Unverified Commit 318bd988 authored by Wang, Yi's avatar Wang, Yi Committed by GitHub
Browse files

Merge branch 'EleutherAI:master' into fix_ptun

parents 35f1b5a7 25dfd3f6
......@@ -3,3 +3,5 @@ env
data/
lm_cache
.idea
*.egg-info/
* @jon-tow @StellaAthena
* @haileyschoelkopf @lintangsutawika
FROM nvidia/cuda:11.2.0-cudnn8-runtime-ubuntu20.04
### Install python 3.10 and set it as default python interpreter
RUN apt update && apt install software-properties-common -y && \
add-apt-repository ppa:deadsnakes/ppa -y && apt update && \
apt install curl -y && \
apt install python3.10 -y && \
update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.10 1 && \
update-alternatives --install /usr/bin/python python /usr/bin/python3.10 1 && \
apt install python3.10-venv python3.10-dev -y && \
curl -Ss https://bootstrap.pypa.io/get-pip.py | python3.10 && \
apt-get clean && rm -rf /var/lib/apt/lists/
### Copy files
COPY . /lm-evaluation-harness/
### Set working directory
WORKDIR /lm-evaluation-harness
### Install requirements
RUN pip install --no-cache-dir -e .
### Run bash
CMD ["/bin/bash"]
# Language Model Evaluation Harness
## We're Refactoring LM-Eval!
(as of 6/15/23)
We have a revamp of the Evaluation Harness library internals staged on the [big-refactor](https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor) branch! It is far along in progress, but before we start to move the `master` branch of the repository over to this new design with a new version release, we'd like to ensure that it's been tested by outside users and there are no glaring bugs.
We’d like your help to test it out! you can help by:
1. Trying out your current workloads on the big-refactor branch, and seeing if anything breaks or is counterintuitive,
2. Porting tasks supported in the previous version of the harness to the new YAML configuration format. Please check out our [task implementation guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/new_task_guide.md) for more information.
If you choose to port a task not yet completed according to [our checklist](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/tasks/README.md), then you can contribute it by opening a PR containing [Refactor] in the name with:
- A shell command to run the task in the `master` branch, and what the score is
- A shell command to run the task in your PR branch to `big-refactor`, and what the resulting score is, to show that we achieve equality between the two implementations.
Lastly, we'll no longer be accepting new feature requests beyond those that are already open to the master branch as we carry out this switch to the new version over the next week, though we will be accepting bugfixes to `master` branch and PRs to `big-refactor`. Feel free to reach out in the #lm-thunderdome channel of the EAI discord for more information.
## Overview
This project provides a unified framework to test generative language models on a large number of different evaluation tasks.
......@@ -142,7 +157,7 @@ When reporting eval harness results, please also report the version of each task
## Test Set Decontamination
To address concerns about train / test contamination, we provide utilities for comparing results on a benchmark using only the data points nto found in the model training set. Unfortunately, outside of models trained on the Pile and C4, its very rare that people who train models disclose the contents of the training data. However this utility can be useful to evaluate models you have trained on private data, provided you are willing to pre-compute the necessary indices. We provide computed indices for 13-gram exact match deduplication against the Pile, and plan to add additional precomputed dataset indices in the future (including C4 and min-hash LSH deduplication).
To address concerns about train / test contamination, we provide utilities for comparing results on a benchmark using only the data points not found in the model training set. Unfortunately, outside of models trained on the Pile and C4, its very rare that people who train models disclose the contents of the training data. However this utility can be useful to evaluate models you have trained on private data, provided you are willing to pre-compute the necessary indices. We provide computed indices for 13-gram exact match deduplication against the Pile, and plan to add additional precomputed dataset indices in the future (including C4 and min-hash LSH deduplication).
For details on text decontamination, see the [decontamination guide](./docs/decontamination.md).
......
......@@ -202,7 +202,7 @@ def construct_requests(self, doc, ctx):
#### What's a `Request`? What's a `doc`?
To reiterate, a `doc` is just a `Dict` object that contains information about a document from your corpus. It can contain things like a prompt, question type information, answers and anything else you think will be needed in order to assess your model for a given task. Keep in mind that the fields of this can be basically whatever you want (you can sort this out in `training_docs` \ `validation_docs` \ `test_docs` if you need to customise things - see above), just remember to be consistent with them throughout the rest of the `Task` you write up.
A `Request` is an object that takes the text prompt you want to present to a model and computes one of a few different types of response. These are evaluated lazily (meaning, only when the result is actually needed). If your task requires generating text you'll need to return a `rf.greedy_until` request otherwise an `rf.loglikelihood` across all labels in a classification tasks will do.
The function `construct_requests` can return a list of `Request`s or an iterable; it's perfectly fine to `yield` them from something or other. This is particularly handy if you are creating more than one request per `doc` (usually because you're up to something like multi-task learning). The objects this function returns then get consumed one by one and turned into result objects.
The function `construct_requests` returns a list or tuple of, or singular `Request`s. This is particularly handy if you are creating more than one request per `doc` (usually because you're up to something like multi-task learning). The objects this function returns then get consumed one by one and turned into result objects.
```python
......
......@@ -286,6 +286,13 @@
|reversed_words | |✓ | | 10000|acc |
|rte |✓ |✓ | | 277|acc |
|sciq |✓ |✓ |✓ | 1000|acc, acc_norm |
|scrolls_contractnli |✓ |✓ | | 1037|em, acc, acc_norm |
|scrolls_govreport |✓ |✓ | | 972|rouge1, rouge2, rougeL |
|scrolls_narrativeqa |✓ |✓ | | 3425|f1 |
|scrolls_qasper |✓ |✓ | | 984|f1 |
|scrolls_qmsum |✓ |✓ | | 272|rouge1, rouge2, rougeL |
|scrolls_quality |✓ |✓ | | 2086|em, acc, acc_norm |
|scrolls_summscreenfd |✓ |✓ | | 338|rouge1, rouge2, rougeL |
|squad2 |✓ |✓ | | 11873|exact, f1, HasAns_exact, HasAns_f1, NoAns_exact, NoAns_f1, best_exact, best_f1 |
|sst |✓ |✓ | | 872|acc |
|swag |✓ |✓ | | 20006|acc, acc_norm |
......
......@@ -119,6 +119,12 @@ class LM(abc.ABC):
class BaseLM(LM):
def __init__(self):
super().__init__()
self.batch_schedule = 1
self.batch_sizes = {}
self.max_batch_size = 512
@property
@abstractmethod
def eot_token_id(self):
......@@ -167,19 +173,52 @@ class BaseLM(LM):
"""
pass
def _detect_batch_size(self, requests=None, pos=0):
if requests:
_, context_enc, continuation_enc = requests[pos]
max_length = len(
(context_enc + continuation_enc)[-(self.max_length + 1) :][:-1]
)
else:
max_length = self.max_length
# if OOM, then halves batch_size and tries again
@find_executable_batch_size(starting_batch_size=self.max_batch_size)
def forward_batch(batch_size):
test_batch = torch.ones((batch_size, max_length), device=self.device).long()
for _ in range(5):
_ = F.log_softmax(self._model_call(test_batch), dim=-1).cpu()
return batch_size
batch_size = forward_batch()
utils.clear_torch_cache()
return batch_size
# subclass must implement properties vocab_size, eot_token_id, max_gen_toks, batch_size, device, max_length.
# TODO: enforce this somehow
def _encode_pair(self, context, continuation):
n_spaces = len(context) - len(context.rstrip())
if n_spaces > 0:
continuation = context[-n_spaces:] + continuation
context = context[:-n_spaces]
whole_enc = self.tok_encode(context + continuation)
context_enc = self.tok_encode(context)
context_enc_len = len(context_enc)
continuation_enc = whole_enc[context_enc_len:]
return context_enc, continuation_enc
def loglikelihood(self, requests):
new_reqs = []
for context, continuation in requests:
if context == "":
# end of text as context
context_enc = [self.eot_token_id]
context_enc, continuation_enc = [self.eot_token_id], self.tok_encode(
continuation
)
else:
context_enc = self.tok_encode(context)
continuation_enc = self.tok_encode(continuation)
context_enc, continuation_enc = self._encode_pair(context, continuation)
new_reqs.append(((context, continuation), context_enc, continuation_enc))
......@@ -193,19 +232,7 @@ class BaseLM(LM):
if self.batch_size == "auto":
# using rolling window with maximum context
print("Passed argument batch_size = auto. Detecting largest batch size")
@find_executable_batch_size(
starting_batch_size=512
) # if OOM, then halves batch_size and tries again
def forward_batch(batch_size):
test_batch = torch.ones(
(batch_size, self.max_length), device=self.device
).long()
for _ in range(5):
_ = F.log_softmax(self._model_call(test_batch), dim=-1).cpu()
return batch_size
batch_size = forward_batch()
batch_size = self._detect_batch_size()
print(f"Determined Largest batch size: {batch_size}")
adaptive_batch_size = batch_size
......@@ -258,34 +285,32 @@ class BaseLM(LM):
re_ord = utils.Reorderer(requests, _collate)
reordered_requests = re_ord.get_reordered()
n_reordered_requests = len(reordered_requests)
# automatic (variable) batch size detection for vectorization
# pull longest context sample from request
if len(re_ord.get_reordered()) > 0:
_, context_enc, continuation_enc = re_ord.get_reordered()[0]
max_context = len((context_enc + continuation_enc)[-(self.max_length + 1) :][:-1])
if (self.batch_size == 'auto'):
if override_bs is None:
print('Passed argument batch_size = auto. Detecting largest batch size')
@find_executable_batch_size(starting_batch_size=512) # if OOM, then halves batch_size and tries again
def forward_batch(batch_size):
test_batch = torch.ones((batch_size, max_context), device=self.device).long()
for _ in range(5):
out = F.log_softmax(self._model_call(test_batch), dim = -1).cpu()
return batch_size
batch_size = forward_batch()
print(f"Determined largest batch size: {batch_size}")
adaptive_batch_size = batch_size
else:
adaptive_batch_size = override_bs
else:
adaptive_batch_size = 0 if override_bs is None else override_bs
def _batch_scheduler(pos):
sched = pos // int(n_reordered_requests / self.batch_schedule)
if sched in self.batch_sizes:
return self.batch_sizes[sched]
print(
f"Passed argument batch_size = auto:{self.batch_schedule}. Detecting largest batch size"
)
self.batch_sizes[sched] = self._detect_batch_size(reordered_requests, pos)
print(f"Determined largest batch size: {self.batch_sizes[sched]}")
return self.batch_sizes[sched]
for chunk in utils.chunks(
tqdm(re_ord.get_reordered(), disable=disable_tqdm),
self.batch_size if self.batch_size != "auto" else adaptive_batch_size,
tqdm(reordered_requests, disable=disable_tqdm),
n=self.batch_size
if self.batch_size != "auto"
else override_bs
if override_bs is not None
else 0,
fn=_batch_scheduler
if self.batch_size == "auto" and n_reordered_requests > 0 and not override_bs
else None,
):
inps = []
cont_toks_list = []
......@@ -387,18 +412,34 @@ class BaseLM(LM):
res = []
def _collate(x):
# the negative sign on len(toks) sorts descending - this has a few advantages:
# - time estimates will always be over not underestimates, which is more useful for planning
# - to know the size of a batch when going through the list, you know the first one is always the batch
# padded context length. this is useful to simplify the batching logic and more importantly to make
# automatic adaptive batches much much easier to implement
# - any OOMs will happen right away rather than near the end
toks = self.tok_encode(x[0])
return len(toks), x[0]
return -len(toks), x[0]
re_ord = utils.Reorderer(requests, _collate)
warn_stop_seq = False
for context, request_args in tqdm(re_ord.get_reordered()):
until = request_args["until"]
if isinstance(until, str):
until = [until]
if until:
(primary_until,) = self.tok_encode(until[0])
try:
(primary_until,) = self.tok_encode(until[0])
except ValueError:
if not warn_stop_seq:
print(
"Warning: a primary stop sequence is multi-token! Will default to EOS token for this tokenizer. Consider using `hf-causal-experimental` for multi-token stop sequence support for the time being."
)
warn_stop_seq = True
primary_until = self.eot_token_id
else:
primary_until = None
......@@ -856,6 +897,10 @@ class CachingLM:
lm.set_cache_hook(self.get_cache_hook())
def __getattr__(self, attr):
lm_attr = getattr(self.lm, attr)
if not callable(lm_attr):
return lm_attr
def fn(requests):
res = []
remaining_reqs = []
......
---
dataset_info:
features:
- name: question_id
dtype: string
- name: question_source
dtype: string
- name: question
dtype: string
- name: answer
struct:
- name: aliases
sequence: string
- name: value
dtype: string
- name: search_results
sequence:
- name: description
dtype: string
- name: filename
dtype: string
- name: rank
dtype: int32
- name: title
dtype: string
- name: url
dtype: string
- name: search_context
dtype: string
config_name: triviaqa
splits:
- name: train
num_bytes: 1270894387
num_examples: 87622
- name: validation
num_bytes: 163755044
num_examples: 11313
download_size: 632549060
dataset_size: 1434649431
---
{"triviaqa": {"description": "TriviaQA is a reading comprehension dataset containing over 650K question-answer-evidence\ntriples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts\nand independently gathered evidence documents, six per question on average, that provide\nhigh quality distant supervision for answering the questions.\n", "citation": "@InProceedings{JoshiTriviaQA2017,\n author = {Joshi, Mandar and Choi, Eunsol and Weld, Daniel S. and Zettlemoyer, Luke},\n title = {TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension},\n booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics},\n month = {July},\n year = {2017},\n address = {Vancouver, Canada},\n publisher = {Association for Computational Linguistics},\n}\n", "homepage": "https://nlp.cs.washington.edu/triviaqa/", "license": "Apache License 2.0", "features": {"question_id": {"dtype": "string", "id": null, "_type": "Value"}, "question_source": {"dtype": "string", "id": null, "_type": "Value"}, "question": {"dtype": "string", "id": null, "_type": "Value"}, "answer": {"aliases": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "value": {"dtype": "string", "id": null, "_type": "Value"}}, "search_results": {"feature": {"description": {"dtype": "string", "id": null, "_type": "Value"}, "filename": {"dtype": "string", "id": null, "_type": "Value"}, "rank": {"dtype": "int32", "id": null, "_type": "Value"}, "title": {"dtype": "string", "id": null, "_type": "Value"}, "url": {"dtype": "string", "id": null, "_type": "Value"}, "search_context": {"dtype": "string", "id": null, "_type": "Value"}}, "length": -1, "id": null, "_type": "Sequence"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "triviaqa", "config_name": "triviaqa", "version": {"version_str": "0.0.1", "description": null, "major": 0, "minor": 0, "patch": 1}, "splits": {"train": {"name": "train", "num_bytes": 1271393601, "num_examples": 87622, "dataset_name": "triviaqa"}, "validation": {"name": "validation", "num_bytes": 163819509, "num_examples": 11313, "dataset_name": "triviaqa"}}, "download_checksums": {"http://eaidata.bmk.sh/data/triviaqa-unfiltered.tar.gz": {"num_bytes": 546481381, "checksum": "adc19b42769062d241a8fbe834c56e58598d9322eb6c614e9f33a68a2cf5523e"}}, "download_size": 546481381, "post_processing_size": null, "dataset_size": 1435213110, "size_in_bytes": 1981694491}}
# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Custom TriviaQA because HF version sanitizes the dataset differently.
# https://github.com/huggingface/datasets/blob/9977ade72191ff0b6907ec63935448c6269a91a1/datasets/trivia_qa/trivia_qa.py#L285
"""TriviaQA (Unfiltered Raw) dataset."""
import json
import os
import datasets
_CITATION = """\
@InProceedings{JoshiTriviaQA2017,
author = {Joshi, Mandar and Choi, Eunsol and Weld, Daniel S. and Zettlemoyer, Luke},
title = {TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension},
booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics},
month = {July},
year = {2017},
address = {Vancouver, Canada},
publisher = {Association for Computational Linguistics},
}
"""
_DESCRIPTION = """\
TriviaQA is a reading comprehension dataset containing over 650K question-answer-evidence
triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts
and independently gathered evidence documents, six per question on average, that provide
high quality distant supervision for answering the questions.
"""
_HOMEPAGE = "https://nlp.cs.washington.edu/triviaqa/"
_LICENSE = "Apache License 2.0"
_URLS = "https://nlp.cs.washington.edu/triviaqa/data/triviaqa-unfiltered.tar.gz"
class Triviaqa(datasets.GeneratorBasedBuilder):
"""TriviaQA is a reading comprehension dataset containing over 650K question-answer-evidence triples"""
VERSION = datasets.Version("0.0.2")
BUILDER_CONFIGS = [
datasets.BuilderConfig(
name="triviaqa", version=VERSION, description="The TriviaQA dataset"
),
]
def _info(self):
features = datasets.Features(
{
"question_id": datasets.Value("string"),
"question_source": datasets.Value("string"),
"question": datasets.Value("string"),
"answer": {
"aliases": datasets.features.Sequence(
datasets.Value("string"),
),
"value": datasets.Value("string"),
},
"search_results": datasets.features.Sequence(
{
"description": datasets.Value("string"),
"filename": datasets.Value("string"),
"rank": datasets.Value("int32"),
"title": datasets.Value("string"),
"url": datasets.Value("string"),
"search_context": datasets.Value("string"),
}
),
}
)
return datasets.DatasetInfo(
description=_DESCRIPTION,
features=features,
homepage=_HOMEPAGE,
license=_LICENSE,
citation=_CITATION,
)
def _split_generators(self, dl_manager):
urls = _URLS
data_dir = dl_manager.download_and_extract(urls)
return [
datasets.SplitGenerator(
name=datasets.Split.TRAIN,
# These kwargs will be passed to _generate_examples
gen_kwargs={
"filepath": os.path.join(
data_dir, "triviaqa-unfiltered", "unfiltered-web-train.json"
),
},
),
datasets.SplitGenerator(
name=datasets.Split.VALIDATION,
# These kwargs will be passed to _generate_examples
gen_kwargs={
"filepath": os.path.join(
data_dir, "triviaqa-unfiltered", "unfiltered-web-dev.json"
),
},
),
]
# method parameters are unpacked from `gen_kwargs` as given in `_split_generators`
def _generate_examples(self, filepath):
with open(filepath, encoding="utf-8") as f:
json_data = json.load(f)["Data"]
for key, data in enumerate(json_data):
search_results = []
for search_result in data["SearchResults"]:
search_results.append(
{
"description": search_result["Description"]
if "Description" in search_result
else "",
"filename": search_result["Filename"]
if "Filename" in search_result
else "",
"rank": search_result["Rank"]
if "Rank" in search_result
else -1,
"title": search_result["Title"]
if "Title" in search_result
else "",
"url": search_result["Url"]
if "Url" in search_result
else "",
"search_context": search_result["SearchContext"]
if "SearchContext" in search_result
else "",
}
)
yield key, {
"question_id": data["QuestionId"],
"question_source": data["QuestionSource"],
"question": data["Question"],
"answer": {
"aliases": data["Answer"]["Aliases"],
"value": data["Answer"]["Value"],
},
"search_results": search_results,
}
import collections
import itertools
import numpy as np
import random
import lm_eval.metrics
import lm_eval.models
import lm_eval.tasks
import lm_eval.base
from lm_eval.utils import positional_deprecated, run_task_tests
from lm_eval.models.gpt2 import HFLM
import numpy as np
import transformers
@positional_deprecated
......@@ -16,6 +20,7 @@ def simple_evaluate(
tasks=[],
num_fewshot=0,
batch_size=None,
max_batch_size=None,
device=None,
no_cache=False,
limit=None,
......@@ -29,7 +34,7 @@ def simple_evaluate(
"""Instantiate and evaluate a model on a list of tasks.
:param model: Union[str, LM]
Name of model or LM object, see lm_eval.models.get_model
Name of model, transformers.PreTrainedModel object, or LM object, see lm_eval.models.get_model
:param model_args: Optional[str]
String arguments for each model class, see LM.create_from_arg_string.
Ignored if `model` argument is a LM object.
......@@ -37,8 +42,10 @@ def simple_evaluate(
List of task names or Task objects. Task objects will be taken to have name task.EVAL_HARNESS_NAME if defined and type(task).__name__ otherwise.
:param num_fewshot: int
Number of examples in few-shot context
:param batch_size: int, optional
:param batch_size: int or str, optional
Batch size for model
:param max_batch_size: int, optional
Maximal batch size to try with automatic batch size detection
:param device: str, optional
PyTorch device (e.g. "cpu" or "cuda:0") for running models
:param no_cache: bool
......@@ -67,8 +74,15 @@ def simple_evaluate(
if model_args is None:
model_args = ""
lm = lm_eval.models.get_model(model).create_from_arg_string(
model_args, {"batch_size": batch_size, "device": device}
model_args, {"batch_size": batch_size, "max_batch_size": max_batch_size, "device": device}
)
elif isinstance(model, transformers.PreTrainedModel):
lm = lm_eval.models.get_model("hf-causal")(
pretrained=model,
batch_size=batch_size,
max_batch_size=max_batch_size,
)
no_cache = True
else:
assert isinstance(model, lm_eval.base.LM)
lm = model
......@@ -77,7 +91,7 @@ def simple_evaluate(
lm = lm_eval.base.CachingLM(
lm,
"lm_cache/"
+ model
+ (model if isinstance(model, str) else model.model.config._name_or_path)
+ "_"
+ model_args.replace("=", "-").replace(",", "_").replace("/", "-")
+ ".db",
......@@ -101,11 +115,17 @@ def simple_evaluate(
)
# add info about the model and few shot config
model_name = None
if isinstance(model, str):
model_name = model
elif isinstance(model, transformers.PreTrainedModel):
model_name = "pretrained=" + model.config._name_or_path
results["config"] = {
"model": model,
"model": model_name,
"model_args": model_args,
"num_fewshot": num_fewshot,
"batch_size": batch_size,
"batch_sizes": list(lm.batch_sizes.values()) if hasattr(lm, "batch_sizes") else [],
"device": device,
"no_cache": no_cache,
"limit": limit,
......
from . import gpt2
from . import gpt3
from . import anthropic_llms
from . import huggingface
from . import textsynth
from . import dummy
......@@ -11,6 +12,7 @@ MODEL_REGISTRY = {
"hf-seq2seq": huggingface.AutoSeq2SeqLM,
"gpt2": gpt2.GPT2LM,
"gpt3": gpt3.GPT3LM,
"anthropic": anthropic_llms.AnthropicLM,
"textsynth": textsynth.TextSynthLM,
"dummy": dummy.DummyLM,
}
......
import os
from lm_eval.base import BaseLM
from tqdm import tqdm
import time
def anthropic_completion(client, model, prompt, max_tokens_to_sample, temperature, stop):
"""Query Anthropic API for completion.
Retry with back-off until they respond
"""
import anthropic
backoff_time = 3
while True:
try:
response = client.completion(
prompt=f"{anthropic.HUMAN_PROMPT} {prompt}{anthropic.AI_PROMPT}",
model=model,
# NOTE: Claude really likes to do CoT, and overly aggressive stop sequences
# (e.g. gsm8k's ":") may truncate a lot of the input.
stop_sequences=[anthropic.HUMAN_PROMPT] + stop,
max_tokens_to_sample=max_tokens_to_sample,
temperature=temperature,
)
print(response)
return response["completion"]
except RuntimeError:
# TODO: I don't actually know what error Anthropic raises when it times out
# So err update this error when we find out.
import traceback
traceback.print_exc()
time.sleep(backoff_time)
backoff_time *= 1.5
class AnthropicLM(BaseLM):
REQ_CHUNK_SIZE = 20
def __init__(self, model):
"""
:param model: str
Anthropic model e.g. claude-instant-v1
"""
super().__init__()
import anthropic
self.model = model
self.client = anthropic.Client(os.environ['ANTHROPIC_API_KEY'])
@property
def eot_token_id(self):
raise NotImplementedError("No idea about anthropic tokenization.")
@property
def max_length(self):
return 2048
@property
def max_gen_toks(self):
return 256
@property
def batch_size(self):
# Isn't used because we override _loglikelihood_tokens
raise NotImplementedError()
@property
def device(self):
# Isn't used because we override _loglikelihood_tokens
raise NotImplementedError()
def tok_encode(self, string: str):
raise NotImplementedError("No idea about anthropic tokenization.")
def tok_decode(self, tokens):
raise NotImplementedError("No idea about anthropic tokenization.")
def _loglikelihood_tokens(self, requests, disable_tqdm=False):
raise NotImplementedError("No support for logits.")
def greedy_until(self, requests):
if not requests:
return []
res = []
for request in tqdm(requests):
inp = request[0]
request_args = request[1]
until = request_args["until"]
response = anthropic_completion(
client=self.client,
model=self.model,
prompt=inp,
max_tokens_to_sample=self.max_gen_toks,
temperature=0.0,
stop=until,
)
res.append(response)
return res
def _model_call(self, inps):
# Isn't used because we override _loglikelihood_tokens
raise NotImplementedError()
def _model_generate(self, context, max_length, eos_token_id):
# Isn't used because we override greedy_until
raise NotImplementedError()
......@@ -17,6 +17,9 @@ def _get_dtype(
class HFLM(BaseLM):
_DEFAULT_MAX_LENGTH = 2048
def __init__(
self,
device="cuda",
......@@ -26,67 +29,93 @@ class HFLM(BaseLM):
subfolder=None,
tokenizer=None,
batch_size=1,
max_batch_size=512,
max_length=None,
load_in_8bit: Optional[bool] = False,
trust_remote_code: Optional[bool] = False,
dtype: Optional[Union[str, torch.dtype]]="auto",
):
super().__init__()
assert isinstance(device, str)
assert isinstance(pretrained, str)
assert isinstance(batch_size, (int, str))
device_list = set(
["cuda", "cpu"] + [f"cuda:{i}" for i in range(torch.cuda.device_count())]
)
if device and device in device_list:
self._device = torch.device(device)
print(f"Using device '{device}'")
else:
print("Device not specified")
print(f"Cuda Available? {torch.cuda.is_available()}")
self._device = (
torch.device("cuda")
if torch.cuda.is_available()
else torch.device("cpu")
# Initialize model
if isinstance(pretrained, transformers.PreTrainedModel):
self.model = pretrained
self._device = self.model.device
if tokenizer:
assert isinstance(
tokenizer,
transformers.PreTrainedTokenizer
) or isinstance(
tokenizer,
transformers.PreTrainedTokenizerFast
)
self.tokenizer = tokenizer
else:
# Get tokenizer
model_name = self.model.name_or_path
self.tokenizer = transformers.AutoTokenizer.from_pretrained(
model_name,
revision=revision,
trust_remote_code=trust_remote_code,
)
elif isinstance(pretrained, str):
# Initialize device
assert isinstance(device, str)
device_list = set(
["cuda", "cpu"] + [f"cuda:{i}" for i in range(torch.cuda.device_count())]
)
if device and device in device_list:
self._device = torch.device(device)
print(f"Using device '{device}'")
else:
print("Device not specified")
print(f"Cuda Available? {torch.cuda.is_available()}")
self._device = (
torch.device("cuda")
if torch.cuda.is_available()
else torch.device("cpu")
)
revision = revision + ("/" + subfolder if subfolder is not None else "")
# Initialize new model and tokenizer instances
self.model = transformers.AutoModelForCausalLM.from_pretrained(
pretrained,
load_in_8bit=load_in_8bit,
low_cpu_mem_usage=low_cpu_mem_usage,
revision=revision,
torch_dtype=_get_dtype(dtype),
trust_remote_code=trust_remote_code,
).to(self.device)
self.tokenizer = transformers.AutoTokenizer.from_pretrained(
tokenizer if tokenizer else pretrained,
revision=revision,
trust_remote_code=trust_remote_code,
)
# TODO: update this to be less of a hack once subfolder is fixed in HF
revision = revision + ("/" + subfolder if subfolder is not None else "")
self.gpt2 = transformers.AutoModelForCausalLM.from_pretrained(
pretrained,
load_in_8bit=load_in_8bit,
low_cpu_mem_usage=low_cpu_mem_usage,
revision=revision,
torch_dtype=_get_dtype(dtype),
trust_remote_code=trust_remote_code,
).to(self.device)
self.gpt2.eval()
self.tokenizer = transformers.AutoTokenizer.from_pretrained(
pretrained if tokenizer is None else tokenizer,
revision=revision,
trust_remote_code=trust_remote_code,
)
else:
raise TypeError('Parameter pretrained should be of type str or transformers.PreTrainedModel')
self.model.eval()
self.vocab_size = self.tokenizer.vocab_size
if isinstance(
self.tokenizer, (transformers.GPT2Tokenizer, transformers.GPT2TokenizerFast)
):
assert self.tokenizer.encode("hello\n\nhello") == [
31373,
198,
198,
31373,
], self.tokenizer.encode("hello\n\nhello")
# Validate batch_size
assert isinstance(batch_size, (int, str))
# setup for automatic batch size detection
if batch_size == "auto":
self.batch_size_per_gpu = batch_size
if str(batch_size).startswith("auto"):
batch_size = batch_size.split(":")
self.batch_size_per_gpu = batch_size[0]
self.batch_schedule = float(batch_size[1]) if len(batch_size) > 1 else 1
else:
self.batch_size_per_gpu = int(batch_size)
self.max_batch_size = max_batch_size
self._max_length = max_length
@property
def eot_token_id(self):
......@@ -95,11 +124,18 @@ class HFLM(BaseLM):
@property
def max_length(self):
try:
return self.gpt2.config.n_ctx
except AttributeError:
# gptneoconfig doesn't have n_ctx apparently
return self.gpt2.config.max_position_embeddings
if self._max_length: # if max length manually set, return it
return self._max_length
seqlen_config_attrs = ("n_positions", "max_position_embeddings", "n_ctx")
for attr in seqlen_config_attrs:
if hasattr(self.model.config, attr):
return getattr(self.model.config, attr)
if hasattr(self.tokenizer, "model_max_length"):
if self.tokenizer.model_max_length == 1000000000000000019884624838656:
return self._DEFAULT_MAX_LENGTH
return self.tokenizer.model_max_length
return self._DEFAULT_MAX_LENGTH
@property
def max_gen_toks(self):
......@@ -130,14 +166,14 @@ class HFLM(BaseLM):
logits returned from the model
"""
with torch.no_grad():
return self.gpt2(inps)[0]
return self.model(inps)[0]
def _model_generate(self, context, max_length, eos_token_id):
generation_kwargs = {"do_sample": False, "max_length": max_length}
if eos_token_id is not None:
generation_kwargs['eos_token_id'] = eos_token_id
generation_kwargs['pad_token_id'] = eos_token_id # setting eos_token_id as pad token
return self.gpt2.generate(context, **generation_kwargs)
return self.model.generate(context, **generation_kwargs)
# for backwards compatibility
......
......@@ -3,12 +3,12 @@ import torch
import torch.nn.functional as F
import transformers
import peft
from peft import __version__ as PEFT_VERSION
from pathlib import Path
from typing import List, Mapping, NewType, Optional, Tuple, Union
from tqdm import tqdm
from transformers import BatchEncoding
from accelerate import find_executable_batch_size
from lm_eval import utils
from lm_eval.base import BaseLM
......@@ -75,6 +75,7 @@ class HuggingFaceAutoLM(BaseLM):
subfolder: Optional[str] = None,
revision: Optional[str] = "main",
batch_size: Optional[Union[int, str]] = 1,
max_batch_size: Optional[int] = 512,
max_gen_toks: Optional[int] = 256,
max_length: Optional[int] = None,
add_special_tokens: Optional[bool] = None,
......@@ -87,8 +88,11 @@ class HuggingFaceAutoLM(BaseLM):
device: Optional[Union[int, str]] = "cuda",
peft: str = None,
load_in_8bit: Optional[bool] = False,
load_in_4bit: Optional[bool] = False,
trust_remote_code: Optional[bool] = False,
gptq_use_triton: Optional[bool] = False,
bnb_4bit_quant_type: Optional[str] = None,
bnb_4bit_compute_dtype: Optional[Union[str, torch.dtype]] = None,
):
"""Initializes a HuggingFace `AutoModel` and `AutoTokenizer` for evaluation.
Args:
......@@ -142,11 +146,21 @@ class HuggingFaceAutoLM(BaseLM):
`adapter_model.bin`. Compatible with [PEFT](https://github.com/huggingface/peft)
load_in_8bit (bool, optional, defaults to False):
If True, will convert the loaded model into mixed-8bit quantized model. See:
https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.from_pretrained.load_in_8bit
https://huggingface.co/docs/transformers/main/en/main_classes/quantization#load-a-large-model-in-8bit
load_in_4bit (bool, optional, defaults to False):
If True, will convert the loaded model into mixed-4bit quantized model. See:
https://huggingface.co/docs/transformers/main/en/main_classes/quantization#load-a-large-model-in-4bit
trust_remote_code (bool, optional, defaults to False):
If True, will trust the remote code when loading the model.
gptq_use_triton (bool, optional, defaults to False):
Use Triton for GPTQ inference.
bnb_4bit_quant_type (str, optional, defaults to None):
The quantization type to use for BnB 4bit quantization. See:
https://github.com/huggingface/transformers/blob/main/src/transformers/utils/quantization_config.py#L77
bnb_4bit_compute_dtype (Union[str, torch.dtype], optional, defaults to None):
The compute dtype to use for BnB 4bit quantization. See:
https://github.com/huggingface/transformers/blob/main/src/transformers/utils/quantization_config.py#L74
"""
super().__init__()
......@@ -167,10 +181,13 @@ class HuggingFaceAutoLM(BaseLM):
), "Evaluating causal models with `add_special_tokens=True` is currently not supported."
# setup for automatic batch size detection
if batch_size == "auto":
self._batch_size = batch_size
if str(batch_size).startswith("auto"):
batch_size = batch_size.split(":")
self._batch_size = batch_size[0]
self.batch_schedule = float(batch_size[1]) if len(batch_size) > 1 else 1
else:
self._batch_size = int(batch_size)
self.max_batch_size = max_batch_size
self._max_gen_toks = max_gen_toks
self._max_length = max_length
......@@ -186,6 +203,7 @@ class HuggingFaceAutoLM(BaseLM):
revision=revision,
subfolder=subfolder,
tokenizer=tokenizer,
trust_remote_code=trust_remote_code,
)
self.tokenizer.model_max_length = self.max_length
......@@ -197,7 +215,6 @@ class HuggingFaceAutoLM(BaseLM):
max_cpu_memory,
offload_folder,
)
model_kwargs["load_in_8bit"] = load_in_8bit
self.model = self._create_auto_model(
pretrained=pretrained,
quantized=quantized,
......@@ -206,6 +223,10 @@ class HuggingFaceAutoLM(BaseLM):
subfolder=subfolder,
torch_dtype=_get_dtype(dtype, self._config),
gptq_use_triton=gptq_use_triton,
load_in_8bit=load_in_8bit,
load_in_4bit=load_in_4bit,
bnb_4bit_quant_type=bnb_4bit_quant_type,
bnb_4bit_compute_dtype=bnb_4bit_compute_dtype,
**model_kwargs,
)
# note: peft_path can be different than pretrained model path
......@@ -215,8 +236,7 @@ class HuggingFaceAutoLM(BaseLM):
peft=peft,
revision=revision,
subfolder=subfolder,
torch_dtype=_get_dtype(dtype, self._config),
**model_kwargs,
load_in_4bit=load_in_4bit,
)
self.model.eval()
torch.set_grad_enabled(False)
......@@ -227,8 +247,11 @@ class HuggingFaceAutoLM(BaseLM):
# the user specified one so we force `self._device` to be the same as
# `lm_head`'s.
self._device = self.model.hf_device_map["lm_head"]
if not use_accelerate:
self.model.to(self._device)
if not use_accelerate and not (load_in_4bit or load_in_8bit):
try:
self.model.to(self._device)
except:
print("Failed to place model onto specified device. This may be because the model is quantized via `bitsandbytes`. If the desired GPU is being used, this message is safe to ignore.")
def _create_auto_model(
self,
......@@ -241,12 +264,25 @@ class HuggingFaceAutoLM(BaseLM):
max_memory: Optional[dict] = None,
offload_folder: Optional[str] = None,
load_in_8bit: Optional[bool] = False,
load_in_4bit: Optional[bool] = False,
trust_remote_code: Optional[bool] = False,
torch_dtype: Optional[Union[str, torch.dtype]] = None,
gptq_use_triton: Optional[bool] = False,
bnb_4bit_quant_type: Optional[str] = None,
bnb_4bit_compute_dtype: Optional[Union[str, torch.dtype]] = None,
) -> transformers.AutoModel:
"""Returns a pre-trained pytorch model from a pre-trained model configuration."""
if not quantized:
if load_in_4bit:
assert transformers.__version__ >= "4.30.0", "load_in_4bit requires transformers >= 4.30.0"
model_kwargs = {}
if transformers.__version__ >= "4.30.0":
model_kwargs["load_in_4bit"] = load_in_4bit
if load_in_4bit:
if bnb_4bit_quant_type:
model_kwargs["bnb_4bit_quant_type"] = bnb_4bit_quant_type
if bnb_4bit_compute_dtype:
model_kwargs["bnb_4bit_compute_dtype"] = _get_dtype(bnb_4bit_compute_dtype)
model = self.AUTO_MODEL_CLASS.from_pretrained(
pretrained,
revision=revision + ("/" + subfolder if subfolder is not None else ""),
......@@ -256,6 +292,7 @@ class HuggingFaceAutoLM(BaseLM):
load_in_8bit=load_in_8bit,
trust_remote_code=trust_remote_code,
torch_dtype=torch_dtype,
**model_kwargs,
)
else:
from auto_gptq import AutoGPTQForCausalLM
......@@ -278,23 +315,14 @@ class HuggingFaceAutoLM(BaseLM):
peft: str,
revision: str,
subfolder: str,
device_map: Optional[Union[str, _DeviceMapping]] = None,
max_memory: Optional[dict] = None,
offload_folder: Optional[str] = None,
load_in_8bit: Optional[bool] = False,
trust_remote_code: Optional[bool] = False,
torch_dtype: Optional[Union[str, torch.dtype]] = None,
load_in_4bit: Optional[bool] = False,
):
if load_in_4bit:
assert PEFT_VERSION >= "0.4.0", "load_in_4bit requires peft >= 0.4.0"
model = self.AUTO_PEFT_CLASS.from_pretrained(
model,
peft,
revision=revision + ("/" + subfolder if subfolder is not None else ""),
device_map=device_map,
max_memory=max_memory,
offload_folder=offload_folder,
load_in_8bit=load_in_8bit,
trust_remote_code=trust_remote_code,
torch_dtype=torch_dtype,
)
return model
......@@ -305,11 +333,13 @@ class HuggingFaceAutoLM(BaseLM):
revision: str,
subfolder: str,
tokenizer: Optional[str] = None,
trust_remote_code: Optional[bool] = False,
) -> transformers.PreTrainedTokenizer:
"""Returns a pre-trained tokenizer from a pre-trained tokenizer configuration."""
tokenizer = self.AUTO_TOKENIZER_CLASS.from_pretrained(
pretrained if tokenizer is None else tokenizer,
revision=revision + ("/" + subfolder if subfolder is not None else ""),
trust_remote_code=trust_remote_code,
)
tokenizer.pad_token = tokenizer.eos_token
return tokenizer
......@@ -351,7 +381,7 @@ class HuggingFaceAutoLM(BaseLM):
"""Return the maximum sequence length of the model.
NOTE: Different model configurations have different max sequence length
attribute names.
- n_positions: (CTRLConfig)
- n_positions: (CTRLConfig, T5Config)
- max_position_embeddings: (BartConfig, RoFormerConfig)
- n_ctx: (GPT2Config)
NOTE: For relative position encoded models you should specify the max
......@@ -408,19 +438,7 @@ class HuggingFaceAutoLM(BaseLM):
if self.batch_size == "auto":
# using rolling window with maximum context
print("Passed argument batch_size = auto. Detecting largest batch size")
@find_executable_batch_size(
starting_batch_size=512
) # if OOM, then halves batch_size and tries again
def forward_batch(batch_size):
test_batch = torch.ones(
(batch_size, self.max_length), device=self.device
).long()
for _ in range(5):
_ = F.log_softmax(self._model_call(test_batch), dim=-1).cpu()
return batch_size
batch_size = forward_batch()
batch_size = self._detect_batch_size()
print(f"Determined Largest batch size: {batch_size}")
adaptive_batch_size = batch_size
......@@ -485,12 +503,14 @@ class AutoCausalLM(HuggingFaceAutoLM):
revision: str,
subfolder: str,
tokenizer: Optional[str] = None,
trust_remote_code: Optional[bool] = False,
) -> transformers.PreTrainedTokenizer:
tokenizer = super()._create_auto_tokenizer(
pretrained=pretrained,
revision=revision,
subfolder=subfolder,
tokenizer=tokenizer,
trust_remote_code=trust_remote_code,
)
tokenizer.padding_side = "left"
return tokenizer
......@@ -543,15 +563,6 @@ class AutoSeq2SeqLM(HuggingFaceAutoLM):
AUTO_MODEL_CLASS = transformers.AutoModelForSeq2SeqLM
AUTO_PEFT_CLASS = peft.PeftModel
@property
def max_length(self) -> int:
"""Return the maximum sequence length of the model.
TODO: Currently only works for relative position encoded Seq2Seq models.
"""
if self._max_length is not None:
return self._max_length
return self._DEFAULT_MAX_LENGTH
def loglikelihood(
self, requests: List[Tuple[str, str]]
) -> List[Tuple[float, bool]]:
......
......@@ -4,6 +4,7 @@ from typing import List, Union
import sacrebleu
import lm_eval.base
from . import babi
from . import superglue
from . import glue
from . import arc
......@@ -60,6 +61,7 @@ from . import xwinograd
from . import pawsx
from . import xnli
from . import mgsm
from . import scrolls
########################################
# Translation tasks
......@@ -92,6 +94,7 @@ all_translation_benchmarks = {
TASK_REGISTRY = {
"babi": babi.Babi,
# GLUE
"cola": glue.CoLA,
"mnli": glue.MNLI,
......@@ -325,6 +328,7 @@ TASK_REGISTRY = {
**pawsx.construct_tasks(),
**xnli.construct_tasks(),
**mgsm.construct_tasks(),
**scrolls.construct_tasks()
}
......
"""
Inspired by https://github.com/stanford-crfm/helm/blob/0eaaa62a2263ddb94e9850ee629423b010f57e4a/src/helm/benchmark/scenarios/babi_qa_scenario.py
"""
import numpy as np
from collections import defaultdict
from lm_eval.base import rf, Task
from lm_eval.metrics import mean
_CITATION = """
@article{weston2015towards,
title={Towards ai-complete question answering: A set of prerequisite toy tasks},
author={Weston, Jason and Bordes, Antoine and Chopra, Sumit and Rush, Alexander M and Van Merri{\"e}nboer, Bart and Joulin, Armand and Mikolov, Tomas},
journal={arXiv preprint arXiv:1502.05698},
year={2015}
}
"""
class Babi(Task):
VERSION = 0
DATASET_PATH = "Muennighoff/babi"
DATASET_NAME = None
def has_training_docs(self):
return True
def has_validation_docs(self):
return True
def has_test_docs(self):
return True
def training_docs(self):
if self.has_training_docs():
return self.dataset["train"]
def validation_docs(self):
if self.has_validation_docs():
return self.dataset["valid"]
def test_docs(self):
if self.has_test_docs():
return self.dataset["test"]
def doc_to_text(self, doc):
return (
doc['passage'] + doc['question']
)
def should_decontaminate(self):
return False # TODO Necessary?
def doc_to_decontamination_query(self, doc):
return f"Passage: {doc['passage']}\nQuestion: {doc['question']}\nAnswer:"
def doc_to_target(self, doc):
return " " + doc['answer']
def construct_requests(self, doc, ctx):
"""Uses RequestFactory to construct Requests and returns an iterable of
Requests which will be sent to the LM.
:param doc:
The document as returned from training_docs, validation_docs, or test_docs.
:param ctx: str
The context string, generated by fewshot_context. This includes the natural
language description, as well as the few shot examples, and the question
part of the document for `doc`.
"""
return rf.greedy_until(ctx, ["\n"])
def process_results(self, doc, results):
"""Take a single document and the LM results and evaluates, returning a
dict where keys are the names of submetrics and values are the values of
the metric for that one document
:param doc:
The document as returned from training_docs, validation_docs, or test_docs.
:param results:
The results of the requests created in construct_requests.
"""
gold = doc["answer"]
pred = gold.strip() == results[0].strip()
return {"em": pred}
def aggregation(self):
return {
"em": mean,
}
def higher_is_better(self):
return {
"em": True,
}
......@@ -14,7 +14,6 @@ Homepage: https://github.com/hendrycks/test
"""
from lm_eval.base import MultipleChoiceTask
_CITATION = """
@article{hendryckstest2021,
title={Measuring Massive Multitask Language Understanding},
......@@ -103,8 +102,8 @@ def create_task(subject):
class GeneralHendrycksTest(MultipleChoiceTask):
VERSION = 0
DATASET_PATH = "hendrycks_test"
VERSION = 1
DATASET_PATH = "cais/mmlu"
DATASET_NAME = None
def __init__(self, subject):
......@@ -112,7 +111,7 @@ class GeneralHendrycksTest(MultipleChoiceTask):
super().__init__()
def has_training_docs(self):
return False
return True
def has_validation_docs(self):
return True
......@@ -126,41 +125,50 @@ class GeneralHendrycksTest(MultipleChoiceTask):
def test_docs(self):
return map(self._process_doc, self.dataset["test"])
def _format_subject(self, subject):
words = subject.split("_")
return " ".join(words)
def fewshot_context(self, doc, num_fewshot, **kwargs):
subject = self.DATASET_NAME
description = f"The following are multiple choice questions (with answers) about {self._format_subject(subject)}."
kwargs["description"] = description
return super().fewshot_context(doc=doc, num_fewshot=num_fewshot, **kwargs)
def _process_doc(self, doc):
def format_example(doc, keys):
"""
Question: <prompt>
Choices:
<prompt>
A. <choice1>
B. <choice2>
C. <choice3>
D. <choice4>
Answer:
"""
prompt = "Question: " + doc["question"] + "\nChoices:\n"
prompt += "".join(
question = doc["question"].strip()
choices = "".join(
[f"{key}. {choice}\n" for key, choice in zip(keys, doc["choices"])]
)
prompt += "Answer:"
prompt = f"{question}\n{choices}Answer:"
return prompt
keys = ["A", "B", "C", "D"]
return {
"query": format_example(doc, keys),
"choices": doc["choices"],
"gold": keys.index(doc["answer"])
if isinstance(doc["answer"], str)
else doc["answer"],
"choices": keys,
"gold": doc["answer"],
}
def fewshot_examples(self, k, rnd):
# fewshot_examples is not just sampling from train_docs because dev is
# in the same distribution as val/test but auxiliary_train isn't
if self._fewshot_docs is None:
self._fewshot_docs = list(map(self._process_doc, self.dataset["dev"]))
return rnd.sample(list(self._fewshot_docs), k)
# use the unchanged order of the dev set without sampling,
# just as in the original code https://github.com/hendrycks/test/blob/master/evaluate.py#L28
return self._fewshot_docs[:k]
def doc_to_text(self, doc):
return doc["query"]
......
"""
SCROLLS: Standardized CompaRison Over Long Language Sequences
https://arxiv.org/abs/2201.03533
SCROLLS is a suite of datasets that require synthesizing information over long texts.
The benchmark includes seven natural language tasks across multiple domains,
including summarization, question answering, and natural language inference.
Homepage: https://www.scrolls-benchmark.com/
Since SCROLLS tasks are generally longer than the maximum sequence length of many models,
it is possible to create "subset" tasks that contain only those samples whose tokenized length
is less than some pre-defined limit. For example, to create a subset of "Qasper" that would
be suitable for a model using the GPTNeoX tokenizer and a 4K maximium sequence length:
```
class QasperGPTNeoX4K(Qasper):
PRUNE_TOKENIZERS = ["EleutherAI/pythia-410m-deduped"]
PRUNE_MAX_TOKENS = 4096
PRUNE_NUM_PROC = _num_cpu_cores() # optional, to speed up pruning of large datasets like NarrativeQA
```
`PRUNE_TOKENIZERS` can contain more than one tokenizer; this will include only samples that are
less than `PRUNE_MAX_TOKENS` for ALL of the tokenizers. This can be useful to comparing models
that use different tokenizers but the same maximum sequence length.
Once the subset task class has been defined in this file, it can be used by adding the class
to `lm_eval/tasks/__init__.py`.
NOTE: GovReport may need `max_gen_toks` set larger for causal models.
"""
from abc import abstractmethod
from datasets import load_metric
from transformers import AutoTokenizer
from lm_eval.base import rf, Task
from lm_eval.metrics import mean
from functools import reduce
import transformers.data.metrics.squad_metrics as squad_metrics
import numpy as np
import re
_CITATION = """
@inproceedings{shaham-etal-2022-scrolls,
title = "{SCROLLS}: Standardized {C}ompa{R}ison Over Long Language Sequences",
author = "Shaham, Uri and
Segal, Elad and
Ivgi, Maor and
Efrat, Avia and
Yoran, Ori and
Haviv, Adi and
Gupta, Ankit and
Xiong, Wenhan and
Geva, Mor and
Berant, Jonathan and
Levy, Omer",
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.emnlp-main.823",
pages = "12007--12021"
}
"""
# SCROLLS is formualted as a sequence-to-sequence task.
# To allow for evaluation of causal models, we'll
# reformualte these with appropriate prompts
def _download_metric():
import os
import shutil
from huggingface_hub import hf_hub_download
scrolls_metric_path = hf_hub_download(repo_id="tau/scrolls", repo_type="dataset", filename="metrics/scrolls.py")
updated_scrolls_metric_path = (
os.path.dirname(scrolls_metric_path) + os.path.basename(scrolls_metric_path).replace(".", "_") + ".py"
)
shutil.copy(scrolls_metric_path, updated_scrolls_metric_path)
return updated_scrolls_metric_path
def _process_doc_prepended_question(doc):
# "When a query is given in addition to the raw text (as
# in QMSum, Qasper, NarrativeQA, QuALITY, and ContractNLI),
# we prepend it to the text, using two newlines as a natural separator"
input = doc["input"]
split = input.find("\n\n")
return {
"id": doc["id"],
"pid": doc["pid"],
"input": input,
"outputs": doc["outputs"],
"question": input[0:split],
"text": input[split + 2:]
}
def _drop_duplicates_in_input(untokenized_dataset):
# from scrolls/evaluator/dataset_evaluator.py
indices_to_keep = []
id_to_idx = {}
outputs = []
for i, (id_, output) in enumerate(zip(untokenized_dataset["id"], untokenized_dataset["output"])):
if id_ in id_to_idx:
outputs[id_to_idx[id_]].append(output)
continue
indices_to_keep.append(i)
id_to_idx[id_] = len(outputs)
outputs.append([output])
untokenized_dataset = untokenized_dataset.select(indices_to_keep).flatten_indices()
untokenized_dataset = untokenized_dataset.remove_columns("output")
untokenized_dataset = untokenized_dataset.add_column("outputs", outputs)
return untokenized_dataset
def _num_cpu_cores():
# https://stackoverflow.com/questions/1006289/how-to-find-out-the-number-of-cpus-using-python/55423170#55423170
try:
import psutil
return psutil.cpu_count(logical=False)
except ImportError:
import os
return len(os.sched_getaffinity(0))
class _SCROLLSTask(Task):
VERSION = 0
DATASET_PATH = "tau/scrolls"
DATASET_NAME = None
PRUNE_TOKENIZERS = None
PRUNE_MAX_TOKENS = None
PRUNE_NUM_PROC = None
def __init__(self, no_metric=False):
super().__init__()
self.metric = load_metric(_download_metric(), config_name=self.DATASET_NAME) if not no_metric else None
def has_training_docs(self):
return True
def has_validation_docs(self):
return True
def has_test_docs(self):
return False
def training_docs(self):
for doc in self.dataset["train"]:
yield from self._process_doc(doc)
def validation_docs(self):
for doc in self.dataset["validation"]:
yield from self._process_doc(doc)
def should_decontaminate(self):
return True
def doc_to_decontamination_query(self, doc):
return doc["input"]
def download(self, *args, **kwargs):
super().download(*args, **kwargs)
del self.dataset["test"]
for split in self.dataset:
self.dataset[split] = _drop_duplicates_in_input(self.dataset[split])
if self.PRUNE_TOKENIZERS is not None and self.PRUNE_TOKENIZERS is not None:
self.prune()
def _get_prune_text(self, sample):
return self.doc_to_text(self._process_doc(sample)[0])
def prune(self):
"""Create a pruned version of a SCROLLS task dataset containing only inputs
that are less than `max_tokens` when tokenized by each tokenizer
"""
tokenizers = [AutoTokenizer.from_pretrained(tokenizer) for tokenizer in self.PRUNE_TOKENIZERS]
cache = {}
def _filter(sample):
text = self._get_prune_text(sample)
cached = cache.get(text, None)
if cached is None:
for tokenizer in tokenizers:
if len(tokenizer(text).input_ids) > self.PRUNE_MAX_TOKENS:
cache[text] = False
return False
cache[text] = True
return True
else:
return cached
self.dataset = self.dataset.filter(_filter, num_proc=self.PRUNE_NUM_PROC)
def doc_to_target(self, doc):
return " " + ", ".join(doc["outputs"])
def doc_to_text(self, doc):
return f"{doc['text']}\n\nQuestion: {doc['question']}\nAnswer:"
def higher_is_better(self):
return {x: True for x in self._scrolls_metrics().keys()}
@abstractmethod
def _scrolls_metrics(self):
pass
def _make_compute_metrics(self, value):
def compute_metrics(samples):
predictions, references = zip(*samples) # unzip, if you will
computed = self.metric.compute(predictions=predictions, references=references)
return computed[value]
return compute_metrics
def aggregation(self):
return {
key: self._make_compute_metrics(value) for key, value in self._scrolls_metrics().items()
}
class _SCROLLSMultipleChoiceTask(_SCROLLSTask):
def __init__(self):
super().__init__(no_metric=True)
def _scrolls_metrics(self):
return None
def aggregation(self):
return {
"em": mean,
"acc": mean,
"acc_norm": mean
}
def higher_is_better(self):
return {
"em": True,
"acc": True,
"acc_norm": True
}
def process_results(self, doc, results):
gold = doc["gold"]
acc = 1.0 if np.argmax(results) == gold else 0.0
completion_len = np.array([float(len(i)) for i in doc["choices"]])
acc_norm = 1.0 if np.argmax(results / completion_len) == gold else 0.0
return {
"acc": acc,
"acc_norm": acc_norm,
"em": acc_norm * 100.0,
}
def construct_requests(self, doc, ctx):
lls = [
rf.loglikelihood(ctx, " {}".format(choice))[0] for choice in doc["choices"]
]
return lls
class _SCROLLSSummaryTask(_SCROLLSTask):
def _process_doc(self, doc):
return [doc]
def _scrolls_metrics(self):
return {"rouge1": "rouge/rouge1", "rouge2": "rouge/rouge2", "rougeL": "rouge/rougeL"}
def process_results(self, doc, results):
return {
"rouge1": (results[0], doc["outputs"]),
"rouge2": (results[0], doc["outputs"]),
"rougeL": (results[0], doc["outputs"])
}
def construct_requests(self, doc, ctx):
return [rf.greedy_until(ctx, {'until': ["\n"]})]
def doc_to_text(self, doc):
return f"{doc['input']}\n\nQuestion: What is a summary of the preceding text?\nAnswer:"
class Qasper(_SCROLLSTask):
"""A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers
https://arxiv.org/abs/2105.03011
"""
DATASET_NAME = "qasper"
def _process_doc(self, doc):
doc = _process_doc_prepended_question(doc)
doc["is_yes_no"] = reduce(lambda prev, cur: prev and squad_metrics.normalize_answer(cur)
in ["yes", "no"], doc["outputs"], True)
return [doc]
def _scrolls_metrics(self):
return {"f1": "f1"}
def process_results(self, doc, results):
if doc["is_yes_no"]:
prediction = " yes" if results[0] > results[1] else " no"
elif len(results[0].strip()) == 0:
prediction = "Unanswerable"
else:
prediction = results[0]
return {
"f1": (prediction, doc["outputs"])
}
def construct_requests(self, doc, ctx):
if doc["is_yes_no"]:
ll_yes, _ = rf.loglikelihood(ctx, " yes")
ll_no, _ = rf.loglikelihood(ctx, " no")
return [ll_yes, ll_no]
else:
return [rf.greedy_until(ctx, {'until': ["\n"]})]
class QuALITY(_SCROLLSMultipleChoiceTask):
"""QuALITY: Question Answering with Long Input Texts, Yes!
https://arxiv.org/abs/2112.08608
"""
DATASET_NAME = "quality"
_multiple_choice_pattern = re.compile(r" *\([A-D]\) *")
@staticmethod
def _normalize_answer(text):
return " ".join(text.split()).strip()
def _process_doc(self, doc):
doc = _process_doc_prepended_question(doc)
split = doc["text"].find("\n\n", doc["text"].find("(D)"))
choices_text = doc["text"][:split]
doc["text"] = doc["text"][split:].strip()
doc["choices"] = [QuALITY._normalize_answer(choice) for choice in re.split(
QuALITY._multiple_choice_pattern, choices_text)[1:]]
doc["gold"] = doc["choices"].index(QuALITY._normalize_answer(doc["outputs"][0]))
return [doc]
class NarrativeQA(_SCROLLSTask):
"""The NarrativeQA Reading Comprehension Challenge
https://arxiv.org/abs/1712.07040
"""
DATASET_NAME = "narrative_qa"
def _process_doc(self, doc):
return [_process_doc_prepended_question(doc)]
def _scrolls_metrics(self):
return {"f1": "f1"}
def _get_prune_text(self, doc):
# pruning narrativeqa takes forever -- let's cheat a bit
# and just cache on the text, not the question, since
# the dataset is different questions about the same large
# documents
return self._process_doc(doc)[0]["text"]
def process_results(self, doc, results):
return {
"f1": (results[0], doc["outputs"])
}
def construct_requests(self, doc, ctx):
return [rf.greedy_until(ctx, {'until': ["\n"]})]
class ContractNLI(_SCROLLSMultipleChoiceTask):
"""ContractNLI: A Dataset for Document-level Natural Language Inference for Contracts
https://arxiv.org/abs/1712.07040
"""
DATASET_NAME = "contract_nli"
CHOICES = ["Not mentioned", "Entailment", "Contradiction"]
def _process_doc(self, doc):
doc = _process_doc_prepended_question(doc)
doc["choices"] = ContractNLI.CHOICES
doc["gold"] = ContractNLI.CHOICES.index(doc["outputs"][0])
return [doc]
def doc_to_text(self, doc):
return f"{doc['text']}\n\nHypothesis: {doc['question']}\nConclusion:"
class GovReport(_SCROLLSSummaryTask):
"""Efficient Attentions for Long Document Summarization
https://arxiv.org/abs/2104.02112
Note: The average length of the reference summaries is ~3,000
characters, or ~600 tokens as tokenized by GPT-NeoX. For causal models,
it is recommended to set `max_gen_toks` sufficently large (e.g. 1024)
to allow a full summary to be generated.
"""
DATASET_NAME = "gov_report"
class SummScreenFD(_SCROLLSSummaryTask):
"""SummScreen: A Dataset for Abstractive Screenplay Summarization
https://arxiv.org/abs/2104.07091
"""
DATASET_NAME = "summ_screen_fd"
class QMSum(_SCROLLSSummaryTask):
"""QMSum: A New Benchmark for Query-based Multi-domain
Meeting Summarization
https://arxiv.org/abs/2104.05938
"""
DATASET_NAME = "qmsum"
def _process_doc(self, doc):
return [_process_doc_prepended_question(doc)]
def doc_to_text(self, doc):
return f"{doc['text']}\n\nQuestion: {doc['question']}\nAnswer:"
def construct_tasks():
return {
"scrolls_qasper": Qasper,
"scrolls_quality": QuALITY,
"scrolls_narrativeqa": NarrativeQA,
"scrolls_contractnli": ContractNLI,
"scrolls_govreport": GovReport,
"scrolls_summscreenfd": SummScreenFD,
"scrolls_qmsum": QMSum
}
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment