Commit 5ba0c5f9 authored by Oleh Shliazhko's avatar Oleh Shliazhko
Browse files

Merge remote-tracking branch 'upstream/master' into mmlu_fix

parents c117e787 9d06c953
* @jon-tow @StellaAthena * @jon-tow @StellaAthena @haileyschoelkopf @lintangsutawika
# Language Model Evaluation Harness # Language Model Evaluation Harness
![](https://github.com/EleutherAI/lm-evaluation-harness/workflows/Build/badge.svg)
[![codecov](https://codecov.io/gh/EleutherAI/lm-evaluation-harness/branch/master/graph/badge.svg?token=JSG3O2427J)](https://codecov.io/gh/EleutherAI/lm-evaluation-harness)
## Overview ## Overview
This project provides a unified framework to test generative language models on a large number of different evaluation tasks. This project provides a unified framework to test generative language models on a large number of different evaluation tasks.
...@@ -10,7 +7,7 @@ This project provides a unified framework to test generative language models on ...@@ -10,7 +7,7 @@ This project provides a unified framework to test generative language models on
Features: Features:
- 200+ tasks implemented. See the [task-table](./docs/task_table.md) for a complete list. - 200+ tasks implemented. See the [task-table](./docs/task_table.md) for a complete list.
- Support for models loaded via [transformers](https://github.com/huggingface/transformers/), [GPT-NeoX](https://github.com/EleutherAI/gpt-neox), and [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/), with a flexible tokenization-agnostic interface. - Support for models loaded via [transformers](https://github.com/huggingface/transformers/) (including quantization via [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)), [GPT-NeoX](https://github.com/EleutherAI/gpt-neox), and [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/), with a flexible tokenization-agnostic interface.
- Support for commercial APIs including [OpenAI](https://openai.com), [goose.ai](https://goose.ai), and [TextSynth](https://textsynth.com/). - Support for commercial APIs including [OpenAI](https://openai.com), [goose.ai](https://goose.ai), and [TextSynth](https://textsynth.com/).
- Support for evaluation on adapters (e.g. LoRa) supported in [HuggingFace's PEFT library](https://github.com/huggingface/peft). - Support for evaluation on adapters (e.g. LoRa) supported in [HuggingFace's PEFT library](https://github.com/huggingface/peft).
- Evaluating with publicly available prompts ensures reproducibility and comparability between papers. - Evaluating with publicly available prompts ensures reproducibility and comparability between papers.
...@@ -32,6 +29,12 @@ To install additional multilingual tokenization and text segmentation packages, ...@@ -32,6 +29,12 @@ To install additional multilingual tokenization and text segmentation packages,
pip install -e ".[multilingual]" pip install -e ".[multilingual]"
``` ```
To support loading GPTQ quantized models, install the package with the `auto-gptq` extra:
```bash
pip install -e ".[auto-gptq]"
```
## Basic Usage ## Basic Usage
> **Note**: When reporting results from eval harness, please include the task versions (shown in `results["versions"]`) for reproducibility. This allows bug fixes to tasks while also ensuring that previously reported scores are reproducible. See the [Task Versioning](#task-versioning) section for more info. > **Note**: When reporting results from eval harness, please include the task versions (shown in `results["versions"]`) for reproducibility. This allows bug fixes to tasks while also ensuring that previously reported scores are reproducible. See the [Task Versioning](#task-versioning) section for more info.
...@@ -49,12 +52,12 @@ python main.py \ ...@@ -49,12 +52,12 @@ python main.py \
--device cuda:0 --device cuda:0
``` ```
Additional arguments can be provided to the model constructor using the `--model_args` flag. Most notably, this supports the common practice of using the `revisions` feature on the Hub to store partially trained checkpoints: Additional arguments can be provided to the model constructor using the `--model_args` flag. Most notably, this supports the common practice of using the `revisions` feature on the Hub to store partially trained checkpoints, or to specify the datatype for running a model:
```bash ```bash
python main.py \ python main.py \
--model hf-causal \ --model hf-causal \
--model_args pretrained=EleutherAI/pythia-160m,revision=step100000 \ --model_args pretrained=EleutherAI/pythia-160m,revision=step100000,dtype="float" \
--tasks lambada_openai,hellaswag \ --tasks lambada_openai,hellaswag \
--device cuda:0 --device cuda:0
``` ```
...@@ -114,6 +117,14 @@ python main.py \ ...@@ -114,6 +117,14 @@ python main.py \
--device cuda:0 --device cuda:0
``` ```
GPTQ quantized models can be loaded by specifying their file names in `,quantized=NAME` (or `,quantized=True` for default names) in the `model_args` argument:
```bash
python main.py \
--model hf-causal-experimental \
--model_args pretrained=model-name-or-path,quantized=model.safetensors,gptq_use_triton=True \
--tasks hellaswag
```
We support wildcards in task names, for example you can run all of the machine-translated lambada tasks via `--task lambada_openai_mt_*`. We support wildcards in task names, for example you can run all of the machine-translated lambada tasks via `--task lambada_openai_mt_*`.
...@@ -131,7 +142,7 @@ When reporting eval harness results, please also report the version of each task ...@@ -131,7 +142,7 @@ When reporting eval harness results, please also report the version of each task
## Test Set Decontamination ## Test Set Decontamination
To address concerns about train / test contamination, we provide utilities for comparing results on a benchmark using only the data points nto found in the model training set. Unfortunately, outside of models trained on the Pile and C4, its very rare that people who train models disclose the contents of the training data. However this utility can be useful to evaluate models you have trained on private data, provided you are willing to pre-compute the necessary indices. We provide computed indices for 13-gram exact match deduplication against the Pile, and plan to add additional precomputed dataset indices in the future (including C4 and min-hash LSH deduplication). To address concerns about train / test contamination, we provide utilities for comparing results on a benchmark using only the data points not found in the model training set. Unfortunately, outside of models trained on the Pile and C4, its very rare that people who train models disclose the contents of the training data. However this utility can be useful to evaluate models you have trained on private data, provided you are willing to pre-compute the necessary indices. We provide computed indices for 13-gram exact match deduplication against the Pile, and plan to add additional precomputed dataset indices in the future (including C4 and min-hash LSH deduplication).
For details on text decontamination, see the [decontamination guide](./docs/decontamination.md). For details on text decontamination, see the [decontamination guide](./docs/decontamination.md).
......
...@@ -202,7 +202,7 @@ def construct_requests(self, doc, ctx): ...@@ -202,7 +202,7 @@ def construct_requests(self, doc, ctx):
#### What's a `Request`? What's a `doc`? #### What's a `Request`? What's a `doc`?
To reiterate, a `doc` is just a `Dict` object that contains information about a document from your corpus. It can contain things like a prompt, question type information, answers and anything else you think will be needed in order to assess your model for a given task. Keep in mind that the fields of this can be basically whatever you want (you can sort this out in `training_docs` \ `validation_docs` \ `test_docs` if you need to customise things - see above), just remember to be consistent with them throughout the rest of the `Task` you write up. To reiterate, a `doc` is just a `Dict` object that contains information about a document from your corpus. It can contain things like a prompt, question type information, answers and anything else you think will be needed in order to assess your model for a given task. Keep in mind that the fields of this can be basically whatever you want (you can sort this out in `training_docs` \ `validation_docs` \ `test_docs` if you need to customise things - see above), just remember to be consistent with them throughout the rest of the `Task` you write up.
A `Request` is an object that takes the text prompt you want to present to a model and computes one of a few different types of response. These are evaluated lazily (meaning, only when the result is actually needed). If your task requires generating text you'll need to return a `rf.greedy_until` request otherwise an `rf.loglikelihood` across all labels in a classification tasks will do. A `Request` is an object that takes the text prompt you want to present to a model and computes one of a few different types of response. These are evaluated lazily (meaning, only when the result is actually needed). If your task requires generating text you'll need to return a `rf.greedy_until` request otherwise an `rf.loglikelihood` across all labels in a classification tasks will do.
The function `construct_requests` can return a list of `Request`s or an iterable; it's perfectly fine to `yield` them from something or other. This is particularly handy if you are creating more than one request per `doc` (usually because you're up to something like multi-task learning). The objects this function returns then get consumed one by one and turned into result objects. The function `construct_requests` returns a list or tuple of, or singular `Request`s. This is particularly handy if you are creating more than one request per `doc` (usually because you're up to something like multi-task learning). The objects this function returns then get consumed one by one and turned into result objects.
```python ```python
...@@ -271,6 +271,19 @@ python main.py \ ...@@ -271,6 +271,19 @@ python main.py \
--num_fewshot K --num_fewshot K
``` ```
### Checking the Model Outputs
The `--write_out.py` script mentioned previously can be used to verify that the prompts look as intended. If you also want to save model outputs, you can use the `--write_out` parameter in `main.py` to dump JSON with prompts and completions. The output path can be chosen with `--output_base_path`. It is helpful for debugging and for exploring model outputs.
```sh
python main.py \
--model gpt2 \
--model_args device=<device-name> \
--tasks <task-name> \
--num_fewshot K \
--write_out \
--output_base_path <path>
```
### Running Unit Tests ### Running Unit Tests
To run the entire test suite, use: To run the entire test suite, use:
......
...@@ -119,6 +119,12 @@ class LM(abc.ABC): ...@@ -119,6 +119,12 @@ class LM(abc.ABC):
class BaseLM(LM): class BaseLM(LM):
def __init__(self):
super().__init__()
self.batch_schedule = 1
self.batch_sizes = {}
self.max_batch_size = 512
@property @property
@abstractmethod @abstractmethod
def eot_token_id(self): def eot_token_id(self):
...@@ -167,19 +173,48 @@ class BaseLM(LM): ...@@ -167,19 +173,48 @@ class BaseLM(LM):
""" """
pass pass
def _detect_batch_size(self, requests=None, pos=0):
if requests:
_, context_enc, continuation_enc = requests[pos]
max_length = len((context_enc + continuation_enc)[-(self.max_length + 1) :][:-1])
else:
max_length = self.max_length
# if OOM, then halves batch_size and tries again
@find_executable_batch_size(starting_batch_size=self.max_batch_size)
def forward_batch(batch_size):
test_batch = torch.ones((batch_size, max_length), device=self.device).long()
for _ in range(5):
_ = F.log_softmax(self._model_call(test_batch), dim=-1).cpu()
return batch_size
batch_size = forward_batch()
utils.clear_torch_cache()
return batch_size
# subclass must implement properties vocab_size, eot_token_id, max_gen_toks, batch_size, device, max_length. # subclass must implement properties vocab_size, eot_token_id, max_gen_toks, batch_size, device, max_length.
# TODO: enforce this somehow # TODO: enforce this somehow
def _encode_pair(self, context, continuation):
n_spaces = len(context) - len(context.rstrip())
if n_spaces > 0:
continuation = context[-n_spaces:] + continuation
context = context[:-n_spaces]
whole_enc = self.tok_encode(context + continuation)
context_enc = self.tok_encode(context)
context_enc_len = len(context_enc)
continuation_enc = whole_enc[context_enc_len:]
return context_enc, continuation_enc
def loglikelihood(self, requests): def loglikelihood(self, requests):
new_reqs = [] new_reqs = []
for context, continuation in requests: for context, continuation in requests:
if context == "": if context == "":
# end of text as context # end of text as context
context_enc = [self.eot_token_id] context_enc, continuation_enc = [self.eot_token_id], self.tok_encode(continuation)
else: else:
context_enc = self.tok_encode(context) context_enc, continuation_enc = self._encode_pair(context, continuation)
continuation_enc = self.tok_encode(continuation)
new_reqs.append(((context, continuation), context_enc, continuation_enc)) new_reqs.append(((context, continuation), context_enc, continuation_enc))
...@@ -193,19 +228,7 @@ class BaseLM(LM): ...@@ -193,19 +228,7 @@ class BaseLM(LM):
if self.batch_size == "auto": if self.batch_size == "auto":
# using rolling window with maximum context # using rolling window with maximum context
print("Passed argument batch_size = auto. Detecting largest batch size") print("Passed argument batch_size = auto. Detecting largest batch size")
batch_size = self._detect_batch_size()
@find_executable_batch_size(
starting_batch_size=512
) # if OOM, then halves batch_size and tries again
def forward_batch(batch_size):
test_batch = torch.ones(
(batch_size, self.max_length), device=self.device
).long()
for _ in range(5):
_ = F.log_softmax(self._model_call(test_batch), dim=-1).cpu()
return batch_size
batch_size = forward_batch()
print(f"Determined Largest batch size: {batch_size}") print(f"Determined Largest batch size: {batch_size}")
adaptive_batch_size = batch_size adaptive_batch_size = batch_size
...@@ -258,34 +281,24 @@ class BaseLM(LM): ...@@ -258,34 +281,24 @@ class BaseLM(LM):
re_ord = utils.Reorderer(requests, _collate) re_ord = utils.Reorderer(requests, _collate)
reordered_requests = re_ord.get_reordered()
n_reordered_requests = len(reordered_requests)
# automatic (variable) batch size detection for vectorization # automatic (variable) batch size detection for vectorization
# pull longest context sample from request # pull longest context sample from request
if len(re_ord.get_reordered()) > 0: def _batch_scheduler(pos):
_, context_enc, continuation_enc = re_ord.get_reordered()[0] sched = pos // int(n_reordered_requests / self.batch_schedule)
max_context = len((context_enc + continuation_enc)[-(self.max_length + 1) :][:-1]) if sched in self.batch_sizes:
if (self.batch_size == 'auto'): return self.batch_sizes[sched]
print(f"Passed argument batch_size = auto:{self.batch_schedule}. Detecting largest batch size")
if override_bs is None: self.batch_sizes[sched] = self._detect_batch_size(reordered_requests, pos)
print('Passed argument batch_size = auto. Detecting largest batch size') print(f"Determined largest batch size: {self.batch_sizes[sched]}")
@find_executable_batch_size(starting_batch_size=512) # if OOM, then halves batch_size and tries again return self.batch_sizes[sched]
def forward_batch(batch_size):
test_batch = torch.ones((batch_size, max_context), device=self.device).long()
for _ in range(5):
out = F.log_softmax(self._model_call(test_batch), dim = -1).cpu()
return batch_size
batch_size = forward_batch()
print(f"Determined largest batch size: {batch_size}")
adaptive_batch_size = batch_size
else:
adaptive_batch_size = override_bs
else:
adaptive_batch_size = 0 if override_bs is None else override_bs
for chunk in utils.chunks( for chunk in utils.chunks(
tqdm(re_ord.get_reordered(), disable=disable_tqdm), tqdm(reordered_requests, disable=disable_tqdm),
self.batch_size if self.batch_size != "auto" else adaptive_batch_size, n=self.batch_size if self.batch_size != "auto" else override_bs if override_bs is not None else 0,
fn=_batch_scheduler if self.batch_size == "auto" and n_reordered_requests > 0 else None,
): ):
inps = [] inps = []
cont_toks_list = [] cont_toks_list = []
...@@ -855,6 +868,10 @@ class CachingLM: ...@@ -855,6 +868,10 @@ class CachingLM:
lm.set_cache_hook(self.get_cache_hook()) lm.set_cache_hook(self.get_cache_hook())
def __getattr__(self, attr): def __getattr__(self, attr):
lm_attr = getattr(self.lm, attr)
if not callable(lm_attr):
return lm_attr
def fn(requests): def fn(requests):
res = [] res = []
remaining_reqs = [] remaining_reqs = []
......
...@@ -16,6 +16,7 @@ def simple_evaluate( ...@@ -16,6 +16,7 @@ def simple_evaluate(
tasks=[], tasks=[],
num_fewshot=0, num_fewshot=0,
batch_size=None, batch_size=None,
max_batch_size=None,
device=None, device=None,
no_cache=False, no_cache=False,
limit=None, limit=None,
...@@ -23,8 +24,9 @@ def simple_evaluate( ...@@ -23,8 +24,9 @@ def simple_evaluate(
description_dict=None, description_dict=None,
check_integrity=False, check_integrity=False,
decontamination_ngrams_path=None, decontamination_ngrams_path=None,
write_out=False,
output_base_path=None,
): ):
"""Instantiate and evaluate a model on a list of tasks. """Instantiate and evaluate a model on a list of tasks.
:param model: Union[str, LM] :param model: Union[str, LM]
...@@ -36,20 +38,26 @@ def simple_evaluate( ...@@ -36,20 +38,26 @@ def simple_evaluate(
List of task names or Task objects. Task objects will be taken to have name task.EVAL_HARNESS_NAME if defined and type(task).__name__ otherwise. List of task names or Task objects. Task objects will be taken to have name task.EVAL_HARNESS_NAME if defined and type(task).__name__ otherwise.
:param num_fewshot: int :param num_fewshot: int
Number of examples in few-shot context Number of examples in few-shot context
:param batch_size: int, optional :param batch_size: int or str, optional
Batch size for model Batch size for model
:param max_batch_size: int, optional
Maximal batch size to try with automatic batch size detection
:param device: str, optional :param device: str, optional
PyTorch device (e.g. "cpu" or "cuda:0") for running models PyTorch device (e.g. "cpu" or "cuda:0") for running models
:param no_cache: bool :param no_cache: bool
Whether or not to cache Whether or not to cache
:param limit: int, optional :param limit: int or float, optional
Limit the number of examples per task (only use this for testing) Limit the number of examples per task (only use this for testing), If <1, limit is a percentage of the total number of examples.
:param bootstrap_iters: :param bootstrap_iters:
Number of iterations for bootstrap statistics Number of iterations for bootstrap statistics
:param description_dict: dict[str, str] :param description_dict: dict[str, str]
Dictionary of custom task descriptions of the form: `task_name: description` Dictionary of custom task descriptions of the form: `task_name: description`
:param check_integrity: bool :param check_integrity: bool
Whether to run the relevant part of the test suite for the tasks Whether to run the relevant part of the test suite for the tasks
:param write_out: bool
If True, write details about prompts and logits to json for all tasks
:param output_base_path: str, optional
Directory to which detailed eval info will be written. Defaults to present working dir.
:return :return
Dictionary of results Dictionary of results
""" """
...@@ -62,7 +70,7 @@ def simple_evaluate( ...@@ -62,7 +70,7 @@ def simple_evaluate(
if model_args is None: if model_args is None:
model_args = "" model_args = ""
lm = lm_eval.models.get_model(model).create_from_arg_string( lm = lm_eval.models.get_model(model).create_from_arg_string(
model_args, {"batch_size": batch_size, "device": device} model_args, {"batch_size": batch_size, "max_batch_size": max_batch_size, "device": device}
) )
else: else:
assert isinstance(model, lm_eval.base.LM) assert isinstance(model, lm_eval.base.LM)
...@@ -72,7 +80,7 @@ def simple_evaluate( ...@@ -72,7 +80,7 @@ def simple_evaluate(
lm = lm_eval.base.CachingLM( lm = lm_eval.base.CachingLM(
lm, lm,
"lm_cache/" "lm_cache/"
+ model + (model if isinstance(model, str) else model.model.config._name_or_path)
+ "_" + "_"
+ model_args.replace("=", "-").replace(",", "_").replace("/", "-") + model_args.replace("=", "-").replace(",", "_").replace("/", "-")
+ ".db", + ".db",
...@@ -91,14 +99,17 @@ def simple_evaluate( ...@@ -91,14 +99,17 @@ def simple_evaluate(
bootstrap_iters=bootstrap_iters, bootstrap_iters=bootstrap_iters,
description_dict=description_dict, description_dict=description_dict,
decontamination_ngrams_path=decontamination_ngrams_path, decontamination_ngrams_path=decontamination_ngrams_path,
write_out=write_out,
output_base_path=output_base_path,
) )
# add info about the model and few shot config # add info about the model and few shot config
results["config"] = { results["config"] = {
"model": model, "model": (model if isinstance(model, str) else model.model.config._name_or_path),
"model_args": model_args, "model_args": model_args,
"num_fewshot": num_fewshot, "num_fewshot": num_fewshot,
"batch_size": batch_size, "batch_size": batch_size,
"batch_sizes": list(lm.batch_sizes.values()) if hasattr(lm, "batch_sizes") else [],
"device": device, "device": device,
"no_cache": no_cache, "no_cache": no_cache,
"limit": limit, "limit": limit,
...@@ -122,6 +133,8 @@ def evaluate( ...@@ -122,6 +133,8 @@ def evaluate(
bootstrap_iters=100000, bootstrap_iters=100000,
description_dict=None, description_dict=None,
decontamination_ngrams_path=None, decontamination_ngrams_path=None,
write_out=False,
output_base_path=None,
): ):
"""Instantiate and evaluate a model on a list of tasks. """Instantiate and evaluate a model on a list of tasks.
...@@ -139,6 +152,10 @@ def evaluate( ...@@ -139,6 +152,10 @@ def evaluate(
Number of iterations for bootstrap statistics Number of iterations for bootstrap statistics
:param description_dict: dict[str, str] :param description_dict: dict[str, str]
Dictionary of custom task descriptions of the form: `task_name: description` Dictionary of custom task descriptions of the form: `task_name: description`
:param write_out: bool
If True, write all prompts, logits and metrics to json for offline analysis
:param output_base_path: str, optional
Directory to which detailed eval info will be written. Defaults to present working dir
:return :return
Dictionary of results Dictionary of results
""" """
...@@ -175,6 +192,7 @@ def evaluate( ...@@ -175,6 +192,7 @@ def evaluate(
# TODO: we need unit tests & sanity checks or something to ensure that the return of `validation_docs` is stable # TODO: we need unit tests & sanity checks or something to ensure that the return of `validation_docs` is stable
docs = {} docs = {}
write_out_info = {}
docs_for_decontamination = collections.defaultdict(list) docs_for_decontamination = collections.defaultdict(list)
...@@ -197,15 +215,20 @@ def evaluate( ...@@ -197,15 +215,20 @@ def evaluate(
rnd = random.Random() rnd = random.Random()
rnd.seed(42) rnd.seed(42)
rnd.shuffle(task_docs) rnd.shuffle(task_docs)
print(f"Task: {task_name}; number of docs: {len(task_docs)}")
if write_out:
prompt_details = []
description = ( description = (
description_dict[task_name] description_dict[task_name]
if description_dict and task_name in description_dict if description_dict and task_name in description_dict
else "" else ""
) )
if limit is not None:
limit = int(len(task_docs) * limit) if limit < 1.0 else int(limit)
for doc_id, doc in enumerate(itertools.islice(task_docs, 0, limit)): for doc_id, doc in enumerate(itertools.islice(task_docs, 0, limit)):
if decontaminate and task.should_decontaminate(): if decontaminate and task.should_decontaminate():
docs_for_decontamination[(task_name, task_set)].append( docs_for_decontamination[(task_name, task_set)].append(
task.doc_to_decontamination_query(doc) task.doc_to_decontamination_query(doc)
...@@ -216,6 +239,17 @@ def evaluate( ...@@ -216,6 +239,17 @@ def evaluate(
doc=doc, num_fewshot=num_fewshot, rnd=rnd, description=description doc=doc, num_fewshot=num_fewshot, rnd=rnd, description=description
) )
reqs = task.construct_requests(doc, ctx) reqs = task.construct_requests(doc, ctx)
if write_out:
prompt_details.append({"doc_id": doc_id})
# print the prompt for the first few documents
if doc_id < 1:
print(
f"Task: {task_name}; document {doc_id}; context prompt (starting on next line):\n{ctx}\n(end of prompt on previous line)"
)
print("Requests:", reqs)
if not isinstance(reqs, (list, tuple)): if not isinstance(reqs, (list, tuple)):
reqs = [reqs] reqs = [reqs]
for i, req in enumerate(reqs): for i, req in enumerate(reqs):
...@@ -224,6 +258,14 @@ def evaluate( ...@@ -224,6 +258,14 @@ def evaluate(
# doc_id: unique id that we can get back to a doc using `docs` # doc_id: unique id that we can get back to a doc using `docs`
requests_origin[req.request_type].append((i, task_name, doc, doc_id)) requests_origin[req.request_type].append((i, task_name, doc, doc_id))
if write_out:
prompt_details[-1][f"prompt_{i}"] = "".join(
(map(lambda x: "".join(x), req.args))
)
if write_out:
write_out_info[task_name] = prompt_details
# Compare all tasks/sets at once to ensure a single training set scan # Compare all tasks/sets at once to ensure a single training set scan
if decontaminate: if decontaminate:
from lm_eval.decontamination.decontaminate import get_train_overlap from lm_eval.decontamination.decontaminate import get_train_overlap
...@@ -252,6 +294,18 @@ def evaluate( ...@@ -252,6 +294,18 @@ def evaluate(
for resp, (i, task_name, doc, doc_id) in zip(resps, requests_origin[reqtype]): for resp, (i, task_name, doc, doc_id) in zip(resps, requests_origin[reqtype]):
process_res_queue[(task_name, doc_id)].append((i, resp)) process_res_queue[(task_name, doc_id)].append((i, resp))
if write_out:
write_out_info[task_name][doc_id][f"logit_{i}"] = resp
task = task_dict[task_name]
if isinstance(task, lm_eval.base.MultipleChoiceTask):
write_out_info[task_name][doc_id]["truth"] = doc["gold"]
elif isinstance(task, lm_eval.tasks.winogrande.Winogrande):
write_out_info[task_name][doc_id]["truth"] = task.answer_to_num[
doc["answer"]
]
else:
write_out_info[task_name][doc_id]["truth"] = task.doc_to_target(doc)
vals = collections.defaultdict(list) vals = collections.defaultdict(list)
# unpack results and sort back in order and return control to Task # unpack results and sort back in order and return control to Task
...@@ -266,6 +320,9 @@ def evaluate( ...@@ -266,6 +320,9 @@ def evaluate(
for metric, value in metrics.items(): for metric, value in metrics.items():
vals[(task_name, metric)].append(value) vals[(task_name, metric)].append(value)
if write_out:
write_out_info[task_name][doc_id][metric] = str(value)
# Re-use the evaluation for the decontaminated set by just ignoring the overlaps # Re-use the evaluation for the decontaminated set by just ignoring the overlaps
if decontaminate and task_name in overlaps: if decontaminate and task_name in overlaps:
if doc_id not in overlaps[task_name]: if doc_id not in overlaps[task_name]:
...@@ -294,6 +351,28 @@ def evaluate( ...@@ -294,6 +351,28 @@ def evaluate(
if stderr is not None: if stderr is not None:
results[task_name][metric + "_stderr"] = stderr(items) results[task_name][metric + "_stderr"] = stderr(items)
if write_out:
import json
import pathlib
output_base_path = (
pathlib.Path(output_base_path)
if output_base_path is not None
else pathlib.Path(".")
)
try:
output_base_path.mkdir(parents=True, exist_ok=False)
except FileExistsError:
pass
for task_name, _ in task_dict_items:
with open(
output_base_path.joinpath(f"{task_name}_write_out_info.json"),
"w",
encoding="utf8",
) as fp:
json.dump(write_out_info[task_name], fp, indent=4, ensure_ascii=False)
return {"results": dict(results), "versions": dict(versions)} return {"results": dict(results), "versions": dict(versions)}
......
from . import gpt2 from . import gpt2
from . import gpt3 from . import gpt3
from . import anthropic_llms
from . import huggingface from . import huggingface
from . import textsynth from . import textsynth
from . import dummy from . import dummy
...@@ -11,6 +12,7 @@ MODEL_REGISTRY = { ...@@ -11,6 +12,7 @@ MODEL_REGISTRY = {
"hf-seq2seq": huggingface.AutoSeq2SeqLM, "hf-seq2seq": huggingface.AutoSeq2SeqLM,
"gpt2": gpt2.GPT2LM, "gpt2": gpt2.GPT2LM,
"gpt3": gpt3.GPT3LM, "gpt3": gpt3.GPT3LM,
"anthropic": anthropic_llms.AnthropicLM,
"textsynth": textsynth.TextSynthLM, "textsynth": textsynth.TextSynthLM,
"dummy": dummy.DummyLM, "dummy": dummy.DummyLM,
} }
......
import os
from lm_eval.base import BaseLM
from tqdm import tqdm
import time
def anthropic_completion(client, model, prompt, max_tokens_to_sample, temperature, stop):
"""Query Anthropic API for completion.
Retry with back-off until they respond
"""
import anthropic
backoff_time = 3
while True:
try:
response = client.completion(
prompt=f"{anthropic.HUMAN_PROMPT} {prompt}{anthropic.AI_PROMPT}",
model=model,
# NOTE: Claude really likes to do CoT, and overly aggressive stop sequences
# (e.g. gsm8k's ":") may truncate a lot of the input.
stop_sequences=[anthropic.HUMAN_PROMPT] + stop,
max_tokens_to_sample=max_tokens_to_sample,
temperature=temperature,
)
print(response)
return response["completion"]
except RuntimeError:
# TODO: I don't actually know what error Anthropic raises when it times out
# So err update this error when we find out.
import traceback
traceback.print_exc()
time.sleep(backoff_time)
backoff_time *= 1.5
class AnthropicLM(BaseLM):
REQ_CHUNK_SIZE = 20
def __init__(self, model):
"""
:param model: str
Anthropic model e.g. claude-instant-v1
"""
super().__init__()
import anthropic
self.model = model
self.client = anthropic.Client(os.environ['ANTHROPIC_API_KEY'])
@property
def eot_token_id(self):
raise NotImplementedError("No idea about anthropic tokenization.")
@property
def max_length(self):
return 2048
@property
def max_gen_toks(self):
return 256
@property
def batch_size(self):
# Isn't used because we override _loglikelihood_tokens
raise NotImplementedError()
@property
def device(self):
# Isn't used because we override _loglikelihood_tokens
raise NotImplementedError()
def tok_encode(self, string: str):
raise NotImplementedError("No idea about anthropic tokenization.")
def tok_decode(self, tokens):
raise NotImplementedError("No idea about anthropic tokenization.")
def _loglikelihood_tokens(self, requests, disable_tqdm=False):
raise NotImplementedError("No support for logits.")
def greedy_until(self, requests):
if not requests:
return []
res = []
for request in tqdm(requests):
inp = request[0]
request_args = request[1]
until = request_args["until"]
response = anthropic_completion(
client=self.client,
model=self.model,
prompt=inp,
max_tokens_to_sample=self.max_gen_toks,
temperature=0.0,
stop=until,
)
res.append(response)
return res
def _model_call(self, inps):
# Isn't used because we override _loglikelihood_tokens
raise NotImplementedError()
def _model_generate(self, context, max_length, eos_token_id):
# Isn't used because we override greedy_until
raise NotImplementedError()
import torch import torch
import transformers import transformers
from typing import Optional from typing import Optional, Union
from lm_eval.base import BaseLM from lm_eval.base import BaseLM
def _get_dtype(
dtype: Union[str, torch.dtype]
) -> torch.dtype:
"""Converts `dtype` from `str` to torch.dtype when possible. Does not use an instantiated HF AutoConfig"""
if isinstance(dtype, str) and dtype != "auto":
# Convert `str` args torch dtype: `float16` -> `torch.float16`
_torch_dtype = getattr(torch, dtype)
else:
_torch_dtype = dtype
return _torch_dtype
class HFLM(BaseLM): class HFLM(BaseLM):
_DEFAULT_MAX_LENGTH = 2048
def __init__( def __init__(
self, self,
device="cuda", device="cuda",
...@@ -14,8 +29,10 @@ class HFLM(BaseLM): ...@@ -14,8 +29,10 @@ class HFLM(BaseLM):
subfolder=None, subfolder=None,
tokenizer=None, tokenizer=None,
batch_size=1, batch_size=1,
max_length=None,
load_in_8bit: Optional[bool] = False, load_in_8bit: Optional[bool] = False,
trust_remote_code: Optional[bool] = False, trust_remote_code: Optional[bool] = False,
dtype: Optional[Union[str, torch.dtype]]="auto",
): ):
super().__init__() super().__init__()
...@@ -46,10 +63,14 @@ class HFLM(BaseLM): ...@@ -46,10 +63,14 @@ class HFLM(BaseLM):
load_in_8bit=load_in_8bit, load_in_8bit=load_in_8bit,
low_cpu_mem_usage=low_cpu_mem_usage, low_cpu_mem_usage=low_cpu_mem_usage,
revision=revision, revision=revision,
torch_dtype=_get_dtype(dtype),
trust_remote_code=trust_remote_code, trust_remote_code=trust_remote_code,
).to(self.device) ).eval()
self.gpt2.eval() if not load_in_8bit:
try:
self.gpt2.to(self.device)
except:
print("Failed to place model onto specified device. This may be because the model is quantized via `bitsandbytes`. If the desired GPU is being used, this message is safe to ignore.")
self.tokenizer = transformers.AutoTokenizer.from_pretrained( self.tokenizer = transformers.AutoTokenizer.from_pretrained(
pretrained if tokenizer is None else tokenizer, pretrained if tokenizer is None else tokenizer,
revision=revision, revision=revision,
...@@ -58,22 +79,14 @@ class HFLM(BaseLM): ...@@ -58,22 +79,14 @@ class HFLM(BaseLM):
self.vocab_size = self.tokenizer.vocab_size self.vocab_size = self.tokenizer.vocab_size
if isinstance(
self.tokenizer, (transformers.GPT2Tokenizer, transformers.GPT2TokenizerFast)
):
assert self.tokenizer.encode("hello\n\nhello") == [
31373,
198,
198,
31373,
], self.tokenizer.encode("hello\n\nhello")
# setup for automatic batch size detection # setup for automatic batch size detection
if batch_size == "auto": if batch_size == "auto":
self.batch_size_per_gpu = batch_size self.batch_size_per_gpu = batch_size
else: else:
self.batch_size_per_gpu = int(batch_size) self.batch_size_per_gpu = int(batch_size)
self._max_length = max_length
@property @property
def eot_token_id(self): def eot_token_id(self):
# we use EOT because end of *text* is more accurate for what we're doing than end of *sentence* # we use EOT because end of *text* is more accurate for what we're doing than end of *sentence*
...@@ -81,11 +94,18 @@ class HFLM(BaseLM): ...@@ -81,11 +94,18 @@ class HFLM(BaseLM):
@property @property
def max_length(self): def max_length(self):
try: if self._max_length: # if max length manually set, return it
return self.gpt2.config.n_ctx return self._max_length
except AttributeError: seqlen_config_attrs = ("n_positions", "max_position_embeddings", "n_ctx")
# gptneoconfig doesn't have n_ctx apparently for attr in seqlen_config_attrs:
return self.gpt2.config.max_position_embeddings if hasattr(self.gpt2.config, attr):
return getattr(self.gpt2.config, attr)
if hasattr(self.tokenizer, "model_max_length"):
if self.tokenizer.model_max_length == 1000000000000000019884624838656:
return self._DEFAULT_MAX_LENGTH
return self.tokenizer.model_max_length
return self._DEFAULT_MAX_LENGTH
@property @property
def max_gen_toks(self): def max_gen_toks(self):
......
...@@ -3,11 +3,12 @@ import torch ...@@ -3,11 +3,12 @@ import torch
import torch.nn.functional as F import torch.nn.functional as F
import transformers import transformers
import peft import peft
from peft import __version__ as PEFT_VERSION
from pathlib import Path
from typing import List, Mapping, NewType, Optional, Tuple, Union from typing import List, Mapping, NewType, Optional, Tuple, Union
from tqdm import tqdm from tqdm import tqdm
from transformers import BatchEncoding from transformers import BatchEncoding
from accelerate import find_executable_batch_size
from lm_eval import utils from lm_eval import utils
from lm_eval.base import BaseLM from lm_eval.base import BaseLM
...@@ -69,10 +70,12 @@ class HuggingFaceAutoLM(BaseLM): ...@@ -69,10 +70,12 @@ class HuggingFaceAutoLM(BaseLM):
def __init__( def __init__(
self, self,
pretrained: str, pretrained: str,
quantized: Optional[Union[bool, str]] = False,
tokenizer: Optional[str] = None, tokenizer: Optional[str] = None,
subfolder: Optional[str] = None, subfolder: Optional[str] = None,
revision: Optional[str] = "main", revision: Optional[str] = "main",
batch_size: Optional[Union[int, str]] = 1, batch_size: Optional[Union[int, str]] = 1,
max_batch_size: Optional[int] = 512,
max_gen_toks: Optional[int] = 256, max_gen_toks: Optional[int] = 256,
max_length: Optional[int] = None, max_length: Optional[int] = None,
add_special_tokens: Optional[bool] = None, add_special_tokens: Optional[bool] = None,
...@@ -85,7 +88,9 @@ class HuggingFaceAutoLM(BaseLM): ...@@ -85,7 +88,9 @@ class HuggingFaceAutoLM(BaseLM):
device: Optional[Union[int, str]] = "cuda", device: Optional[Union[int, str]] = "cuda",
peft: str = None, peft: str = None,
load_in_8bit: Optional[bool] = False, load_in_8bit: Optional[bool] = False,
load_in_4bit: Optional[bool] = False,
trust_remote_code: Optional[bool] = False, trust_remote_code: Optional[bool] = False,
gptq_use_triton: Optional[bool] = False,
): ):
"""Initializes a HuggingFace `AutoModel` and `AutoTokenizer` for evaluation. """Initializes a HuggingFace `AutoModel` and `AutoTokenizer` for evaluation.
Args: Args:
...@@ -93,6 +98,9 @@ class HuggingFaceAutoLM(BaseLM): ...@@ -93,6 +98,9 @@ class HuggingFaceAutoLM(BaseLM):
The HuggingFace Hub model ID name or the path to a pre-trained The HuggingFace Hub model ID name or the path to a pre-trained
model to load. This is effectively the `pretrained_model_name_or_path` model to load. This is effectively the `pretrained_model_name_or_path`
argument of `from_pretrained` in the HuggingFace `transformers` API. argument of `from_pretrained` in the HuggingFace `transformers` API.
quantized (str or bool, optional, defaults to False):
File name of a GPTQ quantized model to load. Set to `True` to use the
default name of the quantized model.
add_special_tokens (bool, optional, defaults to True): add_special_tokens (bool, optional, defaults to True):
Whether to add special tokens to the input sequences. If `None`, the Whether to add special tokens to the input sequences. If `None`, the
default value will be set to `True` for seq2seq models (e.g. T5) and default value will be set to `True` for seq2seq models (e.g. T5) and
...@@ -136,9 +144,14 @@ class HuggingFaceAutoLM(BaseLM): ...@@ -136,9 +144,14 @@ class HuggingFaceAutoLM(BaseLM):
`adapter_model.bin`. Compatible with [PEFT](https://github.com/huggingface/peft) `adapter_model.bin`. Compatible with [PEFT](https://github.com/huggingface/peft)
load_in_8bit (bool, optional, defaults to False): load_in_8bit (bool, optional, defaults to False):
If True, will convert the loaded model into mixed-8bit quantized model. See: If True, will convert the loaded model into mixed-8bit quantized model. See:
https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.from_pretrained.load_in_8bit https://huggingface.co/docs/transformers/main/en/main_classes/quantization#load-a-large-model-in-8bit
load_in_4bit (bool, optional, defaults to False):
If True, will convert the loaded model into mixed-4bit quantized model. See:
https://huggingface.co/docs/transformers/main/en/main_classes/quantization#load-a-large-model-in-4bit
trust_remote_code (bool, optional, defaults to False): trust_remote_code (bool, optional, defaults to False):
If True, will trust the remote code when loading the model. If True, will trust the remote code when loading the model.
gptq_use_triton (bool, optional, defaults to False):
Use Triton for GPTQ inference.
""" """
super().__init__() super().__init__()
...@@ -159,10 +172,13 @@ class HuggingFaceAutoLM(BaseLM): ...@@ -159,10 +172,13 @@ class HuggingFaceAutoLM(BaseLM):
), "Evaluating causal models with `add_special_tokens=True` is currently not supported." ), "Evaluating causal models with `add_special_tokens=True` is currently not supported."
# setup for automatic batch size detection # setup for automatic batch size detection
if batch_size == "auto": if str(batch_size).startswith("auto"):
self._batch_size = batch_size batch_size = batch_size.split(":")
self._batch_size = batch_size[0]
self.batch_schedule = float(batch_size[1]) if len(batch_size) > 1 else 1
else: else:
self._batch_size = int(batch_size) self._batch_size = int(batch_size)
self.max_batch_size = max_batch_size
self._max_gen_toks = max_gen_toks self._max_gen_toks = max_gen_toks
self._max_length = max_length self._max_length = max_length
...@@ -189,13 +205,16 @@ class HuggingFaceAutoLM(BaseLM): ...@@ -189,13 +205,16 @@ class HuggingFaceAutoLM(BaseLM):
max_cpu_memory, max_cpu_memory,
offload_folder, offload_folder,
) )
model_kwargs["load_in_8bit"] = load_in_8bit
self.model = self._create_auto_model( self.model = self._create_auto_model(
pretrained=pretrained, pretrained=pretrained,
quantized=quantized,
trust_remote_code=trust_remote_code, trust_remote_code=trust_remote_code,
revision=revision, revision=revision,
subfolder=subfolder, subfolder=subfolder,
torch_dtype=_get_dtype(dtype, self._config), torch_dtype=_get_dtype(dtype, self._config),
gptq_use_triton=gptq_use_triton,
load_in_8bit=load_in_8bit,
load_in_4bit=load_in_4bit,
**model_kwargs, **model_kwargs,
) )
# note: peft_path can be different than pretrained model path # note: peft_path can be different than pretrained model path
...@@ -205,8 +224,7 @@ class HuggingFaceAutoLM(BaseLM): ...@@ -205,8 +224,7 @@ class HuggingFaceAutoLM(BaseLM):
peft=peft, peft=peft,
revision=revision, revision=revision,
subfolder=subfolder, subfolder=subfolder,
torch_dtype=_get_dtype(dtype, self._config), load_in_4bit=load_in_4bit,
**model_kwargs,
) )
self.model.eval() self.model.eval()
torch.set_grad_enabled(False) torch.set_grad_enabled(False)
...@@ -217,23 +235,35 @@ class HuggingFaceAutoLM(BaseLM): ...@@ -217,23 +235,35 @@ class HuggingFaceAutoLM(BaseLM):
# the user specified one so we force `self._device` to be the same as # the user specified one so we force `self._device` to be the same as
# `lm_head`'s. # `lm_head`'s.
self._device = self.model.hf_device_map["lm_head"] self._device = self.model.hf_device_map["lm_head"]
if not use_accelerate: if not use_accelerate and not (load_in_4bit or load_in_8bit):
try:
self.model.to(self._device) self.model.to(self._device)
except:
print("Failed to place model onto specified device. This may be because the model is quantized via `bitsandbytes`. If the desired GPU is being used, this message is safe to ignore.")
def _create_auto_model( def _create_auto_model(
self, self,
*, *,
pretrained: str, pretrained: str,
quantized: Optional[Union[bool, str]] = False,
revision: str, revision: str,
subfolder: str, subfolder: str,
device_map: Optional[Union[str, _DeviceMapping]] = None, device_map: Optional[Union[str, _DeviceMapping]] = None,
max_memory: Optional[dict] = None, max_memory: Optional[dict] = None,
offload_folder: Optional[str] = None, offload_folder: Optional[str] = None,
load_in_8bit: Optional[bool] = False, load_in_8bit: Optional[bool] = False,
load_in_4bit: Optional[bool] = False,
trust_remote_code: Optional[bool] = False, trust_remote_code: Optional[bool] = False,
torch_dtype: Optional[Union[str, torch.dtype]] = None, torch_dtype: Optional[Union[str, torch.dtype]] = None,
gptq_use_triton: Optional[bool] = False,
) -> transformers.AutoModel: ) -> transformers.AutoModel:
"""Returns a pre-trained pytorch model from a pre-trained model configuration.""" """Returns a pre-trained pytorch model from a pre-trained model configuration."""
if not quantized:
if load_in_4bit:
assert transformers.__version__ >= "4.30.0", "load_in_4bit requires transformers >= 4.30.0"
model_kwargs = {}
if transformers.__version__ >= "4.30.0":
model_kwargs["load_in_4bit"] = load_in_4bit
model = self.AUTO_MODEL_CLASS.from_pretrained( model = self.AUTO_MODEL_CLASS.from_pretrained(
pretrained, pretrained,
revision=revision + ("/" + subfolder if subfolder is not None else ""), revision=revision + ("/" + subfolder if subfolder is not None else ""),
...@@ -243,6 +273,19 @@ class HuggingFaceAutoLM(BaseLM): ...@@ -243,6 +273,19 @@ class HuggingFaceAutoLM(BaseLM):
load_in_8bit=load_in_8bit, load_in_8bit=load_in_8bit,
trust_remote_code=trust_remote_code, trust_remote_code=trust_remote_code,
torch_dtype=torch_dtype, torch_dtype=torch_dtype,
**model_kwargs,
)
else:
from auto_gptq import AutoGPTQForCausalLM
model = AutoGPTQForCausalLM.from_quantized(
pretrained,
model_basename=None if quantized == True else Path(quantized).stem,
device_map=device_map,
max_memory=max_memory,
trust_remote_code=trust_remote_code,
use_safetensors=True if quantized == True else quantized.endswith('.safetensors'),
use_triton=gptq_use_triton,
warmup_triton=gptq_use_triton,
) )
return model return model
...@@ -253,23 +296,14 @@ class HuggingFaceAutoLM(BaseLM): ...@@ -253,23 +296,14 @@ class HuggingFaceAutoLM(BaseLM):
peft: str, peft: str,
revision: str, revision: str,
subfolder: str, subfolder: str,
device_map: Optional[Union[str, _DeviceMapping]] = None, load_in_4bit: Optional[bool] = False,
max_memory: Optional[dict] = None,
offload_folder: Optional[str] = None,
load_in_8bit: Optional[bool] = False,
trust_remote_code: Optional[bool] = False,
torch_dtype: Optional[Union[str, torch.dtype]] = None,
): ):
if load_in_4bit:
assert PEFT_VERSION >= "0.4.0", "load_in_4bit requires peft >= 0.4.0"
model = self.AUTO_PEFT_CLASS.from_pretrained( model = self.AUTO_PEFT_CLASS.from_pretrained(
model, model,
peft, peft,
revision=revision + ("/" + subfolder if subfolder is not None else ""), revision=revision + ("/" + subfolder if subfolder is not None else ""),
device_map=device_map,
max_memory=max_memory,
offload_folder=offload_folder,
load_in_8bit=load_in_8bit,
trust_remote_code=trust_remote_code,
torch_dtype=torch_dtype,
) )
return model return model
...@@ -326,7 +360,7 @@ class HuggingFaceAutoLM(BaseLM): ...@@ -326,7 +360,7 @@ class HuggingFaceAutoLM(BaseLM):
"""Return the maximum sequence length of the model. """Return the maximum sequence length of the model.
NOTE: Different model configurations have different max sequence length NOTE: Different model configurations have different max sequence length
attribute names. attribute names.
- n_positions: (CTRLConfig) - n_positions: (CTRLConfig, T5Config)
- max_position_embeddings: (BartConfig, RoFormerConfig) - max_position_embeddings: (BartConfig, RoFormerConfig)
- n_ctx: (GPT2Config) - n_ctx: (GPT2Config)
NOTE: For relative position encoded models you should specify the max NOTE: For relative position encoded models you should specify the max
...@@ -383,19 +417,7 @@ class HuggingFaceAutoLM(BaseLM): ...@@ -383,19 +417,7 @@ class HuggingFaceAutoLM(BaseLM):
if self.batch_size == "auto": if self.batch_size == "auto":
# using rolling window with maximum context # using rolling window with maximum context
print("Passed argument batch_size = auto. Detecting largest batch size") print("Passed argument batch_size = auto. Detecting largest batch size")
batch_size = self._detect_batch_size()
@find_executable_batch_size(
starting_batch_size=512
) # if OOM, then halves batch_size and tries again
def forward_batch(batch_size):
test_batch = torch.ones(
(batch_size, self.max_length), device=self.device
).long()
for _ in range(5):
_ = F.log_softmax(self._model_call(test_batch), dim=-1).cpu()
return batch_size
batch_size = forward_batch()
print(f"Determined Largest batch size: {batch_size}") print(f"Determined Largest batch size: {batch_size}")
adaptive_batch_size = batch_size adaptive_batch_size = batch_size
...@@ -518,15 +540,6 @@ class AutoSeq2SeqLM(HuggingFaceAutoLM): ...@@ -518,15 +540,6 @@ class AutoSeq2SeqLM(HuggingFaceAutoLM):
AUTO_MODEL_CLASS = transformers.AutoModelForSeq2SeqLM AUTO_MODEL_CLASS = transformers.AutoModelForSeq2SeqLM
AUTO_PEFT_CLASS = peft.PeftModel AUTO_PEFT_CLASS = peft.PeftModel
@property
def max_length(self) -> int:
"""Return the maximum sequence length of the model.
TODO: Currently only works for relative position encoded Seq2Seq models.
"""
if self._max_length is not None:
return self._max_length
return self._DEFAULT_MAX_LENGTH
def loglikelihood( def loglikelihood(
self, requests: List[Tuple[str, str]] self, requests: List[Tuple[str, str]]
) -> List[Tuple[float, bool]]: ) -> List[Tuple[float, bool]]:
......
...@@ -52,6 +52,7 @@ from . import gsm8k ...@@ -52,6 +52,7 @@ from . import gsm8k
from . import storycloze from . import storycloze
from . import toxigen from . import toxigen
from . import crowspairs from . import crowspairs
from . import json
from . import xcopa from . import xcopa
from . import bigbench from . import bigbench
from . import xstorycloze from . import xstorycloze
...@@ -329,9 +330,42 @@ TASK_REGISTRY = { ...@@ -329,9 +330,42 @@ TASK_REGISTRY = {
ALL_TASKS = sorted(list(TASK_REGISTRY)) ALL_TASKS = sorted(list(TASK_REGISTRY))
_EXAMPLE_JSON_PATH = "split:key:/absolute/path/to/data.json"
def add_json_task(task_name):
"""Add a JSON perplexity task if the given task name matches the
JSON task specification.
See `json.JsonPerplexity`.
"""
if not task_name.startswith("json"):
return
def create_json_task():
splits = task_name.split("=", 1)
if len(splits) != 2 or not splits[1]:
raise ValueError(
"json tasks need a path argument pointing to the local "
"dataset, specified like this: json="
+ _EXAMPLE_JSON_PATH
+ ' (if there are no splits, use "train")'
)
json_path = splits[1]
if json_path == _EXAMPLE_JSON_PATH:
raise ValueError(
"please do not copy the example path directly, but substitute "
"it with a path to your local dataset"
)
return lambda: json.JsonPerplexity(json_path)
TASK_REGISTRY[task_name] = create_json_task()
def get_task(task_name): def get_task(task_name):
try: try:
add_json_task(task_name)
return TASK_REGISTRY[task_name] return TASK_REGISTRY[task_name]
except KeyError: except KeyError:
print("Available tasks:") print("Available tasks:")
......
import datasets
from lm_eval.base import PerplexityTask
from lm_eval.utils import escaped_split
class JsonPerplexity(PerplexityTask):
VERSION = 0
DATASET_NAME = "json"
def __init__(self, data_dir=None, cache_dir=None, download_mode=None):
"""
:param data_dir: str
Use this to specify the path to manually downloaded JSON test data.
This also needs to include the split key and text key for the data
in the following format:
```
split:text:/absolute/path/to/data.json
```
If you do not have splits inside the JSON file, it should be "train".
Colons in the split or text key can be escaped by backslashes.
:param cache_dir: str
The directory to read/write the `Task` dataset. This follows the
HuggingFace `datasets` API with the default cache directory located at:
`~/.cache/huggingface/datasets`
NOTE: You can change the cache location globally for a given process
by setting the shell environment variable, `HF_DATASETS_CACHE`,
to another directory:
`export HF_DATASETS_CACHE="/path/to/another/directory"`
:param download_mode: datasets.DownloadMode
How to treat pre-existing `Task` downloads and data.
- `datasets.DownloadMode.REUSE_DATASET_IF_EXISTS`
Reuse download and reuse dataset.
- `datasets.DownloadMode.REUSE_CACHE_IF_EXISTS`
Reuse download with fresh dataset.
- `datasets.DownloadMode.FORCE_REDOWNLOAD`
Fresh download and fresh dataset.
"""
self._split, self._key, data_file = escaped_split(data_dir, ":", 2)
self.load(data_file)
self._training_docs = None
self._fewshot_docs = None
def download(self, data_dir=None, cache_dir=None, download_mode=None):
raise TypeError("cannot download an arbitrary JSON dataset")
def load(self, data_file):
self.dataset = datasets.load_dataset("json", data_files=data_file)
def has_validation_docs(self):
return False
def has_test_docs(self):
return True
def test_docs(self):
return map(self._process_doc, self.dataset[self._split])
def _process_doc(self, doc):
return doc[self._key]
...@@ -26,6 +26,7 @@ Homepage: https://github.com/google-research-datasets/paws/tree/master/pawsx ...@@ -26,6 +26,7 @@ Homepage: https://github.com/google-research-datasets/paws/tree/master/pawsx
""" """
from lm_eval.base import Task, rf from lm_eval.base import Task, rf
from lm_eval.metrics import mean from lm_eval.metrics import mean
from lm_eval import utils
_CITATION = """ _CITATION = """
@inproceedings{yang-etal-2019-paws, @inproceedings{yang-etal-2019-paws,
...@@ -85,6 +86,11 @@ class PAWSXBase(Task): ...@@ -85,6 +86,11 @@ class PAWSXBase(Task):
def doc_to_target(self, doc): def doc_to_target(self, doc):
return " " + [self.YES, self.NO][doc["label"]] return " " + [self.YES, self.NO][doc["label"]]
def doc_to_fewshot_prompt(self, doc):
prompt = self.doc_to_text(doc)
return prompt.replace("[MASK]", self.doc_to_target(doc)[1:])
def construct_requests(self, doc, ctx): def construct_requests(self, doc, ctx):
"""Uses RequestFactory to construct Requests and returns an iterable of """Uses RequestFactory to construct Requests and returns an iterable of
Requests which will be sent to the LM. Requests which will be sent to the LM.
...@@ -136,6 +142,76 @@ class PAWSXBase(Task): ...@@ -136,6 +142,76 @@ class PAWSXBase(Task):
def higher_is_better(self): def higher_is_better(self):
return {"acc": True} return {"acc": True}
@utils.positional_deprecated
def fewshot_context(
self, doc, num_fewshot, provide_description=None, rnd=None, description=None
):
"""Returns a fewshot context string that is made up of a prepended description
(if provided), the `num_fewshot` number of examples, and an appended prompt example.
:param doc: str
The document as returned from training_docs, validation_docs, or test_docs.
:param num_fewshot: int
The number of fewshot examples to provide in the returned context string.
:param provide_description: bool
Not implemented, and this option is deprecated and will be removed in a future version in favor of a different description providing method
:param rnd: random.Random
The pseudo-random number generator used to randomly sample examples.
WARNING: This is currently a required arg although it's optionalized with a default `None`.
:param description: str
The task's description that will be prepended to the fewshot examples.
:returns: str
The fewshot context.
"""
assert (
rnd is not None
), "A `random.Random` generator argument must be provided to `rnd`"
assert not provide_description, (
"The `provide_description` arg will be removed in future versions. To prepend "
"a custom description to the context, supply the corresponding string via the "
"`description` arg."
)
if provide_description is not None:
# nudge people to not specify it at all
print(
"WARNING: provide_description is deprecated and will be removed in a future version in favor of description_dict"
)
description = description + "\n\n" if description else ""
if num_fewshot == 0:
labeled_examples = ""
else:
# for sets with no training docs, draw from other set *but ensure no overlap with current doc*
if self.has_training_docs():
fewshotex = self.fewshot_examples(k=num_fewshot, rnd=rnd)
else:
if self._fewshot_docs is None:
self._fewshot_docs = list(
self.validation_docs()
if self.has_validation_docs()
else self.test_docs()
)
fewshotex = rnd.sample(self._fewshot_docs, num_fewshot + 1)
# get rid of the doc that's the one we're evaluating, if it's in the fewshot
fewshotex = [x for x in fewshotex if x != doc][:num_fewshot]
labeled_examples = (
"\n\n".join(
[
# self.doc_to_text(doc) + self.doc_to_target(doc)
self.doc_to_fewshot_prompt(doc)
for doc in fewshotex
]
)
+ "\n\n"
)
example = self.doc_to_text(doc)
return description + labeled_examples + example
class PAWSX_en(PAWSXBase): class PAWSX_en(PAWSXBase):
DATASET_NAME = "en" DATASET_NAME = "en"
......
...@@ -18,6 +18,7 @@ Homepage: https://github.com/facebookresearch/XNLI ...@@ -18,6 +18,7 @@ Homepage: https://github.com/facebookresearch/XNLI
import numpy as np import numpy as np
from lm_eval.base import rf, Task from lm_eval.base import rf, Task
from lm_eval.metrics import mean from lm_eval.metrics import mean
from lm_eval import utils
_CITATIONS = """ _CITATIONS = """
@InProceedings{conneau2018xnli, @InProceedings{conneau2018xnli,
...@@ -89,6 +90,11 @@ class XNLIBase(Task): ...@@ -89,6 +90,11 @@ class XNLIBase(Task):
] ]
) )
def doc_to_fewshot_prompt(self, doc):
prompt = self.doc_to_text(doc)
return prompt.replace("[MASK]", self.doc_to_target(doc)[1:])
def construct_requests(self, doc, ctx): def construct_requests(self, doc, ctx):
"""Uses RequestFactory to construct Requests and returns an iterable of """Uses RequestFactory to construct Requests and returns an iterable of
Requests which will be sent to the LM. Requests which will be sent to the LM.
...@@ -138,6 +144,76 @@ class XNLIBase(Task): ...@@ -138,6 +144,76 @@ class XNLIBase(Task):
""" """
return {"acc": True} return {"acc": True}
@utils.positional_deprecated
def fewshot_context(
self, doc, num_fewshot, provide_description=None, rnd=None, description=None
):
"""Returns a fewshot context string that is made up of a prepended description
(if provided), the `num_fewshot` number of examples, and an appended prompt example.
:param doc: str
The document as returned from training_docs, validation_docs, or test_docs.
:param num_fewshot: int
The number of fewshot examples to provide in the returned context string.
:param provide_description: bool
Not implemented, and this option is deprecated and will be removed in a future version in favor of a different description providing method
:param rnd: random.Random
The pseudo-random number generator used to randomly sample examples.
WARNING: This is currently a required arg although it's optionalized with a default `None`.
:param description: str
The task's description that will be prepended to the fewshot examples.
:returns: str
The fewshot context.
"""
assert (
rnd is not None
), "A `random.Random` generator argument must be provided to `rnd`"
assert not provide_description, (
"The `provide_description` arg will be removed in future versions. To prepend "
"a custom description to the context, supply the corresponding string via the "
"`description` arg."
)
if provide_description is not None:
# nudge people to not specify it at all
print(
"WARNING: provide_description is deprecated and will be removed in a future version in favor of description_dict"
)
description = description + "\n\n" if description else ""
if num_fewshot == 0:
labeled_examples = ""
else:
# for sets with no training docs, draw from other set *but ensure no overlap with current doc*
if self.has_training_docs():
fewshotex = self.fewshot_examples(k=num_fewshot, rnd=rnd)
else:
if self._fewshot_docs is None:
self._fewshot_docs = list(
self.validation_docs()
if self.has_validation_docs()
else self.test_docs()
)
fewshotex = rnd.sample(self._fewshot_docs, num_fewshot + 1)
# get rid of the doc that's the one we're evaluating, if it's in the fewshot
fewshotex = [x for x in fewshotex if x != doc][:num_fewshot]
labeled_examples = (
"\n\n".join(
[
# self.doc_to_text(doc) + self.doc_to_target(doc)
self.doc_to_fewshot_prompt(doc)
for doc in fewshotex
]
)
+ "\n\n"
)
example = self.doc_to_text(doc)
return description + labeled_examples + example
class XNLI_en(XNLIBase): # English class XNLI_en(XNLIBase): # English
DATASET_NAME = "en" DATASET_NAME = "en"
......
...@@ -5,8 +5,10 @@ import collections ...@@ -5,8 +5,10 @@ import collections
import functools import functools
import inspect import inspect
import sys import sys
import fnmatch
from typing import List, Union from typing import List, Union
import gc
import torch import torch
from omegaconf import OmegaConf from omegaconf import OmegaConf
...@@ -21,6 +23,29 @@ def sh(x): ...@@ -21,6 +23,29 @@ def sh(x):
raise ExitCodeError() raise ExitCodeError()
def escaped_split(text, sep_char, maxsplit=-1):
"""Split text into a list on occurrences of the given separation
character `sep_char`. The separation character may be escaped by a
backslash to avoid splitting at that location.
The separation character must be a string of size 1.
If `maxsplit` is given, at most `maxsplit` splits are done (thus,
the list will have at most `maxsplit + 1` elements). If `maxsplit`
is not specified or less than 0, then there is no limit on the
number of splits (all possible splits are made).
"""
assert (
len(sep_char) == 1
), "separation string must be a single character for escaped splitting"
if maxsplit == 0:
return text
maxsplit = max(0, maxsplit)
return re.split(r"(?<!\\)" + sep_char, text, maxsplit)
def simple_parse_args_string(args_string): def simple_parse_args_string(args_string):
""" """
Parses something like Parses something like
...@@ -40,11 +65,11 @@ def join_iters(iters): ...@@ -40,11 +65,11 @@ def join_iters(iters):
yield from iter yield from iter
def chunks(iter, n): def chunks(iter, n=0, fn=None):
arr = [] arr = []
for x in iter: for i, x in enumerate(iter):
arr.append(x) arr.append(x)
if len(arr) == n: if len(arr) == (fn(i) if fn else n):
yield arr yield arr
arr = [] arr = []
...@@ -61,6 +86,42 @@ def group(arr, fn): ...@@ -61,6 +86,42 @@ def group(arr, fn):
return list(res.values()) return list(res.values())
def _is_json_task(task_name):
return task_name == "json" or task_name.startswith("json=")
class MultiChoice:
def __init__(self, choices):
self.choices = choices
# Simple wildcard support (linux filename patterns)
def __contains__(self, values):
for value in values.split(","):
if len(fnmatch.filter(self.choices, value)) == 0 and not _is_json_task(
value
):
return False
return True
def __iter__(self):
for choice in self.choices:
yield choice
# Returns a list containing all values of the source_list that
# match at least one of the patterns
def pattern_match(patterns, source_list):
task_names = set()
for pattern in patterns:
if _is_json_task(pattern):
task_names.add(pattern)
for matching in fnmatch.filter(source_list, pattern):
task_names.add(matching)
return sorted(list(task_names))
def general_detokenize(string): def general_detokenize(string):
string = string.replace(" n't", "n't") string = string.replace(" n't", "n't")
string = string.replace(" )", ")") string = string.replace(" )", ")")
...@@ -223,3 +284,8 @@ def run_task_tests(task_list: List[str]): ...@@ -223,3 +284,8 @@ def run_task_tests(task_list: List[str]):
raise ValueError( raise ValueError(
f"Not all tests for the specified tasks ({task_list}) ran successfully! Error code: {pytest_return_val}" f"Not all tests for the specified tasks ({task_list}) ran successfully! Error code: {pytest_return_val}"
) )
def clear_torch_cache():
gc.collect()
torch.cuda.empty_cache()
import argparse import argparse
import json import json
import logging import logging
import fnmatch import os
from lm_eval import tasks, evaluator from lm_eval import tasks, evaluator, utils
logging.getLogger("openai").setLevel(logging.WARNING) logging.getLogger("openai").setLevel(logging.WARNING)
class MultiChoice:
def __init__(self, choices):
self.choices = choices
# Simple wildcard support (linux filename patterns)
def __contains__(self, values):
for value in values.split(","):
if len(fnmatch.filter(self.choices, value)) == 0:
return False
return True
def __iter__(self):
for choice in self.choices:
yield choice
def parse_args(): def parse_args():
parser = argparse.ArgumentParser() parser = argparse.ArgumentParser()
parser.add_argument("--model", required=True) parser.add_argument("--model", required=True)
parser.add_argument("--model_args", default="") parser.add_argument("--model_args", default="")
parser.add_argument("--tasks", default=None, choices=MultiChoice(tasks.ALL_TASKS)) parser.add_argument("--tasks", default=None, choices=utils.MultiChoice(tasks.ALL_TASKS))
parser.add_argument("--provide_description", action="store_true") parser.add_argument("--provide_description", action="store_true")
parser.add_argument("--num_fewshot", type=int, default=0) parser.add_argument("--num_fewshot", type=int, default=0)
parser.add_argument("--batch_size", type=str, default=None) parser.add_argument("--batch_size", type=str, default=None)
parser.add_argument("--max_batch_size", type=int, default=None,
help="Maximal batch size to try with --batch_size auto")
parser.add_argument("--device", type=str, default=None) parser.add_argument("--device", type=str, default=None)
parser.add_argument("--output_path", default=None) parser.add_argument("--output_path", default=None)
parser.add_argument("--limit", type=int, default=None) parser.add_argument("--limit", type=float, default=None,
help="Limit the number of examples per task. "
"If <1, limit is a percentage of the total number of examples.")
parser.add_argument("--data_sampling", type=float, default=None)
parser.add_argument("--no_cache", action="store_true") parser.add_argument("--no_cache", action="store_true")
parser.add_argument("--decontamination_ngrams_path", default=None) parser.add_argument("--decontamination_ngrams_path", default=None)
parser.add_argument("--description_dict_path", default=None) parser.add_argument("--description_dict_path", default=None)
parser.add_argument("--check_integrity", action="store_true") parser.add_argument("--check_integrity", action="store_true")
parser.add_argument("--write_out", action="store_true", default=False)
parser.add_argument("--output_base_path", type=str, default=None)
return parser.parse_args() return parser.parse_args()
# Returns a list containing all values of the source_list that
# match at least one of the patterns
def pattern_match(patterns, source_list):
task_names = set()
for pattern in patterns:
for matching in fnmatch.filter(source_list, pattern):
task_names.add(matching)
return sorted(list(task_names))
def main(): def main():
args = parse_args() args = parse_args()
...@@ -67,7 +47,7 @@ def main(): ...@@ -67,7 +47,7 @@ def main():
if args.tasks is None: if args.tasks is None:
task_names = tasks.ALL_TASKS task_names = tasks.ALL_TASKS
else: else:
task_names = pattern_match(args.tasks.split(","), tasks.ALL_TASKS) task_names = utils.pattern_match(args.tasks.split(","), tasks.ALL_TASKS)
print(f"Selected Tasks: {task_names}") print(f"Selected Tasks: {task_names}")
...@@ -82,24 +62,29 @@ def main(): ...@@ -82,24 +62,29 @@ def main():
tasks=task_names, tasks=task_names,
num_fewshot=args.num_fewshot, num_fewshot=args.num_fewshot,
batch_size=args.batch_size, batch_size=args.batch_size,
max_batch_size=args.max_batch_size,
device=args.device, device=args.device,
no_cache=args.no_cache, no_cache=args.no_cache,
limit=args.limit, limit=args.limit,
description_dict=description_dict, description_dict=description_dict,
decontamination_ngrams_path=args.decontamination_ngrams_path, decontamination_ngrams_path=args.decontamination_ngrams_path,
check_integrity=args.check_integrity, check_integrity=args.check_integrity,
write_out=args.write_out,
output_base_path=args.output_base_path,
) )
dumped = json.dumps(results, indent=2) dumped = json.dumps(results, indent=2)
print(dumped) print(dumped)
if args.output_path: if args.output_path:
os.makedirs(os.path.dirname(args.output_path), exist_ok=True)
with open(args.output_path, "w") as f: with open(args.output_path, "w") as f:
f.write(dumped) f.write(dumped)
batch_sizes = ",".join(map(str, results["config"]["batch_sizes"]))
print( print(
f"{args.model} ({args.model_args}), limit: {args.limit}, provide_description: {args.provide_description}, " f"{args.model} ({args.model_args}), limit: {args.limit}, provide_description: {args.provide_description}, "
f"num_fewshot: {args.num_fewshot}, batch_size: {args.batch_size}" f"num_fewshot: {args.num_fewshot}, batch_size: {args.batch_size}{f' ({batch_sizes})' if batch_sizes else ''}"
) )
print(evaluator.make_table(results)) print(evaluator.make_table(results))
......
# bloom-1b1
## bloom-1b1_common_sense_reasoning_0-shot.json
| Task |Version| Metric |Value| |Stderr|
|-------------|------:|--------|----:|---|-----:|
|arc_challenge| 0|acc |23.63|± | 1.24|
| | |acc_norm|25.68|± | 1.28|
|arc_easy | 0|acc |51.47|± | 1.03|
| | |acc_norm|45.45|± | 1.02|
|boolq | 1|acc |59.08|± | 0.86|
|copa | 0|acc |68.00|± | 4.69|
|hellaswag | 0|acc |34.63|± | 0.47|
| | |acc_norm|41.77|± | 0.49|
|mc_taco | 0|em |14.49| | |
| | |f1 |32.43| | |
|openbookqa | 0|acc |19.60|± | 1.78|
| | |acc_norm|29.40|± | 2.04|
|piqa | 0|acc |67.14|± | 1.10|
| | |acc_norm|67.14|± | 1.10|
|prost | 0|acc |23.41|± | 0.31|
| | |acc_norm|30.50|± | 0.34|
|swag | 0|acc |43.43|± | 0.35|
| | |acc_norm|58.28|± | 0.35|
|winogrande | 0|acc |54.93|± | 1.40|
|wsc273 | 0|acc |68.50|± | 2.82|
## bloom-1b1_gsm8k_8-shot.json
|Task |Version|Metric|Value| |Stderr|
|-----|------:|------|----:|---|-----:|
|gsm8k| 0|acc | 0.83|± | 0.25|
## bloom-1b1_mathematical_reasoning_few_shot_5-shot.json
| Task |Version| Metric |Value| |Stderr|
|-------------------------|------:|--------|----:|---|-----:|
|drop | 1|em | 1.38|± | 0.12|
| | |f1 | 4.01|± | 0.15|
|gsm8k | 0|acc | 0.00|± | 0.00|
|math_algebra | 1|acc | 0.00|± | 0.00|
|math_counting_and_prob | 1|acc | 0.21|± | 0.21|
|math_geometry | 1|acc | 0.21|± | 0.21|
|math_intermediate_algebra| 1|acc | 0.00|± | 0.00|
|math_num_theory | 1|acc | 0.19|± | 0.19|
|math_prealgebra | 1|acc | 0.11|± | 0.11|
|math_precalc | 1|acc | 0.00|± | 0.00|
|mathqa | 0|acc |23.55|± | 0.78|
| | |acc_norm|23.62|± | 0.78|
## bloom-1b1_pawsx_0-shot.json
| Task |Version|Metric|Value| |Stderr|
|--------|------:|------|----:|---|-----:|
|pawsx_de| 0|acc |46.95|± | 1.12|
|pawsx_en| 0|acc |52.45|± | 1.12|
|pawsx_es| 0|acc |51.50|± | 1.12|
|pawsx_fr| 0|acc |46.15|± | 1.11|
|pawsx_ja| 0|acc |48.40|± | 1.12|
|pawsx_ko| 0|acc |49.90|± | 1.12|
|pawsx_zh| 0|acc |48.95|± | 1.12|
## bloom-1b1_question_answering_0-shot.json
| Task |Version| Metric |Value| |Stderr|
|-------------|------:|------------|----:|---|-----:|
|headqa_en | 0|acc |26.44|± | 0.84|
| | |acc_norm |30.49|± | 0.88|
|headqa_es | 0|acc |24.43|± | 0.82|
| | |acc_norm |28.30|± | 0.86|
|logiqa | 0|acc |18.89|± | 1.54|
| | |acc_norm |25.65|± | 1.71|
|squad2 | 1|exact | 4.17| | |
| | |f1 | 6.60| | |
| | |HasAns_exact| 2.19| | |
| | |HasAns_f1 | 7.05| | |
| | |NoAns_exact | 6.14| | |
| | |NoAns_f1 | 6.14| | |
| | |best_exact |50.07| | |
| | |best_f1 |50.07| | |
|triviaqa | 1|acc | 2.68|± | 0.15|
|truthfulqa_mc| 1|mc1 |25.34|± | 1.52|
| | |mc2 |41.80|± | 1.46|
|webqs | 0|acc | 1.38|± | 0.26|
## bloom-1b1_reading_comprehension_0-shot.json
|Task|Version|Metric|Value| |Stderr|
|----|------:|------|----:|---|-----:|
|coqa| 1|f1 |45.57|± | 1.88|
| | |em |32.98|± | 1.95|
|drop| 1|em | 3.31|± | 0.18|
| | |f1 | 8.63|± | 0.22|
|race| 1|acc |32.63|± | 1.45|
## bloom-1b1_xcopa_0-shot.json
| Task |Version|Metric|Value| |Stderr|
|--------|------:|------|----:|---|-----:|
|xcopa_et| 0|acc | 50.6|± | 2.24|
|xcopa_ht| 0|acc | 53.0|± | 2.23|
|xcopa_id| 0|acc | 64.8|± | 2.14|
|xcopa_it| 0|acc | 50.8|± | 2.24|
|xcopa_qu| 0|acc | 51.2|± | 2.24|
|xcopa_sw| 0|acc | 54.4|± | 2.23|
|xcopa_ta| 0|acc | 57.0|± | 2.22|
|xcopa_th| 0|acc | 53.2|± | 2.23|
|xcopa_tr| 0|acc | 53.0|± | 2.23|
|xcopa_vi| 0|acc | 62.4|± | 2.17|
|xcopa_zh| 0|acc | 59.4|± | 2.20|
## bloom-1b1_xnli_0-shot.json
| Task |Version|Metric|Value| |Stderr|
|-------|------:|------|----:|---|-----:|
|xnli_ar| 0|acc |33.93|± | 0.67|
|xnli_bg| 0|acc |34.13|± | 0.67|
|xnli_de| 0|acc |39.64|± | 0.69|
|xnli_el| 0|acc |34.03|± | 0.67|
|xnli_en| 0|acc |51.48|± | 0.71|
|xnli_es| 0|acc |47.98|± | 0.71|
|xnli_fr| 0|acc |47.15|± | 0.71|
|xnli_hi| 0|acc |42.32|± | 0.70|
|xnli_ru| 0|acc |40.46|± | 0.69|
|xnli_sw| 0|acc |35.29|± | 0.68|
|xnli_th| 0|acc |33.75|± | 0.67|
|xnli_tr| 0|acc |34.79|± | 0.67|
|xnli_ur| 0|acc |37.33|± | 0.68|
|xnli_vi| 0|acc |44.45|± | 0.70|
|xnli_zh| 0|acc |36.23|± | 0.68|
## bloom-1b1_xstory_cloze_0-shot.json
| Task |Version|Metric|Value| |Stderr|
|---------------|------:|------|----:|---|-----:|
|xstory_cloze_ar| 0|acc |52.88|± | 1.28|
|xstory_cloze_en| 0|acc |62.54|± | 1.25|
|xstory_cloze_es| 0|acc |58.31|± | 1.27|
|xstory_cloze_eu| 0|acc |54.33|± | 1.28|
|xstory_cloze_hi| 0|acc |55.53|± | 1.28|
|xstory_cloze_id| 0|acc |57.91|± | 1.27|
|xstory_cloze_my| 0|acc |46.19|± | 1.28|
|xstory_cloze_ru| 0|acc |48.25|± | 1.29|
|xstory_cloze_sw| 0|acc |50.56|± | 1.29|
|xstory_cloze_te| 0|acc |56.39|± | 1.28|
|xstory_cloze_zh| 0|acc |58.04|± | 1.27|
## bloom-1b1_xwinograd_0-shot.json
| Task |Version|Metric|Value| |Stderr|
|------------|------:|------|----:|---|-----:|
|xwinograd_en| 0|acc |69.98|± | 0.95|
|xwinograd_fr| 0|acc |66.27|± | 5.22|
|xwinograd_jp| 0|acc |52.87|± | 1.61|
|xwinograd_pt| 0|acc |63.12|± | 2.98|
|xwinograd_ru| 0|acc |54.29|± | 2.81|
|xwinograd_zh| 0|acc |69.25|± | 2.06|
{
"results": {
"boolq": {
"acc": 0.5908256880733945,
"acc_stderr": 0.008599563442397352
},
"arc_easy": {
"acc": 0.5147306397306397,
"acc_stderr": 0.010255329977562096,
"acc_norm": 0.45454545454545453,
"acc_norm_stderr": 0.010217299762709435
},
"openbookqa": {
"acc": 0.196,
"acc_stderr": 0.017770751227744862,
"acc_norm": 0.294,
"acc_norm_stderr": 0.020395095484936614
},
"hellaswag": {
"acc": 0.3463453495319657,
"acc_stderr": 0.004748324319714264,
"acc_norm": 0.4177454690300737,
"acc_norm_stderr": 0.004921798492608764
},
"swag": {
"acc": 0.43431970408877335,
"acc_stderr": 0.0035044592489844794,
"acc_norm": 0.5828251524542637,
"acc_norm_stderr": 0.0034862531772295617
},
"arc_challenge": {
"acc": 0.2363481228668942,
"acc_stderr": 0.012414960524301834,
"acc_norm": 0.2568259385665529,
"acc_norm_stderr": 0.0127669237941168
},
"mc_taco": {
"em": 0.1448948948948949,
"f1": 0.32425976796237205
},
"wsc273": {
"acc": 0.684981684981685,
"acc_stderr": 0.028165854394193602
},
"winogrande": {
"acc": 0.5493291239147593,
"acc_stderr": 0.013983928869040239
},
"prost": {
"acc": 0.23409479077711356,
"acc_stderr": 0.003093545711826552,
"acc_norm": 0.3049743808710504,
"acc_norm_stderr": 0.003363606918420179
},
"copa": {
"acc": 0.68,
"acc_stderr": 0.04688261722621504
},
"piqa": {
"acc": 0.6713819368879217,
"acc_stderr": 0.010959127105167048,
"acc_norm": 0.6713819368879217,
"acc_norm_stderr": 0.010959127105167044
}
},
"versions": {
"boolq": 1,
"arc_easy": 0,
"openbookqa": 0,
"hellaswag": 0,
"swag": 0,
"arc_challenge": 0,
"mc_taco": 0,
"wsc273": 0,
"winogrande": 0,
"prost": 0,
"copa": 0,
"piqa": 0
},
"config": {
"model": "hf-causal-experimental",
"model_args": "pretrained=bigscience/bloom-1b1,use_accelerate=True",
"num_fewshot": 0,
"batch_size": "auto",
"device": "cuda:0",
"no_cache": true,
"limit": null,
"bootstrap_iters": 100000,
"description_dict": {}
}
}
{
"results": {
"gsm8k": {
"acc": 0.008339651250947688,
"acc_stderr": 0.002504942226860508
}
},
"versions": {
"gsm8k": 0
},
"config": {
"model": "hf-causal-experimental",
"model_args": "pretrained=bigscience/bloom-1b1,use_accelerate=True",
"num_fewshot": 8,
"batch_size": "auto",
"device": "cuda",
"no_cache": true,
"limit": null,
"bootstrap_iters": 100000,
"description_dict": {}
}
}
{
"results": {
"mathqa": {
"acc": 0.2355108877721943,
"acc_stderr": 0.007767687364650971,
"acc_norm": 0.23618090452261306,
"acc_norm_stderr": 0.0077753193787470495
},
"gsm8k": {
"acc": 0.0,
"acc_stderr": 0.0
},
"drop": {
"em": 0.013842281879194632,
"em_stderr": 0.001196510970060749,
"f1": 0.040085989932885986,
"f1_stderr": 0.0014841664758736023
},
"math_geometry": {
"acc": 0.0020876826722338203,
"acc_stderr": 0.0020876826722338315
},
"math_counting_and_prob": {
"acc": 0.002109704641350211,
"acc_stderr": 0.002109704641350211
},
"math_prealgebra": {
"acc": 0.001148105625717566,
"acc_stderr": 0.0011481056257175708
},
"math_num_theory": {
"acc": 0.001851851851851852,
"acc_stderr": 0.0018518518518518448
},
"math_precalc": {
"acc": 0.0,
"acc_stderr": 0.0
},
"math_algebra": {
"acc": 0.0,
"acc_stderr": 0.0
},
"math_intermediate_algebra": {
"acc": 0.0,
"acc_stderr": 0.0
}
},
"versions": {
"mathqa": 0,
"gsm8k": 0,
"drop": 1,
"math_geometry": 1,
"math_counting_and_prob": 1,
"math_prealgebra": 1,
"math_num_theory": 1,
"math_precalc": 1,
"math_algebra": 1,
"math_intermediate_algebra": 1
},
"config": {
"model": "hf-causal-experimental",
"model_args": "pretrained=bigscience/bloom-1b1,use_accelerate=True",
"num_fewshot": 5,
"batch_size": "auto",
"device": "cuda:0",
"no_cache": true,
"limit": null,
"bootstrap_iters": 100000,
"description_dict": {}
}
}
{
"results": {
"pawsx_es": {
"acc": 0.515,
"acc_stderr": 0.011178102477052804
},
"pawsx_zh": {
"acc": 0.4895,
"acc_stderr": 0.011180669867648657
},
"pawsx_fr": {
"acc": 0.4615,
"acc_stderr": 0.011149934327957058
},
"pawsx_ko": {
"acc": 0.499,
"acc_stderr": 0.01118311365477017
},
"pawsx_de": {
"acc": 0.4695,
"acc_stderr": 0.011162310405413175
},
"pawsx_ja": {
"acc": 0.484,
"acc_stderr": 0.011177408788874897
},
"pawsx_en": {
"acc": 0.5245,
"acc_stderr": 0.011169702598013186
}
},
"versions": {
"pawsx_es": 0,
"pawsx_zh": 0,
"pawsx_fr": 0,
"pawsx_ko": 0,
"pawsx_de": 0,
"pawsx_ja": 0,
"pawsx_en": 0
},
"config": {
"model": "hf-causal-experimental",
"model_args": "pretrained=bigscience/bloom-1b1",
"num_fewshot": 0,
"batch_size": "auto",
"device": "cuda",
"no_cache": true,
"limit": null,
"bootstrap_iters": 100000,
"description_dict": {}
}
}
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment