@@ -10,9 +10,11 @@ This project provides a unified framework to test generative language models on
...
@@ -10,9 +10,11 @@ This project provides a unified framework to test generative language models on
Features:
Features:
- 200+ tasks implemented. See the [task-table](./docs/task_table.md) for a complete list.
- 200+ tasks implemented. See the [task-table](./docs/task_table.md) for a complete list.
- Support for the Hugging Face `transformers` library, GPT-NeoX, Megatron-DeepSpeed, and the OpenAI API, with flexible tokenization-agnostic interface.
- Support for models loaded via [transformers](https://github.com/huggingface/transformers/), [GPT-NeoX](https://github.com/EleutherAI/gpt-neox), and [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/), with a flexible tokenization-agnostic interface.
- Support for commercial APIs including [OpenAI](https://openai.com), [goose.ai](https://goose.ai), and [TextSynth](https://textsynth.com/).
- Support for evaluation on adapters (e.g. LoRa) supported in [HuggingFace's PEFT library](https://github.com/huggingface/peft).
- Support for evaluation on adapters (e.g. LoRa) supported in [HuggingFace's PEFT library](https://github.com/huggingface/peft).
- Task versioning to ensure reproducibility.
- Evaluating with publicly available prompts ensures reproducibility and comparability between papers.
- Task versioning to ensure reproducibility when tasks are updated.
> **Note**: When reporting results from eval harness, please include the task versions (shown in `results["versions"]`) for reproducibility. This allows bug fixes to tasks while also ensuring that previously reported scores are reproducible. See the [Task Versioning](#task-versioning) section for more info.
> **Note**: When reporting results from eval harness, please include the task versions (shown in `results["versions"]`) for reproducibility. This allows bug fixes to tasks while also ensuring that previously reported scores are reproducible. See the [Task Versioning](#task-versioning) section for more info.
To evaluate a model hosted on the [HuggingFace Hub](https://huggingface.co/models)(e.g. GPT-J-6B) on tasks with names matching the pattern `lambada_*` and `hellaswag` you can use the following command:
### Hugging Face `transformers`
To evaluate a model hosted on the [HuggingFace Hub](https://huggingface.co/models)(e.g. GPT-J-6B) on `hellaswag` you can use the following command:
```bash
```bash
python main.py \
python main.py \
--model hf-causal \
--model hf-causal \
--model_argspretrained=EleutherAI/gpt-j-6B \
--model_argspretrained=EleutherAI/gpt-j-6B \
--taskslambada_*,hellaswag \
--tasks hellaswag \
--device cuda:0
--device cuda:0
```
```
...
@@ -59,16 +63,9 @@ To evaluate models that are loaded via `AutoSeq2SeqLM` in Huggingface, you inste
...
@@ -59,16 +63,9 @@ To evaluate models that are loaded via `AutoSeq2SeqLM` in Huggingface, you inste
> **Warning**: Choosing the wrong model may result in erroneous outputs despite not erroring.
> **Warning**: Choosing the wrong model may result in erroneous outputs despite not erroring.
To use with [PEFT](https://github.com/huggingface/peft), take the call you would run to evaluate the base model and add `,peft=PATH` to the `model_args` argument as shown below:
Our library also supports language models served via the OpenAI API:
```bash
```bash
export OPENAI_API_SECRET_KEY=YOUR_KEY_HERE
export OPENAI_API_SECRET_KEY=YOUR_KEY_HERE
...
@@ -90,7 +87,9 @@ python main.py \
...
@@ -90,7 +87,9 @@ python main.py \
--check_integrity
--check_integrity
```
```
To evaluate mesh-transformer-jax models that are not available on HF, please invoke eval harness through [this script](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py).
### Other Frameworks
A number of other libraries contain scripts for calling the eval harness through their library. These include [GPT-NeoX](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py), [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples/MoE/readme_evalharness.md), and [mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py).
💡 **Tip**: You can inspect what the LM inputs look like by running the following command:
💡 **Tip**: You can inspect what the LM inputs look like by running the following command:
...
@@ -104,6 +103,22 @@ python write_out.py \
...
@@ -104,6 +103,22 @@ python write_out.py \
This will write out one text file for each task.
This will write out one text file for each task.
## Advanced Usage
For models loaded with the HuggingFace `transformers` library, any arguments provided via `--model_args` get passed to the relevant constructor directly. This means that anything you can do with `AutoModel` can be done with our library. For example, you can pass a local path via `pretrained=` or use models finetuned with [PEFT](https://github.com/huggingface/peft) by taking the call you would run to evaluate the base model and add `,peft=PATH` to the `model_args` argument:
We support wildcards in task names, for example you can run all of the machine-translated lambada tasks via `--task lambada_openai_mt_*`.
We currently only support one prompt per task, which we strive to make the "standard" as defined by the benchmark's authors. If you would like to study how varying prompts causes changes in the evaluation score, check out the [BigScience fork](https://github.com/bigscience-workshop/lm-evaluation-harness) of this repo. We are currently working on upstreaming this capability to `main`.
## Implementing new tasks
## Implementing new tasks
To implement a new task in the eval harness, see [this guide](./docs/task_guide.md).
To implement a new task in the eval harness, see [this guide](./docs/task_guide.md).
Language Models are Multilingual Chain-of-Thought Reasoners
https://arxiv.org/abs/2210.03057
Multilingual Grade School Math Benchmark (MGSM) is a benchmark of grade-school math problems, proposed in the paper [Language models are multilingual chain-of-thought reasoners](http://arxiv.org/abs/2210.03057).
The same 250 problems from [GSM8K](https://arxiv.org/abs/2110.14168) are each translated via human annotators in 10 languages. The 10 languages are:
- Spanish
- French
- German
- Russian
- Chinese
- Japanese
- Thai
- Swahili
- Bengali
- Telugu
GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.
You can find the input and targets for each of the ten languages (and English) as `.tsv` files.
We also include few-shot exemplars that are also manually translated from each language in `exemplars.py`.
title={Training Verifiers to Solve Math Word Problems},
author={Karl Cobbe and Vineet Kosaraju and Mohammad Bavarian and Jacob Hilton and Reiichiro Nakano and Christopher Hesse and John Schulman},
year={2021},
eprint={2110.14168},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
@misc{shi2022language,
title={Language Models are Multilingual Chain-of-Thought Reasoners},
author={Freda Shi and Mirac Suzgun and Markus Freitag and Xuezhi Wang and Suraj Srivats and Soroush Vosoughi and Hyung Won Chung and Yi Tay and Sebastian Ruder and Denny Zhou and Dipanjan Das and Jason Wei},
title = "{PAWS}-{X}: A Cross-lingual Adversarial Dataset for Paraphrase Identification",
author = "Yang, Yinfei and
Zhang, Yuan and
Tar, Chris and
Baldridge, Jason",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
month = nov,
year = "2019",
address = "Hong Kong, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/D19-1382",
doi = "10.18653/v1/D19-1382",
pages = "3687--3692",
}"""
classPAWSXBase(Task):
VERSION=0
DATASET_PATH="paws-x"
DATASET_NAME=None# 'en'
YES=None# 'Yes'
NO=None# 'No'
QUESTION_WORD=None# 'right'
defhas_training_docs(self):
returnTrue
defhas_validation_docs(self):
returnTrue
defhas_test_docs(self):
returnTrue
deftraining_docs(self):
returnself.dataset["train"]
defvalidation_docs(self):
returnself.dataset["validation"]
deftest_docs(self):
returnself.dataset["test"]
defdoc_to_text(self,doc):
# same as in mGPT paper
return(
doc["sentence1"]
+", "
+self.QUESTION_WORD
+"? [MASK], "
+doc["sentence2"]
)
defdoc_to_target(self,doc):
return" "+[self.YES,self.NO][doc["label"]]
defconstruct_requests(self,doc,ctx):
"""Uses RequestFactory to construct Requests and returns an iterable of
Requests which will be sent to the LM.
:param doc:
The document as returned from training_docs, validation_docs, or
test_docs.
:param ctx: str
The context string, generated by fewshot_context. This includes the natural
language description, as well as the few shot examples, and the question