This project provides a unified framework to test generative language models on a large number of different evaluation tasks.
This project provides a unified framework to test generative language models on a large number of different evaluation tasks.
**Features:**
### Features
- 200+ tasks implemented. See the [task-table](./docs/task_table.md) for a complete list.
- 200+ tasks implemented. See the [task-table](./docs/task_table.md) for a complete list.
- Support for the Hugging Face `transformers` library, GPT-NeoX, Megatron-DeepSpeed, and the OpenAI API, with flexible tokenization-agnostic interface.
- Support for the Hugging Face [transformers](https://github.com/huggingface/transformers) library, [GPT-NeoX](https://github.com/EleutherAI/gpt-neox), [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed), with flexible tokenization-agnostic interface.
- Support for evaluation on adapters (e.g. LoRa) supported in [HuggingFace's PEFT library](https://github.com/huggingface/peft).
- Support for commercial APIs including [OpenAI](https://openai.com/), [goose.ai](https://goose.ai/), [Anthropic](https://www.anthropic.com/), and [TextSynth](https://textsynth.com/).
- Task versioning to ensure reproducibility.
- Support for evaluation on adapters (e.g. LoRA) supported in [HuggingFace's PEFT library](https://github.com/huggingface/peft).
- Support for GPTQ quantized models via [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ).
- Evaluating with publicly available prompts ensures reproducibility and comparability between papers.
- Task versioning to ensure reproducibility when tasks are updated.
**Evaluation Overview**
### Evaluation Overview
`Task` and `Prompt` classes contain information that, when combined, produces the input to the language model. The language model is then queried to obtain an output. One or more `Filters` can then be applied to perform arbitrary operations on the model's raw output, such as selecting the final answer (for chain of thought) or calling an external API. This final output is then evaluated using a `Metric` to obtain the final result.
`Task` and `Prompt` classes contain information that, when combined, produces the input to the language model. The language model is then queried to obtain an output. One or more `Filters` can then be applied to perform arbitrary operations on the model's raw output, such as selecting the final answer (for chain of thought) or calling an external API. This final output is then evaluated using a `Metric` to obtain the final result.
...
@@ -37,7 +40,7 @@ graph LR;
...
@@ -37,7 +40,7 @@ graph LR;
O --> F
O --> F
Me --> R:::empty
Me --> R:::empty
F --> R
F --> R
```
```
## Install
## Install
...
@@ -55,12 +58,19 @@ To install additional multilingual tokenization and text segmentation packages,
...
@@ -55,12 +58,19 @@ To install additional multilingual tokenization and text segmentation packages,
pip install-e".[multilingual]"
pip install-e".[multilingual]"
```
```
To support loading GPTQ quantized models, install the package with the `auto-gptq` extra:
```bash
pip install-e".[auto-gptq]"
```
## Basic Usage
## Basic Usage
> **Note**: When reporting results from eval harness, please include the task versions (shown in `results["versions"]`) for reproducibility. This allows bug fixes to tasks while also ensuring that previously reported scores are reproducible. See the [Task Versioning](#task-versioning) section for more info.
> **Note**: When reporting results from eval harness, please include the task versions (shown in `results["versions"]`) for reproducibility. This allows bug fixes to tasks while also ensuring that previously reported scores are reproducible. See the [Task Versioning](#task-versioning) section for more info.
To evaluate a model hosted on the [HuggingFace Hub](https://huggingface.co/models) (e.g. GPT-J-6B) you can use the following command:
### HuggingFace `transformers`
To evaluate a model hosted on the [HuggingFace Hub](https://huggingface.co/models)(e.g. GPT-J-6B) on `lambada_openai` and `hellaswag` you can use the following command:
```bash
```bash
python main.py \
python main.py \
...
@@ -70,21 +80,24 @@ python main.py \
...
@@ -70,21 +80,24 @@ python main.py \
--device cuda:0
--device cuda:0
```
```
Additional arguments can be provided to the model constructor using the `--model_args` flag. Most notably, this supports the common practice of using the `revisions` feature on the Hub to store partially trained checkpoints:
Additional arguments can be provided to the model constructor using the `--model_args` flag. Most notably, this supports the common practice of using the `revisions` feature on the Hub to store partially trained checkpoints, or to specify the datatype for running a model:
To evaluate models that are called via `AutoSeq2SeqLM`, you instead use `hf-seq2seq`.
To evaluate models that are loaded via `AutoSeq2SeqLM`, you instead use `hf-seq2seq`.
> **Warning**: Choosing the wrong model may result in erroneous outputs despite not erroring.
> **Warning**: Choosing the wrong model may result in erroneous outputs despite not erroring.
Arguments provided via `--model_args` get passed to the relevant constructor directly. This means that anything you can do with `AutoModel` can be done with our library.
To use with [PEFT](https://github.com/huggingface/peft), take the call you would run to evaluate the base model and add `,peft=PATH` to the `model_args` argument as shown below:
To use with [PEFT](https://github.com/huggingface/peft), take the call you would run to evaluate the base model and add `,peft=PATH` to the `model_args` argument as shown below:
```bash
```bash
python main.py \
python main.py \
--model hf-causal \
--model hf-causal \
...
@@ -93,7 +106,18 @@ python main.py \
...
@@ -93,7 +106,18 @@ python main.py \
--device cuda:0
--device cuda:0
```
```
Our library also supports the OpenAI API:
GPTQ quantized models can be loaded by specifying their file names in `,quantized=NAME` (or `,quantized=True` for default names) in the `model_args` argument:
Our library also supports language models served via the OpenAI API:
```bash
```bash
export OPENAI_API_SECRET_KEY=YOUR_KEY_HERE
export OPENAI_API_SECRET_KEY=YOUR_KEY_HERE
...
@@ -115,7 +139,9 @@ python main.py \
...
@@ -115,7 +139,9 @@ python main.py \
--check_integrity
--check_integrity
```
```
To evaluate mesh-transformer-jax models that are not available on HF, please invoke eval harness through [this script](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py).
### Other Frameworks
A number of other libraries contain scripts for calling the eval harness through their library. These include [GPT-NeoX](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py), [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples/MoE/readme_evalharness.md), and [mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py).
💡 **Tip**: You can inspect what the LM inputs look like by running the following command:
💡 **Tip**: You can inspect what the LM inputs look like by running the following command:
...
@@ -131,16 +157,17 @@ This will write out one text file for each task.
...
@@ -131,16 +157,17 @@ This will write out one text file for each task.
## Multi-GPU Evaluation
## Multi-GPU Evaluation
Multi-GPU evaluation is supported through [accelerate](https://github.com/huggingface/accelerate). To initialize the distributed environment, run ```accelerate config``` in terminal and follow the prompts. Once the environment is configured, evaluations can be launched with:
Multi-GPU evaluation is supported through [accelerate](https://github.com/huggingface/accelerate). To initialize the distributed environment, run `accelerate config` in terminal and follow the prompts. Once the environment is configured, evaluations can be launched with:
```bash
```bash
accelerate launch main.py \
accelerate launch main.py \
--model hf-causal \
--model hf-causal \
--model_argspretrained=EleutherAI/pythia-12b \
--tasks lambada_openai,arc_easy \
--tasks lambada_openai,arc_easy \
--batch_size 16\
--batch_size 16
```
```
**Warning**: Distributed evaluation requires launching multiple processes of the evaluation script. Running ```python main.py *args*``` instead of ```accelerate launch main.py *args*``` on machine with multiple GPUs will only run the evaluations on a single device.
**Warning**: Distributed evaluation requires launching multiple processes of the evaluation script. Running `python main.py *args*` instead of `accelerate launch main.py *args*` on machine with multiple GPUs will only run the evaluations on a single device (unless you instead use `use_accelerate=True` in `--model_args`).
## Implementing new tasks
## Implementing new tasks
...
@@ -154,7 +181,7 @@ When reporting eval harness results, please also report the version of each task
...
@@ -154,7 +181,7 @@ When reporting eval harness results, please also report the version of each task
## Test Set Decontamination
## Test Set Decontamination
To address concerns about train / test contamination, we provide utilities for comparing results on a benchmark using only the data points nto found in the model trainign set. Unfortunately, outside of models trained on the Pile and C4, its very rare that people who train models disclose the contents of the training data. However this utility can be useful to evaluate models you have trained on private data, provided you are willing to pre-compute the necessary indices. We provide computed indices for 13-gram exact match deduplication against the Pile, and plan to add additional precomputed dataset indices in the future (including C4 and min-hash LSH deduplication).
To address concerns about train / test contamination, we provide utilities for comparing results on a benchmark using only the data points not found in the model training set. Unfortunately, outside of models trained on the Pile and C4, its very rare that people who train models disclose the contents of the training data. However this utility can be useful to evaluate models you have trained on private data, provided you are willing to pre-compute the necessary indices. We provide computed indices for 13-gram exact match deduplication against the Pile, and plan to add additional precomputed dataset indices in the future (including C4 and min-hash LSH deduplication).
For details on text decontamination, see the [decontamination guide](./docs/decontamination.md).
For details on text decontamination, see the [decontamination guide](./docs/decontamination.md).
To reiterate, a `doc` is just a `Dict` object that contains information about a document from your corpus. It can contain things like a prompt, question type information, answers and anything else you think will be needed in order to assess your model for a given task. Keep in mind that the fields of this can be basically whatever you want (you can sort this out in `training_docs`\ `validation_docs`\ `test_docs` if you need to customise things - see above), just remember to be consistent with them throughout the rest of the `Task` you write up.
To reiterate, a `doc` is just a `Dict` object that contains information about a document from your corpus. It can contain things like a prompt, question type information, answers and anything else you think will be needed in order to assess your model for a given task. Keep in mind that the fields of this can be basically whatever you want (you can sort this out in `training_docs`\ `validation_docs`\ `test_docs` if you need to customise things - see above), just remember to be consistent with them throughout the rest of the `Task` you write up.
A `Request` is an object that takes the text prompt you want to present to a model and computes one of a few different types of response. These are evaluated lazily (meaning, only when the result is actually needed). If your task requires generating text you'll need to return a `rf.greedy_until` request otherwise an `rf.loglikelihood` across all labels in a classification tasks will do.
A `Request` is an object that takes the text prompt you want to present to a model and computes one of a few different types of response. These are evaluated lazily (meaning, only when the result is actually needed). If your task requires generating text you'll need to return a `rf.greedy_until` request otherwise an `rf.loglikelihood` across all labels in a classification tasks will do.
The function `construct_requests`can return a list of `Request`s or an iterable; it's perfectly fine to `yield` them from something or other. This is particularly handy if you are creating more than one request per `doc` (usually because you're up to something like multi-task learning). The objects this function returns then get consumed one by one and turned into result objects.
The function `construct_requests` returns a list or tuple of, or singular `Request`s. This is particularly handy if you are creating more than one request per `doc` (usually because you're up to something like multi-task learning). The objects this function returns then get consumed one by one and turned into result objects.
```python
```python
...
@@ -271,6 +271,19 @@ python main.py \
...
@@ -271,6 +271,19 @@ python main.py \
--num_fewshot K
--num_fewshot K
```
```
### Checking the Model Outputs
The `--write_out.py` script mentioned previously can be used to verify that the prompts look as intended. If you also want to save model outputs, you can use the `--write_out` parameter in `main.py` to dump JSON with prompts and completions. The output path can be chosen with `--output_base_path`. It is helpful for debugging and for exploring model outputs.