@@ -49,6 +48,13 @@ To support loading GPTQ quantized models, install the package with the `gptq` ex
pip install-e".[gptq]"
```
To install the package with all extras, run
```bash
pip install-e".[all]"
```
## Support
The best way to get support is to open an issue on this repo or join the EleutherAI discord server](discord.gg/eleutherai). The `#lm-thunderdome` channel is dedicated to developing this project and the `#release-discussion` channel is for receiving support for our releases.
...
...
@@ -93,6 +99,8 @@ python main.py \
--batch_size auto:4
```
Alternatively, you can use `lm-eval` instead of `python main.py` to call lm eval from anywhere.
### Multi-GPU Evaluation with Hugging Face `accelerate`
To parallelize evaluation of HuggingFace models across multiple GPUs, we allow for two different types of multi-GPU evaluation.
...
...
@@ -128,30 +136,43 @@ Using this setting helps for massive models like BLOOM which require, or to avoi
**Note that this option requires launching evaluation via `python main.py` rather than `accelerate launch main.py`.**
To use `accelerate` with the `lm-eval` command, use
```
accelerate launch --no_python lm-eval --model ...
```
### Commercial APIs
Our library also supports language models served via the OpenAI API:
Our library also supports the evaluation of models served via several commercial APIs, and hope to implement support for common performant local/self-hosted inference servers.
A full accounting of the supported and planned libraries + APIs can be seen below:
| API or Inference Server | Implemented? | `--model <xxx>` name | Models supported: | Request Types: |
| vLLM | :x: Not yet - needs help! | N/A | All HF models | `greedy_until` (no logprobs) |
| Your inference server here! | ... | ... | ... | ... | | ... |
It is on our roadmap to create task variants designed to enable models which do not serve logprobs/loglikelihoods to be compared with generation performance of open-source models.
Our library supports language models served via the OpenAI Completions API as follows:
```bash
export OPENAI_API_SECRET_KEY=YOUR_KEY_HERE
python main.py \
--model openai \
--model openai-completions\
--model_argsengine=davinci \
--tasks lambada_openai,hellaswag
```
While this functionality is only officially maintained for the official OpenAI API, it tends to also work for other hosting services that use the same API such as [goose.ai](goose.ai) with minor modification. We also have an implementation for the [TextSynth](https://textsynth.com/index.html) API, using `--model textsynth`.
To verify the data integrity of the tasks you're performing in addition to running the tasks themselves, you can use the `--check_integrity` flag:
```bash
python main.py \
--model openai \
--model_argsengine=davinci \
--tasks lambada_openai,hellaswag \
--check_integrity
```
### Other Frameworks
A number of other libraries contain scripts for calling the eval harness through their library. These include [GPT-NeoX](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py), [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples/MoE/readme_evalharness.md), and [mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py).
...
...
@@ -172,6 +193,16 @@ python write_out.py \
This will write out one text file for each task.
To verify the data integrity of the tasks you're performing in addition to running the tasks themselves, you can use the `--check_integrity` flag:
```bash
python main.py \
--model openai \
--model_argsengine=davinci \
--tasks lambada_openai,hellaswag \
--check_integrity
```
## Advanced Usage
For models loaded with the HuggingFace `transformers` library, any arguments provided via `--model_args` get passed to the relevant constructor directly. This means that anything you can do with `AutoModel` can be done with our library. For example, you can pass a local path via `pretrained=` or use models finetuned with [PEFT](https://github.com/huggingface/peft) by taking the call you would run to evaluate the base model and add `,peft=PATH` to the `model_args` argument:
...
...
@@ -201,6 +232,14 @@ To implement a new task in the eval harness, see [this guide](./docs/new_task_gu
As a start, we currently only support one prompt per task, which we strive to make the "standard" as defined by the benchmark's authors. If you would like to study how varying prompts causes changes in the evaluation score, we support prompts authored in the [Promptsource Library](https://github.com/bigscience-workshop/promptsource/tree/main) as described further in https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/docs/new_task_guide.md and https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/docs/advanced_task_guide.md and welcome contributions of novel task templates and task variants.
## How to Contribute or Learn More?
For more information on the library and how everything fits together, check out all of our [documentation pages](https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor/docs)! We plan to post a larger roadmap of desired + planned library improvements soon, with more information on how contributors can help.
You can also ask for help, or discuss new features with the maintainers in the #lm-thunderdome channel of the EleutherAI discord! If you've used the library and have had a positive (or negative) experience, we'd love to hear from you!
Where `Instance` is a dataclass defined in [`lm_eval.api.instance`](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/api/instance.py) with property `args` which returns a tuple of (context, continuation).
- GSM8k with CoT (+ with Self-Consistency): (`lm_eval/tasks/gsm8k/gsm8k-cot.yaml` ; `lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml`)
## Benchmarks
When evaluating a language model, it's is not unusual to test across a number of tasks that may not be related to one another in order to assess a variety of capabilities. To this end, it may be combursome to have to list the set of tasks or add a new group name to each yaml of each individual task.
To solve this, we can create a benchmark yaml config. This is a config that contains the names of the tasks that should be included in a particular benchmark. The config consists of two main keys `group` which denotes the name of the benchmark and `task` which is where we can list the tasks. The tasks listed in `task` are the task names that have been registered. A good example would be the list of tasks used to evaluate the Pythia Suite.
```yaml
group:pythia
task:
-lambada_openai
-wikitext
-piqa
-sciq
-wsc
-winogrande
-arc
-logiqa
-blimp
-hendrycksTest*
```
Alternatively, benchmarks can have tasks that are customizable for each task. They can be defined like how a yaml task is usually set.
```yaml
group:t0_eval
task:
# Coreference Resolution
-dataset_path:super_glue
dataset_name:wsc.fixed
use_prompt:promptsource:*
training_split:train
validation_split:validation
metric_list:
-metric:exact_match
aggregation:mean
higher_is_better:true
ignore_case:true
ignore_punctuation:true
# Coreference Resolution
-dataset_path:winogrande
dataset_name:winogrande_xl
use_prompt:promptsource:*
training_split:train
validation_split:validation
metric_list:
-metric:exact_match
aggregation:mean
higher_is_better:true
ignore_case:true
ignore_punctuation:true
...
```
If the benchmark contains the same dataset but with different configurations, use `task` to differentiate between them. For example, T0-Eval evaluates on 3 versions of ANLI but the huggingface dataset collects them in one dataset.
```YAML
group: t0_eval
task:
...
- task: anli_r1
dataset_path: anli
use_prompt: promptsource:*
training_split: train_r1
validation_split: dev_r1
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
- task: anli_r2
dataset_path: anli
use_prompt: promptsource:*
training_split: train_r2
validation_split: dev_r2
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
```
Calling the benchmark is done the same way we would call any task with `--tasks`. Benchmarks can be added in `lm_eval/benchmarks/`
"Failed to place model onto specified device. This may be because the model is quantized via `bitsandbytes`. If the desired GPU is being used, this message is safe to ignore."
@@ -3,41 +3,41 @@ This list keeps track of which tasks' implementations have been ported to YAML /
Boxes should be checked iff tasks are implemented in the refactor and tested for regression. Tasks should be struck through if checked *against original introducing paper* implementation or popularizing implementation. (WIP) Denotes that there exists a PR or person working on this task already.
- [x] Hendrycks Ethics (missing some tasks/metrics, see PR 660: <https://github.com/EleutherAI/lm-evaluation-harness/pull/660> for more info)
- [x] TruthfulQA (mc1)
- [] TruthfulQA (mc2)
- [] TruthfulQA (gen)
- [x] TruthfulQA (mc2)
- [x] TruthfulQA (gen)
- [ ] MuTual
- [ ] Hendrycks Math
- [ ] Hendrycks Math (Hailey)
- [ ] Asdiv
- [ ] GSM8k
- [x] Arithmetic
...
...
@@ -45,20 +45,20 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
- [ ] Translation (WMT) suite (Hailey)
- [x] Unscramble
- [x] ~~Pile (perplexity)~~
- [] BLiMP (Lintang)
- [x] BLiMP
- [x] ToxiGen
- [] StoryCloze
- [ ] NaturalQs
- [] CrowS-Pairs
- [] XCopa
- [ ] BIG-Bench
- [] XStoryCloze
- [x] StoryCloze
- [ ] NaturalQs (Hailey)
- [x] CrowS-Pairs
- [x] XCopa
- [ ] BIG-Bench (Hailey)
- [x] XStoryCloze
- [x] XWinograd
- [] PAWS-X
- [] XNLI
- [ ] MGSM
- [x] PAWS-X
- [x] XNLI
- [ ] MGSM (Lintang)
- [ ] SCROLLS
- [] Babi
- [x] Babi
# Novel Tasks
Tasks added in the revamped harness that were not previously available. Again, a strikethrough denotes checking performed *against the original task's implementation or published results introducing the task*.