Unverified Commit 1d8107bf authored by Stella Biderman's avatar Stella Biderman Committed by GitHub
Browse files

Merge pull request #362 from EleutherAI/cleanup-for-release

Cleanup `README.md` and package deps
parents fdd3dbc3 1e5d55d9
...@@ -32,7 +32,9 @@ jobs: ...@@ -32,7 +32,9 @@ jobs:
run: | run: |
python -m pip install --upgrade pip python -m pip install --upgrade pip
pip install flake8 pytest pytest-cov pip install flake8 pytest pytest-cov
pip install -e .[dev] pip install -e .[dev,multilingual]
# Install optional git dependencies
pip install bleurt@https://github.com/google-research/bleurt/archive/b610120347ef22b494b6d69b4316e303f5932516.zip#egg=bleurt
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
- name: Lint with flake8 - name: Lint with flake8
run: | run: |
......
...@@ -9,9 +9,9 @@ This project provides a unified framework to test autoregressive language models ...@@ -9,9 +9,9 @@ This project provides a unified framework to test autoregressive language models
Features: Features:
- 200+ tasks implemented - 200+ tasks implemented. See the [task-table](./docs/task_table.md) for a complete list.
- Support for GPT-2, GPT-3, GPT-Neo, GPT-NeoX, and GPT-J, with flexible tokenization-agnostic interface - Support for GPT-2, GPT-3, GPT-Neo, GPT-NeoX, and GPT-J, with flexible tokenization-agnostic interface.
- Task versioning to ensure reproducibility - Task versioning to ensure reproducibility.
## Install ## Install
...@@ -19,26 +19,36 @@ Features: ...@@ -19,26 +19,36 @@ Features:
pip install lm-eval pip install lm-eval
``` ```
To install additional multlingual tokenization and text segmenation packages, you must install the package with the `multilingual` extra:
```bash
pip install "lm-eval[multilingual]"
```
## Basic Usage ## Basic Usage
To evaluate a model, (e.g. GPT-2) on NLU tasks (e.g. LAMBADA, HellaSwag), you can run the following command. **When reporting results from eval harness, please include the task versions (shown in `results["versions"]`) for reproducibility.** This allows bug fixes to tasks while also ensuring that previously reported scores are reproducible. See the [Task Versioning](https://github.com/EleutherAI/lm-evaluation-harness#task-versioning) section for more info. > **Note**: When reporting results from eval harness, please include the task versions (shown in `results["versions"]`) for reproducibility. This allows bug fixes to tasks while also ensuring that previously reported scores are reproducible. See the [Task Versioning](#task-versioning) section for more info.
To evaluate a model (e.g. GPT-2) on NLP tasks such as SuperGLUE WiC, you can run the following command:
```bash ```bash
python main.py \ python main.py \
--model gpt2 \ --model gpt2 \
--device 0 \ --tasks lambada_openai,hellaswag \
--tasks lambada,hellaswag --device 0
``` ```
(This uses gpt2-117M by default as per HF defaults, use --model_args to specify other gpt2 sizes)
Additional arguments can be provided to the model constructor using the `--model_args` flag. Most importantly, the `gpt2` model can be used to load an arbitrary HuggingFace model. For example, to run GPTNeo use the following: This example uses gpt2-117M by default as per HF defaults.
Additional arguments can be provided to the model constructor using the `--model_args` flag. Most importantly, the `gpt2` model can be used to load an arbitrary HuggingFace CausalLM. For example, to run GPTNeo use the following:
```bash ```bash
python main.py \ python main.py \
--model gpt2 \ --model gpt2 \
--model_args pretrained=EleutherAI/gpt-neo-2.7B \ --model_args pretrained=EleutherAI/gpt-neo-2.7B \
--device 0 \ --tasks lambada_openai,hellaswag \
--tasks lambada,hellaswag --device 0
``` ```
If you have access to the OpenAI API, you can also evaluate GPT-3: If you have access to the OpenAI API, you can also evaluate GPT-3:
...@@ -48,7 +58,7 @@ export OPENAI_API_SECRET_KEY=YOUR_KEY_HERE ...@@ -48,7 +58,7 @@ export OPENAI_API_SECRET_KEY=YOUR_KEY_HERE
python main.py \ python main.py \
--model gpt3 \ --model gpt3 \
--model_args engine=davinci \ --model_args engine=davinci \
--tasks lambada,hellaswag --tasks lambada_openai,hellaswag
``` ```
And if you want to verify the data integrity of the tasks you're performing in addition to running the tasks themselves, you can use the `--check_integrity` flag: And if you want to verify the data integrity of the tasks you're performing in addition to running the tasks themselves, you can use the `--check_integrity` flag:
...@@ -57,14 +67,47 @@ And if you want to verify the data integrity of the tasks you're performing in a ...@@ -57,14 +67,47 @@ And if you want to verify the data integrity of the tasks you're performing in a
python main.py \ python main.py \
--model gpt3 \ --model gpt3 \
--model_args engine=davinci \ --model_args engine=davinci \
--tasks lambada,hellaswag \ --tasks lambada_openai,hellaswag \
--check_integrity --check_integrity
``` ```
To evaluate mesh-transformer-jax models that are not available on HF, please invoke eval harness through [this script](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py). To evaluate mesh-transformer-jax models that are not available on HF, please invoke eval harness through [this script](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py).
💡 **Tip**: You can inspect what the LM inputs look like by running the following command:
```bash
python write_out.py \
--tasks all_tasks \
--num_fewshot 5 \
--num_examples 10 \
--output_base_path /path/to/output/folder
```
This will write out one text file for each task.
## Implementing new tasks ## Implementing new tasks
To implement a new task in eval harness, see [this guide](./docs/task_guide.md). To implement a new task in the eval harness, see [this guide](./docs/task_guide.md).
## Task Versioning
To help improve reproducibility, all tasks have a `VERSION` field. When run from the command line, this is reported in a column in the table, or in the "version" field in the evaluator return dict. The purpose of the version is so that if the task definition changes (i.e to fix a bug), then we can know exactly which metrics were computed using the old buggy implementation to avoid unfair comparisons. To enforce this, there are unit tests that make sure the behavior of all tests remains the same as when they were first implemented. Task versions start at 0, and each time a breaking change is made, the version is incremented by one.
When reporting eval harness results, please also report the version of each task. This can be done either with a separate column in the table, or by reporting the task name with the version appended as such: taskname-v0.
## Test Set Decontamination
For details on text decontamination, see the [decontamination guide](./docs/decontamination.md).
Note that the directory provided to the `--decontamination_ngrams_path` argument should contain the ngram files and info.json. See the above guide for ngram generation for the pile, this could be adapted for other training sets.
```bash
python main.py \
--model gpt2 \
--tasks sciq \
--decontamination_ngrams_path path/containing/training/set/ngrams \
--device 0
```
## Cite as ## Cite as
...@@ -96,371 +139,3 @@ To implement a new task in eval harness, see [this guide](./docs/task_guide.md). ...@@ -96,371 +139,3 @@ To implement a new task in eval harness, see [this guide](./docs/task_guide.md).
url = {https://doi.org/10.5281/zenodo.5371628} url = {https://doi.org/10.5281/zenodo.5371628}
} }
``` ```
### Full Task List
| Task Name |Train|Val|Test|Val/Test Docs| Metrics |
|---------------------------------------------------------|-----|---|----|------------:|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|cola |✓ |✓ | | 1043|mcc |
|mnli |✓ |✓ | | 9815|acc |
|mnli_mismatched |✓ |✓ | | 9832|acc |
|mrpc |✓ |✓ | | 408|acc, f1 |
|rte |✓ |✓ | | 277|acc |
|qnli |✓ |✓ | | 5463|acc |
|qqp |✓ |✓ | | 40430|acc, f1 |
|sst |✓ |✓ | | 872|acc |
|wnli |✓ |✓ | | 71|acc |
|boolq |✓ |✓ | | 3270|acc |
|cb |✓ |✓ | | 56|acc, f1 |
|copa |✓ |✓ | | 100|acc |
|multirc |✓ |✓ | | 4848|acc |
|record |✓ |✓ | | 10000|f1, em |
|wic |✓ |✓ | | 638|acc |
|wsc |✓ |✓ | | 104|acc |
|coqa |✓ |✓ | | 500|f1, em |
|drop |✓ |✓ | | 9536|em, f1 |
|lambada | |✓ | | 5153|ppl, acc |
|lambada_cloze | |✓ | | 5153|ppl, acc |
|lambada_mt_en | |✓ | | 5153|ppl, acc |
|lambada_mt_fr | |✓ | | 5153|ppl, acc |
|lambada_mt_de | |✓ | | 5153|ppl, acc |
|lambada_mt_it | |✓ | | 5153|ppl, acc |
|lambada_mt_es | |✓ | | 5153|ppl, acc |
|wikitext | |✓ |✓ | 62|word_perplexity, byte_perplexity, bits_per_byte |
|piqa |✓ |✓ | | 1838|acc, acc_norm |
|prost | | |✓ | 18736|acc, acc_norm |
|mc_taco | |✓ |✓ | 9442|f1, em |
|pubmedqa | | |✓ | 1000|acc |
|sciq |✓ |✓ |✓ | 1000|acc, acc_norm |
|qa4mre_2011 | | |✓ | 120|acc, acc_norm |
|qa4mre_2012 | | |✓ | 160|acc, acc_norm |
|qa4mre_2013 | | |✓ | 284|acc, acc_norm |
|triviaqa |✓ |✓ | | 11313|acc |
|arc_easy |✓ |✓ |✓ | 2376|acc, acc_norm |
|arc_challenge |✓ |✓ |✓ | 1172|acc, acc_norm |
|logiqa |✓ |✓ |✓ | 651|acc, acc_norm |
|hellaswag |✓ |✓ | | 10042|acc, acc_norm |
|openbookqa |✓ |✓ |✓ | 500|acc, acc_norm |
|squad2 |✓ |✓ | | 11873|exact, f1, HasAns_exact, HasAns_f1, NoAns_exact, NoAns_f1, best_exact, best_f1 |
|race |✓ |✓ |✓ | 1045|acc |
|headqa |✓ |✓ |✓ | 2742|acc, acc_norm |
|headqa_es |✓ |✓ |✓ | 2742|acc, acc_norm |
|headqa_en |✓ |✓ |✓ | 2742|acc, acc_norm |
|mathqa |✓ |✓ |✓ | 2985|acc, acc_norm |
|webqs |✓ | |✓ | 2032|acc |
|wsc273 | | |✓ | 273|acc |
|winogrande |✓ |✓ | | 1267|acc |
|anli_r1 |✓ |✓ |✓ | 1000|acc |
|anli_r2 |✓ |✓ |✓ | 1000|acc |
|anli_r3 |✓ |✓ |✓ | 1200|acc |
|ethics_cm |✓ | |✓ | 3885|acc |
|ethics_deontology |✓ | |✓ | 3596|acc, em |
|ethics_justice |✓ | |✓ | 2704|acc, em |
|ethics_utilitarianism_original | | |✓ | 4808|acc |
|ethics_utilitarianism |✓ | |✓ | 4808|acc |
|ethics_virtue |✓ | |✓ | 4975|acc, em |
|truthfulqa_mc | |✓ | | 817|mc1, mc2 |
|truthfulqa_gen | |✓ | | 817|bleurt_max, bleurt_acc, bleurt_diff, bleu_max, bleu_acc, bleu_diff, rouge1_max, rouge1_acc, rouge1_diff, rouge2_max, rouge2_acc, rouge2_diff, rougeL_max, rougeL_acc, rougeL_diff|
|mutual |✓ |✓ | | 886|r@1, r@2, mrr |
|mutual_plus |✓ |✓ | | 886|r@1, r@2, mrr |
|math_algebra |✓ | |✓ | 1187|acc |
|math_counting_and_prob |✓ | |✓ | 474|acc |
|math_geometry |✓ | |✓ | 479|acc |
|math_intermediate_algebra |✓ | |✓ | 903|acc |
|math_num_theory |✓ | |✓ | 540|acc |
|math_prealgebra |✓ | |✓ | 871|acc |
|math_precalc |✓ | |✓ | 546|acc |
|math_asdiv | |✓ | | 2305|acc |
|arithmetic_2da | |✓ | | 2000|acc |
|arithmetic_2ds | |✓ | | 2000|acc |
|arithmetic_3da | |✓ | | 2000|acc |
|arithmetic_3ds | |✓ | | 2000|acc |
|arithmetic_4da | |✓ | | 2000|acc |
|arithmetic_4ds | |✓ | | 2000|acc |
|arithmetic_5da | |✓ | | 2000|acc |
|arithmetic_5ds | |✓ | | 2000|acc |
|arithmetic_2dm | |✓ | | 2000|acc |
|arithmetic_1dc | |✓ | | 2000|acc |
|hendrycksTest-abstract_algebra |✓ |✓ |✓ | 100|acc, acc_norm |
|hendrycksTest-anatomy |✓ |✓ |✓ | 135|acc, acc_norm |
|hendrycksTest-astronomy |✓ |✓ |✓ | 152|acc, acc_norm |
|hendrycksTest-business_ethics |✓ |✓ |✓ | 100|acc, acc_norm |
|hendrycksTest-clinical_knowledge |✓ |✓ |✓ | 265|acc, acc_norm |
|hendrycksTest-college_biology |✓ |✓ |✓ | 144|acc, acc_norm |
|hendrycksTest-college_chemistry |✓ |✓ |✓ | 100|acc, acc_norm |
|hendrycksTest-college_computer_science |✓ |✓ |✓ | 100|acc, acc_norm |
|hendrycksTest-college_mathematics |✓ |✓ |✓ | 100|acc, acc_norm |
|hendrycksTest-college_medicine |✓ |✓ |✓ | 173|acc, acc_norm |
|hendrycksTest-college_physics |✓ |✓ |✓ | 102|acc, acc_norm |
|hendrycksTest-computer_security |✓ |✓ |✓ | 100|acc, acc_norm |
|hendrycksTest-conceptual_physics |✓ |✓ |✓ | 235|acc, acc_norm |
|hendrycksTest-econometrics |✓ |✓ |✓ | 114|acc, acc_norm |
|hendrycksTest-electrical_engineering |✓ |✓ |✓ | 145|acc, acc_norm |
|hendrycksTest-elementary_mathematics |✓ |✓ |✓ | 378|acc, acc_norm |
|hendrycksTest-formal_logic |✓ |✓ |✓ | 126|acc, acc_norm |
|hendrycksTest-global_facts |✓ |✓ |✓ | 100|acc, acc_norm |
|hendrycksTest-high_school_biology |✓ |✓ |✓ | 310|acc, acc_norm |
|hendrycksTest-high_school_chemistry |✓ |✓ |✓ | 203|acc, acc_norm |
|hendrycksTest-high_school_computer_science |✓ |✓ |✓ | 100|acc, acc_norm |
|hendrycksTest-high_school_european_history |✓ |✓ |✓ | 165|acc, acc_norm |
|hendrycksTest-high_school_geography |✓ |✓ |✓ | 198|acc, acc_norm |
|hendrycksTest-high_school_government_and_politics |✓ |✓ |✓ | 193|acc, acc_norm |
|hendrycksTest-high_school_macroeconomics |✓ |✓ |✓ | 390|acc, acc_norm |
|hendrycksTest-high_school_mathematics |✓ |✓ |✓ | 270|acc, acc_norm |
|hendrycksTest-high_school_microeconomics |✓ |✓ |✓ | 238|acc, acc_norm |
|hendrycksTest-high_school_physics |✓ |✓ |✓ | 151|acc, acc_norm |
|hendrycksTest-high_school_psychology |✓ |✓ |✓ | 545|acc, acc_norm |
|hendrycksTest-high_school_statistics |✓ |✓ |✓ | 216|acc, acc_norm |
|hendrycksTest-high_school_us_history |✓ |✓ |✓ | 204|acc, acc_norm |
|hendrycksTest-high_school_world_history |✓ |✓ |✓ | 237|acc, acc_norm |
|hendrycksTest-human_aging |✓ |✓ |✓ | 223|acc, acc_norm |
|hendrycksTest-human_sexuality |✓ |✓ |✓ | 131|acc, acc_norm |
|hendrycksTest-international_law |✓ |✓ |✓ | 121|acc, acc_norm |
|hendrycksTest-jurisprudence |✓ |✓ |✓ | 108|acc, acc_norm |
|hendrycksTest-logical_fallacies |✓ |✓ |✓ | 163|acc, acc_norm |
|hendrycksTest-machine_learning |✓ |✓ |✓ | 112|acc, acc_norm |
|hendrycksTest-management |✓ |✓ |✓ | 103|acc, acc_norm |
|hendrycksTest-marketing |✓ |✓ |✓ | 234|acc, acc_norm |
|hendrycksTest-medical_genetics |✓ |✓ |✓ | 100|acc, acc_norm |
|hendrycksTest-miscellaneous |✓ |✓ |✓ | 783|acc, acc_norm |
|hendrycksTest-moral_disputes |✓ |✓ |✓ | 346|acc, acc_norm |
|hendrycksTest-moral_scenarios |✓ |✓ |✓ | 895|acc, acc_norm |
|hendrycksTest-nutrition |✓ |✓ |✓ | 306|acc, acc_norm |
|hendrycksTest-philosophy |✓ |✓ |✓ | 311|acc, acc_norm |
|hendrycksTest-prehistory |✓ |✓ |✓ | 324|acc, acc_norm |
|hendrycksTest-professional_accounting |✓ |✓ |✓ | 282|acc, acc_norm |
|hendrycksTest-professional_law |✓ |✓ |✓ | 1534|acc, acc_norm |
|hendrycksTest-professional_medicine |✓ |✓ |✓ | 272|acc, acc_norm |
|hendrycksTest-professional_psychology |✓ |✓ |✓ | 612|acc, acc_norm |
|hendrycksTest-public_relations |✓ |✓ |✓ | 110|acc, acc_norm |
|hendrycksTest-security_studies |✓ |✓ |✓ | 245|acc, acc_norm |
|hendrycksTest-sociology |✓ |✓ |✓ | 201|acc, acc_norm |
|hendrycksTest-us_foreign_policy |✓ |✓ |✓ | 100|acc, acc_norm |
|hendrycksTest-virology |✓ |✓ |✓ | 166|acc, acc_norm |
|hendrycksTest-world_religions |✓ |✓ |✓ | 171|acc, acc_norm |
|wmt14-en-fr | | |✓ | 3003|bleu, chrf, ter |
|wmt14-fr-en | | |✓ | 3003|bleu, chrf, ter |
|wmt16-en-ro | | |✓ | 1999|bleu, chrf, ter |
|wmt16-ro-en | | |✓ | 1999|bleu, chrf, ter |
|wmt16-de-en | | |✓ | 2999|bleu, chrf, ter |
|wmt16-en-de | | |✓ | 2999|bleu, chrf, ter |
|wmt20-cs-en | | |✓ | 664|bleu, chrf, ter |
|wmt20-de-en | | |✓ | 785|bleu, chrf, ter |
|wmt20-de-fr | | |✓ | 1619|bleu, chrf, ter |
|wmt20-en-cs | | |✓ | 1418|bleu, chrf, ter |
|wmt20-en-de | | |✓ | 1418|bleu, chrf, ter |
|wmt20-en-iu | | |✓ | 2971|bleu, chrf, ter |
|wmt20-en-ja | | |✓ | 1000|bleu, chrf, ter |
|wmt20-en-km | | |✓ | 2320|bleu, chrf, ter |
|wmt20-en-pl | | |✓ | 1000|bleu, chrf, ter |
|wmt20-en-ps | | |✓ | 2719|bleu, chrf, ter |
|wmt20-en-ru | | |✓ | 2002|bleu, chrf, ter |
|wmt20-en-ta | | |✓ | 1000|bleu, chrf, ter |
|wmt20-en-zh | | |✓ | 1418|bleu, chrf, ter |
|wmt20-fr-de | | |✓ | 1619|bleu, chrf, ter |
|wmt20-iu-en | | |✓ | 2971|bleu, chrf, ter |
|wmt20-ja-en | | |✓ | 993|bleu, chrf, ter |
|wmt20-km-en | | |✓ | 2320|bleu, chrf, ter |
|wmt20-pl-en | | |✓ | 1001|bleu, chrf, ter |
|wmt20-ps-en | | |✓ | 2719|bleu, chrf, ter |
|wmt20-ru-en | | |✓ | 991|bleu, chrf, ter |
|wmt20-ta-en | | |✓ | 997|bleu, chrf, ter |
|wmt20-zh-en | | |✓ | 2000|bleu, chrf, ter |
|iwslt17-en-ar | | |✓ | 1460|bleu, chrf, ter |
|iwslt17-ar-en | | |✓ | 1460|bleu, chrf, ter |
|anagrams1 | |✓ | | 10000|acc |
|anagrams2 | |✓ | | 10000|acc |
|cycle_letters | |✓ | | 10000|acc |
|random_insertion | |✓ | | 10000|acc |
|reversed_words | |✓ | | 10000|acc |
|pile_arxiv | |✓ |✓ | 2407|word_perplexity, byte_perplexity, bits_per_byte |
|pile_books3 | |✓ |✓ | 269|word_perplexity, byte_perplexity, bits_per_byte |
|pile_bookcorpus2 | |✓ |✓ | 28|word_perplexity, byte_perplexity, bits_per_byte |
|pile_dm-mathematics | |✓ |✓ | 1922|word_perplexity, byte_perplexity, bits_per_byte |
|pile_enron | |✓ |✓ | 1010|word_perplexity, byte_perplexity, bits_per_byte |
|pile_europarl | |✓ |✓ | 157|word_perplexity, byte_perplexity, bits_per_byte |
|pile_freelaw | |✓ |✓ | 5101|word_perplexity, byte_perplexity, bits_per_byte |
|pile_github | |✓ |✓ | 18195|word_perplexity, byte_perplexity, bits_per_byte |
|pile_gutenberg | |✓ |✓ | 80|word_perplexity, byte_perplexity, bits_per_byte |
|pile_hackernews | |✓ |✓ | 1632|word_perplexity, byte_perplexity, bits_per_byte |
|pile_nih-exporter | |✓ |✓ | 1884|word_perplexity, byte_perplexity, bits_per_byte |
|pile_opensubtitles | |✓ |✓ | 642|word_perplexity, byte_perplexity, bits_per_byte |
|pile_openwebtext2 | |✓ |✓ | 32925|word_perplexity, byte_perplexity, bits_per_byte |
|pile_philpapers | |✓ |✓ | 68|word_perplexity, byte_perplexity, bits_per_byte |
|pile_pile-cc | |✓ |✓ | 52790|word_perplexity, byte_perplexity, bits_per_byte |
|pile_pubmed-abstracts | |✓ |✓ | 29895|word_perplexity, byte_perplexity, bits_per_byte |
|pile_pubmed-central | |✓ |✓ | 5911|word_perplexity, byte_perplexity, bits_per_byte |
|pile_stackexchange | |✓ |✓ | 30378|word_perplexity, byte_perplexity, bits_per_byte |
|pile_uspto | |✓ |✓ | 11415|word_perplexity, byte_perplexity, bits_per_byte |
|pile_ubuntu-irc | |✓ |✓ | 22|word_perplexity, byte_perplexity, bits_per_byte |
|pile_wikipedia | |✓ |✓ | 17511|word_perplexity, byte_perplexity, bits_per_byte |
|pile_youtubesubtitles | |✓ |✓ | 342|word_perplexity, byte_perplexity, bits_per_byte |
|blimp_adjunct_island | |✓ | | 1000|acc |
|blimp_anaphor_gender_agreement | |✓ | | 1000|acc |
|blimp_anaphor_number_agreement | |✓ | | 1000|acc |
|blimp_animate_subject_passive | |✓ | | 1000|acc |
|blimp_animate_subject_trans | |✓ | | 1000|acc |
|blimp_causative | |✓ | | 1000|acc |
|blimp_complex_NP_island | |✓ | | 1000|acc |
|blimp_coordinate_structure_constraint_complex_left_branch| |✓ | | 1000|acc |
|blimp_coordinate_structure_constraint_object_extraction | |✓ | | 1000|acc |
|blimp_determiner_noun_agreement_1 | |✓ | | 1000|acc |
|blimp_determiner_noun_agreement_2 | |✓ | | 1000|acc |
|blimp_determiner_noun_agreement_irregular_1 | |✓ | | 1000|acc |
|blimp_determiner_noun_agreement_irregular_2 | |✓ | | 1000|acc |
|blimp_determiner_noun_agreement_with_adj_2 | |✓ | | 1000|acc |
|blimp_determiner_noun_agreement_with_adj_irregular_1 | |✓ | | 1000|acc |
|blimp_determiner_noun_agreement_with_adj_irregular_2 | |✓ | | 1000|acc |
|blimp_determiner_noun_agreement_with_adjective_1 | |✓ | | 1000|acc |
|blimp_distractor_agreement_relational_noun | |✓ | | 1000|acc |
|blimp_distractor_agreement_relative_clause | |✓ | | 1000|acc |
|blimp_drop_argument | |✓ | | 1000|acc |
|blimp_ellipsis_n_bar_1 | |✓ | | 1000|acc |
|blimp_ellipsis_n_bar_2 | |✓ | | 1000|acc |
|blimp_existential_there_object_raising | |✓ | | 1000|acc |
|blimp_existential_there_quantifiers_1 | |✓ | | 1000|acc |
|blimp_existential_there_quantifiers_2 | |✓ | | 1000|acc |
|blimp_existential_there_subject_raising | |✓ | | 1000|acc |
|blimp_expletive_it_object_raising | |✓ | | 1000|acc |
|blimp_inchoative | |✓ | | 1000|acc |
|blimp_intransitive | |✓ | | 1000|acc |
|blimp_irregular_past_participle_adjectives | |✓ | | 1000|acc |
|blimp_irregular_past_participle_verbs | |✓ | | 1000|acc |
|blimp_irregular_plural_subject_verb_agreement_1 | |✓ | | 1000|acc |
|blimp_irregular_plural_subject_verb_agreement_2 | |✓ | | 1000|acc |
|blimp_left_branch_island_echo_question | |✓ | | 1000|acc |
|blimp_left_branch_island_simple_question | |✓ | | 1000|acc |
|blimp_matrix_question_npi_licensor_present | |✓ | | 1000|acc |
|blimp_npi_present_1 | |✓ | | 1000|acc |
|blimp_npi_present_2 | |✓ | | 1000|acc |
|blimp_only_npi_licensor_present | |✓ | | 1000|acc |
|blimp_only_npi_scope | |✓ | | 1000|acc |
|blimp_passive_1 | |✓ | | 1000|acc |
|blimp_passive_2 | |✓ | | 1000|acc |
|blimp_principle_A_c_command | |✓ | | 1000|acc |
|blimp_principle_A_case_1 | |✓ | | 1000|acc |
|blimp_principle_A_case_2 | |✓ | | 1000|acc |
|blimp_principle_A_domain_1 | |✓ | | 1000|acc |
|blimp_principle_A_domain_2 | |✓ | | 1000|acc |
|blimp_principle_A_domain_3 | |✓ | | 1000|acc |
|blimp_principle_A_reconstruction | |✓ | | 1000|acc |
|blimp_regular_plural_subject_verb_agreement_1 | |✓ | | 1000|acc |
|blimp_regular_plural_subject_verb_agreement_2 | |✓ | | 1000|acc |
|blimp_sentential_negation_npi_licensor_present | |✓ | | 1000|acc |
|blimp_sentential_negation_npi_scope | |✓ | | 1000|acc |
|blimp_sentential_subject_island | |✓ | | 1000|acc |
|blimp_superlative_quantifiers_1 | |✓ | | 1000|acc |
|blimp_superlative_quantifiers_2 | |✓ | | 1000|acc |
|blimp_tough_vs_raising_1 | |✓ | | 1000|acc |
|blimp_tough_vs_raising_2 | |✓ | | 1000|acc |
|blimp_transitive | |✓ | | 1000|acc |
|blimp_wh_island | |✓ | | 1000|acc |
|blimp_wh_questions_object_gap | |✓ | | 1000|acc |
|blimp_wh_questions_subject_gap | |✓ | | 1000|acc |
|blimp_wh_questions_subject_gap_long_distance | |✓ | | 1000|acc |
|blimp_wh_vs_that_no_gap | |✓ | | 1000|acc |
|blimp_wh_vs_that_no_gap_long_distance | |✓ | | 1000|acc |
|blimp_wh_vs_that_with_gap | |✓ | | 1000|acc |
|blimp_wh_vs_that_with_gap_long_distance | |✓ | | 1000|acc |
## Usage
### Evaluate a task
Additional arguments can be provided to the model constructor using the `--model_args` flag. Most importantly, the `gpt2` model can be used to load an arbitrary HuggingFace model as follows:
```bash
python main.py \
--model gpt2 \
--model_args pretrained=EleutherAI/gpt-neo-1.3B \
--device 0 \
--tasks lambada,hellaswag \
--num_fewshot 2
```
To inspect what the LM inputs look like, you can run the following command:
```bash
python write_out.py \
--tasks all_tasks \
--num_fewshot 5 \
--num_examples 10 \
--output_base_path /path/to/output/folder
```
This will write out one text file for each task.
### Test Set Decontamination
For more details see the [decontamination guide](./docs/decontamination.md).
The directory provided with the "--decontamination_ngrams_path" argument should contain
the ngram files and info.json. See the above guide for ngram generation for the pile, this could be adapted for other training sets.
```bash
python main.py \
--model gpt2 \
--device 0 \
--tasks sciq \
--decontamination_ngrams_path path/containing/training/set/ngrams
```
### Code Structure
There are two major components of the library:
1. LMs (language models), e.g. GPT-2, GPT-3
2. Tasks, e.g. MNLI, RTE, SQuAD (coming soon)
Both LMs (`lm_eval.models`) and Tasks (`lm_eval.tasks`) are kept in a registry data structure, for easy CLI instantiation.
**If you want to extend either models or tasks, simply add a new LM or Task subclass, and decorate with the registry decorator**.
The [GPT-3 Evaluations Project](https://github.com/EleutherAI/lm_evaluation_harness/projects/1) tracks our progress implementing new tasks. Right now, we are focused on getting all the datasets loaded so that we can dedupe against the training data. Implementing the actual evaluations is nice but not necessary at the current moment.
### Task Versioning
To help improve reproducibility, all tasks have a VERSION field. When run from the command line, this is reported in a column in the table, or in the "version" field in the evaluator return dict. The purpose of the version is so that if the task definition changes (i.e to fix a bug), then we can know exactly which metrics were computed using the old buggy implementation to avoid unfair comparisons. To enforce this, there are unit tests that make sure the behavior of all tests remains the same as when they were first implemented. Task versions start at 0, and each time a breaking change is made, the version is incremented by one.
When reporting eval harness results, please also report the version of each task. This can be done either with a separate column in the table, or by reporting the task name with the version appended as such: taskname-v0.
## Description
### 1. LM Evaluation
Given an LM, we want to evaluate it on a wide range of NLU tasks. We should at least cover the set of tasks in the GPT-3 paper, and any other tasks/benchmarks that are relevant. We will follow the GPT-3 format of a) zero-shot, b) one-shot, c) few-shot evaluation.
To do this, we need 3 components:
* Data downloader (shared with later sections, potentially needs to be directly linked to the latter 2 components)
* Task formatter
* Task evaluator
The **data downloader** should download data for the relevant tasks.
* We should heavily rely on Hugging Face's NLP for this. They are already doing most of the work with handling data scripts/caching.
* Optionally, we can rely directly on HF-NLP's caching, but that makes it awkward to handle non-HF-NLP datasets. Otherwise, we can just write them out to .jsonl. My feeling is that NLU data storage will be a drop in the bucket compared to LM data.
* Where we're not using HF-NLP, we can keep the data in the raw format (.jsonl, tsv, etc) and let the other components handle transforming it.
The **task formatter** formats the task input data into an LM-usable format.
* We should potentially support multiple formats for a given task, e.g. some formats may be better or worse suited for LM evaluation. See also: prompt-engineering
* The task formatter should also support zero/one/few-shot packing of training examples into an input. This may require weird interactions with the tokenizer for dealing with max-token issues.
The **task evaluator** scores a task.
* In essence, we want to generation output predictions for all our input examples, and feed them into some function that pops out a score (or scores)
An alternative approach is to collect the output logits and score them against the expected set of outputs.
* Some tasks have weird evaluation schemes, so we should make this as general as possible.
* Will thus likely have to be closely tied with the formatter.
* Likewise, we should take advantage of HF-NLP's metrics.
We might as well provide a sufficiently general API for the model to support OpenAI API as well. This can double up as an effort to reproduce the OpenAI NLU results.
### 2. Removing val/test data from LM training set
With the data downloader in place, we simply need to (1) expose the val/test examples, and (2) remove them from the training set.
* Arguably, (2) should be handled by LM preprocessing in a more general way. There are probably non-NLU-eval cases where we want to remove some specific data from training.
* Depending on how exactly we do the val/test removal, we may want to format the same example multiple ways to ensure that they don't get leaked into the training set in a slightly tweaked format.
* Thought experiment: SQuAD is based largely on Wikipedia. What exactly would we want to remove from the LM?
* [GPT-3]: In GPT-3, they attempted to remove val/test from their LM set, but there was a bug that caused leakage. So they ended up doing the opposite: removing overlaps from the LM set from the val/test. Funky.
* [GPT-3]: See page 30 and Appendix C for details. They do some funky n-gram based search and removal. We should think about whether we want to follow their protocol exactly
### 3. Adding task training data to LM training set
This part is the easiest. I guess we just write out some text files containing the training data? We can let the usual LM preprocessing pipeline handle it from there.
| Task Name |Train|Val|Test|Val/Test Docs| Metrics |
|---------------------------------------------------------|-----|---|----|------------:|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|anagrams1 | |✓ | | 10000|acc |
|anagrams2 | |✓ | | 10000|acc |
|anli_r1 |✓ |✓ |✓ | 1000|acc |
|anli_r2 |✓ |✓ |✓ | 1000|acc |
|anli_r3 |✓ |✓ |✓ | 1200|acc |
|arc_challenge |✓ |✓ |✓ | 1172|acc, acc_norm |
|arc_easy |✓ |✓ |✓ | 2376|acc, acc_norm |
|arithmetic_1dc | |✓ | | 2000|acc |
|arithmetic_2da | |✓ | | 2000|acc |
|arithmetic_2dm | |✓ | | 2000|acc |
|arithmetic_2ds | |✓ | | 2000|acc |
|arithmetic_3da | |✓ | | 2000|acc |
|arithmetic_3ds | |✓ | | 2000|acc |
|arithmetic_4da | |✓ | | 2000|acc |
|arithmetic_4ds | |✓ | | 2000|acc |
|arithmetic_5da | |✓ | | 2000|acc |
|arithmetic_5ds | |✓ | | 2000|acc |
|blimp_adjunct_island | |✓ | | 1000|acc |
|blimp_anaphor_gender_agreement | |✓ | | 1000|acc |
|blimp_anaphor_number_agreement | |✓ | | 1000|acc |
|blimp_animate_subject_passive | |✓ | | 1000|acc |
|blimp_animate_subject_trans | |✓ | | 1000|acc |
|blimp_causative | |✓ | | 1000|acc |
|blimp_complex_NP_island | |✓ | | 1000|acc |
|blimp_coordinate_structure_constraint_complex_left_branch| |✓ | | 1000|acc |
|blimp_coordinate_structure_constraint_object_extraction | |✓ | | 1000|acc |
|blimp_determiner_noun_agreement_1 | |✓ | | 1000|acc |
|blimp_determiner_noun_agreement_2 | |✓ | | 1000|acc |
|blimp_determiner_noun_agreement_irregular_1 | |✓ | | 1000|acc |
|blimp_determiner_noun_agreement_irregular_2 | |✓ | | 1000|acc |
|blimp_determiner_noun_agreement_with_adj_2 | |✓ | | 1000|acc |
|blimp_determiner_noun_agreement_with_adj_irregular_1 | |✓ | | 1000|acc |
|blimp_determiner_noun_agreement_with_adj_irregular_2 | |✓ | | 1000|acc |
|blimp_determiner_noun_agreement_with_adjective_1 | |✓ | | 1000|acc |
|blimp_distractor_agreement_relational_noun | |✓ | | 1000|acc |
|blimp_distractor_agreement_relative_clause | |✓ | | 1000|acc |
|blimp_drop_argument | |✓ | | 1000|acc |
|blimp_ellipsis_n_bar_1 | |✓ | | 1000|acc |
|blimp_ellipsis_n_bar_2 | |✓ | | 1000|acc |
|blimp_existential_there_object_raising | |✓ | | 1000|acc |
|blimp_existential_there_quantifiers_1 | |✓ | | 1000|acc |
|blimp_existential_there_quantifiers_2 | |✓ | | 1000|acc |
|blimp_existential_there_subject_raising | |✓ | | 1000|acc |
|blimp_expletive_it_object_raising | |✓ | | 1000|acc |
|blimp_inchoative | |✓ | | 1000|acc |
|blimp_intransitive | |✓ | | 1000|acc |
|blimp_irregular_past_participle_adjectives | |✓ | | 1000|acc |
|blimp_irregular_past_participle_verbs | |✓ | | 1000|acc |
|blimp_irregular_plural_subject_verb_agreement_1 | |✓ | | 1000|acc |
|blimp_irregular_plural_subject_verb_agreement_2 | |✓ | | 1000|acc |
|blimp_left_branch_island_echo_question | |✓ | | 1000|acc |
|blimp_left_branch_island_simple_question | |✓ | | 1000|acc |
|blimp_matrix_question_npi_licensor_present | |✓ | | 1000|acc |
|blimp_npi_present_1 | |✓ | | 1000|acc |
|blimp_npi_present_2 | |✓ | | 1000|acc |
|blimp_only_npi_licensor_present | |✓ | | 1000|acc |
|blimp_only_npi_scope | |✓ | | 1000|acc |
|blimp_passive_1 | |✓ | | 1000|acc |
|blimp_passive_2 | |✓ | | 1000|acc |
|blimp_principle_A_c_command | |✓ | | 1000|acc |
|blimp_principle_A_case_1 | |✓ | | 1000|acc |
|blimp_principle_A_case_2 | |✓ | | 1000|acc |
|blimp_principle_A_domain_1 | |✓ | | 1000|acc |
|blimp_principle_A_domain_2 | |✓ | | 1000|acc |
|blimp_principle_A_domain_3 | |✓ | | 1000|acc |
|blimp_principle_A_reconstruction | |✓ | | 1000|acc |
|blimp_regular_plural_subject_verb_agreement_1 | |✓ | | 1000|acc |
|blimp_regular_plural_subject_verb_agreement_2 | |✓ | | 1000|acc |
|blimp_sentential_negation_npi_licensor_present | |✓ | | 1000|acc |
|blimp_sentential_negation_npi_scope | |✓ | | 1000|acc |
|blimp_sentential_subject_island | |✓ | | 1000|acc |
|blimp_superlative_quantifiers_1 | |✓ | | 1000|acc |
|blimp_superlative_quantifiers_2 | |✓ | | 1000|acc |
|blimp_tough_vs_raising_1 | |✓ | | 1000|acc |
|blimp_tough_vs_raising_2 | |✓ | | 1000|acc |
|blimp_transitive | |✓ | | 1000|acc |
|blimp_wh_island | |✓ | | 1000|acc |
|blimp_wh_questions_object_gap | |✓ | | 1000|acc |
|blimp_wh_questions_subject_gap | |✓ | | 1000|acc |
|blimp_wh_questions_subject_gap_long_distance | |✓ | | 1000|acc |
|blimp_wh_vs_that_no_gap | |✓ | | 1000|acc |
|blimp_wh_vs_that_no_gap_long_distance | |✓ | | 1000|acc |
|blimp_wh_vs_that_with_gap | |✓ | | 1000|acc |
|blimp_wh_vs_that_with_gap_long_distance | |✓ | | 1000|acc |
|boolq |✓ |✓ | | 3270|acc |
|cb |✓ |✓ | | 56|acc, f1 |
|cola |✓ |✓ | | 1043|mcc |
|copa |✓ |✓ | | 100|acc |
|coqa |✓ |✓ | | 500|f1, em |
|cycle_letters | |✓ | | 10000|acc |
|drop |✓ |✓ | | 9536|em, f1 |
|ethics_cm |✓ | |✓ | 3885|acc |
|ethics_deontology |✓ | |✓ | 3596|acc, em |
|ethics_justice |✓ | |✓ | 2704|acc, em |
|ethics_utilitarianism |✓ | |✓ | 4808|acc |
|ethics_utilitarianism_original | | |✓ | 4808|acc |
|ethics_virtue |✓ | |✓ | 4975|acc, em |
|gsm8k |✓ | |✓ | 1319|acc |
|headqa |✓ |✓ |✓ | 2742|acc, acc_norm |
|headqa_en |✓ |✓ |✓ | 2742|acc, acc_norm |
|headqa_es |✓ |✓ |✓ | 2742|acc, acc_norm |
|hellaswag |✓ |✓ | | 10042|acc, acc_norm |
|hendrycksTest-abstract_algebra | |✓ |✓ | 100|acc, acc_norm |
|hendrycksTest-anatomy | |✓ |✓ | 135|acc, acc_norm |
|hendrycksTest-astronomy | |✓ |✓ | 152|acc, acc_norm |
|hendrycksTest-business_ethics | |✓ |✓ | 100|acc, acc_norm |
|hendrycksTest-clinical_knowledge | |✓ |✓ | 265|acc, acc_norm |
|hendrycksTest-college_biology | |✓ |✓ | 144|acc, acc_norm |
|hendrycksTest-college_chemistry | |✓ |✓ | 100|acc, acc_norm |
|hendrycksTest-college_computer_science | |✓ |✓ | 100|acc, acc_norm |
|hendrycksTest-college_mathematics | |✓ |✓ | 100|acc, acc_norm |
|hendrycksTest-college_medicine | |✓ |✓ | 173|acc, acc_norm |
|hendrycksTest-college_physics | |✓ |✓ | 102|acc, acc_norm |
|hendrycksTest-computer_security | |✓ |✓ | 100|acc, acc_norm |
|hendrycksTest-conceptual_physics | |✓ |✓ | 235|acc, acc_norm |
|hendrycksTest-econometrics | |✓ |✓ | 114|acc, acc_norm |
|hendrycksTest-electrical_engineering | |✓ |✓ | 145|acc, acc_norm |
|hendrycksTest-elementary_mathematics | |✓ |✓ | 378|acc, acc_norm |
|hendrycksTest-formal_logic | |✓ |✓ | 126|acc, acc_norm |
|hendrycksTest-global_facts | |✓ |✓ | 100|acc, acc_norm |
|hendrycksTest-high_school_biology | |✓ |✓ | 310|acc, acc_norm |
|hendrycksTest-high_school_chemistry | |✓ |✓ | 203|acc, acc_norm |
|hendrycksTest-high_school_computer_science | |✓ |✓ | 100|acc, acc_norm |
|hendrycksTest-high_school_european_history | |✓ |✓ | 165|acc, acc_norm |
|hendrycksTest-high_school_geography | |✓ |✓ | 198|acc, acc_norm |
|hendrycksTest-high_school_government_and_politics | |✓ |✓ | 193|acc, acc_norm |
|hendrycksTest-high_school_macroeconomics | |✓ |✓ | 390|acc, acc_norm |
|hendrycksTest-high_school_mathematics | |✓ |✓ | 270|acc, acc_norm |
|hendrycksTest-high_school_microeconomics | |✓ |✓ | 238|acc, acc_norm |
|hendrycksTest-high_school_physics | |✓ |✓ | 151|acc, acc_norm |
|hendrycksTest-high_school_psychology | |✓ |✓ | 545|acc, acc_norm |
|hendrycksTest-high_school_statistics | |✓ |✓ | 216|acc, acc_norm |
|hendrycksTest-high_school_us_history | |✓ |✓ | 204|acc, acc_norm |
|hendrycksTest-high_school_world_history | |✓ |✓ | 237|acc, acc_norm |
|hendrycksTest-human_aging | |✓ |✓ | 223|acc, acc_norm |
|hendrycksTest-human_sexuality | |✓ |✓ | 131|acc, acc_norm |
|hendrycksTest-international_law | |✓ |✓ | 121|acc, acc_norm |
|hendrycksTest-jurisprudence | |✓ |✓ | 108|acc, acc_norm |
|hendrycksTest-logical_fallacies | |✓ |✓ | 163|acc, acc_norm |
|hendrycksTest-machine_learning | |✓ |✓ | 112|acc, acc_norm |
|hendrycksTest-management | |✓ |✓ | 103|acc, acc_norm |
|hendrycksTest-marketing | |✓ |✓ | 234|acc, acc_norm |
|hendrycksTest-medical_genetics | |✓ |✓ | 100|acc, acc_norm |
|hendrycksTest-miscellaneous | |✓ |✓ | 783|acc, acc_norm |
|hendrycksTest-moral_disputes | |✓ |✓ | 346|acc, acc_norm |
|hendrycksTest-moral_scenarios | |✓ |✓ | 895|acc, acc_norm |
|hendrycksTest-nutrition | |✓ |✓ | 306|acc, acc_norm |
|hendrycksTest-philosophy | |✓ |✓ | 311|acc, acc_norm |
|hendrycksTest-prehistory | |✓ |✓ | 324|acc, acc_norm |
|hendrycksTest-professional_accounting | |✓ |✓ | 282|acc, acc_norm |
|hendrycksTest-professional_law | |✓ |✓ | 1534|acc, acc_norm |
|hendrycksTest-professional_medicine | |✓ |✓ | 272|acc, acc_norm |
|hendrycksTest-professional_psychology | |✓ |✓ | 612|acc, acc_norm |
|hendrycksTest-public_relations | |✓ |✓ | 110|acc, acc_norm |
|hendrycksTest-security_studies | |✓ |✓ | 245|acc, acc_norm |
|hendrycksTest-sociology | |✓ |✓ | 201|acc, acc_norm |
|hendrycksTest-us_foreign_policy | |✓ |✓ | 100|acc, acc_norm |
|hendrycksTest-virology | |✓ |✓ | 166|acc, acc_norm |
|hendrycksTest-world_religions | |✓ |✓ | 171|acc, acc_norm |
|iwslt17-ar-en | | |✓ | 1460|bleu, chrf, ter |
|iwslt17-en-ar | | |✓ | 1460|bleu, chrf, ter |
|lambada_openai | |✓ | | 5153|ppl, acc |
|lambada_openai_cloze | |✓ | | 5153|ppl, acc |
|lambada_openai_mt_de | |✓ | | 5153|ppl, acc |
|lambada_openai_mt_en | |✓ | | 5153|ppl, acc |
|lambada_openai_mt_es | |✓ | | 5153|ppl, acc |
|lambada_openai_mt_fr | |✓ | | 5153|ppl, acc |
|lambada_openai_mt_it | |✓ | | 5153|ppl, acc |
|lambada_standard | |✓ |✓ | 5153|ppl, acc |
|lambada_standard_cloze | |✓ |✓ | 5153|ppl, acc |
|logiqa |✓ |✓ |✓ | 651|acc, acc_norm |
|math_algebra |✓ | |✓ | 1187|acc |
|math_asdiv | |✓ | | 2305|acc |
|math_counting_and_prob |✓ | |✓ | 474|acc |
|math_geometry |✓ | |✓ | 479|acc |
|math_intermediate_algebra |✓ | |✓ | 903|acc |
|math_num_theory |✓ | |✓ | 540|acc |
|math_prealgebra |✓ | |✓ | 871|acc |
|math_precalc |✓ | |✓ | 546|acc |
|mathqa |✓ |✓ |✓ | 2985|acc, acc_norm |
|mc_taco | |✓ |✓ | 9442|f1, em |
|mnli |✓ |✓ | | 9815|acc |
|mnli_mismatched |✓ |✓ | | 9832|acc |
|mrpc |✓ |✓ | | 408|acc, f1 |
|multirc |✓ |✓ | | 4848|acc |
|mutual |✓ |✓ | | 886|r@1, r@2, mrr |
|mutual_plus |✓ |✓ | | 886|r@1, r@2, mrr |
|openbookqa |✓ |✓ |✓ | 500|acc, acc_norm |
|pile_arxiv | |✓ |✓ | 2407|word_perplexity, byte_perplexity, bits_per_byte |
|pile_bookcorpus2 | |✓ |✓ | 28|word_perplexity, byte_perplexity, bits_per_byte |
|pile_books3 | |✓ |✓ | 269|word_perplexity, byte_perplexity, bits_per_byte |
|pile_dm-mathematics | |✓ |✓ | 1922|word_perplexity, byte_perplexity, bits_per_byte |
|pile_enron | |✓ |✓ | 1010|word_perplexity, byte_perplexity, bits_per_byte |
|pile_europarl | |✓ |✓ | 157|word_perplexity, byte_perplexity, bits_per_byte |
|pile_freelaw | |✓ |✓ | 5101|word_perplexity, byte_perplexity, bits_per_byte |
|pile_github | |✓ |✓ | 18195|word_perplexity, byte_perplexity, bits_per_byte |
|pile_gutenberg | |✓ |✓ | 80|word_perplexity, byte_perplexity, bits_per_byte |
|pile_hackernews | |✓ |✓ | 1632|word_perplexity, byte_perplexity, bits_per_byte |
|pile_nih-exporter | |✓ |✓ | 1884|word_perplexity, byte_perplexity, bits_per_byte |
|pile_opensubtitles | |✓ |✓ | 642|word_perplexity, byte_perplexity, bits_per_byte |
|pile_openwebtext2 | |✓ |✓ | 32925|word_perplexity, byte_perplexity, bits_per_byte |
|pile_philpapers | |✓ |✓ | 68|word_perplexity, byte_perplexity, bits_per_byte |
|pile_pile-cc | |✓ |✓ | 52790|word_perplexity, byte_perplexity, bits_per_byte |
|pile_pubmed-abstracts | |✓ |✓ | 29895|word_perplexity, byte_perplexity, bits_per_byte |
|pile_pubmed-central | |✓ |✓ | 5911|word_perplexity, byte_perplexity, bits_per_byte |
|pile_stackexchange | |✓ |✓ | 30378|word_perplexity, byte_perplexity, bits_per_byte |
|pile_ubuntu-irc | |✓ |✓ | 22|word_perplexity, byte_perplexity, bits_per_byte |
|pile_uspto | |✓ |✓ | 11415|word_perplexity, byte_perplexity, bits_per_byte |
|pile_wikipedia | |✓ |✓ | 17511|word_perplexity, byte_perplexity, bits_per_byte |
|pile_youtubesubtitles | |✓ |✓ | 342|word_perplexity, byte_perplexity, bits_per_byte |
|piqa |✓ |✓ | | 1838|acc, acc_norm |
|prost | | |✓ | 18736|acc, acc_norm |
|pubmedqa | | |✓ | 1000|acc |
|qa4mre_2011 | | |✓ | 120|acc, acc_norm |
|qa4mre_2012 | | |✓ | 160|acc, acc_norm |
|qa4mre_2013 | | |✓ | 284|acc, acc_norm |
|qasper |✓ |✓ | | 1764|f1_yesno, f1_abstractive |
|qnli |✓ |✓ | | 5463|acc |
|qqp |✓ |✓ | | 40430|acc, f1 |
|race |✓ |✓ |✓ | 1045|acc |
|random_insertion | |✓ | | 10000|acc |
|record |✓ |✓ | | 10000|f1, em |
|reversed_words | |✓ | | 10000|acc |
|rte |✓ |✓ | | 277|acc |
|sciq |✓ |✓ |✓ | 1000|acc, acc_norm |
|squad2 |✓ |✓ | | 11873|exact, f1, HasAns_exact, HasAns_f1, NoAns_exact, NoAns_f1, best_exact, best_f1 |
|sst |✓ |✓ | | 872|acc |
|swag |✓ |✓ | | 20006|acc, acc_norm |
|triviaqa |✓ |✓ | | 11313|acc |
|truthfulqa_gen | |✓ | | 817|bleurt_max, bleurt_acc, bleurt_diff, bleu_max, bleu_acc, bleu_diff, rouge1_max, rouge1_acc, rouge1_diff, rouge2_max, rouge2_acc, rouge2_diff, rougeL_max, rougeL_acc, rougeL_diff|
|truthfulqa_mc | |✓ | | 817|mc1, mc2 |
|webqs |✓ | |✓ | 2032|acc |
|wic |✓ |✓ | | 638|acc |
|wikitext |✓ |✓ |✓ | 62|word_perplexity, byte_perplexity, bits_per_byte |
|winogrande |✓ |✓ | | 1267|acc |
|wmt14-en-fr | | |✓ | 3003|bleu, chrf, ter |
|wmt14-fr-en | | |✓ | 3003|bleu, chrf, ter |
|wmt16-de-en | | |✓ | 2999|bleu, chrf, ter |
|wmt16-en-de | | |✓ | 2999|bleu, chrf, ter |
|wmt16-en-ro | | |✓ | 1999|bleu, chrf, ter |
|wmt16-ro-en | | |✓ | 1999|bleu, chrf, ter |
|wmt20-cs-en | | |✓ | 664|bleu, chrf, ter |
|wmt20-de-en | | |✓ | 785|bleu, chrf, ter |
|wmt20-de-fr | | |✓ | 1619|bleu, chrf, ter |
|wmt20-en-cs | | |✓ | 1418|bleu, chrf, ter |
|wmt20-en-de | | |✓ | 1418|bleu, chrf, ter |
|wmt20-en-iu | | |✓ | 2971|bleu, chrf, ter |
|wmt20-en-ja | | |✓ | 1000|bleu, chrf, ter |
|wmt20-en-km | | |✓ | 2320|bleu, chrf, ter |
|wmt20-en-pl | | |✓ | 1000|bleu, chrf, ter |
|wmt20-en-ps | | |✓ | 2719|bleu, chrf, ter |
|wmt20-en-ru | | |✓ | 2002|bleu, chrf, ter |
|wmt20-en-ta | | |✓ | 1000|bleu, chrf, ter |
|wmt20-en-zh | | |✓ | 1418|bleu, chrf, ter |
|wmt20-fr-de | | |✓ | 1619|bleu, chrf, ter |
|wmt20-iu-en | | |✓ | 2971|bleu, chrf, ter |
|wmt20-ja-en | | |✓ | 993|bleu, chrf, ter |
|wmt20-km-en | | |✓ | 2320|bleu, chrf, ter |
|wmt20-pl-en | | |✓ | 1001|bleu, chrf, ter |
|wmt20-ps-en | | |✓ | 2719|bleu, chrf, ter |
|wmt20-ru-en | | |✓ | 991|bleu, chrf, ter |
|wmt20-ta-en | | |✓ | 997|bleu, chrf, ter |
|wmt20-zh-en | | |✓ | 2000|bleu, chrf, ter |
|wnli |✓ |✓ | | 71|acc |
|wsc |✓ |✓ | | 104|acc |
|wsc273 | | |✓ | 273|acc |
...@@ -16,6 +16,20 @@ from lm_eval import metrics ...@@ -16,6 +16,20 @@ from lm_eval import metrics
from lm_eval.base import Task, rf from lm_eval.base import Task, rf
from typing import List from typing import List
try:
import nagisa
HAS_NAGISA = True
except ImportError:
HAS_NAGISA = False
try:
import jieba
HAS_JIEBA = True
except ImportError:
HAS_JIEBA = False
_CITATION = """ _CITATION = """
@inproceedings{post-2018-call, @inproceedings{post-2018-call,
...@@ -63,14 +77,22 @@ def create_tasks_from_benchmarks(benchmark_dict): ...@@ -63,14 +77,22 @@ def create_tasks_from_benchmarks(benchmark_dict):
def zh_split(zh_text: List[str]) -> List[str]: def zh_split(zh_text: List[str]) -> List[str]:
"""Chinese splitting""" """Chinese splitting"""
import jieba if not HAS_JIEBA:
raise ImportError(
"Chinese text splitting requires the `jieba` package. "
"Please install it with:\npip install jieba"
)
return [" ".join(jieba.cut(txt.strip())) for txt in zh_text] return [" ".join(jieba.cut(txt.strip())) for txt in zh_text]
def ja_split(ja_text: List[str]) -> List[str]: def ja_split(ja_text: List[str]) -> List[str]:
"""Japanese splitting""" """Japanese splitting"""
import nagisa if not HAS_NAGISA:
raise ImportError(
"Japanese text splitting requires the `nagisa` package. "
"Please install it with:\npip install nagisa"
)
return [" ".join(nagisa.tagging(txt.strip()).words) for txt in ja_text] return [" ".join(nagisa.tagging(txt.strip()).words) for txt in ja_text]
......
...@@ -27,6 +27,14 @@ from lm_eval.base import rf, Task ...@@ -27,6 +27,14 @@ from lm_eval.base import rf, Task
from lm_eval.metrics import mean from lm_eval.metrics import mean
try:
import bleurt
HAS_BLEURT = True
except ImportError:
HAS_BLEURT = False
_CITATION = """ _CITATION = """
@misc{lin2021truthfulqa, @misc{lin2021truthfulqa,
title={TruthfulQA: Measuring How Models Mimic Human Falsehoods}, title={TruthfulQA: Measuring How Models Mimic Human Falsehoods},
...@@ -164,6 +172,12 @@ class TruthfulQAGeneration(Task): ...@@ -164,6 +172,12 @@ class TruthfulQAGeneration(Task):
def __init__(self): def __init__(self):
super().__init__() super().__init__()
if not HAS_BLEURT:
raise ImportError(
"`TruthfulQAGeneration` requires the `bleurt` package. Please install it with:\n"
"pip install bleurt@https://github.com/google-research/bleurt/archive/b610120347ef22b494b6d69b4316e303f5932516.zip#egg=bleurt"
"\nWARNING: Installing any other version of bleurt may result in different results."
)
self.bleurt = datasets.load_metric("bleurt") self.bleurt = datasets.load_metric("bleurt")
def has_training_docs(self): def has_training_docs(self):
......
...@@ -5,7 +5,6 @@ import collections ...@@ -5,7 +5,6 @@ import collections
import functools import functools
import inspect import inspect
import sys import sys
import pytest
from typing import List from typing import List
...@@ -187,6 +186,8 @@ def run_task_tests(task_list: List[str]): ...@@ -187,6 +186,8 @@ def run_task_tests(task_list: List[str]):
""" """
Find the package root and run the tests for the given tasks Find the package root and run the tests for the given tasks
""" """
import pytest
package_root = find_test_root(start_path=pathlib.Path(__file__)) package_root = find_test_root(start_path=pathlib.Path(__file__))
task_string = " or ".join(task_list) task_string = " or ".join(task_list)
args = [ args = [
......
"""
Usage:
python make_table_tasks.py --output <markdown_filename>
"""
import argparse
import logging
from lm_eval import tasks from lm_eval import tasks
from pytablewriter import MarkdownTableWriter from pytablewriter import MarkdownTableWriter
writer = MarkdownTableWriter()
writer.headers = ["Task Name", "Train", "Val", "Test", "Val/Test Docs", "Metrics"]
values = [] logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def chk(tf): def check(tf):
if tf: if tf:
return "✓" return "✓"
else: else:
return " " return " "
for tname, Task in tasks.TASK_REGISTRY.items(): if __name__ == "__main__":
task = Task() parser = argparse.ArgumentParser()
parser.add_argument("--output", type=str, default="task_table.md")
args = parser.parse_args()
writer = MarkdownTableWriter()
writer.headers = ["Task Name", "Train", "Val", "Test", "Val/Test Docs", "Metrics"]
values = []
tasks = tasks.TASK_REGISTRY.items()
tasks = sorted(tasks, key=lambda x: x[0])
for tname, Task in tasks:
task = Task()
v = [ v = [
tname, tname,
chk(task.has_training_docs()), check(task.has_training_docs()),
chk(task.has_validation_docs()), check(task.has_validation_docs()),
chk(task.has_test_docs()), check(task.has_test_docs()),
len(list(task.test_docs() if task.has_test_docs() else task.validation_docs())), len(
list(
task.test_docs() if task.has_test_docs() else task.validation_docs()
)
),
", ".join(task.aggregation().keys()), ", ".join(task.aggregation().keys()),
] ]
print(v) logger.info(v)
values.append(v) values.append(v)
writer.value_matrix = values
writer.value_matrix = values table = writer.dumps()
with open(args.output, "w") as f:
print(writer.dumps()) f.write(table)
...@@ -14,6 +14,7 @@ setuptools.setup( ...@@ -14,6 +14,7 @@ setuptools.setup(
url="https://github.com/EleutherAI/lm-evaluation-harness", url="https://github.com/EleutherAI/lm-evaluation-harness",
packages=setuptools.find_packages(), packages=setuptools.find_packages(),
classifiers=[ classifiers=[
"Development Status :: 3 - Alpha",
"Programming Language :: Python :: 3", "Programming Language :: Python :: 3",
"License :: OSI Approved :: MIT License", "License :: OSI Approved :: MIT License",
"Operating System :: OS Independent", "Operating System :: OS Independent",
...@@ -21,29 +22,23 @@ setuptools.setup( ...@@ -21,29 +22,23 @@ setuptools.setup(
python_requires=">=3.6", python_requires=">=3.6",
install_requires=[ install_requires=[
"datasets>=2.0.0", "datasets>=2.0.0",
"click>=7.1", "jsonlines",
"numexpr",
"openai>=0.6.4",
"pybind11>=2.6.2",
"pycountry",
"pytablewriter",
"rouge-score>=0.0.4",
"sacrebleu==1.5.0",
"scikit-learn>=0.24.1", "scikit-learn>=0.24.1",
"sqlitedict",
"torch>=1.7", "torch>=1.7",
"tqdm-multiprocess",
"transformers>=4.1", "transformers>=4.1",
"sqlitedict==1.6.0", "zstandard",
"pytablewriter==0.58.0",
"sacrebleu==1.5.0",
"rouge-score==0.0.4",
"pycountry==20.7.3",
"numexpr>=2.7.2",
"lm_dataformat==0.0.20",
"pybind11==2.6.2",
"tqdm-multiprocess==0.0.11",
"zstandard==0.15.2",
"jsonlines==2.0.0",
"mock==4.0.3",
"openai==0.6.4",
"jieba==0.42.1",
"nagisa==0.2.7",
"bleurt@https://github.com/google-research/bleurt/archive/b610120347ef22b494b6d69b4316e303f5932516.zip#egg=bleurt",
],
dependency_links=[
"https://github.com/google-research/bleurt/archive/b610120347ef22b494b6d69b4316e303f5932516.zip#egg=bleurt",
], ],
extras_require={"dev": ["pytest", "black", "pre-commit"]}, extras_require={
"dev": ["black", "flake8", "pre-commit", "pytest", "pytest-cov"],
"multilingual": ["nagisa>=0.2.7", "jieba>=0.42.1"],
},
) )
...@@ -258,8 +258,9 @@ def textsynth_mock_completion(**kwargs): ...@@ -258,8 +258,9 @@ def textsynth_mock_completion(**kwargs):
import requests import requests
os.makedirs("tests/testdata", exist_ok=True) os.makedirs("tests/testdata", exist_ok=True)
hash_kwargs = {k: v for k, v in kwargs.items() if k != "headers"}
hash = hashlib.sha256( hash = hashlib.sha256(
json.dumps(kwargs, sort_keys=True).encode("utf-8") json.dumps(hash_kwargs, sort_keys=True).encode("utf-8")
).hexdigest() ).hexdigest()
fname = f"tests/testdata/textsynth_test_{hash}.pkl" fname = f"tests/testdata/textsynth_test_{hash}.pkl"
......
...@@ -7,10 +7,7 @@ from itertools import islice ...@@ -7,10 +7,7 @@ from itertools import islice
@pytest.mark.parametrize("taskname,task_class", tasks.TASK_REGISTRY.items()) @pytest.mark.parametrize("taskname,task_class", tasks.TASK_REGISTRY.items())
def test_basic_interface(taskname, task_class): def test_basic_interface(taskname, task_class):
print("Evaluating task", taskname) print("Evaluating task", taskname)
# dl = task_class.download
# task_class.download = MagicMock()
task = task_class() task = task_class()
# task_class.download = dl
assert task.has_training_docs() in [True, False] assert task.has_training_docs() in [True, False]
assert task.has_validation_docs() in [True, False] assert task.has_validation_docs() in [True, False]
......
...@@ -51,7 +51,7 @@ def flatten(d, parent_key="", sep="."): ...@@ -51,7 +51,7 @@ def flatten(d, parent_key="", sep="."):
items = [] items = []
for k, v in d.items(): for k, v in d.items():
new_key = parent_key + sep + k if parent_key else k new_key = parent_key + sep + k if parent_key else k
if isinstance(v, collections.MutableMapping): if isinstance(v, collections.abc.MutableMapping):
items.extend(flatten(v, new_key, sep=sep).items()) items.extend(flatten(v, new_key, sep=sep).items())
else: else:
items.append((new_key, v)) items.append((new_key, v))
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment