Merge pull request #362 from EleutherAI/cleanup-for-release

Cleanup `README.md` and package deps

Merge pull request #362 from EleutherAI/cleanup-for-release
Cleanup `README.md` and package deps
1d8107bf · Stella Biderman · GitHub · fdd3dbc3 · 1e5d55d9 · 1d8107bf
Unverified Commit 1d8107bf authored Dec 07, 2022 by Stella Biderman Committed by GitHub Dec 07, 2022
20 changed files
--- a/.github/workflows/python-app.yml
+++ b/.github/workflows/python-app.yml
@@ -32,7 +32,9 @@ jobs:
      run: |
        python -m pip install --upgrade pip
        pip install flake8 pytest pytest-cov
-        pip install -e .[dev]
+        pip install -e .[dev,multilingual]
+        # Install optional git dependencies
+        pip install bleurt@https://github.com/google-research/bleurt/archive/b610120347ef22b494b6d69b4316e303f5932516.zip#egg=bleurt
        if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
    - name: Lint with flake8
      run: |

--- a/README.md
+++ b/README.md
@@ -9,9 +9,9 @@ This project provides a unified framework to test autoregressive language models
 Features:
- 200+ tasks implemented
+- 200+ tasks implemented. See the [task-table](./docs/task_table.md) for a complete list.
- Support for GPT-2, GPT-3, GPT-Neo, GPT-NeoX, and GPT-J, with flexible tokenization-agnostic interface
+- Support for GPT-2, GPT-3, GPT-Neo, GPT-NeoX, and GPT-J, with flexible tokenization-agnostic interface.
- Task versioning to ensure reproducibility
+- Task versioning to ensure reproducibility.
 ## Install
@@ -19,26 +19,36 @@ Features:
 pip install lm-eval
 ```
+To install additional multlingual tokenization and text segmenation packages, you must install the package with the `multilingual` extra:
+```bash
+pip install "lm-eval[multilingual]"
+```
 ## Basic Usage
-To evaluate a model, (e.g. GPT-2) on NLU tasks (e.g. LAMBADA, HellaSwag), you can run the following command. **When reporting results from eval harness, please include the task versions (shown in `results["versions"]`) for reproducibility.** This allows bug fixes to tasks while also ensuring that previously reported scores are reproducible. See the [Task Versioning](https://github.com/EleutherAI/lm-evaluation-harness#task-versioning) section for more info.
+> **Note**: When reporting results from eval harness, please include the task versions (shown in `results["versions"]`) for reproducibility. This allows bug fixes to tasks while also ensuring that previously reported scores are reproducible. See the [Task Versioning](#task-versioning) section for more info.
+To evaluate a model (e.g. GPT-2) on NLP tasks such as SuperGLUE WiC, you can run the following command:
 ```bash
 python main.py \
    --model gpt2 \
-	--device 0 \
+    --tasks lambada_openai,hellaswag \
-	--tasks lambada,hellaswag
+    --device 0
 ```
-(This uses gpt2-117M by default as per HF defaults, use --model_args to specify other gpt2 sizes)
-Additional arguments can be provided to the model constructor using the `--model_args` flag. Most importantly, the `gpt2` model can be used to load an arbitrary HuggingFace model. For example, to run GPTNeo use the following:
+This example uses gpt2-117M by default as per HF defaults.
+Additional arguments can be provided to the model constructor using the `--model_args` flag. Most importantly, the `gpt2` model can be used to load an arbitrary HuggingFace CausalLM. For example, to run GPTNeo use the following:
 ```bash
 python main.py \
    --model gpt2 \
    --model_args pretrained=EleutherAI/gpt-neo-2.7B \
-	--device 0 \
+    --tasks lambada_openai,hellaswag \
-	--tasks lambada,hellaswag
+    --device 0
 ```
 If you have access to the OpenAI API, you can also evaluate GPT-3:
@@ -48,7 +58,7 @@ export OPENAI_API_SECRET_KEY=YOUR_KEY_HERE
 python main.py \
    --model gpt3 \
    --model_args engine=davinci \
-	--tasks lambada,hellaswag
+    --tasks lambada_openai,hellaswag
 ```
 And if you want to verify the data integrity of the tasks you're performing in addition to running the tasks themselves, you can use the `--check_integrity` flag:
@@ -57,14 +67,47 @@ And if you want to verify the data integrity of the tasks you're performing in a
 python main.py \
    --model gpt3 \
    --model_args engine=davinci \
-	--tasks lambada,hellaswag \
+    --tasks lambada_openai,hellaswag \
    --check_integrity
 ```
 To evaluate mesh-transformer-jax models that are not available on HF, please invoke eval harness through [this script](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py).
+💡 **Tip**: You can inspect what the LM inputs look like by running the following command:
+```bash
+python write_out.py \
+    --tasks all_tasks \
+    --num_fewshot 5 \
+    --num_examples 10 \
+    --output_base_path /path/to/output/folder
+```
+This will write out one text file for each task.
 ## Implementing new tasks
-To implement a new task in eval harness, see [this guide](./docs/task_guide.md).
+To implement a new task in the eval harness, see [this guide](./docs/task_guide.md).
+## Task Versioning
+To help improve reproducibility, all tasks have a `VERSION` field. When run from the command line, this is reported in a column in the table, or in the "version" field in the evaluator return dict. The purpose of the version is so that if the task definition changes (i.e to fix a bug), then we can know exactly which metrics were computed using the old buggy implementation to avoid unfair comparisons. To enforce this, there are unit tests that make sure the behavior of all tests remains the same as when they were first implemented. Task versions start at 0, and each time a breaking change is made, the version is incremented by one.
+When reporting eval harness results, please also report the version of each task. This can be done either with a separate column in the table, or by reporting the task name with the version appended as such: taskname-v0.
+## Test Set Decontamination
+For details on text decontamination, see the [decontamination guide](./docs/decontamination.md).
+Note that the directory provided to the `--decontamination_ngrams_path` argument should contain the ngram files and info.json. See the above guide for ngram generation for the pile, this could be adapted for other training sets.
+```bash
+python main.py \
+    --model gpt2 \
+    --tasks sciq \
+    --decontamination_ngrams_path path/containing/training/set/ngrams \
+    --device 0
+```
 ## Cite as
@@ -96,371 +139,3 @@ To implement a new task in eval harness, see [this guide](./docs/task_guide.md).
  url          = {https://doi.org/10.5281/zenodo.5371628}
 }
 ```
-### Full Task List
-|                        Task Name                        |Train|Val|Test|Val/Test Docs|                                                                                     Metrics                                                                                     |
-|---------------------------------------------------------|-----|---|----|------------:|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-|cola                                                     |✓    |✓  |    |         1043|mcc                                                                                                                                                                              |
-|mnli                                                     |✓    |✓  |    |         9815|acc                                                                                                                                                                              |
-|mnli_mismatched                                          |✓    |✓  |    |         9832|acc                                                                                                                                                                              |
-|mrpc                                                     |✓    |✓  |    |          408|acc, f1                                                                                                                                                                          |
-|rte                                                      |✓    |✓  |    |          277|acc                                                                                                                                                                              |
-|qnli                                                     |✓    |✓  |    |         5463|acc                                                                                                                                                                              |
-|qqp                                                      |✓    |✓  |    |        40430|acc, f1                                                                                                                                                                          |
-|sst                                                      |✓    |✓  |    |          872|acc                                                                                                                                                                              |
-|wnli                                                     |✓    |✓  |    |           71|acc                                                                                                                                                                              |
-|boolq                                                    |✓    |✓  |    |         3270|acc                                                                                                                                                                              |
-|cb                                                       |✓    |✓  |    |           56|acc, f1                                                                                                                                                                          |
-|copa                                                     |✓    |✓  |    |          100|acc                                                                                                                                                                              |
-|multirc                                                  |✓    |✓  |    |         4848|acc                                                                                                                                                                              |
-|record                                                   |✓    |✓  |    |        10000|f1, em                                                                                                                                                                           |
-|wic                                                      |✓    |✓  |    |          638|acc                                                                                                                                                                              |
-|wsc                                                      |✓    |✓  |    |          104|acc                                                                                                                                                                              |
-|coqa                                                     |✓    |✓  |    |          500|f1, em                                                                                                                                                                           |
-|drop                                                     |✓    |✓  |    |         9536|em, f1                                                                                                                                                                           |
-|lambada                                                  |     |✓  |    |         5153|ppl, acc                                                                                                                                                                         |
-|lambada_cloze                                            |     |✓  |    |         5153|ppl, acc                                                                                                                                                                         |
-|lambada_mt_en                                            |     |✓  |    |         5153|ppl, acc                                                                                                                                                                         |
-|lambada_mt_fr                                            |     |✓  |    |         5153|ppl, acc                                                                                                                                                                         |
-|lambada_mt_de                                            |     |✓  |    |         5153|ppl, acc                                                                                                                                                                         |
-|lambada_mt_it                                            |     |✓  |    |         5153|ppl, acc                                                                                                                                                                         |
-|lambada_mt_es                                            |     |✓  |    |         5153|ppl, acc                                                                                                                                                                         |
-|wikitext                                                 |     |✓  |✓   |           62|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
-|piqa                                                     |✓    |✓  |    |         1838|acc, acc_norm                                                                                                                                                                    |
-|prost                                                    |     |   |✓   |        18736|acc, acc_norm                                                                                                                                                                    |
-|mc_taco                                                  |     |✓  |✓   |         9442|f1, em                                                                                                                                                                           |
-|pubmedqa                                                 |     |   |✓   |         1000|acc                                                                                                                                                                              |
-|sciq                                                     |✓    |✓  |✓   |         1000|acc, acc_norm                                                                                                                                                                    |
-|qa4mre_2011                                              |     |   |✓   |          120|acc, acc_norm                                                                                                                                                                    |
-|qa4mre_2012                                              |     |   |✓   |          160|acc, acc_norm                                                                                                                                                                    |
-|qa4mre_2013                                              |     |   |✓   |          284|acc, acc_norm                                                                                                                                                                    |
-|triviaqa                                                 |✓    |✓  |    |        11313|acc                                                                                                                                                                              |
-|arc_easy                                                 |✓    |✓  |✓   |         2376|acc, acc_norm                                                                                                                                                                    |
-|arc_challenge                                            |✓    |✓  |✓   |         1172|acc, acc_norm                                                                                                                                                                    |
-|logiqa                                                   |✓    |✓  |✓   |          651|acc, acc_norm                                                                                                                                                                    |
-|hellaswag                                                |✓    |✓  |    |        10042|acc, acc_norm                                                                                                                                                                    |
-|openbookqa                                               |✓    |✓  |✓   |          500|acc, acc_norm                                                                                                                                                                    |
-|squad2                                                   |✓    |✓  |    |        11873|exact, f1, HasAns_exact, HasAns_f1, NoAns_exact, NoAns_f1, best_exact, best_f1                                                                                                   |
-|race                                                     |✓    |✓  |✓   |         1045|acc                                                                                                                                                                              |
-|headqa                                                   |✓    |✓  |✓   |         2742|acc, acc_norm                                                                                                                                                                    |
-|headqa_es                                                |✓    |✓  |✓   |         2742|acc, acc_norm                                                                                                                                                                    |
-|headqa_en                                                |✓    |✓  |✓   |         2742|acc, acc_norm                                                                                                                                                                    |
-|mathqa                                                   |✓    |✓  |✓   |         2985|acc, acc_norm                                                                                                                                                                    |
-|webqs                                                    |✓    |   |✓   |         2032|acc                                                                                                                                                                              |
-|wsc273                                                   |     |   |✓   |          273|acc                                                                                                                                                                              |
-|winogrande                                               |✓    |✓  |    |         1267|acc                                                                                                                                                                              |
-|anli_r1                                                  |✓    |✓  |✓   |         1000|acc                                                                                                                                                                              |
-|anli_r2                                                  |✓    |✓  |✓   |         1000|acc                                                                                                                                                                              |
-|anli_r3                                                  |✓    |✓  |✓   |         1200|acc                                                                                                                                                                              |
-|ethics_cm                                                |✓    |   |✓   |         3885|acc                                                                                                                                                                              |
-|ethics_deontology                                        |✓    |   |✓   |         3596|acc, em                                                                                                                                                                          |
-|ethics_justice                                           |✓    |   |✓   |         2704|acc, em                                                                                                                                                                          |
-|ethics_utilitarianism_original                           |     |   |✓   |         4808|acc                                                                                                                                                                              |
-|ethics_utilitarianism                                    |✓    |   |✓   |         4808|acc                                                                                                                                                                              |
-|ethics_virtue                                            |✓    |   |✓   |         4975|acc, em                                                                                                                                                                          |
-|truthfulqa_mc                                            |     |✓  |    |          817|mc1, mc2                                                                                                                                                                         |
-|truthfulqa_gen                                           |     |✓  |    |          817|bleurt_max, bleurt_acc, bleurt_diff, bleu_max, bleu_acc, bleu_diff, rouge1_max, rouge1_acc, rouge1_diff, rouge2_max, rouge2_acc, rouge2_diff, rougeL_max, rougeL_acc, rougeL_diff|
-|mutual                                                   |✓    |✓  |    |          886|r@1, r@2, mrr                                                                                                                                                                    |
-|mutual_plus                                              |✓    |✓  |    |          886|r@1, r@2, mrr                                                                                                                                                                    |
-|math_algebra                                             |✓    |   |✓   |         1187|acc                                                                                                                                                                              |
-|math_counting_and_prob                                   |✓    |   |✓   |          474|acc                                                                                                                                                                              |
-|math_geometry                                            |✓    |   |✓   |          479|acc                                                                                                                                                                              |
-|math_intermediate_algebra                                |✓    |   |✓   |          903|acc                                                                                                                                                                              |
-|math_num_theory                                          |✓    |   |✓   |          540|acc                                                                                                                                                                              |
-|math_prealgebra                                          |✓    |   |✓   |          871|acc                                                                                                                                                                              |
-|math_precalc                                             |✓    |   |✓   |          546|acc                                                                                                                                                                              |
-|math_asdiv                                               |     |✓  |    |         2305|acc                                                                                                                                                                              |
-|arithmetic_2da                                           |     |✓  |    |         2000|acc                                                                                                                                                                              |
-|arithmetic_2ds                                           |     |✓  |    |         2000|acc                                                                                                                                                                              |
-|arithmetic_3da                                           |     |✓  |    |         2000|acc                                                                                                                                                                              |
-|arithmetic_3ds                                           |     |✓  |    |         2000|acc                                                                                                                                                                              |
-|arithmetic_4da                                           |     |✓  |    |         2000|acc                                                                                                                                                                              |
-|arithmetic_4ds                                           |     |✓  |    |         2000|acc                                                                                                                                                                              |
-|arithmetic_5da                                           |     |✓  |    |         2000|acc                                                                                                                                                                              |
-|arithmetic_5ds                                           |     |✓  |    |         2000|acc                                                                                                                                                                              |
-|arithmetic_2dm                                           |     |✓  |    |         2000|acc                                                                                                                                                                              |
-|arithmetic_1dc                                           |     |✓  |    |         2000|acc                                                                                                                                                                              |
-|hendrycksTest-abstract_algebra                           |✓    |✓  |✓   |          100|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-anatomy                                    |✓    |✓  |✓   |          135|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-astronomy                                  |✓    |✓  |✓   |          152|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-business_ethics                            |✓    |✓  |✓   |          100|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-clinical_knowledge                         |✓    |✓  |✓   |          265|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-college_biology                            |✓    |✓  |✓   |          144|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-college_chemistry                          |✓    |✓  |✓   |          100|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-college_computer_science                   |✓    |✓  |✓   |          100|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-college_mathematics                        |✓    |✓  |✓   |          100|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-college_medicine                           |✓    |✓  |✓   |          173|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-college_physics                            |✓    |✓  |✓   |          102|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-computer_security                          |✓    |✓  |✓   |          100|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-conceptual_physics                         |✓    |✓  |✓   |          235|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-econometrics                               |✓    |✓  |✓   |          114|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-electrical_engineering                     |✓    |✓  |✓   |          145|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-elementary_mathematics                     |✓    |✓  |✓   |          378|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-formal_logic                               |✓    |✓  |✓   |          126|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-global_facts                               |✓    |✓  |✓   |          100|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-high_school_biology                        |✓    |✓  |✓   |          310|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-high_school_chemistry                      |✓    |✓  |✓   |          203|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-high_school_computer_science               |✓    |✓  |✓   |          100|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-high_school_european_history               |✓    |✓  |✓   |          165|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-high_school_geography                      |✓    |✓  |✓   |          198|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-high_school_government_and_politics        |✓    |✓  |✓   |          193|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-high_school_macroeconomics                 |✓    |✓  |✓   |          390|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-high_school_mathematics                    |✓    |✓  |✓   |          270|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-high_school_microeconomics                 |✓    |✓  |✓   |          238|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-high_school_physics                        |✓    |✓  |✓   |          151|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-high_school_psychology                     |✓    |✓  |✓   |          545|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-high_school_statistics                     |✓    |✓  |✓   |          216|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-high_school_us_history                     |✓    |✓  |✓   |          204|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-high_school_world_history                  |✓    |✓  |✓   |          237|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-human_aging                                |✓    |✓  |✓   |          223|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-human_sexuality                            |✓    |✓  |✓   |          131|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-international_law                          |✓    |✓  |✓   |          121|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-jurisprudence                              |✓    |✓  |✓   |          108|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-logical_fallacies                          |✓    |✓  |✓   |          163|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-machine_learning                           |✓    |✓  |✓   |          112|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-management                                 |✓    |✓  |✓   |          103|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-marketing                                  |✓    |✓  |✓   |          234|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-medical_genetics                           |✓    |✓  |✓   |          100|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-miscellaneous                              |✓    |✓  |✓   |          783|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-moral_disputes                             |✓    |✓  |✓   |          346|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-moral_scenarios                            |✓    |✓  |✓   |          895|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-nutrition                                  |✓    |✓  |✓   |          306|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-philosophy                                 |✓    |✓  |✓   |          311|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-prehistory                                 |✓    |✓  |✓   |          324|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-professional_accounting                    |✓    |✓  |✓   |          282|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-professional_law                           |✓    |✓  |✓   |         1534|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-professional_medicine                      |✓    |✓  |✓   |          272|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-professional_psychology                    |✓    |✓  |✓   |          612|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-public_relations                           |✓    |✓  |✓   |          110|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-security_studies                           |✓    |✓  |✓   |          245|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-sociology                                  |✓    |✓  |✓   |          201|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-us_foreign_policy                          |✓    |✓  |✓   |          100|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-virology                                   |✓    |✓  |✓   |          166|acc, acc_norm                                                                                                                                                                    |
-|hendrycksTest-world_religions                            |✓    |✓  |✓   |          171|acc, acc_norm                                                                                                                                                                    |
-|wmt14-en-fr                                              |     |   |✓   |         3003|bleu, chrf, ter                                                                                                                                                                  |
-|wmt14-fr-en                                              |     |   |✓   |         3003|bleu, chrf, ter                                                                                                                                                                  |
-|wmt16-en-ro                                              |     |   |✓   |         1999|bleu, chrf, ter                                                                                                                                                                  |
-|wmt16-ro-en                                              |     |   |✓   |         1999|bleu, chrf, ter                                                                                                                                                                  |
-|wmt16-de-en                                              |     |   |✓   |         2999|bleu, chrf, ter                                                                                                                                                                  |
-|wmt16-en-de                                              |     |   |✓   |         2999|bleu, chrf, ter                                                                                                                                                                  |
-|wmt20-cs-en                                              |     |   |✓   |          664|bleu, chrf, ter                                                                                                                                                                  |
-|wmt20-de-en                                              |     |   |✓   |          785|bleu, chrf, ter                                                                                                                                                                  |
-|wmt20-de-fr                                              |     |   |✓   |         1619|bleu, chrf, ter                                                                                                                                                                  |
-|wmt20-en-cs                                              |     |   |✓   |         1418|bleu, chrf, ter                                                                                                                                                                  |
-|wmt20-en-de                                              |     |   |✓   |         1418|bleu, chrf, ter                                                                                                                                                                  |
-|wmt20-en-iu                                              |     |   |✓   |         2971|bleu, chrf, ter                                                                                                                                                                  |
-|wmt20-en-ja                                              |     |   |✓   |         1000|bleu, chrf, ter                                                                                                                                                                  |
-|wmt20-en-km                                              |     |   |✓   |         2320|bleu, chrf, ter                                                                                                                                                                  |
-|wmt20-en-pl                                              |     |   |✓   |         1000|bleu, chrf, ter                                                                                                                                                                  |
-|wmt20-en-ps                                              |     |   |✓   |         2719|bleu, chrf, ter                                                                                                                                                                  |
-|wmt20-en-ru                                              |     |   |✓   |         2002|bleu, chrf, ter                                                                                                                                                                  |
-|wmt20-en-ta                                              |     |   |✓   |         1000|bleu, chrf, ter                                                                                                                                                                  |
-|wmt20-en-zh                                              |     |   |✓   |         1418|bleu, chrf, ter                                                                                                                                                                  |
-|wmt20-fr-de                                              |     |   |✓   |         1619|bleu, chrf, ter                                                                                                                                                                  |
-|wmt20-iu-en                                              |     |   |✓   |         2971|bleu, chrf, ter                                                                                                                                                                  |
-|wmt20-ja-en                                              |     |   |✓   |          993|bleu, chrf, ter                                                                                                                                                                  |
-|wmt20-km-en                                              |     |   |✓   |         2320|bleu, chrf, ter                                                                                                                                                                  |
-|wmt20-pl-en                                              |     |   |✓   |         1001|bleu, chrf, ter                                                                                                                                                                  |
-|wmt20-ps-en                                              |     |   |✓   |         2719|bleu, chrf, ter                                                                                                                                                                  |
-|wmt20-ru-en                                              |     |   |✓   |          991|bleu, chrf, ter                                                                                                                                                                  |
-|wmt20-ta-en                                              |     |   |✓   |          997|bleu, chrf, ter                                                                                                                                                                  |
-|wmt20-zh-en                                              |     |   |✓   |         2000|bleu, chrf, ter                                                                                                                                                                  |
-|iwslt17-en-ar                                            |     |   |✓   |         1460|bleu, chrf, ter                                                                                                                                                                  |
-|iwslt17-ar-en                                            |     |   |✓   |         1460|bleu, chrf, ter                                                                                                                                                                  |
-|anagrams1                                                |     |✓  |    |        10000|acc                                                                                                                                                                              |
-|anagrams2                                                |     |✓  |    |        10000|acc                                                                                                                                                                              |
-|cycle_letters                                            |     |✓  |    |        10000|acc                                                                                                                                                                              |
-|random_insertion                                         |     |✓  |    |        10000|acc                                                                                                                                                                              |
-|reversed_words                                           |     |✓  |    |        10000|acc                                                                                                                                                                              |
-|pile_arxiv                                               |     |✓  |✓   |         2407|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
-|pile_books3                                              |     |✓  |✓   |          269|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
-|pile_bookcorpus2                                         |     |✓  |✓   |           28|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
-|pile_dm-mathematics                                      |     |✓  |✓   |         1922|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
-|pile_enron                                               |     |✓  |✓   |         1010|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
-|pile_europarl                                            |     |✓  |✓   |          157|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
-|pile_freelaw                                             |     |✓  |✓   |         5101|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
-|pile_github                                              |     |✓  |✓   |        18195|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
-|pile_gutenberg                                           |     |✓  |✓   |           80|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
-|pile_hackernews                                          |     |✓  |✓   |         1632|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
-|pile_nih-exporter                                        |     |✓  |✓   |         1884|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
-|pile_opensubtitles                                       |     |✓  |✓   |          642|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
-|pile_openwebtext2                                        |     |✓  |✓   |        32925|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
-|pile_philpapers                                          |     |✓  |✓   |           68|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
-|pile_pile-cc                                             |     |✓  |✓   |        52790|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
-|pile_pubmed-abstracts                                    |     |✓  |✓   |        29895|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
-|pile_pubmed-central                                      |     |✓  |✓   |         5911|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
-|pile_stackexchange                                       |     |✓  |✓   |        30378|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
-|pile_uspto                                               |     |✓  |✓   |        11415|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
-|pile_ubuntu-irc                                          |     |✓  |✓   |           22|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
-|pile_wikipedia                                           |     |✓  |✓   |        17511|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
-|pile_youtubesubtitles                                    |     |✓  |✓   |          342|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
-|blimp_adjunct_island                                     |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_anaphor_gender_agreement                           |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_anaphor_number_agreement                           |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_animate_subject_passive                            |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_animate_subject_trans                              |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_causative                                          |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_complex_NP_island                                  |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_coordinate_structure_constraint_complex_left_branch|     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_coordinate_structure_constraint_object_extraction  |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_determiner_noun_agreement_1                        |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_determiner_noun_agreement_2                        |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_determiner_noun_agreement_irregular_1              |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_determiner_noun_agreement_irregular_2              |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_determiner_noun_agreement_with_adj_2               |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_determiner_noun_agreement_with_adj_irregular_1     |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_determiner_noun_agreement_with_adj_irregular_2     |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_determiner_noun_agreement_with_adjective_1         |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_distractor_agreement_relational_noun               |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_distractor_agreement_relative_clause               |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_drop_argument                                      |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_ellipsis_n_bar_1                                   |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_ellipsis_n_bar_2                                   |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_existential_there_object_raising                   |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_existential_there_quantifiers_1                    |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_existential_there_quantifiers_2                    |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_existential_there_subject_raising                  |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_expletive_it_object_raising                        |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_inchoative                                         |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_intransitive                                       |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_irregular_past_participle_adjectives               |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_irregular_past_participle_verbs                    |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_irregular_plural_subject_verb_agreement_1          |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_irregular_plural_subject_verb_agreement_2          |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_left_branch_island_echo_question                   |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_left_branch_island_simple_question                 |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_matrix_question_npi_licensor_present               |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_npi_present_1                                      |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_npi_present_2                                      |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_only_npi_licensor_present                          |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_only_npi_scope                                     |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_passive_1                                          |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_passive_2                                          |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_principle_A_c_command                              |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_principle_A_case_1                                 |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_principle_A_case_2                                 |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_principle_A_domain_1                               |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_principle_A_domain_2                               |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_principle_A_domain_3                               |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_principle_A_reconstruction                         |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_regular_plural_subject_verb_agreement_1            |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_regular_plural_subject_verb_agreement_2            |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_sentential_negation_npi_licensor_present           |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_sentential_negation_npi_scope                      |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_sentential_subject_island                          |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_superlative_quantifiers_1                          |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_superlative_quantifiers_2                          |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_tough_vs_raising_1                                 |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_tough_vs_raising_2                                 |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_transitive                                         |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_wh_island                                          |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_wh_questions_object_gap                            |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_wh_questions_subject_gap                           |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_wh_questions_subject_gap_long_distance             |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_wh_vs_that_no_gap                                  |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_wh_vs_that_no_gap_long_distance                    |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_wh_vs_that_with_gap                                |     |✓  |    |         1000|acc                                                                                                                                                                              |
-|blimp_wh_vs_that_with_gap_long_distance                  |     |✓  |    |         1000|acc                                                                                                                                                                              |
-## Usage
-### Evaluate a task
-Additional arguments can be provided to the model constructor using the `--model_args` flag. Most importantly, the `gpt2` model can be used to load an arbitrary HuggingFace model as follows:
-```bash
-python main.py \
-	--model gpt2 \
-	--model_args pretrained=EleutherAI/gpt-neo-1.3B \
-	--device 0 \
-	--tasks lambada,hellaswag \
-	--num_fewshot 2
-```
-To inspect what the LM inputs look like, you can run the following command:
-```bash
-python write_out.py \
-	--tasks all_tasks \
-	--num_fewshot 5 \
-	--num_examples 10 \
-	--output_base_path /path/to/output/folder
-```
-This will write out one text file for each task.
-### Test Set Decontamination
-For more details see the [decontamination guide](./docs/decontamination.md).
-The directory provided with the "--decontamination_ngrams_path" argument should contain
-the ngram files and info.json. See the above guide for ngram generation for the pile, this could be adapted for other training sets.
-```bash
-python main.py \
-    --model gpt2 \
-    --device 0 \
-    --tasks sciq \
-    --decontamination_ngrams_path path/containing/training/set/ngrams
-```
-### Code Structure
-There are two major components of the library:
-1. LMs (language models), e.g. GPT-2, GPT-3
-2. Tasks, e.g. MNLI, RTE, SQuAD (coming soon)
-Both LMs (`lm_eval.models`) and Tasks (`lm_eval.tasks`) are kept in a registry data structure, for easy CLI instantiation.
-**If you want to extend either models or tasks, simply add a new LM or Task subclass, and decorate with the registry decorator**.
-The [GPT-3 Evaluations Project](https://github.com/EleutherAI/lm_evaluation_harness/projects/1) tracks our progress implementing new tasks. Right now, we are focused on getting all the datasets loaded so that we can dedupe against the training data. Implementing the actual evaluations is nice but not necessary at the current moment.
-### Task Versioning
-To help improve reproducibility, all tasks have a VERSION field. When run from the command line, this is reported in a column in the table, or in the "version" field in the evaluator return dict. The purpose of the version is so that if the task definition changes (i.e to fix a bug), then we can know exactly which metrics were computed using the old buggy implementation to avoid unfair comparisons. To enforce this, there are unit tests that make sure the behavior of all tests remains the same as when they were first implemented. Task versions start at 0, and each time a breaking change is made, the version is incremented by one.
-When reporting eval harness results, please also report the version of each task. This can be done either with a separate column in the table, or by reporting the task name with the version appended as such: taskname-v0.
-## Description
-### 1. LM Evaluation
-Given an LM, we want to evaluate it on a wide range of NLU tasks. We should at least cover the set of tasks in the GPT-3 paper, and any other tasks/benchmarks that are relevant. We will follow the GPT-3 format of a) zero-shot, b) one-shot, c) few-shot evaluation.
-To do this, we need 3 components:
-* Data downloader (shared with later sections, potentially needs to be directly linked to the latter 2 components)
-* Task formatter
-* Task evaluator
-The **data downloader** should download data for the relevant tasks.
-* We should heavily rely on Hugging Face's NLP for this. They are already doing most of the work with handling data scripts/caching.
-* Optionally, we can rely directly on HF-NLP's caching, but that makes it awkward to handle non-HF-NLP datasets. Otherwise, we can just write them out to .jsonl. My feeling is that NLU data storage will be a drop in the bucket compared to LM data.
-* Where we're not using HF-NLP, we can keep the data in the raw format (.jsonl, tsv, etc) and let the other components handle transforming it.
-The **task formatter** formats the task input data into an LM-usable format.
-* We should potentially support multiple formats for a given task, e.g. some formats may be better or worse suited for LM evaluation. See also: prompt-engineering
-* The task formatter should also support zero/one/few-shot packing of training examples into an input. This may require weird interactions with the tokenizer for dealing with max-token issues.
-The **task evaluator** scores a task.
-* In essence, we want to generation output predictions for all our input examples, and feed them into some function that pops out a score (or scores)
-An alternative approach is to collect the output logits and score them against the expected set of outputs.
-* Some tasks have weird evaluation schemes, so we should make this as general as possible.
-* Will thus likely have to be closely tied with the formatter.
-* Likewise, we should take advantage of HF-NLP's metrics.
-We might as well provide a sufficiently general API for the model to support OpenAI API as well. This can double up as an effort to reproduce the OpenAI NLU results.
-### 2. Removing val/test data from LM training set
-With the data downloader in place, we simply need to (1) expose the val/test examples, and (2) remove them from the training set.
-* Arguably, (2) should be handled by LM preprocessing in a more general way. There are probably non-NLU-eval cases where we want to remove some specific data from training.
-* Depending on how exactly we do the val/test removal, we may want to format the same example multiple ways to ensure that they don't get leaked into the training set in a slightly tweaked format.
-* Thought experiment: SQuAD is based largely on Wikipedia. What exactly would we want to remove from the LM?
-* [GPT-3]: In GPT-3, they attempted to remove val/test from their LM set, but there was a bug that caused leakage. So they ended up doing the opposite: removing overlaps from the LM set from the val/test. Funky.
-* [GPT-3]: See page 30 and Appendix C for details. They do some funky n-gram based search and removal. We should think about whether we want to follow their protocol exactly
-### 3. Adding task training data to LM training set
-This part is the easiest. I guess we just write out some text files containing the training data? We can let the usual LM preprocessing pipeline handle it from there.
--- a/docs/task_table.md
+++ b/docs/task_table.md
+|                        Task Name                        |Train|Val|Test|Val/Test Docs|                                                                                     Metrics                                                                                     |
+|---------------------------------------------------------|-----|---|----|------------:|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+|anagrams1                                                |     |✓  |    |        10000|acc                                                                                                                                                                              |
+|anagrams2                                                |     |✓  |    |        10000|acc                                                                                                                                                                              |
+|anli_r1                                                  |✓    |✓  |✓   |         1000|acc                                                                                                                                                                              |
+|anli_r2                                                  |✓    |✓  |✓   |         1000|acc                                                                                                                                                                              |
+|anli_r3                                                  |✓    |✓  |✓   |         1200|acc                                                                                                                                                                              |
+|arc_challenge                                            |✓    |✓  |✓   |         1172|acc, acc_norm                                                                                                                                                                    |
+|arc_easy                                                 |✓    |✓  |✓   |         2376|acc, acc_norm                                                                                                                                                                    |
+|arithmetic_1dc                                           |     |✓  |    |         2000|acc                                                                                                                                                                              |
+|arithmetic_2da                                           |     |✓  |    |         2000|acc                                                                                                                                                                              |
+|arithmetic_2dm                                           |     |✓  |    |         2000|acc                                                                                                                                                                              |
+|arithmetic_2ds                                           |     |✓  |    |         2000|acc                                                                                                                                                                              |
+|arithmetic_3da                                           |     |✓  |    |         2000|acc                                                                                                                                                                              |
+|arithmetic_3ds                                           |     |✓  |    |         2000|acc                                                                                                                                                                              |
+|arithmetic_4da                                           |     |✓  |    |         2000|acc                                                                                                                                                                              |
+|arithmetic_4ds                                           |     |✓  |    |         2000|acc                                                                                                                                                                              |
+|arithmetic_5da                                           |     |✓  |    |         2000|acc                                                                                                                                                                              |
+|arithmetic_5ds                                           |     |✓  |    |         2000|acc                                                                                                                                                                              |
+|blimp_adjunct_island                                     |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_anaphor_gender_agreement                           |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_anaphor_number_agreement                           |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_animate_subject_passive                            |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_animate_subject_trans                              |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_causative                                          |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_complex_NP_island                                  |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_coordinate_structure_constraint_complex_left_branch|     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_coordinate_structure_constraint_object_extraction  |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_determiner_noun_agreement_1                        |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_determiner_noun_agreement_2                        |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_determiner_noun_agreement_irregular_1              |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_determiner_noun_agreement_irregular_2              |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_determiner_noun_agreement_with_adj_2               |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_determiner_noun_agreement_with_adj_irregular_1     |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_determiner_noun_agreement_with_adj_irregular_2     |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_determiner_noun_agreement_with_adjective_1         |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_distractor_agreement_relational_noun               |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_distractor_agreement_relative_clause               |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_drop_argument                                      |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_ellipsis_n_bar_1                                   |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_ellipsis_n_bar_2                                   |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_existential_there_object_raising                   |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_existential_there_quantifiers_1                    |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_existential_there_quantifiers_2                    |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_existential_there_subject_raising                  |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_expletive_it_object_raising                        |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_inchoative                                         |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_intransitive                                       |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_irregular_past_participle_adjectives               |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_irregular_past_participle_verbs                    |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_irregular_plural_subject_verb_agreement_1          |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_irregular_plural_subject_verb_agreement_2          |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_left_branch_island_echo_question                   |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_left_branch_island_simple_question                 |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_matrix_question_npi_licensor_present               |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_npi_present_1                                      |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_npi_present_2                                      |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_only_npi_licensor_present                          |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_only_npi_scope                                     |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_passive_1                                          |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_passive_2                                          |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_principle_A_c_command                              |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_principle_A_case_1                                 |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_principle_A_case_2                                 |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_principle_A_domain_1                               |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_principle_A_domain_2                               |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_principle_A_domain_3                               |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_principle_A_reconstruction                         |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_regular_plural_subject_verb_agreement_1            |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_regular_plural_subject_verb_agreement_2            |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_sentential_negation_npi_licensor_present           |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_sentential_negation_npi_scope                      |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_sentential_subject_island                          |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_superlative_quantifiers_1                          |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_superlative_quantifiers_2                          |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_tough_vs_raising_1                                 |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_tough_vs_raising_2                                 |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_transitive                                         |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_wh_island                                          |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_wh_questions_object_gap                            |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_wh_questions_subject_gap                           |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_wh_questions_subject_gap_long_distance             |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_wh_vs_that_no_gap                                  |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_wh_vs_that_no_gap_long_distance                    |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_wh_vs_that_with_gap                                |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|blimp_wh_vs_that_with_gap_long_distance                  |     |✓  |    |         1000|acc                                                                                                                                                                              |
+|boolq                                                    |✓    |✓  |    |         3270|acc                                                                                                                                                                              |
+|cb                                                       |✓    |✓  |    |           56|acc, f1                                                                                                                                                                          |
+|cola                                                     |✓    |✓  |    |         1043|mcc                                                                                                                                                                              |
+|copa                                                     |✓    |✓  |    |          100|acc                                                                                                                                                                              |
+|coqa                                                     |✓    |✓  |    |          500|f1, em                                                                                                                                                                           |
+|cycle_letters                                            |     |✓  |    |        10000|acc                                                                                                                                                                              |
+|drop                                                     |✓    |✓  |    |         9536|em, f1                                                                                                                                                                           |
+|ethics_cm                                                |✓    |   |✓   |         3885|acc                                                                                                                                                                              |
+|ethics_deontology                                        |✓    |   |✓   |         3596|acc, em                                                                                                                                                                          |
+|ethics_justice                                           |✓    |   |✓   |         2704|acc, em                                                                                                                                                                          |
+|ethics_utilitarianism                                    |✓    |   |✓   |         4808|acc                                                                                                                                                                              |
+|ethics_utilitarianism_original                           |     |   |✓   |         4808|acc                                                                                                                                                                              |
+|ethics_virtue                                            |✓    |   |✓   |         4975|acc, em                                                                                                                                                                          |
+|gsm8k                                                    |✓    |   |✓   |         1319|acc                                                                                                                                                                              |
+|headqa                                                   |✓    |✓  |✓   |         2742|acc, acc_norm                                                                                                                                                                    |
+|headqa_en                                                |✓    |✓  |✓   |         2742|acc, acc_norm                                                                                                                                                                    |
+|headqa_es                                                |✓    |✓  |✓   |         2742|acc, acc_norm                                                                                                                                                                    |
+|hellaswag                                                |✓    |✓  |    |        10042|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-abstract_algebra                           |     |✓  |✓   |          100|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-anatomy                                    |     |✓  |✓   |          135|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-astronomy                                  |     |✓  |✓   |          152|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-business_ethics                            |     |✓  |✓   |          100|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-clinical_knowledge                         |     |✓  |✓   |          265|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-college_biology                            |     |✓  |✓   |          144|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-college_chemistry                          |     |✓  |✓   |          100|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-college_computer_science                   |     |✓  |✓   |          100|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-college_mathematics                        |     |✓  |✓   |          100|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-college_medicine                           |     |✓  |✓   |          173|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-college_physics                            |     |✓  |✓   |          102|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-computer_security                          |     |✓  |✓   |          100|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-conceptual_physics                         |     |✓  |✓   |          235|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-econometrics                               |     |✓  |✓   |          114|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-electrical_engineering                     |     |✓  |✓   |          145|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-elementary_mathematics                     |     |✓  |✓   |          378|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-formal_logic                               |     |✓  |✓   |          126|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-global_facts                               |     |✓  |✓   |          100|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-high_school_biology                        |     |✓  |✓   |          310|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-high_school_chemistry                      |     |✓  |✓   |          203|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-high_school_computer_science               |     |✓  |✓   |          100|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-high_school_european_history               |     |✓  |✓   |          165|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-high_school_geography                      |     |✓  |✓   |          198|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-high_school_government_and_politics        |     |✓  |✓   |          193|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-high_school_macroeconomics                 |     |✓  |✓   |          390|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-high_school_mathematics                    |     |✓  |✓   |          270|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-high_school_microeconomics                 |     |✓  |✓   |          238|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-high_school_physics                        |     |✓  |✓   |          151|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-high_school_psychology                     |     |✓  |✓   |          545|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-high_school_statistics                     |     |✓  |✓   |          216|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-high_school_us_history                     |     |✓  |✓   |          204|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-high_school_world_history                  |     |✓  |✓   |          237|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-human_aging                                |     |✓  |✓   |          223|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-human_sexuality                            |     |✓  |✓   |          131|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-international_law                          |     |✓  |✓   |          121|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-jurisprudence                              |     |✓  |✓   |          108|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-logical_fallacies                          |     |✓  |✓   |          163|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-machine_learning                           |     |✓  |✓   |          112|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-management                                 |     |✓  |✓   |          103|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-marketing                                  |     |✓  |✓   |          234|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-medical_genetics                           |     |✓  |✓   |          100|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-miscellaneous                              |     |✓  |✓   |          783|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-moral_disputes                             |     |✓  |✓   |          346|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-moral_scenarios                            |     |✓  |✓   |          895|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-nutrition                                  |     |✓  |✓   |          306|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-philosophy                                 |     |✓  |✓   |          311|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-prehistory                                 |     |✓  |✓   |          324|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-professional_accounting                    |     |✓  |✓   |          282|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-professional_law                           |     |✓  |✓   |         1534|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-professional_medicine                      |     |✓  |✓   |          272|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-professional_psychology                    |     |✓  |✓   |          612|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-public_relations                           |     |✓  |✓   |          110|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-security_studies                           |     |✓  |✓   |          245|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-sociology                                  |     |✓  |✓   |          201|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-us_foreign_policy                          |     |✓  |✓   |          100|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-virology                                   |     |✓  |✓   |          166|acc, acc_norm                                                                                                                                                                    |
+|hendrycksTest-world_religions                            |     |✓  |✓   |          171|acc, acc_norm                                                                                                                                                                    |
+|iwslt17-ar-en                                            |     |   |✓   |         1460|bleu, chrf, ter                                                                                                                                                                  |
+|iwslt17-en-ar                                            |     |   |✓   |         1460|bleu, chrf, ter                                                                                                                                                                  |
+|lambada_openai                                           |     |✓  |    |         5153|ppl, acc                                                                                                                                                                         |
+|lambada_openai_cloze                                     |     |✓  |    |         5153|ppl, acc                                                                                                                                                                         |
+|lambada_openai_mt_de                                     |     |✓  |    |         5153|ppl, acc                                                                                                                                                                         |
+|lambada_openai_mt_en                                     |     |✓  |    |         5153|ppl, acc                                                                                                                                                                         |
+|lambada_openai_mt_es                                     |     |✓  |    |         5153|ppl, acc                                                                                                                                                                         |
+|lambada_openai_mt_fr                                     |     |✓  |    |         5153|ppl, acc                                                                                                                                                                         |
+|lambada_openai_mt_it                                     |     |✓  |    |         5153|ppl, acc                                                                                                                                                                         |
+|lambada_standard                                         |     |✓  |✓   |         5153|ppl, acc                                                                                                                                                                         |
+|lambada_standard_cloze                                   |     |✓  |✓   |         5153|ppl, acc                                                                                                                                                                         |
+|logiqa                                                   |✓    |✓  |✓   |          651|acc, acc_norm                                                                                                                                                                    |
+|math_algebra                                             |✓    |   |✓   |         1187|acc                                                                                                                                                                              |
+|math_asdiv                                               |     |✓  |    |         2305|acc                                                                                                                                                                              |
+|math_counting_and_prob                                   |✓    |   |✓   |          474|acc                                                                                                                                                                              |
+|math_geometry                                            |✓    |   |✓   |          479|acc                                                                                                                                                                              |
+|math_intermediate_algebra                                |✓    |   |✓   |          903|acc                                                                                                                                                                              |
+|math_num_theory                                          |✓    |   |✓   |          540|acc                                                                                                                                                                              |
+|math_prealgebra                                          |✓    |   |✓   |          871|acc                                                                                                                                                                              |
+|math_precalc                                             |✓    |   |✓   |          546|acc                                                                                                                                                                              |
+|mathqa                                                   |✓    |✓  |✓   |         2985|acc, acc_norm                                                                                                                                                                    |
+|mc_taco                                                  |     |✓  |✓   |         9442|f1, em                                                                                                                                                                           |
+|mnli                                                     |✓    |✓  |    |         9815|acc                                                                                                                                                                              |
+|mnli_mismatched                                          |✓    |✓  |    |         9832|acc                                                                                                                                                                              |
+|mrpc                                                     |✓    |✓  |    |          408|acc, f1                                                                                                                                                                          |
+|multirc                                                  |✓    |✓  |    |         4848|acc                                                                                                                                                                              |
+|mutual                                                   |✓    |✓  |    |          886|r@1, r@2, mrr                                                                                                                                                                    |
+|mutual_plus                                              |✓    |✓  |    |          886|r@1, r@2, mrr                                                                                                                                                                    |
+|openbookqa                                               |✓    |✓  |✓   |          500|acc, acc_norm                                                                                                                                                                    |
+|pile_arxiv                                               |     |✓  |✓   |         2407|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
+|pile_bookcorpus2                                         |     |✓  |✓   |           28|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
+|pile_books3                                              |     |✓  |✓   |          269|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
+|pile_dm-mathematics                                      |     |✓  |✓   |         1922|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
+|pile_enron                                               |     |✓  |✓   |         1010|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
+|pile_europarl                                            |     |✓  |✓   |          157|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
+|pile_freelaw                                             |     |✓  |✓   |         5101|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
+|pile_github                                              |     |✓  |✓   |        18195|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
+|pile_gutenberg                                           |     |✓  |✓   |           80|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
+|pile_hackernews                                          |     |✓  |✓   |         1632|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
+|pile_nih-exporter                                        |     |✓  |✓   |         1884|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
+|pile_opensubtitles                                       |     |✓  |✓   |          642|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
+|pile_openwebtext2                                        |     |✓  |✓   |        32925|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
+|pile_philpapers                                          |     |✓  |✓   |           68|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
+|pile_pile-cc                                             |     |✓  |✓   |        52790|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
+|pile_pubmed-abstracts                                    |     |✓  |✓   |        29895|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
+|pile_pubmed-central                                      |     |✓  |✓   |         5911|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
+|pile_stackexchange                                       |     |✓  |✓   |        30378|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
+|pile_ubuntu-irc                                          |     |✓  |✓   |           22|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
+|pile_uspto                                               |     |✓  |✓   |        11415|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
+|pile_wikipedia                                           |     |✓  |✓   |        17511|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
+|pile_youtubesubtitles                                    |     |✓  |✓   |          342|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
+|piqa                                                     |✓    |✓  |    |         1838|acc, acc_norm                                                                                                                                                                    |
+|prost                                                    |     |   |✓   |        18736|acc, acc_norm                                                                                                                                                                    |
+|pubmedqa                                                 |     |   |✓   |         1000|acc                                                                                                                                                                              |
+|qa4mre_2011                                              |     |   |✓   |          120|acc, acc_norm                                                                                                                                                                    |
+|qa4mre_2012                                              |     |   |✓   |          160|acc, acc_norm                                                                                                                                                                    |
+|qa4mre_2013                                              |     |   |✓   |          284|acc, acc_norm                                                                                                                                                                    |
+|qasper                                                   |✓    |✓  |    |         1764|f1_yesno, f1_abstractive                                                                                                                                                         |
+|qnli                                                     |✓    |✓  |    |         5463|acc                                                                                                                                                                              |
+|qqp                                                      |✓    |✓  |    |        40430|acc, f1                                                                                                                                                                          |
+|race                                                     |✓    |✓  |✓   |         1045|acc                                                                                                                                                                              |
+|random_insertion                                         |     |✓  |    |        10000|acc                                                                                                                                                                              |
+|record                                                   |✓    |✓  |    |        10000|f1, em                                                                                                                                                                           |
+|reversed_words                                           |     |✓  |    |        10000|acc                                                                                                                                                                              |
+|rte                                                      |✓    |✓  |    |          277|acc                                                                                                                                                                              |
+|sciq                                                     |✓    |✓  |✓   |         1000|acc, acc_norm                                                                                                                                                                    |
+|squad2                                                   |✓    |✓  |    |        11873|exact, f1, HasAns_exact, HasAns_f1, NoAns_exact, NoAns_f1, best_exact, best_f1                                                                                                   |
+|sst                                                      |✓    |✓  |    |          872|acc                                                                                                                                                                              |
+|swag                                                     |✓    |✓  |    |        20006|acc, acc_norm                                                                                                                                                                    |
+|triviaqa                                                 |✓    |✓  |    |        11313|acc                                                                                                                                                                              |
+|truthfulqa_gen                                           |     |✓  |    |          817|bleurt_max, bleurt_acc, bleurt_diff, bleu_max, bleu_acc, bleu_diff, rouge1_max, rouge1_acc, rouge1_diff, rouge2_max, rouge2_acc, rouge2_diff, rougeL_max, rougeL_acc, rougeL_diff|
+|truthfulqa_mc                                            |     |✓  |    |          817|mc1, mc2                                                                                                                                                                         |
+|webqs                                                    |✓    |   |✓   |         2032|acc                                                                                                                                                                              |
+|wic                                                      |✓    |✓  |    |          638|acc                                                                                                                                                                              |
+|wikitext                                                 |✓    |✓  |✓   |           62|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
+|winogrande                                               |✓    |✓  |    |         1267|acc                                                                                                                                                                              |
+|wmt14-en-fr                                              |     |   |✓   |         3003|bleu, chrf, ter                                                                                                                                                                  |
+|wmt14-fr-en                                              |     |   |✓   |         3003|bleu, chrf, ter                                                                                                                                                                  |
+|wmt16-de-en                                              |     |   |✓   |         2999|bleu, chrf, ter                                                                                                                                                                  |
+|wmt16-en-de                                              |     |   |✓   |         2999|bleu, chrf, ter                                                                                                                                                                  |
+|wmt16-en-ro                                              |     |   |✓   |         1999|bleu, chrf, ter                                                                                                                                                                  |
+|wmt16-ro-en                                              |     |   |✓   |         1999|bleu, chrf, ter                                                                                                                                                                  |
+|wmt20-cs-en                                              |     |   |✓   |          664|bleu, chrf, ter                                                                                                                                                                  |
+|wmt20-de-en                                              |     |   |✓   |          785|bleu, chrf, ter                                                                                                                                                                  |
+|wmt20-de-fr                                              |     |   |✓   |         1619|bleu, chrf, ter                                                                                                                                                                  |
+|wmt20-en-cs                                              |     |   |✓   |         1418|bleu, chrf, ter                                                                                                                                                                  |
+|wmt20-en-de                                              |     |   |✓   |         1418|bleu, chrf, ter                                                                                                                                                                  |
+|wmt20-en-iu                                              |     |   |✓   |         2971|bleu, chrf, ter                                                                                                                                                                  |
+|wmt20-en-ja                                              |     |   |✓   |         1000|bleu, chrf, ter                                                                                                                                                                  |
+|wmt20-en-km                                              |     |   |✓   |         2320|bleu, chrf, ter                                                                                                                                                                  |
+|wmt20-en-pl                                              |     |   |✓   |         1000|bleu, chrf, ter                                                                                                                                                                  |
+|wmt20-en-ps                                              |     |   |✓   |         2719|bleu, chrf, ter                                                                                                                                                                  |
+|wmt20-en-ru                                              |     |   |✓   |         2002|bleu, chrf, ter                                                                                                                                                                  |
+|wmt20-en-ta                                              |     |   |✓   |         1000|bleu, chrf, ter                                                                                                                                                                  |
+|wmt20-en-zh                                              |     |   |✓   |         1418|bleu, chrf, ter                                                                                                                                                                  |
+|wmt20-fr-de                                              |     |   |✓   |         1619|bleu, chrf, ter                                                                                                                                                                  |
+|wmt20-iu-en                                              |     |   |✓   |         2971|bleu, chrf, ter                                                                                                                                                                  |
+|wmt20-ja-en                                              |     |   |✓   |          993|bleu, chrf, ter                                                                                                                                                                  |
+|wmt20-km-en                                              |     |   |✓   |         2320|bleu, chrf, ter                                                                                                                                                                  |
+|wmt20-pl-en                                              |     |   |✓   |         1001|bleu, chrf, ter                                                                                                                                                                  |
+|wmt20-ps-en                                              |     |   |✓   |         2719|bleu, chrf, ter                                                                                                                                                                  |
+|wmt20-ru-en                                              |     |   |✓   |          991|bleu, chrf, ter                                                                                                                                                                  |
+|wmt20-ta-en                                              |     |   |✓   |          997|bleu, chrf, ter                                                                                                                                                                  |
+|wmt20-zh-en                                              |     |   |✓   |         2000|bleu, chrf, ter                                                                                                                                                                  |
+|wnli                                                     |✓    |✓  |    |           71|acc                                                                                                                                                                              |
+|wsc                                                      |✓    |✓  |    |          104|acc                                                                                                                                                                              |
+|wsc273                                                   |     |   |✓   |          273|acc                                                                                                                                                                              |
--- a/lm_eval/tasks/translation.py
+++ b/lm_eval/tasks/translation.py
@@ -16,6 +16,20 @@ from lm_eval import metrics
 from lm_eval.base import Task, rf
 from typing import List
+try:
+    import nagisa
+    HAS_NAGISA = True
+except ImportError:
+    HAS_NAGISA = False
+try:
+    import jieba
+    HAS_JIEBA = True
+except ImportError:
+    HAS_JIEBA = False
 _CITATION = """
 @inproceedings{post-2018-call,
@@ -63,14 +77,22 @@ def create_tasks_from_benchmarks(benchmark_dict):
 def zh_split(zh_text: List[str]) -> List[str]:
    """Chinese splitting"""
-    import jieba
+    if not HAS_JIEBA:
+        raise ImportError(
+            "Chinese text splitting requires the `jieba` package. "
+            "Please install it with:\npip install jieba"
+        )
    return [" ".join(jieba.cut(txt.strip())) for txt in zh_text]
 def ja_split(ja_text: List[str]) -> List[str]:
    """Japanese splitting"""
-    import nagisa
+    if not HAS_NAGISA:
+        raise ImportError(
+            "Japanese text splitting requires the `nagisa` package. "
+            "Please install it with:\npip install nagisa"
+        )
    return [" ".join(nagisa.tagging(txt.strip()).words) for txt in ja_text]

--- a/lm_eval/tasks/truthfulqa.py
+++ b/lm_eval/tasks/truthfulqa.py
@@ -27,6 +27,14 @@ from lm_eval.base import rf, Task
 from lm_eval.metrics import mean
+try:
+    import bleurt
+    HAS_BLEURT = True
+except ImportError:
+    HAS_BLEURT = False
 _CITATION = """
 @misc{lin2021truthfulqa,
    title={TruthfulQA: Measuring How Models Mimic Human Falsehoods},
@@ -164,6 +172,12 @@ class TruthfulQAGeneration(Task):
    def __init__(self):
        super().__init__()
+        if not HAS_BLEURT:
+            raise ImportError(
+                "`TruthfulQAGeneration` requires the `bleurt` package. Please install it with:\n"
+                "pip install bleurt@https://github.com/google-research/bleurt/archive/b610120347ef22b494b6d69b4316e303f5932516.zip#egg=bleurt"
+                "\nWARNING: Installing any other version of bleurt may result in different results."
+            )
        self.bleurt = datasets.load_metric("bleurt")
    def has_training_docs(self):

--- a/lm_eval/utils.py
+++ b/lm_eval/utils.py
@@ -5,7 +5,6 @@ import collections
 import functools
 import inspect
 import sys
-import pytest
 from typing import List
@@ -187,6 +186,8 @@ def run_task_tests(task_list: List[str]):
    """
    Find the package root and run the tests for the given tasks
    """
+    import pytest
    package_root = find_test_root(start_path=pathlib.Path(__file__))
    task_string = " or ".join(task_list)
    args = [

--- a/scripts/make_table_tasks.py
+++ b/scripts/make_table_tasks.py
+"""
+Usage:
+   python make_table_tasks.py --output <markdown_filename>
+"""
+import argparse
+import logging
 from lm_eval import tasks
 from pytablewriter import MarkdownTableWriter
-writer = MarkdownTableWriter()
-writer.headers = ["Task Name", "Train", "Val", "Test", "Val/Test Docs", "Metrics"]
-values = []
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
-def chk(tf):
+def check(tf):
    if tf:
        return "✓"
    else:
        return " "
-for tname, Task in tasks.TASK_REGISTRY.items():
+if __name__ == "__main__":
-    task = Task()
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--output", type=str, default="task_table.md")
+    args = parser.parse_args()
+    writer = MarkdownTableWriter()
+    writer.headers = ["Task Name", "Train", "Val", "Test", "Val/Test Docs", "Metrics"]
+    values = []
+    tasks = tasks.TASK_REGISTRY.items()
+    tasks = sorted(tasks, key=lambda x: x[0])
+    for tname, Task in tasks:
+        task = Task()
        v = [
            tname,
-        chk(task.has_training_docs()),
+            check(task.has_training_docs()),
-        chk(task.has_validation_docs()),
+            check(task.has_validation_docs()),
-        chk(task.has_test_docs()),
+            check(task.has_test_docs()),
-        len(list(task.test_docs() if task.has_test_docs() else task.validation_docs())),
+            len(
+                list(
+                    task.test_docs() if task.has_test_docs() else task.validation_docs()
+                )
+            ),
            ", ".join(task.aggregation().keys()),
        ]
-    print(v)
+        logger.info(v)
        values.append(v)
+    writer.value_matrix = values
-writer.value_matrix = values
+    table = writer.dumps()
+    with open(args.output, "w") as f:
-print(writer.dumps())
+        f.write(table)
--- a/setup.py
+++ b/setup.py
@@ -14,6 +14,7 @@ setuptools.setup(
    url="https://github.com/EleutherAI/lm-evaluation-harness",
    packages=setuptools.find_packages(),
    classifiers=[
+        "Development Status :: 3 - Alpha",
        "Programming Language :: Python :: 3",
        "License :: OSI Approved :: MIT License",
        "Operating System :: OS Independent",
@@ -21,29 +22,23 @@ setuptools.setup(
    python_requires=">=3.6",
    install_requires=[
        "datasets>=2.0.0",
-        "click>=7.1",
+        "jsonlines",
+        "numexpr",
+        "openai>=0.6.4",
+        "pybind11>=2.6.2",
+        "pycountry",
+        "pytablewriter",
+        "rouge-score>=0.0.4",
+        "sacrebleu==1.5.0",
        "scikit-learn>=0.24.1",
+        "sqlitedict",
        "torch>=1.7",
+        "tqdm-multiprocess",
        "transformers>=4.1",
-        "sqlitedict==1.6.0",
+        "zstandard",
-        "pytablewriter==0.58.0",
-        "sacrebleu==1.5.0",
-        "rouge-score==0.0.4",
-        "pycountry==20.7.3",
-        "numexpr>=2.7.2",
-        "lm_dataformat==0.0.20",
-        "pybind11==2.6.2",
-        "tqdm-multiprocess==0.0.11",
-        "zstandard==0.15.2",
-        "jsonlines==2.0.0",
-        "mock==4.0.3",
-        "openai==0.6.4",
-        "jieba==0.42.1",
-        "nagisa==0.2.7",
-        "bleurt@https://github.com/google-research/bleurt/archive/b610120347ef22b494b6d69b4316e303f5932516.zip#egg=bleurt",
-    ],
-    dependency_links=[
-        "https://github.com/google-research/bleurt/archive/b610120347ef22b494b6d69b4316e303f5932516.zip#egg=bleurt",
    ],
-    extras_require={"dev": ["pytest", "black", "pre-commit"]},
+    extras_require={
+        "dev": ["black", "flake8", "pre-commit", "pytest", "pytest-cov"],
+        "multilingual": ["nagisa>=0.2.7", "jieba>=0.42.1"],
+    },
 )
--- a/tests/test_models.py
+++ b/tests/test_models.py
@@ -258,8 +258,9 @@ def textsynth_mock_completion(**kwargs):
    import requests
    os.makedirs("tests/testdata", exist_ok=True)
+    hash_kwargs = {k: v for k, v in kwargs.items() if k != "headers"}
    hash = hashlib.sha256(
-        json.dumps(kwargs, sort_keys=True).encode("utf-8")
+        json.dumps(hash_kwargs, sort_keys=True).encode("utf-8")
    ).hexdigest()
    fname = f"tests/testdata/textsynth_test_{hash}.pkl"

--- a/tests/test_tasks.py
+++ b/tests/test_tasks.py
@@ -7,10 +7,7 @@ from itertools import islice
 @pytest.mark.parametrize("taskname,task_class", tasks.TASK_REGISTRY.items())
 def test_basic_interface(taskname, task_class):
    print("Evaluating task", taskname)
-    # dl = task_class.download
-    # task_class.download = MagicMock()
    task = task_class()
-    # task_class.download = dl
    assert task.has_training_docs() in [True, False]
    assert task.has_validation_docs() in [True, False]

--- a/tests/test_version_stable.py
+++ b/tests/test_version_stable.py
@@ -51,7 +51,7 @@ def flatten(d, parent_key="", sep="."):
    items = []
    for k, v in d.items():
        new_key = parent_key + sep + k if parent_key else k
-        if isinstance(v, collections.MutableMapping):
+        if isinstance(v, collections.abc.MutableMapping):
            items.extend(flatten(v, new_key, sep=sep).items())
        else:
            items.append((new_key, v))

--- a/tests/testdata/textsynth_test_2a333f73ac695f21ccc9818fc1c5c18295ff6cc3c6e86268f404cfed5aee8428.pkl
+++ b/tests/testdata/textsynth_test_2a333f73ac695f21ccc9818fc1c5c18295ff6cc3c6e86268f404cfed5aee8428.pkl
--- a/tests/testdata/textsynth_test_d4ff4c625c1f2b4fab137e4829a8e67a3582f72655c7bfaab017c471f8216a1d.pkl
+++ b/tests/testdata/textsynth_test_d4ff4c625c1f2b4fab137e4829a8e67a3582f72655c7bfaab017c471f8216a1d.pkl
--- a/tests/testdata/textsynth_test_43d375c048824d415ea6e315702d24ddcdfa906b8675a9f551f84b7bd5810e73.pkl
+++ b/tests/testdata/textsynth_test_43d375c048824d415ea6e315702d24ddcdfa906b8675a9f551f84b7bd5810e73.pkl
--- a/tests/testdata/textsynth_test_5bf000f1dd82089eacd5e452f4348f355948bcb1dfc73c6cd12e5fa8ebc8390c.pkl
+++ b/tests/testdata/textsynth_test_5bf000f1dd82089eacd5e452f4348f355948bcb1dfc73c6cd12e5fa8ebc8390c.pkl
--- a/tests/testdata/textsynth_test_e595600e98cdb86290b9ea562ee4bb6cc4e0ba82260189e55e66e6086654f28a.pkl
+++ b/tests/testdata/textsynth_test_e595600e98cdb86290b9ea562ee4bb6cc4e0ba82260189e55e66e6086654f28a.pkl
--- a/tests/testdata/textsynth_test_f85c600ce285362820c732899542ac6782d0b9fe6e6cb93b77919fe5a4d377ed.pkl
+++ b/tests/testdata/textsynth_test_f85c600ce285362820c732899542ac6782d0b9fe6e6cb93b77919fe5a4d377ed.pkl
--- a/tests/testdata/textsynth_test_c578855af1e00883017e5c142701e542813e8949f6e6f8471a8b2f4a144ba6d6.pkl
+++ b/tests/testdata/textsynth_test_c578855af1e00883017e5c142701e542813e8949f6e6f8471a8b2f4a144ba6d6.pkl
--- a/tests/testdata/textsynth_test_92ce62887f42665f7b752661ea8e33658074b94e7eef68b6282ce7b2f76422ea.pkl
+++ b/tests/testdata/textsynth_test_92ce62887f42665f7b752661ea8e33658074b94e7eef68b6282ce7b2f76422ea.pkl
--- a/tests/testdata/textsynth_test_c05f35fda4bdc2eefd6389e3317043c89b6816221ef36286b15969febee34757.pkl
+++ b/tests/testdata/textsynth_test_c05f35fda4bdc2eefd6389e3317043c89b6816221ef36286b15969febee34757.pkl