Merge branch 'big-refactor' into seq2seq-refactor

a6c640d3 · Lintang Sutawika · GitHub · 55eccc29 · 24e3e3fa · a6c640d3
Unverified Commit a6c640d3 authored Jun 16, 2023 by Lintang Sutawika Committed by GitHub Jun 16, 2023
20 changed files
--- a/.github/workflows/pull_request.yml
+++ b/.github/workflows/pull_request.yml
@@ -9,5 +9,5 @@ jobs:
      - uses: actions/checkout@v3
      - uses: actions/setup-python@v4
        with:
-          python-version: 3.8
+          python-version: 3.9
      - uses: pre-commit/action@v2.0.3
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -12,7 +12,7 @@ repos:
      - id: check-merge-conflict
      - id: check-symlinks
      - id: check-yaml
-        args: ['--unsafe']
+        args: ["--unsafe"]
      - id: destroyed-symlinks
      - id: detect-private-key
      - id: end-of-file-fixer
@@ -33,7 +33,7 @@ repos:
    rev: 22.3.0
    hooks:
      - id: black
-        language_version: python3.8
+        language_version: python3.9
  - repo: https://github.com/codespell-project/codespell
    rev: v2.1.0
    hooks:

--- a/CODEOWNERS
+++ b/CODEOWNERS
-* @jon-tow @StellaAthena
+* @haileyschoelkopf @lintangsutawika
--- a/README.md
+++ b/README.md
 # Language Model Evaluation Harness
-![](https://github.com/EleutherAI/lm-evaluation-harness/workflows/Build/badge.svg)
+## Notice to Users
-[![codecov](https://codecov.io/gh/EleutherAI/lm-evaluation-harness/branch/master/graph/badge.svg?token=JSG3O2427J)](https://codecov.io/gh/EleutherAI/lm-evaluation-harness)
+(as of 6/15/23)
+We have a revamp of the Evaluation Harness library internals staged on the [big-refactor](https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor) branch! It is far along in progress, but before we start to move the `master` branch of the repository over to this new design with a new version release, we'd like to ensure that it's been tested by outside users and there are no glaring bugs.
+We’d like your help to test it out! you can help by:
+1. Trying out your current workloads on the big-refactor branch, and seeing if anything breaks or is counterintuitive,
+2. Porting tasks supported in the previous version of the harness to the new YAML configuration format. Please check out our [task implementation guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/new_task_guide.md) for more information.
+If you choose to port a task not yet completed according to [our checklist](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/tasks/README.md), then you can contribute it by opening a PR containing [Refactor] in the name with:
+- A command of the form `python main.py --model hf-causal --model_args ..... --tasks <task name> ...` which will run the task in the `master` branch, and what the score is
+- A command of the form `python main.py --model hf-causal --model_args ..... --tasks <task name> ...` to run the task in your PR branch to `big-refactor`, and what the resulting score is, to show that we achieve equality between the two implementations.
+Lastly, we'll no longer be accepting new feature requests beyond those that are already open to the master branch as we carry out this switch to the new version over the next week, though we will be accepting bugfixes to `master` branch and PRs to `big-refactor`. Feel free to reach out in the #lm-thunderdome channel of the EAI discord for more information.
 ## Overview
 This project provides a unified framework to test generative language models on a large number of different evaluation tasks.
-**Features:**
+Features:
- 200+ tasks implemented. See the [task-table](./docs/task_table.md) for a complete list.
+- Many tasks implemented, 200+ tasks implemented in the old framework which require porting to the new setup as described in https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/docs/new_task_guide.md.
- Support for the Hugging Face `transformers` library, GPT-NeoX, Megatron-DeepSpeed, and the OpenAI API, with flexible tokenization-agnostic interface.
+- Support for models loaded via [transformers](https://github.com/huggingface/transformers/) (including quantization via [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)), [GPT-NeoX](https://github.com/EleutherAI/gpt-neox), and [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/), with a flexible tokenization-agnostic interface.
+- Support for commercial APIs including [OpenAI](https://openai.com), [goose.ai](https://goose.ai), and [TextSynth](https://textsynth.com/).
 - Support for evaluation on adapters (e.g. LoRa) supported in [HuggingFace's PEFT library](https://github.com/huggingface/peft).
- Task versioning to ensure reproducibility.
+- Evaluating with publicly available prompts ensures reproducibility and comparability between papers.
-**Evaluation Overview**
-`Task` and `Prompt` classes contain information that, when combined, produces the input to the language model. The language model is then queried to obtain an output. One or more `Filters` can then be applied to perform arbitrary operations on the model's raw output, such as selecting the final answer (for chain of thought) or calling an external API. This final output is then evaluated using a `Metric` to obtain the final result.
-```mermaid
-graph LR;
-    classDef empty width:0px,height:0px;
-    T[Task]
-    I[Input]
-    F[Filter]
-    M[Model]
-    O[Output]:::empty
-    P[Prompt]
-    Me[Metric]
-    R[Result]
-    T --- I:::empty
-    P --- I
-    I --> M
-    M --> O
-    O --> F
-    Me --> R:::empty
-    F --> R
- ```
 ## Install
-To install `lm-eval` from the github repository main branch, run:
+To install the `lm-eval` refactor branch from the github repository, run:
 ```bash
 git clone https://github.com/EleutherAI/lm-evaluation-harness
 cd lm-evaluation-harness
+git checkout big-refactor
 pip install -e .
 ```
@@ -55,50 +44,64 @@ To install additional multilingual tokenization and text segmentation packages,
 pip install -e ".[multilingual]"
 ```
+To support loading GPTQ quantized models, install the package with the `auto-gptq` extra:
+```bash
+pip install -e ".[auto-gptq]"
+```
 ## Basic Usage
-> **Note**: When reporting results from eval harness, please include the task versions (shown in `results["versions"]`) for reproducibility. This allows bug fixes to tasks while also ensuring that previously reported scores are reproducible. See the [Task Versioning](#task-versioning) section for more info.
+### Hugging Face `transformers`
-To evaluate a model hosted on the [HuggingFace Hub](https://huggingface.co/models) (e.g. GPT-J-6B) you can use the following command:
+To evaluate a model hosted on the [HuggingFace Hub](https://huggingface.co/models) (e.g. GPT-J-6B) on `hellaswag` you can use the following command:
 ```bash
 python main.py \
    --model hf-causal \
    --model_args pretrained=EleutherAI/gpt-j-6B \
-    --tasks lambada_openai,hellaswag \
+    --tasks hellaswag \
-    --device cuda:0
+    --device cuda:0 \
+    --batch_size 8
 ```
-Additional arguments can be provided to the model constructor using the `--model_args` flag. Most notably, this supports the common practice of using the `revisions` feature on the Hub to store partially trained checkpoints:
+Additional arguments can be provided to the model constructor using the `--model_args` flag. Most notably, this supports the common practice of using the `revisions` feature on the Hub to store partially trained checkpoints, or to specify the datatype for running a model:
 ```bash
 python main.py \
    --model hf-causal \
-    --model_args pretrained=EleutherAI/pythia-160m,revision=step100000 \
+    --model_args pretrained=EleutherAI/pythia-160m,revision=step100000,dtype="float" \
    --tasks lambada_openai,hellaswag \
-    --device cuda:0
+    --device cuda:0 \
+    --batch_size 8
 ```
-To evaluate models that are called via `AutoSeq2SeqLM`, you instead use `hf-seq2seq`.
+### Multi-GPU Evaluation with Hugging Face `transformers`
-> **Warning**: Choosing the wrong model may result in erroneous outputs despite not erroring.
+To parallelize evaluation across multiple GPUs, we allow for launching evaluation via the `accelerate` library as follows:
-To use with [PEFT](https://github.com/huggingface/peft), take the call you would run to evaluate the base model and add `,peft=PATH` to the `model_args` argument as shown below:
+```
-```bash
+accelerate launch main.py \
-python main.py \
    --model hf-causal \
-    --model_args pretrained=EleutherAI/gpt-j-6b,peft=nomic-ai/gpt4all-j-lora \
+    --tasks lambada_openai,arc_easy \
-    --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq \
+    --batch_size 16 \
-    --device cuda:0
 ```
-Our library also supports the OpenAI API:
+### Evaluation of Seq2Seq Models
+To evaluate models that are loaded via `AutoSeq2SeqLM` (such as encoder-decoder models like T5) in Huggingface, you instead use `--model hf-seq2seq`. Support for this model type is currently pending.
+> **Warning**: Choosing the wrong model may result in erroneous outputs despite not erroring.
+### Commercial APIs
+Our library also supports language models served via the OpenAI API:
 ```bash
 export OPENAI_API_SECRET_KEY=YOUR_KEY_HERE
 python main.py \
-    --model gpt3 \
+    --model openai \
    --model_args engine=davinci \
    --tasks lambada_openai,hellaswag
 ```
@@ -109,13 +112,15 @@ To verify the data integrity of the tasks you're performing in addition to runni
 ```bash
 python main.py \
-    --model gpt3 \
+    --model openai \
    --model_args engine=davinci \
    --tasks lambada_openai,hellaswag \
    --check_integrity
 ```
-To evaluate mesh-transformer-jax models that are not available on HF, please invoke eval harness through [this script](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py).
+### Other Frameworks
+A number of other libraries contain scripts for calling the eval harness through their library. These include [GPT-NeoX](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py), [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples/MoE/readme_evalharness.md), and [mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py).
 💡 **Tip**: You can inspect what the LM inputs look like by running the following command:
@@ -129,44 +134,34 @@ python write_out.py \
 This will write out one text file for each task.
-## Multi-GPU Evaluation
+## Advanced Usage
-Multi-GPU evaluation is supported through [accelerate](https://github.com/huggingface/accelerate). To initialize the distributed environment, run ```accelerate config``` in terminal and follow the prompts. Once the environment is configured, evaluations can be launched with:
+For models loaded with the HuggingFace  `transformers` library, any arguments provided via `--model_args` get passed to the relevant constructor directly. This means that anything you can do with `AutoModel` can be done with our library. For example, you can pass a local path via `pretrained=` or use models finetuned with [PEFT](https://github.com/huggingface/peft) by taking the call you would run to evaluate the base model and add `,peft=PATH` to the `model_args` argument:
 ```bash
-accelerate launch main.py \
+python main.py \
    --model hf-causal \
-    --tasks lambada_openai,arc_easy \
+    --model_args pretrained=EleutherAI/gpt-j-6b,peft=nomic-ai/gpt4all-j-lora \
-    --batch_size 16 \
+    --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq \
+    --device cuda:0
 ```
-**Warning**: Distributed evaluation requires launching multiple processes of the evaluation script. Running ```python main.py *args*``` instead of ```accelerate launch main.py *args*``` on machine with multiple GPUs will only run the evaluations on a single device.
+GPTQ quantized models can be loaded by specifying their file names in `,quantized=NAME` (or `,quantized=True` for default names) in the `model_args` argument:
-## Implementing new tasks
-To implement a new task in the eval harness, see [this guide](./docs/task_guide.md).
-## Task Versioning
-To help improve reproducibility, all tasks have a `VERSION` field. When run from the command line, this is reported in a column in the table, or in the "version" field in the evaluator return dict. The purpose of the version is so that if the task definition changes (i.e to fix a bug), then we can know exactly which metrics were computed using the old buggy implementation to avoid unfair comparisons. To enforce this, there are unit tests that make sure the behavior of all tests remains the same as when they were first implemented. Task versions start at 0, and each time a breaking change is made, the version is incremented by one.
-When reporting eval harness results, please also report the version of each task. This can be done either with a separate column in the table, or by reporting the task name with the version appended as such: taskname-v0.
+```bash
+python main.py \
+    --model hf-causal \
+    --model_args pretrained=model-name-or-path,quantized=model.safetensors,gptq_use_triton=True \
+    --tasks hellaswag
+```
-## Test Set Decontamination
+We support wildcards in task names, for example you can run all of the machine-translated lambada tasks via `--task lambada_openai_mt_*`.
-To address concerns about train / test contamination, we provide utilities for comparing results on a benchmark using only the data points nto found in the model trainign set. Unfortunately, outside of models trained on the Pile and C4, its very rare that people who train models disclose the contents of the training data. However this utility can be useful to evaluate models you have trained on private data, provided you are willing to pre-compute the necessary indices. We provide computed indices for 13-gram exact match deduplication against the Pile, and plan to add additional precomputed dataset indices in the future (including C4 and min-hash LSH deduplication).
+## Implementing new tasks
-For details on text decontamination, see the [decontamination guide](./docs/decontamination.md).
+To implement a new task in the eval harness, see [this guide](./docs/new_task_guide.md).
-Note that the directory provided to the `--decontamination_ngrams_path` argument should contain the ngram files and info.json. See the above guide for ngram generation for the pile, this could be adapted for other training sets.
-```bash
+As a start, we currently only support one prompt per task, which we strive to make the "standard" as defined by the benchmark's authors. If you would like to study how varying prompts causes changes in the evaluation score, we support prompts authored in the [Promptsource Library](https://github.com/bigscience-workshop/promptsource/tree/main) as described further in https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/docs/new_task_guide.md and https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/docs/advanced_task_guide.md and welcome contributions of novel task templates and task variants.
-python main.py \
-    --model gpt2 \
-    --tasks sciq \
-    --decontamination_ngrams_path path/containing/training/set/ngrams \
-    --device cuda:0
-```
 ## Cite as

--- a/docs/PROGRESS.md
+++ b/docs/PROGRESS.md
+Tracking progress on revamping documentation pages for the refactor of LM-Evaluation-Harness.
+## Desired Pages
+* [ ] YAML explainer
+  * [ ] Explainer on filters + advanced features
+  * [ ] Walkthrough start-to-finish of adding a new task to codebase
+* [ ] Explaining registries + decorators
+* [ ] model_guide.md for adding new model API
+  * [ ] guide to writing an adapter to new advanced codebase (e.g. NeoX)
+* [ ] Parallelism guide (?)
--- a/docs/advanced_task_guide.md
+++ b/docs/advanced_task_guide.md
+# Advanced Task Configuration
+The `lm-evaluation-harness` is meant to be an extensible and flexible framework within which many different evaluation tasks can be defined. All tasks in the new version of the harness are built around a YAML configuration file format.
+These YAML configuration files, along with the current codebase commit hash, are intended to be shareable such that providing the YAML config enables another researcher to precisely replicate the evaluation setup used by another, in the case that the prompt or setup differs from standard `lm-eval` task implementations.
+While adding a standard evaluation task on a new dataset can be occasionally as simple as swapping out a Hugging Face dataset path in an existing file, more specialized evaluation setups. Here we'll provide a crash course on the more advanced logic implementable in YAML form available to users.
+If your intended task relies on features beyond what are described in this guide, we'd love to hear about it! Feel free to open an issue describing the scenario on Github, create a PR to the project with a proposed implementation, or ask in the `#lm-thunderdome` channel on the EleutherAI discord.
+## Configurations
+Tasks are configured via the `TaskConfig` object. Below, we describe all fields usable within the object, and their role in defining a task.
+### Parameters
+- **task** (`str`, defaults to None) — name of the task.
+- **group** (`str`, *optional*) — name of the task group(s) a task belongs to. Enables one to run all tasks with a specified tag or group name at once.
+- **reference** (`str`, *optional*) —
+- **dataset_path** (`str`) — The name of the dataset as listed by HF in the datasets Hub.
+- **dataset_name**  (`str`, *optional*, defaults to None) — The name of, what HF calls, a “data instance” or sub-task of the benchmark. If your task does not contain any data instances, just leave this to default to None. (If you're familiar with the HF `datasets.load_dataset` function, these are just the first 2 arguments to it.)
+- **dataset_kwargs** (`dict`, *optional*) — Auxiliary arguments that `datasets.load_dataset` accepts. This can be used to specify arguments such as `data_files` or `data_dir` if you want to use local datafiles such as json or csv.
+- **training_split** (`str`, *optional*) — Split in the dataset to use as the training split.
+- **validation_split** (`str`, *optional*) — Split in the dataset to use as the validation split.
+- **test_split** (`str`, *optional*) — Split in the dataset to use as the test split.
+- **fewshot_split** (`str`, *optional*) — assert that this not None if num_fewshot > 0. (?) assert if this is same split as one evaling (?)
+- **template_aliases** (`str`, *optional*) —
+- **aliases**: (`Union[str, list]`, *optional*) —
+- **doc_to_text** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into the appropriate input for the model
+- **doc_to_target** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into the appropriate target output for the model
+- **num_fewshot** (`int`, *optional*, defaults to 0) — Number of few-shot examples before the input.
+- **batch_size** (`int`, *optional*, defaults to 1) — Batch size.
+- **repeats** (`int`, *optional*, defaults to 1) — Number of repeated runs for each sample. can be used for cases such as self-consistency.
+- **metric_list** (`str`, *optional*, defaults to None) — A list of metrics to use for evaluation. See docs for expected format.
+- **gold_alias** (`str`, *optional*, defaults to None) — if provided, used to generate the reference answer that is scored against. Used in cases where `doc_to_target` should be the "target string" format appended to each example's input for a fewshot exemplar, so doc_to_target is used for fewshot examples, but the input to the metric function as `gold` is from `gold_alias`.
+- **output_type** (`str`, *optional*, defaults to "greedy_until") — Selects the type of model output for the given task. Options are `greedy_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`.
+- **generation_kwargs** (`dict`, *optional*) — Auxiliary arguments for the `generate` function from HF transformers library. Advanced keyword arguments may not be supported for non-HF LM classes.
+- **delimiter** (`str`, *optional*, defaults to "\n\n") — String to insert between few-shot examples.
+- **filter_list** (`Union[str, list]`, *optional*) — List of filters to postprocess model outputs. See below for further detail on the filter API.
+- **should_decontaminate** (`bool`, *optional*, defaults to False) -
+- **doc_to_decontamination_query** (`str`, *optional*) —
+- **use_prompt** (`str`, *optional*) — Name of prompt in promptsource to use, if defined will overwrite doc_to_text and doc_to_target.
+- **metadata** (`str`, *optional*) — An optional field where arbitrary metadata can be passed.
+## Filters
+Explain: What are filters? What is their place in the pipeline?
+A key component of the `lm-evaluation-harness` library is the `Filter` object. In a typical evaluation run of the harness, we take the formatted inputs and run them through our LM, with the appropriate output type (greedy or free-form generation, or loglikelihood-based comparative scoring).
+After getting scores or output text from our LM on each `Instance` or document in the dataset, we then need to feed these responses into a metric or scoring function to return scores to a user.
+However, certain tasks may require more complex behavior than directly turning over model outputs to a metric function. For example, we may want to post-process our output text by truncating it or extracting a model's answer, we may want to ensemble over multiple "takes" on a different document, et cetera.
+**Detailed Aside**:
+We do such post-processing by operating on *responses*, which are stored after running an LM on an `Instance` from the task in `Instance.resps`.
+`resps` is a `List[str]` for each instance, and we pass a `List[List[<expected return type from model>]]` to our filters that is a list of `[instance.resps for instance in instances]`.
+Our filters, after completing a pipeline, must return a `List[<expected return type from model>]` which we then unpack and store each element of in `Instance.filtered_resps` for the corresponding instance. Thus, we take as input a list of returns from our model for each doc, and must return a return from our model *without it being wrapped in a list* for each doc.
+**End Aside**
+A full list of supported filter operations can be found in `lm_eval/filters/__init__.py`. Contributions of new filter types are welcome!
+### Multiple Filter Pipelines
+Tasks need not be limited to a single filter pipeline. We enable users to run multiple, distinct, filter pipelines on *the same model outputs* generated in one run on a task.
+As a case study, let's look at an implementation of solving the Gsm8k math word problem benchmark in `lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml`. Here, we are emulating the setup used by [Self-Consistency Improves Chain of Thought Prompting](https://arxiv.org/abs/2203.11171), in which evaluation is performed by generating N chain-of-thought outputs from a model via temperature-based sampling, then selecting the answers output by the model at the end of the chains of thought, then majority voting across all those numeric answers.
+Within our YAML file:
+```yaml
+...
+repeats: 64
+filter_list:
+  - name: "score-first"
+    filter:
+      - function: "regex"
+        regex_pattern: "The answer is (\\-?[0-9\\.\\,]*[0-9]+)"
+      - function: "take_first"
+  - name: "maj@64"
+    filter:
+      - function: "regex"
+        regex_pattern: "The answer is (\\-?[0-9\\.\\,]*[0-9]+)"
+      - function: "majority_vote"
+      - function: "take_first"
+  - name: "maj@8"
+    filter:
+      - function: "take_first_k"
+        k: 8
+      - function: "regex"
+        regex_pattern: "The answer is (\\-?[0-9\\.\\,]*[0-9]+)"
+      - function: "majority_vote"
+      - function: "take_first"
+```
+We are able to provide multiple different filter pipelines, each with their own name and list of filters to apply in sequence.
+Our first filter pipeline implements
+- applying a regex to the model generations (extracting the number within the phrase "The answer is (number)")
+- selecting only the first out of the 64 model answers
+Then scoring this single answer.
+```yaml
+- name: "score-first"
+  filter:
+    - function: "regex"
+      regex_pattern: "The answer is (\\-?[0-9\\.\\,]*[0-9]+)"
+    - function: "take_first"
+```
+Our second filter pipeline, "maj@64", does majority voting across all 64 answers via:
+- applying the same regex to all responses, to get the numerical answer from the model for each of the 64 responses per problem
+- applying majority voting to all responses, which then returns a length-1 `[<majority answer>]` list for each
+- taking the first element of this length-1 list, to then score the sole response `<majority answer>` for each document.
+```yaml
+- name: "maj@64"
+  filter:
+    - function: "regex"
+      regex_pattern: "The answer is (\\-?[0-9\\.\\,]*[0-9]+)"
+    - function: "majority_vote"
+    - function: "take_first"
+```
+Our final filter pipeline, "maj@8", does majority voting across the first 8 of the model's responses per document via:
+- subsetting the len-64 list of responses `[answer1, answer2, ..., answer64]` to `[answer1, answer2, ..., answer8]` for each document
+- performing the same sequence of filters on these new sets of 8 responses, for each document.
+```yaml
+- name: "maj@8"
+    filter:
+    - function: "take_first_k"
+      k: 8
+    - function: "regex"
+      regex_pattern: "The answer is (\\-?[0-9\\.\\,]*[0-9]+)"
+    - function: "majority_vote"
+    - function: "take_first"
+```
+Thus, given the 64 responses from our LM on each document, we can report metrics on these responses in these 3 different ways, as defined by our filter pipelines.
+## Embedded Python Code
+Use can use python functions for certain arguments by using the `!function` operator after the argument name followed by `<filename>.<pythonfunctionname>`. This feature can be used for the following arguments:
+1. `doc_to_text`
+2. `doc_to_target`
+3. `gold_alias`
+4. `aggregation` for a `metric` in `metric_list`
+## (No Longer Recommended) Direct `Task` Subclassing
+The prior implementation method of new tasks was to subclass `Task`. While we intend to migrate all tasks to the new YAML implementation option going forward, it remains possible to subclass the Task class and implement custom logic. For more information, see `docs/task_guide.md` in v0.3.0 of the `lm-evaluation-harness`.
+## Including a Base YAML
+You can base a YAML on another YAML file as a template. This can be handy when you need to just change the prompt for `doc_to_text` but keep the rest the same or change `filters` to compare which is better. Simply use `include` in the YAML file and write the name of the template you want to base from. This assumes that the base temeplate is in the same directory. Otherwise, You will need to define the full path.
+```
+include: <YAML filename or with full path>
+...
+```
+You can find an example of how to use this feature at [gsm8k-cot-self-consistency.yaml](https://github.com/EleutherAI/lm-evaluation-harness/blob/3c07cc04a92fc467d7c9a94894aeddd58c93a5da/lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml) where it is based off [gsm8k-cot.yaml](https://github.com/EleutherAI/lm-evaluation-harness/blob/3c07cc04a92fc467d7c9a94894aeddd58c93a5da/lm_eval/tasks/gsm8k/gsm8k-cot.yaml)
+## Passing Arguments to Metrics
+Metrics can be defined in the `metric_list` argument when building the YAML config. Multiple metrics can be listed along with any auxiliary arguments. For example, setting the [`exact_match` metric](https://github.com/huggingface/evaluate/tree/main/metrics/exact_match), auxiliary arguments such as `ignore_case`, `ignore_punctuation`, `regexes_to_ignore` can be listed as well. They will be added to the metric function as `kwargs`. Some metrics have predefined values for `aggregation` and `higher_is_better` so listing the metric name only can be sufficient.
+```
+metric_list:
+  - metric: acc
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+    ignore_case: true
+    ignore_punctuation: false
+    regexes_to_ignore:
+      - ","
+      - "\\$"
+```
+### Natively Supported Metrics
+Here we list all metrics currently supported natively in `lm-eval`:
+Metrics:
+* `acc` (accuracy)
+* `acc_norm` (length-normalized accuracy)
+* `acc_mutual_info` (baseline loglikelihood - normalized accuracy)
+* `perplexity`
+* `word_perplexity` (perplexity per word)
+* `byte_perplexity` (perplexity per byte)
+* `bits_per_byte`
+* `matthews_corrcoef` (Matthews correlation coefficient)
+* `f1` (F1 score)
+* `bleu`
+* `chrf`
+* `ter`
+Aggregation functions:
+* `mean`
+* `median`
+* `perplexity`
+* `weighted_perplexity`
+* `bits_per_byte`
+## Good Reference Tasks
+Contributing a new task can be daunting! Luckily, much of the work has often been done for you in a different, similarly evaluated task. Good examples of task implementations to study include:
+Multiple choice tasks:
+- SciQ (`lm_eval/tasks/sciq/sciq.yaml`)
+Corpus perplexity evaluations:
+- Wikitext (`lm_eval/tasks/wikitext/wikitext.yaml`)
+Generative tasks:
+- GSM8k (`lm_eval/tasks/gsm8k/gsm8k.yaml`)
+Tasks using complex filtering:
+- GSM8k with CoT (+ with Self-Consistency): (`lm_eval/tasks/gsm8k/gsm8k-cot.yaml` ; `lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml`)
--- a/docs/new_task_guide.md
+++ b/docs/new_task_guide.md
+# New Task Guide
+`lm-evaluation-harness` is a framework that strives to support a wide range of zero- and few-shot evaluation tasks on autoregressive language models (LMs).
+This documentation page provides a walkthrough to get started creating your own task, on the `big-refactor` branch of the repository (which will be v0.5.0 in the future.)
+## Setup
+If you haven't already, go ahead and fork the main repo, clone it, create a branch with the name of your task, and install the project requirements in your environment:
+```sh
+# After forking...
+git clone https://github.com/<YOUR-USERNAME>/lm-evaluation-harness.git
+cd lm-evaluation-harness
+git checkout big-refactor
+git checkout -b <task-name>
+pip install -e ".[dev]"
+```
+As a concrete example, we'll walk through reimplementing the `gsm8k` benchmark (a *generative* task which requires sampling text from a model) and the `sciq` benchmark. (a *discriminative*, or *multiple choice*, task where the model picks the most likely of several fixed answer choices).
+## Creating a YAML file
+To implement a new standard task, we'll need to write a YAML file which configures our task logic. We start by making a new empty YAML file. This file can have any name, but we recommend placing it in a subfolder of `lm_eval/tasks` titled by the dataset or task's shorthand name: for example,
+```sh
+touch lm_eval/tasks/<dataset_name>/<my_new_task_name>.yaml
+```
+Or, copy the template subfolder we provide from `templates/new_yaml_task`:
+```sh
+cp -r templates/new_yaml_task lm_eval/tasks/
+```
+and rename the folders and YAML file(s) as desired.
+### Selecting and configuring a dataset
+All data downloading and management is handled through the HuggingFace (**HF**) [`datasets`](https://github.com/huggingface/datasets) API. So, the first thing you should do is check to see if your task's dataset is already provided in their catalog [here](https://huggingface.co/datasets). If it's not in there, please consider adding it to their Hub to make it accessible to a wider user base by following their [new dataset guide](https://github.com/huggingface/datasets/blob/master/ADD_NEW_DATASET.md)
+.
+Once you have a HuggingFace dataset prepared for your task, we want to assign our new YAML to use this dataset:
+```yaml
+dataset_path: ... # the name of the dataset on the HF Hub.
+dataset_name: ... # the dataset configuration to use. Leave `null` if your dataset does not require a config to be passed. See https://huggingface.co/docs/datasets/load_hub#configurations for more info.
+dataset_kwargs: null # any extra keyword arguments that should be passed to the dataset constructor, e.g. `data_dir`.
+```
+Next, we'd like to tell our task what the dataset's train, validation, and test splits are named, if they exist:
+```yaml
+training_split: <split name of training set, or `null`>
+validation_split: <split name of val. set, or `null`>
+test_split: <split name of test set, or `null`>
+```
+Tests will run on the `test_split` if it is available, and otherwise evaluate on the `validation_split`.
+We can also specify from which split the task should retrieve few-shot examples via:
+```yaml
+fewshot_split: <split name to draw fewshot examples from, or `null`>
+```
+though if this is not set, we will default to train/validation/test sets, in that order.
+### Writing a prompt with Jinja 2
+The next thing we need to do is decide what format to use when presenting the data to the LM. This is our **prompt**, where we'll define both an input and output format.
+We support the [Jinja 2](https://jinja.palletsprojects.com/en/3.1.x/) templating language for writing prompts. In practice, this means you can take your dataset's columns and do many basic string manipulations to place each document into prompted format.
+To write a prompt, users are required to write two YAML fields in Jinja as strings:
+```yaml
+doc_to_text:
+doc_to_target:
+```
+Suppose our dataset has a `"question"` field, and an `"answer"` field, which are both strings. We want the model to see, if given a `document` object that is a row of our dataset:
+```
+Question: {document[question]}
+Answer:
+```
+We do this by writing
+```yaml
+doc_to_text: "Question: {{question}}\nAnswer:"
+```
+Such that {{question}} will be replaced by `doc["question"]` when rendering the prompt template.
+Our intended output is for the model to predict a single whitespace, and then the answer to the question. We do this via:
+```yaml
+doc_to_target: "{{answer}}"
+```
+**Important**: We always add one whitespace between the input and output, such that the full input-output string is `doc_to_target(doc) + " " + doc_to_text(doc)`. doc_to_text and doc_to_target should not contain trailing right or left whitespace, respectively.
+Users can also fill out the optional `template_aliases` YAML field, which is added ahead of both the `doc_to_text` and `doc_to_target` fields. This field should not contain any test, but only Jinja variable definitions (`{% ... %}` clauses). This can be used to perform more involved string manipulations and renamings of dataset columns while the main prompt fields remain easy to parse visually.
+### Using Python Functions for Prompts
+There may be cases where the prompt we want to implement is easier expressed in Python instead of Jinja 2. For this, we can use Python helper functions that are defined in the YAML config. It should be noted that the function script must be in the same directory as the yaml.
+A good example is WikiText that requires a lot of regex rules to clean the samples.
+```
+def wikitext_detokenizer(doc):
+    string = doc["page"]
+    # contractions
+    string = string.replace("s '", "s'")
+    string = re.sub(r"/' [0-9]/", r"/'[0-9]/", string)
+    ...
+    string = string.replace(" 's", "'s")
+    return string
+```
+We can load this function in `doc_to_target` by using a `!function` operator after `doc_to_target` and followed by `<file name>.<function name>`. In the file [wikitext.yaml](https://github.com/EleutherAI/lm-evaluation-harness/blob/6ae376e3a43caa58b95bb8aa73054a94827bf560/lm_eval/tasks/wikitext/wikitext.yaml) we write:
+```
+doc_to_target: !function preprocess_wikitext.wikitext_detokenizer
+```
+### Importing a Prompt from Promptsource
+[Promptsource](https://github.com/bigscience-workshop/promptsource/tree/main/promptsource) is a great repository for crowdsourced prompts for many datasets. We can load these prompts easily by using the `use_prompt` argument and filling it with the format `"promptsource:<name of prompt template>"`. To use this, `doc_to_text` and `doc_to_target` should be left undefined. This will fetch the template of the dataset defined in the YAML file.
+For example, For Super Glue BoolQ, if we want to use the prompt template `GPT-3 Style` we can add this to the YAML file.
+```
+use_prompt: "promptsource:GPT-3 Style"
+```
+#### Multiple choice format
+For tasks which are multiple choice (a fixed, finite set of label words per each document) and evaluated via comparing loglikelihoods of all label words (the `multiple_choice` task output type) we enforce a particular convention on prompt format.
+An annotated example in the case of SciQ is as follows:
+```yaml
+template_aliases: "{% set answer_choices = [distractor1, distractor2, distractor3, correct_answer] %}{% set gold = 3 %}" # `template_aliases` must set the list of possible answer choices to the jinja variable `answer_choices` (List[str]), and set what the index within `answer_choices` of this doc's gold label (correct answer choice).
+doc_to_text: "{{support.lstrip()}}\nQuestion: {{question}}\nAnswer:" # This is the input portion of the prompt for this doc. It will have " {{choice}}" appended to it as target for each choice in answer_choices.
+doc_to_target: "{{gold}}" # this must be castable to an integer. It must output only the index within `answer_choices` that is the correct label.
+```
+Task implementers are thus able to decide what the answer choices should be for a document, and what prompt format to use.
+### Setting metrics
+You're almost done! Now we need to choose how to score our task.
+- *If this is a multiple choice task:* do you just want to check your model's accuracy in choosing the correct answer choice?
+- *If this is a generation task:* do you just want to check how often your model outputs *exactly the ground-truth output string provided*?
+If the answer to the above is no: you'll need to record what scoring metrics to use! Metrics can be listed in the following format:
+```yaml
+metric_list:
+  - metric: <name of the metric here>
+    aggregation: <name of the aggregation fn here>
+    higher_is_better: <true or false>
+  - metric: ...
+    aggregation: ...
+    higher_is_better: ...
+```
+`aggregation` and `higher_is_better` can optionally be left out to default to the manually-set defaults, if using a natively supported metric.
+For a full list of natively supported metrics and aggregation functions see `docs/advanced_task_guide.md`. All metrics supported in [HuggingFace Evaluate](https://github.com/huggingface/evaluate/tree/main/metrics) can also be used, and will be loaded if a given metric name is not one natively supported in `lm-eval`.
+### Optional, More Advanced Setup
+Some tasks may require more advanced processing logic than is described in this guide.
+As a heuristic check:
+* Does your task require generating multiple free-form outputs per input document?
+* Does your task require complex, multi-step post-processing of generated model outputs?
+* Does your task require subsetting documents on the fly based on their content?
+* Do you expect to compute metrics after applying multiple such processing steps on your model outputs?
+* Does your task rely on metrics that need a custom implementation?
+For more detail on the task system and advanced features, see `docs/advanced_task_guide.md` . If none of the above sound like they apply to your task, it's time to continue onto checking your task performance!
+### Task name + groups (registering a task)
+To test a task conveniently, it helps to *register* the task--that is, to give it a name and make the `lm-eval` library aware it exists!
+If you're writing your YAML file inside the `lm_eval/tasks` folder, you just need to give your task a name! You can do this inside your YAML file:
+```yaml
+task: <name of the task>
+```
+Including a task name is mandatory.
+It is often also convenient to label your task with several `groups`, or tags, though this field is optional:
+```yaml
+group:
+  - group1
+  - group2
+```
+This will add your task to the `group1` and `group2` groups, enabling people to know how to categorize your task, and if desired run all tasks in one of these groups at once, your task along with them.
+If your task is not in the `lm_eval/tasks` folder, you'll need to tell the Eval Harness where to look for YAML files.
+You can do this via adding the Python snippet
+```python
+from lm_eval.tasks import include_task_folder
+include_task_folder("/path/to/yaml/parent/folder")
+```
+to the top of any Python file that is run or imported when performing evaluation, such as `main.py`.
+Passing `--tasks /path/to/yaml/file` is also accepted.
+## Checking validity
+After registering your task, you can now check on your data downloading and verify that the few-shot samples look as intended. Run the following command with your desired args:
+```bash
+python -m scripts.write_out \
+    --output_base_path <path> \
+    --tasks <your-task-name> \
+    --sets <train | val | test> \
+    --num_fewshot K \
+    --num_examples N \
+```
+Open the file specified at the `--output_base_path <path>` and ensure it passes
+a simple eye test.
+## Checking performance + equivalence
+It's now time to check models' performance on your task! In the evaluation harness, we intend to support a wide range of evaluation tasks and setups, but prioritize the inclusion of already-proven benchmarks following the precise evaluation setups in the literature where possible.
+To enable this, we provide a checklist that should be completed when contributing a new task, to enable accurate book-keeping and to ensure that tasks added to the library are well-tested and, where applicable, precedented.
+### Task impl. checklist
+The checklist is the following:
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
+It is recommended to include a filled-out copy of this checklist in the README.md for the subfolder you are creating, if you have created a new subfolder in `lm_eval/tasks`.
+## Submitting your task
+You're all set! Now push your work and make a pull request to the `big-refactor` branch! Thanks for the contribution :). If there are any questions, please leave a message in the `#lm-thunderdome` channel on the EAI discord!
--- a/docs/task_guide.md
+++ b/docs/task_guide.md
-# `Task` Guide
-The `Task` class is the foundation of all natural language tasks in the `lm-evaluation-harness` (harness). It encompasses everything you’d need to perform few-shot evaluation of an autoregressive language model. Here we’ll provide a step-by-step guide on how to subclass `Task` to create your very own task/s.
-## Setup
-If you haven't already, go ahead and fork the main repo, clone it, create a branch with the name of your task, and install the project requirements in your environment:
-```sh
-# After forking...
-git clone https://github.com/<YOUR-USERNAME>/lm-evaluation-harness.git
-cd lm-evaluation-harness
-git checkout -b <task-name>
-pip install -e ".[dev]"
-```
-## Creating Your Task File
-From the `lm-evaluation-harness` project root, copy over the `new_task.py` template to `lm_eval/datasets`.
-```sh
-cp templates/new_task.py lm_eval/tasks/<task-name>.py
-```
-or if your task is **multiple-choice**, the `new_multiple_choice_task.py`:
-```sh
-cp templates/new_multiple_choice_task.py lm_eval/tasks/<task-name>.py
-```
-This will set you up with a few `TODO`s to fill-in which we'll now go over in detail.
-## Task Heading
-Open the file you've just created and add a multiline docstring on the first line with the following contents:
-```python
-"""
-<Paper title>
-<Paper PDF URL>
-<Short description of task>
-Homepage: <URL to task's homepage>
-"""
-```
-For example, take the QuAC dataset. We have:
-```python
-"""
-QuAC: Question Answering in Context
-https://arxiv.org/abs/1808.07036
-Question Answering in Context (QuAC) is a dataset for modeling, understanding, and
-participating in information seeking dialog. Data instances consist of an interactive
-dialog between two crowd workers: (1) a student who poses a sequence of freeform
-questions to learn as much as possible about a hidden Wikipedia text, and (2)
-a teacher who answers the questions by providing short excerpts (spans) from the text.
-Homepage: https://quac.ai/
-"""
-```
-Next, at the module-level, create a constant variable named
-`_CITATION` that contains the citation information for your task in BibTeX format.
-Now let's walk through the actual implementation - from data handling to evaluation.
-## Data Handling
-### Downloading your Data
-All data downloading and management is handled through the HuggingFace (**HF**) [`datasets`](https://github.com/huggingface/datasets) API. So, the first thing you should do is check to see if your task's dataset is already provided in their catalog [here](https://huggingface.co/datasets). If it's not in there, please consider adding it to their Hub to make it accessible to a wider user base by following their [new dataset guide](https://github.com/huggingface/datasets/blob/master/ADD_NEW_DATASET.md)
-.
-Now, that you have your HF dataset, you need to assign its path and name to your `Task` in the following fields:
-```python
-class TaskName(...):
-    DATASET_PATH = "..."
-    DATASET_NAME = "..."
-```
-where `DATASET_PATH` is the name of the dataset as listed by HF in the `datasets` Hub and `DATASET_NAME` is the name of, what HF calls, a “data instance” or sub-task of the benchmark. If your task does not contain any data instances, just set `DATASET_NAME = None`.
-(If you're familiar with the HF `datasets.load_dataset` function, these are just the first 2 arguments to it.)
-Next up, we have to set some “flags”:
-```python
-    def has_training_docs(self):
-        return # True/False
-    def has_validation_docs(self):
-        return # True/False
-    def has_test_docs(self):
-        return # True/False
-```
-These methods return `True`/`False` whether or not your task dataset provides documents for each split type. __Note__: if the test set does not have publicly available answer labels, please do not put it down as having a test set - return False.
-Lastly, we need to load the documents. In our terminology, a document (`doc`) is a single natural language data example stored in a Python `dict`. E.g.: `{“question”: “What is the capital of France?”, “answer”: “Paris”}`. Override the following methods to load your data splits from their storage location in `DATASET_PATH`:
-```python
-    def training_docs(self):
-        return #...
-    def validation_docs(self):
-        return #...
-    def test_docs(self):
-        return #...
-```
-These should return a Python iterable (`list` or `generator`) of `dict`s that can be queried for individual `doc` examples.
-#### Processing Documents
-At this point, you can also process each individual document to, for example, strip whitespace or "detokenize" its fields. Put the processing logic into `_process_doc` and map the functions across training/validation/test docs inside of the respective functions.
-🔠 If your task is **multiple-choice**, we require you to format your documents such that they contain `gold` and `choices` fields. They can also have other fields, but those will be ignored by `MultipleChoiceTask`. `choices` should be a list of possible continuations, and `gold` should be an integer specifying the index of the correct completion.
-See [this task](https://github.com/EleutherAI/lm-evaluation-harness/blob/6caa0afd96a7a7efb2ec4c1f24ad1756e48f3aa7/lm_eval/tasks/sat.py#L60) for an example. 🔠
-### Formatting your Few-Shot Examples
-The harness is designed to facilitate task evaluations under the few-shot setting. Here we’ll format such examples.
-Format your document into a single query prompt __without the answer__ here. This method takes a single `doc` example of type `dict` with `str` key-value members. You should concatenate these `doc` item values together into a neatly formatted prompt.
-```python
-def doc_to_text(self, doc):
-    return ""
-```
-<br>
-️🔠 **Multiple-Choice Formatting**
-If your task is multiple-choice, you can now skip ahead to <a href="#Registering-Your-Task">registering your task</a>.
-️️🔠 **End Multiple-Choice Formatting**
-<br>
-Format the target answer from the contents of `doc`. Note that the prepended `" "` is required to space out the `doc_to_text` and `doc_to_target` strings.
-```python
-def doc_to_target(self, doc):
-    target = ""
-    return " " + target
-```
-Finally, be aware that the strings from `doc_to_text` and `doc_to_target` will be concatenated together to build up labeled examples in the k-shot setting where k > 0. Design with that in mind 👍.
-### Decontamination
-For background on decontamination please see [this](./decontamination.md).
-If you wish to support decontamination studies for your task simply override the "should_decontaminate" method and return true.
-You also need to override "doc_to_decontamination_query" and return the data you wish to compare against the training set. This doesn't necessarily need to be the full document or request, and we leave this up to the implementor. For a multi-choice evaluation you could for example just return the question.
-### Registering Your Task
-Now's a good time to register your task to expose it for usage. All you'll need to do is import your task module in `lm_eval/tasks/__init__.py` and provide an entry in the `TASK_REGISTRY`  dictionary with the key as the name of your benchmark task (in the form it'll be referred to in the command line) and the value as the task class. See how it's done for other tasks in the [file](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/__init__.py).
-### Checking the Data
-After registering your task, you can now check on your data downloading and verify that the few-shot samples look as intended. Run the following command with your desired args:
-```bash
-python -m scripts.write_out \
-    --output_base_path <path> \
-    --tasks <your-task> \
-    --sets <train | val | test> \
-    --num_fewshot K \
-    --num_examples N \
-    --description_dict_path <path>
-```
-Open the file specified at the `--output_base_path <path>` and ensure it passes
-a simple eye test.
-## Evaluation
-**🛑**  If your task is a single-true multiple-choice task and you've correctly inherited from `MultipleChoiceTask` then your job here is done; <a href="#Checking-the-Task-Performance">go ‘head and check on the task performance!</a> 🛑
-Now comes evaluation. The methods you'll need to implement are:
-```python
-def construct_requests(self, doc, ctx):
-    """ Uses RequestFactory to construct Requests and returns an iterable of
-    Requests which will be sent to the LM.
-    :param doc:
-        The document as returned from training_docs, validation_docs, or test_docs.
-    :param ctx: str
-        The context string, generated by fewshot_context. This includes the natural
-        language description, as well as the few shot examples, and the question
-        part of the document for `doc`.
-    """
-    return ...
-```
-#### What's a `Request`? What's a `doc`?
-To reiterate, a `doc` is just a `Dict` object that contains information about a document from your corpus. It can contain things like a prompt, question type information, answers and anything else you think will be needed in order to assess your model for a given task. Keep in mind that the fields of this can be basically whatever you want (you can sort this out in `training_docs` \ `validation_docs` \ `test_docs` if you need to customise things - see above), just remember to be consistent with them throughout the rest of the `Task` you write up.
-A `Request` is an object that takes the text prompt you want to present to a model and computes one of a few different types of response. These are evaluated lazily (meaning, only when the result is actually needed). If your task requires generating text you'll need to return a `rf.greedy_until` request otherwise an `rf.loglikelihood` across all labels in a classification tasks will do.
-The function `construct_requests` can return a list of `Request`s or an iterable; it's perfectly fine to `yield` them from something or other. This is particularly handy if you are creating more than one request per `doc` (usually because you're up to something like multi-task learning). The objects this function returns then get consumed one by one and turned into result objects.
-```python
-def process_results(self, doc, results):
-    """Take a single document and the LM results and evaluates, returning a
-    dict where keys are the names of submetrics and values are the values of
-    the metric for that one document
-    :param doc:
-        The document as returned from training_docs, validation_docs, or test_docs.
-    :param results:
-        The results of the requests created in construct_requests.
-    """
-    return {}
-```
-This is the next step in the chain after `construct_requests`. In between this function and the one above, the request is evaluated. The results of that request are returned in the `results` arg to this function. By processing results, what is meant is calculating the metric or metrics of interest for your dataset using the result and associated ground truth given to this function. It's possible to calculate and return multiple metrics in this function and the logic for it can be whatever you want - as long as you've made sure the ground truth was included in the `doc` object. The dict returned from this function should be of the format `{'metric_name': value}`. It is not necessary to have the same keys for every doc processed using `process_results`; this sort of thing can be handled in the next function, `aggregation`.
-```python
-def aggregation(self):
-    """
-    :returns: {str: [float] -> float}
-        A dictionary where keys are the names of submetrics and values are
-        functions that aggregate a list of metrics
-    """
-    return {}
-```
-In `process_results`, model outputs are converted into metrics. These metrics are per document metrics, however; the `aggregation` function is used to work out what to do with them to create a corpus-level metric. Imagine you have a bunch of documents, for each of which you have calculated an F1 score. What should that mean overall? Should they be summed, averaged, the min/max found? This function handles that problem.
-The contents of the function itself are pretty straightforward; it should simply return a dict that maps from each metric label that could be returned by `process_results` to a function that can be used to aggregate that metric. That is to say, if the metrics that `process_results` could return are given by `{'a', 'b', 'c'}`, then all of these keys should be present in the dict returned by `aggregation`.
-__NOTE__: See `lm_eval/metrics.py` for a few "built-in" aggregate metrics you can easily import. The standard metrics available in this package are generally based on `sklearn` functions, so if you are in any doubt for how to set things up the documentation over there can be of assistance. If you need to write a custom metric for some reason, start by looking at the existing ones in `lm_eval/metrics.py` for an idea about what the function signature needs to be.
-```python
-def higher_is_better(self):
-    """
-    :returns: {str: bool}
-        A dictionary where keys are the names of submetrics and values are
-        whether a higher value of the submetric is better
-    """
-    return {}
-```
-Finally, this function returns a dict with the same keys as `aggregation` and as it says in the description, simply tells us whether higher scores are better.
-Some tasks that are good examples of various ways evaluation can be implemented can be found here: [LAMBADA](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/lambada.py), [TriviaQA](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/triviaqa.py), [SQuAD](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/squad.py).
-Tip: Feel free to create your own helper-methods for your task!
-### Checking the Task Performance
-```sh
-python main.py \
-	--model gpt2 \
-	--model_args device=<device-name> \
-	--tasks <task-name> \
-	--num_fewshot K
-```
-Set the limit size, `N`, to a smallish number (e.g. 10) and try out the task under different `K`-shot settings. If you have an Nvidia GPU at your disposal, add the argument
-`--model_args device=cuda:0`. If you have access to an OpenAI API key, you can also evaluate GPT-3 on various tasks with the following command:
-```sh
-export OPENAI_API_SECRET_KEY=YOUR_KEY_HERE
-python main.py \
-	--model gpt3 \
-	--tasks <task-name> \
-	--num_fewshot K
-```
-### Running Unit Tests
-To run the entire test suite, use:
-```sh
-pytest
-```
-This is usually overkill; to run only the tests for your task, do:
-```sh
-pytest -k <task name>
-```
-## Versioning
-Lastly, we need to "version control". Tasks in the harness can always evolve. Metrics get updated, data sources change, etc. It’s important to mark each task with a version attribute so users can document which implementation version was used to obtain their results. Add a `VERSION` attribute to your task right below the class name and set it to `0` (this is the first version/implementation of your task):
-```python
-class TaskName(...):
-	VERSION = 0
-```
-## Submitting your Task
-You can format your changes and perform flake8 standard checks by running the following commands:
-```sh
-pre-commit install
-pre-commit run --all-files
-```
-Now push your work and make a pull request! Thanks for the contribution 👍. If there are any questions, leave a message in the `#lm-thunderdome` channel on the EAI discord.
--- a/docs/task_table.md
+++ b/docs/task_table.md
--- a/examples/README.md
+++ b/examples/README.md
+This folder is meant to contain instructions and task setups required to evaluate certain papers which may perform non-standard evaluation setups.
+Tasks can be supported already in the library under `lm_eval/tasks`, or if highly paper-specific, may remain as YAMLs in the respective `examples/paper-title` folder.
+## Verified Papers:
+* [WIP] [Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https://arxiv.org/abs/2201.11903)
+  * Further details can be found in the `chain_of_thought` subfolder.
+## Candidates to Support:
+* Least-to-Most Prompting
+* Algorithmic Prompting
+* Other in-scope prompting techniques
+  * Multi-turn prompting strategies are likely out of scope for the repository.
+* Pythia Suite: Term Frequencies over training
+* All setups from GPT-3 Paper
+* Varying few-shot orderings + selection ; Varying the label choices for multiple-choice tasks
+* Your Paper Here!
--- a/examples/chain_of_thought/README.md
+++ b/examples/chain_of_thought/README.md
+# Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
+https://arxiv.org/abs/2201.11903
+## All Tasks in Paper
+* ...
+* ...
+* ...
+## Reproduction Scripts
+* ...
--- a/ignore.txt
+++ b/ignore.txt
 ROUGE
 rouge
 nin
+maka
+mor
+te
--- a/lm_eval/api/metrics.py
+++ b/lm_eval/api/metrics.py
@@ -20,6 +20,8 @@ def median(arr):
    return arr[len(arr) // 2]
+# Certain metrics must be calculated across all documents in a benchmark.
+# We use them as aggregation metrics, paired with no-op passthrough metric fns.
 @register_aggregation("perplexity")
 def perplexity(items):
    return math.exp(-mean(items))
@@ -35,6 +37,25 @@ def bits_per_byte(items):
    return -weighted_mean(items) / math.log(2)
+@register_aggregation("f1")
+def f1_score(items):
+    unzipped_list = list(zip(*items))
+    golds = unzipped_list[0]
+    preds = unzipped_list[1]
+    fscore = sklearn.metrics.f1_score(golds, preds)
+    return np.max(fscore)
+@register_aggregation("matthews_corrcoef")
+def matthews_corrcoef(items):
+    unzipped_list = list(zip(*items))
+    golds = unzipped_list[0]
+    preds = unzipped_list[1]
+    # print(preds)
+    return sklearn.metrics.matthews_corrcoef(golds, preds)
 @register_metric(
    metric="acc",
    higher_is_better=True,
@@ -45,6 +66,16 @@ def acc_fn(items):  # This is a passthrough function
    return items
+@register_metric(
+    metric="acc_norm",
+    higher_is_better=True,
+    output_type=["loglikelihood", "multiple_choice"],
+    aggregation="mean",
+)
+def acc_norm_fn(items):  # This is a passthrough function
+    return items
 @register_metric(
    metric="acc_mutual_info",
    higher_is_better=True,
@@ -109,27 +140,24 @@ def mean_stderr(arr):
    return sample_stddev(arr) / math.sqrt(len(arr))
-@register_metric(metric="matthews_corrcoef", higher_is_better=True, aggregation="mean")
+@register_metric(
-def matthews_corrcoef(items):
+    metric="mcc",
-    unzipped_list = list(zip(*items))
+    higher_is_better=True,
-    golds = unzipped_list[0]
+    output_type="multiple_choice",
-    preds = unzipped_list[1]
+    aggregation="matthews_corrcoef",
-    return sklearn.metrics.matthews_corrcoef(golds, preds)
+)
+def mcc_fn(items):  # This is a passthrough function
+    return items
 @register_metric(
    metric="f1",
    higher_is_better=True,
    output_type="multiple_choice",
-    aggregation="mean",
+    aggregation="f1",
 )
-def f1_score(items):
+def f1_fn(items):  # This is a passthrough function
-    unzipped_list = list(zip(*items))
+    return items
-    golds = unzipped_list[0]
-    preds = unzipped_list[1]
-    fscore = sklearn.metrics.f1_score(golds, preds)
-    return np.max(fscore)
 @register_metric(

--- a/lm_eval/api/registry.py
+++ b/lm_eval/api/registry.py
@@ -26,11 +26,17 @@ def register_model(*names):
 def get_model(model_name):
-    return MODEL_REGISTRY[model_name]
+    try:
+        return MODEL_REGISTRY[model_name]
+    except KeyError:
+        raise ValueError(
+            f"Attempted to load model '{model_name}', but no model for this name found! Supported model names: {', '.join(MODEL_REGISTRY.keys())}"
+        )
 TASK_REGISTRY = {}
 GROUP_REGISTRY = {}
+ALL_TASKS = set()
 func2task_index = {}
@@ -41,6 +47,7 @@ def register_task(name):
        ), f"task named '{name}' conflicts with existing registered task!"
        TASK_REGISTRY[name] = fn
+        ALL_TASKS.add(name)
        func2task_index[fn.__name__] = name
        return fn
@@ -49,15 +56,12 @@ def register_task(name):
 def register_group(name):
    def decorate(fn):
-        # assert (
-        #     name not in GROUP_REGISTRY
-        # ), f"group named '{name}' conflicts with existing registered group!"
        func_name = func2task_index[fn.__name__]
        if name in GROUP_REGISTRY:
            GROUP_REGISTRY[name].append(func_name)
        else:
            GROUP_REGISTRY[name] = [func_name]
+            ALL_TASKS.add(name)
        return fn
    return decorate
@@ -75,9 +79,7 @@ DEFAULT_METRIC_REGISTRY = {
        "acc",
    ],
    "loglikelihood_rolling": ["word_perplexity", "byte_perplexity", "bits_per_byte"],
-    "multiple_choice": [
+    "multiple_choice": ["acc", "acc_norm"],
-        "acc",
-    ],
    "greedy_until": ["exact_match"],
 }
@@ -135,7 +137,6 @@ searching in HF Evaluate library..."
 def register_aggregation(name):
-    # TODO: should we enforce a specific interface to aggregation metrics?
    def decorate(fn):
        assert (
            name not in AGGREGATION_REGISTRY

--- a/lm_eval/api/task.py
+++ b/lm_eval/api/task.py
 import abc
-from dataclasses import dataclass, field
+from dataclasses import dataclass, field, asdict
 import re
 import ast
@@ -51,11 +51,9 @@ ALL_OUTPUT_TYPES = [
 class TaskConfig(dict):
    task: str = None
-    group: str = None
+    group: Union[str, list] = None
    reference: str = None
-    task_name: str = (
-        None  # TODO: deprecate this, it'll be set in __post_init__ to be names[0]
-    )
    dataset_path: str = None
    dataset_name: str = None
    dataset_kwargs: dict = None
@@ -68,6 +66,7 @@ class TaskConfig(dict):
    aliases: Union[str, list] = None
    doc_to_text: Union[Callable, str] = None
    doc_to_target: Union[Callable, str] = None
+    use_prompt: str = None
    num_fewshot: int = 0
    batch_size: int = 1
@@ -79,12 +78,8 @@ class TaskConfig(dict):
    generation_kwargs: dict = None
    delimiter: str = "\n\n"
    filter_list: Union[str, list] = None
-    normalization: str = (
-        None  # TODO: add length-normalization of various types, mutual info
-    )
    should_decontaminate: bool = False
    doc_to_decontamination_query: str = None
-    use_prompt: str = None
    metadata: str = None  # by default, not used in the code. allows for users to pass arbitrary info to tasks
@@ -102,13 +97,33 @@ class TaskConfig(dict):
            if type(self.gold_alias) == str:
                self.gold_alias = self.template_aliases + self.doc_to_target
-        if not self.generation_kwargs:
+        if self.generation_kwargs or self.output_type == "greedy_until":
+            assert (
+                self.output_type == "greedy_until"
+            ), "passed `generation_kwargs`, but not using a generation request type!"
            # ensure that we greedily generate in absence of explicit arguments otherwise
            self.generation_kwargs = {"do_sample": False, "temperature": 0.0}
    def __getitem__(self, item):
        return getattr(self, item)
+    def to_dict(self):
+        """dumps the current config as a dictionary object, as a printable format.
+        null fields will not be printed.
+        Used for dumping results alongside full task configuration
+        :return: dict
+            A printable dictionary version of the TaskConfig object.
+        # TODO: should any default value in the TaskConfig not be printed?
+        """
+        cfg_dict = asdict(self)
+        # remove values that are `None`
+        for k, v in list(cfg_dict.items()):
+            if v is None:
+                cfg_dict.pop(k)
+        return cfg_dict
 class Task(abc.ABC):
    """A task represents an entire benchmark including its dataset, problems,
@@ -420,7 +435,7 @@ class Task(abc.ABC):
        if num_fewshot == 0:
            labeled_examples = ""
        else:
-            labeled_examples = self.sampler.get_context(doc, self._config.num_fewshot)
+            labeled_examples = self.sampler.get_context(doc, num_fewshot)
            # for sets with no training docs, draw from other set *but ensure no overlap with current doc*
            # if self.has_training_docs():
@@ -460,10 +475,20 @@ class Task(abc.ABC):
            eval_logger.warning("No filter defined, passing through instances")
            return self._instances
+    def dump_config(self):
+        """Returns a dictionary representing the task's config.
+        :returns: str
+            The fewshot context.
+        """
+        # TODO: this should only return the overrides applied to a non-YAML task's configuration.
+        # (batch size, num_fewshot)
+        return self._config.to_dict()
 class ConfigurableTask(Task):
-    VERSION = "2.0"
+    VERSION = "Yaml"
    OUTPUT_TYPE = None
    CONFIG = None
@@ -503,7 +528,7 @@ class ConfigurableTask(Task):
        _metric_list = DEFAULT_METRIC_REGISTRY[self._config.output_type]
        if self._config.metric_list is None:
+            # TODO: handle this in TaskConfig.__post_init__ ?
            for metric_name in _metric_list:
                self._metric_fn_list[metric_name] = METRIC_REGISTRY[metric_name]
                self._aggregation_list[metric_name] = DEFAULT_AGGREGATION_REGISTRY[
@@ -521,9 +546,9 @@ class ConfigurableTask(Task):
                    for key in metric_config
                    if key not in ["metric", "aggregation", "higher_is_better"]
                }
-                if metric_name in _metric_list:
+                try:
                    self._metric_fn_list[metric_name] = METRIC_REGISTRY[metric_name]
-                else:
+                except Exception:
                    eval_logger.warning(
                        f"Metric {metric_name} not found, "
                        "Searching from https://huggingface.co/evaluate-metric"
@@ -540,15 +565,25 @@ class ConfigurableTask(Task):
                        )
                if "aggregation" in metric_config:
-                    self._aggregation_list[metric_name] = metric_config["aggregation"]
+                    agg_name = metric_config["aggregation"]
+                    if type(agg_name) == str:
+                        self._aggregation_list[metric_name] = AGGREGATION_REGISTRY[
+                            agg_name
+                        ]
+                    elif callable(agg_name):
+                        self._aggregation_list[metric_name] = metric_config[
+                            "aggregation"
+                        ]
                else:
+                    INV_AGG_REGISTRY = {v: k for k, v in AGGREGATION_REGISTRY.items()}
+                    metric_agg = DEFAULT_AGGREGATION_REGISTRY[metric_name]
                    eval_logger.warning(
-                        f"metric {metric_name} is defined, but aggregation is not"
+                        f"metric {metric_name} is defined, but aggregation is not. "
-                        f"using default aggregation for {metric_name}"
+                        f"using default "
+                        f"aggregation={INV_AGG_REGISTRY[metric_agg]}"
                    )
-                    self._aggregation_list[metric_name] = DEFAULT_AGGREGATION_REGISTRY[
+                    self._aggregation_list[metric_name] = metric_agg
-                        metric_name
-                    ]
                if "higher_is_better" in metric_config:
                    self._higher_is_better[metric_name] = metric_config[
@@ -556,8 +591,9 @@ class ConfigurableTask(Task):
                    ]
                else:
                    eval_logger.warning(
-                        f"metric {metric_name} is defined, but higher_is_better is not"
+                        f"metric {metric_name} is defined, but higher_is_better is not. "
-                        f"using default higher_is_better for {metric_name}"
+                        f"using default "
+                        f"higher_is_better={HIGHER_IS_BETTER_REGISTRY[metric_name]}"
                    )
                    self._higher_is_better[metric_name] = HIGHER_IS_BETTER_REGISTRY[
                        metric_name
@@ -579,13 +615,10 @@ class ConfigurableTask(Task):
                            key: function[key] for key in function if key != "function"
                        }
                        components.append([function["function"], kwargs])
                    filter_pipeline = build_filter_ensemble(filter_name, components)
                self._filters.append(filter_pipeline)
        else:
-            self._filters = [
+            self._filters = [build_filter_ensemble("none", [["take_first", None]])]
-                build_filter_ensemble("take_first", [["take_first", None]])
-            ]
        if self._config.use_prompt is not None:
            eval_logger.info(f"loading prompt {self._config.use_prompt}")
@@ -598,7 +631,7 @@ class ConfigurableTask(Task):
        if self.fewshot_docs() is not None:
            self.sampler = samplers.Sampler(
                list(self.fewshot_docs()), self, rnd=random.Random()
-            )  # TODO: pass the correct docs in here
+            )
    def download(self, dataset_kwargs=None):
@@ -639,15 +672,16 @@ class ConfigurableTask(Task):
            return self.dataset[self._config.test_split]
    def fewshot_docs(self):
-        if (self._config.num_fewshot > 0) and (self._config.fewshot_split is None):
+        if self._config.fewshot_split is not None:
-            eval_logger.warning(
-                "num_fewshot > 0 but fewshot_split is None. "
-                "using preconfigured rule."
-            )
-            return super().fewshot_docs()
-        elif self._config.fewshot_split is not None:
            return self.dataset[self._config.fewshot_split]
+        else:
+            if self._config.num_fewshot > 0:
+                eval_logger.warning(
+                    f"Task '{self._config.task}': "
+                    "num_fewshot > 0 but fewshot_split is None. "
+                    "using preconfigured rule."
+                )
+            return super().fewshot_docs()
    def should_decontaminate(self):
        return self._config.should_decontaminate
@@ -818,7 +852,7 @@ class ConfigurableTask(Task):
            )
            if (
                2 * len(choices) == len(lls)
-                and "acc_mutual_info" in self._metric_list.keys()
+                and "acc_mutual_info" in self._metric_fn_list.keys()
            ):
                # then we are doing mutual info.
                # this stores the "dryrun" / unconditional answer loglikelihoods
@@ -833,7 +867,8 @@ class ConfigurableTask(Task):
            result_dict = {
                **({"acc": acc} if "acc" in use_metric else {}),
-                **({"f1": (pred, gold)} if "f1" in use_metric else {}),
+                **({"f1": (gold, pred)} if "f1" in use_metric else {}),
+                **({"mcc": (gold, pred)} if "mcc" in use_metric else {}),
                **({"acc_norm": acc_norm} if "acc_norm" in use_metric else {}),
            }

--- a/lm_eval/evaluator.py
+++ b/lm_eval/evaluator.py
 import random
 import itertools
+import json
 import collections
+import logging
+import sys
 import torch
@@ -22,6 +25,10 @@ from lm_eval.utils import (
 from lm_eval.logger import eval_logger
+logger = logging.getLogger(__name__)
+logger.setLevel(logging.INFO)
+logger.addHandler(logging.StreamHandler(sys.stdout))
 @positional_deprecated
 def simple_evaluate(
@@ -30,14 +37,16 @@ def simple_evaluate(
    tasks=[],
    num_fewshot=0,
    batch_size=None,
+    max_batch_size=None,
    device=None,
    no_cache=False,
    limit=None,
    bootstrap_iters=100000,
    check_integrity=False,
    decontamination_ngrams_path=None,
+    write_out=False,
+    output_base_path=None,
 ):
    """Instantiate and evaluate a model on a list of tasks.
    :param model: Union[str, LM]
@@ -49,18 +58,24 @@ def simple_evaluate(
        List of task names or Task objects. Task objects will be taken to have name task.EVAL_HARNESS_NAME if defined and type(task).__name__ otherwise.
    :param num_fewshot: int
        Number of examples in few-shot context
-    :param batch_size: int, optional
+    :param batch_size: int or str, optional
        Batch size for model
+    :param max_batch_size: int, optional
+        Maximal batch size to try with automatic batch size detection
    :param device: str, optional
        PyTorch device (e.g. "cpu" or "cuda:0") for running models
    :param no_cache: bool
        Whether or not to cache
-    :param limit: int, optional
+    :param limit: int or float, optional
-        Limit the number of examples per task (only use this for testing)
+        Limit the number of examples per task (only use this for testing), If <1, limit is a percentage of the total number of examples.
    :param bootstrap_iters:
        Number of iterations for bootstrap statistics
    :param check_integrity: bool
        Whether to run the relevant part of the test suite for the tasks
+    :param write_out: bool
+        If True, write details about prompts and logits to json for all tasks
+    :param output_base_path: str, optional
+        Directory to which detailed eval info will be written. Defaults to present working dir.
    :return
        Dictionary of results
    """
@@ -73,7 +88,12 @@ def simple_evaluate(
        if model_args is None:
            model_args = ""
        lm = lm_eval.api.registry.get_model(model).create_from_arg_string(
-            model_args, {"batch_size": batch_size, "device": device}
+            model_args,
+            {
+                "batch_size": batch_size,
+                "max_batch_size": max_batch_size,
+                "device": device,
+            },
        )
    else:
        assert isinstance(model, lm_eval.api.model.LM)
@@ -90,15 +110,22 @@ def simple_evaluate(
        limit=limit,
        bootstrap_iters=bootstrap_iters,
        decontamination_ngrams_path=decontamination_ngrams_path,
+        write_out=write_out,
+        output_base_path=output_base_path,
    )
    if lm.rank == 0:
        # add info about the model and few shot config
        results["config"] = {
-            "model": model,
+            "model": model
+            if isinstance(model, str)
+            else model.model.config._name_or_path,
            "model_args": model_args,
            "num_fewshot": num_fewshot,
            "batch_size": batch_size,
+            "batch_sizes": list(lm.batch_sizes.values())
+            if hasattr(lm, "batch_sizes")
+            else [],
            "device": device,
            "no_cache": no_cache,
            "limit": limit,
@@ -120,6 +147,8 @@ def evaluate(
    limit=None,
    bootstrap_iters=100000,
    decontamination_ngrams_path=None,
+    write_out=False,
+    output_base_path=None,
 ):
    """Instantiate and evaluate a model on a list of tasks.
@@ -133,6 +162,10 @@ def evaluate(
        Limit the number of examples per task (only use this for testing)
    :param bootstrap_iters:
        Number of iterations for bootstrap statistics
+    :param write_out: bool
+        If True, write all prompts, logits and metrics to json for offline analysis
+    :param output_base_path: str, optional
+        Directory to which detailed eval info will be written. Defaults to present working dir
    :return
        Dictionary of results
    """
@@ -141,21 +174,32 @@ def evaluate(
    results = collections.defaultdict(dict)
    versions = collections.defaultdict(dict)
+    configs = collections.defaultdict(dict)
+    samples = collections.defaultdict(list)
    requests = collections.defaultdict(list)
-    # requests_origin = collections.defaultdict(list)
    # docs = {}
    # get lists of each type of request
    for task_name, task in task_dict.items():
        versions[task_name] = task.VERSION
+        configs[task_name] = dict(
+            task.dump_config()
+        )  # TODO: don't access a private attribute here ; for non-YAML tasks handle this case
        # deterministically shuffle docs and chop off the first `limit` because sometimes docs are in some kind of order
        # task_docs = list(task_doc_func())
        # rnd = random.Random()
        # rnd.seed(42)
        # rnd.shuffle(task_docs)
+        if limit is not None:
+            if task.has_test_docs():
+                task_docs = task.test_docs()
+            elif task.has_validation_docs():
+                task_docs = task.validation_docs()
+            else:
+                raise RuntimeError("Task has neither test_docs nor validation_docs")
+            limit = int(len(task_docs) * limit) if limit < 1.0 else int(limit)
        task.build_all_requests(limit=limit, rank=lm.rank, world_size=lm.world_size)
@@ -222,6 +266,7 @@ def evaluate(
                    enumerate(task.validation_docs()), lm.rank, limit, lm.world_size
                )
            )
            for doc_id, doc in doc_iterator:
                # subset instances to only this document id ; sort by idx
                requests = list(filter(lambda x: x.doc_id == doc_id, task.instances))
@@ -229,6 +274,16 @@ def evaluate(
                metrics = task.process_results(
                    doc, [req.filtered_resps[key] for req in requests]
                )
+                target = task.doc_to_target(doc)
+                example = {
+                    "doc_id": doc_id,
+                    "doc": doc,
+                    "target": target,
+                    "resps": [req.resps for req in requests],
+                    "filtered_resps": [req.filtered_resps[key] for req in requests],
+                }
+                example.update(metrics)
+                samples[task_name].append(example)
                for metric, value in metrics.items():
                    vals[(task_name, key, metric)].append(value)
@@ -289,7 +344,12 @@ def evaluate(
            if stderr is not None:
                results[task_name][metric + "_stderr" + "," + key] = stderr(items)
-        return {"results": dict(results), "versions": dict(versions)}
+        return {
+            "results": dict(results),
+            "configs": dict(configs),
+            "versions": dict(versions),
+            "samples": samples,
+        }
    else:
        return None
--- a/lm_eval/models/__init__.py
+++ b/lm_eval/models/__init__.py
 from . import hf_causal
-from . import gpt3
+from . import openai_completions
 from . import textsynth
 from . import dummy
 from . import seq2seq

--- a/lm_eval/models/anthropic_llms.py
+++ b/lm_eval/models/anthropic_llms.py
+import os
+from lm_eval.base import BaseLM
+from tqdm import tqdm
+import time
+def anthropic_completion(
+    client, model, prompt, max_tokens_to_sample, temperature, stop
+):
+    """Query Anthropic API for completion.
+    Retry with back-off until they respond
+    """
+    import anthropic
+    backoff_time = 3
+    while True:
+        try:
+            response = client.completion(
+                prompt=f"{anthropic.HUMAN_PROMPT} {prompt}{anthropic.AI_PROMPT}",
+                model=model,
+                # NOTE: Claude really likes to do CoT, and overly aggressive stop sequences
+                #       (e.g. gsm8k's ":") may truncate a lot of the input.
+                stop_sequences=[anthropic.HUMAN_PROMPT] + stop,
+                max_tokens_to_sample=max_tokens_to_sample,
+                temperature=temperature,
+            )
+            print(response)
+            return response["completion"]
+        except RuntimeError:
+            # TODO: I don't actually know what error Anthropic raises when it times out
+            #       So err update this error when we find out.
+            import traceback
+            traceback.print_exc()
+            time.sleep(backoff_time)
+            backoff_time *= 1.5
+class AnthropicLM(BaseLM):
+    REQ_CHUNK_SIZE = 20
+    def __init__(self, model):
+        """
+        :param model: str
+            Anthropic model e.g. claude-instant-v1
+        """
+        super().__init__()
+        import anthropic
+        self.model = model
+        self.client = anthropic.Client(os.environ["ANTHROPIC_API_KEY"])
+    @property
+    def eot_token_id(self):
+        raise NotImplementedError("No idea about anthropic tokenization.")
+    @property
+    def max_length(self):
+        return 2048
+    @property
+    def max_gen_toks(self):
+        return 256
+    @property
+    def batch_size(self):
+        # Isn't used because we override _loglikelihood_tokens
+        raise NotImplementedError()
+    @property
+    def device(self):
+        # Isn't used because we override _loglikelihood_tokens
+        raise NotImplementedError()
+    def tok_encode(self, string: str):
+        raise NotImplementedError("No idea about anthropic tokenization.")
+    def tok_decode(self, tokens):
+        raise NotImplementedError("No idea about anthropic tokenization.")
+    def _loglikelihood_tokens(self, requests, disable_tqdm=False):
+        raise NotImplementedError("No support for logits.")
+    def greedy_until(self, requests):
+        if not requests:
+            return []
+        res = []
+        for request in tqdm(requests):
+            inp = request[0]
+            request_args = request[1]
+            until = request_args["until"]
+            response = anthropic_completion(
+                client=self.client,
+                model=self.model,
+                prompt=inp,
+                max_tokens_to_sample=self.max_gen_toks,
+                temperature=0.0,
+                stop=until,
+            )
+            res.append(response)
+        return res
+    def _model_call(self, inps):
+        # Isn't used because we override _loglikelihood_tokens
+        raise NotImplementedError()
+    def _model_generate(self, context, max_length, eos_token_id):
+        # Isn't used because we override greedy_until
+        raise NotImplementedError()
--- a/lm_eval/models/hf_causal.py
+++ b/lm_eval/models/hf_causal.py
@@ -15,7 +15,7 @@ from accelerate import Accelerator
 from itertools import islice
-@register_model("hf-causal", "gpt2")
+@register_model("hf-causal")
 class HFLM(LM):
    def __init__(
        self,

--- a/lm_eval/models/gpt3.py
+++ b/lm_eval/models/gpt3.py
@@ -57,7 +57,7 @@ def oa_completion(**kwargs):
            backoff_time *= 1.5
-@register_model("openai")
+@register_model("openai", "openai-completions", "gooseai")
 class GPT3LM(LM):
    REQ_CHUNK_SIZE = 20