Merge branch 'big-refactor' of https://github.com/EleutherAI/lm-evaluation-harness into squadv2

a27e8ed1 · lintangsutawika · fc329d31 · 4cda3a1c · a27e8ed1 · fc329d31
Commit a27e8ed1 authored Aug 29, 2023 by lintangsutawika
20 changed files
--- a/.github/workflows/new_tasks.yml
+++ b/.github/workflows/new_tasks.yml
@@ -50,6 +50,7 @@ jobs:
        uses: actions/setup-python@v4
        with:
          python-version: 3.9
+          cache: 'pip'
      - name: Install dependencies
        if: steps.changed-tasks.outputs.tasks_any_modified == 'true' || steps.changed-tasks.outputs.api_any_modified == 'true'
        run: |

--- a/.github/workflows/pull_request.yml
+++ b/.github/workflows/pull_request.yml
-name: Pull Request
-
-on: [pull_request]
-
-jobs:
-  pre-commit:
-    runs-on: ubuntu-20.04
-    steps:
-      - uses: actions/checkout@v3
-      - uses: actions/setup-python@v4
-        with:
-          python-version: 3.9
-      - uses: pre-commit/action@v2.0.3
--- a/.github/workflows/unit_tests.yml
+++ b/.github/workflows/unit_tests.yml
@@ -6,10 +6,10 @@ name: Unit Tests
 on:
  push:
    branches:
-      - big-refactor
+      - 'big-refactor*'
  pull_request:
    branches:
-      - big-refactor
+      - 'big-refactor*'
  workflow_dispatch:
 # Jobs run concurrently and steps run sequentially within a job.
 # jobs: linter and cpu_tests. Add more jobs/steps as required.
@@ -26,8 +26,11 @@ jobs:
      uses: actions/setup-python@v4
      with:
        python-version: 3.9
+        cache: 'pip'
    - name: Install dependencies
      run: pip install -e '.[linting,testing]' --extra-index-url https://download.pytorch.org/whl/cpu
+    - name: Pre-Commit
+      uses: pre-commit/action@v3.0.0
    - name: Lint with pylint
      run: python -m pylint --disable=all -e W0311 --jobs=0 --indent-string='    ' **/*.py
    - name: Lint with flake8
@@ -52,6 +55,7 @@ jobs:
      uses: actions/setup-python@v4
      with:
        python-version: 3.9
+        cache: 'pip'
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
@@ -60,4 +64,4 @@ jobs:
 #                pip install bleurt@https://github.com/google-research/bleurt/archive/b610120347ef22b494b6d69b4316e303f5932516.zip#egg=bleurt
 #        if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
    - name: Test with pytest
-      run: python -m pytest -s -v -n=auto --ignore=tests/tests_master --ignore=tests/extra
+      run: python -m pytest --showlocals -s -vv -n=auto --ignore=tests/tests_master --ignore=tests/extra
--- a/README.md
+++ b/README.md
@@ -20,7 +20,7 @@ This project provides a unified framework to test generative language models on

 Features:

- Many tasks implemented, 200+ tasks [implemented in the old framework](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/docs/task_table.md) which require porting to the new setup as described in [the new task guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/docs/new_task_guide.md).
+- Many tasks implemented, 200+ tasks [implemented in the old framework](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/docs/task_table.md) which require porting to the new setup as described in [the new task guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/new_task_guide.md).
 - Support for models loaded via [transformers](https://github.com/huggingface/transformers/) (including quantization via [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)), [GPT-NeoX](https://github.com/EleutherAI/gpt-neox), and [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/), with a flexible tokenization-agnostic interface.
 - Support for commercial APIs including [OpenAI](https://openai.com), [goose.ai](https://goose.ai), and [TextSynth](https://textsynth.com/).
 - Support for evaluation on adapters (e.g. LoRa) supported in [HuggingFace's PEFT library](https://github.com/huggingface/peft).

--- a/docs/new_task_guide.md
+++ b/docs/new_task_guide.md
@@ -69,6 +69,8 @@ touch lm_eval/tasks/<dataset_name>/utils.py
 ```
 Now, in `utils.py` we'll write a function to process each split of our dataset:

+TODO: Change the example to one that's in the tasks/
+
 ```python
 def process_docs(dataset: datasets.Dataset):
    def _helper(doc):
@@ -86,40 +88,53 @@ Now, in our YAML config file we'll use the `!function` constructor, and tell the
 process_docs: !function utils.process_docs
 ```

-
-### Writing a prompt with Jinja 2
+## Writing a Prompt Template

 The next thing we need to do is decide what format to use when presenting the data to the LM. This is our **prompt**, where we'll define both an input and output format.

-We support the [Jinja 2](https://jinja.palletsprojects.com/en/3.1.x/) templating language for writing prompts. In practice, this means you can take your dataset's columns and do many basic string manipulations to place each document into prompted format.
+To write a prompt, users will use `doc_to_text`, `doc_to_target`, and `doc_to_choice` (Optional when certain conditions are met).
+
+`doc_to_text` defines the input string a model will be given while `doc_to_target` and `doc_to_choice` will be used to generate the target text. `doc_to_target` can be either a text string that refers to the target string or an integer that refers to the index of the correct label. When it is set as an index, `doc_to_choice` must be also be set with the appropriate list of possible choice strings.

-To write a prompt, users are required to write two or three YAML fields in Jinja as strings:
+### Basic prompts
+
+If a dataset is straightforward enough, users can enter the feature name directly. This assumes that no preprocessing is required. For example in [Swag](https://github.com/EleutherAI/lm-evaluation-harness/blob/1710b42d52d0f327cb0eb3cb1bfbbeca992836ca/lm_eval/tasks/swag/swag.yaml#L10-L11), `doc_to_text` and `doc_to_target` given the name of one of the feature each.
 ```yaml
-doc_to_text:
-doc_to_target:
-doc_to_choice:
+doc_to_text: startphrase
+doc_to_target: label
 ```
-Suppose our dataset has a `"question"` field, and an `"answer"` field, which are both strings. We want the model to see, if given a `document` object that is a row of our dataset:
+Hard-coding is also possible as is the case in [SciQ](https://github.com/EleutherAI/lm-evaluation-harness/blob/1710b42d52d0f327cb0eb3cb1bfbbeca992836ca/lm_eval/tasks/sciq/sciq.yaml#L11).
+```yaml
+doc_to_target: 3
 ```
-Question: {document[question]}
+`doc_to_choice` can be directly given a list of text as option (See [Toxigen](https://github.com/EleutherAI/lm-evaluation-harness/blob/1710b42d52d0f327cb0eb3cb1bfbbeca992836ca/lm_eval/tasks/toxigen/toxigen.yaml#L11))
+```yaml
+doc_to_choice: ['No', 'Yes']
+```
+
+### Writing a prompt with Jinja 2
+
+We support the [Jinja 2](https://jinja.palletsprojects.com/en/3.1.x/) templating language for writing prompts. In practice, this means you can take your dataset's columns and do many basic string manipulations to place each document into prompted format.
+
+Take for example `super_glue/boolq`, as input, we'd like to use the features `passage` and `question` and string them together so that for a a sample line `doc`, the model sees something the format of:
+```
+doc["passage"]
+Question: doc["question"]?
 Answer:
 ```
-We do this by writing
+We do this by [writing](https://github.com/EleutherAI/lm-evaluation-harness/blob/1710b42d52d0f327cb0eb3cb1bfbbeca992836ca/lm_eval/tasks/super_glue/boolq/default.yaml#L9C1-L9C61)
 ```yaml
-doc_to_text: "Question: {{question}}\nAnswer:"
+doc_to_text: "{{passage}}\nQuestion: {{question}}?\nAnswer:"
 ```
-Such that {{question}} will be replaced by `doc["question"]` when rendering the prompt template.
+Such that `{{passage}}` will be replaced by `doc["passage"]` and `{{question}}` with `doc["question"]` when rendering the prompt template.

 Our intended output is for the model to predict a single whitespace, and then the answer to the question. We do this via:
 ```yaml
 doc_to_target: "{{answer}}"
-gold_alias: "{{answer}}"
 ```
-where `doc_to_target` is *the string that will be appended to inputs for each few-shot example*, and `gold_alias` is *what is passed to our metric function as reference or gold answer to score against*. For example, for GSM8k word problems, `doc_to_target` should be the reference text reasoning chain given in the dataset culminating in the answer, and `gold_alias` should be **only the numeric answer** to the word problem that is given at the end of the reasoning chain, and which the evaluated model's answer will be compared against.

-**Important**: We always add one whitespace between the input and output, such that the full input-output string is `doc_to_target(doc) + " " + doc_to_text(doc)`. doc_to_text and doc_to_target should not contain trailing right or left whitespace, respectively.

-Users can also fill out the optional `template_aliases` YAML field, which is added ahead of both the `doc_to_text` and `doc_to_target` fields. This field should not contain any test, but only Jinja variable definitions (`{% ... %}` clauses). This can be used to perform more involved string manipulations and renamings of dataset columns while the main prompt fields remain easy to parse visually.
+**Important**: we now add `target_delimiter` between input and target which defaults to " ", such that the full input-output string is `doc_to_target(doc) + target_delimiter + doc_to_text(doc)`. doc_to_text and doc_to_target should not contain trailing right or left whitespace, respectively.


 #### Multiple choice format
@@ -135,7 +150,13 @@ doc_to_choice: "{{[distractor1, distractor2, distractor3, correct_answer]}}"
 ```
 Task implementers are thus able to decide what the answer choices should be for a document, and what prompt format to use.

+The label index can also be sourced from a feature directly. For example in `superglue/boolq`, the label index if defined in the feature `label`. We can set `doc_to_target` as simply `label`. The options or verbalizers can be written in a the form of a list `["no", "yes"]` that will correspond to the label index.

+```yaml
+doc_to_text: "{{passage}}\nQuestion: {{question}}?\nAnswer:"
+doc_to_target: label
+doc_to_choice: ["no", "yes"]
+```

 ### Using Python Functions for Prompts

@@ -168,6 +189,10 @@ For example, For Super Glue BoolQ, if we want to use the prompt template `GPT-3
 use_prompt: "promptsource:GPT-3 Style"
 ```

+If you would like to run evaluation on all prompt templates, you can simply call it this way.
+```
+use_prompt: "promptsource:*"
+```

 ### Setting metrics

@@ -183,11 +208,11 @@ metric_list:
  - metric: <name of the metric here>
    aggregation: <name of the aggregation fn here>
    higher_is_better: <true or false>
-  - metric: ...
+  - metric: !function script.function
    aggregation: ...
    higher_is_better: ...
 ```
-`aggregation` and `higher_is_better` can optionally be left out to default to the manually-set defaults, if using a natively supported metric.
+`aggregation` and `higher_is_better` can optionally be left out to default to the manually-set defaults if using a natively supported metric, otherwise it must be defined explicitly (for example, when using a custom metric implemented as a function).

 For a full list of natively supported metrics and aggregation functions see `docs/advanced_task_guide.md`. All metrics supported in [HuggingFace Evaluate](https://github.com/huggingface/evaluate/tree/main/metrics) can also be used, and will be loaded if a given metric name is not one natively supported in `lm-eval`.


--- a/docs/advanced_task_guide.md
+++ b/docs/advanced_task_guide.md
-# Advanced Task Configuration
+# Task Configuration

 The `lm-evaluation-harness` is meant to be an extensible and flexible framework within which many different evaluation tasks can be defined. All tasks in the new version of the harness are built around a YAML configuration file format.

@@ -33,7 +33,6 @@ Prompting / in-context formatting options:
 - **doc_to_text** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into the appropriate input for the model
 - **doc_to_target** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into the appropriate target output for the model. For multiple choice tasks, this should return an index into
 - **doc_to_choice** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into a list of possible string choices for `multiple_choice` tasks. Left undefined for `greedy_until` tasks.
- **gold_alias** (`str`, *optional*, defaults to None) — if provided, used to generate the reference answer that is scored against. Used in cases where `doc_to_target` should be the "target string" format appended to each example's input for a fewshot exemplar, so doc_to_target is used for fewshot examples, but the input to the metric function as `gold` is from `gold_alias`.
 - **fewshot_delimiter** (`str`, *optional*, defaults to "\n\n") — String to insert between few-shot examples.
 - **target_delimiter** (`str`, *optional*, defaults to `" "`) — String to insert between input and target output for the datapoint being tested.


--- a/ignore.txt
+++ b/ignore.txt
@@ -4,3 +4,4 @@ nin
 maka
 mor
 te
+ond
--- a/lm_eval/api/task.py
+++ b/lm_eval/api/task.py
@@ -465,8 +465,11 @@ class Task(abc.ABC):
        elif type(example) == list:
            return [labeled_examples + ex for ex in example]
        elif type(example) == int:
-            choices = self.doc_to_choice(doc)
-            return labeled_examples + choices[example]
+            if self._config.doc_to_choice is not None:
+                choices = self.doc_to_choice(doc)
+                return labeled_examples + choices[example]
+            else:
+                return labeled_examples + str(example)

    def apply_filters(self):

@@ -649,9 +652,36 @@ class ConfigurableTask(Task):

            if type(test_text) is int:
                self.multiple_input = num_choice
+        else:
+            test_choice = None

        if type(test_target) is list:
            self.multiple_target = len(test_target)
+        else:
+            if (type(test_target) is int) and (test_choice is not None):
+                test_target = test_choice[test_target]
+            else:
+                test_target = str(test_target)
+
+        if test_choice is not None:
+            check_choices = test_choice
+        else:
+            check_choices = [test_target]
+
+        for choice in check_choices:
+            choice_has_whitespace = True if " " in choice else False
+            delimiter_has_whitespace = (
+                True if " " in self._config.target_delimiter else False
+            )
+
+            if delimiter_has_whitespace and choice_has_whitespace:
+                eval_logger.warning(
+                    f'Both target_delimiter and target choice: "{choice}" have whitespace'
+                )
+            elif (not delimiter_has_whitespace) and (not choice_has_whitespace):
+                eval_logger.warning(
+                    f'Both target_delimiter and target choice: "{choice}" does not have whitespace, ignore if the language you are evaluating on does not require/use whitespace'
+                )

    def download(self, dataset_kwargs=None):

@@ -790,7 +820,11 @@ class ConfigurableTask(Task):
                target_string = utils.apply_template(doc_to_target, doc)
                if target_string.isdigit():
                    return ast.literal_eval(target_string)
-                elif (target_string[0] == "[") and (target_string[-1] == "]"):
+                elif (
+                    len(target_string) >= 2
+                    and (target_string[0] == "[")
+                    and (target_string[-1] == "]")
+                ):
                    return ast.literal_eval(target_string)
                else:
                    return target_string
@@ -1002,41 +1036,45 @@ class ConfigurableTask(Task):
        elif self.OUTPUT_TYPE == "greedy_until":

            gold = self.doc_to_target(doc)
-            if type(gold) == int:
+            if self._config.doc_to_choice is not None:
+                # If you set doc_to_choice,
+                # it assumes that doc_to_target returns a number.
                choices = self.doc_to_choice(doc)
                gold = choices[gold]
+            else:
+                gold = str(gold)

-            for key, result in zip(self._metric_fn_list.keys(), results):
+            result = results[0]
+            for metric in self._metric_fn_list.keys():
                if self.multiple_target:
                    # in the case where we have multiple targets,
                    # return true if any are true
                    # TODO: this may break for multipLe_target, non zero-or-1 metrics
                    scores = []
                    for gold_option in gold:
-                        res = self._metric_fn_list[key](
+                        res = self._metric_fn_list[metric](
                            references=[gold_option],
                            predictions=[result],
-                            **self._metric_fn_kwargs[key],
+                            **self._metric_fn_kwargs[metric],
                        )
                        if isinstance(res, dict):
                            # TODO: this handles the case where HF evaluate returns a dict.
-                            res = res[key]
+                            res = res[metric]
                        scores.append(res)
                    if any(scores):
                        result_score = 1.0
                    else:
                        result_score = 0.0
                else:
-                    result_score = self._metric_fn_list[key](
+                    result_score = self._metric_fn_list[metric](
                        references=[gold],
                        predictions=[result],
-                        **self._metric_fn_kwargs[key],
+                        **self._metric_fn_kwargs[metric],
                    )
-
-                if isinstance(result_score, dict):
-                    result_dict.update(result_score)
-                else:
-                    result_dict[key] = result_score
+                    if isinstance(result_score, dict):
+                        # TODO: this handles the case where HF evaluate returns a dict.
+                        result_score = result_score[metric]
+                result_dict[metric] = result_score
        else:
            raise ValueError(
                f"Passed invalid output_type '{self.OUTPUT_TYPE}' ! Please use one of ",

--- a/lm_eval/evaluator.py
+++ b/lm_eval/evaluator.py
@@ -219,7 +219,6 @@ def evaluate(
    padding_requests = collections.defaultdict(int)

    # Stores group related keys and values for group-aggregation
-    aggregate = collections.defaultdict(dict)
    task_groups = collections.defaultdict(dict)

    # get lists of each type of request
@@ -228,6 +227,7 @@ def evaluate(
        if type(task) == tuple:
            group, task = task
            task_groups[task_name] = group
+            aggregate[task_name] = {}

        versions[task_name] = task.VERSION
        configs[task_name] = dict(task.dump_config())
@@ -407,12 +407,12 @@ def evaluate(
            #        | word_perplexity
            #        | byte_perplexity
            #        | bits_per_byte
-            if bool(task_groups):
+            if task_name in task_groups:
                group_name = task_groups[task_name]
-                if metric not in aggregate[group_name]:
-                    aggregate[group_name][metric] = [task_score]
-                else:
+                if metric in list(aggregate[group_name].keys()):
                    aggregate[group_name][metric].append(task_score)
+                else:
+                    aggregate[group_name][metric] = [task_score]

            # hotfix: bleu, chrf, ter seem to be really expensive to bootstrap
            # so we run them less iterations. still looking for a cleaner way to do this

--- a/lm_eval/tasks/README.md
+++ b/lm_eval/tasks/README.md
@@ -13,7 +13,7 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
 - [x] Wikitext
 - [x] PiQA
 - [x] PROST
- [ ] MCTACO (Lintang)
+- [x] MCTACO
 - [x] Pubmed QA
 - [x] SciQ
 - [ ] QASPER
@@ -33,9 +33,9 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
 - [x] Winogrande
 - [x] ANLI
 - [x] Hendrycks Ethics (missing some tasks/metrics, see PR 660: <https://github.com/EleutherAI/lm-evaluation-harness/pull/660> for more info)
- [x] TruthfulQA (mc1) (Lintang)
- [ ] TruthfulQA (mc2) (Lintang)
- [ ] TruthfulQA (gen) (Lintang)
+- [x] TruthfulQA (mc1)
+- [x] TruthfulQA (mc2)
+- [x] TruthfulQA (gen)
 - [ ] MuTual
 - [ ] Hendrycks Math (Hailey)
 - [ ] Asdiv
@@ -54,7 +54,7 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
 - [ ] BIG-Bench (Hailey)
 - [x] XStoryCloze
 - [x] XWinograd
- [ ] PAWS-X (Lintang)
+- [x] PAWS-X
 - [x] XNLI
 - [ ] MGSM (Lintang)
 - [ ] SCROLLS

--- a/lm_eval/tasks/anli/README.md
+++ b/lm_eval/tasks/anli/README.md
-# Task-name
+# ANLI

 ### Paper

 Title: `Adversarial NLI: A New Benchmark for Natural Language Understanding`

-Abstract: `https://arxiv.org/pdf/1910.14599.pdf`
+Paper Link: https://arxiv.org/abs/1910.14599

 Adversarial NLI (ANLI) is a dataset collected via an iterative, adversarial
 human-and-model-in-the-loop procedure. It consists of three rounds that progressively
 increase in difficulty and complexity, and each question-answer includes annotator-
 provided explanations.

-Homepage: `https://github.com/facebookresearch/anli`
-
+Homepage: https://github.com/facebookresearch/anli

 ### Citation

@@ -31,13 +30,18 @@ Homepage: `https://github.com/facebookresearch/anli`
 }
 ```

-### Subtasks
+### Groups and Tasks
+
+#### Groups

-List or describe tasks defined in this folder, and their names here:
+* `anli`: Evaluates `anli_r1`, `anli_r2`, and `anli_r3`
+
+#### Tasks
 * `anli_r1`: The data collected adversarially in the first round.
 * `anli_r2`: The data collected adversarially in the second round, after training on the previous round's data.
 * `anli_r3`: The data collected adversarially in the third round, after training on the previous multiple rounds of data.

+
 ### Checklist

 For adding novel benchmarks/datasets to the library:

--- a/lm_eval/tasks/anli/anli_r1.yaml
+++ b/lm_eval/tasks/anli/anli_r1.yaml
 group:
-  - multiple_choice
-  - natural_language_inference
-  - nli
-  - adverserial
+  - anli
 task: anli_r1
 dataset_path: anli
 dataset_name: null

--- a/lm_eval/tasks/anli/anli_r2.yaml
+++ b/lm_eval/tasks/anli/anli_r2.yaml
-group:
-  - multiple_choice
-  - natural_language_inference
-  - nli
-  - adverserial
+include: anli_r1.yaml
 task: anli_r2
-dataset_path: anli
-dataset_name: null
-output_type: multiple_choice
 training_split: train_r2
 validation_split: dev_r2
 test_split: test_r2
-doc_to_text: "{{premise}}\nQuestion: {{hypothesis}} True, False, or Neither?\nAnswer:"
-# True = entailment
-# False = contradiction
-# Neither = neutral
-doc_to_target: "{{['True', 'Neither', 'False'][label]}}"
-doc_to_choice:
-  - "True"
-  - "Neither"
-  - "False"
-should_decontaminate: true
-doc_to_decontamination_query: premise
-metric_list:
-  - metric: acc
-    aggregation: mean
-    higher_is_better: true
--- a/lm_eval/tasks/anli/anli_r3.yaml
+++ b/lm_eval/tasks/anli/anli_r3.yaml
-group:
-  - multiple_choice
-  - natural_language_inference
-  - nli
-  - adverserial
+include: anli_r1.yaml
 task: anli_r3
-dataset_path: anli
-dataset_name: null
-output_type: multiple_choice
 training_split: train_r3
 validation_split: dev_r3
 test_split: test_r3
-doc_to_text: "{{premise}}\nQuestion: {{hypothesis}} True, False, or Neither?\nAnswer:"
-# True = entailment
-# False = contradiction
-# Neither = neutral
-doc_to_target: "{{['True', 'Neither', 'False'][label]}}"
-doc_to_choice:
-  - "True"
-  - "Neither"
-  - "False"
-should_decontaminate: true
-doc_to_decontamination_query: premise
-metric_list:
-  - metric: acc
-    aggregation: mean
-    higher_is_better: true
--- a/lm_eval/tasks/arc/README.md
+++ b/lm_eval/tasks/arc/README.md
 # ARC

-Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
-https://arxiv.org/pdf/1803.05457.pdf
+### Paper
+
+Title: Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
+
+Abstract: https://arxiv.org/abs/1803.05457

 The ARC dataset consists of 7,787 science exam questions drawn from a variety
 of sources, including science questions provided under license by a research
@@ -13,7 +16,9 @@ a co-occurrence method fail to answer correctly) and an Easy Set of 5,197 questi

 Homepage: https://allenai.org/data/arc

+
 ### Citation
+
 ```
 @article{Clark2018ThinkYH,
  title={Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge},
@@ -23,3 +28,27 @@ Homepage: https://allenai.org/data/arc
  volume={abs/1803.05457}
 }
 ```
+
+### Groups and Tasks
+
+#### Groups
+
+* `ai2_arc`: Evaluates `arc_easy` and `arc_challenge`
+
+#### Tasks
+
+* `arc_easy`
+* `arc_challange`
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/arc/arc_challenge.yaml
+++ b/lm_eval/tasks/arc/arc_challenge.yaml
 include: arc_easy.yaml
-group:
-  - ai2_arc
-  - multiple_choice
 task: arc_challenge
-dataset_path: ai2_arc
 dataset_name: ARC-Challenge
--- a/lm_eval/tasks/arc/arc_easy.yaml
+++ b/lm_eval/tasks/arc/arc_easy.yaml
 group:
  - ai2_arc
-  - multiple_choice
 task: arc_easy
 dataset_path: ai2_arc
 dataset_name: ARC-Easy

--- a/lm_eval/tasks/arithmetic/README.md
+++ b/lm_eval/tasks/arithmetic/README.md
+# Arithmetic
+
+### Paper
+
+Title: `Language Models are Few-Shot Learners`
+Abstract: https://arxiv.org/abs/2005.14165
+
+A small battery of 10 tests that involve asking language models a simple arithmetic
+problem in natural language.
+
+Homepage: https://github.com/openai/gpt-3/tree/master/data
+
+
+### Citation
+
+```
+@inproceedings{NEURIPS2020_1457c0d6,
+    author = {Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winter, Clemens and Hesse, Chris and Chen, Mark and Sigler, Eric and Litwin, Mateusz and Gray, Scott and Chess, Benjamin and Clark, Jack and Berner, Christopher and McCandlish, Sam and Radford, Alec and Sutskever, Ilya and Amodei, Dario},
+    booktitle = {Advances in Neural Information Processing Systems},
+    editor = {H. Larochelle and M. Ranzato and R. Hadsell and M. F. Balcan and H. Lin},
+    pages = {1877--1901},
+    publisher = {Curran Associates, Inc.},
+    title = {Language Models are Few-Shot Learners},
+    url = {https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf},
+    volume = {33},
+    year = {2020}
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* `arithmetic`: Evaluates `1dc` to `5ds`
+
+#### Tasks
+
+* `arithmetic_1dc`
+* `arithmetic_2da`
+* `arithmetic_2dm`
+* `arithmetic_2ds`
+* `arithmetic_3da`
+* `arithmetic_3ds`
+* `arithmetic_4da`
+* `arithmetic_4ds`
+* `arithmetic_5da`
+* `arithmetic_5ds`
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/babi/README.md
+++ b/lm_eval/tasks/babi/README.md
+# bAbI
+
+### Paper
+
+Title: Towards ai-complete question answering: A set of prerequisite toy tasks
+Abstract: https://arxiv.org/abs/1502.05698
+
+One long-term goal of machine learning research is to produce methods that are applicable to reasoning and natural language, in particular building an intelligent dialogue agent. To measure progress towards that goal, we argue for the usefulness of a set of proxy tasks that evaluate reading comprehension via question answering. Our tasks measure understanding in several ways: whether a system is able to answer questions via chaining facts, simple induction, deduction and many more. The tasks are designed to be prerequisites for any system that aims to be capable of conversing with a human. We believe many existing learning systems can currently not solve them, and hence our aim is to classify these tasks into skill sets, so that researchers can identify (and then rectify) the failings of their systems. We also extend and improve the recently introduced Memory Networks model, and show it is able to solve some, but not all, of the tasks.
+
+Homepage: https://github.com/facebookarchive/bAbI-tasks
+
+
+### Citation
+
+```
+@article{weston2015towards,
+  title={Towards ai-complete question answering: A set of prerequisite toy tasks},
+  author={Weston, Jason and Bordes, Antoine and Chopra, Sumit and Rush, Alexander M and Van Merri{\"e}nboer, Bart and Joulin, Armand and Mikolov, Tomas},
+  journal={arXiv preprint arXiv:1502.05698},
+  year={2015}
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* Not part of a group yet
+
+#### Tasks
+
+* `babi`
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/babi/babi.yaml
+++ b/lm_eval/tasks/babi/babi.yaml
-group:
-  - greedy_until
 task: babi
 dataset_path: Muennighoff/babi
 dataset_name: null