Merge branch 'big-refactor' into update_docs

767c58b9 · lintangsutawika · 3bfbddc4 · 759da8d5 · 767c58b9 · 767c58b9
Commit 767c58b9 authored Aug 16, 2023 by lintangsutawika
20 changed files
--- a/README.md
+++ b/README.md
@@ -33,7 +33,6 @@ To install the `lm-eval` refactor branch from the github repository, run:
 ```bash
 git clone https://github.com/EleutherAI/lm-evaluation-harness
 cd lm-evaluation-harness
-git checkout big-refactor
 pip install -e .
 ```

@@ -49,6 +48,13 @@ To support loading GPTQ quantized models, install the package with the `gptq` ex
 pip install -e ".[gptq]"
 ```

+
+To install the package with all extras, run
+```bash
+pip install -e ".[all]"
+```
+
+
 ## Support

 The best way to get support is to open an issue on this repo or join the EleutherAI discord server](discord.gg/eleutherai). The `#lm-thunderdome` channel is dedicated to developing this project and the `#release-discussion` channel is for receiving support for our releases.
@@ -93,6 +99,8 @@ python main.py \
    --batch_size auto:4
 ```

+Alternatively, you can use `lm-eval` instead of `python main.py` to call lm eval from anywhere.
+
 ### Multi-GPU Evaluation with Hugging Face `accelerate`

 To parallelize evaluation of HuggingFace models across multiple GPUs, we allow for two different types of multi-GPU evaluation.
@@ -128,30 +136,43 @@ Using this setting helps for massive models like BLOOM which require, or to avoi

 **Note that this option requires launching evaluation via `python main.py` rather than `accelerate launch main.py`.**

+To use `accelerate` with the `lm-eval` command, use
+```
+accelerate launch --no_python lm-eval --model ...
+```
+
 ### Commercial APIs

-Our library also supports language models served via the OpenAI API:
+Our library also supports the evaluation of models served via several commercial APIs, and hope to implement support for common performant local/self-hosted inference servers.
+
+A full accounting of the supported and planned libraries + APIs can be seen below:
+
+| API or Inference Server     | Implemented?                    | `--model <xxx>` name                                                             | Models supported:                    | Request Types:                                           |
+|-----------------------------|---------------------------------|----------------------------------------------------------------------------------|--------------------------------------|----------------------------------------------------------|
+| OpenAI Completions          | :heavy_check_mark:              | `openai`, `openai-completions`, `gooseai`                                        | up to `code-davinci-002`             | `greedy_until`, `loglikelihood`, `loglikelihood_rolling` |
+| OpenAI ChatCompletions      | :x: Not yet - needs help!       | N/A                                                                              | (link here?)                         | `greedy_until` (no logprobs)                             |
+| Anthropic                   | :heavy_check_mark:              | `anthropic`                                                                      | [Supported Anthropic Engines](https://docs.anthropic.com/claude/reference/selecting-a-model)         | `greedy_until` (no logprobs)                             |
+| GooseAI                     | :heavy_check_mark: (not separately maintained)  | `openai`, `openai-completions`, `gooseai` (same interface as OpenAI Completions) |                                      | `greedy_until`, `loglikelihood`, `loglikelihood_rolling` |
+| Textsynth                   | Needs testing                   | `textsynth`                                                                      | ???                                  | `greedy_until`, `loglikelihood`, `loglikelihood_rolling` |
+| Cohere                      | :hourglass: - blocked on Cohere API bug | N/A                                                                              | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models) | `greedy_until`, `loglikelihood`, `loglikelihood_rolling` |
+| GGML                        | :hourglass: [PR](https://github.com/EleutherAI/lm-evaluation-harness/pull/617)              | N/A                                                                              | ???                                  | `greedy_until`, `loglikelihood`, `loglikelihood_rolling` |
+| vLLM                        | :x: Not yet - needs help!       | N/A                                                                              | All HF models                        | `greedy_until` (no logprobs)                             |
+| Your inference server here! | ...                             | ...                                                                              | ...                                  | ...                                                      |                                | ...                                                      |
+
+It is on our roadmap to create task variants designed to enable models which do not serve logprobs/loglikelihoods to be compared with generation performance of open-source models.
+
+Our library supports language models served via the OpenAI Completions API as follows:

 ```bash
 export OPENAI_API_SECRET_KEY=YOUR_KEY_HERE
 python main.py \
-    --model openai \
+    --model openai-completions \
    --model_args engine=davinci \
    --tasks lambada_openai,hellaswag
 ```

 While this functionality is only officially maintained for the official OpenAI API, it tends to also work for other hosting services that use the same API such as [goose.ai](goose.ai) with minor modification. We also have an implementation for the [TextSynth](https://textsynth.com/index.html) API, using `--model textsynth`.

-To verify the data integrity of the tasks you're performing in addition to running the tasks themselves, you can use the `--check_integrity` flag:
-
-```bash
-python main.py \
-    --model openai \
-    --model_args engine=davinci \
-    --tasks lambada_openai,hellaswag \
-    --check_integrity
-```
-
 ### Other Frameworks

 A number of other libraries contain scripts for calling the eval harness through their library. These include [GPT-NeoX](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py), [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples/MoE/readme_evalharness.md), and [mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py).
@@ -172,6 +193,16 @@ python write_out.py \

 This will write out one text file for each task.

+To verify the data integrity of the tasks you're performing in addition to running the tasks themselves, you can use the `--check_integrity` flag:
+
+```bash
+python main.py \
+    --model openai \
+    --model_args engine=davinci \
+    --tasks lambada_openai,hellaswag \
+    --check_integrity
+```
+
 ## Advanced Usage

 For models loaded with the HuggingFace  `transformers` library, any arguments provided via `--model_args` get passed to the relevant constructor directly. This means that anything you can do with `AutoModel` can be done with our library. For example, you can pass a local path via `pretrained=` or use models finetuned with [PEFT](https://github.com/huggingface/peft) by taking the call you would run to evaluate the base model and add `,peft=PATH` to the `model_args` argument:
@@ -201,6 +232,14 @@ To implement a new task in the eval harness, see [this guide](./docs/new_task_gu

 As a start, we currently only support one prompt per task, which we strive to make the "standard" as defined by the benchmark's authors. If you would like to study how varying prompts causes changes in the evaluation score, we support prompts authored in the [Promptsource Library](https://github.com/bigscience-workshop/promptsource/tree/main) as described further in https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/docs/new_task_guide.md and https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/docs/advanced_task_guide.md and welcome contributions of novel task templates and task variants.

+## How to Contribute or Learn More?
+
+For more information on the library and how everything fits together, check out all of our [documentation pages](https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor/docs)! We plan to post a larger roadmap of desired + planned library improvements soon, with more information on how contributors can help.
+
+
+You can also ask for help, or discuss new features with the maintainers in the #lm-thunderdome channel of the EleutherAI discord! If you've used the library and have had a positive (or negative) experience, we'd love to hear from you!
+
+
 ## Cite as

 ```

--- a/docs/model_guide.md
+++ b/docs/model_guide.md
@@ -36,15 +36,19 @@ The LM class enforces a common interface via which we can extract responses from
 ```python
 class MyCustomLM(LM):
    #...
-    def loglikelihood(self, requests):
+    def loglikelihood(self, requests: list[Instance]) -> list[tuple[float, bool]]:
+        #...


-    def loglikelihood_rolling(self, requests):
+    def loglikelihood_rolling(self, requests: list[Instance]) -> list[tuple[float, bool]]:
+        #...


-    def greedy_until(self, requests):
+    def greedy_until(self, requests: list[Instance]) -> list[str]:
+        #...
    #...
 ```
+Where `Instance` is a dataclass defined in [`lm_eval.api.instance`](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/api/instance.py) with property `args` which returns a tuple of (context, continuation).

 We support


--- a/docs/task_guide.md
+++ b/docs/task_guide.md
@@ -235,3 +235,89 @@ Generative tasks:

 Tasks using complex filtering:
 - GSM8k with CoT (+ with Self-Consistency): (`lm_eval/tasks/gsm8k/gsm8k-cot.yaml` ; `lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml`)
+
+
+## Benchmarks
+
+When evaluating a language model, it's is not unusual to test across a number of tasks that may not be related to one another in order to assess a variety of capabilities. To this end, it may be combursome to have to list the set of tasks or add a new group name to each yaml of each individual task.
+
+To solve this, we can create a benchmark yaml config. This is a config that contains the names of the tasks that should be included in a particular benchmark. The config consists of two main keys `group` which denotes the name of the benchmark and `task` which is where we can list the tasks. The tasks listed in `task` are the task names that have been registered. A good example would be the list of tasks used to evaluate the Pythia Suite.
+
+```yaml
+group: pythia
+task:
+  - lambada_openai
+  - wikitext
+  - piqa
+  - sciq
+  - wsc
+  - winogrande
+  - arc
+  - logiqa
+  - blimp
+  - hendrycksTest*
+```
+
+Alternatively, benchmarks can have tasks that are customizable for each task. They can be defined like how a yaml task is usually set.
+
+```yaml
+group: t0_eval
+task:
+  # Coreference Resolution
+  - dataset_path: super_glue
+    dataset_name: wsc.fixed
+    use_prompt: promptsource:*
+    training_split: train
+    validation_split: validation
+    metric_list:
+      - metric: exact_match
+        aggregation: mean
+        higher_is_better: true
+        ignore_case: true
+        ignore_punctuation: true
+  # Coreference Resolution
+  - dataset_path: winogrande
+    dataset_name: winogrande_xl
+    use_prompt: promptsource:*
+    training_split: train
+    validation_split: validation
+    metric_list:
+      - metric: exact_match
+        aggregation: mean
+        higher_is_better: true
+        ignore_case: true
+        ignore_punctuation: true
+  ...
+```
+
+If the benchmark contains the same dataset but with different configurations, use `task` to differentiate between them. For example, T0-Eval evaluates on 3 versions of ANLI but the huggingface dataset collects them in one dataset.
+
+```YAML
+group: t0_eval
+task:
+  ...
+  - task: anli_r1
+    dataset_path: anli
+    use_prompt: promptsource:*
+    training_split: train_r1
+    validation_split: dev_r1
+    metric_list:
+      - metric: exact_match
+        aggregation: mean
+        higher_is_better: true
+        ignore_case: true
+        ignore_punctuation: true
+  - task: anli_r2
+    dataset_path: anli
+    use_prompt: promptsource:*
+    training_split: train_r2
+    validation_split: dev_r2
+    metric_list:
+      - metric: exact_match
+        aggregation: mean
+        higher_is_better: true
+        ignore_case: true
+        ignore_punctuation: true
+```
+
+Calling the benchmark is done the same way we would call any task with `--tasks`. Benchmarks can be added in `lm_eval/benchmarks/`
--- a/lm_eval/api/model.py
+++ b/lm_eval/api/model.py
 import abc
 import os

-from typing import Union
+from typing import Union, List, Tuple
 from sqlitedict import SqliteDict
 import json
 import hashlib
@@ -25,31 +25,32 @@ class LM(abc.ABC):
        self.cache_hook = CacheHook(None)

    @abc.abstractmethod
-    def loglikelihood(self, requests):
+    def loglikelihood(self, requests) -> List[Tuple[float, bool]]:
        """Compute log-likelihood of generating a continuation from a context.
        Downstream tasks should attempt to use loglikelihood instead of other
        LM calls whenever possible.

-        :param requests: list
-            A list of pairs (context, continuation)
-            context: str
+        :param requests: list[Instance]
+            A list of Instance objects, with property `args` which returns a tuple (context, continuation).
+            `context: str`
                Context string. Implementations of LM must be able to handle an
                empty context string.
-            continuation: str
+            `continuation: str`
                The continuation over which log likelihood will be calculated. If
                there is a word boundary, the space should be in the continuation.
                For example, context="hello" continuation=" world" is correct.
-        :return: list
+
+        :return: list[tuple[float, bool]]
            A list of pairs (logprob, isgreedy)
-            logprob: float
-                The log probability of `continuation`
-            isgreedy:
-                Whether `continuation` would be generated by greedy sampling from `context`
+            `logprob: float`
+                The log probability of `continuation`.
+            `isgreedy`:
+                Whether `continuation` would be generated by greedy sampling from `context`.
        """
        pass

    @abc.abstractmethod
-    def loglikelihood_rolling(self, requests):
+    def loglikelihood_rolling(self, requests) -> List[Tuple[float, bool]]:
        """Compute full log-likelihood of a string, with no truncation, for perplexity computation
        - We will use the full max context length of the model.
        - For inputs that exceed the max context length, we divide the tokenized string into chunks of up to
@@ -77,11 +78,11 @@ class LM(abc.ABC):
            1. Each token is predicted exactly once
            2. For the last pair, we provide the full context, but only score the last two tokens

-        :param requests: list
-            A list of strings
+        :param requests: list[Instance]
+            A list of Instance objects with property `args` which returns a tuple (context, continuation).
            string: str
                String for which we are computing per-token loglikelihood
-        :return: list
+        :return: list[tuple[float, bool]]
            A list of pairs (logprob, isgreedy)
            logprob: float
                The log probability of `continuation`
@@ -92,17 +93,17 @@ class LM(abc.ABC):

    # TODO: Add an optional max length
    @abc.abstractmethod
-    def greedy_until(self, requests):
+    def greedy_until(self, requests) -> List[str]:
        """Generate greedily until a stopping sequence

-        :param requests: list
-            A list of pairs (context, until)
+        :param requests: list[Instance]
+            A list of Instance objects with property `args` which returns a tuple (context, until).
            context: str
                Context string
            until: [str]
                The string sequences to generate until. These string sequences
                may each span across multiple tokens, or may be part of one token.
-        :return: list
+        :return: list[str]
            A list of strings continuation
            continuation: str
                The generated continuation.

--- a/lm_eval/api/samplers.py
+++ b/lm_eval/api/samplers.py
@@ -48,7 +48,9 @@ class Sampler:
                    )
                    + self.target_delimiter
                    + (
-                        self.doc_to_target(doc)
+                        self.doc_to_target(doc)[0]
+                        if type(self.doc_to_target(doc)) is list
+                        else self.doc_to_target(doc)
                        if (
                            self.config.doc_to_choice is None
                            or type(self.doc_to_target(doc)) is str

--- a/lm_eval/api/task.py
+++ b/lm_eval/api/task.py
@@ -13,7 +13,7 @@ from tqdm import tqdm
 import datasets
 import numpy as np

-from typing import Union
+from typing import Union, List, Any, Tuple, Literal
 from collections.abc import Callable

 from lm_eval import utils
@@ -465,8 +465,11 @@ class Task(abc.ABC):
        elif type(example) == list:
            return [labeled_examples + ex for ex in example]
        elif type(example) == int:
-            choices = self.doc_to_choice(doc)
-            return labeled_examples + choices[example]
+            if self._config.doc_to_choice is not None:
+                choices = self.doc_to_choice(doc)
+                return labeled_examples + choices[example]
+            else:
+                return labeled_examples + str(example)

    def apply_filters(self):

@@ -477,7 +480,7 @@ class Task(abc.ABC):
            eval_logger.warning("No filter defined, passing through instances")
            return self._instances

-    def dump_config(self):
+    def dump_config(self) -> dict:
        """Returns a dictionary representing the task's config.

        :returns: str
@@ -489,14 +492,13 @@ class Task(abc.ABC):


 class ConfigurableTask(Task):
-
    VERSION = "Yaml"
    OUTPUT_TYPE = None
    CONFIG = None

    def __init__(
        self, data_dir=None, cache_dir=None, download_mode=None, config: dict = None
-    ):
+    ):  # TODO no super() call here
        # Get pre-configured attributes
        self._config = self.CONFIG

@@ -662,25 +664,25 @@ class ConfigurableTask(Task):
            **dataset_kwargs if dataset_kwargs is not None else {},
        )

-    def has_training_docs(self):
+    def has_training_docs(self) -> bool:
        if self._config.training_split is not None:
            return True
        else:
            return False

-    def has_validation_docs(self):
+    def has_validation_docs(self) -> bool:
        if self._config.validation_split is not None:
            return True
        else:
            return False

-    def has_test_docs(self):
+    def has_test_docs(self) -> bool:
        if self._config.test_split is not None:
            return True
        else:
            return False

-    def training_docs(self):
+    def training_docs(self) -> datasets.Dataset:
        if self.has_training_docs():
            if self._config.process_docs is not None:
                return self._config.process_docs(
@@ -688,7 +690,7 @@ class ConfigurableTask(Task):
                )
            return self.dataset[self._config.training_split]

-    def validation_docs(self):
+    def validation_docs(self) -> datasets.Dataset:
        if self.has_validation_docs():
            if self._config.process_docs is not None:
                return self._config.process_docs(
@@ -696,7 +698,7 @@ class ConfigurableTask(Task):
                )
            return self.dataset[self._config.validation_split]

-    def test_docs(self):
+    def test_docs(self) -> datasets.Dataset:
        if self.has_test_docs():
            if self._config.process_docs is not None:
                return self._config.process_docs(self.dataset[self._config.test_split])
@@ -762,12 +764,17 @@ class ConfigurableTask(Task):
            return doc_to_text(doc)
        # Used when applying a Promptsource template
        elif hasattr(doc_to_text, "apply"):
-            return doc_to_text.apply(doc)[0]
+            applied_prompt = doc_to_text.apply(doc)
+            if len(applied_prompt) == 2:
+                return applied_prompt[0]
+            else:
+                eval_logger.warning("Applied prompt returns empty string")
+                return self._config.fewshot_delimiter
        else:
            print(type(doc_to_text))
            raise TypeError

-    def doc_to_target(self, doc):
+    def doc_to_target(self, doc: dict) -> Union[int, str, list]:

        if self.prompt is not None:
            doc_to_target = self.prompt
@@ -786,17 +793,30 @@ class ConfigurableTask(Task):
                target_string = utils.apply_template(doc_to_target, doc)
                if target_string.isdigit():
                    return ast.literal_eval(target_string)
+                elif (
+                    len(target_string) >= 2
+                    and (target_string[0] == "[")
+                    and (target_string[-1] == "]")
+                ):
+                    return ast.literal_eval(target_string)
                else:
                    return target_string
+        elif type(doc_to_target) == list:
+            return doc_to_target
        elif callable(doc_to_target):
            return doc_to_target(doc)
        # Used when applying a Promptsource template
        elif hasattr(doc_to_target, "apply"):
-            return doc_to_target.apply(doc)[1]
+            applied_prompt = doc_to_target.apply(doc)
+            if len(applied_prompt) == 2:
+                return applied_prompt[1]
+            else:
+                eval_logger.warning("Applied prompt returns empty string")
+                return self._config.fewshot_delimiter
        else:
            raise TypeError

-    def doc_to_choice(self, doc):
+    def doc_to_choice(self, doc: Any) -> List[str]:

        if self.prompt is not None:
            doc_to_choice = self.prompt
@@ -838,7 +858,9 @@ class ConfigurableTask(Task):
        else:
            raise TypeError

-    def construct_requests(self, doc, ctx, **kwargs):
+    def construct_requests(
+        self, doc: dict, ctx: str, **kwargs
+    ) -> Union[List[Instance], Instance]:

        if self.OUTPUT_TYPE == "loglikelihood":
            arguments = (ctx, self.doc_to_target(doc))
@@ -847,13 +869,14 @@ class ConfigurableTask(Task):
        elif self.OUTPUT_TYPE == "multiple_choice":

            choices = self.doc_to_choice(doc)
+            target_delimiter = self._config.target_delimiter
            if self.multiple_input:
                # If there are multiple inputs, choices are placed in the ctx
                cont = self.doc_to_target(doc)
-                arguments = [(ctx, " {}".format(cont)) for ctx in choices]
+                arguments = [(ctx, f"{target_delimiter}{cont}") for ctx in choices]
            else:
                # Otherwise they are placed in the continuation
-                arguments = [(ctx, " {}".format(cont)) for cont in choices]
+                arguments = [(ctx, f"{target_delimiter}{cont}") for cont in choices]

            request_list = [
                Instance(
@@ -986,9 +1009,13 @@ class ConfigurableTask(Task):
        elif self.OUTPUT_TYPE == "greedy_until":

            gold = self.doc_to_target(doc)
-            if type(gold) == int:
+            if self._config.doc_to_choice is not None:
+                # If you set doc_to_choice,
+                # it assumes that doc_to_target returns a number.
                choices = self.doc_to_choice(doc)
                gold = choices[gold]
+            else:
+                gold = str(gold)

            for key, result in zip(self._metric_fn_list.keys(), results):
                if self.multiple_target:
@@ -1007,20 +1034,20 @@ class ConfigurableTask(Task):
                            res = res[key]
                        scores.append(res)
                    if any(scores):
-                        result = 1.0
+                        result_score = 1.0
                    else:
-                        result = 0.0
+                        result_score = 0.0
                else:
-                    result = self._metric_fn_list[key](
+                    result_score = self._metric_fn_list[key](
                        references=[gold],
                        predictions=[result],
                        **self._metric_fn_kwargs[key],
                    )

-                if isinstance(result, dict):
-                    result_dict.update(result)
+                if isinstance(result_score, dict):
+                    result_dict.update(result_score)
                else:
-                    result_dict[key] = result
+                    result_dict[key] = result_score
        else:
            raise ValueError(
                f"Passed invalid output_type '{self.OUTPUT_TYPE}' ! Please use one of ",
@@ -1037,13 +1064,12 @@ class ConfigurableTask(Task):


 class MultipleChoiceTask(Task):
-
    OUTPUT_TYPE: str = "loglikelihood"

-    def doc_to_target(self, doc):
+    def doc_to_target(self, doc: dict) -> str:
        return " " + doc["choices"][doc["gold"]]

-    def construct_requests(self, doc, ctx, **kwargs):
+    def construct_requests(self, doc: dict, ctx: str, **kwargs) -> List[Instance]:
        # TODO: add mutual info here?
        return [
            Instance(
@@ -1056,7 +1082,7 @@ class MultipleChoiceTask(Task):
            for i, choice in enumerate(doc["choices"])
        ]

-    def process_results(self, doc, results):
+    def process_results(self, doc: dict, results: List[Tuple[float, bool]]) -> dict:
        results = [
            res[0] for res in results
        ]  # only retain loglikelihoods, discard is_greedy TODO: do we need is_greedy anywhere?
@@ -1071,13 +1097,13 @@ class MultipleChoiceTask(Task):
            "acc_norm": acc_norm,
        }

-    def higher_is_better(self):
+    def higher_is_better(self) -> dict:
        return {
            "acc": True,
            "acc_norm": True,
        }

-    def aggregation(self):
+    def aggregation(self) -> dict:
        return {
            "acc": mean,
            "acc_norm": mean,
@@ -1085,24 +1111,23 @@ class MultipleChoiceTask(Task):


 class PerplexityTask(Task):
-
    OUTPUT_TYPE = "loglikelihood_rolling"

-    def has_training_docs(self):
+    def has_training_docs(self) -> bool:
        return False

-    def fewshot_examples(self, k, rnd):
+    def fewshot_examples(self, k: int, rnd) -> List:
        assert k == 0
        return []

-    def fewshot_context(self, doc, num_fewshot):
+    def fewshot_context(self, doc: dict, num_fewshot: int) -> Literal[""]:
        assert (
            num_fewshot == 0
        ), "The number of fewshot examples must be 0 for perplexity tasks."

        return ""

-    def higher_is_better(self):
+    def higher_is_better(self) -> dict:
        return {
            "word_perplexity": False,
            "byte_perplexity": False,
@@ -1118,7 +1143,7 @@ class PerplexityTask(Task):
    def doc_to_target(self, doc):
        return doc

-    def construct_requests(self, doc, ctx, **kwargs):
+    def construct_requests(self, doc: dict, ctx: Union[str, None], **kwargs):
        assert not ctx

        return Instance(
@@ -1129,7 +1154,7 @@ class PerplexityTask(Task):
            **kwargs,
        )

-    def process_results(self, doc, results):
+    def process_results(self, doc: dict, results: float) -> dict:
        (loglikelihood,) = results
        words = self.count_words(self.doc_to_target(doc))
        bytes_ = self.count_bytes(self.doc_to_target(doc))
@@ -1139,7 +1164,7 @@ class PerplexityTask(Task):
            "bits_per_byte": (loglikelihood, bytes_),
        }

-    def aggregation(self):
+    def aggregation(self) -> dict:
        return {
            "word_perplexity": weighted_perplexity,
            "byte_perplexity": weighted_perplexity,
@@ -1147,10 +1172,10 @@ class PerplexityTask(Task):
        }

    @classmethod
-    def count_bytes(cls, doc):
+    def count_bytes(cls, doc) -> int:
        return len(doc.encode("utf-8"))

    @classmethod
-    def count_words(cls, doc):
+    def count_words(cls, doc) -> int:
        """Downstream tasks with custom word boundaries should override this!"""
        return len(re.split(r"\s+", doc))
--- a/lm_eval/benchmarks/__init__.py
+++ b/lm_eval/benchmarks/__init__.py
+import os
+import yaml
+
+from lm_eval import utils
+from lm_eval.tasks import register_configurable_task, check_prompt_config
+from lm_eval.logger import eval_logger
+from lm_eval.api.registry import (
+    TASK_REGISTRY,
+    GROUP_REGISTRY,
+    ALL_TASKS,
+)
+
+
+def include_benchmarks(task_dir):
+
+    for root, subdirs, file_list in os.walk(task_dir):
+        if (subdirs == [] or subdirs == ["__pycache__"]) and (len(file_list) > 0):
+            for f in file_list:
+                if f.endswith(".yaml"):
+                    try:
+                        benchmark_path = os.path.join(root, f)
+
+                        with open(benchmark_path, "rb") as file:
+                            yaml_config = yaml.full_load(file)
+
+                        assert "group" in yaml_config
+                        group = yaml_config["group"]
+                        all_task_list = yaml_config["task"]
+                        config_list = [
+                            task for task in all_task_list if type(task) != str
+                        ]
+                        task_list = [
+                            task for task in all_task_list if type(task) == str
+                        ]
+
+                        for task_config in config_list:
+                            var_configs = check_prompt_config(
+                                {
+                                    **task_config,
+                                    **{"group": group},
+                                }
+                            )
+                            for config in var_configs:
+                                register_configurable_task(config)
+
+                        task_names = utils.pattern_match(task_list, ALL_TASKS)
+                        for task in task_names:
+                            if task in TASK_REGISTRY:
+                                if group in GROUP_REGISTRY:
+                                    GROUP_REGISTRY[group].append(task)
+                                else:
+                                    GROUP_REGISTRY[group] = [task]
+                                    ALL_TASKS.add(group)
+                    except Exception as error:
+                        eval_logger.warning(
+                            "Failed to load benchmark in\n"
+                            f"                                 {benchmark_path}\n"
+                            "                                 Benchmark will not be added to registry\n"
+                            f"                                 Error: {error}"
+                        )
+
+
+task_dir = os.path.dirname(os.path.abspath(__file__)) + "/"
+include_benchmarks(task_dir)
--- a/lm_eval/tasks/benchmarks/pythia.yaml
+++ b/lm_eval/tasks/benchmarks/pythia.yaml
@@ -6,7 +6,7 @@ task:
  - sciq
  - wsc
  - winogrande
-  - arc_*
-  # - logiqa
-  # - blimp_*
-  # - hendrycksTest*
+  - arc
+  - logiqa
+  - blimp
+  - hendrycksTest*
--- a/lm_eval/benchmarks/t0_eval.yaml
+++ b/lm_eval/benchmarks/t0_eval.yaml
+group: t0_eval
+task:
+  # Coreference Resolution
+  - dataset_path: super_glue
+    dataset_name: wsc.fixed
+    use_prompt: promptsource:*
+    training_split: train
+    validation_split: validation
+    metric_list:
+      - metric: exact_match
+        aggregation: mean
+        higher_is_better: true
+        ignore_case: true
+        ignore_punctuation: true
+  # Coreference Resolution
+  - dataset_path: winogrande
+    dataset_name: winogrande_xl
+    use_prompt: promptsource:*
+    training_split: train
+    validation_split: validation
+    metric_list:
+      - metric: exact_match
+        aggregation: mean
+        higher_is_better: true
+        ignore_case: true
+        ignore_punctuation: true
+  # Natural Language Inference
+  - dataset_path: super_glue
+    dataset_name: cb
+    use_prompt: promptsource:*
+    training_split: train
+    validation_split: validation
+    output_type: greedy_until
+    metric_list:
+      - metric: exact_match
+        aggregation: mean
+        higher_is_better: true
+        ignore_case: true
+        ignore_punctuation: true
+  - dataset_path: super_glue
+    dataset_name: rte
+    use_prompt: promptsource:*
+    training_split: train
+    validation_split: validation
+    metric_list:
+      - metric: exact_match
+        aggregation: mean
+        higher_is_better: true
+        ignore_case: true
+        ignore_punctuation: true
+  - task: anli_r1
+    dataset_path: anli
+    use_prompt: promptsource:*
+    training_split: train_r1
+    validation_split: dev_r1
+    metric_list:
+      - metric: exact_match
+        aggregation: mean
+        higher_is_better: true
+        ignore_case: true
+        ignore_punctuation: true
+  - task: anli_r2
+    dataset_path: anli
+    use_prompt: promptsource:*
+    training_split: train_r2
+    validation_split: dev_r2
+    metric_list:
+      - metric: exact_match
+        aggregation: mean
+        higher_is_better: true
+        ignore_case: true
+        ignore_punctuation: true
+  - task: anli_r3
+    dataset_path: anli
+    use_prompt: promptsource:*
+    training_split: train_r3
+    validation_split: dev_r3
+    metric_list:
+      - metric: exact_match
+        aggregation: mean
+        higher_is_better: true
+        ignore_case: true
+        ignore_punctuation: true
+  # Sentence Completion
+  - dataset_path: super_glue
+    dataset_name: copa
+    use_prompt: promptsource:*
+    training_split: train
+    validation_split: validation
+    metric_list:
+      - metric: exact_match
+        aggregation: mean
+        higher_is_better: true
+        ignore_case: true
+        ignore_punctuation: true
+  # Natural Language Inference
+  - dataset_path: hellaswag
+    use_prompt: promptsource:*
+    training_split: train
+    validation_split: validation
+    metric_list:
+      - metric: exact_match
+        aggregation: mean
+        higher_is_better: true
+        ignore_case: true
+        ignore_punctuation: true
+  # Word Sense Disambiguation
+  - dataset_path: super_glue
+    dataset_name: wic
+    use_prompt: promptsource:*
+    training_split: train
+    validation_split: validation
+    metric_list:
+      - metric: exact_match
+        aggregation: mean
+        higher_is_better: true
+        ignore_case: true
+        ignore_punctuation: true
--- a/lm_eval/evaluator.py
+++ b/lm_eval/evaluator.py
@@ -11,6 +11,7 @@ import numpy as np

 import lm_eval.api
 import lm_eval.tasks
+import lm_eval.benchmarks
 import lm_eval.models
 import lm_eval.api.metrics
 import lm_eval.api.registry
@@ -85,7 +86,9 @@ def simple_evaluate(
        1234
    )  # TODO: this may affect training runs that are run with evaluation mid-run.

-    assert tasks != [], "No tasks specified"
+    assert (
+        tasks != []
+    ), "No tasks specified, or no tasks found. Please verify the task names."

    if isinstance(model, str):
        if model_args is None:
@@ -251,7 +254,7 @@ def evaluate(
                    eval_logger.info(
                        f"Task: {task_name}; document {inst.doc_id}; context prompt (starting on next line):\n{inst.args[0]}\n(end of prompt on previous line)"
                    )
-                    eval_logger.info("Request:", inst)
+                    eval_logger.info(f"Request: {str(inst)}")

        # aggregate Instances by LM method requested to get output.
        reqtype = (

--- a/lm_eval/filters/__init__.py
+++ b/lm_eval/filters/__init__.py
@@ -8,6 +8,7 @@ FILTER_REGISTRY = {
    "regex": extraction.RegexFilter,
    "majority_vote": selection.MajorityVoteFilter,
    "take_first_k": selection.TakeKFilter,
+    "remove_whitespace": extraction.WhitespaceFilter,
    # TODO: implement this filter. either it should take in an arbitrary "scoring"/reward function
    # that takes an input and returns a scalar and then should select the max reward,
    # or should implement different filters for different ways of handling a reward model's inference.

--- a/lm_eval/filters/extraction.py
+++ b/lm_eval/filters/extraction.py
@@ -36,3 +36,26 @@ class RegexFilter(Filter):
        # print(filtered_resps)

        return filtered_resps
+
+
+class WhitespaceFilter(Filter):
+    """ """
+
+    def __init__(self):
+        pass
+
+    def apply(self, resps):
+        def filter_set(inst):
+
+            filtered_resp = []
+            for resp in inst:
+                if resp.startswith(" "):
+                    resp = resp[1:]
+
+                filtered_resp.append(resp)
+
+            return filtered_resp
+
+        filtered_resps = [filter_set(resp) for resp in resps]
+
+        return filtered_resps
--- a/lm_eval/models/anthropic_llms.py
+++ b/lm_eval/models/anthropic_llms.py
-import os
 from lm_eval.api.model import LM
 from lm_eval.api.registry import register_model
 from tqdm import tqdm
 import time
 from lm_eval.logger import eval_logger
-from typing import List, Literal, Any
+from typing import List, Any, Tuple


 def anthropic_completion(
@@ -15,10 +14,25 @@ def anthropic_completion(
    temperature: float,
    stop: List[str],
    **kwargs: Any,
-):
-    """Query Anthropic API for completion.
-
-    Retry with back-off until they respond
+) -> str:
+    """Wrapper function around the Anthropic completion API client with exponential back-off
+    in case of RateLimitError.
+
+    params:
+        client: anthropic.Anthropic
+            Anthropic API client
+        model: str
+            Anthropic model e.g. 'claude-instant-v1', 'claude-2'
+        prompt: str
+            Prompt to feed to the model
+        max_tokens_to_sample: int
+            Maximum number of tokens to sample from the model
+        temperature: float
+            Sampling temperature
+        stop: List[str]
+            List of stop sequences
+        kwargs: Any
+            Additional model_args to pass to the API client
    """

    try:
@@ -29,7 +43,7 @@ def anthropic_completion(
 please install anthropic via `pip install lm-eval[anthropic]` or `pip install -e .[anthropic]`",
        )

-    backoff_time = 3
+    backoff_time: float = 3
    while True:
        try:
            response = client.completions.create(
@@ -94,15 +108,15 @@ please install anthropic via `pip install lm-eval[anthropic]` or `pip install -e

    @property
    def eot_token_id(self):
-        # Not sure but anthropic.AI_PROMPT -> [203, 203, 50803, 30]
+        # Not sure but anthropic.HUMAN_PROMPT ?
        raise NotImplementedError("No idea about anthropic tokenization.")

    @property
-    def max_length(self):
+    def max_length(self) -> int:
        return 2048

    @property
-    def max_gen_toks(self):
+    def max_gen_toks(self) -> int:
        return self.max_tokens_to_sample

    @property
@@ -124,14 +138,15 @@ please install anthropic via `pip install lm-eval[anthropic]` or `pip install -e
    def _loglikelihood_tokens(self, requests, disable_tqdm=False):
        raise NotImplementedError("No support for logits.")

-    def greedy_until(self, requests):
+    def greedy_until(self, requests) -> List[str]:
+
        if not requests:
            return []

-        requests = [req.args for req in requests]
+        _requests: List[Tuple[str, dict]] = [req.args for req in requests]

        res = []
-        for request in tqdm(requests):
+        for request in tqdm(_requests):
            try:
                inp = request[0]
                request_args = request[1]
@@ -145,16 +160,16 @@ please install anthropic via `pip install lm-eval[anthropic]` or `pip install -e
                    prompt=inp,
                    max_tokens_to_sample=max_gen_toks,
                    temperature=temperature,  # TODO: implement non-greedy sampling for Anthropic
-                    stop=until,
+                    stop=until,  # type: ignore
                    **self.kwargs,
                )
                res.append(response)

                self.cache_hook.add_partial("greedy_until", request, response)
-            except anthropic.APIConnectionError as e:  # noqa: F821
+            except anthropic.APIConnectionError as e:  # type: ignore # noqa: F821
                eval_logger.critical(f"Server unreachable: {e.__cause__}")
                break
-            except anthropic.APIStatusError as e:  # noqa: F821
+            except anthropic.APIStatusError as e:  # type: ignore # noqa: F821
                eval_logger.critical(f"API error {e.status_code}: {e.message}")
                break


--- a/lm_eval/models/huggingface.py
+++ b/lm_eval/models/huggingface.py
@@ -289,7 +289,9 @@ class HFLM(LM):
                        "Failed to place model onto specified device. This may be because the model is quantized via `bitsandbytes`. If the desired GPU is being used, this message is safe to ignore."
                    )
            else:
-                self._model = accelerator.prepare(self.model)
+                self._model = accelerator.prepare_model(
+                    self.model, evaluation_mode=True
+                )
                self._device = torch.device(f"cuda:{accelerator.local_process_index}")
                self.accelerator = accelerator


--- a/lm_eval/models/openai_completions.py
+++ b/lm_eval/models/openai_completions.py
 import os
 import time
-import transformers
-
-import numpy as np
-
+from typing import List, Tuple
 from tqdm import tqdm
 from lm_eval import utils
 from lm_eval.api.model import LM
 from lm_eval.api.registry import register_model


-def get_result(response, ctxlen):
+def get_result(response: dict, ctxlen: int) -> Tuple[float, bool]:
    """Process results from OpenAI API response.

    :param response: dict
@@ -43,7 +40,13 @@ def oa_completion(**kwargs):

    Retry with back-off until they respond
    """
-    import openai
+    try:
+        import openai, tiktoken  # noqa: E401
+    except ModuleNotFoundError:
+        raise Exception(
+            "attempted to use 'openai' LM type, but package `openai` or `tiktoken` are not installed. \
+please install these via `pip install lm-eval[openai]` or `pip install -e .[openai]`",
+        )

    backoff_time = 3
    while True:
@@ -61,7 +64,12 @@ def oa_completion(**kwargs):
 class OpenaiCompletionsLM(LM):
    REQ_CHUNK_SIZE = 20

-    def __init__(self, engine, truncate=False):
+    def __init__(
+        self,
+        engine: str = "text-davinci-003",
+        truncate: bool = False,
+        batch_size: int = 1,
+    ):
        """

        :param engine: str
@@ -70,28 +78,25 @@ class OpenaiCompletionsLM(LM):
            Truncate input if too long (if False and input is too long, throw error)
        """
        super().__init__()
-
-        import openai
-
+        try:
+            import openai, tiktoken  # noqa: E401
+        except ModuleNotFoundError:
+            raise Exception(
+                "attempted to use 'openai' LM type, but package `openai` or `tiktoken` are not installed. \
+    please install these via `pip install lm-eval[openai]` or `pip install -e .[openai]`",
+            )
        self.engine = engine
-        self.tokenizer = transformers.GPT2TokenizerFast.from_pretrained("gpt2")
-
-        self.vocab_size = self.tokenizer.vocab_size
-
-        # to make the annoying "Using pad_token, but it is not set yet." error go away
-        self.tokenizer.pad_token = "<|endoftext|>"
-        assert self.tokenizer.encode("hello\n\nhello") == [31373, 198, 198, 31373]
+        self.tokenizer = tiktoken.encoding_for_model(self.engine)
+        self.vocab_size = self.tokenizer.n_vocab
        self.truncate = truncate
-        self.end_of_text_token_id = self.tokenizer.convert_tokens_to_ids(
-            ["<|endoftext|>"]
-        )[0]
+        self.end_of_text_token_id = self.tokenizer.eot_token

        # Read from environment variable OPENAI_API_SECRET_KEY
        openai.api_key = os.environ["OPENAI_API_SECRET_KEY"]

    @property
    def eot_token_id(self):
-        return self.tokenizer.eos_token_id
+        return self.end_of_text_token_id

    @property
    def max_length(self):
@@ -112,19 +117,49 @@ class OpenaiCompletionsLM(LM):
        # Isn't used because we override _loglikelihood_tokens
        raise NotImplementedError()

-    def tok_encode(self, string: str):
-        return self.tokenizer.encode(string, add_special_tokens=False)
+    def tok_encode(self, string: str) -> List[int]:
+        return self.tokenizer.encode(string)

-    def tok_decode(self, tokens):
+    def tok_decode(self, tokens: List[int]) -> str:
        return self.tokenizer.decode(tokens)

-    def _loglikelihood_tokens(self, requests, disable_tqdm=False):
+    def _encode_pair(
+        self, context: str, continuation: str
+    ) -> Tuple[List[int], List[int]]:
+        n_spaces = len(context) - len(context.rstrip())
+        if n_spaces > 0:
+            continuation = context[-n_spaces:] + continuation
+            context = context[:-n_spaces]
+        whole_enc = self.tok_encode(context + continuation)
+        context_enc = self.tok_encode(context)
+        context_enc_len = len(context_enc)
+        continuation_enc = whole_enc[context_enc_len:]
+        return context_enc, continuation_enc
+
+    def loglikelihood(self, requests) -> List[Tuple[float, bool]]:
+        new_reqs = []
+        for context, continuation in [req.args for req in requests]:
+            if context == "":
+                # end of text as context
+                context_enc, continuation_enc = [self.eot_token_id], self.tok_encode(
+                    continuation
+                )
+            else:
+                context_enc, continuation_enc = self._encode_pair(context, continuation)
+
+            new_reqs.append(((context, continuation), context_enc, continuation_enc))
+
+        return self._loglikelihood_tokens(new_reqs)
+
+    def _loglikelihood_tokens(
+        self, requests, disable_tqdm=False
+    ) -> List[Tuple[float, bool]]:
        res = []

        def _collate(x):
            # this doesn't efficiently handle last-token differences yet, but those are kinda annoying because
            # it's not guaranteed that the 100 or so logprobs we get to see actually contain all the continuations
-            # we care about and so we need some kind of backup for when it isn't
+            # we care about, and so we need some kind of backup for when it isn't
            toks = x[1] + x[2]
            return -len(toks), tuple(toks)

@@ -166,13 +201,13 @@ class OpenaiCompletionsLM(LM):
                # partial caching
                if cache_key is not None:
                    self.cache_hook.add_partial("loglikelihood", cache_key, answer)
-
        return re_ord.get_original(res)

-    def greedy_until(self, requests):
+    def greedy_until(self, requests) -> List[str]:
        if not requests:
            return []
        res = []
+        requests = [req.args for req in requests]

        def _collate(x):
            toks = self.tok_encode(x[0])
@@ -203,12 +238,7 @@ class OpenaiCompletionsLM(LM):
                inp = context_enc[-(self.max_length - self.max_gen_toks) :]
                inps.append(inp)

-            try:
-                until = request_args["until"][
-                    0
-                ]  # TODO: does this handle a list of stop seqs correctly?
-            except KeyError:
-                until = "<|endoftext|>"
+            until = request_args.get("until", ["<|endoftext|>"])

            response = oa_completion(
                engine=self.engine,
@@ -222,7 +252,7 @@ class OpenaiCompletionsLM(LM):
            for resp, (context, args_) in zip(response.choices, chunk):
                s = resp["text"]

-                until_ = args_.get(["until"], [])
+                until_ = args_.get("until", ["<|endoftext|>"])

                for term in until_:
                    if len(term) > 0:
@@ -234,7 +264,6 @@ class OpenaiCompletionsLM(LM):
                )

                res.append(s)
-
        return re_ord.get_original(res)

    def _model_call(self, inps):
@@ -244,3 +273,34 @@ class OpenaiCompletionsLM(LM):
    def _model_generate(self, context, max_length, eos_token_id):
        # Isn't used because we override greedy_until
        raise NotImplementedError()
+
+    def loglikelihood_rolling(self, requests) -> List[float]:
+        loglikelihoods = []
+
+        for (string,) in tqdm([req.args for req in requests]):
+            rolling_token_windows = list(
+                map(
+                    utils.make_disjoint_window,
+                    utils.get_rolling_token_windows(
+                        token_list=self.tok_encode(string),
+                        prefix_token=self.eot_token_id,
+                        max_seq_len=self.max_length,
+                        context_len=1,
+                    ),
+                )
+            )
+
+            # TODO: Right now, we pass single EOT token to the Encoder and the full context to the decoder, in seq2seq case
+            rolling_token_windows = [(None,) + x for x in rolling_token_windows]
+
+            string_nll = self._loglikelihood_tokens(
+                rolling_token_windows,
+                disable_tqdm=True,
+            )
+
+            # discard is_greedy
+            string_nll = [x[0] for x in string_nll]
+
+            string_nll = sum(string_nll)
+            loglikelihoods.append(string_nll)
+        return loglikelihoods
--- a/lm_eval/tasks/README.md
+++ b/lm_eval/tasks/README.md
@@ -3,41 +3,41 @@ This list keeps track of which tasks' implementations have been ported to YAML /

 Boxes should be checked iff tasks are implemented in the refactor and tested for regression. Tasks should be struck through if checked *against original introducing paper* implementation or popularizing implementation. (WIP) Denotes that there exists a PR or person working on this task already.

- [ ] Glue (Lintang)
+- [x] Glue
 - [x] SuperGlue
- [ ] CoQA
- [ ] DROP
+- [ ] CoQA (Lintang)
+- [ ] DROP (Lintang)
 - [x] ~~Lambada~~
 - [x] Lambada (Cloze variants)
 - [x] ~~Lambada (Multilingual)~~
 - [x] Wikitext
 - [x] PiQA
 - [x] PROST
- [ ] MCTACO
+- [x] MCTACO
 - [x] Pubmed QA
 - [x] SciQ
 - [ ] QASPER
 - [x] QA4MRE
- [ ] TriviaQA
+- [x] TriviaQA
 - [x] AI2 ARC
- [ ] LogiQA [(WIP)](https://github.com/EleutherAI/lm-evaluation-harness/pull/711)
+- [x] LogiQA
 - [x] HellaSwag
 - [x] SWAG
 - [x] OpenBookQA
- [ ] SQuADv2
+- [ ] SQuADv2 (Lintang)
 - [x] RACE
 - [x] HeadQA
 - [x] MathQA
- [ ] WebQs
- [ ] WSC273
+- [x] WebQs
+- [ ] WSC273 (Lintang)
 - [x] Winogrande
 - [x] ANLI
 - [x] Hendrycks Ethics (missing some tasks/metrics, see PR 660: <https://github.com/EleutherAI/lm-evaluation-harness/pull/660> for more info)
 - [x] TruthfulQA (mc1)
- [ ] TruthfulQA (mc2)
- [ ] TruthfulQA (gen)
+- [x] TruthfulQA (mc2)
+- [x] TruthfulQA (gen)
 - [ ] MuTual
- [ ] Hendrycks Math
+- [ ] Hendrycks Math (Hailey)
 - [ ] Asdiv
 - [ ] GSM8k
 - [x] Arithmetic
@@ -45,20 +45,20 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
 - [ ] Translation (WMT) suite (Hailey)
 - [x] Unscramble
 - [x] ~~Pile (perplexity)~~
- [ ] BLiMP (Lintang)
+- [x] BLiMP
 - [x] ToxiGen
- [ ] StoryCloze
- [ ] NaturalQs
- [ ] CrowS-Pairs
- [ ] XCopa
- [ ] BIG-Bench
- [ ] XStoryCloze
+- [x] StoryCloze
+- [ ] NaturalQs (Hailey)
+- [x] CrowS-Pairs
+- [x] XCopa
+- [ ] BIG-Bench (Hailey)
+- [x] XStoryCloze
 - [x] XWinograd
- [ ] PAWS-X
- [ ] XNLI
- [ ] MGSM
+- [x] PAWS-X
+- [x] XNLI
+- [ ] MGSM (Lintang)
 - [ ] SCROLLS
- [ ] Babi
+- [x] Babi

 # Novel Tasks
 Tasks added in the revamped harness that were not previously available. Again, a strikethrough denotes checking performed *against the original task's implementation or published results introducing the task*.

--- a/lm_eval/tasks/__init__.py
+++ b/lm_eval/tasks/__init__.py
@@ -44,7 +44,7 @@ def check_prompt_config(config):
        prompt_list = prompts.load_prompt_list(
            use_prompt=config["use_prompt"],
            dataset_name=config["dataset_path"],
-            subset_name=config["dataset_name"],
+            subset_name=config["dataset_name"] if "dataset_name" in config else None,
        )
        for idx, prompt_variation in enumerate(prompt_list):
            all_configs.append(
@@ -54,7 +54,9 @@ def check_prompt_config(config):
                    **{
                        "task": "_".join(
                            [
-                                get_task_name_from_config(config),
+                                config["task"]
+                                if "task" in config
+                                else get_task_name_from_config(config),
                                prompt_variation,
                            ]
                        )
@@ -98,58 +100,8 @@ def include_task_folder(task_dir):
                        )


-def include_benchmarks(task_dir, benchmark_dir="benchmarks"):
-
-    for root, subdirs, file_list in os.walk(os.path.join(task_dir, benchmark_dir)):
-        if (subdirs == [] or subdirs == ["__pycache__"]) and (len(file_list) > 0):
-            for f in file_list:
-                if f.endswith(".yaml"):
-                    try:
-                        benchmark_path = os.path.join(root, f)
-
-                        with open(benchmark_path, "rb") as file:
-                            yaml_config = yaml.full_load(file)
-
-                        assert "group" in yaml_config
-                        group = yaml_config["group"]
-                        all_task_list = yaml_config["task"]
-                        config_list = [
-                            task for task in all_task_list if type(task) != str
-                        ]
-                        task_list = [
-                            task for task in all_task_list if type(task) == str
-                        ]
-
-                        for task_config in config_list:
-                            var_configs = check_prompt_config(
-                                {
-                                    **task_config,
-                                    **{"group": group},
-                                }
-                            )
-                            for config in var_configs:
-                                register_configurable_task(config)
-
-                        task_names = utils.pattern_match(task_list, ALL_TASKS)
-                        for task in task_names:
-                            if task in TASK_REGISTRY:
-                                if group in GROUP_REGISTRY:
-                                    GROUP_REGISTRY[group].append(task)
-                                else:
-                                    GROUP_REGISTRY[group] = [task]
-                                    ALL_TASKS.add(group)
-                    except Exception as error:
-                        eval_logger.warning(
-                            "Failed to load benchmark in\n"
-                            f"                                 {benchmark_path}\n"
-                            "                                 Benchmark will not be added to registry\n"
-                            f"                                 Error: {error}"
-                        )
-
-
 task_dir = os.path.dirname(os.path.abspath(__file__)) + "/"
 include_task_folder(task_dir)
-include_benchmarks(task_dir)


 def get_task(task_name, config):

--- a/lm_eval/tasks/anli/README.md
+++ b/lm_eval/tasks/anli/README.md
-# Task-name
+# ANLI

 ### Paper

 Title: `Adversarial NLI: A New Benchmark for Natural Language Understanding`

-Abstract: `https://arxiv.org/pdf/1910.14599.pdf`
+Paper Link: https://arxiv.org/abs/1910.14599

 Adversarial NLI (ANLI) is a dataset collected via an iterative, adversarial
 human-and-model-in-the-loop procedure. It consists of three rounds that progressively
 increase in difficulty and complexity, and each question-answer includes annotator-
 provided explanations.

-Homepage: `https://github.com/facebookresearch/anli`
-
+Homepage: https://github.com/facebookresearch/anli

 ### Citation

@@ -31,13 +30,18 @@ Homepage: `https://github.com/facebookresearch/anli`
 }
 ```

-### Subtasks
+### Groups and Tasks
+
+#### Groups

-List or describe tasks defined in this folder, and their names here:
+* `anli`: Evaluates `anli_r1`, `anli_r2`, and `anli_r3`
+
+#### Tasks
 * `anli_r1`: The data collected adversarially in the first round.
 * `anli_r2`: The data collected adversarially in the second round, after training on the previous round's data.
 * `anli_r3`: The data collected adversarially in the third round, after training on the previous multiple rounds of data.

+
 ### Checklist

 For adding novel benchmarks/datasets to the library:

--- a/lm_eval/tasks/anli/anli_r1.yaml
+++ b/lm_eval/tasks/anli/anli_r1.yaml
 group:
-  - multiple_choice
-  - natural_language_inference
-  - nli
-  - adverserial
+  - anli
 task: anli_r1
 dataset_path: anli
 dataset_name: null

--- a/lm_eval/tasks/anli/anli_r2.yaml
+++ b/lm_eval/tasks/anli/anli_r2.yaml
-group:
-  - multiple_choice
-  - natural_language_inference
-  - nli
-  - adverserial
+include: anli_r1.yaml
 task: anli_r2
-dataset_path: anli
-dataset_name: null
-output_type: multiple_choice
 training_split: train_r2
 validation_split: dev_r2
 test_split: test_r2
-doc_to_text: "{{premise}}\nQuestion: {{hypothesis}} True, False, or Neither?\nAnswer:"
-# True = entailment
-# False = contradiction
-# Neither = neutral
-doc_to_target: "{{['True', 'Neither', 'False'][label]}}"
-doc_to_choice:
-  - "True"
-  - "Neither"
-  - "False"
-should_decontaminate: true
-doc_to_decontamination_query: premise
-metric_list:
-  - metric: acc
-    aggregation: mean
-    higher_is_better: true