Merge branch 'big-refactor' into add-prost-config

66bb89e5 · FarzanehNakhaee · e8bb77db · 070b6b9c · 66bb89e5 · 66bb89e5
Commit 66bb89e5 authored Jul 04, 2023 by FarzanehNakhaee
20 changed files
--- a/README.md
+++ b/README.md
@@ -9,8 +9,8 @@ We’d like your help to test it out! you can help by:
 2. Porting tasks supported in the previous version of the harness to the new YAML configuration format. Please check out our [task implementation guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/new_task_guide.md) for more information.

 If you choose to port a task not yet completed according to [our checklist](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/tasks/README.md), then you can contribute it by opening a PR containing [Refactor] in the name with:
- A command of the form `python main.py --model hf-causal --model_args ..... --tasks <task name> ...` which will run the task in the `master` branch, and what the score is
- A command of the form `python main.py --model hf-causal --model_args ..... --tasks <task name> ...` to run the task in your PR branch to `big-refactor`, and what the resulting score is, to show that we achieve equality between the two implementations.
+- A command of the form `python main.py --model hf --model_args ..... --tasks <task name> ...` which will run the task in the `master` branch, and what the score is
+- A command of the form `python main.py --model hf --model_args ..... --tasks <task name> ...` to run the task in your PR branch to `big-refactor`, and what the resulting score is, to show that we achieve equality between the two implementations.

 Lastly, we'll no longer be accepting new feature requests beyond those that are already open to the master branch as we carry out this switch to the new version over the next week, though we will be accepting bugfixes to `master` branch and PRs to `big-refactor`. Feel free to reach out in the #lm-thunderdome channel of the EAI discord for more information.

@@ -44,10 +44,10 @@ To install additional multilingual tokenization and text segmentation packages,
 pip install -e ".[multilingual]"
 ```

-To support loading GPTQ quantized models, install the package with the `auto-gptq` extra:
+To support loading GPTQ quantized models, install the package with the `gptq` extra:

 ```bash
-pip install -e ".[auto-gptq]"
+pip install -e ".[gptq]"
 ```

 ## Basic Usage
@@ -59,7 +59,7 @@ To evaluate a model hosted on the [HuggingFace Hub](https://huggingface.co/model

 ```bash
 python main.py \
-    --model hf-causal \
+    --model hf \
    --model_args pretrained=EleutherAI/gpt-j-6B \
    --tasks hellaswag \
    --device cuda:0 \
@@ -70,29 +70,49 @@ Additional arguments can be provided to the model constructor using the `--model

 ```bash
 python main.py \
-    --model hf-causal \
+    --model hf \
    --model_args pretrained=EleutherAI/pythia-160m,revision=step100000,dtype="float" \
    --tasks lambada_openai,hellaswag \
    --device cuda:0 \
    --batch_size 8
 ```

-### Multi-GPU Evaluation with Hugging Face `transformers`
+Models that are loaded via either `transformers.AutoModelForCausalLM` (autoregressive, decoder-only GPT style models) or `transformers.AutoModelForSeq2SeqLM` (such as encoder-decoder models like T5) in Huggingface are supported via  Support for this model type is currently pending.

-To parallelize evaluation across multiple GPUs, we allow for launching evaluation via the `accelerate` library as follows:
+### Multi-GPU Evaluation with Hugging Face `accelerate`
+
+To parallelize evaluation of HuggingFace models across multiple GPUs, we allow for two different types of multi-GPU evaluation.
+
+The first is performed by launching evaluation via the `accelerate` library as follows:

 ```
 accelerate launch main.py \
-    --model hf-causal \
+    --model hf \
    --tasks lambada_openai,arc_easy \
    --batch_size 16 \
 ```

-### Evaluation of Seq2Seq Models
+This will perform *data-parallel evaluation*: that is, placing a **single full copy** of your model onto each available GPU and *splitting batches across GPUs* to evaluate on K GPUs K times faster than on one.
+
+However, if your model *is too large to be run on a single one of your GPUs*, then we provide an alternative method to run these large models: use of the `parallelize` argument.
+
+```
+python main.py \
+    --model hf \
+    --model_args pretrained=EleutherAI/pythia-12b,parallelize=True
+    --tasks lambada_openai,arc_easy \
+    --batch_size 16
+```
+
+To pass even more advanced keyword arguments to `accelerate`, we allow for the following arguments as well:
+- `device_map_option`: How to split model weights across available GPUs. defaults to "auto".
+- `max_memory_per_gpu`: the max GPU memory to use per GPU in loading the model.
+- `max_cpu_memory`: the max amount of CPU memory to use when offloading the model weights to RAM.
+- `offload_folder`: a folder where model weights will be offloaded to disk if needed.

-To evaluate models that are loaded via `AutoSeq2SeqLM` (such as encoder-decoder models like T5) in Huggingface, you instead use `--model hf-seq2seq`. Support for this model type is currently pending.
+Using this setting helps for massive models like BLOOM which require, or to avoid exceeding your total system RAM (by default, with `accelerate launch` one copy of the model for each GPU is initialized in RAM before moving it to GPU, resulting in large RAM usage spikes around the start of the script that may cause errors such as `Killed`.) However, it naively splits models across GPUs, resulting in only a single GPU performing work at any point in time, and so is much slower than launching with `accelerate launch`, possibly by a factor of the total # of GPUs.

-> **Warning**: Choosing the wrong model may result in erroneous outputs despite not erroring.
+**Note that this option requires launching evaluation via `python main.py` rather than `accelerate launch main.py`.**

 ### Commercial APIs

@@ -139,18 +159,18 @@ This will write out one text file for each task.
 For models loaded with the HuggingFace  `transformers` library, any arguments provided via `--model_args` get passed to the relevant constructor directly. This means that anything you can do with `AutoModel` can be done with our library. For example, you can pass a local path via `pretrained=` or use models finetuned with [PEFT](https://github.com/huggingface/peft) by taking the call you would run to evaluate the base model and add `,peft=PATH` to the `model_args` argument:
 ```bash
 python main.py \
-    --model hf-causal \
-    --model_args pretrained=EleutherAI/gpt-j-6b,peft=nomic-ai/gpt4all-j-lora \
+    --model hf \
+    --model_args pretrained=EleutherAI/gpt-j-6b,parallelize=True,load_in_4bit=True,peft=nomic-ai/gpt4all-j-lora \
    --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq \
    --device cuda:0
 ```

-GPTQ quantized models can be loaded by specifying their file names in `,quantized=NAME` (or `,quantized=True` for default names) in the `model_args` argument:
+[GPTQ](https://github.com/PanQiWei/AutoGPTQ) quantized models can be loaded by specifying their file names in `,gptq=NAME` (or `,gptq=True` for default names) in the `model_args` argument:

 ```bash
 python main.py \
-    --model hf-causal \
-    --model_args pretrained=model-name-or-path,quantized=model.safetensors,gptq_use_triton=True \
+    --model hf \
+    --model_args pretrained=model-name-or-path,gptq=model.safetensors,gptq_use_triton=True \
    --tasks hellaswag
 ```


--- a/docs/PROGRESS.md
+++ b/docs/PROGRESS.md
-Tracking progress on revamping documentation pages for the refactor of LM-Evaluation-Harness.
+# Eval Harness Documentation
+
+Welcome to the docs for the LM Evaluation Harness!
+
+## Table of Contents

+* To learn how to add a new library, API, or model type to the library, as well as a quick explainer on the types of ways to evaluate an LM, see the [Model Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/model_guide.md).
+* For a crash course on adding new tasks to the library, see our [New Task Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/new_task_guide.md).
+* To learn more about pushing the limits of task configuration that the Eval Harness supports, see the [Advanced Task Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/advanced_task_guide.md).
+
+## Progress on Revamp
+
+Tracking progress on revamping documentation pages for the refactor of LM-Evaluation-Harness.

-## Desired Pages
+### Desired Pages

 * [ ] YAML explainer
  * [ ] Explainer on filters + advanced features

--- a/docs/model_guide.md
+++ b/docs/model_guide.md
-This is a placeholder.
+# New Model Guide
+
+The `lm-evaluation-harness` is intended to be a model-agnostic framework for evaluating . We provide first-class support for HuggingFace `AutoModelForCausalLM` and `AutoModelForSeq2SeqLM` type models, but
+
+This guide may be of special interest to users who are using the library outside of the repository, via installing the library via pypi and calling `lm_eval.evaluator.evaluate()` to evaluate an existing model.
+
+In order to properly evaluate a given LM, we require implementation of a wrapper class subclassing the `lm_eval.api.model.LM` class, that defines how the Evaluation Harness should interface with your model. This guide walks through how to write this `LM` subclass via adding it to the library!
+
+## Setup
+
+To get started contributing, go ahead and fork the main repo, clone it, create a branch with the name of your task, and install the project requirements in your environment:
+
+```sh
+# After forking...
+git clone https://github.com/<YOUR-USERNAME>/lm-evaluation-harness.git
+cd lm-evaluation-harness
+git checkout big-refactor
+git checkout -b <model-type>
+pip install -e ".[dev]"
+```
+
+Now, we'll create a new file where we'll be adding our model:
+
+```sh
+touch lm_eval/models/<my_model_filename>.py
+```
+
+**Tip: this filename should not shadow package names! For example, naming your file `anthropic.py` is disallowed since the API's name on pypi is `anthropic`, but naming it `anthropic_llms.py` works with no problems.**
+
+## Interface
+
+All models must subclass the `lm_eval.api.model.LM` class.
+
+The LM class enforces a common interface via which we can extract responses from a model:
+
+```python
+class MyCustomLM(LM):
+    #...
+    def loglikelihood(self, requests):
+
+
+    def loglikelihood_rolling(self, requests):
+
+
+    def greedy_until(self, requests):
+    #...
+```
+
+We support
+
+The three types of
+
+
+
+smth smth tokenizer-agnostic
+
+3 reqtypes
+- greedy_until, and the arguments passed to it
+
+- loglikelihood, and args passed to it
+
+- loglikelihood_rolling, and args passed to it
+
+
+## Registration
+
+Congrats on implementing your model! Now it's time to test it out.
+
+To make your model usable via the command line interface to `lm-eval` using `main.py`, you'll need to tell `lm-eval` what your model's name is.
+
+This is done via a *decorator*, `lm_eval.api.registry.register_model`. Using `register_model()`, one can both tell the package what the model's name(s) to be used are when invoking it with `python main.py --model <name>` and alert `lm-eval` to the model's existence.
+
+```python
+from lm_eval.api.registry import register_model
+
+@register_model("<name1>", "<name2>")
+class MyCustomLM(LM):
+```
+
+Using this decorator results in the class being added to an accounting of the usable LM types maintained internally to the library at `lm_eval.api.registry.MODEL_REGISTRY`. See `lm_eval.api.registry` for more detail on what sorts of registries and decorators exist in the library!
+
+
+
+## Other
+
+**Pro tip**: In order to make the Evaluation Harness overestimate total runtimes rather than underestimate it, HuggingFace models come in-built with the ability to provide responses on data points in *descending order by total input length* via `lm_eval.utils.Reorderer`. Take a look at `lm_eval.models.hf_causal.HFLM` to see how this is done, and see if you can implement it in your own model!
+
+## Conclusion
+
+After reading this guide, you should be able to add new model APIs or implementations to the Eval Harness library!
--- a/docs/new_task_guide.md
+++ b/docs/new_task_guide.md
@@ -85,13 +85,31 @@ Such that {{question}} will be replaced by `doc["question"]` when rendering the
 Our intended output is for the model to predict a single whitespace, and then the answer to the question. We do this via:
 ```yaml
 doc_to_target: "{{answer}}"
+gold_alias: "{{answer}}"
 ```
+where `doc_to_target` is *the string that will be appended to inputs for each few-shot example*, and `gold_alias` is *what is passed to our metric function as reference or gold answer to score against*. For example, for GSM8k word problems, `doc_to_target` should be the reference text reasoning chain given in the dataset culminating in the answer, and `gold_alias` should be **only the numeric answer** to the word problem that is given at the end of the reasoning chain, and which the evaluated model's answer will be compared against.

 **Important**: We always add one whitespace between the input and output, such that the full input-output string is `doc_to_target(doc) + " " + doc_to_text(doc)`. doc_to_text and doc_to_target should not contain trailing right or left whitespace, respectively.

 Users can also fill out the optional `template_aliases` YAML field, which is added ahead of both the `doc_to_text` and `doc_to_target` fields. This field should not contain any test, but only Jinja variable definitions (`{% ... %}` clauses). This can be used to perform more involved string manipulations and renamings of dataset columns while the main prompt fields remain easy to parse visually.


+#### Multiple choice format
+
+For tasks which are multiple choice (a fixed, finite set of label words per each document) and evaluated via comparing loglikelihoods of all label words (the `multiple_choice` task output type) we enforce a particular convention on prompt format.
+
+An annotated example in the case of SciQ is as follows:
+
+```yaml
+template_aliases: "{% set answer_choices = [distractor1, distractor2, distractor3, correct_answer] %}{% set gold = 3 %}" # `template_aliases` must set the list of possible answer choices to the jinja variable `answer_choices` (List[str]), and set what the index within `answer_choices` of this doc's gold label (correct answer choice).
+doc_to_text: "{{support.lstrip()}}\nQuestion: {{question}}\nAnswer:" # This is the input portion of the prompt for this doc. It will have " {{choice}}" appended to it as target for each choice in answer_choices.
+doc_to_target: "{{answer_choices[gold]}}" # this contains the gold-standard answer choice, selected via indexing to index `gold` in the answer choice list.
+gold_alias: "{{gold}}" # this must be castable to an integer. It must output only the index within `answer_choices` that is the correct label.
+```
+Task implementers are thus able to decide what the answer choices should be for a document, and what prompt format to use.
+
+
+
 ### Using Python Functions for Prompts

 There may be cases where the prompt we want to implement is easier expressed in Python instead of Jinja 2. For this, we can use Python helper functions that are defined in the YAML config. It should be noted that the function script must be in the same directory as the yaml.
@@ -124,21 +142,6 @@ use_prompt: "promptsource:GPT-3 Style"
 ```


-#### Multiple choice format
-
-For tasks which are multiple choice (a fixed, finite set of label words per each document) and evaluated via comparing loglikelihoods of all label words (the `multiple_choice` task output type) we enforce a particular convention on prompt format.
-
-An annotated example in the case of SciQ is as follows:
-
-```yaml
-template_aliases: "{% set answer_choices = [distractor1, distractor2, distractor3, correct_answer] %}{% set gold = 3 %}" # `template_aliases` must set the list of possible answer choices to the jinja variable `answer_choices` (List[str]), and set what the index within `answer_choices` of this doc's gold label (correct answer choice).
-doc_to_text: "{{support.lstrip()}}\nQuestion: {{question}}\nAnswer:" # This is the input portion of the prompt for this doc. It will have " {{choice}}" appended to it as target for each choice in answer_choices.
-doc_to_target: "{{gold}}" # this must be castable to an integer. It must output only the index within `answer_choices` that is the correct label.
-```
-Task implementers are thus able to decide what the answer choices should be for a document, and what prompt format to use.
-
-
-
 ### Setting metrics

 You're almost done! Now we need to choose how to score our task.

--- a/lm_eval/api/model.py
+++ b/lm_eval/api/model.py
 import abc
+import os

 from typing import Union
+from sqlitedict import SqliteDict
+import json
+import hashlib
+
+from tqdm import tqdm

 from lm_eval import utils
+from lm_eval.logger import eval_logger


 class LM(abc.ABC):
@@ -12,6 +19,10 @@ class LM(abc.ABC):
        (inputs/outputs should be tokenization-agnostic.)

        """
+        # set rank and world size to a single process, by default.
+        self._rank = 0
+        self._world_size = 1
+        self.cache_hook = CacheHook(None)

    @abc.abstractmethod
    def loglikelihood(self, requests):
@@ -104,3 +115,118 @@ class LM(abc.ABC):
        args = utils.simple_parse_args_string(arg_string)
        args2 = {k: v for k, v in additional_config.items() if v is not None}
        return cls(**args, **args2)
+
+    @property
+    def rank(self):
+        # used in the case of parallelism. Hardcoded to
+        # ensure no errors arise using API models which do
+        # not support multi-device parallelism nor expect it.
+        return self._rank
+
+    @property
+    def world_size(self):
+        # used in the case of parallelism. Hardcoded to
+        # ensure no errors arise using API models which do
+        # not support multi-device parallelism nor expect it.
+        return self._world_size
+
+    def set_cache_hook(self, cache_hook):
+        self.cache_hook = cache_hook
+
+
+### SQLite-based caching of LM responses
+def hash_args(attr, args):
+    dat = json.dumps([attr] + list(args))
+    return hashlib.sha256(dat.encode("utf-8")).hexdigest()
+
+
+class CacheHook:
+    def __init__(self, cachinglm):
+        if cachinglm is None:
+            self.dbdict = None
+            return
+
+        self.dbdict = cachinglm.dbdict
+
+    def add_partial(self, attr, req, res):
+        if self.dbdict is None:
+            return
+        hsh = hash_args(attr, req)
+        self.dbdict[hsh] = res
+
+
+class CachingLM:
+    def __init__(self, lm, cache_db):
+        """LM wrapper that returns cached results if they exist, and uses the underlying LM if not.
+
+        :param lm: LM
+            Underlying LM
+        :param cache_db: str
+            Path to cache db
+        """
+        self.lm = lm
+        self.cache_db = cache_db
+        if os.path.dirname(cache_db):
+            os.makedirs(os.path.dirname(cache_db), exist_ok=True)
+        self.dbdict = SqliteDict(cache_db, autocommit=True)
+
+        # add hook to lm
+        lm.set_cache_hook(self.get_cache_hook())
+
+    def __getattr__(self, attr):
+        lm_attr = getattr(self.lm, attr)
+        if not callable(lm_attr):
+            return lm_attr
+
+        def fn(requests):
+            res = []
+            remaining_reqs = []
+            warned = False
+            # figure out which ones are cached and which ones are new
+            eval_logger.info(
+                f"Loading '{attr}' responses from cache '{self.cache_db}' where possible..."
+            )
+            for req in tqdm(requests):
+                hsh = hash_args(attr, req.args)
+                if attr == "greedy_until" and req.args[1].get("do_sample", False):
+                    # when we are doing non-greedy generation, don't use the cache
+                    # (else every "randomly sampled" generation would be identical for repeats > 1).
+                    if not warned:
+                        eval_logger.warning(
+                            f"Arguments to lm.greedy_until() '{req.args[1]}' include non-deterministic sampling. Caching will not be performed for such requests."
+                        )
+                        warned = True
+                    res.append(None)
+                    remaining_reqs.append(req)
+                elif hsh in self.dbdict:
+                    ob = self.dbdict[hsh]
+
+                    assert ob is not None
+
+                    res.append(ob)
+                else:
+                    res.append(None)
+                    remaining_reqs.append(req)
+
+            # actually run the LM on the requests that do not have cached results
+            rem_res = getattr(self.lm, attr)(remaining_reqs)
+
+            # stick the new ones back into the list and also cache any of the new ones
+            resptr = 0
+            for req, r in zip(remaining_reqs, rem_res):
+                while res[resptr] is not None:
+                    resptr += 1
+
+                res[resptr] = r
+
+                # caching
+                hsh = hash_args(attr, req.args)
+                self.dbdict[hsh] = r
+            self.dbdict.commit()
+
+            return res
+
+        return fn
+
+    def get_cache_hook(self):
+        return CacheHook(self)
--- a/lm_eval/api/samplers.py
+++ b/lm_eval/api/samplers.py
@@ -7,7 +7,8 @@ class Sampler:
        self.task = task
        self.config = task._config

-        self.delimiter = self.config.delimiter
+        self.target_delimiter = self.config.target_delimiter
+        self.fewshot_delimiter = self.config.fewshot_delimiter

        self.docs = docs  # HF dataset split, provided by task._fewshot_docs()
        if fewshot_indices:  # subset few-shot docs from
@@ -30,13 +31,16 @@ class Sampler:
        selected_docs = [x for x in fewshotex if x != doc][:num_fewshot]

        labeled_examples = (
-            self.delimiter.join(
+            self.fewshot_delimiter.join(
                [
-                    self.task.doc_to_text(doc) + self.task.doc_to_target(doc)
+                    # TODO: is separating doc_to_text and doc_to_target by one space always desired?
+                    self.task.doc_to_text(doc)
+                    + self.target_delimiter
+                    + self.task.doc_to_target(doc)
                    for doc in selected_docs
                ]
            )
-            + self.delimiter
+            + self.fewshot_delimiter
        )

        # only returns the fewshot context! Does not append the document, do this outside the object

--- a/lm_eval/api/task.py
+++ b/lm_eval/api/task.py
@@ -52,7 +52,6 @@ class TaskConfig(dict):

    task: str = None
    group: Union[str, list] = None
-    reference: str = None

    dataset_path: str = None
    dataset_name: str = None
@@ -63,10 +62,12 @@ class TaskConfig(dict):
    fewshot_split: str = None  # TODO: assert that this not None if num_fewshot > 0. (?) assert if this is same split as one evaling (?)

    template_aliases: str = None
-    aliases: Union[str, list] = None
    doc_to_text: Union[Callable, str] = None
    doc_to_target: Union[Callable, str] = None
    use_prompt: str = None
+    description: str = ""
+    target_delimiter: str = " "
+    fewshot_delimiter: str = "\n\n"

    num_fewshot: int = 0
    batch_size: int = 1
@@ -76,7 +77,6 @@ class TaskConfig(dict):
    gold_alias: Union[Callable, str] = None
    output_type: str = "greedy_until"
    generation_kwargs: dict = None
-    delimiter: str = "\n\n"
    filter_list: Union[str, list] = None
    should_decontaminate: bool = False
    doc_to_decontamination_query: str = None
@@ -97,13 +97,16 @@ class TaskConfig(dict):
            if type(self.gold_alias) == str:
                self.gold_alias = self.template_aliases + self.gold_alias

-        if self.generation_kwargs or self.output_type == "greedy_until":
+        if self.generation_kwargs:
            assert (
                self.output_type == "greedy_until"
            ), "passed `generation_kwargs`, but not using a generation request type!"
+        elif self.output_type == "greedy_until":
            # ensure that we greedily generate in absence of explicit arguments otherwise
            self.generation_kwargs = {"do_sample": False, "temperature": 0.0}

+        # TODO: how to make TaskConfigs be de- and re-serializable, even when using the !function constructor?
+
    def __getitem__(self, item):
        return getattr(self, item)

@@ -122,6 +125,9 @@ class TaskConfig(dict):
        for k, v in list(cfg_dict.items()):
            if v is None:
                cfg_dict.pop(k)
+            elif isinstance(v, Callable):
+                # TODO: this should handle Promptsource template objects as a separate case?
+                cfg_dict[k] = str(v)
        return cfg_dict


@@ -275,7 +281,7 @@ class Task(abc.ABC):
        else:
            eval_logger.warning(
                "has_training_docs and has_validation_docs are False"
-                ", using test_docs but this is not recommended."
+                ", using test_docs as fewshot_docs but this is not recommended."
            )
            return self.test_docs()

@@ -336,7 +342,8 @@ class Task(abc.ABC):
            fewshot_ctx = self.fewshot_context(
                doc, self._config.num_fewshot, rnd=random.Random()
            )
-            # TODO: hardcoded for now: # of runs on each input to be 2. # TODO: we should override this if doing greedy gen so users don't waste time+compute
+
+            # TODO: we should override self._config.repeats if doing greedy gen so users don't waste time+compute
            inst = self.construct_requests(
                doc=doc,
                ctx=fewshot_ctx,
@@ -433,35 +440,12 @@ class Task(abc.ABC):
        ), "A `random.Random` generator argument must be provided to `rnd`"

        if num_fewshot == 0:
-            labeled_examples = ""
+            # always prepend the (possibly empty) task description
+            labeled_examples = self._config.description
        else:
-            labeled_examples = self.sampler.get_context(doc, num_fewshot)
-
-            # for sets with no training docs, draw from other set *but ensure no overlap with current doc*
-            # if self.has_training_docs():
-            #     fewshotex = self.fewshot_examples(k=num_fewshot, rnd=rnd)
-            # else:
-            #     if self._fewshot_docs is None:
-            #         self._fewshot_docs = list(
-            #             self.validation_docs()
-            #             if self.has_validation_docs()
-            #             else self.test_docs()
-            #         )
-
-            #     fewshotex = rnd.sample(self._fewshot_docs, num_fewshot + 1)
-
-            #     # get rid of the doc that's the one we're evaluating, if it's in the fewshot
-            #     fewshotex = [x for x in fewshotex if x != doc][:num_fewshot]
-
-            # labeled_examples = (
-            #     "\n\n".join(
-            #         [
-            #             self.doc_to_text(doc) + self.doc_to_target(doc)
-            #             for doc in fewshotex
-            #         ]
-            #     )
-            #     + "\n\n"
-            # )
+            labeled_examples = self._config.description + self.sampler.get_context(
+                doc, num_fewshot
+            )

        example = self.doc_to_text(doc)
        return labeled_examples + example
@@ -735,12 +719,14 @@ class ConfigurableTask(Task):
            raise TypeError

    def gold_alias(self, doc):
-        # TODO: reevaluate if we need this. implemented to have a
-        # processed version of answer to put into gsm8k exact_match scoring as ref.
+        # returns a version of the gold target answer to a document,
+        # which should be passed into metric for scoring as the ground truth.
+
+        # in multiple_choice tasks, this should be castable to an int corresponding to the index
+        # within the answer choices, while doc_to_target is the string version of {{answer_choices[gold]}}.
        if self._config.gold_alias is not None:
            doc_to_target = self._config.gold_alias
        else:
-            # doc_to_target = self._config.doc_to_target
            return self.doc_to_target(doc)

        if type(doc_to_target) == str:
@@ -899,7 +885,9 @@ class ConfigurableTask(Task):

            for key, result in zip(self._metric_fn_list.keys(), results):
                _dict = self._metric_fn_list[key].compute(
-                    references=[gold], predictions=[result], **self._metric_kwargs[key]
+                    references=[gold],
+                    predictions=[result],
+                    **self._metric_fn_kwargs[key],
                )

                result_dict = {**result_dict, **_dict}

--- a/lm_eval/evaluator.py
+++ b/lm_eval/evaluator.py
@@ -39,7 +39,7 @@ def simple_evaluate(
    batch_size=None,
    max_batch_size=None,
    device=None,
-    no_cache=False,
+    use_cache=None,
    limit=None,
    bootstrap_iters=100000,
    check_integrity=False,
@@ -64,8 +64,8 @@ def simple_evaluate(
        Maximal batch size to try with automatic batch size detection
    :param device: str, optional
        PyTorch device (e.g. "cpu" or "cuda:0") for running models
-    :param no_cache: bool
-        Whether or not to cache
+    :param use_cache: str, optional
+        A path to a sqlite db file for caching model responses. `None` if not caching.
    :param limit: int or float, optional
        Limit the number of examples per task (only use this for testing), If <1, limit is a percentage of the total number of examples.
    :param bootstrap_iters:
@@ -99,6 +99,16 @@ def simple_evaluate(
        assert isinstance(model, lm_eval.api.model.LM)
        lm = model

+    if use_cache is not None:
+        print(f"Using cache at {use_cache + '_rank' + str(lm.rank) + '.db'}")
+        lm = lm_eval.api.model.CachingLM(
+            lm,
+            use_cache
+            # each rank receives a different cache db.
+            # necessary to avoid multiple writes to cache at once
+            + "_rank" + str(lm.rank) + ".db",
+        )
+
    task_dict = lm_eval.tasks.get_task_dict(tasks, num_fewshot=num_fewshot)

    if check_integrity:
@@ -127,7 +137,7 @@ def simple_evaluate(
            if hasattr(lm, "batch_sizes")
            else [],
            "device": device,
-            "no_cache": no_cache,
+            "use_cache": use_cache,
            "limit": limit,
            "bootstrap_iters": bootstrap_iters,
        }
@@ -183,15 +193,8 @@ def evaluate(
    # get lists of each type of request
    for task_name, task in task_dict.items():
        versions[task_name] = task.VERSION
-        configs[task_name] = dict(
-            task.dump_config()
-        )  # TODO: don't access a private attribute here ; for non-YAML tasks handle this case
-
-        # deterministically shuffle docs and chop off the first `limit` because sometimes docs are in some kind of order
-        # task_docs = list(task_doc_func())
-        # rnd = random.Random()
-        # rnd.seed(42)
-        # rnd.shuffle(task_docs)
+        configs[task_name] = dict(task.dump_config())
+
        if limit is not None:
            if task.has_test_docs():
                task_docs = task.test_docs()
@@ -249,13 +252,12 @@ def evaluate(
        task.apply_filters()

    ### Collect values of metrics on all datapoints ###
-    # TODO: make metric configurable, add metric registry
    vals = collections.defaultdict(list)

    # unpack results and sort back in order and return control to Task
    for task_name, task in task_dict.items():
-        # calculate values for each filter setup (TODO: make getting list of keys cleaner)
-        # TODO: make it possible to use a different metric per key
+        # TODO: make it possible to use a different metric per filter
+        # iterate over different filters used
        for key in task.instances[0].filtered_resps.keys():
            doc_iterator = (
                itertools.islice(
@@ -279,6 +281,7 @@ def evaluate(
                    "doc_id": doc_id,
                    "doc": doc,
                    "target": target,
+                    "arguments": requests[0].args,
                    "resps": [req.resps for req in requests],
                    "filtered_resps": [req.filtered_resps[key] for req in requests],
                }
@@ -289,6 +292,15 @@ def evaluate(

    if lm.world_size > 1:
        # if multigpu, then gather data across all ranks
+        # first gather logged samples across all ranks
+        for task_name, task_samples in list(samples.items()):
+
+            full_samples = [None] * lm.world_size
+            torch.distributed.all_gather_object(full_samples, task_samples)
+
+            samples[task_name] = list(itertools.chain.from_iterable(full_samples))
+
+        # then collect metrics across all ranks
        vals_torch = collections.defaultdict(list)
        for (task_name, key, metric), items in vals.items():


--- a/lm_eval/models/__init__.py
+++ b/lm_eval/models/__init__.py
-from . import hf_causal
+from . import huggingface
 from . import openai_completions
+from . import anthropic_llms
 from . import textsynth
 from . import dummy


--- a/lm_eval/models/anthropic_llms.py
+++ b/lm_eval/models/anthropic_llms.py
 import os
-from lm_eval.base import BaseLM
+from lm_eval.api.model import LM
+from lm_eval.api.registry import register_model
 from tqdm import tqdm
 import time

@@ -25,7 +26,6 @@ def anthropic_completion(
                max_tokens_to_sample=max_tokens_to_sample,
                temperature=temperature,
            )
-            print(response)
            return response["completion"]
        except RuntimeError:
            # TODO: I don't actually know what error Anthropic raises when it times out
@@ -37,7 +37,8 @@ def anthropic_completion(
            backoff_time *= 1.5


-class AnthropicLM(BaseLM):
+@register_model("anthropic")
+class AnthropicLM(LM):
    REQ_CHUNK_SIZE = 20

    def __init__(self, model):
@@ -87,6 +88,8 @@ class AnthropicLM(BaseLM):
        if not requests:
            return []

+        requests = [req.args for req in requests]
+
        res = []
        for request in tqdm(requests):
            inp = request[0]
@@ -97,10 +100,13 @@ class AnthropicLM(BaseLM):
                model=self.model,
                prompt=inp,
                max_tokens_to_sample=self.max_gen_toks,
-                temperature=0.0,
+                temperature=0.0,  # TODO: implement non-greedy sampling for Anthropic
                stop=until,
            )
            res.append(response)
+
+            self.cache_hook.add_partial("greedy_until", request, response)
+
        return res

    def _model_call(self, inps):

--- a/lm_eval/models/hf_causal.py
+++ b/lm_eval/models/hf_causal.py
-import torch
-import transformers
-
-import copy
-from tqdm import tqdm
-
-import torch.nn.functional as F
-
-from lm_eval import utils
-from lm_eval.logger import eval_logger
-from lm_eval.api.model import LM
-from lm_eval.api.registry import register_model
-
-from accelerate import Accelerator
-from itertools import islice
-
-
-@register_model("hf-causal")
-class HFLM(LM):
-    def __init__(
-        self,
-        device="cuda",
-        pretrained="gpt2",
-        revision="main",
-        low_cpu_mem_usage=None,
-        subfolder=None,
-        tokenizer=None,
-        batch_size=1,
-    ):
-        super().__init__()
-
-        assert isinstance(device, str)
-        assert isinstance(pretrained, str)
-        assert isinstance(batch_size, int)
-
-        gpus = torch.cuda.device_count()
-        if gpus <= 1:
-            if device:
-                if device not in ["cuda", "cpu"]:
-                    device = int(device)
-                self._device = torch.device(device)
-                eval_logger.info(f"Using device '{device}'")
-            else:
-                eval_logger.info("Device not specified")
-                eval_logger.info(f"Cuda Available? {torch.cuda.is_available()}")
-                self._device = (
-                    torch.device("cuda")
-                    if torch.cuda.is_available()
-                    else torch.device("cpu")
-                )
-            self._rank = 0
-            self._world_size = 1
-
-        else:
-            self._device = "cpu"
-
-        # TODO: update this to be less of a hack once subfolder is fixed in HF
-        revision = revision + ("/" + subfolder if subfolder is not None else "")
-
-        self.model = transformers.AutoModelForCausalLM.from_pretrained(
-            pretrained, revision=revision, low_cpu_mem_usage=low_cpu_mem_usage
-        ).to(self.device)
-        self.model.eval()
-
-        self.tokenizer = transformers.AutoTokenizer.from_pretrained(
-            pretrained if tokenizer is None else tokenizer,
-            revision=revision,
-        )
-
-        self.vocab_size = self.tokenizer.vocab_size
-
-        # multithreading and batching
-        self.batch_size_per_gpu = batch_size  # todo: adaptive batch size
-
-        # multigpu support with accelerate
-        if gpus > 1:
-            accelerator = Accelerator()
-            if gpus > accelerator.num_processes:
-                eval_logger.warning(
-                    "WARNING: The number of total system GPUs does not match the number of spawned processes. "
-                    "If you would like to use data parallelism, please launch the script "
-                    "with 'accelerate launch *script*'. "
-                    f"Current run will proceed with {accelerator.num_processes} devices."
-                )
-                self._rank = accelerator.local_process_index
-                self._world_size = accelerator.num_processes
-            else:
-                self.model = accelerator.prepare(self.model)
-                self._device = torch.device(f"cuda:{accelerator.local_process_index}")
-                self.accelerator = accelerator
-
-                if self.accelerator.is_local_main_process:
-                    eval_logger.info(f"Using {gpus} devices with data parallelism")
-
-                self._rank = self.accelerator.local_process_index
-                self._world_size = self.accelerator.num_processes
-
-    @property
-    def eot_token_id(self):
-        # we use EOT because end of *text* is more accurate for what we're doing than end of *sentence*
-        return self.tokenizer.eos_token_id
-
-    @property
-    def max_length(self):
-        try:
-            if hasattr(self, "accelerator"):
-                return self.accelerator.unwrap_model(self.model).config.n_ctx
-            else:
-                return self.model.config.n_ctx
-        except AttributeError:
-            # gptneoconfig doesn't have n_ctx apparently
-            if hasattr(self, "accelerator"):
-                return self.accelerator.unwrap_model(
-                    self.model
-                ).config.max_position_embeddings
-            else:
-                return self.model.config.max_position_embeddings
-
-    @property
-    def max_gen_toks(self):
-        return 256
-
-    @property
-    def batch_size(self):
-        return self.batch_size_per_gpu
-
-    @property
-    def device(self):
-        return self._device
-
-    @property
-    def rank(self):
-        return self._rank
-
-    @property
-    def world_size(self):
-        return self._world_size
-
-    def tok_encode(self, string: str):
-        return self.tokenizer.encode(string, add_special_tokens=False)
-
-    def tok_decode(self, tokens):
-        return self.tokenizer.decode(tokens)
-
-    def _model_call(self, inps):
-        """
-        inps: a torch tensor of shape [batch, sequence]
-        the size of sequence may vary from call to call
-
-        returns: a torch tensor of shape [batch, sequence, vocab] with the
-        logits returned from the model
-        """
-        with torch.no_grad():
-            return self.model(inps)[0]
-
-    def _model_generate(self, context, max_length, eos_token_id, **generation_kwargs):
-        # we require users to pass do_sample=True explicitly
-        # for non-greedy gen. This should be reevaluated when considering beam search.
-        if "do_sample" not in generation_kwargs.keys():
-            generation_kwargs["do_sample"] = False
-        if hasattr(self, "accelerator"):
-            return self.accelerator.unwrap_model(self.model).generate(
-                context,
-                max_length=max_length,
-                pad_token_id=eos_token_id,
-                eos_token_id=eos_token_id,
-                **generation_kwargs,
-            )
-        else:
-            return self.model.generate(
-                context,
-                max_length=max_length,
-                pad_token_id=eos_token_id,
-                eos_token_id=eos_token_id,
-                **generation_kwargs,
-            )
-
-    def loglikelihood(self, requests):
-        new_reqs = []
-        for context, continuation in [req.args for req in requests]:
-            if context == "":
-                # end of text as context
-                context_enc = [self.eot_token_id]
-            else:
-                context_enc = self.tok_encode(context)
-
-            continuation_enc = self.tok_encode(continuation)
-
-            new_reqs.append(((context, continuation), context_enc, continuation_enc))
-
-        return self._loglikelihood_tokens(new_reqs)
-
-    def loglikelihood_rolling(self, requests):
-        # TODO: Implement caching once we've confirmed the perplexity implementation
-        # TODO: automatic batch size detection for vectorization
-
-        loglikelihoods = []
-        for (string,) in tqdm([req.args for req in requests], disable=(self.rank != 0)):
-            rolling_token_windows = list(
-                map(
-                    utils.make_disjoint_window,
-                    utils.get_rolling_token_windows(
-                        token_list=self.tok_encode(string),
-                        prefix_token=self.eot_token_id,
-                        max_seq_len=self.max_length,
-                        context_len=1,
-                    ),
-                )
-            )
-
-            rolling_token_windows = [(None,) + x for x in rolling_token_windows]
-
-            # TODO: extract out this call so it only gets called once and also somehow figure out partial caching for
-            # that
-
-            pad_amnt = 0
-            if self.world_size > 1:
-                # TODO: Comment on what we do here
-                mytensor = torch.tensor(len(rolling_token_windows), device=self.device)
-                gathered = (
-                    self.accelerator.gather(mytensor).cpu().detach().numpy().tolist()
-                )
-
-                pad_amnt = max(gathered) - gathered[self.rank]
-                if pad_amnt > 0:
-                    rolling_token_windows += pad_amnt * [rolling_token_windows[0]]
-
-            string_nll = self._loglikelihood_tokens(
-                rolling_token_windows, disable_tqdm=True
-            )
-
-            if (self.world_size > 1) and (pad_amnt > 0):
-                string_nll = [x[0] for x in string_nll[:-pad_amnt]]
-            else:
-                # discard is_greedy
-                string_nll = [x[0] for x in string_nll]
-
-            string_nll = sum(string_nll)
-            loglikelihoods.append(string_nll)
-
-        return loglikelihoods
-
-    def _loglikelihood_tokens(self, requests, disable_tqdm=False):
-        # TODO: implement some kind of efficient-request-middleware that lumps together requests with the same context
-        res = []
-
-        def _collate(x):
-            # the negative sign on len(toks) sorts descending - this has a few advantages:
-            # - time estimates will always be over not underestimates, which is more useful for planning
-            # - to know the size of a batch when going through the list, you know the first one is always the batch
-            #   padded context length. this is useful to simplify the batching logic and more importantly to make
-            #   automatic adaptive batches much much easier to implement
-            # - any OOMs will happen right away rather than near the end
-
-            toks = x[1] + x[2]
-            return -len(toks), tuple(toks)
-
-        # TODO: automatic (variable) batch size detection for vectorization
-        re_ord = utils.Reorderer(requests, _collate)
-        for chunk in utils.chunks(
-            tqdm(re_ord.get_reordered(), disable=(disable_tqdm or (self.rank != 0))),
-            self.batch_size,
-        ):
-
-            inps = []
-            cont_toks_list = []
-            inplens = []
-
-            padding_length = None
-
-            # because vectorizing is annoying, we first convert each (context, continuation) pair to padded
-            # tensors, then we pack them together into a batch, call the model, and then pick it all apart
-            # again because vectorizing is annoying
-
-            for _, context_enc, continuation_enc in chunk:
-                # sanity check
-                assert len(context_enc) > 0
-                assert len(continuation_enc) > 0
-                assert len(continuation_enc) <= self.max_length
-
-                # how this all works:
-                #          CTX      CONT
-                # inp    0 1 2 3|4 5 6 7 8 9   <- last token is deleted by inp[:, :-1]
-                # model  \               \
-                # logits   1 2 3|4 5 6 7 8 9   <- the ctx half gets tossed out by the
-                # cont_toks      4 5 6 7 8 9      [:, -len(continuation_enc):, :self.vocab_size] slice
-
-                # when too long to fit in context, truncate from the left
-                inp = torch.tensor(
-                    (context_enc + continuation_enc)[-(self.max_length + 1) :][:-1],
-                    dtype=torch.long,
-                ).to(self.device)
-                (inplen,) = inp.shape
-
-                cont = continuation_enc
-
-                # since in _collate we make sure length is descending, the longest is always the first one.
-                padding_length = (
-                    padding_length if padding_length is not None else inplen
-                )
-
-                # pad length from seq to padding_length
-                inp = torch.cat(
-                    [
-                        inp,  # [seq]
-                        torch.zeros(padding_length - inplen, dtype=torch.long).to(
-                            inp.device
-                        ),  # [padding_length - seq]
-                    ],
-                    dim=0,
-                )
-
-                inps.append(inp.unsqueeze(0))  # [1, padding_length]
-                cont_toks_list.append(cont)
-                inplens.append(inplen)
-
-            batched_inps = torch.cat(inps, dim=0)  # [batch, padding_length
-            multi_logits = F.log_softmax(
-                self._model_call(batched_inps), dim=-1
-            ).cpu()  # [batch, padding_length, vocab]
-
-            for (cache_key, _, _), logits, inp, inplen, cont_toks in zip(
-                chunk, multi_logits, inps, inplens, cont_toks_list
-            ):
-
-                # Slice to original seq length
-                contlen = len(cont_toks)
-                logits = logits[inplen - contlen : inplen].unsqueeze(
-                    0
-                )  # [1, seq, vocab]
-
-                # Check if per-token argmax is exactly equal to continuation
-                greedy_tokens = logits.argmax(dim=-1)
-                cont_toks = torch.tensor(cont_toks, dtype=torch.long).unsqueeze(
-                    0
-                )  # [1, seq]
-                max_equal = (greedy_tokens == cont_toks).all()
-
-                # Obtain log-probs at the corresponding continuation token indices
-                # last_token_slice = logits[:, -1, :].squeeze(0).tolist()
-                logits = torch.gather(logits, 2, cont_toks.unsqueeze(-1)).squeeze(
-                    -1
-                )  # [1, seq]
-
-                # Answer: (log prob, is-exact-match)
-                answer = (float(logits.sum()), bool(max_equal))
-
-                res.append(answer)
-
-        return re_ord.get_original(res)
-
-    def greedy_until(self, requests):
-        # TODO: implement fully general `until` that handles until that are
-        #       multiple tokens or that span multiple tokens correctly
-
-        res = []
-
-        def _collate(x):
-            toks = self.tok_encode(x[0])
-            return len(toks), x[0]
-
-        re_ord = utils.Reorderer([req.args for req in requests], _collate)
-
-        for context, gen_kwargs in tqdm(re_ord.get_reordered()):
-            if isinstance(gen_kwargs, dict):
-                gen_kwargs = copy.deepcopy(gen_kwargs)  # edge case for repeats > 1
-                if "until" in gen_kwargs.keys():
-                    until = gen_kwargs.pop("until")
-                    if isinstance(until, str):
-                        until = [gen_kwargs]
-                    elif not isinstance(until, list):
-                        raise ValueError(
-                            f"Expected `gen_kwargs['until']` to be of type Union[str,list] but got {until}"
-                        )
-            else:
-                raise ValueError(
-                    f"Expected `gen_kwargs` to be of type `dict` but got {gen_kwargs}"
-                )
-            if not until:
-                until = [self.tok_decode(self.eot_token_id)]
-            if "max_gen_toks" in gen_kwargs.keys():
-                max_gen_toks = gen_kwargs.pop("max_gen_toks")
-            else:
-                max_gen_toks = self.max_gen_toks
-
-            try:
-                (primary_until,) = self.tok_encode(until[0])
-            except Exception:
-                # if our primary until would be multiple tokens long, we'll have errors.
-                # TODO: handling this better will let us stop generating earlier + often.
-                primary_until = self.eot_token_id
-
-            context_enc = torch.tensor(
-                [self.tok_encode(context)[max_gen_toks - self.max_length :]]
-            ).to(self.device)
-
-            cont = self._model_generate(
-                context=context_enc,
-                max_length=context_enc.shape[1] + max_gen_toks,
-                eos_token_id=primary_until,
-                **gen_kwargs,
-            )
-
-            s = self.tok_decode(cont[0].tolist()[context_enc.shape[1] :])
-
-            for term in until:
-                s = s.split(term)[0]
-
-            res.append(s)
-
-        return re_ord.get_original(res)
--- a/lm_eval/models/huggingface.py
+++ b/lm_eval/models/huggingface.py
--- a/lm_eval/models/openai_completions.py
+++ b/lm_eval/models/openai_completions.py
@@ -58,7 +58,7 @@ def oa_completion(**kwargs):


 @register_model("openai", "openai-completions", "gooseai")
-class GPT3LM(LM):
+class OpenaiCompletionsLM(LM):
    REQ_CHUNK_SIZE = 20

    def __init__(self, engine, truncate=False):
@@ -194,7 +194,7 @@ class GPT3LM(LM):
                yield ret, lastuntil

        # todo: more intelligent batching for heterogeneous `until`
-        for chunk, until in tqdm(
+        for chunk, request_args in tqdm(
            list(sameuntil_chunks(re_ord.get_reordered(), self.REQ_CHUNK_SIZE))
        ):
            inps = []
@@ -203,6 +203,13 @@ class GPT3LM(LM):
                inp = context_enc[-(self.max_length - self.max_gen_toks) :]
                inps.append(inp)

+            try:
+                until = request_args["until"][
+                    0
+                ]  # TODO: does this handle a list of stop seqs correctly?
+            except KeyError:
+                until = "<|endoftext|>"
+
            response = oa_completion(
                engine=self.engine,
                prompt=inps,
@@ -212,14 +219,19 @@ class GPT3LM(LM):
                stop=until,
            )

-            for resp, (context, until_) in zip(response.choices, chunk):
+            for resp, (context, args_) in zip(response.choices, chunk):
                s = resp["text"]

+                until_ = args_.get(["until"], [])
+
                for term in until_:
-                    s = s.split(term)[0]
+                    if len(term) > 0:
+                        s = s.split(term)[0]

                # partial caching
-                self.cache_hook.add_partial("greedy_until", (context, until_), s)
+                self.cache_hook.add_partial(
+                    "greedy_until", (context, {"until": until_}), s
+                )

                res.append(s)


--- a/lm_eval/models/textsynth.py
+++ b/lm_eval/models/textsynth.py
@@ -101,6 +101,10 @@ class TextSynthLM(LM):
                logprob = resp["logprob"]
                is_greedy = resp["is_greedy"]
                res.append((logprob, is_greedy))
+
+                self.cache_hook.add_partial(
+                    "loglikelihood", (context, continuation), (logprob, is_greedy)
+                )
            else:
                logger.error(
                    f"The following response does not contain `logprobs`. Got:\n{resp}"
@@ -141,6 +145,8 @@ class TextSynthLM(LM):
            if "text" in resp:
                s = resp["text"]
                res.append(s)
+
+                self.cache_hook.add_partial("greedy_until", (inp, request_args), s)
            else:
                logger.error(
                    f"The following response does not contain generated `text`. "

--- a/lm_eval/tasks/README.md
+++ b/lm_eval/tasks/README.md
 # v1.0 Tasks
 This list keeps track of which tasks' implementations have been ported to YAML / v2.0 of the Eval Harness.

-Boxes should be checked iff tasks are implemented in the refactor and tested for regression. Tasks should be struck through if checked *against original introducing paper* implementation or popularizing implementation.
+Boxes should be checked iff tasks are implemented in the refactor and tested for regression. Tasks should be struck through if checked *against original introducing paper* implementation or popularizing implementation. (WIP) Denotes that there exists a PR or person working on this task already.

 - [ ] Glue (WIP)
 - [x] SuperGlue
@@ -12,39 +12,41 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
 - [ ] Lambada (Multilingual)
 - [x] Wikitext
 - [x] PiQA
- [ ] PROST
+- [ ] PROST (WIP)
 - [ ] MCTACO
- [ ] Pubmed QA
+- [ ] Pubmed QA (WIP)
 - [x] SciQ
 - [ ] QASPER
 - [ ] QA4MRE
 - [ ] TriviaQA
 - [x] AI2 ARC
- [ ] LogiQA
- [ ] HellaSwag
- [ ] SWAG
- [ ] OpenBookQA
- [ ] SQuADv2
- [ ] RACE
+- [ ] LogiQA (WIP)
+- [x] HellaSwag
+- [ ] SWAG (WIP)
+- [x] OpenBookQA
+- [ ] SQuADv2 (WIP)
+- [ ] RACE (WIP)
 - [ ] HeadQA
 - [ ] MathQA
 - [ ] WebQs
 - [ ] WSC273
- [ ] Winogrande
+- [ ] Winogrande (WIP)
 - [x] ANLI
 - [ ] Hendrycks Ethics
 - [ ] TruthfulQA
 - [ ] MuTual
- [ ] Hendrycks Math
+- [ ] Hendrycks Math (WIP)
 - [ ] Asdiv
 - [ ] GSM8k
- [ ] Arithmetic
+- [ ] Arithmetic (WIP)
 - [ ] MMMLU
 - [ ] Translation (WMT) suite
 - [ ] Unscramble
 - [x] ~~Pile (perplexity)~~
 - [ ] BLiMP
 - [ ] ToxiGen
+- [ ] StoryCloze
+- [ ] NaturalQs
 - [ ] CrowS-Pairs
 - [ ] XCopa
 - [ ] BIG-Bench
@@ -53,6 +55,9 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
 - [ ] PAWS-X
 - [ ] XNLI
 - [ ] MGSM
+- [ ] SCROLLS
+- [ ] JSON Task (reference: https://github.com/EleutherAI/lm-evaluation-harness/pull/481)
+- [ ] Babi

 # Novel Tasks
 Tasks added in the revamped harness that were not previously available. Again, a strikethrough denotes checking performed *against the original task's implementation or published results introducing the task*.

--- a/lm_eval/tasks/__init__.py
+++ b/lm_eval/tasks/__init__.py
@@ -22,7 +22,7 @@ def include_task_folder(task_dir):
    Calling this function
    """
    for root, subdirs, file_list in os.walk(task_dir):
-        if (subdirs == []) and (len(file_list) > 0):
+        if (subdirs == [] or subdirs == ["__pycache__"]) and (len(file_list) > 0):
            for f in file_list:
                if f.endswith(".yaml"):
                    yaml_path = os.path.join(root, f)
@@ -124,28 +124,6 @@ def get_task_dict(task_name_list: List[Union[str, dict, Task]], **kwargs):
                get_task_name_from_object(task_element): task_element,
            }

-    # task_name_from_registry_dict = {
-    #     task_name: get_task(
-    #         task_name=task_name,
-    #         task_config=config
-    #     )
-    #     for group_name in task_name_list for task_name in GROUP_REGISTRY[group_name]
-    #     if (isinstance(group_name, str)) and (group_name in GROUP_REGISTRY)
-    # }
-    # task_name_from_config_dict = {
-    #     get_task_name_from_config(task_config): ConfigurableTask(
-    #         config=task_config
-    #     )
-    #     for task_config in task_name_list
-    #     if isinstance(task_config, dict)
-    # }
-    # # TODO: Do we still need this?
-    # task_name_from_object_dict = {
-    #     get_task_name_from_object(task_object): task_object
-    #     for task_object in task_name_list
-    #     if isinstance(task_object, Task)
-    # }
-
    assert set(task_name_from_registry_dict.keys()).isdisjoint(
        set(task_name_from_object_dict.keys())
    )

--- a/lm_eval/tasks/arc/arc_challenge.yaml
+++ b/lm_eval/tasks/arc/arc_challenge.yaml
@@ -10,7 +10,8 @@ validation_split: validation
 test_split: test
 template_aliases: "{% set answer_choices = choices['text'] %}{% set gold = choices.label.index(answerKey) %}" # set the list of possible answer choices, and set what this doc's gold answer is (set what ds column used, and what)
 doc_to_text: "Question: {{question}}\nAnswer:"
-doc_to_target: "{{gold}}" # this will be cast to an int.
+doc_to_target: "{{answer_choices[gold]}}"
+gold_alias: "{{gold}}" # this will be cast to an int.
 metric_list:
  - metric: acc
    aggregation: mean

--- a/lm_eval/tasks/arc/arc_easy.yaml
+++ b/lm_eval/tasks/arc/arc_easy.yaml
@@ -10,7 +10,8 @@ validation_split: validation
 test_split: test
 template_aliases: "{% set answer_choices = choices['text'] %}{% set gold = choices.label.index(answerKey) %}" # set the list of possible answer choices, and set what this doc's gold answer is (set what ds column used, and what)
 doc_to_text: "Question: {{question}}\nAnswer:"
-doc_to_target: "{{gold}}" # this will be cast to an int.
+doc_to_target: "{{answer_choices[gold]}}"
+gold_alias: "{{gold}}" # this will be cast to an int.
 metric_list:
  - metric: acc
    aggregation: mean

--- a/lm_eval/tasks/gsm8k/gsm8k-cot.yaml
+++ b/lm_eval/tasks/gsm8k/gsm8k-cot.yaml
@@ -25,7 +25,7 @@ metric_list:
    regexes_to_ignore:
      - ","
      - "\\$"
-delimiter: "\n\n"
+fewshot_delimiter: "\n\n"
 generation_kwargs:
  until:
    - "Q:"

--- a/lm_eval/tasks/hellaswag/README.md
+++ b/lm_eval/tasks/hellaswag/README.md
+# Task-name
+
+### Paper
+
+Title: `HellaSwag: Can a Machine Really Finish Your Sentence?`,
+Abstract: ```Recent work by Zellers et al. (2018) introduced a new task of commonsense natural language inference: given an event description such as "A woman sits at a piano," a machine must select the most likely followup: "She sets her fingers on the keys." With the introduction of BERT, near human-level performance was reached. Does this mean that machines can perform human level commonsense inference?
+In this paper, we show that commonsense inference still proves difficult for even state-of-the-art models, by presenting HellaSwag, a new challenge dataset. Though its questions are trivial for humans (>95% accuracy), state-of-the-art models struggle (<48%). We achieve this via Adversarial Filtering (AF), a data collection paradigm wherein a series of discriminators iteratively select an adversarial set of machine-generated wrong answers. AF proves to be surprisingly robust. The key insight is to scale up the length and complexity of the dataset examples towards a critical 'Goldilocks' zone wherein generated text is ridiculous to humans, yet often misclassified by state-of-the-art models.
+Our construction of HellaSwag, and its resulting difficulty, sheds light on the inner workings of deep pretrained models. More broadly, it suggests a new path forward for NLP research, in which benchmarks co-evolve with the evolving state-of-the-art in an adversarial way, so as to present ever-harder challenges.```
+
+Homepage: `https://rowanzellers.com/hellaswag/`
+
+
+### Citation
+
+```
+@inproceedings{zellers2019hellaswag,
+    title={HellaSwag: Can a Machine Really Finish Your Sentence?},
+    author={Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin},
+    booktitle ={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
+    year={2019}
+}
+```
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
+  * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?