Merge branch 'big-refactor' into mypy

4721379e · Hailey Schoelkopf · GitHub · a551c789 · cc7828dd · 4721379e
Unverified Commit 4721379e authored Sep 05, 2023 by Hailey Schoelkopf Committed by GitHub Sep 05, 2023
20 changed files
--- a/README.md
+++ b/README.md
@@ -20,7 +20,7 @@ This project provides a unified framework to test generative language models on

 Features:

- Many tasks implemented, 200+ tasks [implemented in the old framework](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/docs/task_table.md) which require porting to the new setup as described in [the new task guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/docs/new_task_guide.md).
+- Many tasks implemented, 200+ tasks [implemented in the old framework](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/docs/task_table.md) which require porting to the new setup as described in [the new task guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/new_task_guide.md).
 - Support for models loaded via [transformers](https://github.com/huggingface/transformers/) (including quantization via [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)), [GPT-NeoX](https://github.com/EleutherAI/gpt-neox), and [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/), with a flexible tokenization-agnostic interface.
 - Support for commercial APIs including [OpenAI](https://openai.com), [goose.ai](https://goose.ai), and [TextSynth](https://textsynth.com/).
 - Support for evaluation on adapters (e.g. LoRa) supported in [HuggingFace's PEFT library](https://github.com/huggingface/peft).
@@ -116,8 +116,10 @@ accelerate launch main.py \

 This will perform *data-parallel evaluation*: that is, placing a **single full copy** of your model onto each available GPU and *splitting batches across GPUs* to evaluate on K GPUs K times faster than on one.

-However, if your model *is too large to be run on a single one of your GPUs*, then we provide an alternative method to run these large models: use of the `parallelize` argument.
+If your model is *is too large to be run on a single one of your GPUs* then you can use `accelerate` with Fully Sharded Data Parallel (FSDP) that splits the weights of the model across your data parallel ranks. To enable this, ensure you select `YES` when asked ```Do you want to use FullyShardedDataParallel?``` when running `accelerate config`. To enable memory-efficient loading, select `YES` when asked `Do you want each individually wrapped FSDP unit to broadcast module parameters from rank 0 at the start?`. This will ensure only the rank 0 process loads the model and then broadcasts the parameters to the other ranks instead of having each rank load all parameters which can lead to large RAM usage spikes around the start of the script that may cause errors.

+
+We also provide an second method to run these large models: use of the `parallelize` argument.
 ```
 python main.py \
    --model hf \
@@ -132,7 +134,7 @@ To pass even more advanced keyword arguments to `accelerate`, we allow for the f
 - `max_cpu_memory`: the max amount of CPU memory to use when offloading the model weights to RAM.
 - `offload_folder`: a folder where model weights will be offloaded to disk if needed.

-Using this setting helps for massive models like BLOOM which require, or to avoid exceeding your total system RAM (by default, with `accelerate launch` one copy of the model for each GPU is initialized in RAM before moving it to GPU, resulting in large RAM usage spikes around the start of the script that may cause errors such as `Killed`.) However, it naively splits models across GPUs, resulting in only a single GPU performing work at any point in time, and so is much slower than launching with `accelerate launch`, possibly by a factor of the total # of GPUs.
+Note that this method naively splits models across GPUs, resulting in only a single GPU performing work at any point in time, and so is much slower than launching with `accelerate launch`, possibly by a factor of the total # of GPUs.

 **Note that this option requires launching evaluation via `python main.py` rather than `accelerate launch main.py`.**


--- a/docs/new_task_guide.md
+++ b/docs/new_task_guide.md
@@ -69,6 +69,8 @@ touch lm_eval/tasks/<dataset_name>/utils.py
 ```
 Now, in `utils.py` we'll write a function to process each split of our dataset:

+TODO: Change the example to one that's in the tasks/
+
 ```python
 def process_docs(dataset: datasets.Dataset):
    def _helper(doc):
@@ -86,40 +88,53 @@ Now, in our YAML config file we'll use the `!function` constructor, and tell the
 process_docs: !function utils.process_docs
 ```

-
-### Writing a prompt with Jinja 2
+## Writing a Prompt Template

 The next thing we need to do is decide what format to use when presenting the data to the LM. This is our **prompt**, where we'll define both an input and output format.

-We support the [Jinja 2](https://jinja.palletsprojects.com/en/3.1.x/) templating language for writing prompts. In practice, this means you can take your dataset's columns and do many basic string manipulations to place each document into prompted format.
+To write a prompt, users will use `doc_to_text`, `doc_to_target`, and `doc_to_choice` (Optional when certain conditions are met).
+
+`doc_to_text` defines the input string a model will be given while `doc_to_target` and `doc_to_choice` will be used to generate the target text. `doc_to_target` can be either a text string that refers to the target string or an integer that refers to the index of the correct label. When it is set as an index, `doc_to_choice` must be also be set with the appropriate list of possible choice strings.

-To write a prompt, users are required to write two or three YAML fields in Jinja as strings:
+### Basic prompts
+
+If a dataset is straightforward enough, users can enter the feature name directly. This assumes that no preprocessing is required. For example in [Swag](https://github.com/EleutherAI/lm-evaluation-harness/blob/1710b42d52d0f327cb0eb3cb1bfbbeca992836ca/lm_eval/tasks/swag/swag.yaml#L10-L11), `doc_to_text` and `doc_to_target` given the name of one of the feature each.
 ```yaml
-doc_to_text:
-doc_to_target:
-doc_to_choice:
+doc_to_text: startphrase
+doc_to_target: label
 ```
-Suppose our dataset has a `"question"` field, and an `"answer"` field, which are both strings. We want the model to see, if given a `document` object that is a row of our dataset:
+Hard-coding is also possible as is the case in [SciQ](https://github.com/EleutherAI/lm-evaluation-harness/blob/1710b42d52d0f327cb0eb3cb1bfbbeca992836ca/lm_eval/tasks/sciq/sciq.yaml#L11).
+```yaml
+doc_to_target: 3
 ```
-Question: {document[question]}
+`doc_to_choice` can be directly given a list of text as option (See [Toxigen](https://github.com/EleutherAI/lm-evaluation-harness/blob/1710b42d52d0f327cb0eb3cb1bfbbeca992836ca/lm_eval/tasks/toxigen/toxigen.yaml#L11))
+```yaml
+doc_to_choice: ['No', 'Yes']
+```
+
+### Writing a prompt with Jinja 2
+
+We support the [Jinja 2](https://jinja.palletsprojects.com/en/3.1.x/) templating language for writing prompts. In practice, this means you can take your dataset's columns and do many basic string manipulations to place each document into prompted format.
+
+Take for example `super_glue/boolq`, as input, we'd like to use the features `passage` and `question` and string them together so that for a a sample line `doc`, the model sees something the format of:
+```
+doc["passage"]
+Question: doc["question"]?
 Answer:
 ```
-We do this by writing
+We do this by [writing](https://github.com/EleutherAI/lm-evaluation-harness/blob/1710b42d52d0f327cb0eb3cb1bfbbeca992836ca/lm_eval/tasks/super_glue/boolq/default.yaml#L9C1-L9C61)
 ```yaml
-doc_to_text: "Question: {{question}}\nAnswer:"
+doc_to_text: "{{passage}}\nQuestion: {{question}}?\nAnswer:"
 ```
-Such that {{question}} will be replaced by `doc["question"]` when rendering the prompt template.
+Such that `{{passage}}` will be replaced by `doc["passage"]` and `{{question}}` with `doc["question"]` when rendering the prompt template.

 Our intended output is for the model to predict a single whitespace, and then the answer to the question. We do this via:
 ```yaml
 doc_to_target: "{{answer}}"
-gold_alias: "{{answer}}"
 ```
-where `doc_to_target` is *the string that will be appended to inputs for each few-shot example*, and `gold_alias` is *what is passed to our metric function as reference or gold answer to score against*. For example, for GSM8k word problems, `doc_to_target` should be the reference text reasoning chain given in the dataset culminating in the answer, and `gold_alias` should be **only the numeric answer** to the word problem that is given at the end of the reasoning chain, and which the evaluated model's answer will be compared against.

-**Important**: We always add one whitespace between the input and output, such that the full input-output string is `doc_to_target(doc) + " " + doc_to_text(doc)`. doc_to_text and doc_to_target should not contain trailing right or left whitespace, respectively.

-Users can also fill out the optional `template_aliases` YAML field, which is added ahead of both the `doc_to_text` and `doc_to_target` fields. This field should not contain any test, but only Jinja variable definitions (`{% ... %}` clauses). This can be used to perform more involved string manipulations and renamings of dataset columns while the main prompt fields remain easy to parse visually.
+**Important**: we now add `target_delimiter` between input and target which defaults to " ", such that the full input-output string is `doc_to_target(doc) + target_delimiter + doc_to_text(doc)`. doc_to_text and doc_to_target should not contain trailing right or left whitespace, respectively.


 #### Multiple choice format
@@ -135,7 +150,13 @@ doc_to_choice: "{{[distractor1, distractor2, distractor3, correct_answer]}}"
 ```
 Task implementers are thus able to decide what the answer choices should be for a document, and what prompt format to use.

+The label index can also be sourced from a feature directly. For example in `superglue/boolq`, the label index if defined in the feature `label`. We can set `doc_to_target` as simply `label`. The options or verbalizers can be written in a the form of a list `["no", "yes"]` that will correspond to the label index.

+```yaml
+doc_to_text: "{{passage}}\nQuestion: {{question}}?\nAnswer:"
+doc_to_target: label
+doc_to_choice: ["no", "yes"]
+```

 ### Using Python Functions for Prompts

@@ -168,6 +189,10 @@ For example, For Super Glue BoolQ, if we want to use the prompt template `GPT-3
 use_prompt: "promptsource:GPT-3 Style"
 ```

+If you would like to run evaluation on all prompt templates, you can simply call it this way.
+```
+use_prompt: "promptsource:*"
+```

 ### Setting metrics

@@ -183,11 +208,11 @@ metric_list:
  - metric: <name of the metric here>
    aggregation: <name of the aggregation fn here>
    higher_is_better: <true or false>
-  - metric: ...
+  - metric: !function script.function
    aggregation: ...
    higher_is_better: ...
 ```
-`aggregation` and `higher_is_better` can optionally be left out to default to the manually-set defaults, if using a natively supported metric.
+`aggregation` and `higher_is_better` can optionally be left out to default to the manually-set defaults if using a natively supported metric, otherwise it must be defined explicitly (for example, when using a custom metric implemented as a function).

 For a full list of natively supported metrics and aggregation functions see `docs/advanced_task_guide.md`. All metrics supported in [HuggingFace Evaluate](https://github.com/huggingface/evaluate/tree/main/metrics) can also be used, and will be loaded if a given metric name is not one natively supported in `lm-eval`.


--- a/docs/advanced_task_guide.md
+++ b/docs/advanced_task_guide.md
-# Advanced Task Configuration
+# Task Configuration

 The `lm-evaluation-harness` is meant to be an extensible and flexible framework within which many different evaluation tasks can be defined. All tasks in the new version of the harness are built around a YAML configuration file format.

@@ -33,7 +33,6 @@ Prompting / in-context formatting options:
 - **doc_to_text** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into the appropriate input for the model
 - **doc_to_target** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into the appropriate target output for the model. For multiple choice tasks, this should return an index into
 - **doc_to_choice** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into a list of possible string choices for `multiple_choice` tasks. Left undefined for `greedy_until` tasks.
- **gold_alias** (`str`, *optional*, defaults to None) — if provided, used to generate the reference answer that is scored against. Used in cases where `doc_to_target` should be the "target string" format appended to each example's input for a fewshot exemplar, so doc_to_target is used for fewshot examples, but the input to the metric function as `gold` is from `gold_alias`.
 - **fewshot_delimiter** (`str`, *optional*, defaults to "\n\n") — String to insert between few-shot examples.
 - **target_delimiter** (`str`, *optional*, defaults to `" "`) — String to insert between input and target output for the datapoint being tested.


--- a/ignore.txt
+++ b/ignore.txt
@@ -4,3 +4,4 @@ nin
 maka
 mor
 te
+ond
--- a/lm_eval/api/filter.py
+++ b/lm_eval/api/filter.py
@@ -2,6 +2,7 @@ from dataclasses import dataclass
 from typing import List

 from lm_eval.api.instance import Instance
+from datasets import Dataset


 class Filter:
@@ -18,7 +19,7 @@ class Filter:
        Can define custom behavior here, if an individual instantiation of a Filter class should have state.
        """

-    def apply(self, resps):
+    def apply(self, resps, docs):
        """
        Defines the operation to perform on a list of the `inst.resps` properties of `Instance` objects.
        Should return the list of (filtered) response lists *in the same order as they were input*, e.g.
@@ -40,13 +41,14 @@ class FilterEnsemble:
    name: str
    filters: List[Filter]

-    def apply(self, instances: List[Instance]) -> None:
+    def apply(self, instances: List[Instance], docs: List[Dataset]) -> None:
+
        resps = [
            inst.resps for inst in instances
        ]  # operate just on the model responses
        for f in self.filters:
            # apply filters in sequence
-            resps = f.apply(resps)
+            resps = f.apply(resps, docs)

        # add the end results after filtering to filtered_requests of their respective source instances.
        # has key `self.name`: each FilterEnsemble applied in a given run should use a different name.

--- a/lm_eval/api/task.py
+++ b/lm_eval/api/task.py
@@ -88,7 +88,14 @@ class TaskConfig(dict):

    metadata: str = None  # by default, not used in the code. allows for users to pass arbitrary info to tasks

+
    def __post_init__(self) -> None:
+        if "." in self.dataset_path:
+            import inspect
+            from importlib import import_module
+
+            self.dataset_path = inspect.getfile(import_module(self.dataset_path))
+
        if self.generation_kwargs is not None:
            if self.output_type != "greedy_until":
                eval_logger.warning(
@@ -181,7 +188,6 @@ class Task(abc.ABC):
            HuggingFace `datasets` API with the default cache directory located at:
                `~/.cache/huggingface/datasets`
            NOTE: You can change the cache location globally for a given process
-            by setting the shell environment variable, `HF_DATASETS_CACHE`,
            to another directory:
                `export HF_DATASETS_CACHE="/path/to/another/directory"`
        :param download_mode: datasets.DownloadMode
@@ -624,19 +630,19 @@ class ConfigurableTask(Task):
            )

        if self.has_test_docs():
-            docs = self.test_docs()
+            self.task_docs = self.test_docs()
        elif self.has_validation_docs():
-            docs = self.validation_docs()
+            self.task_docs = self.validation_docs()
        else:
            assert (
                False
            ), f"Task dataset (path={self.DATASET_PATH}, name={self.DATASET_NAME}) must have valid or test docs!"

        # Test One Doc
-        self.features = list(docs.features.keys())
+        self.features = list(self.task_docs.features.keys())
        self.multiple_input = 0
        self.multiple_target = 0
-        test_doc = docs[0]
+        test_doc = self.task_docs[0]
        test_text = self.doc_to_text(test_doc)
        test_target = self.doc_to_target(test_doc)

@@ -656,14 +662,14 @@ class ConfigurableTask(Task):
            self.multiple_target = len(test_target)
        else:
            if (type(test_target) is int) and (test_choice is not None):
-                test_target = [self.doc_to_choice(test_target)[test_target]]
+                test_target = test_choice[test_target]
            else:
-                test_target = [test_target]
+                test_target = str(test_target)

        if test_choice is not None:
            check_choices = test_choice
        else:
-            check_choices = test_target
+            check_choices = [test_target]

        for choice in check_choices:
            choice_has_whitespace = True if " " in choice else False
@@ -739,6 +745,15 @@ class ConfigurableTask(Task):
                )
            return super().fewshot_docs()

+    def apply_filters(self):
+
+        if hasattr(self, "_filters"):
+            for f in self._filters:
+                f.apply(self._instances, self.task_docs)
+        else:
+            eval_logger.warning("No filter defined, passing through instances")
+            return self._instances
+
    def should_decontaminate(self):
        return self._config.should_decontaminate

@@ -778,7 +793,7 @@ class ConfigurableTask(Task):
                return doc[doc_to_text]
            else:
                text_string = utils.apply_template(doc_to_text, doc)
-                if text_string.isdigit():
+                if text_string.isdigit() and self._config.doc_to_choice is not None:
                    return ast.literal_eval(text_string)
                else:
                    return text_string
@@ -812,7 +827,7 @@ class ConfigurableTask(Task):
                return doc[doc_to_target]
            else:
                target_string = utils.apply_template(doc_to_target, doc)
-                if target_string.isdigit():
+                if target_string.isdigit() and self._config.doc_to_choice is not None:
                    return ast.literal_eval(target_string)
                elif (
                    len(target_string) >= 2
@@ -994,18 +1009,36 @@ class ConfigurableTask(Task):
                gold = self.doc_to_text(doc)
            else:
                gold = self.doc_to_target(doc)
-                if type(gold) is str:
-                    gold = choices.index(gold)
+
+            gold_index_error = False
+            if type(gold) is list:
+                gold = [i if i < len(choices) else -100 for i in gold]
+                if -100 in gold:
+                    gold_index_error = True
+            else:
+                if type(gold) is int:
+                    gold = gold if gold < len(choices) else -100
+                elif type(gold) is str:
+                    gold = choices.index(gold) if gold in choices else -100
+
+                if gold == -100:
+                    gold_index_error = True
+
+            if gold_index_error:
+                eval_logger.warning(
+                    f"Label index was not in within range of available choices,"
+                    f"Sample:\n\n{doc}\n\n"
+                )

            if self.multiple_target:
                acc = 1.0 if pred in gold else 0.0
                acc_norm = 1.0 if pred_norm in gold else 0.0
-                exact_match = int(any([is_greedy[i] for i in gold]))
+                exact_match = int(any([is_greedy[i] if i != -100 else 0 for i in gold]))
            else:
                acc = 1.0 if pred == gold else 0.0
                acc_norm = 1.0 if pred_norm == gold else 0.0
                # TODO: this gets score of 0 on arc_challenge for pythia-70m. need to test that this works properly
-                exact_match = int(is_greedy[gold])
+                exact_match = int(is_greedy[gold]) if gold != -100 else 0

            result_dict = {
                **({"acc": acc} if "acc" in use_metric else {}),
@@ -1032,37 +1065,37 @@ class ConfigurableTask(Task):
            else:
                gold = str(gold)

-            for key, result in zip(self._metric_fn_list.keys(), results):
+            result = results[0]
+            for metric in self._metric_fn_list.keys():
                if self.multiple_target:
                    # in the case where we have multiple targets,
                    # return true if any are true
                    # TODO: this may break for multipLe_target, non zero-or-1 metrics
                    scores = []
                    for gold_option in gold:
-                        res = self._metric_fn_list[key](
+                        res = self._metric_fn_list[metric](
                            references=[gold_option],
                            predictions=[result],
-                            **self._metric_fn_kwargs[key],
+                            **self._metric_fn_kwargs[metric],
                        )
                        if isinstance(res, dict):
                            # TODO: this handles the case where HF evaluate returns a dict.
-                            res = res[key]
+                            res = res[metric]
                        scores.append(res)
                    if any(scores):
                        result_score = 1.0
                    else:
                        result_score = 0.0
                else:
-                    result_score = self._metric_fn_list[key](
+                    result_score = self._metric_fn_list[metric](
                        references=[gold],
                        predictions=[result],
-                        **self._metric_fn_kwargs[key],
+                        **self._metric_fn_kwargs[metric],
                    )
-
-                if isinstance(result_score, dict):
-                    result_dict.update(result_score)
-                else:
-                    result_dict[key] = result_score
+                    if isinstance(result_score, dict):
+                        # TODO: this handles the case where HF evaluate returns a dict.
+                        result_score = result_score[metric]
+                result_dict[metric] = result_score
        else:
            raise ValueError(
                f"Passed invalid output_type '{self.OUTPUT_TYPE}' ! Please use one of ",

--- a/lm_eval/evaluator.py
+++ b/lm_eval/evaluator.py
@@ -218,7 +218,6 @@ def evaluate(
    padding_requests = collections.defaultdict(int)

    # Stores group related keys and values for group-aggregation
-    aggregate = collections.defaultdict(dict)
    task_groups = collections.defaultdict(dict)

    # get lists of each type of request
@@ -226,6 +225,7 @@ def evaluate(
        if type(task) == tuple:
            group, task = task
            task_groups[task_name] = group
+            aggregate[task_name] = {}

        versions[task_name] = task.VERSION
        configs[task_name] = dict(task.dump_config())
@@ -403,12 +403,12 @@ def evaluate(
            #        | word_perplexity
            #        | byte_perplexity
            #        | bits_per_byte
-            if bool(task_groups):
+            if task_name in task_groups:
                group_name = task_groups[task_name]
-                if metric not in aggregate[group_name]:
-                    aggregate[group_name][metric] = [task_score]
-                else:
+                if metric in list(aggregate[group_name].keys()):
                    aggregate[group_name][metric].append(task_score)
+                else:
+                    aggregate[group_name][metric] = [task_score]

            # hotfix: bleu, chrf, ter seem to be really expensive to bootstrap
            # so we run them less iterations. still looking for a cleaner way to do this

--- a/lm_eval/filters/__init__.py
+++ b/lm_eval/filters/__init__.py
@@ -17,14 +17,16 @@ FILTER_REGISTRY = {


 def get_filter(filter_name):
-    return FILTER_REGISTRY[filter_name]
+    if filter_name in FILTER_REGISTRY:
+        return FILTER_REGISTRY[filter_name]
+    else:
+        return filter_name


 def build_filter_ensemble(filter_name, components):
    """
    Create a filtering pipeline.
    """
-
    filters = []
    for (function, kwargs) in components:
        if kwargs is None:

--- a/lm_eval/filters/decontamination.py
+++ b/lm_eval/filters/decontamination.py
@@ -17,7 +17,7 @@ class DecontaminationFilter(Filter):
        """
        self._decontam_results = None

-    def apply(self, reps) -> None:
+    def apply(self, resps, docs) -> None:
        """
        Return {"no_contamination", "only_contamination"} keys for the 2 different subsets
        """

--- a/lm_eval/filters/extraction.py
+++ b/lm_eval/filters/extraction.py
@@ -17,7 +17,7 @@ class RegexFilter(Filter):
        self.regex = re.compile(regex_pattern)
        self.fallback = fallback

-    def apply(self, resps):
+    def apply(self, resps, docs):
        # here, we assume we have a list, in which each element is
        # a list of model responses for some particular input/target pair.
        # so we process each of these (same input/target response sets)
@@ -46,7 +46,7 @@ class WhitespaceFilter(Filter):
    def __init__(self) -> None:
        pass

-    def apply(self, resps):
+    def apply(self, resps, docs):
        def filter_set(inst):
            filtered_resp = []
            for resp in inst:

--- a/lm_eval/filters/selection.py
+++ b/lm_eval/filters/selection.py
@@ -9,7 +9,7 @@ class TakeFirstFilter(Filter):
        Can define custom behavior here, if an individual instantiation of a Filter class should have state.
        """

-    def apply(self, resps):
+    def apply(self, resps, docs):
        """
        Assuming each entry of `resps` is a list of model responses, we discard all but the first response.
        """
@@ -22,7 +22,7 @@ class TakeKFilter(Filter):

        super().__init__(*args, **kwargs)

-    def apply(self, resps):
+    def apply(self, resps, docs):
        # check we have at least k responses per doc, else we can't take the first k
        assert (
            len(resps[0]) >= self.k
@@ -36,7 +36,7 @@ class MajorityVoteFilter(Filter):
        Can define custom behavior here, if an individual instantiation of a Filter class should have state.
        """

-    def apply(self, resps):
+    def apply(self, resps, docs):
        """
        Each entry of `resps` is a list of model responses.
        We select the response that occurs most frequently in each entry of `resps`.

--- a/lm_eval/models/huggingface.py
+++ b/lm_eval/models/huggingface.py
+import os
+
 import torch
 import transformers
 from transformers.models.auto.modeling_auto import (
@@ -20,7 +22,7 @@ from lm_eval.api.registry import register_model

 from lm_eval.utils import MultiTokenEOSCriteria, stop_sequences_criteria

-from accelerate import Accelerator, find_executable_batch_size
+from accelerate import Accelerator, find_executable_batch_size, DistributedType
 from typing import List, Optional, Union


@@ -67,6 +69,7 @@ class HFLM(LM):
        revision: Optional[str] = "main",
        subfolder: Optional[str] = None,
        tokenizer: Optional[str] = None,
+        truncation: Optional[bool] = False,
        max_length: Optional[int] = None,
        device: Optional[str] = "cuda",
        dtype: Optional[Union[str, torch.dtype]] = "auto",
@@ -75,6 +78,7 @@ class HFLM(LM):
        low_cpu_mem_usage: Optional[bool] = True,
        trust_remote_code: Optional[bool] = False,
        use_fast_tokenizer: Optional[bool] = True,
+        cache_dir: Optional[Union[str, os.PathLike]] = None,
        # arguments used for splitting a model across GPUs naively.
        # only used if `parallelize=True`.
        parallelize: Optional[bool] = False,
@@ -240,6 +244,8 @@ class HFLM(LM):
            use_fast=use_fast_tokenizer,
        )

+        self.truncation = truncation
+
        self.vocab_size = self.tokenizer.vocab_size
        self.tokenizer.pad_token_id = self.tokenizer.eos_token_id

@@ -289,9 +295,16 @@ class HFLM(LM):
                        "Failed to place model onto specified device. This may be because the model is quantized via `bitsandbytes`. If the desired GPU is being used, this message is safe to ignore."
                    )
            else:
-                self._model = accelerator.prepare_model(
-                    self.model, evaluation_mode=True
-                )
+                assert accelerator.distributed_type in [
+                    DistributedType.FSDP, 
+                    DistributedType.MULTI_GPU
+                ], "Unsupported distributed type provided. Only DDP and FSDP are supported."
+                if accelerator.distributed_type == DistributedType.FSDP:
+                    self._model = accelerator.prepare(self.model)
+                else:
+                    self._model = accelerator.prepare_model(
+                        self.model, evaluation_mode=True 
+                    )
                self._device = torch.device(f"cuda:{accelerator.local_process_index}")
                self.accelerator = accelerator

@@ -419,7 +432,11 @@ class HFLM(LM):
        return encoding

    def tok_batch_encode(
-        self, strings: List[str], padding_side: str = "left", left_truncate_len=None
+        self, 
+        strings: List[str], 
+        padding_side: str = "left", 
+        left_truncate_len: int = None,
+        truncation: bool = False,
    ):
        # encode a batch of strings. converts to tensors and pads automatically, unlike tok_encode.
        old_padding_side = self.tokenizer.padding_side
@@ -432,6 +449,7 @@ class HFLM(LM):

        encoding = self.tokenizer(
            strings,
+            truncation=truncation,
            padding="longest",
            return_tensors="pt",
            add_special_tokens=add_special_tokens,
@@ -858,7 +876,9 @@ class HFLM(LM):

                # encode, pad, and truncate contexts for this batch
                context_enc, attn_masks = self.tok_batch_encode(
-                    contexts, left_truncate_len=max_ctx_len
+                    contexts,
+                    left_truncate_len=max_ctx_len,
+                    truncation=self.truncation,
                )
                context_enc = context_enc.to(self.device)
                attn_masks = attn_masks.to(self.device)

--- a/lm_eval/tasks/README.md
+++ b/lm_eval/tasks/README.md
@@ -5,8 +5,8 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for

 - [x] Glue
 - [x] SuperGlue
- [ ] CoQA (Lintang)
- [ ] DROP (Lintang)
+- [x] CoQA
+- [x] DROP
 - [x] ~~Lambada~~
 - [x] Lambada (Cloze variants)
 - [x] ~~Lambada (Multilingual)~~
@@ -29,7 +29,7 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
 - [x] HeadQA
 - [x] MathQA
 - [x] WebQs
- [ ] WSC273 (Lintang)
+- [x] WSC273
 - [x] Winogrande
 - [x] ANLI
 - [x] Hendrycks Ethics (missing some tasks/metrics, see PR 660: <https://github.com/EleutherAI/lm-evaluation-harness/pull/660> for more info)
@@ -38,7 +38,7 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
 - [x] TruthfulQA (gen)
 - [ ] MuTual
 - [ ] Hendrycks Math (Hailey)
- [ ] Asdiv
+- [x] Asdiv
 - [ ] GSM8k
 - [x] Arithmetic
 - [ ] MMMLU (Hailey)

--- a/lm_eval/tasks/asdiv/default.yaml
+++ b/lm_eval/tasks/asdiv/default.yaml
+task: asdiv
+dataset_path: EleutherAI/asdiv
+output_type: loglikelihood
+validation_split: validation
+doc_to_text: "{{body}}\nQuestion:{{question}}\nAnswer:"
+doc_to_target: "{{answer.split(' (')[0]}}"
+should_decontaminate: true
+doc_to_decontamination_query: "{{body}} {{question}}"
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
--- a/lm_eval/tasks/coqa/README.md
+++ b/lm_eval/tasks/coqa/README.md
+# CoQA
+
+### Paper
+
+Title: `CoQA: A Conversational Question Answering Challenge`
+
+Abstract: https://arxiv.org/pdf/1808.07042.pdf
+
+CoQA is a large-scale dataset for building Conversational Question Answering
+systems. The goal of the CoQA challenge is to measure the ability of machines to
+understand a text passage and answer a series of interconnected questions that
+appear in a conversation.
+
+Homepage: https://stanfordnlp.github.io/coqa/
+
+### Citation
+
+```
+BibTeX-formatted citation goes here
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* Not part of a group yet
+
+#### Tasks
+
+* `coqa`
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/coqa/default.yaml
+++ b/lm_eval/tasks/coqa/default.yaml
+task: coqa
+dataset_path: EleutherAI/coqa
+output_type: greedy_until
+training_split: train
+validation_split: validation
+doc_to_text: !function utils.doc_to_text
+doc_to_target: !function utils.doc_to_target
+process_results: !function utils.process_results
+should_decontaminate: true
+doc_to_decontamination_query: "{{story}} {{question.input_text|join('\n')}}"
+generation_kwargs:
+  until:
+    - "\nQ:"
+metric_list:
+  - metric: em
+    aggregation: mean
+    higher_is_better: true
+  - metric: f1
+    aggregation: mean
+    higher_is_better: true
--- a/lm_eval/tasks/coqa/utils.py
+++ b/lm_eval/tasks/coqa/utils.py
+from itertools import zip_longest
+
+import transformers.data.metrics.squad_metrics as squad_metrics
+
+
+def doc_to_text(doc):
+    # Given a passage p, the conversation history {q1, a1, . . . qi−1, ai−1}
+    # and a question qi, the task is to predict the answer ai
+    doc_text = doc["story"] + "\n\n"
+    for (q, a) in zip_longest(
+        doc["questions"]["input_text"], doc["answers"]["input_text"][:-1]
+    ):  # omit target answer ai
+        question = f"Q: {q}\n\n"
+        answer = f"A: {a}\n\n" if a is not None else "A:"
+        doc_text += question + answer
+    return doc_text
+
+
+def doc_to_target(doc):
+
+    turn_id = len(doc["questions"]["input_text"])
+    # Returns unique answers and valid alternatives (Some questions in CoQA have multiple valid answers).
+    answers = []
+    answer_forturn = doc["answers"]["input_text"][turn_id - 1]
+    answers.append(answer_forturn)
+
+    additional_answers = doc.get("additional_answers")
+    if additional_answers:
+        for key in additional_answers:
+            additional_answer_for_turn = additional_answers[key]["input_text"][
+                turn_id - 1
+            ]
+            if additional_answer_for_turn.lower() not in map(str.lower, answers):
+                answers.append(additional_answer_for_turn)
+    return answers
+
+
+def em(gold_list, pred):
+    # tests for exact match and on the normalised answer (compute_exact)
+    em_sum = 0.0
+    if len(gold_list) > 1:
+        for i in range(len(gold_list)):
+            gold_answers = gold_list[0:i] + gold_list[i + 1 :]
+            # predictions compared against (n) golds and take maximum
+            em_sum += max(squad_metrics.compute_exact(a, pred) for a in gold_answers)
+    else:
+        em_sum += max(squad_metrics.compute_exact(a, pred) for a in gold_list)
+
+    return em_sum / max(1, len(gold_list))
+
+
+def compute_scores(gold_list, pred):
+    # tests for exact match and on the normalised answer (compute_exact)
+    # test for overlap (compute_f1)
+    f1_sum = 0.0
+    em_sum = 0.0
+    if len(gold_list) > 1:
+        for i in range(len(gold_list)):
+            gold_answers = gold_list[0:i] + gold_list[i + 1 :]
+            # predictions compared against (n) golds and take maximum
+            em_sum += max(squad_metrics.compute_exact(a, pred) for a in gold_answers)
+            f1_sum += max(squad_metrics.compute_f1(a, pred) for a in gold_answers)
+    else:
+        em_sum += max(squad_metrics.compute_exact(a, pred) for a in gold_list)
+        f1_sum += max(squad_metrics.compute_f1(a, pred) for a in gold_list)
+
+    return {
+        "em": em_sum / max(1, len(gold_list)),
+        "f1": f1_sum / max(1, len(gold_list)),
+    }
+
+
+def process_results(doc, results):
+
+    gold_list = doc_to_target(doc)
+    pred = results[0].strip().split("\n")[0]
+
+    scores = compute_scores(gold_list, pred)
+    return scores
--- a/lm_eval/tasks/drop/README.md
+++ b/lm_eval/tasks/drop/README.md
+# DROP
+
+### Paper
+
+Title: `DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs`
+
+Abstract: https://aclanthology.org/attachments/N19-1246.Supplementary.pdf
+
+DROP is a QA dataset which tests comprehensive understanding of paragraphs. In
+this crowdsourced, adversarially-created, 96k question-answering benchmark, a
+system must resolve multiple references in a question, map them onto a paragraph,
+and perform discrete operations over them (such as addition, counting, or sorting).
+
+Homepage: https://allenai.org/data/drop
+
+Acknowledgement: This implementation is based on the official evaluation for `DROP`:
+https://github.com/allenai/allennlp-reading-comprehension/blob/master/allennlp_rc/eval/drop_eval.py
+
+### Citation
+
+```
+@misc{dua2019drop,
+    title={DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs},
+    author={Dheeru Dua and Yizhong Wang and Pradeep Dasigi and Gabriel Stanovsky and Sameer Singh and Matt Gardner},
+    year={2019},
+    eprint={1903.00161},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* Not part of a group yet.
+
+#### Tasks
+
+* `drop`
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/drop/default.yaml
+++ b/lm_eval/tasks/drop/default.yaml
+task: drop
+dataset_path: EleutherAI/drop
+output_type: greedy_until
+training_split: train
+validation_split: validation
+process_docs: !function utils.process_docs
+doc_to_text: "{{passage}} {{question}}"
+doc_to_target: "{{ answer|join(',')}}"
+target_delimiter: ""
+process_results: !function utils.process_results
+should_decontaminate: true
+doc_to_decontamination_query: "{{passage}} {{question}}"
+generation_kwargs:
+  until:
+    - "."
+metric_list:
+  - metric: em
+    aggregation: mean
+    higher_is_better: true
+  - metric: f1
+    aggregation: mean
+    higher_is_better: true
--- a/lm_eval/tasks/drop/utils.py
+++ b/lm_eval/tasks/drop/utils.py
+import re
+import string
+
+import numpy as np
+from scipy.optimize import linear_sum_assignment
+
+_ARTICLES = re.compile(r"\b(a|an|the)\b", re.UNICODE)
+
+
+def process_docs(dataset):
+    def _process(doc):
+        return {
+            "id": doc["query_id"],
+            "passage": doc["passage"],
+            "question": doc["question"],
+            "answers": get_answers(doc),
+        }
+
+    return dataset.map(_process)
+
+
+def get_answers(doc):
+    def _flatten_validated_answers(validated_answers):
+        """Flattens a dict of lists of validated answers.
+        {"number": ['1', '8'], ...}
+        -> [{"number": ['1'], ...}, {"number": ['8'], ...}]
+        """
+        valid_answers = []
+        for i in range(len(validated_answers["number"])):
+            valid_answers.append(
+                {
+                    "number": validated_answers["number"][i],
+                    "date": validated_answers["date"][i],
+                    "spans": validated_answers["spans"][i],
+                }
+            )
+        return valid_answers
+
+    answers = []
+    answers_set = set()
+    candidates = [doc["answer"]] + _flatten_validated_answers(doc["validated_answers"])
+    for candidate in candidates:
+        answer = parse_answer(candidate)
+        if answer in answers_set:
+            continue
+        answers_set.add(answer)
+        answers.append(answer)
+    return answers
+
+
+def parse_answer(answer):
+    # NOTE: Everything is returned as a tuple for uniformity and hashability.
+    if answer["number"] != "":
+        return (str(answer["number"]),)
+    if answer["spans"] != []:
+        return tuple(answer["spans"])
+    return (
+        " ".join(
+            [answer["date"]["day"], answer["date"]["month"], answer["date"]["year"]]
+        ).strip(),
+    )
+
+
+def process_results(doc, results):
+
+    preds, golds = results, doc["answers"]
+    max_em = 0
+    max_f1 = 0
+    for gold_answer in golds:
+        exact_match, f1_score = get_metrics(preds, gold_answer)
+        if gold_answer[0].strip():
+            max_em = max(max_em, exact_match)
+            max_f1 = max(max_f1, f1_score)
+    return {"em": max_em, "f1": max_f1}
+
+
+def get_metrics(predicted, gold):
+    """
+    Takes a predicted answer and a gold answer (that are both either a string or a list of
+    strings), and returns exact match and the DROP F1 metric for the prediction.  If you are
+    writing a script for evaluating objects in memory (say, the output of predictions during
+    validation, or while training), this is the function you want to call, after using
+    :func:`answer_json_to_strings` when reading the gold answer from the released data file.
+    """
+    predicted_bags = _answer_to_bags(predicted)
+    gold_bags = _answer_to_bags(gold)
+
+    if set(predicted_bags[0]) == set(gold_bags[0]) and len(predicted_bags[0]) == len(
+        gold_bags[0]
+    ):
+        exact_match = 1.0
+    else:
+        exact_match = 0.0
+
+    f1_per_bag = _align_bags(predicted_bags[1], gold_bags[1])
+    f1 = np.mean(f1_per_bag)
+    f1 = round(f1, 2)
+    return exact_match, f1
+
+
+def _answer_to_bags(answer):
+    if isinstance(answer, (list, tuple)):
+        raw_spans = answer
+    else:
+        raw_spans = [answer]
+    normalized_spans = []
+    token_bags = []
+    for raw_span in raw_spans:
+        normalized_span = _normalize(raw_span)
+        normalized_spans.append(normalized_span)
+        token_bags.append(set(normalized_span.split()))
+    return normalized_spans, token_bags
+
+
+def _align_bags(predicted, gold):
+    """
+    Takes gold and predicted answer sets and first finds the optimal 1-1 alignment
+    between them and gets maximum metric values over all the answers.
+    """
+    scores = np.zeros([len(gold), len(predicted)])
+    for gold_index, gold_item in enumerate(gold):
+        for pred_index, pred_item in enumerate(predicted):
+            if _match_numbers_if_present(gold_item, pred_item):
+                scores[gold_index, pred_index] = _compute_f1(pred_item, gold_item)
+    row_ind, col_ind = linear_sum_assignment(-scores)
+
+    max_scores = np.zeros([max(len(gold), len(predicted))])
+    for row, column in zip(row_ind, col_ind):
+        max_scores[row] = max(max_scores[row], scores[row, column])
+    return max_scores
+
+
+def _compute_f1(predicted_bag, gold_bag):
+    intersection = len(gold_bag.intersection(predicted_bag))
+    if not predicted_bag:
+        precision = 1.0
+    else:
+        precision = intersection / float(len(predicted_bag))
+    if not gold_bag:
+        recall = 1.0
+    else:
+        recall = intersection / float(len(gold_bag))
+    f1 = (
+        (2 * precision * recall) / (precision + recall)
+        if not (precision == 0.0 and recall == 0.0)
+        else 0.0
+    )
+    return f1
+
+
+def _match_numbers_if_present(gold_bag, predicted_bag):
+    gold_numbers = set()
+    predicted_numbers = set()
+    for word in gold_bag:
+        if _is_number(word):
+            gold_numbers.add(word)
+    for word in predicted_bag:
+        if _is_number(word):
+            predicted_numbers.add(word)
+    if (not gold_numbers) or gold_numbers.intersection(predicted_numbers):
+        return True
+    return False
+
+
+def _is_number(text):
+    try:
+        float(text)
+        return True
+    except ValueError:
+        return False
+
+
+def _remove_articles(text):
+    return _ARTICLES.sub(" ", text)
+
+
+def _white_space_fix(text):
+    return " ".join(text.split())
+
+
+def _remove_punc(text):
+    exclude = set(string.punctuation)
+    if not _is_number(text):
+        return "".join(ch for ch in text if ch not in exclude)
+    else:
+        return text
+
+
+def _fix_number(text):
+    return str(float(text)) if _is_number(text) else text
+
+
+def _tokenize(text):
+    return re.split(" |-", text)
+
+
+def _normalize(answer):
+    tokens = [
+        _white_space_fix(_remove_articles(_fix_number(_remove_punc(token.lower()))))
+        for token in _tokenize(answer)
+    ]
+    tokens = [token for token in tokens if token.strip()]
+    normalized = " ".join(tokens).strip()
+    return normalized