Merge remote-tracking branch 'origin/big-refactor' into big-refactor_python

2e747c5b · baberabb · 71ab0f2c · a346e6a0 · 2e747c5b · 2e747c5b
Commit 2e747c5b authored Sep 03, 2023 by baberabb
20 changed files
--- a/docs/new_task_guide.md
+++ b/docs/new_task_guide.md
@@ -69,6 +69,8 @@ touch lm_eval/tasks/<dataset_name>/utils.py
 ```
 Now, in `utils.py` we'll write a function to process each split of our dataset:
+TODO: Change the example to one that's in the tasks/
 ```python
 def process_docs(dataset: datasets.Dataset):
    def _helper(doc):
@@ -86,40 +88,53 @@ Now, in our YAML config file we'll use the `!function` constructor, and tell the
 process_docs: !function utils.process_docs
 ```
+## Writing a Prompt Template
-### Writing a prompt with Jinja 2
 The next thing we need to do is decide what format to use when presenting the data to the LM. This is our **prompt**, where we'll define both an input and output format.
-We support the [Jinja 2](https://jinja.palletsprojects.com/en/3.1.x/) templating language for writing prompts. In practice, this means you can take your dataset's columns and do many basic string manipulations to place each document into prompted format.
+To write a prompt, users will use `doc_to_text`, `doc_to_target`, and `doc_to_choice` (Optional when certain conditions are met).
+`doc_to_text` defines the input string a model will be given while `doc_to_target` and `doc_to_choice` will be used to generate the target text. `doc_to_target` can be either a text string that refers to the target string or an integer that refers to the index of the correct label. When it is set as an index, `doc_to_choice` must be also be set with the appropriate list of possible choice strings.
-To write a prompt, users are required to write two or three YAML fields in Jinja as strings:
+### Basic prompts
+If a dataset is straightforward enough, users can enter the feature name directly. This assumes that no preprocessing is required. For example in [Swag](https://github.com/EleutherAI/lm-evaluation-harness/blob/1710b42d52d0f327cb0eb3cb1bfbbeca992836ca/lm_eval/tasks/swag/swag.yaml#L10-L11), `doc_to_text` and `doc_to_target` given the name of one of the feature each.
 ```yaml
-doc_to_text:
+doc_to_text: startphrase
-doc_to_target:
+doc_to_target: label
-doc_to_choice:
 ```
-Suppose our dataset has a `"question"` field, and an `"answer"` field, which are both strings. We want the model to see, if given a `document` object that is a row of our dataset:
+Hard-coding is also possible as is the case in [SciQ](https://github.com/EleutherAI/lm-evaluation-harness/blob/1710b42d52d0f327cb0eb3cb1bfbbeca992836ca/lm_eval/tasks/sciq/sciq.yaml#L11).
+```yaml
+doc_to_target: 3
 ```
-Question: {document[question]}
+`doc_to_choice` can be directly given a list of text as option (See [Toxigen](https://github.com/EleutherAI/lm-evaluation-harness/blob/1710b42d52d0f327cb0eb3cb1bfbbeca992836ca/lm_eval/tasks/toxigen/toxigen.yaml#L11))
+```yaml
+doc_to_choice: ['No', 'Yes']
+```
+### Writing a prompt with Jinja 2
+We support the [Jinja 2](https://jinja.palletsprojects.com/en/3.1.x/) templating language for writing prompts. In practice, this means you can take your dataset's columns and do many basic string manipulations to place each document into prompted format.
+Take for example `super_glue/boolq`, as input, we'd like to use the features `passage` and `question` and string them together so that for a a sample line `doc`, the model sees something the format of:
+```
+doc["passage"]
+Question: doc["question"]?
 Answer:
 ```
-We do this by writing
+We do this by [writing](https://github.com/EleutherAI/lm-evaluation-harness/blob/1710b42d52d0f327cb0eb3cb1bfbbeca992836ca/lm_eval/tasks/super_glue/boolq/default.yaml#L9C1-L9C61)
 ```yaml
-doc_to_text: "Question: {{question}}\nAnswer:"
+doc_to_text: "{{passage}}\nQuestion: {{question}}?\nAnswer:"
 ```
-Such that {{question}} will be replaced by `doc["question"]` when rendering the prompt template.
+Such that `{{passage}}` will be replaced by `doc["passage"]` and `{{question}}` with `doc["question"]` when rendering the prompt template.
 Our intended output is for the model to predict a single whitespace, and then the answer to the question. We do this via:
 ```yaml
 doc_to_target: "{{answer}}"
-gold_alias: "{{answer}}"
 ```
-where `doc_to_target` is *the string that will be appended to inputs for each few-shot example*, and `gold_alias` is *what is passed to our metric function as reference or gold answer to score against*. For example, for GSM8k word problems, `doc_to_target` should be the reference text reasoning chain given in the dataset culminating in the answer, and `gold_alias` should be **only the numeric answer** to the word problem that is given at the end of the reasoning chain, and which the evaluated model's answer will be compared against.
-**Important**: We always add one whitespace between the input and output, such that the full input-output string is `doc_to_target(doc) + " " + doc_to_text(doc)`. doc_to_text and doc_to_target should not contain trailing right or left whitespace, respectively.
-Users can also fill out the optional `template_aliases` YAML field, which is added ahead of both the `doc_to_text` and `doc_to_target` fields. This field should not contain any test, but only Jinja variable definitions (`{% ... %}` clauses). This can be used to perform more involved string manipulations and renamings of dataset columns while the main prompt fields remain easy to parse visually.
+**Important**: we now add `target_delimiter` between input and target which defaults to " ", such that the full input-output string is `doc_to_target(doc) + target_delimiter + doc_to_text(doc)`. doc_to_text and doc_to_target should not contain trailing right or left whitespace, respectively.
 #### Multiple choice format
@@ -135,7 +150,13 @@ doc_to_choice: "{{[distractor1, distractor2, distractor3, correct_answer]}}"
 ```
 Task implementers are thus able to decide what the answer choices should be for a document, and what prompt format to use.
+The label index can also be sourced from a feature directly. For example in `superglue/boolq`, the label index if defined in the feature `label`. We can set `doc_to_target` as simply `label`. The options or verbalizers can be written in a the form of a list `["no", "yes"]` that will correspond to the label index.
+```yaml
+doc_to_text: "{{passage}}\nQuestion: {{question}}?\nAnswer:"
+doc_to_target: label
+doc_to_choice: ["no", "yes"]
+```
 ### Using Python Functions for Prompts
@@ -168,6 +189,10 @@ For example, For Super Glue BoolQ, if we want to use the prompt template `GPT-3
 use_prompt: "promptsource:GPT-3 Style"
 ```
+If you would like to run evaluation on all prompt templates, you can simply call it this way.
+```
+use_prompt: "promptsource:*"
+```
 ### Setting metrics
@@ -183,11 +208,11 @@ metric_list:
  - metric: <name of the metric here>
    aggregation: <name of the aggregation fn here>
    higher_is_better: <true or false>
-  - metric: ...
+  - metric: !function script.function
    aggregation: ...
    higher_is_better: ...
 ```
-`aggregation` and `higher_is_better` can optionally be left out to default to the manually-set defaults, if using a natively supported metric.
+`aggregation` and `higher_is_better` can optionally be left out to default to the manually-set defaults if using a natively supported metric, otherwise it must be defined explicitly (for example, when using a custom metric implemented as a function).
 For a full list of natively supported metrics and aggregation functions see `docs/advanced_task_guide.md`. All metrics supported in [HuggingFace Evaluate](https://github.com/huggingface/evaluate/tree/main/metrics) can also be used, and will be loaded if a given metric name is not one natively supported in `lm-eval`.

--- a/docs/advanced_task_guide.md
+++ b/docs/advanced_task_guide.md
-# Advanced Task Configuration
+# Task Configuration
 The `lm-evaluation-harness` is meant to be an extensible and flexible framework within which many different evaluation tasks can be defined. All tasks in the new version of the harness are built around a YAML configuration file format.
@@ -33,7 +33,6 @@ Prompting / in-context formatting options:
 - **doc_to_text** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into the appropriate input for the model
 - **doc_to_target** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into the appropriate target output for the model. For multiple choice tasks, this should return an index into
 - **doc_to_choice** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into a list of possible string choices for `multiple_choice` tasks. Left undefined for `greedy_until` tasks.
- **gold_alias** (`str`, *optional*, defaults to None) — if provided, used to generate the reference answer that is scored against. Used in cases where `doc_to_target` should be the "target string" format appended to each example's input for a fewshot exemplar, so doc_to_target is used for fewshot examples, but the input to the metric function as `gold` is from `gold_alias`.
 - **fewshot_delimiter** (`str`, *optional*, defaults to "\n\n") — String to insert between few-shot examples.
 - **target_delimiter** (`str`, *optional*, defaults to `" "`) — String to insert between input and target output for the datapoint being tested.

--- a/lm_eval/api/filter.py
+++ b/lm_eval/api/filter.py
@@ -2,6 +2,7 @@ from dataclasses import dataclass
 from typing import List
 from lm_eval.api.instance import Instance
+from datasets import Dataset
 class Filter:
@@ -18,7 +19,7 @@ class Filter:
        Can define custom behavior here, if an individual instantiation of a Filter class should have state.
        """
-    def apply(self, resps):
+    def apply(self, resps, docs):
        """
        Defines the operation to perform on a list of the `inst.resps` properties of `Instance` objects.
        Should return the list of (filtered) response lists *in the same order as they were input*, e.g.
@@ -40,14 +41,14 @@ class FilterEnsemble:
    name: str
    filters: List[Filter]
-    def apply(self, instances: List[Instance]):
+    def apply(self, instances: List[Instance], docs: List[Dataset]):
        resps = [
            inst.resps for inst in instances
        ]  # operate just on the model responses
        for f in self.filters:
            # apply filters in sequence
-            resps = f.apply(resps)
+            resps = f.apply(resps, docs)
        # add the end results after filtering to filtered_requests of their respective source instances.
        # has key `self.name`: each FilterEnsemble applied in a given run should use a different name.

--- a/lm_eval/api/task.py
+++ b/lm_eval/api/task.py
@@ -90,6 +90,12 @@ class TaskConfig(dict):
    def __post_init__(self):
+        if "." in self.dataset_path:
+            import inspect
+            from importlib import import_module
+            self.dataset_path = inspect.getfile(import_module(self.dataset_path))
        if self.generation_kwargs is not None:
            if self.output_type != "greedy_until":
                eval_logger.warning(
@@ -627,19 +633,19 @@ class ConfigurableTask(Task):
            )
        if self.has_test_docs():
-            docs = self.test_docs()
+            self.task_docs = self.test_docs()
        elif self.has_validation_docs():
-            docs = self.validation_docs()
+            self.task_docs = self.validation_docs()
        else:
            assert (
                False
            ), f"Task dataset (path={self.DATASET_PATH}, name={self.DATASET_NAME}) must have valid or test docs!"
        # Test One Doc
-        self.features = list(docs.features.keys())
+        self.features = list(self.task_docs.features.keys())
        self.multiple_input = 0
        self.multiple_target = 0
-        test_doc = docs[0]
+        test_doc = self.task_docs[0]
        test_text = self.doc_to_text(test_doc)
        test_target = self.doc_to_target(test_doc)
@@ -743,6 +749,15 @@ class ConfigurableTask(Task):
                )
            return super().fewshot_docs()
+    def apply_filters(self):
+        if hasattr(self, "_filters"):
+            for f in self._filters:
+                f.apply(self._instances, self.task_docs)
+        else:
+            eval_logger.warning("No filter defined, passing through instances")
+            return self._instances
    def should_decontaminate(self):
        return self._config.should_decontaminate
@@ -783,7 +798,7 @@ class ConfigurableTask(Task):
                return doc[doc_to_text]
            else:
                text_string = utils.apply_template(doc_to_text, doc)
-                if text_string.isdigit():
+                if text_string.isdigit() and self._config.doc_to_choice is not None:
                    return ast.literal_eval(text_string)
                else:
                    return text_string
@@ -818,7 +833,7 @@ class ConfigurableTask(Task):
                return doc[doc_to_target]
            else:
                target_string = utils.apply_template(doc_to_target, doc)
-                if target_string.isdigit():
+                if target_string.isdigit() and self._config.doc_to_choice is not None:
                    return ast.literal_eval(target_string)
                elif (
                    len(target_string) >= 2
@@ -1005,18 +1020,36 @@ class ConfigurableTask(Task):
                gold = self.doc_to_text(doc)
            else:
                gold = self.doc_to_target(doc)
-                if type(gold) is str:
-                    gold = choices.index(gold)
+            gold_index_error = False
+            if type(gold) is list:
+                gold = [i if i < len(choices) else -100 for i in gold]
+                if -100 in gold:
+                    gold_index_error = True
+            else:
+                if type(gold) is int:
+                    gold = gold if gold < len(choices) else -100
+                elif type(gold) is str:
+                    gold = choices.index(gold) if gold in choices else -100
+                if gold == -100:
+                    gold_index_error = True
+            if gold_index_error:
+                eval_logger.warning(
+                    f"Label index was not in within range of available choices,"
+                    f"Sample:\n\n{doc}\n\n"
+                )
            if self.multiple_target:
                acc = 1.0 if pred in gold else 0.0
                acc_norm = 1.0 if pred_norm in gold else 0.0
-                exact_match = int(any([is_greedy[i] for i in gold]))
+                exact_match = int(any([is_greedy[i] if i != -100 else 0 for i in gold]))
            else:
                acc = 1.0 if pred == gold else 0.0
                acc_norm = 1.0 if pred_norm == gold else 0.0
                # TODO: this gets score of 0 on arc_challenge for pythia-70m. need to test that this works properly
-                exact_match = int(is_greedy[gold])
+                exact_match = int(is_greedy[gold]) if gold != -100 else 0
            result_dict = {
                **({"acc": acc} if "acc" in use_metric else {}),

--- a/lm_eval/filters/__init__.py
+++ b/lm_eval/filters/__init__.py
@@ -17,14 +17,16 @@ FILTER_REGISTRY = {
 def get_filter(filter_name):
-    return FILTER_REGISTRY[filter_name]
+    if filter_name in FILTER_REGISTRY:
+        return FILTER_REGISTRY[filter_name]
+    else:
+        return filter_name
 def build_filter_ensemble(filter_name, components):
    """
    Create a filtering pipeline.
    """
    filters = []
    for (function, kwargs) in components:
        if kwargs is None:

--- a/lm_eval/filters/decontamination.py
+++ b/lm_eval/filters/decontamination.py
@@ -17,7 +17,7 @@ class DecontaminationFilter(Filter):
        """
        self._decontam_results = None
-    def apply(self, reps):
+    def apply(self, reps, docs):
        """
        Return {"no_contamination", "only_contamination"} keys for the 2 different subsets
        """

--- a/lm_eval/filters/extraction.py
+++ b/lm_eval/filters/extraction.py
@@ -15,7 +15,7 @@ class RegexFilter(Filter):
        self.regex = re.compile(regex_pattern)
        self.fallback = fallback
-    def apply(self, resps):
+    def apply(self, resps, docs):
        # here, we assume we have a list, in which each element is
        # a list of model responses for some particular input/target pair.
        # so we process each of these (same input/target response sets)
@@ -44,7 +44,7 @@ class WhitespaceFilter(Filter):
    def __init__(self):
        pass
-    def apply(self, resps):
+    def apply(self, resps, docs):
        def filter_set(inst):
            filtered_resp = []

--- a/lm_eval/filters/selection.py
+++ b/lm_eval/filters/selection.py
@@ -9,7 +9,7 @@ class TakeFirstFilter(Filter):
        Can define custom behavior here, if an individual instantiation of a Filter class should have state.
        """
-    def apply(self, resps):
+    def apply(self, resps, docs):
        """
        Assuming each entry of `resps` is a list of model responses, we discard all but the first response.
        """
@@ -23,7 +23,7 @@ class TakeKFilter(Filter):
        super().__init__(*args, **kwargs)
-    def apply(self, resps):
+    def apply(self, resps, docs):
        # check we have at least k responses per doc, else we can't take the first k
        assert (
            len(resps[0]) >= self.k
@@ -37,7 +37,7 @@ class MajorityVoteFilter(Filter):
        Can define custom behavior here, if an individual instantiation of a Filter class should have state.
        """
-    def apply(self, resps):
+    def apply(self, resps, docs):
        """
        Each entry of `resps` is a list of model responses.
        We select the response that occurs most frequently in each entry of `resps`.

--- a/lm_eval/models/huggingface.py
+++ b/lm_eval/models/huggingface.py
+import os
 import torch
 import transformers
 from transformers.models.auto.modeling_auto import (
@@ -67,6 +69,7 @@ class HFLM(LM):
        revision: Optional[str] = "main",
        subfolder: Optional[str] = None,
        tokenizer: Optional[str] = None,
+        truncation: Optional[bool] = False,
        max_length: Optional[int] = None,
        device: Optional[str] = "cuda",
        dtype: Optional[Union[str, torch.dtype]] = "auto",
@@ -75,6 +78,7 @@ class HFLM(LM):
        low_cpu_mem_usage: Optional[bool] = True,
        trust_remote_code: Optional[bool] = False,
        use_fast_tokenizer: Optional[bool] = True,
+        cache_dir: Optional[Union[str, os.PathLike]] = None,
        # arguments used for splitting a model across GPUs naively.
        # only used if `parallelize=True`.
        parallelize: Optional[bool] = False,
@@ -240,6 +244,8 @@ class HFLM(LM):
            use_fast=use_fast_tokenizer,
        )
+        self.truncation = truncation
        self.vocab_size = self.tokenizer.vocab_size
        self.tokenizer.pad_token_id = self.tokenizer.eos_token_id
@@ -419,7 +425,11 @@ class HFLM(LM):
        return encoding
    def tok_batch_encode(
-        self, strings: List[str], padding_side="left", left_truncate_len=None
+        self,
+        strings: List[str],
+        padding_side="left",
+        left_truncate_len=None,
+        truncation=False,
    ):
        # encode a batch of strings. converts to tensors and pads automatically, unlike tok_encode.
        old_padding_side = self.tokenizer.padding_side
@@ -432,6 +442,7 @@ class HFLM(LM):
        encoding = self.tokenizer(
            strings,
+            truncation=truncation,
            padding="longest",
            return_tensors="pt",
            add_special_tokens=add_special_tokens,
@@ -856,7 +867,9 @@ class HFLM(LM):
                # encode, pad, and truncate contexts for this batch
                context_enc, attn_masks = self.tok_batch_encode(
-                    contexts, left_truncate_len=max_ctx_len
+                    contexts,
+                    left_truncate_len=max_ctx_len,
+                    truncation=self.truncation,
                )
                context_enc = context_enc.to(self.device)
                attn_masks = attn_masks.to(self.device)

--- a/lm_eval/tasks/README.md
+++ b/lm_eval/tasks/README.md
@@ -5,8 +5,8 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
 - [x] Glue
 - [x] SuperGlue
- [ ] CoQA (Lintang)
+- [x] CoQA
- [ ] DROP (Lintang)
+- [x] DROP
 - [x] ~~Lambada~~
 - [x] Lambada (Cloze variants)
 - [x] ~~Lambada (Multilingual)~~
@@ -29,7 +29,7 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
 - [x] HeadQA
 - [x] MathQA
 - [x] WebQs
- [ ] WSC273 (Lintang)
+- [x] WSC273
 - [x] Winogrande
 - [x] ANLI
 - [x] Hendrycks Ethics (missing some tasks/metrics, see PR 660: <https://github.com/EleutherAI/lm-evaluation-harness/pull/660> for more info)
@@ -38,7 +38,7 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
 - [x] TruthfulQA (gen)
 - [ ] MuTual
 - [ ] Hendrycks Math (Hailey)
- [ ] Asdiv
+- [x] Asdiv
 - [ ] GSM8k
 - [x] Arithmetic
 - [ ] MMMLU (Hailey)

--- a/lm_eval/tasks/asdiv/default.yaml
+++ b/lm_eval/tasks/asdiv/default.yaml
+task: asdiv
+dataset_path: EleutherAI/asdiv
+output_type: loglikelihood
+validation_split: validation
+doc_to_text: "{{body}}\nQuestion:{{question}}\nAnswer:"
+doc_to_target: "{{answer.split(' (')[0]}}"
+should_decontaminate: true
+doc_to_decontamination_query: "{{body}} {{question}}"
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
--- a/lm_eval/tasks/coqa/README.md
+++ b/lm_eval/tasks/coqa/README.md
+# CoQA
+### Paper
+Title: `CoQA: A Conversational Question Answering Challenge`
+Abstract: https://arxiv.org/pdf/1808.07042.pdf
+CoQA is a large-scale dataset for building Conversational Question Answering
+systems. The goal of the CoQA challenge is to measure the ability of machines to
+understand a text passage and answer a series of interconnected questions that
+appear in a conversation.
+Homepage: https://stanfordnlp.github.io/coqa/
+### Citation
+```
+BibTeX-formatted citation goes here
+```
+### Groups and Tasks
+#### Groups
+* Not part of a group yet
+#### Tasks
+* `coqa`
+### Checklist
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/coqa/default.yaml
+++ b/lm_eval/tasks/coqa/default.yaml
+task: coqa
+dataset_path: EleutherAI/coqa
+output_type: greedy_until
+training_split: train
+validation_split: validation
+doc_to_text: !function utils.doc_to_text
+doc_to_target: !function utils.doc_to_target
+process_results: !function utils.process_results
+should_decontaminate: true
+doc_to_decontamination_query: "{{story}} {{question.input_text|join('\n')}}"
+generation_kwargs:
+  until:
+    - "\nQ:"
+metric_list:
+  - metric: em
+    aggregation: mean
+    higher_is_better: true
+  - metric: f1
+    aggregation: mean
+    higher_is_better: true
--- a/lm_eval/tasks/coqa/utils.py
+++ b/lm_eval/tasks/coqa/utils.py
+from itertools import zip_longest
+import transformers.data.metrics.squad_metrics as squad_metrics
+def doc_to_text(doc):
+    # Given a passage p, the conversation history {q1, a1, . . . qi−1, ai−1}
+    # and a question qi, the task is to predict the answer ai
+    doc_text = doc["story"] + "\n\n"
+    for (q, a) in zip_longest(
+        doc["questions"]["input_text"], doc["answers"]["input_text"][:-1]
+    ):  # omit target answer ai
+        question = f"Q: {q}\n\n"
+        answer = f"A: {a}\n\n" if a is not None else "A:"
+        doc_text += question + answer
+    return doc_text
+def doc_to_target(doc):
+    turn_id = len(doc["questions"]["input_text"])
+    # Returns unique answers and valid alternatives (Some questions in CoQA have multiple valid answers).
+    answers = []
+    answer_forturn = doc["answers"]["input_text"][turn_id - 1]
+    answers.append(answer_forturn)
+    additional_answers = doc.get("additional_answers")
+    if additional_answers:
+        for key in additional_answers:
+            additional_answer_for_turn = additional_answers[key]["input_text"][
+                turn_id - 1
+            ]
+            if additional_answer_for_turn.lower() not in map(str.lower, answers):
+                answers.append(additional_answer_for_turn)
+    return answers
+def em(gold_list, pred):
+    # tests for exact match and on the normalised answer (compute_exact)
+    em_sum = 0.0
+    if len(gold_list) > 1:
+        for i in range(len(gold_list)):
+            gold_answers = gold_list[0:i] + gold_list[i + 1 :]
+            # predictions compared against (n) golds and take maximum
+            em_sum += max(squad_metrics.compute_exact(a, pred) for a in gold_answers)
+    else:
+        em_sum += max(squad_metrics.compute_exact(a, pred) for a in gold_list)
+    return em_sum / max(1, len(gold_list))
+def compute_scores(gold_list, pred):
+    # tests for exact match and on the normalised answer (compute_exact)
+    # test for overlap (compute_f1)
+    f1_sum = 0.0
+    em_sum = 0.0
+    if len(gold_list) > 1:
+        for i in range(len(gold_list)):
+            gold_answers = gold_list[0:i] + gold_list[i + 1 :]
+            # predictions compared against (n) golds and take maximum
+            em_sum += max(squad_metrics.compute_exact(a, pred) for a in gold_answers)
+            f1_sum += max(squad_metrics.compute_f1(a, pred) for a in gold_answers)
+    else:
+        em_sum += max(squad_metrics.compute_exact(a, pred) for a in gold_list)
+        f1_sum += max(squad_metrics.compute_f1(a, pred) for a in gold_list)
+    return {
+        "em": em_sum / max(1, len(gold_list)),
+        "f1": f1_sum / max(1, len(gold_list)),
+    }
+def process_results(doc, results):
+    gold_list = doc_to_target(doc)
+    pred = results[0].strip().split("\n")[0]
+    scores = compute_scores(gold_list, pred)
+    return scores
--- a/lm_eval/tasks/drop/README.md
+++ b/lm_eval/tasks/drop/README.md
+# DROP
+### Paper
+Title: `DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs`
+Abstract: https://aclanthology.org/attachments/N19-1246.Supplementary.pdf
+DROP is a QA dataset which tests comprehensive understanding of paragraphs. In
+this crowdsourced, adversarially-created, 96k question-answering benchmark, a
+system must resolve multiple references in a question, map them onto a paragraph,
+and perform discrete operations over them (such as addition, counting, or sorting).
+Homepage: https://allenai.org/data/drop
+Acknowledgement: This implementation is based on the official evaluation for `DROP`:
+https://github.com/allenai/allennlp-reading-comprehension/blob/master/allennlp_rc/eval/drop_eval.py
+### Citation
+```
+@misc{dua2019drop,
+    title={DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs},
+    author={Dheeru Dua and Yizhong Wang and Pradeep Dasigi and Gabriel Stanovsky and Sameer Singh and Matt Gardner},
+    year={2019},
+    eprint={1903.00161},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+```
+### Groups and Tasks
+#### Groups
+* Not part of a group yet.
+#### Tasks
+* `drop`
+### Checklist
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/drop/default.yaml
+++ b/lm_eval/tasks/drop/default.yaml
+task: drop
+dataset_path: EleutherAI/drop
+output_type: greedy_until
+training_split: train
+validation_split: validation
+process_docs: !function utils.process_docs
+doc_to_text: "{{passage}} {{question}}"
+doc_to_target: "{{ answer|join(',')}}"
+target_delimiter: ""
+process_results: !function utils.process_results
+should_decontaminate: true
+doc_to_decontamination_query: "{{passage}} {{question}}"
+generation_kwargs:
+  until:
+    - "."
+metric_list:
+  - metric: em
+    aggregation: mean
+    higher_is_better: true
+  - metric: f1
+    aggregation: mean
+    higher_is_better: true
--- a/lm_eval/tasks/drop/utils.py
+++ b/lm_eval/tasks/drop/utils.py
+import re
+import string
+import numpy as np
+from scipy.optimize import linear_sum_assignment
+_ARTICLES = re.compile(r"\b(a|an|the)\b", re.UNICODE)
+def process_docs(dataset):
+    def _process(doc):
+        return {
+            "id": doc["query_id"],
+            "passage": doc["passage"],
+            "question": doc["question"],
+            "answers": get_answers(doc),
+        }
+    return dataset.map(_process)
+def get_answers(doc):
+    def _flatten_validated_answers(validated_answers):
+        """Flattens a dict of lists of validated answers.
+        {"number": ['1', '8'], ...}
+        -> [{"number": ['1'], ...}, {"number": ['8'], ...}]
+        """
+        valid_answers = []
+        for i in range(len(validated_answers["number"])):
+            valid_answers.append(
+                {
+                    "number": validated_answers["number"][i],
+                    "date": validated_answers["date"][i],
+                    "spans": validated_answers["spans"][i],
+                }
+            )
+        return valid_answers
+    answers = []
+    answers_set = set()
+    candidates = [doc["answer"]] + _flatten_validated_answers(doc["validated_answers"])
+    for candidate in candidates:
+        answer = parse_answer(candidate)
+        if answer in answers_set:
+            continue
+        answers_set.add(answer)
+        answers.append(answer)
+    return answers
+def parse_answer(answer):
+    # NOTE: Everything is returned as a tuple for uniformity and hashability.
+    if answer["number"] != "":
+        return (str(answer["number"]),)
+    if answer["spans"] != []:
+        return tuple(answer["spans"])
+    return (
+        " ".join(
+            [answer["date"]["day"], answer["date"]["month"], answer["date"]["year"]]
+        ).strip(),
+    )
+def process_results(doc, results):
+    preds, golds = results, doc["answers"]
+    max_em = 0
+    max_f1 = 0
+    for gold_answer in golds:
+        exact_match, f1_score = get_metrics(preds, gold_answer)
+        if gold_answer[0].strip():
+            max_em = max(max_em, exact_match)
+            max_f1 = max(max_f1, f1_score)
+    return {"em": max_em, "f1": max_f1}
+def get_metrics(predicted, gold):
+    """
+    Takes a predicted answer and a gold answer (that are both either a string or a list of
+    strings), and returns exact match and the DROP F1 metric for the prediction.  If you are
+    writing a script for evaluating objects in memory (say, the output of predictions during
+    validation, or while training), this is the function you want to call, after using
+    :func:`answer_json_to_strings` when reading the gold answer from the released data file.
+    """
+    predicted_bags = _answer_to_bags(predicted)
+    gold_bags = _answer_to_bags(gold)
+    if set(predicted_bags[0]) == set(gold_bags[0]) and len(predicted_bags[0]) == len(
+        gold_bags[0]
+    ):
+        exact_match = 1.0
+    else:
+        exact_match = 0.0
+    f1_per_bag = _align_bags(predicted_bags[1], gold_bags[1])
+    f1 = np.mean(f1_per_bag)
+    f1 = round(f1, 2)
+    return exact_match, f1
+def _answer_to_bags(answer):
+    if isinstance(answer, (list, tuple)):
+        raw_spans = answer
+    else:
+        raw_spans = [answer]
+    normalized_spans = []
+    token_bags = []
+    for raw_span in raw_spans:
+        normalized_span = _normalize(raw_span)
+        normalized_spans.append(normalized_span)
+        token_bags.append(set(normalized_span.split()))
+    return normalized_spans, token_bags
+def _align_bags(predicted, gold):
+    """
+    Takes gold and predicted answer sets and first finds the optimal 1-1 alignment
+    between them and gets maximum metric values over all the answers.
+    """
+    scores = np.zeros([len(gold), len(predicted)])
+    for gold_index, gold_item in enumerate(gold):
+        for pred_index, pred_item in enumerate(predicted):
+            if _match_numbers_if_present(gold_item, pred_item):
+                scores[gold_index, pred_index] = _compute_f1(pred_item, gold_item)
+    row_ind, col_ind = linear_sum_assignment(-scores)
+    max_scores = np.zeros([max(len(gold), len(predicted))])
+    for row, column in zip(row_ind, col_ind):
+        max_scores[row] = max(max_scores[row], scores[row, column])
+    return max_scores
+def _compute_f1(predicted_bag, gold_bag):
+    intersection = len(gold_bag.intersection(predicted_bag))
+    if not predicted_bag:
+        precision = 1.0
+    else:
+        precision = intersection / float(len(predicted_bag))
+    if not gold_bag:
+        recall = 1.0
+    else:
+        recall = intersection / float(len(gold_bag))
+    f1 = (
+        (2 * precision * recall) / (precision + recall)
+        if not (precision == 0.0 and recall == 0.0)
+        else 0.0
+    )
+    return f1
+def _match_numbers_if_present(gold_bag, predicted_bag):
+    gold_numbers = set()
+    predicted_numbers = set()
+    for word in gold_bag:
+        if _is_number(word):
+            gold_numbers.add(word)
+    for word in predicted_bag:
+        if _is_number(word):
+            predicted_numbers.add(word)
+    if (not gold_numbers) or gold_numbers.intersection(predicted_numbers):
+        return True
+    return False
+def _is_number(text):
+    try:
+        float(text)
+        return True
+    except ValueError:
+        return False
+def _remove_articles(text):
+    return _ARTICLES.sub(" ", text)
+def _white_space_fix(text):
+    return " ".join(text.split())
+def _remove_punc(text):
+    exclude = set(string.punctuation)
+    if not _is_number(text):
+        return "".join(ch for ch in text if ch not in exclude)
+    else:
+        return text
+def _fix_number(text):
+    return str(float(text)) if _is_number(text) else text
+def _tokenize(text):
+    return re.split(" |-", text)
+def _normalize(answer):
+    tokens = [
+        _white_space_fix(_remove_articles(_fix_number(_remove_punc(token.lower()))))
+        for token in _tokenize(answer)
+    ]
+    tokens = [token for token in tokens if token.strip()]
+    normalized = " ".join(tokens).strip()
+    return normalized
--- a/lm_eval/tasks/realtoxicityprompts/metric.py
+++ b/lm_eval/tasks/realtoxicityprompts/metric.py
 import os
-import sys
 import json
 import requests
 import numpy as np
 from lm_eval.logger import eval_logger
-try:
-    API_KEY = os.environ["PERSPECTIVE_API_KEY"]
-except KeyError:
-    API_KEY = None
-    eval_logger.warning(
-        "PERSPECTIVE_API_KEY is not set. If you are running the `realtoxicityprompts` task, please set this environment variable."
-    )
-url = f"https://commentanalyzer.googleapis.com/v1alpha1/comments:analyze?key={API_KEY}"
 def toxicity_perspective_api(references, predictions, **kwargs):
    """Toxicity Perspective API is a metric which uses Perspective API to score the toxicity of a generated sentence.
@@ -25,6 +14,16 @@ def toxicity_perspective_api(references, predictions, **kwargs):
    Lower is better
    """
+    try:
+        API_KEY = os.environ["PERSPECTIVE_API_KEY"]
+    except KeyError:
+        eval_logger.warning(
+            "PERSPECTIVE_API_KEY is not set. If you are running the `realtoxicityprompts` task, please set this environment variable."
+        )
+        raise
+    url = f"https://commentanalyzer.googleapis.com/v1alpha1/comments:analyze?key={API_KEY}"
    scores = []
    for pred in predictions:
        data = {
@@ -35,7 +34,7 @@ def toxicity_perspective_api(references, predictions, **kwargs):
        headers = {
            "content-type": "application/json",
        }
-        req_response = requests.post(url, data=data, headers=headers)
+        req_response = requests.post(url, json=data, headers=headers)
        if req_response.ok:
            response = json.loads(req_response.text)
            if (
@@ -54,6 +53,6 @@ def toxicity_perspective_api(references, predictions, **kwargs):
                raise SystemExit(0)
        else:
            eval_logger.error("Unhandled Exception")
-            raise SystemExit(0)
+            req_response.raise_for_status()
    return np.mean(scores)
--- a/lm_eval/tasks/super_glue/README.md
+++ b/lm_eval/tasks/super_glue/README.md
+# SuperGLUE
+### Paper
+Title: `SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems`
+Abstract: `https://w4ngatang.github.io/static/papers/superglue.pdf`
+SuperGLUE is a benchmark styled after GLUE with a new set of more difficult language
+understanding tasks.
+Homepage: https://super.gluebenchmark.com/
+### Citation
+```
+@inproceedings{NEURIPS2019_4496bf24,
+    author = {Wang, Alex and Pruksachatkun, Yada and Nangia, Nikita and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel},
+    booktitle = {Advances in Neural Information Processing Systems},
+    editor = {H. Wallach and H. Larochelle and A. Beygelzimer and F. d\textquotesingle Alch\'{e}-Buc and E. Fox and R. Garnett},
+    pages = {},
+    publisher = {Curran Associates, Inc.},
+    title = {SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems},
+    url = {https://proceedings.neurips.cc/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf},
+    volume = {32},
+    year = {2019}
+}
+```
+### Groups and Tasks
+#### Groups
+* `super-glue-lm-eval-v1`: SuperGLUE eval adapted from LM Eval V1
+* `super-glue-t5-prompt`: SuperGLUE prompt and evaluation that matches the T5 paper (if using accelerate, will error if record is included.)
+#### Tasks
+Comparison between validation split score on T5x and LM-Eval (T5x models converted to HF)
+| T5V1.1 Base | SGLUE | BoolQ | CB        | Copa | MultiRC | ReCoRD | RTE | WiC | WSC |
+| ----------- | ------| ----- | --------- | ---- | ------- | ------ | --- | --- | --- |
+| T5x | 69.47 | 78.47(acc) | 83.93(f1) 87.5(acc) | 50(acc) | 73.81(f1) 33.26(em) | 70.09(em) 71.34(f1) | 78.7(acc) | 63.64(acc) | 75(acc) |
+| LM-Eval | 71.35 | 79.36(acc) | 83.63(f1) 87.5(acc) | 63(acc) | 73.45(f1) 33.26(em) | 69.85(em) 68.86(f1) | 78.34(acc) | 65.83(acc) | 75.96(acc) |
+* `super-glue-lm-eval-v1`
+    -  `boolq`
+    - `cb`
+    - `copa`
+    - `multirc`
+    - `record`
+    - `rte`
+    - `wic`
+    - `wsc`
+* `super-glue-t5-prompt`
+    - `super_glue-boolq-t5-prompt`
+    - `super_glue-cb-t5-prompt`
+    - `super_glue-copa-t5-prompt`
+    - `super_glue-multirc-t5-prompt`
+    - `super_glue-record-t5-prompt`
+    - `super_glue-rte-t5-prompt`
+    - `super_glue-wic-t5-prompt`
+    - `super_glue-wsc-t5-prompt`
+### Checklist
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/super_glue/boolq/t5-prompt.yaml
+++ b/lm_eval/tasks/super_glue/boolq/t5-prompt.yaml
+group:
+  - super-glue-t5-prompt
+task: super_glue-boolq-t5-prompt
+dataset_path: super_glue
+dataset_name: boolq
+training_split: train
+validation_split: validation
+output_type: greedy_until
+doc_to_text: "boolq passage: {{passage}} question: {{question}}"
+doc_to_target: label
+doc_to_choice: ['False', 'True']
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+    ignore_case: true
+    ignore_punctuation: true