Merge branch 'master' of https://github.com/EleutherAI/lm-evaluation-harness into remove-dataset

1f8a8c1d · jon-tow · b4c0275d · b0acb337 · 1f8a8c1d · 1f8a8c1d
Commit 1f8a8c1d authored Jun 11, 2022 by jon-tow
20 changed files
--- a/.coveragerc
+++ b/.coveragerc
 [run]

 # tasks that aren't wired up.
-omit = 
+omit =
    lm_eval/tasks/quac.py
    lm_eval/tasks/storycloze.py
    lm_eval/tasks/cbt.py
@@ -25,4 +25,4 @@ exclude_lines =
    # Don't complain if tests don't hit defensive assertion code:
    raise AssertionError
    raise NotImplementedError
-    return NotImplemented
\ No newline at end of file
+    return NotImplemented
--- a/.flake8
+++ b/.flake8
+[flake8]
+ignore = E203, E266, E501, W503, F403, F401, C901
+max-line-length = 127
+max-complexity = 10
+select = B,C,E,F,W,T4,B9
--- a/.github/workflows/pull_request.yml
+++ b/.github/workflows/pull_request.yml
+name: Pull Request
+
+on: [pull_request]
+
+jobs:
+  pre-commit:
+    runs-on: ubuntu-20.04
+    steps:
+      - uses: actions/checkout@v2
+      - uses: actions/setup-python@v2
+        with:
+          python-version: 3.8
+      - uses: pre-commit/action@v2.0.3
--- a/.gitignore
+++ b/.gitignore
@@ -2,4 +2,4 @@ env
 *.pyc
 data/
 lm_cache
-.idea
\ No newline at end of file
+.idea
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
+# Ignore test linting to avoid conflicting changes to version stability.
+exclude: ^tests/testdata/
+repos:
+  - repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v4.1.0
+    hooks:
+      - id: check-added-large-files
+      - id: check-ast
+      - id: check-byte-order-marker
+      - id: check-case-conflict
+      - id: check-json
+      - id: check-merge-conflict
+      - id: check-symlinks
+      - id: check-yaml
+      - id: destroyed-symlinks
+      - id: detect-private-key
+      - id: end-of-file-fixer
+      - id: no-commit-to-branch
+      - id: requirements-txt-fixer
+      - id: trailing-whitespace
+      - id: fix-byte-order-marker
+        exclude: docs/CNAME
+      - id: fix-encoding-pragma
+        args: [--remove]
+      - id: mixed-line-ending
+        args: [--fix=lf]
+  - repo: https://gitlab.com/pycqa/flake8
+    rev: 3.7.9
+    hooks:
+      - id: flake8
+  - repo: https://github.com/psf/black
+    rev: 22.3.0
+    hooks:
+      - id: black
+        language_version: python3.8
+  - repo: https://github.com/codespell-project/codespell
+    rev: v2.1.0
+    hooks:
+      - id: codespell
+        exclude: >
+          (?x)^(
+              .*\.json|ignore.txt
+          )$
+        args: [--check-filenames, --check-hidden, --ignore-words=ignore.txt]
--- a/README.md
+++ b/README.md
@@ -3,7 +3,7 @@
 ![](https://github.com/EleutherAI/lm-evaluation-harness/workflows/Build/badge.svg)
 [![codecov](https://codecov.io/gh/EleutherAI/lm-evaluation-harness/branch/master/graph/badge.svg?token=JSG3O2427J)](https://codecov.io/gh/EleutherAI/lm-evaluation-harness)

-## Overview 
+## Overview

 This project provides a unified framework to test autoregressive language models (GPT-2, GPT-3, GPTNeo, etc) on a large number of different evaluation tasks.

@@ -26,7 +26,7 @@ To evaluate a model, (e.g. GPT-2) on NLU tasks (e.g. LAMBADA, HellaSwag), you ca
 ```bash
 python main.py \
 	--model gpt2 \
-	--device cuda:0 \
+	--device 0 \
 	--tasks lambada,hellaswag
 ```
 (This uses gpt2-117M by default as per HF defaults, use --model_args to specify other gpt2 sizes)
@@ -37,7 +37,7 @@ Additional arguments can be provided to the model constructor using the `--model
 python main.py \
 	--model gpt2 \
 	--model_args pretrained=EleutherAI/gpt-neo-2.7B \
-	--device cuda:0 \
+	--device 0 \
 	--tasks lambada,hellaswag
 ```

@@ -375,7 +375,7 @@ Additional arguments can be provided to the model constructor using the `--model
 python main.py \
 	--model gpt2 \
 	--model_args pretrained=EleutherAI/gpt-neo-1.3B \
-	--device cuda:0 \
+	--device 0 \
 	--tasks lambada,hellaswag \
 	--num_fewshot 2
 ```
@@ -392,6 +392,21 @@ python write_out.py \

 This will write out one text file for each task.

+### Test Set Decontamination
+
+For more details see the [decontamination guide](./docs/decontamination.md).
+
+The directory provided with the "--decontamination_ngrams_path" argument should contain
+the ngram files and info.json. See the above guide for ngram generation for the pile, this could be adapted for other training sets.
+
+```bash
+python main.py \
+    --model gpt2 \
+    --device 0 \
+    --tasks sciq \
+    --decontamination_ngrams_path path/containing/training/set/ngrams
+```
+
 ### Code Structure

 There are two major components of the library:
@@ -405,9 +420,9 @@ Both LMs (`lm_eval.models`) and Tasks (`lm_eval.tasks`) are kept in a registry d

 The [GPT-3 Evaluations Project](https://github.com/EleutherAI/lm_evaluation_harness/projects/1) tracks our progress implementing new tasks. Right now, we are focused on getting all the datasets loaded so that we can dedupe against the training data. Implementing the actual evaluations is nice but not necessary at the current moment.

-### Task Versioning 
+### Task Versioning

-To help improve reproducibility, all tasks have a VERSION field. When run from the command line, this is reported in a column in the table, or in the "version" field in the evaluator return dict. The purpose of the version is so that if the task definition changes (i.e to fix a bug), then we can know exactly which metrics were computed using the old buggy implementation to avoid unfair comparisons. To enforce this, there are unit tests that make sure the behavior of all tests remains the same as when they were first implemented. Task versions start at 0, and each time a breaking change is made, the version is incremented by one. 
+To help improve reproducibility, all tasks have a VERSION field. When run from the command line, this is reported in a column in the table, or in the "version" field in the evaluator return dict. The purpose of the version is so that if the task definition changes (i.e to fix a bug), then we can know exactly which metrics were computed using the old buggy implementation to avoid unfair comparisons. To enforce this, there are unit tests that make sure the behavior of all tests remains the same as when they were first implemented. Task versions start at 0, and each time a breaking change is made, the version is incremented by one.

 When reporting eval harness results, please also report the version of each task. This can be done either with a separate column in the table, or by reporting the task name with the version appended as such: taskname-v0.


--- a/docs/decontamination.md
+++ b/docs/decontamination.md
+# Decontamination
+
+## Usage
+
+Simply add a "--decontamination_ngrams_path" when running main.py. The provided directory should contain
+the ngram files and info.json produced in "Pile Ngram Generation" further down.
+
+```bash
+python main.py \
+    --model gpt2 \
+    --device 0 \
+    --tasks sciq \
+    --decontamination_ngrams_path path/containing/training/set/ngrams
+```
+
+## Background
+Downstream evaluations test model generalization, and are less useful when test set data also exists in the training set (leakage/contamination).
+
+As a first step this is resolved through training set filtering, however often benchmarks don't exist or weren't considered prior to model training. In this case it is useful to measure the impact of test set leakage by detecting the contaminated test examples and producing a clean version of the benchmark.
+
+The basis for our decontamination procedure can be found in Appendix C of "Language Models are Few-Shot Learners". OpenAI defined a test document as contaminated if any N-gram overlap existed with any training document. They used a range of N values between 8 and 13 depending on dataset, while we just used 13 for simplicity.
+
+## Implementation
+
+Contamination detection can be found in "lm_eval/decontaminate.py" with supporting code in "lm_eval/decontamination/".
+
+decontaminate.py does the following:
+1. Build dictionaries of all ngrams and their corresponding evaluation/document ids.
+2. Scan through sorted files containing training set n-grams.
+3. If a match is found, the corresponding evaluation/document combinations are marked as contaminated.
+
+"lm_eval/evaluator.py" can then produce a clean version of the benchmark by excluding the results of contaminated documents. For each metric, a clean version will be shown in the results with a "decontaminate" suffix.
+
+This is disabled by default for new tasks, to support decontamination on a task override the "should_decontaminate" and "doc_to_decontamination_query" methods. For more details see the [task guide](task_guide.md).
+
+## Pile Ngram Generation
+The relevant scripts can be found in scripts/clean_training_data, which also import from
+"lm_eval/decontamination/"
+
+1. git clone https://github.com/EleutherAI/lm-evaluation-harness.git
+2. pip install -r requirements.txt
+3. Download The Pile from [The Eye](https://the-eye.eu/public/AI/pile/train/)
+4. Place pile files in "pile" directory under "lm-evaluation-harness" (or create a symlink)
+5. Run generate_13_grams.
+
+```bash
+export PYTHONHASHSEED=0
+python -m scripts/clean_training_data/generate_13_grams \
+    -dir path/to/working/directory \
+    -n 13 \
+    -buckets 500
+```
+
+Took approximately 4 days for us. We had the time to wait, but this could be scaled out by doing partial pile scans on multiple instances of this script and merging the relevant buckets. We fixed PYTHONHASHSEED to ensure reproducibility of bucket hashing.
+
+6. Sort the generated 13-grams.
+```bash
+python -m scripts/clean_training_data/sort_13_gram_buckets \
+    -dir path/to/working/directory/output
+```
+
+Took approximately 5 days for us. You could speed this up by spreading the files around to different machines and running the sort script before gathering them together.
+
+7. Compress the sorted 13 grams files and place them together with info.json.
+
+This step only takes a few hours.
+
+```bash
+python -m scripts/clean_training_data/compress_and_package \
+    -dir path/to/working/directory \
+    -output path/to/final/directory \
+    -procs 8
+```
+
+Congratulations, the final directory can now be passed to lm-evaulation-harness with the "--decontamination_ngrams_path" argument.
--- a/docs/task_guide.md
+++ b/docs/task_guide.md
@@ -16,7 +16,7 @@ pip install -e ".[dev]"

 ## Creating Your Task File

-From the `lm-evaluation-harness` project root, copy over the `new_task.py` template to `lm_eval/datasets`. 
+From the `lm-evaluation-harness` project root, copy over the `new_task.py` template to `lm_eval/datasets`.

 ```sh
 cp templates/new_task.py lm_eval/tasks/<task-name>.py
@@ -52,7 +52,7 @@ For example, take the QuAC dataset. We have:
 QuAC: Question Answering in Context
 https://arxiv.org/abs/1808.07036

-Question Answering in Context (QuAC) is a dataset for modeling, understanding, and 
+Question Answering in Context (QuAC) is a dataset for modeling, understanding, and
 participating in information seeking dialog. Data instances consist of an interactive
 dialog between two crowd workers: (1) a student who poses a sequence of freeform
 questions to learn as much as possible about a hidden Wikipedia text, and (2)
@@ -72,7 +72,7 @@ Now let's walk through the actual implementation - from data handling to evaluat
 ### Downloading your Data

 All data downloading and management is handled through the HuggingFace (**HF**) [`datasets`](https://github.com/huggingface/datasets) API. So, the first thing you should do is check to see if your task's dataset is already provided in their catalog [here](https://huggingface.co/datasets). If it's not in there, please consider adding it to their Hub to make it accessible to a wider user base by following their [new dataset guide](https://github.com/huggingface/datasets/blob/master/ADD_NEW_DATASET.md)
-. 
+.
 Now, that you have your HF dataset, you need to assign its path and name to your `Task` in the following fields:

 ```python
@@ -116,7 +116,7 @@ These should return a Python iterable (`list` or `generator`) of `dict`s that ca

 #### Processing Documents

-At this point, you can also process each individual document to, for example, strip whitespace or "detokenize" its fields. Put the processing logic into `_process_doc` and map the functions across training/validation/test docs inside of the respective functions. 
+At this point, you can also process each individual document to, for example, strip whitespace or "detokenize" its fields. Put the processing logic into `_process_doc` and map the functions across training/validation/test docs inside of the respective functions.
 🔠 If your task is **multiple-choice**, we require you to format your documents such that they contain `gold` and `choices` fields. They can also have other fields, but those will be ignored by `MultipleChoiceTask`. `choices` should be a list of possible continuations, and `gold` should be an integer specifying the index of the correct completion.
 See [this task](https://github.com/EleutherAI/lm-evaluation-harness/blob/6caa0afd96a7a7efb2ec4c1f24ad1756e48f3aa7/lm_eval/tasks/sat.py#L60) for an example. 🔠

@@ -151,6 +151,13 @@ def doc_to_target(self, doc):

 Finally, be aware that the strings from `doc_to_text` and `doc_to_target` will be concatenated together to build up labeled examples in the k-shot setting where k > 0. Design with that in mind 👍.

+### Decontamination
+For background on decontamination please see [this](./decontamination.md).
+
+If you wish to support decontamination studies for your task simply override the "should_decontaminate" method and return true.
+
+You also need to override "doc_to_decontamination_query" and return the data you wish to compare against the training set. This doesn't necessarily need to be the full document or request, and we leave this up to the implementor. For a multi-choice evaluation you could for example just return the question.
+
 ### Registering Your Task

 Now's a good time to register your task to expose it for usage. All you'll need to do is import your task module in `lm_eval/tasks/__init__.py` and provide an entry in the `TASK_REGISTRY`  dictionary with the key as the name of your benchmark task (in the form it'll be referred to in the command line) and the value as the task class. See how it's done for other tasks in the [file](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/__init__.py).
@@ -165,7 +172,7 @@ python -m scripts.write_out \
    --tasks <your-task> \
    --sets <train | val | test> \
    --num_fewshot K \
-    --num_examples N \ 
+    --num_examples N \
    --description_dict_path <path>
 ```

@@ -192,7 +199,11 @@ def construct_requests(self, doc, ctx):
    """
    return ...
 ```
-If your task requires generating text you'll need to return a `rf.greedy_until` request otherwise an `rf.loglikelihood` across all labels in a classification tasks will do.
+#### What's a `Request`? What's a `doc`?
+To reiterate, a `doc` is just a `Dict` object that contains information about a document from your corpus. It can contain things like a prompt, question type information, answers and anything else you think will be needed in order to assess your model for a given task. Keep in mind that the fields of this can be basically whatever you want (you can sort this out in `training_docs` \ `validation_docs` \ `test_docs` if you need to customise things - see above), just remember to be consistent with them throughout the rest of the `Task` you write up.
+A `Request` is an object that takes the text prompt you want to present to a model and computes one of a few different types of response. These are evaluated lazily (meaning, only when the result is actually needed). If your task requires generating text you'll need to return a `rf.greedy_until` request otherwise an `rf.loglikelihood` across all labels in a classification tasks will do.
+The function `construct_requests` can return a list of `Request`s or an iterable; it's perfectly fine to `yield` them from something or other. This is particularly handy if you are creating more than one request per `doc` (usually because you're up to something like multi-task learning). The objects this function returns then get consumed one by one and turned into result objects.
+

 ```python
 def process_results(self, doc, results):
@@ -207,6 +218,8 @@ def process_results(self, doc, results):
    """
    return {}
 ```
+This is the next step in the chain after `construct_requests`. In between this function and the one above, the request is evaluated. The results of that request are returned in the `results` arg to this function. By processing results, what is meant is calculating the metric or metrics of interest for your dataset using the result and associated ground truth given to this function. It's possible to calculate and return multiple metrics in this function and the logic for it can be whatever you want - as long as you've made sure the ground truth was included in the `doc` object. The dict returned from this function should be of the format `{'metric_name': value}`. It is not necessary to have the same keys for every doc processed using `process_results`; this sort of thing can be handled in the next function, `aggregation`.
+

 ```python
 def aggregation(self):
@@ -217,8 +230,10 @@ def aggregation(self):
    """
    return {}
 ```
+In `process_results`, model outputs are converted into metrics. These metrics are per document metrics, however; the `aggregation` function is used to work out what to do with them to create a corpus-level metric. Imagine you have a bunch of documents, for each of which you have calculated an F1 score. What should that mean overall? Should they be summed, averaged, the min/max found? This function handles that problem.

-See `lm_eval/metrics.py` for a few "built-in" aggregate metrics you can easily import.
+The contents of the function itself are pretty straightforward; it should simply return a dict that maps from each metric label that could be returned by `process_results` to a function that can be used to aggregate that metric. That is to say, if the metrics that `process_results` could return are given by `{'a', 'b', 'c'}`, then all of these keys should be present in the dict returned by `aggregation`.
+__NOTE__: See `lm_eval/metrics.py` for a few "built-in" aggregate metrics you can easily import. The standard metrics available in this package are generally based on `sklearn` functions, so if you are in any doubt for how to set things up the documentation over there can be of assistance. If you need to write a custom metric for some reason, start by looking at the existing ones in `lm_eval/metrics.py` for an idea about what the function signature needs to be.

 ```python
 def higher_is_better(self):
@@ -229,6 +244,7 @@ def higher_is_better(self):
    """
    return {}
 ```
+Finally, this function returns a dict with the same keys as `aggregation` and as it says in the description, simply tells us whether higher scores are better.

 Some tasks that are good examples of various ways evaluation can be implemented can be found here: [LAMBADA](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/lambada.py), [TriviaQA](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/triviaqa.py), [SQuAD](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/squad.py).

@@ -279,6 +295,11 @@ class TaskName(...):

 ## Submitting your Task

-Although we currently do not work behind a specific style guide, we'd appreciate if you tidy up your file/s with the `black` formatter (which should've been install through the `requirements.txt`). Keep things clean…ish 🙂.
+You can format your changes and perform flake8 standard checks by running the following commands:
+
+```sh
+pre-commit install
+pre-commit run --all-files
+```

 Now push your work and make a pull request! Thanks for the contribution 👍. If there are any questions, leave a message in the `#lm-thunderdome` channel on the EAI discord.
--- a/ignore.txt
+++ b/ignore.txt
+ROUGE
+rouge
+nin
--- a/lm_eval/base.py
+++ b/lm_eval/base.py
--- a/lm_eval/datasets/README.md
+++ b/lm_eval/datasets/README.md
 # datasets

-This directory contains custom EleutherAI datasets not available in the HuggingFace `datasets` hub.
+This directory contains custom HuggingFace [dataset loading scripts](https://huggingface.co/docs/datasets/dataset_script). They are provided to maintain backward compatibility with the ad-hoc data downloaders in earlier versions of the `lm-evaluation-harness` before HuggingFace [`datasets`](https://huggingface.co/docs/datasets/index) was adopted as the default downloading manager. For example, some instances in the HuggingFace `datasets` repository process features (e.g. whitespace stripping, lower-casing, etc.) in ways that the `lm-evaluation-harness` did not.

-In the rare case that you need to add a custom dataset to this collection, follow the 
-HuggingFace `datasets` guide found [here](https://huggingface.co/docs/datasets/dataset_script).
\ No newline at end of file
+__NOTE__: We are __not__ accepting any additional loading scripts into the main branch! If you'd like to use a custom dataset, fork the repo and follow HuggingFace's loading script guide found [here](https://huggingface.co/docs/datasets/dataset_script). You can then override your `Task`'s `DATASET_PATH` attribute to point to this script's local path.
+
+
+__WARNING__: A handful of loading scripts are included in this collection because they have not yet been pushed to the Huggingface Hub or a HuggingFace organization repo. We will remove such scripts once pushed.
--- a/lm_eval/datasets/arithmetic/arithmetic.py
+++ b/lm_eval/datasets/arithmetic/arithmetic.py
@@ -68,61 +68,111 @@ class Arithmetic(datasets.GeneratorBasedBuilder):
        ArithmeticConfig(
            name="arithmetic_2da",
            url="https://raw.githubusercontent.com/openai/gpt-3/master/data/two_digit_addition.jsonl",
-            features=datasets.Features({"context": datasets.Value("string"), "completion": datasets.Value("string")}),
+            features=datasets.Features(
+                {
+                    "context": datasets.Value("string"),
+                    "completion": datasets.Value("string"),
+                }
+            ),
            description="2-digit addition",
        ),
        ArithmeticConfig(
            name="arithmetic_2ds",
            url="https://raw.githubusercontent.com/openai/gpt-3/master/data/two_digit_subtraction.jsonl",
-            features=datasets.Features({"context": datasets.Value("string"), "completion": datasets.Value("string")}),
+            features=datasets.Features(
+                {
+                    "context": datasets.Value("string"),
+                    "completion": datasets.Value("string"),
+                }
+            ),
            description="2-digit subtraction",
        ),
        ArithmeticConfig(
            name="arithmetic_3da",
            url="https://raw.githubusercontent.com/openai/gpt-3/master/data/three_digit_addition.jsonl",
-            features=datasets.Features({"context": datasets.Value("string"), "completion": datasets.Value("string")}),
+            features=datasets.Features(
+                {
+                    "context": datasets.Value("string"),
+                    "completion": datasets.Value("string"),
+                }
+            ),
            description="3-digit addition",
        ),
        ArithmeticConfig(
            name="arithmetic_3ds",
            url="https://raw.githubusercontent.com/openai/gpt-3/master/data/three_digit_subtraction.jsonl",
-            features=datasets.Features({"context": datasets.Value("string"), "completion": datasets.Value("string")}),
+            features=datasets.Features(
+                {
+                    "context": datasets.Value("string"),
+                    "completion": datasets.Value("string"),
+                }
+            ),
            description="3-digit subtraction",
        ),
        ArithmeticConfig(
            name="arithmetic_4da",
            url="https://raw.githubusercontent.com/openai/gpt-3/master/data/four_digit_addition.jsonl",
-            features=datasets.Features({"context": datasets.Value("string"), "completion": datasets.Value("string")}),
+            features=datasets.Features(
+                {
+                    "context": datasets.Value("string"),
+                    "completion": datasets.Value("string"),
+                }
+            ),
            description="4-digit addition",
        ),
        ArithmeticConfig(
            name="arithmetic_4ds",
            url="https://raw.githubusercontent.com/openai/gpt-3/master/data/four_digit_subtraction.jsonl",
-            features=datasets.Features({"context": datasets.Value("string"), "completion": datasets.Value("string")}),
+            features=datasets.Features(
+                {
+                    "context": datasets.Value("string"),
+                    "completion": datasets.Value("string"),
+                }
+            ),
            description="4-digit subtraction",
        ),
        ArithmeticConfig(
            name="arithmetic_5da",
            url="https://raw.githubusercontent.com/openai/gpt-3/master/data/five_digit_addition.jsonl",
-            features=datasets.Features({"context": datasets.Value("string"), "completion": datasets.Value("string")}),
+            features=datasets.Features(
+                {
+                    "context": datasets.Value("string"),
+                    "completion": datasets.Value("string"),
+                }
+            ),
            description="5-digit addition",
        ),
        ArithmeticConfig(
            name="arithmetic_5ds",
            url="https://raw.githubusercontent.com/openai/gpt-3/master/data/five_digit_subtraction.jsonl",
-            features=datasets.Features({"context": datasets.Value("string"), "completion": datasets.Value("string")}),
+            features=datasets.Features(
+                {
+                    "context": datasets.Value("string"),
+                    "completion": datasets.Value("string"),
+                }
+            ),
            description="5-digit subtraction",
        ),
        ArithmeticConfig(
            name="arithmetic_2dm",
            url="https://raw.githubusercontent.com/openai/gpt-3/master/data/two_digit_multiplication.jsonl",
-            features=datasets.Features({"context": datasets.Value("string"), "completion": datasets.Value("string")}),
+            features=datasets.Features(
+                {
+                    "context": datasets.Value("string"),
+                    "completion": datasets.Value("string"),
+                }
+            ),
            description="2-digit multiplication",
        ),
        ArithmeticConfig(
            name="arithmetic_1dc",
            url="https://raw.githubusercontent.com/openai/gpt-3/master/data/single_digit_three_ops.jsonl",
-            features=datasets.Features({"context": datasets.Value("string"), "completion": datasets.Value("string")}),
+            features=datasets.Features(
+                {
+                    "context": datasets.Value("string"),
+                    "completion": datasets.Value("string"),
+                }
+            ),
            description="Single digit 3 operations",
        ),
    ]
@@ -155,9 +205,12 @@ class Arithmetic(datasets.GeneratorBasedBuilder):
        with open(filepath, encoding="utf-8") as f:
            for key, row in enumerate(f):
                data = json.loads(row)
-                context = data['context'].strip() \
-                    .replace('\n\n', '\n') \
-                    .replace('Q:', 'Question:') \
-                    .replace('A:', 'Answer:')
-                completion = data['completion']
-                yield key, {'context': context, 'completion': completion}
+                context = (
+                    data["context"]
+                    .strip()
+                    .replace("\n\n", "\n")
+                    .replace("Q:", "Question:")
+                    .replace("A:", "Answer:")
+                )
+                completion = data["completion"]
+                yield key, {"context": context, "completion": completion}
--- a/lm_eval/datasets/arithmetic/dataset_infos.json
+++ b/lm_eval/datasets/arithmetic/dataset_infos.json
--- a/lm_eval/datasets/asdiv/asdiv.py
+++ b/lm_eval/datasets/asdiv/asdiv.py
@@ -50,13 +50,16 @@ _URLS = "https://github.com/chaochun/nlu-asdiv-dataset/archive/55790e5270bb91ccf


 class ASDiv(datasets.GeneratorBasedBuilder):
-    """ ASDiv: A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers """
+    """ASDiv: A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers"""

    VERSION = datasets.Version("0.0.1")

    BUILDER_CONFIGS = [
-        datasets.BuilderConfig(name="asdiv", version=VERSION,
-                               description="A diverse corpus for evaluating and developing english math word problem solvers")
+        datasets.BuilderConfig(
+            name="asdiv",
+            version=VERSION,
+            description="A diverse corpus for evaluating and developing english math word problem solvers",
+        )
    ]

    def _info(self):
@@ -86,7 +89,9 @@ class ASDiv(datasets.GeneratorBasedBuilder):
                name=datasets.Split.VALIDATION,
                # These kwargs will be passed to _generate_examples
                gen_kwargs={
-                    "filepath": os.path.join(data_dir, base_filepath, "dataset", "ASDiv.xml"),
+                    "filepath": os.path.join(
+                        data_dir, base_filepath, "dataset", "ASDiv.xml"
+                    ),
                    "split": datasets.Split.VALIDATION,
                },
            ),

--- a/lm_eval/datasets/asdiv/dataset_infos.json
+++ b/lm_eval/datasets/asdiv/dataset_infos.json
-{"asdiv": {"description": "ASDiv (Academia Sinica Diverse MWP Dataset) is a diverse (in terms of both language\npatterns and problem types) English math word problem (MWP) corpus for evaluating\nthe capability of various MWP solvers. Existing MWP corpora for studying AI progress\nremain limited either in language usage patterns or in problem types. We thus present\na new English MWP corpus with 2,305 MWPs that cover more text patterns and most problem\ntypes taught in elementary school. Each MWP is annotated with its problem type and grade\nlevel (for indicating the level of difficulty).\n", "citation": "@misc{miao2021diverse,\n    title={A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers},\n    author={Shen-Yun Miao and Chao-Chun Liang and Keh-Yih Su},\n    year={2021},\n    eprint={2106.15772},\n    archivePrefix={arXiv},\n    primaryClass={cs.AI}\n}\n", "homepage": "https://github.com/chaochun/nlu-asdiv-dataset", "license": "", "features": {"body": {"dtype": "string", "id": null, "_type": "Value"}, "question": {"dtype": "string", "id": null, "_type": "Value"}, "solution_type": {"dtype": "string", "id": null, "_type": "Value"}, "answer": {"dtype": "string", "id": null, "_type": "Value"}, "formula": {"dtype": "string", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "as_div", "config_name": "asdiv", "version": {"version_str": "0.0.1", "description": null, "major": 0, "minor": 0, "patch": 1}, "splits": {"validation": {"name": "validation", "num_bytes": 501489, "num_examples": 2305, "dataset_name": "as_div"}}, "download_checksums": {"https://github.com/chaochun/nlu-asdiv-dataset/archive/55790e5270bb91ccfa5053194b25732534696b50.zip": {"num_bytes": 440966, "checksum": "8f1fe4f6d5f170ec1e24ab78c244153c14c568b1bb2b1dad0324e71f37939a2d"}}, "download_size": 440966, "post_processing_size": null, "dataset_size": 501489, "size_in_bytes": 942455}}
\ No newline at end of file
+{"asdiv": {"description": "ASDiv (Academia Sinica Diverse MWP Dataset) is a diverse (in terms of both language\npatterns and problem types) English math word problem (MWP) corpus for evaluating\nthe capability of various MWP solvers. Existing MWP corpora for studying AI progress\nremain limited either in language usage patterns or in problem types. We thus present\na new English MWP corpus with 2,305 MWPs that cover more text patterns and most problem\ntypes taught in elementary school. Each MWP is annotated with its problem type and grade\nlevel (for indicating the level of difficulty).\n", "citation": "@misc{miao2021diverse,\n    title={A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers},\n    author={Shen-Yun Miao and Chao-Chun Liang and Keh-Yih Su},\n    year={2021},\n    eprint={2106.15772},\n    archivePrefix={arXiv},\n    primaryClass={cs.AI}\n}\n", "homepage": "https://github.com/chaochun/nlu-asdiv-dataset", "license": "", "features": {"body": {"dtype": "string", "id": null, "_type": "Value"}, "question": {"dtype": "string", "id": null, "_type": "Value"}, "solution_type": {"dtype": "string", "id": null, "_type": "Value"}, "answer": {"dtype": "string", "id": null, "_type": "Value"}, "formula": {"dtype": "string", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "as_div", "config_name": "asdiv", "version": {"version_str": "0.0.1", "description": null, "major": 0, "minor": 0, "patch": 1}, "splits": {"validation": {"name": "validation", "num_bytes": 501489, "num_examples": 2305, "dataset_name": "as_div"}}, "download_checksums": {"https://github.com/chaochun/nlu-asdiv-dataset/archive/55790e5270bb91ccfa5053194b25732534696b50.zip": {"num_bytes": 440966, "checksum": "8f1fe4f6d5f170ec1e24ab78c244153c14c568b1bb2b1dad0324e71f37939a2d"}}, "download_size": 440966, "post_processing_size": null, "dataset_size": 501489, "size_in_bytes": 942455}}
--- a/lm_eval/datasets/coqa/coqa.py
+++ b/lm_eval/datasets/coqa/coqa.py
@@ -61,7 +61,7 @@ _EMPTY_ADDITIONAL_ANSWER = {
            "span_end": -1,
            "span_text": "",
            "input_text": "",
-            "turn_id": -1
+            "turn_id": -1,
        }
    ],
    "1": [
@@ -70,7 +70,7 @@ _EMPTY_ADDITIONAL_ANSWER = {
            "span_end": -1,
            "span_text": "",
            "input_text": "",
-            "turn_id": -1
+            "turn_id": -1,
        }
    ],
    "2": [
@@ -79,7 +79,7 @@ _EMPTY_ADDITIONAL_ANSWER = {
            "span_end": -1,
            "span_text": "",
            "input_text": "",
-            "turn_id": -1
+            "turn_id": -1,
        }
    ],
 }
@@ -91,8 +91,9 @@ class Coqa(datasets.GeneratorBasedBuilder):
    VERSION = datasets.Version("0.0.1")

    BUILDER_CONFIGS = [
-        datasets.BuilderConfig(name="coqa", version=VERSION,
-                               description="The CoQA dataset."),
+        datasets.BuilderConfig(
+            name="coqa", version=VERSION, description="The CoQA dataset."
+        ),
    ]

    def _info(self):
@@ -101,41 +102,52 @@ class Coqa(datasets.GeneratorBasedBuilder):
                "id": datasets.Value("string"),
                "source": datasets.Value("string"),
                "story": datasets.Value("string"),
-                "questions": datasets.features.Sequence({
-                    "input_text": datasets.Value("string"),
-                    "turn_id": datasets.Value("int32"),
-                }),
-                "answers": datasets.features.Sequence({
-                    "span_start": datasets.Value("int32"),
-                    "span_end": datasets.Value("int32"),
-                    "span_text": datasets.Value("string"),
-                    "input_text": datasets.Value("string"),
-                    "turn_id": datasets.Value("int32"),
-                }),
-                "additional_answers": {
-                    "0": datasets.features.Sequence({
-                        "span_start": datasets.Value("int32"),
-                        "span_end": datasets.Value("int32"),
-                        "span_text": datasets.Value("string"),
-                        "input_text": datasets.Value("string"),
-                        "turn_id": datasets.Value("int32"),
-                    }),
-                    "1": datasets.features.Sequence({
-                        "span_start": datasets.Value("int32"),
-                        "span_end": datasets.Value("int32"),
-                        "span_text": datasets.Value("string"),
+                "questions": datasets.features.Sequence(
+                    {
                        "input_text": datasets.Value("string"),
                        "turn_id": datasets.Value("int32"),
-                    }),
-                    "2": datasets.features.Sequence({
+                    }
+                ),
+                "answers": datasets.features.Sequence(
+                    {
                        "span_start": datasets.Value("int32"),
                        "span_end": datasets.Value("int32"),
                        "span_text": datasets.Value("string"),
                        "input_text": datasets.Value("string"),
                        "turn_id": datasets.Value("int32"),
-                    }),
-                }
-            })
+                    }
+                ),
+                "additional_answers": {
+                    "0": datasets.features.Sequence(
+                        {
+                            "span_start": datasets.Value("int32"),
+                            "span_end": datasets.Value("int32"),
+                            "span_text": datasets.Value("string"),
+                            "input_text": datasets.Value("string"),
+                            "turn_id": datasets.Value("int32"),
+                        }
+                    ),
+                    "1": datasets.features.Sequence(
+                        {
+                            "span_start": datasets.Value("int32"),
+                            "span_end": datasets.Value("int32"),
+                            "span_text": datasets.Value("string"),
+                            "input_text": datasets.Value("string"),
+                            "turn_id": datasets.Value("int32"),
+                        }
+                    ),
+                    "2": datasets.features.Sequence(
+                        {
+                            "span_start": datasets.Value("int32"),
+                            "span_end": datasets.Value("int32"),
+                            "span_text": datasets.Value("string"),
+                            "input_text": datasets.Value("string"),
+                            "turn_id": datasets.Value("int32"),
+                        }
+                    ),
+                },
+            }
+        )
        return datasets.DatasetInfo(
            description=_DESCRIPTION,
            features=features,
@@ -175,10 +187,7 @@ class Coqa(datasets.GeneratorBasedBuilder):
                source = row["source"]
                story = row["story"]
                questions = [
-                    {
-                        "input_text": q["input_text"],
-                        "turn_id": q["turn_id"]
-                    }
+                    {"input_text": q["input_text"], "turn_id": q["turn_id"]}
                    for q in row["questions"]
                ]
                answers = [
@@ -187,7 +196,7 @@ class Coqa(datasets.GeneratorBasedBuilder):
                        "span_end": a["span_end"],
                        "span_text": a["span_text"],
                        "input_text": a["input_text"],
-                        "turn_id": a["turn_id"]
+                        "turn_id": a["turn_id"],
                    }
                    for a in row["answers"]
                ]
@@ -201,7 +210,7 @@ class Coqa(datasets.GeneratorBasedBuilder):
                                "span_end": a0["span_end"],
                                "span_text": a0["span_text"],
                                "input_text": a0["input_text"],
-                                "turn_id": a0["turn_id"]
+                                "turn_id": a0["turn_id"],
                            }
                            for a0 in row["additional_answers"]["0"]
                        ],
@@ -211,7 +220,7 @@ class Coqa(datasets.GeneratorBasedBuilder):
                                "span_end": a1["span_end"],
                                "span_text": a1["span_text"],
                                "input_text": a1["input_text"],
-                                "turn_id": a1["turn_id"]
+                                "turn_id": a1["turn_id"],
                            }
                            for a1 in row["additional_answers"]["1"]
                        ],
@@ -221,7 +230,7 @@ class Coqa(datasets.GeneratorBasedBuilder):
                                "span_end": a2["span_end"],
                                "span_text": a2["span_text"],
                                "input_text": a2["input_text"],
-                                "turn_id": a2["turn_id"]
+                                "turn_id": a2["turn_id"],
                            }
                            for a2 in row["additional_answers"]["2"]
                        ],
@@ -232,5 +241,5 @@ class Coqa(datasets.GeneratorBasedBuilder):
                    "source": source,
                    "questions": questions,
                    "answers": answers,
-                    "additional_answers": additional_answers
+                    "additional_answers": additional_answers,
                }
--- a/lm_eval/datasets/coqa/dataset_infos.json
+++ b/lm_eval/datasets/coqa/dataset_infos.json
-{"coqa": {"description": "CoQA is a large-scale dataset for building Conversational Question Answering\nsystems. The goal of the CoQA challenge is to measure the ability of machines to\nunderstand a text passage and answer a series of interconnected questions that\nappear in a conversation.\n", "citation": "@misc{reddy2018coqa,\n    title={CoQA: A Conversational Question Answering Challenge},\n    author={Siva Reddy and Danqi Chen and Christopher D. Manning},\n    year={2018},\n    eprint={1808.07042},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL}\n}\n", "homepage": "https://stanfordnlp.github.io/coqa/", "license": "", "features": {"id": {"dtype": "string", "id": null, "_type": "Value"}, "source": {"dtype": "string", "id": null, "_type": "Value"}, "story": {"dtype": "string", "id": null, "_type": "Value"}, "questions": {"feature": {"input_text": {"dtype": "string", "id": null, "_type": "Value"}, "turn_id": {"dtype": "int32", "id": null, "_type": "Value"}}, "length": -1, "id": null, "_type": "Sequence"}, "answers": {"feature": {"span_start": {"dtype": "int32", "id": null, "_type": "Value"}, "span_end": {"dtype": "int32", "id": null, "_type": "Value"}, "span_text": {"dtype": "string", "id": null, "_type": "Value"}, "input_text": {"dtype": "string", "id": null, "_type": "Value"}, "turn_id": {"dtype": "int32", "id": null, "_type": "Value"}}, "length": -1, "id": null, "_type": "Sequence"}, "additional_answers": {"0": {"feature": {"span_start": {"dtype": "int32", "id": null, "_type": "Value"}, "span_end": {"dtype": "int32", "id": null, "_type": "Value"}, "span_text": {"dtype": "string", "id": null, "_type": "Value"}, "input_text": {"dtype": "string", "id": null, "_type": "Value"}, "turn_id": {"dtype": "int32", "id": null, "_type": "Value"}}, "length": -1, "id": null, "_type": "Sequence"}, "1": {"feature": {"span_start": {"dtype": "int32", "id": null, "_type": "Value"}, "span_end": {"dtype": "int32", "id": null, "_type": "Value"}, "span_text": {"dtype": "string", "id": null, "_type": "Value"}, "input_text": {"dtype": "string", "id": null, "_type": "Value"}, "turn_id": {"dtype": "int32", "id": null, "_type": "Value"}}, "length": -1, "id": null, "_type": "Sequence"}, "2": {"feature": {"span_start": {"dtype": "int32", "id": null, "_type": "Value"}, "span_end": {"dtype": "int32", "id": null, "_type": "Value"}, "span_text": {"dtype": "string", "id": null, "_type": "Value"}, "input_text": {"dtype": "string", "id": null, "_type": "Value"}, "turn_id": {"dtype": "int32", "id": null, "_type": "Value"}}, "length": -1, "id": null, "_type": "Sequence"}}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "coqa", "config_name": "coqa", "version": {"version_str": "0.0.1", "description": null, "major": 0, "minor": 0, "patch": 1}, "splits": {"train": {"name": "train", "num_bytes": 26250528, "num_examples": 7199, "dataset_name": "coqa"}, "validation": {"name": "validation", "num_bytes": 3765933, "num_examples": 500, "dataset_name": "coqa"}}, "download_checksums": {"https://nlp.stanford.edu/data/coqa/coqa-train-v1.0.json": {"num_bytes": 49001836, "checksum": "b0fdb2bc1bd38dd3ca2ce5fa2ac3e02c6288ac914f241ac409a655ffb6619fa6"}, "https://nlp.stanford.edu/data/coqa/coqa-dev-v1.0.json": {"num_bytes": 9090845, "checksum": "dfa367a9733ce53222918d0231d9b3bedc2b8ee831a2845f62dfc70701f2540a"}}, "download_size": 58092681, "post_processing_size": null, "dataset_size": 30016461, "size_in_bytes": 88109142}}
\ No newline at end of file
+{"coqa": {"description": "CoQA is a large-scale dataset for building Conversational Question Answering\nsystems. The goal of the CoQA challenge is to measure the ability of machines to\nunderstand a text passage and answer a series of interconnected questions that\nappear in a conversation.\n", "citation": "@misc{reddy2018coqa,\n    title={CoQA: A Conversational Question Answering Challenge},\n    author={Siva Reddy and Danqi Chen and Christopher D. Manning},\n    year={2018},\n    eprint={1808.07042},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL}\n}\n", "homepage": "https://stanfordnlp.github.io/coqa/", "license": "", "features": {"id": {"dtype": "string", "id": null, "_type": "Value"}, "source": {"dtype": "string", "id": null, "_type": "Value"}, "story": {"dtype": "string", "id": null, "_type": "Value"}, "questions": {"feature": {"input_text": {"dtype": "string", "id": null, "_type": "Value"}, "turn_id": {"dtype": "int32", "id": null, "_type": "Value"}}, "length": -1, "id": null, "_type": "Sequence"}, "answers": {"feature": {"span_start": {"dtype": "int32", "id": null, "_type": "Value"}, "span_end": {"dtype": "int32", "id": null, "_type": "Value"}, "span_text": {"dtype": "string", "id": null, "_type": "Value"}, "input_text": {"dtype": "string", "id": null, "_type": "Value"}, "turn_id": {"dtype": "int32", "id": null, "_type": "Value"}}, "length": -1, "id": null, "_type": "Sequence"}, "additional_answers": {"0": {"feature": {"span_start": {"dtype": "int32", "id": null, "_type": "Value"}, "span_end": {"dtype": "int32", "id": null, "_type": "Value"}, "span_text": {"dtype": "string", "id": null, "_type": "Value"}, "input_text": {"dtype": "string", "id": null, "_type": "Value"}, "turn_id": {"dtype": "int32", "id": null, "_type": "Value"}}, "length": -1, "id": null, "_type": "Sequence"}, "1": {"feature": {"span_start": {"dtype": "int32", "id": null, "_type": "Value"}, "span_end": {"dtype": "int32", "id": null, "_type": "Value"}, "span_text": {"dtype": "string", "id": null, "_type": "Value"}, "input_text": {"dtype": "string", "id": null, "_type": "Value"}, "turn_id": {"dtype": "int32", "id": null, "_type": "Value"}}, "length": -1, "id": null, "_type": "Sequence"}, "2": {"feature": {"span_start": {"dtype": "int32", "id": null, "_type": "Value"}, "span_end": {"dtype": "int32", "id": null, "_type": "Value"}, "span_text": {"dtype": "string", "id": null, "_type": "Value"}, "input_text": {"dtype": "string", "id": null, "_type": "Value"}, "turn_id": {"dtype": "int32", "id": null, "_type": "Value"}}, "length": -1, "id": null, "_type": "Sequence"}}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "coqa", "config_name": "coqa", "version": {"version_str": "0.0.1", "description": null, "major": 0, "minor": 0, "patch": 1}, "splits": {"train": {"name": "train", "num_bytes": 26250528, "num_examples": 7199, "dataset_name": "coqa"}, "validation": {"name": "validation", "num_bytes": 3765933, "num_examples": 500, "dataset_name": "coqa"}}, "download_checksums": {"https://nlp.stanford.edu/data/coqa/coqa-train-v1.0.json": {"num_bytes": 49001836, "checksum": "b0fdb2bc1bd38dd3ca2ce5fa2ac3e02c6288ac914f241ac409a655ffb6619fa6"}, "https://nlp.stanford.edu/data/coqa/coqa-dev-v1.0.json": {"num_bytes": 9090845, "checksum": "dfa367a9733ce53222918d0231d9b3bedc2b8ee831a2845f62dfc70701f2540a"}}, "download_size": 58092681, "post_processing_size": null, "dataset_size": 30016461, "size_in_bytes": 88109142}}
--- a/lm_eval/datasets/drop/dataset_infos.json
+++ b/lm_eval/datasets/drop/dataset_infos.json
-{"drop": {"description": "DROP is a QA dataset which tests comprehensive understanding of paragraphs. In \nthis crowdsourced, adversarially-created, 96k question-answering benchmark, a \nsystem must resolve multiple references in a question, map them onto a paragraph,\nand perform discrete operations over them (such as addition, counting, or sorting).\n", "citation": "@misc{dua2019drop,\n    title={DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs}, \n    author={Dheeru Dua and Yizhong Wang and Pradeep Dasigi and Gabriel Stanovsky and Sameer Singh and Matt Gardner},\n    year={2019},\n    eprint={1903.00161},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL}\n}\n", "homepage": "https://allenai.org/data/drop", "license": "", "features": {"section_id": {"dtype": "string", "id": null, "_type": "Value"}, "passage": {"dtype": "string", "id": null, "_type": "Value"}, "question": {"dtype": "string", "id": null, "_type": "Value"}, "query_id": {"dtype": "string", "id": null, "_type": "Value"}, "answer": {"number": {"dtype": "string", "id": null, "_type": "Value"}, "date": {"day": {"dtype": "string", "id": null, "_type": "Value"}, "month": {"dtype": "string", "id": null, "_type": "Value"}, "year": {"dtype": "string", "id": null, "_type": "Value"}}, "spans": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "worker_id": {"dtype": "string", "id": null, "_type": "Value"}, "hit_id": {"dtype": "string", "id": null, "_type": "Value"}}, "validated_answers": {"feature": {"number": {"dtype": "string", "id": null, "_type": "Value"}, "date": {"day": {"dtype": "string", "id": null, "_type": "Value"}, "month": {"dtype": "string", "id": null, "_type": "Value"}, "year": {"dtype": "string", "id": null, "_type": "Value"}}, "spans": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "worker_id": {"dtype": "string", "id": null, "_type": "Value"}, "hit_id": {"dtype": "string", "id": null, "_type": "Value"}}, "length": -1, "id": null, "_type": "Sequence"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "drop", "config_name": "drop", "version": {"version_str": "0.0.1", "description": null, "major": 0, "minor": 0, "patch": 1}, "splits": {"train": {"name": "train", "num_bytes": 108858121, "num_examples": 77409, "dataset_name": "drop"}, "validation": {"name": "validation", "num_bytes": 12560739, "num_examples": 9536, "dataset_name": "drop"}}, "download_checksums": {"https://s3-us-west-2.amazonaws.com/allennlp/datasets/drop/drop_dataset.zip": {"num_bytes": 8308692, "checksum": "39d2278a29fd729de301b111a45f434c24834f40df8f4ff116d864589e3249d6"}}, "download_size": 8308692, "post_processing_size": null, "dataset_size": 121418860, "size_in_bytes": 129727552}}
\ No newline at end of file
+{"drop": {"description": "DROP is a QA dataset which tests comprehensive understanding of paragraphs. In \nthis crowdsourced, adversarially-created, 96k question-answering benchmark, a \nsystem must resolve multiple references in a question, map them onto a paragraph,\nand perform discrete operations over them (such as addition, counting, or sorting).\n", "citation": "@misc{dua2019drop,\n    title={DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs}, \n    author={Dheeru Dua and Yizhong Wang and Pradeep Dasigi and Gabriel Stanovsky and Sameer Singh and Matt Gardner},\n    year={2019},\n    eprint={1903.00161},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL}\n}\n", "homepage": "https://allenai.org/data/drop", "license": "", "features": {"section_id": {"dtype": "string", "id": null, "_type": "Value"}, "passage": {"dtype": "string", "id": null, "_type": "Value"}, "question": {"dtype": "string", "id": null, "_type": "Value"}, "query_id": {"dtype": "string", "id": null, "_type": "Value"}, "answer": {"number": {"dtype": "string", "id": null, "_type": "Value"}, "date": {"day": {"dtype": "string", "id": null, "_type": "Value"}, "month": {"dtype": "string", "id": null, "_type": "Value"}, "year": {"dtype": "string", "id": null, "_type": "Value"}}, "spans": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "worker_id": {"dtype": "string", "id": null, "_type": "Value"}, "hit_id": {"dtype": "string", "id": null, "_type": "Value"}}, "validated_answers": {"feature": {"number": {"dtype": "string", "id": null, "_type": "Value"}, "date": {"day": {"dtype": "string", "id": null, "_type": "Value"}, "month": {"dtype": "string", "id": null, "_type": "Value"}, "year": {"dtype": "string", "id": null, "_type": "Value"}}, "spans": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "worker_id": {"dtype": "string", "id": null, "_type": "Value"}, "hit_id": {"dtype": "string", "id": null, "_type": "Value"}}, "length": -1, "id": null, "_type": "Sequence"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "drop", "config_name": "drop", "version": {"version_str": "0.0.1", "description": null, "major": 0, "minor": 0, "patch": 1}, "splits": {"train": {"name": "train", "num_bytes": 108858121, "num_examples": 77409, "dataset_name": "drop"}, "validation": {"name": "validation", "num_bytes": 12560739, "num_examples": 9536, "dataset_name": "drop"}}, "download_checksums": {"https://s3-us-west-2.amazonaws.com/allennlp/datasets/drop/drop_dataset.zip": {"num_bytes": 8308692, "checksum": "39d2278a29fd729de301b111a45f434c24834f40df8f4ff116d864589e3249d6"}}, "download_size": 8308692, "post_processing_size": null, "dataset_size": 121418860, "size_in_bytes": 129727552}}
--- a/lm_eval/datasets/drop/drop.py
+++ b/lm_eval/datasets/drop/drop.py
@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
-# Custom DROP dataet that, unlike HF, keeps all question-answer pairs
+# Custom DROP dataset that, unlike HF, keeps all question-answer pairs
 # even if there are multiple types of answers for the same question.
 """DROP dataset."""

@@ -25,7 +25,7 @@ import datasets

 _CITATION = """\
 @misc{dua2019drop,
-    title={DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs}, 
+    title={DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs},
    author={Dheeru Dua and Yizhong Wang and Pradeep Dasigi and Gabriel Stanovsky and Sameer Singh and Matt Gardner},
    year={2019},
    eprint={1903.00161},
@@ -35,8 +35,8 @@ _CITATION = """\
 """

 _DESCRIPTION = """\
-DROP is a QA dataset which tests comprehensive understanding of paragraphs. In 
-this crowdsourced, adversarially-created, 96k question-answering benchmark, a 
+DROP is a QA dataset which tests comprehensive understanding of paragraphs. In
+this crowdsourced, adversarially-created, 96k question-answering benchmark, a
 system must resolve multiple references in a question, map them onto a paragraph,
 and perform discrete operations over them (such as addition, counting, or sorting).
 """
@@ -50,17 +50,19 @@ _URLS = {
    "drop": "https://s3-us-west-2.amazonaws.com/allennlp/datasets/drop/drop_dataset.zip",
 }

-_EMPTY_VALIDATED_ANSWER = [{
-    "number": "",
-    "date": {
-        "day": "",
-        "month": "",
-        "year": "",
-    },
-    "spans": [],
-    "worker_id": "",
-    "hit_id": ""
-}]
+_EMPTY_VALIDATED_ANSWER = [
+    {
+        "number": "",
+        "date": {
+            "day": "",
+            "month": "",
+            "year": "",
+        },
+        "spans": [],
+        "worker_id": "",
+        "hit_id": "",
+    }
+]


 class Drop(datasets.GeneratorBasedBuilder):
@@ -69,39 +71,44 @@ class Drop(datasets.GeneratorBasedBuilder):
    VERSION = datasets.Version("0.0.1")

    BUILDER_CONFIGS = [
-        datasets.BuilderConfig(name="drop", version=VERSION,
-                               description="The DROP dataset."),
+        datasets.BuilderConfig(
+            name="drop", version=VERSION, description="The DROP dataset."
+        ),
    ]

    def _info(self):
-        features = datasets.Features({
-            "section_id": datasets.Value("string"),
-            "passage": datasets.Value("string"),
-            "question": datasets.Value("string"),
-            "query_id": datasets.Value("string"),
-            "answer": {
-                "number": datasets.Value("string"),
-                "date": {
-                    "day": datasets.Value("string"),
-                    "month": datasets.Value("string"),
-                    "year": datasets.Value("string"),
+        features = datasets.Features(
+            {
+                "section_id": datasets.Value("string"),
+                "passage": datasets.Value("string"),
+                "question": datasets.Value("string"),
+                "query_id": datasets.Value("string"),
+                "answer": {
+                    "number": datasets.Value("string"),
+                    "date": {
+                        "day": datasets.Value("string"),
+                        "month": datasets.Value("string"),
+                        "year": datasets.Value("string"),
+                    },
+                    "spans": datasets.features.Sequence(datasets.Value("string")),
+                    "worker_id": datasets.Value("string"),
+                    "hit_id": datasets.Value("string"),
                },
-                "spans": datasets.features.Sequence(datasets.Value("string")),
-                "worker_id": datasets.Value("string"),
-                "hit_id": datasets.Value("string"),
-            },
-            "validated_answers": datasets.features.Sequence({
-                "number": datasets.Value("string"),
-                "date": {
-                    "day": datasets.Value("string"),
-                    "month": datasets.Value("string"),
-                    "year": datasets.Value("string"),
-                },
-                "spans": datasets.features.Sequence(datasets.Value("string")),
-                "worker_id": datasets.Value("string"),
-                "hit_id": datasets.Value("string"),
-            }),
-        })
+                "validated_answers": datasets.features.Sequence(
+                    {
+                        "number": datasets.Value("string"),
+                        "date": {
+                            "day": datasets.Value("string"),
+                            "month": datasets.Value("string"),
+                            "year": datasets.Value("string"),
+                        },
+                        "spans": datasets.features.Sequence(datasets.Value("string")),
+                        "worker_id": datasets.Value("string"),
+                        "hit_id": datasets.Value("string"),
+                    }
+                ),
+            }
+        )
        return datasets.DatasetInfo(
            description=_DESCRIPTION,
            features=features,
@@ -118,7 +125,9 @@ class Drop(datasets.GeneratorBasedBuilder):
                name=datasets.Split.TRAIN,
                # These kwargs will be passed to _generate_examples
                gen_kwargs={
-                    "filepath": os.path.join(data_dir, "drop_dataset", "drop_dataset_train.json"),
+                    "filepath": os.path.join(
+                        data_dir, "drop_dataset", "drop_dataset_train.json"
+                    ),
                    "split": "train",
                },
            ),
@@ -126,7 +135,9 @@ class Drop(datasets.GeneratorBasedBuilder):
                name=datasets.Split.VALIDATION,
                # These kwargs will be passed to _generate_examples
                gen_kwargs={
-                    "filepath": os.path.join(data_dir, "drop_dataset", "drop_dataset_dev.json"),
+                    "filepath": os.path.join(
+                        data_dir, "drop_dataset", "drop_dataset_dev.json"
+                    ),
                    "split": "validation",
                },
            ),

--- a/lm_eval/datasets/headqa/dataset_infos.json
+++ b/lm_eval/datasets/headqa/dataset_infos.json
-{"es": {"description": "HEAD-QA is a multi-choice HEAlthcare Dataset. The questions come from exams to access a specialized position in the\nSpanish healthcare system, and are challenging even for highly specialized humans. They are designed by the Ministerio\nde Sanidad, Consumo y Bienestar Social.\nThe dataset contains questions about the following topics: medicine, nursing, psychology, chemistry, pharmacology and biology.\n", "citation": "@inproceedings{vilares-gomez-rodriguez-2019-head,\n    title = \"{HEAD}-{QA}: A Healthcare Dataset for Complex Reasoning\",\n    author = \"Vilares, David  and\n      G{'o}mez-Rodr{'i}guez, Carlos\",\n    booktitle = \"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics\",\n    month = jul,\n    year = \"2019\",\n    address = \"Florence, Italy\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://www.aclweb.org/anthology/P19-1092\",\n    doi = \"10.18653/v1/P19-1092\",\n    pages = \"960--966\",\n    abstract = \"We present HEAD-QA, a multi-choice question answering testbed to encourage research on complex reasoning. The questions come from exams to access a specialized position in the Spanish healthcare system, and are challenging even for highly specialized humans. We then consider monolingual (Spanish) and cross-lingual (to English) experiments with information retrieval and neural techniques. We show that: (i) HEAD-QA challenges current methods, and (ii) the results lag well behind human performance, demonstrating its usefulness as a benchmark for future work.\",\n}\n", "homepage": "https://aghie.github.io/head-qa/", "license": "MIT License", "features": {"name": {"dtype": "string", "id": null, "_type": "Value"}, "year": {"dtype": "string", "id": null, "_type": "Value"}, "category": {"dtype": "string", "id": null, "_type": "Value"}, "qid": {"dtype": "int32", "id": null, "_type": "Value"}, "qtext": {"dtype": "string", "id": null, "_type": "Value"}, "ra": {"dtype": "int32", "id": null, "_type": "Value"}, "answers": [{"aid": {"dtype": "int32", "id": null, "_type": "Value"}, "atext": {"dtype": "string", "id": null, "_type": "Value"}}]}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "head_qa", "config_name": "es", "version": {"version_str": "1.1.0", "description": null, "major": 1, "minor": 1, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 1196021, "num_examples": 2657, "dataset_name": "head_qa"}, "test": {"name": "test", "num_bytes": 1169819, "num_examples": 2742, "dataset_name": "head_qa"}, "validation": {"name": "validation", "num_bytes": 556924, "num_examples": 1366, "dataset_name": "head_qa"}}, "download_checksums": {"https://drive.google.com/uc?export=download&confirm=t&id=1a_95N5zQQoUCq8IBNVZgziHbeM-QxG2t": {"num_bytes": 79365502, "checksum": "6ec29a3f55153d167f0bdf05395558919ba0b1df9c63e79ffceda2a09884ad8b"}}, "download_size": 79365502, "post_processing_size": null, "dataset_size": 2922764, "size_in_bytes": 82288266}, "en": {"description": "HEAD-QA is a multi-choice HEAlthcare Dataset. The questions come from exams to access a specialized position in the\nSpanish healthcare system, and are challenging even for highly specialized humans. They are designed by the Ministerio\nde Sanidad, Consumo y Bienestar Social.\nThe dataset contains questions about the following topics: medicine, nursing, psychology, chemistry, pharmacology and biology.\n", "citation": "@inproceedings{vilares-gomez-rodriguez-2019-head,\n    title = \"{HEAD}-{QA}: A Healthcare Dataset for Complex Reasoning\",\n    author = \"Vilares, David  and\n      G{'o}mez-Rodr{'i}guez, Carlos\",\n    booktitle = \"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics\",\n    month = jul,\n    year = \"2019\",\n    address = \"Florence, Italy\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://www.aclweb.org/anthology/P19-1092\",\n    doi = \"10.18653/v1/P19-1092\",\n    pages = \"960--966\",\n    abstract = \"We present HEAD-QA, a multi-choice question answering testbed to encourage research on complex reasoning. The questions come from exams to access a specialized position in the Spanish healthcare system, and are challenging even for highly specialized humans. We then consider monolingual (Spanish) and cross-lingual (to English) experiments with information retrieval and neural techniques. We show that: (i) HEAD-QA challenges current methods, and (ii) the results lag well behind human performance, demonstrating its usefulness as a benchmark for future work.\",\n}\n", "homepage": "https://aghie.github.io/head-qa/", "license": "MIT License", "features": {"name": {"dtype": "string", "id": null, "_type": "Value"}, "year": {"dtype": "string", "id": null, "_type": "Value"}, "category": {"dtype": "string", "id": null, "_type": "Value"}, "qid": {"dtype": "int32", "id": null, "_type": "Value"}, "qtext": {"dtype": "string", "id": null, "_type": "Value"}, "ra": {"dtype": "int32", "id": null, "_type": "Value"}, "answers": [{"aid": {"dtype": "int32", "id": null, "_type": "Value"}, "atext": {"dtype": "string", "id": null, "_type": "Value"}}]}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "head_qa", "config_name": "en", "version": {"version_str": "1.1.0", "description": null, "major": 1, "minor": 1, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 1123151, "num_examples": 2657, "dataset_name": "head_qa"}, "test": {"name": "test", "num_bytes": 1097349, "num_examples": 2742, "dataset_name": "head_qa"}, "validation": {"name": "validation", "num_bytes": 523462, "num_examples": 1366, "dataset_name": "head_qa"}}, "download_checksums": {"https://drive.google.com/uc?export=download&confirm=t&id=1a_95N5zQQoUCq8IBNVZgziHbeM-QxG2t": {"num_bytes": 79365502, "checksum": "6ec29a3f55153d167f0bdf05395558919ba0b1df9c63e79ffceda2a09884ad8b"}}, "download_size": 79365502, "post_processing_size": null, "dataset_size": 2743962, "size_in_bytes": 82109464}}
\ No newline at end of file
+{"es": {"description": "HEAD-QA is a multi-choice HEAlthcare Dataset. The questions come from exams to access a specialized position in the\nSpanish healthcare system, and are challenging even for highly specialized humans. They are designed by the Ministerio\nde Sanidad, Consumo y Bienestar Social.\nThe dataset contains questions about the following topics: medicine, nursing, psychology, chemistry, pharmacology and biology.\n", "citation": "@inproceedings{vilares-gomez-rodriguez-2019-head,\n    title = \"{HEAD}-{QA}: A Healthcare Dataset for Complex Reasoning\",\n    author = \"Vilares, David  and\n      G{'o}mez-Rodr{'i}guez, Carlos\",\n    booktitle = \"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics\",\n    month = jul,\n    year = \"2019\",\n    address = \"Florence, Italy\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://www.aclweb.org/anthology/P19-1092\",\n    doi = \"10.18653/v1/P19-1092\",\n    pages = \"960--966\",\n    abstract = \"We present HEAD-QA, a multi-choice question answering testbed to encourage research on complex reasoning. The questions come from exams to access a specialized position in the Spanish healthcare system, and are challenging even for highly specialized humans. We then consider monolingual (Spanish) and cross-lingual (to English) experiments with information retrieval and neural techniques. We show that: (i) HEAD-QA challenges current methods, and (ii) the results lag well behind human performance, demonstrating its usefulness as a benchmark for future work.\",\n}\n", "homepage": "https://aghie.github.io/head-qa/", "license": "MIT License", "features": {"name": {"dtype": "string", "id": null, "_type": "Value"}, "year": {"dtype": "string", "id": null, "_type": "Value"}, "category": {"dtype": "string", "id": null, "_type": "Value"}, "qid": {"dtype": "int32", "id": null, "_type": "Value"}, "qtext": {"dtype": "string", "id": null, "_type": "Value"}, "ra": {"dtype": "int32", "id": null, "_type": "Value"}, "answers": [{"aid": {"dtype": "int32", "id": null, "_type": "Value"}, "atext": {"dtype": "string", "id": null, "_type": "Value"}}]}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "head_qa", "config_name": "es", "version": {"version_str": "1.1.0", "description": null, "major": 1, "minor": 1, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 1196021, "num_examples": 2657, "dataset_name": "head_qa"}, "test": {"name": "test", "num_bytes": 1169819, "num_examples": 2742, "dataset_name": "head_qa"}, "validation": {"name": "validation", "num_bytes": 556924, "num_examples": 1366, "dataset_name": "head_qa"}}, "download_checksums": {"https://drive.google.com/uc?export=download&confirm=t&id=1a_95N5zQQoUCq8IBNVZgziHbeM-QxG2t": {"num_bytes": 79365502, "checksum": "6ec29a3f55153d167f0bdf05395558919ba0b1df9c63e79ffceda2a09884ad8b"}}, "download_size": 79365502, "post_processing_size": null, "dataset_size": 2922764, "size_in_bytes": 82288266}, "en": {"description": "HEAD-QA is a multi-choice HEAlthcare Dataset. The questions come from exams to access a specialized position in the\nSpanish healthcare system, and are challenging even for highly specialized humans. They are designed by the Ministerio\nde Sanidad, Consumo y Bienestar Social.\nThe dataset contains questions about the following topics: medicine, nursing, psychology, chemistry, pharmacology and biology.\n", "citation": "@inproceedings{vilares-gomez-rodriguez-2019-head,\n    title = \"{HEAD}-{QA}: A Healthcare Dataset for Complex Reasoning\",\n    author = \"Vilares, David  and\n      G{'o}mez-Rodr{'i}guez, Carlos\",\n    booktitle = \"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics\",\n    month = jul,\n    year = \"2019\",\n    address = \"Florence, Italy\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://www.aclweb.org/anthology/P19-1092\",\n    doi = \"10.18653/v1/P19-1092\",\n    pages = \"960--966\",\n    abstract = \"We present HEAD-QA, a multi-choice question answering testbed to encourage research on complex reasoning. The questions come from exams to access a specialized position in the Spanish healthcare system, and are challenging even for highly specialized humans. We then consider monolingual (Spanish) and cross-lingual (to English) experiments with information retrieval and neural techniques. We show that: (i) HEAD-QA challenges current methods, and (ii) the results lag well behind human performance, demonstrating its usefulness as a benchmark for future work.\",\n}\n", "homepage": "https://aghie.github.io/head-qa/", "license": "MIT License", "features": {"name": {"dtype": "string", "id": null, "_type": "Value"}, "year": {"dtype": "string", "id": null, "_type": "Value"}, "category": {"dtype": "string", "id": null, "_type": "Value"}, "qid": {"dtype": "int32", "id": null, "_type": "Value"}, "qtext": {"dtype": "string", "id": null, "_type": "Value"}, "ra": {"dtype": "int32", "id": null, "_type": "Value"}, "answers": [{"aid": {"dtype": "int32", "id": null, "_type": "Value"}, "atext": {"dtype": "string", "id": null, "_type": "Value"}}]}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "head_qa", "config_name": "en", "version": {"version_str": "1.1.0", "description": null, "major": 1, "minor": 1, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 1123151, "num_examples": 2657, "dataset_name": "head_qa"}, "test": {"name": "test", "num_bytes": 1097349, "num_examples": 2742, "dataset_name": "head_qa"}, "validation": {"name": "validation", "num_bytes": 523462, "num_examples": 1366, "dataset_name": "head_qa"}}, "download_checksums": {"https://drive.google.com/uc?export=download&confirm=t&id=1a_95N5zQQoUCq8IBNVZgziHbeM-QxG2t": {"num_bytes": 79365502, "checksum": "6ec29a3f55153d167f0bdf05395558919ba0b1df9c63e79ffceda2a09884ad8b"}}, "download_size": 79365502, "post_processing_size": null, "dataset_size": 2743962, "size_in_bytes": 82109464}}