Merge branch 'master' into task_doc

11f614b0 · Stella Biderman · GitHub · 0a6a9b7e · e00d682f · 11f614b0
Unverified Commit 11f614b0 authored Apr 30, 2022 by Stella Biderman Committed by GitHub Apr 30, 2022
20 changed files
--- a/.github/workflows/python-app.yml
+++ b/.github/workflows/python-app.yml
@@ -32,7 +32,7 @@ jobs:
      run: |
        python -m pip install --upgrade pip
        pip install flake8 pytest pytest-cov
-        pip install -e .
+        pip install -e .[dev]
        if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
    - name: Lint with flake8
      run: |

--- a/CODEOWNERS
+++ b/CODEOWNERS
-* EleutherAI/pm-pile
+* @jon-tow @leogao2 @StellaAthena
--- a/README.md
+++ b/README.md
@@ -26,7 +26,7 @@ To evaluate a model, (e.g. GPT-2) on NLU tasks (e.g. LAMBADA, HellaSwag), you ca
 ```bash
 python main.py \
 	--model gpt2 \
-	--device cuda:0 \
+	--device 0 \
 	--tasks lambada,hellaswag
 ```
 (This uses gpt2-117M by default as per HF defaults, use --model_args to specify other gpt2 sizes)
@@ -37,7 +37,7 @@ Additional arguments can be provided to the model constructor using the `--model
 python main.py \
 	--model gpt2 \
 	--model_args pretrained=EleutherAI/gpt-neo-2.7B \
-	--device cuda:0 \
+	--device 0 \
 	--tasks lambada,hellaswag
 ```
@@ -375,7 +375,7 @@ Additional arguments can be provided to the model constructor using the `--model
 python main.py \
 	--model gpt2 \
 	--model_args pretrained=EleutherAI/gpt-neo-1.3B \
-	--device cuda:0 \
+	--device 0 \
 	--tasks lambada,hellaswag \
 	--num_fewshot 2
 ```
@@ -392,6 +392,21 @@ python write_out.py \
 This will write out one text file for each task.
+### Test Set Decontamination
+For more details see the [decontamination guide](./docs/decontamination.md).
+The directory provided with the "--decontamination_ngrams_path" argument should contain
+the ngram files and info.json. See the above guide for ngram generation for the pile, this could be adapted for other training sets.
+```bash
+python main.py \
+    --model gpt2 \
+    --device 0 \
+    --tasks sciq \ 
+    --decontamination_ngrams_path path/containing/training/set/ngrams
+```
 ### Code Structure
 There are two major components of the library:

--- a/docs/decontamination.md
+++ b/docs/decontamination.md
+# Decontamination
+## Usage
+Simply add a "--decontamination_ngrams_path" when running main.py. The provided directory should contain
+the ngram files and info.json produced in "Pile Ngram Generation" further down.
+```bash
+python main.py \
+    --model gpt2 \
+    --device 0 \
+    --tasks sciq \
+    --decontamination_ngrams_path path/containing/training/set/ngrams
+```
+## Background
+Downstream evaluations test model generalization, and are less useful when test set data also exists in the training set (leakage/contamination).
+As a first step this is resolved through training set filtering, however often benchmarks don't exist or weren't considered prior to model training. In this case it is useful to measure the impact of test set leakage by detecting the contaminated test examples and producing a clean version of the benchmark.
+The basis for our decontamination procedure can be found in Appendix C of "Language Models are Few-Shot Learners". OpenAI defined a test document as contaminated if any N-gram overlap existed with any training document. They used a range of N values between 8 and 13 depending on dataset, while we just used 13 for simplicity.
+## Implementation
+Contamination detection can be found in "lm_eval/decontaminate.py" with supporting code in "lm_eval/decontamination/". 
+decontaminate.py does the following:
+1. Build dictionaries of all ngrams and their corresponding evaluation/document ids.
+2. Scan through sorted files containing training set n-grams.
+3. If a match is found, the corresponding evaluation/document combinations are marked as contaminated.
+"lm_eval/evaluator.py" can then produce a clean version of the benchmark by excluding the results of contaminated documents. For each metric, a clean version will be shown in the results with a "decontaminate" suffix. 
+This is disabled by default for new tasks, to support decontamination on a task override the "should_decontaminate" and "doc_to_decontamination_query" methods. For more details see the [task guide](task_guide.md).
+## Pile Ngram Generation
+The relevant scripts can be found in scripts/clean_training_data, which also import from
+"lm_eval/decontamination/"
+1. git clone https://github.com/EleutherAI/lm-evaluation-harness.git
+2. pip install -r requirements.txt
+3. Download The Pile from [The Eye](https://the-eye.eu/public/AI/pile/train/)
+4. Place pile files in "pile" directory under "lm-evaluation-harness" (or create a symlink)
+5. Run generate_13_grams.
+```bash
+export PYTHONHASHSEED=0
+python -m scripts/clean_training_data/generate_13_grams \
+    -dir path/to/working/directory \
+    -n 13 \
+    -buckets 500
+```
+Took approximately 4 days for us. We had the time to wait, but this could be scaled out by doing partial pile scans on multiple instances of this script and merging the relevant buckets. We fixed PYTHONHASHSEED to ensure reproducibility of bucket hashing.
+6. Sort the generated 13-grams.
+```bash
+python -m scripts/clean_training_data/sort_13_gram_buckets \
+    -dir path/to/working/directory/output
+```
+Took approximately 5 days for us. You could speed this up by spreading the files around to different machines and running the sort script before gathering them together.
+7. Compress the sorted 13 grams files and place them together with info.json.
+This step only takes a few hours.
+```bash
+python -m scripts/clean_training_data/compress_and_package \
+    -dir path/to/working/directory \
+    -output path/to/final/directory \
+    -procs 8
+```
+Congratulations, the final directory can now be passed to lm-evaulation-harness with the "--decontamination_ngrams_path" argument.
--- a/docs/task_guide.md
+++ b/docs/task_guide.md
@@ -11,19 +11,28 @@ If you haven't already, go ahead and fork the main repo, clone it, create a bran
 git clone https://github.com/<YOUR-USERNAME>/lm-evaluation-harness.git
 cd lm-evaluation-harness
 git checkout -b <task-name>
-pip install -r requirements.txt
+pip install -e ".[dev]"
 ```
 ## Creating Your Task File
-The first step in creating a task is to create a Python file in `lm_eval/tasks/`  with the task's name:
+From the `lm-evaluation-harness` project root, copy over the `new_task.py` template to `lm_eval/datasets`. 
 ```sh
-cd lm_eval/tasks
+cp templates/new_task.py lm_eval/tasks/<task-name>.py
-touch <task-name>.py
 ```
-Then open the file and create a multiline docstring on the first line with the following contents:
+or if your task is **multiple-choice**, the `new_multiple_choice_task.py`:
+```sh
+cp templates/new_multiple_choice_task.py lm_eval/tasks/<task-name>.py
+```
+This will set you up with a few `TODO`s to fill-in which we'll now go over in detail.
+## Task Heading
+Open the file you've just created and add a multiline docstring on the first line with the following contents:
 ```python
 """
@@ -62,102 +71,92 @@ Now let's walk through the actual implementation - from data handling to evaluat
 ### Downloading your Data
-There are 2 standard approaches we follow for downloading data:
+All data downloading and management is handled through the HuggingFace (**HF**) [`datasets`](https://github.com/huggingface/datasets) API. So, the first thing you should do is check to see if your task's dataset is already provided in their catalog [here](https://huggingface.co/datasets). If it's not in there, please consider adding it to their Hub to make it accessible to a wider user base by following their [new dataset guide](https://github.com/huggingface/datasets/blob/master/ADD_NEW_DATASET.md)
+. 
+Now, that you have your HF dataset, you need to assign its path and name to your `Task` in the following fields:
-1. Firstly, you should always check to see if your task's dataset is already provided by HuggingFace (__HF__); check their `datasets` catalog [here](https://huggingface.co/datasets). Is it in there? If yes, continue reading here, else go to 2. In the case that it’s there, things are a bit easier.  You can inherit from the `HFTask` class as so:
+```python
+class TaskName(...):
-    ```python
+    DATASET_PATH = "..."
-    from . common import HFTask
+    DATASET_NAME = "..."
+```
-    class TaskName(HFTask):
-        DATASET_PATH = "..."
-        DATASET_NAME = "..."
-    ```
-	where `DATASET_PATH` is the name of the benchmark/task dataset as listed by HF and `DATASET_NAME` is the name of, what HF calls, a “data instance” of the benchmark. If your task is not a benchmark containing any data instances just set `DATASET_NAME = None`.
-2. Your task's dataset is not in HF's catalog, so you'll have to override a few abstract methods of the `Task` base class. First let's define our benchmark/task and inherit from `Task`.
-    ```python
-    from lm_eval.base import Task
-    from pathlib import Path
-    class TaskName(Task):
-        DATASET_PATH = Path("data/<task-name>")
-    ```
-    where `DATASET_PATH` is the local directory we'll download into.
-    Now we need to override the following methods:
-    ```python
+where `DATASET_PATH` is the name of the dataset as listed by HF in the `datasets` Hub and `DATASET_NAME` is the name of, what HF calls, a “data instance” or sub-task of the benchmark. If your task does not contain any data instances, just set `DATASET_NAME = None`.
-    def download(self):
+(If you're familiar with the HF `datasets.load_dataset` function, these are just the first 2 arguments to it.)
-    ```
-    This should download the dataset into the relative path specified by `DATASET_PATH`. The preferred approach is to use EleutherAI's [best-download](https://github.com/EleutherAI/best-download) package which provides a `download_file` function that lets you validate complete data transmission through a checksum argument.  The overall logic should be something like: If the `DATASET_PATH` already exists then don’t download anything and exit the method, otherwise create the `DATASET_PATH` directory and actually download into it.  See this [task](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/logiqa.py#L9-L21) for an example.
-   Next up, we have to set some “flags”:
+Next up, we have to set some “flags”:
-    ```python
+```python
    def has_training_docs(self):
        return # True/False
    def has_validation_docs(self):
        return # True/False
    def has_test_docs(self):
        return # True/False
-    ```
+```
-   These methods return `True`/`False` whether or not your task dataset provides documents for each split type. __Note__: if the test set doesn't have publicly available labels, please do not put it down as having a test set.
-	Lastly, we need to load the documents. In our terminology, a document (`doc`) is a single natural language data example stored in a Python `dict`. E.g.: `{“question”: “What is the capital of France?”, “answer”: “Paris”}`. Override the following methods to load your data splits from their storage location in `DATASET_PATH`:
+These methods return `True`/`False` whether or not your task dataset provides documents for each split type. __Note__: if the test set does not have publicly available answer labels, please do not put it down as having a test set - return False.
-    ```python
+Lastly, we need to load the documents. In our terminology, a document (`doc`) is a single natural language data example stored in a Python `dict`. E.g.: `{“question”: “What is the capital of France?”, “answer”: “Paris”}`. Override the following methods to load your data splits from their storage location in `DATASET_PATH`:
+```python
    def training_docs(self):
        return #...
    def validation_docs(self):
        return #...
    def test_docs(self):
        return #...
-    ```
+```
-	These should return a Python iterable (`list` or `generator`) of `dict`s that can be queried for individual `doc` examples. Note that this will get used in a few places down the track, including for both creating examples to be processed by a model and also for assessing the results. There's nothing stopping you getting these functions be generators, which can be handy if each source document in your dataset contains a number of questions/prompts to handle. The overarching idea is that what comes out of this function should definitively contain an input to be presented to the model, the associated ground truth and anything else you think is needed to judge the quality of the output the model provides. __NOTE__: If your task doesn't have a train/validation/test set, remember to raise a `NotImplementedError` for that specific split.
-### Formatting your Few-Shot Examples
+These should return a Python iterable (`list` or `generator`) of `dict`s that can be queried for individual `doc` examples.
-The harness is designed to facilitate task evaluations under the few-shot setting. Here we’ll format such examples.
+#### Processing Documents
-<br>
+At this point, you can also process each individual document to, for example, strip whitespace or "detokenize" its fields. Put the processing logic into `_process_doc` and map the functions across training/validation/test docs inside of the respective functions. 
+🔠 If your task is **multiple-choice**, we require you to format your documents such that they contain `gold` and `choices` fields. They can also have other fields, but those will be ignored by `MultipleChoiceTask`. `choices` should be a list of possible continuations, and `gold` should be an integer specifying the index of the correct completion.
+See [this task](https://github.com/EleutherAI/lm-evaluation-harness/blob/6caa0afd96a7a7efb2ec4c1f24ad1756e48f3aa7/lm_eval/tasks/sat.py#L60) for an example. 🔠
-⚠️  **Multiple-Choice Formatting**
+### Formatting your Few-Shot Examples
-If your task is **multiple-choice**, just inherit from the `MultipleChoiceTask` class we provide.
+The harness is designed to facilitate task evaluations under the few-shot setting. Here we’ll format such examples.
-```python
+Format your document into a single query prompt __without the answer__ here. This method takes a single `doc` example of type `dict` with `str` key-value members. You should concatenate these `doc` item values together into a neatly formatted prompt.
-from lm_eval.base import MultipleChoiceTask
-class TaskName(..., MultipleChoiceTask):
+```python
+def doc_to_text(self, doc):
+    return ""
 ```
-This will require you to format your documents such that they contain `gold` and `choices` fields. They can also have other fields, but those will be ignored by `MultipleChoiceTask`. `choices` should be a list of possible continuations, and `gold` should be an integer specifying the index of the correct completion.
+<br>
-See [this task](https://github.com/EleutherAI/lm-evaluation-harness/blob/105fa9741ff660f6a62c2eef0d2facfde36dda41/lm_eval/tasks/sat.py#L56) for an example. When used in combination with `HFTask`, it may be useful to override [`_convert_standard`](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/common.py#L28), which will be applied to every document in the HF dataset. See [this task](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/headqa.py) for an example of this.
+️🔠 **Multiple-Choice Formatting**
-You can now skip ahead to <a href="#Registering-Your-Task">registering your task</a>.
+If your task is multiple-choice, you can now skip ahead to <a href="#Registering-Your-Task">registering your task</a>.
-⚠️  **End Multiple-Choice Formatting**
+️️🔠 **End Multiple-Choice Formatting**
 <br>
-In the case your task is _not_ multiple-choice, override the following methods for your task class:
+Format the target answer from the contents of `doc`. Note that the prepended `" "` is required to space out the `doc_to_text` and `doc_to_target` strings.
-Format your document into a single query prompt __without the answer__ here. This method takes a single `doc` example of type `dict` with `str` key-value members. You should concatenate these `doc` item values together into a neatly formatted prompt.
 ```python
-def doc_to_text(self, doc):
+def doc_to_target(self, doc):
-    return ""
+    target = ""
+    return " " + target
 ```
-Put the target answer of the prompt here, in the form: `" " + <answer>`.
+Finally, be aware that the strings from `doc_to_text` and `doc_to_target` will be concatenated together to build up labeled examples in the k-shot setting where k > 0. Design with that in mind 👍.
-```python
+### Decontamination
-def doc_to_target(self, doc):
+For background on decontamination please see [this](./decontamination.md).
-    return ""
-```
+If you wish to support decontamination studies for your task simply override the "should_decontaminate" method and return true. 
-Understand that the strings from `doc_to_text` and `doc_to_target` will be concatenated together to build up labeled examples in the k-shot setting where k > 0. Design with that in mind 👍.
+You also need to override "doc_to_decontamination_query" and return the data you wish to compare against the training set. This doesn't necessarily need to be the full document or request, and we leave this up to the implementor. For a multi-choice evaluation you could for example just return the question.
 ### Registering Your Task

--- a/lm_eval/base.py
+++ b/lm_eval/base.py
@@ -6,6 +6,7 @@ import re
 import os
 import json
 import hashlib
+import datasets
 from sqlitedict import SqliteDict
 from tqdm import tqdm
 import torch
@@ -346,14 +347,77 @@ class Task(abc.ABC):
        {"question": ..., "answer": ...} or
        {"question": ..., question, answer)
    """
-    def __init__(self):
-        self.download()
+    # The name of the `Task` benchmark as denoted in the HuggingFace datasets Hub
+    # or a path to a custom `datasets` loading script.
+    DATASET_PATH: str = None
+    # The name of a subset within `DATASET_PATH`.
+    DATASET_NAME: str = None
+    def __init__(self, data_dir=None, cache_dir=None, download_mode=None):
+        """
+        :param data_dir: str
+            Stores the path to a local folder containing the `Task`'s data files.
+            Use this to specify the path to manually downloaded data (usually when
+            the dataset is not publicly accessible).
+        :param cache_dir: str
+            The directory to read/write the `Task` dataset. This follows the
+            HuggingFace `datasets` API with the default cache directory located at:
+                `~/.cache/huggingface/datasets`
+            NOTE: You can change the cache location globally for a given process
+            by setting the shell environment variable, `HF_DATASETS_CACHE`,
+            to another directory:
+                `export HF_DATASETS_CACHE="/path/to/another/directory"`
+        :param download_mode: datasets.DownloadMode
+            How to treat pre-existing `Task` downloads and data.
+            - `datasets.DownloadMode.REUSE_DATASET_IF_EXISTS`
+                Reuse download and reuse dataset.
+            - `datasets.DownloadMode.REUSE_CACHE_IF_EXISTS`
+                Reuse download with fresh dataset.
+            - `datasets.DownloadMode.FORCE_REDOWNLOAD`
+                Fresh download and fresh dataset.
+        """
+        self.download(data_dir, cache_dir, download_mode)
        self._training_docs = None
        self._fewshot_docs = None
-    def download(self):
+    def download(self, data_dir=None, cache_dir=None, download_mode=None):
-        """Downloads the task dataset if necessary"""
+        """ Downloads and returns the task dataset.
-        pass
+        Override this method to download the dataset from a custom API.
+        :param data_dir: str
+            Stores the path to a local folder containing the `Task`'s data files.
+            Use this to specify the path to manually downloaded data (usually when
+            the dataset is not publicly accessible).
+        :param cache_dir: str
+            The directory to read/write the `Task` dataset. This follows the
+            HuggingFace `datasets` API with the default cache directory located at:
+                `~/.cache/huggingface/datasets`
+            NOTE: You can change the cache location globally for a given process
+            by setting the shell environment variable, `HF_DATASETS_CACHE`,
+            to another directory:
+                `export HF_DATASETS_CACHE="/path/to/another/directory"`
+        :param download_mode: datasets.DownloadMode
+            How to treat pre-existing `Task` downloads and data.
+            - `datasets.DownloadMode.REUSE_DATASET_IF_EXISTS`
+                Reuse download and reuse dataset.
+            - `datasets.DownloadMode.REUSE_CACHE_IF_EXISTS`
+                Reuse download with fresh dataset.
+            - `datasets.DownloadMode.FORCE_REDOWNLOAD`
+                Fresh download and fresh dataset.
+        """
+        self.dataset = datasets.load_dataset(
+            path=self.DATASET_PATH,
+            name=self.DATASET_NAME,
+            data_dir=data_dir,
+            cache_dir=cache_dir,
+            download_mode=download_mode
+        )
+    def should_decontaminate(self):
+        """Whether this task supports decontamination against model training set."""
+        return False
    @abstractmethod
    def has_training_docs(self):
@@ -391,12 +455,27 @@ class Task(abc.ABC):
        """
        return []
+    def _process_doc(self, doc):
+        """
+        Override this to process (detokenize, strip, replace, etc.) individual
+        documents. This can be used in a map over documents of a data split.
+        E.g. `map(self._process_doc, self.dataset["validation"])`
+        :return: dict
+            The processed version of the specified `doc`.
+        """
+        return doc
    def fewshot_examples(self, k, rnd):
        if self._training_docs is None:
            self._training_docs = list(self.training_docs())
        return rnd.sample(self._training_docs, k)
+    def doc_to_decontamination_query(self, doc):
+        print("Override doc_to_decontamination_query with document specific decontamination query.")
+        assert(False)
    @abstractmethod
    def doc_to_text(self, doc):
        pass
@@ -514,7 +593,8 @@ class Task(abc.ABC):
        return description + labeled_examples + example
-class MultipleChoiceTask(Task, abc.ABC):
+class MultipleChoiceTask(Task):
    def doc_to_target(self, doc):
        return " " + doc['choices'][doc['gold']]
@@ -553,6 +633,10 @@ class MultipleChoiceTask(Task, abc.ABC):
 class PerplexityTask(Task, abc.ABC):
+    def should_decontaminate(self):
+        """Whether this task supports decontamination against model training set."""
+        return True
    def has_training_docs(self):
        return False
@@ -561,11 +645,11 @@ class PerplexityTask(Task, abc.ABC):
        return []
    def fewshot_context(self, doc, num_fewshot, provide_description=None, rnd=None, description=None):
-        assert num_fewshot == 0
+        assert num_fewshot == 0, "The number of fewshot examples must be 0 for perplexity tasks."
-        assert rnd is not None, "A `random.Random` generator argument must be provided to `rnd`"
+        assert rnd is not None, "A `random.Random` generator argument must be provided to `rnd`."
        assert not provide_description, (
            "The `provide_description` arg will be removed in future versions. To prepend "
-            "a custom description to the context, supply the corresponding string via the  "
+            "a custom description to the context, supply the corresponding string via the "
            "`description` arg."
        )
        if provide_description is not None:
@@ -581,6 +665,9 @@ class PerplexityTask(Task, abc.ABC):
            "bits_per_byte": False,
        }
+    def doc_to_decontamination_query(self, doc):
+        return doc
    def doc_to_text(self, doc):
        return ""

--- a/lm_eval/datasets/README.md
+++ b/lm_eval/datasets/README.md
+# datasets
+This directory contains custom EleutherAI datasets not available in the HuggingFace `datasets` hub.
+In the rare case that you need to add a custom dataset to this collection, follow the 
+HuggingFace `datasets` guide found [here](https://huggingface.co/docs/datasets/dataset_script).
\ No newline at end of file
--- a/lm_eval/datasets/__init__.py
+++ b/lm_eval/datasets/__init__.py
--- a/lm_eval/datasets/arithmetic/__init__.py
+++ b/lm_eval/datasets/arithmetic/__init__.py
--- a/lm_eval/datasets/arithmetic/arithmetic.py
+++ b/lm_eval/datasets/arithmetic/arithmetic.py
+# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""GPT-3 Arithmetic Test Dataset."""
+import json
+import datasets
+_CITATION = """\
+@inproceedings{NEURIPS2020_1457c0d6,
+    author = {Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winter, Clemens and Hesse, Chris and Chen, Mark and Sigler, Eric and Litwin, Mateusz and Gray, Scott and Chess, Benjamin and Clark, Jack and Berner, Christopher and McCandlish, Sam and Radford, Alec and Sutskever, Ilya and Amodei, Dario},
+    booktitle = {Advances in Neural Information Processing Systems},
+    editor = {H. Larochelle and M. Ranzato and R. Hadsell and M. F. Balcan and H. Lin},
+    pages = {1877--1901},
+    publisher = {Curran Associates, Inc.},
+    title = {Language Models are Few-Shot Learners},
+    url = {https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf},
+    volume = {33},
+    year = {2020}
+}
+"""
+_DESCRIPTION = """\
+A small battery of 10 tests that involve asking language models a simple arithmetic
+problem in natural language.
+"""
+_HOMEPAGE = "https://github.com/openai/gpt-3/tree/master/data"
+# TODO: Add the licence for the dataset here if you can find it
+_LICENSE = ""
+class ArithmeticConfig(datasets.BuilderConfig):
+    """BuilderConfig for GPT3 Arithmetic Test Dataset."""
+    def __init__(self, url, features, **kwargs):
+        """BuilderConfig for GPT3 Arithmetic dataset.
+        Args:
+        url: *string*, the url to the specific subset of the GPT3 Arithmetic dataset.
+        features: *list[string]*, list of the features that will appear in the
+            feature dict.
+        """
+        # Version history:
+        super().__init__(version=datasets.Version("0.0.1"), **kwargs)
+        self.url = url
+        self.features = features
+class Arithmetic(datasets.GeneratorBasedBuilder):
+    """A small battery of 10 tests involving simple arithmetic problems."""
+    BUILDER_CONFIGS = [
+        ArithmeticConfig(
+            name="arithmetic_2da",
+            url="https://raw.githubusercontent.com/openai/gpt-3/master/data/two_digit_addition.jsonl",
+            features=datasets.Features({"context": datasets.Value("string"), "completion": datasets.Value("string")}),
+            description="2-digit addition",
+        ),
+        ArithmeticConfig(
+            name="arithmetic_2ds",
+            url="https://raw.githubusercontent.com/openai/gpt-3/master/data/two_digit_subtraction.jsonl",
+            features=datasets.Features({"context": datasets.Value("string"), "completion": datasets.Value("string")}),
+            description="2-digit subtraction",
+        ),
+        ArithmeticConfig(
+            name="arithmetic_3da",
+            url="https://raw.githubusercontent.com/openai/gpt-3/master/data/three_digit_addition.jsonl",
+            features=datasets.Features({"context": datasets.Value("string"), "completion": datasets.Value("string")}),
+            description="3-digit addition",
+        ),
+        ArithmeticConfig(
+            name="arithmetic_3ds",
+            url="https://raw.githubusercontent.com/openai/gpt-3/master/data/three_digit_subtraction.jsonl",
+            features=datasets.Features({"context": datasets.Value("string"), "completion": datasets.Value("string")}),
+            description="3-digit subtraction",
+        ),
+        ArithmeticConfig(
+            name="arithmetic_4da",
+            url="https://raw.githubusercontent.com/openai/gpt-3/master/data/four_digit_addition.jsonl",
+            features=datasets.Features({"context": datasets.Value("string"), "completion": datasets.Value("string")}),
+            description="4-digit addition",
+        ),
+        ArithmeticConfig(
+            name="arithmetic_4ds",
+            url="https://raw.githubusercontent.com/openai/gpt-3/master/data/four_digit_subtraction.jsonl",
+            features=datasets.Features({"context": datasets.Value("string"), "completion": datasets.Value("string")}),
+            description="4-digit subtraction",
+        ),
+        ArithmeticConfig(
+            name="arithmetic_5da",
+            url="https://raw.githubusercontent.com/openai/gpt-3/master/data/five_digit_addition.jsonl",
+            features=datasets.Features({"context": datasets.Value("string"), "completion": datasets.Value("string")}),
+            description="5-digit addition",
+        ),
+        ArithmeticConfig(
+            name="arithmetic_5ds",
+            url="https://raw.githubusercontent.com/openai/gpt-3/master/data/five_digit_subtraction.jsonl",
+            features=datasets.Features({"context": datasets.Value("string"), "completion": datasets.Value("string")}),
+            description="5-digit subtraction",
+        ),
+        ArithmeticConfig(
+            name="arithmetic_2dm",
+            url="https://raw.githubusercontent.com/openai/gpt-3/master/data/two_digit_multiplication.jsonl",
+            features=datasets.Features({"context": datasets.Value("string"), "completion": datasets.Value("string")}),
+            description="2-digit multiplication",
+        ),
+        ArithmeticConfig(
+            name="arithmetic_1dc",
+            url="https://raw.githubusercontent.com/openai/gpt-3/master/data/single_digit_three_ops.jsonl",
+            features=datasets.Features({"context": datasets.Value("string"), "completion": datasets.Value("string")}),
+            description="Single digit 3 operations",
+        ),
+    ]
+    def _info(self):
+        return datasets.DatasetInfo(
+            description=f"{_DESCRIPTION}\n{self.config.description}",
+            features=self.config.features,
+            homepage=_HOMEPAGE,
+            license=_LICENSE,
+            citation=_CITATION,
+        )
+    def _split_generators(self, dl_manager):
+        urls = self.config.url
+        data_dir = dl_manager.download_and_extract(urls)
+        return [
+            datasets.SplitGenerator(
+                name=datasets.Split.VALIDATION,
+                # These kwargs will be passed to _generate_examples
+                gen_kwargs={
+                    "filepath": data_dir,
+                    "split": datasets.Split.VALIDATION,
+                },
+            ),
+        ]
+    # method parameters are unpacked from `gen_kwargs` as given in `_split_generators`
+    def _generate_examples(self, filepath, split):
+        with open(filepath, encoding="utf-8") as f:
+            for key, row in enumerate(f):
+                data = json.loads(row)
+                context = data['context'].strip() \
+                    .replace('\n\n', '\n') \
+                    .replace('Q:', 'Question:') \
+                    .replace('A:', 'Answer:')
+                completion = data['completion']
+                yield key, {'context': context, 'completion': completion}
--- a/lm_eval/datasets/arithmetic/dataset_infos.json
+++ b/lm_eval/datasets/arithmetic/dataset_infos.json
--- a/lm_eval/datasets/asdiv/__init__.py
+++ b/lm_eval/datasets/asdiv/__init__.py
--- a/lm_eval/datasets/asdiv/asdiv.py
+++ b/lm_eval/datasets/asdiv/asdiv.py
+# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""ASDIV dataset."""
+import os
+import xml.etree.ElementTree as ET
+import datasets
+_CITATION = """\
+@misc{miao2021diverse,
+    title={A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers},
+    author={Shen-Yun Miao and Chao-Chun Liang and Keh-Yih Su},
+    year={2021},
+    eprint={2106.15772},
+    archivePrefix={arXiv},
+    primaryClass={cs.AI}
+}
+"""
+_DESCRIPTION = """\
+ASDiv (Academia Sinica Diverse MWP Dataset) is a diverse (in terms of both language
+patterns and problem types) English math word problem (MWP) corpus for evaluating
+the capability of various MWP solvers. Existing MWP corpora for studying AI progress
+remain limited either in language usage patterns or in problem types. We thus present
+a new English MWP corpus with 2,305 MWPs that cover more text patterns and most problem
+types taught in elementary school. Each MWP is annotated with its problem type and grade
+level (for indicating the level of difficulty).
+"""
+_HOMEPAGE = "https://github.com/chaochun/nlu-asdiv-dataset"
+# TODO: Add the licence for the dataset here if you can find it
+_LICENSE = ""
+_URLS = "https://github.com/chaochun/nlu-asdiv-dataset/archive/55790e5270bb91ccfa5053194b25732534696b50.zip"
+class ASDiv(datasets.GeneratorBasedBuilder):
+    """ ASDiv: A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers """
+    VERSION = datasets.Version("0.0.1")
+    BUILDER_CONFIGS = [
+        datasets.BuilderConfig(name="asdiv", version=VERSION,
+                               description="A diverse corpus for evaluating and developing english math word problem solvers")
+    ]
+    def _info(self):
+        features = datasets.Features(
+            {
+                "body": datasets.Value("string"),
+                "question": datasets.Value("string"),
+                "solution_type": datasets.Value("string"),
+                "answer": datasets.Value("string"),
+                "formula": datasets.Value("string"),
+            }
+        )
+        return datasets.DatasetInfo(
+            description=_DESCRIPTION,
+            features=features,
+            homepage=_HOMEPAGE,
+            license=_LICENSE,
+            citation=_CITATION,
+        )
+    def _split_generators(self, dl_manager):
+        urls = _URLS
+        data_dir = dl_manager.download_and_extract(urls)
+        base_filepath = "nlu-asdiv-dataset-55790e5270bb91ccfa5053194b25732534696b50"
+        return [
+            datasets.SplitGenerator(
+                name=datasets.Split.VALIDATION,
+                # These kwargs will be passed to _generate_examples
+                gen_kwargs={
+                    "filepath": os.path.join(data_dir, base_filepath, "dataset", "ASDiv.xml"),
+                    "split": datasets.Split.VALIDATION,
+                },
+            ),
+        ]
+    # method parameters are unpacked from `gen_kwargs` as given in `_split_generators`
+    def _generate_examples(self, filepath, split):
+        tree = ET.parse(filepath)
+        root = tree.getroot()
+        for key, problem in enumerate(root.iter("Problem")):
+            yield key, {
+                "body": problem.find("Body").text,
+                "question": problem.find("Question").text,
+                "solution_type": problem.find("Solution-Type").text,
+                "answer": problem.find("Answer").text,
+                "formula": problem.find("Formula").text,
+            }
--- a/lm_eval/datasets/asdiv/dataset_infos.json
+++ b/lm_eval/datasets/asdiv/dataset_infos.json
+{"asdiv": {"description": "ASDiv (Academia Sinica Diverse MWP Dataset) is a diverse (in terms of both language\npatterns and problem types) English math word problem (MWP) corpus for evaluating\nthe capability of various MWP solvers. Existing MWP corpora for studying AI progress\nremain limited either in language usage patterns or in problem types. We thus present\na new English MWP corpus with 2,305 MWPs that cover more text patterns and most problem\ntypes taught in elementary school. Each MWP is annotated with its problem type and grade\nlevel (for indicating the level of difficulty).\n", "citation": "@misc{miao2021diverse,\n    title={A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers},\n    author={Shen-Yun Miao and Chao-Chun Liang and Keh-Yih Su},\n    year={2021},\n    eprint={2106.15772},\n    archivePrefix={arXiv},\n    primaryClass={cs.AI}\n}\n", "homepage": "https://github.com/chaochun/nlu-asdiv-dataset", "license": "", "features": {"body": {"dtype": "string", "id": null, "_type": "Value"}, "question": {"dtype": "string", "id": null, "_type": "Value"}, "solution_type": {"dtype": "string", "id": null, "_type": "Value"}, "answer": {"dtype": "string", "id": null, "_type": "Value"}, "formula": {"dtype": "string", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "as_div", "config_name": "asdiv", "version": {"version_str": "0.0.1", "description": null, "major": 0, "minor": 0, "patch": 1}, "splits": {"validation": {"name": "validation", "num_bytes": 501489, "num_examples": 2305, "dataset_name": "as_div"}}, "download_checksums": {"https://github.com/chaochun/nlu-asdiv-dataset/archive/55790e5270bb91ccfa5053194b25732534696b50.zip": {"num_bytes": 440966, "checksum": "8f1fe4f6d5f170ec1e24ab78c244153c14c568b1bb2b1dad0324e71f37939a2d"}}, "download_size": 440966, "post_processing_size": null, "dataset_size": 501489, "size_in_bytes": 942455}}
\ No newline at end of file
--- a/lm_eval/datasets/coqa/__init__.py
+++ b/lm_eval/datasets/coqa/__init__.py
--- a/lm_eval/datasets/coqa/coqa.py
+++ b/lm_eval/datasets/coqa/coqa.py
+# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""CoQA dataset.
+This `CoQA` adds the "additional_answers" feature that's missing in the original
+datasets version:
+https://github.com/huggingface/datasets/blob/master/datasets/coqa/coqa.py
+"""
+import json
+import datasets
+_CITATION = """\
+@misc{reddy2018coqa,
+    title={CoQA: A Conversational Question Answering Challenge},
+    author={Siva Reddy and Danqi Chen and Christopher D. Manning},
+    year={2018},
+    eprint={1808.07042},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+"""
+_DESCRIPTION = """\
+CoQA is a large-scale dataset for building Conversational Question Answering
+systems. The goal of the CoQA challenge is to measure the ability of machines to
+understand a text passage and answer a series of interconnected questions that
+appear in a conversation.
+"""
+_HOMEPAGE = "https://stanfordnlp.github.io/coqa/"
+# TODO: Add the licence for the dataset here if you can find it
+_LICENSE = ""
+_URLS = {
+    "train": "https://nlp.stanford.edu/data/coqa/coqa-train-v1.0.json",
+    "validation": "https://nlp.stanford.edu/data/coqa/coqa-dev-v1.0.json",
+}
+# `additional_answers` are not available in the train set so we fill them with
+# empty dicts of the same form.
+_EMPTY_ADDITIONAL_ANSWER = {
+    "0": [
+        {
+            "span_start": -1,
+            "span_end": -1,
+            "span_text": "",
+            "input_text": "",
+            "turn_id": -1
+        }
+    ],
+    "1": [
+        {
+            "span_start": -1,
+            "span_end": -1,
+            "span_text": "",
+            "input_text": "",
+            "turn_id": -1
+        }
+    ],
+    "2": [
+        {
+            "span_start": -1,
+            "span_end": -1,
+            "span_text": "",
+            "input_text": "",
+            "turn_id": -1
+        }
+    ],
+}
+class Coqa(datasets.GeneratorBasedBuilder):
+    """CoQA is a large-scale dataset for building Conversational Question Answering systems."""
+    VERSION = datasets.Version("0.0.1")
+    BUILDER_CONFIGS = [
+        datasets.BuilderConfig(name="coqa", version=VERSION,
+                               description="The CoQA dataset."),
+    ]
+    def _info(self):
+        features = datasets.Features(
+            {
+                "id": datasets.Value("string"),
+                "source": datasets.Value("string"),
+                "story": datasets.Value("string"),
+                "questions": datasets.features.Sequence({
+                    "input_text": datasets.Value("string"),
+                    "turn_id": datasets.Value("int32"),
+                }),
+                "answers": datasets.features.Sequence({
+                    "span_start": datasets.Value("int32"),
+                    "span_end": datasets.Value("int32"),
+                    "span_text": datasets.Value("string"),
+                    "input_text": datasets.Value("string"),
+                    "turn_id": datasets.Value("int32"),
+                }),
+                "additional_answers": {
+                    "0": datasets.features.Sequence({
+                        "span_start": datasets.Value("int32"),
+                        "span_end": datasets.Value("int32"),
+                        "span_text": datasets.Value("string"),
+                        "input_text": datasets.Value("string"),
+                        "turn_id": datasets.Value("int32"),
+                    }),
+                    "1": datasets.features.Sequence({
+                        "span_start": datasets.Value("int32"),
+                        "span_end": datasets.Value("int32"),
+                        "span_text": datasets.Value("string"),
+                        "input_text": datasets.Value("string"),
+                        "turn_id": datasets.Value("int32"),
+                    }),
+                    "2": datasets.features.Sequence({
+                        "span_start": datasets.Value("int32"),
+                        "span_end": datasets.Value("int32"),
+                        "span_text": datasets.Value("string"),
+                        "input_text": datasets.Value("string"),
+                        "turn_id": datasets.Value("int32"),
+                    }),
+                }
+            })
+        return datasets.DatasetInfo(
+            description=_DESCRIPTION,
+            features=features,
+            homepage=_HOMEPAGE,
+            license=_LICENSE,
+            citation=_CITATION,
+        )
+    def _split_generators(self, dl_manager):
+        urls = {"train": _URLS["train"], "validation": _URLS["validation"]}
+        data_dirs = dl_manager.download_and_extract(urls)
+        return [
+            datasets.SplitGenerator(
+                name=datasets.Split.TRAIN,
+                # These kwargs will be passed to _generate_examples
+                gen_kwargs={
+                    "filepath": data_dirs["train"],
+                    "split": datasets.Split.TRAIN,
+                },
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.VALIDATION,
+                # These kwargs will be passed to _generate_examples
+                gen_kwargs={
+                    "filepath": data_dirs["validation"],
+                    "split": datasets.Split.VALIDATION,
+                },
+            ),
+        ]
+    # method parameters are unpacked from `gen_kwargs` as given in `_split_generators`
+    def _generate_examples(self, filepath, split):
+        with open(filepath, encoding="utf-8") as f:
+            data = json.load(f)
+            for row in data["data"]:
+                id = row["id"]
+                source = row["source"]
+                story = row["story"]
+                questions = [
+                    {
+                        "input_text": q["input_text"],
+                        "turn_id": q["turn_id"]
+                    }
+                    for q in row["questions"]
+                ]
+                answers = [
+                    {
+                        "span_start": a["span_start"],
+                        "span_end": a["span_end"],
+                        "span_text": a["span_text"],
+                        "input_text": a["input_text"],
+                        "turn_id": a["turn_id"]
+                    }
+                    for a in row["answers"]
+                ]
+                if split == datasets.Split.TRAIN:
+                    additional_answers = _EMPTY_ADDITIONAL_ANSWER
+                else:
+                    additional_answers = {
+                        "0": [
+                            {
+                                "span_start": a0["span_start"],
+                                "span_end": a0["span_end"],
+                                "span_text": a0["span_text"],
+                                "input_text": a0["input_text"],
+                                "turn_id": a0["turn_id"]
+                            }
+                            for a0 in row["additional_answers"]["0"]
+                        ],
+                        "1": [
+                            {
+                                "span_start": a1["span_start"],
+                                "span_end": a1["span_end"],
+                                "span_text": a1["span_text"],
+                                "input_text": a1["input_text"],
+                                "turn_id": a1["turn_id"]
+                            }
+                            for a1 in row["additional_answers"]["1"]
+                        ],
+                        "2": [
+                            {
+                                "span_start": a2["span_start"],
+                                "span_end": a2["span_end"],
+                                "span_text": a2["span_text"],
+                                "input_text": a2["input_text"],
+                                "turn_id": a2["turn_id"]
+                            }
+                            for a2 in row["additional_answers"]["2"]
+                        ],
+                    }
+                yield row["id"], {
+                    "id": id,
+                    "story": story,
+                    "source": source,
+                    "questions": questions,
+                    "answers": answers,
+                    "additional_answers": additional_answers
+                }
--- a/lm_eval/datasets/coqa/dataset_infos.json
+++ b/lm_eval/datasets/coqa/dataset_infos.json
+{"coqa": {"description": "CoQA is a large-scale dataset for building Conversational Question Answering\nsystems. The goal of the CoQA challenge is to measure the ability of machines to\nunderstand a text passage and answer a series of interconnected questions that\nappear in a conversation.\n", "citation": "@misc{reddy2018coqa,\n    title={CoQA: A Conversational Question Answering Challenge},\n    author={Siva Reddy and Danqi Chen and Christopher D. Manning},\n    year={2018},\n    eprint={1808.07042},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL}\n}\n", "homepage": "https://stanfordnlp.github.io/coqa/", "license": "", "features": {"id": {"dtype": "string", "id": null, "_type": "Value"}, "source": {"dtype": "string", "id": null, "_type": "Value"}, "story": {"dtype": "string", "id": null, "_type": "Value"}, "questions": {"feature": {"input_text": {"dtype": "string", "id": null, "_type": "Value"}, "turn_id": {"dtype": "int32", "id": null, "_type": "Value"}}, "length": -1, "id": null, "_type": "Sequence"}, "answers": {"feature": {"span_start": {"dtype": "int32", "id": null, "_type": "Value"}, "span_end": {"dtype": "int32", "id": null, "_type": "Value"}, "span_text": {"dtype": "string", "id": null, "_type": "Value"}, "input_text": {"dtype": "string", "id": null, "_type": "Value"}, "turn_id": {"dtype": "int32", "id": null, "_type": "Value"}}, "length": -1, "id": null, "_type": "Sequence"}, "additional_answers": {"0": {"feature": {"span_start": {"dtype": "int32", "id": null, "_type": "Value"}, "span_end": {"dtype": "int32", "id": null, "_type": "Value"}, "span_text": {"dtype": "string", "id": null, "_type": "Value"}, "input_text": {"dtype": "string", "id": null, "_type": "Value"}, "turn_id": {"dtype": "int32", "id": null, "_type": "Value"}}, "length": -1, "id": null, "_type": "Sequence"}, "1": {"feature": {"span_start": {"dtype": "int32", "id": null, "_type": "Value"}, "span_end": {"dtype": "int32", "id": null, "_type": "Value"}, "span_text": {"dtype": "string", "id": null, "_type": "Value"}, "input_text": {"dtype": "string", "id": null, "_type": "Value"}, "turn_id": {"dtype": "int32", "id": null, "_type": "Value"}}, "length": -1, "id": null, "_type": "Sequence"}, "2": {"feature": {"span_start": {"dtype": "int32", "id": null, "_type": "Value"}, "span_end": {"dtype": "int32", "id": null, "_type": "Value"}, "span_text": {"dtype": "string", "id": null, "_type": "Value"}, "input_text": {"dtype": "string", "id": null, "_type": "Value"}, "turn_id": {"dtype": "int32", "id": null, "_type": "Value"}}, "length": -1, "id": null, "_type": "Sequence"}}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "coqa", "config_name": "coqa", "version": {"version_str": "0.0.1", "description": null, "major": 0, "minor": 0, "patch": 1}, "splits": {"train": {"name": "train", "num_bytes": 26250528, "num_examples": 7199, "dataset_name": "coqa"}, "validation": {"name": "validation", "num_bytes": 3765933, "num_examples": 500, "dataset_name": "coqa"}}, "download_checksums": {"https://nlp.stanford.edu/data/coqa/coqa-train-v1.0.json": {"num_bytes": 49001836, "checksum": "b0fdb2bc1bd38dd3ca2ce5fa2ac3e02c6288ac914f241ac409a655ffb6619fa6"}, "https://nlp.stanford.edu/data/coqa/coqa-dev-v1.0.json": {"num_bytes": 9090845, "checksum": "dfa367a9733ce53222918d0231d9b3bedc2b8ee831a2845f62dfc70701f2540a"}}, "download_size": 58092681, "post_processing_size": null, "dataset_size": 30016461, "size_in_bytes": 88109142}}
\ No newline at end of file
--- a/lm_eval/datasets/drop/__init__.py
+++ b/lm_eval/datasets/drop/__init__.py
--- a/lm_eval/datasets/drop/dataset_infos.json
+++ b/lm_eval/datasets/drop/dataset_infos.json
+{"drop": {"description": "DROP is a QA dataset which tests comprehensive understanding of paragraphs. In \nthis crowdsourced, adversarially-created, 96k question-answering benchmark, a \nsystem must resolve multiple references in a question, map them onto a paragraph,\nand perform discrete operations over them (such as addition, counting, or sorting).\n", "citation": "@misc{dua2019drop,\n    title={DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs}, \n    author={Dheeru Dua and Yizhong Wang and Pradeep Dasigi and Gabriel Stanovsky and Sameer Singh and Matt Gardner},\n    year={2019},\n    eprint={1903.00161},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL}\n}\n", "homepage": "https://allenai.org/data/drop", "license": "", "features": {"section_id": {"dtype": "string", "id": null, "_type": "Value"}, "passage": {"dtype": "string", "id": null, "_type": "Value"}, "question": {"dtype": "string", "id": null, "_type": "Value"}, "query_id": {"dtype": "string", "id": null, "_type": "Value"}, "answer": {"number": {"dtype": "string", "id": null, "_type": "Value"}, "date": {"day": {"dtype": "string", "id": null, "_type": "Value"}, "month": {"dtype": "string", "id": null, "_type": "Value"}, "year": {"dtype": "string", "id": null, "_type": "Value"}}, "spans": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "worker_id": {"dtype": "string", "id": null, "_type": "Value"}, "hit_id": {"dtype": "string", "id": null, "_type": "Value"}}, "validated_answers": {"feature": {"number": {"dtype": "string", "id": null, "_type": "Value"}, "date": {"day": {"dtype": "string", "id": null, "_type": "Value"}, "month": {"dtype": "string", "id": null, "_type": "Value"}, "year": {"dtype": "string", "id": null, "_type": "Value"}}, "spans": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "worker_id": {"dtype": "string", "id": null, "_type": "Value"}, "hit_id": {"dtype": "string", "id": null, "_type": "Value"}}, "length": -1, "id": null, "_type": "Sequence"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "drop", "config_name": "drop", "version": {"version_str": "0.0.1", "description": null, "major": 0, "minor": 0, "patch": 1}, "splits": {"train": {"name": "train", "num_bytes": 108858121, "num_examples": 77409, "dataset_name": "drop"}, "validation": {"name": "validation", "num_bytes": 12560739, "num_examples": 9536, "dataset_name": "drop"}}, "download_checksums": {"https://s3-us-west-2.amazonaws.com/allennlp/datasets/drop/drop_dataset.zip": {"num_bytes": 8308692, "checksum": "39d2278a29fd729de301b111a45f434c24834f40df8f4ff116d864589e3249d6"}}, "download_size": 8308692, "post_processing_size": null, "dataset_size": 121418860, "size_in_bytes": 129727552}}
\ No newline at end of file
--- a/lm_eval/datasets/drop/drop.py
+++ b/lm_eval/datasets/drop/drop.py
+# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# Custom DROP dataet that, unlike HF, keeps all question-answer pairs
+# even if there are multiple types of answers for the same question.
+"""DROP dataset."""
+import json
+import os
+import datasets
+_CITATION = """\
+@misc{dua2019drop,
+    title={DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs}, 
+    author={Dheeru Dua and Yizhong Wang and Pradeep Dasigi and Gabriel Stanovsky and Sameer Singh and Matt Gardner},
+    year={2019},
+    eprint={1903.00161},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+"""
+_DESCRIPTION = """\
+DROP is a QA dataset which tests comprehensive understanding of paragraphs. In 
+this crowdsourced, adversarially-created, 96k question-answering benchmark, a 
+system must resolve multiple references in a question, map them onto a paragraph,
+and perform discrete operations over them (such as addition, counting, or sorting).
+"""
+_HOMEPAGE = "https://allenai.org/data/drop"
+# TODO: Add the licence for the dataset here if you can find it
+_LICENSE = ""
+_URLS = {
+    "drop": "https://s3-us-west-2.amazonaws.com/allennlp/datasets/drop/drop_dataset.zip",
+}
+_EMPTY_VALIDATED_ANSWER = [{
+    "number": "",
+    "date": {
+        "day": "",
+        "month": "",
+        "year": "",
+    },
+    "spans": [],
+    "worker_id": "",
+    "hit_id": ""
+}]
+class Drop(datasets.GeneratorBasedBuilder):
+    """DROP is a QA dataset which tests comprehensive understanding of paragraphs."""
+    VERSION = datasets.Version("0.0.1")
+    BUILDER_CONFIGS = [
+        datasets.BuilderConfig(name="drop", version=VERSION,
+                               description="The DROP dataset."),
+    ]
+    def _info(self):
+        features = datasets.Features({
+            "section_id": datasets.Value("string"),
+            "passage": datasets.Value("string"),
+            "question": datasets.Value("string"),
+            "query_id": datasets.Value("string"),
+            "answer": {
+                "number": datasets.Value("string"),
+                "date": {
+                    "day": datasets.Value("string"),
+                    "month": datasets.Value("string"),
+                    "year": datasets.Value("string"),
+                },
+                "spans": datasets.features.Sequence(datasets.Value("string")),
+                "worker_id": datasets.Value("string"),
+                "hit_id": datasets.Value("string"),
+            },
+            "validated_answers": datasets.features.Sequence({
+                "number": datasets.Value("string"),
+                "date": {
+                    "day": datasets.Value("string"),
+                    "month": datasets.Value("string"),
+                    "year": datasets.Value("string"),
+                },
+                "spans": datasets.features.Sequence(datasets.Value("string")),
+                "worker_id": datasets.Value("string"),
+                "hit_id": datasets.Value("string"),
+            }),
+        })
+        return datasets.DatasetInfo(
+            description=_DESCRIPTION,
+            features=features,
+            homepage=_HOMEPAGE,
+            license=_LICENSE,
+            citation=_CITATION,
+        )
+    def _split_generators(self, dl_manager):
+        urls = _URLS[self.config.name]
+        data_dir = dl_manager.download_and_extract(urls)
+        return [
+            datasets.SplitGenerator(
+                name=datasets.Split.TRAIN,
+                # These kwargs will be passed to _generate_examples
+                gen_kwargs={
+                    "filepath": os.path.join(data_dir, "drop_dataset", "drop_dataset_train.json"),
+                    "split": "train",
+                },
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.VALIDATION,
+                # These kwargs will be passed to _generate_examples
+                gen_kwargs={
+                    "filepath": os.path.join(data_dir, "drop_dataset", "drop_dataset_dev.json"),
+                    "split": "validation",
+                },
+            ),
+        ]
+    # method parameters are unpacked from `gen_kwargs` as given in `_split_generators`
+    def _generate_examples(self, filepath, split):
+        with open(filepath, encoding="utf-8") as f:
+            data = json.load(f)
+            key = 0
+            for section_id, example in data.items():
+                # Each example (passage) has multiple sub-question-answer pairs.
+                for qa in example["qa_pairs"]:
+                    # Build answer.
+                    answer = qa["answer"]
+                    answer = {
+                        "number": answer["number"],
+                        "date": {
+                            "day": answer["date"].get("day", ""),
+                            "month": answer["date"].get("month", ""),
+                            "year": answer["date"].get("year", ""),
+                        },
+                        "spans": answer["spans"],
+                        "worker_id": answer.get("worker_id", ""),
+                        "hit_id": answer.get("hit_id", ""),
+                    }
+                    validated_answers = []
+                    if "validated_answers" in qa:
+                        for validated_answer in qa["validated_answers"]:
+                            va = {
+                                "number": validated_answer.get("number", ""),
+                                "date": {
+                                    "day": validated_answer["date"].get("day", ""),
+                                    "month": validated_answer["date"].get("month", ""),
+                                    "year": validated_answer["date"].get("year", ""),
+                                },
+                                "spans": validated_answer.get("spans", ""),
+                                "worker_id": validated_answer.get("worker_id", ""),
+                                "hit_id": validated_answer.get("hit_id", ""),
+                            }
+                            validated_answers.append(va)
+                    else:
+                        validated_answers = _EMPTY_VALIDATED_ANSWER
+                    yield key, {
+                        "section_id": section_id,
+                        "passage": example["passage"],
+                        "question": qa["question"],
+                        "query_id": qa["query_id"],
+                        "answer": answer,
+                        "validated_answers": validated_answers,
+                    }
+                    key += 1