Unverified Commit 11f614b0 authored by Stella Biderman's avatar Stella Biderman Committed by GitHub
Browse files

Merge branch 'master' into task_doc

parents 0a6a9b7e e00d682f
...@@ -32,7 +32,7 @@ jobs: ...@@ -32,7 +32,7 @@ jobs:
run: | run: |
python -m pip install --upgrade pip python -m pip install --upgrade pip
pip install flake8 pytest pytest-cov pip install flake8 pytest pytest-cov
pip install -e . pip install -e .[dev]
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
- name: Lint with flake8 - name: Lint with flake8
run: | run: |
......
* EleutherAI/pm-pile * @jon-tow @leogao2 @StellaAthena
...@@ -26,7 +26,7 @@ To evaluate a model, (e.g. GPT-2) on NLU tasks (e.g. LAMBADA, HellaSwag), you ca ...@@ -26,7 +26,7 @@ To evaluate a model, (e.g. GPT-2) on NLU tasks (e.g. LAMBADA, HellaSwag), you ca
```bash ```bash
python main.py \ python main.py \
--model gpt2 \ --model gpt2 \
--device cuda:0 \ --device 0 \
--tasks lambada,hellaswag --tasks lambada,hellaswag
``` ```
(This uses gpt2-117M by default as per HF defaults, use --model_args to specify other gpt2 sizes) (This uses gpt2-117M by default as per HF defaults, use --model_args to specify other gpt2 sizes)
...@@ -37,7 +37,7 @@ Additional arguments can be provided to the model constructor using the `--model ...@@ -37,7 +37,7 @@ Additional arguments can be provided to the model constructor using the `--model
python main.py \ python main.py \
--model gpt2 \ --model gpt2 \
--model_args pretrained=EleutherAI/gpt-neo-2.7B \ --model_args pretrained=EleutherAI/gpt-neo-2.7B \
--device cuda:0 \ --device 0 \
--tasks lambada,hellaswag --tasks lambada,hellaswag
``` ```
...@@ -375,7 +375,7 @@ Additional arguments can be provided to the model constructor using the `--model ...@@ -375,7 +375,7 @@ Additional arguments can be provided to the model constructor using the `--model
python main.py \ python main.py \
--model gpt2 \ --model gpt2 \
--model_args pretrained=EleutherAI/gpt-neo-1.3B \ --model_args pretrained=EleutherAI/gpt-neo-1.3B \
--device cuda:0 \ --device 0 \
--tasks lambada,hellaswag \ --tasks lambada,hellaswag \
--num_fewshot 2 --num_fewshot 2
``` ```
...@@ -392,6 +392,21 @@ python write_out.py \ ...@@ -392,6 +392,21 @@ python write_out.py \
This will write out one text file for each task. This will write out one text file for each task.
### Test Set Decontamination
For more details see the [decontamination guide](./docs/decontamination.md).
The directory provided with the "--decontamination_ngrams_path" argument should contain
the ngram files and info.json. See the above guide for ngram generation for the pile, this could be adapted for other training sets.
```bash
python main.py \
--model gpt2 \
--device 0 \
--tasks sciq \
--decontamination_ngrams_path path/containing/training/set/ngrams
```
### Code Structure ### Code Structure
There are two major components of the library: There are two major components of the library:
......
# Decontamination
## Usage
Simply add a "--decontamination_ngrams_path" when running main.py. The provided directory should contain
the ngram files and info.json produced in "Pile Ngram Generation" further down.
```bash
python main.py \
--model gpt2 \
--device 0 \
--tasks sciq \
--decontamination_ngrams_path path/containing/training/set/ngrams
```
## Background
Downstream evaluations test model generalization, and are less useful when test set data also exists in the training set (leakage/contamination).
As a first step this is resolved through training set filtering, however often benchmarks don't exist or weren't considered prior to model training. In this case it is useful to measure the impact of test set leakage by detecting the contaminated test examples and producing a clean version of the benchmark.
The basis for our decontamination procedure can be found in Appendix C of "Language Models are Few-Shot Learners". OpenAI defined a test document as contaminated if any N-gram overlap existed with any training document. They used a range of N values between 8 and 13 depending on dataset, while we just used 13 for simplicity.
## Implementation
Contamination detection can be found in "lm_eval/decontaminate.py" with supporting code in "lm_eval/decontamination/".
decontaminate.py does the following:
1. Build dictionaries of all ngrams and their corresponding evaluation/document ids.
2. Scan through sorted files containing training set n-grams.
3. If a match is found, the corresponding evaluation/document combinations are marked as contaminated.
"lm_eval/evaluator.py" can then produce a clean version of the benchmark by excluding the results of contaminated documents. For each metric, a clean version will be shown in the results with a "decontaminate" suffix.
This is disabled by default for new tasks, to support decontamination on a task override the "should_decontaminate" and "doc_to_decontamination_query" methods. For more details see the [task guide](task_guide.md).
## Pile Ngram Generation
The relevant scripts can be found in scripts/clean_training_data, which also import from
"lm_eval/decontamination/"
1. git clone https://github.com/EleutherAI/lm-evaluation-harness.git
2. pip install -r requirements.txt
3. Download The Pile from [The Eye](https://the-eye.eu/public/AI/pile/train/)
4. Place pile files in "pile" directory under "lm-evaluation-harness" (or create a symlink)
5. Run generate_13_grams.
```bash
export PYTHONHASHSEED=0
python -m scripts/clean_training_data/generate_13_grams \
-dir path/to/working/directory \
-n 13 \
-buckets 500
```
Took approximately 4 days for us. We had the time to wait, but this could be scaled out by doing partial pile scans on multiple instances of this script and merging the relevant buckets. We fixed PYTHONHASHSEED to ensure reproducibility of bucket hashing.
6. Sort the generated 13-grams.
```bash
python -m scripts/clean_training_data/sort_13_gram_buckets \
-dir path/to/working/directory/output
```
Took approximately 5 days for us. You could speed this up by spreading the files around to different machines and running the sort script before gathering them together.
7. Compress the sorted 13 grams files and place them together with info.json.
This step only takes a few hours.
```bash
python -m scripts/clean_training_data/compress_and_package \
-dir path/to/working/directory \
-output path/to/final/directory \
-procs 8
```
Congratulations, the final directory can now be passed to lm-evaulation-harness with the "--decontamination_ngrams_path" argument.
...@@ -11,19 +11,28 @@ If you haven't already, go ahead and fork the main repo, clone it, create a bran ...@@ -11,19 +11,28 @@ If you haven't already, go ahead and fork the main repo, clone it, create a bran
git clone https://github.com/<YOUR-USERNAME>/lm-evaluation-harness.git git clone https://github.com/<YOUR-USERNAME>/lm-evaluation-harness.git
cd lm-evaluation-harness cd lm-evaluation-harness
git checkout -b <task-name> git checkout -b <task-name>
pip install -r requirements.txt pip install -e ".[dev]"
``` ```
## Creating Your Task File ## Creating Your Task File
The first step in creating a task is to create a Python file in `lm_eval/tasks/` with the task's name: From the `lm-evaluation-harness` project root, copy over the `new_task.py` template to `lm_eval/datasets`.
```sh ```sh
cd lm_eval/tasks cp templates/new_task.py lm_eval/tasks/<task-name>.py
touch <task-name>.py
``` ```
Then open the file and create a multiline docstring on the first line with the following contents: or if your task is **multiple-choice**, the `new_multiple_choice_task.py`:
```sh
cp templates/new_multiple_choice_task.py lm_eval/tasks/<task-name>.py
```
This will set you up with a few `TODO`s to fill-in which we'll now go over in detail.
## Task Heading
Open the file you've just created and add a multiline docstring on the first line with the following contents:
```python ```python
""" """
...@@ -62,102 +71,92 @@ Now let's walk through the actual implementation - from data handling to evaluat ...@@ -62,102 +71,92 @@ Now let's walk through the actual implementation - from data handling to evaluat
### Downloading your Data ### Downloading your Data
There are 2 standard approaches we follow for downloading data: All data downloading and management is handled through the HuggingFace (**HF**) [`datasets`](https://github.com/huggingface/datasets) API. So, the first thing you should do is check to see if your task's dataset is already provided in their catalog [here](https://huggingface.co/datasets). If it's not in there, please consider adding it to their Hub to make it accessible to a wider user base by following their [new dataset guide](https://github.com/huggingface/datasets/blob/master/ADD_NEW_DATASET.md)
.
Now, that you have your HF dataset, you need to assign its path and name to your `Task` in the following fields:
1. Firstly, you should always check to see if your task's dataset is already provided by HuggingFace (__HF__); check their `datasets` catalog [here](https://huggingface.co/datasets). Is it in there? If yes, continue reading here, else go to 2. In the case that it’s there, things are a bit easier. You can inherit from the `HFTask` class as so: ```python
class TaskName(...):
```python DATASET_PATH = "..."
from . common import HFTask DATASET_NAME = "..."
```
class TaskName(HFTask):
DATASET_PATH = "..."
DATASET_NAME = "..."
```
where `DATASET_PATH` is the name of the benchmark/task dataset as listed by HF and `DATASET_NAME` is the name of, what HF calls, a “data instance” of the benchmark. If your task is not a benchmark containing any data instances just set `DATASET_NAME = None`.
2. Your task's dataset is not in HF's catalog, so you'll have to override a few abstract methods of the `Task` base class. First let's define our benchmark/task and inherit from `Task`.
```python
from lm_eval.base import Task
from pathlib import Path
class TaskName(Task):
DATASET_PATH = Path("data/<task-name>")
```
where `DATASET_PATH` is the local directory we'll download into.
Now we need to override the following methods:
```python where `DATASET_PATH` is the name of the dataset as listed by HF in the `datasets` Hub and `DATASET_NAME` is the name of, what HF calls, a “data instance” or sub-task of the benchmark. If your task does not contain any data instances, just set `DATASET_NAME = None`.
def download(self): (If you're familiar with the HF `datasets.load_dataset` function, these are just the first 2 arguments to it.)
```
This should download the dataset into the relative path specified by `DATASET_PATH`. The preferred approach is to use EleutherAI's [best-download](https://github.com/EleutherAI/best-download) package which provides a `download_file` function that lets you validate complete data transmission through a checksum argument. The overall logic should be something like: If the `DATASET_PATH` already exists then don’t download anything and exit the method, otherwise create the `DATASET_PATH` directory and actually download into it. See this [task](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/logiqa.py#L9-L21) for an example.
Next up, we have to set some “flags”: Next up, we have to set some “flags”:
```python ```python
def has_training_docs(self): def has_training_docs(self):
return # True/False return # True/False
def has_validation_docs(self): def has_validation_docs(self):
return # True/False return # True/False
def has_test_docs(self): def has_test_docs(self):
return # True/False return # True/False
``` ```
These methods return `True`/`False` whether or not your task dataset provides documents for each split type. __Note__: if the test set doesn't have publicly available labels, please do not put it down as having a test set.
Lastly, we need to load the documents. In our terminology, a document (`doc`) is a single natural language data example stored in a Python `dict`. E.g.: `{“question”: “What is the capital of France?”, “answer”: “Paris”}`. Override the following methods to load your data splits from their storage location in `DATASET_PATH`: These methods return `True`/`False` whether or not your task dataset provides documents for each split type. __Note__: if the test set does not have publicly available answer labels, please do not put it down as having a test set - return False.
```python
Lastly, we need to load the documents. In our terminology, a document (`doc`) is a single natural language data example stored in a Python `dict`. E.g.: `{“question”: “What is the capital of France?”, “answer”: “Paris”}`. Override the following methods to load your data splits from their storage location in `DATASET_PATH`:
```python
def training_docs(self): def training_docs(self):
return #... return #...
def validation_docs(self): def validation_docs(self):
return #... return #...
def test_docs(self): def test_docs(self):
return #... return #...
``` ```
These should return a Python iterable (`list` or `generator`) of `dict`s that can be queried for individual `doc` examples. Note that this will get used in a few places down the track, including for both creating examples to be processed by a model and also for assessing the results. There's nothing stopping you getting these functions be generators, which can be handy if each source document in your dataset contains a number of questions/prompts to handle. The overarching idea is that what comes out of this function should definitively contain an input to be presented to the model, the associated ground truth and anything else you think is needed to judge the quality of the output the model provides. __NOTE__: If your task doesn't have a train/validation/test set, remember to raise a `NotImplementedError` for that specific split.
### Formatting your Few-Shot Examples These should return a Python iterable (`list` or `generator`) of `dict`s that can be queried for individual `doc` examples.
The harness is designed to facilitate task evaluations under the few-shot setting. Here we’ll format such examples. #### Processing Documents
<br> At this point, you can also process each individual document to, for example, strip whitespace or "detokenize" its fields. Put the processing logic into `_process_doc` and map the functions across training/validation/test docs inside of the respective functions.
🔠 If your task is **multiple-choice**, we require you to format your documents such that they contain `gold` and `choices` fields. They can also have other fields, but those will be ignored by `MultipleChoiceTask`. `choices` should be a list of possible continuations, and `gold` should be an integer specifying the index of the correct completion.
See [this task](https://github.com/EleutherAI/lm-evaluation-harness/blob/6caa0afd96a7a7efb2ec4c1f24ad1756e48f3aa7/lm_eval/tasks/sat.py#L60) for an example. 🔠
⚠️ **Multiple-Choice Formatting** ### Formatting your Few-Shot Examples
If your task is **multiple-choice**, just inherit from the `MultipleChoiceTask` class we provide. The harness is designed to facilitate task evaluations under the few-shot setting. Here we’ll format such examples.
```python Format your document into a single query prompt __without the answer__ here. This method takes a single `doc` example of type `dict` with `str` key-value members. You should concatenate these `doc` item values together into a neatly formatted prompt.
from lm_eval.base import MultipleChoiceTask
class TaskName(..., MultipleChoiceTask): ```python
def doc_to_text(self, doc):
return ""
``` ```
This will require you to format your documents such that they contain `gold` and `choices` fields. They can also have other fields, but those will be ignored by `MultipleChoiceTask`. `choices` should be a list of possible continuations, and `gold` should be an integer specifying the index of the correct completion. <br>
See [this task](https://github.com/EleutherAI/lm-evaluation-harness/blob/105fa9741ff660f6a62c2eef0d2facfde36dda41/lm_eval/tasks/sat.py#L56) for an example. When used in combination with `HFTask`, it may be useful to override [`_convert_standard`](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/common.py#L28), which will be applied to every document in the HF dataset. See [this task](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/headqa.py) for an example of this. ️🔠 **Multiple-Choice Formatting**
You can now skip ahead to <a href="#Registering-Your-Task">registering your task</a>. If your task is multiple-choice, you can now skip ahead to <a href="#Registering-Your-Task">registering your task</a>.
⚠️ **End Multiple-Choice Formatting** ️️🔠 **End Multiple-Choice Formatting**
<br> <br>
In the case your task is _not_ multiple-choice, override the following methods for your task class: Format the target answer from the contents of `doc`. Note that the prepended `" "` is required to space out the `doc_to_text` and `doc_to_target` strings.
Format your document into a single query prompt __without the answer__ here. This method takes a single `doc` example of type `dict` with `str` key-value members. You should concatenate these `doc` item values together into a neatly formatted prompt.
```python ```python
def doc_to_text(self, doc): def doc_to_target(self, doc):
return "" target = ""
return " " + target
``` ```
Put the target answer of the prompt here, in the form: `" " + <answer>`. Finally, be aware that the strings from `doc_to_text` and `doc_to_target` will be concatenated together to build up labeled examples in the k-shot setting where k > 0. Design with that in mind 👍.
```python ### Decontamination
def doc_to_target(self, doc): For background on decontamination please see [this](./decontamination.md).
return ""
``` If you wish to support decontamination studies for your task simply override the "should_decontaminate" method and return true.
Understand that the strings from `doc_to_text` and `doc_to_target` will be concatenated together to build up labeled examples in the k-shot setting where k > 0. Design with that in mind 👍. You also need to override "doc_to_decontamination_query" and return the data you wish to compare against the training set. This doesn't necessarily need to be the full document or request, and we leave this up to the implementor. For a multi-choice evaluation you could for example just return the question.
### Registering Your Task ### Registering Your Task
......
...@@ -6,6 +6,7 @@ import re ...@@ -6,6 +6,7 @@ import re
import os import os
import json import json
import hashlib import hashlib
import datasets
from sqlitedict import SqliteDict from sqlitedict import SqliteDict
from tqdm import tqdm from tqdm import tqdm
import torch import torch
...@@ -346,14 +347,77 @@ class Task(abc.ABC): ...@@ -346,14 +347,77 @@ class Task(abc.ABC):
{"question": ..., "answer": ...} or {"question": ..., "answer": ...} or
{"question": ..., question, answer) {"question": ..., question, answer)
""" """
def __init__(self):
self.download() # The name of the `Task` benchmark as denoted in the HuggingFace datasets Hub
# or a path to a custom `datasets` loading script.
DATASET_PATH: str = None
# The name of a subset within `DATASET_PATH`.
DATASET_NAME: str = None
def __init__(self, data_dir=None, cache_dir=None, download_mode=None):
"""
:param data_dir: str
Stores the path to a local folder containing the `Task`'s data files.
Use this to specify the path to manually downloaded data (usually when
the dataset is not publicly accessible).
:param cache_dir: str
The directory to read/write the `Task` dataset. This follows the
HuggingFace `datasets` API with the default cache directory located at:
`~/.cache/huggingface/datasets`
NOTE: You can change the cache location globally for a given process
by setting the shell environment variable, `HF_DATASETS_CACHE`,
to another directory:
`export HF_DATASETS_CACHE="/path/to/another/directory"`
:param download_mode: datasets.DownloadMode
How to treat pre-existing `Task` downloads and data.
- `datasets.DownloadMode.REUSE_DATASET_IF_EXISTS`
Reuse download and reuse dataset.
- `datasets.DownloadMode.REUSE_CACHE_IF_EXISTS`
Reuse download with fresh dataset.
- `datasets.DownloadMode.FORCE_REDOWNLOAD`
Fresh download and fresh dataset.
"""
self.download(data_dir, cache_dir, download_mode)
self._training_docs = None self._training_docs = None
self._fewshot_docs = None self._fewshot_docs = None
def download(self): def download(self, data_dir=None, cache_dir=None, download_mode=None):
"""Downloads the task dataset if necessary""" """ Downloads and returns the task dataset.
pass Override this method to download the dataset from a custom API.
:param data_dir: str
Stores the path to a local folder containing the `Task`'s data files.
Use this to specify the path to manually downloaded data (usually when
the dataset is not publicly accessible).
:param cache_dir: str
The directory to read/write the `Task` dataset. This follows the
HuggingFace `datasets` API with the default cache directory located at:
`~/.cache/huggingface/datasets`
NOTE: You can change the cache location globally for a given process
by setting the shell environment variable, `HF_DATASETS_CACHE`,
to another directory:
`export HF_DATASETS_CACHE="/path/to/another/directory"`
:param download_mode: datasets.DownloadMode
How to treat pre-existing `Task` downloads and data.
- `datasets.DownloadMode.REUSE_DATASET_IF_EXISTS`
Reuse download and reuse dataset.
- `datasets.DownloadMode.REUSE_CACHE_IF_EXISTS`
Reuse download with fresh dataset.
- `datasets.DownloadMode.FORCE_REDOWNLOAD`
Fresh download and fresh dataset.
"""
self.dataset = datasets.load_dataset(
path=self.DATASET_PATH,
name=self.DATASET_NAME,
data_dir=data_dir,
cache_dir=cache_dir,
download_mode=download_mode
)
def should_decontaminate(self):
"""Whether this task supports decontamination against model training set."""
return False
@abstractmethod @abstractmethod
def has_training_docs(self): def has_training_docs(self):
...@@ -391,12 +455,27 @@ class Task(abc.ABC): ...@@ -391,12 +455,27 @@ class Task(abc.ABC):
""" """
return [] return []
def _process_doc(self, doc):
"""
Override this to process (detokenize, strip, replace, etc.) individual
documents. This can be used in a map over documents of a data split.
E.g. `map(self._process_doc, self.dataset["validation"])`
:return: dict
The processed version of the specified `doc`.
"""
return doc
def fewshot_examples(self, k, rnd): def fewshot_examples(self, k, rnd):
if self._training_docs is None: if self._training_docs is None:
self._training_docs = list(self.training_docs()) self._training_docs = list(self.training_docs())
return rnd.sample(self._training_docs, k) return rnd.sample(self._training_docs, k)
def doc_to_decontamination_query(self, doc):
print("Override doc_to_decontamination_query with document specific decontamination query.")
assert(False)
@abstractmethod @abstractmethod
def doc_to_text(self, doc): def doc_to_text(self, doc):
pass pass
...@@ -514,7 +593,8 @@ class Task(abc.ABC): ...@@ -514,7 +593,8 @@ class Task(abc.ABC):
return description + labeled_examples + example return description + labeled_examples + example
class MultipleChoiceTask(Task, abc.ABC): class MultipleChoiceTask(Task):
def doc_to_target(self, doc): def doc_to_target(self, doc):
return " " + doc['choices'][doc['gold']] return " " + doc['choices'][doc['gold']]
...@@ -553,6 +633,10 @@ class MultipleChoiceTask(Task, abc.ABC): ...@@ -553,6 +633,10 @@ class MultipleChoiceTask(Task, abc.ABC):
class PerplexityTask(Task, abc.ABC): class PerplexityTask(Task, abc.ABC):
def should_decontaminate(self):
"""Whether this task supports decontamination against model training set."""
return True
def has_training_docs(self): def has_training_docs(self):
return False return False
...@@ -561,11 +645,11 @@ class PerplexityTask(Task, abc.ABC): ...@@ -561,11 +645,11 @@ class PerplexityTask(Task, abc.ABC):
return [] return []
def fewshot_context(self, doc, num_fewshot, provide_description=None, rnd=None, description=None): def fewshot_context(self, doc, num_fewshot, provide_description=None, rnd=None, description=None):
assert num_fewshot == 0 assert num_fewshot == 0, "The number of fewshot examples must be 0 for perplexity tasks."
assert rnd is not None, "A `random.Random` generator argument must be provided to `rnd`" assert rnd is not None, "A `random.Random` generator argument must be provided to `rnd`."
assert not provide_description, ( assert not provide_description, (
"The `provide_description` arg will be removed in future versions. To prepend " "The `provide_description` arg will be removed in future versions. To prepend "
"a custom description to the context, supply the corresponding string via the " "a custom description to the context, supply the corresponding string via the "
"`description` arg." "`description` arg."
) )
if provide_description is not None: if provide_description is not None:
...@@ -581,6 +665,9 @@ class PerplexityTask(Task, abc.ABC): ...@@ -581,6 +665,9 @@ class PerplexityTask(Task, abc.ABC):
"bits_per_byte": False, "bits_per_byte": False,
} }
def doc_to_decontamination_query(self, doc):
return doc
def doc_to_text(self, doc): def doc_to_text(self, doc):
return "" return ""
......
# datasets
This directory contains custom EleutherAI datasets not available in the HuggingFace `datasets` hub.
In the rare case that you need to add a custom dataset to this collection, follow the
HuggingFace `datasets` guide found [here](https://huggingface.co/docs/datasets/dataset_script).
\ No newline at end of file
# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""GPT-3 Arithmetic Test Dataset."""
import json
import datasets
_CITATION = """\
@inproceedings{NEURIPS2020_1457c0d6,
author = {Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winter, Clemens and Hesse, Chris and Chen, Mark and Sigler, Eric and Litwin, Mateusz and Gray, Scott and Chess, Benjamin and Clark, Jack and Berner, Christopher and McCandlish, Sam and Radford, Alec and Sutskever, Ilya and Amodei, Dario},
booktitle = {Advances in Neural Information Processing Systems},
editor = {H. Larochelle and M. Ranzato and R. Hadsell and M. F. Balcan and H. Lin},
pages = {1877--1901},
publisher = {Curran Associates, Inc.},
title = {Language Models are Few-Shot Learners},
url = {https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf},
volume = {33},
year = {2020}
}
"""
_DESCRIPTION = """\
A small battery of 10 tests that involve asking language models a simple arithmetic
problem in natural language.
"""
_HOMEPAGE = "https://github.com/openai/gpt-3/tree/master/data"
# TODO: Add the licence for the dataset here if you can find it
_LICENSE = ""
class ArithmeticConfig(datasets.BuilderConfig):
"""BuilderConfig for GPT3 Arithmetic Test Dataset."""
def __init__(self, url, features, **kwargs):
"""BuilderConfig for GPT3 Arithmetic dataset.
Args:
url: *string*, the url to the specific subset of the GPT3 Arithmetic dataset.
features: *list[string]*, list of the features that will appear in the
feature dict.
"""
# Version history:
super().__init__(version=datasets.Version("0.0.1"), **kwargs)
self.url = url
self.features = features
class Arithmetic(datasets.GeneratorBasedBuilder):
"""A small battery of 10 tests involving simple arithmetic problems."""
BUILDER_CONFIGS = [
ArithmeticConfig(
name="arithmetic_2da",
url="https://raw.githubusercontent.com/openai/gpt-3/master/data/two_digit_addition.jsonl",
features=datasets.Features({"context": datasets.Value("string"), "completion": datasets.Value("string")}),
description="2-digit addition",
),
ArithmeticConfig(
name="arithmetic_2ds",
url="https://raw.githubusercontent.com/openai/gpt-3/master/data/two_digit_subtraction.jsonl",
features=datasets.Features({"context": datasets.Value("string"), "completion": datasets.Value("string")}),
description="2-digit subtraction",
),
ArithmeticConfig(
name="arithmetic_3da",
url="https://raw.githubusercontent.com/openai/gpt-3/master/data/three_digit_addition.jsonl",
features=datasets.Features({"context": datasets.Value("string"), "completion": datasets.Value("string")}),
description="3-digit addition",
),
ArithmeticConfig(
name="arithmetic_3ds",
url="https://raw.githubusercontent.com/openai/gpt-3/master/data/three_digit_subtraction.jsonl",
features=datasets.Features({"context": datasets.Value("string"), "completion": datasets.Value("string")}),
description="3-digit subtraction",
),
ArithmeticConfig(
name="arithmetic_4da",
url="https://raw.githubusercontent.com/openai/gpt-3/master/data/four_digit_addition.jsonl",
features=datasets.Features({"context": datasets.Value("string"), "completion": datasets.Value("string")}),
description="4-digit addition",
),
ArithmeticConfig(
name="arithmetic_4ds",
url="https://raw.githubusercontent.com/openai/gpt-3/master/data/four_digit_subtraction.jsonl",
features=datasets.Features({"context": datasets.Value("string"), "completion": datasets.Value("string")}),
description="4-digit subtraction",
),
ArithmeticConfig(
name="arithmetic_5da",
url="https://raw.githubusercontent.com/openai/gpt-3/master/data/five_digit_addition.jsonl",
features=datasets.Features({"context": datasets.Value("string"), "completion": datasets.Value("string")}),
description="5-digit addition",
),
ArithmeticConfig(
name="arithmetic_5ds",
url="https://raw.githubusercontent.com/openai/gpt-3/master/data/five_digit_subtraction.jsonl",
features=datasets.Features({"context": datasets.Value("string"), "completion": datasets.Value("string")}),
description="5-digit subtraction",
),
ArithmeticConfig(
name="arithmetic_2dm",
url="https://raw.githubusercontent.com/openai/gpt-3/master/data/two_digit_multiplication.jsonl",
features=datasets.Features({"context": datasets.Value("string"), "completion": datasets.Value("string")}),
description="2-digit multiplication",
),
ArithmeticConfig(
name="arithmetic_1dc",
url="https://raw.githubusercontent.com/openai/gpt-3/master/data/single_digit_three_ops.jsonl",
features=datasets.Features({"context": datasets.Value("string"), "completion": datasets.Value("string")}),
description="Single digit 3 operations",
),
]
def _info(self):
return datasets.DatasetInfo(
description=f"{_DESCRIPTION}\n{self.config.description}",
features=self.config.features,
homepage=_HOMEPAGE,
license=_LICENSE,
citation=_CITATION,
)
def _split_generators(self, dl_manager):
urls = self.config.url
data_dir = dl_manager.download_and_extract(urls)
return [
datasets.SplitGenerator(
name=datasets.Split.VALIDATION,
# These kwargs will be passed to _generate_examples
gen_kwargs={
"filepath": data_dir,
"split": datasets.Split.VALIDATION,
},
),
]
# method parameters are unpacked from `gen_kwargs` as given in `_split_generators`
def _generate_examples(self, filepath, split):
with open(filepath, encoding="utf-8") as f:
for key, row in enumerate(f):
data = json.loads(row)
context = data['context'].strip() \
.replace('\n\n', '\n') \
.replace('Q:', 'Question:') \
.replace('A:', 'Answer:')
completion = data['completion']
yield key, {'context': context, 'completion': completion}
This diff is collapsed.
# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""ASDIV dataset."""
import os
import xml.etree.ElementTree as ET
import datasets
_CITATION = """\
@misc{miao2021diverse,
title={A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers},
author={Shen-Yun Miao and Chao-Chun Liang and Keh-Yih Su},
year={2021},
eprint={2106.15772},
archivePrefix={arXiv},
primaryClass={cs.AI}
}
"""
_DESCRIPTION = """\
ASDiv (Academia Sinica Diverse MWP Dataset) is a diverse (in terms of both language
patterns and problem types) English math word problem (MWP) corpus for evaluating
the capability of various MWP solvers. Existing MWP corpora for studying AI progress
remain limited either in language usage patterns or in problem types. We thus present
a new English MWP corpus with 2,305 MWPs that cover more text patterns and most problem
types taught in elementary school. Each MWP is annotated with its problem type and grade
level (for indicating the level of difficulty).
"""
_HOMEPAGE = "https://github.com/chaochun/nlu-asdiv-dataset"
# TODO: Add the licence for the dataset here if you can find it
_LICENSE = ""
_URLS = "https://github.com/chaochun/nlu-asdiv-dataset/archive/55790e5270bb91ccfa5053194b25732534696b50.zip"
class ASDiv(datasets.GeneratorBasedBuilder):
""" ASDiv: A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers """
VERSION = datasets.Version("0.0.1")
BUILDER_CONFIGS = [
datasets.BuilderConfig(name="asdiv", version=VERSION,
description="A diverse corpus for evaluating and developing english math word problem solvers")
]
def _info(self):
features = datasets.Features(
{
"body": datasets.Value("string"),
"question": datasets.Value("string"),
"solution_type": datasets.Value("string"),
"answer": datasets.Value("string"),
"formula": datasets.Value("string"),
}
)
return datasets.DatasetInfo(
description=_DESCRIPTION,
features=features,
homepage=_HOMEPAGE,
license=_LICENSE,
citation=_CITATION,
)
def _split_generators(self, dl_manager):
urls = _URLS
data_dir = dl_manager.download_and_extract(urls)
base_filepath = "nlu-asdiv-dataset-55790e5270bb91ccfa5053194b25732534696b50"
return [
datasets.SplitGenerator(
name=datasets.Split.VALIDATION,
# These kwargs will be passed to _generate_examples
gen_kwargs={
"filepath": os.path.join(data_dir, base_filepath, "dataset", "ASDiv.xml"),
"split": datasets.Split.VALIDATION,
},
),
]
# method parameters are unpacked from `gen_kwargs` as given in `_split_generators`
def _generate_examples(self, filepath, split):
tree = ET.parse(filepath)
root = tree.getroot()
for key, problem in enumerate(root.iter("Problem")):
yield key, {
"body": problem.find("Body").text,
"question": problem.find("Question").text,
"solution_type": problem.find("Solution-Type").text,
"answer": problem.find("Answer").text,
"formula": problem.find("Formula").text,
}
{"asdiv": {"description": "ASDiv (Academia Sinica Diverse MWP Dataset) is a diverse (in terms of both language\npatterns and problem types) English math word problem (MWP) corpus for evaluating\nthe capability of various MWP solvers. Existing MWP corpora for studying AI progress\nremain limited either in language usage patterns or in problem types. We thus present\na new English MWP corpus with 2,305 MWPs that cover more text patterns and most problem\ntypes taught in elementary school. Each MWP is annotated with its problem type and grade\nlevel (for indicating the level of difficulty).\n", "citation": "@misc{miao2021diverse,\n title={A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers},\n author={Shen-Yun Miao and Chao-Chun Liang and Keh-Yih Su},\n year={2021},\n eprint={2106.15772},\n archivePrefix={arXiv},\n primaryClass={cs.AI}\n}\n", "homepage": "https://github.com/chaochun/nlu-asdiv-dataset", "license": "", "features": {"body": {"dtype": "string", "id": null, "_type": "Value"}, "question": {"dtype": "string", "id": null, "_type": "Value"}, "solution_type": {"dtype": "string", "id": null, "_type": "Value"}, "answer": {"dtype": "string", "id": null, "_type": "Value"}, "formula": {"dtype": "string", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "as_div", "config_name": "asdiv", "version": {"version_str": "0.0.1", "description": null, "major": 0, "minor": 0, "patch": 1}, "splits": {"validation": {"name": "validation", "num_bytes": 501489, "num_examples": 2305, "dataset_name": "as_div"}}, "download_checksums": {"https://github.com/chaochun/nlu-asdiv-dataset/archive/55790e5270bb91ccfa5053194b25732534696b50.zip": {"num_bytes": 440966, "checksum": "8f1fe4f6d5f170ec1e24ab78c244153c14c568b1bb2b1dad0324e71f37939a2d"}}, "download_size": 440966, "post_processing_size": null, "dataset_size": 501489, "size_in_bytes": 942455}}
\ No newline at end of file
# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""CoQA dataset.
This `CoQA` adds the "additional_answers" feature that's missing in the original
datasets version:
https://github.com/huggingface/datasets/blob/master/datasets/coqa/coqa.py
"""
import json
import datasets
_CITATION = """\
@misc{reddy2018coqa,
title={CoQA: A Conversational Question Answering Challenge},
author={Siva Reddy and Danqi Chen and Christopher D. Manning},
year={2018},
eprint={1808.07042},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
"""
_DESCRIPTION = """\
CoQA is a large-scale dataset for building Conversational Question Answering
systems. The goal of the CoQA challenge is to measure the ability of machines to
understand a text passage and answer a series of interconnected questions that
appear in a conversation.
"""
_HOMEPAGE = "https://stanfordnlp.github.io/coqa/"
# TODO: Add the licence for the dataset here if you can find it
_LICENSE = ""
_URLS = {
"train": "https://nlp.stanford.edu/data/coqa/coqa-train-v1.0.json",
"validation": "https://nlp.stanford.edu/data/coqa/coqa-dev-v1.0.json",
}
# `additional_answers` are not available in the train set so we fill them with
# empty dicts of the same form.
_EMPTY_ADDITIONAL_ANSWER = {
"0": [
{
"span_start": -1,
"span_end": -1,
"span_text": "",
"input_text": "",
"turn_id": -1
}
],
"1": [
{
"span_start": -1,
"span_end": -1,
"span_text": "",
"input_text": "",
"turn_id": -1
}
],
"2": [
{
"span_start": -1,
"span_end": -1,
"span_text": "",
"input_text": "",
"turn_id": -1
}
],
}
class Coqa(datasets.GeneratorBasedBuilder):
"""CoQA is a large-scale dataset for building Conversational Question Answering systems."""
VERSION = datasets.Version("0.0.1")
BUILDER_CONFIGS = [
datasets.BuilderConfig(name="coqa", version=VERSION,
description="The CoQA dataset."),
]
def _info(self):
features = datasets.Features(
{
"id": datasets.Value("string"),
"source": datasets.Value("string"),
"story": datasets.Value("string"),
"questions": datasets.features.Sequence({
"input_text": datasets.Value("string"),
"turn_id": datasets.Value("int32"),
}),
"answers": datasets.features.Sequence({
"span_start": datasets.Value("int32"),
"span_end": datasets.Value("int32"),
"span_text": datasets.Value("string"),
"input_text": datasets.Value("string"),
"turn_id": datasets.Value("int32"),
}),
"additional_answers": {
"0": datasets.features.Sequence({
"span_start": datasets.Value("int32"),
"span_end": datasets.Value("int32"),
"span_text": datasets.Value("string"),
"input_text": datasets.Value("string"),
"turn_id": datasets.Value("int32"),
}),
"1": datasets.features.Sequence({
"span_start": datasets.Value("int32"),
"span_end": datasets.Value("int32"),
"span_text": datasets.Value("string"),
"input_text": datasets.Value("string"),
"turn_id": datasets.Value("int32"),
}),
"2": datasets.features.Sequence({
"span_start": datasets.Value("int32"),
"span_end": datasets.Value("int32"),
"span_text": datasets.Value("string"),
"input_text": datasets.Value("string"),
"turn_id": datasets.Value("int32"),
}),
}
})
return datasets.DatasetInfo(
description=_DESCRIPTION,
features=features,
homepage=_HOMEPAGE,
license=_LICENSE,
citation=_CITATION,
)
def _split_generators(self, dl_manager):
urls = {"train": _URLS["train"], "validation": _URLS["validation"]}
data_dirs = dl_manager.download_and_extract(urls)
return [
datasets.SplitGenerator(
name=datasets.Split.TRAIN,
# These kwargs will be passed to _generate_examples
gen_kwargs={
"filepath": data_dirs["train"],
"split": datasets.Split.TRAIN,
},
),
datasets.SplitGenerator(
name=datasets.Split.VALIDATION,
# These kwargs will be passed to _generate_examples
gen_kwargs={
"filepath": data_dirs["validation"],
"split": datasets.Split.VALIDATION,
},
),
]
# method parameters are unpacked from `gen_kwargs` as given in `_split_generators`
def _generate_examples(self, filepath, split):
with open(filepath, encoding="utf-8") as f:
data = json.load(f)
for row in data["data"]:
id = row["id"]
source = row["source"]
story = row["story"]
questions = [
{
"input_text": q["input_text"],
"turn_id": q["turn_id"]
}
for q in row["questions"]
]
answers = [
{
"span_start": a["span_start"],
"span_end": a["span_end"],
"span_text": a["span_text"],
"input_text": a["input_text"],
"turn_id": a["turn_id"]
}
for a in row["answers"]
]
if split == datasets.Split.TRAIN:
additional_answers = _EMPTY_ADDITIONAL_ANSWER
else:
additional_answers = {
"0": [
{
"span_start": a0["span_start"],
"span_end": a0["span_end"],
"span_text": a0["span_text"],
"input_text": a0["input_text"],
"turn_id": a0["turn_id"]
}
for a0 in row["additional_answers"]["0"]
],
"1": [
{
"span_start": a1["span_start"],
"span_end": a1["span_end"],
"span_text": a1["span_text"],
"input_text": a1["input_text"],
"turn_id": a1["turn_id"]
}
for a1 in row["additional_answers"]["1"]
],
"2": [
{
"span_start": a2["span_start"],
"span_end": a2["span_end"],
"span_text": a2["span_text"],
"input_text": a2["input_text"],
"turn_id": a2["turn_id"]
}
for a2 in row["additional_answers"]["2"]
],
}
yield row["id"], {
"id": id,
"story": story,
"source": source,
"questions": questions,
"answers": answers,
"additional_answers": additional_answers
}
{"coqa": {"description": "CoQA is a large-scale dataset for building Conversational Question Answering\nsystems. The goal of the CoQA challenge is to measure the ability of machines to\nunderstand a text passage and answer a series of interconnected questions that\nappear in a conversation.\n", "citation": "@misc{reddy2018coqa,\n title={CoQA: A Conversational Question Answering Challenge},\n author={Siva Reddy and Danqi Chen and Christopher D. Manning},\n year={2018},\n eprint={1808.07042},\n archivePrefix={arXiv},\n primaryClass={cs.CL}\n}\n", "homepage": "https://stanfordnlp.github.io/coqa/", "license": "", "features": {"id": {"dtype": "string", "id": null, "_type": "Value"}, "source": {"dtype": "string", "id": null, "_type": "Value"}, "story": {"dtype": "string", "id": null, "_type": "Value"}, "questions": {"feature": {"input_text": {"dtype": "string", "id": null, "_type": "Value"}, "turn_id": {"dtype": "int32", "id": null, "_type": "Value"}}, "length": -1, "id": null, "_type": "Sequence"}, "answers": {"feature": {"span_start": {"dtype": "int32", "id": null, "_type": "Value"}, "span_end": {"dtype": "int32", "id": null, "_type": "Value"}, "span_text": {"dtype": "string", "id": null, "_type": "Value"}, "input_text": {"dtype": "string", "id": null, "_type": "Value"}, "turn_id": {"dtype": "int32", "id": null, "_type": "Value"}}, "length": -1, "id": null, "_type": "Sequence"}, "additional_answers": {"0": {"feature": {"span_start": {"dtype": "int32", "id": null, "_type": "Value"}, "span_end": {"dtype": "int32", "id": null, "_type": "Value"}, "span_text": {"dtype": "string", "id": null, "_type": "Value"}, "input_text": {"dtype": "string", "id": null, "_type": "Value"}, "turn_id": {"dtype": "int32", "id": null, "_type": "Value"}}, "length": -1, "id": null, "_type": "Sequence"}, "1": {"feature": {"span_start": {"dtype": "int32", "id": null, "_type": "Value"}, "span_end": {"dtype": "int32", "id": null, "_type": "Value"}, "span_text": {"dtype": "string", "id": null, "_type": "Value"}, "input_text": {"dtype": "string", "id": null, "_type": "Value"}, "turn_id": {"dtype": "int32", "id": null, "_type": "Value"}}, "length": -1, "id": null, "_type": "Sequence"}, "2": {"feature": {"span_start": {"dtype": "int32", "id": null, "_type": "Value"}, "span_end": {"dtype": "int32", "id": null, "_type": "Value"}, "span_text": {"dtype": "string", "id": null, "_type": "Value"}, "input_text": {"dtype": "string", "id": null, "_type": "Value"}, "turn_id": {"dtype": "int32", "id": null, "_type": "Value"}}, "length": -1, "id": null, "_type": "Sequence"}}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "coqa", "config_name": "coqa", "version": {"version_str": "0.0.1", "description": null, "major": 0, "minor": 0, "patch": 1}, "splits": {"train": {"name": "train", "num_bytes": 26250528, "num_examples": 7199, "dataset_name": "coqa"}, "validation": {"name": "validation", "num_bytes": 3765933, "num_examples": 500, "dataset_name": "coqa"}}, "download_checksums": {"https://nlp.stanford.edu/data/coqa/coqa-train-v1.0.json": {"num_bytes": 49001836, "checksum": "b0fdb2bc1bd38dd3ca2ce5fa2ac3e02c6288ac914f241ac409a655ffb6619fa6"}, "https://nlp.stanford.edu/data/coqa/coqa-dev-v1.0.json": {"num_bytes": 9090845, "checksum": "dfa367a9733ce53222918d0231d9b3bedc2b8ee831a2845f62dfc70701f2540a"}}, "download_size": 58092681, "post_processing_size": null, "dataset_size": 30016461, "size_in_bytes": 88109142}}
\ No newline at end of file
{"drop": {"description": "DROP is a QA dataset which tests comprehensive understanding of paragraphs. In \nthis crowdsourced, adversarially-created, 96k question-answering benchmark, a \nsystem must resolve multiple references in a question, map them onto a paragraph,\nand perform discrete operations over them (such as addition, counting, or sorting).\n", "citation": "@misc{dua2019drop,\n title={DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs}, \n author={Dheeru Dua and Yizhong Wang and Pradeep Dasigi and Gabriel Stanovsky and Sameer Singh and Matt Gardner},\n year={2019},\n eprint={1903.00161},\n archivePrefix={arXiv},\n primaryClass={cs.CL}\n}\n", "homepage": "https://allenai.org/data/drop", "license": "", "features": {"section_id": {"dtype": "string", "id": null, "_type": "Value"}, "passage": {"dtype": "string", "id": null, "_type": "Value"}, "question": {"dtype": "string", "id": null, "_type": "Value"}, "query_id": {"dtype": "string", "id": null, "_type": "Value"}, "answer": {"number": {"dtype": "string", "id": null, "_type": "Value"}, "date": {"day": {"dtype": "string", "id": null, "_type": "Value"}, "month": {"dtype": "string", "id": null, "_type": "Value"}, "year": {"dtype": "string", "id": null, "_type": "Value"}}, "spans": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "worker_id": {"dtype": "string", "id": null, "_type": "Value"}, "hit_id": {"dtype": "string", "id": null, "_type": "Value"}}, "validated_answers": {"feature": {"number": {"dtype": "string", "id": null, "_type": "Value"}, "date": {"day": {"dtype": "string", "id": null, "_type": "Value"}, "month": {"dtype": "string", "id": null, "_type": "Value"}, "year": {"dtype": "string", "id": null, "_type": "Value"}}, "spans": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "worker_id": {"dtype": "string", "id": null, "_type": "Value"}, "hit_id": {"dtype": "string", "id": null, "_type": "Value"}}, "length": -1, "id": null, "_type": "Sequence"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "drop", "config_name": "drop", "version": {"version_str": "0.0.1", "description": null, "major": 0, "minor": 0, "patch": 1}, "splits": {"train": {"name": "train", "num_bytes": 108858121, "num_examples": 77409, "dataset_name": "drop"}, "validation": {"name": "validation", "num_bytes": 12560739, "num_examples": 9536, "dataset_name": "drop"}}, "download_checksums": {"https://s3-us-west-2.amazonaws.com/allennlp/datasets/drop/drop_dataset.zip": {"num_bytes": 8308692, "checksum": "39d2278a29fd729de301b111a45f434c24834f40df8f4ff116d864589e3249d6"}}, "download_size": 8308692, "post_processing_size": null, "dataset_size": 121418860, "size_in_bytes": 129727552}}
\ No newline at end of file
# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Custom DROP dataet that, unlike HF, keeps all question-answer pairs
# even if there are multiple types of answers for the same question.
"""DROP dataset."""
import json
import os
import datasets
_CITATION = """\
@misc{dua2019drop,
title={DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs},
author={Dheeru Dua and Yizhong Wang and Pradeep Dasigi and Gabriel Stanovsky and Sameer Singh and Matt Gardner},
year={2019},
eprint={1903.00161},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
"""
_DESCRIPTION = """\
DROP is a QA dataset which tests comprehensive understanding of paragraphs. In
this crowdsourced, adversarially-created, 96k question-answering benchmark, a
system must resolve multiple references in a question, map them onto a paragraph,
and perform discrete operations over them (such as addition, counting, or sorting).
"""
_HOMEPAGE = "https://allenai.org/data/drop"
# TODO: Add the licence for the dataset here if you can find it
_LICENSE = ""
_URLS = {
"drop": "https://s3-us-west-2.amazonaws.com/allennlp/datasets/drop/drop_dataset.zip",
}
_EMPTY_VALIDATED_ANSWER = [{
"number": "",
"date": {
"day": "",
"month": "",
"year": "",
},
"spans": [],
"worker_id": "",
"hit_id": ""
}]
class Drop(datasets.GeneratorBasedBuilder):
"""DROP is a QA dataset which tests comprehensive understanding of paragraphs."""
VERSION = datasets.Version("0.0.1")
BUILDER_CONFIGS = [
datasets.BuilderConfig(name="drop", version=VERSION,
description="The DROP dataset."),
]
def _info(self):
features = datasets.Features({
"section_id": datasets.Value("string"),
"passage": datasets.Value("string"),
"question": datasets.Value("string"),
"query_id": datasets.Value("string"),
"answer": {
"number": datasets.Value("string"),
"date": {
"day": datasets.Value("string"),
"month": datasets.Value("string"),
"year": datasets.Value("string"),
},
"spans": datasets.features.Sequence(datasets.Value("string")),
"worker_id": datasets.Value("string"),
"hit_id": datasets.Value("string"),
},
"validated_answers": datasets.features.Sequence({
"number": datasets.Value("string"),
"date": {
"day": datasets.Value("string"),
"month": datasets.Value("string"),
"year": datasets.Value("string"),
},
"spans": datasets.features.Sequence(datasets.Value("string")),
"worker_id": datasets.Value("string"),
"hit_id": datasets.Value("string"),
}),
})
return datasets.DatasetInfo(
description=_DESCRIPTION,
features=features,
homepage=_HOMEPAGE,
license=_LICENSE,
citation=_CITATION,
)
def _split_generators(self, dl_manager):
urls = _URLS[self.config.name]
data_dir = dl_manager.download_and_extract(urls)
return [
datasets.SplitGenerator(
name=datasets.Split.TRAIN,
# These kwargs will be passed to _generate_examples
gen_kwargs={
"filepath": os.path.join(data_dir, "drop_dataset", "drop_dataset_train.json"),
"split": "train",
},
),
datasets.SplitGenerator(
name=datasets.Split.VALIDATION,
# These kwargs will be passed to _generate_examples
gen_kwargs={
"filepath": os.path.join(data_dir, "drop_dataset", "drop_dataset_dev.json"),
"split": "validation",
},
),
]
# method parameters are unpacked from `gen_kwargs` as given in `_split_generators`
def _generate_examples(self, filepath, split):
with open(filepath, encoding="utf-8") as f:
data = json.load(f)
key = 0
for section_id, example in data.items():
# Each example (passage) has multiple sub-question-answer pairs.
for qa in example["qa_pairs"]:
# Build answer.
answer = qa["answer"]
answer = {
"number": answer["number"],
"date": {
"day": answer["date"].get("day", ""),
"month": answer["date"].get("month", ""),
"year": answer["date"].get("year", ""),
},
"spans": answer["spans"],
"worker_id": answer.get("worker_id", ""),
"hit_id": answer.get("hit_id", ""),
}
validated_answers = []
if "validated_answers" in qa:
for validated_answer in qa["validated_answers"]:
va = {
"number": validated_answer.get("number", ""),
"date": {
"day": validated_answer["date"].get("day", ""),
"month": validated_answer["date"].get("month", ""),
"year": validated_answer["date"].get("year", ""),
},
"spans": validated_answer.get("spans", ""),
"worker_id": validated_answer.get("worker_id", ""),
"hit_id": validated_answer.get("hit_id", ""),
}
validated_answers.append(va)
else:
validated_answers = _EMPTY_VALIDATED_ANSWER
yield key, {
"section_id": section_id,
"passage": example["passage"],
"question": qa["question"],
"query_id": qa["query_id"],
"answer": answer,
"validated_answers": validated_answers,
}
key += 1
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment