Commit 767c58b9 authored by lintangsutawika's avatar lintangsutawika
Browse files

Merge branch 'big-refactor' into update_docs

parents 3bfbddc4 759da8d5
group: group:
- multiple_choice
- math_word_problems - math_word_problems
task: mathqa task: mathqa
dataset_path: math_qa dataset_path: math_qa
......
# MC Taco
### Paper
Title: `"Going on a vacation" takes longer than "Going for a walk": A Study of Temporal Commonsense Understanding`
Abstract: https://arxiv.org/abs/1909.03065
MC-TACO is a dataset of 13k question-answer pairs that require temporal commonsense
comprehension. The dataset contains five temporal properties, (1) duration (how long
an event takes), (2) temporal ordering (typical order of events), (3) typical time
(when an event occurs), (4) frequency (how often an event occurs), and (5) stationarity
(whether a state is maintained for a very long time or indefinitely).
WARNING: Running this task with a `--limit` arg will give misleading results! The
corresponding dataset is structured such that each multiple-choice-question gathered
by the authors is split into question-option pairs, where each such pair gets
siloed into an individual document for plausibility testing. Because the harness
shuffles these documents, setting `--limit` will likely "cut off" certain candidate
answers. This is a problem because the task's metrics require an exhaustive evaluation
of a question's options. See section 4 of the paper for details.
Homepage: https://leaderboard.allenai.org/mctaco/submissions/public
### Citation
```
BibTeX-formatted citation goes here
```
### Groups and Tasks
#### Groups
* Not part of a group yet.
#### Tasks
* `mc_taco`
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
task: mc_taco
dataset_path: mc_taco
output_type: multiple_choice
validation_split: validation
test_split: test
doc_to_text: "{{sentence}}\nQuestion: {{question}}\nAnswer: {{answer}}\nPlausible:"
doc_to_target: label
doc_to_choice: ["no", "yes"]
should_decontaminate: true
doc_to_decontamination_query: "{{question}} {{sentence}}"
metric_list:
- metric: acc
- metric: f1
# Task-name
### Paper
Title: `Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering`
Abstract: https://arxiv.org/abs/1809.02789
OpenBookQA is a question-answering dataset modeled after open book exams for
assessing human understanding of a subject. It consists of 5,957 multiple-choice
elementary-level science questions (4,957 train, 500 dev, 500 test), which probe
the understanding of a small “book” of 1,326 core science facts and the application
of these facts to novel situations. For training, the dataset includes a mapping
from each question to the core science fact it was designed to probe. Answering
OpenBookQA questions requires additional broad common knowledge, not contained
in the book. The questions, by design, are answered incorrectly by both a retrieval-
based algorithm and a word co-occurrence algorithm.
Homepage: https://allenai.org/data/open-book-qa
### Citation
```
@inproceedings{OpenBookQA2018,
title={Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering},
author={Todor Mihaylov and Peter Clark and Tushar Khot and Ashish Sabharwal},
booktitle={EMNLP},
year={2018}
}
```
### Groups and Tasks
#### Groups
* Not part of a group yet
#### Tasks
* `openbookqa`
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
group:
- multiple_choice
task: openbookqa task: openbookqa
dataset_path: openbookqa dataset_path: openbookqa
dataset_name: main dataset_name: main
......
# PAWS-X
### Paper
Title: `PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification`
Abstract: https://arxiv.org/abs/1908.11828
The dataset consists of 23,659 human translated PAWS evaluation pairs and
296,406 machine translated training pairs in 6 typologically distinct languages.
Examples are adapted from PAWS-Wiki
Prompt format (same as in mGPT):
"<s>" + sentence1 + ", right? " + mask + ", " + sentence2 + "</s>",
where mask is the string that matches the label:
Yes, No.
Example:
<s> The Tabaci River is a tributary of the River Leurda in Romania, right? No, The Leurda River is a tributary of the River Tabaci in Romania.</s>
Language specific prompts are translated word-by-word with Google Translate
and may differ from the ones used by mGPT and XGLM (they do not provide their prompts).
Homepage: https://github.com/google-research-datasets/paws/tree/master/pawsx
### Citation
```
@inproceedings{yang-etal-2019-paws,
title = "{PAWS}-{X}: A Cross-lingual Adversarial Dataset for Paraphrase Identification",
author = "Yang, Yinfei and
Zhang, Yuan and
Tar, Chris and
Baldridge, Jason",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
month = nov,
year = "2019",
address = "Hong Kong, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/D19-1382",
doi = "10.18653/v1/D19-1382",
pages = "3687--3692",
}
```
### Groups and Tasks
#### Groups
* `pawsx`
#### Tasks
* `paws_de`: German
* `paws_en`: English
* `paws_es`: Spanish
* `paws_fr`: French
* `paws_ja`: Japanese
* `paws_ko`: Korean
* `paws_zh`: Chinese
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
# Generated by utils.py
dataset_name: de
doc_to_choice: '{{[sentence1+", richtig? Ja, "+sentence2, sentence1+", richtig? Nein,
"+sentence2]}}'
doc_to_text: ''
include: pawsx_template_yaml
task: paws_de
# Generated by utils.py
dataset_name: en
doc_to_choice: '{{[sentence1+", right? Yes, "+sentence2, sentence1+", right? No, "+sentence2]}}'
doc_to_text: ''
include: pawsx_template_yaml
task: paws_en
# Generated by utils.py
dataset_name: es
doc_to_choice: '{{[sentence1+", verdad? Sí, "+sentence2, sentence1+", verdad? No,
"+sentence2]}}'
doc_to_text: ''
include: pawsx_template_yaml
task: paws_es
# Generated by utils.py
dataset_name: fr
doc_to_choice: '{{[sentence1+", n''est-ce pas? Oui, "+sentence2, sentence1+", n''est-ce
pas? No, "+sentence2]}}'
doc_to_text: ''
include: pawsx_template_yaml
task: paws_fr
# Generated by utils.py
dataset_name: ja
doc_to_choice: '{{[sentence1+", ですね? はい, "+sentence2, sentence1+", ですね? いいえ, "+sentence2]}}'
doc_to_text: ''
include: pawsx_template_yaml
task: paws_ja
# Generated by utils.py
dataset_name: ko
doc_to_choice: '{{[sentence1+", 맞죠? 예, "+sentence2, sentence1+", 맞죠? 아니요, "+sentence2]}}'
doc_to_text: ''
include: pawsx_template_yaml
task: paws_ko
# Generated by utils.py
dataset_name: zh
doc_to_choice: '{{[sentence1+", 对吧? 是, "+sentence2, sentence1+", 对吧? 不是, "+sentence2]}}'
doc_to_text: ''
include: pawsx_template_yaml
task: paws_zh
# This file will be included in the generated language-specific task configs.
# It doesn't have a yaml file extension as it is not meant to be imported directly
# by the harness.
group: pawsx
task: null
dataset_path: paws-x
dataset_name: null
output_type: multiple_choice
training_split: train
validation_split: validation
test_split: test
doc_to_text: null
doc_to_target: label
doc_to_choice: null
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
import argparse
from typing import Dict, List
import yaml
# Different languages that are part of xnli.
# These correspond to dataset names (Subsets) on HuggingFace.
# A yaml file is generated by this script for each language.
LANGUAGES = {
"de": { # German
"QUESTION_WORD": "richtig",
"YES": "Ja",
"NO": "Nein",
},
"en": { # English
"QUESTION_WORD": "right",
"YES": "Yes",
"NO": "No",
},
"es": { # Spanish
"QUESTION_WORD": "verdad",
"YES": "Sí",
"NO": "No",
},
"fr": { # French
"QUESTION_WORD": "n'est-ce pas",
"YES": "Oui",
"NO": "No",
},
"ja": { # Japanese
"QUESTION_WORD": "ですね",
"YES": "はい",
"NO": "いいえ",
},
"ko": { # Korean
"QUESTION_WORD": "맞죠",
"YES": "예",
"NO": "아니요",
},
"zh": { # Chinese
"QUESTION_WORD": "对吧",
"YES": "是",
"NO": "不是",
},
}
def gen_lang_yamls(output_dir: str, overwrite: bool) -> None:
"""
Generate a yaml file for each language.
:param output_dir: The directory to output the files to.
:param overwrite: Whether to overwrite files if they already exist.
"""
err = []
for lang in LANGUAGES.keys():
file_name = f"paws_{lang}.yaml"
try:
QUESTION_WORD = LANGUAGES[lang]["QUESTION_WORD"]
YES = LANGUAGES[lang]["YES"]
NO = LANGUAGES[lang]["NO"]
with open(
f"{output_dir}/{file_name}", "w" if overwrite else "x", encoding="utf8"
) as f:
f.write("# Generated by utils.py\n")
yaml.dump(
{
"include": "pawsx_template_yaml",
"dataset_name": lang,
"task": f"paws_{lang}",
"doc_to_text": "",
"doc_to_choice": f"{{{{["
f"""sentence1+\", {QUESTION_WORD}? {YES}, \"+sentence2,"""
f""" sentence1+\", {QUESTION_WORD}? {NO}, \"+sentence2"""
f"]}}}}",
},
f,
allow_unicode=True,
)
except FileExistsError:
err.append(file_name)
if len(err) > 0:
raise FileExistsError(
"Files were not created because they already exist (use --overwrite flag):"
f" {', '.join(err)}"
)
def main() -> None:
"""Parse CLI args and generate language-specific yaml files."""
parser = argparse.ArgumentParser()
parser.add_argument(
"--overwrite",
default=False,
action="store_true",
help="Overwrite files if they already exist",
)
parser.add_argument(
"--output-dir", default=".", help="Directory to write yaml files to"
)
args = parser.parse_args()
gen_lang_yamls(output_dir=args.output_dir, overwrite=args.overwrite)
if __name__ == "__main__":
main()
# The Pile # The Pile
### Paper ### Paper
The Pile: An 800GB Dataset of Diverse Text for Language Modeling Title: The Pile: An 800GB Dataset of Diverse Text for Language Modeling
https://arxiv.org/pdf/2101.00027.pdf
Abstract: https://arxiv.org/abs/2101.00027
The Pile is a 825 GiB diverse, open source language modelling data set that consists The Pile is a 825 GiB diverse, open source language modelling data set that consists
of 22 smaller, high-quality datasets combined together. To score well on Pile of 22 smaller, high-quality datasets combined together. To score well on Pile
...@@ -21,3 +22,47 @@ Homepage: https://pile.eleuther.ai/ ...@@ -21,3 +22,47 @@ Homepage: https://pile.eleuther.ai/
year={2020} year={2020}
} }
``` ```
### Groups and Tasks
#### Groups
* `pile`
#### Tasks
* `pile_arxiv`
* `pile_bookcorpus2`
* `pile_books3`
* `pile_dm-mathematics`
* `pile_enron`
* `pile_europarl`
* `pile_freelaw`
* `pile_github`
* `pile_gutenberg`
* `pile_hackernews`
* `pile_nih-exporter`
* `pile_opensubtitles`
* `pile_openwebtext2`
* `pile_philpapers`
* `pile_pile-cc`
* `pile_pubmed-abstracts`
* `pile_pubmed-central`
* `pile_stackexchange`
* `pile_ubuntu-irc`
* `pile_uspto`
* `pile_wikipedia`
* `pile_youtubesubtitles`
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
group: group:
- pile - pile
- perplexity
- loglikelihood_rolling
task: pile_arxiv task: pile_arxiv
dataset_path: EleutherAI/pile dataset_path: EleutherAI/pile
dataset_name: pile_arxiv dataset_name: pile_arxiv
......
# PIQA
### Paper
Title: `PIQA: Reasoning about Physical Commonsense in Natural Language`
Abstract: https://arxiv.org/abs/1911.11641
Physical Interaction: Question Answering (PIQA) is a physical commonsense
reasoning and a corresponding benchmark dataset. PIQA was designed to investigate
the physical knowledge of existing models. To what extent are current approaches
actually learning about the world?
Homepage: https://yonatanbisk.com/piqa/
### Citation
```
@inproceedings{Bisk2020,
author = {Yonatan Bisk and Rowan Zellers and
Ronan Le Bras and Jianfeng Gao
and Yejin Choi},
title = {PIQA: Reasoning about Physical Commonsense in
Natural Language},
booktitle = {Thirty-Fourth AAAI Conference on
Artificial Intelligence},
year = {2020},
}
```
### Groups and Tasks
#### Groups
* Not part of a group yet.
#### Tasks
* `piqa`
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
group:
- multiple_choice
task: piqa task: piqa
dataset_path: piqa dataset_path: piqa
dataset_name: null dataset_name: null
......
# PROST
### Paper
Title: `PROST: Physical Reasoning about Objects Through Space and Time`
Abstract: https://arxiv.org/abs/2106.03634
PROST, Physical Reasoning about Objects Through Space and Time, is a dataset
consisting of 18,736 multiple-choice questions made from 14 manually curated
templates, covering 10 physical reasoning concepts. All questions are designed
to probe both causal and masked language models in a zero-shot setting.
NOTE: PROST is limited to the zero-shot setting to adhere to authors' intentions
as discussed in section 7 of the paper: "We hope that the community will use
this dataset in the intended way: in a zero-shot setting to probe models which
have been trained on data not specifically collected to succeed on PROST."
Homepage: https://github.com/nala-cub/prost
### Citation
```
@inproceedings{aroca-ouellette-etal-2021-prost,
title = "{PROST}: {P}hysical Reasoning about Objects through Space and Time",
author = "Aroca-Ouellette, St{\'e}phane and
Paik, Cory and
Roncone, Alessandro and
Kann, Katharina",
booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.findings-acl.404",
pages = "4597--4608",
}
```
### Groups and Tasks
#### Groups
* Not part of a group yet.
#### Tasks
* `prost`
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment