Unverified Commit 759da8d5 authored by Lintang Sutawika's avatar Lintang Sutawika Committed by GitHub
Browse files

Merge pull request #757 from EleutherAI/add-readme

[Refactor] Add README.md
parents 73912efb c05a5ad4
# Task-name
# ANLI
### Paper
Title: `Adversarial NLI: A New Benchmark for Natural Language Understanding`
Abstract: `https://arxiv.org/pdf/1910.14599.pdf`
Paper Link: https://arxiv.org/abs/1910.14599
Adversarial NLI (ANLI) is a dataset collected via an iterative, adversarial
human-and-model-in-the-loop procedure. It consists of three rounds that progressively
increase in difficulty and complexity, and each question-answer includes annotator-
provided explanations.
Homepage: `https://github.com/facebookresearch/anli`
Homepage: https://github.com/facebookresearch/anli
### Citation
......@@ -31,13 +30,18 @@ Homepage: `https://github.com/facebookresearch/anli`
}
```
### Subtasks
### Groups and Tasks
#### Groups
List or describe tasks defined in this folder, and their names here:
* `anli`: Evaluates `anli_r1`, `anli_r2`, and `anli_r3`
#### Tasks
* `anli_r1`: The data collected adversarially in the first round.
* `anli_r2`: The data collected adversarially in the second round, after training on the previous round's data.
* `anli_r3`: The data collected adversarially in the third round, after training on the previous multiple rounds of data.
### Checklist
For adding novel benchmarks/datasets to the library:
......
group:
- multiple_choice
- natural_language_inference
- nli
- adverserial
- anli
task: anli_r1
dataset_path: anli
dataset_name: null
......
group:
- multiple_choice
- natural_language_inference
- nli
- adverserial
include: anli_r1.yaml
task: anli_r2
dataset_path: anli
dataset_name: null
output_type: multiple_choice
training_split: train_r2
validation_split: dev_r2
test_split: test_r2
doc_to_text: "{{premise}}\nQuestion: {{hypothesis}} True, False, or Neither?\nAnswer:"
# True = entailment
# False = contradiction
# Neither = neutral
doc_to_target: "{{['True', 'Neither', 'False'][label]}}"
doc_to_choice:
- "True"
- "Neither"
- "False"
should_decontaminate: true
doc_to_decontamination_query: premise
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
group:
- multiple_choice
- natural_language_inference
- nli
- adverserial
include: anli_r1.yaml
task: anli_r3
dataset_path: anli
dataset_name: null
output_type: multiple_choice
training_split: train_r3
validation_split: dev_r3
test_split: test_r3
doc_to_text: "{{premise}}\nQuestion: {{hypothesis}} True, False, or Neither?\nAnswer:"
# True = entailment
# False = contradiction
# Neither = neutral
doc_to_target: "{{['True', 'Neither', 'False'][label]}}"
doc_to_choice:
- "True"
- "Neither"
- "False"
should_decontaminate: true
doc_to_decontamination_query: premise
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
# ARC
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
https://arxiv.org/pdf/1803.05457.pdf
### Paper
Title: Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Abstract: https://arxiv.org/abs/1803.05457
The ARC dataset consists of 7,787 science exam questions drawn from a variety
of sources, including science questions provided under license by a research
......@@ -13,7 +16,9 @@ a co-occurrence method fail to answer correctly) and an Easy Set of 5,197 questi
Homepage: https://allenai.org/data/arc
### Citation
```
@article{Clark2018ThinkYH,
title={Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge},
......@@ -23,3 +28,27 @@ Homepage: https://allenai.org/data/arc
volume={abs/1803.05457}
}
```
### Groups and Tasks
#### Groups
* `ai2_arc`: Evaluates `arc_easy` and `arc_challenge`
#### Tasks
* `arc_easy`
* `arc_challange`
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
include: arc_easy.yaml
group:
- ai2_arc
- multiple_choice
task: arc_challenge
dataset_path: ai2_arc
dataset_name: ARC-Challenge
group:
- ai2_arc
- multiple_choice
task: arc_easy
dataset_path: ai2_arc
dataset_name: ARC-Easy
......
# Arithmetic
### Paper
Title: `Language Models are Few-Shot Learners`
Abstract: https://arxiv.org/abs/2005.14165
A small battery of 10 tests that involve asking language models a simple arithmetic
problem in natural language.
Homepage: https://github.com/openai/gpt-3/tree/master/data
### Citation
```
@inproceedings{NEURIPS2020_1457c0d6,
author = {Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winter, Clemens and Hesse, Chris and Chen, Mark and Sigler, Eric and Litwin, Mateusz and Gray, Scott and Chess, Benjamin and Clark, Jack and Berner, Christopher and McCandlish, Sam and Radford, Alec and Sutskever, Ilya and Amodei, Dario},
booktitle = {Advances in Neural Information Processing Systems},
editor = {H. Larochelle and M. Ranzato and R. Hadsell and M. F. Balcan and H. Lin},
pages = {1877--1901},
publisher = {Curran Associates, Inc.},
title = {Language Models are Few-Shot Learners},
url = {https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf},
volume = {33},
year = {2020}
}
```
### Groups and Tasks
#### Groups
* `arithmetic`: Evaluates `1dc` to `5ds`
#### Tasks
* `arithmetic_1dc`
* `arithmetic_2da`
* `arithmetic_2dm`
* `arithmetic_2ds`
* `arithmetic_3da`
* `arithmetic_3ds`
* `arithmetic_4da`
* `arithmetic_4ds`
* `arithmetic_5da`
* `arithmetic_5ds`
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
# bAbI
### Paper
Title: Towards ai-complete question answering: A set of prerequisite toy tasks
Abstract: https://arxiv.org/abs/1502.05698
One long-term goal of machine learning research is to produce methods that are applicable to reasoning and natural language, in particular building an intelligent dialogue agent. To measure progress towards that goal, we argue for the usefulness of a set of proxy tasks that evaluate reading comprehension via question answering. Our tasks measure understanding in several ways: whether a system is able to answer questions via chaining facts, simple induction, deduction and many more. The tasks are designed to be prerequisites for any system that aims to be capable of conversing with a human. We believe many existing learning systems can currently not solve them, and hence our aim is to classify these tasks into skill sets, so that researchers can identify (and then rectify) the failings of their systems. We also extend and improve the recently introduced Memory Networks model, and show it is able to solve some, but not all, of the tasks.
Homepage: https://github.com/facebookarchive/bAbI-tasks
### Citation
```
@article{weston2015towards,
title={Towards ai-complete question answering: A set of prerequisite toy tasks},
author={Weston, Jason and Bordes, Antoine and Chopra, Sumit and Rush, Alexander M and Van Merri{\"e}nboer, Bart and Joulin, Armand and Mikolov, Tomas},
journal={arXiv preprint arXiv:1502.05698},
year={2015}
}
```
### Groups and Tasks
#### Groups
* Not part of a group yet
#### Tasks
* `babi`
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
group:
- greedy_until
task: babi
dataset_path: Muennighoff/babi
dataset_name: null
......
......@@ -52,9 +52,15 @@ Homepage: https://github.com/nyu-mll/crows-pairs, https://gitlab.inria.fr/french
}
```
### Subtasks
### Groups and Tasks
#### Groups
- `crows_pairs_english`: The entire English subset of the CrowS-Pairs dataset.
- `crows_pairs_french`: The entire French subset of the CrowS-Pairs dataset.
#### Tasks
The following tasks evaluate sub-areas of bias in the English CrowS-Pairs dataset:
- `crows_pairs_english_age`
......@@ -68,8 +74,6 @@ The following tasks evaluate sub-areas of bias in the English CrowS-Pairs datase
- `crows_pairs_english_sexual_orientation`
- `crows_pairs_english_socioeconomic`
- `crows_pairs_french`: The entire French subset of the CrowS-Pairs dataset.
The following tasks evaluate sub-areas of bias in the French CrowS-Pairs dataset:
- `crows_pairs_french_age`
- `crows_pairs_french_autre`
......
......@@ -40,11 +40,22 @@ Homepage: https://gluebenchmark.com/
}
```
### Subtasks
### Groups and Tasks
List or describe tasks defined in this folder, and their names here:
* `task_name`: `1-sentence description of what this particular task does`
* `task_name2`: .....
#### Groups
* `glue`: Run all Glue subtasks.
#### Tasks
* `cola`
* `mnli`
* `mrpc`
* `qnli`
* `qqp`
* `rte`
* `sst`
* `wnli`
### Checklist
......
......@@ -31,6 +31,19 @@ Homepage: https://github.com/openai/grade-school-math
}
```
### Groups and Tasks
#### Groups
- `math_word_problems`
- `chain_of_thought`
- `self_consistency`
#### Tasks
- `gsm8k_yaml`
- `gsm8k_cot`: GSM8K with Chain-of-Thought
- `gsm8k_cot_self_consistency`: GSM8K with Chain-of-Thought and Self-Consistency
### Checklist
......
group:
- greedy_until
- math_word_problems
task: gsm8k_yaml
dataset_path: gsm8k
......
......@@ -32,7 +32,13 @@ Homepage: https://aghie.github.io/head-qa/
}
```
### Subtasks
### Groups and Tasks
#### Groups
- `headqa`: Evaluates `headqa_en` and `headqa_es`
#### Tasks
* `headqa_en` - English variant of HEAD-QA
* `headqa_es` - Spanish variant of HEAD-QA
......
group:
- multiple_choice
- headqa
task: headqa_en
dataset_path: EleutherAI/headqa
dataset_name: en
......
# Task-name
# HellaSwag
### Paper
Title: `HellaSwag: Can a Machine Really Finish Your Sentence?`,
Abstract: ```Recent work by Zellers et al. (2018) introduced a new task of commonsense natural language inference: given an event description such as "A woman sits at a piano," a machine must select the most likely followup: "She sets her fingers on the keys." With the introduction of BERT, near human-level performance was reached. Does this mean that machines can perform human level commonsense inference?
Title: `HellaSwag: Can a Machine Really Finish Your Sentence?`
Abstract: https://arxiv.org/abs/1905.07830
Recent work by Zellers et al. (2018) introduced a new task of commonsense natural language inference: given an event description such as "A woman sits at a piano," a machine must select the most likely followup: "She sets her fingers on the keys." With the introduction of BERT, near human-level performance was reached. Does this mean that machines can perform human level commonsense inference?
In this paper, we show that commonsense inference still proves difficult for even state-of-the-art models, by presenting HellaSwag, a new challenge dataset. Though its questions are trivial for humans (>95% accuracy), state-of-the-art models struggle (<48%). We achieve this via Adversarial Filtering (AF), a data collection paradigm wherein a series of discriminators iteratively select an adversarial set of machine-generated wrong answers. AF proves to be surprisingly robust. The key insight is to scale up the length and complexity of the dataset examples towards a critical 'Goldilocks' zone wherein generated text is ridiculous to humans, yet often misclassified by state-of-the-art models.
Our construction of HellaSwag, and its resulting difficulty, sheds light on the inner workings of deep pretrained models. More broadly, it suggests a new path forward for NLP research, in which benchmarks co-evolve with the evolving state-of-the-art in an adversarial way, so as to present ever-harder challenges.```
Our construction of HellaSwag, and its resulting difficulty, sheds light on the inner workings of deep pretrained models. More broadly, it suggests a new path forward for NLP research, in which benchmarks co-evolve with the evolving state-of-the-art in an adversarial way, so as to present ever-harder challenges.
Homepage: `https://rowanzellers.com/hellaswag/`
......@@ -21,6 +24,17 @@ Homepage: `https://rowanzellers.com/hellaswag/`
}
```
### Groups and Tasks
#### Groups
- Not part of a group yet
#### Tasks
- `hellaswag`
### Checklist
For adding novel benchmarks/datasets to the library:
......
......@@ -25,13 +25,20 @@ Homepage: https://github.com/hendrycks/ethics
}
```
### Subtasks
### Groups and Tasks
* `ethics_cm`:
*
#### Groups
Missing:
* `ethics_utilitarianism_original`:
- `hendrycks_ethics`
#### Tasks
* `ethics_cm`
* `ethics_deontology`
* `ethics_justice`
* `ethics_utilitarianism`
* (MISSING) `ethics_utilitarianism_original`
* `ethics_virtue`
### Checklist
......
# LAMBADA
### Paper
The LAMBADA dataset: Word prediction requiring a broad discourse context
https://arxiv.org/pdf/1606.06031.pdf
Title: `The LAMBADA dataset: Word prediction requiring a broad discourse context`
Abstract: https://arxiv.org/pdf/1606.06031.pdf
LAMBADA is a dataset to evaluate the capabilities of computational models for text
understanding by means of a word prediction task. LAMBADA is a collection of narrative
......@@ -14,6 +15,18 @@ in the broader discourse.
Homepage: https://zenodo.org/record/2630551#.X4Xzn5NKjUI
### Groups and Tasks
#### Groups
- `lambada`
#### Tasks
- `lambada_openai`
- `lambada_standard`
### Citation
@misc{
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment