Unverified Commit 1cc2a764 authored by Lintang Sutawika's avatar Lintang Sutawika Committed by GitHub
Browse files

Merge branch 'big-refactor' into fix-metrics

parents 4b87456d 3d732e68
group:
- lambada
- loglikelihood
- perplexity
task: lambada_standard
dataset_path: lambada
dataset_name: null
......
# LAMBADA Cloze
### Paper
Title: `The LAMBADA dataset: Word prediction requiring a broad discourse context`
Abstract: https://arxiv.org/abs/1606.06031
Cloze-style LAMBADA dataset.
LAMBADA is a dataset to evaluate the capabilities of computational models for text
understanding by means of a word prediction task. LAMBADA is a collection of narrative
passages sharing the characteristic that human subjects are able to guess their last
word if they are exposed to the whole passage, but not if they only see the last
sentence preceding the target word. To succeed on LAMBADA, computational models
cannot simply rely on local context, but must be able to keep track of information
in the broader discourse.
Homepage: https://zenodo.org/record/2630551#.X4Xzn5NKjUI
### Citation
```
@misc{
author={Paperno, Denis and Kruszewski, Germán and Lazaridou, Angeliki and Pham, Quan Ngoc and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fernández, Raquel},
title={The LAMBADA dataset},
DOI={10.5281/zenodo.2630551},
publisher={Zenodo},
year={2016},
month={Aug}
}
```
### Groups and Tasks
#### Groups
* `lambada_cloze`
#### Tasks
* `lambada_openai_cloze_yaml`
* `lambada_standard_cloze_yaml`
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
group:
- lambada_cloze
- loglikelihood
task: lambada_openai_cloze_yaml
dataset_path: EleutherAI/lambada_openai
dataset_name: default
......
group:
- lambada_cloze
- loglikelihood
task: lambada_standard_cloze_yaml
dataset_path: lambada
dataset_name: null
......
......@@ -25,7 +25,13 @@ Homepage: https://zenodo.org/record/2630551#.X4Xzn5NKjUI
month={Aug}
}
### Subtasks
### Groups and Tasks
#### Groups
* `lambada_multilingual`: Evaluates all `lambada_mt_X` tasks
#### Tasks
* `lambada_mt_{en, fr, de, it, es}`: Machine-translated versions of OpenAI's Lambada variant.
......
include: lambada_mt_en.yaml
group:
- lambada_multilingual
- loglikelihood
- perplexity
task: lambada_openai_mt_de
dataset_name: de
group:
- lambada_multilingual
- loglikelihood
- perplexity
task: lambada_openai_mt_en
dataset_path: EleutherAI/lambada_openai
dataset_name: en
......
include: lambada_mt_en.yaml
group:
- lambada_multilingual
- loglikelihood
- perplexity
task: lambada_openai_mt_es
dataset_name: es
include: lambada_mt_en.yaml
group:
- lambada_multilingual
- loglikelihood
- perplexity
task: lambada_openai_mt_fr
dataset_name: fr
include: lambada_mt_en.yaml
group:
- lambada_multilingual
- loglikelihood
- perplexity
task: lambada_openai_mt_it
dataset_name: it
# LogiQA
### Paper
Title: `LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning`
Abstract: https://arxiv.org/abs/2007.08124
LogiQA is a dataset for testing human logical reasoning. It consists of 8,678 QA
instances, covering multiple types of deductive reasoning. Results show that state-
of-the-art neural models perform by far worse than human ceiling. The dataset can
also serve as a benchmark for reinvestigating logical AI under the deep learning
NLP setting.
Homepage: https://github.com/lgw863/LogiQA-dataset
### Citation
```
@misc{liu2020logiqa,
title={LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning},
author={Jian Liu and Leyang Cui and Hanmeng Liu and Dandan Huang and Yile Wang and Yue Zhang},
year={2020},
eprint={2007.08124},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
### Groups and Tasks
#### Groups
* Not part of a group yet
#### Tasks
* `logiqa`
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
group:
- multiple_choice
task: logiqa
dataset_path: EleutherAI/logiqa
dataset_name: logiqa
......
......@@ -25,15 +25,19 @@ Homepage: https://github.com/csitfun/LogiQA2.0
doi={10.1109/TASLP.2023.3293046}}
```
### Subtasks
### Groups and Tasks
`logiqa2_zh`: The original dataset in Chinese.
#### Groups
`logiqa2_NLI`: The NLI version of the dataset converted from the MRC version.
* Not part of a group yet
`logieval`: Prompt based; https://github.com/csitfun/LogiEval
#### Tasks
The subtasks have not been verified yet.
* `logiqa2_zh`: The original dataset in Chinese.
* `logiqa2_NLI`: The NLI version of the dataset converted from the MRC version.
* `logieval`: Prompt based; https://github.com/csitfun/LogiEval
NOTE! The subtasks have not been verified yet.
### Checklist
......
group:
- greedy_until
task: logieval
dataset_path: baber/logiqa2
dataset_name: logieval
......
group:
- multiple_choice
task: logiqa2
dataset_path: baber/logiqa2
dataset_name: logiqa2
......
......@@ -25,7 +25,13 @@ Homepage: https://math-qa.github.io/math-QA/
}
```
### Subtasks
### Groups and Tasks
#### Groups
* `math_word_problems`
#### Tasks
* `mathqa`: The MathQA dataset, as a multiple choice dataset where the answer choices are not in context.
......
group:
- multiple_choice
- math_word_problems
task: mathqa
dataset_path: math_qa
......
# MC Taco
### Paper
Title: `"Going on a vacation" takes longer than "Going for a walk": A Study of Temporal Commonsense Understanding`
Abstract: https://arxiv.org/abs/1909.03065
MC-TACO is a dataset of 13k question-answer pairs that require temporal commonsense
comprehension. The dataset contains five temporal properties, (1) duration (how long
an event takes), (2) temporal ordering (typical order of events), (3) typical time
(when an event occurs), (4) frequency (how often an event occurs), and (5) stationarity
(whether a state is maintained for a very long time or indefinitely).
WARNING: Running this task with a `--limit` arg will give misleading results! The
corresponding dataset is structured such that each multiple-choice-question gathered
by the authors is split into question-option pairs, where each such pair gets
siloed into an individual document for plausibility testing. Because the harness
shuffles these documents, setting `--limit` will likely "cut off" certain candidate
answers. This is a problem because the task's metrics require an exhaustive evaluation
of a question's options. See section 4 of the paper for details.
Homepage: https://leaderboard.allenai.org/mctaco/submissions/public
### Citation
```
BibTeX-formatted citation goes here
```
### Groups and Tasks
#### Groups
* Not part of a group yet.
#### Tasks
* `mc_taco`
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
task: mc_taco
dataset_path: mc_taco
output_type: multiple_choice
validation_split: validation
test_split: test
doc_to_text: "{{sentence}}\nQuestion: {{question}}\nAnswer: {{answer}}\nPlausible:"
doc_to_target: label
doc_to_choice: ["no", "yes"]
should_decontaminate: true
doc_to_decontamination_query: "{{question}} {{sentence}}"
metric_list:
- metric: acc
- metric: f1
# Task-name
### Paper
Title: `Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering`
Abstract: https://arxiv.org/abs/1809.02789
OpenBookQA is a question-answering dataset modeled after open book exams for
assessing human understanding of a subject. It consists of 5,957 multiple-choice
elementary-level science questions (4,957 train, 500 dev, 500 test), which probe
the understanding of a small “book” of 1,326 core science facts and the application
of these facts to novel situations. For training, the dataset includes a mapping
from each question to the core science fact it was designed to probe. Answering
OpenBookQA questions requires additional broad common knowledge, not contained
in the book. The questions, by design, are answered incorrectly by both a retrieval-
based algorithm and a word co-occurrence algorithm.
Homepage: https://allenai.org/data/open-book-qa
### Citation
```
@inproceedings{OpenBookQA2018,
title={Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering},
author={Todor Mihaylov and Peter Clark and Tushar Khot and Ashish Sabharwal},
booktitle={EMNLP},
year={2018}
}
```
### Groups and Tasks
#### Groups
* Not part of a group yet
#### Tasks
* `openbookqa`
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment