Unverified Commit 759da8d5 authored by Lintang Sutawika's avatar Lintang Sutawika Committed by GitHub
Browse files

Merge pull request #757 from EleutherAI/add-readme

[Refactor] Add README.md
parents 73912efb c05a5ad4
group:
- lambada
- loglikelihood
- perplexity
task: lambada_openai
dataset_path: EleutherAI/lambada_openai
dataset_name: default
......
group:
- lambada
- loglikelihood
- perplexity
task: lambada_standard
dataset_path: lambada
dataset_name: null
......
# LAMBADA Cloze
### Paper
Title: `The LAMBADA dataset: Word prediction requiring a broad discourse context`
Abstract: https://arxiv.org/abs/1606.06031
Cloze-style LAMBADA dataset.
LAMBADA is a dataset to evaluate the capabilities of computational models for text
understanding by means of a word prediction task. LAMBADA is a collection of narrative
passages sharing the characteristic that human subjects are able to guess their last
word if they are exposed to the whole passage, but not if they only see the last
sentence preceding the target word. To succeed on LAMBADA, computational models
cannot simply rely on local context, but must be able to keep track of information
in the broader discourse.
Homepage: https://zenodo.org/record/2630551#.X4Xzn5NKjUI
### Citation
```
@misc{
author={Paperno, Denis and Kruszewski, Germán and Lazaridou, Angeliki and Pham, Quan Ngoc and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fernández, Raquel},
title={The LAMBADA dataset},
DOI={10.5281/zenodo.2630551},
publisher={Zenodo},
year={2016},
month={Aug}
}
```
### Groups and Tasks
#### Groups
* `lambada_cloze`
#### Tasks
* `lambada_openai_cloze_yaml`
* `lambada_standard_cloze_yaml`
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
group:
- lambada_cloze
- loglikelihood
task: lambada_openai_cloze_yaml
dataset_path: EleutherAI/lambada_openai
dataset_name: default
......
group:
- lambada_cloze
- loglikelihood
task: lambada_standard_cloze_yaml
dataset_path: lambada
dataset_name: null
......
......@@ -25,7 +25,13 @@ Homepage: https://zenodo.org/record/2630551#.X4Xzn5NKjUI
month={Aug}
}
### Subtasks
### Groups and Tasks
#### Groups
* `lambada_multilingual`: Evaluates all `lambada_mt_X` tasks
#### Tasks
* `lambada_mt_{en, fr, de, it, es}`: Machine-translated versions of OpenAI's Lambada variant.
......
include: lambada_mt_en.yaml
group:
- lambada_multilingual
- loglikelihood
- perplexity
task: lambada_openai_mt_de
dataset_name: de
group:
- lambada_multilingual
- loglikelihood
- perplexity
task: lambada_openai_mt_en
dataset_path: EleutherAI/lambada_openai
dataset_name: en
......
include: lambada_mt_en.yaml
group:
- lambada_multilingual
- loglikelihood
- perplexity
task: lambada_openai_mt_es
dataset_name: es
include: lambada_mt_en.yaml
group:
- lambada_multilingual
- loglikelihood
- perplexity
task: lambada_openai_mt_fr
dataset_name: fr
include: lambada_mt_en.yaml
group:
- lambada_multilingual
- loglikelihood
- perplexity
task: lambada_openai_mt_it
dataset_name: it
# LogiQA
### Paper
Title: `LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning`
Abstract: https://arxiv.org/abs/2007.08124
LogiQA is a dataset for testing human logical reasoning. It consists of 8,678 QA
instances, covering multiple types of deductive reasoning. Results show that state-
of-the-art neural models perform by far worse than human ceiling. The dataset can
also serve as a benchmark for reinvestigating logical AI under the deep learning
NLP setting.
Homepage: https://github.com/lgw863/LogiQA-dataset
### Citation
```
@misc{liu2020logiqa,
title={LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning},
author={Jian Liu and Leyang Cui and Hanmeng Liu and Dandan Huang and Yile Wang and Yue Zhang},
year={2020},
eprint={2007.08124},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
### Groups and Tasks
#### Groups
* Not part of a group yet
#### Tasks
* `logiqa`
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
group:
- multiple_choice
task: logiqa
dataset_path: EleutherAI/logiqa
dataset_name: logiqa
......
......@@ -25,15 +25,19 @@ Homepage: https://github.com/csitfun/LogiQA2.0
doi={10.1109/TASLP.2023.3293046}}
```
### Subtasks
### Groups and Tasks
`logiqa2_zh`: The original dataset in Chinese.
#### Groups
`logiqa2_NLI`: The NLI version of the dataset converted from the MRC version.
* Not part of a group yet
`logieval`: Prompt based; https://github.com/csitfun/LogiEval
#### Tasks
The subtasks have not been verified yet.
* `logiqa2_zh`: The original dataset in Chinese.
* `logiqa2_NLI`: The NLI version of the dataset converted from the MRC version.
* `logieval`: Prompt based; https://github.com/csitfun/LogiEval
NOTE! The subtasks have not been verified yet.
### Checklist
......
group:
- greedy_until
task: logieval
dataset_path: baber/logiqa2
dataset_name: logieval
......
group:
- multiple_choice
task: logiqa2
dataset_path: baber/logiqa2
dataset_name: logiqa2
......
......@@ -25,7 +25,13 @@ Homepage: https://math-qa.github.io/math-QA/
}
```
### Subtasks
### Groups and Tasks
#### Groups
* `math_word_problems`
#### Tasks
* `mathqa`: The MathQA dataset, as a multiple choice dataset where the answer choices are not in context.
......
group:
- multiple_choice
- math_word_problems
task: mathqa
dataset_path: math_qa
......
# Task-name
### Paper
Title: `Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering`
Abstract: https://arxiv.org/abs/1809.02789
OpenBookQA is a question-answering dataset modeled after open book exams for
assessing human understanding of a subject. It consists of 5,957 multiple-choice
elementary-level science questions (4,957 train, 500 dev, 500 test), which probe
the understanding of a small “book” of 1,326 core science facts and the application
of these facts to novel situations. For training, the dataset includes a mapping
from each question to the core science fact it was designed to probe. Answering
OpenBookQA questions requires additional broad common knowledge, not contained
in the book. The questions, by design, are answered incorrectly by both a retrieval-
based algorithm and a word co-occurrence algorithm.
Homepage: https://allenai.org/data/open-book-qa
### Citation
```
@inproceedings{OpenBookQA2018,
title={Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering},
author={Todor Mihaylov and Peter Clark and Tushar Khot and Ashish Sabharwal},
booktitle={EMNLP},
year={2018}
}
```
### Groups and Tasks
#### Groups
* Not part of a group yet
#### Tasks
* `openbookqa`
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
group:
- multiple_choice
task: openbookqa
dataset_path: openbookqa
dataset_name: main
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment