Commit f71d56eb authored by lintangsutawika's avatar lintangsutawika
Browse files

Merge branch 'big-refactor' of https://github.com/EleutherAI/lm-evaluation-harness into superglue

parents 33f2f9bf 2f870265
include: lambada_mt_en.yaml
group:
- lambada_multilingual
- loglikelihood
- perplexity
task: lambada_openai_mt_es
dataset_name: es
include: lambada_mt_en.yaml
group:
- lambada_multilingual
- loglikelihood
- perplexity
task: lambada_openai_mt_fr
dataset_name: fr
include: lambada_mt_en.yaml
group:
- lambada_multilingual
- loglikelihood
- perplexity
task: lambada_openai_mt_it
dataset_name: it
# LogiQA
### Paper
Title: `LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning`
Abstract: https://arxiv.org/abs/2007.08124
LogiQA is a dataset for testing human logical reasoning. It consists of 8,678 QA
instances, covering multiple types of deductive reasoning. Results show that state-
of-the-art neural models perform by far worse than human ceiling. The dataset can
also serve as a benchmark for reinvestigating logical AI under the deep learning
NLP setting.
Homepage: https://github.com/lgw863/LogiQA-dataset
### Citation
```
@misc{liu2020logiqa,
title={LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning},
author={Jian Liu and Leyang Cui and Hanmeng Liu and Dandan Huang and Yile Wang and Yue Zhang},
year={2020},
eprint={2007.08124},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
### Groups and Tasks
#### Groups
* Not part of a group yet
#### Tasks
* `logiqa`
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
group:
- multiple_choice
task: logiqa
dataset_path: EleutherAI/logiqa
dataset_name: logiqa
......
......@@ -25,15 +25,19 @@ Homepage: https://github.com/csitfun/LogiQA2.0
doi={10.1109/TASLP.2023.3293046}}
```
### Subtasks
### Groups and Tasks
`logiqa2_zh`: The original dataset in Chinese.
#### Groups
`logiqa2_NLI`: The NLI version of the dataset converted from the MRC version.
* Not part of a group yet
`logieval`: Prompt based; https://github.com/csitfun/LogiEval
#### Tasks
The subtasks have not been verified yet.
* `logiqa2_zh`: The original dataset in Chinese.
* `logiqa2_NLI`: The NLI version of the dataset converted from the MRC version.
* `logieval`: Prompt based; https://github.com/csitfun/LogiEval
NOTE! The subtasks have not been verified yet.
### Checklist
......
group:
- greedy_until
task: logieval
dataset_path: baber/logiqa2
dataset_name: logieval
......
group:
- multiple_choice
task: logiqa2
dataset_path: baber/logiqa2
dataset_name: logiqa2
......
......@@ -25,7 +25,13 @@ Homepage: https://math-qa.github.io/math-QA/
}
```
### Subtasks
### Groups and Tasks
#### Groups
* `math_word_problems`
#### Tasks
* `mathqa`: The MathQA dataset, as a multiple choice dataset where the answer choices are not in context.
......
group:
- multiple_choice
- math_word_problems
task: mathqa
dataset_path: math_qa
......
# MC Taco
### Paper
Title: `"Going on a vacation" takes longer than "Going for a walk": A Study of Temporal Commonsense Understanding`
Abstract: https://arxiv.org/abs/1909.03065
MC-TACO is a dataset of 13k question-answer pairs that require temporal commonsense
comprehension. The dataset contains five temporal properties, (1) duration (how long
an event takes), (2) temporal ordering (typical order of events), (3) typical time
(when an event occurs), (4) frequency (how often an event occurs), and (5) stationarity
(whether a state is maintained for a very long time or indefinitely).
WARNING: Running this task with a `--limit` arg will give misleading results! The
corresponding dataset is structured such that each multiple-choice-question gathered
by the authors is split into question-option pairs, where each such pair gets
siloed into an individual document for plausibility testing. Because the harness
shuffles these documents, setting `--limit` will likely "cut off" certain candidate
answers. This is a problem because the task's metrics require an exhaustive evaluation
of a question's options. See section 4 of the paper for details.
Homepage: https://leaderboard.allenai.org/mctaco/submissions/public
### Citation
```
BibTeX-formatted citation goes here
```
### Groups and Tasks
#### Groups
* Not part of a group yet.
#### Tasks
* `mc_taco`
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
task: mc_taco
dataset_path: mc_taco
output_type: multiple_choice
validation_split: validation
test_split: test
doc_to_text: "{{sentence}}\nQuestion: {{question}}\nAnswer: {{answer}}\nPlausible:"
doc_to_target: label
doc_to_choice: ["no", "yes"]
should_decontaminate: true
doc_to_decontamination_query: "{{question}} {{sentence}}"
metric_list:
- metric: acc
- metric: f1
# Task-name
### Paper
Title: `Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering`
Abstract: https://arxiv.org/abs/1809.02789
OpenBookQA is a question-answering dataset modeled after open book exams for
assessing human understanding of a subject. It consists of 5,957 multiple-choice
elementary-level science questions (4,957 train, 500 dev, 500 test), which probe
the understanding of a small “book” of 1,326 core science facts and the application
of these facts to novel situations. For training, the dataset includes a mapping
from each question to the core science fact it was designed to probe. Answering
OpenBookQA questions requires additional broad common knowledge, not contained
in the book. The questions, by design, are answered incorrectly by both a retrieval-
based algorithm and a word co-occurrence algorithm.
Homepage: https://allenai.org/data/open-book-qa
### Citation
```
@inproceedings{OpenBookQA2018,
title={Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering},
author={Todor Mihaylov and Peter Clark and Tushar Khot and Ashish Sabharwal},
booktitle={EMNLP},
year={2018}
}
```
### Groups and Tasks
#### Groups
* Not part of a group yet
#### Tasks
* `openbookqa`
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
group:
- multiple_choice
task: openbookqa
dataset_path: openbookqa
dataset_name: main
......
# PAWS-X
### Paper
Title: `PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification`
Abstract: https://arxiv.org/abs/1908.11828
The dataset consists of 23,659 human translated PAWS evaluation pairs and
296,406 machine translated training pairs in 6 typologically distinct languages.
Examples are adapted from PAWS-Wiki
Prompt format (same as in mGPT):
"<s>" + sentence1 + ", right? " + mask + ", " + sentence2 + "</s>",
where mask is the string that matches the label:
Yes, No.
Example:
<s> The Tabaci River is a tributary of the River Leurda in Romania, right? No, The Leurda River is a tributary of the River Tabaci in Romania.</s>
Language specific prompts are translated word-by-word with Google Translate
and may differ from the ones used by mGPT and XGLM (they do not provide their prompts).
Homepage: https://github.com/google-research-datasets/paws/tree/master/pawsx
### Citation
```
@inproceedings{yang-etal-2019-paws,
title = "{PAWS}-{X}: A Cross-lingual Adversarial Dataset for Paraphrase Identification",
author = "Yang, Yinfei and
Zhang, Yuan and
Tar, Chris and
Baldridge, Jason",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
month = nov,
year = "2019",
address = "Hong Kong, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/D19-1382",
doi = "10.18653/v1/D19-1382",
pages = "3687--3692",
}
```
### Groups and Tasks
#### Groups
* `pawsx`
#### Tasks
* `paws_de`: German
* `paws_en`: English
* `paws_es`: Spanish
* `paws_fr`: French
* `paws_ja`: Japanese
* `paws_ko`: Korean
* `paws_zh`: Chinese
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
# Generated by utils.py
dataset_name: de
doc_to_choice: '{{[sentence1+", richtig? Ja, "+sentence2, sentence1+", richtig? Nein,
"+sentence2]}}'
doc_to_text: ''
include: pawsx_template_yaml
task: paws_de
# Generated by utils.py
dataset_name: en
doc_to_choice: '{{[sentence1+", right? Yes, "+sentence2, sentence1+", right? No, "+sentence2]}}'
doc_to_text: ''
include: pawsx_template_yaml
task: paws_en
# Generated by utils.py
dataset_name: es
doc_to_choice: '{{[sentence1+", verdad? Sí, "+sentence2, sentence1+", verdad? No,
"+sentence2]}}'
doc_to_text: ''
include: pawsx_template_yaml
task: paws_es
# Generated by utils.py
dataset_name: fr
doc_to_choice: '{{[sentence1+", n''est-ce pas? Oui, "+sentence2, sentence1+", n''est-ce
pas? No, "+sentence2]}}'
doc_to_text: ''
include: pawsx_template_yaml
task: paws_fr
# Generated by utils.py
dataset_name: ja
doc_to_choice: '{{[sentence1+", ですね? はい, "+sentence2, sentence1+", ですね? いいえ, "+sentence2]}}'
doc_to_text: ''
include: pawsx_template_yaml
task: paws_ja
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment