Merge branch 'big-refactor' of https://github.com/EleutherAI/lm-evaluation-harness into superglue

f71d56eb · lintangsutawika · 33f2f9bf · 2f870265 · f71d56eb · f71d56eb
Commit f71d56eb authored Aug 21, 2023 by lintangsutawika
20 changed files
--- a/lm_eval/tasks/lambada_multilingual/lambada_mt_es.yaml
+++ b/lm_eval/tasks/lambada_multilingual/lambada_mt_es.yaml
 include: lambada_mt_en.yaml
-group:
-  - lambada_multilingual
-  - loglikelihood
-  - perplexity
 task: lambada_openai_mt_es
 dataset_name: es
--- a/lm_eval/tasks/lambada_multilingual/lambada_mt_fr.yaml
+++ b/lm_eval/tasks/lambada_multilingual/lambada_mt_fr.yaml
 include: lambada_mt_en.yaml
-group:
-  - lambada_multilingual
-  - loglikelihood
-  - perplexity
 task: lambada_openai_mt_fr
 dataset_name: fr
--- a/lm_eval/tasks/lambada_multilingual/lambada_mt_it.yaml
+++ b/lm_eval/tasks/lambada_multilingual/lambada_mt_it.yaml
 include: lambada_mt_en.yaml
-group:
-  - lambada_multilingual
-  - loglikelihood
-  - perplexity
 task: lambada_openai_mt_it
 dataset_name: it
--- a/lm_eval/tasks/logiqa/README.md
+++ b/lm_eval/tasks/logiqa/README.md
+# LogiQA
+
+### Paper
+
+Title: `LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning`
+
+Abstract: https://arxiv.org/abs/2007.08124
+
+LogiQA is a dataset for testing human logical reasoning. It consists of 8,678 QA
+instances, covering multiple types of deductive reasoning. Results show that state-
+of-the-art neural models perform by far worse than human ceiling. The dataset can
+also serve as a benchmark for reinvestigating logical AI under the deep learning
+NLP setting.
+
+Homepage: https://github.com/lgw863/LogiQA-dataset
+
+
+### Citation
+
+```
+@misc{liu2020logiqa,
+    title={LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning},
+    author={Jian Liu and Leyang Cui and Hanmeng Liu and Dandan Huang and Yile Wang and Yue Zhang},
+    year={2020},
+    eprint={2007.08124},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* Not part of a group yet
+
+#### Tasks
+
+* `logiqa`
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/logiqa/logiqa.yaml
+++ b/lm_eval/tasks/logiqa/logiqa.yaml
-group:
-  - multiple_choice
 task: logiqa
 dataset_path: EleutherAI/logiqa
 dataset_name: logiqa

--- a/lm_eval/tasks/logiqa2/README.md
+++ b/lm_eval/tasks/logiqa2/README.md
@@ -25,15 +25,19 @@ Homepage: https://github.com/csitfun/LogiQA2.0
  doi={10.1109/TASLP.2023.3293046}}
 ```

-### Subtasks
+### Groups and Tasks

-`logiqa2_zh`: The original dataset in Chinese.
+#### Groups

-`logiqa2_NLI`: The NLI version of the dataset converted from the MRC version.
+* Not part of a group yet

-`logieval`: Prompt based; https://github.com/csitfun/LogiEval
+#### Tasks

-The subtasks have not been verified yet.
+* `logiqa2_zh`: The original dataset in Chinese.
+* `logiqa2_NLI`: The NLI version of the dataset converted from the MRC version.
+* `logieval`: Prompt based; https://github.com/csitfun/LogiEval
+
+NOTE! The subtasks have not been verified yet.

 ### Checklist


--- a/lm_eval/tasks/logiqa2/logieval.yaml
+++ b/lm_eval/tasks/logiqa2/logieval.yaml
-group:
-  - greedy_until
 task: logieval
 dataset_path: baber/logiqa2
 dataset_name: logieval

--- a/lm_eval/tasks/logiqa2/logiqa2.yaml
+++ b/lm_eval/tasks/logiqa2/logiqa2.yaml
-group:
-  - multiple_choice
 task: logiqa2
 dataset_path: baber/logiqa2
 dataset_name: logiqa2

--- a/lm_eval/tasks/mathqa/README.md
+++ b/lm_eval/tasks/mathqa/README.md
@@ -25,7 +25,13 @@ Homepage: https://math-qa.github.io/math-QA/
 }
 ```

-### Subtasks
+### Groups and Tasks
+
+#### Groups
+
+* `math_word_problems`
+
+#### Tasks

 * `mathqa`: The MathQA dataset, as a multiple choice dataset where the answer choices are not in context.


--- a/lm_eval/tasks/mathqa/mathqa.yaml
+++ b/lm_eval/tasks/mathqa/mathqa.yaml
 group:
-  - multiple_choice
  - math_word_problems
 task: mathqa
 dataset_path: math_qa

--- a/lm_eval/tasks/mc_taco/README.md
+++ b/lm_eval/tasks/mc_taco/README.md
+# MC Taco
+
+### Paper
+
+Title: `"Going on a vacation" takes longer than "Going for a walk": A Study of Temporal Commonsense Understanding`
+Abstract: https://arxiv.org/abs/1909.03065
+
+MC-TACO is a dataset of 13k question-answer pairs that require temporal commonsense
+comprehension. The dataset contains five temporal properties, (1) duration (how long
+an event takes), (2) temporal ordering (typical order of events), (3) typical time
+(when an event occurs), (4) frequency (how often an event occurs), and (5) stationarity
+(whether a state is maintained for a very long time or indefinitely).
+
+WARNING: Running this task with a `--limit` arg will give misleading results! The
+corresponding dataset is structured such that each multiple-choice-question gathered
+by the authors is split into question-option pairs, where each such pair gets
+siloed into an individual document for plausibility testing. Because the harness
+shuffles these documents, setting `--limit` will likely "cut off" certain candidate
+answers. This is a problem because the task's metrics require an exhaustive evaluation
+of a question's options. See section 4 of the paper for details.
+
+Homepage: https://leaderboard.allenai.org/mctaco/submissions/public
+
+
+### Citation
+
+```
+BibTeX-formatted citation goes here
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* Not part of a group yet.
+
+#### Tasks
+
+* `mc_taco`
+
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/mc_taco/default.yaml
+++ b/lm_eval/tasks/mc_taco/default.yaml
+task: mc_taco
+dataset_path: mc_taco
+output_type: multiple_choice
+validation_split: validation
+test_split: test
+doc_to_text: "{{sentence}}\nQuestion: {{question}}\nAnswer: {{answer}}\nPlausible:"
+doc_to_target: label
+doc_to_choice: ["no", "yes"]
+should_decontaminate: true
+doc_to_decontamination_query: "{{question}} {{sentence}}"
+metric_list:
+  - metric: acc
+  - metric: f1
--- a/lm_eval/tasks/openbookqa/README.md
+++ b/lm_eval/tasks/openbookqa/README.md
+# Task-name
+
+### Paper
+
+Title: `Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering`
+
+Abstract: https://arxiv.org/abs/1809.02789
+
+OpenBookQA is a question-answering dataset modeled after open book exams for
+assessing human understanding of a subject. It consists of 5,957 multiple-choice
+elementary-level science questions (4,957 train, 500 dev, 500 test), which probe
+the understanding of a small “book” of 1,326 core science facts and the application
+of these facts to novel situations. For training, the dataset includes a mapping
+from each question to the core science fact it was designed to probe. Answering
+OpenBookQA questions requires additional broad common knowledge, not contained
+in the book. The questions, by design, are answered incorrectly by both a retrieval-
+based algorithm and a word co-occurrence algorithm.
+
+Homepage: https://allenai.org/data/open-book-qa
+
+
+### Citation
+
+```
+@inproceedings{OpenBookQA2018,
+    title={Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering},
+    author={Todor Mihaylov and Peter Clark and Tushar Khot and Ashish Sabharwal},
+    booktitle={EMNLP},
+    year={2018}
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* Not part of a group yet
+
+#### Tasks
+
+* `openbookqa`
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/openbookqa/openbookqa.yaml
+++ b/lm_eval/tasks/openbookqa/openbookqa.yaml
-group:
-  - multiple_choice
 task: openbookqa
 dataset_path: openbookqa
 dataset_name: main

--- a/lm_eval/tasks/paws-x/README.md
+++ b/lm_eval/tasks/paws-x/README.md
+# PAWS-X
+
+### Paper
+
+Title: `PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification`
+Abstract: https://arxiv.org/abs/1908.11828
+
+The dataset consists of 23,659 human translated PAWS evaluation pairs and
+296,406 machine translated training pairs in 6 typologically distinct languages.
+
+Examples are adapted from  PAWS-Wiki
+
+Prompt format (same as in mGPT):
+
+"<s>" + sentence1 + ", right? " + mask + ", " + sentence2 + "</s>",
+
+where mask is the string that matches the label:
+
+Yes, No.
+
+Example:
+
+<s> The Tabaci River is a tributary of the River Leurda in Romania, right? No, The Leurda River is a tributary of the River Tabaci in Romania.</s>
+
+Language specific prompts are translated word-by-word with Google Translate
+and may differ from the ones used by mGPT and XGLM (they do not provide their prompts).
+
+Homepage: https://github.com/google-research-datasets/paws/tree/master/pawsx
+
+
+### Citation
+
+```
+@inproceedings{yang-etal-2019-paws,
+    title = "{PAWS}-{X}: A Cross-lingual Adversarial Dataset for Paraphrase Identification",
+    author = "Yang, Yinfei  and
+      Zhang, Yuan  and
+      Tar, Chris  and
+      Baldridge, Jason",
+    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
+    month = nov,
+    year = "2019",
+    address = "Hong Kong, China",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/D19-1382",
+    doi = "10.18653/v1/D19-1382",
+    pages = "3687--3692",
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* `pawsx`
+
+#### Tasks
+
+* `paws_de`: German
+* `paws_en`: English
+* `paws_es`: Spanish
+* `paws_fr`: French
+* `paws_ja`: Japanese
+* `paws_ko`: Korean
+* `paws_zh`: Chinese
+
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/paws-x/paws_de.yaml
+++ b/lm_eval/tasks/paws-x/paws_de.yaml
+# Generated by utils.py
+dataset_name: de
+doc_to_choice: '{{[sentence1+", richtig? Ja, "+sentence2, sentence1+", richtig? Nein,
+  "+sentence2]}}'
+doc_to_text: ''
+include: pawsx_template_yaml
+task: paws_de
--- a/lm_eval/tasks/paws-x/paws_en.yaml
+++ b/lm_eval/tasks/paws-x/paws_en.yaml
+# Generated by utils.py
+dataset_name: en
+doc_to_choice: '{{[sentence1+", right? Yes, "+sentence2, sentence1+", right? No, "+sentence2]}}'
+doc_to_text: ''
+include: pawsx_template_yaml
+task: paws_en
--- a/lm_eval/tasks/paws-x/paws_es.yaml
+++ b/lm_eval/tasks/paws-x/paws_es.yaml
+# Generated by utils.py
+dataset_name: es
+doc_to_choice: '{{[sentence1+", verdad? Sí, "+sentence2, sentence1+", verdad? No,
+  "+sentence2]}}'
+doc_to_text: ''
+include: pawsx_template_yaml
+task: paws_es
--- a/lm_eval/tasks/paws-x/paws_fr.yaml
+++ b/lm_eval/tasks/paws-x/paws_fr.yaml
+# Generated by utils.py
+dataset_name: fr
+doc_to_choice: '{{[sentence1+", n''est-ce pas? Oui, "+sentence2, sentence1+", n''est-ce
+  pas? No, "+sentence2]}}'
+doc_to_text: ''
+include: pawsx_template_yaml
+task: paws_fr
--- a/lm_eval/tasks/paws-x/paws_ja.yaml
+++ b/lm_eval/tasks/paws-x/paws_ja.yaml
+# Generated by utils.py
+dataset_name: ja
+doc_to_choice: '{{[sentence1+", ですね? はい, "+sentence2, sentence1+", ですね? いいえ, "+sentence2]}}'
+doc_to_text: ''
+include: pawsx_template_yaml
+task: paws_ja