Merge pull request #757 from EleutherAI/add-readme

[Refactor] Add README.md

Merge pull request #757 from EleutherAI/add-readme
[Refactor] Add README.md
759da8d5 · Lintang Sutawika · GitHub · 73912efb · c05a5ad4 · 759da8d5
Unverified Commit 759da8d5 authored Aug 16, 2023 by Lintang Sutawika Committed by GitHub Aug 16, 2023
20 changed files
--- a/lm_eval/tasks/lambada/lambada_openai.yaml
+++ b/lm_eval/tasks/lambada/lambada_openai.yaml
 group:
  - lambada
-  - loglikelihood
-  - perplexity
 task: lambada_openai
 dataset_path: EleutherAI/lambada_openai
 dataset_name: default

--- a/lm_eval/tasks/lambada/lambada_standard.yaml
+++ b/lm_eval/tasks/lambada/lambada_standard.yaml
 group:
  - lambada
-  - loglikelihood
-  - perplexity
 task: lambada_standard
 dataset_path: lambada
 dataset_name: null

--- a/lm_eval/tasks/lambada_cloze/README.md
+++ b/lm_eval/tasks/lambada_cloze/README.md
+# LAMBADA Cloze
+
+### Paper
+
+Title: `The LAMBADA dataset: Word prediction requiring a broad discourse context`
+
+Abstract: https://arxiv.org/abs/1606.06031
+
+Cloze-style LAMBADA dataset.
+LAMBADA is a dataset to evaluate the capabilities of computational models for text
+understanding by means of a word prediction task. LAMBADA is a collection of narrative
+passages sharing the characteristic that human subjects are able to guess their last
+word if they are exposed to the whole passage, but not if they only see the last
+sentence preceding the target word. To succeed on LAMBADA, computational models
+cannot simply rely on local context, but must be able to keep track of information
+in the broader discourse.
+
+Homepage: https://zenodo.org/record/2630551#.X4Xzn5NKjUI
+
+
+### Citation
+
+```
+@misc{
+    author={Paperno, Denis and Kruszewski, Germán and Lazaridou, Angeliki and Pham, Quan Ngoc and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fernández, Raquel},
+    title={The LAMBADA dataset},
+    DOI={10.5281/zenodo.2630551},
+    publisher={Zenodo},
+    year={2016},
+    month={Aug}
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* `lambada_cloze`
+
+#### Tasks
+
+* `lambada_openai_cloze_yaml`
+* `lambada_standard_cloze_yaml`
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/lambada_cloze/lambada_openai_cloze.yaml
+++ b/lm_eval/tasks/lambada_cloze/lambada_openai_cloze.yaml
 group:
  - lambada_cloze
-  - loglikelihood
 task: lambada_openai_cloze_yaml
 dataset_path: EleutherAI/lambada_openai
 dataset_name: default

--- a/lm_eval/tasks/lambada_cloze/lambada_standard_cloze.yaml
+++ b/lm_eval/tasks/lambada_cloze/lambada_standard_cloze.yaml
 group:
  - lambada_cloze
-  - loglikelihood
 task: lambada_standard_cloze_yaml
 dataset_path: lambada
 dataset_name: null

--- a/lm_eval/tasks/lambada_multilingual/README.md
+++ b/lm_eval/tasks/lambada_multilingual/README.md
@@ -25,7 +25,13 @@ Homepage: https://zenodo.org/record/2630551#.X4Xzn5NKjUI
    month={Aug}
 }

-### Subtasks
+### Groups and Tasks
+
+#### Groups
+
+* `lambada_multilingual`: Evaluates all `lambada_mt_X` tasks
+
+#### Tasks

 * `lambada_mt_{en, fr, de, it, es}`: Machine-translated versions of OpenAI's Lambada variant.


--- a/lm_eval/tasks/lambada_multilingual/lambada_mt_de.yaml
+++ b/lm_eval/tasks/lambada_multilingual/lambada_mt_de.yaml
 include: lambada_mt_en.yaml
-group:
-  - lambada_multilingual
-  - loglikelihood
-  - perplexity
 task: lambada_openai_mt_de
 dataset_name: de
--- a/lm_eval/tasks/lambada_multilingual/lambada_mt_en.yaml
+++ b/lm_eval/tasks/lambada_multilingual/lambada_mt_en.yaml
 group:
  - lambada_multilingual
-  - loglikelihood
-  - perplexity
 task: lambada_openai_mt_en
 dataset_path: EleutherAI/lambada_openai
 dataset_name: en

--- a/lm_eval/tasks/lambada_multilingual/lambada_mt_es.yaml
+++ b/lm_eval/tasks/lambada_multilingual/lambada_mt_es.yaml
 include: lambada_mt_en.yaml
-group:
-  - lambada_multilingual
-  - loglikelihood
-  - perplexity
 task: lambada_openai_mt_es
 dataset_name: es
--- a/lm_eval/tasks/lambada_multilingual/lambada_mt_fr.yaml
+++ b/lm_eval/tasks/lambada_multilingual/lambada_mt_fr.yaml
 include: lambada_mt_en.yaml
-group:
-  - lambada_multilingual
-  - loglikelihood
-  - perplexity
 task: lambada_openai_mt_fr
 dataset_name: fr
--- a/lm_eval/tasks/lambada_multilingual/lambada_mt_it.yaml
+++ b/lm_eval/tasks/lambada_multilingual/lambada_mt_it.yaml
 include: lambada_mt_en.yaml
-group:
-  - lambada_multilingual
-  - loglikelihood
-  - perplexity
 task: lambada_openai_mt_it
 dataset_name: it
--- a/lm_eval/tasks/logiqa/README.md
+++ b/lm_eval/tasks/logiqa/README.md
+# LogiQA
+
+### Paper
+
+Title: `LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning`
+
+Abstract: https://arxiv.org/abs/2007.08124
+
+LogiQA is a dataset for testing human logical reasoning. It consists of 8,678 QA
+instances, covering multiple types of deductive reasoning. Results show that state-
+of-the-art neural models perform by far worse than human ceiling. The dataset can
+also serve as a benchmark for reinvestigating logical AI under the deep learning
+NLP setting.
+
+Homepage: https://github.com/lgw863/LogiQA-dataset
+
+
+### Citation
+
+```
+@misc{liu2020logiqa,
+    title={LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning},
+    author={Jian Liu and Leyang Cui and Hanmeng Liu and Dandan Huang and Yile Wang and Yue Zhang},
+    year={2020},
+    eprint={2007.08124},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* Not part of a group yet
+
+#### Tasks
+
+* `logiqa`
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/logiqa/logiqa.yaml
+++ b/lm_eval/tasks/logiqa/logiqa.yaml
-group:
-  - multiple_choice
 task: logiqa
 dataset_path: EleutherAI/logiqa
 dataset_name: logiqa

--- a/lm_eval/tasks/logiqa2/README.md
+++ b/lm_eval/tasks/logiqa2/README.md
@@ -25,15 +25,19 @@ Homepage: https://github.com/csitfun/LogiQA2.0
  doi={10.1109/TASLP.2023.3293046}}
 ```

-### Subtasks
+### Groups and Tasks

-`logiqa2_zh`: The original dataset in Chinese.
+#### Groups

-`logiqa2_NLI`: The NLI version of the dataset converted from the MRC version.
+* Not part of a group yet

-`logieval`: Prompt based; https://github.com/csitfun/LogiEval
+#### Tasks

-The subtasks have not been verified yet.
+* `logiqa2_zh`: The original dataset in Chinese.
+* `logiqa2_NLI`: The NLI version of the dataset converted from the MRC version.
+* `logieval`: Prompt based; https://github.com/csitfun/LogiEval
+
+NOTE! The subtasks have not been verified yet.

 ### Checklist


--- a/lm_eval/tasks/logiqa2/logieval.yaml
+++ b/lm_eval/tasks/logiqa2/logieval.yaml
-group:
-  - greedy_until
 task: logieval
 dataset_path: baber/logiqa2
 dataset_name: logieval

--- a/lm_eval/tasks/logiqa2/logiqa2.yaml
+++ b/lm_eval/tasks/logiqa2/logiqa2.yaml
-group:
-  - multiple_choice
 task: logiqa2
 dataset_path: baber/logiqa2
 dataset_name: logiqa2

--- a/lm_eval/tasks/mathqa/README.md
+++ b/lm_eval/tasks/mathqa/README.md
@@ -25,7 +25,13 @@ Homepage: https://math-qa.github.io/math-QA/
 }
 ```

-### Subtasks
+### Groups and Tasks
+
+#### Groups
+
+* `math_word_problems`
+
+#### Tasks

 * `mathqa`: The MathQA dataset, as a multiple choice dataset where the answer choices are not in context.


--- a/lm_eval/tasks/mathqa/mathqa.yaml
+++ b/lm_eval/tasks/mathqa/mathqa.yaml
 group:
-  - multiple_choice
  - math_word_problems
 task: mathqa
 dataset_path: math_qa

--- a/lm_eval/tasks/openbookqa/README.md
+++ b/lm_eval/tasks/openbookqa/README.md
+# Task-name
+
+### Paper
+
+Title: `Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering`
+
+Abstract: https://arxiv.org/abs/1809.02789
+
+OpenBookQA is a question-answering dataset modeled after open book exams for
+assessing human understanding of a subject. It consists of 5,957 multiple-choice
+elementary-level science questions (4,957 train, 500 dev, 500 test), which probe
+the understanding of a small “book” of 1,326 core science facts and the application
+of these facts to novel situations. For training, the dataset includes a mapping
+from each question to the core science fact it was designed to probe. Answering
+OpenBookQA questions requires additional broad common knowledge, not contained
+in the book. The questions, by design, are answered incorrectly by both a retrieval-
+based algorithm and a word co-occurrence algorithm.
+
+Homepage: https://allenai.org/data/open-book-qa
+
+
+### Citation
+
+```
+@inproceedings{OpenBookQA2018,
+    title={Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering},
+    author={Todor Mihaylov and Peter Clark and Tushar Khot and Ashish Sabharwal},
+    booktitle={EMNLP},
+    year={2018}
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* Not part of a group yet
+
+#### Tasks
+
+* `openbookqa`
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/openbookqa/openbookqa.yaml
+++ b/lm_eval/tasks/openbookqa/openbookqa.yaml
-group:
-  - multiple_choice
 task: openbookqa
 dataset_path: openbookqa
 dataset_name: main