Merge pull request #757 from EleutherAI/add-readme

[Refactor] Add README.md

Merge pull request #757 from EleutherAI/add-readme
[Refactor] Add README.md
759da8d5 · Lintang Sutawika · GitHub · 73912efb · c05a5ad4 · 759da8d5
Unverified Commit 759da8d5 authored Aug 16, 2023 by Lintang Sutawika Committed by GitHub Aug 16, 2023
17 changed files
--- a/lm_eval/tasks/truthfulqa/README.md
+++ b/lm_eval/tasks/truthfulqa/README.md
@@ -27,8 +27,27 @@ Homepage: `https://github.com/sylinrl/TruthfulQA`
 }
 ```

-### Subtasks
+### Groups and Tasks
+
+#### Groups
+
+* Not part of a group yet.
+
+#### Tasks

 * `truthfulqa_mc1`: `Multiple-choice, single answer`
-* `truthfulqa_mc2`: `Multiple-choice, multiple answers`
-* `truthfulqa_gen`: `Answer generation`
+* (MISSING)`truthfulqa_mc2`: `Multiple-choice, multiple answers`
+* (MISSING)`truthfulqa_gen`: `Answer generation`
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/unscramble/README.md
+++ b/lm_eval/tasks/unscramble/README.md
@@ -28,7 +28,13 @@ Homepage: https://github.com/openai/gpt-3/tree/master/data
 }
 ```

-### Subtasks
+### Groups and Tasks
+
+#### Groups
+
+* `unscramble`
+
+#### Tasks

 * `anagrams1` - Anagrams of all but the first and last letter.
 * `anagrams2` - Anagrams of all but the first and last 2 letters.

--- a/lm_eval/tasks/unscramble/anagrams1.yaml
+++ b/lm_eval/tasks/unscramble/anagrams1.yaml
 group:
-  - greedy_until
+  - unscramble
 task: anagrams1
 dataset_path: EleutherAI/unscramble
 dataset_name: mid_word_1_anagrams

--- a/lm_eval/tasks/unscramble/anagrams2.yaml
+++ b/lm_eval/tasks/unscramble/anagrams2.yaml
 group:
-  - greedy_until
+  - unscramble
 task: anagrams2
 dataset_path: EleutherAI/unscramble
 dataset_name: mid_word_2_anagrams

--- a/lm_eval/tasks/unscramble/cycle_letters.yaml
+++ b/lm_eval/tasks/unscramble/cycle_letters.yaml
 group:
-  - greedy_until
+  - unscramble
 task: cycle_letters
 dataset_path: EleutherAI/unscramble
 dataset_name: cycle_letters_in_word

--- a/lm_eval/tasks/unscramble/random_insertion.yaml
+++ b/lm_eval/tasks/unscramble/random_insertion.yaml
 group:
-  - greedy_until
+  - unscramble
 task: random_insertion
 dataset_path: EleutherAI/unscramble
 dataset_name: random_insertion_in_word

--- a/lm_eval/tasks/unscramble/reversed_words.yaml
+++ b/lm_eval/tasks/unscramble/reversed_words.yaml
 group:
-  - greedy_until
+  - unscramble
 task: reversed_words
 dataset_path: EleutherAI/unscramble
 dataset_name: reversed_words

--- a/lm_eval/tasks/webqs/README.md
+++ b/lm_eval/tasks/webqs/README.md
-# Task-name
+# WEBQs

 ### Paper

@@ -33,9 +33,14 @@ Homepage: `https://worksheets.codalab.org/worksheets/0xba659fe363cb46e7a505c5b6a
 }
 ```

-### Subtasks
+### Groups and Tasks
+
+#### Groups
+
+* `freebase`
+
+#### Tasks

-List or describe tasks defined in this folder, and their names here:
 * `webqs`: `Questions with multiple accepted answers.`

 ### Checklist

--- a/lm_eval/tasks/webqs/webqs.yaml
+++ b/lm_eval/tasks/webqs/webqs.yaml
 group:
  - freebase
-  - question_answer
 task: webqs
 dataset_path: web_questions
 dataset_name: null

--- a/lm_eval/tasks/wikitext/README.md
+++ b/lm_eval/tasks/wikitext/README.md
@@ -26,7 +26,13 @@ Homepage: https://www.salesforce.com/products/einstein/ai-research/the-wikitext-
 }
 ```

-### Subtasks
+### Groups and Tasks
+
+#### Groups
+
+* Not part of a group yet.
+
+#### Tasks

 * `wikitext`: measure perplexity on the Wikitext dataset, via rolling loglikelihoods.


--- a/lm_eval/tasks/wikitext/wikitext.yaml
+++ b/lm_eval/tasks/wikitext/wikitext.yaml
-group:
-  - perplexity
-  - loglikelihood_rolling
 task: wikitext
 dataset_path: EleutherAI/wikitext_document_level
 dataset_name: wikitext-2-raw-v1

--- a/lm_eval/tasks/winogrande/README.md
+++ b/lm_eval/tasks/winogrande/README.md
+# WinoGrande
+
+### Paper
+
+Title: `WinoGrande: An Adversarial Winograd Schema Challenge at Scale`
+
+Abstract: https://arxiv.org/abs/1907.10641
+
+WinoGrande is a collection of 44k problems, inspired by Winograd Schema Challenge
+(Levesque, Davis, and Morgenstern 2011), but adjusted to improve the scale and
+robustness against the dataset-specific bias. Formulated as a fill-in-a-blank
+task with binary options, the goal is to choose the right option for a given
+sentence which requires commonsense reasoning.
+
+NOTE: This evaluation of Winogrande uses partial evaluation as described by
+Trinh & Le in Simple Method for Commonsense Reasoning (2018).
+See: https://arxiv.org/abs/1806.02847
+
+Homepage: https://leaderboard.allenai.org/winogrande/submissions/public
+
+
+### Citation
+
+```
+@article{sakaguchi2019winogrande,
+    title={WinoGrande: An Adversarial Winograd Schema Challenge at Scale},
+    author={Sakaguchi, Keisuke and Bras, Ronan Le and Bhagavatula, Chandra and Choi, Yejin},
+    journal={arXiv preprint arXiv:1907.10641},
+    year={2019}
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* Not part of a group yet.
+
+#### Tasks
+
+* `winogrande`
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/xcopa/README.md
+++ b/lm_eval/tasks/xcopa/README.md
-## XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning
-https://ducdauge.github.io/files/xcopa.pdf
+# XCOPA
+
+### Paper
+
+Title: `XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning`
+
+Abstract: https://ducdauge.github.io/files/xcopa.pdf

 The Cross-lingual Choice of Plausible Alternatives dataset is a benchmark to evaluate the ability of machine learning models to transfer commonsense reasoning across languages.
 The dataset is the translation and reannotation of the English COPA (Roemmele et al. 2011) and covers 11 languages from 11 families and several areas around the globe.
@@ -8,6 +13,8 @@ All the details about the creation of XCOPA and the implementation of the baseli

 Homepage: https://github.com/cambridgeltl/xcopa

+### Citation
+
 ```
 @inproceedings{ponti2020xcopa,
  title={{XCOPA: A} Multilingual Dataset for Causal Commonsense Reasoning},
@@ -17,3 +24,37 @@ Homepage: https://github.com/cambridgeltl/xcopa
  url={https://ducdauge.github.io/files/xcopa.pdf}
 }
 ```
+
+### Groups and Tasks
+
+#### Groups
+
+* `xcopa`
+
+#### Tasks
+
+* `xcopa_et`: Estonian
+* `xcopa_ht`: Haitian Creole
+* `xcopa_id`: Indonesian
+* `xcopa_it`: Italian
+* `xcopa_qu`: Cusco-Collao Quechua
+* `xcopa_sw`: Kiswahili
+* `xcopa_ta`: Tamil
+* `xcopa_th`: Thai
+* `xcopa_tr`: Turkish
+* `xcopa_vi`: Vietnamese
+* `xcopa_zh`: Mandarin Chinese
+
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/xstorycloze/README.md
+++ b/lm_eval/tasks/xstorycloze/README.md
+# XStoryCloze
+
+### Paper
+
+Title: `Few-shot Learning with Multilingual Language Models`
+
+Abstract: https://arxiv.org/abs/2112.10668
+
+XStoryCloze consists of the professionally translated version of the [English StoryCloze dataset](https://cs.rochester.edu/nlp/rocstories/) (Spring 2016 version) to 10 non-English languages. This dataset is released by Meta AI.
+
+Homepage: https://github.com/facebookresearch/fairseq/pull/4820
+
+
+### Citation
+
+```
+@article{DBLP:journals/corr/abs-2112-10668,
+  author    = {Xi Victoria Lin and
+               Todor Mihaylov and
+               Mikel Artetxe and
+               Tianlu Wang and
+               Shuohui Chen and
+               Daniel Simig and
+               Myle Ott and
+               Naman Goyal and
+               Shruti Bhosale and
+               Jingfei Du and
+               Ramakanth Pasunuru and
+               Sam Shleifer and
+               Punit Singh Koura and
+               Vishrav Chaudhary and
+               Brian O'Horo and
+               Jeff Wang and
+               Luke Zettlemoyer and
+               Zornitsa Kozareva and
+               Mona T. Diab and
+               Veselin Stoyanov and
+               Xian Li},
+  title     = {Few-shot Learning with Multilingual Language Models},
+  journal   = {CoRR},
+  volume    = {abs/2112.10668},
+  year      = {2021},
+  url       = {https://arxiv.org/abs/2112.10668},
+  eprinttype = {arXiv},
+  eprint    = {2112.10668},
+  timestamp = {Tue, 04 Jan 2022 15:59:27 +0100},
+  biburl    = {https://dblp.org/rec/journals/corr/abs-2112-10668.bib},
+  bibsource = {dblp computer science bibliography, https://dblp.org}
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* `xstorycloze`
+
+#### Tasks
+
+* `xstorycloze_ar`: Arabic
+* `xstorycloze_en`: English
+* `xstorycloze_es`: Spanish
+* `xstorycloze_eu`: Basque
+* `xstorycloze_hi`: Hindi
+* `xstorycloze_id`: Indonesian
+* `xstorycloze_my`: Burmese
+* `xstorycloze_ru`: Russian
+* `xstorycloze_sw`: Swahili
+* `xstorycloze_te`: Telugu
+* `xstorycloze_zh`: Chinese
+
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/xwinograd/README.md
+++ b/lm_eval/tasks/xwinograd/README.md
@@ -31,7 +31,13 @@ Homepage: `https://huggingface.co/datasets/Muennighoff/xwinograd`
 }
 ```

-### Subtasks
+### Groups and Tasks
+
+#### Groups
+
+* `xwinograd`
+
+#### Tasks

 List or describe tasks defined in this folder, and their names here:
 * `xwinograd_en`: Winograd schema challenges in English.

--- a/lm_eval/tasks/xwinograd/xwinograd_common_yaml
+++ b/lm_eval/tasks/xwinograd/xwinograd_common_yaml
@@ -2,9 +2,7 @@
 # It doesn't have a yaml file extension as it is not meant to be imported directly
 # by the harness.
 group:
-  - winograd
-  - commonsense
-  - multilingual
+  - xwinograd
 dataset_path: Muennighoff/xwinograd
 dataset_name: null  # Overridden by language-specific config.
 output_type: multiple_choice

--- a/templates/new_yaml_task/README.md
+++ b/templates/new_yaml_task/README.md
@@ -2,7 +2,8 @@

 ### Paper

-Title: `paper title goes here`
+Title: `paper titles goes here`
+
 Abstract: `link to paper PDF or arXiv abstract goes here`

 `Short description of paper / benchmark goes here:`
@@ -16,11 +17,16 @@ Homepage: `homepage to the benchmark's website goes here, if applicable`
 BibTeX-formatted citation goes here
 ```

-### Subtasks
+### Groups and Tasks
+
+#### Groups
+
+* `group_name`: `Short description`
+
+#### Tasks

-List or describe tasks defined in this folder, and their names here:
 * `task_name`: `1-sentence description of what this particular task does`
-* `task_name2`: .....
+* `task_name2`: ...

 ### Checklist