Merge pull request #759 from EleutherAI/xstorycloze

[Refactor] XStoryCloze

Merge pull request #759 from EleutherAI/xstorycloze
[Refactor] XStoryCloze
e85ca1a9 · Lintang Sutawika · GitHub · a68a3092 · e332a1ec · e85ca1a9
Unverified Commit e85ca1a9 authored Aug 15, 2023 by Lintang Sutawika Committed by GitHub Aug 15, 2023
16 changed files
--- a/lm_eval/tasks/README.md
+++ b/lm_eval/tasks/README.md
@@ -45,14 +45,14 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
 - [ ] Translation (WMT) suite (Hailey)
 - [x] Unscramble
 - [x] ~~Pile (perplexity)~~
- [ ] BLiMP (Lintang)
+- [x] BLiMP
 - [x] ToxiGen
- [ ] StoryCloze (Lintang)
+- [x] StoryCloze
 - [ ] NaturalQs (Hailey)
 - [x] CrowS-Pairs
 - [x] XCopa
 - [ ] BIG-Bench (Hailey)
- [ ] XStoryCloze (Lintang)
+- [x] XStoryCloze
 - [x] XWinograd
 - [ ] PAWS-X (Lintang)
 - [x] XNLI

--- a/lm_eval/tasks/storycloze/README.md
+++ b/lm_eval/tasks/storycloze/README.md
+# StoryCloze
+
+### Paper
+
+Title: `Few-shot Learning with Multilingual Language Models`
+Abstract: `https://arxiv.org/abs/2112.10668`
+
+XStoryCloze consists of the professionally translated version of the [English StoryCloze dataset](https://cs.rochester.edu/nlp/rocstories/) (Spring 2016 version) to 10 non-English languages. This dataset is released by Meta AI.
+
+Homepage: https://github.com/facebookresearch/fairseq/pull/4820
+
+
+### Citation
+
+```
+@article{DBLP:journals/corr/abs-2112-10668,
+  author    = {Xi Victoria Lin and
+               Todor Mihaylov and
+               Mikel Artetxe and
+               Tianlu Wang and
+               Shuohui Chen and
+               Daniel Simig and
+               Myle Ott and
+               Naman Goyal and
+               Shruti Bhosale and
+               Jingfei Du and
+               Ramakanth Pasunuru and
+               Sam Shleifer and
+               Punit Singh Koura and
+               Vishrav Chaudhary and
+               Brian O'Horo and
+               Jeff Wang and
+               Luke Zettlemoyer and
+               Zornitsa Kozareva and
+               Mona T. Diab and
+               Veselin Stoyanov and
+               Xian Li},
+  title     = {Few-shot Learning with Multilingual Language Models},
+  journal   = {CoRR},
+  volume    = {abs/2112.10668},
+  year      = {2021},
+  url       = {https://arxiv.org/abs/2112.10668},
+  eprinttype = {arXiv},
+  eprint    = {2112.10668},
+  timestamp = {Tue, 04 Jan 2022 15:59:27 +0100},
+  biburl    = {https://dblp.org/rec/journals/corr/abs-2112-10668.bib},
+  bibsource = {dblp computer science bibliography, https://dblp.org}
+}
+```
+
+### Subtasks
+
+List or describe tasks defined in this folder, and their names here:
+* `task_name`: `1-sentence description of what this particular task does`
+* `task_name2`: .....
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/storycloze/storycloze_2016.yaml
+++ b/lm_eval/tasks/storycloze/storycloze_2016.yaml
+group: storycloze
+task: storycloze_2016
+dataset_path: story_cloze
+dataset_name: 2016
+output_type: multiple_choice
+validation_split: validation
+test_split: test
+doc_to_text: "{{[input_sentence_1, input_sentence_2, input_sentence_3, input_sentence_4]|join(' ')}}"
+doc_to_target: "{{answer_right_ending-1}}"
+doc_to_choice: "{{[sentence_quiz1, sentence_quiz2]}}"
+should_decontaminate: true
+doc_to_decontamination_query: "{{[input_sentence_1, input_sentence_2, input_sentence_3, input_sentence_4]|join(' ')}}"
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
--- a/lm_eval/tasks/storycloze/storycloze_2018.yaml
+++ b/lm_eval/tasks/storycloze/storycloze_2018.yaml
+group: storycloze
+task: storycloze_2016
+dataset_path: story_cloze
+dataset_name: 2018
+output_type: multiple_choice
+validation_split: validation
+test_split: test
+doc_to_text: "{{[input_sentence_1, input_sentence_2, input_sentence_3, input_sentence_4]|join(' ')}}"
+doc_to_target: "{{answer_right_ending-1}}"
+doc_to_choice: "{{[sentence_quiz1, sentence_quiz2]}}"
+should_decontaminate: true
+doc_to_decontamination_query: "{{[input_sentence_1, input_sentence_2, input_sentence_3, input_sentence_4]|join(' ')}}"
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
--- a/lm_eval/tasks/xstorycloze/README.md
+++ b/lm_eval/tasks/xstorycloze/README.md
+# XStoryCloze
+
+### Paper
+
+Title: `Few-shot Learning with Multilingual Language Models`
+Abstract: https://arxiv.org/abs/2112.10668
+
+XStoryCloze consists of the professionally translated version of the [English StoryCloze dataset](https://cs.rochester.edu/nlp/rocstories/) (Spring 2016 version) to 10 non-English languages. This dataset is released by Meta AI.
+
+Homepage: https://github.com/facebookresearch/fairseq/pull/4820
+
+
+### Citation
+
+```
+@article{DBLP:journals/corr/abs-2112-10668,
+  author    = {Xi Victoria Lin and
+               Todor Mihaylov and
+               Mikel Artetxe and
+               Tianlu Wang and
+               Shuohui Chen and
+               Daniel Simig and
+               Myle Ott and
+               Naman Goyal and
+               Shruti Bhosale and
+               Jingfei Du and
+               Ramakanth Pasunuru and
+               Sam Shleifer and
+               Punit Singh Koura and
+               Vishrav Chaudhary and
+               Brian O'Horo and
+               Jeff Wang and
+               Luke Zettlemoyer and
+               Zornitsa Kozareva and
+               Mona T. Diab and
+               Veselin Stoyanov and
+               Xian Li},
+  title     = {Few-shot Learning with Multilingual Language Models},
+  journal   = {CoRR},
+  volume    = {abs/2112.10668},
+  year      = {2021},
+  url       = {https://arxiv.org/abs/2112.10668},
+  eprinttype = {arXiv},
+  eprint    = {2112.10668},
+  timestamp = {Tue, 04 Jan 2022 15:59:27 +0100},
+  biburl    = {https://dblp.org/rec/journals/corr/abs-2112-10668.bib},
+  bibsource = {dblp computer science bibliography, https://dblp.org}
+}
+```
+
+### Subtasks
+
+List or describe tasks defined in this folder, and their names here:
+* `task_name`: `1-sentence description of what this particular task does`
+* `task_name2`: .....
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/xstorycloze/default_ar.yaml
+++ b/lm_eval/tasks/xstorycloze/default_ar.yaml
+group: xstorycloze
+task: xstorycloze_ar
+dataset_path: juletxara/xstory_cloze
+dataset_name: ar
+output_type: multiple_choice
+training_split: train
+validation_split: eval
+doc_to_text: "{{[input_sentence_1, input_sentence_2, input_sentence_3, input_sentence_4]|join(' ')}}"
+doc_to_target: "{{answer_right_ending-1}}"
+doc_to_choice: "{{[sentence_quiz1, sentence_quiz2]}}"
+should_decontaminate: true
+doc_to_decontamination_query: "{{[input_sentence_1, input_sentence_2, input_sentence_3, input_sentence_4]|join(' ')}}"
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
--- a/lm_eval/tasks/xstorycloze/default_en.yaml
+++ b/lm_eval/tasks/xstorycloze/default_en.yaml
+include: default_ar.yaml
+task: xstorycloze_en
+dataset_name: en
--- a/lm_eval/tasks/xstorycloze/default_es.yaml
+++ b/lm_eval/tasks/xstorycloze/default_es.yaml
+include: default_ar.yaml
+task: xstorycloze_es
+dataset_name: es
--- a/lm_eval/tasks/xstorycloze/default_eu.yaml
+++ b/lm_eval/tasks/xstorycloze/default_eu.yaml
+include: default_ar.yaml
+task: xstorycloze_eu
+dataset_name: eu
--- a/lm_eval/tasks/xstorycloze/default_hi.yaml
+++ b/lm_eval/tasks/xstorycloze/default_hi.yaml
+include: default_ar.yaml
+task: xstorycloze_hi
+dataset_name: hi
--- a/lm_eval/tasks/xstorycloze/default_id.yaml
+++ b/lm_eval/tasks/xstorycloze/default_id.yaml
+include: default_ar.yaml
+task: xstorycloze_id
+dataset_name: id
--- a/lm_eval/tasks/xstorycloze/default_my.yaml
+++ b/lm_eval/tasks/xstorycloze/default_my.yaml
+include: default_ar.yaml
+task: xstorycloze_my
+dataset_name: my
--- a/lm_eval/tasks/xstorycloze/default_ru.yaml
+++ b/lm_eval/tasks/xstorycloze/default_ru.yaml
+include: default_ar.yaml
+task: xstorycloze_ru
+dataset_name: ru
--- a/lm_eval/tasks/xstorycloze/default_sw.yaml
+++ b/lm_eval/tasks/xstorycloze/default_sw.yaml
+include: default_ar.yaml
+task: xstorycloze_sw
+dataset_name: sw
--- a/lm_eval/tasks/xstorycloze/default_te.yaml
+++ b/lm_eval/tasks/xstorycloze/default_te.yaml
+include: default_ar.yaml
+task: xstorycloze_te
+dataset_name: te
--- a/lm_eval/tasks/xstorycloze/default_zh.yaml
+++ b/lm_eval/tasks/xstorycloze/default_zh.yaml
+include: default_ar.yaml
+task: xstorycloze_zh
+dataset_name: zh