add lambada_mt tasks

9735ac34 · haileyschoelkopf · 7f557daa · 9735ac34 · 9735ac34 · 9735ac34
Commit 9735ac34 authored Jul 05, 2023 by haileyschoelkopf
6 changed files
--- a/lm_eval/tasks/lambada_multilingual/README.md
+++ b/lm_eval/tasks/lambada_multilingual/README.md
+# LAMBADA
+
+### Paper
+The LAMBADA dataset: Word prediction requiring a broad discourse context
+https://arxiv.org/pdf/1606.06031.pdf
+
+LAMBADA is a dataset to evaluate the capabilities of computational models for text
+understanding by means of a word prediction task. LAMBADA is a collection of narrative
+passages sharing the characteristic that human subjects are able to guess their last
+word if they are exposed to the whole passage, but not if they only see the last
+sentence preceding the target word. To succeed on LAMBADA, computational models
+cannot simply rely on local context, but must be able to keep track of information
+in the broader discourse.
+
+Homepage: https://zenodo.org/record/2630551#.X4Xzn5NKjUI
+
+### Citation
+
+@misc{
+    author={Paperno, Denis and Kruszewski, Germán and Lazaridou, Angeliki and Pham, Quan Ngoc and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fernández, Raquel},
+    title={The LAMBADA dataset},
+    DOI={10.5281/zenodo.2630551},
+    publisher={Zenodo},
+    year={2016},
+    month={Aug}
+}
+
+### Subtasks
+
+* `lambada_mt_{en, fr, de, it, es}`: Machine-translated versions of OpenAI's Lambada variant.
+
+### Checklist
+
+* [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
+  * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+(This task is novel to the Evaluation Harness, and has been checked against v0.3.0 of the harness.)
+
+
+If other tasks on this dataset are already supported:
+* [x] Is the "Main" variant of this task clearly denoted?
+* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/lambada_multilingual/lambada_mt_de.yaml
+++ b/lm_eval/tasks/lambada_multilingual/lambada_mt_de.yaml
+include: lambada_mt_en.yaml
+group:
+  - lambada_multilingual
+  - loglikelihood
+  - perplexity
+task: lambada_openai_mt_de
+dataset_name: de
--- a/lm_eval/tasks/lambada_multilingual/lambada_mt_en.yaml
+++ b/lm_eval/tasks/lambada_multilingual/lambada_mt_en.yaml
+group:
+  - lambada_multilingual
+  - loglikelihood
+  - perplexity
+task: lambada_openai_mt_en
+dataset_path: EleutherAI/lambada_openai
+dataset_name: en
+output_type: loglikelihood
+test_split: test
+template_aliases: ""
+doc_to_text: "{{text.split(' ')[:-1]|join(' ')}}"
+doc_to_target: "{{' '+text.split(' ')[-1]}}"
+should_decontaminate: true
+doc_to_decontamination_query: "{{text}}"
+metric_list:
+  - metric: perplexity
+    aggregation: perplexity
+    higher_is_better: false
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
--- a/lm_eval/tasks/lambada_multilingual/lambada_mt_es.yaml
+++ b/lm_eval/tasks/lambada_multilingual/lambada_mt_es.yaml
+include: lambada_mt_en.yaml
+group:
+  - lambada_multilingual
+  - loglikelihood
+  - perplexity
+task: lambada_openai_mt_es
+dataset_name: es
--- a/lm_eval/tasks/lambada_multilingual/lambada_mt_fr.yaml
+++ b/lm_eval/tasks/lambada_multilingual/lambada_mt_fr.yaml
+include: lambada_mt_en.yaml
+group:
+  - lambada_multilingual
+  - loglikelihood
+  - perplexity
+task: lambada_openai_mt_fr
+dataset_name: fr
--- a/lm_eval/tasks/lambada_multilingual/lambada_mt_it.yaml
+++ b/lm_eval/tasks/lambada_multilingual/lambada_mt_it.yaml
+include: lambada_mt_en.yaml
+group:
+  - lambada_multilingual
+  - loglikelihood
+  - perplexity
+task: lambada_openai_mt_it
+dataset_name: it