Merge branch 'big-refactor' of https://github.com/EleutherAI/lm-evaluation-harness into squadv2

a27e8ed1 · lintangsutawika · fc329d31 · 4cda3a1c · a27e8ed1 · a27e8ed1
Commit a27e8ed1 authored Aug 29, 2023 by lintangsutawika
11 changed files
--- a/lm_eval/tasks/wikitext/wikitext.yaml
+++ b/lm_eval/tasks/wikitext/wikitext.yaml
-group:
-  - perplexity
-  - loglikelihood_rolling
 task: wikitext
 dataset_path: EleutherAI/wikitext_document_level
 dataset_name: wikitext-2-raw-v1

--- a/lm_eval/tasks/winogrande/README.md
+++ b/lm_eval/tasks/winogrande/README.md
+# WinoGrande
+
+### Paper
+
+Title: `WinoGrande: An Adversarial Winograd Schema Challenge at Scale`
+
+Abstract: https://arxiv.org/abs/1907.10641
+
+WinoGrande is a collection of 44k problems, inspired by Winograd Schema Challenge
+(Levesque, Davis, and Morgenstern 2011), but adjusted to improve the scale and
+robustness against the dataset-specific bias. Formulated as a fill-in-a-blank
+task with binary options, the goal is to choose the right option for a given
+sentence which requires commonsense reasoning.
+
+NOTE: This evaluation of Winogrande uses partial evaluation as described by
+Trinh & Le in Simple Method for Commonsense Reasoning (2018).
+See: https://arxiv.org/abs/1806.02847
+
+Homepage: https://leaderboard.allenai.org/winogrande/submissions/public
+
+
+### Citation
+
+```
+@article{sakaguchi2019winogrande,
+    title={WinoGrande: An Adversarial Winograd Schema Challenge at Scale},
+    author={Sakaguchi, Keisuke and Bras, Ronan Le and Bhagavatula, Chandra and Choi, Yejin},
+    journal={arXiv preprint arXiv:1907.10641},
+    year={2019}
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* Not part of a group yet.
+
+#### Tasks
+
+* `winogrande`
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/wmt2016/README.md
+++ b/lm_eval/tasks/wmt2016/README.md
+# WMT16
+
+### Paper
+
+Title: `Findings of the 2016 Conference on Machine Translation`
+Abstract: http://www.aclweb.org/anthology/W/W16/W16-2301
+
+
+
+Homepage: https://huggingface.co/datasets/wmt16
+
+
+### Citation
+
+```
+@InProceedings{bojar-EtAl:2016:WMT1,
+  author    = {Bojar, Ond
+{r}ej  and  Chatterjee, Rajen  and  Federmann, Christian  and  Graham, Yvette  and  Haddow, Barry  and  Huck, Matthias  and  Jimeno Yepes, Antonio  and  Koehn, Philipp  and  Logacheva, Varvara  and  Monz, Christof  and  Negri, Matteo  and  Neveol, Aurelie  and  Neves, Mariana  and  Popel, Martin  and  Post, Matt  and  Rubino, Raphael  and  Scarton, Carolina  and  Specia, Lucia  and  Turchi, Marco  and  Verspoor, Karin  and  Zampieri, Marcos},
+  title     = {Findings of the 2016 Conference on Machine Translation},
+  booktitle = {Proceedings of the First Conference on Machine Translation},
+  month     = {August},
+  year      = {2016},
+  address   = {Berlin, Germany},
+  publisher = {Association for Computational Linguistics},
+  pages     = {131--198},
+  url       = {http://www.aclweb.org/anthology/W/W16/W16-2301}
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* `wmt-t5-prompt`: Group for all wmt tasks with prompt templates used for T5 (`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer`)
+
+#### Tasks
+
+With specific prompt styles
+* `wmt-ro-en-t5-prompt`: WMT16 with the prompt template used for T5
+
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/wmt2016/metrics.py
+++ b/lm_eval/tasks/wmt2016/metrics.py
+import evaluate
+
+
+def bleu(predictions, references):
+    return (predictions[0], references[0])
+
+
+def agg_bleu(items):
+    bleu_fn = evaluate.load("bleu")
+    predictions, references = zip(*items)
+    return bleu_fn.compute(predictions=predictions, references=references)["bleu"]
--- a/lm_eval/tasks/wmt2016/ro_en-t5_prompt.yaml
+++ b/lm_eval/tasks/wmt2016/ro_en-t5_prompt.yaml
+group:
+  - wmt-t5-prompt
+task: wmt-ro-en-t5-prompt
+dataset_path: wmt16
+dataset_name: ro-en
+training_split: train
+validation_split: validation
+output_type: greedy_until
+doc_to_text: "translate English to Romanian: {{translation.en}}"
+doc_to_target: "{{translation.ro}}"
+metric_list:
+  - metric: wer
+    aggregation: mean
+    higher_is_better: false
+  - metric: !function metrics.bleu
+    aggregation: !function metrics.agg_bleu
+    higher_is_better: true
--- a/lm_eval/tasks/xcopa/README.md
+++ b/lm_eval/tasks/xcopa/README.md
-## XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning
-https://ducdauge.github.io/files/xcopa.pdf
+# XCOPA
+
+### Paper
+
+Title: `XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning`
+
+Abstract: https://ducdauge.github.io/files/xcopa.pdf

 The Cross-lingual Choice of Plausible Alternatives dataset is a benchmark to evaluate the ability of machine learning models to transfer commonsense reasoning across languages.
 The dataset is the translation and reannotation of the English COPA (Roemmele et al. 2011) and covers 11 languages from 11 families and several areas around the globe.
@@ -8,6 +13,8 @@ All the details about the creation of XCOPA and the implementation of the baseli

 Homepage: https://github.com/cambridgeltl/xcopa

+### Citation
+
 ```
 @inproceedings{ponti2020xcopa,
  title={{XCOPA: A} Multilingual Dataset for Causal Commonsense Reasoning},
@@ -17,3 +24,37 @@ Homepage: https://github.com/cambridgeltl/xcopa
  url={https://ducdauge.github.io/files/xcopa.pdf}
 }
 ```
+
+### Groups and Tasks
+
+#### Groups
+
+* `xcopa`
+
+#### Tasks
+
+* `xcopa_et`: Estonian
+* `xcopa_ht`: Haitian Creole
+* `xcopa_id`: Indonesian
+* `xcopa_it`: Italian
+* `xcopa_qu`: Cusco-Collao Quechua
+* `xcopa_sw`: Kiswahili
+* `xcopa_ta`: Tamil
+* `xcopa_th`: Thai
+* `xcopa_tr`: Turkish
+* `xcopa_vi`: Vietnamese
+* `xcopa_zh`: Mandarin Chinese
+
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/xstorycloze/README.md
+++ b/lm_eval/tasks/xstorycloze/README.md
+# XStoryCloze
+
+### Paper
+
+Title: `Few-shot Learning with Multilingual Language Models`
+
+Abstract: https://arxiv.org/abs/2112.10668
+
+XStoryCloze consists of the professionally translated version of the [English StoryCloze dataset](https://cs.rochester.edu/nlp/rocstories/) (Spring 2016 version) to 10 non-English languages. This dataset is released by Meta AI.
+
+Homepage: https://github.com/facebookresearch/fairseq/pull/4820
+
+
+### Citation
+
+```
+@article{DBLP:journals/corr/abs-2112-10668,
+  author    = {Xi Victoria Lin and
+               Todor Mihaylov and
+               Mikel Artetxe and
+               Tianlu Wang and
+               Shuohui Chen and
+               Daniel Simig and
+               Myle Ott and
+               Naman Goyal and
+               Shruti Bhosale and
+               Jingfei Du and
+               Ramakanth Pasunuru and
+               Sam Shleifer and
+               Punit Singh Koura and
+               Vishrav Chaudhary and
+               Brian O'Horo and
+               Jeff Wang and
+               Luke Zettlemoyer and
+               Zornitsa Kozareva and
+               Mona T. Diab and
+               Veselin Stoyanov and
+               Xian Li},
+  title     = {Few-shot Learning with Multilingual Language Models},
+  journal   = {CoRR},
+  volume    = {abs/2112.10668},
+  year      = {2021},
+  url       = {https://arxiv.org/abs/2112.10668},
+  eprinttype = {arXiv},
+  eprint    = {2112.10668},
+  timestamp = {Tue, 04 Jan 2022 15:59:27 +0100},
+  biburl    = {https://dblp.org/rec/journals/corr/abs-2112-10668.bib},
+  bibsource = {dblp computer science bibliography, https://dblp.org}
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* `xstorycloze`
+
+#### Tasks
+
+* `xstorycloze_ar`: Arabic
+* `xstorycloze_en`: English
+* `xstorycloze_es`: Spanish
+* `xstorycloze_eu`: Basque
+* `xstorycloze_hi`: Hindi
+* `xstorycloze_id`: Indonesian
+* `xstorycloze_my`: Burmese
+* `xstorycloze_ru`: Russian
+* `xstorycloze_sw`: Swahili
+* `xstorycloze_te`: Telugu
+* `xstorycloze_zh`: Chinese
+
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/xwinograd/README.md
+++ b/lm_eval/tasks/xwinograd/README.md
@@ -31,7 +31,13 @@ Homepage: `https://huggingface.co/datasets/Muennighoff/xwinograd`
 }
 ```

-### Subtasks
+### Groups and Tasks
+
+#### Groups
+
+* `xwinograd`
+
+#### Tasks

 List or describe tasks defined in this folder, and their names here:
 * `xwinograd_en`: Winograd schema challenges in English.

--- a/lm_eval/tasks/xwinograd/xwinograd_common_yaml
+++ b/lm_eval/tasks/xwinograd/xwinograd_common_yaml
@@ -2,9 +2,7 @@
 # It doesn't have a yaml file extension as it is not meant to be imported directly
 # by the harness.
 group:
-  - winograd
-  - commonsense
-  - multilingual
+  - xwinograd
 dataset_path: Muennighoff/xwinograd
 dataset_name: null  # Overridden by language-specific config.
 output_type: multiple_choice

--- a/templates/new_yaml_task/README.md
+++ b/templates/new_yaml_task/README.md
@@ -2,7 +2,8 @@

 ### Paper

-Title: `paper title goes here`
+Title: `paper titles goes here`
+
 Abstract: `link to paper PDF or arXiv abstract goes here`

 `Short description of paper / benchmark goes here:`
@@ -16,11 +17,16 @@ Homepage: `homepage to the benchmark's website goes here, if applicable`
 BibTeX-formatted citation goes here
 ```

-### Subtasks
+### Groups and Tasks
+
+#### Groups
+
+* `group_name`: `Short description`
+
+#### Tasks

-List or describe tasks defined in this folder, and their names here:
 * `task_name`: `1-sentence description of what this particular task does`
-* `task_name2`: .....
+* `task_name2`: ...

 ### Checklist


--- a/tests/models/test_huggingface.py
+++ b/tests/models/test_huggingface.py
+from __future__ import annotations
+import pytest
+import numpy as np
+from lm_eval.models.huggingface import HFLM
+from lm_eval.api.instance import Instance
+import lm_eval.tasks as tasks
+
+
+class Test_HFLM:
+
+    multiple_choice_task = tasks.TASK_REGISTRY.get("arc_easy")()  # type: ignore
+    multiple_choice_task.build_all_requests(limit=10, rank=0, world_size=1)
+    MULTIPLE_CH: list[Instance] = multiple_choice_task.instances
+    greedy_until_task = tasks.TASK_REGISTRY.get("gsm8k_yaml")()  # type: ignore
+    greedy_until_task.build_all_requests(limit=10, rank=0, world_size=1)
+    greedy_until_task._config.generation_kwargs["max_gen_toks"] = 10
+    GREEDY_UNTIL: list[Instance] = greedy_until_task.instances
+    rolling_task = tasks.TASK_REGISTRY.get("wikitext")()  # type: ignore
+    rolling_task.build_all_requests(limit=10, rank=0, world_size=1)
+    ROLLING: list[Instance] = rolling_task.instances
+
+    MULTIPLE_CH_RES = [
+        -41.902435302734375,
+        -42.939308166503906,
+        -33.914180755615234,
+        -37.07139205932617,
+        -22.95258331298828,
+        -20.342208862304688,
+        -14.818366050720215,
+        -27.942853927612305,
+        -15.80704116821289,
+        -15.936427116394043,
+        -13.052018165588379,
+        -18.04828453063965,
+        -13.345029830932617,
+        -13.366025924682617,
+        -12.127134323120117,
+        -11.872495651245117,
+        -47.10598373413086,
+        -47.76410675048828,
+        -36.4406852722168,
+        -50.0289421081543,
+        -16.72093963623047,
+        -18.535587310791016,
+        -26.46993637084961,
+        -20.355995178222656,
+        -17.757919311523438,
+        -21.80595588684082,
+        -33.1990852355957,
+        -39.28636932373047,
+        -14.759679794311523,
+        -16.753942489624023,
+        -11.486852645874023,
+        -15.42177677154541,
+        -13.15798282623291,
+        -15.887393951416016,
+        -15.28614616394043,
+        -12.339089393615723,
+        -44.59441375732422,
+        -55.40888214111328,
+        -52.70050811767578,
+        -56.25089645385742,
+    ]
+    GREEDY_UNTIL_RES = [
+        " The average of $2.50 each is $",
+        " A robe takes 2 bolts of blue fiber and half",
+        " $50,000 in repairs.",
+        " He runs 1 sprint 3 times a week.",
+        " They feed each of her chickens three cups of mixed",
+        " The price of the glasses is $5, but",
+        " The total percentage of students who said they like to",
+        " Carla is downloading a 200 GB file. Normally",
+        " John drives for 3 hours at a speed of 60",
+        " Eliza sells 4 tickets to 5 friends so she",
+    ]
+    ROLLING_RES = [
+        -3603.6328125,
+        -19779.23974609375,
+        -8834.16455078125,
+        -27967.591796875,
+        -7636.794982910156,
+        -9491.93505859375,
+        -41043.4248046875,
+        -8397.689819335938,
+        -45969.47155761719,
+        -7158.90625,
+    ]
+    LM = HFLM(pretrained="EleutherAI/pythia-70m", device="cpu", dtype="float32")
+
+    def test_logliklihood(self) -> None:
+        res = self.LM.loglikelihood(self.MULTIPLE_CH)
+        _RES, _res = self.MULTIPLE_CH_RES, [r[0] for r in res]
+        # change atol in case of consistent failure
+        assert np.allclose(_res, _RES, atol=1e-4)
+        # check indices for Multiple Choice
+        argmax_RES, argmax_res = np.argmax(
+            np.array(_RES).reshape(-1, 4), axis=1
+        ), np.argmax(np.array(_res).reshape(-1, 4), axis=1)
+        assert (argmax_RES == argmax_res).all()
+
+    def test_greedy_until(self) -> None:
+        res = self.LM.greedy_until(self.GREEDY_UNTIL)
+        assert res == self.GREEDY_UNTIL_RES
+
+    def test_logliklihood_rolling(self) -> None:
+        res = self.LM.loglikelihood_rolling(self.ROLLING)
+        assert np.allclose(res, self.ROLLING_RES, atol=1e-2)
+
+    def test_toc_encode(self) -> None:
+        res = self.LM.tok_encode("foo bar")
+        assert res == [12110, 2534]
+
+    def test_toc_decode(self) -> None:
+        res = self.LM.tok_decode([12110, 2534])
+        assert res == "foo bar"
+
+    def test_batch_encode(self) -> None:
+        res = self.LM.tok_batch_encode(["foo bar", "bar foo"])[0].tolist()
+        assert res == [[12110, 2534], [2009, 17374]]
+
+    def test_model_generate(self) -> None:
+        context = self.LM.tok_batch_encode(["foo bar"])[0]
+        res = self.LM._model_generate(context, max_length=10, stop=["\n\n"])
+        res = self.LM.tok_decode(res[0])
+        assert res == "foo bar\n<bazhang>!info bar"