Merge branch 'big-refactor' of https://github.com/EleutherAI/lm-evaluation-harness into superglue

f71d56eb · lintangsutawika · 33f2f9bf · 2f870265 · f71d56eb · f71d56eb
Commit f71d56eb authored Aug 21, 2023 by lintangsutawika
20 changed files
--- a/lm_eval/tasks/unscramble/random_insertion.yaml
+++ b/lm_eval/tasks/unscramble/random_insertion.yaml
 group:
-  - greedy_until
+  - unscramble
 task: random_insertion
 dataset_path: EleutherAI/unscramble
 dataset_name: random_insertion_in_word

--- a/lm_eval/tasks/unscramble/reversed_words.yaml
+++ b/lm_eval/tasks/unscramble/reversed_words.yaml
 group:
-  - greedy_until
+  - unscramble
 task: reversed_words
 dataset_path: EleutherAI/unscramble
 dataset_name: reversed_words

--- a/lm_eval/tasks/webqs/README.md
+++ b/lm_eval/tasks/webqs/README.md
-# Task-name
+# WEBQs

 ### Paper

@@ -33,9 +33,14 @@ Homepage: `https://worksheets.codalab.org/worksheets/0xba659fe363cb46e7a505c5b6a
 }
 ```

-### Subtasks
+### Groups and Tasks
+
+#### Groups
+
+* `freebase`
+
+#### Tasks

-List or describe tasks defined in this folder, and their names here:
 * `webqs`: `Questions with multiple accepted answers.`

 ### Checklist

--- a/lm_eval/tasks/webqs/webqs.yaml
+++ b/lm_eval/tasks/webqs/webqs.yaml
 group:
  - freebase
-  - question_answer
 task: webqs
 dataset_path: web_questions
 dataset_name: null

--- a/lm_eval/tasks/wikitext/README.md
+++ b/lm_eval/tasks/wikitext/README.md
@@ -26,7 +26,13 @@ Homepage: https://www.salesforce.com/products/einstein/ai-research/the-wikitext-
 }
 ```

-### Subtasks
+### Groups and Tasks
+
+#### Groups
+
+* Not part of a group yet.
+
+#### Tasks

 * `wikitext`: measure perplexity on the Wikitext dataset, via rolling loglikelihoods.


--- a/lm_eval/tasks/wikitext/wikitext.yaml
+++ b/lm_eval/tasks/wikitext/wikitext.yaml
-group:
-  - perplexity
-  - loglikelihood_rolling
 task: wikitext
 dataset_path: EleutherAI/wikitext_document_level
 dataset_name: wikitext-2-raw-v1

--- a/lm_eval/tasks/winogrande/README.md
+++ b/lm_eval/tasks/winogrande/README.md
+# WinoGrande
+
+### Paper
+
+Title: `WinoGrande: An Adversarial Winograd Schema Challenge at Scale`
+
+Abstract: https://arxiv.org/abs/1907.10641
+
+WinoGrande is a collection of 44k problems, inspired by Winograd Schema Challenge
+(Levesque, Davis, and Morgenstern 2011), but adjusted to improve the scale and
+robustness against the dataset-specific bias. Formulated as a fill-in-a-blank
+task with binary options, the goal is to choose the right option for a given
+sentence which requires commonsense reasoning.
+
+NOTE: This evaluation of Winogrande uses partial evaluation as described by
+Trinh & Le in Simple Method for Commonsense Reasoning (2018).
+See: https://arxiv.org/abs/1806.02847
+
+Homepage: https://leaderboard.allenai.org/winogrande/submissions/public
+
+
+### Citation
+
+```
+@article{sakaguchi2019winogrande,
+    title={WinoGrande: An Adversarial Winograd Schema Challenge at Scale},
+    author={Sakaguchi, Keisuke and Bras, Ronan Le and Bhagavatula, Chandra and Choi, Yejin},
+    journal={arXiv preprint arXiv:1907.10641},
+    year={2019}
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* Not part of a group yet.
+
+#### Tasks
+
+* `winogrande`
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/xcopa/README.md
+++ b/lm_eval/tasks/xcopa/README.md
-## XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning
-https://ducdauge.github.io/files/xcopa.pdf
+# XCOPA
+
+### Paper
+
+Title: `XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning`
+
+Abstract: https://ducdauge.github.io/files/xcopa.pdf

 The Cross-lingual Choice of Plausible Alternatives dataset is a benchmark to evaluate the ability of machine learning models to transfer commonsense reasoning across languages.
 The dataset is the translation and reannotation of the English COPA (Roemmele et al. 2011) and covers 11 languages from 11 families and several areas around the globe.
@@ -8,6 +13,8 @@ All the details about the creation of XCOPA and the implementation of the baseli

 Homepage: https://github.com/cambridgeltl/xcopa

+### Citation
+
 ```
 @inproceedings{ponti2020xcopa,
  title={{XCOPA: A} Multilingual Dataset for Causal Commonsense Reasoning},
@@ -17,3 +24,37 @@ Homepage: https://github.com/cambridgeltl/xcopa
  url={https://ducdauge.github.io/files/xcopa.pdf}
 }
 ```
+
+### Groups and Tasks
+
+#### Groups
+
+* `xcopa`
+
+#### Tasks
+
+* `xcopa_et`: Estonian
+* `xcopa_ht`: Haitian Creole
+* `xcopa_id`: Indonesian
+* `xcopa_it`: Italian
+* `xcopa_qu`: Cusco-Collao Quechua
+* `xcopa_sw`: Kiswahili
+* `xcopa_ta`: Tamil
+* `xcopa_th`: Thai
+* `xcopa_tr`: Turkish
+* `xcopa_vi`: Vietnamese
+* `xcopa_zh`: Mandarin Chinese
+
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/xnli/README.md
+++ b/lm_eval/tasks/xnli/README.md
+# XNLI
+
+### Paper
+
+Title: `XNLI: Evaluating Cross-lingual Sentence Representations`
+
+Abstract: https://arxiv.org/abs/1809.05053
+
+Based on the implementation of @yongzx (see https://github.com/EleutherAI/lm-evaluation-harness/pull/258)
+
+Prompt format (same as XGLM and mGPT):
+
+sentence1 + ", right? " + mask = (Yes|Also|No) + ", " + sentence2
+
+Predicition is the full sequence with the highest likelihood.
+
+Language specific prompts are translated word-by-word with Google Translate
+and may differ from the ones used by mGPT and XGLM (they do not provide their prompts).
+
+Homepage: https://github.com/facebookresearch/XNLI
+
+
+### Citation
+
+"""
+@InProceedings{conneau2018xnli,
+  author = "Conneau, Alexis
+        and Rinott, Ruty
+        and Lample, Guillaume
+        and Williams, Adina
+        and Bowman, Samuel R.
+        and Schwenk, Holger
+        and Stoyanov, Veselin",
+  title = "XNLI: Evaluating Cross-lingual Sentence Representations",
+  booktitle = "Proceedings of the 2018 Conference on Empirical Methods
+               in Natural Language Processing",
+  year = "2018",
+  publisher = "Association for Computational Linguistics",
+  location = "Brussels, Belgium",
+}
+"""
+
+### Groups and Tasks
+
+#### Groups
+
+* `xnli`
+
+#### Tasks
+
+* `xnli_ar`: Arabic
+* `xnli_bg`: Bulgarian
+* `xnli_de`: German
+* `xnli_el`: Greek
+* `xnli_en`: English
+* `xnli_es`: Spanish
+* `xnli_fr`: French
+* `xnli_hi`: Hindi
+* `xnli_ru`: Russian
+* `xnli_sw`: Swahili
+* `xnli_th`: Thai
+* `xnli_tr`: Turkish
+* `xnli_ur`: Urdu
+* `xnli_vi`: Vietnamese
+* `xnli_zh`: Chinese
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/xnli/utils.py
+++ b/lm_eval/tasks/xnli/utils.py
+import argparse
+from typing import Dict, List
+
+import yaml
+
+
+# Different languages that are part of xnli.
+# These correspond to dataset names (Subsets) on HuggingFace.
+# A yaml file is generated by this script for each language.
+
+LANGUAGES = {
+    "ar": {  # Arabic
+        "QUESTION_WORD": "صحيح",
+        "ENTAILMENT_LABEL": "نعم",
+        "NEUTRAL_LABEL": "لذا",
+        "CONTRADICTION_LABEL": "رقم",
+    },
+    "bg": {  # Bulgarian
+        "QUESTION_WORD": "правилно",
+        "ENTAILMENT_LABEL": "да",
+        "NEUTRAL_LABEL": "така",
+        "CONTRADICTION_LABEL": "не",
+    },
+    "de": {  # German
+        "QUESTION_WORD": "richtig",
+        "ENTAILMENT_LABEL": "Ja",
+        "NEUTRAL_LABEL": "Auch",
+        "CONTRADICTION_LABEL": "Nein",
+    },
+    "el": {  # Greek
+        "QUESTION_WORD": "σωστός",
+        "ENTAILMENT_LABEL": "Ναί",
+        "NEUTRAL_LABEL": "Έτσι",
+        "CONTRADICTION_LABEL": "όχι",
+    },
+    "en": {  # English
+        "QUESTION_WORD": "right",
+        "ENTAILMENT_LABEL": "Yes",
+        "NEUTRAL_LABEL": "Also",
+        "CONTRADICTION_LABEL": "No",
+    },
+    "es": {  # Spanish
+        "QUESTION_WORD": "correcto",
+        "ENTAILMENT_LABEL": "Sí",
+        "NEUTRAL_LABEL": "Asi que",
+        "CONTRADICTION_LABEL": "No",
+    },
+    "fr": {  # French
+        "QUESTION_WORD": "correct",
+        "ENTAILMENT_LABEL": "Oui",
+        "NEUTRAL_LABEL": "Aussi",
+        "CONTRADICTION_LABEL": "Non",
+    },
+    "hi": {  # Hindi
+        "QUESTION_WORD": "सही",
+        "ENTAILMENT_LABEL": "हाँ",
+        "NEUTRAL_LABEL": "इसलिए",
+        "CONTRADICTION_LABEL": "नहीं",
+    },
+    "ru": {  # Russian
+        "QUESTION_WORD": "правильно",
+        "ENTAILMENT_LABEL": "Да",
+        "NEUTRAL_LABEL": "Так",
+        "CONTRADICTION_LABEL": "Нет",
+    },
+    "sw": {  # Swahili
+        "QUESTION_WORD": "sahihi",
+        "ENTAILMENT_LABEL": "Ndiyo",
+        "NEUTRAL_LABEL": "Hivyo",
+        "CONTRADICTION_LABEL": "Hapana",
+    },
+    "th": {  # Thai
+        "QUESTION_WORD": "ถูกต้อง",
+        "ENTAILMENT_LABEL": "ใช่",
+        "NEUTRAL_LABEL": "ดังนั้น",
+        "CONTRADICTION_LABEL": "ไม่",
+    },
+    "tr": {  # Turkish
+        "QUESTION_WORD": "doğru",
+        "ENTAILMENT_LABEL": "Evet",
+        "NEUTRAL_LABEL": "Böylece",
+        "CONTRADICTION_LABEL": "Hayır",
+    },
+    "ur": {  # Urdu
+        "QUESTION_WORD": "صحیح",
+        "ENTAILMENT_LABEL": "جی ہاں",
+        "NEUTRAL_LABEL": "اس لئے",
+        "CONTRADICTION_LABEL": "نہیں",
+    },
+    "vi": {  # Vietnamese
+        "QUESTION_WORD": "đúng",
+        "ENTAILMENT_LABEL": "Vâng",
+        "NEUTRAL_LABEL": "Vì vậy",
+        "CONTRADICTION_LABEL": "Không",
+    },
+    "zh": {  # Chinese
+        "QUESTION_WORD": "正确",
+        "ENTAILMENT_LABEL": "是的",
+        "NEUTRAL_LABEL": "所以",
+        "CONTRADICTION_LABEL": "不是的",
+    },
+}
+
+
+def gen_lang_yamls(output_dir: str, overwrite: bool) -> None:
+    """
+    Generate a yaml file for each language.
+
+    :param output_dir: The directory to output the files to.
+    :param overwrite: Whether to overwrite files if they already exist.
+    """
+    err = []
+    for lang in LANGUAGES.keys():
+        file_name = f"xnli_{lang}.yaml"
+        try:
+            QUESTION_WORD = LANGUAGES[lang]["QUESTION_WORD"]
+            ENTAILMENT_LABEL = LANGUAGES[lang]["ENTAILMENT_LABEL"]
+            NEUTRAL_LABEL = LANGUAGES[lang]["NEUTRAL_LABEL"]
+            CONTRADICTION_LABEL = LANGUAGES[lang]["CONTRADICTION_LABEL"]
+            with open(
+                f"{output_dir}/{file_name}", "w" if overwrite else "x", encoding="utf8"
+            ) as f:
+                f.write("# Generated by utils.py\n")
+                yaml.dump(
+                    {
+                        "include": "xnli_common_yaml",
+                        "dataset_name": lang,
+                        "task": f"xnli_{lang}",
+                        "doc_to_text": "",
+                        "doc_to_choice": f"{{{{["
+                        f"""premise+\", {QUESTION_WORD}? {ENTAILMENT_LABEL}, \"+hypothesis,"""
+                        f"""premise+\", {QUESTION_WORD}? {NEUTRAL_LABEL}, \"+hypothesis,"""
+                        f"""premise+\", {QUESTION_WORD}? {CONTRADICTION_LABEL}, \"+hypothesis"""
+                        f"]}}}}",
+                    },
+                    f,
+                    allow_unicode=True,
+                )
+        except FileExistsError:
+            err.append(file_name)
+
+    if len(err) > 0:
+        raise FileExistsError(
+            "Files were not created because they already exist (use --overwrite flag):"
+            f" {', '.join(err)}"
+        )
+
+
+def main() -> None:
+    """Parse CLI args and generate language-specific yaml files."""
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--overwrite",
+        default=False,
+        action="store_true",
+        help="Overwrite files if they already exist",
+    )
+    parser.add_argument(
+        "--output-dir", default=".", help="Directory to write yaml files to"
+    )
+    args = parser.parse_args()
+
+    gen_lang_yamls(output_dir=args.output_dir, overwrite=args.overwrite)
+
+
+if __name__ == "__main__":
+    main()
--- a/lm_eval/tasks/xnli/xnli_ar.yaml
+++ b/lm_eval/tasks/xnli/xnli_ar.yaml
+# Generated by utils.py
+dataset_name: ar
+doc_to_choice: '{{[premise+", صحيح? نعم, "+hypothesis,premise+", صحيح? لذا, "+hypothesis,premise+",
+  صحيح? رقم, "+hypothesis]}}'
+doc_to_text: ''
+include: xnli_common_yaml
+task: xnli_ar
--- a/lm_eval/tasks/xnli/xnli_bg.yaml
+++ b/lm_eval/tasks/xnli/xnli_bg.yaml
+# Generated by utils.py
+dataset_name: bg
+doc_to_choice: '{{[premise+", правилно? да, "+hypothesis,premise+", правилно? така,
+  "+hypothesis,premise+", правилно? не, "+hypothesis]}}'
+doc_to_text: ''
+include: xnli_common_yaml
+task: xnli_bg
--- a/lm_eval/tasks/xnli/xnli_common_yaml
+++ b/lm_eval/tasks/xnli/xnli_common_yaml
+# This file will be included in the generated language-specific task configs.
+# It doesn't have a yaml file extension as it is not meant to be imported directly
+# by the harness.
+group: xnli
+task: null
+dataset_path: xnli
+dataset_name: null
+output_type: multiple_choice
+training_split: train
+validation_split: validation
+doc_to_text: null
+doc_to_target: label
+doc_to_choice: null
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
--- a/lm_eval/tasks/xnli/xnli_de.yaml
+++ b/lm_eval/tasks/xnli/xnli_de.yaml
+# Generated by utils.py
+dataset_name: de
+doc_to_choice: '{{[premise+", richtig? Ja, "+hypothesis,premise+", richtig? Auch,
+  "+hypothesis,premise+", richtig? Nein, "+hypothesis]}}'
+doc_to_text: ''
+include: xnli_common_yaml
+task: xnli_de
--- a/lm_eval/tasks/xnli/xnli_el.yaml
+++ b/lm_eval/tasks/xnli/xnli_el.yaml
+# Generated by utils.py
+dataset_name: el
+doc_to_choice: '{{[premise+", σωστός? Ναί, "+hypothesis,premise+", σωστός? Έτσι, "+hypothesis,premise+",
+  σωστός? όχι, "+hypothesis]}}'
+doc_to_text: ''
+include: xnli_common_yaml
+task: xnli_el
--- a/lm_eval/tasks/xnli/xnli_en.yaml
+++ b/lm_eval/tasks/xnli/xnli_en.yaml
+# Generated by utils.py
+dataset_name: en
+doc_to_choice: '{{[premise+", right? Yes, "+hypothesis,premise+", right? Also, "+hypothesis,premise+",
+  right? No, "+hypothesis]}}'
+doc_to_text: ''
+include: xnli_common_yaml
+task: xnli_en
--- a/lm_eval/tasks/xnli/xnli_es.yaml
+++ b/lm_eval/tasks/xnli/xnli_es.yaml
+# Generated by utils.py
+dataset_name: es
+doc_to_choice: '{{[premise+", correcto? Sí, "+hypothesis,premise+", correcto? Asi
+  que, "+hypothesis,premise+", correcto? No, "+hypothesis]}}'
+doc_to_text: ''
+include: xnli_common_yaml
+task: xnli_es
--- a/lm_eval/tasks/xnli/xnli_fr.yaml
+++ b/lm_eval/tasks/xnli/xnli_fr.yaml
+# Generated by utils.py
+dataset_name: fr
+doc_to_choice: '{{[premise+", correct? Oui, "+hypothesis,premise+", correct? Aussi,
+  "+hypothesis,premise+", correct? Non, "+hypothesis]}}'
+doc_to_text: ''
+include: xnli_common_yaml
+task: xnli_fr
--- a/lm_eval/tasks/xnli/xnli_hi.yaml
+++ b/lm_eval/tasks/xnli/xnli_hi.yaml
+# Generated by utils.py
+dataset_name: hi
+doc_to_choice: '{{[premise+", सही? हाँ, "+hypothesis,premise+", सही? इसलिए, "+hypothesis,premise+",
+  सही? नहीं, "+hypothesis]}}'
+doc_to_text: ''
+include: xnli_common_yaml
+task: xnli_hi
--- a/lm_eval/tasks/xnli/xnli_ru.yaml
+++ b/lm_eval/tasks/xnli/xnli_ru.yaml
+# Generated by utils.py
+dataset_name: ru
+doc_to_choice: '{{[premise+", правильно? Да, "+hypothesis,premise+", правильно? Так,
+  "+hypothesis,premise+", правильно? Нет, "+hypothesis]}}'
+doc_to_text: ''
+include: xnli_common_yaml
+task: xnli_ru