Merge branch 'big-refactor' of https://github.com/EleutherAI/lm-evaluation-harness into superglue

f71d56eb · lintangsutawika · 33f2f9bf · 2f870265 · f71d56eb · f71d56eb
Commit f71d56eb authored Aug 21, 2023 by lintangsutawika
20 changed files
--- a/lm_eval/tasks/arc/README.md
+++ b/lm_eval/tasks/arc/README.md
 # ARC

-Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
-https://arxiv.org/pdf/1803.05457.pdf
+### Paper
+
+Title: Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
+
+Abstract: https://arxiv.org/abs/1803.05457

 The ARC dataset consists of 7,787 science exam questions drawn from a variety
 of sources, including science questions provided under license by a research
@@ -13,7 +16,9 @@ a co-occurrence method fail to answer correctly) and an Easy Set of 5,197 questi

 Homepage: https://allenai.org/data/arc

+
 ### Citation
+
 ```
 @article{Clark2018ThinkYH,
  title={Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge},
@@ -23,3 +28,27 @@ Homepage: https://allenai.org/data/arc
  volume={abs/1803.05457}
 }
 ```
+
+### Groups and Tasks
+
+#### Groups
+
+* `ai2_arc`: Evaluates `arc_easy` and `arc_challenge`
+
+#### Tasks
+
+* `arc_easy`
+* `arc_challange`
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/arc/arc_challenge.yaml
+++ b/lm_eval/tasks/arc/arc_challenge.yaml
 include: arc_easy.yaml
-group:
-  - ai2_arc
-  - multiple_choice
 task: arc_challenge
-dataset_path: ai2_arc
 dataset_name: ARC-Challenge
--- a/lm_eval/tasks/arc/arc_easy.yaml
+++ b/lm_eval/tasks/arc/arc_easy.yaml
 group:
  - ai2_arc
-  - multiple_choice
 task: arc_easy
 dataset_path: ai2_arc
 dataset_name: ARC-Easy

--- a/lm_eval/tasks/arithmetic/README.md
+++ b/lm_eval/tasks/arithmetic/README.md
+# Arithmetic
+
+### Paper
+
+Title: `Language Models are Few-Shot Learners`
+Abstract: https://arxiv.org/abs/2005.14165
+
+A small battery of 10 tests that involve asking language models a simple arithmetic
+problem in natural language.
+
+Homepage: https://github.com/openai/gpt-3/tree/master/data
+
+
+### Citation
+
+```
+@inproceedings{NEURIPS2020_1457c0d6,
+    author = {Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winter, Clemens and Hesse, Chris and Chen, Mark and Sigler, Eric and Litwin, Mateusz and Gray, Scott and Chess, Benjamin and Clark, Jack and Berner, Christopher and McCandlish, Sam and Radford, Alec and Sutskever, Ilya and Amodei, Dario},
+    booktitle = {Advances in Neural Information Processing Systems},
+    editor = {H. Larochelle and M. Ranzato and R. Hadsell and M. F. Balcan and H. Lin},
+    pages = {1877--1901},
+    publisher = {Curran Associates, Inc.},
+    title = {Language Models are Few-Shot Learners},
+    url = {https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf},
+    volume = {33},
+    year = {2020}
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* `arithmetic`: Evaluates `1dc` to `5ds`
+
+#### Tasks
+
+* `arithmetic_1dc`
+* `arithmetic_2da`
+* `arithmetic_2dm`
+* `arithmetic_2ds`
+* `arithmetic_3da`
+* `arithmetic_3ds`
+* `arithmetic_4da`
+* `arithmetic_4ds`
+* `arithmetic_5da`
+* `arithmetic_5ds`
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/babi/README.md
+++ b/lm_eval/tasks/babi/README.md
+# bAbI
+
+### Paper
+
+Title: Towards ai-complete question answering: A set of prerequisite toy tasks
+Abstract: https://arxiv.org/abs/1502.05698
+
+One long-term goal of machine learning research is to produce methods that are applicable to reasoning and natural language, in particular building an intelligent dialogue agent. To measure progress towards that goal, we argue for the usefulness of a set of proxy tasks that evaluate reading comprehension via question answering. Our tasks measure understanding in several ways: whether a system is able to answer questions via chaining facts, simple induction, deduction and many more. The tasks are designed to be prerequisites for any system that aims to be capable of conversing with a human. We believe many existing learning systems can currently not solve them, and hence our aim is to classify these tasks into skill sets, so that researchers can identify (and then rectify) the failings of their systems. We also extend and improve the recently introduced Memory Networks model, and show it is able to solve some, but not all, of the tasks.
+
+Homepage: https://github.com/facebookarchive/bAbI-tasks
+
+
+### Citation
+
+```
+@article{weston2015towards,
+  title={Towards ai-complete question answering: A set of prerequisite toy tasks},
+  author={Weston, Jason and Bordes, Antoine and Chopra, Sumit and Rush, Alexander M and Van Merri{\"e}nboer, Bart and Joulin, Armand and Mikolov, Tomas},
+  journal={arXiv preprint arXiv:1502.05698},
+  year={2015}
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* Not part of a group yet
+
+#### Tasks
+
+* `babi`
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/babi/babi.yaml
+++ b/lm_eval/tasks/babi/babi.yaml
-group:
-  - greedy_until
 task: babi
 dataset_path: Muennighoff/babi
 dataset_name: null

--- a/lm_eval/tasks/benchmarks/t0_eval.yaml
+++ b/lm_eval/tasks/benchmarks/t0_eval.yaml
-group: t0_eval
-task:
-  # # Coreference Resolution
-  # - dataset_path: super_glue
-  #   dataset_name: wsc.fixed
-  #   use_prompt: promptsource:*
-  #   training_split: train
-  #   validation_split: validation
-  #   metric_list:
-  #     - metric: exact_match
-  #       aggregation: mean
-  #       higher_is_better: true
-  #       ignore_case: true
-  #       ignore_punctuation: true
-  # # Coreference Resolution
-  # - dataset_path: winogrande
-  #   dataset_name: winogrande_xl
-  #   use_prompt: promptsource:*
-  #   training_split: train
-  #   validation_split: validation
-  #   metric_list:
-  #     - metric: exact_match
-  #       aggregation: mean
-  #       higher_is_better: true
-  #       ignore_case: true
-  #       ignore_punctuation: true
-  # Natural Language Inference
-  - dataset_path: super_glue
-    dataset_name: cb
-    use_prompt: promptsource:*
-    training_split: train
-    validation_split: validation
-    output_type: greedy_until
-    metric_list:
-      - metric: exact_match
-        aggregation: mean
-        higher_is_better: true
-        ignore_case: true
-        ignore_punctuation: true
-  # Natural Language Inference
-  # - dataset_path: super_glue
-  #   dataset_name: rte
-  #   use_prompt: promptsource:*
-  #   training_split: train
-  #   validation_split: validation
-  #   metric_list:
-  #     - metric: exact_match
-  #       aggregation: mean
-  #       higher_is_better: true
-  #       ignore_case: true
-  #       ignore_punctuation: true
-  # # Natural Language Inference
-  # # - dataset_path: anli
-  # #   use_prompt: promptsource:*
-  # #   training_split: train_r1
-  # #   validation_split: dev_r1
-  # # Sentence Completion
-  # - dataset_path: super_glue
-  #   dataset_name: copa
-  #   use_prompt: promptsource:*
-  #   training_split: train
-  #   validation_split: validation
-  #   metric_list:
-  #     - metric: exact_match
-  #       aggregation: mean
-  #       higher_is_better: true
-  #       ignore_case: true
-  #       ignore_punctuation: true
-  # # Natural Language Inference
-  # - dataset_path: hellaswag
-  #   use_prompt: promptsource:*
-  #   training_split: train
-  #   validation_split: validation
-  #   metric_list:
-  #     - metric: exact_match
-  #       aggregation: mean
-  #       higher_is_better: true
-  #       ignore_case: true
-  #       ignore_punctuation: true
-  # # Word Sense Disambiguation
-  # - dataset_path: super_glue
-  #   dataset_name: wic
-  #   use_prompt: promptsource:*
-  #   training_split: train
-  #   validation_split: validation
-  #   metric_list:
-  #     - metric: exact_match
-  #       aggregation: mean
-  #       higher_is_better: true
-  #       ignore_case: true
-  #       ignore_punctuation: true
--- a/lm_eval/tasks/blimp/template_yaml
+++ b/lm_eval/tasks/blimp/template_yaml
 group: blimp
 dataset_path: blimp
 output_type: multiple_choice
-validation_split: validation
+validation_split: train
 doc_to_text: ""
 doc_to_target: 0
 doc_to_choice: "{{[sentence_good, sentence_bad]}}"

--- a/lm_eval/tasks/crows_pairs/README.md
+++ b/lm_eval/tasks/crows_pairs/README.md
@@ -52,9 +52,15 @@ Homepage: https://github.com/nyu-mll/crows-pairs, https://gitlab.inria.fr/french
 }
 ```

-### Subtasks
+### Groups and Tasks
+
+#### Groups

 - `crows_pairs_english`: The entire English subset of the CrowS-Pairs dataset.
+- `crows_pairs_french`: The entire French subset of the CrowS-Pairs dataset.
+
+#### Tasks
+

 The following tasks evaluate sub-areas of bias in the English CrowS-Pairs dataset:
 - `crows_pairs_english_age`
@@ -68,8 +74,6 @@ The following tasks evaluate sub-areas of bias in the English CrowS-Pairs datase
 - `crows_pairs_english_sexual_orientation`
 - `crows_pairs_english_socioeconomic`

- `crows_pairs_french`: The entire French subset of the CrowS-Pairs dataset.
-
 The following tasks evaluate sub-areas of bias in the French CrowS-Pairs dataset:
 - `crows_pairs_french_age`
 - `crows_pairs_french_autre`

--- a/lm_eval/tasks/crows_pairs/crows_pairs_english.yaml
+++ b/lm_eval/tasks/crows_pairs/crows_pairs_english.yaml
@@ -16,6 +16,6 @@ metric_list:
  - metric: likelihood_diff
    aggregation: mean
    higher_is_better: false
-  - metric: acc
+  - metric: pct_stereotype
    aggregation: mean
-    higher_is_better: true
+    higher_is_better: false
--- a/lm_eval/tasks/crows_pairs/utils.py
+++ b/lm_eval/tasks/crows_pairs/utils.py
@@ -13,7 +13,7 @@ def process_results(doc, results):
    # then treat this as predicting stereotyped sentence
    acc = 1.0 if likelihood1 > likelihood2 else 0.0

-    return {"likelihood_diff": diff, "acc": acc}
+    return {"likelihood_diff": diff, "pct_stereotype": acc}


 def doc_to_choice(doc):

--- a/lm_eval/tasks/glue/README.md
+++ b/lm_eval/tasks/glue/README.md
+# GLUE
+
+### Paper
+
+Title: `GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding`
+
+Abstract: https://openreview.net/pdf?id=rJ4km2R5t7
+
+The General Language Understanding Evaluation (GLUE) benchmark is a collection of
+resources for training, evaluating, and analyzing natural language understanding
+systems. GLUE consists of:
+- A benchmark of nine sentence- or sentence-pair language understanding tasks built
+on established existing datasets and selected to cover a diverse range of dataset
+sizes, text genres, and degrees of difficulty, and
+- A diagnostic dataset designed to evaluate and analyze model performance with
+respect to a wide range of linguistic phenomena found in natural language.
+
+Homepage: https://gluebenchmark.com/
+
+### Citation
+
+```
+@inproceedings{wang-etal-2018-glue,
+    title = "{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding",
+    author = "Wang, Alex  and
+      Singh, Amanpreet  and
+      Michael, Julian  and
+      Hill, Felix  and
+      Levy, Omer  and
+      Bowman, Samuel",
+    booktitle = "Proceedings of the 2018 {EMNLP} Workshop {B}lackbox{NLP}: Analyzing and Interpreting Neural Networks for {NLP}",
+    month = nov,
+    year = "2018",
+    address = "Brussels, Belgium",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/W18-5446",
+    doi = "10.18653/v1/W18-5446",
+    pages = "353--355",
+    abstract = "Human ability to understand language is \textit{general, flexible, and robust}. In contrast, most NLU models above the word level are designed for a specific task and struggle with out-of-domain data. If we aspire to develop models with understanding beyond the detection of superficial correspondences between inputs and outputs, then it is critical to develop a unified model that can execute a range of linguistic tasks across different domains. To facilitate research in this direction, we present the General Language Understanding Evaluation (GLUE, gluebenchmark.com): a benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models. For some benchmark tasks, training data is plentiful, but for others it is limited or does not match the genre of the test set. GLUE thus favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks. While none of the datasets in GLUE were created from scratch for the benchmark, four of them feature privately-held test data, which is used to ensure that the benchmark is used fairly. We evaluate baselines that use ELMo (Peters et al., 2018), a powerful transfer learning technique, as well as state-of-the-art sentence representation models. The best models still achieve fairly low absolute scores. Analysis with our diagnostic dataset yields similarly weak performance over all phenomena tested, with some exceptions.",
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* `glue`: Run all Glue subtasks.
+
+#### Tasks
+
+* `cola`
+* `mnli`
+* `mrpc`
+* `qnli`
+* `qqp`
+* `rte`
+* `sst`
+* `wnli`
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/glue/cola/default.yaml
+++ b/lm_eval/tasks/glue/cola/default.yaml
+group: glue
+task: cola
+dataset_path: glue
+dataset_name: cola
+output_type: multiple_choice
+training_split: train
+validation_split: validation
+doc_to_text: "{{sentence}}\nQuestion: Does this sentence make sense?\nAnswer:"
+doc_to_target: label
+doc_to_choice: ["no", "yes"]
+should_decontaminate: true
+doc_to_decontamination_query: sentence
+metric_list:
+  - metric: mcc
--- a/lm_eval/tasks/glue/mnli/default.yaml
+++ b/lm_eval/tasks/glue/mnli/default.yaml
+group: glue
+task: mnli
+dataset_path: glue
+dataset_name: mnli
+output_type: multiple_choice
+training_split: train
+validation_split: validation_matched
+doc_to_text: !function utils.doc_to_text
+doc_to_target: label
+doc_to_choice: ["True", "Neither", "False"]
+metric_list:
+  - metric: acc
--- a/lm_eval/tasks/glue/mnli/mismatch.yaml
+++ b/lm_eval/tasks/glue/mnli/mismatch.yaml
+include: default.yaml
+task: mnli_mismatch
+validation_split: validation_mismatched
+test_split: test_mismatched
--- a/lm_eval/tasks/glue/mnli/utils.py
+++ b/lm_eval/tasks/glue/mnli/utils.py
+def doc_to_text(doc):
+    return "{}\nQuestion: {} True, False or Neither?\nAnswer:".format(
+        doc["premise"],
+        doc["hypothesis"].strip()
+        + ("" if doc["hypothesis"].strip().endswith(".") else "."),
+    )
--- a/lm_eval/tasks/glue/mrpc/default.yaml
+++ b/lm_eval/tasks/glue/mrpc/default.yaml
+group: glue
+task: mrpc
+dataset_path: glue
+dataset_name: mrpc
+output_type: multiple_choice
+training_split: train
+validation_split: validation
+doc_to_text: "Sentence 1: {{sentence1}}\nSentence 2: {{sentence2}}\nQuestion: Do both sentences mean the same thing?\nAnswer:"
+doc_to_target: label
+doc_to_choice: ["no", "yes"]
+metric_list:
+  - metric: acc
+  - metric: f1
--- a/lm_eval/tasks/glue/qnli/promptsource.yaml
+++ b/lm_eval/tasks/glue/qnli/promptsource.yaml
-group:
-  - glue-promptsource
+group: glue
 task: qnli
 dataset_path: glue
 dataset_name: qnli
 output_type: multiple_choice
 training_split: train
 validation_split: validation
-use_prompt: "promptsource:have all you need"
+doc_to_text: "{{question}}\n{{sentence}}\nQuestion: Does this response answer the question?\nAnswer:"
+doc_to_target: label
+doc_to_choice: ["yes", "no"]
 metric_list:
  - metric: acc
--- a/lm_eval/tasks/glue/qqp/default.yaml
+++ b/lm_eval/tasks/glue/qqp/default.yaml
+group: glue
+task: qqp
+dataset_path: glue
+dataset_name: qqp
+output_type: multiple_choice
+training_split: train
+validation_split: validation
+test_split: test
+doc_to_text: "\nSentence 1: {{sentence1}}\nSentence 2: {{sentence2}}\nAnswer:"
+doc_to_target: label
+doc_to_choice: ["no", "yes"]
+metric_list:
+  - metric: acc
+  - metric: f1
--- a/lm_eval/tasks/glue/rte/default.yaml
+++ b/lm_eval/tasks/glue/rte/default.yaml
+group: glue
+task: rte
+dataset_path: glue
+dataset_name: rte
+output_type: multiple_choice
+training_split: train
+validation_split: validation
+doc_to_text: "{{sentence1}}\nQuestion: {{sentence2}} True or False?\nAnswer:"
+doc_to_target: label
+doc_to_choice: ["True", "False"]
+metric_list:
+  - metric: acc