Merge branch 'big-refactor' into update_docs

767c58b9 · lintangsutawika · 3bfbddc4 · 759da8d5 · 767c58b9 · 767c58b9
Commit 767c58b9 authored Aug 16, 2023 by lintangsutawika
20 changed files
--- a/lm_eval/tasks/anli/anli_r3.yaml
+++ b/lm_eval/tasks/anli/anli_r3.yaml
-group:
-  - multiple_choice
-  - natural_language_inference
-  - nli
-  - adverserial
+include: anli_r1.yaml
 task: anli_r3
-dataset_path: anli
-dataset_name: null
-output_type: multiple_choice
 training_split: train_r3
 validation_split: dev_r3
 test_split: test_r3
-doc_to_text: "{{premise}}\nQuestion: {{hypothesis}} True, False, or Neither?\nAnswer:"
-# True = entailment
-# False = contradiction
-# Neither = neutral
-doc_to_target: "{{['True', 'Neither', 'False'][label]}}"
-doc_to_choice:
-  - "True"
-  - "Neither"
-  - "False"
-should_decontaminate: true
-doc_to_decontamination_query: premise
-metric_list:
-  - metric: acc
-    aggregation: mean
-    higher_is_better: true
--- a/lm_eval/tasks/arc/README.md
+++ b/lm_eval/tasks/arc/README.md
 # ARC

-Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
-https://arxiv.org/pdf/1803.05457.pdf
+### Paper
+
+Title: Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
+
+Abstract: https://arxiv.org/abs/1803.05457

 The ARC dataset consists of 7,787 science exam questions drawn from a variety
 of sources, including science questions provided under license by a research
@@ -13,7 +16,9 @@ a co-occurrence method fail to answer correctly) and an Easy Set of 5,197 questi

 Homepage: https://allenai.org/data/arc

+
 ### Citation
+
 ```
 @article{Clark2018ThinkYH,
  title={Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge},
@@ -23,3 +28,27 @@ Homepage: https://allenai.org/data/arc
  volume={abs/1803.05457}
 }
 ```
+
+### Groups and Tasks
+
+#### Groups
+
+* `ai2_arc`: Evaluates `arc_easy` and `arc_challenge`
+
+#### Tasks
+
+* `arc_easy`
+* `arc_challange`
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/arc/arc_challenge.yaml
+++ b/lm_eval/tasks/arc/arc_challenge.yaml
 include: arc_easy.yaml
-group:
-  - ai2_arc
-  - multiple_choice
 task: arc_challenge
-dataset_path: ai2_arc
 dataset_name: ARC-Challenge
--- a/lm_eval/tasks/arc/arc_easy.yaml
+++ b/lm_eval/tasks/arc/arc_easy.yaml
 group:
  - ai2_arc
-  - multiple_choice
 task: arc_easy
 dataset_path: ai2_arc
 dataset_name: ARC-Easy

--- a/lm_eval/tasks/arithmetic/README.md
+++ b/lm_eval/tasks/arithmetic/README.md
+# Arithmetic
+
+### Paper
+
+Title: `Language Models are Few-Shot Learners`
+Abstract: https://arxiv.org/abs/2005.14165
+
+A small battery of 10 tests that involve asking language models a simple arithmetic
+problem in natural language.
+
+Homepage: https://github.com/openai/gpt-3/tree/master/data
+
+
+### Citation
+
+```
+@inproceedings{NEURIPS2020_1457c0d6,
+    author = {Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winter, Clemens and Hesse, Chris and Chen, Mark and Sigler, Eric and Litwin, Mateusz and Gray, Scott and Chess, Benjamin and Clark, Jack and Berner, Christopher and McCandlish, Sam and Radford, Alec and Sutskever, Ilya and Amodei, Dario},
+    booktitle = {Advances in Neural Information Processing Systems},
+    editor = {H. Larochelle and M. Ranzato and R. Hadsell and M. F. Balcan and H. Lin},
+    pages = {1877--1901},
+    publisher = {Curran Associates, Inc.},
+    title = {Language Models are Few-Shot Learners},
+    url = {https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf},
+    volume = {33},
+    year = {2020}
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* `arithmetic`: Evaluates `1dc` to `5ds`
+
+#### Tasks
+
+* `arithmetic_1dc`
+* `arithmetic_2da`
+* `arithmetic_2dm`
+* `arithmetic_2ds`
+* `arithmetic_3da`
+* `arithmetic_3ds`
+* `arithmetic_4da`
+* `arithmetic_4ds`
+* `arithmetic_5da`
+* `arithmetic_5ds`
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/babi/README.md
+++ b/lm_eval/tasks/babi/README.md
+# bAbI
+
+### Paper
+
+Title: Towards ai-complete question answering: A set of prerequisite toy tasks
+Abstract: https://arxiv.org/abs/1502.05698
+
+One long-term goal of machine learning research is to produce methods that are applicable to reasoning and natural language, in particular building an intelligent dialogue agent. To measure progress towards that goal, we argue for the usefulness of a set of proxy tasks that evaluate reading comprehension via question answering. Our tasks measure understanding in several ways: whether a system is able to answer questions via chaining facts, simple induction, deduction and many more. The tasks are designed to be prerequisites for any system that aims to be capable of conversing with a human. We believe many existing learning systems can currently not solve them, and hence our aim is to classify these tasks into skill sets, so that researchers can identify (and then rectify) the failings of their systems. We also extend and improve the recently introduced Memory Networks model, and show it is able to solve some, but not all, of the tasks.
+
+Homepage: https://github.com/facebookarchive/bAbI-tasks
+
+
+### Citation
+
+```
+@article{weston2015towards,
+  title={Towards ai-complete question answering: A set of prerequisite toy tasks},
+  author={Weston, Jason and Bordes, Antoine and Chopra, Sumit and Rush, Alexander M and Van Merri{\"e}nboer, Bart and Joulin, Armand and Mikolov, Tomas},
+  journal={arXiv preprint arXiv:1502.05698},
+  year={2015}
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* Not part of a group yet
+
+#### Tasks
+
+* `babi`
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/babi/babi.yaml
+++ b/lm_eval/tasks/babi/babi.yaml
+task: babi
+dataset_path: Muennighoff/babi
+dataset_name: null
+output_type: greedy_until
+training_split: train
+validation_split: valid
+test_split: test
+doc_to_text: "Passage: {{passage}}Question: {{question}}\nAnswer:"
+doc_to_target: " {{answer}}"
+target_delimiter: ""
+generation_kwargs:
+  until:
+    - "\n"
+    - "Passage:"
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
--- a/lm_eval/tasks/benchmarks/t0_eval.yaml
+++ b/lm_eval/tasks/benchmarks/t0_eval.yaml
-group: t0_eval
-task:
-  # # Coreference Resolution
-  # - dataset_path: super_glue
-  #   dataset_name: wsc.fixed
-  #   use_prompt: promptsource:*
-  #   training_split: train
-  #   validation_split: validation
-  #   metric_list:
-  #     - metric: exact_match
-  #       aggregation: mean
-  #       higher_is_better: true
-  #       ignore_case: true
-  #       ignore_punctuation: true
-  # # Coreference Resolution
-  # - dataset_path: winogrande
-  #   dataset_name: winogrande_xl
-  #   use_prompt: promptsource:*
-  #   training_split: train
-  #   validation_split: validation
-  #   metric_list:
-  #     - metric: exact_match
-  #       aggregation: mean
-  #       higher_is_better: true
-  #       ignore_case: true
-  #       ignore_punctuation: true
-  # Natural Language Inference
-  - dataset_path: super_glue
-    dataset_name: cb
-    use_prompt: promptsource:*
-    training_split: train
-    validation_split: validation
-    output_type: greedy_until
-    metric_list:
-      - metric: exact_match
-        aggregation: mean
-        higher_is_better: true
-        ignore_case: true
-        ignore_punctuation: true
-  # Natural Language Inference
-  # - dataset_path: super_glue
-  #   dataset_name: rte
-  #   use_prompt: promptsource:*
-  #   training_split: train
-  #   validation_split: validation
-  #   metric_list:
-  #     - metric: exact_match
-  #       aggregation: mean
-  #       higher_is_better: true
-  #       ignore_case: true
-  #       ignore_punctuation: true
-  # # Natural Language Inference
-  # # - dataset_path: anli
-  # #   use_prompt: promptsource:*
-  # #   training_split: train_r1
-  # #   validation_split: dev_r1
-  # # Sentence Completion
-  # - dataset_path: super_glue
-  #   dataset_name: copa
-  #   use_prompt: promptsource:*
-  #   training_split: train
-  #   validation_split: validation
-  #   metric_list:
-  #     - metric: exact_match
-  #       aggregation: mean
-  #       higher_is_better: true
-  #       ignore_case: true
-  #       ignore_punctuation: true
-  # # Natural Language Inference
-  # - dataset_path: hellaswag
-  #   use_prompt: promptsource:*
-  #   training_split: train
-  #   validation_split: validation
-  #   metric_list:
-  #     - metric: exact_match
-  #       aggregation: mean
-  #       higher_is_better: true
-  #       ignore_case: true
-  #       ignore_punctuation: true
-  # # Word Sense Disambiguation
-  # - dataset_path: super_glue
-  #   dataset_name: wic
-  #   use_prompt: promptsource:*
-  #   training_split: train
-  #   validation_split: validation
-  #   metric_list:
-  #     - metric: exact_match
-  #       aggregation: mean
-  #       higher_is_better: true
-  #       ignore_case: true
-  #       ignore_punctuation: true
--- a/lm_eval/tasks/blimp/README.md
+++ b/lm_eval/tasks/blimp/README.md
+# Task-name
+
+### Paper
+
+Title: `BLiMP: A Benchmark of Linguistic Minimal Pairs for English`
+Abstract: `https://arxiv.org/abs/1912.00582`
+
+BLiMP is a challenge set for evaluating what language models (LMs) know about
+major grammatical phenomena in English. BLiMP consists of 67 sub-datasets, each
+containing 1000 minimal pairs isolating specific contrasts in syntax, morphology,
+or semantics. The data is automatically generated according to expert-crafted
+grammars.
+
+Homepage: https://github.com/alexwarstadt/blimp
+
+
+### Citation
+
+```
+@article{warstadt2019blimp,
+    author = {Warstadt, Alex and Parrish, Alicia and Liu, Haokun and Mohananey, Anhad and Peng, Wei and Wang, Sheng-Fu and Bowman, Samuel R.},
+    title = {BLiMP: The Benchmark of Linguistic Minimal Pairs for English},
+    journal = {Transactions of the Association for Computational Linguistics},
+    volume = {8},
+    number = {},
+    pages = {377-392},
+    year = {2020},
+    doi = {10.1162/tacl\_a\_00321},
+    URL = {https://doi.org/10.1162/tacl_a_00321},
+    eprint = {https://doi.org/10.1162/tacl_a_00321},
+    abstract = { We introduce The Benchmark of Linguistic Minimal Pairs (BLiMP),1 a challenge set for evaluating the linguistic knowledge of language models (LMs) on major grammatical phenomena in English. BLiMP consists of 67 individual datasets, each containing 1,000 minimal pairs—that is, pairs of minimally different sentences that contrast in grammatical acceptability and isolate specific phenomenon in syntax, morphology, or semantics. We generate the data according to linguist-crafted grammar templates, and human aggregate agreement with the labels is 96.4\%. We evaluate n-gram, LSTM, and Transformer (GPT-2 and Transformer-XL) LMs by observing whether they assign a higher probability to the acceptable sentence in each minimal pair. We find that state-of-the-art models identify morphological contrasts related to agreement reliably, but they struggle with some subtle semantic and syntactic phenomena, such as negative polarity items and extraction islands. }
+}
+```
+
+### Subtasks
+
+List or describe tasks defined in this folder, and their names here:
+* `task_name`: `1-sentence description of what this particular task does`
+* `task_name2`: .....
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/blimp/adjunct_island.yaml
+++ b/lm_eval/tasks/blimp/adjunct_island.yaml
+# Generated by utils.py
+dataset_name: adjunct_island
+include: template_yaml
+task: blimp_adjunct_island
--- a/lm_eval/tasks/blimp/anaphor_gender_agreement.yaml
+++ b/lm_eval/tasks/blimp/anaphor_gender_agreement.yaml
+# Generated by utils.py
+dataset_name: anaphor_gender_agreement
+include: template_yaml
+task: blimp_anaphor_gender_agreement
--- a/lm_eval/tasks/blimp/anaphor_number_agreement.yaml
+++ b/lm_eval/tasks/blimp/anaphor_number_agreement.yaml
+# Generated by utils.py
+dataset_name: anaphor_number_agreement
+include: template_yaml
+task: blimp_anaphor_number_agreement
--- a/lm_eval/tasks/blimp/animate_subject_passive.yaml
+++ b/lm_eval/tasks/blimp/animate_subject_passive.yaml
+# Generated by utils.py
+dataset_name: animate_subject_passive
+include: template_yaml
+task: blimp_animate_subject_passive
--- a/lm_eval/tasks/blimp/animate_subject_trans.yaml
+++ b/lm_eval/tasks/blimp/animate_subject_trans.yaml
+# Generated by utils.py
+dataset_name: animate_subject_trans
+include: template_yaml
+task: blimp_animate_subject_trans
--- a/lm_eval/tasks/blimp/causative.yaml
+++ b/lm_eval/tasks/blimp/causative.yaml
+# Generated by utils.py
+dataset_name: causative
+include: template_yaml
+task: blimp_causative
--- a/lm_eval/tasks/blimp/complex_NP_island.yaml
+++ b/lm_eval/tasks/blimp/complex_NP_island.yaml
+# Generated by utils.py
+dataset_name: complex_NP_island
+include: template_yaml
+task: blimp_complex_NP_island
--- a/lm_eval/tasks/blimp/coordinate_structure_constraint_complex_left_branch.yaml
+++ b/lm_eval/tasks/blimp/coordinate_structure_constraint_complex_left_branch.yaml
+# Generated by utils.py
+dataset_name: coordinate_structure_constraint_complex_left_branch
+include: template_yaml
+task: blimp_coordinate_structure_constraint_complex_left_branch
--- a/lm_eval/tasks/blimp/coordinate_structure_constraint_object_extraction.yaml
+++ b/lm_eval/tasks/blimp/coordinate_structure_constraint_object_extraction.yaml
+# Generated by utils.py
+dataset_name: coordinate_structure_constraint_object_extraction
+include: template_yaml
+task: blimp_coordinate_structure_constraint_object_extraction
--- a/lm_eval/tasks/blimp/determiner_noun_agreement_1.yaml
+++ b/lm_eval/tasks/blimp/determiner_noun_agreement_1.yaml
+# Generated by utils.py
+dataset_name: determiner_noun_agreement_1
+include: template_yaml
+task: blimp_determiner_noun_agreement_1
--- a/lm_eval/tasks/blimp/determiner_noun_agreement_2.yaml
+++ b/lm_eval/tasks/blimp/determiner_noun_agreement_2.yaml
+# Generated by utils.py
+dataset_name: determiner_noun_agreement_2
+include: template_yaml
+task: blimp_determiner_noun_agreement_2