Merge branch 'big-refactor' of https://github.com/EleutherAI/lm-evaluation-harness into superglue

f71d56eb · lintangsutawika · 33f2f9bf · 2f870265 · f71d56eb · f71d56eb
Commit f71d56eb authored Aug 21, 2023 by lintangsutawika
20 changed files
--- a/lm_eval/tasks/glue/sst/default.yaml
+++ b/lm_eval/tasks/glue/sst/default.yaml
+group: glue
+task: sst
+dataset_path: glue
+dataset_name: sst
+output_type: multiple_choice
+training_split: train
+validation_split: validation
+doc_to_text: "{{sentence}}\nQuestion: Is this sentence positive or negative?\nAnswer:"
+doc_to_target: label
+doc_to_choice: ["negative", "positive"]
+metric_list:
+  - metric: acc
--- a/lm_eval/tasks/glue/wnli/default.yaml
+++ b/lm_eval/tasks/glue/wnli/default.yaml
+group: glue
+task: wnli
+dataset_path: glue
+dataset_name: wnli
+output_type: multiple_choice
+training_split: train
+validation_split: validation
+doc_to_text: "{{sentence1}}\nQuestion: {{sentence2}} True or False?\nAnswer:"
+doc_to_target: label
+doc_to_choice: ["False", "True"]
+metric_list:
+  - metric: acc
--- a/lm_eval/tasks/gsm8k/README.md
+++ b/lm_eval/tasks/gsm8k/README.md
@@ -31,6 +31,19 @@ Homepage: https://github.com/openai/grade-school-math
 }
 ```

+### Groups and Tasks
+
+#### Groups
+
+- `math_word_problems`
+- `chain_of_thought`
+- `self_consistency`
+
+#### Tasks
+
+- `gsm8k_yaml`
+- `gsm8k_cot`: GSM8K with Chain-of-Thought
+- `gsm8k_cot_self_consistency`: GSM8K with Chain-of-Thought and Self-Consistency

 ### Checklist


--- a/lm_eval/tasks/gsm8k/gsm8k.yaml
+++ b/lm_eval/tasks/gsm8k/gsm8k.yaml
 group:
-  - greedy_until
  - math_word_problems
 task: gsm8k_yaml
 dataset_path: gsm8k

--- a/lm_eval/tasks/headqa/README.md
+++ b/lm_eval/tasks/headqa/README.md
@@ -32,7 +32,13 @@ Homepage: https://aghie.github.io/head-qa/
 }
 ```

-### Subtasks
+### Groups and Tasks
+
+#### Groups
+
+- `headqa`: Evaluates `headqa_en` and `headqa_es`
+
+#### Tasks

 * `headqa_en` - English variant of HEAD-QA
 * `headqa_es` - Spanish variant of HEAD-QA

--- a/lm_eval/tasks/headqa/headqa_en.yaml
+++ b/lm_eval/tasks/headqa/headqa_en.yaml
 group:
-  - multiple_choice
+  - headqa
 task: headqa_en
 dataset_path: EleutherAI/headqa
 dataset_name: en

--- a/lm_eval/tasks/hellaswag/README.md
+++ b/lm_eval/tasks/hellaswag/README.md
-# Task-name
+# HellaSwag

 ### Paper

-Title: `HellaSwag: Can a Machine Really Finish Your Sentence?`,
-Abstract: ```Recent work by Zellers et al. (2018) introduced a new task of commonsense natural language inference: given an event description such as "A woman sits at a piano," a machine must select the most likely followup: "She sets her fingers on the keys." With the introduction of BERT, near human-level performance was reached. Does this mean that machines can perform human level commonsense inference?
+Title: `HellaSwag: Can a Machine Really Finish Your Sentence?`
+
+Abstract: https://arxiv.org/abs/1905.07830
+
+Recent work by Zellers et al. (2018) introduced a new task of commonsense natural language inference: given an event description such as "A woman sits at a piano," a machine must select the most likely followup: "She sets her fingers on the keys." With the introduction of BERT, near human-level performance was reached. Does this mean that machines can perform human level commonsense inference?
 In this paper, we show that commonsense inference still proves difficult for even state-of-the-art models, by presenting HellaSwag, a new challenge dataset. Though its questions are trivial for humans (>95% accuracy), state-of-the-art models struggle (<48%). We achieve this via Adversarial Filtering (AF), a data collection paradigm wherein a series of discriminators iteratively select an adversarial set of machine-generated wrong answers. AF proves to be surprisingly robust. The key insight is to scale up the length and complexity of the dataset examples towards a critical 'Goldilocks' zone wherein generated text is ridiculous to humans, yet often misclassified by state-of-the-art models.
-Our construction of HellaSwag, and its resulting difficulty, sheds light on the inner workings of deep pretrained models. More broadly, it suggests a new path forward for NLP research, in which benchmarks co-evolve with the evolving state-of-the-art in an adversarial way, so as to present ever-harder challenges.```
+Our construction of HellaSwag, and its resulting difficulty, sheds light on the inner workings of deep pretrained models. More broadly, it suggests a new path forward for NLP research, in which benchmarks co-evolve with the evolving state-of-the-art in an adversarial way, so as to present ever-harder challenges.

 Homepage: `https://rowanzellers.com/hellaswag/`

@@ -21,6 +24,17 @@ Homepage: `https://rowanzellers.com/hellaswag/`
 }
 ```

+### Groups and Tasks
+
+#### Groups
+
+- Not part of a group yet
+
+#### Tasks
+
+- `hellaswag`
+
+
 ### Checklist

 For adding novel benchmarks/datasets to the library:

--- a/lm_eval/tasks/hellaswag/hellaswag.yaml
+++ b/lm_eval/tasks/hellaswag/hellaswag.yaml
@@ -7,9 +7,10 @@ output_type: multiple_choice
 training_split: train
 validation_split: validation
 test_split: null
-doc_to_text: "{% set text = activity_label ~ ': ' ~ ctx_a ~ ' ' ~ ctx_b.capitalize() %}{{text|trim|replace(' [title]', '. ')|regex_replace('\\[.*?\\]', '')|replace('  ', ' ')}}"
+process_docs: !function utils.process_docs
+doc_to_text: "{{query}}"
 doc_to_target: "{{label}}"
-doc_to_choice: "{{endings|map('trim')|map('replace', ' [title]', '. ')|map('regex_replace', '\\[.*?\\]', '')|map('replace', '  ', ' ')|list}}"
+doc_to_choice: "{{choices}}"
 metric_list:
  - metric: acc
    aggregation: mean

--- a/lm_eval/tasks/hellaswag/utils.py
+++ b/lm_eval/tasks/hellaswag/utils.py
+import datasets
+import re
+
+
+def preprocess(text):
+    text = text.strip()
+    # NOTE: Brackets are artifacts of the WikiHow dataset portion of HellaSwag.
+    text = text.replace(" [title]", ". ")
+    text = re.sub("\\[.*?\\]", "", text)
+    text = text.replace("  ", " ")
+    return text
+
+
+def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:
+    def _process_doc(doc):
+        ctx = doc["ctx_a"] + " " + doc["ctx_b"].capitalize()
+        out_doc = {
+            "query": preprocess(doc["activity_label"] + ": " + ctx),
+            "choices": [preprocess(ending) for ending in doc["endings"]],
+            "gold": int(doc["label"]),
+        }
+        return out_doc
+
+    return dataset.map(_process_doc)
--- a/lm_eval/tasks/hendrycks_ethics/README.md
+++ b/lm_eval/tasks/hendrycks_ethics/README.md
@@ -25,13 +25,20 @@ Homepage: https://github.com/hendrycks/ethics
 }
 ```

-### Subtasks
+### Groups and Tasks

-* `ethics_cm`:
-*
+#### Groups

-Missing:
-* `ethics_utilitarianism_original`:
+- `hendrycks_ethics`
+
+#### Tasks
+
+* `ethics_cm`
+* `ethics_deontology`
+* `ethics_justice`
+* `ethics_utilitarianism`
+* (MISSING) `ethics_utilitarianism_original`
+* `ethics_virtue`

 ### Checklist


--- a/lm_eval/tasks/hendrycks_ethics/utilitarianism_original.yaml
+++ b/lm_eval/tasks/hendrycks_ethics/utilitarianism_original.yaml
--- a/lm_eval/tasks/lambada/README.md
+++ b/lm_eval/tasks/lambada/README.md
 # LAMBADA

 ### Paper
-The LAMBADA dataset: Word prediction requiring a broad discourse context
-https://arxiv.org/pdf/1606.06031.pdf
+Title: `The LAMBADA dataset: Word prediction requiring a broad discourse context`
+
+Abstract: https://arxiv.org/pdf/1606.06031.pdf

 LAMBADA is a dataset to evaluate the capabilities of computational models for text
 understanding by means of a word prediction task. LAMBADA is a collection of narrative
@@ -14,6 +15,18 @@ in the broader discourse.

 Homepage: https://zenodo.org/record/2630551#.X4Xzn5NKjUI

+### Groups and Tasks
+
+#### Groups
+
+- `lambada`
+
+#### Tasks
+
+- `lambada_openai`
+- `lambada_standard`
+
+
 ### Citation

 @misc{

--- a/lm_eval/tasks/lambada/lambada_openai.yaml
+++ b/lm_eval/tasks/lambada/lambada_openai.yaml
 group:
  - lambada
-  - loglikelihood
-  - perplexity
 task: lambada_openai
 dataset_path: EleutherAI/lambada_openai
 dataset_name: default

--- a/lm_eval/tasks/lambada/lambada_standard.yaml
+++ b/lm_eval/tasks/lambada/lambada_standard.yaml
 group:
  - lambada
-  - loglikelihood
-  - perplexity
 task: lambada_standard
 dataset_path: lambada
 dataset_name: null

--- a/lm_eval/tasks/lambada_cloze/README.md
+++ b/lm_eval/tasks/lambada_cloze/README.md
+# LAMBADA Cloze
+
+### Paper
+
+Title: `The LAMBADA dataset: Word prediction requiring a broad discourse context`
+
+Abstract: https://arxiv.org/abs/1606.06031
+
+Cloze-style LAMBADA dataset.
+LAMBADA is a dataset to evaluate the capabilities of computational models for text
+understanding by means of a word prediction task. LAMBADA is a collection of narrative
+passages sharing the characteristic that human subjects are able to guess their last
+word if they are exposed to the whole passage, but not if they only see the last
+sentence preceding the target word. To succeed on LAMBADA, computational models
+cannot simply rely on local context, but must be able to keep track of information
+in the broader discourse.
+
+Homepage: https://zenodo.org/record/2630551#.X4Xzn5NKjUI
+
+
+### Citation
+
+```
+@misc{
+    author={Paperno, Denis and Kruszewski, Germán and Lazaridou, Angeliki and Pham, Quan Ngoc and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fernández, Raquel},
+    title={The LAMBADA dataset},
+    DOI={10.5281/zenodo.2630551},
+    publisher={Zenodo},
+    year={2016},
+    month={Aug}
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* `lambada_cloze`
+
+#### Tasks
+
+* `lambada_openai_cloze_yaml`
+* `lambada_standard_cloze_yaml`
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/lambada_cloze/lambada_openai_cloze.yaml
+++ b/lm_eval/tasks/lambada_cloze/lambada_openai_cloze.yaml
 group:
  - lambada_cloze
-  - loglikelihood
 task: lambada_openai_cloze_yaml
 dataset_path: EleutherAI/lambada_openai
 dataset_name: default

--- a/lm_eval/tasks/lambada_cloze/lambada_standard_cloze.yaml
+++ b/lm_eval/tasks/lambada_cloze/lambada_standard_cloze.yaml
 group:
  - lambada_cloze
-  - loglikelihood
 task: lambada_standard_cloze_yaml
 dataset_path: lambada
 dataset_name: null

--- a/lm_eval/tasks/lambada_multilingual/README.md
+++ b/lm_eval/tasks/lambada_multilingual/README.md
@@ -25,7 +25,13 @@ Homepage: https://zenodo.org/record/2630551#.X4Xzn5NKjUI
    month={Aug}
 }

-### Subtasks
+### Groups and Tasks
+
+#### Groups
+
+* `lambada_multilingual`: Evaluates all `lambada_mt_X` tasks
+
+#### Tasks

 * `lambada_mt_{en, fr, de, it, es}`: Machine-translated versions of OpenAI's Lambada variant.


--- a/lm_eval/tasks/lambada_multilingual/lambada_mt_de.yaml
+++ b/lm_eval/tasks/lambada_multilingual/lambada_mt_de.yaml
 include: lambada_mt_en.yaml
-group:
-  - lambada_multilingual
-  - loglikelihood
-  - perplexity
 task: lambada_openai_mt_de
 dataset_name: de
--- a/lm_eval/tasks/lambada_multilingual/lambada_mt_en.yaml
+++ b/lm_eval/tasks/lambada_multilingual/lambada_mt_en.yaml
 group:
  - lambada_multilingual
-  - loglikelihood
-  - perplexity
 task: lambada_openai_mt_en
 dataset_path: EleutherAI/lambada_openai
 dataset_name: en