Commit f71d56eb authored by lintangsutawika's avatar lintangsutawika
Browse files

Merge branch 'big-refactor' of https://github.com/EleutherAI/lm-evaluation-harness into superglue

parents 33f2f9bf 2f870265
group: glue
task: sst
dataset_path: glue
dataset_name: sst
output_type: multiple_choice
training_split: train
validation_split: validation
doc_to_text: "{{sentence}}\nQuestion: Is this sentence positive or negative?\nAnswer:"
doc_to_target: label
doc_to_choice: ["negative", "positive"]
metric_list:
- metric: acc
group: glue
task: wnli
dataset_path: glue
dataset_name: wnli
output_type: multiple_choice
training_split: train
validation_split: validation
doc_to_text: "{{sentence1}}\nQuestion: {{sentence2}} True or False?\nAnswer:"
doc_to_target: label
doc_to_choice: ["False", "True"]
metric_list:
- metric: acc
......@@ -31,6 +31,19 @@ Homepage: https://github.com/openai/grade-school-math
}
```
### Groups and Tasks
#### Groups
- `math_word_problems`
- `chain_of_thought`
- `self_consistency`
#### Tasks
- `gsm8k_yaml`
- `gsm8k_cot`: GSM8K with Chain-of-Thought
- `gsm8k_cot_self_consistency`: GSM8K with Chain-of-Thought and Self-Consistency
### Checklist
......
group:
- greedy_until
- math_word_problems
task: gsm8k_yaml
dataset_path: gsm8k
......
......@@ -32,7 +32,13 @@ Homepage: https://aghie.github.io/head-qa/
}
```
### Subtasks
### Groups and Tasks
#### Groups
- `headqa`: Evaluates `headqa_en` and `headqa_es`
#### Tasks
* `headqa_en` - English variant of HEAD-QA
* `headqa_es` - Spanish variant of HEAD-QA
......
group:
- multiple_choice
- headqa
task: headqa_en
dataset_path: EleutherAI/headqa
dataset_name: en
......
# Task-name
# HellaSwag
### Paper
Title: `HellaSwag: Can a Machine Really Finish Your Sentence?`,
Abstract: ```Recent work by Zellers et al. (2018) introduced a new task of commonsense natural language inference: given an event description such as "A woman sits at a piano," a machine must select the most likely followup: "She sets her fingers on the keys." With the introduction of BERT, near human-level performance was reached. Does this mean that machines can perform human level commonsense inference?
Title: `HellaSwag: Can a Machine Really Finish Your Sentence?`
Abstract: https://arxiv.org/abs/1905.07830
Recent work by Zellers et al. (2018) introduced a new task of commonsense natural language inference: given an event description such as "A woman sits at a piano," a machine must select the most likely followup: "She sets her fingers on the keys." With the introduction of BERT, near human-level performance was reached. Does this mean that machines can perform human level commonsense inference?
In this paper, we show that commonsense inference still proves difficult for even state-of-the-art models, by presenting HellaSwag, a new challenge dataset. Though its questions are trivial for humans (>95% accuracy), state-of-the-art models struggle (<48%). We achieve this via Adversarial Filtering (AF), a data collection paradigm wherein a series of discriminators iteratively select an adversarial set of machine-generated wrong answers. AF proves to be surprisingly robust. The key insight is to scale up the length and complexity of the dataset examples towards a critical 'Goldilocks' zone wherein generated text is ridiculous to humans, yet often misclassified by state-of-the-art models.
Our construction of HellaSwag, and its resulting difficulty, sheds light on the inner workings of deep pretrained models. More broadly, it suggests a new path forward for NLP research, in which benchmarks co-evolve with the evolving state-of-the-art in an adversarial way, so as to present ever-harder challenges.```
Our construction of HellaSwag, and its resulting difficulty, sheds light on the inner workings of deep pretrained models. More broadly, it suggests a new path forward for NLP research, in which benchmarks co-evolve with the evolving state-of-the-art in an adversarial way, so as to present ever-harder challenges.
Homepage: `https://rowanzellers.com/hellaswag/`
......@@ -21,6 +24,17 @@ Homepage: `https://rowanzellers.com/hellaswag/`
}
```
### Groups and Tasks
#### Groups
- Not part of a group yet
#### Tasks
- `hellaswag`
### Checklist
For adding novel benchmarks/datasets to the library:
......
......@@ -7,9 +7,10 @@ output_type: multiple_choice
training_split: train
validation_split: validation
test_split: null
doc_to_text: "{% set text = activity_label ~ ': ' ~ ctx_a ~ ' ' ~ ctx_b.capitalize() %}{{text|trim|replace(' [title]', '. ')|regex_replace('\\[.*?\\]', '')|replace(' ', ' ')}}"
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{label}}"
doc_to_choice: "{{endings|map('trim')|map('replace', ' [title]', '. ')|map('regex_replace', '\\[.*?\\]', '')|map('replace', ' ', ' ')|list}}"
doc_to_choice: "{{choices}}"
metric_list:
- metric: acc
aggregation: mean
......
import datasets
import re
def preprocess(text):
text = text.strip()
# NOTE: Brackets are artifacts of the WikiHow dataset portion of HellaSwag.
text = text.replace(" [title]", ". ")
text = re.sub("\\[.*?\\]", "", text)
text = text.replace(" ", " ")
return text
def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:
def _process_doc(doc):
ctx = doc["ctx_a"] + " " + doc["ctx_b"].capitalize()
out_doc = {
"query": preprocess(doc["activity_label"] + ": " + ctx),
"choices": [preprocess(ending) for ending in doc["endings"]],
"gold": int(doc["label"]),
}
return out_doc
return dataset.map(_process_doc)
......@@ -25,13 +25,20 @@ Homepage: https://github.com/hendrycks/ethics
}
```
### Subtasks
### Groups and Tasks
* `ethics_cm`:
*
#### Groups
Missing:
* `ethics_utilitarianism_original`:
- `hendrycks_ethics`
#### Tasks
* `ethics_cm`
* `ethics_deontology`
* `ethics_justice`
* `ethics_utilitarianism`
* (MISSING) `ethics_utilitarianism_original`
* `ethics_virtue`
### Checklist
......
# LAMBADA
### Paper
The LAMBADA dataset: Word prediction requiring a broad discourse context
https://arxiv.org/pdf/1606.06031.pdf
Title: `The LAMBADA dataset: Word prediction requiring a broad discourse context`
Abstract: https://arxiv.org/pdf/1606.06031.pdf
LAMBADA is a dataset to evaluate the capabilities of computational models for text
understanding by means of a word prediction task. LAMBADA is a collection of narrative
......@@ -14,6 +15,18 @@ in the broader discourse.
Homepage: https://zenodo.org/record/2630551#.X4Xzn5NKjUI
### Groups and Tasks
#### Groups
- `lambada`
#### Tasks
- `lambada_openai`
- `lambada_standard`
### Citation
@misc{
......
group:
- lambada
- loglikelihood
- perplexity
task: lambada_openai
dataset_path: EleutherAI/lambada_openai
dataset_name: default
......
group:
- lambada
- loglikelihood
- perplexity
task: lambada_standard
dataset_path: lambada
dataset_name: null
......
# LAMBADA Cloze
### Paper
Title: `The LAMBADA dataset: Word prediction requiring a broad discourse context`
Abstract: https://arxiv.org/abs/1606.06031
Cloze-style LAMBADA dataset.
LAMBADA is a dataset to evaluate the capabilities of computational models for text
understanding by means of a word prediction task. LAMBADA is a collection of narrative
passages sharing the characteristic that human subjects are able to guess their last
word if they are exposed to the whole passage, but not if they only see the last
sentence preceding the target word. To succeed on LAMBADA, computational models
cannot simply rely on local context, but must be able to keep track of information
in the broader discourse.
Homepage: https://zenodo.org/record/2630551#.X4Xzn5NKjUI
### Citation
```
@misc{
author={Paperno, Denis and Kruszewski, Germán and Lazaridou, Angeliki and Pham, Quan Ngoc and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fernández, Raquel},
title={The LAMBADA dataset},
DOI={10.5281/zenodo.2630551},
publisher={Zenodo},
year={2016},
month={Aug}
}
```
### Groups and Tasks
#### Groups
* `lambada_cloze`
#### Tasks
* `lambada_openai_cloze_yaml`
* `lambada_standard_cloze_yaml`
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
group:
- lambada_cloze
- loglikelihood
task: lambada_openai_cloze_yaml
dataset_path: EleutherAI/lambada_openai
dataset_name: default
......
group:
- lambada_cloze
- loglikelihood
task: lambada_standard_cloze_yaml
dataset_path: lambada
dataset_name: null
......
......@@ -25,7 +25,13 @@ Homepage: https://zenodo.org/record/2630551#.X4Xzn5NKjUI
month={Aug}
}
### Subtasks
### Groups and Tasks
#### Groups
* `lambada_multilingual`: Evaluates all `lambada_mt_X` tasks
#### Tasks
* `lambada_mt_{en, fr, de, it, es}`: Machine-translated versions of OpenAI's Lambada variant.
......
include: lambada_mt_en.yaml
group:
- lambada_multilingual
- loglikelihood
- perplexity
task: lambada_openai_mt_de
dataset_name: de
group:
- lambada_multilingual
- loglikelihood
- perplexity
task: lambada_openai_mt_en
dataset_path: EleutherAI/lambada_openai
dataset_name: en
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment