README.md 2.59 KB
Newer Older
Lintang Sutawika's avatar
Lintang Sutawika committed
1
# HellaSwag
nikuya3's avatar
nikuya3 committed
2
3
4

### Paper

Lintang Sutawika's avatar
Lintang Sutawika committed
5
6
7
8
9
Title: `HellaSwag: Can a Machine Really Finish Your Sentence?`

Abstract: https://arxiv.org/abs/1905.07830

Recent work by Zellers et al. (2018) introduced a new task of commonsense natural language inference: given an event description such as "A woman sits at a piano," a machine must select the most likely followup: "She sets her fingers on the keys." With the introduction of BERT, near human-level performance was reached. Does this mean that machines can perform human level commonsense inference?
nikuya3's avatar
nikuya3 committed
10
In this paper, we show that commonsense inference still proves difficult for even state-of-the-art models, by presenting HellaSwag, a new challenge dataset. Though its questions are trivial for humans (>95% accuracy), state-of-the-art models struggle (<48%). We achieve this via Adversarial Filtering (AF), a data collection paradigm wherein a series of discriminators iteratively select an adversarial set of machine-generated wrong answers. AF proves to be surprisingly robust. The key insight is to scale up the length and complexity of the dataset examples towards a critical 'Goldilocks' zone wherein generated text is ridiculous to humans, yet often misclassified by state-of-the-art models.
Lintang Sutawika's avatar
Lintang Sutawika committed
11
Our construction of HellaSwag, and its resulting difficulty, sheds light on the inner workings of deep pretrained models. More broadly, it suggests a new path forward for NLP research, in which benchmarks co-evolve with the evolving state-of-the-art in an adversarial way, so as to present ever-harder challenges.
nikuya3's avatar
nikuya3 committed
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

Homepage: `https://rowanzellers.com/hellaswag/`


### Citation

```
@inproceedings{zellers2019hellaswag,
    title={HellaSwag: Can a Machine Really Finish Your Sentence?},
    author={Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin},
    booktitle ={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
    year={2019}
}
```

Lintang Sutawika's avatar
Lintang Sutawika committed
27
28
29
30
31
### Subtasks

- `hellaswag`


nikuya3's avatar
nikuya3 committed
32
33
34
35
36
37
38
39
40
41
42
43
### Checklist

For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
  * [x] Have you referenced the original paper that introduced the task?
  * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?


If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?