Icelandic WinoGrande is a manually translated and localized version of the English-language WinoGrande dataset, designed to be 'a new and challenging benchmark for commonsense reasoning and natural language understanding' in Icelandic [(Snæbjarnarson et al., 2022)](https://aclanthology.org/2022.lrec-1.464/).
**Implementation Note:** The original dataset is designed for evaluation on a BERT model. Following the evaluation method used for the original (English-language) WinoGrande on the Harness (see information [here](../winogrande/README.md)), this evaluation uses partial scoring as described by [Trinh & Le (2018)](https://arxiv.org/abs/1806.02847) to allow evaluation on autoregressive models.
### Groups and Tasks
#### Groups
* Not part of a group yet.
#### Tasks
*`icelandic_winogrande`
### Citation
```
@inproceedings{snaebjarnarson-etal-2022-warm,
title = "A Warm Start and a Clean Crawled Corpus - A Recipe for Good Language Models",
author = "Sn{\ae}bjarnarson, V{\'e}steinn and
S{\'i}monarson, Haukur Barri and
Ragnarsson, P{\'e}tur Orri and
Ing{\'o}lfsd{\'o}ttir, Svanhv{\'i}t Lilja and
J{\'o}nsson, Haukur and
Thorsteinsson, Vilhjalmur and
Einarsson, Hafsteinn",
editor = "Calzolari, Nicoletta and
B{\'e}chet, Fr{\'e}d{\'e}ric and
Blache, Philippe and
Choukri, Khalid and
Cieri, Christopher and
Declerck, Thierry and
Goggi, Sara and
Isahara, Hitoshi and
Maegaard, Bente and
Mariani, Joseph and
Mazo, H{\'e}l{\`e}ne and
Odijk, Jan and
Piperidis, Stelios",
booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
month = jun,
year = "2022",
address = "Marseille, France",
publisher = "European Language Resources Association",
For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
# Targeted Syntactic Evaluation of Language Models (LM-SynEval)
## Paper
**Title:** Targeted Syntactic Evaluation of Language Models
**Authors:**: Rebecca Marvin and Tal Linzen
**Link:** https://doi.org/10.18653/v1/D18-1151
**Abstract:**
> We present a data set for evaluating the grammaticality of the predictions of a language model. We automatically construct a large number of minimally different pairs of English sentences, each consisting of a grammatical and an ungrammatical sentence. The sentence pairs represent different variations of structure-sensitive phenomena: subject-verb agreement, reflexive anaphora and negative polarity items. We expect a language model to assign a higher probability to the grammatical sentence than the ungrammatical one. In an experiment using this data set, an LSTM language model performed poorly on many of the constructions. Multi-task training with a syntactic objective (CCG supertagging) improved the LSTM's accuracy, but a large gap remained between its performance and the accuracy of human participants recruited online. This suggests that there is considerable room for improvement over LSTMs in capturing syntax in a language model.
title = "Targeted Syntactic Evaluation of Language Models",
author = "Marvin, Rebecca and
Linzen, Tal",
editor = "Riloff, Ellen and
Chiang, David and
Hockenmaier, Julia and
Tsujii, Jun{'}ichi",
booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing",
year = "2018",
address = "Brussels, Belgium",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/D18-1151/",
doi = "10.18653/v1/D18-1151",
pages = "1192--1202"
}
```
## Groups, Tags, and Tasks
The tasks are structured hierarchically as listed below. For more detailed explanations, see original paper and repository (linked above). In this implementation, group means are unweighted.
*`lm_syneval`: Targeted Syntactic Evaluation of Language Models
* Example: 'The author knows many different foreign languages and likes to watch television shows.' (correct) vs. 'The author knows many different foreign languages and like to watch television shows.' (incorrect)
* Example: 'The authors know many different foreign languages and like to watch television shows.' (correct) vs. 'The authors know many different foreign languages and likes to watch television shows.' (incorrect)
*`lm_syneval__agreement__obj_rel_within_anim`: Agreement in an object relative clause with animate external subject
* Example: 'The authors that the guards like hurt themselves.' (correct) vs. 'The authors that the guards like hurt himself.' (incorrect)
*`lm_syneval__npi`: Negative polarity items
*`lm_syneval__npi__simple_npi_anim`: Simple NPI with animate subject
*`lm_syneval__npi__simple_npi_anim__past`:
* Example: 'No authors have ever been popular.' (correct) vs. 'The authors have ever been popular.' (incorrect)
*`lm_syneval__npi__simple_npi_anim__future`:
* Example: 'No authors will ever be popular.' (correct) vs. 'The authors will ever be popular.' (incorrect)
*`lm_syneval__npi__simple_npi_inanim`: Simple NPI with imanimate subject
*`lm_syneval__npi__simple_npi_inanim__past`:
* Example: 'No movies have ever been seen.' (correct) vs. 'The movies have ever been seen.' (incorrect)
*`lm_syneval__npi__simple_npi_inanim__future`:
* Example: 'No movies will ever be seen.' (correct) vs. 'The movies will ever be seen.' (incorrect)
*`lm_syneval__npi__npi_across_anim`: NPI across a relative clause with animate subject
*`lm_syneval__npi__npi_across_anim__past`:
* Example: 'No authors that the guards like have ever been popular.' (correct) vs. 'The authors that no guards like have ever been popular.' (incorrect)
*`lm_syneval__npi__npi_across_anim__future`:
* Example: 'No authors that the guards like will ever be popular.' (correct) vs. 'The authors that no guards like will ever be popular.' (incorrect)
*`lm_syneval__npi__npi_across_inanim`: NPI across a relative clause with imanimate subject
*`lm_syneval__npi__npi_across_inanim__past`:
* Example: 'No movies that the guards like have ever been seen.' (correct) vs. 'The movies that no guards like have ever been seen.' (incorrect)
*`lm_syneval__npi__npi_across_inanim__future`:
* Example: 'No movies that the guards like will ever be seen.' (correct) vs. 'The movies that no guards like will ever be seen.' (incorrect)
## Checklist
For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
* The original paper evaluates traditional RNN models, which require a very different pipeline to analyze.