@@ -77,6 +77,7 @@ provided to the individual README.md files for each subfolder.
| [histoires_morales](histoires_morales/README.md) | A dataset of structured narratives that describe normative and norm-divergent actions taken by individuals to accomplish certain intentions in concrete situations. | French (Some MT) |
| [hrm8k](hrm8k/README.md) | A challenging bilingual math reasoning benchmark for Korean and English. | Korean (Some MT), English (Some MT) |
| [humaneval](humaneval/README.md) | Code generation task that measure functional correctness for synthesizing programs from docstrings. | Python |
| [icelandic_winogrande](icelandic_winogrande/README.md) | Manually translated and localized version of the [WinoGrande](winogrande/README.md) commonsense reasoning benchmark for Icelandic. | Icelandic |
| [ifeval](ifeval/README.md) | Interactive fiction evaluation tasks for narrative understanding and reasoning. | English |
| [inverse_scaling](inverse_scaling/README.md) | Multiple-choice tasks from the Inverse Scaling Prize, designed to find settings where larger language models perform worse. | English |
| [japanese_leaderboard](japanese_leaderboard/README.md) | Japanese language understanding tasks to benchmark model performance on various linguistic aspects. | Japanese |
Icelandic WinoGrande is a manually translated and localized version of the English-language WinoGrande dataset, designed to be 'a new and challenging benchmark for commonsense reasoning and natural language understanding' in Icelandic [(Snæbjarnarson et al., 2022)](https://aclanthology.org/2022.lrec-1.464/).
**Implementation Note:** The original dataset is designed for evaluation on a BERT model. Following the evaluation method used for the original (English-language) WinoGrande on the Harness (see information [here](../winogrande/README.md)), this evaluation uses partial scoring as described by [Trinh & Le (2018)](https://arxiv.org/abs/1806.02847) to allow evaluation on autoregressive models.
### Groups and Tasks
#### Groups
* Not part of a group yet.
#### Tasks
*`icelandic_winogrande`
### Citation
```
@inproceedings{snaebjarnarson-etal-2022-warm,
title = "A Warm Start and a Clean Crawled Corpus - A Recipe for Good Language Models",
author = "Sn{\ae}bjarnarson, V{\'e}steinn and
S{\'i}monarson, Haukur Barri and
Ragnarsson, P{\'e}tur Orri and
Ing{\'o}lfsd{\'o}ttir, Svanhv{\'i}t Lilja and
J{\'o}nsson, Haukur and
Thorsteinsson, Vilhjalmur and
Einarsson, Hafsteinn",
editor = "Calzolari, Nicoletta and
B{\'e}chet, Fr{\'e}d{\'e}ric and
Blache, Philippe and
Choukri, Khalid and
Cieri, Christopher and
Declerck, Thierry and
Goggi, Sara and
Isahara, Hitoshi and
Maegaard, Bente and
Mariani, Joseph and
Mazo, H{\'e}l{\`e}ne and
Odijk, Jan and
Piperidis, Stelios",
booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
month = jun,
year = "2022",
address = "Marseille, France",
publisher = "European Language Resources Association",
For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?