# MultiBLiMP: A Massively Multilingual Benchmark of Linguistic Minimal Pairs
## Task Description
MultiBLiMP is a massively multilingual benchmark of linguistic minimal pairs, covering 101 languages, 6 linguistic phenomena and containing more than 125,000 minimal pairs.
* Hugging Face Dataset Repo: https://huggingface.co/datasets/jumelet/multiblimp
## Implementation
*`multiblimp_{lang}` runs MultiBLiMP for a given language, where `{lang}` must be replaced by the language's ISO 639-3 code (e.g., `eng` for English, `abk` for Abkhazian, `wbp` for Warlpiri, etc.).
*`multiblimp` tag runs MultiBLiMP for all languages
Note: The original implementation is provided [here](https://github.com/jumelet/multiblimp), and the [dataset repository](https://huggingface.co/datasets/jumelet/multiblimp) also includes a link to a more flexible version of the implementation [here](https://github.com/catherinearnett/multiblimp). This implementation follows these as closely as possible, but the original implementations normalize length by number of tokens, which is not supported by the Language Model Evaluation Harness (see [[1](https://blog.eleuther.ai/multiple-choice-normalization/)], [[2](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md)], [[3](https://github.com/EleutherAI/lm-evaluation-harness/issues/1396)]). For this reason, the implementation provided here includes both the `acc` (accuracy based on comparing the unnormalized log-probability of the correct and incorrect versions of each sentence) and `acc_norm` (the same as `acc` but with sentence log-probability normalized by number of bytes) metrics.
## Dataset Details
This table (from the [Hugging Face Dataset Repo](https://huggingface.co/datasets/jumelet/multiblimp)) lists the languages covered in MultiBLiMP and the number of items for each language.
title={MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs},
author={Jaap Jumelet and Leonie Weissweiler and Arianna Bisazza},
year={2025},
eprint={2504.02768},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.02768},
}
```
## New Task Checklist
- [x] Is the task an existing benchmark in the literature?
- [x] Have you referenced the original paper that introduced the task?
- [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?