Add TurBLiMP (#3219)

* add turblimp * update general task readme * add normalized accuracy

Add TurBLiMP (#3219)
* add turblimp * update general task readme * add normalized accuracy
d355eac0 · James A. Michaelov · GitHub · b0040ba0 · d355eac0 · d355eac0
Unverified Commit d355eac0 authored Aug 21, 2025 by James A. Michaelov Committed by GitHub Aug 21, 2025
20 changed files
--- a/lm_eval/tasks/README.md
+++ b/lm_eval/tasks/README.md
@@ -157,6 +157,7 @@
 | [truthfulqa](truthfulqa/README.md)                                       | A QA task aimed at evaluating the truthfulness and factual accuracy of model responses.                                                                                                                                                                                                                                                | English                                                                                                                       |
 | [truthfulqa-multi](truthfulqa-multi/README.md)                           | Is a multilingual version of TruthfulQA, a QA task aimed at evaluating the truthfulness and factual accuracy of model responses.                                                                                                                                                                                                       | English, Spanish, Catalan, Basque, Galician                                                                                   |
 | [turkishmmlu](turkishmmlu/README.md)                                     | A multiple-choice QA test modeled after MMLU, written in Turkish based on Turkish high-school level exams.                                                                                                                                                                                                                             | Turkish                                                                                                                       |
+| [turblimp_core](turblimp/README.md)                                     | A benchmark evaluating language models' grammatical capabilities in Turkish based on comparing the probabilities of minimal pairs of grammatical and ungrammatical sentences.                                                                                                                                                                                                                           | Turkish                                                                                                                       |
 | [unitxt](unitxt/README.md)                                               | A number of tasks implemented using the unitxt library for flexible, shareable, and reusable data preparation and evaluation for generative AI.                                                                                                                                                                                        | English                                                                                                                       |
 | [unscramble](unscramble/README.md)                                       | Tasks involving the rearrangement of scrambled sentences to test syntactic understanding.                                                                                                                                                                                                                                              | English                                                                                                                       |
 | [webqs](webqs/README.md)                                                 | Web-based question answering tasks designed to evaluate internet search and retrieval.                                                                                                                                                                                                                                                 | English                                                                                                                       |

--- a/lm_eval/tasks/turblimp/README.md
+++ b/lm_eval/tasks/turblimp/README.md
+# TurBLiMP: A Turkish Benchmark of Linguistic Minimal Pairs
+
+## Paper
+
+Title: TurBLiMP: A Turkish Benchmark of Linguistic Minimal Pairs
+
+Abstract:
+
+> TurBLiMP is the first Turkish benchmark of linguistic minimal pairs, designed to evaluate the linguistic abilities of monolingual and multilingual language models. The dataset covers 16 core grammatical phenomena in Turkish, with 1,000 minimal pairs per phenomenon.
+
+Homepage: https://github.com/ezgibasar/TurBLiMP
+
+### Citation
+
+```
+bibtex
+@misc{basar2025turblimpturkishbenchmarklinguistic,
+  title={TurBLiMP: A Turkish Benchmark of Linguistic Minimal Pairs},
+  author={Ezgi Ba{\c{s}}ar and Francesca Padovani and Jaap Jumelet and Arianna Bisazza},
+  year={2025},
+  eprint={2506.13487},
+  archivePrefix={arXiv},
+  primaryClass={cs.CL},
+  url={https://arxiv.org/abs/2506.13487}
+}
+```
+
+### Groups, Tags, and Tasks
+
+#### Groups
+
+* `turblimp_core`: Runs all 16 grammatical 'core' subtasks of TurBLiMP (additional experimental paradigms which have no correct answer are included in the original release; these are not included here).
+
+#### Tasks
+
+* `turblimp_anaphor_agreement`: Reflexive pronoun agreement violations
+* `turblimp_argument_structure_transitive`: Case marking errors with transitive verbs
+* `turblimp_argument_structure_ditransitive`: Case marking errors with ditransitive verbs
+* `turblimp_binding`: Principle B violations in binding theory
+* `turblimp_determiners`: Obligatory use of the indefinite article
+* `turblimp_ellipsis`: Backward gapping with non-parallel word orders
+* `turblimp_irregular_forms`: Incorrect aorist allomorph usage
+* `turblimp_island_effects`: Wh-adjunct extraction from complex NPs
+* `turblimp_nominalization`: Incorrect nominalization suffix selection
+* `turblimp_npi_licensing`: Negative polarity items in non-negative contexts
+* `turblimp_passives`: Unlicensed use of by-phrases in impersonal passives
+* `turblimp_quantifiers`: Quantifier usage with bare nouns
+* `turblimp_relative_clauses`: Incorrect case marking in relative clauses
+* `turblimp_scrambling`: Illicit postverbal scrambling from embedded clauses
+* `turblimp_subject_agreement`: Person/number agreement violations
+* `turblimp_suspended_affixation`: Improper tense suffix suspension
+
+**Implementation Note:**  The [original implementation](https://github.com/ezgibasar/TurBLiMP) normalizes length by number of tokens, which is not supported by the Language Model Evaluation Harness (see [[1](https://blog.eleuther.ai/multiple-choice-normalization/)], [[2](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md)], [[3](https://github.com/EleutherAI/lm-evaluation-harness/issues/1396)]). For this reason, the implementation provided here includes both the `acc` (accuracy based on comparing the unnormalized log-probability of the correct and incorrect versions of each sentence) and `acc_norm` (the same as `acc` but with sentence log-probability normalized by number of bytes) metrics.
+
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+
+* [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
+  * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+### Changelog
--- a/lm_eval/tasks/turblimp/_template_yaml
+++ b/lm_eval/tasks/turblimp/_template_yaml
+dataset_path: juletxara/turblimp
+output_type: multiple_choice
+test_split: train
+doc_to_text: ""
+target_delimiter: ""
+doc_to_target: 0
+doc_to_choice: "{{[sentence_good,sentence_bad]}}"
+num_fewshot: 0
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+  - metric: acc_norm
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 0
--- a/lm_eval/tasks/turblimp/anaphor_agreement.yaml
+++ b/lm_eval/tasks/turblimp/anaphor_agreement.yaml
+dataset_name: anaphor_agreement
+include: _template_yaml
+task: turblimp_anaphor_agreement
--- a/lm_eval/tasks/turblimp/argument_structure_ditransitive.yaml
+++ b/lm_eval/tasks/turblimp/argument_structure_ditransitive.yaml
+dataset_name: argument_structure_ditransitive
+include: _template_yaml
+task: turblimp_argument_structure_ditransitive
--- a/lm_eval/tasks/turblimp/argument_structure_transitive.yaml
+++ b/lm_eval/tasks/turblimp/argument_structure_transitive.yaml
+dataset_name: argument_structure_transitive
+include: _template_yaml
+task: turblimp_argument_structure_transitive
--- a/lm_eval/tasks/turblimp/binding.yaml
+++ b/lm_eval/tasks/turblimp/binding.yaml
+dataset_name: binding
+include: _template_yaml
+task: turblimp_binding
--- a/lm_eval/tasks/turblimp/determiners.yaml
+++ b/lm_eval/tasks/turblimp/determiners.yaml
+dataset_name: determiners
+include: _template_yaml
+task: turblimp_determiners
--- a/lm_eval/tasks/turblimp/ellipsis.yaml
+++ b/lm_eval/tasks/turblimp/ellipsis.yaml
+dataset_name: ellipsis
+include: _template_yaml
+task: turblimp_ellipsis
--- a/lm_eval/tasks/turblimp/irregular_forms.yaml
+++ b/lm_eval/tasks/turblimp/irregular_forms.yaml
+dataset_name: irregular_forms
+include: _template_yaml
+task: turblimp_irregular_forms
--- a/lm_eval/tasks/turblimp/island_effects.yaml
+++ b/lm_eval/tasks/turblimp/island_effects.yaml
+dataset_name: island_effects
+include: _template_yaml
+task: turblimp_island_effects
--- a/lm_eval/tasks/turblimp/nominalization.yaml
+++ b/lm_eval/tasks/turblimp/nominalization.yaml
+dataset_name: nominalization
+include: _template_yaml
+task: turblimp_nominalization
--- a/lm_eval/tasks/turblimp/npi_licensing.yaml
+++ b/lm_eval/tasks/turblimp/npi_licensing.yaml
+dataset_name: npi_licensing
+include: _template_yaml
+task: turblimp_npi_licensing
--- a/lm_eval/tasks/turblimp/passives.yaml
+++ b/lm_eval/tasks/turblimp/passives.yaml
+dataset_name: passives
+include: _template_yaml
+task: turblimp_passives
--- a/lm_eval/tasks/turblimp/quantifiers.yaml
+++ b/lm_eval/tasks/turblimp/quantifiers.yaml
+dataset_name: quantifiers
+include: _template_yaml
+task: turblimp_quantifiers
--- a/lm_eval/tasks/turblimp/relative_clauses.yaml
+++ b/lm_eval/tasks/turblimp/relative_clauses.yaml
+dataset_name: relative_clauses
+include: _template_yaml
+task: turblimp_relative_clauses
--- a/lm_eval/tasks/turblimp/scrambling.yaml
+++ b/lm_eval/tasks/turblimp/scrambling.yaml
+dataset_name: scrambling
+include: _template_yaml
+task: turblimp_scrambling
--- a/lm_eval/tasks/turblimp/subject_agreement.yaml
+++ b/lm_eval/tasks/turblimp/subject_agreement.yaml
+dataset_name: subject_agreement
+include: _template_yaml
+task: turblimp_subject_agreement
--- a/lm_eval/tasks/turblimp/suspended_affixation.yaml
+++ b/lm_eval/tasks/turblimp/suspended_affixation.yaml
+dataset_name: suspended_affixation
+include: _template_yaml
+task: turblimp_suspended_affixation
--- a/lm_eval/tasks/turblimp/turblimp_group.yaml
+++ b/lm_eval/tasks/turblimp/turblimp_group.yaml
+group: turblimp_core
+task:
+  - turblimp_anaphor_agreement
+  - turblimp_argument_structure_ditransitive
+  - turblimp_argument_structure_transitive
+  - turblimp_binding
+  - turblimp_determiners
+  - turblimp_ellipsis
+  - turblimp_irregular_forms
+  - turblimp_island_effects
+  - turblimp_nominalization
+  - turblimp_npi_licensing
+  - turblimp_passives
+  - turblimp_quantifiers
+  - turblimp_relative_clauses
+  - turblimp_scrambling
+  - turblimp_subject_agreement
+  - turblimp_suspended_affixation
+aggregate_metric_list:
+  - metric: acc
+    aggregation: mean
+    weight_by_size: false
+aggregate_metric_list:
+  - metric: acc_norm
+    aggregation: mean
+    weight_by_size: false