Unverified Commit d355eac0 authored by James A. Michaelov's avatar James A. Michaelov Committed by GitHub
Browse files

Add TurBLiMP (#3219)

* add turblimp

* update general task readme

* add normalized accuracy
parent b0040ba0
...@@ -157,6 +157,7 @@ ...@@ -157,6 +157,7 @@
| [truthfulqa](truthfulqa/README.md) | A QA task aimed at evaluating the truthfulness and factual accuracy of model responses. | English | | [truthfulqa](truthfulqa/README.md) | A QA task aimed at evaluating the truthfulness and factual accuracy of model responses. | English |
| [truthfulqa-multi](truthfulqa-multi/README.md) | Is a multilingual version of TruthfulQA, a QA task aimed at evaluating the truthfulness and factual accuracy of model responses. | English, Spanish, Catalan, Basque, Galician | | [truthfulqa-multi](truthfulqa-multi/README.md) | Is a multilingual version of TruthfulQA, a QA task aimed at evaluating the truthfulness and factual accuracy of model responses. | English, Spanish, Catalan, Basque, Galician |
| [turkishmmlu](turkishmmlu/README.md) | A multiple-choice QA test modeled after MMLU, written in Turkish based on Turkish high-school level exams. | Turkish | | [turkishmmlu](turkishmmlu/README.md) | A multiple-choice QA test modeled after MMLU, written in Turkish based on Turkish high-school level exams. | Turkish |
| [turblimp_core](turblimp/README.md) | A benchmark evaluating language models' grammatical capabilities in Turkish based on comparing the probabilities of minimal pairs of grammatical and ungrammatical sentences. | Turkish |
| [unitxt](unitxt/README.md) | A number of tasks implemented using the unitxt library for flexible, shareable, and reusable data preparation and evaluation for generative AI. | English | | [unitxt](unitxt/README.md) | A number of tasks implemented using the unitxt library for flexible, shareable, and reusable data preparation and evaluation for generative AI. | English |
| [unscramble](unscramble/README.md) | Tasks involving the rearrangement of scrambled sentences to test syntactic understanding. | English | | [unscramble](unscramble/README.md) | Tasks involving the rearrangement of scrambled sentences to test syntactic understanding. | English |
| [webqs](webqs/README.md) | Web-based question answering tasks designed to evaluate internet search and retrieval. | English | | [webqs](webqs/README.md) | Web-based question answering tasks designed to evaluate internet search and retrieval. | English |
......
# TurBLiMP: A Turkish Benchmark of Linguistic Minimal Pairs
## Paper
Title: TurBLiMP: A Turkish Benchmark of Linguistic Minimal Pairs
Abstract:
> TurBLiMP is the first Turkish benchmark of linguistic minimal pairs, designed to evaluate the linguistic abilities of monolingual and multilingual language models. The dataset covers 16 core grammatical phenomena in Turkish, with 1,000 minimal pairs per phenomenon.
Homepage: https://github.com/ezgibasar/TurBLiMP
### Citation
```
bibtex
@misc{basar2025turblimpturkishbenchmarklinguistic,
title={TurBLiMP: A Turkish Benchmark of Linguistic Minimal Pairs},
author={Ezgi Ba{\c{s}}ar and Francesca Padovani and Jaap Jumelet and Arianna Bisazza},
year={2025},
eprint={2506.13487},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.13487}
}
```
### Groups, Tags, and Tasks
#### Groups
* `turblimp_core`: Runs all 16 grammatical 'core' subtasks of TurBLiMP (additional experimental paradigms which have no correct answer are included in the original release; these are not included here).
#### Tasks
* `turblimp_anaphor_agreement`: Reflexive pronoun agreement violations
* `turblimp_argument_structure_transitive`: Case marking errors with transitive verbs
* `turblimp_argument_structure_ditransitive`: Case marking errors with ditransitive verbs
* `turblimp_binding`: Principle B violations in binding theory
* `turblimp_determiners`: Obligatory use of the indefinite article
* `turblimp_ellipsis`: Backward gapping with non-parallel word orders
* `turblimp_irregular_forms`: Incorrect aorist allomorph usage
* `turblimp_island_effects`: Wh-adjunct extraction from complex NPs
* `turblimp_nominalization`: Incorrect nominalization suffix selection
* `turblimp_npi_licensing`: Negative polarity items in non-negative contexts
* `turblimp_passives`: Unlicensed use of by-phrases in impersonal passives
* `turblimp_quantifiers`: Quantifier usage with bare nouns
* `turblimp_relative_clauses`: Incorrect case marking in relative clauses
* `turblimp_scrambling`: Illicit postverbal scrambling from embedded clauses
* `turblimp_subject_agreement`: Person/number agreement violations
* `turblimp_suspended_affixation`: Improper tense suffix suspension
**Implementation Note:** The [original implementation](https://github.com/ezgibasar/TurBLiMP) normalizes length by number of tokens, which is not supported by the Language Model Evaluation Harness (see [[1](https://blog.eleuther.ai/multiple-choice-normalization/)], [[2](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md)], [[3](https://github.com/EleutherAI/lm-evaluation-harness/issues/1396)]). For this reason, the implementation provided here includes both the `acc` (accuracy based on comparing the unnormalized log-probability of the correct and incorrect versions of each sentence) and `acc_norm` (the same as `acc` but with sentence log-probability normalized by number of bytes) metrics.
### Checklist
For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
### Changelog
dataset_path: juletxara/turblimp
output_type: multiple_choice
test_split: train
doc_to_text: ""
target_delimiter: ""
doc_to_target: 0
doc_to_choice: "{{[sentence_good,sentence_bad]}}"
num_fewshot: 0
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 0
dataset_name: anaphor_agreement
include: _template_yaml
task: turblimp_anaphor_agreement
dataset_name: argument_structure_ditransitive
include: _template_yaml
task: turblimp_argument_structure_ditransitive
dataset_name: argument_structure_transitive
include: _template_yaml
task: turblimp_argument_structure_transitive
dataset_name: binding
include: _template_yaml
task: turblimp_binding
dataset_name: determiners
include: _template_yaml
task: turblimp_determiners
dataset_name: ellipsis
include: _template_yaml
task: turblimp_ellipsis
dataset_name: irregular_forms
include: _template_yaml
task: turblimp_irregular_forms
dataset_name: island_effects
include: _template_yaml
task: turblimp_island_effects
dataset_name: nominalization
include: _template_yaml
task: turblimp_nominalization
dataset_name: npi_licensing
include: _template_yaml
task: turblimp_npi_licensing
dataset_name: passives
include: _template_yaml
task: turblimp_passives
dataset_name: quantifiers
include: _template_yaml
task: turblimp_quantifiers
dataset_name: relative_clauses
include: _template_yaml
task: turblimp_relative_clauses
dataset_name: scrambling
include: _template_yaml
task: turblimp_scrambling
dataset_name: subject_agreement
include: _template_yaml
task: turblimp_subject_agreement
dataset_name: suspended_affixation
include: _template_yaml
task: turblimp_suspended_affixation
group: turblimp_core
task:
- turblimp_anaphor_agreement
- turblimp_argument_structure_ditransitive
- turblimp_argument_structure_transitive
- turblimp_binding
- turblimp_determiners
- turblimp_ellipsis
- turblimp_irregular_forms
- turblimp_island_effects
- turblimp_nominalization
- turblimp_npi_licensing
- turblimp_passives
- turblimp_quantifiers
- turblimp_relative_clauses
- turblimp_scrambling
- turblimp_subject_agreement
- turblimp_suspended_affixation
aggregate_metric_list:
- metric: acc
aggregation: mean
weight_by_size: false
aggregate_metric_list:
- metric: acc_norm
aggregation: mean
weight_by_size: false
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment