Add ZhoBLiMP benchmark (#3218)

* add zhoblimp files * correct group name * fix group * add normalized accuracy

Add ZhoBLiMP benchmark (#3218)
* add zhoblimp files * correct group name * fix group * add normalized accuracy
1bd96448 · James A. Michaelov · GitHub · 51d8a192 · 1bd96448 · 1bd96448
Unverified Commit 1bd96448 authored Aug 21, 2025 by James A. Michaelov Committed by GitHub Aug 21, 2025
20 changed files
--- a/lm_eval/tasks/README.md
+++ b/lm_eval/tasks/README.md
@@ -171,6 +171,7 @@
 | [xquad](xquad/README.md)                                                 | Cross-lingual Question Answering Dataset in multiple languages.                                                                                                                                                                                                                                                                        | Arabic, German, Greek, English, Spanish, Hindi, Romanian, Russian, Thai, Turkish, Vietnamese, Chinese                         |
 | [xstorycloze](xstorycloze/README.md)                                     | Cross-lingual narrative understanding tasks to predict story endings in multiple languages.                                                                                                                                                                                                                                            | Russian, Simplified Chinese, Spanish, Arabic, Hindi, Indonesian, Telugu, Swahili, Basque, Burmese                             |
 | [xwinograd](xwinograd/README.md)                                         | Cross-lingual Winograd schema tasks for coreference resolution in multiple languages.                                                                                                                                                                                                                                                  | English, French, Japanese, Portuguese, Russian, Chinese                                                                       |
+| [zhoblimp](zhoblimp/README.md)                                         | A benchmark evaluating language models' grammatical capabilities in Chinese based on comparing the probabilities of minimal pairs of grammatical and ungrammatical sentences.                                                                                                                                                                                                                                                  | Chinese                                                                       |
 ## Multimodal Tasks
 | Task Family                  | Description                                                                                             | Modality    |

--- a/lm_eval/tasks/zhoblimp/BA_BEI_subj_drop.yaml
+++ b/lm_eval/tasks/zhoblimp/BA_BEI_subj_drop.yaml
+dataset_name: BA_BEI_subj_drop
+include: _template_yaml
+task: zhoblimp_BA_BEI_subj_drop
--- a/lm_eval/tasks/zhoblimp/BA_deletion.yaml
+++ b/lm_eval/tasks/zhoblimp/BA_deletion.yaml
+dataset_name: BA_deletion
+include: _template_yaml
+task: zhoblimp_BA_deletion
--- a/lm_eval/tasks/zhoblimp/BA_duplicate_argument.yaml
+++ b/lm_eval/tasks/zhoblimp/BA_duplicate_argument.yaml
+dataset_name: BA_duplicate_argument
+include: _template_yaml
+task: zhoblimp_BA_duplicate_argument
--- a/lm_eval/tasks/zhoblimp/BA_inversion.yaml
+++ b/lm_eval/tasks/zhoblimp/BA_inversion.yaml
+dataset_name: BA_inversion
+include: _template_yaml
+task: zhoblimp_BA_inversion
--- a/lm_eval/tasks/zhoblimp/BA_meiba.yaml
+++ b/lm_eval/tasks/zhoblimp/BA_meiba.yaml
+dataset_name: BA_meiba
+include: _template_yaml
+task: zhoblimp_BA_meiba
--- a/lm_eval/tasks/zhoblimp/BA_negation.yaml
+++ b/lm_eval/tasks/zhoblimp/BA_negation.yaml
+dataset_name: BA_negation
+include: _template_yaml
+task: zhoblimp_BA_negation
--- a/lm_eval/tasks/zhoblimp/BA_no_progressive.yaml
+++ b/lm_eval/tasks/zhoblimp/BA_no_progressive.yaml
+dataset_name: BA_no_progressive
+include: _template_yaml
+task: zhoblimp_BA_no_progressive
--- a/lm_eval/tasks/zhoblimp/BA_no_stative_verb.yaml
+++ b/lm_eval/tasks/zhoblimp/BA_no_stative_verb.yaml
+dataset_name: BA_no_stative_verb
+include: _template_yaml
+task: zhoblimp_BA_no_stative_verb
--- a/lm_eval/tasks/zhoblimp/BA_suo_adverbial_a.yaml
+++ b/lm_eval/tasks/zhoblimp/BA_suo_adverbial_a.yaml
+dataset_name: BA_suo_adverbial_a
+include: _template_yaml
+task: zhoblimp_BA_suo_adverbial_a
--- a/lm_eval/tasks/zhoblimp/BA_suo_adverbial_b.yaml
+++ b/lm_eval/tasks/zhoblimp/BA_suo_adverbial_b.yaml
+dataset_name: BA_suo_adverbial_b
+include: _template_yaml
+task: zhoblimp_BA_suo_adverbial_b
--- a/lm_eval/tasks/zhoblimp/BA_verb_le_a.yaml
+++ b/lm_eval/tasks/zhoblimp/BA_verb_le_a.yaml
+dataset_name: BA_verb_le_a
+include: _template_yaml
+task: zhoblimp_BA_verb_le_a
--- a/lm_eval/tasks/zhoblimp/BA_verb_le_b.yaml
+++ b/lm_eval/tasks/zhoblimp/BA_verb_le_b.yaml
+dataset_name: BA_verb_le_b
+include: _template_yaml
+task: zhoblimp_BA_verb_le_b
--- a/lm_eval/tasks/zhoblimp/BEI_construction_a.yaml
+++ b/lm_eval/tasks/zhoblimp/BEI_construction_a.yaml
+dataset_name: BEI_construction_a
+include: _template_yaml
+task: zhoblimp_BEI_construction_a
--- a/lm_eval/tasks/zhoblimp/BEI_construction_b.yaml
+++ b/lm_eval/tasks/zhoblimp/BEI_construction_b.yaml
+dataset_name: BEI_construction_b
+include: _template_yaml
+task: zhoblimp_BEI_construction_b
--- a/lm_eval/tasks/zhoblimp/BEI_deletion.yaml
+++ b/lm_eval/tasks/zhoblimp/BEI_deletion.yaml
+dataset_name: BEI_deletion
+include: _template_yaml
+task: zhoblimp_BEI_deletion
--- a/lm_eval/tasks/zhoblimp/BEI_preposition.yaml
+++ b/lm_eval/tasks/zhoblimp/BEI_preposition.yaml
+dataset_name: BEI_preposition
+include: _template_yaml
+task: zhoblimp_BEI_preposition
--- a/lm_eval/tasks/zhoblimp/PN_numP_a.yaml
+++ b/lm_eval/tasks/zhoblimp/PN_numP_a.yaml
+dataset_name: PN_numP_a
+include: _template_yaml
+task: zhoblimp_PN_numP_a
--- a/lm_eval/tasks/zhoblimp/PN_numP_b.yaml
+++ b/lm_eval/tasks/zhoblimp/PN_numP_b.yaml
+dataset_name: PN_numP_b
+include: _template_yaml
+task: zhoblimp_PN_numP_b
--- a/lm_eval/tasks/zhoblimp/README.md
+++ b/lm_eval/tasks/zhoblimp/README.md
+# ZhoBLiMP: A Systematic Assessment of Language Models with Linguistic Minimal Pairs in Chinese
+## Paper
+Title: `A Systematic Assessment of Language Models with Linguistic Minimal Pairs in Chinese`
+Paper: https://arxiv.org/pdf/2411.06096
+> Whether and how language models (LMs) acquire the syntax of natural languages has been widely evaluated under the minimal pair paradigm. However, a lack of wide-coverage benchmarks in languages other than English has constrained systematic investigations into the issue. Addressing it, we first introduce ZhoBLiMP, the most comprehensive benchmark of linguistic minimal pairs for Chinese to date, with 118 paradigms, covering 15 linguistic phenomena.
+Homepage: https://github.com/sjtu-compling/ZhoBLiMP
+### Citation
+```
+@article{liu2024zhoblimp,
+  title={Zhoblimp: a systematic assessment of language models with linguistic minimal pairs in chinese},
+  author={Liu, Yikang and Shen, Yeting and Zhu, Hongao and Xu, Lilong and Qian, Zhiheng and Song, Siyuan and Zhang, Kejia and Tang, Jialong and Zhang, Pei and Yang, Baosong and others},
+  journal={arXiv preprint arXiv:2411.06096},
+  year={2024}
+}
+```
+### Groups, Tags, and Tasks
+* `zhoblimp`: Runs all ZhoBLiMP subtasks and calculates mean performance.
+#### Implementation notes
+* **Length normalization:** The [original implementation](https://github.com/sjtu-compling/ZhoBLiMP) normalizes sentence length using a custom function which is not supported by the Language Model Evaluation Harness. For this reason, the implementation provided here includes both un-normalized accuracy (`acc`) and byte-length-normalized accuracy (`acc_norm`).
+### Checklist
+For adding novel benchmarks/datasets to the library:
+* [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
+  * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+### Changelog