Unverified Commit 1bd96448 authored by James A. Michaelov's avatar James A. Michaelov Committed by GitHub
Browse files

Add ZhoBLiMP benchmark (#3218)

* add zhoblimp files

* correct group name

* fix group

* add normalized accuracy
parent 51d8a192
...@@ -171,6 +171,7 @@ ...@@ -171,6 +171,7 @@
| [xquad](xquad/README.md) | Cross-lingual Question Answering Dataset in multiple languages. | Arabic, German, Greek, English, Spanish, Hindi, Romanian, Russian, Thai, Turkish, Vietnamese, Chinese | | [xquad](xquad/README.md) | Cross-lingual Question Answering Dataset in multiple languages. | Arabic, German, Greek, English, Spanish, Hindi, Romanian, Russian, Thai, Turkish, Vietnamese, Chinese |
| [xstorycloze](xstorycloze/README.md) | Cross-lingual narrative understanding tasks to predict story endings in multiple languages. | Russian, Simplified Chinese, Spanish, Arabic, Hindi, Indonesian, Telugu, Swahili, Basque, Burmese | | [xstorycloze](xstorycloze/README.md) | Cross-lingual narrative understanding tasks to predict story endings in multiple languages. | Russian, Simplified Chinese, Spanish, Arabic, Hindi, Indonesian, Telugu, Swahili, Basque, Burmese |
| [xwinograd](xwinograd/README.md) | Cross-lingual Winograd schema tasks for coreference resolution in multiple languages. | English, French, Japanese, Portuguese, Russian, Chinese | | [xwinograd](xwinograd/README.md) | Cross-lingual Winograd schema tasks for coreference resolution in multiple languages. | English, French, Japanese, Portuguese, Russian, Chinese |
| [zhoblimp](zhoblimp/README.md) | A benchmark evaluating language models' grammatical capabilities in Chinese based on comparing the probabilities of minimal pairs of grammatical and ungrammatical sentences. | Chinese |
## Multimodal Tasks ## Multimodal Tasks
| Task Family | Description | Modality | | Task Family | Description | Modality |
......
dataset_name: BA_BEI_subj_drop
include: _template_yaml
task: zhoblimp_BA_BEI_subj_drop
dataset_name: BA_deletion
include: _template_yaml
task: zhoblimp_BA_deletion
dataset_name: BA_duplicate_argument
include: _template_yaml
task: zhoblimp_BA_duplicate_argument
dataset_name: BA_inversion
include: _template_yaml
task: zhoblimp_BA_inversion
dataset_name: BA_meiba
include: _template_yaml
task: zhoblimp_BA_meiba
dataset_name: BA_negation
include: _template_yaml
task: zhoblimp_BA_negation
dataset_name: BA_no_progressive
include: _template_yaml
task: zhoblimp_BA_no_progressive
dataset_name: BA_no_stative_verb
include: _template_yaml
task: zhoblimp_BA_no_stative_verb
dataset_name: BA_suo_adverbial_a
include: _template_yaml
task: zhoblimp_BA_suo_adverbial_a
dataset_name: BA_suo_adverbial_b
include: _template_yaml
task: zhoblimp_BA_suo_adverbial_b
dataset_name: BA_verb_le_a
include: _template_yaml
task: zhoblimp_BA_verb_le_a
dataset_name: BA_verb_le_b
include: _template_yaml
task: zhoblimp_BA_verb_le_b
dataset_name: BEI_construction_a
include: _template_yaml
task: zhoblimp_BEI_construction_a
dataset_name: BEI_construction_b
include: _template_yaml
task: zhoblimp_BEI_construction_b
dataset_name: BEI_deletion
include: _template_yaml
task: zhoblimp_BEI_deletion
dataset_name: BEI_preposition
include: _template_yaml
task: zhoblimp_BEI_preposition
dataset_name: PN_numP_a
include: _template_yaml
task: zhoblimp_PN_numP_a
dataset_name: PN_numP_b
include: _template_yaml
task: zhoblimp_PN_numP_b
# ZhoBLiMP: A Systematic Assessment of Language Models with Linguistic Minimal Pairs in Chinese
## Paper
Title: `A Systematic Assessment of Language Models with Linguistic Minimal Pairs in Chinese`
Paper: https://arxiv.org/pdf/2411.06096
> Whether and how language models (LMs) acquire the syntax of natural languages has been widely evaluated under the minimal pair paradigm. However, a lack of wide-coverage benchmarks in languages other than English has constrained systematic investigations into the issue. Addressing it, we first introduce ZhoBLiMP, the most comprehensive benchmark of linguistic minimal pairs for Chinese to date, with 118 paradigms, covering 15 linguistic phenomena.
Homepage: https://github.com/sjtu-compling/ZhoBLiMP
### Citation
```
@article{liu2024zhoblimp,
title={Zhoblimp: a systematic assessment of language models with linguistic minimal pairs in chinese},
author={Liu, Yikang and Shen, Yeting and Zhu, Hongao and Xu, Lilong and Qian, Zhiheng and Song, Siyuan and Zhang, Kejia and Tang, Jialong and Zhang, Pei and Yang, Baosong and others},
journal={arXiv preprint arXiv:2411.06096},
year={2024}
}
```
### Groups, Tags, and Tasks
* `zhoblimp`: Runs all ZhoBLiMP subtasks and calculates mean performance.
#### Implementation notes
* **Length normalization:** The [original implementation](https://github.com/sjtu-compling/ZhoBLiMP) normalizes sentence length using a custom function which is not supported by the Language Model Evaluation Harness. For this reason, the implementation provided here includes both un-normalized accuracy (`acc`) and byte-length-normalized accuracy (`acc_norm`).
### Checklist
For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
### Changelog
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment