Unverified Commit 0b45cc71 authored by Weihao XUAN's avatar Weihao XUAN Committed by GitHub
Browse files

Update MMLU-ProX task (#3174)

* update MMLU_ProX

* update MMLU_ProX

* cleanup code by pre-commit
parent 05b37f20
......@@ -113,7 +113,7 @@ provided to the individual README.md files for each subfolder.
| [mmlu](mmlu/README.md) | Massive Multitask Language Understanding benchmark for broad domain language evaluation. Several variants are supported. | English |
| [mmlu_pro](mmlu_pro/README.md) | A refined set of MMLU, integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. | English |
| [mmlu-pro-plus](mmlu-pro-plus/README.md) | A new test set for evaluating shortcut learning and higher-order reasoning of LLMs. | English |
| [mmlu_prox](mmlu_prox/README.md) | A multilingual benchmark that extends MMLU-Pro to multiple typologically diverse languages with human validation. | English, Japanese, Chinese, Korean, French, German, Spanish, Portuguese, Swahili, Thai, Arabic, Hindi, Bengali |
| [mmlu_prox](mmlu_prox/README.md) | A multilingual benchmark that extends MMLU-Pro to multiple typologically diverse languages with human validation. | English, Japanese, Chinese, Korean, French, German, Spanish, Portuguese, Zulu, Swahili, Wolof, Yoruba, Thai, Arabic, Hindi, Bengali, Serbian, Hungarian, Vietnamese, Czech, Marathi, Afrikaans, Nepali, Telugu, Urdu, Russian, Indonesian, Italian, Ukrainian|
| [mmlusr](mmlusr/README.md) | Variation of MMLU designed to be more rigorous. | English |
| model_written_evals | Evaluation tasks auto-generated for evaluating a collection of AI Safety concerns. | |
| [moral_stories](moral_stories/README.md) | A crowd-sourced dataset of structured narratives that describe normative and norm-divergent actions taken by individuals to accomplish certain intentions in concrete situations. | English |
......
......@@ -4,21 +4,29 @@
Title: `MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation`
Abstract: `Traditional benchmarks like MMLU and MMLU-Pro focus primarily on single-language evaluation, limiting their ability to assess language models in multilingual and culturally diverse contexts. To address this gap, we introduce MMLU-ProX, a comprehensive multilingual benchmark that builds upon MMLU-Pro by covering multiple typologically diverse languages with approximately 11,829 questions per language.`
Abstract: `Existing large language model (LLM) evaluation benchmarks primarily focus on English, while current multilingual tasks lack parallel questions that specifically assess cross-linguistic reasoning abilities.
This dual limitation makes it challenging to comprehensively assess LLMs' performance in the multilingual setting. To fill this gap, we introduce MMLU-ProX, a comprehensive benchmark covering 29 languages, built on an English benchmark.
Each language version consists of 11,829 identical questions, enabling direct cross-linguistic comparisons. Additionally, to meet efficient evaluation needs, we provide a lite version containing 658 questions per language.
To ensure the high quality of MMLU-ProX, we employ a rigorous development process that involves multiple powerful LLMs for translation, followed by expert review to ensure accurate expression, consistent terminology, and cultural relevance.
Building on this, we systematically evaluate 36 state-of-the-art LLMs, including reasoning-enhanced and multilingual-optimized LLMs.
The results reveal significant disparities in the multilingual capabilities of LLMs: While they perform well in high-resource languages, their performance declines markedly in low-resource languages, with gaps of up to 24.3%.
Through MMLU-ProX, we aim to advance the development of more inclusive AI systems and promote equitable access to technology across global contexts.
We plan to continuously expand MMLU-ProX by incorporating additional languages to further enhance its coverage and utility for the global AI research community.`
Homepage: https://mmluprox.github.io/
Homepage: https://mmluprox.github.io
Huggingface:
- https://huggingface.co/datasets/li-lab/MMLU-ProX
- https://huggingface.co/datasets/li-lab/MMLU-ProX-Lite
### Citation
```bibtex
@misc{mmluprox,
title={MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation},
author={Weihao Xuan and Rui Yang and Heli Qi and Qingcheng Zeng and Yunze Xiao and Yun Xing and Junjue Wang and Huitao Li and Xin Li and Kunyu Yu and Nan Liu and Qingyu Chen and Douglas Teodoro and Edison Marrese-Taylor and Shijian Lu and Yusuke Iwasawa and Yutaka Matsuo and Irene Li},
year={2025},
eprint={2503.10497},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2503.10497},
@article{xuan2025mmlu,
title={Mmlu-prox: A multilingual benchmark for advanced large language model evaluation},
author={Weihao Xuan and Rui Yang and Heli Qi and Qingcheng Zeng and Yunze Xiao and Aosong Feng and Dairui Liu and Yun Xing and Junjue Wang and Fan Gao and Jinghui Lu and Yuang Jiang and Huitao Li and Xin Li and Kunyu Yu and Ruihai Dong and Shangding Gu and Yuekang Li and Xiaofei Xie and Felix Juefei-Xu and Foutse Khomh and Osamu Yoshie and Qingyu Chen and Douglas Teodoro and Nan Liu and Randy Goebel and Lei Ma and Edison Marrese-Taylor and Shijian Lu and Yusuke Iwasawa and Yutaka Matsuo and Irene Li},
journal={arXiv preprint arXiv:2503.10497},
year={2025}
}
```
......@@ -26,22 +34,39 @@ Homepage: https://mmluprox.github.io/
#### Groups
* `mmlu_pro_{lang}`: 'All 14 subjects of the mmlu_pro_prox dataset in {lang}, evaluated following the methodology in mmlu_pro's original implementation'
* `mmlu_pro_{lang}`: 'All 14 subjects of the mmlu_prox dataset in {lang}, evaluated following the methodology in mmlu_pro's original implementation'
* `mmlu_prox_lite_{lang}`: 'All 14 subjects of the mmlu_prox_lite dataset in {lang}, evaluated following the methodology in mmlu_pro's original implementation'
Available lang:
Available options for `{lang}`:
- af
- ar
- bn
- cs
- de
- en
- es
- fr
- hi
- hu
- id
- it
- ja
- ko
- mr
- ne
- pt
- ru
- sr
- sw
- te
- th
- uk
- ur
- vi
- wo
- yo
- zh
- zu
#### Tasks
......@@ -61,6 +86,23 @@ The following tasks evaluate subjects in the mmlu_prox dataset
- `mmlu_prox_{lang}_physics`
- `mmlu_prox_{lang}_psychology`
The following tasks evaluate subjects in the mmlu_prox_lite dataset
- `mmlu_prox_lite_{lang}_biology`
- `mmlu_prox_lite_{lang}_business`
- `mmlu_prox_lite_{lang}_chemistry`
- `mmlu_prox_lite_{lang}_computer_science`
- `mmlu_prox_lite_{lang}_economics`
- `mmlu_prox_lite_{lang}_engineering`
- `mmlu_prox_lite_{lang}_health`
- `mmlu_prox_lite_{lang}_history`
- `mmlu_prox_lite_{lang}_law`
- `mmlu_prox_lite_{lang}_math`
- `mmlu_prox_lite_{lang}_other`
- `mmlu_prox_lite_{lang}_philosophy`
- `mmlu_prox_lite_{lang}_physics`
- `mmlu_prox_lite_{lang}_psychology`
### Checklist
For adding novel benchmarks/datasets to the library:
......
dataset_path: li-lab/MMLU-ProX-Lite
dataset_name: af
test_split: test
fewshot_split: validation
fewshot_config:
sampler: first_n
doc_to_text: !function utils.fewshot_to_text
doc_to_target: ""
output_type: generate_until
doc_to_text: !function utils.doc_to_text
doc_to_target: answer
filter_list:
- name: "custom-extract"
filter:
- function: "regex"
regex_pattern: 'Die antwoord is \(?([ABCDEFGHIJ])\)?'
- function: "take_first"
generation_kwargs:
until:
- "</s>"
- "Q:"
- "Vraag:"
- "<|im_end|>"
do_sample: false
temperature: 0.0
max_gen_toks: 2048
num_fewshot: 5
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
metadata:
version: 0.0
dataset_path: li-lab/MMLU-ProX
dataset_name: af
test_split: test
fewshot_split: validation
fewshot_config:
sampler: first_n
doc_to_text: !function utils.fewshot_to_text
doc_to_target: ""
output_type: generate_until
doc_to_text: !function utils.doc_to_text
doc_to_target: answer
filter_list:
- name: "custom-extract"
filter:
- function: "regex"
regex_pattern: 'Die antwoord is \(?([ABCDEFGHIJ])\)?'
- function: "take_first"
generation_kwargs:
until:
- "</s>"
- "Q:"
- "Vraag:"
- "<|im_end|>"
do_sample: false
temperature: 0.0
max_gen_toks: 2048
num_fewshot: 5
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
metadata:
version: 0.0
group: mmlu_prox_af
task:
- mmlu_prox_af_biology
- mmlu_prox_af_business
- mmlu_prox_af_chemistry
- mmlu_prox_af_computer_science
- mmlu_prox_af_economics
- mmlu_prox_af_engineering
- mmlu_prox_af_health
- mmlu_prox_af_history
- mmlu_prox_af_law
- mmlu_prox_af_math
- mmlu_prox_af_other
- mmlu_prox_af_philosophy
- mmlu_prox_af_physics
- mmlu_prox_af_psychology
aggregate_metric_list:
- aggregation: mean
metric: exact_match
weight_by_size: true
filter_list: custom-extract
metadata:
version: 0.0
group: mmlu_prox_lite_af
task:
- mmlu_prox_lite_af_biology
- mmlu_prox_lite_af_business
- mmlu_prox_lite_af_chemistry
- mmlu_prox_lite_af_computer_science
- mmlu_prox_lite_af_economics
- mmlu_prox_lite_af_engineering
- mmlu_prox_lite_af_health
- mmlu_prox_lite_af_history
- mmlu_prox_lite_af_law
- mmlu_prox_lite_af_math
- mmlu_prox_lite_af_other
- mmlu_prox_lite_af_philosophy
- mmlu_prox_lite_af_physics
- mmlu_prox_lite_af_psychology
aggregate_metric_list:
- aggregation: mean
metric: exact_match
weight_by_size: true
filter_list: custom-extract
metadata:
version: 0.0
description: 'Hier is ''n multikeusevraag oor Biologie (met antwoorde). Dink asseblief
stap vir stap en eindig jou antwoord met "Die antwoord is (X)", waar X die letter
van die korrekte opsie is.
'
include: _af_template_yaml
task: mmlu_prox_af_biology
task_alias: biology
process_docs: !function utils.process_biology
description: 'Hier is ''n multikeusevraag oor Besigheid (met antwoorde). Dink asseblief
stap vir stap en eindig jou antwoord met "Die antwoord is (X)", waar X die letter
van die korrekte opsie is.
'
include: _af_template_yaml
task: mmlu_prox_af_business
task_alias: business
process_docs: !function utils.process_business
description: 'Hier is ''n multikeusevraag oor Chemie (met antwoorde). Dink asseblief
stap vir stap en eindig jou antwoord met "Die antwoord is (X)", waar X die letter
van die korrekte opsie is.
'
include: _af_template_yaml
task: mmlu_prox_af_chemistry
task_alias: chemistry
process_docs: !function utils.process_chemistry
description: 'Hier is ''n multikeusevraag oor Rekenaarwetenskap (met antwoorde). Dink
asseblief stap vir stap en eindig jou antwoord met "Die antwoord is (X)", waar X
die letter van die korrekte opsie is.
'
include: _af_template_yaml
task: mmlu_prox_af_computer_science
task_alias: computer_science
process_docs: !function utils.process_computer_science
description: 'Hier is ''n multikeusevraag oor Ekonomie (met antwoorde). Dink asseblief
stap vir stap en eindig jou antwoord met "Die antwoord is (X)", waar X die letter
van die korrekte opsie is.
'
include: _af_template_yaml
task: mmlu_prox_af_economics
task_alias: economics
process_docs: !function utils.process_economics
description: 'Hier is ''n multikeusevraag oor Ingenieurswese (met antwoorde). Dink
asseblief stap vir stap en eindig jou antwoord met "Die antwoord is (X)", waar X
die letter van die korrekte opsie is.
'
include: _af_template_yaml
task: mmlu_prox_af_engineering
task_alias: engineering
process_docs: !function utils.process_engineering
description: 'Hier is ''n multikeusevraag oor Gesondheid (met antwoorde). Dink asseblief
stap vir stap en eindig jou antwoord met "Die antwoord is (X)", waar X die letter
van die korrekte opsie is.
'
include: _af_template_yaml
task: mmlu_prox_af_health
task_alias: health
process_docs: !function utils.process_health
description: 'Hier is ''n multikeusevraag oor Geskiedenis (met antwoorde). Dink asseblief
stap vir stap en eindig jou antwoord met "Die antwoord is (X)", waar X die letter
van die korrekte opsie is.
'
include: _af_template_yaml
task: mmlu_prox_af_history
task_alias: history
process_docs: !function utils.process_history
description: 'Hier is ''n multikeusevraag oor Regte (met antwoorde). Dink asseblief
stap vir stap en eindig jou antwoord met "Die antwoord is (X)", waar X die letter
van die korrekte opsie is.
'
include: _af_template_yaml
task: mmlu_prox_af_law
task_alias: law
process_docs: !function utils.process_law
description: 'Hier is ''n multikeusevraag oor Wiskunde (met antwoorde). Dink asseblief
stap vir stap en eindig jou antwoord met "Die antwoord is (X)", waar X die letter
van die korrekte opsie is.
'
include: _af_template_yaml
task: mmlu_prox_af_math
task_alias: math
process_docs: !function utils.process_math
description: 'Hier is ''n multikeusevraag oor Ander (met antwoorde). Dink asseblief
stap vir stap en eindig jou antwoord met "Die antwoord is (X)", waar X die letter
van die korrekte opsie is.
'
include: _af_template_yaml
task: mmlu_prox_af_other
task_alias: other
process_docs: !function utils.process_other
description: 'Hier is ''n multikeusevraag oor Filosofie (met antwoorde). Dink asseblief
stap vir stap en eindig jou antwoord met "Die antwoord is (X)", waar X die letter
van die korrekte opsie is.
'
include: _af_template_yaml
task: mmlu_prox_af_philosophy
task_alias: philosophy
process_docs: !function utils.process_philosophy
description: 'Hier is ''n multikeusevraag oor Fisika (met antwoorde). Dink asseblief
stap vir stap en eindig jou antwoord met "Die antwoord is (X)", waar X die letter
van die korrekte opsie is.
'
include: _af_template_yaml
task: mmlu_prox_af_physics
task_alias: physics
process_docs: !function utils.process_physics
description: 'Hier is ''n multikeusevraag oor Sielkunde (met antwoorde). Dink asseblief
stap vir stap en eindig jou antwoord met "Die antwoord is (X)", waar X die letter
van die korrekte opsie is.
'
include: _af_template_yaml
task: mmlu_prox_af_psychology
task_alias: psychology
process_docs: !function utils.process_psychology
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment