Commit abd17276 authored by Baber's avatar Baber
Browse files

Merge branch 'smolrefact' into tasklist

# Conflicts:
#	lm_eval/__main__.py
#	lm_eval/api/group.py
#	lm_eval/api/task.py
#	lm_eval/evaluator_utils.py
#	lm_eval/tasks/__init__.py
#	lm_eval/utils.py
#	pyproject.toml
parents 00afd536 70314843
"dataset_name": "us_foreign_policy" "dataset_name": "us_foreign_policy"
"description": "The following are questions (with answers) about us\ "description": "The following are questions (with answers) about us\
\ foreign policy.\n\n" \ foreign policy.\n\n"
"tag": "mmlu_continuation_social_sciences" "tag": "mmlu_social_sciences_continuation"
"include": "_continuation_template_yaml" "include": "_continuation_template_yaml"
"task": "mmlu_continuation_us_foreign_policy" "task": "mmlu_us_foreign_policy_continuation"
"dataset_name": "virology" "dataset_name": "virology"
"description": "The following are questions (with answers) about virology.\n\ "description": "The following are questions (with answers) about virology.\n\
\n" \n"
"tag": "mmlu_continuation_other" "tag": "mmlu_other_continuation"
"include": "_continuation_template_yaml" "include": "_continuation_template_yaml"
"task": "mmlu_continuation_virology" "task": "mmlu_virology_continuation"
"dataset_name": "world_religions" "dataset_name": "world_religions"
"description": "The following are questions (with answers) about world\ "description": "The following are questions (with answers) about world\
\ religions.\n\n" \ religions.\n\n"
"tag": "mmlu_continuation_humanities" "tag": "mmlu_humanities_continuation"
"include": "_continuation_template_yaml" "include": "_continuation_template_yaml"
"task": "mmlu_continuation_world_religions" "task": "mmlu_world_religions_continuation"
dataset_path: hails/mmlu_no_train # a copy of `cais/mmlu` with no auxiliary_train split dataset_path: cais/mmlu
validation_split: validation validation_split: validation
test_split: test test_split: test
fewshot_config: fewshot_config:
......
dataset_path: hails/mmlu_no_train # a copy of `cais/mmlu` with no auxiliary_train split dataset_path: cais/mmlu
validation_split: validation validation_split: validation
fewshot_split: dev fewshot_split: dev
output_type: generate_until output_type: generate_until
......
dataset_path: hails/mmlu_no_train # a copy of `cais/mmlu` with no auxiliary_train split dataset_path: cais/mmlu
test_split: test test_split: test
fewshot_split: dev fewshot_split: dev
fewshot_config: fewshot_config:
......
dataset_path: hails/mmlu_no_train # a copy of `cais/mmlu` with no auxiliary_train split dataset_path: cais/mmlu
test_split: test test_split: test
fewshot_split: dev fewshot_split: dev
fewshot_config: fewshot_config:
......
dataset_path: hails/mmlu_no_train # a copy of `cais/mmlu` with no auxiliary_train split dataset_path: cais/mmlu
test_split: test test_split: test
fewshot_split: dev fewshot_split: dev
fewshot_config: fewshot_config:
......
...@@ -4,21 +4,29 @@ ...@@ -4,21 +4,29 @@
Title: `MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation` Title: `MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation`
Abstract: `Traditional benchmarks like MMLU and MMLU-Pro focus primarily on single-language evaluation, limiting their ability to assess language models in multilingual and culturally diverse contexts. To address this gap, we introduce MMLU-ProX, a comprehensive multilingual benchmark that builds upon MMLU-Pro by covering multiple typologically diverse languages with approximately 11,829 questions per language.` Abstract: `Existing large language model (LLM) evaluation benchmarks primarily focus on English, while current multilingual tasks lack parallel questions that specifically assess cross-linguistic reasoning abilities.
This dual limitation makes it challenging to comprehensively assess LLMs' performance in the multilingual setting. To fill this gap, we introduce MMLU-ProX, a comprehensive benchmark covering 29 languages, built on an English benchmark.
Each language version consists of 11,829 identical questions, enabling direct cross-linguistic comparisons. Additionally, to meet efficient evaluation needs, we provide a lite version containing 658 questions per language.
To ensure the high quality of MMLU-ProX, we employ a rigorous development process that involves multiple powerful LLMs for translation, followed by expert review to ensure accurate expression, consistent terminology, and cultural relevance.
Building on this, we systematically evaluate 36 state-of-the-art LLMs, including reasoning-enhanced and multilingual-optimized LLMs.
The results reveal significant disparities in the multilingual capabilities of LLMs: While they perform well in high-resource languages, their performance declines markedly in low-resource languages, with gaps of up to 24.3%.
Through MMLU-ProX, we aim to advance the development of more inclusive AI systems and promote equitable access to technology across global contexts.
We plan to continuously expand MMLU-ProX by incorporating additional languages to further enhance its coverage and utility for the global AI research community.`
Homepage: https://mmluprox.github.io/ Homepage: https://mmluprox.github.io
Huggingface:
- https://huggingface.co/datasets/li-lab/MMLU-ProX
- https://huggingface.co/datasets/li-lab/MMLU-ProX-Lite
### Citation ### Citation
```bibtex ```bibtex
@misc{mmluprox, @article{xuan2025mmlu,
title={MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation}, title={Mmlu-prox: A multilingual benchmark for advanced large language model evaluation},
author={Weihao Xuan and Rui Yang and Heli Qi and Qingcheng Zeng and Yunze Xiao and Yun Xing and Junjue Wang and Huitao Li and Xin Li and Kunyu Yu and Nan Liu and Qingyu Chen and Douglas Teodoro and Edison Marrese-Taylor and Shijian Lu and Yusuke Iwasawa and Yutaka Matsuo and Irene Li}, author={Weihao Xuan and Rui Yang and Heli Qi and Qingcheng Zeng and Yunze Xiao and Aosong Feng and Dairui Liu and Yun Xing and Junjue Wang and Fan Gao and Jinghui Lu and Yuang Jiang and Huitao Li and Xin Li and Kunyu Yu and Ruihai Dong and Shangding Gu and Yuekang Li and Xiaofei Xie and Felix Juefei-Xu and Foutse Khomh and Osamu Yoshie and Qingyu Chen and Douglas Teodoro and Nan Liu and Randy Goebel and Lei Ma and Edison Marrese-Taylor and Shijian Lu and Yusuke Iwasawa and Yutaka Matsuo and Irene Li},
year={2025}, journal={arXiv preprint arXiv:2503.10497},
eprint={2503.10497}, year={2025}
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2503.10497},
} }
``` ```
...@@ -26,22 +34,39 @@ Homepage: https://mmluprox.github.io/ ...@@ -26,22 +34,39 @@ Homepage: https://mmluprox.github.io/
#### Groups #### Groups
* `mmlu_pro_{lang}`: 'All 14 subjects of the mmlu_pro_prox dataset in {lang}, evaluated following the methodology in mmlu_pro's original implementation' * `mmlu_pro_{lang}`: 'All 14 subjects of the mmlu_prox dataset in {lang}, evaluated following the methodology in mmlu_pro's original implementation'
* `mmlu_prox_lite_{lang}`: 'All 14 subjects of the mmlu_prox_lite dataset in {lang}, evaluated following the methodology in mmlu_pro's original implementation'
Available lang: Available options for `{lang}`:
- af
- ar - ar
- bn - bn
- cs
- de - de
- en - en
- es - es
- fr - fr
- hi - hi
- hu
- id
- it
- ja - ja
- ko - ko
- mr
- ne
- pt - pt
- ru
- sr
- sw - sw
- te
- th - th
- uk
- ur
- vi
- wo
- yo
- zh - zh
- zu
#### Tasks #### Tasks
...@@ -61,6 +86,23 @@ The following tasks evaluate subjects in the mmlu_prox dataset ...@@ -61,6 +86,23 @@ The following tasks evaluate subjects in the mmlu_prox dataset
- `mmlu_prox_{lang}_physics` - `mmlu_prox_{lang}_physics`
- `mmlu_prox_{lang}_psychology` - `mmlu_prox_{lang}_psychology`
The following tasks evaluate subjects in the mmlu_prox_lite dataset
- `mmlu_prox_lite_{lang}_biology`
- `mmlu_prox_lite_{lang}_business`
- `mmlu_prox_lite_{lang}_chemistry`
- `mmlu_prox_lite_{lang}_computer_science`
- `mmlu_prox_lite_{lang}_economics`
- `mmlu_prox_lite_{lang}_engineering`
- `mmlu_prox_lite_{lang}_health`
- `mmlu_prox_lite_{lang}_history`
- `mmlu_prox_lite_{lang}_law`
- `mmlu_prox_lite_{lang}_math`
- `mmlu_prox_lite_{lang}_other`
- `mmlu_prox_lite_{lang}_philosophy`
- `mmlu_prox_lite_{lang}_physics`
- `mmlu_prox_lite_{lang}_psychology`
### Checklist ### Checklist
For adding novel benchmarks/datasets to the library: For adding novel benchmarks/datasets to the library:
......
dataset_path: li-lab/MMLU-ProX-Lite
dataset_name: af
test_split: test
fewshot_split: validation
fewshot_config:
sampler: first_n
doc_to_text: !function utils.fewshot_to_text
doc_to_target: ""
output_type: generate_until
doc_to_text: !function utils.doc_to_text
doc_to_target: answer
filter_list:
- name: "custom-extract"
filter:
- function: "regex"
regex_pattern: 'Die antwoord is \(?([ABCDEFGHIJ])\)?'
- function: "take_first"
generation_kwargs:
until:
- "</s>"
- "Q:"
- "Vraag:"
- "<|im_end|>"
do_sample: false
temperature: 0.0
max_gen_toks: 2048
num_fewshot: 5
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
metadata:
version: 0.0
dataset_path: li-lab/MMLU-ProX
dataset_name: af
test_split: test
fewshot_split: validation
fewshot_config:
sampler: first_n
doc_to_text: !function utils.fewshot_to_text
doc_to_target: ""
output_type: generate_until
doc_to_text: !function utils.doc_to_text
doc_to_target: answer
filter_list:
- name: "custom-extract"
filter:
- function: "regex"
regex_pattern: 'Die antwoord is \(?([ABCDEFGHIJ])\)?'
- function: "take_first"
generation_kwargs:
until:
- "</s>"
- "Q:"
- "Vraag:"
- "<|im_end|>"
do_sample: false
temperature: 0.0
max_gen_toks: 2048
num_fewshot: 5
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
metadata:
version: 0.0
group: mmlu_prox_af
task:
- mmlu_prox_af_biology
- mmlu_prox_af_business
- mmlu_prox_af_chemistry
- mmlu_prox_af_computer_science
- mmlu_prox_af_economics
- mmlu_prox_af_engineering
- mmlu_prox_af_health
- mmlu_prox_af_history
- mmlu_prox_af_law
- mmlu_prox_af_math
- mmlu_prox_af_other
- mmlu_prox_af_philosophy
- mmlu_prox_af_physics
- mmlu_prox_af_psychology
aggregate_metric_list:
- aggregation: mean
metric: exact_match
weight_by_size: true
filter_list: custom-extract
metadata:
version: 0.0
group: mmlu_prox_lite_af
task:
- mmlu_prox_lite_af_biology
- mmlu_prox_lite_af_business
- mmlu_prox_lite_af_chemistry
- mmlu_prox_lite_af_computer_science
- mmlu_prox_lite_af_economics
- mmlu_prox_lite_af_engineering
- mmlu_prox_lite_af_health
- mmlu_prox_lite_af_history
- mmlu_prox_lite_af_law
- mmlu_prox_lite_af_math
- mmlu_prox_lite_af_other
- mmlu_prox_lite_af_philosophy
- mmlu_prox_lite_af_physics
- mmlu_prox_lite_af_psychology
aggregate_metric_list:
- aggregation: mean
metric: exact_match
weight_by_size: true
filter_list: custom-extract
metadata:
version: 0.0
description: 'Hier is ''n multikeusevraag oor Biologie (met antwoorde). Dink asseblief
stap vir stap en eindig jou antwoord met "Die antwoord is (X)", waar X die letter
van die korrekte opsie is.
'
include: _af_template_yaml
task: mmlu_prox_af_biology
task_alias: biology
process_docs: !function utils.process_biology
description: 'Hier is ''n multikeusevraag oor Besigheid (met antwoorde). Dink asseblief
stap vir stap en eindig jou antwoord met "Die antwoord is (X)", waar X die letter
van die korrekte opsie is.
'
include: _af_template_yaml
task: mmlu_prox_af_business
task_alias: business
process_docs: !function utils.process_business
description: 'Hier is ''n multikeusevraag oor Chemie (met antwoorde). Dink asseblief
stap vir stap en eindig jou antwoord met "Die antwoord is (X)", waar X die letter
van die korrekte opsie is.
'
include: _af_template_yaml
task: mmlu_prox_af_chemistry
task_alias: chemistry
process_docs: !function utils.process_chemistry
description: 'Hier is ''n multikeusevraag oor Rekenaarwetenskap (met antwoorde). Dink
asseblief stap vir stap en eindig jou antwoord met "Die antwoord is (X)", waar X
die letter van die korrekte opsie is.
'
include: _af_template_yaml
task: mmlu_prox_af_computer_science
task_alias: computer_science
process_docs: !function utils.process_computer_science
description: 'Hier is ''n multikeusevraag oor Ekonomie (met antwoorde). Dink asseblief
stap vir stap en eindig jou antwoord met "Die antwoord is (X)", waar X die letter
van die korrekte opsie is.
'
include: _af_template_yaml
task: mmlu_prox_af_economics
task_alias: economics
process_docs: !function utils.process_economics
description: 'Hier is ''n multikeusevraag oor Ingenieurswese (met antwoorde). Dink
asseblief stap vir stap en eindig jou antwoord met "Die antwoord is (X)", waar X
die letter van die korrekte opsie is.
'
include: _af_template_yaml
task: mmlu_prox_af_engineering
task_alias: engineering
process_docs: !function utils.process_engineering
description: 'Hier is ''n multikeusevraag oor Gesondheid (met antwoorde). Dink asseblief
stap vir stap en eindig jou antwoord met "Die antwoord is (X)", waar X die letter
van die korrekte opsie is.
'
include: _af_template_yaml
task: mmlu_prox_af_health
task_alias: health
process_docs: !function utils.process_health
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment