Merge branch 'smolrefact' into tasklist

# Conflicts: # lm_eval/__main__.py # lm_eval/api/group.py # lm_eval/api/task.py # lm_eval/evaluator_utils.py # lm_eval/tasks/__init__.py # lm_eval/utils.py # pyproject.toml

Merge branch 'smolrefact' into tasklist
# Conflicts: # lm_eval/__main__.py # lm_eval/api/group.py # lm_eval/api/task.py # lm_eval/evaluator_utils.py # lm_eval/tasks/__init__.py # lm_eval/utils.py # pyproject.toml
abd17276 · Baber · 00afd536 · 70314843 · abd17276 · abd17276
Commit abd17276 authored Sep 26, 2025 by Baber
20 changed files
--- a/lm_eval/tasks/mmlu/continuation/mmlu_us_foreign_policy.yaml
+++ b/lm_eval/tasks/mmlu/continuation/mmlu_us_foreign_policy.yaml
 "dataset_name": "us_foreign_policy"
 "description": "The following are questions (with answers) about us\
  \ foreign policy.\n\n"
-"tag": "mmlu_continuation_social_sciences"
+"tag": "mmlu_social_sciences_continuation"
 "include": "_continuation_template_yaml"
-"task": "mmlu_continuation_us_foreign_policy"
+"task": "mmlu_us_foreign_policy_continuation"
--- a/lm_eval/tasks/mmlu/continuation/mmlu_virology.yaml
+++ b/lm_eval/tasks/mmlu/continuation/mmlu_virology.yaml
 "dataset_name": "virology"
 "description": "The following are questions (with answers) about virology.\n\
  \n"
-"tag": "mmlu_continuation_other"
+"tag": "mmlu_other_continuation"
 "include": "_continuation_template_yaml"
-"task": "mmlu_continuation_virology"
+"task": "mmlu_virology_continuation"
--- a/lm_eval/tasks/mmlu/continuation/mmlu_world_religions.yaml
+++ b/lm_eval/tasks/mmlu/continuation/mmlu_world_religions.yaml
 "dataset_name": "world_religions"
 "description": "The following are questions (with answers) about world\
  \ religions.\n\n"
-"tag": "mmlu_continuation_humanities"
+"tag": "mmlu_humanities_continuation"
 "include": "_continuation_template_yaml"
-"task": "mmlu_continuation_world_religions"
+"task": "mmlu_world_religions_continuation"
--- a/lm_eval/tasks/mmlu/flan_cot_fewshot/_mmlu_flan_cot_fewshot_template_yaml
+++ b/lm_eval/tasks/mmlu/flan_cot_fewshot/_mmlu_flan_cot_fewshot_template_yaml
-dataset_path: hails/mmlu_no_train # a copy of `cais/mmlu` with no auxiliary_train split
+dataset_path: cais/mmlu
 validation_split: validation
 test_split: test
 fewshot_config:

--- a/lm_eval/tasks/mmlu/flan_cot_zeroshot/_mmlu_flan_cot_zeroshot_template_yaml
+++ b/lm_eval/tasks/mmlu/flan_cot_zeroshot/_mmlu_flan_cot_zeroshot_template_yaml
-dataset_path: hails/mmlu_no_train # a copy of `cais/mmlu` with no auxiliary_train split
+dataset_path: cais/mmlu
 validation_split: validation
 fewshot_split: dev
 output_type: generate_until

--- a/lm_eval/tasks/mmlu/flan_n_shot/generative/_mmlu_flan_generative_template_yaml
+++ b/lm_eval/tasks/mmlu/flan_n_shot/generative/_mmlu_flan_generative_template_yaml
-dataset_path: hails/mmlu_no_train # a copy of `cais/mmlu` with no auxiliary_train split
+dataset_path: cais/mmlu
 test_split: test
 fewshot_split: dev
 fewshot_config:

--- a/lm_eval/tasks/mmlu/flan_n_shot/loglikelihood/_mmlu_flan_loglikelihood_template_yaml
+++ b/lm_eval/tasks/mmlu/flan_n_shot/loglikelihood/_mmlu_flan_loglikelihood_template_yaml
-dataset_path: hails/mmlu_no_train # a copy of `cais/mmlu` with no auxiliary_train split
+dataset_path: cais/mmlu
 test_split: test
 fewshot_split: dev
 fewshot_config:

--- a/lm_eval/tasks/mmlu/generative/_default_template_yaml
+++ b/lm_eval/tasks/mmlu/generative/_default_template_yaml
-dataset_path: hails/mmlu_no_train # a copy of `cais/mmlu` with no auxiliary_train split
+dataset_path: cais/mmlu
 test_split: test
 fewshot_split: dev
 fewshot_config:

--- a/lm_eval/tasks/mmlu_prox/README.md
+++ b/lm_eval/tasks/mmlu_prox/README.md
@@ -4,21 +4,29 @@
 Title: `MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation`
-Abstract: `Traditional benchmarks like MMLU and MMLU-Pro focus primarily on single-language evaluation, limiting their ability to assess language models in multilingual and culturally diverse contexts. To address this gap, we introduce MMLU-ProX, a comprehensive multilingual benchmark that builds upon MMLU-Pro by covering multiple typologically diverse languages with approximately 11,829 questions per language.`
+Abstract: `Existing large language model (LLM) evaluation benchmarks primarily focus on English, while current multilingual tasks lack parallel questions that specifically assess cross-linguistic reasoning abilities.
+This dual limitation makes it challenging to comprehensively assess LLMs' performance in the multilingual setting. To fill this gap, we introduce MMLU-ProX, a comprehensive benchmark covering 29 languages, built on an English benchmark.
+Each language version consists of 11,829 identical questions, enabling direct cross-linguistic comparisons. Additionally, to meet efficient evaluation needs, we provide a lite version containing 658 questions per language.
+To ensure the high quality of MMLU-ProX, we employ a rigorous development process that involves multiple powerful LLMs for translation, followed by expert review to ensure accurate expression, consistent terminology, and cultural relevance.
+Building on this, we systematically evaluate 36 state-of-the-art LLMs, including reasoning-enhanced and multilingual-optimized LLMs.
+The results reveal significant disparities in the multilingual capabilities of LLMs: While they perform well in high-resource languages, their performance declines markedly in low-resource languages, with gaps of up to 24.3%.
+Through MMLU-ProX, we aim to advance the development of more inclusive AI systems and promote equitable access to technology across global contexts.
+We plan to continuously expand MMLU-ProX by incorporating additional languages to further enhance its coverage and utility for the global AI research community.`
-Homepage: https://mmluprox.github.io/
+Homepage: https://mmluprox.github.io
+Huggingface:
+- https://huggingface.co/datasets/li-lab/MMLU-ProX
+- https://huggingface.co/datasets/li-lab/MMLU-ProX-Lite
 ### Citation
 ```bibtex
-@misc{mmluprox,
+@article{xuan2025mmlu,
-      title={MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation},
+  title={Mmlu-prox: A multilingual benchmark for advanced large language model evaluation},
-      author={Weihao Xuan and Rui Yang and Heli Qi and Qingcheng Zeng and Yunze Xiao and Yun Xing and Junjue Wang and Huitao Li and Xin Li and Kunyu Yu and Nan Liu and Qingyu Chen and Douglas Teodoro and Edison Marrese-Taylor and Shijian Lu and Yusuke Iwasawa and Yutaka Matsuo and Irene Li},
+  author={Weihao Xuan and Rui Yang and Heli Qi and Qingcheng Zeng and Yunze Xiao and Aosong Feng and Dairui Liu and Yun Xing and Junjue Wang and Fan Gao and Jinghui Lu and Yuang Jiang and Huitao Li and Xin Li and Kunyu Yu and Ruihai Dong and Shangding Gu and Yuekang Li and Xiaofei Xie and Felix Juefei-Xu and Foutse Khomh and Osamu Yoshie and Qingyu Chen and Douglas Teodoro and Nan Liu and Randy Goebel and Lei Ma and Edison Marrese-Taylor and Shijian Lu and Yusuke Iwasawa and Yutaka Matsuo and Irene Li},
-      year={2025},
+  journal={arXiv preprint arXiv:2503.10497},
-      eprint={2503.10497},
+  year={2025}
-      archivePrefix={arXiv},
-      primaryClass={cs.CL},
-      url={https://arxiv.org/abs/2503.10497},
 }
 ```
@@ -26,22 +34,39 @@ Homepage: https://mmluprox.github.io/
 #### Groups
-* `mmlu_pro_{lang}`: 'All 14 subjects of the mmlu_pro_prox dataset in {lang}, evaluated following the methodology in mmlu_pro's original implementation'
+* `mmlu_pro_{lang}`: 'All 14 subjects of the mmlu_prox dataset in {lang}, evaluated following the methodology in mmlu_pro's original implementation'
+* `mmlu_prox_lite_{lang}`: 'All 14 subjects of the mmlu_prox_lite dataset in {lang}, evaluated following the methodology in mmlu_pro's original implementation'
-Available lang:
+Available options for `{lang}`:
+- af
 - ar
 - bn
+- cs
 - de
 - en
 - es
 - fr
 - hi
+- hu
+- id
+- it
 - ja
 - ko
+- mr
+- ne
 - pt
+- ru
+- sr
 - sw
+- te
 - th
+- uk
+- ur
+- vi
+- wo
+- yo
 - zh
+- zu
 #### Tasks
@@ -61,6 +86,23 @@ The following tasks evaluate subjects in the mmlu_prox dataset
 - `mmlu_prox_{lang}_physics`
 - `mmlu_prox_{lang}_psychology`
+The following tasks evaluate subjects in the mmlu_prox_lite dataset
+- `mmlu_prox_lite_{lang}_biology`
+- `mmlu_prox_lite_{lang}_business`
+- `mmlu_prox_lite_{lang}_chemistry`
+- `mmlu_prox_lite_{lang}_computer_science`
+- `mmlu_prox_lite_{lang}_economics`
+- `mmlu_prox_lite_{lang}_engineering`
+- `mmlu_prox_lite_{lang}_health`
+- `mmlu_prox_lite_{lang}_history`
+- `mmlu_prox_lite_{lang}_law`
+- `mmlu_prox_lite_{lang}_math`
+- `mmlu_prox_lite_{lang}_other`
+- `mmlu_prox_lite_{lang}_philosophy`
+- `mmlu_prox_lite_{lang}_physics`
+- `mmlu_prox_lite_{lang}_psychology`
 ### Checklist
 For adding novel benchmarks/datasets to the library:

--- a/lm_eval/tasks/mmlu_prox/af/_af_lite_template_yaml
+++ b/lm_eval/tasks/mmlu_prox/af/_af_lite_template_yaml
+dataset_path: li-lab/MMLU-ProX-Lite
+dataset_name: af
+test_split: test
+fewshot_split: validation
+fewshot_config:
+  sampler: first_n
+  doc_to_text: !function utils.fewshot_to_text
+  doc_to_target: ""
+output_type: generate_until
+doc_to_text: !function utils.doc_to_text
+doc_to_target: answer
+filter_list:
+  - name: "custom-extract"
+    filter:
+      - function: "regex"
+        regex_pattern: 'Die antwoord is \(?([ABCDEFGHIJ])\)?'
+      - function: "take_first"
+generation_kwargs:
+  until:
+    - "</s>"
+    - "Q:"
+    - "Vraag:"
+    - "<|im_end|>"
+  do_sample: false
+  temperature: 0.0
+  max_gen_toks: 2048
+num_fewshot: 5
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+    ignore_case: true
+    ignore_punctuation: true
+metadata:
+  version: 0.0
--- a/lm_eval/tasks/mmlu_prox/af/_af_template_yaml
+++ b/lm_eval/tasks/mmlu_prox/af/_af_template_yaml
+dataset_path: li-lab/MMLU-ProX
+dataset_name: af
+test_split: test
+fewshot_split: validation
+fewshot_config:
+  sampler: first_n
+  doc_to_text: !function utils.fewshot_to_text
+  doc_to_target: ""
+output_type: generate_until
+doc_to_text: !function utils.doc_to_text
+doc_to_target: answer
+filter_list:
+  - name: "custom-extract"
+    filter:
+      - function: "regex"
+        regex_pattern: 'Die antwoord is \(?([ABCDEFGHIJ])\)?'
+      - function: "take_first"
+generation_kwargs:
+  until:
+    - "</s>"
+    - "Q:"
+    - "Vraag:"
+    - "<|im_end|>"
+  do_sample: false
+  temperature: 0.0
+  max_gen_toks: 2048
+num_fewshot: 5
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+    ignore_case: true
+    ignore_punctuation: true
+metadata:
+  version: 0.0
--- a/lm_eval/tasks/mmlu_prox/af/_mmlu_prox_af.yaml
+++ b/lm_eval/tasks/mmlu_prox/af/_mmlu_prox_af.yaml
+group: mmlu_prox_af
+task:
+- mmlu_prox_af_biology
+- mmlu_prox_af_business
+- mmlu_prox_af_chemistry
+- mmlu_prox_af_computer_science
+- mmlu_prox_af_economics
+- mmlu_prox_af_engineering
+- mmlu_prox_af_health
+- mmlu_prox_af_history
+- mmlu_prox_af_law
+- mmlu_prox_af_math
+- mmlu_prox_af_other
+- mmlu_prox_af_philosophy
+- mmlu_prox_af_physics
+- mmlu_prox_af_psychology
+aggregate_metric_list:
+- aggregation: mean
+  metric: exact_match
+  weight_by_size: true
+  filter_list: custom-extract
+metadata:
+  version: 0.0
--- a/lm_eval/tasks/mmlu_prox/af/_mmlu_prox_lite_af.yaml
+++ b/lm_eval/tasks/mmlu_prox/af/_mmlu_prox_lite_af.yaml
+group: mmlu_prox_lite_af
+task:
+- mmlu_prox_lite_af_biology
+- mmlu_prox_lite_af_business
+- mmlu_prox_lite_af_chemistry
+- mmlu_prox_lite_af_computer_science
+- mmlu_prox_lite_af_economics
+- mmlu_prox_lite_af_engineering
+- mmlu_prox_lite_af_health
+- mmlu_prox_lite_af_history
+- mmlu_prox_lite_af_law
+- mmlu_prox_lite_af_math
+- mmlu_prox_lite_af_other
+- mmlu_prox_lite_af_philosophy
+- mmlu_prox_lite_af_physics
+- mmlu_prox_lite_af_psychology
+aggregate_metric_list:
+- aggregation: mean
+  metric: exact_match
+  weight_by_size: true
+  filter_list: custom-extract
+metadata:
+  version: 0.0
--- a/lm_eval/tasks/mmlu_prox/af/mmlu_prox_af_biology.yaml
+++ b/lm_eval/tasks/mmlu_prox/af/mmlu_prox_af_biology.yaml
+description: 'Hier is ''n multikeusevraag oor Biologie (met antwoorde). Dink asseblief
+  stap vir stap en eindig jou antwoord met "Die antwoord is (X)", waar X die letter
+  van die korrekte opsie is.
+  '
+include: _af_template_yaml
+task: mmlu_prox_af_biology
+task_alias: biology
+process_docs: !function utils.process_biology
--- a/lm_eval/tasks/mmlu_prox/af/mmlu_prox_af_business.yaml
+++ b/lm_eval/tasks/mmlu_prox/af/mmlu_prox_af_business.yaml
+description: 'Hier is ''n multikeusevraag oor Besigheid (met antwoorde). Dink asseblief
+  stap vir stap en eindig jou antwoord met "Die antwoord is (X)", waar X die letter
+  van die korrekte opsie is.
+  '
+include: _af_template_yaml
+task: mmlu_prox_af_business
+task_alias: business
+process_docs: !function utils.process_business
--- a/lm_eval/tasks/mmlu_prox/af/mmlu_prox_af_chemistry.yaml
+++ b/lm_eval/tasks/mmlu_prox/af/mmlu_prox_af_chemistry.yaml
+description: 'Hier is ''n multikeusevraag oor Chemie (met antwoorde). Dink asseblief
+  stap vir stap en eindig jou antwoord met "Die antwoord is (X)", waar X die letter
+  van die korrekte opsie is.
+  '
+include: _af_template_yaml
+task: mmlu_prox_af_chemistry
+task_alias: chemistry
+process_docs: !function utils.process_chemistry
--- a/lm_eval/tasks/mmlu_prox/af/mmlu_prox_af_computer_science.yaml
+++ b/lm_eval/tasks/mmlu_prox/af/mmlu_prox_af_computer_science.yaml
+description: 'Hier is ''n multikeusevraag oor Rekenaarwetenskap (met antwoorde). Dink
+  asseblief stap vir stap en eindig jou antwoord met "Die antwoord is (X)", waar X
+  die letter van die korrekte opsie is.
+  '
+include: _af_template_yaml
+task: mmlu_prox_af_computer_science
+task_alias: computer_science
+process_docs: !function utils.process_computer_science
--- a/lm_eval/tasks/mmlu_prox/af/mmlu_prox_af_economics.yaml
+++ b/lm_eval/tasks/mmlu_prox/af/mmlu_prox_af_economics.yaml
+description: 'Hier is ''n multikeusevraag oor Ekonomie (met antwoorde). Dink asseblief
+  stap vir stap en eindig jou antwoord met "Die antwoord is (X)", waar X die letter
+  van die korrekte opsie is.
+  '
+include: _af_template_yaml
+task: mmlu_prox_af_economics
+task_alias: economics
+process_docs: !function utils.process_economics
--- a/lm_eval/tasks/mmlu_prox/af/mmlu_prox_af_engineering.yaml
+++ b/lm_eval/tasks/mmlu_prox/af/mmlu_prox_af_engineering.yaml
+description: 'Hier is ''n multikeusevraag oor Ingenieurswese (met antwoorde). Dink
+  asseblief stap vir stap en eindig jou antwoord met "Die antwoord is (X)", waar X
+  die letter van die korrekte opsie is.
+  '
+include: _af_template_yaml
+task: mmlu_prox_af_engineering
+task_alias: engineering
+process_docs: !function utils.process_engineering
--- a/lm_eval/tasks/mmlu_prox/af/mmlu_prox_af_health.yaml
+++ b/lm_eval/tasks/mmlu_prox/af/mmlu_prox_af_health.yaml
+description: 'Hier is ''n multikeusevraag oor Gesondheid (met antwoorde). Dink asseblief
+  stap vir stap en eindig jou antwoord met "Die antwoord is (X)", waar X die letter
+  van die korrekte opsie is.
+  '
+include: _af_template_yaml
+task: mmlu_prox_af_health
+task_alias: health
+process_docs: !function utils.process_health