Add MMLU-ProX task (#2811)

* update mmlu_prox configs * update tasks/README * correct hyphon to underline in task/README * update pre-commit codes

Add MMLU-ProX task (#2811)
* update mmlu_prox configs * update tasks/README * correct hyphon to underline in task/README * update pre-commit codes
8aeff141 · heli-qi · GitHub · 8028a42f · 8aeff141 · 8aeff141
Unverified Commit 8aeff141 authored Mar 21, 2025 by heli-qi Committed by GitHub Mar 21, 2025
20 changed files
--- a/lm_eval/tasks/README.md
+++ b/lm_eval/tasks/README.md
@@ -96,6 +96,7 @@
 | [mmlu](mmlu/README.md)                                                   | Massive Multitask Language Understanding benchmark for broad domain language evaluation. Several variants are supported.                                                                                                                                                                                                               | English                                                                                                               |
 | [mmlu_pro](mmlu_pro/README.md)                                           | A refined set of MMLU, integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options.                                                                                                                                                                                                | English                                                                                                               |
 | [mmlu-pro-plus](mmlu-pro-plus/README.md)                                 | A new test set for evaluating shortcut learning and higher-order reasoning of LLMs.                                                                                                                                                                                                                                                    | English                                                                                                               |
+| [mmlu_prox](mmlu_prox/README.md)                                         | A multilingual benchmark that extends MMLU-Pro to multiple typologically diverse languages with human validation.                                                                                                                                                                                                                                                    | English, Japanese, Chinese, Korean, French, German, Spanish, Portuguese, Swahili, Thai, Arabic, Hindi, Bengali        |
 | [mmlusr](mmlusr/README.md)                                               | Variation of MMLU designed to be more rigorous.                                                                                                                                                                                                                                                                                        | English                                                                                                               |
 | model_written_evals                                                      | Evaluation tasks auto-generated for evaluating a collection of AI Safety concerns.                                                                                                                                                                                                                                                     |                                                                                                                       |
 | [moral_stories](moral_stories/README.md)                                 | A crowd-sourced dataset of structured narratives that describe normative and norm-divergent actions taken by individuals to accomplish certain intentions in concrete situations.                                                                                                                                                      | English  

--- a/lm_eval/tasks/mmlu_prox/README.md
+++ b/lm_eval/tasks/mmlu_prox/README.md
+# MMLU-ProX
+### Paper
+Title: `MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation`
+Abstract: `Traditional benchmarks like MMLU and MMLU-Pro focus primarily on single-language evaluation, limiting their ability to assess language models in multilingual and culturally diverse contexts. To address this gap, we introduce MMLU-ProX, a comprehensive multilingual benchmark that builds upon MMLU-Pro by covering multiple typologically diverse languages with approximately 11,829 questions per language.`
+Homepage: https://mmluprox.github.io/
+### Citation
+```bibtex
+@misc{mmluprox,
+      title={MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation},
+      author={Weihao Xuan and Rui Yang and Heli Qi and Qingcheng Zeng and Yunze Xiao and Yun Xing and Junjue Wang and Huitao Li and Xin Li and Kunyu Yu and Nan Liu and Qingyu Chen and Douglas Teodoro and Edison Marrese-Taylor and Shijian Lu and Yusuke Iwasawa and Yutaka Matsuo and Irene Li},
+      year={2025},
+      eprint={2503.10497},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2503.10497},
+}
+```
+### Groups and Tasks
+#### Groups
+* `mmlu_pro_{lang}`: 'All 14 subjects of the mmlu_pro_prox dataset in {lang}, evaluated following the methodology in mmlu_pro's original implementation'
+Available lang:
+- ar
+- bn
+- de
+- en
+- es
+- fr
+- hi
+- ja
+- ko
+- pt
+- sw
+- th
+- zh
+#### Tasks
+The following tasks evaluate subjects in the mmlu_prox dataset
+- `mmlu_prox_{lang}_biology`
+- `mmlu_prox_{lang}_business`
+- `mmlu_prox_{lang}_chemistry`
+- `mmlu_prox_{lang}_computer_science`
+- `mmlu_prox_{lang}_economics`
+- `mmlu_prox_{lang}_engineering`
+- `mmlu_prox_{lang}_health`
+- `mmlu_prox_{lang}_history`
+- `mmlu_prox_{lang}_law`
+- `mmlu_prox_{lang}_math`
+- `mmlu_prox_{lang}_other`
+- `mmlu_prox_{lang}_philosophy`
+- `mmlu_prox_{lang}_physics`
+- `mmlu_prox_{lang}_psychology`
+### Checklist
+For adding novel benchmarks/datasets to the library:
+* [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
+  * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/mmlu_prox/ar/_ar_template_yaml
+++ b/lm_eval/tasks/mmlu_prox/ar/_ar_template_yaml
+dataset_path: li-lab/MMLU-ProX
+dataset_name: ar
+test_split: test
+fewshot_split: validation
+fewshot_config:
+  sampler: first_n
+  doc_to_text: !function utils.fewshot_to_text
+  doc_to_target: ""
+output_type: generate_until
+doc_to_text: !function utils.doc_to_text
+doc_to_target: answer
+filter_list:
+  - name: "custom-extract"
+    filter:
+      - function: "regex"
+        regex_pattern: 'الإجابة هي \(?([ABCDEFGHIJ])\)?'
+      - function: "take_first"
+generation_kwargs:
+  until:
+    - "</s>"
+    - "Q:"
+    - "سؤال:"
+    - "<|im_end|>"
+  do_sample: false
+  temperature: 0.0
+  max_gen_toks: 2048
+num_fewshot: 5
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+    ignore_case: true
+    ignore_punctuation: true
+metadata:
+  version: 0.0
--- a/lm_eval/tasks/mmlu_prox/ar/_mmlu_prox_ar.yaml
+++ b/lm_eval/tasks/mmlu_prox/ar/_mmlu_prox_ar.yaml
+group: mmlu_prox_ar
+task:
+- mmlu_prox_ar_biology
+- mmlu_prox_ar_business
+- mmlu_prox_ar_chemistry
+- mmlu_prox_ar_computer_science
+- mmlu_prox_ar_economics
+- mmlu_prox_ar_engineering
+- mmlu_prox_ar_health
+- mmlu_prox_ar_history
+- mmlu_prox_ar_law
+- mmlu_prox_ar_math
+- mmlu_prox_ar_other
+- mmlu_prox_ar_philosophy
+- mmlu_prox_ar_physics
+- mmlu_prox_ar_psychology
+aggregate_metric_list:
+- aggregation: mean
+  metric: exact_match
+  weight_by_size: true
+  filter_list: custom-extract
+metadata:
+  version: 0.0
--- a/lm_eval/tasks/mmlu_prox/ar/mmlu_prox_ar_biology.yaml
+++ b/lm_eval/tasks/mmlu_prox/ar/mmlu_prox_ar_biology.yaml
+description: 'فيما يلي أسئلة اختيار من متعدد (مع إجابات) حول علم الأحياء. فكر خطوة
+  بخطوة ثم أنهِ إجابتك بـ ''الإجابة هي (X)'' حيث X هو حرف الخيار الصحيح.
+  '
+include: _ar_template_yaml
+task: mmlu_prox_ar_biology
+task_alias: biology
+process_docs: !function utils.process_biology
--- a/lm_eval/tasks/mmlu_prox/ar/mmlu_prox_ar_business.yaml
+++ b/lm_eval/tasks/mmlu_prox/ar/mmlu_prox_ar_business.yaml
+description: 'فيما يلي أسئلة اختيار من متعدد (مع إجابات) حول الأعمال. فكر خطوة بخطوة
+  ثم أنهِ إجابتك بـ ''الإجابة هي (X)'' حيث X هو حرف الخيار الصحيح.
+  '
+include: _ar_template_yaml
+task: mmlu_prox_ar_business
+task_alias: business
+process_docs: !function utils.process_business
--- a/lm_eval/tasks/mmlu_prox/ar/mmlu_prox_ar_chemistry.yaml
+++ b/lm_eval/tasks/mmlu_prox/ar/mmlu_prox_ar_chemistry.yaml
+description: 'فيما يلي أسئلة اختيار من متعدد (مع إجابات) حول الكيمياء. فكر خطوة بخطوة
+  ثم أنهِ إجابتك بـ ''الإجابة هي (X)'' حيث X هو حرف الخيار الصحيح.
+  '
+include: _ar_template_yaml
+task: mmlu_prox_ar_chemistry
+task_alias: chemistry
+process_docs: !function utils.process_chemistry
--- a/lm_eval/tasks/mmlu_prox/ar/mmlu_prox_ar_computer_science.yaml
+++ b/lm_eval/tasks/mmlu_prox/ar/mmlu_prox_ar_computer_science.yaml
+description: 'فيما يلي أسئلة اختيار من متعدد (مع إجابات) حول علوم الكمبيوتر. فكر خطوة
+  بخطوة ثم أنهِ إجابتك بـ ''الإجابة هي (X)'' حيث X هو حرف الخيار الصحيح.
+  '
+include: _ar_template_yaml
+task: mmlu_prox_ar_computer_science
+task_alias: computer_science
+process_docs: !function utils.process_computer_science
--- a/lm_eval/tasks/mmlu_prox/ar/mmlu_prox_ar_economics.yaml
+++ b/lm_eval/tasks/mmlu_prox/ar/mmlu_prox_ar_economics.yaml
+description: 'فيما يلي أسئلة اختيار من متعدد (مع إجابات) حول الاقتصاد. فكر خطوة بخطوة
+  ثم أنهِ إجابتك بـ ''الإجابة هي (X)'' حيث X هو حرف الخيار الصحيح.
+  '
+include: _ar_template_yaml
+task: mmlu_prox_ar_economics
+task_alias: economics
+process_docs: !function utils.process_economics
--- a/lm_eval/tasks/mmlu_prox/ar/mmlu_prox_ar_engineering.yaml
+++ b/lm_eval/tasks/mmlu_prox/ar/mmlu_prox_ar_engineering.yaml
+description: 'فيما يلي أسئلة اختيار من متعدد (مع إجابات) حول الهندسة. فكر خطوة بخطوة
+  ثم أنهِ إجابتك بـ ''الإجابة هي (X)'' حيث X هو حرف الخيار الصحيح.
+  '
+include: _ar_template_yaml
+task: mmlu_prox_ar_engineering
+task_alias: engineering
+process_docs: !function utils.process_engineering
--- a/lm_eval/tasks/mmlu_prox/ar/mmlu_prox_ar_health.yaml
+++ b/lm_eval/tasks/mmlu_prox/ar/mmlu_prox_ar_health.yaml
+description: 'فيما يلي أسئلة اختيار من متعدد (مع إجابات) حول الصحة. فكر خطوة بخطوة
+  ثم أنهِ إجابتك بـ ''الإجابة هي (X)'' حيث X هو حرف الخيار الصحيح.
+  '
+include: _ar_template_yaml
+task: mmlu_prox_ar_health
+task_alias: health
+process_docs: !function utils.process_health
--- a/lm_eval/tasks/mmlu_prox/ar/mmlu_prox_ar_history.yaml
+++ b/lm_eval/tasks/mmlu_prox/ar/mmlu_prox_ar_history.yaml
+description: 'فيما يلي أسئلة اختيار من متعدد (مع إجابات) حول التاريخ. فكر خطوة بخطوة
+  ثم أنهِ إجابتك بـ ''الإجابة هي (X)'' حيث X هو حرف الخيار الصحيح.
+  '
+include: _ar_template_yaml
+task: mmlu_prox_ar_history
+task_alias: history
+process_docs: !function utils.process_history
--- a/lm_eval/tasks/mmlu_prox/ar/mmlu_prox_ar_law.yaml
+++ b/lm_eval/tasks/mmlu_prox/ar/mmlu_prox_ar_law.yaml
+description: 'فيما يلي أسئلة اختيار من متعدد (مع إجابات) حول القانون. فكر خطوة بخطوة
+  ثم أنهِ إجابتك بـ ''الإجابة هي (X)'' حيث X هو حرف الخيار الصحيح.
+  '
+include: _ar_template_yaml
+task: mmlu_prox_ar_law
+task_alias: law
+process_docs: !function utils.process_law
--- a/lm_eval/tasks/mmlu_prox/ar/mmlu_prox_ar_math.yaml
+++ b/lm_eval/tasks/mmlu_prox/ar/mmlu_prox_ar_math.yaml
+description: 'فيما يلي أسئلة اختيار من متعدد (مع إجابات) حول الرياضيات. فكر خطوة بخطوة
+  ثم أنهِ إجابتك بـ ''الإجابة هي (X)'' حيث X هو حرف الخيار الصحيح.
+  '
+include: _ar_template_yaml
+task: mmlu_prox_ar_math
+task_alias: math
+process_docs: !function utils.process_math
--- a/lm_eval/tasks/mmlu_prox/ar/mmlu_prox_ar_other.yaml
+++ b/lm_eval/tasks/mmlu_prox/ar/mmlu_prox_ar_other.yaml
+description: 'فيما يلي أسئلة اختيار من متعدد (مع إجابات) حول أخرى. فكر خطوة بخطوة
+  ثم أنهِ إجابتك بـ ''الإجابة هي (X)'' حيث X هو حرف الخيار الصحيح.
+  '
+include: _ar_template_yaml
+task: mmlu_prox_ar_other
+task_alias: other
+process_docs: !function utils.process_other
--- a/lm_eval/tasks/mmlu_prox/ar/mmlu_prox_ar_philosophy.yaml
+++ b/lm_eval/tasks/mmlu_prox/ar/mmlu_prox_ar_philosophy.yaml
+description: 'فيما يلي أسئلة اختيار من متعدد (مع إجابات) حول الفلسفة. فكر خطوة بخطوة
+  ثم أنهِ إجابتك بـ ''الإجابة هي (X)'' حيث X هو حرف الخيار الصحيح.
+  '
+include: _ar_template_yaml
+task: mmlu_prox_ar_philosophy
+task_alias: philosophy
+process_docs: !function utils.process_philosophy
--- a/lm_eval/tasks/mmlu_prox/ar/mmlu_prox_ar_physics.yaml
+++ b/lm_eval/tasks/mmlu_prox/ar/mmlu_prox_ar_physics.yaml
+description: 'فيما يلي أسئلة اختيار من متعدد (مع إجابات) حول الفيزياء. فكر خطوة بخطوة
+  ثم أنهِ إجابتك بـ ''الإجابة هي (X)'' حيث X هو حرف الخيار الصحيح.
+  '
+include: _ar_template_yaml
+task: mmlu_prox_ar_physics
+task_alias: physics
+process_docs: !function utils.process_physics
--- a/lm_eval/tasks/mmlu_prox/ar/mmlu_prox_ar_psychology.yaml
+++ b/lm_eval/tasks/mmlu_prox/ar/mmlu_prox_ar_psychology.yaml
+description: 'فيما يلي أسئلة اختيار من متعدد (مع إجابات) حول علم النفس. فكر خطوة بخطوة
+  ثم أنهِ إجابتك بـ ''الإجابة هي (X)'' حيث X هو حرف الخيار الصحيح.
+  '
+include: _ar_template_yaml
+task: mmlu_prox_ar_psychology
+task_alias: psychology
+process_docs: !function utils.process_psychology
--- a/lm_eval/tasks/mmlu_prox/ar/utils.py
+++ b/lm_eval/tasks/mmlu_prox/ar/utils.py
+from functools import partial
+from os.path import basename, dirname
+from lm_eval.tasks.mmlu_prox.lang_libs import LANG_LIBS
+lang_abbr = basename(dirname(__file__))
+lang_dict = LANG_LIBS[lang_abbr]
+choices = [
+    "A",
+    "B",
+    "C",
+    "D",
+    "E",
+    "F",
+    "G",
+    "H",
+    "I",
+    "J",
+    "K",
+    "L",
+    "M",
+    "N",
+    "O",
+    "P",
+]
+max_opt_num = 10
+def format_cot_example(example, including_answer=True):
+    prompt = f"{lang_dict[0]}\n"
+    question = example["question"]
+    prompt += question + "\n"
+    prompt += f"{lang_dict[1]}\n"
+    for i in range(max_opt_num):
+        opt = example[f"option_{i}"]
+        if opt is not None:
+            prompt += "{}. {}\n".format(choices[i], opt)
+    if including_answer:
+        cot_content = example["cot_content"].replace(lang_dict[4], lang_dict[2])
+        prompt += cot_content + "\n\n"
+    else:
+        prompt += lang_dict[2]
+    return prompt
+doc_to_text = partial(format_cot_example, including_answer=False)
+fewshot_to_text = partial(format_cot_example, including_answer=True)
+def process_docs(dataset, subject):
+    return dataset.filter(lambda x: x["category"] == subject)
+process_biology = partial(process_docs, subject="biology")
+process_business = partial(process_docs, subject="business")
+process_chemistry = partial(process_docs, subject="chemistry")
+process_computer_science = partial(process_docs, subject="computer science")
+process_economics = partial(process_docs, subject="economics")
+process_engineering = partial(process_docs, subject="engineering")
+process_health = partial(process_docs, subject="health")
+process_history = partial(process_docs, subject="history")
+process_law = partial(process_docs, subject="law")
+process_math = partial(process_docs, subject="math")
+process_other = partial(process_docs, subject="other")
+process_philosophy = partial(process_docs, subject="philosophy")
+process_physics = partial(process_docs, subject="physics")
+process_psychology = partial(process_docs, subject="psychology")
--- a/lm_eval/tasks/mmlu_prox/bn/_bn_template_yaml
+++ b/lm_eval/tasks/mmlu_prox/bn/_bn_template_yaml
+dataset_path: li-lab/MMLU-ProX
+dataset_name: bn
+test_split: test
+fewshot_split: validation
+fewshot_config:
+  sampler: first_n
+  doc_to_text: !function utils.fewshot_to_text
+  doc_to_target: ""
+output_type: generate_until
+doc_to_text: !function utils.doc_to_text
+doc_to_target: answer
+filter_list:
+  - name: "custom-extract"
+    filter:
+      - function: "regex"
+        regex_pattern: 'উত্তর হল \(?([ABCDEFGHIJ])\)?'
+      - function: "take_first"
+generation_kwargs:
+  until:
+    - "</s>"
+    - "Q:"
+    - "প্রশ্ন:"
+    - "<|im_end|>"
+  do_sample: false
+  temperature: 0.0
+  max_gen_toks: 2048
+num_fewshot: 5
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+    ignore_case: true
+    ignore_punctuation: true
+metadata:
+  version: 0.0