Unverified Commit 8aeff141 authored by heli-qi's avatar heli-qi Committed by GitHub
Browse files

Add MMLU-ProX task (#2811)

* update mmlu_prox configs

* update tasks/README

* correct hyphon to underline in task/README

* update pre-commit codes
parent 8028a42f
...@@ -96,6 +96,7 @@ ...@@ -96,6 +96,7 @@
| [mmlu](mmlu/README.md) | Massive Multitask Language Understanding benchmark for broad domain language evaluation. Several variants are supported. | English | | [mmlu](mmlu/README.md) | Massive Multitask Language Understanding benchmark for broad domain language evaluation. Several variants are supported. | English |
| [mmlu_pro](mmlu_pro/README.md) | A refined set of MMLU, integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. | English | | [mmlu_pro](mmlu_pro/README.md) | A refined set of MMLU, integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. | English |
| [mmlu-pro-plus](mmlu-pro-plus/README.md) | A new test set for evaluating shortcut learning and higher-order reasoning of LLMs. | English | | [mmlu-pro-plus](mmlu-pro-plus/README.md) | A new test set for evaluating shortcut learning and higher-order reasoning of LLMs. | English |
| [mmlu_prox](mmlu_prox/README.md) | A multilingual benchmark that extends MMLU-Pro to multiple typologically diverse languages with human validation. | English, Japanese, Chinese, Korean, French, German, Spanish, Portuguese, Swahili, Thai, Arabic, Hindi, Bengali |
| [mmlusr](mmlusr/README.md) | Variation of MMLU designed to be more rigorous. | English | | [mmlusr](mmlusr/README.md) | Variation of MMLU designed to be more rigorous. | English |
| model_written_evals | Evaluation tasks auto-generated for evaluating a collection of AI Safety concerns. | | | model_written_evals | Evaluation tasks auto-generated for evaluating a collection of AI Safety concerns. | |
| [moral_stories](moral_stories/README.md) | A crowd-sourced dataset of structured narratives that describe normative and norm-divergent actions taken by individuals to accomplish certain intentions in concrete situations. | English | [moral_stories](moral_stories/README.md) | A crowd-sourced dataset of structured narratives that describe normative and norm-divergent actions taken by individuals to accomplish certain intentions in concrete situations. | English
......
# MMLU-ProX
### Paper
Title: `MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation`
Abstract: `Traditional benchmarks like MMLU and MMLU-Pro focus primarily on single-language evaluation, limiting their ability to assess language models in multilingual and culturally diverse contexts. To address this gap, we introduce MMLU-ProX, a comprehensive multilingual benchmark that builds upon MMLU-Pro by covering multiple typologically diverse languages with approximately 11,829 questions per language.`
Homepage: https://mmluprox.github.io/
### Citation
```bibtex
@misc{mmluprox,
title={MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation},
author={Weihao Xuan and Rui Yang and Heli Qi and Qingcheng Zeng and Yunze Xiao and Yun Xing and Junjue Wang and Huitao Li and Xin Li and Kunyu Yu and Nan Liu and Qingyu Chen and Douglas Teodoro and Edison Marrese-Taylor and Shijian Lu and Yusuke Iwasawa and Yutaka Matsuo and Irene Li},
year={2025},
eprint={2503.10497},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2503.10497},
}
```
### Groups and Tasks
#### Groups
* `mmlu_pro_{lang}`: 'All 14 subjects of the mmlu_pro_prox dataset in {lang}, evaluated following the methodology in mmlu_pro's original implementation'
Available lang:
- ar
- bn
- de
- en
- es
- fr
- hi
- ja
- ko
- pt
- sw
- th
- zh
#### Tasks
The following tasks evaluate subjects in the mmlu_prox dataset
- `mmlu_prox_{lang}_biology`
- `mmlu_prox_{lang}_business`
- `mmlu_prox_{lang}_chemistry`
- `mmlu_prox_{lang}_computer_science`
- `mmlu_prox_{lang}_economics`
- `mmlu_prox_{lang}_engineering`
- `mmlu_prox_{lang}_health`
- `mmlu_prox_{lang}_history`
- `mmlu_prox_{lang}_law`
- `mmlu_prox_{lang}_math`
- `mmlu_prox_{lang}_other`
- `mmlu_prox_{lang}_philosophy`
- `mmlu_prox_{lang}_physics`
- `mmlu_prox_{lang}_psychology`
### Checklist
For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
dataset_path: li-lab/MMLU-ProX
dataset_name: ar
test_split: test
fewshot_split: validation
fewshot_config:
sampler: first_n
doc_to_text: !function utils.fewshot_to_text
doc_to_target: ""
output_type: generate_until
doc_to_text: !function utils.doc_to_text
doc_to_target: answer
filter_list:
- name: "custom-extract"
filter:
- function: "regex"
regex_pattern: 'الإجابة هي \(?([ABCDEFGHIJ])\)?'
- function: "take_first"
generation_kwargs:
until:
- "</s>"
- "Q:"
- "سؤال:"
- "<|im_end|>"
do_sample: false
temperature: 0.0
max_gen_toks: 2048
num_fewshot: 5
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
metadata:
version: 0.0
group: mmlu_prox_ar
task:
- mmlu_prox_ar_biology
- mmlu_prox_ar_business
- mmlu_prox_ar_chemistry
- mmlu_prox_ar_computer_science
- mmlu_prox_ar_economics
- mmlu_prox_ar_engineering
- mmlu_prox_ar_health
- mmlu_prox_ar_history
- mmlu_prox_ar_law
- mmlu_prox_ar_math
- mmlu_prox_ar_other
- mmlu_prox_ar_philosophy
- mmlu_prox_ar_physics
- mmlu_prox_ar_psychology
aggregate_metric_list:
- aggregation: mean
metric: exact_match
weight_by_size: true
filter_list: custom-extract
metadata:
version: 0.0
description: 'فيما يلي أسئلة اختيار من متعدد (مع إجابات) حول علم الأحياء. فكر خطوة
بخطوة ثم أنهِ إجابتك بـ ''الإجابة هي (X)'' حيث X هو حرف الخيار الصحيح.
'
include: _ar_template_yaml
task: mmlu_prox_ar_biology
task_alias: biology
process_docs: !function utils.process_biology
description: 'فيما يلي أسئلة اختيار من متعدد (مع إجابات) حول الأعمال. فكر خطوة بخطوة
ثم أنهِ إجابتك بـ ''الإجابة هي (X)'' حيث X هو حرف الخيار الصحيح.
'
include: _ar_template_yaml
task: mmlu_prox_ar_business
task_alias: business
process_docs: !function utils.process_business
description: 'فيما يلي أسئلة اختيار من متعدد (مع إجابات) حول الكيمياء. فكر خطوة بخطوة
ثم أنهِ إجابتك بـ ''الإجابة هي (X)'' حيث X هو حرف الخيار الصحيح.
'
include: _ar_template_yaml
task: mmlu_prox_ar_chemistry
task_alias: chemistry
process_docs: !function utils.process_chemistry
description: 'فيما يلي أسئلة اختيار من متعدد (مع إجابات) حول علوم الكمبيوتر. فكر خطوة
بخطوة ثم أنهِ إجابتك بـ ''الإجابة هي (X)'' حيث X هو حرف الخيار الصحيح.
'
include: _ar_template_yaml
task: mmlu_prox_ar_computer_science
task_alias: computer_science
process_docs: !function utils.process_computer_science
description: 'فيما يلي أسئلة اختيار من متعدد (مع إجابات) حول الاقتصاد. فكر خطوة بخطوة
ثم أنهِ إجابتك بـ ''الإجابة هي (X)'' حيث X هو حرف الخيار الصحيح.
'
include: _ar_template_yaml
task: mmlu_prox_ar_economics
task_alias: economics
process_docs: !function utils.process_economics
description: 'فيما يلي أسئلة اختيار من متعدد (مع إجابات) حول الهندسة. فكر خطوة بخطوة
ثم أنهِ إجابتك بـ ''الإجابة هي (X)'' حيث X هو حرف الخيار الصحيح.
'
include: _ar_template_yaml
task: mmlu_prox_ar_engineering
task_alias: engineering
process_docs: !function utils.process_engineering
description: 'فيما يلي أسئلة اختيار من متعدد (مع إجابات) حول الصحة. فكر خطوة بخطوة
ثم أنهِ إجابتك بـ ''الإجابة هي (X)'' حيث X هو حرف الخيار الصحيح.
'
include: _ar_template_yaml
task: mmlu_prox_ar_health
task_alias: health
process_docs: !function utils.process_health
description: 'فيما يلي أسئلة اختيار من متعدد (مع إجابات) حول التاريخ. فكر خطوة بخطوة
ثم أنهِ إجابتك بـ ''الإجابة هي (X)'' حيث X هو حرف الخيار الصحيح.
'
include: _ar_template_yaml
task: mmlu_prox_ar_history
task_alias: history
process_docs: !function utils.process_history
description: 'فيما يلي أسئلة اختيار من متعدد (مع إجابات) حول القانون. فكر خطوة بخطوة
ثم أنهِ إجابتك بـ ''الإجابة هي (X)'' حيث X هو حرف الخيار الصحيح.
'
include: _ar_template_yaml
task: mmlu_prox_ar_law
task_alias: law
process_docs: !function utils.process_law
description: 'فيما يلي أسئلة اختيار من متعدد (مع إجابات) حول الرياضيات. فكر خطوة بخطوة
ثم أنهِ إجابتك بـ ''الإجابة هي (X)'' حيث X هو حرف الخيار الصحيح.
'
include: _ar_template_yaml
task: mmlu_prox_ar_math
task_alias: math
process_docs: !function utils.process_math
description: 'فيما يلي أسئلة اختيار من متعدد (مع إجابات) حول أخرى. فكر خطوة بخطوة
ثم أنهِ إجابتك بـ ''الإجابة هي (X)'' حيث X هو حرف الخيار الصحيح.
'
include: _ar_template_yaml
task: mmlu_prox_ar_other
task_alias: other
process_docs: !function utils.process_other
description: 'فيما يلي أسئلة اختيار من متعدد (مع إجابات) حول الفلسفة. فكر خطوة بخطوة
ثم أنهِ إجابتك بـ ''الإجابة هي (X)'' حيث X هو حرف الخيار الصحيح.
'
include: _ar_template_yaml
task: mmlu_prox_ar_philosophy
task_alias: philosophy
process_docs: !function utils.process_philosophy
description: 'فيما يلي أسئلة اختيار من متعدد (مع إجابات) حول الفيزياء. فكر خطوة بخطوة
ثم أنهِ إجابتك بـ ''الإجابة هي (X)'' حيث X هو حرف الخيار الصحيح.
'
include: _ar_template_yaml
task: mmlu_prox_ar_physics
task_alias: physics
process_docs: !function utils.process_physics
description: 'فيما يلي أسئلة اختيار من متعدد (مع إجابات) حول علم النفس. فكر خطوة بخطوة
ثم أنهِ إجابتك بـ ''الإجابة هي (X)'' حيث X هو حرف الخيار الصحيح.
'
include: _ar_template_yaml
task: mmlu_prox_ar_psychology
task_alias: psychology
process_docs: !function utils.process_psychology
from functools import partial
from os.path import basename, dirname
from lm_eval.tasks.mmlu_prox.lang_libs import LANG_LIBS
lang_abbr = basename(dirname(__file__))
lang_dict = LANG_LIBS[lang_abbr]
choices = [
"A",
"B",
"C",
"D",
"E",
"F",
"G",
"H",
"I",
"J",
"K",
"L",
"M",
"N",
"O",
"P",
]
max_opt_num = 10
def format_cot_example(example, including_answer=True):
prompt = f"{lang_dict[0]}\n"
question = example["question"]
prompt += question + "\n"
prompt += f"{lang_dict[1]}\n"
for i in range(max_opt_num):
opt = example[f"option_{i}"]
if opt is not None:
prompt += "{}. {}\n".format(choices[i], opt)
if including_answer:
cot_content = example["cot_content"].replace(lang_dict[4], lang_dict[2])
prompt += cot_content + "\n\n"
else:
prompt += lang_dict[2]
return prompt
doc_to_text = partial(format_cot_example, including_answer=False)
fewshot_to_text = partial(format_cot_example, including_answer=True)
def process_docs(dataset, subject):
return dataset.filter(lambda x: x["category"] == subject)
process_biology = partial(process_docs, subject="biology")
process_business = partial(process_docs, subject="business")
process_chemistry = partial(process_docs, subject="chemistry")
process_computer_science = partial(process_docs, subject="computer science")
process_economics = partial(process_docs, subject="economics")
process_engineering = partial(process_docs, subject="engineering")
process_health = partial(process_docs, subject="health")
process_history = partial(process_docs, subject="history")
process_law = partial(process_docs, subject="law")
process_math = partial(process_docs, subject="math")
process_other = partial(process_docs, subject="other")
process_philosophy = partial(process_docs, subject="philosophy")
process_physics = partial(process_docs, subject="physics")
process_psychology = partial(process_docs, subject="psychology")
dataset_path: li-lab/MMLU-ProX
dataset_name: bn
test_split: test
fewshot_split: validation
fewshot_config:
sampler: first_n
doc_to_text: !function utils.fewshot_to_text
doc_to_target: ""
output_type: generate_until
doc_to_text: !function utils.doc_to_text
doc_to_target: answer
filter_list:
- name: "custom-extract"
filter:
- function: "regex"
regex_pattern: 'উত্তর হল \(?([ABCDEFGHIJ])\)?'
- function: "take_first"
generation_kwargs:
until:
- "</s>"
- "Q:"
- "প্রশ্ন:"
- "<|im_end|>"
do_sample: false
temperature: 0.0
max_gen_toks: 2048
num_fewshot: 5
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
metadata:
version: 0.0
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment