MMLU Pro Plus (#2366)

* mmlu-pro-plus is implemented * README file is updated * Update README.md with new task: MMLU Pro Plus * Update README.md with new task: MMLU Pro Plus * pre-commit * nit --------- Co-authored-by: asgsaeid <asgaris@Saeids-MacBook-Pro.local> Co-authored-by: Baber <baber@hey.com>

MMLU Pro Plus (#2366)
* mmlu-pro-plus is implemented * README file is updated * Update README.md with new task: MMLU Pro Plus * Update README.md with new task: MMLU Pro Plus * pre-commit * nit --------- Co-authored-by: asgsaeid <asgaris@Saeids-MacBook-Pro.local> Co-authored-by: Baber <baber@hey.com>
0bb8406f · asgsaeid · GitHub · 1208afd3 · 0bb8406f · 0bb8406f
Unverified Commit 0bb8406f authored Jan 31, 2025 by asgsaeid Committed by GitHub Jan 31, 2025
19 changed files
--- a/lm_eval/tasks/README.md
+++ b/lm_eval/tasks/README.md
@@ -85,6 +85,7 @@
 | [mlqa](mlqa/README.md)                                                   | MultiLingual Question Answering benchmark dataset for evaluating cross-lingual question answering performance. | English, Arabic, German, Spanish, Hindi, Vietnamese, Simplified Chinese                                                       |
 | [mmlu](mmlu/README.md)                                                   | Massive Multitask Language Understanding benchmark for broad domain language evaluation. Several variants are supported. | English                                                                                                                       |
 | [mmlu_pro](mmlu_pro/README.md)                                           | A refined set of MMLU, integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. | English                                                                                                                       |
+| [mmlu-pro-plus](mmlu-pro-plus/README.md) | A new test set for evaluating shortcut learning and higher-order reasoning of LLMs.                                                                                                                                                                                                                                                   | English |
 | [mmlusr](mmlusr/README.md)                                               | Variation of MMLU designed to be more rigorous. | English                                                                                                                       |
 | model_written_evals                                                      | Evaluation tasks auto-generated for evaluating a collection of AI Safety concerns. |                                                                                                                               |
 | [moral_stories](moral_stories/README.md)                                 | A crowd-sourced dataset of structured narratives that describe normative and norm-divergent actions taken by individuals to accomplish certain intentions in concrete situations. | English  

--- a/lm_eval/tasks/mmlu-pro-plus/README.md
+++ b/lm_eval/tasks/mmlu-pro-plus/README.md
+# mmlu_pro_plus
+### Paper
+Title: `MMLU-Pro+: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMs`
+Abstract: `Existing benchmarks for large language models (LLMs) increasingly struggle to differentiate between
+top-performing models, underscoring the need for more challenging evaluation frameworks.
+We introduce MMLU-Pro+, an enhanced benchmark building upon MMLU-Pro to assess shortcut
+learning and higher-order reasoning in LLMs. By incorporating questions with multiple
+correct answers across diverse domains, MMLU-Pro+ tests LLMs' ability to engage in complex
+reasoning and resist simplistic problem-solving strategies. Our results show that
+MMLU-Pro+ maintains MMLU-Pro's difficulty while providing a more rigorous test of
+model discrimination, particularly in multi-correct answer scenarios.
+We introduce novel metrics like shortcut selection ratio and correct pair identification
+ratio, offering deeper insights into model behavior and anchoring bias.
+Evaluations of six state-of-the-art LLMs reveal significant performance gaps,
+highlighting variations in reasoning abilities and bias susceptibility.`
+Homepage: https://github.com/asgsaeid/mmlu-pro-plus
+### Citation
+```bibtex
+@article{taghanaki2024mmlu,
+  title={MMLU-Pro+: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMs},
+  author={Taghanaki, Saeid Asgari and Khani, Aliasgahr and Khasahmadi, Amir},
+  journal={arXiv preprint arXiv:2409.02257},
+  year={2024}
+}
+```
+### Groups and Tasks
+#### Groups
+* `mmlu_pro_plus`: 'All 14 subjects of the mmlu_pro_plus dataset, evaluated following the methodology in mmlu's original implementation'
+#### Tasks
+The following tasks evaluate subjects in the mmlu_pro dataset
+- `mmlu_pro_plus_biology`
+- `mmlu_pro_plus_business`
+- `mmlu_pro_plus_chemistry`
+- `mmlu_pro_plus_computer_science`
+- `mmlu_pro_plus_economics`
+- `mmlu_pro_plus_engineering`
+- `mmlu_pro_plus_health`
+- `mmlu_pro_plus_history`
+- `mmlu_pro_plus_law`
+- `mmlu_pro_plus_math`
+- `mmlu_pro_plus_other`
+- `mmlu_pro_plus_philosophy`
+- `mmlu_pro_plus_physics`
+- `mmlu_pro_plus_psychology`
+### Checklist
+For adding novel benchmarks/datasets to the library:
+* [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
+  * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
+### Changelog
--- a/lm_eval/tasks/mmlu-pro-plus/_default_template_yaml
+++ b/lm_eval/tasks/mmlu-pro-plus/_default_template_yaml
+dataset_path: saeidasgari/mmlu-pro-plus
+test_split: test
+fewshot_split: validation
+fewshot_config:
+  sampler: first_n
+  doc_to_text: !function utils.fewshot_to_text
+  doc_to_target: ""
+output_type: generate_until
+doc_to_text: !function utils.doc_to_text
+doc_to_target: answer
+filter_list:
+  - name: "custom-extract"
+    filter:
+      - function: "regex"
+        regex_pattern: 'answer is \(?([ABCDEFGHIJKL])\)?'
+        # regex_pattern: r".*[aA]nswer:\s*([A-L])",
+      - function: "take_first"
+generation_kwargs:
+  until:
+    - "</s>"
+    - "Q:"
+    - "<|im_end|>"
+  do_sample: false
+  temperature: 0.0
+num_fewshot: 5
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+    ignore_case: true
+    ignore_punctuation: true
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/mmlu-pro-plus/_mmlu_pro_plus.yaml
+++ b/lm_eval/tasks/mmlu-pro-plus/_mmlu_pro_plus.yaml
+group: mmlu_pro_plus
+task:
+  - mmlu_pro_plus_biology
+  - mmlu_pro_plus_business
+  - mmlu_pro_plus_chemistry
+  - mmlu_pro_plus_computer_science
+  - mmlu_pro_plus_economics
+  - mmlu_pro_plus_engineering
+  - mmlu_pro_plus_health
+  - mmlu_pro_plus_history
+  - mmlu_pro_plus_law
+  - mmlu_pro_plus_math
+  - mmlu_pro_plus_other
+  - mmlu_pro_plus_philosophy
+  - mmlu_pro_plus_physics
+  - mmlu_pro_plus_psychology
+aggregate_metric_list:
+  - aggregation: mean
+    metric: exact_match
+    weight_by_size: true
+    filter_list: custom-extract
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_biology.yaml
+++ b/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_biology.yaml
+description: "The following are multiple choice questions (with answers) about biology. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
+include: "_default_template_yaml"
+task: "mmlu_pro_plus_biology"
+task_alias: "biology"
+process_docs: !function utils.process_biology
--- a/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_business.yaml
+++ b/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_business.yaml
+description: "The following are multiple choice questions (with answers) about business. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
+include: "_default_template_yaml"
+task: "mmlu_pro_plus_business"
+task_alias: "business"
+process_docs: !function utils.process_business
--- a/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_chemistry.yaml
+++ b/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_chemistry.yaml
+description: "The following are multiple choice questions (with answers) about chemistry. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
+include: "_default_template_yaml"
+task: "mmlu_pro_plus_chemistry"
+task_alias: "chemistry"
+process_docs: !function utils.process_chemistry
--- a/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_computer_science.yaml
+++ b/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_computer_science.yaml
+description: "The following are multiple choice questions (with answers) about computer science. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
+include: "_default_template_yaml"
+task: "mmlu_pro_plus_computer_science"
+task_alias: "computer_science"
+process_docs: !function utils.process_computer_science
--- a/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_economics.yaml
+++ b/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_economics.yaml
+description: "The following are multiple choice questions (with answers) about economics. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
+include: "_default_template_yaml"
+task: "mmlu_pro_plus_economics"
+task_alias: "economics"
+process_docs: !function utils.process_economics
--- a/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_engineering.yaml
+++ b/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_engineering.yaml
+description: "The following are multiple choice questions (with answers) about engineering. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
+include: "_default_template_yaml"
+task: "mmlu_pro_plus_engineering"
+task_alias: "engineering"
+process_docs: !function utils.process_engineering
--- a/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_health.yaml
+++ b/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_health.yaml
+description: "The following are multiple choice questions (with answers) about health. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
+include: "_default_template_yaml"
+task: "mmlu_pro_plus_health"
+task_alias: "health"
+process_docs: !function utils.process_health
--- a/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_history.yaml
+++ b/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_history.yaml
+description: "The following are multiple choice questions (with answers) about history. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
+include: "_default_template_yaml"
+task: "mmlu_pro_plus_history"
+task_alias: "history"
+process_docs: !function utils.process_history
--- a/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_law.yaml
+++ b/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_law.yaml
+description: "The following are multiple choice questions (with answers) about law. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
+include: "_default_template_yaml"
+task: "mmlu_pro_plus_law"
+task_alias: "law"
+process_docs: !function utils.process_law
--- a/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_math.yaml
+++ b/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_math.yaml
+description: "The following are multiple choice questions (with answers) about math. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
+include: "_default_template_yaml"
+task: "mmlu_pro_plus_math"
+task_alias: "math"
+process_docs: !function utils.process_math
--- a/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_other.yaml
+++ b/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_other.yaml
+description: "The following are multiple choice questions (with answers) about other. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
+include: "_default_template_yaml"
+task: "mmlu_pro_plus_other"
+task_alias: "other"
+process_docs: !function utils.process_other
--- a/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_philosophy.yaml
+++ b/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_philosophy.yaml
+description: "The following are multiple choice questions (with answers) about philosophy. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
+include: "_default_template_yaml"
+task: "mmlu_pro_plus_philosophy"
+task_alias: "philosophy"
+process_docs: !function utils.process_philosophy
--- a/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_physics.yaml
+++ b/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_physics.yaml
+description: "The following are multiple choice questions (with answers) about physics. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
+include: "_default_template_yaml"
+task: "mmlu_pro_plus_physics"
+task_alias: "physics"
+process_docs: !function utils.process_physics
--- a/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_psychology.yaml
+++ b/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_psychology.yaml
+description: "The following are multiple choice questions (with answers) about psychology. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
+include: "_default_template_yaml"
+task: "mmlu_pro_plus_psychology"
+task_alias: "psychology"
+process_docs: !function utils.process_psychology
--- a/lm_eval/tasks/mmlu-pro-plus/utils.py
+++ b/lm_eval/tasks/mmlu-pro-plus/utils.py
+from functools import partial
+choices = [
+    "A",
+    "B",
+    "C",
+    "D",
+    "E",
+    "F",
+    "G",
+    "H",
+    "I",
+    "J",
+    "K",
+    "L",
+    "M",
+    "N",
+    "O",
+    "P",
+]
+def format_cot_example(example, including_answer=True):
+    prompt = "Question:\n"
+    question = example["question"]
+    options = example["options"]
+    prompt += question + "\n"
+    prompt += "Options:\n"
+    for i, opt in enumerate(options):
+        prompt += "{}. {}\n".format(choices[i], opt)
+    if including_answer:
+        cot_content = example["cot_content"].replace(
+            "A: Let's think step by step.", "Answer: Let's think step by step."
+        )
+        prompt += cot_content + "\n\n"
+    else:
+        prompt += "Answer: Let's think step by step."
+    return prompt
+doc_to_text = partial(format_cot_example, including_answer=False)
+fewshot_to_text = partial(format_cot_example, including_answer=True)
+def process_docs(dataset, subject):
+    return dataset.filter(lambda x: x["category"] == subject)
+process_biology = partial(process_docs, subject="biology")
+process_business = partial(process_docs, subject="business")
+process_chemistry = partial(process_docs, subject="chemistry")
+process_computer_science = partial(process_docs, subject="computer science")
+process_economics = partial(process_docs, subject="economics")
+process_engineering = partial(process_docs, subject="engineering")
+process_health = partial(process_docs, subject="health")
+process_history = partial(process_docs, subject="history")
+process_law = partial(process_docs, subject="law")
+process_math = partial(process_docs, subject="math")
+process_other = partial(process_docs, subject="other")
+process_philosophy = partial(process_docs, subject="philosophy")
+process_physics = partial(process_docs, subject="physics")
+process_psychology = partial(process_docs, subject="psychology")