Merge branch 'main' into llama

bf11ac93 · Baber · 83b1c564 · ade01428 · bf11ac93 · bf11ac93
Commit bf11ac93 authored Mar 03, 2025 by Baber
20 changed files
--- a/lm_eval/tasks/mbpp/mbpp_plus.yaml
+++ b/lm_eval/tasks/mbpp/mbpp_plus.yaml
+include: mbpp.yaml
+task: mbpp_plus
+dataset_path: evalplus/mbppplus
+dataset_name: null
+doc_to_text: "You are an expert Python programmer, and here is your task: {{prompt if prompt is defined else text}} Your code should pass these tests:\n\n{{test_list[0]}}\n{{test_list[1]}}\n{{test_list[2]}}\n[BEGIN]\n"
--- a/lm_eval/tasks/minerva_math/README.md
+++ b/lm_eval/tasks/minerva_math/README.md
@@ -63,3 +63,6 @@ If other tasks on this dataset are already supported:
 ### Variant Wishlist

 - [ ] zero-shot variant
+
+### Changelog
+version 2.0: (21-Feb-2025); added math_verify (extraction) metric. For details [see](https://huggingface.co/blog/math_verify_leaderboard)
--- a/lm_eval/tasks/minerva_math/minerva_math_algebra.yaml
+++ b/lm_eval/tasks/minerva_math/minerva_math_algebra.yaml
@@ -19,9 +19,12 @@ metric_list:
  - metric: exact_match
    aggregation: mean
    higher_is_better: true
+  - metric: math_verify
+    aggregation: mean
+    higher_is_better: true
 num_fewshot: 4
 metadata:
-  version: 1.0
+  version: 2.0
 dataset_kwargs:
  trust_remote_code: true
 fewshot_config:

--- a/lm_eval/tasks/minerva_math/utils.py
+++ b/lm_eval/tasks/minerva_math/utils.py
+import logging
 import re
 import signal
+from importlib.metadata import version
 from typing import Dict, List, Optional

 import datasets

-from lm_eval.utils import eval_logger
+
+eval_logger = logging.getLogger(__name__)


 try:
+    import antlr4
    import sympy
+    from math_verify import parse, verify
    from sympy.parsing.latex import parse_latex
-except ModuleNotFoundError:
-    raise ModuleNotFoundError(
-        "`sympy` is required for generating translation task prompt templates. \
-please install sympy via pip install lm-eval[math] or pip install -e .[math]",
-    )
+
+    assert version("antlr4-python3-runtime").startswith("4.11")
+except (ModuleNotFoundError, AssertionError) as e:
+    raise type(e)(
+        "`sympy`, `math_verify` and `antlr4-python3-runtime==4.11` are required for generating translation task prompt templates. "
+        "Please install the required packages via pip install lm-eval[math] or pip install -e .[math]"
+    ) from e


 # taken from
@@ -75,8 +82,13 @@ def process_results(doc: dict, results: List[str]) -> Dict[str, int]:
    else:
        retval = 0

+    # math_verify
+    res = verify(parse(doc["answer"]), parse(candidates))
+    mathval = 1 if res else 0
+
    results = {
        "exact_match": retval,
+        "math_verify": mathval,
    }
    return results


--- a/lm_eval/tasks/mmlu-pro-plus/README.md
+++ b/lm_eval/tasks/mmlu-pro-plus/README.md
+# mmlu_pro_plus
+
+### Paper
+
+Title: `MMLU-Pro+: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMs`
+
+Abstract: `Existing benchmarks for large language models (LLMs) increasingly struggle to differentiate between
+top-performing models, underscoring the need for more challenging evaluation frameworks.
+We introduce MMLU-Pro+, an enhanced benchmark building upon MMLU-Pro to assess shortcut
+learning and higher-order reasoning in LLMs. By incorporating questions with multiple
+correct answers across diverse domains, MMLU-Pro+ tests LLMs' ability to engage in complex
+reasoning and resist simplistic problem-solving strategies. Our results show that
+MMLU-Pro+ maintains MMLU-Pro's difficulty while providing a more rigorous test of
+model discrimination, particularly in multi-correct answer scenarios.
+We introduce novel metrics like shortcut selection ratio and correct pair identification
+ratio, offering deeper insights into model behavior and anchoring bias.
+Evaluations of six state-of-the-art LLMs reveal significant performance gaps,
+highlighting variations in reasoning abilities and bias susceptibility.`
+
+Homepage: https://github.com/asgsaeid/mmlu-pro-plus
+
+### Citation
+
+```bibtex
+@article{taghanaki2024mmlu,
+  title={MMLU-Pro+: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMs},
+  author={Taghanaki, Saeid Asgari and Khani, Aliasgahr and Khasahmadi, Amir},
+  journal={arXiv preprint arXiv:2409.02257},
+  year={2024}
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* `mmlu_pro_plus`: 'All 14 subjects of the mmlu_pro_plus dataset, evaluated following the methodology in mmlu's original implementation'
+
+#### Tasks
+
+The following tasks evaluate subjects in the mmlu_pro dataset
+- `mmlu_pro_plus_biology`
+- `mmlu_pro_plus_business`
+- `mmlu_pro_plus_chemistry`
+- `mmlu_pro_plus_computer_science`
+- `mmlu_pro_plus_economics`
+- `mmlu_pro_plus_engineering`
+- `mmlu_pro_plus_health`
+- `mmlu_pro_plus_history`
+- `mmlu_pro_plus_law`
+- `mmlu_pro_plus_math`
+- `mmlu_pro_plus_other`
+- `mmlu_pro_plus_philosophy`
+- `mmlu_pro_plus_physics`
+- `mmlu_pro_plus_psychology`
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
+  * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
+
+### Changelog
--- a/lm_eval/tasks/mmlu-pro-plus/_default_template_yaml
+++ b/lm_eval/tasks/mmlu-pro-plus/_default_template_yaml
+dataset_path: saeidasgari/mmlu-pro-plus
+test_split: test
+fewshot_split: validation
+fewshot_config:
+  sampler: first_n
+  doc_to_text: !function utils.fewshot_to_text
+  doc_to_target: ""
+output_type: generate_until
+doc_to_text: !function utils.doc_to_text
+doc_to_target: answer
+filter_list:
+  - name: "custom-extract"
+    filter:
+      - function: "regex"
+        regex_pattern: 'answer is \(?([ABCDEFGHIJKL])\)?'
+        # regex_pattern: r".*[aA]nswer:\s*([A-L])",
+      - function: "take_first"
+generation_kwargs:
+  until:
+    - "</s>"
+    - "Q:"
+    - "<|im_end|>"
+  do_sample: false
+  temperature: 0.0
+num_fewshot: 5
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+    ignore_case: true
+    ignore_punctuation: true
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/mmlu-pro-plus/_mmlu_pro_plus.yaml
+++ b/lm_eval/tasks/mmlu-pro-plus/_mmlu_pro_plus.yaml
+group: mmlu_pro_plus
+task:
+  - mmlu_pro_plus_biology
+  - mmlu_pro_plus_business
+  - mmlu_pro_plus_chemistry
+  - mmlu_pro_plus_computer_science
+  - mmlu_pro_plus_economics
+  - mmlu_pro_plus_engineering
+  - mmlu_pro_plus_health
+  - mmlu_pro_plus_history
+  - mmlu_pro_plus_law
+  - mmlu_pro_plus_math
+  - mmlu_pro_plus_other
+  - mmlu_pro_plus_philosophy
+  - mmlu_pro_plus_physics
+  - mmlu_pro_plus_psychology
+aggregate_metric_list:
+  - aggregation: mean
+    metric: exact_match
+    weight_by_size: true
+    filter_list: custom-extract
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_biology.yaml
+++ b/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_biology.yaml
+description: "The following are multiple choice questions (with answers) about biology. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
+include: "_default_template_yaml"
+task: "mmlu_pro_plus_biology"
+task_alias: "biology"
+process_docs: !function utils.process_biology
--- a/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_business.yaml
+++ b/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_business.yaml
+description: "The following are multiple choice questions (with answers) about business. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
+include: "_default_template_yaml"
+task: "mmlu_pro_plus_business"
+task_alias: "business"
+process_docs: !function utils.process_business
--- a/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_chemistry.yaml
+++ b/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_chemistry.yaml
+description: "The following are multiple choice questions (with answers) about chemistry. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
+include: "_default_template_yaml"
+task: "mmlu_pro_plus_chemistry"
+task_alias: "chemistry"
+process_docs: !function utils.process_chemistry
--- a/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_computer_science.yaml
+++ b/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_computer_science.yaml
+description: "The following are multiple choice questions (with answers) about computer science. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
+include: "_default_template_yaml"
+task: "mmlu_pro_plus_computer_science"
+task_alias: "computer_science"
+process_docs: !function utils.process_computer_science
--- a/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_economics.yaml
+++ b/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_economics.yaml
+description: "The following are multiple choice questions (with answers) about economics. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
+include: "_default_template_yaml"
+task: "mmlu_pro_plus_economics"
+task_alias: "economics"
+process_docs: !function utils.process_economics
--- a/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_engineering.yaml
+++ b/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_engineering.yaml
+description: "The following are multiple choice questions (with answers) about engineering. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
+include: "_default_template_yaml"
+task: "mmlu_pro_plus_engineering"
+task_alias: "engineering"
+process_docs: !function utils.process_engineering
--- a/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_health.yaml
+++ b/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_health.yaml
+description: "The following are multiple choice questions (with answers) about health. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
+include: "_default_template_yaml"
+task: "mmlu_pro_plus_health"
+task_alias: "health"
+process_docs: !function utils.process_health
--- a/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_history.yaml
+++ b/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_history.yaml
+description: "The following are multiple choice questions (with answers) about history. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
+include: "_default_template_yaml"
+task: "mmlu_pro_plus_history"
+task_alias: "history"
+process_docs: !function utils.process_history
--- a/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_law.yaml
+++ b/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_law.yaml
+description: "The following are multiple choice questions (with answers) about law. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
+include: "_default_template_yaml"
+task: "mmlu_pro_plus_law"
+task_alias: "law"
+process_docs: !function utils.process_law
--- a/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_math.yaml
+++ b/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_math.yaml
+description: "The following are multiple choice questions (with answers) about math. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
+include: "_default_template_yaml"
+task: "mmlu_pro_plus_math"
+task_alias: "math"
+process_docs: !function utils.process_math
--- a/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_other.yaml
+++ b/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_other.yaml
+description: "The following are multiple choice questions (with answers) about other. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
+include: "_default_template_yaml"
+task: "mmlu_pro_plus_other"
+task_alias: "other"
+process_docs: !function utils.process_other
--- a/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_philosophy.yaml
+++ b/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_philosophy.yaml
+description: "The following are multiple choice questions (with answers) about philosophy. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
+include: "_default_template_yaml"
+task: "mmlu_pro_plus_philosophy"
+task_alias: "philosophy"
+process_docs: !function utils.process_philosophy
--- a/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_physics.yaml
+++ b/lm_eval/tasks/mmlu-pro-plus/mmlu_pro_plus_physics.yaml
+description: "The following are multiple choice questions (with answers) about physics. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
+include: "_default_template_yaml"
+task: "mmlu_pro_plus_physics"
+task_alias: "physics"
+process_docs: !function utils.process_physics