update `minerva_math` (#3259)

* update math_verify * remove normalization * use full solution in `parse` * update version

update `minerva_math` (#3259)
* update math_verify * remove normalization * use full solution in `parse` * update version
18d2face · Baber Abbasi · GitHub · 358bfa37 · 18d2face · 18d2face
Unverified Commit 18d2face authored Aug 24, 2025 by Baber Abbasi Committed by GitHub Aug 24, 2025
3 changed files
--- a/lm_eval/tasks/minerva_math/README.md
+++ b/lm_eval/tasks/minerva_math/README.md
 # MATH
 ℹ️ This is the 4-shot variant!
 ## Paper
 Measuring Mathematical Problem Solving With the MATH Dataset
 https://arxiv.org/abs/2103.03874
-Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations.
+Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of
+computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging
+competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach
+models to generate answer derivations and explanations.
-NOTE: The few-shot and the generated answer extraction is based on the [Minerva](https://arxiv.org/abs/2206.14858) and exact match equivalence is calculated using the `sympy` library. This requires additional dependencies, which can be installed via the `lm-eval[math]` extra.
+NOTE: The few-shot and the generated answer extraction is based on the [Minerva](https://arxiv.org/abs/2206.14858) and
+exact match equivalence is calculated using the `sympy` library. This requires additional dependencies, which can be
+installed via the `lm-eval[math]` extra.
 Homepage: https://github.com/hendrycks/math
 ## Citation
 ```
 @article{hendrycksmath2021,
  title={Measuring Mathematical Problem Solving With the MATH Dataset},
@@ -49,13 +57,18 @@ Eprint = {arXiv:2206.14858},
 The checklist is the following:
 For adding novel benchmarks/datasets to the library:
-* [x] Is the task an existing benchmark in the literature?
-  * [x] Have you referenced the original paper that introduced the task?
-  * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
-    * The implementation in the original paper is one where the model is first fine-tuned on the data. They do have a few-shot evaluation for GPT-3, however the few-shot context used here is sourced from [Lewkowycz et al](https://arxiv.org/abs/2206.14858). The achieved accuracy on Llama-2 models is comparable to that provided in the paper, though not identical.
+* [x] Is the task an existing benchmark in the literature?
+    * [x] Have you referenced the original paper that introduced the task?
+    * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the
+      reference implementation and documented how to run such a test?
+        * The implementation in the original paper is one where the model is first fine-tuned on the data. They do have
+          a few-shot evaluation for GPT-3, however the few-shot context used here is sourced
+          from [Lewkowycz et al](https://arxiv.org/abs/2206.14858). The achieved accuracy on Llama-2 models is
+          comparable to that provided in the paper, though not identical.
 If other tasks on this dataset are already supported:
 * [x] Is the "Main" variant of this task clearly denoted?
 * [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
 * [x] Have you noted which, if any, published evaluation setups are matched by this variant?
@@ -65,4 +78,7 @@ If other tasks on this dataset are already supported:
 - [ ] zero-shot variant
 ### Changelog
-version 2.0: (21-Feb-2025); added math_verify (extraction) metric. For details [see](https://huggingface.co/blog/math_verify_leaderboard)
+- version 2.0: (21-Feb-2025); added math_verify (extraction) metric. For
+  details [see](https://huggingface.co/blog/math_verify_leaderboard)
+- version 3.0 (21-Aug-2025); pass the full solution and model generation to `math_verify`'s `parse`
--- a/lm_eval/tasks/minerva_math/minerva_math_algebra.yaml
+++ b/lm_eval/tasks/minerva_math/minerva_math_algebra.yaml
@@ -24,7 +24,7 @@ metric_list:
    higher_is_better: true
 num_fewshot: 4
 metadata:
-  version: 2.0
+  version: 3.0
 fewshot_config:
  sampler: first_n
  samples: !function utils.list_fewshot_samples
--- a/lm_eval/tasks/minerva_math/utils.py
+++ b/lm_eval/tasks/minerva_math/utils.py
@@ -71,7 +71,7 @@ def list_fewshot_samples() -> list[dict]:
    ]
-def process_results(doc: dict, results: List[str]) -> Dict[str, int]:
+def process_results(doc: dict, results: list[str]) -> dict[str, int]:
    candidates = results[0]
    unnormalized_answer = get_unnormalized_answer(candidates)
@@ -83,14 +83,17 @@ def process_results(doc: dict, results: List[str]) -> Dict[str, int]:
        retval = 0
    # math_verify
-    res = verify(parse(doc["answer"]), parse(candidates))
+    _mvres = verify(
-    mathval = 1 if res else 0
+        gold=parse(doc["solution"]),
+        target=parse(candidates),
+    )
+    mathval = 1 if _mvres else 0
-    results = {
+    res = {
        "exact_match": retval,
        "math_verify": mathval,
    }
-    return results
+    return res
 def last_boxed_only_string(string: str) -> Optional[str]: