Fix: extended to max_gen_toks 2048 for HRM8K math benchmarks (#3124)

* Fix: extended to max_gen_toks 8192 for HRM8K math benchmarks * • Increased max_gen_toks to 2 048 (matches Appendix B of original paper). • Added Evaluation Settings and Changelog sections. * add some logs --------- Co-authored-by: Baber <baber@hey.com>

Fix: extended to max_gen_toks 2048 for HRM8K math benchmarks (#3124)
* Fix: extended to max_gen_toks 8192 for HRM8K math benchmarks * • Increased max_gen_toks to 2 048 (matches Appendix B of original paper). • Added Evaluation Settings and Changelog sections. * add some logs --------- Co-authored-by: Baber <baber@hey.com>
250a04ec · Geun, Lim · GitHub · 8c05cfe0 · 250a04ec · 250a04ec
Unverified Commit 250a04ec authored Jul 22, 2025 by Geun, Lim Committed by GitHub Jul 22, 2025
3 changed files
--- a/lm_eval/tasks/hrm8k/README.md
+++ b/lm_eval/tasks/hrm8k/README.md
@@ -8,6 +8,15 @@ Large language models (LLMs) demonstrate exceptional performance on complex reas
 Homepage: https://huggingface.co/datasets/HAERAE-HUB/HRM8K
+### Evaluation Settings
+The authors suggest (Appendix B) using:
+* **Sampling temperature:** `0.7`  
+* **Top‑p:** `0.95`  
+* **Output length:** *min* `8` tokens, *max* `2048` tokens (`max_gen_toks`)
+We default to greedy decoding.
 ### Citation
@@ -20,7 +29,7 @@ Homepage: https://huggingface.co/datasets/HAERAE-HUB/HRM8K
 }
 ```
-### Groups and and Tasks
+### Groups and Tasks
 #### Groups
@@ -37,10 +46,13 @@ Homepage: https://huggingface.co/datasets/HAERAE-HUB/HRM8K
 For adding novel benchmarks/datasets to the library:
 * [x] Is the task an existing benchmark in the literature?
  * [x] Have you referenced the original paper that introduced the task?
-  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+  * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
 If other tasks on this dataset are already supported:
-* [ ] Is the "Main" variant of this task clearly denoted?
+* [x] Is the "Main" variant of this task clearly denoted?
-* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
-* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
+* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
+### Changelog
+- 2025-07-22: v1.1: increased `max_gen_toks` to 2048
--- a/lm_eval/tasks/hrm8k/default/_hrm8k_yaml
+++ b/lm_eval/tasks/hrm8k/default/_hrm8k_yaml
@@ -11,7 +11,7 @@ generation_kwargs:
    - "<|end_of_text|>"
    - "<|endoftext|>"
    - "<|im_end|>"
-  max_gen_toks: 512
+  max_gen_toks: 2048
  do_sample: false
  temperature: 0
 metric_list:
@@ -19,4 +19,4 @@ metric_list:
    aggregation: mean
    higher_is_better: true
 metadata:
-  version: 1.0
+  version: 1.1
--- a/lm_eval/tasks/hrm8k/en/_hrm8k_en_yaml
+++ b/lm_eval/tasks/hrm8k/en/_hrm8k_en_yaml
@@ -11,7 +11,7 @@ generation_kwargs:
    - "<|end_of_text|>"
    - "<|endoftext|>"
    - "<|im_end|>"
-  max_gen_toks: 512
+  max_gen_toks: 2048
  do_sample: false
  temperature: 0
 metric_list:
@@ -19,4 +19,4 @@ metric_list:
    aggregation: mean
    higher_is_better: true
 metadata:
-  version: 1.0
+  version: 1.1