"docs/source/api/vscode:/vscode.git/clone" did not exist on "2288098ba667b244c94913bde0e9b07182a475c4"
Unverified Commit 250a04ec authored by Geun, Lim's avatar Geun, Lim Committed by GitHub
Browse files

Fix: extended to max_gen_toks 2048 for HRM8K math benchmarks (#3124)



* Fix: extended to max_gen_toks 8192 for HRM8K math benchmarks

* • Increased max_gen_toks to 2 048 (matches Appendix B of original paper).
• Added Evaluation Settings and Changelog sections.

* add some logs

---------
Co-authored-by: default avatarBaber <baber@hey.com>
parent 8c05cfe0
...@@ -8,6 +8,15 @@ Large language models (LLMs) demonstrate exceptional performance on complex reas ...@@ -8,6 +8,15 @@ Large language models (LLMs) demonstrate exceptional performance on complex reas
Homepage: https://huggingface.co/datasets/HAERAE-HUB/HRM8K Homepage: https://huggingface.co/datasets/HAERAE-HUB/HRM8K
### Evaluation Settings
The authors suggest (Appendix B) using:
* **Sampling temperature:** `0.7`
* **Top‑p:** `0.95`
* **Output length:** *min* `8` tokens, *max* `2048` tokens (`max_gen_toks`)
We default to greedy decoding.
### Citation ### Citation
...@@ -20,7 +29,7 @@ Homepage: https://huggingface.co/datasets/HAERAE-HUB/HRM8K ...@@ -20,7 +29,7 @@ Homepage: https://huggingface.co/datasets/HAERAE-HUB/HRM8K
} }
``` ```
### Groups and and Tasks ### Groups and Tasks
#### Groups #### Groups
...@@ -37,10 +46,13 @@ Homepage: https://huggingface.co/datasets/HAERAE-HUB/HRM8K ...@@ -37,10 +46,13 @@ Homepage: https://huggingface.co/datasets/HAERAE-HUB/HRM8K
For adding novel benchmarks/datasets to the library: For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature? * [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task? * [x] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test? * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported: If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted? * [x] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates? * [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant? * [x] Have you noted which, if any, published evaluation setups are matched by this variant?
### Changelog
- 2025-07-22: v1.1: increased `max_gen_toks` to 2048
...@@ -11,7 +11,7 @@ generation_kwargs: ...@@ -11,7 +11,7 @@ generation_kwargs:
- "<|end_of_text|>" - "<|end_of_text|>"
- "<|endoftext|>" - "<|endoftext|>"
- "<|im_end|>" - "<|im_end|>"
max_gen_toks: 512 max_gen_toks: 2048
do_sample: false do_sample: false
temperature: 0 temperature: 0
metric_list: metric_list:
...@@ -19,4 +19,4 @@ metric_list: ...@@ -19,4 +19,4 @@ metric_list:
aggregation: mean aggregation: mean
higher_is_better: true higher_is_better: true
metadata: metadata:
version: 1.0 version: 1.1
...@@ -11,7 +11,7 @@ generation_kwargs: ...@@ -11,7 +11,7 @@ generation_kwargs:
- "<|end_of_text|>" - "<|end_of_text|>"
- "<|endoftext|>" - "<|endoftext|>"
- "<|im_end|>" - "<|im_end|>"
max_gen_toks: 512 max_gen_toks: 2048
do_sample: false do_sample: false
temperature: 0 temperature: 0
metric_list: metric_list:
...@@ -19,4 +19,4 @@ metric_list: ...@@ -19,4 +19,4 @@ metric_list:
aggregation: mean aggregation: mean
higher_is_better: true higher_is_better: true
metadata: metadata:
version: 1.0 version: 1.1
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment