# HRM8K ### Paper Title: [Understand, Solve and Translate: Bridging the Multilingual Mathematical Reasoning Gap](https://www.arxiv.org/abs/2501.02448) Large language models (LLMs) demonstrate exceptional performance on complex reasoning tasks. However, despite their strong reasoning capabilities in high-resource languages (e.g., English and Chinese), a significant performance gap persists in other languages. To investigate this gap in Korean, we introduce HRM8K, a benchmark comprising 8,011 English-Korean parallel bilingual math problems. Through systematic analysis of model behaviors, we identify a key finding: these performance disparities stem primarily from difficulties in comprehending non-English inputs, rather than limitations in reasoning capabilities. Based on these findings, we propose UST (Understand, Solve, and Translate), a method that strategically uses English as an anchor for reasoning and solution generation. By fine-tuning the model on 130k synthetically generated data points, UST achieves a 10.91% improvement on the HRM8K benchmark and reduces the multilingual performance gap from 11.6% to 0.7%. Additionally, we show that improvements from UST generalize effectively to different Korean domains, demonstrating that capabilities acquired from machine-verifiable content can be generalized to other areas. We publicly release the benchmark, training dataset, and models. Homepage: https://huggingface.co/datasets/HAERAE-HUB/HRM8K ### Citation ``` @article{ko2025understand, title={Understand, Solve and Translate: Bridging the Multilingual Mathematical Reasoning Gap}, author={Ko, Hyunwoo and Son, Guijin and Choi, Dasol}, journal={arXiv preprint arXiv:2501.02448}, year={2025} } ``` ### Groups and and Tasks #### Groups * `hrm8k`: HRM8K comprises 8,011 instances for evaluation, sourced through a combination of translations from established English benchmarks (e.g., GSM8K, MATH, OmniMath, MMMLU) and original problems curated from existing Korean math exams. This benchmark consists of Korean instruction and question. * `hrm8k_en`: English version of `hrm8k`. This benchmark consists of English instruction and question. #### Tasks * `hrm8k_{gsm8k|ksm|math|mmmlu|omni_math}` * `hrm8k_en_{gsm8k|ksm|math|mmmlu|omni_math}` ### Checklist For adding novel benchmarks/datasets to the library: * [x] Is the task an existing benchmark in the literature? * [x] Have you referenced the original paper that introduced the task? * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test? If other tasks on this dataset are already supported: * [ ] Is the "Main" variant of this task clearly denoted? * [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates? * [ ] Have you noted which, if any, published evaluation setups are matched by this variant?