Merge branch 'main' into metrics

# Conflicts: # .pre-commit-config.yaml # lm_eval/api/task.py # lm_eval/models/huggingface.py # lm_eval/models/vllm_causallms.py # pyproject.toml

Merge branch 'main' into metrics
# Conflicts: # .pre-commit-config.yaml # lm_eval/api/task.py # lm_eval/models/huggingface.py # lm_eval/models/vllm_causallms.py # pyproject.toml
e6b798f9 · Baber · 14a29ade · 4f8195f1 · e6b798f9 · e6b798f9
Commit e6b798f9 authored Jul 25, 2025 by Baber
20 changed files
--- a/lm_eval/tasks/drop/default.yaml
+++ b/lm_eval/tasks/drop/default.yaml
@@ -22,5 +22,3 @@ metric_list:
    higher_is_better: true
 metadata:
  version: 3.0
-dataset_kwargs:
-  trust_remote_code: true
--- a/lm_eval/tasks/egyhellaswag/egyhellaswag.yaml
+++ b/lm_eval/tasks/egyhellaswag/egyhellaswag.yaml
@@ -20,5 +20,3 @@ metric_list:
    higher_is_better: true
 metadata:
  version: 0.0
-dataset_kwargs:
-  trust_remote_code: true
--- a/lm_eval/tasks/galician_bench/flores_gl/_flores_common_yaml
+++ b/lm_eval/tasks/galician_bench/flores_gl/_flores_common_yaml
@@ -23,5 +23,3 @@ metric_list:
    higher_is_better: true
 metadata:
  version: 1.0
-dataset_kwargs:
-  trust_remote_code: true
--- a/lm_eval/tasks/groundcocoa/groundcocoa.yaml
+++ b/lm_eval/tasks/groundcocoa/groundcocoa.yaml
@@ -14,5 +14,4 @@ metric_list:
    aggregation: mean
    higher_is_better: true
 dataset_kwargs:
-  trust_remote_code: true
  streaming: true
--- a/lm_eval/tasks/hellaswag/hellaswag.yaml
+++ b/lm_eval/tasks/hellaswag/hellaswag.yaml
 tag:
  - multiple_choice
 task: hellaswag
-dataset_path: hellaswag
+dataset_path: Rowan/hellaswag
 dataset_name: null
 output_type: multiple_choice
 training_split: train
@@ -20,5 +20,3 @@ metric_list:
    higher_is_better: true
 metadata:
  version: 1.0
-dataset_kwargs:
-  trust_remote_code: true
--- a/lm_eval/tasks/hendrycks_math/hendrycks_math_algebra.yaml
+++ b/lm_eval/tasks/hendrycks_math/hendrycks_math_algebra.yaml
@@ -7,7 +7,7 @@ dataset_name: algebra
 output_type: generate_until
 training_split: train
 test_split: test
-doc_to_text:  "Problem: {{problem}}\nAnswer:"
+doc_to_text: "Problem: {{problem}}\nAnswer:"
 process_results: !function utils.process_results
 doc_to_target: "{{answer}}"
 generation_kwargs:
@@ -21,5 +21,3 @@ metric_list:
    higher_is_better: true
 metadata:
  version: 1.0
-dataset_kwargs:
-  trust_remote_code: true
--- a/lm_eval/tasks/hrm8k/README.md
+++ b/lm_eval/tasks/hrm8k/README.md
@@ -8,6 +8,15 @@ Large language models (LLMs) demonstrate exceptional performance on complex reas

 Homepage: https://huggingface.co/datasets/HAERAE-HUB/HRM8K

+### Evaluation Settings
+
+The authors suggest (Appendix B) using:
+
+* **Sampling temperature:** `0.7`  
+* **Top‑p:** `0.95`  
+* **Output length:** *min* `8` tokens, *max* `2048` tokens (`max_gen_toks`)
+
+We default to greedy decoding.

 ### Citation

@@ -20,7 +29,7 @@ Homepage: https://huggingface.co/datasets/HAERAE-HUB/HRM8K
 }
 ```

-### Groups and and Tasks
+### Groups and Tasks

 #### Groups

@@ -37,10 +46,13 @@ Homepage: https://huggingface.co/datasets/HAERAE-HUB/HRM8K
 For adding novel benchmarks/datasets to the library:
 * [x] Is the task an existing benchmark in the literature?
  * [x] Have you referenced the original paper that introduced the task?
-  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+  * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?


 If other tasks on this dataset are already supported:
-* [ ] Is the "Main" variant of this task clearly denoted?
-* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
-* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
+* [x] Is the "Main" variant of this task clearly denoted?
+* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
+
+### Changelog
+- 2025-07-22: v1.1: increased `max_gen_toks` to 2048
--- a/lm_eval/tasks/hrm8k/default/_hrm8k_yaml
+++ b/lm_eval/tasks/hrm8k/default/_hrm8k_yaml
@@ -11,7 +11,7 @@ generation_kwargs:
    - "<|end_of_text|>"
    - "<|endoftext|>"
    - "<|im_end|>"
-  max_gen_toks: 512
+  max_gen_toks: 2048
  do_sample: false
  temperature: 0
 metric_list:
@@ -19,4 +19,4 @@ metric_list:
    aggregation: mean
    higher_is_better: true
 metadata:
-  version: 1.0
+  version: 1.1
--- a/lm_eval/tasks/hrm8k/default/utils.py
+++ b/lm_eval/tasks/hrm8k/default/utils.py
@@ -111,7 +111,7 @@ def parse_math_answer(raw_string):
        return retval

    def get_answer_with_dollar_sign(s):
-        first_pattern = "\$(.*)\$"
+        first_pattern = r"\$(.*)\$"
        last_match = None
        matches = re.findall(first_pattern, s)
        if matches:
@@ -127,7 +127,7 @@ def parse_math_answer(raw_string):
            if "\\n" in last_match:
                last_match = last_match.split("\\n")[0]
        else:
-            pattern = "(?:\\$)?\d+(?:\.\d+)?(?![\w\d])"
+            pattern = "(?:\\$)?\\d+(?:\\.\\d+)?(?![\\w\\d])"
            matches = re.findall(pattern, s)
            if matches:
                last_match = matches[-1]
@@ -250,7 +250,7 @@ def _strip_string(string):

    # remove percentage
    string = string.replace("\\%", "")
-    string = string.replace("\%", "")
+    string = string.replace(r"\%", "")

    # " 0." equivalent to " ." and "{0." equivalent to "{." Alternatively, add "0" if "." is the start of the string
    string = string.replace(" .", " 0.")

--- a/lm_eval/tasks/hrm8k/en/_hrm8k_en_yaml
+++ b/lm_eval/tasks/hrm8k/en/_hrm8k_en_yaml
@@ -11,7 +11,7 @@ generation_kwargs:
    - "<|end_of_text|>"
    - "<|endoftext|>"
    - "<|im_end|>"
-  max_gen_toks: 512
+  max_gen_toks: 2048
  do_sample: false
  temperature: 0
 metric_list:
@@ -19,4 +19,4 @@ metric_list:
    aggregation: mean
    higher_is_better: true
 metadata:
-  version: 1.0
+  version: 1.1
--- a/lm_eval/tasks/hrm8k/en/utils.py
+++ b/lm_eval/tasks/hrm8k/en/utils.py
@@ -111,7 +111,7 @@ def parse_math_answer(raw_string):
        return retval

    def get_answer_with_dollar_sign(s):
-        first_pattern = "\$(.*)\$"
+        first_pattern = r"\$(.*)\$"
        last_match = None
        matches = re.findall(first_pattern, s)
        if matches:
@@ -127,7 +127,7 @@ def parse_math_answer(raw_string):
            if "\\n" in last_match:
                last_match = last_match.split("\\n")[0]
        else:
-            pattern = "(?:\\$)?\d+(?:\.\d+)?(?![\w\d])"
+            pattern = "(?:\\$)?\\d+(?:\\.\\d+)?(?![\\w\\d])"
            matches = re.findall(pattern, s)
            if matches:
                last_match = matches[-1]
@@ -250,7 +250,7 @@ def _strip_string(string):

    # remove percentage
    string = string.replace("\\%", "")
-    string = string.replace("\%", "")
+    string = string.replace(r"\%", "")

    # " 0." equivalent to " ." and "{0." equivalent to "{." Alternatively, add "0" if "." is the start of the string
    string = string.replace(" .", " 0.")

--- a/lm_eval/tasks/inverse_scaling/inverse_scaling_winobias_antistereotype.yaml
+++ b/lm_eval/tasks/inverse_scaling/inverse_scaling_winobias_antistereotype.yaml
@@ -14,7 +14,5 @@ metric_list:
  - metric: acc_norm
    aggregation: mean
    higher_is_better: true
-dataset_kwargs:
-  trust_remote_code: true
 metadata:
  version: 0
--- a/lm_eval/tasks/kobest/kobest_sentineg.yaml
+++ b/lm_eval/tasks/kobest/kobest_sentineg.yaml
@@ -19,5 +19,3 @@ metric_list:
    higher_is_better: True
 metadata:
  version: 1.0
-dataset_kwargs:
-  trust_remote_code: true
--- a/lm_eval/tasks/kobest/kobest_wic.yaml
+++ b/lm_eval/tasks/kobest/kobest_wic.yaml
@@ -19,5 +19,3 @@ metric_list:
    higher_is_better: True
 metadata:
  version: 1.0
-dataset_kwargs:
-  trust_remote_code: true
--- a/lm_eval/tasks/kormedmcqa/_template_yaml
+++ b/lm_eval/tasks/kormedmcqa/_template_yaml
@@ -29,5 +29,3 @@ generation_kwargs:
  max_gen_toks: 1024
 metadata:
  version: 2.0
-dataset_kwargs:
-  trust_remote_code: true
--- a/lm_eval/tasks/lambada/lambada_openai.yaml
+++ b/lm_eval/tasks/lambada/lambada_openai.yaml
@@ -18,5 +18,3 @@ metric_list:
    higher_is_better: true
 metadata:
  version: 1.0
-dataset_kwargs:
-  trust_remote_code: true
--- a/lm_eval/tasks/leaderboard/math/_template_yaml
+++ b/lm_eval/tasks/leaderboard/math/_template_yaml
@@ -22,8 +22,6 @@ metric_list:
 num_fewshot: 4
 metadata:
  version: 3.0
-dataset_kwargs:
-  trust_remote_code: true
 fewshot_config:
  sampler: first_n
  samples: !function utils.list_fewshot_samples
--- a/lm_eval/tasks/libra/README.md
+++ b/lm_eval/tasks/libra/README.md
+# Task-name
+
+### Paper
+
+Title: `LIBRA: Long Input Benchmark for Russian Analysis`
+
+Abstract: `Datasets for proper evaluation of long-context understanding in Russian. For the Russian language LIBRA comprises 21 adapted datasets to study the LLM's abilities to understand long texts thoroughly. The tests are divided into four complexity groups and allow the evaluation of models across various context lengths ranging from 4k up to 128k tokens.`
+
+Homepage: `https://huggingface.co/datasets/ai-forever/LIBRA`
+
+
+### Citation
+
+```
+@misc{churin2024longinputbenchmarkrussian,
+      title={Long Input Benchmark for Russian Analysis},
+      author={Igor Churin and Murat Apishev and Maria Tikhonova and Denis Shevelev and Aydar Bulatov and Yuri Kuratov and Sergei Averkiev and Alena Fenogenova},
+      year={2024},
+      eprint={2408.02439},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2408.02439},
+}
+```
+
+### Groups, Tags, and Tasks
+
+#### Groups
+
+* `libra_simple_information_retrieval`
+* `libra_question_answering_and_multiple_choice`
+* `libra_multi_hop_question_answering`
+* `libra_complex_reasoning_and_mathematical_problems`
+
+#### Tags
+
+* `libra`
+
+#### Tasks
+
+* `passkey`
+* `passkey_with_librusec`
+* `matreshka_yes_no`
+* `matreshka_names`
+* `librusec_history`
+* `ru_trec`
+* `ru_sci_abstract_retrieval`
+* `ru_sci_fi`
+* `ru_quality`
+* `ru_tpo`
+* `ru_babilong_qa1`
+* `ru_babilong_qa2`
+* `ru_babilong_qa3`
+* `ru_babilong_qa4`
+* `ru_babilong_qa5`
+* `long_context_multiq`
+* `librusec_mhqa`
+* `ru_2wikimultihopqa`
+* `ru_sci_passage_count`
+* `ru_qasper`
+* `ru_gsm100`
+
+
+# Variants
+
+Usage (**all with `--apply_chat_template`**):
+```
+lm_eval --model hf --model_args pretrained=EleutherAI/gpt-j-6B --tasks libra_simple_information_retrieval --device cpu --apply_chat_template --log_samples --output_path test_libra_simple_information_retrieval
+```
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/libra/_complex_reasoning_and_mathematical_problems.yaml
+++ b/lm_eval/tasks/libra/_complex_reasoning_and_mathematical_problems.yaml
+group: libra_complex_reasoning_and_mathematical_problems
+group_alias: Complex Reasoning and Mathematical Problems
+task:
+  - ru_sci_passage_count
+  - ru_qasper
+  - ru_gsm100
--- a/lm_eval/tasks/libra/_multi_hop_question_answering.yaml
+++ b/lm_eval/tasks/libra/_multi_hop_question_answering.yaml
+group: libra_multi_hop_question_answering
+group_alias: Multi-hop Question Answering
+task:
+  - ru_babilong_qa1
+  - ru_babilong_qa2
+  - ru_babilong_qa3
+  - ru_babilong_qa4
+  - ru_babilong_qa5
+  - long_context_multiq
+  - librusec_mhqa
+  - ru_2wikimultihopqa