mmlu pro generation_kwargs until Q: -> Question: (#2945)

* mmlu pro generation_kwargs until Q: -> Question: * pacify pre-commit * change stop token --------- Co-authored-by: Baber <baber@hey.com>

mmlu pro generation_kwargs until Q: -> Question: (#2945)
* mmlu pro generation_kwargs until Q: -> Question: * pacify pre-commit * change stop token --------- Co-authored-by: Baber <baber@hey.com>
cf51e699 · Yoonsoo Kim · GitHub · af8b87cc · cf51e699 · cf51e699
Unverified Commit cf51e699 authored May 13, 2025 by Yoonsoo Kim Committed by GitHub May 13, 2025
3 changed files
--- a/lm_eval/tasks/longbench/utils.py
+++ b/lm_eval/tasks/longbench/utils.py
@@ -4,7 +4,7 @@ import os

 import numpy as np
 from metrics import (
-    classification_score,
+    # classification_score,
    code_sim_score,
    count_score,
    qa_f1_score,
@@ -29,10 +29,10 @@ dataset2metric = {
    "qmsum": rouge_score,
    "multi_news": rouge_score,
    "vcsum": rouge_zh_score,
-    "trec": classification_score,
+    # "trec": classification_score,
    "triviaqa": qa_f1_score,
    "samsum": rouge_score,
-    "lsht": classification_score,
+    # "lsht": classification_score,
    "passage_retrieval_en": retrieval_score,
    "passage_count": count_score,
    "passage_retrieval_zh": retrieval_zh_score,

--- a/lm_eval/tasks/mmlu_pro/README.md
+++ b/lm_eval/tasks/mmlu_pro/README.md
@@ -64,3 +64,5 @@ If other tasks on this dataset are already supported:
  * Added one newline to task description(s) as per [reference implementation](https://github.com/TIGER-AI-Lab/MMLU-Pro/blob/47b9891aacb8bd7cda29d5c5ba17b9434dd333bc/evaluate_from_local.py#L93)
 * (tasks, group) 2025-03-20 -- (version 2.0 --> version 2.1)
  * Changed default max_length from 2048 to 8192 and max_gen_toks from 256 to 2048.
+* (tasks, group) 2025-05-20 -- (version 2.1 --> version 3)
+  * changed stop sequence from "Q:" to "Question:" PR #2945
--- a/lm_eval/tasks/mmlu_pro/_default_template_yaml
+++ b/lm_eval/tasks/mmlu_pro/_default_template_yaml
@@ -17,9 +17,7 @@ filter_list:
      - function: "take_first"
 generation_kwargs:
  until:
-    - "</s>"
-    - "Q:"
-    - "<|im_end|>"
+    - "Question:"
  max_gen_toks: 2048
  do_sample: false
  temperature: 0.0