Fix LongBench Evaluation (#3273)

* fix: set 'do_sample=False' and use double quotes in 'doc_to_text' * feat: update versions and README for longbench * pacify pre-commit --------- Co-authored-by: Baber <baber@hey.com>

Fix LongBench Evaluation (#3273)
* fix: set 'do_sample=False' and use double quotes in 'doc_to_text' * feat: update versions and README for longbench * pacify pre-commit --------- Co-authored-by: Baber <baber@hey.com>
7f698a5a · Timur Aysin · GitHub · 0c134ee9 · 7f698a5a · 7f698a5a
Unverified Commit 7f698a5a authored Sep 21, 2025 by Timur Aysin Committed by GitHub Sep 21, 2025
20 changed files
--- a/lm_eval/tasks/longbench/2wikimqa.yaml
+++ b/lm_eval/tasks/longbench/2wikimqa.yaml
@@ -5,17 +5,17 @@ task: longbench_2wikimqa
 dataset_path: THUDM/LongBench
 test_split: test
 dataset_name: 2wikimqa
-doc_to_text: 'Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n{{context}}\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {{input}}\nAnswer:'
+doc_to_text: "Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n{{context}}\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {{input}}\nAnswer:"
 doc_to_target: '{{answers}}'
 process_results: !function metrics.get_qa_f1_score
 generation_kwargs:
  max_gen_toks: 32
  temperature: 1
-  do_sample: True
+  do_sample: False
  until: []
 metric_list:
  - metric: "qa_f1_score"
    aggregation: mean
    higher_is_better: True
 metadata:
-  version: 3.0
+  version: 4.0
--- a/lm_eval/tasks/longbench/2wikimqa_e.yaml
+++ b/lm_eval/tasks/longbench/2wikimqa_e.yaml
@@ -5,17 +5,17 @@ task: longbench_2wikimqa_e
 dataset_path: THUDM/LongBench
 test_split: test
 dataset_name: 2wikimqa_e
-doc_to_text: 'Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n{{context}}\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {{input}}\nAnswer:'
+doc_to_text: "Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n{{context}}\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {{input}}\nAnswer:"
 doc_to_target: '{{answers}}'
 process_results: !function metrics.get_qa_f1_score
 generation_kwargs:
  max_gen_toks: 32
  temperature: 1
-  do_sample: True
+  do_sample: False
  until: []
 metric_list:
  - metric: "qa_f1_score"
    aggregation: mean
    higher_is_better: True
 metadata:
-  version: 3.0
+  version: 4.0
--- a/lm_eval/tasks/longbench/README.md
+++ b/lm_eval/tasks/longbench/README.md
@@ -101,4 +101,7 @@ If other tasks on this dataset are already supported:

 ### Changelog
 v2.: fix doc_to_target; add vcsum
+
 v3: properly use all answers for metric calculation; trim whitespace from resps; fix stop sequences not parsing correctly.
+
+v4: fixed special characters in prompts; use greedy decoding by default.
--- a/lm_eval/tasks/longbench/_generate_config.py
+++ b/lm_eval/tasks/longbench/_generate_config.py
@@ -149,7 +149,7 @@ task: {{ task }}
 dataset_path: {{ dataset_path }}
 test_split: {{ test_split }}
 dataset_name: {{ dataset_name }}
-doc_to_text: '{{ doc_to_text }}'
+doc_to_text: "{{ doc_to_text }}"
 doc_to_target: '{{ doc_to_target }}'
 process_results: {{ process_results }}
 generation_kwargs:
@@ -180,13 +180,14 @@ if __name__ == "__main__":
        generation_kwargs = {
            "max_gen_toks": dataset2maxlen[df],
            "temperature": 1,
-            "do_sample": True,
+            "do_sample": False,
            # We'll handle the until value directly in the template
        }

        raw_doc_to_text = (
            dataset2prompt[df]
            .replace("\n", "\\n")
+            .replace('"', '\\"')
            .replace("{", "{{")
            .replace("}", "}}")
        )
@@ -210,7 +211,7 @@ if __name__ == "__main__":
            "generation_kwargs": generation_kwargs,
            "has_newline": has_newline,  # Add the flag to the template context
            "metric_list": metric_list,
-            "metadata": {"version": "3.0"},
+            "metadata": {"version": "4.0"},
        }

        # Render template

--- a/lm_eval/tasks/longbench/dureader.yaml
+++ b/lm_eval/tasks/longbench/dureader.yaml
@@ -5,17 +5,17 @@ task: longbench_dureader
 dataset_path: THUDM/LongBench
 test_split: test
 dataset_name: dureader
-doc_to_text: '请基于给定的文章回答下述问题。\n\n文章：{{context}}\n\n请基于上述文章回答下面的问题。\n\n问题：{{input}}\n回答：'
+doc_to_text: "请基于给定的文章回答下述问题。\n\n文章：{{context}}\n\n请基于上述文章回答下面的问题。\n\n问题：{{input}}\n回答："
 doc_to_target: '{{answers}}'
 process_results: !function metrics.get_rouge_zh_score
 generation_kwargs:
  max_gen_toks: 128
  temperature: 1
-  do_sample: True
+  do_sample: False
  until: []
 metric_list:
  - metric: "rouge_zh_score"
    aggregation: mean
    higher_is_better: True
 metadata:
-  version: 3.0
+  version: 4.0
--- a/lm_eval/tasks/longbench/gov_report.yaml
+++ b/lm_eval/tasks/longbench/gov_report.yaml
@@ -5,17 +5,17 @@ task: longbench_gov_report
 dataset_path: THUDM/LongBench
 test_split: test
 dataset_name: gov_report
-doc_to_text: 'You are given a report by a government agency. Write a one-page summary of the report.\n\nReport:\n{{context}}\n\nNow, write a one-page summary of the report.\n\nSummary:'
+doc_to_text: "You are given a report by a government agency. Write a one-page summary of the report.\n\nReport:\n{{context}}\n\nNow, write a one-page summary of the report.\n\nSummary:"
 doc_to_target: '{{answers}}'
 process_results: !function metrics.get_rouge_score
 generation_kwargs:
  max_gen_toks: 512
  temperature: 1
-  do_sample: True
+  do_sample: False
  until: []
 metric_list:
  - metric: "rouge_score"
    aggregation: mean
    higher_is_better: True
 metadata:
-  version: 3.0
+  version: 4.0
--- a/lm_eval/tasks/longbench/gov_report_e.yaml
+++ b/lm_eval/tasks/longbench/gov_report_e.yaml
@@ -5,17 +5,17 @@ task: longbench_gov_report_e
 dataset_path: THUDM/LongBench
 test_split: test
 dataset_name: gov_report_e
-doc_to_text: 'You are given a report by a government agency. Write a one-page summary of the report.\n\nReport:\n{{context}}\n\nNow, write a one-page summary of the report.\n\nSummary:'
+doc_to_text: "You are given a report by a government agency. Write a one-page summary of the report.\n\nReport:\n{{context}}\n\nNow, write a one-page summary of the report.\n\nSummary:"
 doc_to_target: '{{answers}}'
 process_results: !function metrics.get_rouge_score
 generation_kwargs:
  max_gen_toks: 512
  temperature: 1
-  do_sample: True
+  do_sample: False
  until: []
 metric_list:
  - metric: "rouge_score"
    aggregation: mean
    higher_is_better: True
 metadata:
-  version: 3.0
+  version: 4.0
--- a/lm_eval/tasks/longbench/hotpotqa.yaml
+++ b/lm_eval/tasks/longbench/hotpotqa.yaml
@@ -5,17 +5,17 @@ task: longbench_hotpotqa
 dataset_path: THUDM/LongBench
 test_split: test
 dataset_name: hotpotqa
-doc_to_text: 'Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n{{context}}\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {{input}}\nAnswer:'
+doc_to_text: "Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n{{context}}\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {{input}}\nAnswer:"
 doc_to_target: '{{answers}}'
 process_results: !function metrics.get_qa_f1_score
 generation_kwargs:
  max_gen_toks: 32
  temperature: 1
-  do_sample: True
+  do_sample: False
  until: []
 metric_list:
  - metric: "qa_f1_score"
    aggregation: mean
    higher_is_better: True
 metadata:
-  version: 3.0
+  version: 4.0
--- a/lm_eval/tasks/longbench/hotpotqa_e.yaml
+++ b/lm_eval/tasks/longbench/hotpotqa_e.yaml
@@ -5,17 +5,17 @@ task: longbench_hotpotqa_e
 dataset_path: THUDM/LongBench
 test_split: test
 dataset_name: hotpotqa_e
-doc_to_text: 'Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n{{context}}\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {{input}}\nAnswer:'
+doc_to_text: "Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n{{context}}\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {{input}}\nAnswer:"
 doc_to_target: '{{answers}}'
 process_results: !function metrics.get_qa_f1_score
 generation_kwargs:
  max_gen_toks: 32
  temperature: 1
-  do_sample: True
+  do_sample: False
  until: []
 metric_list:
  - metric: "qa_f1_score"
    aggregation: mean
    higher_is_better: True
 metadata:
-  version: 3.0
+  version: 4.0
--- a/lm_eval/tasks/longbench/lcc.yaml
+++ b/lm_eval/tasks/longbench/lcc.yaml
@@ -5,17 +5,17 @@ task: longbench_lcc
 dataset_path: THUDM/LongBench
 test_split: test
 dataset_name: lcc
-doc_to_text: 'Please complete the code given below. \n{{context}}Next line of code:\n'
+doc_to_text: "Please complete the code given below. \n{{context}}Next line of code:\n"
 doc_to_target: '{{answers}}'
 process_results: !function metrics.get_code_sim_score
 generation_kwargs:
  max_gen_toks: 64
  temperature: 1
-  do_sample: True
+  do_sample: False
  until: []
 metric_list:
  - metric: "code_sim_score"
    aggregation: mean
    higher_is_better: True
 metadata:
-  version: 3.0
+  version: 4.0
--- a/lm_eval/tasks/longbench/lcc_e.yaml
+++ b/lm_eval/tasks/longbench/lcc_e.yaml
@@ -5,17 +5,17 @@ task: longbench_lcc_e
 dataset_path: THUDM/LongBench
 test_split: test
 dataset_name: lcc_e
-doc_to_text: 'Please complete the code given below. \n{{context}}Next line of code:\n'
+doc_to_text: "Please complete the code given below. \n{{context}}Next line of code:\n"
 doc_to_target: '{{answers}}'
 process_results: !function metrics.get_code_sim_score
 generation_kwargs:
  max_gen_toks: 64
  temperature: 1
-  do_sample: True
+  do_sample: False
  until: []
 metric_list:
  - metric: "code_sim_score"
    aggregation: mean
    higher_is_better: True
 metadata:
-  version: 3.0
+  version: 4.0
--- a/lm_eval/tasks/longbench/lsht.yaml
+++ b/lm_eval/tasks/longbench/lsht.yaml
@@ -5,17 +5,17 @@ task: longbench_lsht
 dataset_path: THUDM/LongBench
 test_split: test
 dataset_name: lsht
-doc_to_text: '请判断给定新闻的类别，下面是一些例子。\n\n{{context}}\n{{input}}'
+doc_to_text: "请判断给定新闻的类别，下面是一些例子。\n\n{{context}}\n{{input}}"
 doc_to_target: '{{answers}}'
 process_results: !function metrics.get_classification_score
 generation_kwargs:
  max_gen_toks: 64
  temperature: 1
-  do_sample: True
+  do_sample: False
  until: ["\n"]
 metric_list:
  - metric: "classification_score"
    aggregation: mean
    higher_is_better: True
 metadata:
-  version: 3.0
+  version: 4.0
--- a/lm_eval/tasks/longbench/multi_news.yaml
+++ b/lm_eval/tasks/longbench/multi_news.yaml
@@ -5,17 +5,17 @@ task: longbench_multi_news
 dataset_path: THUDM/LongBench
 test_split: test
 dataset_name: multi_news
-doc_to_text: 'You are given several news passages. Write a one-page summary of all news. \n\nNews:\n{{context}}\n\nNow, write a one-page summary of all the news.\n\nSummary:'
+doc_to_text: "You are given several news passages. Write a one-page summary of all news. \n\nNews:\n{{context}}\n\nNow, write a one-page summary of all the news.\n\nSummary:"
 doc_to_target: '{{answers}}'
 process_results: !function metrics.get_rouge_score
 generation_kwargs:
  max_gen_toks: 512
  temperature: 1
-  do_sample: True
+  do_sample: False
  until: []
 metric_list:
  - metric: "rouge_score"
    aggregation: mean
    higher_is_better: True
 metadata:
-  version: 3.0
+  version: 4.0
--- a/lm_eval/tasks/longbench/multi_news_e.yaml
+++ b/lm_eval/tasks/longbench/multi_news_e.yaml
@@ -5,17 +5,17 @@ task: longbench_multi_news_e
 dataset_path: THUDM/LongBench
 test_split: test
 dataset_name: multi_news_e
-doc_to_text: 'You are given several news passages. Write a one-page summary of all news. \n\nNews:\n{{context}}\n\nNow, write a one-page summary of all the news.\n\nSummary:'
+doc_to_text: "You are given several news passages. Write a one-page summary of all news. \n\nNews:\n{{context}}\n\nNow, write a one-page summary of all the news.\n\nSummary:"
 doc_to_target: '{{answers}}'
 process_results: !function metrics.get_rouge_score
 generation_kwargs:
  max_gen_toks: 512
  temperature: 1
-  do_sample: True
+  do_sample: False
  until: []
 metric_list:
  - metric: "rouge_score"
    aggregation: mean
    higher_is_better: True
 metadata:
-  version: 3.0
+  version: 4.0
--- a/lm_eval/tasks/longbench/multifieldqa_en.yaml
+++ b/lm_eval/tasks/longbench/multifieldqa_en.yaml
@@ -5,17 +5,17 @@ task: longbench_multifieldqa_en
 dataset_path: THUDM/LongBench
 test_split: test
 dataset_name: multifieldqa_en
-doc_to_text: 'Read the following text and answer briefly.\n\n{{context}}\n\nNow, answer the following question based on the above text, only give me the answer and do not output any other words.\n\nQuestion: {{input}}\nAnswer:'
+doc_to_text: "Read the following text and answer briefly.\n\n{{context}}\n\nNow, answer the following question based on the above text, only give me the answer and do not output any other words.\n\nQuestion: {{input}}\nAnswer:"
 doc_to_target: '{{answers}}'
 process_results: !function metrics.get_qa_f1_score
 generation_kwargs:
  max_gen_toks: 64
  temperature: 1
-  do_sample: True
+  do_sample: False
  until: []
 metric_list:
  - metric: "qa_f1_score"
    aggregation: mean
    higher_is_better: True
 metadata:
-  version: 3.0
+  version: 4.0
--- a/lm_eval/tasks/longbench/multifieldqa_en_e.yaml
+++ b/lm_eval/tasks/longbench/multifieldqa_en_e.yaml
@@ -5,17 +5,17 @@ task: longbench_multifieldqa_en_e
 dataset_path: THUDM/LongBench
 test_split: test
 dataset_name: multifieldqa_en_e
-doc_to_text: 'Read the following text and answer briefly.\n\n{{context}}\n\nNow, answer the following question based on the above text, only give me the answer and do not output any other words.\n\nQuestion: {{input}}\nAnswer:'
+doc_to_text: "Read the following text and answer briefly.\n\n{{context}}\n\nNow, answer the following question based on the above text, only give me the answer and do not output any other words.\n\nQuestion: {{input}}\nAnswer:"
 doc_to_target: '{{answers}}'
 process_results: !function metrics.get_qa_f1_score
 generation_kwargs:
  max_gen_toks: 64
  temperature: 1
-  do_sample: True
+  do_sample: False
  until: []
 metric_list:
  - metric: "qa_f1_score"
    aggregation: mean
    higher_is_better: True
 metadata:
-  version: 3.0
+  version: 4.0
--- a/lm_eval/tasks/longbench/multifieldqa_zh.yaml
+++ b/lm_eval/tasks/longbench/multifieldqa_zh.yaml
@@ -5,17 +5,17 @@ task: longbench_multifieldqa_zh
 dataset_path: THUDM/LongBench
 test_split: test
 dataset_name: multifieldqa_zh
-doc_to_text: '阅读以下文字并用中文简短回答：\n\n{{context}}\n\n现在请基于上面的文章回答下面的问题，只告诉我答案，不要输出任何其他字词。\n\n问题：{{input}}\n回答：'
+doc_to_text: "阅读以下文字并用中文简短回答：\n\n{{context}}\n\n现在请基于上面的文章回答下面的问题，只告诉我答案，不要输出任何其他字词。\n\n问题：{{input}}\n回答："
 doc_to_target: '{{answers}}'
 process_results: !function metrics.get_qa_f1_zh_score
 generation_kwargs:
  max_gen_toks: 64
  temperature: 1
-  do_sample: True
+  do_sample: False
  until: []
 metric_list:
  - metric: "qa_f1_zh_score"
    aggregation: mean
    higher_is_better: True
 metadata:
-  version: 3.0
+  version: 4.0
--- a/lm_eval/tasks/longbench/musique.yaml
+++ b/lm_eval/tasks/longbench/musique.yaml
@@ -5,17 +5,17 @@ task: longbench_musique
 dataset_path: THUDM/LongBench
 test_split: test
 dataset_name: musique
-doc_to_text: 'Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n{{context}}\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {{input}}\nAnswer:'
+doc_to_text: "Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n{{context}}\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {{input}}\nAnswer:"
 doc_to_target: '{{answers}}'
 process_results: !function metrics.get_qa_f1_score
 generation_kwargs:
  max_gen_toks: 32
  temperature: 1
-  do_sample: True
+  do_sample: False
  until: []
 metric_list:
  - metric: "qa_f1_score"
    aggregation: mean
    higher_is_better: True
 metadata:
-  version: 3.0
+  version: 4.0
--- a/lm_eval/tasks/longbench/narrativeqa.yaml
+++ b/lm_eval/tasks/longbench/narrativeqa.yaml
@@ -5,17 +5,17 @@ task: longbench_narrativeqa
 dataset_path: THUDM/LongBench
 test_split: test
 dataset_name: narrativeqa
-doc_to_text: 'You are given a story, which can be either a novel or a movie script, and a question. Answer the question asconcisely as you can, using a single phrase if possible. Do not provide any explanation.\n\nStory: {{context}}\n\nNow, answer the question based on the story asconcisely as you can, using a single phrase if possible. Do not provide any explanation.\n\nQuestion: {{input}}\n\nAnswer:'
+doc_to_text: "You are given a story, which can be either a novel or a movie script, and a question. Answer the question asconcisely as you can, using a single phrase if possible. Do not provide any explanation.\n\nStory: {{context}}\n\nNow, answer the question based on the story asconcisely as you can, using a single phrase if possible. Do not provide any explanation.\n\nQuestion: {{input}}\n\nAnswer:"
 doc_to_target: '{{answers}}'
 process_results: !function metrics.get_qa_f1_score
 generation_kwargs:
  max_gen_toks: 128
  temperature: 1
-  do_sample: True
+  do_sample: False
  until: []
 metric_list:
  - metric: "qa_f1_score"
    aggregation: mean
    higher_is_better: True
 metadata:
-  version: 3.0
+  version: 4.0
--- a/lm_eval/tasks/longbench/passage_count.yaml
+++ b/lm_eval/tasks/longbench/passage_count.yaml
@@ -5,17 +5,17 @@ task: longbench_passage_count
 dataset_path: THUDM/LongBench
 test_split: test
 dataset_name: passage_count
-doc_to_text: 'There are some paragraphs below sourced from Wikipedia. Some of them may be duplicates. Please carefully read these paragraphs and determine how many unique paragraphs there are after removing duplicates. In other words, how many non-repeating paragraphs are there in total?\n\n{{context}}\n\nPlease enter the final count of unique paragraphs after removing duplicates. The output format should only contain the number, such as 1, 2, 3, and so on.\n\nThe final answer is: '
+doc_to_text: "There are some paragraphs below sourced from Wikipedia. Some of them may be duplicates. Please carefully read these paragraphs and determine how many unique paragraphs there are after removing duplicates. In other words, how many non-repeating paragraphs are there in total?\n\n{{context}}\n\nPlease enter the final count of unique paragraphs after removing duplicates. The output format should only contain the number, such as 1, 2, 3, and so on.\n\nThe final answer is: "
 doc_to_target: '{{answers}}'
 process_results: !function metrics.get_count_score
 generation_kwargs:
  max_gen_toks: 32
  temperature: 1
-  do_sample: True
+  do_sample: False
  until: []
 metric_list:
  - metric: "count_score"
    aggregation: mean
    higher_is_better: True
 metadata:
-  version: 3.0
+  version: 4.0