[longbench] fix metric calculation (#2983)

* use all answers * use middle truncation * maybe fix classification score * strip classification preds * [vllm] remove stop tokens post-hoc * strip all preds * pacify pre-commit * start on truncation utility * add to readme * add a footgun doc * fix newline in yaml templates * do not strip code_sim preds! * fix pre-commit config * fix instruction warning * add not to longbench readme

[longbench] fix metric calculation (#2983)
* use all answers * use middle truncation * maybe fix classification score * strip classification preds * [vllm] remove stop tokens post-hoc * strip all preds * pacify pre-commit * start on truncation utility * add to readme * add a footgun doc * fix newline in yaml templates * do not strip code_sim preds! * fix pre-commit config * fix instruction warning * add not to longbench readme
147e9d61 · Baber Abbasi · GitHub · 9f152e0b · 147e9d61 · 147e9d61
Unverified Commit 147e9d61 authored Jun 08, 2025 by Baber Abbasi Committed by GitHub Jun 08, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 9 additions and 7 deletions

lm_eval/tasks/longbench/triviaqa_e.yaml lm_eval/tasks/longbench/triviaqa_e.yaml +5 -4

lm_eval/tasks/longbench/vcsum.yaml lm_eval/tasks/longbench/vcsum.yaml +4 -3

No files found.
--- a/lm_eval/tasks/longbench/triviaqa_e.yaml
+++ b/lm_eval/tasks/longbench/triviaqa_e.yaml
@@ -6,15 +6,16 @@ dataset_path: THUDM/LongBench
 test_split: test
 dataset_name: triviaqa_e
 doc_to_text: 'Answer the question based on the given passage. Only give me the answer and do not output any other words. The following are some examples.\n\n{{context}}\n\n{{input}}'
-doc_to_target: '{{answers[0]}}'
+doc_to_target: '{{answers}}'
+process_results: !function metrics.get_qa_f1_score
 generation_kwargs:
  max_gen_toks: 32
  temperature: 1
  do_sample: True
-  until: ['\n']
+  until: ["\n"]
 metric_list:
-  - metric: !function metrics.qa_f1_score
+  - metric: "qa_f1_score"
    aggregation: mean
    higher_is_better: True
 metadata:
-  version: 2.0
+  version: 3.0
--- a/lm_eval/tasks/longbench/vcsum.yaml
+++ b/lm_eval/tasks/longbench/vcsum.yaml
@@ -6,15 +6,16 @@ dataset_path: THUDM/LongBench
 test_split: test
 dataset_name: vcsum
 doc_to_text: '下面有一段会议记录，请你阅读后，写一段总结，总结会议的内容。\n会议记录：\n{{context}}\n\n会议总结：'
-doc_to_target: '{{answers[0]}}'
+doc_to_target: '{{answers}}'
+process_results: !function metrics.get_rouge_zh_score
 generation_kwargs:
  max_gen_toks: 512
  temperature: 1
  do_sample: True
  until: []
 metric_list:
-  - metric: !function metrics.rouge_zh_score
+  - metric: "rouge_zh_score"
    aggregation: mean
    higher_is_better: True
 metadata:
-  version: 2.0
+  version: 3.0