Longbench v2 (#3338)

* initial commit * change to acc * fix long-dialogue tasks * fix versioning * more fixes * fix naming * fix naming * more renaming * maybe a dataset fix * fix dataset and use new dataset schema * add README * fix prompt and dataset naming * lint * remove utils.py * lint * more linting * fix typo * fix naming * add longbenchv2 --------- Co-authored-by: Baber <baber@hey.com>

Longbench v2 (#3338)
* initial commit * change to acc * fix long-dialogue tasks * fix versioning * more fixes * fix naming * fix naming * more renaming * maybe a dataset fix * fix dataset and use new dataset schema * add README * fix prompt and dataset naming * lint * remove utils.py * lint * more linting * fix typo * fix naming * add longbenchv2 --------- Co-authored-by: Baber <baber@hey.com>
655718d0 · Janna · GitHub · 8efef8f1 · 655718d0 · 655718d0
Unverified Commit 655718d0 authored Oct 14, 2025 by Janna Committed by GitHub Oct 14, 2025
20 changed files
--- a/lm_eval/tasks/README.md
+++ b/lm_eval/tasks/README.md
@@ -6,9 +6,9 @@ For more information, including a full list of task names and their precise mean
 provided to the individual README.md files for each subfolder.
 | Task Family                                                              | Description                                                                                                                                                                                                                                                                                                                            | Language(s)                                                                                                                                                                                                                                                   |
-|--------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|
+|--------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| [eq-bench_es](eq_bench/README.md) | Spanish version of EQ-Bench (EN). Task for evaluating emotional reasoning through dialogue-based prompts. [Hugging Face](https://huggingface.co/datasets/BSC-LT/EQ-bench_es) |Spanish **Human Translated** |
+| [eq-bench_es](eq_bench/README.md)                                        | Spanish version of EQ-Bench (EN). Task for evaluating emotional reasoning through dialogue-based prompts. [Hugging Face](https://huggingface.co/datasets/BSC-LT/EQ-bench_es)                                                                                                                                                           | Spanish **Human Translated**                                                                                                                                                                                                                                  |
-| [eq-bench_ca](eq_bench/README.md) | Catalan version of EQ-Bench (EN). Task for evaluating emotional reasoning through dialogue-based prompts. [Hugging Face](https://huggingface.co/datasets/BSC-LT/EQ-bench_ca)| Catalan                                                                                                                        **Human Translated** |
+| [eq-bench_ca](eq_bench/README.md)                                        | Catalan version of EQ-Bench (EN). Task for evaluating emotional reasoning through dialogue-based prompts. [Hugging Face](https://huggingface.co/datasets/BSC-LT/EQ-bench_ca)                                                                                                                                                           | Catalan                                                                                                                        **Human Translated**                                                                                                           |
 | [aclue](aclue/README.md)                                                 | Tasks focusing on ancient Chinese language understanding and cultural aspects.                                                                                                                                                                                                                                                         | Ancient Chinese                                                                                                                                                                                                                                               |
 | [acp_bench](acpbench/README.md)                                          | Tasks evaluating the reasoning ability about Action, Change, and Planning                                                                                                                                                                                                                                                              | English                                                                                                                                                                                                                                                       |
 | [acp_bench_hard](acpbench/README.md)                                     | Tasks evaluating the reasoning ability about Action, Change, and Planning                                                                                                                                                                                                                                                              | English                                                                                                                                                                                                                                                       |
@@ -103,6 +103,7 @@ provided to the individual README.md files for each subfolder.
 | [logiqa](logiqa/README.md)                                               | Logical reasoning tasks requiring advanced inference and deduction.                                                                                                                                                                                                                                                                    | English, Chinese                                                                                                                                                                                                                                              |
 | [logiqa2](logiqa2/README.md)                                             | Large-scale logical reasoning dataset adapted from the Chinese Civil Service Examination.                                                                                                                                                                                                                                              | English, Chinese                                                                                                                                                                                                                                              |
 | [longbench](longbench/README.md)                                         | LongBench evaluates language models' ability to understand lengthy texts across multiple tasks and languages.                                                                                                                                                                                                                          | English, Chinese                                                                                                                                                                                                                                              |
+| [longbenchv2](longbench/README.md)                                       | longbench v2, multiple-choice variant.                                                                                                                                                                                                                                                                                                 | English, Chinese                                                                                                                                                                                                                                              |
 | [mastermind](mastermind/README.md)                                       | Reasoning benchmark based on the board game of Mastermind.                                                                                                                                                                                                                                                                             | English                                                                                                                                                                                                                                                       |
 | [mathqa](mathqa/README.md)                                               | Question answering tasks involving mathematical reasoning and problem-solving.                                                                                                                                                                                                                                                         | English                                                                                                                                                                                                                                                       |
 | [mbpp](mbpp/README.md)                                                   | A benchmark designed to measure the ability to synthesize short Python programs from natural language descriptions.                                                                                                                                                                                                                    | Python                                                                                                                                                                                                                                                        |
@@ -124,7 +125,7 @@ provided to the individual README.md files for each subfolder.
 | [mmlu_redux](mmlu-redux-spanish/README.md)                               | Refined Massive Multitask Language Understanding benchmark for broad domain evaluation with improved data quality.                                                                                                                                                                                                                     | Spanish                                                                                                                                                                                                                                                       |
 | [mmlu_pro](mmlu_pro/README.md)                                           | A refined set of MMLU, integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options.                                                                                                                                                                                                | English                                                                                                                                                                                                                                                       |
 | [mmlu-pro-plus](mmlu-pro-plus/README.md)                                 | A new test set for evaluating shortcut learning and higher-order reasoning of LLMs.                                                                                                                                                                                                                                                    | English                                                                                                                                                                                                                                                       |
-| [mmlu_prox](mmlu_prox/README.md)                                         | A multilingual benchmark that extends MMLU-Pro to multiple typologically diverse languages with human validation.                                                                                                                                                                                                                      | English, Japanese, Chinese, Korean, French, German, Spanish, Portuguese, Zulu, Swahili, Wolof, Yoruba, Thai, Arabic, Hindi, Bengali, Serbian, Hungarian, Vietnamese, Czech, Marathi, Afrikaans, Nepali, Telugu, Urdu, Russian, Indonesian, Italian, Ukrainian|
+| [mmlu_prox](mmlu_prox/README.md)                                         | A multilingual benchmark that extends MMLU-Pro to multiple typologically diverse languages with human validation.                                                                                                                                                                                                                      | English, Japanese, Chinese, Korean, French, German, Spanish, Portuguese, Zulu, Swahili, Wolof, Yoruba, Thai, Arabic, Hindi, Bengali, Serbian, Hungarian, Vietnamese, Czech, Marathi, Afrikaans, Nepali, Telugu, Urdu, Russian, Indonesian, Italian, Ukrainian |
 | [mmlusr](mmlusr/README.md)                                               | Variation of MMLU designed to be more rigorous.                                                                                                                                                                                                                                                                                        | English                                                                                                                                                                                                                                                       |
 | model_written_evals                                                      | Evaluation tasks auto-generated for evaluating a collection of AI Safety concerns.                                                                                                                                                                                                                                                     |                                                                                                                                                                                                                                                               |
 | [moral_stories](moral_stories/README.md)                                 | A crowd-sourced dataset of structured narratives that describe normative and norm-divergent actions taken by individuals to accomplish certain intentions in concrete situations.                                                                                                                                                      | English                                                                                                                                                                                                                                                       |
@@ -194,6 +195,6 @@ provided to the individual README.md files for each subfolder.
 ## Multimodal Tasks
 | Task Family                  | Description                                                                                             | Modality    |
-| ---------------------------- | ------------------------------------------------------------------------------------------------------- | ----------- |
+|------------------------------|---------------------------------------------------------------------------------------------------------|-------------|
 | [chartqa](chartqa/README.md) | A benchmark for question answering about charts that requires both visual and logical reasoning.        | Image, Text |
 | [mmmu](mmmu/README.md)       | Evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge. | Image, Text |
--- a/lm_eval/tasks/longbench2/README.md
+++ b/lm_eval/tasks/longbench2/README.md
+# LongBench v2
+### Paper
+Title: `LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks`
+Abstract: `This paper introduces LongBench v2, a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. LongBench v2 consists of 503 challenging multiple-choice questions, with contexts ranging from 8k to 2M words, across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding. To ensure the breadth and the practicality, we collect data from nearly 100 highly educated individuals with diverse professional backgrounds. We employ both automated and manual review processes to maintain high quality and difficulty, resulting in human experts achieving only 53.7% accuracy under a 15-minute time constraint. Our evaluation reveals that the best-performing model, when directly answers the questions, achieves only 50.1% accuracy. In contrast, the o1-preview model, which includes longer reasoning, achieves 57.7%, surpassing the human baseline by 4%. These results highlight the importance of enhanced reasoning ability and scaling inference-time compute to tackle the long-context challenges in LongBench v2.`
+Homepage: `https://github.com/THUDM/LongBench`
+### Citation
+```
+@article{bai2024longbench2,
+  title={LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks},
+  author={Yushi Bai and Shangqing Tu and Jiajie Zhang and Hao Peng and Xiaozhi Wang and Xin Lv and Shulin Cao and Jiazheng Xu and Lei Hou and Yuxiao Dong and Jie Tang and Juanzi Li},
+  journal={arXiv preprint arXiv:2412.15204},
+  year={2024}
+}
+```
+### Groups, Tags, and Tasks
+#### Groups
+* `longbench2_single`: Single-document QA tasks requiring comprehension of documents across various domains (government, legal, literature, finance, academic, detective stories, and order of events)
+* `longbench2_multi`: Multi-document QA tasks requiring information synthesis and reasoning across multiple documents in government, academic, finance, and news
+* `longbench2_incontext`: Long in-context learning tasks including user guide comprehension, translation with examples, and many-shot learning scenarios
+* `longbench2_history`: Long-dialogue history understanding tasks involving agent conversations and dialogue history comprehension
+* `longbench2_structured`: Long structured data understanding tasks for graph and table data processing
+#### Tags
+* `longbench2`: Run the full benchmark with 503 multiple-choice questions (8k-2M words) testing understanding and reasoning on long-context tasks
+#### Tasks
+**Single-Document QA:**
+* `longbench2_govt_single`: Question answering from single government documents
+* `longbench2_legal_single`: Question answering from single legal documents
+* `longbench2_lit_single`: Question answering from single literature/literary documents
+* `longbench2_fin_single`: Question answering from single financial documents
+* `longbench2_academic_single`: Question answering from single academic papers and research documents
+* `longbench2_detective`: Question answering from detective stories requiring logical reasoning
+* `longbench2_event_order`: Temporal reasoning tasks about event ordering in narratives
+**Multi-Document QA:**
+* `longbench2_govt_multi`: Question answering across multiple government documents
+* `longbench2_academic_multi`: Question answering across multiple academic papers
+* `longbench2_fin_multi`: Question answering across multiple financial documents
+* `longbench2_news_multi`: Question answering across multiple news articles
+**Long In-context Learning:**
+* `longbench2_user_guide`: Comprehension and application of user guide instructions
+* `longbench2_translate`: Translation tasks in new languages with long examples
+* `longbench2_many_shot`: Few-shot learning with many examples in context
+**Long-dialogue History Understanding:**
+* `longbench2_agent_history`: Understanding and reasoning over extended agent conversation histories
+* `longbench2_dialogue_history`: Understanding and reasoning over long dialogue exchanges
+**Code Repository Understanding:**
+* `longbench2_code`: Question answering on code repositories requiring codebase comprehension
+**Long Structured Data Understanding:**
+* `longbench2_graph`: Understanding and reasoning over graph-structured data
+* `longbench2_table`: Understanding and reasoning over tabular data
+### Checklist
+For adding novel benchmarks/datasets to the library:
+* [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/longbench2/_longbench2.yaml
+++ b/lm_eval/tasks/longbench2/_longbench2.yaml
+group: longbench2
+task:
+  - longbench2_history
+  - longbench2_incontext
+  - longbench2_multi
+  - longbench2_single
+  - longbench2_structured
+  - longbench2_code
+aggregate_metric_list:
+  - metric: acc
+    weight_by_size: True
+metadata:
+  version: 0.0
--- a/lm_eval/tasks/longbench2/_longbench2_history.yaml
+++ b/lm_eval/tasks/longbench2/_longbench2_history.yaml
+group: longbench2_history
+group_alias: "Long-dialogue History Understanding"
+task:
+  - longbench2_agent_history
+  - longbench2_dialogue_history
+aggregate_metric_list:
+  - metric: acc
+    weight_by_size: True
+metadata:
+  version: 0.0
--- a/lm_eval/tasks/longbench2/_longbench2_incontext.yaml
+++ b/lm_eval/tasks/longbench2/_longbench2_incontext.yaml
+group: longbench2_incontext
+group_alias: "Long In-context Learning"
+task:
+  - longbench2_user_guide
+  - longbench2_translate
+  - longbench2_many_shot
+aggregate_metric_list:
+  - metric: acc
+    weight_by_size: True
+metadata:
+  version: 0.0
--- a/lm_eval/tasks/longbench2/_longbench2_multi.yaml
+++ b/lm_eval/tasks/longbench2/_longbench2_multi.yaml
+group: longbench2_multi
+group_alias: "Multi-Document QA"
+task:
+  - longbench2_govt_multi
+  - longbench2_academic_multi
+  - longbench2_fin_multi
+  - longbench2_news_multi
+aggregate_metric_list:
+  - metric: acc
+    weight_by_size: True
+metadata:
+  version: 0.0
--- a/lm_eval/tasks/longbench2/_longbench2_single.yaml
+++ b/lm_eval/tasks/longbench2/_longbench2_single.yaml
+group: longbench2_single
+group_alias: "Single-Document QA"
+task:
+  - longbench2_govt_single
+  - longbench2_legal_single
+  - longbench2_lit_single
+  - longbench2_fin_single
+  - longbench2_event_order
+  - longbench2_academic_single
+  - longbench2_detective
+aggregate_metric_list:
+  - metric: acc
+    weight_by_size: True
+metadata:
+  version: 0.0
--- a/lm_eval/tasks/longbench2/_longbench2_structured.yaml
+++ b/lm_eval/tasks/longbench2/_longbench2_structured.yaml
+group: longbench2_structured
+group_alias: "Long Structured Data Understanding"
+task:
+  - longbench2_graph
+  - longbench2_table
+aggregate_metric_list:
+  - metric: acc
+    weight_by_size: True
+metadata:
+  version: 0.0
--- a/lm_eval/tasks/longbench2/_longbench_common_yaml
+++ b/lm_eval/tasks/longbench2/_longbench_common_yaml
+dataset_path: recursal/longbench-v2
+test_split: train
+output_type: multiple_choice
+doc_to_text: "Please read the following text and answer the question below.\n\n<text>\n{{context}}\n</text>\n\nWhat is the correct answer to this question: {{question.strip()}}\nChoices:\n(A) {{choices[0]}}\n(B) {{choices[1]}}\n(C) {{choices[2]}}\n(D) {{choices[3]}}\n\nAnswer:"
+doc_to_choice: ["A", "B", "C", "D"]
+doc_to_target: answer
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 0.0
--- a/lm_eval/tasks/longbench2/academic_multi_doc.yaml
+++ b/lm_eval/tasks/longbench2/academic_multi_doc.yaml
+include: _longbench_common_yaml
+tag:
+  - longbench2
+  - longbench2_multi
+task: longbench2_academic_multi
+dataset_name: academic_multi
--- a/lm_eval/tasks/longbench2/academic_single.yaml
+++ b/lm_eval/tasks/longbench2/academic_single.yaml
+include: _longbench_common_yaml
+tag:
+  - longbench2
+  - longbench2_single
+task: longbench2_academic_single
+dataset_name: academic_single
--- a/lm_eval/tasks/longbench2/agent_history.yaml
+++ b/lm_eval/tasks/longbench2/agent_history.yaml
+include: _longbench_common_yaml
+tag:
+  - longbench2
+  - longbench2_history
+task: longbench2_agent_history
+dataset_name: agent_history_qa
--- a/lm_eval/tasks/longbench2/detective.yaml
+++ b/lm_eval/tasks/longbench2/detective.yaml
+include: _longbench_common_yaml
+tag:
+  - longbench2
+  - longbench2_single
+task: longbench2_detective
+dataset_name: detective
--- a/lm_eval/tasks/longbench2/dialogue_history.yaml
+++ b/lm_eval/tasks/longbench2/dialogue_history.yaml
+include: _longbench_common_yaml
+tag:
+  - longbench2
+  - longbench2_history
+task: longbench2_dialogue_history
+dataset_name: dialogue_history_qa
--- a/lm_eval/tasks/longbench2/event_order.yaml
+++ b/lm_eval/tasks/longbench2/event_order.yaml
+include: _longbench_common_yaml
+tag:
+  - longbench2
+  - longbench2_single
+task: longbench2_event_order
+dataset_name: event_ordering
--- a/lm_eval/tasks/longbench2/fin_multi_doc.yaml
+++ b/lm_eval/tasks/longbench2/fin_multi_doc.yaml
+include: _longbench_common_yaml
+tag:
+  - longbench2
+  - longbench2_multi
+task: longbench2_fin_multi
+dataset_name: financial_multi
--- a/lm_eval/tasks/longbench2/fin_single_doc.yaml
+++ b/lm_eval/tasks/longbench2/fin_single_doc.yaml
+include: _longbench_common_yaml
+tag:
+  - longbench2
+  - longbench2_single
+task: longbench2_fin_single
+dataset_name: financial_single
--- a/lm_eval/tasks/longbench2/govt_multi_doc.yaml
+++ b/lm_eval/tasks/longbench2/govt_multi_doc.yaml
+include: _longbench_common_yaml
+tag:
+  - longbench2
+  - longbench2_multi
+task: longbench2_govt_multi
+dataset_name: government_multi
--- a/lm_eval/tasks/longbench2/govt_single_doc.yaml
+++ b/lm_eval/tasks/longbench2/govt_single_doc.yaml
+include: _longbench_common_yaml
+tag:
+  - longbench2
+  - longbench2_single
+task: longbench2_govt_single
+dataset_name: government_single
--- a/lm_eval/tasks/longbench2/graph.yaml
+++ b/lm_eval/tasks/longbench2/graph.yaml
+include: _longbench_common_yaml
+tag:
+  - longbench2
+  - longbench2_structured
+task: longbench2_graph
+dataset_name: graph_reasoning