Unverified Commit 655718d0 authored by Janna's avatar Janna Committed by GitHub
Browse files

Longbench v2 (#3338)



* initial commit

* change to acc

* fix long-dialogue tasks

* fix versioning

* more fixes

* fix naming

* fix naming

* more renaming

* maybe a dataset fix

* fix dataset and use new dataset schema

* add README

* fix prompt and dataset naming

* lint

* remove utils.py

* lint

* more linting

* fix typo

* fix naming

* add longbenchv2

---------
Co-authored-by: default avatarBaber <baber@hey.com>
parent 8efef8f1
...@@ -6,9 +6,9 @@ For more information, including a full list of task names and their precise mean ...@@ -6,9 +6,9 @@ For more information, including a full list of task names and their precise mean
provided to the individual README.md files for each subfolder. provided to the individual README.md files for each subfolder.
| Task Family | Description | Language(s) | | Task Family | Description | Language(s) |
|--------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------| |--------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| [eq-bench_es](eq_bench/README.md) | Spanish version of EQ-Bench (EN). Task for evaluating emotional reasoning through dialogue-based prompts. [Hugging Face](https://huggingface.co/datasets/BSC-LT/EQ-bench_es) |Spanish **Human Translated** | | [eq-bench_es](eq_bench/README.md) | Spanish version of EQ-Bench (EN). Task for evaluating emotional reasoning through dialogue-based prompts. [Hugging Face](https://huggingface.co/datasets/BSC-LT/EQ-bench_es) | Spanish **Human Translated** |
| [eq-bench_ca](eq_bench/README.md) | Catalan version of EQ-Bench (EN). Task for evaluating emotional reasoning through dialogue-based prompts. [Hugging Face](https://huggingface.co/datasets/BSC-LT/EQ-bench_ca)| Catalan **Human Translated** | | [eq-bench_ca](eq_bench/README.md) | Catalan version of EQ-Bench (EN). Task for evaluating emotional reasoning through dialogue-based prompts. [Hugging Face](https://huggingface.co/datasets/BSC-LT/EQ-bench_ca) | Catalan **Human Translated** |
| [aclue](aclue/README.md) | Tasks focusing on ancient Chinese language understanding and cultural aspects. | Ancient Chinese | | [aclue](aclue/README.md) | Tasks focusing on ancient Chinese language understanding and cultural aspects. | Ancient Chinese |
| [acp_bench](acpbench/README.md) | Tasks evaluating the reasoning ability about Action, Change, and Planning | English | | [acp_bench](acpbench/README.md) | Tasks evaluating the reasoning ability about Action, Change, and Planning | English |
| [acp_bench_hard](acpbench/README.md) | Tasks evaluating the reasoning ability about Action, Change, and Planning | English | | [acp_bench_hard](acpbench/README.md) | Tasks evaluating the reasoning ability about Action, Change, and Planning | English |
...@@ -103,6 +103,7 @@ provided to the individual README.md files for each subfolder. ...@@ -103,6 +103,7 @@ provided to the individual README.md files for each subfolder.
| [logiqa](logiqa/README.md) | Logical reasoning tasks requiring advanced inference and deduction. | English, Chinese | | [logiqa](logiqa/README.md) | Logical reasoning tasks requiring advanced inference and deduction. | English, Chinese |
| [logiqa2](logiqa2/README.md) | Large-scale logical reasoning dataset adapted from the Chinese Civil Service Examination. | English, Chinese | | [logiqa2](logiqa2/README.md) | Large-scale logical reasoning dataset adapted from the Chinese Civil Service Examination. | English, Chinese |
| [longbench](longbench/README.md) | LongBench evaluates language models' ability to understand lengthy texts across multiple tasks and languages. | English, Chinese | | [longbench](longbench/README.md) | LongBench evaluates language models' ability to understand lengthy texts across multiple tasks and languages. | English, Chinese |
| [longbenchv2](longbench/README.md) | longbench v2, multiple-choice variant. | English, Chinese |
| [mastermind](mastermind/README.md) | Reasoning benchmark based on the board game of Mastermind. | English | | [mastermind](mastermind/README.md) | Reasoning benchmark based on the board game of Mastermind. | English |
| [mathqa](mathqa/README.md) | Question answering tasks involving mathematical reasoning and problem-solving. | English | | [mathqa](mathqa/README.md) | Question answering tasks involving mathematical reasoning and problem-solving. | English |
| [mbpp](mbpp/README.md) | A benchmark designed to measure the ability to synthesize short Python programs from natural language descriptions. | Python | | [mbpp](mbpp/README.md) | A benchmark designed to measure the ability to synthesize short Python programs from natural language descriptions. | Python |
...@@ -124,7 +125,7 @@ provided to the individual README.md files for each subfolder. ...@@ -124,7 +125,7 @@ provided to the individual README.md files for each subfolder.
| [mmlu_redux](mmlu-redux-spanish/README.md) | Refined Massive Multitask Language Understanding benchmark for broad domain evaluation with improved data quality. | Spanish | | [mmlu_redux](mmlu-redux-spanish/README.md) | Refined Massive Multitask Language Understanding benchmark for broad domain evaluation with improved data quality. | Spanish |
| [mmlu_pro](mmlu_pro/README.md) | A refined set of MMLU, integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. | English | | [mmlu_pro](mmlu_pro/README.md) | A refined set of MMLU, integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. | English |
| [mmlu-pro-plus](mmlu-pro-plus/README.md) | A new test set for evaluating shortcut learning and higher-order reasoning of LLMs. | English | | [mmlu-pro-plus](mmlu-pro-plus/README.md) | A new test set for evaluating shortcut learning and higher-order reasoning of LLMs. | English |
| [mmlu_prox](mmlu_prox/README.md) | A multilingual benchmark that extends MMLU-Pro to multiple typologically diverse languages with human validation. | English, Japanese, Chinese, Korean, French, German, Spanish, Portuguese, Zulu, Swahili, Wolof, Yoruba, Thai, Arabic, Hindi, Bengali, Serbian, Hungarian, Vietnamese, Czech, Marathi, Afrikaans, Nepali, Telugu, Urdu, Russian, Indonesian, Italian, Ukrainian| | [mmlu_prox](mmlu_prox/README.md) | A multilingual benchmark that extends MMLU-Pro to multiple typologically diverse languages with human validation. | English, Japanese, Chinese, Korean, French, German, Spanish, Portuguese, Zulu, Swahili, Wolof, Yoruba, Thai, Arabic, Hindi, Bengali, Serbian, Hungarian, Vietnamese, Czech, Marathi, Afrikaans, Nepali, Telugu, Urdu, Russian, Indonesian, Italian, Ukrainian |
| [mmlusr](mmlusr/README.md) | Variation of MMLU designed to be more rigorous. | English | | [mmlusr](mmlusr/README.md) | Variation of MMLU designed to be more rigorous. | English |
| model_written_evals | Evaluation tasks auto-generated for evaluating a collection of AI Safety concerns. | | | model_written_evals | Evaluation tasks auto-generated for evaluating a collection of AI Safety concerns. | |
| [moral_stories](moral_stories/README.md) | A crowd-sourced dataset of structured narratives that describe normative and norm-divergent actions taken by individuals to accomplish certain intentions in concrete situations. | English | | [moral_stories](moral_stories/README.md) | A crowd-sourced dataset of structured narratives that describe normative and norm-divergent actions taken by individuals to accomplish certain intentions in concrete situations. | English |
...@@ -194,6 +195,6 @@ provided to the individual README.md files for each subfolder. ...@@ -194,6 +195,6 @@ provided to the individual README.md files for each subfolder.
## Multimodal Tasks ## Multimodal Tasks
| Task Family | Description | Modality | | Task Family | Description | Modality |
| ---------------------------- | ------------------------------------------------------------------------------------------------------- | ----------- | |------------------------------|---------------------------------------------------------------------------------------------------------|-------------|
| [chartqa](chartqa/README.md) | A benchmark for question answering about charts that requires both visual and logical reasoning. | Image, Text | | [chartqa](chartqa/README.md) | A benchmark for question answering about charts that requires both visual and logical reasoning. | Image, Text |
| [mmmu](mmmu/README.md) | Evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge. | Image, Text | | [mmmu](mmmu/README.md) | Evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge. | Image, Text |
# LongBench v2
### Paper
Title: `LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks`
Abstract: `This paper introduces LongBench v2, a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. LongBench v2 consists of 503 challenging multiple-choice questions, with contexts ranging from 8k to 2M words, across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding. To ensure the breadth and the practicality, we collect data from nearly 100 highly educated individuals with diverse professional backgrounds. We employ both automated and manual review processes to maintain high quality and difficulty, resulting in human experts achieving only 53.7% accuracy under a 15-minute time constraint. Our evaluation reveals that the best-performing model, when directly answers the questions, achieves only 50.1% accuracy. In contrast, the o1-preview model, which includes longer reasoning, achieves 57.7%, surpassing the human baseline by 4%. These results highlight the importance of enhanced reasoning ability and scaling inference-time compute to tackle the long-context challenges in LongBench v2.`
Homepage: `https://github.com/THUDM/LongBench`
### Citation
```
@article{bai2024longbench2,
title={LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks},
author={Yushi Bai and Shangqing Tu and Jiajie Zhang and Hao Peng and Xiaozhi Wang and Xin Lv and Shulin Cao and Jiazheng Xu and Lei Hou and Yuxiao Dong and Jie Tang and Juanzi Li},
journal={arXiv preprint arXiv:2412.15204},
year={2024}
}
```
### Groups, Tags, and Tasks
#### Groups
* `longbench2_single`: Single-document QA tasks requiring comprehension of documents across various domains (government, legal, literature, finance, academic, detective stories, and order of events)
* `longbench2_multi`: Multi-document QA tasks requiring information synthesis and reasoning across multiple documents in government, academic, finance, and news
* `longbench2_incontext`: Long in-context learning tasks including user guide comprehension, translation with examples, and many-shot learning scenarios
* `longbench2_history`: Long-dialogue history understanding tasks involving agent conversations and dialogue history comprehension
* `longbench2_structured`: Long structured data understanding tasks for graph and table data processing
#### Tags
* `longbench2`: Run the full benchmark with 503 multiple-choice questions (8k-2M words) testing understanding and reasoning on long-context tasks
#### Tasks
**Single-Document QA:**
* `longbench2_govt_single`: Question answering from single government documents
* `longbench2_legal_single`: Question answering from single legal documents
* `longbench2_lit_single`: Question answering from single literature/literary documents
* `longbench2_fin_single`: Question answering from single financial documents
* `longbench2_academic_single`: Question answering from single academic papers and research documents
* `longbench2_detective`: Question answering from detective stories requiring logical reasoning
* `longbench2_event_order`: Temporal reasoning tasks about event ordering in narratives
**Multi-Document QA:**
* `longbench2_govt_multi`: Question answering across multiple government documents
* `longbench2_academic_multi`: Question answering across multiple academic papers
* `longbench2_fin_multi`: Question answering across multiple financial documents
* `longbench2_news_multi`: Question answering across multiple news articles
**Long In-context Learning:**
* `longbench2_user_guide`: Comprehension and application of user guide instructions
* `longbench2_translate`: Translation tasks in new languages with long examples
* `longbench2_many_shot`: Few-shot learning with many examples in context
**Long-dialogue History Understanding:**
* `longbench2_agent_history`: Understanding and reasoning over extended agent conversation histories
* `longbench2_dialogue_history`: Understanding and reasoning over long dialogue exchanges
**Code Repository Understanding:**
* `longbench2_code`: Question answering on code repositories requiring codebase comprehension
**Long Structured Data Understanding:**
* `longbench2_graph`: Understanding and reasoning over graph-structured data
* `longbench2_table`: Understanding and reasoning over tabular data
### Checklist
For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
group: longbench2
task:
- longbench2_history
- longbench2_incontext
- longbench2_multi
- longbench2_single
- longbench2_structured
- longbench2_code
aggregate_metric_list:
- metric: acc
weight_by_size: True
metadata:
version: 0.0
group: longbench2_history
group_alias: "Long-dialogue History Understanding"
task:
- longbench2_agent_history
- longbench2_dialogue_history
aggregate_metric_list:
- metric: acc
weight_by_size: True
metadata:
version: 0.0
group: longbench2_incontext
group_alias: "Long In-context Learning"
task:
- longbench2_user_guide
- longbench2_translate
- longbench2_many_shot
aggregate_metric_list:
- metric: acc
weight_by_size: True
metadata:
version: 0.0
group: longbench2_multi
group_alias: "Multi-Document QA"
task:
- longbench2_govt_multi
- longbench2_academic_multi
- longbench2_fin_multi
- longbench2_news_multi
aggregate_metric_list:
- metric: acc
weight_by_size: True
metadata:
version: 0.0
group: longbench2_single
group_alias: "Single-Document QA"
task:
- longbench2_govt_single
- longbench2_legal_single
- longbench2_lit_single
- longbench2_fin_single
- longbench2_event_order
- longbench2_academic_single
- longbench2_detective
aggregate_metric_list:
- metric: acc
weight_by_size: True
metadata:
version: 0.0
group: longbench2_structured
group_alias: "Long Structured Data Understanding"
task:
- longbench2_graph
- longbench2_table
aggregate_metric_list:
- metric: acc
weight_by_size: True
metadata:
version: 0.0
dataset_path: recursal/longbench-v2
test_split: train
output_type: multiple_choice
doc_to_text: "Please read the following text and answer the question below.\n\n<text>\n{{context}}\n</text>\n\nWhat is the correct answer to this question: {{question.strip()}}\nChoices:\n(A) {{choices[0]}}\n(B) {{choices[1]}}\n(C) {{choices[2]}}\n(D) {{choices[3]}}\n\nAnswer:"
doc_to_choice: ["A", "B", "C", "D"]
doc_to_target: answer
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 0.0
include: _longbench_common_yaml
tag:
- longbench2
- longbench2_multi
task: longbench2_academic_multi
dataset_name: academic_multi
include: _longbench_common_yaml
tag:
- longbench2
- longbench2_single
task: longbench2_academic_single
dataset_name: academic_single
include: _longbench_common_yaml
tag:
- longbench2
- longbench2_history
task: longbench2_agent_history
dataset_name: agent_history_qa
include: _longbench_common_yaml
tag:
- longbench2
- longbench2_single
task: longbench2_detective
dataset_name: detective
include: _longbench_common_yaml
tag:
- longbench2
- longbench2_history
task: longbench2_dialogue_history
dataset_name: dialogue_history_qa
include: _longbench_common_yaml
tag:
- longbench2
- longbench2_single
task: longbench2_event_order
dataset_name: event_ordering
include: _longbench_common_yaml
tag:
- longbench2
- longbench2_multi
task: longbench2_fin_multi
dataset_name: financial_multi
include: _longbench_common_yaml
tag:
- longbench2
- longbench2_single
task: longbench2_fin_single
dataset_name: financial_single
include: _longbench_common_yaml
tag:
- longbench2
- longbench2_multi
task: longbench2_govt_multi
dataset_name: government_multi
include: _longbench_common_yaml
tag:
- longbench2
- longbench2_single
task: longbench2_govt_single
dataset_name: government_single
include: _longbench_common_yaml
tag:
- longbench2
- longbench2_structured
task: longbench2_graph
dataset_name: graph_reasoning
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment