[Infinitebench](https://github.com/OpenBMB/InfiniteBench) is composed of 12 major tasks.
## Create Hugging Face dataset
The processed Hugging Face dataset for Infinitebench can be found [here](https://huggingface.co/datasets/MaxJeblick/InfiniteBench). To reproduce this dataset, simply run the `create_huggingface_dataset.py` script.
| code_debug | Code Document | 394 | 114.7k | 4.8 | Finding which function in a code repo contains an crashing error (in multiple choice form). |
| math_find | Synthetic | 350 | 87.9k | 1.3 | Finding special integers in a lengthy list. |
| longbook_qa_eng | Fake Book | 351 | 192.6k | 4.8 | Free-form question answering based on the fake book. |
| longdialogue_qa_eng | Script | 200 | 103.6k | 3.4 | Identification of talkers in partially anonymized scripts. |
| longbook_choice_eng | Fake Book | 229 | 184.4k | 5.3 | Multiple choice questions derived from the fake book. |
"""
"""
Examples:
passkey:
context: "There is an important info hidden inside a lot of irrelevant text. Find it and memorize them. I will quiz you about the important information there. The pass key is 71432. Remember it. 71432 is the pass key. The grass is green. The sky is blue."
question: "What is the pass key?"
answer: ["71432"]
kv_retrieval:
context: "Extract the value corresponding to the specified key in the JSON object below. JSON data: {"e6aa4656-0eb5-4e1d-ad33-1ded282e0a78" ..."
question: "Key: "ce06788c-71b4-4f7a-b196-0fd9965a59c5" The value associated with the specified key is:"
answer: ["00a00042-6bcb-494f-9c35-57180f1e7251"]
number_string:
context: "There is an important info hidden inside a lot of irrelevant text. Find it. I will quiz you about the important information there. The sequence of digits is 2200012222. Remember it. 2200012222 is the sequence of digits. The grass is green. The sky is blue."
question: "What is the sequence number?"
answer: ["2200012222"]
longdialogue_qa_eng:
context: "Below is a dialogue script where one random occurrence of a character name is replaced with "$$MASK$$", and you should try to guess who that character is. The dialogue: --- BEASTS OF THE SOUTHERN WILD Written by Lucy Alibar & Benh Zeitlin FINAL DRAFT: Based on the stage play "Juicy and Delicious"
question: "Which character is $$MASK$$ ?"
answer: [ "ACE", "ACE ROTHSTEIN" ]
longbook_qa_eng:
context: "Read the book below and answer a question. ‘Yes, of course, if it’s fine to-morrow,’ said Mrs Bronwyn. ‘But you’ll have to be up with the lark,’ she added. "
question: "Which among Annalisa, Seb, Peyton, and Gannonmarie is not Mrs. Bronwyn's child?"
answer: [ "\"Peyton\"" ]
longbook_choice_eng:
context: "Read the book and answer the question. With a single drop of ink for a mirror, the Egyptian sorcerer undertakes to reveal to any chance comer far-reaching visions of the past. This is what I undertake to do for you, reader. With this drop of ink at the end of my pen, I will show you the roomy workshop "
question: "Which of the following is NOT one of Alain's chores at Hall Farm? Only one of the following options is correct, tell me the answer using one single letter (A, B, C, or D). Don't say anything else. A. Walking Georgie B. Taking care of Totty C. Working in the dairy D. Light housework"
"passkey":"There is an important info hidden inside a lot of irrelevant text. Find it and memorize them. I will quiz you about the important information there.\n\n{context}\n\n",
"kv_retrieval":"Extract the value corresponding to the specified key in the JSON object below.\n\n{context}\n\n",
"number_string":"There is an important info hidden inside a lot of irrelevant text. Find it. I will quiz you about the important information there.\n\n{context}\n\n",
"longdialogue_qa_eng":'Below is a dialogue script where one random occurrence of a character name is replaced with "$$MASK$$", and you should try to guess who that character is.\n\n{context}\n\n',
"longbook_qa_eng":"Read the book below and answer a question. Be very concise in your answer.\n\n{context}\n\n",
"code_run":"There is a function called {func} in the following Python code.\n\n{context}\n\n",
}
question_template={
"longbook_choice_eng":"\n\nOnly one of the following options is correct, tell me the answer using one single letter (A, B, C, or D). Don't say anything else.\nA. {OPTION_A}\nB. {OPTION_B}\nC. {OPTION_C}\nD. {OPTION_D}",
The processed Hugging Face dataset for longbench can be found [here](https://huggingface.co/datasets/Xnhyacinth/LongBench). To reproduce this dataset, simply run the `create_huggingface_dataset.py` script.
"narrativeqa":"You are given a story, which can be either a novel or a movie script, and a question. Answer the question asconcisely as you can, using a single phrase if possible. Do not provide any explanation.\n\nStory: {context}\n\nNow, answer the question based on the story asconcisely as you can, using a single phrase if possible. Do not provide any explanation.\n\n",
"qasper":'You are given a scientific article and a question. Answer the question as concisely as you can, using a single phrase or sentence if possible. If the question cannot be answered based on the information in the article, write "unanswerable". If the question is a yes/no question, answer "yes", "no", or "unanswerable". Do not provide any explanation.\n\nArticle: {context}\n\n Answer the question based on the above article as concisely as you can, using a single phrase or sentence if possible. If the question cannot be answered based on the information in the article, write "unanswerable". If the question is a yes/no question, answer "yes", "no", or "unanswerable". Do not provide any explanation.\n\n',
"multifieldqa_en":"Read the following text and answer briefly.\n\n{context}\n\nNow, answer the following question based on the above text, only give me the answer and do not output any other words.\n\n",
"hotpotqa":"Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n{context}\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\n",
"2wikimqa":"Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n{context}\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\n",
"musique":"Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n{context}\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\n",
"gov_report":"You are given a report by a government agency. Write a one-page summary of the report.\n\nReport:\n{context}\n\n",
"qmsum":"You are given a meeting transcript and a query containing a question or instruction. Answer the query in one or more sentences.\n\nTranscript:\n{context}\n\nNow, answer the query based on the above meeting transcript in one or more sentences.\n\n",
"multi_news":"You are given several news passages. Write a one-page summary of all news. \n\nNews:\n{context}\n\n",
"trec":"Please determine the type of the question below. Here are some examples of questions.\n\n{context}\n",
"triviaqa":"Answer the question based on the given passage. Only give me the answer and do not output any other words. The following are some examples.\n\n{context}\n\n",
"samsum":"Summarize the dialogue into a few short sentences. The following are some examples.\n\n{context}\n\n",
"lsht":"请判断给定新闻的类别,下面是一些例子。\n\n{context}\n",
"passage_count":"There are some paragraphs below sourced from Wikipedia. Some of them may be duplicates. Please carefully read these paragraphs and determine how many unique paragraphs there are after removing duplicates. In other words, how many non-repeating paragraphs are there in total?\n\n{context}\n\n",
"passage_retrieval_en":"Here are 30 paragraphs from Wikipedia, along with an abstract. Please determine which paragraph the abstract is from.\n\n{context}\n\nThe following is an abstract.\n\n",
"lcc":"Please complete the code given below. \n{context}",
"repobench-p":"Please complete the code given below. \n{context}",
}
question_template={
"narrativeqa":"Question: {input}\n\n",
"qasper":"Question: {input}\n\n",
"multifieldqa_en":"Question: {input}\n",
"multifieldqa_zh":"问题:{input}\n",
"hotpotqa":"Question: {input}\n",
"2wikimqa":"Question: {input}\n",
"musique":"Question: {input}\n",
"dureader":"问题:{input}\n",
"gov_report":"Now, write a one-page summary of the report.\n\n",
"qmsum":"Query: {input}\n",
"multi_news":"Now, write a one-page summary of all the news.\n\n",
"vcsum":"",
"trec":"{input}",
"triviaqa":"{input}",
"samsum":"{input}",
"lsht":"{input}",
"passage_count":"Please enter the final count of unique paragraphs after removing duplicates. The output format should only contain the number, such as 1, 2, 3, and so on.\n\n",
"passage_retrieval_en":'{input}\n\nPlease enter the number of the paragraph that the abstract is from. The answer format must be like "Paragraph 1", "Paragraph 2", etc.\n\n',
The processed Hugging Face dataset for LongBench-v2 can be found [here](https://huggingface.co/datasets/simonjegou/LongBench-v2). To reproduce this dataset, simply run the `create_huggingface_dataset.py` script.
[Loogle](https://github.com/bigai-nlco/LooGLE/tree/main) is composed of 7 major tasks to evaluate LLMs' ability to understand both short and long dependency content.
## Create Hugging Face dataset
The Hugging Face dataset for Loogle can be found [here](https://huggingface.co/datasets/simonjegou/loogle). To reproduce this dataset, simply run the `create_huggingface_dataset.py` script.
# SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
importjson
importpandasaspd
fromdatasetsimportDataset,load_dataset
# Templates based on https://github.com/bigai-nlco/LooGLE/blob/main/config/task2prompt.json
context_prompt={
"shortdep_qa":"Please answer the question based on the long texts below. \n{input}",
"longdep_qa":"Please answer the question based on the long texts below. \n{input}",
"shortdep_cloze":"Please fill in the clozes based on the given long texts below. Each of the placeholder '<mask-n>' in the question could be an entity of Person, Location or Organiocation. The same masks represent the same entity. Output a json format answer, for example: {{'<mask-0>': 'Bob', '<mask-1>': 'Gorrosion Magazine','<mask-2>': 'Bethel Horizon'}}\n{input}",# noqa
"longdep_summarization":"Please generate a summary of the below paper. \n{input}",
}
question_prompt={
"shortdep_qa":"\nQuestion: {Q}\n",
"longdep_qa":"\nQuestion: {Q}\n",
"shortdep_cloze":"\n Question: {Q} What are the masked entities?\n",
Adapted from https://huggingface.co/datasets/HuggingFaceH4/MATH-500
This dataset contains a subset of 500 problems from the MATH benchmark that OpenAI created in their Let's Verify Step by Step paper. See their GitHub repo for the source file: https://github.com/openai/prm800k/tree/main?tab=readme-ov-file#math-splits
This benchmark evaluates a model's ability to retrieve a specific piece of information, the "needle," hidden within a large body of text, the "haystack." The test challenges a model's long-context understanding and its ability to maintain information accuracy over increasing document lengths.
We follow the vast majority of the literature and use [Paul Graham's essays](https://huggingface.co/datasets/alessiodevoto/paul_graham_essays) as the haystack.
> The default needle is a sentence defined in the dataset itself, but it can be replaced by a custom sentence (e.g. for the passkey retrieval or similar tests). To do that, check [utils.py](./utils.py).