Add BabiLong (#3287)

* create babilong tasks * lint * add clarification * fix typo * add babilong description

Add BabiLong (#3287)
* create babilong tasks * lint * add clarification * fix typo * add babilong description
ccfa4ad1 · Janna · GitHub · fec9dde7 · ccfa4ad1 · ccfa4ad1
Unverified Commit ccfa4ad1 authored Sep 20, 2025 by Janna Committed by GitHub Sep 21, 2025
20 changed files
--- a/lm_eval/tasks/README.md
+++ b/lm_eval/tasks/README.md
@@ -22,6 +22,7 @@ provided to the individual README.md files for each subfolder.
 | [arithmetic](arithmetic/README.md)                                       | Tasks involving numerical computations and arithmetic reasoning.                                                                                                                                                                                                                                                                       | English                                                                                                                       |
 | [asdiv](asdiv/README.md)                                                 | Tasks involving arithmetic and mathematical reasoning challenges.                                                                                                                                                                                                                                                                      | English                                                                                                                       |
 | [babi](babi/README.md)                                                   | Tasks designed as question and answering challenges based on simulated stories.                                                                                                                                                                                                                                                        | English                                                                                                                       |
+| [babilong](babilong/README.md)                                           | Tasks designed to test whether models can find and reason over facts in long contexts.                                                                                                                                                                                                                                                 | English                                                                                                                       |
 | [basque_bench](basque_bench/README.md)                                   | Collection of tasks in Basque encompassing various evaluation areas.                                                                                                                                                                                                                                                                   | Basque                                                                                                                        |
 | [basqueglue](basqueglue/README.md)                                       | Tasks designed to evaluate language understanding in Basque language.                                                                                                                                                                                                                                                                  | Basque                                                                                                                        |
 | [bbh](bbh/README.md)                                                     | Tasks focused on deep semantic understanding through hypothesization and reasoning.                                                                                                                                                                                                                                                    | English, German                                                                                                               |
@@ -29,7 +30,7 @@ provided to the individual README.md files for each subfolder.
 | [belebele](belebele/README.md)                                           | Language understanding tasks in a variety of languages and scripts.                                                                                                                                                                                                                                                                    | Multiple (122 languages)                                                                                                      |
 | benchmarks                                                               | General benchmarking tasks that test a wide range of language understanding capabilities.                                                                                                                                                                                                                                              |                                                                                                                               |
 | [bertaqa](bertaqa/README.md)                                             | Local Basque cultural trivia QA tests in English and Basque languages.                                                                                                                                                                                                                                                                 | English, Basque, Basque (MT)                                                                                                  |
-| [bhs](bhs/README.md)                                           | Grammatical knowledge evaluation for low-resource langauges. | Basque, Hindi, Swahili                                                                                                                                                                                                                                              |
+| [bhs](bhs/README.md)                                                     | Grammatical knowledge evaluation for low-resource langauges. | Basque, Hindi, Swahili                                                                                                                                                                                                                                              |
 | [bigbench](bigbench/README.md)                                           | Broad tasks from the BIG-bench benchmark designed to push the boundaries of large models.                                                                                                                                                                                                                                              | Multiple                                                                                                                      |
 | [blimp](blimp/README.md)                                                 | Tasks testing grammatical phenomena to evaluate language model's linguistic capabilities.                                                                                                                                                                                                                                              | English                                                                                                                       |
 | [blimp_nl](blimp_nl/README.md)                                           | A benchmark evaluating language models' grammatical capabilities in Dutch based on comparing the probabilities of minimal pairs of grammatical and ungrammatical sentences.                                                                                                                                                            | Dutch                                                                                                                         |

--- a/lm_eval/tasks/babilong/README.md
+++ b/lm_eval/tasks/babilong/README.md
+# Babilong
+### Paper
+Title: Babilong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack
+Abstract: https://arxiv.org/abs/2406.10149
+In recent years, the input context sizes of large language models (LLMs) have increased dramatically. However, existing evaluation methods have not kept pace, failing to comprehensively assess the efficiency of models in handling long contexts. To bridge this gap, we introduce the BABILong benchmark, designed to test language models' ability to reason across facts distributed in extremely long documents. BABILong includes a diverse set of 20 reasoning tasks, including fact chaining, simple induction, deduction, counting, and handling lists/sets. These tasks are challenging on their own, and even more demanding when the required facts are scattered across long natural text. Our evaluations show that popular LLMs effectively utilize only 10-20\% of the context and their performance declines sharply with increased reasoning complexity. Among alternatives to in-context reasoning, Retrieval-Augmented Generation methods achieve a modest 60\% accuracy on single-fact question answering, independent of context length. Among context extension methods, the highest performance is demonstrated by recurrent memory transformers after fine-tuning, enabling the processing of lengths up to 50 million tokens. The BABILong benchmark is extendable to any length to support the evaluation of new upcoming models with increased capabilities, and we provide splits up to 10 million token lengths.
+Homepage: https://github.com/booydar/babilong
+### Citation
+```
+@article{kuratov2024babilong,
+    title={Babilong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack},
+    author={Kuratov, Yuri and Bulatov, Aydar and Anokhin, Petr and Rodkin, Ivan and Sorokin, Dmitry and Burtsev, Mikhail},
+    journal={arXiv preprint arXiv:2406.10149},
+    year={2024}
+}
+```
+### Groups and Tasks
+#### Groups
+* `babilong`: All Babilong tasks at 0k context length
+* `babilong_longctx`: Babilong tasks between qa1-qa5 at context lengths up to 128k
+#### Tasks
+The benchmark includes 1000 samples of 20 reasoning tasks at various context lengths:
+**QA Tasks (qa1-qa20):**
+* `babilong_qa1`: Single supporting fact QA
+* `babilong_qa2`: Two supporting facts QA
+* `babilong_qa3`: Three supporting facts QA
+* `babilong_qa4`: Two argument relations
+* `babilong_qa5`: Three argument relations
+* `babilong_qa6`: Yes/No questions
+* `babilong_qa7`: Counting
+* `babilong_qa8`: Lists and sets
+* `babilong_qa9`: Simple negation
+* `babilong_qa10`: Indefinite knowledge
+* `babilong_qa11`: Track person through temporal references
+* `babilong_qa12`: Conjunction
+* `babilong_qa13`: Compound coreference
+* `babilong_qa14`: Time reasoning
+* `babilong_qa15`: Basic deduction
+* `babilong_qa16`: Basic induction
+* `babilong_qa17`: Positional reasoning
+* `babilong_qa18`: Size reasoning
+* `babilong_qa19`: Path finding
+* `babilong_qa20`: Motivation deduction
+> [!NOTE]
+> When using babilong tasks, please note:
+> 1. This is the implementation with 1000 samples per length. You can change the dataset path to `RMT-team/babilong` in `common_utils.py` for the dataset with 100 samples per length, which supports context lengths up to 10M tokens.
+> 2. Supported lengths are 0k, 1, 2, 4, 8, 16, 32, 64, 128k tokens for tasks qa1-5. Tasks qa6-20 only have a length of 0k.
+> 3. The default maximum sequence length is 0k. For calculating metrics of different max seq lengths, specify additional lengths using the metadata parameter:
+>   `--metadata '{"max_seq_lengths":"0k,1k,2k,4k,8k,16k,32k,128k"}'`. The config currently only takes one context length at a time. The metadata parameter can also be passed to the TaskManager (metadata: dict).
+### Checklist
+For adding novel benchmarks/datasets to the library:
+* [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
+  * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/babilong/_babilong_common_yaml
+++ b/lm_eval/tasks/babilong/_babilong_common_yaml
+dataset_path: RMT-team/babilong-1k-samples
+output_type: generate_until
+doc_to_target: "{{target}}"
+target_delimiter: " "
+num_fewshot: 2
+process_results: !function common_utils.process_results
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+generation_kwargs:
+  do_sample: false
+  temperature: 0.0
+  max_gen_toks: 16
+  until: []
+metadata:
+  version: 0.0
--- a/lm_eval/tasks/babilong/babilong.yaml
+++ b/lm_eval/tasks/babilong/babilong.yaml
+group: babilong
+task:
+  - babilong_qa1
+  - babilong_qa2
+  - babilong_qa3
+  - babilong_qa4
+  - babilong_qa5
+  - babilong_qa6
+  - babilong_qa7
+  - babilong_qa8
+  - babilong_qa9
+  - babilong_qa10
+  - babilong_qa11
+  - babilong_qa12
+  - babilong_qa13
+  - babilong_qa14
+  - babilong_qa15
+  - babilong_qa16
+  - babilong_qa17
+  - babilong_qa18
+  - babilong_qa19
+  - babilong_qa20
+aggregate_metric_list:
+  - metric: acc
+    weight_by_size: True
+metadata:
+  version: 0.0
--- a/lm_eval/tasks/babilong/babilong_longctx.yaml
+++ b/lm_eval/tasks/babilong/babilong_longctx.yaml
+group: babilong_longctx
+task:
+  - babilong_qa1
+  - babilong_qa2
+  - babilong_qa3
+  - babilong_qa4
+  - babilong_qa5
+aggregate_metric_list:
+  - metric: acc
+    weight_by_size: True
+metadata:
+  version: 0.0
--- a/lm_eval/tasks/babilong/babilong_qa1.yaml
+++ b/lm_eval/tasks/babilong/babilong_qa1.yaml
+include: _babilong_common_yaml
+task: babilong_qa1
+test_split: qa1
+custom_dataset: !function common_utils.load_dataset
+dataset_kwargs:
+  qa_split: qa1
+description: "I will give you context with the facts about positions of different persons hidden in some random text and a question. You need to answer the question based only on the information from the facts. If a person was in different locations, use the latest location to answer the question.\nAlways return your answer in the following format:\nThe most recent location of 'person' is 'location'. Do not write anything else after that.\n\n"
+doc_to_text: "{{input.strip()}}\n{{question.strip()}}"
+fewshot_config:
+  sampler: first_n
+  samples:
+    - input: "Charlie went to the hallway. Judith come back to the kitchen. Charlie travelled to balcony."
+      question: "Where is Charlie?"
+      target: "The most recent location of Charlie is balcony."
+    - input: "Alan moved to the garage. Charlie went to the beach. Alan went to the shop. Rouse travelled to balcony."
+      question: "Where is Alan?"
+      target: "The most recent location of Alan is shop."
--- a/lm_eval/tasks/babilong/babilong_qa10.yaml
+++ b/lm_eval/tasks/babilong/babilong_qa10.yaml
+include: _babilong_common_yaml
+task: babilong_qa10
+test_split: qa10
+custom_dataset: !function common_utils.load_dataset
+dataset_kwargs:
+  qa_split: qa10
+description: "I will give you context with the facts about people and their locations hidden in some random text and a question. You need to answer the question based only on the information from the facts.\nIf a person was in different locations, use the latest location the person was in to answer the question.\nYour answer should contain only one word - $yes$ or $no$ or $maybe$. Do not write anything else. Do not explain your answer.\n\n"
+doc_to_text: "{{input.strip()}}\n{{question.strip()}}"
+fewshot_config:
+  sampler: first_n
+  samples:
+    - input: "Bill is in the kitchen. Julie is either in the school or the cinema."
+      question: "Is Bill in the bedroom?"
+      target: "no"
+    - input: "Fred is in the bedroom. Mary is either in the school or the cinema."
+      question: "Is Mary in the school?"
+      target: "maybe"
+    - input: "Fred is either in the kitchen or the park. Bill moved to the cinema."
+      question: "Is Bill in the cinema?"
+      target: "yes"
--- a/lm_eval/tasks/babilong/babilong_qa11.yaml
+++ b/lm_eval/tasks/babilong/babilong_qa11.yaml
+include: _babilong_common_yaml
+task: babilong_qa11
+test_split: qa11
+dataset_name: 0k
+description: "I will give you context with the facts about people and their locations hidden in some random text and a question. You need to answer the question based only on the information from the facts.\nIf a person was in different locations, use the latest location the person was in to answer the question.\nYour answer should contain only one word - location. Do not write anything else after that. Do not explain your answer.\n\n"
+doc_to_text: "{{input.strip()}}\n{{question.strip()}}"
+fewshot_config:
+  sampler: first_n
+  samples:
+    - input: "Daniel journeyed to the hallway. After that he journeyed to the garden."
+      question: "Where is Daniel?"
+      target: "garden"
+    - input: "Mary moved to the office. Afterwards she journeyed to the kitchen. Daniel went to the hallway. Then he journeyed to the garden."
+      question: "Where is Mary?"
+      target: "kitchen"
+    - input: "Sandra moved to the kitchen. After that she went back to the hallway. Sandra moved to the bedroom. Then she went to the hallway. Mary moved to the bedroom. Afterwards she travelled to the bathroom."
+      question: "Where is Sandra?"
+      target: "hallway"
--- a/lm_eval/tasks/babilong/babilong_qa12.yaml
+++ b/lm_eval/tasks/babilong/babilong_qa12.yaml
+include: _babilong_common_yaml
+task: babilong_qa12
+test_split: qa12
+dataset_name: 0k
+description: "I will give you context with the facts about people and their locations hidden in some random text and a question. You need to answer the question based only on the information from the facts.\nIf a person was in different locations, use the latest location the person was in to answer the question.\nYour answer should contain only one word - location. Do not write anything else after that. Do not explain your answer.\n\n"
+doc_to_text: "{{input.strip()}}\n{{question.strip()}}"
+fewshot_config:
+  sampler: first_n
+  samples:
+    - input: "Mary and Daniel travelled to the bathroom. John and Daniel travelled to the office."
+      question: "Where is Daniel?"
+      target: "office"
+    - input: "Sandra and Mary went back to the office. Daniel and Sandra went to the bedroom. Sandra and Mary travelled to the hallway. John and Mary went to the kitchen."
+      question: "Where is Mary?"
+      target: "kitchen"
+    - input: "Daniel and Sandra went back to the hallway. Daniel and John moved to the office. Daniel and John moved to the garden. Daniel and Mary went back to the bathroom. Daniel and John went back to the kitchen. Daniel and Sandra went to the bathroom."
+      question: "Where is John?"
+      target: "kitchen"
--- a/lm_eval/tasks/babilong/babilong_qa13.yaml
+++ b/lm_eval/tasks/babilong/babilong_qa13.yaml
+include: _babilong_common_yaml
+task: babilong_qa13
+test_split: qa13
+dataset_name: 0k
+description: "I will give you context with the facts about people and their locations hidden in some random text and a question. You need to answer the question based only on the information from the facts.\nIf a person was in different locations, use the latest location the person was in to answer the question.\nYour answer should contain only one word - location. Do not write anything else after that. Do not explain your answer.\n\n"
+doc_to_text: "{{input.strip()}}\n{{question.strip()}}"
+fewshot_config:
+  sampler: first_n
+  samples:
+    - input: "Mary and Daniel travelled to the bathroom. Then they journeyed to the hallway."
+      question: "Where is Daniel?"
+      target: "hallway"
+    - input: "Daniel and Sandra travelled to the kitchen. After that they journeyed to the hallway. Mary and Daniel travelled to the bedroom. After that they travelled to the hallway."
+      question: "Where is Sandra?"
+      target: "hallway"
+    - input: "John and Mary moved to the bathroom. Then they travelled to the office. John and Mary went to the kitchen. Afterwards they went to the bedroom. John and Sandra moved to the bathroom. Following that they went back to the kitchen."
+      question: "Where is Mary?"
+      target: "bedroom"
--- a/lm_eval/tasks/babilong/babilong_qa14.yaml
+++ b/lm_eval/tasks/babilong/babilong_qa14.yaml
+include: _babilong_common_yaml
+task: babilong_qa14
+test_split: qa14
+dataset_name: 0k
+description: "I will give you context with the facts about people and their locations hidden in some random text and a question. You need to answer the question based only on the information from the facts.\nIf a person was in different locations, use the latest location the person was in to answer the question.\nYour answer should contain only one word - location. Do not write anything else after that. Do not explain your answer.\n\n"
+doc_to_text: "{{input.strip()}}\n{{question.strip()}}"
+fewshot_config:
+  sampler: first_n
+  samples:
+    - input: "Bill went back to the cinema yesterday. Julie went to the school this morning. Fred went to the park yesterday. Yesterday Julie went to the office."
+      question: "Where was Julie before the school?"
+      target: "office"
+    - input: "This morning Fred went to the kitchen. Fred journeyed to the bedroom yesterday. Mary travelled to the bedroom this morning. Yesterday Mary went to the cinema."
+      question: "Where was Mary before the bedroom?"
+      target: "cinema"
+    - input: "Yesterday Julie went back to the park. Julie went to the bedroom this morning. Bill journeyed to the cinema yesterday. This morning Bill went back to the park. This evening Julie went to the school. This afternoon Julie went back to the park."
+      question: "Where was Julie before the bedroom?"
+      target: "park"
--- a/lm_eval/tasks/babilong/babilong_qa15.yaml
+++ b/lm_eval/tasks/babilong/babilong_qa15.yaml
+include: _babilong_common_yaml
+task: babilong_qa15
+test_split: qa15
+dataset_name: 0k
+description: "I will give you context with the facts about animals, their names and relations. The facts and a question are hidden in some random text. You need to answer the question based only on the information from the facts.\nYour answer should contain only one word - an animal species. Do not write anything else after that. Do not explain your answer.\n\n"
+doc_to_text: "{{input.strip()}}\n{{question.strip()}}"
+fewshot_config:
+  sampler: first_n
+  samples:
+    - input: "Mice are afraid of wolves. Gertrude is a mouse. Cats are afraid of sheep. Winona is a mouse. Sheep are afraid of wolves. Emily is a mouse. Jessica is a wolf."
+      question: "What is gertrude afraid of?"
+      target: "wolf"
+    - input: "Mice are afraid of wolves. Gertrude is a mouse. Cats are afraid of sheep. Winona is a mouse. Sheep are afraid of wolves. Emily is a mouse. Jessica is a wolf."
+      question: "What is jessica afraid of?"
+      target: "cat"
+    - input: "Mice are afraid of cats. Wolves are afraid of sheep. Emily is a wolf. Cats are afraid of sheep. Gertrude is a wolf. Sheep are afraid of cats. Winona is a wolf."
+      question: "What is emily afraid of?"
+      target: "sheep"
--- a/lm_eval/tasks/babilong/babilong_qa16.yaml
+++ b/lm_eval/tasks/babilong/babilong_qa16.yaml
+include: _babilong_common_yaml
+task: babilong_qa16
+test_split: qa16
+dataset_name: 0k
+description: "I will give you context with the facts about animals, their names and colors. The facts and a question are hidden in some random text. You need to answer the question based only on the information from the facts.\nYour answer should contain only one word - a color. Do not write anything else after that.\nDo not explain your answer.\n\n"
+doc_to_text: "{{input.strip()}}\n{{question.strip()}}"
+fewshot_config:
+  sampler: first_n
+  samples:
+    - input: "Lily is a frog. Bernhard is a frog. Bernhard is green. Brian is a lion. Brian is white. Julius is a swan. Julius is green. Lily is green. Greg is a swan."
+      question: "What color is Greg?"
+      target: "green"
+    - input: "Julius is a lion. Lily is a rhino. Bernhard is a swan. Lily is white. Bernhard is green. Greg is a rhino. Greg is gray. Julius is white. Brian is a lion."
+      question: "What color is Brian?"
+      target: "white"
+    - input: "Brian is a rhino. Julius is a lion. Bernhard is a lion. Greg is a swan. Brian is gray. Greg is white. Lily is a rhino. Bernhard is yellow. Lily is gray."
+      question: "What color is Julius?"
+      target: "yellow"
--- a/lm_eval/tasks/babilong/babilong_qa17.yaml
+++ b/lm_eval/tasks/babilong/babilong_qa17.yaml
+include: _babilong_common_yaml
+task: babilong_qa17
+test_split: qa17
+dataset_name: 0k
+description: "I will give you context with the facts about different figures, their location and colors, hidden in some random text and a question. You need to answer the question based only on the information from the facts.\nYour answer should contain only one word - $yes$ or $no$. Do not write anything else.\nDo not explain your answer.\n\n"
+doc_to_text: "{{input.strip()}}\n{{question.strip()}}"
+fewshot_config:
+  sampler: first_n
+  samples:
+    - input: "The triangle is above the pink rectangle. The blue square is to the left of the triangle."
+      question: "Is the pink rectangle to the right of the blue square?"
+      target: "yes"
+    - input: "The red sphere is to the left of the yellow square. The red sphere is below the pink rectangle."
+      question: "Is the pink rectangle to the left of the yellow square?"
+      target: "yes"
+    - input: "The red sphere is above the pink rectangle. The red sphere is to the right of the red square."
+      question: "Is the pink rectangle above the red square?"
+      target: "no"
--- a/lm_eval/tasks/babilong/babilong_qa18.yaml
+++ b/lm_eval/tasks/babilong/babilong_qa18.yaml
+include: _babilong_common_yaml
+task: babilong_qa18
+test_split: qa18
+dataset_name: 0k
+description: "I will give you context with the facts about different objects and their sizes, hidden in some random text and a question. You need to answer the question based only on the information from the facts.\nYour answer should contain only one word - $yes$ or $no$. Do not write anything else.\nDo not explain your answer.\n\n"
+doc_to_text: "{{input.strip()}}\n{{question.strip()}}"
+fewshot_config:
+  sampler: first_n
+  samples:
+    - input: "The box of chocolates fits inside the chest. The box is bigger than the chest. The box is bigger than the suitcase. The suitcase fits inside the box. The container is bigger than the box of chocolates."
+      question: "Does the box fit in the box of chocolates?"
+      target: "no"
+    - input: "The suitcase is bigger than the container. The container fits inside the box. The chest is bigger than the chocolate. The suitcase fits inside the box. The chest fits inside the box."
+      question: "Does the chocolate fit in the box?"
+      target: "yes"
+    - input: "The chocolate fits inside the box of chocolates. The suitcase fits inside the box. The chocolate fits inside the box. The box is bigger than the box of chocolates. The suitcase is bigger than the box of chocolates."
+      question: "Is the chocolate bigger than the box?"
+      target: "no"
--- a/lm_eval/tasks/babilong/babilong_qa19.yaml
+++ b/lm_eval/tasks/babilong/babilong_qa19.yaml
+include: _babilong_common_yaml
+task: babilong_qa19
+test_split: qa19
+dataset_name: 0k
+description: "I will give you context with the facts about different places and their locations, hidden in some random text and a question. You need to answer the question based only on the information from the facts.\nYour answer should contain only two letters, separated by a comma - ordinal directions. You can choose the letters from $n$, $s$, $e$ and $w$. Do not write anything else after that.\n\n"
+doc_to_text: "{{input.strip()}}\n{{question.strip()}}"
+fewshot_config:
+  sampler: first_n
+  samples:
+    - input: "The office is east of the hallway. The kitchen is north of the office. The garden is west of the bedroom. The office is west of the garden. The bathroom is north of the garden."
+      question: "How do you go from the kitchen to the garden?"
+      target: "s,e"
+    - input: "The bedroom is west of the hallway. The office is east of the garden. The garden is north of the kitchen. The kitchen is north of the bathroom. The hallway is west of the garden."
+      question: "How do you go from the kitchen to the hallway?"
+      target: "n,w"
+    - input: "The bedroom is south of the hallway. The bathroom is east of the office. The kitchen is west of the garden. The garden is south of the office. The office is south of the bedroom."
+      question: "How do you go from the garden to the bedroom?"
+      target: "n,n"
--- a/lm_eval/tasks/babilong/babilong_qa2.yaml
+++ b/lm_eval/tasks/babilong/babilong_qa2.yaml
+include: _babilong_common_yaml
+task: babilong_qa2
+test_split: qa2
+custom_dataset: !function common_utils.load_dataset
+dataset_kwargs:
+  qa_split: qa2
+description: "I will give you context with the facts about locations and actions of different persons hidden in some random text and a question. You need to answer the question based only on the information from the facts. If a person got an item in the first location and travelled to the second location the item is also in the second location. If a person dropped an item in the first location and moved to the second location the item remains in the first location.\nAlways return your answer in the following format:\nThe 'item' is in 'location'. Do not write anything else after that.\n\n"
+doc_to_text: "{{input.strip()}}\n{{question.strip()}}"
+fewshot_config:
+  sampler: first_n
+  samples:
+    - input: "Charlie went to the kitchen. Charlie got a bottle. Charlie moved to the balcony."
+      question: "Where is the bottle?"
+      target: "The bottle is in the balcony."
+    - input: "Alan moved to the garage. Alan got a screw driver. Alan moved to the kitchen."
+      question: "Where is the screw driver?"
+      target: "The screw driver is in the kitchen."
--- a/lm_eval/tasks/babilong/babilong_qa20.yaml
+++ b/lm_eval/tasks/babilong/babilong_qa20.yaml
+include: _babilong_common_yaml
+task: babilong_qa20
+test_split: qa20
+dataset_name: 0k
+description: "I will give you context with the facts about people, their locations and condition hidden in some random text and a question. You need to answer the question based only on the information from the facts.\nIf a person was in different locations, use the latest location the person was in to answer the question.\nYour answer should contain only one word - a person condition or a place. Do not write anything else after that. Do not explain your answer.\n\n"
+doc_to_text: "{{input.strip()}}\n{{question.strip()}}"
+fewshot_config:
+  sampler: first_n
+  samples:
+    - input: "Sumit is tired."
+      question: "Where will sumit go?"
+      target: "bedroom"
+    - input: "Yann is hungry. Yann journeyed to the kitchen."
+      question: "Why did yann go to the kitchen?"
+      target: "hungry"
+    - input: "Antoine is thirsty. Yann is tired. Yann went back to the bedroom. Yann picked up the pajamas there. Jason is thirsty. Antoine went back to the kitchen."
+      question: "Why did antoine go to the kitchen?"
+      target: "thirsty"
--- a/lm_eval/tasks/babilong/babilong_qa3.yaml
+++ b/lm_eval/tasks/babilong/babilong_qa3.yaml
+include: _babilong_common_yaml
+task: babilong_qa3
+test_split: qa3
+custom_dataset: !function common_utils.load_dataset
+dataset_kwargs:
+  qa_split: qa3
+description: "I give you context with the facts about locations and actions of different persons hidden in some random text and a question. You need to answer the question based only on the information from the facts. If a person got an item in the first location and travelled to the second location the item is also in the second location. If a person dropped an item in the first location and moved to the second location the item remains in the first location.\nAlways return your answer in the following format:\nBefore the $location_1$ the $item$ was in the $location_2$. Do not write anything else after that.\n\n"
+doc_to_text: "{{input.strip()}}\n{{question.strip()}}"
+fewshot_config:
+  sampler: first_n
+  samples:
+    - input: "John journeyed to the bedroom. Mary grabbed the apple. Mary went back to the bathroom. Daniel journeyed to the bedroom. Daniel moved to the garden. Mary travelled to the kitchen."
+      question: "Where was the apple before the kitchen?"
+      target: "Before the kitchen the apple was in the bathroom."
+    - input: "John went back to the bedroom. John went back to the garden. John went back to the kitchen. Sandra took the football. Sandra travelled to the garden. Sandra journeyed to the bedroom."
+      question: "Where was the football before the bedroom?"
+      target: "Before the bedroom the football was in the garden."
--- a/lm_eval/tasks/babilong/babilong_qa4.yaml
+++ b/lm_eval/tasks/babilong/babilong_qa4.yaml
+include: _babilong_common_yaml
+task: babilong_qa4
+test_split: qa4
+custom_dataset: !function common_utils.load_dataset
+dataset_kwargs:
+  qa_split: qa4
+description: "I will give you context with the facts about different people, their location and actions, hidden in some random text and a question. You need to answer the question based only on the information from the facts.\nYour answer should contain only one word - location. Do not write anything else after that.\n\n"
+doc_to_text: "{{input.strip()}}\n{{question.strip()}}"
+fewshot_config:
+  sampler: first_n
+  samples:
+    - input: "The hallway is south of the kitchen. The bedroom is north of the kitchen."
+      question: "What is the kitchen south of?"
+      target: "bedroom"
+    - input: "The garden is west of the bedroom. The bedroom is west of the kitchen."
+      question: "What is west of the bedroom?"
+      target: "garden"