Adding ACPBench task (#2807)

* Adding acpbench task * adding ACPBench in Tasks readme. * running precommit

Adding ACPBench task (#2807)
* Adding acpbench task * adding ACPBench in Tasks readme. * running precommit
5a9d5ba0 · Harsha · GitHub · ded890f3 · 5a9d5ba0 · 5a9d5ba0
Unverified Commit 5a9d5ba0 authored Mar 27, 2025 by Harsha Committed by GitHub Mar 28, 2025
18 changed files
--- a/lm_eval/tasks/README.md
+++ b/lm_eval/tasks/README.md
@@ -8,6 +8,7 @@
 | Task Family                                                              | Description                                                                                                                                                                                                                                                                                                                            | Language(s)                                                                                                           |
 |--------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------|
 | [aclue](aclue/README.md)                                                 | Tasks focusing on ancient Chinese language understanding and cultural aspects.                                                                                                                                                                                                                                                         | Ancient Chinese                                                                                                       |
+| [acp_bench](acp_bench/README.md) | Tasks evaluating the reasoning ability about Action, Change, and Planning | English |
 | [aexams](aexams/README.md)                                               | Tasks in Arabic related to various academic exams covering a range of subjects. | Arabic                                                                                                                        |
 | [agieval](agieval/README.md)                                             | Tasks involving historical data or questions related to history and historical texts.                                                                                                                                                                                                                                                  | English, Chinese                                                                                                      |
 | [anli](anli/README.md)                                                   | Adversarial natural language inference tasks designed to test model robustness.                                                                                                                                                                                                                                                        | English                                                                                                               |

--- a/lm_eval/tasks/acpbench/README.md
+++ b/lm_eval/tasks/acpbench/README.md
+# ACPBench
+### Paper
+Title: ACPBench: Reasoning About Action, Change, and Planning
+Abstract: https://arxiv.org/pdf/2410.05669
+There is an increasing body of work using Large Language Models (LLMs) as agents for orchestrating workflows and making decisions in domains that require planning and multi-step reasoning. As a result, it is imperative to evaluate LMs on core skills required for planning. ACPBench is a benchmark for evaluating the reasoning tasks in the field of planning. The benchmark consists of 7 reasoning tasks over 13 planning domains. The collection is constructed from planning domains described in a formal language. This allows the synthesized problems to have provably correct solutions across many tasks and domains. Further, it allows the luxury to scale without additional human effort, i.e., many additional problems can be created automatically.
+Homepage: https://ibm.github.io/ACPBench/
+### Citation
+```
+@inproceedings{kokel2025acp
+  author       = {Harsha Kokel and
+                  Michael Katz and
+                  Kavitha Srinivas and
+                  Shirin Sohrabi},
+  title        = {ACPBench: Reasoning about Action, Change, and Planning},
+  booktitle    = {{AAAI}},
+  publisher    = {{AAAI} Press},
+  year         = {2025}
+}
+```
+### Groups, Tags, and Tasks
+#### Groups
+* None
+#### Tags
+* `acp_bench` : Evaluates `acp_bool_cot_2shot` and `acp_mcq_cot_2shot`
+* `acp_bool_cot_2shot` : Evaluates `acp_areach_bool`, `acp_app_bool`, `acp_just_bool`, `acp_land_bool`, `acp_prog_bool`, `acp_reach_bool`, `acp_val_bool` with chain-of-thought and 2 shots
+* `acp_mcq_cot_2shot` : Evaluates `acp_areach_mcq`, `acp_app_mcq`, `acp_just_mcq`, `acp_land_mcq`, `acp_prog_mcq`, `acp_reach_mcq`, `acp_val_mcq`  with chain-of-thought and 2 shots
+#### Tasks
+7 Boolean tasks
+* `acp_areach_bool`
+* `acp_app_bool`
+* `acp_just_bool`
+* `acp_land_bool`
+* `acp_prog_bool`
+* `acp_reach_bool`
+* `acp_val_bool`
+7 MCQ tasks
+* `acp_areach_mcq`
+* `acp_app_mcq`
+* `acp_just_mcq`
+* `acp_land_mcq`
+* `acp_prog_mcq`
+* `acp_reach_mcq`
+* `acp_val_mcq`
+> ! The evaluation scripts are taken from original github https://github.com/IBM/ACPBench
+### Checklist
+For adding novel benchmarks/datasets to the library:
+* [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
+  * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+If other tasks on this dataset are already supported:
+* [x] Is the "Main" variant of this task clearly denoted?
+* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
+### Change Log
+* 03/17/2025 Initial Commit
--- a/lm_eval/tasks/acpbench/boolq_cot_2shot/_boolq_cot_2shot_yaml
+++ b/lm_eval/tasks/acpbench/boolq_cot_2shot/_boolq_cot_2shot_yaml
+tag:
+  - acp_bool_cot_2shot
+  - acp_bench
+output_type: generate_until
+dataset_path: ibm-research/acp_bench
+test_split: test
+num_fewshot: 2
+doc_to_target: "{{answer}}"
+doc_to_text: "**Question**: {{context}}  {{question}} **Thoughts**:"
+filter_list:
+  - name: "extract-yes-no"
+    filter:
+      - function: "regex"
+        regex_pattern: "((?<=The answer is )(.*)(?=.)|(?<=the answer is )(.*)(?=.)|(?<=The answer: )(.*)(?=.)|(?<=The final answer: )(.*)(?=.)|(?<=..Final Answer..: )(.*)(?=.)|(?<=..answer..: )(.*)(?=.)|(?<=..Answer..: )(.*)(?=.)|\\b(Yes|No|yes|no)\\b)"
+      - function: "take_first"
+generation_kwargs:
+  until:
+    - "**Question**:"
+    - "**Question:**"
+    - "Q:"
+    - "\n\n\n\n"
+  do_sample: false
+  temperature: 0.0
+  max_gen_toks: 1024
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+    ignore_case: true
+    ignore_punctuation: true
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/acpbench/boolq_cot_2shot/act_reach.yaml
+++ b/lm_eval/tasks/acpbench/boolq_cot_2shot/act_reach.yaml
+task: acp_areach_bool
+dataset_name: acp_areach_bool
+include: _boolq_cot_2shot_yaml
+fewshot_config:
+  sampler: first_n
+  samples:
+  - context: 'This is a ferry domain, where the task is to transport cars from their start to their goal locations, using a ferry. Each location is accessible by ferry from each other location. The cars can be debarked or boarded, and the ferry can carry only one car at a time.  There are 2 locations and 5 cars, numbered consecutively.  Currently, the ferry is at l1, with the car c4 on board. The cars are at locations as follows: c0 and c3 are at l1; c1 and c2 are at l0.'
+    question: 'Is it possible to transition to a state where the action "travel by sea from location l0 to location l1" can be applied?'
+    answer: "Let's think step by step.   Step 1: Verify if there is a sequence of actions which transforms the current state into a state where the precondition of the action \"travel by sea from location l0 to location l1\" hold. Step 2: The following sequence of actions would transition to such a state: sail from location l1 to location l0, unload the car c4 from the ferry to location l0, board car c1 at location l0. **Final Answer**: Yes."
+  - context: 'There are several cities, each containing several locations, some of which are airports. There are also trucks, which can drive within a single city, and airplanes, which can fly between airports. The goal is to get some packages from various locations to various new locations.  There are 2 trucks and 1 airplane, as well as 4 packages. There are 6 locations across 2 cities.  The locations are in cities as follows: l0-0, l0-1, and l0-2 are in c0; l1-1, l1-2, and l1-0 are in c1.  Currently, a0 is at l1-0, t1 is at l1-1, t0 is at l0-0, p2 and p1 are in t1, p0 and p3 are in a0.'
+    question: 'Is it possible to transition to a state where the action "offload the object p0 from the truck p0 at location p1" can be applied?'
+    answer: "Let's think step by step. Step 1: Verify if there is a sequence of actions which transforms the current state into a state where the precondition of the action \"offload the object p0 from the truck p0 at location p1\" hold. Step 2: Action preconditions are \"p0 is in p0 and p0 is at p1\". Step 3: These facts are not reachable together, as they include mutually exclusive facts \"p0 is in p0 and p0 is at p1\". **Final Answer**: No."
--- a/lm_eval/tasks/acpbench/boolq_cot_2shot/app.yaml
+++ b/lm_eval/tasks/acpbench/boolq_cot_2shot/app.yaml
+task: acp_app_bool
+dataset_name: acp_app_bool
+include: _boolq_cot_2shot_yaml
+fewshot_config:
+  sampler: first_n
+  samples:
+  - context: 'This is a ferry domain, where the task is to transport cars from their start to their goal locations, using a ferry. Each location is accessible by ferry from each other location. The cars can be debarked or boarded, and the ferry can carry only one car at a time.  There are 2 locations and 2 cars, numbered consecutively.  Currently, the ferry is at l0, with the car c1 on board. The cars are at locations as follows: c0 is at l0.'
+    question: 'Is the following action applicable in this state:  travel by sea from location l1 to location l0?'
+    answer: "Step 1: In order to apply the action travel by sea from location l1 to location l0, the following fact(s) must hold in this state: The ferry is at l1 location Step 2: These facts do not hold in the mentioned state: The ferry is at l1 location. So, the action is not applicable. **Final Answer**: No."
+  - context: 'There are several cities, each containing several locations, some of which are airports. There are also trucks, which can drive within a single city, and airplanes, which can fly between airports. The goal is to get some packages from various locations to various new locations.  There are 3 trucks and 1 airplane, as well as 7 packages. There are 9 locations across 3 cities.  The locations are in cities as follows: l1-2, l1-0, and l1-1 are in c1; l2-0, l2-1, and l2-2 are in c2; l0-0, l0-2, and l0-1 are in c0.  Currently, p0 and p1 are at l1-2, a0 and p2 are at l0-0, t1 is at l1-0, t0 and p4 are at l0-2, t2 is at l2-0, p6, p3, and p5 are in t0.'
+    question:  'Is the following action applicable in this state:  load the object p4 from location l0-2 into the truck t0?'
+    answer: "Let's think step by step.  Step 1: In order to apply the action \"load the object p4 from location l0-2 into the truck t0\", the following fact(s) must hold in this state: p4 is at l0-2 and t0 is at l0-2 Step 2: These facts hold in the mentioned state, so the action is applicable. **Final Answer**: Yes."
--- a/lm_eval/tasks/acpbench/boolq_cot_2shot/just.yaml
+++ b/lm_eval/tasks/acpbench/boolq_cot_2shot/just.yaml
+task: acp_just_bool
+dataset_name: acp_just_bool
+include: _boolq_cot_2shot_yaml
+fewshot_config:
+  sampler: first_n
+  samples:
+  - context: 'This is a ferry domain, where the task is to transport cars from their start to their goal locations, using a ferry. Each location is accessible by ferry from each other location. The cars can be debarked or boarded, and the ferry can carry only one car at a time.  There are 2 locations and 2 cars, numbered consecutively.  Currently, the ferry is at l0 location and it is empty. The cars are at locations as follows: c0 and c1 are at l0. The goal is to reach a state where the following facts hold: Car c1 is at location l1 and Car c0 is at location l1'
+    question: 'Given the plan: \"board the car c1 at the location l0, sail from location l0 to location l1, unload the car c1 from the ferry to location l1, travel by sea from location l1 to location l0, board the car c0 at location l0, travel by sea from location l0 to location l1, unload the car c0 from the ferry to location l1, embark the car c1 at location l1 on to the ferry, debark car c1 to location l1 from the ferry\"; can the following pair of consecutive actions be removed from this plan and still have a valid plan:  embark the car c1 at location l1 on to the ferry and debark the car c1 to location l1 from the ferry?'
+    answer: "Let's think step by step.   Step 1: A valid plan is a sequence of actions that is executable from the current state and achieves the goal.  Step 2: Consecutive actions \"embark the car c1 at location l1 on to the ferry and debark the car c1 to location l1 from the ferry\" can be removed from the plan and still have a valid plan. **Final Answer**: Yes."
+  - context: 'There are several cities, each containing several locations, some of which are airports. There are also trucks, which can drive within a single city, and airplanes, which can fly between airports. The goal is to get some packages from various locations to various new locations.  There are 2 trucks and 1 airplane, as well as 4 packages. There are 4 locations across 2 cities.  The locations are in cities as follows: l1-0 and l1-1 are in c1; l0-1 and l0-0 are in
+  c0.  Currently, p3, p2, and p1 are at l1-0, p0 and t1 are at l1-1, a0 is at l0-0, t0 is at l0-1. The goal is to reach a state where the following facts hold: p0 is at l0-0, p2 is at l1-0, p3 is at l0-1, and p1 is at l1-0. Given the plan: \"load the object p0 from location l1-1 into the truck t1, navigate the truck t1 which is in location l1-1 in city c1 to another location l1-0 in the same city, offload the object p0 from the truck t1 at location l1-0, drive truck t1 from location l1-0 in city c1 to location l1-1 in the same city, fly airplane a0 from airport l0-0 to airport l1-0, load the object p0 from location l1-0 onto the airplane a0, load the object p3 from location l1-0 onto the airplane a0, fly the airplane a0 from airport l1-0 to airport l0-0, unload the object p0 from the airplane a0 at location l0-0, remove the object p3 from the airplane a0 and place it on the location l0-0, drive the truck t0 in city c0 from location l0-1 to location l0-0, place the object p3 into the truck t0 at location l0-0, navigate the truck t0 from location l0-0 in city c0 to location l0-1 in the same city, unload object p3 from truck t0 at location l0-1\"'
+    question: 'can the following pair of consecutive actions be removed from this plan and still have a valid plan:  place the object p0 onto the airplane a0 at location l1-0 and place the object p3 onto the airplane a0 at location l1-0?'
+    answer: "Let's think step by step.   Step 1: A valid plan is a sequence of actions that is exectuable from the current state and achieves the goal.  Step 2: Consecutive actions \"place the object p0 onto the airplane a0 at location l1-0 and place the object p3 onto the airplane a0 at location l1-0\" cannot be removed from the plan and still have a valid plan. **Final Answer**: No."
--- a/lm_eval/tasks/acpbench/boolq_cot_2shot/land.yaml
+++ b/lm_eval/tasks/acpbench/boolq_cot_2shot/land.yaml
+task: acp_land_bool
+dataset_name: acp_land_bool
+include: _boolq_cot_2shot_yaml
+fewshot_config:
+  sampler: first_n
+  samples:
+  - context: 'This is a ferry domain, where the task is to transport cars from their start to their goal locations, using a ferry. Each location is accessible by ferry from each other location. The cars can be debarked or boarded, and the ferry can carry only one car at a time.  There are 2 locations and 5 cars, numbered consecutively.  Currently, the ferry is at l1 location and it is empty. The cars are at locations as follows: c4 and c2 are at l1; c0, c1, and c3 are at l0. The goal is to reach a state where the following facts hold: Car c4 is at location l0, Car c0 is at location l1, Car c1 is at location l0, Car c3 is at location l0, and Car c2 is at location l1.'
+    question: 'Is the following fact a landmark (must hold at some point along any plan) for the current state? The ferry is at l0 location.'
+    answer: "Let's think step by step.   Step 1: A fact is a landmark if it must hold at some point along any plan. Step 2: The fact \"The ferry is at l0 location.\" can be found by a simple procedure that traces back such atoms from the goal. **Final Answer**: Yes."
+  - context: 'There are several cities, each containing several locations, some of which are airports. There are also trucks, which can drive within a single city, and airplanes, which can fly between airports. The goal is to get some packages from various locations to various new locations.  There are 2 trucks and 1 airplane, as well as 4 packages. There are 4 locations across 2 cities.  The locations are in cities as follows: l1-1 and l1-0 are in c1; l0-0 and l0-1 are in c0.  Currently, t0 is at l0-0, t1, a0, p1, and p2 are at l1-0, p3 is in t1, p0 is in a0. The goal is to reach a state where the following facts hold: p2 is at l1-0, p0 is at l0-0, p3 is at l0-1, and p1 is at l1-0.'
+    question: 'Is the following fact a landmark (must hold at some point along any plan) for the current state? p2 is in a0.'
+    answer: "Let's think step by step.   Step 1: A fact is a landmark if it must hold at some point along any plan. Step 2: The fact \"p2 is in a0.\" is not achieved
+  along the following plan: remove the object p3 from the truck t1 and place it on the location l1-0, load the object p3 from location l1-0 into the airplane a0, load the object p2 from location l1-0 into the airplane a0, remove the object p2 from the airplane a0 and place it on the location l1-0, fly the airplane a0 from the airport l1-0 to the airport l0-0, unload the object p3 from the airplane a0 at location l0-0, remove the object p0 from the airplane a0 and place it on the location l0-0, place the object p3 into the truck t0 at location l0-0. drive the truck t0 in city c0 from location l0-0 to location l0-1, unload the object p3 from the truck t0 at location l0-1. **Final Answer**: No."
--- a/lm_eval/tasks/acpbench/boolq_cot_2shot/prog.yaml
+++ b/lm_eval/tasks/acpbench/boolq_cot_2shot/prog.yaml
+task: acp_prog_bool
+dataset_name: acp_prog_bool
+include: _boolq_cot_2shot_yaml
+fewshot_config:
+  sampler: first_n
+  samples:
+  - context: 'This is a ferry domain, where the task is to transport cars from their start to their goal locations, using a ferry. Each location is accessible by ferry from each other location. The cars can be debarked or boarded, and the ferry can carry only one car at a time.  There are 2 locations and 2 cars, numbered consecutively.  Currently, the ferry is at l0, with the car c0 on board. The cars are at locations as follows: c1 is at l0.'
+    question: 'Will the fact \"Car c0 is at location l0\" hold after performing the action \"travel by sea from location l0 to location l1\" in the current state?'
+    answer: "Let's think step by step. Step 1: The following fact(s) do not hold in the current state: Car c0 is at location l0. Step 2: The action adds the following fact(s): The ferry is at l1 location Step 3: Fact(s) \"Car c0 is at location l0\" are not added. **Final Answer**: No."
+  - context: 'A robot is in a grid and can only move to places that are connected to its current position.  The grid size is 5x5, and the locations are of the form fi-jf (e.g., f3-2f or f0-1f). The grid cells are connected to their neighbors (e.g., f1-2f is connected to the four neighbors f0-2f, f2-2f, f1-1f, and f1-3f). Some positions on the grid are locked and can be opened with a key of a matching shape. The robot has an arm that can pick up a key when the key is in same location as the robot and the arm is empty.   There are 4 keys in 1 different shapes: Key key0-1 is of shape shape0, Key key0-3 is of shape shape0, Key key0-2 is of shape shape0, Key key0-0 is of shape shape0.   Currently, the robot is at position f4-2f and its arm is empty. All the positions are open except the following: f0-1f has shape0 shaped lock, f4-0f has shape0 shaped lock, f4-4f has shape0 shaped lock, f0-3f has shape0 shaped lock. Key key0-3 is at position f2-1f. Key key0-0 is at position f2-2f. Key key0-1 is at position f4-3f. Key key0-2 is at position f4-2f.'
+    question: 'Will the fact \"Robot is holding key0-2\" hold after performing the action \"retrieve the key key0-2 from its current position f4-2f\" in the current state?'
+    answer: "Let's think step by step. Step 1: The following fact(s) do not hold in the current state: Robot is holding key0-2. Step 2: The action adds the following fact(s): Robot is holding key0-2 Step 3: Fact(s) \"Robot is holding key0-2\" are added. **Final Answer**: Yes."
--- a/lm_eval/tasks/acpbench/boolq_cot_2shot/reach.yaml
+++ b/lm_eval/tasks/acpbench/boolq_cot_2shot/reach.yaml
+task: acp_reach_bool
+dataset_name: acp_reach_bool
+include: _boolq_cot_2shot_yaml
+fewshot_config:
+  sampler: first_n
+  samples:
+  - context: 'This is a ferry domain, where the task is to transport cars from their start to their goal locations, using a ferry. Each location is accessible by ferry from each other location. The cars can be debarked or boarded, and the ferry can carry only one car at a time.  There are 2 locations and 2 cars, numbered consecutively.  Currently, the ferry is at l1 location and it is empty. The cars are at locations as follows: c0 is at l1; c1 is at l0.  is at l1 location and it is empty. The cars are at locations as follows: c0 is at l1; c1 is at l0.'
+    question: 'Is it possible to transition to a state where the following holds: The ferry is empty and The ferry is at c1 location.'
+    answer: "Let's think step by step.  Step 1: Verify if the following fact(s) hold in current state or if there is a sequence of actions which transforms the current state into a state where they hold: The ferry is empty and The ferry is at c1 location. Step 2: These facts do not hold in the current state. Step 3: The fact There are no cars on the ferry and The ferry is at c1 location is not reachable even by a simple iterative procedure that accumulates all facts made true by applicable actions. **Final Answer**: No."
+  - context: 'There are several cities, each containing several locations, some of which are airports. There are also trucks, which can drive within a single city, and airplanes, which can fly between airports. The goal is to get some packages from various locations to various new locations.  There are 5 trucks and 1 airplane, as well as 4 packages. There are 15 locations across 5 cities.  The locations are in cities as follows: l4-2, l4-1, and l4-0 are in c4; l3-2, l3-0, and l3-1 are in c3; l0-2, l0-0, and l0-1 are in c0; l1-0, l1-1, and l1-2 are in c1; l2-1, l2-0, and l2-2 are in c2.  Currently, p3 and t1 are at l1-1, t3 is at l3-0, p0 is at l1-0, t4 is at l4-2, p1 and t2 are at l2-0, a0 is at l4-0, t0 is at l0-2, p2 is in a0.'
+    question: 'Is it possible to transition to a state where the following holds:  l2-2 is in l3-2?'
+    answer: "Let's think step by step.  Step 1: Verify if the following fact(s) hold in current state or if there
+  is a sequence of actions which transforms the current state into a state where they hold: l2-2 is in l3-2. Step 2: These facts do not hold in the current state. Step 3: The fact l2-2 is in l3-2 is not reachable even by a simple iterative procedure that accumulates all facts made true by applicable actions. **Final Answer**: No."
--- a/lm_eval/tasks/acpbench/boolq_cot_2shot/val.yaml
+++ b/lm_eval/tasks/acpbench/boolq_cot_2shot/val.yaml
+task: acp_val_bool
+dataset_name: acp_val_bool
+include: _boolq_cot_2shot_yaml
+fewshot_config:
+  sampler: first_n
+  samples:
+  - context: 'This is a ferry domain, where the task is to transport cars from their start to their goal locations, using a ferry. Each location is accessible by ferry from each other location. The cars can be debarked or boarded, and the ferry can carry only one car at a time.  There are 2 locations and 2 cars, numbered consecutively.  Currently, the ferry is at l0 location and it is empty. The cars are at locations as follows: c0 and c1 are at l0. The goal is to reach a state where the following facts hold: Car c1 is at location l1 and Car c0 is at location l1.'
+    question: 'Is the following sequence of actions \"embark the car c0 at location l0 on to the ferry travel by sea from location l0 to location l1 debark the car c0 to location l1 from the ferry sail from location l1 to location l0 board car c1 at location l0 travel by sea from location l0 to location l1 sail from location l1 to location l0\" a plan for the current state?'
+    answer: "Let's think step by step.   Step 1: For a sequence of actions to be a plan, it should be applicable in the current state, all actions should be valid, and the resulting state should entail the goal. Step 2: All the actions in the sequence are valid in this problem. Step 3: The action sequence is applicable in the current state. Step 4: However, the following goal fact(s) are not achieved: Car c1 is at location l1. **Final Answer**: No."
+  - context: 'This is a blocksworld domain where blocks can be placed on top of each other or on the table. There is one robotic arm that can move the block.  There are 3 blocks.  Currently, the robotic arm is empty. The following block(s) are on the table: block_1 and block_2. The following block(s) is stacked on top of another block: block_3 is on block_2. The goal is to reach a state where the following facts hold: The block block_1 is currently situated above the block block_2 and The block block_3 is on top of block block_1.'
+    question: 'Is the following sequence of actions "unstack the object block_3 from the object block_2 place the object block_3 on top of the object block_2 unstack object block_3 from object block_2 put down object block_3 collect the object block_3 from the table place the object block_3 on the table pick up object block_2 from the table place the object block_2 on the table pick up the object block_1 from the table stack the object block_1 on top of the object block_2 pick up object block_3 from the table stack object block_3 on top of object block_1" a plan for the current state? '
+    answer: "Let's think step by step.  Step 1: For a sequence of actions to be a plan, it should be applicable in the current state, all actions should be valid, and the resulting state should entail the goal. Step 2: All the actions in the sequence are valid in this problem. Step 3: The action sequence is applicable in the current state. Step 4: The action sequence achieves the goal. **Final Answer**: Yes."
--- a/lm_eval/tasks/acpbench/mcq_cot_2shot/_mcq_cot_2shot_yaml
+++ b/lm_eval/tasks/acpbench/mcq_cot_2shot/_mcq_cot_2shot_yaml
+tag:
+  - acp_mcq_cot_2shot
+  - acp_bench
+output_type: generate_until
+dataset_path: ibm-research/acp_bench
+test_split: test
+num_fewshot: 2
+doc_to_target: "{{answer}}"
+doc_to_text: "**Question**: {{context}}  {{question}} **Thoughts**:"
+generation_kwargs:
+  until:
+    - "\n\n\n\n"
+    - "**Question**:"
+    - "**Question:**"
+    - "Q:"
+  do_sample: false
+  temperature: 0.0
+  max_gen_toks: 1024
+metric_list:
+  - metric: exact_match
+    ignore_case: true
+    ignore_punctuation: true
+    regexes_to_ignore:
+      - "\\.$"
+      - ","
+      - "\\\\"
+      - "\n"
+      - '"'
+filter_list:
+  - name: "mcq-extract"
+    filter:
+      - function: multi_choice_regex
+        group_select: -1
+        ignore_case: true
+        ignore_punctuation: true
+        regex_pattern: '(((?<=[answer is ])[A-D])|([A-D]\n)|([A-D]\.)|( [A-D] )|(^[A-D]$)|(\[[A-D]\])|([A-D])|(?<=..Final Answer..: )(.*)(?=.)|(?<=..answer..: )(.*)(?=.)|(?<=..Answer..: )(.*)(?=.))'
+      - function: "take_first"
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/acpbench/mcq_cot_2shot/act_reach.yaml
+++ b/lm_eval/tasks/acpbench/mcq_cot_2shot/act_reach.yaml
+task: acp_areach_mcq
+dataset_name: acp_areach_mcq
+include: _mcq_cot_2shot_yaml
+fewshot_config:
+  sampler: first_n
+  samples:
+  - context: 'This is a ferry domain, where the task is to transport cars from their start to their goal locations, using a ferry. Each location is accessible by ferry from each other location. The cars can be debarked or boarded, and the ferry can carry only one car at a time.  There are 2 locations and 2 cars, numbered consecutively.  Currently, the ferry is at l0, with the car c1 on board. The cars are at locations as follows: c0 is at l0.'
+    question: 'Which of the following actions can eventually be applied? A. embark the car c0 at location l0 on to the ferry.  B. travel by sea from location c0 to location c1.  C. fly from location l0 to location l1.  D. board the car c0 at location l0 into the airplane.'
+    answer: "Let's think step by step.   Step 1: Verify if there is a sequence of actions which transforms the current state into a state where the precondition of the action \"embark the car c0 at location l0 on to the ferry\" hold. Step 2: The following sequence of actions would transition to such a state: travel by sea from location l0 to location l1, debark the car c1 from the ferry to location l1, travel by sea from location l1 to location l0. **Final Answer**: A."
+  - context: 'There are several cities, each containing several locations, some of which are airports. There are also trucks, which can drive within a single city, and airplanes, which can fly between airports. The goal is to get some packages from various locations to various new locations.  There are 2 trucks and 1 airplane, as well as 4 packages. There are 4 locations across 2 cities.  The locations are in cities as follows: l1-1 and l1-0 are in c1; l0-0 and l0-1 are in c0.  Currently, a0 is at l0-0, t0 is at l0-1, p3 and t1 are at l1-0, p0 is in t1, p2 and p1 are in a0.'
+    question: 'Which of the following actions can eventually be applied? A. offload the object p3 from the truck p3 at location l1-1.  B. navigate the truck c1 which is in location p0 in city l0-1 to another location t1 in the same city.  C. fly the airplane a0 from airport l1-0 to airport l0-0.  D. fly the airplane a0 to the airport l0-0 in city l1-0.'
+    answer: "Let's think step by step.   Step 1: Verify if there is a sequence of actions which transforms the current state into a state where the precondition of the action \"fly the airplane a0 from airport l1-0 to airport l0-0\" hold. Step 2: The following sequence of actions would transition to such a state: drive truck t0 from location l0-1 in city c0 to location l0-0 in the same city, fly the airplane a0 from location l0-0 to location l1-0, navigate the truck t1 which is in location l1-0 in city c1 to another location l1-1 in the same city. **Final Answer**: C."
--- a/lm_eval/tasks/acpbench/mcq_cot_2shot/app.yaml
+++ b/lm_eval/tasks/acpbench/mcq_cot_2shot/app.yaml
+task: acp_app_mcq
+dataset_name: acp_app_mcq
+include: _mcq_cot_2shot_yaml
+fewshot_config:
+  sampler: first_n
+  samples:
+  - context: 'This is a ferry domain, where the task is to transport cars from their start to their goal locations, using a ferry. Each location is accessible by ferry from each other location. The cars can be debarked or boarded, and the ferry can carry only one car at a time.  There are 2 locations and 2 cars, numbered consecutively.  Currently, the ferry is at l1 location and it is empty. The cars are at locations as follows: c1 and c0 are at l0.'
+    question: 'Which of the following actions will be applicable in this state? A. unload the car c1 from the ferry to location l1.  B. load the car c0 at location l1 on to the ferry.  C. load the car c0 at location l0 on to the ferry.  D. sail from location l1 to location l0.'
+    answer: "Let's think step by step.   Step 1: In order to apply the action \"sail from location l1 to location l0\", the following fact(s) must hold in this state:  The ferry is at l1 location Step 2: These facts hold in the mentioned state, so the action \"sail from location l1 to location l0\" is applicable. **Final Answer**: D."
+  - context: 'There are several cities, each containing several locations, some of which are airports. There are also trucks, which can drive within a single city, and airplanes, which can fly between airports. The goal is to get some packages from various locations to various new locations.  There are 2 trucks and 1 airplane, as well as 4 packages. There are 4 locations across 2 cities.  The locations are in cities as follows: l1-1 and l1-0 are in c1; l0-0 and l0-1 are in c0.  Currently, a0 and t0 are at l0-0, t1 is at l1-1, p2 is at l1-0, p1 and p3 are in t1, p0 is in t0.'
+    question: 'Which of the following actions will be applicable in this state? A. load object p1 into airplane a0 at location l1-0.  B. unload the object p2 from the airplane a0 at location l1-0.  C. navigate the truck t1 from location l1-1 in city c1 to location l1-1 in the same city.  D. operate the airplane a0 from airport l1-0 to airport l0-0.'
+    answer: "Let's think step by step.   Step 1: In order to apply the action \"navigate the truck t1 from location l1-1 in city c1 to location l1-1 in the same city\", the following fact(s) must hold in this state:  t1 is at l1-1 Step 2: These facts hold in the mentioned state, so the action \"navigate the truck t1 from location l1-1 in city c1 to location l1-1 in the same city\" is applicable. **Final Answer**: C."
--- a/lm_eval/tasks/acpbench/mcq_cot_2shot/just.yaml
+++ b/lm_eval/tasks/acpbench/mcq_cot_2shot/just.yaml
+task: acp_just_mcq
+dataset_name: acp_just_mcq
+include: _mcq_cot_2shot_yaml
+fewshot_config:
+  sampler: first_n
+  samples:
+  - context: 'This is a ferry domain, where the task is to transport cars from their start to their goal locations, using a ferry. Each location is accessible by ferry from each other location. The cars can be debarked or boarded, and the ferry can carry only one car at a time.  There are 2 locations and 2 cars, numbered consecutively.  Currently, the ferry is at l0 location and it is empty. The cars are at locations as follows: c0 and c1 are at l0. The goal is to reach a state where the following facts hold: Car c1 is at location l1 and Car c0 is at location l1.'
+    question: 'Given the plan: \"board the car c0 at the location l0, travel by sea from location l0 to location l1, unload the car c0 from the ferry to location l1, travel by sea from location l1 to location l0, board the car c1 at location l0, sail from location l0 to location l1, debark the car c1 from the ferry to location l1, travel by sea from location l1 to location l0\"; which of the following actions can be removed from this plan and still have a valid plan?  A. board the car c1 at the location l0.  B. debark car c1 to location l1 from the ferry.  C. travel by sea from location l1 to location l0.  D. sail from location l0 to location l1.'
+    answer: "Let's think step by step.   Step 1: A valid plan is a sequence of actions that is exectuable from the current state and achieves the goal.  Step 2: The 2nd instance of  \"travel by sea from location l1 to location l0\" can be removed from the plan and still have a valid plan. **Final Answer**: C."
+  - context: 'There are several cities, each containing several locations, some of which are airports. There are also trucks, which can drive within a single city, and airplanes, which can fly between airports. The goal is to get some packages from various locations to various new locations.  There are 2 trucks and 1 airplane, as well as 4 packages. There are 4 locations across 2 cities.  The locations are in cities as follows: l0-0 and l0-1 are in c0; l1-0 and l1-1 are in c1.  Currently, p2, p1, and p3 are at l1-0, p0 and t1 are at l1-1, a0 is at l0-0, t0 is at l0-1. The goal is to reach a state where the following facts hold: p3 is at l0-1, p0 is at l0-0, p2 is at l1-0, and p1 is at l1-0.'
+    question: 'Given the plan: \"load object p0 into truck t1 at location l1-1, navigate the truck t1 which is in location l1-1 in city c1 to another location l1-0 in the same city, offload the object p0 from the truck t1 at location l1-0, fly the airplane a0 from location l0-0 to location l1-0, load object p0 into airplane a0 at location l1-0, place the object p3 onto the airplane a0 at location l1-0, fly the airplane a0 from location l1-0 to location l0-0, unload the object p0 from the airplane a0 at location l0-0, unload the object p3 from the airplane a0 at location l0-0, navigate the truck t0 from location l0-1 in city c0 to location l0-0 in the same city, load object p3 into truck t0 at location l0-0, drive truck t0 from location l0-0 in city c0 to location l0-1 in the same city, offload the object p3 from the truck t0 at location l0-1, drive truck t0 from location l0-1 in city c0 to location l0-0 in the same city, navigate the truck t0 from its current location l0-0 in city c0 to the new location l0-1 within the same city\"; which of the following pair of consecutive actions can be removed from this plan and still have a valid plan?  A. drive the truck t0 in city c0 from location l0-1 to location l0-0 and navigate the truck t0 which is in location l0-0 in city c0 to another location l0-1 in the same city.  B. unload object p3 from truck t0 at location l0-1 and drive truck t0 from location l0-1 in city c0 to location l0-0 in the same city.  C. drive truck t0 from location l0-1 in city c0 to location l0-0 in the same city and load the object p3 from location l0-0 into the truck t0.  D. unload the object p0 from the airplane a0 at location l0-0 and remove the object p3 from the airplane a0 and place it on the location l0-0.'
+    answer: "Let's think step by step.   Step 1: A valid plan is a sequence of actions that is exectuable from the current state and achieves the goal.  Step 2: Consecutive actions \"drive the truck t0 in city c0 from location l0-1 to location l0-0 and navigate the truck t0 which is in location l0-0 in city c0 to another location l0-1 in the same city\" can be removed from the plan and still have a valid plan. **Final Answer**: A."
--- a/lm_eval/tasks/acpbench/mcq_cot_2shot/land.yaml
+++ b/lm_eval/tasks/acpbench/mcq_cot_2shot/land.yaml
+task: acp_land_mcq
+dataset_name: acp_land_mcq
+include: _mcq_cot_2shot_yaml
+fewshot_config:
+  sampler: first_n
+  samples:
+  - context: 'This is a ferry domain, where the task is to transport cars from their start to their goal locations, using a ferry. Each location is accessible by ferry from each other location. The cars can be debarked or boarded, and the ferry can carry only one car at a time.  There are 2 locations and 2 cars, numbered consecutively.  Currently, the ferry is at l0 location and it is empty. The cars are at locations as follows: c1 is at l0; c0 is at l1. The goal is to reach a state where the following facts hold: Car c1 is at location l1 and Car c0 is at location l1.'
+    question: 'Which of the following facts is a landmark (must hold at some point along any plan) for the current state? A. Car c0 is on the ferry.  B. Ferry has car c1 on board and Car c0 is at location l0.  C. Ferry has car c1 on board.  D. Ferry has car c1 on board and Car c0 is on the ferry.'
+    answer: "Let's think step by step.   Step 1: A fact is a landmark if it must hold at some point along any plan. Step 2: The fact \"Ferry has car c1 on board\" can be found by a simple procedure that traces back such atoms from the goal. **Final Answer**: C."
+  - context: 'There are several cities, each containing several locations, some of which are airports. There are also trucks, which can drive within a single city, and airplanes, which can fly between airports. The goal is to get some packages from various locations to various new locations.  There are 2 trucks and 1 airplane, as well as 4 packages. There are 4 locations across 2 cities.  The locations are in cities as follows: l1-1 and l1-0 are in c1; l0-1 and l0-0 are in c0.  Currently, p1 and t0 are at l0-0, a0, p0, p3, and p2 are at l1-0, t1 is at l1-1. The goal is to reach a state where the following facts hold: p0 is at l0-0, p1 is at l1-0, p3 is at l0-1, and p2 is at l1-0.'
+    question: 'Which of the following facts is a landmark (must hold at some point along any plan) for the current state? A. p3 is at l0-0.  B. p1 is at l0-1.  C. p0 is in t0.  D. p2 is in a0.'
+    answer: "Let's think step by step.   Step 1: A fact is a landmark if it must hold at some point along any plan. Step 2: The fact \"p3 is at l0-0\" can be found by a simple procedure that traces back such atoms from the goal. **Final Answer**: A."
--- a/lm_eval/tasks/acpbench/mcq_cot_2shot/prog.yaml
+++ b/lm_eval/tasks/acpbench/mcq_cot_2shot/prog.yaml
+task: acp_prog_mcq
+dataset_name: acp_prog_mcq
+include: _mcq_cot_2shot_yaml
+fewshot_config:
+  sampler: first_n
+  samples:
+  - context: 'This is a ferry domain, where the task is to transport cars from their start to their goal locations, using a ferry. Each location is accessible by ferry from each other location. The cars can be debarked or boarded, and the ferry can carry only one car at a time.  There are 2 locations and 2 cars, numbered consecutively.  Currently, the ferry is at l1, with the car c1 on board. The cars are at locations as follows: c0 is at l1.'
+    question: 'Which the following facts hold after performing the action \"travel by sea from location l1 to location l0\" in the current state? **Possible Answers**: A. Car c0 is at location l1 and The ferry is at l1 location.  B. The ferry is at l0 location and The ferry is at l1 location.  C. The ferry is at l0 location.  D. The ferry is at l0 location and Car c0 is at location l1.'
+    answer: "Let's think step by step.   Step 1: The following fact(s) do not hold in the current state: The ferry is at l0 location. Step 2: The action adds the following fact(s): The ferry is at l0 location Step 3: The following fact(s) hold in the current state: Car c0 is at location l1. Step 4: The action deletes the following fact(s): The ferry is at l1 location Step 5: Fact(s) \"The ferry is at l0 location\" are added and Fact(s) \"Car c0 is at location l1\" are not deleted. **Final Answer**: D."
+  - context: 'There are several cities, each containing several locations, some of which are airports. There are also trucks, which can drive within a single city, and airplanes, which can fly between airports. The goal is to get some packages from various locations to various new locations.  There are 2 trucks and 1 airplane, as well as 4 packages. There are 4 locations across 2 cities.  The locations are in cities as follows: l1-1 and l1-0 are in c1; l0-1 and l0-0 are in c0.  Currently, a0 is at l0-0, t1 and p0 are at l1-1, t0 is at l0-1, p1 is in t1, p2 and p3 are in a0.'
+    question: 'Which the following facts hold after performing the action \"drive truck t0 from location l0-1 in city c0 to location l0-1 in the same city\" in the current state? A. p3 is in t1.  B. a0 is at l0-0 and p3 is in t1.  C. a0 is at l0-0.  D. None of the above.'
+    answer: "Let's think step by step.   Step 1: The following fact(s) hold in the current state: a0 is at l0-0. Step 2: The action deletes the following fact(s): t0 is at l0-1 Step 3: Fact(s) \"a0 is at l0-0\" are not deleted. **Final Answer**: C."
--- a/lm_eval/tasks/acpbench/mcq_cot_2shot/reach.yaml
+++ b/lm_eval/tasks/acpbench/mcq_cot_2shot/reach.yaml
+task: acp_reach_mcq
+dataset_name: acp_reach_mcq
+include: _mcq_cot_2shot_yaml
+fewshot_config:
+  sampler: first_n
+  samples:
+  - context: 'This is a ferry domain, where the task is to transport cars from their start to their goal locations, using a ferry. Each location is accessible by ferry from each other location. The cars can be debarked or boarded, and the ferry can carry only one car at a time.  There are 2 locations and 2 cars, numbered consecutively.  Currently, the ferry is at l0, with the car c0 on board. The cars are at locations as follows: c1 is at l1.'
+    question: 'Which of the following options can hold in a state that can potentially be reached? A. There are no cars on the ferry and The ferry is at l1 location.  B. Car l1 is at location c1.  C. Ferry has car c0 on board and There are no cars on the ferry.  D. The ferry is at c0 location and Car c1 is at location l1.'
+    answer: "Let's think step by step.   Step 1: Verify if the following facts hold in the current state:  There are no cars on the ferry and The ferry is at l1 location. Step 2: These facts do not hold. Step 3: Verify if there is a sequence of actions which transforms the current state into a state where these facts hold. Step 4: The following sequence of actions would transition to such a state: debark the car c0 from the ferry to location l0, sail from location l0 to location l1. **Final Answer**: A."
+  - context: 'There are several cities, each containing several locations, some of which are airports. There are also trucks, which can drive within a single city, and airplanes, which can fly between airports. The goal is to get some packages from various locations to various new locations.  There are 2 trucks and 1 airplane, as well as 4 packages. There are 4 locations across 2 cities.  The locations are in cities as follows: l1-1 and l1-0 are in c1; l0-0 and l0-1 are in c0.  Currently, p1 and t1 are at l1-1, a0 and p0 are at l0-0, t0 and p3 are at l0-1, p2 is at l1-0.'
+    question: 'Which of the following options can hold in a state that can potentially be reached? A. p2 is at p0.  B. t0 is at l0-0.  C. p3 is in t1 and p3 is in t0.  D. l1-1 is at p1.'
+    answer: "Let's think step by step.   Step 1: Verify if the following fact holds in the current state:  t0 is at l0-0. Step 2: The fact does not hold. Step 3: Verify if there is a sequence of actions which transforms the current state into a state where the fact holds. Step 4: The following sequence of actions would transition to such a state: drive the truck t0 in city c0 from location l0-1 to location l0-0, navigate the truck t0 which is in location l0-0 in city c0 to another location l0-0 in the same city. **Final Answer**: B."
--- a/lm_eval/tasks/acpbench/mcq_cot_2shot/val.yaml
+++ b/lm_eval/tasks/acpbench/mcq_cot_2shot/val.yaml
+task: acp_val_mcq
+dataset_name: acp_val_mcq
+include: _mcq_cot_2shot_yaml
+fewshot_config:
+  sampler: first_n
+  samples:
+  - context: 'This is a ferry domain, where the task is to transport cars from their start to their goal locations, using a ferry. Each location is accessible by ferry from each other location. The cars can be debarked or boarded, and the ferry can carry only one car at a time.  There are 2 locations and 2 cars, numbered consecutively.  Currently, the ferry is at l0 location and it is empty. The cars are at locations as follows: c1 and c0 are at l0. The goal is to reach a state where the following facts hold: Car c0 is at location l1 and Car c1 is at location l1.'
+    question: 'Which of the following claims is true with regard to the following sequence of actions \"board the car c0 at the location l0, travel by sea from location l0 to location l1, unload the car c0 from the ferry to location l1, travel by sea from location l1 to location l0, board the car c1 at location l0, sail from location l0 to location l1, debark the car c1 from the ferry to location l1\" and the current state? A.  The sequence is not applicable.  B. The sequence is a plan.  C. The sequence is applicable, but does not achieve the goal.  D. The sequence is not valid.'
+    answer: "Let's think step by step.   Step 1: For a sequence of actions to be a plan, all actions should be valid, applicable in sequence, and achieve the goal. Step 2: The action sequence is applicable and it achieves the goal. **Final Answer**: B."
+  - context: 'There are several cities, each containing several locations, some of which are airports. There are also trucks, which can drive within a single city, and airplanes, which can fly between airports. The goal is to get some packages from various locations to various new locations.  There are 3 trucks and 1 airplane, as well as 4 packages. There are 9 locations across 3 cities.  The locations are in cities as follows: l1-2, l1-0, and l1-1 are in c1; l0-0, l0-1, and l0-2 are in c0; l2-1, l2-2, and l2-0 are in c2.  Currently, p2 and t1 are at l1-2, p3 is at l2-0, t0 and p0 are at l0-2, p1 is at l1-0, a0 is at l0-0, t2 is at l2-2. The goal is to reach a state where the following facts hold: p1 is at l1-0, p3 is at l2-0, p2 is at l0-1, and p0 is at l1-2.'
+    question: 'Which of the following claims is true with regard to the following sequence of actions \"load object p0 into truck t0 at location l0-2, sail the ship t0 into city c0 from location l0-2 in city l0-0, remove the object p0 from the truck t0 and place it on the location l0-0, load the object p0 from location l0-0 onto the airplane a0, fly the airplane a0 from the airport l0-0 to the airport l1-0, remove the object p0 from the airplane a0 and place it on the location l1-0, load object p2 into truck t1 at location l1-2, navigate the truck t1 from its current location l1-2 in city c1 to the new location l1-0 within the same city place the object p0 into the truck t1 at location l1-0 remove the object p2 from the truck t1 and place it on the location l1-0 load the object p2 from location l1-0 onto the airplane a0 fly the airplane a0 from location l1-0 to location l2-0 fly airplane a0 from airport l2-0 to airport l0-0 unload the object p2 from the airplane a0 at location l0-0 place the object p2 into the truck t0 at location l0-0 navigate the truck t0 from its current location l0-0 in city c0 to the new location l0-1 within the same city offload the object p2 from the truck t0 at location l0-1 drive the truck t1 in city c1 from location l1-0 to location l1-2 offload the object p0 from the truck t1 at location l1-2 navigate the truck t2 from its current location l2-2 in city c2 to the new location l2-1 within the same city\" and the current state? A. The sequence is not valid.  B. The sequence is applicable, but does not achieve the goal.  C. The sequence is a plan.  D.  The sequence is not applicable.'
+    answer: "Let's think step by step.   Step 1: For a sequence of actions to be a plan, all actions should be valid, applicable in sequence, and achieve the goal. Step 2: The action \"sail the ship t0 into city c0 from location l0-2 in city l0-0\" is not valid in this problem. **Final Answer**: A."