merge conflict

470059f6 · lintangsutawika · b8d7d6c3 · 9d030712 · 470059f6 · 470059f6
Commit 470059f6 authored Nov 24, 2023 by lintangsutawika
20 changed files
--- a/lm_eval/tasks/bbh/flan_cot_fewshot/web_of_lies.yaml
+++ b/lm_eval/tasks/bbh/flan_cot_fewshot/web_of_lies.yaml
+"dataset_name": "web_of_lies"
+"description": "Evaluate a random boolean function expressed as a word problem.\n\n"
+"doc_to_text": " Question: Fidel tells the truth. Jerry says Fidel tells the truth. Vina says Jerry tells the truth. Millicent says Vina lies. Raymond says Millicent lies. Does Raymond tell the truth?\nA: Let's think step by step.\n(1) Fidel tells the truth. So, we know that Fidel tells the truth.\n(2) Jerry says Fidel tells the truth. Since we know from (1) that Fidel tells the truth, if Jerry says that Fidel tells the truth, then Jerry tells the truth.\n(3) Vina says Jerry tells the truth. Since we know from (2) that Jerry tells the truth, if Vina says Jerry tells the truth, then Vine tells the truth.\n(4) Millicent says Vina lies. Since we know from (3) that Vina tells the truth, if Millicent says Vina lies, then Millicent lies.\n(5) Raymond says Millicent lies. Since we know from (4) that Millicent lies, if Raymond says Millicent lies, then Raymond tells the truth.\nNow, the question asks: Does Raymond tell the truth? We know from (5) that Raymond tells the truth. So the answer is Yes.  Question: Kristian lies. Millie says Kristian lies. Maybelle says Millie tells the truth. Fidel says Maybelle lies. Leda says Fidel lies. Does Leda tell the truth?\nA: Let's think step by step.\n(1) Kristian lies. So, we know that Kristian lies.\n(2) Millie says Kristian lies. Since we know from (1) that Kristian lies, if Millie says Kristian lies, then Millie tells the truth.\n(3) Maybelle says Millie tells the truth. Since we know from (2) that Millie tells the truth, if Maybelle says Millie tells the truth, then Maybelle tells the truth.\n(4) Fidel says Maybelle lies. Since we know from (3) that Maybelle tells the truth, if Fidel says Maybelle lies, then Fidel lies.\n(5) Leda says Fidel lies. Since we know from (4) that Fidel lies, if Leda says Fidel lies, then Leda tells the truth.\nNow, the question asks: Does Leda tell the truth? We know from (5) that Leda tells the truth. So the answer is Yes.  Question: Kristian tells the truth. Michaela says Kristian lies. Raymond says Michaela tells the truth. Osvaldo says Raymond tells the truth. Jamey says Osvaldo tells the truth. Does Jamey tell the truth?\nA: Let's think step by step.\n(1) Kristian tells the truth. So, we know that Kristian tells the truth.\n(2) Michaela says Kristian lies. Since we know from (1) that Kristian tells the truth, if Michaela says Kristian lies, then Michaela lies.\n(3) Raymond says Michaela tells the truth. Since we know from (2) that Michaela lies, if Raymond says Michaela tells the truth, then Raymond lies.\n(4) Osvaldo says Raymond tells the truth. Since we know from (3) that Raymond lies, if Osvaldo says Raymond tells the truth, then Osvaldo lies.\n(5) Jamey says Osvaldo tells the truth. Since we know from (4) that Osvaldo lies, if Jamey says Osvaldo tells the truth, then Jamey lies.\nNow, the question asks: Does Jamey tell the truth? We know from (5) that Jamey lies. So the answer is No.Q: {{input}}\nA: Let's think step by step.\n"
+"include": "_flan_cot_fewshot_template_yaml"
+"task": "bbh_flan_cot_fewshot_web_of_lies"
--- a/lm_eval/tasks/bbh/flan_cot_fewshot/word_sorting.yaml
+++ b/lm_eval/tasks/bbh/flan_cot_fewshot/word_sorting.yaml
+"dataset_name": "word_sorting"
+"description": "Sort a list of words.\n\n"
+"doc_to_text": " Sort the following words alphabetically: List: oven costume counterpart\nA: Let's think step by step.\nThe first letter: \"oven\": \"o\" (15). \"costume\": \"c\" (3). \"counterpart\": \"c\" (3). We now have: (3) [\"costume\" ? \"counterpart\"] < (15) \"oven\". Now let's sort this subpart [\"costume\" ? \"counterpart\"] by looking at their second letters.\nThe second letter: \"costume\": \"o\" (15). \"counterpart\": \"o\" (15). We now have: (15) [\"costume\" ? \"counterpart\"]. Now let's sort this subpart [\"costume\" ? \"counterpart\"] by looking at their third letters.\nThe third letter: \"costume\": \"s\" (19). \"counterpart\": \"u\" (21). We now have: (19) \"costume\" < (21) \"counterpart\". Hence, we have [\"costume\" < \"counterpart\"] < \"oven\". So the answer is costume counterpart oven.  Sort the following words alphabetically: List: hypochlorite ponderosa phone credulity\nA: Let's think step by step.\nThe first letter: \"hypochlorite\": \"h\" (8). \"ponderosa\": \"p\" (16). \"phone\": \"p\" (16). \"credulity\": \"c\" (3). We now have: (3) \"credulity\" < (8) \"hypochlorite\" < (16) [\"ponderosa\" ? \"phone\"]. Now let's sort this subpart [\"ponderosa\" ? \"phone\"] by looking at their second letters.\nThe second letter: \"ponderosa\": \"o\" (15). \"phone\": \"h\" (8). We now have: (8) \"phone\" < (15) \"ponderosa\". Hence, we have \"credulity\" < \"hypochlorite\" < [\"phone\" <\"ponderosa\"]. So the answer is credulity hypochlorite phone ponderosa.  Sort the following words alphabetically: List: newt arson parthia seismography mugho aspect census\nA: Let's think step by step.\nThe first letter: \"newt\": \"n\" (14). \"arson\": \"a\" (1). \"parthia\": \"p\" (16). \"seismography\": \"s\" (19). \"mugho\": \"m\" (13). \"aspect\": \"a\" (1). \"census\": \"c\" (3). We now have: (1) [\"arson\" ? \"aspect\"] < (3) \"census\" < (13) \"mugho\" < (14) \"newt\" < (16) \"parthia\" < (19) \"seismography\". Now let's sort this subpart [\"arson\" ? \"aspect\"] by looking at their second letters.\nThe second letter: \"arson\": \"r\" (18). \"aspect\": \"s\" (19). We now have: (18) \"arson\" < (19) \"aspect\". Hence, we have [\"arson\" < \"aspect\"] < \"census\" < \"mugho\" < \"newt\" < \"parthia\" < \"seismography\". So the answer is arson aspect census mugho newt parthia seismography.Q: {{input}}\nA: Let's think step by step.\n"
+"include": "_flan_cot_fewshot_template_yaml"
+"task": "bbh_flan_cot_fewshot_word_sorting"
--- a/lm_eval/tasks/bbh/flan_cot_zeroshot/_flan_cot_zeroshot_template_yaml
+++ b/lm_eval/tasks/bbh/flan_cot_zeroshot/_flan_cot_zeroshot_template_yaml
+group: bbh_flan_cot_zeroshot
+dataset_path: lukaemon/bbh
+output_type: generate_until
+test_split: test
+doc_to_target: "{{target}}"
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+    # ignore_case: true
+    # ignore_punctuation: true
+generation_kwargs:
+  until:
+    - "</s>"
+  do_sample: false
+  temperature: 0.0
+filter_list:
+  - name: "get-answer"
+    filter:
+      - function: "regex"
+        regex_pattern: "((?<=The answer is )(.*)(?=.)|(?<=the answer is )(.*)(?=.)|(?<=The answer: )(.*)(?=.)|(?<=The final answer: )(.*)(?=.))"
+      - function: "take_first"
--- a/lm_eval/tasks/bbh/flan_cot_zeroshot/boolean_expressions.yaml
+++ b/lm_eval/tasks/bbh/flan_cot_zeroshot/boolean_expressions.yaml
+"dataset_name": "boolean_expressions"
+"description": "Evaluate the result of a random Boolean expression.\n\n"
+"doc_to_text": "Q: {{input}}\nA: Let's think step by step.\n"
+"include": "_flan_cot_zeroshot_template_yaml"
+"task": "bbh_flan_cot_zeroshot_boolean_expressions"
--- a/lm_eval/tasks/bbh/flan_cot_zeroshot/causal_judgement.yaml
+++ b/lm_eval/tasks/bbh/flan_cot_zeroshot/causal_judgement.yaml
+"dataset_name": "causal_judgement"
+"description": "Answer questions about causal attribution.\n\n"
+"doc_to_text": "Q: {{input}}\nA: Let's think step by step.\n"
+"include": "_flan_cot_zeroshot_template_yaml"
+"task": "bbh_flan_cot_zeroshot_causal_judgement"
--- a/lm_eval/tasks/bbh/flan_cot_zeroshot/date_understanding.yaml
+++ b/lm_eval/tasks/bbh/flan_cot_zeroshot/date_understanding.yaml
+"dataset_name": "date_understanding"
+"description": "Infer the date from context.\n\n"
+"doc_to_text": "Q: {{input}}\nA: Let's think step by step.\n"
+"include": "_flan_cot_zeroshot_template_yaml"
+"task": "bbh_flan_cot_zeroshot_date_understanding"
--- a/lm_eval/tasks/bbh/flan_cot_zeroshot/disambiguation_qa.yaml
+++ b/lm_eval/tasks/bbh/flan_cot_zeroshot/disambiguation_qa.yaml
+"dataset_name": "disambiguation_qa"
+"description": "Clarify the meaning of sentences with ambiguous pronouns.\n\n"
+"doc_to_text": "Q: {{input}}\nA: Let's think step by step.\n"
+"include": "_flan_cot_zeroshot_template_yaml"
+"task": "bbh_flan_cot_zeroshot_disambiguation_qa"
--- a/lm_eval/tasks/bbh/flan_cot_zeroshot/dyck_languages.yaml
+++ b/lm_eval/tasks/bbh/flan_cot_zeroshot/dyck_languages.yaml
+"dataset_name": "dyck_languages"
+"description": "Correctly close a Dyck-n word.\n\n"
+"doc_to_text": "Q: {{input}}\nA: Let's think step by step.\n"
+"include": "_flan_cot_zeroshot_template_yaml"
+"task": "bbh_flan_cot_zeroshot_dyck_languages"
--- a/lm_eval/tasks/bbh/flan_cot_zeroshot/formal_fallacies.yaml
+++ b/lm_eval/tasks/bbh/flan_cot_zeroshot/formal_fallacies.yaml
+"dataset_name": "formal_fallacies"
+"description": "Distinguish deductively valid arguments from formal fallacies.\n\n"
+"doc_to_text": "Q: {{input}}\nA: Let's think step by step.\n"
+"include": "_flan_cot_zeroshot_template_yaml"
+"task": "bbh_flan_cot_zeroshot_formal_fallacies"
--- a/lm_eval/tasks/bbh/flan_cot_zeroshot/geometric_shapes.yaml
+++ b/lm_eval/tasks/bbh/flan_cot_zeroshot/geometric_shapes.yaml
+"dataset_name": "geometric_shapes"
+"description": "Name geometric shapes from their SVG paths.\n\n"
+"doc_to_text": "Q: {{input}}\nA: Let's think step by step.\n"
+"include": "_flan_cot_zeroshot_template_yaml"
+"task": "bbh_flan_cot_zeroshot_geometric_shapes"
--- a/lm_eval/tasks/bbh/flan_cot_zeroshot/hyperbaton.yaml
+++ b/lm_eval/tasks/bbh/flan_cot_zeroshot/hyperbaton.yaml
+"dataset_name": "hyperbaton"
+"description": "Order adjectives correctly in English sentences.\n\n"
+"doc_to_text": "Q: {{input}}\nA: Let's think step by step.\n"
+"include": "_flan_cot_zeroshot_template_yaml"
+"task": "bbh_flan_cot_zeroshot_hyperbaton"
--- a/lm_eval/tasks/bbh/flan_cot_zeroshot/logical_deduction_five_objects.yaml
+++ b/lm_eval/tasks/bbh/flan_cot_zeroshot/logical_deduction_five_objects.yaml
+"dataset_name": "logical_deduction_five_objects"
+"description": "A logical deduction task which requires deducing the order of a sequence of objects.\n\n"
+"doc_to_text": "Q: {{input}}\nA: Let's think step by step.\n"
+"include": "_flan_cot_zeroshot_template_yaml"
+"task": "bbh_flan_cot_zeroshot_logical_deduction_five_objects"
--- a/lm_eval/tasks/bbh/flan_cot_zeroshot/logical_deduction_seven_objects.yaml
+++ b/lm_eval/tasks/bbh/flan_cot_zeroshot/logical_deduction_seven_objects.yaml
+"dataset_name": "logical_deduction_seven_objects"
+"description": "A logical deduction task which requires deducing the order of a sequence of objects.\n\n"
+"doc_to_text": "Q: {{input}}\nA: Let's think step by step.\n"
+"include": "_flan_cot_zeroshot_template_yaml"
+"task": "bbh_flan_cot_zeroshot_logical_deduction_seven_objects"
--- a/lm_eval/tasks/bbh/flan_cot_zeroshot/logical_deduction_three_objects.yaml
+++ b/lm_eval/tasks/bbh/flan_cot_zeroshot/logical_deduction_three_objects.yaml
+"dataset_name": "logical_deduction_three_objects"
+"description": "A logical deduction task which requires deducing the order of a sequence of objects.\n\n"
+"doc_to_text": "Q: {{input}}\nA: Let's think step by step.\n"
+"include": "_flan_cot_zeroshot_template_yaml"
+"task": "bbh_flan_cot_zeroshot_logical_deduction_three_objects"
--- a/lm_eval/tasks/bbh/flan_cot_zeroshot/movie_recommendation.yaml
+++ b/lm_eval/tasks/bbh/flan_cot_zeroshot/movie_recommendation.yaml
+"dataset_name": "movie_recommendation"
+"description": "Recommend movies similar to the given list of movies.\n\n"
+"doc_to_text": "Q: {{input}}\nA: Let's think step by step.\n"
+"include": "_flan_cot_zeroshot_template_yaml"
+"task": "bbh_flan_cot_zeroshot_movie_recommendation"
--- a/lm_eval/tasks/bbh/flan_cot_zeroshot/multistep_arithmetic_two.yaml
+++ b/lm_eval/tasks/bbh/flan_cot_zeroshot/multistep_arithmetic_two.yaml
+"dataset_name": "multistep_arithmetic_two"
+"description": "Solve multi-step arithmetic problems.\n\n"
+"doc_to_text": "Q: {{input}}\nA: Let's think step by step.\n"
+"include": "_flan_cot_zeroshot_template_yaml"
+"task": "bbh_flan_cot_zeroshot_multistep_arithmetic_two"
--- a/lm_eval/tasks/bbh/flan_cot_zeroshot/navigate.yaml
+++ b/lm_eval/tasks/bbh/flan_cot_zeroshot/navigate.yaml
+"dataset_name": "navigate"
+"description": "Given a series of navigation instructions, determine whether one would end up back at the starting point.\n\n"
+"doc_to_text": "Q: {{input}}\nA: Let's think step by step.\n"
+"include": "_flan_cot_zeroshot_template_yaml"
+"task": "bbh_flan_cot_zeroshot_navigate"
--- a/lm_eval/tasks/bbh/flan_cot_zeroshot/object_counting.yaml
+++ b/lm_eval/tasks/bbh/flan_cot_zeroshot/object_counting.yaml
+"dataset_name": "object_counting"
+"description": "Questions that involve enumerating objects and asking the model to count them.\n\n"
+"doc_to_text": "Q: {{input}}\nA: Let's think step by step.\n"
+"include": "_flan_cot_zeroshot_template_yaml"
+"task": "bbh_flan_cot_zeroshot_object_counting"
--- a/lm_eval/tasks/bbh/flan_cot_zeroshot/penguins_in_a_table.yaml
+++ b/lm_eval/tasks/bbh/flan_cot_zeroshot/penguins_in_a_table.yaml
+"dataset_name": "penguins_in_a_table"
+"description": "Answer questions about a table of penguins and their attributes.\n\n"
+"doc_to_text": "Q: {{input}}\nA: Let's think step by step.\n"
+"include": "_flan_cot_zeroshot_template_yaml"
+"task": "bbh_flan_cot_zeroshot_penguins_in_a_table"
--- a/lm_eval/tasks/bbh/flan_cot_zeroshot/reasoning_about_colored_objects.yaml
+++ b/lm_eval/tasks/bbh/flan_cot_zeroshot/reasoning_about_colored_objects.yaml
+"dataset_name": "reasoning_about_colored_objects"
+"description": "Answer extremely simple questions about the colors of objects on a surface.\n\n"
+"doc_to_text": "Q: {{input}}\nA: Let's think step by step.\n"
+"include": "_flan_cot_zeroshot_template_yaml"
+"task": "bbh_flan_cot_zeroshot_reasoning_about_colored_objects"