init

3041681f · silencealiang · 291fc518 · 3041681f · 3041681f · 3041681f
Commit 3041681f authored Mar 19, 2025 by silencealiang
20 changed files
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/fewshot/tracking_shuffled_objects_five_objects.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/fewshot/tracking_shuffled_objects_five_objects.yaml
+"dataset_name": "tracking_shuffled_objects_five_objects"
+"description": "A task requiring determining the final positions of a set of objects given their initial positions and a description of a sequence of swaps.\n\n"
+"doc_to_text": "Q: Alice, Bob, and Claire are playing a game. At the start of the game, they are each holding a ball: Alice has a yellow ball, Bob has a blue ball, and Claire has a pink ball.\nAs the game progresses, pairs of players trade balls. First, Claire and Alice swap balls. Then, Alice and Bob swap balls. Finally, Claire and Bob swap balls. At the end of the game, Bob has the\nOptions:\n(A) yellow ball\n(B) blue ball\n(C) pink ball\nA: (A)\n\nQ: Alice, Bob, and Claire are playing a game. At the start of the game, they are each holding a ball: Alice has a white ball, Bob has a purple ball, and Claire has a pink ball.\nAs the game progresses, pairs of players trade balls. First, Bob and Alice swap balls. Then, Bob and Claire swap balls. Finally, Bob and Alice swap balls. At the end of the game, Alice has the\nOptions:\n(A) white ball\n(B) purple ball\n(C) pink ball\nA: (C)\n\nQ: Alice, Bob, and Claire are dancers at a square dance. At the start of a song, they each have a partner: Alice is dancing with Lola, Bob is dancing with Rodrigo, and Claire is dancing with Patrick.\nThroughout the song, the dancers often trade partners. First, Alice and Bob switch partners. Then, Claire and Bob switch partners. Finally, Bob and Alice switch partners. At the end of the dance, Alice is dancing with\nOptions:\n(A) Lola\n(B) Rodrigo\n(C) Patrick\nA: (C)\n\nQ: {{input}}\nA:"
+"include": "_fewshot_template_yaml"
+"task": "bbh_fewshot_tracking_shuffled_objects_five_objects"
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/fewshot/tracking_shuffled_objects_seven_objects.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/fewshot/tracking_shuffled_objects_seven_objects.yaml
+"dataset_name": "tracking_shuffled_objects_seven_objects"
+"description": "A task requiring determining the final positions of a set of objects given their initial positions and a description of a sequence of swaps.\n\n"
+"doc_to_text": "Q: Alice, Bob, and Claire are playing a game. At the start of the game, they are each holding a ball: Alice has a yellow ball, Bob has a blue ball, and Claire has a pink ball.\nAs the game progresses, pairs of players trade balls. First, Claire and Alice swap balls. Then, Alice and Bob swap balls. Finally, Claire and Bob swap balls. At the end of the game, Bob has the\nOptions:\n(A) yellow ball\n(B) blue ball\n(C) pink ball\nA: (A)\n\nQ: Alice, Bob, and Claire are playing a game. At the start of the game, they are each holding a ball: Alice has a white ball, Bob has a purple ball, and Claire has a pink ball.\nAs the game progresses, pairs of players trade balls. First, Bob and Alice swap balls. Then, Bob and Claire swap balls. Finally, Bob and Alice swap balls. At the end of the game, Alice has the\nOptions:\n(A) white ball\n(B) purple ball\n(C) pink ball\nA: (C)\n\nQ: Alice, Bob, and Claire are dancers at a square dance. At the start of a song, they each have a partner: Alice is dancing with Lola, Bob is dancing with Rodrigo, and Claire is dancing with Patrick.\nThroughout the song, the dancers often trade partners. First, Alice and Bob switch partners. Then, Claire and Bob switch partners. Finally, Bob and Alice switch partners. At the end of the dance, Alice is dancing with\nOptions:\n(A) Lola\n(B) Rodrigo\n(C) Patrick\nA: (C)\n\nQ: {{input}}\nA:"
+"include": "_fewshot_template_yaml"
+"task": "bbh_fewshot_tracking_shuffled_objects_seven_objects"
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/fewshot/tracking_shuffled_objects_three_objects.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/fewshot/tracking_shuffled_objects_three_objects.yaml
+"dataset_name": "tracking_shuffled_objects_three_objects"
+"description": "A task requiring determining the final positions of a set of objects given their initial positions and a description of a sequence of swaps.\n\n"
+"doc_to_text": "Q: Alice, Bob, and Claire are playing a game. At the start of the game, they are each holding a ball: Alice has a yellow ball, Bob has a blue ball, and Claire has a pink ball.\nAs the game progresses, pairs of players trade balls. First, Claire and Alice swap balls. Then, Alice and Bob swap balls. Finally, Claire and Bob swap balls. At the end of the game, Bob has the\nOptions:\n(A) yellow ball\n(B) blue ball\n(C) pink ball\nA: (A)\n\nQ: Alice, Bob, and Claire are playing a game. At the start of the game, they are each holding a ball: Alice has a white ball, Bob has a purple ball, and Claire has a pink ball.\nAs the game progresses, pairs of players trade balls. First, Bob and Alice swap balls. Then, Bob and Claire swap balls. Finally, Bob and Alice swap balls. At the end of the game, Alice has the\nOptions:\n(A) white ball\n(B) purple ball\n(C) pink ball\nA: (C)\n\nQ: Alice, Bob, and Claire are dancers at a square dance. At the start of a song, they each have a partner: Alice is dancing with Lola, Bob is dancing with Rodrigo, and Claire is dancing with Patrick.\nThroughout the song, the dancers often trade partners. First, Alice and Bob switch partners. Then, Claire and Bob switch partners. Finally, Bob and Alice switch partners. At the end of the dance, Alice is dancing with\nOptions:\n(A) Lola\n(B) Rodrigo\n(C) Patrick\nA: (C)\n\nQ: {{input}}\nA:"
+"include": "_fewshot_template_yaml"
+"task": "bbh_fewshot_tracking_shuffled_objects_three_objects"
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/fewshot/web_of_lies.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/fewshot/web_of_lies.yaml
+"dataset_name": "web_of_lies"
+"description": "Evaluate a random boolean function expressed as a word problem.\n\n"
+"doc_to_text": "Q: Question: Fidel tells the truth. Jerry says Fidel tells the truth. Vina says Jerry tells the truth. Millicent says Vina lies. Raymond says Millicent lies. Does Raymond tell the truth?\nA: Yes\n\nQ: Question: Kristian lies. Millie says Kristian lies. Maybelle says Millie tells the truth. Fidel says Maybelle lies. Leda says Fidel lies. Does Leda tell the truth?\nA: Yes\n\nQ: Question: Kristian tells the truth. Michaela says Kristian lies. Raymond says Michaela tells the truth. Osvaldo says Raymond tells the truth. Jamey says Osvaldo tells the truth. Does Jamey tell the truth?\nA: No\n\nQ: {{input}}\nA:"
+"include": "_fewshot_template_yaml"
+"task": "bbh_fewshot_web_of_lies"
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/fewshot/word_sorting.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/fewshot/word_sorting.yaml
+"dataset_name": "word_sorting"
+"description": "Sort a list of words.\n\n"
+"doc_to_text": "Q: Sort the following words alphabetically: List: oven costume counterpart\nA: costume counterpart oven\n\nQ: Sort the following words alphabetically: List: hypochlorite ponderosa phone credulity\nA: credulity hypochlorite phone ponderosa\n\nQ: Sort the following words alphabetically: List: newt arson parthia seismography mugho aspect census\nA: arson aspect census mugho newt parthia seismography\n\nQ: {{input}}\nA:"
+"include": "_fewshot_template_yaml"
+"task": "bbh_fewshot_word_sorting"
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/zeroshot/_zeroshot_template_yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/zeroshot/_zeroshot_template_yaml
+group: bbh_zeroshot
+dataset_path: lukaemon/bbh
+output_type: generate_until
+test_split: test
+doc_to_target: "{{target}}"
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+    ignore_case: true
+    # ignore_punctuation: true
+    regexes_to_ignore:
+      - "\\.$"
+      - ","
+      - "\n"
+      - "\\\\"
+      - '"'
+generation_kwargs:
+  until:
+    - "</s>"
+    - "Q:"
+    - "<|im_end|>"
+  do_sample: false
+  temperature: 0.0
+num_fewshot: 0
+metadata:
+  version: 2.0
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/zeroshot/boolean_expressions.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/zeroshot/boolean_expressions.yaml
+"dataset_name": "boolean_expressions"
+"description": "Evaluate the result of a random Boolean expression.\n\n"
+"doc_to_text": "Q: {{input}}\nA:"
+"include": "_zeroshot_template_yaml"
+"task": "bbh_zeroshot_boolean_expressions"
+
+filter_list:
+  - name: "strict-match"
+    filter:
+      - function: "take_first"
+  - name: "flexible-extract"
+    filter:
+      - function: "regex"
+        group_select: 0
+        regex_pattern: "\\b(True|False)\\b"
+      - function: "take_first"
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/zeroshot/causal_judgement.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/zeroshot/causal_judgement.yaml
+"dataset_name": "causal_judgement"
+"description": "Answer questions about causal attribution.\n\n"
+"doc_to_text": "Q: {{input}}\nA:"
+"include": "_zeroshot_template_yaml"
+"task": "bbh_zeroshot_causal_judgement"
+
+filter_list:
+  - name: "strict-match"
+    filter:
+      - function: "take_first"
+  - name: "flexible-extract"
+    filter:
+      - function: "regex"
+        group_select: 0
+        regex_pattern: "\\b(Yes|No|yes|no)\\b"
+      - function: "take_first"
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/zeroshot/date_understanding.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/zeroshot/date_understanding.yaml
+"dataset_name": "date_understanding"
+"description": "Infer the date from context.\n\n"
+"doc_to_text": "Q: {{input}}\nA:"
+"include": "_zeroshot_template_yaml"
+"task": "bbh_zeroshot_date_understanding"
+
+filter_list:
+  - name: "strict-match"
+    filter:
+      - function: "take_first"
+  - name: "flexible-extract"
+    filter:
+      - function: !function utils.MultiChoiceRegexFilter
+        group_select: 0
+        ignore_case: true
+        ignore_punctuation: true
+        regex_pattern: "(\\([A-Z]\\))"
+      - function: "take_first"
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/zeroshot/disambiguation_qa.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/zeroshot/disambiguation_qa.yaml
+"dataset_name": "disambiguation_qa"
+"description": "Clarify the meaning of sentences with ambiguous pronouns.\n\n"
+"doc_to_text": "Q: {{input}}\nA:"
+"include": "_zeroshot_template_yaml"
+"task": "bbh_zeroshot_disambiguation_qa"
+
+filter_list:
+  - name: "strict-match"
+    filter:
+      - function: "take_first"
+  - name: "flexible-extract"
+    filter:
+      - function: !function utils.MultiChoiceRegexFilter
+        group_select: 0
+        ignore_case: true
+        ignore_punctuation: true
+        regex_pattern: "(\\([A-Z]\\))"
+      - function: "take_first"
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/zeroshot/dyck_languages.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/zeroshot/dyck_languages.yaml
+"dataset_name": "dyck_languages"
+"description": "Correctly close a Dyck-n word.\n\n"
+"doc_to_text": "Q: {{input}}\nA:"
+"include": "_zeroshot_template_yaml"
+"task": "bbh_zeroshot_dyck_languages"
+filter_list:
+  - name: "strict-match"
+    filter:
+      - function: "take_first"
+  - name: "flexible-extract"
+    filter:
+      - function: "regex"
+        group_select: 0
+        regex_pattern: "(?<= )([\" \\[\\(<{}>\\)\\]]+)|([\" \\[\\(<{}>\\)\\]]+)"
+      - function: "take_first"
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/zeroshot/formal_fallacies.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/zeroshot/formal_fallacies.yaml
+"dataset_name": "formal_fallacies"
+"description": "Distinguish deductively valid arguments from formal fallacies.\n\n"
+"doc_to_text": "Q: {{input}}\nA:"
+"include": "_zeroshot_template_yaml"
+"task": "bbh_zeroshot_formal_fallacies"
+
+filter_list:
+  - name: "strict-match"
+    filter:
+      - function: "take_first"
+  - name: "flexible-extract"
+    filter:
+      - function: "regex"
+        group_select: 0
+        regex_pattern: "\\b(valid|invalid)\\b"
+      - function: "take_first"
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/zeroshot/geometric_shapes.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/zeroshot/geometric_shapes.yaml
+"dataset_name": "geometric_shapes"
+"description": "Name geometric shapes from their SVG paths.\n\n"
+"doc_to_text": "Q: {{input}}\nA:"
+"include": "_zeroshot_template_yaml"
+"task": "bbh_zeroshot_geometric_shapes"
+
+filter_list:
+  - name: "strict-match"
+    filter:
+      - function: "take_first"
+  - name: "flexible-extract"
+    filter:
+      - function: !function utils.MultiChoiceRegexFilter
+        group_select: 0
+        ignore_case: true
+        ignore_punctuation: true
+        regex_pattern: "(\\([A-Z]\\))"
+      - function: "take_first"
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/zeroshot/hyperbaton.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/zeroshot/hyperbaton.yaml
+"dataset_name": "hyperbaton"
+"description": "Order adjectives correctly in English sentences.\n\n"
+"doc_to_text": "Q: {{input}}\nA:"
+"include": "_zeroshot_template_yaml"
+"task": "bbh_zeroshot_hyperbaton"
+
+filter_list:
+  - name: "strict-match"
+    filter:
+      - function: "take_first"
+  - name: "flexible-extract"
+    filter:
+      - function: !function utils.MultiChoiceRegexFilter
+        group_select: 0
+        ignore_case: true
+        ignore_punctuation: true
+        regex_pattern: "(\\([A-Z]\\))"
+      - function: "take_first"
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/zeroshot/logical_deduction_five_objects.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/zeroshot/logical_deduction_five_objects.yaml
+"dataset_name": "logical_deduction_five_objects"
+"description": "A logical deduction task which requires deducing the order of a sequence of objects.\n\n"
+"doc_to_text": "Q: {{input}}\nA:"
+"include": "_zeroshot_template_yaml"
+"task": "bbh_zeroshot_logical_deduction_five_objects"
+filter_list:
+  - name: "strict-match"
+    filter:
+      - function: "take_first"
+  - name: "flexible-extract"
+    filter:
+      - function: !function utils.MultiChoiceRegexFilter
+        group_select: 0
+        ignore_case: true
+        ignore_punctuation: true
+        regex_pattern: "(\\([A-Z]\\))"
+      - function: "take_first"
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/zeroshot/logical_deduction_seven_objects.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/zeroshot/logical_deduction_seven_objects.yaml
+"dataset_name": "logical_deduction_seven_objects"
+"description": "A logical deduction task which requires deducing the order of a sequence of objects.\n\n"
+"doc_to_text": "Q: {{input}}\nA:"
+"include": "_zeroshot_template_yaml"
+"task": "bbh_zeroshot_logical_deduction_seven_objects"
+filter_list:
+  - name: "strict-match"
+    filter:
+      - function: "take_first"
+  - name: "flexible-extract"
+    filter:
+      - function: !function utils.MultiChoiceRegexFilter
+        group_select: 0
+        ignore_case: true
+        ignore_punctuation: true
+        regex_pattern: "(\\([A-Z]\\))"
+      - function: "take_first"
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/zeroshot/logical_deduction_three_objects.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/zeroshot/logical_deduction_three_objects.yaml
+"dataset_name": "logical_deduction_three_objects"
+"description": "A logical deduction task which requires deducing the order of a sequence of objects.\n\n"
+"doc_to_text": "Q: {{input}}\nA:"
+"include": "_zeroshot_template_yaml"
+"task": "bbh_zeroshot_logical_deduction_three_objects"
+filter_list:
+  - name: "strict-match"
+    filter:
+      - function: "take_first"
+  - name: "flexible-extract"
+    filter:
+      - function: !function utils.MultiChoiceRegexFilter
+        group_select: 0
+        ignore_case: true
+        ignore_punctuation: true
+        regex_pattern: "(\\([A-Z]\\))"
+      - function: "take_first"
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/zeroshot/movie_recommendation.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/zeroshot/movie_recommendation.yaml
+"dataset_name": "movie_recommendation"
+"description": "Recommend movies similar to the given list of movies.\n\n"
+"doc_to_text": "Q: {{input}}\nA:"
+"include": "_zeroshot_template_yaml"
+"task": "bbh_zeroshot_movie_recommendation"
+filter_list:
+  - name: "strict-match"
+    filter:
+      - function: "take_first"
+  - name: "flexible-extract"
+    filter:
+      - function: !function utils.MultiChoiceRegexFilter
+        group_select: 0
+        ignore_case: true
+        ignore_punctuation: true
+        regex_pattern: "(\\([A-Z]\\))"
+      - function: "take_first"
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/zeroshot/multistep_arithmetic_two.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/zeroshot/multistep_arithmetic_two.yaml
+"dataset_name": "multistep_arithmetic_two"
+"description": "Solve multi-step arithmetic problems.\n\n"
+"doc_to_text": "Q: {{input}}\nA:"
+"include": "_zeroshot_template_yaml"
+"task": "bbh_zeroshot_multistep_arithmetic_two"
+
+filter_list:
+  - name: "strict-match"
+    filter:
+      - function: "take_first"
+  - name: "flexible-extract"
+    filter:
+      - function: !function utils.NumberParseRegexFilter
+        group_select: 0
+        regex_pattern: "([-0-9]+)"
+      - function: "take_first"
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/zeroshot/navigate.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/zeroshot/navigate.yaml
+"dataset_name": "navigate"
+"description": "Given a series of navigation instructions, determine whether one would end up back at the starting point.\n\n"
+"doc_to_text": "Q: {{input}}\nA:"
+"include": "_zeroshot_template_yaml"
+"task": "bbh_zeroshot_navigate"
+filter_list:
+  - name: "strict-match"
+    filter:
+      - function: "take_first"
+  - name: "flexible-extract"
+    filter:
+      - function: "regex"
+        group_select: 0
+        regex_pattern: "\\b(Yes|No|yes|no)\\b"
+      - function: "take_first"