Adds MMLU CoT, gsm8k and arc_challenge for llama instruct (#2829)

* llama-style MMLU CoT * Refactor MMLU CoT template YAML to simplify 'until' structure * Add GSM8K task configuration for LLaMA3 with few-shot examples * Fix missing newline at end of MMLU CoT YAML file * Add ARC-Challenge task configuration and processing utility * Add additional MMLU and ARC-Challenge task variants to README * Update README with notes on arc_challenge_llama dataset preprocessing

Adds MMLU CoT, gsm8k and arc_challenge for llama instruct (#2829)
* llama-style MMLU CoT * Refactor MMLU CoT template YAML to simplify 'until' structure * Add GSM8K task configuration for LLaMA3 with few-shot examples * Fix missing newline at end of MMLU CoT YAML file * Add ARC-Challenge task configuration and processing utility * Add additional MMLU and ARC-Challenge task variants to README * Update README with notes on arc_challenge_llama dataset preprocessing
3816796e · Alexandre Marques · GitHub · 1514ac1e · 3816796e · 3816796e
Unverified Commit 3816796e authored Mar 30, 2025 by Alexandre Marques Committed by GitHub Mar 30, 2025
20 changed files
--- a/lm_eval/tasks/llama3/README.md
+++ b/lm_eval/tasks/llama3/README.md
@@ -30,7 +30,24 @@ BibTeX-formatted citation goes here
 #### Tasks

 * `mmlu_llama`: `generation variant of MMLU`
+* `mmlu_pro_llama`: `generation variant of MMLU-PRO`
+* `mmlu_cot_llama`: `Chain-of-thought variant of MMLU`
+* `mmlu_it_llama`: `Italian version of generation MMLU`
+* `mmlu_fr_llama`: `French version of generation MMLU`
+* `mmlu_pt_llama`: `Portuguese version of generation MMLU`
+* `mmlu_th_llama`: `Thai version of generation MMLU`
+* `mmlu_hi_llama`: `Hindi version of generation MMLU`
+* `mmlu_es_llama`: `Spanish version of generation MMLU`
+* `mmlu_de_llama`: `German version of generation MMLU`
 * `arc_chalenge_chat`: `generation variant of ARC-Challenge using MMLU format`
+* `arc_challenge_llama`: `generation variant of ARC-Challenge following Meta pre-processing`
+* `gsm8k_llama`: `Chain-of-though variant of GSM8k`
+
+
+**Notes regarding arc_challenge_llama:**
+
+- The original ARC-Challenge dataset contains 8 samples with less than 4 options. Meta filtered these samples out, and `arc_challenge_llama` does the same.
+- A small number of samples use 1, 2, 3, 4 as labels. These are replaced by A, B, C, D like the rest in the doc preprocessing.

 ### Checklist


--- a/lm_eval/tasks/llama3/instruct/arc_challenge/arc_challenge_llama.yaml
+++ b/lm_eval/tasks/llama3/instruct/arc_challenge/arc_challenge_llama.yaml
+task: arc_challenge_llama
+dataset_name: ARC-Challenge
+dataset_path: allenai/ai2_arc
+test_split: test
+process_docs: !function utils.process_docs
+doc_to_text: "Given the following question and four candidate answers (A, B, C and D), choose the best answer.\nQuestion: {{ question }}\nA. {{ choices.text[0] }}\nB. {{ choices.text[1] }}\nC. {{ choices.text[2] }}\nD. {{ choices.text[3] }}\nYour response should end with \"The best answer is [the_answer_letter]\" where the [the_answer_letter] is one of A, B, C or D."
+doc_to_target: "{{answerKey}}"
+gen_prefix: "The best answer is"
+num_fewshot: 0
+output_type: generate_until
+generation_kwargs:
+  do_sample: false
+  max_gen_toks: 100
+  until: []
+filter_list:
+  - name: strict_match
+    filter:
+      - function: remove_whitespace
+      - function: take_first
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+    ignore_case: true
+    ignore_punctuation: true
+    regexes_to_ignore:
+      - "\\$"
+      - "\\.$"
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/llama3/instruct/arc_challenge/utils.py
+++ b/lm_eval/tasks/llama3/instruct/arc_challenge/utils.py
+import datasets
+
+
+def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:
+    label = ["A", "B", "C", "D"]
+
+    def _process_doc(doc):
+        choices = doc["choices"]
+        choices["label"] = label
+        answerKey = doc["answerKey"]
+        if answerKey not in label:
+            answerKey = label[int(answerKey) - 1]
+        return {
+            "question": doc["question"],
+            "choices": choices,
+            "answerKey": answerKey,
+        }
+
+    return dataset.filter(lambda x: len(x["choices"]["label"]) == 4).map(_process_doc)
--- a/lm_eval/tasks/llama3/instruct/gsm8k/gsm8k.yaml
+++ b/lm_eval/tasks/llama3/instruct/gsm8k/gsm8k.yaml
+include: ../../../gsm8k/gsm8k-cot.yaml
+doc_to_text: "Given the following problem, reason and give a final answer to the problem.\nProblem: {{question}}\nYour response should end with \"The final answer is [answer]\" where [answer] is the response to the problem."
+doc_to_target: '{{answer.split(''####'')[-1].strip() if answer is defined else target}}'
+fewshot_config:
+  sampler: first_n
+  samples:
+  - question: There are 15 trees in the grove. Grove workers will plant trees in the
+      grove today. After they are done, there will be 21 trees. How many trees did
+      the grove workers plant today?
+    target: There are 15 trees originally. Then there were 21 trees after some more
+      were planted. So there must have been 21 - 15 = 6. The final answer is 6.
+  - question: If there are 3 cars in the parking lot and 2 more cars arrive, how many
+      cars are in the parking lot?
+    target: There are originally 3 cars. 2 more cars arrive. 3 + 2 = 5. The final answer
+      is 5.
+  - question: Leah had 32 chocolates and her sister had 42. If they ate 35, how many
+      pieces do they have left in total?
+    target: Originally, Leah had 32 chocolates. Her sister had 42. So in total they
+      had 32 + 42 = 74. After eating 35, they had 74 - 35 = 39. The final answer is 39.
+  - question: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12
+      lollipops. How many lollipops did Jason give to Denny?
+    target: Jason started with 20 lollipops. Then he had 12 after giving some to Denny.
+      So he gave Denny 20 - 12 = 8. The final answer is 8.
+  - question: Shawn has five toys. For Christmas, he got two toys each from his mom and
+      dad. How many toys does he have now?
+    target: Shawn started with 5 toys. If he got 2 toys each from his mom and dad,
+      then that is 4 more toys. 5 + 4 = 9. The final answer is 9.
+  - question: There were nine computers in the server room. Five more computers were
+      installed each day, from monday to thursday. How many computers are now in the
+      server room?
+    target: There were originally 9 computers. For each of 4 days, 5 more computers
+      were added. So 5 * 4 = 20 computers were added. 9 + 20 is 29. The final answer is
+      29.
+  - question: Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday,
+      he lost 2 more. How many golf balls did he have at the end of wednesday?
+    target: Michael started with 58 golf balls. After losing 23 on tuesday, he had
+      58 - 23 = 35. After losing 2 more, he had 35 - 2 = 33 golf balls. The final answer
+      is 33.
+  - question: Olivia has $23. She bought five bagels for $3 each. How much money does
+      she have left?
+    target: Olivia had 23 dollars. 5 bagels for 3 dollars each will be 5 x 3 = 15
+      dollars. So she has 23 - 15 dollars left. 23 - 15 is 8. The final answer is 8.
+filter_list:
+  - name: strict_match
+    filter:
+      - function: "regex"
+        regex_pattern: final answer is (\-?[0-9\.\,]+)
+        group_select: -1
+      - function: take_first
+  - name: flexible_extract
+    filter:
+      - function: regex
+        group_select: -1
+        regex_pattern: (-?[$0-9.,]{2,})|(-?[0-9]+)
+      - function: take_first
+num_fewshot: 8
+generation_kwargs:
+  do_sample: false
+  max_gen_toks: 1024
+  temperature: 0
+  until: []
+task: gsm8k_llama
--- a/lm_eval/tasks/llama3/instruct/mmlu_cot/_mmlu_cot_llama.yaml
+++ b/lm_eval/tasks/llama3/instruct/mmlu_cot/_mmlu_cot_llama.yaml
+group: mmlu_cot_llama
+task:
+- mmlu_cot_llama_stem
+- mmlu_cot_llama_other
+- mmlu_cot_llama_social_sciences
+- mmlu_cot_llama_humanities
+aggregate_metric_list:
+  - metric: exact_match
+    weight_by_size: True
+    filter_list: strict_match
--- a/lm_eval/tasks/llama3/instruct/mmlu_cot/_mmlu_cot_llama_template_yaml
+++ b/lm_eval/tasks/llama3/instruct/mmlu_cot/_mmlu_cot_llama_template_yaml
+dataset_path: hails/mmlu_no_train # a copy of `cais/mmlu` with no auxiliary_train split
+test_split: test
+output_type: generate_until
+doc_to_text: "Given the following question and four candidate answers (A, B, C and D), choose the best answer.\n\nQuestion: {{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\n\n- For simple problems:\nDirectly provide the answer with minimal explanation.\n\n- For complex problems:\nUse this step-by-step format:\n## Step 1: [Concise description]\n[Brief explanation]\n## Step 2: [Concise description]\n[Brief explanation]\n\nRegardless of the approach, always conclude with:\nThe best answer is [the_answer_letter].\nwhere the [the_answer_letter] is one of A, B, C or D.\n\nLet's think step by step."
+doc_to_target: "{{['A','B','C','D'][answer]}}"
+generation_kwargs:
+  do_sample: false
+  temperature: 0
+  max_gen_toks: 1024
+  until: []
+filter_list:
+  - name: strict_match
+    filter:
+      - function: "regex"
+        regex_pattern: "best answer is ([A-Z])"
+        group_select: -1
+      - function: take_first
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+    ignore_case: true
+    ignore_punctuation: true
+metadata:
+  version: 1.0
+dataset_kwargs:
+  trust_remote_code: true
+num_fewshot: 0
--- a/lm_eval/tasks/llama3/instruct/mmlu_cot/_mmlu_humanities.yaml
+++ b/lm_eval/tasks/llama3/instruct/mmlu_cot/_mmlu_humanities.yaml
+group: mmlu_cot_llama_humanities
+group_alias: humanities
+task:
+  - mmlu_cot_llama_humanities_tasks
+aggregate_metric_list:
+  - metric: exact_match
+    weight_by_size: True
+    filter_list: strict_match
+metadata:
+  version: 1
--- a/lm_eval/tasks/llama3/instruct/mmlu_cot/_mmlu_other.yaml
+++ b/lm_eval/tasks/llama3/instruct/mmlu_cot/_mmlu_other.yaml
+group: mmlu_cot_llama_other
+group_alias: other
+task:
+  - mmlu_cot_llama_other_tasks
+aggregate_metric_list:
+  - metric: exact_match
+    weight_by_size: True
+    filter_list: strict_match
+metadata:
+  version: 1
--- a/lm_eval/tasks/llama3/instruct/mmlu_cot/_mmlu_social_sciences.yaml
+++ b/lm_eval/tasks/llama3/instruct/mmlu_cot/_mmlu_social_sciences.yaml
+group: mmlu_cot_llama_social_sciences
+group_alias: social sciences
+task:
+  - mmlu_cot_llama_social_sciences_tasks
+aggregate_metric_list:
+  - metric: exact_match
+    weight_by_size: True
+    filter_list: strict_match
+metadata:
+  version: 1
--- a/lm_eval/tasks/llama3/instruct/mmlu_cot/_mmlu_stem.yaml
+++ b/lm_eval/tasks/llama3/instruct/mmlu_cot/_mmlu_stem.yaml
+group: mmlu_cot_llama_stem
+group_alias: stem
+task:
+  - mmlu_cot_llama_stem_tasks
+aggregate_metric_list:
+  - metric: exact_match
+    weight_by_size: True
+    filter_list: strict_match
+metadata:
+  version: 1
--- a/lm_eval/tasks/llama3/instruct/mmlu_cot/mmlu_abstract_algebra.yaml
+++ b/lm_eval/tasks/llama3/instruct/mmlu_cot/mmlu_abstract_algebra.yaml
+"dataset_name": "abstract_algebra"
+"description": ""
+"include": "_mmlu_cot_llama_template_yaml"
+"tag": "mmlu_cot_llama_stem_tasks"
+"task": "mmlu_cot_llama_abstract_algebra"
+"task_alias": "abstract algebra"
--- a/lm_eval/tasks/llama3/instruct/mmlu_cot/mmlu_anatomy.yaml
+++ b/lm_eval/tasks/llama3/instruct/mmlu_cot/mmlu_anatomy.yaml
+"dataset_name": "anatomy"
+"description": ""
+"include": "_mmlu_cot_llama_template_yaml"
+"tag": "mmlu_cot_llama_stem_tasks"
+"task": "mmlu_cot_llama_anatomy"
+"task_alias": "anatomy"
--- a/lm_eval/tasks/llama3/instruct/mmlu_cot/mmlu_astronomy.yaml
+++ b/lm_eval/tasks/llama3/instruct/mmlu_cot/mmlu_astronomy.yaml
+"dataset_name": "astronomy"
+"description": ""
+"include": "_mmlu_cot_llama_template_yaml"
+"tag": "mmlu_cot_llama_stem_tasks"
+"task": "mmlu_cot_llama_astronomy"
+"task_alias": "astronomy"
--- a/lm_eval/tasks/llama3/instruct/mmlu_cot/mmlu_business_ethics.yaml
+++ b/lm_eval/tasks/llama3/instruct/mmlu_cot/mmlu_business_ethics.yaml
+"dataset_name": "business_ethics"
+"description": ""
+"include": "_mmlu_cot_llama_template_yaml"
+"tag": "mmlu_cot_llama_other_tasks"
+"task": "mmlu_cot_llama_business_ethics"
+"task_alias": "business ethics"
--- a/lm_eval/tasks/llama3/instruct/mmlu_cot/mmlu_clinical_knowledge.yaml
+++ b/lm_eval/tasks/llama3/instruct/mmlu_cot/mmlu_clinical_knowledge.yaml
+"dataset_name": "clinical_knowledge"
+"description": ""
+"include": "_mmlu_cot_llama_template_yaml"
+"tag": "mmlu_cot_llama_other_tasks"
+"task": "mmlu_cot_llama_clinical_knowledge"
+"task_alias": "clinical knowledge"
--- a/lm_eval/tasks/llama3/instruct/mmlu_cot/mmlu_college_biology.yaml
+++ b/lm_eval/tasks/llama3/instruct/mmlu_cot/mmlu_college_biology.yaml
+"dataset_name": "college_biology"
+"description": ""
+"include": "_mmlu_cot_llama_template_yaml"
+"tag": "mmlu_cot_llama_stem_tasks"
+"task": "mmlu_cot_llama_college_biology"
+"task_alias": "college biology"
--- a/lm_eval/tasks/llama3/instruct/mmlu_cot/mmlu_college_chemistry.yaml
+++ b/lm_eval/tasks/llama3/instruct/mmlu_cot/mmlu_college_chemistry.yaml
+"dataset_name": "college_chemistry"
+"description": ""
+"include": "_mmlu_cot_llama_template_yaml"
+"tag": "mmlu_cot_llama_stem_tasks"
+"task": "mmlu_cot_llama_college_chemistry"
+"task_alias": "college chemistry"
--- a/lm_eval/tasks/llama3/instruct/mmlu_cot/mmlu_college_computer_science.yaml
+++ b/lm_eval/tasks/llama3/instruct/mmlu_cot/mmlu_college_computer_science.yaml
+"dataset_name": "college_computer_science"
+"description": ""
+"include": "_mmlu_cot_llama_template_yaml"
+"tag": "mmlu_cot_llama_stem_tasks"
+"task": "mmlu_cot_llama_college_computer_science"
+"task_alias": "college computer science"
--- a/lm_eval/tasks/llama3/instruct/mmlu_cot/mmlu_college_mathematics.yaml
+++ b/lm_eval/tasks/llama3/instruct/mmlu_cot/mmlu_college_mathematics.yaml
+"dataset_name": "college_mathematics"
+"description": ""
+"include": "_mmlu_cot_llama_template_yaml"
+"tag": "mmlu_cot_llama_stem_tasks"
+"task": "mmlu_cot_llama_college_mathematics"
+"task_alias": "college mathematics"
--- a/lm_eval/tasks/llama3/instruct/mmlu_cot/mmlu_college_medicine.yaml
+++ b/lm_eval/tasks/llama3/instruct/mmlu_cot/mmlu_college_medicine.yaml
+"dataset_name": "college_medicine"
+"description": ""
+"include": "_mmlu_cot_llama_template_yaml"
+"tag": "mmlu_cot_llama_other_tasks"
+"task": "mmlu_cot_llama_college_medicine"
+"task_alias": "college medicine"