Unverified Commit 3816796e authored by Alexandre Marques's avatar Alexandre Marques Committed by GitHub
Browse files

Adds MMLU CoT, gsm8k and arc_challenge for llama instruct (#2829)

* llama-style MMLU CoT

* Refactor MMLU CoT template YAML to simplify 'until' structure

* Add GSM8K task configuration for LLaMA3 with few-shot examples

* Fix missing newline at end of MMLU CoT YAML file

* Add ARC-Challenge task configuration and processing utility

* Add additional MMLU and ARC-Challenge task variants to README

* Update README with notes on arc_challenge_llama dataset preprocessing
parent 1514ac1e
......@@ -30,7 +30,24 @@ BibTeX-formatted citation goes here
#### Tasks
* `mmlu_llama`: `generation variant of MMLU`
* `mmlu_pro_llama`: `generation variant of MMLU-PRO`
* `mmlu_cot_llama`: `Chain-of-thought variant of MMLU`
* `mmlu_it_llama`: `Italian version of generation MMLU`
* `mmlu_fr_llama`: `French version of generation MMLU`
* `mmlu_pt_llama`: `Portuguese version of generation MMLU`
* `mmlu_th_llama`: `Thai version of generation MMLU`
* `mmlu_hi_llama`: `Hindi version of generation MMLU`
* `mmlu_es_llama`: `Spanish version of generation MMLU`
* `mmlu_de_llama`: `German version of generation MMLU`
* `arc_chalenge_chat`: `generation variant of ARC-Challenge using MMLU format`
* `arc_challenge_llama`: `generation variant of ARC-Challenge following Meta pre-processing`
* `gsm8k_llama`: `Chain-of-though variant of GSM8k`
**Notes regarding arc_challenge_llama:**
- The original ARC-Challenge dataset contains 8 samples with less than 4 options. Meta filtered these samples out, and `arc_challenge_llama` does the same.
- A small number of samples use 1, 2, 3, 4 as labels. These are replaced by A, B, C, D like the rest in the doc preprocessing.
### Checklist
......
task: arc_challenge_llama
dataset_name: ARC-Challenge
dataset_path: allenai/ai2_arc
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "Given the following question and four candidate answers (A, B, C and D), choose the best answer.\nQuestion: {{ question }}\nA. {{ choices.text[0] }}\nB. {{ choices.text[1] }}\nC. {{ choices.text[2] }}\nD. {{ choices.text[3] }}\nYour response should end with \"The best answer is [the_answer_letter]\" where the [the_answer_letter] is one of A, B, C or D."
doc_to_target: "{{answerKey}}"
gen_prefix: "The best answer is"
num_fewshot: 0
output_type: generate_until
generation_kwargs:
do_sample: false
max_gen_toks: 100
until: []
filter_list:
- name: strict_match
filter:
- function: remove_whitespace
- function: take_first
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
regexes_to_ignore:
- "\\$"
- "\\.$"
metadata:
version: 1.0
import datasets
def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:
label = ["A", "B", "C", "D"]
def _process_doc(doc):
choices = doc["choices"]
choices["label"] = label
answerKey = doc["answerKey"]
if answerKey not in label:
answerKey = label[int(answerKey) - 1]
return {
"question": doc["question"],
"choices": choices,
"answerKey": answerKey,
}
return dataset.filter(lambda x: len(x["choices"]["label"]) == 4).map(_process_doc)
include: ../../../gsm8k/gsm8k-cot.yaml
doc_to_text: "Given the following problem, reason and give a final answer to the problem.\nProblem: {{question}}\nYour response should end with \"The final answer is [answer]\" where [answer] is the response to the problem."
doc_to_target: '{{answer.split(''####'')[-1].strip() if answer is defined else target}}'
fewshot_config:
sampler: first_n
samples:
- question: There are 15 trees in the grove. Grove workers will plant trees in the
grove today. After they are done, there will be 21 trees. How many trees did
the grove workers plant today?
target: There are 15 trees originally. Then there were 21 trees after some more
were planted. So there must have been 21 - 15 = 6. The final answer is 6.
- question: If there are 3 cars in the parking lot and 2 more cars arrive, how many
cars are in the parking lot?
target: There are originally 3 cars. 2 more cars arrive. 3 + 2 = 5. The final answer
is 5.
- question: Leah had 32 chocolates and her sister had 42. If they ate 35, how many
pieces do they have left in total?
target: Originally, Leah had 32 chocolates. Her sister had 42. So in total they
had 32 + 42 = 74. After eating 35, they had 74 - 35 = 39. The final answer is 39.
- question: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12
lollipops. How many lollipops did Jason give to Denny?
target: Jason started with 20 lollipops. Then he had 12 after giving some to Denny.
So he gave Denny 20 - 12 = 8. The final answer is 8.
- question: Shawn has five toys. For Christmas, he got two toys each from his mom and
dad. How many toys does he have now?
target: Shawn started with 5 toys. If he got 2 toys each from his mom and dad,
then that is 4 more toys. 5 + 4 = 9. The final answer is 9.
- question: There were nine computers in the server room. Five more computers were
installed each day, from monday to thursday. How many computers are now in the
server room?
target: There were originally 9 computers. For each of 4 days, 5 more computers
were added. So 5 * 4 = 20 computers were added. 9 + 20 is 29. The final answer is
29.
- question: Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday,
he lost 2 more. How many golf balls did he have at the end of wednesday?
target: Michael started with 58 golf balls. After losing 23 on tuesday, he had
58 - 23 = 35. After losing 2 more, he had 35 - 2 = 33 golf balls. The final answer
is 33.
- question: Olivia has $23. She bought five bagels for $3 each. How much money does
she have left?
target: Olivia had 23 dollars. 5 bagels for 3 dollars each will be 5 x 3 = 15
dollars. So she has 23 - 15 dollars left. 23 - 15 is 8. The final answer is 8.
filter_list:
- name: strict_match
filter:
- function: "regex"
regex_pattern: final answer is (\-?[0-9\.\,]+)
group_select: -1
- function: take_first
- name: flexible_extract
filter:
- function: regex
group_select: -1
regex_pattern: (-?[$0-9.,]{2,})|(-?[0-9]+)
- function: take_first
num_fewshot: 8
generation_kwargs:
do_sample: false
max_gen_toks: 1024
temperature: 0
until: []
task: gsm8k_llama
group: mmlu_cot_llama
task:
- mmlu_cot_llama_stem
- mmlu_cot_llama_other
- mmlu_cot_llama_social_sciences
- mmlu_cot_llama_humanities
aggregate_metric_list:
- metric: exact_match
weight_by_size: True
filter_list: strict_match
dataset_path: hails/mmlu_no_train # a copy of `cais/mmlu` with no auxiliary_train split
test_split: test
output_type: generate_until
doc_to_text: "Given the following question and four candidate answers (A, B, C and D), choose the best answer.\n\nQuestion: {{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\n\n- For simple problems:\nDirectly provide the answer with minimal explanation.\n\n- For complex problems:\nUse this step-by-step format:\n## Step 1: [Concise description]\n[Brief explanation]\n## Step 2: [Concise description]\n[Brief explanation]\n\nRegardless of the approach, always conclude with:\nThe best answer is [the_answer_letter].\nwhere the [the_answer_letter] is one of A, B, C or D.\n\nLet's think step by step."
doc_to_target: "{{['A','B','C','D'][answer]}}"
generation_kwargs:
do_sample: false
temperature: 0
max_gen_toks: 1024
until: []
filter_list:
- name: strict_match
filter:
- function: "regex"
regex_pattern: "best answer is ([A-Z])"
group_select: -1
- function: take_first
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
metadata:
version: 1.0
dataset_kwargs:
trust_remote_code: true
num_fewshot: 0
group: mmlu_cot_llama_humanities
group_alias: humanities
task:
- mmlu_cot_llama_humanities_tasks
aggregate_metric_list:
- metric: exact_match
weight_by_size: True
filter_list: strict_match
metadata:
version: 1
group: mmlu_cot_llama_other
group_alias: other
task:
- mmlu_cot_llama_other_tasks
aggregate_metric_list:
- metric: exact_match
weight_by_size: True
filter_list: strict_match
metadata:
version: 1
group: mmlu_cot_llama_social_sciences
group_alias: social sciences
task:
- mmlu_cot_llama_social_sciences_tasks
aggregate_metric_list:
- metric: exact_match
weight_by_size: True
filter_list: strict_match
metadata:
version: 1
group: mmlu_cot_llama_stem
group_alias: stem
task:
- mmlu_cot_llama_stem_tasks
aggregate_metric_list:
- metric: exact_match
weight_by_size: True
filter_list: strict_match
metadata:
version: 1
"dataset_name": "abstract_algebra"
"description": ""
"include": "_mmlu_cot_llama_template_yaml"
"tag": "mmlu_cot_llama_stem_tasks"
"task": "mmlu_cot_llama_abstract_algebra"
"task_alias": "abstract algebra"
"dataset_name": "anatomy"
"description": ""
"include": "_mmlu_cot_llama_template_yaml"
"tag": "mmlu_cot_llama_stem_tasks"
"task": "mmlu_cot_llama_anatomy"
"task_alias": "anatomy"
"dataset_name": "astronomy"
"description": ""
"include": "_mmlu_cot_llama_template_yaml"
"tag": "mmlu_cot_llama_stem_tasks"
"task": "mmlu_cot_llama_astronomy"
"task_alias": "astronomy"
"dataset_name": "business_ethics"
"description": ""
"include": "_mmlu_cot_llama_template_yaml"
"tag": "mmlu_cot_llama_other_tasks"
"task": "mmlu_cot_llama_business_ethics"
"task_alias": "business ethics"
"dataset_name": "clinical_knowledge"
"description": ""
"include": "_mmlu_cot_llama_template_yaml"
"tag": "mmlu_cot_llama_other_tasks"
"task": "mmlu_cot_llama_clinical_knowledge"
"task_alias": "clinical knowledge"
"dataset_name": "college_biology"
"description": ""
"include": "_mmlu_cot_llama_template_yaml"
"tag": "mmlu_cot_llama_stem_tasks"
"task": "mmlu_cot_llama_college_biology"
"task_alias": "college biology"
"dataset_name": "college_chemistry"
"description": ""
"include": "_mmlu_cot_llama_template_yaml"
"tag": "mmlu_cot_llama_stem_tasks"
"task": "mmlu_cot_llama_college_chemistry"
"task_alias": "college chemistry"
"dataset_name": "college_computer_science"
"description": ""
"include": "_mmlu_cot_llama_template_yaml"
"tag": "mmlu_cot_llama_stem_tasks"
"task": "mmlu_cot_llama_college_computer_science"
"task_alias": "college computer science"
"dataset_name": "college_mathematics"
"description": ""
"include": "_mmlu_cot_llama_template_yaml"
"tag": "mmlu_cot_llama_stem_tasks"
"task": "mmlu_cot_llama_college_mathematics"
"task_alias": "college mathematics"
"dataset_name": "college_medicine"
"description": ""
"include": "_mmlu_cot_llama_template_yaml"
"tag": "mmlu_cot_llama_other_tasks"
"task": "mmlu_cot_llama_college_medicine"
"task_alias": "college medicine"
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment