add mmlu variants

f23ae748 · lintangsutawika · 191458b8 · f23ae748 · f23ae748 · f23ae748
Commit f23ae748 authored Sep 03, 2023 by lintangsutawika
20 changed files
--- a/lm_eval/tasks/mmlu/_cot_prompts.json
+++ b/lm_eval/tasks/mmlu/_cot_prompts.json
--- a/lm_eval/tasks/mmlu/gen_all_splits.py
+++ b/lm_eval/tasks/mmlu/gen_all_splits.py
@@ -11,7 +11,7 @@ from lm_eval import utils
 from lm_eval.logger import eval_logger
 SUBJECTS = [
-    # "abstract_algebra",
+    "abstract_algebra",
    "anatomy",
    "astronomy",
    "business_ethics",
@@ -73,11 +73,14 @@ SUBJECTS = [
 def parse_args():
    parser = argparse.ArgumentParser()
-    # parser.add_argument("--benchmark_name", required=True)
    parser.add_argument("--base_yaml_path", required=True)
    parser.add_argument(
-        "--task_save_path", default="lm_eval/tasks/mmlu/hendrycks_test_original"
+        "--save_prefix_path", default="flan"
    )
+    parser.add_argument(
+        "--cot_prompt_path", default=None
+    )
+    parser.add_argument("--task_prefix", default="")
    return parser.parse_args()
@@ -91,16 +94,25 @@ if __name__ == "__main__":
        base_yaml = yaml.full_load(f)
    print(base_yaml)
+    if args.cot_prompt_path is not None:
+        import json
+        with open(args.cot_prompt_path) as f:
+            cot_file = json.load(f)
    for subject in tqdm(SUBJECTS):
+        if args.cot_prompt_path is not None:
+            description = cot_file[subject]
+        else:
+            description = f"The following are multiple choice questions (with answers) about {' '.join(subject.split('_'))}.\n\n"
        yaml_dict = {
            "include": base_yaml_name,
-            "task": base_yaml["task"].strip("abstract_algebra") + "subject",
+            "task": f"mmlu_{args.task_prefix}_{subject}",
            "dataset_name": subject,
-            "description": f"The following are multiple choice questions (with answers) about {' '.join(subject.split('_'))}.\n\n",
+            "description": description,
        }
-        file_save_path = args.task_save_path + f"_{subject}.yaml"
+        file_save_path = args.save_prefix_path + f"_{subject}.yaml"
        eval_logger.info(f"Saving yaml for subset {subject} to {file_save_path}")
        with open(file_save_path, "w") as yaml_file:
            yaml.dump(yaml_dict, yaml_file)
--- a/lm_eval/tasks/mmlu/hendrycks_test_original_default.yaml
+++ b/lm_eval/tasks/mmlu/hendrycks_test_original_default.yaml
--- a/lm_eval/tasks/mmlu/flan_cot_fewshot/_mmlu_flan_cot_fewshot_template_yaml
+++ b/lm_eval/tasks/mmlu/flan_cot_fewshot/_mmlu_flan_cot_fewshot_template_yaml
+group: mmlu_flan_cot_fewshot
+dataset_path: cais/mmlu
+validation_split: validation
+fewshot_split: dev
+doc_to_text: "\n\nQ: {{question.strip()}}\n(A) {{choices[0]}} (B) {{choices[1]}} (C) {{choices[2]}} (D) {{choices[3]}}\nA: Let's think step by step."
+fewshot_delimiter: ""
+output_type: greedy_until
+doc_to_target: "{{['(A)', '(B)', '(C)', '(D)'][answer]}}"
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+    ignore_case: true
+    ignore_punctuation: true
+generation_kwargs:
+  until:
+    - "</s>"
+  do_sample: false
+  temperature: 0.0
+filter_list:
+  - name: "get-answer"
+    filter:
+      - function: "regex"
+        regex_pattern: "(?<=The answer is )(.*)(.)"
+      - function: "take_first"
\ No newline at end of file
--- a/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_abstract_algebra.yaml
+++ b/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_abstract_algebra.yaml
+dataset_name: abstract_algebra
+description: "The following are multiple choice questions (with answers) about abstract\
+  \ algebra.\n\nQ: Statement 1 | Every element of a group generates a cyclic subgroup\
+  \ of the group. Statement 2 | The symmetric group S_10 has 10 elements.\n(A) True,\
+  \ True (B) False, False (C) True, False (D) False, True\nA: Let's think step by\
+  \ step. A cyclic group is a group that is generated by a single element. Hence a\
+  \ subgroup generated by a single element of a group is cyclic and Statement 1 is\
+  \ True. The answer is (C).\n\nQ: The symmetric group $S_n$ has $\nactorial{n}$ elements,\
+  \ hence it is not true that $S_{10}$ has 10 elements.\nFind the characteristic of\
+  \ the ring 2Z.\n(A) 0 (B) 3 (C) 12 (D) 30\nA: Let's think step by step. A characteristic\
+  \ of a ring is R is $n$ if the statement $ka = 0$ for all $a\\in 2Z$ implies that\
+  \ $k$ is a multiple of $n$. Assume that $ka = 0$ for all $a\\in 2Z$ for some $k$.\
+  \ In particular $2k = 0$. Hence $k=0$ and $n=0$. The answer is (A).\n\nQ: Statement\
+  \ 1| Every function from a finite set onto itself must be one to one. Statement\
+  \ 2 | Every subgroup of an abelian group is abelian.\n(A) True, True (B) False,\
+  \ False (C) True, False (D) False, True\nA: Let's think step by step. Statement\
+  \ 1 is true. Let $S$ be a finite set. If $f:S \nightarrow S$ is a onto function,\
+  \ then $|S| = |f(S)|$. If $f$ was not one to one, then for finite domain $S$ the\
+  \ image would have less than $S$ elements, a contradiction.\nStatement 2 is true.\
+  \ Let $G$ be an abelian group and $H$ be a subgroup of $G$. We need to show that\
+  \ $H$ is abelian. Let $a,b \\in H$. Then $a,b \\in G$ and $ab=ba$. Since $G$ is\
+  \ abelian, $ab=ba$. Since $H$ is a subgroup of $G$, $ab \\in H$. Therefore, $ab=ba$\
+  \ and $H$ is abelian. The answer is (A).\n\nQ: Statement 1 | If aH is an element\
+  \ of a factor group, then |aH| divides |a|. Statement 2 | If H and K are subgroups\
+  \ of G then HK is a subgroup of G.\n(A) True, True (B) False, False (C) True, False\
+  \ (D) False, True\nA: Let's think step by step. Statement 2 is false. Let $H$ be\
+  \ a subgroup of $S_3$ generated by the cycle $(1,2)$ and $K$ be a subgroup of $S_3$\
+  \ generated by the cycle $(1,3)$. Both $H$ and $K$ have two elements, the generators\
+  \ and the identity. However $HK$ contains cycles (1,2), (1,3) and (2,3,1), but the\
+  \ inverse of (2,3,1) is (2,1,3) and it does not belong to HK, hence HK is not a\
+  \ subgroup. The answer is (B).\n\nQ: Find all c in Z_3 such that Z_3[x]/(x^2 + c)\
+  \ is a field.\n(A) 0 (B) 1 (C) 2 (D) 3\nA: Let's think step by step. Z_3[x]/(x^2\
+  \ + c) is a field if and only if x^2 + c does not have roots in Z_3. That is x^2\
+  \ + c != 0 for every x in Z_3. If c = 0, then x^2 + c = x^2 has root 0. If c = 1\
+  \ then x^2 + c = x^2 + 1 = 0 + 1 for x = 0, 1 + 1 = 2 for x = 1 and 1 + 1 = 2 for\
+  \ x = 2, hence x^2 + 1 does not have any roots. For c = 2 the polynomial x^2 + 2\
+  \ has two roots at x = 1 and x = 2. Hence Z_3[x]/(x^2 + c) is a field if and only\
+  \ if c = 1. The answer is (B)."
+include: _mmlu_flan_cot_fewshot_template_yaml
+task: mmlu_flan_cot_fewshot_abstract_algebra
--- a/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_anatomy.yaml
+++ b/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_anatomy.yaml
+dataset_name: anatomy
+description: "The following are multiple choice questions (with answers) about anatomy.\n\
+  \nQ: Which of the following is the body cavity that contains the pituitary gland?\n\
+  (A) Abdominal (B) Cranial (C) Pleural (D) Spinal\nA: Let's think step by step. We\
+  \ refer to Wikipedia articles on anatomy for help. Let\u2019s solve this problem\
+  \ step by step. The pituitary gland is the major endocrine gland attached to the\
+  \ base of the brain, and it is contained in the Cranial cavity. The answer is (B).\n\
+  \nQ: Which of these branches of the trigeminal nerve contain somatic motor processes?\n\
+  (A) The supraorbital nerve (B) The infraorbital nerve (C) The mental nerve (D) None\
+  \ of the above\nA: Let's think step by step. We refer to Wikipedia articles on anatomy\
+  \ for help. Let\u2019s solve this problem step by step. \nWe know the following:\
+  \ (A) The supraorbital nerve (also known as the frontal nerve) is the largest branch\
+  \ of the ophthalmic nerve and branch of ophthalmic division of the trigeminal nerve.\
+  \ (B) The infraorbital nerve is a branch of the maxillary division of the trigeminal\
+  \ nerve. (C) The mental nerve is a branch of the mandibular division of the trigeminal\
+  \ nerve. Because all these nerves are purely sensory nerves and do not contain any\
+  \ somatic motor processes. Therefore, the answer should be none of the above, which\
+  \ is (D). The answer is (D).\n\nQ: In Angle's Class II Div 2 occlusion there is\n\
+  (A) excess overbite of the upper lateral incisors. (B) negative overjet of the upper\
+  \ central incisors. (C) excess overjet of the upper lateral incisors. (D) excess\
+  \ overjet of the upper central incisors.\nA: Let's think step by step. We refer\
+  \ to Wikipedia articles on anatomy for help. Let\u2019s solve this problem step\
+  \ by step. This is a question related to anatomy and orthodontics. Excess overjet\
+  \ is associated with Class II occlusions; therefore, we can safely eliminate (B)\
+  \ from the list, as negative overjet is often associated with Class III occlusions.\
+  \ Now, we need to determine the location of the excess overjet, and that would be\
+  \ the upper (maxillary) lateral incisors. Only (C) has the correct information.\
+  \ The answer is (C).\n\nQ: The pleura\n(A) have no sensory innervation. (B) are\
+  \ separated by a 2 mm space. (C) extend into the neck. (D) are composed of respiratory\
+  \ epithelium.\nA: Let's think step by step. We refer to Wikipedia articles on anatomy\
+  \ for help. Let\u2019s solve this problem step by step. First, recall that the pleura\
+  \ refers to the thin layer of tissue that covers the lungs and lines the interior\
+  \ wall of the chest cavity. Now, let\u2019s look at each option:\nOption (A): \u201C\
+  The pleura have no sensory innervation.\u201D This information is not correct. The\
+  \ pleura do have a sensory innervation.\nOption (B): \u201CThe pleura are separated\
+  \ by a 2 mm space.\u201D This information is not correct. There is a very thin \u201C\
+  potential\u201D space between the layers of the pleura; however, it is typically\
+  \ filled with serous pleural fluid. \nOption (C): \u201CThe pleura extend into the\
+  \ neck.\u201D This information is actuakky true. The cervical pleura, also known\
+  \ as the dome of the pleuradome of the pleura, lines the extendsiton of the pleural\
+  \ cavity into the neck.\nOption (D): \u201CThe pleura are composed of respiratory\
+  \ epithelium.\u201D This information is not correct. The pleaura are composed of\
+  \ connective tissue (CT).\nBecause (A), (B), and (D) are all incorrect, (D) is the\
+  \ only correct answer. The answer is (C).\n\nQ: What is the embryological origin\
+  \ of the hyoid bone?\n(A) The first pharyngeal arch (B) The first and second pharyngeal\
+  \ arches (C) The second pharyngeal arch (D) The second and third pharyngeal arches\n\
+  A: Let's think step by step. We refer to Wikipedia articles on anatomy for help.\
+  \ Let\u2019s solve this problem step by step. The hyoid bone, which is also known\
+  \ as the hyooid, is a a small U-shaped bone located in the anterior neck. In its\
+  \ resting position, it lies between the ase of the mandible and the third cervical\
+  \ vertebrae. We know that the second and the third pharyngeal arches give rise to\
+  \ the horns of the hyoid bone; therefore, the embryological origin of the hyoid\
+  \ bone are the second and the third pharyngeal arches\u2014this information is covered\
+  \ in the last option (D). Therefore, we conclude that (D) must be the correct answer.\
+  \ The answer is (D)."
+include: _mmlu_flan_cot_fewshot_template_yaml
+task: mmlu_flan_cot_fewshot_anatomy
--- a/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_astronomy.yaml
+++ b/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_astronomy.yaml
+dataset_name: astronomy
+description: "The following are multiple choice questions (with answers) about astronomy.\n\
+  \nQ: Where do most short-period comets come from and how do we know?\n(A) The Kuiper\
+  \ belt; short period comets tend to be in the plane of the solar system just like\
+  \ the Kuiper belt. (B) The Kuiper belt; short period comets tend to come from random\
+  \ directions indicating a spherical distribution of comets called the Kuiper belt.\
+  \ (C) The asteroid belt; short period comets have orbital periods similar to asteroids\
+  \ like Vesta and are found in the plane of the solar system just like the asteroid\
+  \ belt. (D) The Oort cloud; short period comets tend to be in the plane of the solar\
+  \ system just like the Oort cloud.\nA: Let's think step by step. Most short-period\
+  \ comets come from the Kuiper belt, and we know because short period coments tend\
+  \ to be in the plane of the solar system, just like the Kuiper belt is. The answer\
+  \ is (A).\n\nQ: You are pushing a truck along a road. Would it be easier to accelerate\
+  \ this truck on Mars? Why? (Assume there is no friction)\n(A) It would be harder\
+  \ since the truck is heavier on Mars. (B) It would be easier since the truck is\
+  \ lighter on Mars. (C) It would be harder since the truck is lighter on Mars. (D)\
+  \ It would be the same no matter where you are.\nA: Let's think step by step. If\
+  \ we assume that there is no friction, the force needed to accelerate the truck\
+  \ is by Newton\u2019s second law only dependent on the mass of the truck. Hence\
+  \ (A), (B) and (C) are incorrect since it doesn\u2019t matter that it\u2019s on\
+  \ Mars, and (D) is the correct answer. The answer is (D).\n\nQ: Say the pupil of\
+  \ your eye has a diameter of 5 mm and you have a telescope with an aperture of 50\
+  \ cm. How much more light can the telescope gather than your eye?\n(A) 10000 times\
+  \ more (B) 100 times more (C) 1000 times more (D) 10 times more\nA: Let's think\
+  \ step by step. The amount of light is proportional to the aperture area $A = \\\
+  pi D^2/4$ for a lens with diameter $D$, so the relative amounts of light between\
+  \ the eye with diameter 5mm and the telescope with diameter 50mm is $(50 cm)^2/(5mm)^2\
+  \ = 10000$. The answer is (A).\n\nQ: Why isn't there a planet where the asteroid\
+  \ belt is located?\n(A) A planet once formed here but it was broken apart by a catastrophic\
+  \ collision. (B) There was not enough material in this part of the solar nebula\
+  \ to form a planet. (C) There was too much rocky material to form a terrestrial\
+  \ planet but not enough gaseous material to form a jovian planet. (D) Resonance\
+  \ with Jupiter prevented material from collecting together to form a planet.\nA:\
+  \ Let's think step by step. The asteroid belt is a stellar disc consisting of a\
+  \ large number of asteroids between Mars and Jupiter's orbits. The asteroids in\
+  \ this belt are affected by the gravitational pull from both other asteroids and\
+  \ nearby planets. Due to the strong gravitational force of Jupiter there are resonances\
+  \ that give rise to low density regions of asteroids known as the Kirkwood gap.\
+  \ So (B) and (C) are not correct since it\u2019s not a lack of material that prevents\
+  \ a planet from being formed, and (A) is incorrect because the Kirkwood gap would\
+  \ have prevented a planet from forming in the first place, and (D) is the correct\
+  \ option. The answer is (D).\n\nQ: Why is Mars red?\n(A) Because the surface is\
+  \ covered with heavily oxidized (\"rusted\") minerals. (B) Because the atmosphere\
+  \ scatters more light at bluer wavelengths transmitting mostly red light. (C) Because\
+  \ Mars is covered with ancient lava flows which are red in color. (D) Because flowing\
+  \ water on Mars's surface altered the surface minerals several billion years ago.\n\
+  A: Let's think step by step. Option (B) is not correct because if the red color\
+  \ was caused by the scattering off the atmosphere, then the earth with a much thicker\
+  \ atmosphere would also look red. Options (C) and (D) are not specific enough about\
+  \ why the color of the surface would be red, while (A) is correct because it explains\
+  \ that the surface is red due to the rusted materials on the surface and the red\
+  \ color comes from the rust. So the correct option is (A). The answer is (A)."
+include: _mmlu_flan_cot_fewshot_template_yaml
+task: mmlu_flan_cot_fewshot_astronomy
--- a/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_business_ethics.yaml
+++ b/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_business_ethics.yaml
+dataset_name: business_ethics
+description: "The following are multiple choice questions (with answers) about business\
+  \ ethics.\n\nQ: In contrast to _______, _______ aim to reward favourable behaviour\
+  \ by companies. The success of such campaigns have been heightened through the use\
+  \ of ___________, which allow campaigns to facilitate the company in achieving _________\
+  \ .\n(A) Buycotts, Boycotts, Blockchain technology, Charitable donations (B) Buycotts,\
+  \ Boycotts, Digital technology, Increased Sales (C) Boycotts, Buyalls, Blockchain\
+  \ technology, Charitable donations (D) Boycotts, Buycotts, Digital technology, Increased\
+  \ Sales\nA: Let's think step by step. We refer to Wikipedia articles on business\
+  \ ethics for help. The sentence that best uses the possible options above is \u201C\
+  In contrast to *boycotts*, *buycotts* aim to reward favourable behavior by companies.\
+  \ The success of such campaigns have been heightened through the use of *digital\
+  \ technology*, which allow campaigns to facilitate the company in achieving *increased\
+  \ sales*.\u201D The answer is (D).\n\nQ: _______ is the direct attempt to formally\
+  \ or informally manage ethical issues or problems, through specific policies, practices\
+  \ and programmes.\n(A) Corporate social responsibility (B) Business ethics management\
+  \ (C) Sustainability (D) Environmental management\nA: Let's think step by step.\
+  \ We refer to Wikipedia articles on business ethics for help. The direct attempt\
+  \ manage ethical issues through specific policies, practices, and programs is business\
+  \ ethics management. The answer is (B).\n\nQ: Three contrasting tactics that CSO's\
+  \ can engage in to meet their aims are ________ which typically involves research\
+  \ and communication, ________, which may involve physically attacking a company's\
+  \ operations or ________, often involving some form of _______.\n(A) Non-violent\
+  \ direct action, Violent direct action, Indirect action, Boycott (B) Indirect action,\
+  \ Instrumental action, Non-violent direct action, Information campaign (C) Indirect\
+  \ action, Violent direct action, Non-violent direct-action Boycott (D) Non-violent\
+  \ direct action, Instrumental action, Indirect action, Information campaign\nA:\
+  \ Let's think step by step. We refer to Wikipedia articles on business ethics for\
+  \ help. The sentence that best uses the possible options above is \u201CThree contrasting\
+  \ tactics that CSO's can engage in to meet their aims are *indirect action*, which\
+  \ typically involves research and communication, *violent direct action*, which\
+  \ may involve physically attacking a company's operations or *non-violent direct\
+  \ action*, often involving some form of *boycott*.\u201D The answer is (C).\n\n\
+  Q: To ensure the independence of the non-executive board members, there are a number\
+  \ of steps which can be taken, which include non-executives being drawn from _______\
+  \ the company, being appointed for a _________ time period as well as being appointed\
+  \ _________.\n(A) Outside, Limited, Independently (B) Inside, Limited, Intermittently\
+  \ (C) Outside, Unlimited, Intermittently (D) Inside, Unlimited, Independently\n\
+  A: Let's think step by step. We refer to Wikipedia articles on business ethics for\
+  \ help. The sentence that best uses the possible options above is \u201CTo ensure\
+  \ the independence of the non-executive board members, there are a number of steps\
+  \ which can be taken, which include non-executives being draw from *outside* the\
+  \ company, being appointed for a *limited* time period as well as being imported\
+  \ *independently*. The answer is (A).\n\nQ: Beyond the business case for engaging\
+  \ in CSR there are a number of moral arguments relating to: negative _______, the\
+  \ _______that corporations possess and the ________ of business and society.\n(A)\
+  \ Externalities, Power, Independence (B) Publicity, Insubstantial resources, Mutual\
+  \ dependence (C) Publicity, Power, Independence (D) Externalities, Power, Mutual\
+  \ dependence\nA: Let's think step by step. We refer to Wikipedia articles on business\
+  \ ethics for help. The sentence that best uses the possible options above is \u201C\
+  Beyond the business case for engaging the CSR there are a number of moral arguments\
+  \ relating to: negative *externalities*, the *power* that corporations possess and\
+  \ the *mutual independence* of business and society. The answer is (D)."
+include: _mmlu_flan_cot_fewshot_template_yaml
+task: mmlu_flan_cot_fewshot_business_ethics
--- a/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_clinical_knowledge.yaml
+++ b/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_clinical_knowledge.yaml
+dataset_name: clinical_knowledge
+description: 'The following are multiple choice questions (with answers) about clinical
+  knowledge.
+  Q: Glycolysis is the name given to the pathway involving the conversion of:
+  (A) glycogen to glucose-1-phosphate. (B) glycogen or glucose to fructose. (C) glycogen
+  or glucose to pyruvate or lactate. (D) glycogen or glucose to pyruvate or acetyl
+  CoA.
+  A: Let''s think step by step. We refer to Wikipedia articles on clinical knowledge
+  for help. Glycolysis is the name given to the pathway involving conversion of glycogen
+  or glucose to pyruvate or lactate. The answer is (C).
+  Q: What is the difference between a male and a female catheter?
+  (A) Male and female catheters are different colours. (B) Male catheters are longer
+  than female catheters. (C) Male catheters are bigger than female catheters. (D)
+  Female catheters are longer than male catheters.
+  A: Let''s think step by step. We refer to Wikipedia articles on clinical knowledge
+  for help. The difference between a male and female catheter is that male catheters
+  tend to be longer than female catheters. The answer is (B).
+  Q: How many attempts should you make to cannulate a patient before passing the job
+  on to a senior colleague, according to the medical knowledge of 2020?
+  (A) 4 (B) 3 (C) 2 (D) 1
+  A: Let''s think step by step. We refer to Wikipedia articles on clinical knowledge
+  for help. According to the medical protocol as of 2020, you should make two attempts
+  to cannulate a patient before passing the job on to a more-senior practitioner.
+  The answer is (C).
+  Q: In the assessment of the hand function which of the following is true?
+  (A) Abduction of the thumb is supplied by spinal root T2 (B) Opposition of the thumb
+  by opponens policis is supplied by spinal root T1 (C) Finger adduction is supplied
+  by the median nerve (D) Finger abduction is mediated by the palmar interossei
+  A: Let''s think step by step. We refer to Wikipedia articles on clinical knowledge
+  for help. Of all the options, it is only true that the opposition of the thumb by
+  opponens pollicis is supplied by spinal root T1. The answer is (B).
+  Q: The energy for all forms of muscle contraction is provided by:
+  (A) ATP. (B) ADP. (C) phosphocreatine. (D) oxidative phosphorylation.
+  A: Let''s think step by step. We refer to Wikipedia articles on clinical knowledge
+  for help. The energy for muscular contraction is provided by ATP (adenosine triphosphate),
+  which is the powerhouse of the cell. The answer is (A).'
+include: _mmlu_flan_cot_fewshot_template_yaml
+task: mmlu_flan_cot_fewshot_clinical_knowledge
--- a/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_college_biology.yaml
+++ b/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_college_biology.yaml
+dataset_name: college_biology
+description: "The following are multiple choice questions (with answers) about college\
+  \ biology.\n\nQ: Which of the following represents an accurate statement concerning\
+  \ arthropods?\n(A) They possess an exoskeleton composed primarily of peptidoglycan.\
+  \ (B) They possess an open circulatory system with a dorsal heart. (C) They are\
+  \ members of a biologically unsuccessful phylum incapable of exploiting diverse\
+  \ habitats and nutrition sources. (D) They lack paired, jointed appendages.\nA:\
+  \ Let's think step by step. Peptidoglycan is known to comprise the plasma membrane\
+  \ of most bacteria, rather than the exoskeleton of arthropods, which is made of\
+  \ chitin, which rules out (A). The answer (C) is false because arthropods are a\
+  \ highly successful phylum. Likewise, arthropods have paired, jointed appendages,\
+  \ which rules out (D). The only remaining option is (B), as arthropods have an open\
+  \ circulatory system with a dorsal tubular heart. The answer is (B).\n\nQ: In a\
+  \ given population, 1 out of every 400 people has a cancer caused by a completely\
+  \ recessive allele, b. Assuming the population is in Hardy-Weinberg equilibrium,\
+  \ which of the following is the expected proportion of individuals who carry the\
+  \ b allele but are not expected to develop the cancer?\n(A) 1/400 (B) 19/400 (C)\
+  \ 20/400 (D) 38/400\nA: Let's think step by step. According to the Hardy Weinberg\
+  \ Law, $p^2 + 2 p q + q^2 = 1$, and $p + q = 1$ where $p$ is the frequency of the\
+  \ dominant allele, $q$ is the frequency of the recessive allele, and $p^2$, $q^2$,\
+  \ and $2pq$ are the frequencies of dominant homozygous, recessive homozygous, and\
+  \ heterozygous individuals, respectively. \u200BThe frequency of the recessive allele\
+  \ (q) is $\\sqrt{\frac{1}{400}} = 0.05$. We have $p = 1 - q = 0.95$. The frequency\
+  \ of heterozygous individuals is $2pq = 2 \\cdot 0.05 \\cdot 0.95 = 0.095$. The\
+  \ number of heterozygous individuals is equal to the frequency of heterozygous individuals\
+  \ times the size of the population, or $0.095 * 400 = 38$. So we end up with 38/400.\
+  \ The answer is (D).\n\nQ: According to the pressure-flow model of movement of phloem\
+  \ contents, photosynthate movement from source to sink is driven by\n(A) an ATP-dependent\
+  \ pressure-flow pump (B) a water-pressure potential gradient (C) transpiration (D)\
+  \ apoplastic diffusion\nA: Let's think step by step. It is a gradient in water pressure\
+  \ that induces the movement of phloem content, which refers to answer (B). The mechanism\
+  \ of movement does not rely on metabolism, which rules out (A). Transpiration refers\
+  \ to the exhalation of water vapor through plant stomata, and is also not related,\
+  \ which rules out (C). While the apoplastic pathway is one of two main pathways\
+  \ for water transport in plants, it is not central to the pressure flow model, which\
+  \ rules out (D). The answer is (B).\n\nQ: Which of the following contain DNA sequences\
+  \ required for the segregation of chromosomes in mitosis and meiosis?\n(A) Telomeres\
+  \ (B) Centromeres (C) Nucleosomes (D) Spliceosomes\nA: Let's think step by step.\
+  \ The genetic material in Telomeres is not used, which rules out (A). Nucleosomes\
+  \ are the repeating subunit that comprises chromatin packed in a cell nucleus, and\
+  \ do not specifically refer to DNA sequences necessary for segregating chromosomes\
+  \ in cell division, which rules out (C). A spliceosome is a large ribonucleoprotein\
+  \ that removes introns from transcribed pre-mRNA rather than governing chromosome\
+  \ segregation. Centromeres are directly responsible for segregating chromosomes\
+  \ in cell division. The answer is (B).\n\nQ: The presence of homologous structures\
+  \ in two different organisms, such as the humerus in the front limb of a human and\
+  \ a bird, indicates that\n(A) the human and bird are polyphyletic species (B) a\
+  \ human's and bird's evolution is convergent (C) the human and bird belong to a\
+  \ clade (D) the human and bird developed by analogy\nA: Let's think step by step.\
+  \ Polyphyletic species are organisms that are grouped due to having similar characteristics\
+  \ but which do not have a common ancestor. This is not the case for humans and birds,\
+  \ which rules out (A). Convergent evolution refers to the indepdendent development\
+  \ of similar features in different species at different periods, which is also not\
+  \ the case for humans and birds, which rules out (B). Analogy refers to the superficial\
+  \ resemblance of structures that have different origins, which is not the case for\
+  \ the human and bird forearms, which rules out (D). Humans and birds do belong to\
+  \ the same clade - a group of organisms composed of a common ancestor. The answer\
+  \ is (C)."
+include: _mmlu_flan_cot_fewshot_template_yaml
+task: mmlu_flan_cot_fewshot_college_biology
--- a/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_college_chemistry.yaml
+++ b/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_college_chemistry.yaml
+dataset_name: college_chemistry
+description: "The following are multiple choice questions (with answers) about college\
+  \ chemistry.\n\nQ: 3 Cl\u2212(aq) + 4 CrO_4^2\u2212(aq) + 23 H+(aq) \u2192 3 HClO2(aq)\
+  \ + 4 Cr3+(aq) + 10 H2O(l). In the reaction shown above, Cl\u2212(aq) behaves as\n\
+  (A) an acid (B) a base (C) a catalyst (D) a reducing agent\nA: Let's think step\
+  \ by step. A molecule that behaves as a base accepts an H+ ion (or proton) from\
+  \ another molecule, whereas a molecule that behaves as an acid donates an H+ ion\
+  \ (or proton) to another molecule. Neither of these is the case for Cl in this reaction,\
+  \ which rules out (A) and (B). A catalyst is a substance that only accelerates a\
+  \ reaction without itself undergoing chemical change, which is not the case here.\
+  \ This rules out (C). Instead, the $Cl^{-} molecules carry a negative charge, which\
+  \ they donate in the reaction to form 3 HClO2. This is the behavior of a reducing\
+  \ agent, or (D). The answer is (D).\n\nQ: Which of the following statements about\
+  \ the lanthanide elements is NOT true?\n(A) The most common oxidation state for\
+  \ the lanthanide elements is +3. (B) Lanthanide complexes often have high coordination\
+  \ numbers (> 6). (C) All of the lanthanide elements react with aqueous acid to liberate\
+  \ hydrogen. (D) The atomic radii of the lanthanide elements increase across the\
+  \ period from La to Lu.\nA: Let's think step by step. The atomic radii of the lanthanide\
+  \ elements in fact decrease across the period from La to Lu. Options (A), (B), and\
+  \ (C) are all true. This means that only (D) is NOT true. The answer is (D).\n\n\
+  Q: Which of the following lists the hydrides of group-14 elements in order of thermal\
+  \ stability, from lowest to highest?\n(A) PbH4 < SnH4 < GeH4 < SiH4 < CH4 (B) PbH4\
+  \ < SnH4 < CH4 < GeH4 < SiH4 (C) CH4 < SiH4 < GeH4 < SnH4 < PbH4 (D) CH4 < PbH4\
+  \ < GeH4 < SnH4 < SiH4\nA: Let's think step by step. The thermal stability of group-14\
+  \ hydrides decreases as we move from the top of group 14 to the bottom. The order\
+  \ of elements in the group from top to bottom is C, Si, Ge, Sn, Pb. Therefore in\
+  \ order of increasing thermal stability we have PbH4, SnH4, GeH4, SiH4, and CH4,\
+  \ or answer (A). The answer is (A).\n\nQ: Predict the number of lines in the EPR\
+  \ spectrum of a solution of 13C-labelled methyl radical (13CH3\u2022), assuming\
+  \ the lines do not overlap.\n(A) 4 (B) 3 (C) 6 (D) 24 (E) 8\nA: Let's think step\
+  \ by step. The electron paramagnetic resonance spectrum will be split by two forms\
+  \ of interactions. The first is the hyperfine interaction with the 13C (nuclear\
+  \ spin $I = \nrac{1}{2}$) which will split the spectrum into 2 lines. This will\
+  \ be further split into 4 lines by the interaction with three equivalent 1H nuclei.\
+  \ The total number of lines is therefore $2 \\cdot 4 = 8$. The answer is (E)."
+include: _mmlu_flan_cot_fewshot_template_yaml
+task: mmlu_flan_cot_fewshot_college_chemistry
--- a/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_college_computer_science.yaml
+++ b/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_college_computer_science.yaml
+dataset_name: college_computer_science
+description: 'The following are multiple choice questions (with answers) about college
+  computer science.
+  Q: Which of the following regular expressions is equivalent to (describes the same
+  set of strings as) (a* + b)*(c + d)?
+  (A) a*(c + d)+ b(c + d)
+  (B) a*(c + d)* + b(c + d)*
+  (C) a*(c + d)+ b*(c + d)
+  (D) (a + b)*c +(a + b)*d
+  A: Let''s think step by step. We know that:
+  1. (X* + Y)* = (X + Y)*
+  2. X(Y + Z)? = XY + XZ
+  Using equation 1 we can rewrite (a* + b)*(c + d)? as:
+  3. (a + b)*(c + d)?
+  Using equation 2 we can rewrite equation 3 as:
+  (a + b)*c + (a + b)*d The answer is (D).
+  Q: The Singleton design pattern is used to guarantee that only a single instance
+  of a class may be instantiated. Which of the following is (are) true of this design
+  pattern?
+  I. The Singleton class has a static factory method to provide its instance.
+  II. The Singleton class can be a subclass of another class.
+  III. The Singleton class has a private constructor.
+  (A) I only
+  (B) II only
+  (C) III only
+  (D) I, II, and III
+  A: Let''s think step by step. Statement I is a correct statement about a Singleton,
+  because a Singleton restricts instantiation to a single, static method. Statement
+  II is also correct, because there is no inherent restriction regarding the inheritance
+  of a Singleton. Statement III is also correct, because a Singletons must be instantiated
+  only once, so its constructor is made private to prevent any construction except
+  via its static factory method.
+  Given these facts, statements I, II, and III are all correct. The answer is (D).
+  Q: A certain pipelined RISC machine has 8 general-purpose registers R0, R1, . .
+  . , R7 and supports the following operations:
+  ADD Rs1, Rs2, Rd (Add Rs1 to Rs2 and put the sum in Rd)
+  MUL Rs1, Rs2, Rd (Multiply Rs1 by Rs2 and put the product in Rd)
+  An operation normally takes one cycle; however, an operation takes two cycles if
+  it produces a result required by the immediately following operation in an operation
+  sequence.
+  Consider the expression AB + ABC + BC, where variables A, B, C are located in registers
+  R0, R1, R2. If the contents of these three registers must not be modified, what
+  is the minimum number of clock cycles required for an operation sequence that computes
+  the value of AB + ABC + BC?
+  (A) 5 (B) 6 (C) 7 (D) 8
+  A: Let''s think step by step. First, we are given that A is in R0, B is in R1, and
+  C is in R2.
+  Next, we can see that we must compute three multiplies (AB, BC, and ABC) and two
+  adds (AB + ABC, (AB + ABC) + BC) to compute our final answer, resulting in a minimum
+  of five clock cycles.
+  Next, we can see that there is no way to avoid at least one pipeline stall when
+  computing our final answer, because to compute our final sum we must wait at least
+  one cycle for the results from the previous stage to be ready. Thus, our minimum
+  number of cycles must be 6.
+  We can verify that we can create a solution that requires only six cycles as follows:
+  compute AB: MUL R0, R1, R3
+  compute BC: MUL R1, R2, R4
+  compute ABC: MUL R3, R4, R5
+  compute AB + BC: ADD R3, R4, R6
+  STALL
+  compute AB + ABC + BC: ADD R5, R6, R7
+  So there are 6 cycles. The answer is (B).
+  Q: A compiler generates code for the following assignment statement.
+  G := (A + B) * C - (D + E) * F
+  The target machine has a single accumulator and a single-address instruction set
+  consisting of instructions load, store, add, subtract, and multiply. For the arithmetic
+  operations, the left operand is taken from the accumulator and the result appears
+  in the accumulator. The smallest possible number of instructions in the resulting
+  code is
+  (A) 5 (B) 6 (C) 7 (D) 9
+  A: Let''s think step by step. We can compute the final answer with the following
+  sequence of operations:
+  1. LOAD D  (accumulator = D)
+  2. ADD E  (accumulator = D+E)
+  3. MUL F  (accumulator = (D+E)*F)
+  4. STORE X (X = (D+E)*F)
+  5. LOAD A  (accumulator = A)
+  6. ADD B  (accumulator = A+B)
+  7. MUL C  (accumulator = (A+B)*C)
+  8. SUB X  (accumulator = (A+B)*C - (D+E)*F)
+  9. STORE G (G = (A+B)*C - (D+E)*F)
+  This sequence takes 9 instructions. The answer is (D).
+  Q: Consider a computer design in which multiple processors, each with a private
+  cache memory, share global memory using a single bus. This bus is the critical system
+  resource. Each processor can execute one instruction every 500 nanoseconds as long
+  as memory references are satisfied by its local cache. When a cache miss occurs,
+  the processor is delayed for an additional 2,000 nanoseconds. During half of this
+  additional delay, the bus is dedicated to serving the cache miss. During the other
+  half, the processor cannot continue, but the bus is free to service requests from
+  other processors. On average, each instruction requires 2 memory references. On
+  average, cache misses occur on 1 percent of references. What proportion of the capacity
+  of the bus would a single processor consume, ignoring delays due to competition
+  from other processors?
+  (A) 1/50 (B) 1/27 (C) 1/25 (D) 2/27
+  A: Let''s think step by step. We know that each instruction requires two memory
+  references per instruction, and that there is an average cache miss rate of one
+  percent.
+  Thus a given processor has:
+  (1 cache miss / 100 references) * (2 references / instruction) =
+  (2 cache misses / 100 instructions), so:
+  misses_per_instruction = 1 cache miss / 50 instructions.
+  Next, we know that each instruction requires 500 nanoseconds when there is no cache
+  miss, and 500 + 2000 = 2500 nanoseconds when there is a cache miss. Thus:
+  50 instructions / (49 * 500) + (1 * 2500) nanoseconds, so:
+  instructions_per_ns = 50 instructions / 27000 nanoseconds.
+  Now, we know that each cache miss locks the bus for half of the 2000 nanosecond
+  cache miss delay, or 1000 nanoseconds, so:
+  lock_ns_per_miss = 1000 nanoseconds / cache miss.
+  Thus we can see that on average a single processor will lock the bus for:
+  lock_ns_per_miss * misses_per_instruction * instructions_per_ns =
+  (1000 nanoseconds / cache miss) * (1 cache miss / 50 instructions) * (50 instructions
+  / 27000 nanoseconds) = 1000 * (1/50) * (50/27000) = 1000/27000 = 1/27. The answer
+  is (B).'
+include: _mmlu_flan_cot_fewshot_template_yaml
+task: mmlu_flan_cot_fewshot_college_computer_science
--- a/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_college_mathematics.yaml
+++ b/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_college_mathematics.yaml
+dataset_name: college_mathematics
+description: "The following are multiple choice questions (with answers) about college\
+  \ mathematics.\n\nQ: Let V be the set of all real polynomials p(x). Let transformations\
+  \ T, S be defined on V by T:p(x) -> xp(x) and S:p(x) -> p'(x) = d/dx p(x), and interpret\
+  \ (ST)(p(x)) as S(T(p(x))). Which of the following is true?\n(A) ST = 0 (B) ST =\
+  \ T (C) ST = TS (D) ST - TS is the identity map of V onto itself.\nA: Let's think\
+  \ step by step. For a given polynomial $p$ we have\n\\[ST(p) = (xp(x))\u2019 = p(x)\
+  \ + xp\u2019(x)\\]\nand\n\\[TS(p) = xp\u2019(x).\\]\nHence \\[ST(p) - TS(p) = p(x)\
+  \ + xp\u2019(x) - xp\u2019(x).\\] The answer is (D).\n\nQ: Suppose that f(1 + x)\
+  \ = f(x) for all real x. If f is a polynomial and f(5) = 11, then f(15/2)\n(A) -11\
+  \ (B) 0 (C) 11 (D) 33/2\nA: Let's think step by step. The only polynomial so that\
+  \ $f(1 + x) = f(x)$ is a constant polynomial. Hence $f(5) = 11 = f(15/2)$. The answer\
+  \ is (C).\n\nQ: Let A be a real 2x2 matrix. Which of the following statements must\
+  \ be true?\nI. All of the entries of A^2 are nonnegative.\nII. The determinant of\
+  \ A^2 is nonnegative.\nIII. If A has two distinct eigenvalues, then A^2 has two\
+  \ distinct eigenvalues.\n(A) I only (B) II only (C) III only (D) II and III only\n\
+  A: Let's think step by step. We have \\[ det(A^2) = (det(A))^2 \\geq 0,\\] hence\
+  \ II holds.\nIII is false: as a counterexample take a diagonal matrix with -1 and\
+  \ 1 on the diagonal. Then $A^2$ is the identity matrix. The answer is (B).\n\nQ:\
+  \ Let A be the set of all ordered pairs of integers (m, n) such that 7m + 12n =\
+  \ 22. What is the greatest negative number in the set B = {m + n : (m, n) \\in A}?\n\
+  (A) -5 (B) -4 (C) -3 (D) -2\nA: Let's think step by step. We have 12n = 22 - 7m\
+  \ and one of the solutions is $m = -2$, $n = 3$. Then $m + n = 1$, hence we need\
+  \ to look for smaller $m$ in order to make $m + n$ negative. The next solution is\
+  \ $m = -14$ and $n = 10$. For smaller $m$ we have $m + n$ smaller than $-4$. The\
+  \ answer is (B).\n\nQ: A tank initially contains a salt solution of 3 grams of salt\
+  \ dissolved in 100 liters of water. A salt solution containing 0.02 grams of salt\
+  \ per liter of water is sprayed into the tank at a rate of 4 liters per minute.\
+  \ The sprayed solution is continually mixed with the salt solution in the tank,\
+  \ and the mixture flows out of the tank at a rate of 4 liters per minute. If the\
+  \ mixing is instantaneous, how many grams of salt are in the tank after 100 minutes\
+  \ have elapsed?\n(A) 2 (B) 2 - e^-2 (C) 2 + e^-2 (D) 2 + e^-4\nA: Let's think step\
+  \ by step. For all $t \\in \\mathbb{R}$, let $s(t)$ denote the number grams of salt\
+  \ in the tank at the $t$ minute mark. Then $s(0) = 3$.\nWe use $s$ and $s(t)$ interchangeably.\
+  \ We also use $s^{\\prime}$ and $s^{\\prime}(t)$ interchangeably. The solution sprayed\
+  \ into the tank adds $(0.02) 4=2 / 25$ grams of salt per minute. There are always\
+  \ 100 liters of liquid in the tank, containing $s$ grams of salt. So the density\
+  \ of salt in the tank is $s / 100$ grams per liter. The flow of water out of the\
+  \ tank therefore subtracts $4(s / 100)=s / 25$ grams of salt per minute. Then, for\
+  \ all $t \\in \\mathbb{R}$, we have $s^{\\prime}(t)=(2 / 25)-(s / 25)=(2-s) / 25$,\
+  \ and so $[s(t)=2] \\Rightarrow\\left[s^{\\prime}(t)=0\right]$. For all $t \\in\
+  \ \\mathbb{R}$,\n$$\n\frac{d}{d t}[\\ln (s-2)]=\frac{s^{\\prime}}{s-2}=\frac{-1}{25}=\f\
+  rac{d}{d t}\\left[-\frac{t}{25}\right] .\n$$\nChoose $C \\in \\mathbb{R}$ such that,\
+  \ for all $t \\in \\mathbb{R}, \\ln ((s(t)-2))=-[t / 25]+C$. Let $K:=e^{C}$. Then,\
+  \ for all $t \\in \\mathbb{R}$, we have $(s(t))-2=K e^{-t / 25}$, and so $s(t)=2+K\
+  \ e^{-t / 25}$. Then $3=s(0)=2+K e^{0}=2+K$, so $K=1$. Then $s(100)=2+K e^{-100\
+  \ / 25}=2+1 \\cdot e^{-4}=2+e^{-4}$. The answer is (D)."
+include: _mmlu_flan_cot_fewshot_template_yaml
+task: mmlu_flan_cot_fewshot_college_mathematics
--- a/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_college_medicine.yaml
+++ b/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_college_medicine.yaml
+dataset_name: college_medicine
+description: "The following are multiple choice questions (with answers) about college\
+  \ medicine.\n\nQ: An expected side effect of creatine supplementation is:\n(A) muscle\
+  \ weakness. (B) gain in body mass. (C) muscle cramps. (D) loss of electrolytes.\n\
+  A: Let's think step by step. We refer to Wikipedia articles on medicine for help.\
+  \ Creatine supplementation is a dietary supplement that results in body mass gain.\
+  \ The answer is (B).\n\nQ: Which of the following is not a true statement?\n(A)\
+  \ Muscle glycogen is broken down enzymatically to glucose-1-phosphate (B) Elite\
+  \ endurance runners have a high proportion of Type I fibres in their leg muscles\
+  \ (C) Liver glycogen is important in the maintenance of the blood glucose concentration\
+  \ (D) Insulin promotes glucose uptake by all tissues in the body\nA: Let's think\
+  \ step by step. We refer to Wikipedia articles on medicine for help. Let\u2019s\
+  \ solve this step by step and go over each choice: \n(A) \u201CMuscle glycogen is\
+  \ broken down enzymatically to glucose-1-phosphate\u201D: This is a correct statement.\n\
+  (B) \u201CElite endurance runners have a high proportion of Type I fibres in their\
+  \ leg muscles\u201D: This is a correct statement.\n(C) \u201CLiver glycogen is important\
+  \ in the maintenance of the blood glucose concentration\u201D: This is a correct\
+  \ statement. \n(D) \u201CInsulin promotes glucose uptake by all tissues in the body\u201D\
+  : This is not a correct statement, because insulin promotes glucose uptake by the\
+  \ liver, adipose tissue, and muscle, but not all tissues. For instance, the tissues\
+  \ in the brain and red blood cells are not affected by insulin. The answer is (D).\n\
+  \nQ: A high school science teacher fills a 1 liter bottle with pure nitrogen and\
+  \ seals the lid. The pressure is 1.70 atm, and the room temperature is 25\xB0C.\
+  \ Which two variables will both increase the pressure of the system, if all other\
+  \ variables are held constant?\n(A) Increasing temperature, increasing moles of\
+  \ gas (B) Increasing temperature, increasing volume (C) Decreasing volume, decreasing\
+  \ temperature (D) Decreasing moles of gas, increasing volume\nA: Let's think step\
+  \ by step. We refer to Wikipedia articles on medicine for help. The relevant equation\
+  \ for this is the ideal gas law: PV=nRT. To increase the pressure of the system\
+  \ (P), then either n (number of moles of the gas) or T (temperature) have to increase.\
+  \ The answer is (A).\n\nQ: In a genetic test of a newborn, a rare genetic disorder\
+  \ is found that has X-linked recessive transmission. Which of the following statements\
+  \ is likely true regarding the pedigree of this disorder?\n(A) All descendants on\
+  \ the maternal side will have the disorder. (B) Females will be approximately twice\
+  \ as affected as males in this family. (C) All daughters of an affected male will\
+  \ be affected. (D) There will be equal distribution of males and females affected.\n\
+  A: Let's think step by step. We refer to Wikipedia articles on medicine for help.\
+  \ Let\u2019s solve this step by step. Let's recall first that females have two X\
+  \ chromosomes, while males have one X and one Y chromosome. This is an important\
+  \ fact we need to know before answering this question. \nBecause a male can only\
+  \ pass his only one X chromosome to a daughter, if he is affected by this rare genetic\
+  \ disorder, then we know for sure that he will pass this rare genetic disorder to\
+  \ all his future-born daughters. Therefore, \u201C(C): All daughters of an affected\
+  \ male will be affected\u201D is a correct statement. The answer is (C).\n\nQ: Glucose\
+  \ is transported into the muscle cell:\n(A) via protein transporters called GLUT4.\
+  \ (B) only in the presence of insulin. (C) via hexokinase. (D) via monocarbylic\
+  \ acid transporters.\nA: Let's think step by step. We refer to Wikipedia articles\
+  \ on medicine for help. Glucose (also known as the blood sugar) is the main sugar\
+  \ found in the human body. It is transported into the muscle cell via diffusion\
+  \ through protein transporters called GLUT4. The answer is (A)."
+include: _mmlu_flan_cot_fewshot_template_yaml
+task: mmlu_flan_cot_fewshot_college_medicine
--- a/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_college_physics.yaml
+++ b/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_college_physics.yaml
+dataset_name: college_physics
+description: 'The following are multiple choice questions (with answers) about college
+  physics.
+  Q: A refracting telescope consists of two converging lenses separated by 100 cm.
+  The eye-piece lens has a focal length of 20 cm. The angular magnification of the
+  telescope is
+  (A) 4 (B) 5 (C) 6 (D) 20
+  A: Let''s think step by step. In a refracting telescope, if both lenses are converging,
+  the focus of both lenses must be between the two lenses, and thus the focal lengths
+  of the two lenses must add up to their separation. Since the focal length of one
+  lens is 20 cm, the focal length of the other must be 80 cm. The magnification is
+  the ratio of these two focal lengths, or 4. The answer is (A).
+  Q: The muon decays with a characteristic lifetime of about 10^-6 second into an
+  electron, a muon neutrino, and an electron antineutrino. The muon is forbidden from
+  decaying into an electron and just a single neutrino by the law of conservation
+  of
+  (A) charge (B) mass (C) energy and momentum (D) lepton number
+  A: Let''s think step by step. Lepton number must be conserved, meaning the total
+  number of leptons minus the number of antileptons. If a muon decays into an electron
+  and a single neutrino, the total lepton number would go from one to two, violating
+  lepton number conservation. The answer is (D).
+  Q: One end of a Nichrome wire of length 2L and cross-sectional area A is attached
+  to an end of another Nichrome wire of length L and cross- sectional area 2A. If
+  the free end of the longer wire is at an electric potential of 8.0 volts, and the
+  free end of the shorter wire is at an electric potential of 1.0 volt, the potential
+  at the junction of the two wires is most nearly equal to
+  (A) 2.4 V (B) 3.3 V (C) 4.5 V (D) 5.7 V
+  A: Let''s think step by step. This is a simple voltage divider problem, where the
+  longer wire has a resistance four times that of the shorter end. So the voltage
+  divider ratio is 1 / 5, meaning that the potential in the middle is 1.0 V + (8.0
+  V - 1.0 V) * 1/5 = 2.4 V. The answer is (A).
+  Q: A refracting telescope consists of two converging lenses separated by 100 cm.
+  The eye-piece lens has a focal length of 20 cm. The angular magnification of the
+  telescope is
+  (A) 4 (B) 5 (C) 6 (D) 20
+  A: Let''s think step by step. In a refracting telescope, if both lenses are converging,
+  the focus of both lenses must be between the two lenses, and thus the focal lengths
+  of the two lenses must add up to their separation. Since the focal length of one
+  lens is 20 cm, the focal length of the other must be 80 cm. The magnification is
+  the ratio of these two focal lengths, or 4. The answer is (A).
+  Q: For which of the following thermodynamic processes is the increase in the internal
+  energy of an ideal gas equal to the heat added to the gas?
+  (A) Constant temperature (B) Constant volume (C) Constant pressure (D) Adiabatic
+  A: Let''s think step by step. Heat added to the gas can go into the gases internal
+  energy or work done against an external force. However, if the volume of the gas
+  container is constant, no work will be done (since work is pressure times change
+  in volume). So, at constant volume, all of the heat goes into the internal energy.
+  The answer is (B).'
+include: _mmlu_flan_cot_fewshot_template_yaml
+task: mmlu_flan_cot_fewshot_college_physics
--- a/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_computer_security.yaml
+++ b/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_computer_security.yaml
+dataset_name: computer_security
+description: "The following are multiple choice questions (with answers) about computer\
+  \ security.\n\nQ: SHA-1 has a message digest of\n(A) 160 bits (B) 512 bits (C) 628\
+  \ bits (D) 820 bits\nA: Let's think step by step. Since SHA-1 is a hash function\
+  \ which takes an input and produces a 160-bit (20-byte) hash value, its message\
+  \ digest is 160 bits. The answer is (A).\n\nQ: _____________ can modify data on\
+  \ your system \u2013 so that your system doesn\u2019t run correctly or you can no\
+  \ longer access specific data, or it may even ask for ransom in order to give your\
+  \ access.\n(A) IM \u2013 Trojans (B) Backdoor Trojans (C) Trojan-Downloader (D)\
+  \ Ransom Trojan\nA: Let's think step by step. The system is asking for trojans,\
+  \ which are for ransom, which means ransom trojan. The answer is (D).\n\nQ: What\
+  \ is ethical hacking?\n(A) \"Hacking\" ethics so they justify unintended selfish\
+  \ behavior (B) Hacking systems (e.g., during penetration testing) to expose vulnerabilities\
+  \ so they can be fixed, rather than exploited (C) Hacking into systems run by those\
+  \ whose ethics you disagree with (D) A slang term for rapid software development,\
+  \ e.g., as part of hackathons\nA: Let's think step by step. Ethical hacking is a\
+  \ process of detecting vulnerabilities in an application, system, or organization's\
+  \ infrastructure that an attacker can use to exploit an individual or organization.\
+  \ They use this process to prevent cyberattacks and security breaches by lawfully\
+  \ hacking into the systems and looking for weak points. The answer is (B).\n\nQ:\
+  \ The ____________ is anything which your search engine cannot search.\n(A) Haunted\
+  \ web (B) World Wide Web (C) Surface web (D) Deep Web\nA: Let's think step by step.\
+  \ The search engine searches on the Surface Web, which is the portion of the world\
+  \ wide web which is visible so (B,C) are wrong. The Haunted Web doesn\u2019t correspond\
+  \ to an internet concept. The Deep Web is the part of the World Wide Web which is\
+  \ not indexed. The answer is (D).\n\nQ: Exploitation of the Heartbleed bug permits\n\
+  (A) overwriting cryptographic keys in memory (B) a kind of code injection (C) a\
+  \ read outside bounds of a buffer (D) a format string attack\nA: Let's think step\
+  \ by step. The Heartbleed Bug is a serious vulnerability in the popular OpenSSL\
+  \ cryptographic software library. Heartbleed resulted from improper input validation\
+  \ (due to a missing bounds check) in the implementation of the TLS heartbeat extension.\
+  \ The vulnerability was classified as a buffer over-read, a situation where more\
+  \ data can be read than should be allowed. The answer is (C)."
+include: _mmlu_flan_cot_fewshot_template_yaml
+task: mmlu_flan_cot_fewshot_computer_security
--- a/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_conceptual_physics.yaml
+++ b/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_conceptual_physics.yaml
+dataset_name: conceptual_physics
+description: "\nThe following are multiple choice questions (with answers) about conceptual\
+  \ physics.\n\nQ: Colors in a soap bubble result from light\n(A) converted to a different\
+  \ frequency (B) deflection (C) interference (D) polarization\nA: Let's think step\
+  \ by step. In a soap bubble film, the light bounces between the two soap-air interfaces\
+  \ many times, interfering with itself constructively or destructively depending\
+  \ on the width of the film. This results in different colors being visible. The\
+  \ answer is (C).\n\nQ: Compared with the mass of a uranium atom undergoing fission,\
+  \ the combined masses of the products after fission are\n(A) less (B) more (C) the\
+  \ same (D) zero\nA: Let's think step by step. Fission releases energy, which comes\
+  \ from the rest mass of its initial nucleus. Thus the mass of the products is less\
+  \ than the mass of the reactant uranium nucleus. The answer is (A).\n\nQ: Things\
+  \ that are equivalent according to the equivalence principle are\n(A) space and\
+  \ time. (B) a traveling twin and a stay-at-home twin. (C) gravity and acceleration.\
+  \ (D) mass and energy.\nA: Let's think step by step. Einstein\u2019s famous equivalence\
+  \ principle states that gravity and acceleration are equivalent. The answer is (C).\n\
+  \nQ: Which of these three elements has the most mass per nucleon?\n(A) Hydrogen\
+  \ (B) Iron (C) Uranium (D) Same in each\nA: Let's think step by step. Due to nuclear\
+  \ binding energy, the mass of an atomic nucleus is less than the sum of individual\
+  \ masses of the free constituent protons and neutrons; this is known as the mass\
+  \ defect. Hydrogen has no mass defect because it has only a single nucleon, so it\
+  \ will have the most mass per nucleon. The answer is (A).\n\nQ: A model airplane\
+  \ flies slower when flying into the wind and faster with wind at its back. When\
+  \ launched at right angles to the wind a cross wind its groundspeed compared with\
+  \ flying in still air is\n(A) the same (B) greater (C) less (D) either greater or\
+  \ less depending on wind speed\nA: Let's think step by step. The plane\u2019s speed\
+  \ in the direction of the wind is greater than it would be in the absence of wind,\
+  \ and its direction orthogonal to the wind is the same as it would be in the absence\
+  \ of the wind. The total speed, which is these two components added in quadrature,\
+  \ is thus greater than the speed in still air. The answer is (B)."
+include: _mmlu_flan_cot_fewshot_template_yaml
+task: mmlu_flan_cot_fewshot_conceptual_physics
--- a/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_econometrics.yaml
+++ b/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_econometrics.yaml
+dataset_name: econometrics
+description: "The following are multiple choice questions (with answers) about econometrics.\n\
+  \nQ: Suppose now that a researcher wishes to use information criteria to determine\
+  \ the optimal lag length for a VAR. 500 observations are available for the bi-variate\
+  \ VAR, and the values of the determinant of the variance-covariance matrix of residuals\
+  \ are 0.0336, 0.0169, 0.0084, and 0.0062 for 1, 2, 3, and 4 lags respectively. What\
+  \ is the optimal model order according to Akaike's information criterion?\n(A) 1\
+  \ lag (B) 2 lags (C) 3 lags (D) 4 lags\nA: Let's think step by step. We refer to\
+  \ Wikipedia articles on econometrics for help. Let\u2019s solve this problem step\
+  \ by step. First of all, let\u2019s recall that for a given set of data, Akaike's\
+  \ information criterion (AIC) allows us to measure how well a statistical model\
+  \ fits the data; it is an estimator of prediction error. Here in this problem we\
+  \ will need to use the formula ln(det(sigma_hat)) + (2 * k / T) to determine the\
+  \ values of Akaike\u2019s criterion, where ln denotes the natural log function,\
+  \ det the determinant function, k the total number of parameters in total (across\
+  \ both equations), and T the number of observations (which, in this case, is equal\
+  \ to 500). For 1 lag, the number of parameters in total is equal to 6; for 2 lags,\
+  \ it is 10; for 3 lags, it is 14; and for 4 lags, it is 18. Now, let\u2019s calculate\
+  \ the values of the criterion for each lag:\n(A) 1 lag: ln(0.0336) + (2 * 6 / 500)\
+  \ = ln(0.0336) + (12 / 500) = -3.369\n(B) 2 lags: ln(0.0169) + (2 * 10 / 500) =\
+  \ ln(0.0169) + (20 / 500) = -4.040\n(C) 3 lags: ln(0.0084) + (2 * 14 / 500) = ln(0.0084)\
+  \ + (28 / 500) =-4.724\n(D) 4 lags: ln(0.0062) + (2 * 18 / 500) = ln(0.0062) + (36\
+  \ / 500) =-5.011\nBecause the optimal model order according to AIC minimizes the\
+  \ information criterion, the answer should be the one with the lowest value. In\
+  \ this case, (D) has the lowest value. The answer is (C).\n\nQ: Consider the following\
+  \ AR(1) model with the disturbances having zero mean and unit variance\nyt = 0.2\
+  \ + 0.4 yt-1 + ut\nThe (unconditional) mean of y will be given by\n(A) 0.2 (B) 0.4\
+  \ (C) 0.5 (D) 0.33\nA: Let's think step by step. We refer to Wikipedia articles\
+  \ on econometrics for help. Let\u2019s solve this problem step by step. If we have\
+  \ a an AR(1) model with the disturbances having zero mean and unit variance, then\
+  \ the unconditional mean of y is equal to the following:\nunconditional mean of\
+  \ y = (the intercept term) / (1 - autoregressive coefficient)\nWe know that the\
+  \ intercept term is 0.2 and the autoregressive coefficient is 0.4; thus, we have:\n\
+  unconditional mean of y = (0.2) / (1 - 0.4) = (0.2) / (0.6) = 2 / 6 = 1 / 3, which\
+  \ is approximately 0.33. That means that the answer should be (D) 0.33. The answer\
+  \ is (D).\n\nQ: What would be then consequences for the OLS estimator if heteroscedasticity\
+  \ is present in a regression model but ignored?\n(A) It will be biased (B) It will\
+  \ be inconsistent (C) It will be inefficient (D) All of (a), (b) and (c) will be\
+  \ true.\nA: Let's think step by step. We refer to Wikipedia articles on econometrics\
+  \ for help. Heteroscedasticity refers to the condition where the variance of the\
+  \ error terms is not constant across multiple observations. If heteroscedasticity\
+  \ is present in a regression model, then the coefficient estimates in the OLS estimator\
+  \ will be not only unbiased and consistent but also inefficient. Because (A) and\
+  \ (B) are incorrect choices and (C) is a correct choice, (D) cannot be the right\
+  \ answer. Ultimately, (C) is the only true choice. The answer is (C).\n\nQ: Suppose\
+  \ that a test statistic has associated with it a p-value of 0.08. Which one of the\
+  \ following statements is true?\n(i) If the size of the test were exactly 8%, we\
+  \ would be indifferent between rejecting and not rejecting the null hypothesis\n\
+  (ii) The null would be rejected if a 10% size of test were used\n(iii) The null\
+  \ would not be rejected if a 1% size of test were used\n(iv) The null would be rejected\
+  \ if a 5% size of test were used.\n(A) (ii) and (iv) only (B) (i) and (iii) only\
+  \ (C) (i), (ii), and (iii) only (D) (i), (ii), (iii), and (iv).\nA: Let's think\
+  \ step by step. We refer to Wikipedia articles on econometrics for help. Let\u2019\
+  s reason about each of the options.\n(i) is a true statement.\n(ii) is a true statement.\n\
+  (iii) is a true statement.\n(iv) is not a true statement. Thus, (i), (ii), and (iii)\
+  \ are true. The answer is (C).\n\nQ: For a stationary autoregressive process, shocks\
+  \ will\n(A) Eventually die away (B) Persist indefinitely (C) Grow exponentially\
+  \ (D) Never occur\nA: Let's think step by step. We refer to Wikipedia articles on\
+  \ econometrics for help. This is a formal logic problem about stationally process.\
+  \ For a stationary autoregressive process, shocks will eventually die away. The\
+  \ answer is (A)."
+include: _mmlu_flan_cot_fewshot_template_yaml
+task: mmlu_flan_cot_fewshot_econometrics
--- a/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_electrical_engineering.yaml
+++ b/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_electrical_engineering.yaml
+dataset_name: electrical_engineering
+description: "\nThe following are multiple choice questions (with answers) about electrical\
+  \ engineering.\n\nQ: A point pole has a strength of 4\u03C0 * 10^-4 weber. The force\
+  \ in newtons on a point pole of 4\u03C0 * 1.5 * 10^-4 weber placed at a distance\
+  \ of 10 cm from it will be\n(A) 15 N. (B) 20 N. (C) 7.5 N. (D) 3.75 N.\nA: Let's\
+  \ think step by step. The force between two point poles is given by m_1m_2/(mu_0\
+  \ 4 \\pi r^2), in analogy to Coulomb\u2019s law. Plugging in the values given in\
+  \ the question, we calculate that the force is approximately 15 N. The answer is\
+  \ (A).\n\nQ: The coil of a moving coil meter has 100 turns, is 40 mm long and 30\
+  \ mm wide. The control torque is 240*10-6 N-m on full scale. If magnetic flux density\
+  \ is 1Wb/m2 range of meter is\n(A) 1 mA. (B) 2 mA. (C) 3 mA. (D) 4 mA.\nA: Let's\
+  \ think step by step. The torque on a coil in a uniform magnetic field is given\
+  \ by BANI, where B is the magnetic flux density, A is the area of the coil, N is\
+  \ the number of turns, and I is the current. So we have that I = (Torque)/(BAN),\
+  \ or 240e-6/(1200e-6 * 100 * 1) = 2e-3. The answer is (B).\n\nQ: In an SR latch\
+  \ built from NOR gates, which condition is not allowed\n(A) S=0, R=0 (B) S=0, R=1\
+  \ (C) S=1, R=0 (D) S=1, R=1\nA: Let's think step by step. An SR latch is a set-reset\
+  \ latch; in the case where S=1 and R=1, the circuit has no stable state; instead\
+  \ a race condition will be produced within the circuit, so the device will be in\
+  \ an undefined state. So S=1, R=1 is an illegal input. The answer is (D).\n\nQ:\
+  \ Two long parallel conductors carry 100 A. If the conductors are separated by 20\
+  \ mm, the force per meter of length of each conductor will be\n(A) 100 N. (B) 0.1\
+  \ N. (C) 1 N. (D) 0.01 N.\nA: Let's think step by step. The magnetic force-per-length\
+  \ between two current-carrying conductors is given by \\mu_0 I_1 I_2 / (2 \\pi r),\
+  \ where $r$ is the separation distance and I_1 and I_2 are the currents. Plugging\
+  \ in 100 A for I_1 and I_2, and 20 mm for r, gives 0.1 N. The answer is (B).\n\n\
+  Q: In a 2 pole lap winding dc machine , the resistance of one conductor is 2\u03A9\
+  \ and total number of conductors is 100. Find the total resistance\n(A) 200\u03A9\
+  \ (B) 100\u03A9 (C) 50\u03A9 (D) 10\u03A9\nA: Let's think step by step. In lap winding,\
+  \ effectively two resistors are connected in parallel, so the actual resistance\
+  \ of each pair is 1 Ohm. Since we have 50 pairs, we get a total resistance of 50\
+  \ Ohms. The answer is (C)."
+include: _mmlu_flan_cot_fewshot_template_yaml
+task: mmlu_flan_cot_fewshot_electrical_engineering
--- a/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_elementary_mathematics.yaml
+++ b/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_elementary_mathematics.yaml
+dataset_name: elementary_mathematics
+description: "The following are multiple choice questions (with answers) about elementary\
+  \ mathematics.\n\nQ: Olivia used the rule \"Add 11\" to create the number pattern\
+  \ shown below. 10, 21, 32, 43, 54. Which statement about the number pattern is true?\n\
+  (A) The 10th number in the pattern will be an even number.\n(B) The number pattern\
+  \ will never have two even numbers next to each other.\n(C) The next two numbers\
+  \ in the pattern will be an even number then an odd number.\n(D) If the number pattern\
+  \ started with an odd number then the pattern would have only odd numbers in it.\n\
+  A: Let's think step by step. Choice A is incorrect because every even-numbered term\
+  \ in the pattern is odd, and 10 is an even number. Choice B is correct, because\
+  \ adding an odd number (in this case 11) to an odd number produces an even number,\
+  \ and adding an odd number to an even number produces an odd number. Thus the terms\
+  \ in the pattern will alternate between odd and even, so there will never be two\
+  \ even numbers next to each other. Choice C is incorrect because the last term in\
+  \ the example is even (54), and we know that the terms will alternate between even\
+  \ and odd. Choice D is incorrect because the terms in the pattern will alternate\
+  \ between odd and even, regardless of the value of the first term. The answer is\
+  \ (B).\n\nQ: The population of the city where Michelle was born is 145,826. What\
+  \ is the value of the 5 in the number 145,826?\n(A) 5 thousands\n(B) 5 hundreds\n\
+  (C) 5 tens\n(D) 5 ones\nA: Let's think step by step. Choice A is correct, because\
+  \ there are three digits following the 5, so\nthe 5 is in the thousands place. Thus\
+  \ the other choices are incorrect. The answer is (A).\n\nQ: A store sells 107 different\
+  \ colors of paint. They have 25 cans of each color in storage. The number of cans\
+  \ of paint the store has in storage can be found using the expression below. 107\
+  \ \xD7 25. How many cans of paint does the store have in storage?\n(A) 749\n(B)\
+  \ 2,675\n(C) 2,945\n(D) 4,250\nA: Let's think step by step. We can calculate 107\
+  \ x 25 = (100 x 25) + (7 x 25) = 2500 + 175 = 2675. The answer is (B).\n\nQ: A total\
+  \ of 30 players will play basketball at a park. There will be exactly 5 players\
+  \ on each team. Which statement correctly explains how to find the number of teams\
+  \ needed?\n(A) Add 5 to 30 to find 35 teams.\n(B) Divide 30 by 5 to find 6 teams.\n\
+  (C) Multiply 30 and 5 to find 150 teams.\n(D) Subtract 5 from 30 to find 25 teams.\n\
+  A: Let's think step by step. We want to find the number of teams. We know that there\
+  \ are 5 players/team, and 30 players. Thus to get the number of teams we divide\
+  \ players by players/team, so 30 players / 5 players/team = 6 teams. The answer\
+  \ is (B).\n\nQ: Which expression is equivalent to 5 x 9?\n(A) (5 x 4) x (6 x 5)\n\
+  (B) (5 x 5) + (5 x 4)\n(C) (5 x 5) + (5 x 9)\n(D) (5 x 9) x (6 x 9)\nA: Let's think\
+  \ step by step. We know that 9 = (5 + 4), so 5 x 9 = 5 x (5 + 4) = (5 x 5) + (5\
+  \ x 4). The answer is (B)."
+include: _mmlu_flan_cot_fewshot_template_yaml
+task: mmlu_flan_cot_fewshot_elementary_mathematics