Merge branch 'EleutherAI:main' into main

da211969 · Jess · GitHub · 1b97e487 · 801322e0 · da211969
Unverified Commit da211969 authored Jun 28, 2024 by Jess Committed by GitHub Jun 28, 2024
20 changed files
--- a/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_anatomy.yaml
+++ b/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_anatomy.yaml
-"dataset_name": "anatomy"
-"description": "The following are multiple choice questions (with answers) about anatomy.\n\
-  \nQ: Which of the following is the body cavity that contains the pituitary gland?\n\
-  (A) Abdominal (B) Cranial (C) Pleural (D) Spinal\nA: Let's think step by step. We\
-  \ refer to Wikipedia articles on anatomy for help. Let’s solve this problem step\
-  \ by step. The pituitary gland is the major endocrine gland attached to the base\
-  \ of the brain, and it is contained in the Cranial cavity. The answer is (B).\n\n\
-  Q: Which of these branches of the trigeminal nerve contain somatic motor processes?\n\
-  (A) The supraorbital nerve (B) The infraorbital nerve (C) The mental nerve (D) None\
-  \ of the above\nA: Let's think step by step. We refer to Wikipedia articles on anatomy\
-  \ for help. Let’s solve this problem step by step. \nWe know the following: (A)\
-  \ The supraorbital nerve (also known as the frontal nerve) is the largest branch\
-  \ of the ophthalmic nerve and branch of ophthalmic division of the trigeminal nerve.\
-  \ (B) The infraorbital nerve is a branch of the maxillary division of the trigeminal\
-  \ nerve. (C) The mental nerve is a branch of the mandibular division of the trigeminal\
-  \ nerve. Because all these nerves are purely sensory nerves and do not contain any\
-  \ somatic motor processes. Therefore, the answer should be none of the above, which\
-  \ is (D). The answer is (D).\n\nQ: In Angle's Class II Div 2 occlusion there is\n\
-  (A) excess overbite of the upper lateral incisors. (B) negative overjet of the upper\
-  \ central incisors. (C) excess overjet of the upper lateral incisors. (D) excess\
-  \ overjet of the upper central incisors.\nA: Let's think step by step. We refer\
-  \ to Wikipedia articles on anatomy for help. Let’s solve this problem step by step.\
-  \ This is a question related to anatomy and orthodontics. Excess overjet is associated\
-  \ with Class II occlusions; therefore, we can safely eliminate (B) from the list,\
-  \ as negative overjet is often associated with Class III occlusions. Now, we need\
-  \ to determine the location of the excess overjet, and that would be the upper (maxillary)\
-  \ lateral incisors. Only (C) has the correct information. The answer is (C).\n\n\
-  Q: The pleura\n(A) have no sensory innervation. (B) are separated by a 2 mm space.\
-  \ (C) extend into the neck. (D) are composed of respiratory epithelium.\nA: Let's\
-  \ think step by step. We refer to Wikipedia articles on anatomy for help. Let’s\
-  \ solve this problem step by step. First, recall that the pleura refers to the thin\
-  \ layer of tissue that covers the lungs and lines the interior wall of the chest\
-  \ cavity. Now, let’s look at each option:\nOption (A): “The pleura have no sensory\
-  \ innervation.” This information is not correct. The pleura do have a sensory innervation.\n\
-  Option (B): “The pleura are separated by a 2 mm space.” This information is not\
-  \ correct. There is a very thin “potential” space between the layers of the pleura;\
-  \ however, it is typically filled with serous pleural fluid. \nOption (C): “The\
-  \ pleura extend into the neck.” This information is actuakky true. The cervical\
-  \ pleura, also known as the dome of the pleuradome of the pleura, lines the extendsiton\
-  \ of the pleural cavity into the neck.\nOption (D): “The pleura are composed of\
-  \ respiratory epithelium.” This information is not correct. The pleaura are composed\
-  \ of connective tissue (CT).\nBecause (A), (B), and (D) are all incorrect, (D) is\
-  \ the only correct answer. The answer is (C).\n\nQ: What is the embryological origin\
-  \ of the hyoid bone?\n(A) The first pharyngeal arch (B) The first and second pharyngeal\
-  \ arches (C) The second pharyngeal arch (D) The second and third pharyngeal arches\n\
-  A: Let's think step by step. We refer to Wikipedia articles on anatomy for help.\
-  \ Let’s solve this problem step by step. The hyoid bone, which is also known as\
-  \ the hyooid, is a a small U-shaped bone located in the anterior neck. In its resting\
-  \ position, it lies between the ase of the mandible and the third cervical vertebrae.\
-  \ We know that the second and the third pharyngeal arches give rise to the horns\
-  \ of the hyoid bone; therefore, the embryological origin of the hyoid bone are the\
-  \ second and the third pharyngeal arches—this information is covered in the last\
-  \ option (D). Therefore, we conclude that (D) must be the correct answer. The answer\
-  \ is (D).\n\n"
-"group": "mmlu_flan_cot_fewshot_stem"
-"include": "_mmlu_flan_cot_fewshot_template_yaml"
-"task": "mmlu_flan_cot_fewshot_anatomy"
+dataset_name: anatomy
+description: The following are multiple choice questions (with answers) about anatomy.
+fewshot_config:
+  sampler: first_n
+  samples:
+  - question: 'Which of the following is the body cavity that contains the pituitary
+      gland?
+
+      (A) Abdominal (B) Cranial (C) Pleural (D) Spinal'
+    target: "Let's think step by step. We refer to Wikipedia articles on anatomy for\
+      \ help. Let\u2019s solve this problem step by step. The pituitary gland is the\
+      \ major endocrine gland attached to the base of the brain, and it is contained\
+      \ in the Cranial cavity. The answer is (B)."
+  - question: 'Which of these branches of the trigeminal nerve contain somatic motor
+      processes?
+
+      (A) The supraorbital nerve (B) The infraorbital nerve (C) The mental nerve (D)
+      None of the above'
+    target: "Let's think step by step. We refer to Wikipedia articles on anatomy for\
+      \ help. Let\u2019s solve this problem step by step. \nWe know the following:\
+      \ (A) The supraorbital nerve (also known as the frontal nerve) is the largest\
+      \ branch of the ophthalmic nerve and branch of ophthalmic division of the trigeminal\
+      \ nerve. (B) The infraorbital nerve is a branch of the maxillary division of\
+      \ the trigeminal nerve. (C) The mental nerve is a branch of the mandibular division\
+      \ of the trigeminal nerve. Because all these nerves are purely sensory nerves\
+      \ and do not contain any somatic motor processes. Therefore, the answer should\
+      \ be none of the above, which is (D). The answer is (D)."
+  - question: 'In Angle''s Class II Div 2 occlusion there is
+
+      (A) excess overbite of the upper lateral incisors. (B) negative overjet of the
+      upper central incisors. (C) excess overjet of the upper lateral incisors. (D)
+      excess overjet of the upper central incisors.'
+    target: "Let's think step by step. We refer to Wikipedia articles on anatomy for\
+      \ help. Let\u2019s solve this problem step by step. This is a question related\
+      \ to anatomy and orthodontics. Excess overjet is associated with Class II occlusions;\
+      \ therefore, we can safely eliminate (B) from the list, as negative overjet\
+      \ is often associated with Class III occlusions. Now, we need to determine the\
+      \ location of the excess overjet, and that would be the upper (maxillary) lateral\
+      \ incisors. Only (C) has the correct information. The answer is (C)."
+  - question: 'The pleura
+
+      (A) have no sensory innervation. (B) are separated by a 2 mm space. (C) extend
+      into the neck. (D) are composed of respiratory epithelium.'
+    target: "Let's think step by step. We refer to Wikipedia articles on anatomy for\
+      \ help. Let\u2019s solve this problem step by step. First, recall that the pleura\
+      \ refers to the thin layer of tissue that covers the lungs and lines the interior\
+      \ wall of the chest cavity. Now, let\u2019s look at each option:\nOption (A):\
+      \ \u201CThe pleura have no sensory innervation.\u201D This information is not\
+      \ correct. The pleura do have a sensory innervation.\nOption (B): \u201CThe\
+      \ pleura are separated by a 2 mm space.\u201D This information is not correct.\
+      \ There is a very thin \u201Cpotential\u201D space between the layers of the\
+      \ pleura; however, it is typically filled with serous pleural fluid. \nOption\
+      \ (C): \u201CThe pleura extend into the neck.\u201D This information is actuakky\
+      \ true. The cervical pleura, also known as the dome of the pleuradome of the\
+      \ pleura, lines the extendsiton of the pleural cavity into the neck.\nOption\
+      \ (D): \u201CThe pleura are composed of respiratory epithelium.\u201D This information\
+      \ is not correct. The pleaura are composed of connective tissue (CT).\nBecause\
+      \ (A), (B), and (D) are all incorrect, (D) is the only correct answer. The answer\
+      \ is (C)."
+  - question: 'What is the embryological origin of the hyoid bone?
+
+      (A) The first pharyngeal arch (B) The first and second pharyngeal arches (C)
+      The second pharyngeal arch (D) The second and third pharyngeal arches'
+    target: "Let's think step by step. We refer to Wikipedia articles on anatomy for\
+      \ help. Let\u2019s solve this problem step by step. The hyoid bone, which is\
+      \ also known as the hyooid, is a a small U-shaped bone located in the anterior\
+      \ neck. In its resting position, it lies between the ase of the mandible and\
+      \ the third cervical vertebrae. We know that the second and the third pharyngeal\
+      \ arches give rise to the horns of the hyoid bone; therefore, the embryological\
+      \ origin of the hyoid bone are the second and the third pharyngeal arches\u2014\
+      this information is covered in the last option (D). Therefore, we conclude that\
+      \ (D) must be the correct answer. The answer is (D).\n\n"
+group: mmlu_flan_cot_fewshot_stem
+include: _mmlu_flan_cot_fewshot_template_yaml
+task: mmlu_flan_cot_fewshot_anatomy
--- a/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_astronomy.yaml
+++ b/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_astronomy.yaml
-"dataset_name": "astronomy"
-"description": "The following are multiple choice questions (with answers) about astronomy.\n\
-  \nQ: Where do most short-period comets come from and how do we know?\n(A) The Kuiper\
-  \ belt; short period comets tend to be in the plane of the solar system just like\
-  \ the Kuiper belt. (B) The Kuiper belt; short period comets tend to come from random\
-  \ directions indicating a spherical distribution of comets called the Kuiper belt.\
-  \ (C) The asteroid belt; short period comets have orbital periods similar to asteroids\
-  \ like Vesta and are found in the plane of the solar system just like the asteroid\
-  \ belt. (D) The Oort cloud; short period comets tend to be in the plane of the solar\
-  \ system just like the Oort cloud.\nA: Let's think step by step. Most short-period\
-  \ comets come from the Kuiper belt, and we know because short period coments tend\
-  \ to be in the plane of the solar system, just like the Kuiper belt is. The answer\
-  \ is (A).\n\nQ: You are pushing a truck along a road. Would it be easier to accelerate\
-  \ this truck on Mars? Why? (Assume there is no friction)\n(A) It would be harder\
-  \ since the truck is heavier on Mars. (B) It would be easier since the truck is\
-  \ lighter on Mars. (C) It would be harder since the truck is lighter on Mars. (D)\
-  \ It would be the same no matter where you are.\nA: Let's think step by step. If\
-  \ we assume that there is no friction, the force needed to accelerate the truck\
-  \ is by Newton’s second law only dependent on the mass of the truck. Hence (A),\
-  \ (B) and (C) are incorrect since it doesn’t matter that it’s on Mars, and (D) is\
-  \ the correct answer. The answer is (D).\n\nQ: Say the pupil of your eye has a diameter\
-  \ of 5 mm and you have a telescope with an aperture of 50 cm. How much more light\
-  \ can the telescope gather than your eye?\n(A) 10000 times more (B) 100 times more\
-  \ (C) 1000 times more (D) 10 times more\nA: Let's think step by step. The amount\
-  \ of light is proportional to the aperture area $A = \\pi D^2/4$ for a lens with\
-  \ diameter $D$, so the relative amounts of light between the eye with diameter 5mm\
-  \ and the telescope with diameter 50mm is $(50 cm)^2/(5mm)^2 = 10000$. The answer\
-  \ is (A).\n\nQ: Why isn't there a planet where the asteroid belt is located?\n(A)\
-  \ A planet once formed here but it was broken apart by a catastrophic collision.\
-  \ (B) There was not enough material in this part of the solar nebula to form a planet.\
-  \ (C) There was too much rocky material to form a terrestrial planet but not enough\
-  \ gaseous material to form a jovian planet. (D) Resonance with Jupiter prevented\
-  \ material from collecting together to form a planet.\nA: Let's think step by step.\
-  \ The asteroid belt is a stellar disc consisting of a large number of asteroids\
-  \ between Mars and Jupiter's orbits. The asteroids in this belt are affected by\
-  \ the gravitational pull from both other asteroids and nearby planets. Due to the\
-  \ strong gravitational force of Jupiter there are resonances that give rise to low\
-  \ density regions of asteroids known as the Kirkwood gap. So (B) and (C) are not\
-  \ correct since it’s not a lack of material that prevents a planet from being formed,\
-  \ and (A) is incorrect because the Kirkwood gap would have prevented a planet from\
-  \ forming in the first place, and (D) is the correct option. The answer is (D).\n\
-  \nQ: Why is Mars red?\n(A) Because the surface is covered with heavily oxidized\
-  \ (\"rusted\") minerals. (B) Because the atmosphere scatters more light at bluer\
-  \ wavelengths transmitting mostly red light. (C) Because Mars is covered with ancient\
-  \ lava flows which are red in color. (D) Because flowing water on Mars's surface\
-  \ altered the surface minerals several billion years ago.\nA: Let's think step by\
-  \ step. Option (B) is not correct because if the red color was caused by the scattering\
-  \ off the atmosphere, then the earth with a much thicker atmosphere would also look\
-  \ red. Options (C) and (D) are not specific enough about why the color of the surface\
-  \ would be red, while (A) is correct because it explains that the surface is red\
-  \ due to the rusted materials on the surface and the red color comes from the rust.\
-  \ So the correct option is (A). The answer is (A).\n\n"
-"group": "mmlu_flan_cot_fewshot_stem"
-"include": "_mmlu_flan_cot_fewshot_template_yaml"
-"task": "mmlu_flan_cot_fewshot_astronomy"
+dataset_name: astronomy
+description: The following are multiple choice questions (with answers) about astronomy.
+fewshot_config:
+  sampler: first_n
+  samples:
+  - question: 'Where do most short-period comets come from and how do we know?
+
+      (A) The Kuiper belt; short period comets tend to be in the plane of the solar
+      system just like the Kuiper belt. (B) The Kuiper belt; short period comets tend
+      to come from random directions indicating a spherical distribution of comets
+      called the Kuiper belt. (C) The asteroid belt; short period comets have orbital
+      periods similar to asteroids like Vesta and are found in the plane of the solar
+      system just like the asteroid belt. (D) The Oort cloud; short period comets
+      tend to be in the plane of the solar system just like the Oort cloud.'
+    target: Let's think step by step. Most short-period comets come from the Kuiper
+      belt, and we know because short period coments tend to be in the plane of the
+      solar system, just like the Kuiper belt is. The answer is (A).
+  - question: 'You are pushing a truck along a road. Would it be easier to accelerate
+      this truck on Mars? Why? (Assume there is no friction)
+
+      (A) It would be harder since the truck is heavier on Mars. (B) It would be easier
+      since the truck is lighter on Mars. (C) It would be harder since the truck is
+      lighter on Mars. (D) It would be the same no matter where you are.'
+    target: "Let's think step by step. If we assume that there is no friction, the\
+      \ force needed to accelerate the truck is by Newton\u2019s second law only dependent\
+      \ on the mass of the truck. Hence (A), (B) and (C) are incorrect since it doesn\u2019\
+      t matter that it\u2019s on Mars, and (D) is the correct answer. The answer is\
+      \ (D)."
+  - question: 'Say the pupil of your eye has a diameter of 5 mm and you have a telescope
+      with an aperture of 50 cm. How much more light can the telescope gather than
+      your eye?
+
+      (A) 10000 times more (B) 100 times more (C) 1000 times more (D) 10 times more'
+    target: Let's think step by step. The amount of light is proportional to the aperture
+      area $A = \pi D^2/4$ for a lens with diameter $D$, so the relative amounts of
+      light between the eye with diameter 5mm and the telescope with diameter 50mm
+      is $(50 cm)^2/(5mm)^2 = 10000$. The answer is (A).
+  - question: 'Why isn''t there a planet where the asteroid belt is located?
+
+      (A) A planet once formed here but it was broken apart by a catastrophic collision.
+      (B) There was not enough material in this part of the solar nebula to form a
+      planet. (C) There was too much rocky material to form a terrestrial planet but
+      not enough gaseous material to form a jovian planet. (D) Resonance with Jupiter
+      prevented material from collecting together to form a planet.'
+    target: "Let's think step by step. The asteroid belt is a stellar disc consisting\
+      \ of a large number of asteroids between Mars and Jupiter's orbits. The asteroids\
+      \ in this belt are affected by the gravitational pull from both other asteroids\
+      \ and nearby planets. Due to the strong gravitational force of Jupiter there\
+      \ are resonances that give rise to low density regions of asteroids known as\
+      \ the Kirkwood gap. So (B) and (C) are not correct since it\u2019s not a lack\
+      \ of material that prevents a planet from being formed, and (A) is incorrect\
+      \ because the Kirkwood gap would have prevented a planet from forming in the\
+      \ first place, and (D) is the correct option. The answer is (D)."
+  - question: 'Why is Mars red?
+
+      (A) Because the surface is covered with heavily oxidized ("rusted") minerals.
+      (B) Because the atmosphere scatters more light at bluer wavelengths transmitting
+      mostly red light. (C) Because Mars is covered with ancient lava flows which
+      are red in color. (D) Because flowing water on Mars''s surface altered the surface
+      minerals several billion years ago.'
+    target: 'Let''s think step by step. Option (B) is not correct because if the red
+      color was caused by the scattering off the atmosphere, then the earth with a
+      much thicker atmosphere would also look red. Options (C) and (D) are not specific
+      enough about why the color of the surface would be red, while (A) is correct
+      because it explains that the surface is red due to the rusted materials on the
+      surface and the red color comes from the rust. So the correct option is (A).
+      The answer is (A).'
+group: mmlu_flan_cot_fewshot_stem
+include: _mmlu_flan_cot_fewshot_template_yaml
+task: mmlu_flan_cot_fewshot_astronomy
--- a/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_business_ethics.yaml
+++ b/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_business_ethics.yaml
-"dataset_name": "business_ethics"
-"description": "The following are multiple choice questions (with answers) about business\
-  \ ethics.\n\nQ: In contrast to _______, _______ aim to reward favourable behaviour\
-  \ by companies. The success of such campaigns have been heightened through the use\
-  \ of ___________, which allow campaigns to facilitate the company in achieving _________\
-  \ .\n(A) Buycotts, Boycotts, Blockchain technology, Charitable donations (B) Buycotts,\
-  \ Boycotts, Digital technology, Increased Sales (C) Boycotts, Buyalls, Blockchain\
-  \ technology, Charitable donations (D) Boycotts, Buycotts, Digital technology, Increased\
-  \ Sales\nA: Let's think step by step. We refer to Wikipedia articles on business\
-  \ ethics for help. The sentence that best uses the possible options above is “In\
-  \ contrast to *boycotts*, *buycotts* aim to reward favourable behavior by companies.\
-  \ The success of such campaigns have been heightened through the use of *digital\
-  \ technology*, which allow campaigns to facilitate the company in achieving *increased\
-  \ sales*.” The answer is (D).\n\nQ: _______ is the direct attempt to formally or\
-  \ informally manage ethical issues or problems, through specific policies, practices\
-  \ and programmes.\n(A) Corporate social responsibility (B) Business ethics management\
-  \ (C) Sustainability (D) Environmental management\nA: Let's think step by step.\
-  \ We refer to Wikipedia articles on business ethics for help. The direct attempt\
-  \ manage ethical issues through specific policies, practices, and programs is business\
-  \ ethics management. The answer is (B).\n\nQ: Three contrasting tactics that CSO's\
-  \ can engage in to meet their aims are ________ which typically involves research\
-  \ and communication, ________, which may involve physically attacking a company's\
-  \ operations or ________, often involving some form of _______.\n(A) Non-violent\
-  \ direct action, Violent direct action, Indirect action, Boycott (B) Indirect action,\
-  \ Instrumental action, Non-violent direct action, Information campaign (C) Indirect\
-  \ action, Violent direct action, Non-violent direct-action Boycott (D) Non-violent\
-  \ direct action, Instrumental action, Indirect action, Information campaign\nA:\
-  \ Let's think step by step. We refer to Wikipedia articles on business ethics for\
-  \ help. The sentence that best uses the possible options above is “Three contrasting\
-  \ tactics that CSO's can engage in to meet their aims are *indirect action*, which\
-  \ typically involves research and communication, *violent direct action*, which\
-  \ may involve physically attacking a company's operations or *non-violent direct\
-  \ action*, often involving some form of *boycott*.” The answer is (C).\n\nQ: To\
-  \ ensure the independence of the non-executive board members, there are a number\
-  \ of steps which can be taken, which include non-executives being drawn from _______\
-  \ the company, being appointed for a _________ time period as well as being appointed\
-  \ _________.\n(A) Outside, Limited, Independently (B) Inside, Limited, Intermittently\
-  \ (C) Outside, Unlimited, Intermittently (D) Inside, Unlimited, Independently\n\
-  A: Let's think step by step. We refer to Wikipedia articles on business ethics for\
-  \ help. The sentence that best uses the possible options above is “To ensure the\
-  \ independence of the non-executive board members, there are a number of steps which\
-  \ can be taken, which include non-executives being draw from *outside* the company,\
-  \ being appointed for a *limited* time period as well as being imported *independently*.\
-  \ The answer is (A).\n\nQ: Beyond the business case for engaging in CSR there are\
-  \ a number of moral arguments relating to: negative _______, the _______that corporations\
-  \ possess and the ________ of business and society.\n(A) Externalities, Power, Independence\
-  \ (B) Publicity, Insubstantial resources, Mutual dependence (C) Publicity, Power,\
-  \ Independence (D) Externalities, Power, Mutual dependence\nA: Let's think step\
-  \ by step. We refer to Wikipedia articles on business ethics for help. The sentence\
-  \ that best uses the possible options above is “Beyond the business case for engaging\
-  \ the CSR there are a number of moral arguments relating to: negative *externalities*,\
-  \ the *power* that corporations possess and the *mutual independence* of business\
-  \ and society. The answer is (D).\n\n"
-"group": "mmlu_flan_cot_fewshot_other"
-"include": "_mmlu_flan_cot_fewshot_template_yaml"
-"task": "mmlu_flan_cot_fewshot_business_ethics"
+dataset_name: business_ethics
+description: The following are multiple choice questions (with answers) about business
+  ethics.
+fewshot_config:
+  sampler: first_n
+  samples:
+  - question: 'In contrast to _______, _______ aim to reward favourable behaviour by
+      companies. The success of such campaigns have been heightened through the use
+      of ___________, which allow campaigns to facilitate the company in achieving
+      _________ .
+
+      (A) Buycotts, Boycotts, Blockchain technology, Charitable donations (B) Buycotts,
+      Boycotts, Digital technology, Increased Sales (C) Boycotts, Buyalls, Blockchain
+      technology, Charitable donations (D) Boycotts, Buycotts, Digital technology,
+      Increased Sales'
+    target: "Let's think step by step. We refer to Wikipedia articles on business\
+      \ ethics for help. The sentence that best uses the possible options above is\
+      \ \u201CIn contrast to *boycotts*, *buycotts* aim to reward favourable behavior\
+      \ by companies. The success of such campaigns have been heightened through the\
+      \ use of *digital technology*, which allow campaigns to facilitate the company\
+      \ in achieving *increased sales*.\u201D The answer is (D)."
+  - question: '_______ is the direct attempt to formally or informally manage ethical
+      issues or problems, through specific policies, practices and programmes.
+
+      (A) Corporate social responsibility (B) Business ethics management (C) Sustainability
+      (D) Environmental management'
+    target: Let's think step by step. We refer to Wikipedia articles on business ethics
+      for help. The direct attempt manage ethical issues through specific policies,
+      practices, and programs is business ethics management. The answer is (B).
+  - question: 'Three contrasting tactics that CSO''s can engage in to meet their aims
+      are ________ which typically involves research and communication, ________,
+      which may involve physically attacking a company''s operations or ________,
+      often involving some form of _______.
+
+      (A) Non-violent direct action, Violent direct action, Indirect action, Boycott
+      (B) Indirect action, Instrumental action, Non-violent direct action, Information
+      campaign (C) Indirect action, Violent direct action, Non-violent direct-action
+      Boycott (D) Non-violent direct action, Instrumental action, Indirect action,
+      Information campaign'
+    target: "Let's think step by step. We refer to Wikipedia articles on business\
+      \ ethics for help. The sentence that best uses the possible options above is\
+      \ \u201CThree contrasting tactics that CSO's can engage in to meet their aims\
+      \ are *indirect action*, which typically involves research and communication,\
+      \ *violent direct action*, which may involve physically attacking a company's\
+      \ operations or *non-violent direct action*, often involving some form of *boycott*.\u201D\
+      \ The answer is (C)."
+  - question: 'To ensure the independence of the non-executive board members, there are
+      a number of steps which can be taken, which include non-executives being drawn
+      from _______ the company, being appointed for a _________ time period as well
+      as being appointed _________.
+
+      (A) Outside, Limited, Independently (B) Inside, Limited, Intermittently (C)
+      Outside, Unlimited, Intermittently (D) Inside, Unlimited, Independently'
+    target: "Let's think step by step. We refer to Wikipedia articles on business\
+      \ ethics for help. The sentence that best uses the possible options above is\
+      \ \u201CTo ensure the independence of the non-executive board members, there\
+      \ are a number of steps which can be taken, which include non-executives being\
+      \ draw from *outside* the company, being appointed for a *limited* time period\
+      \ as well as being imported *independently*. The answer is (A)."
+  - question: 'Beyond the business case for engaging in CSR there are a number of moral
+      arguments relating to: negative _______, the _______that corporations possess
+      and the ________ of business and society.
+
+      (A) Externalities, Power, Independence (B) Publicity, Insubstantial resources,
+      Mutual dependence (C) Publicity, Power, Independence (D) Externalities, Power,
+      Mutual dependence'
+    target: "Let's think step by step. We refer to Wikipedia articles on business\
+      \ ethics for help. The sentence that best uses the possible options above is\
+      \ \u201CBeyond the business case for engaging the CSR there are a number of\
+      \ moral arguments relating to: negative *externalities*, the *power* that corporations\
+      \ possess and the *mutual independence* of business and society. The answer\
+      \ is (D).\n\n"
+group: mmlu_flan_cot_fewshot_other
+include: _mmlu_flan_cot_fewshot_template_yaml
+task: mmlu_flan_cot_fewshot_business_ethics
--- a/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_clinical_knowledge.yaml
+++ b/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_clinical_knowledge.yaml
-"dataset_name": "clinical_knowledge"
-"description": "The following are multiple choice questions (with answers) about clinical\
-  \ knowledge.\n\nQ: Glycolysis is the name given to the pathway involving the conversion\
-  \ of:\n(A) glycogen to glucose-1-phosphate. (B) glycogen or glucose to fructose.\
-  \ (C) glycogen or glucose to pyruvate or lactate. (D) glycogen or glucose to pyruvate\
-  \ or acetyl CoA.\nA: Let's think step by step. We refer to Wikipedia articles on\
-  \ clinical knowledge for help. Glycolysis is the name given to the pathway involving\
-  \ conversion of glycogen or glucose to pyruvate or lactate. The answer is (C).\n\
-  \nQ: What is the difference between a male and a female catheter?\n(A) Male and\
-  \ female catheters are different colours. (B) Male catheters are longer than female\
-  \ catheters. (C) Male catheters are bigger than female catheters. (D) Female catheters\
-  \ are longer than male catheters.\nA: Let's think step by step. We refer to Wikipedia\
-  \ articles on clinical knowledge for help. The difference between a male and female\
-  \ catheter is that male catheters tend to be longer than female catheters. The answer\
-  \ is (B).\n\nQ: How many attempts should you make to cannulate a patient before\
-  \ passing the job on to a senior colleague, according to the medical knowledge of\
-  \ 2020?\n(A) 4 (B) 3 (C) 2 (D) 1\nA: Let's think step by step. We refer to Wikipedia\
-  \ articles on clinical knowledge for help. According to the medical protocol as\
-  \ of 2020, you should make two attempts to cannulate a patient before passing the\
-  \ job on to a more-senior practitioner. The answer is (C).\n\nQ: In the assessment\
-  \ of the hand function which of the following is true?\n(A) Abduction of the thumb\
-  \ is supplied by spinal root T2 (B) Opposition of the thumb by opponens policis\
-  \ is supplied by spinal root T1 (C) Finger adduction is supplied by the median nerve\
-  \ (D) Finger abduction is mediated by the palmar interossei\nA: Let's think step\
-  \ by step. We refer to Wikipedia articles on clinical knowledge for help. Of all\
-  \ the options, it is only true that the opposition of the thumb by opponens pollicis\
-  \ is supplied by spinal root T1. The answer is (B).\n\nQ: The energy for all forms\
-  \ of muscle contraction is provided by:\n(A) ATP. (B) ADP. (C) phosphocreatine.\
-  \ (D) oxidative phosphorylation.\nA: Let's think step by step. We refer to Wikipedia\
-  \ articles on clinical knowledge for help. The energy for muscular contraction is\
-  \ provided by ATP (adenosine triphosphate), which is the powerhouse of the cell.\
-  \ The answer is (A).\n\n"
-"group": "mmlu_flan_cot_fewshot_other"
-"include": "_mmlu_flan_cot_fewshot_template_yaml"
-"task": "mmlu_flan_cot_fewshot_clinical_knowledge"
+dataset_name: clinical_knowledge
+description: The following are multiple choice questions (with answers) about clinical
+  knowledge.
+fewshot_config:
+  sampler: first_n
+  samples:
+  - question: 'Glycolysis is the name given to the pathway involving the conversion of:
+
+      (A) glycogen to glucose-1-phosphate. (B) glycogen or glucose to fructose. (C)
+      glycogen or glucose to pyruvate or lactate. (D) glycogen or glucose to pyruvate
+      or acetyl CoA.'
+    target: Let's think step by step. We refer to Wikipedia articles on clinical knowledge
+      for help. Glycolysis is the name given to the pathway involving conversion of
+      glycogen or glucose to pyruvate or lactate. The answer is (C).
+  - question: 'What is the difference between a male and a female catheter?
+
+      (A) Male and female catheters are different colours. (B) Male catheters are
+      longer than female catheters. (C) Male catheters are bigger than female catheters.
+      (D) Female catheters are longer than male catheters.'
+    target: Let's think step by step. We refer to Wikipedia articles on clinical knowledge
+      for help. The difference between a male and female catheter is that male catheters
+      tend to be longer than female catheters. The answer is (B).
+  - question: 'How many attempts should you make to cannulate a patient before passing
+      the job on to a senior colleague, according to the medical knowledge of 2020?
+
+      (A) 4 (B) 3 (C) 2 (D) 1'
+    target: Let's think step by step. We refer to Wikipedia articles on clinical knowledge
+      for help. According to the medical protocol as of 2020, you should make two
+      attempts to cannulate a patient before passing the job on to a more-senior practitioner.
+      The answer is (C).
+  - question: 'In the assessment of the hand function which of the following is true?
+
+      (A) Abduction of the thumb is supplied by spinal root T2 (B) Opposition of the
+      thumb by opponens policis is supplied by spinal root T1 (C) Finger adduction
+      is supplied by the median nerve (D) Finger abduction is mediated by the palmar
+      interossei'
+    target: Let's think step by step. We refer to Wikipedia articles on clinical knowledge
+      for help. Of all the options, it is only true that the opposition of the thumb
+      by opponens pollicis is supplied by spinal root T1. The answer is (B).
+  - question: 'The energy for all forms of muscle contraction is provided by:
+
+      (A) ATP. (B) ADP. (C) phosphocreatine. (D) oxidative phosphorylation.'
+    target: 'Let''s think step by step. We refer to Wikipedia articles on clinical
+      knowledge for help. The energy for muscular contraction is provided by ATP (adenosine
+      triphosphate), which is the powerhouse of the cell. The answer is (A).'
+group: mmlu_flan_cot_fewshot_other
+include: _mmlu_flan_cot_fewshot_template_yaml
+task: mmlu_flan_cot_fewshot_clinical_knowledge
--- a/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_college_biology.yaml
+++ b/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_college_biology.yaml
-"dataset_name": "college_biology"
-"description": "The following are multiple choice questions (with answers) about college\
-  \ biology.\n\nQ: Which of the following represents an accurate statement concerning\
-  \ arthropods?\n(A) They possess an exoskeleton composed primarily of peptidoglycan.\
-  \ (B) They possess an open circulatory system with a dorsal heart. (C) They are\
-  \ members of a biologically unsuccessful phylum incapable of exploiting diverse\
-  \ habitats and nutrition sources. (D) They lack paired, jointed appendages.\nA:\
-  \ Let's think step by step. Peptidoglycan is known to comprise the plasma membrane\
-  \ of most bacteria, rather than the exoskeleton of arthropods, which is made of\
-  \ chitin, which rules out (A). The answer (C) is false because arthropods are a\
-  \ highly successful phylum. Likewise, arthropods have paired, jointed appendages,\
-  \ which rules out (D). The only remaining option is (B), as arthropods have an open\
-  \ circulatory system with a dorsal tubular heart. The answer is (B).\n\nQ: In a\
-  \ given population, 1 out of every 400 people has a cancer caused by a completely\
-  \ recessive allele, b. Assuming the population is in Hardy-Weinberg equilibrium,\
-  \ which of the following is the expected proportion of individuals who carry the\
-  \ b allele but are not expected to develop the cancer?\n(A) 1/400 (B) 19/400 (C)\
-  \ 20/400 (D) 38/400\nA: Let's think step by step. According to the Hardy Weinberg\
-  \ Law, $p^2 + 2 p q + q^2 = 1$, and $p + q = 1$ where $p$ is the frequency of the\
-  \ dominant allele, $q$ is the frequency of the recessive allele, and $p^2$, $q^2$,\
-  \ and $2pq$ are the frequencies of dominant homozygous, recessive homozygous, and\
-  \ heterozygous individuals, respectively. The frequency of the recessive allele\
-  \ (q) is $\\sqrt{\frac{1}{400}} = 0.05$. We have $p = 1 - q = 0.95$. The frequency\
-  \ of heterozygous individuals is $2pq = 2 \\cdot 0.05 \\cdot 0.95 = 0.095$. The\
-  \ number of heterozygous individuals is equal to the frequency of heterozygous individuals\
-  \ times the size of the population, or $0.095 * 400 = 38$. So we end up with 38/400.\
-  \ The answer is (D).\n\nQ: According to the pressure-flow model of movement of phloem\
-  \ contents, photosynthate movement from source to sink is driven by\n(A) an ATP-dependent\
-  \ pressure-flow pump (B) a water-pressure potential gradient (C) transpiration (D)\
-  \ apoplastic diffusion\nA: Let's think step by step. It is a gradient in water pressure\
-  \ that induces the movement of phloem content, which refers to answer (B). The mechanism\
-  \ of movement does not rely on metabolism, which rules out (A). Transpiration refers\
-  \ to the exhalation of water vapor through plant stomata, and is also not related,\
-  \ which rules out (C). While the apoplastic pathway is one of two main pathways\
-  \ for water transport in plants, it is not central to the pressure flow model, which\
-  \ rules out (D). The answer is (B).\n\nQ: Which of the following contain DNA sequences\
-  \ required for the segregation of chromosomes in mitosis and meiosis?\n(A) Telomeres\
-  \ (B) Centromeres (C) Nucleosomes (D) Spliceosomes\nA: Let's think step by step.\
-  \ The genetic material in Telomeres is not used, which rules out (A). Nucleosomes\
-  \ are the repeating subunit that comprises chromatin packed in a cell nucleus, and\
-  \ do not specifically refer to DNA sequences necessary for segregating chromosomes\
-  \ in cell division, which rules out (C). A spliceosome is a large ribonucleoprotein\
-  \ that removes introns from transcribed pre-mRNA rather than governing chromosome\
-  \ segregation. Centromeres are directly responsible for segregating chromosomes\
-  \ in cell division. The answer is (B).\n\nQ: The presence of homologous structures\
-  \ in two different organisms, such as the humerus in the front limb of a human and\
-  \ a bird, indicates that\n(A) the human and bird are polyphyletic species (B) a\
-  \ human's and bird's evolution is convergent (C) the human and bird belong to a\
-  \ clade (D) the human and bird developed by analogy\nA: Let's think step by step.\
-  \ Polyphyletic species are organisms that are grouped due to having similar characteristics\
-  \ but which do not have a common ancestor. This is not the case for humans and birds,\
-  \ which rules out (A). Convergent evolution refers to the indepdendent development\
-  \ of similar features in different species at different periods, which is also not\
-  \ the case for humans and birds, which rules out (B). Analogy refers to the superficial\
-  \ resemblance of structures that have different origins, which is not the case for\
-  \ the human and bird forearms, which rules out (D). Humans and birds do belong to\
-  \ the same clade - a group of organisms composed of a common ancestor. The answer\
-  \ is (C).\n\n"
-"group": "mmlu_flan_cot_fewshot_stem"
-"include": "_mmlu_flan_cot_fewshot_template_yaml"
-"task": "mmlu_flan_cot_fewshot_college_biology"
+dataset_name: college_biology
+description: The following are multiple choice questions (with answers) about college
+  biology.
+fewshot_config:
+  sampler: first_n
+  samples:
+  - question: 'Which of the following represents an accurate statement concerning arthropods?
+
+      (A) They possess an exoskeleton composed primarily of peptidoglycan. (B) They
+      possess an open circulatory system with a dorsal heart. (C) They are members
+      of a biologically unsuccessful phylum incapable of exploiting diverse habitats
+      and nutrition sources. (D) They lack paired, jointed appendages.'
+    target: Let's think step by step. Peptidoglycan is known to comprise the plasma
+      membrane of most bacteria, rather than the exoskeleton of arthropods, which
+      is made of chitin, which rules out (A). The answer (C) is false because arthropods
+      are a highly successful phylum. Likewise, arthropods have paired, jointed appendages,
+      which rules out (D). The only remaining option is (B), as arthropods have an
+      open circulatory system with a dorsal tubular heart. The answer is (B).
+  - question: 'In a given population, 1 out of every 400 people has a cancer caused by
+      a completely recessive allele, b. Assuming the population is in Hardy-Weinberg
+      equilibrium, which of the following is the expected proportion of individuals
+      who carry the b allele but are not expected to develop the cancer?
+
+      (A) 1/400 (B) 19/400 (C) 20/400 (D) 38/400'
+    target: "Let's think step by step. According to the Hardy Weinberg Law, $p^2 +\
+      \ 2 p q + q^2 = 1$, and $p + q = 1$ where $p$ is the frequency of the dominant\
+      \ allele, $q$ is the frequency of the recessive allele, and $p^2$, $q^2$, and\
+      \ $2pq$ are the frequencies of dominant homozygous, recessive homozygous, and\
+      \ heterozygous individuals, respectively. \u200BThe frequency of the recessive\
+      \ allele (q) is $\\sqrt{\frac{1}{400}} = 0.05$. We have $p = 1 - q = 0.95$.\
+      \ The frequency of heterozygous individuals is $2pq = 2 \\cdot 0.05 \\cdot 0.95\
+      \ = 0.095$. The number of heterozygous individuals is equal to the frequency\
+      \ of heterozygous individuals times the size of the population, or $0.095 *\
+      \ 400 = 38$. So we end up with 38/400. The answer is (D)."
+  - question: 'According to the pressure-flow model of movement of phloem contents, photosynthate
+      movement from source to sink is driven by
+
+      (A) an ATP-dependent pressure-flow pump (B) a water-pressure potential gradient
+      (C) transpiration (D) apoplastic diffusion'
+    target: Let's think step by step. It is a gradient in water pressure that induces
+      the movement of phloem content, which refers to answer (B). The mechanism of
+      movement does not rely on metabolism, which rules out (A). Transpiration refers
+      to the exhalation of water vapor through plant stomata, and is also not related,
+      which rules out (C). While the apoplastic pathway is one of two main pathways
+      for water transport in plants, it is not central to the pressure flow model,
+      which rules out (D). The answer is (B).
+  - question: 'Which of the following contain DNA sequences required for the segregation
+      of chromosomes in mitosis and meiosis?
+
+      (A) Telomeres (B) Centromeres (C) Nucleosomes (D) Spliceosomes'
+    target: Let's think step by step. The genetic material in Telomeres is not used,
+      which rules out (A). Nucleosomes are the repeating subunit that comprises chromatin
+      packed in a cell nucleus, and do not specifically refer to DNA sequences necessary
+      for segregating chromosomes in cell division, which rules out (C). A spliceosome
+      is a large ribonucleoprotein that removes introns from transcribed pre-mRNA
+      rather than governing chromosome segregation. Centromeres are directly responsible
+      for segregating chromosomes in cell division. The answer is (B).
+  - question: 'The presence of homologous structures in two different organisms, such
+      as the humerus in the front limb of a human and a bird, indicates that
+
+      (A) the human and bird are polyphyletic species (B) a human''s and bird''s evolution
+      is convergent (C) the human and bird belong to a clade (D) the human and bird
+      developed by analogy'
+    target: 'Let''s think step by step. Polyphyletic species are organisms that are
+      grouped due to having similar characteristics but which do not have a common
+      ancestor. This is not the case for humans and birds, which rules out (A). Convergent
+      evolution refers to the indepdendent development of similar features in different
+      species at different periods, which is also not the case for humans and birds,
+      which rules out (B). Analogy refers to the superficial resemblance of structures
+      that have different origins, which is not the case for the human and bird forearms,
+      which rules out (D). Humans and birds do belong to the same clade - a group
+      of organisms composed of a common ancestor. The answer is (C).'
+group: mmlu_flan_cot_fewshot_stem
+include: _mmlu_flan_cot_fewshot_template_yaml
+task: mmlu_flan_cot_fewshot_college_biology
--- a/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_college_chemistry.yaml
+++ b/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_college_chemistry.yaml
-"dataset_name": "college_chemistry"
-"description": "The following are multiple choice questions (with answers) about college\
-  \ chemistry.\n\nQ: 3 Cl−(aq) + 4 CrO_4^2−(aq) + 23 H+(aq) → 3 HClO2(aq) + 4 Cr3+(aq)\
-  \ + 10 H2O(l). In the reaction shown above, Cl−(aq) behaves as\n(A) an acid (B)\
-  \ a base (C) a catalyst (D) a reducing agent\nA: Let's think step by step. A molecule\
-  \ that behaves as a base accepts an H+ ion (or proton) from another molecule, whereas\
-  \ a molecule that behaves as an acid donates an H+ ion (or proton) to another molecule.\
-  \ Neither of these is the case for Cl in this reaction, which rules out (A) and\
-  \ (B). A catalyst is a substance that only accelerates a reaction without itself\
-  \ undergoing chemical change, which is not the case here. This rules out (C). Instead,\
-  \ the $Cl^{-} molecules carry a negative charge, which they donate in the reaction\
-  \ to form 3 HClO2. This is the behavior of a reducing agent, or (D). The answer\
-  \ is (D).\n\nQ: Which of the following statements about the lanthanide elements\
-  \ is NOT true?\n(A) The most common oxidation state for the lanthanide elements\
-  \ is +3. (B) Lanthanide complexes often have high coordination numbers (> 6). (C)\
-  \ All of the lanthanide elements react with aqueous acid to liberate hydrogen. (D)\
-  \ The atomic radii of the lanthanide elements increase across the period from La\
-  \ to Lu.\nA: Let's think step by step. The atomic radii of the lanthanide elements\
-  \ in fact decrease across the period from La to Lu. Options (A), (B), and (C) are\
-  \ all true. This means that only (D) is NOT true. The answer is (D).\n\nQ: Which\
-  \ of the following lists the hydrides of group-14 elements in order of thermal stability,\
-  \ from lowest to highest?\n(A) PbH4 < SnH4 < GeH4 < SiH4 < CH4 (B) PbH4 < SnH4 <\
-  \ CH4 < GeH4 < SiH4 (C) CH4 < SiH4 < GeH4 < SnH4 < PbH4 (D) CH4 < PbH4 < GeH4 <\
-  \ SnH4 < SiH4\nA: Let's think step by step. The thermal stability of group-14 hydrides\
-  \ decreases as we move from the top of group 14 to the bottom. The order of elements\
-  \ in the group from top to bottom is C, Si, Ge, Sn, Pb. Therefore in order of increasing\
-  \ thermal stability we have PbH4, SnH4, GeH4, SiH4, and CH4, or answer (A). The\
-  \ answer is (A).\n\nQ: Predict the number of lines in the EPR spectrum of a solution\
-  \ of 13C-labelled methyl radical (13CH3•), assuming the lines do not overlap.\n\
-  (A) 4 (B) 3 (C) 6 (D) 24 (E) 8\nA: Let's think step by step. The electron paramagnetic\
-  \ resonance spectrum will be split by two forms of interactions. The first is the\
-  \ hyperfine interaction with the 13C (nuclear spin $I = \nrac{1}{2}$) which will\
-  \ split the spectrum into 2 lines. This will be further split into 4 lines by the\
-  \ interaction with three equivalent 1H nuclei. The total number of lines is therefore\
-  \ $2 \\cdot 4 = 8$. The answer is (E).\n\n"
-"group": "mmlu_flan_cot_fewshot_stem"
-"include": "_mmlu_flan_cot_fewshot_template_yaml"
-"task": "mmlu_flan_cot_fewshot_college_chemistry"
+dataset_name: college_chemistry
+description: The following are multiple choice questions (with answers) about college
+  chemistry.
+fewshot_config:
+  sampler: first_n
+  samples:
+  - question: "3 Cl\u2212(aq) + 4 CrO_4^2\u2212(aq) + 23 H+(aq) \u2192 3 HClO2(aq) +\
+      \ 4 Cr3+(aq) + 10 H2O(l). In the reaction shown above, Cl\u2212(aq) behaves\
+      \ as\n(A) an acid (B) a base (C) a catalyst (D) a reducing agent"
+    target: Let's think step by step. A molecule that behaves as a base accepts an
+      H+ ion (or proton) from another molecule, whereas a molecule that behaves as
+      an acid donates an H+ ion (or proton) to another molecule. Neither of these
+      is the case for Cl in this reaction, which rules out (A) and (B). A catalyst
+      is a substance that only accelerates a reaction without itself undergoing chemical
+      change, which is not the case here. This rules out (C). Instead, the $Cl^{-}
+      molecules carry a negative charge, which they donate in the reaction to form
+      3 HClO2. This is the behavior of a reducing agent, or (D). The answer is (D).
+  - question: 'Which of the following statements about the lanthanide elements is NOT
+      true?
+
+      (A) The most common oxidation state for the lanthanide elements is +3. (B) Lanthanide
+      complexes often have high coordination numbers (> 6). (C) All of the lanthanide
+      elements react with aqueous acid to liberate hydrogen. (D) The atomic radii
+      of the lanthanide elements increase across the period from La to Lu.'
+    target: Let's think step by step. The atomic radii of the lanthanide elements
+      in fact decrease across the period from La to Lu. Options (A), (B), and (C)
+      are all true. This means that only (D) is NOT true. The answer is (D).
+  - question: 'Which of the following lists the hydrides of group-14 elements in order
+      of thermal stability, from lowest to highest?
+
+      (A) PbH4 < SnH4 < GeH4 < SiH4 < CH4 (B) PbH4 < SnH4 < CH4 < GeH4 < SiH4 (C)
+      CH4 < SiH4 < GeH4 < SnH4 < PbH4 (D) CH4 < PbH4 < GeH4 < SnH4 < SiH4'
+    target: Let's think step by step. The thermal stability of group-14 hydrides decreases
+      as we move from the top of group 14 to the bottom. The order of elements in
+      the group from top to bottom is C, Si, Ge, Sn, Pb. Therefore in order of increasing
+      thermal stability we have PbH4, SnH4, GeH4, SiH4, and CH4, or answer (A). The
+      answer is (A).
+  - question: "Predict the number of lines in the EPR spectrum of a solution of 13C-labelled\
+      \ methyl radical (13CH3\u2022), assuming the lines do not overlap.\n(A) 4 (B)\
+      \ 3 (C) 6 (D) 24 (E) 8"
+    target: "Let's think step by step. The electron paramagnetic resonance spectrum\
+      \ will be split by two forms of interactions. The first is the hyperfine interaction\
+      \ with the 13C (nuclear spin $I = \nrac{1}{2}$) which will split the spectrum\
+      \ into 2 lines. This will be further split into 4 lines by the interaction with\
+      \ three equivalent 1H nuclei. The total number of lines is therefore $2 \\cdot\
+      \ 4 = 8$. The answer is (E).\n\n"
+group: mmlu_flan_cot_fewshot_stem
+include: _mmlu_flan_cot_fewshot_template_yaml
+task: mmlu_flan_cot_fewshot_college_chemistry
--- a/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_college_computer_science.yaml
+++ b/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_college_computer_science.yaml
-"dataset_name": "college_computer_science"
-"description": "The following are multiple choice questions (with answers) about college\
-  \ computer science.\n\nQ: Which of the following regular expressions is equivalent\
-  \ to (describes the same set of strings as) (a* + b)*(c + d)?\n(A) a*(c + d)+ b(c\
-  \ + d)\n(B) a*(c + d)* + b(c + d)*\n(C) a*(c + d)+ b*(c + d)\n(D) (a + b)*c +(a\
-  \ + b)*d\nA: Let's think step by step. We know that:\n1. (X* + Y)* = (X + Y)*\n\
-  2. X(Y + Z)? = XY + XZ\nUsing equation 1 we can rewrite (a* + b)*(c + d)? as:\n\
-  3. (a + b)*(c + d)?\nUsing equation 2 we can rewrite equation 3 as:\n(a + b)*c +\
-  \ (a + b)*d The answer is (D).\n\nQ: The Singleton design pattern is used to guarantee\
-  \ that only a single instance of a class may be instantiated. Which of the following\
-  \ is (are) true of this design pattern?\nI. The Singleton class has a static factory\
-  \ method to provide its instance.\nII. The Singleton class can be a subclass of\
-  \ another class.\nIII. The Singleton class has a private constructor.\n(A) I only\n\
-  (B) II only\n(C) III only\n(D) I, II, and III\nA: Let's think step by step. Statement\
-  \ I is a correct statement about a Singleton, because a Singleton restricts instantiation\
-  \ to a single, static method. Statement II is also correct, because there is no\
-  \ inherent restriction regarding the inheritance of a Singleton. Statement III is\
-  \ also correct, because a Singletons must be instantiated only once, so its constructor\
-  \ is made private to prevent any construction except via its static factory method.\n\
-  Given these facts, statements I, II, and III are all correct. The answer is (D).\n\
-  \nQ: A certain pipelined RISC machine has 8 general-purpose registers R0, R1, .\
-  \ . . , R7 and supports the following operations:\nADD Rs1, Rs2, Rd (Add Rs1 to\
-  \ Rs2 and put the sum in Rd)\nMUL Rs1, Rs2, Rd (Multiply Rs1 by Rs2 and put the\
-  \ product in Rd)\nAn operation normally takes one cycle; however, an operation takes\
-  \ two cycles if it produces a result required by the immediately following operation\
-  \ in an operation sequence.\nConsider the expression AB + ABC + BC, where variables\
-  \ A, B, C are located in registers R0, R1, R2. If the contents of these three registers\
-  \ must not be modified, what is the minimum number of clock cycles required for\
-  \ an operation sequence that computes the value of AB + ABC + BC?\n(A) 5 (B) 6 (C)\
-  \ 7 (D) 8\nA: Let's think step by step. First, we are given that A is in R0, B is\
-  \ in R1, and C is in R2.\nNext, we can see that we must compute three multiplies\
-  \ (AB, BC, and ABC) and two adds (AB + ABC, (AB + ABC) + BC) to compute our final\
-  \ answer, resulting in a minimum of five clock cycles.\nNext, we can see that there\
-  \ is no way to avoid at least one pipeline stall when computing our final answer,\
-  \ because to compute our final sum we must wait at least one cycle for the results\
-  \ from the previous stage to be ready. Thus, our minimum number of cycles must be\
-  \ 6.\nWe can verify that we can create a solution that requires only six cycles\
-  \ as follows:\ncompute AB: MUL R0, R1, R3\ncompute BC: MUL R1, R2, R4\ncompute ABC:\
-  \ MUL R3, R4, R5\ncompute AB + BC: ADD R3, R4, R6\nSTALL\ncompute AB + ABC + BC:\
-  \ ADD R5, R6, R7\nSo there are 6 cycles. The answer is (B).\n\nQ: A compiler generates\
-  \ code for the following assignment statement.\nG := (A + B) * C - (D + E) * F\n\
-  The target machine has a single accumulator and a single-address instruction set\
-  \ consisting of instructions load, store, add, subtract, and multiply. For the arithmetic\
-  \ operations, the left operand is taken from the accumulator and the result appears\
-  \ in the accumulator. The smallest possible number of instructions in the resulting\
-  \ code is\n(A) 5 (B) 6 (C) 7 (D) 9\nA: Let's think step by step. We can compute\
-  \ the final answer with the following sequence of operations:\n1. LOAD D  (accumulator\
-  \ = D)\n2. ADD E  (accumulator = D+E)\n3. MUL F  (accumulator = (D+E)*F)\n4. STORE\
-  \ X (X = (D+E)*F)\n5. LOAD A  (accumulator = A)\n6. ADD B  (accumulator = A+B)\n\
-  7. MUL C  (accumulator = (A+B)*C)\n8. SUB X  (accumulator = (A+B)*C - (D+E)*F)\n\
-  9. STORE G (G = (A+B)*C - (D+E)*F)\nThis sequence takes 9 instructions. The answer\
-  \ is (D).\n\nQ: Consider a computer design in which multiple processors, each with\
-  \ a private cache memory, share global memory using a single bus. This bus is the\
-  \ critical system resource. Each processor can execute one instruction every 500\
-  \ nanoseconds as long as memory references are satisfied by its local cache. When\
-  \ a cache miss occurs, the processor is delayed for an additional 2,000 nanoseconds.\
-  \ During half of this additional delay, the bus is dedicated to serving the cache\
-  \ miss. During the other half, the processor cannot continue, but the bus is free\
-  \ to service requests from other processors. On average, each instruction requires\
-  \ 2 memory references. On average, cache misses occur on 1 percent of references.\
-  \ What proportion of the capacity of the bus would a single processor consume, ignoring\
-  \ delays due to competition from other processors?\n(A) 1/50 (B) 1/27 (C) 1/25 (D)\
-  \ 2/27\nA: Let's think step by step. We know that each instruction requires two\
-  \ memory references per instruction, and that there is an average cache miss rate\
-  \ of one percent.\nThus a given processor has:\n(1 cache miss / 100 references)\
-  \ * (2 references / instruction) =\n(2 cache misses / 100 instructions), so:\nmisses_per_instruction\
-  \ = 1 cache miss / 50 instructions.\nNext, we know that each instruction requires\
-  \ 500 nanoseconds when there is no cache miss, and 500 + 2000 = 2500 nanoseconds\
-  \ when there is a cache miss. Thus:\n50 instructions / (49 * 500) + (1 * 2500) nanoseconds,\
-  \ so:\ninstructions_per_ns = 50 instructions / 27000 nanoseconds.\nNow, we know\
-  \ that each cache miss locks the bus for half of the 2000 nanosecond cache miss\
-  \ delay, or 1000 nanoseconds, so:\nlock_ns_per_miss = 1000 nanoseconds / cache miss.\n\
-  Thus we can see that on average a single processor will lock the bus for:\nlock_ns_per_miss\
-  \ * misses_per_instruction * instructions_per_ns =\n(1000 nanoseconds / cache miss)\
-  \ * (1 cache miss / 50 instructions) * (50 instructions / 27000 nanoseconds) = 1000\
-  \ * (1/50) * (50/27000) = 1000/27000 = 1/27. The answer is (B).\n\n"
-"group": "mmlu_flan_cot_fewshot_stem"
-"include": "_mmlu_flan_cot_fewshot_template_yaml"
-"task": "mmlu_flan_cot_fewshot_college_computer_science"
+dataset_name: college_computer_science
+description: The following are multiple choice questions (with answers) about college
+  computer science.
+fewshot_config:
+  sampler: first_n
+  samples:
+  - question: 'Which of the following regular expressions is equivalent to (describes
+      the same set of strings as) (a* + b)*(c + d)?
+
+      (A) a*(c + d)+ b(c + d)
+
+      (B) a*(c + d)* + b(c + d)*
+
+      (C) a*(c + d)+ b*(c + d)
+
+      (D) (a + b)*c +(a + b)*d'
+    target: 'Let''s think step by step. We know that:
+
+      1. (X* + Y)* = (X + Y)*
+
+      2. X(Y + Z)? = XY + XZ
+
+      Using equation 1 we can rewrite (a* + b)*(c + d)? as:
+
+      3. (a + b)*(c + d)?
+
+      Using equation 2 we can rewrite equation 3 as:
+
+      (a + b)*c + (a + b)*d The answer is (D).'
+  - question: 'The Singleton design pattern is used to guarantee that only a single instance
+      of a class may be instantiated. Which of the following is (are) true of this
+      design pattern?
+
+      I. The Singleton class has a static factory method to provide its instance.
+
+      II. The Singleton class can be a subclass of another class.
+
+      III. The Singleton class has a private constructor.
+
+      (A) I only
+
+      (B) II only
+
+      (C) III only
+
+      (D) I, II, and III'
+    target: 'Let''s think step by step. Statement I is a correct statement about a
+      Singleton, because a Singleton restricts instantiation to a single, static method.
+      Statement II is also correct, because there is no inherent restriction regarding
+      the inheritance of a Singleton. Statement III is also correct, because a Singletons
+      must be instantiated only once, so its constructor is made private to prevent
+      any construction except via its static factory method.
+
+      Given these facts, statements I, II, and III are all correct. The answer is
+      (D).'
+  - question: 'A certain pipelined RISC machine has 8 general-purpose registers R0, R1,
+      . . . , R7 and supports the following operations:
+
+      ADD Rs1, Rs2, Rd (Add Rs1 to Rs2 and put the sum in Rd)
+
+      MUL Rs1, Rs2, Rd (Multiply Rs1 by Rs2 and put the product in Rd)
+
+      An operation normally takes one cycle; however, an operation takes two cycles
+      if it produces a result required by the immediately following operation in an
+      operation sequence.
+
+      Consider the expression AB + ABC + BC, where variables A, B, C are located in
+      registers R0, R1, R2. If the contents of these three registers must not be modified,
+      what is the minimum number of clock cycles required for an operation sequence
+      that computes the value of AB + ABC + BC?
+
+      (A) 5 (B) 6 (C) 7 (D) 8'
+    target: 'Let''s think step by step. First, we are given that A is in R0, B is
+      in R1, and C is in R2.
+
+      Next, we can see that we must compute three multiplies (AB, BC, and ABC) and
+      two adds (AB + ABC, (AB + ABC) + BC) to compute our final answer, resulting
+      in a minimum of five clock cycles.
+
+      Next, we can see that there is no way to avoid at least one pipeline stall when
+      computing our final answer, because to compute our final sum we must wait at
+      least one cycle for the results from the previous stage to be ready. Thus, our
+      minimum number of cycles must be 6.
+
+      We can verify that we can create a solution that requires only six cycles as
+      follows:
+
+      compute AB: MUL R0, R1, R3
+
+      compute BC: MUL R1, R2, R4
+
+      compute ABC: MUL R3, R4, R5
+
+      compute AB + BC: ADD R3, R4, R6
+
+      STALL
+
+      compute AB + ABC + BC: ADD R5, R6, R7
+
+      So there are 6 cycles. The answer is (B).'
+  - question: 'A compiler generates code for the following assignment statement.
+
+      G := (A + B) * C - (D + E) * F
+
+      The target machine has a single accumulator and a single-address instruction
+      set consisting of instructions load, store, add, subtract, and multiply. For
+      the arithmetic operations, the left operand is taken from the accumulator and
+      the result appears in the accumulator. The smallest possible number of instructions
+      in the resulting code is
+
+      (A) 5 (B) 6 (C) 7 (D) 9'
+    target: 'Let''s think step by step. We can compute the final answer with the following
+      sequence of operations:
+
+      1. LOAD D  (accumulator = D)
+
+      2. ADD E  (accumulator = D+E)
+
+      3. MUL F  (accumulator = (D+E)*F)
+
+      4. STORE X (X = (D+E)*F)
+
+      5. LOAD A  (accumulator = A)
+
+      6. ADD B  (accumulator = A+B)
+
+      7. MUL C  (accumulator = (A+B)*C)
+
+      8. SUB X  (accumulator = (A+B)*C - (D+E)*F)
+
+      9. STORE G (G = (A+B)*C - (D+E)*F)
+
+      This sequence takes 9 instructions. The answer is (D).'
+  - question: 'Consider a computer design in which multiple processors, each with a private
+      cache memory, share global memory using a single bus. This bus is the critical
+      system resource. Each processor can execute one instruction every 500 nanoseconds
+      as long as memory references are satisfied by its local cache. When a cache
+      miss occurs, the processor is delayed for an additional 2,000 nanoseconds. During
+      half of this additional delay, the bus is dedicated to serving the cache miss.
+      During the other half, the processor cannot continue, but the bus is free to
+      service requests from other processors. On average, each instruction requires
+      2 memory references. On average, cache misses occur on 1 percent of references.
+      What proportion of the capacity of the bus would a single processor consume,
+      ignoring delays due to competition from other processors?
+
+      (A) 1/50 (B) 1/27 (C) 1/25 (D) 2/27'
+    target: 'Let''s think step by step. We know that each instruction requires two
+      memory references per instruction, and that there is an average cache miss rate
+      of one percent.
+
+      Thus a given processor has:
+
+      (1 cache miss / 100 references) * (2 references / instruction) =
+
+      (2 cache misses / 100 instructions), so:
+
+      misses_per_instruction = 1 cache miss / 50 instructions.
+
+      Next, we know that each instruction requires 500 nanoseconds when there is no
+      cache miss, and 500 + 2000 = 2500 nanoseconds when there is a cache miss. Thus:
+
+      50 instructions / (49 * 500) + (1 * 2500) nanoseconds, so:
+
+      instructions_per_ns = 50 instructions / 27000 nanoseconds.
+
+      Now, we know that each cache miss locks the bus for half of the 2000 nanosecond
+      cache miss delay, or 1000 nanoseconds, so:
+
+      lock_ns_per_miss = 1000 nanoseconds / cache miss.
+
+      Thus we can see that on average a single processor will lock the bus for:
+
+      lock_ns_per_miss * misses_per_instruction * instructions_per_ns =
+
+      (1000 nanoseconds / cache miss) * (1 cache miss / 50 instructions) * (50 instructions
+      / 27000 nanoseconds) = 1000 * (1/50) * (50/27000) = 1000/27000 = 1/27. The answer
+      is (B).'
+group: mmlu_flan_cot_fewshot_stem
+include: _mmlu_flan_cot_fewshot_template_yaml
+task: mmlu_flan_cot_fewshot_college_computer_science
--- a/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_college_mathematics.yaml
+++ b/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_college_mathematics.yaml
-"dataset_name": "college_mathematics"
-"description": "The following are multiple choice questions (with answers) about college\
-  \ mathematics.\n\nQ: Let V be the set of all real polynomials p(x). Let transformations\
-  \ T, S be defined on V by T:p(x) -> xp(x) and S:p(x) -> p'(x) = d/dx p(x), and interpret\
-  \ (ST)(p(x)) as S(T(p(x))). Which of the following is true?\n(A) ST = 0 (B) ST =\
-  \ T (C) ST = TS (D) ST - TS is the identity map of V onto itself.\nA: Let's think\
-  \ step by step. For a given polynomial $p$ we have\n\\[ST(p) = (xp(x))’ = p(x) +\
-  \ xp’(x)\\]\nand\n\\[TS(p) = xp’(x).\\]\nHence \\[ST(p) - TS(p) = p(x) + xp’(x)\
-  \ - xp’(x).\\] The answer is (D).\n\nQ: Suppose that f(1 + x) = f(x) for all real\
-  \ x. If f is a polynomial and f(5) = 11, then f(15/2)\n(A) -11 (B) 0 (C) 11 (D)\
-  \ 33/2\nA: Let's think step by step. The only polynomial so that $f(1 + x) = f(x)$\
-  \ is a constant polynomial. Hence $f(5) = 11 = f(15/2)$. The answer is (C).\n\n\
-  Q: Let A be a real 2x2 matrix. Which of the following statements must be true?\n\
-  I. All of the entries of A^2 are nonnegative.\nII. The determinant of A^2 is nonnegative.\n\
-  III. If A has two distinct eigenvalues, then A^2 has two distinct eigenvalues.\n\
-  (A) I only (B) II only (C) III only (D) II and III only\nA: Let's think step by\
-  \ step. We have \\[ det(A^2) = (det(A))^2 \\geq 0,\\] hence II holds.\nIII is false:\
-  \ as a counterexample take a diagonal matrix with -1 and 1 on the diagonal. Then\
-  \ $A^2$ is the identity matrix. The answer is (B).\n\nQ: Let A be the set of all\
-  \ ordered pairs of integers (m, n) such that 7m + 12n = 22. What is the greatest\
-  \ negative number in the set B = {m + n : (m, n) \\in A}?\n(A) -5 (B) -4 (C) -3\
-  \ (D) -2\nA: Let's think step by step. We have 12n = 22 - 7m and one of the solutions\
-  \ is $m = -2$, $n = 3$. Then $m + n = 1$, hence we need to look for smaller $m$\
-  \ in order to make $m + n$ negative. The next solution is $m = -14$ and $n = 10$.\
-  \ For smaller $m$ we have $m + n$ smaller than $-4$. The answer is (B).\n\nQ: A\
-  \ tank initially contains a salt solution of 3 grams of salt dissolved in 100 liters\
-  \ of water. A salt solution containing 0.02 grams of salt per liter of water is\
-  \ sprayed into the tank at a rate of 4 liters per minute. The sprayed solution is\
-  \ continually mixed with the salt solution in the tank, and the mixture flows out\
-  \ of the tank at a rate of 4 liters per minute. If the mixing is instantaneous,\
-  \ how many grams of salt are in the tank after 100 minutes have elapsed?\n(A) 2\
-  \ (B) 2 - e^-2 (C) 2 + e^-2 (D) 2 + e^-4\nA: Let's think step by step. For all $t\
-  \ \\in \\mathbb{R}$, let $s(t)$ denote the number grams of salt in the tank at the\
-  \ $t$ minute mark. Then $s(0) = 3$.\nWe use $s$ and $s(t)$ interchangeably. We also\
-  \ use $s^{\\prime}$ and $s^{\\prime}(t)$ interchangeably. The solution sprayed into\
-  \ the tank adds $(0.02) 4=2 / 25$ grams of salt per minute. There are always 100\
-  \ liters of liquid in the tank, containing $s$ grams of salt. So the density of\
-  \ salt in the tank is $s / 100$ grams per liter. The flow of water out of the tank\
-  \ therefore subtracts $4(s / 100)=s / 25$ grams of salt per minute. Then, for all\
-  \ $t \\in \\mathbb{R}$, we have $s^{\\prime}(t)=(2 / 25)-(s / 25)=(2-s) / 25$, and\
-  \ so $[s(t)=2] \\Rightarrow\\left[s^{\\prime}(t)=0\right]$. For all $t \\in \\mathbb{R}$,\n\
-  $$\n\frac{d}{d t}[\\ln (s-2)]=\frac{s^{\\prime}}{s-2}=\frac{-1}{25}=\frac{d}{d t}\\\
-  left[-\frac{t}{25}\right] .\n$$\nChoose $C \\in \\mathbb{R}$ such that, for all\
-  \ $t \\in \\mathbb{R}, \\ln ((s(t)-2))=-[t / 25]+C$. Let $K:=e^{C}$. Then, for all\
-  \ $t \\in \\mathbb{R}$, we have $(s(t))-2=K e^{-t / 25}$, and so $s(t)=2+K e^{-t\
-  \ / 25}$. Then $3=s(0)=2+K e^{0}=2+K$, so $K=1$. Then $s(100)=2+K e^{-100 / 25}=2+1\
-  \ \\cdot e^{-4}=2+e^{-4}$. The answer is (D).\n\n"
-"group": "mmlu_flan_cot_fewshot_stem"
-"include": "_mmlu_flan_cot_fewshot_template_yaml"
-"task": "mmlu_flan_cot_fewshot_college_mathematics"
+dataset_name: college_mathematics
+description: The following are multiple choice questions (with answers) about college
+  mathematics.
+fewshot_config:
+  sampler: first_n
+  samples:
+  - question: 'Let V be the set of all real polynomials p(x). Let transformations T,
+      S be defined on V by T:p(x) -> xp(x) and S:p(x) -> p''(x) = d/dx p(x), and interpret
+      (ST)(p(x)) as S(T(p(x))). Which of the following is true?
+
+      (A) ST = 0 (B) ST = T (C) ST = TS (D) ST - TS is the identity map of V onto
+      itself.'
+    target: "Let's think step by step. For a given polynomial $p$ we have\n\\[ST(p)\
+      \ = (xp(x))\u2019 = p(x) + xp\u2019(x)\\]\nand\n\\[TS(p) = xp\u2019(x).\\]\n\
+      Hence \\[ST(p) - TS(p) = p(x) + xp\u2019(x) - xp\u2019(x).\\] The answer is\
+      \ (D)."
+  - question: 'Suppose that f(1 + x) = f(x) for all real x. If f is a polynomial and
+      f(5) = 11, then f(15/2)
+
+      (A) -11 (B) 0 (C) 11 (D) 33/2'
+    target: Let's think step by step. The only polynomial so that $f(1 + x) = f(x)$
+      is a constant polynomial. Hence $f(5) = 11 = f(15/2)$. The answer is (C).
+  - question: 'Let A be a real 2x2 matrix. Which of the following statements must be
+      true?
+
+      I. All of the entries of A^2 are nonnegative.
+
+      II. The determinant of A^2 is nonnegative.
+
+      III. If A has two distinct eigenvalues, then A^2 has two distinct eigenvalues.
+
+      (A) I only (B) II only (C) III only (D) II and III only'
+    target: 'Let''s think step by step. We have \[ det(A^2) = (det(A))^2 \geq 0,\]
+      hence II holds.
+
+      III is false: as a counterexample take a diagonal matrix with -1 and 1 on the
+      diagonal. Then $A^2$ is the identity matrix. The answer is (B).'
+  - question: 'Let A be the set of all ordered pairs of integers (m, n) such that 7m
+      + 12n = 22. What is the greatest negative number in the set B = {m + n : (m,
+      n) \in A}?
+
+      (A) -5 (B) -4 (C) -3 (D) -2'
+    target: Let's think step by step. We have 12n = 22 - 7m and one of the solutions
+      is $m = -2$, $n = 3$. Then $m + n = 1$, hence we need to look for smaller $m$
+      in order to make $m + n$ negative. The next solution is $m = -14$ and $n = 10$.
+      For smaller $m$ we have $m + n$ smaller than $-4$. The answer is (B).
+  - question: 'A tank initially contains a salt solution of 3 grams of salt dissolved
+      in 100 liters of water. A salt solution containing 0.02 grams of salt per liter
+      of water is sprayed into the tank at a rate of 4 liters per minute. The sprayed
+      solution is continually mixed with the salt solution in the tank, and the mixture
+      flows out of the tank at a rate of 4 liters per minute. If the mixing is instantaneous,
+      how many grams of salt are in the tank after 100 minutes have elapsed?
+
+      (A) 2 (B) 2 - e^-2 (C) 2 + e^-2 (D) 2 + e^-4'
+    target: "Let's think step by step. For all $t \\in \\mathbb{R}$, let $s(t)$ denote\
+      \ the number grams of salt in the tank at the $t$ minute mark. Then $s(0) =\
+      \ 3$.\nWe use $s$ and $s(t)$ interchangeably. We also use $s^{\\prime}$ and\
+      \ $s^{\\prime}(t)$ interchangeably. The solution sprayed into the tank adds\
+      \ $(0.02) 4=2 / 25$ grams of salt per minute. There are always 100 liters of\
+      \ liquid in the tank, containing $s$ grams of salt. So the density of salt in\
+      \ the tank is $s / 100$ grams per liter. The flow of water out of the tank therefore\
+      \ subtracts $4(s / 100)=s / 25$ grams of salt per minute. Then, for all $t \\\
+      in \\mathbb{R}$, we have $s^{\\prime}(t)=(2 / 25)-(s / 25)=(2-s) / 25$, and\
+      \ so $[s(t)=2] \\Rightarrow\\left[s^{\\prime}(t)=0\right]$. For all $t \\in\
+      \ \\mathbb{R}$,\n$$\n\frac{d}{d t}[\\ln (s-2)]=\frac{s^{\\prime}}{s-2}=\frac{-1}{25}=\f\
+      rac{d}{d t}\\left[-\frac{t}{25}\right] .\n$$\nChoose $C \\in \\mathbb{R}$ such\
+      \ that, for all $t \\in \\mathbb{R}, \\ln ((s(t)-2))=-[t / 25]+C$. Let $K:=e^{C}$.\
+      \ Then, for all $t \\in \\mathbb{R}$, we have $(s(t))-2=K e^{-t / 25}$, and\
+      \ so $s(t)=2+K e^{-t / 25}$. Then $3=s(0)=2+K e^{0}=2+K$, so $K=1$. Then $s(100)=2+K\
+      \ e^{-100 / 25}=2+1 \\cdot e^{-4}=2+e^{-4}$. The answer is (D).\n\n"
+group: mmlu_flan_cot_fewshot_stem
+include: _mmlu_flan_cot_fewshot_template_yaml
+task: mmlu_flan_cot_fewshot_college_mathematics
--- a/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_college_medicine.yaml
+++ b/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_college_medicine.yaml
-"dataset_name": "college_medicine"
-"description": "The following are multiple choice questions (with answers) about college\
-  \ medicine.\n\nQ: An expected side effect of creatine supplementation is:\n(A) muscle\
-  \ weakness. (B) gain in body mass. (C) muscle cramps. (D) loss of electrolytes.\n\
-  A: Let's think step by step. We refer to Wikipedia articles on medicine for help.\
-  \ Creatine supplementation is a dietary supplement that results in body mass gain.\
-  \ The answer is (B).\n\nQ: Which of the following is not a true statement?\n(A)\
-  \ Muscle glycogen is broken down enzymatically to glucose-1-phosphate (B) Elite\
-  \ endurance runners have a high proportion of Type I fibres in their leg muscles\
-  \ (C) Liver glycogen is important in the maintenance of the blood glucose concentration\
-  \ (D) Insulin promotes glucose uptake by all tissues in the body\nA: Let's think\
-  \ step by step. We refer to Wikipedia articles on medicine for help. Let’s solve\
-  \ this step by step and go over each choice: \n(A) “Muscle glycogen is broken down\
-  \ enzymatically to glucose-1-phosphate”: This is a correct statement.\n(B) “Elite\
-  \ endurance runners have a high proportion of Type I fibres in their leg muscles”:\
-  \ This is a correct statement.\n(C) “Liver glycogen is important in the maintenance\
-  \ of the blood glucose concentration”: This is a correct statement. \n(D) “Insulin\
-  \ promotes glucose uptake by all tissues in the body”: This is not a correct statement,\
-  \ because insulin promotes glucose uptake by the liver, adipose tissue, and muscle,\
-  \ but not all tissues. For instance, the tissues in the brain and red blood cells\
-  \ are not affected by insulin. The answer is (D).\n\nQ: A high school science teacher\
-  \ fills a 1 liter bottle with pure nitrogen and seals the lid. The pressure is 1.70\
-  \ atm, and the room temperature is 25°C. Which two variables will both increase\
-  \ the pressure of the system, if all other variables are held constant?\n(A) Increasing\
-  \ temperature, increasing moles of gas (B) Increasing temperature, increasing volume\
-  \ (C) Decreasing volume, decreasing temperature (D) Decreasing moles of gas, increasing\
-  \ volume\nA: Let's think step by step. We refer to Wikipedia articles on medicine\
-  \ for help. The relevant equation for this is the ideal gas law: PV=nRT. To increase\
-  \ the pressure of the system (P), then either n (number of moles of the gas) or\
-  \ T (temperature) have to increase. The answer is (A).\n\nQ: In a genetic test of\
-  \ a newborn, a rare genetic disorder is found that has X-linked recessive transmission.\
-  \ Which of the following statements is likely true regarding the pedigree of this\
-  \ disorder?\n(A) All descendants on the maternal side will have the disorder. (B)\
-  \ Females will be approximately twice as affected as males in this family. (C) All\
-  \ daughters of an affected male will be affected. (D) There will be equal distribution\
-  \ of males and females affected.\nA: Let's think step by step. We refer to Wikipedia\
-  \ articles on medicine for help. Let’s solve this step by step. Let's recall first\
-  \ that females have two X chromosomes, while males have one X and one Y chromosome.\
-  \ This is an important fact we need to know before answering this question. \nBecause\
-  \ a male can only pass his only one X chromosome to a daughter, if he is affected\
-  \ by this rare genetic disorder, then we know for sure that he will pass this rare\
-  \ genetic disorder to all his future-born daughters. Therefore, “(C): All daughters\
-  \ of an affected male will be affected” is a correct statement. The answer is (C).\n\
-  \nQ: Glucose is transported into the muscle cell:\n(A) via protein transporters\
-  \ called GLUT4. (B) only in the presence of insulin. (C) via hexokinase. (D) via\
-  \ monocarbylic acid transporters.\nA: Let's think step by step. We refer to Wikipedia\
-  \ articles on medicine for help. Glucose (also known as the blood sugar) is the\
-  \ main sugar found in the human body. It is transported into the muscle cell via\
-  \ diffusion through protein transporters called GLUT4. The answer is (A).\n\n"
-"group": "mmlu_flan_cot_fewshot_other"
-"include": "_mmlu_flan_cot_fewshot_template_yaml"
-"task": "mmlu_flan_cot_fewshot_college_medicine"
+dataset_name: college_medicine
+description: The following are multiple choice questions (with answers) about college
+  medicine.
+fewshot_config:
+  sampler: first_n
+  samples:
+  - question: 'An expected side effect of creatine supplementation is:
+
+      (A) muscle weakness. (B) gain in body mass. (C) muscle cramps. (D) loss of electrolytes.'
+    target: Let's think step by step. We refer to Wikipedia articles on medicine for
+      help. Creatine supplementation is a dietary supplement that results in body
+      mass gain. The answer is (B).
+  - question: 'Which of the following is not a true statement?
+
+      (A) Muscle glycogen is broken down enzymatically to glucose-1-phosphate (B)
+      Elite endurance runners have a high proportion of Type I fibres in their leg
+      muscles (C) Liver glycogen is important in the maintenance of the blood glucose
+      concentration (D) Insulin promotes glucose uptake by all tissues in the body'
+    target: "Let's think step by step. We refer to Wikipedia articles on medicine\
+      \ for help. Let\u2019s solve this step by step and go over each choice: \n(A)\
+      \ \u201CMuscle glycogen is broken down enzymatically to glucose-1-phosphate\u201D\
+      : This is a correct statement.\n(B) \u201CElite endurance runners have a high\
+      \ proportion of Type I fibres in their leg muscles\u201D: This is a correct\
+      \ statement.\n(C) \u201CLiver glycogen is important in the maintenance of the\
+      \ blood glucose concentration\u201D: This is a correct statement. \n(D) \u201C\
+      Insulin promotes glucose uptake by all tissues in the body\u201D: This is not\
+      \ a correct statement, because insulin promotes glucose uptake by the liver,\
+      \ adipose tissue, and muscle, but not all tissues. For instance, the tissues\
+      \ in the brain and red blood cells are not affected by insulin. The answer is\
+      \ (D)."
+  - question: "A high school science teacher fills a 1 liter bottle with pure nitrogen\
+      \ and seals the lid. The pressure is 1.70 atm, and the room temperature is 25\xB0\
+      C. Which two variables will both increase the pressure of the system, if all\
+      \ other variables are held constant?\n(A) Increasing temperature, increasing\
+      \ moles of gas (B) Increasing temperature, increasing volume (C) Decreasing\
+      \ volume, decreasing temperature (D) Decreasing moles of gas, increasing volume"
+    target: 'Let''s think step by step. We refer to Wikipedia articles on medicine
+      for help. The relevant equation for this is the ideal gas law: PV=nRT. To increase
+      the pressure of the system (P), then either n (number of moles of the gas) or
+      T (temperature) have to increase. The answer is (A).'
+  - question: 'In a genetic test of a newborn, a rare genetic disorder is found that
+      has X-linked recessive transmission. Which of the following statements is likely
+      true regarding the pedigree of this disorder?
+
+      (A) All descendants on the maternal side will have the disorder. (B) Females
+      will be approximately twice as affected as males in this family. (C) All daughters
+      of an affected male will be affected. (D) There will be equal distribution of
+      males and females affected.'
+    target: "Let's think step by step. We refer to Wikipedia articles on medicine\
+      \ for help. Let\u2019s solve this step by step. Let's recall first that females\
+      \ have two X chromosomes, while males have one X and one Y chromosome. This\
+      \ is an important fact we need to know before answering this question. \nBecause\
+      \ a male can only pass his only one X chromosome to a daughter, if he is affected\
+      \ by this rare genetic disorder, then we know for sure that he will pass this\
+      \ rare genetic disorder to all his future-born daughters. Therefore, \u201C\
+      (C): All daughters of an affected male will be affected\u201D is a correct statement.\
+      \ The answer is (C)."
+  - question: 'Glucose is transported into the muscle cell:
+
+      (A) via protein transporters called GLUT4. (B) only in the presence of insulin.
+      (C) via hexokinase. (D) via monocarbylic acid transporters.'
+    target: 'Let''s think step by step. We refer to Wikipedia articles on medicine
+      for help. Glucose (also known as the blood sugar) is the main sugar found in
+      the human body. It is transported into the muscle cell via diffusion through
+      protein transporters called GLUT4. The answer is (A).'
+group: mmlu_flan_cot_fewshot_other
+include: _mmlu_flan_cot_fewshot_template_yaml
+task: mmlu_flan_cot_fewshot_college_medicine
--- a/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_college_physics.yaml
+++ b/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_college_physics.yaml
-"dataset_name": "college_physics"
-"description": "The following are multiple choice questions (with answers) about college\
-  \ physics.\n\nQ: A refracting telescope consists of two converging lenses separated\
-  \ by 100 cm. The eye-piece lens has a focal length of 20 cm. The angular magnification\
-  \ of the telescope is\n(A) 4 (B) 5 (C) 6 (D) 20\nA: Let's think step by step. In\
-  \ a refracting telescope, if both lenses are converging, the focus of both lenses\
-  \ must be between the two lenses, and thus the focal lengths of the two lenses must\
-  \ add up to their separation. Since the focal length of one lens is 20 cm, the focal\
-  \ length of the other must be 80 cm. The magnification is the ratio of these two\
-  \ focal lengths, or 4. The answer is (A).\n\nQ: The muon decays with a characteristic\
-  \ lifetime of about 10^-6 second into an electron, a muon neutrino, and an electron\
-  \ antineutrino. The muon is forbidden from decaying into an electron and just a\
-  \ single neutrino by the law of conservation of\n(A) charge (B) mass (C) energy\
-  \ and momentum (D) lepton number\nA: Let's think step by step. Lepton number must\
-  \ be conserved, meaning the total number of leptons minus the number of antileptons.\
-  \ If a muon decays into an electron and a single neutrino, the total lepton number\
-  \ would go from one to two, violating lepton number conservation. The answer is\
-  \ (D).\n\nQ: One end of a Nichrome wire of length 2L and cross-sectional area A\
-  \ is attached to an end of another Nichrome wire of length L and cross- sectional\
-  \ area 2A. If the free end of the longer wire is at an electric potential of 8.0\
-  \ volts, and the free end of the shorter wire is at an electric potential of 1.0\
-  \ volt, the potential at the junction of the two wires is most nearly equal to\n\
-  (A) 2.4 V (B) 3.3 V (C) 4.5 V (D) 5.7 V\nA: Let's think step by step. This is a\
-  \ simple voltage divider problem, where the longer wire has a resistance four times\
-  \ that of the shorter end. So the voltage divider ratio is 1 / 5, meaning that the\
-  \ potential in the middle is 1.0 V + (8.0 V - 1.0 V) * 1/5 = 2.4 V. The answer is\
-  \ (A).\n\nQ: A refracting telescope consists of two converging lenses separated\
-  \ by 100 cm. The eye-piece lens has a focal length of 20 cm. The angular magnification\
-  \ of the telescope is\n(A) 4 (B) 5 (C) 6 (D) 20\nA: Let's think step by step. In\
-  \ a refracting telescope, if both lenses are converging, the focus of both lenses\
-  \ must be between the two lenses, and thus the focal lengths of the two lenses must\
-  \ add up to their separation. Since the focal length of one lens is 20 cm, the focal\
-  \ length of the other must be 80 cm. The magnification is the ratio of these two\
-  \ focal lengths, or 4. The answer is (A).\n\nQ: For which of the following thermodynamic\
-  \ processes is the increase in the internal energy of an ideal gas equal to the\
-  \ heat added to the gas?\n(A) Constant temperature (B) Constant volume (C) Constant\
-  \ pressure (D) Adiabatic\nA: Let's think step by step. Heat added to the gas can\
-  \ go into the gases internal energy or work done against an external force. However,\
-  \ if the volume of the gas container is constant, no work will be done (since work\
-  \ is pressure times change in volume). So, at constant volume, all of the heat goes\
-  \ into the internal energy. The answer is (B).\n\n"
-"group": "mmlu_flan_cot_fewshot_stem"
-"include": "_mmlu_flan_cot_fewshot_template_yaml"
-"task": "mmlu_flan_cot_fewshot_college_physics"
+dataset_name: college_physics
+description: The following are multiple choice questions (with answers) about college
+  physics.
+fewshot_config:
+  sampler: first_n
+  samples:
+  - question: 'A refracting telescope consists of two converging lenses separated by
+      100 cm. The eye-piece lens has a focal length of 20 cm. The angular magnification
+      of the telescope is
+
+      (A) 4 (B) 5 (C) 6 (D) 20'
+    target: Let's think step by step. In a refracting telescope, if both lenses are
+      converging, the focus of both lenses must be between the two lenses, and thus
+      the focal lengths of the two lenses must add up to their separation. Since the
+      focal length of one lens is 20 cm, the focal length of the other must be 80
+      cm. The magnification is the ratio of these two focal lengths, or 4. The answer
+      is (A).
+  - question: 'The muon decays with a characteristic lifetime of about 10^-6 second into
+      an electron, a muon neutrino, and an electron antineutrino. The muon is forbidden
+      from decaying into an electron and just a single neutrino by the law of conservation
+      of
+
+      (A) charge (B) mass (C) energy and momentum (D) lepton number'
+    target: Let's think step by step. Lepton number must be conserved, meaning the
+      total number of leptons minus the number of antileptons. If a muon decays into
+      an electron and a single neutrino, the total lepton number would go from one
+      to two, violating lepton number conservation. The answer is (D).
+  - question: 'One end of a Nichrome wire of length 2L and cross-sectional area A is
+      attached to an end of another Nichrome wire of length L and cross- sectional
+      area 2A. If the free end of the longer wire is at an electric potential of 8.0
+      volts, and the free end of the shorter wire is at an electric potential of 1.0
+      volt, the potential at the junction of the two wires is most nearly equal to
+
+      (A) 2.4 V (B) 3.3 V (C) 4.5 V (D) 5.7 V'
+    target: Let's think step by step. This is a simple voltage divider problem, where
+      the longer wire has a resistance four times that of the shorter end. So the
+      voltage divider ratio is 1 / 5, meaning that the potential in the middle is
+      1.0 V + (8.0 V - 1.0 V) * 1/5 = 2.4 V. The answer is (A).
+  - question: 'A refracting telescope consists of two converging lenses separated by
+      100 cm. The eye-piece lens has a focal length of 20 cm. The angular magnification
+      of the telescope is
+
+      (A) 4 (B) 5 (C) 6 (D) 20'
+    target: Let's think step by step. In a refracting telescope, if both lenses are
+      converging, the focus of both lenses must be between the two lenses, and thus
+      the focal lengths of the two lenses must add up to their separation. Since the
+      focal length of one lens is 20 cm, the focal length of the other must be 80
+      cm. The magnification is the ratio of these two focal lengths, or 4. The answer
+      is (A).
+  - question: 'For which of the following thermodynamic processes is the increase in
+      the internal energy of an ideal gas equal to the heat added to the gas?
+
+      (A) Constant temperature (B) Constant volume (C) Constant pressure (D) Adiabatic'
+    target: 'Let''s think step by step. Heat added to the gas can go into the gases
+      internal energy or work done against an external force. However, if the volume
+      of the gas container is constant, no work will be done (since work is pressure
+      times change in volume). So, at constant volume, all of the heat goes into the
+      internal energy. The answer is (B).'
+group: mmlu_flan_cot_fewshot_stem
+include: _mmlu_flan_cot_fewshot_template_yaml
+task: mmlu_flan_cot_fewshot_college_physics
--- a/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_computer_security.yaml
+++ b/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_computer_security.yaml
-"dataset_name": "computer_security"
-"description": "The following are multiple choice questions (with answers) about computer\
-  \ security.\n\nQ: SHA-1 has a message digest of\n(A) 160 bits (B) 512 bits (C) 628\
-  \ bits (D) 820 bits\nA: Let's think step by step. Since SHA-1 is a hash function\
-  \ which takes an input and produces a 160-bit (20-byte) hash value, its message\
-  \ digest is 160 bits. The answer is (A).\n\nQ: _____________ can modify data on\
-  \ your system – so that your system doesn’t run correctly or you can no longer access\
-  \ specific data, or it may even ask for ransom in order to give your access.\n(A)\
-  \ IM – Trojans (B) Backdoor Trojans (C) Trojan-Downloader (D) Ransom Trojan\nA:\
-  \ Let's think step by step. The system is asking for trojans, which are for ransom,\
-  \ which means ransom trojan. The answer is (D).\n\nQ: What is ethical hacking?\n\
-  (A) \"Hacking\" ethics so they justify unintended selfish behavior (B) Hacking systems\
-  \ (e.g., during penetration testing) to expose vulnerabilities so they can be fixed,\
-  \ rather than exploited (C) Hacking into systems run by those whose ethics you disagree\
-  \ with (D) A slang term for rapid software development, e.g., as part of hackathons\n\
-  A: Let's think step by step. Ethical hacking is a process of detecting vulnerabilities\
-  \ in an application, system, or organization's infrastructure that an attacker can\
-  \ use to exploit an individual or organization. They use this process to prevent\
-  \ cyberattacks and security breaches by lawfully hacking into the systems and looking\
-  \ for weak points. The answer is (B).\n\nQ: The ____________ is anything which your\
-  \ search engine cannot search.\n(A) Haunted web (B) World Wide Web (C) Surface web\
-  \ (D) Deep Web\nA: Let's think step by step. The search engine searches on the Surface\
-  \ Web, which is the portion of the world wide web which is visible so (B,C) are\
-  \ wrong. The Haunted Web doesn’t correspond to an internet concept. The Deep Web\
-  \ is the part of the World Wide Web which is not indexed. The answer is (D).\n\n\
-  Q: Exploitation of the Heartbleed bug permits\n(A) overwriting cryptographic keys\
-  \ in memory (B) a kind of code injection (C) a read outside bounds of a buffer (D)\
-  \ a format string attack\nA: Let's think step by step. The Heartbleed Bug is a serious\
-  \ vulnerability in the popular OpenSSL cryptographic software library. Heartbleed\
-  \ resulted from improper input validation (due to a missing bounds check) in the\
-  \ implementation of the TLS heartbeat extension. The vulnerability was classified\
-  \ as a buffer over-read, a situation where more data can be read than should be\
-  \ allowed. The answer is (C).\n\n"
-"group": "mmlu_flan_cot_fewshot_stem"
-"include": "_mmlu_flan_cot_fewshot_template_yaml"
-"task": "mmlu_flan_cot_fewshot_computer_security"
+dataset_name: computer_security
+description: The following are multiple choice questions (with answers) about computer
+  security.
+fewshot_config:
+  sampler: first_n
+  samples:
+  - question: 'SHA-1 has a message digest of
+
+      (A) 160 bits (B) 512 bits (C) 628 bits (D) 820 bits'
+    target: Let's think step by step. Since SHA-1 is a hash function which takes an
+      question and produces a 160-bit (20-byte) hash value, its message digest is 160
+      bits. The answer is (A).
+  - question: "_____________ can modify data on your system \u2013 so that your system\
+      \ doesn\u2019t run correctly or you can no longer access specific data, or it\
+      \ may even ask for ransom in order to give your access.\n(A) IM \u2013 Trojans\
+      \ (B) Backdoor Trojans (C) Trojan-Downloader (D) Ransom Trojan"
+    target: Let's think step by step. The system is asking for trojans, which are
+      for ransom, which means ransom trojan. The answer is (D).
+  - question: 'What is ethical hacking?
+
+      (A) "Hacking" ethics so they justify unintended selfish behavior (B) Hacking
+      systems (e.g., during penetration testing) to expose vulnerabilities so they
+      can be fixed, rather than exploited (C) Hacking into systems run by those whose
+      ethics you disagree with (D) A slang term for rapid software development, e.g.,
+      as part of hackathons'
+    target: Let's think step by step. Ethical hacking is a process of detecting vulnerabilities
+      in an application, system, or organization's infrastructure that an attacker
+      can use to exploit an individual or organization. They use this process to prevent
+      cyberattacks and security breaches by lawfully hacking into the systems and
+      looking for weak points. The answer is (B).
+  - question: 'The ____________ is anything which your search engine cannot search.
+
+      (A) Haunted web (B) World Wide Web (C) Surface web (D) Deep Web'
+    target: "Let's think step by step. The search engine searches on the Surface Web,\
+      \ which is the portion of the world wide web which is visible so (B,C) are wrong.\
+      \ The Haunted Web doesn\u2019t correspond to an internet concept. The Deep Web\
+      \ is the part of the World Wide Web which is not indexed. The answer is (D)."
+  - question: 'Exploitation of the Heartbleed bug permits
+
+      (A) overwriting cryptographic keys in memory (B) a kind of code injection (C)
+      a read outside bounds of a buffer (D) a format string attack'
+    target: 'Let''s think step by step. The Heartbleed Bug is a serious vulnerability
+      in the popular OpenSSL cryptographic software library. Heartbleed resulted from
+      improper question validation (due to a missing bounds check) in the implementation
+      of the TLS heartbeat extension. The vulnerability was classified as a buffer
+      over-read, a situation where more data can be read than should be allowed. The
+      answer is (C).'
+group: mmlu_flan_cot_fewshot_stem
+include: _mmlu_flan_cot_fewshot_template_yaml
+task: mmlu_flan_cot_fewshot_computer_security
--- a/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_conceptual_physics.yaml
+++ b/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_conceptual_physics.yaml
-"dataset_name": "conceptual_physics"
-"description": "\nThe following are multiple choice questions (with answers) about\
-  \ conceptual physics.\n\nQ: Colors in a soap bubble result from light\n(A) converted\
-  \ to a different frequency (B) deflection (C) interference (D) polarization\nA:\
-  \ Let's think step by step. In a soap bubble film, the light bounces between the\
-  \ two soap-air interfaces many times, interfering with itself constructively or\
-  \ destructively depending on the width of the film. This results in different colors\
-  \ being visible. The answer is (C).\n\nQ: Compared with the mass of a uranium atom\
-  \ undergoing fission, the combined masses of the products after fission are\n(A)\
-  \ less (B) more (C) the same (D) zero\nA: Let's think step by step. Fission releases\
-  \ energy, which comes from the rest mass of its initial nucleus. Thus the mass of\
-  \ the products is less than the mass of the reactant uranium nucleus. The answer\
-  \ is (A).\n\nQ: Things that are equivalent according to the equivalence principle\
-  \ are\n(A) space and time. (B) a traveling twin and a stay-at-home twin. (C) gravity\
-  \ and acceleration. (D) mass and energy.\nA: Let's think step by step. Einstein’s\
-  \ famous equivalence principle states that gravity and acceleration are equivalent.\
-  \ The answer is (C).\n\nQ: Which of these three elements has the most mass per nucleon?\n\
-  (A) Hydrogen (B) Iron (C) Uranium (D) Same in each\nA: Let's think step by step.\
-  \ Due to nuclear binding energy, the mass of an atomic nucleus is less than the\
-  \ sum of individual masses of the free constituent protons and neutrons; this is\
-  \ known as the mass defect. Hydrogen has no mass defect because it has only a single\
-  \ nucleon, so it will have the most mass per nucleon. The answer is (A).\n\nQ: A\
-  \ model airplane flies slower when flying into the wind and faster with wind at\
-  \ its back. When launched at right angles to the wind a cross wind its groundspeed\
-  \ compared with flying in still air is\n(A) the same (B) greater (C) less (D) either\
-  \ greater or less depending on wind speed\nA: Let's think step by step. The plane’s\
-  \ speed in the direction of the wind is greater than it would be in the absence\
-  \ of wind, and its direction orthogonal to the wind is the same as it would be in\
-  \ the absence of the wind. The total speed, which is these two components added\
-  \ in quadrature, is thus greater than the speed in still air. The answer is (B).\n\n"
-"group": "mmlu_flan_cot_fewshot_stem"
-"include": "_mmlu_flan_cot_fewshot_template_yaml"
-"task": "mmlu_flan_cot_fewshot_conceptual_physics"
+dataset_name: conceptual_physics
+description: '
+
+  The following are multiple choice questions (with answers) about conceptual physics.'
+fewshot_config:
+  sampler: first_n
+  samples:
+  - question: 'Colors in a soap bubble result from light
+
+      (A) converted to a different frequency (B) deflection (C) interference (D) polarization'
+    target: Let's think step by step. In a soap bubble film, the light bounces between
+      the two soap-air interfaces many times, interfering with itself constructively
+      or destructively depending on the width of the film. This results in different
+      colors being visible. The answer is (C).
+  - question: 'Compared with the mass of a uranium atom undergoing fission, the combined
+      masses of the products after fission are
+
+      (A) less (B) more (C) the same (D) zero'
+    target: Let's think step by step. Fission releases energy, which comes from the
+      rest mass of its initial nucleus. Thus the mass of the products is less than
+      the mass of the reactant uranium nucleus. The answer is (A).
+  - question: 'Things that are equivalent according to the equivalence principle are
+
+      (A) space and time. (B) a traveling twin and a stay-at-home twin. (C) gravity
+      and acceleration. (D) mass and energy.'
+    target: "Let's think step by step. Einstein\u2019s famous equivalence principle\
+      \ states that gravity and acceleration are equivalent. The answer is (C)."
+  - question: 'Which of these three elements has the most mass per nucleon?
+
+      (A) Hydrogen (B) Iron (C) Uranium (D) Same in each'
+    target: Let's think step by step. Due to nuclear binding energy, the mass of an
+      atomic nucleus is less than the sum of individual masses of the free constituent
+      protons and neutrons; this is known as the mass defect. Hydrogen has no mass
+      defect because it has only a single nucleon, so it will have the most mass per
+      nucleon. The answer is (A).
+  - question: 'A model airplane flies slower when flying into the wind and faster with
+      wind at its back. When launched at right angles to the wind a cross wind its
+      groundspeed compared with flying in still air is
+
+      (A) the same (B) greater (C) less (D) either greater or less depending on wind
+      speed'
+    target: "Let's think step by step. The plane\u2019s speed in the direction of\
+      \ the wind is greater than it would be in the absence of wind, and its direction\
+      \ orthogonal to the wind is the same as it would be in the absence of the wind.\
+      \ The total speed, which is these two components added in quadrature, is thus\
+      \ greater than the speed in still air. The answer is (B).\n\n"
+group: mmlu_flan_cot_fewshot_stem
+include: _mmlu_flan_cot_fewshot_template_yaml
+task: mmlu_flan_cot_fewshot_conceptual_physics
--- a/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_econometrics.yaml
+++ b/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_econometrics.yaml
-"dataset_name": "econometrics"
-"description": "The following are multiple choice questions (with answers) about econometrics.\n\
-  \nQ: Suppose now that a researcher wishes to use information criteria to determine\
-  \ the optimal lag length for a VAR. 500 observations are available for the bi-variate\
-  \ VAR, and the values of the determinant of the variance-covariance matrix of residuals\
-  \ are 0.0336, 0.0169, 0.0084, and 0.0062 for 1, 2, 3, and 4 lags respectively. What\
-  \ is the optimal model order according to Akaike's information criterion?\n(A) 1\
-  \ lag (B) 2 lags (C) 3 lags (D) 4 lags\nA: Let's think step by step. We refer to\
-  \ Wikipedia articles on econometrics for help. Let’s solve this problem step by\
-  \ step. First of all, let’s recall that for a given set of data, Akaike's information\
-  \ criterion (AIC) allows us to measure how well a statistical model fits the data;\
-  \ it is an estimator of prediction error. Here in this problem we will need to use\
-  \ the formula ln(det(sigma_hat)) + (2 * k / T) to determine the values of Akaike’s\
-  \ criterion, where ln denotes the natural log function, det the determinant function,\
-  \ k the total number of parameters in total (across both equations), and T the number\
-  \ of observations (which, in this case, is equal to 500). For 1 lag, the number\
-  \ of parameters in total is equal to 6; for 2 lags, it is 10; for 3 lags, it is\
-  \ 14; and for 4 lags, it is 18. Now, let’s calculate the values of the criterion\
-  \ for each lag:\n(A) 1 lag: ln(0.0336) + (2 * 6 / 500) = ln(0.0336) + (12 / 500)\
-  \ = -3.369\n(B) 2 lags: ln(0.0169) + (2 * 10 / 500) = ln(0.0169) + (20 / 500) =\
-  \ -4.040\n(C) 3 lags: ln(0.0084) + (2 * 14 / 500) = ln(0.0084) + (28 / 500) =-4.724\n\
-  (D) 4 lags: ln(0.0062) + (2 * 18 / 500) = ln(0.0062) + (36 / 500) =-5.011\nBecause\
-  \ the optimal model order according to AIC minimizes the information criterion,\
-  \ the answer should be the one with the lowest value. In this case, (D) has the\
-  \ lowest value. The answer is (C).\n\nQ: Consider the following AR(1) model with\
-  \ the disturbances having zero mean and unit variance\nyt = 0.2 + 0.4 yt-1 + ut\n\
-  The (unconditional) mean of y will be given by\n(A) 0.2 (B) 0.4 (C) 0.5 (D) 0.33\n\
-  A: Let's think step by step. We refer to Wikipedia articles on econometrics for\
-  \ help. Let’s solve this problem step by step. If we have a an AR(1) model with\
-  \ the disturbances having zero mean and unit variance, then the unconditional mean\
-  \ of y is equal to the following:\nunconditional mean of y = (the intercept term)\
-  \ / (1 - autoregressive coefficient)\nWe know that the intercept term is 0.2 and\
-  \ the autoregressive coefficient is 0.4; thus, we have:\nunconditional mean of y\
-  \ = (0.2) / (1 - 0.4) = (0.2) / (0.6) = 2 / 6 = 1 / 3, which is approximately 0.33.\
-  \ That means that the answer should be (D) 0.33. The answer is (D).\n\nQ: What would\
-  \ be then consequences for the OLS estimator if heteroscedasticity is present in\
-  \ a regression model but ignored?\n(A) It will be biased (B) It will be inconsistent\
-  \ (C) It will be inefficient (D) All of (a), (b) and (c) will be true.\nA: Let's\
-  \ think step by step. We refer to Wikipedia articles on econometrics for help. Heteroscedasticity\
-  \ refers to the condition where the variance of the error terms is not constant\
-  \ across multiple observations. If heteroscedasticity is present in a regression\
-  \ model, then the coefficient estimates in the OLS estimator will be not only unbiased\
-  \ and consistent but also inefficient. Because (A) and (B) are incorrect choices\
-  \ and (C) is a correct choice, (D) cannot be the right answer. Ultimately, (C) is\
-  \ the only true choice. The answer is (C).\n\nQ: Suppose that a test statistic has\
-  \ associated with it a p-value of 0.08. Which one of the following statements is\
-  \ true?\n(i) If the size of the test were exactly 8%, we would be indifferent between\
-  \ rejecting and not rejecting the null hypothesis\n(ii) The null would be rejected\
-  \ if a 10% size of test were used\n(iii) The null would not be rejected if a 1%\
-  \ size of test were used\n(iv) The null would be rejected if a 5% size of test were\
-  \ used.\n(A) (ii) and (iv) only (B) (i) and (iii) only (C) (i), (ii), and (iii)\
-  \ only (D) (i), (ii), (iii), and (iv).\nA: Let's think step by step. We refer to\
-  \ Wikipedia articles on econometrics for help. Let’s reason about each of the options.\n\
-  (i) is a true statement.\n(ii) is a true statement.\n(iii) is a true statement.\n\
-  (iv) is not a true statement. Thus, (i), (ii), and (iii) are true. The answer is\
-  \ (C).\n\nQ: For a stationary autoregressive process, shocks will\n(A) Eventually\
-  \ die away (B) Persist indefinitely (C) Grow exponentially (D) Never occur\nA: Let's\
-  \ think step by step. We refer to Wikipedia articles on econometrics for help. This\
-  \ is a formal logic problem about stationally process. For a stationary autoregressive\
-  \ process, shocks will eventually die away. The answer is (A).\n\n"
-"group": "mmlu_flan_cot_fewshot_social_sciences"
-"include": "_mmlu_flan_cot_fewshot_template_yaml"
-"task": "mmlu_flan_cot_fewshot_econometrics"
+dataset_name: econometrics
+description: The following are multiple choice questions (with answers) about econometrics.
+fewshot_config:
+  sampler: first_n
+  samples:
+  - question: 'Suppose now that a researcher wishes to use information criteria to determine
+      the optimal lag length for a VAR. 500 observations are available for the bi-variate
+      VAR, and the values of the determinant of the variance-covariance matrix of
+      residuals are 0.0336, 0.0169, 0.0084, and 0.0062 for 1, 2, 3, and 4 lags respectively.
+      What is the optimal model order according to Akaike''s information criterion?
+
+      (A) 1 lag (B) 2 lags (C) 3 lags (D) 4 lags'
+    target: "Let's think step by step. We refer to Wikipedia articles on econometrics\
+      \ for help. Let\u2019s solve this problem step by step. First of all, let\u2019\
+      s recall that for a given set of data, Akaike's information criterion (AIC)\
+      \ allows us to measure how well a statistical model fits the data; it is an\
+      \ estimator of prediction error. Here in this problem we will need to use the\
+      \ formula ln(det(sigma_hat)) + (2 * k / T) to determine the values of Akaike\u2019\
+      s criterion, where ln denotes the natural log function, det the determinant\
+      \ function, k the total number of parameters in total (across both equations),\
+      \ and T the number of observations (which, in this case, is equal to 500). For\
+      \ 1 lag, the number of parameters in total is equal to 6; for 2 lags, it is\
+      \ 10; for 3 lags, it is 14; and for 4 lags, it is 18. Now, let\u2019s calculate\
+      \ the values of the criterion for each lag:\n(A) 1 lag: ln(0.0336) + (2 * 6\
+      \ / 500) = ln(0.0336) + (12 / 500) = -3.369\n(B) 2 lags: ln(0.0169) + (2 * 10\
+      \ / 500) = ln(0.0169) + (20 / 500) = -4.040\n(C) 3 lags: ln(0.0084) + (2 * 14\
+      \ / 500) = ln(0.0084) + (28 / 500) =-4.724\n(D) 4 lags: ln(0.0062) + (2 * 18\
+      \ / 500) = ln(0.0062) + (36 / 500) =-5.011\nBecause the optimal model order\
+      \ according to AIC minimizes the information criterion, the answer should be\
+      \ the one with the lowest value. In this case, (D) has the lowest value. The\
+      \ answer is (C)."
+  - question: 'Consider the following AR(1) model with the disturbances having zero mean
+      and unit variance
+
+      yt = 0.2 + 0.4 yt-1 + ut
+
+      The (unconditional) mean of y will be given by
+
+      (A) 0.2 (B) 0.4 (C) 0.5 (D) 0.33'
+    target: "Let's think step by step. We refer to Wikipedia articles on econometrics\
+      \ for help. Let\u2019s solve this problem step by step. If we have a an AR(1)\
+      \ model with the disturbances having zero mean and unit variance, then the unconditional\
+      \ mean of y is equal to the following:\nunconditional mean of y = (the intercept\
+      \ term) / (1 - autoregressive coefficient)\nWe know that the intercept term\
+      \ is 0.2 and the autoregressive coefficient is 0.4; thus, we have:\nunconditional\
+      \ mean of y = (0.2) / (1 - 0.4) = (0.2) / (0.6) = 2 / 6 = 1 / 3, which is approximately\
+      \ 0.33. That means that the answer should be (D) 0.33. The answer is (D)."
+  - question: 'What would be then consequences for the OLS estimator if heteroscedasticity
+      is present in a regression model but ignored?
+
+      (A) It will be biased (B) It will be inconsistent (C) It will be inefficient
+      (D) All of (a), (b) and (c) will be true.'
+    target: Let's think step by step. We refer to Wikipedia articles on econometrics
+      for help. Heteroscedasticity refers to the condition where the variance of the
+      error terms is not constant across multiple observations. If heteroscedasticity
+      is present in a regression model, then the coefficient estimates in the OLS
+      estimator will be not only unbiased and consistent but also inefficient. Because
+      (A) and (B) are incorrect choices and (C) is a correct choice, (D) cannot be
+      the right answer. Ultimately, (C) is the only true choice. The answer is (C).
+  - question: 'Suppose that a test statistic has associated with it a p-value of 0.08.
+      Which one of the following statements is true?
+
+      (i) If the size of the test were exactly 8%, we would be indifferent between
+      rejecting and not rejecting the null hypothesis
+
+      (ii) The null would be rejected if a 10% size of test were used
+
+      (iii) The null would not be rejected if a 1% size of test were used
+
+      (iv) The null would be rejected if a 5% size of test were used.
+
+      (A) (ii) and (iv) only (B) (i) and (iii) only (C) (i), (ii), and (iii) only
+      (D) (i), (ii), (iii), and (iv).'
+    target: "Let's think step by step. We refer to Wikipedia articles on econometrics\
+      \ for help. Let\u2019s reason about each of the options.\n(i) is a true statement.\n\
+      (ii) is a true statement.\n(iii) is a true statement.\n(iv) is not a true statement.\
+      \ Thus, (i), (ii), and (iii) are true. The answer is (C)."
+  - question: 'For a stationary autoregressive process, shocks will
+
+      (A) Eventually die away (B) Persist indefinitely (C) Grow exponentially (D)
+      Never occur'
+    target: 'Let''s think step by step. We refer to Wikipedia articles on econometrics
+      for help. This is a formal logic problem about stationally process. For a stationary
+      autoregressive process, shocks will eventually die away. The answer is (A).'
+group: mmlu_flan_cot_fewshot_social_sciences
+include: _mmlu_flan_cot_fewshot_template_yaml
+task: mmlu_flan_cot_fewshot_econometrics
--- a/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_electrical_engineering.yaml
+++ b/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_electrical_engineering.yaml
-"dataset_name": "electrical_engineering"
-"description": "\nThe following are multiple choice questions (with answers) about\
-  \ electrical engineering.\n\nQ: A point pole has a strength of 4π * 10^-4 weber.\
-  \ The force in newtons on a point pole of 4π * 1.5 * 10^-4 weber placed at a distance\
-  \ of 10 cm from it will be\n(A) 15 N. (B) 20 N. (C) 7.5 N. (D) 3.75 N.\nA: Let's\
-  \ think step by step. The force between two point poles is given by m_1m_2/(mu_0\
-  \ 4 \\pi r^2), in analogy to Coulomb’s law. Plugging in the values given in the\
-  \ question, we calculate that the force is approximately 15 N. The answer is (A).\n\
-  \nQ: The coil of a moving coil meter has 100 turns, is 40 mm long and 30 mm wide.\
-  \ The control torque is 240*10-6 N-m on full scale. If magnetic flux density is\
-  \ 1Wb/m2 range of meter is\n(A) 1 mA. (B) 2 mA. (C) 3 mA. (D) 4 mA.\nA: Let's think\
-  \ step by step. The torque on a coil in a uniform magnetic field is given by BANI,\
-  \ where B is the magnetic flux density, A is the area of the coil, N is the number\
-  \ of turns, and I is the current. So we have that I = (Torque)/(BAN), or 240e-6/(1200e-6\
-  \ * 100 * 1) = 2e-3. The answer is (B).\n\nQ: In an SR latch built from NOR gates,\
-  \ which condition is not allowed\n(A) S=0, R=0 (B) S=0, R=1 (C) S=1, R=0 (D) S=1,\
-  \ R=1\nA: Let's think step by step. An SR latch is a set-reset latch; in the case\
-  \ where S=1 and R=1, the circuit has no stable state; instead a race condition will\
-  \ be produced within the circuit, so the device will be in an undefined state. So\
-  \ S=1, R=1 is an illegal input. The answer is (D).\n\nQ: Two long parallel conductors\
-  \ carry 100 A. If the conductors are separated by 20 mm, the force per meter of\
-  \ length of each conductor will be\n(A) 100 N. (B) 0.1 N. (C) 1 N. (D) 0.01 N.\n\
-  A: Let's think step by step. The magnetic force-per-length between two current-carrying\
-  \ conductors is given by \\mu_0 I_1 I_2 / (2 \\pi r), where $r$ is the separation\
-  \ distance and I_1 and I_2 are the currents. Plugging in 100 A for I_1 and I_2,\
-  \ and 20 mm for r, gives 0.1 N. The answer is (B).\n\nQ: In a 2 pole lap winding\
-  \ dc machine , the resistance of one conductor is 2Ω and total number of conductors\
-  \ is 100. Find the total resistance\n(A) 200Ω (B) 100Ω (C) 50Ω (D) 10Ω\nA: Let's\
-  \ think step by step. In lap winding, effectively two resistors are connected in\
-  \ parallel, so the actual resistance of each pair is 1 Ohm. Since we have 50 pairs,\
-  \ we get a total resistance of 50 Ohms. The answer is (C).\n\n"
-"group": "mmlu_flan_cot_fewshot_stem"
-"include": "_mmlu_flan_cot_fewshot_template_yaml"
-"task": "mmlu_flan_cot_fewshot_electrical_engineering"
+dataset_name: electrical_engineering
+description: '
+
+  The following are multiple choice questions (with answers) about electrical engineering.'
+fewshot_config:
+  sampler: first_n
+  samples:
+  - question: "A point pole has a strength of 4\u03C0 * 10^-4 weber. The force in newtons\
+      \ on a point pole of 4\u03C0 * 1.5 * 10^-4 weber placed at a distance of 10\
+      \ cm from it will be\n(A) 15 N. (B) 20 N. (C) 7.5 N. (D) 3.75 N."
+    target: "Let's think step by step. The force between two point poles is given\
+      \ by m_1m_2/(mu_0 4 \\pi r^2), in analogy to Coulomb\u2019s law. Plugging in\
+      \ the values given in the question, we calculate that the force is approximately\
+      \ 15 N. The answer is (A)."
+  - question: 'The coil of a moving coil meter has 100 turns, is 40 mm long and 30 mm
+      wide. The control torque is 240*10-6 N-m on full scale. If magnetic flux density
+      is 1Wb/m2 range of meter is
+
+      (A) 1 mA. (B) 2 mA. (C) 3 mA. (D) 4 mA.'
+    target: Let's think step by step. The torque on a coil in a uniform magnetic field
+      is given by BANI, where B is the magnetic flux density, A is the area of the
+      coil, N is the number of turns, and I is the current. So we have that I = (Torque)/(BAN),
+      or 240e-6/(1200e-6 * 100 * 1) = 2e-3. The answer is (B).
+  - question: 'In an SR latch built from NOR gates, which condition is not allowed
+
+      (A) S=0, R=0 (B) S=0, R=1 (C) S=1, R=0 (D) S=1, R=1'
+    target: Let's think step by step. An SR latch is a set-reset latch; in the case
+      where S=1 and R=1, the circuit has no stable state; instead a race condition
+      will be produced within the circuit, so the device will be in an undefined state.
+      So S=1, R=1 is an illegal question. The answer is (D).
+  - question: 'Two long parallel conductors carry 100 A. If the conductors are separated
+      by 20 mm, the force per meter of length of each conductor will be
+
+      (A) 100 N. (B) 0.1 N. (C) 1 N. (D) 0.01 N.'
+    target: Let's think step by step. The magnetic force-per-length between two current-carrying
+      conductors is given by \mu_0 I_1 I_2 / (2 \pi r), where $r$ is the separation
+      distance and I_1 and I_2 are the currents. Plugging in 100 A for I_1 and I_2,
+      and 20 mm for r, gives 0.1 N. The answer is (B).
+  - question: "In a 2 pole lap winding dc machine , the resistance of one conductor is\
+      \ 2\u03A9 and total number of conductors is 100. Find the total resistance\n\
+      (A) 200\u03A9 (B) 100\u03A9 (C) 50\u03A9 (D) 10\u03A9"
+    target: 'Let''s think step by step. In lap winding, effectively two resistors
+      are connected in parallel, so the actual resistance of each pair is 1 Ohm. Since
+      we have 50 pairs, we get a total resistance of 50 Ohms. The answer is (C).'
+group: mmlu_flan_cot_fewshot_stem
+include: _mmlu_flan_cot_fewshot_template_yaml
+task: mmlu_flan_cot_fewshot_electrical_engineering
--- a/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_elementary_mathematics.yaml
+++ b/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_elementary_mathematics.yaml
-"dataset_name": "elementary_mathematics"
-"description": "The following are multiple choice questions (with answers) about elementary\
-  \ mathematics.\n\nQ: Olivia used the rule \"Add 11\" to create the number pattern\
-  \ shown below. 10, 21, 32, 43, 54. Which statement about the number pattern is true?\n\
-  (A) The 10th number in the pattern will be an even number.\n(B) The number pattern\
-  \ will never have two even numbers next to each other.\n(C) The next two numbers\
-  \ in the pattern will be an even number then an odd number.\n(D) If the number pattern\
-  \ started with an odd number then the pattern would have only odd numbers in it.\n\
-  A: Let's think step by step. Choice A is incorrect because every even-numbered term\
-  \ in the pattern is odd, and 10 is an even number. Choice B is correct, because\
-  \ adding an odd number (in this case 11) to an odd number produces an even number,\
-  \ and adding an odd number to an even number produces an odd number. Thus the terms\
-  \ in the pattern will alternate between odd and even, so there will never be two\
-  \ even numbers next to each other. Choice C is incorrect because the last term in\
-  \ the example is even (54), and we know that the terms will alternate between even\
-  \ and odd. Choice D is incorrect because the terms in the pattern will alternate\
-  \ between odd and even, regardless of the value of the first term. The answer is\
-  \ (B).\n\nQ: The population of the city where Michelle was born is 145,826. What\
-  \ is the value of the 5 in the number 145,826?\n(A) 5 thousands\n(B) 5 hundreds\n\
-  (C) 5 tens\n(D) 5 ones\nA: Let's think step by step. Choice A is correct, because\
-  \ there are three digits following the 5, so\nthe 5 is in the thousands place. Thus\
-  \ the other choices are incorrect. The answer is (A).\n\nQ: A store sells 107 different\
-  \ colors of paint. They have 25 cans of each color in storage. The number of cans\
-  \ of paint the store has in storage can be found using the expression below. 107\
-  \ × 25. How many cans of paint does the store have in storage?\n(A) 749\n(B) 2,675\n\
-  (C) 2,945\n(D) 4,250\nA: Let's think step by step. We can calculate 107 x 25 = (100\
-  \ x 25) + (7 x 25) = 2500 + 175 = 2675. The answer is (B).\n\nQ: A total of 30 players\
-  \ will play basketball at a park. There will be exactly 5 players on each team.\
-  \ Which statement correctly explains how to find the number of teams needed?\n(A)\
-  \ Add 5 to 30 to find 35 teams.\n(B) Divide 30 by 5 to find 6 teams.\n(C) Multiply\
-  \ 30 and 5 to find 150 teams.\n(D) Subtract 5 from 30 to find 25 teams.\nA: Let's\
-  \ think step by step. We want to find the number of teams. We know that there are\
-  \ 5 players/team, and 30 players. Thus to get the number of teams we divide players\
-  \ by players/team, so 30 players / 5 players/team = 6 teams. The answer is (B).\n\
-  \nQ: Which expression is equivalent to 5 x 9?\n(A) (5 x 4) x (6 x 5)\n(B) (5 x 5)\
-  \ + (5 x 4)\n(C) (5 x 5) + (5 x 9)\n(D) (5 x 9) x (6 x 9)\nA: Let's think step by\
-  \ step. We know that 9 = (5 + 4), so 5 x 9 = 5 x (5 + 4) = (5 x 5) + (5 x 4). The\
-  \ answer is (B).\n\n"
-"group": "mmlu_flan_cot_fewshot_stem"
-"include": "_mmlu_flan_cot_fewshot_template_yaml"
-"task": "mmlu_flan_cot_fewshot_elementary_mathematics"
+dataset_name: elementary_mathematics
+description: The following are multiple choice questions (with answers) about elementary
+  mathematics.
+fewshot_config:
+  sampler: first_n
+  samples:
+  - question: 'Olivia used the rule "Add 11" to create the number pattern shown below.
+      10, 21, 32, 43, 54. Which statement about the number pattern is true?
+
+      (A) The 10th number in the pattern will be an even number.
+
+      (B) The number pattern will never have two even numbers next to each other.
+
+      (C) The next two numbers in the pattern will be an even number then an odd number.
+
+      (D) If the number pattern started with an odd number then the pattern would
+      have only odd numbers in it.'
+    target: Let's think step by step. Choice A is incorrect because every even-numbered
+      term in the pattern is odd, and 10 is an even number. Choice B is correct, because
+      adding an odd number (in this case 11) to an odd number produces an even number,
+      and adding an odd number to an even number produces an odd number. Thus the
+      terms in the pattern will alternate between odd and even, so there will never
+      be two even numbers next to each other. Choice C is incorrect because the last
+      term in the example is even (54), and we know that the terms will alternate
+      between even and odd. Choice D is incorrect because the terms in the pattern
+      will alternate between odd and even, regardless of the value of the first term.
+      The answer is (B).
+  - question: 'The population of the city where Michelle was born is 145,826. What is
+      the value of the 5 in the number 145,826?
+
+      (A) 5 thousands
+
+      (B) 5 hundreds
+
+      (C) 5 tens
+
+      (D) 5 ones'
+    target: 'Let''s think step by step. Choice A is correct, because there are three
+      digits following the 5, so
+
+      the 5 is in the thousands place. Thus the other choices are incorrect. The answer
+      is (A).'
+  - question: "A store sells 107 different colors of paint. They have 25 cans of each\
+      \ color in storage. The number of cans of paint the store has in storage can\
+      \ be found using the expression below. 107 \xD7 25. How many cans of paint does\
+      \ the store have in storage?\n(A) 749\n(B) 2,675\n(C) 2,945\n(D) 4,250"
+    target: Let's think step by step. We can calculate 107 x 25 = (100 x 25) + (7
+      x 25) = 2500 + 175 = 2675. The answer is (B).
+  - question: 'A total of 30 players will play basketball at a park. There will be exactly
+      5 players on each team. Which statement correctly explains how to find the number
+      of teams needed?
+
+      (A) Add 5 to 30 to find 35 teams.
+
+      (B) Divide 30 by 5 to find 6 teams.
+
+      (C) Multiply 30 and 5 to find 150 teams.
+
+      (D) Subtract 5 from 30 to find 25 teams.'
+    target: Let's think step by step. We want to find the number of teams. We know
+      that there are 5 players/team, and 30 players. Thus to get the number of teams
+      we divide players by players/team, so 30 players / 5 players/team = 6 teams.
+      The answer is (B).
+  - question: 'Which expression is equivalent to 5 x 9?
+
+      (A) (5 x 4) x (6 x 5)
+
+      (B) (5 x 5) + (5 x 4)
+
+      (C) (5 x 5) + (5 x 9)
+
+      (D) (5 x 9) x (6 x 9)'
+    target: 'Let''s think step by step. We know that 9 = (5 + 4), so 5 x 9 = 5 x (5
+      + 4) = (5 x 5) + (5 x 4). The answer is (B).'
+group: mmlu_flan_cot_fewshot_stem
+include: _mmlu_flan_cot_fewshot_template_yaml
+task: mmlu_flan_cot_fewshot_elementary_mathematics
--- a/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_formal_logic.yaml
+++ b/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_formal_logic.yaml
-"dataset_name": "formal_logic"
-"description": "The following are multiple choice questions (with answers) about formal\
-  \ logic.\n\nQ: Which of the given formulas of PL is the best symbolization of the\
-  \ following sentence?\nTurtles live long lives and are happy creatures, unless they\
-  \ are injured.\n(A) (L • H) ≡ I (B) (L • H) ∨ I (C) L • (H ∨ I) (D) L • (H ⊃ R).\n\
-  A: Let's think step by step. We refer to Wikipedia articles on formal logic for\
-  \ help. Let’s solve this step by step. Let “L” denote “living long”, H “being happy”,\
-  \ and “I” “being injured”. Now, consider each choice:\n(A) means (living long AND\
-  \ being happy) is equivalent to (being injured). \n(B) means (living long AND being\
-  \ happy) OR (being injured). \n(C) means (living long) AND (being happy OR being\
-  \ injured). \n(D) means (living long) AND (being happy implies being R), but what\
-  \ R denotes is not clear.\nObviously, (B) is the best symbolization of the original\
-  \ sentence. The answer is (B).\n\nQ: Select the best translation into predicate\
-  \ logic.George borrows Hector's lawnmower. (g: George; h: Hector; l: Hector's lawnmower;\
-  \ Bxyx: x borrows y from z).\n(A) Blgh (B) Bhlg (C) Bglh (D) Bghl\nA: Let's think\
-  \ step by step. We refer to Wikipedia articles on formal logic for help. Let’s solve\
-  \ this step by step. We are told that “Bxyx” means “x borrows y from z”. We can\
-  \ rewrite “George borrows Hector's lawnmower” as “George borrows a lawnmower from\
-  \ Hector”, which can then be translated into predicate logic as “Bglh”. The answer\
-  \ “Bglh” appears in (C); therefore, (C) must be the correct answer. The answer is\
-  \ (C).\n\nQ: \nSelect the best English interpretation of the given arguments in\
-  \ predicate logic.\nDm\n(∀x)(Wx ⊃ ~Dx). \n(∀x)Wx ∨ Ag\t/ (∃x)Ax\n(A) Marina is a\
-  \ dancer. Some weaklings are not dancers. Either everything is a weakling or Georgia\
-  \ plays volleyball. So something plays volleyball. (B) Marina is a dancer. No weakling\
-  \ is a dancer. Everything is either a weakling or plays volleyball. So something\
-  \ plays volleyball. (C) Marina is a dancer. Some weaklings are not dancers. Everything\
-  \ is either a weakling or plays volleyball. So something plays volleyball. (D) Marina\
-  \ is a dancer. No weakling is a dancer. Either everything is a weakling or Georgia\
-  \ plays volleyball. So something plays volleyball.\nA: Let's think step by step.\
-  \ We refer to Wikipedia articles on formal logic for help. Let’s solve this step\
-  \ by step. Let “D” denote “being a dancer”, “m” denote “Maria”, “g” denote “Georgia”,\
-  \ “W” denote “weakling”, “A” denote “playing volleyball”. Then, we have the following:\n\
-  1. Dm → Maria is a dance.\n2. (∀x)(Wx ⊃ ~Dx). → For all x, if x is a weakling, then\
-  \ x is not a dancer. In other words, no weakling is a dancer.\n3. (∀x)Wx ∨ Ag\t\
-  / (∃x)Ax → For all x, x is a weakling or Georgia plays volleyball. So there exists\
-  \ an x that plays volleyball. \nOptions (A) and (C) do claim that some weaklings\
-  \ are not dancers, but the second argument strongly states that no weakling is a\
-  \ dancer. Thus, we can eliminate them. Option (B) omits the important detail about\
-  \ Georgia playing volleyball. Option (D) has all the details presented in the arguments\
-  \ and is the best English interpretation of the arguments. The answer is (D).\n\n\
-  Q: Select the best translation into predicate logic: No people drive on Mars.\n\
-  (A) ~Pd (B) (∀x)(Px ∨ ~Dx) (C) (∀x)(Px ⊃ ~Dx) (D) ~Dp\nA: Let's think step by step.\
-  \ We refer to Wikipedia articles on formal logic for help. Let’s solve this step\
-  \ by step. Let “P” denote “being on Mars” and “D” denote “driving on Mars”. Then\
-  \ let’s consider each option:\nOption (A): ~Pd → d is not on Mars.\nOption (B):\
-  \ (∀x)(Px ∨ ~Dx) → For all x, x is on Mars and x do not drive on Mars.\nOption (C):\
-  \ (∀x)(Px ⊃ ~Dx) → For all x, x is on Mars implies that x do not drive on Mars.\n\
-  Option (D): ~Dp: → p do not drive on Mars.\nOf all these options, Option (C) appears\
-  \ to be the best and most meaningful interpretation of the argument “No people drive\
-  \ on Mars.” The answer is (C).\n\n"
-"group": "mmlu_flan_cot_fewshot_humanities"
-"include": "_mmlu_flan_cot_fewshot_template_yaml"
-"task": "mmlu_flan_cot_fewshot_formal_logic"
+dataset_name: formal_logic
+description: The following are multiple choice questions (with answers) about formal
+  logic.
+fewshot_config:
+  sampler: first_n
+  samples:
+  - question: "Which of the given formulas of PL is the best symbolization of the following\
+      \ sentence?\nTurtles live long lives and are happy creatures, unless they are\
+      \ injured.\n(A) (L \u2022 H) \u2261 I (B) (L \u2022 H) \u2228 I (C) L \u2022\
+      \ (H \u2228 I) (D) L \u2022 (H \u2283 R)."
+    target: "Let's think step by step. We refer to Wikipedia articles on formal logic\
+      \ for help. Let\u2019s solve this step by step. Let \u201CL\u201D denote \u201C\
+      living long\u201D, H \u201Cbeing happy\u201D, and \u201CI\u201D \u201Cbeing\
+      \ injured\u201D. Now, consider each choice:\n(A) means (living long AND being\
+      \ happy) is equivalent to (being injured). \n(B) means (living long AND being\
+      \ happy) OR (being injured). \n(C) means (living long) AND (being happy OR being\
+      \ injured). \n(D) means (living long) AND (being happy implies being R), but\
+      \ what R denotes is not clear.\nObviously, (B) is the best symbolization of\
+      \ the original sentence. The answer is (B)."
+  - question: 'Select the best translation into predicate logic.George borrows Hector''s
+      lawnmower. (g: George; h: Hector; l: Hector''s lawnmower; Bxyx: x borrows y
+      from z).
+
+      (A) Blgh (B) Bhlg (C) Bglh (D) Bghl'
+    target: "Let's think step by step. We refer to Wikipedia articles on formal logic\
+      \ for help. Let\u2019s solve this step by step. We are told that \u201CBxyx\u201D\
+      \ means \u201Cx borrows y from z\u201D. We can rewrite \u201CGeorge borrows\
+      \ Hector's lawnmower\u201D as \u201CGeorge borrows a lawnmower from Hector\u201D\
+      , which can then be translated into predicate logic as \u201CBglh\u201D. The\
+      \ answer \u201CBglh\u201D appears in (C); therefore, (C) must be the correct\
+      \ answer. The answer is (C)."
+  - question: "\nSelect the best English interpretation of the given arguments in predicate\
+      \ logic.\nDm\n(\u2200x)(Wx \u2283 ~Dx). \n(\u2200x)Wx \u2228 Ag\t/ (\u2203x)Ax\n\
+      (A) Marina is a dancer. Some weaklings are not dancers. Either everything is\
+      \ a weakling or Georgia plays volleyball. So something plays volleyball. (B)\
+      \ Marina is a dancer. No weakling is a dancer. Everything is either a weakling\
+      \ or plays volleyball. So something plays volleyball. (C) Marina is a dancer.\
+      \ Some weaklings are not dancers. Everything is either a weakling or plays volleyball.\
+      \ So something plays volleyball. (D) Marina is a dancer. No weakling is a dancer.\
+      \ Either everything is a weakling or Georgia plays volleyball. So something\
+      \ plays volleyball."
+    target: "Let's think step by step. We refer to Wikipedia articles on formal logic\
+      \ for help. Let\u2019s solve this step by step. Let \u201CD\u201D denote \u201C\
+      being a dancer\u201D, \u201Cm\u201D denote \u201CMaria\u201D, \u201Cg\u201D\
+      \ denote \u201CGeorgia\u201D, \u201CW\u201D denote \u201Cweakling\u201D, \u201C\
+      A\u201D denote \u201Cplaying volleyball\u201D. Then, we have the following:\n\
+      1. Dm \u2192 Maria is a dance.\n2. (\u2200x)(Wx \u2283 ~Dx). \u2192 For all\
+      \ x, if x is a weakling, then x is not a dancer. In other words, no weakling\
+      \ is a dancer.\n3. (\u2200x)Wx \u2228 Ag\t/ (\u2203x)Ax \u2192 For all x, x\
+      \ is a weakling or Georgia plays volleyball. So there exists an x that plays\
+      \ volleyball. \nOptions (A) and (C) do claim that some weaklings are not dancers,\
+      \ but the second argument strongly states that no weakling is a dancer. Thus,\
+      \ we can eliminate them. Option (B) omits the important detail about Georgia\
+      \ playing volleyball. Option (D) has all the details presented in the arguments\
+      \ and is the best English interpretation of the arguments. The answer is (D)."
+  - question: "Select the best translation into predicate logic: No people drive on Mars.\n\
+      (A) ~Pd (B) (\u2200x)(Px \u2228 ~Dx) (C) (\u2200x)(Px \u2283 ~Dx) (D) ~Dp"
+    target: "Let's think step by step. We refer to Wikipedia articles on formal logic\
+      \ for help. Let\u2019s solve this step by step. Let \u201CP\u201D denote \u201C\
+      being on Mars\u201D and \u201CD\u201D denote \u201Cdriving on Mars\u201D. Then\
+      \ let\u2019s consider each option:\nOption (A): ~Pd \u2192 d is not on Mars.\n\
+      Option (B): (\u2200x)(Px \u2228 ~Dx) \u2192 For all x, x is on Mars and x do\
+      \ not drive on Mars.\nOption (C): (\u2200x)(Px \u2283 ~Dx) \u2192 For all x,\
+      \ x is on Mars implies that x do not drive on Mars.\nOption (D): ~Dp: \u2192\
+      \ p do not drive on Mars.\nOf all these options, Option (C) appears to be the\
+      \ best and most meaningful interpretation of the argument \u201CNo people drive\
+      \ on Mars.\u201D The answer is (C).\n\n"
+group: mmlu_flan_cot_fewshot_humanities
+include: _mmlu_flan_cot_fewshot_template_yaml
+task: mmlu_flan_cot_fewshot_formal_logic
--- a/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_global_facts.yaml
+++ b/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_global_facts.yaml
-"dataset_name": "global_facts"
-"description": "The following are multiple choice questions (with answers) about global\
-  \ facts.\n\nQ: As of 2017, how many of the world’s 1-year-old children today have\
-  \ been vaccinated against some disease? *\n(A) 80% (B) 60% (C) 40% (D) 20%\nA: Let's\
-  \ think step by step. We refer to Wikipedia articles on global facts for help. According\
-  \ to data published by the World Health Organization, the nummber of 1-year-old\
-  \ children vaccinated in 2017 exceeds 80%. The answer is (A).\n\nQ: As of 2019,\
-  \ about what percentage of Americans agree that the state is run for the benefit\
-  \ of all the people?\n(A) 31% (B) 46% (C) 61% (D) 76%\nA: Let's think step by step.\
-  \ We refer to Wikipedia articles on global facts for help. In 2019, about 46% percentage\
-  \ of Americans agree that the state is run for the benefit of all the people. The\
-  \ answer is (B).\n\nQ: As of 2019, about what percentage of Russians say it is very\
-  \ important to have free media in our country without government/state censorship?\n\
-  (A) 38% (B) 53% (C) 68% (D) 83%\nA: Let's think step by step. We refer to Wikipedia\
-  \ articles on global facts for help. As of 2019, about 38% of Russians say it is\
-  \ very important to have free media in our country. The answer is (A).\n\nQ: As\
-  \ of 2015, since 1990 forests have ____ in Europe and have ____ in Africa and the\
-  \ Americas.\n(A) increased, increased (B) increased, decreased (C) decreased, increased\
-  \ (D) decreased, decreased\nA: Let's think step by step. We refer to Wikipedia articles\
-  \ on global facts for help. As of 2015, since 1990 forests have increased in Europe\
-  \ and have decreased in Africa and the Americas. The answer is (B).\n\nQ: Which\
-  \ of the following pairs of statements are both true (as of 2019)?\n(A) People tend\
-  \ to be optimistic about their own future and the future of their nation or the\
-  \ world. (B) People tend to be optimistic about their own future but pessimistic\
-  \ about the future of their nation or the world. (C) People tend to be pessimistic\
-  \ about their own future but optimistic about the future of their nation or the\
-  \ world. (D) People tend to be pessimistic about their own future and the future\
-  \ of their nation or the world.\nA: Let's think step by step. We refer to Wikipedia\
-  \ articles on global facts for help. As of 2019, most people tend to be optimistic\
-  \ about their own future but pessimistic about the future of their nation or the\
-  \ world. The answer is (B).\n\n"
-"group": "mmlu_flan_cot_fewshot_other"
-"include": "_mmlu_flan_cot_fewshot_template_yaml"
-"task": "mmlu_flan_cot_fewshot_global_facts"
+dataset_name: global_facts
+description: The following are multiple choice questions (with answers) about global
+  facts.
+fewshot_config:
+  sampler: first_n
+  samples:
+  - question: "As of 2017, how many of the world\u2019s 1-year-old children today have\
+      \ been vaccinated against some disease? *\n(A) 80% (B) 60% (C) 40% (D) 20%"
+    target: Let's think step by step. We refer to Wikipedia articles on global facts
+      for help. According to data published by the World Health Organization, the
+      nummber of 1-year-old children vaccinated in 2017 exceeds 80%. The answer is
+      (A).
+  - question: 'As of 2019, about what percentage of Americans agree that the state is
+      run for the benefit of all the people?
+
+      (A) 31% (B) 46% (C) 61% (D) 76%'
+    target: Let's think step by step. We refer to Wikipedia articles on global facts
+      for help. In 2019, about 46% percentage of Americans agree that the state is
+      run for the benefit of all the people. The answer is (B).
+  - question: 'As of 2019, about what percentage of Russians say it is very important
+      to have free media in our country without government/state censorship?
+
+      (A) 38% (B) 53% (C) 68% (D) 83%'
+    target: Let's think step by step. We refer to Wikipedia articles on global facts
+      for help. As of 2019, about 38% of Russians say it is very important to have
+      free media in our country. The answer is (A).
+  - question: 'As of 2015, since 1990 forests have ____ in Europe and have ____ in Africa
+      and the Americas.
+
+      (A) increased, increased (B) increased, decreased (C) decreased, increased (D)
+      decreased, decreased'
+    target: Let's think step by step. We refer to Wikipedia articles on global facts
+      for help. As of 2015, since 1990 forests have increased in Europe and have decreased
+      in Africa and the Americas. The answer is (B).
+  - question: 'Which of the following pairs of statements are both true (as of 2019)?
+
+      (A) People tend to be optimistic about their own future and the future of their
+      nation or the world. (B) People tend to be optimistic about their own future
+      but pessimistic about the future of their nation or the world. (C) People tend
+      to be pessimistic about their own future but optimistic about the future of
+      their nation or the world. (D) People tend to be pessimistic about their own
+      future and the future of their nation or the world.'
+    target: 'Let''s think step by step. We refer to Wikipedia articles on global facts
+      for help. As of 2019, most people tend to be optimistic about their own future
+      but pessimistic about the future of their nation or the world. The answer is
+      (B).'
+group: mmlu_flan_cot_fewshot_other
+include: _mmlu_flan_cot_fewshot_template_yaml
+task: mmlu_flan_cot_fewshot_global_facts
--- a/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_high_school_biology.yaml
+++ b/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_high_school_biology.yaml
-"dataset_name": "high_school_biology"
-"description": "The following are multiple choice questions (with answers) about high\
-  \ school biology.\n\nQ: In animal cells, which of the following represents the most\
-  \ likely pathway that a secretory protein takes as it is synthesized in a cell?\n\
-  (A) Plasma membrane–Golgi apparatus–ribosome–secretory vesicle–rough ER (B) Ribosome–Golgi\
-  \ apparatus–rough ER–secretory vesicle–plasma membrane (C) Plasma membrane–Golgi\
-  \ apparatus–ribosome–secretory vesicle–rough ER (D) Ribosome–rough ER–Golgi apparatus–secretory\
-  \ vesicle–plasma membrane\nA: Let's think step by step. Protein synthesis starts\
-  \ at the ribosome, so we can eliminate (A) and (C). The ribosome is often in the\
-  \ endoplasmic reticulum and moves from there to the Golgi apparatus, where it is\
-  \ modified and packaged into a vesicle. The vesicle then floats to the plasma membrane\
-  \ and is secreted. The answer is (D).\n\nQ: A mutation in a bacterial enzyme changed\
-  \ a previously polar amino acid into a nonpolar amino acid. This amino acid was\
-  \ located at a site distant from the enzyme’s active site. How might this mutation\
-  \ alter the enzyme’s substrate specificity?\n(A) By changing the enzyme’s pH optimum\
-  \ (B) By changing the enzyme’s location in the cell (C) By changing the shape of\
-  \ the protein (D) An amino acid change away from the active site cannot alter the\
-  \ enzyme’s substrate specificity.\nA: Let's think step by step. A change in an amino\
-  \ acid leads to a change in the primary structure of the protein. A change in the\
-  \ primary structure may lead to a change in the secondary and the tertiary structure\
-  \ of the protein. A change in the tertiary structure means a change in the shape\
-  \ of the protein, so (C) has to be correct. Since the change does not affect the\
-  \ active site of the enzyme, we do not expect the activity of the enzyme to be affected.\
-  \ The answer is (C).\n\nQ: Which of the following is not a way to form recombinant\
-  \ DNA?\n(A) Translation (B) Conjugation (C) Specialized transduction (D) Transformation\n\
-  A: Let's think step by step. The introduction of foreign DNA or RNA into bacteria\
-  \ or eukaryotic cells is a common technique in molecular biology and scientific\
-  \ research. There are multiple ways foreign DNA can be introduced into cells including\
-  \ transformation, transduction, conjugation, and transfection. In contrast, (A)\
-  \ is not a way to form DNA: during translation the ribosomes synthesize proteins\
-  \ from RNA. The answer is (A).\n\nQ: Homologous structures are often cited as evidence\
-  \ for the process of natural selection. All of the following are examples of homologous\
-  \ structures EXCEPT\n(A) the wings of a bird and the wings of a bat (B) the flippers\
-  \ of a whale and the arms of a man (C) the pectoral fins of a porpoise and the flippers\
-  \ of a seal (D) the forelegs of an insect and the forelimbs of a dog\nA: Let's think\
-  \ step by step. Homologous structures are similar physical features in organisms\
-  \ that share a common ancestor but different functions. Comparisons (B) and (C)\
-  \ are clearly homologous because they share a common ancestor and the structures\
-  \ serve different purposes. Bat wings and birg wings are also homologous, while\
-  \ they are both wings, the forelimbs serve different purposes. Insects and dogs\
-  \ are very far ancestors since one is vertebrate while the other is invertebrate\
-  \ and the forelimbs serve the same purpose, so they are not homologous. The answer\
-  \ is (D).\n\nQ: Which of the following is not known to be involved in the control\
-  \ of cell division?\n(A) Cyclins (B) Protein kinases (C) Checkpoints (D) Fibroblast\
-  \ cells\nA: Let's think step by step. Normal cells move through the cell cycle in\
-  \ a regulated way. At the checkpoint stage, they use information about their own\
-  \ internal state and cues from the environment around them to decide whether to\
-  \ proceed with cell division. Cues like these act by changing the activity of core\
-  \ cell cycle regulators inside the cell. The most common regulators are cyclins\
-  \ and cyclin-dependent kinases. Fibroblast cells do not play any role in cell division.\
-  \ The answer is (D).\n\n"
-"group": "mmlu_flan_cot_fewshot_stem"
-"include": "_mmlu_flan_cot_fewshot_template_yaml"
-"task": "mmlu_flan_cot_fewshot_high_school_biology"
+dataset_name: high_school_biology
+description: The following are multiple choice questions (with answers) about high
+  school biology.
+fewshot_config:
+  sampler: first_n
+  samples:
+  - question: "In animal cells, which of the following represents the most likely pathway\
+      \ that a secretory protein takes as it is synthesized in a cell?\n(A) Plasma\
+      \ membrane\u2013Golgi apparatus\u2013ribosome\u2013secretory vesicle\u2013rough\
+      \ ER (B) Ribosome\u2013Golgi apparatus\u2013rough ER\u2013secretory vesicle\u2013\
+      plasma membrane (C) Plasma membrane\u2013Golgi apparatus\u2013ribosome\u2013\
+      secretory vesicle\u2013rough ER (D) Ribosome\u2013rough ER\u2013Golgi apparatus\u2013\
+      secretory vesicle\u2013plasma membrane"
+    target: Let's think step by step. Protein synthesis starts at the ribosome, so
+      we can eliminate (A) and (C). The ribosome is often in the endoplasmic reticulum
+      and moves from there to the Golgi apparatus, where it is modified and packaged
+      into a vesicle. The vesicle then floats to the plasma membrane and is secreted.
+      The answer is (D).
+  - question: "A mutation in a bacterial enzyme changed a previously polar amino acid\
+      \ into a nonpolar amino acid. This amino acid was located at a site distant\
+      \ from the enzyme\u2019s active site. How might this mutation alter the enzyme\u2019\
+      s substrate specificity?\n(A) By changing the enzyme\u2019s pH optimum (B) By\
+      \ changing the enzyme\u2019s location in the cell (C) By changing the shape\
+      \ of the protein (D) An amino acid change away from the active site cannot alter\
+      \ the enzyme\u2019s substrate specificity."
+    target: Let's think step by step. A change in an amino acid leads to a change
+      in the primary structure of the protein. A change in the primary structure may
+      lead to a change in the secondary and the tertiary structure of the protein.
+      A change in the tertiary structure means a change in the shape of the protein,
+      so (C) has to be correct. Since the change does not affect the active site of
+      the enzyme, we do not expect the activity of the enzyme to be affected. The
+      answer is (C).
+  - question: 'Which of the following is not a way to form recombinant DNA?
+
+      (A) Translation (B) Conjugation (C) Specialized transduction (D) Transformation'
+    target: 'Let''s think step by step. The introduction of foreign DNA or RNA into
+      bacteria or eukaryotic cells is a common technique in molecular biology and
+      scientific research. There are multiple ways foreign DNA can be introduced into
+      cells including transformation, transduction, conjugation, and transfection.
+      In contrast, (A) is not a way to form DNA: during translation the ribosomes
+      synthesize proteins from RNA. The answer is (A).'
+  - question: 'Homologous structures are often cited as evidence for the process of natural
+      selection. All of the following are examples of homologous structures EXCEPT
+
+      (A) the wings of a bird and the wings of a bat (B) the flippers of a whale and
+      the arms of a man (C) the pectoral fins of a porpoise and the flippers of a
+      seal (D) the forelegs of an insect and the forelimbs of a dog'
+    target: "Let's think step by step. \u200B\u200BHomologous structures are similar\
+      \ physical features in organisms that share a common ancestor \u200B\u200Bbut\
+      \ different functions. Comparisons (B) and (C) are clearly homologous because\
+      \ they share a common ancestor and the structures serve different purposes.\
+      \ Bat wings and birg wings are also homologous, while they are both wings, the\
+      \ forelimbs serve different purposes. Insects and dogs are very far ancestors\
+      \ since one is vertebrate while the other is invertebrate and the forelimbs\
+      \ serve the same purpose, so they are not homologous. The answer is (D)."
+  - question: 'Which of the following is not known to be involved in the control of cell
+      division?
+
+      (A) Cyclins (B) Protein kinases (C) Checkpoints (D) Fibroblast cells'
+    target: 'Let''s think step by step. Normal cells move through the cell cycle in
+      a regulated way. At the checkpoint stage, they use information about their own
+      internal state and cues from the environment around them to decide whether to
+      proceed with cell division. Cues like these act by changing the activity of
+      core cell cycle regulators inside the cell. The most common regulators are cyclins
+      and cyclin-dependent kinases. Fibroblast cells do not play any role in cell
+      division. The answer is (D).'
+group: mmlu_flan_cot_fewshot_stem
+include: _mmlu_flan_cot_fewshot_template_yaml
+task: mmlu_flan_cot_fewshot_high_school_biology
--- a/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_high_school_chemistry.yaml
+++ b/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_high_school_chemistry.yaml
-"dataset_name": "high_school_chemistry"
-"description": "The following are multiple choice questions (with answers) about high\
-  \ school chemistry.\n\nQ: Which of the following is considered an acid anhydride?\n\
-  (A) HCl (B) H2SO3 (C) SO2 (D) Al(NO3)3\nA: Let's think step by step. An acid anhydride\
-  \ is a compound that is derived by removing water from an acid. The chemical formula\
-  \ for water is H2O, which means that we need to determine which of these options,\
-  \ when combined with H2O, forms an acid. SO2, or Sulfur dioxide, when combined with\
-  \ H2O, makes H2SO4, or sulfuric acid. The answer is (C).\n\nQ: Which of the following\
-  \ is expected to be a polar molecule?\n(A) PCl4F (B) BF3 (C) CO2 (D) Si(CH3)4\n\
-  A: Let's think step by step. A polar molecule is one that has a slightly positive\
-  \ charge on one end of the molecule and a slightly negative charge on the other\
-  \ end. Boron trifluoride (BF3) has Boron as the center atom and three fluorine atoms\
-  \ attached to it; it is trigonal planar and symmetric, so it is nonpolar. Carbon\
-  \ Dioxide (CO2) has Carbon as the central atom with double bonds to two Oxygen atoms\
-  \ - this is also symmetrical and therefore nonpolar. The same is the case for tetramethyl\
-  \ silane (SI(CH3)4), which is a Silicon atom surrounded by four methyl groups. The\
-  \ structure of PCL4F is that Phosphorus is the central atom, attached to four chlorines\
-  \ and one fluorine atom. This is asymmetrical, and therefore has a net dipole and\
-  \ is expected to be a polar molecule. The answer is (A).\n\nQ: From the solubility\
-  \ rules, which of the following is true?\n(A) All chlorides, bromides, and iodides\
-  \ are soluble (B) All sulfates are soluble (C) All hydroxides are soluble (D) All\
-  \ ammonium-containing compounds are soluble\nA: Let's think step by step. The chlorides,\
-  \ bromides, and iodides of lead, silver, and mercury are not soluble in water. This\
-  \ rules out (A). The sulfates of lead, barium, and calcium are not soluble in water,\
-  \ which rules out (B). The hydroxides of any metal besides sodium, potassium, ammonium,\
-  \ calcium, and barium are insoluble. This rules out (C). Typically ammonium ions\
-  \ indicate a soluble ionic substance. The answer is (D).\n\nQ: A new compound is\
-  \ synthesized and found to be a monoprotic acid with a molar mass of 248 g/mol.\
-  \ When 0.0050 mol of this acid are dissolved in 0.500 L of water, the pH is measured\
-  \ as 3.89. What is the pKa of this acid?\n(A) 3.89 (B) 7.78 (C) 5.78 (D) 2.33\n\
-  A: Let's think step by step. Recall that $[A] = [H^{+}]$. Here, this is equal to\
-  \ $$10^{-3.89}$. Then we have $K_{a} = $\nrac{[H^{+}][A^{-}]}{[HA]} = \nrac{10^{-3.89}\
-  \ \\cdot 10^{-3.89}}{10^{-2}}. The resulting exponent is $-3.89 + (-3.89) - (-2)\
-  \ = 5.78$, therefore $K_a = 10^{-5.78}$. The $pK_a$ is the negative log of $K_a$,\
-  \ which is equal to $5.78$. The answer is (C).\n\nQ: A solution contains 2.00 mole\
-  \ of acetic acid, CH3COOH, and 1.00 mole of calcium acetate, Ca(CH3COO)2. The solution\
-  \ is able to resist the addition of a small amount of strong acid or strong base\
-  \ with only minor changes in the pH of the solution. Larger quantities of strong\
-  \ acid or strong base can cause a significant change in pH. How many moles of nitric\
-  \ acid, HNO3, may be added before the pH begins to change significantly?\n(A) 0.500\
-  \ mole (B) 1.00 mole (C) 2.00 mole (D) 3.00 mole\nA: Let's think step by step. We\
-  \ would like to compute the buffer capacity of this solution. First we write the\
-  \ equation for the ionization of the weak acid, in this case of acetic acid. $CH_{3}COOH\
-  \ (aq) + H_{2}O \nightarrow H_{3}O^{+} + CH3COO^{-}$. The conjugate base is therefore\
-  \ the acetate ion. The added strong acid, Nitric acid, will react with the conjugate\
-  \ base. Therefore the maximum amount of acid that can be added will be equal to\
-  \ the amount of acetate ion, or 2 moles. The answer is (C).\n\n"
-"group": "mmlu_flan_cot_fewshot_stem"
-"include": "_mmlu_flan_cot_fewshot_template_yaml"
-"task": "mmlu_flan_cot_fewshot_high_school_chemistry"
+dataset_name: high_school_chemistry
+description: The following are multiple choice questions (with answers) about high
+  school chemistry.
+fewshot_config:
+  sampler: first_n
+  samples:
+  - question: 'Which of the following is considered an acid anhydride?
+
+      (A) HCl (B) H2SO3 (C) SO2 (D) Al(NO3)3'
+    target: Let's think step by step. An acid anhydride is a compound that is derived
+      by removing water from an acid. The chemical formula for water is H2O, which
+      means that we need to determine which of these options, when combined with H2O,
+      forms an acid. SO2, or Sulfur dioxide, when combined with H2O, makes H2SO4,
+      or sulfuric acid. The answer is (C).
+  - question: 'Which of the following is expected to be a polar molecule?
+
+      (A) PCl4F (B) BF3 (C) CO2 (D) Si(CH3)4'
+    target: Let's think step by step. A polar molecule is one that has a slightly
+      positive charge on one end of the molecule and a slightly negative charge on
+      the other end. Boron trifluoride (BF3) has Boron as the center atom and three
+      fluorine atoms attached to it; it is trigonal planar and symmetric, so it is
+      nonpolar. Carbon Dioxide (CO2) has Carbon as the central atom with double bonds
+      to two Oxygen atoms - this is also symmetrical and therefore nonpolar. The same
+      is the case for tetramethyl silane (SI(CH3)4), which is a Silicon atom surrounded
+      by four methyl groups. The structure of PCL4F is that Phosphorus is the central
+      atom, attached to four chlorines and one fluorine atom. This is asymmetrical,
+      and therefore has a net dipole and is expected to be a polar molecule. The answer
+      is (A).
+  - question: 'From the solubility rules, which of the following is true?
+
+      (A) All chlorides, bromides, and iodides are soluble (B) All sulfates are soluble
+      (C) All hydroxides are soluble (D) All ammonium-containing compounds are soluble'
+    target: Let's think step by step. The chlorides, bromides, and iodides of lead,
+      silver, and mercury are not soluble in water. This rules out (A). The sulfates
+      of lead, barium, and calcium are not soluble in water, which rules out (B).
+      The hydroxides of any metal besides sodium, potassium, ammonium, calcium, and
+      barium are insoluble. This rules out (C). Typically ammonium ions indicate a
+      soluble ionic substance. The answer is (D).
+  - question: 'A new compound is synthesized and found to be a monoprotic acid with a
+      molar mass of 248 g/mol. When 0.0050 mol of this acid are dissolved in 0.500
+      L of water, the pH is measured as 3.89. What is the pKa of this acid?
+
+      (A) 3.89 (B) 7.78 (C) 5.78 (D) 2.33'
+    target: "Let's think step by step. Recall that $[A] = [H^{+}]$. Here, this is\
+      \ equal to $$10^{-3.89}$. Then we have $K_{a} = $\nrac{[H^{+}][A^{-}]}{[HA]}\
+      \ = \nrac{10^{-3.89} \\cdot 10^{-3.89}}{10^{-2}}. The resulting exponent is\
+      \ $-3.89 + (-3.89) - (-2) = 5.78$, therefore $K_a = 10^{-5.78}$. The $pK_a$\
+      \ is the negative log of $K_a$, which is equal to $5.78$. The answer is (C)."
+  - question: 'A solution contains 2.00 mole of acetic acid, CH3COOH, and 1.00 mole of
+      calcium acetate, Ca(CH3COO)2. The solution is able to resist the addition of
+      a small amount of strong acid or strong base with only minor changes in the
+      pH of the solution. Larger quantities of strong acid or strong base can cause
+      a significant change in pH. How many moles of nitric acid, HNO3, may be added
+      before the pH begins to change significantly?
+
+      (A) 0.500 mole (B) 1.00 mole (C) 2.00 mole (D) 3.00 mole'
+    target: "Let's think step by step. We would like to compute the buffer capacity\
+      \ of this solution. First we write the equation for the ionization of the weak\
+      \ acid, in this case of acetic acid. $CH_{3}COOH (aq) + H_{2}O \nightarrow H_{3}O^{+}\
+      \ + CH3COO^{-}$. The conjugate base is therefore the acetate ion. The added\
+      \ strong acid, Nitric acid, will react with the conjugate base. Therefore the\
+      \ maximum amount of acid that can be added will be equal to the amount of acetate\
+      \ ion, or 2 moles. The answer is (C).\n\n"
+group: mmlu_flan_cot_fewshot_stem
+include: _mmlu_flan_cot_fewshot_template_yaml
+task: mmlu_flan_cot_fewshot_high_school_chemistry
--- a/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_high_school_computer_science.yaml
+++ b/lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_high_school_computer_science.yaml
-"dataset_name": "high_school_computer_science"
-"description": "The following are multiple choice questions (with answers) about high\
-  \ school computer science.\n\nQ: Which of the following is an example of the use\
-  \ of a device on the Internet of Things (IoT) ?\n(A) A car alerts a driver that\
-  \ it is about to hit an object. (B) A hiker uses a G P S watch to keep track of\
-  \ her position. (C) A refrigerator orders milk from an online delivery service when\
-  \ the milk in the refrigerator is almost gone. (D) A runner uses a watch with optical\
-  \ sensors to monitor his heart rate.\nA: Let's think step by step. The term Internet\
-  \ of Things (IoT) refers to common devices which are connected to the internet,\
-  \ enabling new functionality. Choice A is incorrect because it does not describe\
-  \ an internet connected device. In choice B, the watch is only described as having\
-  \ GPS functionality but no internet connectivity. Choice C describes a common device\
-  \ (a refrigerator) which has internet connectivity enabling new functionality (online\
-  \ ordering). Choice D does not mention internet connectivity for the watch, only\
-  \ optical sensors. The answer is (C).\n\nQ: Many Web browsers allow users to open\
-  \ anonymous windows. During a browsing session in an anonymous window, the browser\
-  \ does not record a browsing history or a list of downloaded files. When the anonymous\
-  \ window is exited, cookies created during the session are deleted. Which of the\
-  \ following statements about browsing sessions in an anonymous window is true?\n\
-  (A) The activities of a user browsing in an anonymous window will not be visible\
-  \ to people who monitor the user's network, such as the system administrator. (B)\
-  \ Items placed in a Web store's shopping cart for future purchase during the anonymous\
-  \ browsing session will not be saved on the user's computer. (C) A user will not\
-  \ be able to log in to e-mail or social media accounts during the anonymous browsing\
-  \ session. (D) A user browsing in an anonymous window will be protected from viruses\
-  \ launched from any web sites visited or files downloaded.\nA: Let's think step\
-  \ by step. Choice A is incorrect as it only describes network traffic, which an\
-  \ anonymous browser does not change. Choice B is correct as it correctly describes\
-  \ how an anonymous browser will prevent saving data on the user’s computer after\
-  \ the session is ended. Choice C is incorrect because an anonymous browser will\
-  \ not prevent logging in to email or social media accounts. Choice D is incorrect\
-  \ because an anonymous browser in itself performs no virus protection. The answer\
-  \ is (B).\n\nQ: In the program below, the initial value of X is 5 and the initial\
-  \ value of Y is 10.\nIF (X < 0){\n DISPLAY (\"Foxtrot\")\n} ELSE {\n IF (X > Y){\n\
-  \  DISPLAY (\"Hotel\")\n } ELSE {\n  IF (Y > 0){\n   DISPLAY (\"November\")\n  }\
-  \ ELSE {\n   DISPLAY (\"Yankee\")\n  }\n }\n}\nWhat is displayed as a result of\
-  \ running the program?\n(A) Foxtrot (B) Hotel (C) November (D) Yankee\nA: Let's\
-  \ think step by step. Because X has the value 5, the first conditional IF (X < 0)\
-  \ is false, so we move to the first ELSE clause. Because X is 5 and Y is 10, the\
-  \ second conditional IF (X > Y) is false, so we move to the following ELSE clause.\
-  \ Since Y is 10, the conditional IF (Y > 0) is true, so the command DISPLAY (\"\
-  November\") is executed. The answer is (C).\n\nQ: What is the output of \"abc\"\
-  [::-1] in Python 3?\n(A) Error (B) abc (C) cba (D) c\nA: Let's think step by step.\
-  \ We know that the slicing operator [::-1] takes all of the elements in the string\
-  \ in reverse order, so we reverse the order of the string \"abc\", resulting in\
-  \ \"cba\". The answer is (C).\n\nQ: A list of numbers has n elements, indexed from\
-  \ 1 to n. The following algorithm is intended to display the number of elements\
-  \ in the list that have a value greater than 100. The algorithm uses the variables\
-  \ count and position. Steps 3 and 4 are missing.\n Step 1: Set count to 0 and position\
-  \ to 1.\n Step 2: If the value of the element at index position is greater than\
-  \ 100, increase the value of count by 1.\n Step 3: (missing step)\n Step 4: (missing\
-  \ step)\n Step 5: Display the value of count.\nWhich of the following could be used\
-  \ to replace steps 3 and 4 so that the algorithm works as intended?\n(A) Step 3:\
-  \ Increase the value of position by 1.\n  Step 4: Repeat steps 2 and 3 until the\
-  \ value of count is greater than 100.\n(B) Step 3: Increase the value of position\
-  \ by 1.\n  Step 4: Repeat steps 2 and 3 until the value of position is greater than\
-  \ n.\n(C) Step 3: Repeat step 2 until the value of count is greater than 100.\n\
-  \  Step 4: Increase the value of position by 1.\n(D) Step 3: Repeat step 2 until\
-  \ the value of position is greater than n.\n  Step 4: Increase the value of count\
-  \ by 1.\nA: Let's think step by step. Choice A is incorrect, because its Step 4\
-  \ has an incorrect termination condition, stopping when count is greater than 100.\
-  \ We need to stop after inspecting all elements in the list. Choice B is correct\
-  \ because it correctly increments both count and position, and correctly repeats\
-  \ these steps and terminates when all elements in the list have been inspected.\
-  \ Choice C is incorrect because it incorrectly increments the variable count until\
-  \ its value is greater than 100, regardless of the elements in the list. Choice\
-  \ D is incorrect because its step 3 does not increment the value of position, so\
-  \ it will repeat forever. The answer is (B).\n\n"
-"group": "mmlu_flan_cot_fewshot_stem"
-"include": "_mmlu_flan_cot_fewshot_template_yaml"
-"task": "mmlu_flan_cot_fewshot_high_school_computer_science"
+dataset_name: high_school_computer_science
+description: The following are multiple choice questions (with answers) about high
+  school computer science.
+fewshot_config:
+  sampler: first_n
+  samples:
+  - question: 'Which of the following is an example of the use of a device on the Internet
+      of Things (IoT) ?
+
+      (A) A car alerts a driver that it is about to hit an object. (B) A hiker uses
+      a G P S watch to keep track of her position. (C) A refrigerator orders milk
+      from an online delivery service when the milk in the refrigerator is almost
+      gone. (D) A runner uses a watch with optical sensors to monitor his heart rate.'
+    target: Let's think step by step. The term Internet of Things (IoT) refers to
+      common devices which are connected to the internet, enabling new functionality.
+      Choice A is incorrect because it does not describe an internet connected device.
+      In choice B, the watch is only described as having GPS functionality but no
+      internet connectivity. Choice C describes a common device (a refrigerator) which
+      has internet connectivity enabling new functionality (online ordering). Choice
+      D does not mention internet connectivity for the watch, only optical sensors.
+      The answer is (C).
+  - question: 'Many Web browsers allow users to open anonymous windows. During a browsing
+      session in an anonymous window, the browser does not record a browsing history
+      or a list of downloaded files. When the anonymous window is exited, cookies
+      created during the session are deleted. Which of the following statements about
+      browsing sessions in an anonymous window is true?
+
+      (A) The activities of a user browsing in an anonymous window will not be visible
+      to people who monitor the user''s network, such as the system administrator.
+      (B) Items placed in a Web store''s shopping cart for future purchase during
+      the anonymous browsing session will not be saved on the user''s computer. (C)
+      A user will not be able to log in to e-mail or social media accounts during
+      the anonymous browsing session. (D) A user browsing in an anonymous window will
+      be protected from viruses launched from any web sites visited or files downloaded.'
+    target: "Let's think step by step. Choice A is incorrect as it only describes\
+      \ network traffic, which an anonymous browser does not change. Choice B is correct\
+      \ as it correctly describes how an anonymous browser will prevent saving data\
+      \ on the user\u2019s computer after the session is ended. Choice C is incorrect\
+      \ because an anonymous browser will not prevent logging in to email or social\
+      \ media accounts. Choice D is incorrect because an anonymous browser in itself\
+      \ performs no virus protection. The answer is (B)."
+  - question: "In the program below, the initial value of X is 5 and the initial value\
+      \ of Y is 10.\nIF (X < 0){\n DISPLAY (\"Foxtrot\")\n} ELSE {\n IF (X > Y){\n\
+      \  DISPLAY (\"Hotel\")\n } ELSE {\n  IF (Y > 0){\n   DISPLAY (\"November\")\n\
+      \  } ELSE {\n   DISPLAY (\"Yankee\")\n  }\n }\n}\nWhat is displayed as a result\
+      \ of running the program?\n(A) Foxtrot (B) Hotel (C) November (D) Yankee"
+    target: Let's think step by step. Because X has the value 5, the first conditional
+      IF (X < 0) is false, so we move to the first ELSE clause. Because X is 5 and
+      Y is 10, the second conditional IF (X > Y) is false, so we move to the following
+      ELSE clause. Since Y is 10, the conditional IF (Y > 0) is true, so the command
+      DISPLAY ("November") is executed. The answer is (C).
+  - question: 'What is the output of "abc"[::-1] in Python 3?
+
+      (A) Error (B) abc (C) cba (D) c'
+    target: Let's think step by step. We know that the slicing operator [::-1] takes
+      all of the elements in the string in reverse order, so we reverse the order
+      of the string "abc", resulting in "cba". The answer is (C).
+  - question: "A list of numbers has n elements, indexed from 1 to n. The following algorithm\
+      \ is intended to display the number of elements in the list that have a value\
+      \ greater than 100. The algorithm uses the variables count and position. Steps\
+      \ 3 and 4 are missing.\n Step 1: Set count to 0 and position to 1.\n Step 2:\
+      \ If the value of the element at index position is greater than 100, increase\
+      \ the value of count by 1.\n Step 3: (missing step)\n Step 4: (missing step)\n\
+      \ Step 5: Display the value of count.\nWhich of the following could be used\
+      \ to replace steps 3 and 4 so that the algorithm works as intended?\n(A) Step\
+      \ 3: Increase the value of position by 1.\n  Step 4: Repeat steps 2 and 3 until\
+      \ the value of count is greater than 100.\n(B) Step 3: Increase the value of\
+      \ position by 1.\n  Step 4: Repeat steps 2 and 3 until the value of position\
+      \ is greater than n.\n(C) Step 3: Repeat step 2 until the value of count is\
+      \ greater than 100.\n  Step 4: Increase the value of position by 1.\n(D) Step\
+      \ 3: Repeat step 2 until the value of position is greater than n.\n  Step 4:\
+      \ Increase the value of count by 1."
+    target: 'Let''s think step by step. Choice A is incorrect, because its Step 4
+      has an incorrect termination condition, stopping when count is greater than
+      100. We need to stop after inspecting all elements in the list. Choice B is
+      correct because it correctly increments both count and position, and correctly
+      repeats these steps and terminates when all elements in the list have been inspected.
+      Choice C is incorrect because it incorrectly increments the variable count until
+      its value is greater than 100, regardless of the elements in the list. Choice
+      D is incorrect because its step 3 does not increment the value of position,
+      so it will repeat forever. The answer is (B).'
+group: mmlu_flan_cot_fewshot_stem
+include: _mmlu_flan_cot_fewshot_template_yaml
+task: mmlu_flan_cot_fewshot_high_school_computer_science