Commit 6ac42518 authored by lintangsutawika's avatar lintangsutawika
Browse files

Merge branch 'big-refactor' of...

Merge branch 'big-refactor' of https://github.com/EleutherAI/lm-evaluation-harness into openai_completions
parents 9c3ba7d4 e3644fcc
"dataset_name": "moral_scenarios"
"description": "The following are multiple choice questions (with answers) about moral\
\ scenarios.\n\nQ: For which of these two scenarios does the main character (who\
\ uses I/me/my) do something clearly morally wrong, according to ordinary moral\
\ standards in the US as of 2020? Scenario 1 | I took his wallet after he harassed\
\ me at the bar. Scenario 2 | I made love to my newborn baby boy.\n(A) Wrong, Wrong\
\ (B) Wrong, Not wrong (C) Not wrong, Wrong (D) Not wrong, Not wrong\nA: Let's think\
\ step by step. We refer to Wikipedia articles on moral scenarios for help. Stealing\
\ from someone is not considered an ethical response to harrassment. And having\
\ sex with a minor is egregiously unethical. The answer is (A).\n\nQ: For which\
\ of these two scenarios does the main character (who uses I/me/my) do something\
\ clearly morally wrong, according to ordinary moral standards in the US as of 2020?\
\ Scenario 1 | I asked for a cigarette from the guy outside the store. Scenario\
\ 2 | When the security guard left his post, I pocketed a rare coin.\n(A) Wrong,\
\ Wrong (B) Wrong, Not wrong (C) Not wrong, Wrong (D) Not wrong, Not wrong\nA: Let's\
\ think step by step. We refer to Wikipedia articles on moral scenarios for help.\
\ Asking for something is not morally wrong. Stealing is morally wrong. The answer\
\ is (C).\n\nQ: For which of these two scenarios does the main character (who uses\
\ I/me/my) do something clearly morally wrong, according to ordinary moral standards\
\ in the US as of 2020? Scenario 1 | During the house's demolition, I pushed him\
\ into a wall. Scenario 2 | When I choose a babysitter for my twin nieces, I don't\
\ care about her racial background.\n(A) Wrong, Wrong (B) Wrong, Not wrong (C) Not\
\ wrong, Wrong (D) Not wrong, Not wrong\nA: Let's think step by step. We refer to\
\ Wikipedia articles on moral scenarios for help. Physical violence is morally wrong.\
\ Being open to hiring people from different racial backgrounds is an expected moral\
\ stance. The answer is (B).\n\nQ: For which of these two scenarios does the main\
\ character (who uses I/me/my) do something clearly morally wrong, according to\
\ ordinary moral standards in the US as of 2020? Scenario 1 | I bought an old pick-up\
\ truck for work. Scenario 2 | I looked up at the gorgeous waterfall and threw a\
\ rock at it.\n(A) Wrong, Wrong (B) Wrong, Not wrong (C) Not wrong, Wrong (D) Not\
\ wrong, Not wrong\nA: Let's think step by step. We refer to Wikipedia articles\
\ on moral scenarios for help. Neither of these are wrong as no one is wronged by\
\ buing a pick-up truck or throwing a rock into a stream. The answer is (D).\n\n\
Q: For which of these two scenarios does the main character (who uses I/me/my) do\
\ something clearly morally wrong, according to ordinary moral standards in the\
\ US as of 2020? Scenario 1 | I really loved my newborn baby boy. Scenario 2 | The\
\ new girl is embarrassed to live in a trailer park, so I invite the class to her\
\ house.\n(A) Wrong, Wrong (B) Wrong, Not wrong (C) Not wrong, Wrong (D) Not wrong,\
\ Not wrong\nA: Let's think step by step. We refer to Wikipedia articles on moral\
\ scenarios for help. Loving someone is not wrong. However, exposing something that\
\ someone is embarrassed about could be considered quite mean. The answer is (C)."
"group": "mmlu_flan_cot_fewshot_humanities"
"include": "_mmlu_flan_cot_fewshot_template_yaml"
"task": "mmlu_flan_cot_fewshot_moral_scenarios"
"dataset_name": "nutrition"
"description": "The following are multiple choice questions (with answers) about nutrition.\n\
\nQ: What is the first-line drug for patients with type 2 diabetes and obesity,\
\ as of 2020?\n(A) Acarbose (B) Metformin (C) Sulphonylureas (D) Insulin\nA: Let's\
\ think step by step. We refer to Wikipedia articles on nutrition for help. Metformin\
\ (Fortamet, Glumetza, or others) is usually the first medication prescribed for\
\ type 2 diabetes, as well as obesity. It works by lowering glucose production in\
\ the liver and improving the body's sensitivity to insulin. The answer is (B).\n\
\nQ: Which of the following statements is correct (according to knowledge in 2020)?\n\
(A) Consumers with phenylketonuria must avoid the consumption of the sweetener aspartame\
\ (B) Consumers with phenylketonuria must avoid the consumption of the sweetener\
\ saccharin (C) Consumers with phenylketonuria must avoid the consumption of the\
\ sweetener sucralose (D) Consumers with phenylketonuria must avoid the consumption\
\ of the sweetener acesulfame K\nA: Let's think step by step. We refer to Wikipedia\
\ articles on nutrition for help. People with phenylketonuria (PKU) cannot break\
\ down the amino acid phenylalanine. As it builds up in the blood and brain it can\
\ lead to brain damage. People with PKU should avoid foods that are converted to\
\ phenylalanine in the body, such as aspartame. The answer is (A).\n\nQ: Which of\
\ the following statements about iodine is correct, as of 2020?\n(A) 50% of adults\
\ consume iodine at levels below the RNI (B) Dairy products are a poor source of\
\ iodine (C) The iodine content of organic milk is generally lower that the level\
\ in non-organic milk (D) UK dietary reference values recommend an increase in iodine\
\ intake in pregnancy\nA: Let's think step by step. We refer to Wikipedia articles\
\ on nutrition for help. Organic milk usually has less iodine content than non-organic\
\ milk. The answer is (C).\n\nQ: Which of the following is the most plausible explanation\
\ for the protective effect of dietary fibre against cancer of the colon, as of\
\ 2020?\n(A) Propionic acid, formed during colonic fibre fermentation inhibits liver\
\ fatty acid synthesis (B) Butyric acid, formed during colonic fibre fermentation\
\ stimulates \"silencing\" of the SLC5A8 tumour suppressor gene (C) None of these\
\ options are correct (D) Butyric acid, formed during colonic fibre fermentation\
\ stimulates anti-oxidant defences in the colon\nA: Let's think step by step. We\
\ refer to Wikipedia articles on nutrition for help. Dietary fibre is inversely\
\ proportional to the risk of colorectal cancer. This is presumed because butyric\
\ acid (BA) stimulates antioxidants which help protect the colon from cancerous\
\ tumors. The answer is (D).\n\nQ: In a cohort study, the risk ratio of developing\
\ diabetes was 0.86 when comparing consumers of tea (the exposed) to those who did\
\ not drink tea (the unexposed). Which one statement is correct (according to knowledge\
\ in 2020)?\n(A) The tea drinkers have lower risk of developing diabetes. (B) The\
\ tea drinkers have higher risk of developing diabetes. (C) Based on the information\
\ given we cannot tell if the observed difference in disease risk is the result\
\ of chance. (D) The risk ratio is close to the value one, so there is no difference\
\ in disease risk between the two groups.\nA: Let's think step by step. We refer\
\ to Wikipedia articles on nutrition for help. The risk ratio is not sufficiently\
\ reduced that it could not be explained by random chance given the studies sample\
\ size. The answer is (C)."
"group": "mmlu_flan_cot_fewshot_other"
"include": "_mmlu_flan_cot_fewshot_template_yaml"
"task": "mmlu_flan_cot_fewshot_nutrition"
"dataset_name": "philosophy"
"description": "The following are multiple choice questions (with answers) about philosophy.\n\
\nQ: The study of reality in the broadest sense, an inquiry into the elemental nature\
\ of the universe and the things in it, is known as _____.\n(A) metaphysics (B)\
\ epistemology (C) quantum physics (D) axiology\nA: Let's think step by step. We\
\ refer to Wikipedia articles on philosophy for help. Among the options, only metaphysics\
\ studies the nature of reality and existence. The answer is (A).\n\nQ: According\
\ to Moore’s “ideal utilitarianism,” the right action is the one that brings about\
\ the greatest amount of:\n(A) pleasure. (B) happiness. (C) good. (D) virtue.\n\
A: Let's think step by step. We refer to Wikipedia articles on philosophy for help.\
\ Moore's \"ideal utilitarianism\" states that one's actions should maximize intrinsic\
\ goods. The answer is (C).\n\nQ: Before Tolstoy's Christian conversion, what was\
\ his perspective on the meaning of life?\n(A) optimist (B) satisfied (C) nominally\
\ religious (D) pessimist\nA: Let's think step by step. We refer to Wikipedia articles\
\ on philosophy for help. Before his conversion, Tolstoy feels that life was uncertain,\
\ which is a pessimist's point of view. The answer is (D).\n\nQ: According to d'Holbach,\
\ people always act according to _____.\n(A) free choices (B) dictates of the soul\
\ (C) necessary natural laws (D) undetermined will\nA: Let's think step by step.\
\ We refer to Wikipedia articles on philosophy for help. d'Holbach believes that\
\ people act according to necessary laws, and it proves nothing about people's free\
\ will. The answer is (C).\n\nQ: Psychological egoism is:\n(A) an ethical theory\
\ about how we ought to behave. (B) a generalization concerning the way people tend\
\ to behave. (C) a claim about human nature and the ways people are capable of behaving.\
\ (D) none of the above.\nA: Let's think step by step. We refer to Wikipedia articles\
\ on philosophy for help. Psychological egoism suggests that one behaves based on\
\ what makes one feels good, hence it is a claim about human nature and how humans\
\ are capable of behaving. The answer is (C)."
"group": "mmlu_flan_cot_fewshot_humanities"
"include": "_mmlu_flan_cot_fewshot_template_yaml"
"task": "mmlu_flan_cot_fewshot_philosophy"
"dataset_name": "prehistory"
"description": "The following are multiple choice questions (with answers) about prehistory.\n\
\nQ: What is the approximate mean cranial capacity of Homo erectus?\n(A) under 650\
\ cc (B) about 800 cc (C) just under 1000 cc (D) 1200 cc\nA: Let's think step by\
\ step. We refer to Wikipedia articles on prehistory for help. The average cranium\
\ capacity of Homo erectus is less than 1000 cubic cm. The answer is (C).\n\nQ:\
\ According to Timothy Pauketat, the evidence for social stratification and political\
\ power at Cahokia suggests:\n(A) a center of Mississippian civilization with conditions\
\ similar to the rise of early states. (B) the limitations of authority in a Native\
\ American society of egalitarian foragers. (C) a simple chiefdom or perhaps a complex\
\ chiefdom had evolved by A.D. 1500. (D) a center of Mississippian civilization\
\ with conditions similar to societies on the Northwest Coast of North America.\n\
A: Let's think step by step. We refer to Wikipedia articles on prehistory for help.\
\ Timothy Pauketat is known for his research on Cahokia, the center of the Mississippian\
\ culture, where he found similar conditions to the rise of early states. The answer\
\ is (A).\n\nQ: Recent research on hominid species dating from the Middle Pliocene\
\ indicates there was (as of 2020):\n(A) a great amount of species diversity, or\
\ a single species that exhibited a lot of diversity. (B) very little species diversity\
\ during this period and very few hominids. (C) decreased species diversity due\
\ to a prolonged ice age followed by a severe drought. (D) decreased species diversity\
\ but increased numbers of hammerstones and flakes, indicating stone tool manufacture.\n\
A: Let's think step by step. We refer to Wikipedia articles on prehistory for help.\
\ Recent research has recognized multiple hominid species from the Middle Pliocene,\
\ meaning that there is a great amount of species diversity or diversity in a single\
\ species. The answer is (A).\n\nQ: Researchers now believe that the decline of\
\ the Maya was caused chiefly by:\n(A) a cataclysm of some kind, such as an earthquake,\
\ volcano, or tsunami. (B) ecological degradation resulting from slash-and-burn\
\ farming techniques. (C) endless wars between neighboring Mayan city-states. (D)\
\ practices of interbreeding that led to a steep rise in congenital disorders.\n\
A: Let's think step by step. We refer to Wikipedia articles on prehistory for help.\
\ Researchers believe that the Maya collapse was mainly caused by over-exploitation\
\ of natural resources like the slash-and-burn farming techniques. The answer is\
\ (B).\n\nQ: The great Mayan king Pacal built temples in the city of Palenque in\
\ order to:\n(A) satisfy the powerful Mayan astronomer priests. (B) display his\
\ generosity to the common people, since they were allowed to live in the temples.\
\ (C) frighten away enemies, in particular the Spaniards. (D) legitimize his kingship,\
\ since his father was not royal.\nA: Let's think step by step. We refer to Wikipedia\
\ articles on prehistory for help. Pacal built the temples as the funerary monument\
\ to legitimize his kingship. The answer is (D)."
"group": "mmlu_flan_cot_fewshot_humanities"
"include": "_mmlu_flan_cot_fewshot_template_yaml"
"task": "mmlu_flan_cot_fewshot_prehistory"
"dataset_name": "professional_accounting"
"description": "The following are multiple choice questions (with answers) about professional\
\ accounting.\n\nQ: An auditor traces the serial numbers on equipment to a nonissuer’s\
\ subledger. Which of the following management assertions is supported by this test?\n\
(A) Valuation and allocation (B) Completeness (C) Rights and obligations (D) Presentation\
\ and disclosure\nA: Let's think step by step. We refer to Wikipedia articles on\
\ accounting for help. The completeness assertion is tested by tracing supporting\
\ documents to the record entries. The answer is (B).\n\nQ: One hundred years ago,\
\ your great-great-grandmother invested $100 at 5% yearly interest. What is the\
\ investment worth today?\n(A) $13,000 (B) $600 (C) $15,000 (D) $28,000\nA: Let's\
\ think step by step. We refer to Wikipedia articles on accounting for help. A $100\
\ investment at 5% yearly interest is worth 100*(1.05)^100=13150 after 100 years,\
\ which is around $13,000. The answer is (A).\n\nQ: On January 1, year 1, Alpha\
\ Co. signed an annual maintenance agreement with a software provider for $15,000\
\ and the maintenance period begins on March 1, year 1. Alpha also incurred $5,000\
\ of costs on January 1, year 1, related to software modification requests that\
\ will increase the functionality of the software. Alpha depreciates and amortizes\
\ its computer and software assets over five years using the straight-line method.\
\ What amount is the total expense that Alpha should recognize related to the maintenance\
\ agreement and the software modifications for the year ended December 31, year\
\ 1?\n(A) $5,000 (B) $13,500 (C) $16,000 (D) $20,000\nA: Let's think step by step.\
\ We refer to Wikipedia articles on accounting for help. The maintenance period\
\ begins on March 1, so only 10 months of expenses should be recognized, which is\
\ $15,000/12*10=$12,500. The software modification cost is amortized over 5 years,\
\ so each year is $5,000/5=$1,000. So the total expense is $12,500+$1,000=$13,500.\
\ The answer is (B).\n\nQ: Krete is an unmarried taxpayer with income exclusively\
\ from wages. By December 31, year 1, Krete's employer has withheld $16,000 in federal\
\ income taxes and Krete has made no estimated tax payments. On April 15, year 2,\
\ Krete timely filed for an extension request to file her individual tax return,\
\ and paid $300 of additional taxes. Krete's year 1 tax liability was $16,500 when\
\ she timely filed her return on April 30, year 2, and paid the remaining tax liability\
\ balance. What amount would be subject to the penalty for underpayment of estimated\
\ taxes?\n(A) $0 (B) $500 (C) $1,650 (D) $16,500\nA: Let's think step by step. We\
\ refer to Wikipedia articles on accounting for help. The tax due after withholding\
\ is $16,500-$16,000=$500, which is less than $1000, hence there is no underpayment\
\ penalty of estimated taxes. The answer is (A).\n\nQ: Box a nongovernmental not-for-profit\
\ organization had the following transactions during the year: Proceeds from sale\
\ of investments $80000 Purchase of property plant and equipment $10000 Proceeds\
\ from long-term debt $100000 Loss on sale of investment $5000 What amount should\
\ be reported as net cash provided by financing activities in Box's statement of\
\ cash flows?\n(A) $70,000 (B) $75,000 (C) $80,000 (D) 100000\nA: Let's think step\
\ by step. We refer to Wikipedia articles on accounting for help. Among the four\
\ transactions, only Proceeds from long-term debt belongs to the financing activities\
\ section of cashflow, hence the amount reported should be $100000. The answer is\
\ (D)."
"group": "mmlu_flan_cot_fewshot_other"
"include": "_mmlu_flan_cot_fewshot_template_yaml"
"task": "mmlu_flan_cot_fewshot_professional_accounting"
"dataset_name": "professional_law"
"description": "The following are multiple choice questions (with answers) about professional\
\ law.\n\nQ: A son owed a creditor $5,000. The son's father contacted the creditor\
\ and told him that he wanted to pay the son's debt. The father signed a document\
\ that stated the father would pay the son's debt at a rate of $500 a month for\
\ 10 months. The creditor made no written or oral commitment to forbear to sue the\
\ son to collect the $5,000 debt, and the father made no oral or written request\
\ for any such forbearance. For the next five months, the father made and the creditor\
\ accepted the $500 monthly payments as agreed. During that period, the creditor,\
\ in fact, did forbear to take any legal action against the son. However, the father\
\ then informed the creditor that he would make no further payments on the debt.\
\ Which of the following is the most persuasive argument that the father is liable\
\ to the creditor under the terms of their agreement?\n(A) The father's promise\
\ and the creditor's reliance thereon, if proved, gave rise to a valid claim by\
\ the creditor against the father based on the doctrine of promissory estoppel.\
\ (B) Because it was foreseeable that the father's promise would induce the creditor\
\ to forbear taking any action against the son, such forbearance was, as a matter\
\ of law, a bargained-for consideration for the father's promise. (C) The father's\
\ five payments to the creditor totaling $2,500 manifested a serious intent on the\
\ father's part to be contractually bound, and such manifestation is generally recognized\
\ as an effective substitute for consideration. (D) By assuming the antecedent debt\
\ obligation that the son owed to the creditor, the father became a surety whose\
\ promise to the creditor was enforceable, since it was in writing and supported\
\ by adequate consideration. \nA: Let's think step by step. We refer to Wikipedia\
\ articles on law for help. The doctrine of promissory estoppel stops a person from\
\ going back on a promise in contract law, hence option (A) should be the most persuasive\
\ argument. The answer is (A).\n\nQ: A state has recently enacted a statute prohibiting\
\ the disposal of any nuclear wastes within the state. This law does not contravene\
\ or conflict with any federal statutes. A man operates a company in the state that\
\ is engaged in the disposal of nuclear wastes. Subsequent to the passage of the\
\ state statute, the man, not yet aware of the new law, entered into contracts with\
\ many out-of-state firms to dispose of their nuclear wastes in the state. On account\
\ of this new law, however, the man will be unable to perform these contracts. Assume\
\ that the man has standing to challenge this state law. Which of the following\
\ presents his strongest constitutional grounds to challenge the state law prohibiting\
\ the disposal of nuclear wastes within the state?\n(A) The commerce clause. (B)\
\ The equal protection clause of the Fourteenth Amendment. (C) The privileges and\
\ immunities clause of Article IV, Section 2. (D) The contract clause.\nA: Let's\
\ think step by step. We refer to Wikipedia articles on law for help. The commerce\
\ clause states that Congress shall have the power to regulate commerce with foreign\
\ Nations, and among the several States, and with the Indian Tribes. The statute\
\ affects inter-state commerce which puts it into question. Hence the man's strongest\
\ argument should be the commerce clause. The answer is (A).\n\nQ: On October 1,\
\ 1980, a developer, owner of several hundred acres in a rural county, drafted a\
\ general development plan for the area. The duly recorded plan imposed elaborate\
\ limitations and restrictions upon the land in the plan, which was to be developed\
\ as a residential district. The restrictions were to extend to all persons acquiring\
\ any of the lots and to their heirs, assigns, and lessees. It was further provided\
\ that all subsequent owners would be charged with due notice of the restrictions.\
\ Among those restrictions in the general plan were the following:(22) A franchise\
\ right is created in a strip of land 10 feet in width along the rear of each lot\
\ for the use of public utility companies with right of ingress and egress. (23)\
\ No house or structure of any kind shall be built on the aforementioned strip of\
\ land running through the said blocks. In 2000, a retiree purchased one of the\
\ lots, built a house, and erected a fence in the rear of his property within the\
\ restricted area. In 2004, a teacher purchased a lot adjacent to the retiree's\
\ property and built a new house. Two years later, a librarian purchased the lot\
\ that adjoined the teacher's property. The three deeds to those properties each\
\ contained references to the deed book where the general plan was recorded. In\
\ 2008, the librarian began the construction of a seven-foot post-and-rail fence\
\ along the line dividing his lot with the teacher's, and along the center of the\
\ area subject to the franchise right. Although the teacher objected to its construction,\
\ the fence was completed. If the teacher seeks a mandatory injunction to compel\
\ removal of the librarian's fence, the court will most likely\n(A) grant relief,\
\ because the fence was in violation of the easement restriction. (B) grant relief,\
\ because the encroachment of the fence violated the restriction in the original\
\ plan. (C) deny relief, because the teacher failed to enforce the restriction against\
\ the retiree. (D) deny relief, because the fence would not be construed as \"a\
\ structure\" within the terms of the restriction. \nA: Let's think step by step.\
\ We refer to Wikipedia articles on law for help. The restrictions in the original\
\ plan say no house or structure of any kind shall be built on the aforementioned\
\ strip of land running through the said blocks. Hence the court will most likely\
\ grant relief because the fence violated the restriction in the original plan.\
\ The answer is (B).\n\nQ: Judge took judicial notice of some facts at the beginning\
\ of the trial. Which of the following is not an appropriate kind of fact for judicial\
\ notice?\n(A) Indisputable facts. (B) Facts that have been asserted by individual\
\ political organizations. (C) Facts recognized to be true by common knowledge.\
\ (D) Facts capable of scientific verification.\nA: Let's think step by step. We\
\ refer to Wikipedia articles on law for help. Among the options, facts that have\
\ been asserted by individual political organizations is not an appropriate kind\
\ of fact for judicial notice. The answer is (B).\n\nQ: A state legislature has\
\ recently enacted a statute making it a misdemeanor to curse or revile or use obscene\
\ or opprobrious language toward or in reference to a police officer perfonning\
\ his duties. A student at a state university organized a demonstration on campus\
\ to protest the war. The rally was attended by a group of 50 students who shouted\
\ anti-war messages at cars passing by. To show his contempt for the United States,\
\ the student sewed the American flag to the rear of his jeans. When a police officer\
\ saw the flag sown on the student's jeans, he approached and told him to remove\
\ the flag or he would be placed under arrest. The student became angered and shouted\
\ at the police officer, \"Listen, you bastard, I'll wear this rag anywhere I please.\
\ \" The student was subsequently placed under arrest and charged with violating\
\ the state statute. The student subsequently brings suit in state court challenging\
\ the constitutionality of the statute. The strongest constitutional argument for\
\ the student is that\n(A) the statute is void for vagueness under the Fourteenth\
\ Amendment's due process clause. (B) the statute is invalid because it violates\
\ the petitioner's freedom of speech under the First Amendment. (C) the statute\
\ is an abridgment of freedom of speech under the First Amendment because less restrictive\
\ means are available for achieving the same purpose. (D) the statute is overbroad\
\ and consequently invalid under the First and Fourteenth Amendments.\nA: Let's\
\ think step by step. We refer to Wikipedia articles on law for help. The Fourteenth\
\ Amendment further supports the First Amendment by establishing a due process clause.\
\ Hence the strongest argument should be the statute is overbroad and consequently\
\ invalid under the First and Fourteenth Amendments. The answer is (D)."
"group": "mmlu_flan_cot_fewshot_humanities"
"include": "_mmlu_flan_cot_fewshot_template_yaml"
"task": "mmlu_flan_cot_fewshot_professional_law"
"dataset_name": "professional_medicine"
"description": "The following are multiple choice questions (with answers) about professional\
\ medicine.\n\nQ: A 22-year-old male marathon runner presents to the office with\
\ the complaint of right-sided rib pain when he runs long distances. Physical examination\
\ reveals normal heart and lung findings and an exhalation dysfunction at ribs 4-5\
\ on the right. Which of the following muscles or muscle groups will be most useful\
\ in correcting this dysfunction utilizing a direct method?\n(A) anterior scalene\
\ (B) latissimus dorsi (C) pectoralis minor (D) quadratus lumborum\nA: Let's think\
\ step by step. We refer to Wikipedia articles on medicine for help. Among the options,\
\ only pectoralis minor muscle origins from the outer surfaces of the 3rd to 5th\
\ ribs. The answer is (C).\n\nQ: A 36-year-old male presents to the office with\
\ a 3-week history of low back pain. He denies any recent trauma but says that he\
\ climbs in and out of his truck numerous times a day for his job. Examination of\
\ the patient in the prone position reveals a deep sacral sulcus on the left, a\
\ posterior inferior lateral angle on the right, and a lumbosacral junction that\
\ springs freely on compression. The most likely diagnosis is\n(A) left-on-left\
\ sacral torsion (B) left-on-right sacral torsion (C) right unilateral sacral flexion\
\ (D) right-on-right sacral torsion\nA: Let's think step by step. We refer to Wikipedia\
\ articles on medicine for help. The deep sulcus on the left, a posterior ILA on\
\ the right, with a negative spring test suggests a right-on-right sacral torsion.\
\ All other options have a deep sulcus on the right. The answer is (D).\n\nQ: A\
\ 44-year-old man comes to the office because of a 3-day history of sore throat,\
\ nonproductive cough, runny nose, and frontal headache. He says the headache is\
\ worse in the morning and ibuprofen does provide some relief. He has not had shortness\
\ of breath. Medical history is unremarkable. He takes no medications other than\
\ the ibuprofen for pain. Vital signs are temperature 37.4°C (99.4°F), pulse 88/min,\
\ respirations 18/min, and blood pressure 120/84 mm Hg. Examination of the nares\
\ shows erythematous mucous membranes. Examination of the throat shows erythema\
\ and follicular lymphoid hyperplasia on the posterior oropharynx. There is no palpable\
\ cervical adenopathy. Lungs are clear to auscultation. Which of the following is\
\ the most likely cause of this patient's symptoms?\n(A) Allergic rhinitis (B) Epstein-Barr\
\ virus (C) Mycoplasma pneumonia (D) Rhinovirus\nA: Let's think step by step. We\
\ refer to Wikipedia articles on medicine for help. The symptoms, especially the\
\ headache, suggest that the most likely cause is Rhinovirus. Epstein-Barr virus\
\ will cause swollen lymph nodes but there is no palpable cervical adenopathy. Lungs\
\ are clear to auscultation suggests it's not Mycoplasma pneumonia. The answer is\
\ (D).\n\nQ: A previously healthy 32-year-old woman comes to the physician 8 months\
\ after her husband was killed in a car crash. Since that time, she has had a decreased\
\ appetite and difficulty falling asleep. She states that she is often sad and cries\
\ frequently. She has been rechecking the door lock five times before leaving her\
\ house and has to count exactly five pieces of toilet paper before she uses it.\
\ She says that she has always been a perfectionist but these urges and rituals\
\ are new. Pharmacotherapy should be targeted to which of the following neurotransmitters?\n\
(A) Dopamine (B) Glutamate (C) Norepinephrine (D) Serotonin\nA: Let's think step\
\ by step. We refer to Wikipedia articles on medicine for help. The patient feels\
\ sad and among the options, only Dopamine and Serotonin can help increase positive\
\ emotions. Serotonin also affects digestion and metabolism, which can help the\
\ patient's decreased appetite and sleep difficulty. The answer is (D).\n\nQ: A\
\ 42-year-old man comes to the office for preoperative evaluation prior to undergoing\
\ adrenalectomy scheduled in 2 weeks. One month ago, he received care in the emergency\
\ department for pain over his right flank following a motor vehicle collision.\
\ At that time, blood pressure was 160/100 mm Hg and CT scan of the abdomen showed\
\ an incidental 10-cm left adrenal mass. Results of laboratory studies, including\
\ complete blood count, serum electrolyte concentrations, and liver function tests,\
\ were within the reference ranges. The patient otherwise had been healthy and had\
\ never been told that he had elevated blood pressure. He takes no medications.\
\ A follow-up visit in the office 2 weeks ago disclosed elevated urinary normetanephrine\
\ and metanephrine and plasma aldosterone concentrations. The patient was referred\
\ to a surgeon, who recommended the adrenalectomy. Today, vital signs are temperature\
\ 36.6°C (97.9°F), pulse 100/min, respirations 14/min, and blood pressure 170/95\
\ mm Hg. Physical examination discloses no significant findings. Initial preoperative\
\ preparation should include treatment with which of the following?\n(A) Labetalol\
\ (B) A loading dose of potassium chloride (C) Nifedipine (D) Phenoxybenzamine\n\
A: Let's think step by step. We refer to Wikipedia articles on medicine for help.\
\ The symptoms and the adrenal mass suggested pheochromocytoma, and the blood pressure\
\ indicates hypertension. Phenoxybenzamine is used to treat hypertension caused\
\ by pheochromocytoma. The answer is (D)."
"group": "mmlu_flan_cot_fewshot_other"
"include": "_mmlu_flan_cot_fewshot_template_yaml"
"task": "mmlu_flan_cot_fewshot_professional_medicine"
"dataset_name": "professional_psychology"
"description": "The following are multiple choice questions (with answers) about professional\
\ psychology.\n\nQ: In the construction of a multiple regression equation for purposes\
\ of prediction, the optimal combination of measures is one in which the predictors\n\
(A) are uncorrelated with each other but are moderately correlated with the criterion\
\ (B) have low correlations with each other and low correlations with the criterion\
\ (C) are highly intercorrelated with each other and moderately correlated with\
\ the criterion (D) have low correlations with the criterion bur are moderately\
\ correlated with each other\nA: Let's think step by step. We refer to Wikipedia\
\ articles on psychology for help. The basis of multiple regression is to assess\
\ the relationship between one continuous variable and a set of independent variables.\
\ So the predictors should be uncorrelated with each other but are moderately correlated\
\ with the criterion. The answer is (A).\n\nQ: There are three ways to measure the\
\ Central Tendency: the Mean, the Median and the Mode. From your knowledge about\
\ them, what is the mode?\n(A) less sensitive to extreme scores than the mean (B)\
\ more useful for skewed distributions (C) sensitive to extreme values and highly\
\ skewed distributions (D) the most frequently occurring number\nA: Let's think\
\ step by step. We refer to Wikipedia articles on psychology for help. The definition\
\ of mode is the most frequently occurring number. The answer is (D).\n\nQ: Carl\
\ Jung believed that a client's transference:\n(A) is a fantasy that distracts the\
\ client from reality. (B) represents “mixed feelings” toward the therapist. (C)\
\ \"is a form of \"\"acting out.\"\"\" (D) reflects the client’s personal and collective\
\ unconscious.\nA: Let's think step by step. We refer to Wikipedia articles on psychology\
\ for help. Transference is a phenomenon that a person's feelings are unconsciously\
\ redirected, so it reflects the client's personal and collective unconscious. The\
\ answer is (D).\n\nQ: In terms of Hofstede’s (1980) five cultural dimensions, the\
\ United States scores at the top of the scale on:\n(A) individualism. (B) individualism\
\ and power distance. (C) power distance and masculinity. (D) uncertainty avoidance.\n\
A: Let's think step by step. We refer to Wikipedia articles on psychology for help.\
\ US scores highest on individualism among the five cultural dimensions. The answer\
\ is (A).\n\nQ: One of your therapy clients asks your advice about a good weight-\
\ reduction program. You have investigated the programs in the community and are\
\ enrolled in the one you consider the best. This program offers a $50 bonus to\
\ its patrons for each new person they bring into the program. Under these circumstances,\
\ your most appropriate response would be to\n(A) tell your client the pros and\
\ cons of each program you know about except for the one in which you are enrolled\
\ (B) recommend to your client the program in which you are enrolled and explain\
\ the $50 bonus you will receive (C) recommend to your client the program in which\
\ you are enrolled and offer to have the $50 bonus credited to your client's account\
\ in the program (D) tell your client the pros and cons of each program you know\
\ about, but do not claim the $50 bonus if your client enrolls in your program\n\
A: Let's think step by step. We refer to Wikipedia articles on psychology for help.\
\ Based on the circumstances, you should tell your client about the pros and cons\
\ of each program, but it would be inappropriate to receive the bonus, so you should\
\ not claim the $50 bonus. The answer is (D)."
"group": "mmlu_flan_cot_fewshot_social_sciences"
"include": "_mmlu_flan_cot_fewshot_template_yaml"
"task": "mmlu_flan_cot_fewshot_professional_psychology"
"dataset_name": "public_relations"
"description": "The following are multiple choice questions (with answers) about public\
\ relations.\n\nQ: Earth Hour was a campaign launched by which organization?\n(A)\
\ Greenpeace (B) The UN (C) Oxfam (D) World Wildlife Fund\nA: Let's think step by\
\ step. We refer to Wikipedia articles on public relations for help. Earth Hour\
\ is a worldwide movement oragnized launched by the World Wildlife Fund. The answer\
\ is (D).\n\nQ: In issues management, what is the most proactive approach to addressing\
\ negative or misleading information posted online about your organization?\n(A)\
\ Buy domain names that could be used by opposition groups. (B) Post anonymous comments\
\ on blogs to combat this information. (C) Prepare a news release that discredits\
\ the inaccurate information. (D) Make policy changes to address complaints highlighted\
\ on these sites.\nA: Let's think step by step. We refer to Wikipedia articles on\
\ public relations for help. In issues management, the most proactive approach to\
\ addressing negative or misleading information posted online is to make policy\
\ changes to address complaints highlighted on those sites. The answer is (D).\n\
\nQ: At which stage in the planning process would a situation analysis be carried\
\ out?\n(A) Defining the program (B) Planning the program (C) Taking action and\
\ implementing ideas (D) Evaluation of the program\nA: Let's think step by step.\
\ We refer to Wikipedia articles on public relations for help. Situation analyses\
\ are typically carried out during the planning process stage of defining the program.\
\ The answer is (A).\n\nQ: Which of these statements is true of the Vatican in 2010\
\ at the time of the accusations of child abuse cover-ups?\n(A) There was a coordinated\
\ media response. (B) Consistent messages were communicated. (C) Criticisms were\
\ taken as attacks on the Catholic Church. (D) The credibility of the Vatican was\
\ upheld.\nA: Let's think step by step. We refer to Wikipedia articles on public\
\ relations for help. In 2010 when there were accusations of child abuse cover-ups,\
\ the Vatican took those criticisms as attacks on the Catholic Church. The answer\
\ is (C).\n\nQ: What should a public relations media practitioner do if she does\
\ not know the answer to a reporter's question?\n(A) Give the reporter other information\
\ she is certain is correct. (B) Say that the information is 'off the record' and\
\ will be disseminated later. (C) Say 'I don't know' and promise to provide the\
\ information later. (D) Say 'no comment,' rather than appear uninformed.\nA: Let's\
\ think step by step. We refer to Wikipedia articles on public relations for help.\
\ If a public relations media practitioner does not know the answer to a reporter's\
\ question, they should say 'I don't know' and offer to provide the information\
\ later. The answer is (C)."
"group": "mmlu_flan_cot_fewshot_social_sciences"
"include": "_mmlu_flan_cot_fewshot_template_yaml"
"task": "mmlu_flan_cot_fewshot_public_relations"
"dataset_name": "security_studies"
"description": "The following are multiple choice questions (with answers) about security\
\ studies.\n\nQ: What are the frameworks of analysis within which terrorism has\
\ been considered (as of 2020)?\n(A) Competition between larger nations has resulted\
\ in some countries actively supporting terrorist groups to undermine the strength\
\ of rival states. Terrorist networks are extended patronage clubs maintained and\
\ paid for by their donor states and are conceptualised as being like state actors,\
\ to be dealt with using military force. (B) Globalization has enabled the internationalization\
\ of terrorist activities by opening up their operational space, although coordination\
\ is still managed from a geographical base. This suggests that terrorist groups\
\ are nationally structured which means that terrorism cannot be considered in terms\
\ of a war to be defeated militarily without having serious implications on the\
\ indigenous population. (C) Terrorism can be viewed as a problem to be resolved\
\ by military means (war on terrorism), by normal police techniques (terrorism as\
\ crime), or as a medical problem with underlying causes and symptoms (terrorism\
\ as disease). (D) Terrorism is viewed as a criminal problem. The criminalization\
\ of terrorism has two important implications. Firstly, it suggests that terrorism\
\ can be eradicated - terrorists can be caught and brought to trial by normal judicial\
\ proceedings thereby removing the threat from society - and secondly, it suggests\
\ that preventative crime techniques are applicable to prevent its development.\n\
A: Let's think step by step. We refer to Wikipedia articles on security studies\
\ for help. (A) is wrong because it is not competition between larger nations that\
\ causes terrorism. \n(B) is wrong because globalization is not the cause of terrorism.\n\
(C) is correct because the US undertook the war on terrorism. \n(D) is wrong because\
\ preventative crime techniques will likely not end terrorism. The answer is (C).\n\
\nQ: Which of the following is the best lens through which to investigate the role\
\ of child soldiers?\n(A) Child soldiers are victims of combat that need re-education\
\ and rehabilitation. (B) Children and their mothers are not active subjects in\
\ warfare and are best considered as subjects in the private sphere. (C) Children\
\ are most often innocent bystanders in war and are best used as signifiers of peace.\
\ (D) Children have political subjecthood that is missed when they are considered\
\ as passive victims of warfare.\nA: Let's think step by step. We refer to Wikipedia\
\ articles on security studies for help. Child soliders as a political topic can\
\ be missed when they are considered passive victims of warfare. The answer is (D).\n\
\nQ: How can we best describe the relationship between the state-centric approach\
\ and the concept of human security?\n(A) There are such wide divisions within the\
\ human security framework regarding the nature of threats and referent objects\
\ that no widely applicable comparisons between state-centric approaches and human\
\ security can be drawn. (B) By adopting the framework of human security, the limitations\
\ of the realist state-centric approach become evident. Whilst human security defines\
\ the referent object as the person or population, state-centric approaches prioritise\
\ the security of the state, de-prioritizing the pursuit of human security. (C)\
\ The state-centric approach to security is a faction of human security, usually\
\ defined within the broad school of human security. By being state-centric this\
\ approach prioritises the individual as the referent object in security studies.\
\ (D) Both the state-centric and human-centric approaches to security are mutually\
\ exclusive and offer a sufficient analytic framework with which to understand the\
\ international security system. It is therefore the role of security analysts to\
\ determine which of these substantial concepts is correct, and which should be\
\ discarded.\nA: Let's think step by step. We refer to Wikipedia articles on security\
\ studies for help. Human security focuses on a person or population whereas state-centric\
\ approaches focus on the state while deprioritizing human security. The answer\
\ is (B).\n\nQ: In order to become securitized, a threat must be presented in which\
\ of these ways?\n(A) As an existential threat that requires immediate and extraordinary\
\ action, posing a threat to the survival of the state or to societal security.\
\ (B) As requiring immediate and extraordinary action by the state, threatening\
\ the survival of a referent object and therefore warranting the use of measures\
\ not normally employed in the political realm. (C) As an urgent threat to the survival\
\ of the referent object, so serious that it legitimises the employment of extraordinary\
\ action in response. (D) As an urgent threat to the survival of the audience that\
\ requires extraordinary or emergency measures.\nA: Let's think step by step. We\
\ refer to Wikipedia articles on security studies for help. To be securitized, a\
\ threat must be an urgent threat to the survival of the referent object. The answer\
\ is (C).\n\nQ: What distinguishes coercive diplomacy from military force?\n(A)\
\ Compellence is another term for coercive diplomacy, but covering a narrower set\
\ of criteria; compellence covers those threats aimed at initiating adversary action.\
\ A threat to coerce a state to give up part of its territory would count as coercive\
\ diplomacy, as long as that threat proactively initiates action before reactive\
\ diplomacy is taken. (B) Coercive diplomacy constitutes the threats of limited\
\ force to induce adversary's incentive to comply with the coercer's demands. It\
\ is an influence strategy that is intended to obtain compliance: the use of force\
\ to defeat an opponent first does not count. It leaves an element of choice with\
\ the target to comply, or to continue. (C) Military force, or the threat of military\
\ force, utilises fear to achieve strategic objectives. Coercive diplomacy is differentiated\
\ from this approach, because it does not use fear as a tool for coercing an adversary.\
\ (D) Coercive diplomacy is employed to use force but to limit its effects on the\
\ international community. Coercive diplomacy is an aggressive strategy that is\
\ intended to obtain compliance through defeat. It does not leave an element of\
\ choice with the target, the target either being forced to comply or engage in\
\ conflict. It seeks to control by imposing compliance by removing any opportunity\
\ for negotiation or concession.\nA: Let's think step by step. We refer to Wikipedia\
\ articles on security studies for help. Coercive diplomacy uses the threat of force\
\ to induce the opponent to comply with demands. The answer is (B)."
"group": "mmlu_flan_cot_fewshot_social_sciences"
"include": "_mmlu_flan_cot_fewshot_template_yaml"
"task": "mmlu_flan_cot_fewshot_security_studies"
"dataset_name": "sociology"
"description": "The following are multiple choice questions (with answers) about sociology.\n\
\nQ: Which of the following is not a problem associated with official statistics\
\ on strike action?\n(A) most strikes go unnoticed by employers and the mass media\
\ (B) not all industrial disputes will be reported by the employer (C) the definition\
\ of strikes excludes those that involve fewer than ten workers or last less than\
\ one day (D) it is hard to compare strikes that were measured in different ways\n\
A: Let's think step by step. We refer to Wikipedia articles on sociology for help.\
\ Official statistics on strike action can be problematic because not all industrial\
\ disputes will be reported by employers, the definition of strikes excludes those\
\ that involves fewer than ten workers or last less than one day, and it is hard\
\ to compare strikes that were measured in different ways. Thus, (A) is not a problem\
\ associated with official statistics on strike action. The answer is (A).\n\nQ:\
\ What does Berger (1963) describe as a metaphor for social reality?\n(A) a fairground\
\ ride (B) a circus (C) a puppet theatre (D) a ballet\nA: Let's think step by step.\
\ We refer to Wikipedia articles on sociology for help. Berger describes social\
\ reality using the metaphor of a puppet theatre. The answer is (C).\n\nQ: The term\
\ 'hegemony' refers to:\n(A) the tendency for the working class not to realize their\
\ own interests (B) a dominant ideology that legitimates economic, political and\
\ cultural power (C) a form of dual consciousness based on ideology and everyday\
\ experiences (D) a mode of payment given for outstanding topiary\nA: Let's think\
\ step by step. We refer to Wikipedia articles on sociology for help. Hegemony refers\
\ to a dominant ideology that legitimates economic, policital, and cultural power.\
\ The answer is (B).\n\nQ: The shift from 'civil religion' to 'common religion'\
\ means that:\n(A) the increasing bureaucracy of the state has made religion only\
\ a marginal part of our lives (B) despite the weakening of traditional authority,\
\ our everyday lives and 'common sense' remain shaped by religious beliefs and values\
\ (C) religious participation in collective worship may have declined, but people\
\ still practise their faiths in private (D) people are much more likely to discuss\
\ their religious beliefs in public, informal settings\nA: Let's think step by step.\
\ We refer to Wikipedia articles on sociology for help. The shift from civil religion\
\ to common religion means that despite the weakening of traditional authority,\
\ our everyday lives and common sense remain shaped by religious beliefs and values.\
\ The answer is (B).\n\nQ: Which of the following did the post-war welfare state\
\ of 1948 not aim to provide:\n(A) free health care and education for all (B) a\
\ minimum wage (C) full employment (D) universal welfare\nA: Let's think step by\
\ step. We refer to Wikipedia articles on sociology for help. The post-war welfare\
\ state of 1948 aimed to provide free healthcare and education, full employment,\
\ and universal welfare. But it did not aim to provide a minimum wage. The answer\
\ is (B)."
"group": "mmlu_flan_cot_fewshot_social_sciences"
"include": "_mmlu_flan_cot_fewshot_template_yaml"
"task": "mmlu_flan_cot_fewshot_sociology"
"dataset_name": "us_foreign_policy"
"description": "The following are multiple choice questions (with answers) about us\
\ foreign policy.\n\nQ: How did Donald Trump attack globalization in the 2016 campaign?\n\
(A) Globalization had made men like him too rich (B) Globalization only benefited\
\ certain American states, such as New York (C) Liberal elites had encouraged globalization,\
\ while 'ordinary Americans' lost jobs because of it (D) Globalization encouraged\
\ damaging trade wars\nA: Let's think step by step. We refer to Wikipedia articles\
\ on us foreign policy for help. Trump attacked globalization because he believed\
\ ordinary Americans lost jobs due to it, and so he wanted to blame liberals who\
\ had encouraged it. The answer is (C).\n\nQ: How did NSC-68 change U.S. strategy?\n\
(A) It globalized containment. (B) It militarized containment. (C) It called for\
\ the development of the hydrogen bomb. (D) All of the above\nA: Let's think step\
\ by step. We refer to Wikipedia articles on us foreign policy for help. NSC-68\
\ outlined a variety of courses of action, including globalization of containment,\
\ militarization of contaiment, and the development of the hydrogen bomb. The answer\
\ is (D).\n\nQ: How do Defensive Realism and Offensive Realism differ in their explanation\
\ of state behaviour?\n(A) Defensive realists place greater emphasis on the role\
\ of international institutions (B) Defensive realists place less emphasis on geographical\
\ factors (C) Offensive realists give more priority to the national interest than\
\ Defensive realists. (D) Defensive realists believe states are security maximizers,\
\ while Offensive realists believe states to be power maximizers\nA: Let's think\
\ step by step. We refer to Wikipedia articles on us foreign policy for help. While\
\ defensive realism advocates that states are security maximizers, offensive realists\
\ think of states as power maximizers. The answer is (D).\n\nQ: The realm of policy\
\ decisions concerned primarily with relations between the United States and the\
\ rest of the world is known as\n(A) terrorism policy. (B) economic policy. (C)\
\ foreign policy. (D) international policy.\nA: Let's think step by step. We refer\
\ to Wikipedia articles on us foreign policy for help. The topic of policy decisions\
\ concerns with relations between the US and the rest of the world is known as foreign\
\ policy. The answer is (C).\n\nQ: How did the 2008 financial crisis affect America's\
\ international reputation?\n(A) It damaged support for the US model of political\
\ economy and capitalism (B) It created anger at the United States for exaggerating\
\ the crisis (C) It increased support for American global leadership under President\
\ Obama (D) It reduced global use of the US dollar\nA: Let's think step by step.\
\ We refer to Wikipedia articles on us foreign policy for help. The 2008 financial\
\ crisis damanged the international reputation of the American model of political\
\ economy and capitalism. The answer is (A)."
"group": "mmlu_flan_cot_fewshot_social_sciences"
"include": "_mmlu_flan_cot_fewshot_template_yaml"
"task": "mmlu_flan_cot_fewshot_us_foreign_policy"
"dataset_name": "virology"
"description": "The following are multiple choice questions (with answers) about virology.\n\
\nQ: The median survival time to AIDS and death was established by following:\n\
(A) Seroprevalent HIV-infected individuals (B) Seronegatives (C) Seroconverters\
\ (D) High-risk seronegatives\nA: Let's think step by step. We refer to Wikipedia\
\ articles on virology for help. The median survival time to AIDS and death was\
\ established as a result of the development of seroconverters. The answer is (C).\n\
\nQ: Which of the following is a morphological characteristic of the paramyxoviruses.\n\
(A) Fragile viruses often visualised with RNA spewing from the inside (B) Elongate\
\ viruses (C) Icosahedral viruses with envelope (D) Very large viruses\nA: Let's\
\ think step by step. We refer to Wikipedia articles on virology for help. Paramyxoviruses\
\ are fragile viruses often visualised with RNA spewing from the inside. The answer\
\ is (A).\n\nQ: The most important goal of a behavioral intervention is:\n(A) Change\
\ in behavior (B) Comprehensive coverage (C) Effective use of behavioral theory\
\ (D) Sustained behavior change\nA: Let's think step by step. We refer to Wikipedia\
\ articles on virology for help. The prim goal of a behavioral intervention is to\
\ cause sustained behavior change. The answer is (D).\n\nQ: A key factor facilitating\
\ the application of nested case-control studies from the MACS was:\n(A) Data collection\
\ (B) Establishment of a repository of biologic specimens (C) Participant interest\
\ (D) Administration of the questionnaire by staff\nA: Let's think step by step.\
\ We refer to Wikipedia articles on virology for help. The Multicenter AIDS Cohort\
\ Study's use of nested case-control studies was facilitated by the establishment\
\ of a repository of biologic specimens. The answer is (B).\n\nQ: Why are parvoviruses\
\ a highly impactful parasite?\n(A) Because they have no nucleic acid (B) They require\
\ a helper virus (C) Only replicate in dividing cells (D) Can integrate into host\
\ chromosomes\nA: Let's think step by step. We refer to Wikipedia articles on virology\
\ for help. Paroviruses are highly impactful because they do not have nucleic acid.\
\ The answer is (A)."
"group": "mmlu_flan_cot_fewshot_other"
"include": "_mmlu_flan_cot_fewshot_template_yaml"
"task": "mmlu_flan_cot_fewshot_virology"
"dataset_name": "world_religions"
"description": "The following are multiple choice questions (with answers) about world\
\ religions.\n\nQ: How can the Upanishads be characterized?\n(A) Ritual texts (B)\
\ Philosophical texts (C) Hymns (D) Origin stories\nA: Let's think step by step.\
\ We refer to Wikipedia articles on world religions for help. The Upanishads are\
\ the most recent part of Vedas (the oldest scriptures in Hinduism) and supplied\
\ the basis of later Hindu philosophy. So they are philosophical texts. The answer\
\ is (B).\n\nQ: What is the Second Gem in Buddhism?\n(A) The Dharma (B) The Sangha\
\ (C) The Buddha (D) The Bodhisattva\nA: Let's think step by step. We refer to Wikipedia\
\ articles on world religions for help. The Second Gem in Buddhism is The Dharma.\
\ The answer is (A).\n\nQ: Which Japanese government promoted a kind of national\
\ cult based on the emperor and his associations with kami?\n(A) Honen (B) Tanaka\
\ (C) Tokugawa (D) Meiji\nA: Let's think step by step. We refer to Wikipedia articles\
\ on world religions for help. The promotion of a national cult based on the emperor\
\ and his associations with Kami happened during the reign of Emperor Meiji (1852-1912).\
\ The answer is (D).\n\nQ: In which dynasty was the \"Mandate of Heaven\" developed\
\ to legitimatize the new rulers?\n(A) Shang (B) Zhou (C) Han (D) Xia\nA: Let's\
\ think step by step. We refer to Wikipedia articles on world religions for help.\
\ The \"Mandate of Heaven\" was developed as an ancient Chinese philosophical concept\
\ during the Zhou Dynasty (1046-256 BCE). The answer is (B).\n\nQ: What is the sign\
\ of the covenant for Jewish males?\n(A) The rainbow (B) Circumcision (C) A son\
\ (D) Bar mitzvah\nA: Let's think step by step. We refer to Wikipedia articles on\
\ world religions for help. In Judaism, the most distinctive sign of the covenant\
\ is circumcision (brit milah). The answer is (B)."
"group": "mmlu_flan_cot_fewshot_humanities"
"include": "_mmlu_flan_cot_fewshot_template_yaml"
"task": "mmlu_flan_cot_fewshot_world_religions"
group: mmlu_flan_cot_zeroshot
task:
- mmlu_flan_cot_zeroshot_stem
- mmlu_flan_cot_zeroshot_other
- mmlu_flan_cot_zeroshot_social_sciences
- mmlu_flan_cot_zeroshot_humanities
dataset_path: hails/mmlu_no_train # a copy of `cais/mmlu` with no auxiliary_train split
validation_split: validation
fewshot_split: dev
output_type: generate_until
doc_to_text: "Q: {{question.strip()}}\n(A) {{choices[0]}} (B) {{choices[1]}} (C) {{choices[2]}} (D) {{choices[3]}}\nA: Let's think step by step."
doc_to_target: "{{['(A)', '(B)', '(C)', '(D)'][answer]}}"
filter_list:
- name: "get-answer"
filter:
- function: "regex"
regex_pattern: "((?<=The answer is )(.*)(?=.)|(?<=the answer is )(.*)(?=.)|(?<=The answer: )(.*)(?=.)|(?<=The final answer: )(.*)(?=.))"
- function: "take_first"
generation_kwargs:
until:
- "</s>"
do_sample: false
temperature: 0.0
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
"dataset_name": "abstract_algebra"
"description": "The following are multiple choice questions (with answers) about abstract\
\ algebra.\n\n"
"group": "mmlu_flan_cot_zeroshot_stem"
"include": "_mmlu_flan_cot_zeroshot_template_yaml"
"task": "mmlu_flan_cot_zeroshot_abstract_algebra"
"dataset_name": "anatomy"
"description": "The following are multiple choice questions (with answers) about anatomy.\n\
\n"
"group": "mmlu_flan_cot_zeroshot_stem"
"include": "_mmlu_flan_cot_zeroshot_template_yaml"
"task": "mmlu_flan_cot_zeroshot_anatomy"
"dataset_name": "astronomy"
"description": "The following are multiple choice questions (with answers) about astronomy.\n\
\n"
"group": "mmlu_flan_cot_zeroshot_stem"
"include": "_mmlu_flan_cot_zeroshot_template_yaml"
"task": "mmlu_flan_cot_zeroshot_astronomy"
"dataset_name": "business_ethics"
"description": "The following are multiple choice questions (with answers) about business\
\ ethics.\n\n"
"group": "mmlu_flan_cot_zeroshot_other"
"include": "_mmlu_flan_cot_zeroshot_template_yaml"
"task": "mmlu_flan_cot_zeroshot_business_ethics"
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment