"driver/driver.hip.cpp" did not exist on "0b8e67ef08b28447509fd3e0f26d8e276b06cbf0"
Unverified Commit c9bbec6e authored by Hailey Schoelkopf's avatar Hailey Schoelkopf Committed by GitHub
Browse files

Merge pull request #1060 from EleutherAI/fix-mmlu

[Refactor] Fix fewshot cot mmlu descriptions
parents 7afae7b5 57e017ff
Pipeline #2992 failed with stages
...@@ -150,7 +150,7 @@ It is on our roadmap to create task variants designed to enable models which do ...@@ -150,7 +150,7 @@ It is on our roadmap to create task variants designed to enable models which do
### Other Frameworks ### Other Frameworks
A number of other libraries contain scripts for calling the eval harness through their library. These include [GPT-NeoX](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py), [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples/MoE/readme_evalharness.md), and [mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py). A number of other libraries contain scripts for calling the eval harness through their library. These include [GPT-NeoX](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py), [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples/MoE/readme_evalharness.md), and [mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py).
### Additional Features ### Additional Features
...@@ -158,7 +158,7 @@ If you have a Metal compatible Mac, you can run the eval harness using the MPS b ...@@ -158,7 +158,7 @@ If you have a Metal compatible Mac, you can run the eval harness using the MPS b
> [!Note] > [!Note]
> You can inspect what the LM inputs look like by running the following command: > You can inspect what the LM inputs look like by running the following command:
> >
> ```bash > ```bash
> python write_out.py \ > python write_out.py \
> --tasks all_tasks \ > --tasks all_tasks \
...@@ -166,7 +166,7 @@ If you have a Metal compatible Mac, you can run the eval harness using the MPS b ...@@ -166,7 +166,7 @@ If you have a Metal compatible Mac, you can run the eval harness using the MPS b
> --num_examples 10 \ > --num_examples 10 \
> --output_base_path /path/to/output/folder > --output_base_path /path/to/output/folder
> ``` > ```
> >
> This will write out one text file for each task. > This will write out one text file for each task.
To verify the data integrity of the tasks you're performing in addition to running the tasks themselves, you can use the `--check_integrity` flag: To verify the data integrity of the tasks you're performing in addition to running the tasks themselves, you can use the `--check_integrity` flag:
......
This source diff could not be displayed because it is too large. You can view the blob instead.
...@@ -35,7 +35,7 @@ ...@@ -35,7 +35,7 @@
\ then x^2 + c = x^2 + 1 = 0 + 1 for x = 0, 1 + 1 = 2 for x = 1 and 1 + 1 = 2 for\ \ then x^2 + c = x^2 + 1 = 0 + 1 for x = 0, 1 + 1 = 2 for x = 1 and 1 + 1 = 2 for\
\ x = 2, hence x^2 + 1 does not have any roots. For c = 2 the polynomial x^2 + 2\ \ x = 2, hence x^2 + 1 does not have any roots. For c = 2 the polynomial x^2 + 2\
\ has two roots at x = 1 and x = 2. Hence Z_3[x]/(x^2 + c) is a field if and only\ \ has two roots at x = 1 and x = 2. Hence Z_3[x]/(x^2 + c) is a field if and only\
\ if c = 1. The answer is (B)." \ if c = 1. The answer is (B).\n\n"
"group": "mmlu_flan_cot_fewshot_stem" "group": "mmlu_flan_cot_fewshot_stem"
"include": "_mmlu_flan_cot_fewshot_template_yaml" "include": "_mmlu_flan_cot_fewshot_template_yaml"
"task": "mmlu_flan_cot_fewshot_abstract_algebra" "task": "mmlu_flan_cot_fewshot_abstract_algebra"
...@@ -51,7 +51,7 @@ ...@@ -51,7 +51,7 @@
\ of the hyoid bone; therefore, the embryological origin of the hyoid bone are the\ \ of the hyoid bone; therefore, the embryological origin of the hyoid bone are the\
\ second and the third pharyngeal arches—this information is covered in the last\ \ second and the third pharyngeal arches—this information is covered in the last\
\ option (D). Therefore, we conclude that (D) must be the correct answer. The answer\ \ option (D). Therefore, we conclude that (D) must be the correct answer. The answer\
\ is (D)." \ is (D).\n\n"
"group": "mmlu_flan_cot_fewshot_stem" "group": "mmlu_flan_cot_fewshot_stem"
"include": "_mmlu_flan_cot_fewshot_template_yaml" "include": "_mmlu_flan_cot_fewshot_template_yaml"
"task": "mmlu_flan_cot_fewshot_anatomy" "task": "mmlu_flan_cot_fewshot_anatomy"
...@@ -49,7 +49,7 @@ ...@@ -49,7 +49,7 @@
\ red. Options (C) and (D) are not specific enough about why the color of the surface\ \ red. Options (C) and (D) are not specific enough about why the color of the surface\
\ would be red, while (A) is correct because it explains that the surface is red\ \ would be red, while (A) is correct because it explains that the surface is red\
\ due to the rusted materials on the surface and the red color comes from the rust.\ \ due to the rusted materials on the surface and the red color comes from the rust.\
\ So the correct option is (A). The answer is (A)." \ So the correct option is (A). The answer is (A).\n\n"
"group": "mmlu_flan_cot_fewshot_stem" "group": "mmlu_flan_cot_fewshot_stem"
"include": "_mmlu_flan_cot_fewshot_template_yaml" "include": "_mmlu_flan_cot_fewshot_template_yaml"
"task": "mmlu_flan_cot_fewshot_astronomy" "task": "mmlu_flan_cot_fewshot_astronomy"
...@@ -50,7 +50,7 @@ ...@@ -50,7 +50,7 @@
\ that best uses the possible options above is “Beyond the business case for engaging\ \ that best uses the possible options above is “Beyond the business case for engaging\
\ the CSR there are a number of moral arguments relating to: negative *externalities*,\ \ the CSR there are a number of moral arguments relating to: negative *externalities*,\
\ the *power* that corporations possess and the *mutual independence* of business\ \ the *power* that corporations possess and the *mutual independence* of business\
\ and society. The answer is (D)." \ and society. The answer is (D).\n\n"
"group": "mmlu_flan_cot_fewshot_other" "group": "mmlu_flan_cot_fewshot_other"
"include": "_mmlu_flan_cot_fewshot_template_yaml" "include": "_mmlu_flan_cot_fewshot_template_yaml"
"task": "mmlu_flan_cot_fewshot_business_ethics" "task": "mmlu_flan_cot_fewshot_business_ethics"
...@@ -29,7 +29,7 @@ ...@@ -29,7 +29,7 @@
\ (D) oxidative phosphorylation.\nA: Let's think step by step. We refer to Wikipedia\ \ (D) oxidative phosphorylation.\nA: Let's think step by step. We refer to Wikipedia\
\ articles on clinical knowledge for help. The energy for muscular contraction is\ \ articles on clinical knowledge for help. The energy for muscular contraction is\
\ provided by ATP (adenosine triphosphate), which is the powerhouse of the cell.\ \ provided by ATP (adenosine triphosphate), which is the powerhouse of the cell.\
\ The answer is (A)." \ The answer is (A).\n\n"
"group": "mmlu_flan_cot_fewshot_other" "group": "mmlu_flan_cot_fewshot_other"
"include": "_mmlu_flan_cot_fewshot_template_yaml" "include": "_mmlu_flan_cot_fewshot_template_yaml"
"task": "mmlu_flan_cot_fewshot_clinical_knowledge" "task": "mmlu_flan_cot_fewshot_clinical_knowledge"
...@@ -55,7 +55,7 @@ ...@@ -55,7 +55,7 @@
\ resemblance of structures that have different origins, which is not the case for\ \ resemblance of structures that have different origins, which is not the case for\
\ the human and bird forearms, which rules out (D). Humans and birds do belong to\ \ the human and bird forearms, which rules out (D). Humans and birds do belong to\
\ the same clade - a group of organisms composed of a common ancestor. The answer\ \ the same clade - a group of organisms composed of a common ancestor. The answer\
\ is (C)." \ is (C).\n\n"
"group": "mmlu_flan_cot_fewshot_stem" "group": "mmlu_flan_cot_fewshot_stem"
"include": "_mmlu_flan_cot_fewshot_template_yaml" "include": "_mmlu_flan_cot_fewshot_template_yaml"
"task": "mmlu_flan_cot_fewshot_college_biology" "task": "mmlu_flan_cot_fewshot_college_biology"
...@@ -32,7 +32,7 @@ ...@@ -32,7 +32,7 @@
\ hyperfine interaction with the 13C (nuclear spin $I = \nrac{1}{2}$) which will\ \ hyperfine interaction with the 13C (nuclear spin $I = \nrac{1}{2}$) which will\
\ split the spectrum into 2 lines. This will be further split into 4 lines by the\ \ split the spectrum into 2 lines. This will be further split into 4 lines by the\
\ interaction with three equivalent 1H nuclei. The total number of lines is therefore\ \ interaction with three equivalent 1H nuclei. The total number of lines is therefore\
\ $2 \\cdot 4 = 8$. The answer is (E)." \ $2 \\cdot 4 = 8$. The answer is (E).\n\n"
"group": "mmlu_flan_cot_fewshot_stem" "group": "mmlu_flan_cot_fewshot_stem"
"include": "_mmlu_flan_cot_fewshot_template_yaml" "include": "_mmlu_flan_cot_fewshot_template_yaml"
"task": "mmlu_flan_cot_fewshot_college_chemistry" "task": "mmlu_flan_cot_fewshot_college_chemistry"
...@@ -73,7 +73,7 @@ ...@@ -73,7 +73,7 @@
Thus we can see that on average a single processor will lock the bus for:\nlock_ns_per_miss\ Thus we can see that on average a single processor will lock the bus for:\nlock_ns_per_miss\
\ * misses_per_instruction * instructions_per_ns =\n(1000 nanoseconds / cache miss)\ \ * misses_per_instruction * instructions_per_ns =\n(1000 nanoseconds / cache miss)\
\ * (1 cache miss / 50 instructions) * (50 instructions / 27000 nanoseconds) = 1000\ \ * (1 cache miss / 50 instructions) * (50 instructions / 27000 nanoseconds) = 1000\
\ * (1/50) * (50/27000) = 1000/27000 = 1/27. The answer is (B)." \ * (1/50) * (50/27000) = 1000/27000 = 1/27. The answer is (B).\n\n"
"group": "mmlu_flan_cot_fewshot_stem" "group": "mmlu_flan_cot_fewshot_stem"
"include": "_mmlu_flan_cot_fewshot_template_yaml" "include": "_mmlu_flan_cot_fewshot_template_yaml"
"task": "mmlu_flan_cot_fewshot_college_computer_science" "task": "mmlu_flan_cot_fewshot_college_computer_science"
...@@ -44,7 +44,7 @@ ...@@ -44,7 +44,7 @@
\ $t \\in \\mathbb{R}, \\ln ((s(t)-2))=-[t / 25]+C$. Let $K:=e^{C}$. Then, for all\ \ $t \\in \\mathbb{R}, \\ln ((s(t)-2))=-[t / 25]+C$. Let $K:=e^{C}$. Then, for all\
\ $t \\in \\mathbb{R}$, we have $(s(t))-2=K e^{-t / 25}$, and so $s(t)=2+K e^{-t\ \ $t \\in \\mathbb{R}$, we have $(s(t))-2=K e^{-t / 25}$, and so $s(t)=2+K e^{-t\
\ / 25}$. Then $3=s(0)=2+K e^{0}=2+K$, so $K=1$. Then $s(100)=2+K e^{-100 / 25}=2+1\ \ / 25}$. Then $3=s(0)=2+K e^{0}=2+K$, so $K=1$. Then $s(100)=2+K e^{-100 / 25}=2+1\
\ \\cdot e^{-4}=2+e^{-4}$. The answer is (D)." \ \\cdot e^{-4}=2+e^{-4}$. The answer is (D).\n\n"
"group": "mmlu_flan_cot_fewshot_stem" "group": "mmlu_flan_cot_fewshot_stem"
"include": "_mmlu_flan_cot_fewshot_template_yaml" "include": "_mmlu_flan_cot_fewshot_template_yaml"
"task": "mmlu_flan_cot_fewshot_college_mathematics" "task": "mmlu_flan_cot_fewshot_college_mathematics"
...@@ -46,7 +46,7 @@ ...@@ -46,7 +46,7 @@
\ monocarbylic acid transporters.\nA: Let's think step by step. We refer to Wikipedia\ \ monocarbylic acid transporters.\nA: Let's think step by step. We refer to Wikipedia\
\ articles on medicine for help. Glucose (also known as the blood sugar) is the\ \ articles on medicine for help. Glucose (also known as the blood sugar) is the\
\ main sugar found in the human body. It is transported into the muscle cell via\ \ main sugar found in the human body. It is transported into the muscle cell via\
\ diffusion through protein transporters called GLUT4. The answer is (A)." \ diffusion through protein transporters called GLUT4. The answer is (A).\n\n"
"group": "mmlu_flan_cot_fewshot_other" "group": "mmlu_flan_cot_fewshot_other"
"include": "_mmlu_flan_cot_fewshot_template_yaml" "include": "_mmlu_flan_cot_fewshot_template_yaml"
"task": "mmlu_flan_cot_fewshot_college_medicine" "task": "mmlu_flan_cot_fewshot_college_medicine"
...@@ -38,7 +38,7 @@ ...@@ -38,7 +38,7 @@
\ go into the gases internal energy or work done against an external force. However,\ \ go into the gases internal energy or work done against an external force. However,\
\ if the volume of the gas container is constant, no work will be done (since work\ \ if the volume of the gas container is constant, no work will be done (since work\
\ is pressure times change in volume). So, at constant volume, all of the heat goes\ \ is pressure times change in volume). So, at constant volume, all of the heat goes\
\ into the internal energy. The answer is (B)." \ into the internal energy. The answer is (B).\n\n"
"group": "mmlu_flan_cot_fewshot_stem" "group": "mmlu_flan_cot_fewshot_stem"
"include": "_mmlu_flan_cot_fewshot_template_yaml" "include": "_mmlu_flan_cot_fewshot_template_yaml"
"task": "mmlu_flan_cot_fewshot_college_physics" "task": "mmlu_flan_cot_fewshot_college_physics"
...@@ -30,7 +30,7 @@ ...@@ -30,7 +30,7 @@
\ resulted from improper input validation (due to a missing bounds check) in the\ \ resulted from improper input validation (due to a missing bounds check) in the\
\ implementation of the TLS heartbeat extension. The vulnerability was classified\ \ implementation of the TLS heartbeat extension. The vulnerability was classified\
\ as a buffer over-read, a situation where more data can be read than should be\ \ as a buffer over-read, a situation where more data can be read than should be\
\ allowed. The answer is (C)." \ allowed. The answer is (C).\n\n"
"group": "mmlu_flan_cot_fewshot_stem" "group": "mmlu_flan_cot_fewshot_stem"
"include": "_mmlu_flan_cot_fewshot_template_yaml" "include": "_mmlu_flan_cot_fewshot_template_yaml"
"task": "mmlu_flan_cot_fewshot_computer_security" "task": "mmlu_flan_cot_fewshot_computer_security"
...@@ -27,7 +27,7 @@ ...@@ -27,7 +27,7 @@
\ speed in the direction of the wind is greater than it would be in the absence\ \ speed in the direction of the wind is greater than it would be in the absence\
\ of wind, and its direction orthogonal to the wind is the same as it would be in\ \ of wind, and its direction orthogonal to the wind is the same as it would be in\
\ the absence of the wind. The total speed, which is these two components added\ \ the absence of the wind. The total speed, which is these two components added\
\ in quadrature, is thus greater than the speed in still air. The answer is (B)." \ in quadrature, is thus greater than the speed in still air. The answer is (B).\n\n"
"group": "mmlu_flan_cot_fewshot_stem" "group": "mmlu_flan_cot_fewshot_stem"
"include": "_mmlu_flan_cot_fewshot_template_yaml" "include": "_mmlu_flan_cot_fewshot_template_yaml"
"task": "mmlu_flan_cot_fewshot_conceptual_physics" "task": "mmlu_flan_cot_fewshot_conceptual_physics"
...@@ -57,7 +57,7 @@ ...@@ -57,7 +57,7 @@
\ die away (B) Persist indefinitely (C) Grow exponentially (D) Never occur\nA: Let's\ \ die away (B) Persist indefinitely (C) Grow exponentially (D) Never occur\nA: Let's\
\ think step by step. We refer to Wikipedia articles on econometrics for help. This\ \ think step by step. We refer to Wikipedia articles on econometrics for help. This\
\ is a formal logic problem about stationally process. For a stationary autoregressive\ \ is a formal logic problem about stationally process. For a stationary autoregressive\
\ process, shocks will eventually die away. The answer is (A)." \ process, shocks will eventually die away. The answer is (A).\n\n"
"group": "mmlu_flan_cot_fewshot_social_sciences" "group": "mmlu_flan_cot_fewshot_social_sciences"
"include": "_mmlu_flan_cot_fewshot_template_yaml" "include": "_mmlu_flan_cot_fewshot_template_yaml"
"task": "mmlu_flan_cot_fewshot_econometrics" "task": "mmlu_flan_cot_fewshot_econometrics"
...@@ -28,7 +28,7 @@ ...@@ -28,7 +28,7 @@
\ is 100. Find the total resistance\n(A) 200Ω (B) 100Ω (C) 50Ω (D) 10Ω\nA: Let's\ \ is 100. Find the total resistance\n(A) 200Ω (B) 100Ω (C) 50Ω (D) 10Ω\nA: Let's\
\ think step by step. In lap winding, effectively two resistors are connected in\ \ think step by step. In lap winding, effectively two resistors are connected in\
\ parallel, so the actual resistance of each pair is 1 Ohm. Since we have 50 pairs,\ \ parallel, so the actual resistance of each pair is 1 Ohm. Since we have 50 pairs,\
\ we get a total resistance of 50 Ohms. The answer is (C)." \ we get a total resistance of 50 Ohms. The answer is (C).\n\n"
"group": "mmlu_flan_cot_fewshot_stem" "group": "mmlu_flan_cot_fewshot_stem"
"include": "_mmlu_flan_cot_fewshot_template_yaml" "include": "_mmlu_flan_cot_fewshot_template_yaml"
"task": "mmlu_flan_cot_fewshot_electrical_engineering" "task": "mmlu_flan_cot_fewshot_electrical_engineering"
...@@ -35,7 +35,7 @@ ...@@ -35,7 +35,7 @@
\nQ: Which expression is equivalent to 5 x 9?\n(A) (5 x 4) x (6 x 5)\n(B) (5 x 5)\ \nQ: Which expression is equivalent to 5 x 9?\n(A) (5 x 4) x (6 x 5)\n(B) (5 x 5)\
\ + (5 x 4)\n(C) (5 x 5) + (5 x 9)\n(D) (5 x 9) x (6 x 9)\nA: Let's think step by\ \ + (5 x 4)\n(C) (5 x 5) + (5 x 9)\n(D) (5 x 9) x (6 x 9)\nA: Let's think step by\
\ step. We know that 9 = (5 + 4), so 5 x 9 = 5 x (5 + 4) = (5 x 5) + (5 x 4). The\ \ step. We know that 9 = (5 + 4), so 5 x 9 = 5 x (5 + 4) = (5 x 5) + (5 x 4). The\
\ answer is (B)." \ answer is (B).\n\n"
"group": "mmlu_flan_cot_fewshot_stem" "group": "mmlu_flan_cot_fewshot_stem"
"include": "_mmlu_flan_cot_fewshot_template_yaml" "include": "_mmlu_flan_cot_fewshot_template_yaml"
"task": "mmlu_flan_cot_fewshot_elementary_mathematics" "task": "mmlu_flan_cot_fewshot_elementary_mathematics"
...@@ -47,7 +47,7 @@ ...@@ -47,7 +47,7 @@
\ (∀x)(Px ~Dx) For all x, x is on Mars implies that x do not drive on Mars.\n\ \ (∀x)(Px ~Dx) For all x, x is on Mars implies that x do not drive on Mars.\n\
Option (D): ~Dp: p do not drive on Mars.\nOf all these options, Option (C) appears\ Option (D): ~Dp: p do not drive on Mars.\nOf all these options, Option (C) appears\
\ to be the best and most meaningful interpretation of the argument “No people drive\ \ to be the best and most meaningful interpretation of the argument “No people drive\
\ on Mars.” The answer is (C)." \ on Mars.” The answer is (C).\n\n"
"group": "mmlu_flan_cot_fewshot_humanities" "group": "mmlu_flan_cot_fewshot_humanities"
"include": "_mmlu_flan_cot_fewshot_template_yaml" "include": "_mmlu_flan_cot_fewshot_template_yaml"
"task": "mmlu_flan_cot_fewshot_formal_logic" "task": "mmlu_flan_cot_fewshot_formal_logic"
...@@ -28,7 +28,7 @@ ...@@ -28,7 +28,7 @@
\ of their nation or the world.\nA: Let's think step by step. We refer to Wikipedia\ \ of their nation or the world.\nA: Let's think step by step. We refer to Wikipedia\
\ articles on global facts for help. As of 2019, most people tend to be optimistic\ \ articles on global facts for help. As of 2019, most people tend to be optimistic\
\ about their own future but pessimistic about the future of their nation or the\ \ about their own future but pessimistic about the future of their nation or the\
\ world. The answer is (B)." \ world. The answer is (B).\n\n"
"group": "mmlu_flan_cot_fewshot_other" "group": "mmlu_flan_cot_fewshot_other"
"include": "_mmlu_flan_cot_fewshot_template_yaml" "include": "_mmlu_flan_cot_fewshot_template_yaml"
"task": "mmlu_flan_cot_fewshot_global_facts" "task": "mmlu_flan_cot_fewshot_global_facts"
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment