Unverified Commit 517aadc4 authored by Lintang Sutawika's avatar Lintang Sutawika Committed by GitHub
Browse files

Group agg rework (#1741)



* add greoup_config arg

* add a group config that allows disabling table for group score and group aggregate in general

* fixed size configuration

* adjust config

* add group config

* adjust mmlu to use group_config

* fixed args input in aggregate_subtask_metrics

* fixed issues related to printing alias of group and updated yaml

* update all mmlu variants to include group_config

* edit format

* modify mmlu tasks

* adjust group to also be a configurable group

* add configurable group

* simplify get_task_list

* adjust group scoring with using ConfigurableGroup

* adjust args

* update mmlu

* update mmlu

* update to work with new group and task configuration

* readd group_agg

* readd files

* move prepare_print_tasks to evaluator_utils

* sort set to False by default, fix predict_only arg

* add version for groups

* reversed task list

* update additional condition when loading a group in a group yaml

* update truthfulqa

* add description regarding tags replacing group

* replace group to tag

* fixed conditional statement

* remove warning

* update loading of task group and newly added tags

* reformat with pre-commit

* fixed info log

* update

* fix bug

* fix bug

* use task id to differentiate tasks

* convert all groups to configurable groups

* use task_id

* reformat

* add task_id for python tasks as well

* add task_id for python tasks as well

* add task_id for python tasks as well

* revert truthfulqa

* revert mmlu tasks

* new mmlu config

* new group config parameter `tag_to_task`

* Update truthfulqa_mc2.yaml

* reformate

* add _process_group_config

* adjust task_id

* add get_subtask_list function to get proper subtask list

* group config to_dict update

* remove tag check

* update mmlu

* fix config passing issues

* add test yaml

* format fix

* add documentation

* corner case for single tag being called

* fix indentation

* formatting

* update all mmlu variants

* Update docs/task_guide.md
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* remove group_alias

* Update docs/task_guide.md
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* remove version for metadata

* Update docs/task_guide.md
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* update mmlu/

* removed " " in make_table

* change how aggregate_metric is loaded

* change how aggregate_metric is loaded

* update aggregate_metric arg

* update format

* update format

* some docs fixes

* add groups for agieval, aexams, aclue

* add more explicit aggregation groups

* add more groupings / tags distinctions

* add more groupings

* more groupings

* add many explicit group configs

* add many explicit group configs

* add more explicit group configs

* add more explicit group configs

* add more error msgs, agg_metric -> agg_metric_list

* some docs updates

* update task_id to be updateable and uses group:task format

* make KMMLU a tag for now

* update docs

* don't duplicate task names

* fix merge conflicts?

* giving this a try

* clean up diff

* switch mmlu variants over to using

* don't use to-be-deprecated group: config field in overview notebook

* Python tasks which subclass ConfigurableTask now run

* update mmlu

* pre-commit format

* fixed sorting for multi-level printing

* move group api to separate file

* fix bbh aggregation filter usage

* track api/group.py

* adjust group and tags loading

* make explicit group configs for leaderboard and other newer tasks

* fix arabicmmlu

* update

* change arabicmmlu template name???

* update group alias

* fix printing bugs

* check table printing is correct ; update tests

* use mmlu_stem to have a group included in print tests

---------
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: default avatarhaileyschoelkopf <hailey@eleuther.ai>
parent 5a7ed3ee
"dataset_name": "world_religions" "dataset_name": "world_religions"
"description": "The following are multiple choice questions (with answers) about world\ "description": "The following are multiple choice questions (with answers) about world\
\ religions.\n\n" \ religions.\n\n"
"group": "mmlu_humanities" "tag": "mmlu_humanities_tasks"
"group_alias": "humanities"
"include": "_default_template_yaml" "include": "_default_template_yaml"
"task": "mmlu_world_religions" "task": "mmlu_world_religions"
"task_alias": "world_religions" "task_alias": "world_religions"
group: mmlu_flan_cot_fewshot group: mmlu_flan_cot_fewshot
group_alias: mmlu (flan style, fewshot cot)
task: task:
- mmlu_flan_cot_fewshot_stem - group: stem
- mmlu_flan_cot_fewshot_other task:
- mmlu_flan_cot_fewshot_social_sciences - mmlu_flan_cot_fewshot_stem
- mmlu_flan_cot_fewshot_humanities aggregate_metric_list:
- metric: acc
weight_by_size: True
- group: other
task:
- mmlu_flan_cot_fewshot_other
aggregate_metric_list:
- metric: acc
weight_by_size: True
- group: social sciences
task:
- mmlu_flan_cot_fewshot_social_sciences
aggregate_metric_list:
- metric: acc
weight_by_size: True
- group: humanities
task:
- mmlu_flan_cot_fewshot_humanities
aggregate_metric_list:
- metric: acc
weight_by_size: True
aggregate_metric_list:
- metric: acc
weight_by_size: True
metadata:
version: 1
...@@ -54,6 +54,6 @@ fewshot_config: ...@@ -54,6 +54,6 @@ fewshot_config:
not have any roots. For c = 2 the polynomial x^2 + 2 has two roots at x = 1 not have any roots. For c = 2 the polynomial x^2 + 2 has two roots at x = 1
and x = 2. Hence Z_3[x]/(x^2 + c) is a field if and only if c = 1. The answer and x = 2. Hence Z_3[x]/(x^2 + c) is a field if and only if c = 1. The answer
is (B).' is (B).'
group: mmlu_flan_cot_fewshot_stem tag: mmlu_flan_cot_fewshot_stem
include: _mmlu_flan_cot_fewshot_template_yaml include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_abstract_algebra task: mmlu_flan_cot_fewshot_abstract_algebra
...@@ -70,6 +70,6 @@ fewshot_config: ...@@ -70,6 +70,6 @@ fewshot_config:
\ origin of the hyoid bone are the second and the third pharyngeal arches\u2014\ \ origin of the hyoid bone are the second and the third pharyngeal arches\u2014\
this information is covered in the last option (D). Therefore, we conclude that\ this information is covered in the last option (D). Therefore, we conclude that\
\ (D) must be the correct answer. The answer is (D).\n\n" \ (D) must be the correct answer. The answer is (D).\n\n"
group: mmlu_flan_cot_fewshot_stem tag: mmlu_flan_cot_fewshot_stem
include: _mmlu_flan_cot_fewshot_template_yaml include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_anatomy task: mmlu_flan_cot_fewshot_anatomy
...@@ -65,6 +65,6 @@ fewshot_config: ...@@ -65,6 +65,6 @@ fewshot_config:
because it explains that the surface is red due to the rusted materials on the because it explains that the surface is red due to the rusted materials on the
surface and the red color comes from the rust. So the correct option is (A). surface and the red color comes from the rust. So the correct option is (A).
The answer is (A).' The answer is (A).'
group: mmlu_flan_cot_fewshot_stem tag: mmlu_flan_cot_fewshot_stem
include: _mmlu_flan_cot_fewshot_template_yaml include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_astronomy task: mmlu_flan_cot_fewshot_astronomy
...@@ -70,6 +70,6 @@ fewshot_config: ...@@ -70,6 +70,6 @@ fewshot_config:
\ moral arguments relating to: negative *externalities*, the *power* that corporations\ \ moral arguments relating to: negative *externalities*, the *power* that corporations\
\ possess and the *mutual independence* of business and society. The answer\ \ possess and the *mutual independence* of business and society. The answer\
\ is (D).\n\n" \ is (D).\n\n"
group: mmlu_flan_cot_fewshot_other tag: mmlu_flan_cot_fewshot_other
include: _mmlu_flan_cot_fewshot_template_yaml include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_business_ethics task: mmlu_flan_cot_fewshot_business_ethics
...@@ -43,6 +43,6 @@ fewshot_config: ...@@ -43,6 +43,6 @@ fewshot_config:
target: 'Let''s think step by step. We refer to Wikipedia articles on clinical target: 'Let''s think step by step. We refer to Wikipedia articles on clinical
knowledge for help. The energy for muscular contraction is provided by ATP (adenosine knowledge for help. The energy for muscular contraction is provided by ATP (adenosine
triphosphate), which is the powerhouse of the cell. The answer is (A).' triphosphate), which is the powerhouse of the cell. The answer is (A).'
group: mmlu_flan_cot_fewshot_other tag: mmlu_flan_cot_fewshot_other
include: _mmlu_flan_cot_fewshot_template_yaml include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_clinical_knowledge task: mmlu_flan_cot_fewshot_clinical_knowledge
...@@ -70,6 +70,6 @@ fewshot_config: ...@@ -70,6 +70,6 @@ fewshot_config:
that have different origins, which is not the case for the human and bird forearms, that have different origins, which is not the case for the human and bird forearms,
which rules out (D). Humans and birds do belong to the same clade - a group which rules out (D). Humans and birds do belong to the same clade - a group
of organisms composed of a common ancestor. The answer is (C).' of organisms composed of a common ancestor. The answer is (C).'
group: mmlu_flan_cot_fewshot_stem tag: mmlu_flan_cot_fewshot_stem
include: _mmlu_flan_cot_fewshot_template_yaml include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_college_biology task: mmlu_flan_cot_fewshot_college_biology
...@@ -44,6 +44,6 @@ fewshot_config: ...@@ -44,6 +44,6 @@ fewshot_config:
\ into 2 lines. This will be further split into 4 lines by the interaction with\ \ into 2 lines. This will be further split into 4 lines by the interaction with\
\ three equivalent 1H nuclei. The total number of lines is therefore $2 \\cdot\ \ three equivalent 1H nuclei. The total number of lines is therefore $2 \\cdot\
\ 4 = 8$. The answer is (E).\n\n" \ 4 = 8$. The answer is (E).\n\n"
group: mmlu_flan_cot_fewshot_stem tag: mmlu_flan_cot_fewshot_stem
include: _mmlu_flan_cot_fewshot_template_yaml include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_college_chemistry task: mmlu_flan_cot_fewshot_college_chemistry
...@@ -175,6 +175,6 @@ fewshot_config: ...@@ -175,6 +175,6 @@ fewshot_config:
(1000 nanoseconds / cache miss) * (1 cache miss / 50 instructions) * (50 instructions (1000 nanoseconds / cache miss) * (1 cache miss / 50 instructions) * (50 instructions
/ 27000 nanoseconds) = 1000 * (1/50) * (50/27000) = 1000/27000 = 1/27. The answer / 27000 nanoseconds) = 1000 * (1/50) * (50/27000) = 1000/27000 = 1/27. The answer
is (B).' is (B).'
group: mmlu_flan_cot_fewshot_stem tag: mmlu_flan_cot_fewshot_stem
include: _mmlu_flan_cot_fewshot_template_yaml include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_college_computer_science task: mmlu_flan_cot_fewshot_college_computer_science
...@@ -68,6 +68,6 @@ fewshot_config: ...@@ -68,6 +68,6 @@ fewshot_config:
\ Then, for all $t \\in \\mathbb{R}$, we have $(s(t))-2=K e^{-t / 25}$, and\ \ Then, for all $t \\in \\mathbb{R}$, we have $(s(t))-2=K e^{-t / 25}$, and\
\ so $s(t)=2+K e^{-t / 25}$. Then $3=s(0)=2+K e^{0}=2+K$, so $K=1$. Then $s(100)=2+K\ \ so $s(t)=2+K e^{-t / 25}$. Then $3=s(0)=2+K e^{0}=2+K$, so $K=1$. Then $s(100)=2+K\
\ e^{-100 / 25}=2+1 \\cdot e^{-4}=2+e^{-4}$. The answer is (D).\n\n" \ e^{-100 / 25}=2+1 \\cdot e^{-4}=2+e^{-4}$. The answer is (D).\n\n"
group: mmlu_flan_cot_fewshot_stem tag: mmlu_flan_cot_fewshot_stem
include: _mmlu_flan_cot_fewshot_template_yaml include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_college_mathematics task: mmlu_flan_cot_fewshot_college_mathematics
...@@ -63,6 +63,6 @@ fewshot_config: ...@@ -63,6 +63,6 @@ fewshot_config:
for help. Glucose (also known as the blood sugar) is the main sugar found in for help. Glucose (also known as the blood sugar) is the main sugar found in
the human body. It is transported into the muscle cell via diffusion through the human body. It is transported into the muscle cell via diffusion through
protein transporters called GLUT4. The answer is (A).' protein transporters called GLUT4. The answer is (A).'
group: mmlu_flan_cot_fewshot_other tag: mmlu_flan_cot_fewshot_other
include: _mmlu_flan_cot_fewshot_template_yaml include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_college_medicine task: mmlu_flan_cot_fewshot_college_medicine
...@@ -56,6 +56,6 @@ fewshot_config: ...@@ -56,6 +56,6 @@ fewshot_config:
of the gas container is constant, no work will be done (since work is pressure of the gas container is constant, no work will be done (since work is pressure
times change in volume). So, at constant volume, all of the heat goes into the times change in volume). So, at constant volume, all of the heat goes into the
internal energy. The answer is (B).' internal energy. The answer is (B).'
group: mmlu_flan_cot_fewshot_stem tag: mmlu_flan_cot_fewshot_stem
include: _mmlu_flan_cot_fewshot_template_yaml include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_college_physics task: mmlu_flan_cot_fewshot_college_physics
...@@ -45,6 +45,6 @@ fewshot_config: ...@@ -45,6 +45,6 @@ fewshot_config:
of the TLS heartbeat extension. The vulnerability was classified as a buffer of the TLS heartbeat extension. The vulnerability was classified as a buffer
over-read, a situation where more data can be read than should be allowed. The over-read, a situation where more data can be read than should be allowed. The
answer is (C).' answer is (C).'
group: mmlu_flan_cot_fewshot_stem tag: mmlu_flan_cot_fewshot_stem
include: _mmlu_flan_cot_fewshot_template_yaml include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_computer_security task: mmlu_flan_cot_fewshot_computer_security
...@@ -44,6 +44,6 @@ fewshot_config: ...@@ -44,6 +44,6 @@ fewshot_config:
\ orthogonal to the wind is the same as it would be in the absence of the wind.\ \ orthogonal to the wind is the same as it would be in the absence of the wind.\
\ The total speed, which is these two components added in quadrature, is thus\ \ The total speed, which is these two components added in quadrature, is thus\
\ greater than the speed in still air. The answer is (B).\n\n" \ greater than the speed in still air. The answer is (B).\n\n"
group: mmlu_flan_cot_fewshot_stem tag: mmlu_flan_cot_fewshot_stem
include: _mmlu_flan_cot_fewshot_template_yaml include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_conceptual_physics task: mmlu_flan_cot_fewshot_conceptual_physics
...@@ -82,6 +82,6 @@ fewshot_config: ...@@ -82,6 +82,6 @@ fewshot_config:
target: 'Let''s think step by step. We refer to Wikipedia articles on econometrics target: 'Let''s think step by step. We refer to Wikipedia articles on econometrics
for help. This is a formal logic problem about stationally process. For a stationary for help. This is a formal logic problem about stationally process. For a stationary
autoregressive process, shocks will eventually die away. The answer is (A).' autoregressive process, shocks will eventually die away. The answer is (A).'
group: mmlu_flan_cot_fewshot_social_sciences tag: mmlu_flan_cot_fewshot_social_sciences
include: _mmlu_flan_cot_fewshot_template_yaml include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_econometrics task: mmlu_flan_cot_fewshot_econometrics
...@@ -42,6 +42,6 @@ fewshot_config: ...@@ -42,6 +42,6 @@ fewshot_config:
target: 'Let''s think step by step. In lap winding, effectively two resistors target: 'Let''s think step by step. In lap winding, effectively two resistors
are connected in parallel, so the actual resistance of each pair is 1 Ohm. Since are connected in parallel, so the actual resistance of each pair is 1 Ohm. Since
we have 50 pairs, we get a total resistance of 50 Ohms. The answer is (C).' we have 50 pairs, we get a total resistance of 50 Ohms. The answer is (C).'
group: mmlu_flan_cot_fewshot_stem tag: mmlu_flan_cot_fewshot_stem
include: _mmlu_flan_cot_fewshot_template_yaml include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_electrical_engineering task: mmlu_flan_cot_fewshot_electrical_engineering
...@@ -72,6 +72,6 @@ fewshot_config: ...@@ -72,6 +72,6 @@ fewshot_config:
(D) (5 x 9) x (6 x 9)' (D) (5 x 9) x (6 x 9)'
target: 'Let''s think step by step. We know that 9 = (5 + 4), so 5 x 9 = 5 x (5 target: 'Let''s think step by step. We know that 9 = (5 + 4), so 5 x 9 = 5 x (5
+ 4) = (5 x 5) + (5 x 4). The answer is (B).' + 4) = (5 x 5) + (5 x 4). The answer is (B).'
group: mmlu_flan_cot_fewshot_stem tag: mmlu_flan_cot_fewshot_stem
include: _mmlu_flan_cot_fewshot_template_yaml include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_elementary_mathematics task: mmlu_flan_cot_fewshot_elementary_mathematics
...@@ -65,6 +65,6 @@ fewshot_config: ...@@ -65,6 +65,6 @@ fewshot_config:
\ p do not drive on Mars.\nOf all these options, Option (C) appears to be the\ \ p do not drive on Mars.\nOf all these options, Option (C) appears to be the\
\ best and most meaningful interpretation of the argument \u201CNo people drive\ \ best and most meaningful interpretation of the argument \u201CNo people drive\
\ on Mars.\u201D The answer is (C).\n\n" \ on Mars.\u201D The answer is (C).\n\n"
group: mmlu_flan_cot_fewshot_humanities tag: mmlu_flan_cot_fewshot_humanities
include: _mmlu_flan_cot_fewshot_template_yaml include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_formal_logic task: mmlu_flan_cot_fewshot_formal_logic
...@@ -44,6 +44,6 @@ fewshot_config: ...@@ -44,6 +44,6 @@ fewshot_config:
for help. As of 2019, most people tend to be optimistic about their own future for help. As of 2019, most people tend to be optimistic about their own future
but pessimistic about the future of their nation or the world. The answer is but pessimistic about the future of their nation or the world. The answer is
(B).' (B).'
group: mmlu_flan_cot_fewshot_other tag: mmlu_flan_cot_fewshot_other
include: _mmlu_flan_cot_fewshot_template_yaml include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_global_facts task: mmlu_flan_cot_fewshot_global_facts
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment