Unverified Commit 517aadc4 authored by Lintang Sutawika's avatar Lintang Sutawika Committed by GitHub
Browse files

Group agg rework (#1741)



* add greoup_config arg

* add a group config that allows disabling table for group score and group aggregate in general

* fixed size configuration

* adjust config

* add group config

* adjust mmlu to use group_config

* fixed args input in aggregate_subtask_metrics

* fixed issues related to printing alias of group and updated yaml

* update all mmlu variants to include group_config

* edit format

* modify mmlu tasks

* adjust group to also be a configurable group

* add configurable group

* simplify get_task_list

* adjust group scoring with using ConfigurableGroup

* adjust args

* update mmlu

* update mmlu

* update to work with new group and task configuration

* readd group_agg

* readd files

* move prepare_print_tasks to evaluator_utils

* sort set to False by default, fix predict_only arg

* add version for groups

* reversed task list

* update additional condition when loading a group in a group yaml

* update truthfulqa

* add description regarding tags replacing group

* replace group to tag

* fixed conditional statement

* remove warning

* update loading of task group and newly added tags

* reformat with pre-commit

* fixed info log

* update

* fix bug

* fix bug

* use task id to differentiate tasks

* convert all groups to configurable groups

* use task_id

* reformat

* add task_id for python tasks as well

* add task_id for python tasks as well

* add task_id for python tasks as well

* revert truthfulqa

* revert mmlu tasks

* new mmlu config

* new group config parameter `tag_to_task`

* Update truthfulqa_mc2.yaml

* reformate

* add _process_group_config

* adjust task_id

* add get_subtask_list function to get proper subtask list

* group config to_dict update

* remove tag check

* update mmlu

* fix config passing issues

* add test yaml

* format fix

* add documentation

* corner case for single tag being called

* fix indentation

* formatting

* update all mmlu variants

* Update docs/task_guide.md
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* remove group_alias

* Update docs/task_guide.md
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* remove version for metadata

* Update docs/task_guide.md
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* update mmlu/

* removed " " in make_table

* change how aggregate_metric is loaded

* change how aggregate_metric is loaded

* update aggregate_metric arg

* update format

* update format

* some docs fixes

* add groups for agieval, aexams, aclue

* add more explicit aggregation groups

* add more groupings / tags distinctions

* add more groupings

* more groupings

* add many explicit group configs

* add many explicit group configs

* add more explicit group configs

* add more explicit group configs

* add more error msgs, agg_metric -> agg_metric_list

* some docs updates

* update task_id to be updateable and uses group:task format

* make KMMLU a tag for now

* update docs

* don't duplicate task names

* fix merge conflicts?

* giving this a try

* clean up diff

* switch mmlu variants over to using

* don't use to-be-deprecated group: config field in overview notebook

* Python tasks which subclass ConfigurableTask now run

* update mmlu

* pre-commit format

* fixed sorting for multi-level printing

* move group api to separate file

* fix bbh aggregation filter usage

* track api/group.py

* adjust group and tags loading

* make explicit group configs for leaderboard and other newer tasks

* fix arabicmmlu

* update

* change arabicmmlu template name???

* update group alias

* fix printing bugs

* check table printing is correct ; update tests

* use mmlu_stem to have a group included in print tests

---------
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: default avatarhaileyschoelkopf <hailey@eleuther.ai>
parent 5a7ed3ee
......@@ -64,6 +64,6 @@ fewshot_config:
core cell cycle regulators inside the cell. The most common regulators are cyclins
and cyclin-dependent kinases. Fibroblast cells do not play any role in cell
division. The answer is (D).'
group: mmlu_flan_cot_fewshot_stem
tag: mmlu_flan_cot_fewshot_stem
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_high_school_biology
......@@ -61,6 +61,6 @@ fewshot_config:
\ strong acid, Nitric acid, will react with the conjugate base. Therefore the\
\ maximum amount of acid that can be added will be equal to the amount of acetate\
\ ion, or 2 moles. The answer is (C).\n\n"
group: mmlu_flan_cot_fewshot_stem
tag: mmlu_flan_cot_fewshot_stem
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_high_school_chemistry
......@@ -79,6 +79,6 @@ fewshot_config:
its value is greater than 100, regardless of the elements in the list. Choice
D is incorrect because its step 3 does not increment the value of position,
so it will repeat forever. The answer is (B).'
group: mmlu_flan_cot_fewshot_stem
tag: mmlu_flan_cot_fewshot_stem
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_high_school_computer_science
......@@ -194,6 +194,6 @@ fewshot_config:
wrote extensively against the monoplization of power and advocated for a system
of checks and balances in government to prevent the rise of despotism. The answer
is (B).'
group: mmlu_flan_cot_fewshot_humanities
tag: mmlu_flan_cot_fewshot_humanities
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_high_school_european_history
......@@ -48,6 +48,6 @@ fewshot_config:
target: 'Let''s think step by step. We refer to Wikipedia articles on geography
for help. The difference between number of births and deaths gives the population
increase at any given time. The answer is (A).'
group: mmlu_flan_cot_fewshot_social_sciences
tag: mmlu_flan_cot_fewshot_social_sciences
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_high_school_geography
......@@ -56,6 +56,6 @@ fewshot_config:
target: 'Let''s think step by step. We refer to Wikipedia articles on government
and politics for help. The US Constitution is not very specific about the powers
of the president, leading to uncertainty over its limits. The answer is (A).'
group: mmlu_flan_cot_fewshot_social_sciences
tag: mmlu_flan_cot_fewshot_social_sciences
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_high_school_government_and_politics
......@@ -48,6 +48,6 @@ fewshot_config:
for help. The economic transactions related to the performance of the American
pop-singer in Paris happens entirely outside the U.S. and hence is not included
in the GDP numbers. The answer is (C).'
group: mmlu_flan_cot_fewshot_social_sciences
tag: mmlu_flan_cot_fewshot_social_sciences
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_high_school_macroeconomics
......@@ -46,6 +46,6 @@ fewshot_config:
target: 'Let''s think step by step. The least common multiple of 2, 3 and 5 is
30, so during a 7 minute dance, all the three lights will come on at the same
time $2*7+1=15$ times. The answer is (B).'
group: mmlu_flan_cot_fewshot_stem
tag: mmlu_flan_cot_fewshot_stem
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_high_school_mathematics
......@@ -51,6 +51,6 @@ fewshot_config:
for help. An increase in the construction of new houses means an increase demand
of in-house painting, thus increases the demand for housepainters. The answer
is (C).'
group: mmlu_flan_cot_fewshot_social_sciences
tag: mmlu_flan_cot_fewshot_social_sciences
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_high_school_microeconomics
......@@ -45,6 +45,6 @@ fewshot_config:
cannot have any net displacement because the pipe closure stops them. So the
particle displacement is at a node. This closure also causes the pressure to
be maximal, i.e. an antinode. The answer is (B).'
group: mmlu_flan_cot_fewshot_stem
tag: mmlu_flan_cot_fewshot_stem
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_high_school_physics
......@@ -59,6 +59,6 @@ fewshot_config:
for help. People with an external locus of control believes fate and luck play
an important role in their lives, while people with an internal locus of control
believes they control their lives. The answer is (D).'
group: mmlu_flan_cot_fewshot_social_sciences
tag: mmlu_flan_cot_fewshot_social_sciences
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_high_school_psychology
......@@ -76,6 +76,6 @@ fewshot_config:
not perfectly correlated. Statement B is false because uncorrelated variables
regression lines can have slope zero. Statement C is false because correlation
is symmetric in the two random variables. The answer is (D).'
group: mmlu_flan_cot_fewshot_stem
tag: mmlu_flan_cot_fewshot_stem
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_high_school_statistics
......@@ -151,6 +151,6 @@ fewshot_config:
suspect Washington''s military response to Whiskey Rebellion. Bacon''s Rebellion
and Pontiac''s Rebellion happen before the Revolution and they can be ruled
out. The answer is (C).'
group: mmlu_flan_cot_fewshot_humanities
tag: mmlu_flan_cot_fewshot_humanities
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_high_school_us_history
......@@ -95,6 +95,6 @@ fewshot_config:
for help. Brahman refers to the ultimate reality of all things in the Hindu
religion. In contrast, Buddhism does not have a concept of supreme God. The
answer is (A).'
group: mmlu_flan_cot_fewshot_humanities
tag: mmlu_flan_cot_fewshot_humanities
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_high_school_world_history
......@@ -37,6 +37,6 @@ fewshot_config:
target: 'Let''s think step by step. We refer to Wikipedia articles on human aging
for help. Texas does not have state tax, and has low cost of living compared
with the other three options. The answer is (A).'
group: mmlu_flan_cot_fewshot_other
tag: mmlu_flan_cot_fewshot_other
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_human_aging
......@@ -45,6 +45,6 @@ fewshot_config:
target: 'Let''s think step by step. We refer to Wikipedia articles on human sexuality
for help. Morning sickness usually begins by nine weeks after conception, corresponding
to the first trimester. The answer is (A).'
group: mmlu_flan_cot_fewshot_social_sciences
tag: mmlu_flan_cot_fewshot_social_sciences
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_human_sexuality
......@@ -65,6 +65,6 @@ fewshot_config:
target: 'Let''s think step by step. We refer to Wikipedia articles on international
law for help. Article 2(4) of the UN Charter prohibits states from using armed
forces in their international relations. The answer is (A).'
group: mmlu_flan_cot_fewshot_humanities
tag: mmlu_flan_cot_fewshot_humanities
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_international_law
......@@ -54,6 +54,6 @@ fewshot_config:
principle'', and reject the ''system of natural liberty'', but the POP would
not choose equality above liberty, since the POP assume both equal and free
citizens. The answer is (A).'
group: mmlu_flan_cot_fewshot_humanities
tag: mmlu_flan_cot_fewshot_humanities
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_jurisprudence
......@@ -56,6 +56,6 @@ fewshot_config:
fallacies for help. The argument against the person fallacy occurs when someone
irrelevantly attacks the character of an opposing arguer, instead of addressing
that opponent''s arguments. The answer is (C).'
group: mmlu_flan_cot_fewshot_humanities
tag: mmlu_flan_cot_fewshot_humanities
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_logical_fallacies
......@@ -69,6 +69,6 @@ fewshot_config:
\ epsilon when we have N samples if 2 exp(-2 epsilon^2 N)<0.05, this implies\
\ that N > -1/(2*epsilon**2) log ( 0.05/2 )= log (40)*5000. Since log(40)>1,\
\ we have that one needs more than 1000 examples. The answer is (D).\n\n"
group: mmlu_flan_cot_fewshot_stem
tag: mmlu_flan_cot_fewshot_stem
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_machine_learning
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment