"tools/generate-unit-test-expect/IRC/Exo/Exo-37.xyz" did not exist on "06a6f38189092c7ba4aced6f90fa663d5c5ae0c2"
Unverified Commit 517aadc4 authored by Lintang Sutawika's avatar Lintang Sutawika Committed by GitHub
Browse files

Group agg rework (#1741)



* add greoup_config arg

* add a group config that allows disabling table for group score and group aggregate in general

* fixed size configuration

* adjust config

* add group config

* adjust mmlu to use group_config

* fixed args input in aggregate_subtask_metrics

* fixed issues related to printing alias of group and updated yaml

* update all mmlu variants to include group_config

* edit format

* modify mmlu tasks

* adjust group to also be a configurable group

* add configurable group

* simplify get_task_list

* adjust group scoring with using ConfigurableGroup

* adjust args

* update mmlu

* update mmlu

* update to work with new group and task configuration

* readd group_agg

* readd files

* move prepare_print_tasks to evaluator_utils

* sort set to False by default, fix predict_only arg

* add version for groups

* reversed task list

* update additional condition when loading a group in a group yaml

* update truthfulqa

* add description regarding tags replacing group

* replace group to tag

* fixed conditional statement

* remove warning

* update loading of task group and newly added tags

* reformat with pre-commit

* fixed info log

* update

* fix bug

* fix bug

* use task id to differentiate tasks

* convert all groups to configurable groups

* use task_id

* reformat

* add task_id for python tasks as well

* add task_id for python tasks as well

* add task_id for python tasks as well

* revert truthfulqa

* revert mmlu tasks

* new mmlu config

* new group config parameter `tag_to_task`

* Update truthfulqa_mc2.yaml

* reformate

* add _process_group_config

* adjust task_id

* add get_subtask_list function to get proper subtask list

* group config to_dict update

* remove tag check

* update mmlu

* fix config passing issues

* add test yaml

* format fix

* add documentation

* corner case for single tag being called

* fix indentation

* formatting

* update all mmlu variants

* Update docs/task_guide.md
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* remove group_alias

* Update docs/task_guide.md
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* remove version for metadata

* Update docs/task_guide.md
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* update mmlu/

* removed " " in make_table

* change how aggregate_metric is loaded

* change how aggregate_metric is loaded

* update aggregate_metric arg

* update format

* update format

* some docs fixes

* add groups for agieval, aexams, aclue

* add more explicit aggregation groups

* add more groupings / tags distinctions

* add more groupings

* more groupings

* add many explicit group configs

* add many explicit group configs

* add more explicit group configs

* add more explicit group configs

* add more error msgs, agg_metric -> agg_metric_list

* some docs updates

* update task_id to be updateable and uses group:task format

* make KMMLU a tag for now

* update docs

* don't duplicate task names

* fix merge conflicts?

* giving this a try

* clean up diff

* switch mmlu variants over to using

* don't use to-be-deprecated group: config field in overview notebook

* Python tasks which subclass ConfigurableTask now run

* update mmlu

* pre-commit format

* fixed sorting for multi-level printing

* move group api to separate file

* fix bbh aggregation filter usage

* track api/group.py

* adjust group and tags loading

* make explicit group configs for leaderboard and other newer tasks

* fix arabicmmlu

* update

* change arabicmmlu template name???

* update group alias

* fix printing bugs

* check table printing is correct ; update tests

* use mmlu_stem to have a group included in print tests

---------
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: default avatarhaileyschoelkopf <hailey@eleuther.ai>
parent 5a7ed3ee
......@@ -41,6 +41,6 @@ fewshot_config:
for help. The Ohio State Leadership Studies conducted in the 1940s identified
initiating structure and consideration as the two main dimensions of leader
behavior. The answer is (D).'
group: mmlu_flan_cot_fewshot_other
tag: mmlu_flan_cot_fewshot_other
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_management
......@@ -51,6 +51,6 @@ fewshot_config:
target: 'Let''s think step by step. We refer to Wikipedia articles on marketing
for help. Geodemographics is a natural outcome when combining demographic and
geographic variables. The answer is (A).'
group: mmlu_flan_cot_fewshot_other
tag: mmlu_flan_cot_fewshot_other
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_marketing
......@@ -46,6 +46,6 @@ fewshot_config:
genetics for help. A Southern blot is a method in molecular biology for detecting
specific DNA sequences in a sample. Large triplet repeat expansions are usually
detected with this method. The answer is (C).'
group: mmlu_flan_cot_fewshot_other
tag: mmlu_flan_cot_fewshot_other
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_medical_genetics
......@@ -38,6 +38,6 @@ fewshot_config:
(A) one (B) two (C) four (D) eight'
target: 'Let''s think step by step. We refer to Wikipedia for help. Most cars
have two axles to rotate the wheels.. The answer is (B).'
group: mmlu_flan_cot_fewshot_other
tag: mmlu_flan_cot_fewshot_other
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_miscellaneous
......@@ -59,6 +59,6 @@ fewshot_config:
as it treats individuals as incapable of communal relations. It is unclear that
capital punishment is to the benefit of, or a deterrent of harm to the community.
The answer is (A).'
group: mmlu_flan_cot_fewshot_humanities
tag: mmlu_flan_cot_fewshot_humanities
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_moral_disputes
......@@ -57,6 +57,6 @@ fewshot_config:
target: 'Let''s think step by step. We refer to Wikipedia articles on moral scenarios
for help. Loving someone is not wrong. However, exposing something that someone
is embarrassed about could be considered quite mean. The answer is (C).'
group: mmlu_flan_cot_fewshot_humanities
tag: mmlu_flan_cot_fewshot_humanities
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_moral_scenarios
......@@ -58,6 +58,6 @@ fewshot_config:
target: 'Let''s think step by step. We refer to Wikipedia articles on nutrition
for help. The risk ratio is not sufficiently reduced that it could not be explained
by random chance given the studies sample size. The answer is (C).'
group: mmlu_flan_cot_fewshot_other
tag: mmlu_flan_cot_fewshot_other
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_nutrition
......@@ -39,6 +39,6 @@ fewshot_config:
for help. Psychological egoism suggests that one behaves based on what makes
one feels good, hence it is a claim about human nature and how humans are capable
of behaving. The answer is (C).'
group: mmlu_flan_cot_fewshot_humanities
tag: mmlu_flan_cot_fewshot_humanities
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_philosophy
......@@ -54,6 +54,6 @@ fewshot_config:
target: 'Let''s think step by step. We refer to Wikipedia articles on prehistory
for help. Pacal built the temples as the funerary monument to legitimize his
kingship. The answer is (D).'
group: mmlu_flan_cot_fewshot_humanities
tag: mmlu_flan_cot_fewshot_humanities
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_prehistory
......@@ -58,6 +58,6 @@ fewshot_config:
for help. Among the four transactions, only Proceeds from long-term debt belongs
to the financing activities section of cashflow, hence the amount reported should
be $100000. The answer is (D).'
group: mmlu_flan_cot_fewshot_other
tag: mmlu_flan_cot_fewshot_other
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_professional_accounting
......@@ -117,6 +117,6 @@ fewshot_config:
a due process clause. Hence the strongest argument should be the statute is
overbroad and consequently invalid under the First and Fourteenth Amendments.
The answer is (D).'
group: mmlu_flan_cot_fewshot_humanities
tag: mmlu_flan_cot_fewshot_humanities
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_professional_law
......@@ -77,6 +77,6 @@ fewshot_config:
for help. The symptoms and the adrenal mass suggested pheochromocytoma, and
the blood pressure indicates hypertension. Phenoxybenzamine is used to treat
hypertension caused by pheochromocytoma. The answer is (D).'
group: mmlu_flan_cot_fewshot_other
tag: mmlu_flan_cot_fewshot_other
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_professional_medicine
......@@ -57,6 +57,6 @@ fewshot_config:
for help. Based on the circumstances, you should tell your client about the
pros and cons of each program, but it would be inappropriate to receive the
bonus, so you should not claim the $50 bonus. The answer is (D).'
group: mmlu_flan_cot_fewshot_social_sciences
tag: mmlu_flan_cot_fewshot_social_sciences
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_professional_psychology
......@@ -50,6 +50,6 @@ fewshot_config:
for help. If a public relations media practitioner does not know the answer
to a reporter''s question, they should say ''I don''t know'' and offer to provide
the information later. The answer is (C).'
group: mmlu_flan_cot_fewshot_social_sciences
tag: mmlu_flan_cot_fewshot_social_sciences
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_public_relations
......@@ -99,6 +99,6 @@ fewshot_config:
target: 'Let''s think step by step. We refer to Wikipedia articles on security
studies for help. Coercive diplomacy uses the threat of force to induce the
opponent to comply with demands. The answer is (B).'
group: mmlu_flan_cot_fewshot_social_sciences
tag: mmlu_flan_cot_fewshot_social_sciences
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_security_studies
......@@ -53,6 +53,6 @@ fewshot_config:
for help. The post-war welfare state of 1948 aimed to provide free healthcare
and education, full employment, and universal welfare. But it did not aim to
provide a minimum wage. The answer is (B).'
group: mmlu_flan_cot_fewshot_social_sciences
tag: mmlu_flan_cot_fewshot_social_sciences
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_sociology
......@@ -51,6 +51,6 @@ fewshot_config:
target: 'Let''s think step by step. We refer to Wikipedia articles on us foreign
policy for help. The 2008 financial crisis damanged the international reputation
of the American model of political economy and capitalism. The answer is (A).'
group: mmlu_flan_cot_fewshot_social_sciences
tag: mmlu_flan_cot_fewshot_social_sciences
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_us_foreign_policy
......@@ -40,6 +40,6 @@ fewshot_config:
target: 'Let''s think step by step. We refer to Wikipedia articles on virology
for help. Paroviruses are highly impactful because they do not have nucleic
acid. The answer is (A).'
group: mmlu_flan_cot_fewshot_other
tag: mmlu_flan_cot_fewshot_other
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_virology
......@@ -37,6 +37,6 @@ fewshot_config:
target: 'Let''s think step by step. We refer to Wikipedia articles on world religions
for help. In Judaism, the most distinctive sign of the covenant is circumcision
(brit milah). The answer is (B).'
group: mmlu_flan_cot_fewshot_humanities
tag: mmlu_flan_cot_fewshot_humanities
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_world_religions
group: mmlu_flan_cot_zeroshot
group_alias: mmlu (flan style, zeroshot cot)
task:
- mmlu_flan_cot_zeroshot_stem
- mmlu_flan_cot_zeroshot_other
- mmlu_flan_cot_zeroshot_social_sciences
- mmlu_flan_cot_zeroshot_humanities
- group: stem
task:
- mmlu_flan_cot_zeroshot_stem
aggregate_metric_list:
- metric: acc
weight_by_size: True
- group: other
task:
- mmlu_flan_cot_zeroshot_other
aggregate_metric_list:
- metric: acc
weight_by_size: True
- group: social sciences
task:
- mmlu_flan_cot_zeroshot_social_sciences
aggregate_metric_list:
- metric: acc
weight_by_size: True
- group: humanities
task:
- mmlu_flan_cot_zeroshot_humanities
aggregate_metric_list:
- metric: acc
weight_by_size: True
aggregate_metric_list:
- metric: acc
weight_by_size: True
metadata:
version: 1
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment