Unverified Commit 517aadc4 authored by Lintang Sutawika's avatar Lintang Sutawika Committed by GitHub
Browse files

Group agg rework (#1741)



* add greoup_config arg

* add a group config that allows disabling table for group score and group aggregate in general

* fixed size configuration

* adjust config

* add group config

* adjust mmlu to use group_config

* fixed args input in aggregate_subtask_metrics

* fixed issues related to printing alias of group and updated yaml

* update all mmlu variants to include group_config

* edit format

* modify mmlu tasks

* adjust group to also be a configurable group

* add configurable group

* simplify get_task_list

* adjust group scoring with using ConfigurableGroup

* adjust args

* update mmlu

* update mmlu

* update to work with new group and task configuration

* readd group_agg

* readd files

* move prepare_print_tasks to evaluator_utils

* sort set to False by default, fix predict_only arg

* add version for groups

* reversed task list

* update additional condition when loading a group in a group yaml

* update truthfulqa

* add description regarding tags replacing group

* replace group to tag

* fixed conditional statement

* remove warning

* update loading of task group and newly added tags

* reformat with pre-commit

* fixed info log

* update

* fix bug

* fix bug

* use task id to differentiate tasks

* convert all groups to configurable groups

* use task_id

* reformat

* add task_id for python tasks as well

* add task_id for python tasks as well

* add task_id for python tasks as well

* revert truthfulqa

* revert mmlu tasks

* new mmlu config

* new group config parameter `tag_to_task`

* Update truthfulqa_mc2.yaml

* reformate

* add _process_group_config

* adjust task_id

* add get_subtask_list function to get proper subtask list

* group config to_dict update

* remove tag check

* update mmlu

* fix config passing issues

* add test yaml

* format fix

* add documentation

* corner case for single tag being called

* fix indentation

* formatting

* update all mmlu variants

* Update docs/task_guide.md
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* remove group_alias

* Update docs/task_guide.md
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* remove version for metadata

* Update docs/task_guide.md
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* update mmlu/

* removed " " in make_table

* change how aggregate_metric is loaded

* change how aggregate_metric is loaded

* update aggregate_metric arg

* update format

* update format

* some docs fixes

* add groups for agieval, aexams, aclue

* add more explicit aggregation groups

* add more groupings / tags distinctions

* add more groupings

* more groupings

* add many explicit group configs

* add many explicit group configs

* add more explicit group configs

* add more explicit group configs

* add more error msgs, agg_metric -> agg_metric_list

* some docs updates

* update task_id to be updateable and uses group:task format

* make KMMLU a tag for now

* update docs

* don't duplicate task names

* fix merge conflicts?

* giving this a try

* clean up diff

* switch mmlu variants over to using

* don't use to-be-deprecated group: config field in overview notebook

* Python tasks which subclass ConfigurableTask now run

* update mmlu

* pre-commit format

* fixed sorting for multi-level printing

* move group api to separate file

* fix bbh aggregation filter usage

* track api/group.py

* adjust group and tags loading

* make explicit group configs for leaderboard and other newer tasks

* fix arabicmmlu

* update

* change arabicmmlu template name???

* update group alias

* fix printing bugs

* check table printing is correct ; update tests

* use mmlu_stem to have a group included in print tests

---------
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: default avatarhaileyschoelkopf <hailey@eleuther.ai>
parent 5a7ed3ee
group: tag:
- anli - anli
task: anli_r1 task: anli_r1
dataset_path: anli dataset_path: anli
......
...@@ -5,3 +5,8 @@ task: ...@@ -5,3 +5,8 @@ task:
- arabicmmlu_humanities - arabicmmlu_humanities
- arabicmmlu_stem - arabicmmlu_stem
- arabicmmlu_language - arabicmmlu_language
aggregate_metric_list:
- metric: acc
weight_by_size: True
metadata:
version: 0
group: arabicmmlu_humanities
group_alias: Humanities
task:
- arabicmmlu_humanities_tasks
aggregate_metric_list:
- metric: acc
weight_by_size: True
metadata:
version: 0
group: arabicmmlu_language
group_alias: Language
task:
- arabicmmlu_language_tasks
aggregate_metric_list:
- metric: acc
weight_by_size: True
metadata:
version: 0
group: arabicmmlu_other
group_alias: Other
task:
- arabicmmlu_other_tasks
aggregate_metric_list:
- metric: acc
weight_by_size: True
metadata:
version: 0
group: arabicmmlu_social_science
group_alias: Social Science
task:
- arabicmmlu_social_science_tasks
aggregate_metric_list:
- metric: acc
weight_by_size: True
metadata:
version: 0
group: arabicmmlu_stem
group_alias: STEM
task:
- arabicmmlu_stem_tasks
aggregate_metric_list:
- metric: acc
weight_by_size: True
metadata:
version: 0
...@@ -11,3 +11,5 @@ metric_list: ...@@ -11,3 +11,5 @@ metric_list:
- metric: acc - metric: acc
aggregation: mean aggregation: mean
higher_is_better: true higher_is_better: true
metadata:
version: 0.0
...@@ -59,7 +59,7 @@ SUBJECTS = { ...@@ -59,7 +59,7 @@ SUBJECTS = {
def parse_args(): def parse_args():
parser = argparse.ArgumentParser() parser = argparse.ArgumentParser()
parser.add_argument("--base_yaml_path", default="_default_template_yaml") parser.add_argument("--base_yaml_path", default="_default_arabicmmlu_template_yaml")
parser.add_argument("--save_prefix_path", default="arabicmmlu") parser.add_argument("--save_prefix_path", default="arabicmmlu")
return parser.parse_args() return parser.parse_args()
...@@ -81,8 +81,7 @@ if __name__ == "__main__": ...@@ -81,8 +81,7 @@ if __name__ == "__main__":
yaml_dict = { yaml_dict = {
"include": base_yaml_name, "include": base_yaml_name,
"group": f"arabicmmlu_{category}", "tag": f"arabicmmlu_{category}",
"group_alias": category.replace("_", " "),
"task": f"arabicmmlu_{subject.lower().replace(' ', '_')}", "task": f"arabicmmlu_{subject.lower().replace(' ', '_')}",
"task_alias": subject, "task_alias": subject,
"dataset_name": subject, "dataset_name": subject,
......
"dataset_name": "Arabic Language (General)" "dataset_name": "Arabic Language (General)"
"group": "arabicmmlu_language" "tag": "arabicmmlu_language_tasks"
"group_alias": "language" "include": "_default_arabicmmlu_template_yaml"
"include": "_default_template_yaml"
"task": "arabicmmlu_arabic_language_(general)" "task": "arabicmmlu_arabic_language_(general)"
"task_alias": "Arabic Language (General)" "task_alias": "Arabic Language (General)"
"dataset_name": "Arabic Language (Grammar)" "dataset_name": "Arabic Language (Grammar)"
"group": "arabicmmlu_language" "tag": "arabicmmlu_language_tasks"
"group_alias": "language" "include": "_default_arabicmmlu_template_yaml"
"include": "_default_template_yaml"
"task": "arabicmmlu_arabic_language_(grammar)" "task": "arabicmmlu_arabic_language_(grammar)"
"task_alias": "Arabic Language (Grammar)" "task_alias": "Arabic Language (Grammar)"
"dataset_name": "Driving Test" "dataset_name": "Driving Test"
"group": "arabicmmlu_other" "tag": "arabicmmlu_other_tasks"
"group_alias": "other" "include": "_default_arabicmmlu_template_yaml"
"include": "_default_template_yaml"
"task": "arabicmmlu_driving_test" "task": "arabicmmlu_driving_test"
"task_alias": "Driving Test" "task_alias": "Driving Test"
"dataset_name": "General Knowledge" "dataset_name": "General Knowledge"
"group": "arabicmmlu_other" "tag": "arabicmmlu_other_tasks"
"group_alias": "other" "include": "_default_arabicmmlu_template_yaml"
"include": "_default_template_yaml"
"task": "arabicmmlu_general_knowledge" "task": "arabicmmlu_general_knowledge"
"task_alias": "General Knowledge" "task_alias": "General Knowledge"
"dataset_name": "High Arabic Language" "dataset_name": "High Arabic Language"
"group": "arabicmmlu_language" "tag": "arabicmmlu_language_tasks"
"group_alias": "language" "include": "_default_arabicmmlu_template_yaml"
"include": "_default_template_yaml"
"task": "arabicmmlu_high_arabic_language" "task": "arabicmmlu_high_arabic_language"
"task_alias": "High Arabic Language" "task_alias": "High Arabic Language"
"dataset_name": "High Biology" "dataset_name": "High Biology"
"group": "arabicmmlu_stem" "tag": "arabicmmlu_stem_tasks"
"group_alias": "stem" "include": "_default_arabicmmlu_template_yaml"
"include": "_default_template_yaml"
"task": "arabicmmlu_high_biology" "task": "arabicmmlu_high_biology"
"task_alias": "High Biology" "task_alias": "High Biology"
"dataset_name": "High Civics" "dataset_name": "High Civics"
"group": "arabicmmlu_social_science" "tag": "arabicmmlu_social_science_tasks"
"group_alias": "social science" "include": "_default_arabicmmlu_template_yaml"
"include": "_default_template_yaml"
"task": "arabicmmlu_high_civics" "task": "arabicmmlu_high_civics"
"task_alias": "High Civics" "task_alias": "High Civics"
"dataset_name": "High Computer Science" "dataset_name": "High Computer Science"
"group": "arabicmmlu_stem" "tag": "arabicmmlu_stem_tasks"
"group_alias": "stem" "include": "_default_arabicmmlu_template_yaml"
"include": "_default_template_yaml"
"task": "arabicmmlu_high_computer_science" "task": "arabicmmlu_high_computer_science"
"task_alias": "High Computer Science" "task_alias": "High Computer Science"
"dataset_name": "High Economics" "dataset_name": "High Economics"
"group": "arabicmmlu_social_science" "tag": "arabicmmlu_social_science_tasks"
"group_alias": "social science" "include": "_default_arabicmmlu_template_yaml"
"include": "_default_template_yaml"
"task": "arabicmmlu_high_economics" "task": "arabicmmlu_high_economics"
"task_alias": "High Economics" "task_alias": "High Economics"
"dataset_name": "High Geography" "dataset_name": "High Geography"
"group": "arabicmmlu_social_science" "tag": "arabicmmlu_social_science_tasks"
"group_alias": "social science" "include": "_default_arabicmmlu_template_yaml"
"include": "_default_template_yaml"
"task": "arabicmmlu_high_geography" "task": "arabicmmlu_high_geography"
"task_alias": "High Geography" "task_alias": "High Geography"
"dataset_name": "High History" "dataset_name": "High History"
"group": "arabicmmlu_humanities" "tag": "arabicmmlu_humanities_tasks"
"group_alias": "humanities" "include": "_default_arabicmmlu_template_yaml"
"include": "_default_template_yaml"
"task": "arabicmmlu_high_history" "task": "arabicmmlu_high_history"
"task_alias": "High History" "task_alias": "High History"
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment