Group agg rework (#1741)

* add greoup_config arg * add a group config that allows disabling table for group score and group aggregate in general * fixed size configuration * adjust config * add group config * adjust mmlu to use group_config * fixed args input in aggregate_subtask_metrics * fixed issues related to printing alias of group and updated yaml * update all mmlu variants to include group_config * edit format * modify mmlu tasks * adjust group to also be a configurable group * add configurable group * simplify get_task_list * adjust group scoring with using ConfigurableGroup * adjust args * update mmlu * update mmlu * update to work with new group and task configuration * readd group_agg * readd files * move prepare_print_tasks to evaluator_utils * sort set to False by default, fix predict_only arg * add version for groups * reversed task list * update additional condition when loading a group in a group yaml * update truthfulqa * add description regarding tags replacing group * replace group to tag * fixed conditional statement * remove warning * update loading of task group and newly added tags * reformat with pre-commit * fixed info log * update * fix bug * fix bug * use task id to differentiate tasks * convert all groups to configurable groups * use task_id * reformat * add task_id for python tasks as well * add task_id for python tasks as well * add task_id for python tasks as well * revert truthfulqa * revert mmlu tasks * new mmlu config * new group config parameter `tag_to_task` * Update truthfulqa_mc2.yaml * reformate * add _process_group_config * adjust task_id * add get_subtask_list function to get proper subtask list * group config to_dict update * remove tag check * update mmlu * fix config passing issues * add test yaml * format fix * add documentation * corner case for single tag being called * fix indentation * formatting * update all mmlu variants * Update docs/task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * remove group_alias * Update docs/task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * remove version for metadata * Update docs/task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * update mmlu/ * removed " " in make_table * change how aggregate_metric is loaded * change how aggregate_metric is loaded * update aggregate_metric arg * update format * update format * some docs fixes * add groups for agieval, aexams, aclue * add more explicit aggregation groups * add more groupings / tags distinctions * add more groupings * more groupings * add many explicit group configs * add many explicit group configs * add more explicit group configs * add more explicit group configs * add more error msgs, agg_metric -> agg_metric_list * some docs updates * update task_id to be updateable and uses group:task format * make KMMLU a tag for now * update docs * don't duplicate task names * fix merge conflicts? * giving this a try * clean up diff * switch mmlu variants over to using * don't use to-be-deprecated group: config field in overview notebook * Python tasks which subclass ConfigurableTask now run * update mmlu * pre-commit format * fixed sorting for multi-level printing * move group api to separate file * fix bbh aggregation filter usage * track api/group.py * adjust group and tags loading * make explicit group configs for leaderboard and other newer tasks * fix arabicmmlu * update * change arabicmmlu template name??? * update group alias * fix printing bugs * check table printing is correct ; update tests * use mmlu_stem to have a group included in print tests --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

Group agg rework (#1741)
* add greoup_config arg * add a group config that allows disabling table for group score and group aggregate in general * fixed size configuration * adjust config * add group config * adjust mmlu to use group_config * fixed args input in aggregate_subtask_metrics * fixed issues related to printing alias of group and updated yaml * update all mmlu variants to include group_config * edit format * modify mmlu tasks * adjust group to also be a configurable group * add configurable group * simplify get_task_list * adjust group scoring with using ConfigurableGroup * adjust args * update mmlu * update mmlu * update to work with new group and task configuration * readd group_agg * readd files * move prepare_print_tasks to evaluator_utils * sort set to False by default, fix predict_only arg * add version for groups * reversed task list * update additional condition when loading a group in a group yaml * update truthfulqa * add description regarding tags replacing group * replace group to tag * fixed conditional statement * remove warning * update loading of task group and newly added tags * reformat with pre-commit * fixed info log * update * fix bug * fix bug * use task id to differentiate tasks * convert all groups to configurable groups * use task_id * reformat * add task_id for python tasks as well * add task_id for python tasks as well * add task_id for python tasks as well * revert truthfulqa * revert mmlu tasks * new mmlu config * new group config parameter `tag_to_task` * Update truthfulqa_mc2.yaml * reformate * add _process_group_config * adjust task_id * add get_subtask_list function to get proper subtask list * group config to_dict update * remove tag check * update mmlu * fix config passing issues * add test yaml * format fix * add documentation * corner case for single tag being called * fix indentation * formatting * update all mmlu variants * Update docs/task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * remove group_alias * Update docs/task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * remove version for metadata * Update docs/task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * update mmlu/ * removed " " in make_table * change how aggregate_metric is loaded * change how aggregate_metric is loaded * update aggregate_metric arg * update format * update format * some docs fixes * add groups for agieval, aexams, aclue * add more explicit aggregation groups * add more groupings / tags distinctions * add more groupings * more groupings * add many explicit group configs * add many explicit group configs * add more explicit group configs * add more explicit group configs * add more error msgs, agg_metric -> agg_metric_list * some docs updates * update task_id to be updateable and uses group:task format * make KMMLU a tag for now * update docs * don't duplicate task names * fix merge conflicts? * giving this a try * clean up diff * switch mmlu variants over to using * don't use to-be-deprecated group: config field in overview notebook * Python tasks which subclass ConfigurableTask now run * update mmlu * pre-commit format * fixed sorting for multi-level printing * move group api to separate file * fix bbh aggregation filter usage * track api/group.py * adjust group and tags loading * make explicit group configs for leaderboard and other newer tasks * fix arabicmmlu * update * change arabicmmlu template name??? * update group alias * fix printing bugs * check table printing is correct ; update tests * use mmlu_stem to have a group included in print tests --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>
517aadc4 · Lintang Sutawika · GitHub · 5a7ed3ee · 517aadc4 · 517aadc4
Unverified Commit 517aadc4 authored Jul 08, 2024 by Lintang Sutawika Committed by GitHub Jul 08, 2024
20 changed files
--- a/lm_eval/tasks/anli/anli_r1.yaml
+++ b/lm_eval/tasks/anli/anli_r1.yaml
-group:
+tag:
  - anli
 task: anli_r1
 dataset_path: anli

--- a/lm_eval/tasks/arabicmmlu/arabicmmlu.yaml
+++ b/lm_eval/tasks/arabicmmlu/arabicmmlu.yaml
@@ -5,3 +5,8 @@ task:
 - arabicmmlu_humanities
 - arabicmmlu_stem
 - arabicmmlu_language
+aggregate_metric_list:
+  - metric: acc
+    weight_by_size: True
+metadata:
+  version: 0
--- a/lm_eval/tasks/arabicmmlu/_arabicmmlu_humanities.yaml
+++ b/lm_eval/tasks/arabicmmlu/_arabicmmlu_humanities.yaml
+group: arabicmmlu_humanities
+group_alias: Humanities
+task:
+  - arabicmmlu_humanities_tasks
+aggregate_metric_list:
+  - metric: acc
+    weight_by_size: True
+metadata:
+  version: 0
--- a/lm_eval/tasks/arabicmmlu/_arabicmmlu_language.yaml
+++ b/lm_eval/tasks/arabicmmlu/_arabicmmlu_language.yaml
+group: arabicmmlu_language
+group_alias: Language
+task:
+  - arabicmmlu_language_tasks
+aggregate_metric_list:
+  - metric: acc
+    weight_by_size: True
+metadata:
+  version: 0
--- a/lm_eval/tasks/arabicmmlu/_arabicmmlu_other.yaml
+++ b/lm_eval/tasks/arabicmmlu/_arabicmmlu_other.yaml
+group: arabicmmlu_other
+group_alias: Other
+task:
+  - arabicmmlu_other_tasks
+aggregate_metric_list:
+  - metric: acc
+    weight_by_size: True
+metadata:
+  version: 0
--- a/lm_eval/tasks/arabicmmlu/_arabicmmlu_social_science.yaml
+++ b/lm_eval/tasks/arabicmmlu/_arabicmmlu_social_science.yaml
+group: arabicmmlu_social_science
+group_alias: Social Science
+task:
+  - arabicmmlu_social_science_tasks
+aggregate_metric_list:
+  - metric: acc
+    weight_by_size: True
+metadata:
+  version: 0
--- a/lm_eval/tasks/arabicmmlu/_arabicmmlu_stem.yaml
+++ b/lm_eval/tasks/arabicmmlu/_arabicmmlu_stem.yaml
+group: arabicmmlu_stem
+group_alias: STEM
+task:
+  - arabicmmlu_stem_tasks
+aggregate_metric_list:
+  - metric: acc
+    weight_by_size: True
+metadata:
+  version: 0
--- a/lm_eval/tasks/arabicmmlu/_default_template_yaml
+++ b/lm_eval/tasks/arabicmmlu/_default_template_yaml
@@ -11,3 +11,5 @@ metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
+metadata:
+  version: 0.0
--- a/lm_eval/tasks/arabicmmlu/_generate_configs.py
+++ b/lm_eval/tasks/arabicmmlu/_generate_configs.py
@@ -59,7 +59,7 @@ SUBJECTS = {
 def parse_args():
    parser = argparse.ArgumentParser()
-    parser.add_argument("--base_yaml_path", default="_default_template_yaml")
+    parser.add_argument("--base_yaml_path", default="_default_arabicmmlu_template_yaml")
    parser.add_argument("--save_prefix_path", default="arabicmmlu")
    return parser.parse_args()
@@ -81,8 +81,7 @@ if __name__ == "__main__":
        yaml_dict = {
            "include": base_yaml_name,
-            "group": f"arabicmmlu_{category}",
+            "tag": f"arabicmmlu_{category}",
-            "group_alias": category.replace("_", " "),
            "task": f"arabicmmlu_{subject.lower().replace(' ', '_')}",
            "task_alias": subject,
            "dataset_name": subject,

--- a/lm_eval/tasks/arabicmmlu/arabicmmlu_arabic_language_general.yaml
+++ b/lm_eval/tasks/arabicmmlu/arabicmmlu_arabic_language_general.yaml
 "dataset_name": "Arabic Language (General)"
-"group": "arabicmmlu_language"
+"tag": "arabicmmlu_language_tasks"
-"group_alias": "language"
+"include": "_default_arabicmmlu_template_yaml"
-"include": "_default_template_yaml"
 "task": "arabicmmlu_arabic_language_(general)"
 "task_alias": "Arabic Language (General)"
--- a/lm_eval/tasks/arabicmmlu/arabicmmlu_arabic_language_grammar.yaml
+++ b/lm_eval/tasks/arabicmmlu/arabicmmlu_arabic_language_grammar.yaml
 "dataset_name": "Arabic Language (Grammar)"
-"group": "arabicmmlu_language"
+"tag": "arabicmmlu_language_tasks"
-"group_alias": "language"
+"include": "_default_arabicmmlu_template_yaml"
-"include": "_default_template_yaml"
 "task": "arabicmmlu_arabic_language_(grammar)"
 "task_alias": "Arabic Language (Grammar)"
--- a/lm_eval/tasks/arabicmmlu/arabicmmlu_driving_test.yaml
+++ b/lm_eval/tasks/arabicmmlu/arabicmmlu_driving_test.yaml
 "dataset_name": "Driving Test"
-"group": "arabicmmlu_other"
+"tag": "arabicmmlu_other_tasks"
-"group_alias": "other"
+"include": "_default_arabicmmlu_template_yaml"
-"include": "_default_template_yaml"
 "task": "arabicmmlu_driving_test"
 "task_alias": "Driving Test"
--- a/lm_eval/tasks/arabicmmlu/arabicmmlu_general_knowledge.yaml
+++ b/lm_eval/tasks/arabicmmlu/arabicmmlu_general_knowledge.yaml
 "dataset_name": "General Knowledge"
-"group": "arabicmmlu_other"
+"tag": "arabicmmlu_other_tasks"
-"group_alias": "other"
+"include": "_default_arabicmmlu_template_yaml"
-"include": "_default_template_yaml"
 "task": "arabicmmlu_general_knowledge"
 "task_alias": "General Knowledge"
--- a/lm_eval/tasks/arabicmmlu/arabicmmlu_high_arabic_language.yaml
+++ b/lm_eval/tasks/arabicmmlu/arabicmmlu_high_arabic_language.yaml
 "dataset_name": "High Arabic Language"
-"group": "arabicmmlu_language"
+"tag": "arabicmmlu_language_tasks"
-"group_alias": "language"
+"include": "_default_arabicmmlu_template_yaml"
-"include": "_default_template_yaml"
 "task": "arabicmmlu_high_arabic_language"
 "task_alias": "High Arabic Language"
--- a/lm_eval/tasks/arabicmmlu/arabicmmlu_high_biology.yaml
+++ b/lm_eval/tasks/arabicmmlu/arabicmmlu_high_biology.yaml
 "dataset_name": "High Biology"
-"group": "arabicmmlu_stem"
+"tag": "arabicmmlu_stem_tasks"
-"group_alias": "stem"
+"include": "_default_arabicmmlu_template_yaml"
-"include": "_default_template_yaml"
 "task": "arabicmmlu_high_biology"
 "task_alias": "High Biology"
--- a/lm_eval/tasks/arabicmmlu/arabicmmlu_high_civics.yaml
+++ b/lm_eval/tasks/arabicmmlu/arabicmmlu_high_civics.yaml
 "dataset_name": "High Civics"
-"group": "arabicmmlu_social_science"
+"tag": "arabicmmlu_social_science_tasks"
-"group_alias": "social science"
+"include": "_default_arabicmmlu_template_yaml"
-"include": "_default_template_yaml"
 "task": "arabicmmlu_high_civics"
 "task_alias": "High Civics"
--- a/lm_eval/tasks/arabicmmlu/arabicmmlu_high_computer_science.yaml
+++ b/lm_eval/tasks/arabicmmlu/arabicmmlu_high_computer_science.yaml
 "dataset_name": "High Computer Science"
-"group": "arabicmmlu_stem"
+"tag": "arabicmmlu_stem_tasks"
-"group_alias": "stem"
+"include": "_default_arabicmmlu_template_yaml"
-"include": "_default_template_yaml"
 "task": "arabicmmlu_high_computer_science"
 "task_alias": "High Computer Science"
--- a/lm_eval/tasks/arabicmmlu/arabicmmlu_high_economics.yaml
+++ b/lm_eval/tasks/arabicmmlu/arabicmmlu_high_economics.yaml
 "dataset_name": "High Economics"
-"group": "arabicmmlu_social_science"
+"tag": "arabicmmlu_social_science_tasks"
-"group_alias": "social science"
+"include": "_default_arabicmmlu_template_yaml"
-"include": "_default_template_yaml"
 "task": "arabicmmlu_high_economics"
 "task_alias": "High Economics"
--- a/lm_eval/tasks/arabicmmlu/arabicmmlu_high_geography.yaml
+++ b/lm_eval/tasks/arabicmmlu/arabicmmlu_high_geography.yaml
 "dataset_name": "High Geography"
-"group": "arabicmmlu_social_science"
+"tag": "arabicmmlu_social_science_tasks"
-"group_alias": "social science"
+"include": "_default_arabicmmlu_template_yaml"
-"include": "_default_template_yaml"
 "task": "arabicmmlu_high_geography"
 "task_alias": "High Geography"
--- a/lm_eval/tasks/arabicmmlu/arabicmmlu_high_history.yaml
+++ b/lm_eval/tasks/arabicmmlu/arabicmmlu_high_history.yaml
 "dataset_name": "High History"
-"group": "arabicmmlu_humanities"
+"tag": "arabicmmlu_humanities_tasks"
-"group_alias": "humanities"
+"include": "_default_arabicmmlu_template_yaml"
-"include": "_default_template_yaml"
 "task": "arabicmmlu_high_history"
 "task_alias": "High History"