Group agg rework (#1741)

* add greoup_config arg * add a group config that allows disabling table for group score and group aggregate in general * fixed size configuration * adjust config * add group config * adjust mmlu to use group_config * fixed args input in aggregate_subtask_metrics * fixed issues related to printing alias of group and updated yaml * update all mmlu variants to include group_config * edit format * modify mmlu tasks * adjust group to also be a configurable group * add configurable group * simplify get_task_list * adjust group scoring with using ConfigurableGroup * adjust args * update mmlu * update mmlu * update to work with new group and task configuration * readd group_agg * readd files * move prepare_print_tasks to evaluator_utils * sort set to False by default, fix predict_only arg * add version for groups * reversed task list * update additional condition when loading a group in a group yaml * update truthfulqa * add description regarding tags replacing group * replace group to tag * fixed conditional statement * remove warning * update loading of task group and newly added tags * reformat with pre-commit * fixed info log * update * fix bug * fix bug * use task id to differentiate tasks * convert all groups to configurable groups * use task_id * reformat * add task_id for python tasks as well * add task_id for python tasks as well * add task_id for python tasks as well * revert truthfulqa * revert mmlu tasks * new mmlu config * new group config parameter `tag_to_task` * Update truthfulqa_mc2.yaml * reformate * add _process_group_config * adjust task_id * add get_subtask_list function to get proper subtask list * group config to_dict update * remove tag check * update mmlu * fix config passing issues * add test yaml * format fix * add documentation * corner case for single tag being called * fix indentation * formatting * update all mmlu variants * Update docs/task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * remove group_alias * Update docs/task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * remove version for metadata * Update docs/task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * update mmlu/ * removed " " in make_table * change how aggregate_metric is loaded * change how aggregate_metric is loaded * update aggregate_metric arg * update format * update format * some docs fixes * add groups for agieval, aexams, aclue * add more explicit aggregation groups * add more groupings / tags distinctions * add more groupings * more groupings * add many explicit group configs * add many explicit group configs * add more explicit group configs * add more explicit group configs * add more error msgs, agg_metric -> agg_metric_list * some docs updates * update task_id to be updateable and uses group:task format * make KMMLU a tag for now * update docs * don't duplicate task names * fix merge conflicts? * giving this a try * clean up diff * switch mmlu variants over to using * don't use to-be-deprecated group: config field in overview notebook * Python tasks which subclass ConfigurableTask now run * update mmlu * pre-commit format * fixed sorting for multi-level printing * move group api to separate file * fix bbh aggregation filter usage * track api/group.py * adjust group and tags loading * make explicit group configs for leaderboard and other newer tasks * fix arabicmmlu * update * change arabicmmlu template name??? * update group alias * fix printing bugs * check table printing is correct ; update tests * use mmlu_stem to have a group included in print tests --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

Group agg rework (#1741)
* add greoup_config arg * add a group config that allows disabling table for group score and group aggregate in general * fixed size configuration * adjust config * add group config * adjust mmlu to use group_config * fixed args input in aggregate_subtask_metrics * fixed issues related to printing alias of group and updated yaml * update all mmlu variants to include group_config * edit format * modify mmlu tasks * adjust group to also be a configurable group * add configurable group * simplify get_task_list * adjust group scoring with using ConfigurableGroup * adjust args * update mmlu * update mmlu * update to work with new group and task configuration * readd group_agg * readd files * move prepare_print_tasks to evaluator_utils * sort set to False by default, fix predict_only arg * add version for groups * reversed task list * update additional condition when loading a group in a group yaml * update truthfulqa * add description regarding tags replacing group * replace group to tag * fixed conditional statement * remove warning * update loading of task group and newly added tags * reformat with pre-commit * fixed info log * update * fix bug * fix bug * use task id to differentiate tasks * convert all groups to configurable groups * use task_id * reformat * add task_id for python tasks as well * add task_id for python tasks as well * add task_id for python tasks as well * revert truthfulqa * revert mmlu tasks * new mmlu config * new group config parameter `tag_to_task` * Update truthfulqa_mc2.yaml * reformate * add _process_group_config * adjust task_id * add get_subtask_list function to get proper subtask list * group config to_dict update * remove tag check * update mmlu * fix config passing issues * add test yaml * format fix * add documentation * corner case for single tag being called * fix indentation * formatting * update all mmlu variants * Update docs/task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * remove group_alias * Update docs/task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * remove version for metadata * Update docs/task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * update mmlu/ * removed " " in make_table * change how aggregate_metric is loaded * change how aggregate_metric is loaded * update aggregate_metric arg * update format * update format * some docs fixes * add groups for agieval, aexams, aclue * add more explicit aggregation groups * add more groupings / tags distinctions * add more groupings * more groupings * add many explicit group configs * add many explicit group configs * add more explicit group configs * add more explicit group configs * add more error msgs, agg_metric -> agg_metric_list * some docs updates * update task_id to be updateable and uses group:task format * make KMMLU a tag for now * update docs * don't duplicate task names * fix merge conflicts? * giving this a try * clean up diff * switch mmlu variants over to using * don't use to-be-deprecated group: config field in overview notebook * Python tasks which subclass ConfigurableTask now run * update mmlu * pre-commit format * fixed sorting for multi-level printing * move group api to separate file * fix bbh aggregation filter usage * track api/group.py * adjust group and tags loading * make explicit group configs for leaderboard and other newer tasks * fix arabicmmlu * update * change arabicmmlu template name??? * update group alias * fix printing bugs * check table printing is correct ; update tests * use mmlu_stem to have a group included in print tests --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>
517aadc4 · Lintang Sutawika · GitHub · 5a7ed3ee · 517aadc4 · 517aadc4
Unverified Commit 517aadc4 authored Jul 08, 2024 by Lintang Sutawika Committed by GitHub Jul 08, 2024
20 changed files
--- a/lm_eval/tasks/mmlu/generative/mmlu_philosophy.yaml
+++ b/lm_eval/tasks/mmlu/generative/mmlu_philosophy.yaml
 "dataset_name": "philosophy"
 "description": "The following are multiple choice questions (with answers) about philosophy.\n\
  \n"
-"group": "mmlu_humanities_generative"
+"tag": "mmlu_humanities_generative"
-"group_alias": "humanities"
 "include": "_default_template_yaml"
 "task": "mmlu_philosophy_generative"
 "task_alias": "philosophy"
--- a/lm_eval/tasks/mmlu/generative/mmlu_prehistory.yaml
+++ b/lm_eval/tasks/mmlu/generative/mmlu_prehistory.yaml
 "dataset_name": "prehistory"
 "description": "The following are multiple choice questions (with answers) about prehistory.\n\
  \n"
-"group": "mmlu_humanities_generative"
+"tag": "mmlu_humanities_generative"
-"group_alias": "humanities"
 "include": "_default_template_yaml"
 "task": "mmlu_prehistory_generative"
 "task_alias": "prehistory"
--- a/lm_eval/tasks/mmlu/generative/mmlu_professional_accounting.yaml
+++ b/lm_eval/tasks/mmlu/generative/mmlu_professional_accounting.yaml
 "dataset_name": "professional_accounting"
 "description": "The following are multiple choice questions (with answers) about professional\
  \ accounting.\n\n"
-"group": "mmlu_other_generative"
+"tag": "mmlu_other_generative"
-"group_alias": "other"
 "include": "_default_template_yaml"
 "task": "mmlu_professional_accounting_generative"
 "task_alias": "professional_accounting"
--- a/lm_eval/tasks/mmlu/generative/mmlu_professional_law.yaml
+++ b/lm_eval/tasks/mmlu/generative/mmlu_professional_law.yaml
 "dataset_name": "professional_law"
 "description": "The following are multiple choice questions (with answers) about professional\
  \ law.\n\n"
-"group": "mmlu_humanities_generative"
+"tag": "mmlu_humanities_generative"
-"group_alias": "humanities"
 "include": "_default_template_yaml"
 "task": "mmlu_professional_law_generative"
 "task_alias": "professional_law"
--- a/lm_eval/tasks/mmlu/generative/mmlu_professional_medicine.yaml
+++ b/lm_eval/tasks/mmlu/generative/mmlu_professional_medicine.yaml
 "dataset_name": "professional_medicine"
 "description": "The following are multiple choice questions (with answers) about professional\
  \ medicine.\n\n"
-"group": "mmlu_other_generative"
+"tag": "mmlu_other_generative"
-"group_alias": "other"
 "include": "_default_template_yaml"
 "task": "mmlu_professional_medicine_generative"
 "task_alias": "professional_medicine"
--- a/lm_eval/tasks/mmlu/generative/mmlu_professional_psychology.yaml
+++ b/lm_eval/tasks/mmlu/generative/mmlu_professional_psychology.yaml
 "dataset_name": "professional_psychology"
 "description": "The following are multiple choice questions (with answers) about professional\
  \ psychology.\n\n"
-"group": "mmlu_social_sciences_generative"
+"tag": "mmlu_social_sciences_generative"
-"group_alias": "social_sciences"
 "include": "_default_template_yaml"
 "task": "mmlu_professional_psychology_generative"
 "task_alias": "professional_psychology"
--- a/lm_eval/tasks/mmlu/generative/mmlu_public_relations.yaml
+++ b/lm_eval/tasks/mmlu/generative/mmlu_public_relations.yaml
 "dataset_name": "public_relations"
 "description": "The following are multiple choice questions (with answers) about public\
  \ relations.\n\n"
-"group": "mmlu_social_sciences_generative"
+"tag": "mmlu_social_sciences_generative"
-"group_alias": "social_sciences"
 "include": "_default_template_yaml"
 "task": "mmlu_public_relations_generative"
 "task_alias": "public_relations"
--- a/lm_eval/tasks/mmlu/generative/mmlu_security_studies.yaml
+++ b/lm_eval/tasks/mmlu/generative/mmlu_security_studies.yaml
 "dataset_name": "security_studies"
 "description": "The following are multiple choice questions (with answers) about security\
  \ studies.\n\n"
-"group": "mmlu_social_sciences_generative"
+"tag": "mmlu_social_sciences_generative"
-"group_alias": "social_sciences"
 "include": "_default_template_yaml"
 "task": "mmlu_security_studies_generative"
 "task_alias": "security_studies"
--- a/lm_eval/tasks/mmlu/generative/mmlu_sociology.yaml
+++ b/lm_eval/tasks/mmlu/generative/mmlu_sociology.yaml
 "dataset_name": "sociology"
 "description": "The following are multiple choice questions (with answers) about sociology.\n\
  \n"
-"group": "mmlu_social_sciences_generative"
+"tag": "mmlu_social_sciences_generative"
-"group_alias": "social_sciences"
 "include": "_default_template_yaml"
 "task": "mmlu_sociology_generative"
 "task_alias": "sociology"
--- a/lm_eval/tasks/mmlu/generative/mmlu_us_foreign_policy.yaml
+++ b/lm_eval/tasks/mmlu/generative/mmlu_us_foreign_policy.yaml
 "dataset_name": "us_foreign_policy"
 "description": "The following are multiple choice questions (with answers) about us\
  \ foreign policy.\n\n"
-"group": "mmlu_social_sciences_generative"
+"tag": "mmlu_social_sciences_generative"
-"group_alias": "social_sciences"
 "include": "_default_template_yaml"
 "task": "mmlu_us_foreign_policy_generative"
 "task_alias": "us_foreign_policy"
--- a/lm_eval/tasks/mmlu/generative/mmlu_virology.yaml
+++ b/lm_eval/tasks/mmlu/generative/mmlu_virology.yaml
 "dataset_name": "virology"
 "description": "The following are multiple choice questions (with answers) about virology.\n\
  \n"
-"group": "mmlu_other_generative"
+"tag": "mmlu_other_generative"
-"group_alias": "other"
 "include": "_default_template_yaml"
 "task": "mmlu_virology_generative"
 "task_alias": "virology"
--- a/lm_eval/tasks/mmlu/generative/mmlu_world_religions.yaml
+++ b/lm_eval/tasks/mmlu/generative/mmlu_world_religions.yaml
 "dataset_name": "world_religions"
 "description": "The following are multiple choice questions (with answers) about world\
  \ religions.\n\n"
-"group": "mmlu_humanities_generative"
+"tag": "mmlu_humanities_generative"
-"group_alias": "humanities"
 "include": "_default_template_yaml"
 "task": "mmlu_world_religions_generative"
 "task_alias": "world_religions"
--- a/lm_eval/tasks/model_written_evals/advanced_ai_risk/_template_yaml
+++ b/lm_eval/tasks/model_written_evals/advanced_ai_risk/_template_yaml
-group: advanced_ai_risk
+tag: advanced_ai_risk
 dataset_path: EleutherAI/advanced_ai_risk
 output_type: multiple_choice
 validation_split: validation

--- a/lm_eval/tasks/model_written_evals/persona/_template_yaml
+++ b/lm_eval/tasks/model_written_evals/persona/_template_yaml
-group: persona
+tag: persona
 dataset_path: EleutherAI/persona
 output_type: multiple_choice
 validation_split: validation

--- a/lm_eval/tasks/model_written_evals/sycophancy/sycophancy_on_nlp_survey.yaml
+++ b/lm_eval/tasks/model_written_evals/sycophancy/sycophancy_on_nlp_survey.yaml
-group: sycophancy
+tag: sycophancy
 task: sycophancy_on_nlp_survey
 dataset_path: EleutherAI/sycophancy
 dataset_name: sycophancy_on_nlp_survey

--- a/lm_eval/tasks/model_written_evals/sycophancy/sycophancy_on_philpapers2020.yaml
+++ b/lm_eval/tasks/model_written_evals/sycophancy/sycophancy_on_philpapers2020.yaml
-group: sycophancy
+tag: sycophancy
 task: sycophancy_on_philpapers2020
 dataset_path: EleutherAI/sycophancy
 dataset_name: sycophancy_on_philpapers2020

--- a/lm_eval/tasks/model_written_evals/sycophancy/sycophancy_on_political_typology_quiz.yaml
+++ b/lm_eval/tasks/model_written_evals/sycophancy/sycophancy_on_political_typology_quiz.yaml
-group: sycophancy
+tag: sycophancy
 task: sycophancy_on_political_typology_quiz
 dataset_path: EleutherAI/sycophancy
 dataset_name: sycophancy_on_political_typology_quiz

--- a/lm_eval/tasks/model_written_evals/winogenerated/_template_yaml
+++ b/lm_eval/tasks/model_written_evals/winogenerated/_template_yaml
-group: winogenerated
+tag: winogenerated
 dataset_path: EleutherAI/winogenerated
 output_type: multiple_choice
 validation_split: validation

--- a/lm_eval/tasks/okapi/arc_multilingual/_arc_yaml
+++ b/lm_eval/tasks/okapi/arc_multilingual/_arc_yaml
-group:
+tag:
  - arc_multilingual
 dataset_path: null
 dataset_name: null

--- a/lm_eval/tasks/okapi/hellaswag_multilingual/_hellaswag_yaml
+++ b/lm_eval/tasks/okapi/hellaswag_multilingual/_hellaswag_yaml
-group:
+tag:
  - hellaswag_multilingual
 dataset_path: null
 dataset_name: null