Unverified Commit 517aadc4 authored by Lintang Sutawika's avatar Lintang Sutawika Committed by GitHub
Browse files

Group agg rework (#1741)



* add greoup_config arg

* add a group config that allows disabling table for group score and group aggregate in general

* fixed size configuration

* adjust config

* add group config

* adjust mmlu to use group_config

* fixed args input in aggregate_subtask_metrics

* fixed issues related to printing alias of group and updated yaml

* update all mmlu variants to include group_config

* edit format

* modify mmlu tasks

* adjust group to also be a configurable group

* add configurable group

* simplify get_task_list

* adjust group scoring with using ConfigurableGroup

* adjust args

* update mmlu

* update mmlu

* update to work with new group and task configuration

* readd group_agg

* readd files

* move prepare_print_tasks to evaluator_utils

* sort set to False by default, fix predict_only arg

* add version for groups

* reversed task list

* update additional condition when loading a group in a group yaml

* update truthfulqa

* add description regarding tags replacing group

* replace group to tag

* fixed conditional statement

* remove warning

* update loading of task group and newly added tags

* reformat with pre-commit

* fixed info log

* update

* fix bug

* fix bug

* use task id to differentiate tasks

* convert all groups to configurable groups

* use task_id

* reformat

* add task_id for python tasks as well

* add task_id for python tasks as well

* add task_id for python tasks as well

* revert truthfulqa

* revert mmlu tasks

* new mmlu config

* new group config parameter `tag_to_task`

* Update truthfulqa_mc2.yaml

* reformate

* add _process_group_config

* adjust task_id

* add get_subtask_list function to get proper subtask list

* group config to_dict update

* remove tag check

* update mmlu

* fix config passing issues

* add test yaml

* format fix

* add documentation

* corner case for single tag being called

* fix indentation

* formatting

* update all mmlu variants

* Update docs/task_guide.md
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* remove group_alias

* Update docs/task_guide.md
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* remove version for metadata

* Update docs/task_guide.md
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* update mmlu/

* removed " " in make_table

* change how aggregate_metric is loaded

* change how aggregate_metric is loaded

* update aggregate_metric arg

* update format

* update format

* some docs fixes

* add groups for agieval, aexams, aclue

* add more explicit aggregation groups

* add more groupings / tags distinctions

* add more groupings

* more groupings

* add many explicit group configs

* add many explicit group configs

* add more explicit group configs

* add more explicit group configs

* add more error msgs, agg_metric -> agg_metric_list

* some docs updates

* update task_id to be updateable and uses group:task format

* make KMMLU a tag for now

* update docs

* don't duplicate task names

* fix merge conflicts?

* giving this a try

* clean up diff

* switch mmlu variants over to using

* don't use to-be-deprecated group: config field in overview notebook

* Python tasks which subclass ConfigurableTask now run

* update mmlu

* pre-commit format

* fixed sorting for multi-level printing

* move group api to separate file

* fix bbh aggregation filter usage

* track api/group.py

* adjust group and tags loading

* make explicit group configs for leaderboard and other newer tasks

* fix arabicmmlu

* update

* change arabicmmlu template name???

* update group alias

* fix printing bugs

* check table printing is correct ; update tests

* use mmlu_stem to have a group included in print tests

---------
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: default avatarhaileyschoelkopf <hailey@eleuther.ai>
parent 5a7ed3ee
group: blimp
task:
- "blimp_adjunct_island"
- "blimp_anaphor_gender_agreement"
- "blimp_anaphor_number_agreement"
- "blimp_animate_subject_passive"
- "blimp_animate_subject_trans"
- "blimp_causative"
- "blimp_complex_NP_island"
- "blimp_coordinate_structure_constraint_complex_left_branch"
- "blimp_coordinate_structure_constraint_object_extraction"
- "blimp_determiner_noun_agreement_1"
- "blimp_determiner_noun_agreement_2"
- "blimp_determiner_noun_agreement_irregular_1"
- "blimp_determiner_noun_agreement_irregular_2"
- "blimp_determiner_noun_agreement_with_adj_2"
- "blimp_determiner_noun_agreement_with_adj_irregular_1"
- "blimp_determiner_noun_agreement_with_adj_irregular_2"
- "blimp_determiner_noun_agreement_with_adjective_1"
- "blimp_distractor_agreement_relational_noun"
- "blimp_distractor_agreement_relative_clause"
- "blimp_drop_argument"
- "blimp_ellipsis_n_bar_1"
- "blimp_ellipsis_n_bar_2"
- "blimp_existential_there_object_raising"
- "blimp_existential_there_quantifiers_1"
- "blimp_existential_there_quantifiers_2"
- "blimp_existential_there_subject_raising"
- "blimp_expletive_it_object_raising"
- "blimp_inchoative"
- "blimp_intransitive"
- "blimp_irregular_past_participle_adjectives"
- "blimp_irregular_past_participle_verbs"
- "blimp_irregular_plural_subject_verb_agreement_1"
- "blimp_irregular_plural_subject_verb_agreement_2"
- "blimp_left_branch_island_echo_question"
- "blimp_left_branch_island_simple_question"
- "blimp_matrix_question_npi_licensor_present"
- "blimp_npi_present_1"
- "blimp_npi_present_2"
- "blimp_only_npi_licensor_present"
- "blimp_only_npi_scope"
- "blimp_passive_1"
- "blimp_passive_2"
- "blimp_principle_A_c_command"
- "blimp_principle_A_case_1"
- "blimp_principle_A_case_2"
- "blimp_principle_A_domain_1"
- "blimp_principle_A_domain_2"
- "blimp_principle_A_domain_3"
- "blimp_principle_A_reconstruction"
- "blimp_regular_plural_subject_verb_agreement_1"
- "blimp_regular_plural_subject_verb_agreement_2"
- "blimp_sentential_negation_npi_licensor_present"
- "blimp_sentential_negation_npi_scope"
- "blimp_sentential_subject_island"
- "blimp_superlative_quantifiers_1"
- "blimp_superlative_quantifiers_2"
- "blimp_tough_vs_raising_1"
- "blimp_tough_vs_raising_2"
- "blimp_transitive"
- "blimp_wh_island"
- "blimp_wh_questions_object_gap"
- "blimp_wh_questions_subject_gap"
- "blimp_wh_questions_subject_gap_long_distance"
- "blimp_wh_vs_that_no_gap"
- "blimp_wh_vs_that_no_gap_long_distance"
- "blimp_wh_vs_that_with_gap"
- "blimp_wh_vs_that_with_gap_long_distance"
aggregate_metric_list:
- metric: acc
aggregation: mean
weight_by_size: False
metadata:
version: 2.0
group: blimp
dataset_path: blimp
output_type: multiple_choice
validation_split: train
......
aggregate_metric_list:
- aggregation: mean
metric: acc
weight_by_size: true
- aggregation: mean
metric: acc_norm
weight_by_size: true
group: ceval-valid
metadata:
version: 1.0
task:
- ceval-valid_computer_network
- ceval-valid_operating_system
- ceval-valid_computer_architecture
- ceval-valid_college_programming
- ceval-valid_college_physics
- ceval-valid_college_chemistry
- ceval-valid_advanced_mathematics
- ceval-valid_probability_and_statistics
- ceval-valid_discrete_mathematics
- ceval-valid_electrical_engineer
- ceval-valid_metrology_engineer
- ceval-valid_high_school_mathematics
- ceval-valid_high_school_physics
- ceval-valid_high_school_chemistry
- ceval-valid_high_school_biology
- ceval-valid_middle_school_mathematics
- ceval-valid_middle_school_biology
- ceval-valid_middle_school_physics
- ceval-valid_middle_school_chemistry
- ceval-valid_veterinary_medicine
- ceval-valid_college_economics
- ceval-valid_business_administration
- ceval-valid_marxism
- ceval-valid_mao_zedong_thought
- ceval-valid_education_science
- ceval-valid_teacher_qualification
- ceval-valid_high_school_politics
- ceval-valid_high_school_geography
- ceval-valid_middle_school_politics
- ceval-valid_middle_school_geography
- ceval-valid_modern_chinese_history
- ceval-valid_ideological_and_moral_cultivation
- ceval-valid_logic
- ceval-valid_law
- ceval-valid_chinese_language_and_literature
- ceval-valid_art_studies
- ceval-valid_professional_tour_guide
- ceval-valid_legal_professional
- ceval-valid_high_school_chinese
- ceval-valid_high_school_history
- ceval-valid_middle_school_history
- ceval-valid_civil_servant
- ceval-valid_sports_science
- ceval-valid_plant_protection
- ceval-valid_basic_medicine
- ceval-valid_clinical_medicine
- ceval-valid_urban_and_rural_planner
- ceval-valid_accountant
- ceval-valid_fire_engineer
- ceval-valid_environmental_impact_assessment_engineer
- ceval-valid_tax_accountant
- ceval-valid_physician
group: ceval-valid
dataset_path: ceval/ceval-exam
validation_split: val
fewshot_split: dev
......
......@@ -8,7 +8,7 @@ import os
import yaml
from tqdm import tqdm
from lm_eval.logger import eval_logger
from lm_eval.utils import eval_logger
SUBJECTS = {
......@@ -117,3 +117,26 @@ if __name__ == "__main__":
allow_unicode=True,
default_style='"',
)
# write group config out
group_yaml_dict = {
"group": "ceval-valid",
"task": [f"ceval-valid_{task_name}" for task_name in SUBJECTS.keys()],
"aggregate_metric_list": [
{"metric": "acc", "aggregation": "mean", "weight_by_size": True},
{"metric": "acc_norm", "aggregation": "mean", "weight_by_size": True},
],
"metadata": {"version": 1.0},
}
file_save_path = "_" + args.save_prefix_path + ".yaml"
with open(file_save_path, "w", encoding="utf-8") as group_yaml_file:
yaml.dump(
group_yaml_dict,
group_yaml_file,
width=float("inf"),
allow_unicode=True,
default_style='"',
)
group: cmmlu
task:
- cmmlu_agronomy
- cmmlu_anatomy
- cmmlu_ancient_chinese
- cmmlu_arts
- cmmlu_astronomy
- cmmlu_business_ethics
- cmmlu_chinese_civil_service_exam
- cmmlu_chinese_driving_rule
- cmmlu_chinese_food_culture
- cmmlu_chinese_foreign_policy
- cmmlu_chinese_history
- cmmlu_chinese_literature
- cmmlu_chinese_teacher_qualification
- cmmlu_clinical_knowledge
- cmmlu_college_actuarial_science
- cmmlu_college_education
- cmmlu_college_engineering_hydrology
- cmmlu_college_law
- cmmlu_college_mathematics
- cmmlu_college_medical_statistics
- cmmlu_college_medicine
- cmmlu_computer_science
- cmmlu_computer_security
- cmmlu_conceptual_physics
- cmmlu_construction_project_management
- cmmlu_economics
- cmmlu_education
- cmmlu_electrical_engineering
- cmmlu_elementary_chinese
- cmmlu_elementary_commonsense
- cmmlu_elementary_information_and_technology
- cmmlu_elementary_mathematics
- cmmlu_ethnology
- cmmlu_food_science
- cmmlu_genetics
- cmmlu_global_facts
- cmmlu_high_school_biology
- cmmlu_high_school_chemistry
- cmmlu_high_school_geography
- cmmlu_high_school_mathematics
- cmmlu_high_school_physics
- cmmlu_high_school_politics
- cmmlu_human_sexuality
- cmmlu_international_law
- cmmlu_journalism
- cmmlu_jurisprudence
- cmmlu_legal_and_moral_basis
- cmmlu_logical
- cmmlu_machine_learning
- cmmlu_management
- cmmlu_marketing
- cmmlu_marxist_theory
- cmmlu_modern_chinese
- cmmlu_nutrition
- cmmlu_philosophy
- cmmlu_professional_accounting
- cmmlu_professional_law
- cmmlu_professional_medicine
- cmmlu_professional_psychology
- cmmlu_public_relations
- cmmlu_security_study
- cmmlu_sociology
- cmmlu_sports_science
- cmmlu_traditional_chinese_medicine
- cmmlu_virology
- cmmlu_world_history
- cmmlu_world_religions
aggregate_metric_list:
- aggregation: mean
metric: acc
weight_by_size: true
- aggregation: mean
metric: acc_norm
weight_by_size: true
metadata:
version: 0.0
group: cmmlu
dataset_path: haonan-li/cmmlu
test_split: test
fewshot_split: dev
......
......@@ -132,3 +132,33 @@ if __name__ == "__main__":
allow_unicode=True,
default_style='"',
)
# write group config out
group_yaml_dict = {
"group": "cmmlu",
"task": [
(
f"cmmlu_{args.task_prefix}_{subject_eng}"
if args.task_prefix != ""
else f"cmmlu_{subject_eng}"
)
for subject_eng in SUBJECTS.keys()
],
"aggregate_metric_list": [
{"metric": "acc", "aggregation": "mean", "weight_by_size": True},
{"metric": "acc_norm", "aggregation": "mean", "weight_by_size": True},
],
"metadata": {"version": 0.0},
}
file_save_path = "_" + args.save_prefix_path + ".yaml"
with open(file_save_path, "w", encoding="utf-8") as group_yaml_file:
yaml.dump(
group_yaml_dict,
group_yaml_file,
width=float("inf"),
allow_unicode=True,
default_style='"',
)
"dataset_name": "agronomy"
"description": "以下是关于农学的单项选择题,请直接给出正确答案的选项。\n\n"
"include": "_default_template_yaml"
"task": "cmmlu_agronomy"
"dataset_name": "anatomy"
"description": "以下是关于解剖学的单项选择题,请直接给出正确答案的选项。\n\n"
"include": "_default_template_yaml"
"task": "cmmlu_anatomy"
"dataset_name": "ancient_chinese"
"description": "以下是关于古汉语的单项选择题,请直接给出正确答案的选项。\n\n"
"include": "_default_template_yaml"
"task": "cmmlu_ancient_chinese"
"dataset_name": "arts"
"description": "以下是关于艺术学的单项选择题,请直接给出正确答案的选项。\n\n"
"include": "_default_template_yaml"
"task": "cmmlu_arts"
"dataset_name": "astronomy"
"description": "以下是关于天文学的单项选择题,请直接给出正确答案的选项。\n\n"
"include": "_default_template_yaml"
"task": "cmmlu_astronomy"
"dataset_name": "business_ethics"
"description": "以下是关于商业伦理的单项选择题,请直接给出正确答案的选项。\n\n"
"include": "_default_template_yaml"
"task": "cmmlu_business_ethics"
"dataset_name": "chinese_civil_service_exam"
"description": "以下是关于中国公务员考试的单项选择题,请直接给出正确答案的选项。\n\n"
"include": "_default_template_yaml"
"task": "cmmlu_chinese_civil_service_exam"
"dataset_name": "chinese_driving_rule"
"description": "以下是关于中国驾驶规则的单项选择题,请直接给出正确答案的选项。\n\n"
"include": "_default_template_yaml"
"task": "cmmlu_chinese_driving_rule"
"dataset_name": "chinese_food_culture"
"description": "以下是关于中国饮食文化的单项选择题,请直接给出正确答案的选项。\n\n"
"include": "_default_template_yaml"
"task": "cmmlu_chinese_food_culture"
"dataset_name": "chinese_foreign_policy"
"description": "以下是关于中国外交政策的单项选择题,请直接给出正确答案的选项。\n\n"
"include": "_default_template_yaml"
"task": "cmmlu_chinese_foreign_policy"
"dataset_name": "chinese_history"
"description": "以下是关于中国历史的单项选择题,请直接给出正确答案的选项。\n\n"
"include": "_default_template_yaml"
"task": "cmmlu_chinese_history"
"dataset_name": "chinese_literature"
"description": "以下是关于中国文学的单项选择题,请直接给出正确答案的选项。\n\n"
"include": "_default_template_yaml"
"task": "cmmlu_chinese_literature"
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment