Group agg rework (#1741)

* add greoup_config arg * add a group config that allows disabling table for group score and group aggregate in general * fixed size configuration * adjust config * add group config * adjust mmlu to use group_config * fixed args input in aggregate_subtask_metrics * fixed issues related to printing alias of group and updated yaml * update all mmlu variants to include group_config * edit format * modify mmlu tasks * adjust group to also be a configurable group * add configurable group * simplify get_task_list * adjust group scoring with using ConfigurableGroup * adjust args * update mmlu * update mmlu * update to work with new group and task configuration * readd group_agg * readd files * move prepare_print_tasks to evaluator_utils * sort set to False by default, fix predict_only arg * add version for groups * reversed task list * update additional condition when loading a group in a group yaml * update truthfulqa * add description regarding tags replacing group * replace group to tag * fixed conditional statement * remove warning * update loading of task group and newly added tags * reformat with pre-commit * fixed info log * update * fix bug * fix bug * use task id to differentiate tasks * convert all groups to configurable groups * use task_id * reformat * add task_id for python tasks as well * add task_id for python tasks as well * add task_id for python tasks as well * revert truthfulqa * revert mmlu tasks * new mmlu config * new group config parameter `tag_to_task` * Update truthfulqa_mc2.yaml * reformate * add _process_group_config * adjust task_id * add get_subtask_list function to get proper subtask list * group config to_dict update * remove tag check * update mmlu * fix config passing issues * add test yaml * format fix * add documentation * corner case for single tag being called * fix indentation * formatting * update all mmlu variants * Update docs/task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * remove group_alias * Update docs/task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * remove version for metadata * Update docs/task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * update mmlu/ * removed " " in make_table * change how aggregate_metric is loaded * change how aggregate_metric is loaded * update aggregate_metric arg * update format * update format * some docs fixes * add groups for agieval, aexams, aclue * add more explicit aggregation groups * add more groupings / tags distinctions * add more groupings * more groupings * add many explicit group configs * add many explicit group configs * add more explicit group configs * add more explicit group configs * add more error msgs, agg_metric -> agg_metric_list * some docs updates * update task_id to be updateable and uses group:task format * make KMMLU a tag for now * update docs * don't duplicate task names * fix merge conflicts? * giving this a try * clean up diff * switch mmlu variants over to using * don't use to-be-deprecated group: config field in overview notebook * Python tasks which subclass ConfigurableTask now run * update mmlu * pre-commit format * fixed sorting for multi-level printing * move group api to separate file * fix bbh aggregation filter usage * track api/group.py * adjust group and tags loading * make explicit group configs for leaderboard and other newer tasks * fix arabicmmlu * update * change arabicmmlu template name??? * update group alias * fix printing bugs * check table printing is correct ; update tests * use mmlu_stem to have a group included in print tests --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

Group agg rework (#1741)
* add greoup_config arg * add a group config that allows disabling table for group score and group aggregate in general * fixed size configuration * adjust config * add group config * adjust mmlu to use group_config * fixed args input in aggregate_subtask_metrics * fixed issues related to printing alias of group and updated yaml * update all mmlu variants to include group_config * edit format * modify mmlu tasks * adjust group to also be a configurable group * add configurable group * simplify get_task_list * adjust group scoring with using ConfigurableGroup * adjust args * update mmlu * update mmlu * update to work with new group and task configuration * readd group_agg * readd files * move prepare_print_tasks to evaluator_utils * sort set to False by default, fix predict_only arg * add version for groups * reversed task list * update additional condition when loading a group in a group yaml * update truthfulqa * add description regarding tags replacing group * replace group to tag * fixed conditional statement * remove warning * update loading of task group and newly added tags * reformat with pre-commit * fixed info log * update * fix bug * fix bug * use task id to differentiate tasks * convert all groups to configurable groups * use task_id * reformat * add task_id for python tasks as well * add task_id for python tasks as well * add task_id for python tasks as well * revert truthfulqa * revert mmlu tasks * new mmlu config * new group config parameter `tag_to_task` * Update truthfulqa_mc2.yaml * reformate * add _process_group_config * adjust task_id * add get_subtask_list function to get proper subtask list * group config to_dict update * remove tag check * update mmlu * fix config passing issues * add test yaml * format fix * add documentation * corner case for single tag being called * fix indentation * formatting * update all mmlu variants * Update docs/task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * remove group_alias * Update docs/task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * remove version for metadata * Update docs/task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * update mmlu/ * removed " " in make_table * change how aggregate_metric is loaded * change how aggregate_metric is loaded * update aggregate_metric arg * update format * update format * some docs fixes * add groups for agieval, aexams, aclue * add more explicit aggregation groups * add more groupings / tags distinctions * add more groupings * more groupings * add many explicit group configs * add many explicit group configs * add more explicit group configs * add more explicit group configs * add more error msgs, agg_metric -> agg_metric_list * some docs updates * update task_id to be updateable and uses group:task format * make KMMLU a tag for now * update docs * don't duplicate task names * fix merge conflicts? * giving this a try * clean up diff * switch mmlu variants over to using * don't use to-be-deprecated group: config field in overview notebook * Python tasks which subclass ConfigurableTask now run * update mmlu * pre-commit format * fixed sorting for multi-level printing * move group api to separate file * fix bbh aggregation filter usage * track api/group.py * adjust group and tags loading * make explicit group configs for leaderboard and other newer tasks * fix arabicmmlu * update * change arabicmmlu template name??? * update group alias * fix printing bugs * check table printing is correct ; update tests * use mmlu_stem to have a group included in print tests --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>
517aadc4 · Lintang Sutawika · GitHub · 5a7ed3ee · 517aadc4 · 517aadc4
Unverified Commit 517aadc4 authored Jul 08, 2024 by Lintang Sutawika Committed by GitHub Jul 08, 2024
20 changed files
--- a/lm_eval/tasks/truthfulqa/truthfulqa_gen.yaml
+++ b/lm_eval/tasks/truthfulqa/truthfulqa_gen.yaml
-group:
+tag:
  - truthfulqa
 task: truthfulqa_gen
 dataset_path: truthful_qa

--- a/lm_eval/tasks/truthfulqa/truthfulqa_mc1.yaml
+++ b/lm_eval/tasks/truthfulqa/truthfulqa_mc1.yaml
-group:
+tag:
  - truthfulqa
 task: truthfulqa_mc1
 dataset_path: truthful_qa

--- a/lm_eval/tasks/unscramble/anagrams1.yaml
+++ b/lm_eval/tasks/unscramble/anagrams1.yaml
-group:
+tag:
  - unscramble
 task: anagrams1
 dataset_path: EleutherAI/unscramble

--- a/lm_eval/tasks/unscramble/anagrams2.yaml
+++ b/lm_eval/tasks/unscramble/anagrams2.yaml
-group:
+tag:
  - unscramble
 task: anagrams2
 dataset_path: EleutherAI/unscramble

--- a/lm_eval/tasks/unscramble/cycle_letters.yaml
+++ b/lm_eval/tasks/unscramble/cycle_letters.yaml
-group:
+tag:
  - unscramble
 task: cycle_letters
 dataset_path: EleutherAI/unscramble

--- a/lm_eval/tasks/unscramble/random_insertion.yaml
+++ b/lm_eval/tasks/unscramble/random_insertion.yaml
-group:
+tag:
  - unscramble
 task: random_insertion
 dataset_path: EleutherAI/unscramble

--- a/lm_eval/tasks/unscramble/reversed_words.yaml
+++ b/lm_eval/tasks/unscramble/reversed_words.yaml
-group:
+tag:
  - unscramble
 task: reversed_words
 dataset_path: EleutherAI/unscramble

--- a/lm_eval/tasks/webqs/webqs.yaml
+++ b/lm_eval/tasks/webqs/webqs.yaml
-group:
+tag:
  - freebase
 task: webqs
 dataset_path: web_questions

--- a/lm_eval/tasks/wmdp/README.md
+++ b/lm_eval/tasks/wmdp/README.md
@@ -24,7 +24,7 @@ Homepage: https://wmdp.ai
 }
 ```

-### Groups and Tasks
+### Groups, Tags, and Tasks

 #### Groups


--- a/lm_eval/tasks/wmdp/_default_template_yaml
+++ b/lm_eval/tasks/wmdp/_default_template_yaml
 dataset_path: cais/wmdp
-group: wmdp
 test_split: test
 training_split: null
 validation_split: null

--- a/lm_eval/tasks/wmdp/_wmdp.yaml
+++ b/lm_eval/tasks/wmdp/_wmdp.yaml
+group: wmdp
+task:
+  - wmdp_bio
+  - wmdp_chem
+  - wmdp_cyber
+aggregate_metric_list:
+  - metric: acc
+    aggregation: mean
+    weight_by_size: True
+metadata:
+  version: 0
--- a/lm_eval/tasks/wmt2016/README.md
+++ b/lm_eval/tasks/wmt2016/README.md
@@ -27,11 +27,7 @@ Homepage: https://huggingface.co/datasets/wmt16
 }
 ```

-### Groups and Tasks
-
-#### Groups
-
-* `wmt-t5-prompt`: Group for all wmt tasks with prompt templates used for T5 (`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer`)
+### Groups, Tags, and Tasks

 #### Tasks


--- a/lm_eval/tasks/wmt2016/ro_en-t5_prompt.yaml
+++ b/lm_eval/tasks/wmt2016/ro_en-t5_prompt.yaml
-group:
-  - wmt-t5-prompt
 task: wmt-ro-en-t5-prompt
 dataset_path: wmt16
 dataset_name: ro-en

--- a/lm_eval/tasks/xcopa/_xcopa.yaml
+++ b/lm_eval/tasks/xcopa/_xcopa.yaml
+group: xcopa
+task:
+  - xcopa_et
+  - xcopa_ht
+  - xcopa_id
+  - xcopa_it
+  - xcopa_qu
+  - xcopa_sw
+  - xcopa_ta
+  - xcopa_th
+  - xcopa_tr
+  - xcopa_vi
+  - xcopa_zh
+aggregate_metric_list:
+  - metric: acc
+    aggregation: mean
+    weight_by_size: True
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/xcopa/default_et.yaml
+++ b/lm_eval/tasks/xcopa/default_et.yaml
-group: xcopa
 task: xcopa_et
 dataset_path: xcopa
 dataset_name: et

--- a/lm_eval/tasks/xnli/_xnli.yaml
+++ b/lm_eval/tasks/xnli/_xnli.yaml
+group: xnli
+task:
+  - xnli_ar
+  - xnli_bg
+  - xnli_de
+  - xnli_el
+  - xnli_en
+  - xnli_es
+  - xnli_fr
+  - xnli_hi
+  - xnli_ru
+  - xnli_sw
+  - xnli_th
+  - xnli_tr
+  - xnli_ur
+  - xnli_vi
+  - xnli_zh
+aggregate_metric_list:
+  - metric: acc
+    aggregation: mean
+    weight_by_size: true
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/xnli/xnli_common_yaml
+++ b/lm_eval/tasks/xnli/xnli_common_yaml
 # This file will be included in the generated language-specific task configs.
 # It doesn't have a yaml file extension as it is not meant to be imported directly
 # by the harness.
-group: xnli
 task: null
 dataset_path: xnli
 dataset_name: null

--- a/lm_eval/tasks/xnli_eu/README.md
+++ b/lm_eval/tasks/xnli_eu/README.md
@@ -24,9 +24,9 @@ Homepage: https://github.com/hitz-zentroa/xnli-eu
 }
 ```

-### Groups and Tasks
+### Groups, Tags, and Tasks

-#### Groups
+#### Tags

 * `xnli_eu_mt_native`: Includes MT and Native variants of the XNLIeu dataset.


--- a/lm_eval/tasks/xnli_eu/xnli_common_yaml
+++ b/lm_eval/tasks/xnli_eu/xnli_common_yaml
-group: xnli
 task: null
 dataset_path: xnli
 dataset_name: null

--- a/lm_eval/tasks/xnli_eu/xnli_eu_mt.yaml
+++ b/lm_eval/tasks/xnli_eu/xnli_eu_mt.yaml
 include: xnli_eu.yaml
-group: xnli_eu_mt_native
+tag: xnli_eu_mt_native
 task: xnli_eu_mt
 dataset_name: eu_mt