Group agg rework (#1741)

* add greoup_config arg * add a group config that allows disabling table for group score and group aggregate in general * fixed size configuration * adjust config * add group config * adjust mmlu to use group_config * fixed args input in aggregate_subtask_metrics * fixed issues related to printing alias of group and updated yaml * update all mmlu variants to include group_config * edit format * modify mmlu tasks * adjust group to also be a configurable group * add configurable group * simplify get_task_list * adjust group scoring with using ConfigurableGroup * adjust args * update mmlu * update mmlu * update to work with new group and task configuration * readd group_agg * readd files * move prepare_print_tasks to evaluator_utils * sort set to False by default, fix predict_only arg * add version for groups * reversed task list * update additional condition when loading a group in a group yaml * update truthfulqa * add description regarding tags replacing group * replace group to tag * fixed conditional statement * remove warning * update loading of task group and newly added tags * reformat with pre-commit * fixed info log * update * fix bug * fix bug * use task id to differentiate tasks * convert all groups to configurable groups * use task_id * reformat * add task_id for python tasks as well * add task_id for python tasks as well * add task_id for python tasks as well * revert truthfulqa * revert mmlu tasks * new mmlu config * new group config parameter `tag_to_task` * Update truthfulqa_mc2.yaml * reformate * add _process_group_config * adjust task_id * add get_subtask_list function to get proper subtask list * group config to_dict update * remove tag check * update mmlu * fix config passing issues * add test yaml * format fix * add documentation * corner case for single tag being called * fix indentation * formatting * update all mmlu variants * Update docs/task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * remove group_alias * Update docs/task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * remove version for metadata * Update docs/task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * update mmlu/ * removed " " in make_table * change how aggregate_metric is loaded * change how aggregate_metric is loaded * update aggregate_metric arg * update format * update format * some docs fixes * add groups for agieval, aexams, aclue * add more explicit aggregation groups * add more groupings / tags distinctions * add more groupings * more groupings * add many explicit group configs * add many explicit group configs * add more explicit group configs * add more explicit group configs * add more error msgs, agg_metric -> agg_metric_list * some docs updates * update task_id to be updateable and uses group:task format * make KMMLU a tag for now * update docs * don't duplicate task names * fix merge conflicts? * giving this a try * clean up diff * switch mmlu variants over to using * don't use to-be-deprecated group: config field in overview notebook * Python tasks which subclass ConfigurableTask now run * update mmlu * pre-commit format * fixed sorting for multi-level printing * move group api to separate file * fix bbh aggregation filter usage * track api/group.py * adjust group and tags loading * make explicit group configs for leaderboard and other newer tasks * fix arabicmmlu * update * change arabicmmlu template name??? * update group alias * fix printing bugs * check table printing is correct ; update tests * use mmlu_stem to have a group included in print tests --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

Group agg rework (#1741)
* add greoup_config arg * add a group config that allows disabling table for group score and group aggregate in general * fixed size configuration * adjust config * add group config * adjust mmlu to use group_config * fixed args input in aggregate_subtask_metrics * fixed issues related to printing alias of group and updated yaml * update all mmlu variants to include group_config * edit format * modify mmlu tasks * adjust group to also be a configurable group * add configurable group * simplify get_task_list * adjust group scoring with using ConfigurableGroup * adjust args * update mmlu * update mmlu * update to work with new group and task configuration * readd group_agg * readd files * move prepare_print_tasks to evaluator_utils * sort set to False by default, fix predict_only arg * add version for groups * reversed task list * update additional condition when loading a group in a group yaml * update truthfulqa * add description regarding tags replacing group * replace group to tag * fixed conditional statement * remove warning * update loading of task group and newly added tags * reformat with pre-commit * fixed info log * update * fix bug * fix bug * use task id to differentiate tasks * convert all groups to configurable groups * use task_id * reformat * add task_id for python tasks as well * add task_id for python tasks as well * add task_id for python tasks as well * revert truthfulqa * revert mmlu tasks * new mmlu config * new group config parameter `tag_to_task` * Update truthfulqa_mc2.yaml * reformate * add _process_group_config * adjust task_id * add get_subtask_list function to get proper subtask list * group config to_dict update * remove tag check * update mmlu * fix config passing issues * add test yaml * format fix * add documentation * corner case for single tag being called * fix indentation * formatting * update all mmlu variants * Update docs/task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * remove group_alias * Update docs/task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * remove version for metadata * Update docs/task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * update mmlu/ * removed " " in make_table * change how aggregate_metric is loaded * change how aggregate_metric is loaded * update aggregate_metric arg * update format * update format * some docs fixes * add groups for agieval, aexams, aclue * add more explicit aggregation groups * add more groupings / tags distinctions * add more groupings * more groupings * add many explicit group configs * add many explicit group configs * add more explicit group configs * add more explicit group configs * add more error msgs, agg_metric -> agg_metric_list * some docs updates * update task_id to be updateable and uses group:task format * make KMMLU a tag for now * update docs * don't duplicate task names * fix merge conflicts? * giving this a try * clean up diff * switch mmlu variants over to using * don't use to-be-deprecated group: config field in overview notebook * Python tasks which subclass ConfigurableTask now run * update mmlu * pre-commit format * fixed sorting for multi-level printing * move group api to separate file * fix bbh aggregation filter usage * track api/group.py * adjust group and tags loading * make explicit group configs for leaderboard and other newer tasks * fix arabicmmlu * update * change arabicmmlu template name??? * update group alias * fix printing bugs * check table printing is correct ; update tests * use mmlu_stem to have a group included in print tests --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>
517aadc4 · Lintang Sutawika · GitHub · 5a7ed3ee · 517aadc4 · 517aadc4
Unverified Commit 517aadc4 authored Jul 08, 2024 by Lintang Sutawika Committed by GitHub Jul 08, 2024
20 changed files
--- a/lm_eval/tasks/super_glue/copa/t5-prompt.yaml
+++ b/lm_eval/tasks/super_glue/copa/t5-prompt.yaml
-group:
+tag:
  - super-glue-t5-prompt
 task: super_glue-copa-t5-prompt
 dataset_path: super_glue

--- a/lm_eval/tasks/super_glue/multirc/default.yaml
+++ b/lm_eval/tasks/super_glue/multirc/default.yaml
-group:
+tag:
  - super-glue-lm-eval-v1
 task: multirc
 dataset_path: super_glue

--- a/lm_eval/tasks/super_glue/multirc/t5-prompt.yaml
+++ b/lm_eval/tasks/super_glue/multirc/t5-prompt.yaml
-group:
+tag:
  - super-glue-t5-prompt
 task: super_glue-multirc-t5-prompt
 dataset_path: super_glue

--- a/lm_eval/tasks/super_glue/record/default.yaml
+++ b/lm_eval/tasks/super_glue/record/default.yaml
-group:
+tag:
  - super-glue-lm-eval-v1
 task: record
 dataset_path: super_glue

--- a/lm_eval/tasks/super_glue/record/t5-prompt.yaml
+++ b/lm_eval/tasks/super_glue/record/t5-prompt.yaml
-group:
+tag:
  - super-glue-t5-prompt
 task: super_glue-record-t5-prompt
 dataset_path: super_glue

--- a/lm_eval/tasks/super_glue/rte/default.yaml
+++ b/lm_eval/tasks/super_glue/rte/default.yaml
-group:
+tag:
  - super-glue-lm-eval-v1
 task: sglue_rte
 dataset_path: super_glue

--- a/lm_eval/tasks/super_glue/rte/t5-prompt.yaml
+++ b/lm_eval/tasks/super_glue/rte/t5-prompt.yaml
-group:
+tag:
  - super-glue-t5-prompt
 task: super_glue-rte-t5-prompt
 dataset_path: super_glue

--- a/lm_eval/tasks/super_glue/wic/default.yaml
+++ b/lm_eval/tasks/super_glue/wic/default.yaml
-group:
+tag:
  - super-glue-lm-eval-v1
 task: "wic"
 dataset_path: super_glue

--- a/lm_eval/tasks/super_glue/wic/t5-prompt.yaml
+++ b/lm_eval/tasks/super_glue/wic/t5-prompt.yaml
-group:
+tag:
  - super-glue-t5-prompt
 task: super_glue-wic-t5-prompt
 dataset_path: super_glue

--- a/lm_eval/tasks/super_glue/wsc/default.yaml
+++ b/lm_eval/tasks/super_glue/wsc/default.yaml
-group:
+tag:
  - super-glue-lm-eval-v1
 task: wsc
 dataset_path: super_glue

--- a/lm_eval/tasks/super_glue/wsc/t5-prompt.yaml
+++ b/lm_eval/tasks/super_glue/wsc/t5-prompt.yaml
-group:
+tag:
  - super-glue-t5-prompt
 task: super_glue-wsc-t5-prompt
 dataset_path: super_glue

--- a/lm_eval/tasks/swde/task.py
+++ b/lm_eval/tasks/swde/task.py
@@ -12,7 +12,7 @@ class SWDE(ConfigurableTask):
    DATASET_PATH = "hazyresearch/based-swde-v2"
    DATASET_NAME = "default"

-    def __init__(self):
+    def __init__(self, **kwargs):
        super().__init__(config={"metadata": {"version": self.VERSION}})

    def has_training_docs(self):

--- a/lm_eval/tasks/translation/iwslt2017_ar-en.yaml
+++ b/lm_eval/tasks/translation/iwslt2017_ar-en.yaml
@@ -5,8 +5,7 @@ doc_to_target: ' {{translation["en"]}}'
 doc_to_text: 'Arabic phrase: {{translation["ar"]}}

  English phrase:'
-group:
- generate_until
+tag:
 - translation
 - iwslt2017
 include: wmt_common_yaml

--- a/lm_eval/tasks/translation/iwslt2017_en-ar.yaml
+++ b/lm_eval/tasks/translation/iwslt2017_en-ar.yaml
@@ -5,8 +5,7 @@ doc_to_target: ' {{translation["ar"]}}'
 doc_to_text: 'English phrase: {{translation["en"]}}

  Arabic phrase:'
-group:
- generate_until
+tag:
 - translation
 - iwslt2017
 include: wmt_common_yaml

--- a/lm_eval/tasks/translation/wmt14_en-fr.yaml
+++ b/lm_eval/tasks/translation/wmt14_en-fr.yaml
@@ -5,8 +5,7 @@ doc_to_target: ' {{translation["fr"]}}'
 doc_to_text: 'English phrase: {{translation["en"]}}

  French phrase:'
-group:
- generate_until
+tag:
 - translation
 - wmt14
 - gpt3_translation_benchmarks

--- a/lm_eval/tasks/translation/wmt14_fr-en.yaml
+++ b/lm_eval/tasks/translation/wmt14_fr-en.yaml
@@ -5,8 +5,7 @@ doc_to_target: ' {{translation["en"]}}'
 doc_to_text: 'French phrase: {{translation["fr"]}}

  English phrase:'
-group:
- generate_until
+tag:
 - translation
 - wmt14
 - gpt3_translation_benchmarks

--- a/lm_eval/tasks/translation/wmt16_de-en.yaml
+++ b/lm_eval/tasks/translation/wmt16_de-en.yaml
@@ -5,8 +5,7 @@ doc_to_target: ' {{translation["en"]}}'
 doc_to_text: 'German phrase: {{translation["de"]}}

  English phrase:'
-group:
- generate_until
+tag:
 - translation
 - wmt16
 - gpt3_translation_benchmarks

--- a/lm_eval/tasks/translation/wmt16_en-de.yaml
+++ b/lm_eval/tasks/translation/wmt16_en-de.yaml
@@ -5,8 +5,7 @@ doc_to_target: ' {{translation["de"]}}'
 doc_to_text: 'English phrase: {{translation["en"]}}

  German phrase:'
-group:
- generate_until
+tag:
 - translation
 - wmt16
 - gpt3_translation_benchmarks

--- a/lm_eval/tasks/translation/wmt16_en-ro.yaml
+++ b/lm_eval/tasks/translation/wmt16_en-ro.yaml
@@ -5,8 +5,7 @@ doc_to_target: ' {{translation["ro"]}}'
 doc_to_text: 'English phrase: {{translation["en"]}}

  Romanian phrase:'
-group:
- generate_until
+tag:
 - translation
 - wmt16
 - gpt3_translation_benchmarks

--- a/lm_eval/tasks/translation/wmt16_ro-en.yaml
+++ b/lm_eval/tasks/translation/wmt16_ro-en.yaml
@@ -5,8 +5,7 @@ doc_to_target: ' {{translation["en"]}}'
 doc_to_text: 'Romanian phrase: {{translation["ro"]}}

  English phrase:'
-group:
- generate_until
+tag:
 - translation
 - wmt16
 - gpt3_translation_benchmarks