Unverified Commit 517aadc4 authored by Lintang Sutawika's avatar Lintang Sutawika Committed by GitHub
Browse files

Group agg rework (#1741)



* add greoup_config arg

* add a group config that allows disabling table for group score and group aggregate in general

* fixed size configuration

* adjust config

* add group config

* adjust mmlu to use group_config

* fixed args input in aggregate_subtask_metrics

* fixed issues related to printing alias of group and updated yaml

* update all mmlu variants to include group_config

* edit format

* modify mmlu tasks

* adjust group to also be a configurable group

* add configurable group

* simplify get_task_list

* adjust group scoring with using ConfigurableGroup

* adjust args

* update mmlu

* update mmlu

* update to work with new group and task configuration

* readd group_agg

* readd files

* move prepare_print_tasks to evaluator_utils

* sort set to False by default, fix predict_only arg

* add version for groups

* reversed task list

* update additional condition when loading a group in a group yaml

* update truthfulqa

* add description regarding tags replacing group

* replace group to tag

* fixed conditional statement

* remove warning

* update loading of task group and newly added tags

* reformat with pre-commit

* fixed info log

* update

* fix bug

* fix bug

* use task id to differentiate tasks

* convert all groups to configurable groups

* use task_id

* reformat

* add task_id for python tasks as well

* add task_id for python tasks as well

* add task_id for python tasks as well

* revert truthfulqa

* revert mmlu tasks

* new mmlu config

* new group config parameter `tag_to_task`

* Update truthfulqa_mc2.yaml

* reformate

* add _process_group_config

* adjust task_id

* add get_subtask_list function to get proper subtask list

* group config to_dict update

* remove tag check

* update mmlu

* fix config passing issues

* add test yaml

* format fix

* add documentation

* corner case for single tag being called

* fix indentation

* formatting

* update all mmlu variants

* Update docs/task_guide.md
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* remove group_alias

* Update docs/task_guide.md
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* remove version for metadata

* Update docs/task_guide.md
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* update mmlu/

* removed " " in make_table

* change how aggregate_metric is loaded

* change how aggregate_metric is loaded

* update aggregate_metric arg

* update format

* update format

* some docs fixes

* add groups for agieval, aexams, aclue

* add more explicit aggregation groups

* add more groupings / tags distinctions

* add more groupings

* more groupings

* add many explicit group configs

* add many explicit group configs

* add more explicit group configs

* add more explicit group configs

* add more error msgs, agg_metric -> agg_metric_list

* some docs updates

* update task_id to be updateable and uses group:task format

* make KMMLU a tag for now

* update docs

* don't duplicate task names

* fix merge conflicts?

* giving this a try

* clean up diff

* switch mmlu variants over to using

* don't use to-be-deprecated group: config field in overview notebook

* Python tasks which subclass ConfigurableTask now run

* update mmlu

* pre-commit format

* fixed sorting for multi-level printing

* move group api to separate file

* fix bbh aggregation filter usage

* track api/group.py

* adjust group and tags loading

* make explicit group configs for leaderboard and other newer tasks

* fix arabicmmlu

* update

* change arabicmmlu template name???

* update group alias

* fix printing bugs

* check table printing is correct ; update tests

* use mmlu_stem to have a group included in print tests

---------
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: default avatarhaileyschoelkopf <hailey@eleuther.ai>
parent 5a7ed3ee
group: tag:
- m_mmlu - m_mmlu
dataset_path: alexandrainst/m_mmlu dataset_path: alexandrainst/m_mmlu
test_split: test test_split: test
......
group: tag:
- truthfulqa_multilingual - truthfulqa_multilingual
dataset_path: null dataset_path: null
dataset_name: null dataset_name: null
......
group: tag:
- paloma - paloma
dataset_path: allenai/paloma dataset_path: allenai/paloma
output_type: loglikelihood_rolling output_type: loglikelihood_rolling
......
group: pawsx
task:
- paws_en
- paws_de
- paws_es
- paws_fr
- paws_ja
- paws_ko
- paws_zh
aggregate_metric_list:
- metric: acc
aggregation: mean
weight_by_size: true
metadata:
version: 0.0
# This file will be included in the generated language-specific task configs. # This file will be included in the generated language-specific task configs.
# It doesn't have a yaml file extension as it is not meant to be imported directly # It doesn't have a yaml file extension as it is not meant to be imported directly
# by the harness. # by the harness.
group: pawsx
task: null task: null
dataset_path: paws-x dataset_path: paws-x
dataset_name: null dataset_name: null
......
group:
- pile
task: pile_arxiv task: pile_arxiv
dataset_path: EleutherAI/pile dataset_path: EleutherAI/pile
dataset_name: pile_arxiv dataset_name: pile_arxiv
......
group: tag:
- polemo2 - polemo2
task: polemo2_in task: polemo2_in
dataset_path: allegro/klej-polemo2-in dataset_path: allegro/klej-polemo2-in
......
group: tag:
- qa4mre - qa4mre
task: qa4mre_2011 task: qa4mre_2011
dataset_path: qa4mre dataset_path: qa4mre
......
group: qasper tag: qasper
task: qasper_bool task: qasper_bool
dataset_path: allenai/qasper dataset_path: allenai/qasper
output_type: multiple_choice output_type: multiple_choice
......
group: qasper tag: qasper
task: qasper_freeform task: qasper_freeform
dataset_path: allenai/qasper dataset_path: allenai/qasper
output_type: generate_until output_type: generate_until
......
...@@ -12,7 +12,7 @@ class SQUADCompletion(ConfigurableTask): ...@@ -12,7 +12,7 @@ class SQUADCompletion(ConfigurableTask):
DATASET_PATH = "hazyresearch/based-squad" DATASET_PATH = "hazyresearch/based-squad"
DATASET_NAME = "default" DATASET_NAME = "default"
def __init__(self): def __init__(self, **kwargs):
super().__init__(config={"metadata": {"version": self.VERSION}}) super().__init__(config={"metadata": {"version": self.VERSION}})
def has_training_docs(self): def has_training_docs(self):
......
group: storycloze tag: storycloze
task: storycloze_2016 task: storycloze_2016
dataset_path: story_cloze dataset_path: story_cloze
dataset_name: 2016 dataset_name: 2016
......
group: storycloze tag: storycloze
task: storycloze_2018 task: storycloze_2018
dataset_path: story_cloze dataset_path: story_cloze
dataset_name: 2018 dataset_name: 2018
......
...@@ -26,10 +26,14 @@ Homepage: https://super.gluebenchmark.com/ ...@@ -26,10 +26,14 @@ Homepage: https://super.gluebenchmark.com/
} }
``` ```
### Groups and Tasks ### Groups, Tags, and Tasks
#### Groups #### Groups
None.
#### Tags
* `super-glue-lm-eval-v1`: SuperGLUE eval adapted from LM Eval V1 * `super-glue-lm-eval-v1`: SuperGLUE eval adapted from LM Eval V1
* `super-glue-t5-prompt`: SuperGLUE prompt and evaluation that matches the T5 paper (if using accelerate, will error if record is included.) * `super-glue-t5-prompt`: SuperGLUE prompt and evaluation that matches the T5 paper (if using accelerate, will error if record is included.)
......
group: tag:
- super-glue-lm-eval-v1 - super-glue-lm-eval-v1
task: boolq task: boolq
dataset_path: super_glue dataset_path: super_glue
......
group: tag:
- super-glue-lm-eval-v1-seq2seq - super-glue-lm-eval-v1-seq2seq
task: "boolq-seq2seq" task: "boolq-seq2seq"
dataset_path: super_glue dataset_path: super_glue
......
group: tag:
- super-glue-t5-prompt - super-glue-t5-prompt
task: super_glue-boolq-t5-prompt task: super_glue-boolq-t5-prompt
dataset_path: super_glue dataset_path: super_glue
......
group: tag:
- super-glue-lm-eval-v1 - super-glue-lm-eval-v1
task: cb task: cb
dataset_path: super_glue dataset_path: super_glue
......
group: tag:
- super-glue-t5-prompt - super-glue-t5-prompt
task: super_glue-cb-t5-prompt task: super_glue-cb-t5-prompt
dataset_path: super_glue dataset_path: super_glue
......
group: tag:
- super-glue-lm-eval-v1 - super-glue-lm-eval-v1
task: copa task: copa
dataset_path: super_glue dataset_path: super_glue
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment