Unverified Commit 517aadc4 authored by Lintang Sutawika's avatar Lintang Sutawika Committed by GitHub
Browse files

Group agg rework (#1741)



* add greoup_config arg

* add a group config that allows disabling table for group score and group aggregate in general

* fixed size configuration

* adjust config

* add group config

* adjust mmlu to use group_config

* fixed args input in aggregate_subtask_metrics

* fixed issues related to printing alias of group and updated yaml

* update all mmlu variants to include group_config

* edit format

* modify mmlu tasks

* adjust group to also be a configurable group

* add configurable group

* simplify get_task_list

* adjust group scoring with using ConfigurableGroup

* adjust args

* update mmlu

* update mmlu

* update to work with new group and task configuration

* readd group_agg

* readd files

* move prepare_print_tasks to evaluator_utils

* sort set to False by default, fix predict_only arg

* add version for groups

* reversed task list

* update additional condition when loading a group in a group yaml

* update truthfulqa

* add description regarding tags replacing group

* replace group to tag

* fixed conditional statement

* remove warning

* update loading of task group and newly added tags

* reformat with pre-commit

* fixed info log

* update

* fix bug

* fix bug

* use task id to differentiate tasks

* convert all groups to configurable groups

* use task_id

* reformat

* add task_id for python tasks as well

* add task_id for python tasks as well

* add task_id for python tasks as well

* revert truthfulqa

* revert mmlu tasks

* new mmlu config

* new group config parameter `tag_to_task`

* Update truthfulqa_mc2.yaml

* reformate

* add _process_group_config

* adjust task_id

* add get_subtask_list function to get proper subtask list

* group config to_dict update

* remove tag check

* update mmlu

* fix config passing issues

* add test yaml

* format fix

* add documentation

* corner case for single tag being called

* fix indentation

* formatting

* update all mmlu variants

* Update docs/task_guide.md
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* remove group_alias

* Update docs/task_guide.md
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* remove version for metadata

* Update docs/task_guide.md
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* update mmlu/

* removed " " in make_table

* change how aggregate_metric is loaded

* change how aggregate_metric is loaded

* update aggregate_metric arg

* update format

* update format

* some docs fixes

* add groups for agieval, aexams, aclue

* add more explicit aggregation groups

* add more groupings / tags distinctions

* add more groupings

* more groupings

* add many explicit group configs

* add many explicit group configs

* add more explicit group configs

* add more explicit group configs

* add more error msgs, agg_metric -> agg_metric_list

* some docs updates

* update task_id to be updateable and uses group:task format

* make KMMLU a tag for now

* update docs

* don't duplicate task names

* fix merge conflicts?

* giving this a try

* clean up diff

* switch mmlu variants over to using

* don't use to-be-deprecated group: config field in overview notebook

* Python tasks which subclass ConfigurableTask now run

* update mmlu

* pre-commit format

* fixed sorting for multi-level printing

* move group api to separate file

* fix bbh aggregation filter usage

* track api/group.py

* adjust group and tags loading

* make explicit group configs for leaderboard and other newer tasks

* fix arabicmmlu

* update

* change arabicmmlu template name???

* update group alias

* fix printing bugs

* check table printing is correct ; update tests

* use mmlu_stem to have a group included in print tests

---------
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: default avatarhaileyschoelkopf <hailey@eleuther.ai>
parent 5a7ed3ee
include: xnli_eu.yaml
group: xnli_eu_mt_native
tag: xnli_eu_mt_native
task: xnli_eu_native
training_split: null
validation_split: null
......
group: xstorycloze
task:
- xstorycloze_ar
- xstorycloze_en
- xstorycloze_es
- xstorycloze_eu
- xstorycloze_hi
- xstorycloze_id
- xstorycloze_my
- xstorycloze_ru
- xstorycloze_sw
- xstorycloze_te
- xstorycloze_zh
aggregate_metric_list:
- metric: acc
aggregation: mean
weight_by_size: true
metadata:
version: 1.0
group: xstorycloze
task: xstorycloze_ar
dataset_path: juletxara/xstory_cloze
dataset_name: ar
......
group: xwinograd
task:
- xwinograd_en
- xwinograd_fr
- xwinograd_jp
- xwinograd_pt
- xwinograd_ru
- xwinograd_zh
aggregate_metric_list:
- metric: acc
aggregation: mean
weight_by_size: true
metadata:
version: 1.0
# This file will be included in the generated language-specific task configs.
# It doesn't have a yaml file extension as it is not meant to be imported directly
# by the harness.
group:
- xwinograd
dataset_path: Muennighoff/xwinograd
dataset_name: null # Overridden by language-specific config.
output_type: multiple_choice
......
......@@ -308,7 +308,7 @@ class Reorderer:
return res
def make_table(result_dict, column: str = "results", sort_results: bool = True):
def make_table(result_dict, column: str = "results", sort_results: bool = False):
"""Generate table of results."""
from pytablewriter import LatexTableWriter, MarkdownTableWriter
......@@ -338,20 +338,21 @@ def make_table(result_dict, column: str = "results", sort_results: bool = True):
keys = result_dict[column].keys()
if sort_results:
# sort entries alphabetically
# sort entries alphabetically by task or group name.
# NOTE: we default here to false, because order matters for multi-level table printing a la mmlu.
# sorting here would mess that up
keys = sorted(keys)
for k in keys:
dic = result_dict[column][k]
version = result_dict["versions"].get(k, "N/A")
n = str(result_dict["n-shot"][k])
version = result_dict["versions"].get(k, " N/A")
n = str(result_dict.get("n-shot", " ").get(k, " "))
higher_is_better = result_dict.get("higher_is_better", {}).get(k, {})
if "alias" in dic:
k = dic.pop("alias")
metric_items = dic.items()
if sort_results:
metric_items = sorted(metric_items)
metric_items = sorted(metric_items)
for (mf), v in metric_items:
m, _, f = mf.partition(",")
......@@ -362,8 +363,7 @@ def make_table(result_dict, column: str = "results", sort_results: bool = True):
if m + "_stderr" + "," + f in dic:
se = dic[m + "_stderr" + "," + f]
if se != "N/A":
se = "%.4f" % se
se = " N/A" if se == "N/A" else "%.4f" % se
values.append([k, version, f, n, m, hib, "%.4f" % v, "±", se])
else:
values.append([k, version, f, n, m, hib, "%.4f" % v, "", ""])
......
......@@ -17,12 +17,16 @@ Homepage: `homepage to the benchmark's website goes here, if applicable`
BibTeX-formatted citation goes here
```
### Groups and Tasks
### Groups, Tags, and Tasks
#### Groups
* `group_name`: `Short description`
#### Tags
* `tag_name`: `Short description`
#### Tasks
* `task_name`: `1-sentence description of what this particular task does`
......
......@@ -90,7 +90,7 @@ def test_evaluator(
"pretrained=EleutherAI/pythia-14m,dtype=float32,device=cpu",
),
(
["mmlu_abstract_algebra", "mmlu_global_facts", "mmlu_public_relations"],
["mmlu_stem"],
10,
"hf",
"pretrained=EleutherAI/pythia-14m,dtype=float32,device=cpu",
......
| Tasks |Version|Filter|n-shot| Metric | |Value| |Stderr|
|----------------|-------|------|-----:|--------|---|----:|---|------|
|ai2_arc |N/A |none | 0|acc |↑ | 0.15|± |N/A |
| | |none | 0|acc_norm|↑ | 0.05|± |N/A |
| - arc_challenge| 1|none | 0|acc |↑ | 0.00|± |N/A |
| | |none | 0|acc_norm|↑ | 0.00|± |N/A |
| - arc_easy | 1|none | 0|acc |↑ | 0.30|± |N/A |
| | |none | 0|acc_norm|↑ | 0.10|± |N/A |
\ No newline at end of file
| Tasks |Version|Filter|n-shot| Metric | |Value| |Stderr|
|-------------|------:|------|-----:|--------|---|----:|---|-----:|
|arc_challenge| 1|none | 0|acc |↑ | 0.0|± |0.0000|
| | |none | 0|acc_norm|↑ | 0.0|± |0.0000|
|arc_easy | 1|none | 0|acc |↑ | 0.3|± |0.1528|
| | |none | 0|acc_norm|↑ | 0.1|± |0.1000|
\ No newline at end of file
| Tasks |Version|Filter|n-shot| Metric | | Value | |Stderr|
|--------------|------:|------|-----:|----------|---|-------:|---|------|
|lambada_openai| 1|none | 0|acc |↑ | 0.1000|± |N/A |
| | |none | 0|perplexity|↓ |605.4879|± |N/A |
\ No newline at end of file
| Tasks |Version|Filter|n-shot| Metric | | Value | | Stderr |
|--------------|------:|------|-----:|----------|---|-------:|---|--------:|
|lambada_openai| 1|none | 0|acc |↑ | 0.1000|± | 0.1000|
| | |none | 0|perplexity|↓ |605.3866|± |1636.6987|
\ No newline at end of file
| Tasks |Version|Filter|n-shot|Metric| |Value| |Stderr|
|----------------|------:|------|-----:|------|---|----:|---|------|
|abstract_algebra| 0|none | 0|acc |↑ | 0.2|± |N/A |
|global_facts | 0|none | 0|acc |↑ | 0.2|± |N/A |
|public_relations| 0|none | 0|acc |↑ | 0.2|± |N/A |
\ No newline at end of file
| Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr|
|-------------------------------|------:|------|-----:|------|---|-----:|---|-----:|
|stem | 1|none | |acc |↑ |0.2474|± |0.0315|
| - abstract_algebra | 0|none | 0|acc |↑ |0.2000|± |0.1333|
| - anatomy | 0|none | 0|acc |↑ |0.3000|± |0.1528|
| - astronomy | 0|none | 0|acc |↑ |0.1000|± |0.1000|
| - college_biology | 0|none | 0|acc |↑ |0.3000|± |0.1528|
| - college_chemistry | 0|none | 0|acc |↑ |0.1000|± |0.1000|
| - college_computer_science | 0|none | 0|acc |↑ |0.2000|± |0.1333|
| - college_mathematics | 0|none | 0|acc |↑ |0.2000|± |0.1333|
| - college_physics | 0|none | 0|acc |↑ |0.3000|± |0.1528|
| - computer_security | 0|none | 0|acc |↑ |0.5000|± |0.1667|
| - conceptual_physics | 0|none | 0|acc |↑ |0.3000|± |0.1528|
| - electrical_engineering | 0|none | 0|acc |↑ |0.4000|± |0.1633|
| - elementary_mathematics | 0|none | 0|acc |↑ |0.0000|± |0.0000|
| - high_school_biology | 0|none | 0|acc |↑ |0.3000|± |0.1528|
| - high_school_chemistry | 0|none | 0|acc |↑ |0.4000|± |0.1633|
| - high_school_computer_science| 0|none | 0|acc |↑ |0.3000|± |0.1528|
| - high_school_mathematics | 0|none | 0|acc |↑ |0.2000|± |0.1333|
| - high_school_physics | 0|none | 0|acc |↑ |0.3000|± |0.1528|
| - high_school_statistics | 0|none | 0|acc |↑ |0.0000|± |0.0000|
| - machine_learning | 0|none | 0|acc |↑ |0.3000|± |0.1528|
\ No newline at end of file
| Tasks |Version|Filter|n-shot| Metric | | Value | |Stderr|
|--------|------:|------|-----:|---------------|---|-------:|---|------|
|wikitext| 2|none | 0|bits_per_byte |↓ | 1.3394|± |N/A |
| | |none | 0|byte_perplexity|↓ | 2.5304|± |N/A |
| | |none | 0|word_perplexity|↓ |130.4812|± |N/A |
\ No newline at end of file
|wikitext| 2|none | 0|bits_per_byte |↓ | 1.3394|± | N/A|
| | |none | 0|byte_perplexity|↓ | 2.5304|± | N/A|
| | |none | 0|word_perplexity|↓ |130.4801|± | N/A|
\ No newline at end of file
group: test-1
group_alias: test 1
task:
- piqa # string task
- ai2_arc # string tag
- task: super-glue-lm-eval-v1 # Should this be spread out?
num_fewshot: 3
- task: swag # dict registered task
num_fewshot: 2
- task: mmlu
num_fewshot: 5
- group: nli-tasks # dict group
task:
- anli
- boolq
- sglue_rte
num_fewshot: 4
metric_list:
- metric: brier_score
- task: sciq # dict registered task duplicate
task_alias: sciq 2-shot
num_fewshot: 2
- task: sciq # dict registered task duplicate
task_alias: sciq 4-shot
num_fewshot: 4
- task: sciq # dict registered task duplicate
task_alias: sciq 6-shot
num_fewshot: 6
- task: siqa_custom # dict task
dataset_path: social_i_qa
dataset_name: null
output_type: multiple_choice
training_split: train
validation_split: validation
doc_to_text: "Question: {{context}} {{question}}\nAnswer:"
target_delimiter: " "
doc_to_choice:
- "{{answerA}}"
- "{{answerB}}"
- "{{answerC}}"
doc_to_target: "{{ (label|int) - 1 }}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment