Group agg rework (#1741)

* add greoup_config arg * add a group config that allows disabling table for group score and group aggregate in general * fixed size configuration * adjust config * add group config * adjust mmlu to use group_config * fixed args input in aggregate_subtask_metrics * fixed issues related to printing alias of group and updated yaml * update all mmlu variants to include group_config * edit format * modify mmlu tasks * adjust group to also be a configurable group * add configurable group * simplify get_task_list * adjust group scoring with using ConfigurableGroup * adjust args * update mmlu * update mmlu * update to work with new group and task configuration * readd group_agg * readd files * move prepare_print_tasks to evaluator_utils * sort set to False by default, fix predict_only arg * add version for groups * reversed task list * update additional condition when loading a group in a group yaml * update truthfulqa * add description regarding tags replacing group * replace group to tag * fixed conditional statement * remove warning * update loading of task group and newly added tags * reformat with pre-commit * fixed info log * update * fix bug * fix bug * use task id to differentiate tasks * convert all groups to configurable groups * use task_id * reformat * add task_id for python tasks as well * add task_id for python tasks as well * add task_id for python tasks as well * revert truthfulqa * revert mmlu tasks * new mmlu config * new group config parameter `tag_to_task` * Update truthfulqa_mc2.yaml * reformate * add _process_group_config * adjust task_id * add get_subtask_list function to get proper subtask list * group config to_dict update * remove tag check * update mmlu * fix config passing issues * add test yaml * format fix * add documentation * corner case for single tag being called * fix indentation * formatting * update all mmlu variants * Update docs/task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * remove group_alias * Update docs/task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * remove version for metadata * Update docs/task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * update mmlu/ * removed " " in make_table * change how aggregate_metric is loaded * change how aggregate_metric is loaded * update aggregate_metric arg * update format * update format * some docs fixes * add groups for agieval, aexams, aclue * add more explicit aggregation groups * add more groupings / tags distinctions * add more groupings * more groupings * add many explicit group configs * add many explicit group configs * add more explicit group configs * add more explicit group configs * add more error msgs, agg_metric -> agg_metric_list * some docs updates * update task_id to be updateable and uses group:task format * make KMMLU a tag for now * update docs * don't duplicate task names * fix merge conflicts? * giving this a try * clean up diff * switch mmlu variants over to using * don't use to-be-deprecated group: config field in overview notebook * Python tasks which subclass ConfigurableTask now run * update mmlu * pre-commit format * fixed sorting for multi-level printing * move group api to separate file * fix bbh aggregation filter usage * track api/group.py * adjust group and tags loading * make explicit group configs for leaderboard and other newer tasks * fix arabicmmlu * update * change arabicmmlu template name??? * update group alias * fix printing bugs * check table printing is correct ; update tests * use mmlu_stem to have a group included in print tests --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

Group agg rework (#1741)
* add greoup_config arg * add a group config that allows disabling table for group score and group aggregate in general * fixed size configuration * adjust config * add group config * adjust mmlu to use group_config * fixed args input in aggregate_subtask_metrics * fixed issues related to printing alias of group and updated yaml * update all mmlu variants to include group_config * edit format * modify mmlu tasks * adjust group to also be a configurable group * add configurable group * simplify get_task_list * adjust group scoring with using ConfigurableGroup * adjust args * update mmlu * update mmlu * update to work with new group and task configuration * readd group_agg * readd files * move prepare_print_tasks to evaluator_utils * sort set to False by default, fix predict_only arg * add version for groups * reversed task list * update additional condition when loading a group in a group yaml * update truthfulqa * add description regarding tags replacing group * replace group to tag * fixed conditional statement * remove warning * update loading of task group and newly added tags * reformat with pre-commit * fixed info log * update * fix bug * fix bug * use task id to differentiate tasks * convert all groups to configurable groups * use task_id * reformat * add task_id for python tasks as well * add task_id for python tasks as well * add task_id for python tasks as well * revert truthfulqa * revert mmlu tasks * new mmlu config * new group config parameter `tag_to_task` * Update truthfulqa_mc2.yaml * reformate * add _process_group_config * adjust task_id * add get_subtask_list function to get proper subtask list * group config to_dict update * remove tag check * update mmlu * fix config passing issues * add test yaml * format fix * add documentation * corner case for single tag being called * fix indentation * formatting * update all mmlu variants * Update docs/task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * remove group_alias * Update docs/task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * remove version for metadata * Update docs/task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * update mmlu/ * removed " " in make_table * change how aggregate_metric is loaded * change how aggregate_metric is loaded * update aggregate_metric arg * update format * update format * some docs fixes * add groups for agieval, aexams, aclue * add more explicit aggregation groups * add more groupings / tags distinctions * add more groupings * more groupings * add many explicit group configs * add many explicit group configs * add more explicit group configs * add more explicit group configs * add more error msgs, agg_metric -> agg_metric_list * some docs updates * update task_id to be updateable and uses group:task format * make KMMLU a tag for now * update docs * don't duplicate task names * fix merge conflicts? * giving this a try * clean up diff * switch mmlu variants over to using * don't use to-be-deprecated group: config field in overview notebook * Python tasks which subclass ConfigurableTask now run * update mmlu * pre-commit format * fixed sorting for multi-level printing * move group api to separate file * fix bbh aggregation filter usage * track api/group.py * adjust group and tags loading * make explicit group configs for leaderboard and other newer tasks * fix arabicmmlu * update * change arabicmmlu template name??? * update group alias * fix printing bugs * check table printing is correct ; update tests * use mmlu_stem to have a group included in print tests --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>
517aadc4 · Lintang Sutawika · GitHub · 5a7ed3ee · 517aadc4 · 517aadc4
Unverified Commit 517aadc4 authored Jul 08, 2024 by Lintang Sutawika Committed by GitHub Jul 08, 2024
14 changed files
--- a/lm_eval/tasks/xnli_eu/xnli_eu_native.yaml
+++ b/lm_eval/tasks/xnli_eu/xnli_eu_native.yaml
 include: xnli_eu.yaml
-group: xnli_eu_mt_native
+tag: xnli_eu_mt_native
 task: xnli_eu_native
 training_split: null
 validation_split: null

--- a/lm_eval/tasks/xstorycloze/_xstorycloze.yaml
+++ b/lm_eval/tasks/xstorycloze/_xstorycloze.yaml
+group: xstorycloze
+task:
+  - xstorycloze_ar
+  - xstorycloze_en
+  - xstorycloze_es
+  - xstorycloze_eu
+  - xstorycloze_hi
+  - xstorycloze_id
+  - xstorycloze_my
+  - xstorycloze_ru
+  - xstorycloze_sw
+  - xstorycloze_te
+  - xstorycloze_zh
+aggregate_metric_list:
+  - metric: acc
+    aggregation: mean
+    weight_by_size: true
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/xstorycloze/default_ar.yaml
+++ b/lm_eval/tasks/xstorycloze/default_ar.yaml
-group: xstorycloze
 task: xstorycloze_ar
 dataset_path: juletxara/xstory_cloze
 dataset_name: ar

--- a/lm_eval/tasks/xwinograd/_xwinograd.yaml
+++ b/lm_eval/tasks/xwinograd/_xwinograd.yaml
+group: xwinograd
+task:
+  - xwinograd_en
+  - xwinograd_fr
+  - xwinograd_jp
+  - xwinograd_pt
+  - xwinograd_ru
+  - xwinograd_zh
+aggregate_metric_list:
+  - metric: acc
+    aggregation: mean
+    weight_by_size: true
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/xwinograd/xwinograd_common_yaml
+++ b/lm_eval/tasks/xwinograd/xwinograd_common_yaml
 # This file will be included in the generated language-specific task configs.
 # It doesn't have a yaml file extension as it is not meant to be imported directly
 # by the harness.
-group:
-  - xwinograd
 dataset_path: Muennighoff/xwinograd
 dataset_name: null  # Overridden by language-specific config.
 output_type: multiple_choice

--- a/lm_eval/utils.py
+++ b/lm_eval/utils.py
@@ -308,7 +308,7 @@ class Reorderer:
        return res


-def make_table(result_dict, column: str = "results", sort_results: bool = True):
+def make_table(result_dict, column: str = "results", sort_results: bool = False):
    """Generate table of results."""
    from pytablewriter import LatexTableWriter, MarkdownTableWriter

@@ -338,20 +338,21 @@ def make_table(result_dict, column: str = "results", sort_results: bool = True):

    keys = result_dict[column].keys()
    if sort_results:
-        # sort entries alphabetically
+        # sort entries alphabetically by task or group name.
+        # NOTE: we default here to false, because order matters for multi-level table printing a la mmlu.
+        # sorting here would mess that up
        keys = sorted(keys)
    for k in keys:
        dic = result_dict[column][k]
-        version = result_dict["versions"].get(k, "N/A")
-        n = str(result_dict["n-shot"][k])
+        version = result_dict["versions"].get(k, "    N/A")
+        n = str(result_dict.get("n-shot", " ").get(k, " "))
        higher_is_better = result_dict.get("higher_is_better", {}).get(k, {})

        if "alias" in dic:
            k = dic.pop("alias")

        metric_items = dic.items()
-        if sort_results:
-            metric_items = sorted(metric_items)
+        metric_items = sorted(metric_items)

        for (mf), v in metric_items:
            m, _, f = mf.partition(",")
@@ -362,8 +363,7 @@ def make_table(result_dict, column: str = "results", sort_results: bool = True):

            if m + "_stderr" + "," + f in dic:
                se = dic[m + "_stderr" + "," + f]
-                if se != "N/A":
-                    se = "%.4f" % se
+                se = "   N/A" if se == "N/A" else "%.4f" % se
                values.append([k, version, f, n, m, hib, "%.4f" % v, "±", se])
            else:
                values.append([k, version, f, n, m, hib, "%.4f" % v, "", ""])

--- a/templates/new_yaml_task/README.md
+++ b/templates/new_yaml_task/README.md
@@ -17,12 +17,16 @@ Homepage: `homepage to the benchmark's website goes here, if applicable`
 BibTeX-formatted citation goes here
 ```

-### Groups and Tasks
+### Groups, Tags, and Tasks

 #### Groups

 * `group_name`: `Short description`

+#### Tags
+
+* `tag_name`: `Short description`
+
 #### Tasks

 * `task_name`: `1-sentence description of what this particular task does`

--- a/tests/test_evaluator.py
+++ b/tests/test_evaluator.py
@@ -90,7 +90,7 @@ def test_evaluator(
            "pretrained=EleutherAI/pythia-14m,dtype=float32,device=cpu",
        ),
        (
-            ["mmlu_abstract_algebra", "mmlu_global_facts", "mmlu_public_relations"],
+            ["mmlu_stem"],
            10,
            "hf",
            "pretrained=EleutherAI/pythia-14m,dtype=float32,device=cpu",

--- a/tests/testdata/ai2_arc_10_hf_pretrained-EleutherAI-pythia-14m-dtype-float32-device-cpu.txt
+++ b/tests/testdata/ai2_arc_10_hf_pretrained-EleutherAI-pythia-14m-dtype-float32-device-cpu.txt
-|     Tasks      |Version|Filter|n-shot| Metric |   |Value|   |Stderr|
-|----------------|-------|------|-----:|--------|---|----:|---|------|
-|ai2_arc         |N/A    |none  |     0|acc     |↑  | 0.15|±  |N/A   |
-|                |       |none  |     0|acc_norm|↑  | 0.05|±  |N/A   |
-| - arc_challenge|      1|none  |     0|acc     |↑  | 0.00|±  |N/A   |
-|                |       |none  |     0|acc_norm|↑  | 0.00|±  |N/A   |
-| - arc_easy     |      1|none  |     0|acc     |↑  | 0.30|±  |N/A   |
-|                |       |none  |     0|acc_norm|↑  | 0.10|±  |N/A   |
\ No newline at end of file
+|    Tasks    |Version|Filter|n-shot| Metric |   |Value|   |Stderr|
+|-------------|------:|------|-----:|--------|---|----:|---|-----:|
+|arc_challenge|      1|none  |     0|acc     |↑  |  0.0|±  |0.0000|
+|             |       |none  |     0|acc_norm|↑  |  0.0|±  |0.0000|
+|arc_easy     |      1|none  |     0|acc     |↑  |  0.3|±  |0.1528|
+|             |       |none  |     0|acc_norm|↑  |  0.1|±  |0.1000|
\ No newline at end of file
--- a/tests/testdata/lambada_openai_10_hf_pretrained-EleutherAI-pythia-14m-dtype-float32-device-cpu.txt
+++ b/tests/testdata/lambada_openai_10_hf_pretrained-EleutherAI-pythia-14m-dtype-float32-device-cpu.txt
-|    Tasks     |Version|Filter|n-shot|  Metric  |   | Value  |   |Stderr|
-|--------------|------:|------|-----:|----------|---|-------:|---|------|
-|lambada_openai|      1|none  |     0|acc       |↑  |  0.1000|±  |N/A   |
-|              |       |none  |     0|perplexity|↓  |605.4879|±  |N/A   |
\ No newline at end of file
+|    Tasks     |Version|Filter|n-shot|  Metric  |   | Value  |   | Stderr  |
+|--------------|------:|------|-----:|----------|---|-------:|---|--------:|
+|lambada_openai|      1|none  |     0|acc       |↑  |  0.1000|±  |   0.1000|
+|              |       |none  |     0|perplexity|↓  |605.3866|±  |1636.6987|
\ No newline at end of file
--- a/tests/testdata/mmlu_abstract_algebra-mmlu_global_facts-mmlu_public_relations_10_hf_pretrained-EleutherAI-pythia-14m-dtype-float32-device-cpu.txt
+++ b/tests/testdata/mmlu_abstract_algebra-mmlu_global_facts-mmlu_public_relations_10_hf_pretrained-EleutherAI-pythia-14m-dtype-float32-device-cpu.txt
-|     Tasks      |Version|Filter|n-shot|Metric|   |Value|   |Stderr|
-|----------------|------:|------|-----:|------|---|----:|---|------|
-|abstract_algebra|      0|none  |     0|acc   |↑  |  0.2|±  |N/A   |
-|global_facts    |      0|none  |     0|acc   |↑  |  0.2|±  |N/A   |
-|public_relations|      0|none  |     0|acc   |↑  |  0.2|±  |N/A   |
\ No newline at end of file
--- a/tests/testdata/mmlu_stem_10_hf_pretrained-EleutherAI-pythia-14m-dtype-float32-device-cpu.txt
+++ b/tests/testdata/mmlu_stem_10_hf_pretrained-EleutherAI-pythia-14m-dtype-float32-device-cpu.txt
+|             Tasks             |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
+|-------------------------------|------:|------|-----:|------|---|-----:|---|-----:|
+|stem                           |      1|none  |      |acc   |↑  |0.2474|±  |0.0315|
+| - abstract_algebra            |      0|none  |     0|acc   |↑  |0.2000|±  |0.1333|
+| - anatomy                     |      0|none  |     0|acc   |↑  |0.3000|±  |0.1528|
+| - astronomy                   |      0|none  |     0|acc   |↑  |0.1000|±  |0.1000|
+| - college_biology             |      0|none  |     0|acc   |↑  |0.3000|±  |0.1528|
+| - college_chemistry           |      0|none  |     0|acc   |↑  |0.1000|±  |0.1000|
+| - college_computer_science    |      0|none  |     0|acc   |↑  |0.2000|±  |0.1333|
+| - college_mathematics         |      0|none  |     0|acc   |↑  |0.2000|±  |0.1333|
+| - college_physics             |      0|none  |     0|acc   |↑  |0.3000|±  |0.1528|
+| - computer_security           |      0|none  |     0|acc   |↑  |0.5000|±  |0.1667|
+| - conceptual_physics          |      0|none  |     0|acc   |↑  |0.3000|±  |0.1528|
+| - electrical_engineering      |      0|none  |     0|acc   |↑  |0.4000|±  |0.1633|
+| - elementary_mathematics      |      0|none  |     0|acc   |↑  |0.0000|±  |0.0000|
+| - high_school_biology         |      0|none  |     0|acc   |↑  |0.3000|±  |0.1528|
+| - high_school_chemistry       |      0|none  |     0|acc   |↑  |0.4000|±  |0.1633|
+| - high_school_computer_science|      0|none  |     0|acc   |↑  |0.3000|±  |0.1528|
+| - high_school_mathematics     |      0|none  |     0|acc   |↑  |0.2000|±  |0.1333|
+| - high_school_physics         |      0|none  |     0|acc   |↑  |0.3000|±  |0.1528|
+| - high_school_statistics      |      0|none  |     0|acc   |↑  |0.0000|±  |0.0000|
+| - machine_learning            |      0|none  |     0|acc   |↑  |0.3000|±  |0.1528|
\ No newline at end of file
--- a/tests/testdata/wikitext_10_hf_pretrained-EleutherAI-pythia-14m-dtype-float32-device-cpu.txt
+++ b/tests/testdata/wikitext_10_hf_pretrained-EleutherAI-pythia-14m-dtype-float32-device-cpu.txt
 | Tasks  |Version|Filter|n-shot|    Metric     |   | Value  |   |Stderr|
 |--------|------:|------|-----:|---------------|---|-------:|---|------|
-|wikitext|      2|none  |     0|bits_per_byte  |↓  |  1.3394|±  |N/A   |
-|        |       |none  |     0|byte_perplexity|↓  |  2.5304|±  |N/A   |
-|        |       |none  |     0|word_perplexity|↓  |130.4812|±  |N/A   |
\ No newline at end of file
+|wikitext|      2|none  |     0|bits_per_byte  |↓  |  1.3394|±  |   N/A|
+|        |       |none  |     0|byte_perplexity|↓  |  2.5304|±  |   N/A|
+|        |       |none  |     0|word_perplexity|↓  |130.4801|±  |   N/A|
\ No newline at end of file
--- a/tests/testyamls/test-01.yaml
+++ b/tests/testyamls/test-01.yaml
+group: test-1
+group_alias: test 1
+task:
+  - piqa # string task
+  - ai2_arc # string tag
+  - task: super-glue-lm-eval-v1 # Should this be spread out?
+    num_fewshot: 3
+  - task: swag # dict registered task
+    num_fewshot: 2
+  - task: mmlu
+    num_fewshot: 5
+  - group: nli-tasks # dict group
+    task:
+      - anli
+      - boolq
+      - sglue_rte
+    num_fewshot: 4
+    metric_list:
+      - metric: brier_score
+  - task: sciq # dict registered task duplicate
+    task_alias: sciq 2-shot
+    num_fewshot: 2
+  - task: sciq # dict registered task duplicate
+    task_alias: sciq 4-shot
+    num_fewshot: 4
+  - task: sciq # dict registered task duplicate
+    task_alias: sciq 6-shot
+    num_fewshot: 6
+  - task: siqa_custom # dict task
+    dataset_path: social_i_qa
+    dataset_name: null
+    output_type: multiple_choice
+    training_split: train
+    validation_split: validation
+    doc_to_text: "Question: {{context}} {{question}}\nAnswer:"
+    target_delimiter: " "
+    doc_to_choice:
+      - "{{answerA}}"
+      - "{{answerB}}"
+      - "{{answerC}}"
+    doc_to_target: "{{ (label|int) - 1 }}"
+    metric_list:
+      - metric: acc
+        aggregation: mean
+        higher_is_better: true