Merge branch 'group-agg-rework' of...

Merge branch 'group-agg-rework' of https://github.com/EleutherAI/lm-evaluation-harness into multiprompt

Merge branch 'group-agg-rework' of...
Merge branch 'group-agg-rework' of https://github.com/EleutherAI/lm-evaluation-harness into multiprompt
88486e57 · lintangsutawika · 5971f2ca · ba73d131 · 88486e57 · 88486e57
Commit 88486e57 authored Jul 05, 2024 by lintangsutawika
20 changed files
--- a/lm_eval/tasks/arc_mt/arc_challenge_mt_sv.yaml
+++ b/lm_eval/tasks/arc_mt/arc_challenge_mt_sv.yaml
+include: arc_challenge_mt_fi.yaml
+task: arc_challenge_mt_sv
+dataset_name: sv
--- a/lm_eval/tasks/arithmetic/README.md
+++ b/lm_eval/tasks/arithmetic/README.md
@@ -27,9 +27,9 @@ Homepage: https://github.com/openai/gpt-3/tree/master/data
 }
 ```

-### Groups and Tasks
+### Groups, Tags, and Tasks

-#### Groups
+#### Tags

 * `arithmetic`: Evaluates `1dc` to `5ds`


--- a/lm_eval/tasks/arithmetic/arithmetic_1dc.yaml
+++ b/lm_eval/tasks/arithmetic/arithmetic_1dc.yaml
-group:
+tag:
  - arithmetic
 task: arithmetic_1dc
 dataset_path: EleutherAI/arithmetic

--- a/lm_eval/tasks/asdiv/README.md
+++ b/lm_eval/tasks/asdiv/README.md
@@ -32,7 +32,7 @@ Homepage: https://github.com/chaochun/nlu-asdiv-dataset
 }
 ```

-### Groups and Tasks
+### Groups, Tags, and Tasks

 #### Groups


--- a/lm_eval/tasks/babi/README.md
+++ b/lm_eval/tasks/babi/README.md
@@ -21,12 +21,16 @@ Homepage: https://github.com/facebookarchive/bAbI-tasks
 }
 ```

-### Groups and Tasks
+### Groups, Tags, and Tasks

 #### Groups

 * Not part of a group yet

+#### Tags
+
+* No tags applied.
+
 #### Tasks

 * `babi`

--- a/lm_eval/tasks/basqueglue/README.md
+++ b/lm_eval/tasks/basqueglue/README.md
@@ -43,20 +43,24 @@ Homepage: `https://github.com/hitz-zentroa/latxa`
 }
 ```

-### Groups and Tasks
+### Groups, Tags, and Tasks

 #### Groups

-* `basque-glue`: First version of the implementation
+None.
+
+#### Tags
+
+* `basque-glue`: First version of the implementation. Calls all subtasks, but does not average.

 #### Tasks

 * `bhtc_v2`: Topic classification of news extracts with 12 categories.
-* `bec`: Sentiment analysis on tweets about the campaign for the 2016 Basque elections.
+* `bec2016eu`: Sentiment analysis on tweets about the campaign for the 2016 Basque elections.
 * `vaxx_stance`: Stance detection on tweets around the anti-vaccine movement.
 * `qnlieu`: Q&A NLI as in [glue/qnli](../glue/qnli).
 * `wiceu`: Word-in-Context as in [super_glue/wic](../super_glue/wic).
-* `epec_korref_bin`: Correference detection as in [super_glue/wsc](../super_glue/wsc).
+* `epec_koref_bin`: Correference detection as in [super_glue/wsc](../super_glue/wsc).

 ### Checklist


--- a/lm_eval/tasks/basqueglue/bec.yaml
+++ b/lm_eval/tasks/basqueglue/bec.yaml
-group: basque-glue
+tag: basque-glue
 task: bec2016eu
 dataset_path: orai-nlp/basqueGLUE
 dataset_name: bec
@@ -13,4 +13,4 @@ metric_list:
    aggregation: !function utils.micro_f1_score
    higher_is_better: true
 metadata:
-  - version: 1.0
+  version: 1.0
--- a/lm_eval/tasks/basqueglue/bhtc.yaml
+++ b/lm_eval/tasks/basqueglue/bhtc.yaml
-group: basque-glue
+tag: basque-glue
 task: bhtc_v2
 dataset_path: orai-nlp/basqueGLUE
 dataset_name: bhtc
@@ -13,4 +13,4 @@ metric_list:
    aggregation: !function utils.micro_f1_score
    higher_is_better: true
 metadata:
-  - version: 1.0
+  version: 1.0
--- a/lm_eval/tasks/basqueglue/coref.yaml
+++ b/lm_eval/tasks/basqueglue/coref.yaml
-group: basque-glue
+tag: basque-glue
 task: epec_koref_bin
 dataset_path: orai-nlp/basqueGLUE
 dataset_name: coref
@@ -13,4 +13,4 @@ metric_list:
    aggregation: mean
    higher_is_better: true
 metadata:
-  - version: 1.0
+  version: 1.0
--- a/lm_eval/tasks/basqueglue/qnli.yaml
+++ b/lm_eval/tasks/basqueglue/qnli.yaml
-group: basque-glue
+tag: basque-glue
 task: qnlieu
 dataset_path: orai-nlp/basqueGLUE
 dataset_name: qnli
@@ -13,4 +13,4 @@ metric_list:
    aggregation: mean
    higher_is_better: true
 metadata:
-  - version: 1.0
+  version: 1.0
--- a/lm_eval/tasks/basqueglue/vaxx.yaml
+++ b/lm_eval/tasks/basqueglue/vaxx.yaml
-group: basque-glue
+tag: basque-glue
 task: vaxx_stance
 dataset_path: orai-nlp/basqueGLUE
 dataset_name: vaxx
@@ -13,4 +13,4 @@ metric_list:
    aggregation: !function utils.vaxx_f1_score
    higher_is_better: true
 metadata:
-  - version: 1.0
+  version: 1.0
--- a/lm_eval/tasks/basqueglue/wic.yaml
+++ b/lm_eval/tasks/basqueglue/wic.yaml
-group: basque-glue
+tag: basque-glue
 task: wiceu
 dataset_path: orai-nlp/basqueGLUE
 dataset_name: wic
@@ -14,4 +14,4 @@ metric_list:
    aggregation: mean
    higher_is_better: true
 metadata:
-  - version: 1.0
+  version: 1.0
--- a/lm_eval/tasks/bbh/README.md
+++ b/lm_eval/tasks/bbh/README.md
@@ -21,15 +21,19 @@ Homepage: https://github.com/suzgunmirac/BIG-Bench-Hard
 }
 ```

-### Groups and Tasks
+### Groups, Tags, and Tasks

 #### Groups

+- `bbh`: is the same as `bbh_cot_fewshot`.
 - `bbh_zeroshot`
 - `bbh_fewshot`
 - `bbh_cot_fewshot`
 - `bbh_cot_zeroshot`

+#### Tags
+
+None.

 #### Tasks


--- a/lm_eval/tasks/bbh/_generate_configs.py
+++ b/lm_eval/tasks/bbh/_generate_configs.py
 """
 Take in a YAML, and output all other splits with this YAML
 """
+
 import argparse
 import os
 import re

--- a/lm_eval/tasks/bbh/cot_fewshot/_bbh.yaml
+++ b/lm_eval/tasks/bbh/cot_fewshot/_bbh.yaml
+group: bbh
+task:
+  - bbh_cot_fewshot_boolean_expressions
+  - bbh_cot_fewshot_causal_judgement
+  - bbh_cot_fewshot_date_understanding
+  - bbh_cot_fewshot_disambiguation_qa
+  - bbh_cot_fewshot_dyck_languages
+  - bbh_cot_fewshot_formal_fallacies
+  - bbh_cot_fewshot_geometric_shapes
+  - bbh_cot_fewshot_hyperbaton
+  - bbh_cot_fewshot_logical_deduction_five_objects
+  - bbh_cot_fewshot_logical_deduction_seven_objects
+  - bbh_cot_fewshot_logical_deduction_three_objects
+  - bbh_cot_fewshot_movie_recommendation
+  - bbh_cot_fewshot_multistep_arithmetic_two
+  - bbh_cot_fewshot_navigate
+  - bbh_cot_fewshot_object_counting
+  - bbh_cot_fewshot_penguins_in_a_table
+  - bbh_cot_fewshot_reasoning_about_colored_objects
+  - bbh_cot_fewshot_ruin_names
+  - bbh_cot_fewshot_salient_translation_error_detection
+  - bbh_cot_fewshot_snarks
+  - bbh_cot_fewshot_sports_understanding
+  - bbh_cot_fewshot_temporal_sequences
+  - bbh_cot_fewshot_tracking_shuffled_objects_five_objects
+  - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects
+  - bbh_cot_fewshot_tracking_shuffled_objects_three_objects
+  - bbh_cot_fewshot_web_of_lies
+  - bbh_cot_fewshot_word_sorting
+aggregate_metric_list:
+  - metric: exact_match
+    aggregation: mean
+    weight_by_size: true
+    filter_list: get-answer
+metadata:
+  version: 2.0
--- a/lm_eval/tasks/bbh/cot_fewshot/_bbh_cot_fewshot.yaml
+++ b/lm_eval/tasks/bbh/cot_fewshot/_bbh_cot_fewshot.yaml
+group: bbh_cot_fewshot
+task:
+  - bbh_cot_fewshot_boolean_expressions
+  - bbh_cot_fewshot_causal_judgement
+  - bbh_cot_fewshot_date_understanding
+  - bbh_cot_fewshot_disambiguation_qa
+  - bbh_cot_fewshot_dyck_languages
+  - bbh_cot_fewshot_formal_fallacies
+  - bbh_cot_fewshot_geometric_shapes
+  - bbh_cot_fewshot_hyperbaton
+  - bbh_cot_fewshot_logical_deduction_five_objects
+  - bbh_cot_fewshot_logical_deduction_seven_objects
+  - bbh_cot_fewshot_logical_deduction_three_objects
+  - bbh_cot_fewshot_movie_recommendation
+  - bbh_cot_fewshot_multistep_arithmetic_two
+  - bbh_cot_fewshot_navigate
+  - bbh_cot_fewshot_object_counting
+  - bbh_cot_fewshot_penguins_in_a_table
+  - bbh_cot_fewshot_reasoning_about_colored_objects
+  - bbh_cot_fewshot_ruin_names
+  - bbh_cot_fewshot_salient_translation_error_detection
+  - bbh_cot_fewshot_snarks
+  - bbh_cot_fewshot_sports_understanding
+  - bbh_cot_fewshot_temporal_sequences
+  - bbh_cot_fewshot_tracking_shuffled_objects_five_objects
+  - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects
+  - bbh_cot_fewshot_tracking_shuffled_objects_three_objects
+  - bbh_cot_fewshot_web_of_lies
+  - bbh_cot_fewshot_word_sorting
+aggregate_metric_list:
+  - metric: exact_match
+    aggregation: mean
+    weight_by_size: true
+    filter_list: get-answer
+metadata:
+  version: 2.0
--- a/lm_eval/tasks/bbh/cot_fewshot/_cot_fewshot_template_yaml
+++ b/lm_eval/tasks/bbh/cot_fewshot/_cot_fewshot_template_yaml
-group:
- bbh
- bbh_cot_fewshot
 dataset_path: lukaemon/bbh
 output_type: generate_until
 test_split: test

--- a/lm_eval/tasks/bbh/cot_zeroshot/_bbh_cot_zeroshot.yaml
+++ b/lm_eval/tasks/bbh/cot_zeroshot/_bbh_cot_zeroshot.yaml
+group: bbh_cot_zeroshot
+task:
+  - bbh_cot_zeroshot_boolean_expressions
+  - bbh_cot_zeroshot_causal_judgement
+  - bbh_cot_zeroshot_date_understanding
+  - bbh_cot_zeroshot_disambiguation_qa
+  - bbh_cot_zeroshot_dyck_languages
+  - bbh_cot_zeroshot_formal_fallacies
+  - bbh_cot_zeroshot_geometric_shapes
+  - bbh_cot_zeroshot_hyperbaton
+  - bbh_cot_zeroshot_logical_deduction_five_objects
+  - bbh_cot_zeroshot_logical_deduction_seven_objects
+  - bbh_cot_zeroshot_logical_deduction_three_objects
+  - bbh_cot_zeroshot_movie_recommendation
+  - bbh_cot_zeroshot_multistep_arithmetic_two
+  - bbh_cot_zeroshot_navigate
+  - bbh_cot_zeroshot_object_counting
+  - bbh_cot_zeroshot_penguins_in_a_table
+  - bbh_cot_zeroshot_reasoning_about_colored_objects
+  - bbh_cot_zeroshot_ruin_names
+  - bbh_cot_zeroshot_salient_translation_error_detection
+  - bbh_cot_zeroshot_snarks
+  - bbh_cot_zeroshot_sports_understanding
+  - bbh_cot_zeroshot_temporal_sequences
+  - bbh_cot_zeroshot_tracking_shuffled_objects_five_objects
+  - bbh_cot_zeroshot_tracking_shuffled_objects_seven_objects
+  - bbh_cot_zeroshot_tracking_shuffled_objects_three_objects
+  - bbh_cot_zeroshot_web_of_lies
+  - bbh_cot_zeroshot_word_sorting
+aggregate_metric_list:
+  - metric: exact_match
+    aggregation: mean
+    weight_by_size: true
+    filter_list: flexible-extract
+metadata:
+  version: 2.0
--- a/lm_eval/tasks/bbh/cot_zeroshot/_cot_zeroshot_template_yaml
+++ b/lm_eval/tasks/bbh/cot_zeroshot/_cot_zeroshot_template_yaml
-group: bbh_cot_zeroshot
 dataset_path: lukaemon/bbh
 output_type: generate_until
 test_split: test

--- a/lm_eval/tasks/bbh/fewshot/_bbh_fewshot.yaml
+++ b/lm_eval/tasks/bbh/fewshot/_bbh_fewshot.yaml
+group: bbh_fewshot
+task:
+  - bbh_fewshot_boolean_expressions
+  - bbh_fewshot_causal_judgement
+  - bbh_fewshot_date_understanding
+  - bbh_fewshot_disambiguation_qa
+  - bbh_fewshot_dyck_languages
+  - bbh_fewshot_formal_fallacies
+  - bbh_fewshot_geometric_shapes
+  - bbh_fewshot_hyperbaton
+  - bbh_fewshot_logical_deduction_five_objects
+  - bbh_fewshot_logical_deduction_seven_objects
+  - bbh_fewshot_logical_deduction_three_objects
+  - bbh_fewshot_movie_recommendation
+  - bbh_fewshot_multistep_arithmetic_two
+  - bbh_fewshot_navigate
+  - bbh_fewshot_object_counting
+  - bbh_fewshot_penguins_in_a_table
+  - bbh_fewshot_reasoning_about_colored_objects
+  - bbh_fewshot_ruin_names
+  - bbh_fewshot_salient_translation_error_detection
+  - bbh_fewshot_snarks
+  - bbh_fewshot_sports_understanding
+  - bbh_fewshot_temporal_sequences
+  - bbh_fewshot_tracking_shuffled_objects_five_objects
+  - bbh_fewshot_tracking_shuffled_objects_seven_objects
+  - bbh_fewshot_tracking_shuffled_objects_three_objects
+  - bbh_fewshot_web_of_lies
+  - bbh_fewshot_word_sorting
+aggregate_metric_list:
+  - metric: exact_match
+    aggregation: mean
+    weight_by_size: true
+metadata:
+  version: 2.0