"...git@developer.sourcefind.cn:sugon_wxj/megatron-lm.git" did not exist on "a84a5fa07677302f91a0b812c57162aeddac3b09"
Commit 88486e57 authored by lintangsutawika's avatar lintangsutawika
Browse files

Merge branch 'group-agg-rework' of...

Merge branch 'group-agg-rework' of https://github.com/EleutherAI/lm-evaluation-harness into multiprompt
parents 5971f2ca ba73d131
include: arc_challenge_mt_fi.yaml
task: arc_challenge_mt_sv
dataset_name: sv
...@@ -27,9 +27,9 @@ Homepage: https://github.com/openai/gpt-3/tree/master/data ...@@ -27,9 +27,9 @@ Homepage: https://github.com/openai/gpt-3/tree/master/data
} }
``` ```
### Groups and Tasks ### Groups, Tags, and Tasks
#### Groups #### Tags
* `arithmetic`: Evaluates `1dc` to `5ds` * `arithmetic`: Evaluates `1dc` to `5ds`
......
group: tag:
- arithmetic - arithmetic
task: arithmetic_1dc task: arithmetic_1dc
dataset_path: EleutherAI/arithmetic dataset_path: EleutherAI/arithmetic
......
...@@ -32,7 +32,7 @@ Homepage: https://github.com/chaochun/nlu-asdiv-dataset ...@@ -32,7 +32,7 @@ Homepage: https://github.com/chaochun/nlu-asdiv-dataset
} }
``` ```
### Groups and Tasks ### Groups, Tags, and Tasks
#### Groups #### Groups
......
...@@ -21,12 +21,16 @@ Homepage: https://github.com/facebookarchive/bAbI-tasks ...@@ -21,12 +21,16 @@ Homepage: https://github.com/facebookarchive/bAbI-tasks
} }
``` ```
### Groups and Tasks ### Groups, Tags, and Tasks
#### Groups #### Groups
* Not part of a group yet * Not part of a group yet
#### Tags
* No tags applied.
#### Tasks #### Tasks
* `babi` * `babi`
......
...@@ -43,20 +43,24 @@ Homepage: `https://github.com/hitz-zentroa/latxa` ...@@ -43,20 +43,24 @@ Homepage: `https://github.com/hitz-zentroa/latxa`
} }
``` ```
### Groups and Tasks ### Groups, Tags, and Tasks
#### Groups #### Groups
* `basque-glue`: First version of the implementation None.
#### Tags
* `basque-glue`: First version of the implementation. Calls all subtasks, but does not average.
#### Tasks #### Tasks
* `bhtc_v2`: Topic classification of news extracts with 12 categories. * `bhtc_v2`: Topic classification of news extracts with 12 categories.
* `bec`: Sentiment analysis on tweets about the campaign for the 2016 Basque elections. * `bec2016eu`: Sentiment analysis on tweets about the campaign for the 2016 Basque elections.
* `vaxx_stance`: Stance detection on tweets around the anti-vaccine movement. * `vaxx_stance`: Stance detection on tweets around the anti-vaccine movement.
* `qnlieu`: Q&A NLI as in [glue/qnli](../glue/qnli). * `qnlieu`: Q&A NLI as in [glue/qnli](../glue/qnli).
* `wiceu`: Word-in-Context as in [super_glue/wic](../super_glue/wic). * `wiceu`: Word-in-Context as in [super_glue/wic](../super_glue/wic).
* `epec_korref_bin`: Correference detection as in [super_glue/wsc](../super_glue/wsc). * `epec_koref_bin`: Correference detection as in [super_glue/wsc](../super_glue/wsc).
### Checklist ### Checklist
......
group: basque-glue tag: basque-glue
task: bec2016eu task: bec2016eu
dataset_path: orai-nlp/basqueGLUE dataset_path: orai-nlp/basqueGLUE
dataset_name: bec dataset_name: bec
...@@ -13,4 +13,4 @@ metric_list: ...@@ -13,4 +13,4 @@ metric_list:
aggregation: !function utils.micro_f1_score aggregation: !function utils.micro_f1_score
higher_is_better: true higher_is_better: true
metadata: metadata:
- version: 1.0 version: 1.0
group: basque-glue tag: basque-glue
task: bhtc_v2 task: bhtc_v2
dataset_path: orai-nlp/basqueGLUE dataset_path: orai-nlp/basqueGLUE
dataset_name: bhtc dataset_name: bhtc
...@@ -13,4 +13,4 @@ metric_list: ...@@ -13,4 +13,4 @@ metric_list:
aggregation: !function utils.micro_f1_score aggregation: !function utils.micro_f1_score
higher_is_better: true higher_is_better: true
metadata: metadata:
- version: 1.0 version: 1.0
group: basque-glue tag: basque-glue
task: epec_koref_bin task: epec_koref_bin
dataset_path: orai-nlp/basqueGLUE dataset_path: orai-nlp/basqueGLUE
dataset_name: coref dataset_name: coref
...@@ -13,4 +13,4 @@ metric_list: ...@@ -13,4 +13,4 @@ metric_list:
aggregation: mean aggregation: mean
higher_is_better: true higher_is_better: true
metadata: metadata:
- version: 1.0 version: 1.0
group: basque-glue tag: basque-glue
task: qnlieu task: qnlieu
dataset_path: orai-nlp/basqueGLUE dataset_path: orai-nlp/basqueGLUE
dataset_name: qnli dataset_name: qnli
...@@ -13,4 +13,4 @@ metric_list: ...@@ -13,4 +13,4 @@ metric_list:
aggregation: mean aggregation: mean
higher_is_better: true higher_is_better: true
metadata: metadata:
- version: 1.0 version: 1.0
group: basque-glue tag: basque-glue
task: vaxx_stance task: vaxx_stance
dataset_path: orai-nlp/basqueGLUE dataset_path: orai-nlp/basqueGLUE
dataset_name: vaxx dataset_name: vaxx
...@@ -13,4 +13,4 @@ metric_list: ...@@ -13,4 +13,4 @@ metric_list:
aggregation: !function utils.vaxx_f1_score aggregation: !function utils.vaxx_f1_score
higher_is_better: true higher_is_better: true
metadata: metadata:
- version: 1.0 version: 1.0
group: basque-glue tag: basque-glue
task: wiceu task: wiceu
dataset_path: orai-nlp/basqueGLUE dataset_path: orai-nlp/basqueGLUE
dataset_name: wic dataset_name: wic
...@@ -14,4 +14,4 @@ metric_list: ...@@ -14,4 +14,4 @@ metric_list:
aggregation: mean aggregation: mean
higher_is_better: true higher_is_better: true
metadata: metadata:
- version: 1.0 version: 1.0
...@@ -21,15 +21,19 @@ Homepage: https://github.com/suzgunmirac/BIG-Bench-Hard ...@@ -21,15 +21,19 @@ Homepage: https://github.com/suzgunmirac/BIG-Bench-Hard
} }
``` ```
### Groups and Tasks ### Groups, Tags, and Tasks
#### Groups #### Groups
- `bbh`: is the same as `bbh_cot_fewshot`.
- `bbh_zeroshot` - `bbh_zeroshot`
- `bbh_fewshot` - `bbh_fewshot`
- `bbh_cot_fewshot` - `bbh_cot_fewshot`
- `bbh_cot_zeroshot` - `bbh_cot_zeroshot`
#### Tags
None.
#### Tasks #### Tasks
......
""" """
Take in a YAML, and output all other splits with this YAML Take in a YAML, and output all other splits with this YAML
""" """
import argparse import argparse
import os import os
import re import re
......
group: bbh
task:
- bbh_cot_fewshot_boolean_expressions
- bbh_cot_fewshot_causal_judgement
- bbh_cot_fewshot_date_understanding
- bbh_cot_fewshot_disambiguation_qa
- bbh_cot_fewshot_dyck_languages
- bbh_cot_fewshot_formal_fallacies
- bbh_cot_fewshot_geometric_shapes
- bbh_cot_fewshot_hyperbaton
- bbh_cot_fewshot_logical_deduction_five_objects
- bbh_cot_fewshot_logical_deduction_seven_objects
- bbh_cot_fewshot_logical_deduction_three_objects
- bbh_cot_fewshot_movie_recommendation
- bbh_cot_fewshot_multistep_arithmetic_two
- bbh_cot_fewshot_navigate
- bbh_cot_fewshot_object_counting
- bbh_cot_fewshot_penguins_in_a_table
- bbh_cot_fewshot_reasoning_about_colored_objects
- bbh_cot_fewshot_ruin_names
- bbh_cot_fewshot_salient_translation_error_detection
- bbh_cot_fewshot_snarks
- bbh_cot_fewshot_sports_understanding
- bbh_cot_fewshot_temporal_sequences
- bbh_cot_fewshot_tracking_shuffled_objects_five_objects
- bbh_cot_fewshot_tracking_shuffled_objects_seven_objects
- bbh_cot_fewshot_tracking_shuffled_objects_three_objects
- bbh_cot_fewshot_web_of_lies
- bbh_cot_fewshot_word_sorting
aggregate_metric_list:
- metric: exact_match
aggregation: mean
weight_by_size: true
filter_list: get-answer
metadata:
version: 2.0
group: bbh_cot_fewshot
task:
- bbh_cot_fewshot_boolean_expressions
- bbh_cot_fewshot_causal_judgement
- bbh_cot_fewshot_date_understanding
- bbh_cot_fewshot_disambiguation_qa
- bbh_cot_fewshot_dyck_languages
- bbh_cot_fewshot_formal_fallacies
- bbh_cot_fewshot_geometric_shapes
- bbh_cot_fewshot_hyperbaton
- bbh_cot_fewshot_logical_deduction_five_objects
- bbh_cot_fewshot_logical_deduction_seven_objects
- bbh_cot_fewshot_logical_deduction_three_objects
- bbh_cot_fewshot_movie_recommendation
- bbh_cot_fewshot_multistep_arithmetic_two
- bbh_cot_fewshot_navigate
- bbh_cot_fewshot_object_counting
- bbh_cot_fewshot_penguins_in_a_table
- bbh_cot_fewshot_reasoning_about_colored_objects
- bbh_cot_fewshot_ruin_names
- bbh_cot_fewshot_salient_translation_error_detection
- bbh_cot_fewshot_snarks
- bbh_cot_fewshot_sports_understanding
- bbh_cot_fewshot_temporal_sequences
- bbh_cot_fewshot_tracking_shuffled_objects_five_objects
- bbh_cot_fewshot_tracking_shuffled_objects_seven_objects
- bbh_cot_fewshot_tracking_shuffled_objects_three_objects
- bbh_cot_fewshot_web_of_lies
- bbh_cot_fewshot_word_sorting
aggregate_metric_list:
- metric: exact_match
aggregation: mean
weight_by_size: true
filter_list: get-answer
metadata:
version: 2.0
group:
- bbh
- bbh_cot_fewshot
dataset_path: lukaemon/bbh dataset_path: lukaemon/bbh
output_type: generate_until output_type: generate_until
test_split: test test_split: test
......
group: bbh_cot_zeroshot
task:
- bbh_cot_zeroshot_boolean_expressions
- bbh_cot_zeroshot_causal_judgement
- bbh_cot_zeroshot_date_understanding
- bbh_cot_zeroshot_disambiguation_qa
- bbh_cot_zeroshot_dyck_languages
- bbh_cot_zeroshot_formal_fallacies
- bbh_cot_zeroshot_geometric_shapes
- bbh_cot_zeroshot_hyperbaton
- bbh_cot_zeroshot_logical_deduction_five_objects
- bbh_cot_zeroshot_logical_deduction_seven_objects
- bbh_cot_zeroshot_logical_deduction_three_objects
- bbh_cot_zeroshot_movie_recommendation
- bbh_cot_zeroshot_multistep_arithmetic_two
- bbh_cot_zeroshot_navigate
- bbh_cot_zeroshot_object_counting
- bbh_cot_zeroshot_penguins_in_a_table
- bbh_cot_zeroshot_reasoning_about_colored_objects
- bbh_cot_zeroshot_ruin_names
- bbh_cot_zeroshot_salient_translation_error_detection
- bbh_cot_zeroshot_snarks
- bbh_cot_zeroshot_sports_understanding
- bbh_cot_zeroshot_temporal_sequences
- bbh_cot_zeroshot_tracking_shuffled_objects_five_objects
- bbh_cot_zeroshot_tracking_shuffled_objects_seven_objects
- bbh_cot_zeroshot_tracking_shuffled_objects_three_objects
- bbh_cot_zeroshot_web_of_lies
- bbh_cot_zeroshot_word_sorting
aggregate_metric_list:
- metric: exact_match
aggregation: mean
weight_by_size: true
filter_list: flexible-extract
metadata:
version: 2.0
group: bbh_cot_zeroshot
dataset_path: lukaemon/bbh dataset_path: lukaemon/bbh
output_type: generate_until output_type: generate_until
test_split: test test_split: test
......
group: bbh_fewshot
task:
- bbh_fewshot_boolean_expressions
- bbh_fewshot_causal_judgement
- bbh_fewshot_date_understanding
- bbh_fewshot_disambiguation_qa
- bbh_fewshot_dyck_languages
- bbh_fewshot_formal_fallacies
- bbh_fewshot_geometric_shapes
- bbh_fewshot_hyperbaton
- bbh_fewshot_logical_deduction_five_objects
- bbh_fewshot_logical_deduction_seven_objects
- bbh_fewshot_logical_deduction_three_objects
- bbh_fewshot_movie_recommendation
- bbh_fewshot_multistep_arithmetic_two
- bbh_fewshot_navigate
- bbh_fewshot_object_counting
- bbh_fewshot_penguins_in_a_table
- bbh_fewshot_reasoning_about_colored_objects
- bbh_fewshot_ruin_names
- bbh_fewshot_salient_translation_error_detection
- bbh_fewshot_snarks
- bbh_fewshot_sports_understanding
- bbh_fewshot_temporal_sequences
- bbh_fewshot_tracking_shuffled_objects_five_objects
- bbh_fewshot_tracking_shuffled_objects_seven_objects
- bbh_fewshot_tracking_shuffled_objects_three_objects
- bbh_fewshot_web_of_lies
- bbh_fewshot_word_sorting
aggregate_metric_list:
- metric: exact_match
aggregation: mean
weight_by_size: true
metadata:
version: 2.0
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment