Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
gaoqiong
lm-evaluation-harness
Commits
2cc968ba
Unverified
Commit
2cc968ba
authored
Feb 12, 2022
by
Leo Gao
Committed by
GitHub
Feb 12, 2022
Browse files
Merge pull request #288 from EleutherAI/leogao2-patch-1
Update README.md
parents
c5131e8a
71647775
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
263 additions
and
251 deletions
+263
-251
README.md
README.md
+263
-251
No files found.
README.md
View file @
2cc968ba
...
@@ -91,7 +91,7 @@ To implement a new task in eval harness, see [this guide](./docs/task_guide.md).
...
@@ -91,7 +91,7 @@ To implement a new task in eval harness, see [this guide](./docs/task_guide.md).
### Full Task List
### Full Task List
| Task Name |Train|Val|Test|Val/Test Docs| Metrics |
| Task Name |Train|Val|Test|Val/Test Docs| Metrics |
|---------------------------------------------------------|-----|---|----|------------:|------------------------------------------------------------------------------|
|---------------------------------------------------------|-----|---|----|------------:|------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------
|
|cola |✓ |✓ | | 1043|mcc |
|cola |✓ |✓ | | 1043|mcc |
|mnli |✓ |✓ | | 9815|acc |
|mnli |✓ |✓ | | 9815|acc |
|mnli_mismatched |✓ |✓ | | 9832|acc |
|mnli_mismatched |✓ |✓ | | 9832|acc |
...
@@ -112,9 +112,15 @@ To implement a new task in eval harness, see [this guide](./docs/task_guide.md).
...
@@ -112,9 +112,15 @@ To implement a new task in eval harness, see [this guide](./docs/task_guide.md).
|drop |✓ |✓ | | 9536|em, f1 |
|drop |✓ |✓ | | 9536|em, f1 |
|lambada | |✓ | | 5153|ppl, acc |
|lambada | |✓ | | 5153|ppl, acc |
|lambada_cloze | |✓ | | 5153|ppl, acc |
|lambada_cloze | |✓ | | 5153|ppl, acc |
|lambada_mt_en | |✓ | | 5153|ppl, acc |
|lambada_mt_fr | |✓ | | 5153|ppl, acc |
|lambada_mt_de | |✓ | | 5153|ppl, acc |
|lambada_mt_it | |✓ | | 5153|ppl, acc |
|lambada_mt_es | |✓ | | 5153|ppl, acc |
|wikitext | |✓ |✓ | 62|word_perplexity, byte_perplexity, bits_per_byte |
|wikitext | |✓ |✓ | 62|word_perplexity, byte_perplexity, bits_per_byte |
|piqa |✓ |✓ | | 1838|acc, acc_norm |
|piqa |✓ |✓ | | 1838|acc, acc_norm |
|prost | | |✓ | 18736|acc, acc_norm |
|prost | | |✓ | 18736|acc, acc_norm |
|mc_taco | |✓ |✓ | 9442|f1, em |
|pubmedqa | | |✓ | 1000|acc |
|pubmedqa | | |✓ | 1000|acc |
|sciq |✓ |✓ |✓ | 1000|acc, acc_norm |
|sciq |✓ |✓ |✓ | 1000|acc, acc_norm |
|qa4mre_2011 | | |✓ | 120|acc, acc_norm |
|qa4mre_2011 | | |✓ | 120|acc, acc_norm |
...
@@ -126,11 +132,12 @@ To implement a new task in eval harness, see [this guide](./docs/task_guide.md).
...
@@ -126,11 +132,12 @@ To implement a new task in eval harness, see [this guide](./docs/task_guide.md).
|logiqa |✓ |✓ |✓ | 651|acc, acc_norm |
|logiqa |✓ |✓ |✓ | 651|acc, acc_norm |
|hellaswag |✓ |✓ | | 10042|acc, acc_norm |
|hellaswag |✓ |✓ | | 10042|acc, acc_norm |
|openbookqa |✓ |✓ |✓ | 500|acc, acc_norm |
|openbookqa |✓ |✓ |✓ | 500|acc, acc_norm |
|squad2 |✓ |✓ | | 11873|exact, f1, HasAns_exact, HasAns_f1, NoAns_exact, NoAns_f1, best_exact, best_f1|
|squad2 |✓ |✓ | | 11873|exact, f1, HasAns_exact, HasAns_f1, NoAns_exact, NoAns_f1, best_exact, best_f1
|
|race |✓ |✓ |✓ | 1045|acc |
|race |✓ |✓ |✓ | 1045|acc |
|
math
qa |✓ |✓ |✓ | 2
985
|acc, acc_norm |
|
head
qa |✓ |✓ |✓ | 2
742
|acc, acc_norm
|
|headqa_es |✓ |✓ |✓ | 2742|acc, acc_norm |
|headqa_es |✓ |✓ |✓ | 2742|acc, acc_norm |
|headqa_en |✓ |✓ |✓ | 2742|acc, acc_norm |
|headqa_en |✓ |✓ |✓ | 2742|acc, acc_norm |
|mathqa |✓ |✓ |✓ | 2985|acc, acc_norm |
|webqs |✓ | |✓ | 2032|acc |
|webqs |✓ | |✓ | 2032|acc |
|wsc273 | | |✓ | 273|acc |
|wsc273 | | |✓ | 273|acc |
|winogrande |✓ |✓ | | 1267|acc |
|winogrande |✓ |✓ | | 1267|acc |
...
@@ -143,6 +150,10 @@ To implement a new task in eval harness, see [this guide](./docs/task_guide.md).
...
@@ -143,6 +150,10 @@ To implement a new task in eval harness, see [this guide](./docs/task_guide.md).
|ethics_utilitarianism_original | | |✓ | 4808|acc |
|ethics_utilitarianism_original | | |✓ | 4808|acc |
|ethics_utilitarianism |✓ | |✓ | 4808|acc |
|ethics_utilitarianism |✓ | |✓ | 4808|acc |
|ethics_virtue |✓ | |✓ | 4975|acc, em |
|ethics_virtue |✓ | |✓ | 4975|acc, em |
|truthfulqa_mc | |✓ | | 817|mc1, mc2 |
|truthfulqa_gen | |✓ | | 817|bleurt_max, bleurt_acc, bleurt_diff, bleu_max, bleu_acc, bleu_diff, rouge1_max, rouge1_acc, rouge1_diff, rouge2_max, rouge2_acc, rouge2_diff, rougeL_max, rougeL_acc, rougeL_diff|
|mutual |✓ |✓ | | 886|r@1, r@2, mrr |
|mutual_plus |✓ |✓ | | 886|r@1, r@2, mrr |
|math_algebra |✓ | |✓ | 1187|acc |
|math_algebra |✓ | |✓ | 1187|acc |
|math_counting_and_prob |✓ | |✓ | 474|acc |
|math_counting_and_prob |✓ | |✓ | 474|acc |
|math_geometry |✓ | |✓ | 479|acc |
|math_geometry |✓ | |✓ | 479|acc |
...
@@ -150,6 +161,7 @@ To implement a new task in eval harness, see [this guide](./docs/task_guide.md).
...
@@ -150,6 +161,7 @@ To implement a new task in eval harness, see [this guide](./docs/task_guide.md).
|math_num_theory |✓ | |✓ | 540|acc |
|math_num_theory |✓ | |✓ | 540|acc |
|math_prealgebra |✓ | |✓ | 871|acc |
|math_prealgebra |✓ | |✓ | 871|acc |
|math_precalc |✓ | |✓ | 546|acc |
|math_precalc |✓ | |✓ | 546|acc |
|math_asdiv | |✓ | | 2305|acc |
|arithmetic_2da | |✓ | | 2000|acc |
|arithmetic_2da | |✓ | | 2000|acc |
|arithmetic_2ds | |✓ | | 2000|acc |
|arithmetic_2ds | |✓ | | 2000|acc |
|arithmetic_3da | |✓ | | 2000|acc |
|arithmetic_3da | |✓ | | 2000|acc |
...
@@ -273,74 +285,74 @@ To implement a new task in eval harness, see [this guide](./docs/task_guide.md).
...
@@ -273,74 +285,74 @@ To implement a new task in eval harness, see [this guide](./docs/task_guide.md).
|pile_uspto | |✓ |✓ | 11415|word_perplexity, byte_perplexity, bits_per_byte |
|pile_uspto | |✓ |✓ | 11415|word_perplexity, byte_perplexity, bits_per_byte |
|pile_ubuntu-irc | |✓ |✓ | 22|word_perplexity, byte_perplexity, bits_per_byte |
|pile_ubuntu-irc | |✓ |✓ | 22|word_perplexity, byte_perplexity, bits_per_byte |
|pile_wikipedia | |✓ |✓ | 17511|word_perplexity, byte_perplexity, bits_per_byte |
|pile_wikipedia | |✓ |✓ | 17511|word_perplexity, byte_perplexity, bits_per_byte |
|pile_youtubesubtitles | |✓ |
| 1000|acc
|pile_youtubesubtitles | |✓ |
✓ | 342|word_perplexity, byte_perplexity, bits_per_byte |
|blimp_adjunct_island | |✓ | | 1000|acc
|blimp_adjunct_island | |✓ | | 1000|acc
|
|blimp_anaphor_gender_agreement | |✓ | | 1000|acc
|blimp_anaphor_gender_agreement | |✓ | | 1000|acc
|
|blimp_anaphor_number_agreement | |✓ | | 1000|acc
|blimp_anaphor_number_agreement | |✓ | | 1000|acc
|
|blimp_animate_subject_passive | |✓ | | 1000|acc
|blimp_animate_subject_passive | |✓ | | 1000|acc
|
|blimp_animate_subject_trans | |✓ | | 1000|acc
|blimp_animate_subject_trans | |✓ | | 1000|acc
|
|blimp_causative | |✓ | | 1000|acc
|blimp_causative | |✓ | | 1000|acc
|
|blimp_complex_NP_island | |✓ | | 1000|acc
|blimp_complex_NP_island | |✓ | | 1000|acc
|
|blimp_coordinate_structure_constraint_complex_left_branch| |✓ | | 1000|acc
|blimp_coordinate_structure_constraint_complex_left_branch| |✓ | | 1000|acc
|
|blimp_coordinate_structure_constraint_object_extraction | |✓ | | 1000|acc
|blimp_coordinate_structure_constraint_object_extraction | |✓ | | 1000|acc
|
|blimp_determiner_noun_agreement_1 | |✓ | | 1000|acc
|blimp_determiner_noun_agreement_1 | |✓ | | 1000|acc
|
|blimp_determiner_noun_agreement_2 | |✓ | | 1000|acc
|blimp_determiner_noun_agreement_2 | |✓ | | 1000|acc
|
|blimp_determiner_noun_agreement_irregular_1 | |✓ | | 1000|acc
|blimp_determiner_noun_agreement_irregular_1 | |✓ | | 1000|acc
|
|blimp_determiner_noun_agreement_irregular_2 | |✓ | | 1000|acc
|blimp_determiner_noun_agreement_irregular_2 | |✓ | | 1000|acc
|
|blimp_determiner_noun_agreement_with_adj_2 | |✓ | | 1000|acc
|blimp_determiner_noun_agreement_with_adj_2 | |✓ | | 1000|acc
|
|blimp_determiner_noun_agreement_with_adj_irregular_1 | |✓ | | 1000|acc
|blimp_determiner_noun_agreement_with_adj_irregular_1 | |✓ | | 1000|acc
|
|blimp_determiner_noun_agreement_with_adj_irregular_2 | |✓ | | 1000|acc
|blimp_determiner_noun_agreement_with_adj_irregular_2 | |✓ | | 1000|acc
|
|blimp_determiner_noun_agreement_with_adjective_1 | |✓ | | 1000|acc
|blimp_determiner_noun_agreement_with_adjective_1 | |✓ | | 1000|acc
|
|blimp_distractor_agreement_relational_noun | |✓ | | 1000|acc
|blimp_distractor_agreement_relational_noun | |✓ | | 1000|acc
|
|blimp_distractor_agreement_relative_clause | |✓ | | 1000|acc
|blimp_distractor_agreement_relative_clause | |✓ | | 1000|acc
|
|blimp_drop_argument | |✓ | | 1000|acc
|blimp_drop_argument | |✓ | | 1000|acc
|
|blimp_ellipsis_n_bar_1 | |✓ | | 1000|acc
|blimp_ellipsis_n_bar_1 | |✓ | | 1000|acc
|
|blimp_ellipsis_n_bar_2 | |✓ | | 1000|acc
|blimp_ellipsis_n_bar_2 | |✓ | | 1000|acc
|
|blimp_existential_there_object_raising | |✓ | | 1000|acc
|blimp_existential_there_object_raising | |✓ | | 1000|acc
|
|blimp_existential_there_quantifiers_1 | |✓ | | 1000|acc
|blimp_existential_there_quantifiers_1 | |✓ | | 1000|acc
|
|blimp_existential_there_quantifiers_2 | |✓ | | 1000|acc
|blimp_existential_there_quantifiers_2 | |✓ | | 1000|acc
|
|blimp_existential_there_subject_raising | |✓ | | 1000|acc
|blimp_existential_there_subject_raising | |✓ | | 1000|acc
|
|blimp_expletive_it_object_raising | |✓ | | 1000|acc
|blimp_expletive_it_object_raising | |✓ | | 1000|acc
|
|blimp_inchoative | |✓ | | 1000|acc
|blimp_inchoative | |✓ | | 1000|acc
|
|blimp_intransitive | |✓ | | 1000|acc
|blimp_intransitive | |✓ | | 1000|acc
|
|blimp_irregular_past_participle_adjectives | |✓ | | 1000|acc
|blimp_irregular_past_participle_adjectives | |✓ | | 1000|acc
|
|blimp_irregular_past_participle_verbs | |✓ | | 1000|acc
|blimp_irregular_past_participle_verbs | |✓ | | 1000|acc
|
|blimp_irregular_plural_subject_verb_agreement_1 | |✓ | | 1000|acc
|blimp_irregular_plural_subject_verb_agreement_1 | |✓ | | 1000|acc
|
|blimp_irregular_plural_subject_verb_agreement_2 | |✓ | | 1000|acc
|blimp_irregular_plural_subject_verb_agreement_2 | |✓ | | 1000|acc
|
|blimp_left_branch_island_echo_question | |✓ | | 1000|acc
|blimp_left_branch_island_echo_question | |✓ | | 1000|acc
|
|blimp_left_branch_island_simple_question | |✓ | | 1000|acc
|blimp_left_branch_island_simple_question | |✓ | | 1000|acc
|
|blimp_matrix_question_npi_licensor_present | |✓ | | 1000|acc
|blimp_matrix_question_npi_licensor_present | |✓ | | 1000|acc
|
|blimp_npi_present_1 | |✓ | | 1000|acc
|blimp_npi_present_1 | |✓ | | 1000|acc
|
|blimp_npi_present_2 | |✓ | | 1000|acc
|blimp_npi_present_2 | |✓ | | 1000|acc
|
|blimp_only_npi_licensor_present | |✓ | | 1000|acc
|blimp_only_npi_licensor_present | |✓ | | 1000|acc
|
|blimp_only_npi_scope | |✓ | | 1000|acc
|blimp_only_npi_scope | |✓ | | 1000|acc
|
|blimp_passive_1 | |✓ | | 1000|acc
|blimp_passive_1 | |✓ | | 1000|acc
|
|blimp_passive_2 | |✓ | | 1000|acc
|blimp_passive_2 | |✓ | | 1000|acc
|
|blimp_principle_A_c_command | |✓ | | 1000|acc
|blimp_principle_A_c_command | |✓ | | 1000|acc
|
|blimp_principle_A_case_1 | |✓ | | 1000|acc
|blimp_principle_A_case_1 | |✓ | | 1000|acc
|
|blimp_principle_A_case_2 | |✓ | | 1000|acc
|blimp_principle_A_case_2 | |✓ | | 1000|acc
|
|blimp_principle_A_domain_1 | |✓ | | 1000|acc
|blimp_principle_A_domain_1 | |✓ | | 1000|acc
|
|blimp_principle_A_domain_2 | |✓ | | 1000|acc
|blimp_principle_A_domain_2 | |✓ | | 1000|acc
|
|blimp_principle_A_domain_3 | |✓ | | 1000|acc
|blimp_principle_A_domain_3 | |✓ | | 1000|acc
|
|blimp_principle_A_reconstruction | |✓ | | 1000|acc
|blimp_principle_A_reconstruction | |✓ | | 1000|acc
|
|blimp_regular_plural_subject_verb_agreement_1 | |✓ | | 1000|acc
|blimp_regular_plural_subject_verb_agreement_1 | |✓ | | 1000|acc
|
|blimp_regular_plural_subject_verb_agreement_2 | |✓ | | 1000|acc
|blimp_regular_plural_subject_verb_agreement_2 | |✓ | | 1000|acc
|
|blimp_sentential_negation_npi_licensor_present | |✓ | | 1000|acc
|blimp_sentential_negation_npi_licensor_present | |✓ | | 1000|acc
|
|blimp_sentential_negation_npi_scope | |✓ | | 1000|acc
|blimp_sentential_negation_npi_scope | |✓ | | 1000|acc
|
|blimp_sentential_subject_island | |✓ | | 1000|acc
|blimp_sentential_subject_island | |✓ | | 1000|acc
|
|blimp_superlative_quantifiers_1 | |✓ | | 1000|acc
|blimp_superlative_quantifiers_1 | |✓ | | 1000|acc
|
|blimp_superlative_quantifiers_2 | |✓ | | 1000|acc
|blimp_superlative_quantifiers_2 | |✓ | | 1000|acc
|
|blimp_tough_vs_raising_1 | |✓ | | 1000|acc
|blimp_tough_vs_raising_1 | |✓ | | 1000|acc
|
|blimp_tough_vs_raising_2 | |✓ | | 1000|acc
|blimp_tough_vs_raising_2 | |✓ | | 1000|acc
|
|blimp_transitive | |✓ | | 1000|acc
|blimp_transitive | |✓ | | 1000|acc
|
|blimp_wh_island | |✓ | | 1000|acc
|blimp_wh_island | |✓ | | 1000|acc
|
|blimp_wh_questions_object_gap | |✓ | | 1000|acc
|blimp_wh_questions_object_gap | |✓ | | 1000|acc
|
|blimp_wh_questions_subject_gap | |✓ | | 1000|acc
|blimp_wh_questions_subject_gap | |✓ | | 1000|acc
|
|blimp_wh_questions_subject_gap_long_distance | |✓ | | 1000|acc
|blimp_wh_questions_subject_gap_long_distance | |✓ | | 1000|acc
|
|blimp_wh_vs_that_no_gap | |✓ | | 1000|acc
|blimp_wh_vs_that_no_gap | |✓ | | 1000|acc
|
|blimp_wh_vs_that_no_gap_long_distance | |✓ | | 1000|acc
|blimp_wh_vs_that_no_gap_long_distance | |✓ | | 1000|acc
|
|blimp_wh_vs_that_with_gap | |✓ | | 1000|acc
|blimp_wh_vs_that_with_gap | |✓ | | 1000|acc
|
|blimp_wh_vs_that_with_gap_long_distance | |✓ | | 1000|acc
|blimp_wh_vs_that_with_gap_long_distance | |✓ | | 1000|acc
|
## Usage
## Usage
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment