Commit ff58b389 authored by Leo Gao's avatar Leo Gao
Browse files

Merge branch 'master' of github.com:EleutherAI/lm-evaluation-harness into bits_per_byte

parents 33315a1f a67c17e0
...@@ -21,7 +21,7 @@ pip install lm-eval ...@@ -21,7 +21,7 @@ pip install lm-eval
## Basic Usage ## Basic Usage
To evaluate a model, (e.g. GPT-2) on NLU tasks (e.g. LAMBADA, HellaSwag), you can run the following command. To evaluate a model, (e.g. GPT-2) on NLU tasks (e.g. LAMBADA, HellaSwag), you can run the following command. **When reporting results from eval harness, please include the task versions (shown in `results["versions"]`) for reproducibility.** This allows bug fixes to tasks while also ensuring that previously reported scores are reproducible. See the [Task Versioning](https://github.com/EleutherAI/lm-evaluation-harness#task-versioning) section for more info.
```bash ```bash
python main.py \ python main.py \
...@@ -91,7 +91,7 @@ To implement a new task in eval harness, see [this guide](https://github.com/Ele ...@@ -91,7 +91,7 @@ To implement a new task in eval harness, see [this guide](https://github.com/Ele
### Full Task List ### Full Task List
| Task Name |Train|Val|Test|Val/Test Docs| Metrics | | Task Name |Train|Val|Test|Val/Test Docs| Metrics |
|-------------------------------------------------|-----|---|----|------------:|------------------------------------------------------------------------------| |---------------------------------------------------------|-----|---|----|------------:|------------------------------------------------------------------------------|
|cola |✓ |✓ | | 1043|mcc | |cola |✓ |✓ | | 1043|mcc |
|mnli |✓ |✓ | | 9815|acc | |mnli |✓ |✓ | | 9815|acc |
|mnli_mismatched |✓ |✓ | | 9832|acc | |mnli_mismatched |✓ |✓ | | 9832|acc |
...@@ -128,8 +128,9 @@ To implement a new task in eval harness, see [this guide](https://github.com/Ele ...@@ -128,8 +128,9 @@ To implement a new task in eval harness, see [this guide](https://github.com/Ele
|openbookqa |✓ |✓ |✓ | 500|acc, acc_norm | |openbookqa |✓ |✓ |✓ | 500|acc, acc_norm |
|squad2 |✓ |✓ | | 11873|exact, f1, HasAns_exact, HasAns_f1, NoAns_exact, NoAns_f1, best_exact, best_f1| |squad2 |✓ |✓ | | 11873|exact, f1, HasAns_exact, HasAns_f1, NoAns_exact, NoAns_f1, best_exact, best_f1|
|race |✓ |✓ |✓ | 1045|acc | |race |✓ |✓ |✓ | 1045|acc |
|headqa |✓ |✓ |✓ | 2742|acc, acc_norm |
|mathqa |✓ |✓ |✓ | 2985|acc, acc_norm | |mathqa |✓ |✓ |✓ | 2985|acc, acc_norm |
|headqa_es |✓ |✓ |✓ | 2742|acc, acc_norm |
|headqa_en |✓ |✓ |✓ | 2742|acc, acc_norm |
|webqs |✓ | |✓ | 2032|acc | |webqs |✓ | |✓ | 2032|acc |
|wsc273 | | |✓ | 273|acc | |wsc273 | | |✓ | 273|acc |
|winogrande |✓ |✓ | | 1267|acc | |winogrande |✓ |✓ | | 1267|acc |
...@@ -182,7 +183,7 @@ To implement a new task in eval harness, see [this guide](https://github.com/Ele ...@@ -182,7 +183,7 @@ To implement a new task in eval harness, see [this guide](https://github.com/Ele
|hendrycksTest-high_school_computer_science |✓ |✓ |✓ | 100|acc, acc_norm | |hendrycksTest-high_school_computer_science |✓ |✓ |✓ | 100|acc, acc_norm |
|hendrycksTest-high_school_european_history |✓ |✓ |✓ | 165|acc, acc_norm | |hendrycksTest-high_school_european_history |✓ |✓ |✓ | 165|acc, acc_norm |
|hendrycksTest-high_school_geography |✓ |✓ |✓ | 198|acc, acc_norm | |hendrycksTest-high_school_geography |✓ |✓ |✓ | 198|acc, acc_norm |
|hendrycksTest-high_school_government_and_politics|✓ |✓ |✓ | 193|acc, acc_norm | |hendrycksTest-high_school_government_and_politics |✓ |✓ |✓ | 193|acc, acc_norm |
|hendrycksTest-high_school_macroeconomics |✓ |✓ |✓ | 390|acc, acc_norm | |hendrycksTest-high_school_macroeconomics |✓ |✓ |✓ | 390|acc, acc_norm |
|hendrycksTest-high_school_mathematics |✓ |✓ |✓ | 270|acc, acc_norm | |hendrycksTest-high_school_mathematics |✓ |✓ |✓ | 270|acc, acc_norm |
|hendrycksTest-high_school_microeconomics |✓ |✓ |✓ | 238|acc, acc_norm | |hendrycksTest-high_school_microeconomics |✓ |✓ |✓ | 238|acc, acc_norm |
...@@ -272,9 +273,74 @@ To implement a new task in eval harness, see [this guide](https://github.com/Ele ...@@ -272,9 +273,74 @@ To implement a new task in eval harness, see [this guide](https://github.com/Ele
|pile_uspto | |✓ |✓ | 11415|word_perplexity, byte_perplexity, bits_per_byte | |pile_uspto | |✓ |✓ | 11415|word_perplexity, byte_perplexity, bits_per_byte |
|pile_ubuntu-irc | |✓ |✓ | 22|word_perplexity, byte_perplexity, bits_per_byte | |pile_ubuntu-irc | |✓ |✓ | 22|word_perplexity, byte_perplexity, bits_per_byte |
|pile_wikipedia | |✓ |✓ | 17511|word_perplexity, byte_perplexity, bits_per_byte | |pile_wikipedia | |✓ |✓ | 17511|word_perplexity, byte_perplexity, bits_per_byte |
|pile_youtubesubtitles | |✓ |✓ | 342|word_perplexity, byte_perplexity, bits_per_byte | |pile_youtubesubtitles | |✓ | | 1000|acc
|blimp_adjunct_island | |✓ | | 1000|acc
|blimp_anaphor_gender_agreement | |✓ | | 1000|acc
|blimp_anaphor_number_agreement | |✓ | | 1000|acc
|blimp_animate_subject_passive | |✓ | | 1000|acc
|blimp_animate_subject_trans | |✓ | | 1000|acc
|blimp_causative | |✓ | | 1000|acc
|blimp_complex_NP_island | |✓ | | 1000|acc
|blimp_coordinate_structure_constraint_complex_left_branch| |✓ | | 1000|acc
|blimp_coordinate_structure_constraint_object_extraction | |✓ | | 1000|acc
|blimp_determiner_noun_agreement_1 | |✓ | | 1000|acc
|blimp_determiner_noun_agreement_2 | |✓ | | 1000|acc
|blimp_determiner_noun_agreement_irregular_1 | |✓ | | 1000|acc
|blimp_determiner_noun_agreement_irregular_2 | |✓ | | 1000|acc
|blimp_determiner_noun_agreement_with_adj_2 | |✓ | | 1000|acc
|blimp_determiner_noun_agreement_with_adj_irregular_1 | |✓ | | 1000|acc
|blimp_determiner_noun_agreement_with_adj_irregular_2 | |✓ | | 1000|acc
|blimp_determiner_noun_agreement_with_adjective_1 | |✓ | | 1000|acc
|blimp_distractor_agreement_relational_noun | |✓ | | 1000|acc
|blimp_distractor_agreement_relative_clause | |✓ | | 1000|acc
|blimp_drop_argument | |✓ | | 1000|acc
|blimp_ellipsis_n_bar_1 | |✓ | | 1000|acc
|blimp_ellipsis_n_bar_2 | |✓ | | 1000|acc
|blimp_existential_there_object_raising | |✓ | | 1000|acc
|blimp_existential_there_quantifiers_1 | |✓ | | 1000|acc
|blimp_existential_there_quantifiers_2 | |✓ | | 1000|acc
|blimp_existential_there_subject_raising | |✓ | | 1000|acc
|blimp_expletive_it_object_raising | |✓ | | 1000|acc
|blimp_inchoative | |✓ | | 1000|acc
|blimp_intransitive | |✓ | | 1000|acc
|blimp_irregular_past_participle_adjectives | |✓ | | 1000|acc
|blimp_irregular_past_participle_verbs | |✓ | | 1000|acc
|blimp_irregular_plural_subject_verb_agreement_1 | |✓ | | 1000|acc
|blimp_irregular_plural_subject_verb_agreement_2 | |✓ | | 1000|acc
|blimp_left_branch_island_echo_question | |✓ | | 1000|acc
|blimp_left_branch_island_simple_question | |✓ | | 1000|acc
|blimp_matrix_question_npi_licensor_present | |✓ | | 1000|acc
|blimp_npi_present_1 | |✓ | | 1000|acc
|blimp_npi_present_2 | |✓ | | 1000|acc
|blimp_only_npi_licensor_present | |✓ | | 1000|acc
|blimp_only_npi_scope | |✓ | | 1000|acc
|blimp_passive_1 | |✓ | | 1000|acc
|blimp_passive_2 | |✓ | | 1000|acc
|blimp_principle_A_c_command | |✓ | | 1000|acc
|blimp_principle_A_case_1 | |✓ | | 1000|acc
|blimp_principle_A_case_2 | |✓ | | 1000|acc
|blimp_principle_A_domain_1 | |✓ | | 1000|acc
|blimp_principle_A_domain_2 | |✓ | | 1000|acc
|blimp_principle_A_domain_3 | |✓ | | 1000|acc
|blimp_principle_A_reconstruction | |✓ | | 1000|acc
|blimp_regular_plural_subject_verb_agreement_1 | |✓ | | 1000|acc
|blimp_regular_plural_subject_verb_agreement_2 | |✓ | | 1000|acc
|blimp_sentential_negation_npi_licensor_present | |✓ | | 1000|acc
|blimp_sentential_negation_npi_scope | |✓ | | 1000|acc
|blimp_sentential_subject_island | |✓ | | 1000|acc
|blimp_superlative_quantifiers_1 | |✓ | | 1000|acc
|blimp_superlative_quantifiers_2 | |✓ | | 1000|acc
|blimp_tough_vs_raising_1 | |✓ | | 1000|acc
|blimp_tough_vs_raising_2 | |✓ | | 1000|acc
|blimp_transitive | |✓ | | 1000|acc
|blimp_wh_island | |✓ | | 1000|acc
|blimp_wh_questions_object_gap | |✓ | | 1000|acc
|blimp_wh_questions_subject_gap | |✓ | | 1000|acc
|blimp_wh_questions_subject_gap_long_distance | |✓ | | 1000|acc
|blimp_wh_vs_that_no_gap | |✓ | | 1000|acc
|blimp_wh_vs_that_no_gap_long_distance | |✓ | | 1000|acc
|blimp_wh_vs_that_with_gap | |✓ | | 1000|acc
|blimp_wh_vs_that_with_gap_long_distance | |✓ | | 1000|acc
## Usage ## Usage
......
...@@ -44,6 +44,7 @@ from . import wikitext ...@@ -44,6 +44,7 @@ from . import wikitext
from . import lambada_multilingual from . import lambada_multilingual
from . import mutual from . import mutual
from . import truthfulqa from . import truthfulqa
from . import blimp
######################################## ########################################
# Translation tasks # Translation tasks
...@@ -132,7 +133,9 @@ TASK_REGISTRY = { ...@@ -132,7 +133,9 @@ TASK_REGISTRY = {
"squad2": squad.SQuAD2, "squad2": squad.SQuAD2,
"race": race.RACE, "race": race.RACE,
# "naturalqs": naturalqs.NaturalQs, # not implemented yet # "naturalqs": naturalqs.NaturalQs, # not implemented yet
"headqa": headqa.HeadQA, "headqa": headqa.HeadQAEsDeprecated, # for backwards compat - headqa used to default to es
"headqa_es": headqa.HeadQAEs,
"headqa_en": headqa.HeadQAEn,
"mathqa": mathqa.MathQA, "mathqa": mathqa.MathQA,
"webqs": webqs.WebQs, "webqs": webqs.WebQs,
"wsc273": wsc273.WinogradSchemaChallenge273, "wsc273": wsc273.WinogradSchemaChallenge273,
...@@ -217,6 +220,74 @@ TASK_REGISTRY = { ...@@ -217,6 +220,74 @@ TASK_REGISTRY = {
"pile_wikipedia": pile.PileWikipedia, "pile_wikipedia": pile.PileWikipedia,
"pile_youtubesubtitles": pile.PileYoutubeSubtitles, "pile_youtubesubtitles": pile.PileYoutubeSubtitles,
# BLiMP
"blimp_adjunct_island": blimp.BlimpAdjunctIsland,
"blimp_anaphor_gender_agreement": blimp.BlimpAnaphorGenderAgreement,
"blimp_anaphor_number_agreement": blimp.BlimpAnaphorNumberAgreement,
"blimp_animate_subject_passive": blimp.BlimpAnimateSubjectPassive,
"blimp_animate_subject_trans": blimp.BlimpAnimateSubjectTrans,
"blimp_causative": blimp.BlimpCausative,
"blimp_complex_NP_island": blimp.BlimpComplex_NPIsland,
"blimp_coordinate_structure_constraint_complex_left_branch": blimp.BlimpCoordinateStructureConstraintComplexLeftBranch,
"blimp_coordinate_structure_constraint_object_extraction": blimp.BlimpCoordinateStructureConstraintObjectExtraction,
"blimp_determiner_noun_agreement_1": blimp.BlimpDeterminerNounAgreement_1,
"blimp_determiner_noun_agreement_2": blimp.BlimpDeterminerNounAgreement_2,
"blimp_determiner_noun_agreement_irregular_1": blimp.BlimpDeterminerNounAgreementIrregular_1,
"blimp_determiner_noun_agreement_irregular_2": blimp.BlimpDeterminerNounAgreementIrregular_2,
"blimp_determiner_noun_agreement_with_adj_2": blimp.BlimpDeterminerNounAgreementWithAdj_2,
"blimp_determiner_noun_agreement_with_adj_irregular_1": blimp.BlimpDeterminerNounAgreementWithAdjIrregular_1,
"blimp_determiner_noun_agreement_with_adj_irregular_2": blimp.BlimpDeterminerNounAgreementWithAdjIrregular_2,
"blimp_determiner_noun_agreement_with_adjective_1": blimp.BlimpDeterminerNounAgreementWithAdjective_1,
"blimp_distractor_agreement_relational_noun": blimp.BlimpDistractorAgreementRelationalNoun,
"blimp_distractor_agreement_relative_clause": blimp.BlimpDistractorAgreementRelativeClause,
"blimp_drop_argument": blimp.BlimpDropArgument,
"blimp_ellipsis_n_bar_1": blimp.BlimpEllipsisNBar_1,
"blimp_ellipsis_n_bar_2": blimp.BlimpEllipsisNBar_2,
"blimp_existential_there_object_raising": blimp.BlimpExistentialThereObjectRaising,
"blimp_existential_there_quantifiers_1": blimp.BlimpExistentialThereQuantifiers_1,
"blimp_existential_there_quantifiers_2": blimp.BlimpExistentialThereQuantifiers_2,
"blimp_existential_there_subject_raising": blimp.BlimpExistentialThereSubjectRaising,
"blimp_expletive_it_object_raising": blimp.BlimpExpletiveItObjectRaising,
"blimp_inchoative": blimp.BlimpInchoative,
"blimp_intransitive": blimp.BlimpIntransitive,
"blimp_irregular_past_participle_adjectives": blimp.BlimpIrregularPastParticipleAdjectives,
"blimp_irregular_past_participle_verbs": blimp.BlimpIrregularPastParticipleVerbs,
"blimp_irregular_plural_subject_verb_agreement_1": blimp.BlimpIrregularPluralSubjectVerbAgreement_1,
"blimp_irregular_plural_subject_verb_agreement_2": blimp.BlimpIrregularPluralSubjectVerbAgreement_2,
"blimp_left_branch_island_echo_question": blimp.BlimpLeftBranchIslandEchoQuestion,
"blimp_left_branch_island_simple_question": blimp.BlimpLeftBranchIslandSimpleQuestion,
"blimp_matrix_question_npi_licensor_present": blimp.BlimpMatrixQuestionNpiLicensorPresent,
"blimp_npi_present_1": blimp.BlimpNpiPresent_1,
"blimp_npi_present_2": blimp.BlimpNpiPresent_2,
"blimp_only_npi_licensor_present": blimp.BlimpOnlyNpiLicensorPresent,
"blimp_only_npi_scope": blimp.BlimpOnlyNpiScope,
"blimp_passive_1": blimp.BlimpPassive_1,
"blimp_passive_2": blimp.BlimpPassive_2,
"blimp_principle_A_c_command": blimp.BlimpPrinciple_ACCommand,
"blimp_principle_A_case_1": blimp.BlimpPrinciple_ACase_1,
"blimp_principle_A_case_2": blimp.BlimpPrinciple_ACase_2,
"blimp_principle_A_domain_1": blimp.BlimpPrinciple_ADomain_1,
"blimp_principle_A_domain_2": blimp.BlimpPrinciple_ADomain_2,
"blimp_principle_A_domain_3": blimp.BlimpPrinciple_ADomain_3,
"blimp_principle_A_reconstruction": blimp.BlimpPrinciple_AReconstruction,
"blimp_regular_plural_subject_verb_agreement_1": blimp.BlimpRegularPluralSubjectVerbAgreement_1,
"blimp_regular_plural_subject_verb_agreement_2": blimp.BlimpRegularPluralSubjectVerbAgreement_2,
"blimp_sentential_negation_npi_licensor_present": blimp.BlimpSententialNegationNpiLicensorPresent,
"blimp_sentential_negation_npi_scope": blimp.BlimpSententialNegationNpiScope,
"blimp_sentential_subject_island": blimp.BlimpSententialSubjectIsland,
"blimp_superlative_quantifiers_1": blimp.BlimpSuperlativeQuantifiers_1,
"blimp_superlative_quantifiers_2": blimp.BlimpSuperlativeQuantifiers_2,
"blimp_tough_vs_raising_1": blimp.BlimpToughVsRaising_1,
"blimp_tough_vs_raising_2": blimp.BlimpToughVsRaising_2,
"blimp_transitive": blimp.BlimpTransitive,
"blimp_wh_island": blimp.BlimpWhIsland,
"blimp_wh_questions_object_gap": blimp.BlimpWhQuestionsObjectGap,
"blimp_wh_questions_subject_gap": blimp.BlimpWhQuestionsSubjectGap,
"blimp_wh_questions_subject_gap_long_distance": blimp.BlimpWhQuestionsSubjectGapLongDistance,
"blimp_wh_vs_that_no_gap": blimp.BlimpWhVsThatNoGap,
"blimp_wh_vs_that_no_gap_long_distance": blimp.BlimpWhVsThatNoGapLongDistance,
"blimp_wh_vs_that_with_gap": blimp.BlimpWhVsThatWithGap,
"blimp_wh_vs_that_with_gap_long_distance": blimp.BlimpWhVsThatWithGapLongDistance,
} }
......
"""
BLiMP: A Benchmark of Linguistic Minimal Pairs for English
https://arxiv.org/abs/1912.00582
@article{warstadt2019blimp,
title={BLiMP: A Benchmark of Linguistic Minimal Pairs for English},
author={Warstadt, Alex and Parrish, Alicia and Liu, Haokun and Mohananey, Anhad and Peng, Wei, and Wang, Sheng-Fu and Bowman, Samuel R},
journal={arXiv preprint arXiv:1912.00582},
year={2019}
}
"""
from lm_eval.base import rf
from lm_eval.metrics import mean
from .common import HFTask
class BlimpTask(HFTask):
VERSION = 0
DATASET_PATH = "blimp"
def download(self):
super().download()
# The HF dataset only contains a "train" dataset, but the harness expects a "validation"
# dataset. Let's use the training dataset, on the assumption that the model wasn't actually
# trained on this data.
self.data["validation"] = self.data["train"]
del self.data["train"]
def fewshot_context(self, doc, num_fewshot, provide_description, rnd):
assert num_fewshot == 0
assert not provide_description
return ""
def doc_to_text(self, doc):
# this method is invoked by tests only
return ""
def doc_to_target(self, doc):
# this method is invoked by tests only
return ""
def construct_requests(self, doc, ctx):
assert not ctx
# Calculate the loglikelihood for the good and the bad sentence.
# Note that loglikelihood translates the "" prefix to the "<|endoftext|>" token
return [
rf.loglikelihood("", doc["sentence_good"]),
rf.loglikelihood("", doc["sentence_bad"]),
]
def process_results(self, doc, results):
likelihood1, likelihood2 = results
# the model got this case right iff the good sentence scored higher than the bad sentence
acc = 1.0 if likelihood1 > likelihood2 else 0.0
return {
"acc": acc,
}
def higher_is_better(self):
return {
"acc": True,
}
def aggregation(self):
return {
"acc": mean,
}
class BlimpAdjunctIsland(BlimpTask):
DATASET_NAME = "adjunct_island"
class BlimpAnaphorGenderAgreement(BlimpTask):
DATASET_NAME = "anaphor_gender_agreement"
class BlimpAnaphorNumberAgreement(BlimpTask):
DATASET_NAME = "anaphor_number_agreement"
class BlimpAnimateSubjectPassive(BlimpTask):
DATASET_NAME = "animate_subject_passive"
class BlimpAnimateSubjectTrans(BlimpTask):
DATASET_NAME = "animate_subject_trans"
class BlimpCausative(BlimpTask):
DATASET_NAME = "causative"
class BlimpComplex_NPIsland(BlimpTask):
DATASET_NAME = "complex_NP_island"
class BlimpCoordinateStructureConstraintComplexLeftBranch(BlimpTask):
DATASET_NAME = "coordinate_structure_constraint_complex_left_branch"
class BlimpCoordinateStructureConstraintObjectExtraction(BlimpTask):
DATASET_NAME = "coordinate_structure_constraint_object_extraction"
class BlimpDeterminerNounAgreement_1(BlimpTask):
DATASET_NAME = "determiner_noun_agreement_1"
class BlimpDeterminerNounAgreement_2(BlimpTask):
DATASET_NAME = "determiner_noun_agreement_2"
class BlimpDeterminerNounAgreementIrregular_1(BlimpTask):
DATASET_NAME = "determiner_noun_agreement_irregular_1"
class BlimpDeterminerNounAgreementIrregular_2(BlimpTask):
DATASET_NAME = "determiner_noun_agreement_irregular_2"
class BlimpDeterminerNounAgreementWithAdj_2(BlimpTask):
DATASET_NAME = "determiner_noun_agreement_with_adj_2"
class BlimpDeterminerNounAgreementWithAdjIrregular_1(BlimpTask):
DATASET_NAME = "determiner_noun_agreement_with_adj_irregular_1"
class BlimpDeterminerNounAgreementWithAdjIrregular_2(BlimpTask):
DATASET_NAME = "determiner_noun_agreement_with_adj_irregular_2"
class BlimpDeterminerNounAgreementWithAdjective_1(BlimpTask):
DATASET_NAME = "determiner_noun_agreement_with_adjective_1"
class BlimpDistractorAgreementRelationalNoun(BlimpTask):
DATASET_NAME = "distractor_agreement_relational_noun"
class BlimpDistractorAgreementRelativeClause(BlimpTask):
DATASET_NAME = "distractor_agreement_relative_clause"
class BlimpDropArgument(BlimpTask):
DATASET_NAME = "drop_argument"
class BlimpEllipsisNBar_1(BlimpTask):
DATASET_NAME = "ellipsis_n_bar_1"
class BlimpEllipsisNBar_2(BlimpTask):
DATASET_NAME = "ellipsis_n_bar_2"
class BlimpExistentialThereObjectRaising(BlimpTask):
DATASET_NAME = "existential_there_object_raising"
class BlimpExistentialThereQuantifiers_1(BlimpTask):
DATASET_NAME = "existential_there_quantifiers_1"
class BlimpExistentialThereQuantifiers_2(BlimpTask):
DATASET_NAME = "existential_there_quantifiers_2"
class BlimpExistentialThereSubjectRaising(BlimpTask):
DATASET_NAME = "existential_there_subject_raising"
class BlimpExpletiveItObjectRaising(BlimpTask):
DATASET_NAME = "expletive_it_object_raising"
class BlimpInchoative(BlimpTask):
DATASET_NAME = "inchoative"
class BlimpIntransitive(BlimpTask):
DATASET_NAME = "intransitive"
class BlimpIrregularPastParticipleAdjectives(BlimpTask):
DATASET_NAME = "irregular_past_participle_adjectives"
class BlimpIrregularPastParticipleVerbs(BlimpTask):
DATASET_NAME = "irregular_past_participle_verbs"
class BlimpIrregularPluralSubjectVerbAgreement_1(BlimpTask):
DATASET_NAME = "irregular_plural_subject_verb_agreement_1"
class BlimpIrregularPluralSubjectVerbAgreement_2(BlimpTask):
DATASET_NAME = "irregular_plural_subject_verb_agreement_2"
class BlimpLeftBranchIslandEchoQuestion(BlimpTask):
DATASET_NAME = "left_branch_island_echo_question"
class BlimpLeftBranchIslandSimpleQuestion(BlimpTask):
DATASET_NAME = "left_branch_island_simple_question"
class BlimpMatrixQuestionNpiLicensorPresent(BlimpTask):
DATASET_NAME = "matrix_question_npi_licensor_present"
class BlimpNpiPresent_1(BlimpTask):
DATASET_NAME = "npi_present_1"
class BlimpNpiPresent_2(BlimpTask):
DATASET_NAME = "npi_present_2"
class BlimpOnlyNpiLicensorPresent(BlimpTask):
DATASET_NAME = "only_npi_licensor_present"
class BlimpOnlyNpiScope(BlimpTask):
DATASET_NAME = "only_npi_scope"
class BlimpPassive_1(BlimpTask):
DATASET_NAME = "passive_1"
class BlimpPassive_2(BlimpTask):
DATASET_NAME = "passive_2"
class BlimpPrinciple_ACCommand(BlimpTask):
DATASET_NAME = "principle_A_c_command"
class BlimpPrinciple_ACase_1(BlimpTask):
DATASET_NAME = "principle_A_case_1"
class BlimpPrinciple_ACase_2(BlimpTask):
DATASET_NAME = "principle_A_case_2"
class BlimpPrinciple_ADomain_1(BlimpTask):
DATASET_NAME = "principle_A_domain_1"
class BlimpPrinciple_ADomain_2(BlimpTask):
DATASET_NAME = "principle_A_domain_2"
class BlimpPrinciple_ADomain_3(BlimpTask):
DATASET_NAME = "principle_A_domain_3"
class BlimpPrinciple_AReconstruction(BlimpTask):
DATASET_NAME = "principle_A_reconstruction"
class BlimpRegularPluralSubjectVerbAgreement_1(BlimpTask):
DATASET_NAME = "regular_plural_subject_verb_agreement_1"
class BlimpRegularPluralSubjectVerbAgreement_2(BlimpTask):
DATASET_NAME = "regular_plural_subject_verb_agreement_2"
class BlimpSententialNegationNpiLicensorPresent(BlimpTask):
DATASET_NAME = "sentential_negation_npi_licensor_present"
class BlimpSententialNegationNpiScope(BlimpTask):
DATASET_NAME = "sentential_negation_npi_scope"
class BlimpSententialSubjectIsland(BlimpTask):
DATASET_NAME = "sentential_subject_island"
class BlimpSuperlativeQuantifiers_1(BlimpTask):
DATASET_NAME = "superlative_quantifiers_1"
class BlimpSuperlativeQuantifiers_2(BlimpTask):
DATASET_NAME = "superlative_quantifiers_2"
class BlimpToughVsRaising_1(BlimpTask):
DATASET_NAME = "tough_vs_raising_1"
class BlimpToughVsRaising_2(BlimpTask):
DATASET_NAME = "tough_vs_raising_2"
class BlimpTransitive(BlimpTask):
DATASET_NAME = "transitive"
class BlimpWhIsland(BlimpTask):
DATASET_NAME = "wh_island"
class BlimpWhQuestionsObjectGap(BlimpTask):
DATASET_NAME = "wh_questions_object_gap"
class BlimpWhQuestionsSubjectGap(BlimpTask):
DATASET_NAME = "wh_questions_subject_gap"
class BlimpWhQuestionsSubjectGapLongDistance(BlimpTask):
DATASET_NAME = "wh_questions_subject_gap_long_distance"
class BlimpWhVsThatNoGap(BlimpTask):
DATASET_NAME = "wh_vs_that_no_gap"
class BlimpWhVsThatNoGapLongDistance(BlimpTask):
DATASET_NAME = "wh_vs_that_no_gap_long_distance"
class BlimpWhVsThatWithGap(BlimpTask):
DATASET_NAME = "wh_vs_that_with_gap"
class BlimpWhVsThatWithGapLongDistance(BlimpTask):
DATASET_NAME = "wh_vs_that_with_gap_long_distance"
...@@ -227,7 +227,7 @@ class QNLI(HFTask): ...@@ -227,7 +227,7 @@ class QNLI(HFTask):
class WNLI(HFTask): class WNLI(HFTask):
VERSION = 0 VERSION = 1
DATASET_PATH = "glue" DATASET_PATH = "glue"
DATASET_NAME = "wnli" DATASET_NAME = "wnli"
...@@ -241,26 +241,25 @@ class WNLI(HFTask): ...@@ -241,26 +241,25 @@ class WNLI(HFTask):
return False return False
def doc_to_text(self, doc): def doc_to_text(self, doc):
return "{}\nQuestion: {} True, False or Neither?\nAnswer:".format( return "{}\nQuestion: {} True or False?\nAnswer:".format(
doc["sentence1"], doc["sentence1"],
doc["sentence2"], doc["sentence2"],
) )
def doc_to_target(self, doc): def doc_to_target(self, doc):
# True = entailment # True = entailment
# False = contradiction # False = not_entailment
# Neither = neutral return " {}".format({0: "False", 1: "True"}[doc["label"]])
return " {}".format({0: "True", 1: "Neither", 2: "False"}[doc["label"]])
def construct_requests(self, doc, ctx): def construct_requests(self, doc, ctx):
ll_true, _ = rf.loglikelihood(ctx, " True") ll_true, _ = rf.loglikelihood(ctx, " True")
ll_neither, _ = rf.loglikelihood(ctx, " Neither")
ll_false, _ = rf.loglikelihood(ctx, " False") ll_false, _ = rf.loglikelihood(ctx, " False")
return ll_true, ll_neither, ll_false return ll_true, ll_false
def process_results(self, doc, results): def process_results(self, doc, results):
ll_true, ll_false = results
pred = ll_true > ll_false
gold = doc["label"] gold = doc["label"]
pred = np.argmax(results)
return { return {
"acc": pred == gold "acc": pred == gold
} }
......
...@@ -2,10 +2,9 @@ from . common import HFTask ...@@ -2,10 +2,9 @@ from . common import HFTask
from lm_eval.base import MultipleChoiceTask from lm_eval.base import MultipleChoiceTask
class HeadQA(HFTask, MultipleChoiceTask): class HeadQABase(HFTask, MultipleChoiceTask):
VERSION = 0 VERSION = 0
DATASET_PATH = "head_qa" DATASET_PATH = "head_qa"
DATASET_NAME = None
def has_training_docs(self): def has_training_docs(self):
return True return True
...@@ -31,3 +30,15 @@ class HeadQA(HFTask, MultipleChoiceTask): ...@@ -31,3 +30,15 @@ class HeadQA(HFTask, MultipleChoiceTask):
def doc_to_text(self, doc): def doc_to_text(self, doc):
return doc["query"] return doc["query"]
class HeadQAEn(HeadQABase):
DATASET_NAME = "en"
class HeadQAEs(HeadQABase):
DATASET_NAME = "es"
# for backwards compatibility
class HeadQAEsDeprecated(HeadQABase):
DATASET_NAME = "es"
print("WARNING: headqa is deprecated. Please use headqa_es or headqa_en instead. See https://github.com/EleutherAI/lm-evaluation-harness/pull/240 for more info.")
\ No newline at end of file
...@@ -18,9 +18,12 @@ class PilePerplexityTask(PerplexityTask, abc.ABC): ...@@ -18,9 +18,12 @@ class PilePerplexityTask(PerplexityTask, abc.ABC):
def download(self): def download(self):
# TODO: separate pile val/test out by component so we don't have to scan the entire file once per set # TODO: separate pile val/test out by component so we don't have to scan the entire file once per set
if not os.path.exists("data/pile/test.jsonl.zst"):
# todo use new best_download fallback api
os.makedirs("data/pile/", exist_ok=True) os.makedirs("data/pile/", exist_ok=True)
download_file("https://the-eye.eu/public/AI/pile/val.jsonl.zst", self.VAL_PATH, "264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92") download_file("http://eaidata.bmk.sh/data/pile/val.jsonl.zst", self.VAL_PATH, "264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92")
download_file("https://the-eye.eu/public/AI/pile/test.jsonl.zst", self.TEST_PATH, "0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e") download_file("http://eaidata.bmk.sh/data/pile/test.jsonl.zst", self.TEST_PATH, "0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e")
def validation_docs(self): def validation_docs(self):
rdr = lm_dataformat.Reader(self.VAL_PATH) rdr = lm_dataformat.Reader(self.VAL_PATH)
......
...@@ -13,7 +13,7 @@ from ..utils import general_detokenize ...@@ -13,7 +13,7 @@ from ..utils import general_detokenize
class BoolQ(HFTask): class BoolQ(HFTask):
VERSION = 0 VERSION = 1
DATASET_PATH = "super_glue" DATASET_PATH = "super_glue"
DATASET_NAME = "boolq" DATASET_NAME = "boolq"
...@@ -31,7 +31,7 @@ class BoolQ(HFTask): ...@@ -31,7 +31,7 @@ class BoolQ(HFTask):
return "Read the following passages and answer each question with a yes or a no." return "Read the following passages and answer each question with a yes or a no."
def doc_to_text(self, doc): def doc_to_text(self, doc):
return f"{doc['passage']}\nQuestion: {doc['question']}\nAnswer:" return f"{doc['passage']}\nQuestion: {doc['question']}?\nAnswer:"
def doc_to_target(self, doc): def doc_to_target(self, doc):
return " " + yesno(doc['label']) return " " + yesno(doc['label'])
......
...@@ -32,7 +32,7 @@ def test_basic_interface(taskname, task_class): ...@@ -32,7 +32,7 @@ def test_basic_interface(taskname, task_class):
limit = None limit = None
if taskname in ["triviaqa"]: if taskname in ["triviaqa"] or taskname.startswith("pile_"):
limit = 10000 limit = 10000
if task.has_validation_docs(): if task.has_validation_docs():
arr = list(islice(task.validation_docs(), limit)) arr = list(islice(task.validation_docs(), limit))
......
6577e0d88572772ef08e64f624c0e3df0953286ae1f118ccef15623b59ffeabf
\ No newline at end of file
{"results": {"boolq": {"acc": 0.5048929663608562, "acc_stderr": 0.00874463623355505}}, "versions": {"boolq": 1}}
\ No newline at end of file
09da45119b12a0144e3081f8fb790c2a22af7b9c3aac42f54423d348a711fbf5
\ No newline at end of file
{"results": {"headqa_en": {"acc": 0.23559445660102116, "acc_norm": 0.2447118891320204, "acc_norm_stderr": 0.008211629406841468, "acc_stderr": 0.008105688874297972}}, "versions": {"headqa_en": 0}}
\ No newline at end of file
767ca34d9714edd9fb030ddbcc35a64e5180d1e247b0cb557fbb22fdf971ad1f
\ No newline at end of file
{"results": {"headqa_es": {"acc": 0.23559445660102116, "acc_norm": 0.25018234865062, "acc_norm_stderr": 0.008272783230806014, "acc_stderr": 0.008105688874297972}}, "versions": {"headqa_es": 0}}
\ No newline at end of file
8a0f81661d2ab2334bbc8031fac31c0c8882f1d9271dd51599d21dfdbb726dea
\ No newline at end of file
{"results": {"wnli": {"acc": 0.5633802816901409, "acc_stderr": 0.0592793555841297}}, "versions": {"wnli": 1}}
\ No newline at end of file
976a5cac4bdb724632eebd4cb9e522203ce3da8d5525288a597c86e80469f3f2
\ No newline at end of file
{"results": {"blimp_adjunct_island": {"acc": 0.485, "acc_stderr": 0.0158121796418149}}, "versions": {"blimp_adjunct_island": 0}}
\ No newline at end of file
2d8964e56a17661502ecf3f09c0befba63915360ddf2145b0bd845816950515d
\ No newline at end of file
{"results": {"blimp_anaphor_gender_agreement": {"acc": 0.485, "acc_stderr": 0.0158121796418149}}, "versions": {"blimp_anaphor_gender_agreement": 0}}
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment