Merge branch 'master' of github.com:EleutherAI/lm-evaluation-harness into bits_per_byte

ff58b389 · Leo Gao · 33315a1f · a67c17e0 · ff58b389 · ff58b389
Commit ff58b389 authored Jan 03, 2022 by Leo Gao
20 changed files
--- a/README.md
+++ b/README.md
@@ -21,7 +21,7 @@ pip install lm-eval
 ## Basic Usage
-To evaluate a model, (e.g. GPT-2) on NLU tasks (e.g. LAMBADA, HellaSwag), you can run the following command.
+To evaluate a model, (e.g. GPT-2) on NLU tasks (e.g. LAMBADA, HellaSwag), you can run the following command. **When reporting results from eval harness, please include the task versions (shown in `results["versions"]`) for reproducibility.** This allows bug fixes to tasks while also ensuring that previously reported scores are reproducible. See the [Task Versioning](https://github.com/EleutherAI/lm-evaluation-harness#task-versioning) section for more info.
 ```bash
 python main.py \
@@ -91,7 +91,7 @@ To implement a new task in eval harness, see [this guide](https://github.com/Ele
 ### Full Task List
 |                    Task Name                            |Train|Val|Test|Val/Test Docs|                                   Metrics                                    |
-|-------------------------------------------------|-----|---|----|------------:|------------------------------------------------------------------------------|
+|---------------------------------------------------------|-----|---|----|------------:|------------------------------------------------------------------------------|
 |cola                                                     |✓    |✓  |    |         1043|mcc                                                                           |
 |mnli                                                     |✓    |✓  |    |         9815|acc                                                                           |
 |mnli_mismatched                                          |✓    |✓  |    |         9832|acc                                                                           |
@@ -128,8 +128,9 @@ To implement a new task in eval harness, see [this guide](https://github.com/Ele
 |openbookqa                                               |✓    |✓  |✓   |          500|acc, acc_norm                                                                 |
 |squad2                                                   |✓    |✓  |    |        11873|exact, f1, HasAns_exact, HasAns_f1, NoAns_exact, NoAns_f1, best_exact, best_f1|
 |race                                                     |✓    |✓  |✓   |         1045|acc                                                                           |
-|headqa                                           |✓    |✓  |✓   |         2742|acc, acc_norm                                                                 |
 |mathqa                                                   |✓    |✓  |✓   |         2985|acc, acc_norm                                                                 |
+|headqa_es                                                |✓    |✓  |✓   |         2742|acc, acc_norm                                                                 |
+|headqa_en                                                |✓    |✓  |✓   |         2742|acc, acc_norm                                                                 |
 |webqs                                                    |✓    |   |✓   |         2032|acc                                                                           |
 |wsc273                                                   |     |   |✓   |          273|acc                                                                           |
 |winogrande                                               |✓    |✓  |    |         1267|acc                                                                           |
@@ -182,7 +183,7 @@ To implement a new task in eval harness, see [this guide](https://github.com/Ele
 |hendrycksTest-high_school_computer_science               |✓    |✓  |✓   |          100|acc, acc_norm                                                                 |
 |hendrycksTest-high_school_european_history               |✓    |✓  |✓   |          165|acc, acc_norm                                                                 |
 |hendrycksTest-high_school_geography                      |✓    |✓  |✓   |          198|acc, acc_norm                                                                 |
-|hendrycksTest-high_school_government_and_politics|✓    |✓  |✓   |          193|acc, acc_norm                                                                 |
+|hendrycksTest-high_school_government_and_politics        |✓    |✓  |✓   |          193|acc, acc_norm                                                                 |
 |hendrycksTest-high_school_macroeconomics                 |✓    |✓  |✓   |          390|acc, acc_norm                                                                 |
 |hendrycksTest-high_school_mathematics                    |✓    |✓  |✓   |          270|acc, acc_norm                                                                 |
 |hendrycksTest-high_school_microeconomics                 |✓    |✓  |✓   |          238|acc, acc_norm                                                                 |
@@ -272,9 +273,74 @@ To implement a new task in eval harness, see [this guide](https://github.com/Ele
 |pile_uspto                                               |     |✓  |✓   |        11415|word_perplexity, byte_perplexity, bits_per_byte                               |
 |pile_ubuntu-irc                                          |     |✓  |✓   |           22|word_perplexity, byte_perplexity, bits_per_byte                               |
 |pile_wikipedia                                           |     |✓  |✓   |        17511|word_perplexity, byte_perplexity, bits_per_byte                               |
-|pile_youtubesubtitles                            |     |✓  |✓   |          342|word_perplexity, byte_perplexity, bits_per_byte                               |
+|pile_youtubesubtitles                                    |     |✓  |    |         1000|acc
+|blimp_adjunct_island                                     |     |✓  |    |         1000|acc
+|blimp_anaphor_gender_agreement                           |     |✓  |    |         1000|acc
+|blimp_anaphor_number_agreement                           |     |✓  |    |         1000|acc
+|blimp_animate_subject_passive                            |     |✓  |    |         1000|acc
+|blimp_animate_subject_trans                              |     |✓  |    |         1000|acc
+|blimp_causative                                          |     |✓  |    |         1000|acc
+|blimp_complex_NP_island                                  |     |✓  |    |         1000|acc
+|blimp_coordinate_structure_constraint_complex_left_branch|     |✓  |    |         1000|acc
+|blimp_coordinate_structure_constraint_object_extraction  |     |✓  |    |         1000|acc
+|blimp_determiner_noun_agreement_1                        |     |✓  |    |         1000|acc
+|blimp_determiner_noun_agreement_2                        |     |✓  |    |         1000|acc
+|blimp_determiner_noun_agreement_irregular_1              |     |✓  |    |         1000|acc
+|blimp_determiner_noun_agreement_irregular_2              |     |✓  |    |         1000|acc
+|blimp_determiner_noun_agreement_with_adj_2               |     |✓  |    |         1000|acc
+|blimp_determiner_noun_agreement_with_adj_irregular_1     |     |✓  |    |         1000|acc
+|blimp_determiner_noun_agreement_with_adj_irregular_2     |     |✓  |    |         1000|acc
+|blimp_determiner_noun_agreement_with_adjective_1         |     |✓  |    |         1000|acc
+|blimp_distractor_agreement_relational_noun               |     |✓  |    |         1000|acc
+|blimp_distractor_agreement_relative_clause               |     |✓  |    |         1000|acc
+|blimp_drop_argument                                      |     |✓  |    |         1000|acc
+|blimp_ellipsis_n_bar_1                                   |     |✓  |    |         1000|acc
+|blimp_ellipsis_n_bar_2                                   |     |✓  |    |         1000|acc
+|blimp_existential_there_object_raising                   |     |✓  |    |         1000|acc
+|blimp_existential_there_quantifiers_1                    |     |✓  |    |         1000|acc
+|blimp_existential_there_quantifiers_2                    |     |✓  |    |         1000|acc
+|blimp_existential_there_subject_raising                  |     |✓  |    |         1000|acc
+|blimp_expletive_it_object_raising                        |     |✓  |    |         1000|acc
+|blimp_inchoative                                         |     |✓  |    |         1000|acc
+|blimp_intransitive                                       |     |✓  |    |         1000|acc
+|blimp_irregular_past_participle_adjectives               |     |✓  |    |         1000|acc
+|blimp_irregular_past_participle_verbs                    |     |✓  |    |         1000|acc
+|blimp_irregular_plural_subject_verb_agreement_1          |     |✓  |    |         1000|acc
+|blimp_irregular_plural_subject_verb_agreement_2          |     |✓  |    |         1000|acc
+|blimp_left_branch_island_echo_question                   |     |✓  |    |         1000|acc
+|blimp_left_branch_island_simple_question                 |     |✓  |    |         1000|acc
+|blimp_matrix_question_npi_licensor_present               |     |✓  |    |         1000|acc
+|blimp_npi_present_1                                      |     |✓  |    |         1000|acc
+|blimp_npi_present_2                                      |     |✓  |    |         1000|acc
+|blimp_only_npi_licensor_present                          |     |✓  |    |         1000|acc
+|blimp_only_npi_scope                                     |     |✓  |    |         1000|acc
+|blimp_passive_1                                          |     |✓  |    |         1000|acc
+|blimp_passive_2                                          |     |✓  |    |         1000|acc
+|blimp_principle_A_c_command                              |     |✓  |    |         1000|acc
+|blimp_principle_A_case_1                                 |     |✓  |    |         1000|acc
+|blimp_principle_A_case_2                                 |     |✓  |    |         1000|acc
+|blimp_principle_A_domain_1                               |     |✓  |    |         1000|acc
+|blimp_principle_A_domain_2                               |     |✓  |    |         1000|acc
+|blimp_principle_A_domain_3                               |     |✓  |    |         1000|acc
+|blimp_principle_A_reconstruction                         |     |✓  |    |         1000|acc
+|blimp_regular_plural_subject_verb_agreement_1            |     |✓  |    |         1000|acc
+|blimp_regular_plural_subject_verb_agreement_2            |     |✓  |    |         1000|acc
+|blimp_sentential_negation_npi_licensor_present           |     |✓  |    |         1000|acc
+|blimp_sentential_negation_npi_scope                      |     |✓  |    |         1000|acc
+|blimp_sentential_subject_island                          |     |✓  |    |         1000|acc
+|blimp_superlative_quantifiers_1                          |     |✓  |    |         1000|acc
+|blimp_superlative_quantifiers_2                          |     |✓  |    |         1000|acc
+|blimp_tough_vs_raising_1                                 |     |✓  |    |         1000|acc
+|blimp_tough_vs_raising_2                                 |     |✓  |    |         1000|acc
+|blimp_transitive                                         |     |✓  |    |         1000|acc
+|blimp_wh_island                                          |     |✓  |    |         1000|acc
+|blimp_wh_questions_object_gap                            |     |✓  |    |         1000|acc
+|blimp_wh_questions_subject_gap                           |     |✓  |    |         1000|acc
+|blimp_wh_questions_subject_gap_long_distance             |     |✓  |    |         1000|acc
+|blimp_wh_vs_that_no_gap                                  |     |✓  |    |         1000|acc
+|blimp_wh_vs_that_no_gap_long_distance                    |     |✓  |    |         1000|acc
+|blimp_wh_vs_that_with_gap                                |     |✓  |    |         1000|acc
+|blimp_wh_vs_that_with_gap_long_distance                  |     |✓  |    |         1000|acc
 ## Usage

--- a/lm_eval/tasks/__init__.py
+++ b/lm_eval/tasks/__init__.py
@@ -44,6 +44,7 @@ from . import wikitext
 from . import lambada_multilingual
 from . import mutual
 from . import truthfulqa
+from . import blimp
 ########################################
 # Translation tasks
@@ -132,7 +133,9 @@ TASK_REGISTRY = {
    "squad2": squad.SQuAD2,
    "race": race.RACE,
    # "naturalqs": naturalqs.NaturalQs, # not implemented yet
-    "headqa": headqa.HeadQA,
+    "headqa": headqa.HeadQAEsDeprecated, # for backwards compat - headqa used to default to es
+    "headqa_es": headqa.HeadQAEs,
+    "headqa_en": headqa.HeadQAEn,
    "mathqa": mathqa.MathQA,
    "webqs": webqs.WebQs,
    "wsc273": wsc273.WinogradSchemaChallenge273,
@@ -217,6 +220,74 @@ TASK_REGISTRY = {
    "pile_wikipedia": pile.PileWikipedia,
    "pile_youtubesubtitles": pile.PileYoutubeSubtitles,
+    # BLiMP
+    "blimp_adjunct_island": blimp.BlimpAdjunctIsland,
+    "blimp_anaphor_gender_agreement": blimp.BlimpAnaphorGenderAgreement,
+    "blimp_anaphor_number_agreement": blimp.BlimpAnaphorNumberAgreement,
+    "blimp_animate_subject_passive": blimp.BlimpAnimateSubjectPassive,
+    "blimp_animate_subject_trans": blimp.BlimpAnimateSubjectTrans,
+    "blimp_causative": blimp.BlimpCausative,
+    "blimp_complex_NP_island": blimp.BlimpComplex_NPIsland,
+    "blimp_coordinate_structure_constraint_complex_left_branch": blimp.BlimpCoordinateStructureConstraintComplexLeftBranch,
+    "blimp_coordinate_structure_constraint_object_extraction": blimp.BlimpCoordinateStructureConstraintObjectExtraction,
+    "blimp_determiner_noun_agreement_1": blimp.BlimpDeterminerNounAgreement_1,
+    "blimp_determiner_noun_agreement_2": blimp.BlimpDeterminerNounAgreement_2,
+    "blimp_determiner_noun_agreement_irregular_1": blimp.BlimpDeterminerNounAgreementIrregular_1,
+    "blimp_determiner_noun_agreement_irregular_2": blimp.BlimpDeterminerNounAgreementIrregular_2,
+    "blimp_determiner_noun_agreement_with_adj_2": blimp.BlimpDeterminerNounAgreementWithAdj_2,
+    "blimp_determiner_noun_agreement_with_adj_irregular_1": blimp.BlimpDeterminerNounAgreementWithAdjIrregular_1,
+    "blimp_determiner_noun_agreement_with_adj_irregular_2": blimp.BlimpDeterminerNounAgreementWithAdjIrregular_2,
+    "blimp_determiner_noun_agreement_with_adjective_1": blimp.BlimpDeterminerNounAgreementWithAdjective_1,
+    "blimp_distractor_agreement_relational_noun": blimp.BlimpDistractorAgreementRelationalNoun,
+    "blimp_distractor_agreement_relative_clause": blimp.BlimpDistractorAgreementRelativeClause,
+    "blimp_drop_argument": blimp.BlimpDropArgument,
+    "blimp_ellipsis_n_bar_1": blimp.BlimpEllipsisNBar_1,
+    "blimp_ellipsis_n_bar_2": blimp.BlimpEllipsisNBar_2,
+    "blimp_existential_there_object_raising": blimp.BlimpExistentialThereObjectRaising,
+    "blimp_existential_there_quantifiers_1": blimp.BlimpExistentialThereQuantifiers_1,
+    "blimp_existential_there_quantifiers_2": blimp.BlimpExistentialThereQuantifiers_2,
+    "blimp_existential_there_subject_raising": blimp.BlimpExistentialThereSubjectRaising,
+    "blimp_expletive_it_object_raising": blimp.BlimpExpletiveItObjectRaising,
+    "blimp_inchoative": blimp.BlimpInchoative,
+    "blimp_intransitive": blimp.BlimpIntransitive,
+    "blimp_irregular_past_participle_adjectives": blimp.BlimpIrregularPastParticipleAdjectives,
+    "blimp_irregular_past_participle_verbs": blimp.BlimpIrregularPastParticipleVerbs,
+    "blimp_irregular_plural_subject_verb_agreement_1": blimp.BlimpIrregularPluralSubjectVerbAgreement_1,
+    "blimp_irregular_plural_subject_verb_agreement_2": blimp.BlimpIrregularPluralSubjectVerbAgreement_2,
+    "blimp_left_branch_island_echo_question": blimp.BlimpLeftBranchIslandEchoQuestion,
+    "blimp_left_branch_island_simple_question": blimp.BlimpLeftBranchIslandSimpleQuestion,
+    "blimp_matrix_question_npi_licensor_present": blimp.BlimpMatrixQuestionNpiLicensorPresent,
+    "blimp_npi_present_1": blimp.BlimpNpiPresent_1,
+    "blimp_npi_present_2": blimp.BlimpNpiPresent_2,
+    "blimp_only_npi_licensor_present": blimp.BlimpOnlyNpiLicensorPresent,
+    "blimp_only_npi_scope": blimp.BlimpOnlyNpiScope,
+    "blimp_passive_1": blimp.BlimpPassive_1,
+    "blimp_passive_2": blimp.BlimpPassive_2,
+    "blimp_principle_A_c_command": blimp.BlimpPrinciple_ACCommand,
+    "blimp_principle_A_case_1": blimp.BlimpPrinciple_ACase_1,
+    "blimp_principle_A_case_2": blimp.BlimpPrinciple_ACase_2,
+    "blimp_principle_A_domain_1": blimp.BlimpPrinciple_ADomain_1,
+    "blimp_principle_A_domain_2": blimp.BlimpPrinciple_ADomain_2,
+    "blimp_principle_A_domain_3": blimp.BlimpPrinciple_ADomain_3,
+    "blimp_principle_A_reconstruction": blimp.BlimpPrinciple_AReconstruction,
+    "blimp_regular_plural_subject_verb_agreement_1": blimp.BlimpRegularPluralSubjectVerbAgreement_1,
+    "blimp_regular_plural_subject_verb_agreement_2": blimp.BlimpRegularPluralSubjectVerbAgreement_2,
+    "blimp_sentential_negation_npi_licensor_present": blimp.BlimpSententialNegationNpiLicensorPresent,
+    "blimp_sentential_negation_npi_scope": blimp.BlimpSententialNegationNpiScope,
+    "blimp_sentential_subject_island": blimp.BlimpSententialSubjectIsland,
+    "blimp_superlative_quantifiers_1": blimp.BlimpSuperlativeQuantifiers_1,
+    "blimp_superlative_quantifiers_2": blimp.BlimpSuperlativeQuantifiers_2,
+    "blimp_tough_vs_raising_1": blimp.BlimpToughVsRaising_1,
+    "blimp_tough_vs_raising_2": blimp.BlimpToughVsRaising_2,
+    "blimp_transitive": blimp.BlimpTransitive,
+    "blimp_wh_island": blimp.BlimpWhIsland,
+    "blimp_wh_questions_object_gap": blimp.BlimpWhQuestionsObjectGap,
+    "blimp_wh_questions_subject_gap": blimp.BlimpWhQuestionsSubjectGap,
+    "blimp_wh_questions_subject_gap_long_distance": blimp.BlimpWhQuestionsSubjectGapLongDistance,
+    "blimp_wh_vs_that_no_gap": blimp.BlimpWhVsThatNoGap,
+    "blimp_wh_vs_that_no_gap_long_distance": blimp.BlimpWhVsThatNoGapLongDistance,
+    "blimp_wh_vs_that_with_gap": blimp.BlimpWhVsThatWithGap,
+    "blimp_wh_vs_that_with_gap_long_distance": blimp.BlimpWhVsThatWithGapLongDistance,
 }

--- a/lm_eval/tasks/blimp.py
+++ b/lm_eval/tasks/blimp.py
+"""
+BLiMP: A Benchmark of Linguistic Minimal Pairs for English
+https://arxiv.org/abs/1912.00582
+@article{warstadt2019blimp,
+  title={BLiMP: A Benchmark of Linguistic Minimal Pairs for English},
+  author={Warstadt, Alex and Parrish, Alicia and Liu, Haokun and Mohananey, Anhad and Peng, Wei, and Wang, Sheng-Fu and Bowman, Samuel R},
+  journal={arXiv preprint arXiv:1912.00582},
+  year={2019}
+}
+"""
+from lm_eval.base import rf
+from lm_eval.metrics import mean
+from .common import HFTask
+class BlimpTask(HFTask):
+    VERSION = 0
+    DATASET_PATH = "blimp"
+    def download(self):
+        super().download()
+        # The HF dataset only contains a "train" dataset, but the harness expects a "validation"
+        # dataset. Let's use the training dataset, on the assumption that the model wasn't actually
+        # trained on this data.
+        self.data["validation"] = self.data["train"]
+        del self.data["train"]
+    def fewshot_context(self, doc, num_fewshot, provide_description, rnd):
+        assert num_fewshot == 0
+        assert not provide_description
+        return ""
+    def doc_to_text(self, doc):
+        # this method is invoked by tests only
+        return ""
+    def doc_to_target(self, doc):
+        # this method is invoked by tests only
+        return ""
+    def construct_requests(self, doc, ctx):
+        assert not ctx
+        # Calculate the loglikelihood for the good and the bad sentence.
+        # Note that loglikelihood translates the "" prefix to the "<|endoftext|>" token
+        return [
+            rf.loglikelihood("", doc["sentence_good"]),
+            rf.loglikelihood("", doc["sentence_bad"]),
+        ]
+    def process_results(self, doc, results):
+        likelihood1, likelihood2 = results
+        # the model got this case right iff the good sentence scored higher than the bad sentence
+        acc = 1.0 if likelihood1 > likelihood2 else 0.0
+        return {
+            "acc": acc,
+        }
+    def higher_is_better(self):
+        return {
+            "acc": True,
+        }
+    def aggregation(self):
+        return {
+            "acc": mean,
+        }
+class BlimpAdjunctIsland(BlimpTask):
+    DATASET_NAME = "adjunct_island"
+class BlimpAnaphorGenderAgreement(BlimpTask):
+    DATASET_NAME = "anaphor_gender_agreement"
+class BlimpAnaphorNumberAgreement(BlimpTask):
+    DATASET_NAME = "anaphor_number_agreement"
+class BlimpAnimateSubjectPassive(BlimpTask):
+    DATASET_NAME = "animate_subject_passive"
+class BlimpAnimateSubjectTrans(BlimpTask):
+    DATASET_NAME = "animate_subject_trans"
+class BlimpCausative(BlimpTask):
+    DATASET_NAME = "causative"
+class BlimpComplex_NPIsland(BlimpTask):
+    DATASET_NAME = "complex_NP_island"
+class BlimpCoordinateStructureConstraintComplexLeftBranch(BlimpTask):
+    DATASET_NAME = "coordinate_structure_constraint_complex_left_branch"
+class BlimpCoordinateStructureConstraintObjectExtraction(BlimpTask):
+    DATASET_NAME = "coordinate_structure_constraint_object_extraction"
+class BlimpDeterminerNounAgreement_1(BlimpTask):
+    DATASET_NAME = "determiner_noun_agreement_1"
+class BlimpDeterminerNounAgreement_2(BlimpTask):
+    DATASET_NAME = "determiner_noun_agreement_2"
+class BlimpDeterminerNounAgreementIrregular_1(BlimpTask):
+    DATASET_NAME = "determiner_noun_agreement_irregular_1"
+class BlimpDeterminerNounAgreementIrregular_2(BlimpTask):
+    DATASET_NAME = "determiner_noun_agreement_irregular_2"
+class BlimpDeterminerNounAgreementWithAdj_2(BlimpTask):
+    DATASET_NAME = "determiner_noun_agreement_with_adj_2"
+class BlimpDeterminerNounAgreementWithAdjIrregular_1(BlimpTask):
+    DATASET_NAME = "determiner_noun_agreement_with_adj_irregular_1"
+class BlimpDeterminerNounAgreementWithAdjIrregular_2(BlimpTask):
+    DATASET_NAME = "determiner_noun_agreement_with_adj_irregular_2"
+class BlimpDeterminerNounAgreementWithAdjective_1(BlimpTask):
+    DATASET_NAME = "determiner_noun_agreement_with_adjective_1"
+class BlimpDistractorAgreementRelationalNoun(BlimpTask):
+    DATASET_NAME = "distractor_agreement_relational_noun"
+class BlimpDistractorAgreementRelativeClause(BlimpTask):
+    DATASET_NAME = "distractor_agreement_relative_clause"
+class BlimpDropArgument(BlimpTask):
+    DATASET_NAME = "drop_argument"
+class BlimpEllipsisNBar_1(BlimpTask):
+    DATASET_NAME = "ellipsis_n_bar_1"
+class BlimpEllipsisNBar_2(BlimpTask):
+    DATASET_NAME = "ellipsis_n_bar_2"
+class BlimpExistentialThereObjectRaising(BlimpTask):
+    DATASET_NAME = "existential_there_object_raising"
+class BlimpExistentialThereQuantifiers_1(BlimpTask):
+    DATASET_NAME = "existential_there_quantifiers_1"
+class BlimpExistentialThereQuantifiers_2(BlimpTask):
+    DATASET_NAME = "existential_there_quantifiers_2"
+class BlimpExistentialThereSubjectRaising(BlimpTask):
+    DATASET_NAME = "existential_there_subject_raising"
+class BlimpExpletiveItObjectRaising(BlimpTask):
+    DATASET_NAME = "expletive_it_object_raising"
+class BlimpInchoative(BlimpTask):
+    DATASET_NAME = "inchoative"
+class BlimpIntransitive(BlimpTask):
+    DATASET_NAME = "intransitive"
+class BlimpIrregularPastParticipleAdjectives(BlimpTask):
+    DATASET_NAME = "irregular_past_participle_adjectives"
+class BlimpIrregularPastParticipleVerbs(BlimpTask):
+    DATASET_NAME = "irregular_past_participle_verbs"
+class BlimpIrregularPluralSubjectVerbAgreement_1(BlimpTask):
+    DATASET_NAME = "irregular_plural_subject_verb_agreement_1"
+class BlimpIrregularPluralSubjectVerbAgreement_2(BlimpTask):
+    DATASET_NAME = "irregular_plural_subject_verb_agreement_2"
+class BlimpLeftBranchIslandEchoQuestion(BlimpTask):
+    DATASET_NAME = "left_branch_island_echo_question"
+class BlimpLeftBranchIslandSimpleQuestion(BlimpTask):
+    DATASET_NAME = "left_branch_island_simple_question"
+class BlimpMatrixQuestionNpiLicensorPresent(BlimpTask):
+    DATASET_NAME = "matrix_question_npi_licensor_present"
+class BlimpNpiPresent_1(BlimpTask):
+    DATASET_NAME = "npi_present_1"
+class BlimpNpiPresent_2(BlimpTask):
+    DATASET_NAME = "npi_present_2"
+class BlimpOnlyNpiLicensorPresent(BlimpTask):
+    DATASET_NAME = "only_npi_licensor_present"
+class BlimpOnlyNpiScope(BlimpTask):
+    DATASET_NAME = "only_npi_scope"
+class BlimpPassive_1(BlimpTask):
+    DATASET_NAME = "passive_1"
+class BlimpPassive_2(BlimpTask):
+    DATASET_NAME = "passive_2"
+class BlimpPrinciple_ACCommand(BlimpTask):
+    DATASET_NAME = "principle_A_c_command"
+class BlimpPrinciple_ACase_1(BlimpTask):
+    DATASET_NAME = "principle_A_case_1"
+class BlimpPrinciple_ACase_2(BlimpTask):
+    DATASET_NAME = "principle_A_case_2"
+class BlimpPrinciple_ADomain_1(BlimpTask):
+    DATASET_NAME = "principle_A_domain_1"
+class BlimpPrinciple_ADomain_2(BlimpTask):
+    DATASET_NAME = "principle_A_domain_2"
+class BlimpPrinciple_ADomain_3(BlimpTask):
+    DATASET_NAME = "principle_A_domain_3"
+class BlimpPrinciple_AReconstruction(BlimpTask):
+    DATASET_NAME = "principle_A_reconstruction"
+class BlimpRegularPluralSubjectVerbAgreement_1(BlimpTask):
+    DATASET_NAME = "regular_plural_subject_verb_agreement_1"
+class BlimpRegularPluralSubjectVerbAgreement_2(BlimpTask):
+    DATASET_NAME = "regular_plural_subject_verb_agreement_2"
+class BlimpSententialNegationNpiLicensorPresent(BlimpTask):
+    DATASET_NAME = "sentential_negation_npi_licensor_present"
+class BlimpSententialNegationNpiScope(BlimpTask):
+    DATASET_NAME = "sentential_negation_npi_scope"
+class BlimpSententialSubjectIsland(BlimpTask):
+    DATASET_NAME = "sentential_subject_island"
+class BlimpSuperlativeQuantifiers_1(BlimpTask):
+    DATASET_NAME = "superlative_quantifiers_1"
+class BlimpSuperlativeQuantifiers_2(BlimpTask):
+    DATASET_NAME = "superlative_quantifiers_2"
+class BlimpToughVsRaising_1(BlimpTask):
+    DATASET_NAME = "tough_vs_raising_1"
+class BlimpToughVsRaising_2(BlimpTask):
+    DATASET_NAME = "tough_vs_raising_2"
+class BlimpTransitive(BlimpTask):
+    DATASET_NAME = "transitive"
+class BlimpWhIsland(BlimpTask):
+    DATASET_NAME = "wh_island"
+class BlimpWhQuestionsObjectGap(BlimpTask):
+    DATASET_NAME = "wh_questions_object_gap"
+class BlimpWhQuestionsSubjectGap(BlimpTask):
+    DATASET_NAME = "wh_questions_subject_gap"
+class BlimpWhQuestionsSubjectGapLongDistance(BlimpTask):
+    DATASET_NAME = "wh_questions_subject_gap_long_distance"
+class BlimpWhVsThatNoGap(BlimpTask):
+    DATASET_NAME = "wh_vs_that_no_gap"
+class BlimpWhVsThatNoGapLongDistance(BlimpTask):
+    DATASET_NAME = "wh_vs_that_no_gap_long_distance"
+class BlimpWhVsThatWithGap(BlimpTask):
+    DATASET_NAME = "wh_vs_that_with_gap"
+class BlimpWhVsThatWithGapLongDistance(BlimpTask):
+    DATASET_NAME = "wh_vs_that_with_gap_long_distance"
--- a/lm_eval/tasks/glue.py
+++ b/lm_eval/tasks/glue.py
@@ -227,7 +227,7 @@ class QNLI(HFTask):
 class WNLI(HFTask):
-    VERSION = 0
+    VERSION = 1
    DATASET_PATH = "glue"
    DATASET_NAME = "wnli"
@@ -241,26 +241,25 @@ class WNLI(HFTask):
        return False
    def doc_to_text(self, doc):
-        return "{}\nQuestion: {} True, False or Neither?\nAnswer:".format(
+        return "{}\nQuestion: {} True or False?\nAnswer:".format(
            doc["sentence1"],
            doc["sentence2"],
        )
    def doc_to_target(self, doc):
        # True = entailment
-        # False = contradiction
+        # False = not_entailment
-        # Neither = neutral
+        return " {}".format({0: "False", 1: "True"}[doc["label"]])
-        return " {}".format({0: "True", 1: "Neither", 2: "False"}[doc["label"]])
    def construct_requests(self, doc, ctx):
        ll_true, _ = rf.loglikelihood(ctx, " True")
-        ll_neither, _ = rf.loglikelihood(ctx, " Neither")
        ll_false, _ = rf.loglikelihood(ctx, " False")
-        return ll_true, ll_neither, ll_false
+        return ll_true, ll_false
    def process_results(self, doc, results):
+        ll_true, ll_false = results
+        pred = ll_true > ll_false
        gold = doc["label"]
-        pred = np.argmax(results)
        return {
            "acc": pred == gold
        }

--- a/lm_eval/tasks/headqa.py
+++ b/lm_eval/tasks/headqa.py
@@ -2,10 +2,9 @@ from . common import HFTask
 from lm_eval.base import MultipleChoiceTask
-class HeadQA(HFTask, MultipleChoiceTask):
+class HeadQABase(HFTask, MultipleChoiceTask):
    VERSION = 0
    DATASET_PATH = "head_qa"
-    DATASET_NAME = None
    def has_training_docs(self):
        return True
@@ -31,3 +30,15 @@ class HeadQA(HFTask, MultipleChoiceTask):
    def doc_to_text(self, doc):
        return doc["query"]
+class HeadQAEn(HeadQABase):
+    DATASET_NAME = "en"
+class HeadQAEs(HeadQABase):
+    DATASET_NAME = "es"
+# for backwards compatibility
+class HeadQAEsDeprecated(HeadQABase):
+    DATASET_NAME = "es"
+    print("WARNING: headqa is deprecated. Please use headqa_es or headqa_en instead. See https://github.com/EleutherAI/lm-evaluation-harness/pull/240 for more info.")
\ No newline at end of file
--- a/lm_eval/tasks/pile.py
+++ b/lm_eval/tasks/pile.py
@@ -18,9 +18,12 @@ class PilePerplexityTask(PerplexityTask, abc.ABC):
    def download(self):
        # TODO: separate pile val/test out by component so we don't have to scan the entire file once per set
+        if not os.path.exists("data/pile/test.jsonl.zst"):
+            # todo use new best_download fallback api
            os.makedirs("data/pile/", exist_ok=True)
-        download_file("https://the-eye.eu/public/AI/pile/val.jsonl.zst", self.VAL_PATH, "264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92")
+            download_file("http://eaidata.bmk.sh/data/pile/val.jsonl.zst", self.VAL_PATH, "264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92")
-        download_file("https://the-eye.eu/public/AI/pile/test.jsonl.zst", self.TEST_PATH, "0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e")
+            download_file("http://eaidata.bmk.sh/data/pile/test.jsonl.zst", self.TEST_PATH, "0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e")
    def validation_docs(self):
        rdr = lm_dataformat.Reader(self.VAL_PATH)

--- a/lm_eval/tasks/superglue.py
+++ b/lm_eval/tasks/superglue.py
@@ -13,7 +13,7 @@ from ..utils import general_detokenize
 class BoolQ(HFTask):
-    VERSION = 0
+    VERSION = 1
    DATASET_PATH = "super_glue"
    DATASET_NAME = "boolq"
@@ -31,7 +31,7 @@ class BoolQ(HFTask):
        return "Read the following passages and answer each question with a yes or a no."
    def doc_to_text(self, doc):
-        return f"{doc['passage']}\nQuestion: {doc['question']}\nAnswer:"
+        return f"{doc['passage']}\nQuestion: {doc['question']}?\nAnswer:"
    def doc_to_target(self, doc):
        return " " + yesno(doc['label']) 

--- a/tests/test_tasks.py
+++ b/tests/test_tasks.py
@@ -32,7 +32,7 @@ def test_basic_interface(taskname, task_class):
    limit = None
-    if taskname in ["triviaqa"]:
+    if taskname in ["triviaqa"] or taskname.startswith("pile_"):
        limit = 10000
    if task.has_validation_docs():
        arr = list(islice(task.validation_docs(), limit))

--- a/tests/testdata/boolq-v1-loglikelihood
+++ b/tests/testdata/boolq-v1-loglikelihood
+6577e0d88572772ef08e64f624c0e3df0953286ae1f118ccef15623b59ffeabf
\ No newline at end of file
--- a/tests/testdata/boolq-v1-res.json
+++ b/tests/testdata/boolq-v1-res.json
+{"results": {"boolq": {"acc": 0.5048929663608562, "acc_stderr": 0.00874463623355505}}, "versions": {"boolq": 1}}
\ No newline at end of file
--- a/tests/testdata/headqa_en-v0-loglikelihood
+++ b/tests/testdata/headqa_en-v0-loglikelihood
+09da45119b12a0144e3081f8fb790c2a22af7b9c3aac42f54423d348a711fbf5
\ No newline at end of file
--- a/tests/testdata/headqa_en-v0-res.json
+++ b/tests/testdata/headqa_en-v0-res.json
+{"results": {"headqa_en": {"acc": 0.23559445660102116, "acc_norm": 0.2447118891320204, "acc_norm_stderr": 0.008211629406841468, "acc_stderr": 0.008105688874297972}}, "versions": {"headqa_en": 0}}
\ No newline at end of file
--- a/tests/testdata/headqa_es-v0-loglikelihood
+++ b/tests/testdata/headqa_es-v0-loglikelihood
+767ca34d9714edd9fb030ddbcc35a64e5180d1e247b0cb557fbb22fdf971ad1f
\ No newline at end of file
--- a/tests/testdata/headqa_es-v0-res.json
+++ b/tests/testdata/headqa_es-v0-res.json
+{"results": {"headqa_es": {"acc": 0.23559445660102116, "acc_norm": 0.25018234865062, "acc_norm_stderr": 0.008272783230806014, "acc_stderr": 0.008105688874297972}}, "versions": {"headqa_es": 0}}
\ No newline at end of file
--- a/tests/testdata/wnli-v1-loglikelihood
+++ b/tests/testdata/wnli-v1-loglikelihood
+8a0f81661d2ab2334bbc8031fac31c0c8882f1d9271dd51599d21dfdbb726dea
\ No newline at end of file
--- a/tests/testdata/wnli-v1-res.json
+++ b/tests/testdata/wnli-v1-res.json
+{"results": {"wnli": {"acc": 0.5633802816901409, "acc_stderr": 0.0592793555841297}}, "versions": {"wnli": 1}}
\ No newline at end of file
--- a/tests/tests/testdata/blimp_adjunct_island-v0-loglikelihood
+++ b/tests/tests/testdata/blimp_adjunct_island-v0-loglikelihood
+976a5cac4bdb724632eebd4cb9e522203ce3da8d5525288a597c86e80469f3f2
\ No newline at end of file
--- a/tests/tests/testdata/blimp_adjunct_island-v0-res.json
+++ b/tests/tests/testdata/blimp_adjunct_island-v0-res.json
+{"results": {"blimp_adjunct_island": {"acc": 0.485, "acc_stderr": 0.0158121796418149}}, "versions": {"blimp_adjunct_island": 0}}
\ No newline at end of file
--- a/tests/tests/testdata/blimp_anaphor_gender_agreement-v0-loglikelihood
+++ b/tests/tests/testdata/blimp_anaphor_gender_agreement-v0-loglikelihood
+2d8964e56a17661502ecf3f09c0befba63915360ddf2145b0bd845816950515d
\ No newline at end of file
--- a/tests/tests/testdata/blimp_anaphor_gender_agreement-v0-res.json
+++ b/tests/tests/testdata/blimp_anaphor_gender_agreement-v0-res.json
+{"results": {"blimp_anaphor_gender_agreement": {"acc": 0.485, "acc_stderr": 0.0158121796418149}}, "versions": {"blimp_anaphor_gender_agreement": 0}}
\ No newline at end of file