Merge branch 'master' into thomas/fix_best_download_version

78824d7f · Thomas Wang · GitHub · c65412e5 · cc238121 · 78824d7f
Unverified Commit 78824d7f authored Jan 08, 2022 by Thomas Wang Committed by GitHub Jan 08, 2022
20 changed files
--- a/README.md
+++ b/README.md
@@ -21,7 +21,7 @@ pip install lm-eval

 ## Basic Usage

-To evaluate a model, (e.g. GPT-2) on NLU tasks (e.g. LAMBADA, HellaSwag), you can run the following command.
+To evaluate a model, (e.g. GPT-2) on NLU tasks (e.g. LAMBADA, HellaSwag), you can run the following command. **When reporting results from eval harness, please include the task versions (shown in `results["versions"]`) for reproducibility.** This allows bug fixes to tasks while also ensuring that previously reported scores are reproducible. See the [Task Versioning](https://github.com/EleutherAI/lm-evaluation-harness#task-versioning) section for more info.

 ```bash
 python main.py \
@@ -55,7 +55,7 @@ To evaluate mesh-transformer-jax models that are not available on HF, please inv

 ## Implementing new tasks

-To implement a new task in eval harness, see [this guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/task-guide.md).
+To implement a new task in eval harness, see [this guide](./docs/task_guide.md).

 ## Cite as

@@ -128,8 +128,9 @@ To implement a new task in eval harness, see [this guide](https://github.com/Ele
 |openbookqa                                               |✓    |✓  |✓   |          500|acc, acc_norm                                                                 |
 |squad2                                                   |✓    |✓  |    |        11873|exact, f1, HasAns_exact, HasAns_f1, NoAns_exact, NoAns_f1, best_exact, best_f1|
 |race                                                     |✓    |✓  |✓   |         1045|acc                                                                           |
-|headqa                                                   |✓    |✓  |✓   |         2742|acc, acc_norm                                                                 |
 |mathqa                                                   |✓    |✓  |✓   |         2985|acc, acc_norm                                                                 |
+|headqa_es                                                |✓    |✓  |✓   |         2742|acc, acc_norm                                                                 |
+|headqa_en                                                |✓    |✓  |✓   |         2742|acc, acc_norm                                                                 |
 |webqs                                                    |✓    |   |✓   |         2032|acc                                                                           |
 |wsc273                                                   |     |   |✓   |          273|acc                                                                           |
 |winogrande                                               |✓    |✓  |    |         1267|acc                                                                           |
@@ -363,7 +364,6 @@ To inspect what the LM inputs look like, you can run the following command:
 ```bash
 python write_out.py \
 	--tasks all_tasks \
-	--provide_description \
 	--num_fewshot 5 \
 	--num_examples 10 \
 	--output_base_path /path/to/output/folder

--- a/docs/description_guide.md
+++ b/docs/description_guide.md
+# Description Guide
+
+![fewshot-example](./img/fewshot_example_gpt3.png)
+(Figure from [Brown et al., 2020](https://arxiv.org/pdf/2005.14165.pdf))
+
+Task descriptions provide in-context task instruction for your language model. If you'd like to prepend a natural language description to your few-shot examples and prompt, you can do so on a per-task basis via the `description_dict` arg of [`evaluator.evaluate`](../lm_eval/evaluator.py). This `description_dict` must adhere to the following key-value structure:
+
+- **key**: the task name (`str`) as specified in the lm-eval-harness [task registry](../lm_eval/tasks/__init__.py).
+- **value**: the corresponding (`str`) description/prompt for the task identified by **key**.
+
+```python
+description_dict = {
+    "task_name_1": "description",
+    "task_name_2": "description",
+    ...
+}
+```
+
+Note that a task's description will be separated from its following few-shot examples and prompt by a new line as such:
+
+```python
+"""
+<description>
+
+<examples>
+
+<prompt>
+"""
+```
+
+## Descriptions in File
+
+One can also interface with the aforementioned [`evaluator.evaluate`](../lm_eval/evaluator.py) (or `evaluator.simple_evaluate`) method from a higher level by simply passing a JSON file path to the `description_dict_path` arg of the command-line interface (CLI) program, `main.py`. The JSON file pointed to should be structured the same as the `description_dict`. E.g. for some file at `/your/path/descriptions.json` you may have:
+
+```json
+{
+    "cycle_letters": "Please unscramble the letters into a word, and write that word:",
+    "copa": "Given a premise and one alternative with a causal relation to the premise and another without, choose the more plausible alternative"
+}
+```
+
+which can then be supplied to the CLI as:
+
+```bash
+python main.py  \
+--tasks cycle_letters,copa \
+--description_dict_path /your/path/descriptions.json \
+...
+```
--- a/docs/img/fewshot_example_gpt3.png
+++ b/docs/img/fewshot_example_gpt3.png
--- a/task-guide.md
+++ b/task-guide.md
@@ -87,8 +87,7 @@ There are 2 standard approaches we follow for downloading data:
    ```
   These methods return `True`/`False` whether or not your task dataset provides documents for each split type. __Note__: if the test set doesn't have publicly available labels, please do not put it down as having a test set.

-	Lastly, we need to load the documents. In our terminology, a document (`doc`) is a single natural language data example stored in a Python `dict`. E.g.:
-	`{“question”: “What is the capital of France?”, “answer”: “Paris”}`. Override the following methods to load your data splits from their storage location in `DATASET_PATH`:
+	Lastly, we need to load the documents. In our terminology, a document (`doc`) is a single natural language data example stored in a Python `dict`. E.g.: `{“question”: “What is the capital of France?”, “answer”: “Paris”}`. Override the following methods to load your data splits from their storage location in `DATASET_PATH`:
    ```python
    def training_docs(self):
        return #...
@@ -125,17 +124,9 @@ You can now skip ahead to <a href="#Registering-Your-Task">registering your task

 <br>

+In the case your task is _not_ multiple-choice, override the following methods for your task class:

-In the case your task is not multiple-choice, override the following methods for your task class:
-
-Put the natural language task description as a single line (no `\n`s) string here. E.g. `"Translate English to French:"`
-
-```python
-def fewshot_description(self):
-    return ""
-```
-
-Format your document into a single query prompt __without the answer__ here. This method takes a single `doc` example (in dictionary form) . You should concatenate its members into a nicely formatted prompt.
+Format your document into a single query prompt __without the answer__ here. This method takes a single `doc` example of type `dict` with `str` key-value members. You should concatenate these `doc` item values together into a neatly formatted prompt.

 ```python
 def doc_to_text(self, doc):
@@ -161,11 +152,12 @@ After registering your task, you can now check on your data downloading and veri

 ```bash
 python -m scripts.write_out \
-    --task <your-task> \
    --output_base_path <path> \
+    --tasks <your-task> \
    --sets <train | val | test> \
    --num_fewshot K \
-    --num_examples N
+    --num_examples N \ 
+    --description_dict_path <path>
 ```

 Open the file specified at the `--output_base_path <path>` and ensure it passes

--- a/lm_eval/base.py
+++ b/lm_eval/base.py
 import abc
 from typing import Iterable
 import numpy as np
+import random
 import re
 import os
 import json
@@ -10,7 +11,7 @@ from tqdm import tqdm
 import torch
 import torch.nn.functional as F

-from lm_eval.metrics import mean, weighted_perplexity, weighted_mean
+from lm_eval.metrics import mean, weighted_perplexity, weighted_mean, bits_per_byte
 from lm_eval import utils
 from abc import abstractmethod

@@ -450,11 +451,43 @@ class Task(abc.ABC):
        pass

    def fewshot_description(self):
+        import warnings
+        warnings.warn(
+            "`fewshot_description` will be removed in futures versions. Pass "
+            "any custom descriptions to the `evaluate` function instead.",
+            DeprecationWarning)
        return ""

-    def fewshot_context(self, doc, num_fewshot, provide_description, rnd):
-        raw_description = self.fewshot_description()
-        description = (raw_description + "\n===\n\n") if provide_description and raw_description else ""
+    @utils.positional_deprecated
+    def fewshot_context(self, doc, num_fewshot, provide_description=None, rnd=None, description=None):
+        """ Returns a fewshot context string that is made up of a prepended description
+        (if provided), the `num_fewshot` number of examples, and an appended prompt example.
+
+        :param doc: str
+            The document as returned from training_docs, validation_docs, or test_docs.
+        :param num_fewshot: int
+            The number of fewshot examples to provide in the returned context string.
+        :param provide_description: bool
+            Not implemented, and this option is deprecated and will be removed in a future version in favor of a different description providing method
+        :param rnd: random.Random
+            The pseudo-random number generator used to randomly sample examples.
+            WARNING: This is currently a required arg although it's optionalized with a default `None`.
+        :param description: str
+            The task's description that will be prepended to the fewshot examples.
+        :returns: str
+            The fewshot context.
+        """
+        assert rnd is not None, "A `random.Random` generator argument must be provided to `rnd`"
+        assert not provide_description, (
+            "The `provide_description` arg will be removed in future versions. To prepend "
+            "a custom description to the context, supply the corresponding string via the "
+            "`description` arg."
+        )
+        if provide_description is not None:
+            # nudge people to not specify it at all
+            print("WARNING: provide_description is deprecated and will be removed in a future version in favor of description_dict")
+
+        description = description + "\n\n" if description else ""

        if num_fewshot == 0:
            labeled_examples = ""
@@ -523,16 +556,22 @@ class PerplexityTask(Task, abc.ABC):
    def has_training_docs(self):
        return False

-    def fewshot_description(self):
-        return ""
-
    def fewshot_examples(self, k, rnd):
        assert k == 0
        return []

-    def fewshot_context(self, doc, num_fewshot, provide_description, rnd):
+    def fewshot_context(self, doc, num_fewshot, provide_description=None, rnd=None, description=None):
        assert num_fewshot == 0
-        assert not provide_description
+        assert rnd is not None, "A `random.Random` generator argument must be provided to `rnd`"
+        assert not provide_description, (
+            "The `provide_description` arg will be removed in future versions. To prepend "
+            "a custom description to the context, supply the corresponding string via the  "
+            "`description` arg."
+        )
+        if provide_description is not None:
+            # nudge people to not specify it at all
+            print("WARNING: provide_description is deprecated and will be removed in a future version in favor of description_dict")
+
        return ""

    def higher_is_better(self):
@@ -560,14 +599,14 @@ class PerplexityTask(Task, abc.ABC):
        return {
            "word_perplexity": (loglikelihood, words),
            "byte_perplexity": (loglikelihood, bytes_),
-            "bits_per_byte": (-loglikelihood, self.count_bytes(doc))
+            "bits_per_byte": (loglikelihood, bytes_),
        }

    def aggregation(self):
        return {
            "word_perplexity": weighted_perplexity,
            "byte_perplexity": weighted_perplexity,
-            "bits_per_byte": weighted_mean
+            "bits_per_byte": bits_per_byte,
        }

    @classmethod

--- a/lm_eval/evaluator.py
+++ b/lm_eval/evaluator.py
@@ -6,19 +6,23 @@ import lm_eval.models
 import lm_eval.tasks
 import lm_eval.base
 import numpy as np
+from lm_eval.utils import positional_deprecated


-def simple_evaluate(model, model_args, task_names,
+@positional_deprecated
+def simple_evaluate(model, model_args=None, tasks=[],
                    num_fewshot=0, batch_size=None, device=None,
-                    no_cache=False, limit=None, bootstrap_iters=100000):
+                    no_cache=False, limit=None, bootstrap_iters=100000,
+                    description_dict=None):
    """Instantiate and evaluate a model on a list of tasks.

-    :param model: str
-        Name of model, see lm_eval.models.get_model
-    :param model_args: str
-        String arguments for each model class, see LM.create_from_arg_string
-    :param task_names: list[str]
-        List of task names
+    :param model: Union[str, LM]
+        Name of model or LM object, see lm_eval.models.get_model
+    :param model_args: Optional[str]
+        String arguments for each model class, see LM.create_from_arg_string. 
+        Ignored if `model` argument is a LM object.
+    :param tasks: list[Union[str, Task]]
+        List of task names or Task objects. Task objects will be taken to have name task.EVAL_HARNESS_NAME if defined and type(task).__name__ otherwise.
    :param num_fewshot: int
        Number of examples in few-shot context
    :param batch_size: int, optional
@@ -31,23 +35,39 @@ def simple_evaluate(model, model_args, task_names,
        Limit the number of examples per task (only use this for testing)
    :param bootstrap_iters:
        Number of iterations for bootstrap statistics
+    :param description_dict: dict[str, str]
+        Dictionary of custom task descriptions of the form: `task_name: description` 
    :return
        Dictionary of results
    """
    random.seed(1234)
    np.random.seed(1234)

-    lm = lm_eval.models.get_model(model).create_from_arg_string(model_args, {
-        'batch_size': batch_size, 'device': device
-    })
+    assert tasks != [], "No tasks specified"
+
+    if isinstance(model, str):
+        if model_args is None: model_args = ""
+        lm = lm_eval.models.get_model(model).create_from_arg_string(model_args, {
+            'batch_size': batch_size, 'device': device
+        })
+    else:
+        assert isinstance(model, lm_eval.base.LM)
+        lm = model

    if not no_cache:
        lm = lm_eval.base.CachingLM(
            lm, 'lm_cache/' + model + '_' + model_args.replace('=', '-').replace(',', '_').replace('/', '-') + '.db'
        )
    
-    task_dict = lm_eval.tasks.get_task_dict(task_names)
-    results = evaluate(lm, task_dict, False, num_fewshot, limit)
+    task_dict = lm_eval.tasks.get_task_dict(tasks)
+
+    results = evaluate(
+        lm=lm,
+        task_dict=task_dict,
+        num_fewshot=num_fewshot,
+        limit=limit,
+        description_dict=description_dict
+    )

    # add info about the model and few shot config
    results["config"] = {
@@ -58,19 +78,21 @@ def simple_evaluate(model, model_args, task_names,
        "device": device,
        "no_cache": no_cache,
        "limit": limit,
-        "bootstrap_iters": bootstrap_iters
+        "bootstrap_iters": bootstrap_iters,
+        "description_dict": description_dict
    }

    return results


-def evaluate(lm, task_dict, provide_description, num_fewshot, limit, bootstrap_iters=100000):
+@positional_deprecated
+def evaluate(lm, task_dict, provide_description=None, num_fewshot=0, limit=None, bootstrap_iters=100000, description_dict=None):
    """Instantiate and evaluate a model on a list of tasks.

    :param lm: obj
        Language Model
    :param task_dict: dict[str, Task]
-        Dictionary of tasks
+        Dictionary of tasks. Tasks will be taken to have name task.EVAL_HARNESS_NAME if defined and type(task).__name__ otherwise.
    :param provide_description: bool
        Not implemented, and this option is deprecated and will be removed in a future version in favor of a different description providing method
    :param num_fewshot: int
@@ -79,6 +101,8 @@ def evaluate(lm, task_dict, provide_description, num_fewshot, limit, bootstrap_i
        Limit the number of examples per task (only use this for testing)
    :param bootstrap_iters:
        Number of iterations for bootstrap statistics
+    :param description_dict: dict[str, str]
+        Dictionary of custom task descriptions of the form: `task_name: description` 
    :return
        Dictionary of results
    """
@@ -86,6 +110,9 @@ def evaluate(lm, task_dict, provide_description, num_fewshot, limit, bootstrap_i

    # TODO: todo: implement proper description-providing system
    assert not provide_description  # not implemented.
+    if provide_description is not None:
+        # nudge people to not specify it at all
+        print("WARNING: provide_description is deprecated and will be removed in a future version in favor of description_dict")

    task_dict_items = [
        (name, task)
@@ -125,16 +152,16 @@ def evaluate(lm, task_dict, provide_description, num_fewshot, limit, bootstrap_i
        rnd.seed(42)
        rnd.shuffle(task_docs)

+        description = description_dict[task_name] if description_dict and task_name in description_dict else ""
+
        for doc_id, doc in enumerate(itertools.islice(task_docs, 0, limit)):
            docs[(task_name, doc_id)] = doc
-
            ctx = task.fewshot_context(
                doc=doc,
-                provide_description=provide_description,
                num_fewshot=num_fewshot,
-                rnd=rnd
+                rnd=rnd,
+                description=description
            )
-
            reqs = task.construct_requests(doc, ctx)
            if not isinstance(reqs, (list, tuple)):
                reqs = [reqs]

--- a/lm_eval/metrics.py
+++ b/lm_eval/metrics.py
@@ -52,13 +52,14 @@ def acc_all(items):
    docs = list(zip(*items))[1]

    for doc, pred in zip(docs, preds):
+        paragraph_id = doc["idx"]["paragraph"]
        question_id = doc["idx"]["question"]
-        if question_id not in question_scoring_dict:
-            question_scoring_dict[question_id] = []
+        if (paragraph_id, question_id) not in question_scoring_dict:
+            question_scoring_dict[(paragraph_id, question_id)] = []

        gold_label = doc["label"] == 1
-        question_scoring_dict[question_id].append(gold_label == pred)

+        question_scoring_dict[(paragraph_id, question_id)].append(gold_label == pred)
    acc = np.mean([int(all(x)) for x in question_scoring_dict.values()])
    return acc

@@ -102,6 +103,9 @@ def weighted_mean(items):
 def weighted_perplexity(items):
    return math.exp(-weighted_mean(items))

+def bits_per_byte(items):
+    return -weighted_mean(items) / math.log(2)
+

 def bleu(items):
    """The Bilingual Evaluation Understudy Score, or BLEU for short, is a metric

--- a/lm_eval/tasks/__init__.py
+++ b/lm_eval/tasks/__init__.py
 from pprint import pprint
+from typing import List, Union

 import sacrebleu
+import lm_eval.base

 from . import superglue
 from . import glue
@@ -45,6 +47,7 @@ from . import lambada_multilingual
 from . import mutual
 from . import truthfulqa
 from . import blimp
+from . import asdiv

 ########################################
 # Translation tasks
@@ -133,7 +136,9 @@ TASK_REGISTRY = {
    "squad2": squad.SQuAD2,
    "race": race.RACE,
    # "naturalqs": naturalqs.NaturalQs, # not implemented yet
-    "headqa": headqa.HeadQA,
+    "headqa": headqa.HeadQAEsDeprecated, # for backwards compat - headqa used to default to es
+    "headqa_es": headqa.HeadQAEs,
+    "headqa_en": headqa.HeadQAEn,
    "mathqa": mathqa.MathQA,
    "webqs": webqs.WebQs,
    "wsc273": wsc273.WinogradSchemaChallenge273,
@@ -164,6 +169,7 @@ TASK_REGISTRY = {
    "math_num_theory": hendrycks_math.MathNumberTheory,
    "math_prealgebra": hendrycks_math.MathPrealgebra,
    "math_precalc": hendrycks_math.MathPrecalculus,
+    "math_asdiv": asdiv.Asdiv,

    # arithmetic
    "arithmetic_2da": arithmetic.Arithmetic2DPlus,
@@ -301,8 +307,23 @@ def get_task(task_name):
        raise KeyError(f"Missing task {task_name}")


-def get_task_dict(task_name_list):
-    return {
+def get_task_name_from_object(task_object):
+    for name, class_ in TASK_REGISTRY.items():
+        if class_ is task_object:
+            return name
+    
+    # this gives a mechanism for non-registered tasks to have a custom name anyways when reporting
+    return task_object.EVAL_HARNESS_NAME if hasattr(task_object, "EVAL_HARNESS_NAME") else type(task_object).__name__
+
+
+def get_task_dict(task_name_list: List[Union[str, lm_eval.base.Task]]):
+    task_name_dict = {
        task_name: get_task(task_name)()
-        for task_name in task_name_list
+        for task_name in task_name_list if isinstance(task_name, str)
+    }
+    task_name_from_object_dict = {
+        get_task_name_from_object(task_object): task_object
+        for task_object in task_name_list if not isinstance(task_object, str)
    }
+    assert set(task_name_dict.keys()).isdisjoint(set(task_name_from_object_dict.keys()))
+    return {**task_name_dict, **task_name_from_object_dict}
--- a/lm_eval/tasks/anli.py
+++ b/lm_eval/tasks/anli.py
@@ -33,10 +33,6 @@ class ANLIBase(HFTask):
        if self.has_test_docs():
            return self.data["test_r" + str(self.SPLIT)]

-    def fewshot_description(self):
-        # TODO: figure out description
-        return ""
-
    def doc_to_text(self, doc):
        # OA does this a bit weirdly: they prepend "anli 1:  anli 1:  " to the beginning
        # of the prompt (yes, repeating it!). also, " True, False, or Neither?" is directly 

--- a/lm_eval/tasks/arc.py
+++ b/lm_eval/tasks/arc.py
@@ -29,10 +29,6 @@ class ARCEasy(HFTask, MultipleChoiceTask):
        }
        return out_doc

-    def fewshot_description(self):
-        # TODO: figure out description
-        return ""
-
    def doc_to_text(self, doc):
        return doc["query"]


--- a/lm_eval/tasks/asdiv.py
+++ b/lm_eval/tasks/asdiv.py
+"""
+ASDiv: A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers
+https://arxiv.org/abs/2106.15772
+
+@misc{miao2021diverse,
+      title={A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers},
+      author={Shen-Yun Miao and Chao-Chun Liang and Keh-Yih Su},
+      year={2021},
+      eprint={2106.15772},
+      archivePrefix={arXiv},
+      primaryClass={cs.AI}
+}
+"""
+from lm_eval.base import Task
+from pathlib import Path
+from best_download import download_file 
+import xml.etree.ElementTree as ET
+from lm_eval.base import rf
+from lm_eval.metrics import mean,perplexity
+import numpy as np
+from zipfile import ZipFile
+import os 
+
+#currently ignoring formula for answer generation
+
+# given a subset, splits return the docs 
+class Asdiv(Task):
+    VERSION = 0
+    DATASET_PATH = Path("data/asdiv")
+
+    def download(self):
+        if self.DATASET_PATH.exists():
+            return
+        Path.mkdir(self.DATASET_PATH)
+        url = "https://github.com/chaochun/nlu-asdiv-dataset/archive/55790e5270bb91ccfa5053194b25732534696b50.zip"
+        checksum = "8f1fe4f6d5f170ec1e24ab78c244153c14c568b1bb2b1dad0324e71f37939a2d"
+        zip_path = self.DATASET_PATH / "55790e5270bb91ccfa5053194b25732534696b50.zip"
+        download_file(url, str(zip_path), checksum)
+        with ZipFile(zip_path, "r") as zip:
+            zip.extractall(self.DATASET_PATH)
+        os.remove(zip_path)
+
+    def _convert_standard(self, problem):
+        #TODO: include solution-type and formula
+        out_doc = {
+            "question" : problem.find('Question').text,
+            "body" : problem.find('Body').text,
+            "answer": problem.find('Answer').text
+        }
+        return out_doc
+
+    def load_docs(self, textfilename, tfds=False):
+        tree = ET.parse(textfilename)
+        root = tree.getroot()
+        for pid, problem in enumerate(root.iter('Problem')):
+            out_doc = self._convert_standard(problem)
+            yield out_doc
+
+    def has_training_docs(self):
+        return False
+    
+    def has_validation_docs(self):
+        return True
+
+    def has_test_docs(self):
+        return False
+
+    def training_docs(self):
+        raise NotImplementedError("This dataset has no training docs")
+
+    def test_docs(self):
+        raise NotImplementedError("This dataset has no test docs")
+
+    def validation_docs(self):
+        data_xml_path = self.DATASET_PATH / "nlu-asdiv-dataset-55790e5270bb91ccfa5053194b25732534696b50/dataset/ASDiv.xml"
+        return self.load_docs(data_xml_path)
+
+    def fewshot_context(self, doc, num_fewshot, provide_description, rnd):
+        assert num_fewshot == 0, "ASDiv is intended only for the zero-shot setting."
+        return super().fewshot_context(doc, num_fewshot, provide_description, rnd)
+
+    
+    def fewshot_description(self):
+        # TODO: add solution-type and formula
+        desc = "information containing the context of the question\nQuestion: Text of a question.\nAnswer: Answer to the question, based on the passage.\n"
+        return desc
+
+    def doc_to_text(self, doc):
+        # TODO: add solution-type
+        return doc['body'] + '\n' + 'Question:' + doc['question'] + '\n' + 'Answer:'
+
+    def doc_to_target(self, doc):
+        # TODO: add formula
+
+        answer = doc['answer'].split(' (')[0]
+        return " " + answer
+
+    def construct_requests(self, doc, ctx):
+        ll, is_greedy = rf.loglikelihood(ctx, self.doc_to_target(doc))
+        return ll, is_greedy
+    
+    def process_results(self, doc, results):
+        ll, is_greedy = results
+
+        return {
+            'acc': int(is_greedy)
+        }
+        
+    def aggregation(self):
+        return {
+            'acc': mean
+        }
+
+    def higher_is_better(self):
+        return {
+            'acc': True
+        }
+
--- a/lm_eval/tasks/blimp.py
+++ b/lm_eval/tasks/blimp.py
@@ -29,9 +29,18 @@ class BlimpTask(HFTask):
        self.data["validation"] = self.data["train"]
        del self.data["train"]

-    def fewshot_context(self, doc, num_fewshot, provide_description, rnd):
+    def fewshot_context(self, doc, num_fewshot, provide_description=None, rnd=None, description=None):
        assert num_fewshot == 0
-        assert not provide_description
+        assert rnd is not None, "A `random.Random` generator argument must be provided to `rnd`"
+        assert not provide_description, (
+            "The `provide_description` arg will be removed in future versions. To prepend "
+            "a custom description to the context, supply the corresponding string via the  "
+            "`description` arg."
+        )
+        if provide_description is not None:
+            # nudge people to not specify it at all
+            print("WARNING: provide_description is deprecated and will be removed in a future version in favor of description_dict")
+
        return ""

    def doc_to_text(self, doc):

--- a/lm_eval/tasks/cbt.py
+++ b/lm_eval/tasks/cbt.py
@@ -17,10 +17,6 @@ class CBTBase(HFTask):

    VERSION = 0

-    def fewshot_description(self):
-        # TODO: Figure out description.
-        return ""
-
    def detokenize(self, text):
        text = text.replace(" '", "'")
        text = text.replace(" \n", "\n")

--- a/lm_eval/tasks/coqa.py
+++ b/lm_eval/tasks/coqa.py
@@ -36,10 +36,7 @@ class CoQA(Task):

    def test_docs(self):
        pass
-    
-    def fewshot_description(self):
-        return "Given a passage and a conversation so far, answer the next question in the conversation."
-    
+
    def doc_to_text(self, doc):
        # Given a passage p, the conversation history {q1, a1, . . . qi−1, ai−1} 
        # and a question qi, the task is to predict the answer ai

--- a/lm_eval/tasks/drop.py
+++ b/lm_eval/tasks/drop.py
@@ -40,10 +40,6 @@ class DROP(Task):
    def has_test_docs(self):
        return False

-    def fewshot_description(self):
-        # TODO: figure out description
-        return ""
-
    def _load_docs(self, docs):
        for doc in docs:
            for qa in doc["qa_pairs"]:

--- a/lm_eval/tasks/glue.py
+++ b/lm_eval/tasks/glue.py
@@ -21,10 +21,6 @@ class CoLA(HFTask):
    def has_test_docs(self):
        return False

-    def fewshot_description(self):
-        # TODO
-        return ""
-
    def doc_to_text(self, doc):
        return "{}\nQuestion: Does this sentence make sense?\nAnswer:".format(doc["sentence"])

@@ -69,9 +65,6 @@ class SST(HFTask):
    def has_test_docs(self):
        return False

-    def fewshot_description(self):
-        return "Indicate if the sentiment of each sentence is positive or negative."
-
    def doc_to_text(self, doc):
        return "{}\nQuestion: Is this sentence positive or negative?\nAnswer:".format(
            general_detokenize(doc["sentence"]),
@@ -341,9 +334,6 @@ class MRPC(HFTask):
    def has_test_docs(self):
        return False

-    def fewshot_description(self):
-        return "Indicate if both sentences mean the same thing."
-
    def doc_to_text(self, doc):
        return "Sentence 1: {}\nSentence 2: {}\nQuestion: Do both sentences mean the same thing?\nAnswer:".format(
            general_detokenize(doc["sentence1"]),
@@ -394,9 +384,6 @@ class QQP(HFTask):
    def has_test_docs(self):
        return False

-    def fewshot_description(self):
-        return "Indicate if both questions ask the same thing."
-
    def doc_to_text(self, doc):
        return "Question 1: {}\nQuestion 2: {}\nQuestion: Do both questions ask the same thing?\nAnswer:".format(
            doc["question1"],
@@ -447,10 +434,6 @@ class STSB(HFTask):
    def has_test_docs(self):
        return True

-    def fewshot_description(self):
-        return "Indicate if both sentences mean the same thing from a scale of 0-5, " \
-           "where 5 means identical and 0 means unrelated."
-
    def doc_to_text(self, doc):
        return "sentence 1: {}\nsentence 2: {}\nAnswer:".format(
            doc["sentence1"],

--- a/lm_eval/tasks/headqa.py
+++ b/lm_eval/tasks/headqa.py
@@ -2,10 +2,9 @@ from . common import HFTask
 from lm_eval.base import MultipleChoiceTask


-class HeadQA(HFTask, MultipleChoiceTask):
+class HeadQABase(HFTask, MultipleChoiceTask):
    VERSION = 0
    DATASET_PATH = "head_qa"
-    DATASET_NAME = None

    def has_training_docs(self):
        return True
@@ -25,9 +24,17 @@ class HeadQA(HFTask, MultipleChoiceTask):
        }
        return out_doc

-    def fewshot_description(self):
-        # TODO: figure out description
-        return ""
-
    def doc_to_text(self, doc):
        return doc["query"]
+
+class HeadQAEn(HeadQABase):
+    DATASET_NAME = "en"
+
+class HeadQAEs(HeadQABase):
+    DATASET_NAME = "es"
+
+# for backwards compatibility
+class HeadQAEsDeprecated(HeadQABase):
+    DATASET_NAME = "es"
+
+    print("WARNING: headqa is deprecated. Please use headqa_es or headqa_en instead. See https://github.com/EleutherAI/lm-evaluation-harness/pull/240 for more info.")
\ No newline at end of file
--- a/lm_eval/tasks/hellaswag.py
+++ b/lm_eval/tasks/hellaswag.py
@@ -35,10 +35,5 @@ class HellaSwag(HFTask, MultipleChoiceTask):
        }
        return out_doc

-    def fewshot_description(self):
-        return "Label for the relevant action: Sentences describing the " \
-            "context, with an incomplete sentence trailing\nanswer that " \
-            "plausibly completes the situation."
-
    def doc_to_text(self, doc):
        return doc["query"]
--- a/lm_eval/tasks/hendrycks_ethics.py
+++ b/lm_eval/tasks/hendrycks_ethics.py
@@ -237,9 +237,6 @@ class EthicsUtilitarianismOriginal(Ethics):
        for doc in docs:
            yield {"activity": doc[0], "baseline": doc[1], "rating": ""}

-    def fewshot_description(self):
-        return "Rate how pleasant each of the following activities is on a scale from 1 (very unpleasant) to 10 (very pleasant).\n\n"
-
    def fewshot_examples(self, k, rnd):
        # Overwriting fewshot examples as k can be max 5
        assert k <= 5, "There are only 5 possible shots for this task. Refer to the V2 for more."
@@ -350,9 +347,6 @@ class EthicsVirtue(Ethics):
    def get_prefix(self):
        return "virtue/virtue"

-    def fewshot_description(self):
-        return "The following is a list of sentences and traits, along with whether the trait is exhibited in that sentence.\n\n"
-
    def process_doc(self, doc):
        # Append identifiers before shuffling to calculate exact matches lateron & skip the first element of headers
        return [x + [i] for i, x in enumerate(doc[1:])]

--- a/lm_eval/tasks/hendrycks_math.py
+++ b/lm_eval/tasks/hendrycks_math.py
@@ -55,9 +55,6 @@ class Math(Task):
    def test_docs(self):
        return self._load_docs(self.DATASET_PATH / "test" / self.get_file_info())

-    def fewshot_description(self):
-        return "Given a mathematics problem, determine the answer. Simplify your answer as much as possible."
-
    def doc_to_text(self, doc):
        return "Problem: " + doc["problem"] + "\nAnswer:"