Merge pull request #226 from jon-tow/evaluator-description-option

Replace `fewshot_description` API with a `description_dict` based interface

Merge pull request #226 from jon-tow/evaluator-description-option
Replace `fewshot_description` API with a `description_dict` based interface
170ae096 · Leo Gao · GitHub · 8728710c · 02a4def2 · 170ae096
Unverified Commit 170ae096 authored Jan 08, 2022 by Leo Gao Committed by GitHub Jan 08, 2022
20 changed files
--- a/README.md
+++ b/README.md
@@ -55,7 +55,7 @@ To evaluate mesh-transformer-jax models that are not available on HF, please inv

 ## Implementing new tasks

-To implement a new task in eval harness, see [this guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/task-guide.md).
+To implement a new task in eval harness, see [this guide](./docs/task_guide.md).

 ## Cite as

@@ -364,7 +364,6 @@ To inspect what the LM inputs look like, you can run the following command:
 ```bash
 python write_out.py \
 	--tasks all_tasks \
-	--provide_description \
 	--num_fewshot 5 \
 	--num_examples 10 \
 	--output_base_path /path/to/output/folder

--- a/docs/description_guide.md
+++ b/docs/description_guide.md
+# Description Guide
+
+![fewshot-example](./img/fewshot_example_gpt3.png)
+(Figure from [Brown et al., 2020](https://arxiv.org/pdf/2005.14165.pdf))
+
+Task descriptions provide in-context task instruction for your language model. If you'd like to prepend a natural language description to your few-shot examples and prompt, you can do so on a per-task basis via the `description_dict` arg of [`evaluator.evaluate`](../lm_eval/evaluator.py). This `description_dict` must adhere to the following key-value structure:
+
+- **key**: the task name (`str`) as specified in the lm-eval-harness [task registry](../lm_eval/tasks/__init__.py).
+- **value**: the corresponding (`str`) description/prompt for the task identified by **key**.
+
+```python
+description_dict = {
+    "task_name_1": "description",
+    "task_name_2": "description",
+    ...
+}
+```
+
+Note that a task's description will be separated from its following few-shot examples and prompt by a new line as such:
+
+```python
+"""
+<description>
+
+<examples>
+
+<prompt>
+"""
+```
+
+## Descriptions in File
+
+One can also interface with the aforementioned [`evaluator.evaluate`](../lm_eval/evaluator.py) (or `evaluator.simple_evaluate`) method from a higher level by simply passing a JSON file path to the `description_dict_path` arg of the command-line interface (CLI) program, `main.py`. The JSON file pointed to should be structured the same as the `description_dict`. E.g. for some file at `/your/path/descriptions.json` you may have:
+
+```json
+{
+    "cycle_letters": "Please unscramble the letters into a word, and write that word:",
+    "copa": "Given a premise and one alternative with a causal relation to the premise and another without, choose the more plausible alternative"
+}
+```
+
+which can then be supplied to the CLI as:
+
+```bash
+python main.py  \
+--tasks cycle_letters,copa \
+--description_dict_path /your/path/descriptions.json \
+...
+```
--- a/docs/img/fewshot_example_gpt3.png
+++ b/docs/img/fewshot_example_gpt3.png
--- a/task-guide.md
+++ b/task-guide.md
@@ -87,8 +87,7 @@ There are 2 standard approaches we follow for downloading data:
    ```
   These methods return `True`/`False` whether or not your task dataset provides documents for each split type. __Note__: if the test set doesn't have publicly available labels, please do not put it down as having a test set.

-	Lastly, we need to load the documents. In our terminology, a document (`doc`) is a single natural language data example stored in a Python `dict`. E.g.:
-	`{“question”: “What is the capital of France?”, “answer”: “Paris”}`. Override the following methods to load your data splits from their storage location in `DATASET_PATH`:
+	Lastly, we need to load the documents. In our terminology, a document (`doc`) is a single natural language data example stored in a Python `dict`. E.g.: `{“question”: “What is the capital of France?”, “answer”: “Paris”}`. Override the following methods to load your data splits from their storage location in `DATASET_PATH`:
    ```python
    def training_docs(self):
        return #...
@@ -125,17 +124,9 @@ You can now skip ahead to <a href="#Registering-Your-Task">registering your task

 <br>

+In the case your task is _not_ multiple-choice, override the following methods for your task class:

-In the case your task is not multiple-choice, override the following methods for your task class:
-
-Put the natural language task description as a single line (no `\n`s) string here. E.g. `"Translate English to French:"`
-
-```python
-def fewshot_description(self):
-    return ""
-```
-
-Format your document into a single query prompt __without the answer__ here. This method takes a single `doc` example (in dictionary form) . You should concatenate its members into a nicely formatted prompt.
+Format your document into a single query prompt __without the answer__ here. This method takes a single `doc` example of type `dict` with `str` key-value members. You should concatenate these `doc` item values together into a neatly formatted prompt.

 ```python
 def doc_to_text(self, doc):
@@ -161,11 +152,12 @@ After registering your task, you can now check on your data downloading and veri

 ```bash
 python -m scripts.write_out \
-    --task <your-task> \
    --output_base_path <path> \
+    --tasks <your-task> \
    --sets <train | val | test> \
    --num_fewshot K \
-    --num_examples N
+    --num_examples N \ 
+    --description_dict_path <path>
 ```

 Open the file specified at the `--output_base_path <path>` and ensure it passes

--- a/lm_eval/base.py
+++ b/lm_eval/base.py
 import abc
 from typing import Iterable
 import numpy as np
+import random
 import re
 import os
 import json
@@ -450,11 +451,43 @@ class Task(abc.ABC):
        pass

    def fewshot_description(self):
+        import warnings
+        warnings.warn(
+            "`fewshot_description` will be removed in futures versions. Pass "
+            "any custom descriptions to the `evaluate` function instead.",
+            DeprecationWarning)
        return ""

-    def fewshot_context(self, doc, num_fewshot, provide_description, rnd):
-        raw_description = self.fewshot_description()
-        description = (raw_description + "\n===\n\n") if provide_description and raw_description else ""
+    @utils.positional_deprecated
+    def fewshot_context(self, doc, num_fewshot, provide_description=None, rnd=None, description=None):
+        """ Returns a fewshot context string that is made up of a prepended description
+        (if provided), the `num_fewshot` number of examples, and an appended prompt example.
+
+        :param doc: str
+            The document as returned from training_docs, validation_docs, or test_docs.
+        :param num_fewshot: int
+            The number of fewshot examples to provide in the returned context string.
+        :param provide_description: bool
+            Not implemented, and this option is deprecated and will be removed in a future version in favor of a different description providing method
+        :param rnd: random.Random
+            The pseudo-random number generator used to randomly sample examples.
+            WARNING: This is currently a required arg although it's optionalized with a default `None`.
+        :param description: str
+            The task's description that will be prepended to the fewshot examples.
+        :returns: str
+            The fewshot context.
+        """
+        assert rnd is not None, "A `random.Random` generator argument must be provided to `rnd`"
+        assert not provide_description, (
+            "The `provide_description` arg will be removed in future versions. To prepend "
+            "a custom description to the context, supply the corresponding string via the "
+            "`description` arg."
+        )
+        if provide_description is not None:
+            # nudge people to not specify it at all
+            print("WARNING: provide_description is deprecated and will be removed in a future version in favor of description_dict")
+
+        description = description + "\n\n" if description else ""

        if num_fewshot == 0:
            labeled_examples = ""
@@ -523,16 +556,22 @@ class PerplexityTask(Task, abc.ABC):
    def has_training_docs(self):
        return False

-    def fewshot_description(self):
-        return ""
-
    def fewshot_examples(self, k, rnd):
        assert k == 0
        return []

-    def fewshot_context(self, doc, num_fewshot, provide_description, rnd):
+    def fewshot_context(self, doc, num_fewshot, provide_description=None, rnd=None, description=None):
        assert num_fewshot == 0
-        assert not provide_description
+        assert rnd is not None, "A `random.Random` generator argument must be provided to `rnd`"
+        assert not provide_description, (
+            "The `provide_description` arg will be removed in future versions. To prepend "
+            "a custom description to the context, supply the corresponding string via the  "
+            "`description` arg."
+        )
+        if provide_description is not None:
+            # nudge people to not specify it at all
+            print("WARNING: provide_description is deprecated and will be removed in a future version in favor of description_dict")
+
        return ""

    def higher_is_better(self):

--- a/lm_eval/evaluator.py
+++ b/lm_eval/evaluator.py
@@ -6,19 +6,23 @@ import lm_eval.models
 import lm_eval.tasks
 import lm_eval.base
 import numpy as np
+from lm_eval.utils import positional_deprecated


-def simple_evaluate(model, model_args, task_names,
+@positional_deprecated
+def simple_evaluate(model, model_args=None, tasks=[],
                    num_fewshot=0, batch_size=None, device=None,
-                    no_cache=False, limit=None, bootstrap_iters=100000):
+                    no_cache=False, limit=None, bootstrap_iters=100000,
+                    description_dict=None):
    """Instantiate and evaluate a model on a list of tasks.

-    :param model: str
-        Name of model, see lm_eval.models.get_model
-    :param model_args: str
-        String arguments for each model class, see LM.create_from_arg_string
-    :param task_names: list[str]
-        List of task names
+    :param model: Union[str, LM]
+        Name of model or LM object, see lm_eval.models.get_model
+    :param model_args: Optional[str]
+        String arguments for each model class, see LM.create_from_arg_string. 
+        Ignored if `model` argument is a LM object.
+    :param tasks: list[Union[str, Task]]
+        List of task names or Task objects. Task objects will be taken to have name task.EVAL_HARNESS_NAME if defined and type(task).__name__ otherwise.
    :param num_fewshot: int
        Number of examples in few-shot context
    :param batch_size: int, optional
@@ -31,23 +35,39 @@ def simple_evaluate(model, model_args, task_names,
        Limit the number of examples per task (only use this for testing)
    :param bootstrap_iters:
        Number of iterations for bootstrap statistics
+    :param description_dict: dict[str, str]
+        Dictionary of custom task descriptions of the form: `task_name: description` 
    :return
        Dictionary of results
    """
    random.seed(1234)
    np.random.seed(1234)

-    lm = lm_eval.models.get_model(model).create_from_arg_string(model_args, {
-        'batch_size': batch_size, 'device': device
-    })
+    assert tasks != [], "No tasks specified"
+
+    if isinstance(model, str):
+        if model_args is None: model_args = ""
+        lm = lm_eval.models.get_model(model).create_from_arg_string(model_args, {
+            'batch_size': batch_size, 'device': device
+        })
+    else:
+        assert isinstance(model, lm_eval.base.LM)
+        lm = model

    if not no_cache:
        lm = lm_eval.base.CachingLM(
            lm, 'lm_cache/' + model + '_' + model_args.replace('=', '-').replace(',', '_').replace('/', '-') + '.db'
        )
    
-    task_dict = lm_eval.tasks.get_task_dict(task_names)
-    results = evaluate(lm, task_dict, False, num_fewshot, limit)
+    task_dict = lm_eval.tasks.get_task_dict(tasks)
+
+    results = evaluate(
+        lm=lm,
+        task_dict=task_dict,
+        num_fewshot=num_fewshot,
+        limit=limit,
+        description_dict=description_dict
+    )

    # add info about the model and few shot config
    results["config"] = {
@@ -58,19 +78,21 @@ def simple_evaluate(model, model_args, task_names,
        "device": device,
        "no_cache": no_cache,
        "limit": limit,
-        "bootstrap_iters": bootstrap_iters
+        "bootstrap_iters": bootstrap_iters,
+        "description_dict": description_dict
    }

    return results


-def evaluate(lm, task_dict, provide_description, num_fewshot, limit, bootstrap_iters=100000):
+@positional_deprecated
+def evaluate(lm, task_dict, provide_description=None, num_fewshot=0, limit=None, bootstrap_iters=100000, description_dict=None):
    """Instantiate and evaluate a model on a list of tasks.

    :param lm: obj
        Language Model
    :param task_dict: dict[str, Task]
-        Dictionary of tasks
+        Dictionary of tasks. Tasks will be taken to have name task.EVAL_HARNESS_NAME if defined and type(task).__name__ otherwise.
    :param provide_description: bool
        Not implemented, and this option is deprecated and will be removed in a future version in favor of a different description providing method
    :param num_fewshot: int
@@ -79,6 +101,8 @@ def evaluate(lm, task_dict, provide_description, num_fewshot, limit, bootstrap_i
        Limit the number of examples per task (only use this for testing)
    :param bootstrap_iters:
        Number of iterations for bootstrap statistics
+    :param description_dict: dict[str, str]
+        Dictionary of custom task descriptions of the form: `task_name: description` 
    :return
        Dictionary of results
    """
@@ -86,6 +110,9 @@ def evaluate(lm, task_dict, provide_description, num_fewshot, limit, bootstrap_i

    # TODO: todo: implement proper description-providing system
    assert not provide_description  # not implemented.
+    if provide_description is not None:
+        # nudge people to not specify it at all
+        print("WARNING: provide_description is deprecated and will be removed in a future version in favor of description_dict")

    task_dict_items = [
        (name, task)
@@ -125,16 +152,16 @@ def evaluate(lm, task_dict, provide_description, num_fewshot, limit, bootstrap_i
        rnd.seed(42)
        rnd.shuffle(task_docs)

+        description = description_dict[task_name] if description_dict and task_name in description_dict else ""
+
        for doc_id, doc in enumerate(itertools.islice(task_docs, 0, limit)):
            docs[(task_name, doc_id)] = doc
-
            ctx = task.fewshot_context(
                doc=doc,
-                provide_description=provide_description,
                num_fewshot=num_fewshot,
-                rnd=rnd
+                rnd=rnd,
+                description=description
            )
-
            reqs = task.construct_requests(doc, ctx)
            if not isinstance(reqs, (list, tuple)):
                reqs = [reqs]

--- a/lm_eval/tasks/__init__.py
+++ b/lm_eval/tasks/__init__.py
 from pprint import pprint
+from typing import List, Union

 import sacrebleu
+import lm_eval.base

 from . import superglue
 from . import glue
@@ -305,8 +307,23 @@ def get_task(task_name):
        raise KeyError(f"Missing task {task_name}")


-def get_task_dict(task_name_list):
-    return {
+def get_task_name_from_object(task_object):
+    for name, class_ in TASK_REGISTRY.items():
+        if class_ is task_object:
+            return name
+    
+    # this gives a mechanism for non-registered tasks to have a custom name anyways when reporting
+    return task_object.EVAL_HARNESS_NAME if hasattr(task_object, "EVAL_HARNESS_NAME") else type(task_object).__name__
+
+
+def get_task_dict(task_name_list: List[Union[str, lm_eval.base.Task]]):
+    task_name_dict = {
        task_name: get_task(task_name)()
-        for task_name in task_name_list
+        for task_name in task_name_list if isinstance(task_name, str)
+    }
+    task_name_from_object_dict = {
+        get_task_name_from_object(task_object): task_object
+        for task_object in task_name_list if not isinstance(task_object, str)
    }
+    assert set(task_name_dict.keys()).isdisjoint(set(task_name_from_object_dict.keys()))
+    return {**task_name_dict, **task_name_from_object_dict}
--- a/lm_eval/tasks/anli.py
+++ b/lm_eval/tasks/anli.py
@@ -33,10 +33,6 @@ class ANLIBase(HFTask):
        if self.has_test_docs():
            return self.data["test_r" + str(self.SPLIT)]

-    def fewshot_description(self):
-        # TODO: figure out description
-        return ""
-
    def doc_to_text(self, doc):
        # OA does this a bit weirdly: they prepend "anli 1:  anli 1:  " to the beginning
        # of the prompt (yes, repeating it!). also, " True, False, or Neither?" is directly 

--- a/lm_eval/tasks/arc.py
+++ b/lm_eval/tasks/arc.py
@@ -29,10 +29,6 @@ class ARCEasy(HFTask, MultipleChoiceTask):
        }
        return out_doc

-    def fewshot_description(self):
-        # TODO: figure out description
-        return ""
-
    def doc_to_text(self, doc):
        return doc["query"]


--- a/lm_eval/tasks/blimp.py
+++ b/lm_eval/tasks/blimp.py
@@ -29,9 +29,18 @@ class BlimpTask(HFTask):
        self.data["validation"] = self.data["train"]
        del self.data["train"]

-    def fewshot_context(self, doc, num_fewshot, provide_description, rnd):
+    def fewshot_context(self, doc, num_fewshot, provide_description=None, rnd=None, description=None):
        assert num_fewshot == 0
-        assert not provide_description
+        assert rnd is not None, "A `random.Random` generator argument must be provided to `rnd`"
+        assert not provide_description, (
+            "The `provide_description` arg will be removed in future versions. To prepend "
+            "a custom description to the context, supply the corresponding string via the  "
+            "`description` arg."
+        )
+        if provide_description is not None:
+            # nudge people to not specify it at all
+            print("WARNING: provide_description is deprecated and will be removed in a future version in favor of description_dict")
+
        return ""

    def doc_to_text(self, doc):

--- a/lm_eval/tasks/cbt.py
+++ b/lm_eval/tasks/cbt.py
@@ -17,10 +17,6 @@ class CBTBase(HFTask):

    VERSION = 0

-    def fewshot_description(self):
-        # TODO: Figure out description.
-        return ""
-
    def detokenize(self, text):
        text = text.replace(" '", "'")
        text = text.replace(" \n", "\n")

--- a/lm_eval/tasks/coqa.py
+++ b/lm_eval/tasks/coqa.py
@@ -36,10 +36,7 @@ class CoQA(Task):

    def test_docs(self):
        pass
-    
-    def fewshot_description(self):
-        return "Given a passage and a conversation so far, answer the next question in the conversation."
-    
+
    def doc_to_text(self, doc):
        # Given a passage p, the conversation history {q1, a1, . . . qi−1, ai−1} 
        # and a question qi, the task is to predict the answer ai

--- a/lm_eval/tasks/drop.py
+++ b/lm_eval/tasks/drop.py
@@ -40,10 +40,6 @@ class DROP(Task):
    def has_test_docs(self):
        return False

-    def fewshot_description(self):
-        # TODO: figure out description
-        return ""
-
    def _load_docs(self, docs):
        for doc in docs:
            for qa in doc["qa_pairs"]:

--- a/lm_eval/tasks/glue.py
+++ b/lm_eval/tasks/glue.py
@@ -21,10 +21,6 @@ class CoLA(HFTask):
    def has_test_docs(self):
        return False

-    def fewshot_description(self):
-        # TODO
-        return ""
-
    def doc_to_text(self, doc):
        return "{}\nQuestion: Does this sentence make sense?\nAnswer:".format(doc["sentence"])

@@ -69,9 +65,6 @@ class SST(HFTask):
    def has_test_docs(self):
        return False

-    def fewshot_description(self):
-        return "Indicate if the sentiment of each sentence is positive or negative."
-
    def doc_to_text(self, doc):
        return "{}\nQuestion: Is this sentence positive or negative?\nAnswer:".format(
            general_detokenize(doc["sentence"]),
@@ -341,9 +334,6 @@ class MRPC(HFTask):
    def has_test_docs(self):
        return False

-    def fewshot_description(self):
-        return "Indicate if both sentences mean the same thing."
-
    def doc_to_text(self, doc):
        return "Sentence 1: {}\nSentence 2: {}\nQuestion: Do both sentences mean the same thing?\nAnswer:".format(
            general_detokenize(doc["sentence1"]),
@@ -394,9 +384,6 @@ class QQP(HFTask):
    def has_test_docs(self):
        return False

-    def fewshot_description(self):
-        return "Indicate if both questions ask the same thing."
-
    def doc_to_text(self, doc):
        return "Question 1: {}\nQuestion 2: {}\nQuestion: Do both questions ask the same thing?\nAnswer:".format(
            doc["question1"],
@@ -447,10 +434,6 @@ class STSB(HFTask):
    def has_test_docs(self):
        return True

-    def fewshot_description(self):
-        return "Indicate if both sentences mean the same thing from a scale of 0-5, " \
-           "where 5 means identical and 0 means unrelated."
-
    def doc_to_text(self, doc):
        return "sentence 1: {}\nsentence 2: {}\nAnswer:".format(
            doc["sentence1"],

--- a/lm_eval/tasks/headqa.py
+++ b/lm_eval/tasks/headqa.py
@@ -24,10 +24,6 @@ class HeadQABase(HFTask, MultipleChoiceTask):
        }
        return out_doc

-    def fewshot_description(self):
-        # TODO: figure out description
-        return ""
-
    def doc_to_text(self, doc):
        return doc["query"]


--- a/lm_eval/tasks/hellaswag.py
+++ b/lm_eval/tasks/hellaswag.py
@@ -35,10 +35,5 @@ class HellaSwag(HFTask, MultipleChoiceTask):
        }
        return out_doc

-    def fewshot_description(self):
-        return "Label for the relevant action: Sentences describing the " \
-            "context, with an incomplete sentence trailing\nanswer that " \
-            "plausibly completes the situation."
-
    def doc_to_text(self, doc):
        return doc["query"]
--- a/lm_eval/tasks/hendrycks_ethics.py
+++ b/lm_eval/tasks/hendrycks_ethics.py
@@ -237,9 +237,6 @@ class EthicsUtilitarianismOriginal(Ethics):
        for doc in docs:
            yield {"activity": doc[0], "baseline": doc[1], "rating": ""}

-    def fewshot_description(self):
-        return "Rate how pleasant each of the following activities is on a scale from 1 (very unpleasant) to 10 (very pleasant).\n\n"
-
    def fewshot_examples(self, k, rnd):
        # Overwriting fewshot examples as k can be max 5
        assert k <= 5, "There are only 5 possible shots for this task. Refer to the V2 for more."
@@ -350,9 +347,6 @@ class EthicsVirtue(Ethics):
    def get_prefix(self):
        return "virtue/virtue"

-    def fewshot_description(self):
-        return "The following is a list of sentences and traits, along with whether the trait is exhibited in that sentence.\n\n"
-
    def process_doc(self, doc):
        # Append identifiers before shuffling to calculate exact matches lateron & skip the first element of headers
        return [x + [i] for i, x in enumerate(doc[1:])]

--- a/lm_eval/tasks/hendrycks_math.py
+++ b/lm_eval/tasks/hendrycks_math.py
@@ -55,9 +55,6 @@ class Math(Task):
    def test_docs(self):
        return self._load_docs(self.DATASET_PATH / "test" / self.get_file_info())

-    def fewshot_description(self):
-        return "Given a mathematics problem, determine the answer. Simplify your answer as much as possible."
-
    def doc_to_text(self, doc):
        return "Problem: " + doc["problem"] + "\nAnswer:"


--- a/lm_eval/tasks/hendrycks_test.py
+++ b/lm_eval/tasks/hendrycks_test.py
@@ -114,9 +114,5 @@ class GeneralHendrycksTest(MultipleChoiceTask):

        return rnd.sample(list(self._fewshot_docs), k)

-    def fewshot_description(self):
-        subject = self.subject.replace("_", " ")
-        return f"The following are multiple choice questions (with answers) about {subject}."
-
    def doc_to_text(self, doc):
        return doc["query"]
--- a/lm_eval/tasks/lambada.py
+++ b/lm_eval/tasks/lambada.py
@@ -47,10 +47,6 @@ class LAMBADA(Task):

    def doc_to_target(self, doc):
        return " " + doc['text'].rsplit(' ', 1)[1]
-    
-    def fewshot_description(self):
-        # TODO: figure out description
-        return ""

    def construct_requests(self, doc, ctx):
        ll, is_greedy = rf.loglikelihood(ctx, self.doc_to_target(doc))