Merge pull request #308 from jon-tow/add-templates

Add templates and update docs

Merge pull request #308 from jon-tow/add-templates
Add templates and update docs
235f8d3f · Stella Biderman · GitHub · 23ddba77 · a0fa0454 · 235f8d3f
Unverified Commit 235f8d3f authored Apr 28, 2022 by Stella Biderman Committed by GitHub Apr 28, 2022
Showing with 274 additions and 63 deletions

docs/task_guide.md docs/task_guide.md +55 -63

templates/new_multiple_choice_task.py templates/new_multiple_choice_task.py +78 -0

templates/new_task.py templates/new_task.py +141 -0

No files found.
--- a/docs/task_guide.md
+++ b/docs/task_guide.md
@@ -16,14 +16,23 @@ pip install -r requirements.txt

 ## Creating Your Task File

-The first step in creating a task is to create a Python file in `lm_eval/tasks/`  with the task's name:
+From the `lm-evaluation-harness` project root, copy over the `new_task.py` template to `lm_eval/datasets`. 

 ```sh
-cd lm_eval/tasks
-touch <task-name>.py
+cp templates/new_task.py lm_eval/tasks/<task-name>.py
 ```

-Then open the file and create a multiline docstring on the first line with the following contents:
+or if your task is **multiple-choice**, the `new_multiple_choice_task.py`:
+
+```sh
+cp templates/new_multiple_choice_task.py lm_eval/tasks/<task-name>.py
+```
+
+This will set you up with a few `TODO`s to fill-in which we'll now go over in detail.
+
+## Task Heading
+
+Open the file you've just created and add a multiline docstring on the first line with the following contents:

 ```python
 """
@@ -62,102 +71,85 @@ Now let's walk through the actual implementation - from data handling to evaluat

 ### Downloading your Data

-There are 2 standard approaches we follow for downloading data:
-
-1. Firstly, you should always check to see if your task's dataset is already provided by HuggingFace (__HF__); check their `datasets` catalog [here](https://huggingface.co/datasets). Is it in there? If yes, continue reading here, else go to 2. In the case that it’s there, things are a bit easier.  You can inherit from the `HFTask` class as so:
-
-    ```python
-    from . common import HFTask
-
-    class TaskName(HFTask):
-        DATASET_PATH = "..."
-        DATASET_NAME = "..."
-    ```
-	where `DATASET_PATH` is the name of the benchmark/task dataset as listed by HF and `DATASET_NAME` is the name of, what HF calls, a “data instance” of the benchmark. If your task is not a benchmark containing any data instances just set `DATASET_NAME = None`.
-
-2. Your task's dataset is not in HF's catalog, so you'll have to override a few abstract methods of the `Task` base class. First let's define our benchmark/task and inherit from `Task`.
-
-    ```python
-    from lm_eval.base import Task
-    from pathlib import Path
+All data downloading and management is handled through the HuggingFace (**HF**) [`datasets`](https://github.com/huggingface/datasets) API. So, the first thing you should do is check to see if your task's dataset is already provided in their catalog [here](https://huggingface.co/datasets). If it's not in there, please consider adding it to their Hub to make it accessible to a wider user base by following their [new dataset guide](https://github.com/huggingface/datasets/blob/master/ADD_NEW_DATASET.md)
+. 
+Now, that you have your HF dataset, you need to assign its path and name to your `Task` in the following fields:

-    class TaskName(Task):
-        DATASET_PATH = Path("data/<task-name>")
-    ```
-    where `DATASET_PATH` is the local directory we'll download into.
-    Now we need to override the following methods:
+```python
+class TaskName(...):
+    DATASET_PATH = "..."
+    DATASET_NAME = "..."
+```

-    ```python
-    def download(self):
-    ```
-    This should download the dataset into the relative path specified by `DATASET_PATH`. The preferred approach is to use EleutherAI's [best-download](https://github.com/EleutherAI/best-download) package which provides a `download_file` function that lets you validate complete data transmission through a checksum argument.  The overall logic should be something like: If the `DATASET_PATH` already exists then don’t download anything and exit the method, otherwise create the `DATASET_PATH` directory and actually download into it.  See this [task](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/logiqa.py#L9-L21) for an example.
+where `DATASET_PATH` is the name of the dataset as listed by HF in the `datasets` Hub and `DATASET_NAME` is the name of, what HF calls, a “data instance” or sub-task of the benchmark. If your task does not contain any data instances, just set `DATASET_NAME = None`.
+(If you're familiar with the HF `datasets.load_dataset` function, these are just the first 2 arguments to it.)

-   Next up, we have to set some “flags”:
+Next up, we have to set some “flags”:

-    ```python
+```python
    def has_training_docs(self):
        return # True/False
+
    def has_validation_docs(self):
        return # True/False
+
    def has_test_docs(self):
        return # True/False
-    ```
-   These methods return `True`/`False` whether or not your task dataset provides documents for each split type. __Note__: if the test set doesn't have publicly available labels, please do not put it down as having a test set.
+```
+
+These methods return `True`/`False` whether or not your task dataset provides documents for each split type. __Note__: if the test set does not have publicly available answer labels, please do not put it down as having a test set - return False.
+
+Lastly, we need to load the documents. In our terminology, a document (`doc`) is a single natural language data example stored in a Python `dict`. E.g.: `{“question”: “What is the capital of France?”, “answer”: “Paris”}`. Override the following methods to load your data splits from their storage location in `DATASET_PATH`:

-	Lastly, we need to load the documents. In our terminology, a document (`doc`) is a single natural language data example stored in a Python `dict`. E.g.: `{“question”: “What is the capital of France?”, “answer”: “Paris”}`. Override the following methods to load your data splits from their storage location in `DATASET_PATH`:
-    ```python
+```python
    def training_docs(self):
        return #...
+
    def validation_docs(self):
        return #...
+
    def test_docs(self):
        return #...
-    ```
-	These should return a Python iterable (`list` or `generator`) of `dict`s that can be queried for individual `doc` examples. __NOTE__: If your task doesn't have a train/validation/test set, remember to raise a `NotImplementedError` for that specific split.
+```

-### Formatting your Few-Shot Examples
+These should return a Python iterable (`list` or `generator`) of `dict`s that can be queried for individual `doc` examples.

-The harness is designed to facilitate task evaluations under the few-shot setting. Here we’ll format such examples.
+#### Processing Documents

-<br>
+At this point, you can also process each individual document to, for example, strip whitespace or "detokenize" its fields. Put the processing logic into `_process_doc` and map the functions across training/validation/test docs inside of the respective functions. 
+🔠 If your task is **multiple-choice**, we require you to format your documents such that they contain `gold` and `choices` fields. They can also have other fields, but those will be ignored by `MultipleChoiceTask`. `choices` should be a list of possible continuations, and `gold` should be an integer specifying the index of the correct completion.
+See [this task](https://github.com/EleutherAI/lm-evaluation-harness/blob/6caa0afd96a7a7efb2ec4c1f24ad1756e48f3aa7/lm_eval/tasks/sat.py#L60) for an example. 🔠

-⚠️  **Multiple-Choice Formatting**
+### Formatting your Few-Shot Examples

-If your task is **multiple-choice**, just inherit from the `MultipleChoiceTask` class we provide.
+The harness is designed to facilitate task evaluations under the few-shot setting. Here we’ll format such examples.

-```python
-from lm_eval.base import MultipleChoiceTask
+Format your document into a single query prompt __without the answer__ here. This method takes a single `doc` example of type `dict` with `str` key-value members. You should concatenate these `doc` item values together into a neatly formatted prompt.

-class TaskName(..., MultipleChoiceTask):
+```python
+def doc_to_text(self, doc):
+    return ""
 ```

-This will require you to format your documents such that they contain `gold` and `choices` fields. They can also have other fields, but those will be ignored by `MultipleChoiceTask`. `choices` should be a list of possible continuations, and `gold` should be an integer specifying the index of the correct completion.
+<br>

-See [this task](https://github.com/EleutherAI/lm-evaluation-harness/blob/105fa9741ff660f6a62c2eef0d2facfde36dda41/lm_eval/tasks/sat.py#L56) for an example. When used in combination with `HFTask`, it may be useful to override [`_convert_standard`](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/common.py#L28), which will be applied to every document in the HF dataset. See [this task](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/headqa.py) for an example of this.
+️🔠 **Multiple-Choice Formatting**

-You can now skip ahead to <a href="#Registering-Your-Task">registering your task</a>.
+If your task is multiple-choice, you can now skip ahead to <a href="#Registering-Your-Task">registering your task</a>.

-⚠️  **End Multiple-Choice Formatting**
+️️🔠 **End Multiple-Choice Formatting**

 <br>

-In the case your task is _not_ multiple-choice, override the following methods for your task class:
-
-Format your document into a single query prompt __without the answer__ here. This method takes a single `doc` example of type `dict` with `str` key-value members. You should concatenate these `doc` item values together into a neatly formatted prompt.
-
-```python
-def doc_to_text(self, doc):
-    return ""
-```
-
-Put the target answer of the prompt here, in the form: `" " + <answer>`.
+Format the target answer from the contents of `doc`. Note that the prepended `" "` is required to space out the `doc_to_text` and `doc_to_target` strings.

 ```python
 def doc_to_target(self, doc):
-    return ""
+    target = ""
+    return " " + target
 ```

-Understand that the strings from `doc_to_text` and `doc_to_target` will be concatenated together to build up labeled examples in the k-shot setting where k > 0. Design with that in mind 👍.
+Finally, be aware that the strings from `doc_to_text` and `doc_to_target` will be concatenated together to build up labeled examples in the k-shot setting where k > 0. Design with that in mind 👍.

 ### Registering Your Task


--- a/templates/new_multiple_choice_task.py
+++ b/templates/new_multiple_choice_task.py
+# TODO: Remove all TODO comments once the implementation is complete.
+"""
+TODO: Add the Paper Title on this line.
+TODO: Add the paper's PDF URL (preferrably from arXiv) on this line.
+
+TODO: Write a Short Description of the task.
+
+Homepage: TODO: Add the URL to the task's Homepage here.
+"""
+from lm_eval.base import MultipleChoiceTask
+
+
+# TODO: Add the BibTeX citation for the task.
+_CITATION = """
+"""
+
+
+# TODO: Replace `NewTask` with the name of your Task.
+class NewTask(MultipleChoiceTask):
+    VERSION = 0
+    # TODO: Add the `DATASET_PATH` string. This will be the name of the `Task`
+    # dataset as denoted in HuggingFace `datasets`.
+    DATASET_PATH = ""
+    # TODO: Add the `DATASET_NAME` string. This is the name of a subset within
+    # `DATASET_PATH`. If there aren't specific subsets you need, leave this as `None`.
+    DATASET_NAME = None
+
+    def has_training_docs(self):
+        # TODO: Fill in the return with `True` if the Task has training data; else `False`.
+        return False
+
+    def has_validation_docs(self):
+        # TODO: Fill in the return with `True` if the Task has validation data; else `False`.
+        return False
+
+    def has_test_docs(self):
+        # TODO: Fill in the return with `True` if the Task has test data; else `False`.
+        return False
+
+    def training_docs(self):
+        if self.has_training_docs():
+            # We cache training documents in `self._training_docs` for faster
+            # few-shot processing. If the data is too large to fit in memory,
+            # return the training data as a generator instead of a list.
+            if self._training_docs is None:
+                # TODO: Return the training document generator from `self.dataset`.
+                # In most case you can leave this as is unless the dataset split is
+                # named differently than the default `"train"`.
+                self._training_docs = list(
+                    map(self._process_doc, self.dataset["train"])
+                )
+            return self._training_docs
+
+    def validation_docs(self):
+        if self.has_validation_docs():
+            # TODO: Return the validation document generator from `self.dataset`.
+            # In most case you can leave this as is unless the dataset split is
+            # named differently than the default `"validation"`.
+            return map(self._process_doc, self.dataset["validation"])
+
+    def test_docs(self):
+        if self.has_test_docs():
+            # TODO: Return the test document generator from `self.dataset`.
+            # In most case you can leave this as is unless the dataset split is
+            # named differently than the default `"test"`.
+            return map(self._process_doc, self.dataset["test"])
+
+    def _process_doc(self, doc):
+        # TODO: Process the documents into a dictionary with the following keys:
+        return {
+            "query": "",  # The query prompt.
+            "choices": [],  # The list of choices.
+            "gold": 0,  # The integer used to index into the correct element of `"choices"`.
+        }
+
+    def doc_to_text(self, doc):
+        # TODO: Format the query prompt portion of the document example.
+        return doc["query"]
--- a/templates/new_task.py
+++ b/templates/new_task.py
+# TODO: Remove all TODO comments once the implementation is complete.
+"""
+TODO: Add the Paper Title on this line.
+TODO: Add the paper's PDF URL (preferrably from arXiv) on this line.
+
+TODO: Write a Short Description of the task.
+
+Homepage: TODO: Add the URL to the task's Homepage here.
+"""
+from lm_eval.base import Task
+
+
+# TODO: Add the BibTeX citation for the task.
+_CITATION = """
+"""
+
+
+# TODO: Replace `NewTask` with the name of your Task.
+class NewTask(Task):
+    VERSION = 0
+    # TODO: Add the `DATASET_PATH` string. This will be the name of the `Task`
+    # dataset as denoted in HuggingFace `datasets`.
+    DATASET_PATH = ""
+    # TODO: Add the `DATASET_NAME` string. This is the name of a subset within
+    # `DATASET_PATH`. If there aren't specific subsets you need, leave this as `None`.
+    DATASET_NAME = None
+
+    def has_training_docs(self):
+        # TODO: Fill in the return with `True` if the Task has training data; else `False`.
+        return False
+
+    def has_validation_docs(self):
+        # TODO: Fill in the return with `True` if the Task has validation data; else `False`.
+        return False
+
+    def has_test_docs(self):
+        # TODO: Fill in the return with `True` if the Task has test data; else `False`.
+        return False
+
+    def training_docs(self):
+        if self.has_training_docs():
+            # We cache training documents in `self._training_docs` for faster
+            # few-shot processing. If the data is too large to fit in memory,
+            # return the training data as a generator instead of a list.
+            if self._training_docs is None:
+                # TODO: Return the training document generator from `self.dataset`.
+                # If you need to process the data, `map` over the documents with
+                # the custom procesing function, `self._process_doc`. E.g.
+                # `map(self._process_doc, self.dataset["validation"])`
+                # In most case you can leave this as is unless the dataset split is
+                # named differently than the default `"train"`.
+                self._training_docs = list(self.dataset["train"])
+            return self._training_docs
+
+    def validation_docs(self):
+        if self.has_validation_docs():
+            # TODO: Return the validation document generator from `self.dataset`.
+            # If you need to process the data, `map` over the documents with the
+            # custom procesing function, `self._process_doc`. E.g.
+            # `map(self._process_doc, self.dataset["validation"])`
+            # In most case you can leave this as is unless the dataset split is
+            # named differently than the default `"validation"`.
+            return self.dataset["validation"]
+
+    def test_docs(self):
+        if self.has_test_docs():
+            # TODO: Return the test document generator from `self.dataset`.
+            # If you need to process the data, `map` over the documents with the
+            # custom processing function, `self._process_doc`. E.g.
+            # `map(self._process_doc, self.dataset["test"])`
+            # In most case you can leave this as is unless the dataset split is
+            # named differently than the default `"test"`.
+            return self.dataset["test"]
+
+    def _process_doc(self, doc):
+        # TODO: Process (detokenize, strip, replace etc.) each individual `doc`
+        # with this function. You can map this across the docs in each available
+        # dataset split. See the TODOs in `train_docs`, `validation_docs`, and
+        # `test_docs` for snippets.
+        # NOTE: DELETE THIS FUNCTION IF UNUSED.
+        return doc
+
+    def doc_to_text(self, doc):
+        # TODO: Format the query prompt portion of the document example.
+        return ""
+
+    def doc_to_target(self, doc):
+        # TODO: Fill in the `target` ("gold answer") variable.
+        # The prepended `" "` is required to space out the `doc_to_text` and
+        # `doc_to_target` strings.
+        target = ""
+        return " " + target
+
+    def construct_requests(self, doc, ctx):
+        """Uses RequestFactory to construct Requests and returns an iterable of
+        Requests which will be sent to the LM.
+
+        :param doc:
+            The document as returned from training_docs, validation_docs, or
+            test_docs.
+        :param ctx: str
+            The context string, generated by fewshot_context. This includes the natural
+            language description, as well as the few shot examples, and the question
+            part of the document for `doc`.
+        """
+        # TODO: Construct your language model requests with the request factory, `rf`,
+        # and return them as an iterable.
+        return []
+
+    def process_results(self, doc, results):
+        """Take a single document and the LM results and evaluates, returning a
+        dict where keys are the names of submetrics and values are the values of
+        the metric for that one document
+
+        :param doc:
+            The document as returned from training_docs, validation_docs, or test_docs.
+        :param results:
+            The results of the requests created in construct_requests.
+        """
+        # TODO: For each (sub)metric in the task evaluation, add a key-value pair
+        # with the metric name as key and the corresponding metric result as value
+        # for the current `doc`.
+        return {}
+
+    def aggregation(self):
+        """
+        :returns: {str: [metric_score] -> float}
+            A dictionary where keys are the names of submetrics and values are
+            functions that aggregate a list of metric scores
+        """
+        # TODO: For each (sub)metric in the task evaluation, add a key-value pair
+        # with the metric name as key and an aggregation function as value which
+        # determines how to combine results from each document in the dataset.
+        # Check `lm_eval.metrics` to find built-in aggregation functions.
+        return {}
+
+    def higher_is_better(self):
+        # TODO: For each (sub)metric in the task evaluation, add a key-value pair
+        # with the metric name as key and a `bool` value determining whether or
+        # not higher values of that metric are deemed better.
+        return {}