Now let's walk through the actual implementation - from data handling to evaluation.
Now let's walk through the actual implementation - from data handling to evaluation.
### Downloading the Data
## Data Handling
### Downloading your Data
There are 2 standard approaches we follow for downloading data:
There are 2 standard approaches we follow for downloading data:
...
@@ -61,7 +63,7 @@ There are 2 standard approaches we follow for downloading data:
...
@@ -61,7 +63,7 @@ There are 2 standard approaches we follow for downloading data:
```python
```python
from lm_eval.base import Task
from lm_eval.base import Task
from pathlib import Path
from pathlib import Path
class TaskName(Task):
class TaskName(Task):
DATASET_PATH = Path("data/<task-name>")
DATASET_PATH = Path("data/<task-name>")
```
```
...
@@ -83,10 +85,10 @@ There are 2 standard approaches we follow for downloading data:
...
@@ -83,10 +85,10 @@ There are 2 standard approaches we follow for downloading data:
def has_test_docs(self):
def has_test_docs(self):
return # True/False
return # True/False
```
```
These methods return `True`/`False` whether or not your task dataset provides documents for each split type. __Note__: if the test set doesn't have publicly available labels, please do not put it down as having a test set.
These methods return `True`/`False` whether or not your task dataset provides documents for each split type. __Note__: if the test set doesn't have publicly available labels, please do not put it down as having a test set.
Lastly, we need to load the documents. In our terminology, a document (`doc`) is a single natural language data example stored in a Python `dict`. E.g.:
Lastly, we need to load the documents. In our terminology, a document (`doc`) is a single natural language data example stored in a Python `dict`. E.g.:
`{“question”: “What is the capital of France?”, “answer”: “Paris”}`. Override the following methods to load your data splits from their storage location in `DATASET_PATH`:
`{“question”: “What is the capital of France?”, “answer”: “Paris”}`. Override the following methods to load your data splits from their storage location in `DATASET_PATH`:
```python
```python
def training_docs(self):
def training_docs(self):
return #...
return #...
...
@@ -97,34 +99,31 @@ There are 2 standard approaches we follow for downloading data:
...
@@ -97,34 +99,31 @@ There are 2 standard approaches we follow for downloading data:
```
```
These should return a Python iterable (`list` or `generator`) of `dict`s that can be queried for individual `doc` examples. __NOTE__: If your task doesn't have a train/validation/test set, remember to raise a `NotImplementedError` for that specific split.
These should return a Python iterable (`list` or `generator`) of `dict`s that can be queried for individual `doc` examples. __NOTE__: If your task doesn't have a train/validation/test set, remember to raise a `NotImplementedError` for that specific split.
### Formatting your Few-Shot Examples
The harness is designed to facilitate task evaluations under the few-shot setting. Here we’ll format such examples.
<br/>
⚠️ **Multiple-Choice Formatting**
If your task is **multiple-choice**, just inherit from the `MultipleChoiceTask` class we provide.
If your task is multiple-choice just inherit from the `MultipleChoiceTask` class we provide.
```python
```python
fromlm_eval.baseimportMultipleChoiceTask
fromlm_eval.baseimportMultipleChoiceTask
classTaskName(...,MultipleChoiceTask):
classTaskName(...,MultipleChoiceTask):
```
```
Multiple-choice tasks require you to format your documents such that they contain `gold` and `choices` fields. They can also have other fields, but those will be ignored by MultipleChoiceTask. `choices` should be a list of possible continuations, and `gold` should be an integer specifying the index of the correct completion.
See [this task](https://github.com/EleutherAI/lm-evaluation-harness/blob/105fa9741ff660f6a62c2eef0d2facfde36dda41/lm_eval/tasks/sat.py#L56) for an example. When used in combination with HFTask, it may be useful to override [`_convert_standard`](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/common.py#L28), which will be applied to every document in the HF dataset. See [this task](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/headqa.py) for an example of this.
This will require you to format your documents such that they contain `gold` and `choices` fields. They can also have other fields, but those will be ignored by `MultipleChoiceTask`. `choices` should be a list of possible continuations, and `gold` should be an integer specifying the index of the correct completion.
After this go <ahref="#Registering-Your-Task">register your task</a>.
See [this task](https://github.com/EleutherAI/lm-evaluation-harness/blob/105fa9741ff660f6a62c2eef0d2facfde36dda41/lm_eval/tasks/sat.py#L56) for an example. When used in combination with `HFTask`, it may be useful to override [`_convert_standard`](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/common.py#L28), which will be applied to every document in the HF dataset. See [this task](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/headqa.py) for an example of this.
You can now skip ahead to <ahref="#Registering-Your-Task">registering your task</a>.
### Versioning
⚠️ **End Multiple-Choice Formatting**
<br/>
Tasks in the harness can always evolve. Metrics get updated, data sources
change, etc. It’s important to mark each task with a version attribute so users
can document which implementation version was used to obtain their results. Add a `VERSION` attribute to your task class set to 0 (the first version of your task):
```python
classTaskName(...):
VERSION=0
```
### Formatting your Few-Shot Examples
The harness is designed to facilitate task evaluations under the few-shot setting. Here we’ll format such examples. Override the following methods for your task class:
In the case your task is not multiple-choice, override the following methods for your task class:
Put the natural language task description as a single line (no `\n`s) string here. E.g. `"Translate English to French:"`
Put the natural language task description as a single line (no `\n`s) string here. E.g. `"Translate English to French:"`
Understand that the strings from `doc_to_text` and `doc_to_target` will be concatenated together to build up labeled examples in the k-shot setting where k > 0. Design with that in mind 👍.
Understand that the strings from `doc_to_text` and `doc_to_target` will be concatenated together to build up labeled examples in the k-shot setting where k > 0. Design with that in mind 👍.
#### Registering Your Task
### Registering Your Task
Now's a good time to register your task to expose it for usage. All you'll need to do is import your task module in `lm_eval/tasks/__init__.py` and provide an entry in the `TASK_REGISTRY` dictionary with the key as the name of your benchmark task (in the form it'll be referred to in the command line) and the value as the task class. See how it's done for other tasks in the [file](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/__init__.py).
Now's a good time to register your task to expose it for usage. All you'll need to do is import your task module in `lm_eval/tasks/__init__.py` and provide an entry in the `TASK_REGISTRY` dictionary with the key as the name of your benchmark task (in the form it'll be referred to in the command line) and the value as the task class. See how it's done for other tasks in the [file](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/__init__.py).
#### Check On the Data
### Checking the Data
After registering your task, you can now check on your data downloading and verify that the few-shot samples look as intended. Run the following command with your desired args:
After registering your task, you can now check on your data downloading and verify that the few-shot samples look as intended. Run the following command with your desired args:
...
@@ -169,9 +168,9 @@ python -m scripts.write_out \
...
@@ -169,9 +168,9 @@ python -m scripts.write_out \
Open the file specified at the `--output_base_path <path>` and ensure it passes
Open the file specified at the `--output_base_path <path>` and ensure it passes
a simple eye test.
a simple eye test.
### The Evaluation
## Evaluation
**🛑** If your task is multiple-choice and you've correctly inherited from `MultipleChoiceTask` then your job is done; <ahref=“#Check-On-the-Task-Performance”>go ‘head and check on the task performance!</a>
**🛑**If your task is a single-true multiple-choice task and you've correctly inherited from `MultipleChoiceTask` then your job here is done; <ahref="#Checking-the-Task-Performance">go ‘head and check on the task performance!</a> 🛑
Now comes evaluation. The methods you'll need to implement are:
Now comes evaluation. The methods you'll need to implement are:
...
@@ -230,7 +229,8 @@ def higher_is_better(self):
...
@@ -230,7 +229,8 @@ def higher_is_better(self):
Some tasks that are good examples of various ways evaluation can be implemented can be found here: [LAMBADA](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/lambada.py), [TriviaQA](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/triviaqa.py), [SQuAD](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/squad.py).
Some tasks that are good examples of various ways evaluation can be implemented can be found here: [LAMBADA](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/lambada.py), [TriviaQA](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/triviaqa.py), [SQuAD](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/squad.py).
Tip: Feel free to create your own helper-methods for your task!
Tip: Feel free to create your own helper-methods for your task!
#### Check On the Task Performance
### Checking the Task Performance
```sh
```sh
python main.py \
python main.py \
...
@@ -251,7 +251,7 @@ python main.py \
...
@@ -251,7 +251,7 @@ python main.py \
--num_fewshot K
--num_fewshot K
```
```
#### Running Unit Tests
### Running Unit Tests
To run the entire test suite, use:
To run the entire test suite, use:
...
@@ -264,6 +264,15 @@ This is usually overkill; to run only the tests for your task, do:
...
@@ -264,6 +264,15 @@ This is usually overkill; to run only the tests for your task, do:
pytest -k <task name>
pytest -k <task name>
```
```
## Versioning
Lastly, we need to "version control". Tasks in the harness can always evolve. Metrics get updated, data sources change, etc. It’s important to mark each task with a version attribute so users can document which implementation version was used to obtain their results. Add a `VERSION` attribute to your task right below the class name and set it to `0` (this is the first version/implementation of your task):
```python
classTaskName(...):
VERSION=0
```
## Submitting your Task
## Submitting your Task
Although we currently do not work behind a specific style guide, we'd appreciate if you tidy up your file/s with the `black` formatter (which should've been install through the `requirements.txt`). Keep things clean…ish 🙂.
Although we currently do not work behind a specific style guide, we'd appreciate if you tidy up your file/s with the `black` formatter (which should've been install through the `requirements.txt`). Keep things clean…ish 🙂.