Now let's walk through the actual implementation - from data handling to evaluation.
Now let's walk through the actual implementation - from data handling to evaluation.
### Downloading the Data
### Downloading the Data
...
@@ -54,7 +54,7 @@ There are 2 standard approaches we follow for downloading data:
...
@@ -54,7 +54,7 @@ There are 2 standard approaches we follow for downloading data:
DATASET_PATH = "..."
DATASET_PATH = "..."
DATASET_NAME = "..."
DATASET_NAME = "..."
```
```
where `DATASET_PATH` is the name of the benchmark/task dataset as listed by HF and `DATASET_NAME` is the name of, what HF calls, a “data instance” of the benchmark. If your task is not a benchmark containing any data instances just set `DATASET_NAME = None`.
where `DATASET_PATH` is the name of the benchmark/task dataset as listed by HF and `DATASET_NAME` is the name of, what HF calls, a “data instance” of the benchmark. If your task is not a benchmark containing any data instances just set `DATASET_NAME = None`.
2. Your task's dataset is not in HF's catalog, so you'll have to override a few abstract methods of the `Task` base class. First let's define our benchmark/task and inherit from `Task`.
2. Your task's dataset is not in HF's catalog, so you'll have to override a few abstract methods of the `Task` base class. First let's define our benchmark/task and inherit from `Task`.
...
@@ -85,7 +85,7 @@ There are 2 standard approaches we follow for downloading data:
...
@@ -85,7 +85,7 @@ There are 2 standard approaches we follow for downloading data:
```
```
These methods return `True`/`False` whether or not your task dataset provides documents for each split type. __Note__: if the test set doesn't have publicly available labels, please do not put it down as having a test set.
These methods return `True`/`False` whether or not your task dataset provides documents for each split type. __Note__: if the test set doesn't have publicly available labels, please do not put it down as having a test set.
Lastly, we need to load the documents. In our terminology, a document (`doc`) is a single natural language data example stored in a Python `dict`. E.g.:
Lastly, we need to load the documents. In our terminology, a document (`doc`) is a single natural language data example stored in a Python `dict`. E.g.:
`{“question”: “What is the capital of France?”, “answer”: “Paris”}`. Override the following methods to load your data splits from their storage location in `DATASET_PATH`:
`{“question”: “What is the capital of France?”, “answer”: “Paris”}`. Override the following methods to load your data splits from their storage location in `DATASET_PATH`:
```python
```python
deftraining_docs(self):
deftraining_docs(self):
...
@@ -112,7 +112,9 @@ after this go <a href="#Registering-Your-Task">register your task</a>.
...
@@ -112,7 +112,9 @@ after this go <a href="#Registering-Your-Task">register your task</a>.
### Versioning
### Versioning
Tasks in the harness can always evolve. Metrics get updated, data sources change, etc. It’s important to mark each task with a version attribute so users can specify which version of the implementation was used to obtain their results. Add a `VERSION` attribute to your task class set to 0 (the first version of your task):
Tasks in the harness can always evolve. Metrics get updated, data sources
change, etc. It’s important to mark each task with a version attribute so users
can document which implementation version was used to obtain their results. Add a `VERSION` attribute to your task class set to 0 (the first version of your task):
```python
```python
classTaskName(...):
classTaskName(...):
...
@@ -122,23 +124,27 @@ class TaskName(...):
...
@@ -122,23 +124,27 @@ class TaskName(...):
### Formatting your Few-Shot Examples
### Formatting your Few-Shot Examples
The harness is designed to facilitate task evaluations under the few-shot setting. Here we’ll format such examples. Override the following methods for your task class:
The harness is designed to facilitate task evaluations under the few-shot setting. Here we’ll format such examples. Override the following methods for your task class:
Put the natural language task description as a single line (no `\n`s) string here. E.g. `"Translate English to French:"`
```python
```python
deffewshot_description(self):
deffewshot_description(self):
return""
return""
```
```
Put your natural language task description as a single line (no `\n`s) string here. E.g. `"Translate English to French:"`
Format your document into a single query prompt __without the answer__ here. This method takes a single `doc` example (in dictionary form) . You should concatenate its members into a nicely formatted prompt.
```python
```python
defdoc_to_text(self,doc):
defdoc_to_text(self,doc):
return""
return""
```
```
Format your document into a single query prompt __without the answer__ here. This method takes a single `doc` example (in dictionary form) . You should concatenate its members into a nicely formatted prompt.
Put the target answer of the prompt here, in the form: `" " + <answer>`.
```python
```python
defdoc_to_target(self,doc):
defdoc_to_target(self,doc):
return""
return""
```
```
Put the target answer of the prompt here, in the form: `" " + <answer>`.
Understand that the strings from `doc_to_text` and `doc_to_target` will be concatenated together to build up labeled examples in the k-shot setting where k > 0. Design with that in mind 👍.
Understand that the strings from `doc_to_text` and `doc_to_target` will be concatenated together to build up labeled examples in the k-shot setting where k > 0. Design with that in mind 👍.
...
@@ -159,7 +165,8 @@ python -m scripts.write_out \
...
@@ -159,7 +165,8 @@ python -m scripts.write_out \
--num_examples N
--num_examples N
```
```
Open the file specified at the `--output_base_path <path>` and ensure it passes the eye test.
Open the file specified at the `--output_base_path <path>` and ensure it passes
a simple eye test.
### The Evaluation
### The Evaluation
...
@@ -169,13 +176,13 @@ Now comes evaluation. The methods you'll need to implement are:
...
@@ -169,13 +176,13 @@ Now comes evaluation. The methods you'll need to implement are:
```python
```python
defconstruct_requests(self,doc,ctx):
defconstruct_requests(self,doc,ctx):
""" Uses RequestFactory to construct Requests and returns an iterable of
""" Uses RequestFactory to construct Requests and returns an iterable of
Requests which will be sent to the LM.
Requests which will be sent to the LM.
:param doc:
:param doc:
The document as returned from training_docs, validation_docs, or test_docs.
The document as returned from training_docs, validation_docs, or test_docs.
:param ctx: str
:param ctx: str
The context string, generated by fewshot_context. This includes the natural
The context string, generated by fewshot_context. This includes the natural
language description, as well as the few shot examples, and the question
language description, as well as the few shot examples, and the question
part of the document for `doc`.
part of the document for `doc`.
"""
"""
...
@@ -185,8 +192,8 @@ If your task requires generating text you'll need to return a `rf.greedy_until`
...
@@ -185,8 +192,8 @@ If your task requires generating text you'll need to return a `rf.greedy_until`
```python
```python
defprocess_results(self,doc,results):
defprocess_results(self,doc,results):
"""Take a single document and the LM results and evaluates, returning a
"""Take a single document and the LM results and evaluates, returning a
dict where keys are the names of submetrics and values are the values of
dict where keys are the names of submetrics and values are the values of