Commit 011cc891 authored by Jonathan Tow's avatar Jonathan Tow
Browse files

fix up some wording

parent 00cecd59
...@@ -39,7 +39,7 @@ https://arxiv.org/abs/1808.07036 ...@@ -39,7 +39,7 @@ https://arxiv.org/abs/1808.07036
""" """
``` ```
Now let's walk through the actual implementation - from data handling to evaluation. Now let's walk through the actual implementation - from data handling to evaluation.
### Downloading the Data ### Downloading the Data
...@@ -54,7 +54,7 @@ There are 2 standard approaches we follow for downloading data: ...@@ -54,7 +54,7 @@ There are 2 standard approaches we follow for downloading data:
DATASET_PATH = "..." DATASET_PATH = "..."
DATASET_NAME = "..." DATASET_NAME = "..."
``` ```
where `DATASET_PATH` is the name of the benchmark/task dataset as listed by HF and `DATASET_NAME` is the name of, what HF calls, a “data instance” of the benchmark. If your task is not a benchmark containing any data instances just set `DATASET_NAME = None`. where `DATASET_PATH` is the name of the benchmark/task dataset as listed by HF and `DATASET_NAME` is the name of, what HF calls, a “data instance” of the benchmark. If your task is not a benchmark containing any data instances just set `DATASET_NAME = None`.
2. Your task's dataset is not in HF's catalog, so you'll have to override a few abstract methods of the `Task` base class. First let's define our benchmark/task and inherit from `Task`. 2. Your task's dataset is not in HF's catalog, so you'll have to override a few abstract methods of the `Task` base class. First let's define our benchmark/task and inherit from `Task`.
...@@ -85,7 +85,7 @@ There are 2 standard approaches we follow for downloading data: ...@@ -85,7 +85,7 @@ There are 2 standard approaches we follow for downloading data:
``` ```
These methods return `True`/`False` whether or not your task dataset provides documents for each split type. __Note__: if the test set doesn't have publicly available labels, please do not put it down as having a test set. These methods return `True`/`False` whether or not your task dataset provides documents for each split type. __Note__: if the test set doesn't have publicly available labels, please do not put it down as having a test set.
Lastly, we need to load the documents. In our terminology, a document (`doc`) is a single natural language data example stored in a Python `dict`. E.g.: Lastly, we need to load the documents. In our terminology, a document (`doc`) is a single natural language data example stored in a Python `dict`. E.g.:
`{“question”: “What is the capital of France?”, “answer”: “Paris”}`. Override the following methods to load your data splits from their storage location in `DATASET_PATH`: `{“question”: “What is the capital of France?”, “answer”: “Paris”}`. Override the following methods to load your data splits from their storage location in `DATASET_PATH`:
```python ```python
def training_docs(self): def training_docs(self):
...@@ -112,7 +112,9 @@ after this go <a href="#Registering-Your-Task">register your task</a>. ...@@ -112,7 +112,9 @@ after this go <a href="#Registering-Your-Task">register your task</a>.
### Versioning ### Versioning
Tasks in the harness can always evolve. Metrics get updated, data sources change, etc. It’s important to mark each task with a version attribute so users can specify which version of the implementation was used to obtain their results. Add a `VERSION` attribute to your task class set to 0 (the first version of your task): Tasks in the harness can always evolve. Metrics get updated, data sources
change, etc. It’s important to mark each task with a version attribute so users
can document which implementation version was used to obtain their results. Add a `VERSION` attribute to your task class set to 0 (the first version of your task):
```python ```python
class TaskName(...): class TaskName(...):
...@@ -122,23 +124,27 @@ class TaskName(...): ...@@ -122,23 +124,27 @@ class TaskName(...):
### Formatting your Few-Shot Examples ### Formatting your Few-Shot Examples
The harness is designed to facilitate task evaluations under the few-shot setting. Here we’ll format such examples. Override the following methods for your task class: The harness is designed to facilitate task evaluations under the few-shot setting. Here we’ll format such examples. Override the following methods for your task class:
Put the natural language task description as a single line (no `\n`s) string here. E.g. `"Translate English to French:"`
```python ```python
def fewshot_description(self): def fewshot_description(self):
return "" return ""
``` ```
Put your natural language task description as a single line (no `\n`s) string here. E.g. `"Translate English to French:"`
Format your document into a single query prompt __without the answer__ here. This method takes a single `doc` example (in dictionary form) . You should concatenate its members into a nicely formatted prompt.
```python ```python
def doc_to_text(self, doc): def doc_to_text(self, doc):
return "" return ""
``` ```
Format your document into a single query prompt __without the answer__ here. This method takes a single `doc` example (in dictionary form) . You should concatenate its members into a nicely formatted prompt.
Put the target answer of the prompt here, in the form: `" " + <answer>`.
```python ```python
def doc_to_target(self, doc): def doc_to_target(self, doc):
return "" return ""
``` ```
Put the target answer of the prompt here, in the form: `" " + <answer>`.
Understand that the strings from `doc_to_text` and `doc_to_target` will be concatenated together to build up labeled examples in the k-shot setting where k > 0. Design with that in mind 👍. Understand that the strings from `doc_to_text` and `doc_to_target` will be concatenated together to build up labeled examples in the k-shot setting where k > 0. Design with that in mind 👍.
...@@ -159,7 +165,8 @@ python -m scripts.write_out \ ...@@ -159,7 +165,8 @@ python -m scripts.write_out \
--num_examples N --num_examples N
``` ```
Open the file specified at the `--output_base_path <path>` and ensure it passes the eye test. Open the file specified at the `--output_base_path <path>` and ensure it passes
a simple eye test.
### The Evaluation ### The Evaluation
...@@ -169,13 +176,13 @@ Now comes evaluation. The methods you'll need to implement are: ...@@ -169,13 +176,13 @@ Now comes evaluation. The methods you'll need to implement are:
```python ```python
def construct_requests(self, doc, ctx): def construct_requests(self, doc, ctx):
""" Uses RequestFactory to construct Requests and returns an iterable of """ Uses RequestFactory to construct Requests and returns an iterable of
Requests which will be sent to the LM. Requests which will be sent to the LM.
:param doc: :param doc:
The document as returned from training_docs, validation_docs, or test_docs. The document as returned from training_docs, validation_docs, or test_docs.
:param ctx: str :param ctx: str
The context string, generated by fewshot_context. This includes the natural The context string, generated by fewshot_context. This includes the natural
language description, as well as the few shot examples, and the question language description, as well as the few shot examples, and the question
part of the document for `doc`. part of the document for `doc`.
""" """
...@@ -185,8 +192,8 @@ If your task requires generating text you'll need to return a `rf.greedy_until` ...@@ -185,8 +192,8 @@ If your task requires generating text you'll need to return a `rf.greedy_until`
```python ```python
def process_results(self, doc, results): def process_results(self, doc, results):
"""Take a single document and the LM results and evaluates, returning a """Take a single document and the LM results and evaluates, returning a
dict where keys are the names of submetrics and values are the values of dict where keys are the names of submetrics and values are the values of
the metric for that one document the metric for that one document
:param doc: :param doc:
...@@ -201,9 +208,9 @@ def process_results(self, doc, results): ...@@ -201,9 +208,9 @@ def process_results(self, doc, results):
def aggregation(self): def aggregation(self):
""" """
:returns: {str: [float] -> float} :returns: {str: [float] -> float}
A dictionary where keys are the names of submetrics and values are A dictionary where keys are the names of submetrics and values are
functions that aggregate a list of metrics functions that aggregate a list of metrics
""" """
return {} return {}
``` ```
...@@ -213,7 +220,7 @@ See `lm_eval/metrics.py` for a few "built-in" aggregate metrics you can easily i ...@@ -213,7 +220,7 @@ See `lm_eval/metrics.py` for a few "built-in" aggregate metrics you can easily i
def higher_is_better(self): def higher_is_better(self):
""" """
:returns: {str: bool} :returns: {str: bool}
A dictionary where keys are the names of submetrics and values are A dictionary where keys are the names of submetrics and values are
whether a higher value of the submetric is better whether a higher value of the submetric is better
""" """
return {} return {}
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment