Unverified Commit 7a038118 authored by Jonathan Tow's avatar Jonathan Tow Committed by GitHub
Browse files

Merge pull request #301 from StephenHogg/task_doc

Suggested updates for the task guide
parents e00d682f 11f614b0
...@@ -199,7 +199,11 @@ def construct_requests(self, doc, ctx): ...@@ -199,7 +199,11 @@ def construct_requests(self, doc, ctx):
""" """
return ... return ...
``` ```
If your task requires generating text you'll need to return a `rf.greedy_until` request otherwise an `rf.loglikelihood` across all labels in a classification tasks will do. #### What's a `Request`? What's a `doc`?
To reiterate, a `doc` is just a `Dict` object that contains information about a document from your corpus. It can contain things like a prompt, question type information, answers and anything else you think will be needed in order to assess your model for a given task. Keep in mind that the fields of this can be basically whatever you want (you can sort this out in `training_docs` \ `validation_docs` \ `test_docs` if you need to customise things - see above), just remember to be consistent with them throughout the rest of the `Task` you write up.
A `Request` is an object that takes the text prompt you want to present to a model and computes one of a few different types of response. These are evaluated lazily (meaning, only when the result is actually needed). If your task requires generating text you'll need to return a `rf.greedy_until` request otherwise an `rf.loglikelihood` across all labels in a classification tasks will do.
The function `construct_requests` can return a list of `Request`s or an iterable; it's perfectly fine to `yield` them from something or other. This is particularly handy if you are creating more than one request per `doc` (usually because you're up to something like multi-task learning). The objects this function returns then get consumed one by one and turned into result objects.
```python ```python
def process_results(self, doc, results): def process_results(self, doc, results):
...@@ -214,6 +218,8 @@ def process_results(self, doc, results): ...@@ -214,6 +218,8 @@ def process_results(self, doc, results):
""" """
return {} return {}
``` ```
This is the next step in the chain after `construct_requests`. In between this function and the one above, the request is evaluated. The results of that request are returned in the `results` arg to this function. By processing results, what is meant is calculating the metric or metrics of interest for your dataset using the result and associated ground truth given to this function. It's possible to calculate and return multiple metrics in this function and the logic for it can be whatever you want - as long as you've made sure the ground truth was included in the `doc` object. The dict returned from this function should be of the format `{'metric_name': value}`. It is not necessary to have the same keys for every doc processed using `process_results`; this sort of thing can be handled in the next function, `aggregation`.
```python ```python
def aggregation(self): def aggregation(self):
...@@ -224,8 +230,10 @@ def aggregation(self): ...@@ -224,8 +230,10 @@ def aggregation(self):
""" """
return {} return {}
``` ```
In `process_results`, model outputs are converted into metrics. These metrics are per document metrics, however; the `aggregation` function is used to work out what to do with them to create a corpus-level metric. Imagine you have a bunch of documents, for each of which you have calculated an F1 score. What should that mean overall? Should they be summed, averaged, the min/max found? This function handles that problem.
See `lm_eval/metrics.py` for a few "built-in" aggregate metrics you can easily import. The contents of the function itself are pretty straightforward; it should simply return a dict that maps from each metric label that could be returned by `process_results` to a function that can be used to aggregate that metric. That is to say, if the metrics that `process_results` could return are given by `{'a', 'b', 'c'}`, then all of these keys should be present in the dict returned by `aggregation`.
__NOTE__: See `lm_eval/metrics.py` for a few "built-in" aggregate metrics you can easily import. The standard metrics available in this package are generally based on `sklearn` functions, so if you are in any doubt for how to set things up the documentation over there can be of assistance. If you need to write a custom metric for some reason, start by looking at the existing ones in `lm_eval/metrics.py` for an idea about what the function signature needs to be.
```python ```python
def higher_is_better(self): def higher_is_better(self):
...@@ -236,6 +244,7 @@ def higher_is_better(self): ...@@ -236,6 +244,7 @@ def higher_is_better(self):
""" """
return {} return {}
``` ```
Finally, this function returns a dict with the same keys as `aggregation` and as it says in the description, simply tells us whether higher scores are better.
Some tasks that are good examples of various ways evaluation can be implemented can be found here: [LAMBADA](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/lambada.py), [TriviaQA](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/triviaqa.py), [SQuAD](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/squad.py). Some tasks that are good examples of various ways evaluation can be implemented can be found here: [LAMBADA](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/lambada.py), [TriviaQA](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/triviaqa.py), [SQuAD](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/squad.py).
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment