Unverified Commit be1aea6d authored by Hailey Schoelkopf's avatar Hailey Schoelkopf Committed by GitHub
Browse files

Merge pull request #576 from EleutherAI/better-docs

Better docs
parents 966f0686 2d54df7e
# New Task Guide # New Task Guide
`lm-evaluation-harness` is a framework that strives to support a wide range of zero- and few-shot evaluation tasks on autoregressive language models (LMs). `lm-evaluation-harness` is a framework that strives to support a wide range of zero- and few-shot evaluation tasks on autoregressive language models (LMs).
This documentation page provides a walkthrough to get started creating your own task, on the `big-refactor` branch of the repository (which will be v0.5.0 in the future.) This documentation page provides a walkthrough to get started creating your own task, on the `big-refactor` branch of the repository (which will be v0.5.0 in the future.)
...@@ -65,7 +65,7 @@ fewshot_split: <split name to draw fewshot examples from, or `null`> ...@@ -65,7 +65,7 @@ fewshot_split: <split name to draw fewshot examples from, or `null`>
``` ```
though if this is not set, we will default to train/validation/test sets, in that order. though if this is not set, we will default to train/validation/test sets, in that order.
### Writing a prompt ### Writing a prompt with Jinja 2
The next thing we need to do is decide what format to use when presenting the data to the LM. This is our **prompt**, where we'll define both an input and output format. The next thing we need to do is decide what format to use when presenting the data to the LM. This is our **prompt**, where we'll define both an input and output format.
...@@ -81,7 +81,7 @@ Suppose our dataset has a `"question"` field, and an `"answer"` field, which are ...@@ -81,7 +81,7 @@ Suppose our dataset has a `"question"` field, and an `"answer"` field, which are
Question: {document[question]} Question: {document[question]}
Answer: Answer:
``` ```
We do this by writing We do this by writing
```yaml ```yaml
doc_to_text: "Question: {{question}}\nAnswer:" doc_to_text: "Question: {{question}}\nAnswer:"
``` ```
...@@ -96,7 +96,38 @@ doc_to_target: "{{answer}}" ...@@ -96,7 +96,38 @@ doc_to_target: "{{answer}}"
Users can also fill out the optional `template_aliases` YAML field, which is added ahead of both the `doc_to_text` and `doc_to_target` fields. This field should not contain any test, but only Jinja variable definitions (`{% ... %}` clauses). This can be used to perform more involved string manipulations and renamings of dataset columns while the main prompt fields remain easy to parse visually. Users can also fill out the optional `template_aliases` YAML field, which is added ahead of both the `doc_to_text` and `doc_to_target` fields. This field should not contain any test, but only Jinja variable definitions (`{% ... %}` clauses). This can be used to perform more involved string manipulations and renamings of dataset columns while the main prompt fields remain easy to parse visually.
We also support writing prompts in the [Promptsource](https://github.com/bigscience-workshop/promptsource) library created by BigScience. See `docs/advanced_task_guide.md` for more information.
### Using Python Functions for Prompts
There may be cases where the prompt we want to implement is easier expressed in Python instead of Jinja 2. For this, we can use Python helper functions that are defined in the YAML config. It should be noted that the function script must be in the same directory as the yaml.
A good example is WikiText that requires a lot of regex rules to clean the samples.
```
def wikitext_detokenizer(doc):
string = doc["page"]
# contractions
string = string.replace("s '", "s'")
string = re.sub(r"/' [0-9]/", r"/'[0-9]/", string)
...
string = string.replace(" 's", "'s")
return string
```
We can load this function in `doc_to_target` by using a `!function` operator after `doc_to_target` and followed by `<file name>.<function name>`. In the file [wikitext.yaml](https://github.com/EleutherAI/lm-evaluation-harness/blob/6ae376e3a43caa58b95bb8aa73054a94827bf560/lm_eval/tasks/wikitext/wikitext.yaml) we write:
```
doc_to_target: !function preprocess_wikitext.wikitext_detokenizer
```
### Importing a Prompt from Promptsource
[Promptsource](https://github.com/bigscience-workshop/promptsource/tree/main/promptsource) is a great repository for crowdsourced prompts for many datasets. We can load these prompts easily by using the `use_prompt` argument and filling it with the format `"promptsource:<name of prompt template>"`. To use this, `doc_to_text` and `doc_to_target` should be left undefined. This will fetch the template of the dataset defined in the YAML file.
For example, For Super Glue BoolQ, if we want to use the prompt template `GPT-3 Style` we can add this to the YAML file.
```
use_prompt: "promptsource:GPT-3 Style"
```
#### Multiple choice format #### Multiple choice format
...@@ -116,8 +147,9 @@ Task implementers are thus able to decide what the answer choices should be for ...@@ -116,8 +147,9 @@ Task implementers are thus able to decide what the answer choices should be for
### Setting metrics ### Setting metrics
You're almost done! Now we need to choose how to score our task. You're almost done! Now we need to choose how to score our task.
- *If this is a multiple choice task:* do you just want to check your model's accuracy in choosing the correct answer choice? (the "acc" metric) - *If this is a multiple choice task:* do you just want to check your model's accuracy in choosing the correct answer choice?
- *If this is a generation task:* do you just want to check how often your model outputs *exactly the ground-truth output string provided*? (the "exact_match" metric) - *If this is a generation task:* do you just want to check how often your model outputs *exactly the ground-truth output string provided*?
If the answer to the above is no: you'll need to record what scoring metrics to use! Metrics can be listed in the following format: If the answer to the above is no: you'll need to record what scoring metrics to use! Metrics can be listed in the following format:
...@@ -143,7 +175,7 @@ As a heuristic check: ...@@ -143,7 +175,7 @@ As a heuristic check:
* Does your task require complex, multi-step post-processing of generated model outputs? * Does your task require complex, multi-step post-processing of generated model outputs?
* Does your task require subsetting documents on the fly based on their content? * Does your task require subsetting documents on the fly based on their content?
* Do you expect to compute metrics after applying multiple such processing steps on your model outputs? * Do you expect to compute metrics after applying multiple such processing steps on your model outputs?
* Does your task rely on metrics that need a custom implementation? * Does your task rely on metrics that need a custom implementation?
For more detail on the task system and advanced features, see `docs/advanced_task_guide.md` . If none of the above sound like they apply to your task, it's time to continue onto checking your task performance! For more detail on the task system and advanced features, see `docs/advanced_task_guide.md` . If none of the above sound like they apply to your task, it's time to continue onto checking your task performance!
...@@ -170,7 +202,7 @@ This will add your task to the `group1` and `group2` groups, enabling people to ...@@ -170,7 +202,7 @@ This will add your task to the `group1` and `group2` groups, enabling people to
If your task is not in the `lm_eval/tasks` folder, you'll need to tell the Eval Harness where to look for YAML files. If your task is not in the `lm_eval/tasks` folder, you'll need to tell the Eval Harness where to look for YAML files.
You can do this via adding the Python snippet You can do this via adding the Python snippet
```python ```python
from lm_eval.tasks import include_task_folder from lm_eval.tasks import include_task_folder
...@@ -220,4 +252,5 @@ If other tasks on this dataset are already supported: ...@@ -220,4 +252,5 @@ If other tasks on this dataset are already supported:
## Submitting your task ## Submitting your task
You're all set! Now push your work and make a pull request to the `big-refactor` branch! Thanks for the contribution :). If there are any questions, please leave a message in the `#lm-thunderdome` channel on the EAI discord! You're all set! Now push your work and make a pull request to the `big-refactor` branch! Thanks for the contribution :). If there are any questions, please leave a message in the `#lm-thunderdome` channel on the EAI discord!
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment