@@ -10,6 +10,7 @@ If your intended task relies on features beyond what are described in this guide
## Configurations
Tasks are configured via the `TaskConfig` object. Below, we describe all fields usable within the object, and their role in defining a task.
### Parameters
...
...
@@ -51,9 +52,50 @@ TODO: Filters might warrant a separate doc.
### Multiple Filter Pipelines
On the same model outputs, we can perform multiple distinct filtering setups in parallel
Tasks need not be limited to a single filter pipeline. We enable users to run multiple, distinct, filter pipelines on *the same model outputs* generated in one run on a task.
Case study: gsm8k-CoT-self-consistency
As a case study, let's look at an implementation of solving the Gsm8k math word problem benchmark in `lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml`. Here, we are emulating the setup used by [Self-Consistency Improves Chain of Thought Prompting](https://arxiv.org/abs/2203.11171), in which evaluation is performed by generating N chain-of-thought outputs from a model via temperature-based sampling, then selecting the answers output by the model at the end of the chains of thought, then majority voting across all those numeric answers.
Within our YAML file:
```yaml
...
repeats:64
filter_list:
-name:"score-first"# pick only the first response, and report metrics on that
`lm-evaluation-harness` is a framework that strives to support a wide range of zero- and few-shot evaluation tasks on autoregressive language models (LMs).
This documentation page provides a walkthrough to get started creating your own task.
This documentation page provides a walkthrough to get started creating your own task, on the `big-refactor` branch of the repository (which will be v0.5.0 in the future.)
## Setup
...
...
@@ -12,6 +12,7 @@ If you haven't already, go ahead and fork the main repo, clone it, create a bran
**Important**: We always add one whitespace between the input and output, such that the full input-output string is `doc_to_target(doc) + " " + doc_to_text(doc)`. doc_to_text and doc_to_target should not contain trailing right or left whitespace, respectively.
TODO: mention promptsource here, or reserve it for advanced guide
Users can also fill out the optional `template_aliases` YAML field, which is added ahead of both the `doc_to_text` and `doc_to_target` fields. This field should not contain any test, but only Jinja variable definitions (`{% ... %}` clauses). This can be used to perform more involved string manipulations and renamings of dataset columns while the main prompt fields remain easy to parse visually.
We also support writing prompts in the [Promptsource](https://github.com/bigscience-workshop/promptsource) library created by BigScience. See `docs/advanced_task_guide.md` for more information.
#### Multiple choice format
- template_aliases
For tasks which are multiple choice (a fixed, finite set of label words per each document) and evaluated via comparing loglikelihoods of all label words (the `multiple_choice` task output type) we enforce a particular convention on prompt format.
An annotated example in the case of SciQ is as follows:
```yaml
template_aliases:"{%setanswer_choices=[distractor1,distractor2,distractor3,correct_answer]%}{%setgold=3%}"# `template_aliases` must set the list of possible answer choices to the jinja variable `answer_choices` (List[str]), and set what the index within `answer_choices` of this doc's gold label (correct answer choice).
doc_to_text:"{{support.lstrip()}}\nQuestion:{{question}}\nAnswer:"# This is the input portion of the prompt for this doc. It will have " {{choice}}" appended to it as target for each choice in answer_choices.
doc_to_target:"{{gold}}"# this must be castable to an integer. It must output only the index within `answer_choices` that is the correct label.
```
Task implementers are thus able to decide what the answer choices should be for a document, and what prompt format to use.
- expected mcqa setup
### Setting metrics
You're almost done! Now we need to choose how to score our task.
-*If this is a multiple choice task:* do you just want to check your model's accuracy in choosing the correct answer choice?
-*If this is a generation task:* do you just want to check how often your model outputs *exactly the ground-truth output string provided*?
-*If this is a multiple choice task:* do you just want to check your model's accuracy in choosing the correct answer choice? (the "acc" metric)
-*If this is a generation task:* do you just want to check how often your model outputs *exactly the ground-truth output string provided*? (the "exact_match" metric)
If the answer to the above is no: you'll need to record what scoring metrics to use! Metrics can be listed in the following format:
...
...
@@ -170,16 +182,41 @@ Passing `--tasks /path/to/yaml/file` is also accepted.
## Checking validity
- write_out
After registering your task, you can now check on your data downloading and verify that the few-shot samples look as intended. Run the following command with your desired args:
```bash
python -m scripts.write_out \
--output_base_path <path> \
--tasks <your-task-name> \
--sets <train | val | test>\
--num_fewshot K \
--num_examples N \
```
Open the file specified at the `--output_base_path <path>` and ensure it passes
a simple eye test.
## Checking performance + equivalence
It's now time to check models' performance on your task! In the evaluation harness, we intend to support a wide range of evaluation tasks and setups, but prioritize the inclusion of already-proven benchmarks following the precise evaluation setups in the literature where possible.
To enable this, we provide a checklist that should be completed when contributing a new task, to enable accurate book-keeping and to ensure that tasks added to the library are well-tested and, where applicable, precedented.
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Has the task been checked for equivalence with the original paper's methodology?
* [ ] Is the task in Eval-harness v0.3.0 or earlier?
* [ ] If so, has it been checked for regression from earlier versions? If there is a change in results, is it justified by matching the original authors' intended setup?
- README.md in task dir
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
## Submitting your task
You're all set! Now push your work and make a pull request! Thanks for the contribution 👍. If there are any questions, please leave a message in the `#lm-thunderdome` channel on the EAI discord!
\ No newline at end of file
You're all set! Now push your work and make a pull request to the `big-refactor` branch! Thanks for the contribution :). If there are any questions, please leave a message in the `#lm-thunderdome` channel on the EAI discord!