"app/lifecycle/server.go" did not exist on "f397e0e988272ffd14bdfb6c4070bb3ab5328df2"
Commit 640d9817 authored by haileyschoelkopf's avatar haileyschoelkopf
Browse files

update advanced docs

parent 4e764ca0
......@@ -46,9 +46,23 @@ Tasks are configured via the `TaskConfig` object. Below, we describe all fields
Explain: What are filters? What is their place in the pipeline?
Format of the `resps` object, and what needs to happen to yield proper scorable results
TODO: triviaqa is implementable if we don't use `take_first` and implement a multi-alias exact_match_any metric
TODO: Filters might warrant a separate doc.
A key component of the `lm-evaluation-harness` library is the `Filter` object. In a typical evaluation run of the harness, we take the formatted inputs and run them through our LM, with the appropriate output type (greedy or free-form generation, or loglikelihood-based comparative scoring).
After getting scores or output text from our LM on each `Instance` or document in the dataset, we then need to feed these responses into a metric or scoring function to return scores to a user.
However, certain tasks may require more complex behavior than directly turning over model outputs to a metric function. For example, we may want to post-process our output text by truncating it or extracting a model's answer, we may want to ensemble over multiple "takes" on a different document, et cetera.
**Detailed Aside**:
We do such post-processing by operating on *responses*, which are stored after running an LM on an `Instance` from the task in `Instance.resps`.
`resps` is a `List[str]` for each instance, and we pass a `List[List[<expected return type from model>]]` to our filters that is a list of `[instance.resps for instance in instances]`.
Our filters, after completing a pipeline, must return a `List[<expected return type from model>]` which we then unpack and store each element of in `Instance.filtered_resps` for the corresponding instance. Thus, we take as input a list of returns from our model for each doc, and must return a return from our model *without it being wrapped in a list* for each doc.
**End Aside**
A full list of supported filter operations can be found in `lm_eval/filters/__init__.py`. Contributions of new filter types are welcome!
### Multiple Filter Pipelines
......@@ -62,7 +76,7 @@ Within our YAML file:
...
repeats: 64
filter_list:
- name: "score-first" # pick only the first response, and report metrics on that
- name: "score-first"
filter:
- function: "regex"
regex_pattern: "The answer is (\\-?[0-9\\.\\,]*[0-9]+)"
......@@ -73,7 +87,7 @@ filter_list:
regex_pattern: "The answer is (\\-?[0-9\\.\\,]*[0-9]+)"
- function: "majority_vote"
- function: "take_first"
- name: "maj@8" # get Maj@8 , via selecting the first 8 responses. Using a better estimator would be optimal.
- name: "maj@8"
filter:
- function: "take_first_k"
k: 8
......@@ -87,7 +101,9 @@ We are able to provide multiple different filter pipelines, each with their own
Our first filter pipeline implements
- applying a regex to the model generations (extracting the number within the phrase "The answer is (number)")
-
- selecting only the first out of the 64 model answers
Then scoring this single answer.
```yaml
- name: "score-first"
......@@ -97,9 +113,36 @@ Our first filter pipeline implements
- function: "take_first"
```
### "Splitting" Pipelines
Our second filter pipeline, "maj@64", does majority voting across all 64 answers via:
- applying the same regex to all responses, to get the numerical answer from the model for each of the 64 responses per problem
- applying majority voting to all responses, which then returns a length-1 `[<majority answer>]` list for each
- taking the first element of this length-1 list, to then score the sole response `<majority answer>` for each document.
```yaml
- name: "maj@64"
filter:
- function: "regex"
regex_pattern: "The answer is (\\-?[0-9\\.\\,]*[0-9]+)"
- function: "majority_vote"
- function: "take_first"
```
Our final filter pipeline, "maj@8", does majority voting across the first 8 of the model's responses per document via:
- subsetting the len-64 list of responses `[answer1, answer2, ..., answer64]` to `[answer1, answer2, ..., answer8]` for each document
- performing the same sequence of filters on these new sets of 8 responses, for each document.
```yaml
- name: "maj@8"
filter:
- function: "take_first_k"
k: 8
- function: "regex"
regex_pattern: "The answer is (\\-?[0-9\\.\\,]*[0-9]+)"
- function: "majority_vote"
- function: "take_first"
```
Thus, given the 64 responses from our LM on each document, we can report metrics on these responses in these 3 different ways, as defined by our filter pipelines.
TODO: either allow for pipelines that "split" and report multiple keys, or something different. We in particular want to support not re-running reward /scoring models on every different filter pipeline if can be shared.
## Embedded Python Code
......@@ -123,7 +166,7 @@ The prior implementation method of new tasks was to subclass `Task`. While we in
You can base a YAML on another YAML file as a template. This can be handy when you need to just change the prompt for `doc_to_text` but keep the rest the same or change `filters` to compare which is better. Simply use `include` in the YAML file and write the name of the template you want to base from. This assumes that the base temeplate is in the same directory. Otherwise, You will need to define the full path.
```
include: <YAML file or with full path>
include: <YAML filename or with full path>
...
```
You can find an example of how to use this feature at [gsm8k-cot-self-consistency.yaml](https://github.com/EleutherAI/lm-evaluation-harness/blob/3c07cc04a92fc467d7c9a94894aeddd58c93a5da/lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml) where it is based off [gsm8k-cot.yaml](https://github.com/EleutherAI/lm-evaluation-harness/blob/3c07cc04a92fc467d7c9a94894aeddd58c93a5da/lm_eval/tasks/gsm8k/gsm8k-cot.yaml)
......@@ -153,4 +196,17 @@ metric_list:
## Good Reference Tasks
- This section should list some "canonized" task examples for different use cases / subcategories, as suggestions from which to build new tasks off of.
\ No newline at end of file
Contributing a new task can be daunting! Luckily, much of the work has often been done for you in a different, similarly evaluated task. Good examples of task implementations to study include:
Multiple choice tasks:
- SciQ (`lm_eval/tasks/sciq/sciq.yaml`)
Corpus perplexity evaluations:
- Wikitext (`lm_eval/tasks/wikitext/wikitext.yaml`)
Generative tasks:
- GSM8k (`lm_eval/tasks/gsm8k/gsm8k.yaml`)
Tasks using complex filtering:
- GSM8k with CoT (+ with Self-Consistency): (`lm_eval/tasks/gsm8k/gsm8k-cot.yaml` ; `lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml`)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment