WIP docs update

b7a1214e · haileyschoelkopf · 79b972d6 · b7a1214e · b7a1214e · b7a1214e
Commit b7a1214e authored Jun 12, 2023 by haileyschoelkopf
3 changed files
--- a/docs/advanced_task_guide.md
+++ b/docs/advanced_task_guide.md
@@ -10,6 +10,7 @@ If your intended task relies on features beyond what are described in this guide

 ## Configurations

+Tasks are configured via the `TaskConfig` object. Below, we describe all fields usable within the object, and their role in defining a task.

 ### Parameters

@@ -51,9 +52,50 @@ TODO: Filters might warrant a separate doc.

 ### Multiple Filter Pipelines

-On the same model outputs, we can perform multiple distinct filtering setups in parallel
+Tasks need not be limited to a single filter pipeline. We enable users to run multiple, distinct, filter pipelines on *the same model outputs* generated in one run on a task.

-Case study: gsm8k-CoT-self-consistency
+As a case study, let's look at an implementation of solving the Gsm8k math word problem benchmark in `lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml`. Here, we are emulating the setup used by [Self-Consistency Improves Chain of Thought Prompting](https://arxiv.org/abs/2203.11171), in which evaluation is performed by generating N chain-of-thought outputs from a model via temperature-based sampling, then selecting the answers output by the model at the end of the chains of thought, then majority voting across all those numeric answers.
+
+Within our YAML file:
+
+```yaml
+...
+repeats: 64
+filter_list:
+  - name: "score-first" # pick only the first response, and report metrics on that
+    filter:
+      - function: "regex"
+        regex_pattern: "The answer is (\\-?[0-9\\.\\,]*[0-9]+)"
+      - function: "take_first"
+  - name: "maj@64"
+    filter:
+      - function: "regex"
+        regex_pattern: "The answer is (\\-?[0-9\\.\\,]*[0-9]+)"
+      - function: "majority_vote"
+      - function: "take_first"
+  - name: "maj@8" # get Maj@8 , via selecting the first 8 responses. Using a better estimator would be optimal.
+    filter:
+      - function: "take_first_k"
+        k: 8
+      - function: "regex"
+        regex_pattern: "The answer is (\\-?[0-9\\.\\,]*[0-9]+)"
+      - function: "majority_vote"
+      - function: "take_first"
+```
+
+We are able to provide multiple different filter pipelines, each with their own name and list of filters to apply in sequence. 
+
+Our first filter pipeline implements 
+- applying a regex to the model generations (extracting the number within the phrase "The answer is (number)")
+- 
+
+```yaml
+- name: "score-first"
+  filter:
+    - function: "regex"
+      regex_pattern: "The answer is (\\-?[0-9\\.\\,]*[0-9]+)"
+    - function: "take_first"
+```

 ### "Splitting" Pipelines

@@ -76,15 +118,6 @@ The prior implementation method of new tasks was to subclass `Task`. While we in

 {Insert a sample custom `Task` subclass code block here}

-## Configuring Tasks with YAMLs
-
-You can easily make a task evaluation using yamls, this is to allow faster and easier experience.
-
-Doc to text
-Jinja,
-You can use Jinja or f-strings to make a prompt template.
-To set a mapping of verbalizer to label, you can define that in the jinja string dorectly.
-

 ## Including a Base YAML


--- a/docs/new_task_guide.md
+++ b/docs/new_task_guide.md
@@ -2,7 +2,7 @@

 `lm-evaluation-harness` is a framework that strives to support a wide range of zero- and few-shot evaluation tasks on autoregressive language models (LMs). 

-This documentation page provides a walkthrough to get started creating your own task.
+This documentation page provides a walkthrough to get started creating your own task, on the `big-refactor` branch of the repository (which will be v0.5.0 in the future.)

 ## Setup

@@ -12,6 +12,7 @@ If you haven't already, go ahead and fork the main repo, clone it, create a bran
 # After forking...
 git clone https://github.com/<YOUR-USERNAME>/lm-evaluation-harness.git
 cd lm-evaluation-harness
+git checkout big-refactor
 git checkout -b <task-name>
 pip install -e ".[dev]"
 ```
@@ -93,19 +94,30 @@ doc_to_target: "{{answer}}"

 **Important**: We always add one whitespace between the input and output, such that the full input-output string is `doc_to_target(doc) + " " + doc_to_text(doc)`. doc_to_text and doc_to_target should not contain trailing right or left whitespace, respectively.

-TODO: mention promptsource here, or reserve it for advanced guide
+Users can also fill out the optional `template_aliases` YAML field, which is added ahead of both the `doc_to_text` and `doc_to_target` fields. This field should not contain any test, but only Jinja variable definitions (`{% ... %}` clauses). This can be used to perform more involved string manipulations and renamings of dataset columns while the main prompt fields remain easy to parse visually.
+
+We also support writing prompts in the [Promptsource](https://github.com/bigscience-workshop/promptsource) library created by BigScience. See `docs/advanced_task_guide.md` for more information.

 #### Multiple choice format

- template_aliases
+For tasks which are multiple choice (a fixed, finite set of label words per each document) and evaluated via comparing loglikelihoods of all label words (the `multiple_choice` task output type) we enforce a particular convention on prompt format.
+
+An annotated example in the case of SciQ is as follows:
+
+```yaml
+template_aliases: "{% set answer_choices = [distractor1, distractor2, distractor3, correct_answer] %}{% set gold = 3 %}" # `template_aliases` must set the list of possible answer choices to the jinja variable `answer_choices` (List[str]), and set what the index within `answer_choices` of this doc's gold label (correct answer choice).
+doc_to_text: "{{support.lstrip()}}\nQuestion: {{question}}\nAnswer:" # This is the input portion of the prompt for this doc. It will have " {{choice}}" appended to it as target for each choice in answer_choices.
+doc_to_target: "{{gold}}" # this must be castable to an integer. It must output only the index within `answer_choices` that is the correct label.
+```
+Task implementers are thus able to decide what the answer choices should be for a document, and what prompt format to use.
+

- expected mcqa setup

 ### Setting metrics

 You're almost done! Now we need to choose how to score our task.
- *If this is a multiple choice task:* do you just want to check your model's accuracy in choosing the correct answer choice? 
- *If this is a generation task:* do you just want to check how often your model outputs *exactly the ground-truth output string provided*?
+- *If this is a multiple choice task:* do you just want to check your model's accuracy in choosing the correct answer choice? (the "acc" metric) 
+- *If this is a generation task:* do you just want to check how often your model outputs *exactly the ground-truth output string provided*? (the "exact_match" metric)

 If the answer to the above is no: you'll need to record what scoring metrics to use! Metrics can be listed in the following format:

@@ -170,16 +182,41 @@ Passing `--tasks /path/to/yaml/file` is also accepted.

 ## Checking validity

- write_out
+After registering your task, you can now check on your data downloading and verify that the few-shot samples look as intended. Run the following command with your desired args:
+
+```bash
+python -m scripts.write_out \
+    --output_base_path <path> \
+    --tasks <your-task-name> \
+    --sets <train | val | test> \
+    --num_fewshot K \
+    --num_examples N \
+```
+
+Open the file specified at the `--output_base_path <path>` and ensure it passes
+a simple eye test.
+
+## Checking performance + equivalence
+
+It's now time to check models' performance on your task! In the evaluation harness, we intend to support a wide range of evaluation tasks and setups, but prioritize the inclusion of already-proven benchmarks following the precise evaluation setups in the literature where possible. 
+
+To enable this, we provide a checklist that should be completed when contributing a new task, to enable accurate book-keeping and to ensure that tasks added to the library are well-tested and, where applicable, precedented.

-## Checking performance ; implementation equivalence
+### Task impl. checklist

-## Task impl. checklist
+The checklist is the following: 

- turn this into a GH PR template too
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Has the task been checked for equivalence with the original paper's methodology?
+  * [ ] Is the task in Eval-harness v0.3.0 or earlier?
+    * [ ] If so, has it been checked for regression from earlier versions? If there is a change in results, is it justified by matching the original authors' intended setup?

- README.md in task dir
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?

 ## Submitting your task

-You're all set! Now push your work and make a pull request! Thanks for the contribution 👍. If there are any questions, please leave a message in the `#lm-thunderdome` channel on the EAI discord!
\ No newline at end of file
+You're all set! Now push your work and make a pull request to the `big-refactor` branch! Thanks for the contribution :). If there are any questions, please leave a message in the `#lm-thunderdome` channel on the EAI discord!
\ No newline at end of file
--- a/lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml
+++ b/lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml
@@ -9,7 +9,7 @@ generation_kwargs:
    - "\n\n"
  do_sample: true
  temperature: 0.2
-repeats: 8
+repeats: 64
 filter_list:
  - name: "score-first" # pick only the first response, and report metrics on that
    filter: