Merge branch 'big-refactor' into better-docs

2d54df7e · Hailey Schoelkopf · GitHub · b8799c35 · 966f0686 · 2d54df7e
Unverified Commit 2d54df7e authored Jun 12, 2023 by Hailey Schoelkopf Committed by GitHub Jun 12, 2023
7 changed files
--- a/docs/advanced_task_guide.md
+++ b/docs/advanced_task_guide.md
@@ -10,6 +10,7 @@ If your intended task relies on features beyond what are described in this guide

 ## Configurations

+Tasks are configured via the `TaskConfig` object. Below, we describe all fields usable within the object, and their role in defining a task.

 ### Parameters

@@ -45,60 +46,130 @@ If your intended task relies on features beyond what are described in this guide

 Explain: What are filters? What is their place in the pipeline?

-Format of the `resps` object, and what needs to happen to yield proper scorable results
-TODO: triviaqa is implementable if we don't use `take_first` and implement a multi-alias exact_match_any metric
-TODO: Filters might warrant a separate doc.
+A key component of the `lm-evaluation-harness` library is the `Filter` object. In a typical evaluation run of the harness, we take the formatted inputs and run them through our LM, with the appropriate output type (greedy or free-form generation, or loglikelihood-based comparative scoring).
+
+After getting scores or output text from our LM on each `Instance` or document in the dataset, we then need to feed these responses into a metric or scoring function to return scores to a user.
+
+However, certain tasks may require more complex behavior than directly turning over model outputs to a metric function. For example, we may want to post-process our output text by truncating it or extracting a model's answer, we may want to ensemble over multiple "takes" on a different document, et cetera.
+
+**Detailed Aside**: 
+We do such post-processing by operating on *responses*, which are stored after running an LM on an `Instance` from the task in `Instance.resps`. 
+
+`resps` is a `List[str]` for each instance, and we pass a `List[List[<expected return type from model>]]` to our filters that is a list of `[instance.resps for instance in instances]`. 
+
+Our filters, after completing a pipeline, must return a `List[<expected return type from model>]` which we then unpack and store each element of in `Instance.filtered_resps` for the corresponding instance. Thus, we take as input a list of returns from our model for each doc, and must return a return from our model *without it being wrapped in a list* for each doc.
+
+**End Aside**
+
+
+A full list of supported filter operations can be found in `lm_eval/filters/__init__.py`. Contributions of new filter types are welcome!

 ### Multiple Filter Pipelines

-On the same model outputs, we can perform multiple distinct filtering setups in parallel
+Tasks need not be limited to a single filter pipeline. We enable users to run multiple, distinct, filter pipelines on *the same model outputs* generated in one run on a task.
+
+As a case study, let's look at an implementation of solving the Gsm8k math word problem benchmark in `lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml`. Here, we are emulating the setup used by [Self-Consistency Improves Chain of Thought Prompting](https://arxiv.org/abs/2203.11171), in which evaluation is performed by generating N chain-of-thought outputs from a model via temperature-based sampling, then selecting the answers output by the model at the end of the chains of thought, then majority voting across all those numeric answers.

-Case study: gsm8k-CoT-self-consistency
+Within our YAML file:

-### "Splitting" Pipelines
+```yaml
+...
+repeats: 64
+filter_list:
+  - name: "score-first"
+    filter:
+      - function: "regex"
+        regex_pattern: "The answer is (\\-?[0-9\\.\\,]*[0-9]+)"
+      - function: "take_first"
+  - name: "maj@64"
+    filter:
+      - function: "regex"
+        regex_pattern: "The answer is (\\-?[0-9\\.\\,]*[0-9]+)"
+      - function: "majority_vote"
+      - function: "take_first"
+  - name: "maj@8" 
+    filter:
+      - function: "take_first_k"
+        k: 8
+      - function: "regex"
+        regex_pattern: "The answer is (\\-?[0-9\\.\\,]*[0-9]+)"
+      - function: "majority_vote"
+      - function: "take_first"
+```

-TODO: either allow for pipelines that "split" and report multiple keys, or something different. We in particular want to support not re-running reward /scoring models on every different filter pipeline if can be shared.
+We are able to provide multiple different filter pipelines, each with their own name and list of filters to apply in sequence. 

-## Embedded Python Code
+Our first filter pipeline implements 
+- applying a regex to the model generations (extracting the number within the phrase "The answer is (number)")
+- selecting only the first out of the 64 model answers

-There could be cases where Jinja 2 or simple f-string format won't cut it. For tasks like these, we additionally support the importing of Python helper functions that can be injected directly to the yaml. It should be noted that the function script must be in the same directory as the yaml.
+Then scoring this single answer.

-TODO: document the `!function filename.pythonfunctionname` syntax here.
+```yaml
+- name: "score-first"
+  filter:
+    - function: "regex"
+      regex_pattern: "The answer is (\\-?[0-9\\.\\,]*[0-9]+)"
+    - function: "take_first"
+```

-TODO: add permannent link to wikitext.yaml and super_glue_cb.yml
+Our second filter pipeline, "maj@64", does majority voting across all 64 answers via:
+- applying the same regex to all responses, to get the numerical answer from the model for each of the 64 responses per problem
+- applying majority voting to all responses, which then returns a length-1 `[<majority answer>]` list for each
+- taking the first element of this length-1 list, to then score the sole response `<majority answer>` for each document.
+
+```yaml
+- name: "maj@64"
+  filter:
+    - function: "regex"
+      regex_pattern: "The answer is (\\-?[0-9\\.\\,]*[0-9]+)"
+    - function: "majority_vote"
+    - function: "take_first"
 ```
-wikitext.yaml and helper fn go here
+
+Our final filter pipeline, "maj@8", does majority voting across the first 8 of the model's responses per document via: 
+- subsetting the len-64 list of responses `[answer1, answer2, ..., answer64]` to `[answer1, answer2, ..., answer8]` for each document
+- performing the same sequence of filters on these new sets of 8 responses, for each document.
+```yaml
+- name: "maj@8"
+    filter:
+    - function: "take_first_k"
+      k: 8
+    - function: "regex"
+      regex_pattern: "The answer is (\\-?[0-9\\.\\,]*[0-9]+)"
+    - function: "majority_vote"
+    - function: "take_first"
 ```

-## (No Longer Recommended) Direct `Task` Subclassing
+Thus, given the 64 responses from our LM on each document, we can report metrics on these responses in these 3 different ways, as defined by our filter pipelines. 

-The prior implementation method of new tasks was to subclass `Task`. While we intend to migrate all tasks to the new YAML implementation option going forward, it remains possible to subclass

-{Insert a sample custom `Task` subclass code block here}
+## Embedded Python Code

-## Configuring Tasks with YAMLs
+Use can use python functions for certain arguments by using the `!function` operator after the argument name followed by `<filename>.<pythonfunctionname>`. This feature can be used for the following arguments:
+1. `doc_to_text`
+2. `doc_to_target`
+3. `gold_alias`
+4. `aggregation` for a `metric` in `metric_list`

-You can easily make a task evaluation using yamls, this is to allow faster and easier experience.
+## (No Longer Recommended) Direct `Task` Subclassing

-Doc to text
-Jinja,
-You can use Jinja or f-strings to make a prompt template.
-To set a mapping of verbalizer to label, you can define that in the jinja string dorectly.
+The prior implementation method of new tasks was to subclass `Task`. While we intend to migrate all tasks to the new YAML implementation option going forward, it remains possible to subclass the Task class and implement custom logic. For more information, see `docs/task_guide.md` in v0.3.0 of the `lm-evaluation-harness`.


 ## Including a Base YAML

 You can base a YAML on another YAML file as a template. This can be handy when you need to just change the prompt for `doc_to_text` but keep the rest the same or change `filters` to compare which is better. Simply use `include` in the YAML file and write the name of the template you want to base from. This assumes that the base temeplate is in the same directory. Otherwise, You will need to define the full path.
 ```
-include: <YAML file or with full path>
+include: <YAML filename or with full path>
 ...
 ```
 You can find an example of how to use this feature at [gsm8k-cot-self-consistency.yaml](https://github.com/EleutherAI/lm-evaluation-harness/blob/3c07cc04a92fc467d7c9a94894aeddd58c93a5da/lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml) where it is based off [gsm8k-cot.yaml](https://github.com/EleutherAI/lm-evaluation-harness/blob/3c07cc04a92fc467d7c9a94894aeddd58c93a5da/lm_eval/tasks/gsm8k/gsm8k-cot.yaml)


-## Listing Metrics
+## Passing Arguments to Metrics

-Metrics can be defined in the `metric_list` argument when building the YAML config. Multiple metrics can be listed along with any auxillary arguments. For example, setting a `exact_match` (TODO: Add url to metric), auxiliary arguments such as `ignore_case`, `ignore_punctuation`, `regexes_to_ignore` can be listed as well. They will be added to the metric function as `kwargs`. Some metrics have predefined values for `aggregation` and `higher_is_better` so listing the metric name only can be sufficient.
+Metrics can be defined in the `metric_list` argument when building the YAML config. Multiple metrics can be listed along with any auxillary arguments. For example, setting the [`exact_match` metric](https://github.com/huggingface/evaluate/tree/main/metrics/exact_match), auxiliary arguments such as `ignore_case`, `ignore_punctuation`, `regexes_to_ignore` can be listed as well. They will be added to the metric function as `kwargs`. Some metrics have predefined values for `aggregation` and `higher_is_better` so listing the metric name only can be sufficient.

 ```
 metric_list:
@@ -113,11 +184,45 @@ metric_list:
      - "\\$"
 ```

-## Using Promptsource
+### Natively Supported Metrics
+
+Here we list all metrics currently supported natively in `lm-eval`:

- load prompt from promptsource
+Metrics:
+* `acc` (accuracy)
+* `acc_norm` (length-normalized accuracy)
+* `acc_mutual_info` (baseline loglikelihood - normalized accuracy)
+* `perplexity`
+* `word_perplexity` (perplexity per word)
+* `byte_perplexity` (perplexity per byte)
+* `bits_per_byte`
+* `matthews_corrcoef` (Matthews correlation coefficient)
+* `f1` (F1 score)
+* `bleu`
+* `chrf`
+* `ter`
+
+Aggregation functions:
+* `mean`
+* `median`
+* `perplexity`
+* `weighted_perplexity`
+* `bits_per_byte`


 ## Good Reference Tasks

- This section should list some "canonized" task examples for different use cases / subcategories, as suggestions from which to build new tasks off of.
\ No newline at end of file
+Contributing a new task can be daunting! Luckily, much of the work has often been done for you in a different, similarly evaluated task. Good examples of task implementations to study include: 
+
+Multiple choice tasks: 
+- SciQ (`lm_eval/tasks/sciq/sciq.yaml`)
+
+Corpus perplexity evaluations:
+- Wikitext (`lm_eval/tasks/wikitext/wikitext.yaml`)
+
+Generative tasks:
+- GSM8k (`lm_eval/tasks/gsm8k/gsm8k.yaml`)
+
+Tasks using complex filtering:
+- GSM8k with CoT (+ with Self-Consistency): (`lm_eval/tasks/gsm8k/gsm8k-cot.yaml` ; `lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml`)
+
--- a/docs/new_task_guide.md
+++ b/docs/new_task_guide.md
@@ -2,7 +2,7 @@

 `lm-evaluation-harness` is a framework that strives to support a wide range of zero- and few-shot evaluation tasks on autoregressive language models (LMs).

-This documentation page provides a walkthrough to get started creating your own task.
+This documentation page provides a walkthrough to get started creating your own task, on the `big-refactor` branch of the repository (which will be v0.5.0 in the future.)

 ## Setup

@@ -12,6 +12,7 @@ If you haven't already, go ahead and fork the main repo, clone it, create a bran
 # After forking...
 git clone https://github.com/<YOUR-USERNAME>/lm-evaluation-harness.git
 cd lm-evaluation-harness
+git checkout big-refactor
 git checkout -b <task-name>
 pip install -e ".[dev]"
 ```
@@ -93,6 +94,9 @@ doc_to_target: "{{answer}}"

 **Important**: We always add one whitespace between the input and output, such that the full input-output string is `doc_to_target(doc) + " " + doc_to_text(doc)`. doc_to_text and doc_to_target should not contain trailing right or left whitespace, respectively.

+Users can also fill out the optional `template_aliases` YAML field, which is added ahead of both the `doc_to_text` and `doc_to_target` fields. This field should not contain any test, but only Jinja variable definitions (`{% ... %}` clauses). This can be used to perform more involved string manipulations and renamings of dataset columns while the main prompt fields remain easy to parse visually.
+
+
 ### Using Python Functions for Prompts

 There may be cases where the prompt we want to implement is easier expressed in Python instead of Jinja 2. For this, we can use Python helper functions that are defined in the YAML config. It should be noted that the function script must be in the same directory as the yaml.
@@ -124,13 +128,21 @@ For example, For Super Glue BoolQ, if we want to use the prompt template `GPT-3
 use_prompt: "promptsource:GPT-3 Style"
 ```

-TODO: mention promptsource here, or reserve it for advanced guide

 #### Multiple choice format

- template_aliases
+For tasks which are multiple choice (a fixed, finite set of label words per each document) and evaluated via comparing loglikelihoods of all label words (the `multiple_choice` task output type) we enforce a particular convention on prompt format.
+
+An annotated example in the case of SciQ is as follows:
+
+```yaml
+template_aliases: "{% set answer_choices = [distractor1, distractor2, distractor3, correct_answer] %}{% set gold = 3 %}" # `template_aliases` must set the list of possible answer choices to the jinja variable `answer_choices` (List[str]), and set what the index within `answer_choices` of this doc's gold label (correct answer choice).
+doc_to_text: "{{support.lstrip()}}\nQuestion: {{question}}\nAnswer:" # This is the input portion of the prompt for this doc. It will have " {{choice}}" appended to it as target for each choice in answer_choices.
+doc_to_target: "{{gold}}" # this must be castable to an integer. It must output only the index within `answer_choices` that is the correct label.
+```
+Task implementers are thus able to decide what the answer choices should be for a document, and what prompt format to use.
+

- expected mcqa setup

 ### Setting metrics

@@ -138,6 +150,7 @@ You're almost done! Now we need to choose how to score our task.
 - *If this is a multiple choice task:* do you just want to check your model's accuracy in choosing the correct answer choice?
 - *If this is a generation task:* do you just want to check how often your model outputs *exactly the ground-truth output string provided*?

+
 If the answer to the above is no: you'll need to record what scoring metrics to use! Metrics can be listed in the following format:

 ```yaml
@@ -149,10 +162,11 @@ metric_list:
    aggregation: ...
    higher_is_better: ...
 ```
+`aggregation` and `higher_is_better` can optionally be left out to default to the manually-set defaults, if using a natively supported metric.

-For a full list of natively supported metrics and aggregation functions see `TODO: we should list out all supported metrics, aggregations, models, somewhere in the docs.` All metrics supported in [HuggingFace Evaluate](https://github.com/huggingface/evaluate/tree/main/metrics) can also be used, and will be loaded if a given metric name is not one natively supported in `lm-eval`.
+For a full list of natively supported metrics and aggregation functions see `docs/advanced_task_guide.md`. All metrics supported in [HuggingFace Evaluate](https://github.com/huggingface/evaluate/tree/main/metrics) can also be used, and will be loaded if a given metric name is not one natively supported in `lm-eval`.

-### Optional, more advanced setup
+### Optional, More Advanced Setup

 Some tasks may require more advanced processing logic than is described in this guide.

@@ -201,16 +215,42 @@ Passing `--tasks /path/to/yaml/file` is also accepted.

 ## Checking validity

- write_out
+After registering your task, you can now check on your data downloading and verify that the few-shot samples look as intended. Run the following command with your desired args:

-## Checking performance ; implementation equivalence
+```bash
+python -m scripts.write_out \
+    --output_base_path <path> \
+    --tasks <your-task-name> \
+    --sets <train | val | test> \
+    --num_fewshot K \
+    --num_examples N \
+```
+
+Open the file specified at the `--output_base_path <path>` and ensure it passes
+a simple eye test.
+
+## Checking performance + equivalence
+
+It's now time to check models' performance on your task! In the evaluation harness, we intend to support a wide range of evaluation tasks and setups, but prioritize the inclusion of already-proven benchmarks following the precise evaluation setups in the literature where possible. 

-## Task impl. checklist
+To enable this, we provide a checklist that should be completed when contributing a new task, to enable accurate book-keeping and to ensure that tasks added to the library are well-tested and, where applicable, precedented.

- turn this into a GH PR template too
+### Task impl. checklist

- README.md in task dir
+The checklist is the following: 
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Has the task been checked for equivalence with the original paper's methodology?
+  * [ ] Is the task in Eval-harness v0.3.0 or earlier?
+    * [ ] If so, has it been checked for regression from earlier versions? If there is a change in results, is it justified by matching the original authors' intended setup?
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?

 ## Submitting your task

-You're all set! Now push your work and make a pull request! Thanks for the contribution 👍. If there are any questions, please leave a message in the `#lm-thunderdome` channel on the EAI discord!
+You're all set! Now push your work and make a pull request to the `big-refactor` branch! Thanks for the contribution :). If there are any questions, please leave a message in the `#lm-thunderdome` channel on the EAI discord!
+
--- a/examples/README.md
+++ b/examples/README.md
+This folder is meant to contain instructions and task setups required to evaluate certain papers which may perform non-standard evaluation setups. 
+
+Tasks can be supported already in the library under `lm_eval/tasks`, or if highly paper-specific, may remain as YAMLs in the respective `examples/paper-title` folder.
+
+## Verified Papers:
+
+* [WIP] [Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https://arxiv.org/abs/2201.11903)
+  * Further details can be found in the `chain_of_thought` subfolder.
+
+## Candidates to Support:
+
+* Least-to-Most Prompting
+* Algorithmic Prompting
+* Other in-scope prompting techniques
+  * Multi-turn prompting strategies are likely out of scope for the repository.
+* Pythia Suite: Term Frequencies over training
+* All setups from GPT-3 Paper
+* Varying few-shot orderings + selection ; Varying the label choices for multiple-choice tasks
+
+* Your Paper Here!
\ No newline at end of file
--- a/examples/chain_of_thought/README.md
+++ b/examples/chain_of_thought/README.md
+# Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
+https://arxiv.org/abs/2201.11903
+
+## All Tasks in Paper
+
+* ...
+* ...
+* ...
+
+## Reproduction Scripts
+
+* ...
\ No newline at end of file
--- a/lm_eval/models/__init__.py
+++ b/lm_eval/models/__init__.py
 from . import hf_causal
-from . import gpt3
+from . import openai_completions
 from . import textsynth
 from . import dummy


--- a/lm_eval/models/openai.py
+++ b/lm_eval/models/openai.py
--- a/lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml
+++ b/lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml
@@ -9,7 +9,7 @@ generation_kwargs:
    - "\n\n"
  do_sample: true
  temperature: 0.2
-repeats: 8
+repeats: 64
 filter_list:
  - name: "score-first" # pick only the first response, and report metrics on that
    filter: