address issues raised on task docs

d3fa8470 · haileyschoelkopf · be1aea6d · d3fa8470
Commit d3fa8470 authored Jun 12, 2023 by haileyschoelkopf
Hide whitespace changes
Inline Side-by-side

Showing with 7 additions and 15 deletions

docs/new_task_guide.md docs/new_task_guide.md +7 -15

No files found.
--- a/docs/new_task_guide.md
+++ b/docs/new_task_guide.md
@@ -21,20 +21,10 @@ As a concrete example, we'll walk through reimplementing the `gsm8k` benchmark (
 ## Creating a YAML file
- Tasks in eval harness are largely implemented via YAML files.
+To implement a new standard task, we'll need to write a YAML file which configures our task logic. We start by making a new empty YAML file. This file can have any name, but we recommend placing it in a subfolder of `lm_eval/tasks` titled by the dataset or task's shorthand name: for example,
- mention the tasks worth "forking"/building off of
- Step through the different args all tasks will need
-To implement a new standard task, we'll need to write a YAML file which configures our task logic. We start by making a new empty YAML file:
 ```sh
-touch lm_eval/tasks/new_mcqa.yaml
+touch lm_eval/tasks/<dataset_name>/<my_new_task_name>.yaml
-```
-or
-```sh
-touch lm_eval/tasks/new_generative_task.yaml
 ```
 ### Selecting and configuring a dataset
@@ -241,15 +231,17 @@ The checklist is the following:
 For adding novel benchmarks/datasets to the library:
 * [ ] Is the task an existing benchmark in the literature?
-  * [ ] Has the task been checked for equivalence with the original paper's methodology?
+  * [ ] Have you referenced the original paper that introduced the task?
-  * [ ] Is the task in Eval-harness v0.3.0 or earlier?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
-    * [ ] If so, has it been checked for regression from earlier versions? If there is a change in results, is it justified by matching the original authors' intended setup?
 If other tasks on this dataset are already supported:
 * [ ] Is the "Main" variant of this task clearly denoted?
 * [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
 * [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
+It is recommended to include a filled-out copy of this checklist in the README.md for the subfolder you are creating, if you have created a new subfolder in `lm_eval/tasks`.
 ## Submitting your task
 You're all set! Now push your work and make a pull request to the `big-refactor` branch! Thanks for the contribution :). If there are any questions, please leave a message in the `#lm-thunderdome` channel on the EAI discord!