Commit d3fa8470 authored by haileyschoelkopf's avatar haileyschoelkopf
Browse files

address issues raised on task docs

parent be1aea6d
...@@ -21,20 +21,10 @@ As a concrete example, we'll walk through reimplementing the `gsm8k` benchmark ( ...@@ -21,20 +21,10 @@ As a concrete example, we'll walk through reimplementing the `gsm8k` benchmark (
## Creating a YAML file ## Creating a YAML file
- Tasks in eval harness are largely implemented via YAML files. To implement a new standard task, we'll need to write a YAML file which configures our task logic. We start by making a new empty YAML file. This file can have any name, but we recommend placing it in a subfolder of `lm_eval/tasks` titled by the dataset or task's shorthand name: for example,
- mention the tasks worth "forking"/building off of
- Step through the different args all tasks will need
To implement a new standard task, we'll need to write a YAML file which configures our task logic. We start by making a new empty YAML file:
```sh ```sh
touch lm_eval/tasks/new_mcqa.yaml touch lm_eval/tasks/<dataset_name>/<my_new_task_name>.yaml
```
or
```sh
touch lm_eval/tasks/new_generative_task.yaml
``` ```
### Selecting and configuring a dataset ### Selecting and configuring a dataset
...@@ -241,15 +231,17 @@ The checklist is the following: ...@@ -241,15 +231,17 @@ The checklist is the following:
For adding novel benchmarks/datasets to the library: For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature? * [ ] Is the task an existing benchmark in the literature?
* [ ] Has the task been checked for equivalence with the original paper's methodology? * [ ] Have you referenced the original paper that introduced the task?
* [ ] Is the task in Eval-harness v0.3.0 or earlier? * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
* [ ] If so, has it been checked for regression from earlier versions? If there is a change in results, is it justified by matching the original authors' intended setup?
If other tasks on this dataset are already supported: If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted? * [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates? * [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant? * [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
It is recommended to include a filled-out copy of this checklist in the README.md for the subfolder you are creating, if you have created a new subfolder in `lm_eval/tasks`.
## Submitting your task ## Submitting your task
You're all set! Now push your work and make a pull request to the `big-refactor` branch! Thanks for the contribution :). If there are any questions, please leave a message in the `#lm-thunderdome` channel on the EAI discord! You're all set! Now push your work and make a pull request to the `big-refactor` branch! Thanks for the contribution :). If there are any questions, please leave a message in the `#lm-thunderdome` channel on the EAI discord!
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment