Unverified Commit f73c2bc8 authored by Hailey Schoelkopf's avatar Hailey Schoelkopf Committed by GitHub
Browse files

Update new_task_guide.md

parent f55595c8
...@@ -2,9 +2,9 @@ ...@@ -2,9 +2,9 @@
`lm-evaluation-harness` is a framework that strives to support a wide range of zero- and few-shot evaluation tasks on autoregressive language models (LMs). `lm-evaluation-harness` is a framework that strives to support a wide range of zero- and few-shot evaluation tasks on autoregressive language models (LMs).
This documentation page provides a walkthrough to get started creating your own task, on the `big-refactor` branch of the repository (which will be v0.4.0 in the future.) This documentation page provides a walkthrough to get started creating your own task, in `lm-eval` versions v0.4.0 and later.
A more interactive tutorial is available as a Jupyter notebook [here](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/examples/lm-eval-overview.ipynb). A more interactive tutorial is available as a Jupyter notebook [here](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/examples/lm-eval-overview.ipynb).
## Setup ## Setup
...@@ -14,12 +14,11 @@ If you haven't already, go ahead and fork the main repo, clone it, create a bran ...@@ -14,12 +14,11 @@ If you haven't already, go ahead and fork the main repo, clone it, create a bran
# After forking... # After forking...
git clone https://github.com/<YOUR-USERNAME>/lm-evaluation-harness.git git clone https://github.com/<YOUR-USERNAME>/lm-evaluation-harness.git
cd lm-evaluation-harness cd lm-evaluation-harness
git checkout big-refactor
git checkout -b <task-name> git checkout -b <task-name>
pip install -e ".[dev]" pip install -e ".[dev]"
``` ```
In this document, we'll walk through the basics of implementing a static benchmark evaluation in two formats: a *generative* task which requires sampling text from a model, such as [`gsm8k`](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/tasks/gsm8k/gsm8k.yaml), and a *discriminative*, or *multiple choice*, task where the model picks the most likely of several fixed answer choices, such as [`sciq`](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/tasks/sciq/sciq.yaml). In this document, we'll walk through the basics of implementing a static benchmark evaluation in two formats: a *generative* task which requires sampling text from a model, such as [`gsm8k`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/gsm8k/gsm8k.yaml), and a *discriminative*, or *multiple choice*, task where the model picks the most likely of several fixed answer choices, such as [`sciq`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/sciq/sciq.yaml).
## Creating a YAML file ## Creating a YAML file
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment