To evaluate a model, (e.g. GPT-2) on NLU tasks (e.g. LAMBADA, HellaSwag), you can run the following command.
To evaluate a model, (e.g. GPT-2) on NLU tasks (e.g. LAMBADA, HellaSwag), you can run the following command.**When reporting results from eval harness, please include the task versions (shown in `results["versions"]`) for reproducibility.** This allows bug fixes to tasks while also ensuring that previously reported scores are reproducible. See the [Task Versioning](https://github.com/EleutherAI/lm-evaluation-harness#task-versioning) section for more info.
```bash
python main.py \
...
...
@@ -55,7 +55,7 @@ To evaluate mesh-transformer-jax models that are not available on HF, please inv
## Implementing new tasks
To implement a new task in eval harness, see [this guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/task-guide.md).
To implement a new task in eval harness, see [this guide](./docs/task_guide.md).
## Cite as
...
...
@@ -128,8 +128,9 @@ To implement a new task in eval harness, see [this guide](https://github.com/Ele
(Figure from [Brown et al., 2020](https://arxiv.org/pdf/2005.14165.pdf))
Task descriptions provide in-context task instruction for your language model. If you'd like to prepend a natural language description to your few-shot examples and prompt, you can do so on a per-task basis via the `description_dict` arg of [`evaluator.evaluate`](../lm_eval/evaluator.py). This `description_dict` must adhere to the following key-value structure:
-**key**: the task name (`str`) as specified in the lm-eval-harness [task registry](../lm_eval/tasks/__init__.py).
-**value**: the corresponding (`str`) description/prompt for the task identified by **key**.
```python
description_dict={
"task_name_1":"description",
"task_name_2":"description",
...
}
```
Note that a task's description will be separated from its following few-shot examples and prompt by a new line as such:
```python
"""
<description>
<examples>
<prompt>
"""
```
## Descriptions in File
One can also interface with the aforementioned [`evaluator.evaluate`](../lm_eval/evaluator.py)(or`evaluator.simple_evaluate`) method from a higher level by simply passing a JSON file path to the `description_dict_path` arg of the command-line interface (CLI) program, `main.py`. The JSON file pointed to should be structured the same as the `description_dict`. E.g. for some file at `/your/path/descriptions.json` you may have:
```json
{
"cycle_letters":"Please unscramble the letters into a word, and write that word:",
"copa":"Given a premise and one alternative with a causal relation to the premise and another without, choose the more plausible alternative"
@@ -87,8 +87,7 @@ There are 2 standard approaches we follow for downloading data:
```
These methods return `True`/`False` whether or not your task dataset provides documents for each split type. __Note__: if the test set doesn't have publicly available labels, please do not put it down as having a test set.
Lastly, we need to load the documents. In our terminology, a document (`doc`) is a single natural language data example stored in a Python `dict`. E.g.:
`{“question”: “What is the capital of France?”, “answer”: “Paris”}`. Override the following methods to load your data splits from their storage location in `DATASET_PATH`:
Lastly, we need to load the documents. In our terminology, a document (`doc`) is a single natural language data example stored in a Python `dict`. E.g.: `{“question”: “What is the capital of France?”, “answer”: “Paris”}`. Override the following methods to load your data splits from their storage location in `DATASET_PATH`:
```python
def training_docs(self):
return #...
...
...
@@ -125,17 +124,9 @@ You can now skip ahead to <a href="#Registering-Your-Task">registering your task
<br>
In the case your task is _not_ multiple-choice, override the following methods for your task class:
In the case your task is not multiple-choice, override the following methods for your task class:
Put the natural language task description as a single line (no `\n`s) string here. E.g. `"Translate English to French:"`
```python
deffewshot_description(self):
return""
```
Format your document into a single query prompt __without the answer__ here. This method takes a single `doc` example (in dictionary form) . You should concatenate its members into a nicely formatted prompt.
Format your document into a single query prompt __without the answer__ here. This method takes a single `doc` example of type `dict` with `str` key-value members. You should concatenate these `doc` item values together into a neatly formatted prompt.
```python
defdoc_to_text(self,doc):
...
...
@@ -161,11 +152,12 @@ After registering your task, you can now check on your data downloading and veri
```bash
python -m scripts.write_out \
--task <your-task> \
--output_base_path <path> \
--tasks <your-task> \
--sets <train | val | test>\
--num_fewshot K \
--num_examples N
--num_examples N \
--description_dict_path <path>
```
Open the file specified at the `--output_base_path <path>` and ensure it passes
@@ -25,9 +24,17 @@ class HeadQA(HFTask, MultipleChoiceTask):
}
returnout_doc
deffewshot_description(self):
# TODO: figure out description
return""
defdoc_to_text(self,doc):
returndoc["query"]
classHeadQAEn(HeadQABase):
DATASET_NAME="en"
classHeadQAEs(HeadQABase):
DATASET_NAME="es"
# for backwards compatibility
classHeadQAEsDeprecated(HeadQABase):
DATASET_NAME="es"
print("WARNING: headqa is deprecated. Please use headqa_es or headqa_en instead. See https://github.com/EleutherAI/lm-evaluation-harness/pull/240 for more info.")