To evaluate a model, (e.g. GPT-2) on NLU tasks (e.g. LAMBADA, HellaSwag), you can run the following command.
To evaluate a model, (e.g. GPT-2) on NLU tasks (e.g. LAMBADA, HellaSwag), you can run the following command.**When reporting results from eval harness, please include the task versions (shown in `results["versions"]`) for reproducibility.** This allows bug fixes to tasks while also ensuring that previously reported scores are reproducible. See the [Task Versioning](https://github.com/EleutherAI/lm-evaluation-harness#task-versioning) section for more info.
```bash
```bash
python main.py \
python main.py \
...
@@ -91,7 +91,7 @@ To implement a new task in eval harness, see [this guide](https://github.com/Ele
...
@@ -91,7 +91,7 @@ To implement a new task in eval harness, see [this guide](https://github.com/Ele
### Full Task List
### Full Task List
| Task Name |Train|Val|Test|Val/Test Docs| Metrics |
| Task Name |Train|Val|Test|Val/Test Docs| Metrics |
@@ -31,3 +30,15 @@ class HeadQA(HFTask, MultipleChoiceTask):
...
@@ -31,3 +30,15 @@ class HeadQA(HFTask, MultipleChoiceTask):
defdoc_to_text(self,doc):
defdoc_to_text(self,doc):
returndoc["query"]
returndoc["query"]
classHeadQAEn(HeadQABase):
DATASET_NAME="en"
classHeadQAEs(HeadQABase):
DATASET_NAME="es"
# for backwards compatibility
classHeadQAEsDeprecated(HeadQABase):
DATASET_NAME="es"
print("WARNING: headqa is deprecated. Please use headqa_es or headqa_en instead. See https://github.com/EleutherAI/lm-evaluation-harness/pull/240 for more info.")