To evaluate a model, (e.g. GPT-2) on NLU tasks (e.g. LAMBADA, HellaSwag), you can run the following command.
To evaluate a model, (e.g. GPT-2) on NLU tasks (e.g. LAMBADA, HellaSwag), you can run the following command.**When reporting results from eval harness, please include the task versions (shown in `results["versions"]`) for reproducibility.** This allows bug fixes to tasks while also ensuring that previously reported scores are reproducible. See the [Task Versioning](https://github.com/EleutherAI/lm-evaluation-harness#task-versioning) section for more info.
```bash
python main.py \
...
...
@@ -128,8 +128,9 @@ To implement a new task in eval harness, see [this guide](https://github.com/Ele
@@ -31,3 +30,15 @@ class HeadQA(HFTask, MultipleChoiceTask):
defdoc_to_text(self,doc):
returndoc["query"]
classHeadQAEn(HeadQABase):
DATASET_NAME="en"
classHeadQAEs(HeadQABase):
DATASET_NAME="es"
# for backwards compatibility
classHeadQAEsDeprecated(HeadQABase):
DATASET_NAME="es"
print("WARNING: headqa is deprecated. Please use headqa_es or headqa_en instead. See https://github.com/EleutherAI/lm-evaluation-harness/pull/240 for more info.")