abstract = "We present HEAD-QA, a multi-choice question answering testbed to encourage research on complex reasoning. The questions come from exams to access a specialized position in the Spanish healthcare system, and are challenging even for highly specialized humans. We then consider monolingual (Spanish) and cross-lingual (to English) experiments with information retrieval and neural techniques. We show that: (i) HEAD-QA challenges current methods, and (ii) the results lag well behind human performance, demonstrating its usefulness as a benchmark for future work.",
}
```
### Subtasks
*`headqa_en` - English variant of HEAD-QA
*`headqa_es` - Spanish variant of HEAD-QA
### Checklist
* [x] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [x] Is the "Main" variant of this task clearly denoted?
* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?\
* [x] Same as LM Evaluation Harness v0.3.0 implementation
template_aliases:"{%setanswer_choices=answers|map(attribute='atext')|list%}{%setgold=ra-1%}"# set the list of possible answer choices, and set what this doc's gold label idx is
doc_to_text:"Question:{{qtext}}\nAnswer:"
doc_to_target:"{{answer_choices[gold]}}"
gold_alias:"{{gold}}"# this will be cast to an int.