abstract = "We present HEAD-QA, a multi-choice question answering testbed to encourage research on complex reasoning. The questions come from exams to access a specialized position in the Spanish healthcare system, and are challenging even for highly specialized humans. We then consider monolingual (Spanish) and cross-lingual (to English) experiments with information retrieval and neural techniques. We show that: (i) HEAD-QA challenges current methods, and (ii) the results lag well behind human performance, demonstrating its usefulness as a benchmark for future work.",
}
```
### Subtasks
*`headqa_en` - English variant of HEAD-QA
*`headqa_es` - Spanish variant of HEAD-QA
### Checklist
* [x] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [x] Is the "Main" variant of this task clearly denoted?
* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?\
* [x] Same as LM Evaluation Harness v0.3.0 implementation
template_aliases:"{%setanswer_choices=answers|map(attribute='atext')|list%}{%setgold=ra-1%}"# set the list of possible answer choices, and set what this doc's gold label idx is
doc_to_text:"Question:{{qtext}}\nAnswer:"
doc_to_target:"{{answer_choices[gold]}}"
gold_alias:"{{gold}}"# this will be cast to an int.
The ETHICS dataset is a benchmark that spans concepts in justice, well-being,
duties, virtues, and commonsense morality. Models predict widespread moral
judgments about diverse text scenarios. This requires connecting physical and
social world knowledge to value judgements, a capability that may enable us
to steer chatbot outputs or eventually regularize open-ended reinforcement
learning agents.
Homepage: https://github.com/hendrycks/ethics
### Citation
```
@article{hendrycks2021ethics
title={Aligning AI With Shared Human Values},
author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},
journal={Proceedings of the International Conference on Learning Representations (ICLR)},
year={2021}
}
```
### Subtasks
*`ethics_cm`:
*
Missing:
*`ethics_utilitarianism_original`:
### Checklist
* [x] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [x] Is the "Main" variant of this task clearly denoted?
* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms
https://arxiv.org/pdf/1905.13319.pdf
MathQA is a large-scale dataset of 37k English multiple-choice math word problems
covering multiple math domain categories by modeling operation programs corresponding
to word problems in the AQuA dataset (Ling et al., 2017).
Homepage: https://math-qa.github.io/math-QA/
### Citation
```
@misc{amini2019mathqa,
title={MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms},
author={Aida Amini and Saadia Gabriel and Peter Lin and Rik Koncel-Kedziorski and Yejin Choi and Hannaneh Hajishirzi},
year={2019},
eprint={1905.13319},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
### Subtasks
*`mathqa`: The MathQA dataset, as a multiple choice dataset where the answer choices are not in context.
### Checklist
For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
* The MathQA dataset predates transformer-based prompted LLMs. We should, however, return to this task to ensure equivalence to the non-CoT version of mathQA used in the Chain-of-Thought paper.
If other tasks on this dataset are already supported:
* [x] Is the "Main" variant of this task clearly denoted?
* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
* [x] Checked for equivalence with v0.3.0 LM Evaluation Harness