Note: The tasks are formatted to be run with apply_chat_template and fewshot_as_multiturn.
### Citation
```
BibTeX-formatted citation goes here
```
### Groups, Tags, and Tasks
#### Groups
*`group_name`: `Short description`
#### Tags
*`tag_name`: `Short description`
#### Tasks
*`mmlu_llama`: `generation variant of MMLU`
*`arc_chalenge_chat`: `generation variant of ARC-Challenge using MMLU format`
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
dataset_path: hails/mmlu_no_train # a copy of `cais/mmlu` with no auxiliary_train split
output_type: generate_until
test_split: test
fewshot_split: dev
fewshot_config:
sampler: first_n
doc_to_text: "Given the following question and four candidate answers (A, B, C and D), choose the best answer.\nQuestion: {{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nYour response should end with \"The best answer is [the_answer_letter]\" where the [the_answer_letter] is one of A, B, C or D."