# v1.0 Tasks This list keeps track of which tasks' implementations have been ported to YAML / v2.0 of the Eval Harness. Boxes should be checked iff tasks are implemented in the refactor and tested for regression. Tasks should be struck through if checked *against original introducing paper* implementation or popularizing implementation. (WIP) Denotes that there exists a PR or person working on this task already. - [ ] Glue (WIP) - [x] SuperGlue - [ ] CoQA - [ ] DROP - [x] ~~Lambada~~ - [x] Lambada (Cloze variants) - [x] ~~Lambada (Multilingual)~~ - [x] Wikitext - [x] PiQA - [x] PROST - [ ] MCTACO - [x] Pubmed QA - [x] SciQ - [ ] QASPER - [x] QA4MRE - [ ] TriviaQA - [x] AI2 ARC - [ ] LogiQA (WIP) - [x] HellaSwag - [x] SWAG - [x] OpenBookQA - [x] RACE - [ ] LogiQA (WIP) - [x] HellaSwag - [x] SWAG - [x] OpenBookQA - [ ] SQuADv2 (WIP) - [x] RACE - [x] HeadQA (WIP) - [ ] MathQA (WIP) - [ ] WebQs - [ ] WSC273 - [x] Winogrande - [x] ANLI - [ ] Hendrycks Ethics - [ ] TruthfulQA - [ ] MuTual - [ ] Hendrycks Math (WIP) - [ ] Asdiv (WIP) - [ ] GSM8k - [x] Arithmetic - [ ] MMMLU - [ ] Translation (WMT) suite - [x] Unscramble - [x] ~~Pile (perplexity)~~ - [ ] BLiMP - [x] ToxiGen - [ ] StoryCloze - [ ] NaturalQs - [ ] CrowS-Pairs - [ ] XCopa - [ ] BIG-Bench - [ ] XStoryCloze - [ ] XWinograd - [ ] PAWS-X - [ ] XNLI - [ ] MGSM - [ ] SCROLLS - [ ] JSON Task (reference: https://github.com/EleutherAI/lm-evaluation-harness/pull/481) - [ ] Babi # Novel Tasks Tasks added in the revamped harness that were not previously available. Again, a strikethrough denotes checking performed *against the original task's implementation or published results introducing the task*. # Task Wishlist - [ ] TheoremQA - [ ] Theorem Proving evaluations - [ ] Chain of Thought - [ ] Self-consistency ; Least-to-Most prompting, etc. - [ ] Summarization Tasks - [ ] Anthropic Model-Written Evals