# v1.0 Tasks This list keeps track of which tasks' implementations have been ported to YAML / v2.0 of the Eval Harness. Boxes should be checked iff tasks are implemented in v2.0 and tested for regression. Tasks should be struck through if checked *against original introducing paper* implementation or popularizing implementation. - [ ] Glue - [ ] SuperGlue - [ ] CoQA - [ ] DROP - [x] ~~Lambada~~ - [x] Lambada (Cloze variants) - [ ] Lambada (Multilingual) - [x] Wikitext - [x] PiQA - [ ] PROST - [ ] MCTACO - [ ] Pubmed QA - [x] SciQ - [ ] QASPER - [ ] QA4MRE - [ ] TriviaQA - [x] AI2 ARC - [ ] LogiQA - [ ] HellaSwag - [ ] SWAG - [ ] OpenBookQA - [ ] SQuADv2 - [ ] RACE - [ ] HeadQA - [ ] MathQA - [ ] WebQs - [ ] WSC273 - [ ] Winogrande - [ ] ANLI - [ ] Hendrycks Ethics - [ ] TruthfulQA - [ ] MuTual - [ ] Hendrycks Math - [ ] Asdiv - [ ] GSM8k - [ ] Arithmetic - [ ] MMMLU - [ ] Translation (WMT) suite - [ ] Unscramble - [x] ~~Pile (perplexity)~~ - [ ] BLiMP - [ ] ToxiGen - [ ] CrowS-Pairs - [ ] XCopa - [ ] BIG-Bench - [ ] XStoryCloze - [ ] XWinograd - [ ] PAWS-X - [ ] XNLI - [ ] MGSM # Novel Tasks Tasks added in the revamped harness that were not previously available. Again, a strikethrough denotes checking performed *against the original task's implementation or published results introducing the task*. # Task Wishlist - [ ] TheoremQA - [ ] Theorem Proving evaluations - [ ] Chain of Thought - [ ] Self-consistency ; Least-to-Most prompting, etc. - [ ] Summarization Tasks - [ ] Anthropic Model-Written Evals