push task checklist

00d33dfe · haileyschoelkopf · 44eec73b · 00d33dfe · 00d33dfe
Commit 00d33dfe authored Jun 02, 2023 by haileyschoelkopf
Hide whitespace changes
Inline Side-by-side

Showing with 134 additions and 0 deletions

lm_eval/tasks/CHECKLIST.md lm_eval/tasks/CHECKLIST.md +67 -0

lm_eval/tasks/README.md lm_eval/tasks/README.md +67 -0

No files found.
--- a/lm_eval/tasks/CHECKLIST.md
+++ b/lm_eval/tasks/CHECKLIST.md
+# v1.0 Tasks
+This list keeps track of which tasks' implementations have been ported to YAML / v2.0 of the Eval Harness.
+
+Boxes should be checked iff tasks are implemented in v2.0 and tested for regression. Tasks should be struck through if checked *against original introducing paper* implementation or popularizing implementation.
+
+- [ ] Glue
+- [ ] SuperGlue
+- [ ] CoQA
+- [ ] DROP
+- [x] ~~Lambada~~
+- [x] Lambada (Cloze variants)
+- [ ] Lambada (Multilingual)
+- [x] Wikitext
+- [x] PiQA
+- [ ] PROST
+- [ ] MCTACO
+- [ ] Pubmed QA
+- [x] SciQ
+- [ ] QASPER
+- [ ] QA4MRE
+- [ ] TriviaQA
+- [x] AI2 ARC
+- [ ] LogiQA
+- [ ] HellaSwag
+- [ ] SWAG
+- [ ] OpenBookQA
+- [ ] SQuADv2
+- [ ] RACE
+- [ ] HeadQA
+- [ ] MathQA
+- [ ] WebQs
+- [ ] WSC273
+- [ ] Winogrande
+- [ ] ANLI
+- [ ] Hendrycks Ethics
+- [ ] TruthfulQA
+- [ ] MuTual
+- [ ] Hendrycks Math
+- [ ] Asdiv
+- [ ] GSM8k
+- [ ] Arithmetic
+- [ ] MMMLU
+- [ ] Translation (WMT) suite
+- [ ] Unscramble
+- [x] ~~Pile (perplexity)~~
+- [ ] BLiMP
+- [ ] ToxiGen
+- [ ] CrowS-Pairs
+- [ ] XCopa
+- [ ] BIG-Bench
+- [ ] XStoryCloze
+- [ ] XWinograd
+- [ ] PAWS-X
+- [ ] XNLI
+- [ ] MGSM
+
+# Novel Tasks
+Tasks added in the revamped harness that were not previously available. Again, a strikethrough denotes checking performed *against the original task's implementation or published results introducing the task*.
+
+# Task Wishlist
+
+- [ ] TheoremQA
+- [ ] Theorem Proving evaluations
+- [ ] Chain of Thought
+- [ ] Self-consistency ; Least-to-Most prompting, etc.
+- [ ] Summarization Tasks
+- [ ] Anthropic Model-Written Evals 
\ No newline at end of file
--- a/lm_eval/tasks/README.md
+++ b/lm_eval/tasks/README.md
+# v1.0 Tasks
+This list keeps track of which tasks' implementations have been ported to YAML / v2.0 of the Eval Harness.
+
+Boxes should be checked iff tasks are implemented in v2.0 and tested for regression. Tasks should be struck through if checked *against original introducing paper* implementation or popularizing implementation.
+
+- [ ] Glue
+- [ ] SuperGlue
+- [ ] CoQA
+- [ ] DROP
+- [x] ~~Lambada~~
+- [x] Lambada (Cloze variants)
+- [ ] Lambada (Multilingual)
+- [x] Wikitext
+- [x] PiQA
+- [ ] PROST
+- [ ] MCTACO
+- [ ] Pubmed QA
+- [x] SciQ
+- [ ] QASPER
+- [ ] QA4MRE
+- [ ] TriviaQA
+- [x] AI2 ARC
+- [ ] LogiQA
+- [ ] HellaSwag
+- [ ] SWAG
+- [ ] OpenBookQA
+- [ ] SQuADv2
+- [ ] RACE
+- [ ] HeadQA
+- [ ] MathQA
+- [ ] WebQs
+- [ ] WSC273
+- [ ] Winogrande
+- [ ] ANLI
+- [ ] Hendrycks Ethics
+- [ ] TruthfulQA
+- [ ] MuTual
+- [ ] Hendrycks Math
+- [ ] Asdiv
+- [ ] GSM8k
+- [ ] Arithmetic
+- [ ] MMMLU
+- [ ] Translation (WMT) suite
+- [ ] Unscramble
+- [x] ~~Pile (perplexity)~~
+- [ ] BLiMP
+- [ ] ToxiGen
+- [ ] CrowS-Pairs
+- [ ] XCopa
+- [ ] BIG-Bench
+- [ ] XStoryCloze
+- [ ] XWinograd
+- [ ] PAWS-X
+- [ ] XNLI
+- [ ] MGSM
+
+# Novel Tasks
+Tasks added in the revamped harness that were not previously available. Again, a strikethrough denotes checking performed *against the original task's implementation or published results introducing the task*.
+
+# Task Wishlist
+
+- [ ] TheoremQA
+- [ ] Theorem Proving evaluations
+- [ ] Chain of Thought
+- [ ] Self-consistency ; Least-to-Most prompting, etc.
+- [ ] Summarization Tasks
+- [ ] Anthropic Model-Written Evals 
\ No newline at end of file