README.md 1.61 KB
Newer Older
haileyschoelkopf's avatar
haileyschoelkopf committed
1
2
3
# v1.0 Tasks
This list keeps track of which tasks' implementations have been ported to YAML / v2.0 of the Eval Harness.

4
Boxes should be checked iff tasks are implemented in the refactor and tested for regression. Tasks should be struck through if checked *against original introducing paper* implementation or popularizing implementation. (WIP) Denotes that there exists a PR or person working on this task already.
haileyschoelkopf's avatar
haileyschoelkopf committed
5

Hailey Schoelkopf's avatar
Hailey Schoelkopf committed
6
- [ ] Glue (WIP)
haileyschoelkopf's avatar
haileyschoelkopf committed
7
- [x] SuperGlue
haileyschoelkopf's avatar
haileyschoelkopf committed
8
9
10
11
12
13
14
15
16
- [ ] CoQA
- [ ] DROP
- [x] ~~Lambada~~
- [x] Lambada (Cloze variants)
- [ ] Lambada (Multilingual)
- [x] Wikitext
- [x] PiQA
- [ ] PROST
- [ ] MCTACO
17
- [ ] Pubmed QA (WIP)
haileyschoelkopf's avatar
haileyschoelkopf committed
18
19
20
21
22
23
- [x] SciQ
- [ ] QASPER
- [ ] QA4MRE
- [ ] TriviaQA
- [x] AI2 ARC
- [ ] LogiQA
24
25
- [x] HellaSwag
- [ ] SWAG (WIP)
Hailey Schoelkopf's avatar
Hailey Schoelkopf committed
26
- [x] OpenBookQA
haileyschoelkopf's avatar
haileyschoelkopf committed
27
- [ ] SQuADv2
28
- [ ] RACE (WIP)
haileyschoelkopf's avatar
haileyschoelkopf committed
29
30
31
32
- [ ] HeadQA
- [ ] MathQA
- [ ] WebQs
- [ ] WSC273
33
- [ ] Winogrande (WIP)
haileyschoelkopf's avatar
haileyschoelkopf committed
34
- [x] ANLI
haileyschoelkopf's avatar
haileyschoelkopf committed
35
36
37
38
39
40
- [ ] Hendrycks Ethics
- [ ] TruthfulQA
- [ ] MuTual
- [ ] Hendrycks Math
- [ ] Asdiv
- [ ] GSM8k
41
- [ ] Arithmetic (WIP)
haileyschoelkopf's avatar
haileyschoelkopf committed
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
- [ ] MMMLU
- [ ] Translation (WMT) suite
- [ ] Unscramble
- [x] ~~Pile (perplexity)~~
- [ ] BLiMP
- [ ] ToxiGen
- [ ] CrowS-Pairs
- [ ] XCopa
- [ ] BIG-Bench
- [ ] XStoryCloze
- [ ] XWinograd
- [ ] PAWS-X
- [ ] XNLI
- [ ] MGSM

# Novel Tasks
Tasks added in the revamped harness that were not previously available. Again, a strikethrough denotes checking performed *against the original task's implementation or published results introducing the task*.

# Task Wishlist

- [ ] TheoremQA
- [ ] Theorem Proving evaluations
- [ ] Chain of Thought
- [ ] Self-consistency ; Least-to-Most prompting, etc.
- [ ] Summarization Tasks
lintangsutawika's avatar
lintangsutawika committed
67
- [ ] Anthropic Model-Written Evals