This list keeps track of which tasks' implementations have been ported to YAML / v2.0 of the Eval Harness.
This list keeps track of which tasks' implementations have been ported to YAML / v2.0 of the Eval Harness.
Boxes should be checked iff tasks are implemented in the refactor and tested for regression. Tasks should be struck through if checked *against original introducing paper* implementation or popularizing implementation.
Boxes should be checked iff tasks are implemented in the refactor and tested for regression. Tasks should be struck through if checked *against original introducing paper* implementation or popularizing implementation. (WIP) Denotes that there exists a PR or person working on this task already.
- [ ] Glue (WIP)
- [ ] Glue (WIP)
- [x] SuperGlue
- [x] SuperGlue
...
@@ -14,23 +14,23 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
...
@@ -14,23 +14,23 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
- [x] PiQA
- [x] PiQA
- [ ] PROST
- [ ] PROST
- [ ] MCTACO
- [ ] MCTACO
- [ ] Pubmed QA
- [ ] Pubmed QA (WIP)
- [x] SciQ
- [x] SciQ
- [ ] QASPER
- [ ] QASPER
- [ ] QA4MRE
- [ ] QA4MRE
- [ ] TriviaQA
- [ ] TriviaQA
- [x] AI2 ARC
- [x] AI2 ARC
- [ ] LogiQA
- [ ] LogiQA
- [] HellaSwag
- [x] HellaSwag
- [ ] SWAG
- [ ] SWAG (WIP)
- [x] OpenBookQA
- [x] OpenBookQA
- [ ] SQuADv2
- [ ] SQuADv2
- [ ] RACE
- [ ] RACE (WIP)
- [ ] HeadQA
- [ ] HeadQA
- [ ] MathQA
- [ ] MathQA
- [ ] WebQs
- [ ] WebQs
- [ ] WSC273
- [ ] WSC273
- [ ] Winogrande
- [ ] Winogrande (WIP)
- [x] ANLI
- [x] ANLI
- [ ] Hendrycks Ethics
- [ ] Hendrycks Ethics
- [ ] TruthfulQA
- [ ] TruthfulQA
...
@@ -38,7 +38,7 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
...
@@ -38,7 +38,7 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for