This list keeps track of which tasks' implementations have been ported to YAML / v2.0 of the Eval Harness.
This list keeps track of which tasks' implementations have been ported to YAML / v2.0 of the Eval Harness.
Boxes should be checked iff tasks are implemented in v2.0 and tested for regression. Tasks should be struck through if checked *against original introducing paper* implementation or popularizing implementation.
Boxes should be checked iff tasks are implemented in the refactor and tested for regression. Tasks should be struck through if checked *against original introducing paper* implementation or popularizing implementation.
- [ ] Glue
- [ ] Glue
- [] SuperGlue
- [x] SuperGlue
- [ ] CoQA
- [ ] CoQA
- [ ] DROP
- [ ] DROP
- [x] ~~Lambada~~
- [x] ~~Lambada~~
...
@@ -31,7 +31,7 @@ Boxes should be checked iff tasks are implemented in v2.0 and tested for regress
...
@@ -31,7 +31,7 @@ Boxes should be checked iff tasks are implemented in v2.0 and tested for regress
*`wikitext`: measure perplexity on the Wikitext dataset, via rolling loglikelihoods.
### Checklist
### Checklist
- [x] Is in Eval-harness v1.0 ?
* [x] Is the task an existing benchmark in the literature?
- [x] Has been checked for regression from v1.0?
* [x] Have you referenced the original paper that introduced the task?
- [ ] Has been checked for equivalence with original paper methodology?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
- [ ] "Main" checked variant clearly denoted?
If other tasks on this dataset are already supported:
* [x] Is the "Main" variant of this task clearly denoted?
* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?