Add metabench task to LM Evaluation Harness (#2357)
* Add metabench (Kipnis et al. 2024)
* Update metabench tasks for full replication of original benchmarks, using publicly available datasets
* Remove unnecessary import
* Add permute versions of each task, where the answer orders are randomly shuffled.
* Add metabench group for easier evaluations
* Fix mmlu counts after removing duplicate
* Add secondary datasets
* Fix f-string error
* Fix f-string error for permute processing
* Add original hash to outputs for easy matching to original results
* Add line break at end of utils files
* Remove extra line from winogrande
* Reformat for linters
* fix multiple input test
* appease pre-commit
* Add metabench to tasks README
* fix multiple input `test_doc_to_text`
---------
Co-authored-by:
Baber <baber@hey.com>
Showing
Please register or sign in to comment