Commits · 40f5458f57f2acee1a2ba86aed16c8a37d99fffc · gaoqiong / lm-evaluation-harness

07 Jun, 2024 1 commit

Test output table layout consistency (#1916) · 40f5458f

Zafir Stojanovski authored Jun 07, 2024

* sort metrics in output table

* update docstring in `consolidate_results`

* add tests for verifying consistency of table output

* update tests to account for floating point inconsistencies

* updated tests based on `pythia-14m`

40f5458f

30 May, 2024 1 commit

`higher_is_better` tickers in output table (#1893) · 14221c84

Zafir Stojanovski authored May 30, 2024



* Higher is better tickers in output table

* add extra check for `higher_is_better` not being None already

* Update lm_eval/evaluator.py

* fixup format I messed up

* add comment (and retrigger tests)

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

14221c84

24 May, 2024 1 commit
- Fix for bootstrap_iters = 0 case (#1715) (#1789) · b043b050
  Hailey Schoelkopf authored May 24, 2024
```
* add handling for bootstrap_iters=0 case

* add more detail to docstring

* run precommit
```
  b043b050
27 Feb, 2024 1 commit

Refactor `evaluater.evaluate` (#1441) · 5ccd65d4

Baber Abbasi authored Feb 27, 2024



* change `all_gather` to `gather`

* add TaskOutput utility class

* Add FilterResults class and refactor task handling.

* Rename `key` to `filter_key` for clarity

* Add `print_writeout` function in utils.py

* Add function to calculate limit size.

* Add doc_iterator method to Task class

* Refactor `doc_iterator` and cleanup in Task class

* remove superfluous bits

* change `all_gather` to `gather`

* bugfix

* bugfix

* fix `gather`

* Refactor `gather` loop

* Refactor aggregate metrics calculation

* Refactor and simplify aggregate metrics calculation
Removed unused code

* Simplify metrics calculation and remove unused code.

* simplify the metrics calculation in `utils.py` and `evaluator.py`.

* Fix group metric

* change evaluate to hf_evaluate

* change evaluate to hf_evaluate

* add docs

* add docs

* nits

* make isslice keyword only

* nit

* add todo

* nit

* nit

* nit: swap order samples_metrics tuple

* move instance sorting outside loop

* nit

* nit

* Add __repr__ for ConfigurableTask

* nit

* nit

* Revert "nit"

This reverts commit dab8d9977a643752a17f840fd8cf7e4b107df28f.

* fix some logging

* nit

* fix `predict_only` bug. thanks to `@LSinev`!

* change `print_tasks` to `prepare_print_tasks`

* nits

* move eval utils

* move eval utils

* nit

* add comment

* added tqdm descriptions

* Update lm_eval/evaluator_utils.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* fix mgsm bug

* nit

* fix `build_all_requests`

* pre-commit

* add ceil to limit

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

5ccd65d4