Commits · 40f5458f57f2acee1a2ba86aed16c8a37d99fffc · gaoqiong / lm-evaluation-harness

07 Jun, 2024 1 commit

Test output table layout consistency (#1916) · 40f5458f

Zafir Stojanovski authored Jun 07, 2024

* sort metrics in output table

* update docstring in `consolidate_results`

* add tests for verifying consistency of table output

* update tests to account for floating point inconsistencies

* updated tests based on `pythia-14m`

40f5458f

31 May, 2024 1 commit

Add dataset card when pushing to HF hub (#1898) · f4f59251

KonradSzafer authored May 31, 2024



* dataset card initial

* few fixes

* adds groups for math, mmlu, gpqa

* added summary agrs

* moved sanitize_list to utils

* readme update

* recreate metadata moved

* multiple model support

* results latest split fix

* readme update and small refactor

* fix grouping

* add comments

* added pathlib

* corrected pathlib approach

* check whether to create a metadata card

* convert posix paths to str

* default hf org from token

* hf token value error

* Add logs after successful upload

* logging updates

* dataset card example in the readme

---------
Co-authored-by: Nathan Habib <nathan.habib@huggingface.com>
Co-authored-by: Alina Lozovskaia <alinailozovskaya@gmail.com>

f4f59251

30 May, 2024 1 commit

`higher_is_better` tickers in output table (#1893) · 14221c84

Zafir Stojanovski authored May 30, 2024



* Higher is better tickers in output table

* add extra check for `higher_is_better` not being None already

* Update lm_eval/evaluator.py

* fixup format I messed up

* add comment (and retrigger tests)

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

14221c84

07 May, 2024 1 commit
- Logging Updates (Alphabetize table printouts, fix eval tracker bug) (#1774) (#1791) · d4a913c4
  Hailey Schoelkopf authored May 07, 2024
```
* fix auto-batch size bug for seq2seq models

* alphabetize task + group tables ; fix eval tracker bug

* fix eval tracker bug
```
  d4a913c4
03 May, 2024 1 commit

evaluation tracker implementation (#1766) · 59cf408a

KonradSzafer authored May 03, 2024

* evaluation tracker implementation

* OVModelForCausalLM test fix

* typo fix

* moved methods args

* multiple args in one flag

* loggers moved to dedicated dir

* improved filename sanitization

59cf408a

27 Feb, 2024 1 commit

Refactor `evaluater.evaluate` (#1441) · 5ccd65d4

Baber Abbasi authored Feb 27, 2024



* change `all_gather` to `gather`

* add TaskOutput utility class

* Add FilterResults class and refactor task handling.

* Rename `key` to `filter_key` for clarity

* Add `print_writeout` function in utils.py

* Add function to calculate limit size.

* Add doc_iterator method to Task class

* Refactor `doc_iterator` and cleanup in Task class

* remove superfluous bits

* change `all_gather` to `gather`

* bugfix

* bugfix

* fix `gather`

* Refactor `gather` loop

* Refactor aggregate metrics calculation

* Refactor and simplify aggregate metrics calculation
Removed unused code

* Simplify metrics calculation and remove unused code.

* simplify the metrics calculation in `utils.py` and `evaluator.py`.

* Fix group metric

* change evaluate to hf_evaluate

* change evaluate to hf_evaluate

* add docs

* add docs

* nits

* make isslice keyword only

* nit

* add todo

* nit

* nit

* nit: swap order samples_metrics tuple

* move instance sorting outside loop

* nit

* nit

* Add __repr__ for ConfigurableTask

* nit

* nit

* Revert "nit"

This reverts commit dab8d9977a643752a17f840fd8cf7e4b107df28f.

* fix some logging

* nit

* fix `predict_only` bug. thanks to `@LSinev`!

* change `print_tasks` to `prepare_print_tasks`

* nits

* move eval utils

* move eval utils

* nit

* add comment

* added tqdm descriptions

* Update lm_eval/evaluator_utils.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* fix mgsm bug

* nit

* fix `build_all_requests`

* pre-commit

* add ceil to limit

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

5ccd65d4

26 Feb, 2024 1 commit

Cont metrics (#1475) · 96d185fa

Lintang Sutawika authored Feb 26, 2024



* add brier_score

* process brier_score

* brier score is working for N-sized class

* fxied brier score

* add TED to BigBench and Brier score to MMLU

* format

* Update metrics.py

* Update task.py

* Update generate_until_template_yaml

* Delete lm_eval/tasks/bigbench/aux_metric.py

* Update generate_until_template_yaml

* Update _default_template_yaml

* Update _generate_configs.py

* Update _generate_configs.py

* Update _generate_configs.py

* fix (format?)

* format?

* format, once more

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

96d185fa

24 Feb, 2024 1 commit

Add environment and transformers version logging in results dump (#1464) · f78e2da4

LSinev authored Feb 24, 2024

* Save git_hash to results even if git is not available to call as subprocess

* Store more info about environment and transformers version in results to help researchers track inconsistencies

* moved added logging to logging_utils

* moved get_git_commit_hash to logging_utils.py

* moved add_env_info inside evaluator

f78e2da4

14 Feb, 2024 1 commit
- Refactor utilities into a separate model utils file. (#1429) · 2d0a6460
  Baber Abbasi authored Feb 14, 2024
  
  2d0a6460
01 Feb, 2024 1 commit

Faster Task and Group Loading, Allow Recursive Groups (#1321) · d714fc95

Lintang Sutawika authored Feb 01, 2024



* add trust_remote_code as default

* task for testing recursive

* changed source of ALL_TASKS

* tasks should only accept TaskObjects

* initialize_tasks returns list of tasks and groups

* remove trust_remote_code for now

* moved constructor process to inside load_yaml_config

* more comprehensive way to index tasks and groups

* pre-commit format

* add exit after error

* adjust how task objects are called

* no need to use get_task_dict

* load_task_or_group works but only for tasks

* pre-commit format

* half working for nested groups

* changed variable names

* allow groups and tasks to work

* temp save

* indexing and loading are part of a task_manager object

* adapted initialize_tasks

* iron out bugs

* fixed typo

* fixed typo

* simplified code

* further tidy up

* remove lines for testing

* removed test lines

* removed unused code

* remove unused import

* fixed bug

* removed comments

* group in a list of group can accept parameter changes like `num_fewshot`

* add trust_remote_code as default

* task for testing recursive

* changed source of ALL_TASKS

* tasks should only accept TaskObjects

* initialize_tasks returns list of tasks and groups

* remove trust_remote_code for now

* moved constructor process to inside load_yaml_config

* more comprehensive way to index tasks and groups

* pre-commit format

* add exit after error

* adjust how task objects are called

* no need to use get_task_dict

* load_task_or_group works but only for tasks

* pre-commit format

* half working for nested groups

* changed variable names

* allow groups and tasks to work

* temp save

* indexing and loading are part of a task_manager object

* adapted initialize_tasks

* iron out bugs

* fixed typo

* fixed typo

* simplified code

* further tidy up

* remove lines for testing

* removed test lines

* removed unused code

* remove unused import

* fixed bug

* removed comments

* group in a list of group can accept parameter changes like `num_fewshot`

* check if config is task update

* add GroupConfig object

* edit test yaml

* remove args

* testing returning to python task list

* add weight_by_size config

* describe weight_by_size in docs

* fix weight by size potential error

* can load individual custom python class task

* moved import_function into the config loading file

* remove print lines

* add squadv2 yaml

* temporary scroll implementation

* revert back to use load_yaml_config but with modes

* fix group being loaded with a None

* reformat

* can load unregistered tasks from a group

* update scrolls

* edit scrolls multiplechoice task

* adjust class initialization

* fix initialization

* changed how to identify group and python tasks, fix logger

* allow loading "include" that is nested in a group config

* reworked flan benchmark

* allow duplicate task in the same group to co-exist

* process group_alias

* removed group_alias

* allow parameters set in group_config to apply to all tasks in tasklist

* add function, but comment for now

* reworked processing dict-base config

* fixed how configs in group are processed

* update to allow root group to have its alias used

* remove unused classes

* remove unused classes

* revert some parts to original

* forgot to change one variable

* adapt the new process to use get_task_dict

* fix for singular group call

* fix variable names

* add TaskManager into the evaluator

* format

* changed how dict tasks are loaded

* add docs

* Update docs/new_task_guide.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update evaluator.py

* Update evaluator.py

* remove groupconfig for now

* changed _config to config

* update interface.md to explain TaskManager

* added property functions

* adjusted logger

* update write_out.py

* updated tests

* added documentation and some modifications

* added docstring documentation

* precommit format

* updated task loading for tests

* updates tests

* changed arg order for load_yaml_config

* update to handle scrolls and edit log message

* remove unused lines

* return a list of task classes and not a dict

* Update __init__.py

* Delete lm_eval/tasks/benchmarks/test.yaml

* Update task.py

* Update lm_eval/utils.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/utils.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update utils.py

* re-added old functions with new log message

* Update docs/new_task_guide.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update new_task_guide.md

* added infor regarding `get_task_dict` and documentation

* add get_config for Task

* pre-commit formatting

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

d714fc95

11 Jan, 2024 1 commit
- Fix bug in multi-token Stop Sequences (#1268) · ff739414
  Hailey Schoelkopf authored Jan 11, 2024
```
* fix incorrect lookback protections

* bump generate_until task versions
```
  ff739414
02 Jan, 2024 1 commit
- batch_schedular bug in Collator (#1229) · 4d10ad56
  Baber Abbasi authored Jan 02, 2024
```
* auto-batch requires len of iter

* handle case when batch_size="auto:N"
```
  4d10ad56
27 Dec, 2023 1 commit

nits + fix siqa (#1216) · 6a1c19ed

Baber Abbasi authored Dec 27, 2023

* fix group

* siqa: default.yml -> default.yaml

* max_gen_toks -> self.max_gen_toks

* add ids to task tests

* fix siqa

* fix gen_kwargs for openai-chat

6a1c19ed

24 Dec, 2023 1 commit
- fix: incorrect argument order in `utils.divide` doc (#1208) · e4970d81
  Yuliang Li authored Dec 24, 2023
  
  e4970d81
23 Dec, 2023 1 commit

Consolidate batching (#1197) · 9fb2ebab

Baber Abbasi authored Dec 23, 2023



* refactor dataloader

* cleanup + add docs

* change arg

* renamed Collator and added testing

* parametrized test for Collator

* appease pre-commit

* added edge case batch 0 (no batching)

* fix typos

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

9fb2ebab

22 Dec, 2023 1 commit

Generic decorator for handling rate limit errors (#1109) · 046ea6e2

Zach Schillaci authored Dec 21, 2023



* Add retry error handler

* fixup! Add retry error handler

* Move to utils.py

* Run isort on utils.py

* Catch multiple exceptions

* Update LMs with exception handler

* Fixes to anthropic retry handler

* fix callback kwarg

* Update textsynth.py

* fix python 3.8 incompatibility

* fix indenterror I introduced

* placate linter?

* Update on_exception_callback kwarg name

* fixup! Merge branch 'main' into add-retry-error-handler

* fixup! fixup! Merge branch 'main' into add-retry-error-handler

* Merge conflicts are fun

* Run pre-commit

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

046ea6e2

20 Dec, 2023 1 commit

Switch Linting to `ruff` (#1166) · 65b8761d

Baber Abbasi authored Dec 20, 2023

* add ruff and isort. remove black and flake8

* remove unnecessary dependencies

* remove dependency from table

* change order

* ran ruff

* check 3.9

* exclude evaluator

* update CI workflow

* use ruff config in pyproject.toml

* test

* add isort rules to ruff

* sort imports

* import `make_table`

* try stages for no-commit-to-branch

* turn on mypy for pre-commit

* test

* test

* test

* change no-commit-to-branch to default

* nits

* fixed dependency

65b8761d

12 Dec, 2023 1 commit
- allow N/A as a possible value for Stderr · d6f8faad
  lintangsutawika authored Dec 12, 2023
  
  d6f8faad
29 Nov, 2023 1 commit
- fixed chunking · da2d160b
  baberabb authored Nov 29, 2023
  
  da2d160b
28 Nov, 2023 1 commit
- use all_headers for headers · 61d0f137
  lintangsutawika authored Nov 28, 2023
  
  61d0f137
21 Nov, 2023 1 commit
- update multi-token stopsequence handling · f7873a49
  haileyschoelkopf authored Nov 21, 2023
  
  f7873a49
20 Nov, 2023 1 commit
- add docs · b22f3440
  baberabb authored Nov 20, 2023
  
  b22f3440
17 Nov, 2023 1 commit
- edits and format · 10cc0a56
  lintangsutawika authored Nov 17, 2023
  
  10cc0a56
14 Nov, 2023 1 commit
- enables tasks of different group but same task_alis (for example, if... · 7760573f
  lintangsutawika authored Nov 14, 2023
```
enables tasks of different group but same task_alis (for example, if evaluating on different versions of MMLU
```
  7760573f
13 Nov, 2023 1 commit
- num fewshot is printed in table · 81495b0a
  lintangsutawika authored Nov 13, 2023
  
  81495b0a
10 Nov, 2023 1 commit
- added initialize_task and updated where eval_logger is imported from · 056c9d85
  lintangsutawika authored Nov 10, 2023
  
  056c9d85
02 Nov, 2023 2 commits
- precommit format · 73f3029c
  lintangsutawika authored Nov 02, 2023
  
  73f3029c
- eval_logger is not imported from logger.py anymore · f701ba7d
  lintangsutawika authored Nov 02, 2023
  
  f701ba7d
11 Oct, 2023 1 commit
- add _batch_scheduler in greedy_until · 1aa3bc1e
  Zhiwei Zhuang authored Oct 11, 2023
  
  1aa3bc1e
03 Oct, 2023 1 commit
- Allow Generation arguments on greedy_until reqs · ec5846d7
  USVSN Sai Prashanth authored Oct 03, 2023
  
  ec5846d7
26 Sep, 2023 2 commits
- update loading prompts · b2d16321
  lintangsutawika authored Sep 26, 2023
  
  b2d16321
- moved files · 3f090027
  lintangsutawika authored Sep 26, 2023
  
  3f090027
22 Sep, 2023 2 commits
- trigger precommit · e4aa240e
  haileyschoelkopf authored Sep 22, 2023
  
  e4aa240e
- hotfix bugs · 43c91b5c
  haileyschoelkopf authored Sep 22, 2023
  
  43c91b5c
15 Sep, 2023 1 commit
- allow for trailing comma in model_args · 4ff8260d
  haileyschoelkopf authored Sep 15, 2023
  
  4ff8260d
14 Sep, 2023 1 commit
- remove omegaconf usage · 98747638
  haileyschoelkopf authored Sep 14, 2023
  
  98747638
12 Sep, 2023 2 commits
- changed term to groups · 6bba33db
  lintangsutawika authored Sep 12, 2023
  
  6bba33db
- changed term to groups · 4e4f0de2
  lintangsutawika authored Sep 12, 2023
  
  4e4f0de2
11 Sep, 2023 1 commit
- update · c87703f3
  lintangsutawika authored Sep 11, 2023
  
  c87703f3
07 Sep, 2023 1 commit
- normalize path · 4b711aa6
  lintangsutawika authored Sep 07, 2023
  
  4b711aa6