Commits · 84d02f77ced6843dfe7fde525315cb1089d32c19 · gaoqiong / lm-evaluation-harness

06 Jul, 2025 1 commit
- check pil dep (#3114) · ab3acc73
  Baber Abbasi authored Jul 06, 2025
  
  ab3acc73
05 Jul, 2025 1 commit

add image hashing and `LMEVAL_HASHMM` envar (#2973) · e69ca5ed

achervyakov authored Jul 05, 2025



* add image hashing

* remove unused params decription

* use `LMEVAL_HASHMM` (defualt '1') to save raw images

---------
Co-authored-by: Baber <baber@hey.com>

e69ca5ed

04 Jul, 2025 2 commits
- improve logging · dbe4c391
  Baber authored Jul 04, 2025
  
  dbe4c391
- fix logging · 442ce51a
  Baber authored Jul 04, 2025
  
  442ce51a
22 Apr, 2025 1 commit
- add separate eval_config class · c1e43393
  artemorloff authored Apr 23, 2025
  
  c1e43393
08 Apr, 2025 1 commit
- enable evaluation from yaml config file · b5d16d61
  artemorloff authored Apr 09, 2025
  
  b5d16d61
18 Mar, 2025 1 commit

Add loncxt tasks (#2629) · 80a10075

Baber Abbasi authored Mar 18, 2025

suport for longcontext (and other synthetic tasks)
* add ruler
* add longbench
* pass `metadata` to TaskConfig

80a10075

11 Mar, 2025 1 commit
- Use yaml.CLoader to load yaml files when available. (#2777) · 8cfa0d74
  Giulio Lovisotto authored Mar 11, 2025
  
  8cfa0d74
21 Feb, 2025 1 commit

Logging (#2203) · 1ba35e62

Lintang Sutawika authored Feb 20, 2025



* changed source of eval_logger

* allow eval_logger to be set from args

* removed verbosity arg from non-main methods

* fix logging

* pre-commit

* set verbosity in eval logger

* replace utils.eval_logger

* fix logging in main

* add logging to docs

* add logging message

* nit

* add logging to docs

* refactor setup_logging to utils

---------
Co-authored-by: Baber <baber@hey.com>

1ba35e62

19 Jan, 2025 1 commit
- update pre-commit (#2632) · f724be69
  Baber Abbasi authored Jan 19, 2025
```
* update pre-commit
```
  f724be69
20 Dec, 2024 1 commit
- Wandb step handling bugfix and feature (#2580) · b86aa213
  Sabrina J. Mielke authored Dec 20, 2024
  
  b86aa213
16 Dec, 2024 1 commit

batch `loglikelihood_rolling` across requests (#2559) · 0bfb0220

Baber Abbasi authored Dec 16, 2024

* batch all rolling token windows

* nit

* copy to vllm

* fix max_length for `get_rolling_token_windows`

* bugfix

* bugfix

* add type hints

0bfb0220

09 Aug, 2024 1 commit

keep new line for task description (#2116) · 8ad598df

Jungwhan Kim authored Aug 10, 2024



* add keep trailing newline

* apply ruff-format

* add prompt unit test

* increment the version of tasks that have description with whitespace

* remove white spaces of leaderboard bbh

* update MMLU expected versions in output

* CI run does display the expected version=1 for mmlu subtasks, fix expected test output again

---------
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

8ad598df

01 Aug, 2024 1 commit

refactor: limit usage of `scipy` and `skilearn` dependencies (#2097) · 7f15cce4

Nathan Weinberg authored Aug 01, 2024



* refactor: move scipy and sklearn module imports to func imports
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

* refactor: consolidate weighted_f1_score func into lm_eval utils
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

* lint: allow for utils file to have unused imports

this allows for shared functions to be defined only
once while allowing for the YAML function importing
to continue working
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

---------
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

7f15cce4

10 Jul, 2024 1 commit

Update utils.py (#2085) · 058cfd0e

Lintang Sutawika authored Jul 10, 2024

Group Configs with no aggregation will print a empty space as the score for result table.
Example
```
|    Tasks     |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|--------------|-------|------|-----:|--------|---|-----:|---|-----:|
|group         |    N/A|      |      |        |   |      |   |      |
| - task 0     |Yaml   |none  |     0|acc     |↑  |0.4000|±  |0.0910|
| - task 1     |Yaml   |none  |     0|acc     |↑  |0.3333|±  |0.0875|
| - task 2     |Yaml   |none  |     0|acc     |↑  |0.2667|±  |0.0821|
| - task 3     |Yaml   |none  |     0|acc     |↑  |0.3333|±  |0.0875|
```

So the `v` variable in the `make_table` needs to check if the value is a float or a string.

058cfd0e

08 Jul, 2024 2 commits

Allow gating EvaluationTracker HF Hub results; customizability (#2051) · 563f7971

Nathan Habib authored Jul 08, 2024

* batch commit

* :Revert "batch commit"

This reverts commit d859d1ca.

* batch commit

* checkout from main

* checkout from main

* checkout from main

* checkout from main

* checkout from main

* cleanup

* cleanup

* cleanup

* cleanup

* cleanup

* cleanup eval results

* cleanup

* add check for gated repo

* fix jsonline issue

* fix

* add try catch when gating the details repo

* add doc

* adds back hub_repo_name

* readds hub repo name

563f7971

Group agg rework (#1741) · 517aadc4

Lintang Sutawika authored Jul 08, 2024



* add greoup_config arg

* add a group config that allows disabling table for group score and group aggregate in general

* fixed size configuration

* adjust config

* add group config

* adjust mmlu to use group_config

* fixed args input in aggregate_subtask_metrics

* fixed issues related to printing alias of group and updated yaml

* update all mmlu variants to include group_config

* edit format

* modify mmlu tasks

* adjust group to also be a configurable group

* add configurable group

* simplify get_task_list

* adjust group scoring with using ConfigurableGroup

* adjust args

* update mmlu

* update mmlu

* update to work with new group and task configuration

* readd group_agg

* readd files

* move prepare_print_tasks to evaluator_utils

* sort set to False by default, fix predict_only arg

* add version for groups

* reversed task list

* update additional condition when loading a group in a group yaml

* update truthfulqa

* add description regarding tags replacing group

* replace group to tag

* fixed conditional statement

* remove warning

* update loading of task group and newly added tags

* reformat with pre-commit

* fixed info log

* update

* fix bug

* fix bug

* use task id to differentiate tasks

* convert all groups to configurable groups

* use task_id

* reformat

* add task_id for python tasks as well

* add task_id for python tasks as well

* add task_id for python tasks as well

* revert truthfulqa

* revert mmlu tasks

* new mmlu config

* new group config parameter `tag_to_task`

* Update truthfulqa_mc2.yaml

* reformate

* add _process_group_config

* adjust task_id

* add get_subtask_list function to get proper subtask list

* group config to_dict update

* remove tag check

* update mmlu

* fix config passing issues

* add test yaml

* format fix

* add documentation

* corner case for single tag being called

* fix indentation

* formatting

* update all mmlu variants

* Update docs/task_guide.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* remove group_alias

* Update docs/task_guide.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* remove version for metadata

* Update docs/task_guide.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* update mmlu/

* removed " " in make_table

* change how aggregate_metric is loaded

* change how aggregate_metric is loaded

* update aggregate_metric arg

* update format

* update format

* some docs fixes

* add groups for agieval, aexams, aclue

* add more explicit aggregation groups

* add more groupings / tags distinctions

* add more groupings

* more groupings

* add many explicit group configs

* add many explicit group configs

* add more explicit group configs

* add more explicit group configs

* add more error msgs, agg_metric -> agg_metric_list

* some docs updates

* update task_id to be updateable and uses group:task format

* make KMMLU a tag for now

* update docs

* don't duplicate task names

* fix merge conflicts?

* giving this a try

* clean up diff

* switch mmlu variants over to using

* don't use to-be-deprecated group: config field in overview notebook

* Python tasks which subclass ConfigurableTask now run

* update mmlu

* pre-commit format

* fixed sorting for multi-level printing

* move group api to separate file

* fix bbh aggregation filter usage

* track api/group.py

* adjust group and tags loading

* make explicit group configs for leaderboard and other newer tasks

* fix arabicmmlu

* update

* change arabicmmlu template name???

* update group alias

* fix printing bugs

* check table printing is correct ; update tests

* use mmlu_stem to have a group included in print tests

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

517aadc4

11 Jun, 2024 1 commit

Results filenames handling fix (#1926) · 69952581

KonradSzafer authored Jun 11, 2024

* results filenames handling moved to utils

* zeno results handling fix

* tasks_for_model backward compatibility

* results files logic moved to tasks_for_model

* moved sanitize_model_name to utils

69952581

07 Jun, 2024 1 commit

Test output table layout consistency (#1916) · 40f5458f

Zafir Stojanovski authored Jun 07, 2024

* sort metrics in output table

* update docstring in `consolidate_results`

* add tests for verifying consistency of table output

* update tests to account for floating point inconsistencies

* updated tests based on `pythia-14m`

40f5458f

31 May, 2024 1 commit

Add dataset card when pushing to HF hub (#1898) · f4f59251

KonradSzafer authored May 31, 2024



* dataset card initial

* few fixes

* adds groups for math, mmlu, gpqa

* added summary agrs

* moved sanitize_list to utils

* readme update

* recreate metadata moved

* multiple model support

* results latest split fix

* readme update and small refactor

* fix grouping

* add comments

* added pathlib

* corrected pathlib approach

* check whether to create a metadata card

* convert posix paths to str

* default hf org from token

* hf token value error

* Add logs after successful upload

* logging updates

* dataset card example in the readme

---------
Co-authored-by: Nathan Habib <nathan.habib@huggingface.com>
Co-authored-by: Alina Lozovskaia <alinailozovskaya@gmail.com>

f4f59251

30 May, 2024 1 commit

`higher_is_better` tickers in output table (#1893) · 14221c84

Zafir Stojanovski authored May 30, 2024



* Higher is better tickers in output table

* add extra check for `higher_is_better` not being None already

* Update lm_eval/evaluator.py

* fixup format I messed up

* add comment (and retrigger tests)

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

14221c84

07 May, 2024 1 commit
- Logging Updates (Alphabetize table printouts, fix eval tracker bug) (#1774) (#1791) · d4a913c4
  Hailey Schoelkopf authored May 07, 2024
```
* fix auto-batch size bug for seq2seq models

* alphabetize task + group tables ; fix eval tracker bug

* fix eval tracker bug
```
  d4a913c4
03 May, 2024 1 commit

evaluation tracker implementation (#1766) · 59cf408a

KonradSzafer authored May 03, 2024

* evaluation tracker implementation

* OVModelForCausalLM test fix

* typo fix

* moved methods args

* multiple args in one flag

* loggers moved to dedicated dir

* improved filename sanitization

59cf408a

27 Feb, 2024 1 commit

Refactor `evaluater.evaluate` (#1441) · 5ccd65d4

Baber Abbasi authored Feb 27, 2024



* change `all_gather` to `gather`

* add TaskOutput utility class

* Add FilterResults class and refactor task handling.

* Rename `key` to `filter_key` for clarity

* Add `print_writeout` function in utils.py

* Add function to calculate limit size.

* Add doc_iterator method to Task class

* Refactor `doc_iterator` and cleanup in Task class

* remove superfluous bits

* change `all_gather` to `gather`

* bugfix

* bugfix

* fix `gather`

* Refactor `gather` loop

* Refactor aggregate metrics calculation

* Refactor and simplify aggregate metrics calculation
Removed unused code

* Simplify metrics calculation and remove unused code.

* simplify the metrics calculation in `utils.py` and `evaluator.py`.

* Fix group metric

* change evaluate to hf_evaluate

* change evaluate to hf_evaluate

* add docs

* add docs

* nits

* make isslice keyword only

* nit

* add todo

* nit

* nit

* nit: swap order samples_metrics tuple

* move instance sorting outside loop

* nit

* nit

* Add __repr__ for ConfigurableTask

* nit

* nit

* Revert "nit"

This reverts commit dab8d9977a643752a17f840fd8cf7e4b107df28f.

* fix some logging

* nit

* fix `predict_only` bug. thanks to `@LSinev`!

* change `print_tasks` to `prepare_print_tasks`

* nits

* move eval utils

* move eval utils

* nit

* add comment

* added tqdm descriptions

* Update lm_eval/evaluator_utils.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* fix mgsm bug

* nit

* fix `build_all_requests`

* pre-commit

* add ceil to limit

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

5ccd65d4

26 Feb, 2024 1 commit

Cont metrics (#1475) · 96d185fa

Lintang Sutawika authored Feb 26, 2024



* add brier_score

* process brier_score

* brier score is working for N-sized class

* fxied brier score

* add TED to BigBench and Brier score to MMLU

* format

* Update metrics.py

* Update task.py

* Update generate_until_template_yaml

* Delete lm_eval/tasks/bigbench/aux_metric.py

* Update generate_until_template_yaml

* Update _default_template_yaml

* Update _generate_configs.py

* Update _generate_configs.py

* Update _generate_configs.py

* fix (format?)

* format?

* format, once more

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

96d185fa

24 Feb, 2024 1 commit

Add environment and transformers version logging in results dump (#1464) · f78e2da4

LSinev authored Feb 24, 2024

* Save git_hash to results even if git is not available to call as subprocess

* Store more info about environment and transformers version in results to help researchers track inconsistencies

* moved added logging to logging_utils

* moved get_git_commit_hash to logging_utils.py

* moved add_env_info inside evaluator

f78e2da4

14 Feb, 2024 1 commit
- Refactor utilities into a separate model utils file. (#1429) · 2d0a6460
  Baber Abbasi authored Feb 14, 2024
  
  2d0a6460
01 Feb, 2024 1 commit

Faster Task and Group Loading, Allow Recursive Groups (#1321) · d714fc95

Lintang Sutawika authored Feb 01, 2024



* add trust_remote_code as default

* task for testing recursive

* changed source of ALL_TASKS

* tasks should only accept TaskObjects

* initialize_tasks returns list of tasks and groups

* remove trust_remote_code for now

* moved constructor process to inside load_yaml_config

* more comprehensive way to index tasks and groups

* pre-commit format

* add exit after error

* adjust how task objects are called

* no need to use get_task_dict

* load_task_or_group works but only for tasks

* pre-commit format

* half working for nested groups

* changed variable names

* allow groups and tasks to work

* temp save

* indexing and loading are part of a task_manager object

* adapted initialize_tasks

* iron out bugs

* fixed typo

* fixed typo

* simplified code

* further tidy up

* remove lines for testing

* removed test lines

* removed unused code

* remove unused import

* fixed bug

* removed comments

* group in a list of group can accept parameter changes like `num_fewshot`

* add trust_remote_code as default

* task for testing recursive

* changed source of ALL_TASKS

* tasks should only accept TaskObjects

* initialize_tasks returns list of tasks and groups

* remove trust_remote_code for now

* moved constructor process to inside load_yaml_config

* more comprehensive way to index tasks and groups

* pre-commit format

* add exit after error

* adjust how task objects are called

* no need to use get_task_dict

* load_task_or_group works but only for tasks

* pre-commit format

* half working for nested groups

* changed variable names

* allow groups and tasks to work

* temp save

* indexing and loading are part of a task_manager object

* adapted initialize_tasks

* iron out bugs

* fixed typo

* fixed typo

* simplified code

* further tidy up

* remove lines for testing

* removed test lines

* removed unused code

* remove unused import

* fixed bug

* removed comments

* group in a list of group can accept parameter changes like `num_fewshot`

* check if config is task update

* add GroupConfig object

* edit test yaml

* remove args

* testing returning to python task list

* add weight_by_size config

* describe weight_by_size in docs

* fix weight by size potential error

* can load individual custom python class task

* moved import_function into the config loading file

* remove print lines

* add squadv2 yaml

* temporary scroll implementation

* revert back to use load_yaml_config but with modes

* fix group being loaded with a None

* reformat

* can load unregistered tasks from a group

* update scrolls

* edit scrolls multiplechoice task

* adjust class initialization

* fix initialization

* changed how to identify group and python tasks, fix logger

* allow loading "include" that is nested in a group config

* reworked flan benchmark

* allow duplicate task in the same group to co-exist

* process group_alias

* removed group_alias

* allow parameters set in group_config to apply to all tasks in tasklist

* add function, but comment for now

* reworked processing dict-base config

* fixed how configs in group are processed

* update to allow root group to have its alias used

* remove unused classes

* remove unused classes

* revert some parts to original

* forgot to change one variable

* adapt the new process to use get_task_dict

* fix for singular group call

* fix variable names

* add TaskManager into the evaluator

* format

* changed how dict tasks are loaded

* add docs

* Update docs/new_task_guide.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update evaluator.py

* Update evaluator.py

* remove groupconfig for now

* changed _config to config

* update interface.md to explain TaskManager

* added property functions

* adjusted logger

* update write_out.py

* updated tests

* added documentation and some modifications

* added docstring documentation

* precommit format

* updated task loading for tests

* updates tests

* changed arg order for load_yaml_config

* update to handle scrolls and edit log message

* remove unused lines

* return a list of task classes and not a dict

* Update __init__.py

* Delete lm_eval/tasks/benchmarks/test.yaml

* Update task.py

* Update lm_eval/utils.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/utils.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update utils.py

* re-added old functions with new log message

* Update docs/new_task_guide.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update new_task_guide.md

* added infor regarding `get_task_dict` and documentation

* add get_config for Task

* pre-commit formatting

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

d714fc95

11 Jan, 2024 1 commit
- Fix bug in multi-token Stop Sequences (#1268) · ff739414
  Hailey Schoelkopf authored Jan 11, 2024
```
* fix incorrect lookback protections

* bump generate_until task versions
```
  ff739414
02 Jan, 2024 1 commit
- batch_schedular bug in Collator (#1229) · 4d10ad56
  Baber Abbasi authored Jan 02, 2024
```
* auto-batch requires len of iter

* handle case when batch_size="auto:N"
```
  4d10ad56
27 Dec, 2023 1 commit

nits + fix siqa (#1216) · 6a1c19ed

Baber Abbasi authored Dec 27, 2023

* fix group

* siqa: default.yml -> default.yaml

* max_gen_toks -> self.max_gen_toks

* add ids to task tests

* fix siqa

* fix gen_kwargs for openai-chat

6a1c19ed

24 Dec, 2023 1 commit
- fix: incorrect argument order in `utils.divide` doc (#1208) · e4970d81
  Yuliang Li authored Dec 24, 2023
  
  e4970d81
23 Dec, 2023 1 commit

Consolidate batching (#1197) · 9fb2ebab

Baber Abbasi authored Dec 23, 2023



* refactor dataloader

* cleanup + add docs

* change arg

* renamed Collator and added testing

* parametrized test for Collator

* appease pre-commit

* added edge case batch 0 (no batching)

* fix typos

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

9fb2ebab

22 Dec, 2023 1 commit

Generic decorator for handling rate limit errors (#1109) · 046ea6e2

Zach Schillaci authored Dec 21, 2023



* Add retry error handler

* fixup! Add retry error handler

* Move to utils.py

* Run isort on utils.py

* Catch multiple exceptions

* Update LMs with exception handler

* Fixes to anthropic retry handler

* fix callback kwarg

* Update textsynth.py

* fix python 3.8 incompatibility

* fix indenterror I introduced

* placate linter?

* Update on_exception_callback kwarg name

* fixup! Merge branch 'main' into add-retry-error-handler

* fixup! fixup! Merge branch 'main' into add-retry-error-handler

* Merge conflicts are fun

* Run pre-commit

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

046ea6e2

20 Dec, 2023 1 commit

Switch Linting to `ruff` (#1166) · 65b8761d

Baber Abbasi authored Dec 20, 2023

* add ruff and isort. remove black and flake8

* remove unnecessary dependencies

* remove dependency from table

* change order

* ran ruff

* check 3.9

* exclude evaluator

* update CI workflow

* use ruff config in pyproject.toml

* test

* add isort rules to ruff

* sort imports

* import `make_table`

* try stages for no-commit-to-branch

* turn on mypy for pre-commit

* test

* test

* test

* change no-commit-to-branch to default

* nits

* fixed dependency

65b8761d

12 Dec, 2023 1 commit
- allow N/A as a possible value for Stderr · d6f8faad
  lintangsutawika authored Dec 12, 2023
  
  d6f8faad
29 Nov, 2023 1 commit
- fixed chunking · da2d160b
  baberabb authored Nov 29, 2023
  
  da2d160b
28 Nov, 2023 1 commit
- use all_headers for headers · 61d0f137
  lintangsutawika authored Nov 28, 2023
  
  61d0f137
21 Nov, 2023 1 commit
- update multi-token stopsequence handling · f7873a49
  haileyschoelkopf authored Nov 21, 2023
  
  f7873a49
20 Nov, 2023 1 commit
- add docs · b22f3440
  baberabb authored Nov 20, 2023
  
  b22f3440