Commits · 73202a2ea8d94cb7e58ce85935c0c2e05cd0b140 · gaoqiong / lm-evaluation-harness

25 Sep, 2025 24 commits
- fix process_results · 73202a2e
  Baber authored Sep 24, 2025
  
  73202a2e
- update default values; fixes · b89af51e
  Baber authored Jul 10, 2025
  
  b89af51e
- add subcommands · 61520ad6
  Baber authored Jul 04, 2025
  
  61520ad6
- modularize cli · f9d5d3e7
  Baber authored Jul 03, 2025
  
  f9d5d3e7
- move test one doc to method · 7cef4d38
  Baber authored Jul 23, 2025
  
  7cef4d38
- overload Task methods if callable in yaml dict · ec767666
  Baber authored Jul 23, 2025
  
  ec767666
- remove deps; types · 4ad6cd9f
  Baber authored Jul 22, 2025
  
  4ad6cd9f
- make multiple_input explicit · 689e0c91
  Baber authored Jul 22, 2025
  
  689e0c91
- `check_gold_index_error` util; fix `process_results`; rm generate_until multiple-choice · d9876b22
  Baber authored Jul 22, 2025
  
  d9876b22
- improve metric aggregation default and higher-better checks; add `TaskConfig.from_template` · d19bd889
  Baber authored Jul 21, 2025
  
  d19bd889
- cleanup · 69d14fb3
  Baber authored Jul 21, 2025
  
  69d14fb3
- move multi_target to `exact_match`; `nq_open` · 57b8c0b1
  Baber authored Jul 21, 2025
  
  57b8c0b1
- ruff rules; types · 1768fd3b
  Baber authored Jul 19, 2025
  
  1768fd3b
- remove prompt-source for now · 70f5e2f0
  Baber authored Jul 18, 2025
  
  70f5e2f0
- refactor: improve dataset and metric handling in TaskConfig · 227f1a74
  Baber authored Jul 08, 2025
  
  227f1a74
- refactor: update type hints and improve filter ensemble construction · 3b4d0af1
  Baber authored Jul 08, 2025
  
  3b4d0af1
- cleanup · c81c03ee
  Baber authored Jul 08, 2025
  
  c81c03ee
- refactor configs to files · 57adbd35
  Baber authored Jul 04, 2025
  
  57adbd35
- cleanup · 04e74420
  Baber authored Jul 03, 2025
  
  04e74420
- add temlplateconfigs · b0173d57
  Baber authored Jul 01, 2025
  
  b0173d57
- update type hints · bbf79d44
  Baber authored Jun 30, 2025
  
  bbf79d44
- add `sample_metric` and `is_elementwise` to MetricConfig · 7f7872c1
  Baber authored Jun 30, 2025
  
  7f7872c1
- add FewshotConfig · 9c647fc1
  Baber authored Jun 30, 2025
  
  9c647fc1
- add MetricConfig · 28c78d30
  Baber authored Jun 30, 2025
  
  28c78d30
23 Jul, 2025 1 commit

Baber Abbasi authored Jul 23, 2025

* Fix: pin datasets < 4.0

* fix

* update type hints in HF

* fix hellaswag path

904bba12

14 Jul, 2025 1 commit
- Fix for hang due to mp.Pool in bootstrap_stderr (#3135) · cf631de0
  Ankit Gola authored Jul 14, 2025
  
  cf631de0
05 Jul, 2025 1 commit
- remove all; reformat table (#3107) · 28001d29
  Baber Abbasi authored Jul 05, 2025
  
  28001d29
04 Jul, 2025 1 commit

[FIX] Initial code to disable multi-proc for stderr (#3106) · 71d0289d

Neel Gupta authored Jul 04, 2025



* [FIX] Initial code to disable multi-proc for stderr

* add docs; align no-mp bootstrap with mp

---------
Co-authored-by: Baber <baber@hey.com>

71d0289d

25 Jun, 2025 1 commit
- Ensure backwards compatibility in fewshot_context by using kwargs (#3079) · 532909c0
  Kiersten Stokes authored Jun 25, 2025
```
Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>
```
  532909c0
03 Jun, 2025 1 commit
- [Fix] acc_mutual_info metric calculation bug (#3035) · 3f792954
  Baber Abbasi authored Jun 03, 2025
```
* fix: bug in acc_mutual_info slicing; add `target_delimiter` to uncond choices

* add tests
```
  3f792954
21 May, 2025 1 commit
- Revert "feat: add question suffix (#2876)" (#3007) · 29ea6832
  Baber Abbasi authored May 21, 2025
```
This reverts commit 4dbd5ec9
```
  29ea6832
19 May, 2025 1 commit
- [SGLANG] Add the SGLANG generate API (#2997) · 53c65300
  Baber Abbasi authored May 19, 2025
```
* add `sglang-generate`

* nit

* nit

* nit

* pacify pre-commit
```
  53c65300
15 May, 2025 1 commit
- feat: add question suffix (#2876) · 4dbd5ec9
  Tingchen Fu authored May 15, 2025
  
  4dbd5ec9
16 Apr, 2025 1 commit

Longbench bugfix (#2895) · 930d8378

Baber Abbasi authored Apr 17, 2025

* add warning in for default until

* fix stop tokens; add vcsum

* bugfix:fix doc_to_target to string

* fix lsht, trec

* add task to readme

* add debugging logs for multiple input/output

930d8378

07 Apr, 2025 1 commit

Add `--samples` Argument for Fine-Grained Task Evaluation in... · d693dcd2

Felipe Maia Polo authored Apr 07, 2025


 Add `--samples` Argument for Fine-Grained Task Evaluation in `lm-evaluation-harness`. This feature is the first step towards efficient multi-prompt evaluation with PromptEval [1,2] (#2520)

* added option --examples

* specifying examples in dictionary

* run pre-commit - fix arg type

Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com

* fixing bug when examples==None

* fixing bug when examples==None

* limit or examples must be None in simple_evaluate.py and in evaluator.py

* run pre-commit (fix formatting)

Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com

* merge main and run pre-commit (fix formatting)

Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com

* Update __main__.py

undefined "limit" and "examples"

* update branch, fix conflicts, run pre-commit

* nits

* nits

* change 'examples' to 'samples'

---------

Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com
Co-authored-by: mirianfrsilva <mirianfrsilva@ibm.com>
Co-authored-by: Stella Biderman <stellabiderman@gmail.com>
Co-authored-by: Baber <baber@hey.com>

d693dcd2

18 Mar, 2025 1 commit

Add loncxt tasks (#2629) · 80a10075

Baber Abbasi authored Mar 18, 2025

suport for longcontext (and other synthetic tasks)
* add ruler
* add longbench
* pass `metadata` to TaskConfig

80a10075

14 Mar, 2025 1 commit

add audio modality (qwen2 audio only) (#2689) · 62552d2c

achervyakov authored Mar 14, 2025



* Added audio-modality pipeline for qwen2-audio model

* Beauty imports

* fix apply_chat_template args

* update default audio placeholders list

* add demo task - common_voice subset

* add audiolm_qwen libs to pyproject.toml

* pre-commit beautify

---------
Co-authored-by: Alexandra Rak <rakalexandra@mail.ru>

62552d2c

11 Mar, 2025 1 commit

New healthcare benchmark: careqa (#2714) · 7c9fbcf8

PabloAgustin authored Mar 11, 2025



* New healthcare benchmark: careqa

* LAUNCH_MN5_ACC <python main.py --config config/mn5.yml --models Llama-3.2-1B-Instruct --tasks careqa_open --num_fewshot 0>

* Add fixes, READMES, and remove task_list.txt

* pre-commit passed, add formatting updates; add nanmean agg_metric

* Fix import error.

* Wrapped imports in try excepts

* Wrapped imports in try excepts; also metrics to catch bert_score import error

* Try except to catch ImportErrors as well

* use np.nan

* pre-commit

---------
Co-authored-by: PabloAgustin <pablo.martin@bsc.es>
Co-authored-by: Baber <baber@hey.com>

7c9fbcf8

04 Mar, 2025 1 commit
- add debug log (#2757) · 74332955
  Baber Abbasi authored Mar 04, 2025
  
  74332955
21 Feb, 2025 1 commit

Logging (#2203) · 1ba35e62

Lintang Sutawika authored Feb 20, 2025



* changed source of eval_logger

* allow eval_logger to be set from args

* removed verbosity arg from non-main methods

* fix logging

* pre-commit

* set verbosity in eval logger

* replace utils.eval_logger

* fix logging in main

* add logging to docs

* add logging message

* nit

* add logging to docs

* refactor setup_logging to utils

---------
Co-authored-by: Baber <baber@hey.com>

1ba35e62