Commits · 3fe4b022e64b93318f302259baadf09f811b9808 · gaoqiong / lm-evaluation-harness

08 Oct, 2025 1 commit
- fixup fewshots · 3fe4b022
  Baber authored Oct 08, 2025
  
  3fe4b022
04 Oct, 2025 1 commit

Baber Abbasi authored Oct 04, 2025



* overhaul `ContextSampler`

* refactor masakhapos

* move multi_target to `exact_match`

* remove doc_to_choice from `boolq-seq2seq`

* remove doc_to_choice in generation process_results

* Remove unused `doc_to_choice` and fix superglue whitespaces

* require multiple_inputs and multiple_targets to be explicitly set in taskconfig

* fix copa; better logging in task init

* fix doc_to_target to return int rather than str (deprecated)

* fix processing regression; recursively parse lists fron template

* remove redundant jinja parsing logic

* remove promptsource

* for multiple_inputs use `doc_to_text: list[str]``

* Refactor `ContextSampler` `fewshot_context`

* fix multiple_input context

* fix `target_delimiter` with `gen_prefix`

* `doc_to_text` is list for multiple_inputs

* Refactor `count_bytes` and `count_words` methods to `@staticmethod`

* make has_*(train/test/validation) to properties

* remove `multi_target` `generate_until`

* `fix doc_to_target/multiple_targets handling add tests

* rename `multi_target` to `multiple_targets`

* evalaute list when multiple targets

* allow doc_to_target to return list

* Remove gen_prefix space and add warning (#3239)

* Remove gen_prefix space and add warning

* fix null gen_prefix bug again

* use git tests

---------
Co-authored-by: Boaz Ben-Dov <bendboaz@gmail.com>

003e5852

25 Sep, 2025 21 commits
- refactor registry · 93b2ab37
  Baber authored Jul 27, 2025
  
  93b2ab37
- fix process_results · 73202a2e
  Baber authored Sep 24, 2025
  
  73202a2e
- update default values; fixes · b89af51e
  Baber authored Jul 10, 2025
  
  b89af51e
- move test one doc to method · 7cef4d38
  Baber authored Jul 23, 2025
  
  7cef4d38
- overload Task methods if callable in yaml dict · ec767666
  Baber authored Jul 23, 2025
  
  ec767666
- remove deps; types · 4ad6cd9f
  Baber authored Jul 22, 2025
  
  4ad6cd9f
- make multiple_input explicit · 689e0c91
  Baber authored Jul 22, 2025
  
  689e0c91
- `check_gold_index_error` util; fix `process_results`; rm generate_until multiple-choice · d9876b22
  Baber authored Jul 22, 2025
  
  d9876b22
- cleanup · 69d14fb3
  Baber authored Jul 21, 2025
  
  69d14fb3
- move multi_target to `exact_match`; `nq_open` · 57b8c0b1
  Baber authored Jul 21, 2025
  
  57b8c0b1
- ruff rules; types · 1768fd3b
  Baber authored Jul 19, 2025
  
  1768fd3b
- remove prompt-source for now · 70f5e2f0
  Baber authored Jul 18, 2025
  
  70f5e2f0
- refactor: improve dataset and metric handling in TaskConfig · 227f1a74
  Baber authored Jul 08, 2025
  
  227f1a74
- refactor: update type hints and improve filter ensemble construction · 3b4d0af1
  Baber authored Jul 08, 2025
  
  3b4d0af1
- refactor configs to files · 57adbd35
  Baber authored Jul 04, 2025
  
  57adbd35
- cleanup · 04e74420
  Baber authored Jul 03, 2025
  
  04e74420
- add temlplateconfigs · b0173d57
  Baber authored Jul 01, 2025
  
  b0173d57
- update type hints · bbf79d44
  Baber authored Jun 30, 2025
  
  bbf79d44
- add `sample_metric` and `is_elementwise` to MetricConfig · 7f7872c1
  Baber authored Jun 30, 2025
  
  7f7872c1
- add FewshotConfig · 9c647fc1
  Baber authored Jun 30, 2025
  
  9c647fc1
- add MetricConfig · 28c78d30
  Baber authored Jun 30, 2025
  
  28c78d30
23 Jul, 2025 1 commit

Pin datasets < 4.0.0 (#3172) · 904bba12

Baber Abbasi authored Jul 23, 2025

* Fix: pin datasets < 4.0

* fix

* update type hints in HF

* fix hellaswag path

904bba12

11 Jul, 2025 2 commits
- fix circular · 6fc2ac49
  Baber authored Jul 12, 2025
  
  6fc2ac49
- refactor: migrate utils functions to lm_eval.tasks and update references · a9c16905
  Baber authored Jul 12, 2025
  
  a9c16905
25 Jun, 2025 1 commit
- Ensure backwards compatibility in fewshot_context by using kwargs (#3079) · 532909c0
  Kiersten Stokes authored Jun 25, 2025
```
Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>
```
  532909c0
03 Jun, 2025 1 commit
- [Fix] acc_mutual_info metric calculation bug (#3035) · 3f792954
  Baber Abbasi authored Jun 03, 2025
```
* fix: bug in acc_mutual_info slicing; add `target_delimiter` to uncond choices

* add tests
```
  3f792954
21 May, 2025 1 commit
- Revert "feat: add question suffix (#2876)" (#3007) · 29ea6832
  Baber Abbasi authored May 21, 2025
```
This reverts commit 4dbd5ec9
```
  29ea6832
19 May, 2025 1 commit
- [SGLANG] Add the SGLANG generate API (#2997) · 53c65300
  Baber Abbasi authored May 19, 2025
```
* add `sglang-generate`

* nit

* nit

* nit

* pacify pre-commit
```
  53c65300
15 May, 2025 1 commit
- feat: add question suffix (#2876) · 4dbd5ec9
  Tingchen Fu authored May 15, 2025
  
  4dbd5ec9
16 Apr, 2025 1 commit

Longbench bugfix (#2895) · 930d8378

Baber Abbasi authored Apr 17, 2025

* add warning in for default until

* fix stop tokens; add vcsum

* bugfix:fix doc_to_target to string

* fix lsht, trec

* add task to readme

* add debugging logs for multiple input/output

930d8378

07 Apr, 2025 1 commit

Add `--samples` Argument for Fine-Grained Task Evaluation in... · d693dcd2

Felipe Maia Polo authored Apr 07, 2025


 Add `--samples` Argument for Fine-Grained Task Evaluation in `lm-evaluation-harness`. This feature is the first step towards efficient multi-prompt evaluation with PromptEval [1,2] (#2520)

* added option --examples

* specifying examples in dictionary

* run pre-commit - fix arg type

Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com

* fixing bug when examples==None

* fixing bug when examples==None

* limit or examples must be None in simple_evaluate.py and in evaluator.py

* run pre-commit (fix formatting)

Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com

* merge main and run pre-commit (fix formatting)

Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com

* Update __main__.py

undefined "limit" and "examples"

* update branch, fix conflicts, run pre-commit

* nits

* nits

* change 'examples' to 'samples'

---------

Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com
Co-authored-by: mirianfrsilva <mirianfrsilva@ibm.com>
Co-authored-by: Stella Biderman <stellabiderman@gmail.com>
Co-authored-by: Baber <baber@hey.com>

d693dcd2

18 Mar, 2025 1 commit

Add loncxt tasks (#2629) · 80a10075

Baber Abbasi authored Mar 18, 2025

suport for longcontext (and other synthetic tasks)
* add ruler
* add longbench
* pass `metadata` to TaskConfig

80a10075

14 Mar, 2025 1 commit

add audio modality (qwen2 audio only) (#2689) · 62552d2c

achervyakov authored Mar 14, 2025



* Added audio-modality pipeline for qwen2-audio model

* Beauty imports

* fix apply_chat_template args

* update default audio placeholders list

* add demo task - common_voice subset

* add audiolm_qwen libs to pyproject.toml

* pre-commit beautify

---------
Co-authored-by: Alexandra Rak <rakalexandra@mail.ru>

62552d2c

04 Mar, 2025 1 commit
- add debug log (#2757) · 74332955
  Baber Abbasi authored Mar 04, 2025
  
  74332955
21 Feb, 2025 1 commit

Logging (#2203) · 1ba35e62

Lintang Sutawika authored Feb 20, 2025



* changed source of eval_logger

* allow eval_logger to be set from args

* removed verbosity arg from non-main methods

* fix logging

* pre-commit

* set verbosity in eval logger

* replace utils.eval_logger

* fix logging in main

* add logging to docs

* add logging message

* nit

* add logging to docs

* refactor setup_logging to utils

---------
Co-authored-by: Baber <baber@hey.com>

1ba35e62

14 Feb, 2025 1 commit
- Update remaining references to assistant_prefill to gen_prefix (#2683) · ef6f5243
  Kiersten Stokes authored Feb 14, 2025
  
  ef6f5243
06 Feb, 2025 1 commit
- fix early return for multuple dict (#2673) · 144a1e58
  Baber Abbasi authored Feb 06, 2025
  
  144a1e58
28 Jan, 2025 1 commit

fix multiple input chat tempalte (#2576) · 96e499ba

Baber Abbasi authored Jan 28, 2025

* feat: drop Python 3.8 support

* feat: drop Python 3.8 tests

* pre-commit

* handle chat_template for multiple iput

96e499ba