Commits · 73202a2ea8d94cb7e58ce85935c0c2e05cd0b140 · gaoqiong / lm-evaluation-harness

25 Sep, 2025 20 commits
- fix process_results · 73202a2e
  Baber authored Sep 24, 2025
  
  73202a2e
- update default values; fixes · b89af51e
  Baber authored Jul 10, 2025
  
  b89af51e
- move test one doc to method · 7cef4d38
  Baber authored Jul 23, 2025
  
  7cef4d38
- overload Task methods if callable in yaml dict · ec767666
  Baber authored Jul 23, 2025
  
  ec767666
- remove deps; types · 4ad6cd9f
  Baber authored Jul 22, 2025
  
  4ad6cd9f
- make multiple_input explicit · 689e0c91
  Baber authored Jul 22, 2025
  
  689e0c91
- `check_gold_index_error` util; fix `process_results`; rm generate_until multiple-choice · d9876b22
  Baber authored Jul 22, 2025
  
  d9876b22
- cleanup · 69d14fb3
  Baber authored Jul 21, 2025
  
  69d14fb3
- move multi_target to `exact_match`; `nq_open` · 57b8c0b1
  Baber authored Jul 21, 2025
  
  57b8c0b1
- ruff rules; types · 1768fd3b
  Baber authored Jul 19, 2025
  
  1768fd3b
- remove prompt-source for now · 70f5e2f0
  Baber authored Jul 18, 2025
  
  70f5e2f0
- refactor: improve dataset and metric handling in TaskConfig · 227f1a74
  Baber authored Jul 08, 2025
  
  227f1a74
- refactor: update type hints and improve filter ensemble construction · 3b4d0af1
  Baber authored Jul 08, 2025
  
  3b4d0af1
- refactor configs to files · 57adbd35
  Baber authored Jul 04, 2025
  
  57adbd35
- cleanup · 04e74420
  Baber authored Jul 03, 2025
  
  04e74420
- add temlplateconfigs · b0173d57
  Baber authored Jul 01, 2025
  
  b0173d57
- update type hints · bbf79d44
  Baber authored Jun 30, 2025
  
  bbf79d44
- add `sample_metric` and `is_elementwise` to MetricConfig · 7f7872c1
  Baber authored Jun 30, 2025
  
  7f7872c1
- add FewshotConfig · 9c647fc1
  Baber authored Jun 30, 2025
  
  9c647fc1
- add MetricConfig · 28c78d30
  Baber authored Jun 30, 2025
  
  28c78d30
23 Jul, 2025 1 commit

Baber Abbasi authored Jul 23, 2025

* Fix: pin datasets < 4.0

* fix

* update type hints in HF

* fix hellaswag path

904bba12

25 Jun, 2025 1 commit
- Ensure backwards compatibility in fewshot_context by using kwargs (#3079) · 532909c0
  Kiersten Stokes authored Jun 25, 2025
```
Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>
```
  532909c0
03 Jun, 2025 1 commit
- [Fix] acc_mutual_info metric calculation bug (#3035) · 3f792954
  Baber Abbasi authored Jun 03, 2025
```
* fix: bug in acc_mutual_info slicing; add `target_delimiter` to uncond choices

* add tests
```
  3f792954
21 May, 2025 1 commit
- Revert "feat: add question suffix (#2876)" (#3007) · 29ea6832
  Baber Abbasi authored May 21, 2025
```
This reverts commit 4dbd5ec9
```
  29ea6832
19 May, 2025 1 commit
- [SGLANG] Add the SGLANG generate API (#2997) · 53c65300
  Baber Abbasi authored May 19, 2025
```
* add `sglang-generate`

* nit

* nit

* nit

* pacify pre-commit
```
  53c65300
15 May, 2025 1 commit
- feat: add question suffix (#2876) · 4dbd5ec9
  Tingchen Fu authored May 15, 2025
  
  4dbd5ec9
16 Apr, 2025 1 commit

Longbench bugfix (#2895) · 930d8378

Baber Abbasi authored Apr 17, 2025

* add warning in for default until

* fix stop tokens; add vcsum

* bugfix:fix doc_to_target to string

* fix lsht, trec

* add task to readme

* add debugging logs for multiple input/output

930d8378

07 Apr, 2025 1 commit

Add `--samples` Argument for Fine-Grained Task Evaluation in... · d693dcd2

Felipe Maia Polo authored Apr 07, 2025


 Add `--samples` Argument for Fine-Grained Task Evaluation in `lm-evaluation-harness`. This feature is the first step towards efficient multi-prompt evaluation with PromptEval [1,2] (#2520)

* added option --examples

* specifying examples in dictionary

* run pre-commit - fix arg type

Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com

* fixing bug when examples==None

* fixing bug when examples==None

* limit or examples must be None in simple_evaluate.py and in evaluator.py

* run pre-commit (fix formatting)

Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com

* merge main and run pre-commit (fix formatting)

Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com

* Update __main__.py

undefined "limit" and "examples"

* update branch, fix conflicts, run pre-commit

* nits

* nits

* change 'examples' to 'samples'

---------

Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com
Co-authored-by: mirianfrsilva <mirianfrsilva@ibm.com>
Co-authored-by: Stella Biderman <stellabiderman@gmail.com>
Co-authored-by: Baber <baber@hey.com>

d693dcd2

18 Mar, 2025 1 commit

Add loncxt tasks (#2629) · 80a10075

Baber Abbasi authored Mar 18, 2025

suport for longcontext (and other synthetic tasks)
* add ruler
* add longbench
* pass `metadata` to TaskConfig

80a10075

14 Mar, 2025 1 commit

add audio modality (qwen2 audio only) (#2689) · 62552d2c

achervyakov authored Mar 14, 2025



* Added audio-modality pipeline for qwen2-audio model

* Beauty imports

* fix apply_chat_template args

* update default audio placeholders list

* add demo task - common_voice subset

* add audiolm_qwen libs to pyproject.toml

* pre-commit beautify

---------
Co-authored-by: Alexandra Rak <rakalexandra@mail.ru>

62552d2c

04 Mar, 2025 1 commit
- add debug log (#2757) · 74332955
  Baber Abbasi authored Mar 04, 2025
  
  74332955
21 Feb, 2025 1 commit

Logging (#2203) · 1ba35e62

Lintang Sutawika authored Feb 20, 2025



* changed source of eval_logger

* allow eval_logger to be set from args

* removed verbosity arg from non-main methods

* fix logging

* pre-commit

* set verbosity in eval logger

* replace utils.eval_logger

* fix logging in main

* add logging to docs

* add logging message

* nit

* add logging to docs

* refactor setup_logging to utils

---------
Co-authored-by: Baber <baber@hey.com>

1ba35e62

14 Feb, 2025 1 commit
- Update remaining references to assistant_prefill to gen_prefix (#2683) · ef6f5243
  Kiersten Stokes authored Feb 14, 2025
  
  ef6f5243
06 Feb, 2025 1 commit
- fix early return for multuple dict (#2673) · 144a1e58
  Baber Abbasi authored Feb 06, 2025
  
  144a1e58
28 Jan, 2025 1 commit

fix multiple input chat tempalte (#2576) · 96e499ba

Baber Abbasi authored Jan 28, 2025

* feat: drop Python 3.8 support

* feat: drop Python 3.8 tests

* pre-commit

* handle chat_template for multiple iput

96e499ba

17 Jan, 2025 1 commit
- fix gen_prefix (#2630) · 9dda03d6
  Baber Abbasi authored Jan 17, 2025
```
* switch arg
```
  9dda03d6
15 Jan, 2025 2 commits

assistant prefill (#2615) · 703fbffd

Baber Abbasi authored Jan 15, 2025

* add assistant prefix

* add arc_challenge from llama

* nit

* nit

* nit

* add assistant prefix

* add mmlu_llama

* nit

* nit

* Revert "nit"

This reverts commit 6a97f8356237305e375212b966b30e8de59dd4bc.

* fix regex bug

* add assistant_prefix to vllm

* add `Question:`

* add mmlu_pro

* add fewshot assistant_prefix

* use `assistant_prefill`

* typehints

* nits

* nits

* add to docs

* add readme

703fbffd

Add HumanEval (#1992) · 4c11206b

Hojin Lee authored Jan 16, 2025



* add custom filter

* fix type casting of references

* add humaneval

* fix a bug in humaneval

* add greedy version of humaneval

* update tasks README

* test humaneval

* return multiple metrics

* nit

* add confirmation to run code tasks

* nit

* nit

---------
Co-authored-by: Hojin Lee <19949034+hjlee1371@users.noreply.github.com>
Co-authored-by: Baber <baber@hey.com>

4c11206b

04 Jan, 2025 1 commit

some minor logging nits (#2609) · 888ac292

Baber Abbasi authored Jan 04, 2025

* remove yaml extension from phraes_va_common

* remove yaml extension from winogenerated

* remove yaml extension from phrases_es

* no cache debug logging when not used

888ac292

29 Nov, 2024 1 commit
- skip casting if predict_only (#2524) · 9169899b
  Baber Abbasi authored Nov 29, 2024
  
  9169899b