Commits · 66736bc183a7aee5dba4179ca2ccf96c3a8ff736 · gaoqiong / lm-evaluation-harness

10 Jul, 2025 1 commit
- fixup · 66736bc1
  Baber authored Jul 10, 2025
  
  66736bc1
08 Jul, 2025 2 commits
- refactor: improve dataset and metric handling in TaskConfig · fedaf262
  Baber authored Jul 08, 2025
  
  fedaf262
- refactor: update type hints and improve filter ensemble construction · 863ff340
  Baber authored Jul 08, 2025
  
  863ff340
07 Jul, 2025 3 commits
- nit · 5efa7937
  Baber authored Jul 08, 2025
  
  5efa7937
- nit · 646dec9e
  Baber authored Jul 08, 2025
  
  646dec9e
- add docs · 0967905f
  Baber authored Jul 08, 2025
  
  0967905f
05 Jul, 2025 1 commit
- remove all; reformat table (#3107) · 28001d29
  Baber Abbasi authored Jul 05, 2025
  
  28001d29
04 Jul, 2025 4 commits
- [FIX] Initial code to disable multi-proc for stderr (#3106) · 71d0289d
  Neel Gupta authored Jul 04, 2025
```
* [FIX] Initial code to disable multi-proc for stderr

* add docs; align no-mp bootstrap with mp

---------
Co-authored-by: Baber <baber@hey.com>
```
  71d0289d
- nit · b7b0f92e
  Baber authored Jul 04, 2025
  
  b7b0f92e
- refactor configs to files · fb63ac0f
  Baber authored Jul 04, 2025
  
  fb63ac0f
- nit · b0aca59b
  Baber authored Jul 04, 2025
  
  b0aca59b
03 Jul, 2025 1 commit
- type hints · db5dff9c
  Baber authored Jul 03, 2025
  
  db5dff9c
01 Jul, 2025 1 commit
- add docs · 49bfaf68
  Baber authored Jul 01, 2025
  
  49bfaf68
30 Jun, 2025 7 commits
- add temlplateconfigs · 15d07121
  Baber authored Jul 01, 2025
  
  15d07121
- update type hints · 3ba4e897
  Baber authored Jul 01, 2025
  
  3ba4e897
- update type hints · 9b192374
  Baber authored Jun 30, 2025
  
  9b192374
- add `sample_metric` and `is_elementwise` to MetricConfig · cb8dfe63
  Baber authored Jun 30, 2025
  
  cb8dfe63
- add FewshotConfig · 108674ed
  Baber authored Jun 30, 2025
  
  108674ed
- nit · c5aa5cf0
  Baber authored Jun 30, 2025
  
  c5aa5cf0
- add MetricConfig · 1b5c6f88
  Baber authored Jun 30, 2025
  
  1b5c6f88
25 Jun, 2025 1 commit
- Ensure backwards compatibility in fewshot_context by using kwargs (#3079) · 532909c0
  Kiersten Stokes authored Jun 25, 2025
```
Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>
```
  532909c0
03 Jun, 2025 1 commit
- [Fix] acc_mutual_info metric calculation bug (#3035) · 3f792954
  Baber Abbasi authored Jun 03, 2025
```
* fix: bug in acc_mutual_info slicing; add `target_delimiter` to uncond choices

* add tests
```
  3f792954
21 May, 2025 1 commit
- Revert "feat: add question suffix (#2876)" (#3007) · 29ea6832
  Baber Abbasi authored May 21, 2025
```
This reverts commit 4dbd5ec9
```
  29ea6832
19 May, 2025 1 commit
- [SGLANG] Add the SGLANG generate API (#2997) · 53c65300
  Baber Abbasi authored May 19, 2025
```
* add `sglang-generate`

* nit

* nit

* nit

* pacify pre-commit
```
  53c65300
15 May, 2025 1 commit
- feat: add question suffix (#2876) · 4dbd5ec9
  Tingchen Fu authored May 15, 2025
  
  4dbd5ec9
16 Apr, 2025 1 commit

Baber Abbasi authored Apr 17, 2025

* add warning in for default until

* fix stop tokens; add vcsum

* bugfix:fix doc_to_target to string

* fix lsht, trec

* add task to readme

* add debugging logs for multiple input/output

930d8378

07 Apr, 2025 1 commit

Add `--samples` Argument for Fine-Grained Task Evaluation in... · d693dcd2

Felipe Maia Polo authored Apr 07, 2025


 Add `--samples` Argument for Fine-Grained Task Evaluation in `lm-evaluation-harness`. This feature is the first step towards efficient multi-prompt evaluation with PromptEval [1,2] (#2520)

* added option --examples

* specifying examples in dictionary

* run pre-commit - fix arg type

Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com

* fixing bug when examples==None

* fixing bug when examples==None

* limit or examples must be None in simple_evaluate.py and in evaluator.py

* run pre-commit (fix formatting)

Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com

* merge main and run pre-commit (fix formatting)

Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com

* Update __main__.py

undefined "limit" and "examples"

* update branch, fix conflicts, run pre-commit

* nits

* nits

* change 'examples' to 'samples'

---------

Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com
Co-authored-by: mirianfrsilva <mirianfrsilva@ibm.com>
Co-authored-by: Stella Biderman <stellabiderman@gmail.com>
Co-authored-by: Baber <baber@hey.com>

d693dcd2

18 Mar, 2025 1 commit

Add loncxt tasks (#2629) · 80a10075

Baber Abbasi authored Mar 18, 2025

suport for longcontext (and other synthetic tasks)
* add ruler
* add longbench
* pass `metadata` to TaskConfig

80a10075

14 Mar, 2025 1 commit

add audio modality (qwen2 audio only) (#2689) · 62552d2c

achervyakov authored Mar 14, 2025



* Added audio-modality pipeline for qwen2-audio model

* Beauty imports

* fix apply_chat_template args

* update default audio placeholders list

* add demo task - common_voice subset

* add audiolm_qwen libs to pyproject.toml

* pre-commit beautify

---------
Co-authored-by: Alexandra Rak <rakalexandra@mail.ru>

62552d2c

11 Mar, 2025 1 commit

New healthcare benchmark: careqa (#2714) · 7c9fbcf8

PabloAgustin authored Mar 11, 2025



* New healthcare benchmark: careqa

* LAUNCH_MN5_ACC <python main.py --config config/mn5.yml --models Llama-3.2-1B-Instruct --tasks careqa_open --num_fewshot 0>

* Add fixes, READMES, and remove task_list.txt

* pre-commit passed, add formatting updates; add nanmean agg_metric

* Fix import error.

* Wrapped imports in try excepts

* Wrapped imports in try excepts; also metrics to catch bert_score import error

* Try except to catch ImportErrors as well

* use np.nan

* pre-commit

---------
Co-authored-by: PabloAgustin <pablo.martin@bsc.es>
Co-authored-by: Baber <baber@hey.com>

7c9fbcf8

04 Mar, 2025 1 commit
- add debug log (#2757) · 74332955
  Baber Abbasi authored Mar 04, 2025
  
  74332955
21 Feb, 2025 1 commit

Logging (#2203) · 1ba35e62

Lintang Sutawika authored Feb 20, 2025



* changed source of eval_logger

* allow eval_logger to be set from args

* removed verbosity arg from non-main methods

* fix logging

* pre-commit

* set verbosity in eval logger

* replace utils.eval_logger

* fix logging in main

* add logging to docs

* add logging message

* nit

* add logging to docs

* refactor setup_logging to utils

---------
Co-authored-by: Baber <baber@hey.com>

1ba35e62

14 Feb, 2025 2 commits
- `arithmetic`: set target delimiter to empty string (#2701) · 41b952f3
  Baber Abbasi authored Feb 14, 2025
```
* set target delimiter to empty string

* nit

* add warning
```
  41b952f3
- Update remaining references to assistant_prefill to gen_prefix (#2683) · ef6f5243
  Kiersten Stokes authored Feb 14, 2025
  
  ef6f5243
06 Feb, 2025 1 commit
- fix early return for multuple dict (#2673) · 144a1e58
  Baber Abbasi authored Feb 06, 2025
  
  144a1e58
28 Jan, 2025 1 commit

fix multiple input chat tempalte (#2576) · 96e499ba

Baber Abbasi authored Jan 28, 2025

* feat: drop Python 3.8 support

* feat: drop Python 3.8 tests

* pre-commit

* handle chat_template for multiple iput

96e499ba

19 Jan, 2025 1 commit
- update pre-commit (#2632) · f724be69
  Baber Abbasi authored Jan 19, 2025
```
* update pre-commit
```
  f724be69
17 Jan, 2025 1 commit
- fix gen_prefix (#2630) · 9dda03d6
  Baber Abbasi authored Jan 17, 2025
```
* switch arg
```
  9dda03d6
15 Jan, 2025 2 commits

assistant prefill (#2615) · 703fbffd

Baber Abbasi authored Jan 15, 2025

* add assistant prefix

* add arc_challenge from llama

* nit

* nit

* nit

* add assistant prefix

* add mmlu_llama

* nit

* nit

* Revert "nit"

This reverts commit 6a97f8356237305e375212b966b30e8de59dd4bc.

* fix regex bug

* add assistant_prefix to vllm

* add `Question:`

* add mmlu_pro

* add fewshot assistant_prefix

* use `assistant_prefill`

* typehints

* nits

* nits

* add to docs

* add readme

703fbffd

Add HumanEval (#1992) · 4c11206b

Hojin Lee authored Jan 16, 2025



* add custom filter

* fix type casting of references

* add humaneval

* fix a bug in humaneval

* add greedy version of humaneval

* update tasks README

* test humaneval

* return multiple metrics

* nit

* add confirmation to run code tasks

* nit

* nit

---------
Co-authored-by: Hojin Lee <19949034+hjlee1371@users.noreply.github.com>
Co-authored-by: Baber <baber@hey.com>

4c11206b