Commits · d762e2aa6a63fd32a7f315b7fc62b4acb939198d · gaoqiong / lm-evaluation-harness

24 Jul, 2025 7 commits
- fix · d762e2aa
  Baber authored Jul 25, 2025
  
  d762e2aa
- Merge branch 'feature/eval_from_config' into metrics · 7a8203fa
  Baber authored Jul 25, 2025
```
# Conflicts:
#	lm_eval/__main__.py
#	lm_eval/utils.py
```
  7a8203fa
- Merge branch 'main' into metrics · e6b798f9
  Baber authored Jul 25, 2025
```
# Conflicts:
#	.pre-commit-config.yaml
#	lm_eval/api/task.py
#	lm_eval/models/huggingface.py
#	lm_eval/models/vllm_causallms.py
#	pyproject.toml
```
  e6b798f9
- Merge remote-tracking branch 'origin/metrics' into metrics · 14a29ade
  Baber authored Jul 24, 2025
  
  14a29ade
- vllm: remove device (#3181) · 4f8195f1
  Baber Abbasi authored Jul 24, 2025
  
  4f8195f1
- fix vllm test issue that call pop() from None (#3182) · 5f5f35e5
  weiliang authored Jul 24, 2025
  
  5f5f35e5
- types · 1f97a945
  Baber authored Jul 23, 2025
  
  1f97a945
23 Jul, 2025 11 commits
- remove trust-remote-code in configs; fix escape sequences (#3180) · 314f7176
  Baber Abbasi authored Jul 23, 2025
```
* remove trust-remote-code

* add W605 rule
```
  314f7176
- Remove "device" from vllm_causallms.py (#3176) · 8c6fde08
  Michael Goin authored Jul 23, 2025
```
Device has been a deprecated arg for a few releases of vLLM and is now removed in 0.10.0 https://github.com/vllm-project/vllm/pull/21349
```
  8c6fde08
- Pin datasets < 4.0.0 (#3172) · 904bba12
  Baber Abbasi authored Jul 23, 2025
```
* Fix: pin datasets < 4.0

* fix

* update type hints in HF

* fix hellaswag path
```
  904bba12
- Added `chat_template_args` to pass additional kwargs to tokenizer.apply_chat_template (#3164) · 2eea3f50
  Avelina Asada Hadji-Kyriacou authored Jul 23, 2025
```
* added support for additional chat template arguments

* use `enable_thinking`

* add wrap logging function

* add `chat_template_args` back to HF

---------
Co-authored-by: Baber <baber@hey.com>
```
  2eea3f50
- types · 6d63c2ce
  Baber authored Jul 23, 2025
  
  6d63c2ce
- move test one doc to method · 0087929e
  Baber authored Jul 23, 2025
  
  0087929e
- overload Task methods if callable in yaml dict · f40e9b58
  Baber authored Jul 23, 2025
  
  f40e9b58
- update scrolls · b3131364
  Baber authored Jul 23, 2025
  
  b3131364
- nit · 0bad3ace
  Baber authored Jul 23, 2025
  
  0bad3ace
- sort pyproject · 43388406
  Baber authored Jul 23, 2025
  
  43388406
- remove deps · 7c853109
  Baber authored Jul 23, 2025
  
  7c853109
22 Jul, 2025 8 commits

types · 87445e95
Baber authored Jul 23, 2025

87445e95
types · 5fdeb436
Baber authored Jul 22, 2025

5fdeb436
nit · f21a0b81
Baber authored Jul 22, 2025

f21a0b81
type hints · f264f2e2
Baber authored Jul 22, 2025

f264f2e2
make multiple_input explicit · 230352ce
Baber authored Jul 22, 2025

230352ce
nit · 00048838
Baber authored Jul 22, 2025

00048838

feat: Add LIBRA benchmark for long-context evaluation (#2943) · 091aaf6f

Svetlana Karimova authored Jul 22, 2025



* Feat: add LIBRA benchmark

* Feat: add dataset filter to LIBRA

* Fix: formatting through pre-commit and main tasks README

* Fix: resolve conflict

* Fix: dataset name to real

* Fix: delete unnececcary datasets and correct dependency

---------
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

091aaf6f

Fix: extended to max_gen_toks 2048 for HRM8K math benchmarks (#3124) · 250a04ec

Geun, Lim authored Jul 22, 2025



* Fix: extended to max_gen_toks 8192 for HRM8K math benchmarks

* • Increased max_gen_toks to 2 048 (matches Appendix B of original paper).
• Added Evaluation Settings and Changelog sections.

* add some logs

---------
Co-authored-by: Baber <baber@hey.com>

250a04ec

21 Jul, 2025 14 commits
- feat: implement check_gold_index_error utility and refactor process_results... · 55be51ea
  Baber authored Jul 22, 2025
```
feat: implement check_gold_index_error utility and refactor process_results for improved error handling. remove generate_until multiple-choice
```
  55be51ea
- feat: add TaskConfig.from_template method and enhance TemplateConfig with abstract methods · 16030317
  Baber authored Jul 22, 2025
  
  16030317
- refactor: improve default behavior for metric aggregation and higher-better checks · 897fbb37
  Baber authored Jul 21, 2025
  
  897fbb37
- type hints · 5c3badbe
  Baber authored Jul 21, 2025
  
  5c3badbe
- type hints · 17223113
  Baber authored Jul 21, 2025
  
  17223113
- type hints · 24b7e2d6
  Baber authored Jul 21, 2025
  
  24b7e2d6
- nit · 9f345f33
  Baber authored Jul 21, 2025
  
  9f345f33
- nit · e3fee7ea
  Baber authored Jul 21, 2025
  
  e3fee7ea
- Merge branch 'rm_multiple_target' into metrics · 3e3a0d8f
  Baber authored Jul 21, 2025
```
# Conflicts:
#	lm_eval/api/filter.py
#	lm_eval/api/metrics.py
#	lm_eval/api/task.py
#	lm_eval/filters/extraction.py
```
  3e3a0d8f
- type hints; · 00a77ebd
  Baber authored Jul 21, 2025
  
  00a77ebd
- nq_open: move multi_target to `exact_match` · 08c54c63
  Baber authored Jul 21, 2025
  
  08c54c63
- move multi_target to `exact_match` · 8f924e1c
  Baber authored Jul 21, 2025
  
  8f924e1c
- refactor masakhapos · 66e62de7
  Baber authored Jul 21, 2025
  
  66e62de7
- fix ruff · 2b4cdd41
  Baber authored Jul 21, 2025
  
  2b4cdd41