Commits · 1646596a443ecfb64a08ea5cb86f4e5793ad643f · gaoqiong / lm-evaluation-harness

08 Sep, 2025 1 commit
- update template · 1646596a
  Baber authored Sep 08, 2025
  
  1646596a
25 Aug, 2025 1 commit
- add todo · 29cb9ecb
  Baber authored Aug 25, 2025
  
  29cb9ecb
04 Aug, 2025 7 commits

fix · 71cddf6b
Baber authored Aug 04, 2025

71cddf6b
Merge branch 'main' into metrics · 7d6ec4d9
Baber authored Aug 04, 2025
```
# Conflicts:
#	lm_eval/__init__.py
#	pyproject.toml
```
7d6ec4d9
Bump version to 0.4.9.1 (#3208) · d021bf84
Baber Abbasi authored Aug 04, 2025

d021bf84

improve include-path precedence handling (#3068) · 3214d468

parkhs21 authored Aug 04, 2025



* improve include-path precedence handling

* test: add task for test

* add test for include path precedence handling

* Refactor `test_include_path.py`

---------
Co-authored-by: Baber <baber@hey.com>

3214d468

Update README.md for mlqa (#3117) · 584de690
Matthias Neumayer authored Aug 04, 2025
```
The tasks are called without .yaml just the task name
```
584de690

Fix humaneval_instruct (#3201) · edf3aa7a

Idan Tene authored Aug 04, 2025

* Update humaneval_64_instruct.yaml

Sync doc_to_text with humaneval_instruct.yaml

* Update humaneval_instruct.yaml

Remove redundant (flawed) spaces

* Update README.md

* Bump task version

edf3aa7a

Fix ```mmlu_continuation``` subgroup names to fit Readme and other variants (#3137) · 06ba1d28

Felix Michalak authored Aug 04, 2025

* Update continuation group names to fit Readme

* added changelog to readme and switched datasets form hails to cais

* added missing new line at end of readme

06ba1d28

02 Aug, 2025 1 commit

Update vLLM compatibility (#3024) · bc811365

Cyrus Leung authored Aug 03, 2025



* Update vLLM compatibility
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* add TokensPrompt to all generate calls

---------
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Baber <baber@hey.com>

bc811365

30 Jul, 2025 1 commit
- nit · 1020c46e
  Baber authored Jul 30, 2025
  
  1020c46e
25 Jul, 2025 1 commit
- fix · e72ec96c
  Baber authored Jul 25, 2025
  
  e72ec96c
24 Jul, 2025 7 commits
- fix · d762e2aa
  Baber authored Jul 25, 2025
  
  d762e2aa
- Merge branch 'feature/eval_from_config' into metrics · 7a8203fa
  Baber authored Jul 25, 2025
```
# Conflicts:
#	lm_eval/__main__.py
#	lm_eval/utils.py
```
  7a8203fa
- Merge branch 'main' into metrics · e6b798f9
  Baber authored Jul 25, 2025
```
# Conflicts:
#	.pre-commit-config.yaml
#	lm_eval/api/task.py
#	lm_eval/models/huggingface.py
#	lm_eval/models/vllm_causallms.py
#	pyproject.toml
```
  e6b798f9
- Merge remote-tracking branch 'origin/metrics' into metrics · 14a29ade
  Baber authored Jul 24, 2025
  
  14a29ade
- vllm: remove device (#3181) · 4f8195f1
  Baber Abbasi authored Jul 24, 2025
  
  4f8195f1
- fix vllm test issue that call pop() from None (#3182) · 5f5f35e5
  weiliang authored Jul 24, 2025
  
  5f5f35e5
- types · 1f97a945
  Baber authored Jul 23, 2025
  
  1f97a945
23 Jul, 2025 11 commits
- remove trust-remote-code in configs; fix escape sequences (#3180) · 314f7176
  Baber Abbasi authored Jul 23, 2025
```
* remove trust-remote-code

* add W605 rule
```
  314f7176
- Remove "device" from vllm_causallms.py (#3176) · 8c6fde08
  Michael Goin authored Jul 23, 2025
```
Device has been a deprecated arg for a few releases of vLLM and is now removed in 0.10.0 https://github.com/vllm-project/vllm/pull/21349
```
  8c6fde08
- Pin datasets < 4.0.0 (#3172) · 904bba12
  Baber Abbasi authored Jul 23, 2025
```
* Fix: pin datasets < 4.0

* fix

* update type hints in HF

* fix hellaswag path
```
  904bba12
- Added `chat_template_args` to pass additional kwargs to tokenizer.apply_chat_template (#3164) · 2eea3f50
  Avelina Asada Hadji-Kyriacou authored Jul 23, 2025
```
* added support for additional chat template arguments

* use `enable_thinking`

* add wrap logging function

* add `chat_template_args` back to HF

---------
Co-authored-by: Baber <baber@hey.com>
```
  2eea3f50
- types · 6d63c2ce
  Baber authored Jul 23, 2025
  
  6d63c2ce
- move test one doc to method · 0087929e
  Baber authored Jul 23, 2025
  
  0087929e
- overload Task methods if callable in yaml dict · f40e9b58
  Baber authored Jul 23, 2025
  
  f40e9b58
- update scrolls · b3131364
  Baber authored Jul 23, 2025
  
  b3131364
- nit · 0bad3ace
  Baber authored Jul 23, 2025
  
  0bad3ace
- sort pyproject · 43388406
  Baber authored Jul 23, 2025
  
  43388406
- remove deps · 7c853109
  Baber authored Jul 23, 2025
  
  7c853109
22 Jul, 2025 8 commits

types · 87445e95
Baber authored Jul 23, 2025

87445e95
types · 5fdeb436
Baber authored Jul 22, 2025

5fdeb436
nit · f21a0b81
Baber authored Jul 22, 2025

f21a0b81
type hints · f264f2e2
Baber authored Jul 22, 2025

f264f2e2
make multiple_input explicit · 230352ce
Baber authored Jul 22, 2025

230352ce
nit · 00048838
Baber authored Jul 22, 2025

00048838

feat: Add LIBRA benchmark for long-context evaluation (#2943) · 091aaf6f

Svetlana Karimova authored Jul 22, 2025



* Feat: add LIBRA benchmark

* Feat: add dataset filter to LIBRA

* Fix: formatting through pre-commit and main tasks README

* Fix: resolve conflict

* Fix: dataset name to real

* Fix: delete unnececcary datasets and correct dependency

---------
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

091aaf6f

Fix: extended to max_gen_toks 2048 for HRM8K math benchmarks (#3124) · 250a04ec

Geun, Lim authored Jul 22, 2025



* Fix: extended to max_gen_toks 8192 for HRM8K math benchmarks

* • Increased max_gen_toks to 2 048 (matches Appendix B of original paper).
• Added Evaluation Settings and Changelog sections.

* add some logs

---------
Co-authored-by: Baber <baber@hey.com>

250a04ec

21 Jul, 2025 2 commits
- feat: implement check_gold_index_error utility and refactor process_results... · 55be51ea
  Baber authored Jul 22, 2025
```
feat: implement check_gold_index_error utility and refactor process_results for improved error handling. remove generate_until multiple-choice
```
  55be51ea
- feat: add TaskConfig.from_template method and enhance TemplateConfig with abstract methods · 16030317
  Baber authored Jul 22, 2025
  
  16030317