Commits · rm_multiple_target · gaoqiong / lm-evaluation-harness

11 Aug, 2025 13 commits
- for multiple_inputs use `doc_to_text: list[str]`` · 18fde066
  Baber authored Aug 08, 2025
  
  18fde066
- remove promptsource · 85970a3e
  Baber authored Aug 03, 2025
  
  85970a3e
- remove redundant jinja parsing logic · 6c110147
  Baber authored Aug 03, 2025
  
  6c110147
- fix processing regression; recursively parse lists fron template · e66aa10c
  Baber authored Aug 02, 2025
  
  e66aa10c
- fix doc_to_target to return int rather than str (deprecated) · 3d5fa4c7
  Baber authored Aug 02, 2025
  
  3d5fa4c7
- fix copa; better logging in task init · 01daa983
  Baber authored Aug 02, 2025
  
  01daa983
- require multiple_inputs and multiple_targets to be explicitly set in taskconfig · 61d11b8a
  Baber authored Aug 01, 2025
  
  61d11b8a
- Remove unused `doc_to_choice` and fix superglue whitespaces · 57b86c47
  Baber authored Jul 22, 2025
  
  57b86c47
- remove doc_to_choice in generation process_results · e0021a06
  Baber authored Jul 22, 2025
  
  e0021a06
- remove doc_to_choice from `boolq-seq2seq` · 90d44580
  Baber authored Jul 22, 2025
  
  90d44580
- type hints; · 3fd12675
  Baber authored Jul 21, 2025
  
  3fd12675
- move multi_target to `exact_match` · 46654b3d
  Baber authored Jul 21, 2025
  
  46654b3d
- refactor masakhapos · d58d67b3
  Baber authored Jul 21, 2025
  
  d58d67b3
08 Aug, 2025 1 commit
- Remove `trust_remote_code: True` from updated datasets (#3213) · 7f04db12
  Avelina Asada Hadji-Kyriacou authored Aug 08, 2025
```
* Update afridiacritics_yaml

* Update afrisenti

* Update nollysenti

* Update ntrex

* Update salt
```
  7f04db12
04 Aug, 2025 5 commits

Bump version to 0.4.9.1 (#3208) · d021bf84
Baber Abbasi authored Aug 04, 2025

d021bf84

improve include-path precedence handling (#3068) · 3214d468

parkhs21 authored Aug 04, 2025



* improve include-path precedence handling

* test: add task for test

* add test for include path precedence handling

* Refactor `test_include_path.py`

---------
Co-authored-by: Baber <baber@hey.com>

3214d468

Update README.md for mlqa (#3117) · 584de690
Matthias Neumayer authored Aug 04, 2025
```
The tasks are called without .yaml just the task name
```
584de690

Fix humaneval_instruct (#3201) · edf3aa7a

Idan Tene authored Aug 04, 2025

* Update humaneval_64_instruct.yaml

Sync doc_to_text with humaneval_instruct.yaml

* Update humaneval_instruct.yaml

Remove redundant (flawed) spaces

* Update README.md

* Bump task version

edf3aa7a

Fix ```mmlu_continuation``` subgroup names to fit Readme and other variants (#3137) · 06ba1d28

Felix Michalak authored Aug 04, 2025

* Update continuation group names to fit Readme

* added changelog to readme and switched datasets form hails to cais

* added missing new line at end of readme

06ba1d28

02 Aug, 2025 1 commit

Update vLLM compatibility (#3024) · bc811365

Cyrus Leung authored Aug 03, 2025



* Update vLLM compatibility
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* add TokensPrompt to all generate calls

---------
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Baber <baber@hey.com>

bc811365

24 Jul, 2025 2 commits
- vllm: remove device (#3181) · 4f8195f1
  Baber Abbasi authored Jul 24, 2025
  
  4f8195f1
- fix vllm test issue that call pop() from None (#3182) · 5f5f35e5
  weiliang authored Jul 24, 2025
  
  5f5f35e5
23 Jul, 2025 4 commits

remove trust-remote-code in configs; fix escape sequences (#3180) · 314f7176
Baber Abbasi authored Jul 23, 2025
```
* remove trust-remote-code

* add W605 rule
```
314f7176

Remove "device" from vllm_causallms.py (#3176) · 8c6fde08

Michael Goin authored Jul 23, 2025

Device has been a deprecated arg for a few releases of vLLM and is now removed in 0.10.0 https://github.com/vllm-project/vllm/pull/21349

8c6fde08

Pin datasets < 4.0.0 (#3172) · 904bba12

Baber Abbasi authored Jul 23, 2025

* Fix: pin datasets < 4.0

* fix

* update type hints in HF

* fix hellaswag path

904bba12

Added `chat_template_args` to pass additional kwargs to tokenizer.apply_chat_template (#3164) · 2eea3f50

Avelina Asada Hadji-Kyriacou authored Jul 23, 2025



* added support for additional chat template arguments

* use `enable_thinking`

* add wrap logging function

* add `chat_template_args` back to HF

---------
Co-authored-by: Baber <baber@hey.com>

2eea3f50

22 Jul, 2025 2 commits

feat: Add LIBRA benchmark for long-context evaluation (#2943) · 091aaf6f

Svetlana Karimova authored Jul 22, 2025



* Feat: add LIBRA benchmark

* Feat: add dataset filter to LIBRA

* Fix: formatting through pre-commit and main tasks README

* Fix: resolve conflict

* Fix: dataset name to real

* Fix: delete unnececcary datasets and correct dependency

---------
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

091aaf6f

Fix: extended to max_gen_toks 2048 for HRM8K math benchmarks (#3124) · 250a04ec

Geun, Lim authored Jul 22, 2025



* Fix: extended to max_gen_toks 8192 for HRM8K math benchmarks

* • Increased max_gen_toks to 2 048 (matches Appendix B of original paper).
• Added Evaluation Settings and Changelog sections.

* add some logs

---------
Co-authored-by: Baber <baber@hey.com>

250a04ec

19 Jul, 2025 4 commits
- [tests] Added missing fixture in test_unitxt_tasks.py (#3163) · 8c05cfe0
  Avelina Asada Hadji-Kyriacou authored Jul 19, 2025
```
* Added missing fixture in test_unitxt_tasks.py

* pacify pre-commit

---------
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>
```
  8c05cfe0
- multiblimp (#3162) · 4366fc82
  Baber Abbasi authored Jul 20, 2025
  
  4366fc82
- Add the MultiBLiMP benchmark (#3155) · de4ce482
  James A. Michaelov authored Jul 19, 2025
```
* add multiblimp

* run linter
```
  de4ce482
- Bugfix: update path for GLUE (#3159) · 8c025369
  Avelina Asada Hadji-Kyriacou authored Jul 19, 2025
```
* Update default.yaml
```
  8c025369
18 Jul, 2025 3 commits

Custom request headers | trust_remote_code param fix (#3069) · 56def33d

Ramiro R. C. authored Jul 18, 2025



* added headers and custom model name | fixed bug with trust_remote_code param

* linting

* removed custom model name | changed headers override

* add `header` to base TemplateAPI

* nit

---------
Co-authored-by: Baber <baber@hey.com>

56def33d

fix request hanging when request api (#3090) · e6ea0315

mans authored Jul 18, 2025



* fix request hanging when request api

* pre commit

---------
Co-authored-by: qinyidao <qinyidao@moonshot.cn>

e6ea0315

Fix medical benchmarks import (#3151) · 489fbc21
Idan Tene authored Jul 18, 2025
```
* Update utils.py
```
489fbc21

16 Jul, 2025 2 commits

`bbh_cot_fewshot`: Removed repeated "Let''s think step by step." text from bbh cot prompts (#3140) · c2be7211

philipdoldo authored Jul 16, 2025



* Removed the 'Let''s think step by step.' text from the start of the target entry in each of the samples to prevent this phrase from being repeated twice in the few-shot prompts and to match the behavior from the original bbh repository. Worth noting that this applied to only 26 out of 27 subtasks, the only one it did not apply to is boolean_expressions.yaml. When it comes to boolean_expressions.yaml, in my opinion there is an error in that it doesn't say the 'Remember that (i) ...' text after the final 'A: Let's think step by step.' in the prompt. Models like EleutherAI/gpt-neo-125m seem to always begin answers with this string anyway (copying what was done in the few-shot prompts), but I think it really should've been part of the prompt, much like how 'A: Let's think step by step.' is included in the prompt for all of the cot tasks. However, the original bbh repo also has this issue, so I think it is fine to keep it this way for consistency, but just thought I'd point it out anyway.

* feat: remove extra space from answers; add changelog

---------
Co-authored-by: Baber <baber@hey.com>

c2be7211

truncate thinking tags in generations (#3145) · 51ede33c

Baber Abbasi authored Jul 17, 2025

* feat: add postprocessing for generated text to strip stop sequences and thinking tokens

* nit

* fix: trim leading whitespace after stripping thinking tokens from generation

* feat: add think_end_token to model_args

* nit

* nit

* nit

* add to readme

* nit

51ede33c

15 Jul, 2025 1 commit
- fix: vllm lora (#3132) · 3102a8e4
  MaYongQing authored Jul 15, 2025
  
  3102a8e4
14 Jul, 2025 2 commits
- Fix for hang due to mp.Pool in bootstrap_stderr (#3135) · cf631de0
  Ankit Gola authored Jul 14, 2025
  
  cf631de0
- Added mixed_precision_dtype arg (#3138) · 31895e5b
  Avelina Asada Hadji-Kyriacou authored Jul 14, 2025
  
  31895e5b