Commits · 4366fc822eb4488adf8dcab4e25a75868442bb0c · gaoqiong / lm-evaluation-harness

19 Jul, 2025 3 commits
- multiblimp (#3162) · 4366fc82
  Baber Abbasi authored Jul 20, 2025
  
  4366fc82
- Add the MultiBLiMP benchmark (#3155) · de4ce482
  James A. Michaelov authored Jul 19, 2025
```
* add multiblimp

* run linter
```
  de4ce482
- Bugfix: update path for GLUE (#3159) · 8c025369
  Avelina Asada Hadji-Kyriacou authored Jul 19, 2025
```
* Update default.yaml
```
  8c025369
18 Jul, 2025 3 commits

Custom request headers | trust_remote_code param fix (#3069) · 56def33d

Ramiro R. C. authored Jul 18, 2025



* added headers and custom model name | fixed bug with trust_remote_code param

* linting

* removed custom model name | changed headers override

* add `header` to base TemplateAPI

* nit

---------
Co-authored-by: Baber <baber@hey.com>

56def33d

fix request hanging when request api (#3090) · e6ea0315

mans authored Jul 18, 2025



* fix request hanging when request api

* pre commit

---------
Co-authored-by: qinyidao <qinyidao@moonshot.cn>

e6ea0315

Fix medical benchmarks import (#3151) · 489fbc21
Idan Tene authored Jul 18, 2025
```
* Update utils.py
```
489fbc21

16 Jul, 2025 2 commits

`bbh_cot_fewshot`: Removed repeated "Let''s think step by step." text from bbh cot prompts (#3140) · c2be7211

philipdoldo authored Jul 16, 2025



* Removed the 'Let''s think step by step.' text from the start of the target entry in each of the samples to prevent this phrase from being repeated twice in the few-shot prompts and to match the behavior from the original bbh repository. Worth noting that this applied to only 26 out of 27 subtasks, the only one it did not apply to is boolean_expressions.yaml. When it comes to boolean_expressions.yaml, in my opinion there is an error in that it doesn't say the 'Remember that (i) ...' text after the final 'A: Let's think step by step.' in the prompt. Models like EleutherAI/gpt-neo-125m seem to always begin answers with this string anyway (copying what was done in the few-shot prompts), but I think it really should've been part of the prompt, much like how 'A: Let's think step by step.' is included in the prompt for all of the cot tasks. However, the original bbh repo also has this issue, so I think it is fine to keep it this way for consistency, but just thought I'd point it out anyway.

* feat: remove extra space from answers; add changelog

---------
Co-authored-by: Baber <baber@hey.com>

c2be7211

truncate thinking tags in generations (#3145) · 51ede33c

Baber Abbasi authored Jul 17, 2025

* feat: add postprocessing for generated text to strip stop sequences and thinking tokens

* nit

* fix: trim leading whitespace after stripping thinking tokens from generation

* feat: add think_end_token to model_args

* nit

* nit

* nit

* add to readme

* nit

51ede33c

15 Jul, 2025 1 commit
- fix: vllm lora (#3132) · 3102a8e4
  MaYongQing authored Jul 15, 2025
  
  3102a8e4
14 Jul, 2025 3 commits
- Fix for hang due to mp.Pool in bootstrap_stderr (#3135) · cf631de0
  Ankit Gola authored Jul 14, 2025
  
  cf631de0
- Added mixed_precision_dtype arg (#3138) · 31895e5b
  Avelina Asada Hadji-Kyriacou authored Jul 14, 2025
  
  31895e5b
- Adding EgyMMLU and EgyHellaSwag (#3063) · 2ea6114e
  Atou Houdaifa authored Jul 14, 2025
```
* add egy mmlu hellaswag

* add egymmlu egyhellaswag to tasks readme

* fix egymmlu config generation

* fix _generate_configs formating
```
  2ea6114e
10 Jul, 2025 2 commits
- fix: remove warning (#3128) · fcddf195
  Baber Abbasi authored Jul 10, 2025
  
  fcddf195
- warning for "chat" pretrained; disable buggy evalita configs (#3127) · f3a0b554
  Baber Abbasi authored Jul 10, 2025
```
* check for chat for warning

* add test

* remove yaml extension from some evalita configs

* move unitxt to own test script

* fix CI test
```
  f3a0b554
06 Jul, 2025 3 commits
- check pil dep (#3114) · ab3acc73
  Baber Abbasi authored Jul 06, 2025
  
  ab3acc73
- Neuralmagic (#3113) · 89654090
  Baber Abbasi authored Jul 06, 2025
```
* remove sparse-ml
```
  89654090
- delete neuralmagic models (#3112) · f93001db
  Baber Abbasi authored Jul 06, 2025
  
  f93001db
05 Jul, 2025 4 commits
- add image hashing and `LMEVAL_HASHMM` envar (#2973) · e69ca5ed
  achervyakov authored Jul 05, 2025
```
* add image hashing

* remove unused params decription

* use `LMEVAL_HASHMM` (defualt '1') to save raw images

---------
Co-authored-by: Baber <baber@hey.com>
```
  e69ca5ed
- Fixed #3005: Processes both formats of model_args: string and dictionay (#3097) · 0e96cd18
  Debjyoti Ray authored Jul 05, 2025
```
* git push --force
correctly processes both formats of model_args: string and dictionary both

* exctract to function for better test

* nit

---------
Co-authored-by: Baber <baber@hey.com>
```
  0e96cd18
- delete unneeded files (#3108) · 6e91fdcd
  Baber Abbasi authored Jul 05, 2025
```
* delete unneeded files
```
  6e91fdcd
- remove all; reformat table (#3107) · 28001d29
  Baber Abbasi authored Jul 05, 2025
  
  28001d29
04 Jul, 2025 1 commit

[FIX] Initial code to disable multi-proc for stderr (#3106) · 71d0289d

Neel Gupta authored Jul 04, 2025



* [FIX] Initial code to disable multi-proc for stderr

* add docs; align no-mp bootstrap with mp

---------
Co-authored-by: Baber <baber@hey.com>

71d0289d

03 Jul, 2025 4 commits

Bugfix/hf tokenizer gguf override (#3098) · ff41a856

Ankush authored Jul 03, 2025

* fix(hf-gguf): skip gguf_file if external tokenizer is provided

* docs(readme): add instructions for evaluating GGUF models with Hugging Face backend

ff41a856

Humaneval - fix regression (#3102) · 8c1016cb
Baber Abbasi authored Jul 03, 2025
```
* use double quotes
```
8c1016cb

Fix: Reduce CLI loading time from 2.2s to 0.05s (#3099) · 944d32b4

Alex Stachowiak authored Jul 03, 2025



* Lazy-load submodules to reduce import time

* pacify pre-commit

---------
Co-authored-by: Baber <baber@hey.com>

944d32b4

Truthfulqa multi harness (#3062) · e0dc33ae

Blanca Calvo authored Jul 03, 2025



* truthfulqa-multi task

* truthfulqa-multi with chat few-shot

* few shot chat implementation

* changed until so it outputs lists

* changed dataset location

* added MT task

* Create README.md

* do not include MT

* changes for PR

* tag change

* removed yaml extension

* adding task to the table

* fix task configs

* add import exception

---------
Co-authored-by: Baber <baber@hey.com>

e0dc33ae

30 Jun, 2025 2 commits

FixBug: Align the Humaneval with official results for Llama-3.1-70B-Instruct (#3092) · a7ca0435

jinze authored Jul 01, 2025

* Fix: Align the Humaneval dataset with official results

Details:(1) modified the "doc_to_text" and "gen_prefix" in the "humaneval_instruct.yaml" file to make them the same as the Prompt in "meta-llama/Llama-3.1-70B-Instruct-evals".

(2) Change r.rfind("```") to r.find("```"), so it can locate the first "```", not the last one.

Results: Partially reproduced the official results: The result of LLaMA3.1-8B-Instruct is 66.5 (the official result is 72.6), and the result of LLaMA3.1-70B-Instruct is 80.5 (the official result is 80.5).

Ref: PR#2650

* add changelog and version

* add changelog

a7ca0435

[HF] fix quantization config (#3039) · fea4d11d

Baber Abbasi authored Jun 30, 2025

* Try fixing issue 3026 which is caused by the quantization_config argument introduced in Commit 758c5ed8

.
The argument is in Dict type, but for a GPTQ quantized model, it has a conflict with the huggingface interface which expects QuantizationConfigMixin type.
Current solution is removing quantization_config argument in HFLM._create_model() of lm_eval/models/huggingface.py.
Require further modification to restore the functionality provided by the previous commit.

* wrap quantization_config in AutoQuantizationConfig

* handle quantization config not dict

* wrap quantization_config in AutoQuantizationConfig if dict

---------
Co-authored-by: shanhx2000 <hs359@duke.edu>

fea4d11d

25 Jun, 2025 3 commits
- feat / fix: Properly make use of `subfolder` from HF models (#3072) · 6b3f3f7e
  Younes B authored Jun 25, 2025
```
* add subfolder

* lint

* change it to empty string

* fix typehints

---------
Co-authored-by: Baber <baber@hey.com>
```
  6b3f3f7e
- remove system message if `TemplateError` (#3076) · 0f63d4f5
  Baber Abbasi authored Jun 25, 2025
  
  0f63d4f5
- Ensure backwards compatibility in fewshot_context by using kwargs (#3079) · 532909c0
  Kiersten Stokes authored Jun 25, 2025
```
Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>
```
  532909c0
23 Jun, 2025 1 commit

Fix Anthropic API compatibility issues in chat completions (#3054) · 8bc46207

NourFahmy authored Jun 23, 2025



* Fix Anthropic API compatibility issues in chat completions

solves two important compatibility issues between the LM Eval Harness and Anthropic's API:

1) The type field issue - Anthropic's Messages API doesn't accept the type field that other APIs might expect, that was previously included
2) The stop sequences issue - Anthropic requires stop sequences to contain non-whitespace characters

tested with most recent models from anthopic; claude-sonnet-4-0, claude-opus-4-0, resolved my local api errors

* pacufy pre-commit

* add type

---------
Co-authored-by: Baber <baber@hey.com>

8bc46207

20 Jun, 2025 1 commit

llama3 task: update README.md (#3074) · 68c3a811

Anna Fontana authored Jun 20, 2025

"arc_chalenge_chat" doesn't exist: I think it should be "arc_challenge_chat", but this task is not implemented here (see arc task folder).

68c3a811

19 Jun, 2025 3 commits
- bump version to `0.4.9` (#3073) · 45274951
  Baber Abbasi authored Jun 19, 2025
  
  45274951
- Update instructions.py (#3060) · 37357004
  Maxim Evtush authored Jun 19, 2025
  
  37357004
- Update README.md (#3070) · 5a15058e
  Anna Fontana authored Jun 19, 2025
```
Wrong task name: mmlu_generation doesn't non exist -> mmlu_generative is the correct one
```
  5a15058e
16 Jun, 2025 2 commits
- fix longbech citation (#3061) · 9fbe48c2
  Baber Abbasi authored Jun 16, 2025
```
* fix longbech citation
```
  9fbe48c2
- Fix Typo in README and Comment in utils_mcq.py (#3057) · e20ef72e
  fuder.eth authored Jun 16, 2025
```
* Update README.md

* Update utils_mcq.py
```
  e20ef72e
12 Jun, 2025 1 commit
- Fallback to super impl in fewshot_context for Unitxt tasks (#3023) · d09e03dd
  Kiersten Stokes authored Jun 12, 2025
```
Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>
```
  d09e03dd
08 Jun, 2025 1 commit

[longbench] fix metric calculation (#2983) · 147e9d61

Baber Abbasi authored Jun 08, 2025

* use all answers

* use middle truncation

* maybe fix classification score

* strip classification preds

* [vllm] remove stop tokens post-hoc

* strip all preds

* pacify pre-commit

* start on truncation utility

* add to readme

* add a footgun doc

* fix newline in yaml templates

* do not strip code_sim preds!

* fix pre-commit config

* fix instruction warning

* add not to longbench readme

147e9d61