Commits · repeats · gaoqiong / lm-evaluation-harness

04 Aug, 2025 1 commit
- type hints · 312374bc
  Baber authored Jul 07, 2025
  
  312374bc
07 Jul, 2025 5 commits
- rename · 90cf3b89
  Baber authored Jul 07, 2025
  
  90cf3b89
- move agg metrics to after gather · c407be5b
  Baber authored Jul 07, 2025
  
  c407be5b
- nit · b6ae8c4a
  Baber authored Jul 07, 2025
  
  b6ae8c4a
- move metric calculation to task · 9093b1a6
  Baber authored Jul 07, 2025
  
  9093b1a6
- test · 51ab86ff
  Baber authored Jul 07, 2025
  
  51ab86ff
05 Jul, 2025 15 commits
- remove other schemas. work on metrics · 5b8a7506
  Baber authored Jun 23, 2025
  
  5b8a7506
- refactor: streamline metric calculations and enhance logging · ba1d4483
  Baber authored May 25, 2025
  
  ba1d4483
- refactor: enhance metric handling and aggregation logic · 57b91fdb
  Baber authored May 24, 2025
  
  57b91fdb
- TODO! · 911cae22
  Baber authored May 23, 2025
  
  911cae22
- fix filters and metrics · 69e95b87
  Baber authored May 23, 2025
  
  69e95b87
- nit · 72f5a5df
  Baber authored May 22, 2025
  
  72f5a5df
- add todo · c23161f9
  Baber authored May 19, 2025
  
  c23161f9
- rename · ee014926
  Baber authored May 19, 2025
  
  ee014926
- add typehint · 6139a5de
  Baber authored May 19, 2025
  
  6139a5de
- refactor filter hf to use new output classes · 3285f030
  Baber authored May 19, 2025
  
  3285f030
- add classes to inputs/outputs · 451e73f1
  Baber authored May 19, 2025
  
  451e73f1
- add metric calulation method to configurable task · e30978c7
  Baber authored May 19, 2025
  
  e30978c7
- fix metric logging · e9eb451e
  Baber authored May 19, 2025
  
  e9eb451e
- modify evaluator metrics to calcualte each repeat · acf454b7
  Baber authored May 14, 2025
  
  acf454b7
- remove all; reformat table (#3107) · 28001d29
  Baber Abbasi authored Jul 05, 2025
  
  28001d29
04 Jul, 2025 1 commit

[FIX] Initial code to disable multi-proc for stderr (#3106) · 71d0289d

Neel Gupta authored Jul 04, 2025



* [FIX] Initial code to disable multi-proc for stderr

* add docs; align no-mp bootstrap with mp

---------
Co-authored-by: Baber <baber@hey.com>

71d0289d

03 Jul, 2025 4 commits

Bugfix/hf tokenizer gguf override (#3098) · ff41a856

Ankush authored Jul 03, 2025

* fix(hf-gguf): skip gguf_file if external tokenizer is provided

* docs(readme): add instructions for evaluating GGUF models with Hugging Face backend

ff41a856

Humaneval - fix regression (#3102) · 8c1016cb
Baber Abbasi authored Jul 03, 2025
```
* use double quotes
```
8c1016cb

Fix: Reduce CLI loading time from 2.2s to 0.05s (#3099) · 944d32b4

Alex Stachowiak authored Jul 03, 2025



* Lazy-load submodules to reduce import time

* pacify pre-commit

---------
Co-authored-by: Baber <baber@hey.com>

944d32b4

Truthfulqa multi harness (#3062) · e0dc33ae

Blanca Calvo authored Jul 03, 2025



* truthfulqa-multi task

* truthfulqa-multi with chat few-shot

* few shot chat implementation

* changed until so it outputs lists

* changed dataset location

* added MT task

* Create README.md

* do not include MT

* changes for PR

* tag change

* removed yaml extension

* adding task to the table

* fix task configs

* add import exception

---------
Co-authored-by: Baber <baber@hey.com>

e0dc33ae

30 Jun, 2025 2 commits

FixBug: Align the Humaneval with official results for Llama-3.1-70B-Instruct (#3092) · a7ca0435

jinze authored Jul 01, 2025

* Fix: Align the Humaneval dataset with official results

Details:(1) modified the "doc_to_text" and "gen_prefix" in the "humaneval_instruct.yaml" file to make them the same as the Prompt in "meta-llama/Llama-3.1-70B-Instruct-evals".

(2) Change r.rfind("```") to r.find("```"), so it can locate the first "```", not the last one.

Results: Partially reproduced the official results: The result of LLaMA3.1-8B-Instruct is 66.5 (the official result is 72.6), and the result of LLaMA3.1-70B-Instruct is 80.5 (the official result is 80.5).

Ref: PR#2650

* add changelog and version

* add changelog

a7ca0435

[HF] fix quantization config (#3039) · fea4d11d

Baber Abbasi authored Jun 30, 2025

* Try fixing issue 3026 which is caused by the quantization_config argument introduced in Commit 758c5ed8

.
The argument is in Dict type, but for a GPTQ quantized model, it has a conflict with the huggingface interface which expects QuantizationConfigMixin type.
Current solution is removing quantization_config argument in HFLM._create_model() of lm_eval/models/huggingface.py.
Require further modification to restore the functionality provided by the previous commit.

* wrap quantization_config in AutoQuantizationConfig

* handle quantization config not dict

* wrap quantization_config in AutoQuantizationConfig if dict

---------
Co-authored-by: shanhx2000 <hs359@duke.edu>

fea4d11d

25 Jun, 2025 3 commits
- feat / fix: Properly make use of `subfolder` from HF models (#3072) · 6b3f3f7e
  Younes B authored Jun 25, 2025
```
* add subfolder

* lint

* change it to empty string

* fix typehints

---------
Co-authored-by: Baber <baber@hey.com>
```
  6b3f3f7e
- remove system message if `TemplateError` (#3076) · 0f63d4f5
  Baber Abbasi authored Jun 25, 2025
  
  0f63d4f5
- Ensure backwards compatibility in fewshot_context by using kwargs (#3079) · 532909c0
  Kiersten Stokes authored Jun 25, 2025
```
Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>
```
  532909c0
23 Jun, 2025 1 commit

Fix Anthropic API compatibility issues in chat completions (#3054) · 8bc46207

NourFahmy authored Jun 23, 2025



* Fix Anthropic API compatibility issues in chat completions

solves two important compatibility issues between the LM Eval Harness and Anthropic's API:

1) The type field issue - Anthropic's Messages API doesn't accept the type field that other APIs might expect, that was previously included
2) The stop sequences issue - Anthropic requires stop sequences to contain non-whitespace characters

tested with most recent models from anthopic; claude-sonnet-4-0, claude-opus-4-0, resolved my local api errors

* pacufy pre-commit

* add type

---------
Co-authored-by: Baber <baber@hey.com>

8bc46207

20 Jun, 2025 1 commit

llama3 task: update README.md (#3074) · 68c3a811

Anna Fontana authored Jun 20, 2025

"arc_chalenge_chat" doesn't exist: I think it should be "arc_challenge_chat", but this task is not implemented here (see arc task folder).

68c3a811

19 Jun, 2025 3 commits
- bump version to `0.4.9` (#3073) · 45274951
  Baber Abbasi authored Jun 19, 2025
  
  45274951
- Update instructions.py (#3060) · 37357004
  Maxim Evtush authored Jun 19, 2025
  
  37357004
- Update README.md (#3070) · 5a15058e
  Anna Fontana authored Jun 19, 2025
```
Wrong task name: mmlu_generation doesn't non exist -> mmlu_generative is the correct one
```
  5a15058e
16 Jun, 2025 2 commits
- fix longbech citation (#3061) · 9fbe48c2
  Baber Abbasi authored Jun 16, 2025
```
* fix longbech citation
```
  9fbe48c2
- Fix Typo in README and Comment in utils_mcq.py (#3057) · e20ef72e
  fuder.eth authored Jun 16, 2025
```
* Update README.md

* Update utils_mcq.py
```
  e20ef72e
12 Jun, 2025 1 commit
- Fallback to super impl in fewshot_context for Unitxt tasks (#3023) · d09e03dd
  Kiersten Stokes authored Jun 12, 2025
```
Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>
```
  d09e03dd
08 Jun, 2025 1 commit

[longbench] fix metric calculation (#2983) · 147e9d61

Baber Abbasi authored Jun 08, 2025

* use all answers

* use middle truncation

* maybe fix classification score

* strip classification preds

* [vllm] remove stop tokens post-hoc

* strip all preds

* pacify pre-commit

* start on truncation utility

* add to readme

* add a footgun doc

* fix newline in yaml templates

* do not strip code_sim preds!

* fix pre-commit config

* fix instruction warning

* add not to longbench readme

147e9d61