Commits · reason · gaoqiong / lm-evaluation-harness

27 Jun, 2025 3 commits
- re-arrange imports · 34ae15c6
  Baber authored Jun 28, 2025
  
  34ae15c6
- fix filter type hint · 65425258
  Baber authored Jun 28, 2025
  
  65425258
- strip thinking · 708b160d
  Baber authored Jun 28, 2025
  
  708b160d
26 Jun, 2025 2 commits
- only use for generate_until tasks · 35be7100
  Baber authored Jun 27, 2025
  
  35be7100
- add strip_reasoning param · ea8dfbe8
  Baber authored Jun 27, 2025
  
  ea8dfbe8
25 Jun, 2025 3 commits
- feat / fix: Properly make use of `subfolder` from HF models (#3072) · 6b3f3f7e
  Younes B authored Jun 25, 2025
```
* add subfolder

* lint

* change it to empty string

* fix typehints

---------
Co-authored-by: Baber <baber@hey.com>
```
  6b3f3f7e
- remove system message if `TemplateError` (#3076) · 0f63d4f5
  Baber Abbasi authored Jun 25, 2025
  
  0f63d4f5
- Ensure backwards compatibility in fewshot_context by using kwargs (#3079) · 532909c0
  Kiersten Stokes authored Jun 25, 2025
```
Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>
```
  532909c0
23 Jun, 2025 1 commit

Fix Anthropic API compatibility issues in chat completions (#3054) · 8bc46207

NourFahmy authored Jun 23, 2025



* Fix Anthropic API compatibility issues in chat completions

solves two important compatibility issues between the LM Eval Harness and Anthropic's API:

1) The type field issue - Anthropic's Messages API doesn't accept the type field that other APIs might expect, that was previously included
2) The stop sequences issue - Anthropic requires stop sequences to contain non-whitespace characters

tested with most recent models from anthopic; claude-sonnet-4-0, claude-opus-4-0, resolved my local api errors

* pacufy pre-commit

* add type

---------
Co-authored-by: Baber <baber@hey.com>

8bc46207

20 Jun, 2025 1 commit

llama3 task: update README.md (#3074) · 68c3a811

Anna Fontana authored Jun 20, 2025

"arc_chalenge_chat" doesn't exist: I think it should be "arc_challenge_chat", but this task is not implemented here (see arc task folder).

68c3a811

19 Jun, 2025 3 commits
- bump version to `0.4.9` (#3073) · 45274951
  Baber Abbasi authored Jun 19, 2025
  
  45274951
- Update instructions.py (#3060) · 37357004
  Maxim Evtush authored Jun 19, 2025
  
  37357004
- Update README.md (#3070) · 5a15058e
  Anna Fontana authored Jun 19, 2025
```
Wrong task name: mmlu_generation doesn't non exist -> mmlu_generative is the correct one
```
  5a15058e
16 Jun, 2025 2 commits
- fix longbech citation (#3061) · 9fbe48c2
  Baber Abbasi authored Jun 16, 2025
```
* fix longbech citation
```
  9fbe48c2
- Fix Typo in README and Comment in utils_mcq.py (#3057) · e20ef72e
  fuder.eth authored Jun 16, 2025
```
* Update README.md

* Update utils_mcq.py
```
  e20ef72e
12 Jun, 2025 1 commit
- Fallback to super impl in fewshot_context for Unitxt tasks (#3023) · d09e03dd
  Kiersten Stokes authored Jun 12, 2025
```
Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>
```
  d09e03dd
08 Jun, 2025 1 commit

[longbench] fix metric calculation (#2983) · 147e9d61

Baber Abbasi authored Jun 08, 2025

* use all answers

* use middle truncation

* maybe fix classification score

* strip classification preds

* [vllm] remove stop tokens post-hoc

* strip all preds

* pacify pre-commit

* start on truncation utility

* add to readme

* add a footgun doc

* fix newline in yaml templates

* do not strip code_sim preds!

* fix pre-commit config

* fix instruction warning

* add not to longbench readme

147e9d61

03 Jun, 2025 4 commits

remove prints (#3041) · 9f152e0b
Baber Abbasi authored Jun 03, 2025

9f152e0b

add Mbpp instruct (#2995) · 60e85da5

Baber Abbasi authored Jun 03, 2025

* feat: add mbpp_instruct

* fix: update generation_kwargs to use an empty until list

* fix: correct predictions formatting in pass_at_1 function

* fix: improve code block extraction by checking first without opening backticks

* fix mbpp `pass_at_1`

60e85da5

fix: fix vllm issue with DP>1 (#3025) · d57e3d65
Younes B authored Jun 03, 2025

d57e3d65
[Fix] acc_mutual_info metric calculation bug (#3035) · 3f792954
Baber Abbasi authored Jun 03, 2025
```
* fix: bug in acc_mutual_info slicing; add `target_delimiter` to uncond choices

* add tests
```
3f792954

02 Jun, 2025 2 commits
- Enable text-only evals for VLM models (#2999) · 82a99365
  Yury Sulsky authored Jun 02, 2025
  
  82a99365
- chore: clean up and extend .gitignore rules (#3030) · 9d29ef0e
  Ivan Stankevich authored Jun 02, 2025
```
* chore: clean up and extend .gitignore rules

* pacify pre-commit

---------
Co-authored-by: Baber <baber@hey.com>
```
  9d29ef0e
26 May, 2025 2 commits

add arab_culture task (#3006) · 8bc4afff
Boda Sadallah authored May 26, 2025
```
* add arab_culture tasks

* add target_delimeter and remove debugging code
```
8bc4afff

[vllm] data parallel for V1 (#3011) · 5a481f43

Baber Abbasi authored May 26, 2025

* add data_parallel for V1

* use Process instead of Queue

* ray used if V0 DP

* better error handling

* fix truncation warning comparison

5a481f43

23 May, 2025 2 commits

Fix error due in Collating queries with different continuation lengths (fixes #2984) (#2987) · 7aaceeec

Ameya Godbole authored May 22, 2025



* FIX error due to grouping queries with different continuation length

Make Collator choose query with the longest continuation as the
candidate for generation

* use max for key selection

* added comments explaining variable cont length (identical ctx+cont[:-1])

---------
Co-authored-by: Baber <baber@hey.com>

7aaceeec

[Fix] Update `resolve_hf_chat_template` arguments (#2992) · 357d4eaa
fxmarty-amd authored May 23, 2025
```
* fix arguments

* pacify pre-commit

---------
Co-authored-by: Baber <baber@hey.com>
```
357d4eaa

22 May, 2025 1 commit
- change multimodal check in evaluate (#3013) · e1a7a39c
  Baber Abbasi authored May 22, 2025
```
changed multimodal check from strict equality
```
  e1a7a39c
21 May, 2025 6 commits

Revert "feat: add question suffix (#2876)" (#3007) · 29ea6832
Baber Abbasi authored May 21, 2025
```
This reverts commit 4dbd5ec9
```
29ea6832

Adding resize images support (#2958) · 143a7fe0

achervyakov authored May 21, 2025



* first version of image resizing

* fixed bug

* clean up `resize_image`

---------
Co-authored-by: Artem Safin <artemsafin67@gmail.com>
Co-authored-by: Baber <baber@hey.com>

143a7fe0

use images with api models (#2981) · 2cfdd0a2
Baber Abbasi authored May 21, 2025
```
* use images with apis

* pacify pre-commit
```
2cfdd0a2

Output path fix (#2993) · 178fa84d

Niccolò Ajroldi authored May 21, 2025



* fix(output_path): support direct JSON file paths

* fix linting

* turn off external Lm tests for now

* Update help text for `output_path`

---------
Co-authored-by: Baber <baber@hey.com>

178fa84d

add kbl 2025 (#3000) · 8be417a8
Hongseok Oh authored May 21, 2025

8be417a8
Log tokenized request warning only once (#3002) · 07e5348c
Rob Geada authored May 21, 2025
```
* Log tokenized request warning only once

* Fix logging for concurrent usecase as well
```
07e5348c

19 May, 2025 3 commits
- fix github parse error (#2998) · 81fc0826
  Baber Abbasi authored May 19, 2025
  
  81fc0826
- [SGLANG] Add the SGLANG generate API (#2997) · 53c65300
  Baber Abbasi authored May 19, 2025
```
* add `sglang-generate`

* nit

* nit

* nit

* pacify pre-commit
```
  53c65300
- Adding ACPBench Hard tasks (#2980) · 0daf28fd
  Harsha authored May 19, 2025
```
* adding ACPBench_hard

* adding Clingo

* changing tarski to tarski[clingo]

* denoting the main variants in each paper
```
  0daf28fd
17 May, 2025 1 commit

Delete scripts/cost_estimate.py (#2985) · 86c266a1

Stella Biderman authored May 17, 2025

This function was written years ago when the cost of running an OpenAI model was easy to compute. It is no longer viable to support this.

86c266a1

15 May, 2025 2 commits
- fix formatting (#2759) · 0126f6d1
  Baber Abbasi authored May 15, 2025
  
  0126f6d1
- Add device arg to model_args passed to LLM object in VLLM model class (#2879) · 96966f53
  Filippo Momentè authored May 15, 2025
```
* fix: pass device arg in model_ar in vllm_causallms

* casting device arg to str in vLLM model args
```
  96966f53