Commits · gemini_ · gaoqiong / lm-evaluation-harness

08 Oct, 2025 3 commits
- add support for`GeminiOpenAI` API · dba2ee3e
  Baber authored Oct 08, 2025
  
  dba2ee3e
- add `parse_generations` to OpenAIChatCompletion · a7362d8b
  Baber authored Oct 08, 2025
  
  a7362d8b
- add `max_thinking_tokens` for anthropic · 3e28eed1
  Baber authored Oct 08, 2025
  
  3e28eed1
03 Oct, 2025 1 commit
- Update CODEOWNERS · c0fc7172
  Stella Biderman authored Oct 03, 2025
  
  c0fc7172
02 Oct, 2025 3 commits
- unpin datasets; update pre-commit (#3316) · 705bedd0
  Baber Abbasi authored Oct 02, 2025
```
* update pre-commit

* unpin datasets
```
  705bedd0
- fix: sp, req order (#3303) · a1404f06
  Vineeth authored Oct 02, 2025
  
  a1404f06
- add math and longbench to test dependencies (#3321) · 8616a284
  Janna authored Oct 02, 2025
  
  8616a284
22 Sep, 2025 1 commit

Add eqbench tasks in Spanish and Catalan (#3168) · de496b80

priverabsc authored Sep 22, 2025

* Add eqbench tasks in Spanish and Catalan

* Incremented catalan_bench and spanish_bench versions. Added 'multilingual' folder inside 'eq_bench' and moved the eqbench_ca and eqbench_es .yaml to that folder. Updated the tasks README with eqbench_es and eqbench_ca, expliciting inside each description both the Hugging Face link and the translation method.

* Fixed tasks table.

* remove test_task.sh and results folder

* Add utils.py to multilingual folder

de496b80

21 Sep, 2025 6 commits

Add humaneval_infilling task (#3299) · a4752ccd

its-alpesh authored Sep 21, 2025



* Add humaneval_infilling task

* pacify pre-commit

---------
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

a4752ccd

Add AIME to task description (#3296) · 6b8ec144
Janna authored Sep 20, 2025
```
* register aime

* lint

---------
Co-authored-by: Baber <baber@hey.com>
```
6b8ec144

Add BabiLong (#3287) · ccfa4ad1

Janna authored Sep 20, 2025

* create babilong tasks

* lint

* add clarification

* fix typo

* add babilong description

ccfa4ad1

feat: Add mmlu-redux and it's spanish transaltion as generative task definitions (#2705) · fec9dde7

Luis Cosio authored Sep 20, 2025



* Added benchmark

* Added more testing

* Added task definition for mmlu_redux and mmlu_redux_spanish

* Add MMLU Redux English and Spanish tasks with YAML fixes and READMEs

* Add remaining MMLU Redux YAMLs and updated tasks README

* Add MMLU Redux English and Spanish tasks with YAML fixes and READMEs

* Add MMLU Redux changes from pr-2705

* Resolve pre-commit hook and pytest overlapping group issues by adding mmlu_redux_spanish task entries and unique subgroup names

* Enhance retry logic to prevent 429 error when using Hugging Face API for tests, apply pre-commit fixes

* Revert python test changes and comments one task group to avoid Hugging Face rate limit and task failure

---------
Co-authored-by: CT-6282 <ricardo.godric@hotmail.com>

fec9dde7

add xpu support HFLM (#3211) · 368275f3
kaixuanliu authored Sep 21, 2025
```
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
```
368275f3

Fix LongBench Evaluation (#3273) · 7f698a5a

Timur Aysin authored Sep 21, 2025



* fix: set 'do_sample=False' and use double quotes in 'doc_to_text'

* feat: update versions and README for longbench

* pacify pre-commit

---------
Co-authored-by: Baber <baber@hey.com>

7f698a5a

12 Sep, 2025 1 commit
- add quote to type hints (#3292) · 0c134ee9
  fxmarty-amd authored Sep 12, 2025
  
  0c134ee9
08 Sep, 2025 3 commits

Ignore seed when splitting batch in chunks with groupby (#3047) · 44398478

Slim Frikha authored Sep 09, 2025



* feat(vllm_causallms): make collator ignore seed when splitting batch into chunks

* fix(collator): revert PR changes

* fix(vllm-causallm): update collator call with groupby None

* feat(sglang-causallms): make generation accept a list of sampling params

---------
Co-authored-by: Baber <baber@hey.com>

44398478

Add the Icelandic WinoGrande benchmark (#3277) · 4f1e9f7c
James A. Michaelov authored Sep 08, 2025
```
* add icelandic_winogrande

* fix spacing for final words in sentence
```
4f1e9f7c
Add support for steering specific attention heads (#3279) · a46180bf
Lucia Quirke authored Sep 08, 2025

a46180bf

02 Sep, 2025 4 commits
- Add EsBBQ and CaBBQ tasks (#3167) · 2d7cb5c3
  Valle Ruiz-Fernández authored Sep 02, 2025
```
* Add EsBBQ and CaBBQ tasks

* Linter fixes

* add esbbq and cabbq to task list

---------
Co-authored-by: Júlia Falcão <juliafsfalcao@hotmail.com>
```
  2d7cb5c3
- Add `acc_norm` metric to ZhoBLiMP (#3271) · ecebf1bd
  James A. Michaelov authored Sep 02, 2025
  
  ecebf1bd
- Add `acc_norm` to BLiMP-NL (#3272) · aff14e50
  James A. Michaelov authored Sep 02, 2025
  
  aff14e50
- Add BHS benchmark (#3265) · 331288bb
  James A. Michaelov authored Sep 02, 2025
```
* run linter

* add acc_norm
```
  331288bb
27 Aug, 2025 3 commits
- Fix codexglue (#3238) · 84aa9f95
  Gül Sena A authored Aug 27, 2025
```
* Fix codex-glue/code2text group issue

* Added README

* pacify pre-commit

---------
Co-authored-by: Baber <baber@hey.com>
```
  84aa9f95
- pacify pre-commit (#3268) · 3a9bcc3f
  Baber Abbasi authored Aug 27, 2025
  
  3a9bcc3f
- feat(scrolls): delete chat_template from kwargs (#3267) · a35eb973
  Slim Frikha authored Aug 27, 2025
  
  a35eb973
26 Aug, 2025 1 commit

Support for AIME dataset (#3248) · 5ac7cdf8

Janna authored Aug 26, 2025

* add AIME tasks

* standardize the repeats

* fix task naming

* aime25 only has test set

* edit readme

* add utils

* standardize

* fix case sensitivity

* repeat once

* lint

* more linting

* lint huggingface.py

5ac7cdf8

25 Aug, 2025 4 commits

Update MMLU-ProX task (#3174) · 0b45cc71
Weihao XUAN authored Aug 26, 2025
```
* update MMLU_ProX

* update MMLU_ProX

* cleanup code by pre-commit
```
0b45cc71
Add support for OpenVINO text2text generation models (#3101) · 05b37f20
Nikita Savelyev authored Aug 25, 2025
```
* Add support for OVModelForSeq2SeqLM

* Add test
```
05b37f20

Adds Anthropic/discrim-eval to lm-evaluation-harness (#3091) · dddfe7ec

William Held authored Aug 25, 2025



* Anthropic Discrim Eval

* Mixed Effects Regression

* Actually wire it all upo

* Operator Name Doesn't Exist on Github

* Update lm_eval/tasks/discrim_eval/discrim_eval_implicit.yaml
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

* Update discrim_eval_implicit.yaml

* Update discrim_eval_explicit.yaml

* pacify pre-commit

---------
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>
Co-authored-by: Baber <baber@hey.com>

dddfe7ec

feat: Add CLIcK task (#3173) · bb433af7

Geun, Lim authored Aug 25, 2025

* feat: Add CLIcK task

* Fix formatting issues

* Add Click Task Description

* fix: lint

* fix

bb433af7

23 Aug, 2025 1 commit

update `minerva_math` (#3259) · 18d2face

Baber Abbasi authored Aug 24, 2025

* update math_verify

* remove normalization

* use full solution in `parse`

* update version

18d2face

22 Aug, 2025 1 commit
- fix unknown group key to tag (#3222) · 358bfa37
  Patrick Haller authored Aug 22, 2025
```
Co-authored-by: Patrick Haller <phmaker@Patricks-MacBook-Pro.local>
```
  358bfa37
21 Aug, 2025 8 commits
- Add LM-SynEval Benchmark (#3184) · 938a4fb3
  James A. Michaelov authored Aug 21, 2025
```
* add lm_syneval

* edit readme

* update task readme

* formatting fixes

* run linting

* add descriptions and examples

* clean readme formatting
```
  938a4fb3
- Add TurBLiMP (#3219) · d355eac0
  James A. Michaelov authored Aug 21, 2025
```
* add turblimp

* update general task readme

* add normalized accuracy
```
  d355eac0
- Add BLiMP-NL (#3221) · b0040ba0
  James A. Michaelov authored Aug 21, 2025
```
* add blimp_nl

* add template yaml file
```
  b0040ba0
- Add ZhoBLiMP benchmark (#3218) · 1bd96448
  James A. Michaelov authored Aug 21, 2025
```
* add zhoblimp files

* correct group name

* fix group

* add normalized accuracy
```
  1bd96448
- add xnli_va dataset to catalan_bench (#3194) · 51d8a192
  FranValero97 authored Aug 21, 2025
  
  51d8a192
- Adding support for OpenAI GPT-5 model; Models only support hardcoded... · 30885632
  Kurt Yang authored Aug 21, 2025
```
Adding support for OpenAI GPT-5 model; Models only support hardcoded tempeature=1 and stop=None (#3247)
```
  30885632
- Update utils.py (#3246) · a4fd524f
  Anri Lombard authored Aug 21, 2025
  
  a4fd524f
- remove incomplete compilation instructions (#3242) · 98c1880f
  Jafar Isbarov authored Aug 21, 2025
  
  98c1880f