Commits · de496b80d60c267a2d7eea3b3c1dc40f693daee7 · gaoqiong / lm-evaluation-harness

22 Sep, 2025 1 commit

Add eqbench tasks in Spanish and Catalan (#3168) · de496b80

priverabsc authored Sep 22, 2025

* Add eqbench tasks in Spanish and Catalan

* Incremented catalan_bench and spanish_bench versions. Added 'multilingual' folder inside 'eq_bench' and moved the eqbench_ca and eqbench_es .yaml to that folder. Updated the tasks README with eqbench_es and eqbench_ca, expliciting inside each description both the Hugging Face link and the translation method.

* Fixed tasks table.

* remove test_task.sh and results folder

* Add utils.py to multilingual folder

de496b80

21 Sep, 2025 5 commits

Add humaneval_infilling task (#3299) · a4752ccd

its-alpesh authored Sep 21, 2025



* Add humaneval_infilling task

* pacify pre-commit

---------
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

a4752ccd

Add AIME to task description (#3296) · 6b8ec144
Janna authored Sep 20, 2025
```
* register aime

* lint

---------
Co-authored-by: Baber <baber@hey.com>
```
6b8ec144

Add BabiLong (#3287) · ccfa4ad1

Janna authored Sep 20, 2025

* create babilong tasks

* lint

* add clarification

* fix typo

* add babilong description

ccfa4ad1

feat: Add mmlu-redux and it's spanish transaltion as generative task definitions (#2705) · fec9dde7

Luis Cosio authored Sep 20, 2025



* Added benchmark

* Added more testing

* Added task definition for mmlu_redux and mmlu_redux_spanish

* Add MMLU Redux English and Spanish tasks with YAML fixes and READMEs

* Add remaining MMLU Redux YAMLs and updated tasks README

* Add MMLU Redux English and Spanish tasks with YAML fixes and READMEs

* Add MMLU Redux changes from pr-2705

* Resolve pre-commit hook and pytest overlapping group issues by adding mmlu_redux_spanish task entries and unique subgroup names

* Enhance retry logic to prevent 429 error when using Hugging Face API for tests, apply pre-commit fixes

* Revert python test changes and comments one task group to avoid Hugging Face rate limit and task failure

---------
Co-authored-by: CT-6282 <ricardo.godric@hotmail.com>

fec9dde7

Fix LongBench Evaluation (#3273) · 7f698a5a

Timur Aysin authored Sep 21, 2025



* fix: set 'do_sample=False' and use double quotes in 'doc_to_text'

* feat: update versions and README for longbench

* pacify pre-commit

---------
Co-authored-by: Baber <baber@hey.com>

7f698a5a

08 Sep, 2025 1 commit
- Add the Icelandic WinoGrande benchmark (#3277) · 4f1e9f7c
  James A. Michaelov authored Sep 08, 2025
```
* add icelandic_winogrande

* fix spacing for final words in sentence
```
  4f1e9f7c
02 Sep, 2025 4 commits
- Add EsBBQ and CaBBQ tasks (#3167) · 2d7cb5c3
  Valle Ruiz-Fernández authored Sep 02, 2025
```
* Add EsBBQ and CaBBQ tasks

* Linter fixes

* add esbbq and cabbq to task list

---------
Co-authored-by: Júlia Falcão <juliafsfalcao@hotmail.com>
```
  2d7cb5c3
- Add `acc_norm` metric to ZhoBLiMP (#3271) · ecebf1bd
  James A. Michaelov authored Sep 02, 2025
  
  ecebf1bd
- Add `acc_norm` to BLiMP-NL (#3272) · aff14e50
  James A. Michaelov authored Sep 02, 2025
  
  aff14e50
- Add BHS benchmark (#3265) · 331288bb
  James A. Michaelov authored Sep 02, 2025
```
* run linter

* add acc_norm
```
  331288bb
27 Aug, 2025 3 commits
- Fix codexglue (#3238) · 84aa9f95
  Gül Sena A authored Aug 27, 2025
```
* Fix codex-glue/code2text group issue

* Added README

* pacify pre-commit

---------
Co-authored-by: Baber <baber@hey.com>
```
  84aa9f95
- pacify pre-commit (#3268) · 3a9bcc3f
  Baber Abbasi authored Aug 27, 2025
  
  3a9bcc3f
- feat(scrolls): delete chat_template from kwargs (#3267) · a35eb973
  Slim Frikha authored Aug 27, 2025
  
  a35eb973
26 Aug, 2025 1 commit

Support for AIME dataset (#3248) · 5ac7cdf8

Janna authored Aug 26, 2025

* add AIME tasks

* standardize the repeats

* fix task naming

* aime25 only has test set

* edit readme

* add utils

* standardize

* fix case sensitivity

* repeat once

* lint

* more linting

* lint huggingface.py

5ac7cdf8

25 Aug, 2025 3 commits

Update MMLU-ProX task (#3174) · 0b45cc71
Weihao XUAN authored Aug 26, 2025
```
* update MMLU_ProX

* update MMLU_ProX

* cleanup code by pre-commit
```
0b45cc71

Adds Anthropic/discrim-eval to lm-evaluation-harness (#3091) · dddfe7ec

William Held authored Aug 25, 2025



* Anthropic Discrim Eval

* Mixed Effects Regression

* Actually wire it all upo

* Operator Name Doesn't Exist on Github

* Update lm_eval/tasks/discrim_eval/discrim_eval_implicit.yaml
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

* Update discrim_eval_implicit.yaml

* Update discrim_eval_explicit.yaml

* pacify pre-commit

---------
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>
Co-authored-by: Baber <baber@hey.com>

dddfe7ec

feat: Add CLIcK task (#3173) · bb433af7

Geun, Lim authored Aug 25, 2025

* feat: Add CLIcK task

* Fix formatting issues

* Add Click Task Description

* fix: lint

* fix

bb433af7

23 Aug, 2025 1 commit

update `minerva_math` (#3259) · 18d2face

Baber Abbasi authored Aug 24, 2025

* update math_verify

* remove normalization

* use full solution in `parse`

* update version

18d2face

22 Aug, 2025 1 commit
- fix unknown group key to tag (#3222) · 358bfa37
  Patrick Haller authored Aug 22, 2025
```
Co-authored-by: Patrick Haller <phmaker@Patricks-MacBook-Pro.local>
```
  358bfa37
21 Aug, 2025 6 commits
- Add LM-SynEval Benchmark (#3184) · 938a4fb3
  James A. Michaelov authored Aug 21, 2025
```
* add lm_syneval

* edit readme

* update task readme

* formatting fixes

* run linting

* add descriptions and examples

* clean readme formatting
```
  938a4fb3
- Add TurBLiMP (#3219) · d355eac0
  James A. Michaelov authored Aug 21, 2025
```
* add turblimp

* update general task readme

* add normalized accuracy
```
  d355eac0
- Add BLiMP-NL (#3221) · b0040ba0
  James A. Michaelov authored Aug 21, 2025
```
* add blimp_nl

* add template yaml file
```
  b0040ba0
- Add ZhoBLiMP benchmark (#3218) · 1bd96448
  James A. Michaelov authored Aug 21, 2025
```
* add zhoblimp files

* correct group name

* fix group

* add normalized accuracy
```
  1bd96448
- add xnli_va dataset to catalan_bench (#3194) · 51d8a192
  FranValero97 authored Aug 21, 2025
  
  51d8a192
- Update utils.py (#3246) · a4fd524f
  Anri Lombard authored Aug 21, 2025
  
  a4fd524f
08 Aug, 2025 1 commit
- Remove `trust_remote_code: True` from updated datasets (#3213) · 7f04db12
  Avelina Asada Hadji-Kyriacou authored Aug 08, 2025
```
* Update afridiacritics_yaml

* Update afrisenti

* Update nollysenti

* Update ntrex

* Update salt
```
  7f04db12
04 Aug, 2025 4 commits

improve include-path precedence handling (#3068) · 3214d468

parkhs21 authored Aug 04, 2025



* improve include-path precedence handling

* test: add task for test

* add test for include path precedence handling

* Refactor `test_include_path.py`

---------
Co-authored-by: Baber <baber@hey.com>

3214d468

Update README.md for mlqa (#3117) · 584de690
Matthias Neumayer authored Aug 04, 2025
```
The tasks are called without .yaml just the task name
```
584de690

Fix humaneval_instruct (#3201) · edf3aa7a

Idan Tene authored Aug 04, 2025

* Update humaneval_64_instruct.yaml

Sync doc_to_text with humaneval_instruct.yaml

* Update humaneval_instruct.yaml

Remove redundant (flawed) spaces

* Update README.md

* Bump task version

edf3aa7a

Fix ```mmlu_continuation``` subgroup names to fit Readme and other variants (#3137) · 06ba1d28

Felix Michalak authored Aug 04, 2025

* Update continuation group names to fit Readme

* added changelog to readme and switched datasets form hails to cais

* added missing new line at end of readme

06ba1d28

23 Jul, 2025 2 commits
- remove trust-remote-code in configs; fix escape sequences (#3180) · 314f7176
  Baber Abbasi authored Jul 23, 2025
```
* remove trust-remote-code

* add W605 rule
```
  314f7176
- Pin datasets < 4.0.0 (#3172) · 904bba12
  Baber Abbasi authored Jul 23, 2025
```
* Fix: pin datasets < 4.0

* fix

* update type hints in HF

* fix hellaswag path
```
  904bba12
22 Jul, 2025 2 commits

feat: Add LIBRA benchmark for long-context evaluation (#2943) · 091aaf6f

Svetlana Karimova authored Jul 22, 2025



* Feat: add LIBRA benchmark

* Feat: add dataset filter to LIBRA

* Fix: formatting through pre-commit and main tasks README

* Fix: resolve conflict

* Fix: dataset name to real

* Fix: delete unnececcary datasets and correct dependency

---------
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

091aaf6f

Fix: extended to max_gen_toks 2048 for HRM8K math benchmarks (#3124) · 250a04ec

Geun, Lim authored Jul 22, 2025



* Fix: extended to max_gen_toks 8192 for HRM8K math benchmarks

* • Increased max_gen_toks to 2 048 (matches Appendix B of original paper).
• Added Evaluation Settings and Changelog sections.

* add some logs

---------
Co-authored-by: Baber <baber@hey.com>

250a04ec

19 Jul, 2025 3 commits
- multiblimp (#3162) · 4366fc82
  Baber Abbasi authored Jul 20, 2025
  
  4366fc82
- Add the MultiBLiMP benchmark (#3155) · de4ce482
  James A. Michaelov authored Jul 19, 2025
```
* add multiblimp

* run linter
```
  de4ce482
- Bugfix: update path for GLUE (#3159) · 8c025369
  Avelina Asada Hadji-Kyriacou authored Jul 19, 2025
```
* Update default.yaml
```
  8c025369
18 Jul, 2025 1 commit
- Fix medical benchmarks import (#3151) · 489fbc21
  Idan Tene authored Jul 18, 2025
```
* Update utils.py
```
  489fbc21
16 Jul, 2025 1 commit

`bbh_cot_fewshot`: Removed repeated "Let''s think step by step." text from bbh cot prompts (#3140) · c2be7211

philipdoldo authored Jul 16, 2025



* Removed the 'Let''s think step by step.' text from the start of the target entry in each of the samples to prevent this phrase from being repeated twice in the few-shot prompts and to match the behavior from the original bbh repository. Worth noting that this applied to only 26 out of 27 subtasks, the only one it did not apply to is boolean_expressions.yaml. When it comes to boolean_expressions.yaml, in my opinion there is an error in that it doesn't say the 'Remember that (i) ...' text after the final 'A: Let's think step by step.' in the prompt. Models like EleutherAI/gpt-neo-125m seem to always begin answers with this string anyway (copying what was done in the few-shot prompts), but I think it really should've been part of the prompt, much like how 'A: Let's think step by step.' is included in the prompt for all of the cot tasks. However, the original bbh repo also has this issue, so I think it is fine to keep it this way for consistency, but just thought I'd point it out anyway.

* feat: remove extra space from answers; add changelog

---------
Co-authored-by: Baber <baber@hey.com>

c2be7211