Commits · 18d2faceca2944ca79746e7396adab013ea96ba1 · gaoqiong / lm-evaluation-harness

23 Aug, 2025 1 commit

update `minerva_math` (#3259) · 18d2face

Baber Abbasi authored Aug 24, 2025

* update math_verify

* remove normalization

* use full solution in `parse`

* update version

18d2face

22 Aug, 2025 1 commit
- fix unknown group key to tag (#3222) · 358bfa37
  Patrick Haller authored Aug 22, 2025
```
Co-authored-by: Patrick Haller <phmaker@Patricks-MacBook-Pro.local>
```
  358bfa37
21 Aug, 2025 6 commits
- Add LM-SynEval Benchmark (#3184) · 938a4fb3
  James A. Michaelov authored Aug 21, 2025
```
* add lm_syneval

* edit readme

* update task readme

* formatting fixes

* run linting

* add descriptions and examples

* clean readme formatting
```
  938a4fb3
- Add TurBLiMP (#3219) · d355eac0
  James A. Michaelov authored Aug 21, 2025
```
* add turblimp

* update general task readme

* add normalized accuracy
```
  d355eac0
- Add BLiMP-NL (#3221) · b0040ba0
  James A. Michaelov authored Aug 21, 2025
```
* add blimp_nl

* add template yaml file
```
  b0040ba0
- Add ZhoBLiMP benchmark (#3218) · 1bd96448
  James A. Michaelov authored Aug 21, 2025
```
* add zhoblimp files

* correct group name

* fix group

* add normalized accuracy
```
  1bd96448
- add xnli_va dataset to catalan_bench (#3194) · 51d8a192
  FranValero97 authored Aug 21, 2025
  
  51d8a192
- Update utils.py (#3246) · a4fd524f
  Anri Lombard authored Aug 21, 2025
  
  a4fd524f
08 Aug, 2025 1 commit
- Remove `trust_remote_code: True` from updated datasets (#3213) · 7f04db12
  Avelina Asada Hadji-Kyriacou authored Aug 08, 2025
```
* Update afridiacritics_yaml

* Update afrisenti

* Update nollysenti

* Update ntrex

* Update salt
```
  7f04db12
04 Aug, 2025 4 commits

improve include-path precedence handling (#3068) · 3214d468

parkhs21 authored Aug 04, 2025



* improve include-path precedence handling

* test: add task for test

* add test for include path precedence handling

* Refactor `test_include_path.py`

---------
Co-authored-by: Baber <baber@hey.com>

3214d468

Update README.md for mlqa (#3117) · 584de690
Matthias Neumayer authored Aug 04, 2025
```
The tasks are called without .yaml just the task name
```
584de690

Fix humaneval_instruct (#3201) · edf3aa7a

Idan Tene authored Aug 04, 2025

* Update humaneval_64_instruct.yaml

Sync doc_to_text with humaneval_instruct.yaml

* Update humaneval_instruct.yaml

Remove redundant (flawed) spaces

* Update README.md

* Bump task version

edf3aa7a

Fix ```mmlu_continuation``` subgroup names to fit Readme and other variants (#3137) · 06ba1d28

Felix Michalak authored Aug 04, 2025

* Update continuation group names to fit Readme

* added changelog to readme and switched datasets form hails to cais

* added missing new line at end of readme

06ba1d28

23 Jul, 2025 2 commits
- remove trust-remote-code in configs; fix escape sequences (#3180) · 314f7176
  Baber Abbasi authored Jul 23, 2025
```
* remove trust-remote-code

* add W605 rule
```
  314f7176
- Pin datasets < 4.0.0 (#3172) · 904bba12
  Baber Abbasi authored Jul 23, 2025
```
* Fix: pin datasets < 4.0

* fix

* update type hints in HF

* fix hellaswag path
```
  904bba12
22 Jul, 2025 2 commits

feat: Add LIBRA benchmark for long-context evaluation (#2943) · 091aaf6f

Svetlana Karimova authored Jul 22, 2025



* Feat: add LIBRA benchmark

* Feat: add dataset filter to LIBRA

* Fix: formatting through pre-commit and main tasks README

* Fix: resolve conflict

* Fix: dataset name to real

* Fix: delete unnececcary datasets and correct dependency

---------
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

091aaf6f

Fix: extended to max_gen_toks 2048 for HRM8K math benchmarks (#3124) · 250a04ec

Geun, Lim authored Jul 22, 2025



* Fix: extended to max_gen_toks 8192 for HRM8K math benchmarks

* • Increased max_gen_toks to 2 048 (matches Appendix B of original paper).
• Added Evaluation Settings and Changelog sections.

* add some logs

---------
Co-authored-by: Baber <baber@hey.com>

250a04ec

19 Jul, 2025 3 commits
- multiblimp (#3162) · 4366fc82
  Baber Abbasi authored Jul 20, 2025
  
  4366fc82
- Add the MultiBLiMP benchmark (#3155) · de4ce482
  James A. Michaelov authored Jul 19, 2025
```
* add multiblimp

* run linter
```
  de4ce482
- Bugfix: update path for GLUE (#3159) · 8c025369
  Avelina Asada Hadji-Kyriacou authored Jul 19, 2025
```
* Update default.yaml
```
  8c025369
18 Jul, 2025 1 commit
- Fix medical benchmarks import (#3151) · 489fbc21
  Idan Tene authored Jul 18, 2025
```
* Update utils.py
```
  489fbc21
16 Jul, 2025 1 commit

`bbh_cot_fewshot`: Removed repeated "Let''s think step by step." text from bbh cot prompts (#3140) · c2be7211

philipdoldo authored Jul 16, 2025



* Removed the 'Let''s think step by step.' text from the start of the target entry in each of the samples to prevent this phrase from being repeated twice in the few-shot prompts and to match the behavior from the original bbh repository. Worth noting that this applied to only 26 out of 27 subtasks, the only one it did not apply to is boolean_expressions.yaml. When it comes to boolean_expressions.yaml, in my opinion there is an error in that it doesn't say the 'Remember that (i) ...' text after the final 'A: Let's think step by step.' in the prompt. Models like EleutherAI/gpt-neo-125m seem to always begin answers with this string anyway (copying what was done in the few-shot prompts), but I think it really should've been part of the prompt, much like how 'A: Let's think step by step.' is included in the prompt for all of the cot tasks. However, the original bbh repo also has this issue, so I think it is fine to keep it this way for consistency, but just thought I'd point it out anyway.

* feat: remove extra space from answers; add changelog

---------
Co-authored-by: Baber <baber@hey.com>

c2be7211

14 Jul, 2025 1 commit

Adding EgyMMLU and EgyHellaSwag (#3063) · 2ea6114e

Atou Houdaifa authored Jul 14, 2025

* add egy mmlu hellaswag

* add egymmlu egyhellaswag to tasks readme

* fix egymmlu config generation

* fix _generate_configs formating

2ea6114e

10 Jul, 2025 1 commit

warning for "chat" pretrained; disable buggy evalita configs (#3127) · f3a0b554

Baber Abbasi authored Jul 10, 2025

* check for chat for warning

* add test

* remove yaml extension from some evalita configs

* move unitxt to own test script

* fix CI test

f3a0b554

03 Jul, 2025 2 commits

Humaneval - fix regression (#3102) · 8c1016cb
Baber Abbasi authored Jul 03, 2025
```
* use double quotes
```
8c1016cb

Truthfulqa multi harness (#3062) · e0dc33ae

Blanca Calvo authored Jul 03, 2025



* truthfulqa-multi task

* truthfulqa-multi with chat few-shot

* few shot chat implementation

* changed until so it outputs lists

* changed dataset location

* added MT task

* Create README.md

* do not include MT

* changes for PR

* tag change

* removed yaml extension

* adding task to the table

* fix task configs

* add import exception

---------
Co-authored-by: Baber <baber@hey.com>

e0dc33ae

30 Jun, 2025 1 commit

FixBug: Align the Humaneval with official results for Llama-3.1-70B-Instruct (#3092) · a7ca0435

jinze authored Jul 01, 2025

* Fix: Align the Humaneval dataset with official results

Details:(1) modified the "doc_to_text" and "gen_prefix" in the "humaneval_instruct.yaml" file to make them the same as the Prompt in "meta-llama/Llama-3.1-70B-Instruct-evals".

(2) Change r.rfind("```") to r.find("```"), so it can locate the first "```", not the last one.

Results: Partially reproduced the official results: The result of LLaMA3.1-8B-Instruct is 66.5 (the official result is 72.6), and the result of LLaMA3.1-70B-Instruct is 80.5 (the official result is 80.5).

Ref: PR#2650

* add changelog and version

* add changelog

a7ca0435

25 Jun, 2025 1 commit
- Ensure backwards compatibility in fewshot_context by using kwargs (#3079) · 532909c0
  Kiersten Stokes authored Jun 25, 2025
```
Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>
```
  532909c0
20 Jun, 2025 1 commit

llama3 task: update README.md (#3074) · 68c3a811

Anna Fontana authored Jun 20, 2025

"arc_chalenge_chat" doesn't exist: I think it should be "arc_challenge_chat", but this task is not implemented here (see arc task folder).

68c3a811

19 Jun, 2025 2 commits
- Update instructions.py (#3060) · 37357004
  Maxim Evtush authored Jun 19, 2025
  
  37357004
- Update README.md (#3070) · 5a15058e
  Anna Fontana authored Jun 19, 2025
```
Wrong task name: mmlu_generation doesn't non exist -> mmlu_generative is the correct one
```
  5a15058e
16 Jun, 2025 2 commits
- fix longbech citation (#3061) · 9fbe48c2
  Baber Abbasi authored Jun 16, 2025
```
* fix longbech citation
```
  9fbe48c2
- Fix Typo in README and Comment in utils_mcq.py (#3057) · e20ef72e
  fuder.eth authored Jun 16, 2025
```
* Update README.md

* Update utils_mcq.py
```
  e20ef72e
12 Jun, 2025 1 commit
- Fallback to super impl in fewshot_context for Unitxt tasks (#3023) · d09e03dd
  Kiersten Stokes authored Jun 12, 2025
```
Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>
```
  d09e03dd
08 Jun, 2025 1 commit

[longbench] fix metric calculation (#2983) · 147e9d61

Baber Abbasi authored Jun 08, 2025

* use all answers

* use middle truncation

* maybe fix classification score

* strip classification preds

* [vllm] remove stop tokens post-hoc

* strip all preds

* pacify pre-commit

* start on truncation utility

* add to readme

* add a footgun doc

* fix newline in yaml templates

* do not strip code_sim preds!

* fix pre-commit config

* fix instruction warning

* add not to longbench readme

147e9d61

03 Jun, 2025 2 commits

remove prints (#3041) · 9f152e0b
Baber Abbasi authored Jun 03, 2025

9f152e0b

add Mbpp instruct (#2995) · 60e85da5

Baber Abbasi authored Jun 03, 2025

* feat: add mbpp_instruct

* fix: update generation_kwargs to use an empty until list

* fix: correct predictions formatting in pass_at_1 function

* fix: improve code block extraction by checking first without opening backticks

* fix mbpp `pass_at_1`

60e85da5

26 May, 2025 1 commit
- add arab_culture task (#3006) · 8bc4afff
  Boda Sadallah authored May 26, 2025
```
* add arab_culture tasks

* add target_delimeter and remove debugging code
```
  8bc4afff
21 May, 2025 1 commit
- add kbl 2025 (#3000) · 8be417a8
  Hongseok Oh authored May 21, 2025
  
  8be417a8
19 May, 2025 1 commit
- [SGLANG] Add the SGLANG generate API (#2997) · 53c65300
  Baber Abbasi authored May 19, 2025
```
* add `sglang-generate`

* nit

* nit

* nit

* pacify pre-commit
```
  53c65300