Commits · aime24 · gaoqiong / lm-evaluation-harness

11 Mar, 2025 7 commits

Create utils.py · 861c5c27
Stella Biderman authored Mar 11, 2025

861c5c27
Create aime24.yaml · 09228840
Stella Biderman authored Mar 11, 2025

09228840

New healthcare benchmark: careqa (#2714) · 7c9fbcf8

PabloAgustin authored Mar 11, 2025



* New healthcare benchmark: careqa

* LAUNCH_MN5_ACC <python main.py --config config/mn5.yml --models Llama-3.2-1B-Instruct --tasks careqa_open --num_fewshot 0>

* Add fixes, READMES, and remove task_list.txt

* pre-commit passed, add formatting updates; add nanmean agg_metric

* Fix import error.

* Wrapped imports in try excepts

* Wrapped imports in try excepts; also metrics to catch bert_score import error

* Try except to catch ImportErrors as well

* use np.nan

* pre-commit

---------
Co-authored-by: PabloAgustin <pablo.martin@bsc.es>
Co-authored-by: Baber <baber@hey.com>

7c9fbcf8

Fix for mc2 calculation (#2768) · 2c8ffb80

Kajetan Dymkiewicz authored Mar 11, 2025



* fix for mc2 calculation

* increment versions and changelog

---------
Co-authored-by: Baber <baber@hey.com>

2c8ffb80

Consistency Fix: Filter new leaderboard_math_hard dataset to "Level 5" only (#2773) · c8044f30

Yotam Perlitz authored Mar 11, 2025



* Filter new leaderboard_math_hard dataset to "Level 5" only

* align to linters
Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>

---------
Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>

c8044f30

Use yaml.CLoader to load yaml files when available. (#2777) · 8cfa0d74
Giulio Lovisotto authored Mar 11, 2025

8cfa0d74
initialize tokenizer with bos_token (#2781) · 07bd7e23
Baber Abbasi authored Mar 11, 2025

07bd7e23

10 Mar, 2025 1 commit
- docs: Fix typos in README.md (#2778) · ebb498e4
  Rui Vieira authored Mar 10, 2025
  
  ebb498e4
06 Mar, 2025 1 commit
- fix verbosity typo (#2765) · 4890e881
  Baber Abbasi authored Mar 06, 2025
  
  4890e881
05 Mar, 2025 3 commits
- Bugfix (#2762) · dce4bd35
  Baber Abbasi authored Mar 05, 2025
```
* bug fix

* add warning for instruct models

* nit
```
  dce4bd35
- fix: mmlu (generative) metric aggregation (#2761) · 38a9c530
  Yongkeun Hwang authored Mar 05, 2025
  
  38a9c530
- increment version to 0.4.8 (#2760) · 6d2abda4
  Baber Abbasi authored Mar 05, 2025
  
  6d2abda4
04 Mar, 2025 3 commits

add debug log (#2757) · 74332955
Baber Abbasi authored Mar 04, 2025

74332955

Add test for a simple Unitxt task (#2742) · 8b5c5c13

Kiersten Stokes authored Mar 04, 2025

* Add a test for a custom unitxt task

* Update task.py to bring in line with breaking change in v1.17.2

* Fix lint

8b5c5c13

Enable steering HF models (#2749) · d35008f1

Lucia Quirke authored Mar 04, 2025



* Enable steering HF models
Co-authored-by: Matthew Khoriaty <matthewkhoriaty2026@u.northwestern.edu>

* increase HF download timeout

* Update readme; improve steering vector device handling

* Update latest news

* remove HF timeout increase

* fix tests

* ignore sae lens test

* fix accidental force push

---------
Co-authored-by: Matthew Khoriaty <matthewkhoriaty2026@u.northwestern.edu>

d35008f1

03 Mar, 2025 3 commits

fix doc: generate_until only outputs the generated text! (#2755) · 14b0bd26
Baber Abbasi authored Mar 03, 2025

14b0bd26

Groundcocoa (#2724) · ade01428

Harsh Kohli authored Mar 03, 2025



* Fix failing tests

* Resolved merge conflicts

* pre-commit

---------
Co-authored-by: Baber <baber@hey.com>

ade01428

[Readme change for SGLang] fix error in readme and add OOM solutions for sglang (#2738) · 529f4805

Jinwei authored Mar 02, 2025



* initial components to support sglang

* init of class SGLangLM

* draft for generate_until of SGLang model

* mock loglikelihood

* initial loglikelihood_tokens

* todo: fix bug of sglang engine init

* implement generation tasks and test

* support output type loglikelihood and loglikelihood_rolling (#1)

* .

* loglikelihood_rolling

* /

* support dp_size>1

* typo

* add tests and clean code

* skip tests of sglang for now

* fix OOM error of sglang pytest

* finish test for sglang

* add sglang to readme

* fix OOM of tests and clean SGLang model

* update readme

* clean pyproject and add tests for evaluator

* add accuracy tests and it passed locally

* add notes for test

* Update README.md

update readme

* pre-commit

* add OOM guideline for sglang and fix readme error

* fix typo

* fix typo

* add readme

---------
Co-authored-by: Xiaotong Jiang <xiaotong.jiang@databricks.com>
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>
Co-authored-by: Baber <baber@hey.com>

529f4805

27 Feb, 2025 1 commit
- fix vllm data parallel (#2746) · a87fe425
  Baber Abbasi authored Feb 27, 2025
```
* remove ray.remote resources

* remove kobtest tag (registered as group)
```
  a87fe425
26 Feb, 2025 1 commit
- fix log condition (#2737) · af2d2f3e
  Baber Abbasi authored Feb 26, 2025
  
  af2d2f3e
25 Feb, 2025 4 commits

Support SGLang as Potential Backend for Evaluation (#2703) · 29971faa

Jinwei authored Feb 25, 2025



* initial components to support sglang

* init of class SGLangLM

* draft for generate_until of SGLang model

* mock loglikelihood

* initial loglikelihood_tokens

* todo: fix bug of sglang engine init

* implement generation tasks and test

* support output type loglikelihood and loglikelihood_rolling (#1)

* .

* loglikelihood_rolling

* /

* support dp_size>1

* typo

* add tests and clean code

* skip tests of sglang for now

* fix OOM error of sglang pytest

* finish test for sglang

* add sglang to readme

* fix OOM of tests and clean SGLang model

* update readme

* clean pyproject and add tests for evaluator

* add accuracy tests and it passed locally

* add notes for test

* Update README.md

update readme

* pre-commit

---------
Co-authored-by: Xiaotong Jiang <xiaotong.jiang@databricks.com>
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>
Co-authored-by: Baber <baber@hey.com>

29971faa

add humaneval+ and mbpp+ (#2734) · 86bbf6ac
Minho Ryu authored Feb 25, 2025
```
* add humaneval+ and mbpp+

* add newline at end of file
```
86bbf6ac

Fix the import source for eval_logger (#2735) · 9b29da00

Kailashbuki authored Feb 25, 2025



* Fix the import source for eval_logger

* fix logging

---------
Co-authored-by: Baber <baber@hey.com>

9b29da00

add cocoteros_es dataset (#2721) · 2b2fa97b
Santiago Galiano Segura authored Feb 25, 2025
```
Co-authored-by: Robiert Sepulveda Torres <rsepulveda911112@gmail.com>
```
2b2fa97b

24 Feb, 2025 3 commits

add Basque translation of ARC and PAWS to BasqueBench (#2732) · 2f403fa0

Naiara Perez authored Feb 25, 2025



* add Basque translation of ARC and PAWS to BasqueBench

* pre-commit

---------
Co-authored-by: Baber <baber@hey.com>

2f403fa0

add o3-mini support (#2697) · 01849b40
Jocelyn authored Feb 25, 2025
```
* add o3-mini support

* fix linter tests
```
01849b40

Added IberoBench citation info... · a9a0e3ca

Naiara Perez authored Feb 24, 2025

Added IberoBench citation info (https://aclanthology.org/2025.coling-main.699/) in correpsonding READMEs (#2729)

a9a0e3ca

23 Feb, 2025 1 commit
- remove unused import (#2728) · 5e0b6f16
  Baber Abbasi authored Feb 23, 2025
  
  5e0b6f16
21 Feb, 2025 3 commits

fix missing dataset repo (#2719) · 0bf9f4ea
Farhan Ahmed authored Feb 21, 2025

0bf9f4ea

Logging (#2203) · 1ba35e62

Lintang Sutawika authored Feb 20, 2025



* changed source of eval_logger

* allow eval_logger to be set from args

* removed verbosity arg from non-main methods

* fix logging

* pre-commit

* set verbosity in eval logger

* replace utils.eval_logger

* fix logging in main

* add logging to docs

* add logging message

* nit

* add logging to docs

* refactor setup_logging to utils

---------
Co-authored-by: Baber <baber@hey.com>

1ba35e62

add math_verify to some tasks (#2686) · 358adaf7

Baber Abbasi authored Feb 21, 2025

* add math_verify to minerva math

* add math_verify to benchmark

* fix error

* increment version

358adaf7

17 Feb, 2025 1 commit
- fix vllm (#2708) · 52df63b7
  Baber Abbasi authored Feb 17, 2025
```
* fix vllm

* fix data_parallel

* copy to multimodal
```
  52df63b7
14 Feb, 2025 4 commits
- `arithmetic`: set target delimiter to empty string (#2701) · 41b952f3
  Baber Abbasi authored Feb 14, 2025
```
* set target delimiter to empty string

* nit

* add warning
```
  41b952f3
- fix `construct_requests` kwargs (#2700) · 5a5acc08
  Baber Abbasi authored Feb 14, 2025
  
  5a5acc08
- Update README.md (#2694) · 157d8c3c
  Irina Proskurina authored Feb 14, 2025
  
  157d8c3c
- Update remaining references to assistant_prefill to gen_prefix (#2683) · ef6f5243
  Kiersten Stokes authored Feb 14, 2025
  
  ef6f5243
13 Feb, 2025 1 commit
- set aggregation and higher_is_better (instead of falling back on defaults) (#2692) · c3c05b06
  James A. Michaelov authored Feb 13, 2025
  
  c3c05b06
12 Feb, 2025 2 commits
- change ensure_ascii to False for JsonChatStr (#2691) · 96f5e58f
  achervyakov authored Feb 13, 2025
  
  96f5e58f
- Update unitxt task.py to bring in line with recent repo changes (#2684) · 8751fb35
  Kiersten Stokes authored Feb 12, 2025
  
  8751fb35
11 Feb, 2025 1 commit
- Delete lm_eval/tasks/evalita_llm/single_prompt.zip (#2687) · 684fd2dd
  Baber Abbasi authored Feb 11, 2025
  
  684fd2dd