Commits · a57ffba1be6aab24b1f4ff5ad264ca7fc37cc00f · gaoqiong / lm-evaluation-harness

25 Sep, 2025 6 commits
- fixup! merge · bcd6faaa
  Baber authored Sep 26, 2025
  
  bcd6faaa
- add tests · 2b32f7be
  Baber authored Jul 28, 2025
  
  2b32f7be
- update default values; fixes · b89af51e
  Baber authored Jul 10, 2025
  
  b89af51e
- move test one doc to method · 7cef4d38
  Baber authored Jul 23, 2025
  
  7cef4d38
- remove deps; types · 4ad6cd9f
  Baber authored Jul 22, 2025
  
  4ad6cd9f
- ruff rules; types · 1768fd3b
  Baber authored Jul 19, 2025
  
  1768fd3b
25 Aug, 2025 1 commit

Adds Anthropic/discrim-eval to lm-evaluation-harness (#3091) · dddfe7ec

William Held authored Aug 25, 2025



* Anthropic Discrim Eval

* Mixed Effects Regression

* Actually wire it all upo

* Operator Name Doesn't Exist on Github

* Update lm_eval/tasks/discrim_eval/discrim_eval_implicit.yaml
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

* Update discrim_eval_implicit.yaml

* Update discrim_eval_explicit.yaml

* pacify pre-commit

---------
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>
Co-authored-by: Baber <baber@hey.com>

dddfe7ec

04 Aug, 2025 1 commit
- Bump version to 0.4.9.1 (#3208) · d021bf84
  Baber Abbasi authored Aug 04, 2025
  
  d021bf84
26 Jul, 2025 1 commit
- add task factory · 4254c7bd
  Baber authored Jul 26, 2025
  
  4254c7bd
23 Jul, 2025 2 commits
- remove trust-remote-code in configs; fix escape sequences (#3180) · 314f7176
  Baber Abbasi authored Jul 23, 2025
```
* remove trust-remote-code

* add W605 rule
```
  314f7176
- Pin datasets < 4.0.0 (#3172) · 904bba12
  Baber Abbasi authored Jul 23, 2025
```
* Fix: pin datasets < 4.0

* fix

* update type hints in HF

* fix hellaswag path
```
  904bba12
22 Jul, 2025 1 commit

feat: Add LIBRA benchmark for long-context evaluation (#2943) · 091aaf6f

Svetlana Karimova authored Jul 22, 2025



* Feat: add LIBRA benchmark

* Feat: add dataset filter to LIBRA

* Fix: formatting through pre-commit and main tasks README

* Fix: resolve conflict

* Fix: dataset name to real

* Fix: delete unnececcary datasets and correct dependency

---------
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

091aaf6f

06 Jul, 2025 2 commits
- Neuralmagic (#3113) · 89654090
  Baber Abbasi authored Jul 06, 2025
```
* remove sparse-ml
```
  89654090
- delete neuralmagic models (#3112) · f93001db
  Baber Abbasi authored Jul 06, 2025
  
  f93001db
05 Jul, 2025 1 commit
- remove all; reformat table (#3107) · 28001d29
  Baber Abbasi authored Jul 05, 2025
  
  28001d29
19 Jun, 2025 1 commit
- bump version to `0.4.9` (#3073) · 45274951
  Baber Abbasi authored Jun 19, 2025
  
  45274951
19 May, 2025 1 commit

Adding ACPBench Hard tasks (#2980) · 0daf28fd

Harsha authored May 19, 2025

* adding ACPBench_hard

* adding Clingo

* changing tarski to tarski[clingo]

* denoting the main variants in each paper

0daf28fd

13 May, 2025 1 commit
- Pin unitxt to most recent major version to avoid test failures (#2970) · af8b87cc
  Kiersten Stokes authored May 13, 2025
```
Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>
```
  af8b87cc
16 Apr, 2025 1 commit
- mmlu - switch dataset to cais/mmlu; fix tests (#2918) · cb316a18
  Baber Abbasi authored Apr 16, 2025
```
* switch MMLU to cais/mmlu

* switch back to tj-actions/changed-files

* cache HF folder
```
  cb316a18
03 Apr, 2025 1 commit
- Fix the deps of longbench from jeiba to jieba (#2873) · 19ba1b16
  Lu Fang authored Apr 02, 2025
```
Signed-off-by: Lu Fang <lufang@fb.com>
```
  19ba1b16
20 Mar, 2025 1 commit

Add Markdown linter (#2818) · 7158f4f4

Kiersten Stokes authored Mar 19, 2025

* Add markdown linter to pre-commit hooks

* Reformat existing markdown (excluding lm_eval/tasks/*.md)

7158f4f4

19 Mar, 2025 1 commit
- Clean up README and pyproject.toml (#2814) · ce9ba47e
  Kiersten Stokes authored Mar 19, 2025
  
  ce9ba47e
18 Mar, 2025 1 commit

Add loncxt tasks (#2629) · 80a10075

Baber Abbasi authored Mar 18, 2025

suport for longcontext (and other synthetic tasks)
* add ruler
* add longbench
* pass `metadata` to TaskConfig

80a10075

17 Mar, 2025 1 commit

Add support for token-based auth for watsonx models (#2796) · 78d57e0f

Kiersten Stokes authored Mar 17, 2025

* Add support for token-based auth for watsonx models

* Fix lint

* Move dotenv import to inner scope

* Improve readability of _verify_credentials

78d57e0f

14 Mar, 2025 1 commit

add audio modality (qwen2 audio only) (#2689) · 62552d2c

achervyakov authored Mar 14, 2025



* Added audio-modality pipeline for qwen2-audio model

* Beauty imports

* fix apply_chat_template args

* update default audio placeholders list

* add demo task - common_voice subset

* add audiolm_qwen libs to pyproject.toml

* pre-commit beautify

---------
Co-authored-by: Alexandra Rak <rakalexandra@mail.ru>

62552d2c

05 Mar, 2025 1 commit
- increment version to 0.4.8 (#2760) · 6d2abda4
  Baber Abbasi authored Mar 05, 2025
  
  6d2abda4
04 Mar, 2025 2 commits

Add test for a simple Unitxt task (#2742) · 8b5c5c13

Kiersten Stokes authored Mar 04, 2025

* Add a test for a custom unitxt task

* Update task.py to bring in line with breaking change in v1.17.2

* Fix lint

8b5c5c13

Enable steering HF models (#2749) · d35008f1

Lucia Quirke authored Mar 04, 2025



* Enable steering HF models
Co-authored-by: Matthew Khoriaty <matthewkhoriaty2026@u.northwestern.edu>

* increase HF download timeout

* Update readme; improve steering vector device handling

* Update latest news

* remove HF timeout increase

* fix tests

* ignore sae lens test

* fix accidental force push

---------
Co-authored-by: Matthew Khoriaty <matthewkhoriaty2026@u.northwestern.edu>

d35008f1

21 Feb, 2025 1 commit

add math_verify to some tasks (#2686) · 358adaf7

Baber Abbasi authored Feb 21, 2025

* add math_verify to minerva math

* add math_verify to benchmark

* fix error

* increment version

358adaf7

17 Dec, 2024 2 commits
- drop python 3.8 support (#2575) · 8558b8d4
  Baber Abbasi authored Dec 17, 2024
```
* feat: drop Python 3.8 support

* feat: drop Python 3.8 tests

* pre-commit
```
  8558b8d4
- increment version (#2574) · 4c26a9c1
  Baber Abbasi authored Dec 17, 2024
```
forgot to increment 0.4.6!
```
  4c26a9c1
15 Nov, 2024 1 commit

IBM watsonx_llm fixes & refactor (#2464) · 4259a6d4

Nikodem Szwast authored Nov 15, 2024

* refactor code, fix config path bug

* update types to be from typing lib

* add pre-commit formatting

* specify version of ibm_watsonx_ai package

* adjust get_watsonx_credentials() function, add minor refactor to adress PR review comments

* change missing installation hint from ibm_watsonx_ai to lm_eval[ibm_watsonx_ai]

4259a6d4

05 Nov, 2024 1 commit

Add Japanese Leaderboard (#2439) · 26f607f5

mtkachenko authored Nov 05, 2024

* add jaqket_v2 and jcommonsenseqa

* remove comments

* remove num_beams as it is incompatible with vllm

* add jnli + refactor

* rename jnla -> jnli

* add jsquad + replace colon chars with the Japanese unicode

* ignore whitespaces in generation tasks

* add marc_ja

* add xwinograd + simplify other yamls

* add mgsm and xlsum

* refactor xlsum

* add ja_leaderboard tag

* edit README.md

* update README.md

* add credit + minor changes

* run ruff format

* address review comments + add group

* remove aggregate_metric_list

* remove tags

* update tasks/README.md

26f607f5

31 Oct, 2024 1 commit

Add GPTQModel support for evaluating GPTQ models (#2217) · 4f8e479e

Qubitium-ModelCloud authored Nov 01, 2024



* support gptqmodel

* code opt

* add gptqmodel option

* Update huggingface.py

* Update pyproject.toml

* gptqmodel version upgraded to 1.0.6

* GPTQModel version upgraded to 1.0.8

* Update pyproject.toml

* fix ruff-format error

* add gptqmodel test

* Update gptqmodel test model

* skip cuda

* python3.8 compatible

* Update README.md

* Update README.md

---------
Co-authored-by: CL-ModelCloud <cl@modelcloud.ai>

4f8e479e

25 Oct, 2024 1 commit

Fix package extras for watsonx support (#2426) · 7882043b

Kiersten Stokes authored Oct 25, 2024



* Update pyproject.toml with watsonx package extra
Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>

* Remove unused function
Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>

---------
Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>

7882043b

23 Oct, 2024 1 commit

Support for IBM watsonx_llm (#2397) · 1185e89a

Nikodem Szwast authored Oct 23, 2024



* add support for IBM watsonx_llm

* add ibm_watsonx_ai package to optional-dependencies

* move global scope imports to inner scope

* change cache to lru_cache

* fix circular import

* use 3.8 typing

* use 3.8 typing

---------
Co-authored-by: Baber <baber@hey.com>

1185e89a

08 Oct, 2024 1 commit
- Bump version to v0.4.5 (#2389) · 0845b588
  Hailey Schoelkopf authored Oct 08, 2024
  
  0845b588
05 Sep, 2024 1 commit
- Bump version to v0.4.4 ; Fixes to TMMLUplus (#2280) · 543617fe
  Hailey Schoelkopf authored Sep 05, 2024
  
  543617fe
28 Aug, 2024 1 commit
- update nltk version to require 3.9.1 (#2259) · 2de3688f
  Hailey Schoelkopf authored Aug 28, 2024
  
  2de3688f
01 Aug, 2024 1 commit

refactor: limit usage of `scipy` and `skilearn` dependencies (#2097) · 7f15cce4

Nathan Weinberg authored Aug 01, 2024



* refactor: move scipy and sklearn module imports to func imports
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

* refactor: consolidate weighted_f1_score func into lm_eval utils
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

* lint: allow for utils file to have unused imports

this allows for shared functions to be defined only
once while allowing for the YAML function importing
to continue working
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

---------
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

7f15cce4