Commits · 84d02f77ced6843dfe7fde525315cb1089d32c19 · gaoqiong / lm-evaluation-harness

06 Jul, 2025 2 commits
- Neuralmagic (#3113) · 89654090
  Baber Abbasi authored Jul 06, 2025
```
* remove sparse-ml
```
  89654090
- delete neuralmagic models (#3112) · f93001db
  Baber Abbasi authored Jul 06, 2025
  
  f93001db
05 Jul, 2025 1 commit
- remove all; reformat table (#3107) · 28001d29
  Baber Abbasi authored Jul 05, 2025
  
  28001d29
23 Jun, 2025 1 commit
- remove pydantic dependency · d0884a96
  Baber authored Jun 23, 2025
  
  d0884a96
19 Jun, 2025 1 commit
- bump version to `0.4.9` (#3073) · 45274951
  Baber Abbasi authored Jun 19, 2025
  
  45274951
19 May, 2025 1 commit

Adding ACPBench Hard tasks (#2980) · 0daf28fd

Harsha authored May 19, 2025

* adding ACPBench_hard

* adding Clingo

* changing tarski to tarski[clingo]

* denoting the main variants in each paper

0daf28fd

13 May, 2025 1 commit
- Pin unitxt to most recent major version to avoid test failures (#2970) · af8b87cc
  Kiersten Stokes authored May 13, 2025
```
Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>
```
  af8b87cc
23 Apr, 2025 1 commit
- update pyproject with pydantic dep · 9c94fb2e
  artemorloff authored Apr 23, 2025
  
  9c94fb2e
16 Apr, 2025 1 commit
- mmlu - switch dataset to cais/mmlu; fix tests (#2918) · cb316a18
  Baber Abbasi authored Apr 16, 2025
```
* switch MMLU to cais/mmlu

* switch back to tj-actions/changed-files

* cache HF folder
```
  cb316a18
03 Apr, 2025 1 commit
- Fix the deps of longbench from jeiba to jieba (#2873) · 19ba1b16
  Lu Fang authored Apr 02, 2025
```
Signed-off-by: Lu Fang <lufang@fb.com>
```
  19ba1b16
20 Mar, 2025 1 commit

Add Markdown linter (#2818) · 7158f4f4

Kiersten Stokes authored Mar 19, 2025

* Add markdown linter to pre-commit hooks

* Reformat existing markdown (excluding lm_eval/tasks/*.md)

7158f4f4

19 Mar, 2025 1 commit
- Clean up README and pyproject.toml (#2814) · ce9ba47e
  Kiersten Stokes authored Mar 19, 2025
  
  ce9ba47e
18 Mar, 2025 1 commit

Add loncxt tasks (#2629) · 80a10075

Baber Abbasi authored Mar 18, 2025

suport for longcontext (and other synthetic tasks)
* add ruler
* add longbench
* pass `metadata` to TaskConfig

80a10075

17 Mar, 2025 1 commit

Add support for token-based auth for watsonx models (#2796) · 78d57e0f

Kiersten Stokes authored Mar 17, 2025

* Add support for token-based auth for watsonx models

* Fix lint

* Move dotenv import to inner scope

* Improve readability of _verify_credentials

78d57e0f

14 Mar, 2025 1 commit

add audio modality (qwen2 audio only) (#2689) · 62552d2c

achervyakov authored Mar 14, 2025



* Added audio-modality pipeline for qwen2-audio model

* Beauty imports

* fix apply_chat_template args

* update default audio placeholders list

* add demo task - common_voice subset

* add audiolm_qwen libs to pyproject.toml

* pre-commit beautify

---------
Co-authored-by: Alexandra Rak <rakalexandra@mail.ru>

62552d2c

05 Mar, 2025 1 commit
- increment version to 0.4.8 (#2760) · 6d2abda4
  Baber Abbasi authored Mar 05, 2025
  
  6d2abda4
04 Mar, 2025 2 commits

Add test for a simple Unitxt task (#2742) · 8b5c5c13

Kiersten Stokes authored Mar 04, 2025

* Add a test for a custom unitxt task

* Update task.py to bring in line with breaking change in v1.17.2

* Fix lint

8b5c5c13

Enable steering HF models (#2749) · d35008f1

Lucia Quirke authored Mar 04, 2025



* Enable steering HF models
Co-authored-by: Matthew Khoriaty <matthewkhoriaty2026@u.northwestern.edu>

* increase HF download timeout

* Update readme; improve steering vector device handling

* Update latest news

* remove HF timeout increase

* fix tests

* ignore sae lens test

* fix accidental force push

---------
Co-authored-by: Matthew Khoriaty <matthewkhoriaty2026@u.northwestern.edu>

d35008f1

21 Feb, 2025 1 commit

add math_verify to some tasks (#2686) · 358adaf7

Baber Abbasi authored Feb 21, 2025

* add math_verify to minerva math

* add math_verify to benchmark

* fix error

* increment version

358adaf7

17 Dec, 2024 2 commits
- drop python 3.8 support (#2575) · 8558b8d4
  Baber Abbasi authored Dec 17, 2024
```
* feat: drop Python 3.8 support

* feat: drop Python 3.8 tests

* pre-commit
```
  8558b8d4
- increment version (#2574) · 4c26a9c1
  Baber Abbasi authored Dec 17, 2024
```
forgot to increment 0.4.6!
```
  4c26a9c1
15 Nov, 2024 1 commit

IBM watsonx_llm fixes & refactor (#2464) · 4259a6d4

Nikodem Szwast authored Nov 15, 2024

* refactor code, fix config path bug

* update types to be from typing lib

* add pre-commit formatting

* specify version of ibm_watsonx_ai package

* adjust get_watsonx_credentials() function, add minor refactor to adress PR review comments

* change missing installation hint from ibm_watsonx_ai to lm_eval[ibm_watsonx_ai]

4259a6d4

05 Nov, 2024 1 commit

Add Japanese Leaderboard (#2439) · 26f607f5

mtkachenko authored Nov 05, 2024

* add jaqket_v2 and jcommonsenseqa

* remove comments

* remove num_beams as it is incompatible with vllm

* add jnli + refactor

* rename jnla -> jnli

* add jsquad + replace colon chars with the Japanese unicode

* ignore whitespaces in generation tasks

* add marc_ja

* add xwinograd + simplify other yamls

* add mgsm and xlsum

* refactor xlsum

* add ja_leaderboard tag

* edit README.md

* update README.md

* add credit + minor changes

* run ruff format

* address review comments + add group

* remove aggregate_metric_list

* remove tags

* update tasks/README.md

26f607f5

31 Oct, 2024 1 commit

Add GPTQModel support for evaluating GPTQ models (#2217) · 4f8e479e

Qubitium-ModelCloud authored Nov 01, 2024



* support gptqmodel

* code opt

* add gptqmodel option

* Update huggingface.py

* Update pyproject.toml

* gptqmodel version upgraded to 1.0.6

* GPTQModel version upgraded to 1.0.8

* Update pyproject.toml

* fix ruff-format error

* add gptqmodel test

* Update gptqmodel test model

* skip cuda

* python3.8 compatible

* Update README.md

* Update README.md

---------
Co-authored-by: CL-ModelCloud <cl@modelcloud.ai>

4f8e479e

25 Oct, 2024 1 commit

Fix package extras for watsonx support (#2426) · 7882043b

Kiersten Stokes authored Oct 25, 2024



* Update pyproject.toml with watsonx package extra
Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>

* Remove unused function
Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>

---------
Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>

7882043b

23 Oct, 2024 1 commit

Support for IBM watsonx_llm (#2397) · 1185e89a

Nikodem Szwast authored Oct 23, 2024



* add support for IBM watsonx_llm

* add ibm_watsonx_ai package to optional-dependencies

* move global scope imports to inner scope

* change cache to lru_cache

* fix circular import

* use 3.8 typing

* use 3.8 typing

---------
Co-authored-by: Baber <baber@hey.com>

1185e89a

08 Oct, 2024 1 commit
- Bump version to v0.4.5 (#2389) · 0845b588
  Hailey Schoelkopf authored Oct 08, 2024
  
  0845b588
05 Sep, 2024 1 commit
- Bump version to v0.4.4 ; Fixes to TMMLUplus (#2280) · 543617fe
  Hailey Schoelkopf authored Sep 05, 2024
  
  543617fe
28 Aug, 2024 1 commit
- update nltk version to require 3.9.1 (#2259) · 2de3688f
  Hailey Schoelkopf authored Aug 28, 2024
  
  2de3688f
01 Aug, 2024 1 commit

refactor: limit usage of `scipy` and `skilearn` dependencies (#2097) · 7f15cce4

Nathan Weinberg authored Aug 01, 2024



* refactor: move scipy and sklearn module imports to func imports
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

* refactor: consolidate weighted_f1_score func into lm_eval utils
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

* lint: allow for utils file to have unused imports

this allows for shared functions to be defined only
once while allowing for the YAML function importing
to continue working
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

---------
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

7f15cce4

22 Jul, 2024 1 commit

Refactor API models (#2008) · 42dc2448

Baber Abbasi authored Jul 23, 2024



* refactor pad_token handling to fn

* fix docs

* add pad_token_handling to vllm

* start on API superclass

* don't detokenize the returned logits

* streamline vllm tokenizer

* add type hint

* pre-commit

* seems to be in working order

* add model to init

* refactor api models

* nit

* cleanup

* add pbar

* fix type hints

* change optional dependencies

* json encode chat template

* add type hints

* deal with different prompt input requiremnts

* nits

* fix

* cache inside async

* fix

* fix

* nits

* nits

* nits

* nit

* fixup

* fixup

* nit

* add dummy retry

* add dummy retry

* handle imports; skip failing test

* add type hint

* add tests

* add dependency to tests

* add package names to exception

* nit

* docs; type hints

* handle api key

* nit

* tokenizer bug

* fix tokenizer

* nit

* nit

* add better error messages

* nit

* remove decorator

* CI: install api dep

* revert evaluator.py

* consolidate

* consolidate

* nits

* nit

* fix typealias

* nit

* nit

* nit

* Update lm_eval/models/api_models.py

typo
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/models/openai_completions.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/models/anthropic_llms.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/models/api_models.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* fix typo

* add news section

* add info for API

* pre-commit

* typo

* fix bug: unpack logliklehood requests

* fix bug: shared gen_kwargs mutated

* nit: handle copy properly

* Update README.md

* Update README.md

* Update README.md

* Update api_models.py

* Update README.md

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

42dc2448

08 Jul, 2024 1 commit

Easier unitxt tasks loading and removal of unitxt library dependancy (#1933) · ad80f555

Elron Bandel authored Jul 08, 2024



* Updated unitxt loading
Signed-off-by: Elron Bandel <elron.bandel@ibm.com>

* Revert change to general Readme
Signed-off-by: Elron Bandel <elron.bandel@ibm.com>

* Adjust fda,squadv2,squad_completion and swde to work accept config in the constructor
Signed-off-by: Elron Bandel <elron.bandel@ibm.com>

* Fix scrolls
Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Update documentation
Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Enforce backward compatability
Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Format unitxt class
Signed-off-by: elronbandel <elron.bandel@ibm.com>

---------
Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
Signed-off-by: elronbandel <elron.bandel@ibm.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

ad80f555

01 Jul, 2024 1 commit
- update to v0.4.3 (#2046) · 3fa4fd72
  Hailey Schoelkopf authored Jul 01, 2024
  
  3fa4fd72
30 May, 2024 1 commit

[HFLM]Add support for Ascend NPU (#1886) · 8f716817

Huazhong Ji authored May 31, 2024



* [HFLM]Add support for Ascend NPU
Co-authored-by: jiaqiw09 <jiaqiw960714@gmail.com>
Co-authored-by: zhabuye <2947436155@qq.com>

* bump accelerate dependency version to 0.26.0 for NPU compat.

---------
Co-authored-by: jiaqiw09 <jiaqiw960714@gmail.com>
Co-authored-by: zhabuye <2947436155@qq.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

8f716817

23 May, 2024 1 commit
- Unpin vllm in dependencies (#1874) · 5711ab87
  Edward Gan authored May 23, 2024
  
  5711ab87
07 May, 2024 1 commit

Initial integration of the Unitxt to LM eval harness (#1615) · 885f48d6

Yoav Katz authored May 08, 2024

* Initial support for Unitxt datasets in LM Eval Harness

See  https://github.com/IBM/unitxt



The script 'generate_yamls.py' creates LM Eval Harness yaml files corresponding to Unitxt datasets specified in the 'unitxt_datasets' file.

The glue code required to register Unitxt metrics is in 'unitxt_wrapper.py'.

* Added dataset loading check to generate_yaml

Improved error messages.

* Speed up generate_yaml

Added printouts and improved error message

* Added output printout

* Simplified integration of unitxt datasets

Store all the common yaml configuration in a yaml include shared by all datasets of the same task.

* Post code review comments - part 1

1. Made sure include files don't end wth 'yaml' so they won't be marked as tasks
2. Added more datasets and tasks (NER, GEC)
3. Added README

* Post code review comments - part 2

1. Added install unitxt install option in pyproject.toml:
pip install 'lm_eval[unitxt]'
2. Added a check that unitxt is installed and print a clear error message if not

* Commited missing pyproject change

* Added documentation on adding datasets

* More doc changes

* add unitxt extra to readme

* run precommit

---------
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

885f48d6

16 Apr, 2024 1 commit

Add `neuralmagic` models for `sparseml` and `deepsparse` (#1674) · 8b326be7

Michael Goin authored Apr 16, 2024



* Add neuralmagic models for SparseML and DeepSparse

* Update to latest and add test

* Format

* Fix list to List

* Format

* Add deepsparse/sparseml to automated testing

* Update pyproject.toml

* Update pyproject.toml

* Update README

* Fixes for dtype and device

* Format

* Fix test

* Apply suggestions from code review
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Address review comments!

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

8b326be7

18 Mar, 2024 1 commit

Cleanup for v0.4.2 release (#1573) · 5627e819

Hailey Schoelkopf authored Mar 18, 2024

* Update interface.md

* fix: make caching reqs always work with accelerate launch

* remove stale task migration checklist

* remove deprecation warnings

* make informative TypeErrors for get_task_dict

* bump version metadata

* fix num_fewshot printing bug

* add fewshot value to cache key

5627e819

06 Mar, 2024 1 commit

Cleanup and fixes (Task, Instance, and a little bit of *evaluate) (#1533) · 4ee1b386

LSinev authored Mar 06, 2024



* Remove unused `decontamination_ngrams_path` and all mentions (still no alternative path provided)

* Fix improper import of LM and usage of evaluator in one of scripts

* update type hints in instance and task api

* raising errors in task.py instead of asserts

* Fix warnings from ruff

* raising errors in __main__.py instead of asserts

* raising errors in tasks/__init__.py instead of asserts

* raising errors in evaluator.py instead of asserts

* evaluator: update type hints and remove unused variables in code

* Update lm_eval/__main__.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/__main__.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/api/task.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/api/task.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/api/task.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/evaluator.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* pre-commit induced fixes

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

4ee1b386

03 Mar, 2024 1 commit

Setting trust_remote_code to True for HuggingFace datasets compatibility (#1487) · 95167926

Vicki Boykis authored Mar 03, 2024

* setting trust_remote_code

* dataset list no notebooks

* respect trust remote code

* Address changes, move cli options and change datasets

* fix task for tests

* headqa

* remove kobest

* pin datasets and address comments

* clean up space

95167926