Commits · 690ef8ba7c02dbb20e31f1d068eeb2e01091f2b3 · gaoqiong / lm-evaluation-harness

14 Oct, 2025 1 commit

Leverage vllm's `tokenizer_info` endpoint to avoid manual duplication (#3185) · 690ef8ba

Mac Misiura authored Oct 14, 2025

* ✨

 added an approach to use tokenizer_info endpoint from vllm
Signed-off-by: m-misiura <mmisiura@redhat.com>

* 🚧

 removed all auto-detection and tokenization logic from `LocalChatCompletion`

* pacify pre-commit

---------
Signed-off-by: m-misiura <mmisiura@redhat.com>
Co-authored-by: Baber <baber@hey.com>

690ef8ba

27 Aug, 2025 1 commit
- pacify pre-commit (#3268) · 3a9bcc3f
  Baber Abbasi authored Aug 27, 2025
  
  3a9bcc3f
25 Aug, 2025 1 commit
- Add support for OpenVINO text2text generation models (#3101) · 05b37f20
  Nikita Savelyev authored Aug 25, 2025
```
* Add support for OVModelForSeq2SeqLM

* Add test
```
  05b37f20
06 Jul, 2025 1 commit
- delete neuralmagic models (#3112) · f93001db
  Baber Abbasi authored Jul 06, 2025
  
  f93001db
14 Mar, 2025 1 commit
- use verify_certificate flag in batch requests (#2785) · 3b7dbef9
  daniel-salib authored Mar 14, 2025
  
  3b7dbef9
04 Mar, 2025 1 commit

Enable steering HF models (#2749) · d35008f1

Lucia Quirke authored Mar 04, 2025



* Enable steering HF models
Co-authored-by: Matthew Khoriaty <matthewkhoriaty2026@u.northwestern.edu>

* increase HF download timeout

* Update readme; improve steering vector device handling

* Update latest news

* remove HF timeout increase

* fix tests

* ignore sae lens test

* fix accidental force push

---------
Co-authored-by: Matthew Khoriaty <matthewkhoriaty2026@u.northwestern.edu>

d35008f1

25 Feb, 2025 1 commit

Support SGLang as Potential Backend for Evaluation (#2703) · 29971faa

Jinwei authored Feb 25, 2025



* initial components to support sglang

* init of class SGLangLM

* draft for generate_until of SGLang model

* mock loglikelihood

* initial loglikelihood_tokens

* todo: fix bug of sglang engine init

* implement generation tasks and test

* support output type loglikelihood and loglikelihood_rolling (#1)

* .

* loglikelihood_rolling

* /

* support dp_size>1

* typo

* add tests and clean code

* skip tests of sglang for now

* fix OOM error of sglang pytest

* finish test for sglang

* add sglang to readme

* fix OOM of tests and clean SGLang model

* update readme

* clean pyproject and add tests for evaluator

* add accuracy tests and it passed locally

* add notes for test

* Update README.md

update readme

* pre-commit

---------
Co-authored-by: Xiaotong Jiang <xiaotong.jiang@databricks.com>
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>
Co-authored-by: Baber <baber@hey.com>

29971faa

30 Nov, 2024 1 commit
- make utility function to handle `until` (#2518) · 0230356c
  Baber Abbasi authored Nov 30, 2024
```
* make utility function to handle `until`

* fix text
```
  0230356c
31 Oct, 2024 1 commit

Add GPTQModel support for evaluating GPTQ models (#2217) · 4f8e479e

Qubitium-ModelCloud authored Nov 01, 2024



* support gptqmodel

* code opt

* add gptqmodel option

* Update huggingface.py

* Update pyproject.toml

* gptqmodel version upgraded to 1.0.6

* GPTQModel version upgraded to 1.0.8

* Update pyproject.toml

* fix ruff-format error

* add gptqmodel test

* Update gptqmodel test model

* skip cuda

* python3.8 compatible

* Update README.md

* Update README.md

---------
Co-authored-by: CL-ModelCloud <cl@modelcloud.ai>

4f8e479e

04 Oct, 2024 1 commit
- fix tests (#2380) · 5e0bc289
  Baber Abbasi authored Oct 04, 2024
  
  5e0bc289
18 Sep, 2024 1 commit

Update neuron backend (#2314) · 9a092f37

David Corvoysier authored Sep 18, 2024

* feat(neuron): align with latest optimum-neuron

* feat(neuron): support pre-exported neuron models

* fix(neuron): correctly use max_length

* fix(neuron): adapt loglikelihood

The evaluation of log likelihood was not working for neuron models
using continuous batching, such as all cached neuron LLama models.

* refactor(neuron): remove dead code

9a092f37

01 Aug, 2024 1 commit
- [Bugfix] add temperature=0 to logprobs and seed args to API models (#2149) · 63e76e89
  Baber Abbasi authored Aug 01, 2024
```
* add temperature for log probs

* add seed

* nit

* add new args to test

* added warning for api chat models
```
  63e76e89
22 Jul, 2024 1 commit

Refactor API models (#2008) · 42dc2448

Baber Abbasi authored Jul 23, 2024



* refactor pad_token handling to fn

* fix docs

* add pad_token_handling to vllm

* start on API superclass

* don't detokenize the returned logits

* streamline vllm tokenizer

* add type hint

* pre-commit

* seems to be in working order

* add model to init

* refactor api models

* nit

* cleanup

* add pbar

* fix type hints

* change optional dependencies

* json encode chat template

* add type hints

* deal with different prompt input requiremnts

* nits

* fix

* cache inside async

* fix

* fix

* nits

* nits

* nits

* nit

* fixup

* fixup

* nit

* add dummy retry

* add dummy retry

* handle imports; skip failing test

* add type hint

* add tests

* add dependency to tests

* add package names to exception

* nit

* docs; type hints

* handle api key

* nit

* tokenizer bug

* fix tokenizer

* nit

* nit

* add better error messages

* nit

* remove decorator

* CI: install api dep

* revert evaluator.py

* consolidate

* consolidate

* nits

* nit

* fix typealias

* nit

* nit

* nit

* Update lm_eval/models/api_models.py

typo
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/models/openai_completions.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/models/anthropic_llms.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/models/api_models.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* fix typo

* add news section

* add info for API

* pre-commit

* typo

* fix bug: unpack logliklehood requests

* fix bug: shared gen_kwargs mutated

* nit: handle copy properly

* Update README.md

* Update README.md

* Update README.md

* Update api_models.py

* Update README.md

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

42dc2448

25 Jun, 2024 2 commits

Factor out LM-specific tests (#1859) · 0366c74f

Hailey Schoelkopf authored Jun 25, 2024

* separate out optimum/neuralmagic tests to separate job

* fix vllm tests

* fix bug in --trust_remote_code

* use datasets.config instead intentionally

* fix remote code issue?

0366c74f

Remove `LM` dependency from `build_all_requests` (#2011) · 9b6179b2

Baber Abbasi authored Jun 25, 2024

* refactored `lm.apply_chat_template`

* nit

* fix weird type error

* fixed!

* skip failing test

* pre-commit run all

* add type hints

* nit

* nit

* fixup

9b6179b2

31 May, 2024 1 commit
- Try to make existing tests run little bit faster (#1905) · 1060b68d
  LSinev authored May 31, 2024
  
  1060b68d
06 May, 2024 1 commit

Provide ability for custom sampler for ConfigurableTask (#1616) · ae72cebc

LSinev authored May 06, 2024

* Added fewshot sampling seeds to evaluator.simple_evaluate signature

Way to control seed of fewshot sampling
may help with #1591

* Added ability for custom sampler for ConfigurableTask

May be set in config like
```
fewshot_config:
  sampler: !function utils.MyFewshotSampler
```

* explicitly set fewshot random generator seed for HFLM generate_until_task test

* add backward compatibility for three args seed setup

* save seeds info to logs/reports

ae72cebc

02 May, 2024 1 commit
- Add option to set OpenVINO config (#1730) · e6394715
  Helena Kloosterman authored May 02, 2024
```
* Add option to set OpenVINO config

* Use utils.eval_logger for logging
```
  e6394715
16 Apr, 2024 1 commit

Add `neuralmagic` models for `sparseml` and `deepsparse` (#1674) · 8b326be7

Michael Goin authored Apr 16, 2024



* Add neuralmagic models for SparseML and DeepSparse

* Update to latest and add test

* Format

* Fix list to List

* Format

* Add deepsparse/sparseml to automated testing

* Update pyproject.toml

* Update pyproject.toml

* Update README

* Fixes for dtype and device

* Format

* Fix test

* Apply suggestions from code review
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Address review comments!

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

8b326be7

20 Mar, 2024 1 commit

Fixes to Loglikelihood prefix token / VLLM (#1611) · c7b03ad4

Hailey Schoelkopf authored Mar 20, 2024

* make vllm use prefix_token_id ; have prefix_token_id be optional method to define

* custom_prefix_token_id wasn't set if not passed

c7b03ad4

22 Feb, 2024 1 commit

Fixed generation args issue affection OpenAI completion model (#1458) · 75ac1f47

Amine Elhattami authored Feb 23, 2024



* Fixed generation args issue affection openai completion model

* Fixed hf unit test; removed pop attributes in OpenAi completion.

* fix format

* fix format

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

75ac1f47

20 Feb, 2024 1 commit

Group reqs by context (#1425) · 45941c67

Baber Abbasi authored Feb 20, 2024



* add key lookup for same contexts

* nit

* appease pre-commit

* nit

* use `expand` (in-place view) rather than `repeat`

* try mixed grouping

* add docs.

* nit

* nit

* nits

* fix tests

* Move greedy_tokens calculation out of cache loop

* nit

* nits

* add test

* nits

* fix name conflict

* fix name conflict

* chunk tensor

* move Collator

* nits/docstring

* fixup

* fixup

* group contexts only for decoders

* pre-commit

* fix `generate_until` test

* fix `generate_until` test

* Update lm_eval/models/huggingface.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* add docs

* nit

* add docs

* add docs

* add 'logits_cache' arg

* bugfix

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

45941c67

05 Feb, 2024 1 commit

Support for Inf2 optimum class [WIP] (#1364) · d17dcea0

Michael Feil authored Feb 05, 2024

* initial commit

* remove overwrite bs

* adding neuronx dependencies

* Update README.md

* update neuronx

d17dcea0

01 Feb, 2024 1 commit

Faster Task and Group Loading, Allow Recursive Groups (#1321) · d714fc95

Lintang Sutawika authored Feb 01, 2024



* add trust_remote_code as default

* task for testing recursive

* changed source of ALL_TASKS

* tasks should only accept TaskObjects

* initialize_tasks returns list of tasks and groups

* remove trust_remote_code for now

* moved constructor process to inside load_yaml_config

* more comprehensive way to index tasks and groups

* pre-commit format

* add exit after error

* adjust how task objects are called

* no need to use get_task_dict

* load_task_or_group works but only for tasks

* pre-commit format

* half working for nested groups

* changed variable names

* allow groups and tasks to work

* temp save

* indexing and loading are part of a task_manager object

* adapted initialize_tasks

* iron out bugs

* fixed typo

* fixed typo

* simplified code

* further tidy up

* remove lines for testing

* removed test lines

* removed unused code

* remove unused import

* fixed bug

* removed comments

* group in a list of group can accept parameter changes like `num_fewshot`

* add trust_remote_code as default

* task for testing recursive

* changed source of ALL_TASKS

* tasks should only accept TaskObjects

* initialize_tasks returns list of tasks and groups

* remove trust_remote_code for now

* moved constructor process to inside load_yaml_config

* more comprehensive way to index tasks and groups

* pre-commit format

* add exit after error

* adjust how task objects are called

* no need to use get_task_dict

* load_task_or_group works but only for tasks

* pre-commit format

* half working for nested groups

* changed variable names

* allow groups and tasks to work

* temp save

* indexing and loading are part of a task_manager object

* adapted initialize_tasks

* iron out bugs

* fixed typo

* fixed typo

* simplified code

* further tidy up

* remove lines for testing

* removed test lines

* removed unused code

* remove unused import

* fixed bug

* removed comments

* group in a list of group can accept parameter changes like `num_fewshot`

* check if config is task update

* add GroupConfig object

* edit test yaml

* remove args

* testing returning to python task list

* add weight_by_size config

* describe weight_by_size in docs

* fix weight by size potential error

* can load individual custom python class task

* moved import_function into the config loading file

* remove print lines

* add squadv2 yaml

* temporary scroll implementation

* revert back to use load_yaml_config but with modes

* fix group being loaded with a None

* reformat

* can load unregistered tasks from a group

* update scrolls

* edit scrolls multiplechoice task

* adjust class initialization

* fix initialization

* changed how to identify group and python tasks, fix logger

* allow loading "include" that is nested in a group config

* reworked flan benchmark

* allow duplicate task in the same group to co-exist

* process group_alias

* removed group_alias

* allow parameters set in group_config to apply to all tasks in tasklist

* add function, but comment for now

* reworked processing dict-base config

* fixed how configs in group are processed

* update to allow root group to have its alias used

* remove unused classes

* remove unused classes

* revert some parts to original

* forgot to change one variable

* adapt the new process to use get_task_dict

* fix for singular group call

* fix variable names

* add TaskManager into the evaluator

* format

* changed how dict tasks are loaded

* add docs

* Update docs/new_task_guide.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update evaluator.py

* Update evaluator.py

* remove groupconfig for now

* changed _config to config

* update interface.md to explain TaskManager

* added property functions

* adjusted logger

* update write_out.py

* updated tests

* added documentation and some modifications

* added docstring documentation

* precommit format

* updated task loading for tests

* updates tests

* changed arg order for load_yaml_config

* update to handle scrolls and edit log message

* remove unused lines

* return a list of task classes and not a dict

* Update __init__.py

* Delete lm_eval/tasks/benchmarks/test.yaml

* Update task.py

* Update lm_eval/utils.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/utils.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update utils.py

* re-added old functions with new log message

* Update docs/new_task_guide.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update new_task_guide.md

* added infor regarding `get_task_dict` and documentation

* add get_config for Task

* pre-commit formatting

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

d714fc95

26 Jan, 2024 1 commit

Add causalLM OpenVino models (#1290) · 97a67d27

NoushNabi authored Jan 26, 2024



* added intel optimum

* added intel optimum in readme

* modified intel optimum

* modified intel optimum

* modified intel optimum

* modified install optimum

* modified path of IR file

* added openvino_device

* added openvino_device2

* changed optimum-causal to openvino-causal

* Update README.md

* Update README.md

* remove `lm_eval.base` import

* update openvino-causal -> openvino ; pass device through super().__init__()

* Update README.md

* Add optimum to tests dependencies

* apply pre-commit

* fix so tests pass

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

97a67d27

20 Dec, 2023 1 commit

Switch Linting to `ruff` (#1166) · 65b8761d

Baber Abbasi authored Dec 20, 2023

* add ruff and isort. remove black and flake8

* remove unnecessary dependencies

* remove dependency from table

* change order

* ran ruff

* check 3.9

* exclude evaluator

* update CI workflow

* use ruff config in pyproject.toml

* test

* add isort rules to ruff

* sort imports

* import `make_table`

* try stages for no-commit-to-branch

* turn on mypy for pre-commit

* test

* test

* test

* change no-commit-to-branch to default

* nits

* fixed dependency

65b8761d

27 Nov, 2023 3 commits
- skip vllm test (requires CUDA) · 05326b04
  baberabb authored Nov 27, 2023
  
  05326b04
- lazy load import in tests · 2c20df08
  baberabb authored Nov 27, 2023
  
  2c20df08
- Add tests for VLLM · 026005ac
  baberabb authored Nov 27, 2023
  
  026005ac
20 Nov, 2023 1 commit
- add ggml tests · f87b864f
  haileyschoelkopf authored Nov 20, 2023
  
  f87b864f
17 Nov, 2023 1 commit
- initialize tasks in test files · 032820dd
  haileyschoelkopf authored Nov 17, 2023
  
  032820dd
16 Nov, 2023 2 commits
- widen HF model test tolerance · 57ce579c
  haileyschoelkopf authored Nov 16, 2023
  
  57ce579c
- update gsm8k name · 1eed1cb5
  haileyschoelkopf authored Nov 16, 2023
  
  1eed1cb5
18 Oct, 2023 1 commit
- replace the rest of the 'greedy_until' occurrences · c28d100d
  haileyschoelkopf authored Oct 18, 2023
  
  c28d100d
06 Sep, 2023 3 commits
- changed atol · dfc47bd5
  baberabb authored Sep 06, 2023
  
  dfc47bd5
- artifact loglikelihood results · 96c60cf6
  baberabb authored Sep 06, 2023
  
  96c60cf6
- added torch deterministic · f1e62d36
  baberabb authored Sep 05, 2023
  
  f1e62d36
20 Aug, 2023 1 commit
- increase atol · 21706139
  baberabb authored Aug 20, 2023
  
  21706139
19 Aug, 2023 2 commits
- atol=1e-8 · 51882c1e
  baberabb authored Aug 19, 2023
  
  51882c1e
- get _RES from CI and change atol to 0.0001 · ac290ce8
  baberabb authored Aug 19, 2023
  
  ac290ce8