Commits · 003e5852ce1344120d22808a865cc1fea46f45a1 · gaoqiong / lm-evaluation-harness

04 Oct, 2025 1 commit

Baber Abbasi authored Oct 04, 2025



* overhaul `ContextSampler`

* refactor masakhapos

* move multi_target to `exact_match`

* remove doc_to_choice from `boolq-seq2seq`

* remove doc_to_choice in generation process_results

* Remove unused `doc_to_choice` and fix superglue whitespaces

* require multiple_inputs and multiple_targets to be explicitly set in taskconfig

* fix copa; better logging in task init

* fix doc_to_target to return int rather than str (deprecated)

* fix processing regression; recursively parse lists fron template

* remove redundant jinja parsing logic

* remove promptsource

* for multiple_inputs use `doc_to_text: list[str]``

* Refactor `ContextSampler` `fewshot_context`

* fix multiple_input context

* fix `target_delimiter` with `gen_prefix`

* `doc_to_text` is list for multiple_inputs

* Refactor `count_bytes` and `count_words` methods to `@staticmethod`

* make has_*(train/test/validation) to properties

* remove `multi_target` `generate_until`

* `fix doc_to_target/multiple_targets handling add tests

* rename `multi_target` to `multiple_targets`

* evalaute list when multiple targets

* allow doc_to_target to return list

* Remove gen_prefix space and add warning (#3239)

* Remove gen_prefix space and add warning

* fix null gen_prefix bug again

* use git tests

---------
Co-authored-by: Boaz Ben-Dov <bendboaz@gmail.com>

003e5852

25 Sep, 2025 3 commits
- add tests · fadd26e4
  Baber authored Jul 04, 2025
  
  fadd26e4
- refactor: improve dataset and metric handling in TaskConfig · 227f1a74
  Baber authored Jul 08, 2025
  
  227f1a74
- refactor configs to files · 57adbd35
  Baber authored Jul 04, 2025
  
  57adbd35
27 Aug, 2025 1 commit
- pacify pre-commit (#3268) · 3a9bcc3f
  Baber Abbasi authored Aug 27, 2025
  
  3a9bcc3f
25 Aug, 2025 1 commit
- Add support for OpenVINO text2text generation models (#3101) · 05b37f20
  Nikita Savelyev authored Aug 25, 2025
```
* Add support for OVModelForSeq2SeqLM

* Add test
```
  05b37f20
04 Aug, 2025 1 commit

improve include-path precedence handling (#3068) · 3214d468

parkhs21 authored Aug 04, 2025



* improve include-path precedence handling

* test: add task for test

* add test for include path precedence handling

* Refactor `test_include_path.py`

---------
Co-authored-by: Baber <baber@hey.com>

3214d468

26 Jul, 2025 3 commits
- fix tests · 00afd536
  Baber authored Jul 27, 2025
  
  00afd536
- fix test · 1de882c2
  Baber authored Jul 27, 2025
  
  1de882c2
- add task factory · 4254c7bd
  Baber authored Jul 26, 2025
  
  4254c7bd
25 Jul, 2025 3 commits
- add TaskRegistry · eec9de3e
  Baber authored Jul 25, 2025
  
  eec9de3e
- refactor yaml loading · c8f9991f
  Baber authored Jul 25, 2025
  
  c8f9991f
- fix · 4a0a8bd8
  Baber authored Jul 25, 2025
  
  4a0a8bd8
19 Jul, 2025 1 commit

[tests] Added missing fixture in test_unitxt_tasks.py (#3163) · 8c05cfe0

Avelina Asada Hadji-Kyriacou authored Jul 19, 2025



* Added missing fixture in test_unitxt_tasks.py

* pacify pre-commit

---------
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

8c05cfe0

13 Jul, 2025 1 commit
- add: create new YAML configurations for task and group setups · 45c11c31
  Baber authored Jul 14, 2025
  
  45c11c31
11 Jul, 2025 3 commits
- fix circular · 6fc2ac49
  Baber authored Jul 12, 2025
  
  6fc2ac49
- refactor: migrate utils functions to lm_eval.tasks and update references · a9c16905
  Baber authored Jul 12, 2025
  
  a9c16905
- refactor: add apply_template function and improve lazy initialization · e11fa05d
  Baber authored Jul 11, 2025
  
  e11fa05d
10 Jul, 2025 1 commit

warning for "chat" pretrained; disable buggy evalita configs (#3127) · f3a0b554

Baber Abbasi authored Jul 10, 2025

* check for chat for warning

* add test

* remove yaml extension from some evalita configs

* move unitxt to own test script

* fix CI test

f3a0b554

06 Jul, 2025 1 commit
- delete neuralmagic models (#3112) · f93001db
  Baber Abbasi authored Jul 06, 2025
  
  f93001db
05 Jul, 2025 1 commit

Fixed #3005: Processes both formats of model_args: string and dictionay (#3097) · 0e96cd18

Debjyoti Ray authored Jul 05, 2025



* git push --force
correctly processes both formats of model_args: string and dictionary both

* exctract to function for better test

* nit

---------
Co-authored-by: Baber <baber@hey.com>

0e96cd18

04 Jul, 2025 1 commit

[FIX] Initial code to disable multi-proc for stderr (#3106) · 71d0289d

Neel Gupta authored Jul 04, 2025



* [FIX] Initial code to disable multi-proc for stderr

* add docs; align no-mp bootstrap with mp

---------
Co-authored-by: Baber <baber@hey.com>

71d0289d

03 Jun, 2025 1 commit
- [Fix] acc_mutual_info metric calculation bug (#3035) · 3f792954
  Baber Abbasi authored Jun 03, 2025
```
* fix: bug in acc_mutual_info slicing; add `target_delimiter` to uncond choices

* add tests
```
  3f792954
16 Apr, 2025 1 commit
- mmlu - switch dataset to cais/mmlu; fix tests (#2918) · cb316a18
  Baber Abbasi authored Apr 16, 2025
```
* switch MMLU to cais/mmlu

* switch back to tj-actions/changed-files

* cache HF folder
```
  cb316a18
17 Mar, 2025 1 commit

Add INCLUDE tasks (#2769) · 6fbebb4b

Angelika Romanou authored Mar 17, 2025



* Add INCLUDE tasks

* pacify pre-commit

---------
Co-authored-by: Baber <baber@hey.com>

6fbebb4b

14 Mar, 2025 1 commit
- use verify_certificate flag in batch requests (#2785) · 3b7dbef9
  daniel-salib authored Mar 14, 2025
  
  3b7dbef9
04 Mar, 2025 2 commits

Add test for a simple Unitxt task (#2742) · 8b5c5c13

Kiersten Stokes authored Mar 04, 2025

* Add a test for a custom unitxt task

* Update task.py to bring in line with breaking change in v1.17.2

* Fix lint

8b5c5c13

Enable steering HF models (#2749) · d35008f1

Lucia Quirke authored Mar 04, 2025



* Enable steering HF models
Co-authored-by: Matthew Khoriaty <matthewkhoriaty2026@u.northwestern.edu>

* increase HF download timeout

* Update readme; improve steering vector device handling

* Update latest news

* remove HF timeout increase

* fix tests

* ignore sae lens test

* fix accidental force push

---------
Co-authored-by: Matthew Khoriaty <matthewkhoriaty2026@u.northwestern.edu>

d35008f1

25 Feb, 2025 1 commit

Support SGLang as Potential Backend for Evaluation (#2703) · 29971faa

Jinwei authored Feb 25, 2025



* initial components to support sglang

* init of class SGLangLM

* draft for generate_until of SGLang model

* mock loglikelihood

* initial loglikelihood_tokens

* todo: fix bug of sglang engine init

* implement generation tasks and test

* support output type loglikelihood and loglikelihood_rolling (#1)

* .

* loglikelihood_rolling

* /

* support dp_size>1

* typo

* add tests and clean code

* skip tests of sglang for now

* fix OOM error of sglang pytest

* finish test for sglang

* add sglang to readme

* fix OOM of tests and clean SGLang model

* update readme

* clean pyproject and add tests for evaluator

* add accuracy tests and it passed locally

* add notes for test

* Update README.md

update readme

* pre-commit

---------
Co-authored-by: Xiaotong Jiang <xiaotong.jiang@databricks.com>
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>
Co-authored-by: Baber <baber@hey.com>

29971faa

19 Jan, 2025 1 commit
- update pre-commit (#2632) · f724be69
  Baber Abbasi authored Jan 19, 2025
```
* update pre-commit
```
  f724be69
04 Dec, 2024 1 commit
- add better testing when both doc_to_text ends in and target_delimiter are whitespaces (#2535) · 6824d39d
  Baber Abbasi authored Dec 04, 2024
  
  6824d39d
30 Nov, 2024 1 commit
- make utility function to handle `until` (#2518) · 0230356c
  Baber Abbasi authored Nov 30, 2024
```
* make utility function to handle `until`

* fix text
```
  0230356c
20 Nov, 2024 1 commit
- Nits (#2500) · 867413f8
  Baber Abbasi authored Nov 20, 2024
```
* fix test task

* dont call lm.chat_template each time
```
  867413f8
18 Nov, 2024 1 commit

Add metabench task to LM Evaluation Harness (#2357) · 62b4364d

Kozzy Voudouris authored Nov 18, 2024



* Add metabench (Kipnis et al. 2024)

* Update metabench tasks for full replication of original benchmarks, using publicly available datasets

* Remove unnecessary import

* Add permute versions of each task, where the answer orders are randomly shuffled.

* Add metabench group for easier evaluations

* Fix mmlu counts after removing duplicate

* Add secondary datasets

* Fix f-string error

* Fix f-string error for permute processing

* Add original hash to outputs for easy matching to original results

* Add line break at end of utils files

* Remove extra line from winogrande

* Reformat for linters

* fix multiple input test

* appease pre-commit

* Add metabench to tasks README

* fix multiple input `test_doc_to_text`

---------
Co-authored-by: Baber <baber@hey.com>

62b4364d

09 Nov, 2024 1 commit

OpenAI ChatCompletions: switch `max_tokens` (#2443) · 060e8761

Baber Abbasi authored Nov 09, 2024

* switch `max_tokens` for `max_completion_tokens`. OpenAI ChatCompletions

* remove stop, temp=1 for o1

* add chat assertion

* HF_DATASETS_TRUST_REMOTE_CODE = True for task tests

* move warning

060e8761

31 Oct, 2024 1 commit

Add GPTQModel support for evaluating GPTQ models (#2217) · 4f8e479e

Qubitium-ModelCloud authored Nov 01, 2024



* support gptqmodel

* code opt

* add gptqmodel option

* Update huggingface.py

* Update pyproject.toml

* gptqmodel version upgraded to 1.0.6

* GPTQModel version upgraded to 1.0.8

* Update pyproject.toml

* fix ruff-format error

* add gptqmodel test

* Update gptqmodel test model

* skip cuda

* python3.8 compatible

* Update README.md

* Update README.md

---------
Co-authored-by: CL-ModelCloud <cl@modelcloud.ai>

4f8e479e

04 Oct, 2024 1 commit
- fix tests (#2380) · 5e0bc289
  Baber Abbasi authored Oct 04, 2024
  
  5e0bc289
26 Sep, 2024 3 commits

mmlu-pro: add newlines to task descriptions (not leaderboard) (#2334) · 558d0d71

Baber Abbasi authored Sep 27, 2024



* add newlines to task descriptions; increment versions

* fix task tests (with groups)

* Apply suggestions from code review

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

558d0d71

change glianorex to test split (#2332) · 7d242381

Baber Abbasi authored Sep 26, 2024

* change glianorex to test set

* nit

* fix test; doc_to_target can be str for multiple_choice

* nit

7d242381

Treat tags in python tasks the same as yaml tasks (#2288) · b2bf7bc4

Giulio Lovisotto authored Sep 26, 2024

* Treat python tasks same as yaml tasks.

* Add tests.

* Re-add fixture decorators.

* Fix typing specification error for Python 3.9.

b2bf7bc4