Commits · 2f03271d25db3c19e5552e19f59816bcbba07357 · gaoqiong / lm-evaluation-harness

09 May, 2025 1 commit
- add warning on truncation (#2962) · 2f03271d
  Baber Abbasi authored May 09, 2025
  
  2f03271d
06 May, 2025 5 commits

Stella Biderman authored May 06, 2025

This hasn't been a library for few shot language model evaluation in quite a while. Let's update the citation to use "the Language Model Evaluation Harness" as the title.

a96085f1

Include all test files in sdist (#2634) · cff3da29
Ihar Hrachyshka authored May 06, 2025
```
This is useful to run unit tests during distro builds.
```
cff3da29

Fix import error for eval_logger in score utils (#2940) · 817a2fe7

Anna Fontana authored May 06, 2025



* Fix import error for eval_logger in score utils

* pacify pre-commit

---------
Co-authored-by: Baber <baber@hey.com>

817a2fe7

Added NorEval, a novel Norwegian benchmark (#2919) · 71f2954b

Vladislav Mikhailov authored May 06, 2025

* added noreval

* added a checklist for noreval

* run pre-commit

* changed imports and added short noreval description

* fixed norsumm path

* refactored multi-folder tasks

* refactored multi-folder tasks

71f2954b

Add support for enable_thinking argument in vllm model, set default to False (#2947) · ab618f01
Alexandre Marques authored May 06, 2025

ab618f01

29 Apr, 2025 1 commit
- use np.NaN (#2937) · fc5019ea
  Baber Abbasi authored Apr 29, 2025
  
  fc5019ea
18 Apr, 2025 1 commit

Added softmax_dtype argument to HFLM to coerce log_softmax computations (#2921) · e4a7b69f

Avelina9X authored Apr 18, 2025



* Added softmax_dtype argument to coerce log_softmax computations

* move softmax_dtype

---------
Co-authored-by: Baber <baber@hey.com>

e4a7b69f

16 Apr, 2025 5 commits
- Longbench bugfix (#2895) · 930d8378
  Baber Abbasi authored Apr 17, 2025
```
* add warning in for default until

* fix stop tokens; add vcsum

* bugfix:fix doc_to_target to string

* fix lsht, trec

* add task to readme

* add debugging logs for multiple input/output
```
  930d8378
- init pixels before tokenizer creation (#2911) · 82fe48ec
  achervyakov authored Apr 16, 2025
  
  82fe48ec
- mmlu - switch dataset to cais/mmlu; fix tests (#2918) · cb316a18
  Baber Abbasi authored Apr 16, 2025
```
* switch MMLU to cais/mmlu

* switch back to tj-actions/changed-files

* cache HF folder
```
  cb316a18
- fix resolve_hf_chat_template version (#2917) · 38ba7dce
  Baber Abbasi authored Apr 16, 2025
```
* fix resolve_hf_chat_template version

* pre-commit
```
  38ba7dce
- Fix a typo in README for tasks (#2910) · bb90a90c
  Eldar Kurtic authored Apr 16, 2025
  
  bb90a90c
15 Apr, 2025 1 commit

Add support for quantization_config (#2842) · 758c5ed8

Jerry Zhang authored Apr 14, 2025

* Add support for quantization_config

Summary:
Previously quantization_config is ignored, so torchao quantized models are not supported,
this PR adds that.

Test Plan:
lm_eval --model hf --model_args pretrained=jerryzh168/gemma3-int4wo --tasks hellaswag --device cuda:0 --batch_size 8

Reviewers:

Subscribers:

Tasks:

Tags:

* quantization_config is optional

758c5ed8

14 Apr, 2025 2 commits
- tasks README: fix dead link (#2899) · a9582804
  Daniele authored Apr 14, 2025
  
  a9582804
- Extend support for chat template in vLLM (#2902) · 2a41c02e
  Alexandre Marques authored Apr 14, 2025
```
* Add support for chat templates defined outside of tokenizer_config.json, as supported by vLLM

* Update template name to avoid conflict with other variable
```
  2a41c02e
07 Apr, 2025 1 commit

Add `--samples` Argument for Fine-Grained Task Evaluation in... · d693dcd2

Felipe Maia Polo authored Apr 07, 2025


 Add `--samples` Argument for Fine-Grained Task Evaluation in `lm-evaluation-harness`. This feature is the first step towards efficient multi-prompt evaluation with PromptEval [1,2] (#2520)

* added option --examples

* specifying examples in dictionary

* run pre-commit - fix arg type

Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com

* fixing bug when examples==None

* fixing bug when examples==None

* limit or examples must be None in simple_evaluate.py and in evaluator.py

* run pre-commit (fix formatting)

Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com

* merge main and run pre-commit (fix formatting)

Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com

* Update __main__.py

undefined "limit" and "examples"

* update branch, fix conflicts, run pre-commit

* nits

* nits

* change 'examples' to 'samples'

---------

Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com
Co-authored-by: mirianfrsilva <mirianfrsilva@ibm.com>
Co-authored-by: Stella Biderman <stellabiderman@gmail.com>
Co-authored-by: Baber <baber@hey.com>

d693dcd2

04 Apr, 2025 3 commits

Add GSM8K Platinum (#2771) · 11ac352d
Qubitium-ModelCloud authored Apr 04, 2025
```
* add gsm8k platinum

* only test splits

* wrong dataset

* link to blog

* format
```
11ac352d
Update authentications methods, add support for deployment_id for IBM watsonx_ai (#2877) · 1da9e4e8
Nikodem Szwast authored Apr 04, 2025
```
* update authnentications methods, add support for deployment_id

* run pre-commit on changed file
```
1da9e4e8

Optimization for evalita-llm rouge computation (#2878) · 22bd2bcb

Michele Resta authored Apr 04, 2025



* feat: initial commit with templates for evalita evaluation

* fix: change rule for generate_until

* feat: modified yaml to use reduced version of NER test datasets

* feat: added templates to use reduced dataset for summarization (fanpage and ilpost)

* Add Six Prompts for Each Multiple-Choice Task

* fix: fastest eval for summarization

* chore: linted with ruff

* chore: linted with ruff

---------
Co-authored-by: rzanoli <zanoli@fbk.eu>

22bd2bcb

03 Apr, 2025 1 commit
- Fix the deps of longbench from jeiba to jieba (#2873) · 19ba1b16
  Lu Fang authored Apr 02, 2025
```
Signed-off-by: Lu Fang <lufang@fb.com>
```
  19ba1b16
02 Apr, 2025 2 commits

leaderboard - add subtask scores (#2867) · ac0bc1df
Baber Abbasi authored Apr 02, 2025
```
* add subtask scores

* pacify pre-commit
```
ac0bc1df

Add JSONSchemaBench: A Benchmark for Evaluating Structured Output from LLMs (#2865) · 6cc41d34

Saibo-creator authored Apr 02, 2025



* Add JSON schema benchmark

* Update lm_eval/tasks/jsonschema_bench/metrics.py

Thanks for catching this
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

* run pre-commit

* add description to task catalogue readme

---------
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

6cc41d34

01 Apr, 2025 2 commits
- Update supported models (#2866) · 773dcd7f
  Daniel Holanda authored Apr 01, 2025
  
  773dcd7f
- [leaderboard] math - sync with repo (#2817) · 61ee8678
  Baber Abbasi authored Apr 01, 2025
```
* sync with leaderboard

* also output old metric

* wrap old extraction in try except

* better log
```
  61ee8678
30 Mar, 2025 1 commit

Adds MMLU CoT, gsm8k and arc_challenge for llama instruct (#2829) · 3816796e

Alexandre Marques authored Mar 30, 2025

* llama-style MMLU CoT

* Refactor MMLU CoT template YAML to simplify 'until' structure

* Add GSM8K task configuration for LLaMA3 with few-shot examples

* Fix missing newline at end of MMLU CoT YAML file

* Add ARC-Challenge task configuration and processing utility

* Add additional MMLU and ARC-Challenge task variants to README

* Update README with notes on arc_challenge_llama dataset preprocessing

3816796e

29 Mar, 2025 1 commit
- Fix: ACPBench Link (#2860) · 1514ac1e
  Harsha authored Mar 28, 2025
  
  1514ac1e
28 Mar, 2025 3 commits

move warning (#2857) · 3cb72148
Baber Abbasi authored Mar 28, 2025

3cb72148

Increase default max_gen_toks to 2048 and max_length to 8192 for MMLU Pro tests (#2824) · 8850ebc0

dazipe authored Mar 28, 2025

* Changed default max_length from 2048 to 8192 and max_gen_toks from 256 to 2048 fro MMLU Pro tasks.

* Update lm_eval/tasks/mmlu_pro/_default_template_yaml

* pre-commit

* nit

---------

8850ebc0

add Darija (Moroccan dialects) tasks including darijammlu. darijahellaswag and darija_bench (#2521) · ebbbb968

Hadi Abdine authored Mar 28, 2025



* add Darija tasks

* fix multiple groups issue in darijammlu

* add MT to the description of the Darija tasks

* Update README.md

nit

* fix the recursion error caused by the darija_summarization task

* use a custom filter instead of the decorator for the strip function

---------
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

ebbbb968

27 Mar, 2025 3 commits
- Adding ACPBench task (#2807) · 5a9d5ba0
  Harsha authored Mar 27, 2025
```
* Adding acpbench task

* adding ACPBench in Tasks readme.

* running precommit
```
  5a9d5ba0
- Add kmmlu multiple-choice(accuracy) task (#2849) · ded890f3
  Jinho Heo authored Mar 28, 2025
  
  ded890f3
- Fix typo in longbench metrics (#2854) · febd19d8
  wackey authored Mar 28, 2025
  
  febd19d8
26 Mar, 2025 1 commit
- changed dataset to parquet version (#2845) · 908ac2b2
  Baber Abbasi authored Mar 26, 2025
  
  908ac2b2
25 Mar, 2025 1 commit
- Multilingual MMLU for Llama instruct models (#2826) · 1b357a68
  Alexandre Marques authored Mar 25, 2025
```
* Multilingual MMLU

* Refactor process_docs function calls for clarity and consistency
```
  1b357a68
23 Mar, 2025 1 commit

feat: replace library (#2828) · 1afb190c

Bruno Carneiro authored Mar 23, 2025

I haven't had time to review the library that's replacing tj-actions or whether this change breaks anything, but the vulnerability is quite severe and I would rather the functionality be broken than risk compromise.

**to do:** review this later

1afb190c

21 Mar, 2025 2 commits
- Remove unnecessary nested list in MMLU-Pro default template YAML (#2827) · fd93c6c4
  Alexandre Marques authored Mar 21, 2025
  
  fd93c6c4
- Add MMLU-ProX task (#2811) · 8aeff141
  heli-qi authored Mar 21, 2025
```
* update mmlu_prox configs

* update tasks/README

* correct hyphon to underline in task/README

* update pre-commit codes
```
  8aeff141
20 Mar, 2025 2 commits

Fixes to mmlu_pro_llama (#2816) · 8028a42f

Alexandre Marques authored Mar 20, 2025

* Update generation_kwargs in default template to include additional end tokens

* Update filter_list in MMLU Pro configuration to use strict_match

* Update _default_template_yaml

8028a42f

[VLLM, SLANG] default temp=0.0 (#2819) · c6b9aeeb
Baber Abbasi authored Mar 20, 2025

c6b9aeeb