Commits · b5d16d61ebadea0a0e1d7c57f49c2947b852fe4d · gaoqiong / lm-evaluation-harness

08 Apr, 2025 1 commit
- enable evaluation from yaml config file · b5d16d61
  artemorloff authored Apr 09, 2025
  
  b5d16d61
07 Apr, 2025 1 commit

Add `--samples` Argument for Fine-Grained Task Evaluation in... · d693dcd2

Felipe Maia Polo authored Apr 07, 2025


 Add `--samples` Argument for Fine-Grained Task Evaluation in `lm-evaluation-harness`. This feature is the first step towards efficient multi-prompt evaluation with PromptEval [1,2] (#2520)

* added option --examples

* specifying examples in dictionary

* run pre-commit - fix arg type

Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com

* fixing bug when examples==None

* fixing bug when examples==None

* limit or examples must be None in simple_evaluate.py and in evaluator.py

* run pre-commit (fix formatting)

Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com

* merge main and run pre-commit (fix formatting)

Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com

* Update __main__.py

undefined "limit" and "examples"

* update branch, fix conflicts, run pre-commit

* nits

* nits

* change 'examples' to 'samples'

---------

Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com
Co-authored-by: mirianfrsilva <mirianfrsilva@ibm.com>
Co-authored-by: Stella Biderman <stellabiderman@gmail.com>
Co-authored-by: Baber <baber@hey.com>

d693dcd2

04 Apr, 2025 3 commits

Add GSM8K Platinum (#2771) · 11ac352d
Qubitium-ModelCloud authored Apr 04, 2025
```
* add gsm8k platinum

* only test splits

* wrong dataset

* link to blog

* format
```
11ac352d
Update authentications methods, add support for deployment_id for IBM watsonx_ai (#2877) · 1da9e4e8
Nikodem Szwast authored Apr 04, 2025
```
* update authnentications methods, add support for deployment_id

* run pre-commit on changed file
```
1da9e4e8

Optimization for evalita-llm rouge computation (#2878) · 22bd2bcb

Michele Resta authored Apr 04, 2025



* feat: initial commit with templates for evalita evaluation

* fix: change rule for generate_until

* feat: modified yaml to use reduced version of NER test datasets

* feat: added templates to use reduced dataset for summarization (fanpage and ilpost)

* Add Six Prompts for Each Multiple-Choice Task

* fix: fastest eval for summarization

* chore: linted with ruff

* chore: linted with ruff

---------
Co-authored-by: rzanoli <zanoli@fbk.eu>

22bd2bcb

02 Apr, 2025 2 commits

leaderboard - add subtask scores (#2867) · ac0bc1df
Baber Abbasi authored Apr 02, 2025
```
* add subtask scores

* pacify pre-commit
```
ac0bc1df

Add JSONSchemaBench: A Benchmark for Evaluating Structured Output from LLMs (#2865) · 6cc41d34

Saibo-creator authored Apr 02, 2025



* Add JSON schema benchmark

* Update lm_eval/tasks/jsonschema_bench/metrics.py

Thanks for catching this
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

* run pre-commit

* add description to task catalogue readme

---------
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

6cc41d34

01 Apr, 2025 1 commit

[leaderboard] math - sync with repo (#2817) · 61ee8678

Baber Abbasi authored Apr 01, 2025

* sync with leaderboard

* also output old metric

* wrap old extraction in try except

* better log

61ee8678

30 Mar, 2025 1 commit

Adds MMLU CoT, gsm8k and arc_challenge for llama instruct (#2829) · 3816796e

Alexandre Marques authored Mar 30, 2025

* llama-style MMLU CoT

* Refactor MMLU CoT template YAML to simplify 'until' structure

* Add GSM8K task configuration for LLaMA3 with few-shot examples

* Fix missing newline at end of MMLU CoT YAML file

* Add ARC-Challenge task configuration and processing utility

* Add additional MMLU and ARC-Challenge task variants to README

* Update README with notes on arc_challenge_llama dataset preprocessing

3816796e

29 Mar, 2025 1 commit
- Fix: ACPBench Link (#2860) · 1514ac1e
  Harsha authored Mar 28, 2025
  
  1514ac1e
28 Mar, 2025 2 commits

Increase default max_gen_toks to 2048 and max_length to 8192 for MMLU Pro tests (#2824) · 8850ebc0

dazipe authored Mar 28, 2025

* Changed default max_length from 2048 to 8192 and max_gen_toks from 256 to 2048 fro MMLU Pro tasks.

* Update lm_eval/tasks/mmlu_pro/_default_template_yaml

* pre-commit

* nit

---------

8850ebc0

add Darija (Moroccan dialects) tasks including darijammlu. darijahellaswag and darija_bench (#2521) · ebbbb968

Hadi Abdine authored Mar 28, 2025



* add Darija tasks

* fix multiple groups issue in darijammlu

* add MT to the description of the Darija tasks

* Update README.md

nit

* fix the recursion error caused by the darija_summarization task

* use a custom filter instead of the decorator for the strip function

---------
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

ebbbb968

27 Mar, 2025 3 commits
- Adding ACPBench task (#2807) · 5a9d5ba0
  Harsha authored Mar 27, 2025
```
* Adding acpbench task

* adding ACPBench in Tasks readme.

* running precommit
```
  5a9d5ba0
- Add kmmlu multiple-choice(accuracy) task (#2849) · ded890f3
  Jinho Heo authored Mar 28, 2025
  
  ded890f3
- Fix typo in longbench metrics (#2854) · febd19d8
  wackey authored Mar 28, 2025
  
  febd19d8
26 Mar, 2025 1 commit
- changed dataset to parquet version (#2845) · 908ac2b2
  Baber Abbasi authored Mar 26, 2025
  
  908ac2b2
25 Mar, 2025 1 commit
- Multilingual MMLU for Llama instruct models (#2826) · 1b357a68
  Alexandre Marques authored Mar 25, 2025
```
* Multilingual MMLU

* Refactor process_docs function calls for clarity and consistency
```
  1b357a68
21 Mar, 2025 2 commits
- Remove unnecessary nested list in MMLU-Pro default template YAML (#2827) · fd93c6c4
  Alexandre Marques authored Mar 21, 2025
  
  fd93c6c4
- Add MMLU-ProX task (#2811) · 8aeff141
  heli-qi authored Mar 21, 2025
```
* update mmlu_prox configs

* update tasks/README

* correct hyphon to underline in task/README

* update pre-commit codes
```
  8aeff141
20 Mar, 2025 5 commits

Fixes to mmlu_pro_llama (#2816) · 8028a42f

Alexandre Marques authored Mar 20, 2025

* Update generation_kwargs in default template to include additional end tokens

* Update filter_list in MMLU Pro configuration to use strict_match

* Update _default_template_yaml

8028a42f

[VLLM, SLANG] default temp=0.0 (#2819) · c6b9aeeb
Baber Abbasi authored Mar 20, 2025

c6b9aeeb
fix typo (#2820) · 110e65da
Baber Abbasi authored Mar 20, 2025

110e65da
Configure the pad tokens for Qwen when using vLLM (#2810) · 61b63da7
Yifei Zhang authored Mar 20, 2025

61b63da7

Llama3 mmlu correction (#2797) · c73b43f4

Alexandre Marques authored Mar 19, 2025

* Update continuation template YAML for MMLU task with new generation and filtering options

* Refactor filter_list structure in continuation template YAML for improved readability

* Add 'take_first' function to filter_list in continuation template YAML

* Update filter_list in continuation template YAML to use 'strict_match' and modify filtering functions

* Add 'do_sample' option to generation_kwargs in MMLU template YAML

c73b43f4

18 Mar, 2025 8 commits
- [change] group -> tag (#2813) · 7123c6a5
  Jaedong Hwang authored Mar 18, 2025
  
  7123c6a5
- Allow writing config to wandb (#2736) · 604b62c4
  Surya Kasturi authored Mar 18, 2025
```
* Allow writing confing to wandb

* set defaults

* Update help

* Update help
```
  604b62c4
- [MM] Chartqa (#2544) · b8adf3cc
  Baber Abbasi authored Mar 18, 2025
```
* add changelog to readme template

* add readme

* add to task list
```
  b8adf3cc
- [hf-multimodal] pass kwargs to self.processor (#2667) · 1e2428a2
  Baber Abbasi authored Mar 18, 2025
```
* add min_pixels, max_pixels

* fix
```
  1e2428a2
- Add loncxt tasks (#2629) · 80a10075
  Baber Abbasi authored Mar 18, 2025
```
suport for longcontext (and other synthetic tasks)
* add ruler
* add longbench
* pass `metadata` to TaskConfig
```
  80a10075
- Add MastermindEval (#2788) · f47ddaf8
  Jonas Golde authored Mar 18, 2025
```
* add MastermindEval benchmark

* fill out checklist
```
  f47ddaf8
- Add cocoteros_va dataset (#2787) · 65ef2573
  Santiago Galiano Segura authored Mar 18, 2025
```
* Add cocoteros_va dataset

* Fix format in cocoteros_va.yml

* Undo newline added

* Execute pre-commit to fix format errors

* Update catalan_bench.yaml version and add Changelog section into Readme.md
```
  65ef2573
- add __version__ (#2808) · fa1ce2c6
  Baber Abbasi authored Mar 18, 2025
```
* add __version__

* add version consistency check to publish action
```
  fa1ce2c6
17 Mar, 2025 3 commits
- Add support for token-based auth for watsonx models (#2796) · 78d57e0f
  Kiersten Stokes authored Mar 17, 2025
```
* Add support for token-based auth for watsonx models

* Fix lint

* Move dotenv import to inner scope

* Improve readability of _verify_credentials
```
  78d57e0f
- Add INCLUDE tasks (#2769) · 6fbebb4b
  Angelika Romanou authored Mar 17, 2025
```
* Add INCLUDE tasks

* pacify pre-commit

---------
Co-authored-by: Baber <baber@hey.com>
```
  6fbebb4b
- Update Legacy OpenLLM leaderboard to use "train" split for ARC fewshot (#2802) · bb4fa95e
  Avelina9X authored Mar 17, 2025
```
* Update openllm.yaml to use train fewshot split for arc
```
  bb4fa95e
14 Mar, 2025 4 commits

Add various social bias tasks (#1185) · 150a1852

Oskar van der Wal authored Mar 14, 2025



* Implementation of Winogender

* Minor fixes README.md

* Add winogender

* Clean winogender utils.py

* Change dataset to one containing All subsets

* Flesh out README for BBQ task

* Add missing tasks for BBQ

* Add simple cooccurrence bias task

* Fix wrong mask for ambiguated context+rename metrics

* Made generate_until evaluation (following PALM paper) default

Also moved separate config files per category to separate metrics using custom function.
Created config file for multiple_choice way of evaluating BBQ.

* Add missing version metadata

* Add missing versionmetadata for bbq multiple choice

* Fix metrics and address edge cases

* Made BBQ multiple choice the default version

* Added settings following winogrande

* Add num_fewshot to simple_cooccurrence_bias

* Fixes for bbq (multiple choice)

* Fix wrong dataset

* CrowS-Pairs: make it easier to use another dataset by removing dataset_name from the subsets.

* Use simplest prompt possible without description

* Merge

* BBQ: Fix np.NaN related bug

* BBQ: Fix wrong aggregation method for disamb accuracy

* BBQ: Make it possible to only evaluate on (dis)ambiguous subset (needed for few shot eval)

* BBQ: fix showing one target in case of few-shot evals

* BBQ: Fix few-shot example for bbq_generate

* BBQ: simplify subtasks

* BBQ: Minimize number of UNK variations to reduce inference time

* BBQ: Add extra UNK keywords for the generate task

* Add a generate_until version of simple_cooccurrence_bias

* Change system/description prompt to include few-shot examples

* Group agg rework

* Run pre-commit

* add tasks to readme table

* remove trailing space from simple_cooccurrence_bias_gen.yaml `doc_to_text`

* fix

* fix

* fix version

---------
Co-authored-by: Baber <baber@hey.com>

150a1852

add audio modality (qwen2 audio only) (#2689) · 62552d2c

achervyakov authored Mar 14, 2025



* Added audio-modality pipeline for qwen2-audio model

* Beauty imports

* fix apply_chat_template args

* update default audio placeholders list

* add demo task - common_voice subset

* add audiolm_qwen libs to pyproject.toml

* pre-commit beautify

---------
Co-authored-by: Alexandra Rak <rakalexandra@mail.ru>

62552d2c

use verify_certificate flag in batch requests (#2785) · 3b7dbef9
daniel-salib authored Mar 14, 2025

3b7dbef9
change piqa dataset path (uses parquet rather than dataset script) (#2790) · 91264653
Baber Abbasi authored Mar 14, 2025

91264653

12 Mar, 2025 1 commit
- Update evaluator.py (#2786) · 0f944779
  Zeyuan Allen-Zhu authored Mar 12, 2025
```
minor bug fix, lm_eval.setup_logging -> setup_logging
```
  0f944779