Commits · b8adf3cc9db34ac005d11e5de1224de244ebb007 · gaoqiong / lm-evaluation-harness

18 Mar, 2025 4 commits

[MM] Chartqa (#2544) · b8adf3cc
Baber Abbasi authored Mar 18, 2025
```
* add changelog to readme template

* add readme

* add to task list
```
b8adf3cc

Baber Abbasi authored Mar 18, 2025

suport for longcontext (and other synthetic tasks)
* add ruler
* add longbench
* pass `metadata` to TaskConfig

80a10075

Add MastermindEval (#2788) · f47ddaf8
Jonas Golde authored Mar 18, 2025
```
* add MastermindEval benchmark

* fill out checklist
```
f47ddaf8

Add cocoteros_va dataset (#2787) · 65ef2573

Santiago Galiano Segura authored Mar 18, 2025

* Add cocoteros_va dataset

* Fix format in cocoteros_va.yml

* Undo newline added

* Execute pre-commit to fix format errors

* Update catalan_bench.yaml version and add Changelog section into Readme.md

65ef2573

17 Mar, 2025 2 commits
- Add INCLUDE tasks (#2769) · 6fbebb4b
  Angelika Romanou authored Mar 17, 2025
```
* Add INCLUDE tasks

* pacify pre-commit

---------
Co-authored-by: Baber <baber@hey.com>
```
  6fbebb4b
- Update Legacy OpenLLM leaderboard to use "train" split for ARC fewshot (#2802) · bb4fa95e
  Avelina9X authored Mar 17, 2025
```
* Update openllm.yaml to use train fewshot split for arc
```
  bb4fa95e
14 Mar, 2025 3 commits

Add various social bias tasks (#1185) · 150a1852

Oskar van der Wal authored Mar 14, 2025



* Implementation of Winogender

* Minor fixes README.md

* Add winogender

* Clean winogender utils.py

* Change dataset to one containing All subsets

* Flesh out README for BBQ task

* Add missing tasks for BBQ

* Add simple cooccurrence bias task

* Fix wrong mask for ambiguated context+rename metrics

* Made generate_until evaluation (following PALM paper) default

Also moved separate config files per category to separate metrics using custom function.
Created config file for multiple_choice way of evaluating BBQ.

* Add missing version metadata

* Add missing versionmetadata for bbq multiple choice

* Fix metrics and address edge cases

* Made BBQ multiple choice the default version

* Added settings following winogrande

* Add num_fewshot to simple_cooccurrence_bias

* Fixes for bbq (multiple choice)

* Fix wrong dataset

* CrowS-Pairs: make it easier to use another dataset by removing dataset_name from the subsets.

* Use simplest prompt possible without description

* Merge

* BBQ: Fix np.NaN related bug

* BBQ: Fix wrong aggregation method for disamb accuracy

* BBQ: Make it possible to only evaluate on (dis)ambiguous subset (needed for few shot eval)

* BBQ: fix showing one target in case of few-shot evals

* BBQ: Fix few-shot example for bbq_generate

* BBQ: simplify subtasks

* BBQ: Minimize number of UNK variations to reduce inference time

* BBQ: Add extra UNK keywords for the generate task

* Add a generate_until version of simple_cooccurrence_bias

* Change system/description prompt to include few-shot examples

* Group agg rework

* Run pre-commit

* add tasks to readme table

* remove trailing space from simple_cooccurrence_bias_gen.yaml `doc_to_text`

* fix

* fix

* fix version

---------
Co-authored-by: Baber <baber@hey.com>

150a1852

add audio modality (qwen2 audio only) (#2689) · 62552d2c

achervyakov authored Mar 14, 2025



* Added audio-modality pipeline for qwen2-audio model

* Beauty imports

* fix apply_chat_template args

* update default audio placeholders list

* add demo task - common_voice subset

* add audiolm_qwen libs to pyproject.toml

* pre-commit beautify

---------
Co-authored-by: Alexandra Rak <rakalexandra@mail.ru>

62552d2c

change piqa dataset path (uses parquet rather than dataset script) (#2790) · 91264653
Baber Abbasi authored Mar 14, 2025

91264653

11 Mar, 2025 5 commits

humaneval instruct (#2650) · c8489857
Baber Abbasi authored Mar 11, 2025
```
* add instruct humaneval

* nit

* add to readme

* nit
```
c8489857

Capture gen_kwargs from CLI in squad_completion (#2727) · 7a2ba052

Surya Kasturi authored Mar 11, 2025



* Capture gen_kwargs from CLI in squad_completion

* Update lm_eval/tasks/squad_completion/task.py
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

* Update lm_eval/tasks/squad_completion/task.py
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

* pre-commit

---------
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>
Co-authored-by: Baber <baber@hey.com>

7a2ba052

New healthcare benchmark: careqa (#2714) · 7c9fbcf8

PabloAgustin authored Mar 11, 2025



* New healthcare benchmark: careqa

* LAUNCH_MN5_ACC <python main.py --config config/mn5.yml --models Llama-3.2-1B-Instruct --tasks careqa_open --num_fewshot 0>

* Add fixes, READMES, and remove task_list.txt

* pre-commit passed, add formatting updates; add nanmean agg_metric

* Fix import error.

* Wrapped imports in try excepts

* Wrapped imports in try excepts; also metrics to catch bert_score import error

* Try except to catch ImportErrors as well

* use np.nan

* pre-commit

---------
Co-authored-by: PabloAgustin <pablo.martin@bsc.es>
Co-authored-by: Baber <baber@hey.com>

7c9fbcf8

Fix for mc2 calculation (#2768) · 2c8ffb80

Kajetan Dymkiewicz authored Mar 11, 2025



* fix for mc2 calculation

* increment versions and changelog

---------
Co-authored-by: Baber <baber@hey.com>

2c8ffb80

Consistency Fix: Filter new leaderboard_math_hard dataset to "Level 5" only (#2773) · c8044f30

Yotam Perlitz authored Mar 11, 2025



* Filter new leaderboard_math_hard dataset to "Level 5" only

* align to linters
Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>

---------
Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>

c8044f30

05 Mar, 2025 1 commit
- fix: mmlu (generative) metric aggregation (#2761) · 38a9c530
  Yongkeun Hwang authored Mar 05, 2025
  
  38a9c530
04 Mar, 2025 1 commit

Add test for a simple Unitxt task (#2742) · 8b5c5c13

Kiersten Stokes authored Mar 04, 2025

* Add a test for a custom unitxt task

* Update task.py to bring in line with breaking change in v1.17.2

* Fix lint

8b5c5c13

03 Mar, 2025 1 commit

Groundcocoa (#2724) · ade01428

Harsh Kohli authored Mar 03, 2025



* Fix failing tests

* Resolved merge conflicts

* pre-commit

---------
Co-authored-by: Baber <baber@hey.com>

ade01428

27 Feb, 2025 1 commit
- fix vllm data parallel (#2746) · a87fe425
  Baber Abbasi authored Feb 27, 2025
```
* remove ray.remote resources

* remove kobtest tag (registered as group)
```
  a87fe425
25 Feb, 2025 3 commits
- add humaneval+ and mbpp+ (#2734) · 86bbf6ac
  Minho Ryu authored Feb 25, 2025
```
* add humaneval+ and mbpp+

* add newline at end of file
```
  86bbf6ac
- Fix the import source for eval_logger (#2735) · 9b29da00
  Kailashbuki authored Feb 25, 2025
```
* Fix the import source for eval_logger

* fix logging

---------
Co-authored-by: Baber <baber@hey.com>
```
  9b29da00
- add cocoteros_es dataset (#2721) · 2b2fa97b
  Santiago Galiano Segura authored Feb 25, 2025
```
Co-authored-by: Robiert Sepulveda Torres <rsepulveda911112@gmail.com>
```
  2b2fa97b
24 Feb, 2025 2 commits

add Basque translation of ARC and PAWS to BasqueBench (#2732) · 2f403fa0

Naiara Perez authored Feb 25, 2025



* add Basque translation of ARC and PAWS to BasqueBench

* pre-commit

---------
Co-authored-by: Baber <baber@hey.com>

2f403fa0

Added IberoBench citation info... · a9a0e3ca

Naiara Perez authored Feb 24, 2025

Added IberoBench citation info (https://aclanthology.org/2025.coling-main.699/) in correpsonding READMEs (#2729)

a9a0e3ca

23 Feb, 2025 1 commit
- remove unused import (#2728) · 5e0b6f16
  Baber Abbasi authored Feb 23, 2025
  
  5e0b6f16
21 Feb, 2025 3 commits

fix missing dataset repo (#2719) · 0bf9f4ea
Farhan Ahmed authored Feb 21, 2025

0bf9f4ea

Logging (#2203) · 1ba35e62

Lintang Sutawika authored Feb 20, 2025



* changed source of eval_logger

* allow eval_logger to be set from args

* removed verbosity arg from non-main methods

* fix logging

* pre-commit

* set verbosity in eval logger

* replace utils.eval_logger

* fix logging in main

* add logging to docs

* add logging message

* nit

* add logging to docs

* refactor setup_logging to utils

---------
Co-authored-by: Baber <baber@hey.com>

1ba35e62

add math_verify to some tasks (#2686) · 358adaf7

Baber Abbasi authored Feb 21, 2025

* add math_verify to minerva math

* add math_verify to benchmark

* fix error

* increment version

358adaf7

14 Feb, 2025 2 commits
- `arithmetic`: set target delimiter to empty string (#2701) · 41b952f3
  Baber Abbasi authored Feb 14, 2025
```
* set target delimiter to empty string

* nit

* add warning
```
  41b952f3
- fix `construct_requests` kwargs (#2700) · 5a5acc08
  Baber Abbasi authored Feb 14, 2025
  
  5a5acc08
13 Feb, 2025 1 commit
- set aggregation and higher_is_better (instead of falling back on defaults) (#2692) · c3c05b06
  James A. Michaelov authored Feb 13, 2025
  
  c3c05b06
12 Feb, 2025 1 commit
- Update unitxt task.py to bring in line with recent repo changes (#2684) · 8751fb35
  Kiersten Stokes authored Feb 12, 2025
  
  8751fb35
11 Feb, 2025 2 commits

Delete lm_eval/tasks/evalita_llm/single_prompt.zip (#2687) · 684fd2dd
Baber Abbasi authored Feb 11, 2025

684fd2dd

Adding the Evalita-LLM benchmark (#2681) · b7fccef5

Michele Resta authored Feb 11, 2025



* feat: initial commit with templates for evalita evaluation

* fix: change rule for generate_until

* feat: modified yaml to use reduced version of NER test datasets

* feat: added templates to use reduced dataset for summarization (fanpage and ilpost)

* Add Six Prompts for Each Multiple-Choice Task

* feat: modified fewshot split for textual entailment task

* fix: new doc_to_target function for NER tasks

* Update prompt

* Add partition for few-shot evaluation

* Add partition for few-shot evaluation

* Add partition for few-shot evaluation

* Add partition for few-shot evaluatio

* Update prompt

* Add partition for few-shot evaluation

* Rename file

Rename file from _evalita-mp_ner_adg_p1 .yaml to _evalita-mp_ner_adg_p1.yaml

* Add partition for few-shot evaluation

* Add partition for few-shot evaluation

* Enhance lexical substitution management

- Improve scorer calculation for better accuracy
- Update model output postprocessing for clearer results
- Add support for few-shot relation extraction task

* Add F1 macro measure for the document dating task

* Add F1-macro measure to evaluate document dating

* Use the whole dataset

* Small changes

* Add the two prompts for the task of lexical substitution

* Add few-shot split configuration

* Add few-shot split configuration

* Add function for handling few-shot learning setup

* Fix prompt

* Remove configuration file

* Update dataset from test_same to test_cross for evaluations

* Remove whitespace at end of prompt

* Fix configuration error: corrected parameter name for the dataset used in few-shot

* Fix: Check if results is not empty before processing in lexical substitution task

* added the prompts and functions for correct NER and RE execution

* Add accuracy measure

* Add tasks for the EVALITA-LLM benchmark evaluation

* Small changes

Add the alias of the task name that will be printed in the final table results.

* Updated the prompts to reflect changes made to the extended dataset for the Admission Test task

* chore: cleaned templates before PR; feat: add configuration to run generation/ppl tasks.

* fix: add information on Evalita-LLM for PR

* fix: rename folders and files

* fix: remove unused imports

* chore: run pre-commit

* chore: add task description

---------
Co-authored-by: rzanoli <zanoli@fbk.eu>
Co-authored-by: Marco Madeddu <marco.madeddu.bra@gmail.com>

b7fccef5

07 Feb, 2025 1 commit

Turkish mmlu Config Update (#2678) · 53504a96

Arda authored Feb 07, 2025



* Added TurkishMMLU to LM Evaluation Harness

* Fixed COT name

* Fixed COT name

* Updated Readme

* Fixed Test issues

* Completed  Scan for changed tasks

* Updated Readme

* Update README.md

* fixup task naming casing + ensure yaml template stubs aren't registered

* Fix Regex Pattern for CoT experiments

* Fixed multiple choice accuracy

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

53504a96

31 Jan, 2025 1 commit

MMLU Pro Plus (#2366) · 0bb8406f

asgsaeid authored Jan 31, 2025



* mmlu-pro-plus is implemented

* README file is updated

* Update README.md with new task: MMLU Pro Plus

* Update README.md with new task: MMLU Pro Plus

* pre-commit

* nit

---------
Co-authored-by: asgsaeid <asgaris@Saeids-MacBook-Pro.local>
Co-authored-by: Baber <baber@hey.com>

0bb8406f

29 Jan, 2025 2 commits

Add Histoires Morales task (#2662) · 1208afd3

Irina Proskurina authored Jan 29, 2025

* Add Histoires Morales task

* Histoires Morales task: fix mixed line endings

* Histoires Morales task: fix mixed line endings

* Remove tag for a single task

* Add some MT for Histoires Morales

1208afd3

remove `group` from bigbench task configs (#2663) · fe9c5707
Baber Abbasi authored Jan 29, 2025
```
* remove group from task configs

* add tags

* update readme
```
fe9c5707

28 Jan, 2025 3 commits
- update pre-commit (#2660) · 4b4b0363
  Baber Abbasi authored Jan 28, 2025
```
* nit

* update pre-commit
```
  4b4b0363
- Add Aggregation for Kobest Benchmark (#2446) · 94344a61
  Seungwoo Ryu authored Jan 29, 2025
```
Co-authored-by: Baber <baber@hey.com>
```
  94344a61
- Add Moral Stories (#2653) · a0466f01
  Irina Proskurina authored Jan 28, 2025
```
* Add moral stories task

* Add moral stories task

* Create README.md

* Update README.md

* Update line endings in moral_stories files
```
  a0466f01