Commits · 4366fc822eb4488adf8dcab4e25a75868442bb0c · gaoqiong / lm-evaluation-harness

19 Jul, 2025 1 commit
- multiblimp (#3162) · 4366fc82
  Baber Abbasi authored Jul 20, 2025
  
  4366fc82
14 Jul, 2025 1 commit

Adding EgyMMLU and EgyHellaSwag (#3063) · 2ea6114e

Atou Houdaifa authored Jul 14, 2025

* add egy mmlu hellaswag

* add egymmlu egyhellaswag to tasks readme

* fix egymmlu config generation

* fix _generate_configs formating

2ea6114e

03 Jul, 2025 1 commit

Truthfulqa multi harness (#3062) · e0dc33ae

Blanca Calvo authored Jul 03, 2025



* truthfulqa-multi task

* truthfulqa-multi with chat few-shot

* few shot chat implementation

* changed until so it outputs lists

* changed dataset location

* added MT task

* Create README.md

* do not include MT

* changes for PR

* tag change

* removed yaml extension

* adding task to the table

* fix task configs

* add import exception

---------
Co-authored-by: Baber <baber@hey.com>

e0dc33ae

26 May, 2025 1 commit
- add arab_culture task (#3006) · 8bc4afff
  Boda Sadallah authored May 26, 2025
```
* add arab_culture tasks

* add target_delimeter and remove debugging code
```
  8bc4afff
19 May, 2025 1 commit

Adding ACPBench Hard tasks (#2980) · 0daf28fd

Harsha authored May 19, 2025

* adding ACPBench_hard

* adding Clingo

* changing tarski to tarski[clingo]

* denoting the main variants in each paper

0daf28fd

15 May, 2025 1 commit

Added C4 Support (#2889) · 86a3b270

Yufeng Xu authored May 15, 2025

* added c4 dataset (working)

* fixed bugs in c4

* fixed loading bugs in c4 dataset; using partial loading

* cleaned the code

* added version number for c4

* removed irrelevant files

86a3b270

06 May, 2025 1 commit

Added NorEval, a novel Norwegian benchmark (#2919) · 71f2954b

Vladislav Mikhailov authored May 06, 2025

* added noreval

* added a checklist for noreval

* run pre-commit

* changed imports and added short noreval description

* fixed norsumm path

* refactored multi-folder tasks

* refactored multi-folder tasks

71f2954b

16 Apr, 2025 2 commits

Longbench bugfix (#2895) · 930d8378

Baber Abbasi authored Apr 17, 2025

* add warning in for default until

* fix stop tokens; add vcsum

* bugfix:fix doc_to_target to string

* fix lsht, trec

* add task to readme

* add debugging logs for multiple input/output

930d8378

Fix a typo in README for tasks (#2910) · bb90a90c
Eldar Kurtic authored Apr 16, 2025

bb90a90c

14 Apr, 2025 1 commit
- tasks README: fix dead link (#2899) · a9582804
  Daniele authored Apr 14, 2025
  
  a9582804
02 Apr, 2025 1 commit

Add JSONSchemaBench: A Benchmark for Evaluating Structured Output from LLMs (#2865) · 6cc41d34

Saibo-creator authored Apr 02, 2025



* Add JSON schema benchmark

* Update lm_eval/tasks/jsonschema_bench/metrics.py

Thanks for catching this
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

* run pre-commit

* add description to task catalogue readme

---------
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

6cc41d34

29 Mar, 2025 1 commit
- Fix: ACPBench Link (#2860) · 1514ac1e
  Harsha authored Mar 28, 2025
  
  1514ac1e
28 Mar, 2025 1 commit

add Darija (Moroccan dialects) tasks including darijammlu. darijahellaswag and darija_bench (#2521) · ebbbb968

Hadi Abdine authored Mar 28, 2025



* add Darija tasks

* fix multiple groups issue in darijammlu

* add MT to the description of the Darija tasks

* Update README.md

nit

* fix the recursion error caused by the darija_summarization task

* use a custom filter instead of the decorator for the strip function

---------
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

ebbbb968

27 Mar, 2025 1 commit

Adding ACPBench task (#2807) · 5a9d5ba0

Harsha authored Mar 27, 2025

* Adding acpbench task

* adding ACPBench in Tasks readme.

* running precommit

5a9d5ba0

21 Mar, 2025 1 commit

Add MMLU-ProX task (#2811) · 8aeff141

heli-qi authored Mar 21, 2025

* update mmlu_prox configs

* update tasks/README

* correct hyphon to underline in task/README

* update pre-commit codes

8aeff141

18 Mar, 2025 3 commits

[MM] Chartqa (#2544) · b8adf3cc
Baber Abbasi authored Mar 18, 2025
```
* add changelog to readme template

* add readme

* add to task list
```
b8adf3cc

Add loncxt tasks (#2629) · 80a10075

Baber Abbasi authored Mar 18, 2025

suport for longcontext (and other synthetic tasks)
* add ruler
* add longbench
* pass `metadata` to TaskConfig

80a10075

Add MastermindEval (#2788) · f47ddaf8
Jonas Golde authored Mar 18, 2025
```
* add MastermindEval benchmark

* fill out checklist
```
f47ddaf8

14 Mar, 2025 1 commit

Add various social bias tasks (#1185) · 150a1852

Oskar van der Wal authored Mar 14, 2025



* Implementation of Winogender

* Minor fixes README.md

* Add winogender

* Clean winogender utils.py

* Change dataset to one containing All subsets

* Flesh out README for BBQ task

* Add missing tasks for BBQ

* Add simple cooccurrence bias task

* Fix wrong mask for ambiguated context+rename metrics

* Made generate_until evaluation (following PALM paper) default

Also moved separate config files per category to separate metrics using custom function.
Created config file for multiple_choice way of evaluating BBQ.

* Add missing version metadata

* Add missing versionmetadata for bbq multiple choice

* Fix metrics and address edge cases

* Made BBQ multiple choice the default version

* Added settings following winogrande

* Add num_fewshot to simple_cooccurrence_bias

* Fixes for bbq (multiple choice)

* Fix wrong dataset

* CrowS-Pairs: make it easier to use another dataset by removing dataset_name from the subsets.

* Use simplest prompt possible without description

* Merge

* BBQ: Fix np.NaN related bug

* BBQ: Fix wrong aggregation method for disamb accuracy

* BBQ: Make it possible to only evaluate on (dis)ambiguous subset (needed for few shot eval)

* BBQ: fix showing one target in case of few-shot evals

* BBQ: Fix few-shot example for bbq_generate

* BBQ: simplify subtasks

* BBQ: Minimize number of UNK variations to reduce inference time

* BBQ: Add extra UNK keywords for the generate task

* Add a generate_until version of simple_cooccurrence_bias

* Change system/description prompt to include few-shot examples

* Group agg rework

* Run pre-commit

* add tasks to readme table

* remove trailing space from simple_cooccurrence_bias_gen.yaml `doc_to_text`

* fix

* fix

* fix version

---------
Co-authored-by: Baber <baber@hey.com>

150a1852

11 Mar, 2025 1 commit

New healthcare benchmark: careqa (#2714) · 7c9fbcf8

PabloAgustin authored Mar 11, 2025



* New healthcare benchmark: careqa

* LAUNCH_MN5_ACC <python main.py --config config/mn5.yml --models Llama-3.2-1B-Instruct --tasks careqa_open --num_fewshot 0>

* Add fixes, READMES, and remove task_list.txt

* pre-commit passed, add formatting updates; add nanmean agg_metric

* Fix import error.

* Wrapped imports in try excepts

* Wrapped imports in try excepts; also metrics to catch bert_score import error

* Try except to catch ImportErrors as well

* use np.nan

* pre-commit

---------
Co-authored-by: PabloAgustin <pablo.martin@bsc.es>
Co-authored-by: Baber <baber@hey.com>

7c9fbcf8

03 Mar, 2025 1 commit

Groundcocoa (#2724) · ade01428

Harsh Kohli authored Mar 03, 2025



* Fix failing tests

* Resolved merge conflicts

* pre-commit

---------
Co-authored-by: Baber <baber@hey.com>

ade01428

11 Feb, 2025 1 commit

Adding the Evalita-LLM benchmark (#2681) · b7fccef5

Michele Resta authored Feb 11, 2025



* feat: initial commit with templates for evalita evaluation

* fix: change rule for generate_until

* feat: modified yaml to use reduced version of NER test datasets

* feat: added templates to use reduced dataset for summarization (fanpage and ilpost)

* Add Six Prompts for Each Multiple-Choice Task

* feat: modified fewshot split for textual entailment task

* fix: new doc_to_target function for NER tasks

* Update prompt

* Add partition for few-shot evaluation

* Add partition for few-shot evaluation

* Add partition for few-shot evaluation

* Add partition for few-shot evaluatio

* Update prompt

* Add partition for few-shot evaluation

* Rename file

Rename file from _evalita-mp_ner_adg_p1 .yaml to _evalita-mp_ner_adg_p1.yaml

* Add partition for few-shot evaluation

* Add partition for few-shot evaluation

* Enhance lexical substitution management

- Improve scorer calculation for better accuracy
- Update model output postprocessing for clearer results
- Add support for few-shot relation extraction task

* Add F1 macro measure for the document dating task

* Add F1-macro measure to evaluate document dating

* Use the whole dataset

* Small changes

* Add the two prompts for the task of lexical substitution

* Add few-shot split configuration

* Add few-shot split configuration

* Add function for handling few-shot learning setup

* Fix prompt

* Remove configuration file

* Update dataset from test_same to test_cross for evaluations

* Remove whitespace at end of prompt

* Fix configuration error: corrected parameter name for the dataset used in few-shot

* Fix: Check if results is not empty before processing in lexical substitution task

* added the prompts and functions for correct NER and RE execution

* Add accuracy measure

* Add tasks for the EVALITA-LLM benchmark evaluation

* Small changes

Add the alias of the task name that will be printed in the final table results.

* Updated the prompts to reflect changes made to the extended dataset for the Admission Test task

* chore: cleaned templates before PR; feat: add configuration to run generation/ppl tasks.

* fix: add information on Evalita-LLM for PR

* fix: rename folders and files

* fix: remove unused imports

* chore: run pre-commit

* chore: add task description

---------
Co-authored-by: rzanoli <zanoli@fbk.eu>
Co-authored-by: Marco Madeddu <marco.madeddu.bra@gmail.com>

b7fccef5

31 Jan, 2025 1 commit

MMLU Pro Plus (#2366) · 0bb8406f

asgsaeid authored Jan 31, 2025



* mmlu-pro-plus is implemented

* README file is updated

* Update README.md with new task: MMLU Pro Plus

* Update README.md with new task: MMLU Pro Plus

* pre-commit

* nit

---------
Co-authored-by: asgsaeid <asgaris@Saeids-MacBook-Pro.local>
Co-authored-by: Baber <baber@hey.com>

0bb8406f

29 Jan, 2025 1 commit

Add Histoires Morales task (#2662) · 1208afd3

Irina Proskurina authored Jan 29, 2025

* Add Histoires Morales task

* Histoires Morales task: fix mixed line endings

* Histoires Morales task: fix mixed line endings

* Remove tag for a single task

* Add some MT for Histoires Morales

1208afd3

28 Jan, 2025 1 commit

Add Moral Stories (#2653) · a0466f01

Irina Proskurina authored Jan 28, 2025

* Add moral stories task

* Add moral stories task

* Create README.md

* Update README.md

* Update line endings in moral_stories files

a0466f01

20 Jan, 2025 1 commit

add hrm8k benchmark for both Korean and English (#2627) · a5c344cf

Minho Ryu authored Jan 21, 2025



* add hrm8k benchmark for both Korean and English

* apply precommit

* revise tasks to make models not to directly answer; use zeroshot_cot if possible

* add README

* Add hrm8k on the task-list

---------
Co-authored-by: Baber <baber@hey.com>

a5c344cf

15 Jan, 2025 3 commits

Add MLQA (#2622) · e86cece6

Shivansh Pachnanda authored Jan 16, 2025

* Add MLQA
* add mlqa_common_yaml

* add 49 tests of mlqa family

* update tasks/README.md

---------

* fix: mlqa ast error

* nit: removed .yaml ext from template_yaml

* nit changes: minor modifications generate_tasks.py

* deleted    lm_eval/tasks/mlqa/mlqa_common_yaml.yaml

* tests updated

* nit

e86cece6

Add MBPP (#2247) · 5db23e2c

Hojin Lee authored Jan 16, 2025



* add mbpp

* fix some bugs

* add README for mbpp

* update README

* nits

---------
Co-authored-by: Hojin Lee <19949034+hjlee1371@users.noreply.github.com>
Co-authored-by: Baber <baber@hey.com>

5db23e2c

Add HumanEval (#1992) · 4c11206b

Hojin Lee authored Jan 16, 2025



* add custom filter

* fix type casting of references

* add humaneval

* fix a bug in humaneval

* add greedy version of humaneval

* update tasks README

* test humaneval

* return multiple metrics

* nit

* add confirmation to run code tasks

* nit

* nit

---------
Co-authored-by: Hojin Lee <19949034+hjlee1371@users.noreply.github.com>
Co-authored-by: Baber <baber@hey.com>

4c11206b

24 Dec, 2024 1 commit

AraDICE task config file (#2507) · 932e8f9e

Firoj Alam, Scientist, QCRI authored Dec 24, 2024



* added aradice

* Added ArabicMMLU Lev Configs

* added ArabicMMLU egy configs

* Added boolq configs

* Added cultural bench configs

* added openbookqa configs

* Added PiQA configs

* added winogrande configs

* Added truthfulQA configs

* Added aradice group config

* Remove deleted files from repository

* modified arabimmlu configs

* modified metadata versions

* fixed formatting using ruff

* added aradice tasks information

* pre-commit

* Uptaded openbookqa utils

* fixed formatting on obqa

---------
Co-authored-by: Basel Mousi <bmousi@hbku.edu.qa>
Co-authored-by: Baber <baber@hey.com>

932e8f9e

19 Dec, 2024 1 commit

Add Global MMLU Lite (#2567) · 2b75b110

shivalika-singh authored Dec 19, 2024



* add global mmlu lite

* add global mmlu lite

* fix bugs

* add task README.md

* Update README.md

* Update tasks README.md

* Update README.md

* update readme

---------
Co-authored-by: shivi <shivalikasingh95@gmail.com>

2b75b110

26 Nov, 2024 1 commit

Score tasks (#2452) · 0ef7548d

Rima Shahbazyan authored Nov 26, 2024

* score readme added

* generate until task's "until" parameter's default value fixed.

* score mmlu-pro and agieval added

* changed macro accuracy to micro for agieval

* Always E removed from agi eval

* redundancies removed

* MATH added

* minor cosmetic changes for math

* Licenses added Readme updated

* changes for flake8 + license header on math

* Score added to readme and precommit was run.

* Score added to readme and precommit was run.

* Import error fixed

* math task bugfix
postprocess minor fix

* CR for math added

* math CR

* math task bugfix
postprocess minor fix

CR for math added

* Math cr fixed

* reverting the default "until" parameter change and adjusting  score task configs

0ef7548d

18 Nov, 2024 1 commit

Add metabench task to LM Evaluation Harness (#2357) · 62b4364d

Kozzy Voudouris authored Nov 18, 2024



* Add metabench (Kipnis et al. 2024)

* Update metabench tasks for full replication of original benchmarks, using publicly available datasets

* Remove unnecessary import

* Add permute versions of each task, where the answer orders are randomly shuffled.

* Add metabench group for easier evaluations

* Fix mmlu counts after removing duplicate

* Add secondary datasets

* Fix f-string error

* Fix f-string error for permute processing

* Add original hash to outputs for easy matching to original results

* Add line break at end of utils files

* Remove extra line from winogrande

* Reformat for linters

* fix multiple input test

* appease pre-commit

* Add metabench to tasks README

* fix multiple input `test_doc_to_text`

---------
Co-authored-by: Baber <baber@hey.com>

62b4364d

16 Nov, 2024 1 commit

kbl-v0.1.1 (#2493) · cbc31eb8

Wonseok Hwang authored Nov 17, 2024

* release kbl-v0.1

* fix linting

* remove rag tasks as  doc_to_text functions cause trouble

* remove remaining rag tasks

* remove unnecessary repeat in yaml files and rag dataset in hf-hub

* remove unncessary newline; introduce cfg files in lbox/kbl in hf

* Make task yaml files consistent to hf-datasets-config

* Make task yaml files consistent to hf-datasets-config

* Remove trailing empty space in doc-to-text

* Remove unncessary yaml file

* Fix task nameing error

* trailing space removed

cbc31eb8

05 Nov, 2024 1 commit

Add Japanese Leaderboard (#2439) · 26f607f5

mtkachenko authored Nov 05, 2024

* add jaqket_v2 and jcommonsenseqa

* remove comments

* remove num_beams as it is incompatible with vllm

* add jnli + refactor

* rename jnla -> jnli

* add jsquad + replace colon chars with the Japanese unicode

* ignore whitespaces in generation tasks

* add marc_ja

* add xwinograd + simplify other yamls

* add mgsm and xlsum

* refactor xlsum

* add ja_leaderboard tag

* edit README.md

* update README.md

* add credit + minor changes

* run ruff format

* address review comments + add group

* remove aggregate_metric_list

* remove tags

* update tasks/README.md

26f607f5

01 Nov, 2024 1 commit
- Add missing task links (#2449) · ade1cc4e
  Sypherd authored Nov 01, 2024
  
  ade1cc4e
30 Oct, 2024 1 commit
- Add xquad task (#2435) · b40a20ae
  zxcvuser authored Oct 30, 2024
```
* Add xquad task

* Update general README

* Run pre-commit
```
  b40a20ae
04 Oct, 2024 2 commits

Add new benchmark: Catalan bench (#2154) · cb069004

zxcvuser authored Oct 04, 2024



* Add catalan_bench

* added flores_ca.yaml

* Updated some task groupings and readme

* Fix create_yamls_flores_ca.py

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

cb069004

Add new benchmark: Basque bench (#2153) · c887796d

zxcvuser authored Oct 04, 2024



* Add basque_bench

* Add flores_eu group

* Update _flores_common_yaml

* Run linters, updated flores, mgsm, copa, and readme

* Apply suggestions from code review
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

---------
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

c887796d

03 Oct, 2024 1 commit

Add new benchmark: Galician bench (#2155) · 0e763862

zxcvuser authored Oct 03, 2024

* Add galician_bench

* Update xnli_gl path

* Add flores_gl group

* Update _flores_common_yaml

* Updated some task groupings and readme

---------

0e763862