- 19 Jul, 2025 1 commit
-
-
Baber Abbasi authored
-
- 14 Jul, 2025 1 commit
-
-
Atou Houdaifa authored
* add egy mmlu hellaswag * add egymmlu egyhellaswag to tasks readme * fix egymmlu config generation * fix _generate_configs formating
-
- 03 Jul, 2025 1 commit
-
-
Blanca Calvo authored
* truthfulqa-multi task * truthfulqa-multi with chat few-shot * few shot chat implementation * changed until so it outputs lists * changed dataset location * added MT task * Create README.md * do not include MT * changes for PR * tag change * removed yaml extension * adding task to the table * fix task configs * add import exception --------- Co-authored-by:Baber <baber@hey.com>
-
- 26 May, 2025 1 commit
-
-
Boda Sadallah authored
* add arab_culture tasks * add target_delimeter and remove debugging code
-
- 19 May, 2025 1 commit
-
-
Harsha authored
* adding ACPBench_hard * adding Clingo * changing tarski to tarski[clingo] * denoting the main variants in each paper
-
- 15 May, 2025 1 commit
-
-
Yufeng Xu authored
* added c4 dataset (working) * fixed bugs in c4 * fixed loading bugs in c4 dataset; using partial loading * cleaned the code * added version number for c4 * removed irrelevant files
-
- 06 May, 2025 1 commit
-
-
Vladislav Mikhailov authored
* added noreval * added a checklist for noreval * run pre-commit * changed imports and added short noreval description * fixed norsumm path * refactored multi-folder tasks * refactored multi-folder tasks
-
- 16 Apr, 2025 2 commits
-
-
Baber Abbasi authored
* add warning in for default until * fix stop tokens; add vcsum * bugfix:fix doc_to_target to string * fix lsht, trec * add task to readme * add debugging logs for multiple input/output
-
Eldar Kurtic authored
-
- 14 Apr, 2025 1 commit
-
-
Daniele authored
-
- 02 Apr, 2025 1 commit
-
-
Saibo-creator authored
* Add JSON schema benchmark * Update lm_eval/tasks/jsonschema_bench/metrics.py Thanks for catching this Co-authored-by:
Baber Abbasi <92168766+baberabb@users.noreply.github.com> * run pre-commit * add description to task catalogue readme --------- Co-authored-by:
Baber Abbasi <92168766+baberabb@users.noreply.github.com>
-
- 29 Mar, 2025 1 commit
-
-
Harsha authored
-
- 28 Mar, 2025 1 commit
-
-
Hadi Abdine authored
* add Darija tasks * fix multiple groups issue in darijammlu * add MT to the description of the Darija tasks * Update README.md nit * fix the recursion error caused by the darija_summarization task * use a custom filter instead of the decorator for the strip function --------- Co-authored-by:Baber Abbasi <92168766+baberabb@users.noreply.github.com>
-
- 27 Mar, 2025 1 commit
-
-
Harsha authored
* Adding acpbench task * adding ACPBench in Tasks readme. * running precommit
-
- 21 Mar, 2025 1 commit
-
-
heli-qi authored
* update mmlu_prox configs * update tasks/README * correct hyphon to underline in task/README * update pre-commit codes
-
- 18 Mar, 2025 3 commits
-
-
Baber Abbasi authored
* add changelog to readme template * add readme * add to task list
-
Baber Abbasi authored
suport for longcontext (and other synthetic tasks) * add ruler * add longbench * pass `metadata` to TaskConfig
-
Jonas Golde authored
* add MastermindEval benchmark * fill out checklist
-
- 14 Mar, 2025 1 commit
-
-
Oskar van der Wal authored
* Implementation of Winogender * Minor fixes README.md * Add winogender * Clean winogender utils.py * Change dataset to one containing All subsets * Flesh out README for BBQ task * Add missing tasks for BBQ * Add simple cooccurrence bias task * Fix wrong mask for ambiguated context+rename metrics * Made generate_until evaluation (following PALM paper) default Also moved separate config files per category to separate metrics using custom function. Created config file for multiple_choice way of evaluating BBQ. * Add missing version metadata * Add missing versionmetadata for bbq multiple choice * Fix metrics and address edge cases * Made BBQ multiple choice the default version * Added settings following winogrande * Add num_fewshot to simple_cooccurrence_bias * Fixes for bbq (multiple choice) * Fix wrong dataset * CrowS-Pairs: make it easier to use another dataset by removing dataset_name from the subsets. * Use simplest prompt possible without description * Merge * BBQ: Fix np.NaN related bug * BBQ: Fix wrong aggregation method for disamb accuracy * BBQ: Make it possible to only evaluate on (dis)ambiguous subset (needed for few shot eval) * BBQ: fix showing one target in case of few-shot evals * BBQ: Fix few-shot example for bbq_generate * BBQ: simplify subtasks * BBQ: Minimize number of UNK variations to reduce inference time * BBQ: Add extra UNK keywords for the generate task * Add a generate_until version of simple_cooccurrence_bias * Change system/description prompt to include few-shot examples * Group agg rework * Run pre-commit * add tasks to readme table * remove trailing space from simple_cooccurrence_bias_gen.yaml `doc_to_text` * fix * fix * fix version --------- Co-authored-by:Baber <baber@hey.com>
-
- 11 Mar, 2025 1 commit
-
-
PabloAgustin authored
* New healthcare benchmark: careqa * LAUNCH_MN5_ACC <python main.py --config config/mn5.yml --models Llama-3.2-1B-Instruct --tasks careqa_open --num_fewshot 0> * Add fixes, READMES, and remove task_list.txt * pre-commit passed, add formatting updates; add nanmean agg_metric * Fix import error. * Wrapped imports in try excepts * Wrapped imports in try excepts; also metrics to catch bert_score import error * Try except to catch ImportErrors as well * use np.nan * pre-commit --------- Co-authored-by:
PabloAgustin <pablo.martin@bsc.es> Co-authored-by:
Baber <baber@hey.com>
-
- 03 Mar, 2025 1 commit
-
-
Harsh Kohli authored
* Fix failing tests * Resolved merge conflicts * pre-commit --------- Co-authored-by:Baber <baber@hey.com>
-
- 11 Feb, 2025 1 commit
-
-
Michele Resta authored
* feat: initial commit with templates for evalita evaluation * fix: change rule for generate_until * feat: modified yaml to use reduced version of NER test datasets * feat: added templates to use reduced dataset for summarization (fanpage and ilpost) * Add Six Prompts for Each Multiple-Choice Task * feat: modified fewshot split for textual entailment task * fix: new doc_to_target function for NER tasks * Update prompt * Add partition for few-shot evaluation * Add partition for few-shot evaluation * Add partition for few-shot evaluation * Add partition for few-shot evaluatio * Update prompt * Add partition for few-shot evaluation * Rename file Rename file from _evalita-mp_ner_adg_p1 .yaml to _evalita-mp_ner_adg_p1.yaml * Add partition for few-shot evaluation * Add partition for few-shot evaluation * Enhance lexical substitution management - Improve scorer calculation for better accuracy - Update model output postprocessing for clearer results - Add support for few-shot relation extraction task * Add F1 macro measure for the document dating task * Add F1-macro measure to evaluate document dating * Use the whole dataset * Small changes * Add the two prompts for the task of lexical substitution * Add few-shot split configuration * Add few-shot split configuration * Add function for handling few-shot learning setup * Fix prompt * Remove configuration file * Update dataset from test_same to test_cross for evaluations * Remove whitespace at end of prompt * Fix configuration error: corrected parameter name for the dataset used in few-shot * Fix: Check if results is not empty before processing in lexical substitution task * added the prompts and functions for correct NER and RE execution * Add accuracy measure * Add tasks for the EVALITA-LLM benchmark evaluation * Small changes Add the alias of the task name that will be printed in the final table results. * Updated the prompts to reflect changes made to the extended dataset for the Admission Test task * chore: cleaned templates before PR; feat: add configuration to run generation/ppl tasks. * fix: add information on Evalita-LLM for PR * fix: rename folders and files * fix: remove unused imports * chore: run pre-commit * chore: add task description --------- Co-authored-by:
rzanoli <zanoli@fbk.eu> Co-authored-by:
Marco Madeddu <marco.madeddu.bra@gmail.com>
-
- 31 Jan, 2025 1 commit
-
-
asgsaeid authored
* mmlu-pro-plus is implemented * README file is updated * Update README.md with new task: MMLU Pro Plus * Update README.md with new task: MMLU Pro Plus * pre-commit * nit --------- Co-authored-by:
asgsaeid <asgaris@Saeids-MacBook-Pro.local> Co-authored-by:
Baber <baber@hey.com>
-
- 29 Jan, 2025 1 commit
-
-
Irina Proskurina authored
* Add Histoires Morales task * Histoires Morales task: fix mixed line endings * Histoires Morales task: fix mixed line endings * Remove tag for a single task * Add some MT for Histoires Morales
-
- 28 Jan, 2025 1 commit
-
-
Irina Proskurina authored
* Add moral stories task * Add moral stories task * Create README.md * Update README.md * Update line endings in moral_stories files
-
- 20 Jan, 2025 1 commit
-
-
Minho Ryu authored
* add hrm8k benchmark for both Korean and English * apply precommit * revise tasks to make models not to directly answer; use zeroshot_cot if possible * add README * Add hrm8k on the task-list --------- Co-authored-by:Baber <baber@hey.com>
-
- 15 Jan, 2025 3 commits
-
-
Shivansh Pachnanda authored
* Add MLQA * add mlqa_common_yaml * add 49 tests of mlqa family * update tasks/README.md --------- * fix: mlqa ast error * nit: removed .yaml ext from template_yaml * nit changes: minor modifications generate_tasks.py * deleted lm_eval/tasks/mlqa/mlqa_common_yaml.yaml * tests updated * nit
-
Hojin Lee authored
* add mbpp * fix some bugs * add README for mbpp * update README * nits --------- Co-authored-by:
Hojin Lee <19949034+hjlee1371@users.noreply.github.com> Co-authored-by:
Baber <baber@hey.com>
-
Hojin Lee authored
* add custom filter * fix type casting of references * add humaneval * fix a bug in humaneval * add greedy version of humaneval * update tasks README * test humaneval * return multiple metrics * nit * add confirmation to run code tasks * nit * nit --------- Co-authored-by:
Hojin Lee <19949034+hjlee1371@users.noreply.github.com> Co-authored-by:
Baber <baber@hey.com>
-
- 24 Dec, 2024 1 commit
-
-
Firoj Alam, Scientist, QCRI authored
* added aradice * Added ArabicMMLU Lev Configs * added ArabicMMLU egy configs * Added boolq configs * Added cultural bench configs * added openbookqa configs * Added PiQA configs * added winogrande configs * Added truthfulQA configs * Added aradice group config * Remove deleted files from repository * modified arabimmlu configs * modified metadata versions * fixed formatting using ruff * added aradice tasks information * pre-commit * Uptaded openbookqa utils * fixed formatting on obqa --------- Co-authored-by:
Basel Mousi <bmousi@hbku.edu.qa> Co-authored-by:
Baber <baber@hey.com>
-
- 19 Dec, 2024 1 commit
-
-
shivalika-singh authored
* add global mmlu lite * add global mmlu lite * fix bugs * add task README.md * Update README.md * Update tasks README.md * Update README.md * update readme --------- Co-authored-by:shivi <shivalikasingh95@gmail.com>
-
- 26 Nov, 2024 1 commit
-
-
Rima Shahbazyan authored
* score readme added * generate until task's "until" parameter's default value fixed. * score mmlu-pro and agieval added * changed macro accuracy to micro for agieval * Always E removed from agi eval * redundancies removed * MATH added * minor cosmetic changes for math * Licenses added Readme updated * changes for flake8 + license header on math * Score added to readme and precommit was run. * Score added to readme and precommit was run. * Import error fixed * math task bugfix postprocess minor fix * CR for math added * math CR * math task bugfix postprocess minor fix CR for math added * Math cr fixed * reverting the default "until" parameter change and adjusting score task configs
-
- 18 Nov, 2024 1 commit
-
-
Kozzy Voudouris authored
* Add metabench (Kipnis et al. 2024) * Update metabench tasks for full replication of original benchmarks, using publicly available datasets * Remove unnecessary import * Add permute versions of each task, where the answer orders are randomly shuffled. * Add metabench group for easier evaluations * Fix mmlu counts after removing duplicate * Add secondary datasets * Fix f-string error * Fix f-string error for permute processing * Add original hash to outputs for easy matching to original results * Add line break at end of utils files * Remove extra line from winogrande * Reformat for linters * fix multiple input test * appease pre-commit * Add metabench to tasks README * fix multiple input `test_doc_to_text` --------- Co-authored-by:Baber <baber@hey.com>
-
- 16 Nov, 2024 1 commit
-
-
Wonseok Hwang authored
* release kbl-v0.1 * fix linting * remove rag tasks as doc_to_text functions cause trouble * remove remaining rag tasks * remove unnecessary repeat in yaml files and rag dataset in hf-hub * remove unncessary newline; introduce cfg files in lbox/kbl in hf * Make task yaml files consistent to hf-datasets-config * Make task yaml files consistent to hf-datasets-config * Remove trailing empty space in doc-to-text * Remove unncessary yaml file * Fix task nameing error * trailing space removed
-
- 05 Nov, 2024 1 commit
-
-
mtkachenko authored
* add jaqket_v2 and jcommonsenseqa * remove comments * remove num_beams as it is incompatible with vllm * add jnli + refactor * rename jnla -> jnli * add jsquad + replace colon chars with the Japanese unicode * ignore whitespaces in generation tasks * add marc_ja * add xwinograd + simplify other yamls * add mgsm and xlsum * refactor xlsum * add ja_leaderboard tag * edit README.md * update README.md * add credit + minor changes * run ruff format * address review comments + add group * remove aggregate_metric_list * remove tags * update tasks/README.md
-
- 01 Nov, 2024 1 commit
-
-
Sypherd authored
-
- 30 Oct, 2024 1 commit
-
-
zxcvuser authored
* Add xquad task * Update general README * Run pre-commit
-
- 04 Oct, 2024 2 commits
-
-
zxcvuser authored
* Add catalan_bench * added flores_ca.yaml * Updated some task groupings and readme * Fix create_yamls_flores_ca.py --------- Co-authored-by:Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
zxcvuser authored
* Add basque_bench * Add flores_eu group * Update _flores_common_yaml * Run linters, updated flores, mgsm, copa, and readme * Apply suggestions from code review Co-authored-by:
Baber Abbasi <92168766+baberabb@users.noreply.github.com> --------- Co-authored-by:
Baber Abbasi <92168766+baberabb@users.noreply.github.com> Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
- 03 Oct, 2024 1 commit
-
-
zxcvuser authored
* Add galician_bench * Update xnli_gl path * Add flores_gl group * Update _flores_common_yaml * Updated some task groupings and readme ---------
-