- 22 Sep, 2025 1 commit
-
-
priverabsc authored
* Add eqbench tasks in Spanish and Catalan * Incremented catalan_bench and spanish_bench versions. Added 'multilingual' folder inside 'eq_bench' and moved the eqbench_ca and eqbench_es .yaml to that folder. Updated the tasks README with eqbench_es and eqbench_ca, expliciting inside each description both the Hugging Face link and the translation method. * Fixed tasks table. * remove test_task.sh and results folder * Add utils.py to multilingual folder
-
- 21 Sep, 2025 4 commits
-
-
its-alpesh authored
* Add humaneval_infilling task * pacify pre-commit --------- Co-authored-by:Baber Abbasi <92168766+baberabb@users.noreply.github.com>
-
Janna authored
* register aime * lint --------- Co-authored-by:Baber <baber@hey.com>
-
Janna authored
* create babilong tasks * lint * add clarification * fix typo * add babilong description
-
Luis Cosio authored
* Added benchmark * Added more testing * Added task definition for mmlu_redux and mmlu_redux_spanish * Add MMLU Redux English and Spanish tasks with YAML fixes and READMEs * Add remaining MMLU Redux YAMLs and updated tasks README * Add MMLU Redux English and Spanish tasks with YAML fixes and READMEs * Add MMLU Redux changes from pr-2705 * Resolve pre-commit hook and pytest overlapping group issues by adding mmlu_redux_spanish task entries and unique subgroup names * Enhance retry logic to prevent 429 error when using Hugging Face API for tests, apply pre-commit fixes * Revert python test changes and comments one task group to avoid Hugging Face rate limit and task failure --------- Co-authored-by:CT-6282 <ricardo.godric@hotmail.com>
-
- 08 Sep, 2025 1 commit
-
-
James A. Michaelov authored
* add icelandic_winogrande * fix spacing for final words in sentence
-
- 02 Sep, 2025 2 commits
-
-
Valle Ruiz-Fernández authored
* Add EsBBQ and CaBBQ tasks * Linter fixes * add esbbq and cabbq to task list --------- Co-authored-by:Júlia Falcão <juliafsfalcao@hotmail.com>
-
James A. Michaelov authored
* run linter * add acc_norm
-
- 25 Aug, 2025 3 commits
-
-
Weihao XUAN authored
* update MMLU_ProX * update MMLU_ProX * cleanup code by pre-commit
-
William Held authored
* Anthropic Discrim Eval * Mixed Effects Regression * Actually wire it all upo * Operator Name Doesn't Exist on Github * Update lm_eval/tasks/discrim_eval/discrim_eval_implicit.yaml Co-authored-by:
Baber Abbasi <92168766+baberabb@users.noreply.github.com> * Update discrim_eval_implicit.yaml * Update discrim_eval_explicit.yaml * pacify pre-commit --------- Co-authored-by:
Baber Abbasi <92168766+baberabb@users.noreply.github.com> Co-authored-by:
Baber <baber@hey.com>
-
Geun, Lim authored
* feat: Add CLIcK task * Fix formatting issues * Add Click Task Description * fix: lint * fix
-
- 21 Aug, 2025 4 commits
-
-
James A. Michaelov authored
* add lm_syneval * edit readme * update task readme * formatting fixes * run linting * add descriptions and examples * clean readme formatting
-
James A. Michaelov authored
* add turblimp * update general task readme * add normalized accuracy
-
James A. Michaelov authored
* add blimp_nl * add template yaml file
-
James A. Michaelov authored
* add zhoblimp files * correct group name * fix group * add normalized accuracy
-
- 22 Jul, 2025 1 commit
-
-
Svetlana Karimova authored
* Feat: add LIBRA benchmark * Feat: add dataset filter to LIBRA * Fix: formatting through pre-commit and main tasks README * Fix: resolve conflict * Fix: dataset name to real * Fix: delete unnececcary datasets and correct dependency --------- Co-authored-by:Baber Abbasi <92168766+baberabb@users.noreply.github.com>
-
- 19 Jul, 2025 1 commit
-
-
Baber Abbasi authored
-
- 14 Jul, 2025 1 commit
-
-
Atou Houdaifa authored
* add egy mmlu hellaswag * add egymmlu egyhellaswag to tasks readme * fix egymmlu config generation * fix _generate_configs formating
-
- 03 Jul, 2025 1 commit
-
-
Blanca Calvo authored
* truthfulqa-multi task * truthfulqa-multi with chat few-shot * few shot chat implementation * changed until so it outputs lists * changed dataset location * added MT task * Create README.md * do not include MT * changes for PR * tag change * removed yaml extension * adding task to the table * fix task configs * add import exception --------- Co-authored-by:Baber <baber@hey.com>
-
- 26 May, 2025 1 commit
-
-
Boda Sadallah authored
* add arab_culture tasks * add target_delimeter and remove debugging code
-
- 19 May, 2025 1 commit
-
-
Harsha authored
* adding ACPBench_hard * adding Clingo * changing tarski to tarski[clingo] * denoting the main variants in each paper
-
- 15 May, 2025 1 commit
-
-
Yufeng Xu authored
* added c4 dataset (working) * fixed bugs in c4 * fixed loading bugs in c4 dataset; using partial loading * cleaned the code * added version number for c4 * removed irrelevant files
-
- 06 May, 2025 1 commit
-
-
Vladislav Mikhailov authored
* added noreval * added a checklist for noreval * run pre-commit * changed imports and added short noreval description * fixed norsumm path * refactored multi-folder tasks * refactored multi-folder tasks
-
- 16 Apr, 2025 2 commits
-
-
Baber Abbasi authored
* add warning in for default until * fix stop tokens; add vcsum * bugfix:fix doc_to_target to string * fix lsht, trec * add task to readme * add debugging logs for multiple input/output
-
Eldar Kurtic authored
-
- 14 Apr, 2025 1 commit
-
-
Daniele authored
-
- 02 Apr, 2025 1 commit
-
-
Saibo-creator authored
* Add JSON schema benchmark * Update lm_eval/tasks/jsonschema_bench/metrics.py Thanks for catching this Co-authored-by:
Baber Abbasi <92168766+baberabb@users.noreply.github.com> * run pre-commit * add description to task catalogue readme --------- Co-authored-by:
Baber Abbasi <92168766+baberabb@users.noreply.github.com>
-
- 29 Mar, 2025 1 commit
-
-
Harsha authored
-
- 28 Mar, 2025 1 commit
-
-
Hadi Abdine authored
* add Darija tasks * fix multiple groups issue in darijammlu * add MT to the description of the Darija tasks * Update README.md nit * fix the recursion error caused by the darija_summarization task * use a custom filter instead of the decorator for the strip function --------- Co-authored-by:Baber Abbasi <92168766+baberabb@users.noreply.github.com>
-
- 27 Mar, 2025 1 commit
-
-
Harsha authored
* Adding acpbench task * adding ACPBench in Tasks readme. * running precommit
-
- 21 Mar, 2025 1 commit
-
-
heli-qi authored
* update mmlu_prox configs * update tasks/README * correct hyphon to underline in task/README * update pre-commit codes
-
- 18 Mar, 2025 3 commits
-
-
Baber Abbasi authored
* add changelog to readme template * add readme * add to task list
-
Baber Abbasi authored
suport for longcontext (and other synthetic tasks) * add ruler * add longbench * pass `metadata` to TaskConfig
-
Jonas Golde authored
* add MastermindEval benchmark * fill out checklist
-
- 14 Mar, 2025 1 commit
-
-
Oskar van der Wal authored
* Implementation of Winogender * Minor fixes README.md * Add winogender * Clean winogender utils.py * Change dataset to one containing All subsets * Flesh out README for BBQ task * Add missing tasks for BBQ * Add simple cooccurrence bias task * Fix wrong mask for ambiguated context+rename metrics * Made generate_until evaluation (following PALM paper) default Also moved separate config files per category to separate metrics using custom function. Created config file for multiple_choice way of evaluating BBQ. * Add missing version metadata * Add missing versionmetadata for bbq multiple choice * Fix metrics and address edge cases * Made BBQ multiple choice the default version * Added settings following winogrande * Add num_fewshot to simple_cooccurrence_bias * Fixes for bbq (multiple choice) * Fix wrong dataset * CrowS-Pairs: make it easier to use another dataset by removing dataset_name from the subsets. * Use simplest prompt possible without description * Merge * BBQ: Fix np.NaN related bug * BBQ: Fix wrong aggregation method for disamb accuracy * BBQ: Make it possible to only evaluate on (dis)ambiguous subset (needed for few shot eval) * BBQ: fix showing one target in case of few-shot evals * BBQ: Fix few-shot example for bbq_generate * BBQ: simplify subtasks * BBQ: Minimize number of UNK variations to reduce inference time * BBQ: Add extra UNK keywords for the generate task * Add a generate_until version of simple_cooccurrence_bias * Change system/description prompt to include few-shot examples * Group agg rework * Run pre-commit * add tasks to readme table * remove trailing space from simple_cooccurrence_bias_gen.yaml `doc_to_text` * fix * fix * fix version --------- Co-authored-by:Baber <baber@hey.com>
-
- 11 Mar, 2025 1 commit
-
-
PabloAgustin authored
* New healthcare benchmark: careqa * LAUNCH_MN5_ACC <python main.py --config config/mn5.yml --models Llama-3.2-1B-Instruct --tasks careqa_open --num_fewshot 0> * Add fixes, READMES, and remove task_list.txt * pre-commit passed, add formatting updates; add nanmean agg_metric * Fix import error. * Wrapped imports in try excepts * Wrapped imports in try excepts; also metrics to catch bert_score import error * Try except to catch ImportErrors as well * use np.nan * pre-commit --------- Co-authored-by:
PabloAgustin <pablo.martin@bsc.es> Co-authored-by:
Baber <baber@hey.com>
-
- 03 Mar, 2025 1 commit
-
-
Harsh Kohli authored
* Fix failing tests * Resolved merge conflicts * pre-commit --------- Co-authored-by:Baber <baber@hey.com>
-
- 11 Feb, 2025 1 commit
-
-
Michele Resta authored
* feat: initial commit with templates for evalita evaluation * fix: change rule for generate_until * feat: modified yaml to use reduced version of NER test datasets * feat: added templates to use reduced dataset for summarization (fanpage and ilpost) * Add Six Prompts for Each Multiple-Choice Task * feat: modified fewshot split for textual entailment task * fix: new doc_to_target function for NER tasks * Update prompt * Add partition for few-shot evaluation * Add partition for few-shot evaluation * Add partition for few-shot evaluation * Add partition for few-shot evaluatio * Update prompt * Add partition for few-shot evaluation * Rename file Rename file from _evalita-mp_ner_adg_p1 .yaml to _evalita-mp_ner_adg_p1.yaml * Add partition for few-shot evaluation * Add partition for few-shot evaluation * Enhance lexical substitution management - Improve scorer calculation for better accuracy - Update model output postprocessing for clearer results - Add support for few-shot relation extraction task * Add F1 macro measure for the document dating task * Add F1-macro measure to evaluate document dating * Use the whole dataset * Small changes * Add the two prompts for the task of lexical substitution * Add few-shot split configuration * Add few-shot split configuration * Add function for handling few-shot learning setup * Fix prompt * Remove configuration file * Update dataset from test_same to test_cross for evaluations * Remove whitespace at end of prompt * Fix configuration error: corrected parameter name for the dataset used in few-shot * Fix: Check if results is not empty before processing in lexical substitution task * added the prompts and functions for correct NER and RE execution * Add accuracy measure * Add tasks for the EVALITA-LLM benchmark evaluation * Small changes Add the alias of the task name that will be printed in the final table results. * Updated the prompts to reflect changes made to the extended dataset for the Admission Test task * chore: cleaned templates before PR; feat: add configuration to run generation/ppl tasks. * fix: add information on Evalita-LLM for PR * fix: rename folders and files * fix: remove unused imports * chore: run pre-commit * chore: add task description --------- Co-authored-by:
rzanoli <zanoli@fbk.eu> Co-authored-by:
Marco Madeddu <marco.madeddu.bra@gmail.com>
-
- 31 Jan, 2025 1 commit
-
-
asgsaeid authored
* mmlu-pro-plus is implemented * README file is updated * Update README.md with new task: MMLU Pro Plus * Update README.md with new task: MMLU Pro Plus * pre-commit * nit --------- Co-authored-by:
asgsaeid <asgaris@Saeids-MacBook-Pro.local> Co-authored-by:
Baber <baber@hey.com>
-
- 29 Jan, 2025 1 commit
-
-
Irina Proskurina authored
* Add Histoires Morales task * Histoires Morales task: fix mixed line endings * Histoires Morales task: fix mixed line endings * Remove tag for a single task * Add some MT for Histoires Morales
-