- 18 Mar, 2025 4 commits
-
-
Baber Abbasi authored
* add changelog to readme template * add readme * add to task list
-
Baber Abbasi authored
suport for longcontext (and other synthetic tasks) * add ruler * add longbench * pass `metadata` to TaskConfig
-
Jonas Golde authored
* add MastermindEval benchmark * fill out checklist
-
Santiago Galiano Segura authored
* Add cocoteros_va dataset * Fix format in cocoteros_va.yml * Undo newline added * Execute pre-commit to fix format errors * Update catalan_bench.yaml version and add Changelog section into Readme.md
-
- 17 Mar, 2025 2 commits
-
-
Angelika Romanou authored
* Add INCLUDE tasks * pacify pre-commit --------- Co-authored-by:Baber <baber@hey.com>
-
Avelina9X authored
* Update openllm.yaml to use train fewshot split for arc
-
- 14 Mar, 2025 3 commits
-
-
Oskar van der Wal authored
* Implementation of Winogender * Minor fixes README.md * Add winogender * Clean winogender utils.py * Change dataset to one containing All subsets * Flesh out README for BBQ task * Add missing tasks for BBQ * Add simple cooccurrence bias task * Fix wrong mask for ambiguated context+rename metrics * Made generate_until evaluation (following PALM paper) default Also moved separate config files per category to separate metrics using custom function. Created config file for multiple_choice way of evaluating BBQ. * Add missing version metadata * Add missing versionmetadata for bbq multiple choice * Fix metrics and address edge cases * Made BBQ multiple choice the default version * Added settings following winogrande * Add num_fewshot to simple_cooccurrence_bias * Fixes for bbq (multiple choice) * Fix wrong dataset * CrowS-Pairs: make it easier to use another dataset by removing dataset_name from the subsets. * Use simplest prompt possible without description * Merge * BBQ: Fix np.NaN related bug * BBQ: Fix wrong aggregation method for disamb accuracy * BBQ: Make it possible to only evaluate on (dis)ambiguous subset (needed for few shot eval) * BBQ: fix showing one target in case of few-shot evals * BBQ: Fix few-shot example for bbq_generate * BBQ: simplify subtasks * BBQ: Minimize number of UNK variations to reduce inference time * BBQ: Add extra UNK keywords for the generate task * Add a generate_until version of simple_cooccurrence_bias * Change system/description prompt to include few-shot examples * Group agg rework * Run pre-commit * add tasks to readme table * remove trailing space from simple_cooccurrence_bias_gen.yaml `doc_to_text` * fix * fix * fix version --------- Co-authored-by:Baber <baber@hey.com>
-
achervyakov authored
* Added audio-modality pipeline for qwen2-audio model * Beauty imports * fix apply_chat_template args * update default audio placeholders list * add demo task - common_voice subset * add audiolm_qwen libs to pyproject.toml * pre-commit beautify --------- Co-authored-by:Alexandra Rak <rakalexandra@mail.ru>
-
Baber Abbasi authored
-
- 11 Mar, 2025 5 commits
-
-
Baber Abbasi authored
* add instruct humaneval * nit * add to readme * nit
-
Surya Kasturi authored
* Capture gen_kwargs from CLI in squad_completion * Update lm_eval/tasks/squad_completion/task.py Co-authored-by:
Baber Abbasi <92168766+baberabb@users.noreply.github.com> * Update lm_eval/tasks/squad_completion/task.py Co-authored-by:
Baber Abbasi <92168766+baberabb@users.noreply.github.com> * pre-commit --------- Co-authored-by:
Baber Abbasi <92168766+baberabb@users.noreply.github.com> Co-authored-by:
Baber <baber@hey.com>
-
PabloAgustin authored
* New healthcare benchmark: careqa * LAUNCH_MN5_ACC <python main.py --config config/mn5.yml --models Llama-3.2-1B-Instruct --tasks careqa_open --num_fewshot 0> * Add fixes, READMES, and remove task_list.txt * pre-commit passed, add formatting updates; add nanmean agg_metric * Fix import error. * Wrapped imports in try excepts * Wrapped imports in try excepts; also metrics to catch bert_score import error * Try except to catch ImportErrors as well * use np.nan * pre-commit --------- Co-authored-by:
PabloAgustin <pablo.martin@bsc.es> Co-authored-by:
Baber <baber@hey.com>
-
Kajetan Dymkiewicz authored
* fix for mc2 calculation * increment versions and changelog --------- Co-authored-by:Baber <baber@hey.com>
-
Yotam Perlitz authored
* Filter new leaderboard_math_hard dataset to "Level 5" only * align to linters Signed-off-by:
Yotam Perlitz <y.perlitz@ibm.com> --------- Signed-off-by:
Yotam Perlitz <y.perlitz@ibm.com>
-
- 05 Mar, 2025 1 commit
-
-
Yongkeun Hwang authored
-
- 04 Mar, 2025 1 commit
-
-
Kiersten Stokes authored
* Add a test for a custom unitxt task * Update task.py to bring in line with breaking change in v1.17.2 * Fix lint
-
- 03 Mar, 2025 1 commit
-
-
Harsh Kohli authored
* Fix failing tests * Resolved merge conflicts * pre-commit --------- Co-authored-by:Baber <baber@hey.com>
-
- 27 Feb, 2025 1 commit
-
-
Baber Abbasi authored
* remove ray.remote resources * remove kobtest tag (registered as group)
-
- 25 Feb, 2025 3 commits
-
-
Minho Ryu authored
* add humaneval+ and mbpp+ * add newline at end of file
-
Kailashbuki authored
* Fix the import source for eval_logger * fix logging --------- Co-authored-by:Baber <baber@hey.com>
-
Santiago Galiano Segura authored
Co-authored-by:Robiert Sepulveda Torres <rsepulveda911112@gmail.com>
-
- 24 Feb, 2025 2 commits
-
-
Naiara Perez authored
* add Basque translation of ARC and PAWS to BasqueBench * pre-commit --------- Co-authored-by:Baber <baber@hey.com>
-
Naiara Perez authored
Added IberoBench citation info (https://aclanthology.org/2025.coling-main.699/) in correpsonding READMEs (#2729)
-
- 23 Feb, 2025 1 commit
-
-
Baber Abbasi authored
-
- 21 Feb, 2025 3 commits
-
-
Farhan Ahmed authored
-
Lintang Sutawika authored
* changed source of eval_logger * allow eval_logger to be set from args * removed verbosity arg from non-main methods * fix logging * pre-commit * set verbosity in eval logger * replace utils.eval_logger * fix logging in main * add logging to docs * add logging message * nit * add logging to docs * refactor setup_logging to utils --------- Co-authored-by:Baber <baber@hey.com>
-
Baber Abbasi authored
* add math_verify to minerva math * add math_verify to benchmark * fix error * increment version
-
- 14 Feb, 2025 2 commits
-
-
Baber Abbasi authored
* set target delimiter to empty string * nit * add warning
-
Baber Abbasi authored
-
- 13 Feb, 2025 1 commit
-
-
James A. Michaelov authored
-
- 12 Feb, 2025 1 commit
-
-
Kiersten Stokes authored
-
- 11 Feb, 2025 2 commits
-
-
Baber Abbasi authored
-
Michele Resta authored
* feat: initial commit with templates for evalita evaluation * fix: change rule for generate_until * feat: modified yaml to use reduced version of NER test datasets * feat: added templates to use reduced dataset for summarization (fanpage and ilpost) * Add Six Prompts for Each Multiple-Choice Task * feat: modified fewshot split for textual entailment task * fix: new doc_to_target function for NER tasks * Update prompt * Add partition for few-shot evaluation * Add partition for few-shot evaluation * Add partition for few-shot evaluation * Add partition for few-shot evaluatio * Update prompt * Add partition for few-shot evaluation * Rename file Rename file from _evalita-mp_ner_adg_p1 .yaml to _evalita-mp_ner_adg_p1.yaml * Add partition for few-shot evaluation * Add partition for few-shot evaluation * Enhance lexical substitution management - Improve scorer calculation for better accuracy - Update model output postprocessing for clearer results - Add support for few-shot relation extraction task * Add F1 macro measure for the document dating task * Add F1-macro measure to evaluate document dating * Use the whole dataset * Small changes * Add the two prompts for the task of lexical substitution * Add few-shot split configuration * Add few-shot split configuration * Add function for handling few-shot learning setup * Fix prompt * Remove configuration file * Update dataset from test_same to test_cross for evaluations * Remove whitespace at end of prompt * Fix configuration error: corrected parameter name for the dataset used in few-shot * Fix: Check if results is not empty before processing in lexical substitution task * added the prompts and functions for correct NER and RE execution * Add accuracy measure * Add tasks for the EVALITA-LLM benchmark evaluation * Small changes Add the alias of the task name that will be printed in the final table results. * Updated the prompts to reflect changes made to the extended dataset for the Admission Test task * chore: cleaned templates before PR; feat: add configuration to run generation/ppl tasks. * fix: add information on Evalita-LLM for PR * fix: rename folders and files * fix: remove unused imports * chore: run pre-commit * chore: add task description --------- Co-authored-by:
rzanoli <zanoli@fbk.eu> Co-authored-by:
Marco Madeddu <marco.madeddu.bra@gmail.com>
-
- 07 Feb, 2025 1 commit
-
-
Arda authored
* Added TurkishMMLU to LM Evaluation Harness * Fixed COT name * Fixed COT name * Updated Readme * Fixed Test issues * Completed Scan for changed tasks * Updated Readme * Update README.md * fixup task naming casing + ensure yaml template stubs aren't registered * Fix Regex Pattern for CoT experiments * Fixed multiple choice accuracy --------- Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by:
haileyschoelkopf <hailey@eleuther.ai>
-
- 31 Jan, 2025 1 commit
-
-
asgsaeid authored
* mmlu-pro-plus is implemented * README file is updated * Update README.md with new task: MMLU Pro Plus * Update README.md with new task: MMLU Pro Plus * pre-commit * nit --------- Co-authored-by:
asgsaeid <asgaris@Saeids-MacBook-Pro.local> Co-authored-by:
Baber <baber@hey.com>
-
- 29 Jan, 2025 2 commits
-
-
Irina Proskurina authored
* Add Histoires Morales task * Histoires Morales task: fix mixed line endings * Histoires Morales task: fix mixed line endings * Remove tag for a single task * Add some MT for Histoires Morales
-
Baber Abbasi authored
* remove group from task configs * add tags * update readme
-
- 28 Jan, 2025 3 commits
-
-
Baber Abbasi authored
* nit * update pre-commit
-
Seungwoo Ryu authored
Co-authored-by:Baber <baber@hey.com>
-
Irina Proskurina authored
* Add moral stories task * Add moral stories task * Create README.md * Update README.md * Update line endings in moral_stories files
-