- 03 Mar, 2025 1 commit
-
-
Harsh Kohli authored
* Fix failing tests * Resolved merge conflicts * pre-commit --------- Co-authored-by:Baber <baber@hey.com>
-
- 11 Feb, 2025 1 commit
-
-
Michele Resta authored
* feat: initial commit with templates for evalita evaluation * fix: change rule for generate_until * feat: modified yaml to use reduced version of NER test datasets * feat: added templates to use reduced dataset for summarization (fanpage and ilpost) * Add Six Prompts for Each Multiple-Choice Task * feat: modified fewshot split for textual entailment task * fix: new doc_to_target function for NER tasks * Update prompt * Add partition for few-shot evaluation * Add partition for few-shot evaluation * Add partition for few-shot evaluation * Add partition for few-shot evaluatio * Update prompt * Add partition for few-shot evaluation * Rename file Rename file from _evalita-mp_ner_adg_p1 .yaml to _evalita-mp_ner_adg_p1.yaml * Add partition for few-shot evaluation * Add partition for few-shot evaluation * Enhance lexical substitution management - Improve scorer calculation for better accuracy - Update model output postprocessing for clearer results - Add support for few-shot relation extraction task * Add F1 macro measure for the document dating task * Add F1-macro measure to evaluate document dating * Use the whole dataset * Small changes * Add the two prompts for the task of lexical substitution * Add few-shot split configuration * Add few-shot split configuration * Add function for handling few-shot learning setup * Fix prompt * Remove configuration file * Update dataset from test_same to test_cross for evaluations * Remove whitespace at end of prompt * Fix configuration error: corrected parameter name for the dataset used in few-shot * Fix: Check if results is not empty before processing in lexical substitution task * added the prompts and functions for correct NER and RE execution * Add accuracy measure * Add tasks for the EVALITA-LLM benchmark evaluation * Small changes Add the alias of the task name that will be printed in the final table results. * Updated the prompts to reflect changes made to the extended dataset for the Admission Test task * chore: cleaned templates before PR; feat: add configuration to run generation/ppl tasks. * fix: add information on Evalita-LLM for PR * fix: rename folders and files * fix: remove unused imports * chore: run pre-commit * chore: add task description --------- Co-authored-by:
rzanoli <zanoli@fbk.eu> Co-authored-by:
Marco Madeddu <marco.madeddu.bra@gmail.com>
-
- 31 Jan, 2025 1 commit
-
-
asgsaeid authored
* mmlu-pro-plus is implemented * README file is updated * Update README.md with new task: MMLU Pro Plus * Update README.md with new task: MMLU Pro Plus * pre-commit * nit --------- Co-authored-by:
asgsaeid <asgaris@Saeids-MacBook-Pro.local> Co-authored-by:
Baber <baber@hey.com>
-
- 29 Jan, 2025 1 commit
-
-
Irina Proskurina authored
* Add Histoires Morales task * Histoires Morales task: fix mixed line endings * Histoires Morales task: fix mixed line endings * Remove tag for a single task * Add some MT for Histoires Morales
-
- 28 Jan, 2025 1 commit
-
-
Irina Proskurina authored
* Add moral stories task * Add moral stories task * Create README.md * Update README.md * Update line endings in moral_stories files
-
- 20 Jan, 2025 1 commit
-
-
Minho Ryu authored
* add hrm8k benchmark for both Korean and English * apply precommit * revise tasks to make models not to directly answer; use zeroshot_cot if possible * add README * Add hrm8k on the task-list --------- Co-authored-by:Baber <baber@hey.com>
-
- 15 Jan, 2025 3 commits
-
-
Shivansh Pachnanda authored
* Add MLQA * add mlqa_common_yaml * add 49 tests of mlqa family * update tasks/README.md --------- * fix: mlqa ast error * nit: removed .yaml ext from template_yaml * nit changes: minor modifications generate_tasks.py * deleted lm_eval/tasks/mlqa/mlqa_common_yaml.yaml * tests updated * nit
-
Hojin Lee authored
* add mbpp * fix some bugs * add README for mbpp * update README * nits --------- Co-authored-by:
Hojin Lee <19949034+hjlee1371@users.noreply.github.com> Co-authored-by:
Baber <baber@hey.com>
-
Hojin Lee authored
* add custom filter * fix type casting of references * add humaneval * fix a bug in humaneval * add greedy version of humaneval * update tasks README * test humaneval * return multiple metrics * nit * add confirmation to run code tasks * nit * nit --------- Co-authored-by:
Hojin Lee <19949034+hjlee1371@users.noreply.github.com> Co-authored-by:
Baber <baber@hey.com>
-
- 24 Dec, 2024 1 commit
-
-
Firoj Alam, Scientist, QCRI authored
* added aradice * Added ArabicMMLU Lev Configs * added ArabicMMLU egy configs * Added boolq configs * Added cultural bench configs * added openbookqa configs * Added PiQA configs * added winogrande configs * Added truthfulQA configs * Added aradice group config * Remove deleted files from repository * modified arabimmlu configs * modified metadata versions * fixed formatting using ruff * added aradice tasks information * pre-commit * Uptaded openbookqa utils * fixed formatting on obqa --------- Co-authored-by:
Basel Mousi <bmousi@hbku.edu.qa> Co-authored-by:
Baber <baber@hey.com>
-
- 19 Dec, 2024 1 commit
-
-
shivalika-singh authored
* add global mmlu lite * add global mmlu lite * fix bugs * add task README.md * Update README.md * Update tasks README.md * Update README.md * update readme --------- Co-authored-by:shivi <shivalikasingh95@gmail.com>
-
- 26 Nov, 2024 1 commit
-
-
Rima Shahbazyan authored
* score readme added * generate until task's "until" parameter's default value fixed. * score mmlu-pro and agieval added * changed macro accuracy to micro for agieval * Always E removed from agi eval * redundancies removed * MATH added * minor cosmetic changes for math * Licenses added Readme updated * changes for flake8 + license header on math * Score added to readme and precommit was run. * Score added to readme and precommit was run. * Import error fixed * math task bugfix postprocess minor fix * CR for math added * math CR * math task bugfix postprocess minor fix CR for math added * Math cr fixed * reverting the default "until" parameter change and adjusting score task configs
-
- 18 Nov, 2024 1 commit
-
-
Kozzy Voudouris authored
* Add metabench (Kipnis et al. 2024) * Update metabench tasks for full replication of original benchmarks, using publicly available datasets * Remove unnecessary import * Add permute versions of each task, where the answer orders are randomly shuffled. * Add metabench group for easier evaluations * Fix mmlu counts after removing duplicate * Add secondary datasets * Fix f-string error * Fix f-string error for permute processing * Add original hash to outputs for easy matching to original results * Add line break at end of utils files * Remove extra line from winogrande * Reformat for linters * fix multiple input test * appease pre-commit * Add metabench to tasks README * fix multiple input `test_doc_to_text` --------- Co-authored-by:Baber <baber@hey.com>
-
- 16 Nov, 2024 1 commit
-
-
Wonseok Hwang authored
* release kbl-v0.1 * fix linting * remove rag tasks as doc_to_text functions cause trouble * remove remaining rag tasks * remove unnecessary repeat in yaml files and rag dataset in hf-hub * remove unncessary newline; introduce cfg files in lbox/kbl in hf * Make task yaml files consistent to hf-datasets-config * Make task yaml files consistent to hf-datasets-config * Remove trailing empty space in doc-to-text * Remove unncessary yaml file * Fix task nameing error * trailing space removed
-
- 05 Nov, 2024 1 commit
-
-
mtkachenko authored
* add jaqket_v2 and jcommonsenseqa * remove comments * remove num_beams as it is incompatible with vllm * add jnli + refactor * rename jnla -> jnli * add jsquad + replace colon chars with the Japanese unicode * ignore whitespaces in generation tasks * add marc_ja * add xwinograd + simplify other yamls * add mgsm and xlsum * refactor xlsum * add ja_leaderboard tag * edit README.md * update README.md * add credit + minor changes * run ruff format * address review comments + add group * remove aggregate_metric_list * remove tags * update tasks/README.md
-
- 01 Nov, 2024 1 commit
-
-
Sypherd authored
-
- 30 Oct, 2024 1 commit
-
-
zxcvuser authored
* Add xquad task * Update general README * Run pre-commit
-
- 04 Oct, 2024 2 commits
-
-
zxcvuser authored
* Add catalan_bench * added flores_ca.yaml * Updated some task groupings and readme * Fix create_yamls_flores_ca.py --------- Co-authored-by:Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
zxcvuser authored
* Add basque_bench * Add flores_eu group * Update _flores_common_yaml * Run linters, updated flores, mgsm, copa, and readme * Apply suggestions from code review Co-authored-by:
Baber Abbasi <92168766+baberabb@users.noreply.github.com> --------- Co-authored-by:
Baber Abbasi <92168766+baberabb@users.noreply.github.com> Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
- 03 Oct, 2024 2 commits
-
-
zxcvuser authored
* Add galician_bench * Update xnli_gl path * Add flores_gl group * Update _flores_common_yaml * Updated some task groupings and readme ---------
-
zxcvuser authored
* Add spanish_bench * Add flores_es group * Update _flores_common_yaml * Delete lm_eval/tasks/spanish_bench/escola.yaml * Delete escola from spanish_bench.yaml * Delete escola from README.md * pre-commit run --all-files * Updated some task groupings and readme ---------
-
- 30 Sep, 2024 1 commit
-
-
zxcvuser authored
* Add portuguese_bench * Add flores_pt group * Update _flores_common_yaml * Run linters and update flores and readme
-
- 26 Sep, 2024 1 commit
-
-
Arda authored
* Added TurkishMMLU to LM Evaluation Harness * Fixed COT name * Fixed COT name * Updated Readme * Fixed Test issues * Completed Scan for changed tasks * Updated Readme * Update README.md * fixup task naming casing + ensure yaml template stubs aren't registered --------- Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by:
haileyschoelkopf <hailey@eleuther.ai>
-
- 10 Sep, 2024 1 commit
-
-
Malikeh Ehghaghi authored
* arabic leaferboard yaml file is added * arabic toxigen is implemented * Dataset library is imported * arabic sciq is added * util file of arabic toxigen is updated * arabic race is added * arabic piqa is implemented * arabic open qa is added * arabic copa is implemented * arabic boolq ia added * arabic arc easy is added * arabic arc challenge is added * arabic exams benchmark is implemented * arabic hellaswag is added * arabic leaderboard yaml file metrics are updated * arabic mmlu benchmarks are added * arabic mmlu group yaml file is updated * alghafa benchmarks are added * acva benchmarks are added * acva utils.py is updated * light version of arabic leaderboard benchmarks are added * bugs fixed * bug fixed * bug fixed * bug fixed * bug fixed * bug fixed * library import bug is fixed * doc to target updated * bash file is deleted * results folder is deleted * leaderboard groups are added * full arabic leaderboard groups are added, plus some bug fixes to the light version * Create README.md README.md for arabic_leaderboard_complete * Create README.md README.md for arabic_leaderboard_light * Delete lm_eval/tasks/arabic_leaderboard directory * Update README.md * Update README.md adding the Arabic leaderboards to the library * Update README.md 10% of the training set * Update README.md 10% of the training set * revert .gitignore to prev version * Update lm_eval/tasks/README.md Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * updated main README.md * Update lm_eval/tasks/README.md * specify machine translated benchmarks (complete) * specify machine translated benchmarks (light version) * add alghafa to the related task names (complete and light) * add 'acva' to the related task names (complete and light) * add 'arabic_leaderboard' to all the groups (complete and light) * all dataset - not a random sample * added more accurate details to the readme file * added mt_mmlu from okapi * Update lm_eval/tasks/README.md Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/tasks/README.md Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * updated mt_mmlu readme * renaming 'alghafa' full and light * renaming 'arabic_mmlu' light and full * renaming 'acva' full and light * update readme and standardize dir/file names * running pre-commit --------- Co-authored-by:
shahrzads <sayehban@ualberta.ca> Co-authored-by:
shahrzads <56282669+shahrzads@users.noreply.github.com> Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
- 23 Aug, 2024 1 commit
-
-
LSinev authored
ACLUE bibtex typo reported to ACL Anthology and fixed here as title in pdf is correct.
-
- 19 Aug, 2024 1 commit
-
-
am-bean authored
* Setting up lingoly task * Testing yaml changes to debug * Adding pre-commit hooks * Functional LingOly benchmark * Renaming files and adding grouping * Extending group aggregations to allow custom functions. Setting up custom lingoly aggregation using difference in scores. * Adding LingOly to the README file
-
- 05 Aug, 2024 1 commit
-
-
Amir Hossein Kargaran authored
-
- 04 Aug, 2024 2 commits
-
-
zhabuye authored
-
Amir Hossein Kargaran authored
-
- 14 Jul, 2024 1 commit
-
-
Ben Shoham Ofir authored
* Added MedConceptsQA Benchmark * pre-commit factor * update group name * update in naming * changed name * Changed mcqa to med_concepts_qa prefix * Added med_concepts_qa to README.md * Changed config files according the new format * Updated README --------- Co-authored-by:lintangsutawika <lintang@eleuther.ai>
-
- 12 Jul, 2024 1 commit
-
-
SuperCat authored
* add mmlusr tasks * renamed all tasks names in mmlusr * edit format and readme * added mmlu_sr * mmlu_sr -> mmlusr * update --------- Co-authored-by:lintangsutawika <lintang@eleuther.ai>
-
- 03 Jul, 2024 2 commits
-
-
Hanwool Albert Lee authored
* initial_implementation (test has to be proceeded) * minor fix * revised task name and implemented new task * minor fixes * new tasks implement * minor fix * added 'prompt injection' task * delete prompt injection task (will be implemented at next PR) * trust remote code * Update lm_eval/tasks/inverse_scaling/README.md Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * added readme * Update lm_eval/tasks/README.md * Update lm_eval/tasks/inverse_scaling/_inverse_scaling_mc_yaml * Update lm_eval/tasks/inverse_scaling/README.md Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/tasks/inverse_scaling/_inverse_scaling_mc_yaml Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update README.md * precommit? * run precommit on readme --------- Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by:
haileyschoelkopf <hailey@eleuther.ai>
-
Nathan Habib authored
* adds leaderboard tasks * Delete lm_eval/tasks/leaderboard/leaderboard_chat_template.yaml * add readme * Delete lm_eval/tasks/leaderboard/mmlu_pro/mmlu_pro_chat_template.yaml * modify readme * fix bbh task * fix bbh salient task * modify the readme * Delete lm_eval/tasks/leaderboard/ifeval/README.md * Delete lm_eval/tasks/leaderboard/math/README.md * add leaderboard to the tasks repertory * add anouncment about new leaderbaord tasks * linting * Update README.md Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * installs ifeval dependency in new_task github workflow --------- Co-authored-by:
Nathan Habib <nathan.habib@huggingface.com> Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
- 25 Jun, 2024 1 commit
-
-
Brendan Murphy authored
* Initial configuration * Using the validation set for the test set, because the test set on HF doesn't have labels * Probably just makes more sense to have validation be validation * fix format ; add docs to tasks/README.md * fix format --------- Co-authored-by:haileyschoelkopf <hailey@eleuther.ai>
-
- 20 Jun, 2024 1 commit
-
-
Julen Etxaniz authored
* add bertaqa tasks * rename basquetrivia-->bertaqa ; make template stub not .yaml * add bertaqa entry to lm_eval/tasks/README.md --------- Co-authored-by:haileyschoelkopf <hailey@eleuther.ai>
-
- 19 Jun, 2024 2 commits
-
-
Yazeed Alnumay authored
* Added ArabicMMLU * Rename `ammlu` to `arabicmmlu`
-
Zafir Stojanovski authored
* init paloma benchmark * pre-process in utils function * add `task_alias` * updated task aliases * Update paloma_dolma-v1_5.yaml * Update paloma_twitterAAE_HELM_fixed.yaml * Update paloma_dolma_100_programing_languages.yaml --------- Co-authored-by:Lintang Sutawika <lintang@eleuther.ai>
-
- 11 Jun, 2024 1 commit
-
-
Hailey Schoelkopf authored
* Update README.md * Delete lm_eval/tasks/ammlu directory
-
- 06 Jun, 2024 1 commit
-
-
Zafir Stojanovski authored
* added tasks and task family descriptors * configs for the new lambada translations * continue work on task list w/ links; slightly reorganize README * Apply suggestions from code review * Rename file so that it'll preview in Github when viewing lm_eval/tasks folder * Update new_task_guide.md * Update README.md * run linter * Add language column to task table; Add missing tasks to task table; fix nq_open and storycloze READMEs * fix typo * update `lm_eval/tasks/README.md` with task description --------- Co-authored-by:
Harish Vadaparty <harishvadaparty@gmail.com> Co-authored-by:
anthony <anthonydipofi@gmail.com> Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by:
haileyschoelkopf <hailey@eleuther.ai>
-
- 03 Jun, 2024 1 commit
-
-
anthony-dipofi authored
* added tasks and task family descriptors * continue work on task list w/ links; slightly reorganize README * Apply suggestions from code review * Rename file so that it'll preview in Github when viewing lm_eval/tasks folder * Update new_task_guide.md * Update README.md * run linter * Add language column to task table; Add missing tasks to task table; fix nq_open and storycloze READMEs * fix typo * Apply suggestions from code review Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * apply format --------- Co-authored-by:
Harish Vadaparty <harishvadaparty@gmail.com> Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by:
haileyschoelkopf <hailey@eleuther.ai>
-