- 05 Sep, 2024 1 commit
-
-
Hailey Schoelkopf authored
-
- 28 Aug, 2024 1 commit
-
-
Hailey Schoelkopf authored
-
- 25 Aug, 2024 1 commit
-
-
Baber Abbasi authored
* chat template hotfix * pre-commit
-
- 23 Aug, 2024 3 commits
-
-
Cameron Witkowski authored
* Created DUP eval code for gsm8k * asdiv * Fixed fewshot=8 issue * added results to .gitignore * reverted unnecessary changes and moved results + gsm8k_dup out of repo to prepare for pull req * fixed whitespace and unintentional hardcoded version change information * created mbpp task * Reverted changes re. mbpp to save for a future Pull req * reverted metrics.py to previous commit * updated asdiv readme to include informaiton about new asdiv_cot_llama task * Apply suggestions from code review --------- Co-authored-by:
Alexander Detkov <alexander.d.detkov@gmail.com> Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
eyuansu62 authored
-
LSinev authored
ACLUE bibtex typo reported to ACL Anthology and fixed here as title in pdf is correct.
-
- 22 Aug, 2024 2 commits
-
-
Baber Abbasi authored
-
lxning authored
* fix the regex string in yaml file * Update samplers.py --------- Co-authored-by:Lintang Sutawika <lintang@eleuther.ai>
-
- 20 Aug, 2024 3 commits
-
-
Geralt authored
* mela * Update mela_en.yaml * Create _mela.yaml --------- Co-authored-by:Lintang Sutawika <lintang@eleuther.ai>
-
Nathan Habib authored
-
lewtun authored
* Update IFEval dataset to official one This PR updates the IFEval dataset to the one hosted under the Google org: https://huggingface.co/datasets/google/IFEval Note the main change is an updated prompt from this commit in the GitHub repo: https://github.com/google-research/google-research/commit/26d8ccdab6fec61b5c83ad6327ea8bda9e580288 * Update ifeval.yaml --------- Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
- 19 Aug, 2024 3 commits
-
-
Yen-Ting Lin authored
* add taiwan truthful qa * add tmlu * Add .gitignore entries for evals/ and harness_eval_main_log.txt, and add harness_eval.slurm script * add pega eval and legal eval * add ccp eval * Update .gitignore and harness_eval.slurm * Add trust_remote_code and wandb_args to harness_eval.slurm, and add run_all.sh script * Add Pega MMLU task and configuration files * Add new models and update parameters in run_all.sh * Add UMTCEval tasks and configurations * Update dataset paths and output path * Update .gitignore and harness_eval.slurm, and modify _generate_configs.py * Update SLURM script and add new models * clean for pr * Update lm_eval/tasks/tmlu/default/tmlu.yaml Co-authored-by:
Lintang Sutawika <lintang@sutawika.com> * adjust tag name * removed group alias from tasks * format --------- Co-authored-by:
Lintang Sutawika <lintang@sutawika.com> Co-authored-by:
lintangsutawika <lintang@eleuther.ai> Co-authored-by:
Yen-Ting Adam, Lin <r08944064@csie.ntu.edu.tw>
-
Uminosachi authored
-
am-bean authored
* Setting up lingoly task * Testing yaml changes to debug * Adding pre-commit hooks * Functional LingOly benchmark * Renaming files and adding grouping * Extending group aggregations to allow custom functions. Setting up custom lingoly aggregation using difference in scores. * Adding LingOly to the README file
-
- 16 Aug, 2024 1 commit
-
-
Cameron7195 authored
* Created a new task for gsm8k which corresponds to the cot settings and prompt formatting described by Meta to evaluate Llama. Useful for replicating Llama performance on GSM8K benchmark. * fixing formatting * fixing formatting
-
- 15 Aug, 2024 1 commit
-
-
am-bean authored
* Setting up lingoly task * Testing yaml changes to debug * Adding pre-commit hooks * Functional LingOly benchmark * Renaming files and adding grouping * Extending group aggregations to allow custom functions. Setting up custom lingoly aggregation using difference in scores.
-
- 10 Aug, 2024 1 commit
-
-
Yu Shi Jie authored
-
- 09 Aug, 2024 1 commit
-
-
Jungwhan Kim authored
* add keep trailing newline * apply ruff-format * add prompt unit test * increment the version of tasks that have description with whitespace * remove white spaces of leaderboard bbh * update MMLU expected versions in output * CI run does display the expected version=1 for mmlu subtasks, fix expected test output again --------- Co-authored-by:haileyschoelkopf <hailey@eleuther.ai>
-
- 07 Aug, 2024 1 commit
-
-
Yu Shi Jie authored
* fixed gsm * GSM-Plus: remove dataset_name line
-
- 05 Aug, 2024 4 commits
-
-
Yu Shi Jie authored
* added gsm_plus * formatted dataset to have train-test-splits * README.md for gsm-plus * Update README.md * GSM-Plus: added gsm_plus_mini * GSM-Plus: attribution to original dataset * Update README.md * Update README.md * Update README.md --------- Co-authored-by:Lintang Sutawika <lintang@eleuther.ai>
-
Yu Shi Jie authored
* initialized mmlu_pro task * added generative mmlu-pro * added cot fewshot for mmlu-pro * Initial commit * updated mmlu-pro to take on 3 splits: test, val, dev * mmlu-pro: added continuation and flan_cot_zeroshot * added README.md for mmlu_pro * removed * update files * moved files out, and removed unused versions * updated * mmlu_pro: -changed task 'other' to 'miscellaneous' there is already a group named 'other' task and group with the same alias (e.g. mmlu_pro_other_generative) throws an error -fixed yaml backslash escape for fewshot cot * changed choices -> options in yaml config to fit dataset schema * ONLY FOR DEFAULT: fixed yaml file to use variable number of choices * mmlu-pro: fixed doc_to_text/choice/target configs for all variants * mmlu-pro: minor fixes * mmlu-pro/default: aligned with mmlu updates * mmlu-pro: update yaml content in line with mmlu * mmlu-pro: fixed mislabelling of task (math->chemistry) * mmlu-pro: fixed yaml formatting * add custom fewshot doc_to_text, target, and choice * add process for each subtask * add process for each subtask * pre-commit * pre-commit * format * resolved left out merge * deleted folders + updated readme * Update evaluator.py * Update evaluator.py --------- Co-authored-by:
Yu Shi Jie <shijie@tensorplex.ai> Co-authored-by:
lintangsutawika <lintang@eleuther.ai> Co-authored-by:
root <root@455bdd73-01.cloud.together.ai> Co-authored-by:
Lintang Sutawika <lintang@sutawika.com>
-
Hailey Schoelkopf authored
-
Amir Hossein Kargaran authored
-
- 04 Aug, 2024 2 commits
-
-
zhabuye authored
-
Amir Hossein Kargaran authored
-
- 01 Aug, 2024 1 commit
-
-
Nathan Weinberg authored
* refactor: move scipy and sklearn module imports to func imports Signed-off-by:
Nathan Weinberg <nweinber@redhat.com> * refactor: consolidate weighted_f1_score func into lm_eval utils Signed-off-by:
Nathan Weinberg <nweinber@redhat.com> * lint: allow for utils file to have unused imports this allows for shared functions to be defined only once while allowing for the YAML function importing to continue working Signed-off-by:
Nathan Weinberg <nweinber@redhat.com> --------- Signed-off-by:
Nathan Weinberg <nweinber@redhat.com>
-
- 20 Jul, 2024 1 commit
-
-
Jennifer Cwagenberg authored
-
- 18 Jul, 2024 1 commit
-
-
Jungwhan Kim authored
-
- 17 Jul, 2024 1 commit
-
-
jab13x authored
-
- 15 Jul, 2024 1 commit
-
-
Lintang Sutawika authored
-
- 14 Jul, 2024 1 commit
-
-
Ben Shoham Ofir authored
* Added MedConceptsQA Benchmark * pre-commit factor * update group name * update in naming * changed name * Changed mcqa to med_concepts_qa prefix * Added med_concepts_qa to README.md * Changed config files according the new format * Updated README --------- Co-authored-by:lintangsutawika <lintang@eleuther.ai>
-
- 12 Jul, 2024 4 commits
-
-
Jess authored
* add afrixnli to task * add chat completion * remove chat completion -untested * afrimmlu added * afrimmlu folder update * afrimmlu folder update * updated prompt * remove print * add afrimgsm -direct * add squad metric * fix bash script * remove direct util, update common yaml * remove print * add few show. metric fixes * fix direct path, add bash script for gpt models * added transate test * update afrixnli tasks * update afrixnli tasks * update metrics for afrixnli * prompt translations fix * prompt translations fix * filter and metric fix -mgsm * remove squad metric * remove squad metric * add f1 score to mgsm * add f1 score to mgsm * update native-direct with lin * change f1 function * add lin to utils * add utils * remove test limit * remove test configs * add swahili to mmlu * change eng to ewe in ewe yaml mmlu * add squad metric to mgsm, remove whitespace filter * added translate test * added afrixnli_translate * fix exact match valueError * fix exact match valueError * restructure mmlu folder * spacing * remove afrimmlu_translate folder * add utility * format task name, clean ups * modefied mgsm * update on afrimgsm * update on afrimgsm * removed utils * other mgsm varieties * other mgsm varieties * adding trasnslate direct * Update translate_direct_yaml * add manual xnli prompt, add multichoice for openai models, and adapt multichoice metric for openai model * edit for open models * Update translate_direct_yaml * add verbalizer for xnli * change xnli from multiple choice to generate * add manual accuracy scores * revert xnli to multiple choice * change afrimgsm utils * revert xnli to multiple_choice * cleanups and readmes * remove openai fixes and unused regex * pr review changes * revert metrics.py, task.py and extraction.py to main version --------- Co-authored-by:
Israel Abebe Azime <azime@cg.uni-saarland.de> Co-authored-by:
Israel Abebe Azime <se.israel.abebe@gmail.com>
-
SuperCat authored
* add mmlusr tasks * renamed all tasks names in mmlusr * edit format and readme * added mmlu_sr * mmlu_sr -> mmlusr * update --------- Co-authored-by:lintangsutawika <lintang@eleuther.ai>
-
Wonung Kim authored
-
Hailey Schoelkopf authored
-
- 11 Jul, 2024 1 commit
-
-
anthony-dipofi authored
* add and ; move task list newline logic to new TaskManager.list_all_tasks() method * format table list into markdown table; add config location column * add Output Type column * add logic for printing table of tags separately * merge with main and fix conflicts ; update docstrings --------- Co-authored-by:haileyschoelkopf <hailey@eleuther.ai>
-
- 08 Jul, 2024 3 commits
-
-
Pankaj Mathur authored
leaderboard README.md missing mmlu-pro group and task
-
Elron Bandel authored
* Updated unitxt loading Signed-off-by:
Elron Bandel <elron.bandel@ibm.com> * Revert change to general Readme Signed-off-by:
Elron Bandel <elron.bandel@ibm.com> * Adjust fda,squadv2,squad_completion and swde to work accept config in the constructor Signed-off-by:
Elron Bandel <elron.bandel@ibm.com> * Fix scrolls Signed-off-by:
elronbandel <elron.bandel@ibm.com> * Update documentation Signed-off-by:
elronbandel <elron.bandel@ibm.com> * Enforce backward compatability Signed-off-by:
elronbandel <elron.bandel@ibm.com> * Format unitxt class Signed-off-by:
elronbandel <elron.bandel@ibm.com> --------- Signed-off-by:
Elron Bandel <elron.bandel@ibm.com> Signed-off-by:
elronbandel <elron.bandel@ibm.com> Co-authored-by:
haileyschoelkopf <hailey@eleuther.ai>
-
Lintang Sutawika authored
* add greoup_config arg * add a group config that allows disabling table for group score and group aggregate in general * fixed size configuration * adjust config * add group config * adjust mmlu to use group_config * fixed args input in aggregate_subtask_metrics * fixed issues related to printing alias of group and updated yaml * update all mmlu variants to include group_config * edit format * modify mmlu tasks * adjust group to also be a configurable group * add configurable group * simplify get_task_list * adjust group scoring with using ConfigurableGroup * adjust args * update mmlu * update mmlu * update to work with new group and task configuration * readd group_agg * readd files * move prepare_print_tasks to evaluator_utils * sort set to False by default, fix predict_only arg * add version for groups * reversed task list * update additional condition when loading a group in a group yaml * update truthfulqa * add description regarding tags replacing group * replace group to tag * fixed conditional statement * remove warning * update loading of task group and newly added tags * reformat with pre-commit * fixed info log * update * fix bug * fix bug * use task id to differentiate tasks * convert all groups to configurable groups * use task_id * reformat * add task_id for python tasks as well * add task_id for python tasks as well * add task_id for python tasks as well * revert truthfulqa * revert mmlu tasks * new mmlu config * new group config parameter `tag_to_task` * Update truthfulqa_mc2.yaml * reformate * add _process_group_config * adjust task_id * add get_subtask_list function to get proper subtask list * group config to_dict update * remove tag check * update mmlu * fix config passing issues * add test yaml * format fix * add documentation * corner case for single tag being called * fix indentation * formatting * update all mmlu variants * Update docs/task_guide.md Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * remove group_alias * Update docs/task_guide.md Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * remove version for metadata * Update docs/task_guide.md Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * update mmlu/ * removed " " in make_table * change how aggregate_metric is loaded * change how aggregate_metric is loaded * update aggregate_metric arg * update format * update format * some docs fixes * add groups for agieval, aexams, aclue * add more explicit aggregation groups * add more groupings / tags distinctions * add more groupings * more groupings * add many explicit group configs * add many explicit group configs * add more explicit group configs * add more explicit group configs * add more error msgs, agg_metric -> agg_metric_list * some docs updates * update task_id to be updateable and uses group:task format * make KMMLU a tag for now * update docs * don't duplicate task names * fix merge conflicts? * giving this a try * clean up diff * switch mmlu variants over to using * don't use to-be-deprecated group: config field in overview notebook * Python tasks which subclass ConfigurableTask now run * update mmlu * pre-commit format * fixed sorting for multi-level printing * move group api to separate file * fix bbh aggregation filter usage * track api/group.py * adjust group and tags loading * make explicit group configs for leaderboard and other newer tasks * fix arabicmmlu * update * change arabicmmlu template name??? * update group alias * fix printing bugs * check table printing is correct ; update tests * use mmlu_stem to have a group included in print tests --------- Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by:
haileyschoelkopf <hailey@eleuther.ai>
-
- 03 Jul, 2024 1 commit
-
-
Hanwool Albert Lee authored
* initial_implementation (test has to be proceeded) * minor fix * revised task name and implemented new task * minor fixes * new tasks implement * minor fix * added 'prompt injection' task * delete prompt injection task (will be implemented at next PR) * trust remote code * Update lm_eval/tasks/inverse_scaling/README.md Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * added readme * Update lm_eval/tasks/README.md * Update lm_eval/tasks/inverse_scaling/_inverse_scaling_mc_yaml * Update lm_eval/tasks/inverse_scaling/README.md Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/tasks/inverse_scaling/_inverse_scaling_mc_yaml Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update README.md * precommit? * run precommit on readme --------- Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by:
haileyschoelkopf <hailey@eleuther.ai>
-