- 28 Jan, 2025 1 commit
-
-
Baber Abbasi authored
* feat: drop Python 3.8 support * feat: drop Python 3.8 tests * pre-commit * handle chat_template for multiple iput
-
- 19 Jan, 2025 1 commit
-
-
Baber Abbasi authored
* update pre-commit
-
- 17 Jan, 2025 1 commit
-
-
Baber Abbasi authored
* switch arg
-
- 15 Jan, 2025 2 commits
-
-
Baber Abbasi authored
* add assistant prefix * add arc_challenge from llama * nit * nit * nit * add assistant prefix * add mmlu_llama * nit * nit * Revert "nit" This reverts commit 6a97f8356237305e375212b966b30e8de59dd4bc. * fix regex bug * add assistant_prefix to vllm * add `Question:` * add mmlu_pro * add fewshot assistant_prefix * use `assistant_prefill` * typehints * nits * nits * add to docs * add readme
-
Hojin Lee authored
* add custom filter * fix type casting of references * add humaneval * fix a bug in humaneval * add greedy version of humaneval * update tasks README * test humaneval * return multiple metrics * nit * add confirmation to run code tasks * nit * nit --------- Co-authored-by:
Hojin Lee <19949034+hjlee1371@users.noreply.github.com> Co-authored-by:
Baber <baber@hey.com>
-
- 04 Jan, 2025 1 commit
-
-
Baber Abbasi authored
* remove yaml extension from phraes_va_common * remove yaml extension from winogenerated * remove yaml extension from phrases_es * no cache debug logging when not used
-
- 29 Nov, 2024 1 commit
-
-
Baber Abbasi authored
-
- 28 Nov, 2024 1 commit
-
-
Baber Abbasi authored
* allow !function filters * bugfix * nit * add `filter` to logged samples * add `filter` and `metric` to logged samples to identification * convert `metric` to `metrics`: list
-
- 11 Nov, 2024 1 commit
-
-
Baber Abbasi authored
* batch commit * :Revert "batch commit" This reverts commit d859d1ca . * batch commit * checkout from main * checkout from main * checkout from main * checkout from main * checkout from main * cleanup * cleanup * cleanup * cleanup * cleanup * cleanup * cleanup * cleanup * cleanup * Chat template fix (#7) * cleanup * cleanup * cleanup * linting * fix tests * add ifeval install to new_task CI * Revert "add ifeval install to new_task CI" This reverts commit 1d19449bb7fbfa05d51e7cd20950475eae533bf1. * adds leaderboard tasks (#1) * adds leaderboard tasks * Delete lm_eval/tasks/leaderboard/leaderboard_chat_template.yaml * add readme * Delete lm_eval/tasks/leaderboard/mmlu_pro/mmlu_pro_chat_template.yaml * modify readme * fix bbh task * fix bbh salient task * modify the readme * Delete lm_eval/tasks/leaderboard/ifeval/README.md * Delete lm_eval/tasks/leaderboard/math/README.md * add leaderboard to the tasks repertory * add anouncment about new leaderbaord tasks * linting * Update README.md Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * installs ifeval dependency in new_task github workflow --------- Co-authored-by:
Nathan Habib <nathan.habib@huggingface.com> Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * fix math parser * fix math parser * fix version * add warning about chat template --------- Co-authored-by:
Nathan Habib <nathan.habib@huggingface.co> Co-authored-by:
Nathan Habib <30601243+NathanHB@users.noreply.github.com> Co-authored-by:
Nathan Habib <nathan.habib@huggingface.com> Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by:
Nathan Habib <nathan.habib19@gmail.com>
-
- 07 Nov, 2024 1 commit
-
-
Baber Abbasi authored
-
- 08 Oct, 2024 2 commits
-
-
Hailey Schoelkopf authored
-
Baber Abbasi authored
* max_images are passed on to vllms `limit_mm_per_prompt` * replace max image placeholders in string * handle chat_template error * move `fewshot_random_seed` to global
-
- 07 Oct, 2024 1 commit
-
-
Baber Abbasi authored
* bugfix * pre-commit
-
- 04 Oct, 2024 1 commit
-
-
Baber Abbasi authored
-
- 17 Sep, 2024 1 commit
-
-
Baber Abbasi authored
-
- 13 Sep, 2024 1 commit
-
-
Lintang Sutawika authored
* add WIP hf vlm class * add doc_to_image * add mmmu tasks * fix merge conflicts * add lintang's changes to hf_vlms.py * fix doc_to_image * added yaml_path for config-loading * revert * add line to process str type v * update * modeling cleanup * add aggregation for mmmu * rewrite MMMU processing code based on only MMMU authors' repo (doc_to_image still WIP) * implemented doc_to_image * update doc_to_image to accept list of features * update functions * readd image processed * update args process * bugfix for repeated images fed to model * push WIP loglikelihood code * commit most recent code (generative ; qwen2-vl testing) * preliminary image_token_id handling * small mmmu update: some qs have >4 mcqa options * push updated modeling code * use processor.apply_chat_template * add mathvista draft * nit * nit * ensure no footguns in text<>multimodal LM<>task incompatibility * add notification to readme regarding launch of prototype! * fix compatibility check * reorganize mmmu configs * chat_template=None * add interleave chat_template * add condition * add max_images; interleave=true * nit * testmini_mcq * nit * pass image string; convert img * add vllm * add init * vlm add multi attr * fixup * pass max images to vllm model init * nit * encoding to device * fix HFMultimodalLM.chat_template ? * add mmmu readme * remove erroneous prints * use HFMultimodalLM.chat_template ; restore tasks/__init__.py * add docstring for replace_placeholders in utils * fix `replace_placeholders`; set image_string=None * fix typo * cleanup + fix merge conflicts * update MMMU readme * del mathvista * add some sample scores * Update README.md * add log msg for image_string value --------- Co-authored-by:
haileyschoelkopf <hailey@eleuther.ai> Co-authored-by:
Baber Abbasi <baber@eleuther.ai> Co-authored-by:
Baber <baber@hey.com> Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
- 04 Sep, 2024 1 commit
-
-
Baber Abbasi authored
* default chat template method fix * move chat_template to TemplateLM * remove hotfix * handle openai `chat_template` * Update lm_eval/api/model.py Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * add 'max_tokens' to gen_kwargs * pre-commit --------- Co-authored-by:
KonradSzafer <szafer.konrad@gmail.com> Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
- 28 Aug, 2024 1 commit
-
-
Hailey Schoelkopf authored
* fix revision type * allow for None-input loglikelihood reqs to be cached * handle no remaining cache items * pre-commit * change cache_hook.add_partial(loglikelihood_rolling...) convention --------- Co-authored-by:Baber Abbasi <baber@eleuther.ai>
-
- 22 Aug, 2024 1 commit
-
-
lxning authored
* fix the regex string in yaml file * Update samplers.py --------- Co-authored-by:Lintang Sutawika <lintang@eleuther.ai>
-
- 15 Aug, 2024 1 commit
-
-
am-bean authored
* Setting up lingoly task * Testing yaml changes to debug * Adding pre-commit hooks * Functional LingOly benchmark * Renaming files and adding grouping * Extending group aggregations to allow custom functions. Setting up custom lingoly aggregation using difference in scores.
-
- 05 Aug, 2024 1 commit
-
-
Yu Shi Jie authored
* initialized mmlu_pro task * added generative mmlu-pro * added cot fewshot for mmlu-pro * Initial commit * updated mmlu-pro to take on 3 splits: test, val, dev * mmlu-pro: added continuation and flan_cot_zeroshot * added README.md for mmlu_pro * removed * update files * moved files out, and removed unused versions * updated * mmlu_pro: -changed task 'other' to 'miscellaneous' there is already a group named 'other' task and group with the same alias (e.g. mmlu_pro_other_generative) throws an error -fixed yaml backslash escape for fewshot cot * changed choices -> options in yaml config to fit dataset schema * ONLY FOR DEFAULT: fixed yaml file to use variable number of choices * mmlu-pro: fixed doc_to_text/choice/target configs for all variants * mmlu-pro: minor fixes * mmlu-pro/default: aligned with mmlu updates * mmlu-pro: update yaml content in line with mmlu * mmlu-pro: fixed mislabelling of task (math->chemistry) * mmlu-pro: fixed yaml formatting * add custom fewshot doc_to_text, target, and choice * add process for each subtask * add process for each subtask * pre-commit * pre-commit * format * resolved left out merge * deleted folders + updated readme * Update evaluator.py * Update evaluator.py --------- Co-authored-by:
Yu Shi Jie <shijie@tensorplex.ai> Co-authored-by:
lintangsutawika <lintang@eleuther.ai> Co-authored-by:
root <root@455bdd73-01.cloud.together.ai> Co-authored-by:
Lintang Sutawika <lintang@sutawika.com>
-
- 01 Aug, 2024 1 commit
-
-
Nathan Weinberg authored
* refactor: move scipy and sklearn module imports to func imports Signed-off-by:
Nathan Weinberg <nweinber@redhat.com> * refactor: consolidate weighted_f1_score func into lm_eval utils Signed-off-by:
Nathan Weinberg <nweinber@redhat.com> * lint: allow for utils file to have unused imports this allows for shared functions to be defined only once while allowing for the YAML function importing to continue working Signed-off-by:
Nathan Weinberg <nweinber@redhat.com> --------- Signed-off-by:
Nathan Weinberg <nweinber@redhat.com>
-
- 22 Jul, 2024 1 commit
-
-
Baber Abbasi authored
* refactor pad_token handling to fn * fix docs * add pad_token_handling to vllm * start on API superclass * don't detokenize the returned logits * streamline vllm tokenizer * add type hint * pre-commit * seems to be in working order * add model to init * refactor api models * nit * cleanup * add pbar * fix type hints * change optional dependencies * json encode chat template * add type hints * deal with different prompt input requiremnts * nits * fix * cache inside async * fix * fix * nits * nits * nits * nit * fixup * fixup * nit * add dummy retry * add dummy retry * handle imports; skip failing test * add type hint * add tests * add dependency to tests * add package names to exception * nit * docs; type hints * handle api key * nit * tokenizer bug * fix tokenizer * nit * nit * add better error messages * nit * remove decorator * CI: install api dep * revert evaluator.py * consolidate * consolidate * nits * nit * fix typealias * nit * nit * nit * Update lm_eval/models/api_models.py typo Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/models/openai_completions.py Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/models/anthropic_llms.py Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/models/api_models.py Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * fix typo * add news section * add info for API * pre-commit * typo * fix bug: unpack logliklehood requests * fix bug: shared gen_kwargs mutated * nit: handle copy properly * Update README.md * Update README.md * Update README.md * Update api_models.py * Update README.md --------- Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
- 15 Jul, 2024 1 commit
-
-
Lintang Sutawika authored
-
- 12 Jul, 2024 1 commit
-
-
Jess authored
* add afrixnli to task * add chat completion * remove chat completion -untested * afrimmlu added * afrimmlu folder update * afrimmlu folder update * updated prompt * remove print * add afrimgsm -direct * add squad metric * fix bash script * remove direct util, update common yaml * remove print * add few show. metric fixes * fix direct path, add bash script for gpt models * added transate test * update afrixnli tasks * update afrixnli tasks * update metrics for afrixnli * prompt translations fix * prompt translations fix * filter and metric fix -mgsm * remove squad metric * remove squad metric * add f1 score to mgsm * add f1 score to mgsm * update native-direct with lin * change f1 function * add lin to utils * add utils * remove test limit * remove test configs * add swahili to mmlu * change eng to ewe in ewe yaml mmlu * add squad metric to mgsm, remove whitespace filter * added translate test * added afrixnli_translate * fix exact match valueError * fix exact match valueError * restructure mmlu folder * spacing * remove afrimmlu_translate folder * add utility * format task name, clean ups * modefied mgsm * update on afrimgsm * update on afrimgsm * removed utils * other mgsm varieties * other mgsm varieties * adding trasnslate direct * Update translate_direct_yaml * add manual xnli prompt, add multichoice for openai models, and adapt multichoice metric for openai model * edit for open models * Update translate_direct_yaml * add verbalizer for xnli * change xnli from multiple choice to generate * add manual accuracy scores * revert xnli to multiple choice * change afrimgsm utils * revert xnli to multiple_choice * cleanups and readmes * remove openai fixes and unused regex * pr review changes * revert metrics.py, task.py and extraction.py to main version --------- Co-authored-by:
Israel Abebe Azime <azime@cg.uni-saarland.de> Co-authored-by:
Israel Abebe Azime <se.israel.abebe@gmail.com>
-
- 08 Jul, 2024 2 commits
-
-
Lintang Sutawika authored
* add greoup_config arg * add a group config that allows disabling table for group score and group aggregate in general * fixed size configuration * adjust config * add group config * adjust mmlu to use group_config * fixed args input in aggregate_subtask_metrics * fixed issues related to printing alias of group and updated yaml * update all mmlu variants to include group_config * edit format * modify mmlu tasks * adjust group to also be a configurable group * add configurable group * simplify get_task_list * adjust group scoring with using ConfigurableGroup * adjust args * update mmlu * update mmlu * update to work with new group and task configuration * readd group_agg * readd files * move prepare_print_tasks to evaluator_utils * sort set to False by default, fix predict_only arg * add version for groups * reversed task list * update additional condition when loading a group in a group yaml * update truthfulqa * add description regarding tags replacing group * replace group to tag * fixed conditional statement * remove warning * update loading of task group and newly added tags * reformat with pre-commit * fixed info log * update * fix bug * fix bug * use task id to differentiate tasks * convert all groups to configurable groups * use task_id * reformat * add task_id for python tasks as well * add task_id for python tasks as well * add task_id for python tasks as well * revert truthfulqa * revert mmlu tasks * new mmlu config * new group config parameter `tag_to_task` * Update truthfulqa_mc2.yaml * reformate * add _process_group_config * adjust task_id * add get_subtask_list function to get proper subtask list * group config to_dict update * remove tag check * update mmlu * fix config passing issues * add test yaml * format fix * add documentation * corner case for single tag being called * fix indentation * formatting * update all mmlu variants * Update docs/task_guide.md Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * remove group_alias * Update docs/task_guide.md Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * remove version for metadata * Update docs/task_guide.md Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * update mmlu/ * removed " " in make_table * change how aggregate_metric is loaded * change how aggregate_metric is loaded * update aggregate_metric arg * update format * update format * some docs fixes * add groups for agieval, aexams, aclue * add more explicit aggregation groups * add more groupings / tags distinctions * add more groupings * more groupings * add many explicit group configs * add many explicit group configs * add more explicit group configs * add more explicit group configs * add more error msgs, agg_metric -> agg_metric_list * some docs updates * update task_id to be updateable and uses group:task format * make KMMLU a tag for now * update docs * don't duplicate task names * fix merge conflicts? * giving this a try * clean up diff * switch mmlu variants over to using * don't use to-be-deprecated group: config field in overview notebook * Python tasks which subclass ConfigurableTask now run * update mmlu * pre-commit format * fixed sorting for multi-level printing * move group api to separate file * fix bbh aggregation filter usage * track api/group.py * adjust group and tags loading * make explicit group configs for leaderboard and other newer tasks * fix arabicmmlu * update * change arabicmmlu template name??? * update group alias * fix printing bugs * check table printing is correct ; update tests * use mmlu_stem to have a group included in print tests --------- Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by:
haileyschoelkopf <hailey@eleuther.ai>
-
Choyunhui authored
Co-authored-by:yhjo <yhjo@suresofttech.com>
-
- 01 Jul, 2024 1 commit
-
-
Hailey Schoelkopf authored
-
- 28 Jun, 2024 1 commit
-
-
Baber Abbasi authored
-
- 25 Jun, 2024 1 commit
-
-
Baber Abbasi authored
* refactored `lm.apply_chat_template` * nit * fix weird type error * fixed! * skip failing test * pre-commit run all * add type hints * nit * nit * fixup
-
- 13 Jun, 2024 1 commit
-
-
Baber Abbasi authored
* `samples` is newline delimited * updated git and pre-commit * appease pre-commit * nit * Revert back for now * Revert for now --------- Co-authored-by:Lintang Sutawika <lintang@eleuther.ai>
-
- 03 Jun, 2024 1 commit
-
-
KonradSzafer authored
* initial chat template * tokenizer attribute check * variable rename * interface update * system instruction * system inst default update * fewshot as multiturn * typing update * indent update * added comments * Adding a fewshot in a more readable way * linting * Moved apply chat template to LM * multiturn alternation fix * cache key update * apply chat template method fix * add system prompt hash to cache_key * tokenizer name property for cache_key * property name fix * linting backward compatibility fix * docs and errors update * add documentation on adding chat template compatibility to model_guide * fewshot as multiturn check fix * saving system inst and chat template in results * eval tracker update * docs update * Apply suggestions from code review Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> --------- Co-authored-by:
haileyschoelkopf <hailey@eleuther.ai> Co-authored-by:
Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com> Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
- 31 May, 2024 1 commit
-
-
Clémentine Fourrier authored
* init test 1 * fix * this format seems to be working - need to update all other tasks with the new format * bbh with few shot format * fix fewshot bbh * add mmlu flan cot * samples of cot * kmmlu * fix gsm8k * update keys for mmlu * minerva math * bbh * fix * fix samples * small fixes to templates * last prompt format change * fixing prompt * fixed minerva math format * rm accidental commited file * added doc for few shot samples * Update lm_eval/loggers/evaluation_tracker.py * Update lm_eval/loggers/evaluation_tracker.py * Update docs/new_task_guide.md Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * added check in sampler per code review * added the system from a function, plus an example in minerva math * style * Apply suggestions from code review Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * fix unit tests 1 * forcing use of test split --------- Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
- 24 May, 2024 2 commits
-
-
Hailey Schoelkopf authored
* add handling for bootstrap_iters=0 case * add more detail to docstring * run precommit
-
Lintang Sutawika authored
`gold_one_hot` needs to follow the dimension of predictions so that it still works when `--limit` is used and the indexes in gold does not cover all gold indexes.
-
- 06 May, 2024 1 commit
-
-
LSinev authored
* Added fewshot sampling seeds to evaluator.simple_evaluate signature Way to control seed of fewshot sampling may help with #1591 * Added ability for custom sampler for ConfigurableTask May be set in config like ``` fewshot_config: sampler: !function utils.MyFewshotSampler ``` * explicitly set fewshot random generator seed for HFLM generate_until_task test * add backward compatibility for three args seed setup * save seeds info to logs/reports
-
- 26 Apr, 2024 1 commit
-
-
Nikita Lozhnikov authored
* Add register_filter decorator * Add register_filter docs
-
- 25 Apr, 2024 1 commit
-
-
Lintang Sutawika authored
* Update task.py * Update __init__.py
-
- 25 Mar, 2024 1 commit
-
-
Lintang Sutawika authored
* fix on --task list * add fixes to tokeniation * differentiate encoding for seq2seq and decoder * return token setting * format for pre-commit * Seq2seq fix, pt2 (#1630) * getting model class only when defined * encode_pair handles None, add_special_tokens turned into dict with default value --------- Co-authored-by:achervyakov <77295913+artemorloff@users.noreply.github.com>
-
- 20 Mar, 2024 1 commit
-
-
Hailey Schoelkopf authored
* make vllm use prefix_token_id ; have prefix_token_id be optional method to define * custom_prefix_token_id wasn't set if not passed
-