- 08 Jul, 2024 1 commit
-
-
Lintang Sutawika authored
* add greoup_config arg * add a group config that allows disabling table for group score and group aggregate in general * fixed size configuration * adjust config * add group config * adjust mmlu to use group_config * fixed args input in aggregate_subtask_metrics * fixed issues related to printing alias of group and updated yaml * update all mmlu variants to include group_config * edit format * modify mmlu tasks * adjust group to also be a configurable group * add configurable group * simplify get_task_list * adjust group scoring with using ConfigurableGroup * adjust args * update mmlu * update mmlu * update to work with new group and task configuration * readd group_agg * readd files * move prepare_print_tasks to evaluator_utils * sort set to False by default, fix predict_only arg * add version for groups * reversed task list * update additional condition when loading a group in a group yaml * update truthfulqa * add description regarding tags replacing group * replace group to tag * fixed conditional statement * remove warning * update loading of task group and newly added tags * reformat with pre-commit * fixed info log * update * fix bug * fix bug * use task id to differentiate tasks * convert all groups to configurable groups * use task_id * reformat * add task_id for python tasks as well * add task_id for python tasks as well * add task_id for python tasks as well * revert truthfulqa * revert mmlu tasks * new mmlu config * new group config parameter `tag_to_task` * Update truthfulqa_mc2.yaml * reformate * add _process_group_config * adjust task_id * add get_subtask_list function to get proper subtask list * group config to_dict update * remove tag check * update mmlu * fix config passing issues * add test yaml * format fix * add documentation * corner case for single tag being called * fix indentation * formatting * update all mmlu variants * Update docs/task_guide.md Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * remove group_alias * Update docs/task_guide.md Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * remove version for metadata * Update docs/task_guide.md Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * update mmlu/ * removed " " in make_table * change how aggregate_metric is loaded * change how aggregate_metric is loaded * update aggregate_metric arg * update format * update format * some docs fixes * add groups for agieval, aexams, aclue * add more explicit aggregation groups * add more groupings / tags distinctions * add more groupings * more groupings * add many explicit group configs * add many explicit group configs * add more explicit group configs * add more explicit group configs * add more error msgs, agg_metric -> agg_metric_list * some docs updates * update task_id to be updateable and uses group:task format * make KMMLU a tag for now * update docs * don't duplicate task names * fix merge conflicts? * giving this a try * clean up diff * switch mmlu variants over to using * don't use to-be-deprecated group: config field in overview notebook * Python tasks which subclass ConfigurableTask now run * update mmlu * pre-commit format * fixed sorting for multi-level printing * move group api to separate file * fix bbh aggregation filter usage * track api/group.py * adjust group and tags loading * make explicit group configs for leaderboard and other newer tasks * fix arabicmmlu * update * change arabicmmlu template name??? * update group alias * fix printing bugs * check table printing is correct ; update tests * use mmlu_stem to have a group included in print tests --------- Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by:
haileyschoelkopf <hailey@eleuther.ai>
-
- 03 Jul, 2024 3 commits
-
-
Hanwool Albert Lee authored
* initial_implementation (test has to be proceeded) * minor fix * revised task name and implemented new task * minor fixes * new tasks implement * minor fix * added 'prompt injection' task * delete prompt injection task (will be implemented at next PR) * trust remote code * Update lm_eval/tasks/inverse_scaling/README.md Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * added readme * Update lm_eval/tasks/README.md * Update lm_eval/tasks/inverse_scaling/_inverse_scaling_mc_yaml * Update lm_eval/tasks/inverse_scaling/README.md Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/tasks/inverse_scaling/_inverse_scaling_mc_yaml Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update README.md * precommit? * run precommit on readme --------- Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by:
haileyschoelkopf <hailey@eleuther.ai>
-
Nathan Habib authored
* adds leaderboard tasks * Delete lm_eval/tasks/leaderboard/leaderboard_chat_template.yaml * add readme * Delete lm_eval/tasks/leaderboard/mmlu_pro/mmlu_pro_chat_template.yaml * modify readme * fix bbh task * fix bbh salient task * modify the readme * Delete lm_eval/tasks/leaderboard/ifeval/README.md * Delete lm_eval/tasks/leaderboard/math/README.md * add leaderboard to the tasks repertory * add anouncment about new leaderbaord tasks * linting * Update README.md Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * installs ifeval dependency in new_task github workflow --------- Co-authored-by:
Nathan Habib <nathan.habib@huggingface.com> Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
Hailey Schoelkopf authored
-
- 28 Jun, 2024 2 commits
-
-
Baber Abbasi authored
* add chat template * refactor token padding * nit * nit * check on failing test * check transformers version * remove transformers pin * add ids to test * nit * fixup * fix bos bug * nit * fixup! fix bos bug * increase tolerance for table test * don't detokenize vllm logprobs * Update lm_eval/models/utils.py Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * pre-commit run --all-files --------- Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
Steven Basart authored
Bug: ``` python -m scripts.write_out --task scrolls_quality --output_base_path ~/workspace/ Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/lm-evaluation-harness/scripts/write_out.py", line 92, in <module> main() File "/lm-evaluation-harness/scripts/write_out.py", line 51, in main task_dict = tasks.get_task_dict(task_names, task_manager) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 423, in get_task_dict task_name_from_string_dict = task_manager.load_task_or_group( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 271, in load_task_or_group collections.ChainMap(*map(self._load_individual_task_or_group, task_list)) File "/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 162, in _load_individual_task_or_group return load_task(task_config, task=name_or_config, group=parent_name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 148, in load_task task_object = config["class"]() ^^^^^^^^^^^^^^^^^ File "/lm-evaluation-harness/lm_eval/tasks/scrolls/task.py", line 120, in __init__ super().__init__() File "/lm-evaluation-harness/lm_eval/api/task.py", line 703, in __init__ self._config = TaskConfig(**config) ^^^^^^^^^^^^^^^^^^^^ TypeError: lm_eval.api.task.TaskConfig() argument after ** must be a mapping, not NoneType ```
-
- 26 Jun, 2024 1 commit
-
-
Hailey Schoelkopf authored
* make MMLU trust remote code to fix tests * remove trust remote code
-
- 25 Jun, 2024 3 commits
-
-
Brendan Murphy authored
* Initial configuration * Using the validation set for the test set, because the test set on HF doesn't have labels * Probably just makes more sense to have validation be validation * fix format ; add docs to tasks/README.md * fix format --------- Co-authored-by:haileyschoelkopf <hailey@eleuther.ai>
-
Baber Abbasi authored
* refactored `lm.apply_chat_template` * nit * fix weird type error * fixed! * skip failing test * pre-commit run all * add type hints * nit * nit * fixup
-
jonabur authored
* add arc_challenge_mt * add README * add icelandic
-
- 20 Jun, 2024 1 commit
-
-
Julen Etxaniz authored
* add bertaqa tasks * rename basquetrivia-->bertaqa ; make template stub not .yaml * add bertaqa entry to lm_eval/tasks/README.md --------- Co-authored-by:haileyschoelkopf <hailey@eleuther.ai>
-
- 19 Jun, 2024 3 commits
-
-
Yazeed Alnumay authored
* Added ArabicMMLU * Rename `ammlu` to `arabicmmlu`
-
Hailey Schoelkopf authored
* init paloma benchmark * pre-process in utils function * add `task_alias` * updated task aliases * Update paloma_dolma-v1_5.yaml * Update paloma_twitterAAE_HELM_fixed.yaml * Update paloma_dolma_100_programing_languages.yaml * update on names * fix paloma template issue --------- Co-authored-by:
Zafir Stojanovski <zaf.stojano@gmail.com> Co-authored-by:
Zafir Stojanovski <zafir.stojanovski@icloud.com> Co-authored-by:
Lintang Sutawika <lintang@eleuther.ai>
-
Zafir Stojanovski authored
* init paloma benchmark * pre-process in utils function * add `task_alias` * updated task aliases * Update paloma_dolma-v1_5.yaml * Update paloma_twitterAAE_HELM_fixed.yaml * Update paloma_dolma_100_programing_languages.yaml --------- Co-authored-by:Lintang Sutawika <lintang@eleuther.ai>
-
- 18 Jun, 2024 1 commit
-
-
Wang, Chang authored
Signed-off-by:changwangss <chang1.wang@intel.com>
-
- 13 Jun, 2024 2 commits
-
-
johnwee1 authored
* fix: add filter to os.walk to ignore 'ipynb_checkpoints * Update __init__.py * Update __init__.py --------- Co-authored-by:Lintang Sutawika <lintang@eleuther.ai>
-
Baber Abbasi authored
* `samples` is newline delimited * updated git and pre-commit * appease pre-commit * nit * Revert back for now * Revert for now --------- Co-authored-by:Lintang Sutawika <lintang@eleuther.ai>
-
- 11 Jun, 2024 2 commits
-
-
Hailey Schoelkopf authored
-
Hailey Schoelkopf authored
* Update README.md * Delete lm_eval/tasks/ammlu directory
-
- 10 Jun, 2024 1 commit
-
-
khalil authored
-
- 07 Jun, 2024 3 commits
-
-
zhabuye authored
* Update README.md * Update bec.yaml * Update bhtc.yaml * Update coref.yaml * Update qnli.yaml * Update vaxx.yaml * Update wic.yaml
-
Hailey Schoelkopf authored
-
khalil authored
-
- 06 Jun, 2024 3 commits
-
-
Iker García-Ferrero authored
* Noticia * test * Final testes implementation * Fixes * Fix linters
-
Zafir Stojanovski authored
* added tasks and task family descriptors * configs for the new lambada translations * continue work on task list w/ links; slightly reorganize README * Apply suggestions from code review * Rename file so that it'll preview in Github when viewing lm_eval/tasks folder * Update new_task_guide.md * Update README.md * run linter * Add language column to task table; Add missing tasks to task table; fix nq_open and storycloze READMEs * fix typo * update `lm_eval/tasks/README.md` with task description --------- Co-authored-by:
Harish Vadaparty <harishvadaparty@gmail.com> Co-authored-by:
anthony <anthonydipofi@gmail.com> Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by:
haileyschoelkopf <hailey@eleuther.ai>
-
MorishT authored
-
- 05 Jun, 2024 1 commit
-
-
Maxime authored
Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data (#1867) * glianorex tasks * Create README.md * Update README.md * Update README.md * fix formatting * fix internal formatting
-
- 03 Jun, 2024 1 commit
-
-
anthony-dipofi authored
* added tasks and task family descriptors * continue work on task list w/ links; slightly reorganize README * Apply suggestions from code review * Rename file so that it'll preview in Github when viewing lm_eval/tasks folder * Update new_task_guide.md * Update README.md * run linter * Add language column to task table; Add missing tasks to task table; fix nq_open and storycloze READMEs * fix typo * Apply suggestions from code review Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * apply format --------- Co-authored-by:
Harish Vadaparty <harishvadaparty@gmail.com> Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by:
haileyschoelkopf <hailey@eleuther.ai>
-
- 31 May, 2024 1 commit
-
-
Clémentine Fourrier authored
* init test 1 * fix * this format seems to be working - need to update all other tasks with the new format * bbh with few shot format * fix fewshot bbh * add mmlu flan cot * samples of cot * kmmlu * fix gsm8k * update keys for mmlu * minerva math * bbh * fix * fix samples * small fixes to templates * last prompt format change * fixing prompt * fixed minerva math format * rm accidental commited file * added doc for few shot samples * Update lm_eval/loggers/evaluation_tracker.py * Update lm_eval/loggers/evaluation_tracker.py * Update docs/new_task_guide.md Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * added check in sampler per code review * added the system from a function, plus an example in minerva math * style * Apply suggestions from code review Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * fix unit tests 1 * forcing use of test split --------- Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
- 24 May, 2024 2 commits
-
-
Lintang Sutawika authored
* edit process multiple-choice * split template yaml * remove * modified multiple_choice tasks * udpate * Update multiple_choice_template_b_yaml * Update multiple_choice_template_a_yaml --------- Co-authored-by:Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
Lintang Sutawika authored
* add mmlu tasks from pile-t5 * Update _mmlu_flan_cot_fewshot_template_yaml * Update _mmlu_flan_cot_zeroshot_template_yaml * Update _mmlu_flan_generative_template_yaml * Update _mmlu_flan_loglikelihood_template_yaml * Update _default_template_yaml --------- Co-authored-by:Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
- 22 May, 2024 1 commit
-
-
zhabuye authored
-
- 21 May, 2024 1 commit
-
-
Zafir Stojanovski authored
-
- 13 May, 2024 1 commit
-
-
Lucas Weber authored
* Add tinyBenchmarks * Add acknowledgements * Add ordering of outputs for data-parallel * Run pre-commit * Add few_shot specifications * Add tinyBenchmarks post-processing * add conditional import ; fix task names --------- Co-authored-by:haileyschoelkopf <hailey@eleuther.ai>
-
- 09 May, 2024 1 commit
-
-
Edd authored
* add copal * change name to copal id for clarity and the task name * remove `copal_id...` to yaml to make it work * checkmark on README * change group name to `copal_id`
-
- 08 May, 2024 1 commit
-
-
jonabur authored
* add mmlu arc style evaluation * rename arc_style to continuation --------- Co-authored-by:
Jonathan Burdge <jburdge@mahti-login11.mahti.csc.fi> Co-authored-by:
Jonathan Burdge <jburdge@mahti-login12.mahti.csc.fi>
-
- 07 May, 2024 3 commits
-
-
Yoav Katz authored
* Initial support for Unitxt datasets in LM Eval Harness See https://github.com/IBM/unitxt The script 'generate_yamls.py' creates LM Eval Harness yaml files corresponding to Unitxt datasets specified in the 'unitxt_datasets' file. The glue code required to register Unitxt metrics is in 'unitxt_wrapper.py'. * Added dataset loading check to generate_yaml Improved error messages. * Speed up generate_yaml Added printouts and improved error message * Added output printout * Simplified integration of unitxt datasets Store all the common yaml configuration in a yaml include shared by all datasets of the same task. * Post code review comments - part 1 1. Made sure include files don't end wth 'yaml' so they won't be marked as tasks 2. Added more datasets and tasks (NER, GEC) 3. Added README * Post code review comments - part 2 1. Added install unitxt install option in pyproject.toml: pip install 'lm_eval[unitxt]' 2. Added a check that unitxt is installed and print a clear error message if not * Commited missing pyproject change * Added documentation on adding datasets * More doc changes * add unitxt extra to readme * run precommit --------- Co-authored-by:
haileyschoelkopf <hailey@eleuther.ai>
-
Hailey Schoelkopf authored
* add Hendrycks MATH (no sympy checking) variant * add readmes for MATH tasks
-
Hailey Schoelkopf authored
-
- 01 May, 2024 1 commit
-
-
Simran Arora authored
* upload new tasks * add readmes * run linters --------- Co-authored-by:haileyschoelkopf <hailey@eleuther.ai>
-