- 27 Apr, 2024 1 commit
-
-
LucWeber authored
-
- 12 Mar, 2024 1 commit
-
-
LucWeber authored
-
- 11 Mar, 2024 2 commits
- 08 Mar, 2024 2 commits
- 06 Mar, 2024 5 commits
-
-
LSinev authored
* Remove unused `decontamination_ngrams_path` and all mentions (still no alternative path provided) * Fix improper import of LM and usage of evaluator in one of scripts * update type hints in instance and task api * raising errors in task.py instead of asserts * Fix warnings from ruff * raising errors in __main__.py instead of asserts * raising errors in tasks/__init__.py instead of asserts * raising errors in evaluator.py instead of asserts * evaluator: update type hints and remove unused variables in code * Update lm_eval/__main__.py Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/__main__.py Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/api/task.py Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/api/task.py Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/api/task.py Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/evaluator.py Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * pre-commit induced fixes --------- Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
Hailey Schoelkopf authored
update printed num-fewshot ; prevent fewshots from erroneously being used by cot which hardcodes fewshot prompt (#1502)
-
sean0042 authored
-
Long Phan authored
* init wmdp yaml file * Add WMDP Multiple-choice * fix linter issues * Delete lm_eval/tasks/wmdp/_wmdp.yaml --------- Co-authored-by:Lintang Sutawika <lintang@sutawika.com>
-
Peter Bevan authored
* Start adding eq-bench * Start adding to yaml and utils * Get metric working * Add README * Handle cases where answer is not parseable * Deal with unparseable answers and add percent_parseable metric * Update README
-
- 05 Mar, 2024 2 commits
-
-
Uanu authored
* Add new tasks of GPQA * Add README * Remove unused functions * Remove unused functions * Linters * Add flexible match * update * Remove deplicate function * Linter * update * Update lm_eval/filters/extraction.py Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * register multi_choice_regex * Update * run precommit --------- Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by:
haileyschoelkopf <hailey@eleuther.ai>
-
Baber Abbasi authored
-
- 04 Mar, 2024 1 commit
-
-
Manuel Faysse authored
* add french-bench * rename arc easy * linting * update datasets for no remote code exec * fix string delimiter * add info to readmr * trim trailing whitespace * add detailed groups * add info to readme * remove orangesum title from fbench main * Force PPL tasks to be 0-shot --------- Co-authored-by:Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
- 03 Mar, 2024 1 commit
-
-
Vicki Boykis authored
* setting trust_remote_code * dataset list no notebooks * respect trust remote code * Address changes, move cli options and change datasets * fix task for tests * headqa * remove kobest * pin datasets and address comments * clean up space
-
- 01 Mar, 2024 1 commit
-
-
Zehan Li authored
-
- 27 Feb, 2024 2 commits
-
-
Hailey Schoelkopf authored
-
Zehan Li authored
-
- 26 Feb, 2024 6 commits
-
-
Lintang Sutawika authored
* add brier_score * process brier_score * brier score is working for N-sized class * fxied brier score * add TED to BigBench and Brier score to MMLU * format * Update metrics.py * Update task.py * Update generate_until_template_yaml * Delete lm_eval/tasks/bigbench/aux_metric.py * Update generate_until_template_yaml * Update _default_template_yaml * Update _generate_configs.py * Update _generate_configs.py * Update _generate_configs.py * fix (format?) * format? * format, once more --------- Co-authored-by:Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
Aaron V authored
* Create a means for caching task registration and request building. Add the ability to specify an args dict for simple_evaluate(). * Remove extra S in cache path in caching module Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Rename requests cache args, make model_args polymorphic so that a dict can also be accepted. * Update docs to reflect new caching behavior, add CLI args for requests caching. Create a function for deleting items in the cache. * Update documentation, fix minor bug with arg parsing for requests caching where an undefined variable was used. * Remove line from gitignore, add to cli for caching datasets. * Add hashing suffix to .pickles. Update test script typo. * Favor isinstance() over type() in evaluator.py * Add tests for caching, gets tests working, remove unneeded arg from build_all_requests(). * Update arg description to simple_evaluate. * Update pyproject.toml * Fix typehint * Remove the use of random() for creating default cache pickle hash. * Check that cache dir exists before clearing it in request cache tests. * Fix linting problems. * Fix additional formatting errors. * Remove trailing whitespace. * Add new line to the end of .gitignore. --------- Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
Hailey Schoelkopf authored
This reverts commit c1145dfd.
-
khalil authored
* add arabic mmlu * update the description * add readme file
-
Vicki Boykis authored
-
LSinev authored
-
- 23 Feb, 2024 1 commit
-
-
thnkinbtfly authored
-
- 22 Feb, 2024 1 commit
-
-
Lei Chen authored
* fix the issue #1391, wrong contexts in mgsm tasks * fix yaml issue for having two target_delimiter lines. For COT tasks, keep the one with a space (default) * regenerate all task yaml files - change naming so that file name will match with task name - task|file follows a consistent naming way, mgsm_(mode)_(lang) for three modes, i.e., direct, en_cot, and native_cot * English CoTs should have a space as target_delimiter * Update utils.py * Apply suggestions from code review --------- Co-authored-by:Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
- 21 Feb, 2024 1 commit
-
-
Hanwool Albert Lee authored
* update kmmlu default formatting * Update _default_kmmlu_yaml * Delete lm_eval/tasks/kmmlu/utils.py * new tasks implemented * add direct tasks * update direct evaluate * update direct eval * add cot sample * update cot * add cot * Update _cot_kmmlu_yaml * add kmmlu90 * Update and rename _cot_kmmlu.yaml to _cot_kmmlu_yaml * Create kmmlu90.yaml * Update _cot_kmmlu_yaml * add direct * Update _cot_kmmlu_yaml * Update and rename kmmlu90.yaml to kmmlu90_cot.yaml * Update kmmlu90_direct.yaml * add kmmlu hard * Update _cot_kmmlu_yaml * Update _cot_kmmlu_yaml * update cot * update cot * erase typo * Update _cot_kmmlu_yaml * update cot * Rename dataset to match k-mmlu-hard * removed kmmlu90 * fixed name 'kmmlu_cot' to 'kmmlu_hard_cot' and revised README * applied pre-commit before pull requests * rename datasets and add notes * Remove DS_Store cache * Update lm_eval/tasks/kmmlu/README.md Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Change citations and reflect reviews on version * Added kmmlu_hard and fixed other errors * fixing minor errors * remove duplicated * Rename files * try ".index" * minor fix * minor fix again * fix revert. * minor fix. thank for hailey --------- Co-authored-by:
GUIJIN SON <spthsrbwls123@yonsei.ac.kr> Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
- 20 Feb, 2024 3 commits
-
-
Uanu authored
* add new task GPQA_n_shot * add new task GPQA_zeroshot * correct GPQA_zeroshot filename * Add randomly shuffle choices * Correct missing parentheses * delete wrong tasks * Add README * Update lm_eval/tasks/gpqa/zeroshot/_gpqa_zeroshot_yaml * Update lm_eval/tasks/gpqa/n_shot/utils.py * Update lm_eval/tasks/gpqa/n_shot/utils.py * Update lm_eval/tasks/gpqa/README.md * placate linter * linter --------- Co-authored-by:Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
Baber Abbasi authored
* add key lookup for same contexts * nit * appease pre-commit * nit * use `expand` (in-place view) rather than `repeat` * try mixed grouping * add docs. * nit * nit * nits * fix tests * Move greedy_tokens calculation out of cache loop * nit * nits * add test * nits * fix name conflict * fix name conflict * chunk tensor * move Collator * nits/docstring * fixup * fixup * group contexts only for decoders * pre-commit * fix `generate_until` test * fix `generate_until` test * Update lm_eval/models/huggingface.py Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * add docs * nit * add docs * add docs * add 'logits_cache' arg * bugfix --------- Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
Hanwool Albert Lee authored
* haerae_reimplementation * edited Readme and add few_shot settings * edited readme * newlines at end of each files * Modifying the README file * applied pre-commit
-
- 19 Feb, 2024 2 commits
-
-
thnkinbtfly authored
* update bbh, gsm8k, mmlu parsing logic and prompts * remove the formatting prompt (bbh) + minor update (mmlu) * update bbh, gsm8k, mmlu zeroshot, revert fewshots * update bbh, gsm8k, mmlu version, forward changes to gsm8k-cot * remove take_last, update to use docs parameters * add newline * ruff formatting * Update pyproject.toml * fix format --------- Co-authored-by:Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
larekrow authored
-
- 15 Feb, 2024 1 commit
-
-
David Hoffmann authored
-
- 13 Feb, 2024 1 commit
-
-
Hailey Schoelkopf authored
* fix weight_by_size condition * add tests, update stderr formula slightly * apply pre-commit
-
- 12 Feb, 2024 1 commit
-
-
Alessandro Ercolani authored
-
- 11 Feb, 2024 2 commits
- 02 Feb, 2024 1 commit
-
-
https://github.com/EleutherAI/lm-evaluation-harness/issues/1383Pasquale Minervini authored
Fixes https://github.com/EleutherAI/lm-evaluation-harness/issues/1383 If this is okay, it will need to be propagated to SCROLLS
-
- 01 Feb, 2024 2 commits
-
-
Lintang Sutawika authored
* add trust_remote_code as default * task for testing recursive * changed source of ALL_TASKS * tasks should only accept TaskObjects * initialize_tasks returns list of tasks and groups * remove trust_remote_code for now * moved constructor process to inside load_yaml_config * more comprehensive way to index tasks and groups * pre-commit format * add exit after error * adjust how task objects are called * no need to use get_task_dict * load_task_or_group works but only for tasks * pre-commit format * half working for nested groups * changed variable names * allow groups and tasks to work * temp save * indexing and loading are part of a task_manager object * adapted initialize_tasks * iron out bugs * fixed typo * fixed typo * simplified code * further tidy up * remove lines for testing * removed test lines * removed unused code * remove unused import * fixed bug * removed comments * group in a list of group can accept parameter changes like `num_fewshot` * add trust_remote_code as default * task for testing recursive * changed source of ALL_TASKS * tasks should only accept TaskObjects * initialize_tasks returns list of tasks and groups * remove trust_remote_code for now * moved constructor process to inside load_yaml_config * more comprehensive way to index tasks and groups * pre-commit format * add exit after error * adjust how task objects are called * no need to use get_task_dict * load_task_or_group works but only for tasks * pre-commit format * half working for nested groups * changed variable names * allow groups and tasks to work * temp save * indexing and loading are part of a task_manager object * adapted initialize_tasks * iron out bugs * fixed typo * fixed typo * simplified code * further tidy up * remove lines for testing * removed test lines * removed unused code * remove unused import * fixed bug * removed comments * group in a list of group can accept parameter changes like `num_fewshot` * check if config is task update * add GroupConfig object * edit test yaml * remove args * testing returning to python task list * add weight_by_size config * describe weight_by_size in docs * fix weight by size potential error * can load individual custom python class task * moved import_function into the config loading file * remove print lines * add squadv2 yaml * temporary scroll implementation * revert back to use load_yaml_config but with modes * fix group being loaded with a None * reformat * can load unregistered tasks from a group * update scrolls * edit scrolls multiplechoice task * adjust class initialization * fix initialization * changed how to identify group and python tasks, fix logger * allow loading "include" that is nested in a group config * reworked flan benchmark * allow duplicate task in the same group to co-exist * process group_alias * removed group_alias * allow parameters set in group_config to apply to all tasks in tasklist * add function, but comment for now * reworked processing dict-base config * fixed how configs in group are processed * update to allow root group to have its alias used * remove unused classes * remove unused classes * revert some parts to original * forgot to change one variable * adapt the new process to use get_task_dict * fix for singular group call * fix variable names * add TaskManager into the evaluator * format * changed how dict tasks are loaded * add docs * Update docs/new_task_guide.md Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update evaluator.py * Update evaluator.py * remove groupconfig for now * changed _config to config * update interface.md to explain TaskManager * added property functions * adjusted logger * update write_out.py * updated tests * added documentation and some modifications * added docstring documentation * precommit format * updated task loading for tests * updates tests * changed arg order for load_yaml_config * update to handle scrolls and edit log message * remove unused lines * return a list of task classes and not a dict * Update __init__.py * Delete lm_eval/tasks/benchmarks/test.yaml * Update task.py * Update lm_eval/utils.py Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/utils.py Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update utils.py * re-added old functions with new log message * Update docs/new_task_guide.md Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update new_task_guide.md * added infor regarding `get_task_dict` and documentation * add get_config for Task * pre-commit formatting --------- Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
Hailey Schoelkopf authored
* allow tasks to specify printed fewshot val * fix to belebele * update metadata field's documentation
-