- 26 Feb, 2024 5 commits
-
-
Hailey Schoelkopf authored
This reverts commit c1145dfd.
-
Hailey Schoelkopf authored
* add add_bos_token to HFLM * add BOS token flag to other local model classes --------- Co-authored-by:Lintang Sutawika <lintang@eleuther.ai>
-
khalil authored
* add arabic mmlu * update the description * add readme file
-
Vicki Boykis authored
-
LSinev authored
-
- 24 Feb, 2024 1 commit
-
-
LSinev authored
* Save git_hash to results even if git is not available to call as subprocess * Store more info about environment and transformers version in results to help researchers track inconsistencies * moved added logging to logging_utils * moved get_git_commit_hash to logging_utils.py * moved add_env_info inside evaluator
-
- 23 Feb, 2024 2 commits
-
-
Vicki Boykis authored
* interface docs * fix link
-
thnkinbtfly authored
-
- 22 Feb, 2024 5 commits
-
-
Amine Elhattami authored
* Fixed generation args issue affection openai completion model * Fixed hf unit test; removed pop attributes in OpenAi completion. * fix format * fix format --------- Co-authored-by:Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
Ayush Thakur authored
* add wandb as extra dependency * wandb metrics logging * refactor * log samples as tables * fix linter * refactor: put in a class * change dir * add panels * log eval as table * improve tables logging * improve reports logging * precommit run * ruff check * handle importing reports api gracefully * ruff * compare results * minor pre-commit fixes * build comparison report * ruff check * log results as artifacts * remove comparison script * update dependency * type annotate and docstring * add example * update readme * fix typo * teardown * handle outside wandb run * gracefully fail reports creation * precommit checks * add report url to summary * use wandb printer for better url stdout * fix ruff * handle N/A and groups * fix eval table * remove unused var * update wandb version req + disable reports stdout * remove reports feature to TODO * add label to multi-choice question data * log model predictions * lints * loglikelihood_rolling * log eval result for groups * log tables by group for better handling * precommit * choices column for multi-choice * graciously fail wandb * remove reports feature * track system metrics + total eval time + stdout --------- Co-authored-by:Lintang Sutawika <lintang@eleuther.ai>
-
Lei Chen authored
* fix the issue #1391, wrong contexts in mgsm tasks * fix yaml issue for having two target_delimiter lines. For COT tasks, keep the one with a space (default) * regenerate all task yaml files - change naming so that file name will match with task name - task|file follows a consistent naming way, mgsm_(mode)_(lang) for three modes, i.e., direct, en_cot, and native_cot * English CoTs should have a space as target_delimiter * Update utils.py * Apply suggestions from code review --------- Co-authored-by:Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
Hailey Schoelkopf authored
* log group membership * no stray prints * Update evaluator.py
-
Anjor Kanekar authored
* loglikelihood refactor using template lm * linter * fix whitespace in target + prompt for CoT gsm8k (#1275) * Make `parallelize=True` vs. `accelerate launch` distinction clearer in docs (#1261) * Make parallelize=True distinction clearer in documentation. * run linter * Allow parameter edits for registered tasks when listed in a benchmark (#1273) * benchmark yamls allow minor edits of already registered tasks * add documentation * removed print * Fix data-parallel evaluation with quantized models (#1270) * add WIP device_map overrides * update handling outside of accelerate launcher * change .to(device) log to debug level * run linter * Rework documentation for explaining local dataset (#1284) * rewor documentation for explaining local dataset * fix typo * Update new_task_guide.md * Re-add citation It looks like Google Scholar has [already noticed](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C9&authuser=2&q=%22A+framework+for+few-shot+language+model+evaluation%2C+12+2023%22&btnG= ) the updated citation block so let's add it back in. * Update CITATION.bib (#1285) Bumping CITATION.bib to match re-adding the citation in readme. cc @StellaAthena * Update nq_open.yaml (#1289) * Update README.md with custom integration doc (#1298) * Update README.md * punctuation --------- Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update nq_open.yaml (#1305) * Update nq_open.yaml change regex * Bump NQ version --------- Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update task_guide.md (#1306) * Update pyproject.toml (#1312) * Fix polemo2_in.yaml config name (#1313) * Update pyproject.toml (#1314) * Fix group register (#1315) * tuple should be considered as well * set option to keep callable as callable * Update task_guide.md (#1316) * Update polemo2_in.yaml (#1318) * don't pass extra kwargs to mamba any more (#1328) * Fix Issue regarding stderr (#1327) * add fix fordeciding if stderr is N/A or not * process N/A * Add `local-completions` support using OpenAI interface (#1277) * Add `local-completions` support using OpenAI interface * Refactor oa_completion * Address tokenizer comments and change request chunks to batch size * Add warning message for tiktoken backend * fix formatting * fix whitespace * Update README.md --------- Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * fallback to classname when LM doesnt have config (#1334) * fix a trailing whitespace that breaks a lint job (#1335) * skip "benchmarks" in changed_tasks (#1336) * Update migrated HF dataset paths (#1332) * Update arc_easy.yaml * Update flan_cot.yaml * update HF dataset path * Update freeform.yaml * Update flan_cot.yaml --------- Co-authored-by:
Lintang Sutawika <lintang@eleuther.ai> * Don't use `get_task_dict()` in task registration / initialization (#1331) * don't use get_task_dict() as a helper, it will download the dataset! * pre-commit * Update README.md --------- Co-authored-by:
lintangsutawika <lintang@eleuther.ai> * manage default (greedy) gen_kwargs in vllm (#1341) * manage default (greedy) gen_kwargs in vllm better * mirror HF `do_sample` * just need to set temp=0 for greedy * modified default gen_kwargs to work better with CLI; changed prompt_logprobs=1 (#1345) * update links to task_guide.md (#1348) * `Filter` docs not offset by `doc_id` (#1349) * get `doc` from instance * acceletate bugfix: get ground doc from instance * convert filter to `process_result` * get docs from instances in `FilterEnsemble` * rename * nit * better looping * fix typehint * Add FAQ on `lm_eval.tasks.initialize_tasks()` to README (#1330) * Update README.md * [!Tip] * Refix issue regarding stderr (#1357) * Add causalLM OpenVino models (#1290) * added intel optimum * added intel optimum in readme * modified intel optimum * modified intel optimum * modified intel optimum * modified install optimum * modified path of IR file * added openvino_device * added openvino_device2 * changed optimum-causal to openvino-causal * Update README.md * Update README.md * remove `lm_eval.base` import * update openvino-causal -> openvino ; pass device through super().__init__() * Update README.md * Add optimum to tests dependencies * apply pre-commit * fix so tests pass --------- Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by:
haileyschoelkopf <hailey@eleuther.ai> * Apply some best practices and guideline recommendations to code (#1363) * raise Exception, not a string Additional info https://peps.python.org/pep-0352/#exception-hierarchy-changes https://docs.python.org/3.8/tutorial/errors.html#raising-exceptions * Apply PEP8 recommendation to prefer isinstance "Object type comparisons should always use isinstance() instead of comparing types directly" https://peps.python.org/pep-0008/ * Remove dangerous default mutable values in arguments https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/dangerous-default-value.html * Format logging messages with fstring (not with format) Additional info https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/logging-format-interpolation.html There are also discussions about the speed of formatting while logging or some unintended code executions https://github.com/pylint-dev/pylint/issues/2395 https://stackoverflow.com/a/54368109 but at least one format (fstring one) will be used throughout the project * Specify utf-8 encoding for `open` explicitly If not specified, it may be supposed differently in different environments, OSes, and Python versions. See https://peps.python.org/pep-0597/ https://docs.python.org/3.11/library/locale.html#locale.getencoding https://docs.python.org/3.10/library/os.html#utf8-mode https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/unspecified-encoding.html Helps also if some code from English language tasks is taken as inspiration for tasks in non-English languages. * Use inline-ignoring comments to pass pre-commit instead of identity process https://flake8.pycqa.org/en/3.0.1/user/ignoring-errors.html#in-line-ignoring-errors https://www.flake8rules.com/rules/F841.html flake8 comments are supported by ruff: https://docs.astral.sh/ruff/linter/#error-suppression * serialize callable functions in config (#1367) * delay filter init; remove `*args` (#1369) * delay filter init; remove `*args` * bugfix * optimize * type hint * Fix unintuitive `--gen_kwargs` behavior (#1329) * don't override do_sample if no value for it is passed * Update gen_kwargs override condition * Update huggingface.py * Update huggingface.py * run linters * silence an erroneous warning * Publish to pypi (#1194) * publish to pypi * lint * Update publish.yml * minor * Make dependencies compatible with PyPI (#1378) * make deps not point to github urls * formatting * try making PyPI only run on tag pushes * Add support for RWKV models with World tokenizer (#1374) * Add support for RWKV models with World tokenizer The RWKV line of model with the World tokenizer, does not allow the padding token to be configured, and has its value preset as 0 This however fails all the "if set" checks, and would cause the tokenizer to crash. A tokenizer class name check was added, in addition to a model type check, as there exists RWKV models which uses the neox tokenizers * Update huggingface.py Genericized so that this supports any RWKVWorld tokenizer, and added a fall-back for if the HF implementation name changes. * Comply with formatting guidelines * fix format --------- Co-authored-by:
Stella Biderman <stellabiderman@gmail.com> Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * add bypass metric (#1156) * add bypass metric * fixed `bypass` metric. * add task attributes if predict_only * add `predict_only` checks * add docs * added `overide_metric`, `override_config` to `Task` * nits * nit * changed --predict_only to generations; nits * nits * nits * change gen_kwargs warning * add note about `--predict_only` in README.md * added `predict_only` * move table to bottom * nit * change null aggregation to bypass (conflict) * bugfix; default `temp=0.0` * typo * loglikelihood refactor using template lm * lint * code review * neuron optimum * Mention TemplateLM in model_guide.md * Update lm_eval/api/model.py * fix linter * fix format * fix format * fix format --------- Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by:
Lintang Sutawika <lintang@eleuther.ai> Co-authored-by:
Stella Biderman <stellabiderman@gmail.com> Co-authored-by:
Mark Saroufim <marksaroufim@meta.com> Co-authored-by:
Hannibal046 <38466901+Hannibal046@users.noreply.github.com> Co-authored-by:
Danielle Pintz <38207072+daniellepintz@users.noreply.github.com> Co-authored-by:
Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> Co-authored-by:
kwrobel.eth <djstrong@gmail.com> Co-authored-by:
Michael Goin <michael@neuralmagic.com> Co-authored-by:
Brian Vaughan <nairbv@users.noreply.github.com> Co-authored-by:
Baber Abbasi <92168766+baberabb@users.noreply.github.com> Co-authored-by:
thnkinbtfly <70014488+thnkinbtfly@users.noreply.github.com> Co-authored-by:
NoushNabi <33136068+NoushNabi@users.noreply.github.com> Co-authored-by:
haileyschoelkopf <hailey@eleuther.ai> Co-authored-by:
LSinev <LSinev@users.noreply.github.com> Co-authored-by:
Eugene Cheah <PicoCreator@users.noreply.github.com>
-
- 21 Feb, 2024 1 commit
-
-
Hanwool Albert Lee authored
* update kmmlu default formatting * Update _default_kmmlu_yaml * Delete lm_eval/tasks/kmmlu/utils.py * new tasks implemented * add direct tasks * update direct evaluate * update direct eval * add cot sample * update cot * add cot * Update _cot_kmmlu_yaml * add kmmlu90 * Update and rename _cot_kmmlu.yaml to _cot_kmmlu_yaml * Create kmmlu90.yaml * Update _cot_kmmlu_yaml * add direct * Update _cot_kmmlu_yaml * Update and rename kmmlu90.yaml to kmmlu90_cot.yaml * Update kmmlu90_direct.yaml * add kmmlu hard * Update _cot_kmmlu_yaml * Update _cot_kmmlu_yaml * update cot * update cot * erase typo * Update _cot_kmmlu_yaml * update cot * Rename dataset to match k-mmlu-hard * removed kmmlu90 * fixed name 'kmmlu_cot' to 'kmmlu_hard_cot' and revised README * applied pre-commit before pull requests * rename datasets and add notes * Remove DS_Store cache * Update lm_eval/tasks/kmmlu/README.md Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Change citations and reflect reviews on version * Added kmmlu_hard and fixed other errors * fixing minor errors * remove duplicated * Rename files * try ".index" * minor fix * minor fix again * fix revert. * minor fix. thank for hailey --------- Co-authored-by:
GUIJIN SON <spthsrbwls123@yonsei.ac.kr> Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
- 20 Feb, 2024 3 commits
-
-
Uanu authored
* add new task GPQA_n_shot * add new task GPQA_zeroshot * correct GPQA_zeroshot filename * Add randomly shuffle choices * Correct missing parentheses * delete wrong tasks * Add README * Update lm_eval/tasks/gpqa/zeroshot/_gpqa_zeroshot_yaml * Update lm_eval/tasks/gpqa/n_shot/utils.py * Update lm_eval/tasks/gpqa/n_shot/utils.py * Update lm_eval/tasks/gpqa/README.md * placate linter * linter --------- Co-authored-by:Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
Baber Abbasi authored
* add key lookup for same contexts * nit * appease pre-commit * nit * use `expand` (in-place view) rather than `repeat` * try mixed grouping * add docs. * nit * nit * nits * fix tests * Move greedy_tokens calculation out of cache loop * nit * nits * add test * nits * fix name conflict * fix name conflict * chunk tensor * move Collator * nits/docstring * fixup * fixup * group contexts only for decoders * pre-commit * fix `generate_until` test * fix `generate_until` test * Update lm_eval/models/huggingface.py Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * add docs * nit * add docs * add docs * add 'logits_cache' arg * bugfix --------- Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
Hanwool Albert Lee authored
* haerae_reimplementation * edited Readme and add few_shot settings * edited readme * newlines at end of each files * Modifying the README file * applied pre-commit
-
- 19 Feb, 2024 2 commits
-
-
thnkinbtfly authored
* update bbh, gsm8k, mmlu parsing logic and prompts * remove the formatting prompt (bbh) + minor update (mmlu) * update bbh, gsm8k, mmlu zeroshot, revert fewshots * update bbh, gsm8k, mmlu version, forward changes to gsm8k-cot * remove take_last, update to use docs parameters * add newline * ruff formatting * Update pyproject.toml * fix format --------- Co-authored-by:Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
larekrow authored
-
- 18 Feb, 2024 1 commit
-
-
Michael Feil authored
-
- 15 Feb, 2024 1 commit
-
-
David Hoffmann authored
-
- 14 Feb, 2024 1 commit
-
-
Baber Abbasi authored
-
- 13 Feb, 2024 1 commit
-
-
Hailey Schoelkopf authored
* fix weight_by_size condition * add tests, update stderr formula slightly * apply pre-commit
-
- 12 Feb, 2024 2 commits
-
-
Amine Elhattami authored
* Added seeds to `evaluator.simple_evaluate` signature * Added CLI argument * Updated to add arg.
-
Alessandro Ercolani authored
-
- 11 Feb, 2024 3 commits
-
-
Uanu authored
-
Uanu authored
-
Baber Abbasi authored
* un-exclude `evaluate.py` from linting * readability * readability * add task name to build info message * fix link * nit * add functions for var and mean pooling * add functions for var and mean pooling * metadata compatibility with task * rename `override_config` to `set_config` and move to `Task` * add unit test * nit * nit * bugfix * nit * nit * nit * add docstrings * fix metadata-fewshot * revert metric refactor * nit * type checking * type hints * type hints * move `override_metric` to `Task` * change metadata * change name * pre-commit * rename * remove * remove * `override_metric` backwards compatible with `Task` * type hints * use generic * type hint
-
- 10 Feb, 2024 2 commits
-
-
Jeevan authored
* Fix watchdog timeout * Pre-commit fix * Timedelta
-
https://github.com/EleutherAI/lm-evaluation-harness/issues/1416Pasquale Minervini authored
* Fixes https://github.com/EleutherAI/lm-evaluation-harness/issues/1416 Sets `do_sample = False` if `temperature == 0.0` and `do_sample = None` * Update huggingface.py * Update huggingface.py making linter happy
-
- 09 Feb, 2024 1 commit
-
-
Hailey Schoelkopf authored
-
- 07 Feb, 2024 1 commit
-
-
- 06 Feb, 2024 4 commits
-
-
Michael Feil authored
* add hf_transfer * update dependencies * Delete stale `[linting]` extra * Update README.md with extras table --------- Co-authored-by:Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
Hailey Schoelkopf authored
* update formula for stderr aggregation * hack: see what happens when using stderr_for_metric bootstrapping on a group * undo bootstrap_for_stderr test * factor out variance-aggregation formulas into api.metrics * fix failing tests * remove stray print * update comment * further detail in comment * add back initialize_tasks() call * fix format
-
Hailey Schoelkopf authored
-
Michael Chen authored
Add instructions for non-MacOS users on how to compile janitor_util.cpp so that janitor.py can use it.
-
- 05 Feb, 2024 1 commit
-
-
Michael Feil authored
* initial commit * remove overwrite bs * adding neuronx dependencies * Update README.md * update neuronx
-
- 02 Feb, 2024 2 commits
-
-
Lintang Sutawika authored
-
https://github.com/EleutherAI/lm-evaluation-harness/issues/1383Pasquale Minervini authored
Fixes https://github.com/EleutherAI/lm-evaluation-harness/issues/1383 If this is okay, it will need to be propagated to SCROLLS
-
- 01 Feb, 2024 1 commit
-
-
Lintang Sutawika authored
* add trust_remote_code as default * task for testing recursive * changed source of ALL_TASKS * tasks should only accept TaskObjects * initialize_tasks returns list of tasks and groups * remove trust_remote_code for now * moved constructor process to inside load_yaml_config * more comprehensive way to index tasks and groups * pre-commit format * add exit after error * adjust how task objects are called * no need to use get_task_dict * load_task_or_group works but only for tasks * pre-commit format * half working for nested groups * changed variable names * allow groups and tasks to work * temp save * indexing and loading are part of a task_manager object * adapted initialize_tasks * iron out bugs * fixed typo * fixed typo * simplified code * further tidy up * remove lines for testing * removed test lines * removed unused code * remove unused import * fixed bug * removed comments * group in a list of group can accept parameter changes like `num_fewshot` * add trust_remote_code as default * task for testing recursive * changed source of ALL_TASKS * tasks should only accept TaskObjects * initialize_tasks returns list of tasks and groups * remove trust_remote_code for now * moved constructor process to inside load_yaml_config * more comprehensive way to index tasks and groups * pre-commit format * add exit after error * adjust how task objects are called * no need to use get_task_dict * load_task_or_group works but only for tasks * pre-commit format * half working for nested groups * changed variable names * allow groups and tasks to work * temp save * indexing and loading are part of a task_manager object * adapted initialize_tasks * iron out bugs * fixed typo * fixed typo * simplified code * further tidy up * remove lines for testing * removed test lines * removed unused code * remove unused import * fixed bug * removed comments * group in a list of group can accept parameter changes like `num_fewshot` * check if config is task update * add GroupConfig object * edit test yaml * remove args * testing returning to python task list * add weight_by_size config * describe weight_by_size in docs * fix weight by size potential error * can load individual custom python class task * moved import_function into the config loading file * remove print lines * add squadv2 yaml * temporary scroll implementation * revert back to use load_yaml_config but with modes * fix group being loaded with a None * reformat * can load unregistered tasks from a group * update scrolls * edit scrolls multiplechoice task * adjust class initialization * fix initialization * changed how to identify group and python tasks, fix logger * allow loading "include" that is nested in a group config * reworked flan benchmark * allow duplicate task in the same group to co-exist * process group_alias * removed group_alias * allow parameters set in group_config to apply to all tasks in tasklist * add function, but comment for now * reworked processing dict-base config * fixed how configs in group are processed * update to allow root group to have its alias used * remove unused classes * remove unused classes * revert some parts to original * forgot to change one variable * adapt the new process to use get_task_dict * fix for singular group call * fix variable names * add TaskManager into the evaluator * format * changed how dict tasks are loaded * add docs * Update docs/new_task_guide.md Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update evaluator.py * Update evaluator.py * remove groupconfig for now * changed _config to config * update interface.md to explain TaskManager * added property functions * adjusted logger * update write_out.py * updated tests * added documentation and some modifications * added docstring documentation * precommit format * updated task loading for tests * updates tests * changed arg order for load_yaml_config * update to handle scrolls and edit log message * remove unused lines * return a list of task classes and not a dict * Update __init__.py * Delete lm_eval/tasks/benchmarks/test.yaml * Update task.py * Update lm_eval/utils.py Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/utils.py Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update utils.py * re-added old functions with new log message * Update docs/new_task_guide.md Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update new_task_guide.md * added infor regarding `get_task_dict` and documentation * add get_config for Task * pre-commit formatting --------- Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-