- 26 Feb, 2024 6 commits
-
-
Aaron V authored
* Create a means for caching task registration and request building. Add the ability to specify an args dict for simple_evaluate(). * Remove extra S in cache path in caching module Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Rename requests cache args, make model_args polymorphic so that a dict can also be accepted. * Update docs to reflect new caching behavior, add CLI args for requests caching. Create a function for deleting items in the cache. * Update documentation, fix minor bug with arg parsing for requests caching where an undefined variable was used. * Remove line from gitignore, add to cli for caching datasets. * Add hashing suffix to .pickles. Update test script typo. * Favor isinstance() over type() in evaluator.py * Add tests for caching, gets tests working, remove unneeded arg from build_all_requests(). * Update arg description to simple_evaluate. * Update pyproject.toml * Fix typehint * Remove the use of random() for creating default cache pickle hash. * Check that cache dir exists before clearing it in request cache tests. * Fix linting problems. * Fix additional formatting errors. * Remove trailing whitespace. * Add new line to the end of .gitignore. --------- Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
Hailey Schoelkopf authored
This reverts commit c1145dfd.
-
Hailey Schoelkopf authored
* add add_bos_token to HFLM * add BOS token flag to other local model classes --------- Co-authored-by:Lintang Sutawika <lintang@eleuther.ai>
-
khalil authored
* add arabic mmlu * update the description * add readme file
-
Vicki Boykis authored
-
LSinev authored
-
- 24 Feb, 2024 1 commit
-
-
LSinev authored
* Save git_hash to results even if git is not available to call as subprocess * Store more info about environment and transformers version in results to help researchers track inconsistencies * moved added logging to logging_utils * moved get_git_commit_hash to logging_utils.py * moved add_env_info inside evaluator
-
- 23 Feb, 2024 2 commits
-
-
Vicki Boykis authored
* interface docs * fix link
-
thnkinbtfly authored
-
- 22 Feb, 2024 5 commits
-
-
Amine Elhattami authored
* Fixed generation args issue affection openai completion model * Fixed hf unit test; removed pop attributes in OpenAi completion. * fix format * fix format --------- Co-authored-by:Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
Ayush Thakur authored
* add wandb as extra dependency * wandb metrics logging * refactor * log samples as tables * fix linter * refactor: put in a class * change dir * add panels * log eval as table * improve tables logging * improve reports logging * precommit run * ruff check * handle importing reports api gracefully * ruff * compare results * minor pre-commit fixes * build comparison report * ruff check * log results as artifacts * remove comparison script * update dependency * type annotate and docstring * add example * update readme * fix typo * teardown * handle outside wandb run * gracefully fail reports creation * precommit checks * add report url to summary * use wandb printer for better url stdout * fix ruff * handle N/A and groups * fix eval table * remove unused var * update wandb version req + disable reports stdout * remove reports feature to TODO * add label to multi-choice question data * log model predictions * lints * loglikelihood_rolling * log eval result for groups * log tables by group for better handling * precommit * choices column for multi-choice * graciously fail wandb * remove reports feature * track system metrics + total eval time + stdout --------- Co-authored-by:Lintang Sutawika <lintang@eleuther.ai>
-
Lei Chen authored
* fix the issue #1391, wrong contexts in mgsm tasks * fix yaml issue for having two target_delimiter lines. For COT tasks, keep the one with a space (default) * regenerate all task yaml files - change naming so that file name will match with task name - task|file follows a consistent naming way, mgsm_(mode)_(lang) for three modes, i.e., direct, en_cot, and native_cot * English CoTs should have a space as target_delimiter * Update utils.py * Apply suggestions from code review --------- Co-authored-by:Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
Hailey Schoelkopf authored
* log group membership * no stray prints * Update evaluator.py
-
Anjor Kanekar authored
* loglikelihood refactor using template lm * linter * fix whitespace in target + prompt for CoT gsm8k (#1275) * Make `parallelize=True` vs. `accelerate launch` distinction clearer in docs (#1261) * Make parallelize=True distinction clearer in documentation. * run linter * Allow parameter edits for registered tasks when listed in a benchmark (#1273) * benchmark yamls allow minor edits of already registered tasks * add documentation * removed print * Fix data-parallel evaluation with quantized models (#1270) * add WIP device_map overrides * update handling outside of accelerate launcher * change .to(device) log to debug level * run linter * Rework documentation for explaining local dataset (#1284) * rewor documentation for explaining local dataset * fix typo * Update new_task_guide.md * Re-add citation It looks like Google Scholar has [already noticed](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C9...
-
- 21 Feb, 2024 1 commit
-
-
Hanwool Albert Lee authored
* update kmmlu default formatting * Update _default_kmmlu_yaml * Delete lm_eval/tasks/kmmlu/utils.py * new tasks implemented * add direct tasks * update direct evaluate * update direct eval * add cot sample * update cot * add cot * Update _cot_kmmlu_yaml * add kmmlu90 * Update and rename _cot_kmmlu.yaml to _cot_kmmlu_yaml * Create kmmlu90.yaml * Update _cot_kmmlu_yaml * add direct * Update _cot_kmmlu_yaml * Update and rename kmmlu90.yaml to kmmlu90_cot.yaml * Update kmmlu90_direct.yaml * add kmmlu hard * Update _cot_kmmlu_yaml * Update _cot_kmmlu_yaml * update cot * update cot * erase typo * Update _cot_kmmlu_yaml * update cot * Rename dataset to match k-mmlu-hard * removed kmmlu90 * fixed name 'kmmlu_cot' to 'kmmlu_hard_cot' and revised README * applied pre-commit before pull requests * rename datasets and add notes * Remove DS_Store cache * Update lm_eval/tasks/kmmlu/README.md Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Change citations and reflect reviews on version * Added kmmlu_hard and fixed other errors * fixing minor errors * remove duplicated * Rename files * try ".index" * minor fix * minor fix again * fix revert. * minor fix. thank for hailey --------- Co-authored-by:
GUIJIN SON <spthsrbwls123@yonsei.ac.kr> Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
- 20 Feb, 2024 3 commits
-
-
Uanu authored
* add new task GPQA_n_shot * add new task GPQA_zeroshot * correct GPQA_zeroshot filename * Add randomly shuffle choices * Correct missing parentheses * delete wrong tasks * Add README * Update lm_eval/tasks/gpqa/zeroshot/_gpqa_zeroshot_yaml * Update lm_eval/tasks/gpqa/n_shot/utils.py * Update lm_eval/tasks/gpqa/n_shot/utils.py * Update lm_eval/tasks/gpqa/README.md * placate linter * linter --------- Co-authored-by:Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
Baber Abbasi authored
* add key lookup for same contexts * nit * appease pre-commit * nit * use `expand` (in-place view) rather than `repeat` * try mixed grouping * add docs. * nit * nit * nits * fix tests * Move greedy_tokens calculation out of cache loop * nit * nits * add test * nits * fix name conflict * fix name conflict * chunk tensor * move Collator * nits/docstring * fixup * fixup * group contexts only for decoders * pre-commit * fix `generate_until` test * fix `generate_until` test * Update lm_eval/models/huggingface.py Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * add docs * nit * add docs * add docs * add 'logits_cache' arg * bugfix --------- Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
Hanwool Albert Lee authored
* haerae_reimplementation * edited Readme and add few_shot settings * edited readme * newlines at end of each files * Modifying the README file * applied pre-commit
-
- 19 Feb, 2024 2 commits
-
-
thnkinbtfly authored
* update bbh, gsm8k, mmlu parsing logic and prompts * remove the formatting prompt (bbh) + minor update (mmlu) * update bbh, gsm8k, mmlu zeroshot, revert fewshots * update bbh, gsm8k, mmlu version, forward changes to gsm8k-cot * remove take_last, update to use docs parameters * add newline * ruff formatting * Update pyproject.toml * fix format --------- Co-authored-by:Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
larekrow authored
-
- 18 Feb, 2024 1 commit
-
-
Michael Feil authored
-
- 15 Feb, 2024 1 commit
-
-
David Hoffmann authored
-
- 14 Feb, 2024 1 commit
-
-
Baber Abbasi authored
-
- 13 Feb, 2024 1 commit
-
-
Hailey Schoelkopf authored
* fix weight_by_size condition * add tests, update stderr formula slightly * apply pre-commit
-
- 12 Feb, 2024 2 commits
-
-
Amine Elhattami authored
* Added seeds to `evaluator.simple_evaluate` signature * Added CLI argument * Updated to add arg.
-
Alessandro Ercolani authored
-
- 11 Feb, 2024 3 commits
-
-
Uanu authored
-
Uanu authored
-
Baber Abbasi authored
* un-exclude `evaluate.py` from linting * readability * readability * add task name to build info message * fix link * nit * add functions for var and mean pooling * add functions for var and mean pooling * metadata compatibility with task * rename `override_config` to `set_config` and move to `Task` * add unit test * nit * nit * bugfix * nit * nit * nit * add docstrings * fix metadata-fewshot * revert metric refactor * nit * type checking * type hints * type hints * move `override_metric` to `Task` * change metadata * change name * pre-commit * rename * remove * remove * `override_metric` backwards compatible with `Task` * type hints * use generic * type hint
-
- 10 Feb, 2024 2 commits
-
-
Jeevan authored
* Fix watchdog timeout * Pre-commit fix * Timedelta
-
https://github.com/EleutherAI/lm-evaluation-harness/issues/1416Pasquale Minervini authored
* Fixes https://github.com/EleutherAI/lm-evaluation-harness/issues/1416 Sets `do_sample = False` if `temperature == 0.0` and `do_sample = None` * Update huggingface.py * Update huggingface.py making linter happy
-
- 09 Feb, 2024 1 commit
-
-
Hailey Schoelkopf authored
-
- 07 Feb, 2024 1 commit
-
-
- 06 Feb, 2024 4 commits
-
-
Michael Feil authored
* add hf_transfer * update dependencies * Delete stale `[linting]` extra * Update README.md with extras table --------- Co-authored-by:Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
Hailey Schoelkopf authored
* update formula for stderr aggregation * hack: see what happens when using stderr_for_metric bootstrapping on a group * undo bootstrap_for_stderr test * factor out variance-aggregation formulas into api.metrics * fix failing tests * remove stray print * update comment * further detail in comment * add back initialize_tasks() call * fix format
-
Hailey Schoelkopf authored
-
Michael Chen authored
Add instructions for non-MacOS users on how to compile janitor_util.cpp so that janitor.py can use it.
-
- 05 Feb, 2024 1 commit
-
-
Michael Feil authored
* initial commit * remove overwrite bs * adding neuronx dependencies * Update README.md * update neuronx
-
- 02 Feb, 2024 2 commits
-
-
Lintang Sutawika authored
-
https://github.com/EleutherAI/lm-evaluation-harness/issues/1383Pasquale Minervini authored
Fixes https://github.com/EleutherAI/lm-evaluation-harness/issues/1383 If this is okay, it will need to be propagated to SCROLLS
-