- 10 Jul, 2025 1 commit
-
-
Baber Abbasi authored
* check for chat for warning * add test * remove yaml extension from some evalita configs * move unitxt to own test script * fix CI test
-
- 06 Jul, 2025 1 commit
-
-
Baber Abbasi authored
-
- 05 Jul, 2025 1 commit
-
-
Debjyoti Ray authored
* git push --force correctly processes both formats of model_args: string and dictionary both * exctract to function for better test * nit --------- Co-authored-by:Baber <baber@hey.com>
-
- 04 Jul, 2025 2 commits
-
-
Neel Gupta authored
* [FIX] Initial code to disable multi-proc for stderr * add docs; align no-mp bootstrap with mp --------- Co-authored-by:Baber <baber@hey.com>
-
Baber authored
-
- 03 Jun, 2025 1 commit
-
-
Baber Abbasi authored
* fix: bug in acc_mutual_info slicing; add `target_delimiter` to uncond choices * add tests
-
- 16 Apr, 2025 1 commit
-
-
Baber Abbasi authored
* switch MMLU to cais/mmlu * switch back to tj-actions/changed-files * cache HF folder
-
- 17 Mar, 2025 1 commit
-
-
Angelika Romanou authored
* Add INCLUDE tasks * pacify pre-commit --------- Co-authored-by:Baber <baber@hey.com>
-
- 14 Mar, 2025 1 commit
-
-
daniel-salib authored
-
- 04 Mar, 2025 2 commits
-
-
Kiersten Stokes authored
* Add a test for a custom unitxt task * Update task.py to bring in line with breaking change in v1.17.2 * Fix lint
-
Lucia Quirke authored
* Enable steering HF models Co-authored-by:
Matthew Khoriaty <matthewkhoriaty2026@u.northwestern.edu> * increase HF download timeout * Update readme; improve steering vector device handling * Update latest news * remove HF timeout increase * fix tests * ignore sae lens test * fix accidental force push --------- Co-authored-by:
Matthew Khoriaty <matthewkhoriaty2026@u.northwestern.edu>
-
- 25 Feb, 2025 1 commit
-
-
Jinwei authored
* initial components to support sglang * init of class SGLangLM * draft for generate_until of SGLang model * mock loglikelihood * initial loglikelihood_tokens * todo: fix bug of sglang engine init * implement generation tasks and test * support output type loglikelihood and loglikelihood_rolling (#1) * . * loglikelihood_rolling * / * support dp_size>1 * typo * add tests and clean code * skip tests of sglang for now * fix OOM error of sglang pytest * finish test for sglang * add sglang to readme * fix OOM of tests and clean SGLang model * update readme * clean pyproject and add tests for evaluator * add accuracy tests and it passed locally * add notes for test * Update README.md update readme * pre-commit --------- Co-authored-by:
Xiaotong Jiang <xiaotong.jiang@databricks.com> Co-authored-by:
Baber Abbasi <92168766+baberabb@users.noreply.github.com> Co-authored-by:
Baber <baber@hey.com>
-
- 19 Jan, 2025 1 commit
-
-
Baber Abbasi authored
* update pre-commit
-
- 04 Dec, 2024 1 commit
-
-
Baber Abbasi authored
-
- 30 Nov, 2024 1 commit
-
-
Baber Abbasi authored
* make utility function to handle `until` * fix text
-
- 20 Nov, 2024 1 commit
-
-
Baber Abbasi authored
* fix test task * dont call lm.chat_template each time
-
- 18 Nov, 2024 1 commit
-
-
Kozzy Voudouris authored
* Add metabench (Kipnis et al. 2024) * Update metabench tasks for full replication of original benchmarks, using publicly available datasets * Remove unnecessary import * Add permute versions of each task, where the answer orders are randomly shuffled. * Add metabench group for easier evaluations * Fix mmlu counts after removing duplicate * Add secondary datasets * Fix f-string error * Fix f-string error for permute processing * Add original hash to outputs for easy matching to original results * Add line break at end of utils files * Remove extra line from winogrande * Reformat for linters * fix multiple input test * appease pre-commit * Add metabench to tasks README * fix multiple input `test_doc_to_text` --------- Co-authored-by:Baber <baber@hey.com>
-
- 09 Nov, 2024 1 commit
-
-
Baber Abbasi authored
* switch `max_tokens` for `max_completion_tokens`. OpenAI ChatCompletions * remove stop, temp=1 for o1 * add chat assertion * HF_DATASETS_TRUST_REMOTE_CODE = True for task tests * move warning
-
- 31 Oct, 2024 1 commit
-
-
Qubitium-ModelCloud authored
* support gptqmodel * code opt * add gptqmodel option * Update huggingface.py * Update pyproject.toml * gptqmodel version upgraded to 1.0.6 * GPTQModel version upgraded to 1.0.8 * Update pyproject.toml * fix ruff-format error * add gptqmodel test * Update gptqmodel test model * skip cuda * python3.8 compatible * Update README.md * Update README.md --------- Co-authored-by:CL-ModelCloud <cl@modelcloud.ai>
-
- 04 Oct, 2024 1 commit
-
-
Baber Abbasi authored
-
- 26 Sep, 2024 3 commits
-
-
Baber Abbasi authored
* add newlines to task descriptions; increment versions * fix task tests (with groups) * Apply suggestions from code review --------- Co-authored-by:Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
Baber Abbasi authored
* change glianorex to test set * nit * fix test; doc_to_target can be str for multiple_choice * nit
-
Giulio Lovisotto authored
* Treat python tasks same as yaml tasks. * Add tests. * Re-add fixture decorators. * Fix typing specification error for Python 3.9.
-
- 18 Sep, 2024 1 commit
-
-
David Corvoysier authored
* feat(neuron): align with latest optimum-neuron * feat(neuron): support pre-exported neuron models * fix(neuron): correctly use max_length * fix(neuron): adapt loglikelihood The evaluation of log likelihood was not working for neuron models using continuous batching, such as all cached neuron LLama models. * refactor(neuron): remove dead code
-
- 09 Aug, 2024 1 commit
-
-
Jungwhan Kim authored
* add keep trailing newline * apply ruff-format * add prompt unit test * increment the version of tasks that have description with whitespace * remove white spaces of leaderboard bbh * update MMLU expected versions in output * CI run does display the expected version=1 for mmlu subtasks, fix expected test output again --------- Co-authored-by:haileyschoelkopf <hailey@eleuther.ai>
-
- 01 Aug, 2024 1 commit
-
-
Baber Abbasi authored
* add temperature for log probs * add seed * nit * add new args to test * added warning for api chat models
-
- 22 Jul, 2024 1 commit
-
-
Baber Abbasi authored
* refactor pad_token handling to fn * fix docs * add pad_token_handling to vllm * start on API superclass * don't detokenize the returned logits * streamline vllm tokenizer * add type hint * pre-commit * seems to be in working order * add model to init * refactor api models * nit * cleanup * add pbar * fix type hints * change optional dependencies * json encode chat template * add type hints * deal with different prompt input requiremnts * nits * fix * cache inside async * fix * fix * nits * nits * nits * nit * fixup * fixup * nit * add dummy retry * add dummy retry * handle imports; skip failing test * add type hint * add tests * add dependency to tests * add package names to exception * nit * docs; type hints * handle api key * nit * tokenizer bug * fix tokenizer * nit * nit * add better error messages * nit * remove decorator * CI: install api dep * revert evaluator.py * consolidate * consolidate * nits * nit * fix typealias * nit * nit * nit * Update lm_eval/models/api_models.py typo Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/models/openai_completions.py Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/models/anthropic_llms.py Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/models/api_models.py Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * fix typo * add news section * add info for API * pre-commit * typo * fix bug: unpack logliklehood requests * fix bug: shared gen_kwargs mutated * nit: handle copy properly * Update README.md * Update README.md * Update README.md * Update api_models.py * Update README.md --------- Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
- 08 Jul, 2024 2 commits
-
-
Hailey Schoelkopf authored
-
Lintang Sutawika authored
* add greoup_config arg * add a group config that allows disabling table for group score and group aggregate in general * fixed size configuration * adjust config * add group config * adjust mmlu to use group_config * fixed args input in aggregate_subtask_metrics * fixed issues related to printing alias of group and updated yaml * update all mmlu variants to include group_config * edit format * modify mmlu tasks * adjust group to also be a configurable group * add configurable group * simplify get_task_list * adjust group scoring with using ConfigurableGroup * adjust args * update mmlu * update mmlu * update to work with new group and task configuration * readd group_agg * readd files * move prepare_print_tasks to evaluator_utils * sort set to False by default, fix predict_only arg * add version for groups * reversed task list * update additional condition when loading a group in a group yaml * update truthfulqa * add description regarding tags replacing group * replace group to tag * fixed conditional statement * remove warning * update loading of task group and newly added tags * reformat with pre-commit * fixed info log * update * fix bug * fix bug * use task id to differentiate tasks * convert all groups to configurable groups * use task_id * reformat * add task_id for python tasks as well * add task_id for python tasks as well * add task_id for python tasks as well * revert truthfulqa * revert mmlu tasks * new mmlu config * new group config parameter `tag_to_task` * Update truthfulqa_mc2.yaml * reformate * add _process_group_config * adjust task_id * add get_subtask_list function to get proper subtask list * group config to_dict update * remove tag check * update mmlu * fix config passing issues * add test yaml * format fix * add documentation * corner case for single tag being called * fix indentation * formatting * update all mmlu variants * Update docs/task_guide.md Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * remove group_alias * Update docs/task_guide.md Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * remove version for metadata * Update docs/task_guide.md Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * update mmlu/ * removed " " in make_table * change how aggregate_metric is loaded * change how aggregate_metric is loaded * update aggregate_metric arg * update format * update format * some docs fixes * add groups for agieval, aexams, aclue * add more explicit aggregation groups * add more groupings / tags distinctions * add more groupings * more groupings * add many explicit group configs * add many explicit group configs * add more explicit group configs * add more explicit group configs * add more error msgs, agg_metric -> agg_metric_list * some docs updates * update task_id to be updateable and uses group:task format * make KMMLU a tag for now * update docs * don't duplicate task names * fix merge conflicts? * giving this a try * clean up diff * switch mmlu variants over to using * don't use to-be-deprecated group: config field in overview notebook * Python tasks which subclass ConfigurableTask now run * update mmlu * pre-commit format * fixed sorting for multi-level printing * move group api to separate file * fix bbh aggregation filter usage * track api/group.py * adjust group and tags loading * make explicit group configs for leaderboard and other newer tasks * fix arabicmmlu * update * change arabicmmlu template name??? * update group alias * fix printing bugs * check table printing is correct ; update tests * use mmlu_stem to have a group included in print tests --------- Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by:
haileyschoelkopf <hailey@eleuther.ai>
-
- 28 Jun, 2024 1 commit
-
-
Baber Abbasi authored
* add chat template * refactor token padding * nit * nit * check on failing test * check transformers version * remove transformers pin * add ids to test * nit * fixup * fix bos bug * nit * fixup! fix bos bug * increase tolerance for table test * don't detokenize vllm logprobs * Update lm_eval/models/utils.py Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * pre-commit run --all-files --------- Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-
- 25 Jun, 2024 2 commits
-
-
Hailey Schoelkopf authored
* separate out optimum/neuralmagic tests to separate job * fix vllm tests * fix bug in --trust_remote_code * use datasets.config instead intentionally * fix remote code issue?
-
Baber Abbasi authored
* refactored `lm.apply_chat_template` * nit * fix weird type error * fixed! * skip failing test * pre-commit run all * add type hints * nit * nit * fixup
-
- 11 Jun, 2024 1 commit
-
-
Hailey Schoelkopf authored
-
- 07 Jun, 2024 1 commit
-
-
Zafir Stojanovski authored
* sort metrics in output table * update docstring in `consolidate_results` * add tests for verifying consistency of table output * update tests to account for floating point inconsistencies * updated tests based on `pythia-14m`
-
- 31 May, 2024 1 commit
-
-
LSinev authored
-
- 24 May, 2024 1 commit
-
-
Hailey Schoelkopf authored
* add handling for bootstrap_iters=0 case * add more detail to docstring * run precommit
-
- 07 May, 2024 1 commit
-
-
Hailey Schoelkopf authored
-
- 06 May, 2024 1 commit
-
-
LSinev authored
* Added fewshot sampling seeds to evaluator.simple_evaluate signature Way to control seed of fewshot sampling may help with #1591 * Added ability for custom sampler for ConfigurableTask May be set in config like ``` fewshot_config: sampler: !function utils.MyFewshotSampler ``` * explicitly set fewshot random generator seed for HFLM generate_until_task test * add backward compatibility for three args seed setup * save seeds info to logs/reports
-
- 02 May, 2024 1 commit
-
-
Helena Kloosterman authored
* Add option to set OpenVINO config * Use utils.eval_logger for logging
-
- 16 Apr, 2024 1 commit
-
-
Michael Goin authored
* Add neuralmagic models for SparseML and DeepSparse * Update to latest and add test * Format * Fix list to List * Format * Add deepsparse/sparseml to automated testing * Update pyproject.toml * Update pyproject.toml * Update README * Fixes for dtype and device * Format * Fix test * Apply suggestions from code review Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Address review comments! --------- Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
-