1. 10 Jul, 2025 1 commit
  2. 06 Jul, 2025 1 commit
  3. 05 Jul, 2025 1 commit
  4. 04 Jul, 2025 2 commits
  5. 03 Jun, 2025 1 commit
  6. 16 Apr, 2025 1 commit
  7. 17 Mar, 2025 1 commit
  8. 14 Mar, 2025 1 commit
  9. 04 Mar, 2025 2 commits
  10. 25 Feb, 2025 1 commit
    • Jinwei's avatar
      Support SGLang as Potential Backend for Evaluation (#2703) · 29971faa
      Jinwei authored
      
      
      * initial components to support sglang
      
      * init of class SGLangLM
      
      * draft for generate_until of SGLang model
      
      * mock loglikelihood
      
      * initial loglikelihood_tokens
      
      * todo: fix bug of sglang engine init
      
      * implement generation tasks and test
      
      * support output type loglikelihood and loglikelihood_rolling (#1)
      
      * .
      
      * loglikelihood_rolling
      
      * /
      
      * support dp_size>1
      
      * typo
      
      * add tests and clean code
      
      * skip tests of sglang for now
      
      * fix OOM error of sglang pytest
      
      * finish test for sglang
      
      * add sglang to readme
      
      * fix OOM of tests and clean SGLang model
      
      * update readme
      
      * clean pyproject and add tests for evaluator
      
      * add accuracy tests and it passed locally
      
      * add notes for test
      
      * Update README.md
      
      update readme
      
      * pre-commit
      
      ---------
      Co-authored-by: default avatarXiaotong Jiang <xiaotong.jiang@databricks.com>
      Co-authored-by: default avatarBaber Abbasi <92168766+baberabb@users.noreply.github.com>
      Co-authored-by: default avatarBaber <baber@hey.com>
      29971faa
  11. 19 Jan, 2025 1 commit
  12. 04 Dec, 2024 1 commit
  13. 30 Nov, 2024 1 commit
  14. 20 Nov, 2024 1 commit
    • Baber Abbasi's avatar
      Nits (#2500) · 867413f8
      Baber Abbasi authored
      * fix test task
      
      * dont call lm.chat_template each time
      867413f8
  15. 18 Nov, 2024 1 commit
    • Kozzy Voudouris's avatar
      Add metabench task to LM Evaluation Harness (#2357) · 62b4364d
      Kozzy Voudouris authored
      
      
      * Add metabench (Kipnis et al. 2024)
      
      * Update metabench tasks for full replication of original benchmarks, using publicly available datasets
      
      * Remove unnecessary import
      
      * Add permute versions of each task, where the answer orders are randomly shuffled.
      
      * Add metabench group for easier evaluations
      
      * Fix mmlu counts after removing duplicate
      
      * Add secondary datasets
      
      * Fix f-string error
      
      * Fix f-string error for permute processing
      
      * Add original hash to outputs for easy matching to original results
      
      * Add line break at end of utils files
      
      * Remove extra line from winogrande
      
      * Reformat for linters
      
      * fix multiple input test
      
      * appease pre-commit
      
      * Add metabench to tasks README
      
      * fix multiple input `test_doc_to_text`
      
      ---------
      Co-authored-by: default avatarBaber <baber@hey.com>
      62b4364d
  16. 09 Nov, 2024 1 commit
  17. 31 Oct, 2024 1 commit
    • Qubitium-ModelCloud's avatar
      Add GPTQModel support for evaluating GPTQ models (#2217) · 4f8e479e
      Qubitium-ModelCloud authored
      
      
      * support gptqmodel
      
      * code opt
      
      * add gptqmodel option
      
      * Update huggingface.py
      
      * Update pyproject.toml
      
      * gptqmodel version upgraded to 1.0.6
      
      * GPTQModel version upgraded to 1.0.8
      
      * Update pyproject.toml
      
      * fix ruff-format error
      
      * add gptqmodel test
      
      * Update gptqmodel test model
      
      * skip cuda
      
      * python3.8 compatible
      
      * Update README.md
      
      * Update README.md
      
      ---------
      Co-authored-by: default avatarCL-ModelCloud <cl@modelcloud.ai>
      4f8e479e
  18. 04 Oct, 2024 1 commit
  19. 26 Sep, 2024 3 commits
  20. 18 Sep, 2024 1 commit
    • David Corvoysier's avatar
      Update neuron backend (#2314) · 9a092f37
      David Corvoysier authored
      * feat(neuron): align with latest optimum-neuron
      
      * feat(neuron): support pre-exported neuron models
      
      * fix(neuron): correctly use max_length
      
      * fix(neuron): adapt loglikelihood
      
      The evaluation of log likelihood was not working for neuron models
      using continuous batching, such as all cached neuron LLama models.
      
      * refactor(neuron): remove dead code
      9a092f37
  21. 09 Aug, 2024 1 commit
    • Jungwhan Kim's avatar
      keep new line for task description (#2116) · 8ad598df
      Jungwhan Kim authored
      
      
      * add keep trailing newline
      
      * apply ruff-format
      
      * add prompt unit test
      
      * increment the version of tasks that have description with whitespace
      
      * remove white spaces of leaderboard bbh
      
      * update MMLU expected versions in output
      
      * CI run does display the expected version=1 for mmlu subtasks, fix expected test output again
      
      ---------
      Co-authored-by: default avatarhaileyschoelkopf <hailey@eleuther.ai>
      8ad598df
  22. 01 Aug, 2024 1 commit
  23. 22 Jul, 2024 1 commit
    • Baber Abbasi's avatar
      Refactor API models (#2008) · 42dc2448
      Baber Abbasi authored
      
      
      * refactor pad_token handling to fn
      
      * fix docs
      
      * add pad_token_handling to vllm
      
      * start on API superclass
      
      * don't detokenize the returned logits
      
      * streamline vllm tokenizer
      
      * add type hint
      
      * pre-commit
      
      * seems to be in working order
      
      * add model to init
      
      * refactor api models
      
      * nit
      
      * cleanup
      
      * add pbar
      
      * fix type hints
      
      * change optional dependencies
      
      * json encode chat template
      
      * add type hints
      
      * deal with different prompt input requiremnts
      
      * nits
      
      * fix
      
      * cache inside async
      
      * fix
      
      * fix
      
      * nits
      
      * nits
      
      * nits
      
      * nit
      
      * fixup
      
      * fixup
      
      * nit
      
      * add dummy retry
      
      * add dummy retry
      
      * handle imports; skip failing test
      
      * add type hint
      
      * add tests
      
      * add dependency to tests
      
      * add package names to exception
      
      * nit
      
      * docs; type hints
      
      * handle api key
      
      * nit
      
      * tokenizer bug
      
      * fix tokenizer
      
      * nit
      
      * nit
      
      * add better error messages
      
      * nit
      
      * remove decorator
      
      * CI: install api dep
      
      * revert evaluator.py
      
      * consolidate
      
      * consolidate
      
      * nits
      
      * nit
      
      * fix typealias
      
      * nit
      
      * nit
      
      * nit
      
      * Update lm_eval/models/api_models.py
      
      typo
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * Update lm_eval/models/openai_completions.py
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * Update lm_eval/models/anthropic_llms.py
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * Update lm_eval/models/api_models.py
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * fix typo
      
      * add news section
      
      * add info for API
      
      * pre-commit
      
      * typo
      
      * fix bug: unpack logliklehood requests
      
      * fix bug: shared gen_kwargs mutated
      
      * nit: handle copy properly
      
      * Update README.md
      
      * Update README.md
      
      * Update README.md
      
      * Update api_models.py
      
      * Update README.md
      
      ---------
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      42dc2448
  24. 08 Jul, 2024 2 commits
    • Hailey Schoelkopf's avatar
    • Lintang Sutawika's avatar
      Group agg rework (#1741) · 517aadc4
      Lintang Sutawika authored
      
      
      * add greoup_config arg
      
      * add a group config that allows disabling table for group score and group aggregate in general
      
      * fixed size configuration
      
      * adjust config
      
      * add group config
      
      * adjust mmlu to use group_config
      
      * fixed args input in aggregate_subtask_metrics
      
      * fixed issues related to printing alias of group and updated yaml
      
      * update all mmlu variants to include group_config
      
      * edit format
      
      * modify mmlu tasks
      
      * adjust group to also be a configurable group
      
      * add configurable group
      
      * simplify get_task_list
      
      * adjust group scoring with using ConfigurableGroup
      
      * adjust args
      
      * update mmlu
      
      * update mmlu
      
      * update to work with new group and task configuration
      
      * readd group_agg
      
      * readd files
      
      * move prepare_print_tasks to evaluator_utils
      
      * sort set to False by default, fix predict_only arg
      
      * add version for groups
      
      * reversed task list
      
      * update additional condition when loading a group in a group yaml
      
      * update truthfulqa
      
      * add description regarding tags replacing group
      
      * replace group to tag
      
      * fixed conditional statement
      
      * remove warning
      
      * update loading of task group and newly added tags
      
      * reformat with pre-commit
      
      * fixed info log
      
      * update
      
      * fix bug
      
      * fix bug
      
      * use task id to differentiate tasks
      
      * convert all groups to configurable groups
      
      * use task_id
      
      * reformat
      
      * add task_id for python tasks as well
      
      * add task_id for python tasks as well
      
      * add task_id for python tasks as well
      
      * revert truthfulqa
      
      * revert mmlu tasks
      
      * new mmlu config
      
      * new group config parameter `tag_to_task`
      
      * Update truthfulqa_mc2.yaml
      
      * reformate
      
      * add _process_group_config
      
      * adjust task_id
      
      * add get_subtask_list function to get proper subtask list
      
      * group config to_dict update
      
      * remove tag check
      
      * update mmlu
      
      * fix config passing issues
      
      * add test yaml
      
      * format fix
      
      * add documentation
      
      * corner case for single tag being called
      
      * fix indentation
      
      * formatting
      
      * update all mmlu variants
      
      * Update docs/task_guide.md
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * remove group_alias
      
      * Update docs/task_guide.md
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * remove version for metadata
      
      * Update docs/task_guide.md
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * update mmlu/
      
      * removed " " in make_table
      
      * change how aggregate_metric is loaded
      
      * change how aggregate_metric is loaded
      
      * update aggregate_metric arg
      
      * update format
      
      * update format
      
      * some docs fixes
      
      * add groups for agieval, aexams, aclue
      
      * add more explicit aggregation groups
      
      * add more groupings / tags distinctions
      
      * add more groupings
      
      * more groupings
      
      * add many explicit group configs
      
      * add many explicit group configs
      
      * add more explicit group configs
      
      * add more explicit group configs
      
      * add more error msgs, agg_metric -> agg_metric_list
      
      * some docs updates
      
      * update task_id to be updateable and uses group:task format
      
      * make KMMLU a tag for now
      
      * update docs
      
      * don't duplicate task names
      
      * fix merge conflicts?
      
      * giving this a try
      
      * clean up diff
      
      * switch mmlu variants over to using
      
      * don't use to-be-deprecated group: config field in overview notebook
      
      * Python tasks which subclass ConfigurableTask now run
      
      * update mmlu
      
      * pre-commit format
      
      * fixed sorting for multi-level printing
      
      * move group api to separate file
      
      * fix bbh aggregation filter usage
      
      * track api/group.py
      
      * adjust group and tags loading
      
      * make explicit group configs for leaderboard and other newer tasks
      
      * fix arabicmmlu
      
      * update
      
      * change arabicmmlu template name???
      
      * update group alias
      
      * fix printing bugs
      
      * check table printing is correct ; update tests
      
      * use mmlu_stem to have a group included in print tests
      
      ---------
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      Co-authored-by: default avatarhaileyschoelkopf <hailey@eleuther.ai>
      517aadc4
  25. 28 Jun, 2024 1 commit
  26. 25 Jun, 2024 2 commits
  27. 11 Jun, 2024 1 commit
  28. 07 Jun, 2024 1 commit
    • Zafir Stojanovski's avatar
      Test output table layout consistency (#1916) · 40f5458f
      Zafir Stojanovski authored
      * sort metrics in output table
      
      * update docstring in `consolidate_results`
      
      * add tests for verifying consistency of table output
      
      * update tests to account for floating point inconsistencies
      
      * updated tests based on `pythia-14m`
      40f5458f
  29. 31 May, 2024 1 commit
  30. 24 May, 2024 1 commit
  31. 07 May, 2024 1 commit
  32. 06 May, 2024 1 commit
    • LSinev's avatar
      Provide ability for custom sampler for ConfigurableTask (#1616) · ae72cebc
      LSinev authored
      * Added fewshot sampling seeds to evaluator.simple_evaluate signature
      
      Way to control seed of fewshot sampling
      may help with #1591
      
      * Added ability for custom sampler for ConfigurableTask
      
      May be set in config like
      ```
      fewshot_config:
        sampler: !function utils.MyFewshotSampler
      ```
      
      * explicitly set fewshot random generator seed for HFLM generate_until_task test
      
      * add backward compatibility for three args seed setup
      
      * save seeds info to logs/reports
      ae72cebc
  33. 02 May, 2024 1 commit
  34. 16 Apr, 2024 1 commit