1. 05 Sep, 2024 1 commit
  2. 20 Aug, 2024 1 commit
  3. 11 Jul, 2024 1 commit
    • anthony-dipofi's avatar
      Prettify lm_eval --tasks list (#1929) · a0243d54
      anthony-dipofi authored
      
      
      * add  and ; move task list newline logic to new TaskManager.list_all_tasks() method
      
      * format table list into markdown table; add config location column
      
      * add Output Type column
      
      * add logic for printing table of tags separately
      
      * merge with main and fix conflicts ; update docstrings
      
      ---------
      Co-authored-by: default avatarhaileyschoelkopf <hailey@eleuther.ai>
      a0243d54
  4. 19 Jun, 2024 1 commit
  5. 09 Jun, 2024 1 commit
  6. 03 Jun, 2024 1 commit
  7. 31 May, 2024 1 commit
    • KonradSzafer's avatar
      Add dataset card when pushing to HF hub (#1898) · f4f59251
      KonradSzafer authored
      
      
      * dataset card initial
      
      * few fixes
      
      * adds groups for math, mmlu, gpqa
      
      * added summary agrs
      
      * moved sanitize_list to utils
      
      * readme update
      
      * recreate metadata moved
      
      * multiple model support
      
      * results latest split fix
      
      * readme update and small refactor
      
      * fix grouping
      
      * add comments
      
      * added pathlib
      
      * corrected pathlib approach
      
      * check whether to create a metadata card
      
      * convert posix paths to str
      
      * default hf org from token
      
      * hf token value error
      
      * Add logs after successful upload
      
      * logging updates
      
      * dataset card example in the readme
      
      ---------
      Co-authored-by: default avatarNathan Habib <nathan.habib@huggingface.com>
      Co-authored-by: default avatarAlina Lozovskaia <alinailozovskaya@gmail.com>
      f4f59251
  8. 26 May, 2024 1 commit
  9. 06 May, 2024 1 commit
    • LSinev's avatar
      Provide ability for custom sampler for ConfigurableTask (#1616) · ae72cebc
      LSinev authored
      * Added fewshot sampling seeds to evaluator.simple_evaluate signature
      
      Way to control seed of fewshot sampling
      may help with #1591
      
      * Added ability for custom sampler for ConfigurableTask
      
      May be set in config like
      ```
      fewshot_config:
        sampler: !function utils.MyFewshotSampler
      ```
      
      * explicitly set fewshot random generator seed for HFLM generate_until_task test
      
      * add backward compatibility for three args seed setup
      
      * save seeds info to logs/reports
      ae72cebc
  10. 03 May, 2024 2 commits
  11. 07 Apr, 2024 1 commit
  12. 22 Mar, 2024 1 commit
  13. 19 Mar, 2024 1 commit
  14. 18 Mar, 2024 1 commit
    • Hailey Schoelkopf's avatar
      Cleanup for v0.4.2 release (#1573) · 5627e819
      Hailey Schoelkopf authored
      * Update interface.md
      
      * fix: make caching reqs always work with accelerate launch
      
      * remove stale task migration checklist
      
      * remove deprecation warnings
      
      * make informative TypeErrors for get_task_dict
      
      * bump version metadata
      
      * fix num_fewshot printing bug
      
      * add fewshot value to cache key
      5627e819
  15. 17 Mar, 2024 1 commit
  16. 12 Mar, 2024 1 commit
  17. 06 Mar, 2024 1 commit
  18. 04 Mar, 2024 1 commit
  19. 03 Mar, 2024 1 commit
  20. 01 Mar, 2024 1 commit
  21. 26 Feb, 2024 1 commit
    • Aaron V's avatar
      Create a means for caching task registration and request building. Ad… (#1372) · 1e6c9272
      Aaron V authored
      
      
      * Create a means for caching task registration and request building. Add the ability to specify an args dict for simple_evaluate().
      
      * Remove extra S in cache path in caching module
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * Rename requests cache args, make model_args polymorphic so that a dict can also be accepted.
      
      * Update docs to reflect new caching behavior, add CLI args for requests caching. Create a function for deleting items in the cache.
      
      * Update documentation, fix minor bug with arg parsing for requests caching where an undefined variable was used.
      
      * Remove line from gitignore, add to cli for caching datasets.
      
      * Add hashing suffix to .pickles. Update test script typo.
      
      * Favor isinstance() over type() in evaluator.py
      
      * Add tests for caching, gets tests working, remove unneeded arg from build_all_requests().
      
      * Update arg description to simple_evaluate.
      
      * Update pyproject.toml
      
      * Fix typehint
      
      * Remove the use of random() for creating default cache pickle hash.
      
      * Check that cache dir exists before clearing it in request cache tests.
      
      * Fix linting problems.
      
      * Fix additional formatting errors.
      
      * Remove trailing whitespace.
      
      * Add new line to the end of .gitignore.
      
      ---------
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      1e6c9272
  22. 22 Feb, 2024 1 commit
    • Ayush Thakur's avatar
      feat: Add Weights and Biases support (#1339) · 2683fbbb
      Ayush Thakur authored
      
      
      * add wandb as extra dependency
      
      * wandb metrics logging
      
      * refactor
      
      * log samples as tables
      
      * fix linter
      
      * refactor: put in a class
      
      * change dir
      
      * add panels
      
      * log eval as table
      
      * improve tables logging
      
      * improve reports logging
      
      * precommit run
      
      * ruff check
      
      * handle importing reports api gracefully
      
      * ruff
      
      * compare results
      
      * minor pre-commit fixes
      
      * build comparison report
      
      * ruff check
      
      * log results as artifacts
      
      * remove comparison script
      
      * update dependency
      
      * type annotate and docstring
      
      * add example
      
      * update readme
      
      * fix typo
      
      * teardown
      
      * handle outside wandb run
      
      * gracefully fail reports creation
      
      * precommit checks
      
      * add report url to summary
      
      * use wandb  printer for better url stdout
      
      * fix ruff
      
      * handle N/A and groups
      
      * fix eval table
      
      * remove unused var
      
      * update wandb version req + disable reports stdout
      
      * remove reports feature to TODO
      
      * add label to multi-choice question data
      
      * log model predictions
      
      * lints
      
      * loglikelihood_rolling
      
      * log eval result for groups
      
      * log tables by group for better handling
      
      * precommit
      
      * choices column for multi-choice
      
      * graciously fail wandb
      
      * remove reports feature
      
      * track system metrics + total eval time + stdout
      
      ---------
      Co-authored-by: default avatarLintang Sutawika <lintang@eleuther.ai>
      2683fbbb
  23. 12 Feb, 2024 1 commit
  24. 02 Feb, 2024 1 commit
  25. 01 Feb, 2024 1 commit
    • Lintang Sutawika's avatar
      Faster Task and Group Loading, Allow Recursive Groups (#1321) · d714fc95
      Lintang Sutawika authored
      
      
      * add trust_remote_code as default
      
      * task for testing recursive
      
      * changed source of ALL_TASKS
      
      * tasks should only accept TaskObjects
      
      * initialize_tasks returns list of tasks and groups
      
      * remove trust_remote_code for now
      
      * moved constructor process to inside load_yaml_config
      
      * more comprehensive way to index tasks and groups
      
      * pre-commit format
      
      * add exit after error
      
      * adjust how task objects are called
      
      * no need to use get_task_dict
      
      * load_task_or_group works but only for tasks
      
      * pre-commit format
      
      * half working for nested groups
      
      * changed variable names
      
      * allow groups and tasks to work
      
      * temp save
      
      * indexing and loading are part of a task_manager object
      
      * adapted initialize_tasks
      
      * iron out bugs
      
      * fixed typo
      
      * fixed typo
      
      * simplified code
      
      * further tidy up
      
      * remove lines for testing
      
      * removed test lines
      
      * removed unused code
      
      * remove unused import
      
      * fixed bug
      
      * removed comments
      
      * group in a list of group can accept parameter changes like `num_fewshot`
      
      * add trust_remote_code as default
      
      * task for testing recursive
      
      * changed source of ALL_TASKS
      
      * tasks should only accept TaskObjects
      
      * initialize_tasks returns list of tasks and groups
      
      * remove trust_remote_code for now
      
      * moved constructor process to inside load_yaml_config
      
      * more comprehensive way to index tasks and groups
      
      * pre-commit format
      
      * add exit after error
      
      * adjust how task objects are called
      
      * no need to use get_task_dict
      
      * load_task_or_group works but only for tasks
      
      * pre-commit format
      
      * half working for nested groups
      
      * changed variable names
      
      * allow groups and tasks to work
      
      * temp save
      
      * indexing and loading are part of a task_manager object
      
      * adapted initialize_tasks
      
      * iron out bugs
      
      * fixed typo
      
      * fixed typo
      
      * simplified code
      
      * further tidy up
      
      * remove lines for testing
      
      * removed test lines
      
      * removed unused code
      
      * remove unused import
      
      * fixed bug
      
      * removed comments
      
      * group in a list of group can accept parameter changes like `num_fewshot`
      
      * check if config is task update
      
      * add GroupConfig object
      
      * edit test yaml
      
      * remove args
      
      * testing returning to python task list
      
      * add weight_by_size config
      
      * describe weight_by_size in docs
      
      * fix weight by size potential error
      
      * can load individual custom python class task
      
      * moved import_function into the config loading file
      
      * remove print lines
      
      * add squadv2 yaml
      
      * temporary scroll implementation
      
      * revert back to use load_yaml_config but with modes
      
      * fix group being loaded with a None
      
      * reformat
      
      * can load unregistered tasks from a group
      
      * update scrolls
      
      * edit scrolls multiplechoice task
      
      * adjust class initialization
      
      * fix initialization
      
      * changed how to identify group and python tasks, fix logger
      
      * allow loading "include" that is nested in a group config
      
      * reworked flan benchmark
      
      * allow duplicate task in the same group to co-exist
      
      * process group_alias
      
      * removed group_alias
      
      * allow parameters set in group_config to apply to all tasks in tasklist
      
      * add function, but comment for now
      
      * reworked processing dict-base config
      
      * fixed how configs in group are processed
      
      * update to allow root group to have its alias used
      
      * remove unused classes
      
      * remove unused classes
      
      * revert some parts to original
      
      * forgot to change one variable
      
      * adapt the new process to use get_task_dict
      
      * fix for singular group call
      
      * fix variable names
      
      * add TaskManager into the evaluator
      
      * format
      
      * changed how dict tasks are loaded
      
      * add docs
      
      * Update docs/new_task_guide.md
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * Update evaluator.py
      
      * Update evaluator.py
      
      * remove groupconfig for now
      
      * changed _config to config
      
      * update interface.md to explain TaskManager
      
      * added property functions
      
      * adjusted logger
      
      * update write_out.py
      
      * updated tests
      
      * added documentation and some modifications
      
      * added docstring documentation
      
      * precommit format
      
      * updated task loading for tests
      
      * updates tests
      
      * changed arg order for load_yaml_config
      
      * update to handle scrolls and edit log message
      
      * remove unused lines
      
      * return a list of task classes and not a dict
      
      * Update __init__.py
      
      * Delete lm_eval/tasks/benchmarks/test.yaml
      
      * Update task.py
      
      * Update lm_eval/utils.py
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * Update lm_eval/utils.py
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * Update utils.py
      
      * re-added old functions with new log message
      
      * Update docs/new_task_guide.md
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * Update new_task_guide.md
      
      * added infor regarding `get_task_dict` and documentation
      
      * add get_config for Task
      
      * pre-commit formatting
      
      ---------
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      d714fc95
  26. 31 Jan, 2024 1 commit
    • Baber Abbasi's avatar
      add bypass metric (#1156) · f8203de1
      Baber Abbasi authored
      * add bypass metric
      
      * fixed `bypass` metric.
      
      * add task attributes if predict_only
      
      * add `predict_only` checks
      
      * add docs
      
      * added `overide_metric`, `override_config` to `Task`
      
      * nits
      
      * nit
      
      * changed --predict_only to generations; nits
      
      * nits
      
      * nits
      
      * change gen_kwargs warning
      
      * add note about `--predict_only` in README.md
      
      * added `predict_only`
      
      * move table to bottom
      
      * nit
      
      * change null aggregation to bypass (conflict)
      
      * bugfix; default `temp=0.0`
      
      * typo
      f8203de1
  27. 28 Jan, 2024 1 commit
  28. 10 Jan, 2024 1 commit
  29. 05 Jan, 2024 1 commit
  30. 20 Dec, 2023 1 commit
    • Baber Abbasi's avatar
      Switch Linting to `ruff` (#1166) · 65b8761d
      Baber Abbasi authored
      * add ruff and isort. remove black and flake8
      
      * remove unnecessary dependencies
      
      * remove dependency from table
      
      * change order
      
      * ran ruff
      
      * check 3.9
      
      * exclude evaluator
      
      * update CI workflow
      
      * use ruff config in pyproject.toml
      
      * test
      
      * add isort rules to ruff
      
      * sort imports
      
      * import `make_table`
      
      * try stages for no-commit-to-branch
      
      * turn on mypy for pre-commit
      
      * test
      
      * test
      
      * test
      
      * change no-commit-to-branch to default
      
      * nits
      
      * fixed dependency
      65b8761d
  31. 18 Dec, 2023 2 commits
  32. 24 Nov, 2023 2 commits
  33. 17 Nov, 2023 3 commits
  34. 10 Nov, 2023 1 commit
  35. 03 Nov, 2023 1 commit