1. 06 Jul, 2025 1 commit
  2. 05 Jul, 2025 1 commit
  3. 04 Jul, 2025 2 commits
  4. 22 Apr, 2025 1 commit
  5. 08 Apr, 2025 1 commit
  6. 18 Mar, 2025 1 commit
    • Baber Abbasi's avatar
      Add loncxt tasks (#2629) · 80a10075
      Baber Abbasi authored
      suport for longcontext (and other synthetic tasks)
      * add ruler
      * add longbench
      * pass `metadata` to TaskConfig
      80a10075
  7. 11 Mar, 2025 1 commit
  8. 21 Feb, 2025 1 commit
    • Lintang Sutawika's avatar
      Logging (#2203) · 1ba35e62
      Lintang Sutawika authored
      
      
      * changed source of eval_logger
      
      * allow eval_logger to be set from args
      
      * removed verbosity arg from non-main methods
      
      * fix logging
      
      * pre-commit
      
      * set verbosity in eval logger
      
      * replace utils.eval_logger
      
      * fix logging in main
      
      * add logging to docs
      
      * add logging message
      
      * nit
      
      * add logging to docs
      
      * refactor setup_logging to utils
      
      ---------
      Co-authored-by: default avatarBaber <baber@hey.com>
      1ba35e62
  9. 19 Jan, 2025 1 commit
  10. 20 Dec, 2024 1 commit
  11. 16 Dec, 2024 1 commit
  12. 09 Aug, 2024 1 commit
    • Jungwhan Kim's avatar
      keep new line for task description (#2116) · 8ad598df
      Jungwhan Kim authored
      
      
      * add keep trailing newline
      
      * apply ruff-format
      
      * add prompt unit test
      
      * increment the version of tasks that have description with whitespace
      
      * remove white spaces of leaderboard bbh
      
      * update MMLU expected versions in output
      
      * CI run does display the expected version=1 for mmlu subtasks, fix expected test output again
      
      ---------
      Co-authored-by: default avatarhaileyschoelkopf <hailey@eleuther.ai>
      8ad598df
  13. 01 Aug, 2024 1 commit
  14. 10 Jul, 2024 1 commit
    • Lintang Sutawika's avatar
      Update utils.py (#2085) · 058cfd0e
      Lintang Sutawika authored
      Group Configs with no aggregation will print a empty space as the score for result table.
      Example
      ```
      |    Tasks     |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
      |--------------|-------|------|-----:|--------|---|-----:|---|-----:|
      |group         |    N/A|      |      |        |   |      |   |      |
      | - task 0     |Yaml   |none  |     0|acc     |↑  |0.4000|±  |0.0910|
      | - task 1     |Yaml   |none  |     0|acc     |↑  |0.3333|±  |0.0875|
      | - task 2     |Yaml   |none  |     0|acc     |↑  |0.2667|±  |0.0821|
      | - task 3     |Yaml   |none  |     0|acc     |↑  |0.3333|±  |0.0875|
      ```
      
      So the `v` variable in the `make_table` needs to check if the value is a float or a string.
      058cfd0e
  15. 08 Jul, 2024 2 commits
    • Nathan Habib's avatar
      Allow gating EvaluationTracker HF Hub results; customizability (#2051) · 563f7971
      Nathan Habib authored
      * batch commit
      
      * :Revert "batch commit"
      
      This reverts commit d859d1ca.
      
      * batch commit
      
      * checkout from main
      
      * checkout from main
      
      * checkout from main
      
      * checkout from main
      
      * checkout from main
      
      * cleanup
      
      * cleanup
      
      * cleanup
      
      * cleanup
      
      * cleanup
      
      * cleanup eval results
      
      * cleanup
      
      * add check for gated repo
      
      * fix jsonline issue
      
      * fix
      
      * add try catch when gating the details repo
      
      * add doc
      
      * adds back hub_repo_name
      
      * readds hub repo name
      563f7971
    • Lintang Sutawika's avatar
      Group agg rework (#1741) · 517aadc4
      Lintang Sutawika authored
      
      
      * add greoup_config arg
      
      * add a group config that allows disabling table for group score and group aggregate in general
      
      * fixed size configuration
      
      * adjust config
      
      * add group config
      
      * adjust mmlu to use group_config
      
      * fixed args input in aggregate_subtask_metrics
      
      * fixed issues related to printing alias of group and updated yaml
      
      * update all mmlu variants to include group_config
      
      * edit format
      
      * modify mmlu tasks
      
      * adjust group to also be a configurable group
      
      * add configurable group
      
      * simplify get_task_list
      
      * adjust group scoring with using ConfigurableGroup
      
      * adjust args
      
      * update mmlu
      
      * update mmlu
      
      * update to work with new group and task configuration
      
      * readd group_agg
      
      * readd files
      
      * move prepare_print_tasks to evaluator_utils
      
      * sort set to False by default, fix predict_only arg
      
      * add version for groups
      
      * reversed task list
      
      * update additional condition when loading a group in a group yaml
      
      * update truthfulqa
      
      * add description regarding tags replacing group
      
      * replace group to tag
      
      * fixed conditional statement
      
      * remove warning
      
      * update loading of task group and newly added tags
      
      * reformat with pre-commit
      
      * fixed info log
      
      * update
      
      * fix bug
      
      * fix bug
      
      * use task id to differentiate tasks
      
      * convert all groups to configurable groups
      
      * use task_id
      
      * reformat
      
      * add task_id for python tasks as well
      
      * add task_id for python tasks as well
      
      * add task_id for python tasks as well
      
      * revert truthfulqa
      
      * revert mmlu tasks
      
      * new mmlu config
      
      * new group config parameter `tag_to_task`
      
      * Update truthfulqa_mc2.yaml
      
      * reformate
      
      * add _process_group_config
      
      * adjust task_id
      
      * add get_subtask_list function to get proper subtask list
      
      * group config to_dict update
      
      * remove tag check
      
      * update mmlu
      
      * fix config passing issues
      
      * add test yaml
      
      * format fix
      
      * add documentation
      
      * corner case for single tag being called
      
      * fix indentation
      
      * formatting
      
      * update all mmlu variants
      
      * Update docs/task_guide.md
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * remove group_alias
      
      * Update docs/task_guide.md
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * remove version for metadata
      
      * Update docs/task_guide.md
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * update mmlu/
      
      * removed " " in make_table
      
      * change how aggregate_metric is loaded
      
      * change how aggregate_metric is loaded
      
      * update aggregate_metric arg
      
      * update format
      
      * update format
      
      * some docs fixes
      
      * add groups for agieval, aexams, aclue
      
      * add more explicit aggregation groups
      
      * add more groupings / tags distinctions
      
      * add more groupings
      
      * more groupings
      
      * add many explicit group configs
      
      * add many explicit group configs
      
      * add more explicit group configs
      
      * add more explicit group configs
      
      * add more error msgs, agg_metric -> agg_metric_list
      
      * some docs updates
      
      * update task_id to be updateable and uses group:task format
      
      * make KMMLU a tag for now
      
      * update docs
      
      * don't duplicate task names
      
      * fix merge conflicts?
      
      * giving this a try
      
      * clean up diff
      
      * switch mmlu variants over to using
      
      * don't use to-be-deprecated group: config field in overview notebook
      
      * Python tasks which subclass ConfigurableTask now run
      
      * update mmlu
      
      * pre-commit format
      
      * fixed sorting for multi-level printing
      
      * move group api to separate file
      
      * fix bbh aggregation filter usage
      
      * track api/group.py
      
      * adjust group and tags loading
      
      * make explicit group configs for leaderboard and other newer tasks
      
      * fix arabicmmlu
      
      * update
      
      * change arabicmmlu template name???
      
      * update group alias
      
      * fix printing bugs
      
      * check table printing is correct ; update tests
      
      * use mmlu_stem to have a group included in print tests
      
      ---------
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      Co-authored-by: default avatarhaileyschoelkopf <hailey@eleuther.ai>
      517aadc4
  16. 11 Jun, 2024 1 commit
    • KonradSzafer's avatar
      Results filenames handling fix (#1926) · 69952581
      KonradSzafer authored
      * results filenames handling moved to utils
      
      * zeno results handling fix
      
      * tasks_for_model backward compatibility
      
      * results files logic moved to tasks_for_model
      
      * moved sanitize_model_name to utils
      69952581
  17. 07 Jun, 2024 1 commit
    • Zafir Stojanovski's avatar
      Test output table layout consistency (#1916) · 40f5458f
      Zafir Stojanovski authored
      * sort metrics in output table
      
      * update docstring in `consolidate_results`
      
      * add tests for verifying consistency of table output
      
      * update tests to account for floating point inconsistencies
      
      * updated tests based on `pythia-14m`
      40f5458f
  18. 31 May, 2024 1 commit
    • KonradSzafer's avatar
      Add dataset card when pushing to HF hub (#1898) · f4f59251
      KonradSzafer authored
      
      
      * dataset card initial
      
      * few fixes
      
      * adds groups for math, mmlu, gpqa
      
      * added summary agrs
      
      * moved sanitize_list to utils
      
      * readme update
      
      * recreate metadata moved
      
      * multiple model support
      
      * results latest split fix
      
      * readme update and small refactor
      
      * fix grouping
      
      * add comments
      
      * added pathlib
      
      * corrected pathlib approach
      
      * check whether to create a metadata card
      
      * convert posix paths to str
      
      * default hf org from token
      
      * hf token value error
      
      * Add logs after successful upload
      
      * logging updates
      
      * dataset card example in the readme
      
      ---------
      Co-authored-by: default avatarNathan Habib <nathan.habib@huggingface.com>
      Co-authored-by: default avatarAlina Lozovskaia <alinailozovskaya@gmail.com>
      f4f59251
  19. 30 May, 2024 1 commit
  20. 07 May, 2024 1 commit
  21. 03 May, 2024 1 commit
    • KonradSzafer's avatar
      evaluation tracker implementation (#1766) · 59cf408a
      KonradSzafer authored
      * evaluation tracker implementation
      
      * OVModelForCausalLM test fix
      
      * typo fix
      
      * moved methods args
      
      * multiple args in one flag
      
      * loggers moved to dedicated dir
      
      * improved filename sanitization
      59cf408a
  22. 27 Feb, 2024 1 commit
    • Baber Abbasi's avatar
      Refactor `evaluater.evaluate` (#1441) · 5ccd65d4
      Baber Abbasi authored
      
      
      * change `all_gather` to `gather`
      
      * add TaskOutput utility class
      
      * Add FilterResults class and refactor task handling.
      
      * Rename `key` to `filter_key` for clarity
      
      * Add `print_writeout` function in utils.py
      
      * Add function to calculate limit size.
      
      * Add doc_iterator method to Task class
      
      * Refactor `doc_iterator` and cleanup in Task class
      
      * remove superfluous bits
      
      * change `all_gather` to `gather`
      
      * bugfix
      
      * bugfix
      
      * fix `gather`
      
      * Refactor `gather` loop
      
      * Refactor aggregate metrics calculation
      
      * Refactor and simplify aggregate metrics calculation
      Removed unused code
      
      * Simplify metrics calculation and remove unused code.
      
      * simplify the metrics calculation in `utils.py` and `evaluator.py`.
      
      * Fix group metric
      
      * change evaluate to hf_evaluate
      
      * change evaluate to hf_evaluate
      
      * add docs
      
      * add docs
      
      * nits
      
      * make isslice keyword only
      
      * nit
      
      * add todo
      
      * nit
      
      * nit
      
      * nit: swap order samples_metrics tuple
      
      * move instance sorting outside loop
      
      * nit
      
      * nit
      
      * Add __repr__ for ConfigurableTask
      
      * nit
      
      * nit
      
      * Revert "nit"
      
      This reverts commit dab8d9977a643752a17f840fd8cf7e4b107df28f.
      
      * fix some logging
      
      * nit
      
      * fix `predict_only` bug. thanks to `@LSinev`!
      
      * change `print_tasks` to `prepare_print_tasks`
      
      * nits
      
      * move eval utils
      
      * move eval utils
      
      * nit
      
      * add comment
      
      * added tqdm descriptions
      
      * Update lm_eval/evaluator_utils.py
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * fix mgsm bug
      
      * nit
      
      * fix `build_all_requests`
      
      * pre-commit
      
      * add ceil to limit
      
      ---------
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      5ccd65d4
  23. 26 Feb, 2024 1 commit
    • Lintang Sutawika's avatar
      Cont metrics (#1475) · 96d185fa
      Lintang Sutawika authored
      
      
      * add brier_score
      
      * process brier_score
      
      * brier score is working for N-sized class
      
      * fxied brier score
      
      * add TED to BigBench and Brier score to MMLU
      
      * format
      
      * Update metrics.py
      
      * Update task.py
      
      * Update generate_until_template_yaml
      
      * Delete lm_eval/tasks/bigbench/aux_metric.py
      
      * Update generate_until_template_yaml
      
      * Update _default_template_yaml
      
      * Update _generate_configs.py
      
      * Update _generate_configs.py
      
      * Update _generate_configs.py
      
      * fix (format?)
      
      * format?
      
      * format, once more
      
      ---------
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      96d185fa
  24. 24 Feb, 2024 1 commit
    • LSinev's avatar
      Add environment and transformers version logging in results dump (#1464) · f78e2da4
      LSinev authored
      * Save git_hash to results even if git is not available to call as subprocess
      
      * Store more info about environment and transformers version in results to help researchers track inconsistencies
      
      * moved added logging to logging_utils
      
      * moved get_git_commit_hash to logging_utils.py
      
      * moved add_env_info inside evaluator
      f78e2da4
  25. 14 Feb, 2024 1 commit
  26. 01 Feb, 2024 1 commit
    • Lintang Sutawika's avatar
      Faster Task and Group Loading, Allow Recursive Groups (#1321) · d714fc95
      Lintang Sutawika authored
      
      
      * add trust_remote_code as default
      
      * task for testing recursive
      
      * changed source of ALL_TASKS
      
      * tasks should only accept TaskObjects
      
      * initialize_tasks returns list of tasks and groups
      
      * remove trust_remote_code for now
      
      * moved constructor process to inside load_yaml_config
      
      * more comprehensive way to index tasks and groups
      
      * pre-commit format
      
      * add exit after error
      
      * adjust how task objects are called
      
      * no need to use get_task_dict
      
      * load_task_or_group works but only for tasks
      
      * pre-commit format
      
      * half working for nested groups
      
      * changed variable names
      
      * allow groups and tasks to work
      
      * temp save
      
      * indexing and loading are part of a task_manager object
      
      * adapted initialize_tasks
      
      * iron out bugs
      
      * fixed typo
      
      * fixed typo
      
      * simplified code
      
      * further tidy up
      
      * remove lines for testing
      
      * removed test lines
      
      * removed unused code
      
      * remove unused import
      
      * fixed bug
      
      * removed comments
      
      * group in a list of group can accept parameter changes like `num_fewshot`
      
      * add trust_remote_code as default
      
      * task for testing recursive
      
      * changed source of ALL_TASKS
      
      * tasks should only accept TaskObjects
      
      * initialize_tasks returns list of tasks and groups
      
      * remove trust_remote_code for now
      
      * moved constructor process to inside load_yaml_config
      
      * more comprehensive way to index tasks and groups
      
      * pre-commit format
      
      * add exit after error
      
      * adjust how task objects are called
      
      * no need to use get_task_dict
      
      * load_task_or_group works but only for tasks
      
      * pre-commit format
      
      * half working for nested groups
      
      * changed variable names
      
      * allow groups and tasks to work
      
      * temp save
      
      * indexing and loading are part of a task_manager object
      
      * adapted initialize_tasks
      
      * iron out bugs
      
      * fixed typo
      
      * fixed typo
      
      * simplified code
      
      * further tidy up
      
      * remove lines for testing
      
      * removed test lines
      
      * removed unused code
      
      * remove unused import
      
      * fixed bug
      
      * removed comments
      
      * group in a list of group can accept parameter changes like `num_fewshot`
      
      * check if config is task update
      
      * add GroupConfig object
      
      * edit test yaml
      
      * remove args
      
      * testing returning to python task list
      
      * add weight_by_size config
      
      * describe weight_by_size in docs
      
      * fix weight by size potential error
      
      * can load individual custom python class task
      
      * moved import_function into the config loading file
      
      * remove print lines
      
      * add squadv2 yaml
      
      * temporary scroll implementation
      
      * revert back to use load_yaml_config but with modes
      
      * fix group being loaded with a None
      
      * reformat
      
      * can load unregistered tasks from a group
      
      * update scrolls
      
      * edit scrolls multiplechoice task
      
      * adjust class initialization
      
      * fix initialization
      
      * changed how to identify group and python tasks, fix logger
      
      * allow loading "include" that is nested in a group config
      
      * reworked flan benchmark
      
      * allow duplicate task in the same group to co-exist
      
      * process group_alias
      
      * removed group_alias
      
      * allow parameters set in group_config to apply to all tasks in tasklist
      
      * add function, but comment for now
      
      * reworked processing dict-base config
      
      * fixed how configs in group are processed
      
      * update to allow root group to have its alias used
      
      * remove unused classes
      
      * remove unused classes
      
      * revert some parts to original
      
      * forgot to change one variable
      
      * adapt the new process to use get_task_dict
      
      * fix for singular group call
      
      * fix variable names
      
      * add TaskManager into the evaluator
      
      * format
      
      * changed how dict tasks are loaded
      
      * add docs
      
      * Update docs/new_task_guide.md
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * Update evaluator.py
      
      * Update evaluator.py
      
      * remove groupconfig for now
      
      * changed _config to config
      
      * update interface.md to explain TaskManager
      
      * added property functions
      
      * adjusted logger
      
      * update write_out.py
      
      * updated tests
      
      * added documentation and some modifications
      
      * added docstring documentation
      
      * precommit format
      
      * updated task loading for tests
      
      * updates tests
      
      * changed arg order for load_yaml_config
      
      * update to handle scrolls and edit log message
      
      * remove unused lines
      
      * return a list of task classes and not a dict
      
      * Update __init__.py
      
      * Delete lm_eval/tasks/benchmarks/test.yaml
      
      * Update task.py
      
      * Update lm_eval/utils.py
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * Update lm_eval/utils.py
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * Update utils.py
      
      * re-added old functions with new log message
      
      * Update docs/new_task_guide.md
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * Update new_task_guide.md
      
      * added infor regarding `get_task_dict` and documentation
      
      * add get_config for Task
      
      * pre-commit formatting
      
      ---------
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      d714fc95
  27. 11 Jan, 2024 1 commit
  28. 02 Jan, 2024 1 commit
  29. 27 Dec, 2023 1 commit
    • Baber Abbasi's avatar
      nits + fix siqa (#1216) · 6a1c19ed
      Baber Abbasi authored
      * fix group
      
      * siqa: default.yml -> default.yaml
      
      * max_gen_toks -> self.max_gen_toks
      
      * add ids to task tests
      
      * fix siqa
      
      * fix gen_kwargs for openai-chat
      6a1c19ed
  30. 24 Dec, 2023 1 commit
  31. 23 Dec, 2023 1 commit
  32. 22 Dec, 2023 1 commit
    • Zach Schillaci's avatar
      Generic decorator for handling rate limit errors (#1109) · 046ea6e2
      Zach Schillaci authored
      
      
      * Add retry error handler
      
      * fixup! Add retry error handler
      
      * Move to utils.py
      
      * Run isort on utils.py
      
      * Catch multiple exceptions
      
      * Update LMs with exception handler
      
      * Fixes to anthropic retry handler
      
      * fix callback kwarg
      
      * Update textsynth.py
      
      * fix python 3.8 incompatibility
      
      * fix indenterror I introduced
      
      * placate linter?
      
      * Update on_exception_callback kwarg name
      
      * fixup! Merge branch 'main' into add-retry-error-handler
      
      * fixup! fixup! Merge branch 'main' into add-retry-error-handler
      
      * Merge conflicts are fun
      
      * Run pre-commit
      
      ---------
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      046ea6e2
  33. 20 Dec, 2023 1 commit
    • Baber Abbasi's avatar
      Switch Linting to `ruff` (#1166) · 65b8761d
      Baber Abbasi authored
      * add ruff and isort. remove black and flake8
      
      * remove unnecessary dependencies
      
      * remove dependency from table
      
      * change order
      
      * ran ruff
      
      * check 3.9
      
      * exclude evaluator
      
      * update CI workflow
      
      * use ruff config in pyproject.toml
      
      * test
      
      * add isort rules to ruff
      
      * sort imports
      
      * import `make_table`
      
      * try stages for no-commit-to-branch
      
      * turn on mypy for pre-commit
      
      * test
      
      * test
      
      * test
      
      * change no-commit-to-branch to default
      
      * nits
      
      * fixed dependency
      65b8761d
  34. 12 Dec, 2023 1 commit
  35. 29 Nov, 2023 1 commit
  36. 28 Nov, 2023 1 commit
  37. 21 Nov, 2023 1 commit
  38. 20 Nov, 2023 1 commit