1. 05 Mar, 2025 1 commit
  2. 21 Feb, 2025 1 commit
    • Lintang Sutawika's avatar
      Logging (#2203) · 1ba35e62
      Lintang Sutawika authored
      
      
      * changed source of eval_logger
      
      * allow eval_logger to be set from args
      
      * removed verbosity arg from non-main methods
      
      * fix logging
      
      * pre-commit
      
      * set verbosity in eval logger
      
      * replace utils.eval_logger
      
      * fix logging in main
      
      * add logging to docs
      
      * add logging message
      
      * nit
      
      * add logging to docs
      
      * refactor setup_logging to utils
      
      ---------
      Co-authored-by: default avatarBaber <baber@hey.com>
      1ba35e62
  3. 15 Jan, 2025 1 commit
  4. 28 Nov, 2024 1 commit
  5. 20 Nov, 2024 1 commit
    • Baber Abbasi's avatar
      Nits (#2500) · 867413f8
      Baber Abbasi authored
      * fix test task
      
      * dont call lm.chat_template each time
      867413f8
  6. 11 Nov, 2024 1 commit
  7. 08 Oct, 2024 1 commit
  8. 07 Oct, 2024 1 commit
  9. 13 Sep, 2024 1 commit
    • Lintang Sutawika's avatar
      Multimodal prototyping (#2243) · fb963f0f
      Lintang Sutawika authored
      
      
      * add WIP hf vlm class
      
      * add doc_to_image
      
      * add mmmu tasks
      
      * fix merge conflicts
      
      * add lintang's changes to hf_vlms.py
      
      * fix doc_to_image
      
      * added yaml_path for config-loading
      
      * revert
      
      * add line to process str type v
      
      * update
      
      * modeling cleanup
      
      * add aggregation for mmmu
      
      * rewrite MMMU processing code based on only MMMU authors' repo (doc_to_image still WIP)
      
      * implemented doc_to_image
      
      * update doc_to_image to accept list of features
      
      * update functions
      
      * readd image processed
      
      * update args process
      
      * bugfix for repeated images fed to model
      
      * push WIP loglikelihood code
      
      * commit most recent code (generative ; qwen2-vl testing)
      
      * preliminary image_token_id handling
      
      * small mmmu update: some qs have >4 mcqa options
      
      * push updated modeling code
      
      * use processor.apply_chat_template
      
      * add mathvista draft
      
      * nit
      
      * nit
      
      * ensure no footguns in text<>multimodal LM<>task incompatibility
      
      * add notification to readme regarding launch of prototype!
      
      * fix compatibility check
      
      * reorganize mmmu configs
      
      * chat_template=None
      
      * add interleave chat_template
      
      * add condition
      
      * add max_images; interleave=true
      
      * nit
      
      * testmini_mcq
      
      * nit
      
      * pass image string; convert img
      
      * add vllm
      
      * add init
      
      * vlm add multi attr
      
      * fixup
      
      * pass max images to vllm model init
      
      * nit
      
      * encoding to device
      
      * fix HFMultimodalLM.chat_template ?
      
      * add mmmu readme
      
      * remove erroneous prints
      
      * use HFMultimodalLM.chat_template ; restore tasks/__init__.py
      
      * add docstring for replace_placeholders in utils
      
      * fix `replace_placeholders`; set image_string=None
      
      * fix typo
      
      * cleanup + fix merge conflicts
      
      * update MMMU readme
      
      * del mathvista
      
      * add some sample scores
      
      * Update README.md
      
      * add log msg for image_string value
      
      ---------
      Co-authored-by: default avatarhaileyschoelkopf <hailey@eleuther.ai>
      Co-authored-by: default avatarBaber Abbasi <baber@eleuther.ai>
      Co-authored-by: default avatarBaber <baber@hey.com>
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      fb963f0f
  10. 04 Sep, 2024 1 commit
  11. 28 Aug, 2024 1 commit
  12. 25 Aug, 2024 1 commit
  13. 20 Aug, 2024 1 commit
  14. 10 Jul, 2024 1 commit
  15. 08 Jul, 2024 1 commit
    • Lintang Sutawika's avatar
      Group agg rework (#1741) · 517aadc4
      Lintang Sutawika authored
      
      
      * add greoup_config arg
      
      * add a group config that allows disabling table for group score and group aggregate in general
      
      * fixed size configuration
      
      * adjust config
      
      * add group config
      
      * adjust mmlu to use group_config
      
      * fixed args input in aggregate_subtask_metrics
      
      * fixed issues related to printing alias of group and updated yaml
      
      * update all mmlu variants to include group_config
      
      * edit format
      
      * modify mmlu tasks
      
      * adjust group to also be a configurable group
      
      * add configurable group
      
      * simplify get_task_list
      
      * adjust group scoring with using ConfigurableGroup
      
      * adjust args
      
      * update mmlu
      
      * update mmlu
      
      * update to work with new group and task configuration
      
      * readd group_agg
      
      * readd files
      
      * move prepare_print_tasks to evaluator_utils
      
      * sort set to False by default, fix predict_only arg
      
      * add version for groups
      
      * reversed task list
      
      * update additional condition when loading a group in a group yaml
      
      * update truthfulqa
      
      * add description regarding tags replacing group
      
      * replace group to tag
      
      * fixed conditional statement
      
      * remove warning
      
      * update loading of task group and newly added tags
      
      * reformat with pre-commit
      
      * fixed info log
      
      * update
      
      * fix bug
      
      * fix bug
      
      * use task id to differentiate tasks
      
      * convert all groups to configurable groups
      
      * use task_id
      
      * reformat
      
      * add task_id for python tasks as well
      
      * add task_id for python tasks as well
      
      * add task_id for python tasks as well
      
      * revert truthfulqa
      
      * revert mmlu tasks
      
      * new mmlu config
      
      * new group config parameter `tag_to_task`
      
      * Update truthfulqa_mc2.yaml
      
      * reformate
      
      * add _process_group_config
      
      * adjust task_id
      
      * add get_subtask_list function to get proper subtask list
      
      * group config to_dict update
      
      * remove tag check
      
      * update mmlu
      
      * fix config passing issues
      
      * add test yaml
      
      * format fix
      
      * add documentation
      
      * corner case for single tag being called
      
      * fix indentation
      
      * formatting
      
      * update all mmlu variants
      
      * Update docs/task_guide.md
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * remove group_alias
      
      * Update docs/task_guide.md
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * remove version for metadata
      
      * Update docs/task_guide.md
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * update mmlu/
      
      * removed " " in make_table
      
      * change how aggregate_metric is loaded
      
      * change how aggregate_metric is loaded
      
      * update aggregate_metric arg
      
      * update format
      
      * update format
      
      * some docs fixes
      
      * add groups for agieval, aexams, aclue
      
      * add more explicit aggregation groups
      
      * add more groupings / tags distinctions
      
      * add more groupings
      
      * more groupings
      
      * add many explicit group configs
      
      * add many explicit group configs
      
      * add more explicit group configs
      
      * add more explicit group configs
      
      * add more error msgs, agg_metric -> agg_metric_list
      
      * some docs updates
      
      * update task_id to be updateable and uses group:task format
      
      * make KMMLU a tag for now
      
      * update docs
      
      * don't duplicate task names
      
      * fix merge conflicts?
      
      * giving this a try
      
      * clean up diff
      
      * switch mmlu variants over to using
      
      * don't use to-be-deprecated group: config field in overview notebook
      
      * Python tasks which subclass ConfigurableTask now run
      
      * update mmlu
      
      * pre-commit format
      
      * fixed sorting for multi-level printing
      
      * move group api to separate file
      
      * fix bbh aggregation filter usage
      
      * track api/group.py
      
      * adjust group and tags loading
      
      * make explicit group configs for leaderboard and other newer tasks
      
      * fix arabicmmlu
      
      * update
      
      * change arabicmmlu template name???
      
      * update group alias
      
      * fix printing bugs
      
      * check table printing is correct ; update tests
      
      * use mmlu_stem to have a group included in print tests
      
      ---------
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      Co-authored-by: default avatarhaileyschoelkopf <hailey@eleuther.ai>
      517aadc4
  16. 25 Jun, 2024 1 commit
  17. 24 Jun, 2024 2 commits
  18. 19 Jun, 2024 1 commit
  19. 03 Jun, 2024 2 commits
  20. 30 May, 2024 1 commit
  21. 26 May, 2024 1 commit
  22. 24 May, 2024 1 commit
  23. 07 May, 2024 1 commit
  24. 06 May, 2024 1 commit
    • LSinev's avatar
      Provide ability for custom sampler for ConfigurableTask (#1616) · ae72cebc
      LSinev authored
      * Added fewshot sampling seeds to evaluator.simple_evaluate signature
      
      Way to control seed of fewshot sampling
      may help with #1591
      
      * Added ability for custom sampler for ConfigurableTask
      
      May be set in config like
      ```
      fewshot_config:
        sampler: !function utils.MyFewshotSampler
      ```
      
      * explicitly set fewshot random generator seed for HFLM generate_until_task test
      
      * add backward compatibility for three args seed setup
      
      * save seeds info to logs/reports
      ae72cebc
  25. 05 May, 2024 1 commit
  26. 03 May, 2024 1 commit
    • KonradSzafer's avatar
      evaluation tracker implementation (#1766) · 59cf408a
      KonradSzafer authored
      * evaluation tracker implementation
      
      * OVModelForCausalLM test fix
      
      * typo fix
      
      * moved methods args
      
      * multiple args in one flag
      
      * loggers moved to dedicated dir
      
      * improved filename sanitization
      59cf408a
  27. 22 Mar, 2024 1 commit
  28. 18 Mar, 2024 1 commit
    • Hailey Schoelkopf's avatar
      Cleanup for v0.4.2 release (#1573) · 5627e819
      Hailey Schoelkopf authored
      * Update interface.md
      
      * fix: make caching reqs always work with accelerate launch
      
      * remove stale task migration checklist
      
      * remove deprecation warnings
      
      * make informative TypeErrors for get_task_dict
      
      * bump version metadata
      
      * fix num_fewshot printing bug
      
      * add fewshot value to cache key
      5627e819
  29. 17 Mar, 2024 1 commit
  30. 06 Mar, 2024 1 commit
  31. 04 Mar, 2024 1 commit
  32. 27 Feb, 2024 1 commit
    • Baber Abbasi's avatar
      Refactor `evaluater.evaluate` (#1441) · 5ccd65d4
      Baber Abbasi authored
      
      
      * change `all_gather` to `gather`
      
      * add TaskOutput utility class
      
      * Add FilterResults class and refactor task handling.
      
      * Rename `key` to `filter_key` for clarity
      
      * Add `print_writeout` function in utils.py
      
      * Add function to calculate limit size.
      
      * Add doc_iterator method to Task class
      
      * Refactor `doc_iterator` and cleanup in Task class
      
      * remove superfluous bits
      
      * change `all_gather` to `gather`
      
      * bugfix
      
      * bugfix
      
      * fix `gather`
      
      * Refactor `gather` loop
      
      * Refactor aggregate metrics calculation
      
      * Refactor and simplify aggregate metrics calculation
      Removed unused code
      
      * Simplify metrics calculation and remove unused code.
      
      * simplify the metrics calculation in `utils.py` and `evaluator.py`.
      
      * Fix group metric
      
      * change evaluate to hf_evaluate
      
      * change evaluate to hf_evaluate
      
      * add docs
      
      * add docs
      
      * nits
      
      * make isslice keyword only
      
      * nit
      
      * add todo
      
      * nit
      
      * nit
      
      * nit: swap order samples_metrics tuple
      
      * move instance sorting outside loop
      
      * nit
      
      * nit
      
      * Add __repr__ for ConfigurableTask
      
      * nit
      
      * nit
      
      * Revert "nit"
      
      This reverts commit dab8d9977a643752a17f840fd8cf7e4b107df28f.
      
      * fix some logging
      
      * nit
      
      * fix `predict_only` bug. thanks to `@LSinev`!
      
      * change `print_tasks` to `prepare_print_tasks`
      
      * nits
      
      * move eval utils
      
      * move eval utils
      
      * nit
      
      * add comment
      
      * added tqdm descriptions
      
      * Update lm_eval/evaluator_utils.py
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * fix mgsm bug
      
      * nit
      
      * fix `build_all_requests`
      
      * pre-commit
      
      * add ceil to limit
      
      ---------
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      5ccd65d4
  33. 26 Feb, 2024 1 commit
    • Aaron V's avatar
      Create a means for caching task registration and request building. Ad… (#1372) · 1e6c9272
      Aaron V authored
      
      
      * Create a means for caching task registration and request building. Add the ability to specify an args dict for simple_evaluate().
      
      * Remove extra S in cache path in caching module
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * Rename requests cache args, make model_args polymorphic so that a dict can also be accepted.
      
      * Update docs to reflect new caching behavior, add CLI args for requests caching. Create a function for deleting items in the cache.
      
      * Update documentation, fix minor bug with arg parsing for requests caching where an undefined variable was used.
      
      * Remove line from gitignore, add to cli for caching datasets.
      
      * Add hashing suffix to .pickles. Update test script typo.
      
      * Favor isinstance() over type() in evaluator.py
      
      * Add tests for caching, gets tests working, remove unneeded arg from build_all_requests().
      
      * Update arg description to simple_evaluate.
      
      * Update pyproject.toml
      
      * Fix typehint
      
      * Remove the use of random() for creating default cache pickle hash.
      
      * Check that cache dir exists before clearing it in request cache tests.
      
      * Fix linting problems.
      
      * Fix additional formatting errors.
      
      * Remove trailing whitespace.
      
      * Add new line to the end of .gitignore.
      
      ---------
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      1e6c9272
  34. 24 Feb, 2024 1 commit
    • LSinev's avatar
      Add environment and transformers version logging in results dump (#1464) · f78e2da4
      LSinev authored
      * Save git_hash to results even if git is not available to call as subprocess
      
      * Store more info about environment and transformers version in results to help researchers track inconsistencies
      
      * moved added logging to logging_utils
      
      * moved get_git_commit_hash to logging_utils.py
      
      * moved add_env_info inside evaluator
      f78e2da4
  35. 22 Feb, 2024 1 commit
  36. 12 Feb, 2024 1 commit
  37. 11 Feb, 2024 1 commit
    • Baber Abbasi's avatar
      Evaluate (#1385) · 1ff84897
      Baber Abbasi authored
      * un-exclude `evaluate.py` from linting
      
      * readability
      
      * readability
      
      * add task name to build info message
      
      * fix link
      
      * nit
      
      * add functions for var and mean pooling
      
      * add functions for var and mean pooling
      
      * metadata compatibility with task
      
      * rename `override_config` to `set_config` and move to `Task`
      
      * add unit test
      
      * nit
      
      * nit
      
      * bugfix
      
      * nit
      
      * nit
      
      * nit
      
      * add docstrings
      
      * fix metadata-fewshot
      
      * revert metric refactor
      
      * nit
      
      * type checking
      
      * type hints
      
      * type hints
      
      * move `override_metric` to `Task`
      
      * change metadata
      
      * change name
      
      * pre-commit
      
      * rename
      
      * remove
      
      * remove
      
      * `override_metric` backwards compatible with `Task`
      
      * type hints
      
      * use generic
      
      * type hint
      1ff84897
  38. 09 Feb, 2024 1 commit