1. 26 Feb, 2024 6 commits
  2. 24 Feb, 2024 1 commit
    • LSinev's avatar
      Add environment and transformers version logging in results dump (#1464) · f78e2da4
      LSinev authored
      * Save git_hash to results even if git is not available to call as subprocess
      
      * Store more info about environment and transformers version in results to help researchers track inconsistencies
      
      * moved added logging to logging_utils
      
      * moved get_git_commit_hash to logging_utils.py
      
      * moved add_env_info inside evaluator
      f78e2da4
  3. 23 Feb, 2024 2 commits
  4. 22 Feb, 2024 5 commits
    • Amine Elhattami's avatar
      Fixed generation args issue affection OpenAI completion model (#1458) · 75ac1f47
      Amine Elhattami authored
      
      
      * Fixed generation args issue affection openai completion model
      
      * Fixed hf unit test; removed pop attributes in OpenAi completion.
      
      * fix format
      
      * fix format
      
      ---------
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      75ac1f47
    • Ayush Thakur's avatar
      feat: Add Weights and Biases support (#1339) · 2683fbbb
      Ayush Thakur authored
      
      
      * add wandb as extra dependency
      
      * wandb metrics logging
      
      * refactor
      
      * log samples as tables
      
      * fix linter
      
      * refactor: put in a class
      
      * change dir
      
      * add panels
      
      * log eval as table
      
      * improve tables logging
      
      * improve reports logging
      
      * precommit run
      
      * ruff check
      
      * handle importing reports api gracefully
      
      * ruff
      
      * compare results
      
      * minor pre-commit fixes
      
      * build comparison report
      
      * ruff check
      
      * log results as artifacts
      
      * remove comparison script
      
      * update dependency
      
      * type annotate and docstring
      
      * add example
      
      * update readme
      
      * fix typo
      
      * teardown
      
      * handle outside wandb run
      
      * gracefully fail reports creation
      
      * precommit checks
      
      * add report url to summary
      
      * use wandb  printer for better url stdout
      
      * fix ruff
      
      * handle N/A and groups
      
      * fix eval table
      
      * remove unused var
      
      * update wandb version req + disable reports stdout
      
      * remove reports feature to TODO
      
      * add label to multi-choice question data
      
      * log model predictions
      
      * lints
      
      * loglikelihood_rolling
      
      * log eval result for groups
      
      * log tables by group for better handling
      
      * precommit
      
      * choices column for multi-choice
      
      * graciously fail wandb
      
      * remove reports feature
      
      * track system metrics + total eval time + stdout
      
      ---------
      Co-authored-by: default avatarLintang Sutawika <lintang@eleuther.ai>
      2683fbbb
    • Lei Chen's avatar
      PR fixing the issue #1391 (wrong contexts in the mgsm task) (#1440) · a72babbf
      Lei Chen authored
      
      
      * fix the issue #1391, wrong contexts in mgsm tasks
      
      * fix yaml issue for having two target_delimiter lines. For COT tasks, keep the one with a space (default)
      
      * regenerate all task yaml files
      - change naming so that file name will match with task name
      - task|file follows a consistent naming way, mgsm_(mode)_(lang) for three modes, i.e., direct, en_cot, and native_cot
      
      * English CoTs should have a space as target_delimiter
      
      * Update utils.py
      
      * Apply suggestions from code review
      
      ---------
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      a72babbf
    • Hailey Schoelkopf's avatar
      Log which subtasks were called with which groups (#1456) · 00dc9960
      Hailey Schoelkopf authored
      * log group membership
      
      * no stray prints
      
      * Update evaluator.py
      00dc9960
    • Anjor Kanekar's avatar
      Add TemplateLM boilerplate LM class (#1279) · ba5cdf0f
      Anjor Kanekar authored
      * loglikelihood refactor using template lm
      
      * linter
      
      * fix whitespace in target + prompt for CoT gsm8k (#1275)
      
      * Make `parallelize=True` vs. `accelerate launch` distinction clearer in docs (#1261)
      
      * Make parallelize=True distinction clearer in documentation.
      
      * run linter
      
      * Allow parameter edits for registered tasks when listed in a benchmark (#1273)
      
      * benchmark yamls allow minor edits of already registered tasks
      
      * add documentation
      
      * removed print
      
      * Fix data-parallel evaluation with quantized models (#1270)
      
      * add WIP device_map overrides
      
      * update handling outside of accelerate launcher
      
      * change .to(device) log to debug level
      
      * run linter
      
      * Rework documentation for explaining local dataset (#1284)
      
      * rewor documentation for explaining local dataset
      
      * fix typo
      
      * Update new_task_guide.md
      
      * Re-add citation
      
      It looks like Google Scholar has [already noticed](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C9...
      ba5cdf0f
  5. 21 Feb, 2024 1 commit
    • Hanwool Albert Lee's avatar
      Added KMMLU evaluation method and changed ReadMe (#1447) · c26a6ac7
      Hanwool Albert Lee authored
      
      
      * update kmmlu default formatting
      
      * Update _default_kmmlu_yaml
      
      * Delete lm_eval/tasks/kmmlu/utils.py
      
      * new tasks implemented
      
      * add direct tasks
      
      * update direct evaluate
      
      * update direct eval
      
      * add cot sample
      
      * update cot
      
      * add cot
      
      * Update _cot_kmmlu_yaml
      
      * add kmmlu90
      
      * Update and rename _cot_kmmlu.yaml to _cot_kmmlu_yaml
      
      * Create kmmlu90.yaml
      
      * Update _cot_kmmlu_yaml
      
      * add direct
      
      * Update _cot_kmmlu_yaml
      
      * Update and rename kmmlu90.yaml to kmmlu90_cot.yaml
      
      * Update kmmlu90_direct.yaml
      
      * add kmmlu hard
      
      * Update _cot_kmmlu_yaml
      
      * Update _cot_kmmlu_yaml
      
      * update cot
      
      * update cot
      
      * erase typo
      
      * Update _cot_kmmlu_yaml
      
      * update cot
      
      * Rename dataset to match k-mmlu-hard
      
      * removed kmmlu90
      
      * fixed name 'kmmlu_cot' to 'kmmlu_hard_cot' and revised README
      
      * applied pre-commit before pull requests
      
      * rename datasets and add notes
      
      * Remove DS_Store cache
      
      * Update lm_eval/tasks/kmmlu/README.md
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * Change citations and reflect reviews on version
      
      * Added kmmlu_hard and fixed other errors
      
      * fixing minor errors
      
      * remove duplicated
      
      * Rename files
      
      * try ".index"
      
      * minor fix
      
      * minor fix again
      
      * fix revert.
      
      * minor fix. thank for hailey
      
      ---------
      Co-authored-by: default avatarGUIJIN SON <spthsrbwls123@yonsei.ac.kr>
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      c26a6ac7
  6. 20 Feb, 2024 3 commits
  7. 19 Feb, 2024 2 commits
  8. 18 Feb, 2024 1 commit
  9. 15 Feb, 2024 1 commit
  10. 14 Feb, 2024 1 commit
  11. 13 Feb, 2024 1 commit
  12. 12 Feb, 2024 2 commits
  13. 11 Feb, 2024 3 commits
    • Uanu's avatar
      Add multilingual TruthfulQA task (#1420) · 7397b965
      Uanu authored
      7397b965
    • Uanu's avatar
      Add multilingual ARC task (#1419) · 0256c682
      Uanu authored
      0256c682
    • Baber Abbasi's avatar
      Evaluate (#1385) · 1ff84897
      Baber Abbasi authored
      * un-exclude `evaluate.py` from linting
      
      * readability
      
      * readability
      
      * add task name to build info message
      
      * fix link
      
      * nit
      
      * add functions for var and mean pooling
      
      * add functions for var and mean pooling
      
      * metadata compatibility with task
      
      * rename `override_config` to `set_config` and move to `Task`
      
      * add unit test
      
      * nit
      
      * nit
      
      * bugfix
      
      * nit
      
      * nit
      
      * nit
      
      * add docstrings
      
      * fix metadata-fewshot
      
      * revert metric refactor
      
      * nit
      
      * type checking
      
      * type hints
      
      * type hints
      
      * move `override_metric` to `Task`
      
      * change metadata
      
      * change name
      
      * pre-commit
      
      * rename
      
      * remove
      
      * remove
      
      * `override_metric` backwards compatible with `Task`
      
      * type hints
      
      * use generic
      
      * type hint
      1ff84897
  14. 10 Feb, 2024 2 commits
  15. 09 Feb, 2024 1 commit
  16. 07 Feb, 2024 1 commit
  17. 06 Feb, 2024 4 commits
  18. 05 Feb, 2024 1 commit
  19. 02 Feb, 2024 2 commits