1. 19 Aug, 2024 3 commits
    • Yen-Ting Lin's avatar
      Add TMLU Benchmark Dataset (#2093) · ca3d86d6
      Yen-Ting Lin authored
      
      
      * add taiwan truthful qa
      
      * add tmlu
      
      * Add .gitignore entries for evals/ and harness_eval_main_log.txt, and add harness_eval.slurm script
      
      * add pega eval and legal eval
      
      * add ccp eval
      
      * Update .gitignore and harness_eval.slurm
      
      * Add trust_remote_code and wandb_args to harness_eval.slurm, and add run_all.sh script
      
      * Add Pega MMLU task and configuration files
      
      * Add new models and update parameters in run_all.sh
      
      * Add UMTCEval tasks and configurations
      
      * Update dataset paths and output path
      
      * Update .gitignore and harness_eval.slurm, and modify _generate_configs.py
      
      * Update SLURM script and add new models
      
      * clean for pr
      
      * Update lm_eval/tasks/tmlu/default/tmlu.yaml
      Co-authored-by: default avatarLintang Sutawika <lintang@sutawika.com>
      
      * adjust tag name
      
      * removed group alias from tasks
      
      * format
      
      ---------
      Co-authored-by: default avatarLintang Sutawika <lintang@sutawika.com>
      Co-authored-by: default avatarlintangsutawika <lintang@eleuther.ai>
      Co-authored-by: default avatarYen-Ting Adam, Lin <r08944064@csie.ntu.edu.tw>
      ca3d86d6
    • Uminosachi's avatar
      86edeffa
    • am-bean's avatar
      Lingoly README update (#2228) · f81b62bf
      am-bean authored
      * Setting up lingoly task
      
      * Testing yaml changes to debug
      
      * Adding pre-commit hooks
      
      * Functional LingOly benchmark
      
      * Renaming files and adding grouping
      
      * Extending group aggregations to allow custom functions. Setting up custom lingoly aggregation using difference in scores.
      
      * Adding LingOly to the README file
      f81b62bf
  2. 16 Aug, 2024 1 commit
  3. 15 Aug, 2024 2 commits
    • am-bean's avatar
      New task: Lingoly (#2198) · 8b41f925
      am-bean authored
      * Setting up lingoly task
      
      * Testing yaml changes to debug
      
      * Adding pre-commit hooks
      
      * Functional LingOly benchmark
      
      * Renaming files and adding grouping
      
      * Extending group aggregations to allow custom functions. Setting up custom lingoly aggregation using difference in scores.
      8b41f925
    • Anton Polishko's avatar
      Update citation in README.md (#2083) · cbdc3539
      Anton Polishko authored
      Bumped citation to the v0.4.3
      cbdc3539
  4. 10 Aug, 2024 1 commit
  5. 09 Aug, 2024 1 commit
    • Jungwhan Kim's avatar
      keep new line for task description (#2116) · 8ad598df
      Jungwhan Kim authored
      
      
      * add keep trailing newline
      
      * apply ruff-format
      
      * add prompt unit test
      
      * increment the version of tasks that have description with whitespace
      
      * remove white spaces of leaderboard bbh
      
      * update MMLU expected versions in output
      
      * CI run does display the expected version=1 for mmlu subtasks, fix expected test output again
      
      ---------
      Co-authored-by: default avatarhaileyschoelkopf <hailey@eleuther.ai>
      8ad598df
  6. 07 Aug, 2024 1 commit
  7. 05 Aug, 2024 8 commits
  8. 04 Aug, 2024 2 commits
  9. 01 Aug, 2024 3 commits
  10. 29 Jul, 2024 1 commit
    • Baber Abbasi's avatar
      bugfix and docs for API (#2139) · b70af4f5
      Baber Abbasi authored
      
      
      * encoding bugfix
      
      * encoding bugfix
      
      * overload logliklehood rather than loglikehood_tokens
      
      * add custom tokenizer
      
      * add docs
      
      * Update API_guide.md
      
      fix link; add note
      
      * Update API_guide.md
      
      typo
      
      * pre-commit
      
      * add link in readme
      
      * nit
      
      * nit
      
      * nit
      
      * Update API_guide.md
      
      nits
      
      * Update API_guide.md
      
      * Update API_guide.md
      
      * Update API_guide.md
      
      * Update API_guide.md
      
      * Update README.md
      
      * Update docs/API_guide.md
      
      * Update docs/API_guide.md
      
      * Update API_guide.md
      
      ---------
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      b70af4f5
  11. 22 Jul, 2024 1 commit
    • Baber Abbasi's avatar
      Refactor API models (#2008) · 42dc2448
      Baber Abbasi authored
      
      
      * refactor pad_token handling to fn
      
      * fix docs
      
      * add pad_token_handling to vllm
      
      * start on API superclass
      
      * don't detokenize the returned logits
      
      * streamline vllm tokenizer
      
      * add type hint
      
      * pre-commit
      
      * seems to be in working order
      
      * add model to init
      
      * refactor api models
      
      * nit
      
      * cleanup
      
      * add pbar
      
      * fix type hints
      
      * change optional dependencies
      
      * json encode chat template
      
      * add type hints
      
      * deal with different prompt input requiremnts
      
      * nits
      
      * fix
      
      * cache inside async
      
      * fix
      
      * fix
      
      * nits
      
      * nits
      
      * nits
      
      * nit
      
      * fixup
      
      * fixup
      
      * nit
      
      * add dummy retry
      
      * add dummy retry
      
      * handle imports; skip failing test
      
      * add type hint
      
      * add tests
      
      * add dependency to tests
      
      * add package names to exception
      
      * nit
      
      * docs; type hints
      
      * handle api key
      
      * nit
      
      * tokenizer bug
      
      * fix tokenizer
      
      * nit
      
      * nit
      
      * add better error messages
      
      * nit
      
      * remove decorator
      
      * CI: install api dep
      
      * revert evaluator.py
      
      * consolidate
      
      * consolidate
      
      * nits
      
      * nit
      
      * fix typealias
      
      * nit
      
      * nit
      
      * nit
      
      * Update lm_eval/models/api_models.py
      
      typo
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * Update lm_eval/models/openai_completions.py
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * Update lm_eval/models/anthropic_llms.py
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * Update lm_eval/models/api_models.py
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * fix typo
      
      * add news section
      
      * add info for API
      
      * pre-commit
      
      * typo
      
      * fix bug: unpack logliklehood requests
      
      * fix bug: shared gen_kwargs mutated
      
      * nit: handle copy properly
      
      * Update README.md
      
      * Update README.md
      
      * Update README.md
      
      * Update api_models.py
      
      * Update README.md
      
      ---------
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      42dc2448
  12. 21 Jul, 2024 1 commit
  13. 20 Jul, 2024 1 commit
  14. 18 Jul, 2024 2 commits
  15. 17 Jul, 2024 1 commit
  16. 15 Jul, 2024 3 commits
  17. 14 Jul, 2024 1 commit
  18. 13 Jul, 2024 1 commit
  19. 12 Jul, 2024 4 commits
    • Jess's avatar
      Irokobench: Benchmark Dataset for African languages (#2042) · 383bbd54
      Jess authored
      
      
      * add afrixnli to task
      
      * add chat completion
      
      * remove chat completion -untested
      
      * afrimmlu added
      
      * afrimmlu folder update
      
      * afrimmlu folder update
      
      * updated prompt
      
      * remove print
      
      * add afrimgsm -direct
      
      * add squad metric
      
      * fix bash script
      
      * remove direct util, update common yaml
      
      * remove print
      
      * add few show. metric fixes
      
      * fix direct path, add bash script for gpt models
      
      * added transate test
      
      * update afrixnli tasks
      
      * update afrixnli tasks
      
      * update metrics for afrixnli
      
      * prompt translations fix
      
      * prompt translations fix
      
      * filter and metric fix -mgsm
      
      * remove squad metric
      
      * remove squad metric
      
      * add f1 score to mgsm
      
      * add f1 score to mgsm
      
      * update native-direct with lin
      
      * change f1 function
      
      * add lin to utils
      
      * add utils
      
      * remove test limit
      
      * remove test configs
      
      * add swahili to mmlu
      
      * change eng to ewe in ewe yaml mmlu
      
      * add squad metric to mgsm, remove whitespace filter
      
      * added translate test
      
      * added afrixnli_translate
      
      * fix exact match valueError
      
      * fix exact match valueError
      
      * restructure mmlu folder
      
      * spacing
      
      * remove afrimmlu_translate folder
      
      * add utility
      
      * format task name, clean ups
      
      * modefied mgsm
      
      * update on afrimgsm
      
      * update on afrimgsm
      
      * removed utils
      
      * other mgsm varieties
      
      * other mgsm varieties
      
      * adding trasnslate direct
      
      * Update translate_direct_yaml
      
      * add manual xnli prompt, add multichoice for openai models, and adapt multichoice metric for openai model
      
      * edit for open models
      
      * Update translate_direct_yaml
      
      * add verbalizer for xnli
      
      * change xnli from multiple choice to generate
      
      * add manual accuracy scores
      
      * revert xnli to multiple choice
      
      * change afrimgsm utils
      
      * revert xnli to multiple_choice
      
      * cleanups and readmes
      
      * remove openai fixes and unused regex
      
      * pr review changes
      
      * revert metrics.py, task.py and extraction.py to main version
      
      ---------
      Co-authored-by: default avatarIsrael Abebe Azime <azime@cg.uni-saarland.de>
      Co-authored-by: default avatarIsrael Abebe Azime <se.israel.abebe@gmail.com>
      383bbd54
    • SuperCat's avatar
      Add new dataset MMLU-SR tasks (#2032) · d5f39bf8
      SuperCat authored
      
      
      * add mmlusr tasks
      
      * renamed all tasks names in mmlusr
      
      * edit format and readme
      
      * added mmlu_sr
      
      * mmlu_sr -> mmlusr
      
      * update
      
      ---------
      Co-authored-by: default avatarlintangsutawika <lintang@eleuther.ai>
      d5f39bf8
    • Wonung Kim's avatar
      Update default.yaml (#2092) · cdd954f9
      Wonung Kim authored
      cdd954f9
    • Hailey Schoelkopf's avatar
      eeec6dae
  20. 11 Jul, 2024 1 commit
    • anthony-dipofi's avatar
      Prettify lm_eval --tasks list (#1929) · a0243d54
      anthony-dipofi authored
      
      
      * add  and ; move task list newline logic to new TaskManager.list_all_tasks() method
      
      * format table list into markdown table; add config location column
      
      * add Output Type column
      
      * add logic for printing table of tags separately
      
      * merge with main and fix conflicts ; update docstrings
      
      ---------
      Co-authored-by: default avatarhaileyschoelkopf <hailey@eleuther.ai>
      a0243d54
  21. 10 Jul, 2024 1 commit