1. 05 Sep, 2024 1 commit
  2. 28 Aug, 2024 1 commit
  3. 25 Aug, 2024 1 commit
  4. 23 Aug, 2024 3 commits
  5. 22 Aug, 2024 2 commits
  6. 20 Aug, 2024 3 commits
  7. 19 Aug, 2024 3 commits
    • Yen-Ting Lin's avatar
      Add TMLU Benchmark Dataset (#2093) · ca3d86d6
      Yen-Ting Lin authored
      
      
      * add taiwan truthful qa
      
      * add tmlu
      
      * Add .gitignore entries for evals/ and harness_eval_main_log.txt, and add harness_eval.slurm script
      
      * add pega eval and legal eval
      
      * add ccp eval
      
      * Update .gitignore and harness_eval.slurm
      
      * Add trust_remote_code and wandb_args to harness_eval.slurm, and add run_all.sh script
      
      * Add Pega MMLU task and configuration files
      
      * Add new models and update parameters in run_all.sh
      
      * Add UMTCEval tasks and configurations
      
      * Update dataset paths and output path
      
      * Update .gitignore and harness_eval.slurm, and modify _generate_configs.py
      
      * Update SLURM script and add new models
      
      * clean for pr
      
      * Update lm_eval/tasks/tmlu/default/tmlu.yaml
      Co-authored-by: default avatarLintang Sutawika <lintang@sutawika.com>
      
      * adjust tag name
      
      * removed group alias from tasks
      
      * format
      
      ---------
      Co-authored-by: default avatarLintang Sutawika <lintang@sutawika.com>
      Co-authored-by: default avatarlintangsutawika <lintang@eleuther.ai>
      Co-authored-by: default avatarYen-Ting Adam, Lin <r08944064@csie.ntu.edu.tw>
      ca3d86d6
    • Uminosachi's avatar
      86edeffa
    • am-bean's avatar
      Lingoly README update (#2228) · f81b62bf
      am-bean authored
      * Setting up lingoly task
      
      * Testing yaml changes to debug
      
      * Adding pre-commit hooks
      
      * Functional LingOly benchmark
      
      * Renaming files and adding grouping
      
      * Extending group aggregations to allow custom functions. Setting up custom lingoly aggregation using difference in scores.
      
      * Adding LingOly to the README file
      f81b62bf
  8. 16 Aug, 2024 1 commit
  9. 15 Aug, 2024 1 commit
    • am-bean's avatar
      New task: Lingoly (#2198) · 8b41f925
      am-bean authored
      * Setting up lingoly task
      
      * Testing yaml changes to debug
      
      * Adding pre-commit hooks
      
      * Functional LingOly benchmark
      
      * Renaming files and adding grouping
      
      * Extending group aggregations to allow custom functions. Setting up custom lingoly aggregation using difference in scores.
      8b41f925
  10. 10 Aug, 2024 1 commit
  11. 09 Aug, 2024 1 commit
    • Jungwhan Kim's avatar
      keep new line for task description (#2116) · 8ad598df
      Jungwhan Kim authored
      
      
      * add keep trailing newline
      
      * apply ruff-format
      
      * add prompt unit test
      
      * increment the version of tasks that have description with whitespace
      
      * remove white spaces of leaderboard bbh
      
      * update MMLU expected versions in output
      
      * CI run does display the expected version=1 for mmlu subtasks, fix expected test output again
      
      ---------
      Co-authored-by: default avatarhaileyschoelkopf <hailey@eleuther.ai>
      8ad598df
  12. 07 Aug, 2024 1 commit
  13. 05 Aug, 2024 4 commits
    • Yu Shi Jie's avatar
      added gsm_plus (#2103) · d8506db0
      Yu Shi Jie authored
      
      
      * added gsm_plus
      
      * formatted dataset to have train-test-splits
      
      * README.md for gsm-plus
      
      * Update README.md
      
      * GSM-Plus: added gsm_plus_mini
      
      * GSM-Plus: attribution to original dataset
      
      * Update README.md
      
      * Update README.md
      
      * Update README.md
      
      ---------
      Co-authored-by: default avatarLintang Sutawika <lintang@eleuther.ai>
      d8506db0
    • Yu Shi Jie's avatar
      Mmlu Pro (#1961) · 69d56f45
      Yu Shi Jie authored
      
      
      * initialized mmlu_pro task
      
      * added generative mmlu-pro
      
      * added cot fewshot for mmlu-pro
      
      * Initial commit
      
      * updated mmlu-pro to take on 3 splits: test, val, dev
      
      * mmlu-pro: added continuation and flan_cot_zeroshot
      
      * added README.md for mmlu_pro
      
      * removed
      
      * update files
      
      * moved files out, and removed unused versions
      
      * updated
      
      * mmlu_pro:
      
      -changed task 'other' to 'miscellaneous'
      there is already a group named 'other'
      task and group with the same alias (e.g. mmlu_pro_other_generative) throws an error
      
      -fixed yaml backslash escape for fewshot cot
      
      * changed choices -> options in yaml config to fit dataset schema
      
      * ONLY FOR DEFAULT: fixed yaml file to use variable number of choices
      
      * mmlu-pro: fixed doc_to_text/choice/target configs for all variants
      
      * mmlu-pro: minor fixes
      
      * mmlu-pro/default: aligned with mmlu updates
      
      * mmlu-pro: update yaml content in line with mmlu
      
      * mmlu-pro: fixed mislabelling of task (math->chemistry)
      
      * mmlu-pro: fixed yaml formatting
      
      * add custom fewshot doc_to_text, target, and choice
      
      * add process for each subtask
      
      * add process for each subtask
      
      * pre-commit
      
      * pre-commit
      
      * format
      
      * resolved left out merge
      
      * deleted folders + updated readme
      
      * Update evaluator.py
      
      * Update evaluator.py
      
      ---------
      Co-authored-by: default avatarYu Shi Jie <shijie@tensorplex.ai>
      Co-authored-by: default avatarlintangsutawika <lintang@eleuther.ai>
      Co-authored-by: default avatarroot <root@455bdd73-01.cloud.together.ai>
      Co-authored-by: default avatarLintang Sutawika <lintang@sutawika.com>
      69d56f45
    • Hailey Schoelkopf's avatar
      c2168869
    • Amir Hossein Kargaran's avatar
      54c9a979
  14. 04 Aug, 2024 2 commits
  15. 01 Aug, 2024 1 commit
  16. 20 Jul, 2024 1 commit
  17. 18 Jul, 2024 1 commit
  18. 17 Jul, 2024 1 commit
  19. 15 Jul, 2024 1 commit
  20. 14 Jul, 2024 1 commit
  21. 12 Jul, 2024 4 commits
    • Jess's avatar
      Irokobench: Benchmark Dataset for African languages (#2042) · 383bbd54
      Jess authored
      
      
      * add afrixnli to task
      
      * add chat completion
      
      * remove chat completion -untested
      
      * afrimmlu added
      
      * afrimmlu folder update
      
      * afrimmlu folder update
      
      * updated prompt
      
      * remove print
      
      * add afrimgsm -direct
      
      * add squad metric
      
      * fix bash script
      
      * remove direct util, update common yaml
      
      * remove print
      
      * add few show. metric fixes
      
      * fix direct path, add bash script for gpt models
      
      * added transate test
      
      * update afrixnli tasks
      
      * update afrixnli tasks
      
      * update metrics for afrixnli
      
      * prompt translations fix
      
      * prompt translations fix
      
      * filter and metric fix -mgsm
      
      * remove squad metric
      
      * remove squad metric
      
      * add f1 score to mgsm
      
      * add f1 score to mgsm
      
      * update native-direct with lin
      
      * change f1 function
      
      * add lin to utils
      
      * add utils
      
      * remove test limit
      
      * remove test configs
      
      * add swahili to mmlu
      
      * change eng to ewe in ewe yaml mmlu
      
      * add squad metric to mgsm, remove whitespace filter
      
      * added translate test
      
      * added afrixnli_translate
      
      * fix exact match valueError
      
      * fix exact match valueError
      
      * restructure mmlu folder
      
      * spacing
      
      * remove afrimmlu_translate folder
      
      * add utility
      
      * format task name, clean ups
      
      * modefied mgsm
      
      * update on afrimgsm
      
      * update on afrimgsm
      
      * removed utils
      
      * other mgsm varieties
      
      * other mgsm varieties
      
      * adding trasnslate direct
      
      * Update translate_direct_yaml
      
      * add manual xnli prompt, add multichoice for openai models, and adapt multichoice metric for openai model
      
      * edit for open models
      
      * Update translate_direct_yaml
      
      * add verbalizer for xnli
      
      * change xnli from multiple choice to generate
      
      * add manual accuracy scores
      
      * revert xnli to multiple choice
      
      * change afrimgsm utils
      
      * revert xnli to multiple_choice
      
      * cleanups and readmes
      
      * remove openai fixes and unused regex
      
      * pr review changes
      
      * revert metrics.py, task.py and extraction.py to main version
      
      ---------
      Co-authored-by: default avatarIsrael Abebe Azime <azime@cg.uni-saarland.de>
      Co-authored-by: default avatarIsrael Abebe Azime <se.israel.abebe@gmail.com>
      383bbd54
    • SuperCat's avatar
      Add new dataset MMLU-SR tasks (#2032) · d5f39bf8
      SuperCat authored
      
      
      * add mmlusr tasks
      
      * renamed all tasks names in mmlusr
      
      * edit format and readme
      
      * added mmlu_sr
      
      * mmlu_sr -> mmlusr
      
      * update
      
      ---------
      Co-authored-by: default avatarlintangsutawika <lintang@eleuther.ai>
      d5f39bf8
    • Wonung Kim's avatar
      Update default.yaml (#2092) · cdd954f9
      Wonung Kim authored
      cdd954f9
    • Hailey Schoelkopf's avatar
      eeec6dae
  22. 11 Jul, 2024 1 commit
    • anthony-dipofi's avatar
      Prettify lm_eval --tasks list (#1929) · a0243d54
      anthony-dipofi authored
      
      
      * add  and ; move task list newline logic to new TaskManager.list_all_tasks() method
      
      * format table list into markdown table; add config location column
      
      * add Output Type column
      
      * add logic for printing table of tags separately
      
      * merge with main and fix conflicts ; update docstrings
      
      ---------
      Co-authored-by: default avatarhaileyschoelkopf <hailey@eleuther.ai>
      a0243d54
  23. 08 Jul, 2024 3 commits
    • Pankaj Mathur's avatar
      Minor doc fix: leaderboard README.md missing mmlu-pro group and task (#2075) · be01651c
      Pankaj Mathur authored
      leaderboard README.md missing mmlu-pro group and task
      be01651c
    • Elron Bandel's avatar
      Easier unitxt tasks loading and removal of unitxt library dependancy (#1933) · ad80f555
      Elron Bandel authored
      
      
      * Updated unitxt loading
      Signed-off-by: default avatarElron Bandel <elron.bandel@ibm.com>
      
      * Revert change to general Readme
      Signed-off-by: default avatarElron Bandel <elron.bandel@ibm.com>
      
      * Adjust fda,squadv2,squad_completion and swde to work accept config in the constructor
      Signed-off-by: default avatarElron Bandel <elron.bandel@ibm.com>
      
      * Fix scrolls
      Signed-off-by: default avatarelronbandel <elron.bandel@ibm.com>
      
      * Update documentation
      Signed-off-by: default avatarelronbandel <elron.bandel@ibm.com>
      
      * Enforce backward compatability
      Signed-off-by: default avatarelronbandel <elron.bandel@ibm.com>
      
      * Format unitxt class
      Signed-off-by: default avatarelronbandel <elron.bandel@ibm.com>
      
      ---------
      Signed-off-by: default avatarElron Bandel <elron.bandel@ibm.com>
      Signed-off-by: default avatarelronbandel <elron.bandel@ibm.com>
      Co-authored-by: default avatarhaileyschoelkopf <hailey@eleuther.ai>
      ad80f555
    • Lintang Sutawika's avatar
      Group agg rework (#1741) · 517aadc4
      Lintang Sutawika authored
      
      
      * add greoup_config arg
      
      * add a group config that allows disabling table for group score and group aggregate in general
      
      * fixed size configuration
      
      * adjust config
      
      * add group config
      
      * adjust mmlu to use group_config
      
      * fixed args input in aggregate_subtask_metrics
      
      * fixed issues related to printing alias of group and updated yaml
      
      * update all mmlu variants to include group_config
      
      * edit format
      
      * modify mmlu tasks
      
      * adjust group to also be a configurable group
      
      * add configurable group
      
      * simplify get_task_list
      
      * adjust group scoring with using ConfigurableGroup
      
      * adjust args
      
      * update mmlu
      
      * update mmlu
      
      * update to work with new group and task configuration
      
      * readd group_agg
      
      * readd files
      
      * move prepare_print_tasks to evaluator_utils
      
      * sort set to False by default, fix predict_only arg
      
      * add version for groups
      
      * reversed task list
      
      * update additional condition when loading a group in a group yaml
      
      * update truthfulqa
      
      * add description regarding tags replacing group
      
      * replace group to tag
      
      * fixed conditional statement
      
      * remove warning
      
      * update loading of task group and newly added tags
      
      * reformat with pre-commit
      
      * fixed info log
      
      * update
      
      * fix bug
      
      * fix bug
      
      * use task id to differentiate tasks
      
      * convert all groups to configurable groups
      
      * use task_id
      
      * reformat
      
      * add task_id for python tasks as well
      
      * add task_id for python tasks as well
      
      * add task_id for python tasks as well
      
      * revert truthfulqa
      
      * revert mmlu tasks
      
      * new mmlu config
      
      * new group config parameter `tag_to_task`
      
      * Update truthfulqa_mc2.yaml
      
      * reformate
      
      * add _process_group_config
      
      * adjust task_id
      
      * add get_subtask_list function to get proper subtask list
      
      * group config to_dict update
      
      * remove tag check
      
      * update mmlu
      
      * fix config passing issues
      
      * add test yaml
      
      * format fix
      
      * add documentation
      
      * corner case for single tag being called
      
      * fix indentation
      
      * formatting
      
      * update all mmlu variants
      
      * Update docs/task_guide.md
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * remove group_alias
      
      * Update docs/task_guide.md
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * remove version for metadata
      
      * Update docs/task_guide.md
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * update mmlu/
      
      * removed " " in make_table
      
      * change how aggregate_metric is loaded
      
      * change how aggregate_metric is loaded
      
      * update aggregate_metric arg
      
      * update format
      
      * update format
      
      * some docs fixes
      
      * add groups for agieval, aexams, aclue
      
      * add more explicit aggregation groups
      
      * add more groupings / tags distinctions
      
      * add more groupings
      
      * more groupings
      
      * add many explicit group configs
      
      * add many explicit group configs
      
      * add more explicit group configs
      
      * add more explicit group configs
      
      * add more error msgs, agg_metric -> agg_metric_list
      
      * some docs updates
      
      * update task_id to be updateable and uses group:task format
      
      * make KMMLU a tag for now
      
      * update docs
      
      * don't duplicate task names
      
      * fix merge conflicts?
      
      * giving this a try
      
      * clean up diff
      
      * switch mmlu variants over to using
      
      * don't use to-be-deprecated group: config field in overview notebook
      
      * Python tasks which subclass ConfigurableTask now run
      
      * update mmlu
      
      * pre-commit format
      
      * fixed sorting for multi-level printing
      
      * move group api to separate file
      
      * fix bbh aggregation filter usage
      
      * track api/group.py
      
      * adjust group and tags loading
      
      * make explicit group configs for leaderboard and other newer tasks
      
      * fix arabicmmlu
      
      * update
      
      * change arabicmmlu template name???
      
      * update group alias
      
      * fix printing bugs
      
      * check table printing is correct ; update tests
      
      * use mmlu_stem to have a group included in print tests
      
      ---------
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      Co-authored-by: default avatarhaileyschoelkopf <hailey@eleuther.ai>
      517aadc4
  24. 03 Jul, 2024 1 commit