1. 25 Aug, 2025 1 commit
  2. 10 Sep, 2024 1 commit
    • Malikeh Ehghaghi's avatar
      Add Open Arabic LLM Leaderboard Benchmarks (Full and Light Version) (#2232) · decc533d
      Malikeh Ehghaghi authored
      
      
      * arabic leaferboard yaml file is added
      
      * arabic toxigen is implemented
      
      * Dataset library is imported
      
      * arabic sciq is added
      
      * util file of arabic toxigen is updated
      
      * arabic race is added
      
      * arabic piqa is implemented
      
      * arabic open qa is added
      
      * arabic copa is implemented
      
      * arabic boolq ia added
      
      * arabic arc easy is added
      
      * arabic arc challenge is added
      
      * arabic exams benchmark is implemented
      
      * arabic hellaswag is added
      
      * arabic leaderboard yaml file metrics are updated
      
      * arabic mmlu benchmarks are added
      
      * arabic mmlu group yaml file is updated
      
      * alghafa benchmarks are added
      
      * acva benchmarks are added
      
      * acva utils.py is updated
      
      * light version of arabic leaderboard benchmarks are added
      
      * bugs fixed
      
      * bug fixed
      
      * bug fixed
      
      * bug fixed
      
      * bug fixed
      
      * bug fixed
      
      * library import bug is fixed
      
      * doc to target updated
      
      * bash file is deleted
      
      * results folder is deleted
      
      * leaderboard groups are added
      
      * full arabic leaderboard groups are added, plus some bug fixes to the light version
      
      * Create README.md
      
      README.md for arabic_leaderboard_complete
      
      * Create README.md
      
      README.md for arabic_leaderboard_light
      
      * Delete lm_eval/tasks/arabic_leaderboard directory
      
      * Update README.md
      
      * Update README.md
      
      adding the Arabic leaderboards to the library
      
      * Update README.md
      
      10% of the training set
      
      * Update README.md
      
      10% of the training set
      
      * revert .gitignore to prev version
      
      * Update lm_eval/tasks/README.md
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * updated main README.md
      
      * Update lm_eval/tasks/README.md
      
      * specify machine translated benchmarks (complete)
      
      * specify machine translated benchmarks (light version)
      
      * add alghafa to the related task names (complete and light)
      
      * add 'acva' to the related task names (complete and light)
      
      * add 'arabic_leaderboard' to all the groups (complete and light)
      
      * all dataset - not a random sample
      
      * added more accurate details to the readme file
      
      * added mt_mmlu from okapi
      
      * Update lm_eval/tasks/README.md
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * Update lm_eval/tasks/README.md
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * updated mt_mmlu readme
      
      * renaming 'alghafa' full and light
      
      * renaming 'arabic_mmlu' light and full
      
      * renaming 'acva' full and light
      
      * update readme and standardize dir/file names
      
      * running pre-commit
      
      ---------
      Co-authored-by: default avatarshahrzads <sayehban@ualberta.ca>
      Co-authored-by: default avatarshahrzads <56282669+shahrzads@users.noreply.github.com>
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      decc533d
  3. 09 Aug, 2024 1 commit
    • Jungwhan Kim's avatar
      keep new line for task description (#2116) · 8ad598df
      Jungwhan Kim authored
      
      
      * add keep trailing newline
      
      * apply ruff-format
      
      * add prompt unit test
      
      * increment the version of tasks that have description with whitespace
      
      * remove white spaces of leaderboard bbh
      
      * update MMLU expected versions in output
      
      * CI run does display the expected version=1 for mmlu subtasks, fix expected test output again
      
      ---------
      Co-authored-by: default avatarhaileyschoelkopf <hailey@eleuther.ai>
      8ad598df
  4. 08 Jul, 2024 1 commit
    • Lintang Sutawika's avatar
      Group agg rework (#1741) · 517aadc4
      Lintang Sutawika authored
      
      
      * add greoup_config arg
      
      * add a group config that allows disabling table for group score and group aggregate in general
      
      * fixed size configuration
      
      * adjust config
      
      * add group config
      
      * adjust mmlu to use group_config
      
      * fixed args input in aggregate_subtask_metrics
      
      * fixed issues related to printing alias of group and updated yaml
      
      * update all mmlu variants to include group_config
      
      * edit format
      
      * modify mmlu tasks
      
      * adjust group to also be a configurable group
      
      * add configurable group
      
      * simplify get_task_list
      
      * adjust group scoring with using ConfigurableGroup
      
      * adjust args
      
      * update mmlu
      
      * update mmlu
      
      * update to work with new group and task configuration
      
      * readd group_agg
      
      * readd files
      
      * move prepare_print_tasks to evaluator_utils
      
      * sort set to False by default, fix predict_only arg
      
      * add version for groups
      
      * reversed task list
      
      * update additional condition when loading a group in a group yaml
      
      * update truthfulqa
      
      * add description regarding tags replacing group
      
      * replace group to tag
      
      * fixed conditional statement
      
      * remove warning
      
      * update loading of task group and newly added tags
      
      * reformat with pre-commit
      
      * fixed info log
      
      * update
      
      * fix bug
      
      * fix bug
      
      * use task id to differentiate tasks
      
      * convert all groups to configurable groups
      
      * use task_id
      
      * reformat
      
      * add task_id for python tasks as well
      
      * add task_id for python tasks as well
      
      * add task_id for python tasks as well
      
      * revert truthfulqa
      
      * revert mmlu tasks
      
      * new mmlu config
      
      * new group config parameter `tag_to_task`
      
      * Update truthfulqa_mc2.yaml
      
      * reformate
      
      * add _process_group_config
      
      * adjust task_id
      
      * add get_subtask_list function to get proper subtask list
      
      * group config to_dict update
      
      * remove tag check
      
      * update mmlu
      
      * fix config passing issues
      
      * add test yaml
      
      * format fix
      
      * add documentation
      
      * corner case for single tag being called
      
      * fix indentation
      
      * formatting
      
      * update all mmlu variants
      
      * Update docs/task_guide.md
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * remove group_alias
      
      * Update docs/task_guide.md
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * remove version for metadata
      
      * Update docs/task_guide.md
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * update mmlu/
      
      * removed " " in make_table
      
      * change how aggregate_metric is loaded
      
      * change how aggregate_metric is loaded
      
      * update aggregate_metric arg
      
      * update format
      
      * update format
      
      * some docs fixes
      
      * add groups for agieval, aexams, aclue
      
      * add more explicit aggregation groups
      
      * add more groupings / tags distinctions
      
      * add more groupings
      
      * more groupings
      
      * add many explicit group configs
      
      * add many explicit group configs
      
      * add more explicit group configs
      
      * add more explicit group configs
      
      * add more error msgs, agg_metric -> agg_metric_list
      
      * some docs updates
      
      * update task_id to be updateable and uses group:task format
      
      * make KMMLU a tag for now
      
      * update docs
      
      * don't duplicate task names
      
      * fix merge conflicts?
      
      * giving this a try
      
      * clean up diff
      
      * switch mmlu variants over to using
      
      * don't use to-be-deprecated group: config field in overview notebook
      
      * Python tasks which subclass ConfigurableTask now run
      
      * update mmlu
      
      * pre-commit format
      
      * fixed sorting for multi-level printing
      
      * move group api to separate file
      
      * fix bbh aggregation filter usage
      
      * track api/group.py
      
      * adjust group and tags loading
      
      * make explicit group configs for leaderboard and other newer tasks
      
      * fix arabicmmlu
      
      * update
      
      * change arabicmmlu template name???
      
      * update group alias
      
      * fix printing bugs
      
      * check table printing is correct ; update tests
      
      * use mmlu_stem to have a group included in print tests
      
      ---------
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      Co-authored-by: default avatarhaileyschoelkopf <hailey@eleuther.ai>
      517aadc4