1. 01 Dec, 2024 1 commit
    • Yoav Katz's avatar
      Update Unitxt task to use locally installed unitxt and not download Unitxt... · 1170ef9e
      Yoav Katz authored
      
      Update Unitxt task to  use locally installed unitxt and not download Unitxt code from Huggingface (#2514)
      
      * Moved to require unitxt installation and not download unitxt from HF hub.
      
      This has performance benefits and simplifies the code.
      Signed-off-by: default avatarYoav Katz <katz@il.ibm.com>
      
      * Updated watsonx documentation
      
      * Updated installation instructions
      
      * Removed redundant comman
      
      * Allowed unitxt tasks to generate chat APIs
      
      Modified WatsonXI model to support chat apis
      
      * Removed print
      
      * Run precommit formatting
      
      ---------
      Signed-off-by: default avatarYoav Katz <katz@il.ibm.com>
      1170ef9e
  2. 26 Nov, 2024 1 commit
    • Rima Shahbazyan's avatar
      Score tasks (#2452) · 0ef7548d
      Rima Shahbazyan authored
      * score readme added
      
      * generate until task's "until" parameter's default value fixed.
      
      * score mmlu-pro and agieval added
      
      * changed macro accuracy to micro for agieval
      
      * Always E removed from agi eval
      
      * redundancies removed
      
      * MATH added
      
      * minor cosmetic changes for math
      
      * Licenses added Readme updated
      
      * changes for flake8 + license header on math
      
      * Score added to readme and precommit was run.
      
      * Score added to readme and precommit was run.
      
      * Import error fixed
      
      * math task bugfix
      postprocess minor fix
      
      * CR for math added
      
      * math CR
      
      * math task bugfix
      postprocess minor fix
      
      CR for math added
      
      * Math cr fixed
      
      * reverting the default "until" parameter change and adjusting  score task configs
      0ef7548d
  3. 18 Nov, 2024 2 commits
    • Kozzy Voudouris's avatar
      Add metabench task to LM Evaluation Harness (#2357) · 62b4364d
      Kozzy Voudouris authored
      
      
      * Add metabench (Kipnis et al. 2024)
      
      * Update metabench tasks for full replication of original benchmarks, using publicly available datasets
      
      * Remove unnecessary import
      
      * Add permute versions of each task, where the answer orders are randomly shuffled.
      
      * Add metabench group for easier evaluations
      
      * Fix mmlu counts after removing duplicate
      
      * Add secondary datasets
      
      * Fix f-string error
      
      * Fix f-string error for permute processing
      
      * Add original hash to outputs for easy matching to original results
      
      * Add line break at end of utils files
      
      * Remove extra line from winogrande
      
      * Reformat for linters
      
      * fix multiple input test
      
      * appease pre-commit
      
      * Add metabench to tasks README
      
      * fix multiple input `test_doc_to_text`
      
      ---------
      Co-authored-by: default avatarBaber <baber@hey.com>
      62b4364d
    • Baber Abbasi's avatar
      remove duplicate `arc_ca` (#2499) · 8222ad0a
      Baber Abbasi authored
      8222ad0a
  4. 16 Nov, 2024 1 commit
    • Wonseok Hwang's avatar
      kbl-v0.1.1 (#2493) · cbc31eb8
      Wonseok Hwang authored
      * release kbl-v0.1
      
      * fix linting
      
      * remove rag tasks as  doc_to_text functions cause trouble
      
      * remove remaining rag tasks
      
      * remove unnecessary repeat in yaml files and rag dataset in hf-hub
      
      * remove unncessary newline; introduce cfg files in lbox/kbl in hf
      
      * Make task yaml files consistent to hf-datasets-config
      
      * Make task yaml files consistent to hf-datasets-config
      
      * Remove trailing empty space in doc-to-text
      
      * Remove unncessary yaml file
      
      * Fix task nameing error
      
      * trailing space removed
      cbc31eb8
  5. 11 Nov, 2024 1 commit
  6. 09 Nov, 2024 1 commit
  7. 07 Nov, 2024 1 commit
  8. 05 Nov, 2024 2 commits
    • mtkachenko's avatar
      Add Japanese Leaderboard (#2439) · 26f607f5
      mtkachenko authored
      * add jaqket_v2 and jcommonsenseqa
      
      * remove comments
      
      * remove num_beams as it is incompatible with vllm
      
      * add jnli + refactor
      
      * rename jnla -> jnli
      
      * add jsquad + replace colon chars with the Japanese unicode
      
      * ignore whitespaces in generation tasks
      
      * add marc_ja
      
      * add xwinograd + simplify other yamls
      
      * add mgsm and xlsum
      
      * refactor xlsum
      
      * add ja_leaderboard tag
      
      * edit README.md
      
      * update README.md
      
      * add credit + minor changes
      
      * run ruff format
      
      * address review comments + add group
      
      * remove aggregate_metric_list
      
      * remove tags
      
      * update tasks/README.md
      26f607f5
    • zxcvuser's avatar
      Modify label errors in catcola and paws-x (#2434) · fb2e4b59
      zxcvuser authored
      
      
      * Modify label errors in catcola and paws
      
      * Update version to 1.0 in pawsx_template_yaml
      
      * add changelog
      
      ---------
      Co-authored-by: default avatarBaber <baber@hey.com>
      fb2e4b59
  9. 01 Nov, 2024 1 commit
  10. 30 Oct, 2024 1 commit
  11. 22 Oct, 2024 1 commit
  12. 20 Oct, 2024 1 commit
  13. 17 Oct, 2024 2 commits
  14. 16 Oct, 2024 1 commit
  15. 14 Oct, 2024 1 commit
  16. 08 Oct, 2024 1 commit
  17. 07 Oct, 2024 3 commits
  18. 04 Oct, 2024 3 commits
  19. 03 Oct, 2024 2 commits
    • zxcvuser's avatar
      Add new benchmark: Galician bench (#2155) · 0e763862
      zxcvuser authored
      * Add galician_bench
      
      * Update xnli_gl path
      
      * Add flores_gl group
      
      * Update _flores_common_yaml
      
      * Updated some task groupings and readme
      
      ---------
      0e763862
    • zxcvuser's avatar
      Add new benchmark: Spanish bench (#2157) · ea17b98e
      zxcvuser authored
      * Add spanish_bench
      
      * Add flores_es group
      
      * Update _flores_common_yaml
      
      * Delete lm_eval/tasks/spanish_bench/escola.yaml
      
      * Delete escola from spanish_bench.yaml
      
      * Delete escola from README.md
      
      * pre-commit run --all-files
      
      * Updated some task groupings and readme
      
      ---------
      ea17b98e
  20. 30 Sep, 2024 2 commits
  21. 28 Sep, 2024 1 commit
  22. 26 Sep, 2024 7 commits
  23. 24 Sep, 2024 1 commit
  24. 13 Sep, 2024 1 commit
    • Lintang Sutawika's avatar
      Multimodal prototyping (#2243) · fb963f0f
      Lintang Sutawika authored
      
      
      * add WIP hf vlm class
      
      * add doc_to_image
      
      * add mmmu tasks
      
      * fix merge conflicts
      
      * add lintang's changes to hf_vlms.py
      
      * fix doc_to_image
      
      * added yaml_path for config-loading
      
      * revert
      
      * add line to process str type v
      
      * update
      
      * modeling cleanup
      
      * add aggregation for mmmu
      
      * rewrite MMMU processing code based on only MMMU authors' repo (doc_to_image still WIP)
      
      * implemented doc_to_image
      
      * update doc_to_image to accept list of features
      
      * update functions
      
      * readd image processed
      
      * update args process
      
      * bugfix for repeated images fed to model
      
      * push WIP loglikelihood code
      
      * commit most recent code (generative ; qwen2-vl testing)
      
      * preliminary image_token_id handling
      
      * small mmmu update: some qs have >4 mcqa options
      
      * push updated modeling code
      
      * use processor.apply_chat_template
      
      * add mathvista draft
      
      * nit
      
      * nit
      
      * ensure no footguns in text<>multimodal LM<>task incompatibility
      
      * add notification to readme regarding launch of prototype!
      
      * fix compatibility check
      
      * reorganize mmmu configs
      
      * chat_template=None
      
      * add interleave chat_template
      
      * add condition
      
      * add max_images; interleave=true
      
      * nit
      
      * testmini_mcq
      
      * nit
      
      * pass image string; convert img
      
      * add vllm
      
      * add init
      
      * vlm add multi attr
      
      * fixup
      
      * pass max images to vllm model init
      
      * nit
      
      * encoding to device
      
      * fix HFMultimodalLM.chat_template ?
      
      * add mmmu readme
      
      * remove erroneous prints
      
      * use HFMultimodalLM.chat_template ; restore tasks/__init__.py
      
      * add docstring for replace_placeholders in utils
      
      * fix `replace_placeholders`; set image_string=None
      
      * fix typo
      
      * cleanup + fix merge conflicts
      
      * update MMMU readme
      
      * del mathvista
      
      * add some sample scores
      
      * Update README.md
      
      * add log msg for image_string value
      
      ---------
      Co-authored-by: default avatarhaileyschoelkopf <hailey@eleuther.ai>
      Co-authored-by: default avatarBaber Abbasi <baber@eleuther.ai>
      Co-authored-by: default avatarBaber <baber@hey.com>
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      fb963f0f
  25. 10 Sep, 2024 1 commit
    • Malikeh Ehghaghi's avatar
      Add Open Arabic LLM Leaderboard Benchmarks (Full and Light Version) (#2232) · decc533d
      Malikeh Ehghaghi authored
      
      
      * arabic leaferboard yaml file is added
      
      * arabic toxigen is implemented
      
      * Dataset library is imported
      
      * arabic sciq is added
      
      * util file of arabic toxigen is updated
      
      * arabic race is added
      
      * arabic piqa is implemented
      
      * arabic open qa is added
      
      * arabic copa is implemented
      
      * arabic boolq ia added
      
      * arabic arc easy is added
      
      * arabic arc challenge is added
      
      * arabic exams benchmark is implemented
      
      * arabic hellaswag is added
      
      * arabic leaderboard yaml file metrics are updated
      
      * arabic mmlu benchmarks are added
      
      * arabic mmlu group yaml file is updated
      
      * alghafa benchmarks are added
      
      * acva benchmarks are added
      
      * acva utils.py is updated
      
      * light version of arabic leaderboard benchmarks are added
      
      * bugs fixed
      
      * bug fixed
      
      * bug fixed
      
      * bug fixed
      
      * bug fixed
      
      * bug fixed
      
      * library import bug is fixed
      
      * doc to target updated
      
      * bash file is deleted
      
      * results folder is deleted
      
      * leaderboard groups are added
      
      * full arabic leaderboard groups are added, plus some bug fixes to the light version
      
      * Create README.md
      
      README.md for arabic_leaderboard_complete
      
      * Create README.md
      
      README.md for arabic_leaderboard_light
      
      * Delete lm_eval/tasks/arabic_leaderboard directory
      
      * Update README.md
      
      * Update README.md
      
      adding the Arabic leaderboards to the library
      
      * Update README.md
      
      10% of the training set
      
      * Update README.md
      
      10% of the training set
      
      * revert .gitignore to prev version
      
      * Update lm_eval/tasks/README.md
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * updated main README.md
      
      * Update lm_eval/tasks/README.md
      
      * specify machine translated benchmarks (complete)
      
      * specify machine translated benchmarks (light version)
      
      * add alghafa to the related task names (complete and light)
      
      * add 'acva' to the related task names (complete and light)
      
      * add 'arabic_leaderboard' to all the groups (complete and light)
      
      * all dataset - not a random sample
      
      * added more accurate details to the readme file
      
      * added mt_mmlu from okapi
      
      * Update lm_eval/tasks/README.md
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * Update lm_eval/tasks/README.md
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * updated mt_mmlu readme
      
      * renaming 'alghafa' full and light
      
      * renaming 'arabic_mmlu' light and full
      
      * renaming 'acva' full and light
      
      * update readme and standardize dir/file names
      
      * running pre-commit
      
      ---------
      Co-authored-by: default avatarshahrzads <sayehban@ualberta.ca>
      Co-authored-by: default avatarshahrzads <56282669+shahrzads@users.noreply.github.com>
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      decc533d