1. 03 Mar, 2025 1 commit
  2. 11 Feb, 2025 1 commit
    • Michele Resta's avatar
      Adding the Evalita-LLM benchmark (#2681) · b7fccef5
      Michele Resta authored
      
      
      * feat: initial commit with templates for evalita evaluation
      
      * fix: change rule for generate_until
      
      * feat: modified yaml to use reduced version of NER test datasets
      
      * feat: added templates to use reduced dataset for summarization (fanpage and ilpost)
      
      * Add Six Prompts for Each Multiple-Choice Task
      
      * feat: modified fewshot split for textual entailment task
      
      * fix: new doc_to_target function for NER tasks
      
      * Update prompt
      
      * Add partition for few-shot evaluation
      
      * Add partition for few-shot evaluation
      
      * Add partition for few-shot evaluation
      
      * Add partition for few-shot evaluatio
      
      * Update prompt
      
      * Add partition for few-shot evaluation
      
      * Rename file
      
      Rename file from _evalita-mp_ner_adg_p1 .yaml to _evalita-mp_ner_adg_p1.yaml
      
      * Add partition for few-shot evaluation
      
      * Add partition for few-shot evaluation
      
      * Enhance lexical substitution management
      
      - Improve scorer calculation for better accuracy
      - Update model output postprocessing for clearer results
      - Add support for few-shot relation extraction task
      
      * Add F1 macro measure for the document dating task
      
      * Add F1-macro measure to evaluate document dating
      
      * Use the whole dataset
      
      * Small changes
      
      * Add the two prompts for the task of lexical substitution
      
      * Add few-shot split configuration
      
      * Add few-shot split configuration
      
      * Add function for handling few-shot learning setup
      
      * Fix prompt
      
      * Remove configuration file
      
      * Update dataset from test_same to test_cross for evaluations
      
      * Remove whitespace at end of prompt
      
      * Fix configuration error: corrected parameter name for the dataset used in few-shot
      
      * Fix: Check if results is not empty before processing in lexical substitution task
      
      * added the prompts and functions for correct NER and RE execution
      
      * Add accuracy measure
      
      * Add tasks for the EVALITA-LLM benchmark evaluation
      
      * Small changes
      
      Add the alias of the task name that will be printed in the final table results.
      
      * Updated the prompts to reflect changes made to the extended dataset for the Admission Test task
      
      * chore: cleaned templates before PR; feat: add configuration to run generation/ppl tasks.
      
      * fix: add information on Evalita-LLM for PR
      
      * fix: rename folders and files
      
      * fix: remove unused imports
      
      * chore: run pre-commit
      
      * chore: add task description
      
      ---------
      Co-authored-by: default avatarrzanoli <zanoli@fbk.eu>
      Co-authored-by: default avatarMarco Madeddu <marco.madeddu.bra@gmail.com>
      b7fccef5
  3. 31 Jan, 2025 1 commit
  4. 29 Jan, 2025 1 commit
    • Irina Proskurina's avatar
      Add Histoires Morales task (#2662) · 1208afd3
      Irina Proskurina authored
      * Add Histoires Morales task
      
      * Histoires Morales task: fix mixed line endings
      
      * Histoires Morales task: fix mixed line endings
      
      * Remove tag for a single task
      
      * Add some MT for Histoires Morales
      1208afd3
  5. 28 Jan, 2025 1 commit
    • Irina Proskurina's avatar
      Add Moral Stories (#2653) · a0466f01
      Irina Proskurina authored
      * Add moral stories task
      
      * Add moral stories task
      
      * Create README.md
      
      * Update README.md
      
      * Update line endings in moral_stories files
      a0466f01
  6. 20 Jan, 2025 1 commit
  7. 15 Jan, 2025 3 commits
  8. 24 Dec, 2024 1 commit
    • Firoj Alam, Scientist, QCRI's avatar
      AraDICE task config file (#2507) · 932e8f9e
      Firoj Alam, Scientist, QCRI authored
      
      
      * added aradice
      
      * Added ArabicMMLU Lev Configs
      
      * added ArabicMMLU egy configs
      
      * Added boolq configs
      
      * Added cultural bench configs
      
      * added openbookqa configs
      
      * Added PiQA configs
      
      * added winogrande configs
      
      * Added truthfulQA configs
      
      * Added aradice group config
      
      * Remove deleted files from repository
      
      * modified arabimmlu configs
      
      * modified metadata versions
      
      * fixed formatting using ruff
      
      * added aradice tasks information
      
      * pre-commit
      
      * Uptaded openbookqa utils
      
      * fixed formatting on obqa
      
      ---------
      Co-authored-by: default avatarBasel Mousi <bmousi@hbku.edu.qa>
      Co-authored-by: default avatarBaber <baber@hey.com>
      932e8f9e
  9. 19 Dec, 2024 1 commit
  10. 26 Nov, 2024 1 commit
    • Rima Shahbazyan's avatar
      Score tasks (#2452) · 0ef7548d
      Rima Shahbazyan authored
      * score readme added
      
      * generate until task's "until" parameter's default value fixed.
      
      * score mmlu-pro and agieval added
      
      * changed macro accuracy to micro for agieval
      
      * Always E removed from agi eval
      
      * redundancies removed
      
      * MATH added
      
      * minor cosmetic changes for math
      
      * Licenses added Readme updated
      
      * changes for flake8 + license header on math
      
      * Score added to readme and precommit was run.
      
      * Score added to readme and precommit was run.
      
      * Import error fixed
      
      * math task bugfix
      postprocess minor fix
      
      * CR for math added
      
      * math CR
      
      * math task bugfix
      postprocess minor fix
      
      CR for math added
      
      * Math cr fixed
      
      * reverting the default "until" parameter change and adjusting  score task configs
      0ef7548d
  11. 18 Nov, 2024 1 commit
    • Kozzy Voudouris's avatar
      Add metabench task to LM Evaluation Harness (#2357) · 62b4364d
      Kozzy Voudouris authored
      
      
      * Add metabench (Kipnis et al. 2024)
      
      * Update metabench tasks for full replication of original benchmarks, using publicly available datasets
      
      * Remove unnecessary import
      
      * Add permute versions of each task, where the answer orders are randomly shuffled.
      
      * Add metabench group for easier evaluations
      
      * Fix mmlu counts after removing duplicate
      
      * Add secondary datasets
      
      * Fix f-string error
      
      * Fix f-string error for permute processing
      
      * Add original hash to outputs for easy matching to original results
      
      * Add line break at end of utils files
      
      * Remove extra line from winogrande
      
      * Reformat for linters
      
      * fix multiple input test
      
      * appease pre-commit
      
      * Add metabench to tasks README
      
      * fix multiple input `test_doc_to_text`
      
      ---------
      Co-authored-by: default avatarBaber <baber@hey.com>
      62b4364d
  12. 16 Nov, 2024 1 commit
    • Wonseok Hwang's avatar
      kbl-v0.1.1 (#2493) · cbc31eb8
      Wonseok Hwang authored
      * release kbl-v0.1
      
      * fix linting
      
      * remove rag tasks as  doc_to_text functions cause trouble
      
      * remove remaining rag tasks
      
      * remove unnecessary repeat in yaml files and rag dataset in hf-hub
      
      * remove unncessary newline; introduce cfg files in lbox/kbl in hf
      
      * Make task yaml files consistent to hf-datasets-config
      
      * Make task yaml files consistent to hf-datasets-config
      
      * Remove trailing empty space in doc-to-text
      
      * Remove unncessary yaml file
      
      * Fix task nameing error
      
      * trailing space removed
      cbc31eb8
  13. 05 Nov, 2024 1 commit
    • mtkachenko's avatar
      Add Japanese Leaderboard (#2439) · 26f607f5
      mtkachenko authored
      * add jaqket_v2 and jcommonsenseqa
      
      * remove comments
      
      * remove num_beams as it is incompatible with vllm
      
      * add jnli + refactor
      
      * rename jnla -> jnli
      
      * add jsquad + replace colon chars with the Japanese unicode
      
      * ignore whitespaces in generation tasks
      
      * add marc_ja
      
      * add xwinograd + simplify other yamls
      
      * add mgsm and xlsum
      
      * refactor xlsum
      
      * add ja_leaderboard tag
      
      * edit README.md
      
      * update README.md
      
      * add credit + minor changes
      
      * run ruff format
      
      * address review comments + add group
      
      * remove aggregate_metric_list
      
      * remove tags
      
      * update tasks/README.md
      26f607f5
  14. 01 Nov, 2024 1 commit
  15. 30 Oct, 2024 1 commit
  16. 04 Oct, 2024 2 commits
  17. 03 Oct, 2024 2 commits
    • zxcvuser's avatar
      Add new benchmark: Galician bench (#2155) · 0e763862
      zxcvuser authored
      * Add galician_bench
      
      * Update xnli_gl path
      
      * Add flores_gl group
      
      * Update _flores_common_yaml
      
      * Updated some task groupings and readme
      
      ---------
      0e763862
    • zxcvuser's avatar
      Add new benchmark: Spanish bench (#2157) · ea17b98e
      zxcvuser authored
      * Add spanish_bench
      
      * Add flores_es group
      
      * Update _flores_common_yaml
      
      * Delete lm_eval/tasks/spanish_bench/escola.yaml
      
      * Delete escola from spanish_bench.yaml
      
      * Delete escola from README.md
      
      * pre-commit run --all-files
      
      * Updated some task groupings and readme
      
      ---------
      ea17b98e
  18. 30 Sep, 2024 1 commit
  19. 26 Sep, 2024 1 commit
  20. 10 Sep, 2024 1 commit
    • Malikeh Ehghaghi's avatar
      Add Open Arabic LLM Leaderboard Benchmarks (Full and Light Version) (#2232) · decc533d
      Malikeh Ehghaghi authored
      
      
      * arabic leaferboard yaml file is added
      
      * arabic toxigen is implemented
      
      * Dataset library is imported
      
      * arabic sciq is added
      
      * util file of arabic toxigen is updated
      
      * arabic race is added
      
      * arabic piqa is implemented
      
      * arabic open qa is added
      
      * arabic copa is implemented
      
      * arabic boolq ia added
      
      * arabic arc easy is added
      
      * arabic arc challenge is added
      
      * arabic exams benchmark is implemented
      
      * arabic hellaswag is added
      
      * arabic leaderboard yaml file metrics are updated
      
      * arabic mmlu benchmarks are added
      
      * arabic mmlu group yaml file is updated
      
      * alghafa benchmarks are added
      
      * acva benchmarks are added
      
      * acva utils.py is updated
      
      * light version of arabic leaderboard benchmarks are added
      
      * bugs fixed
      
      * bug fixed
      
      * bug fixed
      
      * bug fixed
      
      * bug fixed
      
      * bug fixed
      
      * library import bug is fixed
      
      * doc to target updated
      
      * bash file is deleted
      
      * results folder is deleted
      
      * leaderboard groups are added
      
      * full arabic leaderboard groups are added, plus some bug fixes to the light version
      
      * Create README.md
      
      README.md for arabic_leaderboard_complete
      
      * Create README.md
      
      README.md for arabic_leaderboard_light
      
      * Delete lm_eval/tasks/arabic_leaderboard directory
      
      * Update README.md
      
      * Update README.md
      
      adding the Arabic leaderboards to the library
      
      * Update README.md
      
      10% of the training set
      
      * Update README.md
      
      10% of the training set
      
      * revert .gitignore to prev version
      
      * Update lm_eval/tasks/README.md
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * updated main README.md
      
      * Update lm_eval/tasks/README.md
      
      * specify machine translated benchmarks (complete)
      
      * specify machine translated benchmarks (light version)
      
      * add alghafa to the related task names (complete and light)
      
      * add 'acva' to the related task names (complete and light)
      
      * add 'arabic_leaderboard' to all the groups (complete and light)
      
      * all dataset - not a random sample
      
      * added more accurate details to the readme file
      
      * added mt_mmlu from okapi
      
      * Update lm_eval/tasks/README.md
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * Update lm_eval/tasks/README.md
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      
      * updated mt_mmlu readme
      
      * renaming 'alghafa' full and light
      
      * renaming 'arabic_mmlu' light and full
      
      * renaming 'acva' full and light
      
      * update readme and standardize dir/file names
      
      * running pre-commit
      
      ---------
      Co-authored-by: default avatarshahrzads <sayehban@ualberta.ca>
      Co-authored-by: default avatarshahrzads <56282669+shahrzads@users.noreply.github.com>
      Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
      decc533d
  21. 23 Aug, 2024 1 commit
  22. 19 Aug, 2024 1 commit
    • am-bean's avatar
      Lingoly README update (#2228) · f81b62bf
      am-bean authored
      * Setting up lingoly task
      
      * Testing yaml changes to debug
      
      * Adding pre-commit hooks
      
      * Functional LingOly benchmark
      
      * Renaming files and adding grouping
      
      * Extending group aggregations to allow custom functions. Setting up custom lingoly aggregation using difference in scores.
      
      * Adding LingOly to the README file
      f81b62bf
  23. 05 Aug, 2024 1 commit
  24. 04 Aug, 2024 2 commits
  25. 14 Jul, 2024 1 commit
  26. 12 Jul, 2024 1 commit
  27. 03 Jul, 2024 2 commits
  28. 25 Jun, 2024 1 commit
  29. 20 Jun, 2024 1 commit
  30. 19 Jun, 2024 2 commits
  31. 11 Jun, 2024 1 commit
  32. 06 Jun, 2024 1 commit
  33. 03 Jun, 2024 1 commit