1. 11 Jul, 2025 2 commits
  2. 10 Jul, 2025 3 commits
  3. 03 Jul, 2025 2 commits
    • Baber Abbasi's avatar
      Humaneval - fix regression (#3102) · 8c1016cb
      Baber Abbasi authored
      * use double quotes
      8c1016cb
    • Blanca Calvo's avatar
      Truthfulqa multi harness (#3062) · e0dc33ae
      Blanca Calvo authored
      
      
      * truthfulqa-multi task
      
      * truthfulqa-multi with chat few-shot
      
      * few shot chat implementation
      
      * changed until so it outputs lists
      
      * changed dataset location
      
      * added MT task
      
      * Create README.md
      
      * do not include MT
      
      * changes for PR
      
      * tag change
      
      * removed yaml extension
      
      * adding task to the table
      
      * fix task configs
      
      * add import exception
      
      ---------
      Co-authored-by: default avatarBaber <baber@hey.com>
      e0dc33ae
  4. 30 Jun, 2025 1 commit
    • jinze's avatar
      FixBug: Align the Humaneval with official results for Llama-3.1-70B-Instruct (#3092) · a7ca0435
      jinze authored
      * Fix: Align the Humaneval dataset with official results
      
      Details:(1) modified the "doc_to_text" and "gen_prefix" in the "humaneval_instruct.yaml" file to make them the same as the Prompt in "meta-llama/Llama-3.1-70B-Instruct-evals".
      
      (2) Change r.rfind("```") to r.find("```"), so it can locate the first "```", not the last one.
      
      Results: Partially reproduced the official results: The result of LLaMA3.1-8B-Instruct is 66.5 (the official result is 72.6), and the result of LLaMA3.1-70B-Instruct is 80.5 (the official result is 80.5).
      
      Ref: PR#2650
      
      * add changelog and version
      
      * add changelog
      a7ca0435
  5. 25 Jun, 2025 1 commit
  6. 20 Jun, 2025 1 commit
  7. 19 Jun, 2025 2 commits
  8. 16 Jun, 2025 2 commits
  9. 12 Jun, 2025 1 commit
  10. 08 Jun, 2025 1 commit
    • Baber Abbasi's avatar
      [longbench] fix metric calculation (#2983) · 147e9d61
      Baber Abbasi authored
      * use all answers
      
      * use middle truncation
      
      * maybe fix classification score
      
      * strip classification preds
      
      * [vllm] remove stop tokens post-hoc
      
      * strip all preds
      
      * pacify pre-commit
      
      * start on truncation utility
      
      * add to readme
      
      * add a footgun doc
      
      * fix newline in yaml templates
      
      * do not strip code_sim preds!
      
      * fix pre-commit config
      
      * fix instruction warning
      
      * add not to longbench readme
      147e9d61
  11. 03 Jun, 2025 2 commits
    • Baber Abbasi's avatar
      remove prints (#3041) · 9f152e0b
      Baber Abbasi authored
      9f152e0b
    • Baber Abbasi's avatar
      add Mbpp instruct (#2995) · 60e85da5
      Baber Abbasi authored
      * feat: add mbpp_instruct
      
      * fix: update generation_kwargs to use an empty until list
      
      * fix: correct predictions formatting in pass_at_1 function
      
      * fix: improve code block extraction by checking first without opening backticks
      
      * fix mbpp `pass_at_1`
      60e85da5
  12. 26 May, 2025 1 commit
  13. 21 May, 2025 1 commit
  14. 19 May, 2025 2 commits
  15. 15 May, 2025 4 commits
    • Baber Abbasi's avatar
      fix formatting (#2759) · 0126f6d1
      Baber Abbasi authored
      0126f6d1
    • tawsif's avatar
      Update utils.py (#2870) · 2bde99e4
      tawsif authored
      2bde99e4
    • Yufeng Xu's avatar
      Added C4 Support (#2889) · 86a3b270
      Yufeng Xu authored
      * added c4 dataset (working)
      
      * fixed bugs in c4
      
      * fixed loading bugs in c4 dataset; using partial loading
      
      * cleaned the code
      
      * added version number for c4
      
      * removed irrelevant files
      86a3b270
    • Jess's avatar
      AfroBench: How Good are Large Language Models on African Languages? (#2825) · 18297993
      Jess authored
      
      
      * add afrixnli to task
      
      * add chat completion
      
      * remove chat completion -untested
      
      * afrimmlu added
      
      * afrimmlu folder update
      
      * afrimmlu folder update
      
      * updated prompt
      
      * remove print
      
      * add afrimgsm -direct
      
      * add squad metric
      
      * fix bash script
      
      * remove direct util, update common yaml
      
      * remove print
      
      * add few show. metric fixes
      
      * fix direct path, add bash script for gpt models
      
      * added transate test
      
      * update afrixnli tasks
      
      * update afrixnli tasks
      
      * update metrics for afrixnli
      
      * prompt translations fix
      
      * prompt translations fix
      
      * filter and metric fix -mgsm
      
      * remove squad metric
      
      * remove squad metric
      
      * add f1 score to mgsm
      
      * add f1 score to mgsm
      
      * update native-direct with lin
      
      * change f1 function
      
      * add lin to utils
      
      * add utils
      
      * remove test limit
      
      * remove test configs
      
      * add swahili to mmlu
      
      * change eng to ewe in ewe yaml mmlu
      
      * add squad metric to mgsm, remove whitespace filter
      
      * added translate test
      
      * added afrixnli_translate
      
      * fix exact match valueError
      
      * fix exact match valueError
      
      * restructure mmlu folder
      
      * spacing
      
      * remove afrimmlu_translate folder
      
      * add utility
      
      * format task name, clean ups
      
      * modefied mgsm
      
      * update on afrimgsm
      
      * update on afrimgsm
      
      * removed utils
      
      * other mgsm varieties
      
      * other mgsm varieties
      
      * adding trasnslate direct
      
      * Update translate_direct_yaml
      
      * add manual xnli prompt, add multichoice for openai models, and adapt multichoice metric for openai model
      
      * edit for open models
      
      * Update translate_direct_yaml
      
      * add verbalizer for xnli
      
      * change xnli from multiple choice to generate
      
      * add manual accuracy scores
      
      * revert xnli to multiple choice
      
      * change afrimgsm utils
      
      * revert xnli to multiple_choice
      
      * cleanups and readmes
      
      * remove openai fixes and unused regex
      
      * pr review changes
      
      * revert metrics.py, task.py and extraction.py to main version
      
      * add afrisenti
      
      * utilities
      
      * pulled from main
      
      * add afrixnli
      
      * add afrimmlu
      
      * update afrixnli prompts
      
      * mising senti language
      
      * fix afrisenti prompt 2
      
      * fix afrisenti prompts
      
      * fix afrisenti prompts
      
      * configure task grouping
      
      * add multiple prompts to afrixnli for irokobench
      
      * add multiple prompts to afrimmlu for irokobench
      
      * Update afrixnli_yaml
      
      * fixes and moves
      
      * fixes and moves
      
      * afrimmlu multiple prompts configs
      
      * remove validation set from afrimmlu
      
      * remove eng from afrimmlu translate test
      
      * correct dataset path
      
      * multiple prompts for mgsm
      
      * file restructure
      
      * afribench grouping
      
      * repo restructuring
      
      * repo restructuring
      
      * update exact match to hugging face exact match and add new mgsm language
      
      * remove decontamination
      
      * update generation kwargs
      
      * update generation kwargs for all mgsm prompts
      
      * remove lang
      
      * update generation kwargs for afrimgsm translatetest
      
      * add afrimgsm cot for direct and translate
      
      * remove eng from translate-cot
      
      * add masakhaPOS tasks
      
      * remove changes from task script
      
      * add masakhanews tasks
      
      * add uhura arc easy
      
      * add afriqa and belebele files
      
      * add tags for easier run. add naija rc
      
      * add new metrics and transformation scripts
      
      * fix afriqa swa fewshot split
      
      * add naijarc
      
      * add afrobench lite tasks
      
      * update afrobench
      
      * update afrobench
      
      * remove unverified files to avoid bugs
      
      * remove files not needed
      
      * add afrobench tasks
      
      * add afrobench tasks
      
      * change to version 1
      
      * change to version 1
      
      * update afrobench
      
      * update afrobench
      
      * restore metric to original script
      
      * update readme instructions
      
      * add individual dataset readmes
      
      * add link to collections
      
      * correct run script
      
      * align with main
      
      * align with main
      
      * align with main
      
      * align with main
      
      * align with main
      
      * align with main
      
      * align with main
      
      * align with main
      
      * failed run fixes
      
      * failed run fixes
      
      * add afrimgsm cot
      
      * Apply precommit fixes
      
      * update mafand dataset name
      
      * pull request fixes
      
      * remove afrihate due to availability
      
      ---------
      Co-authored-by: default avatarIsrael Abebe Azime <azime@cg.uni-saarland.de>
      Co-authored-by: default avatarIsrael Abebe Azime <se.israel.abebe@gmail.com>
      Co-authored-by: default avatarDavid Adelani <davlanade@gmail.com>
      Co-authored-by: default avatartheyorubayesian <akin.o.oladipo@gmail.com>
      18297993
  16. 13 May, 2025 2 commits
  17. 06 May, 2025 2 commits
  18. 29 Apr, 2025 1 commit
  19. 16 Apr, 2025 3 commits
  20. 14 Apr, 2025 1 commit
  21. 04 Apr, 2025 2 commits
    • Qubitium-ModelCloud's avatar
      Add GSM8K Platinum (#2771) · 11ac352d
      Qubitium-ModelCloud authored
      * add gsm8k platinum
      
      * only test splits
      
      * wrong dataset
      
      * link to blog
      
      * format
      11ac352d
    • Michele Resta's avatar
      Optimization for evalita-llm rouge computation (#2878) · 22bd2bcb
      Michele Resta authored
      
      
      * feat: initial commit with templates for evalita evaluation
      
      * fix: change rule for generate_until
      
      * feat: modified yaml to use reduced version of NER test datasets
      
      * feat: added templates to use reduced dataset for summarization (fanpage and ilpost)
      
      * Add Six Prompts for Each Multiple-Choice Task
      
      * fix: fastest eval for summarization
      
      * chore: linted with ruff
      
      * chore: linted with ruff
      
      ---------
      Co-authored-by: default avatarrzanoli <zanoli@fbk.eu>
      22bd2bcb
  22. 02 Apr, 2025 2 commits
  23. 01 Apr, 2025 1 commit