1. 21 Sep, 2025 2 commits
    • Luis Cosio's avatar
      feat: Add mmlu-redux and it's spanish transaltion as generative task definitions (#2705) · fec9dde7
      Luis Cosio authored
      
      
      * Added benchmark
      
      * Added more testing
      
      * Added task definition for mmlu_redux and mmlu_redux_spanish
      
      * Add MMLU Redux English and Spanish tasks with YAML fixes and READMEs
      
      * Add remaining MMLU Redux YAMLs and updated tasks README
      
      * Add MMLU Redux English and Spanish tasks with YAML fixes and READMEs
      
      * Add MMLU Redux changes from pr-2705
      
      * Resolve pre-commit hook and pytest overlapping group issues by adding mmlu_redux_spanish task entries and unique subgroup names
      
      * Enhance retry logic to prevent 429 error when using Hugging Face API for tests, apply pre-commit fixes
      
      * Revert python test changes and comments one task group to avoid Hugging Face rate limit and task failure
      
      ---------
      Co-authored-by: default avatarCT-6282 <ricardo.godric@hotmail.com>
      fec9dde7
    • Timur Aysin's avatar
      Fix LongBench Evaluation (#3273) · 7f698a5a
      Timur Aysin authored
      
      
      * fix: set 'do_sample=False' and use double quotes in 'doc_to_text'
      
      * feat: update versions and README for longbench
      
      * pacify pre-commit
      
      ---------
      Co-authored-by: default avatarBaber <baber@hey.com>
      7f698a5a
  2. 08 Sep, 2025 1 commit
  3. 02 Sep, 2025 4 commits
  4. 27 Aug, 2025 3 commits
  5. 26 Aug, 2025 1 commit
    • Janna's avatar
      Support for AIME dataset (#3248) · 5ac7cdf8
      Janna authored
      * add AIME tasks
      
      * standardize the repeats
      
      * fix task naming
      
      * aime25 only has test set
      
      * edit readme
      
      * add utils
      
      * standardize
      
      * fix case sensitivity
      
      * repeat once
      
      * lint
      
      * more linting
      
      * lint huggingface.py
      5ac7cdf8
  6. 25 Aug, 2025 3 commits
  7. 23 Aug, 2025 1 commit
  8. 22 Aug, 2025 1 commit
  9. 21 Aug, 2025 6 commits
  10. 08 Aug, 2025 1 commit
  11. 04 Aug, 2025 4 commits
  12. 23 Jul, 2025 2 commits
  13. 22 Jul, 2025 2 commits
  14. 19 Jul, 2025 3 commits
  15. 18 Jul, 2025 1 commit
  16. 16 Jul, 2025 1 commit
    • philipdoldo's avatar
      `bbh_cot_fewshot`: Removed repeated "Let''s think step by step." text from bbh cot prompts (#3140) · c2be7211
      philipdoldo authored
      
      
      * Removed the 'Let''s think step by step.' text from the start of the target entry in each of the samples to prevent this phrase from being repeated twice in the few-shot prompts and to match the behavior from the original bbh repository. Worth noting that this applied to only 26 out of 27 subtasks, the only one it did not apply to is boolean_expressions.yaml. When it comes to boolean_expressions.yaml, in my opinion there is an error in that it doesn't say the 'Remember that (i) ...' text after the final 'A: Let's think step by step.' in the prompt. Models like EleutherAI/gpt-neo-125m seem to always begin answers with this string anyway (copying what was done in the few-shot prompts), but I think it really should've been part of the prompt, much like how 'A: Let's think step by step.' is included in the prompt for all of the cot tasks. However, the original bbh repo also has this issue, so I think it is fine to keep it this way for consistency, but just thought I'd point it out anyway.
      
      * feat: remove extra space from answers; add changelog
      
      ---------
      Co-authored-by: default avatarBaber <baber@hey.com>
      c2be7211
  17. 14 Jul, 2025 1 commit
  18. 10 Jul, 2025 1 commit
  19. 03 Jul, 2025 2 commits
    • Baber Abbasi's avatar
      Humaneval - fix regression (#3102) · 8c1016cb
      Baber Abbasi authored
      * use double quotes
      8c1016cb
    • Blanca Calvo's avatar
      Truthfulqa multi harness (#3062) · e0dc33ae
      Blanca Calvo authored
      
      
      * truthfulqa-multi task
      
      * truthfulqa-multi with chat few-shot
      
      * few shot chat implementation
      
      * changed until so it outputs lists
      
      * changed dataset location
      
      * added MT task
      
      * Create README.md
      
      * do not include MT
      
      * changes for PR
      
      * tag change
      
      * removed yaml extension
      
      * adding task to the table
      
      * fix task configs
      
      * add import exception
      
      ---------
      Co-authored-by: default avatarBaber <baber@hey.com>
      e0dc33ae