1. 21 Sep, 2025 3 commits
    • Janna's avatar
      Add BabiLong (#3287) · ccfa4ad1
      Janna authored
      * create babilong tasks
      
      * lint
      
      * add clarification
      
      * fix typo
      
      * add babilong description
      ccfa4ad1
    • Luis Cosio's avatar
      feat: Add mmlu-redux and it's spanish transaltion as generative task definitions (#2705) · fec9dde7
      Luis Cosio authored
      
      
      * Added benchmark
      
      * Added more testing
      
      * Added task definition for mmlu_redux and mmlu_redux_spanish
      
      * Add MMLU Redux English and Spanish tasks with YAML fixes and READMEs
      
      * Add remaining MMLU Redux YAMLs and updated tasks README
      
      * Add MMLU Redux English and Spanish tasks with YAML fixes and READMEs
      
      * Add MMLU Redux changes from pr-2705
      
      * Resolve pre-commit hook and pytest overlapping group issues by adding mmlu_redux_spanish task entries and unique subgroup names
      
      * Enhance retry logic to prevent 429 error when using Hugging Face API for tests, apply pre-commit fixes
      
      * Revert python test changes and comments one task group to avoid Hugging Face rate limit and task failure
      
      ---------
      Co-authored-by: default avatarCT-6282 <ricardo.godric@hotmail.com>
      fec9dde7
    • Timur Aysin's avatar
      Fix LongBench Evaluation (#3273) · 7f698a5a
      Timur Aysin authored
      
      
      * fix: set 'do_sample=False' and use double quotes in 'doc_to_text'
      
      * feat: update versions and README for longbench
      
      * pacify pre-commit
      
      ---------
      Co-authored-by: default avatarBaber <baber@hey.com>
      7f698a5a
  2. 08 Sep, 2025 1 commit
  3. 02 Sep, 2025 4 commits
  4. 27 Aug, 2025 3 commits
  5. 26 Aug, 2025 1 commit
    • Janna's avatar
      Support for AIME dataset (#3248) · 5ac7cdf8
      Janna authored
      * add AIME tasks
      
      * standardize the repeats
      
      * fix task naming
      
      * aime25 only has test set
      
      * edit readme
      
      * add utils
      
      * standardize
      
      * fix case sensitivity
      
      * repeat once
      
      * lint
      
      * more linting
      
      * lint huggingface.py
      5ac7cdf8
  6. 25 Aug, 2025 3 commits
  7. 23 Aug, 2025 1 commit
  8. 22 Aug, 2025 1 commit
  9. 21 Aug, 2025 6 commits
  10. 08 Aug, 2025 1 commit
  11. 04 Aug, 2025 4 commits
  12. 23 Jul, 2025 2 commits
  13. 22 Jul, 2025 2 commits
  14. 19 Jul, 2025 3 commits
  15. 18 Jul, 2025 1 commit
  16. 16 Jul, 2025 1 commit
    • philipdoldo's avatar
      `bbh_cot_fewshot`: Removed repeated "Let''s think step by step." text from bbh cot prompts (#3140) · c2be7211
      philipdoldo authored
      
      
      * Removed the 'Let''s think step by step.' text from the start of the target entry in each of the samples to prevent this phrase from being repeated twice in the few-shot prompts and to match the behavior from the original bbh repository. Worth noting that this applied to only 26 out of 27 subtasks, the only one it did not apply to is boolean_expressions.yaml. When it comes to boolean_expressions.yaml, in my opinion there is an error in that it doesn't say the 'Remember that (i) ...' text after the final 'A: Let's think step by step.' in the prompt. Models like EleutherAI/gpt-neo-125m seem to always begin answers with this string anyway (copying what was done in the few-shot prompts), but I think it really should've been part of the prompt, much like how 'A: Let's think step by step.' is included in the prompt for all of the cot tasks. However, the original bbh repo also has this issue, so I think it is fine to keep it this way for consistency, but just thought I'd point it out anyway.
      
      * feat: remove extra space from answers; add changelog
      
      ---------
      Co-authored-by: default avatarBaber <baber@hey.com>
      c2be7211
  17. 14 Jul, 2025 1 commit
  18. 10 Jul, 2025 1 commit
  19. 03 Jul, 2025 1 commit