1. 25 Jul, 2025 5 commits
  2. 14 Jul, 2025 2 commits
  3. 12 Jul, 2025 4 commits
  4. 11 Jul, 2025 7 commits
  5. 10 Jul, 2025 3 commits
  6. 03 Jul, 2025 2 commits
    • Baber Abbasi's avatar
      Humaneval - fix regression (#3102) · 8c1016cb
      Baber Abbasi authored
      * use double quotes
      8c1016cb
    • Blanca Calvo's avatar
      Truthfulqa multi harness (#3062) · e0dc33ae
      Blanca Calvo authored
      
      
      * truthfulqa-multi task
      
      * truthfulqa-multi with chat few-shot
      
      * few shot chat implementation
      
      * changed until so it outputs lists
      
      * changed dataset location
      
      * added MT task
      
      * Create README.md
      
      * do not include MT
      
      * changes for PR
      
      * tag change
      
      * removed yaml extension
      
      * adding task to the table
      
      * fix task configs
      
      * add import exception
      
      ---------
      Co-authored-by: default avatarBaber <baber@hey.com>
      e0dc33ae
  7. 30 Jun, 2025 1 commit
    • jinze's avatar
      FixBug: Align the Humaneval with official results for Llama-3.1-70B-Instruct (#3092) · a7ca0435
      jinze authored
      * Fix: Align the Humaneval dataset with official results
      
      Details:(1) modified the "doc_to_text" and "gen_prefix" in the "humaneval_instruct.yaml" file to make them the same as the Prompt in "meta-llama/Llama-3.1-70B-Instruct-evals".
      
      (2) Change r.rfind("```") to r.find("```"), so it can locate the first "```", not the last one.
      
      Results: Partially reproduced the official results: The result of LLaMA3.1-8B-Instruct is 66.5 (the official result is 72.6), and the result of LLaMA3.1-70B-Instruct is 80.5 (the official result is 80.5).
      
      Ref: PR#2650
      
      * add changelog and version
      
      * add changelog
      a7ca0435
  8. 25 Jun, 2025 1 commit
  9. 20 Jun, 2025 1 commit
  10. 19 Jun, 2025 2 commits
  11. 16 Jun, 2025 2 commits
  12. 12 Jun, 2025 1 commit
  13. 08 Jun, 2025 1 commit
    • Baber Abbasi's avatar
      [longbench] fix metric calculation (#2983) · 147e9d61
      Baber Abbasi authored
      * use all answers
      
      * use middle truncation
      
      * maybe fix classification score
      
      * strip classification preds
      
      * [vllm] remove stop tokens post-hoc
      
      * strip all preds
      
      * pacify pre-commit
      
      * start on truncation utility
      
      * add to readme
      
      * add a footgun doc
      
      * fix newline in yaml templates
      
      * do not strip code_sim preds!
      
      * fix pre-commit config
      
      * fix instruction warning
      
      * add not to longbench readme
      147e9d61
  14. 03 Jun, 2025 2 commits
    • Baber Abbasi's avatar
      remove prints (#3041) · 9f152e0b
      Baber Abbasi authored
      9f152e0b
    • Baber Abbasi's avatar
      add Mbpp instruct (#2995) · 60e85da5
      Baber Abbasi authored
      * feat: add mbpp_instruct
      
      * fix: update generation_kwargs to use an empty until list
      
      * fix: correct predictions formatting in pass_at_1 function
      
      * fix: improve code block extraction by checking first without opening backticks
      
      * fix mbpp `pass_at_1`
      60e85da5
  15. 26 May, 2025 1 commit
  16. 21 May, 2025 1 commit
  17. 19 May, 2025 2 commits
  18. 15 May, 2025 2 commits