- 22 Sep, 2025 1 commit
-
-
priverabsc authored
* Add eqbench tasks in Spanish and Catalan * Incremented catalan_bench and spanish_bench versions. Added 'multilingual' folder inside 'eq_bench' and moved the eqbench_ca and eqbench_es .yaml to that folder. Updated the tasks README with eqbench_es and eqbench_ca, expliciting inside each description both the Hugging Face link and the translation method. * Fixed tasks table. * remove test_task.sh and results folder * Add utils.py to multilingual folder
-
- 21 Sep, 2025 5 commits
-
-
its-alpesh authored
* Add humaneval_infilling task * pacify pre-commit --------- Co-authored-by:Baber Abbasi <92168766+baberabb@users.noreply.github.com>
-
Janna authored
* register aime * lint --------- Co-authored-by:Baber <baber@hey.com>
-
Janna authored
* create babilong tasks * lint * add clarification * fix typo * add babilong description
-
Luis Cosio authored
* Added benchmark * Added more testing * Added task definition for mmlu_redux and mmlu_redux_spanish * Add MMLU Redux English and Spanish tasks with YAML fixes and READMEs * Add remaining MMLU Redux YAMLs and updated tasks README * Add MMLU Redux English and Spanish tasks with YAML fixes and READMEs * Add MMLU Redux changes from pr-2705 * Resolve pre-commit hook and pytest overlapping group issues by adding mmlu_redux_spanish task entries and unique subgroup names * Enhance retry logic to prevent 429 error when using Hugging Face API for tests, apply pre-commit fixes * Revert python test changes and comments one task group to avoid Hugging Face rate limit and task failure --------- Co-authored-by:CT-6282 <ricardo.godric@hotmail.com>
-
Timur Aysin authored
* fix: set 'do_sample=False' and use double quotes in 'doc_to_text' * feat: update versions and README for longbench * pacify pre-commit --------- Co-authored-by:Baber <baber@hey.com>
-
- 08 Sep, 2025 1 commit
-
-
James A. Michaelov authored
* add icelandic_winogrande * fix spacing for final words in sentence
-
- 02 Sep, 2025 4 commits
-
-
Valle Ruiz-Fernández authored
* Add EsBBQ and CaBBQ tasks * Linter fixes * add esbbq and cabbq to task list --------- Co-authored-by:Júlia Falcão <juliafsfalcao@hotmail.com>
-
James A. Michaelov authored
-
James A. Michaelov authored
-
James A. Michaelov authored
* run linter * add acc_norm
-
- 27 Aug, 2025 3 commits
-
-
Gül Sena A authored
* Fix codex-glue/code2text group issue * Added README * pacify pre-commit --------- Co-authored-by:Baber <baber@hey.com>
-
Baber Abbasi authored
-
Slim Frikha authored
-
- 26 Aug, 2025 1 commit
-
-
Janna authored
* add AIME tasks * standardize the repeats * fix task naming * aime25 only has test set * edit readme * add utils * standardize * fix case sensitivity * repeat once * lint * more linting * lint huggingface.py
-
- 25 Aug, 2025 3 commits
-
-
Weihao XUAN authored
* update MMLU_ProX * update MMLU_ProX * cleanup code by pre-commit
-
William Held authored
* Anthropic Discrim Eval * Mixed Effects Regression * Actually wire it all upo * Operator Name Doesn't Exist on Github * Update lm_eval/tasks/discrim_eval/discrim_eval_implicit.yaml Co-authored-by:
Baber Abbasi <92168766+baberabb@users.noreply.github.com> * Update discrim_eval_implicit.yaml * Update discrim_eval_explicit.yaml * pacify pre-commit --------- Co-authored-by:
Baber Abbasi <92168766+baberabb@users.noreply.github.com> Co-authored-by:
Baber <baber@hey.com>
-
Geun, Lim authored
* feat: Add CLIcK task * Fix formatting issues * Add Click Task Description * fix: lint * fix
-
- 23 Aug, 2025 1 commit
-
-
Baber Abbasi authored
* update math_verify * remove normalization * use full solution in `parse` * update version
-
- 22 Aug, 2025 1 commit
-
-
Patrick Haller authored
Co-authored-by:Patrick Haller <phmaker@Patricks-MacBook-Pro.local>
-
- 21 Aug, 2025 6 commits
-
-
James A. Michaelov authored
* add lm_syneval * edit readme * update task readme * formatting fixes * run linting * add descriptions and examples * clean readme formatting
-
James A. Michaelov authored
* add turblimp * update general task readme * add normalized accuracy
-
James A. Michaelov authored
* add blimp_nl * add template yaml file
-
James A. Michaelov authored
* add zhoblimp files * correct group name * fix group * add normalized accuracy
-
FranValero97 authored
-
Anri Lombard authored
-
- 08 Aug, 2025 1 commit
-
-
Avelina Asada Hadji-Kyriacou authored
* Update afridiacritics_yaml * Update afrisenti * Update nollysenti * Update ntrex * Update salt
-
- 04 Aug, 2025 4 commits
-
-
parkhs21 authored
* improve include-path precedence handling * test: add task for test * add test for include path precedence handling * Refactor `test_include_path.py` --------- Co-authored-by:Baber <baber@hey.com>
-
Matthias Neumayer authored
The tasks are called without .yaml just the task name
-
Idan Tene authored
* Update humaneval_64_instruct.yaml Sync doc_to_text with humaneval_instruct.yaml * Update humaneval_instruct.yaml Remove redundant (flawed) spaces * Update README.md * Bump task version
-
Felix Michalak authored
* Update continuation group names to fit Readme * added changelog to readme and switched datasets form hails to cais * added missing new line at end of readme
-
- 23 Jul, 2025 2 commits
-
-
Baber Abbasi authored
* remove trust-remote-code * add W605 rule
-
Baber Abbasi authored
* Fix: pin datasets < 4.0 * fix * update type hints in HF * fix hellaswag path
-
- 22 Jul, 2025 2 commits
-
-
Svetlana Karimova authored
* Feat: add LIBRA benchmark * Feat: add dataset filter to LIBRA * Fix: formatting through pre-commit and main tasks README * Fix: resolve conflict * Fix: dataset name to real * Fix: delete unnececcary datasets and correct dependency --------- Co-authored-by:Baber Abbasi <92168766+baberabb@users.noreply.github.com>
-
Geun, Lim authored
* Fix: extended to max_gen_toks 8192 for HRM8K math benchmarks * • Increased max_gen_toks to 2 048 (matches Appendix B of original paper). • Added Evaluation Settings and Changelog sections. * add some logs --------- Co-authored-by:Baber <baber@hey.com>
-
- 19 Jul, 2025 3 commits
-
-
Baber Abbasi authored
-
James A. Michaelov authored
* add multiblimp * run linter
-
Avelina Asada Hadji-Kyriacou authored
* Update default.yaml
-
- 18 Jul, 2025 1 commit
-
-
Idan Tene authored
* Update utils.py
-
- 16 Jul, 2025 1 commit
-
-
philipdoldo authored
* Removed the 'Let''s think step by step.' text from the start of the target entry in each of the samples to prevent this phrase from being repeated twice in the few-shot prompts and to match the behavior from the original bbh repository. Worth noting that this applied to only 26 out of 27 subtasks, the only one it did not apply to is boolean_expressions.yaml. When it comes to boolean_expressions.yaml, in my opinion there is an error in that it doesn't say the 'Remember that (i) ...' text after the final 'A: Let's think step by step.' in the prompt. Models like EleutherAI/gpt-neo-125m seem to always begin answers with this string anyway (copying what was done in the few-shot prompts), but I think it really should've been part of the prompt, much like how 'A: Let's think step by step.' is included in the prompt for all of the cot tasks. However, the original bbh repo also has this issue, so I think it is fine to keep it this way for consistency, but just thought I'd point it out anyway. * feat: remove extra space from answers; add changelog --------- Co-authored-by:Baber <baber@hey.com>
-