- 23 Aug, 2025 1 commit
-
-
Baber Abbasi authored
* update math_verify * remove normalization * use full solution in `parse` * update version
-
- 22 Aug, 2025 1 commit
-
-
Patrick Haller authored
Co-authored-by:Patrick Haller <phmaker@Patricks-MacBook-Pro.local>
-
- 21 Aug, 2025 6 commits
-
-
James A. Michaelov authored
* add lm_syneval * edit readme * update task readme * formatting fixes * run linting * add descriptions and examples * clean readme formatting
-
James A. Michaelov authored
* add turblimp * update general task readme * add normalized accuracy
-
James A. Michaelov authored
* add blimp_nl * add template yaml file
-
James A. Michaelov authored
* add zhoblimp files * correct group name * fix group * add normalized accuracy
-
FranValero97 authored
-
Anri Lombard authored
-
- 08 Aug, 2025 1 commit
-
-
Avelina Asada Hadji-Kyriacou authored
* Update afridiacritics_yaml * Update afrisenti * Update nollysenti * Update ntrex * Update salt
-
- 04 Aug, 2025 4 commits
-
-
parkhs21 authored
* improve include-path precedence handling * test: add task for test * add test for include path precedence handling * Refactor `test_include_path.py` --------- Co-authored-by:Baber <baber@hey.com>
-
Matthias Neumayer authored
The tasks are called without .yaml just the task name
-
Idan Tene authored
* Update humaneval_64_instruct.yaml Sync doc_to_text with humaneval_instruct.yaml * Update humaneval_instruct.yaml Remove redundant (flawed) spaces * Update README.md * Bump task version
-
Felix Michalak authored
* Update continuation group names to fit Readme * added changelog to readme and switched datasets form hails to cais * added missing new line at end of readme
-
- 23 Jul, 2025 2 commits
-
-
Baber Abbasi authored
* remove trust-remote-code * add W605 rule
-
Baber Abbasi authored
* Fix: pin datasets < 4.0 * fix * update type hints in HF * fix hellaswag path
-
- 22 Jul, 2025 2 commits
-
-
Svetlana Karimova authored
* Feat: add LIBRA benchmark * Feat: add dataset filter to LIBRA * Fix: formatting through pre-commit and main tasks README * Fix: resolve conflict * Fix: dataset name to real * Fix: delete unnececcary datasets and correct dependency --------- Co-authored-by:Baber Abbasi <92168766+baberabb@users.noreply.github.com>
-
Geun, Lim authored
* Fix: extended to max_gen_toks 8192 for HRM8K math benchmarks * • Increased max_gen_toks to 2 048 (matches Appendix B of original paper). • Added Evaluation Settings and Changelog sections. * add some logs --------- Co-authored-by:Baber <baber@hey.com>
-
- 19 Jul, 2025 3 commits
-
-
Baber Abbasi authored
-
James A. Michaelov authored
* add multiblimp * run linter
-
Avelina Asada Hadji-Kyriacou authored
* Update default.yaml
-
- 18 Jul, 2025 1 commit
-
-
Idan Tene authored
* Update utils.py
-
- 16 Jul, 2025 1 commit
-
-
philipdoldo authored
* Removed the 'Let''s think step by step.' text from the start of the target entry in each of the samples to prevent this phrase from being repeated twice in the few-shot prompts and to match the behavior from the original bbh repository. Worth noting that this applied to only 26 out of 27 subtasks, the only one it did not apply to is boolean_expressions.yaml. When it comes to boolean_expressions.yaml, in my opinion there is an error in that it doesn't say the 'Remember that (i) ...' text after the final 'A: Let's think step by step.' in the prompt. Models like EleutherAI/gpt-neo-125m seem to always begin answers with this string anyway (copying what was done in the few-shot prompts), but I think it really should've been part of the prompt, much like how 'A: Let's think step by step.' is included in the prompt for all of the cot tasks. However, the original bbh repo also has this issue, so I think it is fine to keep it this way for consistency, but just thought I'd point it out anyway. * feat: remove extra space from answers; add changelog --------- Co-authored-by:Baber <baber@hey.com>
-
- 14 Jul, 2025 1 commit
-
-
Atou Houdaifa authored
* add egy mmlu hellaswag * add egymmlu egyhellaswag to tasks readme * fix egymmlu config generation * fix _generate_configs formating
-
- 10 Jul, 2025 1 commit
-
-
Baber Abbasi authored
* check for chat for warning * add test * remove yaml extension from some evalita configs * move unitxt to own test script * fix CI test
-
- 03 Jul, 2025 2 commits
-
-
Baber Abbasi authored
* use double quotes
-
Blanca Calvo authored
* truthfulqa-multi task * truthfulqa-multi with chat few-shot * few shot chat implementation * changed until so it outputs lists * changed dataset location * added MT task * Create README.md * do not include MT * changes for PR * tag change * removed yaml extension * adding task to the table * fix task configs * add import exception --------- Co-authored-by:Baber <baber@hey.com>
-
- 30 Jun, 2025 1 commit
-
-
jinze authored
* Fix: Align the Humaneval dataset with official results Details:(1) modified the "doc_to_text" and "gen_prefix" in the "humaneval_instruct.yaml" file to make them the same as the Prompt in "meta-llama/Llama-3.1-70B-Instruct-evals". (2) Change r.rfind("```") to r.find("```"), so it can locate the first "```", not the last one. Results: Partially reproduced the official results: The result of LLaMA3.1-8B-Instruct is 66.5 (the official result is 72.6), and the result of LLaMA3.1-70B-Instruct is 80.5 (the official result is 80.5). Ref: PR#2650 * add changelog and version * add changelog
-
- 25 Jun, 2025 1 commit
-
-
Kiersten Stokes authored
Signed-off-by:kiersten-stokes <kierstenstokes@gmail.com>
-
- 20 Jun, 2025 1 commit
-
-
Anna Fontana authored
"arc_chalenge_chat" doesn't exist: I think it should be "arc_challenge_chat", but this task is not implemented here (see arc task folder).
-
- 19 Jun, 2025 2 commits
-
-
Maxim Evtush authored
-
Anna Fontana authored
Wrong task name: mmlu_generation doesn't non exist -> mmlu_generative is the correct one
-
- 16 Jun, 2025 2 commits
-
-
Baber Abbasi authored
* fix longbech citation
-
fuder.eth authored
* Update README.md * Update utils_mcq.py
-
- 12 Jun, 2025 1 commit
-
-
Kiersten Stokes authored
Signed-off-by:kiersten-stokes <kierstenstokes@gmail.com>
-
- 08 Jun, 2025 1 commit
-
-
Baber Abbasi authored
* use all answers * use middle truncation * maybe fix classification score * strip classification preds * [vllm] remove stop tokens post-hoc * strip all preds * pacify pre-commit * start on truncation utility * add to readme * add a footgun doc * fix newline in yaml templates * do not strip code_sim preds! * fix pre-commit config * fix instruction warning * add not to longbench readme
-
- 03 Jun, 2025 2 commits
-
-
Baber Abbasi authored
-
Baber Abbasi authored
* feat: add mbpp_instruct * fix: update generation_kwargs to use an empty until list * fix: correct predictions formatting in pass_at_1 function * fix: improve code block extraction by checking first without opening backticks * fix mbpp `pass_at_1`
-
- 26 May, 2025 1 commit
-
-
Boda Sadallah authored
* add arab_culture tasks * add target_delimeter and remove debugging code
-
- 21 May, 2025 1 commit
-
-
Hongseok Oh authored
-
- 19 May, 2025 1 commit
-
-
Baber Abbasi authored
* add `sglang-generate` * nit * nit * nit * pacify pre-commit
-