- 04 Apr, 2025 1 commit
-
-
Michele Resta authored
* feat: initial commit with templates for evalita evaluation * fix: change rule for generate_until * feat: modified yaml to use reduced version of NER test datasets * feat: added templates to use reduced dataset for summarization (fanpage and ilpost) * Add Six Prompts for Each Multiple-Choice Task * fix: fastest eval for summarization * chore: linted with ruff * chore: linted with ruff --------- Co-authored-by:rzanoli <zanoli@fbk.eu>
-
- 03 Apr, 2025 1 commit
-
-
Lu Fang authored
Signed-off-by:Lu Fang <lufang@fb.com>
-
- 02 Apr, 2025 2 commits
-
-
Baber Abbasi authored
* add subtask scores * pacify pre-commit
-
Saibo-creator authored
* Add JSON schema benchmark * Update lm_eval/tasks/jsonschema_bench/metrics.py Thanks for catching this Co-authored-by:
Baber Abbasi <92168766+baberabb@users.noreply.github.com> * run pre-commit * add description to task catalogue readme --------- Co-authored-by:
Baber Abbasi <92168766+baberabb@users.noreply.github.com>
-
- 01 Apr, 2025 2 commits
-
-
Daniel Holanda authored
-
Baber Abbasi authored
* sync with leaderboard * also output old metric * wrap old extraction in try except * better log
-
- 30 Mar, 2025 1 commit
-
-
Alexandre Marques authored
* llama-style MMLU CoT * Refactor MMLU CoT template YAML to simplify 'until' structure * Add GSM8K task configuration for LLaMA3 with few-shot examples * Fix missing newline at end of MMLU CoT YAML file * Add ARC-Challenge task configuration and processing utility * Add additional MMLU and ARC-Challenge task variants to README * Update README with notes on arc_challenge_llama dataset preprocessing
-
- 29 Mar, 2025 1 commit
-
-
Harsha authored
-
- 28 Mar, 2025 3 commits
-
-
Baber Abbasi authored
-
dazipe authored
* Changed default max_length from 2048 to 8192 and max_gen_toks from 256 to 2048 fro MMLU Pro tasks. * Update lm_eval/tasks/mmlu_pro/_default_template_yaml * pre-commit * nit ---------
-
Hadi Abdine authored
* add Darija tasks * fix multiple groups issue in darijammlu * add MT to the description of the Darija tasks * Update README.md nit * fix the recursion error caused by the darija_summarization task * use a custom filter instead of the decorator for the strip function --------- Co-authored-by:Baber Abbasi <92168766+baberabb@users.noreply.github.com>
-
- 27 Mar, 2025 3 commits
- 26 Mar, 2025 1 commit
-
-
Baber Abbasi authored
-
- 25 Mar, 2025 1 commit
-
-
Alexandre Marques authored
* Multilingual MMLU * Refactor process_docs function calls for clarity and consistency
-
- 23 Mar, 2025 1 commit
-
-
Bruno Carneiro authored
I haven't had time to review the library that's replacing tj-actions or whether this change breaks anything, but the vulnerability is quite severe and I would rather the functionality be broken than risk compromise. **to do:** review this later
-
- 21 Mar, 2025 2 commits
-
-
Alexandre Marques authored
-
heli-qi authored
* update mmlu_prox configs * update tasks/README * correct hyphon to underline in task/README * update pre-commit codes
-
- 20 Mar, 2025 6 commits
-
-
Alexandre Marques authored
* Update generation_kwargs in default template to include additional end tokens * Update filter_list in MMLU Pro configuration to use strict_match * Update _default_template_yaml
-
Baber Abbasi authored
-
Baber Abbasi authored
-
Yifei Zhang authored
-
Kiersten Stokes authored
* Add markdown linter to pre-commit hooks * Reformat existing markdown (excluding lm_eval/tasks/*.md)
-
Alexandre Marques authored
* Update continuation template YAML for MMLU task with new generation and filtering options * Refactor filter_list structure in continuation template YAML for improved readability * Add 'take_first' function to filter_list in continuation template YAML * Update filter_list in continuation template YAML to use 'strict_match' and modify filtering functions * Add 'do_sample' option to generation_kwargs in MMLU template YAML
-
- 19 Mar, 2025 2 commits
-
-
Stella Biderman authored
-
Kiersten Stokes authored
-
- 18 Mar, 2025 8 commits
-
-
Jaedong Hwang authored
-
Surya Kasturi authored
* Allow writing confing to wandb * set defaults * Update help * Update help
-
Baber Abbasi authored
* add changelog to readme template * add readme * add to task list
-
Baber Abbasi authored
* add min_pixels, max_pixels * fix
-
Baber Abbasi authored
suport for longcontext (and other synthetic tasks) * add ruler * add longbench * pass `metadata` to TaskConfig
-
Jonas Golde authored
* add MastermindEval benchmark * fill out checklist
-
Santiago Galiano Segura authored
* Add cocoteros_va dataset * Fix format in cocoteros_va.yml * Undo newline added * Execute pre-commit to fix format errors * Update catalan_bench.yaml version and add Changelog section into Readme.md
-
Baber Abbasi authored
* add __version__ * add version consistency check to publish action
-
- 17 Mar, 2025 3 commits
-
-
Kiersten Stokes authored
* Add support for token-based auth for watsonx models * Fix lint * Move dotenv import to inner scope * Improve readability of _verify_credentials
-
Angelika Romanou authored
* Add INCLUDE tasks * pacify pre-commit --------- Co-authored-by:Baber <baber@hey.com>
-
Avelina9X authored
* Update openllm.yaml to use train fewshot split for arc
-
- 16 Mar, 2025 1 commit
-
-
Baber Abbasi authored
-
- 14 Mar, 2025 1 commit
-
-
Oskar van der Wal authored
* Implementation of Winogender * Minor fixes README.md * Add winogender * Clean winogender utils.py * Change dataset to one containing All subsets * Flesh out README for BBQ task * Add missing tasks for BBQ * Add simple cooccurrence bias task * Fix wrong mask for ambiguated context+rename metrics * Made generate_until evaluation (following PALM paper) default Also moved separate config files per category to separate metrics using custom function. Created config file for multiple_choice way of evaluating BBQ. * Add missing version metadata * Add missing versionmetadata for bbq multiple choice * Fix metrics and address edge cases * Made BBQ multiple choice the default version * Added settings following winogrande * Add num_fewshot to simple_cooccurrence_bias * Fixes for bbq (multiple choice) * Fix wrong dataset * CrowS-Pairs: make it easier to use another dataset by removing dataset_name from the subsets. * Use simplest prompt possible without description * Merge * BBQ: Fix np.NaN related bug * BBQ: Fix wrong aggregation method for disamb accuracy * BBQ: Make it possible to only evaluate on (dis)ambiguous subset (needed for few shot eval) * BBQ: fix showing one target in case of few-shot evals * BBQ: Fix few-shot example for bbq_generate * BBQ: simplify subtasks * BBQ: Minimize number of UNK variations to reduce inference time * BBQ: Add extra UNK keywords for the generate task * Add a generate_until version of simple_cooccurrence_bias * Change system/description prompt to include few-shot examples * Group agg rework * Run pre-commit * add tasks to readme table * remove trailing space from simple_cooccurrence_bias_gen.yaml `doc_to_text` * fix * fix * fix version --------- Co-authored-by:Baber <baber@hey.com>
-