- 09 May, 2025 1 commit
-
-
Baber Abbasi authored
-
- 06 May, 2025 5 commits
-
-
Stella Biderman authored
This hasn't been a library for few shot language model evaluation in quite a while. Let's update the citation to use "the Language Model Evaluation Harness" as the title.
-
Ihar Hrachyshka authored
This is useful to run unit tests during distro builds.
-
Anna Fontana authored
* Fix import error for eval_logger in score utils * pacify pre-commit --------- Co-authored-by:Baber <baber@hey.com>
-
Vladislav Mikhailov authored
* added noreval * added a checklist for noreval * run pre-commit * changed imports and added short noreval description * fixed norsumm path * refactored multi-folder tasks * refactored multi-folder tasks
-
Alexandre Marques authored
-
- 29 Apr, 2025 1 commit
-
-
Baber Abbasi authored
-
- 18 Apr, 2025 1 commit
-
-
Avelina9X authored
* Added softmax_dtype argument to coerce log_softmax computations * move softmax_dtype --------- Co-authored-by:Baber <baber@hey.com>
-
- 16 Apr, 2025 5 commits
-
-
Baber Abbasi authored
* add warning in for default until * fix stop tokens; add vcsum * bugfix:fix doc_to_target to string * fix lsht, trec * add task to readme * add debugging logs for multiple input/output
-
achervyakov authored
-
Baber Abbasi authored
* switch MMLU to cais/mmlu * switch back to tj-actions/changed-files * cache HF folder
-
Baber Abbasi authored
* fix resolve_hf_chat_template version * pre-commit
-
Eldar Kurtic authored
-
- 15 Apr, 2025 1 commit
-
-
Jerry Zhang authored
* Add support for quantization_config Summary: Previously quantization_config is ignored, so torchao quantized models are not supported, this PR adds that. Test Plan: lm_eval --model hf --model_args pretrained=jerryzh168/gemma3-int4wo --tasks hellaswag --device cuda:0 --batch_size 8 Reviewers: Subscribers: Tasks: Tags: * quantization_config is optional
-
- 14 Apr, 2025 2 commits
-
-
Daniele authored
-
Alexandre Marques authored
* Add support for chat templates defined outside of tokenizer_config.json, as supported by vLLM * Update template name to avoid conflict with other variable
-
- 07 Apr, 2025 1 commit
-
-
Felipe Maia Polo authored
Add `--samples` Argument for Fine-Grained Task Evaluation in `lm-evaluation-harness`. This feature is the first step towards efficient multi-prompt evaluation with PromptEval [1,2] (#2520) * added option --examples * specifying examples in dictionary * run pre-commit - fix arg type Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com * fixing bug when examples==None * fixing bug when examples==None * limit or examples must be None in simple_evaluate.py and in evaluator.py * run pre-commit (fix formatting) Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com * merge main and run pre-commit (fix formatting) Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com * Update __main__.py undefined "limit" and "examples" * update branch, fix conflicts, run pre-commit * nits * nits * change 'examples' to 'samples' --------- Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com Co-authored-by:
mirianfrsilva <mirianfrsilva@ibm.com> Co-authored-by:
Stella Biderman <stellabiderman@gmail.com> Co-authored-by:
Baber <baber@hey.com>
-
- 04 Apr, 2025 3 commits
-
-
Qubitium-ModelCloud authored
* add gsm8k platinum * only test splits * wrong dataset * link to blog * format
-
Nikodem Szwast authored
* update authnentications methods, add support for deployment_id * run pre-commit on changed file
-
Michele Resta authored
* feat: initial commit with templates for evalita evaluation * fix: change rule for generate_until * feat: modified yaml to use reduced version of NER test datasets * feat: added templates to use reduced dataset for summarization (fanpage and ilpost) * Add Six Prompts for Each Multiple-Choice Task * fix: fastest eval for summarization * chore: linted with ruff * chore: linted with ruff --------- Co-authored-by:rzanoli <zanoli@fbk.eu>
-
- 03 Apr, 2025 1 commit
-
-
Lu Fang authored
Signed-off-by:Lu Fang <lufang@fb.com>
-
- 02 Apr, 2025 2 commits
-
-
Baber Abbasi authored
* add subtask scores * pacify pre-commit
-
Saibo-creator authored
* Add JSON schema benchmark * Update lm_eval/tasks/jsonschema_bench/metrics.py Thanks for catching this Co-authored-by:
Baber Abbasi <92168766+baberabb@users.noreply.github.com> * run pre-commit * add description to task catalogue readme --------- Co-authored-by:
Baber Abbasi <92168766+baberabb@users.noreply.github.com>
-
- 01 Apr, 2025 2 commits
-
-
Daniel Holanda authored
-
Baber Abbasi authored
* sync with leaderboard * also output old metric * wrap old extraction in try except * better log
-
- 30 Mar, 2025 1 commit
-
-
Alexandre Marques authored
* llama-style MMLU CoT * Refactor MMLU CoT template YAML to simplify 'until' structure * Add GSM8K task configuration for LLaMA3 with few-shot examples * Fix missing newline at end of MMLU CoT YAML file * Add ARC-Challenge task configuration and processing utility * Add additional MMLU and ARC-Challenge task variants to README * Update README with notes on arc_challenge_llama dataset preprocessing
-
- 29 Mar, 2025 1 commit
-
-
Harsha authored
-
- 28 Mar, 2025 3 commits
-
-
Baber Abbasi authored
-
dazipe authored
* Changed default max_length from 2048 to 8192 and max_gen_toks from 256 to 2048 fro MMLU Pro tasks. * Update lm_eval/tasks/mmlu_pro/_default_template_yaml * pre-commit * nit ---------
-
Hadi Abdine authored
* add Darija tasks * fix multiple groups issue in darijammlu * add MT to the description of the Darija tasks * Update README.md nit * fix the recursion error caused by the darija_summarization task * use a custom filter instead of the decorator for the strip function --------- Co-authored-by:Baber Abbasi <92168766+baberabb@users.noreply.github.com>
-
- 27 Mar, 2025 3 commits
- 26 Mar, 2025 1 commit
-
-
Baber Abbasi authored
-
- 25 Mar, 2025 1 commit
-
-
Alexandre Marques authored
* Multilingual MMLU * Refactor process_docs function calls for clarity and consistency
-
- 23 Mar, 2025 1 commit
-
-
Bruno Carneiro authored
I haven't had time to review the library that's replacing tj-actions or whether this change breaks anything, but the vulnerability is quite severe and I would rather the functionality be broken than risk compromise. **to do:** review this later
-
- 21 Mar, 2025 2 commits
-
-
Alexandre Marques authored
-
heli-qi authored
* update mmlu_prox configs * update tasks/README * correct hyphon to underline in task/README * update pre-commit codes
-
- 20 Mar, 2025 2 commits
-
-
Alexandre Marques authored
* Update generation_kwargs in default template to include additional end tokens * Update filter_list in MMLU Pro configuration to use strict_match * Update _default_template_yaml
-
Baber Abbasi authored
-