- 03 Mar, 2025 3 commits
-
-
Baber authored
-
Harsh Kohli authored
* Fix failing tests * Resolved merge conflicts * pre-commit --------- Co-authored-by:Baber <baber@hey.com>
-
Jinwei authored
* initial components to support sglang * init of class SGLangLM * draft for generate_until of SGLang model * mock loglikelihood * initial loglikelihood_tokens * todo: fix bug of sglang engine init * implement generation tasks and test * support output type loglikelihood and loglikelihood_rolling (#1) * . * loglikelihood_rolling * / * support dp_size>1 * typo * add tests and clean code * skip tests of sglang for now * fix OOM error of sglang pytest * finish test for sglang * add sglang to readme * fix OOM of tests and clean SGLang model * update readme * clean pyproject and add tests for evaluator * add accuracy tests and it passed locally * add notes for test * Update README.md update readme * pre-commit * add OOM guideline for sglang and fix readme error * fix typo * fix typo * add readme --------- Co-authored-by:
Xiaotong Jiang <xiaotong.jiang@databricks.com> Co-authored-by:
Baber Abbasi <92168766+baberabb@users.noreply.github.com> Co-authored-by:
Baber <baber@hey.com>
-
- 27 Feb, 2025 1 commit
-
-
Baber Abbasi authored
* remove ray.remote resources * remove kobtest tag (registered as group)
-
- 26 Feb, 2025 1 commit
-
-
Baber Abbasi authored
-
- 25 Feb, 2025 4 commits
-
-
Jinwei authored
* initial components to support sglang * init of class SGLangLM * draft for generate_until of SGLang model * mock loglikelihood * initial loglikelihood_tokens * todo: fix bug of sglang engine init * implement generation tasks and test * support output type loglikelihood and loglikelihood_rolling (#1) * . * loglikelihood_rolling * / * support dp_size>1 * typo * add tests and clean code * skip tests of sglang for now * fix OOM error of sglang pytest * finish test for sglang * add sglang to readme * fix OOM of tests and clean SGLang model * update readme * clean pyproject and add tests for evaluator * add accuracy tests and it passed locally * add notes for test * Update README.md update readme * pre-commit --------- Co-authored-by:
Xiaotong Jiang <xiaotong.jiang@databricks.com> Co-authored-by:
Baber Abbasi <92168766+baberabb@users.noreply.github.com> Co-authored-by:
Baber <baber@hey.com>
-
Minho Ryu authored
* add humaneval+ and mbpp+ * add newline at end of file
-
Kailashbuki authored
* Fix the import source for eval_logger * fix logging --------- Co-authored-by:Baber <baber@hey.com>
-
Santiago Galiano Segura authored
Co-authored-by:Robiert Sepulveda Torres <rsepulveda911112@gmail.com>
-
- 24 Feb, 2025 3 commits
-
-
Naiara Perez authored
* add Basque translation of ARC and PAWS to BasqueBench * pre-commit --------- Co-authored-by:Baber <baber@hey.com>
-
Jocelyn authored
* add o3-mini support * fix linter tests
-
Naiara Perez authored
Added IberoBench citation info (https://aclanthology.org/2025.coling-main.699/) in correpsonding READMEs (#2729)
-
- 23 Feb, 2025 1 commit
-
-
Baber Abbasi authored
-
- 21 Feb, 2025 3 commits
-
-
Farhan Ahmed authored
-
Lintang Sutawika authored
* changed source of eval_logger * allow eval_logger to be set from args * removed verbosity arg from non-main methods * fix logging * pre-commit * set verbosity in eval logger * replace utils.eval_logger * fix logging in main * add logging to docs * add logging message * nit * add logging to docs * refactor setup_logging to utils --------- Co-authored-by:Baber <baber@hey.com>
-
Baber Abbasi authored
* add math_verify to minerva math * add math_verify to benchmark * fix error * increment version
-
- 17 Feb, 2025 1 commit
-
-
Baber Abbasi authored
* fix vllm * fix data_parallel * copy to multimodal
-
- 14 Feb, 2025 4 commits
-
-
Baber Abbasi authored
* set target delimiter to empty string * nit * add warning
-
Baber Abbasi authored
-
Irina Proskurina authored
-
Kiersten Stokes authored
-
- 13 Feb, 2025 1 commit
-
-
James A. Michaelov authored
-
- 12 Feb, 2025 2 commits
-
-
achervyakov authored
-
Kiersten Stokes authored
-
- 11 Feb, 2025 2 commits
-
-
Baber Abbasi authored
-
Michele Resta authored
* feat: initial commit with templates for evalita evaluation * fix: change rule for generate_until * feat: modified yaml to use reduced version of NER test datasets * feat: added templates to use reduced dataset for summarization (fanpage and ilpost) * Add Six Prompts for Each Multiple-Choice Task * feat: modified fewshot split for textual entailment task * fix: new doc_to_target function for NER tasks * Update prompt * Add partition for few-shot evaluation * Add partition for few-shot evaluation * Add partition for few-shot evaluation * Add partition for few-shot evaluatio * Update prompt * Add partition for few-shot evaluation * Rename file Rename file from _evalita-mp_ner_adg_p1 .yaml to _evalita-mp_ner_adg_p1.yaml * Add partition for few-shot evaluation * Add partition for few-shot evaluation * Enhance lexical substitution management - Improve scorer calculation for better accuracy - Update model output postprocessing for clearer results - Add support for few-shot relation extraction task * Add F1 macro measure for the document dating task * Add F1-macro measure to evaluate document dating * Use the whole dataset * Small changes * Add the two prompts for the task of lexical substitution * Add few-shot split configuration * Add few-shot split configuration * Add function for handling few-shot learning setup * Fix prompt * Remove configuration file * Update dataset from test_same to test_cross for evaluations * Remove whitespace at end of prompt * Fix configuration error: corrected parameter name for the dataset used in few-shot * Fix: Check if results is not empty before processing in lexical substitution task * added the prompts and functions for correct NER and RE execution * Add accuracy measure * Add tasks for the EVALITA-LLM benchmark evaluation * Small changes Add the alias of the task name that will be printed in the final table results. * Updated the prompts to reflect changes made to the extended dataset for the Admission Test task * chore: cleaned templates before PR; feat: add configuration to run generation/ppl tasks. * fix: add information on Evalita-LLM for PR * fix: rename folders and files * fix: remove unused imports * chore: run pre-commit * chore: add task description --------- Co-authored-by:
rzanoli <zanoli@fbk.eu> Co-authored-by:
Marco Madeddu <marco.madeddu.bra@gmail.com>
-
- 07 Feb, 2025 3 commits
-
-
Baber Abbasi authored
-
omahs authored
* fix typo * fix typos * fix typos
-
Arda authored
* Added TurkishMMLU to LM Evaluation Harness * Fixed COT name * Fixed COT name * Updated Readme * Fixed Test issues * Completed Scan for changed tasks * Updated Readme * Update README.md * fixup task naming casing + ensure yaml template stubs aren't registered * Fix Regex Pattern for CoT experiments * Fixed multiple choice accuracy --------- Co-authored-by:
Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by:
haileyschoelkopf <hailey@eleuther.ai>
-
- 06 Feb, 2025 1 commit
-
-
Baber Abbasi authored
-
- 31 Jan, 2025 1 commit
-
-
asgsaeid authored
* mmlu-pro-plus is implemented * README file is updated * Update README.md with new task: MMLU Pro Plus * Update README.md with new task: MMLU Pro Plus * pre-commit * nit --------- Co-authored-by:
asgsaeid <asgaris@Saeids-MacBook-Pro.local> Co-authored-by:
Baber <baber@hey.com>
-
- 29 Jan, 2025 4 commits
-
-
Irina Proskurina authored
* Add Histoires Morales task * Histoires Morales task: fix mixed line endings * Histoires Morales task: fix mixed line endings * Remove tag for a single task * Add some MT for Histoires Morales
-
Baber Abbasi authored
* remove group from task configs * add tags * update readme
-
Baber authored
-
Baber authored
-
- 28 Jan, 2025 5 commits
-
-
Baber Abbasi authored
* nit * update pre-commit
-
Seungwoo Ryu authored
Co-authored-by:Baber <baber@hey.com>
-
Baber Abbasi authored
* feat: drop Python 3.8 support * feat: drop Python 3.8 tests * pre-commit * handle chat_template for multiple iput
-
Nicky Pochinkov authored
* add TransformerLens example Many people use TransformerLens to do interpretability and interventions on models, and then need to test the model. Here is a simple script that allows one to pass in the TransformerLens model and run evaluations on it. * Ran pre-commit checks
-
Irina Proskurina authored
* Add moral stories task * Add moral stories task * Create README.md * Update README.md * Update line endings in moral_stories files
-