Commits · 2f403fa0b74bbe6bc78efff88d2d439896ef2bb1 · gaoqiong / lm-evaluation-harness

24 Feb, 2025 2 commits

add Basque translation of ARC and PAWS to BasqueBench (#2732) · 2f403fa0

Naiara Perez authored Feb 25, 2025



* add Basque translation of ARC and PAWS to BasqueBench

* pre-commit

---------
Co-authored-by: Baber <baber@hey.com>

2f403fa0

Added IberoBench citation info... · a9a0e3ca

Naiara Perez authored Feb 24, 2025

Added IberoBench citation info (https://aclanthology.org/2025.coling-main.699/) in correpsonding READMEs (#2729)

a9a0e3ca

23 Feb, 2025 1 commit
- remove unused import (#2728) · 5e0b6f16
  Baber Abbasi authored Feb 23, 2025
  
  5e0b6f16
21 Feb, 2025 3 commits

fix missing dataset repo (#2719) · 0bf9f4ea
Farhan Ahmed authored Feb 21, 2025

0bf9f4ea

Logging (#2203) · 1ba35e62

Lintang Sutawika authored Feb 20, 2025



* changed source of eval_logger

* allow eval_logger to be set from args

* removed verbosity arg from non-main methods

* fix logging

* pre-commit

* set verbosity in eval logger

* replace utils.eval_logger

* fix logging in main

* add logging to docs

* add logging message

* nit

* add logging to docs

* refactor setup_logging to utils

---------
Co-authored-by: Baber <baber@hey.com>

1ba35e62

add math_verify to some tasks (#2686) · 358adaf7

Baber Abbasi authored Feb 21, 2025

* add math_verify to minerva math

* add math_verify to benchmark

* fix error

* increment version

358adaf7

14 Feb, 2025 2 commits
- `arithmetic`: set target delimiter to empty string (#2701) · 41b952f3
  Baber Abbasi authored Feb 14, 2025
```
* set target delimiter to empty string

* nit

* add warning
```
  41b952f3
- fix `construct_requests` kwargs (#2700) · 5a5acc08
  Baber Abbasi authored Feb 14, 2025
  
  5a5acc08
13 Feb, 2025 1 commit
- set aggregation and higher_is_better (instead of falling back on defaults) (#2692) · c3c05b06
  James A. Michaelov authored Feb 13, 2025
  
  c3c05b06
12 Feb, 2025 1 commit
- Update unitxt task.py to bring in line with recent repo changes (#2684) · 8751fb35
  Kiersten Stokes authored Feb 12, 2025
  
  8751fb35
11 Feb, 2025 2 commits

Delete lm_eval/tasks/evalita_llm/single_prompt.zip (#2687) · 684fd2dd
Baber Abbasi authored Feb 11, 2025

684fd2dd

Adding the Evalita-LLM benchmark (#2681) · b7fccef5

Michele Resta authored Feb 11, 2025



* feat: initial commit with templates for evalita evaluation

* fix: change rule for generate_until

* feat: modified yaml to use reduced version of NER test datasets

* feat: added templates to use reduced dataset for summarization (fanpage and ilpost)

* Add Six Prompts for Each Multiple-Choice Task

* feat: modified fewshot split for textual entailment task

* fix: new doc_to_target function for NER tasks

* Update prompt

* Add partition for few-shot evaluation

* Add partition for few-shot evaluation

* Add partition for few-shot evaluation

* Add partition for few-shot evaluatio

* Update prompt

* Add partition for few-shot evaluation

* Rename file

Rename file from _evalita-mp_ner_adg_p1 .yaml to _evalita-mp_ner_adg_p1.yaml

* Add partition for few-shot evaluation

* Add partition for few-shot evaluation

* Enhance lexical substitution management

- Improve scorer calculation for better accuracy
- Update model output postprocessing for clearer results
- Add support for few-shot relation extraction task

* Add F1 macro measure for the document dating task

* Add F1-macro measure to evaluate document dating

* Use the whole dataset

* Small changes

* Add the two prompts for the task of lexical substitution

* Add few-shot split configuration

* Add few-shot split configuration

* Add function for handling few-shot learning setup

* Fix prompt

* Remove configuration file

* Update dataset from test_same to test_cross for evaluations

* Remove whitespace at end of prompt

* Fix configuration error: corrected parameter name for the dataset used in few-shot

* Fix: Check if results is not empty before processing in lexical substitution task

* added the prompts and functions for correct NER and RE execution

* Add accuracy measure

* Add tasks for the EVALITA-LLM benchmark evaluation

* Small changes

Add the alias of the task name that will be printed in the final table results.

* Updated the prompts to reflect changes made to the extended dataset for the Admission Test task

* chore: cleaned templates before PR; feat: add configuration to run generation/ppl tasks.

* fix: add information on Evalita-LLM for PR

* fix: rename folders and files

* fix: remove unused imports

* chore: run pre-commit

* chore: add task description

---------
Co-authored-by: rzanoli <zanoli@fbk.eu>
Co-authored-by: Marco Madeddu <marco.madeddu.bra@gmail.com>

b7fccef5

07 Feb, 2025 1 commit

Turkish mmlu Config Update (#2678) · 53504a96

Arda authored Feb 07, 2025



* Added TurkishMMLU to LM Evaluation Harness

* Fixed COT name

* Fixed COT name

* Updated Readme

* Fixed Test issues

* Completed  Scan for changed tasks

* Updated Readme

* Update README.md

* fixup task naming casing + ensure yaml template stubs aren't registered

* Fix Regex Pattern for CoT experiments

* Fixed multiple choice accuracy

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

53504a96

31 Jan, 2025 1 commit

MMLU Pro Plus (#2366) · 0bb8406f

asgsaeid authored Jan 31, 2025



* mmlu-pro-plus is implemented

* README file is updated

* Update README.md with new task: MMLU Pro Plus

* Update README.md with new task: MMLU Pro Plus

* pre-commit

* nit

---------
Co-authored-by: asgsaeid <asgaris@Saeids-MacBook-Pro.local>
Co-authored-by: Baber <baber@hey.com>

0bb8406f

29 Jan, 2025 2 commits

Add Histoires Morales task (#2662) · 1208afd3

Irina Proskurina authored Jan 29, 2025

* Add Histoires Morales task

* Histoires Morales task: fix mixed line endings

* Histoires Morales task: fix mixed line endings

* Remove tag for a single task

* Add some MT for Histoires Morales

1208afd3

remove `group` from bigbench task configs (#2663) · fe9c5707
Baber Abbasi authored Jan 29, 2025
```
* remove group from task configs

* add tags

* update readme
```
fe9c5707

28 Jan, 2025 3 commits
- update pre-commit (#2660) · 4b4b0363
  Baber Abbasi authored Jan 28, 2025
```
* nit

* update pre-commit
```
  4b4b0363
- Add Aggregation for Kobest Benchmark (#2446) · 94344a61
  Seungwoo Ryu authored Jan 29, 2025
```
Co-authored-by: Baber <baber@hey.com>
```
  94344a61
- Add Moral Stories (#2653) · a0466f01
  Irina Proskurina authored Jan 28, 2025
```
* Add moral stories task

* Add moral stories task

* Create README.md

* Update README.md

* Update line endings in moral_stories files
```
  a0466f01
24 Jan, 2025 1 commit
- separate category for `global_mmlu` (#2652) · 5c006ed4
  Minho Ryu authored Jan 25, 2025
```
* separate category

* set version 0.0

* apply precommit
```
  5c006ed4
21 Jan, 2025 2 commits
- aggregate by group (total and categories) (#2643) · b2c090cc
  Minho Ryu authored Jan 22, 2025
  
  b2c090cc
- revise mbpp prompt (#2645) · ed9c6fc8
  Minho Ryu authored Jan 22, 2025
  
  ed9c6fc8
20 Jan, 2025 6 commits

fixed mmlu generative response extraction (#2503) · 12b6eeb5

Ramiro R. C. authored Jan 20, 2025



* fixed mmlu generative response extraction

* updated file version | added args to exact_match

* fix

* fix

* pre-commit

* fix groups

---------
Co-authored-by: Baber <baber@hey.com>

12b6eeb5

fix tmlu tmlu_taiwan_specific_tasks tag (#2420) · 88144079
nike00811 authored Jan 21, 2025

88144079

Update KorMedMCQA: ver 2.0 (#2540) · ff2c49ff

Gyouk Chu authored Jan 21, 2025

* Update KorMedMCQA: ver 2.0

* Fix pre-commit formatting issues

* Update KorMedMCQA v2.0

* pre-commit

ff2c49ff

apply precommit (#2636) · 3a4e4674
Minho Ryu authored Jan 21, 2025

3a4e4674

New arabicmmlu (#2541) · 6dac8c69

Boda Sadallah authored Jan 21, 2025

* point to the original ArabicMMLU dataset

* create the new subtasks files

* fix bug when the context filed is empty

6dac8c69

add hrm8k benchmark for both Korean and English (#2627) · a5c344cf

Minho Ryu authored Jan 21, 2025



* add hrm8k benchmark for both Korean and English

* apply precommit

* revise tasks to make models not to directly answer; use zeroshot_cot if possible

* add README

* Add hrm8k on the task-list

---------
Co-authored-by: Baber <baber@hey.com>

a5c344cf

19 Jan, 2025 1 commit
- update pre-commit (#2632) · f724be69
  Baber Abbasi authored Jan 19, 2025
```
* update pre-commit
```
  f724be69
17 Jan, 2025 1 commit
- fix gen_prefix (#2630) · 9dda03d6
  Baber Abbasi authored Jan 17, 2025
```
* switch arg
```
  9dda03d6
15 Jan, 2025 4 commits

assistant prefill (#2615) · 703fbffd

Baber Abbasi authored Jan 15, 2025

* add assistant prefix

* add arc_challenge from llama

* nit

* nit

* nit

* add assistant prefix

* add mmlu_llama

* nit

* nit

* Revert "nit"

This reverts commit 6a97f8356237305e375212b966b30e8de59dd4bc.

* fix regex bug

* add assistant_prefix to vllm

* add `Question:`

* add mmlu_pro

* add fewshot assistant_prefix

* use `assistant_prefill`

* typehints

* nits

* nits

* add to docs

* add readme

703fbffd

Add MLQA (#2622) · e86cece6

Shivansh Pachnanda authored Jan 16, 2025

* Add MLQA
* add mlqa_common_yaml

* add 49 tests of mlqa family

* update tasks/README.md

---------

* fix: mlqa ast error

* nit: removed .yaml ext from template_yaml

* nit changes: minor modifications generate_tasks.py

* deleted    lm_eval/tasks/mlqa/mlqa_common_yaml.yaml

* tests updated

* nit

e86cece6

Add MBPP (#2247) · 5db23e2c

Hojin Lee authored Jan 16, 2025



* add mbpp

* fix some bugs

* add README for mbpp

* update README

* nits

---------
Co-authored-by: Hojin Lee <19949034+hjlee1371@users.noreply.github.com>
Co-authored-by: Baber <baber@hey.com>

5db23e2c

Add HumanEval (#1992) · 4c11206b

Hojin Lee authored Jan 16, 2025



* add custom filter

* fix type casting of references

* add humaneval

* fix a bug in humaneval

* add greedy version of humaneval

* update tasks README

* test humaneval

* return multiple metrics

* nit

* add confirmation to run code tasks

* nit

* nit

---------
Co-authored-by: Hojin Lee <19949034+hjlee1371@users.noreply.github.com>
Co-authored-by: Baber <baber@hey.com>

4c11206b

07 Jan, 2025 1 commit

Fix the format of mgsm zh and ja. (#2587) · bb098f13

Wenyang LUO authored Jan 07, 2025

* Fix the format of mgsm zh and ja.

* Add change log to mgsm.

* Add newline after changelog.

bb098f13

04 Jan, 2025 1 commit

some minor logging nits (#2609) · 888ac292

Baber Abbasi authored Jan 04, 2025

* remove yaml extension from phraes_va_common

* remove yaml extension from winogenerated

* remove yaml extension from phrases_es

* no cache debug logging when not used

888ac292

02 Jan, 2025 1 commit

update scrolls (#2602) · 1044db95

Baber Abbasi authored Jan 02, 2025

* update evaluate; update construct requests

* update construct requests to handle `apply_chat_template` kwarg

1044db95

24 Dec, 2024 1 commit

AraDICE task config file (#2507) · 932e8f9e

Firoj Alam, Scientist, QCRI authored Dec 24, 2024



* added aradice

* Added ArabicMMLU Lev Configs

* added ArabicMMLU egy configs

* Added boolq configs

* Added cultural bench configs

* added openbookqa configs

* Added PiQA configs

* added winogrande configs

* Added truthfulQA configs

* Added aradice group config

* Remove deleted files from repository

* modified arabimmlu configs

* modified metadata versions

* fixed formatting using ruff

* added aradice tasks information

* pre-commit

* Uptaded openbookqa utils

* fixed formatting on obqa

---------
Co-authored-by: Basel Mousi <bmousi@hbku.edu.qa>
Co-authored-by: Baber <baber@hey.com>

932e8f9e

19 Dec, 2024 1 commit

Add Global MMLU Lite (#2567) · 2b75b110

shivalika-singh authored Dec 19, 2024



* add global mmlu lite

* add global mmlu lite

* fix bugs

* add task README.md

* Update README.md

* Update tasks README.md

* Update README.md

* update readme

---------
Co-authored-by: shivi <shivalikasingh95@gmail.com>

2b75b110

16 Dec, 2024 1 commit

Adding new subtask to SCORE tasks: non greedy robustness (#2558) · 976d8a0b

Rima Shahbazyan authored Dec 16, 2024

* score readme added

* generate until task's "until" parameter's default value fixed.

* score mmlu-pro and agieval added

* changed macro accuracy to micro for agieval

* Always E removed from agi eval

* redundancies removed

* MATH added

* minor cosmetic changes for math

* Licenses added Readme updated

* changes for flake8 + license header on math

* Score added to readme and precommit was run.

* Score added to readme and precommit was run.

* Import error fixed

* math task bugfix
postprocess minor fix

* CR for math added

* math CR

* math task bugfix
postprocess minor fix

CR for math added

* Math cr fixed

* mmlu_pro non_greedy task added

* non greedy summarizer added

* Non greedy for all score tasks

* Bugfixes for non-greedy

* fixing the until argument

* undoing the change to "until" arguments default behaviour

* minor fix in summarizer

* log naming changes for better readability

* math subtasks naming fix

* agieval subtask naming fix

* logging added for debugging

* path issue fixed

* minor fix

* path fix

* path fix

* non_greedy_math minor fix

* final changes

* changed readme for non-greedy
added Nvidia header
added wxample script for non_greedy
changed prompts to match that fo trt runs

* non greedy summarizer bugfix

* non_greedy summarizer fixed

976d8a0b