Commits · 1170ef9eeac321bae0a69b8a1e6c3d1087d40ff6 · gaoqiong / lm-evaluation-harness

01 Dec, 2024 1 commit

Update Unitxt task to use locally installed unitxt and not download Unitxt... · 1170ef9e

Yoav Katz authored Dec 01, 2024


Update Unitxt task to  use locally installed unitxt and not download Unitxt code from Huggingface (#2514)

* Moved to require unitxt installation and not download unitxt from HF hub.

This has performance benefits and simplifies the code.
Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Updated watsonx documentation

* Updated installation instructions

* Removed redundant comman

* Allowed unitxt tasks to generate chat APIs

Modified WatsonXI model to support chat apis

* Removed print

* Run precommit formatting

---------
Signed-off-by: Yoav Katz <katz@il.ibm.com>

1170ef9e

26 Nov, 2024 1 commit

Score tasks (#2452) · 0ef7548d

Rima Shahbazyan authored Nov 26, 2024

* score readme added

* generate until task's "until" parameter's default value fixed.

* score mmlu-pro and agieval added

* changed macro accuracy to micro for agieval

* Always E removed from agi eval

* redundancies removed

* MATH added

* minor cosmetic changes for math

* Licenses added Readme updated

* changes for flake8 + license header on math

* Score added to readme and precommit was run.

* Score added to readme and precommit was run.

* Import error fixed

* math task bugfix
postprocess minor fix

* CR for math added

* math CR

* math task bugfix
postprocess minor fix

CR for math added

* Math cr fixed

* reverting the default "until" parameter change and adjusting  score task configs

0ef7548d

18 Nov, 2024 2 commits

Add metabench task to LM Evaluation Harness (#2357) · 62b4364d

Kozzy Voudouris authored Nov 18, 2024



* Add metabench (Kipnis et al. 2024)

* Update metabench tasks for full replication of original benchmarks, using publicly available datasets

* Remove unnecessary import

* Add permute versions of each task, where the answer orders are randomly shuffled.

* Add metabench group for easier evaluations

* Fix mmlu counts after removing duplicate

* Add secondary datasets

* Fix f-string error

* Fix f-string error for permute processing

* Add original hash to outputs for easy matching to original results

* Add line break at end of utils files

* Remove extra line from winogrande

* Reformat for linters

* fix multiple input test

* appease pre-commit

* Add metabench to tasks README

* fix multiple input `test_doc_to_text`

---------
Co-authored-by: Baber <baber@hey.com>

62b4364d

remove duplicate `arc_ca` (#2499) · 8222ad0a
Baber Abbasi authored Nov 18, 2024

8222ad0a

16 Nov, 2024 1 commit

kbl-v0.1.1 (#2493) · cbc31eb8

Wonseok Hwang authored Nov 17, 2024

* release kbl-v0.1

* fix linting

* remove rag tasks as  doc_to_text functions cause trouble

* remove remaining rag tasks

* remove unnecessary repeat in yaml files and rag dataset in hf-hub

* remove unncessary newline; introduce cfg files in lbox/kbl in hf

* Make task yaml files consistent to hf-datasets-config

* Make task yaml files consistent to hf-datasets-config

* Remove trailing empty space in doc-to-text

* Remove unncessary yaml file

* Fix task nameing error

* trailing space removed

cbc31eb8

11 Nov, 2024 1 commit

Fix chat template; fix leaderboard math (#2475) · 77c811ea

Baber Abbasi authored Nov 11, 2024

* batch commit

* :Revert "batch commit"

This reverts commit d859d1ca

.

* batch commit

* checkout from main

* checkout from main

* checkout from main

* checkout from main

* checkout from main

* cleanup

* cleanup

* cleanup

* cleanup

* cleanup

* cleanup

* cleanup

* cleanup

* cleanup

* Chat template fix (#7)

* cleanup

* cleanup

* cleanup

* linting

* fix tests

* add ifeval install to new_task CI

* Revert "add ifeval install to new_task CI"

This reverts commit 1d19449bb7fbfa05d51e7cd20950475eae533bf1.

* adds leaderboard tasks (#1)

* adds leaderboard tasks

* Delete lm_eval/tasks/leaderboard/leaderboard_chat_template.yaml

* add readme

* Delete lm_eval/tasks/leaderboard/mmlu_pro/mmlu_pro_chat_template.yaml

* modify readme

* fix bbh task

* fix bbh salient task

* modify the readme

* Delete lm_eval/tasks/leaderboard/ifeval/README.md

* Delete lm_eval/tasks/leaderboard/math/README.md

* add leaderboard to the tasks repertory

* add anouncment about new leaderbaord tasks

* linting

* Update README.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* installs ifeval dependency in new_task github workflow

---------
Co-authored-by: Nathan Habib <nathan.habib@huggingface.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* fix math parser

* fix math parser

* fix version

* add warning about chat template

---------
Co-authored-by: Nathan Habib <nathan.habib@huggingface.co>
Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
Co-authored-by: Nathan Habib <nathan.habib@huggingface.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: Nathan Habib <nathan.habib19@gmail.com>

77c811ea

09 Nov, 2024 1 commit
- Ifeval: Dowload `punkt_tab` on rank 0 (#2267) · bd80a6c0
  Baber Abbasi authored Nov 09, 2024
```
* download nltk `punkt_tab` on LOCAL_RANK=0

* remove print

* remove `time`

* nit
```
  bd80a6c0
07 Nov, 2024 1 commit
- use global filter (#2461) · cd18cb3b
  Baber Abbasi authored Nov 07, 2024
  
  cd18cb3b
05 Nov, 2024 2 commits

Add Japanese Leaderboard (#2439) · 26f607f5

mtkachenko authored Nov 05, 2024

* add jaqket_v2 and jcommonsenseqa

* remove comments

* remove num_beams as it is incompatible with vllm

* add jnli + refactor

* rename jnla -> jnli

* add jsquad + replace colon chars with the Japanese unicode

* ignore whitespaces in generation tasks

* add marc_ja

* add xwinograd + simplify other yamls

* add mgsm and xlsum

* refactor xlsum

* add ja_leaderboard tag

* edit README.md

* update README.md

* add credit + minor changes

* run ruff format

* address review comments + add group

* remove aggregate_metric_list

* remove tags

* update tasks/README.md

26f607f5

Modify label errors in catcola and paws-x (#2434) · fb2e4b59

zxcvuser authored Nov 05, 2024



* Modify label errors in catcola and paws

* Update version to 1.0 in pawsx_template_yaml

* add changelog

---------
Co-authored-by: Baber <baber@hey.com>

fb2e4b59

01 Nov, 2024 1 commit
- Add missing task links (#2449) · ade1cc4e
  Sypherd authored Nov 01, 2024
  
  ade1cc4e
30 Oct, 2024 1 commit
- Add xquad task (#2435) · b40a20ae
  zxcvuser authored Oct 30, 2024
```
* Add xquad task

* Update general README

* Run pre-commit
```
  b40a20ae
22 Oct, 2024 1 commit

Update prompt (#2421) · 389347ee

Iker García-Ferrero authored Oct 22, 2024

Update prompt according to: 
https://github.com/ikergarcia1996/NoticIA/blob/main/prompts.py

389347ee

20 Oct, 2024 1 commit
- fix storycloze datanames (#2409) · 9b052fdc
  Yuxian Gu authored Oct 20, 2024
  
  9b052fdc
17 Oct, 2024 2 commits
- Fix: Turkish MMLU Regex Pattern (#2393) · c1d8795d
  Arda authored Oct 17, 2024
```
* Fix Regex Pattern for CoT experiments

---------
```
  c1d8795d
- group to tag for minerva_math (#2404) · 624017b7
  Ranger authored Oct 17, 2024
```
I find out this bug by comparing the code between hendrycks_math and minerva_math.
```
  624017b7
16 Oct, 2024 1 commit

Add new tasks to spanish_bench and fix duplicates (#2390) · 7ecee2bc

zxcvuser authored Oct 17, 2024



* added tasks to spanish_bench

* fixed capitalization in escola and run pre-commit

* Update _flores_common_yaml

* Update _flores_common_yaml

* Update direct_yaml

* Update cot_yaml

* Update cot_yaml

* Update _flores_common_yaml

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

7ecee2bc

14 Oct, 2024 1 commit

Add Unitxt Multimodality Support (#2364) · 7785577c

Elron Bandel authored Oct 14, 2024



* Add Unitxt Multimodality Support
Signed-off-by: elronbandel <elronbandel@gmail.com>

* Update
Signed-off-by: elronbandel <elronbandel@gmail.com>

* Fix formatting
Signed-off-by: elronbandel <elronbandel@gmail.com>

---------
Signed-off-by: elronbandel <elronbandel@gmail.com>

7785577c

08 Oct, 2024 1 commit
- Fix Llava-1.5-hf ; Update to version 0.4.5 (#2388) · 2576a8cb
  Hailey Schoelkopf authored Oct 08, 2024
  
  2576a8cb
07 Oct, 2024 3 commits
- LingOly - Fixing scoring bugs for smaller models (#2376) · fe3040f1
  am-bean authored Oct 07, 2024
```
* Fixing scoring bugs for smaller models

* Catching another error type in parsing
```
  fe3040f1
- Solution for CSAT-QA tasks evaluation (#2385) · 8f619361
  kyujinHan authored Oct 07, 2024
  
  8f619361
- Hotfix! (#2383) · bfdcdbe0
  Baber Abbasi authored Oct 07, 2024
```
* bugfix

* pre-commit
```
  bfdcdbe0
04 Oct, 2024 3 commits

fix tests (#2380) · 5e0bc289
Baber Abbasi authored Oct 04, 2024

5e0bc289

Add new benchmark: Catalan bench (#2154) · cb069004

zxcvuser authored Oct 04, 2024



* Add catalan_bench

* added flores_ca.yaml

* Updated some task groupings and readme

* Fix create_yamls_flores_ca.py

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

cb069004

Add new benchmark: Basque bench (#2153) · c887796d

zxcvuser authored Oct 04, 2024



* Add basque_bench

* Add flores_eu group

* Update _flores_common_yaml

* Run linters, updated flores, mgsm, copa, and readme

* Apply suggestions from code review
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

---------
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

c887796d

03 Oct, 2024 2 commits

Add new benchmark: Galician bench (#2155) · 0e763862

zxcvuser authored Oct 03, 2024

* Add galician_bench

* Update xnli_gl path

* Add flores_gl group

* Update _flores_common_yaml

* Updated some task groupings and readme

---------

0e763862

Add new benchmark: Spanish bench (#2157) · ea17b98e

zxcvuser authored Oct 03, 2024

* Add spanish_bench

* Add flores_es group

* Update _flores_common_yaml

* Delete lm_eval/tasks/spanish_bench/escola.yaml

* Delete escola from spanish_bench.yaml

* Delete escola from README.md

* pre-commit run --all-files

* Updated some task groupings and readme

---------

ea17b98e

30 Sep, 2024 2 commits
- Fix missing key in custom task loading. (#2304) · 15ffb0da
  Giulio Lovisotto authored Sep 30, 2024
  
  15ffb0da
- Add new benchmark: Portuguese bench (#2156) · caa7c409
  zxcvuser authored Sep 30, 2024
```
* Add portuguese_bench

* Add flores_pt group

* Update _flores_common_yaml

* Run linters and update flores and readme
```
  caa7c409
28 Sep, 2024 1 commit

fix some bugs of mmlu (#2299) · 5a48ca27

eyuansu62 authored Sep 28, 2024



* fix some bugs of mmlu

* Fix end of file newline issue

---------
Co-authored-by: eyuansu62 <772468951@qq.com>

5a48ca27

26 Sep, 2024 7 commits

add mmlu readme (#2282) · 00f5537a
Baber Abbasi authored Sep 27, 2024

00f5537a

Added TurkishMMLU to LM Evaluation Harness (#2283) · deb43287

Arda authored Sep 26, 2024



* Added TurkishMMLU to LM Evaluation Harness

* Fixed COT name

* Fixed COT name

* Updated Readme

* Fixed Test issues

* Completed  Scan for changed tasks

* Updated Readme

* Update README.md

* fixup task naming casing + ensure yaml template stubs aren't registered

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

deb43287

mmlu-pro: add newlines to task descriptions (not leaderboard) (#2334) · 558d0d71

Baber Abbasi authored Sep 27, 2024



* add newlines to task descriptions; increment versions

* fix task tests (with groups)

* Apply suggestions from code review

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

558d0d71

change glianorex to test split (#2332) · 7d242381

Baber Abbasi authored Sep 26, 2024

* change glianorex to test set

* nit

* fix test; doc_to_target can be str for multiple_choice

* nit

7d242381

change group to tags in task `eus_exams` task configs (#2320) · af92448e
Baber Abbasi authored Sep 26, 2024

af92448e

Treat tags in python tasks the same as yaml tasks (#2288) · b2bf7bc4

Giulio Lovisotto authored Sep 26, 2024

* Treat python tasks same as yaml tasks.

* Add tests.

* Re-add fixture decorators.

* Fix typing specification error for Python 3.9.

b2bf7bc4

load metric with `evaluate` (#2351) · f378f306
Baber Abbasi authored Sep 26, 2024

f378f306

24 Sep, 2024 1 commit
- add a note for missing dependencies (#2336) · bc50a9aa
  Eldar Kurtic authored Sep 24, 2024
  
  bc50a9aa
13 Sep, 2024 1 commit

Multimodal prototyping (#2243) · fb963f0f

Lintang Sutawika authored Sep 13, 2024



* add WIP hf vlm class

* add doc_to_image

* add mmmu tasks

* fix merge conflicts

* add lintang's changes to hf_vlms.py

* fix doc_to_image

* added yaml_path for config-loading

* revert

* add line to process str type v

* update

* modeling cleanup

* add aggregation for mmmu

* rewrite MMMU processing code based on only MMMU authors' repo (doc_to_image still WIP)

* implemented doc_to_image

* update doc_to_image to accept list of features

* update functions

* readd image processed

* update args process

* bugfix for repeated images fed to model

* push WIP loglikelihood code

* commit most recent code (generative ; qwen2-vl testing)

* preliminary image_token_id handling

* small mmmu update: some qs have >4 mcqa options

* push updated modeling code

* use processor.apply_chat_template

* add mathvista draft

* nit

* nit

* ensure no footguns in text<>multimodal LM<>task incompatibility

* add notification to readme regarding launch of prototype!

* fix compatibility check

* reorganize mmmu configs

* chat_template=None

* add interleave chat_template

* add condition

* add max_images; interleave=true

* nit

* testmini_mcq

* nit

* pass image string; convert img

* add vllm

* add init

* vlm add multi attr

* fixup

* pass max images to vllm model init

* nit

* encoding to device

* fix HFMultimodalLM.chat_template ?

* add mmmu readme

* remove erroneous prints

* use HFMultimodalLM.chat_template ; restore tasks/__init__.py

* add docstring for replace_placeholders in utils

* fix `replace_placeholders`; set image_string=None

* fix typo

* cleanup + fix merge conflicts

* update MMMU readme

* del mathvista

* add some sample scores

* Update README.md

* add log msg for image_string value

---------
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>
Co-authored-by: Baber Abbasi <baber@eleuther.ai>
Co-authored-by: Baber <baber@hey.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

fb963f0f

10 Sep, 2024 1 commit

Add Open Arabic LLM Leaderboard Benchmarks (Full and Light Version) (#2232) · decc533d

Malikeh Ehghaghi authored Sep 10, 2024



* arabic leaferboard yaml file is added

* arabic toxigen is implemented

* Dataset library is imported

* arabic sciq is added

* util file of arabic toxigen is updated

* arabic race is added

* arabic piqa is implemented

* arabic open qa is added

* arabic copa is implemented

* arabic boolq ia added

* arabic arc easy is added

* arabic arc challenge is added

* arabic exams benchmark is implemented

* arabic hellaswag is added

* arabic leaderboard yaml file metrics are updated

* arabic mmlu benchmarks are added

* arabic mmlu group yaml file is updated

* alghafa benchmarks are added

* acva benchmarks are added

* acva utils.py is updated

* light version of arabic leaderboard benchmarks are added

* bugs fixed

* bug fixed

* bug fixed

* bug fixed

* bug fixed

* bug fixed

* library import bug is fixed

* doc to target updated

* bash file is deleted

* results folder is deleted

* leaderboard groups are added

* full arabic leaderboard groups are added, plus some bug fixes to the light version

* Create README.md

README.md for arabic_leaderboard_complete

* Create README.md

README.md for arabic_leaderboard_light

* Delete lm_eval/tasks/arabic_leaderboard directory

* Update README.md

* Update README.md

adding the Arabic leaderboards to the library

* Update README.md

10% of the training set

* Update README.md

10% of the training set

* revert .gitignore to prev version

* Update lm_eval/tasks/README.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* updated main README.md

* Update lm_eval/tasks/README.md

* specify machine translated benchmarks (complete)

* specify machine translated benchmarks (light version)

* add alghafa to the related task names (complete and light)

* add 'acva' to the related task names (complete and light)

* add 'arabic_leaderboard' to all the groups (complete and light)

* all dataset - not a random sample

* added more accurate details to the readme file

* added mt_mmlu from okapi

* Update lm_eval/tasks/README.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/tasks/README.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* updated mt_mmlu readme

* renaming 'alghafa' full and light

* renaming 'arabic_mmlu' light and full

* renaming 'acva' full and light

* update readme and standardize dir/file names

* running pre-commit

---------
Co-authored-by: shahrzads <sayehban@ualberta.ca>
Co-authored-by: shahrzads <56282669+shahrzads@users.noreply.github.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

decc533d