Commits · v0.4.5 · gaoqiong / lm-evaluation-harness

08 Oct, 2024 4 commits

Bump version to v0.4.5 (#2389) · 0845b588
Hailey Schoelkopf authored Oct 08, 2024

0845b588
Fix Llava-1.5-hf ; Update to version 0.4.5 (#2388) · 2576a8cb
Hailey Schoelkopf authored Oct 08, 2024

2576a8cb

max_images are passed on to vllms `limit_mm_per_prompt` (#2387) · 1ed1f9ed

Baber Abbasi authored Oct 09, 2024

* max_images are passed on to vllms `limit_mm_per_prompt`

* replace max image placeholders in string

* handle chat_template error

* move `fewshot_random_seed` to global

1ed1f9ed

HF: switch conditional checks to `self.backend` from `AUTO_MODEL_CLASS` (#2353) · ab2c46c3

Baber Abbasi authored Oct 09, 2024



* switch conditional checks to `self.backend`

* nit

* nit

* commit feedback

* fix test; update precommit hooks

* add escape hatch for custom self.AUTO_MODEL_CLASS

* add escape hatch for custom self.AUTO_MODEL_CLASS

* fix

* move assertion

* add logging messages

* update AUTO_MODEL_CLASS behavior in _get_backend

---------
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

ab2c46c3

07 Oct, 2024 5 commits

[API] tokenizer: add trust-remote-code (#2372) · 4cec66e4

Baber Abbasi authored Oct 07, 2024



* tokenizer: trust-remote-code

* pre-commit

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

4cec66e4

Fix float limit override (#2325) · aa457edc

Chenjie Luo authored Oct 07, 2024

* Fix float limit override

See: https://github.com/EleutherAI/lm-evaluation-harness/issues/2324

The float limit will be override with the previous int limit of multiple tasks are triggered together.

This PR fix this issue

* Update evaluator.py

* Update evaluator.py

aa457edc

LingOly - Fixing scoring bugs for smaller models (#2376) · fe3040f1
am-bean authored Oct 07, 2024
```
* Fixing scoring bugs for smaller models

* Catching another error type in parsing
```
fe3040f1
Solution for CSAT-QA tasks evaluation (#2385) · 8f619361
kyujinHan authored Oct 07, 2024

8f619361
Hotfix! (#2383) · bfdcdbe0
Baber Abbasi authored Oct 07, 2024
```
* bugfix

* pre-commit
```
bfdcdbe0

04 Oct, 2024 3 commits

fix tests (#2380) · 5e0bc289
Baber Abbasi authored Oct 04, 2024

5e0bc289

Add new benchmark: Catalan bench (#2154) · cb069004

zxcvuser authored Oct 04, 2024



* Add catalan_bench

* added flores_ca.yaml

* Updated some task groupings and readme

* Fix create_yamls_flores_ca.py

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

cb069004

Add new benchmark: Basque bench (#2153) · c887796d

zxcvuser authored Oct 04, 2024



* Add basque_bench

* Add flores_eu group

* Update _flores_common_yaml

* Run linters, updated flores, mgsm, copa, and readme

* Apply suggestions from code review
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

---------
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

c887796d

03 Oct, 2024 2 commits

Add new benchmark: Galician bench (#2155) · 0e763862

zxcvuser authored Oct 03, 2024

* Add galician_bench

* Update xnli_gl path

* Add flores_gl group

* Update _flores_common_yaml

* Updated some task groupings and readme

---------

0e763862

Add new benchmark: Spanish bench (#2157) · ea17b98e

zxcvuser authored Oct 03, 2024

* Add spanish_bench

* Add flores_es group

* Update _flores_common_yaml

* Delete lm_eval/tasks/spanish_bench/escola.yaml

* Delete escola from spanish_bench.yaml

* Delete escola from README.md

* pre-commit run --all-files

* Updated some task groupings and readme

---------

ea17b98e

30 Sep, 2024 2 commits
- Fix missing key in custom task loading. (#2304) · 15ffb0da
  Giulio Lovisotto authored Sep 30, 2024
  
  15ffb0da
- Add new benchmark: Portuguese bench (#2156) · caa7c409
  zxcvuser authored Sep 30, 2024
```
* Add portuguese_bench

* Add flores_pt group

* Update _flores_common_yaml

* Run linters and update flores and readme
```
  caa7c409
28 Sep, 2024 1 commit

fix some bugs of mmlu (#2299) · 5a48ca27

eyuansu62 authored Sep 28, 2024



* fix some bugs of mmlu

* Fix end of file newline issue

---------
Co-authored-by: eyuansu62 <772468951@qq.com>

5a48ca27

26 Sep, 2024 9 commits

openai: better error messages; fix greedy matching (#2327) · 1bc6c933

Baber Abbasi authored Sep 27, 2024



* better error message; fix greedy matching

* Update lm_eval/models/openai_completions.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/models/openai_completions.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* pre-commit

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

1bc6c933

add mmlu readme (#2282) · 00f5537a
Baber Abbasi authored Sep 27, 2024

00f5537a

Added TurkishMMLU to LM Evaluation Harness (#2283) · deb43287

Arda authored Sep 26, 2024



* Added TurkishMMLU to LM Evaluation Harness

* Fixed COT name

* Fixed COT name

* Updated Readme

* Fixed Test issues

* Completed  Scan for changed tasks

* Updated Readme

* Update README.md

* fixup task naming casing + ensure yaml template stubs aren't registered

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

deb43287

mmlu-pro: add newlines to task descriptions (not leaderboard) (#2334) · 558d0d71

Baber Abbasi authored Sep 27, 2024



* add newlines to task descriptions; increment versions

* fix task tests (with groups)

* Apply suggestions from code review

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

558d0d71

change glianorex to test split (#2332) · 7d242381

Baber Abbasi authored Sep 26, 2024

* change glianorex to test set

* nit

* fix test; doc_to_target can be str for multiple_choice

* nit

7d242381

change group to tags in task `eus_exams` task configs (#2320) · af92448e
Baber Abbasi authored Sep 26, 2024

af92448e

Treat tags in python tasks the same as yaml tasks (#2288) · b2bf7bc4

Giulio Lovisotto authored Sep 26, 2024

* Treat python tasks same as yaml tasks.

* Add tests.

* Re-add fixture decorators.

* Fix typing specification error for Python 3.9.

b2bf7bc4

fix writeout script (#2350) · 72d619ff
Baber Abbasi authored Sep 26, 2024

72d619ff
load metric with `evaluate` (#2351) · f378f306
Baber Abbasi authored Sep 26, 2024

f378f306

24 Sep, 2024 2 commits
- add a note for missing dependencies (#2336) · bc50a9aa
  Eldar Kurtic authored Sep 24, 2024
  
  bc50a9aa
- Fixed dummy model (#2339) · d7734d19
  Amine Elhattami authored Sep 24, 2024
  
  d7734d19
18 Sep, 2024 1 commit

Update neuron backend (#2314) · 9a092f37

David Corvoysier authored Sep 18, 2024

* feat(neuron): align with latest optimum-neuron

* feat(neuron): support pre-exported neuron models

* fix(neuron): correctly use max_length

* fix(neuron): adapt loglikelihood

The evaluation of log likelihood was not working for neuron models
using continuous batching, such as all cached neuron LLama models.

* refactor(neuron): remove dead code

9a092f37

17 Sep, 2024 2 commits

repr bug (#2315) · 88ea85b4
Baber Abbasi authored Sep 18, 2024

88ea85b4

Update README.md (#2297) · a5e0adcb

SYusupov authored Sep 17, 2024

* Update README.md

I encounter some Git buffer size limits when trying to download all commits history of the repository, such as:
```error: RPC failed; curl 18 transfer closed with outstanding read data remaining
error: 5815 bytes of body are still expected
fetch-pack: unexpected disconnect while reading sideband packet
fatal: early EOF```

therefore the installation is faster and there are not errors when I download only the last version of the repository

* Fix linting issue

a5e0adcb

13 Sep, 2024 1 commit

Multimodal prototyping (#2243) · fb963f0f

Lintang Sutawika authored Sep 13, 2024



* add WIP hf vlm class

* add doc_to_image

* add mmmu tasks

* fix merge conflicts

* add lintang's changes to hf_vlms.py

* fix doc_to_image

* added yaml_path for config-loading

* revert

* add line to process str type v

* update

* modeling cleanup

* add aggregation for mmmu

* rewrite MMMU processing code based on only MMMU authors' repo (doc_to_image still WIP)

* implemented doc_to_image

* update doc_to_image to accept list of features

* update functions

* readd image processed

* update args process

* bugfix for repeated images fed to model

* push WIP loglikelihood code

* commit most recent code (generative ; qwen2-vl testing)

* preliminary image_token_id handling

* small mmmu update: some qs have >4 mcqa options

* push updated modeling code

* use processor.apply_chat_template

* add mathvista draft

* nit

* nit

* ensure no footguns in text<>multimodal LM<>task incompatibility

* add notification to readme regarding launch of prototype!

* fix compatibility check

* reorganize mmmu configs

* chat_template=None

* add interleave chat_template

* add condition

* add max_images; interleave=true

* nit

* testmini_mcq

* nit

* pass image string; convert img

* add vllm

* add init

* vlm add multi attr

* fixup

* pass max images to vllm model init

* nit

* encoding to device

* fix HFMultimodalLM.chat_template ?

* add mmmu readme

* remove erroneous prints

* use HFMultimodalLM.chat_template ; restore tasks/__init__.py

* add docstring for replace_placeholders in utils

* fix `replace_placeholders`; set image_string=None

* fix typo

* cleanup + fix merge conflicts

* update MMMU readme

* del mathvista

* add some sample scores

* Update README.md

* add log msg for image_string value

---------
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>
Co-authored-by: Baber Abbasi <baber@eleuther.ai>
Co-authored-by: Baber <baber@hey.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

fb963f0f

10 Sep, 2024 1 commit

Add Open Arabic LLM Leaderboard Benchmarks (Full and Light Version) (#2232) · decc533d

Malikeh Ehghaghi authored Sep 10, 2024



* arabic leaferboard yaml file is added

* arabic toxigen is implemented

* Dataset library is imported

* arabic sciq is added

* util file of arabic toxigen is updated

* arabic race is added

* arabic piqa is implemented

* arabic open qa is added

* arabic copa is implemented

* arabic boolq ia added

* arabic arc easy is added

* arabic arc challenge is added

* arabic exams benchmark is implemented

* arabic hellaswag is added

* arabic leaderboard yaml file metrics are updated

* arabic mmlu benchmarks are added

* arabic mmlu group yaml file is updated

* alghafa benchmarks are added

* acva benchmarks are added

* acva utils.py is updated

* light version of arabic leaderboard benchmarks are added

* bugs fixed

* bug fixed

* bug fixed

* bug fixed

* bug fixed

* bug fixed

* library import bug is fixed

* doc to target updated

* bash file is deleted

* results folder is deleted

* leaderboard groups are added

* full arabic leaderboard groups are added, plus some bug fixes to the light version

* Create README.md

README.md for arabic_leaderboard_complete

* Create README.md

README.md for arabic_leaderboard_light

* Delete lm_eval/tasks/arabic_leaderboard directory

* Update README.md

* Update README.md

adding the Arabic leaderboards to the library

* Update README.md

10% of the training set

* Update README.md

10% of the training set

* revert .gitignore to prev version

* Update lm_eval/tasks/README.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* updated main README.md

* Update lm_eval/tasks/README.md

* specify machine translated benchmarks (complete)

* specify machine translated benchmarks (light version)

* add alghafa to the related task names (complete and light)

* add 'acva' to the related task names (complete and light)

* add 'arabic_leaderboard' to all the groups (complete and light)

* all dataset - not a random sample

* added more accurate details to the readme file

* added mt_mmlu from okapi

* Update lm_eval/tasks/README.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/tasks/README.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* updated mt_mmlu readme

* renaming 'alghafa' full and light

* renaming 'arabic_mmlu' light and full

* renaming 'acva' full and light

* update readme and standardize dir/file names

* running pre-commit

---------
Co-authored-by: shahrzads <sayehban@ualberta.ca>
Co-authored-by: shahrzads <56282669+shahrzads@users.noreply.github.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

decc533d

05 Sep, 2024 1 commit
- Bump version to v0.4.4 ; Fixes to TMMLUplus (#2280) · 543617fe
  Hailey Schoelkopf authored Sep 05, 2024
  
  543617fe
04 Sep, 2024 1 commit

Chat Template fix (cont. #2235) (#2269) · 7a1614eb

Baber Abbasi authored Sep 04, 2024



* default chat template method fix

* move chat_template to TemplateLM

* remove hotfix

* handle openai `chat_template`

* Update lm_eval/api/model.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* add 'max_tokens' to gen_kwargs

* pre-commit

---------
Co-authored-by: KonradSzafer <szafer.konrad@gmail.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

7a1614eb

30 Aug, 2024 2 commits

hotfix #2262 (#2264) · 928e8bb6

Baber Abbasi authored Aug 30, 2024

* max_length - 1 (generation always >= 1)

* vllm: fix rolling prefix_token

* nit: add comment

* fixup! max_length should be handled for logliklihoods

* Revert "fixup! max_length should be handled for logliklihoods"

This reverts commit 432d1a3b754c117c3a54ea2fe792ab3a1bd09ed3.

928e8bb6

API: fix maxlen; vllm: prefix_token_id bug (#2262) · b31f92e8

Baber Abbasi authored Aug 30, 2024

* max_length - 1 (generation always >= 1)

* vllm: fix rolling prefix_token

* nit: add comment

* fixup! max_length should be handled for logliklihoods

b31f92e8

28 Aug, 2024 3 commits
- Fix `loglikelihood_rolling` caching ( #1821 ) (#2187) · 8138fd52
  Hailey Schoelkopf authored Aug 28, 2024
```
* fix revision type

* allow for None-input loglikelihood reqs to be cached

* handle no remaining cache items

* pre-commit

* change cache_hook.add_partial(loglikelihood_rolling...) convention

---------
Co-authored-by: Baber Abbasi <baber@eleuther.ai>
```
  8138fd52
- update nltk version to require 3.9.1 (#2259) · 2de3688f
  Hailey Schoelkopf authored Aug 28, 2024
  
  2de3688f
- [Draft] More descriptive `simple_evaluate()` LM TypeError (#2258) · 40010ec1
  Hailey Schoelkopf authored Aug 28, 2024
```
* Update evaluator.py

* update error msg
```
  40010ec1