Commits · parentheses · gaoqiong / lm-evaluation-harness

18 Sep, 2024 2 commits
- Update _default_template_yaml · f2a9b4c4
  Stella Biderman authored Sep 18, 2024
  
  f2a9b4c4
- Update _default_template_yaml · cc723f2b
  Stella Biderman authored Sep 18, 2024
  
  cc723f2b
17 Sep, 2024 2 commits

repr bug (#2315) · 88ea85b4
Baber Abbasi authored Sep 18, 2024

88ea85b4

SYusupov authored Sep 17, 2024

* Update README.md

I encounter some Git buffer size limits when trying to download all commits history of the repository, such as:
```error: RPC failed; curl 18 transfer closed with outstanding read data remaining
error: 5815 bytes of body are still expected
fetch-pack: unexpected disconnect while reading sideband packet
fatal: early EOF```

therefore the installation is faster and there are not errors when I download only the last version of the repository

* Fix linting issue

a5e0adcb

13 Sep, 2024 1 commit

Multimodal prototyping (#2243) · fb963f0f

Lintang Sutawika authored Sep 13, 2024



* add WIP hf vlm class

* add doc_to_image

* add mmmu tasks

* fix merge conflicts

* add lintang's changes to hf_vlms.py

* fix doc_to_image

* added yaml_path for config-loading

* revert

* add line to process str type v

* update

* modeling cleanup

* add aggregation for mmmu

* rewrite MMMU processing code based on only MMMU authors' repo (doc_to_image still WIP)

* implemented doc_to_image

* update doc_to_image to accept list of features

* update functions

* readd image processed

* update args process

* bugfix for repeated images fed to model

* push WIP loglikelihood code

* commit most recent code (generative ; qwen2-vl testing)

* preliminary image_token_id handling

* small mmmu update: some qs have >4 mcqa options

* push updated modeling code

* use processor.apply_chat_template

* add mathvista draft

* nit

* nit

* ensure no footguns in text<>multimodal LM<>task incompatibility

* add notification to readme regarding launch of prototype!

* fix compatibility check

* reorganize mmmu configs

* chat_template=None

* add interleave chat_template

* add condition

* add max_images; interleave=true

* nit

* testmini_mcq

* nit

* pass image string; convert img

* add vllm

* add init

* vlm add multi attr

* fixup

* pass max images to vllm model init

* nit

* encoding to device

* fix HFMultimodalLM.chat_template ?

* add mmmu readme

* remove erroneous prints

* use HFMultimodalLM.chat_template ; restore tasks/__init__.py

* add docstring for replace_placeholders in utils

* fix `replace_placeholders`; set image_string=None

* fix typo

* cleanup + fix merge conflicts

* update MMMU readme

* del mathvista

* add some sample scores

* Update README.md

* add log msg for image_string value

---------
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>
Co-authored-by: Baber Abbasi <baber@eleuther.ai>
Co-authored-by: Baber <baber@hey.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

fb963f0f

10 Sep, 2024 1 commit

Add Open Arabic LLM Leaderboard Benchmarks (Full and Light Version) (#2232) · decc533d

Malikeh Ehghaghi authored Sep 10, 2024



* arabic leaferboard yaml file is added

* arabic toxigen is implemented

* Dataset library is imported

* arabic sciq is added

* util file of arabic toxigen is updated

* arabic race is added

* arabic piqa is implemented

* arabic open qa is added

* arabic copa is implemented

* arabic boolq ia added

* arabic arc easy is added

* arabic arc challenge is added

* arabic exams benchmark is implemented

* arabic hellaswag is added

* arabic leaderboard yaml file metrics are updated

* arabic mmlu benchmarks are added

* arabic mmlu group yaml file is updated

* alghafa benchmarks are added

* acva benchmarks are added

* acva utils.py is updated

* light version of arabic leaderboard benchmarks are added

* bugs fixed

* bug fixed

* bug fixed

* bug fixed

* bug fixed

* bug fixed

* library import bug is fixed

* doc to target updated

* bash file is deleted

* results folder is deleted

* leaderboard groups are added

* full arabic leaderboard groups are added, plus some bug fixes to the light version

* Create README.md

README.md for arabic_leaderboard_complete

* Create README.md

README.md for arabic_leaderboard_light

* Delete lm_eval/tasks/arabic_leaderboard directory

* Update README.md

* Update README.md

adding the Arabic leaderboards to the library

* Update README.md

10% of the training set

* Update README.md

10% of the training set

* revert .gitignore to prev version

* Update lm_eval/tasks/README.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* updated main README.md

* Update lm_eval/tasks/README.md

* specify machine translated benchmarks (complete)

* specify machine translated benchmarks (light version)

* add alghafa to the related task names (complete and light)

* add 'acva' to the related task names (complete and light)

* add 'arabic_leaderboard' to all the groups (complete and light)

* all dataset - not a random sample

* added more accurate details to the readme file

* added mt_mmlu from okapi

* Update lm_eval/tasks/README.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/tasks/README.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* updated mt_mmlu readme

* renaming 'alghafa' full and light

* renaming 'arabic_mmlu' light and full

* renaming 'acva' full and light

* update readme and standardize dir/file names

* running pre-commit

---------
Co-authored-by: shahrzads <sayehban@ualberta.ca>
Co-authored-by: shahrzads <56282669+shahrzads@users.noreply.github.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

decc533d

05 Sep, 2024 1 commit
- Bump version to v0.4.4 ; Fixes to TMMLUplus (#2280) · 543617fe
  Hailey Schoelkopf authored Sep 05, 2024
  
  543617fe
04 Sep, 2024 1 commit

Chat Template fix (cont. #2235) (#2269) · 7a1614eb

Baber Abbasi authored Sep 04, 2024



* default chat template method fix

* move chat_template to TemplateLM

* remove hotfix

* handle openai `chat_template`

* Update lm_eval/api/model.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* add 'max_tokens' to gen_kwargs

* pre-commit

---------
Co-authored-by: KonradSzafer <szafer.konrad@gmail.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

7a1614eb

30 Aug, 2024 2 commits

hotfix #2262 (#2264) · 928e8bb6

Baber Abbasi authored Aug 30, 2024

* max_length - 1 (generation always >= 1)

* vllm: fix rolling prefix_token

* nit: add comment

* fixup! max_length should be handled for logliklihoods

* Revert "fixup! max_length should be handled for logliklihoods"

This reverts commit 432d1a3b754c117c3a54ea2fe792ab3a1bd09ed3.

928e8bb6

API: fix maxlen; vllm: prefix_token_id bug (#2262) · b31f92e8

Baber Abbasi authored Aug 30, 2024

* max_length - 1 (generation always >= 1)

* vllm: fix rolling prefix_token

* nit: add comment

* fixup! max_length should be handled for logliklihoods

b31f92e8

28 Aug, 2024 3 commits
- Fix `loglikelihood_rolling` caching ( #1821 ) (#2187) · 8138fd52
  Hailey Schoelkopf authored Aug 28, 2024
```
* fix revision type

* allow for None-input loglikelihood reqs to be cached

* handle no remaining cache items

* pre-commit

* change cache_hook.add_partial(loglikelihood_rolling...) convention

---------
Co-authored-by: Baber Abbasi <baber@eleuther.ai>
```
  8138fd52
- update nltk version to require 3.9.1 (#2259) · 2de3688f
  Hailey Schoelkopf authored Aug 28, 2024
  
  2de3688f
- [Draft] More descriptive `simple_evaluate()` LM TypeError (#2258) · 40010ec1
  Hailey Schoelkopf authored Aug 28, 2024
```
* Update evaluator.py

* update error msg
```
  40010ec1
25 Aug, 2024 1 commit
- chat template hotfix (#2250) · ebe7226e
  Baber Abbasi authored Aug 26, 2024
```
* chat template hotfix

* pre-commit
```
  ebe7226e
23 Aug, 2024 3 commits

Created new task for testing Llama on Asdiv (#2236) · aab42ba8

Cameron Witkowski authored Aug 23, 2024



* Created DUP eval code for gsm8k

* asdiv

* Fixed fewshot=8 issue

* added results to .gitignore

* reverted unnecessary changes and moved results + gsm8k_dup out of repo to prepare for pull req

* fixed whitespace and unintentional hardcoded version change information

* created mbpp task

* Reverted changes re. mbpp to save for a future Pull req

* reverted metrics.py to previous commit

* updated asdiv readme to include informaiton about new asdiv_cot_llama task

* Apply suggestions from code review

---------
Co-authored-by: Alexander Detkov <alexander.d.detkov@gmail.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

aab42ba8

fix group args of mmlu and mmlu_pro (#2245) · 5ad23ec2
eyuansu62 authored Aug 23, 2024

5ad23ec2

Fix typos in multiple places (#2244) · fa837646

LSinev authored Aug 23, 2024

ACLUE bibtex typo reported to ACL Anthology and fixed here as title in pdf is correct.

fa837646

22 Aug, 2024 3 commits
- computer_science --> "computer science" (#2241) · 259b756a
  Baber Abbasi authored Aug 23, 2024
  
  259b756a
- Fix logging when resizing embedding layer in peft mode (#2239) · e9287fce
  Wessel Poelman authored Aug 22, 2024
  
  e9287fce
- fix the regex string in mmlu_pro template (#2238) · 325f168c
  lxning authored Aug 22, 2024
```
* fix the regex string in yaml file

* Update samplers.py

---------
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>
```
  325f168c
20 Aug, 2024 6 commits

mela (#1970) · a4987bba

Geralt authored Aug 21, 2024



* mela

* Update mela_en.yaml

* Create _mela.yaml

---------
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

a4987bba

Fix Zeno Visualizer (#2227) · b536f067

Nam D. Tran authored Aug 20, 2024



* fix: arguments data

* fix based on comment

* Update zeno_visualize.py

updated all output types

---------
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

b536f067

Update CODEOWNERS (#2229) · e050f3c9
Hailey Schoelkopf authored Aug 20, 2024

e050f3c9

Add multiple chat template (#2129) · 3740a5d2

KonradSzafer authored Aug 20, 2024



* multiple chat template support

* help doc update

* add transformers link to docstring

* model args update

* comment update

* statement simplification

* simplified chat_template property

* docs update

* removed template arg from HFLM class

* interface doc update

* model guide update

* interface doc update

* reuse apply_chat_template variable

* model guide refactor

* interface doc update

* removed old definition

* last nits

* last nits

* last nits

* better wording

* last nits

* Remove unnecessary Optional

* Apply suggestions from code review
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* return variable rename

---------
Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

3740a5d2

fix the leaderboard doc to reflect the tasks (#2219) · 221c7d7d
Nathan Habib authored Aug 20, 2024

221c7d7d

Update IFEval dataset to official one (#2218) · 97327e43

lewtun authored Aug 20, 2024

* Update IFEval dataset to official one

This PR updates the IFEval dataset to the one hosted under the Google org: https://huggingface.co/datasets/google/IFEval

Note the main change is an updated prompt from this commit in the GitHub repo: https://github.com/google-research/google-research/commit/26d8ccdab6fec61b5c83ad6327ea8bda9e580288



* Update ifeval.yaml

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

97327e43

19 Aug, 2024 3 commits

Add TMLU Benchmark Dataset (#2093) · ca3d86d6

Yen-Ting Lin authored Aug 19, 2024



* add taiwan truthful qa

* add tmlu

* Add .gitignore entries for evals/ and harness_eval_main_log.txt, and add harness_eval.slurm script

* add pega eval and legal eval

* add ccp eval

* Update .gitignore and harness_eval.slurm

* Add trust_remote_code and wandb_args to harness_eval.slurm, and add run_all.sh script

* Add Pega MMLU task and configuration files

* Add new models and update parameters in run_all.sh

* Add UMTCEval tasks and configurations

* Update dataset paths and output path

* Update .gitignore and harness_eval.slurm, and modify _generate_configs.py

* Update SLURM script and add new models

* clean for pr

* Update lm_eval/tasks/tmlu/default/tmlu.yaml
Co-authored-by: Lintang Sutawika <lintang@sutawika.com>

* adjust tag name

* removed group alias from tasks

* format

---------
Co-authored-by: Lintang Sutawika <lintang@sutawika.com>
Co-authored-by: lintangsutawika <lintang@eleuther.ai>
Co-authored-by: Yen-Ting Adam, Lin <r08944064@csie.ntu.edu.tw>

ca3d86d6

Update yaml to adapt to belebele dataset changes (#2216) · 86edeffa
Uminosachi authored Aug 20, 2024

86edeffa

Lingoly README update (#2228) · f81b62bf

am-bean authored Aug 19, 2024

* Setting up lingoly task

* Testing yaml changes to debug

* Adding pre-commit hooks

* Functional LingOly benchmark

* Renaming files and adding grouping

* Extending group aggregations to allow custom functions. Setting up custom lingoly aggregation using difference in scores.

* Adding LingOly to the README file

f81b62bf

16 Aug, 2024 1 commit

Created a new task for gsm8k which corresponds to the Llama cot settings… (#2215) · cd35aecb

Cameron7195 authored Aug 16, 2024

* Created a new task for gsm8k which corresponds to the cot settings and prompt formatting described by Meta to evaluate Llama. Useful for replicating Llama performance on GSM8K benchmark.

* fixing formatting

* fixing formatting

cd35aecb

15 Aug, 2024 2 commits

New task: Lingoly (#2198) · 8b41f925

am-bean authored Aug 15, 2024

* Setting up lingoly task

* Testing yaml changes to debug

* Adding pre-commit hooks

* Functional LingOly benchmark

* Renaming files and adding grouping

* Extending group aggregations to allow custom functions. Setting up custom lingoly aggregation using difference in scores.

8b41f925

Update citation in README.md (#2083) · cbdc3539
Anton Polishko authored Aug 15, 2024
```
Bumped citation to the v0.4.3
```
cbdc3539

10 Aug, 2024 1 commit
- Update README.md (#2206) · 3823cfec
  Yu Shi Jie authored Aug 10, 2024
  
  3823cfec
09 Aug, 2024 1 commit

keep new line for task description (#2116) · 8ad598df

Jungwhan Kim authored Aug 10, 2024



* add keep trailing newline

* apply ruff-format

* add prompt unit test

* increment the version of tasks that have description with whitespace

* remove white spaces of leaderboard bbh

* update MMLU expected versions in output

* CI run does display the expected version=1 for mmlu subtasks, fix expected test output again

---------
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

8ad598df

07 Aug, 2024 1 commit
- gsm_plus minor fix (#2191) · 0571eeb1
  Yu Shi Jie authored Aug 07, 2024
```
* fixed gsm

* GSM-Plus: remove dataset_name line
```
  0571eeb1
05 Aug, 2024 5 commits

Update README.md (#2186) · cddce0a1
Hailey Schoelkopf authored Aug 05, 2024

cddce0a1
fix revision type (#2184) · 7ff13e9e
Hailey Schoelkopf authored Aug 05, 2024

7ff13e9e

added gsm_plus (#2103) · d8506db0

Yu Shi Jie authored Aug 06, 2024



* added gsm_plus

* formatted dataset to have train-test-splits

* README.md for gsm-plus

* Update README.md

* GSM-Plus: added gsm_plus_mini

* GSM-Plus: attribution to original dataset

* Update README.md

* Update README.md

* Update README.md

---------
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

d8506db0

Mmlu Pro (#1961) · 69d56f45

Yu Shi Jie authored Aug 06, 2024



* initialized mmlu_pro task

* added generative mmlu-pro

* added cot fewshot for mmlu-pro

* Initial commit

* updated mmlu-pro to take on 3 splits: test, val, dev

* mmlu-pro: added continuation and flan_cot_zeroshot

* added README.md for mmlu_pro

* removed

* update files

* moved files out, and removed unused versions

* updated

* mmlu_pro:

-changed task 'other' to 'miscellaneous'
there is already a group named 'other'
task and group with the same alias (e.g. mmlu_pro_other_generative) throws an error

-fixed yaml backslash escape for fewshot cot

* changed choices -> options in yaml config to fit dataset schema

* ONLY FOR DEFAULT: fixed yaml file to use variable number of choices

* mmlu-pro: fixed doc_to_text/choice/target configs for all variants

* mmlu-pro: minor fixes

* mmlu-pro/default: aligned with mmlu updates

* mmlu-pro: update yaml content in line with mmlu

* mmlu-pro: fixed mislabelling of task (math->chemistry)

* mmlu-pro: fixed yaml formatting

* add custom fewshot doc_to_text, target, and choice

* add process for each subtask

* add process for each subtask

* pre-commit

* pre-commit

* format

* resolved left out merge

* deleted folders + updated readme

* Update evaluator.py

* Update evaluator.py

---------
Co-authored-by: Yu Shi Jie <shijie@tensorplex.ai>
Co-authored-by: lintangsutawika <lintang@eleuther.ai>
Co-authored-by: root <root@455bdd73-01.cloud.together.ai>
Co-authored-by: Lintang Sutawika <lintang@sutawika.com>

69d56f45

remove incorrectly inherited group names (#2181) · c2168869
Hailey Schoelkopf authored Aug 05, 2024

c2168869