Commits · ca3d86d6b8dea86211ec60b93c3c026ce73c9d60 · gaoqiong / lm-evaluation-harness

19 Aug, 2024 3 commits

Add TMLU Benchmark Dataset (#2093) · ca3d86d6

Yen-Ting Lin authored Aug 19, 2024



* add taiwan truthful qa

* add tmlu

* Add .gitignore entries for evals/ and harness_eval_main_log.txt, and add harness_eval.slurm script

* add pega eval and legal eval

* add ccp eval

* Update .gitignore and harness_eval.slurm

* Add trust_remote_code and wandb_args to harness_eval.slurm, and add run_all.sh script

* Add Pega MMLU task and configuration files

* Add new models and update parameters in run_all.sh

* Add UMTCEval tasks and configurations

* Update dataset paths and output path

* Update .gitignore and harness_eval.slurm, and modify _generate_configs.py

* Update SLURM script and add new models

* clean for pr

* Update lm_eval/tasks/tmlu/default/tmlu.yaml
Co-authored-by: Lintang Sutawika <lintang@sutawika.com>

* adjust tag name

* removed group alias from tasks

* format

---------
Co-authored-by: Lintang Sutawika <lintang@sutawika.com>
Co-authored-by: lintangsutawika <lintang@eleuther.ai>
Co-authored-by: Yen-Ting Adam, Lin <r08944064@csie.ntu.edu.tw>

ca3d86d6

Update yaml to adapt to belebele dataset changes (#2216) · 86edeffa
Uminosachi authored Aug 20, 2024

86edeffa

Lingoly README update (#2228) · f81b62bf

am-bean authored Aug 19, 2024

* Setting up lingoly task

* Testing yaml changes to debug

* Adding pre-commit hooks

* Functional LingOly benchmark

* Renaming files and adding grouping

* Extending group aggregations to allow custom functions. Setting up custom lingoly aggregation using difference in scores.

* Adding LingOly to the README file

f81b62bf

16 Aug, 2024 1 commit

Created a new task for gsm8k which corresponds to the Llama cot settings… (#2215) · cd35aecb

Cameron7195 authored Aug 16, 2024

* Created a new task for gsm8k which corresponds to the cot settings and prompt formatting described by Meta to evaluate Llama. Useful for replicating Llama performance on GSM8K benchmark.

* fixing formatting

* fixing formatting

cd35aecb

15 Aug, 2024 2 commits

New task: Lingoly (#2198) · 8b41f925

am-bean authored Aug 15, 2024

* Setting up lingoly task

* Testing yaml changes to debug

* Adding pre-commit hooks

* Functional LingOly benchmark

* Renaming files and adding grouping

* Extending group aggregations to allow custom functions. Setting up custom lingoly aggregation using difference in scores.

8b41f925

Update citation in README.md (#2083) · cbdc3539
Anton Polishko authored Aug 15, 2024
```
Bumped citation to the v0.4.3
```
cbdc3539

10 Aug, 2024 1 commit
- Update README.md (#2206) · 3823cfec
  Yu Shi Jie authored Aug 10, 2024
  
  3823cfec
09 Aug, 2024 1 commit

keep new line for task description (#2116) · 8ad598df

Jungwhan Kim authored Aug 10, 2024



* add keep trailing newline

* apply ruff-format

* add prompt unit test

* increment the version of tasks that have description with whitespace

* remove white spaces of leaderboard bbh

* update MMLU expected versions in output

* CI run does display the expected version=1 for mmlu subtasks, fix expected test output again

---------
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

8ad598df

07 Aug, 2024 1 commit
- gsm_plus minor fix (#2191) · 0571eeb1
  Yu Shi Jie authored Aug 07, 2024
```
* fixed gsm

* GSM-Plus: remove dataset_name line
```
  0571eeb1
05 Aug, 2024 8 commits

Update README.md (#2186) · cddce0a1
Hailey Schoelkopf authored Aug 05, 2024

cddce0a1
fix revision type (#2184) · 7ff13e9e
Hailey Schoelkopf authored Aug 05, 2024

7ff13e9e

added gsm_plus (#2103) · d8506db0

Yu Shi Jie authored Aug 06, 2024



* added gsm_plus

* formatted dataset to have train-test-splits

* README.md for gsm-plus

* Update README.md

* GSM-Plus: added gsm_plus_mini

* GSM-Plus: attribution to original dataset

* Update README.md

* Update README.md

* Update README.md

---------
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

d8506db0

Mmlu Pro (#1961) · 69d56f45

Yu Shi Jie authored Aug 06, 2024



* initialized mmlu_pro task

* added generative mmlu-pro

* added cot fewshot for mmlu-pro

* Initial commit

* updated mmlu-pro to take on 3 splits: test, val, dev

* mmlu-pro: added continuation and flan_cot_zeroshot

* added README.md for mmlu_pro

* removed

* update files

* moved files out, and removed unused versions

* updated

* mmlu_pro:

-changed task 'other' to 'miscellaneous'
there is already a group named 'other'
task and group with the same alias (e.g. mmlu_pro_other_generative) throws an error

-fixed yaml backslash escape for fewshot cot

* changed choices -> options in yaml config to fit dataset schema

* ONLY FOR DEFAULT: fixed yaml file to use variable number of choices

* mmlu-pro: fixed doc_to_text/choice/target configs for all variants

* mmlu-pro: minor fixes

* mmlu-pro/default: aligned with mmlu updates

* mmlu-pro: update yaml content in line with mmlu

* mmlu-pro: fixed mislabelling of task (math->chemistry)

* mmlu-pro: fixed yaml formatting

* add custom fewshot doc_to_text, target, and choice

* add process for each subtask

* add process for each subtask

* pre-commit

* pre-commit

* format

* resolved left out merge

* deleted folders + updated readme

* Update evaluator.py

* Update evaluator.py

---------
Co-authored-by: Yu Shi Jie <shijie@tensorplex.ai>
Co-authored-by: lintangsutawika <lintang@eleuther.ai>
Co-authored-by: root <root@455bdd73-01.cloud.together.ai>
Co-authored-by: Lintang Sutawika <lintang@sutawika.com>

69d56f45

remove incorrectly inherited group names (#2181) · c2168869
Hailey Schoelkopf authored Aug 05, 2024

c2168869
add okapi machine translated notice. (#2168) · 54c9a979
Amir Hossein Kargaran authored Aug 05, 2024

54c9a979
[hotfix] API: messages were created twice (#2174) · 8cffa29b
Baber Abbasi authored Aug 05, 2024

8cffa29b

Dp and mp support (#2056) · 0ce7734d

Nathan Habib authored Aug 05, 2024

* batch commit

* :Revert "batch commit"

This reverts commit d859d1ca

.

* batch commit

* checkout from main

* checkout from main

* checkout from main

* checkout from main

* checkout from main

* cleanup

* cleanup

* cleanup

* cleanup

* cleanup

* cleanup

* cleanup

* cleanup

* linting

* add doc

* Update lm_eval/models/huggingface.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update README.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/models/huggingface.py

* linter

* Apply suggestions from code review
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* style

* remove prepare

* fix

* style

* last check

* Update lm_eval/models/huggingface.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

---------
Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: clementine@huggingface.co <clementine@huggingface.co>

0ce7734d

04 Aug, 2024 2 commits
- Update README.md (#2125) · 05e6505b
  zhabuye authored Aug 04, 2024
  
  05e6505b
- fix typo. (#2169) · 836eba52
  Amir Hossein Kargaran authored Aug 04, 2024
  
  836eba52
01 Aug, 2024 3 commits

Update lm-eval-overview.ipynb (#2118) · 7ad7c5b9
Hailey Schoelkopf authored Aug 01, 2024

7ad7c5b9

refactor: limit usage of `scipy` and `skilearn` dependencies (#2097) · 7f15cce4

Nathan Weinberg authored Aug 01, 2024



* refactor: move scipy and sklearn module imports to func imports
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

* refactor: consolidate weighted_f1_score func into lm_eval utils
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

* lint: allow for utils file to have unused imports

this allows for shared functions to be defined only
once while allowing for the YAML function importing
to continue working
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

---------
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

7f15cce4

[Bugfix] add temperature=0 to logprobs and seed args to API models (#2149) · 63e76e89
Baber Abbasi authored Aug 01, 2024
```
* add temperature for log probs

* add seed

* nit

* add new args to test

* added warning for api chat models
```
63e76e89

29 Jul, 2024 1 commit

bugfix and docs for API (#2139) · b70af4f5

Baber Abbasi authored Jul 29, 2024



* encoding bugfix

* encoding bugfix

* overload logliklehood rather than loglikehood_tokens

* add custom tokenizer

* add docs

* Update API_guide.md

fix link; add note

* Update API_guide.md

typo

* pre-commit

* add link in readme

* nit

* nit

* nit

* Update API_guide.md

nits

* Update API_guide.md

* Update API_guide.md

* Update API_guide.md

* Update API_guide.md

* Update README.md

* Update docs/API_guide.md

* Update docs/API_guide.md

* Update API_guide.md

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

b70af4f5

22 Jul, 2024 1 commit

Refactor API models (#2008) · 42dc2448

Baber Abbasi authored Jul 23, 2024



* refactor pad_token handling to fn

* fix docs

* add pad_token_handling to vllm

* start on API superclass

* don't detokenize the returned logits

* streamline vllm tokenizer

* add type hint

* pre-commit

* seems to be in working order

* add model to init

* refactor api models

* nit

* cleanup

* add pbar

* fix type hints

* change optional dependencies

* json encode chat template

* add type hints

* deal with different prompt input requiremnts

* nits

* fix

* cache inside async

* fix

* fix

* nits

* nits

* nits

* nit

* fixup

* fixup

* nit

* add dummy retry

* add dummy retry

* handle imports; skip failing test

* add type hint

* add tests

* add dependency to tests

* add package names to exception

* nit

* docs; type hints

* handle api key

* nit

* tokenizer bug

* fix tokenizer

* nit

* nit

* add better error messages

* nit

* remove decorator

* CI: install api dep

* revert evaluator.py

* consolidate

* consolidate

* nits

* nit

* fix typealias

* nit

* nit

* nit

* Update lm_eval/models/api_models.py

typo
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/models/openai_completions.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/models/anthropic_llms.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/models/api_models.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* fix typo

* add news section

* add info for API

* pre-commit

* typo

* fix bug: unpack logliklehood requests

* fix bug: shared gen_kwargs mutated

* nit: handle copy properly

* Update README.md

* Update README.md

* Update README.md

* Update api_models.py

* Update README.md

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

42dc2448

21 Jul, 2024 1 commit
- fix caching module (hotfix for now) (#2124) · 4a62757d
  Hailey Schoelkopf authored Jul 21, 2024
  
  4a62757d
20 Jul, 2024 1 commit
- docs: update truthfulqa tasks (#2119) · feff1b55
  Jennifer Cwagenberg authored Jul 19, 2024
  
  feff1b55
18 Jul, 2024 2 commits
- fix: broken discord link in CONTRIBUTING.md (#2114) · 8f8e7f6e
  Nathan Weinberg authored Jul 18, 2024
```
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
```
  8f8e7f6e
- [python] fix haerae tasks (#2112) · 9d4a04a0
  Jungwhan Kim authored Jul 18, 2024
  
  9d4a04a0
17 Jul, 2024 1 commit
- Fixed colon in Belebele _default_template_yaml (#2111) · 69502c06
  jab13x authored Jul 17, 2024
  
  69502c06
15 Jul, 2024 3 commits
- docs: align local test command to match CI (#2100) · 1adab703
  Nathan Weinberg authored Jul 15, 2024
```
Also add 'test_logs/' to .gitignore
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
```
  1adab703
- formatting (#2104) · 56a4e794
  Lintang Sutawika authored Jul 15, 2024
  
  56a4e794
- make recurrent_gemma model types included in the force-BOS case (#2105) · 9884ad6e
  Hailey Schoelkopf authored Jul 15, 2024
  
  9884ad6e
14 Jul, 2024 1 commit

Added MedConceptsQA Benchmark (#2010) · 2b26690f

Ben Shoham Ofir authored Jul 14, 2024



* Added MedConceptsQA Benchmark

* pre-commit factor

* update group name

* update in naming

* changed name

* Changed mcqa to med_concepts_qa prefix

* Added med_concepts_qa to README.md

* Changed config files according the new format

* Updated README

---------
Co-authored-by: lintangsutawika <lintang@eleuther.ai>

2b26690f

13 Jul, 2024 1 commit
- docs: remove trailing sentence from contribution doc (#2098) · a7a2923f
  Nathan Weinberg authored Jul 13, 2024
```
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
```
  a7a2923f
12 Jul, 2024 4 commits

Irokobench: Benchmark Dataset for African languages (#2042) · 383bbd54

Jess authored Jul 12, 2024



* add afrixnli to task

* add chat completion

* remove chat completion -untested

* afrimmlu added

* afrimmlu folder update

* afrimmlu folder update

* updated prompt

* remove print

* add afrimgsm -direct

* add squad metric

* fix bash script

* remove direct util, update common yaml

* remove print

* add few show. metric fixes

* fix direct path, add bash script for gpt models

* added transate test

* update afrixnli tasks

* update afrixnli tasks

* update metrics for afrixnli

* prompt translations fix

* prompt translations fix

* filter and metric fix -mgsm

* remove squad metric

* remove squad metric

* add f1 score to mgsm

* add f1 score to mgsm

* update native-direct with lin

* change f1 function

* add lin to utils

* add utils

* remove test limit

* remove test configs

* add swahili to mmlu

* change eng to ewe in ewe yaml mmlu

* add squad metric to mgsm, remove whitespace filter

* added translate test

* added afrixnli_translate

* fix exact match valueError

* fix exact match valueError

* restructure mmlu folder

* spacing

* remove afrimmlu_translate folder

* add utility

* format task name, clean ups

* modefied mgsm

* update on afrimgsm

* update on afrimgsm

* removed utils

* other mgsm varieties

* other mgsm varieties

* adding trasnslate direct

* Update translate_direct_yaml

* add manual xnli prompt, add multichoice for openai models, and adapt multichoice metric for openai model

* edit for open models

* Update translate_direct_yaml

* add verbalizer for xnli

* change xnli from multiple choice to generate

* add manual accuracy scores

* revert xnli to multiple choice

* change afrimgsm utils

* revert xnli to multiple_choice

* cleanups and readmes

* remove openai fixes and unused regex

* pr review changes

* revert metrics.py, task.py and extraction.py to main version

---------
Co-authored-by: Israel Abebe Azime <azime@cg.uni-saarland.de>
Co-authored-by: Israel Abebe Azime <se.israel.abebe@gmail.com>

383bbd54

Add new dataset MMLU-SR tasks (#2032) · d5f39bf8

SuperCat authored Jul 12, 2024



* add mmlusr tasks

* renamed all tasks names in mmlusr

* edit format and readme

* added mmlu_sr

* mmlu_sr -> mmlusr

* update

---------
Co-authored-by: lintangsutawika <lintang@eleuther.ai>

d5f39bf8

Update default.yaml (#2092) · cdd954f9
Wonung Kim authored Jul 12, 2024

cdd954f9
make RougeScorer only initialized once (#2090) · eeec6dae
Hailey Schoelkopf authored Jul 12, 2024

eeec6dae

11 Jul, 2024 1 commit

Prettify lm_eval --tasks list (#1929) · a0243d54

anthony-dipofi authored Jul 11, 2024



* add  and ; move task list newline logic to new TaskManager.list_all_tasks() method

* format table list into markdown table; add config location column

* add Output Type column

* add logic for printing table of tags separately

* merge with main and fix conflicts ; update docstrings

---------
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

a0243d54

10 Jul, 2024 1 commit
- batch_size may be str if 'auto' is specified (#2084) · 30273b47
  meg authored Jul 10, 2024
  
  30273b47