Commits · d8506db0710643756f5ec7bb21ddd85eb2dcd73e · gaoqiong / lm-evaluation-harness

"test/vscode:/vscode.git/clone" did not exist on "0b2bcf2cc4fc7d98a57562108dcc5b5d8792565f"

05 Aug, 2024 6 commits

Yu Shi Jie authored Aug 06, 2024



* added gsm_plus

* formatted dataset to have train-test-splits

* README.md for gsm-plus

* Update README.md

* GSM-Plus: added gsm_plus_mini

* GSM-Plus: attribution to original dataset

* Update README.md

* Update README.md

* Update README.md

---------
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

d8506db0

Mmlu Pro (#1961) · 69d56f45

Yu Shi Jie authored Aug 06, 2024



* initialized mmlu_pro task

* added generative mmlu-pro

* added cot fewshot for mmlu-pro

* Initial commit

* updated mmlu-pro to take on 3 splits: test, val, dev

* mmlu-pro: added continuation and flan_cot_zeroshot

* added README.md for mmlu_pro

* removed

* update files

* moved files out, and removed unused versions

* updated

* mmlu_pro:

-changed task 'other' to 'miscellaneous'
there is already a group named 'other'
task and group with the same alias (e.g. mmlu_pro_other_generative) throws an error

-fixed yaml backslash escape for fewshot cot

* changed choices -> options in yaml config to fit dataset schema

* ONLY FOR DEFAULT: fixed yaml file to use variable number of choices

* mmlu-pro: fixed doc_to_text/choice/target configs for all variants

* mmlu-pro: minor fixes

* mmlu-pro/default: aligned with mmlu updates

* mmlu-pro: update yaml content in line with mmlu

* mmlu-pro: fixed mislabelling of task (math->chemistry)

* mmlu-pro: fixed yaml formatting

* add custom fewshot doc_to_text, target, and choice

* add process for each subtask

* add process for each subtask

* pre-commit

* pre-commit

* format

* resolved left out merge

* deleted folders + updated readme

* Update evaluator.py

* Update evaluator.py

---------
Co-authored-by: Yu Shi Jie <shijie@tensorplex.ai>
Co-authored-by: lintangsutawika <lintang@eleuther.ai>
Co-authored-by: root <root@455bdd73-01.cloud.together.ai>
Co-authored-by: Lintang Sutawika <lintang@sutawika.com>

69d56f45

remove incorrectly inherited group names (#2181) · c2168869
Hailey Schoelkopf authored Aug 05, 2024

c2168869
add okapi machine translated notice. (#2168) · 54c9a979
Amir Hossein Kargaran authored Aug 05, 2024

54c9a979
[hotfix] API: messages were created twice (#2174) · 8cffa29b
Baber Abbasi authored Aug 05, 2024

8cffa29b

Dp and mp support (#2056) · 0ce7734d

Nathan Habib authored Aug 05, 2024

* batch commit

* :Revert "batch commit"

This reverts commit d859d1ca

.

* batch commit

* checkout from main

* checkout from main

* checkout from main

* checkout from main

* checkout from main

* cleanup

* cleanup

* cleanup

* cleanup

* cleanup

* cleanup

* cleanup

* cleanup

* linting

* add doc

* Update lm_eval/models/huggingface.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update README.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/models/huggingface.py

* linter

* Apply suggestions from code review
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* style

* remove prepare

* fix

* style

* last check

* Update lm_eval/models/huggingface.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

---------
Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: clementine@huggingface.co <clementine@huggingface.co>

0ce7734d

04 Aug, 2024 2 commits
- Update README.md (#2125) · 05e6505b
  zhabuye authored Aug 04, 2024
  
  05e6505b
- fix typo. (#2169) · 836eba52
  Amir Hossein Kargaran authored Aug 04, 2024
  
  836eba52
01 Aug, 2024 3 commits

Update lm-eval-overview.ipynb (#2118) · 7ad7c5b9
Hailey Schoelkopf authored Aug 01, 2024

7ad7c5b9

refactor: limit usage of `scipy` and `skilearn` dependencies (#2097) · 7f15cce4

Nathan Weinberg authored Aug 01, 2024



* refactor: move scipy and sklearn module imports to func imports
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

* refactor: consolidate weighted_f1_score func into lm_eval utils
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

* lint: allow for utils file to have unused imports

this allows for shared functions to be defined only
once while allowing for the YAML function importing
to continue working
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

---------
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

7f15cce4

[Bugfix] add temperature=0 to logprobs and seed args to API models (#2149) · 63e76e89
Baber Abbasi authored Aug 01, 2024
```
* add temperature for log probs

* add seed

* nit

* add new args to test

* added warning for api chat models
```
63e76e89

29 Jul, 2024 1 commit

bugfix and docs for API (#2139) · b70af4f5

Baber Abbasi authored Jul 29, 2024



* encoding bugfix

* encoding bugfix

* overload logliklehood rather than loglikehood_tokens

* add custom tokenizer

* add docs

* Update API_guide.md

fix link; add note

* Update API_guide.md

typo

* pre-commit

* add link in readme

* nit

* nit

* nit

* Update API_guide.md

nits

* Update API_guide.md

* Update API_guide.md

* Update API_guide.md

* Update API_guide.md

* Update README.md

* Update docs/API_guide.md

* Update docs/API_guide.md

* Update API_guide.md

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

b70af4f5

22 Jul, 2024 1 commit

Refactor API models (#2008) · 42dc2448

Baber Abbasi authored Jul 23, 2024



* refactor pad_token handling to fn

* fix docs

* add pad_token_handling to vllm

* start on API superclass

* don't detokenize the returned logits

* streamline vllm tokenizer

* add type hint

* pre-commit

* seems to be in working order

* add model to init

* refactor api models

* nit

* cleanup

* add pbar

* fix type hints

* change optional dependencies

* json encode chat template

* add type hints

* deal with different prompt input requiremnts

* nits

* fix

* cache inside async

* fix

* fix

* nits

* nits

* nits

* nit

* fixup

* fixup

* nit

* add dummy retry

* add dummy retry

* handle imports; skip failing test

* add type hint

* add tests

* add dependency to tests

* add package names to exception

* nit

* docs; type hints

* handle api key

* nit

* tokenizer bug

* fix tokenizer

* nit

* nit

* add better error messages

* nit

* remove decorator

* CI: install api dep

* revert evaluator.py

* consolidate

* consolidate

* nits

* nit

* fix typealias

* nit

* nit

* nit

* Update lm_eval/models/api_models.py

typo
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/models/openai_completions.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/models/anthropic_llms.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/models/api_models.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* fix typo

* add news section

* add info for API

* pre-commit

* typo

* fix bug: unpack logliklehood requests

* fix bug: shared gen_kwargs mutated

* nit: handle copy properly

* Update README.md

* Update README.md

* Update README.md

* Update api_models.py

* Update README.md

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

42dc2448

21 Jul, 2024 1 commit
- fix caching module (hotfix for now) (#2124) · 4a62757d
  Hailey Schoelkopf authored Jul 21, 2024
  
  4a62757d
20 Jul, 2024 1 commit
- docs: update truthfulqa tasks (#2119) · feff1b55
  Jennifer Cwagenberg authored Jul 19, 2024
  
  feff1b55
18 Jul, 2024 2 commits
- fix: broken discord link in CONTRIBUTING.md (#2114) · 8f8e7f6e
  Nathan Weinberg authored Jul 18, 2024
```
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
```
  8f8e7f6e
- [python] fix haerae tasks (#2112) · 9d4a04a0
  Jungwhan Kim authored Jul 18, 2024
  
  9d4a04a0
17 Jul, 2024 1 commit
- Fixed colon in Belebele _default_template_yaml (#2111) · 69502c06
  jab13x authored Jul 17, 2024
  
  69502c06
15 Jul, 2024 3 commits
- docs: align local test command to match CI (#2100) · 1adab703
  Nathan Weinberg authored Jul 15, 2024
```
Also add 'test_logs/' to .gitignore
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
```
  1adab703
- formatting (#2104) · 56a4e794
  Lintang Sutawika authored Jul 15, 2024
  
  56a4e794
- make recurrent_gemma model types included in the force-BOS case (#2105) · 9884ad6e
  Hailey Schoelkopf authored Jul 15, 2024
  
  9884ad6e
14 Jul, 2024 1 commit

Added MedConceptsQA Benchmark (#2010) · 2b26690f

Ben Shoham Ofir authored Jul 14, 2024



* Added MedConceptsQA Benchmark

* pre-commit factor

* update group name

* update in naming

* changed name

* Changed mcqa to med_concepts_qa prefix

* Added med_concepts_qa to README.md

* Changed config files according the new format

* Updated README

---------
Co-authored-by: lintangsutawika <lintang@eleuther.ai>

2b26690f

13 Jul, 2024 1 commit
- docs: remove trailing sentence from contribution doc (#2098) · a7a2923f
  Nathan Weinberg authored Jul 13, 2024
```
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
```
  a7a2923f
12 Jul, 2024 4 commits

Irokobench: Benchmark Dataset for African languages (#2042) · 383bbd54

Jess authored Jul 12, 2024



* add afrixnli to task

* add chat completion

* remove chat completion -untested

* afrimmlu added

* afrimmlu folder update

* afrimmlu folder update

* updated prompt

* remove print

* add afrimgsm -direct

* add squad metric

* fix bash script

* remove direct util, update common yaml

* remove print

* add few show. metric fixes

* fix direct path, add bash script for gpt models

* added transate test

* update afrixnli tasks

* update afrixnli tasks

* update metrics for afrixnli

* prompt translations fix

* prompt translations fix

* filter and metric fix -mgsm

* remove squad metric

* remove squad metric

* add f1 score to mgsm

* add f1 score to mgsm

* update native-direct with lin

* change f1 function

* add lin to utils

* add utils

* remove test limit

* remove test configs

* add swahili to mmlu

* change eng to ewe in ewe yaml mmlu

* add squad metric to mgsm, remove whitespace filter

* added translate test

* added afrixnli_translate

* fix exact match valueError

* fix exact match valueError

* restructure mmlu folder

* spacing

* remove afrimmlu_translate folder

* add utility

* format task name, clean ups

* modefied mgsm

* update on afrimgsm

* update on afrimgsm

* removed utils

* other mgsm varieties

* other mgsm varieties

* adding trasnslate direct

* Update translate_direct_yaml

* add manual xnli prompt, add multichoice for openai models, and adapt multichoice metric for openai model

* edit for open models

* Update translate_direct_yaml

* add verbalizer for xnli

* change xnli from multiple choice to generate

* add manual accuracy scores

* revert xnli to multiple choice

* change afrimgsm utils

* revert xnli to multiple_choice

* cleanups and readmes

* remove openai fixes and unused regex

* pr review changes

* revert metrics.py, task.py and extraction.py to main version

---------
Co-authored-by: Israel Abebe Azime <azime@cg.uni-saarland.de>
Co-authored-by: Israel Abebe Azime <se.israel.abebe@gmail.com>

383bbd54

Add new dataset MMLU-SR tasks (#2032) · d5f39bf8

SuperCat authored Jul 12, 2024



* add mmlusr tasks

* renamed all tasks names in mmlusr

* edit format and readme

* added mmlu_sr

* mmlu_sr -> mmlusr

* update

---------
Co-authored-by: lintangsutawika <lintang@eleuther.ai>

d5f39bf8

Update default.yaml (#2092) · cdd954f9
Wonung Kim authored Jul 12, 2024

cdd954f9
make RougeScorer only initialized once (#2090) · eeec6dae
Hailey Schoelkopf authored Jul 12, 2024

eeec6dae

11 Jul, 2024 1 commit

Prettify lm_eval --tasks list (#1929) · a0243d54

anthony-dipofi authored Jul 11, 2024



* add  and ; move task list newline logic to new TaskManager.list_all_tasks() method

* format table list into markdown table; add config location column

* add Output Type column

* add logic for printing table of tags separately

* merge with main and fix conflicts ; update docstrings

---------
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

a0243d54

10 Jul, 2024 2 commits

batch_size may be str if 'auto' is specified (#2084) · 30273b47
meg authored Jul 10, 2024

30273b47

Update utils.py (#2085) · 058cfd0e

Lintang Sutawika authored Jul 10, 2024

Group Configs with no aggregation will print a empty space as the score for result table.
Example
```
|    Tasks     |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|--------------|-------|------|-----:|--------|---|-----:|---|-----:|
|group         |    N/A|      |      |        |   |      |   |      |
| - task 0     |Yaml   |none  |     0|acc     |↑  |0.4000|±  |0.0910|
| - task 1     |Yaml   |none  |     0|acc     |↑  |0.3333|±  |0.0875|
| - task 2     |Yaml   |none  |     0|acc     |↑  |0.2667|±  |0.0821|
| - task 3     |Yaml   |none  |     0|acc     |↑  |0.3333|±  |0.0875|
```

So the `v` variable in the `make_table` needs to check if the value is a float or a string.

058cfd0e

09 Jul, 2024 1 commit
- fix: utf-8 encoding for logged sample files was missing (#2082) · c12e5bec
  Hailey Schoelkopf authored Jul 08, 2024
  
  c12e5bec
08 Jul, 2024 6 commits

Minor doc fix: leaderboard README.md missing mmlu-pro group and task (#2075) · be01651c
Pankaj Mathur authored Jul 08, 2024
```
leaderboard README.md missing mmlu-pro group and task
```
be01651c

Allow gating EvaluationTracker HF Hub results; customizability (#2051) · 563f7971

Nathan Habib authored Jul 08, 2024

* batch commit

* :Revert "batch commit"

This reverts commit d859d1ca.

* batch commit

* checkout from main

* checkout from main

* checkout from main

* checkout from main

* checkout from main

* cleanup

* cleanup

* cleanup

* cleanup

* cleanup

* cleanup eval results

* cleanup

* add check for gated repo

* fix jsonline issue

* fix

* add try catch when gating the details repo

* add doc

* adds back hub_repo_name

* readds hub repo name

563f7971

Easier unitxt tasks loading and removal of unitxt library dependancy (#1933) · ad80f555

Elron Bandel authored Jul 08, 2024



* Updated unitxt loading
Signed-off-by: Elron Bandel <elron.bandel@ibm.com>

* Revert change to general Readme
Signed-off-by: Elron Bandel <elron.bandel@ibm.com>

* Adjust fda,squadv2,squad_completion and swde to work accept config in the constructor
Signed-off-by: Elron Bandel <elron.bandel@ibm.com>

* Fix scrolls
Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Update documentation
Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Enforce backward compatability
Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Format unitxt class
Signed-off-by: elronbandel <elron.bandel@ibm.com>

---------
Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
Signed-off-by: elronbandel <elron.bandel@ibm.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

ad80f555

we run with bootstrap_iters=0 for printing tests (#2080) · cb43ad4e
Hailey Schoelkopf authored Jul 08, 2024

cb43ad4e

Group agg rework (#1741) · 517aadc4

Lintang Sutawika authored Jul 08, 2024



* add greoup_config arg

* add a group config that allows disabling table for group score and group aggregate in general

* fixed size configuration

* adjust config

* add group config

* adjust mmlu to use group_config

* fixed args input in aggregate_subtask_metrics

* fixed issues related to printing alias of group and updated yaml

* update all mmlu variants to include group_config

* edit format

* modify mmlu tasks

* adjust group to also be a configurable group

* add configurable group

* simplify get_task_list

* adjust group scoring with using ConfigurableGroup

* adjust args

* update mmlu

* update mmlu

* update to work with new group and task configuration

* readd group_agg

* readd files

* move prepare_print_tasks to evaluator_utils

* sort set to False by default, fix predict_only arg

* add version for groups

* reversed task list

* update additional condition when loading a group in a group yaml

* update truthfulqa

* add description regarding tags replacing group

* replace group to tag

* fixed conditional statement

* remove warning

* update loading of task group and newly added tags

* reformat with pre-commit

* fixed info log

* update

* fix bug

* fix bug

* use task id to differentiate tasks

* convert all groups to configurable groups

* use task_id

* reformat

* add task_id for python tasks as well

* add task_id for python tasks as well

* add task_id for python tasks as well

* revert truthfulqa

* revert mmlu tasks

* new mmlu config

* new group config parameter `tag_to_task`

* Update truthfulqa_mc2.yaml

* reformate

* add _process_group_config

* adjust task_id

* add get_subtask_list function to get proper subtask list

* group config to_dict update

* remove tag check

* update mmlu

* fix config passing issues

* add test yaml

* format fix

* add documentation

* corner case for single tag being called

* fix indentation

* formatting

* update all mmlu variants

* Update docs/task_guide.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* remove group_alias

* Update docs/task_guide.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* remove version for metadata

* Update docs/task_guide.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* update mmlu/

* removed " " in make_table

* change how aggregate_metric is loaded

* change how aggregate_metric is loaded

* update aggregate_metric arg

* update format

* update format

* some docs fixes

* add groups for agieval, aexams, aclue

* add more explicit aggregation groups

* add more groupings / tags distinctions

* add more groupings

* more groupings

* add many explicit group configs

* add many explicit group configs

* add more explicit group configs

* add more explicit group configs

* add more error msgs, agg_metric -> agg_metric_list

* some docs updates

* update task_id to be updateable and uses group:task format

* make KMMLU a tag for now

* update docs

* don't duplicate task names

* fix merge conflicts?

* giving this a try

* clean up diff

* switch mmlu variants over to using

* don't use to-be-deprecated group: config field in overview notebook

* Python tasks which subclass ConfigurableTask now run

* update mmlu

* pre-commit format

* fixed sorting for multi-level printing

* move group api to separate file

* fix bbh aggregation filter usage

* track api/group.py

* adjust group and tags loading

* make explicit group configs for leaderboard and other newer tasks

* fix arabicmmlu

* update

* change arabicmmlu template name???

* update group alias

* fix printing bugs

* check table printing is correct ; update tests

* use mmlu_stem to have a group included in print tests

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

517aadc4

Fix TypeError in samplers.py by converting int to str (#2074) · 5a7ed3ee
Choyunhui authored Jul 08, 2024
```
Co-authored-by: yhjo <yhjo@suresofttech.com>
```
5a7ed3ee

03 Jul, 2024 3 commits

#1442 inverse scaling tasks implementation (#1589) · d855d0ba

Hanwool Albert Lee authored Jul 03, 2024



* initial_implementation (test has to be proceeded)

* minor fix

* revised task name and implemented new task

* minor fixes

* new tasks implement

* minor fix

* added 'prompt injection' task

* delete prompt injection task (will be implemented at next PR)

* trust remote code

* Update lm_eval/tasks/inverse_scaling/README.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* added readme

* Update lm_eval/tasks/README.md

* Update lm_eval/tasks/inverse_scaling/_inverse_scaling_mc_yaml

* Update lm_eval/tasks/inverse_scaling/README.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/tasks/inverse_scaling/_inverse_scaling_mc_yaml
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update README.md

* precommit?

* run precommit on readme

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

d855d0ba

Adds Open LLM Leaderboard Taks (#2047) · 3c8db1bb

Nathan Habib authored Jul 03, 2024



* adds leaderboard tasks

* Delete lm_eval/tasks/leaderboard/leaderboard_chat_template.yaml

* add readme

* Delete lm_eval/tasks/leaderboard/mmlu_pro/mmlu_pro_chat_template.yaml

* modify readme

* fix bbh task

* fix bbh salient task

* modify the readme

* Delete lm_eval/tasks/leaderboard/ifeval/README.md

* Delete lm_eval/tasks/leaderboard/math/README.md

* add leaderboard to the tasks repertory

* add anouncment about new leaderbaord tasks

* linting

* Update README.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* installs ifeval dependency in new_task github workflow

---------
Co-authored-by: Nathan Habib <nathan.habib@huggingface.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

3c8db1bb

Update hellaswag.yaml (#2029) · 1870ee4e
Hailey Schoelkopf authored Jul 03, 2024

1870ee4e