Commits · cc2d3463c2d5aa28f2b26c40d0ff20c878cc56b8 · gaoqiong / lm-evaluation-harness

28 Jun, 2024 2 commits

Add chat template to `vllm` (#2034) · cc2d3463

Baber Abbasi authored Jun 28, 2024



* add chat template

* refactor token padding

* nit

* nit

* check on failing test

* check transformers version

* remove transformers pin

* add ids to test

* nit

* fixup

* fix bos bug

* nit

* fixup! fix bos bug

* increase tolerance for table test

* don't detokenize vllm logprobs

* Update lm_eval/models/utils.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* pre-commit run --all-files

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

cc2d3463

Fixes scrolls task bug with few_shot examples (#2003) · 801322e0

Steven Basart authored Jun 28, 2024

Bug:

```
python -m scripts.write_out --task scrolls_quality --output_base_path ~/workspace/
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/lm-evaluation-harness/scripts/write_out.py", line 92, in <module>
    main()
  File "/lm-evaluation-harness/scripts/write_out.py", line 51, in main
    task_dict = tasks.get_task_dict(task_names, task_manager)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 423, in get_task_dict
    task_name_from_string_dict = task_manager.load_task_or_group(
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 271, in load_task_or_group
    collections.ChainMap(*map(self._load_individual_task_or_group, task_list))
  File "/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 162, in _load_individual_task_or_group
    return load_task(task_config, task=name_or_config, group=parent_name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 148, in load_task
    task_object = config["class"]()
                  ^^^^^^^^^^^^^^^^^
  File "/lm-evaluation-harness/lm_eval/tasks/scrolls/task.py", line 120, in __init__
    super().__init__()
  File "/lm-evaluation-harness/lm_eval/api/task.py", line 703, in __init__
    self._config = TaskConfig(**config)
                   ^^^^^^^^^^^^^^^^^^^^
TypeError: lm_eval.api.task.TaskConfig() argument after ** must be a mapping, not NoneType
```

801322e0

26 Jun, 2024 1 commit
- Fix `trust_remote_code`-related test failures (#2024) · e5e5ee0c
  Hailey Schoelkopf authored Jun 26, 2024
```
* make MMLU trust remote code to fix tests

* remove trust remote code
```
  e5e5ee0c
25 Jun, 2024 3 commits

Added CommonsenseQA task (#1721) · b62b9bd0

Brendan Murphy authored Jun 25, 2024



* Initial configuration

* Using the validation set for the test set, because the test set on HF doesn't have labels

* Probably just makes more sense to have validation be validation

* fix format ; add docs to tasks/README.md

* fix format

---------
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

b62b9bd0

Remove `LM` dependency from `build_all_requests` (#2011) · 9b6179b2

Baber Abbasi authored Jun 25, 2024

* refactored `lm.apply_chat_template`

* nit

* fix weird type error

* fixed!

* skip failing test

* pre-commit run all

* add type hints

* nit

* nit

* fixup

9b6179b2

add arc_challenge_mt (#1900) · 9b6b0f5e
jonabur authored Jun 25, 2024
```
* add arc_challenge_mt

* add README

* add icelandic
```
9b6b0f5e

20 Jun, 2024 1 commit

Add BertaQA dataset tasks (#1964) · 6f7b4a05

Julen Etxaniz authored Jun 20, 2024



* add bertaqa tasks

* rename basquetrivia-->bertaqa ; make template stub not .yaml

* add bertaqa entry to lm_eval/tasks/README.md

---------
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

6f7b4a05

19 Jun, 2024 3 commits

Added ArabicMMLU (#1987) · a08bc3c8
Yazeed Alnumay authored Jun 19, 2024
```
* Added ArabicMMLU

* Rename `ammlu` to `arabicmmlu`
```
a08bc3c8

Fix Paloma Template yaml (#1993) · ead2964e

Hailey Schoelkopf authored Jun 19, 2024



* init paloma benchmark

* pre-process in utils function

* add `task_alias`

* updated task aliases

* Update paloma_dolma-v1_5.yaml

* Update paloma_twitterAAE_HELM_fixed.yaml

* Update paloma_dolma_100_programing_languages.yaml

* update on names

* fix paloma template issue

---------
Co-authored-by: Zafir Stojanovski <zaf.stojano@gmail.com>
Co-authored-by: Zafir Stojanovski <zafir.stojanovski@icloud.com>
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

ead2964e

[New Task] Add Paloma benchmark (#1928) · f257d38b

Zafir Stojanovski authored Jun 19, 2024



* init paloma benchmark

* pre-process in utils function

* add `task_alias`

* updated task aliases

* Update paloma_dolma-v1_5.yaml

* Update paloma_twitterAAE_HELM_fixed.yaml

* Update paloma_dolma_100_programing_languages.yaml

---------
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

f257d38b

18 Jun, 2024 1 commit
- add trust_remote_code for piqa (#1983) · 72bb6241
  Wang, Chang authored Jun 19, 2024
```
Signed-off-by: changwangss <chang1.wang@intel.com>
```
  72bb6241
13 Jun, 2024 2 commits

fix: add directory filter to os.walk to ignore 'ipynb_checkpoints' (#1956) · 568af943

johnwee1 authored Jun 13, 2024



* fix: add filter to os.walk to ignore 'ipynb_checkpoints

* Update __init__.py

* Update __init__.py

---------
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

568af943

`samples` is newline delimited (#1930) · 3850e21a

Baber Abbasi authored Jun 13, 2024



* `samples` is newline delimited

* updated git and pre-commit

* appease pre-commit

* nit

* Revert back for now

* Revert for now

---------
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

3850e21a

11 Jun, 2024 2 commits
- add include_defaults kwarg to taskmanager, add tests for include_path (#1856) · 4bb77e82
  Hailey Schoelkopf authored Jun 11, 2024
  
  4bb77e82
- Remove AMMLU Due to Translation (#1948) · d0f6e011
  Hailey Schoelkopf authored Jun 11, 2024
```
* Update README.md

* Delete lm_eval/tasks/ammlu directory
```
  d0f6e011
10 Jun, 2024 1 commit
- Add the Arabic version with refactor to Arabic pica to be in alghafa folder (#1940) · 305fb636
  khalil authored Jun 10, 2024
  
  305fb636
07 Jun, 2024 3 commits
- Update basque-glue (#1913) · 59418aac
  zhabuye authored Jun 07, 2024
```
* Update README.md

* Update bec.yaml

* Update bhtc.yaml

* Update coref.yaml

* Update qnli.yaml

* Update vaxx.yaml

* Update wic.yaml
```
  59418aac
- Update siqa.yaml (#1909) · 3f0ef80b
  Hailey Schoelkopf authored Jun 07, 2024
  
  3f0ef80b
- Add The Arabic version of the PICA benchmark (#1917) · 923852b0
  khalil authored Jun 07, 2024
  
  923852b0
06 Jun, 2024 3 commits

Implement NoticIA (#1912) · f2843b2f
Iker García-Ferrero authored Jun 06, 2024
```
* Noticia

* test

* Final testes implementation

* Fixes

* Fix linters
```
f2843b2f

Add new Lambada translations (#1897) · b9d96b50

Zafir Stojanovski authored Jun 06, 2024



* added tasks and task family descriptors

* configs for the new lambada translations

* continue work on task list w/ links; slightly reorganize README

* Apply suggestions from code review

* Rename file so that it'll preview in Github when viewing lm_eval/tasks folder

* Update new_task_guide.md

* Update README.md

* run linter

* Add language column to task table; Add missing tasks to task table; fix nq_open and storycloze READMEs

* fix typo

* update `lm_eval/tasks/README.md` with task description

---------
Co-authored-by: Harish Vadaparty <harishvadaparty@gmail.com>
Co-authored-by: anthony <anthonydipofi@gmail.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

b9d96b50

[add] fld logical formula task (#1931) · 33eef48f
MorishT authored Jun 06, 2024

33eef48f

05 Jun, 2024 1 commit

Multiple Choice Questions and Large Languages Models: A Case Study with... · 7257aa2e

Maxime authored Jun 05, 2024

Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data (#1867)

* glianorex tasks

* Create README.md

* Update README.md

* Update README.md

* fix formatting

* fix internal formatting

7257aa2e

03 Jun, 2024 1 commit

Complete task list from pr 1727 (#1901) · 3e500e9d

anthony-dipofi authored Jun 03, 2024



* added tasks and task family descriptors

* continue work on task list w/ links; slightly reorganize README

* Apply suggestions from code review

* Rename file so that it'll preview in Github when viewing lm_eval/tasks folder

* Update new_task_guide.md

* Update README.md

* run linter

* Add language column to task table; Add missing tasks to task table; fix nq_open and storycloze READMEs

* fix typo

* Apply suggestions from code review
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* apply format

---------
Co-authored-by: Harish Vadaparty <harishvadaparty@gmail.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

3e500e9d

31 May, 2024 1 commit

Making hardcoded few shots compatible with the chat template mechanism (#1895) · 4902aaaf

Clémentine Fourrier authored May 31, 2024



* init test 1

* fix

* this format seems to be working - need to update all other tasks with the new format

* bbh with few shot format

* fix fewshot bbh

* add mmlu flan cot

* samples of cot

* kmmlu

* fix gsm8k

* update keys for mmlu

* minerva math

* bbh

* fix

* fix samples

* small fixes to templates

* last prompt format change

* fixing prompt

* fixed minerva math format

* rm accidental commited file

* added doc for few shot samples

* Update lm_eval/loggers/evaluation_tracker.py

* Update lm_eval/loggers/evaluation_tracker.py

* Update docs/new_task_guide.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* added check in sampler per code review

* added the system from a function, plus an example in minerva math

* style

* Apply suggestions from code review
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* fix unit tests 1

* forcing use of test split

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

4902aaaf

24 May, 2024 2 commits

Bigbench fix (#1686) · 78a215e0

Lintang Sutawika authored May 25, 2024



* edit process multiple-choice

* split template yaml

* remove

* modified multiple_choice tasks

* udpate

* Update multiple_choice_template_b_yaml

* Update multiple_choice_template_a_yaml

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

78a215e0

add mmlu tasks from pile-t5 (#1710) · f2ea37e3

Lintang Sutawika authored May 25, 2024



* add mmlu tasks from pile-t5

* Update _mmlu_flan_cot_fewshot_template_yaml

* Update _mmlu_flan_cot_zeroshot_template_yaml

* Update _mmlu_flan_generative_template_yaml

* Update _mmlu_flan_loglikelihood_template_yaml

* Update _default_template_yaml

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

f2ea37e3

22 May, 2024 1 commit
- Update polemo2_out.yaml (#1871) · 70e1de09
  zhabuye authored May 22, 2024
  
  70e1de09
21 May, 2024 1 commit
- fixed incorrect check for task type (replace `~` with `not`) (#1865) · 00b7a61c
  Zafir Stojanovski authored May 21, 2024
  
  00b7a61c
13 May, 2024 1 commit

Adding tinyBenchmarks datasets (#1545) · fe9fef4e

Lucas Weber authored May 13, 2024



* Add tinyBenchmarks

* Add acknowledgements

* Add ordering of outputs for data-parallel

* Run pre-commit

* Add few_shot specifications

* Add tinyBenchmarks post-processing

* add conditional import ; fix task names

---------
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

fe9fef4e

09 May, 2024 1 commit

Copal task (#1803) · 1980a13c

Edd authored May 10, 2024

* add copal

* change name to copal id for clarity and the task name

* remove `copal_id...` to yaml to make it work

* checkmark on README

* change group name to `copal_id`

1980a13c

08 May, 2024 1 commit

add task for mmlu evaluation in arc multiple choice format (#1745) · 9097ad3e

jonabur authored May 08, 2024



* add mmlu arc style evaluation

* rename arc_style to continuation

---------
Co-authored-by: Jonathan Burdge <jburdge@mahti-login11.mahti.csc.fi>
Co-authored-by: Jonathan Burdge <jburdge@mahti-login12.mahti.csc.fi>

9097ad3e

07 May, 2024 3 commits

Initial integration of the Unitxt to LM eval harness (#1615) · 885f48d6

Yoav Katz authored May 08, 2024

* Initial support for Unitxt datasets in LM Eval Harness

See  https://github.com/IBM/unitxt



The script 'generate_yamls.py' creates LM Eval Harness yaml files corresponding to Unitxt datasets specified in the 'unitxt_datasets' file.

The glue code required to register Unitxt metrics is in 'unitxt_wrapper.py'.

* Added dataset loading check to generate_yaml

Improved error messages.

* Speed up generate_yaml

Added printouts and improved error message

* Added output printout

* Simplified integration of unitxt datasets

Store all the common yaml configuration in a yaml include shared by all datasets of the same task.

* Post code review comments - part 1

1. Made sure include files don't end wth 'yaml' so they won't be marked as tasks
2. Added more datasets and tasks (NER, GEC)
3. Added README

* Post code review comments - part 2

1. Added install unitxt install option in pyproject.toml:
pip install 'lm_eval[unitxt]'
2. Added a check that unitxt is installed and print a clear error message if not

* Commited missing pyproject change

* Added documentation on adding datasets

* More doc changes

* add unitxt extra to readme

* run precommit

---------
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

885f48d6

Re-add Hendrycks MATH (no sympy checking, no Minerva hardcoded prompt) variant (#1793) · d42a3e44
Hailey Schoelkopf authored May 07, 2024
```
* add Hendrycks MATH (no sympy checking) variant

* add readmes for MATH tasks
```
d42a3e44
Fix Caching Tests ; Remove `pretrained=gpt2` default (#1775) · 7fe2b93c
Hailey Schoelkopf authored May 07, 2024

7fe2b93c

01 May, 2024 4 commits

upload new tasks (#1728) · caaf9ab6

Simran Arora authored May 01, 2024



* upload new tasks

* add readmes

* run linters

---------
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

caaf9ab6

Fix m_arc choices (#1760) · f27c4050

Zehan Li authored May 02, 2024



* Update utils.py

This is a 4-choice task, option_e is null for all but 3 samples

* Fix options

Adaptive choices

* add option e

* bump multilingual arc version

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

f27c4050

Pile 10k new task (#1758) · b898bdaa
Gabriel Mukobi authored May 01, 2024
```
* Add Pile-10k readme

* Add Pile-10k task configuration file
```
b898bdaa
remove duplicated `num_fewshot: 0` (#1769) · 552eeae7
Chujie Zheng authored May 01, 2024

552eeae7

26 Apr, 2024 1 commit
- Support individual scrolls datasets (#1740) · 9b49556a
  giorgossideris authored Apr 26, 2024
```
* Support individual scrolls datasets

* Add qmsum context

* Fix formatting
```
  9b49556a