Commits · v0.4.3 · gaoqiong / lm-evaluation-harness

01 Jul, 2024 2 commits
- update to v0.4.3 (#2046) · 3fa4fd72
  Hailey Schoelkopf authored Jul 01, 2024
  
  3fa4fd72
- ship with exact_match function already used ; don't call evaluate.load() on import (#2045) · a8ac0446
  Hailey Schoelkopf authored Jul 01, 2024
  
  a8ac0446
29 Jun, 2024 1 commit
- fail gracefully upon tokenizer logging failure (#2038) · 2a6acc88
  Hailey Schoelkopf authored Jun 29, 2024
  
  2a6acc88
28 Jun, 2024 3 commits

Add chat template to `vllm` (#2034) · cc2d3463

Baber Abbasi authored Jun 28, 2024



* add chat template

* refactor token padding

* nit

* nit

* check on failing test

* check transformers version

* remove transformers pin

* add ids to test

* nit

* fixup

* fix bos bug

* nit

* fixup! fix bos bug

* increase tolerance for table test

* don't detokenize vllm logprobs

* Update lm_eval/models/utils.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* pre-commit run --all-files

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

cc2d3463

fix cache (#2037) · e922cceb
Baber Abbasi authored Jun 28, 2024

e922cceb

Fixes scrolls task bug with few_shot examples (#2003) · 801322e0

Steven Basart authored Jun 28, 2024

Bug:

```
python -m scripts.write_out --task scrolls_quality --output_base_path ~/workspace/
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/lm-evaluation-harness/scripts/write_out.py", line 92, in <module>
    main()
  File "/lm-evaluation-harness/scripts/write_out.py", line 51, in main
    task_dict = tasks.get_task_dict(task_names, task_manager)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 423, in get_task_dict
    task_name_from_string_dict = task_manager.load_task_or_group(
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 271, in load_task_or_group
    collections.ChainMap(*map(self._load_individual_task_or_group, task_list))
  File "/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 162, in _load_individual_task_or_group
    return load_task(task_config, task=name_or_config, group=parent_name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 148, in load_task
    task_object = config["class"]()
                  ^^^^^^^^^^^^^^^^^
  File "/lm-evaluation-harness/lm_eval/tasks/scrolls/task.py", line 120, in __init__
    super().__init__()
  File "/lm-evaluation-harness/lm_eval/api/task.py", line 703, in __init__
    self._config = TaskConfig(**config)
                   ^^^^^^^^^^^^^^^^^^^^
TypeError: lm_eval.api.task.TaskConfig() argument after ** must be a mapping, not NoneType
```

801322e0

26 Jun, 2024 1 commit
- Fix `trust_remote_code`-related test failures (#2024) · e5e5ee0c
  Hailey Schoelkopf authored Jun 26, 2024
```
* make MMLU trust remote code to fix tests

* remove trust remote code
```
  e5e5ee0c
25 Jun, 2024 5 commits

Update interface.md (#1982) · 6e49b1f6

johnwee1 authored Jun 26, 2024



* Update interface.md

update interface to remove link to really outdated commit of evaluator.py

* switch to relative referencing?

* Update interface.md

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

6e49b1f6

Factor out LM-specific tests (#1859) · 0366c74f

Hailey Schoelkopf authored Jun 25, 2024

* separate out optimum/neuralmagic tests to separate job

* fix vllm tests

* fix bug in --trust_remote_code

* use datasets.config instead intentionally

* fix remote code issue?

0366c74f

Added CommonsenseQA task (#1721) · b62b9bd0

Brendan Murphy authored Jun 25, 2024



* Initial configuration

* Using the validation set for the test set, because the test set on HF doesn't have labels

* Probably just makes more sense to have validation be validation

* fix format ; add docs to tasks/README.md

* fix format

---------
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

b62b9bd0

Remove `LM` dependency from `build_all_requests` (#2011) · 9b6179b2

Baber Abbasi authored Jun 25, 2024

* refactored `lm.apply_chat_template`

* nit

* fix weird type error

* fixed!

* skip failing test

* pre-commit run all

* add type hints

* nit

* nit

* fixup

9b6179b2

add arc_challenge_mt (#1900) · 9b6b0f5e
jonabur authored Jun 25, 2024
```
* add arc_challenge_mt

* add README

* add icelandic
```
9b6b0f5e

24 Jun, 2024 2 commits

Hotfix breaking import (#2015) · 0ae3d3eb
Stella Biderman authored Jun 24, 2024

0ae3d3eb

add tokenizer logs info (#1731) · 536691da

achervyakov authored Jun 24, 2024



* add tokenizer logs info

* add no tokenizer case

* Update lm_eval/logging_utils.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/logging_utils.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* add updates

* fix conflict

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

536691da

20 Jun, 2024 1 commit

Add BertaQA dataset tasks (#1964) · 6f7b4a05

Julen Etxaniz authored Jun 20, 2024



* add bertaqa tasks

* rename basquetrivia-->bertaqa ; make template stub not .yaml

* add bertaqa entry to lm_eval/tasks/README.md

---------
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

6f7b4a05

19 Jun, 2024 5 commits

Fix Datasets `--trust_remote_code` (#1998) · d14b36e8
Hailey Schoelkopf authored Jun 19, 2024

d14b36e8
Added ArabicMMLU (#1987) · a08bc3c8
Yazeed Alnumay authored Jun 19, 2024
```
* Added ArabicMMLU

* Rename `ammlu` to `arabicmmlu`
```
a08bc3c8

Log `fewshot_as_multiturn` in results files (#1995) · 78a54e14

Hailey Schoelkopf authored Jun 19, 2024



* log fewshot_as_multiturn in general tracker args

* Update evaluator.py

---------
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

78a54e14

Fix Paloma Template yaml (#1993) · ead2964e

Hailey Schoelkopf authored Jun 19, 2024



* init paloma benchmark

* pre-process in utils function

* add `task_alias`

* updated task aliases

* Update paloma_dolma-v1_5.yaml

* Update paloma_twitterAAE_HELM_fixed.yaml

* Update paloma_dolma_100_programing_languages.yaml

* update on names

* fix paloma template issue

---------
Co-authored-by: Zafir Stojanovski <zaf.stojano@gmail.com>
Co-authored-by: Zafir Stojanovski <zafir.stojanovski@icloud.com>
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

ead2964e

[New Task] Add Paloma benchmark (#1928) · f257d38b

Zafir Stojanovski authored Jun 19, 2024



* init paloma benchmark

* pre-process in utils function

* add `task_alias`

* updated task aliases

* Update paloma_dolma-v1_5.yaml

* Update paloma_twitterAAE_HELM_fixed.yaml

* Update paloma_dolma_100_programing_languages.yaml

---------
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

f257d38b

18 Jun, 2024 2 commits
- Fix self assignment in neuron_optimum.py (#1990) · bdb78d22
  LSinev authored Jun 18, 2024
  
  bdb78d22
- add trust_remote_code for piqa (#1983) · 72bb6241
  Wang, Chang authored Jun 19, 2024
```
Signed-off-by: changwangss <chang1.wang@intel.com>
```
  72bb6241
13 Jun, 2024 4 commits

fix: add directory filter to os.walk to ignore 'ipynb_checkpoints' (#1956) · 568af943

johnwee1 authored Jun 13, 2024



* fix: add filter to os.walk to ignore 'ipynb_checkpoints

* Update __init__.py

* Update __init__.py

---------
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

568af943

make write_out.py explicitly error if no splits match (#1796) · ed72238f
Hailey Schoelkopf authored Jun 13, 2024
```
Co-authored-by: lintangsutawika <lintang@eleuther.ai>
```
ed72238f
Fix `--gen_kwargs` and VLLM (`temperature` not respected) (#1800) · 5c7cba23
Hailey Schoelkopf authored Jun 13, 2024
```
* Update vllm_causallms.py

* adjust

---------
Co-authored-by: lintangsutawika <lintang@eleuther.ai>
```
5c7cba23

`samples` is newline delimited (#1930) · 3850e21a

Baber Abbasi authored Jun 13, 2024



* `samples` is newline delimited

* updated git and pre-commit

* appease pre-commit

* nit

* Revert back for now

* Revert for now

---------
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

3850e21a

12 Jun, 2024 2 commits
- Fix self.max_tokens in anthropic_llms.py (#1848) · 793469e0
  Nikita Lozhnikov authored Jun 12, 2024
```
Fix bug where `self.max_tokens` was not set
```
  793469e0
- Update interface.md (#1955) · 6f434934
  Sadra Barikbin authored Jun 12, 2024
  
  6f434934
11 Jun, 2024 4 commits
- add hacky add_bos_token forcing for Gemma to VLLM too (#1857) · b3e4c49a
  Hailey Schoelkopf authored Jun 11, 2024
  
  b3e4c49a
- add include_defaults kwarg to taskmanager, add tests for include_path (#1856) · 4bb77e82
  Hailey Schoelkopf authored Jun 11, 2024
  
  4bb77e82
- Remove AMMLU Due to Translation (#1948) · d0f6e011
  Hailey Schoelkopf authored Jun 11, 2024
```
* Update README.md

* Delete lm_eval/tasks/ammlu directory
```
  d0f6e011
- Results filenames handling fix (#1926) · 69952581
  KonradSzafer authored Jun 11, 2024
```
* results filenames handling moved to utils

* zeno results handling fix

* tasks_for_model backward compatibility

* results files logic moved to tasks_for_model

* moved sanitize_model_name to utils
```
  69952581
10 Jun, 2024 1 commit
- Add the Arabic version with refactor to Arabic pica to be in alghafa folder (#1940) · 305fb636
  khalil authored Jun 10, 2024
  
  305fb636
09 Jun, 2024 1 commit
- Update __main__.py (#1939) · bea1a859
  Sadra Barikbin authored Jun 09, 2024
  
  bea1a859
07 Jun, 2024 4 commits

Test output table layout consistency (#1916) · 40f5458f

Zafir Stojanovski authored Jun 07, 2024

* sort metrics in output table

* update docstring in `consolidate_results`

* add tests for verifying consistency of table output

* update tests to account for floating point inconsistencies

* updated tests based on `pythia-14m`

40f5458f

Update basque-glue (#1913) · 59418aac

zhabuye authored Jun 07, 2024

* Update README.md

* Update bec.yaml

* Update bhtc.yaml

* Update coref.yaml

* Update qnli.yaml

* Update vaxx.yaml

* Update wic.yaml

59418aac

Update siqa.yaml (#1909) · 3f0ef80b
Hailey Schoelkopf authored Jun 07, 2024

3f0ef80b
Add The Arabic version of the PICA benchmark (#1917) · 923852b0
khalil authored Jun 07, 2024

923852b0

06 Jun, 2024 2 commits

Implement NoticIA (#1912) · f2843b2f
Iker García-Ferrero authored Jun 06, 2024
```
* Noticia

* test

* Final testes implementation

* Fixes

* Fix linters
```
f2843b2f

Add new Lambada translations (#1897) · b9d96b50

Zafir Stojanovski authored Jun 06, 2024



* added tasks and task family descriptors

* configs for the new lambada translations

* continue work on task list w/ links; slightly reorganize README

* Apply suggestions from code review

* Rename file so that it'll preview in Github when viewing lm_eval/tasks folder

* Update new_task_guide.md

* Update README.md

* run linter

* Add language column to task table; Add missing tasks to task table; fix nq_open and storycloze READMEs

* fix typo

* update `lm_eval/tasks/README.md` with task description

---------
Co-authored-by: Harish Vadaparty <harishvadaparty@gmail.com>
Co-authored-by: anthony <anthonydipofi@gmail.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

b9d96b50