Commits · ba73d13103f742de7afc00a6ec5ce288983fdf9e · gaoqiong / lm-evaluation-harness

03 Jul, 2024 13 commits

update · 6e2dbe76
lintangsutawika authored Jul 04, 2024

6e2dbe76
fix arabicmmlu · 269b66e9
haileyschoelkopf authored Jul 03, 2024

269b66e9
make explicit group configs for leaderboard and other newer tasks · 9fc24ab4
haileyschoelkopf authored Jul 03, 2024

9fc24ab4
adjust group and tags loading · f0ea518f
lintangsutawika authored Jul 03, 2024

f0ea518f
track api/group.py · b03c7636
haileyschoelkopf authored Jul 03, 2024

b03c7636
fix bbh aggregation filter usage · 6348b947
haileyschoelkopf authored Jul 03, 2024

6348b947
move group api to separate file · 94673d40
haileyschoelkopf authored Jul 03, 2024

94673d40
fixed sorting for multi-level printing · c6839d72
haileyschoelkopf authored Jul 03, 2024

c6839d72
pre-commit format · 96dfe976
lintangsutawika authored Jul 03, 2024

96dfe976
update mmlu · e200c24e
lintangsutawika authored Jul 03, 2024

e200c24e

#1442 inverse scaling tasks implementation (#1589) · d855d0ba

Hanwool Albert Lee authored Jul 03, 2024



* initial_implementation (test has to be proceeded)

* minor fix

* revised task name and implemented new task

* minor fixes

* new tasks implement

* minor fix

* added 'prompt injection' task

* delete prompt injection task (will be implemented at next PR)

* trust remote code

* Update lm_eval/tasks/inverse_scaling/README.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* added readme

* Update lm_eval/tasks/README.md

* Update lm_eval/tasks/inverse_scaling/_inverse_scaling_mc_yaml

* Update lm_eval/tasks/inverse_scaling/README.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/tasks/inverse_scaling/_inverse_scaling_mc_yaml
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update README.md

* precommit?

* run precommit on readme

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

d855d0ba

Adds Open LLM Leaderboard Taks (#2047) · 3c8db1bb

Nathan Habib authored Jul 03, 2024



* adds leaderboard tasks

* Delete lm_eval/tasks/leaderboard/leaderboard_chat_template.yaml

* add readme

* Delete lm_eval/tasks/leaderboard/mmlu_pro/mmlu_pro_chat_template.yaml

* modify readme

* fix bbh task

* fix bbh salient task

* modify the readme

* Delete lm_eval/tasks/leaderboard/ifeval/README.md

* Delete lm_eval/tasks/leaderboard/math/README.md

* add leaderboard to the tasks repertory

* add anouncment about new leaderbaord tasks

* linting

* Update README.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* installs ifeval dependency in new_task github workflow

---------
Co-authored-by: Nathan Habib <nathan.habib@huggingface.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

3c8db1bb

Update hellaswag.yaml (#2029) · 1870ee4e
Hailey Schoelkopf authored Jul 03, 2024

1870ee4e

02 Jul, 2024 8 commits
- Python tasks which subclass ConfigurableTask now run · e5811879
  haileyschoelkopf authored Jul 02, 2024
  
  e5811879
- switch mmlu variants over to using · 22b95671
  haileyschoelkopf authored Jul 02, 2024
  
  22b95671
- clean up diff · c1d9e625
  haileyschoelkopf authored Jul 02, 2024
  
  c1d9e625
- giving this a try · 3b7e6cc6
  haileyschoelkopf authored Jul 02, 2024
  
  3b7e6cc6
- fix merge conflicts? · d13b1f56
  haileyschoelkopf authored Jul 02, 2024
  
  d13b1f56
- don't duplicate task names · fe5254ac
  haileyschoelkopf authored Jul 02, 2024
  
  fe5254ac
- make KMMLU a tag for now · 30fbcfc9
  haileyschoelkopf authored Jul 02, 2024
  
  30fbcfc9
- update gemma-2 default BOS behavior (#2049) · 67a990e7
  Hailey Schoelkopf authored Jul 01, 2024
  
  67a990e7
01 Jul, 2024 2 commits

Fix strip whitespace filter (#2048) · 9088a68f

Nathan Habib authored Jul 01, 2024

* batch commit

* :Revert "batch commit"

This reverts commit d859d1ca.

* batch commit

* checkout from main

* checkout from main

* checkout from main

* checkout from main

* checkout from main

* cleanup

* cleanup

* cleanup

* cleanup

* cleanup

* cleanup

9088a68f

ship with exact_match function already used ; don't call evaluate.load() on import (#2045) · a8ac0446
Hailey Schoelkopf authored Jul 01, 2024

a8ac0446

29 Jun, 2024 1 commit
- fail gracefully upon tokenizer logging failure (#2038) · 2a6acc88
  Hailey Schoelkopf authored Jun 29, 2024
  
  2a6acc88
28 Jun, 2024 3 commits

Add chat template to `vllm` (#2034) · cc2d3463

Baber Abbasi authored Jun 28, 2024



* add chat template

* refactor token padding

* nit

* nit

* check on failing test

* check transformers version

* remove transformers pin

* add ids to test

* nit

* fixup

* fix bos bug

* nit

* fixup! fix bos bug

* increase tolerance for table test

* don't detokenize vllm logprobs

* Update lm_eval/models/utils.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* pre-commit run --all-files

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

cc2d3463

fix cache (#2037) · e922cceb
Baber Abbasi authored Jun 28, 2024

e922cceb

Fixes scrolls task bug with few_shot examples (#2003) · 801322e0

Steven Basart authored Jun 28, 2024

Bug:

```
python -m scripts.write_out --task scrolls_quality --output_base_path ~/workspace/
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/lm-evaluation-harness/scripts/write_out.py", line 92, in <module>
    main()
  File "/lm-evaluation-harness/scripts/write_out.py", line 51, in main
    task_dict = tasks.get_task_dict(task_names, task_manager)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 423, in get_task_dict
    task_name_from_string_dict = task_manager.load_task_or_group(
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 271, in load_task_or_group
    collections.ChainMap(*map(self._load_individual_task_or_group, task_list))
  File "/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 162, in _load_individual_task_or_group
    return load_task(task_config, task=name_or_config, group=parent_name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 148, in load_task
    task_object = config["class"]()
                  ^^^^^^^^^^^^^^^^^
  File "/lm-evaluation-harness/lm_eval/tasks/scrolls/task.py", line 120, in __init__
    super().__init__()
  File "/lm-evaluation-harness/lm_eval/api/task.py", line 703, in __init__
    self._config = TaskConfig(**config)
                   ^^^^^^^^^^^^^^^^^^^^
TypeError: lm_eval.api.task.TaskConfig() argument after ** must be a mapping, not NoneType
```

801322e0

27 Jun, 2024 1 commit
- update task_id to be updateable and uses group:task format · 43765669
  lintangsutawika authored Jun 27, 2024
  
  43765669
26 Jun, 2024 1 commit
- Fix `trust_remote_code`-related test failures (#2024) · e5e5ee0c
  Hailey Schoelkopf authored Jun 26, 2024
```
* make MMLU trust remote code to fix tests

* remove trust remote code
```
  e5e5ee0c
25 Jun, 2024 8 commits
- Added CommonsenseQA task (#1721) · b62b9bd0
  Brendan Murphy authored Jun 25, 2024
```
* Initial configuration

* Using the validation set for the test set, because the test set on HF doesn't have labels

* Probably just makes more sense to have validation be validation

* fix format ; add docs to tasks/README.md

* fix format

---------
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>
```
  b62b9bd0
- Remove `LM` dependency from `build_all_requests` (#2011) · 9b6179b2
  Baber Abbasi authored Jun 25, 2024
```
* refactored `lm.apply_chat_template`

* nit

* fix weird type error

* fixed!

* skip failing test

* pre-commit run all

* add type hints

* nit

* nit

* fixup
```
  9b6179b2
- add more error msgs, agg_metric -> agg_metric_list · 93c17c57
  haileyschoelkopf authored Jun 25, 2024
  
  93c17c57
- add more explicit group configs · 09dd7f6c
  haileyschoelkopf authored Jun 25, 2024
  
  09dd7f6c
- add more explicit group configs · 8fdcbc13
  haileyschoelkopf authored Jun 25, 2024
  
  8fdcbc13
- add many explicit group configs · 51519e40
  haileyschoelkopf authored Jun 25, 2024
  
  51519e40
- add many explicit group configs · 44a602ab
  haileyschoelkopf authored Jun 25, 2024
  
  44a602ab
- add arc_challenge_mt (#1900) · 9b6b0f5e
  jonabur authored Jun 25, 2024
```
* add arc_challenge_mt

* add README

* add icelandic
```
  9b6b0f5e
24 Jun, 2024 2 commits

Hotfix breaking import (#2015) · 0ae3d3eb
Stella Biderman authored Jun 24, 2024

0ae3d3eb

add tokenizer logs info (#1731) · 536691da

achervyakov authored Jun 24, 2024



* add tokenizer logs info

* add no tokenizer case

* Update lm_eval/logging_utils.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/logging_utils.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* add updates

* fix conflict

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

536691da

21 Jun, 2024 1 commit
- more groupings · c9801daf
  haileyschoelkopf authored Jun 21, 2024
  
  c9801daf