Commits · scipy-optional · gaoqiong / lm-evaluation-harness

08 Jul, 2024 3 commits

make scipy an extra · a6034d0c
haileyschoelkopf authored Jul 08, 2024

a6034d0c

Lintang Sutawika authored Jul 08, 2024



* add greoup_config arg

* add a group config that allows disabling table for group score and group aggregate in general

* fixed size configuration

* adjust config

* add group config

* adjust mmlu to use group_config

* fixed args input in aggregate_subtask_metrics

* fixed issues related to printing alias of group and updated yaml

* update all mmlu variants to include group_config

* edit format

* modify mmlu tasks

* adjust group to also be a configurable group

* add configurable group

* simplify get_task_list

* adjust group scoring with using ConfigurableGroup

* adjust args

* update mmlu

* update mmlu

* update to work with new group and task configuration

* readd group_agg

* readd files

* move prepare_print_tasks to evaluator_utils

* sort set to False by default, fix predict_only arg

* add version for groups

* reversed task list

* update additional condition when loading a group in a group yaml

* update truthfulqa

* add description regarding tags replacing group

* replace group to tag

* fixed conditional statement

* remove warning

* update loading of task group and newly added tags

* reformat with pre-commit

* fixed info log

* update

* fix bug

* fix bug

* use task id to differentiate tasks

* convert all groups to configurable groups

* use task_id

* reformat

* add task_id for python tasks as well

* add task_id for python tasks as well

* add task_id for python tasks as well

* revert truthfulqa

* revert mmlu tasks

* new mmlu config

* new group config parameter `tag_to_task`

* Update truthfulqa_mc2.yaml

* reformate

* add _process_group_config

* adjust task_id

* add get_subtask_list function to get proper subtask list

* group config to_dict update

* remove tag check

* update mmlu

* fix config passing issues

* add test yaml

* format fix

* add documentation

* corner case for single tag being called

* fix indentation

* formatting

* update all mmlu variants

* Update docs/task_guide.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* remove group_alias

* Update docs/task_guide.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* remove version for metadata

* Update docs/task_guide.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* update mmlu/

* removed " " in make_table

* change how aggregate_metric is loaded

* change how aggregate_metric is loaded

* update aggregate_metric arg

* update format

* update format

* some docs fixes

* add groups for agieval, aexams, aclue

* add more explicit aggregation groups

* add more groupings / tags distinctions

* add more groupings

* more groupings

* add many explicit group configs

* add many explicit group configs

* add more explicit group configs

* add more explicit group configs

* add more error msgs, agg_metric -> agg_metric_list

* some docs updates

* update task_id to be updateable and uses group:task format

* make KMMLU a tag for now

* update docs

* don't duplicate task names

* fix merge conflicts?

* giving this a try

* clean up diff

* switch mmlu variants over to using

* don't use to-be-deprecated group: config field in overview notebook

* Python tasks which subclass ConfigurableTask now run

* update mmlu

* pre-commit format

* fixed sorting for multi-level printing

* move group api to separate file

* fix bbh aggregation filter usage

* track api/group.py

* adjust group and tags loading

* make explicit group configs for leaderboard and other newer tasks

* fix arabicmmlu

* update

* change arabicmmlu template name???

* update group alias

* fix printing bugs

* check table printing is correct ; update tests

* use mmlu_stem to have a group included in print tests

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

517aadc4

Fix TypeError in samplers.py by converting int to str (#2074) · 5a7ed3ee
Choyunhui authored Jul 08, 2024
```
Co-authored-by: yhjo <yhjo@suresofttech.com>
```
5a7ed3ee

03 Jul, 2024 3 commits

#1442 inverse scaling tasks implementation (#1589) · d855d0ba

Hanwool Albert Lee authored Jul 03, 2024



* initial_implementation (test has to be proceeded)

* minor fix

* revised task name and implemented new task

* minor fixes

* new tasks implement

* minor fix

* added 'prompt injection' task

* delete prompt injection task (will be implemented at next PR)

* trust remote code

* Update lm_eval/tasks/inverse_scaling/README.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* added readme

* Update lm_eval/tasks/README.md

* Update lm_eval/tasks/inverse_scaling/_inverse_scaling_mc_yaml

* Update lm_eval/tasks/inverse_scaling/README.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/tasks/inverse_scaling/_inverse_scaling_mc_yaml
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update README.md

* precommit?

* run precommit on readme

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

d855d0ba

Adds Open LLM Leaderboard Taks (#2047) · 3c8db1bb

Nathan Habib authored Jul 03, 2024



* adds leaderboard tasks

* Delete lm_eval/tasks/leaderboard/leaderboard_chat_template.yaml

* add readme

* Delete lm_eval/tasks/leaderboard/mmlu_pro/mmlu_pro_chat_template.yaml

* modify readme

* fix bbh task

* fix bbh salient task

* modify the readme

* Delete lm_eval/tasks/leaderboard/ifeval/README.md

* Delete lm_eval/tasks/leaderboard/math/README.md

* add leaderboard to the tasks repertory

* add anouncment about new leaderbaord tasks

* linting

* Update README.md
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* installs ifeval dependency in new_task github workflow

---------
Co-authored-by: Nathan Habib <nathan.habib@huggingface.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

3c8db1bb

Update hellaswag.yaml (#2029) · 1870ee4e
Hailey Schoelkopf authored Jul 03, 2024

1870ee4e

02 Jul, 2024 1 commit
- update gemma-2 default BOS behavior (#2049) · 67a990e7
  Hailey Schoelkopf authored Jul 01, 2024
  
  67a990e7
01 Jul, 2024 4 commits
- Fix strip whitespace filter (#2048) · 9088a68f
  Nathan Habib authored Jul 01, 2024
```
* batch commit

* :Revert "batch commit"

This reverts commit d859d1ca.

* batch commit

* checkout from main

* checkout from main

* checkout from main

* checkout from main

* checkout from main

* cleanup

* cleanup

* cleanup

* cleanup

* cleanup

* cleanup
```
  9088a68f
- fix wandb logger module import in example (#2041) · e83e891d
  Ogundepo Odunayo authored Jul 01, 2024
  
  e83e891d
- update to v0.4.3 (#2046) · 3fa4fd72
  Hailey Schoelkopf authored Jul 01, 2024
  
  3fa4fd72
- ship with exact_match function already used ; don't call evaluate.load() on import (#2045) · a8ac0446
  Hailey Schoelkopf authored Jul 01, 2024
  
  a8ac0446
29 Jun, 2024 1 commit
- fail gracefully upon tokenizer logging failure (#2038) · 2a6acc88
  Hailey Schoelkopf authored Jun 29, 2024
  
  2a6acc88
28 Jun, 2024 3 commits

Add chat template to `vllm` (#2034) · cc2d3463

Baber Abbasi authored Jun 28, 2024



* add chat template

* refactor token padding

* nit

* nit

* check on failing test

* check transformers version

* remove transformers pin

* add ids to test

* nit

* fixup

* fix bos bug

* nit

* fixup! fix bos bug

* increase tolerance for table test

* don't detokenize vllm logprobs

* Update lm_eval/models/utils.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* pre-commit run --all-files

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

cc2d3463

fix cache (#2037) · e922cceb
Baber Abbasi authored Jun 28, 2024

e922cceb

Fixes scrolls task bug with few_shot examples (#2003) · 801322e0

Steven Basart authored Jun 28, 2024

Bug:

```
python -m scripts.write_out --task scrolls_quality --output_base_path ~/workspace/
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/lm-evaluation-harness/scripts/write_out.py", line 92, in <module>
    main()
  File "/lm-evaluation-harness/scripts/write_out.py", line 51, in main
    task_dict = tasks.get_task_dict(task_names, task_manager)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 423, in get_task_dict
    task_name_from_string_dict = task_manager.load_task_or_group(
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 271, in load_task_or_group
    collections.ChainMap(*map(self._load_individual_task_or_group, task_list))
  File "/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 162, in _load_individual_task_or_group
    return load_task(task_config, task=name_or_config, group=parent_name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 148, in load_task
    task_object = config["class"]()
                  ^^^^^^^^^^^^^^^^^
  File "/lm-evaluation-harness/lm_eval/tasks/scrolls/task.py", line 120, in __init__
    super().__init__()
  File "/lm-evaluation-harness/lm_eval/api/task.py", line 703, in __init__
    self._config = TaskConfig(**config)
                   ^^^^^^^^^^^^^^^^^^^^
TypeError: lm_eval.api.task.TaskConfig() argument after ** must be a mapping, not NoneType
```

801322e0

26 Jun, 2024 1 commit
- Fix `trust_remote_code`-related test failures (#2024) · e5e5ee0c
  Hailey Schoelkopf authored Jun 26, 2024
```
* make MMLU trust remote code to fix tests

* remove trust remote code
```
  e5e5ee0c
25 Jun, 2024 5 commits

Update interface.md (#1982) · 6e49b1f6

johnwee1 authored Jun 26, 2024



* Update interface.md

update interface to remove link to really outdated commit of evaluator.py

* switch to relative referencing?

* Update interface.md

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

6e49b1f6

Factor out LM-specific tests (#1859) · 0366c74f

Hailey Schoelkopf authored Jun 25, 2024

* separate out optimum/neuralmagic tests to separate job

* fix vllm tests

* fix bug in --trust_remote_code

* use datasets.config instead intentionally

* fix remote code issue?

0366c74f

Added CommonsenseQA task (#1721) · b62b9bd0

Brendan Murphy authored Jun 25, 2024



* Initial configuration

* Using the validation set for the test set, because the test set on HF doesn't have labels

* Probably just makes more sense to have validation be validation

* fix format ; add docs to tasks/README.md

* fix format

---------
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

b62b9bd0

Remove `LM` dependency from `build_all_requests` (#2011) · 9b6179b2

Baber Abbasi authored Jun 25, 2024

* refactored `lm.apply_chat_template`

* nit

* fix weird type error

* fixed!

* skip failing test

* pre-commit run all

* add type hints

* nit

* nit

* fixup

9b6179b2

add arc_challenge_mt (#1900) · 9b6b0f5e
jonabur authored Jun 25, 2024
```
* add arc_challenge_mt

* add README

* add icelandic
```
9b6b0f5e

24 Jun, 2024 2 commits

Hotfix breaking import (#2015) · 0ae3d3eb
Stella Biderman authored Jun 24, 2024

0ae3d3eb

add tokenizer logs info (#1731) · 536691da

achervyakov authored Jun 24, 2024



* add tokenizer logs info

* add no tokenizer case

* Update lm_eval/logging_utils.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/logging_utils.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* add updates

* fix conflict

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

536691da

20 Jun, 2024 1 commit

Add BertaQA dataset tasks (#1964) · 6f7b4a05

Julen Etxaniz authored Jun 20, 2024



* add bertaqa tasks

* rename basquetrivia-->bertaqa ; make template stub not .yaml

* add bertaqa entry to lm_eval/tasks/README.md

---------
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

6f7b4a05

19 Jun, 2024 5 commits

Fix Datasets `--trust_remote_code` (#1998) · d14b36e8
Hailey Schoelkopf authored Jun 19, 2024

d14b36e8
Added ArabicMMLU (#1987) · a08bc3c8
Yazeed Alnumay authored Jun 19, 2024
```
* Added ArabicMMLU

* Rename `ammlu` to `arabicmmlu`
```
a08bc3c8

Log `fewshot_as_multiturn` in results files (#1995) · 78a54e14

Hailey Schoelkopf authored Jun 19, 2024



* log fewshot_as_multiturn in general tracker args

* Update evaluator.py

---------
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

78a54e14

Fix Paloma Template yaml (#1993) · ead2964e

Hailey Schoelkopf authored Jun 19, 2024



* init paloma benchmark

* pre-process in utils function

* add `task_alias`

* updated task aliases

* Update paloma_dolma-v1_5.yaml

* Update paloma_twitterAAE_HELM_fixed.yaml

* Update paloma_dolma_100_programing_languages.yaml

* update on names

* fix paloma template issue

---------
Co-authored-by: Zafir Stojanovski <zaf.stojano@gmail.com>
Co-authored-by: Zafir Stojanovski <zafir.stojanovski@icloud.com>
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

ead2964e

[New Task] Add Paloma benchmark (#1928) · f257d38b

Zafir Stojanovski authored Jun 19, 2024



* init paloma benchmark

* pre-process in utils function

* add `task_alias`

* updated task aliases

* Update paloma_dolma-v1_5.yaml

* Update paloma_twitterAAE_HELM_fixed.yaml

* Update paloma_dolma_100_programing_languages.yaml

---------
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

f257d38b

18 Jun, 2024 2 commits
- Fix self assignment in neuron_optimum.py (#1990) · bdb78d22
  LSinev authored Jun 18, 2024
  
  bdb78d22
- add trust_remote_code for piqa (#1983) · 72bb6241
  Wang, Chang authored Jun 19, 2024
```
Signed-off-by: changwangss <chang1.wang@intel.com>
```
  72bb6241
13 Jun, 2024 4 commits

fix: add directory filter to os.walk to ignore 'ipynb_checkpoints' (#1956) · 568af943

johnwee1 authored Jun 13, 2024



* fix: add filter to os.walk to ignore 'ipynb_checkpoints

* Update __init__.py

* Update __init__.py

---------
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

568af943

make write_out.py explicitly error if no splits match (#1796) · ed72238f
Hailey Schoelkopf authored Jun 13, 2024
```
Co-authored-by: lintangsutawika <lintang@eleuther.ai>
```
ed72238f
Fix `--gen_kwargs` and VLLM (`temperature` not respected) (#1800) · 5c7cba23
Hailey Schoelkopf authored Jun 13, 2024
```
* Update vllm_causallms.py

* adjust

---------
Co-authored-by: lintangsutawika <lintang@eleuther.ai>
```
5c7cba23

`samples` is newline delimited (#1930) · 3850e21a

Baber Abbasi authored Jun 13, 2024



* `samples` is newline delimited

* updated git and pre-commit

* appease pre-commit

* nit

* Revert back for now

* Revert for now

---------
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

3850e21a

12 Jun, 2024 2 commits
- Fix self.max_tokens in anthropic_llms.py (#1848) · 793469e0
  Nikita Lozhnikov authored Jun 12, 2024
```
Fix bug where `self.max_tokens` was not set
```
  793469e0
- Update interface.md (#1955) · 6f434934
  Sadra Barikbin authored Jun 12, 2024
  
  6f434934
11 Jun, 2024 3 commits
- add hacky add_bos_token forcing for Gemma to VLLM too (#1857) · b3e4c49a
  Hailey Schoelkopf authored Jun 11, 2024
  
  b3e4c49a
- add include_defaults kwarg to taskmanager, add tests for include_path (#1856) · 4bb77e82
  Hailey Schoelkopf authored Jun 11, 2024
  
  4bb77e82
- Remove AMMLU Due to Translation (#1948) · d0f6e011
  Hailey Schoelkopf authored Jun 11, 2024
```
* Update README.md

* Delete lm_eval/tasks/ammlu directory
```
  d0f6e011