Commits · bd7d265af6a8a01ca1c1c2ca34da4598a2344c2e · gaoqiong / lm-evaluation-harness

31 Jan, 2024 1 commit

Fix unintuitive `--gen_kwargs` behavior (#1329) · bd7d265a

Hailey Schoelkopf authored Jan 31, 2024

* don't override do_sample if no value for it is passed

* Update gen_kwargs override condition

* Update huggingface.py

* Update huggingface.py

* run linters

* silence an erroneous warning

bd7d265a

30 Jan, 2024 1 commit
- delay filter init; remove `*args` (#1369) · 1554066c
  Baber Abbasi authored Jan 30, 2024
```
* delay filter init; remove `*args`

* bugfix

* optimize

* type hint
```
  1554066c
29 Jan, 2024 1 commit
- serialize callable functions in config (#1367) · 7fc43656
  Baber Abbasi authored Jan 29, 2024
  
  7fc43656
28 Jan, 2024 1 commit

Apply some best practices and guideline recommendations to code (#1363) · 488759d2

LSinev authored Jan 28, 2024

* raise Exception, not a string

Additional info https://peps.python.org/pep-0352/#exception-hierarchy-changes
https://docs.python.org/3.8/tutorial/errors.html#raising-exceptions

* Apply PEP8 recommendation to prefer isinstance

"Object type comparisons should always use isinstance() instead of comparing types directly"
https://peps.python.org/pep-0008/

* Remove dangerous default mutable values in arguments

https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/dangerous-default-value.html

* Format logging messages with fstring (not with format)

Additional info
https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/logging-format-interpolation.html
There are also discussions about the speed of formatting while logging or some unintended code executions
https://github.com/pylint-dev/pylint/issues/2395
https://stackoverflow.com/a/54368109
but at least one format (fstring one) will be used throughout the project

* Specify utf-8 encoding for `open` explicitly

If not specified, it may be supposed differently in different environments, OSes, and Python versions. See
https://peps.python.org/pep-0597/
https://docs.python.org/3.11/library/locale.html#locale.getencoding
https://docs.python.org/3.10/library/os.html#utf8-mode
https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/unspecified-encoding.html

Helps also if some code from English language tasks is taken as inspiration for tasks in non-English languages.

* Use inline-ignoring comments to pass pre-commit instead of identity process

https://flake8.pycqa.org/en/3.0.1/user/ignoring-errors.html#in-line-ignoring-errors
https://www.flake8rules.com/rules/F841.html

flake8 comments are supported by ruff: https://docs.astral.sh/ruff/linter/#error-suppression

488759d2

26 Jan, 2024 2 commits

Add causalLM OpenVino models (#1290) · 97a67d27

NoushNabi authored Jan 26, 2024



* added intel optimum

* added intel optimum in readme

* modified intel optimum

* modified intel optimum

* modified intel optimum

* modified install optimum

* modified path of IR file

* added openvino_device

* added openvino_device2

* changed optimum-causal to openvino-causal

* Update README.md

* Update README.md

* remove `lm_eval.base` import

* update openvino-causal -> openvino ; pass device through super().__init__()

* Update README.md

* Add optimum to tests dependencies

* apply pre-commit

* fix so tests pass

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

97a67d27

Refix issue regarding stderr (#1357) · 5b0b8a56
thnkinbtfly authored Jan 26, 2024

5b0b8a56

25 Jan, 2024 1 commit

`Filter` docs not offset by `doc_id` (#1349) · a0f1cacd

Baber Abbasi authored Jan 26, 2024

* get `doc` from instance

* acceletate bugfix: get ground doc from instance

* convert filter to `process_result`

* get docs from instances in `FilterEnsemble`

* rename

* nit

* better looping

* fix typehint

a0f1cacd

24 Jan, 2024 1 commit
- modified default gen_kwargs to work better with CLI; changed prompt_logprobs=1 (#1345) · 38c8d02f
  Baber Abbasi authored Jan 24, 2024
  
  38c8d02f
23 Jan, 2024 3 commits

manage default (greedy) gen_kwargs in vllm (#1341) · 081deb8b

Baber Abbasi authored Jan 24, 2024

* manage default (greedy) gen_kwargs in vllm better

* mirror HF `do_sample`

* just need to set temp=0 for greedy

081deb8b

Don't use `get_task_dict()` in task registration / initialization (#1331) · 969b48bf

Hailey Schoelkopf authored Jan 23, 2024



* don't use get_task_dict() as a helper, it will download the dataset!

* pre-commit

* Update README.md

---------
Co-authored-by: lintangsutawika <lintang@eleuther.ai>

969b48bf

Update migrated HF dataset paths (#1332) · 45a8f709

Hailey Schoelkopf authored Jan 22, 2024



* Update arc_easy.yaml

* Update flan_cot.yaml

* update HF dataset path

* Update freeform.yaml

* Update flan_cot.yaml

---------
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

45a8f709

22 Jan, 2024 4 commits

fallback to classname when LM doesnt have config (#1334) · 607d7da5
Brian Vaughan authored Jan 22, 2024

607d7da5

Add `local-completions` support using OpenAI interface (#1277) · 5c25dd55

Michael Goin authored Jan 22, 2024



* Add `local-completions` support using OpenAI interface

* Refactor oa_completion

* Address tokenizer comments and change request chunks to batch size

* Add warning message for tiktoken backend

* fix formatting

* fix whitespace

* Update README.md

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

5c25dd55

Fix Issue regarding stderr (#1327) · 076372ee
Lintang Sutawika authored Jan 23, 2024
```
* add fix fordeciding if stderr is N/A or not

* process N/A
```
076372ee
don't pass extra kwargs to mamba any more (#1328) · 181ccf43
Hailey Schoelkopf authored Jan 22, 2024

181ccf43

19 Jan, 2024 1 commit
- Update polemo2_in.yaml (#1318) · 4477d572
  Lintang Sutawika authored Jan 20, 2024
  
  4477d572
18 Jan, 2024 3 commits
- Fix group register (#1315) · 72ea626e
  Lintang Sutawika authored Jan 19, 2024
```
* tuple should be considered as well

* set option to keep callable as callable
```
  72ea626e
- Fix polemo2_in.yaml config name (#1313) · b8cbc425
  Quentin Lhoest authored Jan 18, 2024
  
  b8cbc425
- Update nq_open.yaml (#1305) · 10488d0d
  Hannibal046 authored Jan 18, 2024
```
* Update nq_open.yaml

change regex

* Bump NQ version

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
```
  10488d0d
16 Jan, 2024 1 commit
- Update nq_open.yaml (#1289) · 032e879b
  Hailey Schoelkopf authored Jan 15, 2024
  
  032e879b
15 Jan, 2024 3 commits
- Fix data-parallel evaluation with quantized models (#1270) · ef665088
  Hailey Schoelkopf authored Jan 15, 2024
```
* add WIP device_map overrides

* update handling outside of accelerate launcher

* change .to(device) log to debug level

* run linter
```
  ef665088
- Allow parameter edits for registered tasks when listed in a benchmark (#1273) · 03e7df51
  Lintang Sutawika authored Jan 15, 2024
```
* benchmark yamls allow minor edits of already registered tasks

* add documentation

* removed print
```
  03e7df51
- fix whitespace in target + prompt for CoT gsm8k (#1275) · ace4393e
  Hailey Schoelkopf authored Jan 15, 2024
  
  ace4393e
12 Jan, 2024 3 commits
- apply process_docs() to fewshot_split too (#1276) · 89618bf8
  Hailey Schoelkopf authored Jan 12, 2024
  
  89618bf8
- add Kobest (#1263) · 653217a7
  jp authored Jan 12, 2024
```
* Add: kobest config file

* Add: kobest utils

* Add: README

* Update utils.py
```
  653217a7
- update versioning logging (#1271) · 75dc2b87
  Hailey Schoelkopf authored Jan 11, 2024
  
  75dc2b87
11 Jan, 2024 2 commits

Fix bug in multi-token Stop Sequences (#1268) · ff739414
Hailey Schoelkopf authored Jan 11, 2024
```
* fix incorrect lookback protections

* bump generate_until task versions
```
ff739414

MultiMedQA (#1198) · 818c056b

Tanishq Abraham authored Jan 10, 2024



* multimedqa

* Update medqa.yaml

* move to benchmarks folder

* add README.md

---------
Co-authored-by: Lintang Sutawika <lintang@sutawika.com>

818c056b

10 Jan, 2024 3 commits
- Call "exact_match" once for each multiple-target sample (#1266) · 692e0f83
  Baber Abbasi authored Jan 10, 2024
```
* Refine scoring logic for multiple_target "exact_match" metric

* skip old tests from master

* skip old tests from master

* delete tests from master
```
  692e0f83
- fixed belebele (#1267) · 9b0b15b1
  James A. Michaelov authored Jan 10, 2024
  
  9b0b15b1
- specify utf-8 encoding to save samples to file. (#1265) · 7264a2e0
  Baber Abbasi authored Jan 10, 2024
  
  7264a2e0
08 Jan, 2024 1 commit
- fixed fewshot loading for multiple input tasks (#1255) · cf6a8321
  Lintang Sutawika authored Jan 08, 2024
  
  cf6a8321
05 Jan, 2024 2 commits

Do not escape ascii is logging outputs (#1246) · 28ec7fa9

Sam Passaglia authored Jan 05, 2024



* do not ensure ascii

* Update __main__.py

---------
Co-authored-by: Lintang Sutawika <lintang@sutawika.com>

28ec7fa9

Add multilingual HellaSwag task (#1228) · 28bb45fb

JorgeDeCorte authored Jan 05, 2024



* add hellaswag_nl

* add other languages and update readme to hellaswag

* refactor as new task

* update readme

* add endline to yaml files and readme.md

* add group, change folder location and update yaml file

* rename default hellaswag yaml file

* fix whitespace error in some labels

* downgrade log level of whitespace checking

---------
Co-authored-by: JorgeDeCorte <jorge.decorte@ravago.be>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

28bb45fb

04 Jan, 2024 2 commits

Remove self.dataset_path post_init process (#1243) · e7c03d0c
Lintang Sutawika authored Jan 04, 2024
```
* Remove self.dataset_path post_init process

* Update task.py

* Update task.py
```
e7c03d0c

vllm: handle max_length better and substitute Collator (#1241) · eca6926b

Baber Abbasi authored Jan 04, 2024



* copies max_length from huggingface

* handle max_length properly

* get tokens from inputs

* substitute Collator for Reorderer

* `batch=auto` if using data_parallel

* nit

* cleanup

* update code comments

* `ray.shutdown()` after calling method if data_parallel_size > 1

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

eca6926b

02 Jan, 2024 3 commits
- Update openai_completions.py (#1238) · 25a15379
  Stella Biderman authored Jan 02, 2024
  
  25a15379
- batch_schedular bug in Collator (#1229) · 4d10ad56
  Baber Abbasi authored Jan 02, 2024
```
* auto-batch requires len of iter

* handle case when batch_size="auto:N"
```
  4d10ad56
- Update README.md (#1230) · a12ef445
  Pasquale Minervini authored Jan 02, 2024
  
  a12ef445
29 Dec, 2023 1 commit

Don't silence errors when loading tasks (#1148) · 34b563b1

Paul McCann authored Dec 30, 2023



* Add example failing task

This task includes an invalid import. This will cause an exception and
the task will not be loaded. But this just results in a DEBUG level log
message, so in normal usage you'll see no error, and will be told the
task doesn't exist.

Here's an example command line to run the task:

    python -m lm_eval --model hf --model_args pretrained=rinna/japanese-gpt-1b --tasks fail

This task is based on a Japanese Winograd task, but that's not
important, and was just used due to familiarity.

* Do not ignore errors when loading tasks

* Change how task errors are logged

This makes the proposed changes from PR discussion.

1. Exceptions not related to missing modules/imports are logged as
   warnings.

2. module/import related exceptions are still logged at debug level, but
   if any of them happen there is a warning about it with instructions
   on how to show logs.

* Remove intentionally failing task

---------
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

34b563b1