Commits · 4eecbabb78c96941d011aecc44adcddc8a672736 · gaoqiong / lm-evaluation-harness

04 Sep, 2024 1 commit

Chat Template fix (cont. #2235) (#2269) · 7a1614eb

Baber Abbasi authored Sep 04, 2024



* default chat template method fix

* move chat_template to TemplateLM

* remove hotfix

* handle openai `chat_template`

* Update lm_eval/api/model.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* add 'max_tokens' to gen_kwargs

* pre-commit

---------
Co-authored-by: KonradSzafer <szafer.konrad@gmail.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

7a1614eb

28 Aug, 2024 1 commit

Fix `loglikelihood_rolling` caching ( #1821 ) (#2187) · 8138fd52

Hailey Schoelkopf authored Aug 28, 2024



* fix revision type

* allow for None-input loglikelihood reqs to be cached

* handle no remaining cache items

* pre-commit

* change cache_hook.add_partial(loglikelihood_rolling...) convention

---------
Co-authored-by: Baber Abbasi <baber@eleuther.ai>

8138fd52

22 Jul, 2024 1 commit

Refactor API models (#2008) · 42dc2448

Baber Abbasi authored Jul 23, 2024



* refactor pad_token handling to fn

* fix docs

* add pad_token_handling to vllm

* start on API superclass

* don't detokenize the returned logits

* streamline vllm tokenizer

* add type hint

* pre-commit

* seems to be in working order

* add model to init

* refactor api models

* nit

* cleanup

* add pbar

* fix type hints

* change optional dependencies

* json encode chat template

* add type hints

* deal with different prompt input requiremnts

* nits

* fix

* cache inside async

* fix

* fix

* nits

* nits

* nits

* nit

* fixup

* fixup

* nit

* add dummy retry

* add dummy retry

* handle imports; skip failing test

* add type hint

* add tests

* add dependency to tests

* add package names to exception

* nit

* docs; type hints

* handle api key

* nit

* tokenizer bug

* fix tokenizer

* nit

* nit

* add better error messages

* nit

* remove decorator

* CI: install api dep

* revert evaluator.py

* consolidate

* consolidate

* nits

* nit

* fix typealias

* nit

* nit

* nit

* Update lm_eval/models/api_models.py

typo
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/models/openai_completions.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/models/anthropic_llms.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/models/api_models.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* fix typo

* add news section

* add info for API

* pre-commit

* typo

* fix bug: unpack logliklehood requests

* fix bug: shared gen_kwargs mutated

* nit: handle copy properly

* Update README.md

* Update README.md

* Update README.md

* Update api_models.py

* Update README.md

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

42dc2448

28 Jun, 2024 1 commit
- fix cache (#2037) · e922cceb
  Baber Abbasi authored Jun 28, 2024
  
  e922cceb
03 Jun, 2024 1 commit

Add chat template (#1873) · 070d31df

KonradSzafer authored Jun 03, 2024



* initial chat template

* tokenizer attribute check

* variable rename

* interface update

* system instruction

* system inst default update

* fewshot as multiturn

* typing update

* indent update

* added comments

* Adding a fewshot in a more readable way

* linting

* Moved apply chat template to LM

* multiturn alternation fix

* cache key update

* apply chat template method fix

* add system prompt hash to cache_key

* tokenizer name property for cache_key

* property name fix

* linting backward compatibility fix

* docs and errors update

* add documentation on adding chat template compatibility to model_guide

* fewshot as multiturn check fix

* saving system inst and chat template in results

* eval tracker update

* docs update

* Apply suggestions from code review
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

---------
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>
Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

070d31df

25 Mar, 2024 1 commit

Seq2seq fix (#1604) · 262f879a

Lintang Sutawika authored Mar 25, 2024



* fix on --task list

* add fixes to tokeniation

* differentiate encoding for seq2seq and decoder

* return token setting

* format for pre-commit

* Seq2seq fix, pt2 (#1630)

* getting model class only when defined

* encode_pair handles None, add_special_tokens turned into dict with default value

---------
Co-authored-by: achervyakov <77295913+artemorloff@users.noreply.github.com>

262f879a

20 Mar, 2024 1 commit

Fixes to Loglikelihood prefix token / VLLM (#1611) · c7b03ad4

Hailey Schoelkopf authored Mar 20, 2024

* make vllm use prefix_token_id ; have prefix_token_id be optional method to define

* custom_prefix_token_id wasn't set if not passed

c7b03ad4

19 Mar, 2024 1 commit
- Revert "Patch for Seq2Seq Model predictions (#1584)" (#1601) · f871646f
  Hailey Schoelkopf authored Mar 19, 2024
```
This reverts commit b7923a84.
```
  f871646f
18 Mar, 2024 1 commit

use BOS token in loglikelihood (#1588) · a4192489

kwrobel.eth authored Mar 18, 2024



* use BOS token in loglikelihood

* improve comments

* add model arg

* log prefix token id

* log prefix token id

* Update lm_eval/api/model.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* change name to prefix_token_id

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

a4192489

17 Mar, 2024 1 commit

Patch for Seq2Seq Model predictions (#1584) · b7923a84

Lintang Sutawika authored Mar 18, 2024



* Differentiate _encode_pair setting for decoder and enc-dec models

* tok_decode to not skip special token so that eos doen't become empty string

* Update model.py

* Update model.py

* Update huggingface.py

* Update lm_eval/models/huggingface.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update model.py

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

b7923a84

13 Mar, 2024 1 commit

add manual tqdm disabling management (#1569) · e74ec966

achervyakov authored Mar 13, 2024



* add manual tqdm disabling management

* add typing to all new args

* apply precommit changes

---------
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

e74ec966

06 Mar, 2024 1 commit
- Update docs on LM.loglikelihood_rolling abstract method (#1532) · 525b8f5d
  Hailey Schoelkopf authored Mar 06, 2024
  
  525b8f5d
27 Feb, 2024 1 commit

Refactor `evaluater.evaluate` (#1441) · 5ccd65d4

Baber Abbasi authored Feb 27, 2024



* change `all_gather` to `gather`

* add TaskOutput utility class

* Add FilterResults class and refactor task handling.

* Rename `key` to `filter_key` for clarity

* Add `print_writeout` function in utils.py

* Add function to calculate limit size.

* Add doc_iterator method to Task class

* Refactor `doc_iterator` and cleanup in Task class

* remove superfluous bits

* change `all_gather` to `gather`

* bugfix

* bugfix

* fix `gather`

* Refactor `gather` loop

* Refactor aggregate metrics calculation

* Refactor and simplify aggregate metrics calculation
Removed unused code

* Simplify metrics calculation and remove unused code.

* simplify the metrics calculation in `utils.py` and `evaluator.py`.

* Fix group metric

* change evaluate to hf_evaluate

* change evaluate to hf_evaluate

* add docs

* add docs

* nits

* make isslice keyword only

* nit

* add todo

* nit

* nit

* nit: swap order samples_metrics tuple

* move instance sorting outside loop

* nit

* nit

* Add __repr__ for ConfigurableTask

* nit

* nit

* Revert "nit"

This reverts commit dab8d9977a643752a17f840fd8cf7e4b107df28f.

* fix some logging

* nit

* fix `predict_only` bug. thanks to `@LSinev`!

* change `print_tasks` to `prepare_print_tasks`

* nits

* move eval utils

* move eval utils

* nit

* add comment

* added tqdm descriptions

* Update lm_eval/evaluator_utils.py
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* fix mgsm bug

* nit

* fix `build_all_requests`

* pre-commit

* add ceil to limit

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

5ccd65d4

26 Feb, 2024 2 commits

Create a means for caching task registration and request building. Ad… (#1372) · 1e6c9272

Aaron V authored Feb 26, 2024



* Create a means for caching task registration and request building. Add the ability to specify an args dict for simple_evaluate().

* Remove extra S in cache path in caching module
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Rename requests cache args, make model_args polymorphic so that a dict can also be accepted.

* Update docs to reflect new caching behavior, add CLI args for requests caching. Create a function for deleting items in the cache.

* Update documentation, fix minor bug with arg parsing for requests caching where an undefined variable was used.

* Remove line from gitignore, add to cli for caching datasets.

* Add hashing suffix to .pickles. Update test script typo.

* Favor isinstance() over type() in evaluator.py

* Add tests for caching, gets tests working, remove unneeded arg from build_all_requests().

* Update arg description to simple_evaluate.

* Update pyproject.toml

* Fix typehint

* Remove the use of random() for creating default cache pickle hash.

* Check that cache dir exists before clearing it in request cache tests.

* Fix linting problems.

* Fix additional formatting errors.

* Remove trailing whitespace.

* Add new line to the end of .gitignore.

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

1e6c9272

Add Gemma support (Add flag to control BOS token usage) (#1465) · 4c51111c

Hailey Schoelkopf authored Feb 26, 2024



* add add_bos_token to HFLM

* add BOS token flag to other local model classes

---------
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

4c51111c

22 Feb, 2024 1 commit

Add TemplateLM boilerplate LM class (#1279) · ba5cdf0f

Anjor Kanekar authored Feb 22, 2024

* loglikelihood refactor using template lm

* linter

* fix whitespace in target + prompt for CoT gsm8k (#1275)

* Make `parallelize=True` vs. `accelerate launch` distinction clearer in docs (#1261)

* Make parallelize=True distinction clearer in documentation.

* run linter

* Allow parameter edits for registered tasks when listed in a benchmark (#1273)

* benchmark yamls allow minor edits of already registered tasks

* add documentation

* removed print

* Fix data-parallel evaluation with quantized models (#1270)

* add WIP device_map overrides

* update handling outside of accelerate launcher

* change .to(device) log to debug level

* run linter

* Rework documentation for explaining local dataset (#1284)

* rewor documentation for explaining local dataset

* fix typo

* Update new_task_guide.md

* Re-add citation

It looks like Google Scholar has [already noticed](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C9&authuser=2&q=%22A+framework+for+few-shot+language+model+evaluation%2C+12+2023%22&btnG=

) the updated citation block so let's add it back in.

* Update CITATION.bib (#1285)

Bumping CITATION.bib to match re-adding the citation in readme. 

cc @StellaAthena

* Update nq_open.yaml (#1289)

* Update README.md with custom integration doc (#1298)

* Update README.md

* punctuation

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update nq_open.yaml (#1305)

* Update nq_open.yaml

change regex

* Bump NQ version

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update task_guide.md (#1306)

* Update pyproject.toml (#1312)

* Fix polemo2_in.yaml config name (#1313)

* Update pyproject.toml (#1314)

* Fix group register (#1315)

* tuple should be considered as well

* set option to keep callable as callable

* Update task_guide.md (#1316)

* Update polemo2_in.yaml (#1318)

* don't pass extra kwargs to mamba any more (#1328)

* Fix Issue regarding stderr (#1327)

* add fix fordeciding if stderr is N/A or not

* process N/A

* Add `local-completions` support using OpenAI interface (#1277)

* Add `local-completions` support using OpenAI interface

* Refactor oa_completion

* Address tokenizer comments and change request chunks to batch size

* Add warning message for tiktoken backend

* fix formatting

* fix whitespace

* Update README.md

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* fallback to classname when LM doesnt have config (#1334)

* fix a trailing whitespace that breaks a lint job (#1335)

* skip "benchmarks" in changed_tasks (#1336)

* Update migrated HF dataset paths (#1332)

* Update arc_easy.yaml

* Update flan_cot.yaml

* update HF dataset path

* Update freeform.yaml

* Update flan_cot.yaml

---------
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

* Don't use `get_task_dict()` in task registration / initialization (#1331)

* don't use get_task_dict() as a helper, it will download the dataset!

* pre-commit

* Update README.md

---------
Co-authored-by: lintangsutawika <lintang@eleuther.ai>

* manage default (greedy) gen_kwargs in vllm (#1341)

* manage default (greedy) gen_kwargs in vllm better

* mirror HF `do_sample`

* just need to set temp=0 for greedy

* modified default gen_kwargs to work better with CLI; changed prompt_logprobs=1 (#1345)

* update links to task_guide.md (#1348)

* `Filter` docs not offset by `doc_id`  (#1349)

* get `doc` from instance

* acceletate bugfix: get ground doc from instance

* convert filter to `process_result`

* get docs from instances in `FilterEnsemble`

* rename

* nit

* better looping

* fix typehint

* Add FAQ on `lm_eval.tasks.initialize_tasks()` to README (#1330)

* Update README.md

* [!Tip]

* Refix issue regarding stderr (#1357)

* Add causalLM OpenVino models (#1290)

* added intel optimum

* added intel optimum in readme

* modified intel optimum

* modified intel optimum

* modified intel optimum

* modified install optimum

* modified path of IR file

* added openvino_device

* added openvino_device2

* changed optimum-causal to openvino-causal

* Update README.md

* Update README.md

* remove `lm_eval.base` import

* update openvino-causal -> openvino ; pass device through super().__init__()

* Update README.md

* Add optimum to tests dependencies

* apply pre-commit

* fix so tests pass

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* Apply some best practices and guideline recommendations to code (#1363)

* raise Exception, not a string

Additional info https://peps.python.org/pep-0352/#exception-hierarchy-changes
https://docs.python.org/3.8/tutorial/errors.html#raising-exceptions

* Apply PEP8 recommendation to prefer isinstance

"Object type comparisons should always use isinstance() instead of comparing types directly"
https://peps.python.org/pep-0008/

* Remove dangerous default mutable values in arguments

https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/dangerous-default-value.html

* Format logging messages with fstring (not with format)

Additional info
https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/logging-format-interpolation.html
There are also discussions about the speed of formatting while logging or some unintended code executions
https://github.com/pylint-dev/pylint/issues/2395
https://stackoverflow.com/a/54368109
but at least one format (fstring one) will be used throughout the project

* Specify utf-8 encoding for `open` explicitly

If not specified, it may be supposed differently in different environments, OSes, and Python versions. See
https://peps.python.org/pep-0597/
https://docs.python.org/3.11/library/locale.html#locale.getencoding
https://docs.python.org/3.10/library/os.html#utf8-mode
https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/unspecified-encoding.html

Helps also if some code from English language tasks is taken as inspiration for tasks in non-English languages.

* Use inline-ignoring comments to pass pre-commit instead of identity process

https://flake8.pycqa.org/en/3.0.1/user/ignoring-errors.html#in-line-ignoring-errors
https://www.flake8rules.com/rules/F841.html

flake8 comments are supported by ruff: https://docs.astral.sh/ruff/linter/#error-suppression



* serialize callable functions in config (#1367)

* delay filter init; remove `*args` (#1369)

* delay filter init; remove `*args`

* bugfix

* optimize

* type hint

* Fix unintuitive `--gen_kwargs` behavior (#1329)

* don't override do_sample if no value for it is passed

* Update gen_kwargs override condition

* Update huggingface.py

* Update huggingface.py

* run linters

* silence an erroneous warning

* Publish to pypi (#1194)

* publish to pypi

* lint

* Update publish.yml

* minor

* Make dependencies compatible with PyPI (#1378)

* make deps not point to github urls

* formatting

* try making PyPI only run on tag pushes

* Add support for RWKV models with World tokenizer (#1374)

* Add support for RWKV models with World tokenizer

The RWKV line of model with the World tokenizer, does not allow the padding token to be configured, and has its value preset as 0

This however fails all the "if set" checks, and would cause the tokenizer to crash.

A tokenizer class name check was added, in addition to a model type check, as there exists RWKV models which uses the neox tokenizers

* Update huggingface.py

Genericized so that this supports any RWKVWorld tokenizer, and added a fall-back for if the HF implementation name changes.

* Comply with formatting guidelines

* fix format

---------
Co-authored-by: Stella Biderman <stellabiderman@gmail.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* add bypass metric (#1156)

* add bypass metric

* fixed `bypass` metric.

* add task attributes if predict_only

* add `predict_only` checks

* add docs

* added `overide_metric`, `override_config` to `Task`

* nits

* nit

* changed --predict_only to generations; nits

* nits

* nits

* change gen_kwargs warning

* add note about `--predict_only` in README.md

* added `predict_only`

* move table to bottom

* nit

* change null aggregation to bypass (conflict)

* bugfix; default `temp=0.0`

* typo

* loglikelihood refactor using template lm

* lint

* code review

* neuron optimum

* Mention TemplateLM in model_guide.md

* Update lm_eval/api/model.py

* fix linter

* fix format

* fix format

* fix format

---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>
Co-authored-by: Stella Biderman <stellabiderman@gmail.com>
Co-authored-by: Mark Saroufim <marksaroufim@meta.com>
Co-authored-by: Hannibal046 <38466901+Hannibal046@users.noreply.github.com>
Co-authored-by: Danielle Pintz <38207072+daniellepintz@users.noreply.github.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: kwrobel.eth <djstrong@gmail.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: Brian Vaughan <nairbv@users.noreply.github.com>
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>
Co-authored-by: thnkinbtfly <70014488+thnkinbtfly@users.noreply.github.com>
Co-authored-by: NoushNabi <33136068+NoushNabi@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>
Co-authored-by: LSinev <LSinev@users.noreply.github.com>
Co-authored-by: Eugene Cheah <PicoCreator@users.noreply.github.com>

ba5cdf0f

20 Dec, 2023 1 commit

Switch Linting to `ruff` (#1166) · 65b8761d

Baber Abbasi authored Dec 20, 2023

* add ruff and isort. remove black and flake8

* remove unnecessary dependencies

* remove dependency from table

* change order

* ran ruff

* check 3.9

* exclude evaluator

* update CI workflow

* use ruff config in pyproject.toml

* test

* add isort rules to ruff

* sort imports

* import `make_table`

* try stages for no-commit-to-branch

* turn on mypy for pre-commit

* test

* test

* test

* change no-commit-to-branch to default

* nits

* fixed dependency

65b8761d

29 Nov, 2023 1 commit
- change torch req for mps · 7bb147b5
  baberabb authored Nov 29, 2023
  
  7bb147b5
02 Nov, 2023 2 commits
- precommit format · 73f3029c
  lintangsutawika authored Nov 02, 2023
  
  73f3029c
- eval_logger is not imported from logger.py anymore · f701ba7d
  lintangsutawika authored Nov 02, 2023
  
  f701ba7d
18 Oct, 2023 2 commits
- replace the rest of the 'greedy_until' occurrences · c28d100d
  haileyschoelkopf authored Oct 18, 2023
  
  c28d100d
- fix greedy_until abstractmethod · 785153f6
  haileyschoelkopf authored Oct 18, 2023
  
  785153f6
13 Sep, 2023 1 commit
- Update device list and dtype detection for MPS · bd81b8c0
  baberabb authored Sep 14, 2023
  
  bd81b8c0
25 Aug, 2023 1 commit

Add suggestions from autotyping · fc69d84f

Ethan Smith authored Aug 25, 2023

This adds a bunch of simple annotations suggested by https://github.com/JelleZijlstra/autotyping.

fc69d84f

07 Aug, 2023 1 commit
- added typehints and updated docs · 331340ad
  baberabb authored Aug 07, 2023
  
  331340ad
16 Jul, 2023 2 commits
- set fp32 if device=mps · 43407f36
  baberabb authored Jul 16, 2023
  
  43407f36
- set fp32 if device=mps · ce214979
  baberabb authored Jul 16, 2023
  
  ce214979
28 Jun, 2023 4 commits
- make self.rank and world_size defaults more robust · 8762b07c
  haileyschoelkopf authored Jun 28, 2023
  
  8762b07c
- make cache-loading msg part of logger · 7ffb74ce
  haileyschoelkopf authored Jun 28, 2023
  
  7ffb74ce
- Update model.py · b2fb53d4
  Hailey Schoelkopf authored Jun 28, 2023
  
  b2fb53d4
- Add tqdm progress bar for cachingLM · f95b160b
  Hailey Schoelkopf authored Jun 28, 2023
  
  f95b160b
23 Jun, 2023 1 commit
- add CachingLM back · 1c0ff968
  haileyschoelkopf authored Jun 23, 2023
  
  1c0ff968
19 Jun, 2023 1 commit
- misc fixes · 3f0afe02
  haileyschoelkopf authored Jun 19, 2023
  
  3f0afe02
07 Jun, 2023 1 commit
- MODEL_REGISTRY to registry.py · f2166089
  lintangsutawika authored Jun 07, 2023
  
  f2166089
19 May, 2023 1 commit
- pre-commit stuff · c4c20ff5
  lintangsutawika authored May 19, 2023
  
  c4c20ff5
02 May, 2023 1 commit
- prelim. multiple choice support · 487f7811
  haileyschoelkopf authored May 02, 2023
  
  487f7811
24 Apr, 2023 1 commit
- make tasks and models registered by decorators · f275301a
  haileyschoelkopf authored Apr 23, 2023
  
  f275301a
19 Apr, 2023 1 commit
- in-place replace main with lm-eval2, keeping old git history · d2a9b759
  haileyschoelkopf authored Apr 19, 2023
  
  d2a9b759