Commits · bd5091df8db7cea1a9f94f797fc11487f840ade1 · chenpangpang / transformers

05 Jun, 2024 1 commit

Reduce by 2 the memory requirement in `generate()`

(#30536) · bd5091df

Cyril Vallez authored Jun 05, 2024

* Fix contrastive_search for new cache structure, and improve performance by removing inneficient torch.stack(torch.split(x, top_k, dim=0))

* Fix _contrastive_search for non-standard cache using ellipsis slicing

* Fix all outputs.logits memory leaks for all decoding strategies!

* Fix small error in _contrastive_search()

* Make all necessary change and revert for the new class

* Apply coding style

* Remove pipes in type hints for compatibility

* correct type hint

* apply style

* Use DynamicCache by default and solve conflicts

* Fix rebase issues

* Add `_supports_dynamic_cache_class` in models for models that support DynamicCache but not other caches to make DynamicCache the default for more models

* Create generation config to return legacy format by default, or to choose not to

* style

* Fix case when use_cache is False

* Remove default DynamicCache in assiste_decoding if assistant_model does not support it + fix _seen_tokens when cropping cache

* Update prepare_inputs_for_generation() for case with empty DynamicCache

* Correct return of args in _assisted_decoding

* Remove EfficientDynamicCache as it is no longer needed

* Correct mistake in generation config

* Move cache logic of assisted decoding to AssistedCandidateGenerator.__init__

* change DynamicCache function names from "split" to "batch_split" for readability + apply coding style

* Remove `_supports_dynamic_cache_class` attribute after rebase

* Correct missing line lost in conflict resolution during rebasing

* Add special case for Jamba

* Fix jamba test

* Coding style

* coding style

* Correct missing import in rebasing

* Simplify _validate_model_kwargs based on removal of _supports_dynamic_cache attribute

* Simplify code paths in _contrastive_search

* coding style

* Update docstrings of cache methods

* Update prepare_inputs_for_generation() -> past_key_values are always Cache objects

bd5091df

03 Jun, 2024 1 commit
- [`GemmaModel`] fix small typo (#31202) · 1749841a
  Arthur authored Jun 03, 2024
```
* fixes

* fix-copies
```
  1749841a
31 May, 2024 1 commit

Diff converter v2 (#30868) · 96eb0628

Arthur authored May 31, 2024

* current working example!

* commit regex and result file

* update

* nit

* push the conversion file

* oups

* roadmap and nits

* attempt diffs for 3 files

* persimmon

* nit

* add diff file that is the same as the modeling_llama.py

* fix rope nits

* updates

* updates with converted versions

* give some breathing space to the code

* delete

* update

* update

* push the actual result

* update regex patterns

* update regex patterns

* fix some issues

* fix some issues

* fix some issues

* updates

* updates

* updates

* updates

* updates

* revert changes done to llama

* updates

* update gemma

* updates

* oups

* current state

* current state

* update

* ouiiii

* nit

* clear diffs

* nit

* fixup

* update

* doc 🚀

* 🔥

* for now use gemma

* deal with comments

* style

* handle funtions

* deal with assigns

* todos

* process inheritage

* keep decorators?

* 🤗

* deal with duplicates

* fixup

* correctly remove duplicate code

* run ruff post script

* ruff deals pretty well with imports, let's leave it to him

* ah maybe not lol

* for now remove all imports from child.

* nit

* conversion of llama

* okay

* convert starcoder2

* synch with main

* update llama diff

* updates

* https://docs.astral.sh/ruff/rules/redefined-while-unused/

 fixes the imports, bit needs later version of ruff

* updates

* okay actual state

* non zero exit

* update!

* revert unrelated

* remove other diff files

* updates

* cleanup

* update

* less diff!

* stash

* current updates

* updates

* No need for call

* finished fining deps

* update

* current changes

* current state

* current state

* new status

* nit

* finally

* fixes

* nits

* order is now expected

* use logger info instead of prints

* fixup

* up

* nit

* update

* nits

* update

* correct merge

* update

* update

* update

* add warning

* update caution message

* update

* better merging strategy

* copy class statements :wink

* fixups

* nits

* update

* Apply suggestions from code review
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* nits

* smaller header

* do cleanup some stuff

* even simpler header?

* fixup

* updates

* ruff

* update examples

* nit

* TODO

* state

* OUUUUUUF

* current state

* nits

* final state

* add a readme

* fixup

* remove diff llama

* fix

* nit

* dummy noy funny

* ruff format tests src utils --check

* everless diffs

* less diffs and fix test

* fixes

* naming nit?

* update converter and add supper example

* nits

* updated for function signatures

* update

* update

* add converted dummies

* autoformat

* single target assign fix

* fixup

* fix some imports

* fixes

* don't push them

* `# noqa: F841`

---------
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

96eb0628

23 May, 2024 1 commit

Quantized KV Cache (#30483) · d583f131

Raushan Turganbay authored May 23, 2024



* clean-up

* Update src/transformers/cache_utils.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/cache_utils.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/cache_utils.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* fixup

* Update tests/quantization/quanto_integration/test_quanto.py
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update src/transformers/generation/configuration_utils.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* more suggestions

* mapping if torch available

* run tests & add 'support_quantized' flag

* fix jamba test

* revert, will be fixed by another PR

* codestyle

* HQQ and versatile cache classes

* final update

* typo

* make tests happy

---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

d583f131

22 May, 2024 1 commit

update ruff version (#30932) · 673440d0

Arthur authored May 22, 2024



* update ruff version

* fix research projects

* Empty

* Fix errors

---------
Co-authored-by: Lysandre <lysandre@huggingface.co>

673440d0

20 May, 2024 3 commits

Add torch.compile for Mistral (#30642) · 616bb11d

Longjie Zheng authored May 20, 2024

* first version

* fix sliding window

* fix style

* add sliding window cache

* fix style

* address comments

* fix test

* fix style

* move sliding window check inside cache init

* revert changes on irrelevant files & add comment on SlidingWindowCache

* address comments & fix style

fix style

* update causal mask

* [run-slow] mistral

* [run-slow] mistral

* [run-slow] mistral

* [run-slow] mistral

* [run-slow] mistral

* [run-slow] llama

* [run-slow] mistral

* [run-slow] mistral

* [run-slow] mistral

* revert CI from a10 to t4

* wrap up

616bb11d

Add support for torch.compile dynamic shapes (#30560) · cd6bd0af

Benjamin Warner authored May 20, 2024

* add torch.compile dynamic support

* Add SDPA dynamic shapes compile test & improve SDPA comment

* comment consistency

cd6bd0af

Add TokenClassification for Mistral, Mixtral and Qwen2 (#29878) · 07bf2dff

Joseph Enguehard authored May 20, 2024



* Add MistralForTokenClassification

* Add tests and docs

* Add token classification for Mixtral and Qwen2

* Save llma for token classification draft

* Add token classification support for Llama, Gemma, Persimmon, StableLm and StarCoder2

* Formatting

* Add token classification support for Qwen2Moe model

* Add dropout layer to each ForTokenClassification model

* Add copied from in tests

* Update src/transformers/models/llama/modeling_llama.py
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Propagate suggested changes

* Style

---------
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

07bf2dff

17 May, 2024 1 commit

Remove deprecated logic and warnings (#30743) · 57c965a8

amyeroberts authored May 17, 2024

* Remove deprecated logic and warnings

* Add back some code that seems to be important...

* Let's just add all he nllb stuff back; removing it is a bit more involved

* Remove kwargs

* Remove more kwargs

57c965a8

16 May, 2024 2 commits

Make `Gemma` work with `torch.compile` (#30775) · 1b3dba94

Yih-Dar authored May 16, 2024



* fix

* [run-slow] gemma

* add test

* add `test_compile_static_cache`

* fix

* style

* remove subprocess

* use attribute

* fix

* style

* update

* [run-slow] dbrx,gemma,jetmoe,phi3,recurrent_gemma

---------
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

1b3dba94

Cache: add new flag to distinguish models that `Cache` but not static cache (#30800) · 9d889f87
Joao Gante authored May 16, 2024
```
* jamba cache

* new flag

* generate exception
```
9d889f87

15 May, 2024 1 commit

Fix llama model sdpa attention forward function masking bug when output_attentions=True (#30652) · 4b3eb19f

Edoardo Cetin authored May 15, 2024



* Fix llama model forward function with attention=True, same-length encoded sequence.

* Fix style

* propagate fix to modeling_cohere, gemma, dbrx, and olmo (which copy the same sdpa masking logic from llama)

* Fix style

* ignore unnecessary sdpa mask converter when output_attentions=True

* add tests checking sdpa and eager outputs match when output_attentions=True

* Split if statements in two lines
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Fix formatting

* Add fix to new jetmoe model

* Add missing output_attentions argument to jetmoe mask creation

---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

4b3eb19f

14 May, 2024 1 commit

Add PaliGemma (#30814) · 1360801a

Pablo Montalvo authored May 14, 2024



* add new model like

* add state dict slicing + new model config

* update palma config and weights, passes vision activations

* fix

* update

* reorder loading/unpacking

* clean up

* add debug statements

* change device

* fix

* debugging

* fix noncausal mask

* fixup sdpa + causal mask

* fix activation function

* remove debug before changing modeling file

* add variants

* debug attention mask in generate

* revert to non-debug sdpa

* revert gemma modifications

* add custom language modeling

* use Processor

* add language modeling file to init

* try thin wrapper around generate

* Update

* update mask

* breakpoints galore

* remove conflict

* switch to left-padding

* add incomplete model doc

* add paligemma global files

* batch rename paligemma

* make generation match outputs and captioning

* style

* style

* remove copied from + doc

* remove more copied from

* remove copy from projector

* minor fix

* update config and style

* add readme - dummy

* CORRECT image captioning

* moving to args

* add siglip proper + fix merging image + text features

* take update_causal_mask from upstream

* remove breakpoint

* leverage AutoModel

* fix input_ids slicing

* make siglip head conditional

* remove encoder_decoder value

* remove unneeded modeling file

* add commented 4d attention mask

* FIXED generation with 4D mask

* Update src/transformers/models/siglip/modeling_siglip.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* fix left padding detection

* shuffle order of verifications

* fix missing labels for training

* fix

* vectorize merging of features, improve slicing

* improve testing before conversion

* handle merging in processor

* image token index depends on checkpoint

* add variants, save processor too

* save processors, base tokenizer off spm file

* expand model embeddings due to additional image token

* pass image processing args

* add convert rgb to siglip processor

* add \n token separately

* fix tokenizer and prompts

* fix docstrings

* change to camel

* fix casing

* debug pos_ids and sdpa

* pass and use cache_position

* add flag for newline tokenization

* Update src/transformers/models/paligemma/processing_paligemma.py
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>

* simplify conversion script

* add copied from

* add precision to conversion script

* Update src/transformers/models/paligemma/modeling_paligemma.py
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* clean up

* Shift attention mask from `1:`

After discussion with @molbap

* add docs, fix quality

* quality, tied weights inheritance, and logits/label alignment

* fix more tests

* pass attn_implementation to language model correctly

* add SiglipVisionTransformer to no split modules

* skip paligemma test for sdpa dispatch to flash

* skip incompatible tests

* quality

* [broken archive maps]

* Apply suggestions

- remove archive lists
- style
- take shape of inputs_embeds for batch
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/utils/dummy_pt_objects.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* simplify conversion script

* add suggestions

* add suggestions

* add copied from

* fix

* move labels out

* revert

* fix

* remove placeholder labels if None

* use cache_position

* fix quality + docstrings

* fix quality

* fix paligemma 4d gemma mask incompatibility

* fix config docstring

* fix query and attn_mask dtype

---------
Co-authored-by: ArthurZucker <arthur.zucker@gmail.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

1360801a

13 May, 2024 1 commit

Llama: fix custom 4D masks, v2 (#30348) · a0779b9e

Poedator authored May 13, 2024



* 4d mask fixes

* Update custom 4D mask logic

* test moved to mixin

* extra tests 4d mask

* upd 4d mask and StaticCache handling

* added Mask4DTestHard to mistral tests

* post-rebase fixes

* test fixes for StaticCache

* make fix-copies

* upd 1 after #30476

* fix common tests

* rm elif attention_mask.dim() == 4:

* tests combined, fixed, mixtral supported

* bigbird style chg reverted

* rm if attention_mask.dim() == 2

* modeling_llama formatting chg

---------
Co-authored-by: Joao Gante <joao@huggingface.co>

a0779b9e

09 May, 2024 1 commit
- KV cache is no longer a model attribute (#30730) · 5413b898
  Raushan Turganbay authored May 09, 2024
```
kv_cache is no longer a model attribute
```
  5413b898
08 May, 2024 1 commit
- Cache: models return input cache type (#30716) · f26e4073
  Joao Gante authored May 08, 2024
  
  f26e4073
02 May, 2024 1 commit
- Fix for Neuron (#30259) · fbabd674
  Michael Benayoun authored May 02, 2024
  
  fbabd674
01 May, 2024 1 commit

Gemma: update activation warning (#29995) · f4f18afd

Pedro Cuenca authored May 01, 2024

* Gemma: only display act. warning when necessary

This is a nit PR, but I was confused. I got the warning even after I
had changed `hidden_act` to `gelu_pytorch_tanh`, telling me that I
was using the "legacy" `gelu_pytorch_tanh`.

Another option is to keep the warning but change the message to say
something like "`hidden_act` is ignored, please use `hidden_activation`
instead. Setting Gemma's activation function to `gelu_pytorch_tanh`".

* Change message, and set `config.hidden_activation`

f4f18afd

30 Apr, 2024 1 commit
- Cache: Static cache as a standalone object (#30476) · 75bbfd5b
  Joao Gante authored Apr 30, 2024
  
  75bbfd5b
29 Apr, 2024 1 commit

Reenable SDPA's FA2 During Training with torch.compile (#30442) · 9df8b301

Benjamin Warner authored Apr 29, 2024

* Reenable SDPA's FA2 during training with torch.compile

* fix Olmo's SDPA FA2 dispatching too

* update formatting

* improved SDPA comment

* formatting and explanatory comment

* is_causal if statement to one-liner

9df8b301

22 Apr, 2024 1 commit
- `Llama` family, fix `use_cache=False` generation (#30380) · 2d92db84
  Arthur authored Apr 22, 2024
```
* nit to make sure cache positions are not sliced

* fix other models

* nit

* style
```
  2d92db84
18 Apr, 2024 2 commits

FIX: Fixes unexpected behaviour for Llava / LLama & AWQ Fused modules + revert... · 5728b5ad

Younes Belkada authored Apr 18, 2024

FIX: Fixes unexpected behaviour for Llava / LLama & AWQ Fused modules + revert #30070 at the same time (#30317)

* Update awq.py

* style

* revert felix PR

* fix

* add felix comments

5728b5ad

Revert "Re-enable SDPA's FA2 path (#30070)" (#30314) · acab997b

Arthur authored Apr 18, 2024

* Revert "Re-enable SDPA's FA2 path (#30070)"

This reverts commit 05bdef16.

* Revert "Fix quality Olmo + SDPA (#30302)"

This reverts commit ec92f983.

acab997b

17 Apr, 2024 1 commit

Re-enable SDPA's FA2 path (#30070) · 05bdef16

fxmarty authored Apr 17, 2024



* tentatively re-enable FA2 + SDPA

* better comment

* _ignore_causal_mask_sdpa as staticmethod

* type hints

* use past_seen_tokens instead

* enable copied from for sdpa

* ruff

* llama simplifications on review

* remove unnecessary self.is_causal check

* fix copies

* cleaning

* precise message

* better doc

* add test

* simplify

* Update src/transformers/models/llama/modeling_llama.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/models/llama/modeling_llama.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/models/llama/modeling_llama.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* style

---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

05bdef16

05 Apr, 2024 1 commit

Fix `torch.fx` symbolic tracing for LLama (#30047) · 17cd7a9d

Michael Benayoun authored Apr 05, 2024

* [WIP] fix fx

* [WIP] fix fx

* [WIP] fix fx

* [WIP] fix fx

* [WIP] fix fx

* Apply changes to other models

17cd7a9d

30 Mar, 2024 1 commit
- [`BC`] Fix BC for AWQ quant (#29965) · 6e584070
  TechxGenus authored Mar 31, 2024
```
fix awq quant
```
  6e584070
28 Mar, 2024 1 commit
- [`BC`] Fix BC for other libraries (#29934) · 2bbbf1be
  Arthur authored Mar 28, 2024
```
* fi xbc?

* nit
```
  2bbbf1be
22 Mar, 2024 1 commit
- [`cleanup`] vestiges of causal mask (#29806) · 2e7cb46f
  Arthur authored Mar 22, 2024
```
nit
```
  2e7cb46f
21 Mar, 2024 1 commit
- Llama: always convert the causal mask in the SDPA code path (#29663) · ee38fc31
  Joao Gante authored Mar 21, 2024
```
* always convert the mask

* rebase and fix copies
```
  ee38fc31
20 Mar, 2024 1 commit

[`BC 4.37 -> 4.38`] for Llama family, memory and speed (#29753) · ff841900

Arthur authored Mar 21, 2024

* attempt to fix

* the actual fix that works with compilation!

* this?

* temporary update

* nit?

* dispatcg to memory efficient?

* update both models that have static cache support

* fix copies fix compile

* make sure fix

* fix cohere and gemma

* fix beams?

* nit

* slipped through the cracks

* nit

* nits

* update

* fix-copies

* skip failing tests

* nits

ff841900

19 Mar, 2024 2 commits

Llama: partial 4d masks (#29731) · 4294f0c3

Joao Gante authored Mar 19, 2024



* partial 4d masks

* Apply suggestions from code review
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

---------
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

4294f0c3

[`Gemma`] final fixes to the modeling (#29729) · 8e2fc52e

Arthur authored Mar 20, 2024



* gelu_pytorch_tanh

* Force config.hidden_act to be approx gelu

* Gemma bug fixes

* force_use_exact_gelu

* Update configuration_gemma.py

* Update modeling_gemma.py

* update

* update for simpler handling

* nit

* nit

* fixpup

* update

* also update the jax modeling!

* add `"gelu_pytorch_tanh": partial(nn.gelu, approximate=True),`

* fixup

* fix order

* act vs act_fn

---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>

8e2fc52e

14 Mar, 2024 1 commit
- Generate: handle `cache_position` update in `generate` (#29467) · 23db187d
  Joao Gante authored Mar 14, 2024
  
  23db187d
13 Mar, 2024 1 commit
- Llama: allow custom 4d masks (#29618) · 1e21c4fb
  Joao Gante authored Mar 13, 2024
  
  1e21c4fb
08 Mar, 2024 1 commit

StableLM: Fix dropout argument type error (#29236) · f386c51a

liangjs authored Mar 08, 2024



* fix stablelm dropout argument type error

* fix docs of _flash_attention_forward

* fix all docs of _flash_attention_forward

* fix docs of _flash_attention_forward in starcoder2

---------
Co-authored-by: oliang <oliang@tencent.com>

f386c51a

06 Mar, 2024 2 commits

Fix: Disable torch.autocast in RotaryEmbedding of Gemma and LLaMa for MPS device (#29439) · d45f47ab

Park Jun authored Mar 07, 2024



* Fix: Disable torch.autocast in RotaryEmbedding of Gemma and LLaMa for MPS devices

* Update src/transformers/models/gemma/modeling_gemma.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update llama ang gemma rope use cpu in mps device

---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

d45f47ab

Substantially reduce memory usage in _update_causal_mask for large batches by... · 2a939f20

Glen Taggart authored Mar 06, 2024


Substantially reduce memory usage in _update_causal_mask for large batches by using .expand instead of .repeat [needs tests+sanity check] (#29413)

* try to fix gemma mem use

* fix: handle attention mask dim==2 case

* remove logits=logits.float()

* clean up + add llama

* apply formatting

* readability edit: swap order of items being multiplied

* revert change unrelated to PR

* revert black autoformat

* switch to one .to

* Accept style edits
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

2a939f20

01 Mar, 2024 1 commit
- [`Llama + AWQ`] fix `prepare_inputs_for_generation` 🫠 (#29381) · e7b98370
  Arthur authored Mar 01, 2024
```
* use the generation config 🫠

* fixup
```
  e7b98370
28 Feb, 2024 2 commits
- Better SDPA unmasking implementation (#29318) · 49204c1d
  fxmarty authored Feb 28, 2024
```
* better unmask imple

* comment

* typo

* bug report pytorch

* cleanup

* fix import

* add back example

* retrigger ci

* come on
```
  49204c1d
- check if position_ids exists before using it (#29306) · 554e7ada
  jiqing-feng authored Feb 28, 2024
```
Co-authored-by: Joao Gante <joao@huggingface.co>
```
  554e7ada