Commits · 2e113422b3504fe6de821bb9911b24273b11aa9c · chenpangpang / transformers

23 Jul, 2024 1 commit

Llama: RoPE refactor (#32135) · 2e113422

Joao Gante authored Jul 23, 2024


Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

2e113422

14 Jul, 2024 1 commit

Generate: remove deprecated code due to `Cache` and `cache_position` being default (#31898) · 739a6316

Joao Gante authored Jul 14, 2024



* tmp commit

* shorter

* nit

* explicit kwargs

* propagate changes

* mass propagation with a few manual touches (let's see how CI behaves)

* fix cacheless case

* Update src/transformers/generation/utils.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* make fixup

---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

739a6316

11 Jul, 2024 1 commit

Refactor flash attention implementation in transformers (#31446) · e3143952

Arthur authored Jul 11, 2024



* dumb commit

* nit

* update

* something like this

* unpack in modeling utils

* safe import

* oups

* update

* nits

* diff convert gemma

* update

* start propagating

* udpate other modeling code as well

* update for sliding window models

* nits

* more init cleanups

* styling

* fixup

* noice

* pass fixup

* typo typing_extension -> typing_extensions

* torch.nn.functionnal -> torch.nn.functional

* add to import structure

* unpack

* simplify a bit more for this first version

* nut

* update

* update

* nit

* ease the import of `Unpack`

* remove useless `use_sliding_window`

* no qua please

* protect import?

* style

* [run-slow]

* [run slow] llama,gemma,mistral,mixtral

* remove extra kwargs

* fix llama

* address review comments

* apply diff_model_converter to modeling_gemma.py

* remove cache_position 1

* remove cache_position 2

* some cleaning

* refactor gemma2 as well

* apply review comments

* rename file to modeling_flash_attention_utils.py

* siglip refactor

* remove dead code

* is the hub down?

* still down?

* fix siglip

* fix gemma2

* fatal: Could not read from remote repository.

* fix typo in softcap implem

* flacky

* Failed: Timeout >120.0s

---------
Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com>

e3143952

07 Jun, 2024 1 commit
- Fix jetmoe model (#31279) · 8bcf9c8d
  Cyril Vallez authored Jun 07, 2024
```
* Fix jetmoe model

* Remove skip-tests
```
  8bcf9c8d
23 May, 2024 1 commit
- Finish adding support for torch.compile dynamic shapes (#30919) · 046c2ad7
  Benjamin Warner authored May 23, 2024
```
add torch.compile dynamic support
```
  046c2ad7
16 May, 2024 1 commit

Make `Gemma` work with `torch.compile` (#30775) · 1b3dba94

Yih-Dar authored May 16, 2024



* fix

* [run-slow] gemma

* add test

* add `test_compile_static_cache`

* fix

* style

* remove subprocess

* use attribute

* fix

* style

* update

* [run-slow] dbrx,gemma,jetmoe,phi3,recurrent_gemma

---------
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

1b3dba94

15 May, 2024 1 commit

Fix llama model sdpa attention forward function masking bug when output_attentions=True (#30652) · 4b3eb19f

Edoardo Cetin authored May 15, 2024



* Fix llama model forward function with attention=True, same-length encoded sequence.

* Fix style

* propagate fix to modeling_cohere, gemma, dbrx, and olmo (which copy the same sdpa masking logic from llama)

* Fix style

* ignore unnecessary sdpa mask converter when output_attentions=True

* add tests checking sdpa and eager outputs match when output_attentions=True

* Split if statements in two lines
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Fix formatting

* Add fix to new jetmoe model

* Add missing output_attentions argument to jetmoe mask creation

---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

4b3eb19f

14 May, 2024 1 commit

Add JetMoE model (#30005) · ccdabc56

Yikang Shen authored May 14, 2024



* init jetmoe code

* update archive maps

* remove flax import

* fix import error

* update README

* ruff fix

* update readme

* fix

* update config

* fix issue

* merge files

* fix model bug

* fix test

* auto fix

* model size

* add comments

* fix form

* add flash attention support

* fix attention head number

* fix init

* fix support list

* sort auto mapping

* fix test

* fix docs

* update test

* fix test

* fix test

* change variable name

* fix config

* fix init

* update format

* clean code

* fix config

* fix config

* change default config

* update config

* fix issues

* update formate

* update config argument

* update format

* Update src/transformers/models/jetmoe/modeling_jetmoe.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/models/jetmoe/modeling_jetmoe.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* change to mixtral aux loss

* change to cache_position

* debug

* fix bugs

* debug

* fix format

* fix format

* fix copy

* fix format

* fix format

* fix sort

* fix sort

* fix sort

* add copy comment

* add copy from

* remove debug code

* revert readme update

* add copy

* debug

* remove debug code

* fix flash attention

* add comments

* clean code

* clean format

* fix format

* fix format

* Update src/transformers/models/jetmoe/modeling_jetmoe.py
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update src/transformers/models/jetmoe/modeling_jetmoe.py
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update src/transformers/models/jetmoe/modeling_jetmoe.py
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update src/transformers/models/jetmoe/modeling_jetmoe.py
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update src/transformers/models/jetmoe/modeling_jetmoe.py
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update src/transformers/models/jetmoe/modeling_jetmoe.py
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* change variable name

* add copied from

* fix variable name

* remove deprecated functinos

* sync to llama implementation

* fix format

* fix copy

* fix format

* update format

* remove repr

* add comment for moe weight

* fix copy

* Update src/transformers/models/jetmoe/configuration_jetmoe.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/models/jetmoe/modeling_jetmoe.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/models/jetmoe/modeling_jetmoe.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/models/jetmoe/modeling_jetmoe.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/models/jetmoe/modeling_jetmoe.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/models/jetmoe/modeling_jetmoe.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/models/jetmoe/modeling_jetmoe.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/models/jetmoe/modeling_jetmoe.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/models/jetmoe/modeling_jetmoe.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/models/jetmoe/modeling_jetmoe.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/models/jetmoe/modeling_jetmoe.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/models/jetmoe/modeling_jetmoe.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* add comments and reformat config

* fix format

* fix format

* fix format

* update test

* update doc string in config

* Update src/transformers/models/jetmoe/modeling_jetmoe.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* update config doc

* update attention cache

* fix format

* fix copy

---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

ccdabc56

13 May, 2024 1 commit

Llama: fix custom 4D masks, v2 (#30348) · a0779b9e

Poedator authored May 13, 2024



* 4d mask fixes

* Update custom 4D mask logic

* test moved to mixin

* extra tests 4d mask

* upd 4d mask and StaticCache handling

* added Mask4DTestHard to mistral tests

* post-rebase fixes

* test fixes for StaticCache

* make fix-copies

* upd 1 after #30476

* fix common tests

* rm elif attention_mask.dim() == 4:

* tests combined, fixed, mixtral supported

* bigbird style chg reverted

* rm if attention_mask.dim() == 2

* modeling_llama formatting chg

---------
Co-authored-by: Joao Gante <joao@huggingface.co>

a0779b9e

09 May, 2024 1 commit
- KV cache is no longer a model attribute (#30730) · 5413b898
  Raushan Turganbay authored May 09, 2024
```
kv_cache is no longer a model attribute
```
  5413b898
08 May, 2024 1 commit
- Cache: models return input cache type (#30716) · f26e4073
  Joao Gante authored May 08, 2024
  
  f26e4073
03 May, 2024 1 commit
- add mlp bias for llama models (#30031) · 425e1a04
  Mayank Mishra authored May 03, 2024
```
* add bias

* fix quality
```
  425e1a04
02 May, 2024 1 commit
- Fix for Neuron (#30259) · fbabd674
  Michael Benayoun authored May 02, 2024
  
  fbabd674
30 Apr, 2024 1 commit
- Cache: Static cache as a standalone object (#30476) · 75bbfd5b
  Joao Gante authored Apr 30, 2024
  
  75bbfd5b
29 Apr, 2024 1 commit

Reenable SDPA's FA2 During Training with torch.compile (#30442) · 9df8b301

Benjamin Warner authored Apr 29, 2024

* Reenable SDPA's FA2 during training with torch.compile

* fix Olmo's SDPA FA2 dispatching too

* update formatting

* improved SDPA comment

* formatting and explanatory comment

* is_causal if statement to one-liner

9df8b301

22 Apr, 2024 1 commit
- `Llama` family, fix `use_cache=False` generation (#30380) · 2d92db84
  Arthur authored Apr 22, 2024
```
* nit to make sure cache positions are not sliced

* fix other models

* nit

* style
```
  2d92db84
18 Apr, 2024 2 commits

FIX: Fixes unexpected behaviour for Llava / LLama & AWQ Fused modules + revert... · 5728b5ad

Younes Belkada authored Apr 18, 2024

FIX: Fixes unexpected behaviour for Llava / LLama & AWQ Fused modules + revert #30070 at the same time (#30317)

* Update awq.py

* style

* revert felix PR

* fix

* add felix comments

5728b5ad

Revert "Re-enable SDPA's FA2 path (#30070)" (#30314) · acab997b

Arthur authored Apr 18, 2024

* Revert "Re-enable SDPA's FA2 path (#30070)"

This reverts commit 05bdef16.

* Revert "Fix quality Olmo + SDPA (#30302)"

This reverts commit ec92f983.

acab997b

17 Apr, 2024 1 commit

Re-enable SDPA's FA2 path (#30070) · 05bdef16

fxmarty authored Apr 17, 2024



* tentatively re-enable FA2 + SDPA

* better comment

* _ignore_causal_mask_sdpa as staticmethod

* type hints

* use past_seen_tokens instead

* enable copied from for sdpa

* ruff

* llama simplifications on review

* remove unnecessary self.is_causal check

* fix copies

* cleaning

* precise message

* better doc

* add test

* simplify

* Update src/transformers/models/llama/modeling_llama.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/models/llama/modeling_llama.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/models/llama/modeling_llama.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* style

---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

05bdef16

05 Apr, 2024 1 commit

Fix `torch.fx` symbolic tracing for LLama (#30047) · 17cd7a9d

Michael Benayoun authored Apr 05, 2024

* [WIP] fix fx

* [WIP] fix fx

* [WIP] fix fx

* [WIP] fix fx

* [WIP] fix fx

* Apply changes to other models

17cd7a9d

30 Mar, 2024 1 commit
- [`BC`] Fix BC for AWQ quant (#29965) · 6e584070
  TechxGenus authored Mar 31, 2024
```
fix awq quant
```
  6e584070
28 Mar, 2024 1 commit
- [`BC`] Fix BC for other libraries (#29934) · 2bbbf1be
  Arthur authored Mar 28, 2024
```
* fi xbc?

* nit
```
  2bbbf1be
21 Mar, 2024 1 commit
- Llama: always convert the causal mask in the SDPA code path (#29663) · ee38fc31
  Joao Gante authored Mar 21, 2024
```
* always convert the mask

* rebase and fix copies
```
  ee38fc31
20 Mar, 2024 1 commit

[`BC 4.37 -> 4.38`] for Llama family, memory and speed (#29753) · ff841900

Arthur authored Mar 21, 2024

* attempt to fix

* the actual fix that works with compilation!

* this?

* temporary update

* nit?

* dispatcg to memory efficient?

* update both models that have static cache support

* fix copies fix compile

* make sure fix

* fix cohere and gemma

* fix beams?

* nit

* slipped through the cracks

* nit

* nits

* update

* fix-copies

* skip failing tests

* nits

ff841900

19 Mar, 2024 1 commit

Llama: partial 4d masks (#29731) · 4294f0c3

Joao Gante authored Mar 19, 2024



* partial 4d masks

* Apply suggestions from code review
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

---------
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

4294f0c3

14 Mar, 2024 1 commit
- Generate: handle `cache_position` update in `generate` (#29467) · 23db187d
  Joao Gante authored Mar 14, 2024
  
  23db187d
13 Mar, 2024 1 commit
- Llama: allow custom 4d masks (#29618) · 1e21c4fb
  Joao Gante authored Mar 13, 2024
  
  1e21c4fb
08 Mar, 2024 1 commit

StableLM: Fix dropout argument type error (#29236) · f386c51a

liangjs authored Mar 08, 2024



* fix stablelm dropout argument type error

* fix docs of _flash_attention_forward

* fix all docs of _flash_attention_forward

* fix docs of _flash_attention_forward in starcoder2

---------
Co-authored-by: oliang <oliang@tencent.com>

f386c51a

07 Mar, 2024 1 commit
- v4.39 deprecations 🧼 (#29492) · ffe60fdc
  Joao Gante authored Mar 07, 2024
  
  ffe60fdc
06 Mar, 2024 2 commits

Fix: Disable torch.autocast in RotaryEmbedding of Gemma and LLaMa for MPS device (#29439) · d45f47ab

Park Jun authored Mar 07, 2024



* Fix: Disable torch.autocast in RotaryEmbedding of Gemma and LLaMa for MPS devices

* Update src/transformers/models/gemma/modeling_gemma.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update llama ang gemma rope use cpu in mps device

---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

d45f47ab

Substantially reduce memory usage in _update_causal_mask for large batches by... · 2a939f20

Glen Taggart authored Mar 06, 2024


Substantially reduce memory usage in _update_causal_mask for large batches by using .expand instead of .repeat [needs tests+sanity check] (#29413)

* try to fix gemma mem use

* fix: handle attention mask dim==2 case

* remove logits=logits.float()

* clean up + add llama

* apply formatting

* readability edit: swap order of items being multiplied

* revert change unrelated to PR

* revert black autoformat

* switch to one .to

* Accept style edits
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

2a939f20

01 Mar, 2024 2 commits
- [`Llama + AWQ`] fix `prepare_inputs_for_generation` 🫠 (#29381) · e7b98370
  Arthur authored Mar 01, 2024
```
* use the generation config 🫠

* fixup
```
  e7b98370
- Fix Base Model Name of LlamaForQuestionAnswering (#29258) · 2858d6c6
  Leon Engländer authored Mar 01, 2024
```
* LlamaForQuestionAnswering self.transformer->self.model

* fix "Copied from" string

* Llama QA model: set base_model_prefix = "transformer"
```
  2858d6c6
28 Feb, 2024 4 commits

Better SDPA unmasking implementation (#29318) · 49204c1d

fxmarty authored Feb 28, 2024

* better unmask imple

* comment

* typo

* bug report pytorch

* cleanup

* fix import

* add back example

* retrigger ci

* come on

49204c1d

check if position_ids exists before using it (#29306) · 554e7ada
jiqing-feng authored Feb 28, 2024
```
Co-authored-by: Joao Gante <joao@huggingface.co>
```
554e7ada

RoPE loses precision for Llama / Gemma + Gemma logits.float() (#29285) · d3a4b475

Daniel Han authored Feb 29, 2024



* Update modeling_llama.py

Llama - Force float32 since bfloat16 loses precision on long contexts

* Update modeling_llama.py

* Update modeling_gemma.py

Fix RoPE and logits.float()

* @torch.no_grad()

* @torch.no_grad()

* Cos, Sin to float32

* cos, sin to float32

* Update src/transformers/models/gemma/modeling_gemma.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/models/llama/modeling_llama.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Resolve PR conflicts

* Fix RoPE for llama

* Revert "Fix RoPE for llama"

This reverts commit b860a22dab9bb01cd15cb9a3220abeaefad3e458.

* Fix RoPE for llama

* RoPE device

* Autocast device type

* RoPE

* RoPE isinstance

---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

d3a4b475

[`Llama ROPE`] Fix torch export but also slow downs in forward (#29198) · 8a8a0a4a

Arthur authored Feb 28, 2024

* remove control flow

* update gptneox

* update ....

* nits

* Actually let's just break. Otherwise we are silently failing which imo is not optimal

* version BC

* fix tests

* fix eager causal

* nit

* add a test

* style

* nits

* nits

* more nits for the test

* update and fix

* make sure cuda graphs are not skipped

* read token is needed for meta llama

* update!

* fiixup

* compile test should be slow

* fix thet fix copies

* stle 🫠

8a8a0a4a

27 Feb, 2024 1 commit

Cleaner Cache `dtype` and `device` extraction for CUDA graph generation for... · e3fc90ae

Andrei Panferov authored Feb 27, 2024

Cleaner Cache `dtype` and `device` extraction for CUDA graph generation for quantizers compatibility (#29079)

* input_layernorm as the beacon of hope

* cleaner dtype extraction

* AQLM + CUDA graph test

* is available check

* shorter text test

e3fc90ae

26 Feb, 2024 1 commit
- Use `torch.bool` instead of `torch.int64` for non-persistant causal mask buffer (#29241) · 24d59c79
  fxmarty authored Feb 26, 2024
```
use torch.bool instead of torch.int64
```
  24d59c79
23 Feb, 2024 1 commit
- Improve _update_causal_mask performance (#29210) · 3f60d11a
  Alessandro Palla authored Feb 23, 2024
```
* Fix issue 29206

* Fix style
```
  3f60d11a