• Arthur's avatar
    Refactor flash attention implementation in transformers (#31446) · e3143952
    Arthur authored
    
    
    * dumb commit
    
    * nit
    
    * update
    
    * something like this
    
    * unpack in modeling utils
    
    * safe import
    
    * oups
    
    * update
    
    * nits
    
    * diff convert gemma
    
    * update
    
    * start propagating
    
    * udpate other modeling code as well
    
    * update for sliding window models
    
    * nits
    
    * more init cleanups
    
    * styling
    
    * fixup
    
    * noice
    
    * pass fixup
    
    * typo typing_extension -> typing_extensions
    
    * torch.nn.functionnal -> torch.nn.functional
    
    * add to import structure
    
    * unpack
    
    * simplify a bit more for this first version
    
    * nut
    
    * update
    
    * update
    
    * nit
    
    * ease the import of `Unpack`
    
    * remove useless `use_sliding_window`
    
    * no qua please
    
    * protect import?
    
    * style
    
    * [run-slow]
    
    * [run slow] llama,gemma,mistral,mixtral
    
    * remove extra kwargs
    
    * fix llama
    
    * address review comments
    
    * apply diff_model_converter to modeling_gemma.py
    
    * remove cache_position 1
    
    * remove cache_position 2
    
    * some cleaning
    
    * refactor gemma2 as well
    
    * apply review comments
    
    * rename file to modeling_flash_attention_utils.py
    
    * siglip refactor
    
    * remove dead code
    
    * is the hub down?
    
    * still down?
    
    * fix siglip
    
    * fix gemma2
    
    * fatal: Could not read from remote repository.
    
    * fix typo in softcap implem
    
    * flacky
    
    * Failed: Timeout >120.0s
    
    ---------
    Co-authored-by: default avatarfxmarty <9808326+fxmarty@users.noreply.github.com>
    e3143952
modeling_hubert.py 72.3 KB