Commits · 751c762c9cc98f13e375f582974110d781fcf3ca · gaoqiong / flash-attention

23 Jul, 2024 5 commits

Don't specialize for hdim 224 to speed up compilation · 751c762c
Tri Dao authored Jul 22, 2024

751c762c
Fix ima for split-kv kernel (#1085) · 1c275eb0
Driss Guessous authored Jul 22, 2024

1c275eb0
Make FA3 externally importable (#1053) · 3c4053b7
janEbert authored Jul 23, 2024
```
Library name to import is `flash_attn_interface`, which matches the
test.
```
3c4053b7

Support AMD ROCm on FlashAttention 2 (#1010) · d8f104e9

rocking authored Jul 23, 2024



* Support ck in fmha

* Add ck submodule

* Do not return lse if return_softmax == false

* Use receipt to speed up ck compile time

* Integrate new version of ck_tile

* Support dropout for mha_fwd()

* Add dropout to mha_varlen_fwd()

* Update ck to develop

* Extract padding function for dropout randval

* Extract randval transformation function

* Sync the code structure and coding style with FA

* Remove this line, c++ api will handle this.
Sync with test_flash_attn.py

* fix compile error

* Add mha_bwd

* Generate dropout seed and offset from user generator

* update CK

* Add mha_varlen_bwd

* Use same python as build flash-attn to generate ck kernel

* Fix bug of group mode fwd about returning softmax lse

* larger the test tollerance

* Add test_flash_attn_output() and test_flash_attn_varlen_output()

* Always fill softmax_lse

* Remove duplicate benchmark script, since we already implement mha_bwd

* Refine get value from tuple

* Use default parameter for stream_config

* unblock all platform

* Add comment

* refine the test code

* Refine naming

* Add unpack to namespace

* Do not hardcode the warp size 64

* Add more targets

* Add README

* Optimize mha_fwd if seqlen_q == 1

* Support get_wheel_url for rocm

* Detect rocm environment by pytorch's IS_HIP_EXTENSION

* update to lastest ck

* Add necessary compile flag

* Sync the api with upstream FA

---------
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: Yichen Yan <wenji.yyc@alibaba-inc.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: Yichen Yan <oraluben@outlook.com>

d8f104e9

Add var-seq-len to FA3 fp16 / bf16 fwd (#1072) · dfe1a59e

Ying Zhang authored Jul 22, 2024



* fwd var-seq-len

* fixes

* benchmark

* fixes

---------
Co-authored-by: Tri Dao <tridao@users.noreply.github.com>

dfe1a59e

22 Jul, 2024 4 commits

Remove torchlib dependency from cpp files (#1083) · cb516f85
Cameron Shinn authored Jul 22, 2024

cb516f85

backwards for softcapping (#1033) · 5f1ae4a3

Phil Wang authored Jul 21, 2024

* check in the two ways of approaching backwards for softcapping, both functional

* prepare the softcap switch for backwards

* temporary

* cleanup to the way Tri prefers

* calculate dtanh when copying from scores -> dtanh Tensor

* no ternary operators allowed for constexpr, so just use some hack found online

* fix maybe_dtanh, restore some files

* restore another file

* move calculate_dtanh to utils and colocate with apply_softcap

* cleanup

* maybe last cleanup

* save for another pr

* remove a stray line

* fix spacing

* fix an issue, and make test_flash_attn.py ready to test softcapping backwards

5f1ae4a3

remove lambda (#1056) · ef3e358a
youkaichao authored Jul 21, 2024

ef3e358a
catch typo (#1058) · 4df62e14
Jorge António authored Jul 22, 2024

4df62e14

15 Jul, 2024 1 commit
- [FA3] BF16 forward · 74b0761f
  Tri Dao authored Jul 14, 2024
  
  74b0761f
13 Jul, 2024 1 commit
- Pass seqused_k to _flash_attn_varlen_forward · 898dd4bb
  Tri Dao authored Jul 13, 2024
  
  898dd4bb
11 Jul, 2024 8 commits
- Add FA3 image · 7ef24848
  Tri Dao authored Jul 11, 2024
  
  7ef24848
- FA3 initial code release · 7f67966c
  Tri Dao authored Jul 11, 2024
  
  7f67966c
- Temporarily switch to cutlass fork for more shapes · b4a9dd6c
  Tri Dao authored Jul 11, 2024
  
  b4a9dd6c
- Bump to v2.6.1 · 7551202c
  Tri Dao authored Jul 11, 2024
  
  7551202c
- [CI] Switch from CUDA 12.2 to 12.3 · 844912dc
  Tri Dao authored Jul 11, 2024
  
  844912dc
- Implement cache_leftpad · 40e534a7
  Tri Dao authored Jul 11, 2024
  
  40e534a7
- [CI] Compile with pytorch 2.4.0.dev20240514 · 116b05f9
  Tri Dao authored Jul 11, 2024
  
  116b05f9
- Bump v2.6.0 · da11d1b8
  Tri Dao authored Jul 10, 2024
  
  da11d1b8
10 Jul, 2024 8 commits
- Relax dropout_fraction test · d0787acc
  Tri Dao authored Jul 10, 2024
  
  d0787acc
- Don't support softcap and dropout at the same time · dca6d89d
  Tri Dao authored Jul 10, 2024
```
These tests are failing so I'm just disabling this case for now
```
  dca6d89d
- More typo fixes · 81e01efd
  Tri Dao authored Jul 10, 2024
  
  81e01efd
- Fix typo with softcapping · 72e27c63
  Tri Dao authored Jul 10, 2024
  
  72e27c63
- Only test backward if there's no softcapping · 3d41db3e
  Tri Dao authored Jul 10, 2024
  
  3d41db3e
- Split into more .cu files to speed up compilation · 908511b2
  Tri Dao authored Jul 10, 2024
  
  908511b2
- Minor cleanup of softcapping · 1d536d7d
  Tri Dao authored Jul 09, 2024
  
  1d536d7d
- Drop support for pytorch 1.12, 1.13, and python 3.7 · beb2bf2a
  Tri Dao authored Jul 09, 2024
  
  beb2bf2a
09 Jul, 2024 1 commit
- missing commas and backwards return arguments (#1032) · f4628b43
  Phil Wang authored Jul 09, 2024
```
* missing commas

* another fix
```
  f4628b43
08 Jul, 2024 2 commits

Implement softcapping. (#1025) · 8f873cc6
Nicolas Patry authored Jul 08, 2024
```
* Softcap v2 (fwd only).

* Some missing interface + remove overrides in tests.
```
8f873cc6

Add the return_softmax_lse parameter to the flash_attn_with_kvcache function... · 4e8d6006

Jianwei Dong authored Jul 08, 2024

Add the return_softmax_lse parameter to the flash_attn_with_kvcache function to allow returning the logsumexp of the attention scores. (#989)

4e8d6006

03 Jul, 2024 1 commit
- Fix the varlen deterministic test (#1023) · 6df7e0a0
  muoshuosha authored Jul 04, 2024
```
Co-authored-by: moshuosha <moshuosha@qq.com>
```
  6df7e0a0
01 Jul, 2024 5 commits

Fix typos of comments about shape. (#837) · 9486635c
66RING authored Jul 01, 2024

9486635c

Fix KeyError handling for non-existing key in state_dict.pop() (#898) · 0d810cfb

JDKWangGuan authored Jun 30, 2024

Update handling for KeyError in state_dict.pop() for non-existing keys.
Changed state_dict.pop(f"h.{d}.attn.bias") to state_dict.pop(f"h.{d}.attn.bias", None) to prevent KeyError exceptions.


The following code can re-produce the issue
```
from transformers import AutoTokenizer, GPT2Model, GPT2Config
from flash_attn.models.gpt import GPTLMHeadModel, GPTModel

# >>> transformers.__version__
# '4.38.2'

model_path = 'gpt2'
output_model_path = 'gpt2_model'
config = GPT2Config.from_pretrained(model_path, output_hidden_states=True)
model = GPT2Model.from_pretrained(model_path, from_tf=False, config=config)
'''
model fine-tuning here
'''
# dump the fine-tuned model
model.save_pretrained(output_model_path)

# load the fine-tuned model
config = GPT2Config.from_pretrained(output_model_path, output_hidden_states=True)
model = GPTModel.from_pretrained(output_model_path, config=config, strict=True)  # failed due to KeyError: 'h.0.attn.bias'
model = GPTLMHeadModel.from_pretrained(output_model_path, config=config, strict=True)  # failed due to KeyError: 'h.0.attn.bias'

```

0d810cfb

fix typo (#974) · 6a2a16e9
cao lei authored Jun 30, 2024

6a2a16e9

Fixing argument checking when using `seqlenq_ngroups_swapped`. (#976) · 5bf20196

Nicolas Patry authored Jul 01, 2024

When user send `out` as a parameter of the function
`seqlenq_ngroups_swapped` with parameters that trigger,
the CHECK_SHAPE is incorrect (since q shape is modified.)

5bf20196

remove swizzle part of `sV.data()` to get a completely non-swizzle `sVtNoSwizzle` (#984) · ab59ec35
Liang authored Jul 01, 2024
```
Co-authored-by: zl <zl@deepseek.com>
```
ab59ec35

27 Jun, 2024 1 commit

Support unpadded LSE layout (#970) · f816dee6

Grigory Sizov authored Jun 27, 2024



* Support unpadded LSE layout.
Co-authored-by: Xinfeng Xie <xfxie.ceca@gmail.com>
Co-authored-by: Jianyu Huang <hjyahead@gmail.com>

* Cleanup

* Fix unpadded LSE on split-kv path

* Fix formatting and comments

* Fix inline vs forceinline

---------
Co-authored-by: Xinfeng Xie <xfxie.ceca@gmail.com>
Co-authored-by: Jianyu Huang <hjyahead@gmail.com>

f816dee6

26 May, 2024 3 commits
- Update citation · 320fb594
  Tri Dao authored May 26, 2024
  
  320fb594
- Limit to MAX_JOBS=1 with CUDA 12.2 · e2e4333c
  Tri Dao authored May 26, 2024
  
  e2e4333c
- Bump to 2.5.9 · ce735035
  Tri Dao authored May 26, 2024
  
  ce735035