Commits · 26f4b5fb9bb308d0477235e2b66f64034b48db47 · gaoqiong / flash-attention

01 Aug, 2024 2 commits
- Merge branch 'main' into Dao-AILab/main · 26f4b5fb
  Woosuk Kwon authored Jul 31, 2024
  
  26f4b5fb
- Adds Python 3.12 to publish.yml (#10) · 12375706
  Michael Goin authored Jul 31, 2024
  
  12375706
30 Jul, 2024 1 commit

Fp8 kernel with "in-kernel" transpose of V in producer (#1100) · 5018ac6a

jayhshah authored Jul 30, 2024

* base version

* restructure pipelines, add special fp8 epilogue

* add variants

* add fp8 causal and modify dynamic tile scheduler

* better causal schedule

* maintain two schedules for non causal and causal

* removing macros

* fix regression

* clean up unneeded methods and variants

* fix mistake with NumProducerThreads

* base version

* restructure pipelines, add special fp8 epilogue

* add variants

* add fp8 causal and modify dynamic tile scheduler

* better causal schedule

* maintain two schedules for non causal and causal

* removing macros

* fix regression

* clean up unneeded methods and variants

* fix mistake with NumProducerThreads

* use seqlen traits

* add fp8 .cu files and benchmark script

* fix merge issue

* fix merge issue

* fix merge issue

* remove duplicate code

* fix regression with varseqlen

* move varseqlen init in constexpr

* fix test script

* more constexpr on varseqlen and add max offset

* add back test cases

5018ac6a

29 Jul, 2024 3 commits
- Bump up to 2.6.0 · f424d25a
  Woosuk Kwon authored Jul 29, 2024
  
  f424d25a
- Add CUDA 11.8 (#9) · e23f4582
  Woosuk Kwon authored Jul 29, 2024
  
  e23f4582
- Update torch to 2.4.0 (#8) · 5a3e6ebf
  Sage Moore authored Jul 29, 2024
```
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  5a3e6ebf
27 Jul, 2024 1 commit
- Add benchmark_gemm.py · c4b9015d
  Tri Dao authored Jul 27, 2024
  
  c4b9015d
25 Jul, 2024 3 commits
- Bump to v2.6.3 · 418d6771
  Tri Dao authored Jul 25, 2024
  
  418d6771
- [CI] Compile for pytorch 2.4.0 · 65205d35
  Tri Dao authored Jul 25, 2024
  
  65205d35
- Revert "Changes For FP8 (#1075)" · 3aae9c18
  Tri Dao authored Jul 25, 2024
```
This reverts commit 1899c970.
```
  3aae9c18
24 Jul, 2024 1 commit
- use global function rather than lambda (#7) · 8f48a546
  youkaichao authored Jul 24, 2024
  
  8f48a546
23 Jul, 2024 11 commits

Changes For FP8 (#1075) · 1899c970

ganeshcolfax authored Jul 23, 2024



* adding files for fp8 changes.

* removed contiguous check.

* enable all tests except odd-seq-lengths, where it crashes now.

* undid clang formatting.

* change to correct tile size for headdim=128.

* fixed odd-seq-len-k.

* minor formatting.

* minor reformatting.

---------
Co-authored-by: Tri Dao <tridao@users.noreply.github.com>

1899c970

Bump to v2.6.2 · 59594f2a
Tri Dao authored Jul 23, 2024

59594f2a
Fix test with alibi and cache_leftpad · 29956362
Tri Dao authored Jul 23, 2024

29956362
[CI] Compile with torch 2.4.0.dev20240527 · 4488acee
Tri Dao authored Jul 23, 2024

4488acee
Split bwd into more .cu files to speed up compilation · 65f723bb
Tri Dao authored Jul 23, 2024

65f723bb
Clean up softcapping bwd a bit · 5ca83a9c
Tri Dao authored Jul 22, 2024

5ca83a9c
Don't specialize for hdim 224 to speed up compilation · 751c762c
Tri Dao authored Jul 22, 2024

751c762c
Fix ima for split-kv kernel (#1085) · 1c275eb0
Driss Guessous authored Jul 22, 2024

1c275eb0
Make FA3 externally importable (#1053) · 3c4053b7
janEbert authored Jul 23, 2024
```
Library name to import is `flash_attn_interface`, which matches the
test.
```
3c4053b7

Support AMD ROCm on FlashAttention 2 (#1010) · d8f104e9

rocking authored Jul 23, 2024



* Support ck in fmha

* Add ck submodule

* Do not return lse if return_softmax == false

* Use receipt to speed up ck compile time

* Integrate new version of ck_tile

* Support dropout for mha_fwd()

* Add dropout to mha_varlen_fwd()

* Update ck to develop

* Extract padding function for dropout randval

* Extract randval transformation function

* Sync the code structure and coding style with FA

* Remove this line, c++ api will handle this.
Sync with test_flash_attn.py

* fix compile error

* Add mha_bwd

* Generate dropout seed and offset from user generator

* update CK

* Add mha_varlen_bwd

* Use same python as build flash-attn to generate ck kernel

* Fix bug of group mode fwd about returning softmax lse

* larger the test tollerance

* Add test_flash_attn_output() and test_flash_attn_varlen_output()

* Always fill softmax_lse

* Remove duplicate benchmark script, since we already implement mha_bwd

* Refine get value from tuple

* Use default parameter for stream_config

* unblock all platform

* Add comment

* refine the test code

* Refine naming

* Add unpack to namespace

* Do not hardcode the warp size 64

* Add more targets

* Add README

* Optimize mha_fwd if seqlen_q == 1

* Support get_wheel_url for rocm

* Detect rocm environment by pytorch's IS_HIP_EXTENSION

* update to lastest ck

* Add necessary compile flag

* Sync the api with upstream FA

---------
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: Yichen Yan <wenji.yyc@alibaba-inc.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: Yichen Yan <oraluben@outlook.com>

d8f104e9

Add var-seq-len to FA3 fp16 / bf16 fwd (#1072) · dfe1a59e

Ying Zhang authored Jul 22, 2024



* fwd var-seq-len

* fixes

* benchmark

* fixes

---------
Co-authored-by: Tri Dao <tridao@users.noreply.github.com>

dfe1a59e

22 Jul, 2024 4 commits

Remove torchlib dependency from cpp files (#1083) · cb516f85
Cameron Shinn authored Jul 22, 2024

cb516f85

backwards for softcapping (#1033) · 5f1ae4a3

Phil Wang authored Jul 21, 2024

* check in the two ways of approaching backwards for softcapping, both functional

* prepare the softcap switch for backwards

* temporary

* cleanup to the way Tri prefers

* calculate dtanh when copying from scores -> dtanh Tensor

* no ternary operators allowed for constexpr, so just use some hack found online

* fix maybe_dtanh, restore some files

* restore another file

* move calculate_dtanh to utils and colocate with apply_softcap

* cleanup

* maybe last cleanup

* save for another pr

* remove a stray line

* fix spacing

* fix an issue, and make test_flash_attn.py ready to test softcapping backwards

5f1ae4a3

remove lambda (#1056) · ef3e358a
youkaichao authored Jul 21, 2024

ef3e358a
catch typo (#1058) · 4df62e14
Jorge António authored Jul 22, 2024

4df62e14

15 Jul, 2024 1 commit
- [FA3] BF16 forward · 74b0761f
  Tri Dao authored Jul 14, 2024
  
  74b0761f
13 Jul, 2024 1 commit
- Pass seqused_k to _flash_attn_varlen_forward · 898dd4bb
  Tri Dao authored Jul 13, 2024
  
  898dd4bb
11 Jul, 2024 8 commits
- Add FA3 image · 7ef24848
  Tri Dao authored Jul 11, 2024
  
  7ef24848
- FA3 initial code release · 7f67966c
  Tri Dao authored Jul 11, 2024
  
  7f67966c
- Temporarily switch to cutlass fork for more shapes · b4a9dd6c
  Tri Dao authored Jul 11, 2024
  
  b4a9dd6c
- Bump to v2.6.1 · 7551202c
  Tri Dao authored Jul 11, 2024
  
  7551202c
- [CI] Switch from CUDA 12.2 to 12.3 · 844912dc
  Tri Dao authored Jul 11, 2024
  
  844912dc
- Implement cache_leftpad · 40e534a7
  Tri Dao authored Jul 11, 2024
  
  40e534a7
- [CI] Compile with pytorch 2.4.0.dev20240514 · 116b05f9
  Tri Dao authored Jul 11, 2024
  
  116b05f9
- Bump v2.6.0 · da11d1b8
  Tri Dao authored Jul 10, 2024
  
  da11d1b8
10 Jul, 2024 4 commits
- Relax dropout_fraction test · d0787acc
  Tri Dao authored Jul 10, 2024
  
  d0787acc
- Don't support softcap and dropout at the same time · dca6d89d
  Tri Dao authored Jul 10, 2024
```
These tests are failing so I'm just disabling this case for now
```
  dca6d89d
- More typo fixes · 81e01efd
  Tri Dao authored Jul 10, 2024
  
  81e01efd
- Fix typo with softcapping · 72e27c63
  Tri Dao authored Jul 10, 2024
  
  72e27c63