Commits · 4f81aff46e65277d6df30843e98327d4f2571b5f · gaoqiong / flash-attention

31 Oct, 2022 8 commits
- Add debug_barrier for all headdims in Triton bwd · 4f81aff4
  Tri Dao authored Oct 31, 2022
  
  4f81aff4
- Disable some autotune configs that give wrong results in Triton bwd · bedcbd6a
  Tri Dao authored Oct 31, 2022
  
  bedcbd6a
- [WIP] Support all head dimensions up to 128 in the Triton bwd · e78d509c
  Tri Dao authored Oct 31, 2022
```
WIP because there seems to be some race conditions for head dimensions other
than 16, 32, 64, 128.
```
  e78d509c
- Support all head dimensions up to 128 in the Triton fwd · 008951f1
  Tri Dao authored Oct 30, 2022
  
  008951f1
- Support arbitrary seqlens (both q & k) in Triton bwd · b910bf14
  Tri Dao authored Oct 30, 2022
  
  b910bf14
- Support arbitrary seqlen_k in Triton bwd · dc554693
  Tri Dao authored Oct 30, 2022
  
  dc554693
- Fix Triton fwd to support seqlen not multiples of 128 · d11341fd
  Tri Dao authored Oct 30, 2022
  
  d11341fd
- Implement FlashAttention in Triton · b0c0db81
  Tri Dao authored Oct 30, 2022
  
  b0c0db81
25 Oct, 2022 1 commit
- Get rid of o_rows_are_valid since we don't have headdim=16 anymore · c422fee3
  Tri Dao authored Oct 24, 2022
  
  c422fee3
24 Oct, 2022 4 commits
- Support all head dims that are multiples of 8, up to 128 · 46fd2a20
  Tri Dao authored Oct 24, 2022
  
  46fd2a20
- Cast q.get_device() to char to avoid compiler warning (narrowing) · 97e13de2
  Tri Dao authored Oct 24, 2022
  
  97e13de2
- Add Megatron attention implementation for benchmarking · ed553e92
  Tri Dao authored Oct 23, 2022
  
  ed553e92
- Add Triton implementation for benchmarking · 50ca2348
  Tri Dao authored Oct 23, 2022
  
  50ca2348
23 Oct, 2022 4 commits
- Attempt to use atomicCAS to replace atomicAdd(bfloat16) · 9e92a1f2
  Tri Dao authored Oct 23, 2022
  
  9e92a1f2
- Merge pull request #61 from robotcator/workflow · 6731855b
  Tri Dao authored Oct 23, 2022
```
build wheel and upload to release
```
  6731855b
- Move benchmark utils, support AMP · fb88e5e4
  Tri Dao authored Oct 23, 2022
  
  fb88e5e4
- Split bwd on the seqlen_q dimension · a5a8806d
  Tri Dao authored Oct 23, 2022
  
  a5a8806d
22 Oct, 2022 1 commit
- Don't need to run configure for the forward pass · 871db479
  Tri Dao authored Oct 21, 2022
  
  871db479
21 Oct, 2022 3 commits
- Use block_size=128 for headdim=128 on SM80 · 7fc39832
  Tri Dao authored Oct 21, 2022
```
Previously we were using block_size=256.
```
  7fc39832
- Split fwd on the seqlen_q dimension · a44f48df
  Tri Dao authored Oct 21, 2022
  
  a44f48df
- Rework dropout to decouple forward and backward · 1aa6d7d9
  Tri Dao authored Oct 18, 2022
```
They don't have to have the same block size, number of threads, etc.
```
  1aa6d7d9
17 Oct, 2022 4 commits
- Merge pull request #60 from 201419/patch-1 · 1d0b41be
  Tri Dao authored Oct 17, 2022
```
fix typo in function mha_fwd
```
  1d0b41be
- Merge branch 'main' of github.com:robotcator/flash-attention into workflow · 35d589fa
  robotcator authored Oct 17, 2022
  
  35d589fa
- using tag trigger rather than push trigger · 10d07459
  robotcator authored Oct 17, 2022
  
  10d07459
- fix typo in function mha_fwd · ff07250e
  YangShu authored Oct 17, 2022
```
as title.
```
  ff07250e
16 Oct, 2022 1 commit
- Fix #54: set device for multi-GPU case · 52fb4b72
  Tri Dao authored Oct 16, 2022
  
  52fb4b72
14 Oct, 2022 2 commits
- Fix QKV interface to allocate output in Python · 1b9facac
  Tri Dao authored Oct 14, 2022
  
  1b9facac
- Implement attention kernel that splits the batch into two · 5badfb78
  Tri Dao authored Oct 13, 2022
  
  5badfb78
10 Oct, 2022 1 commit
- Merge pull request #53 from robotcator/workflow · f515c77f
  Tri Dao authored Oct 09, 2022
```
build wheel workflow
```
  f515c77f
06 Oct, 2022 2 commits

Merge pull request #55 from ajfadam/main · 8dd52b07
Tri Dao authored Oct 06, 2022
```
remove numpy dependency
```
8dd52b07

Antoine Adam authored Oct 06, 2022

According to the `setup.py` file, only dependencies are torch and einops. But the `bert_padding.py` file requires `numpy` only to multiply the elements of a `torch.Size` object. This change aims at allowing the use of FlashAttention without numpy.

4e38df05

05 Oct, 2022 5 commits
- Merge pull request #52 from bob80333/main · 88dc2040
  Tri Dao authored Oct 04, 2022
```
Make flash attention compile on Windows.
```
  88dc2040
- Fixed switch statement, thanks @yocabon · 2211db5f
  Eric Engelhart authored Oct 04, 2022
  
  2211db5f
- Add C++17 arg to compiler, since C++17 features are used, fixes windows build · 9b1b011b
  Eric Engelhart authored Oct 03, 2022
  
  9b1b011b
- Replace BOOL_SWITCH with FP16_SWITCH to work around MSVC bug with constexpr variables and templates · 9d7fd5b6
  Eric Engelhart authored Oct 03, 2022
  
  9d7fd5b6
- Only run backward test for d=128 on A100 · 0c01568d
  Tri Dao authored Oct 04, 2022
  
  0c01568d
26 Sep, 2022 2 commits
- add publish · 2c853fe8
  robotcator authored Sep 26, 2022
  
  2c853fe8
- add publish · f7e7e912
  robotcator authored Sep 26, 2022
  
  f7e7e912
12 Sep, 2022 1 commit
- Use block_size=128 for d=128 on SM86 to avoid exceeding smem limit · 8166063a
  Tri Dao authored Sep 12, 2022
  
  8166063a
11 Sep, 2022 1 commit
- Relax assert to allow both bf16 and fp16 · 13403e81
  Tri Dao authored Sep 11, 2022
  
  13403e81