Commits · a5a8806d1a4405d4521380e9a8a56f0d02a5ece3 · gaoqiong / flash-attention

23 Oct, 2022 1 commit
- Split bwd on the seqlen_q dimension · a5a8806d
  Tri Dao authored Oct 23, 2022
  
  a5a8806d
22 Oct, 2022 1 commit
- Don't need to run configure for the forward pass · 871db479
  Tri Dao authored Oct 21, 2022
  
  871db479
21 Oct, 2022 3 commits
- Use block_size=128 for headdim=128 on SM80 · 7fc39832
  Tri Dao authored Oct 21, 2022
```
Previously we were using block_size=256.
```
  7fc39832
- Split fwd on the seqlen_q dimension · a44f48df
  Tri Dao authored Oct 21, 2022
  
  a44f48df
- Rework dropout to decouple forward and backward · 1aa6d7d9
  Tri Dao authored Oct 18, 2022
```
They don't have to have the same block size, number of threads, etc.
```
  1aa6d7d9
17 Oct, 2022 2 commits
- Merge pull request #60 from 201419/patch-1 · 1d0b41be
  Tri Dao authored Oct 17, 2022
```
fix typo in function mha_fwd
```
  1d0b41be
- fix typo in function mha_fwd · ff07250e
  YangShu authored Oct 17, 2022
```
as title.
```
  ff07250e
16 Oct, 2022 1 commit
- Fix #54: set device for multi-GPU case · 52fb4b72
  Tri Dao authored Oct 16, 2022
  
  52fb4b72
14 Oct, 2022 2 commits
- Fix QKV interface to allocate output in Python · 1b9facac
  Tri Dao authored Oct 14, 2022
  
  1b9facac
- Implement attention kernel that splits the batch into two · 5badfb78
  Tri Dao authored Oct 13, 2022
  
  5badfb78
10 Oct, 2022 1 commit
- Merge pull request #53 from robotcator/workflow · f515c77f
  Tri Dao authored Oct 09, 2022
```
build wheel workflow
```
  f515c77f
06 Oct, 2022 2 commits

Merge pull request #55 from ajfadam/main · 8dd52b07
Tri Dao authored Oct 06, 2022
```
remove numpy dependency
```
8dd52b07

Antoine Adam authored Oct 06, 2022

According to the `setup.py` file, only dependencies are torch and einops. But the `bert_padding.py` file requires `numpy` only to multiply the elements of a `torch.Size` object. This change aims at allowing the use of FlashAttention without numpy.

4e38df05

05 Oct, 2022 5 commits
- Merge pull request #52 from bob80333/main · 88dc2040
  Tri Dao authored Oct 04, 2022
```
Make flash attention compile on Windows.
```
  88dc2040
- Fixed switch statement, thanks @yocabon · 2211db5f
  Eric Engelhart authored Oct 04, 2022
  
  2211db5f
- Add C++17 arg to compiler, since C++17 features are used, fixes windows build · 9b1b011b
  Eric Engelhart authored Oct 03, 2022
  
  9b1b011b
- Replace BOOL_SWITCH with FP16_SWITCH to work around MSVC bug with constexpr variables and templates · 9d7fd5b6
  Eric Engelhart authored Oct 03, 2022
  
  9d7fd5b6
- Only run backward test for d=128 on A100 · 0c01568d
  Tri Dao authored Oct 04, 2022
  
  0c01568d
26 Sep, 2022 2 commits
- add publish · 2c853fe8
  robotcator authored Sep 26, 2022
  
  2c853fe8
- add publish · f7e7e912
  robotcator authored Sep 26, 2022
  
  f7e7e912
12 Sep, 2022 1 commit
- Use block_size=128 for d=128 on SM86 to avoid exceeding smem limit · 8166063a
  Tri Dao authored Sep 12, 2022
  
  8166063a
11 Sep, 2022 1 commit
- Relax assert to allow both bf16 and fp16 · 13403e81
  Tri Dao authored Sep 11, 2022
  
  13403e81
09 Sep, 2022 1 commit
- Change license from Apache 2.0 to BSD · 64f42cd0
  Tri Dao authored Sep 09, 2022
  
  64f42cd0
06 Sep, 2022 2 commits
- Merge pull request #43 from eric-tc-wong/patch-1 · 04fb1985
  Tri Dao authored Sep 06, 2022
```
Update flash_attention.py
```
  04fb1985
- Update flash_attention.py · b410d14f
  eric-tc-wong authored Sep 06, 2022
```
Recasting query and key after rotary_emb()
```
  b410d14f
09 Aug, 2022 1 commit
- Add back need_weights in FlashMHA · 19d12610
  Tri Dao authored Aug 09, 2022
  
  19d12610
05 Aug, 2022 2 commits
- Support index_first_axis with more than 2 dimensions · 6cc73425
  Tri Dao authored Aug 05, 2022
  
  6cc73425
- Allow headdim 128 in FlashMHA interface · 713ea302
  Tri Dao authored Aug 05, 2022
  
  713ea302
22 Jul, 2022 1 commit
- Add tests for numerical error · 2ed471ec
  Tri Dao authored Jul 22, 2022
  
  2ed471ec
12 Jul, 2022 1 commit
- Edit mention of Triton implementation · 42f54d88
  Tri Dao authored Jul 11, 2022
```
Phil Tillet suggests calling it "experimental".
```
  42f54d88
11 Jul, 2022 2 commits
- Link to Triton implementation · 4577151f
  Tri Dao authored Jul 11, 2022
  
  4577151f
- Don't nest BOOL_SWITCH to work around gcc 7 bug · bc2c2102
  Tri Dao authored Jul 11, 2022
  
  bc2c2102
10 Jul, 2022 6 commits
- Link to IEEE Spectrum article on MLPerf · d1fc80a3
  Tri Dao authored Jul 10, 2022
  
  d1fc80a3
- Edit README to mention bf16 support · 1bbebccc
  Tri Dao authored Jul 09, 2022
  
  1bbebccc
- Implement for bf16 · de19de7a
  Tri Dao authored Jul 09, 2022
  
  de19de7a
- Refactor gemm_cl to template on either __half or __nv_bfloat16 · 6a77a6da
  Tri Dao authored Jul 08, 2022
  
  6a77a6da
- Refactor to template on __half, implement bf16 util functions · e518a4b3
  Tri Dao authored Jul 08, 2022
  
  e518a4b3
- Fix Illegal Memory Access bug in fwd when d=16 · 2dc1b205
  Tri Dao authored Jul 09, 2022
  
  2dc1b205
04 Jul, 2022 2 commits

Apply dropout scaling to dQ and dK instead of to V (in bwd) · 5b838a8b

Tri Dao authored Jun 29, 2022

Theoretically this might have lower numerical error since the scaling is in
fp32 instead of fp16 (not sure, I haven't thought too carefully about it).
However, in practice, the numerical errors seem about the same.

5b838a8b

Do P * dP (pointwise) in the bwd in fp32 instead of fp16 · a5559a0e
Tri Dao authored Jul 03, 2022

a5559a0e