Commits · 2e33fc8e36ee60ea60b5d91ab4a7bc3eb42e0662 · gaoqiong / flash-attention

14 Nov, 2022 4 commits
- Add GPT and ViT models · 2e33fc8e
  Tri Dao authored Nov 13, 2022
  
  2e33fc8e
- Add MLP, MHA, Block, Embedding modules · d4b320b3
  Tri Dao authored Nov 13, 2022
  
  d4b320b3
- Add fused_dense and dropout_add_layernorm CUDA extensions · fa6d1ce4
  Tri Dao authored Nov 13, 2022
  
  fa6d1ce4
- Make nccl operations async in CrossEntropyLossParallel · 343492ec
  Tri Dao authored Nov 13, 2022
  
  343492ec
13 Nov, 2022 1 commit
- Add fused cross entropy loss · 7c995381
  Tri Dao authored Nov 12, 2022
  
  7c995381
10 Nov, 2022 1 commit
- Remove RotaryEmbedding from FlashAttention module · 55797f32
  Tri Dao authored Nov 10, 2022
```
To avoid import error if one doesn't have rotary_emb installed
```
  55797f32
07 Nov, 2022 1 commit
- Set num_warps=4 for headdim=64 in Triton fw (h/t Michael Benesty) · 908a5b22
  Tri Dao authored Nov 07, 2022
  
  908a5b22
06 Nov, 2022 1 commit
- Fix pipelining bug in Triton bwd with bias_type=matrix · 74797571
  Tri Dao authored Nov 06, 2022
  
  74797571
05 Nov, 2022 2 commits
- Parallelize CUDA bwd along seqlen_k instead of seqlen_q · 55778193
  Tri Dao authored Nov 05, 2022
```
This is faster since we only need to do atomic adds on dq, instead of atomic
adds on both dk and dv.
```
  55778193
- Implement rotary embedding in CUDA · ca81f32e
  Tri Dao authored Nov 04, 2022
  
  ca81f32e
04 Nov, 2022 3 commits
- Fix more race condition in Triton bwd when there's bias · 62025e1a
  Tri Dao authored Nov 04, 2022
  
  62025e1a
- Fix race condition in Triton bwd when there's bias · ff78ea41
  Tri Dao authored Nov 04, 2022
  
  ff78ea41
- Implement attention bias for Triton version · 86862cfd
  Tri Dao authored Nov 04, 2022
  
  86862cfd
03 Nov, 2022 1 commit
- Fix race condition for Triton bwd for headdim 48 and 96 · 470010f5
  Tri Dao authored Nov 03, 2022
  
  470010f5
02 Nov, 2022 1 commit
- Fix race condition in Triton bwd for non-po2 headdims · aacc10fb
  Tri Dao authored Nov 02, 2022
  
  aacc10fb
01 Nov, 2022 2 commits
- Avoid memcpy in the Triton bwd · 1fb12afd
  Tri Dao authored Nov 01, 2022
  
  1fb12afd
- Fix race conditions in the Triton bwd for headdim=64 · 731f154d
  Tri Dao authored Nov 01, 2022
  
  731f154d
31 Oct, 2022 10 commits
- Fix race condition in Triton fwd · 9b0bc978
  Tri Dao authored Oct 31, 2022
  
  9b0bc978
- Fix EVEN_M & EVEN_HEADDIM for headdim=40 in Triton bwd · 215930bc
  Tri Dao authored Oct 31, 2022
  
  215930bc
- Add debug_barrier for all headdims in Triton bwd · 4f81aff4
  Tri Dao authored Oct 31, 2022
  
  4f81aff4
- Disable some autotune configs that give wrong results in Triton bwd · bedcbd6a
  Tri Dao authored Oct 31, 2022
  
  bedcbd6a
- [WIP] Support all head dimensions up to 128 in the Triton bwd · e78d509c
  Tri Dao authored Oct 31, 2022
```
WIP because there seems to be some race conditions for head dimensions other
than 16, 32, 64, 128.
```
  e78d509c
- Support all head dimensions up to 128 in the Triton fwd · 008951f1
  Tri Dao authored Oct 30, 2022
  
  008951f1
- Support arbitrary seqlens (both q & k) in Triton bwd · b910bf14
  Tri Dao authored Oct 30, 2022
  
  b910bf14
- Support arbitrary seqlen_k in Triton bwd · dc554693
  Tri Dao authored Oct 30, 2022
  
  dc554693
- Fix Triton fwd to support seqlen not multiples of 128 · d11341fd
  Tri Dao authored Oct 30, 2022
  
  d11341fd
- Implement FlashAttention in Triton · b0c0db81
  Tri Dao authored Oct 30, 2022
  
  b0c0db81
24 Oct, 2022 3 commits
- Support all head dims that are multiples of 8, up to 128 · 46fd2a20
  Tri Dao authored Oct 24, 2022
  
  46fd2a20
- Add Megatron attention implementation for benchmarking · ed553e92
  Tri Dao authored Oct 23, 2022
  
  ed553e92
- Add Triton implementation for benchmarking · 50ca2348
  Tri Dao authored Oct 23, 2022
  
  50ca2348
23 Oct, 2022 2 commits
- Move benchmark utils, support AMP · fb88e5e4
  Tri Dao authored Oct 23, 2022
  
  fb88e5e4
- Split bwd on the seqlen_q dimension · a5a8806d
  Tri Dao authored Oct 23, 2022
  
  a5a8806d
21 Oct, 2022 2 commits
- Split fwd on the seqlen_q dimension · a44f48df
  Tri Dao authored Oct 21, 2022
  
  a44f48df
- Rework dropout to decouple forward and backward · 1aa6d7d9
  Tri Dao authored Oct 18, 2022
```
They don't have to have the same block size, number of threads, etc.
```
  1aa6d7d9
14 Oct, 2022 2 commits
- Fix QKV interface to allocate output in Python · 1b9facac
  Tri Dao authored Oct 14, 2022
  
  1b9facac
- Implement attention kernel that splits the batch into two · 5badfb78
  Tri Dao authored Oct 13, 2022
  
  5badfb78
06 Oct, 2022 1 commit

Antoine Adam authored Oct 06, 2022

According to the `setup.py` file, only dependencies are torch and einops. But the `bert_padding.py` file requires `numpy` only to multiply the elements of a `torch.Size` object. This change aims at allowing the use of FlashAttention without numpy.

4e38df05

11 Sep, 2022 1 commit
- Relax assert to allow both bf16 and fp16 · 13403e81
  Tri Dao authored Sep 11, 2022
  
  13403e81
06 Sep, 2022 1 commit
- Update flash_attention.py · b410d14f
  eric-tc-wong authored Sep 06, 2022
```
Recasting query and key after rotary_emb()
```
  b410d14f
09 Aug, 2022 1 commit
- Add back need_weights in FlashMHA · 19d12610
  Tri Dao authored Aug 09, 2022
  
  19d12610