Commits · de19de7ab13301e840a8bc483e77be6e424e7b32 · gaoqiong / flash-attention

10 Jul, 2022 4 commits
- Implement for bf16 · de19de7a
  Tri Dao authored Jul 09, 2022
  
  de19de7a
- Refactor gemm_cl to template on either __half or __nv_bfloat16 · 6a77a6da
  Tri Dao authored Jul 08, 2022
  
  6a77a6da
- Refactor to template on __half, implement bf16 util functions · e518a4b3
  Tri Dao authored Jul 08, 2022
  
  e518a4b3
- Fix Illegal Memory Access bug in fwd when d=16 · 2dc1b205
  Tri Dao authored Jul 09, 2022
  
  2dc1b205
04 Jul, 2022 3 commits
- Apply dropout scaling to dQ and dK instead of to V (in bwd) · 5b838a8b
  Tri Dao authored Jun 29, 2022
```
Theoretically this might have lower numerical error since the scaling is in
fp32 instead of fp16 (not sure, I haven't thought too carefully about it).
However, in practice, the numerical errors seem about the same.
```
  5b838a8b
- Do P * dP (pointwise) in the bwd in fp32 instead of fp16 · a5559a0e
  Tri Dao authored Jul 03, 2022
  
  a5559a0e
- Implement cross attention · 6c3a8c65
  Tri Dao authored Jun 30, 2022
  
  6c3a8c65
30 Jun, 2022 1 commit
- Support batch size > 64K by swapping grid.x and grid.y · f66603cb
  Tri Dao authored Jun 29, 2022
  
  f66603cb
26 Jun, 2022 1 commit
- Fix race condition in backward pass (smem_dq) · ea38d3d2
  Tri Dao authored Jun 25, 2022
  
  ea38d3d2
25 Jun, 2022 1 commit
- Bug fix: wrong smem_o write pointer for d=16 · eeca63a7
  Tri Dao authored Jun 25, 2022
  
  eeca63a7
12 Jun, 2022 3 commits
- Refactor Gmem code to store q, k, v pointers separately · 5d07483b
  Tri Dao authored Jun 12, 2022
  
  5d07483b
- Implement bwd for head dim 128 · d3e64409
  Tri Dao authored Jun 11, 2022
  
  d3e64409
- Implement fwd for head dim 128 · 0d854692
  Tri Dao authored Jun 05, 2022
  
  0d854692
04 Jun, 2022 2 commits
- Set block size of SM75 fwd to 256 if there's no dropout · 321c57d0
  Tri Dao authored Jun 04, 2022
```
This speeds up the fwd by 1.5x.
```
  321c57d0
- Don't use Smem_dp_sum in backward pass · d380e87f
  Tri Dao authored Jun 04, 2022
```
To reduce smem usage for SM75
```
  d380e87f
03 Jun, 2022 2 commits
- Reduce smem usage for Q and dO in the backward pass · b17c6fe2
  Tri Dao authored Jun 03, 2022
```
From 4KB per buffer to 2KB per buffer. This saves us 8KB of smem (each Q and dO
have 2 buffers)
```
  b17c6fe2
- Support Turing mma instructions · 2712aa4c
  Tri Dao authored Jun 02, 2022
  
  2712aa4c
02 Jun, 2022 4 commits
- Remove softmax fp16 max · 05087332
  Tri Dao authored Jun 02, 2022
  
  05087332
- Use Cutlass gemm as WarpMma · 14dc326e
  Tri Dao authored Jun 02, 2022
  
  14dc326e
- Remove old backward · e78e7c95
  Tri Dao authored Jun 02, 2022
  
  e78e7c95
- Support SM86 GPUs · c41479d6
  Tri Dao authored Jun 01, 2022
  
  c41479d6
26 May, 2022 1 commit
- Rename, add benchmarking script · 9dbc491a
  Tri Dao authored May 26, 2022
  
  9dbc491a