Commits · 5badfb78485adf1333f04c46510d99ac56a17622 · gaoqiong / flash-attention

"llama/ggml-cuda/template-instances/mmq-instance-q4_1.cu" did not exist on "de982616f1dde636e46b2cef2edd971b54ef7691"

14 Oct, 2022 1 commit
- Implement attention kernel that splits the batch into two · 5badfb78
  Tri Dao authored Oct 13, 2022
  
  5badfb78
10 Jul, 2022 1 commit
- Implement for bf16 · de19de7a
  Tri Dao authored Jul 09, 2022
  
  de19de7a
04 Jul, 2022 3 commits
- Apply dropout scaling to dQ and dK instead of to V (in bwd) · 5b838a8b
  Tri Dao authored Jun 29, 2022
```
Theoretically this might have lower numerical error since the scaling is in
fp32 instead of fp16 (not sure, I haven't thought too carefully about it).
However, in practice, the numerical errors seem about the same.
```
  5b838a8b
- Do P * dP (pointwise) in the bwd in fp32 instead of fp16 · a5559a0e
  Tri Dao authored Jul 03, 2022
  
  a5559a0e
- Implement cross attention · 6c3a8c65
  Tri Dao authored Jun 30, 2022
  
  6c3a8c65
12 Jun, 2022 1 commit
- Refactor Gmem code to store q, k, v pointers separately · 5d07483b
  Tri Dao authored Jun 12, 2022
  
  5d07483b
26 May, 2022 1 commit
- Rename, add benchmarking script · 9dbc491a
  Tri Dao authored May 26, 2022
  
  9dbc491a
20 May, 2022 1 commit
- First release · 1fcbe6f0
  Tri Dao authored May 20, 2022
  
  1fcbe6f0