Commits · 751c762c9cc98f13e375f582974110d781fcf3ca · gaoqiong / flash-attention

23 Jul, 2024 2 commits

Don't specialize for hdim 224 to speed up compilation · 751c762c
Tri Dao authored Jul 22, 2024

751c762c

Support AMD ROCm on FlashAttention 2 (#1010) · d8f104e9

rocking authored Jul 23, 2024



* Support ck in fmha

* Add ck submodule

* Do not return lse if return_softmax == false

* Use receipt to speed up ck compile time

* Integrate new version of ck_tile

* Support dropout for mha_fwd()

* Add dropout to mha_varlen_fwd()

* Update ck to develop

* Extract padding function for dropout randval

* Extract randval transformation function

* Sync the code structure and coding style with FA

* Remove this line, c++ api will handle this.
Sync with test_flash_attn.py

* fix compile error

* Add mha_bwd

* Generate dropout seed and offset from user generator

* update CK

* Add mha_varlen_bwd

* Use same python as build flash-attn to generate ck kernel

* Fix bug of group mode fwd about returning softmax lse

* larger the test tollerance

* Add test_flash_attn_output() and test_flash_attn_varlen_output()

* Always fill softmax_lse

* Remove duplicate benchmark script, since we already implement mha_bwd

* Refine get value from tuple

* Use default parameter for stream_config

* unblock all platform

* Add comment

* refine the test code

* Refine naming

* Add unpack to namespace

* Do not hardcode the warp size 64

* Add more targets

* Add README

* Optimize mha_fwd if seqlen_q == 1

* Support get_wheel_url for rocm

* Detect rocm environment by pytorch's IS_HIP_EXTENSION

* update to lastest ck

* Add necessary compile flag

* Sync the api with upstream FA

---------
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: Yichen Yan <wenji.yyc@alibaba-inc.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: Yichen Yan <oraluben@outlook.com>

d8f104e9

11 Jul, 2024 1 commit
- [CI] Switch from CUDA 12.2 to 12.3 · 844912dc
  Tri Dao authored Jul 11, 2024
  
  844912dc
10 Jul, 2024 2 commits
- Split into more .cu files to speed up compilation · 908511b2
  Tri Dao authored Jul 10, 2024
  
  908511b2
- Drop support for pytorch 1.12, 1.13, and python 3.7 · beb2bf2a
  Tri Dao authored Jul 09, 2024
  
  beb2bf2a
08 Jul, 2024 1 commit
- Implement softcapping. (#1025) · 8f873cc6
  Nicolas Patry authored Jul 08, 2024
```
* Softcap v2 (fwd only).

* Some missing interface + remove overrides in tests.
```
  8f873cc6
26 May, 2024 1 commit

add exception to Timeout Error (#963) · beb8b8ba

Corey James Levinson authored May 26, 2024

When timeout connecting, you get URLError: <urlopen error timed out>, In that case, build it from source.

beb8b8ba

06 May, 2024 1 commit
- Move packaging and ninja from install_requires to setup_requires (#937) · 9c0e9ee8
  Wei Ji authored May 07, 2024
```
Set `packaging` and `ninja` as build time dependencies rather than runtime dependencies.
```
  9c0e9ee8
08 Apr, 2024 1 commit
- [CI] Compile with torch 2.3.0.dev20240207 · 2aea958f
  Tri Dao authored Apr 07, 2024
  
  2aea958f
14 Mar, 2024 2 commits
- Support ARM builds (#757) · 26c9e827
  Arvind Sundararajan authored Mar 13, 2024
  
  26c9e827
- Make nvcc threads configurable via environment variable (#885) · 50896ec5
  Chirag Jain authored Mar 14, 2024
  
  50896ec5
18 Feb, 2024 1 commit

Optimize compile to 1: avoid oom 2: minimize swap usage 3: avoid threads... · f45bbb4c

Qubitium authored Feb 18, 2024

Optimize compile to 1: avoid oom 2: minimize swap usage 3: avoid threads starvation when letting ninja decide how many workers to spawn or manual MAX_JOBS "guesses". Logic is to take the min value of MAX_JOBS auto-calculated by two metrics: 1: cpu cores 2: free memory. This should allow flash-attn to compile close to the most efficient manner under any consumer/server env. (#832)

f45bbb4c

28 Nov, 2023 1 commit
- [CI] Only compile for CUDA 11.8 & 12.2, MAX_JOBS=2,add torch-nightly · d4a7c8ff
  Tri Dao authored Nov 27, 2023
  
  d4a7c8ff
04 Oct, 2023 1 commit
- [CI] Use official Pytorch 2.1, add CUDA 11.8 for Pytorch 2.1 · 5e525a8d
  Tri Dao authored Oct 03, 2023
  
  5e525a8d
24 Sep, 2023 1 commit
- Reduce number of templates for headdim > 128 · 1879e089
  Tri Dao authored Sep 23, 2023
  
  1879e089
22 Sep, 2023 1 commit
- Re-enable compilation for Hopper · bff31471
  Tri Dao authored Sep 21, 2023
  
  bff31471
18 Sep, 2023 3 commits
- [Gen] Don't use ft_attention, use flash_attn_with_kvcache instead · dfe29f5e
  Tri Dao authored Sep 18, 2023
  
  dfe29f5e
- [Minor] add nvcc note on bare_metal_version `RuntimeError` (#552) · fa3ddcba
  Federico Berto authored Sep 19, 2023
```
* Add nvcc note on bare_metal_version `RuntimeError`

* Run Black formatting
```
  fa3ddcba
- Don't compile for Pytorch 2.1 on CUDA 12.1 due to nvcc segfaults · 799f56fa
  Tri Dao authored Sep 17, 2023
  
  799f56fa
12 Sep, 2023 1 commit
- Remove some unused headers · bb9beb36
  Tri Dao authored Sep 12, 2023
  
  bb9beb36
04 Sep, 2023 1 commit
- Require CUDA 11.6+, clean up setup.py · 0c04943f
  Tri Dao authored Sep 03, 2023
  
  0c04943f
29 Aug, 2023 1 commit
- Implement splitKV attention · b1fbbd83
  Tri Dao authored Aug 29, 2023
  
  b1fbbd83
18 Aug, 2023 1 commit
- Don't need to set TORCH_CUDA_ARCH_LIST in setup.py · cbb4cf5f
  Tri Dao authored Aug 18, 2023
  
  cbb4cf5f
14 Aug, 2023 2 commits
- fix binary wheel installation when nvcc is not available (#448) · aab603af
  Aman Gupta Karmani authored Aug 14, 2023
  
  aab603af
- Use single thread compilation for cuda12.1, torch2.1 to avoid OOM CI · 9c531bdc
  Tri Dao authored Aug 14, 2023
  
  9c531bdc
13 Aug, 2023 1 commit
- Fix wheel building · 2ddeaa40
  Tri Dao authored Aug 13, 2023
  
  2ddeaa40
01 Aug, 2023 1 commit
- Fix race condition in bwd (overwriting sK) · 1c41d2b0
  Tri Dao authored Aug 01, 2023
  
  1c41d2b0
17 Jul, 2023 1 commit
- FlashAttention-2 release · 4f285b35
  Tri Dao authored Jul 17, 2023
  
  4f285b35
08 Jun, 2023 2 commits
- Clean setup.py imports · 9af165c3
  Pierce Freeman authored Jun 07, 2023
  
  9af165c3
- Add notes to github action workflow · 494b2aa4
  Pierce Freeman authored Jun 04, 2023
  
  494b2aa4
03 Jun, 2023 6 commits
- Refactor and clean of setup.py · ea2ed886
  Pierce Freeman authored Jun 02, 2023
  
  ea2ed886
- Strip cuda name from torch version · 9fc9820a
  Pierce Freeman authored Jun 02, 2023
  
  9fc9820a
- Allow fallback install · 5e469978
  Pierce Freeman authored Jun 02, 2023
  
  5e469978
- Guessing wheel URL · 0e7769c8
  Pierce Freeman authored Jun 02, 2023
  
  0e7769c8
- Raise cuda error on build · e1faefce
  Pierce Freeman authored Jun 02, 2023
  
  e1faefce
- Scaffolding for wheel prototype · add4f0bc
  Pierce Freeman authored May 30, 2023
  
  add4f0bc
19 May, 2023 1 commit
- Allow adding an optional local version to the package version · 31f78a98
  Max H. Gerlach authored May 19, 2023
  
  31f78a98
12 May, 2023 1 commit
- Add ninja to pyproject.toml build-system, bump to v1.0.5 · eff9fe6b
  Tri Dao authored May 12, 2023
  
  eff9fe6b
26 Apr, 2023 1 commit
- [Docs] Clearer error message for bwd d > 64, bump to v1.0.4 · ad113948
  Tri Dao authored Apr 26, 2023
  
  ad113948
21 Apr, 2023 1 commit
- Bump version to v1.0.3.post0 · fbbb1078
  Tri Dao authored Apr 21, 2023
  
  fbbb1078