1. 29 Jul, 2024 1 commit
  2. 23 Jul, 2024 3 commits
    • Tri Dao's avatar
      65f723bb
    • Tri Dao's avatar
      751c762c
    • rocking's avatar
      Support AMD ROCm on FlashAttention 2 (#1010) · d8f104e9
      rocking authored
      
      
      * Support ck in fmha
      
      * Add ck submodule
      
      * Do not return lse if return_softmax == false
      
      * Use receipt to speed up ck compile time
      
      * Integrate new version of ck_tile
      
      * Support dropout for mha_fwd()
      
      * Add dropout to mha_varlen_fwd()
      
      * Update ck to develop
      
      * Extract padding function for dropout randval
      
      * Extract randval transformation function
      
      * Sync the code structure and coding style with FA
      
      * Remove this line, c++ api will handle this.
      Sync with test_flash_attn.py
      
      * fix compile error
      
      * Add mha_bwd
      
      * Generate dropout seed and offset from user generator
      
      * update CK
      
      * Add mha_varlen_bwd
      
      * Use same python as build flash-attn to generate ck kernel
      
      * Fix bug of group mode fwd about returning softmax lse
      
      * larger the test tollerance
      
      * Add test_flash_attn_output() and test_flash_attn_varlen_output()
      
      * Always fill softmax_lse
      
      * Remove duplicate benchmark script, since we already implement mha_bwd
      
      * Refine get value from tuple
      
      * Use default parameter for stream_config
      
      * unblock all platform
      
      * Add comment
      
      * refine the test code
      
      * Refine naming
      
      * Add unpack to namespace
      
      * Do not hardcode the warp size 64
      
      * Add more targets
      
      * Add README
      
      * Optimize mha_fwd if seqlen_q == 1
      
      * Support get_wheel_url for rocm
      
      * Detect rocm environment by pytorch's IS_HIP_EXTENSION
      
      * update to lastest ck
      
      * Add necessary compile flag
      
      * Sync the api with upstream FA
      
      ---------
      Co-authored-by: default avatarcarlushuang <carlus.huang@amd.com>
      Co-authored-by: default avatarYichen Yan <wenji.yyc@alibaba-inc.com>
      Co-authored-by: default avatarPo Yen Chen <PoYen.Chen@amd.com>
      Co-authored-by: default avatarYichen Yan <oraluben@outlook.com>
      d8f104e9
  3. 11 Jul, 2024 1 commit
  4. 10 Jul, 2024 2 commits
  5. 08 Jul, 2024 1 commit
  6. 07 Jun, 2024 1 commit
  7. 26 May, 2024 1 commit
  8. 19 May, 2024 1 commit
  9. 06 May, 2024 2 commits
  10. 22 Apr, 2024 2 commits
  11. 08 Apr, 2024 1 commit
  12. 28 Mar, 2024 2 commits
  13. 14 Mar, 2024 2 commits
  14. 18 Feb, 2024 1 commit
    • Qubitium's avatar
      Optimize compile to 1: avoid oom 2: minimize swap usage 3: avoid threads... · f45bbb4c
      Qubitium authored
      Optimize compile to 1: avoid oom 2: minimize swap usage 3: avoid threads starvation when letting ninja decide how many workers to spawn or manual MAX_JOBS "guesses". Logic is to take the min value of MAX_JOBS auto-calculated by two metrics: 1: cpu cores 2: free memory. This should allow flash-attn to compile close to the most efficient manner under any consumer/server env. (#832)
      
      f45bbb4c
  15. 28 Nov, 2023 1 commit
  16. 04 Oct, 2023 1 commit
  17. 24 Sep, 2023 1 commit
  18. 22 Sep, 2023 1 commit
  19. 18 Sep, 2023 3 commits
  20. 12 Sep, 2023 1 commit
  21. 04 Sep, 2023 1 commit
  22. 29 Aug, 2023 1 commit
  23. 18 Aug, 2023 1 commit
  24. 14 Aug, 2023 2 commits
  25. 13 Aug, 2023 1 commit
  26. 01 Aug, 2023 1 commit
  27. 17 Jul, 2023 1 commit
  28. 08 Jun, 2023 2 commits
  29. 03 Jun, 2023 1 commit