1. 16 Jul, 2025 2 commits
    • Mauro Bisson's avatar
      Forgot to add cudamacro.h. · 9689109f
      Mauro Bisson authored
      9689109f
    • Mauro Bisson's avatar
      Optimized BWD kernel with the same changes for FWD from commit 8cb399ee: · 9a463332
      Mauro Bisson authored
      * Replaced PyTorch's slow permutation.
      * Split kernel into general and specialized versions (for num_channel <= 8192)
      * Enabled float4-based vectorized memory access, when possible.
      * Added runtime dispatch logic for kernel specialization.
      
      Aligned attention_fwd_cuda.cu with attention_bwd_cuda.cu in terms of naming conventions and kernel parameters.
      
      Extracted shared host/device functions and declarations into a separate module:
      * attention_utils.cuh
      * attention_utils.cu
      9a463332
  2. 07 Jul, 2025 1 commit
  3. 04 Jul, 2025 1 commit
  4. 03 Jul, 2025 3 commits
  5. 02 Jul, 2025 3 commits
  6. 18 Jun, 2025 1 commit
  7. 17 Jun, 2025 2 commits
  8. 16 Jun, 2025 2 commits
  9. 13 Jun, 2025 2 commits
  10. 11 Jun, 2025 3 commits
  11. 06 Jun, 2025 1 commit
  12. 04 Jun, 2025 1 commit
  13. 02 Jun, 2025 1 commit
    • Max Rietmann's avatar
      Optimized CUDA kernels for improved backward gradient computation · 5f051c97
      Max Rietmann authored
      
      
      Introduce new CUDA kernels, `s2_attention_bwd_dkvq_kernel_mbT` and
      `s2_attention_kernel_mbT`, for more efficient computation of backward gradients
      and forward attention respectively. These changes optimize memory access
      patterns and employ coalesced operations by leveraging tensor transpositions.
      
      Forward kernel written by Mauro Bisson
      Backwards kernel written by Andrea Paris (aparis@ethz.ch) and Max Rietmann
      
      Parallelization strategy computes 1 output per Warp, with threads computing the
      dot-product in parallel. Because inputs are transposed to have channel dimension
      last, the dot-product memory access pattern is perfectly coalesced, leading to
      excellent performance. This is true across both forward and backward kernels.
      Co-authored-by: default avatarMauro Bisson <maurob@nvidia.com>
      Co-authored-by: default avatarMax Rietmann <mrietmann@nvidia.com>
      Co-authored-by: default avatarAndrea Paris <aparis@ethz.ch>
      5f051c97
  14. 24 May, 2025 4 commits