1. 03 Jul, 2025 1 commit
  2. 02 Jul, 2025 4 commits
  3. 01 Jul, 2025 3 commits
  4. 18 Jun, 2025 3 commits
  5. 17 Jun, 2025 6 commits
  6. 16 Jun, 2025 4 commits
  7. 13 Jun, 2025 10 commits
  8. 11 Jun, 2025 3 commits
  9. 06 Jun, 2025 1 commit
  10. 04 Jun, 2025 1 commit
  11. 02 Jun, 2025 1 commit
    • Max Rietmann's avatar
      Optimized CUDA kernels for improved backward gradient computation · 5f051c97
      Max Rietmann authored
      
      
      Introduce new CUDA kernels, `s2_attention_bwd_dkvq_kernel_mbT` and
      `s2_attention_kernel_mbT`, for more efficient computation of backward gradients
      and forward attention respectively. These changes optimize memory access
      patterns and employ coalesced operations by leveraging tensor transpositions.
      
      Forward kernel written by Mauro Bisson
      Backwards kernel written by Andrea Paris (aparis@ethz.ch) and Max Rietmann
      
      Parallelization strategy computes 1 output per Warp, with threads computing the
      dot-product in parallel. Because inputs are transposed to have channel dimension
      last, the dot-product memory access pattern is perfectly coalesced, leading to
      excellent performance. This is true across both forward and backward kernels.
      Co-authored-by: default avatarMauro Bisson <maurob@nvidia.com>
      Co-authored-by: default avatarMax Rietmann <mrietmann@nvidia.com>
      Co-authored-by: default avatarAndrea Paris <aparis@ethz.ch>
      5f051c97
  12. 26 May, 2025 1 commit
  13. 24 May, 2025 2 commits