1. 02 Jul, 2025 1 commit
    • Mauro Bisson's avatar
      Optimized FWD kernel: custom permutations, gmem accesses reduction, vectorized access · 8cb399ee
      Mauro Bisson authored
      * Replaced PyTorch's slow permutation ops with custom kernels, significantly improving performance (especially on GB200).
      * Split kernel into general and specialized versions for num_channel <= 16384, significantly reducing memory accesses.
      * Enabled float4-based vectorized memory access when pointer alignment and channel size allow, improving throughput.
      * Added runtime dispatch logic for kernel specialization.
      8cb399ee
  2. 01 Jul, 2025 3 commits
  3. 18 Jun, 2025 3 commits
  4. 17 Jun, 2025 6 commits
  5. 16 Jun, 2025 4 commits
  6. 13 Jun, 2025 10 commits
  7. 11 Jun, 2025 3 commits
  8. 06 Jun, 2025 1 commit
  9. 04 Jun, 2025 1 commit
  10. 02 Jun, 2025 1 commit
    • Max Rietmann's avatar
      Optimized CUDA kernels for improved backward gradient computation · 5f051c97
      Max Rietmann authored
      
      
      Introduce new CUDA kernels, `s2_attention_bwd_dkvq_kernel_mbT` and
      `s2_attention_kernel_mbT`, for more efficient computation of backward gradients
      and forward attention respectively. These changes optimize memory access
      patterns and employ coalesced operations by leveraging tensor transpositions.
      
      Forward kernel written by Mauro Bisson
      Backwards kernel written by Andrea Paris (aparis@ethz.ch) and Max Rietmann
      
      Parallelization strategy computes 1 output per Warp, with threads computing the
      dot-product in parallel. Because inputs are transposed to have channel dimension
      last, the dot-product memory access pattern is perfectly coalesced, leading to
      excellent performance. This is true across both forward and backward kernels.
      Co-authored-by: default avatarMauro Bisson <maurob@nvidia.com>
      Co-authored-by: default avatarMax Rietmann <mrietmann@nvidia.com>
      Co-authored-by: default avatarAndrea Paris <aparis@ethz.ch>
      5f051c97
  11. 26 May, 2025 1 commit
  12. 24 May, 2025 6 commits