1. 17 Jun, 2025 4 commits
  2. 13 Jun, 2025 7 commits
  3. 11 Jun, 2025 3 commits
  4. 06 Jun, 2025 1 commit
  5. 04 Jun, 2025 1 commit
  6. 02 Jun, 2025 1 commit
    • Max Rietmann's avatar
      Optimized CUDA kernels for improved backward gradient computation · 5f051c97
      Max Rietmann authored
      
      
      Introduce new CUDA kernels, `s2_attention_bwd_dkvq_kernel_mbT` and
      `s2_attention_kernel_mbT`, for more efficient computation of backward gradients
      and forward attention respectively. These changes optimize memory access
      patterns and employ coalesced operations by leveraging tensor transpositions.
      
      Forward kernel written by Mauro Bisson
      Backwards kernel written by Andrea Paris (aparis@ethz.ch) and Max Rietmann
      
      Parallelization strategy computes 1 output per Warp, with threads computing the
      dot-product in parallel. Because inputs are transposed to have channel dimension
      last, the dot-product memory access pattern is perfectly coalesced, leading to
      excellent performance. This is true across both forward and backward kernels.
      Co-authored-by: default avatarMauro Bisson <maurob@nvidia.com>
      Co-authored-by: default avatarMax Rietmann <mrietmann@nvidia.com>
      Co-authored-by: default avatarAndrea Paris <aparis@ethz.ch>
      5f051c97
  7. 26 May, 2025 1 commit
  8. 24 May, 2025 7 commits
  9. 08 May, 2025 1 commit
  10. 29 Apr, 2025 2 commits
  11. 26 Feb, 2025 1 commit
  12. 21 Feb, 2025 1 commit
  13. 21 Jan, 2025 1 commit
  14. 17 Jan, 2025 1 commit
    • Mike McCann's avatar
      Update README.md · 9eea871c
      Mike McCann authored
      Without putting `signal` on `device`, we get `RuntimeError: Expected all tensors to be on the same device` when `sht` is called.
      9eea871c
  15. 14 Jan, 2025 8 commits