- 16 Jun, 2025 2 commits
-
-
Max Rietmann authored
-
Max Rietmann authored
Leverage the same qdotk_max "trick" for the backward kernel. This avoids 1 loop and saves about 20% of performance.
-
- 13 Jun, 2025 2 commits
-
-
Thorsten Kurth authored
-
Max Rietmann authored
-
- 11 Jun, 2025 3 commits
-
-
Max Rietmann authored
-
Max Rietmann authored
Also: Made fwd kernel use modified memory layout with standard shape
-
Max Rietmann authored
Also match the gradient output to the input, in terms of memory layout
-
- 06 Jun, 2025 1 commit
-
-
Max Rietmann authored
Detect memory layout (B,C,H,W) (stride for C should be 1, if not, fix it) This ensures that the backwards kernel is fast
-
- 04 Jun, 2025 1 commit
-
-
Max Rietmann authored
putting qy in shared is a little faster Changing internal memory layout means we can leave code in standard shape and only change layout external to kernel
-
- 02 Jun, 2025 1 commit
-
-
Max Rietmann authored
Introduce new CUDA kernels, `s2_attention_bwd_dkvq_kernel_mbT` and `s2_attention_kernel_mbT`, for more efficient computation of backward gradients and forward attention respectively. These changes optimize memory access patterns and employ coalesced operations by leveraging tensor transpositions. Forward kernel written by Mauro Bisson Backwards kernel written by Andrea Paris (aparis@ethz.ch) and Max Rietmann Parallelization strategy computes 1 output per Warp, with threads computing the dot-product in parallel. Because inputs are transposed to have channel dimension last, the dot-product memory access pattern is perfectly coalesced, leading to excellent performance. This is true across both forward and backward kernels. Co-authored-by:
Mauro Bisson <maurob@nvidia.com> Co-authored-by:
Max Rietmann <mrietmann@nvidia.com> Co-authored-by:
Andrea Paris <aparis@ethz.ch>
-
- 24 May, 2025 4 commits
-
-
Boris Bonev authored
fixing bug in quadrature weights for full attention. Adding better unit tests for attention. Cleanup in the cuda code.
-
Boris Bonev authored
-
Boris Bonev authored
-
Boris Bonev authored
-