Optimized CUDA kernels for improved backward gradient computation
Introduce new CUDA kernels, `s2_attention_bwd_dkvq_kernel_mbT` and `s2_attention_kernel_mbT`, for more efficient computation of backward gradients and forward attention respectively. These changes optimize memory access patterns and employ coalesced operations by leveraging tensor transpositions. Forward kernel written by Mauro Bisson Backwards kernel written by Andrea Paris (aparis@ethz.ch) and Max Rietmann Parallelization strategy computes 1 output per Warp, with threads computing the dot-product in parallel. Because inputs are transposed to have channel dimension last, the dot-product memory access pattern is perfectly coalesced, leading to excellent performance. This is true across both forward and backward kernels. Co-authored-by:Mauro Bisson <maurob@nvidia.com> Co-authored-by:
Max Rietmann <mrietmann@nvidia.com> Co-authored-by:
Andrea Paris <aparis@ethz.ch>
Showing
This diff is collapsed.
Please register or sign in to comment