- 16 Jun, 2025 2 commits
-
-
Max Rietmann authored
-
Max Rietmann authored
Leverage the same qdotk_max "trick" for the backward kernel. This avoids 1 loop and saves about 20% of performance.
-
- 13 Jun, 2025 2 commits
-
-
Thorsten Kurth authored
-
Max Rietmann authored
-
- 11 Jun, 2025 3 commits
-
-
Max Rietmann authored
-
Max Rietmann authored
Also: Made fwd kernel use modified memory layout with standard shape
-
Max Rietmann authored
Also match the gradient output to the input, in terms of memory layout
-
- 06 Jun, 2025 1 commit
-
-
Max Rietmann authored
Detect memory layout (B,C,H,W) (stride for C should be 1, if not, fix it) This ensures that the backwards kernel is fast
-
- 04 Jun, 2025 1 commit
-
-
Max Rietmann authored
putting qy in shared is a little faster Changing internal memory layout means we can leave code in standard shape and only change layout external to kernel
-
- 02 Jun, 2025 1 commit
-
-
Max Rietmann authored
Introduce new CUDA kernels, `s2_attention_bwd_dkvq_kernel_mbT` and `s2_attention_kernel_mbT`, for more efficient computation of backward gradients and forward attention respectively. These changes optimize memory access patterns and employ coalesced operations by leveraging tensor transpositions. Forward kernel written by Mauro Bisson Backwards kernel written by Andrea Paris (aparis@ethz.ch) and Max Rietmann Parallelization strategy computes 1 output per Warp, with threads computing the dot-product in parallel. Because inputs are transposed to have channel dimension last, the dot-product memory access pattern is perfectly coalesced, leading to excellent performance. This is true across both forward and backward kernels. Co-authored-by:
Mauro Bisson <maurob@nvidia.com> Co-authored-by:
Max Rietmann <mrietmann@nvidia.com> Co-authored-by:
Andrea Paris <aparis@ethz.ch>
-
- 24 May, 2025 4 commits
-
-
Boris Bonev authored
fixing bug in quadrature weights for full attention. Adding better unit tests for attention. Cleanup in the cuda code.
-
Boris Bonev authored
-
Boris Bonev authored
-
Boris Bonev authored
-
- 19 Aug, 2024 1 commit
-
-
Boris Bonev authored
* adding cuda kernels for disco conv * making psi_idx an attribute * adding license headers * adding author files * reorganizing files * draft implementation * added conditional installation to setup.py * formatting changes * removing triton kernel in DISCO convolution * updated github actions * updated Readme and changelog * adding another guard for the cuda installation * renaming the cuda extension * simplifying setup.py * minor bugfix * Bbonev/cuda disco cleanup (#32) * cleanup of disco convolutions based on CUDA extension * fixing unittest * changing version to experimental 0.7.0a * initial rewrite of the distributed convolution with CUDA * fixing streams * need to fix install options * fixing streams * undid setup.py changes * reset setup.py * including CUDAStream * adjusted the precomputation of theta_cutoff. If you rely on this, your models will not be backwards-compatible. * adjusting theta_cutoff in the unittest * adding newly refactored kernels for faster compile * Tkurth/cuda disco distributed fix (#34) * attempt to make disco distributed * working distributed convolutions * fixing distributed conv * working distributed disco * removing irrelevant extra argument * using stream functions from at instead of c10 * using stream functions from at instead of c10, small fix * Bbonev/disc even filters (#35) * initial working commit with new convention of counting collocation points across the diameter instead of across the radius * fixed a bug in the computation of the even kernels * changing heuristic for computing theta_cutoff * Fixing unittest * Readability improvements * reworked normalization of filter basis functions * implemented discrete normalization of disco filters * relaxing tolerances in convolution unit test * bugfix to correctly support unequal scale factors in latitudes and longitudes * hotfix to a bug in the imports * Bbonev/distributed disco refactor (#37) * cleaned up normalization code in convolution * formatting changes in distributed convolution * Fixing default theta_cutoff to be the same in distributed and local case * fixed distributed convolution to support the same normalization as non-distributed one * readability improvements * fixed initial scale of convolution parameter weights and fixed naming of the normalization routine * Updated Readme.md * added comment in Dockerfile regarding older architectures --------- Co-authored-by:
Thorsten Kurth <tkurth@nvidia.com> Co-authored-by:
Boris Bonev <bbonev@nvidia.com>
-