- 16 Jul, 2025 4 commits
-
-
Mauro Bisson authored
-
Mauro Bisson authored
-
Mauro Bisson authored
-
Mauro Bisson authored
* Replaced PyTorch's slow permutation. * Split kernel into general and specialized versions (for num_channel <= 8192) * Enabled float4-based vectorized memory access, when possible. * Added runtime dispatch logic for kernel specialization. Aligned attention_fwd_cuda.cu with attention_bwd_cuda.cu in terms of naming conventions and kernel parameters. Extracted shared host/device functions and declarations into a separate module: * attention_utils.cuh * attention_utils.cu
-
- 14 Jul, 2025 1 commit
-
-
Thorsten Kurth authored
* removing duplicate code from distributed convoloution * replacing from_numpy with as_tensor * make preprocess_psi_tensor GPU ready.
-
- 08 Jul, 2025 1 commit
-
-
Thorsten Kurth authored
* refactoring disco backend code * removed get_psi as member function and instead put it in _disco_convolution * setting seeds in tests more consistently * parametrized test classes to ensure that tests are always run on both CPU and GPU (if available) * cleaning up
-
- 07 Jul, 2025 1 commit
-
-
Thorsten Kurth authored
-
- 04 Jul, 2025 1 commit
-
-
Mauro Bisson authored
Updated pointer offset calculations to use 64-bit integers to prevent overflow with large batch or image sizes.
-
- 03 Jul, 2025 3 commits
-
-
Max Rietmann authored
-
Max Rietmann authored
-
Thorsten Kurth authored
-
- 02 Jul, 2025 4 commits
-
-
Mauro Bisson authored
-
Mauro Bisson authored
* Added a new CSR array, psi_row_index, containing "ho" values sorted in descending order of CSR row length; this is used to process (ho, wo) points corresponding to longer rows before shorter ones, improving overlap and reducing the tail effect.
-
Mauro Bisson authored
* Replaced PyTorch's slow permutation ops with custom kernels, significantly improving performance (especially on GB200). * Split kernel into general and specialized versions for num_channel <= 16384, significantly reducing memory accesses. * Enabled float4-based vectorized memory access when pointer alignment and channel size allow, improving throughput. * Added runtime dispatch logic for kernel specialization.
- 18 Jun, 2025 1 commit
-
-
Max Rietmann authored
-
- 17 Jun, 2025 2 commits
-
-
Thorsten Kurth authored
-
Thorsten Kurth authored
-
- 16 Jun, 2025 2 commits
-
-
Max Rietmann authored
-
Max Rietmann authored
Leverage the same qdotk_max "trick" for the backward kernel. This avoids 1 loop and saves about 20% of performance.
-
- 13 Jun, 2025 2 commits
-
-
Thorsten Kurth authored
-
Max Rietmann authored
-
- 11 Jun, 2025 3 commits
-
-
Max Rietmann authored
-
Max Rietmann authored
Also: Made fwd kernel use modified memory layout with standard shape
-
Max Rietmann authored
Also match the gradient output to the input, in terms of memory layout
-
- 06 Jun, 2025 1 commit
-
-
Max Rietmann authored
Detect memory layout (B,C,H,W) (stride for C should be 1, if not, fix it) This ensures that the backwards kernel is fast
-
- 04 Jun, 2025 1 commit
-
-
Max Rietmann authored
putting qy in shared is a little faster Changing internal memory layout means we can leave code in standard shape and only change layout external to kernel
-
- 02 Jun, 2025 1 commit
-
-
Max Rietmann authored
Introduce new CUDA kernels, `s2_attention_bwd_dkvq_kernel_mbT` and `s2_attention_kernel_mbT`, for more efficient computation of backward gradients and forward attention respectively. These changes optimize memory access patterns and employ coalesced operations by leveraging tensor transpositions. Forward kernel written by Mauro Bisson Backwards kernel written by Andrea Paris (aparis@ethz.ch) and Max Rietmann Parallelization strategy computes 1 output per Warp, with threads computing the dot-product in parallel. Because inputs are transposed to have channel dimension last, the dot-product memory access pattern is perfectly coalesced, leading to excellent performance. This is true across both forward and backward kernels. Co-authored-by:
Mauro Bisson <maurob@nvidia.com> Co-authored-by:
Max Rietmann <mrietmann@nvidia.com> Co-authored-by:
Andrea Paris <aparis@ethz.ch>
-
- 26 May, 2025 1 commit
-
-
Thorsten Kurth authored
-
- 24 May, 2025 5 commits
-
-
Boris Bonev authored
fixing bug in quadrature weights for full attention. Adding better unit tests for attention. Cleanup in the cuda code.
-
Boris Bonev authored
-
Boris Bonev authored
-
Boris Bonev authored
-
Boris Bonev authored
-
- 08 May, 2025 1 commit
-
-
Thorsten Kurth authored
* setting imaginary parts of DCT and nyquist frequency to zero in IRSHT variants
-
- 29 Apr, 2025 2 commits
-
-
Boris Bonev authored
This reverts commit 82881276.
-
Thorsten Kurth authored
* setting imaginary parts of DCT and nyquist frequency to zero in IRSHT variants * small fix * making einsum result contiguous * adding zero frequency to distributed sht
-
- 26 Feb, 2025 1 commit
-
-
Thorsten Kurth authored
* small hotfix for lobatto grid precomputation routine * adding lobatto grid to tests
-
- 21 Feb, 2025 1 commit
-
-
Thorsten Kurth authored
* adding caching * replacing many numpy calls with torch calls * bumping up version number to 0.7.6
-
- 21 Jan, 2025 1 commit
-
-
Boris Bonev authored
* Improved computation of Morlet filter basis and switched to a Hann window. * addresses #064 and some cleanup
-