- 07 Jul, 2025 1 commit
-
-
Thorsten Kurth authored
Use 64-bit for pointer offsets
-
- 04 Jul, 2025 2 commits
-
-
Mauro Bisson authored
Updated pointer offset calculations to use 64-bit integers to prevent overflow with large batch or image sizes.
-
Thorsten Kurth authored
using torch tools to change layout in bd pass
-
- 03 Jul, 2025 4 commits
-
-
Max Rietmann authored
-
Max Rietmann authored
-
Thorsten Kurth authored
-
Thorsten Kurth authored
Optimized forward kernel for attention
-
- 02 Jul, 2025 4 commits
-
-
Mauro Bisson authored
-
Mauro Bisson authored
* Added a new CSR array, psi_row_index, containing "ho" values sorted in descending order of CSR row length; this is used to process (ho, wo) points corresponding to longer rows before shorter ones, improving overlap and reducing the tail effect.
-
Mauro Bisson authored
* Replaced PyTorch's slow permutation ops with custom kernels, significantly improving performance (especially on GB200). * Split kernel into general and specialized versions for num_channel <= 16384, significantly reducing memory accesses. * Enabled float4-based vectorized memory access when pointer alignment and channel size allow, improving throughput. * Added runtime dispatch logic for kernel specialization.
- 01 Jul, 2025 3 commits
-
-
Thorsten Kurth authored
Small fix in metric computation
-
Andrea Paris authored
-
Andrea Paris authored
-
- 18 Jun, 2025 3 commits
-
-
Thorsten Kurth authored
Optimize bwd kernel: incremental qdot_max and alpha/integral/etc
-
Max Rietmann authored
-
Max Rietmann authored
-
- 17 Jun, 2025 6 commits
-
-
Thorsten Kurth authored
adding lineinfo to optional debug flags
-
Thorsten Kurth authored
-
Thorsten Kurth authored
-
Thorsten Kurth authored
-
Thorsten Kurth authored
-
Thorsten Kurth authored
-
- 16 Jun, 2025 4 commits
-
-
Max Rietmann authored
-
Max Rietmann authored
-
Max Rietmann authored
-
Max Rietmann authored
Leverage the same qdotk_max "trick" for the backward kernel. This avoids 1 loop and saves about 20% of performance.
-
- 13 Jun, 2025 10 commits
-
-
Thorsten Kurth authored
-
Thorsten Kurth authored
-
Thorsten Kurth authored
-
Thorsten Kurth authored
fixing attention perf test attempt 1
-
Thorsten Kurth authored
-
Thorsten Kurth authored
Optimized CUDA kernels for S2 Attention (forward and backward)
-
Thorsten Kurth authored
-
Thorsten Kurth authored
Merge branch 'mr/bwd-channel-permute-experiments' of https://github.com/rietmann-nv/torch-harmonics into mr/bwd-channel-permute-experiments
-
Max Rietmann authored
-
Thorsten Kurth authored
-
- 11 Jun, 2025 3 commits
-
-
Max Rietmann authored
-
Max Rietmann authored
Also: Made fwd kernel use modified memory layout with standard shape
-
Max Rietmann authored
Also match the gradient output to the input, in terms of memory layout
-