Commits · 51200bdac3db1627452f01cec7e03877b17e161a · OpenDAS / torch-harmonics

16 Jul, 2025 4 commits

Removed stale comments. · 51200bda
Mauro Bisson authored Jul 11, 2025

51200bda
Added missing kernel launch error checks. · 07fa44d6
Mauro Bisson authored Jul 11, 2025

07fa44d6
Forgot to add cudamacro.h. · 9689109f
Mauro Bisson authored Jul 11, 2025

9689109f

Optimized BWD kernel with the same changes for FWD from commit : · 9a463332

Mauro Bisson authored Jul 10, 2025

* Replaced PyTorch's slow permutation.
* Split kernel into general and specialized versions (for num_channel <= 8192)
* Enabled float4-based vectorized memory access, when possible.
* Added runtime dispatch logic for kernel specialization.

Aligned attention_fwd_cuda.cu with attention_bwd_cuda.cu in terms of naming conventions and kernel parameters.

Extracted shared host/device functions and declarations into a separate module:
* attention_utils.cuh
* attention_utils.cu

9a463332

14 Jul, 2025 1 commit

Tkurth/cleanup (#90) · ab44ba59

Thorsten Kurth authored Jul 14, 2025

* removing duplicate code from distributed convoloution

* replacing from_numpy with as_tensor

* make preprocess_psi_tensor GPU ready.

ab44ba59

08 Jul, 2025 1 commit

Tkurth/remove sparse coo tensor (#89) · bd92cdf7

Thorsten Kurth authored Jul 08, 2025

* refactoring disco backend code

* removed get_psi as member function and instead put it in _disco_convolution

* setting seeds in tests more consistently

* parametrized test classes to ensure that tests are always run on both CPU and GPU (if available)

* cleaning up

bd92cdf7

07 Jul, 2025 1 commit
- removing some comments and nvtx annotations (#88) · 9959a7a6
  Thorsten Kurth authored Jul 07, 2025
  
  9959a7a6
04 Jul, 2025 1 commit
- Updated pointer offset calculations to use 64-bit integers to prevent overflow... · c9ae23d9
  Mauro Bisson authored Jul 04, 2025
```
Updated pointer offset calculations to use 64-bit integers to prevent overflow with large batch or image sizes.
```
  c9ae23d9
03 Jul, 2025 3 commits
- Re-introduce inline softmax from main · 191ba149
  Max Rietmann authored Jul 03, 2025
  
  191ba149
- Fixed compile errors for ChannelsLast C++ code, unfortunately also format-on-save · 3dd35b45
  Max Rietmann authored Jul 03, 2025
  
  3dd35b45
- using torch tools to change layout in bd pass · e1338191
  Thorsten Kurth authored Jul 02, 2025
  
  e1338191
02 Jul, 2025 4 commits

Added Thorsten's fix for bug regarding contiguous storage taken from: · c90b421a
Mauro Bisson authored Jul 02, 2025
```
https://github.com/NVIDIA/torch-harmonics/compare/main...azrael417:torch-harmonics:tkurth/mauro-rebase
```
c90b421a
Fixed typo causing crash in the generic kernel path. · 7aa95ce5
Mauro Bisson authored Jun 27, 2025

7aa95ce5

Optimize FWD kernel: reduced tail effect · 73bfdc53

Mauro Bisson authored Jun 18, 2025

* Added a new CSR array, psi_row_index, containing "ho" values sorted in descending order of CSR row length; this is used to process (ho, wo) points corresponding to longer rows before shorter ones, improving overlap and reducing the tail effect.

73bfdc53

Optimized FWD kernel: custom permutations, gmem accesses reduction, vectorized access · 8cb399ee

Mauro Bisson authored Jun 13, 2025

* Replaced PyTorch's slow permutation ops with custom kernels, significantly improving performance (especially on GB200).
* Split kernel into general and specialized versions for num_channel <= 16384, significantly reducing memory accesses.
* Enabled float4-based vectorized memory access when pointer alignment and channel size allow, improving throughput.
* Added runtime dispatch logic for kernel specialization.

8cb399ee

18 Jun, 2025 1 commit
- Applied new formatting · c46b6925
  Max Rietmann authored Jun 18, 2025
  
  c46b6925
17 Jun, 2025 2 commits
- doubling the indent to 4 · cfbf5f80
  Thorsten Kurth authored Jun 17, 2025
  
  cfbf5f80
- formatting · 4805b39c
  Thorsten Kurth authored Jun 13, 2025
  
  4805b39c
16 Jun, 2025 2 commits
- Clang format · 68e7d0fa
  Max Rietmann authored Jun 16, 2025
  
  68e7d0fa
- Optimize bwd kernel: incremental qdot_max and alpha/integral/etc · ec3050b1
  Max Rietmann authored Jun 16, 2025
```
Leverage the same qdotk_max "trick" for the backward kernel. This avoids 1 loop
and saves about 20% of performance.
```
  ec3050b1
13 Jun, 2025 2 commits
- formatting · 373f9b0b
  Thorsten Kurth authored Jun 13, 2025
  
  373f9b0b
- Changed to qdotk_max single loop torch reference kernel · 37b08bb8
  Max Rietmann authored Jun 13, 2025
  
  37b08bb8
11 Jun, 2025 3 commits
- Update copyright · a07c5b2b
  Max Rietmann authored Jun 11, 2025
  
  a07c5b2b
- Removed unnecessary code in fwd and bwd kernels. · 3d06f4da
  Max Rietmann authored Jun 11, 2025
```
Also: Made fwd kernel use modified memory layout with standard shape
```
  3d06f4da
- Removed all stale backwards kernel code · 6512d042
  Max Rietmann authored Jun 11, 2025
```
Also match the gradient output to the input, in terms of memory layout
```
  6512d042
06 Jun, 2025 1 commit

Optimizations for backward kernel: moved qy to shared, memory layout · 4096e64b

Max Rietmann authored Jun 06, 2025

Detect memory layout (B,C,H,W) (stride for C should be 1, if not, fix it)

This ensures that the backwards kernel is fast

4096e64b

04 Jun, 2025 1 commit

Moved permute out of bwd kernel & qy shared cache · b62c420f

Max Rietmann authored Jun 04, 2025

putting qy in shared is a little faster

Changing internal memory layout means we can leave code in standard shape and
only change layout external to kernel

b62c420f

02 Jun, 2025 1 commit

Optimized CUDA kernels for improved backward gradient computation · 5f051c97

Max Rietmann authored Jun 02, 2025



Introduce new CUDA kernels, `s2_attention_bwd_dkvq_kernel_mbT` and
`s2_attention_kernel_mbT`, for more efficient computation of backward gradients
and forward attention respectively. These changes optimize memory access
patterns and employ coalesced operations by leveraging tensor transpositions.

Forward kernel written by Mauro Bisson
Backwards kernel written by Andrea Paris (aparis@ethz.ch) and Max Rietmann

Parallelization strategy computes 1 output per Warp, with threads computing the
dot-product in parallel. Because inputs are transposed to have channel dimension
last, the dot-product memory access pattern is perfectly coalesced, leading to
excellent performance. This is true across both forward and backward kernels.
Co-authored-by: Mauro Bisson <maurob@nvidia.com>
Co-authored-by: Max Rietmann <mrietmann@nvidia.com>
Co-authored-by: Andrea Paris <aparis@ethz.ch>

5f051c97

26 May, 2025 1 commit
- fixing distributed resampling routine (#74) · 318fc76e
  Thorsten Kurth authored May 26, 2025
  
  318fc76e
24 May, 2025 5 commits
- fixing bug in quadrature weights for full attention. Adding better unit tests... · 4350ba9f
  Boris Bonev authored May 23, 2025
```
fixing bug in quadrature weights for full attention. Adding better unit tests for attention. Cleanup in the cuda code.
```
  4350ba9f
- more header corrections · b6c48457
  Boris Bonev authored May 22, 2025
  
  b6c48457
- adapting header files · ae8257b5
  Boris Bonev authored May 22, 2025
  
  ae8257b5
- cleaning up imports · 13d6130e
  Boris Bonev authored May 22, 2025
  
  13d6130e
- adding spherical attention · 6a845fd3
  Boris Bonev authored May 22, 2025
  
  6a845fd3
08 May, 2025 1 commit
- Setting imaginary parts of DCT and Nyquist frequencies to 0 in IRFFT (#72) · b3816ebc
  Thorsten Kurth authored May 08, 2025
```
* setting imaginary parts of DCT and nyquist frequency to zero in IRSHT variants
```
  b3816ebc
29 Apr, 2025 2 commits
- Revert "setting imaginary parts of DCT and nyquist frequency to zero in IRSHT…" (#71) · dca116b5
  Boris Bonev authored Apr 29, 2025
```
This reverts commit 82881276.
```
  dca116b5
- setting imaginary parts of DCT and nyquist frequency to zero in IRSHT (#70) · 82881276
  Thorsten Kurth authored Apr 29, 2025
```
* setting imaginary parts of DCT and nyquist frequency to zero in IRSHT variants

* small fix

* making einsum result contiguous

* adding zero frequency to distributed sht
```
  82881276
26 Feb, 2025 1 commit

Tkurth/lobatto grid hotfix (#67) · 39a0e375

Thorsten Kurth authored Feb 26, 2025

* small hotfix for lobatto grid precomputation routine

* adding lobatto grid to tests

39a0e375

21 Feb, 2025 1 commit

Tkurth/torchification (#66) · 6730e5c1

Thorsten Kurth authored Feb 21, 2025

* adding caching

* replacing many numpy calls with torch calls

* bumping up version number to 0.7.6

6730e5c1

21 Jan, 2025 1 commit

Improved computation of Morlet filter basis (#65) · 780fd143

Boris Bonev authored Jan 21, 2025

* Improved computation of Morlet filter basis and switched to a Hann window.

* addresses #064 and some cleanup

780fd143