Commits · 1ea5c4ca08400bbc456001c847ea267e9ad2baac · OpenDAS / torch-harmonics

16 Jun, 2025 2 commits
- Clang format · 68e7d0fa
  Max Rietmann authored Jun 16, 2025
  
  68e7d0fa
- Optimize bwd kernel: incremental qdot_max and alpha/integral/etc · ec3050b1
  Max Rietmann authored Jun 16, 2025
```
Leverage the same qdotk_max "trick" for the backward kernel. This avoids 1 loop
and saves about 20% of performance.
```
  ec3050b1
13 Jun, 2025 2 commits
- formatting · 373f9b0b
  Thorsten Kurth authored Jun 13, 2025
  
  373f9b0b
- Changed to qdotk_max single loop torch reference kernel · 37b08bb8
  Max Rietmann authored Jun 13, 2025
  
  37b08bb8
11 Jun, 2025 3 commits
- Update copyright · a07c5b2b
  Max Rietmann authored Jun 11, 2025
  
  a07c5b2b
- Removed unnecessary code in fwd and bwd kernels. · 3d06f4da
  Max Rietmann authored Jun 11, 2025
```
Also: Made fwd kernel use modified memory layout with standard shape
```
  3d06f4da
- Removed all stale backwards kernel code · 6512d042
  Max Rietmann authored Jun 11, 2025
```
Also match the gradient output to the input, in terms of memory layout
```
  6512d042
06 Jun, 2025 1 commit

Optimizations for backward kernel: moved qy to shared, memory layout · 4096e64b

Max Rietmann authored Jun 06, 2025

Detect memory layout (B,C,H,W) (stride for C should be 1, if not, fix it)

This ensures that the backwards kernel is fast

4096e64b

04 Jun, 2025 1 commit

Moved permute out of bwd kernel & qy shared cache · b62c420f

Max Rietmann authored Jun 04, 2025

putting qy in shared is a little faster

Changing internal memory layout means we can leave code in standard shape and
only change layout external to kernel

b62c420f

02 Jun, 2025 1 commit

Optimized CUDA kernels for improved backward gradient computation · 5f051c97

Max Rietmann authored Jun 02, 2025



Introduce new CUDA kernels, `s2_attention_bwd_dkvq_kernel_mbT` and
`s2_attention_kernel_mbT`, for more efficient computation of backward gradients
and forward attention respectively. These changes optimize memory access
patterns and employ coalesced operations by leveraging tensor transpositions.

Forward kernel written by Mauro Bisson
Backwards kernel written by Andrea Paris (aparis@ethz.ch) and Max Rietmann

Parallelization strategy computes 1 output per Warp, with threads computing the
dot-product in parallel. Because inputs are transposed to have channel dimension
last, the dot-product memory access pattern is perfectly coalesced, leading to
excellent performance. This is true across both forward and backward kernels.
Co-authored-by: Mauro Bisson <maurob@nvidia.com>
Co-authored-by: Max Rietmann <mrietmann@nvidia.com>
Co-authored-by: Andrea Paris <aparis@ethz.ch>

5f051c97

24 May, 2025 4 commits
- fixing bug in quadrature weights for full attention. Adding better unit tests... · 4350ba9f
  Boris Bonev authored May 23, 2025
```
fixing bug in quadrature weights for full attention. Adding better unit tests for attention. Cleanup in the cuda code.
```
  4350ba9f
- more header corrections · b6c48457
  Boris Bonev authored May 22, 2025
  
  b6c48457
- adapting header files · ae8257b5
  Boris Bonev authored May 22, 2025
  
  ae8257b5
- adding spherical attention · 6a845fd3
  Boris Bonev authored May 22, 2025
  
  6a845fd3
19 Aug, 2024 1 commit

Tkurth/cuda disco (#38) · 29e7fb68

Boris Bonev authored Aug 19, 2024



* adding cuda kernels for disco conv

* making psi_idx an attribute

* adding license headers

* adding author files

* reorganizing files

* draft implementation

* added conditional installation to setup.py

* formatting changes

* removing triton kernel in DISCO convolution

* updated github actions

* updated Readme and changelog

* adding another guard for the cuda installation

* renaming the  cuda extension

* simplifying setup.py

* minor bugfix

* Bbonev/cuda disco cleanup (#32)

* cleanup of disco convolutions based on CUDA extension

* fixing unittest

* changing version to experimental 0.7.0a

* initial rewrite of the distributed convolution with CUDA

* fixing streams

* need to fix install options

* fixing streams

* undid setup.py changes

* reset setup.py

* including CUDAStream

* adjusted the precomputation of theta_cutoff. If you rely on this, your models will not be backwards-compatible.

* adjusting theta_cutoff in the unittest

* adding newly refactored kernels for faster compile

* Tkurth/cuda disco distributed fix (#34)

* attempt to make disco distributed

* working distributed convolutions

* fixing distributed conv

* working distributed disco

* removing irrelevant extra argument

* using stream functions from at instead of c10

* using stream functions from at instead of c10, small fix

* Bbonev/disc even filters (#35)

* initial working commit with new convention of counting collocation points across the diameter instead of across the radius

* fixed a bug in the computation of the even kernels

* changing heuristic for computing theta_cutoff

* Fixing unittest

* Readability improvements

* reworked normalization of filter basis functions

* implemented discrete normalization of disco filters

* relaxing tolerances in convolution unit test

* bugfix to correctly support unequal scale factors in latitudes and longitudes

* hotfix to a bug in the imports

* Bbonev/distributed disco refactor (#37)

* cleaned up normalization code in convolution

* formatting changes in distributed convolution

* Fixing default theta_cutoff to be the same in distributed and local case

* fixed distributed convolution to support the same normalization as non-distributed one

* readability improvements

* fixed initial scale of convolution parameter weights and fixed naming of the normalization routine

* Updated Readme.md

* added comment in Dockerfile regarding older architectures

---------
Co-authored-by: Thorsten Kurth <tkurth@nvidia.com>
Co-authored-by: Boris Bonev <bbonev@nvidia.com>

29e7fb68