Commits · c4995635e42f8199c82e3d7cb74adee3835c16b0 · OpenDAS / torch-harmonics

21 Jul, 2025 16 commits
- Improved docstrings in legendre · c4995635
  apaaris authored Jun 30, 2025
  
  c4995635
- Improved docstrings in filter basis · 1c2859f2
  apaaris authored Jun 30, 2025
  
  1c2859f2
- Improved docstrings in convolution · eeda67aa
  apaaris authored Jun 30, 2025
  
  eeda67aa
- Improved docstrings in neighborhood attention · ffee67f9
  apaaris authored Jun 30, 2025
  
  ffee67f9
- Improved docstrings in disco convolution · 913e80d4
  apaaris authored Jun 30, 2025
  
  913e80d4
- Corrected docstrings in _layers.py · 313b1b73
  apaaris authored Jun 26, 2025
  
  313b1b73
- Added docstrings to many methods · e4879676
  apaaris authored Jun 26, 2025
  
  e4879676
- Merge pull request #93 from NVIDIA/tkurth/device-fixes · b5c410c0
  Thorsten Kurth authored Jul 21, 2025
```
Tkurth/device fixes
```
  b5c410c0
- renamings · 3d604f85
  Thorsten Kurth authored Jul 21, 2025
  
  3d604f85
- replacing more torch.tensor with torch.as_tensor · 0767c39c
  Thorsten Kurth authored Jul 21, 2025
  
  0767c39c
- and one more · 5a6d52dd
  Thorsten Kurth authored Jul 21, 2025
  
  5a6d52dd
- fixing some more missing device statements · 30f7802b
  Thorsten Kurth authored Jul 21, 2025
  
  30f7802b
- adding more missing device statements · f30ec30a
  Thorsten Kurth authored Jul 21, 2025
  
  f30ec30a
- adding device args to more functions · ea8d1a2e
  Thorsten Kurth authored Jul 21, 2025
  
  ea8d1a2e
- making disco helpers GPU ready · c877cda6
  Thorsten Kurth authored Jul 21, 2025
  
  c877cda6
- small device fix · 00064117
  Thorsten Kurth authored Jul 21, 2025
  
  00064117
17 Jul, 2025 1 commit
- Merge pull request #91 from NVIDIA/maurob/devel · 4aaff021
  Thorsten Kurth authored Jul 17, 2025
```
Attention Backward improvement
```
  4aaff021
16 Jul, 2025 9 commits
- Renamed the template parameter to a simpler name (it's the number of warps per... · fa58767d
  Mauro Bisson authored Jul 16, 2025
```
Renamed the template parameter to a simpler name (it's the number of warps per tile used in the permutation).
```
  fa58767d
- small bf16 fix for w11 loss · 763d4371
  Andrea Paris authored Jul 16, 2025
  
  763d4371
- removing commented code · 6ac50e26
  Thorsten Kurth authored Jul 16, 2025
  
  6ac50e26
- cleanup with contiguous checks · 45fc2a46
  Thorsten Kurth authored Jul 16, 2025
  
  45fc2a46
- Removed stale comments. · 51200bda
  Mauro Bisson authored Jul 11, 2025
  
  51200bda
- Added missing kernel launch error checks. · 07fa44d6
  Mauro Bisson authored Jul 11, 2025
  
  07fa44d6
- Forgot to add cudamacro.h. · 9689109f
  Mauro Bisson authored Jul 11, 2025
  
  9689109f
- Added new attention_utils.cu source. · 46bd1b26
  Mauro Bisson authored Jul 10, 2025
  
  46bd1b26
- Optimized BWD kernel with the same changes for FWD from commit 8cb399ee: · 9a463332
  Mauro Bisson authored Jul 10, 2025
```
* Replaced PyTorch's slow permutation.
* Split kernel into general and specialized versions (for num_channel <= 8192)
* Enabled float4-based vectorized memory access, when possible.
* Added runtime dispatch logic for kernel specialization.

Aligned attention_fwd_cuda.cu with attention_bwd_cuda.cu in terms of naming conventions and kernel parameters.

Extracted shared host/device functions and declarations into a separate module:
* attention_utils.cuh
* attention_utils.cu
```
  9a463332
14 Jul, 2025 1 commit

Thorsten Kurth authored Jul 14, 2025

* removing duplicate code from distributed convoloution

* replacing from_numpy with as_tensor

* make preprocess_psi_tensor GPU ready.

ab44ba59

08 Jul, 2025 1 commit

Tkurth/remove sparse coo tensor (#89) · bd92cdf7

Thorsten Kurth authored Jul 08, 2025

* refactoring disco backend code

* removed get_psi as member function and instead put it in _disco_convolution

* setting seeds in tests more consistently

* parametrized test classes to ensure that tests are always run on both CPU and GPU (if available)

* cleaning up

bd92cdf7

07 Jul, 2025 2 commits
- removing some comments and nvtx annotations (#88) · 9959a7a6
  Thorsten Kurth authored Jul 07, 2025
  
  9959a7a6
- Merge pull request #87 from NVIDIA/maurob/devel · 744d2269
  Thorsten Kurth authored Jul 07, 2025
```
Use 64-bit for pointer offsets
```
  744d2269
04 Jul, 2025 2 commits
- Updated pointer offset calculations to use 64-bit integers to prevent overflow... · c9ae23d9
  Mauro Bisson authored Jul 04, 2025
```
Updated pointer offset calculations to use 64-bit integers to prevent overflow with large batch or image sizes.
```
  c9ae23d9
- Merge pull request #85 from azrael417/tkurth/attention_bwd_layout_stuff · 1eed5673
  Thorsten Kurth authored Jul 04, 2025
```
using torch tools to change layout in bd pass
```
  1eed5673
03 Jul, 2025 4 commits
- Re-introduce inline softmax from main · 191ba149
  Max Rietmann authored Jul 03, 2025
  
  191ba149
- Fixed compile errors for ChannelsLast C++ code, unfortunately also format-on-save · 3dd35b45
  Max Rietmann authored Jul 03, 2025
  
  3dd35b45
- using torch tools to change layout in bd pass · e1338191
  Thorsten Kurth authored Jul 02, 2025
  
  e1338191
- Merge pull request #83 from NVIDIA/maurob/devel · 49a61eee
  Thorsten Kurth authored Jul 03, 2025
```
Optimized forward kernel for attention
```
  49a61eee
02 Jul, 2025 4 commits

Added Thorsten's fix for bug regarding contiguous storage taken from: · c90b421a
Mauro Bisson authored Jul 02, 2025
```
https://github.com/NVIDIA/torch-harmonics/compare/main...azrael417:torch-harmonics:tkurth/mauro-rebase
```
c90b421a
Fixed typo causing crash in the generic kernel path. · 7aa95ce5
Mauro Bisson authored Jun 27, 2025

7aa95ce5

Optimize FWD kernel: reduced tail effect · 73bfdc53

Mauro Bisson authored Jun 18, 2025

* Added a new CSR array, psi_row_index, containing "ho" values sorted in descending order of CSR row length; this is used to process (ho, wo) points corresponding to longer rows before shorter ones, improving overlap and reducing the tail effect.

73bfdc53

Optimized FWD kernel: custom permutations, gmem accesses reduction, vectorized access · 8cb399ee

Mauro Bisson authored Jun 13, 2025

* Replaced PyTorch's slow permutation ops with custom kernels, significantly improving performance (especially on GB200).
* Split kernel into general and specialized versions for num_channel <= 16384, significantly reducing memory accesses.
* Enabled float4-based vectorized memory access when pointer alignment and channel size allow, improving throughput.
* Added runtime dispatch logic for kernel specialization.

8cb399ee