Commits · 744d226938178574b3996d823bb5ab0e50450929 · OpenDAS / torch-harmonics

07 Jul, 2025 1 commit
- Merge pull request #87 from NVIDIA/maurob/devel · 744d2269
  Thorsten Kurth authored Jul 07, 2025
```
Use 64-bit for pointer offsets
```
  744d2269
04 Jul, 2025 2 commits
- Updated pointer offset calculations to use 64-bit integers to prevent overflow... · c9ae23d9
  Mauro Bisson authored Jul 04, 2025
```
Updated pointer offset calculations to use 64-bit integers to prevent overflow with large batch or image sizes.
```
  c9ae23d9
- Merge pull request #85 from azrael417/tkurth/attention_bwd_layout_stuff · 1eed5673
  Thorsten Kurth authored Jul 04, 2025
```
using torch tools to change layout in bd pass
```
  1eed5673
03 Jul, 2025 4 commits
- Re-introduce inline softmax from main · 191ba149
  Max Rietmann authored Jul 03, 2025
  
  191ba149
- Fixed compile errors for ChannelsLast C++ code, unfortunately also format-on-save · 3dd35b45
  Max Rietmann authored Jul 03, 2025
  
  3dd35b45
- using torch tools to change layout in bd pass · e1338191
  Thorsten Kurth authored Jul 02, 2025
  
  e1338191
- Merge pull request #83 from NVIDIA/maurob/devel · 49a61eee
  Thorsten Kurth authored Jul 03, 2025
```
Optimized forward kernel for attention
```
  49a61eee
02 Jul, 2025 4 commits

Added Thorsten's fix for bug regarding contiguous storage taken from: · c90b421a
Mauro Bisson authored Jul 02, 2025
```
https://github.com/NVIDIA/torch-harmonics/compare/main...azrael417:torch-harmonics:tkurth/mauro-rebase
```
c90b421a
Fixed typo causing crash in the generic kernel path. · 7aa95ce5
Mauro Bisson authored Jun 27, 2025

7aa95ce5

Optimize FWD kernel: reduced tail effect · 73bfdc53

Mauro Bisson authored Jun 18, 2025

* Added a new CSR array, psi_row_index, containing "ho" values sorted in descending order of CSR row length; this is used to process (ho, wo) points corresponding to longer rows before shorter ones, improving overlap and reducing the tail effect.

73bfdc53

Optimized FWD kernel: custom permutations, gmem accesses reduction, vectorized access · 8cb399ee

Mauro Bisson authored Jun 13, 2025

* Replaced PyTorch's slow permutation ops with custom kernels, significantly improving performance (especially on GB200).
* Split kernel into general and specialized versions for num_channel <= 16384, significantly reducing memory accesses.
* Enabled float4-based vectorized memory access when pointer alignment and channel size allow, improving throughput.
* Added runtime dispatch logic for kernel specialization.

8cb399ee

01 Jul, 2025 3 commits
- Merge pull request #84 from NVIDIA/depth_small_fix · c485a1fb
  Thorsten Kurth authored Jul 01, 2025
```
Small fix in metric computation
```
  c485a1fb
- cleanup · 3c3b0f8e
  Andrea Paris authored Jul 01, 2025
  
  3c3b0f8e
- first patch · e6b5a952
  Andrea Paris authored Jul 01, 2025
  
  e6b5a952
18 Jun, 2025 3 commits
- Merge pull request #80 from rietmann-nv/mr/optimized-bwd-qdotk_max · 5da00de0
  Thorsten Kurth authored Jun 18, 2025
```
Optimize bwd kernel: incremental qdot_max and alpha/integral/etc
```
  5da00de0
- Merge formatting changes · 65058287
  Max Rietmann authored Jun 18, 2025
  
  65058287
- Applied new formatting · c46b6925
  Max Rietmann authored Jun 18, 2025
  
  c46b6925
17 Jun, 2025 6 commits
- Merge pull request #81 from NVIDIA/tkurth/debug-mode · 76836abf
  Thorsten Kurth authored Jun 17, 2025
```
adding lineinfo to optional debug flags
```
  76836abf
- adding lineinfo to optional debug flags · eea47739
  Thorsten Kurth authored Jun 17, 2025
  
  eea47739
- doubling the indent to 4 · cfbf5f80
  Thorsten Kurth authored Jun 17, 2025
  
  cfbf5f80
- formatting · 4805b39c
  Thorsten Kurth authored Jun 13, 2025
  
  4805b39c
- fixing clang formatting files · 5eaa7f79
  Thorsten Kurth authored Jun 13, 2025
  
  5eaa7f79
- adding clang formatter · 3011eb1c
  Thorsten Kurth authored Jun 13, 2025
  
  3011eb1c
16 Jun, 2025 4 commits
- Merged formatting · 1ea5c4ca
  Max Rietmann authored Jun 16, 2025
  
  1ea5c4ca
- Clang format · 68e7d0fa
  Max Rietmann authored Jun 16, 2025
  
  68e7d0fa
- Clang format · cb79c766
  Max Rietmann authored Jun 16, 2025
  
  cb79c766
- Optimize bwd kernel: incremental qdot_max and alpha/integral/etc · ec3050b1
  Max Rietmann authored Jun 16, 2025
```
Leverage the same qdotk_max "trick" for the backward kernel. This avoids 1 loop
and saves about 20% of performance.
```
  ec3050b1
13 Jun, 2025 10 commits
- formatting · 373f9b0b
  Thorsten Kurth authored Jun 13, 2025
  
  373f9b0b
- fixing clang formatting files · ebc122eb
  Thorsten Kurth authored Jun 13, 2025
  
  ebc122eb
- adding clang formatter · 231a5f25
  Thorsten Kurth authored Jun 13, 2025
  
  231a5f25
- Merge pull request #78 from NVIDIA/tkurth/attention-perf-test-fix · 584e1bd6
  Thorsten Kurth authored Jun 13, 2025
```
fixing attention perf test attempt 1
```
  584e1bd6
- fixing attention perf test attempt 1 · 47beb41a
  Thorsten Kurth authored Jun 13, 2025
  
  47beb41a
- Merge pull request #77 from rietmann-nv/mr/bwd-channel-permute-experiments · 26ce5cb5
  Thorsten Kurth authored Jun 13, 2025
```
Optimized CUDA kernels for S2 Attention (forward and backward)
```
  26ce5cb5
- adjusted perf test shapes · 79fa6ad9
  Thorsten Kurth authored Jun 13, 2025
  
  79fa6ad9
- Merge branch 'mr/bwd-channel-permute-experiments' of... · ec413b4d
  Thorsten Kurth authored Jun 13, 2025
```
Merge branch 'mr/bwd-channel-permute-experiments' of https://github.com/rietmann-nv/torch-harmonics into mr/bwd-channel-permute-experiments
```
  ec413b4d
- Changed to qdotk_max single loop torch reference kernel · 37b08bb8
  Max Rietmann authored Jun 13, 2025
  
  37b08bb8
- streamlining perf test · 1a47fa08
  Thorsten Kurth authored Jun 13, 2025
  
  1a47fa08
11 Jun, 2025 3 commits
- Update copyright · a07c5b2b
  Max Rietmann authored Jun 11, 2025
  
  a07c5b2b
- Removed unnecessary code in fwd and bwd kernels. · 3d06f4da
  Max Rietmann authored Jun 11, 2025
```
Also: Made fwd kernel use modified memory layout with standard shape
```
  3d06f4da
- Removed all stale backwards kernel code · 6512d042
  Max Rietmann authored Jun 11, 2025
```
Also match the gradient output to the input, in terms of memory layout
```
  6512d042