torch_harmonics/csrc/attention/attention_utils.cu · 9689109f0da5578ce052a0889fb11475d0785481 · OpenDAS / torch-harmonics

Optimized BWD kernel with the same changes for FWD from commit : · 9a463332

Mauro Bisson authored Jul 10, 2025

* Replaced PyTorch's slow permutation.
* Split kernel into general and specialized versions (for num_channel <= 8192)
* Enabled float4-based vectorized memory access, when possible.
* Added runtime dispatch logic for kernel specialization.

Aligned attention_fwd_cuda.cu with attention_bwd_cuda.cu in terms of naming conventions and kernel parameters.

Extracted shared host/device functions and declarations into a separate module:
* attention_utils.cuh
* attention_utils.cu

9a463332

attention_utils.cu 12.5 KB

Replace attention_utils.cu