Commits · 45fc2a463f35a1b3ee93f27a271db3b4d8d8b6d1 · OpenDAS / torch-harmonics

16 Jul, 2025 2 commits

cleanup with contiguous checks · 45fc2a46
Thorsten Kurth authored Jul 16, 2025

45fc2a46

Optimized BWD kernel with the same changes for FWD from commit : · 9a463332

Mauro Bisson authored Jul 10, 2025

* Replaced PyTorch's slow permutation.
* Split kernel into general and specialized versions (for num_channel <= 8192)
* Enabled float4-based vectorized memory access, when possible.
* Added runtime dispatch logic for kernel specialization.

Aligned attention_fwd_cuda.cu with attention_bwd_cuda.cu in terms of naming conventions and kernel parameters.

Extracted shared host/device functions and declarations into a separate module:
* attention_utils.cuh
* attention_utils.cu

9a463332