Optimized FWD kernel: custom permutations, gmem accesses reduction, vectorized access (8cb399ee) · Commits · OpenDAS / torch-harmonics

Commit 8cb399ee authored Jun 13, 2025 by

Mauro Bisson

Optimized FWD kernel: custom permutations, gmem accesses reduction, vectorized access

* Replaced PyTorch's slow permutation ops with custom kernels, significantly improving performance (especially on GB200).
* Split kernel into general and specialized versions for num_channel <= 16384, significantly reducing memory accesses.
* Enabled float4-based vectorized memory access when pointer alignment and channel size allow, improving throughput.
* Added runtime dispatch logic for kernel specialization.

parent c485a1fb

Expand all Hide whitespace changes

Inline Side-by-side

View file @ 8cb399ee

This diff is collapsed.

fengzch-das @Fzc7075
mentioned in commit 9a463332
· Jul 24, 2025

mentioned in commit 9a463332

mentioned in commit 9a463332e7d3a0b22dd66fa7a136164add2e67b9

Toggle commit list

Please register or to comment