Commit 8cb399ee authored by Mauro Bisson's avatar Mauro Bisson
Browse files

Optimized FWD kernel: custom permutations, gmem accesses reduction, vectorized access

* Replaced PyTorch's slow permutation ops with custom kernels, significantly improving performance (especially on GB200).
* Split kernel into general and specialized versions for num_channel <= 16384, significantly reducing memory accesses.
* Enabled float4-based vectorized memory access when pointer alignment and channel size allow, improving throughput.
* Added runtime dispatch logic for kernel specialization.
parent c485a1fb
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment