- Optimized dispatch: HAVE_TOPK_LENGTH=true and attn_sink enabled.
- Non-target D512/H64 combinations remain on the generic KernelTemplate path to avoid the measured slowdown risk from routing all D512/H64 cases through the H64 fast path.
Implementation
- Extended KernelTemplate_B_H_64 so the QK pipeline supports D_QK=512 with 16 q/k chunks instead of the existing D_QK=576-only 18 chunk schedule.
- Avoided index LDS prefetch overhead when IS_TOPK_2048 is false.
- Added attn_sink output scaling for the D512/H64 topk_length path.
- Added KernelTemplate_D512_H64_TopkLen_AttnSink wrapper and dispatch for D_QK=512 && HAVE_TOPK_LENGTH && h_q=64 && attn_sink.
The baseline values are from the same target benchmark before enabling this fast path. The optimized run also checked correctness for every measured target row.
- A broader D512/H64 dispatch was tested for HAVE_TOPK_LENGTH=false and attn_sink true/false. It gave only small, noisy gains in some rows and could regress the existing performance path, so the committed dispatch is limited to the target slow path that clears the 30% improvement requirement.
- The test environment prints NumPy 2.x compatibility warnings from PyTorch import. They did not prevent correctness or benchmark execution.