[PERF] Speed-up of GDN attention decode part (Qwen3-Next) (#31722)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

[PERF] Speed-up of GDN attention decode part (Qwen3-Next) (#31722)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
22dffca9 · Vadim Gimpelson · GitHub · 4c73be14 · 22dffca9
Unverified Commit 22dffca9 authored Jan 06, 2026 by Vadim Gimpelson Committed by GitHub Jan 06, 2026
Hide whitespace changes
Inline Side-by-side

Showing with 1 addition and 1 deletion

vllm/model_executor/layers/fla/ops/fused_recurrent.py vllm/model_executor/layers/fla/ops/fused_recurrent.py +1 -1

No files found.
--- a/vllm/model_executor/layers/fla/ops/fused_recurrent.py
+++ b/vllm/model_executor/layers/fla/ops/fused_recurrent.py
@@ -189,7 +189,7 @@ def fused_recurrent_gated_delta_rule_fwd(
    B, T, H, K, V = *k.shape, v.shape[-1]
    HV = v.shape[2]
    N = B if cu_seqlens is None else len(cu_seqlens) - 1
-    BK, BV = triton.next_power_of_2(K), min(triton.next_power_of_2(V), 8)
+    BK, BV = triton.next_power_of_2(K), min(triton.next_power_of_2(V), 32)
    NK, NV = triton.cdiv(K, BK), triton.cdiv(V, BV)
    assert NK == 1, "NK > 1 is not supported yet"
    num_stages = 3