[Qwen3Next] Fixes the cuda graph capture conditions under large batch sizes (#24660) (#24667)

Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>

[Qwen3Next] Fixes the cuda graph capture conditions under large batch sizes (#24660) (#24667)
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
89da8d9d · Tao He · simon-mo · 01085b13 · 89da8d9d
Commit 89da8d9d authored Sep 13, 2025 by Tao He Committed by simon-mo Sep 12, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 2 additions and 1 deletion

vllm/v1/attention/backends/gdn_attn.py vllm/v1/attention/backends/gdn_attn.py +2 -1

No files found.
--- a/vllm/v1/attention/backends/gdn_attn.py
+++ b/vllm/v1/attention/backends/gdn_attn.py
@@ -209,7 +209,8 @@ class GDNAttentionMetadataBuilder(
        # prepare tensors for cudagraph
        if (self.use_full_cuda_graph and num_prefills == 0 and num_decodes == 0
-                and num_spec_decodes <= self.decode_cudagraph_max_bs):
+                and num_spec_decodes <= self.decode_cudagraph_max_bs
+                and m.num_actual_tokens <= self.decode_cudagraph_max_bs):
            num_total_tokens = self.vllm_config.pad_for_cudagraph(
                m.num_actual_tokens)
            batch_size = num_total_tokens // (self.num_spec + 1)