Fix CUDA graph decode capture crash in AITER FlashAttention (#36042)

Signed-off-by: Martin Yuan <myuan@meta.com> Co-authored-by: Martin Yuan <myuan@meta.com>

Fix CUDA graph decode capture crash in AITER FlashAttention (#36042)
Signed-off-by: Martin Yuan <myuan@meta.com> Co-authored-by: Martin Yuan <myuan@meta.com>
1a971808 · Mengtao (Martin) Yuan · GitHub · 7eb524e6 · 1a971808
Unverified Commit 1a971808 authored Mar 06, 2026 by Mengtao (Martin) Yuan Committed by GitHub Mar 06, 2026
Show whitespace changes
Inline Side-by-side

Showing with 3 additions and 4 deletions

vllm/v1/attention/backends/rocm_aiter_fa.py vllm/v1/attention/backends/rocm_aiter_fa.py +3 -4

No files found.
--- a/vllm/v1/attention/backends/rocm_aiter_fa.py
+++ b/vllm/v1/attention/backends/rocm_aiter_fa.py
@@ -1152,11 +1152,10 @@ class AiterFlashAttentionImpl(AttentionImpl):
                decode_max_query_len = attn_metadata.decode_metadata.max_query_len

                # Use unified_attention for speculative decoding (multi-token)
-                # or when sliding window is enabled
-                if self.sliding_window[0] != -1 or decode_max_query_len > 1:
+                if decode_max_query_len > 1:
                    assert not rocm_aiter_ops.is_shuffle_kv_cache_enabled(), (
-                        "Shuffle KV cache layout is not supported with sliding "
-                        "window or speculative decoding (multi-token decode)."
+                        "Shuffle KV cache layout is not supported with "
+                        "speculative decoding (multi-token decode)."
                    )
                    from aiter.ops.triton.unified_attention import (
                        unified_attention,