[fix] fix fa3 forward_decode with spec_decode (#6395)

Co-authored-by: Stefan He <hebiaobuaa@gmail.com>

[fix] fix fa3 forward_decode with spec_decode (#6395)
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
1f30c05d · JieXin Liang · GitHub · 5dd62c3a · 1f30c05d
Unverified Commit 1f30c05d authored May 19, 2025 by JieXin Liang Committed by GitHub May 18, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 5 additions and 2 deletions

python/sglang/srt/layers/attention/flashattention_backend.py python/sglang/srt/layers/attention/flashattention_backend.py +5 -2

No files found.
--- a/python/sglang/srt/layers/attention/flashattention_backend.py
+++ b/python/sglang/srt/layers/attention/flashattention_backend.py
@@ -918,8 +918,11 @@ class FlashAttentionBackend(AttentionBackend):
            and local_attn_metadata is not None
            and (hasattr(layer, "use_irope") and layer.use_irope)
        )
-        # We do cascade attention for Draft Decode with topk > 1
-        use_cascade_attn = self.topk > 1
+        # When Spec Decode enabled, forward_decode would be called with two mode: 
+        # 1. DRAFT_DECODE: we enable cascade attention when top_k > 1
+        # 2. IDLE: we don’t need cascade attention, spec_info will be none in this case
+        use_cascade_attn = forward_batch.spec_info is not None and self.topk > 1
        # Calculate window size (can be moved to metadata if layer properties don't change)
        # we don't do layer.sliding_window_size - 1 since in model.get_attention_sliding_window_size() we already - 1