Unverified Commit 1f30c05d authored by JieXin Liang's avatar JieXin Liang Committed by GitHub
Browse files

[fix] fix fa3 forward_decode with spec_decode (#6395)


Co-authored-by: default avatarStefan He <hebiaobuaa@gmail.com>
parent 5dd62c3a
...@@ -918,8 +918,11 @@ class FlashAttentionBackend(AttentionBackend): ...@@ -918,8 +918,11 @@ class FlashAttentionBackend(AttentionBackend):
and local_attn_metadata is not None and local_attn_metadata is not None
and (hasattr(layer, "use_irope") and layer.use_irope) and (hasattr(layer, "use_irope") and layer.use_irope)
) )
# We do cascade attention for Draft Decode with topk > 1
use_cascade_attn = self.topk > 1 # When Spec Decode enabled, forward_decode would be called with two mode:

# 1. DRAFT_DECODE: we enable cascade attention when top_k > 1
# 2. IDLE: we don’t need cascade attention, spec_info will be none in this case
use_cascade_attn = forward_batch.spec_info is not None and self.topk > 1
# Calculate window size (can be moved to metadata if layer properties don't change) # Calculate window size (can be moved to metadata if layer properties don't change)
# we don't do layer.sliding_window_size - 1 since in model.get_attention_sliding_window_size() we already - 1 # we don't do layer.sliding_window_size - 1 since in model.get_attention_sliding_window_size() we already - 1
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment