# only use kv scaling if: 1) fp8 kv is explicitly enabled, 2) RadixAttention
# only use kv scaling if: 1) fp8 kv is explicitly enabled, 2) RadixAttention
# has corresponding quantization method so that layer.k_scale is not None,
# has corresponding quantization method so that layer.k_scale is not None,
...
@@ -684,8 +701,13 @@ class FlashAttentionBackend(AttentionBackend):
...
@@ -684,8 +701,13 @@ class FlashAttentionBackend(AttentionBackend):
)
)
# We do cascade attention for Target Verify with topk > 1
# We do cascade attention for Target Verify with topk > 1
# We don't use cascade attention for Sliding Window Attention:
# - Different window sizes should be passed in for each q in the first stage of cascade attention, but FA3 interface doesn't support pass in a list of window sizes.
# - The overhead of duplicated computation of the common prefix part is small for sliding window layers (seq_len <= window_size), so we can just expand it.