Unverified Commit 18093084 authored by vllmellm's avatar vllmellm Committed by GitHub
Browse files

[Misc] Remove unnecessary fallback to prefill-decode attention (#19138)


Signed-off-by: default avatarvllmellm <vllm.ellm@embeddedllm.com>
parent da403802
...@@ -171,10 +171,7 @@ class TritonAttentionImpl(AttentionImpl): ...@@ -171,10 +171,7 @@ class TritonAttentionImpl(AttentionImpl):
# Whenever making a change in this method, please benchmark the # Whenever making a change in this method, please benchmark the
# performance to make sure it does not introduce any overhead. # performance to make sure it does not introduce any overhead.
num_queries_per_kv = query.shape[1] // key.shape[1] use_prefill_decode_attn = self.force_prefill_decode_attn
num_q_is_pow2 = (num_queries_per_kv & (num_queries_per_kv - 1)) == 0
use_prefill_decode_attn = (self.force_prefill_decode_attn
or not num_q_is_pow2)
num_actual_tokens = attn_metadata.num_actual_tokens num_actual_tokens = attn_metadata.num_actual_tokens
if use_prefill_decode_attn: if use_prefill_decode_attn:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment