Unverified Commit 7a0b011d authored by Jason Zhu's avatar Jason Zhu Committed by GitHub
Browse files

Add a 1-line docstring to explain why calling context_attention_fwd twice in...

Add a 1-line docstring to explain why calling context_attention_fwd twice in test_prefix_prefill.py (#2553)
parent 63e835cb
......@@ -125,6 +125,7 @@ def test_contexted_kv_attention(
v_cache = v_cache.view(-1, block_size, num_heads,
head_size).permute(0, 2, 3, 1).contiguous()
# Warm up the Triton kernel by calling it once before actually measuring generation time
context_attention_fwd(query, k, v, output, k_cache, v_cache, block_table,
b_start_loc, b_seq_len, b_ctx_len, max_input_len)
torch.cuda.synchronize()
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment