Add a 1-line docstring to explain why calling context_attention_fwd twice in...

Add a 1-line docstring to explain why calling context_attention_fwd twice in test_prefix_prefill.py (#2553)

Add a 1-line docstring to explain why calling context_attention_fwd twice in...
Add a 1-line docstring to explain why calling context_attention_fwd twice in test_prefix_prefill.py (#2553)
7a0b011d · Jason Zhu · GitHub · 63e835cb · 7a0b011d
Unverified Commit 7a0b011d authored Jan 22, 2024 by Jason Zhu Committed by GitHub Jan 22, 2024
Hide whitespace changes
Inline Side-by-side

Showing with 1 addition and 0 deletions

tests/kernels/test_prefix_prefill.py tests/kernels/test_prefix_prefill.py +1 -0

No files found.
--- a/tests/kernels/test_prefix_prefill.py
+++ b/tests/kernels/test_prefix_prefill.py
@@ -125,6 +125,7 @@ def test_contexted_kv_attention(
    v_cache = v_cache.view(-1, block_size, num_heads,
                           head_size).permute(0, 2, 3, 1).contiguous()

+    # Warm up the Triton kernel by calling it once before actually measuring generation time
    context_attention_fwd(query, k, v, output, k_cache, v_cache, block_table,
                          b_start_loc, b_seq_len, b_ctx_len, max_input_len)
    torch.cuda.synchronize()