Quick fix for IMA with the Prefix Prefill kernel during graph capture (#25983)

Signed-off-by: Sage Moore <sage@neuralmagic.com>

Quick fix for IMA with the Prefix Prefill kernel during graph capture (#25983)
Signed-off-by: Sage Moore <sage@neuralmagic.com>
5f2cacdb · Sage Moore · GitHub · aa5053e3 · 5f2cacdb
Unverified Commit 5f2cacdb authored Oct 03, 2025 by Sage Moore Committed by GitHub Oct 03, 2025
Show whitespace changes
Inline Side-by-side

Showing with 8 additions and 0 deletions

vllm/v1/attention/backends/rocm_attn.py vllm/v1/attention/backends/rocm_attn.py +8 -0

No files found.
--- a/vllm/v1/attention/backends/rocm_attn.py
+++ b/vllm/v1/attention/backends/rocm_attn.py
@@ -83,6 +83,14 @@ class RocmAttentionMetadataBuilder(
        # max_model_len will cause graph capture to be extremely
        # slow, so here we set it to 1.
        attn_metadata.seq_lens.fill_(1)
+
+        if envs.VLLM_V1_USE_PREFILL_DECODE_ATTENTION:
+            # Here we set the query start locs to 0. This is to
+            # cover up an invalid memory access in the prefix_prefil kernel
+            # that we run into during graph capture (#25985)
+            common_attn_metadata.query_start_loc.zero_()
+            common_attn_metadata.query_start_loc_cpu.zero_()
+
        return attn_metadata

    def build(self,