[Bugfix][CI/Test][Spec Decode] Fix illegal memory access in...

[Bugfix][CI/Test][Spec Decode] Fix illegal memory access in offline_inference/spec_decode.py (Issue 27619) (#28432) Signed-off-by: Randall Smith <ransmith@amd.com> Co-authored-by: Randall Smith <ransmith@amd.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com>

[Bugfix][CI/Test][Spec Decode] Fix illegal memory access in...
[Bugfix][CI/Test][Spec Decode] Fix illegal memory access in offline_inference/spec_decode.py (Issue 27619) (#28432) Signed-off-by: Randall Smith <ransmith@amd.com> Co-authored-by: Randall Smith <ransmith@amd.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
15ae8e07 · rasmith · GitHub · 0b254989 · 15ae8e07
Unverified Commit 15ae8e07 authored Nov 14, 2025 by rasmith Committed by GitHub Nov 13, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 4 additions and 2 deletions

vllm/attention/ops/triton_reshape_and_cache_flash.py vllm/attention/ops/triton_reshape_and_cache_flash.py +4 -2

No files found.
--- a/vllm/attention/ops/triton_reshape_and_cache_flash.py
+++ b/vllm/attention/ops/triton_reshape_and_cache_flash.py
@@ -97,7 +97,6 @@ def triton_reshape_and_cache_flash(
    k_scale: torch.Tensor,  # float32
    v_scale: torch.Tensor,  # float32
 ):
-    num_tokens = key.shape[0]
    num_heads = key.shape[1]
    head_size = key.shape[2]
    block_size = key_cache.shape[1]
@@ -155,7 +154,10 @@ def triton_reshape_and_cache_flash(
    # TODO(ngl): maybe replace with static launch grid to avoid overhead if
    #   using cudagraphs
-    grid = lambda meta: (int(num_tokens), triton.cdiv(n, meta["TILE_SIZE"]))
+    grid = lambda meta: (
+        slot_mapping.shape[0],
+        triton.cdiv(n, meta["TILE_SIZE"]),
+    )
    reshape_and_cache_kernel_flash[grid](
        key_ptr=key,