[perf] Avoid dtype promotion sync in mamba_get_block_table_tensor (#34870)

Signed-off-by: Huamin Li <3ericli@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>

[perf] Avoid dtype promotion sync in mamba_get_block_table_tensor (#34870)
Signed-off-by: Huamin Li <3ericli@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
1fe46216 · Huamin Li · GitHub · ed31a020 · 1fe46216
Unverified Commit 1fe46216 authored Feb 20, 2026 by Huamin Li Committed by GitHub Feb 20, 2026
Hide whitespace changes
Inline Side-by-side

Showing with 6 additions and 2 deletions

vllm/v1/attention/backends/utils.py vllm/v1/attention/backends/utils.py +6 -2

No files found.
--- a/vllm/v1/attention/backends/utils.py
+++ b/vllm/v1/attention/backends/utils.py
@@ -855,8 +855,12 @@ def mamba_get_block_table_tensor(
            (seq_lens - 1) // kv_cache_spec.block_size,
            min=0,
        )
+        # Use int32 for arithmetic to avoid dtype promotion overhead,
+        # then convert to int64 for gather (which requires Long indices)
        offsets = torch.arange(
-            1 + kv_cache_spec.num_speculative_blocks, device=block_table.device
+            1 + kv_cache_spec.num_speculative_blocks,
+            device=block_table.device,
+            dtype=torch.int32,
        )
-        indices_to_gather = start_indices.unsqueeze(1) + offsets
+        indices_to_gather = (start_indices.unsqueeze(1) + offsets).to(torch.int64)
        return torch.gather(block_table, 1, indices_to_gather)