[BugFix] Fix Llama4 - Index Error When Single Request Near Max Context (#16209)

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

[BugFix] Fix Llama4 - Index Error When Single Request Near Max Context (#16209)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
e1a2c699 · Lucas Wilkinson · GitHub · 0115ccd5 · e1a2c699
Unverified Commit e1a2c699 authored Apr 08, 2025 by Lucas Wilkinson Committed by GitHub Apr 08, 2025
Show whitespace changes
Inline Side-by-side

Showing with 1 addition and 1 deletion

vllm/v1/attention/backends/flash_attn.py vllm/v1/attention/backends/flash_attn.py +1 -1

No files found.
--- a/vllm/v1/attention/backends/flash_attn.py
+++ b/vllm/v1/attention/backends/flash_attn.py
@@ -264,7 +264,7 @@ def make_local_attention_virtual_batches(
        np.arange(pages_per_local_batch, dtype=np.int32),
        (virtual_batches, pages_per_local_batch)) \
            + np.expand_dims(block_starts, axis=1)
-    block_indices = block_indices.flatten()
+    block_indices = block_indices.flatten().clip(max=block_table.shape[1] - 1)
    batch_indices = np.repeat(np.arange(actual_batch_size, dtype=np.int32),
                              local_blocks * pages_per_local_batch)
    block_table_local = block_table[batch_indices, block_indices]\