[Bugfix][V1] Re-compute an entire block when fully cache hit (#11186)

Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>

[Bugfix][V1] Re-compute an entire block when fully cache hit (#11186)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
9855aea2 · Cody Yu · GitHub · 4b5b8a6a · 9855aea2
Unverified Commit 9855aea2 authored Dec 13, 2024 by Cody Yu Committed by GitHub Dec 13, 2024
Hide whitespace changes
Inline Side-by-side

Showing with 7 additions and 3 deletions

vllm/v1/core/scheduler.py vllm/v1/core/scheduler.py +7 -3

No files found.
--- a/vllm/v1/core/scheduler.py
+++ b/vllm/v1/core/scheduler.py
@@ -199,9 +199,13 @@ class Scheduler:
                if num_new_tokens == 0:
                    # The happens when prompt length is divisible by the block
                    # size and all blocks are cached. Now we force to recompute
-                    # the last token.
-                    num_computed_tokens -= 1
-                    num_new_tokens = 1
+                    # the last block. Note that we have to re-compute an entire
+                    # block because allocate_slots() assumes num_computed_tokens
+                    # is always a multiple of the block size. This limitation
+                    # can potentially be removed in the future to slightly
+                    # improve the performance.
+                    num_computed_tokens -= self.block_size
+                    num_new_tokens = self.block_size
                    computed_blocks.pop()
                num_new_tokens = min(num_new_tokens, token_budget)
                assert num_new_tokens > 0