Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
OpenDAS
vllm_cscc
Commits
9855aea2
Unverified
Commit
9855aea2
authored
Dec 13, 2024
by
Cody Yu
Committed by
GitHub
Dec 13, 2024
Browse files
[Bugfix][V1] Re-compute an entire block when fully cache hit (#11186)
Signed-off-by:
Cody Yu
<
hao.yu.cody@gmail.com
>
parent
4b5b8a6a
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
7 additions
and
3 deletions
+7
-3
vllm/v1/core/scheduler.py
vllm/v1/core/scheduler.py
+7
-3
No files found.
vllm/v1/core/scheduler.py
View file @
9855aea2
...
@@ -199,9 +199,13 @@ class Scheduler:
...
@@ -199,9 +199,13 @@ class Scheduler:
if
num_new_tokens
==
0
:
if
num_new_tokens
==
0
:
# The happens when prompt length is divisible by the block
# The happens when prompt length is divisible by the block
# size and all blocks are cached. Now we force to recompute
# size and all blocks are cached. Now we force to recompute
# the last token.
# the last block. Note that we have to re-compute an entire
num_computed_tokens
-=
1
# block because allocate_slots() assumes num_computed_tokens
num_new_tokens
=
1
# is always a multiple of the block size. This limitation
# can potentially be removed in the future to slightly
# improve the performance.
num_computed_tokens
-=
self
.
block_size
num_new_tokens
=
self
.
block_size
computed_blocks
.
pop
()
computed_blocks
.
pop
()
num_new_tokens
=
min
(
num_new_tokens
,
token_budget
)
num_new_tokens
=
min
(
num_new_tokens
,
token_budget
)
assert
num_new_tokens
>
0
assert
num_new_tokens
>
0
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment