[MRv2]fix: model accuracy regression caused by reusing the stale...

[MRv2]fix: model accuracy regression caused by reusing the stale last_sampled_tokens and draft_tokens (#39833) Signed-off-by: Zijing Liu <liuzijing2014@gmail.com>

[MRv2]fix: model accuracy regression caused by reusing the stale...
[MRv2]fix: model accuracy regression caused by reusing the stale last_sampled_tokens and draft_tokens (#39833) Signed-off-by: Zijing Liu <liuzijing2014@gmail.com>
9a6a66f3 · Zijing Liu · GitHub · 67eb6083 · 9a6a66f3
Unverified Commit 9a6a66f3 authored Apr 21, 2026 by Zijing Liu Committed by GitHub Apr 21, 2026
Hide whitespace changes
Inline Side-by-side

Showing with 12 additions and 0 deletions

vllm/v1/worker/gpu/states.py vllm/v1/worker/gpu/states.py +12 -0

No files found.
--- a/vllm/v1/worker/gpu/states.py
+++ b/vllm/v1/worker/gpu/states.py
@@ -102,6 +102,18 @@ class RequestState:
        self.num_computed_prefill_tokens[req_idx] = num_computed_tokens
        self.num_computed_tokens.stage_write_elem(req_idx, num_computed_tokens)

+        if num_computed_tokens > 0 and num_computed_tokens <= prefill_len:
+            # For PD disagg or resumed requests: set last_sampled to the last
+            # computed token so the first decode step gets the right input_id.
+            # For fresh prefill requests (num_computed_tokens == 0) the tensor
+            # is not read by combine_sampled_and_draft_tokens so we skip the
+            # write. Use a slice assignment rather than scalar indexing so the
+            # write is dispatched through fill_ without a host/device sync.
+            self.last_sampled_tokens[req_idx : req_idx + 1] = all_token_ids[
+                num_computed_tokens - 1
+            ]
+        self.draft_tokens[req_idx].zero_()
+
    def apply_staged_writes(self) -> None:
        self.prompt_len.copy_to_uva()
        self.prefill_len.copy_to_uva()