Merge branch 'v0.9.2-dev_pp_bug' into 'v0.9.2-dev'

fix PP 场景 decode 阶段 token 被误丢弃导致卡住 See merge request dcutoolkit/deeplearing/vllm!363

Merge branch 'v0.9.2-dev_pp_bug' into 'v0.9.2-dev'
fix PP 场景 decode 阶段 token 被误丢弃导致卡住 See merge request dcutoolkit/deeplearing/vllm!363
c1795786 · zhuwenwen · ce5b3c9a · 62a5b28f · c1795786
Commit c1795786 authored Jan 13, 2026 by zhuwenwen
Show whitespace changes
Inline Side-by-side

Showing with 10 additions and 0 deletions

vllm/v1/worker/gpu_model_runner.py vllm/v1/worker/gpu_model_runner.py +10 -0

No files found.
--- a/vllm/v1/worker/gpu_model_runner.py
+++ b/vllm/v1/worker/gpu_model_runner.py
@@ -1600,6 +1600,11 @@ class GPUModelRunnerBase(LoRAModelRunnerMixin):
            seq_len = (req_state.num_computed_tokens +
                       scheduler_output.num_scheduled_tokens[req_id])
            if seq_len < req_state.num_tokens:
+                # If we have already started decoding, seeing a "partial prefill"
+                # condition is suspicious and can lead to discarding the sampled
+                # token forever (PP stall).
+                if req_state.output_token_ids:
+                    continue
                # Ignore the sampled token for partial prefills.
                # Rewind the generator state as if the token was not sampled.
                # This relies on cuda-specific torch-internal impl details
@@ -3461,6 +3466,11 @@ class GPUModelRunnerMTP(GPUModelRunnerBase):
            seq_len = (req_state.num_computed_tokens +
                       scheduler_output.num_scheduled_tokens[req_id])
            if seq_len < req_state.num_tokens:
+                # If we have already started decoding, seeing a "partial prefill"
+                # condition is suspicious and can lead to discarding the sampled
+                # token forever (PP stall).
+                if req_state.output_token_ids:
+                    continue
                # Ignore the sampled token for partial prefills.
                # Rewind the generator state as if the token was not sampled.
                # This relies on cuda-specific torch-internal impl details