Skip MM Encoder for non-first PP ranks (#24387)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Skip MM Encoder for non-first PP ranks (#24387)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2e5d2137 · Woosuk Kwon · GitHub · 0661cb9d · 2e5d2137
Unverified Commit 2e5d2137 authored Sep 07, 2025 by Woosuk Kwon Committed by GitHub Sep 07, 2025
Show whitespace changes
Inline Side-by-side

Showing with 1 addition and 4 deletions

vllm/v1/worker/gpu_model_runner.py vllm/v1/worker/gpu_model_runner.py +1 -4

No files found.
--- a/vllm/v1/worker/gpu_model_runner.py
+++ b/vllm/v1/worker/gpu_model_runner.py
@@ -1620,14 +1620,11 @@ class GPUModelRunner(LoRAModelRunnerMixin, KVConnectorModelRunnerMixin):

        # _prepare_inputs may reorder the batch, so we must gather multi
        # modal outputs after that to ensure the correct order
-        if self.supports_mm_inputs:
+        if self.supports_mm_inputs and get_pp_group().is_first_rank:
            # Run the multimodal encoder if any.
            self._execute_mm_encoder(scheduler_output)
            mm_embeds = self._gather_mm_embeddings(scheduler_output)
-        else:
-            mm_embeds = []

-        if self.supports_mm_inputs and get_pp_group().is_first_rank:
            # NOTE(woosuk): To unify token ids and soft tokens (vision
            # embeddings), we always use embeddings (rather than token ids)
            # as input to the multimodal model, even when the input is text.