support qwen3-vl handle requests with embeddings (#30037)

Signed-off-by: taoyun <1069423820@qq.com> Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

support qwen3-vl handle requests with embeddings (#30037)
Signed-off-by: taoyun <1069423820@qq.com> Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
6dcb07f6 · Tao Yun · GitHub · 46cbbca0 · 6dcb07f6 · 6dcb07f6
Unverified Commit 6dcb07f6 authored Dec 05, 2025 by Tao Yun Committed by GitHub Dec 04, 2025
Show whitespace changes
Inline Side-by-side

Showing with 7 additions and 2 deletions

docs/features/multimodal_inputs.md docs/features/multimodal_inputs.md +2 -0

vllm/model_executor/models/qwen3_vl.py vllm/model_executor/models/qwen3_vl.py +5 -2

No files found.
--- a/docs/features/multimodal_inputs.md
+++ b/docs/features/multimodal_inputs.md
@@ -443,6 +443,8 @@ For Qwen2-VL and MiniCPM-V, we accept additional parameters alongside the embedd
        print(generated_text)
    ```
+For Qwen3-VL, the `image_embeds` should contain both the base image embedding and deepstack features.
 #### Audio Embeddings
 You can pass pre-computed audio embeddings similar to image embeddings:

--- a/vllm/model_executor/models/qwen3_vl.py
+++ b/vllm/model_executor/models/qwen3_vl.py
@@ -103,7 +103,7 @@ from .qwen2_5_vl import (
    Qwen2_5_VLVideoInputs,
    Qwen2_5_VLVideoPixelInputs,
 )
-from .qwen2_vl import Qwen2VLProcessingInfo
+from .qwen2_vl import Qwen2VLMultiModalDataParser, Qwen2VLProcessingInfo
 from .qwen3 import Qwen3ForCausalLM, Qwen3Model
 from .utils import (
    AutoWeightsLoader,
@@ -884,7 +884,10 @@ class Qwen3VLDummyInputsBuilder(BaseDummyInputsBuilder[Qwen3VLProcessingInfo]):
 class Qwen3VLMultiModalProcessor(BaseMultiModalProcessor[Qwen3VLProcessingInfo]):
    def _get_data_parser(self) -> MultiModalDataParser:
-        return MultiModalDataParser(video_needs_metadata=True)
+        return Qwen2VLMultiModalDataParser(
+            self.info.get_hf_config().vision_config.spatial_merge_size,
+            video_needs_metadata=True,
+        )
    def _call_hf_processor(
        self,