chore(multimodal): Cleanup multimodal docs and consolidate launch scripts (#7845)

dacb2980 · Ryan McCormick · GitHub · 2075eb67 · dacb2980 · dacb2980
Unverified Commit dacb2980 authored Apr 03, 2026 by Ryan McCormick Committed by GitHub Apr 03, 2026
8 changed files
--- a/docs/features/multimodal/multimodal-vllm.md
+++ b/docs/features/multimodal/multimodal-vllm.md
@@ -7,102 +7,54 @@ title: vLLM Multimodal
 This document provides a comprehensive guide for multimodal inference using vLLM backend in Dynamo.

 <Warning>
-**Security Requirement**: All multimodal workers require the `--enable-multimodal` flag to be explicitly set at startup. This is a security feature to prevent unintended processing of multimodal data from untrusted sources. Workers will fail at startup if multimodal flags (e.g., `--multimodal-worker`, `--multimodal-processor`) are used without `--enable-multimodal`.
-This flag is analogous to `--enable-mm-embeds` in vllm serve but also extends it to all multimodal content (url, embeddings, b64).
+**Security Requirement**: All multimodal workers require the `--enable-multimodal` flag to be explicitly set at startup. This is a security feature to prevent unintended processing of multimodal data from untrusted sources. Workers will fail at startup if a multimodal worker mode is enabled without `--enable-multimodal`. This flag is analogous to `--enable-mm-embeds` in vllm serve but also extends it to all multimodal content (url, embeddings, b64).
 </Warning>

 ## Support Matrix

-| Modality | Input Format | Aggregated | Disaggregated | Notes |
-|----------|--------------|------------|---------------|-------|
-| **Image** | HTTP/HTTPS URL | Yes | Yes | Full support for all image models |
-| **Image** | Data URL (Base64) | Yes | Yes | Inline base64-encoded images |
-| **Video** | HTTP/HTTPS URL | Yes | Yes | Frame extraction and processing |
-| **Audio** | HTTP/HTTPS URL | Yes | Yes | Experimental - requires audio dependencies |
+| Modality                 | Aggregated | Disaggregated |
+| ------------------------ | ---------- | ------------- |
+| **Image**                | Yes        | Yes           |
+| **Video**                | Yes        | Yes           |
+| **Audio** (Experimental) | Yes        | Yes           |

 ### Supported URL Formats

 | Format         | Example                              | Description                |
-|--------|---------|-------------|
+| -------------- | ------------------------------------ | -------------------------- |
 | **HTTP/HTTPS** | `http://example.com/image.jpg`       | Remote media files         |
 | **Data URL**   | `data:image/jpeg;base64,/9j/4AAQ...` | Base64-encoded inline data |

 ## Deployment Patterns

-vLLM supports all multimodal deployment patterns. See [Architecture Patterns](README.md#architecture-patterns) for detailed explanations.
+The main multimodal vLLM launchers in this repo are:

-| Pattern | Supported | Launch Script | Notes |
-|---------|-----------|---------------|-------|
-| EPD (Simple Aggregated) | ✅ | `agg_multimodal.sh` | Easiest setup |
-| E/PD (Encode Separate) | ✅ | `agg_multimodal_epd.sh` | Separate encode worker |
-| E/P/D (Full Disaggregation) | ✅ | `disagg_multimodal_epd.sh` | All stages separate |
-| EP/D (Traditional Disaggregated) | ✅ | `disagg_multimodal_llama.sh` | For Llama 4 models |
+| Pattern                     | Launch Script               | Best For                                                                            |
+| --------------------------- | --------------------------- | ----------------------------------------------------------------------------------- |
+| Aggregated                  | `agg_multimodal.sh`         | Simplest image/video serving from a single multimodal worker                        |
+| E/PD (Encode + PD)          | `disagg_multimodal_e_pd.sh` | Simple example of separating encoder, good for testing embedding-cache workflows    |
+| E/P/D (Full Disaggregation) | `disagg_multimodal_epd.sh`  | Disaggregated image/video serving with separate encode, prefill, and decode workers |

-### Component Flags
+## Image/Video Serving

-| Component | Flag | Purpose |
-|-----------|------|---------|
-| Processor | `--multimodal-processor` | HTTP entry, tokenization |
-| Encode Worker | `--multimodal-encode-worker` | Media encoding |
-| PD Worker | `--multimodal-worker` | Prefill + Decode |
-| Prefill Worker | `--multimodal-worker --disaggregation-mode prefill` | Prefill only |
-| Decode Worker | `--multimodal-decode-worker` | Decode only |
+Dynamo supports multimodal image and video requests for Vision Language Models (VLMs). `Qwen/Qwen3-VL-2B-Instruct` is a good example because the same model can handle both `image_url` and `video_url` requests through the standard OpenAI chat endpoint.

-## Use the Latest Release
+### Aggregated Serving

-We recommend using the latest stable release of dynamo to avoid breaking changes:
-
-[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
-
-You can find the [latest release](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
-
-```bash
-git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
-```
-
-## Image Serving
-
-### E/PD Serving (Encode Separate)
-
-**Components:**
-
- workers: [EncodeWorkerHandler](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py) for encoding and [DecodeWorkerHandler](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/vllm/handlers.py) for prefilling and decoding.
- processor: Tokenizes the prompt and passes it to the EncodeWorkerHandler.
- frontend: HTTP endpoint to handle incoming requests.
-
-**Workflow:**
-
-The EncodeWorkerHandler encodes the image and passes the embeddings to the DecodeWorkerHandler via NATS and RDMA. The work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface.
-
-```mermaid
-flowchart LR
-  HTTP --> processor
-  processor --> HTTP
-  processor --image_url--> encode_worker
-  encode_worker --> processor
-  encode_worker --embeddings--> pd_worker
-  pd_worker --> encode_worker
-```
-
-> **Note:** Aggregated serving supports LLaVA 1.5 7B and Qwen2.5-VL-7B-Instruct. Disaggregated serving is currently only confirmed for LLaVA.
-
-**Launch:**
+Use the single-worker aggregated launcher for the simplest image/video setup:

 ```bash
 cd $DYNAMO_HOME/examples/backends/vllm
-# Serve a LLaVA 1.5 7B model:
-bash launch/agg_multimodal_epd.sh --model llava-hf/llava-1.5-7b-hf
-# Serve a Qwen2.5-VL model:
-bash launch/agg_multimodal_epd.sh --model Qwen/Qwen2.5-VL-7B-Instruct
+bash launch/agg_multimodal.sh --model Qwen/Qwen3-VL-2B-Instruct
 ```

-**Client:**
+**Image request:**

 ```bash
 curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
-      "model": "llava-hf/llava-1.5-7b-hf",
+      "model": "Qwen/Qwen3-VL-2B-Instruct",
      "messages": [
        {
          "role": "user",
@@ -120,205 +72,71 @@ curl http://localhost:8000/v1/chat/completions \
          ]
        }
      ],
-      "max_tokens": 300,
+      "max_tokens": 64,
      "temperature": 0.0,
      "stream": false
    }'
 ```

-### E/P/D Serving (Full Disaggregation)
-
-**Components:**
-
- workers: [EncodeWorkerHandler](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py) for encoding, [DecodeWorkerHandler](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/vllm/handlers.py) for decoding, and [PrefillWorkerHandler](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/vllm/handlers.py) for prefilling.
- processor: Tokenizes the prompt and passes it to the EncodeWorkerHandler.
- frontend: HTTP endpoint to handle incoming requests.
-
-**Workflow:**
-
-For the LLaVA model, embeddings are only required during the prefill stage. The EncodeWorkerHandler is connected directly to the prefill worker, encoding the image and passing embeddings via NATS and RDMA. The prefill worker performs the prefilling step and forwards the KV cache to the decode worker.
-
-```mermaid
-flowchart LR
-  HTTP --> processor
-  processor --> HTTP
-  processor --image_url--> encode_worker
-  encode_worker --> processor
-  encode_worker --embeddings--> prefill_worker
-  prefill_worker --> encode_worker
-  prefill_worker --> decode_worker
-  decode_worker --> prefill_worker
-```
-
-**Launch:**
-
-```bash
-cd $DYNAMO_HOME/examples/backends/vllm
-bash launch/disagg_multimodal_epd.sh --model llava-hf/llava-1.5-7b-hf
-```
-
-<Note>
-Disaggregation is currently only confirmed to work with LLaVA. Qwen2.5-VL is not confirmed to be supported.
-</Note>
-
-## Llama 4 Serving
-
-The Llama 4 model family is natively multimodal. Unlike LLaVA, they do not directly consume image embeddings as input (see the [vLLM support matrix](https://docs.vllm.ai/en/latest/models/supported_models.html#text-generation_1)). Therefore, the encoder worker is not used and encoding is done alongside prefill.
-
-Example model: `meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8` on H100x8.
-
-### Llama 4 Aggregated Serving
-
-**Workflow:**
-
-```mermaid
-flowchart LR
-  HTTP --> processor
-  processor --> HTTP
-  processor --image_url--> pd_worker
-  pd_worker --> processor
-```
-
-**Launch:**
-
-```bash
-cd $DYNAMO_HOME/examples/backends/vllm
-bash launch/agg_multimodal.sh --model meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8
-```
-
-**Client:**
+**Video request:**

 ```bash
 curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
-      "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
+      "model": "Qwen/Qwen3-VL-2B-Instruct",
      "messages": [
        {
          "role": "user",
          "content": [
            {
              "type": "text",
-              "text": "What is in this image?"
+              "text": "Describe the video in detail"
            },
            {
-              "type": "image_url",
-              "image_url": {
-                "url": "http://images.cocodataset.org/test2017/000000155781.jpg"
+              "type": "video_url",
+              "video_url": {
+                "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/draw.mp4"
              }
            }
          ]
        }
      ],
-      "max_tokens": 300,
-      "temperature": 0.0,
+      "max_tokens": 64,
      "stream": false
-    }'
+    }' | jq
 ```

-### Llama 4 Disaggregated Serving
+### E/PD Serving (Encode + PD)

-**Workflow:**
-
-```mermaid
-flowchart LR
-  HTTP --> processor
-  processor --> HTTP
-  processor --image_url--> prefill_worker
-  prefill_worker --> processor
-  prefill_worker --> decode_worker
-  decode_worker --> prefill_worker
-```
-
-**Launch:**
+Use `disagg_multimodal_e_pd.sh` when you want a separate encode worker and a combined prefill/decode worker. This path is primarily useful for image-centric workloads and embedding-cache experiments; use `agg_multimodal.sh` or `disagg_multimodal_epd.sh` for general video serving.

 ```bash
 cd $DYNAMO_HOME/examples/backends/vllm
-bash launch/disagg_multimodal_llama.sh --head-node

-# On a separate node with NATS_SERVER and ETCD_ENDPOINTS pointing to head node:
-cd $DYNAMO_HOME/examples/backends/vllm
-bash launch/disagg_multimodal_llama.sh
-```
+# Multi-GPU deployment
+bash launch/disagg_multimodal_e_pd.sh --model Qwen/Qwen3-VL-2B-Instruct

-## Video Serving
+# Single-GPU (functional testing with small models)
+bash launch/disagg_multimodal_e_pd.sh --model Qwen/Qwen3-VL-2B-Instruct --single-gpu

-### Video Aggregated Serving
-
-**Components:**
-
- worker: Standard `python -m dynamo.vllm --enable-multimodal` backend.
- frontend: Standard `python -m dynamo.frontend` OpenAI-compatible endpoint.
-
-**Workflow:**
-
-The Rust preprocessor tokenizes the request and forwards `multi_modal_data` with `video_url` entries. The vLLM backend decodes video URLs into sampled RGB frames and attaches them to `TokensPrompt(multi_modal_data=...)` for standard multimodal processing.
-
-```mermaid
-flowchart LR
-  HTTP --> frontend
-  frontend --> vllm_worker
-  vllm_worker --> frontend
 ```

-**Launch:**
-
-```bash
-cd $DYNAMO_HOME/examples/backends/vllm
-bash launch/video_agg.sh
-```
+### E/P/D Serving (Full Disaggregation)

-**Client:**
+Use the full disaggregated launcher when you want separate encode, prefill, and decode workers for image/video workloads:

 ```bash
-curl http://localhost:8000/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-      "model": "Qwen/Qwen3-VL-2B-Instruct",
-      "messages": [
-        {
-          "role": "user",
-          "content": [
-            {
-              "type": "text",
-              "text": "Describe the video in detail"
-            },
-            {
-              "type": "video_url",
-              "video_url": {
-                "url": "https://storage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4"
-              }
-            }
-          ]
-        }
-      ],
-      "max_tokens": 300,
-      "stream": false
-    }' | jq
-```
-
-### Video Disaggregated Serving
-
-**Workflow:**
-
-The Rust preprocessor tokenizes the request and forwards `multi_modal_data` with `video_url` entries. The prefill worker decodes the video into sampled RGB frames locally, runs the multimodal prefill, and forwards KV state to the decode worker through the normal disaggregated vLLM path.
-
-```mermaid
-flowchart LR
-  HTTP --> frontend
-  frontend --> prefill_worker
-  prefill_worker --> decode_worker
-  decode_worker --> frontend
-```
+cd $DYNAMO_HOME/examples/backends/vllm

-**Launch:**
+# Multi-GPU deployment
+bash launch/disagg_multimodal_epd.sh --model Qwen/Qwen3-VL-2B-Instruct

-```bash
-cd $DYNAMO_HOME/examples/backends/vllm
-bash launch/video_disagg.sh
+# Single-GPU (functional testing with small models)
+bash launch/disagg_multimodal_epd.sh --model Qwen/Qwen3-VL-2B-Instruct --single-gpu
 ```

-## Audio Serving
+## Audio Serving (Experimental)

 ### Audio Aggregated Serving

@@ -409,13 +227,13 @@ bash launch/audio_disagg.sh
 Dynamo supports embedding cache in both aggregated and disaggregated settings:

 | Setting                   | Implementation                                                 | Launch Script               |
-|---------|---------------|---------------|
+| ------------------------- | -------------------------------------------------------------- | --------------------------- |
+| **Aggregated**            | Supported via vLLM ECConnector in vLLM 0.18+                   | `agg_multimodal.sh` (or with `vllm serve` directly) |
 | **Disaggregated encoder** | Dynamo-managed cache in the worker layer on top of vLLM engine | `disagg_multimodal_e_pd.sh` |
-| **Aggregated** | Experimental via vLLM git patches | N/A |

 ### Aggregated Worker

-A single vLLM instance caches encoded embeddings on CPU so repeated images skip encoding entirely. Experimental — requires vLLM patches (see below).
+A single vLLM instance caches encoded embeddings on CPU so repeated images skip encoding entirely.

 ```mermaid
 ---
@@ -432,19 +250,9 @@ flowchart LR

 **Launch:**

-```bash
-
-cd /opt/dynamo/venv/lib/python3.12/site-packages
-
-curl -sL https://github.com/vllm-project/vllm/pull/34182.diff | patch -p1
-
-curl -sL https://github.com/vllm-project/vllm/pull/34783.diff | python3 -c "
-import sys
-chunks = sys.stdin.read().split('diff --git ')
-filtered = [c for c in chunks if c.startswith('a/vllm/')]
-print(''.join('diff --git ' + c for c in filtered))
-" | patch -p1
+<!-- TODO: Add an example of Dynamo+vLLM Agg worker + Embedding Cache -->

+```bash
 vllm serve $model \
    --ec-transfer-config "{
        \"ec_role\": \"ec_both\",
@@ -484,48 +292,7 @@ cd $DYNAMO_HOME/examples/backends/vllm
 bash launch/disagg_multimodal_e_pd.sh --multimodal-embedding-cache-capacity-gb 10
 ```

-**Client:** Same as [E/PD Serving](#epd-serving-encode-separate)
-
-## NIXL Usage
-
-| Use Case | Script | NIXL Used? | Data Transfer |
-|----------|--------|------------|---------------|
-| EPD (Simple Aggregated) | `agg_multimodal.sh` | No | All in one worker |
-| E/PD (Encode Separate) | `agg_multimodal_epd.sh` | Yes | Encoder → PD (embeddings) |
-| E/P/D (Full Disaggregation) | `disagg_multimodal_epd.sh` | Yes | Encoder → Prefill (embeddings), Prefill → Decode (KV cache) |
-| EP/D (Llama 4) | `disagg_multimodal_llama.sh` | Yes | Prefill → Decode (KV cache) |
-| EC Both (Local Node) | `vllm_serve_embedding_cache.sh` | No | ECConnector via CPU Embedding Cache |
-
-## ModelInput Types and Registration
-
-Dynamo's Rust SDK supports two input types that determine how the HTTP frontend preprocesses requests:
-
-| ModelInput Type | Preprocessing | Use Case |
-|-----------------|---------------|----------|
-| `ModelInput.Text` | None (raw text passed through) | Components that tokenize themselves |
-| `ModelInput.Tokens` | Rust SDK would tokenize (but bypassed in multimodal) | Components expecting pre-tokenized input |
-
-**Registration Pattern:**
-
-```python
-# Processor - Entry point from HTTP frontend
-await register_model(
-    ModelInput.Text,        # Frontend sends raw text
-    ModelType.Chat,
-    generate_endpoint,
-    model_name,
-    ...
-)
-
-# Workers - Internal components
-await register_model(
-    ModelInput.Tokens,      # Expect pre-tokenized input
-    ModelType.Chat,         # or ModelType.Prefill for prefill workers
-    generate_endpoint,
-    model_name,
-    ...
-)
-```
+**Client:** Use the same `image_url` request format shown in [Aggregated Serving](#aggregated-serving).

 ## LoRA Adapters on Multimodal Workers

@@ -603,60 +370,6 @@ curl -X POST http://<decode-worker>/load_lora \

 If a LoRA is loaded on the prefill worker but not on the decode worker, the decode worker will fall back to the base model for that request.

-## Profiling
-
-Dynamo's multimodal workers include NVTX markers for `nsys` profiling. They are disabled by default (zero overhead) and enabled by setting `DYN_NVTX=1`.
-
-```bash
-cd $DYNAMO_HOME/examples/backends/vllm
-DYN_NVTX=1 nsys profile --trace=cuda,nvtx -o profile.nsys-rep \
-    bash launch/agg_multimodal.sh ...
-```
-
-| ENV Variable | Default | Description |
-|---|---|---|
-| `DYN_NVTX` | `0` | Set to `1` to enable NVTX range/mark annotations in encode, prefill, and decode workers for `nsys` profiling |
-
-Key NVTX ranges emitted:
-
-| Range | Worker | Description |
-|-------|--------|-------------|
-| `mm:encode_worker_generate` | Encode | Full encode request lifetime |
-| `mm:enc:cache_check` | Encode | Embedding cache lookup |
-| `mm:enc:image_load` | Encode | Image download/load |
-| `mm:enc:image_preprocess` | Encode | Image processor (CPU) |
-| `mm:enc:vision_encode` | Encode | ViT + projector GPU forward |
-| `mm:enc:embedding_transfer` | Encode | RDMA embedding staging |
-| `mm:pd_worker_generate` | PD | Full PD request lifetime |
-| `mm:pd:ttft` | PD | Worker-side TTFT: from request arrival at the PD worker to first output token (excludes client→frontend→worker network transit) |
-| `mm:pd:load_multimodal` | PD | Fetch embeddings from encode worker |
-| `mm:pd:disagg_prefill` | PD (disagg) | Prefill-only engine call |
-| `mm:pd:disagg_remote_decode` | PD (disagg) | Remote decode round-trip |
-| `mm:decode_worker_generate` | Decode | Full decode request lifetime |
-| `mm:decode:first_token` | Decode | Time to first output token |
-
-## Known Limitations
-
- **Disaggregated flows require Python Processor** - All multimodal disaggregation requires the Python Processor component (`ModelInput.Text`).
-
 ## Supported Models

-The following models have been tested with Dynamo's vLLM multimodal backend:
-
- **Qwen2.5-VL** - `Qwen/Qwen2.5-VL-7B-Instruct`
- **Qwen3-VL** - `Qwen/Qwen3-VL-30B-A3B-Instruct-FP8`
- **LLaVA 1.5** - `llava-hf/llava-1.5-7b-hf`
- **Llama 4 Maverick** - `meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8`
- **Qwen2-Audio** - `Qwen/Qwen2-Audio-7B-Instruct`
-
-For a complete list of multimodal models supported by vLLM, see [vLLM Supported Multimodal Models](https://docs.vllm.ai/en/latest/models/supported_models/#list-of-multimodal-language-models). Models listed there should work with Simple Aggregated Mode but may not be explicitly tested.
-
-## Key Files
-
-| File | Description |
-|------|-------------|
-| `components/src/dynamo/vllm/main.py` | Worker initialization and setup |
-| `components/src/dynamo/vllm/args.py` | Command-line argument parsing |
-| `components/src/dynamo/vllm/multimodal_handlers/processor_handler.py` | Processor implementation |
-| `components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py` | Encode worker implementations (custom and vLLM-native) |
-| `components/src/dynamo/vllm/multimodal_handlers/worker_handler.py` | PD/Prefill/Decode worker implementation |
+For a list of multimodal models supported by vLLM, see [vLLM Supported Multimodal Models](https://docs.vllm.ai/en/latest/models/supported_models/#list-of-multimodal-language-models). Models listed there should generally work with aggregated serving, though they may not all be explicitly tested in this repo.
--- a/examples/backends/vllm/launch/agg_multimodal.sh
+++ b/examples/backends/vllm/launch/agg_multimodal.sh
@@ -2,11 +2,11 @@
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
-# Aggregated multimodal serving with standard Dynamo preprocessing
+# Aggregated multimodal image/video serving with standard Dynamo preprocessing
 #
 # Architecture: Single-worker PD (Prefill-Decode)
-# - Frontend: Rust OpenAIPreprocessor handles image URLs (HTTP and data:// base64)
-# - Worker: Standard vLLM worker with vision model support
+# - Frontend: Rust OpenAIPreprocessor forwards multimodal requests
+# - Worker: Standard vLLM worker with multimodal model support
 #
 # For EPD (Encode-Prefill-Decode) architecture with dedicated encoding worker,
 # see agg_multimodal_epd.sh
@@ -19,7 +19,7 @@ source "$SCRIPT_DIR/../../../common/gpu_utils.sh"
 source "$SCRIPT_DIR/../../../common/launch_utils.sh"

 # Default values
-MODEL_NAME="Qwen/Qwen3-VL-30B-A3B-Instruct-FP8"
+MODEL_NAME="${DYN_MODEL_NAME:-Qwen/Qwen3-VL-30B-A3B-Instruct-FP8}"

 # Parse command line arguments
 # Extra arguments are passed through to the vLLM worker
@@ -48,13 +48,41 @@ while [[ $# -gt 0 ]]; do
 done

 HTTP_PORT="${DYN_HTTP_PORT:-8000}"
-print_launch_banner --multimodal "Launching Aggregated Multimodal Serving" "$MODEL_NAME" "$HTTP_PORT"

 # Use TCP transport (instead of default NATS)
 # TCP is preferred for multimodal workloads because it overcomes:
 # - NATS default 1MB max payload limit (multimodal base64 images can exceed this)
 export DYN_REQUEST_PLANE=tcp

+print_launch_banner --no-curl "Launching Aggregated Multimodal Serving" "$MODEL_NAME" "$HTTP_PORT" \
+    "Backend:     dynamo.vllm --enable-multimodal" \
+    "Media:       image_url and video_url (model support dependent)"
+
+print_curl_footer <<CURL
+  curl http://localhost:${HTTP_PORT}/v1/chat/completions \\
+    -H 'Content-Type: application/json' \\
+    -d '{
+      "model": "${MODEL_NAME}",
+      "messages": [{"role": "user", "content": [
+        {"type": "text", "text": "Describe the image"},
+        {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/300px-PNG_transparency_demonstration_1.png"}}
+      ]}],
+      "max_tokens": 50
+    }'
+
+  # For video-capable models such as Qwen/Qwen3-VL-2B-Instruct:
+  curl http://localhost:${HTTP_PORT}/v1/chat/completions \\
+    -H 'Content-Type: application/json' \\
+    -d '{
+      "model": "Qwen/Qwen3-VL-2B-Instruct",
+      "messages": [{"role": "user", "content": [
+        {"type": "text", "text": "Describe the video in detail"},
+        {"type": "video_url", "video_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/draw.mp4"}}
+      ]}],
+      "max_tokens": 128
+    }'
+CURL
+
 # Start frontend with Rust OpenAIPreprocessor
 # dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000)
 python -m dynamo.frontend &
@@ -65,7 +93,7 @@ MAX_CONCURRENT_SEQS="${MAX_CONCURRENT_SEQS:-2}"
 MODEL_EXTRA_ARGS=""
 case "$MODEL_NAME" in
    meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8)
-        MAX_MODEL_LEN="${MAX_MODEL_LEN:-108960}"
+        MAX_MODEL_LEN="108960"
        MODEL_EXTRA_ARGS="--tensor-parallel-size=8" ;;
 esac


--- a/examples/backends/vllm/launch/disagg_multimodal_e_pd.sh
+++ b/examples/backends/vllm/launch/disagg_multimodal_e_pd.sh
@@ -7,6 +7,9 @@ trap 'echo Cleaning up...; kill 0' EXIT
 SCRIPT_DIR="$(dirname "$(readlink -f "$0")")"
 source "$SCRIPT_DIR/../../../common/launch_utils.sh"

+# Use TCP transport for multimodal workloads (base64 images can exceed NATS 1MB limit)
+export DYN_REQUEST_PLANE=tcp
+
 # Default values
 MODEL_NAME="Qwen/Qwen3-VL-30B-A3B-Instruct-FP8"
 SINGLE_GPU=false

--- a/examples/backends/vllm/launch/disagg_multimodal_epd.sh
+++ b/examples/backends/vllm/launch/disagg_multimodal_epd.sh
@@ -8,6 +8,9 @@ SCRIPT_DIR="$(dirname "$(readlink -f "$0")")"
 source "$SCRIPT_DIR/../../../common/gpu_utils.sh"
 source "$SCRIPT_DIR/../../../common/launch_utils.sh"

+# Use TCP transport for multimodal workloads (base64 images can exceed NATS 1MB limit)
+export DYN_REQUEST_PLANE=tcp
+
 # Default values
 MODEL_NAME="llava-hf/llava-1.5-7b-hf"

@@ -17,7 +20,7 @@ MODEL_NAME="llava-hf/llava-1.5-7b-hf"
 #   - Enabling --enforce-eager (disables torch.compile and CUDA graph capture)
 #   - Hardcoding P/D KV cache to 512 MB (skips all memory profiling)
 #   - Limiting --max-model-len to 4096 tokens on P/D workers
-#   - Limiting P/D workers to image=1,video=0,audio=0 (--limit-mm-per-prompt)
+#   - Limiting P/D workers to image=3,video=3,audio=0 (--limit-mm-per-prompt)
 #   - Using lower gpu-memory-utilization fractions to share the GPU
 SINGLE_GPU=false

@@ -77,10 +80,17 @@ python -m dynamo.frontend &
 EXTRA_ARGS=""
 PD_EXTRA_ARGS=""

-# GPU assignments (override via environment variables)
-DYN_ENCODE_WORKER_GPU=${DYN_ENCODE_WORKER_GPU:-0}
-DYN_PREFILL_WORKER_GPU=${DYN_PREFILL_WORKER_GPU:-1}
-DYN_DECODE_WORKER_GPU=${DYN_DECODE_WORKER_GPU:-2}
+# GPU assignments (override via environment variables).
+# In single-GPU mode all 3 workers default to GPU 0.
+if [[ "$SINGLE_GPU" == "true" ]]; then
+    DYN_ENCODE_WORKER_GPU=${DYN_ENCODE_WORKER_GPU:-0}
+    DYN_PREFILL_WORKER_GPU=${DYN_PREFILL_WORKER_GPU:-0}
+    DYN_DECODE_WORKER_GPU=${DYN_DECODE_WORKER_GPU:-0}
+else
+    DYN_ENCODE_WORKER_GPU=${DYN_ENCODE_WORKER_GPU:-0}
+    DYN_PREFILL_WORKER_GPU=${DYN_PREFILL_WORKER_GPU:-1}
+    DYN_DECODE_WORKER_GPU=${DYN_DECODE_WORKER_GPU:-2}
+fi

 # GPU memory utilization for workers.
 # NOTE: --kv-cache-memory-bytes (set below for P/D workers) overrides
@@ -93,9 +103,15 @@ if [[ -n "${_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE:-}" ]]; then
    echo "WARNING: _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE is set but has no effect here because" >&2
    echo "  --kv-cache-memory-bytes overrides --gpu-memory-utilization in vLLM." >&2
 fi
-DYN_ENCODE_GPU_MEM=${DYN_ENCODE_GPU_MEM:-0.9}
-DYN_PREFILL_GPU_MEM=${DYN_PREFILL_GPU_MEM:-0.9}
-DYN_DECODE_GPU_MEM=${DYN_DECODE_GPU_MEM:-0.9}
+if [[ "$SINGLE_GPU" == "true" ]]; then
+    DYN_ENCODE_GPU_MEM=${DYN_ENCODE_GPU_MEM:-0.1}
+    DYN_PREFILL_GPU_MEM=${DYN_PREFILL_GPU_MEM:-0.4}
+    DYN_DECODE_GPU_MEM=${DYN_DECODE_GPU_MEM:-0.4}
+else
+    DYN_ENCODE_GPU_MEM=${DYN_ENCODE_GPU_MEM:-0.9}
+    DYN_PREFILL_GPU_MEM=${DYN_PREFILL_GPU_MEM:-0.9}
+    DYN_DECODE_GPU_MEM=${DYN_DECODE_GPU_MEM:-0.9}
+fi

 # 512 MB KV cache per P/D worker. Setting --kv-cache-memory-bytes bypasses vLLM's
 # memory profiling entirely (both language model and multimodal encoder), which avoids
@@ -105,7 +121,7 @@ PD_KV_CACHE_BYTES=$((512 * 1024 * 1024))

 if [[ "$SINGLE_GPU" == "true" ]]; then
    EXTRA_ARGS="--enforce-eager"
-    PD_EXTRA_ARGS="--max-model-len 4096 --kv-cache-memory-bytes $PD_KV_CACHE_BYTES --limit-mm-per-prompt {\"image\":1,\"video\":0,\"audio\":0}"
+    PD_EXTRA_ARGS="--max-model-len 4096 --kv-cache-memory-bytes $PD_KV_CACHE_BYTES --limit-mm-per-prompt {\"image\":3,\"video\":3,\"audio\":0}"
 fi

 # Start encode worker

--- a/examples/backends/vllm/launch/disagg_multimodal_llama.sh
+++ b/examples/backends/vllm/launch/disagg_multimodal_llama.sh
-#!/bin/bash
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-set -ex
-
-SCRIPT_DIR="$(dirname "$(readlink -f "$0")")"
-source "$SCRIPT_DIR/../../../common/launch_utils.sh"
-
-# Default values
-HEAD_NODE=0
-MODEL_NAME="meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8"
-EXTRA_ARGS=()
-
-# Parse command line arguments
-while [[ $# -gt 0 ]]; do
-    case $1 in
-        --head-node)
-            HEAD_NODE=1
-            shift 1
-            ;;
-        --model)
-            MODEL_NAME=$2
-            shift 2
-            ;;
-        -h|--help)
-            echo "Usage: $0 [OPTIONS]"
-            echo ""
-            echo "Disaggregated multimodal serving with separate Prefill/Decode workers for Llama 4"
-            echo ""
-            echo "Options:"
-            echo "  --head-node          Run as head node. Head node will run the HTTP server, processor and prefill worker."
-            echo "  --model <model_name> Specify the VLM model to use (default: $MODEL_NAME)"
-            echo "  -h, --help           Show this help message"
-            echo ""
-            echo "Examples:"
-            echo "  # On head node:"
-            echo "  $0 --head-node"
-            echo ""
-            echo "  # On worker node (requires NATS_SERVER and ETCD_ENDPOINTS pointing to head node):"
-            echo "  $0"
-            echo ""
-            exit 0
-            ;;
-        *)
-            EXTRA_ARGS+=("$1")
-            shift
-            ;;
-    esac
-done
-
-trap 'echo Cleaning up...; kill 0' EXIT
-
-HTTP_PORT="${DYN_HTTP_PORT:-8000}"
-if [[ $HEAD_NODE -eq 1 ]]; then
-    print_launch_banner --multimodal "Launching Disaggregated Multimodal Llama 4 (Multi-Node)" "$MODEL_NAME" "$HTTP_PORT"
-else
-    print_launch_banner --no-curl "Launching Disaggregated Multimodal Llama 4 (Multi-Node)" "$MODEL_NAME" "$HTTP_PORT"
-fi
-
-# Use TCP transport to avoid NATS payload limits for multimodal
-export DYN_REQUEST_PLANE=tcp
-
-# Configure model-specific args
-GPU_MEM="0.80"
-KV_BYTES="${_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES:-}"
-if [[ -n "$KV_BYTES" ]]; then
-    GPU_MEM_ARGS="--kv-cache-memory-bytes $KV_BYTES --gpu-memory-utilization 0.01"
-else
-    GPU_MEM_ARGS="--gpu-memory-utilization $GPU_MEM"
-fi
-MODEL_SPECIFIC_ARGS=""
-if [[ "$MODEL_NAME" == "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8" ]]; then
-    MODEL_SPECIFIC_ARGS="--tensor-parallel-size=8 --max-model-len=208960 $GPU_MEM_ARGS"
-fi
-
-if [[ $HEAD_NODE -eq 1 ]]; then
-    # run ingress
-    # dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000)
-    python -m dynamo.frontend &
-
-    # run processor (CPU-only to avoid competing for GPU memory with workers)
-    CUDA_VISIBLE_DEVICES="" \
-    python -m dynamo.vllm --route-to-encoder --enable-multimodal --model $MODEL_NAME &
-
-    # Prefill worker handles prompt processing and image encoding
-    # Uses all 8 GPUs for tensor-parallel
-    CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
-    VLLM_NIXL_SIDE_CHANNEL_PORT=20097 \
-    python -m dynamo.vllm \
-        --enable-multimodal \
-        --model $MODEL_NAME \
-        --disaggregation-mode prefill \
-        --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' \
-        $MODEL_SPECIFIC_ARGS \
-        --kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20080"}' \
-        "${EXTRA_ARGS[@]}" &
-else
-    # run decode worker on non-head node
-    # Uses all 8 GPUs for tensor-parallel
-    CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
-    VLLM_NIXL_SIDE_CHANNEL_PORT=20098 \
-    python -m dynamo.vllm \
-        --enable-multimodal \
-        --model $MODEL_NAME \
-        --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' \
-        $MODEL_SPECIFIC_ARGS \
-        --kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20081"}' \
-        "${EXTRA_ARGS[@]}" &
-fi
-
-# Exit on first worker failure; kill 0 in the EXIT trap tears down the rest
-wait_any_exit
--- a/examples/backends/vllm/launch/video_agg.sh
+++ b/examples/backends/vllm/launch/video_agg.sh
-#!/bin/bash
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Aggregated video serving with standard Dynamo preprocessing and vLLM backend.
-
-set -euo pipefail
-
-cleanup() {
-    echo "Cleaning up..."
-    local pids
-    pids="$(jobs -pr)"
-    if [[ -n "$pids" ]]; then
-        kill $pids 2>/dev/null || true
-    fi
-}
-
-trap cleanup EXIT
-
-SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
-REPO_ROOT="$(cd "$SCRIPT_DIR/../../../.." && pwd)"
-source "$SCRIPT_DIR/../../../common/gpu_utils.sh"
-source "$SCRIPT_DIR/../../../common/launch_utils.sh"
-
-export PYTHONPATH="${REPO_ROOT}/components/src:${REPO_ROOT}/lib/bindings/python/src${PYTHONPATH:+:${PYTHONPATH}}"
-
-MODEL_NAME="${DYN_MODEL_NAME:-Qwen/Qwen3-VL-2B-Instruct}"
-HTTP_PORT="${DYN_HTTP_PORT:-8000}"
-GPU_DEVICE="${CUDA_VISIBLE_DEVICES:-0}"
-MAX_MODEL_LEN="${MAX_MODEL_LEN:-8192}"
-MAX_NUM_SEQS="${MAX_NUM_SEQS:-2}"
-
-EXTRA_ARGS=()
-while [[ $# -gt 0 ]]; do
-    case $1 in
-        --model)
-            MODEL_NAME=$2
-            shift 2
-            ;;
-        -h|--help)
-            cat <<USAGE
-Usage: $0 [OPTIONS] [-- EXTRA_VLLM_ARGS]
-
-Options:
-  --model <model_name>   Video-capable VLM to serve (default: $MODEL_NAME)
-  -h, --help             Show this help message
-
-Any arguments after '--' are passed through to the vLLM worker.
-USAGE
-            exit 0
-            ;;
-        --)
-            shift
-            EXTRA_ARGS+=("$@")
-            break
-            ;;
-        *)
-            EXTRA_ARGS+=("$1")
-            shift
-            ;;
-    esac
-done
-
-export DYN_REQUEST_PLANE=tcp
-
-GPU_MEM_ARGS=$(build_gpu_mem_args vllm)
-
-print_launch_banner --no-curl "Launching Aggregated Video Serving" "$MODEL_NAME" "$HTTP_PORT" \
-    "Backend:     dynamo.vllm --enable-multimodal" \
-    "Video path:  Standard TokensPrompt multi_modal_data flow"
-
-print_curl_footer <<CURL
-  curl http://localhost:${HTTP_PORT}/v1/chat/completions \\
-    -H 'Content-Type: application/json' \\
-    -d '{
-      "model": "${MODEL_NAME}",
-      "messages": [{"role": "user", "content": [
-        {"type": "text", "text": "Describe the video in detail"},
-        {"type": "video_url", "video_url": {"url": "https://storage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4"}}
-      ]}],
-      "max_tokens": 128
-    }'
-CURL
-
-python -m dynamo.frontend &
-
-CUDA_VISIBLE_DEVICES="$GPU_DEVICE" \
-    python -m dynamo.vllm \
-        --enable-multimodal \
-        --model "$MODEL_NAME" \
-        --max-model-len "$MAX_MODEL_LEN" \
-        --max-num-seqs "$MAX_NUM_SEQS" \
-        $GPU_MEM_ARGS \
-        "${EXTRA_ARGS[@]}" &
-
-wait_any_exit
--- a/examples/backends/vllm/launch/video_disagg.sh
+++ b/examples/backends/vllm/launch/video_disagg.sh
-#!/bin/bash
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Disaggregated video serving with standard Dynamo preprocessing and vLLM backend.
-
-set -euo pipefail
-
-cleanup() {
-    echo "Cleaning up..."
-    local pids
-    pids="$(jobs -pr)"
-    if [[ -n "$pids" ]]; then
-        kill $pids 2>/dev/null || true
-    fi
-}
-
-trap cleanup EXIT
-
-SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
-REPO_ROOT="$(cd "$SCRIPT_DIR/../../../.." && pwd)"
-source "$SCRIPT_DIR/../../../common/gpu_utils.sh"
-source "$SCRIPT_DIR/../../../common/launch_utils.sh"
-
-export PYTHONPATH="${REPO_ROOT}/components/src:${REPO_ROOT}/lib/bindings/python/src${PYTHONPATH:+:${PYTHONPATH}}"
-
-MODEL_NAME="${DYN_MODEL_NAME:-Qwen/Qwen3-VL-2B-Instruct}"
-HTTP_PORT="${DYN_HTTP_PORT:-8000}"
-SINGLE_GPU=false
-EXTRA_ARGS=()
-
-while [[ $# -gt 0 ]]; do
-    case $1 in
-        --model)
-            MODEL_NAME=$2
-            shift 2
-            ;;
-        --single-gpu)
-            SINGLE_GPU=true
-            shift
-            ;;
-        -h|--help)
-            cat <<USAGE
-Usage: $0 [OPTIONS] [-- EXTRA_VLLM_ARGS]
-
-Options:
-  --model <model_name>   Video-capable VLM to serve (default: $MODEL_NAME)
-  --single-gpu           Run prefill and decode on one GPU for functional testing
-  -h, --help             Show this help message
-
-Any arguments after '--' are passed through to both vLLM workers.
-USAGE
-            exit 0
-            ;;
-        --)
-            shift
-            EXTRA_ARGS+=("$@")
-            break
-            ;;
-        *)
-            EXTRA_ARGS+=("$1")
-            shift
-            ;;
-    esac
-done
-
-export DYN_REQUEST_PLANE=tcp
-
-if [[ "$SINGLE_GPU" == "true" ]]; then
-    GPU_LABEL="1 GPU"
-    PREFILL_GPU="${DYN_PREFILL_WORKER_GPU:-${CUDA_VISIBLE_DEVICES:-0}}"
-    DECODE_GPU="${DYN_DECODE_WORKER_GPU:-${CUDA_VISIBLE_DEVICES:-0}}"
-    MAX_MODEL_LEN="${MAX_MODEL_LEN:-4096}"
-    PD_KV_CACHE_BYTES=$((512 * 1024 * 1024))
-    SHARED_GPU_FRACTION=$(build_gpu_mem_args vllm --workers-per-gpu 2)
-    PREFILL_GPU_MEM="${DYN_PREFILL_GPU_MEM:-${SHARED_GPU_FRACTION:-0.45}}"
-    DECODE_GPU_MEM="${DYN_DECODE_GPU_MEM:-${SHARED_GPU_FRACTION:-0.45}}"
-    SHARED_ARGS=(
-        --enforce-eager
-        --max-model-len "$MAX_MODEL_LEN"
-        --kv-cache-memory-bytes "$PD_KV_CACHE_BYTES"
-        --limit-mm-per-prompt '{"image":1,"video":1,"audio":0}'
-    )
-else
-    GPU_LABEL="2 GPUs"
-    PREFILL_GPU="${DYN_PREFILL_WORKER_GPU:-0}"
-    DECODE_GPU="${DYN_DECODE_WORKER_GPU:-1}"
-    MAX_MODEL_LEN="${MAX_MODEL_LEN:-8192}"
-    GPU_MEM_ARGS=$(build_gpu_mem_args vllm)
-    PREFILL_GPU_MEM="${DYN_PREFILL_GPU_MEM:-${GPU_MEM_ARGS:-0.9}}"
-    DECODE_GPU_MEM="${DYN_DECODE_GPU_MEM:-${GPU_MEM_ARGS:-0.9}}"
-    SHARED_ARGS=(--max-model-len "$MAX_MODEL_LEN")
-fi
-
-print_launch_banner --no-curl "Launching Disaggregated Video Serving ($GPU_LABEL)" "$MODEL_NAME" "$HTTP_PORT" \
-    "Backend:     Prefill + decode workers via dynamo.vllm" \
-    "Video path:  Standard TokensPrompt multi_modal_data flow"
-
-print_curl_footer <<CURL
-  curl http://localhost:${HTTP_PORT}/v1/chat/completions \\
-    -H 'Content-Type: application/json' \\
-    -d '{
-      "model": "${MODEL_NAME}",
-      "messages": [{"role": "user", "content": [
-        {"type": "text", "text": "Describe the video in detail"},
-        {"type": "video_url", "video_url": {"url": "https://storage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4"}}
-      ]}],
-      "max_tokens": 128
-    }'
-CURL
-
-python -m dynamo.frontend &
-
-VLLM_NIXL_SIDE_CHANNEL_PORT=20098 \
-CUDA_VISIBLE_DEVICES="$PREFILL_GPU" \
-    python -m dynamo.vllm \
-        --disaggregation-mode prefill \
-        --enable-multimodal \
-        --model "$MODEL_NAME" \
-        --gpu-memory-utilization "$PREFILL_GPU_MEM" \
-        "${SHARED_ARGS[@]}" \
-        --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' \
-        --kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20081"}' \
-        "${EXTRA_ARGS[@]}" &
-
-VLLM_NIXL_SIDE_CHANNEL_PORT=20099 \
-CUDA_VISIBLE_DEVICES="$DECODE_GPU" \
-    python -m dynamo.vllm \
-        --disaggregation-mode decode \
-        --enable-multimodal \
-        --model "$MODEL_NAME" \
-        --gpu-memory-utilization "$DECODE_GPU_MEM" \
-        "${SHARED_ARGS[@]}" \
-        --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' \
-        --kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20082"}' \
-        "${EXTRA_ARGS[@]}" &
-
-wait_any_exit
--- a/tests/serve/test_vllm.py
+++ b/tests/serve/test_vllm.py
@@ -428,14 +428,6 @@ vllm_configs = {
        model="Qwen/Qwen3-VL-2B-Instruct",
        script_args=["--model", "Qwen/Qwen3-VL-2B-Instruct", "--single-gpu"],
        timeout=300,
-        env={
-            "DYN_ENCODE_WORKER_GPU": "0",
-            "DYN_PREFILL_WORKER_GPU": "0",
-            "DYN_DECODE_WORKER_GPU": "0",
-            "DYN_ENCODE_GPU_MEM": "0.1",
-            "DYN_PREFILL_GPU_MEM": "0.4",
-            "DYN_DECODE_GPU_MEM": "0.4",
-        },
        request_payloads=[
            chat_payload(
                [
@@ -536,11 +528,11 @@ vllm_configs = {
            ),
        ],
    ),
-    # Video multimodal tests for CI using the vLLM video launch scripts.
+    # Video multimodal tests for CI use the canonical aggregated multimodal launcher.
    "multimodal_video_agg": VLLMConfig(
        name="multimodal_video_agg",
        directory=vllm_dir,
-        script_name="video_agg.sh",
+        script_name="agg_multimodal.sh",
        marks=[
            pytest.mark.gpu_1,
            pytest.mark.pre_merge,
@@ -568,7 +560,7 @@ vllm_configs = {
    "multimodal_video_disagg": VLLMConfig(
        name="multimodal_video_disagg",
        directory=vllm_dir,
-        script_name="video_disagg.sh",
+        script_name="disagg_multimodal_epd.sh",
        marks=[
            pytest.mark.gpu_1,
            pytest.mark.pre_merge,