docs: consolidate multimodal docs (#4842)

Signed-off-by: Neal Vaidya <nealv@nvidia.com>

docs: consolidate multimodal docs (#4842)
Signed-off-by: Neal Vaidya <nealv@nvidia.com>
3c4b3069 · Neal Vaidya · GitHub · 29ec4969 · 3c4b3069 · 3c4b3069
Unverified Commit 3c4b3069 authored Dec 15, 2025 by Neal Vaidya Committed by GitHub Dec 16, 2025
17 changed files
--- a/docs/api/nixl_connect/README.md
+++ b/docs/api/nixl_connect/README.md
@@ -103,7 +103,7 @@ flowchart LR

 ### Multimodal Example

-In the case of the [Dynamo Multimodal Disaggregated Example](../../backends/vllm/multimodal.md):
+In the case of the [Dynamo Multimodal Disaggregated Example](../../multimodal/vllm.md):

 1. The HTTP frontend accepts a text prompt and a URL to an image.


--- a/docs/backends/sglang/README.md
+++ b/docs/backends/sglang/README.md
@@ -38,7 +38,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 | [**Conditional Disaggregation**](../../design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) |
 | [**KV-Aware Routing**](../../router/kv_cache_routing.md) | ✅ |  |
 | [**SLA-Based Planner**](../../planner/sla_planner.md) | ✅ |  |
-| [**Multimodal EPD Disaggregation**](multimodal_epd.md) | ✅ |  |
+| [**Multimodal Support**](../../multimodal/sglang.md) | ✅ |  |
 | [**KVBM**](../../kvbm/kvbm_architecture.md) | ❌ | Planned |


@@ -276,7 +276,7 @@ Below we provide a selected list of advanced examples. Please open up an issue i
 - **[Enable SGLang Hierarchical Cache (HiCache)](sgl-hicache-example.md)**

 ### Multimodal Encode-Prefill-Decode (EPD) Disaggregation with NIXL
- **[Run a multimodal model with EPD Disaggregation](multimodal_epd.md)**
+- **[Run a multimodal model with EPD Disaggregation](../../multimodal/sglang.md)**

 ## Deployment


--- a/docs/backends/sglang/multimodal_epd.md
+++ b/docs/backends/sglang/multimodal_epd.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-->
-
-# Encode-Prefill-Decode (EPD) Flow with NIXL
-
-For high-performance multimodal inference with large embeddings, Dynamo supports a specialized **Encode-Prefill-Decode (EPD)** flow using **NIXL (RDMA)** for zero-copy tensor transfer.
-
-## Use the Latest Release
-
-We recommend using the latest stable release of dynamo to avoid breaking changes:
-
-[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
-
-You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
-
-```bash
-git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
-```
-
-## Multimodal Aggregated Serving
-
-### Components
-
- workers: For aggregated serving, we have two workers, [MultimodalEncodeWorkerHandler](../../../components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py) for encoding and [MultimodalWorkerHandler](../../../components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py) for prefilling and decoding.
- processor: Tokenizes the prompt and passes it to the MultimodalEncodeWorker.
-
-### Workflow
-
-
-The MultimodalEncodeWorker is responsible for encoding the image and passing the embeddings to the MultimodalWorker via a combination of NATS and RDMA.
-The work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface.
-Its MultimodalWorker then prefills and decodes the prompt, just like the [LLM aggregated serving](README.md) example.
-By separating the encode from the prefill and decode stages, we can have a more flexible deployment and scale the
-MultimodalEncodeWorker independently from the prefill and decode workers if needed.
-
-This figure illustrates the workflow:
-```mermaid
-flowchart LR
-  HTTP --> processor
-  processor --> HTTP
-  processor --image_url--> encode_worker
-  encode_worker --> processor
-  encode_worker --embeddings descriptor--> worker
-  worker --> encode_worker
-```
-
-```bash
-cd $DYNAMO_HOME/examples/backends/sglang
-./launch/multimodal_agg.sh
-```
-
-### Client
-
-In another terminal:
-```bash
-curl http://localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
-    "model": "Qwen/Qwen2.5-VL-7B-Instruct",
-    "messages": [
-      {
-        "role": "user",
-        "content": [
-          {
-            "type": "text",
-            "text": "Describe the image."
-          },
-          {
-            "type": "image_url",
-            "image_url": {
-              "url": "http://images.cocodataset.org/test2017/000000155781.jpg"
-            }
-          }
-        ]
-      }
-    ],
-    "max_tokens": 50,
-    "stream": false
-  }' | jq
-```
-
-You should see a response similar to this:
-```json
-{
-  "id": "chatcmpl-2546f44756884a14916ce13ebaa09da8",
-  "choices": [
-    {
-      "index": 0,
-      "message": {
-        "content": "This image shows a public transit bus on a dimly lit, street-level track in what appears to be a quiet urban neighborhood or suburban area. The bus displays \"OUT OF SERVICE\" in red on its illuminated sign. It is positioned",
-        "role": "assistant",
-        "reasoning_content": null
-      },
-      "finish_reason": "length"
-    }
-  ],
-  "created": 1758824222,
-  "model": "Qwen/Qwen2.5-VL-7B-Instruct",
-  "object": "chat.completion",
-  "usage": {
-    "prompt_tokens": 0,
-    "completion_tokens": 40,
-    "total_tokens": 40
-  }
-}
-```
-
-## Multimodal Disaggregated Serving
-
-### Components
-
- workers: For disaggregated serving, we have three workers, [MultimodalEncodeWorkerHandler](../../../components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py) for encoding, [MultimodalWorkerHandler](../../../components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py) for decoding, and [MultimodalPrefillWorkerHandler](../../../components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py) for prefilling.
- processor: Tokenizes the prompt and passes it to the MultimodalEncodeWorker.
-
-### Workflow
-
-For the Qwen2.5-VL model, embeddings are only required during the prefill stage. As such, the image embeddings are transferred using a NIXL descriptor from the encode worker to the worker and then passed to the prefill worker for processing.
-The prefill worker performs the prefilling step and forwards the KV cache to the worker for decoding.
-For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](README.md) example.
-
-This figure illustrates the workflow:
-```mermaid
-flowchart LR
-  HTTP --> processor
-  processor --> HTTP
-  processor --image_url--> encode_worker
-  encode_worker --> processor
-  encode_worker --embeddings descriptor--> worker
-  worker --> encode_worker
-  worker --embeddings descriptor--> prefill_worker
-  prefill_worker --> worker
-```
-
-
-```bash
-cd $DYNAMO_HOME/examples/backends/sglang
-./launch/multimodal_disagg.sh
-```
-
-### Client
-
-In another terminal:
-```bash
-curl http://localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
-    "model": "Qwen/Qwen2.5-VL-7B-Instruct",
-    "messages": [
-      {
-        "role": "user",
-        "content": [
-          {
-            "type": "text",
-            "text": "Describe the image."
-          },
-          {
-            "type": "image_url",
-            "image_url": {
-              "url": "http://images.cocodataset.org/test2017/000000155781.jpg"
-            }
-          }
-        ]
-      }
-    ],
-    "max_tokens": 50,
-    "stream": false
-  }' | jq
-```
-
-You should see a response similar to this:
-```json
-{
-  "id": "chatcmpl-2546f44756884a14916ce13ebaa09da8",
-  "choices": [
-    {
-      "index": 0,
-      "message": {
-        "content": "This image shows a public transit bus on a dimly lit, street-level track in what appears to be a quiet urban neighborhood or suburban area. The bus displays \"OUT OF SERVICE\" in red on its illuminated sign. It is positioned",
-        "role": "assistant",
-        "reasoning_content": null
-      },
-      "finish_reason": "length"
-    }
-  ],
-  "created": 1758824222,
-  "model": "Qwen/Qwen2.5-VL-7B-Instruct",
-  "object": "chat.completion",
-  "usage": {
-    "prompt_tokens": 0,
-    "completion_tokens": 40,
-    "total_tokens": 40
-  }
-}
-```
--- a/docs/backends/trtllm/README.md
+++ b/docs/backends/trtllm/README.md
@@ -240,7 +240,7 @@ To benchmark your deployment with AIPerf, see this utility script, configuring t

 ## Multimodal support

-Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [Multimodal Support Guide](./multimodal_support.md).
+Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [TensorRT-LLM Multimodal Guide](../../multimodal/trtllm.md).

 ## Logits Processing


--- a/docs/backends/trtllm/multimodal_trtllm_guide.md
+++ b/docs/backends/trtllm/multimodal_trtllm_guide.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-
-# TRT-LLM Multimodal Guide
-
-This document provides a comprehensive guide for multimodal inference using TensorRT-LLM backend in Dynamo. For more details on the multimodal examples, see [Multimodal Examples Documentation](./multimodal_support.md).
-
-## Multimodal Support Matrix
-
-| Modality | Input Format | Aggregated | Disaggregated | Notes |
-|----------|--------------|------------|---------------|-------|
-| **Image** | HTTP/HTTPS URL | Yes | Yes | Full support for all image models |
-| **Image** | Pre-computed Embeddings (.pt, .pth, .bin) | Yes | Yes | Direct embedding files |
-| **Video** | HTTP/HTTPS URL | ❌ No | ❌ No | Not implemented |
-| **Audio** | HTTP/HTTPS URL | ❌ No | ❌ No | Not implemented |
-
-## Architecture Comparison
-
-TRT-LLM multimodal supports three deployment patterns:
-
-```text
-SIMPLE AGGREGATED (agg.sh):
-  Client → Frontend (Rust) → Worker [image load, encode, P+D] → Response
-  • 2 components • worker flag `--modality multimodal` • Easiest setup
-
-DISAGGREGATED P->D (disagg_multimodal.sh):
-  Client → Frontend → Prefill [image load, encode] → Decode → Response
-  • 3 components • worker flag `--disaggregation-mode prefill/decode` • Multi-GPU, KV transfer
-
-EPD DISAGGREGATED - WIP:
-  Client → Frontend → Encode [MultimodalEncoder] → Prefill [via params] → Decode → Response
-  • 4 components • worker flag `--disaggregation-mode encode/prefill/decode` • WIP PR #4668
-```
-
-## Input Format Details
-
-### Supported URL Formats
-
-| Format | Example | Description | Support |
-|--------|---------|-------------|---------|
-| **HTTP/HTTPS** | `http://example.com/image.jpg` | Remote media files | ✅ |
-| **Pre-computed Embeddings** | `/path/to/embedding.pt` | Local embedding files (.pt, .pth, .bin) | ✅ |
-
-## Simple Aggregated Mode (PD)
-
-In aggregated mode, all processing (image loading, encoding, prefill, decode) happens within a single worker.
-
-### Architecture
-
-```text
-HTTP Frontend (Rust)
-    ↓
-TRT-LLM Worker (Python - ModelInput.Tokens)
-    ↓ downloads media, encodes, prefill + decode
-Response
-```
-
-### Components
-
-| Component | Flag | ModelInput | Registered | Purpose |
-|-----------|------|-----------|------------|---------|
-| Worker | `--modality multimodal` | Tokens | Yes | Complete inference pipeline |
-
-### Launch Script
-
-Example: [`examples/backends/trtllm/launch/agg.sh`](../../../examples/backends/trtllm/launch/agg.sh)
-
-## Disaggregated Mode (P->D)
-
-In disaggregated mode, prefill and decode are handled by separate workers. The prefill worker handles image loading and encoding internally.
-
-### Architecture
-
-```text
-HTTP Frontend (Rust)
-    ↓
-Prefill Worker (Python - ModelInput.Tokens)
-    ↓ downloads media, encodes, prefill, KV cache transfer
-Decode Worker (Python - ModelInput.Tokens)
-    ↓ decode only, token generation
-Response
-```
-
-### Components
-
-| Component | Flag | ModelInput | Registered | Purpose |
-|-----------|------|-----------|------------|---------|
-| Prefill Worker | `--disaggregation-mode prefill` | Tokens | Yes | Image processing + Prefill |
-| Decode Worker | `--disaggregation-mode decode` | Tokens | Yes | Decode only |
-
-### Launch Script
-
-Example: [`examples/backends/trtllm/launch/disagg_multimodal.sh`](../../../examples/backends/trtllm/launch/disagg_multimodal.sh)
-
-## Pre-computed Embeddings
-
-TRT-LLM supports providing pre-computed embeddings, bypassing image-to-embedding processing.
-
-### Supported File Types
-
- `.pt` - PyTorch tensor files
- `.pth` - PyTorch checkpoint files
- `.bin` - Binary tensor files
-
-### Embedding File Formats
-
-TRT-LLM supports two formats for embedding files:
-
-#### 1. Simple Tensor Format
-
- Direct tensor saved as `.pt` file
- Example: `llava_next_mm_embed_seashore.pt`
- Contains only the embedding tensor
-
-```python
-# Example: Simple tensor format
-embedding_tensor = torch.rand(1, 576, 4096)  # [batch, seq_len, hidden_dim]
-torch.save(embedding_tensor, "embedding.pt")
-```
-
-#### 2. Dictionary Format with Auxiliary Data
-
- Dictionary containing multiple keys
- Used by models like Llama-4 that require additional metadata
- Must contain `mm_embeddings` key with the main tensor
- Can include auxiliary data like special tokens, offsets, etc.
-
-```python
-# Example: Dictionary format (Llama-4 style)
-embedding_dict = {
-    "mm_embeddings": torch.rand(1, 576, 4096),
-    "special_tokens": [128256, 128257],
-    "image_token_offsets": [[0, 576]],
-    # ... other model-specific metadata
-}
-torch.save(embedding_dict, "llama4_embedding.pt")
-```
-
-**How They're Used:**
- **Simple tensors**: Loaded directly and passed to `mm_embeddings` parameter
- **Dictionary format**: `mm_embeddings` key extracted as main tensor, other keys preserved as auxiliary data and transferred separately
-
-### Launch Script
-
-Example: [`examples/backends/trtllm/launch/epd_disagg.sh`](../../../examples/backends/trtllm/launch/epd_disagg.sh)
-
-### Security Considerations
-
-For EPD mode with local embedding files:
-
- `--allowed-local-media-path` - Specify secure directory for embedding files (default: `/tmp`)
- `--max-file-size-mb` - Limit max file size to prevent DoS attacks (default: `50MB`)
-
-## EPD Disaggregated Mode (E->P->D) - WIP
-
-**Status:** Work In Progress (WIP PR #4668) - Full EPD flow with MultimodalEncoder
-
-In EPD mode, encoding, prefill, and decode are handled by separate workers. The encode worker uses TensorRT-LLM's `MultimodalEncoder` to process images and transfer embeddings via disaggregated parameters.
-
-### Architecture
-
-```text
-HTTP Frontend (Rust)
-    ↓
-Encode Worker (Python - NOT registered, uses MultimodalEncoder)
-    ↓ downloads image, encodes with vision model, transfers via disaggregated_params
-Prefill Worker (Python - ModelInput.Tokens)
-    ↓ receives embeddings via disaggregated_params, prefill only, KV cache transfer
-Decode Worker (Python - ModelInput.Tokens)
-    ↓ decode only, token generation
-Response
-```
-
-**Note (WIP):** The encode worker uses `MultimodalEncoder` from TensorRT-LLM to actually encode images, not just load pre-computed embeddings. This is a significant change from the legacy NIXL-based embedding transfer.
-
-### Components
-
-| Component | Flag | ModelInput | Registered | Purpose |
-|-----------|------|-----------|------------|---------|
-| Encode Worker | `--disaggregation-mode encode` | N/A | No | Image encoding with MultimodalEncoder |
-| Prefill Worker | `--disaggregation-mode prefill --encode-endpoint` | Tokens | Yes | Prefill only |
-| Decode Worker | `--disaggregation-mode decode` | Tokens | Yes | Decode only |
-
-
-## ModelInput Types and Registration
-
-### Understanding ModelInput
-
-TRT-LLM workers register with Dynamo using:
-
-| ModelInput Type | Preprocessing | Use Case |
-|-----------------|---------------|----------|
-| `ModelInput.Tokens` | Rust SDK tokenizes text (bypassed for multimodal) | All TRT-LLM workers |
-
-### Component Registration Pattern
-
-```python
-# TRT-LLM Worker - Register with Tokens
-await register_llm(
-    ModelInput.Tokens,      # Rust does minimal preprocessing
-    model_type,             # ModelType.Chat or ModelType.Prefill
-    generate_endpoint,
-    model_name,
-    ...
-)
-```
-
-## Inter-Component Communication
-
-| Transfer Stage | Message      | NIXL Transfer |
-|----------------|--------------|---------------|
-| **Frontend → Prefill** | Request with image URL or embedding path | No |
-| **Encode → Prefill (pre-computed embeddings)** | NIXL metadata (pre-computed embeddings) | Yes (Embeddings tensor) |
-| **Encode → Prefill (Image URL) (WIP)** | Disaggregated params with multimodal handles | No (Handles via params) |
-| **Prefill → Decode** | Disaggregated params | Configurable (KV cache: NIXL default, UCX optional) |
-
-
-## **NIXL USE**
-
-| Use Case | Script | NIXL Used? | Data Transfer |
-|----------|--------|------------|---------------|
-| Simple Aggregated | [`examples/backends/trtllm/launch/agg.sh`](../../../examples/backends/trtllm/launch/agg.sh) | ❌ No | All in one worker |
-| P->D Disaggregated | [`examples/backends/trtllm/launch/disagg_multimodal.sh`](../../../examples/backends/trtllm/launch/disagg_multimodal.sh) | ⚙️ Optional | Prefill → Decode (KV cache via UCX or NIXL) |
-| E->P->D Disaggregated (pre-computed embeddings) | [`examples/backends/trtllm/launch/epd_disagg.sh`](../../../examples/backends/trtllm/launch/epd_disagg.sh) | ✅ Yes | Encoder → Prefill (pre-computed embeddings via NIXL) |
-| E->P->D Disaggregated (WIP) | X | ❌ No | Encoder → Prefill (multimodal handles via disaggregated_params)<br>Prefill → Decode (KV cache via UCX/NIXL) |
-
-**Note:** NIXL for KV cache transfer is currently beta and only supported on AMD64 (x86_64) architecture.
-
-
-## Key Files
-
-| File | Description |
-|------|-------------|
-| `components/src/dynamo/trtllm/main.py` | Worker initialization and setup |
-| `components/src/dynamo/trtllm/utils/trtllm_utils.py` | Command-line argument parsing |
-| `components/src/dynamo/trtllm/multimodal_processor.py` | Multimodal request processing |
-| `components/src/dynamo/trtllm/request_handlers/handlers.py` | Request handler factory |
-| `components/src/dynamo/trtllm/request_handlers/handler_base.py` | Base handler and disaggregation modes |
-
-## Known Limitations
-
- **No Data URL support** - Only HTTP/HTTPS URLs supported; `data:image/...` base64 URLs not supported
- **No video support** - No video encoder implementation
- **No audio support** - No audio encoder implementation
- **No Rust preprocessing** - All preprocessing happens in Python workers
- **E->P->D mode is WIP** - Full EPD with image URLs under development
-
-## Supported Models
-
-Multimodal models listed in [TensorRT-LLM supported models](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/models/supported-models.md) are supported by Dynamo.
-
-Common examples:
- Llama 4 Vision models (Maverick, Scout)
- Qwen2-VL models
- Other vision-language models with TRT-LLM support
-
--- a/docs/backends/trtllm/multinode/multinode-multimodal-example.md
+++ b/docs/backends/trtllm/multinode/multinode-multimodal-example.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-
-# Example: Multi-node TRTLLM Workers with Dynamo on Slurm for multimodal models
-
-> **Note:** The scripts referenced in this example (such as `srun_aggregated.sh` and `srun_disaggregated.sh`) can be found in [`examples/basics/multinode/trtllm/`](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics/multinode/trtllm/).
-
-This guide demonstrates how to deploy large multimodal models that require a multi-node setup. It builds on the general multi-node deployment process described in the main [multinode-examples.md](./multinode-examples.md) guide.
-
-Before you begin, ensure you have completed the initial environment configuration by following the **Setup** section in that guide.
-
-The following sections provide specific instructions for deploying `meta-llama/Llama-4-Maverick-17B-128E-Instruct`, including environment variable setup and launch commands. These steps can be adapted for other large multimodal models.
-
-## Environment Variable Setup
-
-Assuming you have already allocated your nodes via `salloc`, and are
-inside an interactive shell on one of the allocated nodes, set the
-following environment variables based:
-```bash
-# NOTE: IMAGE must be set manually for now
-# To build an iamge, see the steps here:
-# https://github.com/ai-dynamo/dynamo/tree/main/docs/backends/trtllm/README.md#build-container
-export IMAGE="<dynamo_trtllm_image>"
-
-# MOUNTS are the host:container path pairs that are mounted into the containers
-# launched by each `srun` command.
-#
-# If you want to reference files, such as $MODEL_PATH below, in a
-# different location, you can customize MOUNTS or specify additional
-# comma-separated mount pairs here.
-#
-# NOTE: Currently, this example assumes that the local bash scripts and configs
-# referenced are mounted into into /mnt inside the container. If you want to
-# customize the location of the scripts, make sure to modify `srun_aggregated.sh`
-# accordingly for the new locations of `start_frontend_services.sh` and
-# `start_trtllm_worker.sh`.
-#
-# For example, assuming your cluster had a `/lustre` directory on the host, you
-# could add that as a mount like so:
-#
-# export MOUNTS="${PWD}/../../../../:/mnt,/lustre:/lustre"
-export MOUNTS="${PWD}/../../../../:/mnt"
-
-# Can point to local FS as weel
-# export MODEL_PATH="/location/to/model"
-export MODEL_PATH="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
-
-# The name the model will be served/queried under, matching what's
-# returned by the /v1/models endpoint.
-#
-# By default this is inferred from MODEL_PATH, but when using locally downloaded
-# model weights, it can be nice to have explicit control over the name.
-export SERVED_MODEL_NAME="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
-
-export MODALITY=${MODALITY:-"multimodal"}
-```
-
-## Disaggregated Mode
-
-Assuming you have at least 4 4xGB200 nodes allocated (2 for prefill, 2 for decode)
-following the setup above, follow these steps below to launch a **disaggregated**
-deployment across 4 nodes:
-
-> [!Tip]
-> Make sure you have a fresh environment and don't still have the aggregated
-> example above still deployed on the same set of nodes.
-
-```bash
-# Defaults set in srun_disaggregated.sh, but can customize here.
-# export PREFILL_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/llama4/multimodal/prefill.yaml"
-# export DECODE_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/llama4/multimodal/decode.yaml"
-
-# Customize NUM_PREFILL_NODES to match the desired parallelism in PREFILL_ENGINE_CONFIG
-# Customize NUM_DECODE_NODES to match the desired parallelism in DECODE_ENGINE_CONFIG
-# The products of NUM_PREFILL_NODES*NUM_GPUS_PER_NODE and
-# NUM_DECODE_NODES*NUM_GPUS_PER_NODE should match the respective number of
-# GPUs necessary to satisfy the requested parallelism in each config.
-# export NUM_PREFILL_NODES=2
-# export NUM_DECODE_NODES=2
-
-# GB200 nodes have 4 gpus per node, but for other types of nodes you can configure this.
-# export NUM_GPUS_PER_NODE=4
-
-# Launches:
-# - frontend + etcd/nats on current (head) node.
-# - one large prefill trtllm worker across multiple nodes via MPI tasks
-# - one large decode trtllm worker across multiple nodes via MPI tasks
-./srun_disaggregated.sh
-```
-
-## Understanding the Output
-
-1. The `srun_disaggregated.sh` launches three srun jobs instead of two. One for frontend, one for prefill worker, and one for decode worker.
-
-2. The OpenAI frontend will listen for and dynamically discover workers as
-   they register themselves with Dynamo's distributed runtime:
-   ```
-   0: 2025-06-13T02:36:48.160Z  INFO dynamo_run::input::http: Watching for remote model at models
-   0: 2025-06-13T02:36:48.161Z  INFO dynamo_llm::http::service::service_v2: Starting HTTP service on: 0.0.0.0:8000 address="0.0.0.0:8000"
-   ```
-3. The TRTLLM worker will consist of N (N=8 for TP8) MPI ranks, 1 rank on each
-   GPU on each node, which will each output their progress while loading the model.
-   You can see each rank's output prefixed with the rank at the start of each log line
-   until the model succesfully finishes loading:
-    ```
-     7: rank7 run mgmn worker node with mpi_world_size: 8 ...
-    ```
-4. After the model fully finishes loading on all ranks, the worker will register itself,
-   and the OpenAI frontend will detect it, signaled by this output:
-    ```
-    0: 2025-06-13T02:46:35.040Z  INFO dynamo_llm::discovery::watcher: added model model_name="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
-    ```
-5. At this point, with the worker fully initialized and detected by the frontend,
-   it is now ready for inference.
-
-## Example Request
-
-To verify the deployed model is working, send a `curl` request:
-```bash
-curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
-    "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct",
-    "messages": [
-        {
-            "role": "user",
-            "content": [
-                {
-                    "type": "text",
-                    "text": "Describe the image"
-                },
-                {
-                    "type": "image_url",
-                    "image_url": {
-                        "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png"
-                    }
-                }
-            ]
-        }
-    ],
-    "stream": false,
-    "max_tokens": 160
-}'
-```
-
-## Cleanup
-
-To cleanup background `srun` processes launched by `srun_aggregated.sh` or
-`srun_disaggregated.sh`, you can run:
-```bash
-pkill srun
-```
-
-## Known Issues
-
- Loading `meta-llama/Llama-4-Maverick-17B-128E-Instruct` with 8 nodes of H100 with TP=16 is not posssible due to Llama4 Maverick has a config `"num_attention_heads": 40` , trtllm engine asserts on assert `self.num_heads % tp_size == 0`  causing the engine to crash on model loading.
--- a/docs/backends/vllm/multimodal.md
+++ b/docs/backends/vllm/multimodal.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-
-# Multimodal Support
-
-Dynamo supports multimodal models with vLLM v1. In general, multimodal models can be served using the aggregated serving setup with [`agg_multimodal.sh`](../../../examples/backends/vllm/launch/agg_multimodal.sh).
-
-> [!WARNING]
-> **LLaVA Model Limitation**: Do not use LLaVA models (e.g., `llava-hf/llava-1.5-7b-hf`) with the standard aggregated serving setup, as they contain keywords that Dynamo cannot yet parse. LLaVA models can still be used with the EPD (Encode-Prefill-Decode) setup described below.
-
-> [!IMPORTANT]
-> **Security Requirement**: All multimodal workers require the `--enable-multimodal` flag to be explicitly set at startup. This is a security feature to prevent unintended processing of multimodal data from untrusted sources. Workers will fail at startup if multimodal flags (e.g., `--multimodal-worker`, `--multimodal-processor`) are used without `--enable-multimodal`.
-This flag is analogus to `--enable-mm-embeds` in vllm serve but also extends it to all multimodal content (url, embeddings, b64).
-
-# Multimodal EPD Deployment Examples
-
-This section provides example workflows and reference implementations for deploying a multimodal model using Dynamo and vLLM v1 with EPD(Encode-Prefill-Decode) pipeline.
-
-## Use the Latest Release
-
-We recommend using the latest stable release of dynamo to avoid breaking changes:
-
-[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
-
-You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
-
-```bash
-git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
-```
-
-## Multimodal Aggregated Serving
-
-### Components
-
- workers: For aggregated serving, we have two workers, [EncodeWorkerHandler](../../../components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py) for encoding and [MultimodalPDWorkerHandler](../../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) for prefilling and decoding.
- processor: Tokenizes the prompt and passes it to the EncodeWorkerHandler.
- frontend: HTTP endpoint to handle incoming requests.
-
-### Workflow
-
-The EncodeWorkerHandler is responsible for encoding the image and passing the embeddings to the MultimodalPDWorkerHandler via a combination of NATS and RDMA.
-The work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface.
-Its MultimodalPDWorkerHandler then prefills and decodes the prompt, just like the [LLM aggregated serving](README.md) example.
-By separating the encode from the prefill and decode stages, we can have a more flexible deployment and scale the
-MultimodalPDWorkerHandler independently from the prefill and decode workers if needed.
-
-This figure illustrates the workflow:
-```mermaid
-flowchart LR
-  HTTP --> processor
-  processor --> HTTP
-  processor --image_url--> encode_worker
-  encode_worker --> processor
-  encode_worker --embeddings--> pd_worker
-  pd_worker --> encode_worker
-```
-
-***Note*** Aggregated serving supports LLaVA 1.5 7B and Qwen2.5-VL-7B-Instruct today. Disaggregated serving is currently only confirmed for LLaVA (see note below).
-
-```bash
-cd $DYNAMO_HOME/examples/backends/vllm
-# Serve a LLaVA 1.5 7B model:
-bash launch/agg_multimodal_epd.sh --model llava-hf/llava-1.5-7b-hf
-# Serve a Qwen2.5-VL model:
-bash launch/agg_multimodal_epd.sh --model Qwen/Qwen2.5-VL-7B-Instruct
-```
-
-### Client
-
-In another terminal:
-```bash
-curl http://localhost:8000/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-      "model": "llava-hf/llava-1.5-7b-hf",
-      "messages": [
-        {
-          "role": "user",
-          "content": [
-            {
-              "type": "text",
-              "text": "What is in this image?"
-            },
-            {
-              "type": "image_url",
-              "image_url": {
-                "url": "http://images.cocodataset.org/test2017/000000155781.jpg"
-              }
-            }
-          ]
-        }
-      ],
-      "max_tokens": 300,
-      "temperature": 0.0,
-      "stream": false
-    }'
-```
-
-If serving the example Qwen model, replace `"llava-hf/llava-1.5-7b-hf"` in the `"model"` field with `"Qwen/Qwen2.5-VL-7B-Instruct"`.
-
-You should see a response similar to this:
-```json
-{"id": "c37b946e-9e58-4d54-88c8-2dbd92c47b0c", "object": "chat.completion", "created": 1747725277, "model": "llava-hf/llava-1.5-7b-hf", "choices": [{"index": 0, "message": {"role": "assistant", "content": " In the image, there is a city bus parked on a street, with a street sign nearby on the right side. The bus appears to be stopped out of service. The setting is in a foggy city, giving it a slightly moody atmosphere."}, "finish_reason": "stop"}]}
-```
-
-## Multimodal Disaggregated Serving
-
-### Components
-
- workers: For disaggregated serving, we have three workers, [EncodeWorkerHandler](../../../components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py) for encoding, [MultimodalDecodeWorkerHandler](../../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) for decoding, and [MultimodalPDWorkerHandler](../../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) for prefilling.
- processor: Tokenizes the prompt and passes it to the EncodeWorkerHandler.
- frontend: HTTP endpoint to handle incoming requests.
-
-### Workflow
-
-In this workflow, we have three workers, [EncodeWorkerHandler](../../../components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py), [MultimodalDecodeWorkerHandler](../../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py), and [MultimodalPDWorkerHandler](../../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py).
-For the LLaVA model, embeddings are only required during the prefill stage. As such, the EncodeWorkerHandler is connected directly to the prefill worker.
-The EncodeWorkerHandler is responsible for encoding the image and passing the embeddings to the prefill worker via a combination of NATS and RDMA.
-Its work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface.
-The prefill worker performs the prefilling step and forwards the KV cache to the decode worker for decoding.
-For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](README.md) example.
-
-This figure illustrates the workflow:
-```mermaid
-flowchart LR
-  HTTP --> processor
-  processor --> HTTP
-  processor --image_url--> encode_worker
-  encode_worker --> processor
-  encode_worker --embeddings--> prefill_worker
-  prefill_worker --> encode_worker
-  prefill_worker --> decode_worker
-  decode_worker --> prefill_worker
-```
-
-```bash
-cd $DYNAMO_HOME/examples/backends/vllm
-bash launch/disagg_multimodal_epd.sh --model llava-hf/llava-1.5-7b-hf
-```
-
-### Client
-
-In another terminal:
-```bash
-curl http://localhost:8000/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-      "model": "llava-hf/llava-1.5-7b-hf",
-      "messages": [
-        {
-          "role": "user",
-          "content": [
-            {
-              "type": "text",
-              "text": "What is in this image?"
-            },
-            {
-              "type": "image_url",
-              "image_url": {
-                "url": "http://images.cocodataset.org/test2017/000000155781.jpg"
-              }
-            }
-          ]
-        }
-      ],
-      "max_tokens": 300,
-      "temperature": 0.0,
-      "stream": false
-    }'
-```
-
-You should see a response similar to this:
-```json
-{"id": "c1774d61-3299-4aa3-bea1-a0af6c055ba8", "object": "chat.completion", "created": 1747725645, "model": "llava-hf/llava-1.5-7b-hf", "choices": [{"index": 0, "message": {"role": "assistant", "content": " This image shows a passenger bus traveling down the road near power lines and trees. The bus displays a sign that says \"OUT OF SERVICE\" on its front."}, "finish_reason": "stop"}]}
-```
-
-***Note***: disaggregation is currently only confirmed to work with LLaVA. Qwen2.5-VL is not confirmed to be supported.
-
-## Llama 4 family Serving
-
-The family of Llama 4 models is natively multimodal, however, different
-from Llava, they do not directly consume image embedding as input
-(see the [support metrics](https://docs.vllm.ai/en/latest/models/supported_models.html#text-generation_1)
-from vLLM for the types of multi-modal inputs supported by the model).
-Therefore, encoder worker will not be used in the following example and the
-encoding will be done along side with prefill.
-
-`meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8` will be used as an example
-for the content below. And the system will be H100x8 which can hold one instance
-of the model per node.
-
-### Multimodal Aggregated Serving
-
-#### Components
-
- workers: For aggregated serving, we have one worker, [MultimodalPDWorkerHandler](../../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) for prefilling and decoding.
- processor: Tokenizes the prompt and passes it to the MultimodalPDWorkerHandler.
- frontend: HTTP endpoint to handle incoming requests.
-
-#### Workflow
-
-In this workflow, we have [MultimodalPDWorkerHandler](../../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) which will encode the image, prefill and decode the prompt, just like the [LLM aggregated serving](README.md) example.
-
-This figure illustrates the workflow:
-```mermaid
-flowchart LR
-  HTTP --> processor
-  processor --> HTTP
-  processor --image_url--> pd_worker
-  pd_worker --> processor
-```
-
-```bash
-cd $DYNAMO_HOME/examples/backends/vllm
-bash launch/agg_multimodal_llama.sh
-```
-
-#### Client
-
-In another terminal:
-```bash
-curl http://localhost:8000/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-      "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
-      "messages": [
-        {
-          "role": "user",
-          "content": [
-            {
-              "type": "text",
-              "text": "What is in this image?"
-            },
-            {
-              "type": "image_url",
-              "image_url": {
-                "url": "http://images.cocodataset.org/test2017/000000155781.jpg"
-              }
-            }
-          ]
-        }
-      ],
-      "max_tokens": 300,
-      "temperature": 0.0,
-      "stream": false
-    }'
-```
-
-You should see a response similar to this:
-```json
-{"id": "b8f060fa95584e34b9204eaba7b105cc", "object": "chat.completion", "created": 1752706281, "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8", "choices": [{"index": 0, "message": {"role": "assistant", "content": "The image depicts a street scene with a trolley bus as the central focus. The trolley bus is positioned on the left side of the road, facing the camera, and features a white and yellow color scheme. A prominent sign on the front of the bus reads \"OUT OF SERVICE\" in orange letters.\n\n**Key Elements:**\n\n* **Trolley Bus:** The bus is the main subject of the image, showcasing its distinctive design and color.\n* **Sign:** The \"OUT OF SERVICE\" sign is clearly visible on the front of the bus, indicating its current status.\n* **Street Scene:** The surrounding environment includes trees, buildings, and power lines, creating a sense of context and atmosphere.\n* **Lighting:** The image is characterized by a misty or foggy quality, with soft lighting that adds to the overall ambiance.\n\n**Overall Impression:**\n\nThe image presents a serene and somewhat melancholic scene, with the out-of-service trolley bus serving as a focal point. The misty atmosphere and soft lighting contribute to a dreamy or nostalgic feel, inviting the viewer to reflect on the scene."}, "finish_reason": "stop"}]}
-```
-
-### Multimodal Disaggregated Serving
-
-#### Components
-
- workers: For disaggregated serving, we have two workers, [MultimodalDecodeWorkerHandler](../../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) for decoding, and [MultimodalPDWorkerHandler](../../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) for encoding and prefilling.
- processor: Tokenizes the prompt and passes it to the MultimodalPDWorkerHandler.
- frontend: HTTP endpoint to handle incoming requests.
-
-#### Workflow
-
-In this workflow, we have two workers, [MultimodalDecodeWorkerHandler](../../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py), and [MultimodalPDWorkerHandler](../../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py).
-The prefill worker performs the encoding and prefilling steps and forwards the KV cache to the decode worker for decoding.
-For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](README.md) example.
-
-This figure illustrates the workflow:
-```mermaid
-flowchart LR
-  HTTP --> processor
-  processor --> HTTP
-  processor --image_url--> prefill_worker
-  prefill_worker --> processor
-  prefill_worker --> decode_worker
-  decode_worker --> prefill_worker
-```
-
-```bash
-cd $DYNAMO_HOME/examples/backends/vllm
-bash launch/disagg_multimodal_llama.sh --head-node
-
-# On a separate node that has finished standard dynamo setup, i.e.
-# the worker node needs NATS_SERVER and ETCD_ENDPOINTS environment variables
-# pointing to the head node's external IP address for distributed coordination
-cd $DYNAMO_HOME/examples/backends/vllm
-bash launch/disagg_multimodal_llama.sh
-```
-
-#### Client
-
-In another terminal:
-```bash
-curl http://localhost:8000/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-      "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
-      "messages": [
-        {
-          "role": "user",
-          "content": [
-            {
-              "type": "text",
-              "text": "What is in this image?"
-            },
-            {
-              "type": "image_url",
-              "image_url": {
-                "url": "http://images.cocodataset.org/test2017/000000155781.jpg"
-              }
-            }
-          ]
-        }
-      ],
-      "max_tokens": 300,
-      "temperature": 0.0,
-      "stream": false
-    }'
-```
-
-You should see a response similar to this:
-```json
-{"id": "6cc99123ad6948d685b8695428238d4b", "object": "chat.completion", "created": 1752708043, "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8", "choices": [{"index": 0, "message": {"role": "assistant", "content": "The image depicts a street scene with a trolley bus as the central focus. The trolley bus is positioned on the left side of the road, facing the camera, and features a white and yellow color scheme. A prominent sign on the front of the bus reads \"OUT OF SERVICE\" in orange letters.\n\n**Key Elements:**\n\n* **Trolley Bus:** The bus is the main subject of the image, showcasing its distinctive design and color.\n* **Sign:** The \"OUT OF SERVICE\" sign is clearly visible on the front of the bus, indicating its current status.\n* **Street Scene:** The surrounding environment includes trees, buildings, and power lines, creating a sense of context and atmosphere.\n* **Lighting:** The image is characterized by a misty or foggy quality, with soft lighting that adds to the overall mood.\n\n**Overall Impression:**\n\nThe image presents a serene and somewhat melancholic scene, with the out-of-service trolley bus serving as a focal point. The misty atmosphere and soft lighting contribute to a contemplative ambiance, inviting the viewer to reflect on the situation."}, "finish_reason": "stop"}]}
-```
-
-## Multimodal Aggregated Video Serving
-
-This example demonstrates deploying an aggregated multimodal model that can process video inputs.
-
-### Components
-
- workers: For video serving, we use the [VideoEncodeWorker](../../../examples/multimodal/components/video_encode_worker.py) for decoding video into frames, and send the frames to [VllmPDWorker](../../../examples/multimodal/components/worker.py) for prefilling and decoding.
- processor: Tokenizes the prompt and passes it to the VideoEncodeWorker.
- frontend: HTTP endpoint to handle incoming requests.
-
-### Workflow
-
-In this workflow, we have two workers, [VideoEncodeWorker](../../../examples/multimodal/components/video_encode_worker.py) and [VllmPDWorker](../../../examples/multimodal/components/worker.py).
-The VideoEncodeWorker is responsible for decoding the video into a series of frames. Unlike the image pipeline which generates embeddings,
-this pipeline passes the raw frames directly to the VllmPDWorker via a combination of NATS and RDMA.
-Its VllmPDWorker then prefills and decodes the prompt, just like the [LLM aggregated serving](README.md) example.
-By separating the video processing from the prefill and decode stages, we can have a more flexible deployment and scale the
-VideoEncodeWorker independently from the prefill and decode workers if needed.
-
-This figure illustrates the workflow:
-```mermaid
-flowchart LR
-  HTTP --> processor
-  processor --> HTTP
-  processor --video_url--> video_encode_worker
-  video_encode_worker --> processor
-  video_encode_worker --frames--> pd_worker
-  pd_worker --> video_encode_worker
-```
-
-```bash
-cd $DYNAMO_HOME/examples/multimodal
-bash launch/video_agg.sh
-```
-
-### Client
-
-In another terminal:
-```bash
-curl http://localhost:8000/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-      "model": "llava-hf/LLaVA-NeXT-Video-7B-hf",
-      "messages": [
-        {
-          "role": "user",
-          "content": [
-            {
-              "type": "text",
-              "text": "Describe the video in detail"
-            },
-            {
-              "type": "video_url",
-              "video_url": {
-                "url": "https://storage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4"
-              }
-            }
-          ]
-        }
-      ],
-      "max_tokens": 300,
-      "stream": false
-    }' | jq
-```
-
-You should see a response describing the video's content similar to
-```json
-{
-  "id": "7587e7d152014bae8e5c4e25f9fda0ed",
-  "choices": [
-    {
-      "index": 0,
-      "message": {
-        "content": " The video takes us away to a lively world of wildlife and natural beauty, featuring a white rabbit in a vibrant forest setting. At the beginning of the clip, the white rabbit is seen standing on a rock, facing towards the right side of the frame, with bushes and trees in the backdrop. The rabbit appears to be alert, given its ears are up and its ears perked in the air. As the clip progresses, the movement of the rabbit brings it around a tree, where its legs are partially hidden by the dense vegetation. It then sits down and grooms its fur, a behavior that suggests it is comfortable in its surroundings. \n\nThe scene then switches to a close-up shot of the rabbit, giving us a better view of its features and expressions. In this camera angle, the rabbit appears more dynamic and alert, with its breathing more visible, signaling its health and well-being. The camera pans out, and we see the rabbit heading towards the left side of the screen, possibly curious or hunting for food, with its ears perked up again. The lush greenery of the forest unfolds in the background, adding to the feeling of a wild and thriving environment.\n\n\nThe rabbit, upturned slightly while walking, finds a pile of dirt and rocks and sits there, fully clothed, perhaps taking a break from its exploration. There's a mention of a blue bird that appears to perch atop a log, adding a touch of whimsy to the scene. Lastly, the rabbit is observed relaxing on the rocks, resting comfortably, and looking off to the right side—a moment of tranquility in a bustling ecosystem. Throughout the clip, the rabbit's outfit remains the same, allowing for a clear focus on its behavior and characteristics while fitting in its habitat.",
-        "role": "assistant",
-        "reasoning_content": null
-      },
-      "finish_reason": "stop"
-    }
-  ],
-  "created": 1756251832,
-  "model": "llava-hf/LLaVA-NeXT-Video-7B-hf",
-  "object": "chat.completion",
-  "usage": null
-}
-```
-
-## Multimodal Disaggregated Video Serving
-
-This example demonstrates deploying a disaggregated multimodal model that can process video inputs.
-
-### Components
-
- workers: For disaggregated video serving, we have three workers, [VideoEncodeWorker](../../../examples/multimodal/components/video_encode_worker.py) for decoding video into frames,
-[VllmDecodeWorker](../../../examples/multimodal/components/worker.py) for decoding, and [VllmPDWorker](../../../examples/multimodal/components/worker.py) for prefilling.
- processor: Tokenizes the prompt and passes it to the VideoEncodeWorker.
- frontend: HTTP endpoint to handle incoming requests.
-
-### Workflow
-
-In this workflow, we have three workers, [VideoEncodeWorker](../../../examples/multimodal/components/video_encode_worker.py), [VllmDecodeWorker](../../../examples/multimodal/components/worker.py), and [VllmPDWorker](../../../examples/multimodal/components/worker.py).
-For the LLaVA-NeXT-Video-7B model, frames are only required during the prefill stage. As such, the VideoEncodeWorker is connected directly to the prefill worker.
-The VideoEncodeWorker is responsible for decoding the video into a series of frames and passing them to the prefill worker via RDMA.
-The prefill worker performs the prefilling step and forwards the KV cache to the decode worker for decoding.
-For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](README.md) example.
-
-This figure illustrates the workflow:
-```mermaid
-flowchart LR
-  HTTP --> processor
-  processor --> HTTP
-  processor --video_url--> video_encode_worker
-  video_encode_worker --> processor
-  video_encode_worker --frames--> prefill_worker
-  prefill_worker --> video_encode_worker
-  prefill_worker --> decode_worker
-  decode_worker --> prefill_worker
-```
-
-```bash
-cd $DYNAMO_HOME/examples/multimodal
-bash launch/video_disagg.sh
-```
-
-### Client
-
-In another terminal:
-```bash
-curl http://localhost:8000/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-      "model": "llava-hf/LLaVA-NeXT-Video-7B-hf",
-      "messages": [
-        {
-          "role": "user",
-          "content": [
-            {
-              "type": "text",
-              "text": "Describe the video in detail"
-            },
-            {
-              "type": "video_url",
-              "video_url": {
-                "url": "https://storage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4"
-              }
-            }
-          ]
-        }
-      ],
-      "max_tokens": 300,
-      "stream": false
-    }' | jq
-```
-
-You should see a response describing the video's content similar to
-```json
-{
-  "id": "7587e7d152014bae8e5c4e25f9fda0ed",
-  "choices": [
-    {
-      "index": 0,
-      "message": {
-        "content": " The video takes us away to a lively world of wildlife and natural beauty, featuring a white rabbit in a vibrant forest setting. At the beginning of the clip, the white rabbit is seen standing on a rock, facing towards the right side of the frame, with bushes and trees in the backdrop. The rabbit appears to be alert, given its ears are up and its ears perked in the air. As the clip progresses, the movement of the rabbit brings it around a tree, where its legs are partially hidden by the dense vegetation. It then sits down and grooms its fur, a behavior that suggests it is comfortable in its surroundings. \n\nThe scene then switches to a close-up shot of the rabbit, giving us a better view of its features and expressions. In this camera angle, the rabbit appears more dynamic and alert, with its breathing more visible, signaling its health and well-being. The camera pans out, and we see the rabbit heading towards the left side of the screen, possibly curious or hunting for food, with its ears perked up again. The lush greenery of the forest unfolds in the background, adding to the feeling of a wild and thriving environment.\n\n\nThe rabbit, upturned slightly while walking, finds a pile of dirt and rocks and sits there, fully clothed, perhaps taking a break from its exploration. There's a mention of a blue bird that appears to perch atop a log, adding a touch of whimsy to the scene. Lastly, the rabbit is observed relaxing on the rocks, resting comfortably, and looking off to the right side—a moment of tranquility in a bustling ecosystem. Throughout the clip, the rabbit's outfit remains the same, allowing for a clear focus on its behavior and characteristics while fitting in its habitat.",
-        "role": "assistant",
-        "reasoning_content": null
-      },
-      "finish_reason": "stop"
-    }
-  ],
-  "created": 1756251832,
-  "model": "llava-hf/LLaVA-NeXT-Video-7B-hf",
-  "object": "chat.completion",
-  "usage": null
-}
-```
-## Multimodal Aggregated Audio Serving
-
-This example demonstrates deploying an aggregated multimodal model that can process audio inputs.
-
-### Components
-
- workers: For audio serving, we use the [AudioEncodeWorker](../../../examples/multimodal/components/audio_encode_worker.py) for decoding audio into audio embeddings, and send the embeddings to [VllmPDWorker](../../../examples/multimodal/components/worker.py) for prefilling and decoding.
- processor: Tokenizes the prompt and passes it to the AudioEncodeWorker.
- frontend: HTTP endpoint to handle incoming requests.
-
-### Workflow
-
-In this workflow, we have two workers, [AudioEncodeWorker](../../../examples/multimodal/components/audio_encode_worker.py) and [VllmPDWorker](../../../examples/multimodal/components/worker.py).
-The AudioEncodeWorker is responsible for decoding the audio into embeddings.
-Its VllmPDWorker then prefills and decodes the prompt, just like the [LLM aggregated serving](README.md) example.
-By separating the audio processing from the prefill and decode stages, we can have a more flexible deployment and scale the
-AudioEncodeWorker independently from the prefill and decode workers if needed.
-
-This figure illustrates the workflow:
-```mermaid
-flowchart LR
-  HTTP --> processor
-  processor --> HTTP
-  processor --audio_url--> audio_encode_worker
-  audio_encode_worker --> processor
-  audio_encode_worker --embeddings--> pd_worker
-  pd_worker --> audio_encode_worker
-```
-
-```bash
-pip install vllm["audio"] accelerate # multimodal audio models dependency
-cd $DYNAMO_HOME/examples/multimodal
-bash launch/audio_agg.sh
-```
-
-### Client
-
-In another terminal:
-```bash
-curl http://localhost:8000/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-      "model": "Qwen/Qwen2-Audio-7B-Instruct",
-      "messages": [
-        {
-          "role": "user",
-          "content": [
-            {
-              "type": "text",
-              "text": "What is recited in the audio?"
-            },
-            {
-              "type": "audio_url",
-              "audio_url": {
-                "url": "https://raw.githubusercontent.com/yuekaizhang/Triton-ASR-Client/main/datasets/mini_en/wav/1221-135766-0002.wav"
-              }
-            }
-          ]
-        }
-      ],
-      "max_tokens": 6000,
-      "temperature": 0.8,
-      "stream": false
-    }' | jq
-```
-
-You should see a response describing the audio's content similar to
-```json
-{
-  "id": "e2d8d67c37634b309400974eaa058ce8",
-  "choices": [
-    {
-      "index": 0,
-      "message": {
-        "content": "The original content of this audio is:'yet these thoughts affected Hester Pynne less with hope than apprehension.'",
-        "refusal": null,
-        "tool_calls": null,
-        "role": "assistant",
-        "function_call": null,
-        "audio": null
-      },
-      "finish_reason": "stop",
-      "logprobs": null
-    }
-  ],
-  "created": 1756368148,
-  "model": "Qwen/Qwen2-Audio-7B-Instruct",
-  "service_tier": null,
-  "system_fingerprint": null,
-  "object": "chat.completion",
-  "usage": null
-}
-```
-
-## Multimodal Disaggregated Audio Serving
-
-This example demonstrates deploying a disaggregated multimodal model that can process audio inputs.
-
-### Components
-
- workers: For disaggregated audio serving, we have three workers, [AudioEncodeWorker](../../../examples/multimodal/components/audio_encode_worker.py) for decoding audio into embeddings,
-[VllmDecodeWorker](../../../examples/multimodal/components/worker.py) for decoding, and [VllmPDWorker](../../../examples/multimodal/components/worker.py) for prefilling.
- processor: Tokenizes the prompt and passes it to the AudioEncodeWorker.
- frontend: HTTP endpoint to handle incoming requests.
-
-### Workflow
-
-In this workflow, we have three workers, [AudioEncodeWorker](../../../examples/multimodal/components/audio_encode_worker.py), [VllmDecodeWorker](../../../examples/multimodal/components/worker.py), and [VllmPDWorker](../../../examples/multimodal/components/worker.py).
-For the Qwen/Qwen2-Audio-7B-Instruct model, audio embeddings are only required during the prefill stage. As such, the AudioEncodeWorker is connected directly to the prefill worker.
-The AudioEncodeWorker is responsible for decoding the audio into embeddings and passing them to the prefill worker via RDMA.
-The prefill worker performs the prefilling step and forwards the KV cache to the decode worker for decoding.
-For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](README.md) example.
-
-This figure illustrates the workflow:
-```mermaid
-flowchart LR
-  HTTP --> processor
-  processor --> HTTP
-  processor --audio_url--> audio_encode_worker
-  audio_encode_worker --> processor
-  audio_encode_worker --embeddings--> prefill_worker
-  prefill_worker --> audio_encode_worker
-  prefill_worker --> decode_worker
-  decode_worker --> prefill_worker
-```
-
-```bash
-pip install vllm["audio"] accelerate # multimodal audio models dependency
-cd $DYNAMO_HOME/examples/multimodal
-bash launch/audio_disagg.sh
-```
--- a/docs/backends/vllm/multimodal_vllm_guide.md
+++ b/docs/backends/vllm/multimodal_vllm_guide.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-
-# vLLM Multimodal Guide
-
-This document provides a comprehensive guide for multimodal inference using vLLM backend in Dynamo. For more details on the multimodal examples, see [Multimodal Examples Documentation](./multimodal.md).
-
-## Multimodal Support Matrix
-
-| Modality | Input Format | Aggregated | Disaggregated | Notes |
-|----------|--------------|------------|---------------|-------|
-| **Image** | HTTP/HTTPS URL | Yes | Yes | Full support for all image models |
-| **Image** | Data URL (Base64) | Yes | Yes | Inline base64-encoded images |
-| **Video** | HTTP/HTTPS URL | Yes | Yes | Frame extraction and processing |
-| **Audio** | HTTP/HTTPS URL | Yes | Yes | Experimental - requires audio dependencies |
-
-## Architecture Comparison
-
-vLLM multimodal supports three deployment patterns:
-
-```text
-SIMPLE AGGREGATED ([examples/backends/vllm/launch/agg_multimodal.sh](../../../examples/backends/vllm/launch/agg_multimodal.sh)):
-  Client → Frontend (Rust processor) → Worker [image load, encode, P+D] → Response
-  • 2 components • --connector none • Easiest setup
-
-EPD AGGREGATED ([examples/backends/vllm/launch/agg_multimodal_epd.sh](../../../examples/backends/vllm/launch/agg_multimodal_epd.sh)):
-  Client → Frontend → Processor → Encoder [NIXL] → PD Worker → Response
-  • 4 components • --multimodal-processor • Custom templates, NIXL
-
-DISAGGREGATED ([examples/backends/vllm/launch/disagg_multimodal_epd.sh](../../../examples/backends/vllm/launch/disagg_multimodal_epd.sh)):
-  Client → Frontend → Processor → Encoder [NIXL] → Prefill [NIXL] → Decode → Response
-  • 5 components • Separate P/D workers • Multi-node, max optimization
-```
-
-## Input Format Details
-
-### Supported URL Formats
-
-| Format | Example | Description | Support |
-|--------|---------|-------------|---------|
-| **HTTP/HTTPS** | `http://example.com/image.jpg` | Remote media files | ✅ |
-| **Data URL** | `data:image/jpeg;base64,/9j/4AAQ...` | Base64-encoded inline data | ✅ |
-
-## Simple Aggregated Mode (PD)
-
-In simple aggregated mode, encoding, prefill, and decode happen within the same worker.
-
-### Architecture
-
-```text
-HTTP Frontend with Rust processor
-    ↓
-Worker (Python - ModelInput.Tokens)
-    ↓ encode + prefill + decode
-Response
-```
-
-## EPD Aggregated Mode (PD)
-
-In EPD aggregated mode, encoding happens in a separate worker and prefill and decode happen within the same pipeline.
-
-### Architecture
-
-```text
-HTTP Frontend (Rust)
-    ↓
-Processor (Python - ModelInput.Text)
-    ↓ tokenizes, extracts media URL
-Encode Worker (Python - not registered)
-    ↓ downloads media, generates embeddings, NIXL transfer
-PD Worker (Python - ModelInput.Tokens)
-    ↓ prefill + decode
-Response
-```
-
-### Components
-
-| Component | Flag | ModelInput | Registered | Purpose |
-|-----------|------|-----------|------------|---------|
-| Processor | `--multimodal-processor` | Text | Yes | HTTP entry, tokenization |
-| Encode Worker | `--multimodal-encode-worker` | N/A | No | Media encoding |
-| PD Worker | `--multimodal-worker` | Tokens | Yes | Prefill + Decode |
-
-## EPD Disaggregated Mode (E->P->D)
-
-In EPD disaggregated mode, encoding, prefill, and decode are handled by separate workers.
-
-### Architecture
-
-```text
-HTTP Frontend (Rust)
-    ↓
-Processor (Python - ModelInput.Text)
-    ↓ tokenizes, extracts media URL
-Encode Worker (Python - not registered)
-    ↓ downloads media, generates embeddings, NIXL transfer
-Prefill Worker (Python - ModelInput.Tokens)
-    ↓ prefill only, KV cache NIXL transfer
-Decode Worker (Python - ModelInput.Tokens)
-    ↓ decode only, token generation
-Response
-```
-
-### Components
-
-| Component | Flag | ModelInput | Registered | Purpose |
-|-----------|------|-----------|------------|---------|
-| Processor | `--multimodal-processor` | Text | Yes | HTTP entry, tokenization |
-| Encode Worker | `--multimodal-encode-worker` | N/A | No | Media encoding |
-| Prefill Worker | `--multimodal-worker --is-prefill-worker` | Tokens | Yes | Prefill only |
-| Decode Worker | `--multimodal-decode-worker` | Tokens | Yes | Decode only |
-
-## Traditional Disagg (EP->D)
-
-Llama 4 models don't support pre-computed embeddings, so they use a combined Encode+Prefill worker.
-
-### Architecture
-
-```text
-HTTP Frontend (Rust)
-    ↓
-Processor (Python - ModelInput.Text)
-    ↓ tokenizes, extracts media URL
-Encode+Prefill Worker (Python - ModelInput.Tokens)
-    ↓ downloads media, encodes inline, prefill, KV cache NIXL transfer
-Decode Worker (Python - ModelInput.Tokens)
-    ↓ decode only, token generation
-Response
-```
-
-### Components
-
-| Component | Flag | ModelInput | Registered | Purpose |
-|-----------|------|-----------|------------|---------|
-| Processor | `--multimodal-processor` | Text | Yes | HTTP entry, tokenization |
-| Encode+Prefill | `--multimodal-encode-prefill-worker --is-prefill-worker` | Tokens | Yes | Encode + Prefill |
-| Decode Worker | `--multimodal-decode-worker` | Tokens | Yes | Decode only |
-
-### Launch Script
-
-Example: [`examples/backends/vllm/launch/disagg_multimodal_llama.sh`](../../../examples/backends/vllm/launch/disagg_multimodal_llama.sh)
-
-## ModelInput Types and Registration
-
-### Understanding ModelInput
-
-Dynamo's Rust SDK supports two input types that determine how the HTTP frontend preprocesses requests:
-
-| ModelInput Type | Preprocessing | Use Case |
-|-----------------|---------------|----------|
-| `ModelInput.Text` | None (raw text passed through) | Components that tokenize themselves |
-| `ModelInput.Tokens` | Rust SDK would tokenize (but bypassed in multimodal) | Components expecting pre-tokenized input |
-
-### Component Registration Pattern
-
-```python
-# Processor - Entry point from HTTP frontend
-await register_llm(
-    ModelInput.Text,        # Frontend sends raw text
-    ModelType.Chat,
-    generate_endpoint,
-    model_name,
-    ...
-)
-
-# Workers - Internal components
-await register_llm(
-    ModelInput.Tokens,      # Expect pre-tokenized input
-    ModelType.Chat,         # or ModelType.Prefill for prefill workers
-    generate_endpoint,
-    model_name,
-    ...
-)
-```
-
-## **NIXL USE**
-
-| Use Case | Script | NIXL Used? | Data Transfer |
-|----------|--------|------------|---------------|
-| Simple Aggregated | [`examples/backends/vllm/launch/agg_multimodal.sh`](../../../examples/backends/vllm/launch/agg_multimodal.sh) | ❌ No | All in one worker |
-| E->PD Aggregated | [`examples/backends/vllm/launch/agg_multimodal_epd.sh`](../../../examples/backends/vllm/launch/agg_multimodal_epd.sh) | ✅ Yes | Encoder → PD (embeddings) |
-| E->P->D Disaggregated | [`examples/backends/vllm/launch/disagg_multimodal_epd.sh`](../../../examples/backends/vllm/launch/disagg_multimodal_epd.sh) | ✅ Yes | Encoder → Prefill (embeddings)<br>Prefill → Decode (KV cache) |
-| EP->D Disaggregated (Llama 4) | [`examples/backends/vllm/launch/disagg_multimodal_llama.sh`](../../../examples/backends/vllm/launch/disagg_multimodal_llama.sh) | ✅ Yes | Prefill → Decode (KV cache) |
-
-
-## Known Limitations
-
- **Disaggregated flows require Python Processor** - All multimodal disaggregation requires the Python Processor component (`ModelInput.Text`).
-
-## Supported Models
-
-The following models have been tested with Dynamo's vLLM multimodal backend:
-
- **Qwen2.5-VL** - `Qwen/Qwen2.5-VL-7B-Instruct`
- **Qwen3-VL** - `Qwen/Qwen3-VL-30B-A3B-Instruct-FP8`
- **LLaVA 1.5** - `llava-hf/llava-1.5-7b-hf`
- **Llama 4 Maverick** - `meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8`
- **LLaVA Next Video** - `llava-hf/LLaVA-NeXT-Video-7B-hf`
- **Qwen2-Audio** - `Qwen/Qwen2-Audio-7B-Instruct`
-
-For a complete list of multimodal models supported by vLLM, see [vLLM Supported Multimodal Models](https://docs.vllm.ai/en/latest/models/supported_models/#list-of-multimodal-language-models). Models listed there should work with Simple Aggregated Mode but may not be explicitly tested.
-
-## Key Files
-
-| File | Description |
-|------|-------------|
-| `components/src/dynamo/vllm/main.py` | Worker initialization and setup |
-| `components/src/dynamo/vllm/args.py` | Command-line argument parsing |
-| `components/src/dynamo/vllm/multimodal_handlers/processor_handler.py` | Processor implementation |
-| `components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py` | Encode worker implementation |
-| `components/src/dynamo/vllm/multimodal_handlers/worker_handler.py` | PD/Prefill/Decode worker implementation |
-
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -86,6 +86,15 @@ redirects = {
    "dynamo_glossary": "../reference/glossary.html",
    "support_matrix": "../reference/support-matrix.html",
    "components/router/README": "../router/README.html",
+    # Multimodal documentation consolidation
+    "backends/vllm/multimodal": "../../multimodal/vllm.html",
+    "backends/vllm/multimodal_vllm_guide": "../../multimodal/vllm.html",
+    "backends/trtllm/multimodal_support": "../../multimodal/trtllm.html",
+    "backends/trtllm/multimodal_trtllm_guide": "../../multimodal/trtllm.html",
+    "backends/trtllm/multinode/multinode-multimodal-example": "../../../multimodal/trtllm.html",
+    "backends/sglang/multimodal_epd": "../../multimodal/sglang.html",
+    "backends/sglang/multimodal_sglang_guide": "../../multimodal/sglang.html",
+    "multimodal/multimodal_intro": "index.html",
 }

 # Custom extensions

--- a/docs/hidden_toctree.rst
+++ b/docs/hidden_toctree.rst
@@ -46,11 +46,8 @@
   fault_tolerance/request_cancellation.md

   backends/trtllm/multinode/multinode-examples.md
-   backends/trtllm/multinode/multinode-multimodal-example.md
   backends/trtllm/llama4_plus_eagle.md
   backends/trtllm/kv-cache-transfer.md
-   backends/trtllm/multimodal_support.md
-   backends/trtllm/multimodal_trtllm_guide.md
   backends/trtllm/gemma3_sliding_window_attention.md
   backends/trtllm/gpt-oss.md
   backends/trtllm/prometheus.md
@@ -60,8 +57,6 @@
   backends/sglang/dsr1-wideep-h100.md
   backends/sglang/expert-distribution-eplb.md
   backends/sglang/gpt-oss.md
-   backends/sglang/multimodal_epd.md
-   backends/sglang/multimodal_sglang_guide.md
   backends/sglang/profiling.md
   backends/sglang/sgl-hicache-example.md
   backends/sglang/sglang-disaggregation.md
@@ -77,8 +72,6 @@
   backends/vllm/gpt-oss.md
   backends/vllm/LMCache_Integration.md
   backends/vllm/multi-node.md
-   backends/vllm/multimodal.md
-   backends/vllm/multimodal_vllm_guide.md
   backends/vllm/prometheus.md
   backends/vllm/speculative_decoding.md


--- a/docs/index.rst
+++ b/docs/index.rst
@@ -58,7 +58,7 @@ Quickstart
   :caption: User Guides

   Tool Calling <agents/tool-calling.md>
-   Multimodality Support <multimodal/multimodal_intro.md>
+   Multimodality Support <multimodal/index.md>
   Finding Best Initial Configs <performance/aiconfigurator.md>
   Benchmarking <benchmarks/benchmarking.md>
   Tuning Disaggregated Performance <performance/tuning.md>

--- a/docs/multimodal/index.md
+++ b/docs/multimodal/index.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Multimodal Inference in Dynamo
+
+Dynamo supports multimodal inference across multiple LLM backends, enabling models to process images, video, and audio alongside text. This section provides comprehensive documentation for deploying multimodal models.
+
+> [!IMPORTANT]
+> **Security Requirement**: Multimodal processing must be explicitly enabled at startup.
+> See the relevant documentation for each backend for the necessary flags.
+>
+> This prevents unintended processing of multimodal data from untrusted sources.
+
+## Backend Documentation
+
+```{toctree}
+:maxdepth: 1
+
+vLLM Multimodal <vllm.md>
+TensorRT-LLM Multimodal <trtllm.md>
+SGLang Multimodal <sglang.md>
+```
+
+## Support Matrix
+
+### Backend Capabilities
+
+| Stack | E/PD | E/P/D | EP/D | EPD | Image | Video | Audio |
+|-------|------|-------|------|-----|-------|-------|-------|
+| **[vLLM](vllm.md)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 🧪 |
+| **[TRT-LLM](trtllm.md)** | ❌ | 🚧* | ✅ | ✅ | ✅ | ❌ | ❌ |
+| **[SGLang](sglang.md)** | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ |
+
+\* E/P/D supported in TRT-LLM with pre-computed embeddings only; image URL support is WIP ([PR #4668](https://github.com/ai-dynamo/dynamo/pull/4668))
+
+**Pattern Key:**
+
+- **EPD** - All-in-one worker (Simple Aggregated)
+- **E/PD** - Separate encode, combined prefill+decode
+- **E/P/D** - All stages separate
+- **EP/D** - Combined encode+prefill, separate decode
+
+**Status:** ✅ Supported | 🚧 WIP | 🧪 Experimental | ❌ Not supported
+
+### Input Format Support
+
+| Format | vLLM | TRT-LLM | SGLang |
+|--------|------|---------|--------|
+| HTTP/HTTPS URL | ✅ | ✅ | ✅ |
+| Data URL (Base64) | ✅ | ❌ | ❌ |
+| Pre-computed Embeddings (.pt) | ❌ | ✅ | ❌ |
+
+## Architecture Patterns
+
+Dynamo supports several deployment patterns for multimodal inference based on two dimensions:
+
+1. **Encoding**: Is media encoding handled inline (within prefill) or by a separate **Encode Worker**?
+   - *Inline*: Simpler setup, encoding happens in the prefill worker
+   - *Separate (EPD)*: Dedicated encode worker transfers embeddings via **NIXL (RDMA)**, enabling independent scaling
+
+2. **Prefill/Decode**: Are prefill and decode in the same worker or separate?
+   - *Aggregated*: Single worker handles both prefill and decode
+   - *Disaggregated*: Separate workers for prefill and decode, with KV cache transfer between them
+
+These combine into four deployment patterns:
+
+### EPD - Simple Aggregated
+
+All processing happens within a single worker - the simplest setup.
+
+```text
+HTTP Frontend (Rust)
+    ↓
+Worker (Python)
+    ↓ image load + encode + prefill + decode
+Response
+```
+
+| Component | Purpose |
+|-----------|---------|
+| Frontend (Rust) | HTTP entry point, tokenization, image URL preprocessing |
+| Worker | Complete inference pipeline (encode + prefill + decode) |
+
+**When to use:** Quick setup, smaller models, development/testing.
+
+### E/PD - Encode Separate
+
+Encoding happens in a separate worker; prefill and decode share the same engine.
+
+```text
+HTTP Frontend (Rust)
+    ↓
+Processor (Python)
+    ↓ tokenizes, extracts media URL
+Encode Worker (Python)
+    ↓ downloads media, generates embeddings, NIXL transfer
+PD Worker (Python)
+    ↓ receives embeddings via NIXL, prefill + decode
+Response
+```
+
+| Component | Purpose |
+|-----------|---------|
+| Frontend (Rust) | HTTP entry point |
+| Processor (Python) | Tokenization, extracts media URLs |
+| Encode Worker | Media encoding, embeddings generation |
+| PD Worker | Prefill + Decode with embeddings |
+
+**When to use:** Offload vision encoding to separate GPU, scale encode workers independently.
+
+### E/P/D - Full Disaggregation
+
+Full disaggregation with separate workers for encoding, prefill, and decode.
+There are two variants of this workflow:
+- Prefill-first, used by vLLM
+- Decode-first, used by SGlang
+
+Prefill-first:
+
+```text
+HTTP Frontend (Rust)
+    ↓
+Processor (Python)
+    ↓ tokenizes, extracts media URL
+Encode Worker (Python)
+    ↓ downloads media, generates embeddings, NIXL transfer
+Prefill Worker (Python)
+    ↓ receives embeddings via NIXL, prefill only, KV cache transfer
+Decode Worker (Python)
+    ↓ decode only, token generation
+Response
+```
+
+OR
+
+Decode-first:
+
+```text
+HTTP Frontend (Rust)
+    ↓
+Processor (Python)
+    ↓ tokenizes, extracts media URL
+Encode Worker (Python)
+    ↓ downloads media, generates embeddings, NIXL transfer
+Decode Worker (Python)
+    ↓ Bootstraps prefill worker
+Prefill Worker (Python)
+    ↓ receives embeddings via NIXL, prefill only, KV cache transfer
+Decode Worker (Python)
+    ↓ decode only, token generation
+Response
+```
+
+| Component | Purpose |
+|-----------|---------|
+| Frontend (Rust) | HTTP entry point |
+| Processor (Python) | Tokenization, extracts media URLs |
+| Encode Worker | Media encoding, embeddings generation |
+| Prefill Worker | Prefill only, transfers KV cache |
+| Decode Worker | Decode only, token generation |
+
+**When to use:** Maximum optimization, multi-node deployment, independent scaling of each phase.
+
+### EP/D - Traditional Disaggregated
+
+Encoding is combined with prefill, with decode separate.
+
+```text
+HTTP Frontend (Rust)
+    ↓
+Processor (Python)
+    ↓ tokenizes, extracts media URL
+Encode+Prefill Worker (Python)
+    ↓ downloads media, encodes inline, prefill, KV cache transfer
+Decode Worker (Python)
+    ↓ decode only, token generation
+Response
+```
+
+| Component | Purpose |
+|-----------|---------|
+| Frontend (Rust) | HTTP entry point |
+| Processor (Python) | Tokenization, extracts media URLs (vLLM only) |
+| Encode+Prefill Worker | Combined encoding and prefill |
+| Decode Worker | Decode only, token generation |
+
+> **Note:** TRT-LLM's EP/D mode skips the Python Processor - the Rust frontend handles tokenization and routes directly to the Prefill worker.
+> For multimodal requests, the Python prefill worker still re-tokenizes/builds inputs; Rust token_ids are ignored.
+
+**When to use:** Models without pre-computed embedding support (Llama 4), or TRT-LLM disaggregated deployment.
+
+## Example Workflows
+
+You can find example workflows and reference implementations for deploying multimodal models in:
+
+- [vLLM multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/launch)
+- [TRT-LLM multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/launch)
+- [SGLang multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/launch)
+- [Advanced multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal/launch) (video, audio)
--- a/docs/multimodal/multimodal_intro.md
+++ b/docs/multimodal/multimodal_intro.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
-All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-
-# Multimodal Inference in Dynamo:
-
-You can find example workflows and reference implementations for deploying a multimodal model using Dynamo in [multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal).
-
-##  EPD vs. PD Disaggregation
-Dynamo supports two primary approaches for processing multimodal inputs, which differ in how the initial media encoding step is handled relative to the main LLM inference engine.
-
-### 1. EPD (Encode-Prefill-Decode) Disaggregation
-The EPD approach introduces an explicit separation of the media encoding step, maximizing the utilization of specialized hardware and increasing overall system efficiency for large multimodal models.
-
-* **Media Input:** Image, video, audio, or an embedding URL is provided.
-* **Process Flow:**
-    1.  A dedicated **Encode Worker** is launched separately to handle the embedding extraction from the media input.
-    2.  The extracted embeddings are transferred to the main engine via the **NVIDIA Inference Xfer Library (NIXL)**.
-    3.  The main **Engine** performs the remaining **Prefill Decode Disaggregation** steps to generate the output.
-* **Benefit:** This disaggregation allows for the decoupling of media encoding hardware/resources from the main LLM serving engine, making the serving of large multimodal models more efficient.
-
-### 2. PD (Prefill-Decode) Disaggregation
-
-The PD approach is a more traditional, aggregated method where the inference engine handles the entire process.
-* **Media Input:** Image, video, or audio is loaded.
-* **Process Flow:**
-    1.  The main **Engine** receives the media input.
-    2.  The Engine executes the full sequence: **Encode + Prefill + Decode**.
-* **Note:** In this approach, the encoding step is executed within the same pipeline as the prefill and decode phases.
-
-## Inference Framework Support Matrix
-
-Dynamo supports multimodal capabilities across leading LLM inference backends, including **vLLM**, **TensorRT-LLM (TRT-LLM)**, and **SGLang**. The table below details the current support level for EPD/PD and various media types for each stack.
-
-| Stack | EPD Support | PD Support | Image | Video | Audio |
-| --------- | --------- | --------- | --------- |---------| --------- |
-| **vLLM** | ✅  | ✅  | ✅  | ✅  | 🚧 |
-| **TRT-LLM** | ✅  (Currently via precomputed Embeddings URL) | ✅  | ✅  | ❌  | ❌  |
-| **SGLang** | ✅  | ❌  | ✅  | ❌  | ❌  |
--- a/docs/backends/sglang/multimodal_sglang_guide.md
+++ b/docs/backends/sglang/multimodal_sglang_guide.md
@@ -15,171 +15,231 @@ See the License for the specific language governing permissions and
 limitations under the License.
 -->

-# SGLang Multimodal Guide
+# SGLang Multimodal

-This document provides a comprehensive guide for multimodal inference using SGLang backend in Dynamo. For more details on the multimodal examples, see [Multimodal Examples Documentation](./multimodal_epd.md).
+This document provides a comprehensive guide for multimodal inference using SGLang backend in Dynamo. SGLang multimodal uses specialized **E/PD or E/P/D** flows with **NIXL (RDMA)** for zero-copy tensor transfer.

-## Multimodal Support Matrix
+## Support Matrix

 | Modality | Input Format | Aggregated | Disaggregated | Notes |
 |----------|--------------|------------|---------------|-------|
-| **Image** | HTTP/HTTPS URL | ✅ Yes | ✅ Yes | Vision encoder generates embeddings |
-| **Image** | Data URL (Base64) | ❌ No | ❌ No | Not supported |
-| **Video** | HTTP/HTTPS URL | ❌ No | ❌ No | Not implemented |
-| **Audio** | HTTP/HTTPS URL | ❌ No | ❌ No | Not implemented |
+| **Image** | HTTP/HTTPS URL | Yes | Yes | Vision encoder generates embeddings |
+| **Image** | Data URL (Base64) | No | No |  |
+| **Video** | HTTP/HTTPS URL | No | No |  |
+| **Audio** | HTTP/HTTPS URL | No | No |  |

-## Architecture Comparison
+### Supported URL Formats

-SGLang multimodal supports two deployment patterns:
+| Format | Example | Description |
+|--------|---------|-------------|
+| **HTTP/HTTPS** | `http://example.com/image.jpg` | Remote media files |

-```text
-AGGREGATED (E->PD):
-  Client → Frontend (Rust) → Processor → Encoder [NIXL] → PD Worker → Response
-  • 3 components • Vision encoder in Python • NIXL embeddings transfer
+## Deployment Patterns

-DISAGGREGATED (E->P->D):
-  Client → Frontend → Processor → Encoder [NIXL] → Prefill [bootstrap] → Decode → Response
-  • 4 components • Vision encoder in Python • KV cache transfer via bootstrap mechanism
-```
+SGLang supports E/PD and E/P/D patterns only (always has a separate encode worker). See [Multimodal Architecture Patterns](index.md#architecture-patterns) for detailed explanations.

-## Aggregated Mode (E->PD)
+| Pattern | Supported | Launch Script | Notes |
+|---------|-----------|---------------|-------|
+| EPD (Simple Aggregated) | ❌ | N/A | Not supported |
+| E/PD (Encode Separate) | ✅ | `multimodal_agg.sh` | Vision encoder separate |
+| E/P/D (Full Disaggregation) | ✅ | `multimodal_disagg.sh` | KV cache via bootstrap |
+| EP/D (Traditional Disaggregated) | ❌ | N/A | Not supported |

-In aggregated mode, encoding happens in a separate worker, but prefill and decode share the same engine.
+### Component Flags

-### Architecture
+| Component | Flag | Purpose |
+|-----------|------|---------|
+| Processor | `--multimodal-processor` | HTTP entry, OpenAI→SGLang conversion |
+| Encode Worker | `--multimodal-encode-worker` | Vision encoder, embeddings generation |
+| PD Worker | `--multimodal-worker` | Prefill + Decode with embeddings |
+| Decode Worker | `--multimodal-worker --serving-mode=decode` | Entry point for disaggregation |
+| Prefill Worker | `--multimodal-worker --serving-mode=prefill` | Called by Decode, bootstrap coordination |

-```text
-HTTP Frontend (Rust)
-    ↓
-Processor (Python - ModelInput.Text - REGISTERED)
-    ↓ tokenizes with chat template, extracts image URL
-Encode Worker (Python - NOT registered)
-    ↓ downloads image, runs vision encoder, generates embeddings, NIXL transfer
-PD Worker (Python - NOT registered)
-    ↓ receives embeddings via NIXL, prefill + decode
-Response → Processor → Frontend
+### SGLang-Specific Characteristics
+
+- **Vision Encoder in Python**: Encode worker loads vision model (AutoModel) and image processor (AutoImageProcessor)
+- **Token Expansion**: Single `<|image_pad|>` token replaced with N tokens based on embedding shape
+- **NIXL Transfer**: Embeddings transferred from Encoder → PD Worker using NIXL
+- **No Rust Processing**: All tokenization and image handling happens in Python
+
+## Use the Latest Release
+
+We recommend using the latest stable release of dynamo to avoid breaking changes:
+
+[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
+
+You can find the [latest release](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
+
+```bash
+git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 ```

+## E/PD Serving (Encode Separate)
+
 ### Components

-| Component | Flag | ModelInput | Registered | Has SGLang Engine? | Purpose |
-|-----------|------|-----------|------------|-------------------|---------|
-| Processor | `--multimodal-processor` | Text | ✅ Yes | ❌ No | HTTP entry, OpenAI→SGLang conversion |
-| Encode Worker | `--multimodal-encode-worker` | N/A | ❌ No | ❌ No | Vision encoder, embeddings generation |
-| PD Worker | `--multimodal-worker` | N/A | ❌ No | ✅ Yes | Prefill + Decode with embeddings |
+- workers:
+  - [MultimodalEncodeWorkerHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py) for encoding
+  - [MultimodalWorkerHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py) for prefilling and decoding.
+- processor: [MultimodalProcessorHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/processor_handler.py)
+  - tokenizes the prompt using the chat template
+  - passes the text and image url to the MultimodalEncodeWorker.

-### Key Characteristics
+### Workflow

- **Vision Encoder in Python**: Encode worker loads vision model (AutoModel) and image processor (AutoImageProcessor)
- **Token Expansion**: Single `<|image_pad|>` token replaced with N tokens based on embedding shape
- **NIXL Transfer**: Embeddings transferred from Encoder → PD Worker using NIXL
- **No Rust Processing**: All tokenization and image handling happens in Python
+The `MultimodalEncodeWorker` downloads and encodes the image and passes the embeddings to the MultimodalWorker. The work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface. The `MultimodalWorker` then prefills and decodes the prompt in the same engine, as in the [LLM aggregated serving](../backends/sglang/README.md) example. Only the processor is registered to the Dynamo frontend as an available endpoint. Workers do NOT register - they are internal components and communicate via NATS.

-## Disaggregated Mode (E->P->D)
+```mermaid
+flowchart LR
+  HTTP --> processor
+  processor --tokenized request + image_url--> encode_worker
+  encode_worker --request + embeddings--> worker

-In disaggregated mode, encoding, prefill, and decode are handled by separate workers using SGLang's bootstrap coordination.
+  worker -.-> encode_worker
+  encode_worker -.-> processor
+  processor -.-> HTTP
+```

-### Architecture

-```text
-HTTP Frontend (Rust)
-    ↓
-Processor (Python - ModelInput.Text - REGISTERED)
-    ↓ tokenizes with chat template, extracts image URL
-Encode Worker (Python - NOT registered)
-    ↓ downloads image, runs vision encoder, generates embeddings, NIXL transfer
-Prefill Worker (Python - NOT registered)
-    ↓ receives embeddings via NIXL, prefill only, returns bootstrap info
-Decode Worker (Python - NOT registered)
-    ↓ uses bootstrap info, decode only, token generation
-Response → Processor → Frontend
+### Launch
+
+```bash
+cd $DYNAMO_HOME/examples/backends/sglang
+./launch/multimodal_agg.sh
+```
+
+**Client:**
+
+```bash
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen2.5-VL-7B-Instruct",
+    "messages": [
+      {
+        "role": "user",
+        "content": [
+          {
+            "type": "text",
+            "text": "Describe the image."
+          },
+          {
+            "type": "image_url",
+            "image_url": {
+              "url": "http://images.cocodataset.org/test2017/000000155781.jpg"
+            }
+          }
+        ]
+      }
+    ],
+    "max_tokens": 50,
+    "stream": false
+  }' | jq
 ```

+## E/P/D Serving (Full Disaggregation)
+
 ### Components

-| Component | Flag | ModelInput | Registered | Has SGLang Engine? | Purpose |
-|-----------|------|-----------|------------|-------------------|---------|
-| Processor | `--multimodal-processor` | Text | ✅ Yes | ❌ No | HTTP entry, OpenAI→SGLang conversion |
-| Encode Worker | `--multimodal-encode-worker` | N/A | ❌ No | ❌ No | Vision encoder, embeddings generation |
-| Decode Worker | `--multimodal-worker --serving-mode=decode` | N/A | ❌ No | ✅ Yes | **Entry point for disaggregation**, calls Prefill |
-| Prefill Worker | `--multimodal-worker --serving-mode=prefill` | N/A | ❌ No | ✅ Yes | Called by Decode, bootstrap coordination |
+- workers:
+  - [MultimodalEncodeWorkerHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py) for encoding
+  - [MultimodalWorkerHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py) for decoding
+  - [MultimodalPrefillWorkerHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py) for prefilling
+- processor: [MultimodalProcessorHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/processor_handler.py) tokenizes the prompt and passes it to the MultimodalEncodeWorker.
+
+### Workflow
+
+In models like Qwen2.5-VL, embeddings are only required during the prefill stage. The image embeddings are transferred via NIXL from the Encode Worker to the Decode Worker (the entry point for disaggregation), which then coordinates with the Prefill Worker. The Prefill Worker processes the embeddings and forwards the KV cache back to the Decode Worker for token generation.
+
+```mermaid
+flowchart LR
+  HTTP --> processor
+  processor --tokenized request + image_url--> encode_worker
+  encode_worker --request + embeddings--> worker
+  worker --request + embeddings--> prefill_worker
+
+  prefill_worker --KV Cache--> worker
+  encode_worker -.-> processor
+  worker -.-> encode_worker
+  processor -.-> HTTP
+```

-### Bootstrap Coordination
+### Launch
+
+```bash
+cd $DYNAMO_HOME/examples/backends/sglang
+./launch/multimodal_disagg.sh
+```
+
+**Client:**
+
+```bash
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen2.5-VL-7B-Instruct",
+    "messages": [
+      {
+        "role": "user",
+        "content": [
+          {
+            "type": "text",
+            "text": "Describe the image."
+          },
+          {
+            "type": "image_url",
+            "image_url": {
+              "url": "http://images.cocodataset.org/test2017/000000155781.jpg"
+            }
+          }
+        ]
+      }
+    ],
+    "max_tokens": 50,
+    "stream": false
+  }' | jq
+```
+
+## Bootstrap Coordination

 SGLang disaggregation uses a bootstrap mechanism for P->D coordination:

-**Request Flow (Important):**
+### Request Flow (Important)
+
 ```text
 Client → Frontend → Processor → Encode → DECODE Worker → Prefill Worker
                                               ↑
                                    Entry point for disaggregation!
 ```

-**Bootstrap Process:**
+### Bootstrap Process
+
 1. **Decode Worker** receives request from Encode Worker
 2. **Decode Worker** calls Prefill Worker via NATS to request bootstrap info
 3. **Prefill Worker** generates `{host, port, room}` and returns immediately
 4. **Both workers** connect to same "room" using bootstrap coordinates
 5. **SGLang internally** transfers KV cache state via bootstrap connection (not NIXL)

-**Key Difference from vLLM:**
+### Key Difference from vLLM
+
 - vLLM: Frontend → Prefill → Decode (Prefill is entry point)
 - SGLang: Frontend → Processor → Encode → **Decode → Prefill** (Decode is entry point)

-## ModelInput Types and Registration
-
-**Only the Processor registers with Dynamo Rust.**
-
-### Registration Pattern
-
-```python
-# ONLY Processor registers with Dynamo Rust
-await register_llm_with_readiness_gate(
-    None,                   # No engine for processor
-    generate_endpoint,
-    server_args,
-    dynamo_args,
-    input_type=ModelInput.Text,  # Receives raw OpenAI format
-    readiness_gate=ready_event,
-)
-
-# Workers do NOT register - they are internal components
-# They communicate via NATS clients created in main.py
-```
-
-### Component Initialization
-
-```python
-# Encode Worker - connects to downstream PD worker
-pd_worker_client = (
-    await runtime.namespace(dynamo_args.namespace)
-    .component("backend")
-    .endpoint("generate")
-    .client()
-)
-
-# PD Worker (Decode mode) - connects to upstream Prefill worker
-prefill_client = (
-    await runtime.namespace(dynamo_args.namespace)
-    .component("prefill")
-    .endpoint("generate")
-    .client()
-)
-```
-
 ## Inter-Component Communication

 ### Control Flow (NATS)

 All component-to-component communication happens via NATS:

-**Aggregated Mode (E→PD):**
+#### E/PD Mode (Encode Separate)
+
 ```text
 Processor → Encode Worker → PD Worker
  (NATS)        (NATS + NIXL embeddings)
 ```

-**Disaggregated Mode (E→P→D):**
+#### E/P/D Mode (Full Disaggregation)
+
 ```text
 Processor → Encode Worker → DECODE Worker → Prefill Worker
  (NATS)        (NATS)            (NATS)
@@ -193,7 +253,7 @@ Processor → Encode Worker → DECODE Worker → Prefill Worker
                    SGLang internal KV cache transfer
 ```

-**Detailed Message Flow:**
+### Detailed Message Flow

 ```text
 Processor → Encode Worker:
@@ -220,19 +280,18 @@ Prefill ↔ Decode (via bootstrap):
 NIXL is used only for embedding transfer:

 ```python
-Encode Worker:
-  descriptor = connect.Descriptor(precomputed_embeddings)
-  with await connector.create_readable(descriptor) as readable:
-      request.serialized_request = readable.metadata()
-      # Send request with NIXL metadata
-      await pd_worker_client.round_robin(request)
-      await readable.wait_for_completion()
-
-PD Worker:
-  embeddings = torch.empty(request.embeddings_shape, dtype=torch.float16)
-  descriptor = connect.Descriptor(embeddings)
-  read_op = await connector.begin_read(request.serialized_request, descriptor)
-  await read_op.wait_for_completion()
+# Encode Worker
+descriptor = connect.Descriptor(precomputed_embeddings)
+with connector.create_readable(descriptor) as readable:
+    request.serialized_request = readable.metadata()
+    await pd_worker_client.round_robin(request)
+    await readable.wait_for_completion()
+
+# PD Worker
+embeddings = torch.empty(request.embeddings_shape, dtype=torch.float16)
+descriptor = connect.Descriptor(embeddings)
+read_op = await connector.begin_read(request.serialized_request, descriptor)
+await read_op.wait_for_completion()
 ```

 ## Vision Encoding Details
@@ -242,7 +301,6 @@ PD Worker:
 The encode worker loads and runs the vision model in Python:

 ```python
-# Vision components loaded in encode worker
 self.image_processor = AutoImageProcessor.from_pretrained(
    model_path, trust_remote_code=True
 )
@@ -262,6 +320,7 @@ self.vision_model = AutoModel.from_pretrained(
 4. Downstream worker receives expanded token sequence

 Example:
+
 ```python
 # Before: ["Hello", "<|image_pad|>", "world"]
 # After:  ["Hello", "<|image_pad|>", "<|image_pad|>", ...(576 tokens), "world"]
@@ -281,14 +340,14 @@ processed = tokenizer(text=conv.get_prompt(), return_tensors="pt")

 Supported templates: `qwen2-vl`, `llama-3`, `vicuna`, etc.

-## NIXL USE
+## NIXL Usage

 | Use Case | NIXL Used? | Data Transfer | Notes |
 |----------|------------|---------------|-------|
-| E→PD Aggregated | ✅ Yes | Encoder → PD (embeddings) | Vision encoder separate |
-| E→P→D Disaggregated | ✅ Yes | Encoder → Prefill (embeddings) | KV cache via SGLang bootstrap |
+| E/PD (Encode Separate) | Yes | Encoder → PD (embeddings) | Vision encoder separate |
+| E/P/D (Full Disaggregation) | Yes | Encoder → Prefill (embeddings) | KV cache via SGLang bootstrap |

-**Key Difference:** SGLang P→D uses bootstrap mechanism, not NIXL for KV cache like vLLM.
+**Key Difference:** SGLang P/D uses bootstrap mechanism, not NIXL for KV cache like vLLM.

 ## Known Limitations

@@ -304,12 +363,10 @@ Supported templates: `qwen2-vl`, `llama-3`, `vicuna`, etc.

 SGLang multimodal **only supports image-based vision-language models**:

-### ✅ Supported (Images Only)
 - **Qwen2-VL** / **Qwen2.5-VL** (primary support)
 - Models with `AutoImageProcessor` and vision tower
 - Models compatible with SGLang's image embedding format

-
 ## Key Files

 | File | Description |
@@ -321,4 +378,3 @@ SGLang multimodal **only supports image-based vision-language models**:
 | `components/src/dynamo/sglang/multimodal_utils/multimodal_chat_processor.py` | Chat template processing |
 | `components/src/dynamo/sglang/protocol.py` | Request/response data structures |
 | `components/src/dynamo/sglang/register.py` | Registration logic (only called for Processor) |
-
--- a/docs/backends/trtllm/multimodal_support.md
+++ b/docs/backends/trtllm/multimodal_support.md
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
+
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
+
 http://www.apache.org/licenses/LICENSE-2.0
+
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
@@ -12,33 +15,66 @@ See the License for the specific language governing permissions and
 limitations under the License.
 -->

-# Multimodal Support
+# TensorRT-LLM Multimodal

-TRTLLM supports multimodal models with dynamo. You can provide multimodal inputs in the following ways:
+This document provides a comprehensive guide for multimodal inference using TensorRT-LLM backend in Dynamo.

+You can provide multimodal inputs in the following ways:
 - By sending image URLs
 - By providing paths to pre-computed embedding files

-Please note that you should provide **either image URLs or embedding file paths** in a single request.
+> **Note:** You should provide **either image URLs or embedding file paths** in a single request.
+
+## Support Matrix
+
+| Modality | Input Format | Aggregated | Disaggregated | Notes |
+|----------|--------------|------------|---------------|-------|
+| **Image** | HTTP/HTTPS URL | Yes | Yes | Full support for all image models |
+| **Image** | Pre-computed Embeddings (.pt, .pth, .bin) | Yes | Yes | Direct embedding files |
+| **Video** | HTTP/HTTPS URL | No | No | Not implemented |
+| **Audio** | HTTP/HTTPS URL | No | No | Not implemented |
+
+### Supported URL Formats
+
+| Format | Example | Description |
+|--------|---------|-------------|
+| **HTTP/HTTPS** | `http://example.com/image.jpg` | Remote media files |
+| **Pre-computed Embeddings** | `/path/to/embedding.pt` | Local embedding files (.pt, .pth, .bin) |
+
+## Deployment Patterns
+
+TRT-LLM supports aggregated and traditional disaggregated patterns. See [Architecture Patterns](index.md#architecture-patterns) for detailed explanations.
+
+| Pattern | Supported | Launch Script | Notes |
+|---------|-----------|---------------|-------|
+| EPD (Simple Aggregated) | ✅ | `agg.sh` | Easiest setup |
+| E/PD (Encode Separate) | ❌ | N/A | Not supported |
+| E/P/D (Full Disaggregation) | 🚧 WIP | N/A | PR #4668 in progress |
+| EP/D (Traditional Disaggregated) | ✅ | `disagg_multimodal.sh` | Prefill handles encoding |
+
+### Component Flags
+
+| Component | Flag | Purpose |
+|-----------|------|---------|
+| Worker | `--modality multimodal` | Complete pipeline (aggregated) |
+| Prefill Worker | `--disaggregation-mode prefill` | Image processing + Prefill (multimodal tokenization happens here) |
+| Decode Worker | `--disaggregation-mode decode` | Decode only |
+| Encode Worker (WIP) | `--disaggregation-mode encode` | Image encoding (E/P/D flow) |

-## Aggregated
+## Aggregated Serving
+
+Quick steps to launch Llama-4 Maverick BF16 in aggregated mode:

-Here are quick steps to launch Llama-4 Maverick BF16 in aggregated mode
 ```bash
 cd $DYNAMO_HOME

 export AGG_ENGINE_ARGS=./examples/backends/trtllm/engine_configs/llama4/multimodal/agg.yaml
 export SERVED_MODEL_NAME="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
 export MODEL_PATH="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
-./launch/agg.sh
+./examples/backends/trtllm/launch/agg.sh
 ```
-## Example Requests
-
-### With Image URL
-
-Below is an example of an image being sent to `Llama-4-Maverick-17B-128E-Instruct` model

-Request :
+**Client:**
 ```bash
 curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct",
@@ -63,38 +99,53 @@ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '
    "max_tokens": 160
 }'
 ```
-Response :
-
-```
-{"id":"unknown-id","choices":[{"index":0,"message":{"content":"The image depicts a serene landscape featuring a large rock formation, likely El Capitan in Yosemite National Park, California. The scene is characterized by a winding road that curves from the bottom-right corner towards the center-left of the image, with a few rocks and trees lining its edge.\n\n**Key Features:**\n\n* **Rock Formation:** A prominent, tall, and flat-topped rock formation dominates the center of the image.\n* **Road:** A paved road winds its way through the landscape, curving from the bottom-right corner towards the center-left.\n* **Trees and Rocks:** Trees are visible on both sides of the road, with rocks scattered along the left side.\n* **Sky:** The sky above is blue, dotted with white clouds.\n* **Atmosphere:** The overall atmosphere of the","refusal":null,"tool_calls":null,"role":"assistant","function_call":null,"audio":null},"finish_reason":"stop","logprobs":null}],"created":1753322607,"model":"meta-llama/Llama-4-Maverick-17B-128E-Instruct","service_tier":null,"system_fingerprint":null,"object":"chat.completion","usage":null}
-```

-## Disaggregated
+## Disaggregated Serving

-Here are quick steps to launch in disaggregated mode.
+Example using `Qwen/Qwen2-VL-7B-Instruct`:

-The following is an example of launching a model in disaggregated mode. While this example uses `Qwen/Qwen2-VL-7B-Instruct`, you can adapt it for other models by modifying the environment variables for the model path and engine configurations.
 ```bash
 cd $DYNAMO_HOME

-export MODEL_PATH=${MODEL_PATH:-"Qwen/Qwen2-VL-7B-Instruct"}
-export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"Qwen/Qwen2-VL-7B-Instruct"}
-export PREFILL_ENGINE_ARGS=${PREFILL_ENGINE_ARGS:-"examples/backends/trtllm/engine_configs/qwen2-vl-7b-instruct/prefill.yaml"}
-export DECODE_ENGINE_ARGS=${DECODE_ENGINE_ARGS:-"examples/backends/trtllm/engine_configs/qwen2-vl-7b-instruct/decode.yaml"}
-export MODALITY=${MODALITY:-"multimodal"}
+export MODEL_PATH="Qwen/Qwen2-VL-7B-Instruct"
+export SERVED_MODEL_NAME="Qwen/Qwen2-VL-7B-Instruct"
+export PREFILL_ENGINE_ARGS="examples/backends/trtllm/engine_configs/qwen2-vl-7b-instruct/prefill.yaml"
+export DECODE_ENGINE_ARGS="examples/backends/trtllm/engine_configs/qwen2-vl-7b-instruct/decode.yaml"
+export MODALITY="multimodal"

-./launch/disagg.sh
+./examples/backends/trtllm/launch/disagg.sh
 ```

-For a large model like `meta-llama/Llama-4-Maverick-17B-128E-Instruct`, a multi-node setup is required for disaggregated serving, while aggregated serving can run on a single node. This is because the model with a disaggregated configuration is too large to fit on a single node's GPUs. For instance, running this model in disaggregated mode requires a setup of 2 nodes with 8xH200 GPUs or 4 nodes with 4xGB200 GPUs.
-
-In general, disaggregated serving can run on a single node, provided the model fits on the GPU. The multi-node requirement in this example is specific to the size and configuration of the `meta-llama/Llama-4-Maverick-17B-128E-Instruct` model.
+```bash
+curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
+    "model": "Qwen/Qwen2-VL-7B-Instruct",
+    "messages": [
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "text",
+                    "text": "Describe the image"
+                },
+                {
+                    "type": "image_url",
+                    "image_url": {
+                        "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png"
+                    }
+                }
+            ]
+        }
+    ],
+    "stream": false,
+    "max_tokens": 160
+}'
+```

-To deploy `Llama-4-Maverick-17B-128E-Instruct` in disaggregated mode, you will need to follow the multi-node setup instructions, which can be found [here](./multinode/multinode-multimodal-example.md).
+For a large model like `meta-llama/Llama-4-Maverick-17B-128E-Instruct`, a multi-node setup is required for disaggregated serving (see [Multi-node Deployment](#multi-node-deployment-slurm) below), while aggregated serving can run on a single node. This is because the model with a disaggregated configuration is too large to fit on a single node's GPUs. For instance, running this model in disaggregated mode requires 2 nodes with 8xH200 GPUs or 4 nodes with 4xGB200 GPUs.

-## Pre-computed Embeddings with EPD Flow
+## Pre-computed Embeddings with E/P/D Flow

-For high-performance multimodal inference, Dynamo supports pre-computed embeddings with an **Encode-Prefill-Decode (EPD)** flow using **NIXL (RDMA)** for zero-copy tensor transfer.
+For high-performance multimodal inference, Dynamo supports pre-computed embeddings with an **Encode-Prefill-Decode (E/P/D)** flow using **NIXL (RDMA)** for zero-copy tensor transfer.

 ### Supported File Types

@@ -102,12 +153,42 @@ For high-performance multimodal inference, Dynamo supports pre-computed embeddin
 - `.pth` - PyTorch checkpoint files
 - `.bin` - Binary tensor files

+### Embedding File Formats
+
+TRT-LLM supports two formats for embedding files:
+
+**1. Simple Tensor Format**
+
+Direct tensor saved as `.pt` file containing only the embedding tensor:
+
+```python
+embedding_tensor = torch.rand(1, 576, 4096)  # [batch, seq_len, hidden_dim]
+torch.save(embedding_tensor, "embedding.pt")
+```
+
+**2. Dictionary Format with Auxiliary Data**
+
+Dictionary containing multiple keys, used by models like Llama-4 that require additional metadata:
+
+```python
+embedding_dict = {
+    "mm_embeddings": torch.rand(1, 576, 4096),
+    "special_tokens": [128256, 128257],
+    "image_token_offsets": [[0, 576]],
+    # ... other model-specific metadata
+}
+torch.save(embedding_dict, "llama4_embedding.pt")
+```
+
+- **Simple tensors**: Loaded directly and passed to `mm_embeddings` parameter
+- **Dictionary format**: `mm_embeddings` key extracted as main tensor, other keys preserved as auxiliary data
+
 ### How to Launch

 ```bash
 cd $DYNAMO_HOME/examples/backends/trtllm

-# Launch 3-worker EPD flow with NIXL
+# Launch 3-worker E/P/D flow with NIXL
 ./launch/epd_disagg.sh
 ```

@@ -126,7 +207,7 @@ export ALLOWED_LOCAL_MEDIA_PATH="/tmp"
 export MAX_FILE_SIZE_MB=50
 ```

-### Example Request
+### Example Request with Pre-computed Embeddings

 ```bash
 curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
@@ -144,16 +225,14 @@ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '
 }'
 ```

-### Architecture
+### E/P/D Architecture

-The EPD flow implements a **3-worker architecture**:
+The E/P/D flow implements a **3-worker architecture**:

 - **Encode Worker**: Loads pre-computed embeddings, transfers via NIXL
 - **Prefill Worker**: Receives embeddings, handles context processing and KV-cache generation
 - **Decode Worker**: Performs streaming token generation

-### Request Flow
-
 ```mermaid
 sequenceDiagram
    participant Client
@@ -176,6 +255,129 @@ sequenceDiagram
    Frontend->>Client: Stream response
 ```

-## Supported Multimodal Models
+## Multi-node Deployment (Slurm)
+
+This section demonstrates how to deploy large multimodal models that require a multi-node setup using Slurm.
+
+> **Note:** The scripts referenced in this section can be found in [`examples/basics/multinode/trtllm/`](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics/multinode/trtllm/).
+
+### Environment Setup
+
+Assuming you have allocated your nodes via `salloc` and are inside an interactive shell:
+
+```bash
+# Container image (build using docs/backends/trtllm/README.md#build-container)
+export IMAGE="<dynamo_trtllm_image>"
+
+# Host:container path pairs for mounting
+export MOUNTS="${PWD}/../../../../:/mnt"
+
+# Model configuration
+export MODEL_PATH="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
+export SERVED_MODEL_NAME="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
+export MODALITY=${MODALITY:-"multimodal"}
+```
+
+### Multi-node Disaggregated Launch
+
+For 4 4xGB200 nodes (2 for prefill, 2 for decode):
+
+```bash
+# Customize parallelism to match your engine configs
+# export PREFILL_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/llama4/multimodal/prefill.yaml"
+# export DECODE_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/llama4/multimodal/decode.yaml"
+# export NUM_PREFILL_NODES=2
+# export NUM_DECODE_NODES=2
+# export NUM_GPUS_PER_NODE=4
+
+# Launches frontend + etcd/nats on head node, plus prefill and decode workers
+./srun_disaggregated.sh
+```
+
+### Understanding the Output
+
+1. `srun_disaggregated.sh` launches three srun jobs: frontend, prefill worker, and decode worker
+2. The OpenAI frontend will dynamically discover workers as they register:
+   ```
+   INFO dynamo_run::input::http: Watching for remote model at models
+   INFO dynamo_llm::http::service::service_v2: Starting HTTP service on: 0.0.0.0:8000
+   ```
+3. TRT-LLM workers output progress from each MPI rank while loading
+4. When ready, the frontend logs:
+   ```
+   INFO dynamo_llm::discovery::watcher: added model model_name="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
+   ```
+
+### Cleanup
+
+```bash
+pkill srun
+```
+
+## NIXL Usage
+
+| Use Case | Script | NIXL Used? | Data Transfer |
+|----------|--------|------------|---------------|
+| EPD (Simple Aggregated) | `agg.sh` | No | All in one worker |
+| EP/D (Traditional Disaggregated) | `disagg_multimodal.sh` | Optional | Prefill → Decode (KV cache via UCX or NIXL) |
+| E/P/D (pre-computed embeddings) | `epd_disagg.sh` | Yes | Encoder → Prefill (embeddings via NIXL) |
+| E/P/D (WIP) | N/A | No | Encoder → Prefill (handles via params), Prefill → Decode (KV cache) |
+
+> **Note:** NIXL for KV cache transfer is currently beta and only supported on AMD64 (x86_64) architecture.
+
+## ModelInput Types and Registration
+
+TRT-LLM workers register with Dynamo using:
+
+| ModelInput Type | Preprocessing | Use Case |
+|-----------------|---------------|----------|
+| `ModelInput.Tokens` | Rust frontend may tokenize, but multimodal flows re-tokenize and build inputs in the Python worker; Rust token_ids are ignored | All TRT-LLM workers |
+
+```python
+# TRT-LLM Worker - Register with Tokens
+await register_llm(
+    ModelInput.Tokens,      # Rust does minimal preprocessing
+    model_type,             # ModelType.Chat or ModelType.Prefill
+    generate_endpoint,
+    model_name,
+    ...
+)
+```
+
+## Inter-Component Communication
+
+| Transfer Stage | Message | NIXL Transfer |
+|----------------|---------|---------------|
+| **Frontend → Prefill** | Request with image URL or embedding path | No |
+| **Encode → Prefill (pre-computed)** | NIXL metadata | Yes (Embeddings tensor) |
+| **Encode → Prefill (Image URL) (WIP)** | Disaggregated params with multimodal handles | No |
+| **Prefill → Decode** | Disaggregated params | Configurable (KV cache: NIXL default, UCX optional) |
+
+## Known Limitations
+
+- **No Data URL support** - Only HTTP/HTTPS URLs supported; `data:image/...` base64 URLs not supported
+- **No video support** - No video encoder implementation
+- **No audio support** - No audio encoder implementation
+- **Multimodal preprocessing/tokenization happens in Python** - Rust may forward token_ids, but multimodal requests are parsed and re-tokenized in the Python worker
+- **E/P/D mode is WIP** - Full E/P/D with image URLs under development
+- **Multi-node H100 limitation** - Loading `meta-llama/Llama-4-Maverick-17B-128E-Instruct` with 8 nodes of H100 with TP=16 is not possible due to head count divisibility (`num_attention_heads: 40` not divisible by `tp_size: 16`)
+
+## Supported Models

 Multimodal models listed in [TensorRT-LLM supported models](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/models/supported-models.md) are supported by Dynamo.
+
+Common examples:
+- Llama 4 Vision models (Maverick, Scout)
+- Qwen2-VL models
+- Other vision-language models with TRT-LLM support
+
+## Key Files
+
+| File | Description |
+|------|-------------|
+| `components/src/dynamo/trtllm/main.py` | Worker initialization and setup |
+| `components/src/dynamo/trtllm/utils/trtllm_utils.py` | Command-line argument parsing |
+| `components/src/dynamo/trtllm/multimodal_processor.py` | Multimodal request processing |
+| `components/src/dynamo/trtllm/request_handlers/handlers.py` | Request handler factory |
+| `components/src/dynamo/trtllm/request_handlers/handler_base.py` | Base handler and disaggregation modes |
+
--- a/docs/multimodal/vllm.md
+++ b/docs/multimodal/vllm.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# vLLM Multimodal
+
+This document provides a comprehensive guide for multimodal inference using vLLM backend in Dynamo.
+
+> [!IMPORTANT]
+> **Security Requirement**: All multimodal workers require the `--enable-multimodal` flag to be explicitly set at startup. This is a security feature to prevent unintended processing of multimodal data from untrusted sources. Workers will fail at startup if multimodal flags (e.g., `--multimodal-worker`, `--multimodal-processor`) are used without `--enable-multimodal`.
+> This flag is analogous to `--enable-mm-embeds` in vllm serve but also extends it to all multimodal content (url, embeddings, b64).
+
+## Support Matrix
+
+| Modality | Input Format | Aggregated | Disaggregated | Notes |
+|----------|--------------|------------|---------------|-------|
+| **Image** | HTTP/HTTPS URL | Yes | Yes | Full support for all image models |
+| **Image** | Data URL (Base64) | Yes | Yes | Inline base64-encoded images |
+| **Video** | HTTP/HTTPS URL | Yes | Yes | Frame extraction and processing |
+| **Audio** | HTTP/HTTPS URL | Yes | Yes | Experimental - requires audio dependencies |
+
+### Supported URL Formats
+
+| Format | Example | Description |
+|--------|---------|-------------|
+| **HTTP/HTTPS** | `http://example.com/image.jpg` | Remote media files |
+| **Data URL** | `data:image/jpeg;base64,/9j/4AAQ...` | Base64-encoded inline data |
+
+## Deployment Patterns
+
+vLLM supports all multimodal deployment patterns. See [Architecture Patterns](index.md#architecture-patterns) for detailed explanations.
+
+| Pattern | Supported | Launch Script | Notes |
+|---------|-----------|---------------|-------|
+| EPD (Simple Aggregated) | ✅ | `agg_multimodal.sh` | Easiest setup |
+| E/PD (Encode Separate) | ✅ | `agg_multimodal_epd.sh` | Separate encode worker |
+| E/P/D (Full Disaggregation) | ✅ | `disagg_multimodal_epd.sh` | All stages separate |
+| EP/D (Traditional Disaggregated) | ✅ | `disagg_multimodal_llama.sh` | For Llama 4 models |
+
+### Component Flags
+
+| Component | Flag | Purpose |
+|-----------|------|---------|
+| Processor | `--multimodal-processor` | HTTP entry, tokenization |
+| Encode Worker | `--multimodal-encode-worker` | Media encoding |
+| PD Worker | `--multimodal-worker` | Prefill + Decode |
+| Prefill Worker | `--multimodal-worker --is-prefill-worker` | Prefill only |
+| Decode Worker | `--multimodal-decode-worker` | Decode only |
+| Encode+Prefill Worker | `--multimodal-encode-prefill-worker --is-prefill-worker` | Combined (Llama 4) |
+
+## Use the Latest Release
+
+We recommend using the latest stable release of dynamo to avoid breaking changes:
+
+[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
+
+You can find the [latest release](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
+
+```bash
+git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
+```
+
+## Image Serving
+
+### E/PD Serving (Encode Separate)
+
+**Components:**
+
+- workers: [EncodeWorkerHandler](../../components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py) for encoding and [MultimodalPDWorkerHandler](../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) for prefilling and decoding.
+- processor: Tokenizes the prompt and passes it to the EncodeWorkerHandler.
+- frontend: HTTP endpoint to handle incoming requests.
+
+**Workflow:**
+
+The EncodeWorkerHandler encodes the image and passes the embeddings to the MultimodalPDWorkerHandler via NATS and RDMA. The work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface.
+
+```mermaid
+flowchart LR
+  HTTP --> processor
+  processor --> HTTP
+  processor --image_url--> encode_worker
+  encode_worker --> processor
+  encode_worker --embeddings--> pd_worker
+  pd_worker --> encode_worker
+```
+
+> **Note:** Aggregated serving supports LLaVA 1.5 7B and Qwen2.5-VL-7B-Instruct. Disaggregated serving is currently only confirmed for LLaVA.
+
+**Launch:**
+
+```bash
+cd $DYNAMO_HOME/examples/backends/vllm
+# Serve a LLaVA 1.5 7B model:
+bash launch/agg_multimodal_epd.sh --model llava-hf/llava-1.5-7b-hf
+# Serve a Qwen2.5-VL model:
+bash launch/agg_multimodal_epd.sh --model Qwen/Qwen2.5-VL-7B-Instruct
+```
+
+**Client:**
+
+```bash
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+      "model": "llava-hf/llava-1.5-7b-hf",
+      "messages": [
+        {
+          "role": "user",
+          "content": [
+            {
+              "type": "text",
+              "text": "What is in this image?"
+            },
+            {
+              "type": "image_url",
+              "image_url": {
+                "url": "http://images.cocodataset.org/test2017/000000155781.jpg"
+              }
+            }
+          ]
+        }
+      ],
+      "max_tokens": 300,
+      "temperature": 0.0,
+      "stream": false
+    }'
+```
+
+### E/P/D Serving (Full Disaggregation)
+
+**Components:**
+
+- workers: [EncodeWorkerHandler](../../components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py) for encoding, [MultimodalDecodeWorkerHandler](../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) for decoding, and [MultimodalPDWorkerHandler](../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) for prefilling.
+- processor: Tokenizes the prompt and passes it to the EncodeWorkerHandler.
+- frontend: HTTP endpoint to handle incoming requests.
+
+**Workflow:**
+
+For the LLaVA model, embeddings are only required during the prefill stage. The EncodeWorkerHandler is connected directly to the prefill worker, encoding the image and passing embeddings via NATS and RDMA. The prefill worker performs the prefilling step and forwards the KV cache to the decode worker.
+
+```mermaid
+flowchart LR
+  HTTP --> processor
+  processor --> HTTP
+  processor --image_url--> encode_worker
+  encode_worker --> processor
+  encode_worker --embeddings--> prefill_worker
+  prefill_worker --> encode_worker
+  prefill_worker --> decode_worker
+  decode_worker --> prefill_worker
+```
+
+**Launch:**
+
+```bash
+cd $DYNAMO_HOME/examples/backends/vllm
+bash launch/disagg_multimodal_epd.sh --model llava-hf/llava-1.5-7b-hf
+```
+
+> [!NOTE] Disaggregation is currently only confirmed to work with LLaVA. Qwen2.5-VL is not confirmed to be supported.
+
+## Llama 4 Serving
+
+The Llama 4 model family is natively multimodal. Unlike LLaVA, they do not directly consume image embeddings as input (see the [vLLM support matrix](https://docs.vllm.ai/en/latest/models/supported_models.html#text-generation_1)). Therefore, the encoder worker is not used and encoding is done alongside prefill.
+
+Example model: `meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8` on H100x8.
+
+### Llama 4 Aggregated Serving
+
+**Workflow:**
+
+```mermaid
+flowchart LR
+  HTTP --> processor
+  processor --> HTTP
+  processor --image_url--> pd_worker
+  pd_worker --> processor
+```
+
+**Launch:**
+
+```bash
+cd $DYNAMO_HOME/examples/backends/vllm
+bash launch/agg_multimodal_llama.sh
+```
+
+**Client:**
+
+```bash
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+      "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
+      "messages": [
+        {
+          "role": "user",
+          "content": [
+            {
+              "type": "text",
+              "text": "What is in this image?"
+            },
+            {
+              "type": "image_url",
+              "image_url": {
+                "url": "http://images.cocodataset.org/test2017/000000155781.jpg"
+              }
+            }
+          ]
+        }
+      ],
+      "max_tokens": 300,
+      "temperature": 0.0,
+      "stream": false
+    }'
+```
+
+### Llama 4 Disaggregated Serving
+
+**Workflow:**
+
+```mermaid
+flowchart LR
+  HTTP --> processor
+  processor --> HTTP
+  processor --image_url--> prefill_worker
+  prefill_worker --> processor
+  prefill_worker --> decode_worker
+  decode_worker --> prefill_worker
+```
+
+**Launch:**
+
+```bash
+cd $DYNAMO_HOME/examples/backends/vllm
+bash launch/disagg_multimodal_llama.sh --head-node
+
+# On a separate node with NATS_SERVER and ETCD_ENDPOINTS pointing to head node:
+cd $DYNAMO_HOME/examples/backends/vllm
+bash launch/disagg_multimodal_llama.sh
+```
+
+## Video Serving
+
+### Video Aggregated Serving
+
+**Components:**
+
+- workers: [VideoEncodeWorker](../../examples/multimodal/components/video_encode_worker.py) for decoding video into frames, and [VllmPDWorker](../../examples/multimodal/components/worker.py) for prefilling and decoding.
+- processor: Tokenizes the prompt and passes it to the VideoEncodeWorker.
+- frontend: HTTP endpoint to handle incoming requests.
+
+**Workflow:**
+
+The VideoEncodeWorker decodes the video into frames. Unlike the image pipeline which generates embeddings, this pipeline passes raw frames directly to the VllmPDWorker via NATS and RDMA.
+
+```mermaid
+flowchart LR
+  HTTP --> processor
+  processor --> HTTP
+  processor --video_url--> video_encode_worker
+  video_encode_worker --> processor
+  video_encode_worker --frames--> pd_worker
+  pd_worker --> video_encode_worker
+```
+
+**Launch:**
+
+```bash
+cd $DYNAMO_HOME/examples/multimodal
+bash launch/video_agg.sh
+```
+
+**Client:**
+
+```bash
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+      "model": "llava-hf/LLaVA-NeXT-Video-7B-hf",
+      "messages": [
+        {
+          "role": "user",
+          "content": [
+            {
+              "type": "text",
+              "text": "Describe the video in detail"
+            },
+            {
+              "type": "video_url",
+              "video_url": {
+                "url": "https://storage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4"
+              }
+            }
+          ]
+        }
+      ],
+      "max_tokens": 300,
+      "stream": false
+    }' | jq
+```
+
+### Video Disaggregated Serving
+
+**Workflow:**
+
+For the LLaVA-NeXT-Video-7B model, frames are only required during the prefill stage. The VideoEncodeWorker is connected directly to the prefill worker, decoding the video into frames and passing them via RDMA.
+
+```mermaid
+flowchart LR
+  HTTP --> processor
+  processor --> HTTP
+  processor --video_url--> video_encode_worker
+  video_encode_worker --> processor
+  video_encode_worker --frames--> prefill_worker
+  prefill_worker --> video_encode_worker
+  prefill_worker --> decode_worker
+  decode_worker --> prefill_worker
+```
+
+**Launch:**
+
+```bash
+cd $DYNAMO_HOME/examples/multimodal
+bash launch/video_disagg.sh
+```
+
+## Audio Serving
+
+### Audio Aggregated Serving
+
+**Components:**
+
+- workers: [AudioEncodeWorker](../../examples/multimodal/components/audio_encode_worker.py) for decoding audio into embeddings, and [VllmPDWorker](../../examples/multimodal/components/worker.py) for prefilling and decoding.
+- processor: Tokenizes the prompt and passes it to the AudioEncodeWorker.
+- frontend: HTTP endpoint to handle incoming requests.
+
+**Workflow:**
+
+```mermaid
+flowchart LR
+  HTTP --> processor
+  processor --> HTTP
+  processor --audio_url--> audio_encode_worker
+  audio_encode_worker --> processor
+  audio_encode_worker --embeddings--> pd_worker
+  pd_worker --> audio_encode_worker
+```
+
+**Launch:**
+
+```bash
+pip install vllm["audio"] accelerate # multimodal audio models dependency
+cd $DYNAMO_HOME/examples/multimodal
+bash launch/audio_agg.sh
+```
+
+**Client:**
+
+```bash
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+      "model": "Qwen/Qwen2-Audio-7B-Instruct",
+      "messages": [
+        {
+          "role": "user",
+          "content": [
+            {
+              "type": "text",
+              "text": "What is recited in the audio?"
+            },
+            {
+              "type": "audio_url",
+              "audio_url": {
+                "url": "https://raw.githubusercontent.com/yuekaizhang/Triton-ASR-Client/main/datasets/mini_en/wav/1221-135766-0002.wav"
+              }
+            }
+          ]
+        }
+      ],
+      "max_tokens": 6000,
+      "temperature": 0.8,
+      "stream": false
+    }' | jq
+```
+
+### Audio Disaggregated Serving
+
+**Workflow:**
+
+For the Qwen2-Audio model, audio embeddings are only required during the prefill stage. The AudioEncodeWorker is connected directly to the prefill worker.
+
+```mermaid
+flowchart LR
+  HTTP --> processor
+  processor --> HTTP
+  processor --audio_url--> audio_encode_worker
+  audio_encode_worker --> processor
+  audio_encode_worker --embeddings--> prefill_worker
+  prefill_worker --> audio_encode_worker
+  prefill_worker --> decode_worker
+  decode_worker --> prefill_worker
+```
+
+**Launch:**
+
+```bash
+pip install vllm["audio"] accelerate # multimodal audio models dependency
+cd $DYNAMO_HOME/examples/multimodal
+bash launch/audio_disagg.sh
+```
+
+## NIXL Usage
+
+| Use Case | Script | NIXL Used? | Data Transfer |
+|----------|--------|------------|---------------|
+| EPD (Simple Aggregated) | `agg_multimodal.sh` | No | All in one worker |
+| E/PD (Encode Separate) | `agg_multimodal_epd.sh` | Yes | Encoder → PD (embeddings) |
+| E/P/D (Full Disaggregation) | `disagg_multimodal_epd.sh` | Yes | Encoder → Prefill (embeddings), Prefill → Decode (KV cache) |
+| EP/D (Llama 4) | `disagg_multimodal_llama.sh` | Yes | Prefill → Decode (KV cache) |
+
+## ModelInput Types and Registration
+
+Dynamo's Rust SDK supports two input types that determine how the HTTP frontend preprocesses requests:
+
+| ModelInput Type | Preprocessing | Use Case |
+|-----------------|---------------|----------|
+| `ModelInput.Text` | None (raw text passed through) | Components that tokenize themselves |
+| `ModelInput.Tokens` | Rust SDK would tokenize (but bypassed in multimodal) | Components expecting pre-tokenized input |
+
+**Registration Pattern:**
+
+```python
+# Processor - Entry point from HTTP frontend
+await register_llm(
+    ModelInput.Text,        # Frontend sends raw text
+    ModelType.Chat,
+    generate_endpoint,
+    model_name,
+    ...
+)
+
+# Workers - Internal components
+await register_llm(
+    ModelInput.Tokens,      # Expect pre-tokenized input
+    ModelType.Chat,         # or ModelType.Prefill for prefill workers
+    generate_endpoint,
+    model_name,
+    ...
+)
+```
+
+## Known Limitations
+
+- **Disaggregated flows require Python Processor** - All multimodal disaggregation requires the Python Processor component (`ModelInput.Text`).
+
+## Supported Models
+
+The following models have been tested with Dynamo's vLLM multimodal backend:
+
+- **Qwen2.5-VL** - `Qwen/Qwen2.5-VL-7B-Instruct`
+- **Qwen3-VL** - `Qwen/Qwen3-VL-30B-A3B-Instruct-FP8`
+- **LLaVA 1.5** - `llava-hf/llava-1.5-7b-hf`
+- **Llama 4 Maverick** - `meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8`
+- **LLaVA Next Video** - `llava-hf/LLaVA-NeXT-Video-7B-hf`
+- **Qwen2-Audio** - `Qwen/Qwen2-Audio-7B-Instruct`
+
+For a complete list of multimodal models supported by vLLM, see [vLLM Supported Multimodal Models](https://docs.vllm.ai/en/latest/models/supported_models/#list-of-multimodal-language-models). Models listed there should work with Simple Aggregated Mode but may not be explicitly tested.
+
+## Key Files
+
+| File | Description |
+|------|-------------|
+| `components/src/dynamo/vllm/main.py` | Worker initialization and setup |
+| `components/src/dynamo/vllm/args.py` | Command-line argument parsing |
+| `components/src/dynamo/vllm/multimodal_handlers/processor_handler.py` | Processor implementation |
+| `components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py` | Encode worker implementation |
+| `components/src/dynamo/vllm/multimodal_handlers/worker_handler.py` | PD/Prefill/Decode worker implementation |
--- a/docs/project.json
+++ b/docs/project.json
-{"name": "NVIDIA Dynamo", "version": "dev"}
+{"name": "NVIDIA Dynamo", "version": "dev"}
\ No newline at end of file