docs: migrate Multimodal docs to three-tier structure (#5999)

Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>

docs: migrate Multimodal docs to three-tier structure (#5999)
Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>
c5709980 · dagil-nvidia · GitHub · 7752ce21 · c5709980 · c5709980
Unverified Commit c5709980 authored Feb 05, 2026 by dagil-nvidia Committed by GitHub Feb 05, 2026
8 changed files
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -95,8 +95,13 @@ redirects = {
    "backends/sglang/multimodal_epd": "../../multimodal/sglang.html",
    "backends/sglang/multimodal_sglang_guide": "../../multimodal/sglang.html",
    "multimodal/multimodal_intro": "index.html",
-    # Speculative decoding consolidation (PR speculative-migration)
+    # Speculative decoding consolidation
    "backends/vllm/speculative_decoding": "../../features/speculative_decoding/speculative_decoding_vllm.html",
+    # Multimodal migration to features/multimodal/
+    "multimodal/index": "../features/multimodal/README.html",
+    "multimodal/vllm": "../features/multimodal/multimodal_vllm.html",
+    "multimodal/sglang": "../features/multimodal/multimodal_sglang.html",
+    "multimodal/trtllm": "../features/multimodal/multimodal_trtllm.html",
 }

 # Custom extensions

--- a/docs/features/multimodal/README.md
+++ b/docs/features/multimodal/README.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Multimodal Inference in Dynamo
+
+Dynamo supports multimodal inference across multiple LLM backends, enabling models to process images, video, and audio alongside text. This section provides comprehensive documentation for deploying multimodal models.
+
+> [!IMPORTANT]
+> **Security Requirement**: Multimodal processing must be explicitly enabled at startup.
+> See the relevant documentation for each backend for the necessary flags.
+>
+> This prevents unintended processing of multimodal data from untrusted sources.
+
+## Backend Documentation
+
+```{toctree}
+:maxdepth: 1
+
+vLLM Multimodal <multimodal_vllm.md>
+TensorRT-LLM Multimodal <multimodal_trtllm.md>
+SGLang Multimodal <multimodal_sglang.md>
+```
+
+## Support Matrix
+
+### Backend Capabilities
+
+| Stack | E/PD | E/P/D | EP/D | EPD | Image | Video | Audio |
+|-------|------|-------|------|-----|-------|-------|-------|
+| **[vLLM](multimodal_vllm.md)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 🧪 |
+| **[TRT-LLM](multimodal_trtllm.md)** | ❌ | 🚧* | ✅ | ✅ | ✅ | ❌ | ❌ |
+| **[SGLang](multimodal_sglang.md)** | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ |
+
+\* E/P/D supported in TRT-LLM with pre-computed embeddings only; image URL support is WIP ([PR #4668](https://github.com/ai-dynamo/dynamo/pull/4668))
+
+**Pattern Key:**
+
+- **EPD** - All-in-one worker (Simple Aggregated)
+- **E/PD** - Separate encode, combined prefill+decode
+- **E/P/D** - All stages separate
+- **EP/D** - Combined encode+prefill, separate decode
+
+**Status:** ✅ Supported | 🚧 WIP | 🧪 Experimental | ❌ Not supported
+
+### Input Format Support
+
+| Format | vLLM | TRT-LLM | SGLang |
+|--------|------|---------|--------|
+| HTTP/HTTPS URL | ✅ | ✅ | ✅ |
+| Data URL (Base64) | ✅ | ❌ | ❌ |
+| Pre-computed Embeddings (.pt) | ❌ | ✅ | ❌ |
+
+## Architecture Patterns
+
+Dynamo supports several deployment patterns for multimodal inference based on two dimensions:
+
+1. **Encoding**: Is media encoding handled inline (within prefill) or by a separate **Encode Worker**?
+   - *Inline*: Simpler setup, encoding happens in the prefill worker
+   - *Separate (EPD)*: Dedicated encode worker transfers embeddings via **NIXL (RDMA)**, enabling independent scaling
+
+2. **Prefill/Decode**: Are prefill and decode in the same worker or separate?
+   - *Aggregated*: Single worker handles both prefill and decode
+   - *Disaggregated*: Separate workers for prefill and decode, with KV cache transfer between them
+
+These combine into four deployment patterns:
+
+### EPD - Simple Aggregated
+
+All processing happens within a single worker - the simplest setup.
+
+```text
+HTTP Frontend (Rust)
+    ↓
+Worker (Python)
+    ↓ image load + encode + prefill + decode
+Response
+```
+
+| Component | Purpose |
+|-----------|---------|
+| Frontend (Rust) | HTTP entry point, tokenization, image URL preprocessing |
+| Worker | Complete inference pipeline (encode + prefill + decode) |
+
+**When to use:** Quick setup, smaller models, development/testing.
+
+### E/PD - Encode Separate
+
+Encoding happens in a separate worker; prefill and decode share the same engine.
+
+```text
+HTTP Frontend (Rust)
+    ↓
+Processor (Python)
+    ↓ tokenizes, extracts media URL
+Encode Worker (Python)
+    ↓ downloads media, generates embeddings, NIXL transfer
+PD Worker (Python)
+    ↓ receives embeddings via NIXL, prefill + decode
+Response
+```
+
+| Component | Purpose |
+|-----------|---------|
+| Frontend (Rust) | HTTP entry point |
+| Processor (Python) | Tokenization, extracts media URLs |
+| Encode Worker | Media encoding, embeddings generation |
+| PD Worker | Prefill + Decode with embeddings |
+
+**When to use:** Offload vision encoding to separate GPU, scale encode workers independently.
+
+### E/P/D - Full Disaggregation
+
+Full disaggregation with separate workers for encoding, prefill, and decode.
+There are two variants of this workflow:
+- Prefill-first, used by vLLM
+- Decode-first, used by SGLang
+
+Prefill-first:
+
+```text
+HTTP Frontend (Rust)
+    ↓
+Processor (Python)
+    ↓ tokenizes, extracts media URL
+Encode Worker (Python)
+    ↓ downloads media, generates embeddings, NIXL transfer
+Prefill Worker (Python)
+    ↓ receives embeddings via NIXL, prefill only, KV cache transfer
+Decode Worker (Python)
+    ↓ decode only, token generation
+Response
+```
+
+OR
+
+Decode-first:
+
+```text
+HTTP Frontend (Rust)
+    ↓
+Processor (Python)
+    ↓ tokenizes, extracts media URL
+Encode Worker (Python)
+    ↓ downloads media, generates embeddings, NIXL transfer
+Decode Worker (Python)
+    ↓ Bootstraps prefill worker
+Prefill Worker (Python)
+    ↓ receives embeddings via NIXL, prefill only, KV cache transfer
+Decode Worker (Python)
+    ↓ decode only, token generation
+Response
+```
+
+| Component | Purpose |
+|-----------|---------|
+| Frontend (Rust) | HTTP entry point |
+| Processor (Python) | Tokenization, extracts media URLs |
+| Encode Worker | Media encoding, embeddings generation |
+| Prefill Worker | Prefill only, transfers KV cache |
+| Decode Worker | Decode only, token generation |
+
+**When to use:** Maximum optimization, multi-node deployment, independent scaling of each phase.
+
+### EP/D - Traditional Disaggregated
+
+Encoding is combined with prefill, with decode separate.
+
+```text
+HTTP Frontend (Rust)
+    ↓
+Processor (Python)
+    ↓ tokenizes, extracts media URL
+Encode+Prefill Worker (Python)
+    ↓ downloads media, encodes inline, prefill, KV cache transfer
+Decode Worker (Python)
+    ↓ decode only, token generation
+Response
+```
+
+| Component | Purpose |
+|-----------|---------|
+| Frontend (Rust) | HTTP entry point |
+| Processor (Python) | Tokenization, extracts media URLs (vLLM only) |
+| Encode+Prefill Worker | Combined encoding and prefill |
+| Decode Worker | Decode only, token generation |
+
+> **Note:** TRT-LLM's EP/D mode skips the Python Processor - the Rust frontend handles tokenization and routes directly to the Prefill worker.
+> For multimodal requests, the Python prefill worker still re-tokenizes/builds inputs; Rust token_ids are ignored.
+
+**When to use:** Models without pre-computed embedding support (Llama 4), or TRT-LLM disaggregated deployment.
+
+## Example Workflows
+
+You can find example workflows and reference implementations for deploying multimodal models in:
+
+- [vLLM multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/launch)
+- [TRT-LLM multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/launch)
+- [SGLang multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/launch)
+- [Advanced multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal/launch) (video, audio)
--- a/docs/features/multimodal/multimodal_sglang.md
+++ b/docs/features/multimodal/multimodal_sglang.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# SGLang Multimodal
+
+This document provides a comprehensive guide for multimodal inference using SGLang backend in Dynamo. SGLang multimodal supports **EPD**, **E/PD**, and **E/P/D** flows, with NIXL (RDMA) for zero-copy tensor transfer in disaggregated modes.
+
+## Support Matrix
+
+| Modality | Input Format | Aggregated | Disaggregated | Notes |
+|----------|--------------|------------|---------------|-------|
+| **Image** | HTTP/HTTPS URL | Yes | Yes | Vision encoder generates embeddings |
+| **Image** | Data URL (Base64) | No | No |  |
+| **Video** | HTTP/HTTPS URL | No | No |  |
+| **Audio** | HTTP/HTTPS URL | No | No |  |
+
+### Supported URL Formats
+
+| Format | Example | Description |
+|--------|---------|-------------|
+| **HTTP/HTTPS** | `http://example.com/image.jpg` | Remote media files |
+
+## Deployment Patterns
+
+SGLang supports EPD, E/PD, and E/P/D patterns. See [Multimodal Architecture Patterns](README.md#architecture-patterns) for detailed explanations.
+
+| Pattern | Supported | Launch Script | Notes |
+|---------|-----------|---------------|-------|
+| EPD (Simple Aggregated) | ✅ | `agg.sh` | Internal encoding |
+| E/PD (Encode Separate) | ✅ | `multimodal_epd.sh` | Vision encoder separate |
+| E/P/D (Full Disaggregation) | ✅ | `multimodal_disagg.sh` | KV cache via bootstrap |
+| EP/D (Traditional Disaggregated) | ❌ | N/A | Not supported |
+
+### Component Flags
+
+| Component | Flag | Purpose |
+|-----------|------|---------|
+| Processor | `--multimodal-processor` | HTTP entry, OpenAI→SGLang conversion |
+| Encode Worker | `--multimodal-encode-worker` | Vision encoder, embeddings generation |
+| PD Worker | `--multimodal-worker` | Prefill + Decode with embeddings |
+| Decode Worker | `--multimodal-worker --serving-mode=decode` | Entry point for disaggregation |
+| Prefill Worker | `--multimodal-worker --serving-mode=prefill` | Called by Decode, bootstrap coordination |
+
+### SGLang-Specific Characteristics
+
+- **Vision Encoder in Python**: Encode worker loads vision model (AutoModel) and image processor (AutoImageProcessor)
+- **Token Expansion**: Single `<|image_pad|>` token replaced with N tokens based on embedding shape
+- **NIXL Transfer**: Embeddings transferred from Encoder → PD Worker using NIXL
+- **No Rust Processing**: All tokenization and image handling happens in Python
+
+## Use the Latest Release
+
+We recommend using the latest stable release of dynamo to avoid breaking changes:
+
+[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
+
+You can find the [latest release](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
+
+```bash
+git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
+```
+
+## EPD Serving (Simple Aggregated)
+
+### Components
+
+- worker: [DecodeWorkerHandler](../../components/src/dynamo/sglang/request_handlers/llm/decode_handler.py) handles encoding, prefilling, and decoding in a single process.
+
+### Workflow
+
+The `DecodeWorkerHandler` receives multimodal requests with image URLs and passes them directly to SGLang's engine. SGLang's internal `mm_data_processor` handles image fetching, loading, encoding, and token expansion.
+
+```mermaid
+flowchart LR
+  HTTP --> worker
+  worker --tokenized text + image_urls--> SGLang[SGLang Engine]
+```
+
+### Launch
+
+```bash
+cd $DYNAMO_HOME/examples/backends/sglang
+./launch/agg.sh --model Qwen/Qwen2.5-VL-7B-Instruct --chat-template qwen2-vl
+```
+
+**Client:**
+
+```bash
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen2.5-VL-7B-Instruct",
+    "messages": [
+      {
+        "role": "user",
+        "content": [
+          {
+            "type": "text",
+            "text": "Describe the image."
+          },
+          {
+            "type": "image_url",
+            "image_url": {
+              "url": "http://images.cocodataset.org/test2017/000000155781.jpg"
+            }
+          }
+        ]
+      }
+    ],
+    "max_tokens": 50,
+    "stream": false
+  }' | jq
+```
+
+## E/PD Serving (Encode Separate)
+
+### Components
+
+- workers:
+  - [MultimodalEncodeWorkerHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py) for encoding
+  - [MultimodalWorkerHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py) for prefilling and decoding.
+- processor: [MultimodalProcessorHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/processor_handler.py)
+  - tokenizes the prompt using the chat template
+  - passes the text and image url to the MultimodalEncodeWorker.
+
+### Workflow
+
+The `MultimodalEncodeWorker` downloads and encodes the image and passes the embeddings to the MultimodalWorker. The work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface. The `MultimodalWorker` then prefills and decodes the prompt in the same engine, as in the [LLM aggregated serving](../../backends/sglang/README.md) example. Only the processor is registered to the Dynamo frontend as an available endpoint. Workers do NOT register - they are internal components and communicate via NATS.
+
+```mermaid
+flowchart LR
+  HTTP --> processor
+  processor --tokenized request + image_url--> encode_worker
+  encode_worker --request + embeddings--> worker
+
+  worker -.-> encode_worker
+  encode_worker -.-> processor
+  processor -.-> HTTP
+```
+
+
+### Launch
+
+```bash
+cd $DYNAMO_HOME/examples/backends/sglang
+./launch/multimodal_epd.sh
+```
+
+**Client:**
+
+```bash
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen2.5-VL-7B-Instruct",
+    "messages": [
+      {
+        "role": "user",
+        "content": [
+          {
+            "type": "text",
+            "text": "Describe the image."
+          },
+          {
+            "type": "image_url",
+            "image_url": {
+              "url": "http://images.cocodataset.org/test2017/000000155781.jpg"
+            }
+          }
+        ]
+      }
+    ],
+    "max_tokens": 50,
+    "stream": false
+  }' | jq
+```
+
+## E/P/D Serving (Full Disaggregation)
+
+### Components
+
+- workers:
+  - [MultimodalEncodeWorkerHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py) for encoding
+  - [MultimodalWorkerHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py) for decoding
+  - [MultimodalPrefillWorkerHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py) for prefilling
+- processor: [MultimodalProcessorHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/processor_handler.py) tokenizes the prompt and passes it to the MultimodalEncodeWorker.
+
+### Workflow
+
+In models like Qwen2.5-VL, embeddings are only required during the prefill stage. The image embeddings are transferred via NIXL from the Encode Worker to the Decode Worker (the entry point for disaggregation), which then coordinates with the Prefill Worker. The Prefill Worker processes the embeddings and forwards the KV cache back to the Decode Worker for token generation.
+
+```mermaid
+flowchart LR
+  HTTP --> processor
+  processor --tokenized request + image_url--> encode_worker
+  encode_worker --request + embeddings--> worker
+  worker --request + embeddings--> prefill_worker
+
+  prefill_worker --KV Cache--> worker
+  encode_worker -.-> processor
+  worker -.-> encode_worker
+  processor -.-> HTTP
+```
+
+### Launch
+
+```bash
+cd $DYNAMO_HOME/examples/backends/sglang
+./launch/multimodal_disagg.sh
+```
+
+**Client:**
+
+```bash
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen2.5-VL-7B-Instruct",
+    "messages": [
+      {
+        "role": "user",
+        "content": [
+          {
+            "type": "text",
+            "text": "Describe the image."
+          },
+          {
+            "type": "image_url",
+            "image_url": {
+              "url": "http://images.cocodataset.org/test2017/000000155781.jpg"
+            }
+          }
+        ]
+      }
+    ],
+    "max_tokens": 50,
+    "stream": false
+  }' | jq
+```
+
+## Bootstrap Coordination
+
+SGLang disaggregation uses a bootstrap mechanism for P->D coordination:
+
+### Request Flow (Important)
+
+```text
+Client → Frontend → Processor → Encode → DECODE Worker → Prefill Worker
+                                               ↑
+                                    Entry point for disaggregation!
+```
+
+### Bootstrap Process
+
+1. **Decode Worker** receives request from Encode Worker
+2. **Decode Worker** calls Prefill Worker via NATS to request bootstrap info
+3. **Prefill Worker** generates `{host, port, room}` and returns immediately
+4. **Both workers** connect to same "room" using bootstrap coordinates
+5. **SGLang internally** transfers KV cache state via bootstrap connection (not NIXL)
+
+### Key Difference from vLLM
+
+- vLLM: Frontend → Prefill → Decode (Prefill is entry point)
+- SGLang: Frontend → Processor → Encode → **Decode → Prefill** (Decode is entry point)
+
+## Inter-Component Communication
+
+### Control Flow (NATS)
+
+All component-to-component communication happens via NATS:
+
+#### E/PD Mode (Encode Separate)
+
+```text
+Processor → Encode Worker → PD Worker
+  (NATS)        (NATS + NIXL embeddings)
+```
+
+#### E/P/D Mode (Full Disaggregation)
+
+```text
+Processor → Encode Worker → DECODE Worker → Prefill Worker
+  (NATS)        (NATS)            (NATS)
+                             ↓
+                    Decode requests bootstrap
+                             ↓
+                    Prefill returns {host, port, room}
+                             ↓
+                    Both connect via bootstrap
+                             ↓
+                    SGLang internal KV cache transfer
+```
+
+### Detailed Message Flow
+
+```text
+Processor → Encode Worker:
+  - NATS round_robin with SglangMultimodalRequest
+  - Contains: tokenized input_ids, image URL, sampling params
+
+Encode Worker → Decode/PD Worker:
+  - NATS round_robin to "backend" component
+  - Contains: expanded token_ids, NIXL metadata, embeddings shape
+  - NIXL transfer: embeddings tensor
+
+Decode Worker → Prefill Worker (disagg only):
+  - NATS call to "prefill" component
+  - Decode requests bootstrap coordinates
+  - Prefill returns: {bootstrap_host, bootstrap_port, bootstrap_room}
+
+Prefill ↔ Decode (via bootstrap):
+  - SGLang internal connection (not NATS)
+  - KV cache state shared via bootstrap mechanism
+```
+
+### Data Transfer (NIXL)
+
+NIXL is used only for embedding transfer:
+
+```python
+# Encode Worker
+descriptor = connect.Descriptor(precomputed_embeddings)
+with connector.create_readable(descriptor) as readable:
+    request.serialized_request = readable.metadata()
+    await pd_worker_client.round_robin(request)
+    await readable.wait_for_completion()
+
+# PD Worker
+embeddings = torch.empty(request.embeddings_shape, dtype=torch.float16)
+descriptor = connect.Descriptor(embeddings)
+read_op = await connector.begin_read(request.serialized_request, descriptor)
+await read_op.wait_for_completion()
+```
+
+## Vision Encoding Details
+
+### Encode Worker Components
+
+The encode worker loads and runs the vision model in Python:
+
+```python
+self.image_processor = AutoImageProcessor.from_pretrained(
+    model_path, trust_remote_code=True
+)
+self.vision_model = AutoModel.from_pretrained(
+    model_path,
+    device_map="auto",
+    torch_dtype=torch.float16,
+    trust_remote_code=True
+)
+```
+
+### Token Expansion Process
+
+1. Processor inserts single image token (e.g., `<|image_pad|>`)
+2. Encode worker generates embeddings: `shape = (batch, num_patches, hidden_dim)`
+3. Encode worker replaces single token with `num_patches` tokens
+4. Downstream worker receives expanded token sequence
+
+Example:
+
+```python
+# Before: ["Hello", "<|image_pad|>", "world"]
+# After:  ["Hello", "<|image_pad|>", "<|image_pad|>", ...(576 tokens), "world"]
+```
+
+## Chat Template Processing
+
+SGLang uses its own chat template system:
+
+```python
+from sglang.srt.parser.conversation import chat_templates
+
+conv = chat_templates["qwen2-vl"].copy()
+conv.append_message(conv.roles[0], f"{conv.image_token} Describe this image")
+processed = tokenizer(text=conv.get_prompt(), return_tensors="pt")
+```
+
+Supported templates: `qwen2-vl`, `llama-3`, `vicuna`, etc.
+
+## NIXL Usage
+
+| Use Case | NIXL Used? | Data Transfer | Notes |
+|----------|------------|---------------|-------|
+| EPD (Simple Aggregated) | No | N/A | All processing internal to SGLang |
+| E/PD (Encode Separate) | Yes | Encoder → PD (embeddings) | Vision encoder separate |
+| E/P/D (Full Disaggregation) | Yes | Encoder → Prefill (embeddings) | KV cache via SGLang bootstrap |
+
+**Key Difference:** SGLang P/D uses bootstrap mechanism, not NIXL for KV cache like vLLM.
+
+## Known Limitations
+
+- **No Data URL support** - Only HTTP/HTTPS URLs supported; `data:image/...` base64 URLs not supported
+- **No pre-computed embeddings** - Cannot use `.pt`, `.pth`, `.bin` embedding files; vision encoder runs for every request
+- **No video support** - No video encoder implementation
+- **No audio support** - No audio encoder implementation
+- **Only Processor registers with Dynamo** - Workers are internal components, frontend routes to Processor only
+- **Disaggregated routing** - Decode Worker is the entry point (calls Prefill), cannot route directly to Prefill workers
+- **Limited model generalization** - Token expansion logic is model-specific; adding new models may require implementation updates
+
+## Supported Models
+
+SGLang multimodal **only supports image-based vision-language models**:
+
+- **Qwen2-VL** / **Qwen2.5-VL** (primary support)
+- Models with `AutoImageProcessor` and vision tower
+- Models compatible with SGLang's image embedding format
+
+## Key Files
+
+| File | Description |
+|------|-------------|
+| `components/src/dynamo/sglang/main.py` | Component initialization, only Processor registers |
+| `components/src/dynamo/sglang/request_handlers/multimodal/processor_handler.py` | Processor implementation, OpenAI→SGLang |
+| `components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py` | Vision encoder, embeddings generation |
+| `components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py` | PD/Prefill/Decode workers, NIXL read |
+| `components/src/dynamo/sglang/multimodal_utils/multimodal_chat_processor.py` | Chat template processing |
+| `components/src/dynamo/sglang/protocol.py` | Request/response data structures |
+| `components/src/dynamo/sglang/register.py` | Registration logic (only called for Processor) |
--- a/docs/features/multimodal/multimodal_trtllm.md
+++ b/docs/features/multimodal/multimodal_trtllm.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# TensorRT-LLM Multimodal
+
+This document provides a comprehensive guide for multimodal inference using TensorRT-LLM backend in Dynamo.
+
+You can provide multimodal inputs in the following ways:
+- By sending image URLs
+- By providing paths to pre-computed embedding files
+
+> **Note:** You should provide **either image URLs or embedding file paths** in a single request.
+
+## Support Matrix
+
+| Modality | Input Format | Aggregated | Disaggregated | Notes |
+|----------|--------------|------------|---------------|-------|
+| **Image** | HTTP/HTTPS URL | Yes | Yes | Full support for all image models |
+| **Image** | Pre-computed Embeddings (.pt, .pth, .bin) | Yes | Yes | Direct embedding files |
+| **Video** | HTTP/HTTPS URL | No | No | Not implemented |
+| **Audio** | HTTP/HTTPS URL | No | No | Not implemented |
+
+### Supported URL Formats
+
+| Format | Example | Description |
+|--------|---------|-------------|
+| **HTTP/HTTPS** | `http://example.com/image.jpg` | Remote media files |
+| **Pre-computed Embeddings** | `/path/to/embedding.pt` | Local embedding files (.pt, .pth, .bin) |
+
+## Deployment Patterns
+
+TRT-LLM supports aggregated and traditional disaggregated patterns. See [Architecture Patterns](README.md#architecture-patterns) for detailed explanations.
+
+| Pattern | Supported | Launch Script | Notes |
+|---------|-----------|---------------|-------|
+| Aggregated | ✅ | `agg.sh` | Easiest setup, single worker |
+| EP/D (Traditional Disaggregated) | ✅ | `disagg_multimodal.sh` | Prefill handles encoding, 2 workers |
+| E/P/D (Full - Image URLs) | ✅ | `epd_multimodal_image_and_embeddings.sh` | Standalone encoder with `MultimodalEncoder`, 3 workers |
+| E/P/D (Full - Pre-computed Embeddings) | ✅ | `epd_multimodal_image_and_embeddings.sh` | Standalone encoder with NIXL transfer, 3 workers |
+| E/P/D (Large Models) | ✅ | `epd_disagg.sh` | For Llama-4 Scout/Maverick, multi-node |
+
+### Component Flags
+
+| Component | Flag | Purpose |
+|-----------|------|---------|
+| Worker | `--modality multimodal` | Complete pipeline (aggregated) |
+| Prefill Worker | `--disaggregation-mode prefill` | Image processing + Prefill (multimodal tokenization happens here) |
+| Decode Worker | `--disaggregation-mode decode` | Decode only |
+| Encode Worker | `--disaggregation-mode encode` | Image encoding (E/P/D flow) |
+
+## Aggregated Serving
+
+Quick steps to launch Llama-4 Maverick BF16 in aggregated mode:
+
+```bash
+cd $DYNAMO_HOME
+
+export AGG_ENGINE_ARGS=./examples/backends/trtllm/engine_configs/llama4/multimodal/agg.yaml
+export SERVED_MODEL_NAME="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
+export MODEL_PATH="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
+./examples/backends/trtllm/launch/agg.sh
+```
+
+**Client:**
+```bash
+curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
+    "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct",
+    "messages": [
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "text",
+                    "text": "Describe the image"
+                },
+                {
+                    "type": "image_url",
+                    "image_url": {
+                        "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png"
+                    }
+                }
+            ]
+        }
+    ],
+    "stream": false,
+    "max_tokens": 160
+}'
+```
+
+## Disaggregated Serving
+
+Example using `Qwen/Qwen2-VL-7B-Instruct`:
+
+```bash
+cd $DYNAMO_HOME
+
+export MODEL_PATH="Qwen/Qwen2-VL-7B-Instruct"
+export SERVED_MODEL_NAME="Qwen/Qwen2-VL-7B-Instruct"
+export PREFILL_ENGINE_ARGS="examples/backends/trtllm/engine_configs/qwen2-vl-7b-instruct/prefill.yaml"
+export DECODE_ENGINE_ARGS="examples/backends/trtllm/engine_configs/qwen2-vl-7b-instruct/decode.yaml"
+export MODALITY="multimodal"
+
+./examples/backends/trtllm/launch/disagg.sh
+```
+
+```bash
+curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
+    "model": "Qwen/Qwen2-VL-7B-Instruct",
+    "messages": [
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "text",
+                    "text": "Describe the image"
+                },
+                {
+                    "type": "image_url",
+                    "image_url": {
+                        "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png"
+                    }
+                }
+            ]
+        }
+    ],
+    "stream": false,
+    "max_tokens": 160
+}'
+```
+
+For a large model like `meta-llama/Llama-4-Maverick-17B-128E-Instruct`, a multi-node setup is required for disaggregated serving (see [Multi-node Deployment](#multi-node-deployment-slurm) below), while aggregated serving can run on a single node. This is because the model with a disaggregated configuration is too large to fit on a single node's GPUs. For instance, running this model in disaggregated mode requires 2 nodes with 8xH200 GPUs or 4 nodes with 4xGB200 GPUs.
+
+## Full E/P/D Flow (Image URLs)
+
+For high-performance multimodal inference, Dynamo supports a standalone encoder with an **Encode-Prefill-Decode (E/P/D)** flow using TRT-LLM's `MultimodalEncoder`. This separates the vision encoding stage from prefill and decode, enabling better GPU utilization and scalability.
+
+### Supported Input Formats
+
+| Format | Example | Description |
+|--------|---------|-------------|
+| **HTTP/HTTPS URL** | `https://example.com/image.jpg` | Remote image files |
+| **Base64 Data URL** | `data:image/jpeg;base64,...` | Inline base64-encoded images |
+
+### How It Works
+
+In the full E/P/D flow:
+
+1. **Encode Worker**: Runs TRT-LLM's `MultimodalEncoder.generate()` to process image URLs through the vision encoder and projector
+2. **Prefill Worker**: Receives `disaggregated_params` containing multimodal embedding handles, processes context and generates KV cache
+3. **Decode Worker**: Performs streaming token generation using the KV cache
+
+The encode worker uses TRT-LLM's `MultimodalEncoder` class (which inherits from `BaseLLM`) and only requires the model path and batch size - no KV cache configuration is needed since it only runs the vision encoder + projector.
+
+### How to Launch
+
+```bash
+cd $DYNAMO_HOME
+
+# Launch 3-worker E/P/D flow with image URL support
+./examples/backends/trtllm/launch/epd_multimodal_image_and_embeddings.sh
+```
+
+### Example Request
+
+```bash
+curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
+    "model": "llava-v1.6-mistral-7b-hf",
+    "messages": [
+        {
+            "role": "user",
+            "content": [
+                {"type": "text", "text": "Describe the image"},
+                {
+                    "type": "image_url",
+                    "image_url": {
+                        "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png"
+                    }
+                }
+            ]
+        }
+    ],
+    "max_tokens": 160
+}'
+```
+
+### E/P/D Architecture (Image URLs)
+
+```mermaid
+sequenceDiagram
+    participant Client
+    participant Frontend
+    participant PrefillWorker as "Prefill Worker"
+    participant EncodeWorker as "Encode Worker"
+    participant DecodeWorker as "Decode Worker"
+
+    Client->>Frontend: POST /v1/chat/completions (image URL)
+    Frontend->>PrefillWorker: Route to prefill worker
+    PrefillWorker->>EncodeWorker: Send request (image URL)
+    Note over EncodeWorker: MultimodalEncoder.generate()<br/>runs vision encoder + projector
+    EncodeWorker->>PrefillWorker: Return disaggregated_params<br/>(multimodal_embedding_handles)
+    Note over PrefillWorker: Process context with embeddings<br/>Generate KV cache
+    PrefillWorker->>Frontend: Return prefill response
+    Frontend->>DecodeWorker: Route to decode worker
+    DecodeWorker->>Frontend: Stream response chunks
+    Frontend->>Client: Stream response
+```
+
+### Key Differences from EP/D (Traditional Disaggregated)
+
+| Aspect | EP/D (Traditional) | E/P/D (Full) |
+|--------|-------------------|--------------|
+| **Encoding** | Prefill worker handles image encoding | Dedicated encode worker |
+| **Prefill Load** | Higher (encoding + prefill) | Lower (prefill only) |
+| **Use Case** | Simpler setup | Better scalability for vision-heavy workloads |
+| **Launch Script** | `disagg_multimodal.sh` | `epd_multimodal_image_and_embeddings.sh` |
+
+## Pre-computed Embeddings with E/P/D Flow
+
+For high-performance multimodal inference, Dynamo supports pre-computed embeddings with an **Encode-Prefill-Decode (E/P/D)** flow using **NIXL (RDMA)** for zero-copy tensor transfer.
+
+### Supported File Types
+
+- `.pt` - PyTorch tensor files
+- `.pth` - PyTorch checkpoint files
+- `.bin` - Binary tensor files
+
+### Embedding File Formats
+
+TRT-LLM supports two formats for embedding files:
+
+**1. Simple Tensor Format**
+
+Direct tensor saved as `.pt` file containing only the embedding tensor:
+
+```python
+embedding_tensor = torch.rand(1, 576, 4096)  # [batch, seq_len, hidden_dim]
+torch.save(embedding_tensor, "embedding.pt")
+```
+
+**2. Dictionary Format with Auxiliary Data**
+
+Dictionary containing multiple keys, used by models like Llama-4 that require additional metadata:
+
+```python
+embedding_dict = {
+    "mm_embeddings": torch.rand(1, 576, 4096),
+    "special_tokens": [128256, 128257],
+    "image_token_offsets": [[0, 576]],
+    # ... other model-specific metadata
+}
+torch.save(embedding_dict, "llama4_embedding.pt")
+```
+
+- **Simple tensors**: Loaded directly and passed to `mm_embeddings` parameter
+- **Dictionary format**: `mm_embeddings` key extracted as main tensor, other keys preserved as auxiliary data
+
+### How to Launch
+
+```bash
+cd $DYNAMO_HOME/examples/backends/trtllm
+
+# Launch 3-worker E/P/D flow with NIXL
+./launch/epd_disagg.sh
+```
+
+> **Note:** This script is designed for 8-node H200 with `Llama-4-Scout-17B-16E-Instruct` model and assumes you have a model-specific embedding file ready.
+
+### Configuration
+
+```bash
+# Encode endpoint for Prefill → Encode communication
+export ENCODE_ENDPOINT="dyn://dynamo.tensorrt_llm_encode.generate"
+
+# Security: Allowed directory for embedding files (default: /tmp)
+export ALLOWED_LOCAL_MEDIA_PATH="/tmp"
+
+# Security: Max file size to prevent DoS attacks (default: 50MB)
+export MAX_FILE_SIZE_MB=50
+```
+
+### Example Request with Pre-computed Embeddings
+
+```bash
+curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
+    "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct",
+    "messages": [
+        {
+            "role": "user",
+            "content": [
+                {"type": "text", "text": "Describe the image"},
+                {"type": "image_url", "image_url": {"url": "/path/to/embedding.pt"}}
+            ]
+        }
+    ],
+    "max_tokens": 160
+}'
+```
+
+### E/P/D Architecture
+
+The E/P/D flow implements a **3-worker architecture**:
+
+- **Encode Worker**: Loads pre-computed embeddings, transfers via NIXL
+- **Prefill Worker**: Receives embeddings, handles context processing and KV-cache generation
+- **Decode Worker**: Performs streaming token generation
+
+```mermaid
+sequenceDiagram
+    participant Client
+    participant Frontend
+    participant PrefillWorker as "Prefill Worker"
+    participant EncodeWorker as "Encode Worker"
+    participant DecodeWorker as "Decode Worker"
+    participant NIXL as "NIXL (RDMA)"
+
+    Client->>Frontend: POST /v1/chat/completions
+    Frontend->>PrefillWorker: Route to prefill worker
+    PrefillWorker->>EncodeWorker: Send request (embedding paths)
+    EncodeWorker->>NIXL: Create readable operation
+    EncodeWorker->>PrefillWorker: Send metadata + NIXL info
+    PrefillWorker->>NIXL: Begin read operation
+    NIXL-->>PrefillWorker: Zero-copy transfer complete
+    PrefillWorker->>Frontend: Return prefill response
+    Frontend->>DecodeWorker: Route to decode worker
+    DecodeWorker->>Frontend: Stream response chunks
+    Frontend->>Client: Stream response
+```
+
+## Multi-node Deployment (Slurm)
+
+This section demonstrates how to deploy large multimodal models that require a multi-node setup using Slurm.
+
+> **Note:** The scripts referenced in this section can be found in [`examples/basics/multinode/trtllm/`](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics/multinode/trtllm/).
+
+### Environment Setup
+
+Assuming you have allocated your nodes via `salloc` and are inside an interactive shell:
+
+```bash
+# Container image (build using docs/backends/trtllm/README.md#build-container)
+export IMAGE="<dynamo_trtllm_image>"
+
+# Host:container path pairs for mounting
+export MOUNTS="${PWD}/../../../../:/mnt"
+
+# Model configuration
+export MODEL_PATH="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
+export SERVED_MODEL_NAME="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
+export MODALITY=${MODALITY:-"multimodal"}
+```
+
+### Multi-node Disaggregated Launch
+
+For 4 4xGB200 nodes (2 for prefill, 2 for decode):
+
+```bash
+# Customize parallelism to match your engine configs
+# export PREFILL_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/llama4/multimodal/prefill.yaml"
+# export DECODE_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/llama4/multimodal/decode.yaml"
+# export NUM_PREFILL_NODES=2
+# export NUM_DECODE_NODES=2
+# export NUM_GPUS_PER_NODE=4
+
+# Launches frontend + etcd/nats on head node, plus prefill and decode workers
+./srun_disaggregated.sh
+```
+
+### Understanding the Output
+
+1. `srun_disaggregated.sh` launches three srun jobs: frontend, prefill worker, and decode worker
+2. The OpenAI frontend will dynamically discover workers as they register:
+   ```text
+   INFO dynamo_run::input::http: Watching for remote model at models
+   INFO dynamo_llm::http::service::service_v2: Starting HTTP service on: 0.0.0.0:8000
+   ```
+3. TRT-LLM workers output progress from each MPI rank while loading
+4. When ready, the frontend logs:
+   ```text
+   INFO dynamo_llm::discovery::watcher: added model model_name="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
+   ```
+
+### Cleanup
+
+```bash
+pkill srun
+```
+
+## NIXL Usage
+
+| Use Case | Script | NIXL Used? | Data Transfer |
+|----------|--------|------------|---------------|
+| Aggregated | `agg.sh` | No | All in one worker |
+| EP/D (Traditional Disaggregated) | `disagg_multimodal.sh` | Optional | Prefill → Decode (KV cache via UCX or NIXL) |
+| E/P/D (Image URLs) | `epd_multimodal_image_and_embeddings.sh` | No | Encoder → Prefill (handles via params), Prefill → Decode (KV cache) |
+| E/P/D (Pre-computed Embeddings) | `epd_multimodal_image_and_embeddings.sh` | Yes | Encoder → Prefill (embeddings via NIXL RDMA) |
+| E/P/D (Large Models) | `epd_disagg.sh` | Yes | Encoder → Prefill (embeddings via NIXL), Prefill → Decode (KV cache) |
+
+> **Note:** NIXL for KV cache transfer is currently beta and only supported on AMD64 (x86_64) architecture.
+
+## ModelInput Types and Registration
+
+TRT-LLM workers register with Dynamo using:
+
+| ModelInput Type | Preprocessing | Use Case |
+|-----------------|---------------|----------|
+| `ModelInput.Tokens` | Rust frontend may tokenize, but multimodal flows re-tokenize and build inputs in the Python worker; Rust token_ids are ignored | All TRT-LLM workers |
+
+```python
+# TRT-LLM Worker - Register with Tokens
+await register_llm(
+    ModelInput.Tokens,      # Rust does minimal preprocessing
+    model_type,             # ModelType.Chat or ModelType.Prefill
+    generate_endpoint,
+    model_name,
+    ...
+)
+```
+
+## Inter-Component Communication
+
+| Transfer Stage | Message | NIXL Transfer |
+|----------------|---------|---------------|
+| **Frontend → Prefill** | Request with image URL or embedding path | No |
+| **Prefill → Encode (Image URL)** | Request with image URL | No |
+| **Encode → Prefill (Image URL)** | `ep_disaggregated_params` with `multimodal_embedding_handles`, processed prompt, and token IDs | No |
+| **Prefill → Encode (Embedding Path)** | Request with embedding file path | No |
+| **Encode → Prefill (Embedding Path)** | NIXL readable metadata + shape/dtype + auxiliary data | Yes (Embeddings tensor via RDMA) |
+| **Prefill → Decode** | `disaggregated_params` with `_epd_metadata` (prompt, token IDs) | Configurable (KV cache: NIXL default, UCX optional) |
+
+## Known Limitations
+
+- **No video support** - No video encoder implementation
+- **No audio support** - No audio encoder implementation
+- **Multimodal preprocessing/tokenization happens in Python** - Rust may forward token_ids, but multimodal requests are parsed and re-tokenized in the Python worker
+- **Multi-node H100 limitation** - Loading `meta-llama/Llama-4-Maverick-17B-128E-Instruct` with 8 nodes of H100 with TP=16 is not possible due to head count divisibility (`num_attention_heads: 40` not divisible by `tp_size: 16`)
+- **llava-v1.6-mistral-7b-hf model crash** - Known issue with TRTLLM backend compatibility with `TensorRT LLM version: 1.2.0rc6.post1`. To use Llava model download revision `revision='52320fb52229` locally using HF.
+- **Embeddings file crash** - Known issue with TRTLLM backend compatibility with `TensorRT LLM version: 1.2.0rc6.post1`. Embedding file parsing crashes in `attach_multimodal_embeddings(`. To be fixed in next TRTLLM upgrade.
+
+## Supported Models
+
+Multimodal models listed in [TensorRT-LLM supported models](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/models/supported-models.md) are supported by Dynamo.
+
+Common examples:
+- **Llama 4 Vision models** (Maverick, Scout) - Recommended for large-scale deployments
+- **LLaVA models** (e.g., `llava-hf/llava-v1.6-mistral-7b-hf`) - Default model for E/P/D examples
+- **Qwen2-VL models** - Supported in traditional disaggregated mode
+- Other vision-language models with TRT-LLM support
+
+## Key Files
+
+| File | Description |
+|------|-------------|
+| `components/src/dynamo/trtllm/main.py` | Worker initialization and setup |
+| `components/src/dynamo/trtllm/engine.py` | TensorRTLLMEngine wrapper (LLM and MultimodalEncoder) |
+| `components/src/dynamo/trtllm/constants.py` | DisaggregationMode enum (AGGREGATED, PREFILL, DECODE, ENCODE) |
+| `components/src/dynamo/trtllm/encode_helper.py` | Encode worker request processing (embedding-path and full EPD flows) |
+| `components/src/dynamo/trtllm/multimodal_processor.py` | Multimodal request processing |
+| `components/src/dynamo/trtllm/request_handlers/handlers.py` | Request handlers (EncodeHandler, PrefillHandler, DecodeHandler) |
+| `components/src/dynamo/trtllm/request_handlers/handler_base.py` | Base handler with disaggregated params encoding/decoding |
+| `components/src/dynamo/trtllm/utils/disagg_utils.py` | DisaggregatedParamsCodec for network transfer |
+| `components/src/dynamo/trtllm/utils/trtllm_utils.py` | Command-line argument parsing |
+
--- a/docs/features/multimodal/multimodal_vllm.md
+++ b/docs/features/multimodal/multimodal_vllm.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# vLLM Multimodal
+
+This document provides a comprehensive guide for multimodal inference using vLLM backend in Dynamo.
+
+> [!IMPORTANT]
+> **Security Requirement**: All multimodal workers require the `--enable-multimodal` flag to be explicitly set at startup. This is a security feature to prevent unintended processing of multimodal data from untrusted sources. Workers will fail at startup if multimodal flags (e.g., `--multimodal-worker`, `--multimodal-processor`) are used without `--enable-multimodal`.
+> This flag is analogous to `--enable-mm-embeds` in vllm serve but also extends it to all multimodal content (url, embeddings, b64).
+
+## Support Matrix
+
+| Modality | Input Format | Aggregated | Disaggregated | Notes |
+|----------|--------------|------------|---------------|-------|
+| **Image** | HTTP/HTTPS URL | Yes | Yes | Full support for all image models |
+| **Image** | Data URL (Base64) | Yes | Yes | Inline base64-encoded images |
+| **Video** | HTTP/HTTPS URL | Yes | Yes | Frame extraction and processing |
+| **Audio** | HTTP/HTTPS URL | Yes | Yes | Experimental - requires audio dependencies |
+
+### Supported URL Formats
+
+| Format | Example | Description |
+|--------|---------|-------------|
+| **HTTP/HTTPS** | `http://example.com/image.jpg` | Remote media files |
+| **Data URL** | `data:image/jpeg;base64,/9j/4AAQ...` | Base64-encoded inline data |
+
+## Deployment Patterns
+
+vLLM supports all multimodal deployment patterns. See [Architecture Patterns](README.md#architecture-patterns) for detailed explanations.
+
+| Pattern | Supported | Launch Script | Notes |
+|---------|-----------|---------------|-------|
+| EPD (Simple Aggregated) | ✅ | `agg_multimodal.sh` | Easiest setup |
+| E/PD (Encode Separate) | ✅ | `agg_multimodal_epd.sh` | Separate encode worker |
+| E/P/D (Full Disaggregation) | ✅ | `disagg_multimodal_epd.sh` | All stages separate |
+| EP/D (Traditional Disaggregated) | ✅ | `disagg_multimodal_llama.sh` | For Llama 4 models |
+| E/PD (EC Connector) | ✅ | `agg_multimodal_ec_connector.sh` | vLLM-native encoder with ECConnector |
+
+### Component Flags
+
+| Component | Flag | Purpose |
+|-----------|------|---------|
+| Processor | `--multimodal-processor` | HTTP entry, tokenization |
+| Encode Worker | `--multimodal-encode-worker` | Media encoding |
+| PD Worker | `--multimodal-worker` | Prefill + Decode |
+| Prefill Worker | `--multimodal-worker --is-prefill-worker` | Prefill only |
+| Decode Worker | `--multimodal-decode-worker` | Decode only |
+| Encode+Prefill Worker | `--multimodal-encode-prefill-worker --is-prefill-worker` | Combined (Llama 4) |
+| vLLM Native Encoder | `--vllm-native-encoder-worker` | vLLM-native encoding with ECConnector |
+
+## Use the Latest Release
+
+We recommend using the latest stable release of dynamo to avoid breaking changes:
+
+[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
+
+You can find the [latest release](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
+
+```bash
+git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
+```
+
+## Image Serving
+
+### E/PD Serving (Encode Separate)
+
+**Components:**
+
+- workers: [EncodeWorkerHandler](../../components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py) for encoding and [MultimodalPDWorkerHandler](../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) for prefilling and decoding.
+- processor: Tokenizes the prompt and passes it to the EncodeWorkerHandler.
+- frontend: HTTP endpoint to handle incoming requests.
+
+**Workflow:**
+
+The EncodeWorkerHandler encodes the image and passes the embeddings to the MultimodalPDWorkerHandler via NATS and RDMA. The work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface.
+
+```mermaid
+flowchart LR
+  HTTP --> processor
+  processor --> HTTP
+  processor --image_url--> encode_worker
+  encode_worker --> processor
+  encode_worker --embeddings--> pd_worker
+  pd_worker --> encode_worker
+```
+
+> **Note:** Aggregated serving supports LLaVA 1.5 7B and Qwen2.5-VL-7B-Instruct. Disaggregated serving is currently only confirmed for LLaVA.
+
+**Launch:**
+
+```bash
+cd $DYNAMO_HOME/examples/backends/vllm
+# Serve a LLaVA 1.5 7B model:
+bash launch/agg_multimodal_epd.sh --model llava-hf/llava-1.5-7b-hf
+# Serve a Qwen2.5-VL model:
+bash launch/agg_multimodal_epd.sh --model Qwen/Qwen2.5-VL-7B-Instruct
+```
+
+**Client:**
+
+```bash
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+      "model": "llava-hf/llava-1.5-7b-hf",
+      "messages": [
+        {
+          "role": "user",
+          "content": [
+            {
+              "type": "text",
+              "text": "What is in this image?"
+            },
+            {
+              "type": "image_url",
+              "image_url": {
+                "url": "http://images.cocodataset.org/test2017/000000155781.jpg"
+              }
+            }
+          ]
+        }
+      ],
+      "max_tokens": 300,
+      "temperature": 0.0,
+      "stream": false
+    }'
+```
+
+### E/P/D Serving (Full Disaggregation)
+
+**Components:**
+
+- workers: [EncodeWorkerHandler](../../components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py) for encoding, [MultimodalDecodeWorkerHandler](../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) for decoding, and [MultimodalPDWorkerHandler](../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) for prefilling.
+- processor: Tokenizes the prompt and passes it to the EncodeWorkerHandler.
+- frontend: HTTP endpoint to handle incoming requests.
+
+**Workflow:**
+
+For the LLaVA model, embeddings are only required during the prefill stage. The EncodeWorkerHandler is connected directly to the prefill worker, encoding the image and passing embeddings via NATS and RDMA. The prefill worker performs the prefilling step and forwards the KV cache to the decode worker.
+
+```mermaid
+flowchart LR
+  HTTP --> processor
+  processor --> HTTP
+  processor --image_url--> encode_worker
+  encode_worker --> processor
+  encode_worker --embeddings--> prefill_worker
+  prefill_worker --> encode_worker
+  prefill_worker --> decode_worker
+  decode_worker --> prefill_worker
+```
+
+**Launch:**
+
+```bash
+cd $DYNAMO_HOME/examples/backends/vllm
+bash launch/disagg_multimodal_epd.sh --model llava-hf/llava-1.5-7b-hf
+```
+
+> [!NOTE] Disaggregation is currently only confirmed to work with LLaVA. Qwen2.5-VL is not confirmed to be supported.
+
+## ECConnector Serving
+
+ECConnector is vLLM's native connector for transferring multimodal embeddings via an Embedding Cache. The encoder worker acts as a **producer** (writes embeddings), while the PD worker acts as a **consumer** (reads embeddings).
+
+**Workflow:**
+
+```mermaid
+flowchart LR
+  HTTP --> processor[EC Processor]
+  processor --image_url--> encoder[vLLM Native Encoder<br/>Producer]
+  encoder --writes--> cache[(Embedding Cache)]
+  cache --reads--> pd[PD Worker<br/>Consumer]
+  pd --> processor
+  processor --> HTTP
+```
+
+**Launch:**
+
+```bash
+cd $DYNAMO_HOME/examples/backends/vllm
+bash launch/agg_multimodal_ec_connector.sh --model llava-hf/llava-1.5-7b-hf
+
+# Custom storage path for Embedding Cache
+bash launch/agg_multimodal_ec_connector.sh --ec-storage-path /shared/encoder-cache
+```
+
+**Client:** Same as [E/PD Serving](#epd-serving-encode-separate)
+
+## Llama 4 Serving
+
+The Llama 4 model family is natively multimodal. Unlike LLaVA, they do not directly consume image embeddings as input (see the [vLLM support matrix](https://docs.vllm.ai/en/latest/models/supported_models.html#text-generation_1)). Therefore, the encoder worker is not used and encoding is done alongside prefill.
+
+Example model: `meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8` on H100x8.
+
+### Llama 4 Aggregated Serving
+
+**Workflow:**
+
+```mermaid
+flowchart LR
+  HTTP --> processor
+  processor --> HTTP
+  processor --image_url--> pd_worker
+  pd_worker --> processor
+```
+
+**Launch:**
+
+```bash
+cd $DYNAMO_HOME/examples/backends/vllm
+bash launch/agg_multimodal_llama.sh
+```
+
+**Client:**
+
+```bash
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+      "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
+      "messages": [
+        {
+          "role": "user",
+          "content": [
+            {
+              "type": "text",
+              "text": "What is in this image?"
+            },
+            {
+              "type": "image_url",
+              "image_url": {
+                "url": "http://images.cocodataset.org/test2017/000000155781.jpg"
+              }
+            }
+          ]
+        }
+      ],
+      "max_tokens": 300,
+      "temperature": 0.0,
+      "stream": false
+    }'
+```
+
+### Llama 4 Disaggregated Serving
+
+**Workflow:**
+
+```mermaid
+flowchart LR
+  HTTP --> processor
+  processor --> HTTP
+  processor --image_url--> prefill_worker
+  prefill_worker --> processor
+  prefill_worker --> decode_worker
+  decode_worker --> prefill_worker
+```
+
+**Launch:**
+
+```bash
+cd $DYNAMO_HOME/examples/backends/vllm
+bash launch/disagg_multimodal_llama.sh --head-node
+
+# On a separate node with NATS_SERVER and ETCD_ENDPOINTS pointing to head node:
+cd $DYNAMO_HOME/examples/backends/vllm
+bash launch/disagg_multimodal_llama.sh
+```
+
+## Video Serving
+
+### Video Aggregated Serving
+
+**Components:**
+
+- workers: [VideoEncodeWorker](../../examples/multimodal/components/video_encode_worker.py) for decoding video into frames, and [VllmPDWorker](../../examples/multimodal/components/worker.py) for prefilling and decoding.
+- processor: Tokenizes the prompt and passes it to the VideoEncodeWorker.
+- frontend: HTTP endpoint to handle incoming requests.
+
+**Workflow:**
+
+The VideoEncodeWorker decodes the video into frames. Unlike the image pipeline which generates embeddings, this pipeline passes raw frames directly to the VllmPDWorker via NATS and RDMA.
+
+```mermaid
+flowchart LR
+  HTTP --> processor
+  processor --> HTTP
+  processor --video_url--> video_encode_worker
+  video_encode_worker --> processor
+  video_encode_worker --frames--> pd_worker
+  pd_worker --> video_encode_worker
+```
+
+**Launch:**
+
+```bash
+cd $DYNAMO_HOME/examples/multimodal
+bash launch/video_agg.sh
+```
+
+**Client:**
+
+```bash
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+      "model": "llava-hf/LLaVA-NeXT-Video-7B-hf",
+      "messages": [
+        {
+          "role": "user",
+          "content": [
+            {
+              "type": "text",
+              "text": "Describe the video in detail"
+            },
+            {
+              "type": "video_url",
+              "video_url": {
+                "url": "https://storage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4"
+              }
+            }
+          ]
+        }
+      ],
+      "max_tokens": 300,
+      "stream": false
+    }' | jq
+```
+
+### Video Disaggregated Serving
+
+**Workflow:**
+
+For the LLaVA-NeXT-Video-7B model, frames are only required during the prefill stage. The VideoEncodeWorker is connected directly to the prefill worker, decoding the video into frames and passing them via RDMA.
+
+```mermaid
+flowchart LR
+  HTTP --> processor
+  processor --> HTTP
+  processor --video_url--> video_encode_worker
+  video_encode_worker --> processor
+  video_encode_worker --frames--> prefill_worker
+  prefill_worker --> video_encode_worker
+  prefill_worker --> decode_worker
+  decode_worker --> prefill_worker
+```
+
+**Launch:**
+
+```bash
+cd $DYNAMO_HOME/examples/multimodal
+bash launch/video_disagg.sh
+```
+
+## Audio Serving
+
+### Audio Aggregated Serving
+
+**Components:**
+
+- workers: [AudioEncodeWorker](../../examples/multimodal/components/audio_encode_worker.py) for decoding audio into embeddings, and [VllmPDWorker](../../examples/multimodal/components/worker.py) for prefilling and decoding.
+- processor: Tokenizes the prompt and passes it to the AudioEncodeWorker.
+- frontend: HTTP endpoint to handle incoming requests.
+
+**Workflow:**
+
+```mermaid
+flowchart LR
+  HTTP --> processor
+  processor --> HTTP
+  processor --audio_url--> audio_encode_worker
+  audio_encode_worker --> processor
+  audio_encode_worker --embeddings--> pd_worker
+  pd_worker --> audio_encode_worker
+```
+
+**Launch:**
+
+```bash
+pip install 'vllm[audio]' accelerate # multimodal audio models dependency
+cd $DYNAMO_HOME/examples/multimodal
+bash launch/audio_agg.sh
+```
+
+**Client:**
+
+```bash
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+      "model": "Qwen/Qwen2-Audio-7B-Instruct",
+      "messages": [
+        {
+          "role": "user",
+          "content": [
+            {
+              "type": "text",
+              "text": "What is recited in the audio?"
+            },
+            {
+              "type": "audio_url",
+              "audio_url": {
+                "url": "https://raw.githubusercontent.com/yuekaizhang/Triton-ASR-Client/main/datasets/mini_en/wav/1221-135766-0002.wav"
+              }
+            }
+          ]
+        }
+      ],
+      "max_tokens": 6000,
+      "temperature": 0.8,
+      "stream": false
+    }' | jq
+```
+
+### Audio Disaggregated Serving
+
+**Workflow:**
+
+For the Qwen2-Audio model, audio embeddings are only required during the prefill stage. The AudioEncodeWorker is connected directly to the prefill worker.
+
+```mermaid
+flowchart LR
+  HTTP --> processor
+  processor --> HTTP
+  processor --audio_url--> audio_encode_worker
+  audio_encode_worker --> processor
+  audio_encode_worker --embeddings--> prefill_worker
+  prefill_worker --> audio_encode_worker
+  prefill_worker --> decode_worker
+  decode_worker --> prefill_worker
+```
+
+**Launch:**
+
+```bash
+pip install 'vllm[audio]' accelerate # multimodal audio models dependency
+cd $DYNAMO_HOME/examples/multimodal
+bash launch/audio_disagg.sh
+```
+
+## NIXL Usage
+
+| Use Case | Script | NIXL Used? | Data Transfer |
+|----------|--------|------------|---------------|
+| EPD (Simple Aggregated) | `agg_multimodal.sh` | No | All in one worker |
+| E/PD (Encode Separate) | `agg_multimodal_epd.sh` | Yes | Encoder → PD (embeddings) |
+| E/P/D (Full Disaggregation) | `disagg_multimodal_epd.sh` | Yes | Encoder → Prefill (embeddings), Prefill → Decode (KV cache) |
+| EP/D (Llama 4) | `disagg_multimodal_llama.sh` | Yes | Prefill → Decode (KV cache) |
+| E/PD (EC Connector) | `agg_multimodal_ec_connector.sh` | No | ECConnector via Embedding Cache |
+
+## ModelInput Types and Registration
+
+Dynamo's Rust SDK supports two input types that determine how the HTTP frontend preprocesses requests:
+
+| ModelInput Type | Preprocessing | Use Case |
+|-----------------|---------------|----------|
+| `ModelInput.Text` | None (raw text passed through) | Components that tokenize themselves |
+| `ModelInput.Tokens` | Rust SDK would tokenize (but bypassed in multimodal) | Components expecting pre-tokenized input |
+
+**Registration Pattern:**
+
+```python
+# Processor - Entry point from HTTP frontend
+await register_llm(
+    ModelInput.Text,        # Frontend sends raw text
+    ModelType.Chat,
+    generate_endpoint,
+    model_name,
+    ...
+)
+
+# Workers - Internal components
+await register_llm(
+    ModelInput.Tokens,      # Expect pre-tokenized input
+    ModelType.Chat,         # or ModelType.Prefill for prefill workers
+    generate_endpoint,
+    model_name,
+    ...
+)
+```
+
+## Known Limitations
+
+- **Disaggregated flows require Python Processor** - All multimodal disaggregation requires the Python Processor component (`ModelInput.Text`).
+
+## Supported Models
+
+The following models have been tested with Dynamo's vLLM multimodal backend:
+
+- **Qwen2.5-VL** - `Qwen/Qwen2.5-VL-7B-Instruct`
+- **Qwen3-VL** - `Qwen/Qwen3-VL-30B-A3B-Instruct-FP8`
+- **LLaVA 1.5** - `llava-hf/llava-1.5-7b-hf`
+- **Llama 4 Maverick** - `meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8`
+- **LLaVA Next Video** - `llava-hf/LLaVA-NeXT-Video-7B-hf`
+- **Qwen2-Audio** - `Qwen/Qwen2-Audio-7B-Instruct`
+
+For a complete list of multimodal models supported by vLLM, see [vLLM Supported Multimodal Models](https://docs.vllm.ai/en/latest/models/supported_models/#list-of-multimodal-language-models). Models listed there should work with Simple Aggregated Mode but may not be explicitly tested.
+
+## Key Files
+
+| File | Description |
+|------|-------------|
+| `components/src/dynamo/vllm/main.py` | Worker initialization and setup |
+| `components/src/dynamo/vllm/args.py` | Command-line argument parsing |
+| `components/src/dynamo/vllm/multimodal_handlers/processor_handler.py` | Processor implementation |
+| `components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py` | Encode worker implementations (custom and vLLM-native) |
+| `components/src/dynamo/vllm/multimodal_handlers/worker_handler.py` | PD/Prefill/Decode worker implementation |
--- a/docs/hidden_toctree.rst
+++ b/docs/hidden_toctree.rst
@@ -90,6 +90,11 @@

   mocker/mocker.md

+   multimodal/index.md
+   multimodal/vllm.md
+   multimodal/sglang.md
+   multimodal/trtllm.md
+
   frontends/kserve.md
   _sections/frontends.rst


--- a/docs/index.rst
+++ b/docs/index.rst
@@ -59,7 +59,7 @@ Quickstart
   :caption: User Guides

   Tool Calling <agents/tool-calling.md>
-   Multimodality Support <multimodal/index.md>
+   Multimodality Support <features/multimodal/README.md>
   Finding Best Initial Configs <performance/aiconfigurator.md>
   Benchmarking <benchmarks/benchmarking.md>
   Tuning Disaggregated Performance <performance/tuning.md>

--- a/docs/multimodal/index.md
+++ b/docs/multimodal/index.md
@@ -15,6 +15,11 @@ See the License for the specific language governing permissions and
 limitations under the License.
 -->

+> [!NOTE]
+> **This content has moved.** The canonical location for this documentation is now
+> [docs/features/multimodal/](../features/multimodal/README.md).
+> This file will be removed in a future release.
+
 # Multimodal Inference in Dynamo

 Dynamo supports multimodal inference across multiple LLM backends, enabling models to process images, video, and audio alongside text. This section provides comprehensive documentation for deploying multimodal models.