docs: Add multimodal documentation vllm, sglang, and trtllm backends (#4510)

Signed-off-by: Indrajit Bhosale <iamindrajitb@gmail.com> Co-authored-by: krishung5 <krish@nvidia.com>

docs: Add multimodal documentation vllm, sglang, and trtllm backends (#4510)
Signed-off-by: Indrajit Bhosale <iamindrajitb@gmail.com> Co-authored-by: krishung5 <krish@nvidia.com>
94d145a9 · Indrajit Bhosale · GitHub · 09f2314d · 94d145a9 · 09f2314d
Unverified Commit 94d145a9 authored Dec 09, 2025 by Indrajit Bhosale Committed by GitHub Dec 09, 2025
6 changed files
--- a/docs/backends/sglang/multimodal_sglang_guide.md
+++ b/docs/backends/sglang/multimodal_sglang_guide.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# SGLang Multimodal Guide
+
+This document provides a comprehensive guide for multimodal inference using SGLang backend in Dynamo. For more details on the multimodal examples, see [Multimodal Examples Documentation](./multimodal_epd.md).
+
+## Multimodal Support Matrix
+
+| Modality | Input Format | Aggregated | Disaggregated | Notes |
+|----------|--------------|------------|---------------|-------|
+| **Image** | HTTP/HTTPS URL | ✅ Yes | ✅ Yes | Vision encoder generates embeddings |
+| **Image** | Data URL (Base64) | ❌ No | ❌ No | Not supported |
+| **Video** | HTTP/HTTPS URL | ❌ No | ❌ No | Not implemented |
+| **Audio** | HTTP/HTTPS URL | ❌ No | ❌ No | Not implemented |
+
+## Architecture Comparison
+
+SGLang multimodal supports two deployment patterns:
+
+```text
+AGGREGATED (E->PD):
+  Client → Frontend (Rust) → Processor → Encoder [NIXL] → PD Worker → Response
+  • 3 components • Vision encoder in Python • NIXL embeddings transfer
+
+DISAGGREGATED (E->P->D):
+  Client → Frontend → Processor → Encoder [NIXL] → Prefill [bootstrap] → Decode → Response
+  • 4 components • Vision encoder in Python • KV cache transfer via bootstrap mechanism
+```
+
+## Aggregated Mode (E->PD)
+
+In aggregated mode, encoding happens in a separate worker, but prefill and decode share the same engine.
+
+### Architecture
+
+```text
+HTTP Frontend (Rust)
+    ↓
+Processor (Python - ModelInput.Text - REGISTERED)
+    ↓ tokenizes with chat template, extracts image URL
+Encode Worker (Python - NOT registered)
+    ↓ downloads image, runs vision encoder, generates embeddings, NIXL transfer
+PD Worker (Python - NOT registered)
+    ↓ receives embeddings via NIXL, prefill + decode
+Response → Processor → Frontend
+```
+
+### Components
+
+| Component | Flag | ModelInput | Registered | Has SGLang Engine? | Purpose |
+|-----------|------|-----------|------------|-------------------|---------|
+| Processor | `--multimodal-processor` | Text | ✅ Yes | ❌ No | HTTP entry, OpenAI→SGLang conversion |
+| Encode Worker | `--multimodal-encode-worker` | N/A | ❌ No | ❌ No | Vision encoder, embeddings generation |
+| PD Worker | `--multimodal-worker` | N/A | ❌ No | ✅ Yes | Prefill + Decode with embeddings |
+
+### Key Characteristics
+
+- **Vision Encoder in Python**: Encode worker loads vision model (AutoModel) and image processor (AutoImageProcessor)
+- **Token Expansion**: Single `<|image_pad|>` token replaced with N tokens based on embedding shape
+- **NIXL Transfer**: Embeddings transferred from Encoder → PD Worker using NIXL
+- **No Rust Processing**: All tokenization and image handling happens in Python
+
+## Disaggregated Mode (E->P->D)
+
+In disaggregated mode, encoding, prefill, and decode are handled by separate workers using SGLang's bootstrap coordination.
+
+### Architecture
+
+```text
+HTTP Frontend (Rust)
+    ↓
+Processor (Python - ModelInput.Text - REGISTERED)
+    ↓ tokenizes with chat template, extracts image URL
+Encode Worker (Python - NOT registered)
+    ↓ downloads image, runs vision encoder, generates embeddings, NIXL transfer
+Prefill Worker (Python - NOT registered)
+    ↓ receives embeddings via NIXL, prefill only, returns bootstrap info
+Decode Worker (Python - NOT registered)
+    ↓ uses bootstrap info, decode only, token generation
+Response → Processor → Frontend
+```
+
+### Components
+
+| Component | Flag | ModelInput | Registered | Has SGLang Engine? | Purpose |
+|-----------|------|-----------|------------|-------------------|---------|
+| Processor | `--multimodal-processor` | Text | ✅ Yes | ❌ No | HTTP entry, OpenAI→SGLang conversion |
+| Encode Worker | `--multimodal-encode-worker` | N/A | ❌ No | ❌ No | Vision encoder, embeddings generation |
+| Decode Worker | `--multimodal-worker --serving-mode=decode` | N/A | ❌ No | ✅ Yes | **Entry point for disaggregation**, calls Prefill |
+| Prefill Worker | `--multimodal-worker --serving-mode=prefill` | N/A | ❌ No | ✅ Yes | Called by Decode, bootstrap coordination |
+
+### Bootstrap Coordination
+
+SGLang disaggregation uses a bootstrap mechanism for P->D coordination:
+
+**Request Flow (Important):**
+```text
+Client → Frontend → Processor → Encode → DECODE Worker → Prefill Worker
+                                               ↑
+                                    Entry point for disaggregation!
+```
+
+**Bootstrap Process:**
+1. **Decode Worker** receives request from Encode Worker
+2. **Decode Worker** calls Prefill Worker via NATS to request bootstrap info
+3. **Prefill Worker** generates `{host, port, room}` and returns immediately
+4. **Both workers** connect to same "room" using bootstrap coordinates
+5. **SGLang internally** transfers KV cache state via bootstrap connection (not NIXL)
+
+**Key Difference from vLLM:**
+- vLLM: Frontend → Prefill → Decode (Prefill is entry point)
+- SGLang: Frontend → Processor → Encode → **Decode → Prefill** (Decode is entry point)
+
+## ModelInput Types and Registration
+
+**Only the Processor registers with Dynamo Rust.**
+
+### Registration Pattern
+
+```python
+# ONLY Processor registers with Dynamo Rust
+await register_llm_with_readiness_gate(
+    None,                   # No engine for processor
+    generate_endpoint,
+    server_args,
+    dynamo_args,
+    input_type=ModelInput.Text,  # Receives raw OpenAI format
+    readiness_gate=ready_event,
+)
+
+# Workers do NOT register - they are internal components
+# They communicate via NATS clients created in main.py
+```
+
+### Component Initialization
+
+```python
+# Encode Worker - connects to downstream PD worker
+pd_worker_client = (
+    await runtime.namespace(dynamo_args.namespace)
+    .component("backend")
+    .endpoint("generate")
+    .client()
+)
+
+# PD Worker (Decode mode) - connects to upstream Prefill worker
+prefill_client = (
+    await runtime.namespace(dynamo_args.namespace)
+    .component("prefill")
+    .endpoint("generate")
+    .client()
+)
+```
+
+## Inter-Component Communication
+
+### Control Flow (NATS)
+
+All component-to-component communication happens via NATS:
+
+**Aggregated Mode (E→PD):**
+```text
+Processor → Encode Worker → PD Worker
+  (NATS)        (NATS + NIXL embeddings)
+```
+
+**Disaggregated Mode (E→P→D):**
+```text
+Processor → Encode Worker → DECODE Worker → Prefill Worker
+  (NATS)        (NATS)            (NATS)
+                             ↓
+                    Decode requests bootstrap
+                             ↓
+                    Prefill returns {host, port, room}
+                             ↓
+                    Both connect via bootstrap
+                             ↓
+                    SGLang internal KV cache transfer
+```
+
+**Detailed Message Flow:**
+
+```text
+Processor → Encode Worker:
+  - NATS round_robin with SglangMultimodalRequest
+  - Contains: tokenized input_ids, image URL, sampling params
+
+Encode Worker → Decode/PD Worker:
+  - NATS round_robin to "backend" component
+  - Contains: expanded token_ids, NIXL metadata, embeddings shape
+  - NIXL transfer: embeddings tensor
+
+Decode Worker → Prefill Worker (disagg only):
+  - NATS call to "prefill" component
+  - Decode requests bootstrap coordinates
+  - Prefill returns: {bootstrap_host, bootstrap_port, bootstrap_room}
+
+Prefill ↔ Decode (via bootstrap):
+  - SGLang internal connection (not NATS)
+  - KV cache state shared via bootstrap mechanism
+```
+
+### Data Transfer (NIXL)
+
+NIXL is used only for embedding transfer:
+
+```python
+Encode Worker:
+  descriptor = connect.Descriptor(precomputed_embeddings)
+  with connector.create_readable(descriptor) as readable:
+      request.serialized_request = readable.metadata()
+      # Send request with NIXL metadata
+      await pd_worker_client.round_robin(request)
+      await readable.wait_for_completion()
+
+PD Worker:
+  embeddings = torch.empty(request.embeddings_shape, dtype=torch.float16)
+  descriptor = connect.Descriptor(embeddings)
+  read_op = await connector.begin_read(request.serialized_request, descriptor)
+  await read_op.wait_for_completion()
+```
+
+## Vision Encoding Details
+
+### Encode Worker Components
+
+The encode worker loads and runs the vision model in Python:
+
+```python
+# Vision components loaded in encode worker
+self.image_processor = AutoImageProcessor.from_pretrained(
+    model_path, trust_remote_code=True
+)
+self.vision_model = AutoModel.from_pretrained(
+    model_path,
+    device_map="auto",
+    torch_dtype=torch.float16,
+    trust_remote_code=True
+)
+```
+
+### Token Expansion Process
+
+1. Processor inserts single image token (e.g., `<|image_pad|>`)
+2. Encode worker generates embeddings: `shape = (batch, num_patches, hidden_dim)`
+3. Encode worker replaces single token with `num_patches` tokens
+4. Downstream worker receives expanded token sequence
+
+Example:
+```python
+# Before: ["Hello", "<|image_pad|>", "world"]
+# After:  ["Hello", "<|image_pad|>", "<|image_pad|>", ...(576 tokens), "world"]
+```
+
+## Chat Template Processing
+
+SGLang uses its own chat template system:
+
+```python
+from sglang.srt.parser.conversation import chat_templates
+
+conv = chat_templates["qwen2-vl"].copy()
+conv.append_message(conv.roles[0], f"{conv.image_token} Describe this image")
+processed = tokenizer(text=conv.get_prompt(), return_tensors="pt")
+```
+
+Supported templates: `qwen2-vl`, `llama-3`, `vicuna`, etc.
+
+## NIXL USE
+
+| Use Case | NIXL Used? | Data Transfer | Notes |
+|----------|------------|---------------|-------|
+| E→PD Aggregated | ✅ Yes | Encoder → PD (embeddings) | Vision encoder separate |
+| E→P→D Disaggregated | ✅ Yes | Encoder → Prefill (embeddings) | KV cache via SGLang bootstrap |
+
+**Key Difference:** SGLang P→D uses bootstrap mechanism, not NIXL for KV cache like vLLM.
+
+## Known Limitations
+
+- **No Data URL support** - Only HTTP/HTTPS URLs supported; `data:image/...` base64 URLs not supported
+- **No pre-computed embeddings** - Cannot use `.pt`, `.pth`, `.bin` embedding files; vision encoder runs for every request
+- **No video support** - No video encoder implementation
+- **No audio support** - No audio encoder implementation
+- **Only Processor registers with Dynamo** - Workers are internal components, frontend routes to Processor only
+- **Disaggregated routing** - Decode Worker is the entry point (calls Prefill), cannot route directly to Prefill workers
+- **Limited model generalization** - Token expansion logic is model-specific; adding new models may require implementation updates
+
+## Supported Models
+
+SGLang multimodal **only supports image-based vision-language models**:
+
+### ✅ Supported (Images Only)
+- **Qwen2-VL** / **Qwen2.5-VL** (primary support)
+- Models with `AutoImageProcessor` and vision tower
+- Models compatible with SGLang's image embedding format
+
+
+## Key Files
+
+| File | Description |
+|------|-------------|
+| `components/src/dynamo/sglang/main.py` | Component initialization, only Processor registers |
+| `components/src/dynamo/sglang/request_handlers/multimodal/processor_handler.py` | Processor implementation, OpenAI→SGLang |
+| `components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py` | Vision encoder, embeddings generation |
+| `components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py` | PD/Prefill/Decode workers, NIXL read |
+| `components/src/dynamo/sglang/multimodal_utils/multimodal_chat_processor.py` | Chat template processing |
+| `components/src/dynamo/sglang/protocol.py` | Request/response data structures |
+| `components/src/dynamo/sglang/register.py` | Registration logic (only called for Processor) |
+
--- a/docs/backends/trtllm/multimodal_epd.md
+++ b/docs/backends/trtllm/multimodal_epd.md
-# Encode-Prefill-Decode (EPD) Flow with NIXL
-
-For high-performance multimodal inference with large embeddings, Dynamo supports a specialized **Encode-Prefill-Decode (EPD)** flow using **NIXL (RDMA)** for zero-copy tensor transfer.
-
-## Enabling the Feature
-
-This is an experimental feature that requires using a specific TensorRT-LLM commit.
-To enable it build the dynamo container with the `--tensorrtllm-commit` flag, followed by the commit hash:
-
-```bash
-./container/build.sh --framework trtllm --tensorrtllm-git-url https://github.com/NVIDIA/TensorRT-LLM.git --tensorrtllm-commit v1.2.0rc3
-```
-
-## Key Features
-
- **High Performance**: Zero-copy RDMA transfer for embeddings
- **Dynamic Shape Allocation**: Automatically handles variable embedding shapes per image
- **Multi-Format Support**: Works with tensor files (`.pt`) and dictionary-based embeddings
- **Hybrid Transfer**: Large tensors via NIXL, small metadata via JSON
-
-## How to use
-
-```bash
-cd $DYNAMO_HOME/examples/backends/trtllm
-
-# Launch 3-worker EPD flow with NIXL.
-./launch/epd_disagg.sh
-```
-
-## Pre-requsites
-
-This script is specifically designed to work on 8 node H200 and `Llama-4-Maverick-17B-128E-Instruct` model with assumption that you already have a model specific embedding file ready.
-
-## Configuration
-
-The EPD flow uses a dedicated **Encode Worker** that runs separately from the Prefill and Decode workers. The `ENCODE_ENDPOINT` environment variable specifies how the Prefill worker communicates with the Encode worker:
-
-```bash
-export ENCODE_ENDPOINT="dyn://dynamo.tensorrt_llm_encode.generate"
-```
-
-This endpoint follows Dynamo's standard format: `dyn://namespace.component.endpoint` where the Encode worker registers itself as `dynamo.tensorrt_llm_encode.generate`.
-
-For local embedding file access, use the `--allowed-local-media-path "$ALLOWED_LOCAL_MEDIA_PATH"` parameter to specify the secure directory path where embedding files can be loaded from (default: `/tmp`). This prevents path traversal attacks while allowing flexible file access within the designated directory.
-
-```bash
-export ALLOWED_LOCAL_MEDIA_PATH="/tmp"
-```
-
-For tensor file size protection, use the `--max-file-size-mb "$MAX_FILE_SIZE_MB"` parameter to limit the maximum size of downloadable embedding files/Image URLs (default: `50MB`). This prevents Denial of Service (DoS) attacks from maliciously large files while accommodating typical embedding file sizes.
-
-```bash
-export MAX_FILE_SIZE_MB=50
-```
-
-## Architecture Overview
-
-The EPD flow implements a **3-worker architecture** for high-performance multimodal inference:
-
- **Encode Worker**: Loads and processes multimodal embeddings
- **Prefill Worker**: Handles initial context processing and KV-cache generation
- **Decode Worker**: Performs streaming token generation
-
-## Request Flow Diagram
-
-```mermaid
-sequenceDiagram
-    participant Client
-    participant Frontend
-    participant PrefillWorker as "Prefill Worker<br/>(PrefillHandler)"
-    participant EncodeWorker as "Encode Worker<br/>(EncodeHandler)"
-    participant DecodeWorker as "Decode Worker<br/>(DecodeHandler)"
-    participant NIXL as "NIXL<br/>(RDMA Transfer)"
-
-    Note over Client,NIXL: Unified Frontend: Context processing followed by streaming generation
-
-    Client->>Frontend: POST /v1/chat/completions<br/>(multimodal request)
-    Frontend->>PrefillWorker: Route to prefill worker
-
-    Note over PrefillWorker: Check for multimodal content
-    PrefillWorker->>EncodeWorker: Send request<br/>(contains embedding paths)
-
-    Note over EncodeWorker: Load embeddings from file/url<br/>
-    EncodeWorker->>NIXL: Create readable operation<br/>
-    EncodeWorker->>PrefillWorker: Send metadata + NIXL info<br/>(JSON: shape, dtype, aux_data)
-
-    Note over PrefillWorker: Allocate tensor with dynamic shape
-    PrefillWorker->>NIXL: Begin read operation
-    NIXL-->>PrefillWorker: Zero-copy transfer complete<br/>
-
-    Note over PrefillWorker: Reconstruct embeddings<br/>(mm_embeddings + special_tokens + offsets)
-    Note over PrefillWorker: Process full context<br/>(text + multimodal embeddings)
-    Note over PrefillWorker: Generate KV-cache<br/>(max_tokens=1 in prefill mode)
-
-    PrefillWorker->>Frontend: Return prefill response<br/>(disaggregated_params)
-
-    Frontend->>DecodeWorker: Route to decode worker<br/>with disaggregated_params
-
-    Note over DecodeWorker: Continue generation<br/>(streaming tokens)
-    DecodeWorker->>Frontend: Stream response chunk 1
-    Frontend->>Client: Response chunk 1
-    DecodeWorker->>Frontend: Stream response chunk 2
-    Frontend->>Client: Response chunk 2
-    DecodeWorker->>Frontend: ... (continue streaming)
-    Frontend->>Client: ... (continue streaming)
-    DecodeWorker->>Frontend: Final response + [DONE]
-    Frontend->>Client: Final response + [DONE]
-```
-
-## How the System Works
-
-1. **Request Processing**: Multimodal requests containing embedding file paths or URLs are routed by the frontend to prefill workers
-2. **Multimodal Loading**: EncodeWorker loads large embedding files and extracts auxiliary metadata
-3. **NIXL Transfer**: Main tensors transferred via zero-copy RDMA, small metadata via JSON for efficiency
-4. **Dynamic Allocation**: Consumer workers allocate tensors with exact shapes received from EncodeWorker
-5. **Reconstruction**: Original embedding format (dictionary or tensor) is reconstructed for model processing
-
-## Example Request
-
-The request format is identical to regular multimodal requests:
-
-```bash
-curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
-    "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct",
-    "messages": [
-        {
-            "role": "user",
-            "content": [
-                {"type": "text", "text": "Describe the image"},
-                {
-                    "type": "image_url",
-                    "image_url": {"url": "/path/to/embeddings.pt"}
-                }
-            ]
-        }
-    ],
-    "max_tokens": 160
-}'
-```
--- a/docs/backends/trtllm/multimodal_support.md
+++ b/docs/backends/trtllm/multimodal_support.md
@@ -92,23 +92,50 @@ In general, disaggregated serving can run on a single node, provided the model f

 To deploy `Llama-4-Maverick-17B-128E-Instruct` in disaggregated mode, you will need to follow the multi-node setup instructions, which can be found [here](./multinode/multinode-multimodal-example.md).

-## Using Pre-computed Embeddings (Experimental)
+## Pre-computed Embeddings with EPD Flow

-Dynamo with TensorRT-LLM supports providing pre-computed embeddings directly in an inference request. This bypasses the need for the model to process an image and generate embeddings itself, which is useful for performance optimization or when working with custom, pre-generated embeddings.
+For high-performance multimodal inference, Dynamo supports pre-computed embeddings with an **Encode-Prefill-Decode (EPD)** flow using **NIXL (RDMA)** for zero-copy tensor transfer.

-### How to Use
+### Enabling the Feature

-Once the container is built, you can send requests with paths to local embedding files.
+This is an experimental feature that requires using a specific TensorRT-LLM commit.
+To enable it build the dynamo container with the `--tensorrtllm-commit` flag:

-   **Format:** Provide the embedding as part of the `messages` array, using the `image_url` content type.
-   **URL:** The `url` field should contain the absolute or relative path to your embedding file on the local filesystem.
-   **File Types:** Supported embedding file extensions are `.pt`, `.pth`, and `.bin`. Dynamo will automatically detect these extensions.
+```bash
+./container/build.sh --framework trtllm --tensorrtllm-git-url https://github.com/NVIDIA/TensorRT-LLM.git --tensorrtllm-commit v1.2.0rc3
+```

-When a request with a supported embedding file is received, Dynamo will load the tensor from the file and pass it directly to the model for inference, skipping the image-to-embedding pipeline.
+### Supported File Types

-### Example Request
+- `.pt` - PyTorch tensor files
+- `.pth` - PyTorch checkpoint files
+- `.bin` - Binary tensor files
+
+### How to Launch
+
+```bash
+cd $DYNAMO_HOME/examples/backends/trtllm
+
+# Launch 3-worker EPD flow with NIXL
+./launch/epd_disagg.sh
+```
+
+> **Note:** This script is designed for 8-node H200 with `Llama-4-Scout-17B-16E-Instruct` model and assumes you have a model-specific embedding file ready.
+
+### Configuration
+
+```bash
+# Encode endpoint for Prefill → Encode communication
+export ENCODE_ENDPOINT="dyn://dynamo.tensorrt_llm_encode.generate"
+
+# Security: Allowed directory for embedding files (default: /tmp)
+export ALLOWED_LOCAL_MEDIA_PATH="/tmp"

-Here is an example of how to send a request with a pre-computed embedding file.
+# Security: Max file size to prevent DoS attacks (default: 50MB)
+export MAX_FILE_SIZE_MB=50
+```
+
+### Example Request

 ```bash
 curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
@@ -117,27 +144,47 @@ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '
        {
            "role": "user",
            "content": [
-                {
-                    "type": "text",
-                    "text": "Describe the content represented by the embeddings"
-                },
-                {
-                    "type": "image_url",
-                    "image_url": {
-                        "url": "/path/to/your/embedding.pt"
-                    }
-                }
+                {"type": "text", "text": "Describe the image"},
+                {"type": "image_url", "image_url": {"url": "/path/to/embedding.pt"}}
            ]
        }
    ],
-    "stream": false,
    "max_tokens": 160
 }'
 ```
-## Encode-Prefill-Decode (EPD) Flow with NIXL

-Dynamo with the TensorRT-LLM backend supports multimodal models in Encode -> Decode -> Prefill fashion, enabling you to process embeddings seperately in a seperate worker. For detailed setup instructions, example requests, and best practices, see the [Multimodal EPD Support Guide](./multimodal_epd.md).
+### Architecture
+
+The EPD flow implements a **3-worker architecture**:
+
+- **Encode Worker**: Loads pre-computed embeddings, transfers via NIXL
+- **Prefill Worker**: Receives embeddings, handles context processing and KV-cache generation
+- **Decode Worker**: Performs streaming token generation
+
+### Request Flow
+
+```mermaid
+sequenceDiagram
+    participant Client
+    participant Frontend
+    participant PrefillWorker as "Prefill Worker"
+    participant EncodeWorker as "Encode Worker"
+    participant DecodeWorker as "Decode Worker"
+    participant NIXL as "NIXL (RDMA)"
+
+    Client->>Frontend: POST /v1/chat/completions
+    Frontend->>PrefillWorker: Route to prefill worker
+    PrefillWorker->>EncodeWorker: Send request (embedding paths)
+    EncodeWorker->>NIXL: Create readable operation
+    EncodeWorker->>PrefillWorker: Send metadata + NIXL info
+    PrefillWorker->>NIXL: Begin read operation
+    NIXL-->>PrefillWorker: Zero-copy transfer complete
+    PrefillWorker->>Frontend: Return prefill response
+    Frontend->>DecodeWorker: Route to decode worker
+    DecodeWorker->>Frontend: Stream response chunks
+    Frontend->>Client: Stream response
+```

 ## Supported Multimodal Models

-Multimodel models listed [here](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/inputs/utils.py#L221) are supported by dynamo.
\ No newline at end of file
+Multimodal models listed in [TensorRT-LLM supported models](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/models/supported-models.md) are supported by Dynamo.
--- a/docs/backends/trtllm/multimodal_trtllm_guide.md
+++ b/docs/backends/trtllm/multimodal_trtllm_guide.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# TRT-LLM Multimodal Guide
+
+This document provides a comprehensive guide for multimodal inference using TensorRT-LLM backend in Dynamo. For more details on the multimodal examples, see [Multimodal Examples Documentation](./multimodal_support.md).
+
+## Multimodal Support Matrix
+
+| Modality | Input Format | Aggregated | Disaggregated | Notes |
+|----------|--------------|------------|---------------|-------|
+| **Image** | HTTP/HTTPS URL | Yes | Yes | Full support for all image models |
+| **Image** | Pre-computed Embeddings (.pt, .pth, .bin) | Yes | Yes | Direct embedding files |
+| **Video** | HTTP/HTTPS URL | ❌ No | ❌ No | Not implemented |
+| **Audio** | HTTP/HTTPS URL | ❌ No | ❌ No | Not implemented |
+
+## Architecture Comparison
+
+TRT-LLM multimodal supports three deployment patterns:
+
+```text
+SIMPLE AGGREGATED (agg.sh):
+  Client → Frontend (Rust) → Worker [image load, encode, P+D] → Response
+  • 2 components • worker flag `--modality multimodal` • Easiest setup
+
+DISAGGREGATED P->D (disagg_multimodal.sh):
+  Client → Frontend → Prefill [image load, encode] → Decode → Response
+  • 3 components • worker flag `--disaggregation-mode prefill/decode` • Multi-GPU, KV transfer
+
+EPD DISAGGREGATED - WIP:
+  Client → Frontend → Encode [MultimodalEncoder] → Prefill [via params] → Decode → Response
+  • 4 components • worker flag `--disaggregation-mode encode/prefill/decode` • WIP PR #4668
+```
+
+## Input Format Details
+
+### Supported URL Formats
+
+| Format | Example | Description | Support |
+|--------|---------|-------------|---------|
+| **HTTP/HTTPS** | `http://example.com/image.jpg` | Remote media files | ✅ |
+| **Pre-computed Embeddings** | `/path/to/embedding.pt` | Local embedding files (.pt, .pth, .bin) | ✅ |
+
+## Simple Aggregated Mode (PD)
+
+In aggregated mode, all processing (image loading, encoding, prefill, decode) happens within a single worker.
+
+### Architecture
+
+```text
+HTTP Frontend (Rust)
+    ↓
+TRT-LLM Worker (Python - ModelInput.Tokens)
+    ↓ downloads media, encodes, prefill + decode
+Response
+```
+
+### Components
+
+| Component | Flag | ModelInput | Registered | Purpose |
+|-----------|------|-----------|------------|---------|
+| Worker | `--modality multimodal` | Tokens | Yes | Complete inference pipeline |
+
+### Launch Script
+
+Example: [`examples/backends/trtllm/launch/agg.sh`](../../../examples/backends/trtllm/launch/agg.sh)
+
+## Disaggregated Mode (P->D)
+
+In disaggregated mode, prefill and decode are handled by separate workers. The prefill worker handles image loading and encoding internally.
+
+### Architecture
+
+```text
+HTTP Frontend (Rust)
+    ↓
+Prefill Worker (Python - ModelInput.Tokens)
+    ↓ downloads media, encodes, prefill, KV cache transfer
+Decode Worker (Python - ModelInput.Tokens)
+    ↓ decode only, token generation
+Response
+```
+
+### Components
+
+| Component | Flag | ModelInput | Registered | Purpose |
+|-----------|------|-----------|------------|---------|
+| Prefill Worker | `--disaggregation-mode prefill` | Tokens | Yes | Image processing + Prefill |
+| Decode Worker | `--disaggregation-mode decode` | Tokens | Yes | Decode only |
+
+### Launch Script
+
+Example: [`examples/backends/trtllm/launch/disagg_multimodal.sh`](../../../examples/backends/trtllm/launch/disagg_multimodal.sh)
+
+## Pre-computed Embeddings
+
+TRT-LLM supports providing pre-computed embeddings, bypassing image-to-embedding processing.
+
+### Supported File Types
+
+- `.pt` - PyTorch tensor files
+- `.pth` - PyTorch checkpoint files
+- `.bin` - Binary tensor files
+
+### Embedding File Formats
+
+TRT-LLM supports two formats for embedding files:
+
+#### 1. Simple Tensor Format
+
+- Direct tensor saved as `.pt` file
+- Example: `llava_next_mm_embed_seashore.pt`
+- Contains only the embedding tensor
+
+```python
+# Example: Simple tensor format
+embedding_tensor = torch.rand(1, 576, 4096)  # [batch, seq_len, hidden_dim]
+torch.save(embedding_tensor, "embedding.pt")
+```
+
+#### 2. Dictionary Format with Auxiliary Data
+
+- Dictionary containing multiple keys
+- Used by models like Llama-4 that require additional metadata
+- Must contain `mm_embeddings` key with the main tensor
+- Can include auxiliary data like special tokens, offsets, etc.
+
+```python
+# Example: Dictionary format (Llama-4 style)
+embedding_dict = {
+    "mm_embeddings": torch.rand(1, 576, 4096),
+    "special_tokens": [128256, 128257],
+    "image_token_offsets": [[0, 576]],
+    # ... other model-specific metadata
+}
+torch.save(embedding_dict, "llama4_embedding.pt")
+```
+
+**How They're Used:**
+- **Simple tensors**: Loaded directly and passed to `mm_embeddings` parameter
+- **Dictionary format**: `mm_embeddings` key extracted as main tensor, other keys preserved as auxiliary data and transferred separately
+
+### Launch Script
+
+Example: [`examples/backends/trtllm/launch/epd_disagg.sh`](../../../examples/backends/trtllm/launch/epd_disagg.sh)
+
+### Security Considerations
+
+For EPD mode with local embedding files:
+
+- `--allowed-local-media-path` - Specify secure directory for embedding files (default: `/tmp`)
+- `--max-file-size-mb` - Limit max file size to prevent DoS attacks (default: `50MB`)
+
+## EPD Disaggregated Mode (E->P->D) - WIP
+
+**Status:** Work In Progress (WIP PR #4668) - Full EPD flow with MultimodalEncoder
+
+In EPD mode, encoding, prefill, and decode are handled by separate workers. The encode worker uses TensorRT-LLM's `MultimodalEncoder` to process images and transfer embeddings via disaggregated parameters.
+
+### Architecture
+
+```text
+HTTP Frontend (Rust)
+    ↓
+Encode Worker (Python - NOT registered, uses MultimodalEncoder)
+    ↓ downloads image, encodes with vision model, transfers via disaggregated_params
+Prefill Worker (Python - ModelInput.Tokens)
+    ↓ receives embeddings via disaggregated_params, prefill only, KV cache transfer
+Decode Worker (Python - ModelInput.Tokens)
+    ↓ decode only, token generation
+Response
+```
+
+**Note (WIP):** The encode worker uses `MultimodalEncoder` from TensorRT-LLM to actually encode images, not just load pre-computed embeddings. This is a significant change from the legacy NIXL-based embedding transfer.
+
+### Components
+
+| Component | Flag | ModelInput | Registered | Purpose |
+|-----------|------|-----------|------------|---------|
+| Encode Worker | `--disaggregation-mode encode` | N/A | No | Image encoding with MultimodalEncoder |
+| Prefill Worker | `--disaggregation-mode prefill --encode-endpoint` | Tokens | Yes | Prefill only |
+| Decode Worker | `--disaggregation-mode decode` | Tokens | Yes | Decode only |
+
+
+## ModelInput Types and Registration
+
+### Understanding ModelInput
+
+TRT-LLM workers register with Dynamo using:
+
+| ModelInput Type | Preprocessing | Use Case |
+|-----------------|---------------|----------|
+| `ModelInput.Tokens` | Rust SDK tokenizes text (bypassed for multimodal) | All TRT-LLM workers |
+
+### Component Registration Pattern
+
+```python
+# TRT-LLM Worker - Register with Tokens
+await register_llm(
+    ModelInput.Tokens,      # Rust does minimal preprocessing
+    model_type,             # ModelType.Chat or ModelType.Prefill
+    generate_endpoint,
+    model_name,
+    ...
+)
+```
+
+## Inter-Component Communication
+
+| Transfer Stage | Message      | NIXL Transfer |
+|----------------|--------------|---------------|
+| **Frontend → Prefill** | Request with image URL or embedding path | No |
+| **Encode → Prefill (pre-computed embeddings)** | NIXL metadata (pre-computed embeddings) | Yes (Embeddings tensor) |
+| **Encode → Prefill (Image URL) (WIP)** | Disaggregated params with multimodal handles | No (Handles via params) |
+| **Prefill → Decode** | Disaggregated params | Configurable (KV cache: NIXL default, UCX optional) |
+
+
+## **NIXL USE**
+
+| Use Case | Script | NIXL Used? | Data Transfer |
+|----------|--------|------------|---------------|
+| Simple Aggregated | [`examples/backends/trtllm/launch/agg.sh`](../../../examples/backends/trtllm/launch/agg.sh) | ❌ No | All in one worker |
+| P->D Disaggregated | [`examples/backends/trtllm/launch/disagg_multimodal.sh`](../../../examples/backends/trtllm/launch/disagg_multimodal.sh) | ⚙️ Optional | Prefill → Decode (KV cache via UCX or NIXL) |
+| E->P->D Disaggregated (pre-computed embeddings) | [`examples/backends/trtllm/launch/epd_disagg.sh`](../../../examples/backends/trtllm/launch/epd_disagg.sh) | ✅ Yes | Encoder → Prefill (pre-computed embeddings via NIXL) |
+| E->P->D Disaggregated (WIP) | X | ❌ No | Encoder → Prefill (multimodal handles via disaggregated_params)<br>Prefill → Decode (KV cache via UCX/NIXL) |
+
+**Note:** NIXL for KV cache transfer is currently beta and only supported on AMD64 (x86_64) architecture.
+
+
+## Key Files
+
+| File | Description |
+|------|-------------|
+| `components/src/dynamo/trtllm/main.py` | Worker initialization and setup |
+| `components/src/dynamo/trtllm/utils/trtllm_utils.py` | Command-line argument parsing |
+| `components/src/dynamo/trtllm/multimodal_processor.py` | Multimodal request processing |
+| `components/src/dynamo/trtllm/request_handlers/handlers.py` | Request handler factory |
+| `components/src/dynamo/trtllm/request_handlers/handler_base.py` | Base handler and disaggregation modes |
+
+## Known Limitations
+
+- **No Data URL support** - Only HTTP/HTTPS URLs supported; `data:image/...` base64 URLs not supported
+- **No video support** - No video encoder implementation
+- **No audio support** - No audio encoder implementation
+- **No Rust preprocessing** - All preprocessing happens in Python workers
+- **E->P->D mode is WIP** - Full EPD with image URLs under development
+
+## Supported Models
+
+Multimodal models listed in [TensorRT-LLM supported models](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/models/supported-models.md) are supported by Dynamo.
+
+Common examples:
+- Llama 4 Vision models (Maverick, Scout)
+- Qwen2-VL models
+- Other vision-language models with TRT-LLM support
+
--- a/docs/backends/vllm/multimodal_vllm_guide.md
+++ b/docs/backends/vllm/multimodal_vllm_guide.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# vLLM Multimodal Guide
+
+This document provides a comprehensive guide for multimodal inference using vLLM backend in Dynamo. For more details on the multimodal examples, see [Multimodal Examples Documentation](./multimodal.md).
+
+## Multimodal Support Matrix
+
+| Modality | Input Format | Aggregated | Disaggregated | Notes |
+|----------|--------------|------------|---------------|-------|
+| **Image** | HTTP/HTTPS URL | Yes | Yes | Full support for all image models |
+| **Image** | Data URL (Base64) | Yes | Yes | Inline base64-encoded images |
+| **Video** | HTTP/HTTPS URL | Yes | Yes | Frame extraction and processing |
+| **Audio** | HTTP/HTTPS URL | Yes | Yes | Experimental - requires audio dependencies |
+
+## Architecture Comparison
+
+vLLM multimodal supports three deployment patterns:
+
+```text
+SIMPLE AGGREGATED ([examples/backends/vllm/launch/agg_multimodal.sh](../../../examples/backends/vllm/launch/agg_multimodal.sh)):
+  Client → Frontend (Rust processor) → Worker [image load, encode, P+D] → Response
+  • 2 components • --connector none • Easiest setup
+
+EPD AGGREGATED ([examples/backends/vllm/launch/agg_multimodal_epd.sh](../../../examples/backends/vllm/launch/agg_multimodal_epd.sh)):
+  Client → Frontend → Processor → Encoder [NIXL] → PD Worker → Response
+  • 4 components • --multimodal-processor • Custom templates, NIXL
+
+DISAGGREGATED ([examples/backends/vllm/launch/disagg_multimodal_epd.sh](../../../examples/backends/vllm/launch/disagg_multimodal_epd.sh)):
+  Client → Frontend → Processor → Encoder [NIXL] → Prefill [NIXL] → Decode → Response
+  • 5 components • Separate P/D workers • Multi-node, max optimization
+```
+
+## Input Format Details
+
+### Supported URL Formats
+
+| Format | Example | Description | Support |
+|--------|---------|-------------|---------|
+| **HTTP/HTTPS** | `http://example.com/image.jpg` | Remote media files | ✅ |
+| **Data URL** | `data:image/jpeg;base64,/9j/4AAQ...` | Base64-encoded inline data | ✅ |
+
+## Simple Aggregated Mode (PD)
+
+In simple aggregated mode, encoding, prefill, and decode happen within the same worker.
+
+### Architecture
+
+```text
+HTTP Frontend with Rust processor
+    ↓
+Worker (Python - ModelInput.Tokens)
+    ↓ encode + prefill + decode
+Response
+```
+
+## EPD Aggregated Mode (PD)
+
+In EPD aggregated mode, encoding happens in a separate worker and prefill and decode happen within the same pipeline.
+
+### Architecture
+
+```text
+HTTP Frontend (Rust)
+    ↓
+Processor (Python - ModelInput.Text)
+    ↓ tokenizes, extracts media URL
+Encode Worker (Python - not registered)
+    ↓ downloads media, generates embeddings, NIXL transfer
+PD Worker (Python - ModelInput.Tokens)
+    ↓ prefill + decode
+Response
+```
+
+### Components
+
+| Component | Flag | ModelInput | Registered | Purpose |
+|-----------|------|-----------|------------|---------|
+| Processor | `--multimodal-processor` | Text | Yes | HTTP entry, tokenization |
+| Encode Worker | `--multimodal-encode-worker` | N/A | No | Media encoding |
+| PD Worker | `--multimodal-worker` | Tokens | Yes | Prefill + Decode |
+
+## EPD Disaggregated Mode (E->P->D)
+
+In EPD disaggregated mode, encoding, prefill, and decode are handled by separate workers.
+
+### Architecture
+
+```text
+HTTP Frontend (Rust)
+    ↓
+Processor (Python - ModelInput.Text)
+    ↓ tokenizes, extracts media URL
+Encode Worker (Python - not registered)
+    ↓ downloads media, generates embeddings, NIXL transfer
+Prefill Worker (Python - ModelInput.Tokens)
+    ↓ prefill only, KV cache NIXL transfer
+Decode Worker (Python - ModelInput.Tokens)
+    ↓ decode only, token generation
+Response
+```
+
+### Components
+
+| Component | Flag | ModelInput | Registered | Purpose |
+|-----------|------|-----------|------------|---------|
+| Processor | `--multimodal-processor` | Text | Yes | HTTP entry, tokenization |
+| Encode Worker | `--multimodal-encode-worker` | N/A | No | Media encoding |
+| Prefill Worker | `--multimodal-worker --is-prefill-worker` | Tokens | Yes | Prefill only |
+| Decode Worker | `--multimodal-decode-worker` | Tokens | Yes | Decode only |
+
+## Traditional Disagg (EP->D)
+
+Llama 4 models don't support pre-computed embeddings, so they use a combined Encode+Prefill worker.
+
+### Architecture
+
+```text
+HTTP Frontend (Rust)
+    ↓
+Processor (Python - ModelInput.Text)
+    ↓ tokenizes, extracts media URL
+Encode+Prefill Worker (Python - ModelInput.Tokens)
+    ↓ downloads media, encodes inline, prefill, KV cache NIXL transfer
+Decode Worker (Python - ModelInput.Tokens)
+    ↓ decode only, token generation
+Response
+```
+
+### Components
+
+| Component | Flag | ModelInput | Registered | Purpose |
+|-----------|------|-----------|------------|---------|
+| Processor | `--multimodal-processor` | Text | Yes | HTTP entry, tokenization |
+| Encode+Prefill | `--multimodal-encode-prefill-worker --is-prefill-worker` | Tokens | Yes | Encode + Prefill |
+| Decode Worker | `--multimodal-decode-worker` | Tokens | Yes | Decode only |
+
+### Launch Script
+
+Example: [`examples/backends/vllm/launch/disagg_multimodal_llama.sh`](../../../examples/backends/vllm/launch/disagg_multimodal_llama.sh)
+
+## ModelInput Types and Registration
+
+### Understanding ModelInput
+
+Dynamo's Rust SDK supports two input types that determine how the HTTP frontend preprocesses requests:
+
+| ModelInput Type | Preprocessing | Use Case |
+|-----------------|---------------|----------|
+| `ModelInput.Text` | None (raw text passed through) | Components that tokenize themselves |
+| `ModelInput.Tokens` | Rust SDK would tokenize (but bypassed in multimodal) | Components expecting pre-tokenized input |
+
+### Component Registration Pattern
+
+```python
+# Processor - Entry point from HTTP frontend
+await register_llm(
+    ModelInput.Text,        # Frontend sends raw text
+    ModelType.Chat,
+    generate_endpoint,
+    model_name,
+    ...
+)
+
+# Workers - Internal components
+await register_llm(
+    ModelInput.Tokens,      # Expect pre-tokenized input
+    ModelType.Chat,         # or ModelType.Prefill for prefill workers
+    generate_endpoint,
+    model_name,
+    ...
+)
+```
+
+## **NIXL USE**
+
+| Use Case | Script | NIXL Used? | Data Transfer |
+|----------|--------|------------|---------------|
+| Simple Aggregated | [`examples/backends/vllm/launch/agg_multimodal.sh`](../../../examples/backends/vllm/launch/agg_multimodal.sh) | ❌ No | All in one worker |
+| E->PD Aggregated | [`examples/backends/vllm/launch/agg_multimodal_epd.sh`](../../../examples/backends/vllm/launch/agg_multimodal_epd.sh) | ✅ Yes | Encoder → PD (embeddings) |
+| E->P->D Disaggregated | [`examples/backends/vllm/launch/disagg_multimodal_epd.sh`](../../../examples/backends/vllm/launch/disagg_multimodal_epd.sh) | ✅ Yes | Encoder → Prefill (embeddings)<br>Prefill → Decode (KV cache) |
+| EP->D Disaggregated (Llama 4) | [`examples/backends/vllm/launch/disagg_multimodal_llama.sh`](../../../examples/backends/vllm/launch/disagg_multimodal_llama.sh) | ✅ Yes | Prefill → Decode (KV cache) |
+
+
+## Known Limitations
+
+- **Disaggregated flows require Python Processor** - All multimodal disaggregation requires the Python Processor component (`ModelInput.Text`).
+
+## Supported Models
+
+The following models have been tested with Dynamo's vLLM multimodal backend:
+
+- **Qwen2.5-VL** - `Qwen/Qwen2.5-VL-7B-Instruct`
+- **Qwen3-VL** - `Qwen/Qwen3-VL-30B-A3B-Instruct-FP8`
+- **LLaVA 1.5** - `llava-hf/llava-1.5-7b-hf`
+- **Llama 4 Maverick** - `meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8`
+- **LLaVA Next Video** - `llava-hf/LLaVA-NeXT-Video-7B-hf`
+- **Qwen2-Audio** - `Qwen/Qwen2-Audio-7B-Instruct`
+
+For a complete list of multimodal models supported by vLLM, see [vLLM Supported Multimodal Models](https://docs.vllm.ai/en/latest/models/supported_models/#list-of-multimodal-language-models). Models listed there should work with Simple Aggregated Mode but may not be explicitly tested.
+
+## Key Files
+
+| File | Description |
+|------|-------------|
+| `components/src/dynamo/vllm/main.py` | Worker initialization and setup |
+| `components/src/dynamo/vllm/args.py` | Command-line argument parsing |
+| `components/src/dynamo/vllm/multimodal_handlers/processor_handler.py` | Processor implementation |
+| `components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py` | Encode worker implementation |
+| `components/src/dynamo/vllm/multimodal_handlers/worker_handler.py` | PD/Prefill/Decode worker implementation |
+
--- a/docs/hidden_toctree.rst
+++ b/docs/hidden_toctree.rst
@@ -50,7 +50,7 @@
   backends/trtllm/llama4_plus_eagle.md
   backends/trtllm/kv-cache-transfer.md
   backends/trtllm/multimodal_support.md
-   backends/trtllm/multimodal_epd.md
+   backends/trtllm/multimodal_trtllm_guide.md
   backends/trtllm/gemma3_sliding_window_attention.md
   backends/trtllm/gpt-oss.md
   backends/trtllm/prometheus.md
@@ -61,6 +61,7 @@
   backends/sglang/expert-distribution-eplb.md
   backends/sglang/gpt-oss.md
   backends/sglang/multimodal_epd.md
+   backends/sglang/multimodal_sglang_guide.md
   backends/sglang/profiling.md
   backends/sglang/sgl-hicache-example.md
   backends/sglang/sglang-disaggregation.md
@@ -74,8 +75,10 @@

   backends/vllm/deepseek-r1.md
   backends/vllm/gpt-oss.md
+   backends/vllm/LMCache_Integration.md
   backends/vllm/multi-node.md
   backends/vllm/multimodal.md
+   backends/vllm/multimodal_vllm_guide.md
   backends/vllm/prometheus.md
   backends/vllm/speculative_decoding.md