Unverified Commit 94d145a9 authored by Indrajit Bhosale's avatar Indrajit Bhosale Committed by GitHub
Browse files

docs: Add multimodal documentation vllm, sglang, and trtllm backends (#4510)


Signed-off-by: default avatarIndrajit Bhosale <iamindrajitb@gmail.com>
Co-authored-by: default avatarkrishung5 <krish@nvidia.com>
parent 09f2314d
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# SGLang Multimodal Guide
This document provides a comprehensive guide for multimodal inference using SGLang backend in Dynamo. For more details on the multimodal examples, see [Multimodal Examples Documentation](./multimodal_epd.md).
## Multimodal Support Matrix
| Modality | Input Format | Aggregated | Disaggregated | Notes |
|----------|--------------|------------|---------------|-------|
| **Image** | HTTP/HTTPS URL | ✅ Yes | ✅ Yes | Vision encoder generates embeddings |
| **Image** | Data URL (Base64) | ❌ No | ❌ No | Not supported |
| **Video** | HTTP/HTTPS URL | ❌ No | ❌ No | Not implemented |
| **Audio** | HTTP/HTTPS URL | ❌ No | ❌ No | Not implemented |
## Architecture Comparison
SGLang multimodal supports two deployment patterns:
```text
AGGREGATED (E->PD):
Client → Frontend (Rust) → Processor → Encoder [NIXL] → PD Worker → Response
• 3 components • Vision encoder in Python • NIXL embeddings transfer
DISAGGREGATED (E->P->D):
Client → Frontend → Processor → Encoder [NIXL] → Prefill [bootstrap] → Decode → Response
• 4 components • Vision encoder in Python • KV cache transfer via bootstrap mechanism
```
## Aggregated Mode (E->PD)
In aggregated mode, encoding happens in a separate worker, but prefill and decode share the same engine.
### Architecture
```text
HTTP Frontend (Rust)
Processor (Python - ModelInput.Text - REGISTERED)
↓ tokenizes with chat template, extracts image URL
Encode Worker (Python - NOT registered)
↓ downloads image, runs vision encoder, generates embeddings, NIXL transfer
PD Worker (Python - NOT registered)
↓ receives embeddings via NIXL, prefill + decode
Response → Processor → Frontend
```
### Components
| Component | Flag | ModelInput | Registered | Has SGLang Engine? | Purpose |
|-----------|------|-----------|------------|-------------------|---------|
| Processor | `--multimodal-processor` | Text | ✅ Yes | ❌ No | HTTP entry, OpenAI→SGLang conversion |
| Encode Worker | `--multimodal-encode-worker` | N/A | ❌ No | ❌ No | Vision encoder, embeddings generation |
| PD Worker | `--multimodal-worker` | N/A | ❌ No | ✅ Yes | Prefill + Decode with embeddings |
### Key Characteristics
- **Vision Encoder in Python**: Encode worker loads vision model (AutoModel) and image processor (AutoImageProcessor)
- **Token Expansion**: Single `<|image_pad|>` token replaced with N tokens based on embedding shape
- **NIXL Transfer**: Embeddings transferred from Encoder → PD Worker using NIXL
- **No Rust Processing**: All tokenization and image handling happens in Python
## Disaggregated Mode (E->P->D)
In disaggregated mode, encoding, prefill, and decode are handled by separate workers using SGLang's bootstrap coordination.
### Architecture
```text
HTTP Frontend (Rust)
Processor (Python - ModelInput.Text - REGISTERED)
↓ tokenizes with chat template, extracts image URL
Encode Worker (Python - NOT registered)
↓ downloads image, runs vision encoder, generates embeddings, NIXL transfer
Prefill Worker (Python - NOT registered)
↓ receives embeddings via NIXL, prefill only, returns bootstrap info
Decode Worker (Python - NOT registered)
↓ uses bootstrap info, decode only, token generation
Response → Processor → Frontend
```
### Components
| Component | Flag | ModelInput | Registered | Has SGLang Engine? | Purpose |
|-----------|------|-----------|------------|-------------------|---------|
| Processor | `--multimodal-processor` | Text | ✅ Yes | ❌ No | HTTP entry, OpenAI→SGLang conversion |
| Encode Worker | `--multimodal-encode-worker` | N/A | ❌ No | ❌ No | Vision encoder, embeddings generation |
| Decode Worker | `--multimodal-worker --serving-mode=decode` | N/A | ❌ No | ✅ Yes | **Entry point for disaggregation**, calls Prefill |
| Prefill Worker | `--multimodal-worker --serving-mode=prefill` | N/A | ❌ No | ✅ Yes | Called by Decode, bootstrap coordination |
### Bootstrap Coordination
SGLang disaggregation uses a bootstrap mechanism for P->D coordination:
**Request Flow (Important):**
```text
Client → Frontend → Processor → Encode → DECODE Worker → Prefill Worker
Entry point for disaggregation!
```
**Bootstrap Process:**
1. **Decode Worker** receives request from Encode Worker
2. **Decode Worker** calls Prefill Worker via NATS to request bootstrap info
3. **Prefill Worker** generates `{host, port, room}` and returns immediately
4. **Both workers** connect to same "room" using bootstrap coordinates
5. **SGLang internally** transfers KV cache state via bootstrap connection (not NIXL)
**Key Difference from vLLM:**
- vLLM: Frontend → Prefill → Decode (Prefill is entry point)
- SGLang: Frontend → Processor → Encode → **Decode → Prefill** (Decode is entry point)
## ModelInput Types and Registration
**Only the Processor registers with Dynamo Rust.**
### Registration Pattern
```python
# ONLY Processor registers with Dynamo Rust
await register_llm_with_readiness_gate(
None, # No engine for processor
generate_endpoint,
server_args,
dynamo_args,
input_type=ModelInput.Text, # Receives raw OpenAI format
readiness_gate=ready_event,
)
# Workers do NOT register - they are internal components
# They communicate via NATS clients created in main.py
```
### Component Initialization
```python
# Encode Worker - connects to downstream PD worker
pd_worker_client = (
await runtime.namespace(dynamo_args.namespace)
.component("backend")
.endpoint("generate")
.client()
)
# PD Worker (Decode mode) - connects to upstream Prefill worker
prefill_client = (
await runtime.namespace(dynamo_args.namespace)
.component("prefill")
.endpoint("generate")
.client()
)
```
## Inter-Component Communication
### Control Flow (NATS)
All component-to-component communication happens via NATS:
**Aggregated Mode (E→PD):**
```text
Processor → Encode Worker → PD Worker
(NATS) (NATS + NIXL embeddings)
```
**Disaggregated Mode (E→P→D):**
```text
Processor → Encode Worker → DECODE Worker → Prefill Worker
(NATS) (NATS) (NATS)
Decode requests bootstrap
Prefill returns {host, port, room}
Both connect via bootstrap
SGLang internal KV cache transfer
```
**Detailed Message Flow:**
```text
Processor → Encode Worker:
- NATS round_robin with SglangMultimodalRequest
- Contains: tokenized input_ids, image URL, sampling params
Encode Worker → Decode/PD Worker:
- NATS round_robin to "backend" component
- Contains: expanded token_ids, NIXL metadata, embeddings shape
- NIXL transfer: embeddings tensor
Decode Worker → Prefill Worker (disagg only):
- NATS call to "prefill" component
- Decode requests bootstrap coordinates
- Prefill returns: {bootstrap_host, bootstrap_port, bootstrap_room}
Prefill ↔ Decode (via bootstrap):
- SGLang internal connection (not NATS)
- KV cache state shared via bootstrap mechanism
```
### Data Transfer (NIXL)
NIXL is used only for embedding transfer:
```python
Encode Worker:
descriptor = connect.Descriptor(precomputed_embeddings)
with connector.create_readable(descriptor) as readable:
request.serialized_request = readable.metadata()
# Send request with NIXL metadata
await pd_worker_client.round_robin(request)
await readable.wait_for_completion()
PD Worker:
embeddings = torch.empty(request.embeddings_shape, dtype=torch.float16)
descriptor = connect.Descriptor(embeddings)
read_op = await connector.begin_read(request.serialized_request, descriptor)
await read_op.wait_for_completion()
```
## Vision Encoding Details
### Encode Worker Components
The encode worker loads and runs the vision model in Python:
```python
# Vision components loaded in encode worker
self.image_processor = AutoImageProcessor.from_pretrained(
model_path, trust_remote_code=True
)
self.vision_model = AutoModel.from_pretrained(
model_path,
device_map="auto",
torch_dtype=torch.float16,
trust_remote_code=True
)
```
### Token Expansion Process
1. Processor inserts single image token (e.g., `<|image_pad|>`)
2. Encode worker generates embeddings: `shape = (batch, num_patches, hidden_dim)`
3. Encode worker replaces single token with `num_patches` tokens
4. Downstream worker receives expanded token sequence
Example:
```python
# Before: ["Hello", "<|image_pad|>", "world"]
# After: ["Hello", "<|image_pad|>", "<|image_pad|>", ...(576 tokens), "world"]
```
## Chat Template Processing
SGLang uses its own chat template system:
```python
from sglang.srt.parser.conversation import chat_templates
conv = chat_templates["qwen2-vl"].copy()
conv.append_message(conv.roles[0], f"{conv.image_token} Describe this image")
processed = tokenizer(text=conv.get_prompt(), return_tensors="pt")
```
Supported templates: `qwen2-vl`, `llama-3`, `vicuna`, etc.
## NIXL USE
| Use Case | NIXL Used? | Data Transfer | Notes |
|----------|------------|---------------|-------|
| E→PD Aggregated | ✅ Yes | Encoder → PD (embeddings) | Vision encoder separate |
| E→P→D Disaggregated | ✅ Yes | Encoder → Prefill (embeddings) | KV cache via SGLang bootstrap |
**Key Difference:** SGLang P→D uses bootstrap mechanism, not NIXL for KV cache like vLLM.
## Known Limitations
- **No Data URL support** - Only HTTP/HTTPS URLs supported; `data:image/...` base64 URLs not supported
- **No pre-computed embeddings** - Cannot use `.pt`, `.pth`, `.bin` embedding files; vision encoder runs for every request
- **No video support** - No video encoder implementation
- **No audio support** - No audio encoder implementation
- **Only Processor registers with Dynamo** - Workers are internal components, frontend routes to Processor only
- **Disaggregated routing** - Decode Worker is the entry point (calls Prefill), cannot route directly to Prefill workers
- **Limited model generalization** - Token expansion logic is model-specific; adding new models may require implementation updates
## Supported Models
SGLang multimodal **only supports image-based vision-language models**:
### ✅ Supported (Images Only)
- **Qwen2-VL** / **Qwen2.5-VL** (primary support)
- Models with `AutoImageProcessor` and vision tower
- Models compatible with SGLang's image embedding format
## Key Files
| File | Description |
|------|-------------|
| `components/src/dynamo/sglang/main.py` | Component initialization, only Processor registers |
| `components/src/dynamo/sglang/request_handlers/multimodal/processor_handler.py` | Processor implementation, OpenAI→SGLang |
| `components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py` | Vision encoder, embeddings generation |
| `components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py` | PD/Prefill/Decode workers, NIXL read |
| `components/src/dynamo/sglang/multimodal_utils/multimodal_chat_processor.py` | Chat template processing |
| `components/src/dynamo/sglang/protocol.py` | Request/response data structures |
| `components/src/dynamo/sglang/register.py` | Registration logic (only called for Processor) |
# Encode-Prefill-Decode (EPD) Flow with NIXL
For high-performance multimodal inference with large embeddings, Dynamo supports a specialized **Encode-Prefill-Decode (EPD)** flow using **NIXL (RDMA)** for zero-copy tensor transfer.
## Enabling the Feature
This is an experimental feature that requires using a specific TensorRT-LLM commit.
To enable it build the dynamo container with the `--tensorrtllm-commit` flag, followed by the commit hash:
```bash
./container/build.sh --framework trtllm --tensorrtllm-git-url https://github.com/NVIDIA/TensorRT-LLM.git --tensorrtllm-commit v1.2.0rc3
```
## Key Features
- **High Performance**: Zero-copy RDMA transfer for embeddings
- **Dynamic Shape Allocation**: Automatically handles variable embedding shapes per image
- **Multi-Format Support**: Works with tensor files (`.pt`) and dictionary-based embeddings
- **Hybrid Transfer**: Large tensors via NIXL, small metadata via JSON
## How to use
```bash
cd $DYNAMO_HOME/examples/backends/trtllm
# Launch 3-worker EPD flow with NIXL.
./launch/epd_disagg.sh
```
## Pre-requsites
This script is specifically designed to work on 8 node H200 and `Llama-4-Maverick-17B-128E-Instruct` model with assumption that you already have a model specific embedding file ready.
## Configuration
The EPD flow uses a dedicated **Encode Worker** that runs separately from the Prefill and Decode workers. The `ENCODE_ENDPOINT` environment variable specifies how the Prefill worker communicates with the Encode worker:
```bash
export ENCODE_ENDPOINT="dyn://dynamo.tensorrt_llm_encode.generate"
```
This endpoint follows Dynamo's standard format: `dyn://namespace.component.endpoint` where the Encode worker registers itself as `dynamo.tensorrt_llm_encode.generate`.
For local embedding file access, use the `--allowed-local-media-path "$ALLOWED_LOCAL_MEDIA_PATH"` parameter to specify the secure directory path where embedding files can be loaded from (default: `/tmp`). This prevents path traversal attacks while allowing flexible file access within the designated directory.
```bash
export ALLOWED_LOCAL_MEDIA_PATH="/tmp"
```
For tensor file size protection, use the `--max-file-size-mb "$MAX_FILE_SIZE_MB"` parameter to limit the maximum size of downloadable embedding files/Image URLs (default: `50MB`). This prevents Denial of Service (DoS) attacks from maliciously large files while accommodating typical embedding file sizes.
```bash
export MAX_FILE_SIZE_MB=50
```
## Architecture Overview
The EPD flow implements a **3-worker architecture** for high-performance multimodal inference:
- **Encode Worker**: Loads and processes multimodal embeddings
- **Prefill Worker**: Handles initial context processing and KV-cache generation
- **Decode Worker**: Performs streaming token generation
## Request Flow Diagram
```mermaid
sequenceDiagram
participant Client
participant Frontend
participant PrefillWorker as "Prefill Worker<br/>(PrefillHandler)"
participant EncodeWorker as "Encode Worker<br/>(EncodeHandler)"
participant DecodeWorker as "Decode Worker<br/>(DecodeHandler)"
participant NIXL as "NIXL<br/>(RDMA Transfer)"
Note over Client,NIXL: Unified Frontend: Context processing followed by streaming generation
Client->>Frontend: POST /v1/chat/completions<br/>(multimodal request)
Frontend->>PrefillWorker: Route to prefill worker
Note over PrefillWorker: Check for multimodal content
PrefillWorker->>EncodeWorker: Send request<br/>(contains embedding paths)
Note over EncodeWorker: Load embeddings from file/url<br/>
EncodeWorker->>NIXL: Create readable operation<br/>
EncodeWorker->>PrefillWorker: Send metadata + NIXL info<br/>(JSON: shape, dtype, aux_data)
Note over PrefillWorker: Allocate tensor with dynamic shape
PrefillWorker->>NIXL: Begin read operation
NIXL-->>PrefillWorker: Zero-copy transfer complete<br/>
Note over PrefillWorker: Reconstruct embeddings<br/>(mm_embeddings + special_tokens + offsets)
Note over PrefillWorker: Process full context<br/>(text + multimodal embeddings)
Note over PrefillWorker: Generate KV-cache<br/>(max_tokens=1 in prefill mode)
PrefillWorker->>Frontend: Return prefill response<br/>(disaggregated_params)
Frontend->>DecodeWorker: Route to decode worker<br/>with disaggregated_params
Note over DecodeWorker: Continue generation<br/>(streaming tokens)
DecodeWorker->>Frontend: Stream response chunk 1
Frontend->>Client: Response chunk 1
DecodeWorker->>Frontend: Stream response chunk 2
Frontend->>Client: Response chunk 2
DecodeWorker->>Frontend: ... (continue streaming)
Frontend->>Client: ... (continue streaming)
DecodeWorker->>Frontend: Final response + [DONE]
Frontend->>Client: Final response + [DONE]
```
## How the System Works
1. **Request Processing**: Multimodal requests containing embedding file paths or URLs are routed by the frontend to prefill workers
2. **Multimodal Loading**: EncodeWorker loads large embedding files and extracts auxiliary metadata
3. **NIXL Transfer**: Main tensors transferred via zero-copy RDMA, small metadata via JSON for efficiency
4. **Dynamic Allocation**: Consumer workers allocate tensors with exact shapes received from EncodeWorker
5. **Reconstruction**: Original embedding format (dictionary or tensor) is reconstructed for model processing
## Example Request
The request format is identical to regular multimodal requests:
```bash
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe the image"},
{
"type": "image_url",
"image_url": {"url": "/path/to/embeddings.pt"}
}
]
}
],
"max_tokens": 160
}'
```
......@@ -92,23 +92,50 @@ In general, disaggregated serving can run on a single node, provided the model f
To deploy `Llama-4-Maverick-17B-128E-Instruct` in disaggregated mode, you will need to follow the multi-node setup instructions, which can be found [here](./multinode/multinode-multimodal-example.md).
## Using Pre-computed Embeddings (Experimental)
## Pre-computed Embeddings with EPD Flow
Dynamo with TensorRT-LLM supports providing pre-computed embeddings directly in an inference request. This bypasses the need for the model to process an image and generate embeddings itself, which is useful for performance optimization or when working with custom, pre-generated embeddings.
For high-performance multimodal inference, Dynamo supports pre-computed embeddings with an **Encode-Prefill-Decode (EPD)** flow using **NIXL (RDMA)** for zero-copy tensor transfer.
### How to Use
### Enabling the Feature
Once the container is built, you can send requests with paths to local embedding files.
This is an experimental feature that requires using a specific TensorRT-LLM commit.
To enable it build the dynamo container with the `--tensorrtllm-commit` flag:
- **Format:** Provide the embedding as part of the `messages` array, using the `image_url` content type.
- **URL:** The `url` field should contain the absolute or relative path to your embedding file on the local filesystem.
- **File Types:** Supported embedding file extensions are `.pt`, `.pth`, and `.bin`. Dynamo will automatically detect these extensions.
```bash
./container/build.sh --framework trtllm --tensorrtllm-git-url https://github.com/NVIDIA/TensorRT-LLM.git --tensorrtllm-commit v1.2.0rc3
```
When a request with a supported embedding file is received, Dynamo will load the tensor from the file and pass it directly to the model for inference, skipping the image-to-embedding pipeline.
### Supported File Types
### Example Request
- `.pt` - PyTorch tensor files
- `.pth` - PyTorch checkpoint files
- `.bin` - Binary tensor files
### How to Launch
```bash
cd $DYNAMO_HOME/examples/backends/trtllm
# Launch 3-worker EPD flow with NIXL
./launch/epd_disagg.sh
```
> **Note:** This script is designed for 8-node H200 with `Llama-4-Scout-17B-16E-Instruct` model and assumes you have a model-specific embedding file ready.
### Configuration
```bash
# Encode endpoint for Prefill → Encode communication
export ENCODE_ENDPOINT="dyn://dynamo.tensorrt_llm_encode.generate"
# Security: Allowed directory for embedding files (default: /tmp)
export ALLOWED_LOCAL_MEDIA_PATH="/tmp"
Here is an example of how to send a request with a pre-computed embedding file.
# Security: Max file size to prevent DoS attacks (default: 50MB)
export MAX_FILE_SIZE_MB=50
```
### Example Request
```bash
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
......@@ -117,27 +144,47 @@ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the content represented by the embeddings"
},
{
"type": "image_url",
"image_url": {
"url": "/path/to/your/embedding.pt"
}
}
{"type": "text", "text": "Describe the image"},
{"type": "image_url", "image_url": {"url": "/path/to/embedding.pt"}}
]
}
],
"stream": false,
"max_tokens": 160
}'
```
## Encode-Prefill-Decode (EPD) Flow with NIXL
Dynamo with the TensorRT-LLM backend supports multimodal models in Encode -> Decode -> Prefill fashion, enabling you to process embeddings seperately in a seperate worker. For detailed setup instructions, example requests, and best practices, see the [Multimodal EPD Support Guide](./multimodal_epd.md).
### Architecture
The EPD flow implements a **3-worker architecture**:
- **Encode Worker**: Loads pre-computed embeddings, transfers via NIXL
- **Prefill Worker**: Receives embeddings, handles context processing and KV-cache generation
- **Decode Worker**: Performs streaming token generation
### Request Flow
```mermaid
sequenceDiagram
participant Client
participant Frontend
participant PrefillWorker as "Prefill Worker"
participant EncodeWorker as "Encode Worker"
participant DecodeWorker as "Decode Worker"
participant NIXL as "NIXL (RDMA)"
Client->>Frontend: POST /v1/chat/completions
Frontend->>PrefillWorker: Route to prefill worker
PrefillWorker->>EncodeWorker: Send request (embedding paths)
EncodeWorker->>NIXL: Create readable operation
EncodeWorker->>PrefillWorker: Send metadata + NIXL info
PrefillWorker->>NIXL: Begin read operation
NIXL-->>PrefillWorker: Zero-copy transfer complete
PrefillWorker->>Frontend: Return prefill response
Frontend->>DecodeWorker: Route to decode worker
DecodeWorker->>Frontend: Stream response chunks
Frontend->>Client: Stream response
```
## Supported Multimodal Models
Multimodel models listed [here](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/inputs/utils.py#L221) are supported by dynamo.
\ No newline at end of file
Multimodal models listed in [TensorRT-LLM supported models](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/models/supported-models.md) are supported by Dynamo.
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# TRT-LLM Multimodal Guide
This document provides a comprehensive guide for multimodal inference using TensorRT-LLM backend in Dynamo. For more details on the multimodal examples, see [Multimodal Examples Documentation](./multimodal_support.md).
## Multimodal Support Matrix
| Modality | Input Format | Aggregated | Disaggregated | Notes |
|----------|--------------|------------|---------------|-------|
| **Image** | HTTP/HTTPS URL | Yes | Yes | Full support for all image models |
| **Image** | Pre-computed Embeddings (.pt, .pth, .bin) | Yes | Yes | Direct embedding files |
| **Video** | HTTP/HTTPS URL | ❌ No | ❌ No | Not implemented |
| **Audio** | HTTP/HTTPS URL | ❌ No | ❌ No | Not implemented |
## Architecture Comparison
TRT-LLM multimodal supports three deployment patterns:
```text
SIMPLE AGGREGATED (agg.sh):
Client → Frontend (Rust) → Worker [image load, encode, P+D] → Response
• 2 components • worker flag `--modality multimodal` • Easiest setup
DISAGGREGATED P->D (disagg_multimodal.sh):
Client → Frontend → Prefill [image load, encode] → Decode → Response
• 3 components • worker flag `--disaggregation-mode prefill/decode` • Multi-GPU, KV transfer
EPD DISAGGREGATED - WIP:
Client → Frontend → Encode [MultimodalEncoder] → Prefill [via params] → Decode → Response
• 4 components • worker flag `--disaggregation-mode encode/prefill/decode` • WIP PR #4668
```
## Input Format Details
### Supported URL Formats
| Format | Example | Description | Support |
|--------|---------|-------------|---------|
| **HTTP/HTTPS** | `http://example.com/image.jpg` | Remote media files | ✅ |
| **Pre-computed Embeddings** | `/path/to/embedding.pt` | Local embedding files (.pt, .pth, .bin) | ✅ |
## Simple Aggregated Mode (PD)
In aggregated mode, all processing (image loading, encoding, prefill, decode) happens within a single worker.
### Architecture
```text
HTTP Frontend (Rust)
TRT-LLM Worker (Python - ModelInput.Tokens)
↓ downloads media, encodes, prefill + decode
Response
```
### Components
| Component | Flag | ModelInput | Registered | Purpose |
|-----------|------|-----------|------------|---------|
| Worker | `--modality multimodal` | Tokens | Yes | Complete inference pipeline |
### Launch Script
Example: [`examples/backends/trtllm/launch/agg.sh`](../../../examples/backends/trtllm/launch/agg.sh)
## Disaggregated Mode (P->D)
In disaggregated mode, prefill and decode are handled by separate workers. The prefill worker handles image loading and encoding internally.
### Architecture
```text
HTTP Frontend (Rust)
Prefill Worker (Python - ModelInput.Tokens)
↓ downloads media, encodes, prefill, KV cache transfer
Decode Worker (Python - ModelInput.Tokens)
↓ decode only, token generation
Response
```
### Components
| Component | Flag | ModelInput | Registered | Purpose |
|-----------|------|-----------|------------|---------|
| Prefill Worker | `--disaggregation-mode prefill` | Tokens | Yes | Image processing + Prefill |
| Decode Worker | `--disaggregation-mode decode` | Tokens | Yes | Decode only |
### Launch Script
Example: [`examples/backends/trtllm/launch/disagg_multimodal.sh`](../../../examples/backends/trtllm/launch/disagg_multimodal.sh)
## Pre-computed Embeddings
TRT-LLM supports providing pre-computed embeddings, bypassing image-to-embedding processing.
### Supported File Types
- `.pt` - PyTorch tensor files
- `.pth` - PyTorch checkpoint files
- `.bin` - Binary tensor files
### Embedding File Formats
TRT-LLM supports two formats for embedding files:
#### 1. Simple Tensor Format
- Direct tensor saved as `.pt` file
- Example: `llava_next_mm_embed_seashore.pt`
- Contains only the embedding tensor
```python
# Example: Simple tensor format
embedding_tensor = torch.rand(1, 576, 4096) # [batch, seq_len, hidden_dim]
torch.save(embedding_tensor, "embedding.pt")
```
#### 2. Dictionary Format with Auxiliary Data
- Dictionary containing multiple keys
- Used by models like Llama-4 that require additional metadata
- Must contain `mm_embeddings` key with the main tensor
- Can include auxiliary data like special tokens, offsets, etc.
```python
# Example: Dictionary format (Llama-4 style)
embedding_dict = {
"mm_embeddings": torch.rand(1, 576, 4096),
"special_tokens": [128256, 128257],
"image_token_offsets": [[0, 576]],
# ... other model-specific metadata
}
torch.save(embedding_dict, "llama4_embedding.pt")
```
**How They're Used:**
- **Simple tensors**: Loaded directly and passed to `mm_embeddings` parameter
- **Dictionary format**: `mm_embeddings` key extracted as main tensor, other keys preserved as auxiliary data and transferred separately
### Launch Script
Example: [`examples/backends/trtllm/launch/epd_disagg.sh`](../../../examples/backends/trtllm/launch/epd_disagg.sh)
### Security Considerations
For EPD mode with local embedding files:
- `--allowed-local-media-path` - Specify secure directory for embedding files (default: `/tmp`)
- `--max-file-size-mb` - Limit max file size to prevent DoS attacks (default: `50MB`)
## EPD Disaggregated Mode (E->P->D) - WIP
**Status:** Work In Progress (WIP PR #4668) - Full EPD flow with MultimodalEncoder
In EPD mode, encoding, prefill, and decode are handled by separate workers. The encode worker uses TensorRT-LLM's `MultimodalEncoder` to process images and transfer embeddings via disaggregated parameters.
### Architecture
```text
HTTP Frontend (Rust)
Encode Worker (Python - NOT registered, uses MultimodalEncoder)
↓ downloads image, encodes with vision model, transfers via disaggregated_params
Prefill Worker (Python - ModelInput.Tokens)
↓ receives embeddings via disaggregated_params, prefill only, KV cache transfer
Decode Worker (Python - ModelInput.Tokens)
↓ decode only, token generation
Response
```
**Note (WIP):** The encode worker uses `MultimodalEncoder` from TensorRT-LLM to actually encode images, not just load pre-computed embeddings. This is a significant change from the legacy NIXL-based embedding transfer.
### Components
| Component | Flag | ModelInput | Registered | Purpose |
|-----------|------|-----------|------------|---------|
| Encode Worker | `--disaggregation-mode encode` | N/A | No | Image encoding with MultimodalEncoder |
| Prefill Worker | `--disaggregation-mode prefill --encode-endpoint` | Tokens | Yes | Prefill only |
| Decode Worker | `--disaggregation-mode decode` | Tokens | Yes | Decode only |
## ModelInput Types and Registration
### Understanding ModelInput
TRT-LLM workers register with Dynamo using:
| ModelInput Type | Preprocessing | Use Case |
|-----------------|---------------|----------|
| `ModelInput.Tokens` | Rust SDK tokenizes text (bypassed for multimodal) | All TRT-LLM workers |
### Component Registration Pattern
```python
# TRT-LLM Worker - Register with Tokens
await register_llm(
ModelInput.Tokens, # Rust does minimal preprocessing
model_type, # ModelType.Chat or ModelType.Prefill
generate_endpoint,
model_name,
...
)
```
## Inter-Component Communication
| Transfer Stage | Message | NIXL Transfer |
|----------------|--------------|---------------|
| **Frontend → Prefill** | Request with image URL or embedding path | No |
| **Encode → Prefill (pre-computed embeddings)** | NIXL metadata (pre-computed embeddings) | Yes (Embeddings tensor) |
| **Encode → Prefill (Image URL) (WIP)** | Disaggregated params with multimodal handles | No (Handles via params) |
| **Prefill → Decode** | Disaggregated params | Configurable (KV cache: NIXL default, UCX optional) |
## **NIXL USE**
| Use Case | Script | NIXL Used? | Data Transfer |
|----------|--------|------------|---------------|
| Simple Aggregated | [`examples/backends/trtllm/launch/agg.sh`](../../../examples/backends/trtllm/launch/agg.sh) | ❌ No | All in one worker |
| P->D Disaggregated | [`examples/backends/trtllm/launch/disagg_multimodal.sh`](../../../examples/backends/trtllm/launch/disagg_multimodal.sh) | ⚙️ Optional | Prefill → Decode (KV cache via UCX or NIXL) |
| E->P->D Disaggregated (pre-computed embeddings) | [`examples/backends/trtllm/launch/epd_disagg.sh`](../../../examples/backends/trtllm/launch/epd_disagg.sh) | ✅ Yes | Encoder → Prefill (pre-computed embeddings via NIXL) |
| E->P->D Disaggregated (WIP) | X | ❌ No | Encoder → Prefill (multimodal handles via disaggregated_params)<br>Prefill → Decode (KV cache via UCX/NIXL) |
**Note:** NIXL for KV cache transfer is currently beta and only supported on AMD64 (x86_64) architecture.
## Key Files
| File | Description |
|------|-------------|
| `components/src/dynamo/trtllm/main.py` | Worker initialization and setup |
| `components/src/dynamo/trtllm/utils/trtllm_utils.py` | Command-line argument parsing |
| `components/src/dynamo/trtllm/multimodal_processor.py` | Multimodal request processing |
| `components/src/dynamo/trtllm/request_handlers/handlers.py` | Request handler factory |
| `components/src/dynamo/trtllm/request_handlers/handler_base.py` | Base handler and disaggregation modes |
## Known Limitations
- **No Data URL support** - Only HTTP/HTTPS URLs supported; `data:image/...` base64 URLs not supported
- **No video support** - No video encoder implementation
- **No audio support** - No audio encoder implementation
- **No Rust preprocessing** - All preprocessing happens in Python workers
- **E->P->D mode is WIP** - Full EPD with image URLs under development
## Supported Models
Multimodal models listed in [TensorRT-LLM supported models](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/models/supported-models.md) are supported by Dynamo.
Common examples:
- Llama 4 Vision models (Maverick, Scout)
- Qwen2-VL models
- Other vision-language models with TRT-LLM support
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# vLLM Multimodal Guide
This document provides a comprehensive guide for multimodal inference using vLLM backend in Dynamo. For more details on the multimodal examples, see [Multimodal Examples Documentation](./multimodal.md).
## Multimodal Support Matrix
| Modality | Input Format | Aggregated | Disaggregated | Notes |
|----------|--------------|------------|---------------|-------|
| **Image** | HTTP/HTTPS URL | Yes | Yes | Full support for all image models |
| **Image** | Data URL (Base64) | Yes | Yes | Inline base64-encoded images |
| **Video** | HTTP/HTTPS URL | Yes | Yes | Frame extraction and processing |
| **Audio** | HTTP/HTTPS URL | Yes | Yes | Experimental - requires audio dependencies |
## Architecture Comparison
vLLM multimodal supports three deployment patterns:
```text
SIMPLE AGGREGATED ([examples/backends/vllm/launch/agg_multimodal.sh](../../../examples/backends/vllm/launch/agg_multimodal.sh)):
Client → Frontend (Rust processor) → Worker [image load, encode, P+D] → Response
• 2 components • --connector none • Easiest setup
EPD AGGREGATED ([examples/backends/vllm/launch/agg_multimodal_epd.sh](../../../examples/backends/vllm/launch/agg_multimodal_epd.sh)):
Client → Frontend → Processor → Encoder [NIXL] → PD Worker → Response
• 4 components • --multimodal-processor • Custom templates, NIXL
DISAGGREGATED ([examples/backends/vllm/launch/disagg_multimodal_epd.sh](../../../examples/backends/vllm/launch/disagg_multimodal_epd.sh)):
Client → Frontend → Processor → Encoder [NIXL] → Prefill [NIXL] → Decode → Response
• 5 components • Separate P/D workers • Multi-node, max optimization
```
## Input Format Details
### Supported URL Formats
| Format | Example | Description | Support |
|--------|---------|-------------|---------|
| **HTTP/HTTPS** | `http://example.com/image.jpg` | Remote media files | ✅ |
| **Data URL** | `data:image/jpeg;base64,/9j/4AAQ...` | Base64-encoded inline data | ✅ |
## Simple Aggregated Mode (PD)
In simple aggregated mode, encoding, prefill, and decode happen within the same worker.
### Architecture
```text
HTTP Frontend with Rust processor
Worker (Python - ModelInput.Tokens)
↓ encode + prefill + decode
Response
```
## EPD Aggregated Mode (PD)
In EPD aggregated mode, encoding happens in a separate worker and prefill and decode happen within the same pipeline.
### Architecture
```text
HTTP Frontend (Rust)
Processor (Python - ModelInput.Text)
↓ tokenizes, extracts media URL
Encode Worker (Python - not registered)
↓ downloads media, generates embeddings, NIXL transfer
PD Worker (Python - ModelInput.Tokens)
↓ prefill + decode
Response
```
### Components
| Component | Flag | ModelInput | Registered | Purpose |
|-----------|------|-----------|------------|---------|
| Processor | `--multimodal-processor` | Text | Yes | HTTP entry, tokenization |
| Encode Worker | `--multimodal-encode-worker` | N/A | No | Media encoding |
| PD Worker | `--multimodal-worker` | Tokens | Yes | Prefill + Decode |
## EPD Disaggregated Mode (E->P->D)
In EPD disaggregated mode, encoding, prefill, and decode are handled by separate workers.
### Architecture
```text
HTTP Frontend (Rust)
Processor (Python - ModelInput.Text)
↓ tokenizes, extracts media URL
Encode Worker (Python - not registered)
↓ downloads media, generates embeddings, NIXL transfer
Prefill Worker (Python - ModelInput.Tokens)
↓ prefill only, KV cache NIXL transfer
Decode Worker (Python - ModelInput.Tokens)
↓ decode only, token generation
Response
```
### Components
| Component | Flag | ModelInput | Registered | Purpose |
|-----------|------|-----------|------------|---------|
| Processor | `--multimodal-processor` | Text | Yes | HTTP entry, tokenization |
| Encode Worker | `--multimodal-encode-worker` | N/A | No | Media encoding |
| Prefill Worker | `--multimodal-worker --is-prefill-worker` | Tokens | Yes | Prefill only |
| Decode Worker | `--multimodal-decode-worker` | Tokens | Yes | Decode only |
## Traditional Disagg (EP->D)
Llama 4 models don't support pre-computed embeddings, so they use a combined Encode+Prefill worker.
### Architecture
```text
HTTP Frontend (Rust)
Processor (Python - ModelInput.Text)
↓ tokenizes, extracts media URL
Encode+Prefill Worker (Python - ModelInput.Tokens)
↓ downloads media, encodes inline, prefill, KV cache NIXL transfer
Decode Worker (Python - ModelInput.Tokens)
↓ decode only, token generation
Response
```
### Components
| Component | Flag | ModelInput | Registered | Purpose |
|-----------|------|-----------|------------|---------|
| Processor | `--multimodal-processor` | Text | Yes | HTTP entry, tokenization |
| Encode+Prefill | `--multimodal-encode-prefill-worker --is-prefill-worker` | Tokens | Yes | Encode + Prefill |
| Decode Worker | `--multimodal-decode-worker` | Tokens | Yes | Decode only |
### Launch Script
Example: [`examples/backends/vllm/launch/disagg_multimodal_llama.sh`](../../../examples/backends/vllm/launch/disagg_multimodal_llama.sh)
## ModelInput Types and Registration
### Understanding ModelInput
Dynamo's Rust SDK supports two input types that determine how the HTTP frontend preprocesses requests:
| ModelInput Type | Preprocessing | Use Case |
|-----------------|---------------|----------|
| `ModelInput.Text` | None (raw text passed through) | Components that tokenize themselves |
| `ModelInput.Tokens` | Rust SDK would tokenize (but bypassed in multimodal) | Components expecting pre-tokenized input |
### Component Registration Pattern
```python
# Processor - Entry point from HTTP frontend
await register_llm(
ModelInput.Text, # Frontend sends raw text
ModelType.Chat,
generate_endpoint,
model_name,
...
)
# Workers - Internal components
await register_llm(
ModelInput.Tokens, # Expect pre-tokenized input
ModelType.Chat, # or ModelType.Prefill for prefill workers
generate_endpoint,
model_name,
...
)
```
## **NIXL USE**
| Use Case | Script | NIXL Used? | Data Transfer |
|----------|--------|------------|---------------|
| Simple Aggregated | [`examples/backends/vllm/launch/agg_multimodal.sh`](../../../examples/backends/vllm/launch/agg_multimodal.sh) | ❌ No | All in one worker |
| E->PD Aggregated | [`examples/backends/vllm/launch/agg_multimodal_epd.sh`](../../../examples/backends/vllm/launch/agg_multimodal_epd.sh) | ✅ Yes | Encoder → PD (embeddings) |
| E->P->D Disaggregated | [`examples/backends/vllm/launch/disagg_multimodal_epd.sh`](../../../examples/backends/vllm/launch/disagg_multimodal_epd.sh) | ✅ Yes | Encoder → Prefill (embeddings)<br>Prefill → Decode (KV cache) |
| EP->D Disaggregated (Llama 4) | [`examples/backends/vllm/launch/disagg_multimodal_llama.sh`](../../../examples/backends/vllm/launch/disagg_multimodal_llama.sh) | ✅ Yes | Prefill → Decode (KV cache) |
## Known Limitations
- **Disaggregated flows require Python Processor** - All multimodal disaggregation requires the Python Processor component (`ModelInput.Text`).
## Supported Models
The following models have been tested with Dynamo's vLLM multimodal backend:
- **Qwen2.5-VL** - `Qwen/Qwen2.5-VL-7B-Instruct`
- **Qwen3-VL** - `Qwen/Qwen3-VL-30B-A3B-Instruct-FP8`
- **LLaVA 1.5** - `llava-hf/llava-1.5-7b-hf`
- **Llama 4 Maverick** - `meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8`
- **LLaVA Next Video** - `llava-hf/LLaVA-NeXT-Video-7B-hf`
- **Qwen2-Audio** - `Qwen/Qwen2-Audio-7B-Instruct`
For a complete list of multimodal models supported by vLLM, see [vLLM Supported Multimodal Models](https://docs.vllm.ai/en/latest/models/supported_models/#list-of-multimodal-language-models). Models listed there should work with Simple Aggregated Mode but may not be explicitly tested.
## Key Files
| File | Description |
|------|-------------|
| `components/src/dynamo/vllm/main.py` | Worker initialization and setup |
| `components/src/dynamo/vllm/args.py` | Command-line argument parsing |
| `components/src/dynamo/vllm/multimodal_handlers/processor_handler.py` | Processor implementation |
| `components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py` | Encode worker implementation |
| `components/src/dynamo/vllm/multimodal_handlers/worker_handler.py` | PD/Prefill/Decode worker implementation |
......@@ -50,7 +50,7 @@
backends/trtllm/llama4_plus_eagle.md
backends/trtllm/kv-cache-transfer.md
backends/trtllm/multimodal_support.md
backends/trtllm/multimodal_epd.md
backends/trtllm/multimodal_trtllm_guide.md
backends/trtllm/gemma3_sliding_window_attention.md
backends/trtllm/gpt-oss.md
backends/trtllm/prometheus.md
......@@ -61,6 +61,7 @@
backends/sglang/expert-distribution-eplb.md
backends/sglang/gpt-oss.md
backends/sglang/multimodal_epd.md
backends/sglang/multimodal_sglang_guide.md
backends/sglang/profiling.md
backends/sglang/sgl-hicache-example.md
backends/sglang/sglang-disaggregation.md
......@@ -74,8 +75,10 @@
backends/vllm/deepseek-r1.md
backends/vllm/gpt-oss.md
backends/vllm/LMCache_Integration.md
backends/vllm/multi-node.md
backends/vllm/multimodal.md
backends/vllm/multimodal_vllm_guide.md
backends/vllm/prometheus.md
backends/vllm/speculative_decoding.md
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment