Unverified Commit 3c4b3069 authored by Neal Vaidya's avatar Neal Vaidya Committed by GitHub
Browse files

docs: consolidate multimodal docs (#4842)


Signed-off-by: default avatarNeal Vaidya <nealv@nvidia.com>
parent 29ec4969
......@@ -103,7 +103,7 @@ flowchart LR
### Multimodal Example
In the case of the [Dynamo Multimodal Disaggregated Example](../../backends/vllm/multimodal.md):
In the case of the [Dynamo Multimodal Disaggregated Example](../../multimodal/vllm.md):
1. The HTTP frontend accepts a text prompt and a URL to an image.
......
......@@ -38,7 +38,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
| [**Conditional Disaggregation**](../../design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) |
| [**KV-Aware Routing**](../../router/kv_cache_routing.md) | ✅ | |
| [**SLA-Based Planner**](../../planner/sla_planner.md) | ✅ | |
| [**Multimodal EPD Disaggregation**](multimodal_epd.md) | ✅ | |
| [**Multimodal Support**](../../multimodal/sglang.md) | ✅ | |
| [**KVBM**](../../kvbm/kvbm_architecture.md) | ❌ | Planned |
......@@ -276,7 +276,7 @@ Below we provide a selected list of advanced examples. Please open up an issue i
- **[Enable SGLang Hierarchical Cache (HiCache)](sgl-hicache-example.md)**
### Multimodal Encode-Prefill-Decode (EPD) Disaggregation with NIXL
- **[Run a multimodal model with EPD Disaggregation](multimodal_epd.md)**
- **[Run a multimodal model with EPD Disaggregation](../../multimodal/sglang.md)**
## Deployment
......
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Encode-Prefill-Decode (EPD) Flow with NIXL
For high-performance multimodal inference with large embeddings, Dynamo supports a specialized **Encode-Prefill-Decode (EPD)** flow using **NIXL (RDMA)** for zero-copy tensor transfer.
## Use the Latest Release
We recommend using the latest stable release of dynamo to avoid breaking changes:
[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
```bash
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
```
## Multimodal Aggregated Serving
### Components
- workers: For aggregated serving, we have two workers, [MultimodalEncodeWorkerHandler](../../../components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py) for encoding and [MultimodalWorkerHandler](../../../components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py) for prefilling and decoding.
- processor: Tokenizes the prompt and passes it to the MultimodalEncodeWorker.
### Workflow
The MultimodalEncodeWorker is responsible for encoding the image and passing the embeddings to the MultimodalWorker via a combination of NATS and RDMA.
The work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface.
Its MultimodalWorker then prefills and decodes the prompt, just like the [LLM aggregated serving](README.md) example.
By separating the encode from the prefill and decode stages, we can have a more flexible deployment and scale the
MultimodalEncodeWorker independently from the prefill and decode workers if needed.
This figure illustrates the workflow:
```mermaid
flowchart LR
HTTP --> processor
processor --> HTTP
processor --image_url--> encode_worker
encode_worker --> processor
encode_worker --embeddings descriptor--> worker
worker --> encode_worker
```
```bash
cd $DYNAMO_HOME/examples/backends/sglang
./launch/multimodal_agg.sh
```
### Client
In another terminal:
```bash
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen2.5-VL-7B-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the image."
},
{
"type": "image_url",
"image_url": {
"url": "http://images.cocodataset.org/test2017/000000155781.jpg"
}
}
]
}
],
"max_tokens": 50,
"stream": false
}' | jq
```
You should see a response similar to this:
```json
{
"id": "chatcmpl-2546f44756884a14916ce13ebaa09da8",
"choices": [
{
"index": 0,
"message": {
"content": "This image shows a public transit bus on a dimly lit, street-level track in what appears to be a quiet urban neighborhood or suburban area. The bus displays \"OUT OF SERVICE\" in red on its illuminated sign. It is positioned",
"role": "assistant",
"reasoning_content": null
},
"finish_reason": "length"
}
],
"created": 1758824222,
"model": "Qwen/Qwen2.5-VL-7B-Instruct",
"object": "chat.completion",
"usage": {
"prompt_tokens": 0,
"completion_tokens": 40,
"total_tokens": 40
}
}
```
## Multimodal Disaggregated Serving
### Components
- workers: For disaggregated serving, we have three workers, [MultimodalEncodeWorkerHandler](../../../components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py) for encoding, [MultimodalWorkerHandler](../../../components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py) for decoding, and [MultimodalPrefillWorkerHandler](../../../components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py) for prefilling.
- processor: Tokenizes the prompt and passes it to the MultimodalEncodeWorker.
### Workflow
For the Qwen2.5-VL model, embeddings are only required during the prefill stage. As such, the image embeddings are transferred using a NIXL descriptor from the encode worker to the worker and then passed to the prefill worker for processing.
The prefill worker performs the prefilling step and forwards the KV cache to the worker for decoding.
For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](README.md) example.
This figure illustrates the workflow:
```mermaid
flowchart LR
HTTP --> processor
processor --> HTTP
processor --image_url--> encode_worker
encode_worker --> processor
encode_worker --embeddings descriptor--> worker
worker --> encode_worker
worker --embeddings descriptor--> prefill_worker
prefill_worker --> worker
```
```bash
cd $DYNAMO_HOME/examples/backends/sglang
./launch/multimodal_disagg.sh
```
### Client
In another terminal:
```bash
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen2.5-VL-7B-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the image."
},
{
"type": "image_url",
"image_url": {
"url": "http://images.cocodataset.org/test2017/000000155781.jpg"
}
}
]
}
],
"max_tokens": 50,
"stream": false
}' | jq
```
You should see a response similar to this:
```json
{
"id": "chatcmpl-2546f44756884a14916ce13ebaa09da8",
"choices": [
{
"index": 0,
"message": {
"content": "This image shows a public transit bus on a dimly lit, street-level track in what appears to be a quiet urban neighborhood or suburban area. The bus displays \"OUT OF SERVICE\" in red on its illuminated sign. It is positioned",
"role": "assistant",
"reasoning_content": null
},
"finish_reason": "length"
}
],
"created": 1758824222,
"model": "Qwen/Qwen2.5-VL-7B-Instruct",
"object": "chat.completion",
"usage": {
"prompt_tokens": 0,
"completion_tokens": 40,
"total_tokens": 40
}
}
```
......@@ -240,7 +240,7 @@ To benchmark your deployment with AIPerf, see this utility script, configuring t
## Multimodal support
Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [Multimodal Support Guide](./multimodal_support.md).
Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [TensorRT-LLM Multimodal Guide](../../multimodal/trtllm.md).
## Logits Processing
......
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# TRT-LLM Multimodal Guide
This document provides a comprehensive guide for multimodal inference using TensorRT-LLM backend in Dynamo. For more details on the multimodal examples, see [Multimodal Examples Documentation](./multimodal_support.md).
## Multimodal Support Matrix
| Modality | Input Format | Aggregated | Disaggregated | Notes |
|----------|--------------|------------|---------------|-------|
| **Image** | HTTP/HTTPS URL | Yes | Yes | Full support for all image models |
| **Image** | Pre-computed Embeddings (.pt, .pth, .bin) | Yes | Yes | Direct embedding files |
| **Video** | HTTP/HTTPS URL | ❌ No | ❌ No | Not implemented |
| **Audio** | HTTP/HTTPS URL | ❌ No | ❌ No | Not implemented |
## Architecture Comparison
TRT-LLM multimodal supports three deployment patterns:
```text
SIMPLE AGGREGATED (agg.sh):
Client → Frontend (Rust) → Worker [image load, encode, P+D] → Response
• 2 components • worker flag `--modality multimodal` • Easiest setup
DISAGGREGATED P->D (disagg_multimodal.sh):
Client → Frontend → Prefill [image load, encode] → Decode → Response
• 3 components • worker flag `--disaggregation-mode prefill/decode` • Multi-GPU, KV transfer
EPD DISAGGREGATED - WIP:
Client → Frontend → Encode [MultimodalEncoder] → Prefill [via params] → Decode → Response
• 4 components • worker flag `--disaggregation-mode encode/prefill/decode` • WIP PR #4668
```
## Input Format Details
### Supported URL Formats
| Format | Example | Description | Support |
|--------|---------|-------------|---------|
| **HTTP/HTTPS** | `http://example.com/image.jpg` | Remote media files | ✅ |
| **Pre-computed Embeddings** | `/path/to/embedding.pt` | Local embedding files (.pt, .pth, .bin) | ✅ |
## Simple Aggregated Mode (PD)
In aggregated mode, all processing (image loading, encoding, prefill, decode) happens within a single worker.
### Architecture
```text
HTTP Frontend (Rust)
TRT-LLM Worker (Python - ModelInput.Tokens)
↓ downloads media, encodes, prefill + decode
Response
```
### Components
| Component | Flag | ModelInput | Registered | Purpose |
|-----------|------|-----------|------------|---------|
| Worker | `--modality multimodal` | Tokens | Yes | Complete inference pipeline |
### Launch Script
Example: [`examples/backends/trtllm/launch/agg.sh`](../../../examples/backends/trtllm/launch/agg.sh)
## Disaggregated Mode (P->D)
In disaggregated mode, prefill and decode are handled by separate workers. The prefill worker handles image loading and encoding internally.
### Architecture
```text
HTTP Frontend (Rust)
Prefill Worker (Python - ModelInput.Tokens)
↓ downloads media, encodes, prefill, KV cache transfer
Decode Worker (Python - ModelInput.Tokens)
↓ decode only, token generation
Response
```
### Components
| Component | Flag | ModelInput | Registered | Purpose |
|-----------|------|-----------|------------|---------|
| Prefill Worker | `--disaggregation-mode prefill` | Tokens | Yes | Image processing + Prefill |
| Decode Worker | `--disaggregation-mode decode` | Tokens | Yes | Decode only |
### Launch Script
Example: [`examples/backends/trtllm/launch/disagg_multimodal.sh`](../../../examples/backends/trtllm/launch/disagg_multimodal.sh)
## Pre-computed Embeddings
TRT-LLM supports providing pre-computed embeddings, bypassing image-to-embedding processing.
### Supported File Types
- `.pt` - PyTorch tensor files
- `.pth` - PyTorch checkpoint files
- `.bin` - Binary tensor files
### Embedding File Formats
TRT-LLM supports two formats for embedding files:
#### 1. Simple Tensor Format
- Direct tensor saved as `.pt` file
- Example: `llava_next_mm_embed_seashore.pt`
- Contains only the embedding tensor
```python
# Example: Simple tensor format
embedding_tensor = torch.rand(1, 576, 4096) # [batch, seq_len, hidden_dim]
torch.save(embedding_tensor, "embedding.pt")
```
#### 2. Dictionary Format with Auxiliary Data
- Dictionary containing multiple keys
- Used by models like Llama-4 that require additional metadata
- Must contain `mm_embeddings` key with the main tensor
- Can include auxiliary data like special tokens, offsets, etc.
```python
# Example: Dictionary format (Llama-4 style)
embedding_dict = {
"mm_embeddings": torch.rand(1, 576, 4096),
"special_tokens": [128256, 128257],
"image_token_offsets": [[0, 576]],
# ... other model-specific metadata
}
torch.save(embedding_dict, "llama4_embedding.pt")
```
**How They're Used:**
- **Simple tensors**: Loaded directly and passed to `mm_embeddings` parameter
- **Dictionary format**: `mm_embeddings` key extracted as main tensor, other keys preserved as auxiliary data and transferred separately
### Launch Script
Example: [`examples/backends/trtllm/launch/epd_disagg.sh`](../../../examples/backends/trtllm/launch/epd_disagg.sh)
### Security Considerations
For EPD mode with local embedding files:
- `--allowed-local-media-path` - Specify secure directory for embedding files (default: `/tmp`)
- `--max-file-size-mb` - Limit max file size to prevent DoS attacks (default: `50MB`)
## EPD Disaggregated Mode (E->P->D) - WIP
**Status:** Work In Progress (WIP PR #4668) - Full EPD flow with MultimodalEncoder
In EPD mode, encoding, prefill, and decode are handled by separate workers. The encode worker uses TensorRT-LLM's `MultimodalEncoder` to process images and transfer embeddings via disaggregated parameters.
### Architecture
```text
HTTP Frontend (Rust)
Encode Worker (Python - NOT registered, uses MultimodalEncoder)
↓ downloads image, encodes with vision model, transfers via disaggregated_params
Prefill Worker (Python - ModelInput.Tokens)
↓ receives embeddings via disaggregated_params, prefill only, KV cache transfer
Decode Worker (Python - ModelInput.Tokens)
↓ decode only, token generation
Response
```
**Note (WIP):** The encode worker uses `MultimodalEncoder` from TensorRT-LLM to actually encode images, not just load pre-computed embeddings. This is a significant change from the legacy NIXL-based embedding transfer.
### Components
| Component | Flag | ModelInput | Registered | Purpose |
|-----------|------|-----------|------------|---------|
| Encode Worker | `--disaggregation-mode encode` | N/A | No | Image encoding with MultimodalEncoder |
| Prefill Worker | `--disaggregation-mode prefill --encode-endpoint` | Tokens | Yes | Prefill only |
| Decode Worker | `--disaggregation-mode decode` | Tokens | Yes | Decode only |
## ModelInput Types and Registration
### Understanding ModelInput
TRT-LLM workers register with Dynamo using:
| ModelInput Type | Preprocessing | Use Case |
|-----------------|---------------|----------|
| `ModelInput.Tokens` | Rust SDK tokenizes text (bypassed for multimodal) | All TRT-LLM workers |
### Component Registration Pattern
```python
# TRT-LLM Worker - Register with Tokens
await register_llm(
ModelInput.Tokens, # Rust does minimal preprocessing
model_type, # ModelType.Chat or ModelType.Prefill
generate_endpoint,
model_name,
...
)
```
## Inter-Component Communication
| Transfer Stage | Message | NIXL Transfer |
|----------------|--------------|---------------|
| **Frontend → Prefill** | Request with image URL or embedding path | No |
| **Encode → Prefill (pre-computed embeddings)** | NIXL metadata (pre-computed embeddings) | Yes (Embeddings tensor) |
| **Encode → Prefill (Image URL) (WIP)** | Disaggregated params with multimodal handles | No (Handles via params) |
| **Prefill → Decode** | Disaggregated params | Configurable (KV cache: NIXL default, UCX optional) |
## **NIXL USE**
| Use Case | Script | NIXL Used? | Data Transfer |
|----------|--------|------------|---------------|
| Simple Aggregated | [`examples/backends/trtllm/launch/agg.sh`](../../../examples/backends/trtllm/launch/agg.sh) | ❌ No | All in one worker |
| P->D Disaggregated | [`examples/backends/trtllm/launch/disagg_multimodal.sh`](../../../examples/backends/trtllm/launch/disagg_multimodal.sh) | ⚙️ Optional | Prefill → Decode (KV cache via UCX or NIXL) |
| E->P->D Disaggregated (pre-computed embeddings) | [`examples/backends/trtllm/launch/epd_disagg.sh`](../../../examples/backends/trtllm/launch/epd_disagg.sh) | ✅ Yes | Encoder → Prefill (pre-computed embeddings via NIXL) |
| E->P->D Disaggregated (WIP) | X | ❌ No | Encoder → Prefill (multimodal handles via disaggregated_params)<br>Prefill → Decode (KV cache via UCX/NIXL) |
**Note:** NIXL for KV cache transfer is currently beta and only supported on AMD64 (x86_64) architecture.
## Key Files
| File | Description |
|------|-------------|
| `components/src/dynamo/trtllm/main.py` | Worker initialization and setup |
| `components/src/dynamo/trtllm/utils/trtllm_utils.py` | Command-line argument parsing |
| `components/src/dynamo/trtllm/multimodal_processor.py` | Multimodal request processing |
| `components/src/dynamo/trtllm/request_handlers/handlers.py` | Request handler factory |
| `components/src/dynamo/trtllm/request_handlers/handler_base.py` | Base handler and disaggregation modes |
## Known Limitations
- **No Data URL support** - Only HTTP/HTTPS URLs supported; `data:image/...` base64 URLs not supported
- **No video support** - No video encoder implementation
- **No audio support** - No audio encoder implementation
- **No Rust preprocessing** - All preprocessing happens in Python workers
- **E->P->D mode is WIP** - Full EPD with image URLs under development
## Supported Models
Multimodal models listed in [TensorRT-LLM supported models](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/models/supported-models.md) are supported by Dynamo.
Common examples:
- Llama 4 Vision models (Maverick, Scout)
- Qwen2-VL models
- Other vision-language models with TRT-LLM support
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Example: Multi-node TRTLLM Workers with Dynamo on Slurm for multimodal models
> **Note:** The scripts referenced in this example (such as `srun_aggregated.sh` and `srun_disaggregated.sh`) can be found in [`examples/basics/multinode/trtllm/`](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics/multinode/trtllm/).
This guide demonstrates how to deploy large multimodal models that require a multi-node setup. It builds on the general multi-node deployment process described in the main [multinode-examples.md](./multinode-examples.md) guide.
Before you begin, ensure you have completed the initial environment configuration by following the **Setup** section in that guide.
The following sections provide specific instructions for deploying `meta-llama/Llama-4-Maverick-17B-128E-Instruct`, including environment variable setup and launch commands. These steps can be adapted for other large multimodal models.
## Environment Variable Setup
Assuming you have already allocated your nodes via `salloc`, and are
inside an interactive shell on one of the allocated nodes, set the
following environment variables based:
```bash
# NOTE: IMAGE must be set manually for now
# To build an iamge, see the steps here:
# https://github.com/ai-dynamo/dynamo/tree/main/docs/backends/trtllm/README.md#build-container
export IMAGE="<dynamo_trtllm_image>"
# MOUNTS are the host:container path pairs that are mounted into the containers
# launched by each `srun` command.
#
# If you want to reference files, such as $MODEL_PATH below, in a
# different location, you can customize MOUNTS or specify additional
# comma-separated mount pairs here.
#
# NOTE: Currently, this example assumes that the local bash scripts and configs
# referenced are mounted into into /mnt inside the container. If you want to
# customize the location of the scripts, make sure to modify `srun_aggregated.sh`
# accordingly for the new locations of `start_frontend_services.sh` and
# `start_trtllm_worker.sh`.
#
# For example, assuming your cluster had a `/lustre` directory on the host, you
# could add that as a mount like so:
#
# export MOUNTS="${PWD}/../../../../:/mnt,/lustre:/lustre"
export MOUNTS="${PWD}/../../../../:/mnt"
# Can point to local FS as weel
# export MODEL_PATH="/location/to/model"
export MODEL_PATH="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
# The name the model will be served/queried under, matching what's
# returned by the /v1/models endpoint.
#
# By default this is inferred from MODEL_PATH, but when using locally downloaded
# model weights, it can be nice to have explicit control over the name.
export SERVED_MODEL_NAME="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
export MODALITY=${MODALITY:-"multimodal"}
```
## Disaggregated Mode
Assuming you have at least 4 4xGB200 nodes allocated (2 for prefill, 2 for decode)
following the setup above, follow these steps below to launch a **disaggregated**
deployment across 4 nodes:
> [!Tip]
> Make sure you have a fresh environment and don't still have the aggregated
> example above still deployed on the same set of nodes.
```bash
# Defaults set in srun_disaggregated.sh, but can customize here.
# export PREFILL_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/llama4/multimodal/prefill.yaml"
# export DECODE_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/llama4/multimodal/decode.yaml"
# Customize NUM_PREFILL_NODES to match the desired parallelism in PREFILL_ENGINE_CONFIG
# Customize NUM_DECODE_NODES to match the desired parallelism in DECODE_ENGINE_CONFIG
# The products of NUM_PREFILL_NODES*NUM_GPUS_PER_NODE and
# NUM_DECODE_NODES*NUM_GPUS_PER_NODE should match the respective number of
# GPUs necessary to satisfy the requested parallelism in each config.
# export NUM_PREFILL_NODES=2
# export NUM_DECODE_NODES=2
# GB200 nodes have 4 gpus per node, but for other types of nodes you can configure this.
# export NUM_GPUS_PER_NODE=4
# Launches:
# - frontend + etcd/nats on current (head) node.
# - one large prefill trtllm worker across multiple nodes via MPI tasks
# - one large decode trtllm worker across multiple nodes via MPI tasks
./srun_disaggregated.sh
```
## Understanding the Output
1. The `srun_disaggregated.sh` launches three srun jobs instead of two. One for frontend, one for prefill worker, and one for decode worker.
2. The OpenAI frontend will listen for and dynamically discover workers as
they register themselves with Dynamo's distributed runtime:
```
0: 2025-06-13T02:36:48.160Z INFO dynamo_run::input::http: Watching for remote model at models
0: 2025-06-13T02:36:48.161Z INFO dynamo_llm::http::service::service_v2: Starting HTTP service on: 0.0.0.0:8000 address="0.0.0.0:8000"
```
3. The TRTLLM worker will consist of N (N=8 for TP8) MPI ranks, 1 rank on each
GPU on each node, which will each output their progress while loading the model.
You can see each rank's output prefixed with the rank at the start of each log line
until the model succesfully finishes loading:
```
7: rank7 run mgmn worker node with mpi_world_size: 8 ...
```
4. After the model fully finishes loading on all ranks, the worker will register itself,
and the OpenAI frontend will detect it, signaled by this output:
```
0: 2025-06-13T02:46:35.040Z INFO dynamo_llm::discovery::watcher: added model model_name="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
```
5. At this point, with the worker fully initialized and detected by the frontend,
it is now ready for inference.
## Example Request
To verify the deployed model is working, send a `curl` request:
```bash
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the image"
},
{
"type": "image_url",
"image_url": {
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png"
}
}
]
}
],
"stream": false,
"max_tokens": 160
}'
```
## Cleanup
To cleanup background `srun` processes launched by `srun_aggregated.sh` or
`srun_disaggregated.sh`, you can run:
```bash
pkill srun
```
## Known Issues
- Loading `meta-llama/Llama-4-Maverick-17B-128E-Instruct` with 8 nodes of H100 with TP=16 is not posssible due to Llama4 Maverick has a config `"num_attention_heads": 40` , trtllm engine asserts on assert `self.num_heads % tp_size == 0` causing the engine to crash on model loading.
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Multimodal Support
Dynamo supports multimodal models with vLLM v1. In general, multimodal models can be served using the aggregated serving setup with [`agg_multimodal.sh`](../../../examples/backends/vllm/launch/agg_multimodal.sh).
> [!WARNING]
> **LLaVA Model Limitation**: Do not use LLaVA models (e.g., `llava-hf/llava-1.5-7b-hf`) with the standard aggregated serving setup, as they contain keywords that Dynamo cannot yet parse. LLaVA models can still be used with the EPD (Encode-Prefill-Decode) setup described below.
> [!IMPORTANT]
> **Security Requirement**: All multimodal workers require the `--enable-multimodal` flag to be explicitly set at startup. This is a security feature to prevent unintended processing of multimodal data from untrusted sources. Workers will fail at startup if multimodal flags (e.g., `--multimodal-worker`, `--multimodal-processor`) are used without `--enable-multimodal`.
This flag is analogus to `--enable-mm-embeds` in vllm serve but also extends it to all multimodal content (url, embeddings, b64).
# Multimodal EPD Deployment Examples
This section provides example workflows and reference implementations for deploying a multimodal model using Dynamo and vLLM v1 with EPD(Encode-Prefill-Decode) pipeline.
## Use the Latest Release
We recommend using the latest stable release of dynamo to avoid breaking changes:
[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
```bash
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
```
## Multimodal Aggregated Serving
### Components
- workers: For aggregated serving, we have two workers, [EncodeWorkerHandler](../../../components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py) for encoding and [MultimodalPDWorkerHandler](../../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) for prefilling and decoding.
- processor: Tokenizes the prompt and passes it to the EncodeWorkerHandler.
- frontend: HTTP endpoint to handle incoming requests.
### Workflow
The EncodeWorkerHandler is responsible for encoding the image and passing the embeddings to the MultimodalPDWorkerHandler via a combination of NATS and RDMA.
The work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface.
Its MultimodalPDWorkerHandler then prefills and decodes the prompt, just like the [LLM aggregated serving](README.md) example.
By separating the encode from the prefill and decode stages, we can have a more flexible deployment and scale the
MultimodalPDWorkerHandler independently from the prefill and decode workers if needed.
This figure illustrates the workflow:
```mermaid
flowchart LR
HTTP --> processor
processor --> HTTP
processor --image_url--> encode_worker
encode_worker --> processor
encode_worker --embeddings--> pd_worker
pd_worker --> encode_worker
```
***Note*** Aggregated serving supports LLaVA 1.5 7B and Qwen2.5-VL-7B-Instruct today. Disaggregated serving is currently only confirmed for LLaVA (see note below).
```bash
cd $DYNAMO_HOME/examples/backends/vllm
# Serve a LLaVA 1.5 7B model:
bash launch/agg_multimodal_epd.sh --model llava-hf/llava-1.5-7b-hf
# Serve a Qwen2.5-VL model:
bash launch/agg_multimodal_epd.sh --model Qwen/Qwen2.5-VL-7B-Instruct
```
### Client
In another terminal:
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llava-hf/llava-1.5-7b-hf",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "http://images.cocodataset.org/test2017/000000155781.jpg"
}
}
]
}
],
"max_tokens": 300,
"temperature": 0.0,
"stream": false
}'
```
If serving the example Qwen model, replace `"llava-hf/llava-1.5-7b-hf"` in the `"model"` field with `"Qwen/Qwen2.5-VL-7B-Instruct"`.
You should see a response similar to this:
```json
{"id": "c37b946e-9e58-4d54-88c8-2dbd92c47b0c", "object": "chat.completion", "created": 1747725277, "model": "llava-hf/llava-1.5-7b-hf", "choices": [{"index": 0, "message": {"role": "assistant", "content": " In the image, there is a city bus parked on a street, with a street sign nearby on the right side. The bus appears to be stopped out of service. The setting is in a foggy city, giving it a slightly moody atmosphere."}, "finish_reason": "stop"}]}
```
## Multimodal Disaggregated Serving
### Components
- workers: For disaggregated serving, we have three workers, [EncodeWorkerHandler](../../../components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py) for encoding, [MultimodalDecodeWorkerHandler](../../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) for decoding, and [MultimodalPDWorkerHandler](../../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) for prefilling.
- processor: Tokenizes the prompt and passes it to the EncodeWorkerHandler.
- frontend: HTTP endpoint to handle incoming requests.
### Workflow
In this workflow, we have three workers, [EncodeWorkerHandler](../../../components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py), [MultimodalDecodeWorkerHandler](../../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py), and [MultimodalPDWorkerHandler](../../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py).
For the LLaVA model, embeddings are only required during the prefill stage. As such, the EncodeWorkerHandler is connected directly to the prefill worker.
The EncodeWorkerHandler is responsible for encoding the image and passing the embeddings to the prefill worker via a combination of NATS and RDMA.
Its work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface.
The prefill worker performs the prefilling step and forwards the KV cache to the decode worker for decoding.
For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](README.md) example.
This figure illustrates the workflow:
```mermaid
flowchart LR
HTTP --> processor
processor --> HTTP
processor --image_url--> encode_worker
encode_worker --> processor
encode_worker --embeddings--> prefill_worker
prefill_worker --> encode_worker
prefill_worker --> decode_worker
decode_worker --> prefill_worker
```
```bash
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/disagg_multimodal_epd.sh --model llava-hf/llava-1.5-7b-hf
```
### Client
In another terminal:
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llava-hf/llava-1.5-7b-hf",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "http://images.cocodataset.org/test2017/000000155781.jpg"
}
}
]
}
],
"max_tokens": 300,
"temperature": 0.0,
"stream": false
}'
```
You should see a response similar to this:
```json
{"id": "c1774d61-3299-4aa3-bea1-a0af6c055ba8", "object": "chat.completion", "created": 1747725645, "model": "llava-hf/llava-1.5-7b-hf", "choices": [{"index": 0, "message": {"role": "assistant", "content": " This image shows a passenger bus traveling down the road near power lines and trees. The bus displays a sign that says \"OUT OF SERVICE\" on its front."}, "finish_reason": "stop"}]}
```
***Note***: disaggregation is currently only confirmed to work with LLaVA. Qwen2.5-VL is not confirmed to be supported.
## Llama 4 family Serving
The family of Llama 4 models is natively multimodal, however, different
from Llava, they do not directly consume image embedding as input
(see the [support metrics](https://docs.vllm.ai/en/latest/models/supported_models.html#text-generation_1)
from vLLM for the types of multi-modal inputs supported by the model).
Therefore, encoder worker will not be used in the following example and the
encoding will be done along side with prefill.
`meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8` will be used as an example
for the content below. And the system will be H100x8 which can hold one instance
of the model per node.
### Multimodal Aggregated Serving
#### Components
- workers: For aggregated serving, we have one worker, [MultimodalPDWorkerHandler](../../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) for prefilling and decoding.
- processor: Tokenizes the prompt and passes it to the MultimodalPDWorkerHandler.
- frontend: HTTP endpoint to handle incoming requests.
#### Workflow
In this workflow, we have [MultimodalPDWorkerHandler](../../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) which will encode the image, prefill and decode the prompt, just like the [LLM aggregated serving](README.md) example.
This figure illustrates the workflow:
```mermaid
flowchart LR
HTTP --> processor
processor --> HTTP
processor --image_url--> pd_worker
pd_worker --> processor
```
```bash
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/agg_multimodal_llama.sh
```
#### Client
In another terminal:
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "http://images.cocodataset.org/test2017/000000155781.jpg"
}
}
]
}
],
"max_tokens": 300,
"temperature": 0.0,
"stream": false
}'
```
You should see a response similar to this:
```json
{"id": "b8f060fa95584e34b9204eaba7b105cc", "object": "chat.completion", "created": 1752706281, "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8", "choices": [{"index": 0, "message": {"role": "assistant", "content": "The image depicts a street scene with a trolley bus as the central focus. The trolley bus is positioned on the left side of the road, facing the camera, and features a white and yellow color scheme. A prominent sign on the front of the bus reads \"OUT OF SERVICE\" in orange letters.\n\n**Key Elements:**\n\n* **Trolley Bus:** The bus is the main subject of the image, showcasing its distinctive design and color.\n* **Sign:** The \"OUT OF SERVICE\" sign is clearly visible on the front of the bus, indicating its current status.\n* **Street Scene:** The surrounding environment includes trees, buildings, and power lines, creating a sense of context and atmosphere.\n* **Lighting:** The image is characterized by a misty or foggy quality, with soft lighting that adds to the overall ambiance.\n\n**Overall Impression:**\n\nThe image presents a serene and somewhat melancholic scene, with the out-of-service trolley bus serving as a focal point. The misty atmosphere and soft lighting contribute to a dreamy or nostalgic feel, inviting the viewer to reflect on the scene."}, "finish_reason": "stop"}]}
```
### Multimodal Disaggregated Serving
#### Components
- workers: For disaggregated serving, we have two workers, [MultimodalDecodeWorkerHandler](../../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) for decoding, and [MultimodalPDWorkerHandler](../../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) for encoding and prefilling.
- processor: Tokenizes the prompt and passes it to the MultimodalPDWorkerHandler.
- frontend: HTTP endpoint to handle incoming requests.
#### Workflow
In this workflow, we have two workers, [MultimodalDecodeWorkerHandler](../../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py), and [MultimodalPDWorkerHandler](../../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py).
The prefill worker performs the encoding and prefilling steps and forwards the KV cache to the decode worker for decoding.
For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](README.md) example.
This figure illustrates the workflow:
```mermaid
flowchart LR
HTTP --> processor
processor --> HTTP
processor --image_url--> prefill_worker
prefill_worker --> processor
prefill_worker --> decode_worker
decode_worker --> prefill_worker
```
```bash
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/disagg_multimodal_llama.sh --head-node
# On a separate node that has finished standard dynamo setup, i.e.
# the worker node needs NATS_SERVER and ETCD_ENDPOINTS environment variables
# pointing to the head node's external IP address for distributed coordination
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/disagg_multimodal_llama.sh
```
#### Client
In another terminal:
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "http://images.cocodataset.org/test2017/000000155781.jpg"
}
}
]
}
],
"max_tokens": 300,
"temperature": 0.0,
"stream": false
}'
```
You should see a response similar to this:
```json
{"id": "6cc99123ad6948d685b8695428238d4b", "object": "chat.completion", "created": 1752708043, "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8", "choices": [{"index": 0, "message": {"role": "assistant", "content": "The image depicts a street scene with a trolley bus as the central focus. The trolley bus is positioned on the left side of the road, facing the camera, and features a white and yellow color scheme. A prominent sign on the front of the bus reads \"OUT OF SERVICE\" in orange letters.\n\n**Key Elements:**\n\n* **Trolley Bus:** The bus is the main subject of the image, showcasing its distinctive design and color.\n* **Sign:** The \"OUT OF SERVICE\" sign is clearly visible on the front of the bus, indicating its current status.\n* **Street Scene:** The surrounding environment includes trees, buildings, and power lines, creating a sense of context and atmosphere.\n* **Lighting:** The image is characterized by a misty or foggy quality, with soft lighting that adds to the overall mood.\n\n**Overall Impression:**\n\nThe image presents a serene and somewhat melancholic scene, with the out-of-service trolley bus serving as a focal point. The misty atmosphere and soft lighting contribute to a contemplative ambiance, inviting the viewer to reflect on the situation."}, "finish_reason": "stop"}]}
```
## Multimodal Aggregated Video Serving
This example demonstrates deploying an aggregated multimodal model that can process video inputs.
### Components
- workers: For video serving, we use the [VideoEncodeWorker](../../../examples/multimodal/components/video_encode_worker.py) for decoding video into frames, and send the frames to [VllmPDWorker](../../../examples/multimodal/components/worker.py) for prefilling and decoding.
- processor: Tokenizes the prompt and passes it to the VideoEncodeWorker.
- frontend: HTTP endpoint to handle incoming requests.
### Workflow
In this workflow, we have two workers, [VideoEncodeWorker](../../../examples/multimodal/components/video_encode_worker.py) and [VllmPDWorker](../../../examples/multimodal/components/worker.py).
The VideoEncodeWorker is responsible for decoding the video into a series of frames. Unlike the image pipeline which generates embeddings,
this pipeline passes the raw frames directly to the VllmPDWorker via a combination of NATS and RDMA.
Its VllmPDWorker then prefills and decodes the prompt, just like the [LLM aggregated serving](README.md) example.
By separating the video processing from the prefill and decode stages, we can have a more flexible deployment and scale the
VideoEncodeWorker independently from the prefill and decode workers if needed.
This figure illustrates the workflow:
```mermaid
flowchart LR
HTTP --> processor
processor --> HTTP
processor --video_url--> video_encode_worker
video_encode_worker --> processor
video_encode_worker --frames--> pd_worker
pd_worker --> video_encode_worker
```
```bash
cd $DYNAMO_HOME/examples/multimodal
bash launch/video_agg.sh
```
### Client
In another terminal:
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llava-hf/LLaVA-NeXT-Video-7B-hf",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the video in detail"
},
{
"type": "video_url",
"video_url": {
"url": "https://storage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4"
}
}
]
}
],
"max_tokens": 300,
"stream": false
}' | jq
```
You should see a response describing the video's content similar to
```json
{
"id": "7587e7d152014bae8e5c4e25f9fda0ed",
"choices": [
{
"index": 0,
"message": {
"content": " The video takes us away to a lively world of wildlife and natural beauty, featuring a white rabbit in a vibrant forest setting. At the beginning of the clip, the white rabbit is seen standing on a rock, facing towards the right side of the frame, with bushes and trees in the backdrop. The rabbit appears to be alert, given its ears are up and its ears perked in the air. As the clip progresses, the movement of the rabbit brings it around a tree, where its legs are partially hidden by the dense vegetation. It then sits down and grooms its fur, a behavior that suggests it is comfortable in its surroundings. \n\nThe scene then switches to a close-up shot of the rabbit, giving us a better view of its features and expressions. In this camera angle, the rabbit appears more dynamic and alert, with its breathing more visible, signaling its health and well-being. The camera pans out, and we see the rabbit heading towards the left side of the screen, possibly curious or hunting for food, with its ears perked up again. The lush greenery of the forest unfolds in the background, adding to the feeling of a wild and thriving environment.\n\n\nThe rabbit, upturned slightly while walking, finds a pile of dirt and rocks and sits there, fully clothed, perhaps taking a break from its exploration. There's a mention of a blue bird that appears to perch atop a log, adding a touch of whimsy to the scene. Lastly, the rabbit is observed relaxing on the rocks, resting comfortably, and looking off to the right side—a moment of tranquility in a bustling ecosystem. Throughout the clip, the rabbit's outfit remains the same, allowing for a clear focus on its behavior and characteristics while fitting in its habitat.",
"role": "assistant",
"reasoning_content": null
},
"finish_reason": "stop"
}
],
"created": 1756251832,
"model": "llava-hf/LLaVA-NeXT-Video-7B-hf",
"object": "chat.completion",
"usage": null
}
```
## Multimodal Disaggregated Video Serving
This example demonstrates deploying a disaggregated multimodal model that can process video inputs.
### Components
- workers: For disaggregated video serving, we have three workers, [VideoEncodeWorker](../../../examples/multimodal/components/video_encode_worker.py) for decoding video into frames,
[VllmDecodeWorker](../../../examples/multimodal/components/worker.py) for decoding, and [VllmPDWorker](../../../examples/multimodal/components/worker.py) for prefilling.
- processor: Tokenizes the prompt and passes it to the VideoEncodeWorker.
- frontend: HTTP endpoint to handle incoming requests.
### Workflow
In this workflow, we have three workers, [VideoEncodeWorker](../../../examples/multimodal/components/video_encode_worker.py), [VllmDecodeWorker](../../../examples/multimodal/components/worker.py), and [VllmPDWorker](../../../examples/multimodal/components/worker.py).
For the LLaVA-NeXT-Video-7B model, frames are only required during the prefill stage. As such, the VideoEncodeWorker is connected directly to the prefill worker.
The VideoEncodeWorker is responsible for decoding the video into a series of frames and passing them to the prefill worker via RDMA.
The prefill worker performs the prefilling step and forwards the KV cache to the decode worker for decoding.
For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](README.md) example.
This figure illustrates the workflow:
```mermaid
flowchart LR
HTTP --> processor
processor --> HTTP
processor --video_url--> video_encode_worker
video_encode_worker --> processor
video_encode_worker --frames--> prefill_worker
prefill_worker --> video_encode_worker
prefill_worker --> decode_worker
decode_worker --> prefill_worker
```
```bash
cd $DYNAMO_HOME/examples/multimodal
bash launch/video_disagg.sh
```
### Client
In another terminal:
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llava-hf/LLaVA-NeXT-Video-7B-hf",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the video in detail"
},
{
"type": "video_url",
"video_url": {
"url": "https://storage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4"
}
}
]
}
],
"max_tokens": 300,
"stream": false
}' | jq
```
You should see a response describing the video's content similar to
```json
{
"id": "7587e7d152014bae8e5c4e25f9fda0ed",
"choices": [
{
"index": 0,
"message": {
"content": " The video takes us away to a lively world of wildlife and natural beauty, featuring a white rabbit in a vibrant forest setting. At the beginning of the clip, the white rabbit is seen standing on a rock, facing towards the right side of the frame, with bushes and trees in the backdrop. The rabbit appears to be alert, given its ears are up and its ears perked in the air. As the clip progresses, the movement of the rabbit brings it around a tree, where its legs are partially hidden by the dense vegetation. It then sits down and grooms its fur, a behavior that suggests it is comfortable in its surroundings. \n\nThe scene then switches to a close-up shot of the rabbit, giving us a better view of its features and expressions. In this camera angle, the rabbit appears more dynamic and alert, with its breathing more visible, signaling its health and well-being. The camera pans out, and we see the rabbit heading towards the left side of the screen, possibly curious or hunting for food, with its ears perked up again. The lush greenery of the forest unfolds in the background, adding to the feeling of a wild and thriving environment.\n\n\nThe rabbit, upturned slightly while walking, finds a pile of dirt and rocks and sits there, fully clothed, perhaps taking a break from its exploration. There's a mention of a blue bird that appears to perch atop a log, adding a touch of whimsy to the scene. Lastly, the rabbit is observed relaxing on the rocks, resting comfortably, and looking off to the right side—a moment of tranquility in a bustling ecosystem. Throughout the clip, the rabbit's outfit remains the same, allowing for a clear focus on its behavior and characteristics while fitting in its habitat.",
"role": "assistant",
"reasoning_content": null
},
"finish_reason": "stop"
}
],
"created": 1756251832,
"model": "llava-hf/LLaVA-NeXT-Video-7B-hf",
"object": "chat.completion",
"usage": null
}
```
## Multimodal Aggregated Audio Serving
This example demonstrates deploying an aggregated multimodal model that can process audio inputs.
### Components
- workers: For audio serving, we use the [AudioEncodeWorker](../../../examples/multimodal/components/audio_encode_worker.py) for decoding audio into audio embeddings, and send the embeddings to [VllmPDWorker](../../../examples/multimodal/components/worker.py) for prefilling and decoding.
- processor: Tokenizes the prompt and passes it to the AudioEncodeWorker.
- frontend: HTTP endpoint to handle incoming requests.
### Workflow
In this workflow, we have two workers, [AudioEncodeWorker](../../../examples/multimodal/components/audio_encode_worker.py) and [VllmPDWorker](../../../examples/multimodal/components/worker.py).
The AudioEncodeWorker is responsible for decoding the audio into embeddings.
Its VllmPDWorker then prefills and decodes the prompt, just like the [LLM aggregated serving](README.md) example.
By separating the audio processing from the prefill and decode stages, we can have a more flexible deployment and scale the
AudioEncodeWorker independently from the prefill and decode workers if needed.
This figure illustrates the workflow:
```mermaid
flowchart LR
HTTP --> processor
processor --> HTTP
processor --audio_url--> audio_encode_worker
audio_encode_worker --> processor
audio_encode_worker --embeddings--> pd_worker
pd_worker --> audio_encode_worker
```
```bash
pip install vllm["audio"] accelerate # multimodal audio models dependency
cd $DYNAMO_HOME/examples/multimodal
bash launch/audio_agg.sh
```
### Client
In another terminal:
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2-Audio-7B-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is recited in the audio?"
},
{
"type": "audio_url",
"audio_url": {
"url": "https://raw.githubusercontent.com/yuekaizhang/Triton-ASR-Client/main/datasets/mini_en/wav/1221-135766-0002.wav"
}
}
]
}
],
"max_tokens": 6000,
"temperature": 0.8,
"stream": false
}' | jq
```
You should see a response describing the audio's content similar to
```json
{
"id": "e2d8d67c37634b309400974eaa058ce8",
"choices": [
{
"index": 0,
"message": {
"content": "The original content of this audio is:'yet these thoughts affected Hester Pynne less with hope than apprehension.'",
"refusal": null,
"tool_calls": null,
"role": "assistant",
"function_call": null,
"audio": null
},
"finish_reason": "stop",
"logprobs": null
}
],
"created": 1756368148,
"model": "Qwen/Qwen2-Audio-7B-Instruct",
"service_tier": null,
"system_fingerprint": null,
"object": "chat.completion",
"usage": null
}
```
## Multimodal Disaggregated Audio Serving
This example demonstrates deploying a disaggregated multimodal model that can process audio inputs.
### Components
- workers: For disaggregated audio serving, we have three workers, [AudioEncodeWorker](../../../examples/multimodal/components/audio_encode_worker.py) for decoding audio into embeddings,
[VllmDecodeWorker](../../../examples/multimodal/components/worker.py) for decoding, and [VllmPDWorker](../../../examples/multimodal/components/worker.py) for prefilling.
- processor: Tokenizes the prompt and passes it to the AudioEncodeWorker.
- frontend: HTTP endpoint to handle incoming requests.
### Workflow
In this workflow, we have three workers, [AudioEncodeWorker](../../../examples/multimodal/components/audio_encode_worker.py), [VllmDecodeWorker](../../../examples/multimodal/components/worker.py), and [VllmPDWorker](../../../examples/multimodal/components/worker.py).
For the Qwen/Qwen2-Audio-7B-Instruct model, audio embeddings are only required during the prefill stage. As such, the AudioEncodeWorker is connected directly to the prefill worker.
The AudioEncodeWorker is responsible for decoding the audio into embeddings and passing them to the prefill worker via RDMA.
The prefill worker performs the prefilling step and forwards the KV cache to the decode worker for decoding.
For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](README.md) example.
This figure illustrates the workflow:
```mermaid
flowchart LR
HTTP --> processor
processor --> HTTP
processor --audio_url--> audio_encode_worker
audio_encode_worker --> processor
audio_encode_worker --embeddings--> prefill_worker
prefill_worker --> audio_encode_worker
prefill_worker --> decode_worker
decode_worker --> prefill_worker
```
```bash
pip install vllm["audio"] accelerate # multimodal audio models dependency
cd $DYNAMO_HOME/examples/multimodal
bash launch/audio_disagg.sh
```
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# vLLM Multimodal Guide
This document provides a comprehensive guide for multimodal inference using vLLM backend in Dynamo. For more details on the multimodal examples, see [Multimodal Examples Documentation](./multimodal.md).
## Multimodal Support Matrix
| Modality | Input Format | Aggregated | Disaggregated | Notes |
|----------|--------------|------------|---------------|-------|
| **Image** | HTTP/HTTPS URL | Yes | Yes | Full support for all image models |
| **Image** | Data URL (Base64) | Yes | Yes | Inline base64-encoded images |
| **Video** | HTTP/HTTPS URL | Yes | Yes | Frame extraction and processing |
| **Audio** | HTTP/HTTPS URL | Yes | Yes | Experimental - requires audio dependencies |
## Architecture Comparison
vLLM multimodal supports three deployment patterns:
```text
SIMPLE AGGREGATED ([examples/backends/vllm/launch/agg_multimodal.sh](../../../examples/backends/vllm/launch/agg_multimodal.sh)):
Client → Frontend (Rust processor) → Worker [image load, encode, P+D] → Response
• 2 components • --connector none • Easiest setup
EPD AGGREGATED ([examples/backends/vllm/launch/agg_multimodal_epd.sh](../../../examples/backends/vllm/launch/agg_multimodal_epd.sh)):
Client → Frontend → Processor → Encoder [NIXL] → PD Worker → Response
• 4 components • --multimodal-processor • Custom templates, NIXL
DISAGGREGATED ([examples/backends/vllm/launch/disagg_multimodal_epd.sh](../../../examples/backends/vllm/launch/disagg_multimodal_epd.sh)):
Client → Frontend → Processor → Encoder [NIXL] → Prefill [NIXL] → Decode → Response
• 5 components • Separate P/D workers • Multi-node, max optimization
```
## Input Format Details
### Supported URL Formats
| Format | Example | Description | Support |
|--------|---------|-------------|---------|
| **HTTP/HTTPS** | `http://example.com/image.jpg` | Remote media files | ✅ |
| **Data URL** | `data:image/jpeg;base64,/9j/4AAQ...` | Base64-encoded inline data | ✅ |
## Simple Aggregated Mode (PD)
In simple aggregated mode, encoding, prefill, and decode happen within the same worker.
### Architecture
```text
HTTP Frontend with Rust processor
Worker (Python - ModelInput.Tokens)
↓ encode + prefill + decode
Response
```
## EPD Aggregated Mode (PD)
In EPD aggregated mode, encoding happens in a separate worker and prefill and decode happen within the same pipeline.
### Architecture
```text
HTTP Frontend (Rust)
Processor (Python - ModelInput.Text)
↓ tokenizes, extracts media URL
Encode Worker (Python - not registered)
↓ downloads media, generates embeddings, NIXL transfer
PD Worker (Python - ModelInput.Tokens)
↓ prefill + decode
Response
```
### Components
| Component | Flag | ModelInput | Registered | Purpose |
|-----------|------|-----------|------------|---------|
| Processor | `--multimodal-processor` | Text | Yes | HTTP entry, tokenization |
| Encode Worker | `--multimodal-encode-worker` | N/A | No | Media encoding |
| PD Worker | `--multimodal-worker` | Tokens | Yes | Prefill + Decode |
## EPD Disaggregated Mode (E->P->D)
In EPD disaggregated mode, encoding, prefill, and decode are handled by separate workers.
### Architecture
```text
HTTP Frontend (Rust)
Processor (Python - ModelInput.Text)
↓ tokenizes, extracts media URL
Encode Worker (Python - not registered)
↓ downloads media, generates embeddings, NIXL transfer
Prefill Worker (Python - ModelInput.Tokens)
↓ prefill only, KV cache NIXL transfer
Decode Worker (Python - ModelInput.Tokens)
↓ decode only, token generation
Response
```
### Components
| Component | Flag | ModelInput | Registered | Purpose |
|-----------|------|-----------|------------|---------|
| Processor | `--multimodal-processor` | Text | Yes | HTTP entry, tokenization |
| Encode Worker | `--multimodal-encode-worker` | N/A | No | Media encoding |
| Prefill Worker | `--multimodal-worker --is-prefill-worker` | Tokens | Yes | Prefill only |
| Decode Worker | `--multimodal-decode-worker` | Tokens | Yes | Decode only |
## Traditional Disagg (EP->D)
Llama 4 models don't support pre-computed embeddings, so they use a combined Encode+Prefill worker.
### Architecture
```text
HTTP Frontend (Rust)
Processor (Python - ModelInput.Text)
↓ tokenizes, extracts media URL
Encode+Prefill Worker (Python - ModelInput.Tokens)
↓ downloads media, encodes inline, prefill, KV cache NIXL transfer
Decode Worker (Python - ModelInput.Tokens)
↓ decode only, token generation
Response
```
### Components
| Component | Flag | ModelInput | Registered | Purpose |
|-----------|------|-----------|------------|---------|
| Processor | `--multimodal-processor` | Text | Yes | HTTP entry, tokenization |
| Encode+Prefill | `--multimodal-encode-prefill-worker --is-prefill-worker` | Tokens | Yes | Encode + Prefill |
| Decode Worker | `--multimodal-decode-worker` | Tokens | Yes | Decode only |
### Launch Script
Example: [`examples/backends/vllm/launch/disagg_multimodal_llama.sh`](../../../examples/backends/vllm/launch/disagg_multimodal_llama.sh)
## ModelInput Types and Registration
### Understanding ModelInput
Dynamo's Rust SDK supports two input types that determine how the HTTP frontend preprocesses requests:
| ModelInput Type | Preprocessing | Use Case |
|-----------------|---------------|----------|
| `ModelInput.Text` | None (raw text passed through) | Components that tokenize themselves |
| `ModelInput.Tokens` | Rust SDK would tokenize (but bypassed in multimodal) | Components expecting pre-tokenized input |
### Component Registration Pattern
```python
# Processor - Entry point from HTTP frontend
await register_llm(
ModelInput.Text, # Frontend sends raw text
ModelType.Chat,
generate_endpoint,
model_name,
...
)
# Workers - Internal components
await register_llm(
ModelInput.Tokens, # Expect pre-tokenized input
ModelType.Chat, # or ModelType.Prefill for prefill workers
generate_endpoint,
model_name,
...
)
```
## **NIXL USE**
| Use Case | Script | NIXL Used? | Data Transfer |
|----------|--------|------------|---------------|
| Simple Aggregated | [`examples/backends/vllm/launch/agg_multimodal.sh`](../../../examples/backends/vllm/launch/agg_multimodal.sh) | ❌ No | All in one worker |
| E->PD Aggregated | [`examples/backends/vllm/launch/agg_multimodal_epd.sh`](../../../examples/backends/vllm/launch/agg_multimodal_epd.sh) | ✅ Yes | Encoder → PD (embeddings) |
| E->P->D Disaggregated | [`examples/backends/vllm/launch/disagg_multimodal_epd.sh`](../../../examples/backends/vllm/launch/disagg_multimodal_epd.sh) | ✅ Yes | Encoder → Prefill (embeddings)<br>Prefill → Decode (KV cache) |
| EP->D Disaggregated (Llama 4) | [`examples/backends/vllm/launch/disagg_multimodal_llama.sh`](../../../examples/backends/vllm/launch/disagg_multimodal_llama.sh) | ✅ Yes | Prefill → Decode (KV cache) |
## Known Limitations
- **Disaggregated flows require Python Processor** - All multimodal disaggregation requires the Python Processor component (`ModelInput.Text`).
## Supported Models
The following models have been tested with Dynamo's vLLM multimodal backend:
- **Qwen2.5-VL** - `Qwen/Qwen2.5-VL-7B-Instruct`
- **Qwen3-VL** - `Qwen/Qwen3-VL-30B-A3B-Instruct-FP8`
- **LLaVA 1.5** - `llava-hf/llava-1.5-7b-hf`
- **Llama 4 Maverick** - `meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8`
- **LLaVA Next Video** - `llava-hf/LLaVA-NeXT-Video-7B-hf`
- **Qwen2-Audio** - `Qwen/Qwen2-Audio-7B-Instruct`
For a complete list of multimodal models supported by vLLM, see [vLLM Supported Multimodal Models](https://docs.vllm.ai/en/latest/models/supported_models/#list-of-multimodal-language-models). Models listed there should work with Simple Aggregated Mode but may not be explicitly tested.
## Key Files
| File | Description |
|------|-------------|
| `components/src/dynamo/vllm/main.py` | Worker initialization and setup |
| `components/src/dynamo/vllm/args.py` | Command-line argument parsing |
| `components/src/dynamo/vllm/multimodal_handlers/processor_handler.py` | Processor implementation |
| `components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py` | Encode worker implementation |
| `components/src/dynamo/vllm/multimodal_handlers/worker_handler.py` | PD/Prefill/Decode worker implementation |
......@@ -86,6 +86,15 @@ redirects = {
"dynamo_glossary": "../reference/glossary.html",
"support_matrix": "../reference/support-matrix.html",
"components/router/README": "../router/README.html",
# Multimodal documentation consolidation
"backends/vllm/multimodal": "../../multimodal/vllm.html",
"backends/vllm/multimodal_vllm_guide": "../../multimodal/vllm.html",
"backends/trtllm/multimodal_support": "../../multimodal/trtllm.html",
"backends/trtllm/multimodal_trtllm_guide": "../../multimodal/trtllm.html",
"backends/trtllm/multinode/multinode-multimodal-example": "../../../multimodal/trtllm.html",
"backends/sglang/multimodal_epd": "../../multimodal/sglang.html",
"backends/sglang/multimodal_sglang_guide": "../../multimodal/sglang.html",
"multimodal/multimodal_intro": "index.html",
}
# Custom extensions
......
......@@ -46,11 +46,8 @@
fault_tolerance/request_cancellation.md
backends/trtllm/multinode/multinode-examples.md
backends/trtllm/multinode/multinode-multimodal-example.md
backends/trtllm/llama4_plus_eagle.md
backends/trtllm/kv-cache-transfer.md
backends/trtllm/multimodal_support.md
backends/trtllm/multimodal_trtllm_guide.md
backends/trtllm/gemma3_sliding_window_attention.md
backends/trtllm/gpt-oss.md
backends/trtllm/prometheus.md
......@@ -60,8 +57,6 @@
backends/sglang/dsr1-wideep-h100.md
backends/sglang/expert-distribution-eplb.md
backends/sglang/gpt-oss.md
backends/sglang/multimodal_epd.md
backends/sglang/multimodal_sglang_guide.md
backends/sglang/profiling.md
backends/sglang/sgl-hicache-example.md
backends/sglang/sglang-disaggregation.md
......@@ -77,8 +72,6 @@
backends/vllm/gpt-oss.md
backends/vllm/LMCache_Integration.md
backends/vllm/multi-node.md
backends/vllm/multimodal.md
backends/vllm/multimodal_vllm_guide.md
backends/vllm/prometheus.md
backends/vllm/speculative_decoding.md
......
......@@ -58,7 +58,7 @@ Quickstart
:caption: User Guides
Tool Calling <agents/tool-calling.md>
Multimodality Support <multimodal/multimodal_intro.md>
Multimodality Support <multimodal/index.md>
Finding Best Initial Configs <performance/aiconfigurator.md>
Benchmarking <benchmarks/benchmarking.md>
Tuning Disaggregated Performance <performance/tuning.md>
......
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Multimodal Inference in Dynamo
Dynamo supports multimodal inference across multiple LLM backends, enabling models to process images, video, and audio alongside text. This section provides comprehensive documentation for deploying multimodal models.
> [!IMPORTANT]
> **Security Requirement**: Multimodal processing must be explicitly enabled at startup.
> See the relevant documentation for each backend for the necessary flags.
>
> This prevents unintended processing of multimodal data from untrusted sources.
## Backend Documentation
```{toctree}
:maxdepth: 1
vLLM Multimodal <vllm.md>
TensorRT-LLM Multimodal <trtllm.md>
SGLang Multimodal <sglang.md>
```
## Support Matrix
### Backend Capabilities
| Stack | E/PD | E/P/D | EP/D | EPD | Image | Video | Audio |
|-------|------|-------|------|-----|-------|-------|-------|
| **[vLLM](vllm.md)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 🧪 |
| **[TRT-LLM](trtllm.md)** | ❌ | 🚧* | ✅ | ✅ | ✅ | ❌ | ❌ |
| **[SGLang](sglang.md)** | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ |
\* E/P/D supported in TRT-LLM with pre-computed embeddings only; image URL support is WIP ([PR #4668](https://github.com/ai-dynamo/dynamo/pull/4668))
**Pattern Key:**
- **EPD** - All-in-one worker (Simple Aggregated)
- **E/PD** - Separate encode, combined prefill+decode
- **E/P/D** - All stages separate
- **EP/D** - Combined encode+prefill, separate decode
**Status:** ✅ Supported | 🚧 WIP | 🧪 Experimental | ❌ Not supported
### Input Format Support
| Format | vLLM | TRT-LLM | SGLang |
|--------|------|---------|--------|
| HTTP/HTTPS URL | ✅ | ✅ | ✅ |
| Data URL (Base64) | ✅ | ❌ | ❌ |
| Pre-computed Embeddings (.pt) | ❌ | ✅ | ❌ |
## Architecture Patterns
Dynamo supports several deployment patterns for multimodal inference based on two dimensions:
1. **Encoding**: Is media encoding handled inline (within prefill) or by a separate **Encode Worker**?
- *Inline*: Simpler setup, encoding happens in the prefill worker
- *Separate (EPD)*: Dedicated encode worker transfers embeddings via **NIXL (RDMA)**, enabling independent scaling
2. **Prefill/Decode**: Are prefill and decode in the same worker or separate?
- *Aggregated*: Single worker handles both prefill and decode
- *Disaggregated*: Separate workers for prefill and decode, with KV cache transfer between them
These combine into four deployment patterns:
### EPD - Simple Aggregated
All processing happens within a single worker - the simplest setup.
```text
HTTP Frontend (Rust)
Worker (Python)
↓ image load + encode + prefill + decode
Response
```
| Component | Purpose |
|-----------|---------|
| Frontend (Rust) | HTTP entry point, tokenization, image URL preprocessing |
| Worker | Complete inference pipeline (encode + prefill + decode) |
**When to use:** Quick setup, smaller models, development/testing.
### E/PD - Encode Separate
Encoding happens in a separate worker; prefill and decode share the same engine.
```text
HTTP Frontend (Rust)
Processor (Python)
↓ tokenizes, extracts media URL
Encode Worker (Python)
↓ downloads media, generates embeddings, NIXL transfer
PD Worker (Python)
↓ receives embeddings via NIXL, prefill + decode
Response
```
| Component | Purpose |
|-----------|---------|
| Frontend (Rust) | HTTP entry point |
| Processor (Python) | Tokenization, extracts media URLs |
| Encode Worker | Media encoding, embeddings generation |
| PD Worker | Prefill + Decode with embeddings |
**When to use:** Offload vision encoding to separate GPU, scale encode workers independently.
### E/P/D - Full Disaggregation
Full disaggregation with separate workers for encoding, prefill, and decode.
There are two variants of this workflow:
- Prefill-first, used by vLLM
- Decode-first, used by SGlang
Prefill-first:
```text
HTTP Frontend (Rust)
Processor (Python)
↓ tokenizes, extracts media URL
Encode Worker (Python)
↓ downloads media, generates embeddings, NIXL transfer
Prefill Worker (Python)
↓ receives embeddings via NIXL, prefill only, KV cache transfer
Decode Worker (Python)
↓ decode only, token generation
Response
```
OR
Decode-first:
```text
HTTP Frontend (Rust)
Processor (Python)
↓ tokenizes, extracts media URL
Encode Worker (Python)
↓ downloads media, generates embeddings, NIXL transfer
Decode Worker (Python)
↓ Bootstraps prefill worker
Prefill Worker (Python)
↓ receives embeddings via NIXL, prefill only, KV cache transfer
Decode Worker (Python)
↓ decode only, token generation
Response
```
| Component | Purpose |
|-----------|---------|
| Frontend (Rust) | HTTP entry point |
| Processor (Python) | Tokenization, extracts media URLs |
| Encode Worker | Media encoding, embeddings generation |
| Prefill Worker | Prefill only, transfers KV cache |
| Decode Worker | Decode only, token generation |
**When to use:** Maximum optimization, multi-node deployment, independent scaling of each phase.
### EP/D - Traditional Disaggregated
Encoding is combined with prefill, with decode separate.
```text
HTTP Frontend (Rust)
Processor (Python)
↓ tokenizes, extracts media URL
Encode+Prefill Worker (Python)
↓ downloads media, encodes inline, prefill, KV cache transfer
Decode Worker (Python)
↓ decode only, token generation
Response
```
| Component | Purpose |
|-----------|---------|
| Frontend (Rust) | HTTP entry point |
| Processor (Python) | Tokenization, extracts media URLs (vLLM only) |
| Encode+Prefill Worker | Combined encoding and prefill |
| Decode Worker | Decode only, token generation |
> **Note:** TRT-LLM's EP/D mode skips the Python Processor - the Rust frontend handles tokenization and routes directly to the Prefill worker.
> For multimodal requests, the Python prefill worker still re-tokenizes/builds inputs; Rust token_ids are ignored.
**When to use:** Models without pre-computed embedding support (Llama 4), or TRT-LLM disaggregated deployment.
## Example Workflows
You can find example workflows and reference implementations for deploying multimodal models in:
- [vLLM multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/launch)
- [TRT-LLM multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/launch)
- [SGLang multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/launch)
- [Advanced multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal/launch) (video, audio)
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Multimodal Inference in Dynamo:
You can find example workflows and reference implementations for deploying a multimodal model using Dynamo in [multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal).
## EPD vs. PD Disaggregation
Dynamo supports two primary approaches for processing multimodal inputs, which differ in how the initial media encoding step is handled relative to the main LLM inference engine.
### 1. EPD (Encode-Prefill-Decode) Disaggregation
The EPD approach introduces an explicit separation of the media encoding step, maximizing the utilization of specialized hardware and increasing overall system efficiency for large multimodal models.
* **Media Input:** Image, video, audio, or an embedding URL is provided.
* **Process Flow:**
1. A dedicated **Encode Worker** is launched separately to handle the embedding extraction from the media input.
2. The extracted embeddings are transferred to the main engine via the **NVIDIA Inference Xfer Library (NIXL)**.
3. The main **Engine** performs the remaining **Prefill Decode Disaggregation** steps to generate the output.
* **Benefit:** This disaggregation allows for the decoupling of media encoding hardware/resources from the main LLM serving engine, making the serving of large multimodal models more efficient.
### 2. PD (Prefill-Decode) Disaggregation
The PD approach is a more traditional, aggregated method where the inference engine handles the entire process.
* **Media Input:** Image, video, or audio is loaded.
* **Process Flow:**
1. The main **Engine** receives the media input.
2. The Engine executes the full sequence: **Encode + Prefill + Decode**.
* **Note:** In this approach, the encoding step is executed within the same pipeline as the prefill and decode phases.
## Inference Framework Support Matrix
Dynamo supports multimodal capabilities across leading LLM inference backends, including **vLLM**, **TensorRT-LLM (TRT-LLM)**, and **SGLang**. The table below details the current support level for EPD/PD and various media types for each stack.
| Stack | EPD Support | PD Support | Image | Video | Audio |
| --------- | --------- | --------- | --------- |---------| --------- |
| **vLLM** | ✅ | ✅ | ✅ | ✅ | 🚧 |
| **TRT-LLM** | ✅ (Currently via precomputed Embeddings URL) | ✅ | ✅ | ❌ | ❌ |
| **SGLang** | ✅ | ❌ | ✅ | ❌ | ❌ |
......@@ -15,171 +15,231 @@ See the License for the specific language governing permissions and
limitations under the License.
-->
# SGLang Multimodal Guide
# SGLang Multimodal
This document provides a comprehensive guide for multimodal inference using SGLang backend in Dynamo. For more details on the multimodal examples, see [Multimodal Examples Documentation](./multimodal_epd.md).
This document provides a comprehensive guide for multimodal inference using SGLang backend in Dynamo. SGLang multimodal uses specialized **E/PD or E/P/D** flows with **NIXL (RDMA)** for zero-copy tensor transfer.
## Multimodal Support Matrix
## Support Matrix
| Modality | Input Format | Aggregated | Disaggregated | Notes |
|----------|--------------|------------|---------------|-------|
| **Image** | HTTP/HTTPS URL | Yes | Yes | Vision encoder generates embeddings |
| **Image** | Data URL (Base64) | No | No | Not supported |
| **Video** | HTTP/HTTPS URL | No | No | Not implemented |
| **Audio** | HTTP/HTTPS URL | No | No | Not implemented |
| **Image** | HTTP/HTTPS URL | Yes | Yes | Vision encoder generates embeddings |
| **Image** | Data URL (Base64) | No | No | |
| **Video** | HTTP/HTTPS URL | No | No | |
| **Audio** | HTTP/HTTPS URL | No | No | |
## Architecture Comparison
### Supported URL Formats
SGLang multimodal supports two deployment patterns:
| Format | Example | Description |
|--------|---------|-------------|
| **HTTP/HTTPS** | `http://example.com/image.jpg` | Remote media files |
```text
AGGREGATED (E->PD):
Client → Frontend (Rust) → Processor → Encoder [NIXL] → PD Worker → Response
• 3 components • Vision encoder in Python • NIXL embeddings transfer
## Deployment Patterns
DISAGGREGATED (E->P->D):
Client → Frontend → Processor → Encoder [NIXL] → Prefill [bootstrap] → Decode → Response
• 4 components • Vision encoder in Python • KV cache transfer via bootstrap mechanism
```
SGLang supports E/PD and E/P/D patterns only (always has a separate encode worker). See [Multimodal Architecture Patterns](index.md#architecture-patterns) for detailed explanations.
## Aggregated Mode (E->PD)
| Pattern | Supported | Launch Script | Notes |
|---------|-----------|---------------|-------|
| EPD (Simple Aggregated) | ❌ | N/A | Not supported |
| E/PD (Encode Separate) | ✅ | `multimodal_agg.sh` | Vision encoder separate |
| E/P/D (Full Disaggregation) | ✅ | `multimodal_disagg.sh` | KV cache via bootstrap |
| EP/D (Traditional Disaggregated) | ❌ | N/A | Not supported |
In aggregated mode, encoding happens in a separate worker, but prefill and decode share the same engine.
### Component Flags
### Architecture
| Component | Flag | Purpose |
|-----------|------|---------|
| Processor | `--multimodal-processor` | HTTP entry, OpenAI→SGLang conversion |
| Encode Worker | `--multimodal-encode-worker` | Vision encoder, embeddings generation |
| PD Worker | `--multimodal-worker` | Prefill + Decode with embeddings |
| Decode Worker | `--multimodal-worker --serving-mode=decode` | Entry point for disaggregation |
| Prefill Worker | `--multimodal-worker --serving-mode=prefill` | Called by Decode, bootstrap coordination |
```text
HTTP Frontend (Rust)
Processor (Python - ModelInput.Text - REGISTERED)
↓ tokenizes with chat template, extracts image URL
Encode Worker (Python - NOT registered)
↓ downloads image, runs vision encoder, generates embeddings, NIXL transfer
PD Worker (Python - NOT registered)
↓ receives embeddings via NIXL, prefill + decode
Response → Processor → Frontend
### SGLang-Specific Characteristics
- **Vision Encoder in Python**: Encode worker loads vision model (AutoModel) and image processor (AutoImageProcessor)
- **Token Expansion**: Single `<|image_pad|>` token replaced with N tokens based on embedding shape
- **NIXL Transfer**: Embeddings transferred from Encoder → PD Worker using NIXL
- **No Rust Processing**: All tokenization and image handling happens in Python
## Use the Latest Release
We recommend using the latest stable release of dynamo to avoid breaking changes:
[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
You can find the [latest release](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
```bash
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
```
## E/PD Serving (Encode Separate)
### Components
| Component | Flag | ModelInput | Registered | Has SGLang Engine? | Purpose |
|-----------|------|-----------|------------|-------------------|---------|
| Processor | `--multimodal-processor` | Text | ✅ Yes | ❌ No | HTTP entry, OpenAI→SGLang conversion |
| Encode Worker | `--multimodal-encode-worker` | N/A | ❌ No | ❌ No | Vision encoder, embeddings generation |
| PD Worker | `--multimodal-worker` | N/A | ❌ No | ✅ Yes | Prefill + Decode with embeddings |
- workers:
- [MultimodalEncodeWorkerHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py) for encoding
- [MultimodalWorkerHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py) for prefilling and decoding.
- processor: [MultimodalProcessorHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/processor_handler.py)
- tokenizes the prompt using the chat template
- passes the text and image url to the MultimodalEncodeWorker.
### Key Characteristics
### Workflow
- **Vision Encoder in Python**: Encode worker loads vision model (AutoModel) and image processor (AutoImageProcessor)
- **Token Expansion**: Single `<|image_pad|>` token replaced with N tokens based on embedding shape
- **NIXL Transfer**: Embeddings transferred from Encoder → PD Worker using NIXL
- **No Rust Processing**: All tokenization and image handling happens in Python
The `MultimodalEncodeWorker` downloads and encodes the image and passes the embeddings to the MultimodalWorker. The work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface. The `MultimodalWorker` then prefills and decodes the prompt in the same engine, as in the [LLM aggregated serving](../backends/sglang/README.md) example. Only the processor is registered to the Dynamo frontend as an available endpoint. Workers do NOT register - they are internal components and communicate via NATS.
## Disaggregated Mode (E->P->D)
```mermaid
flowchart LR
HTTP --> processor
processor --tokenized request + image_url--> encode_worker
encode_worker --request + embeddings--> worker
In disaggregated mode, encoding, prefill, and decode are handled by separate workers using SGLang's bootstrap coordination.
worker -.-> encode_worker
encode_worker -.-> processor
processor -.-> HTTP
```
### Architecture
```text
HTTP Frontend (Rust)
Processor (Python - ModelInput.Text - REGISTERED)
↓ tokenizes with chat template, extracts image URL
Encode Worker (Python - NOT registered)
↓ downloads image, runs vision encoder, generates embeddings, NIXL transfer
Prefill Worker (Python - NOT registered)
↓ receives embeddings via NIXL, prefill only, returns bootstrap info
Decode Worker (Python - NOT registered)
↓ uses bootstrap info, decode only, token generation
Response → Processor → Frontend
### Launch
```bash
cd $DYNAMO_HOME/examples/backends/sglang
./launch/multimodal_agg.sh
```
**Client:**
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-VL-7B-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the image."
},
{
"type": "image_url",
"image_url": {
"url": "http://images.cocodataset.org/test2017/000000155781.jpg"
}
}
]
}
],
"max_tokens": 50,
"stream": false
}' | jq
```
## E/P/D Serving (Full Disaggregation)
### Components
| Component | Flag | ModelInput | Registered | Has SGLang Engine? | Purpose |
|-----------|------|-----------|------------|-------------------|---------|
| Processor | `--multimodal-processor` | Text | ✅ Yes | ❌ No | HTTP entry, OpenAI→SGLang conversion |
| Encode Worker | `--multimodal-encode-worker` | N/A | ❌ No | ❌ No | Vision encoder, embeddings generation |
| Decode Worker | `--multimodal-worker --serving-mode=decode` | N/A | ❌ No | ✅ Yes | **Entry point for disaggregation**, calls Prefill |
| Prefill Worker | `--multimodal-worker --serving-mode=prefill` | N/A | ❌ No | ✅ Yes | Called by Decode, bootstrap coordination |
- workers:
- [MultimodalEncodeWorkerHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py) for encoding
- [MultimodalWorkerHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py) for decoding
- [MultimodalPrefillWorkerHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py) for prefilling
- processor: [MultimodalProcessorHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/processor_handler.py) tokenizes the prompt and passes it to the MultimodalEncodeWorker.
### Workflow
In models like Qwen2.5-VL, embeddings are only required during the prefill stage. The image embeddings are transferred via NIXL from the Encode Worker to the Decode Worker (the entry point for disaggregation), which then coordinates with the Prefill Worker. The Prefill Worker processes the embeddings and forwards the KV cache back to the Decode Worker for token generation.
```mermaid
flowchart LR
HTTP --> processor
processor --tokenized request + image_url--> encode_worker
encode_worker --request + embeddings--> worker
worker --request + embeddings--> prefill_worker
prefill_worker --KV Cache--> worker
encode_worker -.-> processor
worker -.-> encode_worker
processor -.-> HTTP
```
### Bootstrap Coordination
### Launch
```bash
cd $DYNAMO_HOME/examples/backends/sglang
./launch/multimodal_disagg.sh
```
**Client:**
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-VL-7B-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the image."
},
{
"type": "image_url",
"image_url": {
"url": "http://images.cocodataset.org/test2017/000000155781.jpg"
}
}
]
}
],
"max_tokens": 50,
"stream": false
}' | jq
```
## Bootstrap Coordination
SGLang disaggregation uses a bootstrap mechanism for P->D coordination:
**Request Flow (Important):**
### Request Flow (Important)
```text
Client → Frontend → Processor → Encode → DECODE Worker → Prefill Worker
Entry point for disaggregation!
```
**Bootstrap Process:**
### Bootstrap Process
1. **Decode Worker** receives request from Encode Worker
2. **Decode Worker** calls Prefill Worker via NATS to request bootstrap info
3. **Prefill Worker** generates `{host, port, room}` and returns immediately
4. **Both workers** connect to same "room" using bootstrap coordinates
5. **SGLang internally** transfers KV cache state via bootstrap connection (not NIXL)
**Key Difference from vLLM:**
### Key Difference from vLLM
- vLLM: Frontend → Prefill → Decode (Prefill is entry point)
- SGLang: Frontend → Processor → Encode → **Decode → Prefill** (Decode is entry point)
## ModelInput Types and Registration
**Only the Processor registers with Dynamo Rust.**
### Registration Pattern
```python
# ONLY Processor registers with Dynamo Rust
await register_llm_with_readiness_gate(
None, # No engine for processor
generate_endpoint,
server_args,
dynamo_args,
input_type=ModelInput.Text, # Receives raw OpenAI format
readiness_gate=ready_event,
)
# Workers do NOT register - they are internal components
# They communicate via NATS clients created in main.py
```
### Component Initialization
```python
# Encode Worker - connects to downstream PD worker
pd_worker_client = (
await runtime.namespace(dynamo_args.namespace)
.component("backend")
.endpoint("generate")
.client()
)
# PD Worker (Decode mode) - connects to upstream Prefill worker
prefill_client = (
await runtime.namespace(dynamo_args.namespace)
.component("prefill")
.endpoint("generate")
.client()
)
```
## Inter-Component Communication
### Control Flow (NATS)
All component-to-component communication happens via NATS:
**Aggregated Mode (E→PD):**
#### E/PD Mode (Encode Separate)
```text
Processor → Encode Worker → PD Worker
(NATS) (NATS + NIXL embeddings)
```
**Disaggregated Mode (E→P→D):**
#### E/P/D Mode (Full Disaggregation)
```text
Processor → Encode Worker → DECODE Worker → Prefill Worker
(NATS) (NATS) (NATS)
......@@ -193,7 +253,7 @@ Processor → Encode Worker → DECODE Worker → Prefill Worker
SGLang internal KV cache transfer
```
**Detailed Message Flow:**
### Detailed Message Flow
```text
Processor → Encode Worker:
......@@ -220,19 +280,18 @@ Prefill ↔ Decode (via bootstrap):
NIXL is used only for embedding transfer:
```python
Encode Worker:
descriptor = connect.Descriptor(precomputed_embeddings)
with await connector.create_readable(descriptor) as readable:
request.serialized_request = readable.metadata()
# Send request with NIXL metadata
await pd_worker_client.round_robin(request)
await readable.wait_for_completion()
PD Worker:
embeddings = torch.empty(request.embeddings_shape, dtype=torch.float16)
descriptor = connect.Descriptor(embeddings)
read_op = await connector.begin_read(request.serialized_request, descriptor)
await read_op.wait_for_completion()
# Encode Worker
descriptor = connect.Descriptor(precomputed_embeddings)
with connector.create_readable(descriptor) as readable:
request.serialized_request = readable.metadata()
await pd_worker_client.round_robin(request)
await readable.wait_for_completion()
# PD Worker
embeddings = torch.empty(request.embeddings_shape, dtype=torch.float16)
descriptor = connect.Descriptor(embeddings)
read_op = await connector.begin_read(request.serialized_request, descriptor)
await read_op.wait_for_completion()
```
## Vision Encoding Details
......@@ -242,7 +301,6 @@ PD Worker:
The encode worker loads and runs the vision model in Python:
```python
# Vision components loaded in encode worker
self.image_processor = AutoImageProcessor.from_pretrained(
model_path, trust_remote_code=True
)
......@@ -262,6 +320,7 @@ self.vision_model = AutoModel.from_pretrained(
4. Downstream worker receives expanded token sequence
Example:
```python
# Before: ["Hello", "<|image_pad|>", "world"]
# After: ["Hello", "<|image_pad|>", "<|image_pad|>", ...(576 tokens), "world"]
......@@ -281,14 +340,14 @@ processed = tokenizer(text=conv.get_prompt(), return_tensors="pt")
Supported templates: `qwen2-vl`, `llama-3`, `vicuna`, etc.
## NIXL USE
## NIXL Usage
| Use Case | NIXL Used? | Data Transfer | Notes |
|----------|------------|---------------|-------|
| EPD Aggregated | Yes | Encoder → PD (embeddings) | Vision encoder separate |
| E→P→D Disaggregated | ✅ Yes | Encoder → Prefill (embeddings) | KV cache via SGLang bootstrap |
| E/PD (Encode Separate) | Yes | Encoder → PD (embeddings) | Vision encoder separate |
| E/P/D (Full Disaggregation) | Yes | Encoder → Prefill (embeddings) | KV cache via SGLang bootstrap |
**Key Difference:** SGLang PD uses bootstrap mechanism, not NIXL for KV cache like vLLM.
**Key Difference:** SGLang P/D uses bootstrap mechanism, not NIXL for KV cache like vLLM.
## Known Limitations
......@@ -304,12 +363,10 @@ Supported templates: `qwen2-vl`, `llama-3`, `vicuna`, etc.
SGLang multimodal **only supports image-based vision-language models**:
### ✅ Supported (Images Only)
- **Qwen2-VL** / **Qwen2.5-VL** (primary support)
- Models with `AutoImageProcessor` and vision tower
- Models compatible with SGLang's image embedding format
## Key Files
| File | Description |
......@@ -321,4 +378,3 @@ SGLang multimodal **only supports image-based vision-language models**:
| `components/src/dynamo/sglang/multimodal_utils/multimodal_chat_processor.py` | Chat template processing |
| `components/src/dynamo/sglang/protocol.py` | Request/response data structures |
| `components/src/dynamo/sglang/register.py` | Registration logic (only called for Processor) |
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
......@@ -12,33 +15,66 @@ See the License for the specific language governing permissions and
limitations under the License.
-->
# Multimodal Support
# TensorRT-LLM Multimodal
TRTLLM supports multimodal models with dynamo. You can provide multimodal inputs in the following ways:
This document provides a comprehensive guide for multimodal inference using TensorRT-LLM backend in Dynamo.
You can provide multimodal inputs in the following ways:
- By sending image URLs
- By providing paths to pre-computed embedding files
Please note that you should provide **either image URLs or embedding file paths** in a single request.
> **Note:** You should provide **either image URLs or embedding file paths** in a single request.
## Support Matrix
| Modality | Input Format | Aggregated | Disaggregated | Notes |
|----------|--------------|------------|---------------|-------|
| **Image** | HTTP/HTTPS URL | Yes | Yes | Full support for all image models |
| **Image** | Pre-computed Embeddings (.pt, .pth, .bin) | Yes | Yes | Direct embedding files |
| **Video** | HTTP/HTTPS URL | No | No | Not implemented |
| **Audio** | HTTP/HTTPS URL | No | No | Not implemented |
### Supported URL Formats
| Format | Example | Description |
|--------|---------|-------------|
| **HTTP/HTTPS** | `http://example.com/image.jpg` | Remote media files |
| **Pre-computed Embeddings** | `/path/to/embedding.pt` | Local embedding files (.pt, .pth, .bin) |
## Deployment Patterns
TRT-LLM supports aggregated and traditional disaggregated patterns. See [Architecture Patterns](index.md#architecture-patterns) for detailed explanations.
| Pattern | Supported | Launch Script | Notes |
|---------|-----------|---------------|-------|
| EPD (Simple Aggregated) | ✅ | `agg.sh` | Easiest setup |
| E/PD (Encode Separate) | ❌ | N/A | Not supported |
| E/P/D (Full Disaggregation) | 🚧 WIP | N/A | PR #4668 in progress |
| EP/D (Traditional Disaggregated) | ✅ | `disagg_multimodal.sh` | Prefill handles encoding |
### Component Flags
| Component | Flag | Purpose |
|-----------|------|---------|
| Worker | `--modality multimodal` | Complete pipeline (aggregated) |
| Prefill Worker | `--disaggregation-mode prefill` | Image processing + Prefill (multimodal tokenization happens here) |
| Decode Worker | `--disaggregation-mode decode` | Decode only |
| Encode Worker (WIP) | `--disaggregation-mode encode` | Image encoding (E/P/D flow) |
## Aggregated
## Aggregated Serving
Quick steps to launch Llama-4 Maverick BF16 in aggregated mode:
Here are quick steps to launch Llama-4 Maverick BF16 in aggregated mode
```bash
cd $DYNAMO_HOME
export AGG_ENGINE_ARGS=./examples/backends/trtllm/engine_configs/llama4/multimodal/agg.yaml
export SERVED_MODEL_NAME="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
export MODEL_PATH="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
./launch/agg.sh
./examples/backends/trtllm/launch/agg.sh
```
## Example Requests
### With Image URL
Below is an example of an image being sent to `Llama-4-Maverick-17B-128E-Instruct` model
Request :
**Client:**
```bash
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct",
......@@ -63,38 +99,53 @@ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '
"max_tokens": 160
}'
```
Response :
```
{"id":"unknown-id","choices":[{"index":0,"message":{"content":"The image depicts a serene landscape featuring a large rock formation, likely El Capitan in Yosemite National Park, California. The scene is characterized by a winding road that curves from the bottom-right corner towards the center-left of the image, with a few rocks and trees lining its edge.\n\n**Key Features:**\n\n* **Rock Formation:** A prominent, tall, and flat-topped rock formation dominates the center of the image.\n* **Road:** A paved road winds its way through the landscape, curving from the bottom-right corner towards the center-left.\n* **Trees and Rocks:** Trees are visible on both sides of the road, with rocks scattered along the left side.\n* **Sky:** The sky above is blue, dotted with white clouds.\n* **Atmosphere:** The overall atmosphere of the","refusal":null,"tool_calls":null,"role":"assistant","function_call":null,"audio":null},"finish_reason":"stop","logprobs":null}],"created":1753322607,"model":"meta-llama/Llama-4-Maverick-17B-128E-Instruct","service_tier":null,"system_fingerprint":null,"object":"chat.completion","usage":null}
```
## Disaggregated
## Disaggregated Serving
Here are quick steps to launch in disaggregated mode.
Example using `Qwen/Qwen2-VL-7B-Instruct`:
The following is an example of launching a model in disaggregated mode. While this example uses `Qwen/Qwen2-VL-7B-Instruct`, you can adapt it for other models by modifying the environment variables for the model path and engine configurations.
```bash
cd $DYNAMO_HOME
export MODEL_PATH=${MODEL_PATH:-"Qwen/Qwen2-VL-7B-Instruct"}
export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"Qwen/Qwen2-VL-7B-Instruct"}
export PREFILL_ENGINE_ARGS=${PREFILL_ENGINE_ARGS:-"examples/backends/trtllm/engine_configs/qwen2-vl-7b-instruct/prefill.yaml"}
export DECODE_ENGINE_ARGS=${DECODE_ENGINE_ARGS:-"examples/backends/trtllm/engine_configs/qwen2-vl-7b-instruct/decode.yaml"}
export MODALITY=${MODALITY:-"multimodal"}
export MODEL_PATH="Qwen/Qwen2-VL-7B-Instruct"
export SERVED_MODEL_NAME="Qwen/Qwen2-VL-7B-Instruct"
export PREFILL_ENGINE_ARGS="examples/backends/trtllm/engine_configs/qwen2-vl-7b-instruct/prefill.yaml"
export DECODE_ENGINE_ARGS="examples/backends/trtllm/engine_configs/qwen2-vl-7b-instruct/decode.yaml"
export MODALITY="multimodal"
./launch/disagg.sh
./examples/backends/trtllm/launch/disagg.sh
```
For a large model like `meta-llama/Llama-4-Maverick-17B-128E-Instruct`, a multi-node setup is required for disaggregated serving, while aggregated serving can run on a single node. This is because the model with a disaggregated configuration is too large to fit on a single node's GPUs. For instance, running this model in disaggregated mode requires a setup of 2 nodes with 8xH200 GPUs or 4 nodes with 4xGB200 GPUs.
In general, disaggregated serving can run on a single node, provided the model fits on the GPU. The multi-node requirement in this example is specific to the size and configuration of the `meta-llama/Llama-4-Maverick-17B-128E-Instruct` model.
```bash
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen2-VL-7B-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the image"
},
{
"type": "image_url",
"image_url": {
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png"
}
}
]
}
],
"stream": false,
"max_tokens": 160
}'
```
To deploy `Llama-4-Maverick-17B-128E-Instruct` in disaggregated mode, you will need to follow the multi-node setup instructions, which can be found [here](./multinode/multinode-multimodal-example.md).
For a large model like `meta-llama/Llama-4-Maverick-17B-128E-Instruct`, a multi-node setup is required for disaggregated serving (see [Multi-node Deployment](#multi-node-deployment-slurm) below), while aggregated serving can run on a single node. This is because the model with a disaggregated configuration is too large to fit on a single node's GPUs. For instance, running this model in disaggregated mode requires 2 nodes with 8xH200 GPUs or 4 nodes with 4xGB200 GPUs.
## Pre-computed Embeddings with EPD Flow
## Pre-computed Embeddings with E/P/D Flow
For high-performance multimodal inference, Dynamo supports pre-computed embeddings with an **Encode-Prefill-Decode (EPD)** flow using **NIXL (RDMA)** for zero-copy tensor transfer.
For high-performance multimodal inference, Dynamo supports pre-computed embeddings with an **Encode-Prefill-Decode (E/P/D)** flow using **NIXL (RDMA)** for zero-copy tensor transfer.
### Supported File Types
......@@ -102,12 +153,42 @@ For high-performance multimodal inference, Dynamo supports pre-computed embeddin
- `.pth` - PyTorch checkpoint files
- `.bin` - Binary tensor files
### Embedding File Formats
TRT-LLM supports two formats for embedding files:
**1. Simple Tensor Format**
Direct tensor saved as `.pt` file containing only the embedding tensor:
```python
embedding_tensor = torch.rand(1, 576, 4096) # [batch, seq_len, hidden_dim]
torch.save(embedding_tensor, "embedding.pt")
```
**2. Dictionary Format with Auxiliary Data**
Dictionary containing multiple keys, used by models like Llama-4 that require additional metadata:
```python
embedding_dict = {
"mm_embeddings": torch.rand(1, 576, 4096),
"special_tokens": [128256, 128257],
"image_token_offsets": [[0, 576]],
# ... other model-specific metadata
}
torch.save(embedding_dict, "llama4_embedding.pt")
```
- **Simple tensors**: Loaded directly and passed to `mm_embeddings` parameter
- **Dictionary format**: `mm_embeddings` key extracted as main tensor, other keys preserved as auxiliary data
### How to Launch
```bash
cd $DYNAMO_HOME/examples/backends/trtllm
# Launch 3-worker EPD flow with NIXL
# Launch 3-worker E/P/D flow with NIXL
./launch/epd_disagg.sh
```
......@@ -126,7 +207,7 @@ export ALLOWED_LOCAL_MEDIA_PATH="/tmp"
export MAX_FILE_SIZE_MB=50
```
### Example Request
### Example Request with Pre-computed Embeddings
```bash
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
......@@ -144,16 +225,14 @@ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '
}'
```
### Architecture
### E/P/D Architecture
The EPD flow implements a **3-worker architecture**:
The E/P/D flow implements a **3-worker architecture**:
- **Encode Worker**: Loads pre-computed embeddings, transfers via NIXL
- **Prefill Worker**: Receives embeddings, handles context processing and KV-cache generation
- **Decode Worker**: Performs streaming token generation
### Request Flow
```mermaid
sequenceDiagram
participant Client
......@@ -176,6 +255,129 @@ sequenceDiagram
Frontend->>Client: Stream response
```
## Supported Multimodal Models
## Multi-node Deployment (Slurm)
This section demonstrates how to deploy large multimodal models that require a multi-node setup using Slurm.
> **Note:** The scripts referenced in this section can be found in [`examples/basics/multinode/trtllm/`](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics/multinode/trtllm/).
### Environment Setup
Assuming you have allocated your nodes via `salloc` and are inside an interactive shell:
```bash
# Container image (build using docs/backends/trtllm/README.md#build-container)
export IMAGE="<dynamo_trtllm_image>"
# Host:container path pairs for mounting
export MOUNTS="${PWD}/../../../../:/mnt"
# Model configuration
export MODEL_PATH="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
export SERVED_MODEL_NAME="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
export MODALITY=${MODALITY:-"multimodal"}
```
### Multi-node Disaggregated Launch
For 4 4xGB200 nodes (2 for prefill, 2 for decode):
```bash
# Customize parallelism to match your engine configs
# export PREFILL_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/llama4/multimodal/prefill.yaml"
# export DECODE_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/llama4/multimodal/decode.yaml"
# export NUM_PREFILL_NODES=2
# export NUM_DECODE_NODES=2
# export NUM_GPUS_PER_NODE=4
# Launches frontend + etcd/nats on head node, plus prefill and decode workers
./srun_disaggregated.sh
```
### Understanding the Output
1. `srun_disaggregated.sh` launches three srun jobs: frontend, prefill worker, and decode worker
2. The OpenAI frontend will dynamically discover workers as they register:
```
INFO dynamo_run::input::http: Watching for remote model at models
INFO dynamo_llm::http::service::service_v2: Starting HTTP service on: 0.0.0.0:8000
```
3. TRT-LLM workers output progress from each MPI rank while loading
4. When ready, the frontend logs:
```
INFO dynamo_llm::discovery::watcher: added model model_name="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
```
### Cleanup
```bash
pkill srun
```
## NIXL Usage
| Use Case | Script | NIXL Used? | Data Transfer |
|----------|--------|------------|---------------|
| EPD (Simple Aggregated) | `agg.sh` | No | All in one worker |
| EP/D (Traditional Disaggregated) | `disagg_multimodal.sh` | Optional | Prefill → Decode (KV cache via UCX or NIXL) |
| E/P/D (pre-computed embeddings) | `epd_disagg.sh` | Yes | Encoder → Prefill (embeddings via NIXL) |
| E/P/D (WIP) | N/A | No | Encoder → Prefill (handles via params), Prefill → Decode (KV cache) |
> **Note:** NIXL for KV cache transfer is currently beta and only supported on AMD64 (x86_64) architecture.
## ModelInput Types and Registration
TRT-LLM workers register with Dynamo using:
| ModelInput Type | Preprocessing | Use Case |
|-----------------|---------------|----------|
| `ModelInput.Tokens` | Rust frontend may tokenize, but multimodal flows re-tokenize and build inputs in the Python worker; Rust token_ids are ignored | All TRT-LLM workers |
```python
# TRT-LLM Worker - Register with Tokens
await register_llm(
ModelInput.Tokens, # Rust does minimal preprocessing
model_type, # ModelType.Chat or ModelType.Prefill
generate_endpoint,
model_name,
...
)
```
## Inter-Component Communication
| Transfer Stage | Message | NIXL Transfer |
|----------------|---------|---------------|
| **Frontend → Prefill** | Request with image URL or embedding path | No |
| **Encode → Prefill (pre-computed)** | NIXL metadata | Yes (Embeddings tensor) |
| **Encode → Prefill (Image URL) (WIP)** | Disaggregated params with multimodal handles | No |
| **Prefill → Decode** | Disaggregated params | Configurable (KV cache: NIXL default, UCX optional) |
## Known Limitations
- **No Data URL support** - Only HTTP/HTTPS URLs supported; `data:image/...` base64 URLs not supported
- **No video support** - No video encoder implementation
- **No audio support** - No audio encoder implementation
- **Multimodal preprocessing/tokenization happens in Python** - Rust may forward token_ids, but multimodal requests are parsed and re-tokenized in the Python worker
- **E/P/D mode is WIP** - Full E/P/D with image URLs under development
- **Multi-node H100 limitation** - Loading `meta-llama/Llama-4-Maverick-17B-128E-Instruct` with 8 nodes of H100 with TP=16 is not possible due to head count divisibility (`num_attention_heads: 40` not divisible by `tp_size: 16`)
## Supported Models
Multimodal models listed in [TensorRT-LLM supported models](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/models/supported-models.md) are supported by Dynamo.
Common examples:
- Llama 4 Vision models (Maverick, Scout)
- Qwen2-VL models
- Other vision-language models with TRT-LLM support
## Key Files
| File | Description |
|------|-------------|
| `components/src/dynamo/trtllm/main.py` | Worker initialization and setup |
| `components/src/dynamo/trtllm/utils/trtllm_utils.py` | Command-line argument parsing |
| `components/src/dynamo/trtllm/multimodal_processor.py` | Multimodal request processing |
| `components/src/dynamo/trtllm/request_handlers/handlers.py` | Request handler factory |
| `components/src/dynamo/trtllm/request_handlers/handler_base.py` | Base handler and disaggregation modes |
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# vLLM Multimodal
This document provides a comprehensive guide for multimodal inference using vLLM backend in Dynamo.
> [!IMPORTANT]
> **Security Requirement**: All multimodal workers require the `--enable-multimodal` flag to be explicitly set at startup. This is a security feature to prevent unintended processing of multimodal data from untrusted sources. Workers will fail at startup if multimodal flags (e.g., `--multimodal-worker`, `--multimodal-processor`) are used without `--enable-multimodal`.
> This flag is analogous to `--enable-mm-embeds` in vllm serve but also extends it to all multimodal content (url, embeddings, b64).
## Support Matrix
| Modality | Input Format | Aggregated | Disaggregated | Notes |
|----------|--------------|------------|---------------|-------|
| **Image** | HTTP/HTTPS URL | Yes | Yes | Full support for all image models |
| **Image** | Data URL (Base64) | Yes | Yes | Inline base64-encoded images |
| **Video** | HTTP/HTTPS URL | Yes | Yes | Frame extraction and processing |
| **Audio** | HTTP/HTTPS URL | Yes | Yes | Experimental - requires audio dependencies |
### Supported URL Formats
| Format | Example | Description |
|--------|---------|-------------|
| **HTTP/HTTPS** | `http://example.com/image.jpg` | Remote media files |
| **Data URL** | `data:image/jpeg;base64,/9j/4AAQ...` | Base64-encoded inline data |
## Deployment Patterns
vLLM supports all multimodal deployment patterns. See [Architecture Patterns](index.md#architecture-patterns) for detailed explanations.
| Pattern | Supported | Launch Script | Notes |
|---------|-----------|---------------|-------|
| EPD (Simple Aggregated) | ✅ | `agg_multimodal.sh` | Easiest setup |
| E/PD (Encode Separate) | ✅ | `agg_multimodal_epd.sh` | Separate encode worker |
| E/P/D (Full Disaggregation) | ✅ | `disagg_multimodal_epd.sh` | All stages separate |
| EP/D (Traditional Disaggregated) | ✅ | `disagg_multimodal_llama.sh` | For Llama 4 models |
### Component Flags
| Component | Flag | Purpose |
|-----------|------|---------|
| Processor | `--multimodal-processor` | HTTP entry, tokenization |
| Encode Worker | `--multimodal-encode-worker` | Media encoding |
| PD Worker | `--multimodal-worker` | Prefill + Decode |
| Prefill Worker | `--multimodal-worker --is-prefill-worker` | Prefill only |
| Decode Worker | `--multimodal-decode-worker` | Decode only |
| Encode+Prefill Worker | `--multimodal-encode-prefill-worker --is-prefill-worker` | Combined (Llama 4) |
## Use the Latest Release
We recommend using the latest stable release of dynamo to avoid breaking changes:
[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
You can find the [latest release](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
```bash
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
```
## Image Serving
### E/PD Serving (Encode Separate)
**Components:**
- workers: [EncodeWorkerHandler](../../components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py) for encoding and [MultimodalPDWorkerHandler](../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) for prefilling and decoding.
- processor: Tokenizes the prompt and passes it to the EncodeWorkerHandler.
- frontend: HTTP endpoint to handle incoming requests.
**Workflow:**
The EncodeWorkerHandler encodes the image and passes the embeddings to the MultimodalPDWorkerHandler via NATS and RDMA. The work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface.
```mermaid
flowchart LR
HTTP --> processor
processor --> HTTP
processor --image_url--> encode_worker
encode_worker --> processor
encode_worker --embeddings--> pd_worker
pd_worker --> encode_worker
```
> **Note:** Aggregated serving supports LLaVA 1.5 7B and Qwen2.5-VL-7B-Instruct. Disaggregated serving is currently only confirmed for LLaVA.
**Launch:**
```bash
cd $DYNAMO_HOME/examples/backends/vllm
# Serve a LLaVA 1.5 7B model:
bash launch/agg_multimodal_epd.sh --model llava-hf/llava-1.5-7b-hf
# Serve a Qwen2.5-VL model:
bash launch/agg_multimodal_epd.sh --model Qwen/Qwen2.5-VL-7B-Instruct
```
**Client:**
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llava-hf/llava-1.5-7b-hf",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "http://images.cocodataset.org/test2017/000000155781.jpg"
}
}
]
}
],
"max_tokens": 300,
"temperature": 0.0,
"stream": false
}'
```
### E/P/D Serving (Full Disaggregation)
**Components:**
- workers: [EncodeWorkerHandler](../../components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py) for encoding, [MultimodalDecodeWorkerHandler](../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) for decoding, and [MultimodalPDWorkerHandler](../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) for prefilling.
- processor: Tokenizes the prompt and passes it to the EncodeWorkerHandler.
- frontend: HTTP endpoint to handle incoming requests.
**Workflow:**
For the LLaVA model, embeddings are only required during the prefill stage. The EncodeWorkerHandler is connected directly to the prefill worker, encoding the image and passing embeddings via NATS and RDMA. The prefill worker performs the prefilling step and forwards the KV cache to the decode worker.
```mermaid
flowchart LR
HTTP --> processor
processor --> HTTP
processor --image_url--> encode_worker
encode_worker --> processor
encode_worker --embeddings--> prefill_worker
prefill_worker --> encode_worker
prefill_worker --> decode_worker
decode_worker --> prefill_worker
```
**Launch:**
```bash
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/disagg_multimodal_epd.sh --model llava-hf/llava-1.5-7b-hf
```
> [!NOTE] Disaggregation is currently only confirmed to work with LLaVA. Qwen2.5-VL is not confirmed to be supported.
## Llama 4 Serving
The Llama 4 model family is natively multimodal. Unlike LLaVA, they do not directly consume image embeddings as input (see the [vLLM support matrix](https://docs.vllm.ai/en/latest/models/supported_models.html#text-generation_1)). Therefore, the encoder worker is not used and encoding is done alongside prefill.
Example model: `meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8` on H100x8.
### Llama 4 Aggregated Serving
**Workflow:**
```mermaid
flowchart LR
HTTP --> processor
processor --> HTTP
processor --image_url--> pd_worker
pd_worker --> processor
```
**Launch:**
```bash
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/agg_multimodal_llama.sh
```
**Client:**
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "http://images.cocodataset.org/test2017/000000155781.jpg"
}
}
]
}
],
"max_tokens": 300,
"temperature": 0.0,
"stream": false
}'
```
### Llama 4 Disaggregated Serving
**Workflow:**
```mermaid
flowchart LR
HTTP --> processor
processor --> HTTP
processor --image_url--> prefill_worker
prefill_worker --> processor
prefill_worker --> decode_worker
decode_worker --> prefill_worker
```
**Launch:**
```bash
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/disagg_multimodal_llama.sh --head-node
# On a separate node with NATS_SERVER and ETCD_ENDPOINTS pointing to head node:
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/disagg_multimodal_llama.sh
```
## Video Serving
### Video Aggregated Serving
**Components:**
- workers: [VideoEncodeWorker](../../examples/multimodal/components/video_encode_worker.py) for decoding video into frames, and [VllmPDWorker](../../examples/multimodal/components/worker.py) for prefilling and decoding.
- processor: Tokenizes the prompt and passes it to the VideoEncodeWorker.
- frontend: HTTP endpoint to handle incoming requests.
**Workflow:**
The VideoEncodeWorker decodes the video into frames. Unlike the image pipeline which generates embeddings, this pipeline passes raw frames directly to the VllmPDWorker via NATS and RDMA.
```mermaid
flowchart LR
HTTP --> processor
processor --> HTTP
processor --video_url--> video_encode_worker
video_encode_worker --> processor
video_encode_worker --frames--> pd_worker
pd_worker --> video_encode_worker
```
**Launch:**
```bash
cd $DYNAMO_HOME/examples/multimodal
bash launch/video_agg.sh
```
**Client:**
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llava-hf/LLaVA-NeXT-Video-7B-hf",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the video in detail"
},
{
"type": "video_url",
"video_url": {
"url": "https://storage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4"
}
}
]
}
],
"max_tokens": 300,
"stream": false
}' | jq
```
### Video Disaggregated Serving
**Workflow:**
For the LLaVA-NeXT-Video-7B model, frames are only required during the prefill stage. The VideoEncodeWorker is connected directly to the prefill worker, decoding the video into frames and passing them via RDMA.
```mermaid
flowchart LR
HTTP --> processor
processor --> HTTP
processor --video_url--> video_encode_worker
video_encode_worker --> processor
video_encode_worker --frames--> prefill_worker
prefill_worker --> video_encode_worker
prefill_worker --> decode_worker
decode_worker --> prefill_worker
```
**Launch:**
```bash
cd $DYNAMO_HOME/examples/multimodal
bash launch/video_disagg.sh
```
## Audio Serving
### Audio Aggregated Serving
**Components:**
- workers: [AudioEncodeWorker](../../examples/multimodal/components/audio_encode_worker.py) for decoding audio into embeddings, and [VllmPDWorker](../../examples/multimodal/components/worker.py) for prefilling and decoding.
- processor: Tokenizes the prompt and passes it to the AudioEncodeWorker.
- frontend: HTTP endpoint to handle incoming requests.
**Workflow:**
```mermaid
flowchart LR
HTTP --> processor
processor --> HTTP
processor --audio_url--> audio_encode_worker
audio_encode_worker --> processor
audio_encode_worker --embeddings--> pd_worker
pd_worker --> audio_encode_worker
```
**Launch:**
```bash
pip install vllm["audio"] accelerate # multimodal audio models dependency
cd $DYNAMO_HOME/examples/multimodal
bash launch/audio_agg.sh
```
**Client:**
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2-Audio-7B-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is recited in the audio?"
},
{
"type": "audio_url",
"audio_url": {
"url": "https://raw.githubusercontent.com/yuekaizhang/Triton-ASR-Client/main/datasets/mini_en/wav/1221-135766-0002.wav"
}
}
]
}
],
"max_tokens": 6000,
"temperature": 0.8,
"stream": false
}' | jq
```
### Audio Disaggregated Serving
**Workflow:**
For the Qwen2-Audio model, audio embeddings are only required during the prefill stage. The AudioEncodeWorker is connected directly to the prefill worker.
```mermaid
flowchart LR
HTTP --> processor
processor --> HTTP
processor --audio_url--> audio_encode_worker
audio_encode_worker --> processor
audio_encode_worker --embeddings--> prefill_worker
prefill_worker --> audio_encode_worker
prefill_worker --> decode_worker
decode_worker --> prefill_worker
```
**Launch:**
```bash
pip install vllm["audio"] accelerate # multimodal audio models dependency
cd $DYNAMO_HOME/examples/multimodal
bash launch/audio_disagg.sh
```
## NIXL Usage
| Use Case | Script | NIXL Used? | Data Transfer |
|----------|--------|------------|---------------|
| EPD (Simple Aggregated) | `agg_multimodal.sh` | No | All in one worker |
| E/PD (Encode Separate) | `agg_multimodal_epd.sh` | Yes | Encoder → PD (embeddings) |
| E/P/D (Full Disaggregation) | `disagg_multimodal_epd.sh` | Yes | Encoder → Prefill (embeddings), Prefill → Decode (KV cache) |
| EP/D (Llama 4) | `disagg_multimodal_llama.sh` | Yes | Prefill → Decode (KV cache) |
## ModelInput Types and Registration
Dynamo's Rust SDK supports two input types that determine how the HTTP frontend preprocesses requests:
| ModelInput Type | Preprocessing | Use Case |
|-----------------|---------------|----------|
| `ModelInput.Text` | None (raw text passed through) | Components that tokenize themselves |
| `ModelInput.Tokens` | Rust SDK would tokenize (but bypassed in multimodal) | Components expecting pre-tokenized input |
**Registration Pattern:**
```python
# Processor - Entry point from HTTP frontend
await register_llm(
ModelInput.Text, # Frontend sends raw text
ModelType.Chat,
generate_endpoint,
model_name,
...
)
# Workers - Internal components
await register_llm(
ModelInput.Tokens, # Expect pre-tokenized input
ModelType.Chat, # or ModelType.Prefill for prefill workers
generate_endpoint,
model_name,
...
)
```
## Known Limitations
- **Disaggregated flows require Python Processor** - All multimodal disaggregation requires the Python Processor component (`ModelInput.Text`).
## Supported Models
The following models have been tested with Dynamo's vLLM multimodal backend:
- **Qwen2.5-VL** - `Qwen/Qwen2.5-VL-7B-Instruct`
- **Qwen3-VL** - `Qwen/Qwen3-VL-30B-A3B-Instruct-FP8`
- **LLaVA 1.5** - `llava-hf/llava-1.5-7b-hf`
- **Llama 4 Maverick** - `meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8`
- **LLaVA Next Video** - `llava-hf/LLaVA-NeXT-Video-7B-hf`
- **Qwen2-Audio** - `Qwen/Qwen2-Audio-7B-Instruct`
For a complete list of multimodal models supported by vLLM, see [vLLM Supported Multimodal Models](https://docs.vllm.ai/en/latest/models/supported_models/#list-of-multimodal-language-models). Models listed there should work with Simple Aggregated Mode but may not be explicitly tested.
## Key Files
| File | Description |
|------|-------------|
| `components/src/dynamo/vllm/main.py` | Worker initialization and setup |
| `components/src/dynamo/vllm/args.py` | Command-line argument parsing |
| `components/src/dynamo/vllm/multimodal_handlers/processor_handler.py` | Processor implementation |
| `components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py` | Encode worker implementation |
| `components/src/dynamo/vllm/multimodal_handlers/worker_handler.py` | PD/Prefill/Decode worker implementation |
{"name": "NVIDIA Dynamo", "version": "dev"}
{"name": "NVIDIA Dynamo", "version": "dev"}
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment