Unverified Commit f0bcabe0 authored by Ryan McCormick's avatar Ryan McCormick Committed by GitHub
Browse files

docs(multimodality): Refactor MM docs into high level features and VLM/Diffusion sections (#6831)


Signed-off-by: default avatarRyan McCormick <mccormick.codes@gmail.com>
Co-authored-by: default avatarakshatha-k <33278067+akshatha-k@users.noreply.github.com>
parent 76c96c5d
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: Multimodality Support
title: Multimodal Model Serving
subtitle: Deploy multimodal models with image, video, and audio support in Dynamo
---
Dynamo supports multimodal inference across multiple LLM backends, enabling models to process images, video, and audio alongside text. This section provides comprehensive documentation for deploying multimodal models.
Dynamo supports multimodal inference across multiple LLM backends, enabling models to process images, video, and audio alongside text.
> [!IMPORTANT]
> **Security Requirement**: Multimodal processing must be explicitly enabled at startup.
> See the relevant documentation for each backend for the necessary flags.
>
> This prevents unintended processing of multimodal data from untrusted sources.
<Warning>
**Security Requirement**: Multimodal processing must be explicitly enabled at startup. See the relevant backend documentation ([vLLM](multimodal-vllm.md), [SGLang](multimodal-sglang.md), [TRT-LLM](multimodal-trtllm.md)) for the necessary flags. This prevents unintended processing of multimodal data from untrusted sources.
</Warning>
## Backend Documentation
## Support Matrix
### Backend Capabilities
| Stack | E/PD | E/P/D | EP/D | EPD | Image | Video | Audio |
|-------|------|-------|------|-----|-------|-------|-------|
| **[vLLM](multimodal-vllm.md)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 🧪 |
| **[TRT-LLM](multimodal-trtllm.md)** | ❌ | 🚧* | ✅ | ✅ | ✅ | ❌ | ❌ |
| **[SGLang](multimodal-sglang.md)** | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ |
\* E/P/D supported in TRT-LLM with pre-computed embeddings only; image URL support is WIP ([PR #4668](https://github.com/ai-dynamo/dynamo/pull/4668))
## Key Features
```mermaid
---
title: Sample flow for an aggregated VLM serving scenario
---
flowchart TD
A[Request] --> B{KV cache hit?}
B -->|Yes| C[Use KV]
B -->|No| D{Embedding cache hit?}
D -->|Yes| E[Load embedding]
D -->|No| F[Run encoder]
F --> G[save to cache]
G --> H["PREFILL (image tokens + text tokens → KV cache)"]
E --> H
C --> I[DECODE]
H --> I
I --> J[Response]
Dynamo provides support for improving latency and throughput for vision-and-language workloads through the following features, that can be used together or separately, depending on your workload characteristics:
| Feature | Description |
|---------|-------------|
| **[Embedding Cache](embedding-cache.md)** | CPU-side LRU cache that skips re-encoding repeated images |
| **[Encoder Disaggregation](encoder-disaggregation.md)** | Separate vision encoder worker for independent scaling |
| **[Multimodal KV Routing](multimodal-kv-routing.md)** | MM-aware KV cache routing for optimal worker selection |
**Pattern Key:**
## Support Matrix
- **EPD** - All-in-one worker (Simple Aggregated)
- **E/PD** - Separate encode, combined prefill+decode
- **E/P/D** - All stages separate
- **EP/D** - Combined encode+prefill, separate decode
| Stack | Image | Video | Audio |
|-------|-------|-------|-------|
| **[vLLM](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-vllm.md)** | ✅ | 🧪 | 🧪 |
| **[TRT-LLM](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-trtllm.md)** | ✅ | ❌ | ❌ |
| **[SGLang](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-sglang.md)** | ✅ | ❌ | ❌ |
**Status:** ✅ Supported | 🚧 WIP | 🧪 Experimental | ❌ Not supported
**Status:** ✅ Supported | 🧪 Experimental | ❌ Not supported
### Input Format Support
......@@ -43,150 +54,19 @@ Dynamo supports multimodal inference across multiple LLM backends, enabling mode
| Data URL (Base64) | ❌ | ❌ | ✅ |
| Pre-computed Embeddings (.pt) | ❌ | ✅ | ❌ |
## Architecture Patterns
Dynamo supports several deployment patterns for multimodal inference based on two dimensions:
1. **Encoding**: Is media encoding handled inline (within prefill) or by a separate **Encode Worker**?
- *Inline*: Simpler setup, encoding happens in the prefill worker
- *Separate (EPD)*: Dedicated encode worker transfers embeddings via **NIXL (RDMA)**, enabling independent scaling
2. **Prefill/Decode**: Are prefill and decode in the same worker or separate?
- *Aggregated*: Single worker handles both prefill and decode
- *Disaggregated*: Separate workers for prefill and decode, with KV cache transfer between them
These combine into four deployment patterns:
### EPD - Simple Aggregated
All processing happens within a single worker - the simplest setup.
```text
HTTP Frontend (Rust)
Worker (Python)
↓ image load + encode + prefill + decode
Response
```
| Component | Purpose |
|-----------|---------|
| Frontend (Rust) | HTTP entry point, tokenization, image URL preprocessing |
| Worker | Complete inference pipeline (encode + prefill + decode) |
**When to use:** Quick setup, smaller models, development/testing.
### E/PD - Encode Separate
Encoding happens in a separate worker; prefill and decode share the same engine.
```text
HTTP Frontend (Rust)
Processor (Python)
↓ tokenizes, extracts media URL
Encode Worker (Python)
↓ downloads media, generates embeddings, NIXL transfer
PD Worker (Python)
↓ receives embeddings via NIXL, prefill + decode
Response
```
| Component | Purpose |
|-----------|---------|
| Frontend (Rust) | HTTP entry point |
| Processor (Python) | Tokenization, extracts media URLs |
| Encode Worker | Media encoding, embeddings generation |
| PD Worker | Prefill + Decode with embeddings |
**When to use:** Offload vision encoding to separate GPU, scale encode workers independently.
### E/P/D - Full Disaggregation
Full disaggregation with separate workers for encoding, prefill, and decode.
There are two variants of this workflow:
- Prefill-first, used by vLLM
- Decode-first, used by SGLang
Prefill-first:
```text
HTTP Frontend (Rust)
Processor (Python)
↓ tokenizes, extracts media URL
Encode Worker (Python)
↓ downloads media, generates embeddings, NIXL transfer
Prefill Worker (Python)
↓ receives embeddings via NIXL, prefill only, KV cache transfer
Decode Worker (Python)
↓ decode only, token generation
Response
```
OR
Decode-first:
```text
HTTP Frontend (Rust)
Processor (Python)
↓ tokenizes, extracts media URL
Encode Worker (Python)
↓ downloads media, generates embeddings, NIXL transfer
Decode Worker (Python)
↓ Bootstraps prefill worker
Prefill Worker (Python)
↓ receives embeddings via NIXL, prefill only, KV cache transfer
Decode Worker (Python)
↓ decode only, token generation
Response
```
| Component | Purpose |
|-----------|---------|
| Frontend (Rust) | HTTP entry point |
| Processor (Python) | Tokenization, extracts media URLs |
| Encode Worker | Media encoding, embeddings generation |
| Prefill Worker | Prefill only, transfers KV cache |
| Decode Worker | Decode only, token generation |
**When to use:** Maximum optimization, multi-node deployment, independent scaling of each phase.
### EP/D - Traditional Disaggregated
Encoding is combined with prefill, with decode separate.
```text
HTTP Frontend (Rust)
Processor (Python)
↓ tokenizes, extracts media URL
Encode+Prefill Worker (Python)
↓ downloads media, encodes inline, prefill, KV cache transfer
Decode Worker (Python)
↓ decode only, token generation
Response
```
| Component | Purpose |
|-----------|---------|
| Frontend (Rust) | HTTP entry point |
| Processor (Python) | Tokenization, extracts media URLs (vLLM only) |
| Encode+Prefill Worker | Combined encoding and prefill |
| Decode Worker | Decode only, token generation |
> **Note:** TRT-LLM's EP/D mode skips the Python Processor - the Rust frontend handles tokenization and routes directly to the Prefill worker.
> For multimodal requests, the Python prefill worker still re-tokenizes/builds inputs; Rust token_ids are ignored.
**When to use:** Models without pre-computed embedding support (Llama 4), or TRT-LLM disaggregated deployment.
## Example Workflows
You can find example workflows and reference implementations for deploying multimodal models in:
Reference implementations for deploying multimodal models:
- [vLLM multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/launch)
- [TRT-LLM multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/launch)
- [SGLang multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/launch)
- [Advanced multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal/launch) (video, audio)
- [Experimental multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal/launch) (video, audio)
## Backend Documentation
Detailed deployment guides, configuration, and examples for each backend:
- **[vLLM Multimodal](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-vllm.md)**
- **[TensorRT-LLM Multimodal](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-trtllm.md)**
- **[SGLang Multimodal](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-sglang.md)**
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: Diffusion
subtitle: Deploy diffusion models for text-to-image, text-to-video, and more in Dynamo
---
## Overview
Dynamo supports serving diffusion models across multiple backends, enabling generation of images and video from text prompts. Backends expose diffusion capabilities through the same Dynamo pipeline infrastructure used for LLM inference, including frontend routing, scaling, and observability.
## Support Matrix
| Modality | vLLM-Omni | SGLang | TRT-LLM |
|----------|-----------|--------|---------|
| Text-to-Text | ✅ | ✅ | ❌ |
| Text-to-Image | ✅ | ✅ | ❌ |
| Text-to-Video | ✅ | ✅ | ✅ |
| Image-to-Video | ❌ | ❌ | ❌ |
**Status:** ✅ Supported | ❌ Not supported
> [!NOTE]
> Image-to-video support is planned and coming soon across all backends.
## Backend Documentation
For deployment guides, configuration, and examples for each backend:
- **[vLLM-Omni](../../backends/vllm/vllm-omni.md)**
- **[SGLang Diffusion](../../backends/sglang/sglang-diffusion.md)**
- **[TRT-LLM Diffusion](../../backends/trtllm/trtllm-video-diffusion.md)**
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: Embedding Cache
subtitle: Cache vision encoder embeddings to skip re-encoding repeated images
---
## Overview
The embedding cache is a CPU-side LRU cache that stores vision encoder outputs. When the same image appears in multiple requests, the cached embedding is reused instead of running the vision encoder again. This reduces GPU load on the encoder and lowers latency for repeated images.
> Note: This feature can also be referred to as **encoder cache**. Embedding cache is separate from KV cache, which reuses attention key/value state after prefill to skip prefill and go straight to decode. For KV cache reuse and routing, see [Multimodal KV Routing](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-kv-routing.md).
## When to Use
Use the embedding cache when your workload includes repeated images across requests. Common scenarios:
- Product catalog queries where users ask about the same product images
- Document processing pipelines that reference shared diagrams or figures
- Chat sessions where the same image is discussed across multiple turns, like an architecture diagram in a code-gen use case.
If your workload consists entirely of unique images, the cache provides no benefit.
## Support Matrix
| Backend | Aggregated | Disaggregated (E/PD) | Notes |
|---------|------------|----------------------|-------|
| **vLLM** | ✅* | ✅ | Aggregated uses vLLM-native `ec_both`; disaggregated uses Dynamo `EmbeddingCacheManager` |
| **TRT-LLM** | ❌ | ✅ | Dynamo `MultimodalEmbeddingCacheManager` in PD worker |
| **SGLang** | ❌ | ❌ | Not supported yet |
*Requires an upcoming version of vLLM that has not yet been released. Support will be available once the new vLLM release is published.
## How It Works
The prefill worker owns the CPU-side LRU cache. On a hit, the encode worker is skipped entirely. On a miss, the encode worker produces the embedding, transfers it via NIXL, and the prefill worker saves it to the cache.
```mermaid
flowchart LR
req[Request] --> check{CPU cache hit?}
check -. hit .-> use[Use cached embedding]
check -- miss --> E[Encode Worker]
E -- embeddings via NIXL --> save[Save to cache]
save --> engine[Inference Engine]
use --> engine
```
**Launch (vLLM):**
```bash
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/disagg_multimodal_e_pd.sh --multimodal-embedding-cache-capacity-gb 10
```
**Launch (TRT-LLM):**
```bash
cd $DYNAMO_HOME/examples/backends/trtllm
./launch/disagg_e_pd.sh --multimodal-embedding-cache-capacity-gb 10
```
## Configuration
| Parameter | Description | Default |
|-----------|-------------|---------|
| `--multimodal-embedding-cache-capacity-gb` | CPU-side LRU cache size in GB | 0 (disabled) |
Set the capacity based on your expected working set of unique images. A larger cache holds more embeddings but consumes more host memory.
See the backend-specific documentation ([vLLM](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-vllm.md#embedding-cache), [TRT-LLM](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-trtllm.md#embedding-cache)) for more details.
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: Encoder Disaggregation
subtitle: Separate vision encoding into a dedicated worker for independent scaling
---
## Overview
Encoder disaggregation separates the vision encoder from the prefill/decode pipeline into its own worker. Instead of running image encoding inline, a dedicated encode worker handles media processing and transfers the resulting embeddings to downstream workers via NIXL (RDMA).
This enables:
- Independent scaling of encode workers based on vision workload
- Reduced GPU memory pressure on prefill/decode workers
- Better GPU utilization by matching worker counts to actual bottlenecks
## When to Use
Use encoder disaggregation when:
- Vision encoding is a bottleneck and you need to scale encoders independently of LLM workers
- You want to run the vision encoder on different hardware (e.g., smaller GPUs for encoding, larger GPUs for LLM inference)
- Your deployment handles high volumes of multimodal requests and encoding throughput is limiting
For simple deployments or development/testing, the aggregated (EPD) pattern is easier to set up.
## Support Matrix
| Backend | E/PD | E/P/D | Notes |
|---------|------|-------|-------|
| **vLLM** | ✅ | ✅ | NIXL transfer for embeddings; NIXL KV cache transfer for P/D |
| **TRT-LLM** | ❌ | ✅ | Supports image URLs (via `MultimodalEncoder`) and pre-computed embeddings (via NIXL) |
| **SGLang** | ✅ | ✅ | NIXL for embeddings; bootstrap mechanism for P/D KV transfer |
## Deployment Patterns
**E/PD** — Separate encoder, combined prefill+decode:
```text
Frontend → Processor → Encode Worker → PD Worker → Response
(NIXL)
```
The encode worker runs the vision model and transfers embeddings via NIXL to a combined prefill+decode worker.
**E/P/D** — All stages separate:
```text
Frontend → Processor → Encode Worker → Prefill Worker → Decode Worker → Response
(NIXL) (KV transfer)
```
Full disaggregation with separate workers for each stage. The encode worker transfers embeddings to the prefill worker, which then transfers KV cache to the decode worker.
## Launching
### vLLM
```bash
cd $DYNAMO_HOME/examples/backends/vllm
# E/PD
bash launch/disagg_multimodal_e_pd.sh --model "Qwen/Qwen3-VL-30B-A3B-Instruct-FP8"
# E/P/D
bash launch/disagg_multimodal_epd.sh --model "Qwen/Qwen3-VL-30B-A3B-Instruct-FP8"
```
### TRT-LLM
```bash
cd $DYNAMO_HOME/examples/backends/trtllm
# E/PD
bash launch/disagg_e_pd.sh
# E/P/D
./launch/epd_multimodal_image_and_embeddings.sh
```
### SGLang
```bash
cd $DYNAMO_HOME/examples/backends/sglang
# E/PD
./launch/multimodal_epd.sh
# E/P/D
./launch/multimodal_disagg.sh
```
See the backend-specific documentation ([vLLM](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-vllm.md), [TRT-LLM](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-trtllm.md), [SGLang](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-sglang.md)) for full configuration details and component flags.
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: Multimodal KV Routing
subtitle: Route multimodal requests to workers with the best KV cache overlap
---
## Overview
Multimodal KV routing extends Dynamo's KV-aware router to account for image content when computing cache overlap scores. A dedicated MM router worker sits between the frontend and backend workers. It downloads images, computes a hash of each image (`mm_hash`), and includes this hash in per-block routing metadata. The KV router then selects the backend worker with the highest cache overlap, including overlap on image embedding blocks.
Repeated requests containing the same image are routed to the worker that already has the corresponding KV cache blocks, maximizing prefix cache reuse.
> Note: KV cache is separate from embedding cache (also called encoder cache), which reuses vision encoder outputs (image→embeddings) to avoid re-running the encoder. For encoder-side reuse see [Embedding Cache](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/embedding-cache.md).
## When to Use
Use multimodal KV routing when:
- You have multiple backend workers serving multimodal requests
- Your workload includes repeated images across requests (e.g., the same product photo, shared reference images)
- You want to maximize KV cache hit rates for multimodal content
Without MM-aware routing, the standard router treats image token blocks as opaque and cannot match which worker has cached a particular image's KV blocks.
## Support Matrix
| Backend | Supported | Notes |
|---------|-----------|-------|
| **vLLM** | ✅* | Requires vLLM with KV events `extra_keys` support ([PR #33304](https://github.com/vllm-project/vllm/pull/33304)) |
| **TRT-LLM** | ✅ | Requires `--publish-events-and-metrics` on TRT-LLM workers |
| **SGLang** | ❌ | Not supported yet |
*Requires an upcoming version of vLLM that has not yet been released. Support will be available once the new vLLM release is published.
## How It Works
```text
Frontend (round-robin) → MM Router Worker → Backend Workers
├─ Download image
├─ Compute mm_hash
├─ Build per-block MM metadata
└─ KvRouter selects best worker
```
1. The frontend routes to the MM router worker via round-robin
2. The MM router downloads each image and computes an `mm_hash`
3. Per-block routing metadata (`block_mm_infos`) is built, tagging blocks that contain image tokens
4. The KV router evaluates overlap across all backend workers, accounting for image-bearing blocks
5. The request is forwarded to the worker with the highest overlap
On repeated requests with the same image, the selected worker shows higher cached block counts, reducing prefill latency.
## Launching
### vLLM
```bash
cd $DYNAMO_HOME/examples/backends/vllm/mm_router_worker
MODEL=Qwen/Qwen3-VL-2B-Instruct ./launch.sh
```
### TRT-LLM
```bash
cd $DYNAMO_HOME/examples/backends/trtllm/mm_router_worker
./launch.sh
```
See the [vLLM MM Router README](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/mm_router_worker/README.md) and [TRT-LLM MM Router README](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/mm_router_worker/README.md) for full setup instructions and configuration options.
## Known Limitations
- Currently supports Qwen-family multimodal processors (Qwen2-VL, Qwen2.5-VL, Qwen3-VL) for per-image visual token counting
- Images are downloaded twice: once in the MM router (for hash computation) and once in the backend worker (for processing)
......@@ -84,15 +84,26 @@ navigation:
path: components/kvbm/kvbm-guide.md
- page: Dynamo Benchmarking
path: benchmarks/benchmarking.md
- section: Multimodality Support
path: features/multimodal/README.md
- section: Multimodal Model Serving
contents:
- page: vLLM Multimodal
path: features/multimodal/multimodal-vllm.md
- page: TensorRT-LLM Multimodal
path: features/multimodal/multimodal-trtllm.md
- page: SGLang Multimodal
path: features/multimodal/multimodal-sglang.md
- section: Vision Language Models (VLMs)
path: features/multimodal/README.md
contents:
- page: Embedding Cache
path: features/multimodal/embedding-cache.md
- page: Encoder Disaggregation
path: features/multimodal/encoder-disaggregation.md
- page: Multimodal KV Routing
path: features/multimodal/multimodal-kv-routing.md
- section: Diffusion (Experimental)
path: features/multimodal/diffusion.md
contents:
- page: vLLM-Omni
path: backends/vllm/vllm-omni.md
- page: SGLang Diffusion
path: backends/sglang/sglang-diffusion.md
- page: TRT-LLM Diffusion
path: backends/trtllm/trtllm-video-diffusion.md
- page: Tool Calling
path: agents/tool-calling.md
- page: LoRA Adapters
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment