README.md 7.16 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

# Multimodal Inference in Dynamo

Dynamo supports multimodal inference across multiple LLM backends, enabling models to process images, video, and audio alongside text. This section provides comprehensive documentation for deploying multimodal models.

> [!IMPORTANT]
> **Security Requirement**: Multimodal processing must be explicitly enabled at startup.
> See the relevant documentation for each backend for the necessary flags.
>
> This prevents unintended processing of multimodal data from untrusted sources.

## Backend Documentation

```{toctree}
:maxdepth: 1

vLLM Multimodal <multimodal_vllm.md>
TensorRT-LLM Multimodal <multimodal_trtllm.md>
SGLang Multimodal <multimodal_sglang.md>
```

## Support Matrix

### Backend Capabilities

| Stack | E/PD | E/P/D | EP/D | EPD | Image | Video | Audio |
|-------|------|-------|------|-----|-------|-------|-------|
| **[vLLM](multimodal_vllm.md)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 🧪 |
| **[TRT-LLM](multimodal_trtllm.md)** | ❌ | 🚧* | ✅ | ✅ | ✅ | ❌ | ❌ |
| **[SGLang](multimodal_sglang.md)** | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ |

\* E/P/D supported in TRT-LLM with pre-computed embeddings only; image URL support is WIP ([PR #4668](https://github.com/ai-dynamo/dynamo/pull/4668))

**Pattern Key:**

- **EPD** - All-in-one worker (Simple Aggregated)
- **E/PD** - Separate encode, combined prefill+decode
- **E/P/D** - All stages separate
- **EP/D** - Combined encode+prefill, separate decode

**Status:** ✅ Supported | 🚧 WIP | 🧪 Experimental | ❌ Not supported

### Input Format Support

| Format | vLLM | TRT-LLM | SGLang |
|--------|------|---------|--------|
| HTTP/HTTPS URL | ✅ | ✅ | ✅ |
| Data URL (Base64) | ✅ | ❌ | ❌ |
| Pre-computed Embeddings (.pt) | ❌ | ✅ | ❌ |

## Architecture Patterns

Dynamo supports several deployment patterns for multimodal inference based on two dimensions:

1. **Encoding**: Is media encoding handled inline (within prefill) or by a separate **Encode Worker**?
   - *Inline*: Simpler setup, encoding happens in the prefill worker
   - *Separate (EPD)*: Dedicated encode worker transfers embeddings via **NIXL (RDMA)**, enabling independent scaling

2. **Prefill/Decode**: Are prefill and decode in the same worker or separate?
   - *Aggregated*: Single worker handles both prefill and decode
   - *Disaggregated*: Separate workers for prefill and decode, with KV cache transfer between them

These combine into four deployment patterns:

### EPD - Simple Aggregated

All processing happens within a single worker - the simplest setup.

```text
HTTP Frontend (Rust)

Worker (Python)
    ↓ image load + encode + prefill + decode
Response
```

| Component | Purpose |
|-----------|---------|
| Frontend (Rust) | HTTP entry point, tokenization, image URL preprocessing |
| Worker | Complete inference pipeline (encode + prefill + decode) |

**When to use:** Quick setup, smaller models, development/testing.

### E/PD - Encode Separate

Encoding happens in a separate worker; prefill and decode share the same engine.

```text
HTTP Frontend (Rust)

Processor (Python)
    ↓ tokenizes, extracts media URL
Encode Worker (Python)
    ↓ downloads media, generates embeddings, NIXL transfer
PD Worker (Python)
    ↓ receives embeddings via NIXL, prefill + decode
Response
```

| Component | Purpose |
|-----------|---------|
| Frontend (Rust) | HTTP entry point |
| Processor (Python) | Tokenization, extracts media URLs |
| Encode Worker | Media encoding, embeddings generation |
| PD Worker | Prefill + Decode with embeddings |

**When to use:** Offload vision encoding to separate GPU, scale encode workers independently.

### E/P/D - Full Disaggregation

Full disaggregation with separate workers for encoding, prefill, and decode.
There are two variants of this workflow:
- Prefill-first, used by vLLM
- Decode-first, used by SGLang

Prefill-first:

```text
HTTP Frontend (Rust)

Processor (Python)
    ↓ tokenizes, extracts media URL
Encode Worker (Python)
    ↓ downloads media, generates embeddings, NIXL transfer
Prefill Worker (Python)
    ↓ receives embeddings via NIXL, prefill only, KV cache transfer
Decode Worker (Python)
    ↓ decode only, token generation
Response
```

OR

Decode-first:

```text
HTTP Frontend (Rust)

Processor (Python)
    ↓ tokenizes, extracts media URL
Encode Worker (Python)
    ↓ downloads media, generates embeddings, NIXL transfer
Decode Worker (Python)
    ↓ Bootstraps prefill worker
Prefill Worker (Python)
    ↓ receives embeddings via NIXL, prefill only, KV cache transfer
Decode Worker (Python)
    ↓ decode only, token generation
Response
```

| Component | Purpose |
|-----------|---------|
| Frontend (Rust) | HTTP entry point |
| Processor (Python) | Tokenization, extracts media URLs |
| Encode Worker | Media encoding, embeddings generation |
| Prefill Worker | Prefill only, transfers KV cache |
| Decode Worker | Decode only, token generation |

**When to use:** Maximum optimization, multi-node deployment, independent scaling of each phase.

### EP/D - Traditional Disaggregated

Encoding is combined with prefill, with decode separate.

```text
HTTP Frontend (Rust)

Processor (Python)
    ↓ tokenizes, extracts media URL
Encode+Prefill Worker (Python)
    ↓ downloads media, encodes inline, prefill, KV cache transfer
Decode Worker (Python)
    ↓ decode only, token generation
Response
```

| Component | Purpose |
|-----------|---------|
| Frontend (Rust) | HTTP entry point |
| Processor (Python) | Tokenization, extracts media URLs (vLLM only) |
| Encode+Prefill Worker | Combined encoding and prefill |
| Decode Worker | Decode only, token generation |

> **Note:** TRT-LLM's EP/D mode skips the Python Processor - the Rust frontend handles tokenization and routes directly to the Prefill worker.
> For multimodal requests, the Python prefill worker still re-tokenizes/builds inputs; Rust token_ids are ignored.

**When to use:** Models without pre-computed embedding support (Llama 4), or TRT-LLM disaggregated deployment.

## Example Workflows

You can find example workflows and reference implementations for deploying multimodal models in:

- [vLLM multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/launch)
- [TRT-LLM multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/launch)
- [SGLang multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/launch)
- [Advanced multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal/launch) (video, audio)