multimodal-vllm.md 12.7 KB
Newer Older
1
2
3
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
4
title: vLLM Multimodal
5
6
7
8
---

This document provides a comprehensive guide for multimodal inference using vLLM backend in Dynamo.

9
<Warning>
10
**Security Requirement**: All multimodal workers require the `--enable-multimodal` flag to be explicitly set at startup. This is a security feature to prevent unintended processing of multimodal data from untrusted sources. Workers will fail at startup if a multimodal worker mode is enabled without `--enable-multimodal`. This flag is analogous to `--enable-mm-embeds` in vllm serve but also extends it to all multimodal content (url, embeddings, b64).
11
</Warning>
12
13
14

## Support Matrix

15
16
17
18
| Modality                 | Aggregated | Disaggregated |
| ------------------------ | ---------- | ------------- |
| **Image**                | Yes        | Yes           |
| **Video**                | Yes        | Yes           |
19
| **Audio**                | Yes        | No            |
20
21
22

### Supported URL Formats

23
24
25
26
| Format         | Example                              | Description                |
| -------------- | ------------------------------------ | -------------------------- |
| **HTTP/HTTPS** | `http://example.com/image.jpg`       | Remote media files         |
| **Data URL**   | `data:image/jpeg;base64,/9j/4AAQ...` | Base64-encoded inline data |
27
28
29

## Deployment Patterns

30
The main multimodal vLLM launchers in this repo are:
31

32
33
34
35
36
| Pattern                     | Launch Script               | Best For                                                                            |
| --------------------------- | --------------------------- | ----------------------------------------------------------------------------------- |
| Aggregated                  | `agg_multimodal.sh`         | Simplest image/video serving from a single multimodal worker                        |
| E/PD (Encode + PD)          | `disagg_multimodal_e_pd.sh` | Simple example of separating encoder, good for testing embedding-cache workflows    |
| E/P/D (Full Disaggregation) | `disagg_multimodal_epd.sh`  | Disaggregated image/video serving with separate encode, prefill, and decode workers |
37

38
## Image/Video Serving
39

40
Dynamo supports multimodal image and video requests for Vision Language Models (VLMs). `Qwen/Qwen3-VL-2B-Instruct` is a good example because the same model can handle both `image_url` and `video_url` requests through the standard OpenAI chat endpoint.
41

42
### Aggregated Serving
43

44
Use the single-worker aggregated launcher for the simplest image/video setup:
45
46
47

```bash
cd $DYNAMO_HOME/examples/backends/vllm
48
bash launch/agg_multimodal.sh --model Qwen/Qwen3-VL-2B-Instruct
49
50
```

51
**Image request:**
52
53
54
55
56

```bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
57
      "model": "Qwen/Qwen3-VL-2B-Instruct",
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
      "messages": [
        {
          "role": "user",
          "content": [
            {
              "type": "text",
              "text": "What is in this image?"
            },
            {
              "type": "image_url",
              "image_url": {
                "url": "http://images.cocodataset.org/test2017/000000155781.jpg"
              }
            }
          ]
        }
      ],
75
      "max_tokens": 64,
76
77
78
79
80
      "temperature": 0.0,
      "stream": false
    }'
```

81
**Video request:**
82
83
84
85
86

```bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
87
      "model": "Qwen/Qwen3-VL-2B-Instruct",
88
89
90
91
92
93
      "messages": [
        {
          "role": "user",
          "content": [
            {
              "type": "text",
94
              "text": "Describe the video in detail"
95
96
            },
            {
97
98
99
              "type": "video_url",
              "video_url": {
                "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/draw.mp4"
100
101
102
103
104
              }
            }
          ]
        }
      ],
105
      "max_tokens": 64,
106
      "stream": false
107
    }' | jq
108
109
```

110
### E/PD Serving (Encode + PD)
111

112
113
114
115
116
Use `disagg_multimodal_e_pd.sh` when you want a separate encode worker and a combined prefill/decode worker. This path is primarily useful for image-centric workloads and embedding-cache experiments.

<Warning>
When a separate encode worker is deployed with the current vLLM path, only `image_url` inputs are routed to it. `video_url` inputs are still processed on the combined PD worker.
</Warning>
117
118
119
120

```bash
cd $DYNAMO_HOME/examples/backends/vllm

121
122
# Multi-GPU deployment
bash launch/disagg_multimodal_e_pd.sh --model Qwen/Qwen3-VL-2B-Instruct
123

124
125
# Single-GPU (functional testing with small models)
bash launch/disagg_multimodal_e_pd.sh --model Qwen/Qwen3-VL-2B-Instruct --single-gpu
126
127
128

```

129
### E/P/D Serving (Full Disaggregation)
130

131
132
133
134
135
Use `disagg_multimodal_epd.sh` when you want separate encode, prefill, and decode workers for multimodal workloads.

<Warning>
In the current vLLM implementation, the separate encode worker is only used for `image_url` inputs. `video_url` inputs are still processed on the prefill worker, not on the encode worker.
</Warning>
136
137

```bash
138
cd $DYNAMO_HOME/examples/backends/vllm
139

140
141
# Multi-GPU deployment
bash launch/disagg_multimodal_epd.sh --model Qwen/Qwen3-VL-2B-Instruct
142

143
144
# Single-GPU (functional testing with small models)
bash launch/disagg_multimodal_epd.sh --model Qwen/Qwen3-VL-2B-Instruct --single-gpu
145
146
```

147
## Audio Serving
148

149
Dynamo supports `audio_url` requests for audio-capable models. Audio is loaded by the backend worker via vLLM's `AudioMediaIO` at native sample rate — vLLM's model-specific processor handles resampling and feature extraction internally. Omni models can handle `image_url`, `video_url`, and `audio_url` in the same request.
150

151
### Aggregated Serving
152

153
Use the same aggregated multimodal launcher with an audio-capable model:
154

155
156
157
158
159
```bash
pip install 'vllm[audio]'  # installs librosa and other audio dependencies
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/agg_multimodal.sh --model Qwen/Qwen3-Omni-30B-A3B-Instruct
```
160
161
162

```mermaid
flowchart LR
163
164
165
166
  HTTP --> frontend
  frontend --> HTTP
  frontend --audio_url--> vllm_worker
  vllm_worker --> frontend
167
168
```

169
**Audio request:**
170
171
172
173
174

```bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
175
      "model": "Qwen/Qwen3-Omni-30B-A3B-Instruct",
176
177
178
179
180
181
      "messages": [
        {
          "role": "user",
          "content": [
            {
              "type": "text",
182
              "text": "What sound is this?"
183
184
185
186
187
188
189
190
191
192
            },
            {
              "type": "audio_url",
              "audio_url": {
                "url": "https://raw.githubusercontent.com/yuekaizhang/Triton-ASR-Client/main/datasets/mini_en/wav/1221-135766-0002.wav"
              }
            }
          ]
        }
      ],
193
      "max_tokens": 100,
194
195
196
197
      "stream": false
    }' | jq
```

198
199
200
201
## Embedding Cache

Dynamo supports embedding cache in both aggregated and disaggregated settings:

202
203
| Setting                   | Implementation                                                 | Launch Script               |
| ------------------------- | -------------------------------------------------------------- | --------------------------- |
204
| **Aggregated**            | Supported via vLLM ECConnector in vLLM 0.17+                   | `agg_multimodal.sh` (or with `vllm serve` directly) |
205
| **Disaggregated encoder** | Dynamo-managed cache in the worker layer on top of vLLM engine | `disagg_multimodal_e_pd.sh` |
206

207
### Aggregated Worker
208

209
A single vLLM instance caches encoded embeddings on CPU so repeated images skip encoding entirely. Supported natively with vLLM 0.17+.
210
211

```mermaid
212
213
214
---
title: Embedding Cache — Aggregated Encoder (e.g. aggregated EP or EPD node)
---
215
flowchart LR
216
217
218
219
220
221
  req[Multimodal Request] --> gpu{GPU Encoder Cache<br/>hit?}
  gpu -- yes --> skip[Use cached GPU embedding<br/>no encoder, no connector]
  gpu -- no --> cpu{CPU Embedding Cache<br/>hit?}
  cpu -- yes --> load[Load: CPU → GPU<br/>skip encoder]
  cpu -- no --> encode[Run Encoder]
  encode -- save: GPU → CPU --> store[(CPU Embedding Cache<br/>LRU)]
222
223
```

224
225
226
227
228
229
230
231
232
**Launch with Dynamo:**

```bash
bash examples/backends/vllm/launch/agg_multimodal.sh \
    --model Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 \
    --multimodal-embedding-cache-capacity-gb 10
```

`dynamo.vllm` automatically configures `ec_both` mode with the `DynamoMultimodalEmbeddingCacheConnector` when the capacity is > 0.
233

234
**Launch with `vllm serve` (standalone, no Dynamo):**
235

236
```bash
237
vllm serve Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 \
238
239
240
241
242
243
    --ec-transfer-config "{
        \"ec_role\": \"ec_both\",
        \"ec_connector\": \"DynamoMultimodalEmbeddingCacheConnector\",
        \"ec_connector_module_path\": \"dynamo.vllm.multimodal_utils.multimodal_embedding_cache_connector\",
        \"ec_connector_extra_config\": {\"multimodal_embedding_cache_capacity_gb\": 10}
    }"
244
245
```

246
The `multimodal_embedding_cache_capacity_gb` parameter controls the CPU-side LRU cache size in GB (0 = disabled). Requires vLLM 0.17+.
247

248
### Disaggregated Encoder (Embedding Cache in Prefill Worker)
249

250
In the disaggregated setting, the Prefill Worker (P) owns a CPU-side LRU embedding cache (`EmbeddingCacheManager`). On each request P checks the cache first — on a hit, the Encode Worker is skipped entirely. On a miss, P routes to the Encode Worker (E), receives embeddings via NIXL, saves them to the cache, and then feeds the embeddings along with the request into the vLLM Instance for prefill.
251
252

```mermaid
253
254
255
---
title: Embedding Cache — Disaggregated Encoder
---
256
flowchart LR
257
258
259
260
261
262
263
264
265
266
    req[Request] --> cpu_check{"CPU cache hit?<br/>(EmbeddingCacheManager)"}

    subgraph P ["Prefill Worker (P)"]
        cpu_check -. hit .-> use[Use cached embedding]
        use --> vllm[vLLM Instance]
    end

    cpu_check -- miss --> E["Encode Worker (E)"]
    E -- "embeddings via NIXL" --> save["Save to cache"]
    save --> vllm
267
268
269
270
271
272
273
274
275
```

**Launch:**

```bash
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/disagg_multimodal_e_pd.sh --multimodal-embedding-cache-capacity-gb 10
```

276
**Client:** Use the same `image_url` request format shown in [Aggregated Serving](#aggregated-serving).
277

278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
## LoRA Adapters on Multimodal Workers

Multimodal workers support dynamic loading and unloading of LoRA adapters at runtime via the management API. This enables serving fine-tuned multimodal models alongside the base model.

### Loading a LoRA Adapter

Load an adapter on a running multimodal worker via the `load_lora` endpoint:

```bash
# For components workers (URI-based, requires DYN_LORA_ENABLED=true)
curl -X POST http://<worker-host>:<port>/load_lora \
  -H "Content-Type: application/json" \
  -d '{
    "lora_name": "my-vlm-adapter",
    "source": {"uri": "s3://my-bucket/adapters/my-vlm-adapter"}
  }'

# For example workers (path-based)
curl -X POST http://<worker-host>:<port>/load_lora \
  -H "Content-Type: application/json" \
  -d '{
    "lora_name": "my-vlm-adapter",
    "lora_path": "/path/to/adapter"
  }'
```

### Sending Requests with a LoRA

Set the `model` field in the request to the LoRA adapter name:

```bash
curl -X POST http://<frontend-host>:<port>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-vlm-adapter",
    "messages": [
      {"role": "user", "content": [
        {"type": "text", "text": "Describe this image"},
        {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
      ]}
    ]
  }'
```

Requests without a LoRA name (or with the base model name) will use the base model.

### Unloading a LoRA Adapter

```bash
curl -X POST http://<worker-host>:<port>/unload_lora \
  -H "Content-Type: application/json" \
  -d '{"lora_name": "my-vlm-adapter"}'
```

### Listing Loaded Adapters

```bash
curl -X POST http://<worker-host>:<port>/list_loras
```

### Disaggregated Mode

In disaggregated (prefill/decode) deployments, the **same LoRA adapter must be loaded on both the prefill and decode workers**. The LoRA identity (`model` field) is automatically propagated from the prefill worker to the decode worker in the forwarded request.

```bash
# Load on prefill worker
curl -X POST http://<prefill-worker>/load_lora \
  -d '{"lora_name": "my-adapter", "source": {"uri": "s3://bucket/adapter"}}'

# Load on decode worker (same adapter)
curl -X POST http://<decode-worker>/load_lora \
  -d '{"lora_name": "my-adapter", "source": {"uri": "s3://bucket/adapter"}}'
```

If a LoRA is loaded on the prefill worker but not on the decode worker, the decode worker will fall back to the base model for that request.

354
355
## Supported Models

356
For a list of multimodal models supported by vLLM, see [vLLM Supported Multimodal Models](https://docs.vllm.ai/en/latest/models/supported_models/#list-of-multimodal-language-models). Models listed there should generally work with aggregated serving, though they may not all be explicitly tested in this repo.