---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: vLLM Multimodal
---
This document provides a comprehensive guide for multimodal inference using vLLM backend in Dynamo.
**Security Requirement**: All multimodal workers require the `--enable-multimodal` flag to be explicitly set at startup. This is a security feature to prevent unintended processing of multimodal data from untrusted sources. Workers will fail at startup if a multimodal worker mode is enabled without `--enable-multimodal`. This flag is analogous to `--enable-mm-embeds` in vllm serve but also extends it to all multimodal content (url, embeddings, b64).
## Support Matrix
| Modality | Aggregated | Disaggregated |
| ------------------------ | ---------- | ------------- |
| **Image** | Yes | Yes |
| **Video** | Yes | Yes |
| **Audio** (Experimental) | Yes | Yes |
### Supported URL Formats
| Format | Example | Description |
| -------------- | ------------------------------------ | -------------------------- |
| **HTTP/HTTPS** | `http://example.com/image.jpg` | Remote media files |
| **Data URL** | `data:image/jpeg;base64,/9j/4AAQ...` | Base64-encoded inline data |
## Deployment Patterns
The main multimodal vLLM launchers in this repo are:
| Pattern | Launch Script | Best For |
| --------------------------- | --------------------------- | ----------------------------------------------------------------------------------- |
| Aggregated | `agg_multimodal.sh` | Simplest image/video serving from a single multimodal worker |
| E/PD (Encode + PD) | `disagg_multimodal_e_pd.sh` | Simple example of separating encoder, good for testing embedding-cache workflows |
| E/P/D (Full Disaggregation) | `disagg_multimodal_epd.sh` | Disaggregated image/video serving with separate encode, prefill, and decode workers |
## Image/Video Serving
Dynamo supports multimodal image and video requests for Vision Language Models (VLMs). `Qwen/Qwen3-VL-2B-Instruct` is a good example because the same model can handle both `image_url` and `video_url` requests through the standard OpenAI chat endpoint.
### Aggregated Serving
Use the single-worker aggregated launcher for the simplest image/video setup:
```bash
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/agg_multimodal.sh --model Qwen/Qwen3-VL-2B-Instruct
```
**Image request:**
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-VL-2B-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "http://images.cocodataset.org/test2017/000000155781.jpg"
}
}
]
}
],
"max_tokens": 64,
"temperature": 0.0,
"stream": false
}'
```
**Video request:**
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-VL-2B-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the video in detail"
},
{
"type": "video_url",
"video_url": {
"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/draw.mp4"
}
}
]
}
],
"max_tokens": 64,
"stream": false
}' | jq
```
### E/PD Serving (Encode + PD)
Use `disagg_multimodal_e_pd.sh` when you want a separate encode worker and a combined prefill/decode worker. This path is primarily useful for image-centric workloads and embedding-cache experiments.
When a separate encode worker is deployed with the current vLLM path, only `image_url` inputs are routed to it. `video_url` inputs are still processed on the combined PD worker.
```bash
cd $DYNAMO_HOME/examples/backends/vllm
# Multi-GPU deployment
bash launch/disagg_multimodal_e_pd.sh --model Qwen/Qwen3-VL-2B-Instruct
# Single-GPU (functional testing with small models)
bash launch/disagg_multimodal_e_pd.sh --model Qwen/Qwen3-VL-2B-Instruct --single-gpu
```
### E/P/D Serving (Full Disaggregation)
Use `disagg_multimodal_epd.sh` when you want separate encode, prefill, and decode workers for multimodal workloads.
In the current vLLM implementation, the separate encode worker is only used for `image_url` inputs. `video_url` inputs are still processed on the prefill worker, not on the encode worker.
```bash
cd $DYNAMO_HOME/examples/backends/vllm
# Multi-GPU deployment
bash launch/disagg_multimodal_epd.sh --model Qwen/Qwen3-VL-2B-Instruct
# Single-GPU (functional testing with small models)
bash launch/disagg_multimodal_epd.sh --model Qwen/Qwen3-VL-2B-Instruct --single-gpu
```
## Audio Serving (Experimental)
### Audio Aggregated Serving
**Components:**
- workers: [AudioEncodeWorker](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal/components/audio_encode_worker.py) for decoding audio into embeddings, and [VllmPDWorker](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal/components/worker.py) for prefilling and decoding.
- processor: Tokenizes the prompt and passes it to the AudioEncodeWorker.
- frontend: HTTP endpoint to handle incoming requests.
**Workflow:**
```mermaid
flowchart LR
HTTP --> processor
processor --> HTTP
processor --audio_url--> audio_encode_worker
audio_encode_worker --> processor
audio_encode_worker --embeddings--> pd_worker
pd_worker --> audio_encode_worker
```
**Launch:**
```bash
pip install 'vllm[audio]' accelerate # multimodal audio models dependency
cd $DYNAMO_HOME/examples/multimodal
bash launch/audio_agg.sh
```
**Client:**
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2-Audio-7B-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is recited in the audio?"
},
{
"type": "audio_url",
"audio_url": {
"url": "https://raw.githubusercontent.com/yuekaizhang/Triton-ASR-Client/main/datasets/mini_en/wav/1221-135766-0002.wav"
}
}
]
}
],
"max_tokens": 6000,
"temperature": 0.8,
"stream": false
}' | jq
```
### Audio Disaggregated Serving
**Workflow:**
For the Qwen2-Audio model, audio embeddings are only required during the prefill stage. The AudioEncodeWorker is connected directly to the prefill worker.
```mermaid
flowchart LR
HTTP --> processor
processor --> HTTP
processor --audio_url--> audio_encode_worker
audio_encode_worker --> processor
audio_encode_worker --embeddings--> prefill_worker
prefill_worker --> audio_encode_worker
prefill_worker --> decode_worker
decode_worker --> prefill_worker
```
**Launch:**
```bash
pip install 'vllm[audio]' accelerate # multimodal audio models dependency
cd $DYNAMO_HOME/examples/multimodal
bash launch/audio_disagg.sh
```
## Embedding Cache
Dynamo supports embedding cache in both aggregated and disaggregated settings:
| Setting | Implementation | Launch Script |
| ------------------------- | -------------------------------------------------------------- | --------------------------- |
| **Aggregated** | Supported via vLLM ECConnector in vLLM 0.17+ | `agg_multimodal.sh` (or with `vllm serve` directly) |
| **Disaggregated encoder** | Dynamo-managed cache in the worker layer on top of vLLM engine | `disagg_multimodal_e_pd.sh` |
### Aggregated Worker
A single vLLM instance caches encoded embeddings on CPU so repeated images skip encoding entirely. Supported natively with vLLM 0.17+.
```mermaid
---
title: Embedding Cache — Aggregated Encoder (e.g. aggregated EP or EPD node)
---
flowchart LR
req[Multimodal Request] --> gpu{GPU Encoder Cache
hit?}
gpu -- yes --> skip[Use cached GPU embedding
no encoder, no connector]
gpu -- no --> cpu{CPU Embedding Cache
hit?}
cpu -- yes --> load[Load: CPU → GPU
skip encoder]
cpu -- no --> encode[Run Encoder]
encode -- save: GPU → CPU --> store[(CPU Embedding Cache
LRU)]
```
**Launch with Dynamo:**
```bash
bash examples/backends/vllm/launch/agg_multimodal.sh \
--model Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 \
--multimodal-embedding-cache-capacity-gb 10
```
`dynamo.vllm` automatically configures `ec_both` mode with the `DynamoMultimodalEmbeddingCacheConnector` when the capacity is > 0.
**Launch with `vllm serve` (standalone, no Dynamo):**
```bash
vllm serve Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 \
--ec-transfer-config "{
\"ec_role\": \"ec_both\",
\"ec_connector\": \"DynamoMultimodalEmbeddingCacheConnector\",
\"ec_connector_module_path\": \"dynamo.vllm.multimodal_utils.multimodal_embedding_cache_connector\",
\"ec_connector_extra_config\": {\"multimodal_embedding_cache_capacity_gb\": 10}
}"
```
The `multimodal_embedding_cache_capacity_gb` parameter controls the CPU-side LRU cache size in GB (0 = disabled). Requires vLLM 0.17+.
### Disaggregated Encoder (Embedding Cache in Prefill Worker)
In the disaggregated setting, the Prefill Worker (P) owns a CPU-side LRU embedding cache (`EmbeddingCacheManager`). On each request P checks the cache first — on a hit, the Encode Worker is skipped entirely. On a miss, P routes to the Encode Worker (E), receives embeddings via NIXL, saves them to the cache, and then feeds the embeddings along with the request into the vLLM Instance for prefill.
```mermaid
---
title: Embedding Cache — Disaggregated Encoder
---
flowchart LR
req[Request] --> cpu_check{"CPU cache hit?
(EmbeddingCacheManager)"}
subgraph P ["Prefill Worker (P)"]
cpu_check -. hit .-> use[Use cached embedding]
use --> vllm[vLLM Instance]
end
cpu_check -- miss --> E["Encode Worker (E)"]
E -- "embeddings via NIXL" --> save["Save to cache"]
save --> vllm
```
**Launch:**
```bash
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/disagg_multimodal_e_pd.sh --multimodal-embedding-cache-capacity-gb 10
```
**Client:** Use the same `image_url` request format shown in [Aggregated Serving](#aggregated-serving).
## LoRA Adapters on Multimodal Workers
Multimodal workers support dynamic loading and unloading of LoRA adapters at runtime via the management API. This enables serving fine-tuned multimodal models alongside the base model.
### Loading a LoRA Adapter
Load an adapter on a running multimodal worker via the `load_lora` endpoint:
```bash
# For components workers (URI-based, requires DYN_LORA_ENABLED=true)
curl -X POST http://:/load_lora \
-H "Content-Type: application/json" \
-d '{
"lora_name": "my-vlm-adapter",
"source": {"uri": "s3://my-bucket/adapters/my-vlm-adapter"}
}'
# For example workers (path-based)
curl -X POST http://:/load_lora \
-H "Content-Type: application/json" \
-d '{
"lora_name": "my-vlm-adapter",
"lora_path": "/path/to/adapter"
}'
```
### Sending Requests with a LoRA
Set the `model` field in the request to the LoRA adapter name:
```bash
curl -X POST http://:/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-vlm-adapter",
"messages": [
{"role": "user", "content": [
{"type": "text", "text": "Describe this image"},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
]}
]
}'
```
Requests without a LoRA name (or with the base model name) will use the base model.
### Unloading a LoRA Adapter
```bash
curl -X POST http://:/unload_lora \
-H "Content-Type: application/json" \
-d '{"lora_name": "my-vlm-adapter"}'
```
### Listing Loaded Adapters
```bash
curl -X POST http://:/list_loras
```
### Disaggregated Mode
In disaggregated (prefill/decode) deployments, the **same LoRA adapter must be loaded on both the prefill and decode workers**. The LoRA identity (`model` field) is automatically propagated from the prefill worker to the decode worker in the forwarded request.
```bash
# Load on prefill worker
curl -X POST http:///load_lora \
-d '{"lora_name": "my-adapter", "source": {"uri": "s3://bucket/adapter"}}'
# Load on decode worker (same adapter)
curl -X POST http:///load_lora \
-d '{"lora_name": "my-adapter", "source": {"uri": "s3://bucket/adapter"}}'
```
If a LoRA is loaded on the prefill worker but not on the decode worker, the decode worker will fall back to the base model for that request.
## Supported Models
For a list of multimodal models supported by vLLM, see [vLLM Supported Multimodal Models](https://docs.vllm.ai/en/latest/models/supported_models/#list-of-multimodal-language-models). Models listed there should generally work with aggregated serving, though they may not all be explicitly tested in this repo.