This document provides a comprehensive guide for multimodal inference using vLLM backend in Dynamo.
This document provides a comprehensive guide for multimodal inference using vLLM backend in Dynamo.
<Warning>
<Warning>
**Security Requirement**: All multimodal workers require the `--enable-multimodal` flag to be explicitly set at startup. This is a security feature to prevent unintended processing of multimodal data from untrusted sources. Workers will fail at startup if multimodal flags (e.g., `--multimodal-worker`, `--multimodal-processor`) are used without `--enable-multimodal`.
**Security Requirement**: All multimodal workers require the `--enable-multimodal` flag to be explicitly set at startup. This is a security feature to prevent unintended processing of multimodal data from untrusted sources. Workers will fail at startup if a multimodal worker mode is enabled without `--enable-multimodal`. This flag is analogous to `--enable-mm-embeds` in vllm serve but also extends it to all multimodal content (url, embeddings, b64).
This flag is analogous to `--enable-mm-embeds` in vllm serve but also extends it to all multimodal content (url, embeddings, b64).
Dynamo supports multimodal image and video requests for Vision Language Models (VLMs). `Qwen/Qwen3-VL-2B-Instruct` is a good example because the same model can handle both `image_url` and `video_url` requests through the standard OpenAI chat endpoint.
- workers: [EncodeWorkerHandler](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py) for encoding and [DecodeWorkerHandler](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/vllm/handlers.py) for prefilling and decoding.
- processor: Tokenizes the prompt and passes it to the EncodeWorkerHandler.
- frontend: HTTP endpoint to handle incoming requests.
**Workflow:**
The EncodeWorkerHandler encodes the image and passes the embeddings to the DecodeWorkerHandler via NATS and RDMA. The work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface.
```mermaid
flowchart LR
HTTP --> processor
processor --> HTTP
processor --image_url--> encode_worker
encode_worker --> processor
encode_worker --embeddings--> pd_worker
pd_worker --> encode_worker
```
> **Note:** Aggregated serving supports LLaVA 1.5 7B and Qwen2.5-VL-7B-Instruct. Disaggregated serving is currently only confirmed for LLaVA.
- workers: [EncodeWorkerHandler](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py) for encoding, [DecodeWorkerHandler](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/vllm/handlers.py) for decoding, and [PrefillWorkerHandler](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/vllm/handlers.py) for prefilling.
- processor: Tokenizes the prompt and passes it to the EncodeWorkerHandler.
- frontend: HTTP endpoint to handle incoming requests.
**Workflow:**
For the LLaVA model, embeddings are only required during the prefill stage. The EncodeWorkerHandler is connected directly to the prefill worker, encoding the image and passing embeddings via NATS and RDMA. The prefill worker performs the prefilling step and forwards the KV cache to the decode worker.
Disaggregation is currently only confirmed to work with LLaVA. Qwen2.5-VL is not confirmed to be supported.
</Note>
## Llama 4 Serving
The Llama 4 model family is natively multimodal. Unlike LLaVA, they do not directly consume image embeddings as input (see the [vLLM support matrix](https://docs.vllm.ai/en/latest/models/supported_models.html#text-generation_1)). Therefore, the encoder worker is not used and encoding is done alongside prefill.
Example model: `meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8` on H100x8.
Use `disagg_multimodal_e_pd.sh` when you want a separate encode worker and a combined prefill/decode worker. This path is primarily useful for image-centric workloads and embedding-cache experiments; use `agg_multimodal.sh` or `disagg_multimodal_epd.sh` for general video serving.
- worker: Standard `python -m dynamo.vllm --enable-multimodal` backend.
- frontend: Standard `python -m dynamo.frontend` OpenAI-compatible endpoint.
**Workflow:**
The Rust preprocessor tokenizes the request and forwards `multi_modal_data` with `video_url` entries. The vLLM backend decodes video URLs into sampled RGB frames and attaches them to `TokensPrompt(multi_modal_data=...)` for standard multimodal processing.
```mermaid
flowchart LR
HTTP --> frontend
frontend --> vllm_worker
vllm_worker --> frontend
```
```
**Launch:**
### E/P/D Serving (Full Disaggregation)
```bash
cd$DYNAMO_HOME/examples/backends/vllm
bash launch/video_agg.sh
```
**Client:**
Use the full disaggregated launcher when you want separate encode, prefill, and decode workers for image/video workloads:
The Rust preprocessor tokenizes the request and forwards `multi_modal_data` with `video_url` entries. The prefill worker decodes the video into sampled RGB frames locally, runs the multimodal prefill, and forwards KV state to the decode worker through the normal disaggregated vLLM path.
| EC Both (Local Node) | `vllm_serve_embedding_cache.sh` | No | ECConnector via CPU Embedding Cache |
## ModelInput Types and Registration
Dynamo's Rust SDK supports two input types that determine how the HTTP frontend preprocesses requests:
| ModelInput Type | Preprocessing | Use Case |
|-----------------|---------------|----------|
| `ModelInput.Text` | None (raw text passed through) | Components that tokenize themselves |
| `ModelInput.Tokens` | Rust SDK would tokenize (but bypassed in multimodal) | Components expecting pre-tokenized input |
**Registration Pattern:**
```python
# Processor - Entry point from HTTP frontend
await register_model(
ModelInput.Text, # Frontend sends raw text
ModelType.Chat,
generate_endpoint,
model_name,
...
)
# Workers - Internal components
await register_model(
ModelInput.Tokens, # Expect pre-tokenized input
ModelType.Chat, # or ModelType.Prefill for prefill workers
generate_endpoint,
model_name,
...
)
```
## LoRA Adapters on Multimodal Workers
## LoRA Adapters on Multimodal Workers
...
@@ -603,60 +370,6 @@ curl -X POST http://<decode-worker>/load_lora \
...
@@ -603,60 +370,6 @@ curl -X POST http://<decode-worker>/load_lora \
If a LoRA is loaded on the prefill worker but not on the decode worker, the decode worker will fall back to the base model for that request.
If a LoRA is loaded on the prefill worker but not on the decode worker, the decode worker will fall back to the base model for that request.
## Profiling
Dynamo's multimodal workers include NVTX markers for `nsys` profiling. They are disabled by default (zero overhead) and enabled by setting `DYN_NVTX=1`.
| `mm:pd_worker_generate` | PD | Full PD request lifetime |
| `mm:pd:ttft` | PD | Worker-side TTFT: from request arrival at the PD worker to first output token (excludes client→frontend→worker network transit) |
| `mm:decode_worker_generate` | Decode | Full decode request lifetime |
| `mm:decode:first_token` | Decode | Time to first output token |
## Known Limitations
- **Disaggregated flows require Python Processor** - All multimodal disaggregation requires the Python Processor component (`ModelInput.Text`).
## Supported Models
## Supported Models
The following models have been tested with Dynamo's vLLM multimodal backend:
For a list of multimodal models supported by vLLM, see [vLLM Supported Multimodal Models](https://docs.vllm.ai/en/latest/models/supported_models/#list-of-multimodal-language-models). Models listed there should generally work with aggregated serving, though they may not all be explicitly tested in this repo.
For a complete list of multimodal models supported by vLLM, see [vLLM Supported Multimodal Models](https://docs.vllm.ai/en/latest/models/supported_models/#list-of-multimodal-language-models). Models listed there should work with Simple Aggregated Mode but may not be explicitly tested.
## Key Files
| File | Description |
|------|-------------|
| `components/src/dynamo/vllm/main.py` | Worker initialization and setup |