Unverified Commit dacb2980 authored by Ryan McCormick's avatar Ryan McCormick Committed by GitHub
Browse files

chore(multimodal): Cleanup multimodal docs and consolidate launch scripts (#7845)

parent 2075eb67
......@@ -7,102 +7,54 @@ title: vLLM Multimodal
This document provides a comprehensive guide for multimodal inference using vLLM backend in Dynamo.
<Warning>
**Security Requirement**: All multimodal workers require the `--enable-multimodal` flag to be explicitly set at startup. This is a security feature to prevent unintended processing of multimodal data from untrusted sources. Workers will fail at startup if multimodal flags (e.g., `--multimodal-worker`, `--multimodal-processor`) are used without `--enable-multimodal`.
This flag is analogous to `--enable-mm-embeds` in vllm serve but also extends it to all multimodal content (url, embeddings, b64).
**Security Requirement**: All multimodal workers require the `--enable-multimodal` flag to be explicitly set at startup. This is a security feature to prevent unintended processing of multimodal data from untrusted sources. Workers will fail at startup if a multimodal worker mode is enabled without `--enable-multimodal`. This flag is analogous to `--enable-mm-embeds` in vllm serve but also extends it to all multimodal content (url, embeddings, b64).
</Warning>
## Support Matrix
| Modality | Input Format | Aggregated | Disaggregated | Notes |
|----------|--------------|------------|---------------|-------|
| **Image** | HTTP/HTTPS URL | Yes | Yes | Full support for all image models |
| **Image** | Data URL (Base64) | Yes | Yes | Inline base64-encoded images |
| **Video** | HTTP/HTTPS URL | Yes | Yes | Frame extraction and processing |
| **Audio** | HTTP/HTTPS URL | Yes | Yes | Experimental - requires audio dependencies |
| Modality | Aggregated | Disaggregated |
| ------------------------ | ---------- | ------------- |
| **Image** | Yes | Yes |
| **Video** | Yes | Yes |
| **Audio** (Experimental) | Yes | Yes |
### Supported URL Formats
| Format | Example | Description |
|--------|---------|-------------|
| -------------- | ------------------------------------ | -------------------------- |
| **HTTP/HTTPS** | `http://example.com/image.jpg` | Remote media files |
| **Data URL** | `data:image/jpeg;base64,/9j/4AAQ...` | Base64-encoded inline data |
## Deployment Patterns
vLLM supports all multimodal deployment patterns. See [Architecture Patterns](README.md#architecture-patterns) for detailed explanations.
The main multimodal vLLM launchers in this repo are:
| Pattern | Supported | Launch Script | Notes |
|---------|-----------|---------------|-------|
| EPD (Simple Aggregated) | ✅ | `agg_multimodal.sh` | Easiest setup |
| E/PD (Encode Separate) | ✅ | `agg_multimodal_epd.sh` | Separate encode worker |
| E/P/D (Full Disaggregation) | ✅ | `disagg_multimodal_epd.sh` | All stages separate |
| EP/D (Traditional Disaggregated) | ✅ | `disagg_multimodal_llama.sh` | For Llama 4 models |
| Pattern | Launch Script | Best For |
| --------------------------- | --------------------------- | ----------------------------------------------------------------------------------- |
| Aggregated | `agg_multimodal.sh` | Simplest image/video serving from a single multimodal worker |
| E/PD (Encode + PD) | `disagg_multimodal_e_pd.sh` | Simple example of separating encoder, good for testing embedding-cache workflows |
| E/P/D (Full Disaggregation) | `disagg_multimodal_epd.sh` | Disaggregated image/video serving with separate encode, prefill, and decode workers |
### Component Flags
## Image/Video Serving
| Component | Flag | Purpose |
|-----------|------|---------|
| Processor | `--multimodal-processor` | HTTP entry, tokenization |
| Encode Worker | `--multimodal-encode-worker` | Media encoding |
| PD Worker | `--multimodal-worker` | Prefill + Decode |
| Prefill Worker | `--multimodal-worker --disaggregation-mode prefill` | Prefill only |
| Decode Worker | `--multimodal-decode-worker` | Decode only |
Dynamo supports multimodal image and video requests for Vision Language Models (VLMs). `Qwen/Qwen3-VL-2B-Instruct` is a good example because the same model can handle both `image_url` and `video_url` requests through the standard OpenAI chat endpoint.
## Use the Latest Release
### Aggregated Serving
We recommend using the latest stable release of dynamo to avoid breaking changes:
[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
You can find the [latest release](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
```bash
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
```
## Image Serving
### E/PD Serving (Encode Separate)
**Components:**
- workers: [EncodeWorkerHandler](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py) for encoding and [DecodeWorkerHandler](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/vllm/handlers.py) for prefilling and decoding.
- processor: Tokenizes the prompt and passes it to the EncodeWorkerHandler.
- frontend: HTTP endpoint to handle incoming requests.
**Workflow:**
The EncodeWorkerHandler encodes the image and passes the embeddings to the DecodeWorkerHandler via NATS and RDMA. The work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface.
```mermaid
flowchart LR
HTTP --> processor
processor --> HTTP
processor --image_url--> encode_worker
encode_worker --> processor
encode_worker --embeddings--> pd_worker
pd_worker --> encode_worker
```
> **Note:** Aggregated serving supports LLaVA 1.5 7B and Qwen2.5-VL-7B-Instruct. Disaggregated serving is currently only confirmed for LLaVA.
**Launch:**
Use the single-worker aggregated launcher for the simplest image/video setup:
```bash
cd $DYNAMO_HOME/examples/backends/vllm
# Serve a LLaVA 1.5 7B model:
bash launch/agg_multimodal_epd.sh --model llava-hf/llava-1.5-7b-hf
# Serve a Qwen2.5-VL model:
bash launch/agg_multimodal_epd.sh --model Qwen/Qwen2.5-VL-7B-Instruct
bash launch/agg_multimodal.sh --model Qwen/Qwen3-VL-2B-Instruct
```
**Client:**
**Image request:**
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llava-hf/llava-1.5-7b-hf",
"model": "Qwen/Qwen3-VL-2B-Instruct",
"messages": [
{
"role": "user",
......@@ -120,205 +72,71 @@ curl http://localhost:8000/v1/chat/completions \
]
}
],
"max_tokens": 300,
"max_tokens": 64,
"temperature": 0.0,
"stream": false
}'
```
### E/P/D Serving (Full Disaggregation)
**Components:**
- workers: [EncodeWorkerHandler](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py) for encoding, [DecodeWorkerHandler](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/vllm/handlers.py) for decoding, and [PrefillWorkerHandler](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/vllm/handlers.py) for prefilling.
- processor: Tokenizes the prompt and passes it to the EncodeWorkerHandler.
- frontend: HTTP endpoint to handle incoming requests.
**Workflow:**
For the LLaVA model, embeddings are only required during the prefill stage. The EncodeWorkerHandler is connected directly to the prefill worker, encoding the image and passing embeddings via NATS and RDMA. The prefill worker performs the prefilling step and forwards the KV cache to the decode worker.
```mermaid
flowchart LR
HTTP --> processor
processor --> HTTP
processor --image_url--> encode_worker
encode_worker --> processor
encode_worker --embeddings--> prefill_worker
prefill_worker --> encode_worker
prefill_worker --> decode_worker
decode_worker --> prefill_worker
```
**Launch:**
```bash
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/disagg_multimodal_epd.sh --model llava-hf/llava-1.5-7b-hf
```
<Note>
Disaggregation is currently only confirmed to work with LLaVA. Qwen2.5-VL is not confirmed to be supported.
</Note>
## Llama 4 Serving
The Llama 4 model family is natively multimodal. Unlike LLaVA, they do not directly consume image embeddings as input (see the [vLLM support matrix](https://docs.vllm.ai/en/latest/models/supported_models.html#text-generation_1)). Therefore, the encoder worker is not used and encoding is done alongside prefill.
Example model: `meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8` on H100x8.
### Llama 4 Aggregated Serving
**Workflow:**
```mermaid
flowchart LR
HTTP --> processor
processor --> HTTP
processor --image_url--> pd_worker
pd_worker --> processor
```
**Launch:**
```bash
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/agg_multimodal.sh --model meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8
```
**Client:**
**Video request:**
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
"model": "Qwen/Qwen3-VL-2B-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
"text": "Describe the video in detail"
},
{
"type": "image_url",
"image_url": {
"url": "http://images.cocodataset.org/test2017/000000155781.jpg"
"type": "video_url",
"video_url": {
"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/draw.mp4"
}
}
]
}
],
"max_tokens": 300,
"temperature": 0.0,
"max_tokens": 64,
"stream": false
}'
}' | jq
```
### Llama 4 Disaggregated Serving
### E/PD Serving (Encode + PD)
**Workflow:**
```mermaid
flowchart LR
HTTP --> processor
processor --> HTTP
processor --image_url--> prefill_worker
prefill_worker --> processor
prefill_worker --> decode_worker
decode_worker --> prefill_worker
```
**Launch:**
Use `disagg_multimodal_e_pd.sh` when you want a separate encode worker and a combined prefill/decode worker. This path is primarily useful for image-centric workloads and embedding-cache experiments; use `agg_multimodal.sh` or `disagg_multimodal_epd.sh` for general video serving.
```bash
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/disagg_multimodal_llama.sh --head-node
# On a separate node with NATS_SERVER and ETCD_ENDPOINTS pointing to head node:
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/disagg_multimodal_llama.sh
```
# Multi-GPU deployment
bash launch/disagg_multimodal_e_pd.sh --model Qwen/Qwen3-VL-2B-Instruct
## Video Serving
# Single-GPU (functional testing with small models)
bash launch/disagg_multimodal_e_pd.sh --model Qwen/Qwen3-VL-2B-Instruct --single-gpu
### Video Aggregated Serving
**Components:**
- worker: Standard `python -m dynamo.vllm --enable-multimodal` backend.
- frontend: Standard `python -m dynamo.frontend` OpenAI-compatible endpoint.
**Workflow:**
The Rust preprocessor tokenizes the request and forwards `multi_modal_data` with `video_url` entries. The vLLM backend decodes video URLs into sampled RGB frames and attaches them to `TokensPrompt(multi_modal_data=...)` for standard multimodal processing.
```mermaid
flowchart LR
HTTP --> frontend
frontend --> vllm_worker
vllm_worker --> frontend
```
**Launch:**
```bash
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/video_agg.sh
```
### E/P/D Serving (Full Disaggregation)
**Client:**
Use the full disaggregated launcher when you want separate encode, prefill, and decode workers for image/video workloads:
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-VL-2B-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the video in detail"
},
{
"type": "video_url",
"video_url": {
"url": "https://storage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4"
}
}
]
}
],
"max_tokens": 300,
"stream": false
}' | jq
```
### Video Disaggregated Serving
**Workflow:**
The Rust preprocessor tokenizes the request and forwards `multi_modal_data` with `video_url` entries. The prefill worker decodes the video into sampled RGB frames locally, runs the multimodal prefill, and forwards KV state to the decode worker through the normal disaggregated vLLM path.
```mermaid
flowchart LR
HTTP --> frontend
frontend --> prefill_worker
prefill_worker --> decode_worker
decode_worker --> frontend
```
cd $DYNAMO_HOME/examples/backends/vllm
**Launch:**
# Multi-GPU deployment
bash launch/disagg_multimodal_epd.sh --model Qwen/Qwen3-VL-2B-Instruct
```bash
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/video_disagg.sh
# Single-GPU (functional testing with small models)
bash launch/disagg_multimodal_epd.sh --model Qwen/Qwen3-VL-2B-Instruct --single-gpu
```
## Audio Serving
## Audio Serving (Experimental)
### Audio Aggregated Serving
......@@ -409,13 +227,13 @@ bash launch/audio_disagg.sh
Dynamo supports embedding cache in both aggregated and disaggregated settings:
| Setting | Implementation | Launch Script |
|---------|---------------|---------------|
| ------------------------- | -------------------------------------------------------------- | --------------------------- |
| **Aggregated** | Supported via vLLM ECConnector in vLLM 0.18+ | `agg_multimodal.sh` (or with `vllm serve` directly) |
| **Disaggregated encoder** | Dynamo-managed cache in the worker layer on top of vLLM engine | `disagg_multimodal_e_pd.sh` |
| **Aggregated** | Experimental via vLLM git patches | N/A |
### Aggregated Worker
A single vLLM instance caches encoded embeddings on CPU so repeated images skip encoding entirely. Experimental — requires vLLM patches (see below).
A single vLLM instance caches encoded embeddings on CPU so repeated images skip encoding entirely.
```mermaid
---
......@@ -432,19 +250,9 @@ flowchart LR
**Launch:**
```bash
cd /opt/dynamo/venv/lib/python3.12/site-packages
curl -sL https://github.com/vllm-project/vllm/pull/34182.diff | patch -p1
curl -sL https://github.com/vllm-project/vllm/pull/34783.diff | python3 -c "
import sys
chunks = sys.stdin.read().split('diff --git ')
filtered = [c for c in chunks if c.startswith('a/vllm/')]
print(''.join('diff --git ' + c for c in filtered))
" | patch -p1
<!-- TODO: Add an example of Dynamo+vLLM Agg worker + Embedding Cache -->
```bash
vllm serve $model \
--ec-transfer-config "{
\"ec_role\": \"ec_both\",
......@@ -484,48 +292,7 @@ cd $DYNAMO_HOME/examples/backends/vllm
bash launch/disagg_multimodal_e_pd.sh --multimodal-embedding-cache-capacity-gb 10
```
**Client:** Same as [E/PD Serving](#epd-serving-encode-separate)
## NIXL Usage
| Use Case | Script | NIXL Used? | Data Transfer |
|----------|--------|------------|---------------|
| EPD (Simple Aggregated) | `agg_multimodal.sh` | No | All in one worker |
| E/PD (Encode Separate) | `agg_multimodal_epd.sh` | Yes | Encoder → PD (embeddings) |
| E/P/D (Full Disaggregation) | `disagg_multimodal_epd.sh` | Yes | Encoder → Prefill (embeddings), Prefill → Decode (KV cache) |
| EP/D (Llama 4) | `disagg_multimodal_llama.sh` | Yes | Prefill → Decode (KV cache) |
| EC Both (Local Node) | `vllm_serve_embedding_cache.sh` | No | ECConnector via CPU Embedding Cache |
## ModelInput Types and Registration
Dynamo's Rust SDK supports two input types that determine how the HTTP frontend preprocesses requests:
| ModelInput Type | Preprocessing | Use Case |
|-----------------|---------------|----------|
| `ModelInput.Text` | None (raw text passed through) | Components that tokenize themselves |
| `ModelInput.Tokens` | Rust SDK would tokenize (but bypassed in multimodal) | Components expecting pre-tokenized input |
**Registration Pattern:**
```python
# Processor - Entry point from HTTP frontend
await register_model(
ModelInput.Text, # Frontend sends raw text
ModelType.Chat,
generate_endpoint,
model_name,
...
)
# Workers - Internal components
await register_model(
ModelInput.Tokens, # Expect pre-tokenized input
ModelType.Chat, # or ModelType.Prefill for prefill workers
generate_endpoint,
model_name,
...
)
```
**Client:** Use the same `image_url` request format shown in [Aggregated Serving](#aggregated-serving).
## LoRA Adapters on Multimodal Workers
......@@ -603,60 +370,6 @@ curl -X POST http://<decode-worker>/load_lora \
If a LoRA is loaded on the prefill worker but not on the decode worker, the decode worker will fall back to the base model for that request.
## Profiling
Dynamo's multimodal workers include NVTX markers for `nsys` profiling. They are disabled by default (zero overhead) and enabled by setting `DYN_NVTX=1`.
```bash
cd $DYNAMO_HOME/examples/backends/vllm
DYN_NVTX=1 nsys profile --trace=cuda,nvtx -o profile.nsys-rep \
bash launch/agg_multimodal.sh ...
```
| ENV Variable | Default | Description |
|---|---|---|
| `DYN_NVTX` | `0` | Set to `1` to enable NVTX range/mark annotations in encode, prefill, and decode workers for `nsys` profiling |
Key NVTX ranges emitted:
| Range | Worker | Description |
|-------|--------|-------------|
| `mm:encode_worker_generate` | Encode | Full encode request lifetime |
| `mm:enc:cache_check` | Encode | Embedding cache lookup |
| `mm:enc:image_load` | Encode | Image download/load |
| `mm:enc:image_preprocess` | Encode | Image processor (CPU) |
| `mm:enc:vision_encode` | Encode | ViT + projector GPU forward |
| `mm:enc:embedding_transfer` | Encode | RDMA embedding staging |
| `mm:pd_worker_generate` | PD | Full PD request lifetime |
| `mm:pd:ttft` | PD | Worker-side TTFT: from request arrival at the PD worker to first output token (excludes client→frontend→worker network transit) |
| `mm:pd:load_multimodal` | PD | Fetch embeddings from encode worker |
| `mm:pd:disagg_prefill` | PD (disagg) | Prefill-only engine call |
| `mm:pd:disagg_remote_decode` | PD (disagg) | Remote decode round-trip |
| `mm:decode_worker_generate` | Decode | Full decode request lifetime |
| `mm:decode:first_token` | Decode | Time to first output token |
## Known Limitations
- **Disaggregated flows require Python Processor** - All multimodal disaggregation requires the Python Processor component (`ModelInput.Text`).
## Supported Models
The following models have been tested with Dynamo's vLLM multimodal backend:
- **Qwen2.5-VL** - `Qwen/Qwen2.5-VL-7B-Instruct`
- **Qwen3-VL** - `Qwen/Qwen3-VL-30B-A3B-Instruct-FP8`
- **LLaVA 1.5** - `llava-hf/llava-1.5-7b-hf`
- **Llama 4 Maverick** - `meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8`
- **Qwen2-Audio** - `Qwen/Qwen2-Audio-7B-Instruct`
For a complete list of multimodal models supported by vLLM, see [vLLM Supported Multimodal Models](https://docs.vllm.ai/en/latest/models/supported_models/#list-of-multimodal-language-models). Models listed there should work with Simple Aggregated Mode but may not be explicitly tested.
## Key Files
| File | Description |
|------|-------------|
| `components/src/dynamo/vllm/main.py` | Worker initialization and setup |
| `components/src/dynamo/vllm/args.py` | Command-line argument parsing |
| `components/src/dynamo/vllm/multimodal_handlers/processor_handler.py` | Processor implementation |
| `components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py` | Encode worker implementations (custom and vLLM-native) |
| `components/src/dynamo/vllm/multimodal_handlers/worker_handler.py` | PD/Prefill/Decode worker implementation |
For a list of multimodal models supported by vLLM, see [vLLM Supported Multimodal Models](https://docs.vllm.ai/en/latest/models/supported_models/#list-of-multimodal-language-models). Models listed there should generally work with aggregated serving, though they may not all be explicitly tested in this repo.
......@@ -2,11 +2,11 @@
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Aggregated multimodal serving with standard Dynamo preprocessing
# Aggregated multimodal image/video serving with standard Dynamo preprocessing
#
# Architecture: Single-worker PD (Prefill-Decode)
# - Frontend: Rust OpenAIPreprocessor handles image URLs (HTTP and data:// base64)
# - Worker: Standard vLLM worker with vision model support
# - Frontend: Rust OpenAIPreprocessor forwards multimodal requests
# - Worker: Standard vLLM worker with multimodal model support
#
# For EPD (Encode-Prefill-Decode) architecture with dedicated encoding worker,
# see agg_multimodal_epd.sh
......@@ -19,7 +19,7 @@ source "$SCRIPT_DIR/../../../common/gpu_utils.sh"
source "$SCRIPT_DIR/../../../common/launch_utils.sh"
# Default values
MODEL_NAME="Qwen/Qwen3-VL-30B-A3B-Instruct-FP8"
MODEL_NAME="${DYN_MODEL_NAME:-Qwen/Qwen3-VL-30B-A3B-Instruct-FP8}"
# Parse command line arguments
# Extra arguments are passed through to the vLLM worker
......@@ -48,13 +48,41 @@ while [[ $# -gt 0 ]]; do
done
HTTP_PORT="${DYN_HTTP_PORT:-8000}"
print_launch_banner --multimodal "Launching Aggregated Multimodal Serving" "$MODEL_NAME" "$HTTP_PORT"
# Use TCP transport (instead of default NATS)
# TCP is preferred for multimodal workloads because it overcomes:
# - NATS default 1MB max payload limit (multimodal base64 images can exceed this)
export DYN_REQUEST_PLANE=tcp
print_launch_banner --no-curl "Launching Aggregated Multimodal Serving" "$MODEL_NAME" "$HTTP_PORT" \
"Backend: dynamo.vllm --enable-multimodal" \
"Media: image_url and video_url (model support dependent)"
print_curl_footer <<CURL
curl http://localhost:${HTTP_PORT}/v1/chat/completions \\
-H 'Content-Type: application/json' \\
-d '{
"model": "${MODEL_NAME}",
"messages": [{"role": "user", "content": [
{"type": "text", "text": "Describe the image"},
{"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/300px-PNG_transparency_demonstration_1.png"}}
]}],
"max_tokens": 50
}'
# For video-capable models such as Qwen/Qwen3-VL-2B-Instruct:
curl http://localhost:${HTTP_PORT}/v1/chat/completions \\
-H 'Content-Type: application/json' \\
-d '{
"model": "Qwen/Qwen3-VL-2B-Instruct",
"messages": [{"role": "user", "content": [
{"type": "text", "text": "Describe the video in detail"},
{"type": "video_url", "video_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/draw.mp4"}}
]}],
"max_tokens": 128
}'
CURL
# Start frontend with Rust OpenAIPreprocessor
# dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000)
python -m dynamo.frontend &
......@@ -65,7 +93,7 @@ MAX_CONCURRENT_SEQS="${MAX_CONCURRENT_SEQS:-2}"
MODEL_EXTRA_ARGS=""
case "$MODEL_NAME" in
meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8)
MAX_MODEL_LEN="${MAX_MODEL_LEN:-108960}"
MAX_MODEL_LEN="108960"
MODEL_EXTRA_ARGS="--tensor-parallel-size=8" ;;
esac
......
......@@ -7,6 +7,9 @@ trap 'echo Cleaning up...; kill 0' EXIT
SCRIPT_DIR="$(dirname "$(readlink -f "$0")")"
source "$SCRIPT_DIR/../../../common/launch_utils.sh"
# Use TCP transport for multimodal workloads (base64 images can exceed NATS 1MB limit)
export DYN_REQUEST_PLANE=tcp
# Default values
MODEL_NAME="Qwen/Qwen3-VL-30B-A3B-Instruct-FP8"
SINGLE_GPU=false
......
......@@ -8,6 +8,9 @@ SCRIPT_DIR="$(dirname "$(readlink -f "$0")")"
source "$SCRIPT_DIR/../../../common/gpu_utils.sh"
source "$SCRIPT_DIR/../../../common/launch_utils.sh"
# Use TCP transport for multimodal workloads (base64 images can exceed NATS 1MB limit)
export DYN_REQUEST_PLANE=tcp
# Default values
MODEL_NAME="llava-hf/llava-1.5-7b-hf"
......@@ -17,7 +20,7 @@ MODEL_NAME="llava-hf/llava-1.5-7b-hf"
# - Enabling --enforce-eager (disables torch.compile and CUDA graph capture)
# - Hardcoding P/D KV cache to 512 MB (skips all memory profiling)
# - Limiting --max-model-len to 4096 tokens on P/D workers
# - Limiting P/D workers to image=1,video=0,audio=0 (--limit-mm-per-prompt)
# - Limiting P/D workers to image=3,video=3,audio=0 (--limit-mm-per-prompt)
# - Using lower gpu-memory-utilization fractions to share the GPU
SINGLE_GPU=false
......@@ -77,10 +80,17 @@ python -m dynamo.frontend &
EXTRA_ARGS=""
PD_EXTRA_ARGS=""
# GPU assignments (override via environment variables)
DYN_ENCODE_WORKER_GPU=${DYN_ENCODE_WORKER_GPU:-0}
DYN_PREFILL_WORKER_GPU=${DYN_PREFILL_WORKER_GPU:-1}
DYN_DECODE_WORKER_GPU=${DYN_DECODE_WORKER_GPU:-2}
# GPU assignments (override via environment variables).
# In single-GPU mode all 3 workers default to GPU 0.
if [[ "$SINGLE_GPU" == "true" ]]; then
DYN_ENCODE_WORKER_GPU=${DYN_ENCODE_WORKER_GPU:-0}
DYN_PREFILL_WORKER_GPU=${DYN_PREFILL_WORKER_GPU:-0}
DYN_DECODE_WORKER_GPU=${DYN_DECODE_WORKER_GPU:-0}
else
DYN_ENCODE_WORKER_GPU=${DYN_ENCODE_WORKER_GPU:-0}
DYN_PREFILL_WORKER_GPU=${DYN_PREFILL_WORKER_GPU:-1}
DYN_DECODE_WORKER_GPU=${DYN_DECODE_WORKER_GPU:-2}
fi
# GPU memory utilization for workers.
# NOTE: --kv-cache-memory-bytes (set below for P/D workers) overrides
......@@ -93,9 +103,15 @@ if [[ -n "${_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE:-}" ]]; then
echo "WARNING: _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE is set but has no effect here because" >&2
echo " --kv-cache-memory-bytes overrides --gpu-memory-utilization in vLLM." >&2
fi
DYN_ENCODE_GPU_MEM=${DYN_ENCODE_GPU_MEM:-0.9}
DYN_PREFILL_GPU_MEM=${DYN_PREFILL_GPU_MEM:-0.9}
DYN_DECODE_GPU_MEM=${DYN_DECODE_GPU_MEM:-0.9}
if [[ "$SINGLE_GPU" == "true" ]]; then
DYN_ENCODE_GPU_MEM=${DYN_ENCODE_GPU_MEM:-0.1}
DYN_PREFILL_GPU_MEM=${DYN_PREFILL_GPU_MEM:-0.4}
DYN_DECODE_GPU_MEM=${DYN_DECODE_GPU_MEM:-0.4}
else
DYN_ENCODE_GPU_MEM=${DYN_ENCODE_GPU_MEM:-0.9}
DYN_PREFILL_GPU_MEM=${DYN_PREFILL_GPU_MEM:-0.9}
DYN_DECODE_GPU_MEM=${DYN_DECODE_GPU_MEM:-0.9}
fi
# 512 MB KV cache per P/D worker. Setting --kv-cache-memory-bytes bypasses vLLM's
# memory profiling entirely (both language model and multimodal encoder), which avoids
......@@ -105,7 +121,7 @@ PD_KV_CACHE_BYTES=$((512 * 1024 * 1024))
if [[ "$SINGLE_GPU" == "true" ]]; then
EXTRA_ARGS="--enforce-eager"
PD_EXTRA_ARGS="--max-model-len 4096 --kv-cache-memory-bytes $PD_KV_CACHE_BYTES --limit-mm-per-prompt {\"image\":1,\"video\":0,\"audio\":0}"
PD_EXTRA_ARGS="--max-model-len 4096 --kv-cache-memory-bytes $PD_KV_CACHE_BYTES --limit-mm-per-prompt {\"image\":3,\"video\":3,\"audio\":0}"
fi
# Start encode worker
......
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
set -ex
SCRIPT_DIR="$(dirname "$(readlink -f "$0")")"
source "$SCRIPT_DIR/../../../common/launch_utils.sh"
# Default values
HEAD_NODE=0
MODEL_NAME="meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8"
EXTRA_ARGS=()
# Parse command line arguments
while [[ $# -gt 0 ]]; do
case $1 in
--head-node)
HEAD_NODE=1
shift 1
;;
--model)
MODEL_NAME=$2
shift 2
;;
-h|--help)
echo "Usage: $0 [OPTIONS]"
echo ""
echo "Disaggregated multimodal serving with separate Prefill/Decode workers for Llama 4"
echo ""
echo "Options:"
echo " --head-node Run as head node. Head node will run the HTTP server, processor and prefill worker."
echo " --model <model_name> Specify the VLM model to use (default: $MODEL_NAME)"
echo " -h, --help Show this help message"
echo ""
echo "Examples:"
echo " # On head node:"
echo " $0 --head-node"
echo ""
echo " # On worker node (requires NATS_SERVER and ETCD_ENDPOINTS pointing to head node):"
echo " $0"
echo ""
exit 0
;;
*)
EXTRA_ARGS+=("$1")
shift
;;
esac
done
trap 'echo Cleaning up...; kill 0' EXIT
HTTP_PORT="${DYN_HTTP_PORT:-8000}"
if [[ $HEAD_NODE -eq 1 ]]; then
print_launch_banner --multimodal "Launching Disaggregated Multimodal Llama 4 (Multi-Node)" "$MODEL_NAME" "$HTTP_PORT"
else
print_launch_banner --no-curl "Launching Disaggregated Multimodal Llama 4 (Multi-Node)" "$MODEL_NAME" "$HTTP_PORT"
fi
# Use TCP transport to avoid NATS payload limits for multimodal
export DYN_REQUEST_PLANE=tcp
# Configure model-specific args
GPU_MEM="0.80"
KV_BYTES="${_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES:-}"
if [[ -n "$KV_BYTES" ]]; then
GPU_MEM_ARGS="--kv-cache-memory-bytes $KV_BYTES --gpu-memory-utilization 0.01"
else
GPU_MEM_ARGS="--gpu-memory-utilization $GPU_MEM"
fi
MODEL_SPECIFIC_ARGS=""
if [[ "$MODEL_NAME" == "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8" ]]; then
MODEL_SPECIFIC_ARGS="--tensor-parallel-size=8 --max-model-len=208960 $GPU_MEM_ARGS"
fi
if [[ $HEAD_NODE -eq 1 ]]; then
# run ingress
# dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000)
python -m dynamo.frontend &
# run processor (CPU-only to avoid competing for GPU memory with workers)
CUDA_VISIBLE_DEVICES="" \
python -m dynamo.vllm --route-to-encoder --enable-multimodal --model $MODEL_NAME &
# Prefill worker handles prompt processing and image encoding
# Uses all 8 GPUs for tensor-parallel
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
VLLM_NIXL_SIDE_CHANNEL_PORT=20097 \
python -m dynamo.vllm \
--enable-multimodal \
--model $MODEL_NAME \
--disaggregation-mode prefill \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' \
$MODEL_SPECIFIC_ARGS \
--kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20080"}' \
"${EXTRA_ARGS[@]}" &
else
# run decode worker on non-head node
# Uses all 8 GPUs for tensor-parallel
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
VLLM_NIXL_SIDE_CHANNEL_PORT=20098 \
python -m dynamo.vllm \
--enable-multimodal \
--model $MODEL_NAME \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' \
$MODEL_SPECIFIC_ARGS \
--kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20081"}' \
"${EXTRA_ARGS[@]}" &
fi
# Exit on first worker failure; kill 0 in the EXIT trap tears down the rest
wait_any_exit
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Aggregated video serving with standard Dynamo preprocessing and vLLM backend.
set -euo pipefail
cleanup() {
echo "Cleaning up..."
local pids
pids="$(jobs -pr)"
if [[ -n "$pids" ]]; then
kill $pids 2>/dev/null || true
fi
}
trap cleanup EXIT
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd "$SCRIPT_DIR/../../../.." && pwd)"
source "$SCRIPT_DIR/../../../common/gpu_utils.sh"
source "$SCRIPT_DIR/../../../common/launch_utils.sh"
export PYTHONPATH="${REPO_ROOT}/components/src:${REPO_ROOT}/lib/bindings/python/src${PYTHONPATH:+:${PYTHONPATH}}"
MODEL_NAME="${DYN_MODEL_NAME:-Qwen/Qwen3-VL-2B-Instruct}"
HTTP_PORT="${DYN_HTTP_PORT:-8000}"
GPU_DEVICE="${CUDA_VISIBLE_DEVICES:-0}"
MAX_MODEL_LEN="${MAX_MODEL_LEN:-8192}"
MAX_NUM_SEQS="${MAX_NUM_SEQS:-2}"
EXTRA_ARGS=()
while [[ $# -gt 0 ]]; do
case $1 in
--model)
MODEL_NAME=$2
shift 2
;;
-h|--help)
cat <<USAGE
Usage: $0 [OPTIONS] [-- EXTRA_VLLM_ARGS]
Options:
--model <model_name> Video-capable VLM to serve (default: $MODEL_NAME)
-h, --help Show this help message
Any arguments after '--' are passed through to the vLLM worker.
USAGE
exit 0
;;
--)
shift
EXTRA_ARGS+=("$@")
break
;;
*)
EXTRA_ARGS+=("$1")
shift
;;
esac
done
export DYN_REQUEST_PLANE=tcp
GPU_MEM_ARGS=$(build_gpu_mem_args vllm)
print_launch_banner --no-curl "Launching Aggregated Video Serving" "$MODEL_NAME" "$HTTP_PORT" \
"Backend: dynamo.vllm --enable-multimodal" \
"Video path: Standard TokensPrompt multi_modal_data flow"
print_curl_footer <<CURL
curl http://localhost:${HTTP_PORT}/v1/chat/completions \\
-H 'Content-Type: application/json' \\
-d '{
"model": "${MODEL_NAME}",
"messages": [{"role": "user", "content": [
{"type": "text", "text": "Describe the video in detail"},
{"type": "video_url", "video_url": {"url": "https://storage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4"}}
]}],
"max_tokens": 128
}'
CURL
python -m dynamo.frontend &
CUDA_VISIBLE_DEVICES="$GPU_DEVICE" \
python -m dynamo.vllm \
--enable-multimodal \
--model "$MODEL_NAME" \
--max-model-len "$MAX_MODEL_LEN" \
--max-num-seqs "$MAX_NUM_SEQS" \
$GPU_MEM_ARGS \
"${EXTRA_ARGS[@]}" &
wait_any_exit
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Disaggregated video serving with standard Dynamo preprocessing and vLLM backend.
set -euo pipefail
cleanup() {
echo "Cleaning up..."
local pids
pids="$(jobs -pr)"
if [[ -n "$pids" ]]; then
kill $pids 2>/dev/null || true
fi
}
trap cleanup EXIT
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd "$SCRIPT_DIR/../../../.." && pwd)"
source "$SCRIPT_DIR/../../../common/gpu_utils.sh"
source "$SCRIPT_DIR/../../../common/launch_utils.sh"
export PYTHONPATH="${REPO_ROOT}/components/src:${REPO_ROOT}/lib/bindings/python/src${PYTHONPATH:+:${PYTHONPATH}}"
MODEL_NAME="${DYN_MODEL_NAME:-Qwen/Qwen3-VL-2B-Instruct}"
HTTP_PORT="${DYN_HTTP_PORT:-8000}"
SINGLE_GPU=false
EXTRA_ARGS=()
while [[ $# -gt 0 ]]; do
case $1 in
--model)
MODEL_NAME=$2
shift 2
;;
--single-gpu)
SINGLE_GPU=true
shift
;;
-h|--help)
cat <<USAGE
Usage: $0 [OPTIONS] [-- EXTRA_VLLM_ARGS]
Options:
--model <model_name> Video-capable VLM to serve (default: $MODEL_NAME)
--single-gpu Run prefill and decode on one GPU for functional testing
-h, --help Show this help message
Any arguments after '--' are passed through to both vLLM workers.
USAGE
exit 0
;;
--)
shift
EXTRA_ARGS+=("$@")
break
;;
*)
EXTRA_ARGS+=("$1")
shift
;;
esac
done
export DYN_REQUEST_PLANE=tcp
if [[ "$SINGLE_GPU" == "true" ]]; then
GPU_LABEL="1 GPU"
PREFILL_GPU="${DYN_PREFILL_WORKER_GPU:-${CUDA_VISIBLE_DEVICES:-0}}"
DECODE_GPU="${DYN_DECODE_WORKER_GPU:-${CUDA_VISIBLE_DEVICES:-0}}"
MAX_MODEL_LEN="${MAX_MODEL_LEN:-4096}"
PD_KV_CACHE_BYTES=$((512 * 1024 * 1024))
SHARED_GPU_FRACTION=$(build_gpu_mem_args vllm --workers-per-gpu 2)
PREFILL_GPU_MEM="${DYN_PREFILL_GPU_MEM:-${SHARED_GPU_FRACTION:-0.45}}"
DECODE_GPU_MEM="${DYN_DECODE_GPU_MEM:-${SHARED_GPU_FRACTION:-0.45}}"
SHARED_ARGS=(
--enforce-eager
--max-model-len "$MAX_MODEL_LEN"
--kv-cache-memory-bytes "$PD_KV_CACHE_BYTES"
--limit-mm-per-prompt '{"image":1,"video":1,"audio":0}'
)
else
GPU_LABEL="2 GPUs"
PREFILL_GPU="${DYN_PREFILL_WORKER_GPU:-0}"
DECODE_GPU="${DYN_DECODE_WORKER_GPU:-1}"
MAX_MODEL_LEN="${MAX_MODEL_LEN:-8192}"
GPU_MEM_ARGS=$(build_gpu_mem_args vllm)
PREFILL_GPU_MEM="${DYN_PREFILL_GPU_MEM:-${GPU_MEM_ARGS:-0.9}}"
DECODE_GPU_MEM="${DYN_DECODE_GPU_MEM:-${GPU_MEM_ARGS:-0.9}}"
SHARED_ARGS=(--max-model-len "$MAX_MODEL_LEN")
fi
print_launch_banner --no-curl "Launching Disaggregated Video Serving ($GPU_LABEL)" "$MODEL_NAME" "$HTTP_PORT" \
"Backend: Prefill + decode workers via dynamo.vllm" \
"Video path: Standard TokensPrompt multi_modal_data flow"
print_curl_footer <<CURL
curl http://localhost:${HTTP_PORT}/v1/chat/completions \\
-H 'Content-Type: application/json' \\
-d '{
"model": "${MODEL_NAME}",
"messages": [{"role": "user", "content": [
{"type": "text", "text": "Describe the video in detail"},
{"type": "video_url", "video_url": {"url": "https://storage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4"}}
]}],
"max_tokens": 128
}'
CURL
python -m dynamo.frontend &
VLLM_NIXL_SIDE_CHANNEL_PORT=20098 \
CUDA_VISIBLE_DEVICES="$PREFILL_GPU" \
python -m dynamo.vllm \
--disaggregation-mode prefill \
--enable-multimodal \
--model "$MODEL_NAME" \
--gpu-memory-utilization "$PREFILL_GPU_MEM" \
"${SHARED_ARGS[@]}" \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' \
--kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20081"}' \
"${EXTRA_ARGS[@]}" &
VLLM_NIXL_SIDE_CHANNEL_PORT=20099 \
CUDA_VISIBLE_DEVICES="$DECODE_GPU" \
python -m dynamo.vllm \
--disaggregation-mode decode \
--enable-multimodal \
--model "$MODEL_NAME" \
--gpu-memory-utilization "$DECODE_GPU_MEM" \
"${SHARED_ARGS[@]}" \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' \
--kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20082"}' \
"${EXTRA_ARGS[@]}" &
wait_any_exit
......@@ -428,14 +428,6 @@ vllm_configs = {
model="Qwen/Qwen3-VL-2B-Instruct",
script_args=["--model", "Qwen/Qwen3-VL-2B-Instruct", "--single-gpu"],
timeout=300,
env={
"DYN_ENCODE_WORKER_GPU": "0",
"DYN_PREFILL_WORKER_GPU": "0",
"DYN_DECODE_WORKER_GPU": "0",
"DYN_ENCODE_GPU_MEM": "0.1",
"DYN_PREFILL_GPU_MEM": "0.4",
"DYN_DECODE_GPU_MEM": "0.4",
},
request_payloads=[
chat_payload(
[
......@@ -536,11 +528,11 @@ vllm_configs = {
),
],
),
# Video multimodal tests for CI using the vLLM video launch scripts.
# Video multimodal tests for CI use the canonical aggregated multimodal launcher.
"multimodal_video_agg": VLLMConfig(
name="multimodal_video_agg",
directory=vllm_dir,
script_name="video_agg.sh",
script_name="agg_multimodal.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.pre_merge,
......@@ -568,7 +560,7 @@ vllm_configs = {
"multimodal_video_disagg": VLLMConfig(
name="multimodal_video_disagg",
directory=vllm_dir,
script_name="video_disagg.sh",
script_name="disagg_multimodal_epd.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.pre_merge,
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment