vllm-omni.md 16.6 KB
Newer Older
1
2
3
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
4
title: vLLM-Omni
5
---
6

7
Dynamo supports multimodal generation through the [vLLM-Omni](https://github.com/vllm-project/vllm-omni) backend. This integration exposes text-to-text, text-to-image, text-to-video, and text-to-audio (TTS) capabilities via OpenAI-compatible API endpoints.
8
9
10

## Prerequisites

11
This guide assumes familiarity with deploying Dynamo with vLLM as described in the [vLLM backend guide](README.md).
12

13
14
### Installation

15
Dynamo container images include vLLM-Omni pre-installed. If you are using `pip install ai-dynamo[vllm]`, vLLM-Omni is **not** included automatically because the matching release is not yet available on PyPI. Install it separately from source, pinning the vLLM-Omni release that matches your installed vLLM version (see the [vLLM-Omni releases](https://github.com/vllm-project/vllm-omni/releases) page):
16
17

```bash
18
pip install git+https://github.com/vllm-project/vllm-omni.git@<version>
19
20
```

21
22
> **ARM64 not supported:** vLLM-Omni is currently only installed on `amd64` builds. On `arm64`, the container build skips the install and vLLM-Omni features are unavailable.

23
## Supported Modalities
24

25
26
27
28
29
| Modality | Endpoint(s) | `--output-modalities` |
|---|---|---|
| Text-to-Text | `/v1/chat/completions` | `text` (default) |
| Text-to-Image | `/v1/chat/completions`, `/v1/images/generations` | `image` |
| Text-to-Video | `/v1/videos` | `video` |
30
| Image-to-Video | `/v1/videos` | `video` |
31
| Text-to-Audio (TTS) | `/v1/audio/speech` | `audio` |
32

33
The `--output-modalities` flag determines which endpoint(s) the worker registers. When set to `image`, both `/v1/chat/completions` (returns inline base64 images) and `/v1/images/generations` are available. When set to `video`, the worker serves `/v1/videos`. When set to `audio`, the worker serves `/v1/audio/speech`.
34
35
36
37
38
39

## Tested Models

| Modality | Models |
|---|---|
| Text-to-Text | `Qwen/Qwen2.5-Omni-7B` |
Ayush Agarwal's avatar
Ayush Agarwal committed
40
| Text-to-Image | `Qwen/Qwen-Image`, `AIDC-AI/Ovis-Image-7B`, `zai-org/GLM-Image` (disagg) |
41
| Text-to-Video | `Wan-AI/Wan2.1-T2V-1.3B-Diffusers`, `Wan-AI/Wan2.2-T2V-A14B-Diffusers` |
42
| Image-to-Video | `Wan-AI/Wan2.2-TI2V-5B-Diffusers`, `Wan-AI/Wan2.2-I2V-A14B-Diffusers` |
43
| Text-to-Audio (TTS) | `Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice`, `Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign` |
44
45

To run a non-default model, pass `--model` to any launch script:
46
47

```bash
48
49
bash examples/backends/vllm/launch/agg_omni_image.sh --model AIDC-AI/Ovis-Image-7B
bash examples/backends/vllm/launch/agg_omni_video.sh --model Wan-AI/Wan2.2-T2V-A14B-Diffusers
50
51
```

52
53
54
## Text-to-Text

Launch an aggregated deployment (frontend + omni worker):
55
56

```bash
57
bash examples/backends/vllm/launch/agg_omni.sh
58
59
```

60
61
62
This starts `Qwen/Qwen2.5-Omni-7B` with a single-stage thinker config on one GPU.

Verify the deployment:
63
64

```bash
65
curl -s http://localhost:8000/v1/chat/completions \
66
67
68
69
70
71
72
73
74
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-Omni-7B",
    "messages": [{"role": "user", "content": "What is 2+2?"}],
    "max_tokens": 50,
    "stream": false
  }'
```

75
76
77
This script uses a custom stage config (`stage_configs/single_stage_llm.yaml`) that configures the thinker stage for text generation. See [Stage Configuration](#stage-configuration) for details.

## Text-to-Image
78

79
Launch using the provided script with `Qwen/Qwen-Image`:
80
81

```bash
82
bash examples/backends/vllm/launch/agg_omni_image.sh
83
84
```

85
### Via `/v1/chat/completions`
86
87

```bash
88
curl -s http://localhost:8000/v1/chat/completions \
89
90
  -H "Content-Type: application/json" \
  -d '{
91
    "model": "Qwen/Qwen-Image",
92
93
94
95
96
    "messages": [{"role": "user", "content": "A cat sitting on a windowsill"}],
    "stream": false
  }'
```

97
The response includes base64-encoded images inline:
98
99
100
101
102
103
104
105
106
107
108
109
110

```json
{
  "choices": [{
    "delta": {
      "content": [
        {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
      ]
    }
  }]
}
```

111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
### Via `/v1/images/generations`

```bash
curl -s http://localhost:8000/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen-Image",
    "prompt": "A cat sitting on a windowsill",
    "size": "1024x1024",
    "response_format": "url"
  }'
```

## Text-to-Video

Launch using the provided script with `Wan-AI/Wan2.1-T2V-1.3B-Diffusers`:

```bash
bash examples/backends/vllm/launch/agg_omni_video.sh
```

Generate a video via `/v1/videos`:

```bash
curl -s http://localhost:8000/v1/videos \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
    "prompt": "A drone flyover of a mountain landscape",
    "seconds": 2,
    "size": "832x480",
    "response_format": "url"
  }'
```

The response returns a video URL or base64 data depending on `response_format`:

```json
{
  "id": "...",
  "object": "video",
  "model": "Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
  "status": "completed",
154
  "data": [{"url": "file:///tmp/dynamo_media/videos/req-abc123.mp4"}]
155
156
157
158
159
160
161
162
163
164
165
166
167
}
```

The `/v1/videos` endpoint also accepts NVIDIA extensions via the `nvext` field for fine-grained control:

| Field | Description | Default |
|---|---|---|
| `nvext.fps` | Frames per second | 24 |
| `nvext.num_frames` | Number of frames (overrides `fps * seconds`) | -- |
| `nvext.negative_prompt` | Negative prompt for guidance | -- |
| `nvext.num_inference_steps` | Number of denoising steps | 50 |
| `nvext.guidance_scale` | CFG guidance scale | 5.0 |
| `nvext.seed` | Random seed for reproducibility | -- |
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
| `nvext.boundary_ratio` | MoE expert switching boundary (I2V) | 0.875 |
| `nvext.guidance_scale_2` | CFG scale for low-noise expert (I2V) | 1.0 |

## Image-to-Video

Image-to-video (I2V) uses the same `/v1/videos` endpoint as text-to-video, with an additional `input_reference` field that provides the source image. The image can be an HTTP URL, a base64 data URI, or a local file path.

Launch with the provided script using `Wan-AI/Wan2.2-TI2V-5B-Diffusers`:

```bash
bash examples/backends/vllm/launch/agg_omni_i2v.sh
```

Generate a video from an image:

```bash
curl -s http://localhost:8000/v1/videos \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Wan-AI/Wan2.2-TI2V-5B-Diffusers",
    "prompt": "A bear playing with yarn, smooth motion",
    "input_reference": "https://example.com/bear.png",
    "size": "832x480",
    "response_format": "url",
    "nvext": {
      "num_inference_steps": 40,
      "num_frames": 33,
      "guidance_scale": 1.0,
      "boundary_ratio": 0.875,
      "guidance_scale_2": 1.0,
      "seed": 42
    }
  }'
```

The `input_reference` field accepts:
- **HTTP/HTTPS URL**: `"https://example.com/image.png"`
- **Base64 data URI**: `"data:image/png;base64,iVBORw0KGgo..."`
- **Local file path**: `"/path/to/image.png"` or `"file:///path/to/image.png"`

The I2V-specific `nvext` fields (`boundary_ratio`, `guidance_scale_2`) control the dual-expert MoE denoising schedule in Wan2.x models. See [Wan2.2-I2V model card](https://huggingface.co/Wan-AI/Wan2.2-I2V-A14B-Diffusers) for details.
209

210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
## Text-to-Audio (TTS)

Launch using the provided script with `Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice`:

```bash
bash examples/backends/vllm/launch/agg_omni_audio.sh
```

### CustomVoice (predefined speakers)

```bash
curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Hello, how are you?",
    "voice": "vivian",
    "language": "English"
  }' --output output.wav
```

### CustomVoice with style instructions

```bash
curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "input": "I am so excited!",
    "voice": "vivian",
    "instructions": "Speak with great enthusiasm"
  }' --output excited.wav
```

### VoiceDesign (describe a voice)

```bash
bash examples/backends/vllm/launch/agg_omni_audio.sh --model Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Hello world",
    "task_type": "VoiceDesign",
    "instructions": "A warm, friendly female voice with a gentle tone"
  }' --output voicedesign.wav
```

### Parameters

The `/v1/audio/speech` endpoint follows the [vLLM-Omni](https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/examples/online_serving/qwen3_tts/) API format. All TTS-specific parameters are top-level fields:

| Field | Description | Default |
|---|---|---|
| `input` | Text to synthesize (required) | -- |
| `model` | TTS model name | auto-detected |
| `voice` | Speaker name (e.g., vivian, ryan). Validated against model config. | Vivian |
| `response_format` | Audio format: wav, mp3, pcm, flac, aac, opus | wav |
| `speed` | Speed factor (0.25-4.0) | 1.0 |
| `task_type` | CustomVoice, VoiceDesign, or Base (Qwen3-TTS) | CustomVoice |
| `language` | Language code. Validated against model config. | Auto |
| `instructions` | Voice style/emotion description. Required for VoiceDesign. | -- |
| `ref_audio` | Reference audio URL or base64 data URI. Required for Base. | -- |
| `ref_text` | Transcript of reference audio (Base task) | -- |
| `max_new_tokens` | Maximum tokens to generate (1-4096) | 2048 |

Available voices and languages are loaded dynamically from the model's `config.json` at startup. Non-Qwen3-TTS audio models (e.g., MiMo-Audio) use a generic text prompt and ignore TTS-specific parameters.

276
## CLI Reference
277

Ayush Agarwal's avatar
Ayush Agarwal committed
278
The omni backend uses a dedicated entrypoint: `python -m dynamo.vllm.omni`.
279

Ayush Agarwal's avatar
Ayush Agarwal committed
280
281
| Flag | Description |
|---|---|
282
283
| `--omni` | Enable the vLLM-Omni orchestrator (required for all omni workloads) |
| `--output-modalities <modality>` | Output modality: `text`, `image`, `video`, or `audio` |
Ayush Agarwal's avatar
Ayush Agarwal committed
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
| `--stage-configs-path <path>` | Path to stage config YAML (optional; vLLM-Omni uses model defaults if omitted) |
| `--boundary-ratio <float>` | MoE expert switching boundary (default: 0.875) |
| `--flow-shift <float>` | Scheduler flow_shift (5.0 for 720p, 12.0 for 480p) |
| `--vae-use-slicing` | Enable VAE slicing for memory optimization |
| `--vae-use-tiling` | Enable VAE tiling for memory optimization |
| `--default-video-fps <int>` | Default frames per second for generated videos (default: 16) |
| `--enable-layerwise-offload` | Enable layerwise offloading on DiT modules to reduce GPU memory |
| `--layerwise-num-gpu-layers <int>` | Number of ready layers to keep on GPU during generation (default: 1) |
| `--cache-backend <backend>` | Diffusion cache: `cache_dit` or `tea_cache` |
| `--cache-config <json>` | Cache configuration as JSON string (overrides defaults) |
| `--enable-cache-dit-summary` | Enable cache-dit summary logging after diffusion forward passes |
| `--enforce-eager` | Disable torch.compile for diffusion models |
| `--enable-cpu-offload` | Enable CPU offloading for diffusion models |
| `--ulysses-degree <int>` | GPUs for Ulysses sequence parallelism in diffusion (default: 1) |
| `--ring-degree <int>` | GPUs for ring sequence parallelism in diffusion (default: 1) |
| `--cfg-parallel-size <int>` | GPUs for classifier-free guidance parallelism (1 or 2, default: 1) |
| `--media-output-fs-url <url>` | Filesystem URL for storing generated media (default: `file:///tmp/dynamo_media`) |
| `--media-output-http-url <url>` | Base URL for rewriting media paths in responses (optional) |
302
303
304

## Storage Configuration

305
Generated images, videos, and audio files are stored via [fsspec](https://filesystem-spec.readthedocs.io/), which supports local filesystems, S3, GCS, and Azure Blob.
306
307
308
309
310
311
312
313
314
315
316
317

By default, media is written to the local filesystem at `file:///tmp/dynamo_media`. To use cloud storage:

```bash
bash examples/backends/vllm/launch/agg_omni_video.sh \
  --media-output-fs-url s3://my-bucket/media \
  --media-output-http-url https://cdn.example.com/media
```

When `--media-output-http-url` is set, response URLs are rewritten as `{base-url}/{storage-path}` (e.g., `https://cdn.example.com/media/videos/req-id.mp4`). When unset, the raw filesystem path is returned.

For S3 credential configuration, set the standard AWS environment variables (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`) or use IAM roles. See the [fsspec S3 docs](https://s3fs.readthedocs.io/en/latest/#credentials) for details.
318
319
320

## Stage Configuration

321
Omni pipelines are configured via YAML stage configs. See [`examples/backends/vllm/launch/stage_configs/single_stage_llm.yaml`](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends/vllm/launch/stage_configs/single_stage_llm.yaml) for an example. For full documentation on stage config format and multi-stage pipelines, refer to the [vLLM-Omni Stage Configs documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/).
322

Ayush Agarwal's avatar
Ayush Agarwal committed
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
## Disaggregated Multi-Stage Serving

For models with multiple pipeline stages (e.g., AR + Diffusion), Dynamo supports disaggregated serving where each stage runs as an independent process on its own GPU. This enables independent scaling, GPU isolation, and multi-worker replicas per stage.

### Architecture

Each stage runs as an independent process on its own GPU. A lightweight router coordinates them, acting as a **pure message broker** — it never inspects or transforms inter-stage data.

```mermaid
flowchart LR
  client(Client) --> frontend(Frontend)
  frontend --> router(Router)
  router -->|request| s0(Stage 0)
  s0 -->|ref| router
  router -->|ref| s1(Stage 1)
  s1 -->|result| router
  router --> frontend --> client
  s0 <-->|bulk data| conn[(Connector)]
  conn <--> s1
```

**How it works:**

- The router sends the initial request to Stage 0 and receives back a lightweight connector reference (pointer to the output in shared memory).
- The router forwards that reference — unchanged — to Stage 1. It never reads the bulk data.
- Each stage fetches its inputs from the connector, runs any model-specific processor (e.g., `ar2diffusion`, `thinker2talker`), then runs its engine.
- The final stage's result goes back to the router for formatting and response.
- Connector references accumulate as the pipeline progresses, so any stage can access outputs from all previous stages.

### Data Flow

```mermaid
sequenceDiagram
  participant C as Client
  participant R as Router
  participant S0 as Stage 0 (AR)
  participant SHM as Connector
  participant S1 as Stage 1 (DiT)

  C->>R: POST /v1/images/generations
  R->>S0: request + prompt
  S0->>SHM: store output
  S0-->>R: connector ref
  R->>S1: connector ref (opaque)
  S1->>SHM: fetch output
  S1->>S1: processor → engine
  S1-->>R: result
  R-->>C: {"data": [...]}
```

### Quick Start: GLM-Image (2-Stage, 2 GPUs)

GLM-Image is a 2-stage text-to-image model with an AR stage (generates prior token IDs) and a DiT stage (diffusion denoising + VAE decode). The built-in vLLM-Omni stage config already assigns each stage to a separate GPU.

377
378
> **Experimental:** GLM-Image support is experimental; generation may fail or produce incorrect/garbled outputs for some prompts and sizes.
>
379
380
> **Known issue:** GLM-Image requires `transformers>=5.0` to recognize the `glm_image` architecture. Older versions fail at model config creation with `The checkpoint you are trying to load has model type 'glm_image' but Transformers does not recognize this architecture`.

Ayush Agarwal's avatar
Ayush Agarwal committed
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
```bash
bash examples/backends/vllm/launch/disagg_omni_glm_image.sh
```

Test:

```bash
curl -s http://localhost:8000/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "model": "zai-org/GLM-Image",
    "prompt": "A red apple on a white table",
    "size": "1024x1024",
    "response_format": "url"
  }' | jq
```

### Scaling Stage Replicas

Each stage registers independently with Dynamo's service discovery. To scale a bottleneck stage, launch additional workers with the same `--stage-id` on different GPUs — the router automatically load-balances across all replicas for that stage. Other stages are unaffected.

### Tested Models

| Model | Stages | Output | Stage Config |
|---|---|---|---|
| GLM-Image (`zai-org/GLM-Image`) | AR -> DiT | Image | `glm_image.yaml` (built-in) |

### CLI Flags (Disaggregated Mode)

| Flag | Description |
|---|---|
| `--stage-id <int>` | Run as a single-stage worker for the given stage ID. Requires `--stage-configs-path`. |
| `--omni-router` | Run as the stage router. Requires `--stage-configs-path`. Mutually exclusive with `--stage-id`. |
| `--stage-configs-path <path>` | Path to vLLM-Omni stage configuration YAML. |

416
417
## Current Limitations

418
- Image input is supported only for I2V via `input_reference` in `/v1/videos`. Other endpoints accept text prompts only.
419
420
- KV cache events are not published for omni workers.
- Each worker supports a single output modality at a time.
421
422
- Audio: streaming (`stream: true`) is not yet supported.
- Audio: Base task (voice cloning) is not yet supported.
Ayush Agarwal's avatar
Ayush Agarwal committed
423
- Disaggregated mode: `async_chunk=true` (streaming between stages) is not yet supported.