--prompt"Let this mascot dance under the moon, surrounded by floating stars and poetic bubbles such as 'Be Kind'"\
--output output_image_edit.png \
--num_inference_steps 50 \
--cfg_scale 4.0
```
### Multiple Image Editing (Qwen-Image-Edit-2509)
For multiple image inputs, use `Qwen/Qwen-Image-Edit-2509` or `Qwen/Qwen-Image-Edit-2511`:
```bash
python image_edit.py \
--model Qwen/Qwen-Image-Edit-2509 \
--image img1.png img2.png \
--prompt"Combine these images into a single scene"\
--output output_image_edit.png \
--num_inference_steps 50 \
--cfg_scale 4.0 \
--guidance_scale 1.0
```
Key arguments:
-`--model`: model name or path. Use `Qwen/Qwen-Image-Edit-2509` or later for multiple image support.
-`--image`: path(s) to the source image(s) (PNG/JPG, converted to RGB). Can specify multiple images.
-`--prompt` / `--negative_prompt`: text description (string).
-`--cfg_scale`: true classifier-free guidance scale (default: 4.0). Classifier-free guidance is enabled by setting cfg_scale > 1 and providing a negative_prompt. Higher guidance scale encourages images closely linked to the text prompt, usually at the expense of lower image quality.
-`--cfg_parallel_size`: the number of devices to run CFG Parallel. CFG Parallel is valid only if classifier-free guidance is enabled and `cfg_parallel_size` is set to 2.
-`--guidance_scale`: guidance scale for guidance-distilled models (default: 1.0, disabled). Unlike classifier-free guidance (--cfg_scale), guidance-distilled models take the guidance scale directly as an input parameter. Enabled when guidance_scale > 1. Ignored when not using guidance-distilled models.
This example demonstrates how to generate videos from images using Wan2.2 Image-to-Video models with vLLM-Omni's offline inference API.
## Local CLI Usage
### Wan2.2-I2V-A14B-Diffusers (MoE)
```bash
python image_to_video.py \
--model Wan-AI/Wan2.2-I2V-A14B-Diffusers \
--image input.png \
--prompt"A cat playing with yarn, smooth motion"\
--negative_prompt"<optional quality filter>"\
--height 480 \
--width 832 \
--num_frames 48 \
--guidance_scale 5.0 \
--guidance_scale_high 6.0 \
--num_inference_steps 40 \
--boundary_ratio 0.875 \
--flow_shift 12.0 \
--fps 16 \
--output i2v_output.mp4
```
### Wan2.2-TI2V-5B-Diffusers (Unified)
```bash
python image_to_video.py \
--model Wan-AI/Wan2.2-TI2V-5B-Diffusers \
--image input.png \
--prompt"A cat playing with yarn, smooth motion"\
--negative_prompt"<optional quality filter>"\
--height 480 \
--width 832 \
--num_frames 48 \
--guidance_scale 4.0 \
--num_inference_steps 40 \
--flow_shift 12.0 \
--fps 16 \
--output i2v_output.mp4
```
Key arguments:
-`--model`: Model ID (I2V-A14B for MoE, TI2V-5B for unified T2V+I2V).
-`--image`: Path to input image (required).
-`--prompt`: Text description of desired motion/animation.
-`--height/--width`: Output resolution (auto-calculated from image if not set). Dimensions should be multiples of 16.
-`--num_frames`: Number of frames (default 81).
-`--guidance_scale` and `--guidance_scale_high`: CFG scale (applied to low/high-noise stages for MoE).
-`--negative_prompt`: Optional list of artifacts to suppress.
-`--cfg_parallel_size`: the number of devices to run CFG Parallel. CFG Parallel is valid only if classifier-free guidance is enabled and `cfg_parallel_size` is set to 2.
-`--boundary_ratio`: Boundary split ratio for two-stage MoE models.
-`--flow_shift`: Scheduler flow shift (5.0 for 720p, 12.0 for 480p).
-`--num_inference_steps`: Number of denoising steps (default 50).
-`--fps`: Frames per second for the saved MP4 (requires `diffusers` export_to_video).
This contains examples for using LoRA (Low-Rank Adaptation) adapters with vLLM-omni diffusion models for offline inference.
The example uses the `stabilityai/stable-diffusion-3.5-medium` as the default model, but you can replace it with other models in vLLM-omni.
## Overview
Similar to vLLM, vLLM-omni uses a unified LoRA handling mechanism:
-**Pre-loaded LoRA**: Loaded at initialization via `--lora-path` (pre-loaded into cache)
-**Per-request LoRA**: Loaded on-demand. In the example, the LoRA is loaded via `--lora-request-path` in each request
Both approaches use the same underlying mechanism - all LoRA adapters are handled uniformly through `set_active_adapter()`. If no LoRA request is provided in a request, all adapters are deactivated.
## Usage
### Pre-loaded LoRA (via --lora-path)
Load a LoRA adapter at initialization. This adapter is pre-loaded into the cache and can be activated by requests:
**Note**: When using `--lora-path`, the adapter is loaded at init time with a stable ID derived from the adapter path. This example activates it automatically for the request.
-`--lora-path`: Path to LoRA adapter folder to pre-load at initialization (loads into cache with a stable ID derived from the path)
-`--lora-request-path`: Path to LoRA adapter folder for per-request loading
-`--lora-request-id`: Integer ID for the LoRA adapter (optional). If not provided and `--lora-request-path` is set, will derive a stable ID from the path.
-`--lora-scale`: Scale factor for LoRA weights (default: 1.0). Higher values increase the influence of the LoRA adapter.
### Standard Parameters
-`--prompt`: Text prompt for image generation (required)
-`--seed`: Random seed for reproducibility (default: 42)
-`--height`: Image height in pixels (default: 1024)
-`--width`: Image width in pixels (default: 1024)
-`--num_inference_steps`: Number of denoising steps (default: 50)
Please refer to the [stage configuration documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/) to configure memory allocation appropriately for your hardware setup.
## Run examples
### Multiple Prompts
Get into the example folder
```bash
cd examples/offline_inference/qwen2_5_omni
```
Then run the command below. Note: for processing large volume data, it uses py_generator mode, which will return a python generator from Omni class.
```bash
bash run_multiple_prompts.sh
```
### Single Prompt
Get into the example folder
```bash
cd examples/offline_inference/qwen2_5_omni
```
Then run the command below.
```bash
bash run_single_prompt.sh
```
### Modality control
If you want to control output modalities, e.g. only output text, you can run the command below:
```bash
python end2end.py --output-wav output_audio \
--query-type mixed_modalities \
--modalities text
```
#### Using Local Media Files
The `end2end.py` script supports local media files (audio, video, image) via CLI arguments:
Please refer to the [stage configuration documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/) to configure memory allocation appropriately for your hardware setup.
## Run examples
### Multiple Prompts
Get into the example folder
```bash
cd examples/offline_inference/qwen3_omni
```
Then run the command below. Note: for processing large volume data, it uses py_generator mode, which will return a python generator from Omni class.
```bash
bash run_multiple_prompts.sh
```
### Single Prompt
Get into the example folder
```bash
cd examples/offline_inference/qwen3_omni
```
Then run the command below.
```bash
bash run_single_prompt.sh
```
If you have not enough memory, you can set thinker with tensor parallel. Just run the command below.
```bash
bash run_single_prompt_tp.sh
```
### Modality control
If you want to control output modalities, e.g. only output text, you can run the command below:
```bash
python end2end.py --output-wav output_audio \
--query-type use_audio \
--modalities text
```
#### Using Local Media Files
The `end2end.py` script supports local media files (audio, video, image) via command-line arguments:
This directory contains an offline demo for running Qwen3 TTS models with vLLM Omni. It builds task-specific inputs and generates WAV files locally.
## Model Overview
Qwen3 TTS provides multiple task variants for speech generation:
-**CustomVoice**: Generate speech with a known speaker identity (speaker ID) and optional instruction.
-**VoiceDesign**: Generate speech from text plus a descriptive instruction that designs a new voice.
-**Base**: Voice cloning using a reference audio + reference transcript, with optional mode selection.
## Setup
Please refer to the [stage configuration documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/) to configure memory allocation appropriately for your hardware setup.
## Quick Start
Run a single sample for a task:
```
python end2end.py --query-type CustomVoice
```
Generated audio files are saved to `output_audio/` by default.
-`--cfg_parallel_size`: the number of devices to run CFG Parallel. CFG Parallel is valid only if classifier-free guidance is enabled and `cfg_parallel_size` is set to 2.
-`--num_images_per_prompt`: number of images to generate per prompt (saves as `output`, `output_1`, ...).
> ℹ️ Qwen-Image currently publishes best-effort presets at `1328x1328`, `1664x928`, `928x1664`, `1472x1140`, `1140x1472`, `1584x1056`, and `1056x1584`. Adjust `--height/--width` accordingly for the most reliable outcomes.
## Web UI Demo
Launch the gradio demo:
```bash
python gradio_demo.py --port 7862
```
Then open `http://localhost:7862/` on your local browser to interact with the web UI.
The `Wan-AI/Wan2.2-T2V-A14B-Diffusers` pipeline generates short videos from text prompts.
## Local CLI Usage
```bash
python text_to_video.py \
--prompt"Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."\
--negative_prompt"<optional quality filter>"\
--height 480 \
--width 640 \
--num_frames 32 \
--guidance_scale 4.0 \
--guidance_scale_high 3.0 \
--num_inference_steps 40 \
--fps 16 \
--output t2v_out.mp4
```
Key arguments:
-`--prompt`: text description (string).
-`--height/--width`: output resolution (defaults 720x1280). Dimensions should align with Wan VAE downsampling (multiples of 8).
-`--num_frames`: Number of frames (Wan default is 81).
-`--guidance_scale` and `--guidance_scale_high`: CFG scale (applied to low/high)..
-`--negative_prompt`: optional list of artifacts to suppress (the PR demo used a long Chinese string).
-`--cfg_parallel_size`: the number of devices to run CFG Parallel. CFG Parallel is valid only if classifier-free guidance is enabled and `cfg_parallel_size` is set to 2.
-`--boundary_ratio`: Boundary split ratio for low/high DiT.
-`--fps`: frames per second for the saved MP4 (requires `diffusers` export_to_video).
**Note**: These examples work with the default configuration on an **NVIDIA A100 (80GB)**. We also tested on dual **NVIDIA RTX 5000 Ada (32GB each)**. For dual-GPU setups, please modify the stage configuration to distribute the model across devices.
-`--height`: Image height in pixels (default: 512)
-`--width`: Image width in pixels (default: 512)
-`--steps`: Number of inference steps (default: 25)
-`--seed`: Random seed (default: 42)
-`--negative`: Negative prompt for image generation
Example with custom parameters:
```bash
python openai_chat_client.py \
--prompt"A futuristic city"\
--modality text2img \
--height 768 \
--width 768 \
--steps 50 \
--seed 42 \
--negative"blurry, low quality"
```
## Modality Control
BAGEL-7B-MoT supports **multiple modality modes** for different use cases.
The default yaml configuration deploys Thinker and DiT on the same GPU. You can use the default configuration file: [`bagel.yaml`](../../../vllm_omni/model_executor/stage_configs/bagel.yaml)
"messages": [{"role": "user", "content": [{"type": "text", "text": "<|im_start|>user\nWhat is the capital of France?<|im_end|>\n<|im_start|>assistant\n"}]}]
"modalities": ["text"]
}'
```
## FAQ
- If you encounter an error about the backend of librosa, try to install ffmpeg with the command below.
```bash
sudo apt update
sudo apt install ffmpeg
```
- If you don’t know how much VRAM is needed for the model or encounter the OOM error, you can try to decrease the max_model_len.
-`--video-path` (or `-v`): Path to local video file or URL. If not provided and query-type uses video, uses default video URL. Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs. Example: `--video-path /path/to/video.mp4` or `--video-path https://example.com/video.mp4`
-`--image-path` (or `-i`): Path to local image file or URL. If not provided and query-type uses image, uses default image URL. Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs and common image formats: JPEG, PNG, GIF, WebP. Example: `--image-path /path/to/image.jpg` or `--image-path https://example.com/image.png`
-`--audio-path` (or `-a`): Path to local audio file or URL. If not provided and query-type uses audio, uses default audio URL. Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs and common audio formats: MP3, WAV, OGG, FLAC, M4A. Example: `--audio-path /path/to/audio.wav` or `--audio-path https://example.com/audio.mp3`
-`--prompt` (or `-p`): Custom text prompt/question. If not provided, uses default prompt for the selected query type. Example: `--prompt "What are the main activities shown in this video?"`
For example, to use mixed modalities with all local files:
You can control output modalities to specify which types of output the model should generate. This is useful when you only need text output and want to skip audio generation stages for better performance.
### Supported modalities
| Modalities | Output |
|------------|--------|
| `["text"]` | Text only |
| `["audio"]` | Text + Audio |
| `["text", "audio"]` | Text + Audio |
| Not specified | Text + Audio (default) |
### Using curl
#### Text only
```bash
curl http://localhost:8091/v1/chat/completions \
-H"Content-Type: application/json"\
-d'{
"model": "Qwen/Qwen2.5-Omni-7B",
"messages": [{"role": "user", "content": "Describe vLLM in brief."}],
"modalities": ["text"]
}'
```
#### Text + Audio
```bash
curl http://localhost:8091/v1/chat/completions \
-H"Content-Type: application/json"\
-d'{
"model": "Qwen/Qwen2.5-Omni-7B",
"messages": [{"role": "user", "content": "Describe vLLM in brief."}],
If you want to enable streaming output, please set the argument as below. The final output will be obtained just after generated by corresponding stage. Now we only support text streaming output. Other modalities can output normally.
-`--model` (or `-m`): Model name/path (default: `Qwen/Qwen3-Omni-30B-A3B-Instruct`)
-`--video-path` (or `-v`): Path to local video file or URL. If not provided and query-type is `use_video`, uses default video URL. Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs. Example: `--video-path /path/to/video.mp4` or `--video-path https://example.com/video.mp4`
-`--image-path` (or `-i`): Path to local image file or URL. If not provided and query-type is `use_image`, uses default image URL. Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs and common image formats: JPEG, PNG, GIF, WebP. Example: `--image-path /path/to/image.jpg` or `--image-path https://example.com/image.png`
-`--audio-path` (or `-a`): Path to local audio file or URL. If not provided and query-type is `use_audio`, uses default audio URL. Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs and common audio formats: MP3, WAV, OGG, FLAC, M4A. Example: `--audio-path /path/to/audio.wav` or `--audio-path https://example.com/audio.mp3`
-`--prompt` (or `-p`): Custom text prompt/question. If not provided, uses default prompt for the selected query type. Example: `--prompt "What are the main activities shown in this video?"`
For example, to use a local video file with custom prompt:
--prompt"What are the main activities shown in this video?"
```
#### Send request via curl
```bash
bash run_curl_multimodal_generation.sh use_image
```
### FAQ
If you encounter error about backend of librosa, try to install ffmpeg with command below.
```
sudo apt update
sudo apt install ffmpeg
```
## Modality control
You can control output modalities to specify which types of output the model should generate. This is useful when you only need text output and want to skip audio generation stages for better performance.
### Supported modalities
| Modalities | Output |
|------------|--------|
| `["text"]` | Text only |
| `["audio"]` | Text + Audio |
| `["text", "audio"]` | Text + Audio |
| Not specified | Text + Audio (default) |
### Using curl
#### Text only
```bash
curl http://localhost:8091/v1/chat/completions \
-H"Content-Type: application/json"\
-d'{
"model": "Qwen/Qwen3-Omni-30B-A3B-Instruct",
"messages": [{"role": "user", "content": "Describe vLLM in brief."}],
"modalities": ["text"]
}'
```
#### Text + Audio
```bash
curl http://localhost:8091/v1/chat/completions \
-H"Content-Type: application/json"\
-d'{
"model": "Qwen/Qwen3-Omni-30B-A3B-Instruct",
"messages": [{"role": "user", "content": "Describe vLLM in brief."}],
If you want to enable streaming output, please set the argument as below. The final output will be obtained just after generated by corresponding stage. Now we only support text streaming output. Other modalities can output normally.
This example demonstrates how to deploy Qwen-Image model for online image generation service using vLLM-Omni.
## Start Server
### Basic Start
```bash
vllm serve Qwen/Qwen-Image --omni--port 8091
```
!!! note
If you encounter Out-of-Memory (OOM) issues or have limited GPU memory, you can enable VAE slicing and tiling to reduce memory usage, --vae-use-slicing --vae-use-tiling
Please refer to the [stage configuration documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/) to configure memory allocation appropriately for your hardware setup.
## Run examples
**Note**: These examples work with the default configuration on an **NVIDIA A100 (80GB)**. We also tested on dual **NVIDIA RTX 5000 Ada (32GB each)**. For dual-GPU setups, please modify the stage configuration to distribute the model across devices.
Get into the bagel folder
```bash
cd examples/offline_inference/bagel
```
### Modality Control
BAGEL-7B-MoT supports multiple modality modes. You can control the mode using the `--modality` argument:
#### Text to Image (text2img)
-**Pipeline**: Text → Thinker → DiT → VAE Decode → Image
BAGEL-7B-MoT supports **multiple modality modes** for different use cases.
The default yaml configuration deploys Thinker and DiT on the same GPU. You can use the default configuration file: [`bagel.yaml`](../../../vllm_omni/model_executor/stage_configs/bagel.yaml)
--prompt"Let this mascot dance under the moon, surrounded by floating stars and poetic bubbles such as 'Be Kind'"\
--output output_image_edit.png \
--num_inference_steps 50 \
--cfg_scale 4.0
```
### Multiple Image Editing (Qwen-Image-Edit-2509)
For multiple image inputs, use `Qwen/Qwen-Image-Edit-2509` or `Qwen/Qwen-Image-Edit-2511`:
```bash
python image_edit.py \
--model Qwen/Qwen-Image-Edit-2509 \
--image img1.png img2.png \
--prompt"Combine these images into a single scene"\
--output output_image_edit.png \
--num_inference_steps 50 \
--cfg_scale 4.0 \
--guidance_scale 1.0
```
Key arguments:
-`--model`: model name or path. Use `Qwen/Qwen-Image-Edit-2509` or later for multiple image support.
-`--image`: path(s) to the source image(s) (PNG/JPG, converted to RGB). Can specify multiple images.
-`--prompt` / `--negative_prompt`: text description (string).
-`--cfg_scale`: true classifier-free guidance scale (default: 4.0). Classifier-free guidance is enabled by setting cfg_scale > 1 and providing a negative_prompt. Higher guidance scale encourages images closely linked to the text prompt, usually at the expense of lower image quality.
-`--guidance_scale`: guidance scale for guidance-distilled models (default: 1.0, disabled). Unlike classifier-free guidance (--cfg_scale), guidance-distilled models take the guidance scale directly as an input parameter. Enabled when guidance_scale > 1. Ignored when not using guidance-distilled models.
-`--vae_use_slicing`: enable VAE slicing for memory optimization.
-`--vae_use_tiling`: enable VAE tiling for memory optimization.
-`--cfg_parallel_size`: set it to 2 to enable CFG Parallel. See more examples in [`user_guide`](../../../docs/user_guide/diffusion/parallelism_acceleration.md#cfg-parallel).
-`--enable-cpu-offload`: enable CPU offloading for diffusion models.
> ℹ️ If you encounter OOM errors, try using `--vae_use_slicing` and `--vae_use_tiling` to reduce memory usage.
--prompt"Add a white art board written with colorful text 'vLLM-Omni' on grassland. Add a paintbrush in the bear's hands. position the bear standing in front of the art board as if painting"\
This example demonstrates how to generate videos from images using Wan2.2 Image-to-Video models with vLLM-Omni's offline inference API.
## Local CLI Usage
### Wan2.2-I2V-A14B-Diffusers (MoE)
```bash
python image_to_video.py \
--model Wan-AI/Wan2.2-I2V-A14B-Diffusers \
--image input.png \
--prompt"A cat playing with yarn, smooth motion"\
--negative_prompt"<optional quality filter>"\
--height 480 \
--width 832 \
--num_frames 48 \
--guidance_scale 5.0 \
--guidance_scale_high 6.0 \
--num_inference_steps 40 \
--boundary_ratio 0.875 \
--flow_shift 12.0 \
--fps 16 \
--output i2v_output.mp4
```
### Wan2.2-TI2V-5B-Diffusers (Unified)
```bash
python image_to_video.py \
--model Wan-AI/Wan2.2-TI2V-5B-Diffusers \
--image input.png \
--prompt"A cat playing with yarn, smooth motion"\
--negative_prompt"<optional quality filter>"\
--height 480 \
--width 832 \
--num_frames 48 \
--guidance_scale 4.0 \
--num_inference_steps 40 \
--flow_shift 12.0 \
--fps 16 \
--output i2v_output.mp4
```
Key arguments:
-`--model`: Model ID (I2V-A14B for MoE, TI2V-5B for unified T2V+I2V).
-`--image`: Path to input image (required).
-`--prompt`: Text description of desired motion/animation.
-`--height/--width`: Output resolution (auto-calculated from image if not set). Dimensions should be multiples of 16.
-`--num_frames`: Number of frames (default 81).
-`--guidance_scale` and `--guidance_scale_high`: CFG scale (applied to low/high-noise stages for MoE).
-`--negative_prompt`: Optional list of artifacts to suppress.
-`--boundary_ratio`: Boundary split ratio for two-stage MoE models.
-`--flow_shift`: Scheduler flow shift (5.0 for 720p, 12.0 for 480p).
-`--num_inference_steps`: Number of denoising steps (default 50).
-`--fps`: Frames per second for the saved MP4 (requires `diffusers` export_to_video).
-`--output`: Path to save the generated video.
-`--vae_use_slicing`: Enable VAE slicing for memory optimization.
-`--vae_use_tiling`: Enable VAE tiling for memory optimization.
-`--cfg_parallel_size`: set it to 2 to enable CFG Parallel. See more examples in [`user_guide`](../../../docs/user_guide/diffusion/parallelism_acceleration.md#cfg-parallel).
-`--enable-cpu-offload`: enable CPU offloading for diffusion models.
> ℹ️ If you encounter OOM errors, try using `--vae_use_slicing` and `--vae_use_tiling` to reduce memory usage.