Commit c1cacde6 authored by weishb's avatar weishb
Browse files

vllm-omni_0.15.0.rc1+fix1 first commit

parent 35607782
# Image-To-Image
Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/image_to_image>.
This example edits an input image with `Qwen/Qwen-Image-Edit` using the `image_edit.py` CLI.
## Local CLI Usage
### Single Image Editing
Download the example image:
```bash
wget https://vllm-public-assets.s3.us-west-2.amazonaws.com/omni-assets/qwen-bear.png
```
Then run:
```bash
python image_edit.py \
--image qwen-bear.png \
--prompt "Let this mascot dance under the moon, surrounded by floating stars and poetic bubbles such as 'Be Kind'" \
--output output_image_edit.png \
--num_inference_steps 50 \
--cfg_scale 4.0
```
### Multiple Image Editing (Qwen-Image-Edit-2509)
For multiple image inputs, use `Qwen/Qwen-Image-Edit-2509` or `Qwen/Qwen-Image-Edit-2511`:
```bash
python image_edit.py \
--model Qwen/Qwen-Image-Edit-2509 \
--image img1.png img2.png \
--prompt "Combine these images into a single scene" \
--output output_image_edit.png \
--num_inference_steps 50 \
--cfg_scale 4.0 \
--guidance_scale 1.0
```
Key arguments:
- `--model`: model name or path. Use `Qwen/Qwen-Image-Edit-2509` or later for multiple image support.
- `--image`: path(s) to the source image(s) (PNG/JPG, converted to RGB). Can specify multiple images.
- `--prompt` / `--negative_prompt`: text description (string).
- `--cfg_scale`: true classifier-free guidance scale (default: 4.0). Classifier-free guidance is enabled by setting cfg_scale > 1 and providing a negative_prompt. Higher guidance scale encourages images closely linked to the text prompt, usually at the expense of lower image quality.
- `--cfg_parallel_size`: the number of devices to run CFG Parallel. CFG Parallel is valid only if classifier-free guidance is enabled and `cfg_parallel_size` is set to 2.
- `--guidance_scale`: guidance scale for guidance-distilled models (default: 1.0, disabled). Unlike classifier-free guidance (--cfg_scale), guidance-distilled models take the guidance scale directly as an input parameter. Enabled when guidance_scale > 1. Ignored when not using guidance-distilled models.
- `--num_inference_steps`: diffusion sampling steps (more steps = higher quality, slower).
- `--output`: path to save the generated PNG.
## Example materials
??? abstract "image_edit.py"
``````py
--8<-- "examples/offline_inference/image_to_image/image_edit.py"
``````
??? abstract "run_qwen_image_edit_2511.sh"
``````sh
--8<-- "examples/offline_inference/image_to_image/run_qwen_image_edit_2511.sh"
``````
# Image-To-Video
Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/image_to_video>.
This example demonstrates how to generate videos from images using Wan2.2 Image-to-Video models with vLLM-Omni's offline inference API.
## Local CLI Usage
### Wan2.2-I2V-A14B-Diffusers (MoE)
```bash
python image_to_video.py \
--model Wan-AI/Wan2.2-I2V-A14B-Diffusers \
--image input.png \
--prompt "A cat playing with yarn, smooth motion" \
--negative_prompt "<optional quality filter>" \
--height 480 \
--width 832 \
--num_frames 48 \
--guidance_scale 5.0 \
--guidance_scale_high 6.0 \
--num_inference_steps 40 \
--boundary_ratio 0.875 \
--flow_shift 12.0 \
--fps 16 \
--output i2v_output.mp4
```
### Wan2.2-TI2V-5B-Diffusers (Unified)
```bash
python image_to_video.py \
--model Wan-AI/Wan2.2-TI2V-5B-Diffusers \
--image input.png \
--prompt "A cat playing with yarn, smooth motion" \
--negative_prompt "<optional quality filter>" \
--height 480 \
--width 832 \
--num_frames 48 \
--guidance_scale 4.0 \
--num_inference_steps 40 \
--flow_shift 12.0 \
--fps 16 \
--output i2v_output.mp4
```
Key arguments:
- `--model`: Model ID (I2V-A14B for MoE, TI2V-5B for unified T2V+I2V).
- `--image`: Path to input image (required).
- `--prompt`: Text description of desired motion/animation.
- `--height/--width`: Output resolution (auto-calculated from image if not set). Dimensions should be multiples of 16.
- `--num_frames`: Number of frames (default 81).
- `--guidance_scale` and `--guidance_scale_high`: CFG scale (applied to low/high-noise stages for MoE).
- `--negative_prompt`: Optional list of artifacts to suppress.
- `--cfg_parallel_size`: the number of devices to run CFG Parallel. CFG Parallel is valid only if classifier-free guidance is enabled and `cfg_parallel_size` is set to 2.
- `--boundary_ratio`: Boundary split ratio for two-stage MoE models.
- `--flow_shift`: Scheduler flow shift (5.0 for 720p, 12.0 for 480p).
- `--num_inference_steps`: Number of denoising steps (default 50).
- `--fps`: Frames per second for the saved MP4 (requires `diffusers` export_to_video).
- `--output`: Path to save the generated video.
## Example materials
??? abstract "image_to_video.py"
``````py
--8<-- "examples/offline_inference/image_to_video/image_to_video.py"
``````
# LoRA-Inference
Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/lora_inference>.
This contains examples for using LoRA (Low-Rank Adaptation) adapters with vLLM-omni diffusion models for offline inference.
The example uses the `stabilityai/stable-diffusion-3.5-medium` as the default model, but you can replace it with other models in vLLM-omni.
## Overview
Similar to vLLM, vLLM-omni uses a unified LoRA handling mechanism:
- **Pre-loaded LoRA**: Loaded at initialization via `--lora-path` (pre-loaded into cache)
- **Per-request LoRA**: Loaded on-demand. In the example, the LoRA is loaded via `--lora-request-path` in each request
Both approaches use the same underlying mechanism - all LoRA adapters are handled uniformly through `set_active_adapter()`. If no LoRA request is provided in a request, all adapters are deactivated.
## Usage
### Pre-loaded LoRA (via --lora-path)
Load a LoRA adapter at initialization. This adapter is pre-loaded into the cache and can be activated by requests:
```bash
python -m examples.offline_inference.lora_inference.lora_inference \
--prompt "A piece of cheesecake" \
--lora-path /path/to/lora/ \
--lora-scale 1.0 \
--num_inference_steps 50 \
--height 1024 \
--width 1024 \
--output output_preloaded.png
```
**Note**: When using `--lora-path`, the adapter is loaded at init time with a stable ID derived from the adapter path. This example activates it automatically for the request.
### Per-request LoRA (via --lora-request-path)
Load a LoRA adapter on-demand for each request:
```bash
python -m examples.offline_inference.lora_inference.lora_inference \
--prompt "A piece of cheesecake" \
--lora-request-path /path/to/lora/ \
--lora-scale 1.0 \
--num_inference_steps 50 \
--height 1024 \
--width 1024 \
--output output_per_request.png
```
### No LoRA
If no LoRA request is provided, we will use the base model without any LoRA adapters:
```bash
python -m examples.offline_inference.lora_inference.lora_inference \
--prompt "A piece of cheesecake" \
--num_inference_steps 50 \
--height 1024 \
--width 1024 \
--output output_no_lora.png
```
## Parameters
### LoRA Parameters
- `--lora-path`: Path to LoRA adapter folder to pre-load at initialization (loads into cache with a stable ID derived from the path)
- `--lora-request-path`: Path to LoRA adapter folder for per-request loading
- `--lora-request-id`: Integer ID for the LoRA adapter (optional). If not provided and `--lora-request-path` is set, will derive a stable ID from the path.
- `--lora-scale`: Scale factor for LoRA weights (default: 1.0). Higher values increase the influence of the LoRA adapter.
### Standard Parameters
- `--prompt`: Text prompt for image generation (required)
- `--seed`: Random seed for reproducibility (default: 42)
- `--height`: Image height in pixels (default: 1024)
- `--width`: Image width in pixels (default: 1024)
- `--num_inference_steps`: Number of denoising steps (default: 50)
- `--output`: Output file path (default: `lora_output.png`)
## How LoRA Works
All LoRA adapters are handled uniformly:
1. **Initialization**: If `--lora-path` is provided, the adapter is loaded into cache with a stable ID derived from the adapter path
2. **Per-request**: If `--lora-request-path` is provided, the adapter is loaded/activated for that request
3. **No LoRA**: If no LoRA request is provided (`req.lora_request` is None), all adapters are deactivated
The system uses LRU cache management - adapters are cached and evicted when the cache is full (unless pinned).
## LoRA Adapter Format
LoRA adapters must be in PEFT (Parameter-Efficient Fine-Tuning) format. A typical LoRA adapter directory structure:
```
lora_adapter/
├── adapter_config.json
└── adapter_model.safetensors
```
## Example materials
??? abstract "lora_inference.py"
``````py
--8<-- "examples/offline_inference/lora_inference/lora_inference.py"
``````
# Qwen2.5-Omni
Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/qwen2_5_omni>.
## Setup
Please refer to the [stage configuration documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/) to configure memory allocation appropriately for your hardware setup.
## Run examples
### Multiple Prompts
Get into the example folder
```bash
cd examples/offline_inference/qwen2_5_omni
```
Then run the command below. Note: for processing large volume data, it uses py_generator mode, which will return a python generator from Omni class.
```bash
bash run_multiple_prompts.sh
```
### Single Prompt
Get into the example folder
```bash
cd examples/offline_inference/qwen2_5_omni
```
Then run the command below.
```bash
bash run_single_prompt.sh
```
### Modality control
If you want to control output modalities, e.g. only output text, you can run the command below:
```bash
python end2end.py --output-wav output_audio \
--query-type mixed_modalities \
--modalities text
```
#### Using Local Media Files
The `end2end.py` script supports local media files (audio, video, image) via CLI arguments:
```bash
# Use single local media files
python end2end.py --query-type use_image --image-path /path/to/image.jpg
python end2end.py --query-type use_video --video-path /path/to/video.mp4
python end2end.py --query-type use_audio --audio-path /path/to/audio.wav
# Combine multiple local media files
python end2end.py --query-type mixed_modalities \
--video-path /path/to/video.mp4 \
--image-path /path/to/image.jpg \
--audio-path /path/to/audio.wav
# Use audio from video file
python end2end.py --query-type use_audio_in_video --video-path /path/to/video.mp4
```
If media file paths are not provided, the script will use default assets. Supported query types:
- `use_image`: Image input only
- `use_video`: Video input only
- `use_audio`: Audio input only
- `mixed_modalities`: Audio + image + video
- `use_audio_in_video`: Extract audio from video
- `text`: Text-only query
### FAQ
If you encounter error about backend of librosa, try to install ffmpeg with command below.
```
sudo apt update
sudo apt install ffmpeg
```
## Example materials
??? abstract "end2end.py"
``````py
--8<-- "examples/offline_inference/qwen2_5_omni/end2end.py"
``````
??? abstract "extract_prompts.py"
``````py
--8<-- "examples/offline_inference/qwen2_5_omni/extract_prompts.py"
``````
??? abstract "run_multiple_prompts.sh"
``````sh
--8<-- "examples/offline_inference/qwen2_5_omni/run_multiple_prompts.sh"
``````
??? abstract "run_single_prompt.sh"
``````sh
--8<-- "examples/offline_inference/qwen2_5_omni/run_single_prompt.sh"
``````
# Qwen3-Omni
Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/qwen3_omni>.
## Setup
Please refer to the [stage configuration documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/) to configure memory allocation appropriately for your hardware setup.
## Run examples
### Multiple Prompts
Get into the example folder
```bash
cd examples/offline_inference/qwen3_omni
```
Then run the command below. Note: for processing large volume data, it uses py_generator mode, which will return a python generator from Omni class.
```bash
bash run_multiple_prompts.sh
```
### Single Prompt
Get into the example folder
```bash
cd examples/offline_inference/qwen3_omni
```
Then run the command below.
```bash
bash run_single_prompt.sh
```
If you have not enough memory, you can set thinker with tensor parallel. Just run the command below.
```bash
bash run_single_prompt_tp.sh
```
### Modality control
If you want to control output modalities, e.g. only output text, you can run the command below:
```bash
python end2end.py --output-wav output_audio \
--query-type use_audio \
--modalities text
```
#### Using Local Media Files
The `end2end.py` script supports local media files (audio, video, image) via command-line arguments:
```bash
# Use local video file
python end2end.py --query-type use_video --video-path /path/to/video.mp4
# Use local image file
python end2end.py --query-type use_image --image-path /path/to/image.jpg
# Use local audio file
python end2end.py --query-type use_audio --audio-path /path/to/audio.wav
# Combine multiple local media files
python end2end.py --query-type mixed_modalities \
--video-path /path/to/video.mp4 \
--image-path /path/to/image.jpg \
--audio-path /path/to/audio.wav
```
If media file paths are not provided, the script will use default assets. Supported query types:
- `use_video`: Video input
- `use_image`: Image input
- `use_audio`: Audio input
- `text`: Text-only query
- `multi_audios`: Multiple audio inputs
- `mixed_modalities`: Combination of video, image, and audio inputs
### FAQ
If you encounter error about backend of librosa, try to install ffmpeg with command below.
```
sudo apt update
sudo apt install ffmpeg
```
## Example materials
??? abstract "end2end.py"
``````py
--8<-- "examples/offline_inference/qwen3_omni/end2end.py"
``````
??? abstract "run_multiple_prompts.sh"
``````sh
--8<-- "examples/offline_inference/qwen3_omni/run_multiple_prompts.sh"
``````
??? abstract "run_single_prompt.sh"
``````sh
--8<-- "examples/offline_inference/qwen3_omni/run_single_prompt.sh"
``````
??? abstract "run_single_prompt_tp.sh"
``````sh
--8<-- "examples/offline_inference/qwen3_omni/run_single_prompt_tp.sh"
``````
??? abstract "text_prompts_10.txt"
``````txt
--8<-- "examples/offline_inference/qwen3_omni/text_prompts_10.txt"
``````
# Qwen3-TTS Offline Inference
Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/qwen3_tts>.
This directory contains an offline demo for running Qwen3 TTS models with vLLM Omni. It builds task-specific inputs and generates WAV files locally.
## Model Overview
Qwen3 TTS provides multiple task variants for speech generation:
- **CustomVoice**: Generate speech with a known speaker identity (speaker ID) and optional instruction.
- **VoiceDesign**: Generate speech from text plus a descriptive instruction that designs a new voice.
- **Base**: Voice cloning using a reference audio + reference transcript, with optional mode selection.
## Setup
Please refer to the [stage configuration documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/) to configure memory allocation appropriately for your hardware setup.
## Quick Start
Run a single sample for a task:
```
python end2end.py --query-type CustomVoice
```
Generated audio files are saved to `output_audio/` by default.
## Task Usage
### CustomVoice
Single sample:
```
python end2end.py --query-type CustomVoice
```
Batch sample (multiple prompts in one run):
```
python end2end.py --query-type CustomVoice --use-batch-sample
```
### VoiceDesign
Single sample:
```
python end2end.py --query-type VoiceDesign
```
Batch sample:
```
python end2end.py --query-type VoiceDesign --use-batch-sample
```
### Base (Voice Clone)
Single sample:
```
python end2end.py --query-type Base
```
Batch sample:
```
python end2end.py --query-type Base --use-batch-sample
```
Mode selection for Base:
- `--mode-tag icl` (default): standard mode
- `--mode-tag xvec_only`: enable `x_vector_only_mode` in the request
Examples:
```
python end2end.py --query-type Base --mode-tag icl
```
## Notes
- The script uses the model paths embedded in `end2end.py`. Update them if your local cache path differs.
- Use `--output-dir` (preferred) or `--output-wav` to change the output folder.
## Example materials
??? abstract "end2end.py"
``````py
--8<-- "examples/offline_inference/qwen3_tts/end2end.py"
``````
# Text-To-Image
Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/text_to_image>.
This folder provides several entrypoints for experimenting with `Qwen/Qwen-Image` `Qwen/Qwen-Image-2512` `Tongyi-MAI/Z-Image-Turbo` using vLLM-Omni:
- `text_to_image.py`: command-line script for single image generation with advanced options.
- `web_demo.py`: lightweight Gradio UI for interactive prompt/seed/CFG exploration.
Note that when you pass in multiple independent prompts, they will be processed sequentially. Batching requests is currently not supported.
## Basic Usage
```python
from vllm_omni.entrypoints.omni import Omni
if __name__ == "__main__":
omni = Omni(model="Qwen/Qwen-Image")
prompt = "a cup of coffee on the table"
outputs = omni.generate(prompt)
images = outputs[0].request_output[0].images
images[0].save("coffee.png")
```
Or put more than one prompt in a request.
```python
from vllm_omni.entrypoints.omni import Omni
if __name__ == "__main__":
omni = Omni(model="Qwen/Qwen-Image")
prompts = [
"a cup of coffee on a table",
"a toy dinosaur on a sandy beach",
"a fox waking up in bed and yawning",
]
outputs = omni.generate(prompts)
for i, output in enumerate(outputs):
image = output.request_output[0].images[0].save(f"{i}.jpg")
```
!!! info
However, it is not currently recommended to do so
because not all models support batch inference,
and batch requesting mostly does not provide significant performance improvement (despite the impression that it does).
This feature is primarily for the sake of interface compatibility with vLLM and to allow for future improvements.
!!! info
For diffusion pipelines, the stage config field `stage_args.[].runtime.max_batch_size` is 1 by default, and the input
list is sliced into single-item requests before feeding into the diffusion pipeline. For models that do internally support
batched inputs, you can [modify this configuration](../../../configuration/stage_configs.md) to let the model accept a longer batch of prompts.
Apart from string prompt, vLLM-Omni also supports dictionary prompts in the same style as vLLM.
This is useful for models that support negative prompts.
```python
from vllm_omni.entrypoints.omni import Omni
if __name__ == "__main__":
omni = Omni(model="Qwen/Qwen-Image")
outputs = omni.generate([
{
"prompt": "a cup of coffee on a table"
"negative_prompt": "low resolution"
},
{
"prompt": "a toy dinosaur on a sandy beach"
"negative_prompt": "cinematic, realistic"
}
])
for i, output in enumerate(outputs):
image = output.request_output[0].images[0].save(f"{i}.jpg")
```
## Local CLI Usage
```bash
python text_to_image.py \
--model Tongyi-MAI/Z-Image-Turbo \
--prompt "a cup of coffee on the table" \
--seed 42 \
--cfg_scale 4.0 \
--num_images_per_prompt 1 \
--num_inference_steps 50 \
--height 1024 \
--width 1024 \
--output outputs/coffee.png
```
Key arguments:
- `--prompt`: text description (string).
- `--seed`: integer seed for deterministic sampling.
- `--cfg_scale`: true CFG scale (model-specific guidance strength).
- `--cfg_parallel_size`: the number of devices to run CFG Parallel. CFG Parallel is valid only if classifier-free guidance is enabled and `cfg_parallel_size` is set to 2.
- `--num_images_per_prompt`: number of images to generate per prompt (saves as `output`, `output_1`, ...).
- `--num_inference_steps`: diffusion sampling steps (more steps = higher quality, slower).
- `--height/--width`: output resolution (defaults 1024x1024).
- `--output`: path to save the generated PNG.
> ℹ️ Qwen-Image currently publishes best-effort presets at `1328x1328`, `1664x928`, `928x1664`, `1472x1140`, `1140x1472`, `1584x1056`, and `1056x1584`. Adjust `--height/--width` accordingly for the most reliable outcomes.
## Web UI Demo
Launch the gradio demo:
```bash
python gradio_demo.py --port 7862
```
Then open `http://localhost:7862/` on your local browser to interact with the web UI.
## Example materials
??? abstract "gradio_demo.py"
``````py
--8<-- "examples/offline_inference/text_to_image/gradio_demo.py"
``````
??? abstract "text_to_image.py"
``````py
--8<-- "examples/offline_inference/text_to_image/text_to_image.py"
``````
# Text-To-Video
Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/text_to_video>.
The `Wan-AI/Wan2.2-T2V-A14B-Diffusers` pipeline generates short videos from text prompts.
## Local CLI Usage
```bash
python text_to_video.py \
--prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \
--negative_prompt "<optional quality filter>" \
--height 480 \
--width 640 \
--num_frames 32 \
--guidance_scale 4.0 \
--guidance_scale_high 3.0 \
--num_inference_steps 40 \
--fps 16 \
--output t2v_out.mp4
```
Key arguments:
- `--prompt`: text description (string).
- `--height/--width`: output resolution (defaults 720x1280). Dimensions should align with Wan VAE downsampling (multiples of 8).
- `--num_frames`: Number of frames (Wan default is 81).
- `--guidance_scale` and `--guidance_scale_high`: CFG scale (applied to low/high)..
- `--negative_prompt`: optional list of artifacts to suppress (the PR demo used a long Chinese string).
- `--cfg_parallel_size`: the number of devices to run CFG Parallel. CFG Parallel is valid only if classifier-free guidance is enabled and `cfg_parallel_size` is set to 2.
- `--boundary_ratio`: Boundary split ratio for low/high DiT.
- `--fps`: frames per second for the saved MP4 (requires `diffusers` export_to_video).
- `--output`: path to save the generated video.
## Example materials
??? abstract "text_to_video.py"
``````py
--8<-- "examples/offline_inference/text_to_video/text_to_video.py"
``````
# BAGEL-7B-MoT
Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/bagel>.
## 🛠️ Installation
Please refer to [README.md](../../../README.md)
## Run examples (BAGEL-7B-MoT)
**Note**: These examples work with the default configuration on an **NVIDIA A100 (80GB)**. We also tested on dual **NVIDIA RTX 5000 Ada (32GB each)**. For dual-GPU setups, please modify the stage configuration to distribute the model across devices.
### Launch the Server
```bash
# Use default configuration
vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8091
```
Or use the convenience script:
```bash
cd /workspace/vllm-omni/examples/online_serving/bagel
bash run_server.sh
```
If you have a custom stage configs file, launch the server with the command below:
```bash
vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8091 --stage-configs-path /path/to/stage_configs_file
```
### Send Multi-modal Request
Get into the bagel folder:
```bash
cd examples/online_serving/bagel
```
Send request via Python
```bash
python openai_chat_client.py --prompt "A cute cat" --modality text2img
```
The Python client supports the following command-line arguments:
- `--prompt` (or `-p`): Text prompt for generation (default: `A cute cat`)
- `--output` (or `-o`): Output file path for image results (default: `bagel_output.png`)
- `--server` (or `-s`): Server URL (default: `http://localhost:8091`)
- `--image-url` (or `-i`): Input image URL or local file path (for img2img/img2text modes)
- `--modality` (or `-m`): Task modality (default: `text2img`). Options: `text2img`, `img2img`, `img2text`, `text2text`
- `--height`: Image height in pixels (default: 512)
- `--width`: Image width in pixels (default: 512)
- `--steps`: Number of inference steps (default: 25)
- `--seed`: Random seed (default: 42)
- `--negative`: Negative prompt for image generation
Example with custom parameters:
```bash
python openai_chat_client.py \
--prompt "A futuristic city" \
--modality text2img \
--height 768 \
--width 768 \
--steps 50 \
--seed 42 \
--negative "blurry, low quality"
```
## Modality Control
BAGEL-7B-MoT supports **multiple modality modes** for different use cases.
The default yaml configuration deploys Thinker and DiT on the same GPU. You can use the default configuration file: [`bagel.yaml`](../../../vllm_omni/model_executor/stage_configs/bagel.yaml)
| Modality | Input | Output | Description |
| ----------- | ------------ | ------ | -------------------------------------- |
| `text2img` | Text | Image | Generate images from text prompts |
| `img2img` | Image + Text | Image | Transform images using text guidance |
| `img2text` | Image + Text | Text | Generate text descriptions from images |
| `text2text` | Text | Text | Pure text generation |
### Text to Image (text2img)
Generate images from text prompts:
**Using Python client**
```bash
python openai_chat_client.py \
--prompt "A beautiful sunset over mountains" \
--modality text2img \
--output sunset.png \
--steps 50
```
**Using curl**
```bash
curl http://localhost:8091/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": [{"type": "text", "text": "<|im_start|>A beautiful sunset over mountains<|im_end|>"}]}],
"modalities": ["image"],
"height": 512,
"width": 512,
"num_inference_steps": 50,
"seed": 42
}'
```
### Image to Image (img2img)
Transform images based on text prompts:
**Using Python client**
```bash
python openai_chat_client.py \
--prompt "Make the cat stand up" \
--modality img2img \
--image-url /path/to/input.jpg \
--output transformed.png
```
**Using curl**
```bash
IMAGE_BASE64=$(base64 -w 0 cat.jpg)
cat <<EOF > payload.json
{
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "<|im_start|>Make the cat stand up<|im_end|>"},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,${IMAGE_BASE64}"}}
]
}],
"modalities": ["image"],
"height": 512,
"width": 512,
"num_inference_steps": 50,
"seed": 42
}
EOF
curl http://localhost:8091/v1/chat/completions \
-H "Content-Type: application/json" \
-d @payload.json
```
### Image to Text (img2text)
Generate text descriptions from images:
**Using Python client**
```bash
python openai_chat_client.py \
--prompt "Describe this image in detail" \
--modality img2text \
--image-url /path/to/image.jpg
```
**Using curl**
```bash
IMAGE_BASE64=$(base64 -w 0 cat.jpg)
cat <<EOF > payload.json
{
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "<|im_start|>user\n<|image_pad|>\nDescribe this image in detail<|im_end|>\n<|im_start|>assistant\n"},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,${IMAGE_BASE64}"}}
]
}],
"modalities": ["text"]
}
EOF
curl http://localhost:8091/v1/chat/completions \
-H "Content-Type: application/json" \
-d @payload.json
```
### Text to Text (text2text)
Pure text generation:
**Using Python client**
```bash
python openai_chat_client.py \
--prompt "What is the capital of France?" \
--modality text2text
```
**Using curl**
```bash
curl http://localhost:8091/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": [{"type": "text", "text": "<|im_start|>user\nWhat is the capital of France?<|im_end|>\n<|im_start|>assistant\n"}]}]
"modalities": ["text"]
}'
```
## FAQ
- If you encounter an error about the backend of librosa, try to install ffmpeg with the command below.
```bash
sudo apt update
sudo apt install ffmpeg
```
- If you don’t know how much VRAM is needed for the model or encounter the OOM error, you can try to decrease the max_model_len.
| Stage | VRAM |
| :------------------ | :--------------------------- |
| Stage-0 (Thinker) | **15.04 GiB** **+ KV Cache** |
| Stage-1 (DiT) | **26.50 GiB** |
| Total | **~42 GiB + KV Cache** |
# Image-To-Image
Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/image_to_image>.
This example demonstrates how to deploy Qwen-Image-Edit model for online image editing service using vLLM-Omni.
For **multi-image** input editing, use **Qwen-Image-Edit-2509** (QwenImageEditPlusPipeline) and send multiple images in the user message content.
## Start Server
### Basic Start
```bash
vllm serve Qwen/Qwen-Image-Edit --omni --port 8092
```
### Multi-Image Edit (Qwen-Image-Edit-2509)
```bash
vllm serve Qwen/Qwen-Image-Edit-2509 --omni --port 8092
```
### Start with Parameters
Or use the startup script:
```bash
bash run_server.sh
```
To serve Qwen-Image-Edit-2509 with the script:
```bash
MODEL=Qwen/Qwen-Image-Edit-2509 bash run_server.sh
```
## API Calls
### Method 1: Using curl (Image Editing)
```bash
# Image editing
bash run_curl_image_edit.sh input.png "Convert this image to watercolor style"
# Or execute directly
IMG_B64=$(base64 -w0 input.png)
cat <<EOF > request.json
{
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "Convert this image to watercolor style"},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,$IMG_B64"}}
]
}],
"extra_body": {
"height": 1024,
"width": 1024,
"num_inference_steps": 50,
"guidance_scale": 1,
"seed": 42
}
}
EOF
curl -s http://localhost:8092/v1/chat/completions -H "Content-Type: application/json" -d @request.json | jq -r '.choices[0].message.content[0].image_url.url' | cut -d',' -f2 | base64 -d > output.png
```
### Method 2: Using Python Client
```bash
python openai_chat_client.py --input input.png --prompt "Convert to oil painting style" --output output.png
# Multi-image editing (Qwen-Image-Edit-2509 server required)
python openai_chat_client.py --input input1.png input2.png --prompt "Combine these images into a single scene" --output output.png
```
### Method 3: Using Gradio Demo
```bash
python gradio_demo.py
# Visit http://localhost:7861
```
## Request Format
### Image Editing (Using image_url Format)
```json
{
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Convert this image to watercolor style"},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
]
}
]
}
```
### Image Editing (Using Simplified image Format)
```json
{
"messages": [
{
"role": "user",
"content": [
{"text": "Convert this image to watercolor style"},
{"image": "BASE64_IMAGE_DATA"}
]
}
]
}
```
### Image Editing with Parameters
Use `extra_body` to pass generation parameters:
```json
{
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Convert to ink wash painting style"},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
]
}
],
"extra_body": {
"height": 1024,
"width": 1024,
"num_inference_steps": 50,
"guidance_scale": 7.5,
"seed": 42
}
}
```
### Multi-Image Editing (Qwen-Image-Edit-2509)
Provide multiple images in `content` (order matters):
```json
{
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Combine these images into a single scene"},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."} },
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."} }
]
}
]
}
```
## Generation Parameters (extra_body)
| Parameter | Type | Default | Description |
| ------------------------ | ----- | ------- | ------------------------------------- |
| `height` | int | None | Output image height in pixels |
| `width` | int | None | Output image width in pixels |
| `size` | str | None | Output image size (e.g., "1024x1024") |
| `num_inference_steps` | int | 50 | Number of denoising steps |
| `guidance_scale` | float | 7.5 | CFG guidance scale |
| `seed` | int | None | Random seed (reproducible) |
| `negative_prompt` | str | None | Negative prompt |
| `num_outputs_per_prompt` | int | 1 | Number of images to generate |
## Response Format
```json
{
"id": "chatcmpl-xxx",
"created": 1234567890,
"model": "Qwen/Qwen-Image-Edit",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": [{
"type": "image_url",
"image_url": {
"url": "data:image/png;base64,..."
}
}]
},
"finish_reason": "stop"
}],
"usage": {...}
}
```
## Common Editing Instructions Examples
| Instruction | Description |
| ---------------------------------------- | ---------------- |
| `Convert this image to watercolor style` | Style transfer |
| `Convert the image to black and white` | Desaturation |
| `Enhance the color saturation` | Color adjustment |
| `Convert to cartoon style` | Cartoonization |
| `Add vintage filter effect` | Filter effect |
| `Convert daytime scene to nighttime` | Scene conversion |
## File Description
| File | Description |
| ------------------------ | ---------------------------- |
| `run_server.sh` | Server startup script |
| `run_curl_image_edit.sh` | curl image editing example |
| `openai_chat_client.py` | Python client |
| `gradio_demo.py` | Gradio interactive interface |
## Example materials
??? abstract "gradio_demo.py"
``````py
--8<-- "examples/online_serving/image_to_image/gradio_demo.py"
``````
??? abstract "openai_chat_client.py"
``````py
--8<-- "examples/online_serving/image_to_image/openai_chat_client.py"
``````
??? abstract "run_curl_image_edit.sh"
``````sh
--8<-- "examples/online_serving/image_to_image/run_curl_image_edit.sh"
``````
??? abstract "run_server.sh"
``````sh
--8<-- "examples/online_serving/image_to_image/run_server.sh"
``````
# LoRA-Inference
Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/lora_inference>.
This example shows how to use **per-request LoRA** with vLLM-Omni diffusion models via the OpenAI-compatible Chat Completions API.
> Note: The LoRA adapter path must be readable on the **server** machine (usually a local path or a mounted directory).
> Note: This example uses `/v1/chat/completions`. LoRA payloads for other OpenAI endpoints are not implemented here.
## Start Server
```bash
# Pick a diffusion model (examples)
# export MODEL=stabilityai/stable-diffusion-3.5-medium
# export MODEL=Qwen/Qwen-Image
bash run_server.sh
```
## Call API (curl)
```bash
# Required: local LoRA folder on the server
export LORA_PATH=/path/to/lora_adapter
# Optional
export SERVER=http://localhost:8091
export PROMPT="A piece of cheesecake"
export LORA_NAME=my_lora
export LORA_SCALE=1.0
# Optional: if omitted, the server derives a stable id from LORA_PATH.
# export LORA_INT_ID=123
bash run_curl_lora_inference.sh
```
## Call API (Python)
```bash
python openai_chat_client.py \
--prompt "A piece of cheesecake" \
--lora-path /path/to/lora_adapter \
--lora-name my_lora \
--lora-scale 1.0 \
--output output.png
```
## LoRA Format
LoRA adapters should be in PEFT format, for example:
```
lora_adapter/
├── adapter_config.json
└── adapter_model.safetensors
```
??? abstract "openai_chat_client.py"
``````py
--8<-- "examples/online_serving/lora_inference/openai_chat_client.py"
``````
??? abstract "run_curl_lora_inference.sh"
``````py
--8<-- "examples/online_serving/lora_inference/run_curl_lora_inference.sh"
``````
??? abstract "run_server.sh"
``````py
--8<-- "examples/online_serving/lora_inference/run_server.sh"
``````
# Qwen2.5-Omni
Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/qwen2_5_omni>.
## 🛠️ Installation
Please refer to [README.md](https://github.com/vllm-project/vllm-omni/tree/main/README.md)
## Run examples (Qwen2.5-Omni)
### Launch the Server
```bash
vllm serve Qwen/Qwen2.5-Omni-7B --omni --port 8091
```
If you have custom stage configs file, launch the server with command below
```bash
vllm serve Qwen/Qwen2.5-Omni-7B --omni --port 8091 --stage-configs-path /path/to/stage_configs_file
```
### Send Multi-modal Request
Get into the example folder
```bash
cd examples/online_serving/qwen2_5_omni
```
#### Send request via python
```bash
python openai_chat_completion_client_for_multimodal_generation.py --query-type mixed_modalities
```
The Python client supports the following command-line arguments:
- `--query-type` (or `-q`): Query type (default: `mixed_modalities`). Options: `mixed_modalities`, `use_audio_in_video`, `multi_audios`, `text`
- `--video-path` (or `-v`): Path to local video file or URL. If not provided and query-type uses video, uses default video URL. Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs. Example: `--video-path /path/to/video.mp4` or `--video-path https://example.com/video.mp4`
- `--image-path` (or `-i`): Path to local image file or URL. If not provided and query-type uses image, uses default image URL. Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs and common image formats: JPEG, PNG, GIF, WebP. Example: `--image-path /path/to/image.jpg` or `--image-path https://example.com/image.png`
- `--audio-path` (or `-a`): Path to local audio file or URL. If not provided and query-type uses audio, uses default audio URL. Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs and common audio formats: MP3, WAV, OGG, FLAC, M4A. Example: `--audio-path /path/to/audio.wav` or `--audio-path https://example.com/audio.mp3`
- `--prompt` (or `-p`): Custom text prompt/question. If not provided, uses default prompt for the selected query type. Example: `--prompt "What are the main activities shown in this video?"`
For example, to use mixed modalities with all local files:
```bash
python openai_chat_completion_client_for_multimodal_generation.py \
--query-type mixed_modalities \
--video-path /path/to/your/video.mp4 \
--image-path /path/to/your/image.jpg \
--audio-path /path/to/your/audio.wav \
--prompt "Analyze all the media content and provide a comprehensive summary."
```
#### Send request via curl
```bash
bash run_curl_multimodal_generation.sh mixed_modalities
```
## Modality control
You can control output modalities to specify which types of output the model should generate. This is useful when you only need text output and want to skip audio generation stages for better performance.
### Supported modalities
| Modalities | Output |
|------------|--------|
| `["text"]` | Text only |
| `["audio"]` | Text + Audio |
| `["text", "audio"]` | Text + Audio |
| Not specified | Text + Audio (default) |
### Using curl
#### Text only
```bash
curl http://localhost:8091/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-Omni-7B",
"messages": [{"role": "user", "content": "Describe vLLM in brief."}],
"modalities": ["text"]
}'
```
#### Text + Audio
```bash
curl http://localhost:8091/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-Omni-7B",
"messages": [{"role": "user", "content": "Describe vLLM in brief."}],
"modalities": ["audio"]
}'
```
### Using Python client
```bash
python openai_chat_completion_client_for_multimodal_generation.py \
--query-type mixed_modalities \
--modalities text
```
### Using OpenAI Python SDK
#### Text only
```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8091/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="Qwen/Qwen2.5-Omni-7B",
messages=[{"role": "user", "content": "Describe vLLM in brief."}],
modalities=["text"]
)
print(response.choices[0].message.content)
```
#### Text + Audio
```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8091/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="Qwen/Qwen2.5-Omni-7B",
messages=[{"role": "user", "content": "Describe vLLM in brief."}],
modalities=["audio"]
)
# Response contains two choices: one with text, one with audio
print(response.choices[0].message.content) # Text response
print(response.choices[1].message.audio) # Audio response
```
## Streaming Output
If you want to enable streaming output, please set the argument as below. The final output will be obtained just after generated by corresponding stage. Now we only support text streaming output. Other modalities can output normally.
```bash
python openai_chat_completion_client_for_multimodal_generation.py \
--query-type mixed_modalities \
--stream
```
## Run Local Web UI Demo
This Web UI demo allows users to interact with the model through a web browser.
### Running Gradio Demo
The Gradio demo connects to a vLLM API server. You have two options:
#### Option 1: One-step Launch Script (Recommended)
The convenience script launches both the vLLM server and Gradio demo together:
```bash
./run_gradio_demo.sh --model Qwen/Qwen2.5-Omni-7B --server-port 8091 --gradio-port 7861
```
This script will:
1. Start the vLLM server in the background
2. Wait for the server to be ready
3. Launch the Gradio demo
4. Handle cleanup when you press Ctrl+C
The script supports the following arguments:
- `--model`: Model name/path (default: Qwen/Qwen2.5-Omni-7B)
- `--server-port`: Port for vLLM server (default: 8091)
- `--gradio-port`: Port for Gradio demo (default: 7861)
- `--stage-configs-path`: Path to custom stage configs YAML file (optional)
- `--server-host`: Host for vLLM server (default: 0.0.0.0)
- `--gradio-ip`: IP for Gradio demo (default: 127.0.0.1)
- `--share`: Share Gradio demo publicly (creates a public link)
#### Option 2: Manual Launch (Two-Step Process)
**Step 1: Launch the vLLM API server**
```bash
vllm serve Qwen/Qwen2.5-Omni-7B --omni --port 8091
```
If you have custom stage configs file:
```bash
vllm serve Qwen/Qwen2.5-Omni-7B --omni --port 8091 --stage-configs-path /path/to/stage_configs_file
```
**Step 2: Run the Gradio demo**
In a separate terminal:
```bash
python gradio_demo.py --model Qwen/Qwen2.5-Omni-7B --api-base http://localhost:8091/v1 --port 7861
```
Then open `http://localhost:7861/` on your local browser to interact with the web UI.
The gradio script supports the following arguments:
- `--model`: Model name/path (should match the server model)
- `--api-base`: Base URL for the vLLM API server (default: http://localhost:8091/v1)
- `--ip`: Host/IP for Gradio server (default: 127.0.0.1)
- `--port`: Port for Gradio server (default: 7861)
- `--share`: Share the Gradio demo publicly (creates a public link)
### FAQ
If you encounter error about backend of librosa, try to install ffmpeg with command below.
```
sudo apt update
sudo apt install ffmpeg
```
## Example materials
??? abstract "gradio_demo.py"
``````py
--8<-- "examples/online_serving/qwen2_5_omni/gradio_demo.py"
``````
??? abstract "openai_chat_completion_client_for_multimodal_generation.py"
``````py
--8<-- "examples/online_serving/qwen2_5_omni/openai_chat_completion_client_for_multimodal_generation.py"
``````
??? abstract "run_curl_multimodal_generation.sh"
``````sh
--8<-- "examples/online_serving/qwen2_5_omni/run_curl_multimodal_generation.sh"
``````
??? abstract "run_gradio_demo.sh"
``````sh
--8<-- "examples/online_serving/qwen2_5_omni/run_gradio_demo.sh"
``````
# Qwen3-Omni
Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/qwen3_omni>.
## 🛠️ Installation
Please refer to [README.md](https://github.com/vllm-project/vllm-omni/tree/main/README.md)
## Run examples (Qwen3-Omni)
### Launch the Server
```bash
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091
```
If you want to open async chunking for qwen3-omni, launch the server with command below
```bash
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 --stage-configs-path /vllm_omni/model_executor/stage_configs/qwen3_omni_moe_async_chunk.yaml
```
If you have custom stage configs file, launch the server with command below
```bash
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 --stage-configs-path /path/to/stage_configs_file
```
### Send Multi-modal Request
Get into the example folder
```bash
cd examples/online_serving/qwen3_omni
```
#### Send request via python
```bash
python openai_chat_completion_client_for_multimodal_generation.py --query-type use_image
```
The Python client supports the following command-line arguments:
- `--query-type` (or `-q`): Query type (default: `use_video`). Options: `text`, `use_audio`, `use_image`, `use_video`
- `--model` (or `-m`): Model name/path (default: `Qwen/Qwen3-Omni-30B-A3B-Instruct`)
- `--video-path` (or `-v`): Path to local video file or URL. If not provided and query-type is `use_video`, uses default video URL. Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs. Example: `--video-path /path/to/video.mp4` or `--video-path https://example.com/video.mp4`
- `--image-path` (or `-i`): Path to local image file or URL. If not provided and query-type is `use_image`, uses default image URL. Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs and common image formats: JPEG, PNG, GIF, WebP. Example: `--image-path /path/to/image.jpg` or `--image-path https://example.com/image.png`
- `--audio-path` (or `-a`): Path to local audio file or URL. If not provided and query-type is `use_audio`, uses default audio URL. Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs and common audio formats: MP3, WAV, OGG, FLAC, M4A. Example: `--audio-path /path/to/audio.wav` or `--audio-path https://example.com/audio.mp3`
- `--prompt` (or `-p`): Custom text prompt/question. If not provided, uses default prompt for the selected query type. Example: `--prompt "What are the main activities shown in this video?"`
For example, to use a local video file with custom prompt:
```bash
python openai_chat_completion_client_for_multimodal_generation.py \
--query-type use_video \
--video-path /path/to/your/video.mp4 \
--prompt "What are the main activities shown in this video?"
```
#### Send request via curl
```bash
bash run_curl_multimodal_generation.sh use_image
```
### FAQ
If you encounter error about backend of librosa, try to install ffmpeg with command below.
```
sudo apt update
sudo apt install ffmpeg
```
## Modality control
You can control output modalities to specify which types of output the model should generate. This is useful when you only need text output and want to skip audio generation stages for better performance.
### Supported modalities
| Modalities | Output |
|------------|--------|
| `["text"]` | Text only |
| `["audio"]` | Text + Audio |
| `["text", "audio"]` | Text + Audio |
| Not specified | Text + Audio (default) |
### Using curl
#### Text only
```bash
curl http://localhost:8091/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-Omni-30B-A3B-Instruct",
"messages": [{"role": "user", "content": "Describe vLLM in brief."}],
"modalities": ["text"]
}'
```
#### Text + Audio
```bash
curl http://localhost:8091/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-Omni-30B-A3B-Instruct",
"messages": [{"role": "user", "content": "Describe vLLM in brief."}],
"modalities": ["audio"]
}'
```
### Using Python client
```bash
python openai_chat_completion_client_for_multimodal_generation.py \
--query-type use_image \
--modalities text
```
### Using OpenAI Python SDK
#### Text only
```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8091/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="Qwen/Qwen3-Omni-30B-A3B-Instruct",
messages=[{"role": "user", "content": "Describe vLLM in brief."}],
modalities=["text"]
)
print(response.choices[0].message.content)
```
#### Text + Audio
```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8091/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="Qwen/Qwen3-Omni-30B-A3B-Instruct",
messages=[{"role": "user", "content": "Describe vLLM in brief."}],
modalities=["audio"]
)
# Response contains two choices: one with text, one with audio
print(response.choices[0].message.content) # Text response
print(response.choices[1].message.audio) # Audio response
```
## Streaming Output
If you want to enable streaming output, please set the argument as below. The final output will be obtained just after generated by corresponding stage. Now we only support text streaming output. Other modalities can output normally.
```bash
python openai_chat_completion_client_for_multimodal_generation.py \
--query-type use_image \
--stream
```
## Run Local Web UI Demo
This Web UI demo allows users to interact with the model through a web browser.
### Running Gradio Demo
The Gradio demo connects to a vLLM API server. You have two options:
#### Option 1: One-step Launch Script (Recommended)
The convenience script launches both the vLLM server and Gradio demo together:
```bash
./run_gradio_demo.sh --model Qwen/Qwen3-Omni-30B-A3B-Instruct --server-port 8091 --gradio-port 7861
```
This script will:
1. Start the vLLM server in the background
2. Wait for the server to be ready
3. Launch the Gradio demo
4. Handle cleanup when you press Ctrl+C
The script supports the following arguments:
- `--model`: Model name/path (default: Qwen/Qwen3-Omni-30B-A3B-Instruct)
- `--server-port`: Port for vLLM server (default: 8091)
- `--gradio-port`: Port for Gradio demo (default: 7861)
- `--stage-configs-path`: Path to custom stage configs YAML file (optional)
- `--server-host`: Host for vLLM server (default: 0.0.0.0)
- `--gradio-ip`: IP for Gradio demo (default: 127.0.0.1)
- `--share`: Share Gradio demo publicly (creates a public link)
#### Option 2: Manual Launch (Two-Step Process)
**Step 1: Launch the vLLM API server**
```bash
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091
```
If you have custom stage configs file:
```bash
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 --stage-configs-path /path/to/stage_configs_file
```
**Step 2: Run the Gradio demo**
In a separate terminal:
```bash
python gradio_demo.py --model Qwen/Qwen3-Omni-30B-A3B-Instruct --api-base http://localhost:8091/v1 --port 7861
```
Then open `http://localhost:7861/` on your local browser to interact with the web UI.
The gradio script supports the following arguments:
- `--model`: Model name/path (should match the server model)
- `--api-base`: Base URL for the vLLM API server (default: http://localhost:8091/v1)
- `--ip`: Host/IP for Gradio server (default: 127.0.0.1)
- `--port`: Port for Gradio server (default: 7861)
- `--share`: Share the Gradio demo publicly (creates a public link)
## Example materials
??? abstract "gradio_demo.py"
``````py
--8<-- "examples/online_serving/qwen3_omni/gradio_demo.py"
``````
??? abstract "openai_chat_completion_client_for_multimodal_generation.py"
``````py
--8<-- "examples/online_serving/qwen3_omni/openai_chat_completion_client_for_multimodal_generation.py"
``````
??? abstract "qwen3_omni_moe_thinking.yaml"
``````yaml
--8<-- "examples/online_serving/qwen3_omni/qwen3_omni_moe_thinking.yaml"
``````
??? abstract "run_curl_multimodal_generation.sh"
``````sh
--8<-- "examples/online_serving/qwen3_omni/run_curl_multimodal_generation.sh"
``````
??? abstract "run_gradio_demo.sh"
``````sh
--8<-- "examples/online_serving/qwen3_omni/run_gradio_demo.sh"
``````
# Text-To-Image
Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/text_to_image>.
This example demonstrates how to deploy Qwen-Image model for online image generation service using vLLM-Omni.
## Start Server
### Basic Start
```bash
vllm serve Qwen/Qwen-Image --omni --port 8091
```
!!! note
If you encounter Out-of-Memory (OOM) issues or have limited GPU memory, you can enable VAE slicing and tiling to reduce memory usage, --vae-use-slicing --vae-use-tiling
### Start with Parameters
Or use the startup script:
```bash
bash run_server.sh
```
## API Calls
### Method 1: Using curl
```bash
# Basic text-to-image generation
bash run_curl_text_to_image.sh
# Or execute directly
curl -s http://localhost:8091/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "A beautiful landscape painting"}
],
"extra_body": {
"height": 1024,
"width": 1024,
"num_inference_steps": 50,
"true_cfg_scale": 4.0,
"seed": 42
}
}' | jq -r '.choices[0].message.content[0].image_url.url' | cut -d',' -f2- | base64 -d > output.png
```
### Method 2: Using Python Client
```bash
python openai_chat_client.py --prompt "A beautiful landscape painting" --output output.png
```
### Method 3: Using Gradio Demo
```bash
python gradio_demo.py
# Visit http://localhost:7860
```
## Request Format
### Simple Text Generation
```json
{
"messages": [
{"role": "user", "content": "A beautiful landscape painting"}
]
}
```
### Generation with Parameters
Use `extra_body` to pass generation parameters:
```json
{
"messages": [
{"role": "user", "content": "A beautiful landscape painting"}
],
"extra_body": {
"height": 1024,
"width": 1024,
"num_inference_steps": 50,
"true_cfg_scale": 4.0,
"seed": 42
}
}
```
### Multimodal Input (Text + Structured Content)
```json
{
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "A beautiful landscape painting"}
]
}
]
}
```
## Generation Parameters (extra_body)
| Parameter | Type | Default | Description |
| ------------------------ | ----- | ------- | ------------------------------ |
| `height` | int | None | Image height in pixels |
| `width` | int | None | Image width in pixels |
| `size` | str | None | Image size (e.g., "1024x1024") |
| `num_inference_steps` | int | 50 | Number of denoising steps |
| `true_cfg_scale` | float | 4.0 | Qwen-Image CFG scale |
| `seed` | int | None | Random seed (reproducible) |
| `negative_prompt` | str | None | Negative prompt |
| `num_outputs_per_prompt` | int | 1 | Number of images to generate |
| `--cfg-parallel-size`. | int | 1 | Number of GPUs for CFG parallelism |
## Response Format
```json
{
"id": "chatcmpl-xxx",
"created": 1234567890,
"model": "Qwen/Qwen-Image",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": [{
"type": "image_url",
"image_url": {
"url": "data:image/png;base64,..."
}
}]
},
"finish_reason": "stop"
}],
"usage": {...}
}
```
## Extract Image
```bash
# Extract base64 from response and decode to image
cat response.json | jq -r '.choices[0].message.content[0].image_url.url' | cut -d',' -f2- | base64 -d > output.png
```
## File Description
| File | Description |
| --------------------------- | ---------------------------- |
| `run_server.sh` | Server startup script |
| `run_curl_text_to_image.sh` | curl example |
| `openai_chat_client.py` | Python client |
| `gradio_demo.py` | Gradio interactive interface |
## Example materials
??? abstract "gradio_demo.py"
``````py
--8<-- "examples/online_serving/text_to_image/gradio_demo.py"
``````
??? abstract "openai_chat_client.py"
``````py
--8<-- "examples/online_serving/text_to_image/openai_chat_client.py"
``````
??? abstract "run_curl_text_to_image.sh"
``````sh
--8<-- "examples/online_serving/text_to_image/run_curl_text_to_image.sh"
``````
??? abstract "run_server.sh"
``````sh
--8<-- "examples/online_serving/text_to_image/run_server.sh"
``````
# BAGEL-7B-MoT
## Set up
Please refer to the [stage configuration documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/) to configure memory allocation appropriately for your hardware setup.
## Run examples
**Note**: These examples work with the default configuration on an **NVIDIA A100 (80GB)**. We also tested on dual **NVIDIA RTX 5000 Ada (32GB each)**. For dual-GPU setups, please modify the stage configuration to distribute the model across devices.
Get into the bagel folder
```bash
cd examples/offline_inference/bagel
```
### Modality Control
BAGEL-7B-MoT supports multiple modality modes. You can control the mode using the `--modality` argument:
#### Text to Image (text2img)
- **Pipeline**: Text → Thinker → DiT → VAE Decode → Image
- **Stages Used**: Stage 0 (Thinker) + Stage 1 (DiT)
- **KV Transfer**: Thinker sends KV cache to DiT for conditioned generation
Generate images from text prompts:
```bash
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
--modality text2img \
--prompts "A cute cat"
```
#### Image to Image (img2img)
- **Pipeline**: Image → VAE Encode → DiT → VAE Decode → New Image
- **Stages Used**: Stage 1 (DiT) only
- **Special**: Bypasses the Thinker stage, direct image-to-image transformation
Transform images based on text prompts:
```bash
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
--modality img2img \
--image-path /path/to/image.jpg \
--prompts "Let the woman wear a blue dress"
```
#### Image to Text (img2text)
- **Pipeline**: Image → ViT + VAE Encode → Thinker → Text Output
- **Stages Used**: Stage 0 (Thinker) only
- **Special**: Uses both VAE latent encoding AND ViT semantic encoding for comprehensive image understanding
Generate text descriptions from images:
```bash
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
--modality img2text \
--image-path /path/to/image.jpg \
--prompts "Describe this image in detail"
```
#### Text to Text (text2text)
- **Pipeline**: Text → Thinker → Text Output
- **Stages Used**: Stage 0 (Thinker) only
- **Special**: No visual components involved, operates as pure language model
Pure text generation:
```bash
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
--modality text2text \
--prompts "What is the capital of France?"
# You can load prompts from a text file (one prompt per line):
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
--modality text2text \
--txt-prompts /path/to/prompts.txt
```
### Inference Steps
Control the number of inference steps for image generation:
```bash
# You can adjust steps to 100 to improve image quality
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
--modality text2img \
--steps 50 \
--prompts "A cute cat"
```
### Key arguments
BAGEL-7B-MoT supports **multiple modality modes** for different use cases.
The default yaml configuration deploys Thinker and DiT on the same GPU. You can use the default configuration file: [`bagel.yaml`](../../../vllm_omni/model_executor/stage_configs/bagel.yaml)
#### 📌 Command Line Arguments (end2end.py)
| Argument | Type | Default | Description |
| :--------------------- | :----- | :---------------------------- | :----------------------------------------------------------- |
| `--model` | string | `ByteDance-Seed/BAGEL-7B-MoT` | Model path or name |
| `--modality` | choice | `text2img` | Modality mode: `text2img`, `img2img`, `img2text`, `text2text` |
| `--prompts` | list | `None` | Input text prompts directly |
| `--txt-prompts` | string | `None` | Path to txt file with one prompt per line |
| `--image-path` | string | `None` | Input image path (for `img2img`/`img2text`) |
| `--steps` | int | `50` | Number of inference steps |
| `--stage-configs-path` | string | `None` | Custom stage config file path |
| `--worker-backend` | choice | `process` | Worker backend: `process` or `ray` |
| `--ray-address` | string | `None` | Ray cluster address |
| `--enable-stats` | flag | `False` | Enable statistics logging |
| `--init-sleep-seconds` | int | `20` | Initialization sleep time |
| `--batch-timeout` | int | `5` | Batch timeout |
| `--init-timeout` | int | `300` | Initialization timeout |
------
#### ⚙️ Stage Configuration Parameters (bagel.yaml)
**Stage 0 - Thinker (LLM Stage)**
| Parameter | Value | Description |
| :------------------------------- | :------------------------------ | :----------------------- |
| `stage_type` | `llm` | Stage type |
| `devices` | `"0"` | GPU device ID |
| `max_batch_size` | `1` | Maximum batch size |
| `model_stage` | `thinker` | Model stage identifier |
| `model_arch` | `BagelForConditionalGeneration` | Model architecture |
| `gpu_memory_utilization` | `0.4` | GPU memory utilization |
| `tensor_parallel_size` | `1` | Tensor parallel size |
| `max_num_batched_tokens` | `32768` | Maximum batched tokens |
| `omni_kv_config.need_send_cache` | `true` | Whether to send KV cache |
------
**Stage 1 - DiT (Diffusion Stage)**
| Parameter | Value | Description |
| :------------------------------- | :---------- | :-------------------------- |
| `stage_type` | `diffusion` | Stage type |
| `devices` | `"0"` | GPU device ID |
| `max_batch_size` | `1` | Maximum batch size |
| `model_stage` | `dit` | Model stage identifier |
| `gpu_memory_utilization` | `0.4` | GPU memory utilization |
| `omni_kv_config.need_recv_cache` | `true` | Whether to receive KV cache |
| `engine_input_source` | `[0]` | Input source from Stage 0 |
------
#### 🔗 Runtime Configuration
| Parameter | Value | Description |
| :-------------------- | :------ | :------------------------------- |
| `window_size` | `-1` | Window size (-1 means unlimited) |
| `max_inflight` | `1` | Maximum inflight requests |
| `shm_threshold_bytes` | `65536` | Shared memory threshold (64KB) |
## FAQ
- If you encounter an error about the backend of librosa, try to install ffmpeg with the command below.
```bash
sudo apt update
sudo apt install ffmpeg
```
- If you don’t know how much VRAM is needed for the model or encounter the OOM error, you can try to decrease the max_model_len.
| Stage | VRAM |
| :------------------ | :--------------------------- |
| Stage-0 (Thinker) | **15.04 GiB** **+ KV Cache** |
| Stage-1 (DiT) | **26.50 GiB** |
| Total | **~42 GiB + KV Cache** |
import argparse
import os
from typing import cast
from vllm_omni.inputs.data import OmniDiffusionSamplingParams, OmniPromptType
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument(
"--model",
default="ByteDance-Seed/BAGEL-7B-MoT",
help="Path to merged model directory.",
)
parser.add_argument("--prompts", nargs="+", default=None, help="Input text prompts.")
parser.add_argument(
"--txt-prompts",
type=str,
default=None,
help="Path to a .txt file with one prompt per line (preferred).",
)
parser.add_argument("--prompt_type", default="text", choices=["text"])
parser.add_argument(
"--modality",
default="text2img",
choices=["text2img", "img2img", "img2text", "text2text"],
help="Modality mode to control stage execution.",
)
parser.add_argument(
"--image-path",
type=str,
default=None,
help="Path to input image for img2img.",
)
# OmniLLM init args
parser.add_argument("--enable-stats", action="store_true", default=False)
parser.add_argument("--init-sleep-seconds", type=int, default=20)
parser.add_argument("--batch-timeout", type=int, default=5)
parser.add_argument("--init-timeout", type=int, default=300)
parser.add_argument("--shm-threshold-bytes", type=int, default=65536)
parser.add_argument("--worker-backend", type=str, default="process", choices=["process", "ray"])
parser.add_argument("--ray-address", type=str, default=None)
parser.add_argument("--stage-configs-path", type=str, default=None)
parser.add_argument("--steps", type=int, default=50, help="Number of inference steps.")
args = parser.parse_args()
return args
def main():
args = parse_args()
model_name = args.model
prompts: list[OmniPromptType] = []
try:
# Preferred: load from txt file (one prompt per line)
if getattr(args, "txt_prompts", None) and args.prompt_type == "text":
with open(args.txt_prompts, encoding="utf-8") as f:
lines = [ln.strip() for ln in f.readlines()]
prompts = [ln for ln in lines if ln != ""]
print(f"[Info] Loaded {len(prompts)} prompts from {args.txt_prompts}")
else:
prompts = args.prompts
except Exception as e:
print(f"[Error] Failed to load prompts: {e}")
raise
if not prompts:
# Default prompt for text2img test if none provided
prompts = ["<|im_start|>A cute cat<|im_end|>"]
print(f"[Info] No prompts provided, using default: {prompts}")
omni_outputs = []
from PIL import Image
if args.modality == "img2img":
from PIL import Image
from vllm_omni.entrypoints.omni_diffusion import OmniDiffusion
print("[Info] Running in img2img mode (Stage 1 only)")
client = OmniDiffusion(model=model_name)
if args.image_path:
if os.path.exists(args.image_path):
loaded_image = Image.open(args.image_path).convert("RGB")
prompts = [
{
"prompt": cast(str, p),
"multi_modal_data": {"image": loaded_image},
}
for p in prompts
]
else:
print(f"[Warning] Image path {args.image_path} does not exist.")
result = client.generate(
prompts,
OmniDiffusionSamplingParams(
seed=52,
need_kv_receive=False,
num_inference_steps=args.steps,
),
)
# Ensure result is a list for iteration
if not isinstance(result, list):
omni_outputs = [result]
else:
omni_outputs = result
else:
from vllm_omni.entrypoints.omni import Omni
omni_kwargs = {}
if args.stage_configs_path:
omni_kwargs["stage_configs_path"] = args.stage_configs_path
omni_kwargs.update(
{
"log_stats": args.enable_stats,
"init_sleep_seconds": args.init_sleep_seconds,
"batch_timeout": args.batch_timeout,
"init_timeout": args.init_timeout,
"shm_threshold_bytes": args.shm_threshold_bytes,
"worker_backend": args.worker_backend,
"ray_address": args.ray_address,
}
)
omni = Omni(model=model_name, **omni_kwargs)
formatted_prompts = []
for p in args.prompts:
if args.modality == "img2text":
if args.image_path:
loaded_image = Image.open(args.image_path).convert("RGB")
final_prompt_text = f"<|im_start|>user\n<|image_pad|>\n{p}<|im_end|>\n<|im_start|>assistant\n"
prompt_dict = {
"prompt": final_prompt_text,
"multi_modal_data": {"image": loaded_image},
"modalities": ["text"],
}
formatted_prompts.append(prompt_dict)
elif args.modality == "text2text":
final_prompt_text = f"<|im_start|>user\n{p}<|im_end|>\n<|im_start|>assistant\n"
prompt_dict = {"prompt": final_prompt_text, "modalities": ["text"]}
formatted_prompts.append(prompt_dict)
else:
# text2img
final_prompt_text = f"<|im_start|>{p}<|im_end|>"
prompt_dict = {"prompt": final_prompt_text, "modalities": ["image"]}
formatted_prompts.append(prompt_dict)
params_list = omni.default_sampling_params_list
if args.modality == "text2img":
params_list[0].max_tokens = 1 # type: ignore # The first stage is a SamplingParam (vllm)
if len(params_list) > 1:
params_list[1].num_inference_steps = args.steps # type: ignore # The second stage is an OmniDiffusionSamplingParam
omni_outputs = list(omni.generate(prompts=formatted_prompts, sampling_params_list=params_list))
for i, req_output in enumerate(omni_outputs):
images = getattr(req_output, "images", None)
if not images and hasattr(req_output, "output"):
if isinstance(req_output.output, list):
images = req_output.output
else:
images = [req_output.output]
if images:
for j, img in enumerate(images):
img.save(f"output_{i}_{j}.png")
if hasattr(req_output, "request_output") and req_output.request_output:
for stage_out in req_output.request_output:
if hasattr(stage_out, "images") and stage_out.images:
for k, img in enumerate(stage_out.images):
save_path = f"output_{i}_stage_{getattr(stage_out, 'stage_id', '?')}_{k}.png"
img.save(save_path)
print(f"[Info] Saved stage output image to {save_path}")
print(omni_outputs)
if __name__ == "__main__":
main()
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""
Example script for image editing with Qwen-Image-Edit.
Usage (single image):
python image_edit.py \
--image input.png \
--prompt "Let this mascot dance under the moon, surrounded by floating stars and poetic bubbles such as 'Be Kind'" \
--output output_image_edit.png \
--num_inference_steps 50 \
--cfg_scale 4.0 \
--guidance_scale 1.0
Usage (multiple images):
python image_edit.py \
--image input1.png input2.png input3.png \
--prompt "Combine these images into a single scene" \
--output output_image_edit.png \
--num_inference_steps 50 \
--cfg_scale 4.0 \
--guidance_scale 1.0
Usage (with cache-dit acceleration):
python image_edit.py \
--image input.png \
--prompt "Edit description" \
--cache_backend cache_dit \
--cache_dit_max_continuous_cached_steps 3 \
--cache_dit_residual_diff_threshold 0.24 \
--cache_dit_enable_taylorseer
Usage (with tea_cache acceleration):
python image_edit.py \
--image input.png \
--prompt "Edit description" \
--cache_backend tea_cache \
--tea_cache_rel_l1_thresh 0.25
Usage (layered):
python image_edit.py \
--model "Qwen/Qwen-Image-Layered" \
--image input.png \
--prompt "" \
--output "layered" \
--num_inference_steps 50 \
--cfg_scale 4.0 \
--layers 4 \
--color-format "RGBA"
Usage (with CFG Parallel):
python image_edit.py \
--image input.png \
--prompt "Edit description" \
--cfg_parallel_size 2 \
--num_inference_steps 50 \
--cfg_scale 4.0
Usage (disable torch.compile):
python image_edit.py \
--image input.png \
--prompt "Edit description" \
--enforce_eager \
--num_inference_steps 50 \
--cfg_scale 4.0
For more options, run:
python image_edit.py --help
"""
import argparse
import os
import time
from pathlib import Path
import torch
from PIL import Image
from vllm_omni.diffusion.data import DiffusionParallelConfig
from vllm_omni.entrypoints.omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
from vllm_omni.outputs import OmniRequestOutput
from vllm_omni.platforms import current_omni_platform
def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(description="Edit an image with Qwen-Image-Edit.")
parser.add_argument(
"--model",
default="Qwen/Qwen-Image-Edit",
help=(
"Diffusion model name or local path. "
"For multiple image inputs, use Qwen/Qwen-Image-Edit-2509 or Qwen/Qwen-Image-Edit-2511"
"which supports QwenImageEditPlusPipeline."
),
)
parser.add_argument(
"--image",
type=str,
nargs="+",
required=True,
help="Path(s) to input image file(s) (PNG, JPG, etc.). Can specify multiple images.",
)
parser.add_argument(
"--prompt",
type=str,
required=True,
help="Text prompt describing the edit to make to the image.",
)
parser.add_argument(
"--negative_prompt",
type=str,
default=None,
required=False,
)
parser.add_argument(
"--seed",
type=int,
default=0,
help="Random seed for deterministic results.",
)
parser.add_argument(
"--cfg_scale",
type=float,
default=4.0,
help=(
"True classifier-free guidance scale (default: 4.0). Guidance scale as defined in Classifier-Free "
"Diffusion Guidance. Classifier-free guidance is enabled by setting cfg_scale > 1 and providing "
"a negative_prompt. Higher guidance scale encourages images closely linked to the text prompt, "
"usually at the expense of lower image quality."
),
)
parser.add_argument(
"--guidance_scale",
type=float,
default=1.0,
help=(
"Guidance scale for guidance-distilled models (default: 1.0, disabled). "
"Unlike classifier-free guidance (--cfg_scale), guidance-distilled models take the guidance scale "
"directly as an input parameter. Enabled when guidance_scale > 1. Ignored when not using guidance-distilled models."
),
)
parser.add_argument(
"--output",
type=str,
default="output_image_edit.png",
help=("Path to save the edited image (PNG). Or prefix for Qwen-Image-Layered model save images(PNG)."),
)
parser.add_argument(
"--num_outputs_per_prompt",
type=int,
default=1,
help="Number of images to generate for the given prompt.",
)
parser.add_argument(
"--num_inference_steps",
type=int,
default=50,
help="Number of denoising steps for the diffusion sampler.",
)
parser.add_argument(
"--cache_backend",
type=str,
default=None,
choices=["cache_dit", "tea_cache"],
help=(
"Cache backend to use for acceleration. "
"Options: 'cache_dit' (DBCache + SCM + TaylorSeer), 'tea_cache' (Timestep Embedding Aware Cache). "
"Default: None (no cache acceleration)."
),
)
parser.add_argument(
"--ulysses_degree",
type=int,
default=1,
help="Number of GPUs used for ulysses sequence parallelism.",
)
parser.add_argument(
"--ring_degree",
type=int,
default=1,
help="Number of GPUs used for ring sequence parallelism.",
)
parser.add_argument(
"--tensor_parallel_size",
type=int,
default=1,
help="Number of GPUs used for tensor parallelism (TP) inside the DiT.",
)
parser.add_argument("--layers", type=int, default=4, help="Number of layers to decompose the input image into.")
parser.add_argument(
"--resolution",
type=int,
default=640,
help="Bucket in (640, 1024) to determine the condition and output resolution",
)
parser.add_argument(
"--color-format",
type=str,
default="RGB",
help="For Qwen-Image-Layered, set to RGBA.",
)
# Cache-DiT specific parameters
parser.add_argument(
"--cache_dit_fn_compute_blocks",
type=int,
default=1,
help="[cache-dit] Number of forward compute blocks. Optimized for single-transformer models.",
)
parser.add_argument(
"--cache_dit_bn_compute_blocks",
type=int,
default=0,
help="[cache-dit] Number of backward compute blocks.",
)
parser.add_argument(
"--cache_dit_max_warmup_steps",
type=int,
default=4,
help="[cache-dit] Maximum warmup steps (works for few-step models).",
)
parser.add_argument(
"--cache_dit_residual_diff_threshold",
type=float,
default=0.24,
help="[cache-dit] Residual diff threshold. Higher values enable more aggressive caching.",
)
parser.add_argument(
"--cache_dit_max_continuous_cached_steps",
type=int,
default=3,
help="[cache-dit] Maximum continuous cached steps to prevent precision degradation.",
)
parser.add_argument(
"--cache_dit_enable_taylorseer",
action="store_true",
default=False,
help="[cache-dit] Enable TaylorSeer acceleration (not suitable for few-step models).",
)
parser.add_argument(
"--cache_dit_taylorseer_order",
type=int,
default=1,
help="[cache-dit] TaylorSeer polynomial order.",
)
parser.add_argument(
"--cache_dit_scm_steps_mask_policy",
type=str,
default=None,
choices=[None, "slow", "medium", "fast", "ultra"],
help="[cache-dit] SCM mask policy: None (disabled), slow, medium, fast, ultra.",
)
parser.add_argument(
"--cache_dit_scm_steps_policy",
type=str,
default="dynamic",
choices=["dynamic", "static"],
help="[cache-dit] SCM steps policy: dynamic or static.",
)
# TeaCache specific parameters
parser.add_argument(
"--tea_cache_rel_l1_thresh",
type=float,
default=0.2,
help="[tea_cache] Threshold for accumulated relative L1 distance.",
)
parser.add_argument(
"--cfg_parallel_size",
type=int,
default=1,
choices=[1, 2],
help="Number of GPUs used for classifier free guidance parallel size.",
)
parser.add_argument(
"--enforce_eager",
action="store_true",
help="Disable torch.compile and force eager execution.",
)
parser.add_argument(
"--vae_use_slicing",
action="store_true",
help="Enable VAE slicing for memory optimization.",
)
parser.add_argument(
"--vae_use_tiling",
action="store_true",
help="Enable VAE tiling for memory optimization.",
)
parser.add_argument(
"--enable-cpu-offload",
action="store_true",
help="Enable CPU offloading for diffusion models.",
)
parser.add_argument(
"--enable-layerwise-offload",
action="store_true",
help="Enable layerwise (blockwise) offloading on DiT modules.",
)
parser.add_argument(
"--layerwise-num-gpu-layers",
type=int,
default=1,
help="Number of ready layers (blocks) to keep on GPU during generation.",
)
return parser.parse_args()
def main():
args = parse_args()
# Validate input images exist and load them
input_images = []
for image_path in args.image:
if not os.path.exists(image_path):
raise FileNotFoundError(f"Input image not found: {image_path}")
img = Image.open(image_path).convert(args.color_format)
input_images.append(img)
# Use single image or list based on number of inputs
if len(input_images) == 1:
input_image = input_images[0]
else:
input_image = input_images
generator = torch.Generator(device=current_omni_platform.device_type).manual_seed(args.seed)
parallel_config = DiffusionParallelConfig(
ulysses_degree=args.ulysses_degree,
ring_degree=args.ring_degree,
cfg_parallel_size=args.cfg_parallel_size,
tensor_parallel_size=args.tensor_parallel_size,
)
# Configure cache based on backend type
cache_config = None
if args.cache_backend == "cache_dit":
# cache-dit configuration: Hybrid DBCache + SCM + TaylorSeer
cache_config = {
"Fn_compute_blocks": args.cache_dit_fn_compute_blocks,
"Bn_compute_blocks": args.cache_dit_bn_compute_blocks,
"max_warmup_steps": args.cache_dit_max_warmup_steps,
"residual_diff_threshold": args.cache_dit_residual_diff_threshold,
"max_continuous_cached_steps": args.cache_dit_max_continuous_cached_steps,
"enable_taylorseer": args.cache_dit_enable_taylorseer,
"taylorseer_order": args.cache_dit_taylorseer_order,
"scm_steps_mask_policy": args.cache_dit_scm_steps_mask_policy,
"scm_steps_policy": args.cache_dit_scm_steps_policy,
}
elif args.cache_backend == "tea_cache":
# TeaCache configuration
cache_config = {
"rel_l1_thresh": args.tea_cache_rel_l1_thresh,
# Note: coefficients will use model-specific defaults based on model_type
}
# Initialize Omni with appropriate pipeline
omni = Omni(
model=args.model,
enable_layerwise_offload=args.enable_layerwise_offload,
layerwise_num_gpu_layers=args.layerwise_num_gpu_layers,
vae_use_slicing=args.vae_use_slicing,
vae_use_tiling=args.vae_use_tiling,
cache_backend=args.cache_backend,
cache_config=cache_config,
parallel_config=parallel_config,
enforce_eager=args.enforce_eager,
enable_cpu_offload=args.enable_cpu_offload,
)
print("Pipeline loaded")
# Check if profiling is requested via environment variable
profiler_enabled = bool(os.getenv("VLLM_TORCH_PROFILER_DIR"))
# Time profiling for generation
print(f"\n{'=' * 60}")
print("Generation Configuration:")
print(f" Model: {args.model}")
print(f" Inference steps: {args.num_inference_steps}")
print(f" Cache backend: {args.cache_backend if args.cache_backend else 'None (no acceleration)'}")
if isinstance(input_image, list):
print(f" Number of input images: {len(input_image)}")
for idx, img in enumerate(input_image):
print(f" Image {idx + 1} size: {img.size}")
else:
print(f" Input image size: {input_image.size}")
print(
f" Parallel configuration: ulysses_degree={args.ulysses_degree}, ring_degree={args.ring_degree}, cfg_parallel_size={args.cfg_parallel_size}, tensor_parallel_size={args.tensor_parallel_size}"
)
print(f"{'=' * 60}\n")
generation_start = time.perf_counter()
if profiler_enabled:
print("[Profiler] Starting profiling...")
omni.start_profile()
# Generate edited image
outputs = omni.generate(
{
"prompt": args.prompt,
"negative_prompt": args.negative_prompt,
"multi_modal_data": {"image": input_image},
},
OmniDiffusionSamplingParams(
generator=generator,
true_cfg_scale=args.cfg_scale,
guidance_scale=args.guidance_scale,
num_inference_steps=args.num_inference_steps,
num_outputs_per_prompt=args.num_outputs_per_prompt,
layers=args.layers,
resolution=args.resolution,
),
)
generation_end = time.perf_counter()
generation_time = generation_end - generation_start
# Print profiling results
print(f"Total generation time: {generation_time:.4f} seconds ({generation_time * 1000:.2f} ms)")
if profiler_enabled:
print("\n[Profiler] Stopping profiler and collecting results...")
profile_results = omni.stop_profile()
if profile_results and isinstance(profile_results, dict):
traces = profile_results.get("traces", [])
print("\n" + "=" * 60)
print("PROFILING RESULTS:")
for rank, trace in enumerate(traces):
print(f"\nRank {rank}:")
if trace:
print(f" • Trace: {trace}")
if not traces:
print(" No traces collected.")
print("=" * 60)
else:
print("[Profiler] No valid profiling data returned.")
if not outputs:
raise ValueError("No output generated from omni.generate()")
# Extract images from OmniRequestOutput
# omni.generate() returns list[OmniRequestOutput], extract images from request_output[0].images
first_output = outputs[0]
if not hasattr(first_output, "request_output") or not first_output.request_output:
raise ValueError("No request_output found in OmniRequestOutput")
req_out = first_output.request_output[0]
if not isinstance(req_out, OmniRequestOutput) or not hasattr(req_out, "images"):
raise ValueError("Invalid request_output structure or missing 'images' key")
images = req_out.images
if not images:
raise ValueError("No images found in request_output")
# Save output image(s)
output_path = Path(args.output)
output_path.parent.mkdir(parents=True, exist_ok=True)
suffix = output_path.suffix or ".png"
stem = output_path.stem or "output_image_edit"
# Handle layered output (each image may be a list of layers)
if args.num_outputs_per_prompt <= 1:
img = images[0]
# Check if this is a layered output (list of images)
if isinstance(img, list):
for sub_idx, sub_img in enumerate(img):
save_path = output_path.parent / f"{stem}_{sub_idx}{suffix}"
sub_img.save(save_path)
print(f"Saved edited image to {os.path.abspath(save_path)}")
else:
img.save(output_path)
print(f"Saved edited image to {os.path.abspath(output_path)}")
else:
for idx, img in enumerate(images):
# Check if this is a layered output (list of images)
if isinstance(img, list):
for sub_idx, sub_img in enumerate(img):
save_path = output_path.parent / f"{stem}_{idx}_{sub_idx}{suffix}"
sub_img.save(save_path)
print(f"Saved edited image to {os.path.abspath(save_path)}")
else:
save_path = output_path.parent / f"{stem}_{idx}{suffix}"
img.save(save_path)
print(f"Saved edited image to {os.path.abspath(save_path)}")
if __name__ == "__main__":
main()
# Image-To-Image
This example edits an input image with `Qwen/Qwen-Image-Edit` using the `image_edit.py` CLI.
## Local CLI Usage
### Single Image Editing
Download the example image:
```bash
wget https://vllm-public-assets.s3.us-west-2.amazonaws.com/omni-assets/qwen-bear.png
```
Then run:
```bash
python image_edit.py \
--image qwen-bear.png \
--prompt "Let this mascot dance under the moon, surrounded by floating stars and poetic bubbles such as 'Be Kind'" \
--output output_image_edit.png \
--num_inference_steps 50 \
--cfg_scale 4.0
```
### Multiple Image Editing (Qwen-Image-Edit-2509)
For multiple image inputs, use `Qwen/Qwen-Image-Edit-2509` or `Qwen/Qwen-Image-Edit-2511`:
```bash
python image_edit.py \
--model Qwen/Qwen-Image-Edit-2509 \
--image img1.png img2.png \
--prompt "Combine these images into a single scene" \
--output output_image_edit.png \
--num_inference_steps 50 \
--cfg_scale 4.0 \
--guidance_scale 1.0
```
Key arguments:
- `--model`: model name or path. Use `Qwen/Qwen-Image-Edit-2509` or later for multiple image support.
- `--image`: path(s) to the source image(s) (PNG/JPG, converted to RGB). Can specify multiple images.
- `--prompt` / `--negative_prompt`: text description (string).
- `--cfg_scale`: true classifier-free guidance scale (default: 4.0). Classifier-free guidance is enabled by setting cfg_scale > 1 and providing a negative_prompt. Higher guidance scale encourages images closely linked to the text prompt, usually at the expense of lower image quality.
- `--guidance_scale`: guidance scale for guidance-distilled models (default: 1.0, disabled). Unlike classifier-free guidance (--cfg_scale), guidance-distilled models take the guidance scale directly as an input parameter. Enabled when guidance_scale > 1. Ignored when not using guidance-distilled models.
- `--num_inference_steps`: diffusion sampling steps (more steps = higher quality, slower).
- `--output`: path to save the generated PNG.
- `--vae_use_slicing`: enable VAE slicing for memory optimization.
- `--vae_use_tiling`: enable VAE tiling for memory optimization.
- `--cfg_parallel_size`: set it to 2 to enable CFG Parallel. See more examples in [`user_guide`](../../../docs/user_guide/diffusion/parallelism_acceleration.md#cfg-parallel).
- `--enable-cpu-offload`: enable CPU offloading for diffusion models.
> ℹ️ If you encounter OOM errors, try using `--vae_use_slicing` and `--vae_use_tiling` to reduce memory usage.
python image_edit.py \
--model Qwen/Qwen-Image-Edit-2511 \
--image qwen_bear.png \
--prompt "Add a white art board written with colorful text 'vLLM-Omni' on grassland. Add a paintbrush in the bear's hands. position the bear standing in front of the art board as if painting" \
--output output_image_edit.png \
--num_inference_steps 50 \
--cfg_scale 4.0 \
--cache_backend cache_dit \
# Image-To-Video
This example demonstrates how to generate videos from images using Wan2.2 Image-to-Video models with vLLM-Omni's offline inference API.
## Local CLI Usage
### Wan2.2-I2V-A14B-Diffusers (MoE)
```bash
python image_to_video.py \
--model Wan-AI/Wan2.2-I2V-A14B-Diffusers \
--image input.png \
--prompt "A cat playing with yarn, smooth motion" \
--negative_prompt "<optional quality filter>" \
--height 480 \
--width 832 \
--num_frames 48 \
--guidance_scale 5.0 \
--guidance_scale_high 6.0 \
--num_inference_steps 40 \
--boundary_ratio 0.875 \
--flow_shift 12.0 \
--fps 16 \
--output i2v_output.mp4
```
### Wan2.2-TI2V-5B-Diffusers (Unified)
```bash
python image_to_video.py \
--model Wan-AI/Wan2.2-TI2V-5B-Diffusers \
--image input.png \
--prompt "A cat playing with yarn, smooth motion" \
--negative_prompt "<optional quality filter>" \
--height 480 \
--width 832 \
--num_frames 48 \
--guidance_scale 4.0 \
--num_inference_steps 40 \
--flow_shift 12.0 \
--fps 16 \
--output i2v_output.mp4
```
Key arguments:
- `--model`: Model ID (I2V-A14B for MoE, TI2V-5B for unified T2V+I2V).
- `--image`: Path to input image (required).
- `--prompt`: Text description of desired motion/animation.
- `--height/--width`: Output resolution (auto-calculated from image if not set). Dimensions should be multiples of 16.
- `--num_frames`: Number of frames (default 81).
- `--guidance_scale` and `--guidance_scale_high`: CFG scale (applied to low/high-noise stages for MoE).
- `--negative_prompt`: Optional list of artifacts to suppress.
- `--boundary_ratio`: Boundary split ratio for two-stage MoE models.
- `--flow_shift`: Scheduler flow shift (5.0 for 720p, 12.0 for 480p).
- `--num_inference_steps`: Number of denoising steps (default 50).
- `--fps`: Frames per second for the saved MP4 (requires `diffusers` export_to_video).
- `--output`: Path to save the generated video.
- `--vae_use_slicing`: Enable VAE slicing for memory optimization.
- `--vae_use_tiling`: Enable VAE tiling for memory optimization.
- `--cfg_parallel_size`: set it to 2 to enable CFG Parallel. See more examples in [`user_guide`](../../../docs/user_guide/diffusion/parallelism_acceleration.md#cfg-parallel).
- `--enable-cpu-offload`: enable CPU offloading for diffusion models.
> ℹ️ If you encounter OOM errors, try using `--vae_use_slicing` and `--vae_use_tiling` to reduce memory usage.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment