vllm-omni_0.15.0.rc1+fix1 first commit

c1cacde6 · weishb · 35607782 · c1cacde6 · c1cacde6 · c1cacde6
Commit c1cacde6 authored Mar 25, 2026 by weishb
20 changed files
--- a/docs/user_guide/examples/offline_inference/image_to_image.md
+++ b/docs/user_guide/examples/offline_inference/image_to_image.md
+# Image-To-Image
+
+Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/image_to_image>.
+
+
+This example edits an input image with `Qwen/Qwen-Image-Edit` using the `image_edit.py` CLI.
+
+## Local CLI Usage
+
+### Single Image Editing
+
+Download the example image:
+
+```bash
+wget https://vllm-public-assets.s3.us-west-2.amazonaws.com/omni-assets/qwen-bear.png
+```
+
+Then run:
+
+```bash
+python image_edit.py \
+  --image qwen-bear.png \
+  --prompt "Let this mascot dance under the moon, surrounded by floating stars and poetic bubbles such as 'Be Kind'" \
+  --output output_image_edit.png \
+  --num_inference_steps 50 \
+  --cfg_scale 4.0
+```
+
+### Multiple Image Editing (Qwen-Image-Edit-2509)
+
+For multiple image inputs, use `Qwen/Qwen-Image-Edit-2509` or  `Qwen/Qwen-Image-Edit-2511`:
+
+```bash
+python image_edit.py \
+  --model Qwen/Qwen-Image-Edit-2509 \
+  --image img1.png img2.png \
+  --prompt "Combine these images into a single scene" \
+  --output output_image_edit.png \
+  --num_inference_steps 50 \
+  --cfg_scale 4.0 \
+  --guidance_scale 1.0
+```
+
+Key arguments:
+
+- `--model`: model name or path. Use `Qwen/Qwen-Image-Edit-2509` or later for multiple image support.
+- `--image`: path(s) to the source image(s) (PNG/JPG, converted to RGB). Can specify multiple images.
+- `--prompt` / `--negative_prompt`: text description (string).
+- `--cfg_scale`: true classifier-free guidance scale (default: 4.0). Classifier-free guidance is enabled by setting cfg_scale > 1 and providing a negative_prompt. Higher guidance scale encourages images closely linked to the text prompt, usually at the expense of lower image quality.
+- `--cfg_parallel_size`: the number of devices to run CFG Parallel. CFG Parallel is valid only if classifier-free guidance is enabled and `cfg_parallel_size` is set to 2.
+- `--guidance_scale`: guidance scale for guidance-distilled models (default: 1.0, disabled). Unlike classifier-free guidance (--cfg_scale), guidance-distilled models take the guidance scale directly as an input parameter. Enabled when guidance_scale > 1. Ignored when not using guidance-distilled models.
+- `--num_inference_steps`: diffusion sampling steps (more steps = higher quality, slower).
+- `--output`: path to save the generated PNG.
+
+## Example materials
+
+??? abstract "image_edit.py"
+    ``````py
+    --8<-- "examples/offline_inference/image_to_image/image_edit.py"
+    ``````
+??? abstract "run_qwen_image_edit_2511.sh"
+    ``````sh
+    --8<-- "examples/offline_inference/image_to_image/run_qwen_image_edit_2511.sh"
+    ``````
--- a/docs/user_guide/examples/offline_inference/image_to_video.md
+++ b/docs/user_guide/examples/offline_inference/image_to_video.md
+# Image-To-Video
+
+Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/image_to_video>.
+
+
+This example demonstrates how to generate videos from images using Wan2.2 Image-to-Video models with vLLM-Omni's offline inference API.
+
+## Local CLI Usage
+
+### Wan2.2-I2V-A14B-Diffusers (MoE)
+```bash
+python image_to_video.py \
+  --model Wan-AI/Wan2.2-I2V-A14B-Diffusers \
+  --image input.png \
+  --prompt "A cat playing with yarn, smooth motion" \
+  --negative_prompt "<optional quality filter>" \
+  --height 480 \
+  --width 832 \
+  --num_frames 48 \
+  --guidance_scale 5.0 \
+  --guidance_scale_high 6.0 \
+  --num_inference_steps 40 \
+  --boundary_ratio 0.875 \
+  --flow_shift 12.0 \
+  --fps 16 \
+  --output i2v_output.mp4
+```
+
+### Wan2.2-TI2V-5B-Diffusers (Unified)
+```bash
+python image_to_video.py \
+  --model Wan-AI/Wan2.2-TI2V-5B-Diffusers \
+  --image input.png \
+  --prompt "A cat playing with yarn, smooth motion" \
+  --negative_prompt "<optional quality filter>" \
+  --height 480 \
+  --width 832 \
+  --num_frames 48 \
+  --guidance_scale 4.0 \
+  --num_inference_steps 40 \
+  --flow_shift 12.0 \
+  --fps 16 \
+  --output i2v_output.mp4
+```
+
+Key arguments:
+
+- `--model`: Model ID (I2V-A14B for MoE, TI2V-5B for unified T2V+I2V).
+- `--image`: Path to input image (required).
+- `--prompt`: Text description of desired motion/animation.
+- `--height/--width`: Output resolution (auto-calculated from image if not set). Dimensions should be multiples of 16.
+- `--num_frames`: Number of frames (default 81).
+- `--guidance_scale` and `--guidance_scale_high`: CFG scale (applied to low/high-noise stages for MoE).
+- `--negative_prompt`: Optional list of artifacts to suppress.
+- `--cfg_parallel_size`: the number of devices to run CFG Parallel. CFG Parallel is valid only if classifier-free guidance is enabled and `cfg_parallel_size` is set to 2.
+- `--boundary_ratio`: Boundary split ratio for two-stage MoE models.
+- `--flow_shift`: Scheduler flow shift (5.0 for 720p, 12.0 for 480p).
+- `--num_inference_steps`: Number of denoising steps (default 50).
+- `--fps`: Frames per second for the saved MP4 (requires `diffusers` export_to_video).
+- `--output`: Path to save the generated video.
+
+## Example materials
+
+??? abstract "image_to_video.py"
+    ``````py
+    --8<-- "examples/offline_inference/image_to_video/image_to_video.py"
+    ``````
--- a/docs/user_guide/examples/offline_inference/lora_inference.md
+++ b/docs/user_guide/examples/offline_inference/lora_inference.md
+# LoRA-Inference
+
+Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/lora_inference>.
+
+This contains examples for using LoRA (Low-Rank Adaptation) adapters with vLLM-omni diffusion models for offline inference.
+The example uses the  `stabilityai/stable-diffusion-3.5-medium` as the default model, but you can replace it with other models in vLLM-omni.
+
+## Overview
+
+Similar to vLLM, vLLM-omni uses a unified LoRA handling mechanism:
+
+- **Pre-loaded LoRA**: Loaded at initialization via `--lora-path` (pre-loaded into cache)
+- **Per-request LoRA**: Loaded on-demand. In the example, the LoRA is loaded via `--lora-request-path` in each request
+
+Both approaches use the same underlying mechanism - all LoRA adapters are handled uniformly through `set_active_adapter()`. If no LoRA request is provided in a request, all adapters are deactivated.
+
+## Usage
+
+### Pre-loaded LoRA (via --lora-path)
+
+Load a LoRA adapter at initialization. This adapter is pre-loaded into the cache and can be activated by requests:
+
+```bash
+python -m examples.offline_inference.lora_inference.lora_inference \
+    --prompt "A piece of cheesecake" \
+    --lora-path /path/to/lora/ \
+    --lora-scale 1.0 \
+    --num_inference_steps 50 \
+    --height 1024 \
+    --width 1024 \
+    --output output_preloaded.png
+```
+
+**Note**: When using `--lora-path`, the adapter is loaded at init time with a stable ID derived from the adapter path. This example activates it automatically for the request.
+
+### Per-request LoRA (via --lora-request-path)
+
+Load a LoRA adapter on-demand for each request:
+
+```bash
+python -m examples.offline_inference.lora_inference.lora_inference \
+    --prompt "A piece of cheesecake" \
+    --lora-request-path /path/to/lora/ \
+    --lora-scale 1.0 \
+    --num_inference_steps 50 \
+    --height 1024 \
+    --width 1024 \
+    --output output_per_request.png
+```
+
+### No LoRA
+
+If no LoRA request is provided, we will use the base model without any LoRA adapters:
+
+```bash
+python -m examples.offline_inference.lora_inference.lora_inference \
+    --prompt "A piece of cheesecake" \
+    --num_inference_steps 50 \
+    --height 1024 \
+    --width 1024 \
+    --output output_no_lora.png
+```
+
+## Parameters
+
+### LoRA Parameters
+
+- `--lora-path`: Path to LoRA adapter folder to pre-load at initialization (loads into cache with a stable ID derived from the path)
+- `--lora-request-path`: Path to LoRA adapter folder for per-request loading
+- `--lora-request-id`: Integer ID for the LoRA adapter (optional). If not provided and `--lora-request-path` is set, will derive a stable ID from the path.
+- `--lora-scale`: Scale factor for LoRA weights (default: 1.0). Higher values increase the influence of the LoRA adapter.
+
+### Standard Parameters
+
+- `--prompt`: Text prompt for image generation (required)
+- `--seed`: Random seed for reproducibility (default: 42)
+- `--height`: Image height in pixels (default: 1024)
+- `--width`: Image width in pixels (default: 1024)
+- `--num_inference_steps`: Number of denoising steps (default: 50)
+- `--output`: Output file path (default: `lora_output.png`)
+
+## How LoRA Works
+
+All LoRA adapters are handled uniformly:
+
+1. **Initialization**: If `--lora-path` is provided, the adapter is loaded into cache with a stable ID derived from the adapter path
+2. **Per-request**: If `--lora-request-path` is provided, the adapter is loaded/activated for that request
+3. **No LoRA**: If no LoRA request is provided (`req.lora_request` is None), all adapters are deactivated
+
+The system uses LRU cache management - adapters are cached and evicted when the cache is full (unless pinned).
+
+## LoRA Adapter Format
+
+LoRA adapters must be in PEFT (Parameter-Efficient Fine-Tuning) format. A typical LoRA adapter directory structure:
+
+```
+lora_adapter/
+├── adapter_config.json
+└── adapter_model.safetensors
+```
+
+## Example materials
+
+??? abstract "lora_inference.py"
+    ``````py
+    --8<-- "examples/offline_inference/lora_inference/lora_inference.py"
+    ``````
--- a/docs/user_guide/examples/offline_inference/qwen2_5_omni.md
+++ b/docs/user_guide/examples/offline_inference/qwen2_5_omni.md
+# Qwen2.5-Omni
+
+Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/qwen2_5_omni>.
+
+
+## Setup
+Please refer to the [stage configuration documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/) to configure memory allocation appropriately for your hardware setup.
+
+## Run examples
+
+### Multiple Prompts
+Get into the example folder
+```bash
+cd examples/offline_inference/qwen2_5_omni
+```
+Then run the command below. Note: for processing large volume data, it uses py_generator mode, which will return a python generator from Omni class.
+```bash
+bash run_multiple_prompts.sh
+```
+
+### Single Prompt
+Get into the example folder
+```bash
+cd examples/offline_inference/qwen2_5_omni
+```
+Then run the command below.
+```bash
+bash run_single_prompt.sh
+```
+
+### Modality control
+If you want to control output modalities, e.g. only output text, you can run the command below:
+```bash
+python end2end.py --output-wav output_audio \
+                  --query-type mixed_modalities \
+                  --modalities text
+```
+
+#### Using Local Media Files
+The `end2end.py` script supports local media files (audio, video, image) via CLI arguments:
+
+```bash
+# Use single local media files
+python end2end.py --query-type use_image --image-path /path/to/image.jpg
+python end2end.py --query-type use_video --video-path /path/to/video.mp4
+python end2end.py --query-type use_audio --audio-path /path/to/audio.wav
+
+# Combine multiple local media files
+python end2end.py --query-type mixed_modalities \
+    --video-path /path/to/video.mp4 \
+    --image-path /path/to/image.jpg \
+    --audio-path /path/to/audio.wav
+
+# Use audio from video file
+python end2end.py --query-type use_audio_in_video --video-path /path/to/video.mp4
+
+```
+
+If media file paths are not provided, the script will use default assets. Supported query types:
+- `use_image`: Image input only
+- `use_video`: Video input only
+- `use_audio`: Audio input only
+- `mixed_modalities`: Audio + image + video
+- `use_audio_in_video`: Extract audio from video
+- `text`: Text-only query
+
+### FAQ
+
+If you encounter error about backend of librosa, try to install ffmpeg with command below.
+```
+sudo apt update
+sudo apt install ffmpeg
+```
+
+## Example materials
+
+??? abstract "end2end.py"
+    ``````py
+    --8<-- "examples/offline_inference/qwen2_5_omni/end2end.py"
+    ``````
+??? abstract "extract_prompts.py"
+    ``````py
+    --8<-- "examples/offline_inference/qwen2_5_omni/extract_prompts.py"
+    ``````
+??? abstract "run_multiple_prompts.sh"
+    ``````sh
+    --8<-- "examples/offline_inference/qwen2_5_omni/run_multiple_prompts.sh"
+    ``````
+??? abstract "run_single_prompt.sh"
+    ``````sh
+    --8<-- "examples/offline_inference/qwen2_5_omni/run_single_prompt.sh"
+    ``````
--- a/docs/user_guide/examples/offline_inference/qwen3_omni.md
+++ b/docs/user_guide/examples/offline_inference/qwen3_omni.md
+# Qwen3-Omni
+
+Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/qwen3_omni>.
+
+
+## Setup
+Please refer to the [stage configuration documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/) to configure memory allocation appropriately for your hardware setup.
+
+## Run examples
+
+### Multiple Prompts
+Get into the example folder
+```bash
+cd examples/offline_inference/qwen3_omni
+```
+Then run the command below. Note: for processing large volume data, it uses py_generator mode, which will return a python generator from Omni class.
+```bash
+bash run_multiple_prompts.sh
+```
+### Single Prompt
+Get into the example folder
+```bash
+cd examples/offline_inference/qwen3_omni
+```
+Then run the command below.
+```bash
+bash run_single_prompt.sh
+```
+If you have not enough memory, you can set thinker with tensor parallel. Just run the command below.
+```bash
+bash run_single_prompt_tp.sh
+```
+
+### Modality control
+If you want to control output modalities, e.g. only output text, you can run the command below:
+```bash
+python end2end.py --output-wav output_audio \
+                  --query-type use_audio \
+                  --modalities text
+```
+
+#### Using Local Media Files
+The `end2end.py` script supports local media files (audio, video, image) via command-line arguments:
+
+```bash
+# Use local video file
+python end2end.py --query-type use_video --video-path /path/to/video.mp4
+
+# Use local image file
+python end2end.py --query-type use_image --image-path /path/to/image.jpg
+
+# Use local audio file
+python end2end.py --query-type use_audio --audio-path /path/to/audio.wav
+
+# Combine multiple local media files
+python end2end.py --query-type mixed_modalities \
+    --video-path /path/to/video.mp4 \
+    --image-path /path/to/image.jpg \
+    --audio-path /path/to/audio.wav
+```
+
+If media file paths are not provided, the script will use default assets. Supported query types:
+- `use_video`: Video input
+- `use_image`: Image input
+- `use_audio`: Audio input
+- `text`: Text-only query
+- `multi_audios`: Multiple audio inputs
+- `mixed_modalities`: Combination of video, image, and audio inputs
+
+### FAQ
+
+If you encounter error about backend of librosa, try to install ffmpeg with command below.
+```
+sudo apt update
+sudo apt install ffmpeg
+```
+
+## Example materials
+
+??? abstract "end2end.py"
+    ``````py
+    --8<-- "examples/offline_inference/qwen3_omni/end2end.py"
+    ``````
+??? abstract "run_multiple_prompts.sh"
+    ``````sh
+    --8<-- "examples/offline_inference/qwen3_omni/run_multiple_prompts.sh"
+    ``````
+??? abstract "run_single_prompt.sh"
+    ``````sh
+    --8<-- "examples/offline_inference/qwen3_omni/run_single_prompt.sh"
+    ``````
+??? abstract "run_single_prompt_tp.sh"
+    ``````sh
+    --8<-- "examples/offline_inference/qwen3_omni/run_single_prompt_tp.sh"
+    ``````
+??? abstract "text_prompts_10.txt"
+    ``````txt
+    --8<-- "examples/offline_inference/qwen3_omni/text_prompts_10.txt"
+    ``````
--- a/docs/user_guide/examples/offline_inference/qwen3_tts.md
+++ b/docs/user_guide/examples/offline_inference/qwen3_tts.md
+# Qwen3-TTS Offline Inference
+
+Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/qwen3_tts>.
+
+
+This directory contains an offline demo for running Qwen3 TTS models with vLLM Omni. It builds task-specific inputs and generates WAV files locally.
+
+## Model Overview
+
+Qwen3 TTS provides multiple task variants for speech generation:
+
+- **CustomVoice**: Generate speech with a known speaker identity (speaker ID) and optional instruction.
+- **VoiceDesign**: Generate speech from text plus a descriptive instruction that designs a new voice.
+- **Base**: Voice cloning using a reference audio + reference transcript, with optional mode selection.
+
+## Setup
+Please refer to the [stage configuration documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/) to configure memory allocation appropriately for your hardware setup.
+
+## Quick Start
+
+Run a single sample for a task:
+
+```
+python end2end.py --query-type CustomVoice
+```
+
+Generated audio files are saved to `output_audio/` by default.
+
+## Task Usage
+
+### CustomVoice
+
+Single sample:
+
+```
+python end2end.py --query-type CustomVoice
+```
+
+Batch sample (multiple prompts in one run):
+
+```
+python end2end.py --query-type CustomVoice --use-batch-sample
+```
+
+### VoiceDesign
+
+Single sample:
+
+```
+python end2end.py --query-type VoiceDesign
+```
+
+Batch sample:
+
+```
+python end2end.py --query-type VoiceDesign --use-batch-sample
+```
+
+### Base (Voice Clone)
+
+Single sample:
+
+```
+python end2end.py --query-type Base
+```
+
+Batch sample:
+
+```
+python end2end.py --query-type Base --use-batch-sample
+```
+
+Mode selection for Base:
+
+- `--mode-tag icl` (default): standard mode
+- `--mode-tag xvec_only`: enable `x_vector_only_mode` in the request
+
+Examples:
+
+```
+python end2end.py --query-type Base --mode-tag icl
+```
+
+## Notes
+
+- The script uses the model paths embedded in `end2end.py`. Update them if your local cache path differs.
+- Use `--output-dir` (preferred) or `--output-wav` to change the output folder.
+
+## Example materials
+
+??? abstract "end2end.py"
+    ``````py
+    --8<-- "examples/offline_inference/qwen3_tts/end2end.py"
+    ``````
--- a/docs/user_guide/examples/offline_inference/text_to_image.md
+++ b/docs/user_guide/examples/offline_inference/text_to_image.md
+# Text-To-Image
+
+Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/text_to_image>.
+
+
+This folder provides several entrypoints for experimenting with `Qwen/Qwen-Image` `Qwen/Qwen-Image-2512` `Tongyi-MAI/Z-Image-Turbo` using vLLM-Omni:
+
+- `text_to_image.py`: command-line script for single image generation with advanced options.
+- `web_demo.py`: lightweight Gradio UI for interactive prompt/seed/CFG exploration.
+
+Note that when you pass in multiple independent prompts, they will be processed sequentially. Batching requests is currently not supported.
+
+## Basic Usage
+
+```python
+from vllm_omni.entrypoints.omni import Omni
+
+if __name__ == "__main__":
+    omni = Omni(model="Qwen/Qwen-Image")
+    prompt = "a cup of coffee on the table"
+    outputs = omni.generate(prompt)
+    images = outputs[0].request_output[0].images
+    images[0].save("coffee.png")
+```
+
+Or put more than one prompt in a request.
+
+```python
+from vllm_omni.entrypoints.omni import Omni
+
+if __name__ == "__main__":
+    omni = Omni(model="Qwen/Qwen-Image")
+    prompts = [
+      "a cup of coffee on a table",
+      "a toy dinosaur on a sandy beach",
+      "a fox waking up in bed and yawning",
+    ]
+    outputs = omni.generate(prompts)
+    for i, output in enumerate(outputs):
+      image = output.request_output[0].images[0].save(f"{i}.jpg")
+```
+
+!!! info
+
+    However, it is not currently recommended to do so
+    because not all models support batch inference,
+    and batch requesting mostly does not provide significant performance improvement (despite the impression that it does).
+    This feature is primarily for the sake of interface compatibility with vLLM and to allow for future improvements.
+
+!!! info
+
+    For diffusion pipelines, the stage config field `stage_args.[].runtime.max_batch_size` is 1 by default, and the input
+    list is sliced into single-item requests before feeding into the diffusion pipeline. For models that do internally support
+    batched inputs, you can [modify this configuration](../../../configuration/stage_configs.md) to let the model accept a longer batch of prompts.
+
+Apart from string prompt, vLLM-Omni also supports dictionary prompts in the same style as vLLM.
+This is useful for models that support negative prompts.
+
+```python
+from vllm_omni.entrypoints.omni import Omni
+
+if __name__ == "__main__":
+    omni = Omni(model="Qwen/Qwen-Image")
+    outputs = omni.generate([
+      {
+        "prompt": "a cup of coffee on a table"，
+        "negative_prompt": "low resolution"
+      },
+      {
+        "prompt": "a toy dinosaur on a sandy beach"，
+        "negative_prompt": "cinematic, realistic"
+      }
+    ])
+    for i, output in enumerate(outputs):
+      image = output.request_output[0].images[0].save(f"{i}.jpg")
+```
+
+## Local CLI Usage
+
+```bash
+python text_to_image.py \
+  --model Tongyi-MAI/Z-Image-Turbo \
+  --prompt "a cup of coffee on the table" \
+  --seed 42 \
+  --cfg_scale 4.0 \
+  --num_images_per_prompt 1 \
+  --num_inference_steps 50 \
+  --height 1024 \
+  --width 1024 \
+  --output outputs/coffee.png
+```
+
+Key arguments:
+
+- `--prompt`: text description (string).
+- `--seed`: integer seed for deterministic sampling.
+- `--cfg_scale`: true CFG scale (model-specific guidance strength).
+- `--cfg_parallel_size`: the number of devices to run CFG Parallel. CFG Parallel is valid only if classifier-free guidance is enabled and `cfg_parallel_size` is set to 2.
+- `--num_images_per_prompt`: number of images to generate per prompt (saves as `output`, `output_1`, ...).
+- `--num_inference_steps`: diffusion sampling steps (more steps = higher quality, slower).
+- `--height/--width`: output resolution (defaults 1024x1024).
+- `--output`: path to save the generated PNG.
+
+> ℹ️ Qwen-Image currently publishes best-effort presets at `1328x1328`, `1664x928`, `928x1664`, `1472x1140`, `1140x1472`, `1584x1056`, and `1056x1584`. Adjust `--height/--width` accordingly for the most reliable outcomes.
+
+## Web UI Demo
+
+Launch the gradio demo:
+
+```bash
+python gradio_demo.py --port 7862
+```
+
+Then open `http://localhost:7862/` on your local browser to interact with the web UI.
+
+## Example materials
+
+??? abstract "gradio_demo.py"
+    ``````py
+    --8<-- "examples/offline_inference/text_to_image/gradio_demo.py"
+    ``````
+??? abstract "text_to_image.py"
+    ``````py
+    --8<-- "examples/offline_inference/text_to_image/text_to_image.py"
+    ``````
--- a/docs/user_guide/examples/offline_inference/text_to_video.md
+++ b/docs/user_guide/examples/offline_inference/text_to_video.md
+# Text-To-Video
+
+Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/text_to_video>.
+
+
+The `Wan-AI/Wan2.2-T2V-A14B-Diffusers` pipeline generates short videos from text prompts.
+
+## Local CLI Usage
+
+```bash
+python text_to_video.py \
+  --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \
+  --negative_prompt "<optional quality filter>" \
+  --height 480 \
+  --width 640 \
+  --num_frames 32 \
+  --guidance_scale 4.0 \
+  --guidance_scale_high 3.0 \
+  --num_inference_steps 40 \
+  --fps 16 \
+  --output t2v_out.mp4
+```
+
+Key arguments:
+
+- `--prompt`: text description (string).
+- `--height/--width`: output resolution (defaults 720x1280). Dimensions should align with Wan VAE downsampling (multiples of 8).
+- `--num_frames`: Number of frames (Wan default is 81).
+- `--guidance_scale` and `--guidance_scale_high`: CFG scale (applied to low/high)..
+- `--negative_prompt`: optional list of artifacts to suppress (the PR demo used a long Chinese string).
+- `--cfg_parallel_size`: the number of devices to run CFG Parallel. CFG Parallel is valid only if classifier-free guidance is enabled and `cfg_parallel_size` is set to 2.
+- `--boundary_ratio`: Boundary split ratio for low/high DiT.
+- `--fps`: frames per second for the saved MP4 (requires `diffusers` export_to_video).
+- `--output`: path to save the generated video.
+
+## Example materials
+
+??? abstract "text_to_video.py"
+    ``````py
+    --8<-- "examples/offline_inference/text_to_video/text_to_video.py"
+    ``````
--- a/docs/user_guide/examples/online_serving/bagel.md
+++ b/docs/user_guide/examples/online_serving/bagel.md
+# BAGEL-7B-MoT
+
+Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/bagel>.
+
+## 🛠️ Installation
+
+Please refer to [README.md](../../../README.md)
+
+## Run examples (BAGEL-7B-MoT)
+
+**Note**: These examples work with the default configuration on an **NVIDIA A100 (80GB)**. We also tested on dual **NVIDIA RTX 5000 Ada (32GB each)**. For dual-GPU setups, please modify the stage configuration to distribute the model across devices.
+
+### Launch the Server
+
+```bash
+# Use default configuration
+vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8091
+```
+
+Or use the convenience script:
+
+```bash
+cd /workspace/vllm-omni/examples/online_serving/bagel
+bash run_server.sh
+```
+
+If you have a custom stage configs file, launch the server with the command below:
+
+```bash
+vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8091 --stage-configs-path /path/to/stage_configs_file
+```
+
+### Send Multi-modal Request
+
+Get into the bagel folder:
+
+```bash
+cd examples/online_serving/bagel
+```
+
+Send request via Python
+
+```bash
+python openai_chat_client.py --prompt "A cute cat" --modality text2img
+```
+
+The Python client supports the following command-line arguments:
+
+- `--prompt` (or `-p`): Text prompt for generation (default: `A cute cat`)
+- `--output` (or `-o`): Output file path for image results (default: `bagel_output.png`)
+- `--server` (or `-s`): Server URL (default: `http://localhost:8091`)
+- `--image-url` (or `-i`): Input image URL or local file path (for img2img/img2text modes)
+- `--modality` (or `-m`): Task modality (default: `text2img`). Options: `text2img`, `img2img`, `img2text`, `text2text`
+- `--height`: Image height in pixels (default: 512)
+- `--width`: Image width in pixels (default: 512)
+- `--steps`: Number of inference steps (default: 25)
+- `--seed`: Random seed (default: 42)
+- `--negative`: Negative prompt for image generation
+
+Example with custom parameters:
+
+```bash
+python openai_chat_client.py \
+    --prompt "A futuristic city" \
+    --modality text2img \
+    --height 768 \
+    --width 768 \
+    --steps 50 \
+    --seed 42 \
+    --negative "blurry, low quality"
+```
+
+## Modality Control
+
+BAGEL-7B-MoT supports **multiple modality modes** for different use cases.
+
+The default yaml configuration deploys Thinker and DiT on the same GPU. You can use the default configuration file: [`bagel.yaml`](../../../vllm_omni/model_executor/stage_configs/bagel.yaml)
+
+| Modality    | Input        | Output | Description                            |
+| ----------- | ------------ | ------ | -------------------------------------- |
+| `text2img`  | Text         | Image  | Generate images from text prompts      |
+| `img2img`   | Image + Text | Image  | Transform images using text guidance   |
+| `img2text`  | Image + Text | Text   | Generate text descriptions from images |
+| `text2text` | Text         | Text   | Pure text generation                   |
+
+### Text to Image (text2img)
+
+Generate images from text prompts:
+
+**Using Python client**
+
+```bash
+python openai_chat_client.py \
+    --prompt "A beautiful sunset over mountains" \
+    --modality text2img \
+    --output sunset.png \
+    --steps 50
+```
+
+**Using curl**
+
+```bash
+curl http://localhost:8091/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [{"role": "user", "content": [{"type": "text", "text": "<|im_start|>A beautiful sunset over mountains<|im_end|>"}]}],
+    "modalities": ["image"],
+    "height": 512,
+    "width": 512,
+    "num_inference_steps": 50,
+    "seed": 42
+  }'
+```
+
+
+### Image to Image (img2img)
+
+Transform images based on text prompts:
+
+**Using Python client**
+
+```bash
+python openai_chat_client.py \
+    --prompt "Make the cat stand up" \
+    --modality img2img \
+    --image-url /path/to/input.jpg \
+    --output transformed.png
+```
+
+**Using curl**
+
+```bash
+IMAGE_BASE64=$(base64 -w 0 cat.jpg)
+
+cat <<EOF > payload.json
+{
+    "messages": [{
+      "role": "user",
+      "content": [
+        {"type": "text", "text": "<|im_start|>Make the cat stand up<|im_end|>"},
+        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,${IMAGE_BASE64}"}}
+      ]
+    }],
+    "modalities": ["image"],
+    "height": 512,
+    "width": 512,
+    "num_inference_steps": 50,
+    "seed": 42
+}
+EOF
+
+curl http://localhost:8091/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d @payload.json
+
+```
+
+### Image to Text (img2text)
+
+Generate text descriptions from images:
+
+**Using Python client**
+
+```bash
+python openai_chat_client.py \
+    --prompt "Describe this image in detail" \
+    --modality img2text \
+    --image-url /path/to/image.jpg
+```
+
+**Using curl**
+
+```bash
+IMAGE_BASE64=$(base64 -w 0 cat.jpg)
+
+cat <<EOF > payload.json
+{
+  "messages": [{
+    "role": "user",
+    "content": [
+      {"type": "text", "text": "<|im_start|>user\n<|image_pad|>\nDescribe this image in detail<|im_end|>\n<|im_start|>assistant\n"},
+      {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,${IMAGE_BASE64}"}}
+    ]
+  }],
+  "modalities": ["text"]
+}
+EOF
+
+curl http://localhost:8091/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d @payload.json
+```
+
+### Text to Text (text2text)
+
+Pure text generation:
+
+**Using Python client**
+
+```bash
+python openai_chat_client.py \
+    --prompt "What is the capital of France?" \
+    --modality text2text
+```
+
+**Using curl**
+
+```bash
+curl http://localhost:8091/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [{"role": "user", "content": [{"type": "text", "text": "<|im_start|>user\nWhat is the capital of France?<|im_end|>\n<|im_start|>assistant\n"}]}]
+    "modalities": ["text"]
+  }'
+```
+
+## FAQ
+
+- If you encounter an error about the backend of librosa, try to install ffmpeg with the command below.
+
+```bash
+sudo apt update
+sudo apt install ffmpeg
+```
+
+- If you don’t know how much VRAM is needed for the model or encounter the OOM error, you can try to decrease the max_model_len.
+
+| Stage               | VRAM                         |
+| :------------------ | :--------------------------- |
+| Stage-0 (Thinker)   | **15.04 GiB** **+ KV Cache** |
+| Stage-1 (DiT)       | **26.50 GiB**                |
+| Total               | **~42 GiB + KV Cache**       |
--- a/docs/user_guide/examples/online_serving/image_to_image.md
+++ b/docs/user_guide/examples/online_serving/image_to_image.md
+# Image-To-Image
+
+Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/image_to_image>.
+
+
+This example demonstrates how to deploy Qwen-Image-Edit model for online image editing service using vLLM-Omni.
+
+For **multi-image** input editing, use **Qwen-Image-Edit-2509** (QwenImageEditPlusPipeline) and send multiple images in the user message content.
+
+## Start Server
+
+### Basic Start
+
+```bash
+vllm serve Qwen/Qwen-Image-Edit --omni --port 8092
+```
+
+### Multi-Image Edit (Qwen-Image-Edit-2509)
+
+```bash
+vllm serve Qwen/Qwen-Image-Edit-2509 --omni --port 8092
+```
+
+### Start with Parameters
+
+
+Or use the startup script:
+
+```bash
+bash run_server.sh
+```
+
+To serve Qwen-Image-Edit-2509 with the script:
+
+```bash
+MODEL=Qwen/Qwen-Image-Edit-2509 bash run_server.sh
+```
+
+## API Calls
+
+### Method 1: Using curl (Image Editing)
+
+```bash
+# Image editing
+bash run_curl_image_edit.sh input.png "Convert this image to watercolor style"
+
+# Or execute directly
+IMG_B64=$(base64 -w0 input.png)
+
+cat <<EOF > request.json
+{
+  "messages": [{
+    "role": "user",
+    "content": [
+      {"type": "text", "text": "Convert this image to watercolor style"},
+      {"type": "image_url", "image_url": {"url": "data:image/png;base64,$IMG_B64"}}
+    ]
+  }],
+  "extra_body": {
+    "height": 1024,
+    "width": 1024,
+    "num_inference_steps": 50,
+    "guidance_scale": 1,
+    "seed": 42
+  }
+}
+EOF
+
+curl -s http://localhost:8092/v1/chat/completions   -H "Content-Type: application/json"   -d @request.json | jq -r '.choices[0].message.content[0].image_url.url' | cut -d',' -f2 | base64 -d > output.png
+```
+
+### Method 2: Using Python Client
+
+```bash
+python openai_chat_client.py --input input.png --prompt "Convert to oil painting style" --output output.png
+
+# Multi-image editing (Qwen-Image-Edit-2509 server required)
+python openai_chat_client.py --input input1.png input2.png --prompt "Combine these images into a single scene" --output output.png
+```
+
+### Method 3: Using Gradio Demo
+
+```bash
+python gradio_demo.py
+# Visit http://localhost:7861
+```
+
+## Request Format
+
+### Image Editing (Using image_url Format)
+
+```json
+{
+  "messages": [
+    {
+      "role": "user",
+      "content": [
+        {"type": "text", "text": "Convert this image to watercolor style"},
+        {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
+      ]
+    }
+  ]
+}
+```
+
+### Image Editing (Using Simplified image Format)
+
+```json
+{
+  "messages": [
+    {
+      "role": "user",
+      "content": [
+        {"text": "Convert this image to watercolor style"},
+        {"image": "BASE64_IMAGE_DATA"}
+      ]
+    }
+  ]
+}
+```
+
+### Image Editing with Parameters
+
+Use `extra_body` to pass generation parameters:
+
+```json
+{
+  "messages": [
+    {
+      "role": "user",
+      "content": [
+        {"type": "text", "text": "Convert to ink wash painting style"},
+        {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
+      ]
+    }
+  ],
+  "extra_body": {
+    "height": 1024,
+    "width": 1024,
+    "num_inference_steps": 50,
+    "guidance_scale": 7.5,
+    "seed": 42
+  }
+}
+```
+
+### Multi-Image Editing (Qwen-Image-Edit-2509)
+
+Provide multiple images in `content` (order matters):
+
+```json
+{
+  "messages": [
+    {
+      "role": "user",
+      "content": [
+        {"type": "text", "text": "Combine these images into a single scene"},
+        {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."} },
+        {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."} }
+      ]
+    }
+  ]
+}
+```
+
+## Generation Parameters (extra_body)
+
+| Parameter                | Type  | Default | Description                           |
+| ------------------------ | ----- | ------- | ------------------------------------- |
+| `height`                 | int   | None    | Output image height in pixels         |
+| `width`                  | int   | None    | Output image width in pixels          |
+| `size`                   | str   | None    | Output image size (e.g., "1024x1024") |
+| `num_inference_steps`    | int   | 50      | Number of denoising steps             |
+| `guidance_scale`         | float | 7.5     | CFG guidance scale                    |
+| `seed`                   | int   | None    | Random seed (reproducible)            |
+| `negative_prompt`        | str   | None    | Negative prompt                       |
+| `num_outputs_per_prompt` | int   | 1       | Number of images to generate          |
+
+## Response Format
+
+```json
+{
+  "id": "chatcmpl-xxx",
+  "created": 1234567890,
+  "model": "Qwen/Qwen-Image-Edit",
+  "choices": [{
+    "index": 0,
+    "message": {
+      "role": "assistant",
+      "content": [{
+        "type": "image_url",
+        "image_url": {
+          "url": "data:image/png;base64,..."
+        }
+      }]
+    },
+    "finish_reason": "stop"
+  }],
+  "usage": {...}
+}
+```
+
+## Common Editing Instructions Examples
+
+| Instruction                              | Description      |
+| ---------------------------------------- | ---------------- |
+| `Convert this image to watercolor style` | Style transfer   |
+| `Convert the image to black and white`   | Desaturation     |
+| `Enhance the color saturation`           | Color adjustment |
+| `Convert to cartoon style`               | Cartoonization   |
+| `Add vintage filter effect`              | Filter effect    |
+| `Convert daytime scene to nighttime`     | Scene conversion |
+
+## File Description
+
+| File                     | Description                  |
+| ------------------------ | ---------------------------- |
+| `run_server.sh`          | Server startup script        |
+| `run_curl_image_edit.sh` | curl image editing example   |
+| `openai_chat_client.py`  | Python client                |
+| `gradio_demo.py`         | Gradio interactive interface |
+
+## Example materials
+
+??? abstract "gradio_demo.py"
+    ``````py
+    --8<-- "examples/online_serving/image_to_image/gradio_demo.py"
+    ``````
+??? abstract "openai_chat_client.py"
+    ``````py
+    --8<-- "examples/online_serving/image_to_image/openai_chat_client.py"
+    ``````
+??? abstract "run_curl_image_edit.sh"
+    ``````sh
+    --8<-- "examples/online_serving/image_to_image/run_curl_image_edit.sh"
+    ``````
+??? abstract "run_server.sh"
+    ``````sh
+    --8<-- "examples/online_serving/image_to_image/run_server.sh"
+    ``````
--- a/docs/user_guide/examples/online_serving/lora_inference.md
+++ b/docs/user_guide/examples/online_serving/lora_inference.md
+# LoRA-Inference
+
+Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/lora_inference>.
+
+This example shows how to use **per-request LoRA** with vLLM-Omni diffusion models via the OpenAI-compatible Chat Completions API.
+
+> Note: The LoRA adapter path must be readable on the **server** machine (usually a local path or a mounted directory).
+> Note: This example uses `/v1/chat/completions`. LoRA payloads for other OpenAI endpoints are not implemented here.
+
+## Start Server
+
+```bash
+# Pick a diffusion model (examples)
+# export MODEL=stabilityai/stable-diffusion-3.5-medium
+# export MODEL=Qwen/Qwen-Image
+
+bash run_server.sh
+```
+
+## Call API (curl)
+
+```bash
+# Required: local LoRA folder on the server
+export LORA_PATH=/path/to/lora_adapter
+
+# Optional
+export SERVER=http://localhost:8091
+export PROMPT="A piece of cheesecake"
+export LORA_NAME=my_lora
+export LORA_SCALE=1.0
+# Optional: if omitted, the server derives a stable id from LORA_PATH.
+# export LORA_INT_ID=123
+
+bash run_curl_lora_inference.sh
+```
+
+## Call API (Python)
+
+```bash
+python openai_chat_client.py \
+  --prompt "A piece of cheesecake" \
+  --lora-path /path/to/lora_adapter \
+  --lora-name my_lora \
+  --lora-scale 1.0 \
+  --output output.png
+```
+
+## LoRA Format
+
+LoRA adapters should be in PEFT format, for example:
+
+```
+lora_adapter/
+├── adapter_config.json
+└── adapter_model.safetensors
+```
+
+??? abstract "openai_chat_client.py"
+    ``````py
+    --8<-- "examples/online_serving/lora_inference/openai_chat_client.py"
+    ``````
+??? abstract "run_curl_lora_inference.sh"
+    ``````py
+    --8<-- "examples/online_serving/lora_inference/run_curl_lora_inference.sh"
+    ``````
+??? abstract "run_server.sh"
+    ``````py
+    --8<-- "examples/online_serving/lora_inference/run_server.sh"
+    ``````
--- a/docs/user_guide/examples/online_serving/qwen2_5_omni.md
+++ b/docs/user_guide/examples/online_serving/qwen2_5_omni.md
+# Qwen2.5-Omni
+
+Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/qwen2_5_omni>.
+
+
+## 🛠️ Installation
+
+Please refer to [README.md](https://github.com/vllm-project/vllm-omni/tree/main/README.md)
+
+## Run examples (Qwen2.5-Omni)
+
+### Launch the Server
+
+```bash
+vllm serve Qwen/Qwen2.5-Omni-7B --omni --port 8091
+```
+
+If you have custom stage configs file, launch the server with command below
+```bash
+vllm serve Qwen/Qwen2.5-Omni-7B --omni --port 8091 --stage-configs-path /path/to/stage_configs_file
+```
+
+### Send Multi-modal Request
+
+Get into the example folder
+```bash
+cd examples/online_serving/qwen2_5_omni
+```
+
+#### Send request via python
+
+```bash
+python openai_chat_completion_client_for_multimodal_generation.py --query-type mixed_modalities
+```
+
+The Python client supports the following command-line arguments:
+
+- `--query-type` (or `-q`): Query type (default: `mixed_modalities`). Options: `mixed_modalities`, `use_audio_in_video`, `multi_audios`, `text`
+- `--video-path` (or `-v`): Path to local video file or URL. If not provided and query-type uses video, uses default video URL. Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs. Example: `--video-path /path/to/video.mp4` or `--video-path https://example.com/video.mp4`
+- `--image-path` (or `-i`): Path to local image file or URL. If not provided and query-type uses image, uses default image URL. Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs and common image formats: JPEG, PNG, GIF, WebP. Example: `--image-path /path/to/image.jpg` or `--image-path https://example.com/image.png`
+- `--audio-path` (or `-a`): Path to local audio file or URL. If not provided and query-type uses audio, uses default audio URL. Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs and common audio formats: MP3, WAV, OGG, FLAC, M4A. Example: `--audio-path /path/to/audio.wav` or `--audio-path https://example.com/audio.mp3`
+- `--prompt` (or `-p`): Custom text prompt/question. If not provided, uses default prompt for the selected query type. Example: `--prompt "What are the main activities shown in this video?"`
+
+
+For example, to use mixed modalities with all local files:
+
+```bash
+python openai_chat_completion_client_for_multimodal_generation.py \
+    --query-type mixed_modalities \
+    --video-path /path/to/your/video.mp4 \
+    --image-path /path/to/your/image.jpg \
+    --audio-path /path/to/your/audio.wav \
+    --prompt "Analyze all the media content and provide a comprehensive summary."
+```
+
+####  Send request via curl
+
+```bash
+bash run_curl_multimodal_generation.sh mixed_modalities
+```
+
+## Modality control
+You can control output modalities to specify which types of output the model should generate. This is useful when you only need text output and want to skip audio generation stages for better performance.
+
+### Supported modalities
+
+| Modalities | Output |
+|------------|--------|
+| `["text"]` | Text only |
+| `["audio"]` | Text + Audio |
+| `["text", "audio"]` | Text + Audio |
+| Not specified | Text + Audio (default) |
+
+### Using curl
+
+#### Text only
+
+```bash
+curl http://localhost:8091/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen2.5-Omni-7B",
+    "messages": [{"role": "user", "content": "Describe vLLM in brief."}],
+    "modalities": ["text"]
+  }'
+```
+
+#### Text + Audio
+
+```bash
+curl http://localhost:8091/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen2.5-Omni-7B",
+    "messages": [{"role": "user", "content": "Describe vLLM in brief."}],
+    "modalities": ["audio"]
+  }'
+```
+
+### Using Python client
+
+```bash
+python openai_chat_completion_client_for_multimodal_generation.py \
+    --query-type mixed_modalities \
+    --modalities text
+```
+
+### Using OpenAI Python SDK
+
+#### Text only
+
+```python
+from openai import OpenAI
+
+client = OpenAI(base_url="http://localhost:8091/v1", api_key="EMPTY")
+
+response = client.chat.completions.create(
+    model="Qwen/Qwen2.5-Omni-7B",
+    messages=[{"role": "user", "content": "Describe vLLM in brief."}],
+    modalities=["text"]
+)
+print(response.choices[0].message.content)
+```
+
+#### Text + Audio
+
+```python
+from openai import OpenAI
+
+client = OpenAI(base_url="http://localhost:8091/v1", api_key="EMPTY")
+
+response = client.chat.completions.create(
+    model="Qwen/Qwen2.5-Omni-7B",
+    messages=[{"role": "user", "content": "Describe vLLM in brief."}],
+    modalities=["audio"]
+)
+# Response contains two choices: one with text, one with audio
+print(response.choices[0].message.content)  # Text response
+print(response.choices[1].message.audio)    # Audio response
+```
+
+## Streaming Output
+If you want to enable streaming output, please set the argument as below. The final output will be obtained just after generated by corresponding stage. Now we only support text streaming output. Other modalities can output normally.
+```bash
+python openai_chat_completion_client_for_multimodal_generation.py \
+    --query-type mixed_modalities \
+    --stream
+```
+
+## Run Local Web UI Demo
+
+This Web UI demo allows users to interact with the model through a web browser.
+
+### Running Gradio Demo
+
+The Gradio demo connects to a vLLM API server. You have two options:
+
+#### Option 1: One-step Launch Script (Recommended)
+
+The convenience script launches both the vLLM server and Gradio demo together:
+
+```bash
+./run_gradio_demo.sh --model Qwen/Qwen2.5-Omni-7B --server-port 8091 --gradio-port 7861
+```
+
+This script will:
+1. Start the vLLM server in the background
+2. Wait for the server to be ready
+3. Launch the Gradio demo
+4. Handle cleanup when you press Ctrl+C
+
+The script supports the following arguments:
+- `--model`: Model name/path (default: Qwen/Qwen2.5-Omni-7B)
+- `--server-port`: Port for vLLM server (default: 8091)
+- `--gradio-port`: Port for Gradio demo (default: 7861)
+- `--stage-configs-path`: Path to custom stage configs YAML file (optional)
+- `--server-host`: Host for vLLM server (default: 0.0.0.0)
+- `--gradio-ip`: IP for Gradio demo (default: 127.0.0.1)
+- `--share`: Share Gradio demo publicly (creates a public link)
+
+#### Option 2: Manual Launch (Two-Step Process)
+
+**Step 1: Launch the vLLM API server**
+
+```bash
+vllm serve Qwen/Qwen2.5-Omni-7B --omni --port 8091
+```
+
+If you have custom stage configs file:
+```bash
+vllm serve Qwen/Qwen2.5-Omni-7B --omni --port 8091 --stage-configs-path /path/to/stage_configs_file
+```
+
+**Step 2: Run the Gradio demo**
+
+In a separate terminal:
+
+```bash
+python gradio_demo.py --model Qwen/Qwen2.5-Omni-7B --api-base http://localhost:8091/v1 --port 7861
+```
+
+Then open `http://localhost:7861/` on your local browser to interact with the web UI.
+
+The gradio script supports the following arguments:
+
+- `--model`: Model name/path (should match the server model)
+- `--api-base`: Base URL for the vLLM API server (default: http://localhost:8091/v1)
+- `--ip`: Host/IP for Gradio server (default: 127.0.0.1)
+- `--port`: Port for Gradio server (default: 7861)
+- `--share`: Share the Gradio demo publicly (creates a public link)
+
+### FAQ
+
+If you encounter error about backend of librosa, try to install ffmpeg with command below.
+```
+sudo apt update
+sudo apt install ffmpeg
+```
+
+## Example materials
+
+??? abstract "gradio_demo.py"
+    ``````py
+    --8<-- "examples/online_serving/qwen2_5_omni/gradio_demo.py"
+    ``````
+??? abstract "openai_chat_completion_client_for_multimodal_generation.py"
+    ``````py
+    --8<-- "examples/online_serving/qwen2_5_omni/openai_chat_completion_client_for_multimodal_generation.py"
+    ``````
+??? abstract "run_curl_multimodal_generation.sh"
+    ``````sh
+    --8<-- "examples/online_serving/qwen2_5_omni/run_curl_multimodal_generation.sh"
+    ``````
+??? abstract "run_gradio_demo.sh"
+    ``````sh
+    --8<-- "examples/online_serving/qwen2_5_omni/run_gradio_demo.sh"
+    ``````
--- a/docs/user_guide/examples/online_serving/qwen3_omni.md
+++ b/docs/user_guide/examples/online_serving/qwen3_omni.md
+# Qwen3-Omni
+
+Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/qwen3_omni>.
+
+
+## 🛠️ Installation
+
+Please refer to [README.md](https://github.com/vllm-project/vllm-omni/tree/main/README.md)
+
+## Run examples (Qwen3-Omni)
+
+### Launch the Server
+
+```bash
+vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091
+```
+
+If you want to open async chunking for qwen3-omni, launch the server with command below
+
+```bash
+vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 --stage-configs-path /vllm_omni/model_executor/stage_configs/qwen3_omni_moe_async_chunk.yaml
+```
+
+If you have custom stage configs file, launch the server with command below
+```bash
+vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 --stage-configs-path /path/to/stage_configs_file
+```
+
+### Send Multi-modal Request
+
+Get into the example folder
+```bash
+cd examples/online_serving/qwen3_omni
+```
+
+####  Send request via python
+
+```bash
+python openai_chat_completion_client_for_multimodal_generation.py --query-type use_image
+```
+
+The Python client supports the following command-line arguments:
+
+- `--query-type` (or `-q`): Query type (default: `use_video`). Options: `text`, `use_audio`, `use_image`, `use_video`
+- `--model` (or `-m`): Model name/path (default: `Qwen/Qwen3-Omni-30B-A3B-Instruct`)
+- `--video-path` (or `-v`): Path to local video file or URL. If not provided and query-type is `use_video`, uses default video URL. Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs. Example: `--video-path /path/to/video.mp4` or `--video-path https://example.com/video.mp4`
+- `--image-path` (or `-i`): Path to local image file or URL. If not provided and query-type is `use_image`, uses default image URL. Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs and common image formats: JPEG, PNG, GIF, WebP. Example: `--image-path /path/to/image.jpg` or `--image-path https://example.com/image.png`
+- `--audio-path` (or `-a`): Path to local audio file or URL. If not provided and query-type is `use_audio`, uses default audio URL. Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs and common audio formats: MP3, WAV, OGG, FLAC, M4A. Example: `--audio-path /path/to/audio.wav` or `--audio-path https://example.com/audio.mp3`
+- `--prompt` (or `-p`): Custom text prompt/question. If not provided, uses default prompt for the selected query type. Example: `--prompt "What are the main activities shown in this video?"`
+
+
+For example, to use a local video file with custom prompt:
+
+```bash
+python openai_chat_completion_client_for_multimodal_generation.py \
+    --query-type use_video \
+    --video-path /path/to/your/video.mp4 \
+    --prompt "What are the main activities shown in this video?"
+```
+
+####  Send request via curl
+
+```bash
+bash run_curl_multimodal_generation.sh use_image
+```
+
+
+### FAQ
+
+If you encounter error about backend of librosa, try to install ffmpeg with command below.
+```
+sudo apt update
+sudo apt install ffmpeg
+```
+
+## Modality control
+You can control output modalities to specify which types of output the model should generate. This is useful when you only need text output and want to skip audio generation stages for better performance.
+
+### Supported modalities
+
+| Modalities | Output |
+|------------|--------|
+| `["text"]` | Text only |
+| `["audio"]` | Text + Audio |
+| `["text", "audio"]` | Text + Audio |
+| Not specified | Text + Audio (default) |
+
+### Using curl
+
+#### Text only
+
+```bash
+curl http://localhost:8091/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen3-Omni-30B-A3B-Instruct",
+    "messages": [{"role": "user", "content": "Describe vLLM in brief."}],
+    "modalities": ["text"]
+  }'
+```
+
+#### Text + Audio
+
+```bash
+curl http://localhost:8091/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen3-Omni-30B-A3B-Instruct",
+    "messages": [{"role": "user", "content": "Describe vLLM in brief."}],
+    "modalities": ["audio"]
+  }'
+```
+
+### Using Python client
+
+```bash
+python openai_chat_completion_client_for_multimodal_generation.py \
+    --query-type use_image \
+    --modalities text
+```
+
+### Using OpenAI Python SDK
+
+#### Text only
+
+```python
+from openai import OpenAI
+
+client = OpenAI(base_url="http://localhost:8091/v1", api_key="EMPTY")
+
+response = client.chat.completions.create(
+    model="Qwen/Qwen3-Omni-30B-A3B-Instruct",
+    messages=[{"role": "user", "content": "Describe vLLM in brief."}],
+    modalities=["text"]
+)
+print(response.choices[0].message.content)
+```
+
+#### Text + Audio
+
+```python
+from openai import OpenAI
+
+client = OpenAI(base_url="http://localhost:8091/v1", api_key="EMPTY")
+
+response = client.chat.completions.create(
+    model="Qwen/Qwen3-Omni-30B-A3B-Instruct",
+    messages=[{"role": "user", "content": "Describe vLLM in brief."}],
+    modalities=["audio"]
+)
+# Response contains two choices: one with text, one with audio
+print(response.choices[0].message.content)  # Text response
+print(response.choices[1].message.audio)    # Audio response
+```
+
+## Streaming Output
+If you want to enable streaming output, please set the argument as below. The final output will be obtained just after generated by corresponding stage. Now we only support text streaming output. Other modalities can output normally.
+```bash
+python openai_chat_completion_client_for_multimodal_generation.py \
+    --query-type use_image \
+    --stream
+```
+
+## Run Local Web UI Demo
+
+This Web UI demo allows users to interact with the model through a web browser.
+
+### Running Gradio Demo
+
+The Gradio demo connects to a vLLM API server. You have two options:
+
+#### Option 1: One-step Launch Script (Recommended)
+
+The convenience script launches both the vLLM server and Gradio demo together:
+
+```bash
+./run_gradio_demo.sh --model Qwen/Qwen3-Omni-30B-A3B-Instruct --server-port 8091 --gradio-port 7861
+```
+
+This script will:
+1. Start the vLLM server in the background
+2. Wait for the server to be ready
+3. Launch the Gradio demo
+4. Handle cleanup when you press Ctrl+C
+
+The script supports the following arguments:
+- `--model`: Model name/path (default: Qwen/Qwen3-Omni-30B-A3B-Instruct)
+- `--server-port`: Port for vLLM server (default: 8091)
+- `--gradio-port`: Port for Gradio demo (default: 7861)
+- `--stage-configs-path`: Path to custom stage configs YAML file (optional)
+- `--server-host`: Host for vLLM server (default: 0.0.0.0)
+- `--gradio-ip`: IP for Gradio demo (default: 127.0.0.1)
+- `--share`: Share Gradio demo publicly (creates a public link)
+
+#### Option 2: Manual Launch (Two-Step Process)
+
+**Step 1: Launch the vLLM API server**
+
+```bash
+vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091
+```
+
+If you have custom stage configs file:
+```bash
+vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 --stage-configs-path /path/to/stage_configs_file
+```
+
+**Step 2: Run the Gradio demo**
+
+In a separate terminal:
+
+```bash
+python gradio_demo.py --model Qwen/Qwen3-Omni-30B-A3B-Instruct --api-base http://localhost:8091/v1 --port 7861
+```
+
+Then open `http://localhost:7861/` on your local browser to interact with the web UI.
+
+The gradio script supports the following arguments:
+
+- `--model`: Model name/path (should match the server model)
+- `--api-base`: Base URL for the vLLM API server (default: http://localhost:8091/v1)
+- `--ip`: Host/IP for Gradio server (default: 127.0.0.1)
+- `--port`: Port for Gradio server (default: 7861)
+- `--share`: Share the Gradio demo publicly (creates a public link)
+
+## Example materials
+
+??? abstract "gradio_demo.py"
+    ``````py
+    --8<-- "examples/online_serving/qwen3_omni/gradio_demo.py"
+    ``````
+??? abstract "openai_chat_completion_client_for_multimodal_generation.py"
+    ``````py
+    --8<-- "examples/online_serving/qwen3_omni/openai_chat_completion_client_for_multimodal_generation.py"
+    ``````
+??? abstract "qwen3_omni_moe_thinking.yaml"
+    ``````yaml
+    --8<-- "examples/online_serving/qwen3_omni/qwen3_omni_moe_thinking.yaml"
+    ``````
+??? abstract "run_curl_multimodal_generation.sh"
+    ``````sh
+    --8<-- "examples/online_serving/qwen3_omni/run_curl_multimodal_generation.sh"
+    ``````
+??? abstract "run_gradio_demo.sh"
+    ``````sh
+    --8<-- "examples/online_serving/qwen3_omni/run_gradio_demo.sh"
+    ``````
--- a/docs/user_guide/examples/online_serving/text_to_image.md
+++ b/docs/user_guide/examples/online_serving/text_to_image.md
+# Text-To-Image
+
+Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/text_to_image>.
+
+
+This example demonstrates how to deploy Qwen-Image model for online image generation service using vLLM-Omni.
+
+## Start Server
+
+### Basic Start
+
+```bash
+vllm serve Qwen/Qwen-Image --omni --port 8091
+```
+!!! note
+    If you encounter Out-of-Memory (OOM) issues or have limited GPU memory, you can enable VAE slicing and tiling to reduce memory usage, --vae-use-slicing --vae-use-tiling
+
+### Start with Parameters
+
+Or use the startup script:
+
+```bash
+bash run_server.sh
+```
+
+## API Calls
+
+### Method 1: Using curl
+
+```bash
+# Basic text-to-image generation
+bash run_curl_text_to_image.sh
+
+# Or execute directly
+curl -s http://localhost:8091/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      {"role": "user", "content": "A beautiful landscape painting"}
+    ],
+    "extra_body": {
+      "height": 1024,
+      "width": 1024,
+      "num_inference_steps": 50,
+      "true_cfg_scale": 4.0,
+      "seed": 42
+    }
+  }' | jq -r '.choices[0].message.content[0].image_url.url' | cut -d',' -f2- | base64 -d > output.png
+```
+
+### Method 2: Using Python Client
+
+```bash
+python openai_chat_client.py --prompt "A beautiful landscape painting" --output output.png
+```
+
+### Method 3: Using Gradio Demo
+
+```bash
+python gradio_demo.py
+# Visit http://localhost:7860
+```
+
+## Request Format
+
+### Simple Text Generation
+
+```json
+{
+  "messages": [
+    {"role": "user", "content": "A beautiful landscape painting"}
+  ]
+}
+```
+
+### Generation with Parameters
+
+Use `extra_body` to pass generation parameters:
+
+```json
+{
+  "messages": [
+    {"role": "user", "content": "A beautiful landscape painting"}
+  ],
+  "extra_body": {
+    "height": 1024,
+    "width": 1024,
+    "num_inference_steps": 50,
+    "true_cfg_scale": 4.0,
+    "seed": 42
+  }
+}
+```
+
+### Multimodal Input (Text + Structured Content)
+
+```json
+{
+  "messages": [
+    {
+      "role": "user",
+      "content": [
+        {"type": "text", "text": "A beautiful landscape painting"}
+      ]
+    }
+  ]
+}
+```
+
+## Generation Parameters (extra_body)
+
+| Parameter                | Type  | Default | Description                    |
+| ------------------------ | ----- | ------- | ------------------------------ |
+| `height`                 | int   | None    | Image height in pixels         |
+| `width`                  | int   | None    | Image width in pixels          |
+| `size`                   | str   | None    | Image size (e.g., "1024x1024") |
+| `num_inference_steps`    | int   | 50      | Number of denoising steps      |
+| `true_cfg_scale`         | float | 4.0     | Qwen-Image CFG scale           |
+| `seed`                   | int   | None    | Random seed (reproducible)     |
+| `negative_prompt`        | str   | None    | Negative prompt                |
+| `num_outputs_per_prompt` | int   | 1       | Number of images to generate   |
+| `--cfg-parallel-size`.   | int   | 1       | Number of GPUs for CFG parallelism |
+
+## Response Format
+
+```json
+{
+  "id": "chatcmpl-xxx",
+  "created": 1234567890,
+  "model": "Qwen/Qwen-Image",
+  "choices": [{
+    "index": 0,
+    "message": {
+      "role": "assistant",
+      "content": [{
+        "type": "image_url",
+        "image_url": {
+          "url": "data:image/png;base64,..."
+        }
+      }]
+    },
+    "finish_reason": "stop"
+  }],
+  "usage": {...}
+}
+```
+
+## Extract Image
+
+```bash
+# Extract base64 from response and decode to image
+cat response.json | jq -r '.choices[0].message.content[0].image_url.url' | cut -d',' -f2- | base64 -d > output.png
+```
+
+## File Description
+
+| File                        | Description                  |
+| --------------------------- | ---------------------------- |
+| `run_server.sh`             | Server startup script        |
+| `run_curl_text_to_image.sh` | curl example                 |
+| `openai_chat_client.py`     | Python client                |
+| `gradio_demo.py`            | Gradio interactive interface |
+
+## Example materials
+
+??? abstract "gradio_demo.py"
+    ``````py
+    --8<-- "examples/online_serving/text_to_image/gradio_demo.py"
+    ``````
+??? abstract "openai_chat_client.py"
+    ``````py
+    --8<-- "examples/online_serving/text_to_image/openai_chat_client.py"
+    ``````
+??? abstract "run_curl_text_to_image.sh"
+    ``````sh
+    --8<-- "examples/online_serving/text_to_image/run_curl_text_to_image.sh"
+    ``````
+??? abstract "run_server.sh"
+    ``````sh
+    --8<-- "examples/online_serving/text_to_image/run_server.sh"
+    ``````
--- a/examples/offline_inference/bagel/README.md
+++ b/examples/offline_inference/bagel/README.md
+# BAGEL-7B-MoT
+
+## Set up
+
+Please refer to the [stage configuration documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/) to configure memory allocation appropriately for your hardware setup.
+
+## Run examples
+
+**Note**: These examples work with the default configuration on an **NVIDIA A100 (80GB)**. We also tested on dual **NVIDIA RTX 5000 Ada (32GB each)**. For dual-GPU setups, please modify the stage configuration to distribute the model across devices.
+
+Get into the bagel folder
+
+```bash
+cd examples/offline_inference/bagel
+```
+
+### Modality Control
+
+BAGEL-7B-MoT supports multiple modality modes. You can control the mode using the `--modality` argument:
+
+#### Text to Image (text2img)
+
+- **Pipeline**: Text → Thinker  → DiT → VAE Decode → Image
+- **Stages Used**: Stage 0 (Thinker) + Stage 1 (DiT)
+- **KV Transfer**: Thinker sends KV cache to DiT for conditioned generation
+
+Generate images from text prompts:
+
+```bash
+python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
+                  --modality text2img \
+                  --prompts "A cute cat"
+```
+
+#### Image to Image (img2img)
+
+- **Pipeline**: Image → VAE Encode → DiT → VAE Decode → New Image
+- **Stages Used**: Stage 1 (DiT) only
+- **Special**: Bypasses the Thinker stage, direct image-to-image transformation
+
+Transform images based on text prompts:
+
+```bash
+python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
+                  --modality img2img \
+                  --image-path /path/to/image.jpg \
+                  --prompts "Let the woman wear a blue dress"
+```
+
+#### Image to Text (img2text)
+
+- **Pipeline**: Image → ViT + VAE Encode → Thinker → Text Output
+- **Stages Used**: Stage 0 (Thinker) only
+- **Special**: Uses both VAE latent encoding AND ViT semantic encoding for comprehensive image understanding
+
+Generate text descriptions from images:
+
+```bash
+python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
+                  --modality img2text \
+                  --image-path /path/to/image.jpg \
+                  --prompts "Describe this image in detail"
+```
+
+#### Text to Text (text2text)
+
+- **Pipeline**: Text → Thinker → Text Output
+- **Stages Used**: Stage 0 (Thinker) only
+- **Special**: No visual components involved, operates as pure language model
+
+Pure text generation:
+
+```bash
+python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
+                  --modality text2text \
+                  --prompts "What is the capital of France?"
+
+# You can load prompts from a text file (one prompt per line):  
+python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
+                  --modality text2text \
+                  --txt-prompts /path/to/prompts.txt
+```
+
+### Inference Steps
+
+Control the number of inference steps for image generation:
+
+```bash
+# You can adjust steps to 100 to improve image quality
+python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
+                  --modality text2img \
+                  --steps 50 \
+                  --prompts "A cute cat"
+```
+
+### Key arguments
+
+BAGEL-7B-MoT supports **multiple modality modes** for different use cases.
+
+The default yaml configuration deploys Thinker and DiT on the same GPU. You can use the default configuration file: [`bagel.yaml`](../../../vllm_omni/model_executor/stage_configs/bagel.yaml)
+
+#### 📌 Command Line Arguments (end2end.py)
+
+| Argument               | Type   | Default                       | Description                                                  |
+| :--------------------- | :----- | :---------------------------- | :----------------------------------------------------------- |
+| `--model`              | string | `ByteDance-Seed/BAGEL-7B-MoT` | Model path or name                                           |
+| `--modality`           | choice | `text2img`                    | Modality mode: `text2img`, `img2img`, `img2text`, `text2text` |
+| `--prompts`            | list   | `None`                        | Input text prompts directly                                  |
+| `--txt-prompts`        | string | `None`                        | Path to txt file with one prompt per line                    |
+| `--image-path`         | string | `None`                        | Input image path (for `img2img`/`img2text`)                  |
+| `--steps`              | int    | `50`                          | Number of inference steps                                    |
+| `--stage-configs-path` | string | `None`                        | Custom stage config file path                                |
+| `--worker-backend`     | choice | `process`                     | Worker backend: `process` or `ray`                           |
+| `--ray-address`        | string | `None`                        | Ray cluster address                                          |
+| `--enable-stats`       | flag   | `False`                       | Enable statistics logging                                    |
+| `--init-sleep-seconds` | int    | `20`                          | Initialization sleep time                                    |
+| `--batch-timeout`      | int    | `5`                           | Batch timeout                                                |
+| `--init-timeout`       | int    | `300`                         | Initialization timeout                                       |
+
+------
+
+#### ⚙️ Stage Configuration Parameters (bagel.yaml)
+
+ **Stage 0 - Thinker (LLM Stage)**
+
+| Parameter                        | Value                           | Description              |
+| :------------------------------- | :------------------------------ | :----------------------- |
+| `stage_type`                     | `llm`                           | Stage type               |
+| `devices`                        | `"0"`                           | GPU device ID            |
+| `max_batch_size`                 | `1`                             | Maximum batch size       |
+| `model_stage`                    | `thinker`                       | Model stage identifier   |
+| `model_arch`                     | `BagelForConditionalGeneration` | Model architecture       |
+| `gpu_memory_utilization`         | `0.4`                           | GPU memory utilization   |
+| `tensor_parallel_size`           | `1`                             | Tensor parallel size     |
+| `max_num_batched_tokens`         | `32768`                         | Maximum batched tokens   |
+| `omni_kv_config.need_send_cache` | `true`                          | Whether to send KV cache |
+
+------
+
+**Stage 1 - DiT (Diffusion Stage)**
+
+| Parameter                        | Value       | Description                 |
+| :------------------------------- | :---------- | :-------------------------- |
+| `stage_type`                     | `diffusion` | Stage type                  |
+| `devices`                        | `"0"`       | GPU device ID               |
+| `max_batch_size`                 | `1`         | Maximum batch size          |
+| `model_stage`                    | `dit`       | Model stage identifier      |
+| `gpu_memory_utilization`         | `0.4`       | GPU memory utilization      |
+| `omni_kv_config.need_recv_cache` | `true`      | Whether to receive KV cache |
+| `engine_input_source`            | `[0]`       | Input source from Stage 0   |
+
+------
+
+#### 🔗 Runtime Configuration
+
+| Parameter             | Value   | Description                      |
+| :-------------------- | :------ | :------------------------------- |
+| `window_size`         | `-1`    | Window size (-1 means unlimited) |
+| `max_inflight`        | `1`     | Maximum inflight requests        |
+| `shm_threshold_bytes` | `65536` | Shared memory threshold (64KB)   |
+
+## FAQ
+
+- If you encounter an error about the backend of librosa, try to install ffmpeg with the command below.
+
+```bash
+sudo apt update
+sudo apt install ffmpeg
+```
+
+- If you don’t know how much VRAM is needed for the model or encounter the OOM error, you can try to decrease the max_model_len.
+
+| Stage               | VRAM                         |
+| :------------------ | :--------------------------- |
+| Stage-0 (Thinker)   | **15.04 GiB** **+ KV Cache** |
+| Stage-1 (DiT)       | **26.50 GiB**                |
+| Total               | **~42 GiB + KV Cache**       |
--- a/examples/offline_inference/bagel/end2end.py
+++ b/examples/offline_inference/bagel/end2end.py
+import argparse
+import os
+from typing import cast
+
+from vllm_omni.inputs.data import OmniDiffusionSamplingParams, OmniPromptType
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model",
+        default="ByteDance-Seed/BAGEL-7B-MoT",
+        help="Path to merged model directory.",
+    )
+    parser.add_argument("--prompts", nargs="+", default=None, help="Input text prompts.")
+    parser.add_argument(
+        "--txt-prompts",
+        type=str,
+        default=None,
+        help="Path to a .txt file with one prompt per line (preferred).",
+    )
+    parser.add_argument("--prompt_type", default="text", choices=["text"])
+
+    parser.add_argument(
+        "--modality",
+        default="text2img",
+        choices=["text2img", "img2img", "img2text", "text2text"],
+        help="Modality mode to control stage execution.",
+    )
+
+    parser.add_argument(
+        "--image-path",
+        type=str,
+        default=None,
+        help="Path to input image for img2img.",
+    )
+
+    # OmniLLM init args
+    parser.add_argument("--enable-stats", action="store_true", default=False)
+    parser.add_argument("--init-sleep-seconds", type=int, default=20)
+    parser.add_argument("--batch-timeout", type=int, default=5)
+    parser.add_argument("--init-timeout", type=int, default=300)
+    parser.add_argument("--shm-threshold-bytes", type=int, default=65536)
+    parser.add_argument("--worker-backend", type=str, default="process", choices=["process", "ray"])
+    parser.add_argument("--ray-address", type=str, default=None)
+    parser.add_argument("--stage-configs-path", type=str, default=None)
+    parser.add_argument("--steps", type=int, default=50, help="Number of inference steps.")
+
+    args = parser.parse_args()
+    return args
+
+
+def main():
+    args = parse_args()
+    model_name = args.model
+    prompts: list[OmniPromptType] = []
+    try:
+        # Preferred: load from txt file (one prompt per line)
+        if getattr(args, "txt_prompts", None) and args.prompt_type == "text":
+            with open(args.txt_prompts, encoding="utf-8") as f:
+                lines = [ln.strip() for ln in f.readlines()]
+            prompts = [ln for ln in lines if ln != ""]
+            print(f"[Info] Loaded {len(prompts)} prompts from {args.txt_prompts}")
+        else:
+            prompts = args.prompts
+    except Exception as e:
+        print(f"[Error] Failed to load prompts: {e}")
+        raise
+
+    if not prompts:
+        # Default prompt for text2img test if none provided
+        prompts = ["<|im_start|>A cute cat<|im_end|>"]
+        print(f"[Info] No prompts provided, using default: {prompts}")
+    omni_outputs = []
+
+    from PIL import Image
+
+    if args.modality == "img2img":
+        from PIL import Image
+
+        from vllm_omni.entrypoints.omni_diffusion import OmniDiffusion
+
+        print("[Info] Running in img2img mode (Stage 1 only)")
+        client = OmniDiffusion(model=model_name)
+
+        if args.image_path:
+            if os.path.exists(args.image_path):
+                loaded_image = Image.open(args.image_path).convert("RGB")
+                prompts = [
+                    {
+                        "prompt": cast(str, p),
+                        "multi_modal_data": {"image": loaded_image},
+                    }
+                    for p in prompts
+                ]
+            else:
+                print(f"[Warning] Image path {args.image_path} does not exist.")
+
+        result = client.generate(
+            prompts,
+            OmniDiffusionSamplingParams(
+                seed=52,
+                need_kv_receive=False,
+                num_inference_steps=args.steps,
+            ),
+        )
+
+        # Ensure result is a list for iteration
+        if not isinstance(result, list):
+            omni_outputs = [result]
+        else:
+            omni_outputs = result
+
+    else:
+        from vllm_omni.entrypoints.omni import Omni
+
+        omni_kwargs = {}
+        if args.stage_configs_path:
+            omni_kwargs["stage_configs_path"] = args.stage_configs_path
+
+        omni_kwargs.update(
+            {
+                "log_stats": args.enable_stats,
+                "init_sleep_seconds": args.init_sleep_seconds,
+                "batch_timeout": args.batch_timeout,
+                "init_timeout": args.init_timeout,
+                "shm_threshold_bytes": args.shm_threshold_bytes,
+                "worker_backend": args.worker_backend,
+                "ray_address": args.ray_address,
+            }
+        )
+
+        omni = Omni(model=model_name, **omni_kwargs)
+
+        formatted_prompts = []
+        for p in args.prompts:
+            if args.modality == "img2text":
+                if args.image_path:
+                    loaded_image = Image.open(args.image_path).convert("RGB")
+                    final_prompt_text = f"<|im_start|>user\n<|image_pad|>\n{p}<|im_end|>\n<|im_start|>assistant\n"
+                    prompt_dict = {
+                        "prompt": final_prompt_text,
+                        "multi_modal_data": {"image": loaded_image},
+                        "modalities": ["text"],
+                    }
+                    formatted_prompts.append(prompt_dict)
+            elif args.modality == "text2text":
+                final_prompt_text = f"<|im_start|>user\n{p}<|im_end|>\n<|im_start|>assistant\n"
+                prompt_dict = {"prompt": final_prompt_text, "modalities": ["text"]}
+                formatted_prompts.append(prompt_dict)
+            else:
+                # text2img
+                final_prompt_text = f"<|im_start|>{p}<|im_end|>"
+                prompt_dict = {"prompt": final_prompt_text, "modalities": ["image"]}
+                formatted_prompts.append(prompt_dict)
+
+        params_list = omni.default_sampling_params_list
+        if args.modality == "text2img":
+            params_list[0].max_tokens = 1  # type: ignore # The first stage is a SamplingParam (vllm)
+            if len(params_list) > 1:
+                params_list[1].num_inference_steps = args.steps  # type: ignore # The second stage is an OmniDiffusionSamplingParam
+
+        omni_outputs = list(omni.generate(prompts=formatted_prompts, sampling_params_list=params_list))
+
+    for i, req_output in enumerate(omni_outputs):
+        images = getattr(req_output, "images", None)
+        if not images and hasattr(req_output, "output"):
+            if isinstance(req_output.output, list):
+                images = req_output.output
+            else:
+                images = [req_output.output]
+
+        if images:
+            for j, img in enumerate(images):
+                img.save(f"output_{i}_{j}.png")
+
+        if hasattr(req_output, "request_output") and req_output.request_output:
+            for stage_out in req_output.request_output:
+                if hasattr(stage_out, "images") and stage_out.images:
+                    for k, img in enumerate(stage_out.images):
+                        save_path = f"output_{i}_stage_{getattr(stage_out, 'stage_id', '?')}_{k}.png"
+                        img.save(save_path)
+                        print(f"[Info] Saved stage output image to {save_path}")
+
+    print(omni_outputs)
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/offline_inference/image_to_image/image_edit.py
+++ b/examples/offline_inference/image_to_image/image_edit.py
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
+
+"""
+Example script for image editing with Qwen-Image-Edit.
+
+Usage (single image):
+    python image_edit.py \
+        --image input.png \
+        --prompt "Let this mascot dance under the moon, surrounded by floating stars and poetic bubbles such as 'Be Kind'" \
+        --output output_image_edit.png \
+        --num_inference_steps 50 \
+        --cfg_scale 4.0 \
+        --guidance_scale 1.0
+
+Usage (multiple images):
+    python image_edit.py \
+        --image input1.png input2.png input3.png \
+        --prompt "Combine these images into a single scene" \
+        --output output_image_edit.png \
+        --num_inference_steps 50 \
+        --cfg_scale 4.0 \
+        --guidance_scale 1.0
+
+Usage (with cache-dit acceleration):
+    python image_edit.py \
+        --image input.png \
+        --prompt "Edit description" \
+        --cache_backend cache_dit \
+        --cache_dit_max_continuous_cached_steps 3 \
+        --cache_dit_residual_diff_threshold 0.24 \
+        --cache_dit_enable_taylorseer
+
+Usage (with tea_cache acceleration):
+    python image_edit.py \
+        --image input.png \
+        --prompt "Edit description" \
+        --cache_backend tea_cache \
+        --tea_cache_rel_l1_thresh 0.25
+
+Usage (layered):
+    python image_edit.py \
+        --model "Qwen/Qwen-Image-Layered" \
+        --image input.png \
+        --prompt "" \
+        --output "layered" \
+        --num_inference_steps 50 \
+        --cfg_scale 4.0 \
+        --layers 4 \
+        --color-format "RGBA"
+
+Usage (with CFG Parallel):
+    python image_edit.py \
+        --image input.png \
+        --prompt "Edit description" \
+        --cfg_parallel_size 2 \
+        --num_inference_steps 50 \
+        --cfg_scale 4.0
+
+Usage (disable torch.compile):
+    python image_edit.py \
+        --image input.png \
+        --prompt "Edit description" \
+        --enforce_eager \
+        --num_inference_steps 50 \
+        --cfg_scale 4.0
+
+For more options, run:
+    python image_edit.py --help
+"""
+
+import argparse
+import os
+import time
+from pathlib import Path
+
+import torch
+from PIL import Image
+
+from vllm_omni.diffusion.data import DiffusionParallelConfig
+from vllm_omni.entrypoints.omni import Omni
+from vllm_omni.inputs.data import OmniDiffusionSamplingParams
+from vllm_omni.outputs import OmniRequestOutput
+from vllm_omni.platforms import current_omni_platform
+
+
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Edit an image with Qwen-Image-Edit.")
+    parser.add_argument(
+        "--model",
+        default="Qwen/Qwen-Image-Edit",
+        help=(
+            "Diffusion model name or local path. "
+            "For multiple image inputs, use Qwen/Qwen-Image-Edit-2509 or Qwen/Qwen-Image-Edit-2511"
+            "which supports QwenImageEditPlusPipeline."
+        ),
+    )
+    parser.add_argument(
+        "--image",
+        type=str,
+        nargs="+",
+        required=True,
+        help="Path(s) to input image file(s) (PNG, JPG, etc.). Can specify multiple images.",
+    )
+    parser.add_argument(
+        "--prompt",
+        type=str,
+        required=True,
+        help="Text prompt describing the edit to make to the image.",
+    )
+    parser.add_argument(
+        "--negative_prompt",
+        type=str,
+        default=None,
+        required=False,
+    )
+    parser.add_argument(
+        "--seed",
+        type=int,
+        default=0,
+        help="Random seed for deterministic results.",
+    )
+    parser.add_argument(
+        "--cfg_scale",
+        type=float,
+        default=4.0,
+        help=(
+            "True classifier-free guidance scale (default: 4.0). Guidance scale as defined in Classifier-Free "
+            "Diffusion Guidance. Classifier-free guidance is enabled by setting cfg_scale > 1 and providing "
+            "a negative_prompt. Higher guidance scale encourages images closely linked to the text prompt, "
+            "usually at the expense of lower image quality."
+        ),
+    )
+    parser.add_argument(
+        "--guidance_scale",
+        type=float,
+        default=1.0,
+        help=(
+            "Guidance scale for guidance-distilled models (default: 1.0, disabled). "
+            "Unlike classifier-free guidance (--cfg_scale), guidance-distilled models take the guidance scale "
+            "directly as an input parameter. Enabled when guidance_scale > 1. Ignored when not using guidance-distilled models."
+        ),
+    )
+    parser.add_argument(
+        "--output",
+        type=str,
+        default="output_image_edit.png",
+        help=("Path to save the edited image (PNG). Or prefix for Qwen-Image-Layered model save images(PNG)."),
+    )
+    parser.add_argument(
+        "--num_outputs_per_prompt",
+        type=int,
+        default=1,
+        help="Number of images to generate for the given prompt.",
+    )
+    parser.add_argument(
+        "--num_inference_steps",
+        type=int,
+        default=50,
+        help="Number of denoising steps for the diffusion sampler.",
+    )
+    parser.add_argument(
+        "--cache_backend",
+        type=str,
+        default=None,
+        choices=["cache_dit", "tea_cache"],
+        help=(
+            "Cache backend to use for acceleration. "
+            "Options: 'cache_dit' (DBCache + SCM + TaylorSeer), 'tea_cache' (Timestep Embedding Aware Cache). "
+            "Default: None (no cache acceleration)."
+        ),
+    )
+    parser.add_argument(
+        "--ulysses_degree",
+        type=int,
+        default=1,
+        help="Number of GPUs used for ulysses sequence parallelism.",
+    )
+    parser.add_argument(
+        "--ring_degree",
+        type=int,
+        default=1,
+        help="Number of GPUs used for ring sequence parallelism.",
+    )
+    parser.add_argument(
+        "--tensor_parallel_size",
+        type=int,
+        default=1,
+        help="Number of GPUs used for tensor parallelism (TP) inside the DiT.",
+    )
+    parser.add_argument("--layers", type=int, default=4, help="Number of layers to decompose the input image into.")
+    parser.add_argument(
+        "--resolution",
+        type=int,
+        default=640,
+        help="Bucket in (640, 1024) to determine the condition and output resolution",
+    )
+
+    parser.add_argument(
+        "--color-format",
+        type=str,
+        default="RGB",
+        help="For Qwen-Image-Layered, set to RGBA.",
+    )
+
+    # Cache-DiT specific parameters
+    parser.add_argument(
+        "--cache_dit_fn_compute_blocks",
+        type=int,
+        default=1,
+        help="[cache-dit] Number of forward compute blocks. Optimized for single-transformer models.",
+    )
+    parser.add_argument(
+        "--cache_dit_bn_compute_blocks",
+        type=int,
+        default=0,
+        help="[cache-dit] Number of backward compute blocks.",
+    )
+    parser.add_argument(
+        "--cache_dit_max_warmup_steps",
+        type=int,
+        default=4,
+        help="[cache-dit] Maximum warmup steps (works for few-step models).",
+    )
+    parser.add_argument(
+        "--cache_dit_residual_diff_threshold",
+        type=float,
+        default=0.24,
+        help="[cache-dit] Residual diff threshold. Higher values enable more aggressive caching.",
+    )
+    parser.add_argument(
+        "--cache_dit_max_continuous_cached_steps",
+        type=int,
+        default=3,
+        help="[cache-dit] Maximum continuous cached steps to prevent precision degradation.",
+    )
+    parser.add_argument(
+        "--cache_dit_enable_taylorseer",
+        action="store_true",
+        default=False,
+        help="[cache-dit] Enable TaylorSeer acceleration (not suitable for few-step models).",
+    )
+    parser.add_argument(
+        "--cache_dit_taylorseer_order",
+        type=int,
+        default=1,
+        help="[cache-dit] TaylorSeer polynomial order.",
+    )
+    parser.add_argument(
+        "--cache_dit_scm_steps_mask_policy",
+        type=str,
+        default=None,
+        choices=[None, "slow", "medium", "fast", "ultra"],
+        help="[cache-dit] SCM mask policy: None (disabled), slow, medium, fast, ultra.",
+    )
+    parser.add_argument(
+        "--cache_dit_scm_steps_policy",
+        type=str,
+        default="dynamic",
+        choices=["dynamic", "static"],
+        help="[cache-dit] SCM steps policy: dynamic or static.",
+    )
+
+    # TeaCache specific parameters
+    parser.add_argument(
+        "--tea_cache_rel_l1_thresh",
+        type=float,
+        default=0.2,
+        help="[tea_cache] Threshold for accumulated relative L1 distance.",
+    )
+    parser.add_argument(
+        "--cfg_parallel_size",
+        type=int,
+        default=1,
+        choices=[1, 2],
+        help="Number of GPUs used for classifier free guidance parallel size.",
+    )
+    parser.add_argument(
+        "--enforce_eager",
+        action="store_true",
+        help="Disable torch.compile and force eager execution.",
+    )
+    parser.add_argument(
+        "--vae_use_slicing",
+        action="store_true",
+        help="Enable VAE slicing for memory optimization.",
+    )
+    parser.add_argument(
+        "--vae_use_tiling",
+        action="store_true",
+        help="Enable VAE tiling for memory optimization.",
+    )
+    parser.add_argument(
+        "--enable-cpu-offload",
+        action="store_true",
+        help="Enable CPU offloading for diffusion models.",
+    )
+    parser.add_argument(
+        "--enable-layerwise-offload",
+        action="store_true",
+        help="Enable layerwise (blockwise) offloading on DiT modules.",
+    )
+    parser.add_argument(
+        "--layerwise-num-gpu-layers",
+        type=int,
+        default=1,
+        help="Number of ready layers (blocks) to keep on GPU during generation.",
+    )
+    return parser.parse_args()
+
+
+def main():
+    args = parse_args()
+
+    # Validate input images exist and load them
+    input_images = []
+    for image_path in args.image:
+        if not os.path.exists(image_path):
+            raise FileNotFoundError(f"Input image not found: {image_path}")
+
+        img = Image.open(image_path).convert(args.color_format)
+        input_images.append(img)
+
+    # Use single image or list based on number of inputs
+    if len(input_images) == 1:
+        input_image = input_images[0]
+    else:
+        input_image = input_images
+
+    generator = torch.Generator(device=current_omni_platform.device_type).manual_seed(args.seed)
+
+    parallel_config = DiffusionParallelConfig(
+        ulysses_degree=args.ulysses_degree,
+        ring_degree=args.ring_degree,
+        cfg_parallel_size=args.cfg_parallel_size,
+        tensor_parallel_size=args.tensor_parallel_size,
+    )
+
+    # Configure cache based on backend type
+    cache_config = None
+    if args.cache_backend == "cache_dit":
+        # cache-dit configuration: Hybrid DBCache + SCM + TaylorSeer
+        cache_config = {
+            "Fn_compute_blocks": args.cache_dit_fn_compute_blocks,
+            "Bn_compute_blocks": args.cache_dit_bn_compute_blocks,
+            "max_warmup_steps": args.cache_dit_max_warmup_steps,
+            "residual_diff_threshold": args.cache_dit_residual_diff_threshold,
+            "max_continuous_cached_steps": args.cache_dit_max_continuous_cached_steps,
+            "enable_taylorseer": args.cache_dit_enable_taylorseer,
+            "taylorseer_order": args.cache_dit_taylorseer_order,
+            "scm_steps_mask_policy": args.cache_dit_scm_steps_mask_policy,
+            "scm_steps_policy": args.cache_dit_scm_steps_policy,
+        }
+    elif args.cache_backend == "tea_cache":
+        # TeaCache configuration
+        cache_config = {
+            "rel_l1_thresh": args.tea_cache_rel_l1_thresh,
+            # Note: coefficients will use model-specific defaults based on model_type
+        }
+
+    # Initialize Omni with appropriate pipeline
+    omni = Omni(
+        model=args.model,
+        enable_layerwise_offload=args.enable_layerwise_offload,
+        layerwise_num_gpu_layers=args.layerwise_num_gpu_layers,
+        vae_use_slicing=args.vae_use_slicing,
+        vae_use_tiling=args.vae_use_tiling,
+        cache_backend=args.cache_backend,
+        cache_config=cache_config,
+        parallel_config=parallel_config,
+        enforce_eager=args.enforce_eager,
+        enable_cpu_offload=args.enable_cpu_offload,
+    )
+    print("Pipeline loaded")
+
+    # Check if profiling is requested via environment variable
+    profiler_enabled = bool(os.getenv("VLLM_TORCH_PROFILER_DIR"))
+
+    # Time profiling for generation
+    print(f"\n{'=' * 60}")
+    print("Generation Configuration:")
+    print(f"  Model: {args.model}")
+    print(f"  Inference steps: {args.num_inference_steps}")
+    print(f"  Cache backend: {args.cache_backend if args.cache_backend else 'None (no acceleration)'}")
+    if isinstance(input_image, list):
+        print(f"  Number of input images: {len(input_image)}")
+        for idx, img in enumerate(input_image):
+            print(f"    Image {idx + 1} size: {img.size}")
+    else:
+        print(f"  Input image size: {input_image.size}")
+    print(
+        f"  Parallel configuration: ulysses_degree={args.ulysses_degree}, ring_degree={args.ring_degree}, cfg_parallel_size={args.cfg_parallel_size}, tensor_parallel_size={args.tensor_parallel_size}"
+    )
+    print(f"{'=' * 60}\n")
+
+    generation_start = time.perf_counter()
+
+    if profiler_enabled:
+        print("[Profiler] Starting profiling...")
+        omni.start_profile()
+
+    # Generate edited image
+    outputs = omni.generate(
+        {
+            "prompt": args.prompt,
+            "negative_prompt": args.negative_prompt,
+            "multi_modal_data": {"image": input_image},
+        },
+        OmniDiffusionSamplingParams(
+            generator=generator,
+            true_cfg_scale=args.cfg_scale,
+            guidance_scale=args.guidance_scale,
+            num_inference_steps=args.num_inference_steps,
+            num_outputs_per_prompt=args.num_outputs_per_prompt,
+            layers=args.layers,
+            resolution=args.resolution,
+        ),
+    )
+    generation_end = time.perf_counter()
+    generation_time = generation_end - generation_start
+
+    # Print profiling results
+    print(f"Total generation time: {generation_time:.4f} seconds ({generation_time * 1000:.2f} ms)")
+
+    if profiler_enabled:
+        print("\n[Profiler] Stopping profiler and collecting results...")
+        profile_results = omni.stop_profile()
+        if profile_results and isinstance(profile_results, dict):
+            traces = profile_results.get("traces", [])
+            print("\n" + "=" * 60)
+            print("PROFILING RESULTS:")
+            for rank, trace in enumerate(traces):
+                print(f"\nRank {rank}:")
+                if trace:
+                    print(f"  • Trace: {trace}")
+            if not traces:
+                print("  No traces collected.")
+            print("=" * 60)
+        else:
+            print("[Profiler] No valid profiling data returned.")
+
+    if not outputs:
+        raise ValueError("No output generated from omni.generate()")
+
+    # Extract images from OmniRequestOutput
+    # omni.generate() returns list[OmniRequestOutput], extract images from request_output[0].images
+    first_output = outputs[0]
+    if not hasattr(first_output, "request_output") or not first_output.request_output:
+        raise ValueError("No request_output found in OmniRequestOutput")
+
+    req_out = first_output.request_output[0]
+    if not isinstance(req_out, OmniRequestOutput) or not hasattr(req_out, "images"):
+        raise ValueError("Invalid request_output structure or missing 'images' key")
+
+    images = req_out.images
+    if not images:
+        raise ValueError("No images found in request_output")
+
+    # Save output image(s)
+    output_path = Path(args.output)
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    suffix = output_path.suffix or ".png"
+    stem = output_path.stem or "output_image_edit"
+
+    # Handle layered output (each image may be a list of layers)
+    if args.num_outputs_per_prompt <= 1:
+        img = images[0]
+        # Check if this is a layered output (list of images)
+        if isinstance(img, list):
+            for sub_idx, sub_img in enumerate(img):
+                save_path = output_path.parent / f"{stem}_{sub_idx}{suffix}"
+                sub_img.save(save_path)
+                print(f"Saved edited image to {os.path.abspath(save_path)}")
+        else:
+            img.save(output_path)
+            print(f"Saved edited image to {os.path.abspath(output_path)}")
+    else:
+        for idx, img in enumerate(images):
+            # Check if this is a layered output (list of images)
+            if isinstance(img, list):
+                for sub_idx, sub_img in enumerate(img):
+                    save_path = output_path.parent / f"{stem}_{idx}_{sub_idx}{suffix}"
+                    sub_img.save(save_path)
+                    print(f"Saved edited image to {os.path.abspath(save_path)}")
+            else:
+                save_path = output_path.parent / f"{stem}_{idx}{suffix}"
+                img.save(save_path)
+                print(f"Saved edited image to {os.path.abspath(save_path)}")
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/offline_inference/image_to_image/image_to_image.md
+++ b/examples/offline_inference/image_to_image/image_to_image.md
+# Image-To-Image
+
+This example edits an input image with `Qwen/Qwen-Image-Edit` using the `image_edit.py` CLI.
+
+## Local CLI Usage
+
+### Single Image Editing
+
+Download the example image:
+
+```bash
+wget https://vllm-public-assets.s3.us-west-2.amazonaws.com/omni-assets/qwen-bear.png
+```
+
+Then run:
+
+```bash
+python image_edit.py \
+  --image qwen-bear.png \
+  --prompt "Let this mascot dance under the moon, surrounded by floating stars and poetic bubbles such as 'Be Kind'" \
+  --output output_image_edit.png \
+  --num_inference_steps 50 \
+  --cfg_scale 4.0
+```
+
+### Multiple Image Editing (Qwen-Image-Edit-2509)
+
+For multiple image inputs, use `Qwen/Qwen-Image-Edit-2509` or  `Qwen/Qwen-Image-Edit-2511`:
+
+```bash
+python image_edit.py \
+  --model Qwen/Qwen-Image-Edit-2509 \
+  --image img1.png img2.png \
+  --prompt "Combine these images into a single scene" \
+  --output output_image_edit.png \
+  --num_inference_steps 50 \
+  --cfg_scale 4.0 \
+  --guidance_scale 1.0
+```
+
+Key arguments:
+
+- `--model`: model name or path. Use `Qwen/Qwen-Image-Edit-2509` or later for multiple image support.
+- `--image`: path(s) to the source image(s) (PNG/JPG, converted to RGB). Can specify multiple images.
+- `--prompt` / `--negative_prompt`: text description (string).
+- `--cfg_scale`: true classifier-free guidance scale (default: 4.0). Classifier-free guidance is enabled by setting cfg_scale > 1 and providing a negative_prompt. Higher guidance scale encourages images closely linked to the text prompt, usually at the expense of lower image quality.
+- `--guidance_scale`: guidance scale for guidance-distilled models (default: 1.0, disabled). Unlike classifier-free guidance (--cfg_scale), guidance-distilled models take the guidance scale directly as an input parameter. Enabled when guidance_scale > 1. Ignored when not using guidance-distilled models.
+- `--num_inference_steps`: diffusion sampling steps (more steps = higher quality, slower).
+- `--output`: path to save the generated PNG.
+- `--vae_use_slicing`: enable VAE slicing for memory optimization.
+- `--vae_use_tiling`: enable VAE tiling for memory optimization.
+- `--cfg_parallel_size`: set it to 2 to enable CFG Parallel. See more examples in [`user_guide`](../../../docs/user_guide/diffusion/parallelism_acceleration.md#cfg-parallel).
+- `--enable-cpu-offload`: enable CPU offloading for diffusion models.
+
+> ℹ️ If you encounter OOM errors, try using `--vae_use_slicing` and `--vae_use_tiling` to reduce memory usage.
--- a/examples/offline_inference/image_to_image/run_qwen_image_edit_2511.sh
+++ b/examples/offline_inference/image_to_image/run_qwen_image_edit_2511.sh
+python image_edit.py \
+    --model Qwen/Qwen-Image-Edit-2511 \
+    --image qwen_bear.png \
+    --prompt "Add a white art board written with colorful text 'vLLM-Omni' on grassland. Add a paintbrush in the bear's hands. position the bear standing in front of the art board as if painting" \
+    --output output_image_edit.png \
+    --num_inference_steps 50 \
+    --cfg_scale 4.0 \
+    --cache_backend  cache_dit \
--- a/examples/offline_inference/image_to_video/README.md
+++ b/examples/offline_inference/image_to_video/README.md
+# Image-To-Video
+
+This example demonstrates how to generate videos from images using Wan2.2 Image-to-Video models with vLLM-Omni's offline inference API.
+
+## Local CLI Usage
+
+### Wan2.2-I2V-A14B-Diffusers (MoE)
+```bash
+python image_to_video.py \
+  --model Wan-AI/Wan2.2-I2V-A14B-Diffusers \
+  --image input.png \
+  --prompt "A cat playing with yarn, smooth motion" \
+  --negative_prompt "<optional quality filter>" \
+  --height 480 \
+  --width 832 \
+  --num_frames 48 \
+  --guidance_scale 5.0 \
+  --guidance_scale_high 6.0 \
+  --num_inference_steps 40 \
+  --boundary_ratio 0.875 \
+  --flow_shift 12.0 \
+  --fps 16 \
+  --output i2v_output.mp4
+```
+
+### Wan2.2-TI2V-5B-Diffusers (Unified)
+```bash
+python image_to_video.py \
+  --model Wan-AI/Wan2.2-TI2V-5B-Diffusers \
+  --image input.png \
+  --prompt "A cat playing with yarn, smooth motion" \
+  --negative_prompt "<optional quality filter>" \
+  --height 480 \
+  --width 832 \
+  --num_frames 48 \
+  --guidance_scale 4.0 \
+  --num_inference_steps 40 \
+  --flow_shift 12.0 \
+  --fps 16 \
+  --output i2v_output.mp4
+```
+
+Key arguments:
+
+- `--model`: Model ID (I2V-A14B for MoE, TI2V-5B for unified T2V+I2V).
+- `--image`: Path to input image (required).
+- `--prompt`: Text description of desired motion/animation.
+- `--height/--width`: Output resolution (auto-calculated from image if not set). Dimensions should be multiples of 16.
+- `--num_frames`: Number of frames (default 81).
+- `--guidance_scale` and `--guidance_scale_high`: CFG scale (applied to low/high-noise stages for MoE).
+- `--negative_prompt`: Optional list of artifacts to suppress.
+- `--boundary_ratio`: Boundary split ratio for two-stage MoE models.
+- `--flow_shift`: Scheduler flow shift (5.0 for 720p, 12.0 for 480p).
+- `--num_inference_steps`: Number of denoising steps (default 50).
+- `--fps`: Frames per second for the saved MP4 (requires `diffusers` export_to_video).
+- `--output`: Path to save the generated video.
+- `--vae_use_slicing`: Enable VAE slicing for memory optimization.
+- `--vae_use_tiling`: Enable VAE tiling for memory optimization.
+- `--cfg_parallel_size`: set it to 2 to enable CFG Parallel. See more examples in [`user_guide`](../../../docs/user_guide/diffusion/parallelism_acceleration.md#cfg-parallel).
+- `--enable-cpu-offload`: enable CPU offloading for diffusion models.
+
+> ℹ️ If you encounter OOM errors, try using `--vae_use_slicing` and `--vae_use_tiling` to reduce memory usage.