This directory contains examples for using LoRA (Low-Rank Adaptation) adapters with vLLM-omni diffusion models for offline inference.
The example uses the `stabilityai/stable-diffusion-3.5-medium` as the default model, but you can replace it with other models in vLLM-omni.
## Overview
Similar to vLLM, vLLM-omni uses a unified LoRA handling mechanism:
-**Pre-loaded LoRA**: Loaded at initialization via `--lora-path` (pre-loaded into cache)
-**Per-request LoRA**: Loaded on-demand. In the example, the LoRA is loaded via `--lora-request-path` in each request
Both approaches use the same underlying mechanism - all LoRA adapters are handled uniformly through `set_active_adapter()`. If no LoRA request is provided in a request, all adapters are deactivated.
## Usage
### Pre-loaded LoRA (via --lora-path)
Load a LoRA adapter at initialization. This adapter is pre-loaded into the cache and can be activated by requests:
**Note**: When using `--lora-path`, the adapter is loaded at init time with a stable ID derived from the adapter path. This example activates it automatically for the request.
-`--lora-path`: Path to LoRA adapter folder to pre-load at initialization (loads into cache with a stable ID derived from the path)
-`--lora-request-path`: Path to LoRA adapter folder for per-request loading
-`--lora-request-id`: Integer ID for the LoRA adapter (optional). If not provided and `--lora-request-path` is set, will derive a stable ID from the path.
-`--lora-scale`: Scale factor for LoRA weights (default: 1.0). Higher values increase the influence of the LoRA adapter.
### Standard Parameters
-`--prompt`: Text prompt for image generation (required)
-`--seed`: Random seed for reproducibility (default: 42)
-`--height`: Image height in pixels (default: 1024)
-`--width`: Image width in pixels (default: 1024)
-`--num_inference_steps`: Number of denoising steps (default: 50)
Please refer to the [stage configuration documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/) to configure memory allocation appropriately for your hardware setup.
## Run examples
### Multiple Prompts
Get into the example folder
```bash
cd examples/offline_inference/qwen2_5_omni
```
Then run the command below. Note: for processing large volume data, it uses py_generator mode, which will return a python generator from Omni class.
```bash
bash run_multiple_prompts.sh
```
### Single Prompt
Get into the example folder
```bash
cd examples/offline_inference/qwen2_5_omni
```
Then run the command below.
```bash
bash run_single_prompt.sh
```
### Modality control
If you want to control output modalities, e.g. only output text, you can run the command below:
```bash
python end2end.py --output-wav output_audio \
--query-type mixed_modalities \
--modalities text
```
#### Using Local Media Files
The `end2end.py` script supports local media files (audio, video, image) via CLI arguments:
Please refer to the [stage configuration documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/) to configure memory allocation appropriately for your hardware setup.
## Run examples
### Multiple Prompts
Get into the example folder
```bash
cd examples/offline_inference/qwen3_omni
```
Then run the command below. Note: for processing large volume data, it uses py_generator mode, which will return a python generator from Omni class.
```bash
bash run_multiple_prompts.sh
```
### Single Prompt
Get into the example folder
```bash
cd examples/offline_inference/qwen3_omni
```
Then run the command below.
```bash
bash run_single_prompt.sh
```
If you have not enough memory, you can set thinker with tensor parallel. Just run the command below.
```bash
bash run_single_prompt_tp.sh
```
### Modality control
If you want to control output modalities, e.g. only output text, you can run the command below:
```bash
python end2end.py --output-wav output_audio \
--query-type use_audio \
--modalities text
```
#### Using Local Media Files
The `end2end.py` script supports local media files (audio, video, image) via command-line arguments:
This directory contains an offline demo for running Qwen3 TTS models with vLLM Omni. It builds task-specific inputs and generates WAV files locally.
## Model Overview
Qwen3 TTS provides multiple task variants for speech generation:
-**CustomVoice**: Generate speech with a known speaker identity (speaker ID) and optional instruction.
-**VoiceDesign**: Generate speech from text plus a descriptive instruction that designs a new voice.
-**Base**: Voice cloning using a reference audio + reference transcript, with optional mode selection.
## Setup
Please refer to the [stage configuration documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/) to configure memory allocation appropriately for your hardware setup.
### ROCm Dependencies
You will need to install these two dependencies `onnxruntime-rocm` and `sox`.
```
pip uninstall onnxruntime # should be removed before we can install onnxruntime-rocm
pip install onnxruntime-rocm sox
```
## Quick Start
Run a single sample for a task:
```
python end2end.py --query-type CustomVoice
```
Generated audio files are saved to `output_audio/` by default.
-`--vae_use_slicing`: enable VAE slicing for memory optimization.
-`--vae_use_tiling`: enable VAE tiling for memory optimization.
-`--cfg_parallel_size`: set it to 2 to enable CFG Parallel. See more examples in [`user_guide`](../../../docs/user_guide/diffusion/parallelism_acceleration.md#cfg-parallel).
-`--enable-cpu-offload`: enable CPU offloading for diffusion models.
> ℹ️ If you encounter OOM errors, try using `--vae_use_slicing` and `--vae_use_tiling` to reduce memory usage.
> ℹ️ Qwen-Image currently publishes best-effort presets at `1328x1328`, `1664x928`, `928x1664`, `1472x1140`, `1140x1472`, `1584x1056`, and `1056x1584`. Adjust `--height/--width` accordingly for the most reliable outcomes.
## Web UI Demo
Launch the gradio demo:
```bash
python gradio_demo.py --port 7862
```
Then open `http://localhost:7862/` on your local browser to interact with the web UI.