Commit c1cacde6 authored by weishb's avatar weishb
Browse files

vllm-omni_0.15.0.rc1+fix1 first commit

parent 35607782
# Supported Models
vLLM-Omni supports unified multimodal comprehension and generation models across various tasks.
## Model Implementation
If vLLM-Omni natively supports a model, its implementation can be found in <gh-file:vllm_omni/model_executor/models> and <gh-file:vllm_omni/diffusion/models>.
## List of Supported Models for Nvidia GPU / AMD GPU
<style>
th {
white-space: nowrap;
min-width: 0 !important;
}
</style>
| Architecture | Models | Example HF Models |
|--------------|--------|-------------------|
| `Qwen3OmniMoeForConditionalGeneration` | Qwen3-Omni | `Qwen/Qwen3-Omni-30B-A3B-Instruct` |
| `Qwen2_5OmniForConditionalGeneration` | Qwen2.5-Omni | `Qwen/Qwen2.5-Omni-7B`, `Qwen/Qwen2.5-Omni-3B` |
| `BagelForConditionalGeneration` | BAGEL (DiT-only) | `ByteDance-Seed/BAGEL-7B-MoT` |
| `QwenImagePipeline` | Qwen-Image | `Qwen/Qwen-Image` |
| `QwenImagePipeline` | Qwen-Image-2512 | `Qwen/Qwen-Image-2512` |
| `QwenImageEditPipeline` | Qwen-Image-Edit | `Qwen/Qwen-Image-Edit` |
| `QwenImageEditPlusPipeline` | Qwen-Image-Edit-2509 | `Qwen/Qwen-Image-Edit-2509` |
| `QwenImageLayeredPipeline` | Qwen-Image-Layered | `Qwen/Qwen-Image-Layered` |
|`ZImagePipeline` | Z-Image | `Tongyi-MAI/Z-Image-Turbo` |
| `WanPipeline` | Wan2.2-T2V, Wan2.2-TI2V | `Wan-AI/Wan2.2-T2V-A14B-Diffusers`, `Wan-AI/Wan2.2-TI2V-5B-Diffusers` |
| `WanImageToVideoPipeline` | Wan2.2-I2V | `Wan-AI/Wan2.2-I2V-A14B-Diffusers` |
| `OvisImagePipeline` | Ovis-Image | `OvisAI/Ovis-Image` |
|`LongcatImagePipeline` | LongCat-Image | `meituan-longcat/LongCat-Image` |
|`LongCatImageEditPipeline` | LongCat-Image-Edit | `meituan-longcat/LongCat-Image-Edit` |
|`StableDiffusion3Pipeline` | Stable-Diffusion-3 | `stabilityai/stable-diffusion-3.5-medium` |
|`Flux2KleinPipeline` | FLUX.2-klein | `black-forest-labs/FLUX.2-klein-4B`, `black-forest-labs/FLUX.2-klein-9B` |
|`FluxPipeline` | FLUX.1-dev | `black-forest-labs/FLUX.1-dev` |
|`StableAudioPipeline` | Stable-Audio-Open | `stabilityai/stable-audio-open-1.0` |
|`Qwen3TTSForConditionalGeneration` | Qwen3-TTS-12Hz-1.7B-CustomVoice | `Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice` |
|`Qwen3TTSForConditionalGeneration` | Qwen3-TTS-12Hz-1.7B-VoiceDesign | `Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign` |
|`Qwen3TTSForConditionalGeneration` | Qwen3-TTS-12Hz-1.7B-Base | `Qwen/Qwen3-TTS-12Hz-0.6B-Base` |
## List of Supported Models for NPU
<style>
th {
white-space: nowrap;
min-width: 0 !important;
}
</style>
| Architecture | Models | Example HF Models |
|--------------|--------|-------------------|
| `Qwen2_5OmniForConditionalGeneration` | Qwen2.5-Omni | `Qwen/Qwen2.5-Omni-7B`, `Qwen/Qwen2.5-Omni-3B`|
| `QwenImagePipeline` | Qwen-Image | `Qwen/Qwen-Image` |
| `QwenImagePipeline` | Qwen-Image-2512 | `Qwen/Qwen-Image-2512` |
| `QwenImageEditPipeline` | Qwen-Image-Edit | `Qwen/Qwen-Image-Edit` |
| `QwenImageEditPlusPipeline` | Qwen-Image-Edit-2509 | `Qwen/Qwen-Image-Edit-2509` |
| `QwenImageLayeredPipeline` | Qwen-Image-Layered | `Qwen/Qwen-Image-Layered` |
| `QwenImageEditPlusPipeline` | Qwen-Image-Edit-2511 | `Qwen/Qwen-Image-Edit-2511` |
|`ZImagePipeline` | Z-Image | `Tongyi-MAI/Z-Image-Turbo` |
# Image Edit API
vLLM-Omni provides an OpenAI DALL-E compatible API for image edit using diffusion models.
Each server instance runs a single model (specified at startup via `vllm serve <model> --omni`).
## Quick Start
### Start the Server
For example...
```bash
# Qwen-Image
vllm serve Qwen/Qwen-Image-Edit-2511 --omni --port 8000
### Generate Images
**Using curl:**
```bash
curl -s -D >(grep -i x-request-id >&2) \
-o >(jq -r '.data[0].b64_json' | base64 --decode > gift-basket.png) \
-X POST "http://localhost:8000/v1/images/edits" \
-F "model=xxx" \
-F "image=@./xx.png" \
-F "prompt='this bear is wearing sportwear. holding a basketball, and bending one leg.'" \
-F "size=1024x1024" \
-F "output_format=png"
```
**Using OpenAI SDK:**
```python
import base64
from openai import OpenAI
from pathlib import Path
client = OpenAI(
api_key="None",
base_url="http://localhost:8000/v1"
)
input_image_url = "https://vllm-public-assets.s3.us-west-2.amazonaws.com/omni-assets/qwen-bear.png"
result = client.images.edit(
image=[],
model="Qwen-Image-Edit-2511",
prompt="Change the bears in the two input images into walking together.",
size='512x512',
stream=False,
output_format='jpeg',
# url格式
extra_body={
"url": [input_image_url1,input_image_url],
"num_inference_steps": 50,
"guidance_scale": 1,
"seed": 777,
}
)
image_base64 = result.data[0].b64_json
image_bytes = base64.b64decode(image_base64)
# Save the image to a file
with open("edit_out_http.jpeg", "wb") as f:
f.write(image_bytes)
```
## API Reference
### Endpoint
```
POST /v1/images/edits
Content-Type: multipart/form-data
```
### Request Parameters
#### OpenAI Standard Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `prompt` | string | **required** | A text description of the desired image |
| `model` | string | server's model | Model to use (optional, should match server if specified) |
| `image` | string or array | **required** | The image(s) to edit. |
| `n` | integer | 1 | Number of images to generate (1-10) |
| `size` | string | "auto" | Image dimensions in WxH format (e.g., "1024x1024", "512x512"), when set to auto, it decide size from first input image. |
| `response_format` | string | "b64_json" | Response format (only "b64_json" supported) |
| `user` | string | null | User identifier for tracking |
| `output_format` | string | "png" | The format in which the generated images are returned. Must be one of "png", "jpg", "jpeg", "webp". |
| `output_compression` | integer | 100 | The compression level (0-100%) for the generated images. |
| `background` | string or null | "auto" | Allows to set transparency for the background of the generated image(s).
#### vllm-omni Extension Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `url` | string or array | None | The image(s) to edit. |
| `negative_prompt` | string | null | Text describing what to avoid in the image |
| `num_inference_steps` | integer | model defaults | Number of diffusion steps |
| `guidance_scale` | float | model defaults | Classifier-free guidance scale (typically 0.0-20.0) |
| `true_cfg_scale` | float | model defaults | True CFG scale (model-specific parameter, may be ignored if not supported) |
| `seed` | integer | null | Random seed for reproducibility |
### Response Format
```json
{
"created": 1701234567,
"data": [
{
"b64_json": "<base64-encoded PNG>",
"url": null,
"revised_prompt": null
}
],
"output_format": null,
"size": null,
}
```
## Examples
### Multiple Images input
```bash
curl -s -D >(grep -i x-request-id >&2) \
-o >(jq -r '.data[0].b64_json' | base64 --decode > gift-basket.png) \
-X POST "http://localhost:8000/v1/images/edits" \
-F "model=xxx" \
-F "image=@xx.png" \
-F "image=@xx.png"
-F "prompt='this bear is wearing sportwear. holding a basketball, and bending one leg.'" \
-F "size=1024x1024" \
-F "output_format=png"
```
## Parameter Handling
The API passes parameters directly to the diffusion pipeline without model-specific transformation:
- **Default values**: When parameters are not specified, the underlying model uses its own defaults
- **Pass-through design**: User-provided values are forwarded directly to the diffusion engine
- **Minimal validation**: Only basic type checking and range validation at the API level
### Parameter Compatibility
The API passes parameters directly to the diffusion pipeline without model-specific validation.
- Unsupported parameters may be silently ignored by the model
- Incompatible values will result in errors from the underlying pipeline
- Recommended values vary by model - consult model documentation
**Best Practice:** Start with the model's recommended parameters, then adjust based on your needs.
## Error Responses
### 400 Bad Request
Invalid parameters (e.g., model mismatch):
```json
{
"detail": "Invalid size format: '1024x'. Expected format: 'WIDTHxHEIGHT' (e.g., '1024x1024')."
}
```
### 422 Unprocessable Entity
Validation errors (missing required fields):
```json
{
"detail": "Field 'image' or 'url' is required"
}
```
## Troubleshooting
### Server Not Running
```bash
# Check if server is responding
curl -X http://localhost:8000/v1/images/edit \
-F "prompt='test'"
```
### Out of Memory
If you encounter OOM errors:
1. Reduce image size: `"size": "512x512"`
2. Reduce inference steps: `"num_inference_steps": 25`
## Development
Enable debug logging to see prompts and generation details:
```bash
vllm serve Qwen/Qwen-Image-Edit-2511 --omni \
--uvicorn-log-level debug
```
# Image Generation API
vLLM-Omni provides an OpenAI DALL-E compatible API for text-to-image generation using diffusion models.
Each server instance runs a single model (specified at startup via `vllm serve <model> --omni`).
## Quick Start
### Start the Server
For example...
```bash
# Qwen-Image
vllm serve Qwen/Qwen-Image --omni --port 8000
# Z-Image Turbo
vllm serve Tongyi-MAI/Z-Image-Turbo --omni --port 8000
```
### Generate Images
**Using curl:**
```bash
curl -X POST http://localhost:8000/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"prompt": "a dragon laying over the spine of the Green Mountains of Vermont",
"size": "1024x1024",
"seed": 42
}' | jq -r '.data[0].b64_json' | base64 -d > dragon.png
```
**Using Python:**
```python
import requests
import base64
from PIL import Image
import io
response = requests.post(
"http://localhost:8000/v1/images/generations",
json={
"prompt": "a black and white cat wearing a princess tiara",
"size": "1024x1024",
"num_inference_steps": 50,
"seed": 42,
}
)
# Decode and save
img_data = response.json()["data"][0]["b64_json"]
img_bytes = base64.b64decode(img_data)
img = Image.open(io.BytesIO(img_bytes))
img.save("cat.png")
```
**Using OpenAI SDK:**
```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.images.generate(
model="Qwen/Qwen-Image",
prompt="a horse jumping over a fence nearby a babbling brook",
n=1,
size="1024x1024",
response_format="b64_json"
)
# Note: Extension parameters (seed, steps, cfg) require direct HTTP requests
```
## API Reference
### Endpoint
```
POST /v1/images/generations
Content-Type: application/json
```
### Request Parameters
#### OpenAI Standard Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `prompt` | string | **required** | Text description of the desired image |
| `model` | string | server's model | Model to use (optional, should match server if specified) |
| `n` | integer | 1 | Number of images to generate (1-10) |
| `size` | string | model defaults | Image dimensions in WxH format (e.g., "1024x1024", "512x512") |
| `response_format` | string | "b64_json" | Response format (only "b64_json" supported) |
| `user` | string | null | User identifier for tracking |
#### vllm-omni Extension Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `negative_prompt` | string | null | Text describing what to avoid in the image |
| `num_inference_steps` | integer | model defaults | Number of diffusion steps |
| `guidance_scale` | float | model defaults | Classifier-free guidance scale (typically 0.0-20.0) |
| `true_cfg_scale` | float | model defaults | True CFG scale (model-specific parameter, may be ignored if not supported) |
| `seed` | integer | null | Random seed for reproducibility |
### Response Format
```json
{
"created": 1701234567,
"data": [
{
"b64_json": "<base64-encoded PNG>",
"url": null,
"revised_prompt": null
}
]
}
```
## Examples
### Multiple Images
```bash
curl -X POST http://localhost:8000/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"prompt": "a steampunk city set in a valley of the Adirondack mountains",
"n": 4,
"size": "1024x1024",
"seed": 123
}'
```
This generates 4 images in a single request.
### With Negative Prompt
```python
response = requests.post(
"http://localhost:8000/v1/images/generations",
json={
"prompt": "a portrait of a skier in deep powder snow",
"negative_prompt": "blurry, low quality, distorted, ugly",
"num_inference_steps": 100,
"size": "1024x1024",
}
)
```
## Parameter Handling
The API passes parameters directly to the diffusion pipeline without model-specific transformation:
- **Default values**: When parameters are not specified, the underlying model uses its own defaults
- **Pass-through design**: User-provided values are forwarded directly to the diffusion engine
- **Minimal validation**: Only basic type checking and range validation at the API level
### Parameter Compatibility
The API passes parameters directly to the diffusion pipeline without model-specific validation.
- Unsupported parameters may be silently ignored by the model
- Incompatible values will result in errors from the underlying pipeline
- Recommended values vary by model - consult model documentation
**Best Practice:** Start with the model's recommended parameters, then adjust based on your needs.
## Error Responses
### 400 Bad Request
Invalid parameters (e.g., model mismatch):
```json
{
"detail": "Invalid size format: '1024x'. Expected format: 'WIDTHxHEIGHT' (e.g., '1024x1024')."
}
```
### 422 Unprocessable Entity
Validation errors (missing required fields):
```json
{
"detail": [
{
"loc": ["body", "prompt"],
"msg": "field required",
"type": "value_error.missing"
}
]
}
```
### 503 Service Unavailable
Diffusion engine not initialized:
```json
{
"detail": "Diffusion engine not initialized. Start server with a diffusion model."
}
```
## Troubleshooting
### Server Not Running
```bash
# Check if server is responding
curl http://localhost:8000/v1/images/generations \
-H "Content-Type: application/json" \
-d '{"prompt": "test"}'
```
### Out of Memory
If you encounter OOM errors:
1. Reduce image size: `"size": "512x512"`
2. Reduce inference steps: `"num_inference_steps": 25`
3. Generate fewer images: `"n": 1`
## Testing
Run the test suite to verify functionality:
```bash
# All image generation tests
pytest tests/entrypoints/openai_api/test_image_server.py -v
# Specific test
pytest tests/entrypoints/openai_api/test_image_server.py::test_generate_single_image -v
```
## Development
Enable debug logging to see prompts and generation details:
```bash
vllm serve Qwen/Qwen-Image --omni \
--uvicorn-log-level debug
```
# Frequently Asked Questions
> Q: How many chips do I need to infer a model in vLLM-Omni?
A: Now, we support natively disaggregated deployment for different model stages within a model. There is a restriction that one chip can only have one AutoRegressive model stage. This is because the unified KV cache management of vLLM. Stages of other types can coexist within a chip. The restriction will be resolved in later version.
> Q: When trying to run examples, I encounter error about backend of librosa or soundfile. How to solve it?
A: If you encounter error about backend of librosa, try to install ffmpeg with command below.
```
sudo apt update
sudo apt install ffmpeg
```
> Q: I see GPU OOM or "free memory is less than desired GPU memory utilization" errors. How can I fix it?
A: Refer to [GPU memory calculation and configuration](../configuration/gpu_memory_utilization.md) for guidance on tuning `gpu_memory_utilization` and related settings.
> Q: I encounter some bugs or CI problems, which is urgent. How can I solve it?
A: At first, you can check current [issues](https://github.com/vllm-project/vllm-omni/issues) to find possible solutions. If none of these satisfy your demand and it is urgent, please find these [volunteers](https://docs.vllm.ai/projects/vllm-omni/en/latest/community/volunteers/) for help.
> Q: Does vLLM-Omni support AWQ or any other quantization?
A: vLLM-Omni partitions model into several stages. For AR stages, it will reuse main logic of LLMEngine in vLLM. So current quantization supported in vLLM should be also supported in vLLM-Omni for them. But systematic verification is ongoing. For quantization for DiffusionEngine, we are working on it. Please stay tuned and welcome contribution!
> Q: Does vLLM-Omni support multimodal streaming input and output?
A: Not yet. We already put it on the [Roadmap](https://github.com/vllm-project/vllm-omni/issues/165). Please stay tuned!
# Cache-DiT Acceleration Guide
This guide explains how to use cache-dit acceleration in vLLM-Omni to speed up diffusion model inference.
## Overview
Cache-dit is a library that accelerates diffusion transformer models through intelligent caching mechanisms. It supports multiple acceleration techniques that can be combined for optimal performance:
- **DBCache**: Dual Block Cache for reducing redundant computations
- **TaylorSeer**: Taylor expansion-based forecasting for faster inference
- **SCM**: Step Computation Masking for selective step computation
## Quick Start
### Basic Usage
Enable cache-dit acceleration by simply setting `cache_backend="cache_dit"`. Cache-dit will use its recommended default parameters:
```python
from vllm_omni.entrypoints.omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
# Simplest way: just enable cache-dit with default parameters
omni = Omni(
model="Qwen/Qwen-Image",
cache_backend="cache_dit",
)
images = omni.generate(
"a beautiful landscape",
OmniDiffusionSamplingParams(num_inference_steps=50),
)
```
**Default Parameters**: When `cache_config` is not provided, cache-dit uses optimized default values. See the [Configuration Reference](#configuration-reference) section for a complete list of all parameters and their default values.
### Custom Configuration
To customize cache-dit settings, provide a `cache_config` dictionary, for example:
```python
omni = Omni(
model="Qwen/Qwen-Image",
cache_backend="cache_dit",
cache_config={
"Fn_compute_blocks": 1,
"Bn_compute_blocks": 0,
"max_warmup_steps": 4,
"residual_diff_threshold": 0.12,
},
)
```
## Online Serving (OpenAI-Compatible)
Enable Cache-DiT for online serving by passing `--cache-backend cache_dit` when starting the server:
```bash
# Use Cache-DiT default (recommended) parameters
vllm serve Qwen/Qwen-Image --omni --port 8091 --cache-backend cache_dit
```
To customize Cache-DiT settings for online serving, pass a JSON string via `--cache-config`:
```bash
vllm serve Qwen/Qwen-Image --omni --port 8091 \
--cache-backend cache_dit \
--cache-config '{"Fn_compute_blocks": 1, "Bn_compute_blocks": 0, "max_warmup_steps": 4, "residual_diff_threshold": 0.12}'
```
## Acceleration Methods
For comprehensive illustration, please view cache-dit [User_Guide](https://cache-dit.readthedocs.io/en/latest/user_guide/OVERVIEWS/)
### 1. DBCache (Dual Block Cache)
DBCache intelligently caches intermediate transformer block outputs when the residual differences between consecutive steps are small, reducing redundant computations without sacrificing quality.
**Key Parameters**:
- `Fn_compute_blocks` (int, default: 1): Number of **first n** transformer blocks used to compute stable feature differences. Higher values provide more accurate caching decisions but increase computation.
- `Bn_compute_blocks` (int, default: 0): Number of **last n** transformer blocks used for additional fusion. These blocks act as an auto-scaler for approximate hidden states.
- `max_warmup_steps` (int, default: 4): Number of initial steps where caching is disabled to ensure the model learns sufficient features before caching begins. Optimized for few-step distilled models.
- `residual_diff_threshold` (float, default: 0.24): Threshold for residual difference. Higher values lead to faster performance but may reduce precision. Default uses a relatively higher threshold for more aggressive caching.
- `max_cached_steps` (int, default: -1): Maximum number of cached steps. Set to -1 for unlimited caching.
- `max_continuous_cached_steps` (int, default: 3): Maximum number of consecutive cached steps. Limits consecutive caching to prevent precision degradation.
**Example Configuration**:
```python
cache_config={
"Fn_compute_blocks": 8, # Use first 8 blocks for difference computation
"Bn_compute_blocks": 0, # No additional fusion blocks
"max_warmup_steps": 8, # Cache after 8 warmup steps
"residual_diff_threshold": 0.12, # Higher threshold for faster inference
"max_cached_steps": -1, # No limit on cached steps
}
```
**Performance Tips**:
- Default `Fn_compute_blocks=1` works well for most cases. Increase to 8-12 for larger models or when more accuracy is needed
- Increase `residual_diff_threshold` (e.g., 0.12-0.15) for faster inference with slight quality trade-off, or decrease from default 0.24 for higher quality
- Default `max_warmup_steps=4` is optimized for few-step models. Increase to 6-8 for more steps if needed
### 2. TaylorSeer
TaylorSeer uses Taylor expansion to forecast future hidden states, allowing the model to skip some computation steps while maintaining quality.
**Key Parameters**:
- `enable_taylorseer` (bool, default: False): Enable TaylorSeer acceleration
- `taylorseer_order` (int, default: 1): Order of Taylor expansion. Higher orders provide better accuracy but require more computation.
**Example Configuration**:
```python
cache_config={
"enable_taylorseer": True,
"taylorseer_order": 1, # First-order Taylor expansion
}
```
**Performance Tips**:
- Use `taylorseer_order=1` for most cases (good balance of speed and quality)
- Combine with DBCache for maximum acceleration
- Higher orders (2-3) may improve quality but reduce speed gains
### 3. SCM (Step Computation Masking)
SCM allows you to specify which steps must be computed and which can use cached results, similar to LeMiCa/EasyCache style acceleration.
**Key Parameters**:
- `scm_steps_mask_policy` (str | None, default: None): Predefined mask policy. Options:
- `None`: SCM disabled (default)
- `"slow"`: More compute steps, higher quality (18 compute steps out of 28)
- `"medium"`: Balanced (15 compute steps out of 28)
- `"fast"`: More cache steps, faster inference (11 compute steps out of 28)
- `"ultra"`: Maximum speed (8 compute steps out of 28)
- `scm_steps_policy` (str, default: "dynamic"): Policy for cached steps:
- `"dynamic"`: Use dynamic cache for masked steps (recommended)
- `"static"`: Use static cache for masked steps
**Example Configuration**:
```python
cache_config={
"scm_steps_mask_policy": "medium", # Balanced speed/quality
"scm_steps_policy": "dynamic", # Use dynamic cache
}
```
**Performance Tips**:
- SCM is disabled by default (`scm_steps_mask_policy=None`). Enable it by setting a policy value if you need additional acceleration
- Start with `"medium"` policy and adjust based on quality requirements
- Use `"fast"` or `"ultra"` for maximum speed when quality can be slightly compromised
- `"dynamic"` policy generally provides better quality than `"static"`
- SCM mask is automatically regenerated when `num_inference_steps` changes during inference
## Configuration Reference
### DiffusionCacheConfig Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `Fn_compute_blocks` | int | 1 | First n blocks for difference computation (optimized for single-transformer models) |
| `Bn_compute_blocks` | int | 0 | Last n blocks for fusion |
| `max_warmup_steps` | int | 4 | Steps before caching starts (optimized for few-step distilled models) |
| `max_cached_steps` | int | -1 | Max cached steps (-1 = unlimited) |
| `max_continuous_cached_steps` | int | 3 | Max consecutive cached steps (prevents precision degradation) |
| `residual_diff_threshold` | float | 0.24 | Residual difference threshold (higher for more aggressive caching) |
| `num_inference_steps` | int \| None | None | Initial inference steps for SCM mask generation (optional, auto-refreshed during inference) |
| `enable_taylorseer` | bool | False | Enable TaylorSeer acceleration (not suitable for few-step distilled models) |
| `taylorseer_order` | int | 1 | Taylor expansion order |
| `scm_steps_mask_policy` | str \| None | None | SCM mask policy (None, "slow", "medium", "fast", "ultra") |
| `scm_steps_policy` | str | "dynamic" | SCM computation policy ("dynamic" or "static") |
## Example: Accelerate Text-to-Image Generation with CacheDiT
See `examples/offline_inference/text_to_image/text_to_image.py` for a complete working example with cache-dit acceleration.
```bash
# Enable cache-dit with hybrid acceleration
cd examples/offline_inference/text_to_image
python text_to_image.py \
--model Qwen/Qwen-Image \
--prompt "a cup of coffee on the table" \
--cache_backend cache_dit \
--num_inference_steps 50
```
The script uses cache-dit acceleration with a hybrid configuration combining DBCache, SCM, and TaylorSeer:
```python
omni = Omni(
model="Qwen/Qwen-Image",
cache_backend="cache_dit",
cache_config={
# Scheme: Hybrid DBCache + SCM + TaylorSeer
# DBCache
"Fn_compute_blocks": 8,
"Bn_compute_blocks": 0,
"max_warmup_steps": 4,
"residual_diff_threshold": 0.12,
# TaylorSeer
"enable_taylorseer": True,
"taylorseer_order": 1,
# SCM
"scm_steps_mask_policy": "fast", # Set to None to disable SCM
"scm_steps_policy": "dynamic",
},
)
```
You can customize the configuration by modifying the `cache_config` dictionary to use only specific methods (e.g., DBCache only, DBCache + SCM, etc.) based on your quality and speed requirements.
To test another model, you can modify `--model` with the target model identifier like `Tongyi-MAI/Z-Image-Turbo` and update `cache_config` according the model architecture (e.g., number of transformer blocks).
## Additional Resources
- [Cache-DiT User Guide](https://cache-dit.readthedocs.io/en/latest/user_guide/OVERVIEWS/)
- [Cache-DiT Benchmark](https://cache-dit.readthedocs.io/en/latest/benchmark/HYBRID_CACHE/)
- [DBCache Technical Details](https://cache-dit.readthedocs.io/en/latest/user_guide/CACHE_API/)
# CPU Offloading for Diffusion Model
## Overview
vLLM-Omni provides two offloading strategies to reduce GPU memory usage for diffusion models, allowing you to run larger models on GPUs with limited VRAM:
1. **Model-level (Component) Offloading**: Swaps entire model components (DiT transformer, VAE, encoders) between GPU and CPU.
2. **Layerwise (Blockwise) Offloading**: Keeps only a single or a few transformer blocks on GPU at a time, with compute - memory copy overlap.
Both approaches use pinned memory for faster CPU-GPU transfers. For now, the two offloading strategies could not be used at the same time.
## Model-level CPU Offloading
### Implementation
CPU offload lets the diffusion worker move large model components between GPU and CPU memory on demand. It keeps the DiT transformer resident on GPU only while it is actively running, and swaps it out when encoders modules need the device. This reduces peak VRAM usage so bigger checkpoints run on smaller GPUs, or multiple requests can share the same GPU.
**Execution Flow**:
1. Text encoders run on GPU while the DiT transformer is offloaded to CPU.
2. Before denoising, weights are prefetched back to GPU, honoring pinned-memory copies for speed.
3. After the diffusion step, the transformer returns to CPU and the process repeats as needed.
Transfers use pinned host buffers, and the worker coordinates swaps via mutex-style hooks so components never compete for memory.
### Configuration
You can enable CPU offload in two ways:
1. **Python API**: set `enable_cpu_offload=True`.
```python
from vllm_omni import Omni
if __name__ == "__main__":
m = Omni(model="Qwen/Qwen-Image",enable_cpu_offload=True)
```
2. **CLI**: pass `--enable-cpu-offload` to the diffusion service entrypoint.
### Limitations
- Cold start latency increases for over one minute for some models(e.g., Qwen-Image)
## Layerwise (Blockwise) Offloading
### Implementation
Layerwise offload operates at transformer block granularity, keeping a single transformer block, or a specified number of blocks, on GPU while others stay in CPU memory.
Unlike full model-wise CPU offload which swaps entire components like DiT and encoders, layerwise offloading applies a sliding window way of loading and offloading weights between gpu and cpu: while block `i` computes, block `i+1` get prefetched asynchronously via pinned memory. In this way, only partial blocks(s) reside on GPU at any moment during inference, so that greatly decrease the memory occupancy.
**Execution Flow**:
1. During model initialization, all components are loaded to CPU first. Then components other than DiT model(s) in the pipeline, such as VAE and encoders, are moved to GPU. The weights of target transformer blocks are collected as contiguous tensors per layer on CPU with pinned memory; and non-block modules (embeddings, norms, etc) in the DiT model are moved to and stay on GPU.
2. The first block(s) are transferred to GPU during initialization of `LayerwiseOffloader`, before the first denoising step of the very first request.
3. As each block executes, the next block prefetches on a separate CUDA stream for compute - memory copy overlap. After execution, the current block is immediately freed from GPU memory.
4. When the last block completes, the first block prefetches for the next denoising step.
Example of hook executions of a DiT model with n layers, by default keep a single layer on GPU:
| Layer (block) idx | forward pre-hook | forward | forward post-hook |
|-------------------|--------------------------------|------------------|---------------------------|
| layer-0 | prefetch layer 1 (copy stream) | compute layer 0 | free layer-0 gpu weights |
| layer-1 | prefetch layer 2 (copy stream) | compute layer 1 | free layer-1 gpu weights |
| layer-2 | prefetch layer 3 (copy stream) | compute layer 2 | free layer-2 gpu weights |
| ... | ... | ... | ... |
| layer-(n-1) | **prefetch layer 0 (copy stream)** | compute layer (n-1) | free layer (n-1) gpu weights |
### Configuration
1. **Python API**: set `enable_layerwise_offload=True` and optionally `layerwise_num_gpu_layers`.
```python
from vllm_omni import Omni
if __name__ == "__main__":
m = Omni(
model="Wan-AI/Wan2.2-T2V-A14B-Diffusers",
enable_layerwise_offload=True,
...
)
```
2. **CLI**: pass `--enable-layerwise-offload` and `--layerwise-num-gpu-layers` to the diffusion service entrypoint.
### Supported Models
| Architecture | Models | Example HF Models | DiT Model Cls | Blocks Attr Name |
|--------------|--------|-------------------|----------|----------|
| `QwenImagePipeline` | Qwen-Image-Edit | `Qwen/Qwen-Image` | `QwenImageTransformer2DModel` | "transformer_blocks" |
| `Wan22Pipeline` | Wan2.2 | `Wan-AI/Wan2.2-T2V-A14B-Diffusers` | `WanTransformer3DModel` | "blocks" |
NOTE: Models must define `_layerwise_offload_blocks_attr` class attribute so that the layerwise offloader finds the target transformer blocks.
### Limitations
- Cold start latency increases because of
1) components are loaded to CPU first at the very first during initialization,
2) weight consolidation and pinning
- Performance depends on CPU <-> GPU interconnection (e.g., PCIe bandwidth).
- Support single GPU only for now
# Parallelism Acceleration Guide
This guide includes how to use parallelism methods in vLLM-Omni to speed up diffusion model inference as well as reduce the memory requirement on each device.
## Overview
The following parallelism methods are currently supported in vLLM-Omni:
1. DeepSpeed Ulysses Sequence Parallel (DeepSpeed Ulysses-SP) ([arxiv paper](https://arxiv.org/pdf/2309.14509)): Ulysses-SP splits the input along the sequence dimension and uses all-to-all communication to allow each device to compute only a subset of attention heads.
2. [Ring-Attention](#ring-attention) - splits the input along the sequence dimension and uses ring-based P2P communication to accumulate attention results, keeping the sequence dimension sharded
3. Classifier-Free-Guidance Parallel (CFG-Parallel): CFG-Parallel runs the positive/negative prompts of classifier-free guidance (CFG) on different devices, then merges on a single device to perform the scheduler step.
4. [Tensor Parallelism](#tensor-parallelism): Tensor parallelism shards model weights across devices. This can reduce per-GPU memory usage. Note that for diffusion models we currently shard the majority of layers within the DiT.
The following table shows which models are currently supported by parallelism method:
### ImageGen
| Model | Model Identifier | Ulysses-SP | Ring-SP | CFG-Parallel | Tensor-Parallel |
|--------------------------|--------------------------------------|:----------:|:-------:|:------------:|:---------------:|
| **LongCat-Image** | `meituan-longcat/LongCat-Image` | ✅ | ✅ | ❌ | ✅ |
| **LongCat-Image-Edit** | `meituan-longcat/LongCat-Image-Edit` | ✅ | ✅ | ❌ | ✅ |
| **Ovis-Image** | `OvisAI/Ovis-Image` | ❌ | ❌ | ❌ | ❌ |
| **Qwen-Image** | `Qwen/Qwen-Image` | ✅ | ✅ | ✅ | ✅ |
| **Qwen-Image-Edit** | `Qwen/Qwen-Image-Edit` | ✅ | ✅ | ✅ | ✅ |
| **Qwen-Image-Edit-2509** | `Qwen/Qwen-Image-Edit-2509` | ✅ | ✅ | ✅ | ✅ |
| **Qwen-Image-Layered** | `Qwen/Qwen-Image-Layered` | ✅ | ✅ | ✅ | ✅ |
| **Z-Image** | `Tongyi-MAI/Z-Image-Turbo` | ✅ | ✅ | ❌ | ✅ (TP=2 only) |
| **Stable-Diffusion3.5** | `stabilityai/stable-diffusion-3.5` | ❌ | ❌ | ❌ | ❌ |
| **FLUX.2-klein** | `black-forest-labs/FLUX.2-klein-4B` | ❌ | ❌ | ❌ | ✅ |
| **FLUX.1-dev** | `black-forest-labs/FLUX.1-dev` | ❌ | ❌ | ❌ | ✅ |
!!! note "TP Limitations for Diffusion Models"
We currently implement Tensor Parallelism (TP) only for the DiT (Diffusion Transformer) blocks. This is because the `text_encoder` component in vLLM-Omni uses the original Transformers implementation, which does not yet support TP.
- Good news: The text_encoder typically has minimal impact on overall inference performance.
- Bad news: When TP is enabled, every TP process retains a full copy of the text_encoder weights, leading to significant GPU memory waste.
We are actively refactoring this design to address this. For details and progress, please refer to [Issue #771](https://github.com/vllm-project/vllm-omni/issues/771).
!!! note "Why Z-Image is TP=2 only"
Z-Image Turbo is currently limited to `tensor_parallel_size` of **1 or 2** due to model shape divisibility constraints.
For example, the model has `n_heads=30` and a final projection out dimension of `64`, so valid TP sizes must divide both 30 and 64; the only common divisors are **1 and 2**.
### VideoGen
| Model | Model Identifier | Ulysses-SP | Ring-SP | Tensor-Parallel |
|-------|------------------|------------|---------|--------------------------|
| **Wan2.2** | `Wan-AI/Wan2.2-T2V-A14B-Diffusers` | ✅ | ✅ | ❌ |
### Tensor Parallelism
Tensor parallelism splits model parameters across GPUs. In vLLM-Omni, tensor parallelism is configured via `DiffusionParallelConfig.tensor_parallel_size`.
#### Offline Inference
```python
from vllm_omni import Omni
from vllm_omni.diffusion.data import DiffusionParallelConfig
omni = Omni(
model="Tongyi-MAI/Z-Image-Turbo",
parallel_config=DiffusionParallelConfig(tensor_parallel_size=2),
)
outputs = omni.generate(
"a cat reading a book",
OmniDiffusionSamplingParams(
num_inference_steps=9,
width=512,
height=512,
),
)
```
### Sequence Parallelism
#### Ulysses-SP
##### Offline Inference
An example of offline inference script using [Ulysses-SP](https://arxiv.org/pdf/2309.14509) is shown below:
```python
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
from vllm_omni.diffusion.data import DiffusionParallelConfig
ulysses_degree = 2
omni = Omni(
model="Qwen/Qwen-Image",
parallel_config=DiffusionParallelConfig(ulysses_degree=2)
)
outputs = omni.generate(
"A cat sitting on a windowsill",
OmniDiffusionSamplingParams(num_inference_steps=50, width=2048, height=2048),
)
```
See `examples/offline_inference/text_to_image/text_to_image.py` for a complete working example.
##### Online Serving
You can enable Ulysses-SP in online serving for diffusion models via `--usp`:
```bash
# Text-to-image (requires >= 2 GPUs)
vllm serve Qwen/Qwen-Image --omni --port 8091 --usp 2
```
##### Benchmarks
!!! note "Benchmark Disclaimer"
These benchmarks are provided for **general reference only**. The configurations shown use default or common parameter settings and have not been exhaustively optimized for maximum performance. Actual performance may vary based on:
- Specific model and use case
- Hardware configuration
- Careful parameter tuning
- Different inference settings (e.g., number of steps, image resolution)
To measure the parallelism methods, we run benchmarks with **Qwen/Qwen-Image** model generating images (**2048x2048** as long sequence input) with 50 inference steps. The hardware devices are NVIDIA H800 GPUs. `sdpa` is the attention backends.
| Configuration | Ulysses degree |Generation Time | Speedup |
|---------------|----------------|---------|---------|
| **Baseline (diffusers)** | - | 112.5s | 1.0x |
| Ulysses-SP | 2 | 65.2s | 1.73x |
| Ulysses-SP | 4 | 39.6s | 2.84x |
| Ulysses-SP | 8 | 30.8s | 3.65x |
#### Ring-Attention
Ring-Attention ([arxiv paper](https://arxiv.org/abs/2310.01889)) splits the input along the sequence dimension and uses ring-based P2P communication to accumulate attention results. Unlike Ulysses-SP which uses all-to-all communication, Ring-Attention keeps the sequence dimension sharded throughout the computation and circulates Key/Value blocks through a ring topology.
##### Offline Inference
An example of offline inference script using Ring-Attention is shown below:
```python
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
from vllm_omni.diffusion.data import DiffusionParallelConfig
ring_degree = 2
omni = Omni(
model="Qwen/Qwen-Image",
parallel_config=DiffusionParallelConfig(ring_degree=2)
)
outputs = omni.generate(
"A cat sitting on a windowsill",
OmniDiffusionSamplingParams(num_inference_steps=50, width=2048, height=2048),
)
```
See `examples/offline_inference/text_to_image/text_to_image.py` for a complete working example.
##### Online Serving
You can enable Ring-Attention in online serving for diffusion models via `--ring`:
```bash
# Text-to-image (requires >= 2 GPUs)
vllm serve Qwen/Qwen-Image --omni --port 8091 --ring 2
```
##### Benchmarks
!!! note "Benchmark Disclaimer"
These benchmarks are provided for **general reference only**. The configurations shown use default or common parameter settings and have not been exhaustively optimized for maximum performance. Actual performance may vary based on:
- Specific model and use case
- Hardware configuration
- Careful parameter tuning
- Different inference settings (e.g., number of steps, image resolution)
To measure the parallelism methods, we run benchmarks with **Qwen/Qwen-Image** model generating images (**1024x1024** as long sequence input) with 50 inference steps. The hardware devices are NVIDIA A100 GPUs. `flash_attn` is the attention backends.
| Configuration | Ring degree |Generation Time | Speedup |
|---------------|----------------|---------|---------|
| **Baseline (diffusers)** | - | 45.2s | 1.0x |
| Ring-Attention | 2 | 29.9s | 1.51x |
| Ring-Attention | 4 | 23.3s | 1.94x |
#### Hybrid Ulysses + Ring
You can combine both Ulysses-SP and Ring-Attention for larger scale parallelism. The total sequence parallel size equals `ulysses_degree × ring_degree`.
##### Offline Inference
```python
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
from vllm_omni.diffusion.data import DiffusionParallelConfig
# Hybrid: 2 Ulysses × 2 Ring = 4 GPUs total
omni = Omni(
model="Qwen/Qwen-Image",
parallel_config=DiffusionParallelConfig(ulysses_degree=2, ring_degree=2)
)
outputs = omni.generate(
"A cat sitting on a windowsill",
OmniDiffusionSamplingParams(num_inference_steps=50, width=2048, height=2048),
)
```
##### Online Serving
```bash
# Text-to-image (requires >= 4 GPUs)
vllm serve Qwen/Qwen-Image --omni --port 8091 --usp 2 --ring 2
```
##### Benchmarks
!!! note "Benchmark Disclaimer"
These benchmarks are provided for **general reference only**. The configurations shown use default or common parameter settings and have not been exhaustively optimized for maximum performance. Actual performance may vary based on:
- Specific model and use case
- Hardware configuration
- Careful parameter tuning
- Different inference settings (e.g., number of steps, image resolution)
To measure the parallelism methods, we run benchmarks with **Qwen/Qwen-Image** model generating images (**1024x1024** as long sequence input) with 50 inference steps. The hardware devices are NVIDIA A100 GPUs. `flash_attn` is the attention backends.
| Configuration | Ulysses degree | Ring degree | Generation Time | Speedup |
|---------------|----------------|-------------|-----------------|---------|
| **Baseline (diffusers)** | - | - | 45.2s | 1.0x |
| Hybrid Ulysses + Ring | 2 | 2 | 24.3s | 1.87x |
##### How to parallelize a new model
NOTE: "Terminology: SP vs CP"
Our "Sequence Parallelism" (SP) corresponds to "Context Parallelism" (CP) in the [diffusers library](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/_modeling_parallel.py).
We use "Sequence Parallelism" to align with vLLM-Omni's terminology.
---
###### Non-intrusive `_sp_plan` (Recommended)
The `_sp_plan` mechanism allows SP without modifying `forward()` logic. The framework automatically registers hooks to shard inputs and gather outputs at module boundaries.
**Requirements for `forward()` function:**
- Tensor operations that need sharding/gathering must happen at **`nn.Module` boundaries** (not inline Python operations)
- If your `forward()` contains inline tensor operations (e.g., `torch.cat`, `pad_sequence`) that need sharding, **extract them into a submodule**
**When to create a submodule:**
```python
# ❌ BAD: Inline operations - hooks cannot intercept
def forward(self, x, cap_feats):
unified = torch.cat([x, cap_feats], dim=1) # Cannot be sharded via _sp_plan
...
# ✅ GOOD: Extract into a submodule
class UnifiedPrepare(nn.Module):
def forward(self, x, cap_feats):
return torch.cat([x, cap_feats], dim=1) # Now can be sharded via _sp_plan
class MyModel(nn.Module):
def __init__(self):
self.unified_prepare = UnifiedPrepare() # Submodule
def forward(self, x, cap_feats):
unified = self.unified_prepare(x, cap_feats) # Hook can intercept here
```
---
###### Defining `_sp_plan`
**Type definitions** (see [diffusers `_modeling_parallel.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/_modeling_parallel.py) for reference):
```python
from vllm_omni.diffusion.distributed.sp_plan import (
SequenceParallelInput, # Corresponds to diffusers' ContextParallelInput
SequenceParallelOutput, # Corresponds to diffusers' ContextParallelOutput
)
```
| Parameter | Description |
|-----------|-------------|
| `split_dim` | Dimension to split/gather (usually `1` for sequence) |
| `expected_dims` | Expected tensor rank for validation (optional) |
| `split_output` | `False`: shard **input** parameters; `True`: shard **output** tensors |
| `auto_pad` | Auto-pad if sequence not divisible by world_size (Ulysses only) |
**Key naming convention:**
| Key | Meaning | Python equivalent |
|-----|---------|-------------------|
| `""` | Root model | `model` |
| `"blocks.0"` | First element of ModuleList | `model.blocks[0]` |
| `"blocks.*"` | All elements of ModuleList | `for b in model.blocks` |
| `"outputs.main"` | ModuleDict entry | `model.outputs["main"]` |
**Dictionary key types:**
| Key type | `split_output` | Description |
|----------|----------------|-------------|
| `"param_name"` (str) | `False` | Shard **input parameter** by name |
| `0`, `1` (int) | `True` | Shard **output tuple** by index |
**Example** (similar to [diffusers `transformer_wan.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/transformers/transformer_wan.py)):
```python
class MyTransformer(nn.Module):
_sp_plan = {
# Shard rope module OUTPUTS (returns tuple)
"rope": {
0: SequenceParallelInput(split_dim=1, expected_dims=4, split_output=True), # cos
1: SequenceParallelInput(split_dim=1, expected_dims=4, split_output=True), # sin
},
# Shard transformer block INPUT parameter
"blocks.0": {
"hidden_states": SequenceParallelInput(split_dim=1, expected_dims=3),
},
# Gather at final projection
"proj_out": SequenceParallelOutput(gather_dim=1, expected_dims=3),
}
```
---
###### Hook flow
```
Input → [SequenceParallelSplitHook: pre_forward] → Module.forward() → [post_forward] → ...
... → [SequenceParallelGatherHook: post_forward] → Output
```
1. **SplitHook** shards tensors before/after the target module
2. **Attention layers** handle Ulysses/Ring communication internally
3. **GatherHook** collects sharded outputs
The framework automatically applies these hooks when `sequence_parallel_size > 1`.
---
###### Method 2: Intrusive modification (For complex cases)
For models with dynamic sharding logic that cannot be expressed via `_sp_plan`:
```python
from vllm_omni.diffusion.distributed.sp_sharding import sp_shard, sp_gather
def forward(self, hidden_states, ...):
if self.parallel_config.sequence_parallel_size > 1:
hidden_states = sp_shard(hidden_states, dim=1)
# ... computation ...
output = sp_gather(output, dim=1)
return output
```
---
###### Choosing the right approach
| Scenario | Approach |
|----------|----------|
| Standard transformer | `_sp_plan` |
| Inline tensor ops need sharding | Extract to submodule + `_sp_plan` |
| Dynamic/conditional sharding | Intrusive modification |
### CFG-Parallel
#### Offline Inference
CFG-Parallel is enabled through `DiffusionParallelConfig(cfg_parallel_size=2)`, which runs one rank for the positive branch and one rank for the negative branch.
An example of offline inference using CFG-Parallel (image-to-image) is shown below:
```python
from vllm_omni import Omni
from vllm_omni.diffusion.data import DiffusionParallelConfig
image_path = "path_to_image.png"
omni = Omni(
model="Qwen/Qwen-Image-Edit",
parallel_config=DiffusionParallelConfig(cfg_parallel_size=2),
)
input_image = Image.open(image_path).convert("RGB")
outputs = omni.generate(
{
"prompt": "turn this cat to a dog",
"negative_prompt": "low quality, blurry",
"multi_modal_data": {"image": input_image},
},
OmniDiffusionSamplingParams(
true_cfg_scale=4.0,
num_inference_steps=50,
),
)
```
Notes:
- CFG-Parallel is only effective when a `negative_prompt` is provided AND a guidance scale (or `cfg_scale`) is greater than 1.
See `examples/offline_inference/image_to_image/image_edit.py` for a complete working example.
```bash
cd examples/offline_inference/image_to_image/
python image_edit.py \
--model "Qwen/Qwen-Image-Edit" \
--image "qwen_image_output.png" \
--prompt "turn this cat to a dog" \
--negative_prompt "low quality, blurry" \
--cfg_scale 4.0 \
--output "edited_image.png" \
--cfg_parallel_size 2
```
#### Online Serving
You can enable CFG-Parallel in online serving for diffusion models via `--cfg-parallel-size`:
```bash
vllm serve Qwen/Qwen-Image-Edit --omni --port 8091 --cfg-parallel-size 2
```
#### How to parallelize a pipeline
This section describes how to add CFG-Parallel to a diffusion **pipeline**. We use the Qwen-Image pipeline (`vllm_omni/diffusion/models/qwen_image/pipeline_qwen_image.py`) as the reference implementation.
In `QwenImagePipeline`, each diffusion step runs two denoiser forward passes sequentially:
- positive (prompt-conditioned)
- negative (negative-prompt-conditioned)
CFG-Parallel assigns these two branches to different ranks in the **CFG group** and synchronizes the results.
vLLM-omni provides `CFGParallelMixin` base class that encapsulates the CFG parallel logic. By inheriting from this mixin and calling its methods, pipelines can easily implement CFG parallel without writing repetitive code.
**Key Methods in CFGParallelMixin:**
- `predict_noise_maybe_with_cfg()`: Automatically handles CFG parallel noise prediction
- `scheduler_step_maybe_with_cfg()`: Scheduler step with automatic CFG rank synchronization
**Example Implementation:**
```python
class QwenImageCFGParallelMixin(CFGParallelMixin):
"""
Base Mixin class for Qwen Image pipelines providing shared CFG methods.
"""
def diffuse(
self,
prompt_embeds: torch.Tensor,
prompt_embeds_mask: torch.Tensor,
negative_prompt_embeds: torch.Tensor,
negative_prompt_embeds_mask: torch.Tensor,
latents: torch.Tensor,
img_shapes: torch.Tensor,
txt_seq_lens: torch.Tensor,
negative_txt_seq_lens: torch.Tensor,
timesteps: torch.Tensor,
do_true_cfg: bool,
guidance: torch.Tensor,
true_cfg_scale: float,
image_latents: torch.Tensor | None = None,
cfg_normalize: bool = True,
additional_transformer_kwargs: dict[str, Any] | None = None,
) -> torch.Tensor:
self.transformer.do_true_cfg = do_true_cfg
for i, t in enumerate(timesteps):
timestep = t.expand(latents.shape[0]).to(device=latents.device, dtype=latents.dtype)
# Prepare kwargs for positive (conditional) prediction
positive_kwargs = {
"hidden_states": latents,
"timestep": timestep / 1000,
"guidance": guidance,
"encoder_hidden_states_mask": prompt_embeds_mask,
"encoder_hidden_states": prompt_embeds,
"img_shapes": img_shapes,
"txt_seq_lens": txt_seq_lens,
}
# Prepare kwargs for negative (unconditional) prediction
if do_true_cfg:
negative_kwargs = {
"hidden_states": latents,
"timestep": timestep / 1000,
"guidance": guidance,
"encoder_hidden_states_mask": negative_prompt_embeds_mask,
"encoder_hidden_states": negative_prompt_embeds,
"img_shapes": img_shapes,
"txt_seq_lens": negative_txt_seq_lens,
}
else:
negative_kwargs = None
# Predict noise with automatic CFG parallel handling
# - In CFG parallel mode: rank0 computes positive, rank1 computes negative
# - Automatically gathers results and combines them on rank0
noise_pred = self.predict_noise_maybe_with_cfg(
do_true_cfg=do_true_cfg,
true_cfg_scale=true_cfg_scale,
positive_kwargs=positive_kwargs,
negative_kwargs=negative_kwargs,
cfg_normalize=cfg_normalize,
)
# Step scheduler with automatic CFG synchronization
# - Only rank0 computes the scheduler step
# - Automatically broadcasts updated latents to all ranks
latents = self.scheduler_step_maybe_with_cfg(
noise_pred, t, latents, do_true_cfg
)
return latents
```
**How it works:**
1. Prepare separate `positive_kwargs` and `negative_kwargs` for conditional and unconditional predictions
2. Call `predict_noise_maybe_with_cfg()` which:
- Detects if CFG parallel is enabled (`get_classifier_free_guidance_world_size() > 1`)
- Distributes computation: rank0 processes positive, rank1 processes negative
- Gathers predictions and combines them using `combine_cfg_noise()` on rank0
- Returns combined noise prediction (only valid on rank0)
3. Call `scheduler_step_maybe_with_cfg()` which:
- Only rank0 computes the scheduler step
- Broadcasts the updated latents to all ranks for synchronization
**How to customize**
Some pipelines may need to customize the following functions in `CFGParallelMixin`:
1. You may need to edit `predict_noise` function for custom behaviors.
```python
def predict_noise(self, *args, **kwargs):
"""
Forward pass through transformer to predict noise.
Subclasses should override this if they need custom behavior,
but the default implementation calls self.transformer.
"""
return self.transformer(*args, **kwargs)[0]
```
2. The default normalization function after combining the noise predictions from both branches is as follows. You may need to customize it.
```python
def cfg_normalize_function(self, noise_pred, comb_pred):
"""
Normalize the combined noise prediction.
Args:
noise_pred: positive noise prediction
comb_pred: combined noise prediction after CFG
Returns:
Normalized noise prediction tensor
"""
cond_norm = torch.norm(noise_pred, dim=-1, keepdim=True)
noise_norm = torch.norm(comb_pred, dim=-1, keepdim=True)
noise_pred = comb_pred * (cond_norm / noise_norm)
return noise_pred
```
# TeaCache Configuration Guide
TeaCache speeds up diffusion model inference by caching transformer computations when consecutive timesteps are similar. This typically provides **1.5x-2.0x speedup** with minimal quality loss.
## Quick Start
Enable TeaCache by setting `cache_backend` to `"tea_cache"`:
```python
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
# Simple configuration - model_type is automatically extracted from pipeline.__class__.__name__
omni = Omni(
model="Qwen/Qwen-Image",
cache_backend="tea_cache",
cache_config={
"rel_l1_thresh": 0.2 # Optional, defaults to 0.2
}
)
outputs = omni.generate(
"A cat sitting on a windowsill",
OmniDiffusionSamplingParams(
num_inference_steps=50,
),
)
```
### Using Environment Variable
You can also enable TeaCache via environment variable:
```bash
export DIFFUSION_CACHE_BACKEND=tea_cache
```
Then initialize without explicitly setting `cache_backend`:
```python
from vllm_omni import Omni
omni = Omni(
model="Qwen/Qwen-Image",
cache_config={"rel_l1_thresh": 0.2} # Optional
)
```
## Online Serving (OpenAI-Compatible)
Enable TeaCache for online serving by passing `--cache-backend tea_cache` when starting the server:
```bash
vllm serve Qwen/Qwen-Image --omni --port 8091 \
--cache-backend tea_cache \
--cache-config '{"rel_l1_thresh": 0.2}'
```
## Configuration Parameters
### `rel_l1_thresh` (float, default: `0.2`)
Controls the balance between speed and quality. Lower values prioritize quality, higher values prioritize speed.
**Recommended values:**
- `0.2` - **~1.5x speedup** with minimal quality loss (recommended)
- `0.4` - **~1.8x speedup** with slight quality loss
- `0.6` - **~2.0x speedup** with noticeable quality loss
- `0.8` - **~2.25x speedup** with significant quality loss
## Examples
### Python API
```python
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
omni = Omni(
model="Qwen/Qwen-Image",
cache_backend="tea_cache",
cache_config={"rel_l1_thresh": 0.2}
)
outputs = omni.generate(
"A cat sitting on a windowsill",
OmniDiffusionSamplingParams(
num_inference_steps=50,
),
)
```
## Performance Tuning
Start with the default `rel_l1_thresh=0.2` and adjust based on your needs:
- **Maximum quality**: Use `0.1-0.2`
- **Balanced**: Use `0.2-0.4` (recommended)
- **Maximum speed**: Use `0.6-0.8` (may reduce quality)
## Troubleshooting
### Quality Degradation
If you notice quality issues, lower the threshold:
```python
cache_config={"rel_l1_thresh": 0.1} # More conservative caching
```
## Supported Models
### ImageGen
<style>
th {
white-space: nowrap;
min-width: 0 !important;
}
</style>
| Architecture | Models | Example HF Models |
|--------------|--------|-------------------|
| `QwenImagePipeline` | Qwen-Image | `Qwen/Qwen-Image` |
| `QwenImageEditPipeline` | Qwen-Image-Edit | `Qwen/Qwen-Image-Edit` |
| `QwenImageEditPlusPipeline` | Qwen-Image-Edit-2509 | `Qwen/Qwen-Image-Edit-2509` |
| `QwenImageLayeredPipeline` | Qwen-Image-Layered | `Qwen/Qwen-Image-Layered` |
| `BagelForConditionalGeneration` | BAGEL (DiT-only) | `ByteDance-Seed/BAGEL-7B-MoT` |
### VideoGen
No VideoGen models are supported by TeaCache yet.
### Coming Soon
<style>
th {
white-space: nowrap;
min-width: 0 !important;
}
</style>
| Architecture | Models | Example HF Models |
|--------------|--------|-------------------|
| `FluxPipeline` | Flux | - |
| `CogVideoXPipeline` | CogVideoX | - |
# Diffusion Acceleration Overview
vLLM-Omni supports various cache acceleration methods to speed up diffusion model inference with minimal quality degradation. These methods include **cache methods** that intelligently cache intermediate computations to avoid redundant work across diffusion timesteps, and **parallelism methods** that distribute the computation across multiple devices.
## Supported Acceleration Methods
vLLM-Omni currently supports two main cache acceleration backends:
1. **[TeaCache](diffusion/teacache.md)** - Hook-based adaptive caching that caches transformer computations when consecutive timesteps are similar
2. **[Cache-DiT](diffusion/cache_dit_acceleration.md)** - Library-based acceleration using multiple techniques:
- **DBCache** (Dual Block Cache): Caches intermediate transformer block outputs based on residual differences
- **TaylorSeer**: Uses Taylor expansion-based forecasting for faster inference
- **SCM** (Step Computation Masking): Selectively computes steps based on adaptive masking
Both methods can provide significant speedups (typically **1.5x-2.0x**) while maintaining high output quality.
vLLM-Omni also supports parallelism methods for diffusion models, including:
1. [Ulysses-SP](diffusion/parallelism_acceleration.md#ulysses-sp) - splits the input along the sequence dimension and uses all-to-all communication to allow each device to compute only a subset of attention heads.
2. [Ring-Attention](diffusion/parallelism_acceleration.md#ring-attention) - splits the input along the sequence dimension and uses ring-based P2P communication to accumulate attention results, keeping the sequence dimension sharded.
3. [CFG-Parallel](diffusion/parallelism_acceleration.md#cfg-parallel) - runs the positive/negative prompts of classifier-free guidance (CFG) on different devices, then merges on a single device to perform the scheduler step.
## Quick Comparison
### Cache Methods
| Method | Configuration | Description | Best For |
|--------|--------------|-------------|----------|
| **TeaCache** | `cache_backend="tea_cache"` | Simple, adaptive caching with minimal configuration | Quick setup, balanced speed/quality |
| **Cache-DiT** | `cache_backend="cache_dit"` | Advanced caching with multiple techniques (DBCache, TaylorSeer, SCM) | Maximum acceleration, fine-grained control |
## Supported Models
The following table shows which models are currently supported by each acceleration method:
### ImageGen
| Model | Model Identifier | TeaCache | Cache-DiT | Ulysses-SP | Ring-Attention | CFG-Parallel |
|-------|------------------|:----------:|:-----------:|:-----------:|:----------------:|:----------------:|
| **LongCat-Image** | `meituan-longcat/LongCat-Image` | ❌ | ✅ | ❌ | ❌ | ✅ |
| **LongCat-Image-Edit** | `meituan-longcat/LongCat-Image-Edit` | ❌ | ✅ | ❌ | ❌ | ✅ |
| **Ovis-Image** | `OvisAI/Ovis-Image` | ❌ | ✅ | ❌ | ❌ | ✅ |
| **Qwen-Image** | `Qwen/Qwen-Image` | ✅ | ✅ | ✅ | ✅ | ✅ |
| **Qwen-Image-2512** | `Qwen/Qwen-Image-2512` | ✅ | ✅ | ✅ | ✅ | ✅ |
| **Qwen-Image-Edit** | `Qwen/Qwen-Image-Edit` | ✅ | ✅ | ✅ | ✅ | ✅ |
| **Qwen-Image-Edit-2509** | `Qwen/Qwen-Image-Edit-2509` | ✅ | ✅ | ✅ | ✅ | ✅ |
| **Qwen-Image-Layered** | `Qwen/Qwen-Image-Layered` | ❌ | ✅ | ✅ | ✅ | ✅ |
| **Z-Image** | `Tongyi-MAI/Z-Image-Turbo` | ✅ | ✅ | ❌ | ❌ | ❌ |
| **Stable-Diffusion3.5** | `stabilityai/stable-diffusion-3.5` | ❌ | ✅ | ❌ | ❌ | ✅ |
| **Bagel** | `ByteDance-Seed/BAGEL-7B-MoT` | ✅ | ✅ | ❌ | ❌ | ❌ |
### VideoGen
| Model | Model Identifier | TeaCache | Cache-DiT | Ulysses-SP | Ring-Attention |CFG-Parallel |
|-------|------------------|:--------:|:---------:|:----------:|:--------------:|:----------------:|
| **Wan2.2** | `Wan-AI/Wan2.2-T2V-A14B-Diffusers` | ❌ | ✅ | ✅ | ✅ | ✅ |
## Performance Benchmarks
The following benchmarks were measured on **Qwen/Qwen-Image** and **Qwen/Qwen-Image-Edit** models generating 1024x1024 images with 50 inference steps:
!!! note "Benchmark Disclaimer"
These benchmarks are provided for **general reference only**. The configurations shown use default or common parameter settings and have not been exhaustively optimized for maximum performance. Actual performance may vary based on:
- Specific model and use case
- Hardware configuration
- Careful parameter tuning
- Different inference settings (e.g., number of steps, image resolution)
For optimal performance in your specific scenario, we recommend experimenting with different parameter configurations as described in the detailed guides below.
| Model | Cache Backend | Cache Config | Generation Time | Speedup | Notes |
|-------|---------------|--------------|----------------|---------|-------|
| **Qwen/Qwen-Image** | None | None | 20.0s | 1.0x | Baseline (diffusers) |
| **Qwen/Qwen-Image** | TeaCache | `rel_l1_thresh=0.2` | 10.47s | **1.91x** | Recommended default setting |
| **Qwen/Qwen-Image** | Cache-DiT | DBCache + TaylorSeer (Fn=1, Bn=0, W=8, TaylorSeer order=1) | 10.8s | **1.85x** | - |
| **Qwen/Qwen-Image** | Cache-DiT | DBCache + TaylorSeer + SCM (Fn=8, Bn=0, W=4, TaylorSeer order=1, SCM fast) | 14.0s | **1.43x** | - |
| **Qwen/Qwen-Image-Edit** | None | No acceleration | 51.5s | 1.0x | Baseline (diffusers) |
| **Qwen/Qwen-Image-Edit** | Cache-DiT | Default (Fn=1, Bn=0, W=4, TaylorSeer disabled, SCM disabled) | 21.6s | **2.38x** | - |
To measure the parallelism methods, we run benchmarks with **Qwen/Qwen-Image** model generating images (**2048x2048** as long sequence input) with 50 inference steps. The hardware devices are NVIDIA H800 GPUs. `sdpa` is the attention backends.
| Configuration | Ulysses degree |Generation Time | Speedup |
|---------------|----------------|---------|---------|
| **Baseline (diffusers)** | - | 112.5s | 1.0x |
| Ulysses-SP | 2 | 65.2s | 1.73x |
| Ulysses-SP | 4 | 39.6s | 2.84x |
| Ulysses-SP | 8 | 30.8s | 3.65x |
## Quick Start
### Using TeaCache
```python
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
omni = Omni(
model="Qwen/Qwen-Image",
cache_backend="tea_cache",
cache_config={"rel_l1_thresh": 0.2} # Optional, defaults to 0.2
)
outputs = omni.generate(
"A cat sitting on a windowsill",
OmniDiffusionSamplingParams(
num_inference_steps=50,
),
)
```
### Using Cache-DiT
```python
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
omni = Omni(
model="Qwen/Qwen-Image",
cache_backend="cache_dit",
cache_config={
"Fn_compute_blocks": 1,
"Bn_compute_blocks": 0,
"max_warmup_steps": 8,
"enable_taylorseer": True,
"taylorseer_order": 1,
}
)
outputs = omni.generate(
"A cat sitting on a windowsill",
OmniDiffusionSamplingParams(
num_inference_steps=50,
),
)
```
### Using Ulysses-SP
Run text-to-image:
```python
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
from vllm_omni.diffusion.data import DiffusionParallelConfig
ulysses_degree = 2
omni = Omni(
model="Qwen/Qwen-Image",
parallel_config=DiffusionParallelConfig(ulysses_degree=ulysses_degree)
)
outputs = omni.generate(
"A cat sitting on a windowsill",
OmniDiffusionSamplingParams(num_inference_steps=50, width=2048, height=2048),
)
```
Run image-to-image:
```python
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
from vllm_omni.diffusion.data import DiffusionParallelConfig
ulysses_degree = 2
omni = Omni(
model="Qwen/Qwen-Image-Edit",
parallel_config=DiffusionParallelConfig(ulysses_degree=ulysses_degree)
)
outputs = omni.generate(
{
"prompt": "turn this cat to a dog",
"multi_modal_data": {"image": input_image}
},
OmniDiffusionSamplingParams(num_inference_steps=50),
)
```
### Using Ring-Attention
Run text-to-image:
```python
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
from vllm_omni.diffusion.data import DiffusionParallelConfig
ring_degree = 2
omni = Omni(
model="Qwen/Qwen-Image",
parallel_config=DiffusionParallelConfig(ring_degree=2)
)
outputs = omni.generate(
"A cat sitting on a windowsill",
OmniDiffusionSamplingParams(num_inference_steps=50, width=2048, height=2048),
)
```
### Using CFG-Parallel
Run image-to-image:
CFG-Parallel splits the CFG positive/negative branches across GPUs. Use it when you set a non-trivial `true_cfg_scale`.
```python
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
from vllm_omni.diffusion.data import DiffusionParallelConfig
cfg_parallel_size = 2
omni = Omni(
model="Qwen/Qwen-Image-Edit",
parallel_config=DiffusionParallelConfig(cfg_parallel_size=cfg_parallel_size)
)
outputs = omni.generate(
{
"prompt": "turn this cat to a dog",
"multi_modal_data": {"image": input_image}
},
OmniDiffusionSamplingParams(num_inference_steps=50, true_cfg_scale=4.0),
)
```
## Documentation
For detailed information on each acceleration method:
- **[TeaCache Guide](diffusion/teacache.md)** - Complete TeaCache documentation, configuration options, and best practices
- **[Cache-DiT Acceleration Guide](diffusion/cache_dit_acceleration.md)** - Comprehensive Cache-DiT guide covering DBCache, TaylorSeer, SCM, and configuration parameters
- **[Sequence Parallelism](diffusion/parallelism_acceleration.md#sequence-parallelism)** - Guidance on how to set sequence parallelism with configuration.
- **[CFG-Parallel](diffusion/parallelism_acceleration.md#cfg-parallel)** - Guidance on how to set CFG-Parallel to run positive/negative branches across ranks.
# BAGEL-7B-MoT
Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/bagel>.
## Set up
Please refer to the [stage configuration documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/) to configure memory allocation appropriately for your hardware setup.
## Run examples
**Note**: These examples work with the default configuration on an **NVIDIA A100 (80GB)**. We also tested on dual **NVIDIA RTX 5000 Ada (32GB each)**. For dual-GPU setups, please modify the stage configuration to distribute the model across devices.
Get into the bagel folder
```bash
cd examples/offline_inference/bagel
```
### Modality Control
BAGEL-7B-MoT supports multiple modality modes. You can control the mode using the `--modality` argument:
#### Text to Image (text2img)
- **Pipeline**: Text → Thinker → DiT → VAE Decode → Image
- **Stages Used**: Stage 0 (Thinker) + Stage 1 (DiT)
- **KV Transfer**: Thinker sends KV cache to DiT for conditioned generation
Generate images from text prompts:
```bash
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
--modality text2img \
--prompts "A cute cat"
```
#### Image to Image (img2img)
- **Pipeline**: Image → VAE Encode → DiT → VAE Decode → New Image
- **Stages Used**: Stage 1 (DiT) only
- **Special**: Bypasses the Thinker stage, direct image-to-image transformation
Transform images based on text prompts:
```bash
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
--modality img2img \
--image-path /path/to/image.jpg \
--prompts "Let the woman wear a blue dress"
```
#### Image to Text (img2text)
- **Pipeline**: Image → ViT + VAE Encode → Thinker → Text Output
- **Stages Used**: Stage 0 (Thinker) only
- **Special**: Uses both VAE latent encoding AND ViT semantic encoding for comprehensive image understanding
Generate text descriptions from images:
```bash
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
--modality img2text \
--image-path /path/to/image.jpg \
--prompts "Describe this image in detail"
```
#### Text to Text (text2text)
- **Pipeline**: Text → Thinker → Text Output
- **Stages Used**: Stage 0 (Thinker) only
- **Special**: No visual components involved, operates as pure language model
Pure text generation:
```bash
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
--modality text2text \
--prompts "What is the capital of France?"
# You can load prompts from a text file (one prompt per line):
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
--modality text2text \
--txt-prompts /path/to/prompts.txt
```
### Inference Steps
Control the number of inference steps for image generation:
```bash
# You can adjust steps to 100 to improve image quality
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
--modality text2img \
--steps 50 \
--prompts "A cute cat"
```
### Key arguments
BAGEL-7B-MoT supports **multiple modality modes** for different use cases.
The default yaml configuration deploys Thinker and DiT on the same GPU. You can use the default configuration file: [`bagel.yaml`](../../../vllm_omni/model_executor/stage_configs/bagel.yaml)
#### 📌 Command Line Arguments (end2end.py)
| Argument | Type | Default | Description |
| :--------------------- | :----- | :---------------------------- | :----------------------------------------------------------- |
| `--model` | string | `ByteDance-Seed/BAGEL-7B-MoT` | Model path or name |
| `--modality` | choice | `text2img` | Modality mode: `text2img`, `img2img`, `img2text`, `text2text` |
| `--prompts` | list | `None` | Input text prompts directly |
| `--txt-prompts` | string | `None` | Path to txt file with one prompt per line |
| `--image-path` | string | `None` | Input image path (for `img2img`/`img2text`) |
| `--steps` | int | `50` | Number of inference steps |
| `--stage-configs-path` | string | `None` | Custom stage config file path |
| `--worker-backend` | choice | `process` | Worker backend: `process` or `ray` |
| `--ray-address` | string | `None` | Ray cluster address |
| `--enable-stats` | flag | `False` | Enable statistics logging |
| `--init-sleep-seconds` | int | `20` | Initialization sleep time |
| `--batch-timeout` | int | `5` | Batch timeout |
| `--init-timeout` | int | `300` | Initialization timeout |
------
#### ⚙️ Stage Configuration Parameters (bagel.yaml)
**Stage 0 - Thinker (LLM Stage)**
| Parameter | Value | Description |
| :------------------------------- | :------------------------------ | :----------------------- |
| `stage_type` | `llm` | Stage type |
| `devices` | `"0"` | GPU device ID |
| `max_batch_size` | `1` | Maximum batch size |
| `model_stage` | `thinker` | Model stage identifier |
| `model_arch` | `BagelForConditionalGeneration` | Model architecture |
| `gpu_memory_utilization` | `0.4` | GPU memory utilization |
| `tensor_parallel_size` | `1` | Tensor parallel size |
| `max_num_batched_tokens` | `32768` | Maximum batched tokens |
| `omni_kv_config.need_send_cache` | `true` | Whether to send KV cache |
------
**Stage 1 - DiT (Diffusion Stage)**
| Parameter | Value | Description |
| :------------------------------- | :---------- | :-------------------------- |
| `stage_type` | `diffusion` | Stage type |
| `devices` | `"0"` | GPU device ID |
| `max_batch_size` | `1` | Maximum batch size |
| `model_stage` | `dit` | Model stage identifier |
| `gpu_memory_utilization` | `0.4` | GPU memory utilization |
| `omni_kv_config.need_recv_cache` | `true` | Whether to receive KV cache |
| `engine_input_source` | `[0]` | Input source from Stage 0 |
------
#### 🔗 Runtime Configuration
| Parameter | Value | Description |
| :-------------------- | :------ | :------------------------------- |
| `window_size` | `-1` | Window size (-1 means unlimited) |
| `max_inflight` | `1` | Maximum inflight requests |
| `shm_threshold_bytes` | `65536` | Shared memory threshold (64KB) |
## FAQ
- If you encounter an error about the backend of librosa, try to install ffmpeg with the command below.
```bash
sudo apt update
sudo apt install ffmpeg
```
- If you don’t know how much VRAM is needed for the model or encounter the OOM error, you can try to decrease the max_model_len.
| Stage | VRAM |
| :------------------ | :--------------------------- |
| Stage-0 (Thinker) | **15.04 GiB** **+ KV Cache** |
| Stage-1 (DiT) | **26.50 GiB** |
| Total | **~42 GiB + KV Cache** |
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment