vllm-omni_0.15.0.rc1+fix1 first commit

c1cacde6 · weishb · 35607782 · c1cacde6 · c1cacde6 · c1cacde6
Commit c1cacde6 authored Mar 25, 2026 by weishb
20 changed files
--- a/docs/models/supported_models.md
+++ b/docs/models/supported_models.md
+# Supported Models
+vLLM-Omni supports unified multimodal comprehension and generation models across various tasks.
+## Model Implementation
+If vLLM-Omni natively supports a model, its implementation can be found in <gh-file:vllm_omni/model_executor/models> and <gh-file:vllm_omni/diffusion/models>.
+## List of Supported Models for Nvidia GPU / AMD GPU
+<style>
+th {
+  white-space: nowrap;
+  min-width: 0 !important;
+}
+</style>
+| Architecture | Models | Example HF Models |
+|--------------|--------|-------------------|
+| `Qwen3OmniMoeForConditionalGeneration` | Qwen3-Omni | `Qwen/Qwen3-Omni-30B-A3B-Instruct` |
+| `Qwen2_5OmniForConditionalGeneration` | Qwen2.5-Omni | `Qwen/Qwen2.5-Omni-7B`, `Qwen/Qwen2.5-Omni-3B` |
+| `BagelForConditionalGeneration` | BAGEL (DiT-only) | `ByteDance-Seed/BAGEL-7B-MoT` |
+| `QwenImagePipeline` | Qwen-Image | `Qwen/Qwen-Image` |
+| `QwenImagePipeline` | Qwen-Image-2512 | `Qwen/Qwen-Image-2512` |
+| `QwenImageEditPipeline` | Qwen-Image-Edit | `Qwen/Qwen-Image-Edit` |
+| `QwenImageEditPlusPipeline` | Qwen-Image-Edit-2509 | `Qwen/Qwen-Image-Edit-2509` |
+| `QwenImageLayeredPipeline` | Qwen-Image-Layered | `Qwen/Qwen-Image-Layered` |
+|`ZImagePipeline` | Z-Image | `Tongyi-MAI/Z-Image-Turbo` |
+| `WanPipeline` | Wan2.2-T2V, Wan2.2-TI2V | `Wan-AI/Wan2.2-T2V-A14B-Diffusers`, `Wan-AI/Wan2.2-TI2V-5B-Diffusers` |
+| `WanImageToVideoPipeline` | Wan2.2-I2V | `Wan-AI/Wan2.2-I2V-A14B-Diffusers` |
+| `OvisImagePipeline` | Ovis-Image | `OvisAI/Ovis-Image` |
+|`LongcatImagePipeline` | LongCat-Image | `meituan-longcat/LongCat-Image` |
+|`LongCatImageEditPipeline` | LongCat-Image-Edit | `meituan-longcat/LongCat-Image-Edit` |
+|`StableDiffusion3Pipeline` | Stable-Diffusion-3 | `stabilityai/stable-diffusion-3.5-medium` |
+|`Flux2KleinPipeline` | FLUX.2-klein | `black-forest-labs/FLUX.2-klein-4B`, `black-forest-labs/FLUX.2-klein-9B` |
+|`FluxPipeline` | FLUX.1-dev | `black-forest-labs/FLUX.1-dev` |
+|`StableAudioPipeline` | Stable-Audio-Open | `stabilityai/stable-audio-open-1.0` |
+|`Qwen3TTSForConditionalGeneration` | Qwen3-TTS-12Hz-1.7B-CustomVoice | `Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice` |
+|`Qwen3TTSForConditionalGeneration` | Qwen3-TTS-12Hz-1.7B-VoiceDesign | `Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign` |
+|`Qwen3TTSForConditionalGeneration` | Qwen3-TTS-12Hz-1.7B-Base | `Qwen/Qwen3-TTS-12Hz-0.6B-Base` |
+## List of Supported Models for NPU
+<style>
+th {
+  white-space: nowrap;
+  min-width: 0 !important;
+}
+</style>
+| Architecture | Models | Example HF Models |
+|--------------|--------|-------------------|
+| `Qwen2_5OmniForConditionalGeneration` | Qwen2.5-Omni | `Qwen/Qwen2.5-Omni-7B`, `Qwen/Qwen2.5-Omni-3B`|
+| `QwenImagePipeline` | Qwen-Image | `Qwen/Qwen-Image` |
+| `QwenImagePipeline` | Qwen-Image-2512 | `Qwen/Qwen-Image-2512` |
+| `QwenImageEditPipeline` | Qwen-Image-Edit | `Qwen/Qwen-Image-Edit` |
+| `QwenImageEditPlusPipeline` | Qwen-Image-Edit-2509 | `Qwen/Qwen-Image-Edit-2509` |
+| `QwenImageLayeredPipeline` | Qwen-Image-Layered | `Qwen/Qwen-Image-Layered` |
+| `QwenImageEditPlusPipeline` | Qwen-Image-Edit-2511 | `Qwen/Qwen-Image-Edit-2511` |
+|`ZImagePipeline` | Z-Image | `Tongyi-MAI/Z-Image-Turbo` |
--- a/docs/serving/image_edit_api.md
+++ b/docs/serving/image_edit_api.md
+# Image Edit API
+vLLM-Omni provides an OpenAI DALL-E compatible API for image edit using diffusion models.
+Each server instance runs a single model (specified at startup via `vllm serve <model> --omni`).
+## Quick Start
+### Start the Server
+For example...
+```bash
+# Qwen-Image
+vllm serve Qwen/Qwen-Image-Edit-2511 --omni --port 8000
+### Generate Images
+**Using curl:**
+```bash
+curl -s -D >(grep -i x-request-id >&2) \
+  -o >(jq -r '.data[0].b64_json' | base64 --decode > gift-basket.png) \
+  -X POST "http://localhost:8000/v1/images/edits" \
+  -F "model=xxx" \
+  -F "image=@./xx.png" \
+  -F "prompt='this bear is wearing sportwear. holding a basketball, and bending one leg.'" \
+  -F "size=1024x1024" \
+  -F "output_format=png"
+```
+**Using OpenAI SDK:**
+```python
+import base64
+from openai import OpenAI
+from pathlib import Path
+client = OpenAI(
+    api_key="None",
+    base_url="http://localhost:8000/v1"
+)
+input_image_url = "https://vllm-public-assets.s3.us-west-2.amazonaws.com/omni-assets/qwen-bear.png"
+result = client.images.edit(
+    image=[],
+    model="Qwen-Image-Edit-2511",
+    prompt="Change the bears in the two input images into walking together.",
+    size='512x512',
+    stream=False,
+    output_format='jpeg',
+    # url格式
+    extra_body={
+        "url": [input_image_url1,input_image_url],
+        "num_inference_steps": 50,
+        "guidance_scale": 1,
+        "seed": 777,
+    }
+)
+image_base64 = result.data[0].b64_json
+image_bytes = base64.b64decode(image_base64)
+# Save the image to a file
+with open("edit_out_http.jpeg", "wb") as f:
+    f.write(image_bytes)
+```
+## API Reference
+### Endpoint
+```
+POST /v1/images/edits
+Content-Type: multipart/form-data
+```
+### Request Parameters
+#### OpenAI Standard Parameters
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `prompt` | string | **required** | A text description of the desired image |
+| `model` | string | server's model | Model to use (optional, should match server if specified) |
+| `image` | string or array | **required** | The image(s) to edit. |
+| `n` | integer | 1 | Number of images to generate (1-10) |
+| `size` | string | "auto" | Image dimensions in WxH format (e.g., "1024x1024", "512x512"), when set to auto, it decide size from first input image. |
+| `response_format` | string | "b64_json" | Response format (only "b64_json" supported) |
+| `user` | string | null | User identifier for tracking |
+| `output_format` | string | "png" | The format in which the generated images are returned. Must be one of "png", "jpg", "jpeg", "webp". |
+| `output_compression` | integer | 100 | The compression level (0-100%) for the generated images. |
+| `background` | string or null | "auto" | Allows to set transparency for the background of the generated image(s).
+#### vllm-omni Extension Parameters
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `url` | string or array | None | The image(s) to edit. |
+| `negative_prompt` | string | null | Text describing what to avoid in the image |
+| `num_inference_steps` | integer | model defaults | Number of diffusion steps |
+| `guidance_scale` | float | model defaults | Classifier-free guidance scale (typically 0.0-20.0) |
+| `true_cfg_scale` | float | model defaults | True CFG scale (model-specific parameter, may be ignored if not supported) |
+| `seed` | integer | null | Random seed for reproducibility |
+### Response Format
+```json
+{
+  "created": 1701234567,
+  "data": [
+    {
+      "b64_json": "<base64-encoded PNG>",
+      "url": null,
+      "revised_prompt": null
+    }
+  ],
+  "output_format": null,
+  "size": null,
+}
+```
+## Examples
+### Multiple Images input
+```bash
+curl -s -D >(grep -i x-request-id >&2) \
+  -o >(jq -r '.data[0].b64_json' | base64 --decode > gift-basket.png) \
+  -X POST "http://localhost:8000/v1/images/edits" \
+  -F "model=xxx" \
+  -F "image=@xx.png" \
+  -F "image=@xx.png"
+  -F "prompt='this bear is wearing sportwear. holding a basketball, and bending one leg.'" \
+  -F "size=1024x1024" \
+  -F "output_format=png"
+```
+## Parameter Handling
+The API passes parameters directly to the diffusion pipeline without model-specific transformation:
+- **Default values**: When parameters are not specified, the underlying model uses its own defaults
+- **Pass-through design**: User-provided values are forwarded directly to the diffusion engine
+- **Minimal validation**: Only basic type checking and range validation at the API level
+### Parameter Compatibility
+The API passes parameters directly to the diffusion pipeline without model-specific validation.
+- Unsupported parameters may be silently ignored by the model
+- Incompatible values will result in errors from the underlying pipeline
+- Recommended values vary by model - consult model documentation
+**Best Practice:** Start with the model's recommended parameters, then adjust based on your needs.
+## Error Responses
+### 400 Bad Request
+Invalid parameters (e.g., model mismatch):
+```json
+{
+  "detail": "Invalid size format: '1024x'. Expected format: 'WIDTHxHEIGHT' (e.g., '1024x1024')."
+}
+```
+### 422 Unprocessable Entity
+Validation errors (missing required fields):
+```json
+{
+  "detail": "Field 'image' or 'url' is required"
+}
+```
+## Troubleshooting
+### Server Not Running
+```bash
+# Check if server is responding
+curl -X http://localhost:8000/v1/images/edit \
+  -F "prompt='test'"
+```
+### Out of Memory
+If you encounter OOM errors:
+1. Reduce image size: `"size": "512x512"`
+2. Reduce inference steps: `"num_inference_steps": 25`
+## Development
+Enable debug logging to see prompts and generation details:
+```bash
+vllm serve Qwen/Qwen-Image-Edit-2511 --omni \
+  --uvicorn-log-level debug
+```
--- a/docs/serving/image_generation_api.md
+++ b/docs/serving/image_generation_api.md
+# Image Generation API
+vLLM-Omni provides an OpenAI DALL-E compatible API for text-to-image generation using diffusion models.
+Each server instance runs a single model (specified at startup via `vllm serve <model> --omni`).
+## Quick Start
+### Start the Server
+For example...
+```bash
+# Qwen-Image
+vllm serve Qwen/Qwen-Image --omni --port 8000
+# Z-Image Turbo
+vllm serve Tongyi-MAI/Z-Image-Turbo --omni --port 8000
+```
+### Generate Images
+**Using curl:**
+```bash
+curl -X POST http://localhost:8000/v1/images/generations \
+  -H "Content-Type: application/json" \
+  -d '{
+    "prompt": "a dragon laying over the spine of the Green Mountains of Vermont",
+    "size": "1024x1024",
+    "seed": 42
+  }' | jq -r '.data[0].b64_json' | base64 -d > dragon.png
+```
+**Using Python:**
+```python
+import requests
+import base64
+from PIL import Image
+import io
+response = requests.post(
+    "http://localhost:8000/v1/images/generations",
+    json={
+        "prompt": "a black and white cat wearing a princess tiara",
+        "size": "1024x1024",
+        "num_inference_steps": 50,
+        "seed": 42,
+    }
+)
+# Decode and save
+img_data = response.json()["data"][0]["b64_json"]
+img_bytes = base64.b64decode(img_data)
+img = Image.open(io.BytesIO(img_bytes))
+img.save("cat.png")
+```
+**Using OpenAI SDK:**
+```python
+from openai import OpenAI
+client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
+response = client.images.generate(
+    model="Qwen/Qwen-Image",
+    prompt="a horse jumping over a fence nearby a babbling brook",
+    n=1,
+    size="1024x1024",
+    response_format="b64_json"
+)
+# Note: Extension parameters (seed, steps, cfg) require direct HTTP requests
+```
+## API Reference
+### Endpoint
+```
+POST /v1/images/generations
+Content-Type: application/json
+```
+### Request Parameters
+#### OpenAI Standard Parameters
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `prompt` | string | **required** | Text description of the desired image |
+| `model` | string | server's model | Model to use (optional, should match server if specified) |
+| `n` | integer | 1 | Number of images to generate (1-10) |
+| `size` | string | model defaults | Image dimensions in WxH format (e.g., "1024x1024", "512x512") |
+| `response_format` | string | "b64_json" | Response format (only "b64_json" supported) |
+| `user` | string | null | User identifier for tracking |
+#### vllm-omni Extension Parameters
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `negative_prompt` | string | null | Text describing what to avoid in the image |
+| `num_inference_steps` | integer | model defaults | Number of diffusion steps |
+| `guidance_scale` | float | model defaults | Classifier-free guidance scale (typically 0.0-20.0) |
+| `true_cfg_scale` | float | model defaults | True CFG scale (model-specific parameter, may be ignored if not supported) |
+| `seed` | integer | null | Random seed for reproducibility |
+### Response Format
+```json
+{
+  "created": 1701234567,
+  "data": [
+    {
+      "b64_json": "<base64-encoded PNG>",
+      "url": null,
+      "revised_prompt": null
+    }
+  ]
+}
+```
+## Examples
+### Multiple Images
+```bash
+curl -X POST http://localhost:8000/v1/images/generations \
+  -H "Content-Type: application/json" \
+  -d '{
+    "prompt": "a steampunk city set in a valley of the Adirondack mountains",
+    "n": 4,
+    "size": "1024x1024",
+    "seed": 123
+  }'
+```
+This generates 4 images in a single request.
+### With Negative Prompt
+```python
+response = requests.post(
+    "http://localhost:8000/v1/images/generations",
+    json={
+        "prompt": "a portrait of a skier in deep powder snow",
+        "negative_prompt": "blurry, low quality, distorted, ugly",
+        "num_inference_steps": 100,
+        "size": "1024x1024",
+    }
+)
+```
+## Parameter Handling
+The API passes parameters directly to the diffusion pipeline without model-specific transformation:
+- **Default values**: When parameters are not specified, the underlying model uses its own defaults
+- **Pass-through design**: User-provided values are forwarded directly to the diffusion engine
+- **Minimal validation**: Only basic type checking and range validation at the API level
+### Parameter Compatibility
+The API passes parameters directly to the diffusion pipeline without model-specific validation.
+- Unsupported parameters may be silently ignored by the model
+- Incompatible values will result in errors from the underlying pipeline
+- Recommended values vary by model - consult model documentation
+**Best Practice:** Start with the model's recommended parameters, then adjust based on your needs.
+## Error Responses
+### 400 Bad Request
+Invalid parameters (e.g., model mismatch):
+```json
+{
+  "detail": "Invalid size format: '1024x'. Expected format: 'WIDTHxHEIGHT' (e.g., '1024x1024')."
+}
+```
+### 422 Unprocessable Entity
+Validation errors (missing required fields):
+```json
+{
+  "detail": [
+    {
+      "loc": ["body", "prompt"],
+      "msg": "field required",
+      "type": "value_error.missing"
+    }
+  ]
+}
+```
+### 503 Service Unavailable
+Diffusion engine not initialized:
+```json
+{
+  "detail": "Diffusion engine not initialized. Start server with a diffusion model."
+}
+```
+## Troubleshooting
+### Server Not Running
+```bash
+# Check if server is responding
+curl http://localhost:8000/v1/images/generations \
+  -H "Content-Type: application/json" \
+  -d '{"prompt": "test"}'
+```
+### Out of Memory
+If you encounter OOM errors:
+1. Reduce image size: `"size": "512x512"`
+2. Reduce inference steps: `"num_inference_steps": 25`
+3. Generate fewer images: `"n": 1`
+## Testing
+Run the test suite to verify functionality:
+```bash
+# All image generation tests
+pytest tests/entrypoints/openai_api/test_image_server.py -v
+# Specific test
+pytest tests/entrypoints/openai_api/test_image_server.py::test_generate_single_image -v
+```
+## Development
+Enable debug logging to see prompts and generation details:
+```bash
+vllm serve Qwen/Qwen-Image --omni \
+  --uvicorn-log-level debug
+```
--- a/docs/source/architecture/ar-dit-main-architecture.png
+++ b/docs/source/architecture/ar-dit-main-architecture.png
--- a/docs/source/architecture/ar-main-architecture.png
+++ b/docs/source/architecture/ar-main-architecture.png
--- a/docs/source/architecture/dit-main-architecture.png
+++ b/docs/source/architecture/dit-main-architecture.png
--- a/docs/source/architecture/omni-modality-model-architecture.png
+++ b/docs/source/architecture/omni-modality-model-architecture.png
--- a/docs/source/architecture/vllm-omni-dataflow-between-stages.png
+++ b/docs/source/architecture/vllm-omni-dataflow-between-stages.png
--- a/docs/source/architecture/vllm-omni-diffusion-flow.png
+++ b/docs/source/architecture/vllm-omni-diffusion-flow.png
--- a/docs/source/architecture/vllm-omni-main-architecture.png
+++ b/docs/source/architecture/vllm-omni-main-architecture.png
--- a/docs/source/architecture/vllm-omni-user-interface.png
+++ b/docs/source/architecture/vllm-omni-user-interface.png
--- a/docs/source/logos/vllm-logo-only-light.ico
+++ b/docs/source/logos/vllm-logo-only-light.ico
--- a/docs/source/logos/vllm-omni-logo.png
+++ b/docs/source/logos/vllm-omni-logo.png
--- a/docs/usage/faq.md
+++ b/docs/usage/faq.md
+# Frequently Asked Questions
+> Q: How many chips do I need to infer a model in vLLM-Omni?
+A: Now, we support natively disaggregated deployment for different model stages within a model. There is a restriction that one chip can only have one AutoRegressive model stage. This is because the unified KV cache management of vLLM. Stages of other types can coexist within a chip. The restriction will be resolved in later version.
+> Q: When trying to run examples, I encounter error about backend of librosa or soundfile. How to solve it?
+A: If you encounter error about backend of librosa, try to install ffmpeg with command below.
+```
+sudo apt update
+sudo apt install ffmpeg
+```
+> Q: I see GPU OOM or "free memory is less than desired GPU memory utilization" errors. How can I fix it?
+A: Refer to [GPU memory calculation and configuration](../configuration/gpu_memory_utilization.md) for guidance on tuning `gpu_memory_utilization` and related settings.
+> Q: I encounter some bugs or CI problems, which is urgent. How can I solve it?
+A: At first, you can check current [issues](https://github.com/vllm-project/vllm-omni/issues) to find possible solutions. If none of these satisfy your demand and it is urgent, please find these [volunteers](https://docs.vllm.ai/projects/vllm-omni/en/latest/community/volunteers/) for help.
+> Q: Does vLLM-Omni support AWQ or any other quantization?
+A: vLLM-Omni partitions model into several stages. For AR stages, it will reuse main logic of LLMEngine in vLLM. So current quantization supported in vLLM should be also supported in vLLM-Omni for them. But systematic verification is ongoing. For quantization for DiffusionEngine, we are working on it. Please stay tuned and welcome contribution!
+> Q: Does vLLM-Omni support multimodal streaming input and output?
+A: Not yet. We already put it on the [Roadmap](https://github.com/vllm-project/vllm-omni/issues/165). Please stay tuned!
--- a/docs/user_guide/diffusion/cache_dit_acceleration.md
+++ b/docs/user_guide/diffusion/cache_dit_acceleration.md
+# Cache-DiT Acceleration Guide
+This guide explains how to use cache-dit acceleration in vLLM-Omni to speed up diffusion model inference.
+## Overview
+Cache-dit is a library that accelerates diffusion transformer models through intelligent caching mechanisms. It supports multiple acceleration techniques that can be combined for optimal performance:
+- **DBCache**: Dual Block Cache for reducing redundant computations
+- **TaylorSeer**: Taylor expansion-based forecasting for faster inference
+- **SCM**: Step Computation Masking for selective step computation
+## Quick Start
+### Basic Usage
+Enable cache-dit acceleration by simply setting `cache_backend="cache_dit"`. Cache-dit will use its recommended default parameters:
+```python
+from vllm_omni.entrypoints.omni import Omni
+from vllm_omni.inputs.data import OmniDiffusionSamplingParams
+# Simplest way: just enable cache-dit with default parameters
+omni = Omni(
+    model="Qwen/Qwen-Image",
+    cache_backend="cache_dit",
+)
+images = omni.generate(
+    "a beautiful landscape",
+    OmniDiffusionSamplingParams(num_inference_steps=50),
+)
+```
+**Default Parameters**: When `cache_config` is not provided, cache-dit uses optimized default values. See the [Configuration Reference](#configuration-reference) section for a complete list of all parameters and their default values.
+### Custom Configuration
+To customize cache-dit settings, provide a `cache_config` dictionary, for example:
+```python
+omni = Omni(
+    model="Qwen/Qwen-Image",
+    cache_backend="cache_dit",
+    cache_config={
+        "Fn_compute_blocks": 1,
+        "Bn_compute_blocks": 0,
+        "max_warmup_steps": 4,
+        "residual_diff_threshold": 0.12,
+    },
+)
+```
+## Online Serving (OpenAI-Compatible)
+Enable Cache-DiT for online serving by passing `--cache-backend cache_dit` when starting the server:
+```bash
+# Use Cache-DiT default (recommended) parameters
+vllm serve Qwen/Qwen-Image --omni --port 8091 --cache-backend cache_dit
+```
+To customize Cache-DiT settings for online serving, pass a JSON string via `--cache-config`:
+```bash
+vllm serve Qwen/Qwen-Image --omni --port 8091 \
+  --cache-backend cache_dit \
+  --cache-config '{"Fn_compute_blocks": 1, "Bn_compute_blocks": 0, "max_warmup_steps": 4, "residual_diff_threshold": 0.12}'
+```
+## Acceleration Methods
+For comprehensive illustration, please view cache-dit [User_Guide](https://cache-dit.readthedocs.io/en/latest/user_guide/OVERVIEWS/)
+### 1. DBCache (Dual Block Cache)
+DBCache intelligently caches intermediate transformer block outputs when the residual differences between consecutive steps are small, reducing redundant computations without sacrificing quality.
+**Key Parameters**:
+- `Fn_compute_blocks` (int, default: 1): Number of **first n** transformer blocks used to compute stable feature differences. Higher values provide more accurate caching decisions but increase computation.
+- `Bn_compute_blocks` (int, default: 0): Number of **last n** transformer blocks used for additional fusion. These blocks act as an auto-scaler for approximate hidden states.
+- `max_warmup_steps` (int, default: 4): Number of initial steps where caching is disabled to ensure the model learns sufficient features before caching begins. Optimized for few-step distilled models.
+- `residual_diff_threshold` (float, default: 0.24): Threshold for residual difference. Higher values lead to faster performance but may reduce precision. Default uses a relatively higher threshold for more aggressive caching.
+- `max_cached_steps` (int, default: -1): Maximum number of cached steps. Set to -1 for unlimited caching.
+- `max_continuous_cached_steps` (int, default: 3): Maximum number of consecutive cached steps. Limits consecutive caching to prevent precision degradation.
+**Example Configuration**:
+```python
+cache_config={
+    "Fn_compute_blocks": 8,      # Use first 8 blocks for difference computation
+    "Bn_compute_blocks": 0,       # No additional fusion blocks
+    "max_warmup_steps": 8,        # Cache after 8 warmup steps
+    "residual_diff_threshold": 0.12,  # Higher threshold for faster inference
+    "max_cached_steps": -1,        # No limit on cached steps
+}
+```
+**Performance Tips**:
+- Default `Fn_compute_blocks=1` works well for most cases. Increase to 8-12 for larger models or when more accuracy is needed
+- Increase `residual_diff_threshold` (e.g., 0.12-0.15) for faster inference with slight quality trade-off, or decrease from default 0.24 for higher quality
+- Default `max_warmup_steps=4` is optimized for few-step models. Increase to 6-8 for more steps if needed
+### 2. TaylorSeer
+TaylorSeer uses Taylor expansion to forecast future hidden states, allowing the model to skip some computation steps while maintaining quality.
+**Key Parameters**:
+- `enable_taylorseer` (bool, default: False): Enable TaylorSeer acceleration
+- `taylorseer_order` (int, default: 1): Order of Taylor expansion. Higher orders provide better accuracy but require more computation.
+**Example Configuration**:
+```python
+cache_config={
+    "enable_taylorseer": True,
+    "taylorseer_order": 1,  # First-order Taylor expansion
+}
+```
+**Performance Tips**:
+- Use `taylorseer_order=1` for most cases (good balance of speed and quality)
+- Combine with DBCache for maximum acceleration
+- Higher orders (2-3) may improve quality but reduce speed gains
+### 3. SCM (Step Computation Masking)
+SCM allows you to specify which steps must be computed and which can use cached results, similar to LeMiCa/EasyCache style acceleration.
+**Key Parameters**:
+- `scm_steps_mask_policy` (str | None, default: None): Predefined mask policy. Options:
+  - `None`: SCM disabled (default)
+  - `"slow"`: More compute steps, higher quality (18 compute steps out of 28)
+  - `"medium"`: Balanced (15 compute steps out of 28)
+  - `"fast"`: More cache steps, faster inference (11 compute steps out of 28)
+  - `"ultra"`: Maximum speed (8 compute steps out of 28)
+- `scm_steps_policy` (str, default: "dynamic"): Policy for cached steps:
+  - `"dynamic"`: Use dynamic cache for masked steps (recommended)
+  - `"static"`: Use static cache for masked steps
+**Example Configuration**:
+```python
+cache_config={
+    "scm_steps_mask_policy": "medium",  # Balanced speed/quality
+    "scm_steps_policy": "dynamic",      # Use dynamic cache
+}
+```
+**Performance Tips**:
+- SCM is disabled by default (`scm_steps_mask_policy=None`). Enable it by setting a policy value if you need additional acceleration
+- Start with `"medium"` policy and adjust based on quality requirements
+- Use `"fast"` or `"ultra"` for maximum speed when quality can be slightly compromised
+- `"dynamic"` policy generally provides better quality than `"static"`
+- SCM mask is automatically regenerated when `num_inference_steps` changes during inference
+## Configuration Reference
+### DiffusionCacheConfig Parameters
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `Fn_compute_blocks` | int | 1 | First n blocks for difference computation (optimized for single-transformer models) |
+| `Bn_compute_blocks` | int | 0 | Last n blocks for fusion |
+| `max_warmup_steps` | int | 4 | Steps before caching starts (optimized for few-step distilled models) |
+| `max_cached_steps` | int | -1 | Max cached steps (-1 = unlimited) |
+| `max_continuous_cached_steps` | int | 3 | Max consecutive cached steps (prevents precision degradation) |
+| `residual_diff_threshold` | float | 0.24 | Residual difference threshold (higher for more aggressive caching) |
+| `num_inference_steps` | int \| None | None | Initial inference steps for SCM mask generation (optional, auto-refreshed during inference) |
+| `enable_taylorseer` | bool | False | Enable TaylorSeer acceleration (not suitable for few-step distilled models) |
+| `taylorseer_order` | int | 1 | Taylor expansion order |
+| `scm_steps_mask_policy` | str \| None | None | SCM mask policy (None, "slow", "medium", "fast", "ultra") |
+| `scm_steps_policy` | str | "dynamic" | SCM computation policy ("dynamic" or "static") |
+## Example: Accelerate Text-to-Image Generation with CacheDiT
+See `examples/offline_inference/text_to_image/text_to_image.py` for a complete working example with cache-dit acceleration.
+```bash
+# Enable cache-dit with hybrid acceleration
+cd examples/offline_inference/text_to_image
+python text_to_image.py \
+    --model Qwen/Qwen-Image \
+    --prompt "a cup of coffee on the table" \
+    --cache_backend cache_dit \
+    --num_inference_steps 50
+```
+The script uses cache-dit acceleration with a hybrid configuration combining DBCache, SCM, and TaylorSeer:
+```python
+omni = Omni(
+    model="Qwen/Qwen-Image",
+    cache_backend="cache_dit",
+    cache_config={
+        # Scheme: Hybrid DBCache + SCM + TaylorSeer
+        # DBCache
+        "Fn_compute_blocks": 8,
+        "Bn_compute_blocks": 0,
+        "max_warmup_steps": 4,
+        "residual_diff_threshold": 0.12,
+        # TaylorSeer
+        "enable_taylorseer": True,
+        "taylorseer_order": 1,
+        # SCM
+        "scm_steps_mask_policy": "fast",  # Set to None to disable SCM
+        "scm_steps_policy": "dynamic",
+    },
+)
+```
+You can customize the configuration by modifying the `cache_config` dictionary to use only specific methods (e.g., DBCache only, DBCache + SCM, etc.) based on your quality and speed requirements.
+To test another model, you can modify `--model` with the target model identifier like `Tongyi-MAI/Z-Image-Turbo` and update `cache_config` according the model architecture (e.g., number of transformer blocks).
+## Additional Resources
+- [Cache-DiT User Guide](https://cache-dit.readthedocs.io/en/latest/user_guide/OVERVIEWS/)
+- [Cache-DiT Benchmark](https://cache-dit.readthedocs.io/en/latest/benchmark/HYBRID_CACHE/)
+- [DBCache Technical Details](https://cache-dit.readthedocs.io/en/latest/user_guide/CACHE_API/)
--- a/docs/user_guide/diffusion/cpu_offload_diffusion.md
+++ b/docs/user_guide/diffusion/cpu_offload_diffusion.md
+# CPU Offloading for Diffusion Model
+## Overview
+vLLM-Omni provides two offloading strategies to reduce GPU memory usage for diffusion models, allowing you to run larger models on GPUs with limited VRAM:
+1. **Model-level (Component) Offloading**: Swaps entire model components (DiT transformer, VAE, encoders) between GPU and CPU.
+2. **Layerwise (Blockwise) Offloading**: Keeps only a single or a few transformer blocks on GPU at a time, with compute - memory copy overlap.
+Both approaches use pinned memory for faster CPU-GPU transfers. For now, the two offloading strategies could not be used at the same time.
+## Model-level CPU Offloading
+### Implementation
+CPU offload lets the diffusion worker move large model components between GPU and CPU memory on demand. It keeps the DiT transformer resident on GPU only while it is actively running, and swaps it out when encoders modules need the device. This reduces peak VRAM usage so bigger checkpoints run on smaller GPUs, or multiple requests can share the same GPU.
+**Execution Flow**:
+1. Text encoders run on GPU while the DiT transformer is offloaded to CPU.
+2. Before denoising, weights are prefetched back to GPU, honoring pinned-memory copies for speed.
+3. After the diffusion step, the transformer returns to CPU and the process repeats as needed.
+Transfers use pinned host buffers, and the worker coordinates swaps via mutex-style hooks so components never compete for memory.
+### Configuration
+You can enable CPU offload in two ways:
+1. **Python API**: set `enable_cpu_offload=True`.
+```python
+from vllm_omni import Omni
+if __name__ == "__main__":
+    m = Omni(model="Qwen/Qwen-Image",enable_cpu_offload=True)
+```
+2. **CLI**: pass `--enable-cpu-offload` to the diffusion service entrypoint.
+### Limitations
+- Cold start latency increases for over one minute for some models(e.g., Qwen-Image)
+## Layerwise (Blockwise) Offloading
+### Implementation
+Layerwise offload operates at transformer block granularity, keeping a single transformer block, or a specified number of blocks, on GPU while others stay in CPU memory.
+Unlike full model-wise CPU offload which swaps entire components like DiT and encoders, layerwise offloading applies a sliding window way of loading and offloading weights between gpu and cpu: while block `i` computes, block `i+1` get prefetched asynchronously via pinned memory. In this way, only partial blocks(s) reside on GPU at any moment during inference, so that greatly decrease the memory occupancy.
+**Execution Flow**:
+1. During model initialization, all components are loaded to CPU first. Then components other than DiT model(s) in the pipeline, such as VAE and encoders, are moved to GPU. The weights of target transformer blocks are collected as contiguous tensors per layer on CPU with pinned memory; and non-block modules (embeddings, norms, etc) in the DiT model are moved to and stay on GPU.
+2. The first block(s) are transferred to GPU during initialization of `LayerwiseOffloader`, before the first denoising step of the very first request.
+3. As each block executes, the next block prefetches on a separate CUDA stream for compute - memory copy overlap. After execution, the current block is immediately freed from GPU memory.
+4. When the last block completes, the first block prefetches for the next denoising step.
+Example of hook executions of a DiT model with n layers, by default keep a single layer on GPU:
+| Layer (block) idx | forward pre-hook               | forward          | forward post-hook         |
+|-------------------|--------------------------------|------------------|---------------------------|
+| layer-0           | prefetch layer 1 (copy stream) | compute layer 0  | free layer-0 gpu weights  |
+| layer-1           | prefetch layer 2 (copy stream) | compute layer 1  | free layer-1 gpu weights  |
+| layer-2           | prefetch layer 3 (copy stream) | compute layer 2  | free layer-2 gpu weights  |
+| ...               | ...                            | ...              | ...                       |
+| layer-(n-1)       | **prefetch layer 0 (copy stream)** | compute layer (n-1) | free layer (n-1) gpu weights  |
+### Configuration
+1. **Python API**: set `enable_layerwise_offload=True` and optionally `layerwise_num_gpu_layers`.
+```python
+from vllm_omni import Omni
+if __name__ == "__main__":
+    m = Omni(
+        model="Wan-AI/Wan2.2-T2V-A14B-Diffusers",
+        enable_layerwise_offload=True,
+        ...
+    )
+```
+2. **CLI**: pass `--enable-layerwise-offload` and `--layerwise-num-gpu-layers` to the diffusion service entrypoint.
+### Supported Models
+| Architecture | Models | Example HF Models | DiT Model Cls | Blocks Attr Name |
+|--------------|--------|-------------------|----------|----------|
+| `QwenImagePipeline` | Qwen-Image-Edit | `Qwen/Qwen-Image` | `QwenImageTransformer2DModel` | "transformer_blocks" |
+| `Wan22Pipeline` | Wan2.2 | `Wan-AI/Wan2.2-T2V-A14B-Diffusers` | `WanTransformer3DModel` | "blocks" |
+NOTE: Models must define `_layerwise_offload_blocks_attr` class attribute so that the layerwise offloader finds the target transformer blocks.
+### Limitations
+- Cold start latency increases because of
+    1) components are loaded to CPU first at the very first during initialization,  
+    2) weight consolidation and pinning
+- Performance depends on CPU <-> GPU interconnection (e.g., PCIe bandwidth).
+- Support single GPU only for now
--- a/docs/user_guide/diffusion/parallelism_acceleration.md
+++ b/docs/user_guide/diffusion/parallelism_acceleration.md
+# Parallelism Acceleration Guide
+This guide includes how to use parallelism methods in vLLM-Omni to speed up diffusion model inference as well as reduce the memory requirement on each device.
+## Overview
+The following parallelism methods are currently supported in vLLM-Omni:
+1. DeepSpeed Ulysses Sequence Parallel (DeepSpeed Ulysses-SP) ([arxiv paper](https://arxiv.org/pdf/2309.14509)): Ulysses-SP splits the input along the sequence dimension and uses all-to-all communication to allow each device to compute only a subset of attention heads.
+2. [Ring-Attention](#ring-attention) - splits the input along the sequence dimension and uses ring-based P2P communication to accumulate attention results, keeping the sequence dimension sharded
+3. Classifier-Free-Guidance Parallel (CFG-Parallel): CFG-Parallel runs the positive/negative prompts of classifier-free guidance (CFG) on different devices, then merges on a single device to perform the scheduler step.
+4. [Tensor Parallelism](#tensor-parallelism): Tensor parallelism shards model weights across devices. This can reduce per-GPU memory usage. Note that for diffusion models we currently shard the majority of layers within the DiT.
+The following table shows which models are currently supported by parallelism method:
+### ImageGen
+| Model                    | Model Identifier                     | Ulysses-SP | Ring-SP | CFG-Parallel | Tensor-Parallel |
+|--------------------------|--------------------------------------|:----------:|:-------:|:------------:|:---------------:|
+| **LongCat-Image**        | `meituan-longcat/LongCat-Image`      |     ✅      |    ✅    |      ❌       |        ✅        |
+| **LongCat-Image-Edit**   | `meituan-longcat/LongCat-Image-Edit` |     ✅      |    ✅    |      ❌       |        ✅        |
+| **Ovis-Image**           | `OvisAI/Ovis-Image`                  |     ❌      |    ❌    |      ❌       |        ❌        |
+| **Qwen-Image**           | `Qwen/Qwen-Image`                    |     ✅      |    ✅    |      ✅       |        ✅        |
+| **Qwen-Image-Edit**      | `Qwen/Qwen-Image-Edit`               |     ✅      |    ✅    |      ✅       |        ✅        |
+| **Qwen-Image-Edit-2509** | `Qwen/Qwen-Image-Edit-2509`          |     ✅      |    ✅    |      ✅       |        ✅        |
+| **Qwen-Image-Layered**   | `Qwen/Qwen-Image-Layered`            |     ✅      |    ✅    |      ✅       |        ✅        |
+| **Z-Image**              | `Tongyi-MAI/Z-Image-Turbo`           |     ✅      |    ✅    |      ❌       |  ✅ (TP=2 only)  |
+| **Stable-Diffusion3.5**  | `stabilityai/stable-diffusion-3.5`   |     ❌      |    ❌    |      ❌       |        ❌        |
+| **FLUX.2-klein**         | `black-forest-labs/FLUX.2-klein-4B`  |     ❌      |    ❌    |      ❌       |        ✅        |
+| **FLUX.1-dev**           | `black-forest-labs/FLUX.1-dev`       |     ❌      |    ❌    |      ❌       |        ✅        |
+!!! note "TP Limitations for Diffusion Models"
+    We currently implement Tensor Parallelism (TP) only for the DiT (Diffusion Transformer) blocks. This is because the `text_encoder` component in vLLM-Omni uses the original Transformers implementation, which does not yet support TP.
+    - Good news: The text_encoder typically has minimal impact on overall inference performance.
+    - Bad news: When TP is enabled, every TP process retains a full copy of the text_encoder weights, leading to significant GPU memory waste.
+    We are actively refactoring this design to address this. For details and progress, please refer to [Issue #771](https://github.com/vllm-project/vllm-omni/issues/771).
+!!! note "Why Z-Image is TP=2 only"
+    Z-Image Turbo is currently limited to `tensor_parallel_size` of **1 or 2** due to model shape divisibility constraints.
+    For example, the model has `n_heads=30` and a final projection out dimension of `64`, so valid TP sizes must divide both 30 and 64; the only common divisors are **1 and 2**.
+### VideoGen
+| Model | Model Identifier | Ulysses-SP | Ring-SP | Tensor-Parallel |
+|-------|------------------|------------|---------|--------------------------|
+| **Wan2.2** | `Wan-AI/Wan2.2-T2V-A14B-Diffusers` | ✅ | ✅ | ❌ |
+### Tensor Parallelism
+Tensor parallelism splits model parameters across GPUs. In vLLM-Omni, tensor parallelism is configured via `DiffusionParallelConfig.tensor_parallel_size`.
+#### Offline Inference
+```python
+from vllm_omni import Omni
+from vllm_omni.diffusion.data import DiffusionParallelConfig
+omni = Omni(
+    model="Tongyi-MAI/Z-Image-Turbo",
+    parallel_config=DiffusionParallelConfig(tensor_parallel_size=2),
+)
+outputs = omni.generate(
+    "a cat reading a book",
+    OmniDiffusionSamplingParams(
+        num_inference_steps=9,
+        width=512,
+        height=512,
+    ),
+)
+```
+### Sequence Parallelism
+#### Ulysses-SP
+##### Offline Inference
+An example of offline inference script using [Ulysses-SP](https://arxiv.org/pdf/2309.14509) is shown below:
+```python
+from vllm_omni import Omni
+from vllm_omni.inputs.data import OmniDiffusionSamplingParams
+from vllm_omni.diffusion.data import DiffusionParallelConfig
+ulysses_degree = 2
+omni = Omni(
+    model="Qwen/Qwen-Image",
+    parallel_config=DiffusionParallelConfig(ulysses_degree=2)
+)
+outputs = omni.generate(
+    "A cat sitting on a windowsill",
+    OmniDiffusionSamplingParams(num_inference_steps=50, width=2048, height=2048),
+)
+```
+See `examples/offline_inference/text_to_image/text_to_image.py` for a complete working example.
+##### Online Serving
+You can enable Ulysses-SP in online serving for diffusion models via `--usp`:
+```bash
+# Text-to-image (requires >= 2 GPUs)
+vllm serve Qwen/Qwen-Image --omni --port 8091 --usp 2
+```
+##### Benchmarks
+!!! note "Benchmark Disclaimer"
+    These benchmarks are provided for **general reference only**. The configurations shown use default or common parameter settings and have not been exhaustively optimized for maximum performance. Actual performance may vary based on:
+    - Specific model and use case
+    - Hardware configuration
+    - Careful parameter tuning
+    - Different inference settings (e.g., number of steps, image resolution)
+To measure the parallelism methods, we run benchmarks with **Qwen/Qwen-Image** model generating images (**2048x2048** as long sequence input) with 50 inference steps. The hardware devices are NVIDIA H800 GPUs. `sdpa` is the attention backends.
+| Configuration | Ulysses degree |Generation Time | Speedup |
+|---------------|----------------|---------|---------|
+| **Baseline (diffusers)** | - | 112.5s | 1.0x |
+| Ulysses-SP  |  2  |  65.2s | 1.73x |
+| Ulysses-SP  |  4  | 39.6s | 2.84x |
+| Ulysses-SP  |  8  | 30.8s | 3.65x |
+#### Ring-Attention
+Ring-Attention ([arxiv paper](https://arxiv.org/abs/2310.01889)) splits the input along the sequence dimension and uses ring-based P2P communication to accumulate attention results. Unlike Ulysses-SP which uses all-to-all communication, Ring-Attention keeps the sequence dimension sharded throughout the computation and circulates Key/Value blocks through a ring topology.
+##### Offline Inference
+An example of offline inference script using Ring-Attention is shown below:
+```python
+from vllm_omni import Omni
+from vllm_omni.inputs.data import OmniDiffusionSamplingParams
+from vllm_omni.diffusion.data import DiffusionParallelConfig
+ring_degree = 2
+omni = Omni(
+    model="Qwen/Qwen-Image",
+    parallel_config=DiffusionParallelConfig(ring_degree=2)
+)
+outputs = omni.generate(
+    "A cat sitting on a windowsill",
+    OmniDiffusionSamplingParams(num_inference_steps=50, width=2048, height=2048),
+)
+```
+See `examples/offline_inference/text_to_image/text_to_image.py` for a complete working example.
+##### Online Serving
+You can enable Ring-Attention in online serving for diffusion models via `--ring`:
+```bash
+# Text-to-image (requires >= 2 GPUs)
+vllm serve Qwen/Qwen-Image --omni --port 8091 --ring 2
+```
+##### Benchmarks
+!!! note "Benchmark Disclaimer"
+    These benchmarks are provided for **general reference only**. The configurations shown use default or common parameter settings and have not been exhaustively optimized for maximum performance. Actual performance may vary based on:
+    - Specific model and use case
+    - Hardware configuration
+    - Careful parameter tuning
+    - Different inference settings (e.g., number of steps, image resolution)
+To measure the parallelism methods, we run benchmarks with **Qwen/Qwen-Image** model generating images (**1024x1024** as long sequence input) with 50 inference steps. The hardware devices are NVIDIA A100 GPUs. `flash_attn` is the attention backends.
+| Configuration | Ring degree |Generation Time | Speedup |
+|---------------|----------------|---------|---------|
+| **Baseline (diffusers)** | - | 45.2s | 1.0x |
+| Ring-Attention  |  2  |  29.9s | 1.51x |
+| Ring-Attention  |  4  | 23.3s | 1.94x |
+#### Hybrid Ulysses + Ring
+You can combine both Ulysses-SP and Ring-Attention for larger scale parallelism. The total sequence parallel size equals `ulysses_degree × ring_degree`.
+##### Offline Inference
+```python
+from vllm_omni import Omni
+from vllm_omni.inputs.data import OmniDiffusionSamplingParams
+from vllm_omni.diffusion.data import DiffusionParallelConfig
+# Hybrid: 2 Ulysses × 2 Ring = 4 GPUs total
+omni = Omni(
+    model="Qwen/Qwen-Image",
+    parallel_config=DiffusionParallelConfig(ulysses_degree=2, ring_degree=2)
+)
+outputs = omni.generate(
+    "A cat sitting on a windowsill",
+    OmniDiffusionSamplingParams(num_inference_steps=50, width=2048, height=2048),
+)
+```
+##### Online Serving
+```bash
+# Text-to-image (requires >= 4 GPUs)
+vllm serve Qwen/Qwen-Image --omni --port 8091 --usp 2 --ring 2
+```
+##### Benchmarks
+!!! note "Benchmark Disclaimer"
+    These benchmarks are provided for **general reference only**. The configurations shown use default or common parameter settings and have not been exhaustively optimized for maximum performance. Actual performance may vary based on:
+    - Specific model and use case
+    - Hardware configuration
+    - Careful parameter tuning
+    - Different inference settings (e.g., number of steps, image resolution)
+To measure the parallelism methods, we run benchmarks with **Qwen/Qwen-Image** model generating images (**1024x1024** as long sequence input) with 50 inference steps. The hardware devices are NVIDIA A100 GPUs. `flash_attn` is the attention backends.
+| Configuration | Ulysses degree | Ring degree | Generation Time | Speedup |
+|---------------|----------------|-------------|-----------------|---------|
+| **Baseline (diffusers)** | - | - | 45.2s | 1.0x |
+| Hybrid Ulysses + Ring  |  2  |  2  |  24.3s | 1.87x |
+##### How to parallelize a new model
+NOTE: "Terminology: SP vs CP"
+    Our "Sequence Parallelism" (SP) corresponds to "Context Parallelism" (CP) in the [diffusers library](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/_modeling_parallel.py).
+    We use "Sequence Parallelism" to align with vLLM-Omni's terminology.
+---
+###### Non-intrusive `_sp_plan` (Recommended)
+The `_sp_plan` mechanism allows SP without modifying `forward()` logic. The framework automatically registers hooks to shard inputs and gather outputs at module boundaries.
+**Requirements for `forward()` function:**
+- Tensor operations that need sharding/gathering must happen at **`nn.Module` boundaries** (not inline Python operations)
+- If your `forward()` contains inline tensor operations (e.g., `torch.cat`, `pad_sequence`) that need sharding, **extract them into a submodule**
+**When to create a submodule:**
+```python
+# ❌ BAD: Inline operations - hooks cannot intercept
+def forward(self, x, cap_feats):
+    unified = torch.cat([x, cap_feats], dim=1)  # Cannot be sharded via _sp_plan
+    ...
+# ✅ GOOD: Extract into a submodule
+class UnifiedPrepare(nn.Module):
+    def forward(self, x, cap_feats):
+        return torch.cat([x, cap_feats], dim=1)  # Now can be sharded via _sp_plan
+class MyModel(nn.Module):
+    def __init__(self):
+        self.unified_prepare = UnifiedPrepare()  # Submodule
+    def forward(self, x, cap_feats):
+        unified = self.unified_prepare(x, cap_feats)  # Hook can intercept here
+```
+---
+###### Defining `_sp_plan`
+**Type definitions** (see [diffusers `_modeling_parallel.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/_modeling_parallel.py) for reference):
+```python
+from vllm_omni.diffusion.distributed.sp_plan import (
+    SequenceParallelInput,   # Corresponds to diffusers' ContextParallelInput
+    SequenceParallelOutput,  # Corresponds to diffusers' ContextParallelOutput
+)
+```
+| Parameter | Description |
+|-----------|-------------|
+| `split_dim` | Dimension to split/gather (usually `1` for sequence) |
+| `expected_dims` | Expected tensor rank for validation (optional) |
+| `split_output` | `False`: shard **input** parameters; `True`: shard **output** tensors |
+| `auto_pad` | Auto-pad if sequence not divisible by world_size (Ulysses only) |
+**Key naming convention:**
+| Key | Meaning | Python equivalent |
+|-----|---------|-------------------|
+| `""` | Root model | `model` |
+| `"blocks.0"` | First element of ModuleList | `model.blocks[0]` |
+| `"blocks.*"` | All elements of ModuleList | `for b in model.blocks` |
+| `"outputs.main"` | ModuleDict entry | `model.outputs["main"]` |
+**Dictionary key types:**
+| Key type | `split_output` | Description |
+|----------|----------------|-------------|
+| `"param_name"` (str) | `False` | Shard **input parameter** by name |
+| `0`, `1` (int) | `True` | Shard **output tuple** by index |
+**Example** (similar to [diffusers `transformer_wan.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/transformers/transformer_wan.py)):
+```python
+class MyTransformer(nn.Module):
+    _sp_plan = {
+        # Shard rope module OUTPUTS (returns tuple)
+        "rope": {
+            0: SequenceParallelInput(split_dim=1, expected_dims=4, split_output=True),  # cos
+            1: SequenceParallelInput(split_dim=1, expected_dims=4, split_output=True),  # sin
+        },
+        # Shard transformer block INPUT parameter
+        "blocks.0": {
+            "hidden_states": SequenceParallelInput(split_dim=1, expected_dims=3),
+        },
+        # Gather at final projection
+        "proj_out": SequenceParallelOutput(gather_dim=1, expected_dims=3),
+    }
+```
+---
+###### Hook flow
+```
+Input → [SequenceParallelSplitHook: pre_forward] → Module.forward() → [post_forward] → ...
+                                                                              ↓
+... → [SequenceParallelGatherHook: post_forward] → Output
+```
+1. **SplitHook** shards tensors before/after the target module
+2. **Attention layers** handle Ulysses/Ring communication internally
+3. **GatherHook** collects sharded outputs
+The framework automatically applies these hooks when `sequence_parallel_size > 1`.
+---
+###### Method 2: Intrusive modification (For complex cases)
+For models with dynamic sharding logic that cannot be expressed via `_sp_plan`:
+```python
+from vllm_omni.diffusion.distributed.sp_sharding import sp_shard, sp_gather
+def forward(self, hidden_states, ...):
+    if self.parallel_config.sequence_parallel_size > 1:
+        hidden_states = sp_shard(hidden_states, dim=1)
+        # ... computation ...
+        output = sp_gather(output, dim=1)
+    return output
+```
+---
+###### Choosing the right approach
+| Scenario | Approach |
+|----------|----------|
+| Standard transformer | `_sp_plan` |
+| Inline tensor ops need sharding | Extract to submodule + `_sp_plan` |
+| Dynamic/conditional sharding | Intrusive modification |
+### CFG-Parallel
+#### Offline Inference
+CFG-Parallel is enabled through `DiffusionParallelConfig(cfg_parallel_size=2)`, which runs one rank for the positive branch and one rank for the negative branch.
+An example of offline inference using CFG-Parallel (image-to-image) is shown below:
+```python
+from vllm_omni import Omni
+from vllm_omni.diffusion.data import DiffusionParallelConfig
+image_path = "path_to_image.png"
+omni = Omni(
+    model="Qwen/Qwen-Image-Edit",
+    parallel_config=DiffusionParallelConfig(cfg_parallel_size=2),
+)
+input_image = Image.open(image_path).convert("RGB")
+outputs = omni.generate(
+    {
+        "prompt": "turn this cat to a dog",
+        "negative_prompt": "low quality, blurry",
+        "multi_modal_data": {"image": input_image},
+    },
+    OmniDiffusionSamplingParams(
+        true_cfg_scale=4.0,
+        num_inference_steps=50,
+    ),
+)
+```
+Notes:
+- CFG-Parallel is only effective when a `negative_prompt` is provided AND a guidance scale (or `cfg_scale`) is greater than 1.
+See `examples/offline_inference/image_to_image/image_edit.py` for a complete working example.
+```bash
+cd examples/offline_inference/image_to_image/
+python image_edit.py \
+  --model "Qwen/Qwen-Image-Edit" \
+  --image "qwen_image_output.png" \
+  --prompt "turn this cat to a dog" \
+  --negative_prompt "low quality, blurry" \
+  --cfg_scale 4.0 \
+  --output "edited_image.png" \
+  --cfg_parallel_size 2
+```
+#### Online Serving
+You can enable CFG-Parallel in online serving for diffusion models via `--cfg-parallel-size`:
+```bash
+vllm serve Qwen/Qwen-Image-Edit --omni --port 8091 --cfg-parallel-size 2
+```
+#### How to parallelize a pipeline
+This section describes how to add CFG-Parallel to a diffusion **pipeline**. We use the Qwen-Image pipeline (`vllm_omni/diffusion/models/qwen_image/pipeline_qwen_image.py`) as the reference implementation.
+In `QwenImagePipeline`, each diffusion step runs two denoiser forward passes sequentially:
+- positive (prompt-conditioned)
+- negative (negative-prompt-conditioned)
+CFG-Parallel assigns these two branches to different ranks in the **CFG group** and synchronizes the results.
+vLLM-omni provides `CFGParallelMixin` base class that encapsulates the CFG parallel logic. By inheriting from this mixin and calling its methods, pipelines can easily implement CFG parallel without writing repetitive code.
+**Key Methods in CFGParallelMixin:**
+- `predict_noise_maybe_with_cfg()`: Automatically handles CFG parallel noise prediction
+- `scheduler_step_maybe_with_cfg()`: Scheduler step with automatic CFG rank synchronization
+**Example Implementation:**
+```python
+class QwenImageCFGParallelMixin(CFGParallelMixin):
+    """
+    Base Mixin class for Qwen Image pipelines providing shared CFG methods.
+    """
+    def diffuse(
+        self,
+        prompt_embeds: torch.Tensor,
+        prompt_embeds_mask: torch.Tensor,
+        negative_prompt_embeds: torch.Tensor,
+        negative_prompt_embeds_mask: torch.Tensor,
+        latents: torch.Tensor,
+        img_shapes: torch.Tensor,
+        txt_seq_lens: torch.Tensor,
+        negative_txt_seq_lens: torch.Tensor,
+        timesteps: torch.Tensor,
+        do_true_cfg: bool,
+        guidance: torch.Tensor,
+        true_cfg_scale: float,
+        image_latents: torch.Tensor | None = None,
+        cfg_normalize: bool = True,
+        additional_transformer_kwargs: dict[str, Any] | None = None,
+    ) -> torch.Tensor:
+        self.transformer.do_true_cfg = do_true_cfg
+        for i, t in enumerate(timesteps):
+            timestep = t.expand(latents.shape[0]).to(device=latents.device, dtype=latents.dtype)
+            # Prepare kwargs for positive (conditional) prediction
+            positive_kwargs = {
+                "hidden_states": latents,
+                "timestep": timestep / 1000,
+                "guidance": guidance,
+                "encoder_hidden_states_mask": prompt_embeds_mask,
+                "encoder_hidden_states": prompt_embeds,
+                "img_shapes": img_shapes,
+                "txt_seq_lens": txt_seq_lens,
+            }
+            # Prepare kwargs for negative (unconditional) prediction
+            if do_true_cfg:
+                negative_kwargs = {
+                    "hidden_states": latents,
+                    "timestep": timestep / 1000,
+                    "guidance": guidance,
+                    "encoder_hidden_states_mask": negative_prompt_embeds_mask,
+                    "encoder_hidden_states": negative_prompt_embeds,
+                    "img_shapes": img_shapes,
+                    "txt_seq_lens": negative_txt_seq_lens,
+                }
+            else:
+                negative_kwargs = None
+            # Predict noise with automatic CFG parallel handling
+            # - In CFG parallel mode: rank0 computes positive, rank1 computes negative
+            # - Automatically gathers results and combines them on rank0
+            noise_pred = self.predict_noise_maybe_with_cfg(
+                do_true_cfg=do_true_cfg,
+                true_cfg_scale=true_cfg_scale,
+                positive_kwargs=positive_kwargs,
+                negative_kwargs=negative_kwargs,
+                cfg_normalize=cfg_normalize,
+            )
+            # Step scheduler with automatic CFG synchronization
+            # - Only rank0 computes the scheduler step
+            # - Automatically broadcasts updated latents to all ranks
+            latents = self.scheduler_step_maybe_with_cfg(
+                noise_pred, t, latents, do_true_cfg
+            )
+        return latents
+```
+**How it works:**
+1. Prepare separate `positive_kwargs` and `negative_kwargs` for conditional and unconditional predictions
+2. Call `predict_noise_maybe_with_cfg()` which:
+   - Detects if CFG parallel is enabled (`get_classifier_free_guidance_world_size() > 1`)
+   - Distributes computation: rank0 processes positive, rank1 processes negative
+   - Gathers predictions and combines them using `combine_cfg_noise()` on rank0
+   - Returns combined noise prediction (only valid on rank0)
+3. Call `scheduler_step_maybe_with_cfg()` which:
+   - Only rank0 computes the scheduler step
+   - Broadcasts the updated latents to all ranks for synchronization
+**How to customize**
+Some pipelines may need to customize the following functions in `CFGParallelMixin`:
+1. You may need to edit `predict_noise` function for custom behaviors.
+```python
+def predict_noise(self, *args, **kwargs):
+    """
+    Forward pass through transformer to predict noise.
+    Subclasses should override this if they need custom behavior,
+    but the default implementation calls self.transformer.
+    """
+    return self.transformer(*args, **kwargs)[0]
+```
+2. The default normalization function after combining the noise predictions from both branches is as follows. You may need to customize it.
+```python
+def cfg_normalize_function(self, noise_pred, comb_pred):
+    """
+    Normalize the combined noise prediction.
+    Args:
+        noise_pred: positive noise prediction
+        comb_pred: combined noise prediction after CFG
+    Returns:
+        Normalized noise prediction tensor
+    """
+    cond_norm = torch.norm(noise_pred, dim=-1, keepdim=True)
+    noise_norm = torch.norm(comb_pred, dim=-1, keepdim=True)
+    noise_pred = comb_pred * (cond_norm / noise_norm)
+    return noise_pred
+```
--- a/docs/user_guide/diffusion/teacache.md
+++ b/docs/user_guide/diffusion/teacache.md
+# TeaCache Configuration Guide
+TeaCache speeds up diffusion model inference by caching transformer computations when consecutive timesteps are similar. This typically provides **1.5x-2.0x speedup** with minimal quality loss.
+## Quick Start
+Enable TeaCache by setting `cache_backend` to `"tea_cache"`:
+```python
+from vllm_omni import Omni
+from vllm_omni.inputs.data import OmniDiffusionSamplingParams
+# Simple configuration - model_type is automatically extracted from pipeline.__class__.__name__
+omni = Omni(
+    model="Qwen/Qwen-Image",
+    cache_backend="tea_cache",
+    cache_config={
+        "rel_l1_thresh": 0.2  # Optional, defaults to 0.2
+    }
+)
+outputs = omni.generate(
+    "A cat sitting on a windowsill",
+    OmniDiffusionSamplingParams(
+        num_inference_steps=50,
+    ),
+)
+```
+### Using Environment Variable
+You can also enable TeaCache via environment variable:
+```bash
+export DIFFUSION_CACHE_BACKEND=tea_cache
+```
+Then initialize without explicitly setting `cache_backend`:
+```python
+from vllm_omni import Omni
+omni = Omni(
+    model="Qwen/Qwen-Image",
+    cache_config={"rel_l1_thresh": 0.2}  # Optional
+)
+```
+## Online Serving (OpenAI-Compatible)
+Enable TeaCache for online serving by passing `--cache-backend tea_cache` when starting the server:
+```bash
+vllm serve Qwen/Qwen-Image --omni --port 8091 \
+  --cache-backend tea_cache \
+  --cache-config '{"rel_l1_thresh": 0.2}'
+```
+## Configuration Parameters
+### `rel_l1_thresh` (float, default: `0.2`)
+Controls the balance between speed and quality. Lower values prioritize quality, higher values prioritize speed.
+**Recommended values:**
+- `0.2` - **~1.5x speedup** with minimal quality loss (recommended)
+- `0.4` - **~1.8x speedup** with slight quality loss
+- `0.6` - **~2.0x speedup** with noticeable quality loss
+- `0.8` - **~2.25x speedup** with significant quality loss
+## Examples
+### Python API
+```python
+from vllm_omni import Omni
+from vllm_omni.inputs.data import OmniDiffusionSamplingParams
+omni = Omni(
+    model="Qwen/Qwen-Image",
+    cache_backend="tea_cache",
+    cache_config={"rel_l1_thresh": 0.2}
+)
+outputs = omni.generate(
+    "A cat sitting on a windowsill",
+    OmniDiffusionSamplingParams(
+        num_inference_steps=50,
+    ),
+)
+```
+## Performance Tuning
+Start with the default `rel_l1_thresh=0.2` and adjust based on your needs:
+- **Maximum quality**: Use `0.1-0.2`
+- **Balanced**: Use `0.2-0.4` (recommended)
+- **Maximum speed**: Use `0.6-0.8` (may reduce quality)
+## Troubleshooting
+### Quality Degradation
+If you notice quality issues, lower the threshold:
+```python
+cache_config={"rel_l1_thresh": 0.1}  # More conservative caching
+```
+## Supported Models
+### ImageGen
+<style>
+th {
+  white-space: nowrap;
+  min-width: 0 !important;
+}
+</style>
+| Architecture | Models | Example HF Models |
+|--------------|--------|-------------------|
+| `QwenImagePipeline` | Qwen-Image | `Qwen/Qwen-Image` |
+| `QwenImageEditPipeline` | Qwen-Image-Edit | `Qwen/Qwen-Image-Edit` |
+| `QwenImageEditPlusPipeline` | Qwen-Image-Edit-2509 | `Qwen/Qwen-Image-Edit-2509` |
+| `QwenImageLayeredPipeline` | Qwen-Image-Layered | `Qwen/Qwen-Image-Layered` |
+| `BagelForConditionalGeneration` | BAGEL (DiT-only) | `ByteDance-Seed/BAGEL-7B-MoT` |
+### VideoGen
+No VideoGen models are supported by TeaCache yet.
+### Coming Soon
+<style>
+th {
+  white-space: nowrap;
+  min-width: 0 !important;
+}
+</style>
+| Architecture | Models | Example HF Models |
+|--------------|--------|-------------------|
+| `FluxPipeline` | Flux | - |
+| `CogVideoXPipeline` | CogVideoX | - |
--- a/docs/user_guide/diffusion_acceleration.md
+++ b/docs/user_guide/diffusion_acceleration.md
+# Diffusion Acceleration Overview
+vLLM-Omni supports various cache acceleration methods to speed up diffusion model inference with minimal quality degradation. These methods include **cache methods** that intelligently cache intermediate computations to avoid redundant work across diffusion timesteps, and **parallelism methods** that distribute the computation across multiple devices.
+## Supported Acceleration Methods
+vLLM-Omni currently supports two main cache acceleration backends:
+1. **[TeaCache](diffusion/teacache.md)** - Hook-based adaptive caching that caches transformer computations when consecutive timesteps are similar
+2. **[Cache-DiT](diffusion/cache_dit_acceleration.md)** - Library-based acceleration using multiple techniques:
+    - **DBCache** (Dual Block Cache): Caches intermediate transformer block outputs based on residual differences
+    - **TaylorSeer**: Uses Taylor expansion-based forecasting for faster inference
+    - **SCM** (Step Computation Masking): Selectively computes steps based on adaptive masking
+Both methods can provide significant speedups (typically **1.5x-2.0x**) while maintaining high output quality.
+vLLM-Omni also supports parallelism methods for diffusion models, including:
+1. [Ulysses-SP](diffusion/parallelism_acceleration.md#ulysses-sp) - splits the input along the sequence dimension and uses all-to-all communication to allow each device to compute only a subset of attention heads.
+2. [Ring-Attention](diffusion/parallelism_acceleration.md#ring-attention) - splits the input along the sequence dimension and uses ring-based P2P communication to accumulate attention results, keeping the sequence dimension sharded.
+3. [CFG-Parallel](diffusion/parallelism_acceleration.md#cfg-parallel) - runs the positive/negative prompts of classifier-free guidance (CFG) on different devices, then merges on a single device to perform the scheduler step.
+## Quick Comparison
+### Cache Methods
+| Method | Configuration | Description | Best For |
+|--------|--------------|-------------|----------|
+| **TeaCache** | `cache_backend="tea_cache"` | Simple, adaptive caching with minimal configuration | Quick setup, balanced speed/quality |
+| **Cache-DiT** | `cache_backend="cache_dit"` | Advanced caching with multiple techniques (DBCache, TaylorSeer, SCM) | Maximum acceleration, fine-grained control |
+## Supported Models
+The following table shows which models are currently supported by each acceleration method:
+### ImageGen
+| Model | Model Identifier | TeaCache | Cache-DiT | Ulysses-SP | Ring-Attention | CFG-Parallel |
+|-------|------------------|:----------:|:-----------:|:-----------:|:----------------:|:----------------:|
+| **LongCat-Image** | `meituan-longcat/LongCat-Image` | ❌ | ✅ | ❌ | ❌ | ✅ |
+| **LongCat-Image-Edit** | `meituan-longcat/LongCat-Image-Edit` | ❌ | ✅ | ❌ | ❌ | ✅ |
+| **Ovis-Image** | `OvisAI/Ovis-Image` | ❌ | ✅ | ❌ | ❌ | ✅ |
+| **Qwen-Image** | `Qwen/Qwen-Image` | ✅ | ✅ | ✅ | ✅ | ✅ |
+| **Qwen-Image-2512** | `Qwen/Qwen-Image-2512` | ✅ | ✅ | ✅ | ✅ | ✅ |
+| **Qwen-Image-Edit** | `Qwen/Qwen-Image-Edit` | ✅ | ✅ | ✅ | ✅ | ✅ |
+| **Qwen-Image-Edit-2509** | `Qwen/Qwen-Image-Edit-2509` | ✅ | ✅ | ✅ | ✅ | ✅ |
+| **Qwen-Image-Layered** | `Qwen/Qwen-Image-Layered` | ❌ | ✅ | ✅ | ✅ | ✅ |
+| **Z-Image** | `Tongyi-MAI/Z-Image-Turbo` | ✅ | ✅ | ❌ | ❌ | ❌ |
+| **Stable-Diffusion3.5** | `stabilityai/stable-diffusion-3.5` | ❌ | ✅ | ❌ | ❌ | ✅ |
+| **Bagel** | `ByteDance-Seed/BAGEL-7B-MoT` | ✅ | ✅ | ❌ | ❌ | ❌ |
+### VideoGen
+| Model | Model Identifier | TeaCache | Cache-DiT | Ulysses-SP | Ring-Attention |CFG-Parallel |
+|-------|------------------|:--------:|:---------:|:----------:|:--------------:|:----------------:|
+| **Wan2.2** | `Wan-AI/Wan2.2-T2V-A14B-Diffusers` | ❌ | ✅ | ✅ | ✅ | ✅ |
+## Performance Benchmarks
+The following benchmarks were measured on **Qwen/Qwen-Image** and **Qwen/Qwen-Image-Edit** models generating 1024x1024 images with 50 inference steps:
+!!! note "Benchmark Disclaimer"
+    These benchmarks are provided for **general reference only**. The configurations shown use default or common parameter settings and have not been exhaustively optimized for maximum performance. Actual performance may vary based on:
+    - Specific model and use case
+    - Hardware configuration
+    - Careful parameter tuning
+    - Different inference settings (e.g., number of steps, image resolution)
+    For optimal performance in your specific scenario, we recommend experimenting with different parameter configurations as described in the detailed guides below.
+| Model | Cache Backend | Cache Config | Generation Time | Speedup | Notes |
+|-------|---------------|--------------|----------------|---------|-------|
+| **Qwen/Qwen-Image** | None  | None | 20.0s | 1.0x | Baseline (diffusers) |
+| **Qwen/Qwen-Image** | TeaCache | `rel_l1_thresh=0.2` | 10.47s | **1.91x** | Recommended default setting |
+| **Qwen/Qwen-Image** | Cache-DiT | DBCache + TaylorSeer (Fn=1, Bn=0, W=8, TaylorSeer order=1) | 10.8s | **1.85x** | - |
+| **Qwen/Qwen-Image** | Cache-DiT | DBCache + TaylorSeer + SCM (Fn=8, Bn=0, W=4, TaylorSeer order=1, SCM fast) | 14.0s | **1.43x** | - |
+| **Qwen/Qwen-Image-Edit** | None | No acceleration | 51.5s | 1.0x | Baseline (diffusers) |
+| **Qwen/Qwen-Image-Edit** | Cache-DiT | Default (Fn=1, Bn=0, W=4, TaylorSeer disabled, SCM disabled) | 21.6s | **2.38x** | - |
+To measure the parallelism methods, we run benchmarks with **Qwen/Qwen-Image** model generating images (**2048x2048** as long sequence input) with 50 inference steps. The hardware devices are NVIDIA H800 GPUs. `sdpa` is the attention backends.
+| Configuration | Ulysses degree |Generation Time | Speedup |
+|---------------|----------------|---------|---------|
+| **Baseline (diffusers)** | - | 112.5s | 1.0x |
+| Ulysses-SP  |  2  |  65.2s | 1.73x |
+| Ulysses-SP  |  4  | 39.6s | 2.84x |
+| Ulysses-SP  |  8  | 30.8s | 3.65x |
+## Quick Start
+### Using TeaCache
+```python
+from vllm_omni import Omni
+from vllm_omni.inputs.data import OmniDiffusionSamplingParams
+omni = Omni(
+    model="Qwen/Qwen-Image",
+    cache_backend="tea_cache",
+    cache_config={"rel_l1_thresh": 0.2}  # Optional, defaults to 0.2
+)
+outputs = omni.generate(
+    "A cat sitting on a windowsill",
+    OmniDiffusionSamplingParams(
+        num_inference_steps=50,
+    ),
+)
+```
+### Using Cache-DiT
+```python
+from vllm_omni import Omni
+from vllm_omni.inputs.data import OmniDiffusionSamplingParams
+omni = Omni(
+    model="Qwen/Qwen-Image",
+    cache_backend="cache_dit",
+    cache_config={
+        "Fn_compute_blocks": 1,
+        "Bn_compute_blocks": 0,
+        "max_warmup_steps": 8,
+        "enable_taylorseer": True,
+        "taylorseer_order": 1,
+    }
+)
+outputs = omni.generate(
+    "A cat sitting on a windowsill",
+    OmniDiffusionSamplingParams(
+        num_inference_steps=50,
+    ),
+)
+```
+### Using Ulysses-SP
+Run text-to-image:
+```python
+from vllm_omni import Omni
+from vllm_omni.inputs.data import OmniDiffusionSamplingParams
+from vllm_omni.diffusion.data import DiffusionParallelConfig
+ulysses_degree = 2
+omni = Omni(
+    model="Qwen/Qwen-Image",
+    parallel_config=DiffusionParallelConfig(ulysses_degree=ulysses_degree)
+)
+outputs = omni.generate(
+    "A cat sitting on a windowsill",
+    OmniDiffusionSamplingParams(num_inference_steps=50, width=2048, height=2048),
+)
+```
+Run image-to-image:
+```python
+from vllm_omni import Omni
+from vllm_omni.inputs.data import OmniDiffusionSamplingParams
+from vllm_omni.diffusion.data import DiffusionParallelConfig
+ulysses_degree = 2
+omni = Omni(
+    model="Qwen/Qwen-Image-Edit",
+    parallel_config=DiffusionParallelConfig(ulysses_degree=ulysses_degree)
+)
+outputs = omni.generate(
+    {
+        "prompt": "turn this cat to a dog",
+        "multi_modal_data": {"image": input_image}
+    },
+    OmniDiffusionSamplingParams(num_inference_steps=50),
+)
+```
+### Using Ring-Attention
+Run text-to-image:
+```python
+from vllm_omni import Omni
+from vllm_omni.inputs.data import OmniDiffusionSamplingParams
+from vllm_omni.diffusion.data import DiffusionParallelConfig
+ring_degree = 2
+omni = Omni(
+    model="Qwen/Qwen-Image",
+    parallel_config=DiffusionParallelConfig(ring_degree=2)
+)
+outputs = omni.generate(
+    "A cat sitting on a windowsill",
+    OmniDiffusionSamplingParams(num_inference_steps=50, width=2048, height=2048),
+)
+```
+### Using CFG-Parallel
+Run image-to-image:
+CFG-Parallel splits the CFG positive/negative branches across GPUs. Use it when you set a non-trivial `true_cfg_scale`.
+```python
+from vllm_omni import Omni
+from vllm_omni.inputs.data import OmniDiffusionSamplingParams
+from vllm_omni.diffusion.data import DiffusionParallelConfig
+cfg_parallel_size = 2
+omni = Omni(
+    model="Qwen/Qwen-Image-Edit",
+    parallel_config=DiffusionParallelConfig(cfg_parallel_size=cfg_parallel_size)
+)
+outputs = omni.generate(
+    {
+        "prompt": "turn this cat to a dog",
+        "multi_modal_data": {"image": input_image}
+    },
+    OmniDiffusionSamplingParams(num_inference_steps=50, true_cfg_scale=4.0),
+)
+```
+## Documentation
+For detailed information on each acceleration method:
+- **[TeaCache Guide](diffusion/teacache.md)** - Complete TeaCache documentation, configuration options, and best practices
+- **[Cache-DiT Acceleration Guide](diffusion/cache_dit_acceleration.md)** - Comprehensive Cache-DiT guide covering DBCache, TaylorSeer, SCM, and configuration parameters
+- **[Sequence Parallelism](diffusion/parallelism_acceleration.md#sequence-parallelism)** - Guidance on how to set sequence parallelism with configuration.
+- **[CFG-Parallel](diffusion/parallelism_acceleration.md#cfg-parallel)** - Guidance on how to set CFG-Parallel to run positive/negative branches across ranks.
--- a/docs/user_guide/examples/offline_inference/bagel.md
+++ b/docs/user_guide/examples/offline_inference/bagel.md
+# BAGEL-7B-MoT
+Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/bagel>.
+## Set up
+Please refer to the [stage configuration documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/) to configure memory allocation appropriately for your hardware setup.
+## Run examples
+**Note**: These examples work with the default configuration on an **NVIDIA A100 (80GB)**. We also tested on dual **NVIDIA RTX 5000 Ada (32GB each)**. For dual-GPU setups, please modify the stage configuration to distribute the model across devices.
+Get into the bagel folder
+```bash
+cd examples/offline_inference/bagel
+```
+### Modality Control
+BAGEL-7B-MoT supports multiple modality modes. You can control the mode using the `--modality` argument:
+#### Text to Image (text2img)
+- **Pipeline**: Text → Thinker  → DiT → VAE Decode → Image
+- **Stages Used**: Stage 0 (Thinker) + Stage 1 (DiT)
+- **KV Transfer**: Thinker sends KV cache to DiT for conditioned generation
+Generate images from text prompts:
+```bash
+python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
+                  --modality text2img \
+                  --prompts "A cute cat"
+```
+#### Image to Image (img2img)
+- **Pipeline**: Image → VAE Encode → DiT → VAE Decode → New Image
+- **Stages Used**: Stage 1 (DiT) only
+- **Special**: Bypasses the Thinker stage, direct image-to-image transformation
+Transform images based on text prompts:
+```bash
+python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
+                  --modality img2img \
+                  --image-path /path/to/image.jpg \
+                  --prompts "Let the woman wear a blue dress"
+```
+#### Image to Text (img2text)
+- **Pipeline**: Image → ViT + VAE Encode → Thinker → Text Output
+- **Stages Used**: Stage 0 (Thinker) only
+- **Special**: Uses both VAE latent encoding AND ViT semantic encoding for comprehensive image understanding
+Generate text descriptions from images:
+```bash
+python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
+                  --modality img2text \
+                  --image-path /path/to/image.jpg \
+                  --prompts "Describe this image in detail"
+```
+#### Text to Text (text2text)
+- **Pipeline**: Text → Thinker → Text Output
+- **Stages Used**: Stage 0 (Thinker) only
+- **Special**: No visual components involved, operates as pure language model
+Pure text generation:
+```bash
+python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
+                  --modality text2text \
+                  --prompts "What is the capital of France?"
+# You can load prompts from a text file (one prompt per line):  
+python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
+                  --modality text2text \
+                  --txt-prompts /path/to/prompts.txt
+```
+### Inference Steps
+Control the number of inference steps for image generation:
+```bash
+# You can adjust steps to 100 to improve image quality
+python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
+                  --modality text2img \
+                  --steps 50 \
+                  --prompts "A cute cat"
+```
+### Key arguments
+BAGEL-7B-MoT supports **multiple modality modes** for different use cases.
+The default yaml configuration deploys Thinker and DiT on the same GPU. You can use the default configuration file: [`bagel.yaml`](../../../vllm_omni/model_executor/stage_configs/bagel.yaml)
+#### 📌 Command Line Arguments (end2end.py)
+| Argument               | Type   | Default                       | Description                                                  |
+| :--------------------- | :----- | :---------------------------- | :----------------------------------------------------------- |
+| `--model`              | string | `ByteDance-Seed/BAGEL-7B-MoT` | Model path or name                                           |
+| `--modality`           | choice | `text2img`                    | Modality mode: `text2img`, `img2img`, `img2text`, `text2text` |
+| `--prompts`            | list   | `None`                        | Input text prompts directly                                  |
+| `--txt-prompts`        | string | `None`                        | Path to txt file with one prompt per line                    |
+| `--image-path`         | string | `None`                        | Input image path (for `img2img`/`img2text`)                  |
+| `--steps`              | int    | `50`                          | Number of inference steps                                    |
+| `--stage-configs-path` | string | `None`                        | Custom stage config file path                                |
+| `--worker-backend`     | choice | `process`                     | Worker backend: `process` or `ray`                           |
+| `--ray-address`        | string | `None`                        | Ray cluster address                                          |
+| `--enable-stats`       | flag   | `False`                       | Enable statistics logging                                    |
+| `--init-sleep-seconds` | int    | `20`                          | Initialization sleep time                                    |
+| `--batch-timeout`      | int    | `5`                           | Batch timeout                                                |
+| `--init-timeout`       | int    | `300`                         | Initialization timeout                                       |
+------
+#### ⚙️ Stage Configuration Parameters (bagel.yaml)
+ **Stage 0 - Thinker (LLM Stage)**
+| Parameter                        | Value                           | Description              |
+| :------------------------------- | :------------------------------ | :----------------------- |
+| `stage_type`                     | `llm`                           | Stage type               |
+| `devices`                        | `"0"`                           | GPU device ID            |
+| `max_batch_size`                 | `1`                             | Maximum batch size       |
+| `model_stage`                    | `thinker`                       | Model stage identifier   |
+| `model_arch`                     | `BagelForConditionalGeneration` | Model architecture       |
+| `gpu_memory_utilization`         | `0.4`                           | GPU memory utilization   |
+| `tensor_parallel_size`           | `1`                             | Tensor parallel size     |
+| `max_num_batched_tokens`         | `32768`                         | Maximum batched tokens   |
+| `omni_kv_config.need_send_cache` | `true`                          | Whether to send KV cache |
+------
+**Stage 1 - DiT (Diffusion Stage)**
+| Parameter                        | Value       | Description                 |
+| :------------------------------- | :---------- | :-------------------------- |
+| `stage_type`                     | `diffusion` | Stage type                  |
+| `devices`                        | `"0"`       | GPU device ID               |
+| `max_batch_size`                 | `1`         | Maximum batch size          |
+| `model_stage`                    | `dit`       | Model stage identifier      |
+| `gpu_memory_utilization`         | `0.4`       | GPU memory utilization      |
+| `omni_kv_config.need_recv_cache` | `true`      | Whether to receive KV cache |
+| `engine_input_source`            | `[0]`       | Input source from Stage 0   |
+------
+#### 🔗 Runtime Configuration
+| Parameter             | Value   | Description                      |
+| :-------------------- | :------ | :------------------------------- |
+| `window_size`         | `-1`    | Window size (-1 means unlimited) |
+| `max_inflight`        | `1`     | Maximum inflight requests        |
+| `shm_threshold_bytes` | `65536` | Shared memory threshold (64KB)   |
+## FAQ
+- If you encounter an error about the backend of librosa, try to install ffmpeg with the command below.
+```bash
+sudo apt update
+sudo apt install ffmpeg
+```
+- If you don’t know how much VRAM is needed for the model or encounter the OOM error, you can try to decrease the max_model_len.
+| Stage               | VRAM                         |
+| :------------------ | :--------------------------- |
+| Stage-0 (Thinker)   | **15.04 GiB** **+ KV Cache** |
+| Stage-1 (DiT)       | **26.50 GiB**                |
+| Total               | **~42 GiB + KV Cache**       |