cache_dit_acceleration.md

# Cache-DiT Acceleration Guide

This guide explains how to use cache-dit acceleration in vLLM-Omni to speed up diffusion model inference.

## Overview

Cache-dit is a library that accelerates diffusion transformer models through intelligent caching mechanisms. It supports multiple acceleration techniques that can be combined for optimal performance:

- **DBCache**: Dual Block Cache for reducing redundant computations
- **TaylorSeer**: Taylor expansion-based forecasting for faster inference
- **SCM**: Step Computation Masking for selective step computation

## Quick Start

### Basic Usage

Enable cache-dit acceleration by simply setting `cache_backend="cache_dit"`. Cache-dit will use its recommended default parameters:

```python
from vllm_omni.entrypoints.omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams

# Simplest way: just enable cache-dit with default parameters
omni = Omni(
    model="Qwen/Qwen-Image",
    cache_backend="cache_dit",
)

images = omni.generate(
    "a beautiful landscape",
    OmniDiffusionSamplingParams(num_inference_steps=50),
)
```

**Default Parameters**: When `cache_config` is not provided, cache-dit uses optimized default values. See the [Configuration Reference](#configuration-reference) section for a complete list of all parameters and their default values.

### Custom Configuration

To customize cache-dit settings, provide a `cache_config` dictionary, for example:

```python
omni = Omni(
    model="Qwen/Qwen-Image",
    cache_backend="cache_dit",
    cache_config={
        "Fn_compute_blocks": 1,
        "Bn_compute_blocks": 0,
        "max_warmup_steps": 4,
        "residual_diff_threshold": 0.12,
    },
)
```

## Online Serving (OpenAI-Compatible)

Enable Cache-DiT for online serving by passing `--cache-backend cache_dit` when starting the server:

```bash
# Use Cache-DiT default (recommended) parameters
vllm serve Qwen/Qwen-Image --omni --port 8091 --cache-backend cache_dit
```

To customize Cache-DiT settings for online serving, pass a JSON string via `--cache-config`:

```bash
vllm serve Qwen/Qwen-Image --omni --port 8091 \
  --cache-backend cache_dit \
  --cache-config '{"Fn_compute_blocks": 1, "Bn_compute_blocks": 0, "max_warmup_steps": 4, "residual_diff_threshold": 0.12}'
```

## Acceleration Methods

For comprehensive illustration, please view cache-dit [User_Guide](https://cache-dit.readthedocs.io/en/latest/user_guide/OVERVIEWS/)

### 1. DBCache (Dual Block Cache)

DBCache intelligently caches intermediate transformer block outputs when the residual differences between consecutive steps are small, reducing redundant computations without sacrificing quality.

**Key Parameters**:

- `Fn_compute_blocks` (int, default: 1): Number of **first n** transformer blocks used to compute stable feature differences. Higher values provide more accurate caching decisions but increase computation.
- `Bn_compute_blocks` (int, default: 0): Number of **last n** transformer blocks used for additional fusion. These blocks act as an auto-scaler for approximate hidden states.
- `max_warmup_steps` (int, default: 4): Number of initial steps where caching is disabled to ensure the model learns sufficient features before caching begins. Optimized for few-step distilled models.
- `residual_diff_threshold` (float, default: 0.24): Threshold for residual difference. Higher values lead to faster performance but may reduce precision. Default uses a relatively higher threshold for more aggressive caching.
- `max_cached_steps` (int, default: -1): Maximum number of cached steps. Set to -1 for unlimited caching.
- `max_continuous_cached_steps` (int, default: 3): Maximum number of consecutive cached steps. Limits consecutive caching to prevent precision degradation.

**Example Configuration**:

```python
cache_config={
    "Fn_compute_blocks": 8,      # Use first 8 blocks for difference computation
    "Bn_compute_blocks": 0,       # No additional fusion blocks
    "max_warmup_steps": 8,        # Cache after 8 warmup steps
    "residual_diff_threshold": 0.12,  # Higher threshold for faster inference
    "max_cached_steps": -1,        # No limit on cached steps
}
```

**Performance Tips**:

- Default `Fn_compute_blocks=1` works well for most cases. Increase to 8-12 for larger models or when more accuracy is needed
- Increase `residual_diff_threshold` (e.g., 0.12-0.15) for faster inference with slight quality trade-off, or decrease from default 0.24 for higher quality
- Default `max_warmup_steps=4` is optimized for few-step models. Increase to 6-8 for more steps if needed

### 2. TaylorSeer

TaylorSeer uses Taylor expansion to forecast future hidden states, allowing the model to skip some computation steps while maintaining quality.

**Key Parameters**:

- `enable_taylorseer` (bool, default: False): Enable TaylorSeer acceleration
- `taylorseer_order` (int, default: 1): Order of Taylor expansion. Higher orders provide better accuracy but require more computation.

**Example Configuration**:

```python
cache_config={
    "enable_taylorseer": True,
    "taylorseer_order": 1,  # First-order Taylor expansion
}
```

**Performance Tips**:

- Use `taylorseer_order=1` for most cases (good balance of speed and quality)
- Combine with DBCache for maximum acceleration
- Higher orders (2-3) may improve quality but reduce speed gains

### 3. SCM (Step Computation Masking)

SCM allows you to specify which steps must be computed and which can use cached results, similar to LeMiCa/EasyCache style acceleration.

**Key Parameters**:

- `scm_steps_mask_policy` (str | None, default: None): Predefined mask policy. Options:
  - `None`: SCM disabled (default)
  - `"slow"`: More compute steps, higher quality (18 compute steps out of 28)
  - `"medium"`: Balanced (15 compute steps out of 28)
  - `"fast"`: More cache steps, faster inference (11 compute steps out of 28)
  - `"ultra"`: Maximum speed (8 compute steps out of 28)
- `scm_steps_policy` (str, default: "dynamic"): Policy for cached steps:
  - `"dynamic"`: Use dynamic cache for masked steps (recommended)
  - `"static"`: Use static cache for masked steps

**Example Configuration**:

```python
cache_config={
    "scm_steps_mask_policy": "medium",  # Balanced speed/quality
    "scm_steps_policy": "dynamic",      # Use dynamic cache
}
```

**Performance Tips**:

- SCM is disabled by default (`scm_steps_mask_policy=None`). Enable it by setting a policy value if you need additional acceleration
- Start with `"medium"` policy and adjust based on quality requirements
- Use `"fast"` or `"ultra"` for maximum speed when quality can be slightly compromised
- `"dynamic"` policy generally provides better quality than `"static"`
- SCM mask is automatically regenerated when `num_inference_steps` changes during inference

## Configuration Reference

### DiffusionCacheConfig Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `Fn_compute_blocks` | int | 1 | First n blocks for difference computation (optimized for single-transformer models) |
| `Bn_compute_blocks` | int | 0 | Last n blocks for fusion |
| `max_warmup_steps` | int | 4 | Steps before caching starts (optimized for few-step distilled models) |
| `max_cached_steps` | int | -1 | Max cached steps (-1 = unlimited) |
| `max_continuous_cached_steps` | int | 3 | Max consecutive cached steps (prevents precision degradation) |
| `residual_diff_threshold` | float | 0.24 | Residual difference threshold (higher for more aggressive caching) |
| `num_inference_steps` | int \| None | None | Initial inference steps for SCM mask generation (optional, auto-refreshed during inference) |
| `enable_taylorseer` | bool | False | Enable TaylorSeer acceleration (not suitable for few-step distilled models) |
| `taylorseer_order` | int | 1 | Taylor expansion order |
| `scm_steps_mask_policy` | str \| None | None | SCM mask policy (None, "slow", "medium", "fast", "ultra") |
| `scm_steps_policy` | str | "dynamic" | SCM computation policy ("dynamic" or "static") |

## Example: Accelerate Text-to-Image Generation with CacheDiT

See `examples/offline_inference/text_to_image/text_to_image.py` for a complete working example with cache-dit acceleration.

```bash
# Enable cache-dit with hybrid acceleration
cd examples/offline_inference/text_to_image
python text_to_image.py \
    --model Qwen/Qwen-Image \
    --prompt "a cup of coffee on the table" \
    --cache_backend cache_dit \
    --num_inference_steps 50
```


The script uses cache-dit acceleration with a hybrid configuration combining DBCache, SCM, and TaylorSeer:

```python
omni = Omni(
    model="Qwen/Qwen-Image",
    cache_backend="cache_dit",
    cache_config={
        # Scheme: Hybrid DBCache + SCM + TaylorSeer
        # DBCache
        "Fn_compute_blocks": 8,
        "Bn_compute_blocks": 0,
        "max_warmup_steps": 4,
        "residual_diff_threshold": 0.12,
        # TaylorSeer
        "enable_taylorseer": True,
        "taylorseer_order": 1,
        # SCM
        "scm_steps_mask_policy": "fast",  # Set to None to disable SCM
        "scm_steps_policy": "dynamic",
    },
)
```

You can customize the configuration by modifying the `cache_config` dictionary to use only specific methods (e.g., DBCache only, DBCache + SCM, etc.) based on your quality and speed requirements.

To test another model, you can modify `--model` with the target model identifier like `Tongyi-MAI/Z-Image-Turbo` and update `cache_config` according the model architecture (e.g., number of transformer blocks).


## Additional Resources

- [Cache-DiT User Guide](https://cache-dit.readthedocs.io/en/latest/user_guide/OVERVIEWS/)
- [Cache-DiT Benchmark](https://cache-dit.readthedocs.io/en/latest/benchmark/HYBRID_CACHE/)
- [DBCache Technical Details](https://cache-dit.readthedocs.io/en/latest/user_guide/CACHE_API/)