cache_dit_acceleration.md 9.62 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
# Cache-DiT Acceleration Guide

This guide explains how to use cache-dit acceleration in vLLM-Omni to speed up diffusion model inference.

## Overview

Cache-dit is a library that accelerates diffusion transformer models through intelligent caching mechanisms. It supports multiple acceleration techniques that can be combined for optimal performance:

- **DBCache**: Dual Block Cache for reducing redundant computations
- **TaylorSeer**: Taylor expansion-based forecasting for faster inference
- **SCM**: Step Computation Masking for selective step computation

## Quick Start

### Basic Usage

Enable cache-dit acceleration by simply setting `cache_backend="cache_dit"`. Cache-dit will use its recommended default parameters:

```python
from vllm_omni.entrypoints.omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams

# Simplest way: just enable cache-dit with default parameters
omni = Omni(
    model="Qwen/Qwen-Image",
    cache_backend="cache_dit",
)

images = omni.generate(
    "a beautiful landscape",
    OmniDiffusionSamplingParams(num_inference_steps=50),
)
```

**Default Parameters**: When `cache_config` is not provided, cache-dit uses optimized default values. See the [Configuration Reference](#configuration-reference) section for a complete list of all parameters and their default values.

### Custom Configuration

To customize cache-dit settings, provide a `cache_config` dictionary, for example:

```python
omni = Omni(
    model="Qwen/Qwen-Image",
    cache_backend="cache_dit",
    cache_config={
        "Fn_compute_blocks": 1,
        "Bn_compute_blocks": 0,
        "max_warmup_steps": 4,
        "residual_diff_threshold": 0.12,
    },
)
```

## Online Serving (OpenAI-Compatible)

Enable Cache-DiT for online serving by passing `--cache-backend cache_dit` when starting the server:

```bash
# Use Cache-DiT default (recommended) parameters
vllm serve Qwen/Qwen-Image --omni --port 8091 --cache-backend cache_dit
```

To customize Cache-DiT settings for online serving, pass a JSON string via `--cache-config`:

```bash
vllm serve Qwen/Qwen-Image --omni --port 8091 \
  --cache-backend cache_dit \
  --cache-config '{"Fn_compute_blocks": 1, "Bn_compute_blocks": 0, "max_warmup_steps": 4, "residual_diff_threshold": 0.12}'
```

## Acceleration Methods

For comprehensive illustration, please view cache-dit [User_Guide](https://cache-dit.readthedocs.io/en/latest/user_guide/OVERVIEWS/)

### 1. DBCache (Dual Block Cache)

DBCache intelligently caches intermediate transformer block outputs when the residual differences between consecutive steps are small, reducing redundant computations without sacrificing quality.

**Key Parameters**:

- `Fn_compute_blocks` (int, default: 1): Number of **first n** transformer blocks used to compute stable feature differences. Higher values provide more accurate caching decisions but increase computation.
- `Bn_compute_blocks` (int, default: 0): Number of **last n** transformer blocks used for additional fusion. These blocks act as an auto-scaler for approximate hidden states.
- `max_warmup_steps` (int, default: 4): Number of initial steps where caching is disabled to ensure the model learns sufficient features before caching begins. Optimized for few-step distilled models.
- `residual_diff_threshold` (float, default: 0.24): Threshold for residual difference. Higher values lead to faster performance but may reduce precision. Default uses a relatively higher threshold for more aggressive caching.
- `max_cached_steps` (int, default: -1): Maximum number of cached steps. Set to -1 for unlimited caching.
- `max_continuous_cached_steps` (int, default: 3): Maximum number of consecutive cached steps. Limits consecutive caching to prevent precision degradation.

**Example Configuration**:

```python
cache_config={
    "Fn_compute_blocks": 8,      # Use first 8 blocks for difference computation
    "Bn_compute_blocks": 0,       # No additional fusion blocks
    "max_warmup_steps": 8,        # Cache after 8 warmup steps
    "residual_diff_threshold": 0.12,  # Higher threshold for faster inference
    "max_cached_steps": -1,        # No limit on cached steps
}
```

**Performance Tips**:

- Default `Fn_compute_blocks=1` works well for most cases. Increase to 8-12 for larger models or when more accuracy is needed
- Increase `residual_diff_threshold` (e.g., 0.12-0.15) for faster inference with slight quality trade-off, or decrease from default 0.24 for higher quality
- Default `max_warmup_steps=4` is optimized for few-step models. Increase to 6-8 for more steps if needed

### 2. TaylorSeer

TaylorSeer uses Taylor expansion to forecast future hidden states, allowing the model to skip some computation steps while maintaining quality.

**Key Parameters**:

- `enable_taylorseer` (bool, default: False): Enable TaylorSeer acceleration
- `taylorseer_order` (int, default: 1): Order of Taylor expansion. Higher orders provide better accuracy but require more computation.

**Example Configuration**:

```python
cache_config={
    "enable_taylorseer": True,
    "taylorseer_order": 1,  # First-order Taylor expansion
}
```

**Performance Tips**:

- Use `taylorseer_order=1` for most cases (good balance of speed and quality)
- Combine with DBCache for maximum acceleration
- Higher orders (2-3) may improve quality but reduce speed gains

### 3. SCM (Step Computation Masking)

SCM allows you to specify which steps must be computed and which can use cached results, similar to LeMiCa/EasyCache style acceleration.

**Key Parameters**:

- `scm_steps_mask_policy` (str | None, default: None): Predefined mask policy. Options:
  - `None`: SCM disabled (default)
  - `"slow"`: More compute steps, higher quality (18 compute steps out of 28)
  - `"medium"`: Balanced (15 compute steps out of 28)
  - `"fast"`: More cache steps, faster inference (11 compute steps out of 28)
  - `"ultra"`: Maximum speed (8 compute steps out of 28)
- `scm_steps_policy` (str, default: "dynamic"): Policy for cached steps:
  - `"dynamic"`: Use dynamic cache for masked steps (recommended)
  - `"static"`: Use static cache for masked steps

**Example Configuration**:

```python
cache_config={
    "scm_steps_mask_policy": "medium",  # Balanced speed/quality
    "scm_steps_policy": "dynamic",      # Use dynamic cache
}
```

**Performance Tips**:

- SCM is disabled by default (`scm_steps_mask_policy=None`). Enable it by setting a policy value if you need additional acceleration
- Start with `"medium"` policy and adjust based on quality requirements
- Use `"fast"` or `"ultra"` for maximum speed when quality can be slightly compromised
- `"dynamic"` policy generally provides better quality than `"static"`
- SCM mask is automatically regenerated when `num_inference_steps` changes during inference

## Configuration Reference

### DiffusionCacheConfig Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `Fn_compute_blocks` | int | 1 | First n blocks for difference computation (optimized for single-transformer models) |
| `Bn_compute_blocks` | int | 0 | Last n blocks for fusion |
| `max_warmup_steps` | int | 4 | Steps before caching starts (optimized for few-step distilled models) |
| `max_cached_steps` | int | -1 | Max cached steps (-1 = unlimited) |
| `max_continuous_cached_steps` | int | 3 | Max consecutive cached steps (prevents precision degradation) |
| `residual_diff_threshold` | float | 0.24 | Residual difference threshold (higher for more aggressive caching) |
| `num_inference_steps` | int \| None | None | Initial inference steps for SCM mask generation (optional, auto-refreshed during inference) |
| `enable_taylorseer` | bool | False | Enable TaylorSeer acceleration (not suitable for few-step distilled models) |
| `taylorseer_order` | int | 1 | Taylor expansion order |
| `scm_steps_mask_policy` | str \| None | None | SCM mask policy (None, "slow", "medium", "fast", "ultra") |
| `scm_steps_policy` | str | "dynamic" | SCM computation policy ("dynamic" or "static") |

## Example: Accelerate Text-to-Image Generation with CacheDiT

See `examples/offline_inference/text_to_image/text_to_image.py` for a complete working example with cache-dit acceleration.

```bash
# Enable cache-dit with hybrid acceleration
cd examples/offline_inference/text_to_image
python text_to_image.py \
    --model Qwen/Qwen-Image \
    --prompt "a cup of coffee on the table" \
    --cache_backend cache_dit \
    --num_inference_steps 50
```


The script uses cache-dit acceleration with a hybrid configuration combining DBCache, SCM, and TaylorSeer:

```python
omni = Omni(
    model="Qwen/Qwen-Image",
    cache_backend="cache_dit",
    cache_config={
        # Scheme: Hybrid DBCache + SCM + TaylorSeer
        # DBCache
        "Fn_compute_blocks": 8,
        "Bn_compute_blocks": 0,
        "max_warmup_steps": 4,
        "residual_diff_threshold": 0.12,
        # TaylorSeer
        "enable_taylorseer": True,
        "taylorseer_order": 1,
        # SCM
        "scm_steps_mask_policy": "fast",  # Set to None to disable SCM
        "scm_steps_policy": "dynamic",
    },
)
```

You can customize the configuration by modifying the `cache_config` dictionary to use only specific methods (e.g., DBCache only, DBCache + SCM, etc.) based on your quality and speed requirements.

To test another model, you can modify `--model` with the target model identifier like `Tongyi-MAI/Z-Image-Turbo` and update `cache_config` according the model architecture (e.g., number of transformer blocks).


## Additional Resources

- [Cache-DiT User Guide](https://cache-dit.readthedocs.io/en/latest/user_guide/OVERVIEWS/)
- [Cache-DiT Benchmark](https://cache-dit.readthedocs.io/en/latest/benchmark/HYBRID_CACHE/)
- [DBCache Technical Details](https://cache-dit.readthedocs.io/en/latest/user_guide/CACHE_API/)