teacache.md 3.48 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
# TeaCache Configuration Guide

TeaCache speeds up diffusion model inference by caching transformer computations when consecutive timesteps are similar. This typically provides **1.5x-2.0x speedup** with minimal quality loss.

## Quick Start

Enable TeaCache by setting `cache_backend` to `"tea_cache"`:

```python
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams

# Simple configuration - model_type is automatically extracted from pipeline.__class__.__name__
omni = Omni(
    model="Qwen/Qwen-Image",
    cache_backend="tea_cache",
    cache_config={
        "rel_l1_thresh": 0.2  # Optional, defaults to 0.2
    }
)
outputs = omni.generate(
    "A cat sitting on a windowsill",
    OmniDiffusionSamplingParams(
        num_inference_steps=50,
    ),
)
```

### Using Environment Variable

You can also enable TeaCache via environment variable:

```bash
export DIFFUSION_CACHE_BACKEND=tea_cache
```

Then initialize without explicitly setting `cache_backend`:

```python
from vllm_omni import Omni

omni = Omni(
    model="Qwen/Qwen-Image",
    cache_config={"rel_l1_thresh": 0.2}  # Optional
)
```

## Online Serving (OpenAI-Compatible)

Enable TeaCache for online serving by passing `--cache-backend tea_cache` when starting the server:

```bash
vllm serve Qwen/Qwen-Image --omni --port 8091 \
  --cache-backend tea_cache \
  --cache-config '{"rel_l1_thresh": 0.2}'
```

## Configuration Parameters

### `rel_l1_thresh` (float, default: `0.2`)

Controls the balance between speed and quality. Lower values prioritize quality, higher values prioritize speed.

**Recommended values:**

- `0.2` - **~1.5x speedup** with minimal quality loss (recommended)
- `0.4` - **~1.8x speedup** with slight quality loss
- `0.6` - **~2.0x speedup** with noticeable quality loss
- `0.8` - **~2.25x speedup** with significant quality loss

## Examples

### Python API

```python
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams

omni = Omni(
    model="Qwen/Qwen-Image",
    cache_backend="tea_cache",
    cache_config={"rel_l1_thresh": 0.2}
)
outputs = omni.generate(
    "A cat sitting on a windowsill",
    OmniDiffusionSamplingParams(
        num_inference_steps=50,
    ),
)
```

## Performance Tuning

Start with the default `rel_l1_thresh=0.2` and adjust based on your needs:

- **Maximum quality**: Use `0.1-0.2`
- **Balanced**: Use `0.2-0.4` (recommended)
- **Maximum speed**: Use `0.6-0.8` (may reduce quality)

## Troubleshooting

### Quality Degradation

If you notice quality issues, lower the threshold:

```python
cache_config={"rel_l1_thresh": 0.1}  # More conservative caching
```

## Supported Models

### ImageGen

<style>
th {
  white-space: nowrap;
  min-width: 0 !important;
}
</style>

| Architecture | Models | Example HF Models |
|--------------|--------|-------------------|
| `QwenImagePipeline` | Qwen-Image | `Qwen/Qwen-Image` |
| `QwenImageEditPipeline` | Qwen-Image-Edit | `Qwen/Qwen-Image-Edit` |
| `QwenImageEditPlusPipeline` | Qwen-Image-Edit-2509 | `Qwen/Qwen-Image-Edit-2509` |
| `QwenImageLayeredPipeline` | Qwen-Image-Layered | `Qwen/Qwen-Image-Layered` |
| `BagelForConditionalGeneration` | BAGEL (DiT-only) | `ByteDance-Seed/BAGEL-7B-MoT` |

### VideoGen

No VideoGen models are supported by TeaCache yet.

### Coming Soon

<style>
th {
  white-space: nowrap;
  min-width: 0 !important;
}
</style>

| Architecture | Models | Example HF Models |
|--------------|--------|-------------------|
| `FluxPipeline` | Flux | - |
| `CogVideoXPipeline` | CogVideoX | - |