quicktour.md 10.7 KB
Newer Older
Aryan's avatar
Aryan committed
1
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Nathan Lambert's avatar
Nathan Lambert committed
2
3
4
5
6
7
8
9
10
11
12

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

Steven Liu's avatar
Steven Liu committed
13
# Quickstart
Steven Liu's avatar
Steven Liu committed
14

Steven Liu's avatar
Steven Liu committed
15
Diffusers is a library for developers and researchers that provides an easy inference API for generating images, videos and audio, as well as the building blocks for implementing new workflows.
Nathan Lambert's avatar
Nathan Lambert committed
16

Steven Liu's avatar
Steven Liu committed
17
Diffusers provides many optimizations out-of-the-box that makes it possible to load and run large models on setups with limited memory or to accelerate inference.
Steven Liu's avatar
Steven Liu committed
18

Steven Liu's avatar
Steven Liu committed
19
This Quickstart will give you an overview of Diffusers and get you up and generating quickly.
Steven Liu's avatar
Steven Liu committed
20

Steven Liu's avatar
Steven Liu committed
21
22
> [!TIP]
> Before you begin, make sure you have a Hugging Face [account](https://huggingface.co/join) in order to use gated models like [Flux](https://huggingface.co/black-forest-labs/FLUX.1-dev).
Patrick von Platen's avatar
Patrick von Platen committed
23

Steven Liu's avatar
Steven Liu committed
24
Follow the [Installation](./installation) guide to install Diffusers if it's not already installed.
Patrick von Platen's avatar
Patrick von Platen committed
25

Patrick von Platen's avatar
Patrick von Platen committed
26
27
## DiffusionPipeline

Steven Liu's avatar
Steven Liu committed
28
A diffusion model combines multiple components to generate outputs in any modality based on an input, such as a text description, image or both.
Nathan Lambert's avatar
Nathan Lambert committed
29

Steven Liu's avatar
Steven Liu committed
30
For a standard text-to-image model:
Nathan Lambert's avatar
Nathan Lambert committed
31

Steven Liu's avatar
Steven Liu committed
32
33
34
1. A text encoder turns a prompt into embeddings that guide the denoising process. Some models have more than one text encoder.
2. A scheduler contains the algorithmic specifics for gradually denoising initial random noise into clean outputs. Different schedulers affect generation speed and quality.
3. A UNet or diffusion transformer (DiT) is the workhorse of a diffusion model.
Patrick von Platen's avatar
Patrick von Platen committed
35

Steven Liu's avatar
Steven Liu committed
36
  At each step, it performs the denoising predictions, such as how much noise to remove or the general direction in which to steer the noise to generate better quality outputs.
Patrick von Platen's avatar
Patrick von Platen committed
37

Steven Liu's avatar
Steven Liu committed
38
39
40
  The UNet or DiT repeats this loop for a set amount of steps to generate the final output.
  
4. A variational autoencoder (VAE) encodes and decodes pixels to a spatially compressed latent-space. *Latents* are compressed representations of an image and are more efficient to work with. The UNet or DiT operates on latents, and the clean latents at the end are decoded back into images.
Patrick von Platen's avatar
Patrick von Platen committed
41

Steven Liu's avatar
Steven Liu committed
42
The [`DiffusionPipeline`] packages all these components into a single class for inference. There are several arguments in [`~DiffusionPipeline.__call__`] you can change, such as `num_inference_steps`, that affect the diffusion process. Try different values and arguments to see how they change generation quality or speed.
Steven Liu's avatar
Steven Liu committed
43

Steven Liu's avatar
Steven Liu committed
44
Load a model with [`~DiffusionPipeline.from_pretrained`] and describe what you'd like to generate. The example below uses the default argument values.
Patrick von Platen's avatar
Patrick von Platen committed
45

Steven Liu's avatar
Steven Liu committed
46
47
<hfoptions id="diffusionpipeline">
<hfoption id="text-to-image">
Patrick von Platen's avatar
Patrick von Platen committed
48

Steven Liu's avatar
Steven Liu committed
49
Use `.images[0]` to access the generated image output.
Steven Liu's avatar
Steven Liu committed
50
51

```py
Steven Liu's avatar
Steven Liu committed
52
53
import torch
from diffusers import DiffusionPipeline
Patrick von Platen's avatar
Patrick von Platen committed
54

Steven Liu's avatar
Steven Liu committed
55
56
57
pipeline = DiffusionPipeline.from_pretrained(
  "Qwen/Qwen-Image", torch_dtype=torch.bfloat16, device_map="cuda"
)
Patrick von Platen's avatar
Patrick von Platen committed
58

Steven Liu's avatar
Steven Liu committed
59
60
61
62
63
prompt = """
cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California
highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain
"""
pipeline(prompt).images[0]
Patrick von Platen's avatar
Patrick von Platen committed
64
65
```

Steven Liu's avatar
Steven Liu committed
66
67
</hfoption>
<hfoption id="text-to-video">
Patrick von Platen's avatar
Patrick von Platen committed
68

Steven Liu's avatar
Steven Liu committed
69
Use `.frames[0]` to access the generated video output and [`~utils.export_to_video`] to save the video.
Patrick von Platen's avatar
Patrick von Platen committed
70

Steven Liu's avatar
Steven Liu committed
71
```py
Steven Liu's avatar
Steven Liu committed
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
import torch
from diffusers import AutoencoderKLWan, DiffusionPipeline
from diffusers.quantizers import PipelineQuantizationConfig
from diffusers.utils import export_to_video

vae = AutoencoderKLWan.from_pretrained(
  "Wan-AI/Wan2.2-T2V-A14B-Diffusers",
  subfolder="vae",
  torch_dtype=torch.float32
)
pipeline = DiffusionPipeline.from_pretrained(
  "Wan-AI/Wan2.2-T2V-A14B-Diffusers",
  vae=vae
  torch_dtype=torch.bfloat16,
  device_map="cuda"
)

prompt = """
Cinematic video of a sleek cat lounging on a colorful inflatable in a crystal-clear turquoise pool in Palm Springs, 
sipping a salt-rimmed margarita through a straw. Golden-hour sunlight glows over mid-century modern homes and swaying palms. 
Shot in rich Sony a7S III: with moody, glamorous color grading, subtle lens flares, and soft vintage film grain. 
Ripples shimmer as a warm desert breeze stirs the water, blending luxury and playful charm in an epic, gorgeously composed frame.
"""
video = pipeline(prompt=prompt, num_frames=81, num_inference_steps=40).frames[0]
export_to_video(video, "output.mp4", fps=16)
```

</hfoption>
</hfoptions>

## LoRA

Adapters insert a small number of trainable parameters to the original base model. Only the inserted parameters are fine-tuned while the rest of the model weights remain frozen. This makes it fast and cheap to fine-tune a model on a new style. Among adapters, [LoRA's](./tutorials/using_peft_for_inference) are the most popular.

Add a LoRA to a pipeline with the [`~loaders.QwenImageLoraLoaderMixin.load_lora_weights`] method. Some LoRA's require a special word to trigger it, such as `Realism`, in the example below. Check a LoRA's model card to see if it requires a trigger word.
Steven Liu's avatar
Steven Liu committed
107
108

```py
Steven Liu's avatar
Steven Liu committed
109
110
import torch
from diffusers import DiffusionPipeline
111

Steven Liu's avatar
Steven Liu committed
112
113
114
115
116
117
pipeline = DiffusionPipeline.from_pretrained(
  "Qwen/Qwen-Image", torch_dtype=torch.bfloat16, device_map="cuda"
)
pipeline.load_lora_weights(
  "flymy-ai/qwen-image-realism-lora",
)
Steven Liu's avatar
Steven Liu committed
118

Steven Liu's avatar
Steven Liu committed
119
120
121
122
123
prompt = """
super Realism cinematic film still of a cat sipping a margarita in a pool in Palm Springs in the style of umempart, California
highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain
"""
pipeline(prompt).images[0]
Steven Liu's avatar
Steven Liu committed
124
125
```

Steven Liu's avatar
Steven Liu committed
126
Check out the [LoRA](./tutorials/using_peft_for_inference) docs or Adapters section to learn more.
Steven Liu's avatar
Steven Liu committed
127

Steven Liu's avatar
Steven Liu committed
128
## Quantization
Steven Liu's avatar
Steven Liu committed
129

Steven Liu's avatar
Steven Liu committed
130
[Quantization](./quantization/overview) stores data in fewer bits to reduce memory usage. It may also speed up inference because it takes less time to perform calculations with fewer bits.
Steven Liu's avatar
Steven Liu committed
131

Steven Liu's avatar
Steven Liu committed
132
Diffusers provides several quantization backends and picking one depends on your use case. For example, [bitsandbytes](./quantization/bitsandbytes) and [torchao](./quantization/torchao) are both simple and easy to use for inference, but torchao supports more [quantization types](./quantization/torchao#supported-quantization-types) like fp8.
Steven Liu's avatar
Steven Liu committed
133

Steven Liu's avatar
Steven Liu committed
134
Configure [`PipelineQuantizationConfig`] with the backend to use, the specific arguments (refer to the [API](./api/quantization) reference for available arguments) for that backend, and which components to quantize. The example below quantizes the model to 4-bits and only uses 14.93GB of memory.
Steven Liu's avatar
Steven Liu committed
135
136

```py
Steven Liu's avatar
Steven Liu committed
137
138
139
import torch
from diffusers import DiffusionPipeline
from diffusers.quantizers import PipelineQuantizationConfig
Steven Liu's avatar
Steven Liu committed
140

Steven Liu's avatar
Steven Liu committed
141
142
143
144
145
146
147
148
149
150
151
quant_config = PipelineQuantizationConfig(
  quant_backend="bitsandbytes_4bit",
  quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
  components_to_quantize=["transformer", "text_encoder"],
)
pipeline = DiffusionPipeline.from_pretrained(
  "Qwen/Qwen-Image",
  torch_dtype=torch.bfloat16,
  quantization_config=quant_config,
  device_map="cuda"
)
Steven Liu's avatar
Steven Liu committed
152

Steven Liu's avatar
Steven Liu committed
153
154
155
156
157
158
prompt = """
cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California
highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain
"""
pipeline(prompt).images[0]
print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")
Steven Liu's avatar
Steven Liu committed
159
160
```

Steven Liu's avatar
Steven Liu committed
161
Take a look at the [Quantization](./quantization/overview) section for more details.
Steven Liu's avatar
Steven Liu committed
162

Steven Liu's avatar
Steven Liu committed
163
## Optimizations
Steven Liu's avatar
Steven Liu committed
164

Steven Liu's avatar
Steven Liu committed
165
Modern diffusion models are very large and have billions of parameters. The iterative denoising process is also computationally intensive and slow. Diffusers provides techniques for reducing memory usage and boosting inference speed. These techniques can be combined with quantization to optimize for both memory usage and inference speed.
Steven Liu's avatar
Steven Liu committed
166

Steven Liu's avatar
Steven Liu committed
167
### Memory usage
Steven Liu's avatar
Steven Liu committed
168

Steven Liu's avatar
Steven Liu committed
169
The text encoders and UNet or DiT can use up as much as ~30GB of memory, exceeding the amount available on many free-tier or consumer GPUs.
Steven Liu's avatar
Steven Liu committed
170

Steven Liu's avatar
Steven Liu committed
171
Offloading stores weights that aren't currently used on the CPU and only moves them to the GPU when they're needed. There are a few offloading types and the example below uses [model offloading](./optimization/memory#model-offloading). This moves an entire model, like a text encoder or transformer, to the CPU when it isn't actively being used.
Steven Liu's avatar
Steven Liu committed
172

Steven Liu's avatar
Steven Liu committed
173
Call [`~DiffusionPipeline.enable_model_cpu_offload`] to activate it. By combining quantization and offloading, the following example only requires ~12.54GB of memory.
Steven Liu's avatar
Steven Liu committed
174
175

```py
Steven Liu's avatar
Steven Liu committed
176
177
178
import torch
from diffusers import DiffusionPipeline
from diffusers.quantizers import PipelineQuantizationConfig
Steven Liu's avatar
Steven Liu committed
179

Steven Liu's avatar
Steven Liu committed
180
181
182
183
184
185
186
187
188
189
190
191
quant_config = PipelineQuantizationConfig(
  quant_backend="bitsandbytes_4bit",
  quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
  components_to_quantize=["transformer", "text_encoder"],
)
pipeline = DiffusionPipeline.from_pretrained(
  "Qwen/Qwen-Image",
  torch_dtype=torch.bfloat16,
  quantization_config=quant_config,
  device_map="cuda"
)
pipeline.enable_model_cpu_offload()
Steven Liu's avatar
Steven Liu committed
192

Steven Liu's avatar
Steven Liu committed
193
194
195
196
197
198
prompt = """
cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California
highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain
"""
pipeline(prompt).images[0]
print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")
Steven Liu's avatar
Steven Liu committed
199
200
```

Steven Liu's avatar
Steven Liu committed
201
Refer to the [Reduce memory usage](./optimization/memory) docs to learn more about other memory reducing techniques.
Steven Liu's avatar
Steven Liu committed
202

Steven Liu's avatar
Steven Liu committed
203
### Inference speed
Steven Liu's avatar
Steven Liu committed
204

Steven Liu's avatar
Steven Liu committed
205
The denoising loop performs a lot of computations and can be slow. Methods like [torch.compile](./optimization/fp16#torchcompile) increases inference speed by compiling the computations into an optimized kernel. Compilation is slow for the first generation but successive generations should be much faster.
Steven Liu's avatar
Steven Liu committed
206

Steven Liu's avatar
Steven Liu committed
207
The example below uses [regional compilation](./optimization/fp16#regional-compilation) to only compile small regions of a model. It reduces cold-start latency while also providing a runtime speed up.
Steven Liu's avatar
Steven Liu committed
208

Steven Liu's avatar
Steven Liu committed
209
Call [`~ModelMixin.compile_repeated_blocks`] on the model to activate it.
Steven Liu's avatar
Steven Liu committed
210

Steven Liu's avatar
Steven Liu committed
211
212
213
```py
import torch
from diffusers import DiffusionPipeline
Steven Liu's avatar
Steven Liu committed
214

Steven Liu's avatar
Steven Liu committed
215
216
217
pipeline = DiffusionPipeline.from_pretrained(
  "Qwen/Qwen-Image", torch_dtype=torch.bfloat16, device_map="cuda"
)
Steven Liu's avatar
Steven Liu committed
218

Steven Liu's avatar
Steven Liu committed
219
220
221
222
223
224
225
226
pipeline.transformer.compile_repeated_blocks(
    fullgraph=True,
)
prompt = """
cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California
highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain
"""
pipeline(prompt).images[0]
Steven Liu's avatar
Steven Liu committed
227
```
228

Steven Liu's avatar
Steven Liu committed
229
Check out the [Accelerate inference](./optimization/fp16) or [Caching](./optimization/cache) docs for more methods that speed up inference.