controlnet.md 14.2 KB
Newer Older
1
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
M. Tolga Cangöz's avatar
M. Tolga Cangöz committed
2
3
4
5
6
7
8
9
10
11
12

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

Steven Liu's avatar
Steven Liu committed
13
14
# ControlNet

Steven Liu's avatar
Steven Liu committed
15
[ControlNet](https://huggingface.co/papers/2302.05543) is an adapter that enables controllable generation such as generating an image of a cat in a *specific pose* or following the lines in a sketch of a *specific* cat. It works by adding a smaller network of "zero convolution" layers and progressively training these to avoid disrupting with the original model. The original model parameters are frozen to avoid retraining it.
Steven Liu's avatar
Steven Liu committed
16

Steven Liu's avatar
Steven Liu committed
17
A ControlNet is conditioned on extra visual information or "structural controls" (canny edge, depth maps, human pose, etc.) that can be combined with text prompts to generate images that are guided by the visual input.
Steven Liu's avatar
Steven Liu committed
18

Steven Liu's avatar
Steven Liu committed
19
20
> [!TIP]
> ControlNets are available to many models such as [Flux](../api/pipelines/controlnet_flux), [Hunyuan-DiT](../api/pipelines/controlnet_hunyuandit), [Stable Diffusion 3](../api/pipelines/controlnet_sd3), and more. The examples in this guide use Flux and Stable Diffusion XL.
Steven Liu's avatar
Steven Liu committed
21

Steven Liu's avatar
Steven Liu committed
22
Load a ControlNet conditioned on a specific control, such as canny edge, and pass it to the pipeline in [`~DiffusionPipeline.from_pretrained`].
Steven Liu's avatar
Steven Liu committed
23

Steven Liu's avatar
Steven Liu committed
24
25
<hfoptions id="usage">
<hfoption id="text-to-image">
Steven Liu's avatar
Steven Liu committed
26

Steven Liu's avatar
Steven Liu committed
27
Generate a canny image with [opencv-python](https://github.com/opencv/opencv-python).
Steven Liu's avatar
Steven Liu committed
28
29
30
31

```py
import cv2
import numpy as np
Steven Liu's avatar
Steven Liu committed
32
33
from PIL import Image
from diffusers.utils import load_image
Steven Liu's avatar
Steven Liu committed
34

35
original_image = load_image(
Steven Liu's avatar
Steven Liu committed
36
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/non-enhanced-prompt.png"
Steven Liu's avatar
Steven Liu committed
37
38
)

39
image = np.array(original_image)
Steven Liu's avatar
Steven Liu committed
40
41
42
43
44
45
46
47
48
49

low_threshold = 100
high_threshold = 200

image = cv2.Canny(image, low_threshold, high_threshold)
image = image[:, :, None]
image = np.concatenate([image, image, image], axis=2)
canny_image = Image.fromarray(image)
```

Steven Liu's avatar
Steven Liu committed
50
Pass the canny image to the pipeline. Use the `controlnet_conditioning_scale` parameter to determine how much weight to assign to the control.
Steven Liu's avatar
Steven Liu committed
51
52
53

```py
import torch
Steven Liu's avatar
Steven Liu committed
54
55
from diffusers.utils import load_image
from diffusers import FluxControlNetPipeline, FluxControlNetModel
Steven Liu's avatar
Steven Liu committed
56

Steven Liu's avatar
Steven Liu committed
57
58
controlnet = FluxControlNetModel.from_pretrained(
    "InstantX/FLUX.1-dev-Controlnet-Canny", torch_dtype=torch.bfloat16
59
)
Steven Liu's avatar
Steven Liu committed
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
pipeline = FluxControlNetPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev", controlnet=controlnet, torch_dtype=torch.bfloat16
).to("cuda")

prompt = """
A photorealistic overhead image of a cat reclining sideways in a flamingo pool floatie holding a margarita. 
The cat is floating leisurely in the pool and completely relaxed and happy.
"""

pipeline(
    prompt, 
    control_image=canny_image,
    controlnet_conditioning_scale=0.5,
    num_inference_steps=50, 
    guidance_scale=3.5,
Steven Liu's avatar
Steven Liu committed
75
76
77
).images[0]
```

Steven Liu's avatar
Steven Liu committed
78
79
80
81
82
83
84
85
86
87
88
89
90
<div style="display: flex; gap: 10px; justify-content: space-around; align-items: flex-end;">
  <figure>
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/non-enhanced-prompt.png" width="300" alt="Generated image (prompt only)"/>
    <figcaption style="text-align: center;">original image</figcaption>
  </figure>
  <figure>
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/canny-cat.png" width="300" alt="Control image (Canny edges)"/>
    <figcaption style="text-align: center;">canny image</figcaption>
  </figure>
  <figure>
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/canny-cat-generated.png" width="300" alt="Generated image (ControlNet + prompt)"/>
    <figcaption style="text-align: center;">generated image</figcaption>
  </figure>
Steven Liu's avatar
Steven Liu committed
91
92
93
</div>


Steven Liu's avatar
Steven Liu committed
94
95
</hfoption>
<hfoption id="image-to-image">
Steven Liu's avatar
Steven Liu committed
96

Steven Liu's avatar
Steven Liu committed
97
Generate a depth map with a depth estimation pipeline from Transformers.
Steven Liu's avatar
Steven Liu committed
98
99
100
101

```py
import torch
import numpy as np
Steven Liu's avatar
Steven Liu committed
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
from PIL import Image
from transformers import DPTImageProcessor, DPTForDepthEstimation
from diffusers import ControlNetModel, StableDiffusionXLControlNetImg2ImgPipeline, AutoencoderKL
from diffusers.utils import load_image


depth_estimator = DPTForDepthEstimation.from_pretrained("Intel/dpt-hybrid-midas").to("cuda")
feature_extractor = DPTImageProcessor.from_pretrained("Intel/dpt-hybrid-midas")

def get_depth_map(image):
    image = feature_extractor(images=image, return_tensors="pt").pixel_values.to("cuda")
    with torch.no_grad(), torch.autocast("cuda"):
        depth_map = depth_estimator(image).predicted_depth

    depth_map = torch.nn.functional.interpolate(
        depth_map.unsqueeze(1),
        size=(1024, 1024),
        mode="bicubic",
        align_corners=False,
    )
    depth_min = torch.amin(depth_map, dim=[1, 2, 3], keepdim=True)
    depth_max = torch.amax(depth_map, dim=[1, 2, 3], keepdim=True)
    depth_map = (depth_map - depth_min) / (depth_max - depth_min)
    image = torch.cat([depth_map] * 3, dim=1)
    image = image.permute(0, 2, 3, 1).cpu().numpy()[0]
    image = Image.fromarray((image * 255.0).clip(0, 255).astype(np.uint8))
    return image
Steven Liu's avatar
Steven Liu committed
129

Steven Liu's avatar
Steven Liu committed
130
depth_image = get_depth_map(image)
Steven Liu's avatar
Steven Liu committed
131
132
```

Steven Liu's avatar
Steven Liu committed
133
Pass the depth map to the pipeline. Use the `controlnet_conditioning_scale` parameter to determine how much weight to assign to the control.
Steven Liu's avatar
Steven Liu committed
134
135

```py
Steven Liu's avatar
Steven Liu committed
136
137
138
controlnet = ControlNetModel.from_pretrained(
    "diffusers/controlnet-depth-sdxl-1.0-small",
    torch_dtype=torch.float16,
139
)
Steven Liu's avatar
Steven Liu committed
140
141
142
143
144
145
146
vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16)
pipeline = StableDiffusionXLControlNetImg2ImgPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    controlnet=controlnet,
    vae=vae,
    torch_dtype=torch.float16,
).to("cuda")
Steven Liu's avatar
Steven Liu committed
147

Steven Liu's avatar
Steven Liu committed
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
prompt = """
A photorealistic overhead image of a cat reclining sideways in a flamingo pool floatie holding a margarita. 
The cat is floating leisurely in the pool and completely relaxed and happy.
"""
image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/non-enhanced-prompt.png"
).resize((1024, 1024))
controlnet_conditioning_scale = 0.5 
pipeline(
    prompt,
    image=image,
    control_image=depth_image,
    controlnet_conditioning_scale=controlnet_conditioning_scale,
    strength=0.99,
    num_inference_steps=100,
Steven Liu's avatar
Steven Liu committed
163
164
165
).images[0]
```

Steven Liu's avatar
Steven Liu committed
166
167
168
169
170
171
172
173
174
175
176
177
178
<div style="display: flex; gap: 10px; justify-content: space-around; align-items: flex-end;">
  <figure>
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/non-enhanced-prompt.png" width="300" alt="Generated image (prompt only)"/>
    <figcaption style="text-align: center;">original image</figcaption>
  </figure>
  <figure>
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl_depth_image.png" width="300" alt="Control image (Canny edges)"/>
    <figcaption style="text-align: center;">depth map</figcaption>
  </figure>
  <figure> 
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl_depth_cat.png" width="300" alt="Generated image (ControlNet + prompt)"/>
    <figcaption style="text-align: center;">generated image</figcaption>
  </figure>
Steven Liu's avatar
Steven Liu committed
179
180
</div>

Steven Liu's avatar
Steven Liu committed
181
182
</hfoption>
<hfoption id="inpainting">
Steven Liu's avatar
Steven Liu committed
183

Steven Liu's avatar
Steven Liu committed
184
Generate a mask image and convert it to a tensor to mark the pixels in the original image as masked if the corresponding pixel in the mask image is over a certain threshold.
Steven Liu's avatar
Steven Liu committed
185
186

```py
Steven Liu's avatar
Steven Liu committed
187
188
189
190
191
192
import cv2
import torch
import numpy as np
from PIL import Image
from diffusers.utils import load_image
from diffusers import StableDiffusionXLControlNetInpaintPipeline, ControlNetModel
Steven Liu's avatar
Steven Liu committed
193
194

init_image = load_image(
Steven Liu's avatar
Steven Liu committed
195
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/non-enhanced-prompt.png"
Steven Liu's avatar
Steven Liu committed
196
)
Steven Liu's avatar
Steven Liu committed
197
init_image = init_image.resize((1024, 1024))
Steven Liu's avatar
Steven Liu committed
198
mask_image = load_image(
Steven Liu's avatar
Steven Liu committed
199
    "/content/cat_mask.png"
Steven Liu's avatar
Steven Liu committed
200
)
Steven Liu's avatar
Steven Liu committed
201
mask_image = mask_image.resize((1024, 1024))
Steven Liu's avatar
Steven Liu committed
202

Steven Liu's avatar
Steven Liu committed
203
204
205
206
207
208
def make_canny_condition(image):
    image = np.array(image)
    image = cv2.Canny(image, 100, 200)
    image = image[:, :, None]
    image = np.concatenate([image, image, image], axis=2)
    image = Image.fromarray(image)
Steven Liu's avatar
Steven Liu committed
209
210
    return image

Steven Liu's avatar
Steven Liu committed
211
control_image = make_canny_condition(init_image)
Steven Liu's avatar
Steven Liu committed
212
213
```

Steven Liu's avatar
Steven Liu committed
214
Pass the mask and control image to the pipeline. Use the `controlnet_conditioning_scale` parameter to determine how much weight to assign to the control.
Steven Liu's avatar
Steven Liu committed
215
216

```py
Steven Liu's avatar
Steven Liu committed
217
218
controlnet = ControlNetModel.from_pretrained(
    "diffusers/controlnet-canny-sdxl-1.0", torch_dtype=torch.float16
219
)
Steven Liu's avatar
Steven Liu committed
220
221
222
223
224
225
226
227
pipeline = StableDiffusionXLControlNetInpaintPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", controlnet=controlnet, torch_dtype=torch.float16
)
pipeline(
    "a cute and fluffy bunny rabbit",
    num_inference_steps=100,
    strength=0.99,
    controlnet_conditioning_scale=0.5,
Steven Liu's avatar
Steven Liu committed
228
229
230
231
232
233
    image=init_image,
    mask_image=mask_image,
    control_image=control_image,
).images[0]
```

Steven Liu's avatar
Steven Liu committed
234
235
236
237
238
239
240
241
242
243
244
245
246
<div style="display: flex; gap: 10px; justify-content: space-around; align-items: flex-end;">
  <figure>
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/non-enhanced-prompt.png" width="300" alt="Generated image (prompt only)"/>
    <figcaption style="text-align: center;">original image</figcaption>
  </figure>
  <figure>
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat_mask.png" width="300" alt="Control image (Canny edges)"/>
    <figcaption style="text-align: center;">mask image</figcaption>
  </figure>
  <figure> 
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl_rabbit_inpaint.png" width="300" alt="Generated image (ControlNet + prompt)"/>
    <figcaption style="text-align: center;">generated image</figcaption>
  </figure>
Steven Liu's avatar
Steven Liu committed
247
248
</div>

Steven Liu's avatar
Steven Liu committed
249
250
</hfoption>
</hfoptions>
Steven Liu's avatar
Steven Liu committed
251

Steven Liu's avatar
Steven Liu committed
252
## Multi-ControlNet
Steven Liu's avatar
Steven Liu committed
253

Steven Liu's avatar
Steven Liu committed
254
You can compose multiple ControlNet conditionings, such as canny image and a depth map, to create a *MultiControlNet*. For the best rersults, you should mask conditionings so they don't overlap and experiment with different `controlnet_conditioning_scale` parameters to adjust how much weight is assigned to each control input.
Steven Liu's avatar
Steven Liu committed
255

Steven Liu's avatar
Steven Liu committed
256
The example below composes a canny image and depth map.
Steven Liu's avatar
Steven Liu committed
257

Steven Liu's avatar
Steven Liu committed
258
Pass the ControlNets as a list to the pipeline and resize the images to the expected input size.
Steven Liu's avatar
Steven Liu committed
259
260
261
262
263

```py
import torch
from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel, AutoencoderKL

Steven Liu's avatar
Steven Liu committed
264
265
266
267
268
269
270
271
controlnets = [
    ControlNetModel.from_pretrained(
        "diffusers/controlnet-depth-sdxl-1.0-small", torch_dtype=torch.float16
    ),
    ControlNetModel.from_pretrained(
        "diffusers/controlnet-canny-sdxl-1.0", torch_dtype=torch.float16,
    ),
]
Steven Liu's avatar
Steven Liu committed
272

Steven Liu's avatar
Steven Liu committed
273
274
275
276
vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16)
pipeline = StableDiffusionXLControlNetPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", controlnet=controlnets, vae=vae, torch_dtype=torch.float16
).to("cuda")
Steven Liu's avatar
Steven Liu committed
277

Steven Liu's avatar
Steven Liu committed
278
279
280
281
282
prompt = """
a relaxed rabbit sitting on a striped towel next to a pool with a tropical drink nearby, 
bright sunny day, vacation scene, 35mm photograph, film, professional, 4k, highly detailed
"""
negative_prompt = "lowres, bad anatomy, worst quality, low quality, deformed, ugly"
Steven Liu's avatar
Steven Liu committed
283

Steven Liu's avatar
Steven Liu committed
284
images = [canny_image.resize((1024, 1024)), depth_image.resize((1024, 1024))]
Steven Liu's avatar
Steven Liu committed
285

Steven Liu's avatar
Steven Liu committed
286
pipeline(
287
288
    prompt,
    negative_prompt=negative_prompt,
Steven Liu's avatar
Steven Liu committed
289
290
291
292
    image=images,
    num_inference_steps=100,
    controlnet_conditioning_scale=[0.5, 0.5],
    strength=0.7,
Steven Liu's avatar
Steven Liu committed
293
294
295
).images[0]
```

Steven Liu's avatar
Steven Liu committed
296
297
298
299
300
301
302
303
304
305
306
307
308
<div style="display: flex; gap: 10px; justify-content: space-around; align-items: flex-end;">
  <figure>
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/canny-cat.png" width="300" alt="Generated image (prompt only)"/>
    <figcaption style="text-align: center;">canny image</figcaption>
  </figure>
  <figure>
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/multicontrolnet_depth.png" width="300" alt="Control image (Canny edges)"/>
    <figcaption style="text-align: center;">depth map</figcaption>
  </figure>
  <figure> 
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl_multi_controlnet.png" width="300" alt="Generated image (ControlNet + prompt)"/>
    <figcaption style="text-align: center;">generated image</figcaption>
  </figure>
Steven Liu's avatar
Steven Liu committed
309
310
</div>

Steven Liu's avatar
Steven Liu committed
311
312
313
## guess_mode

[Guess mode](https://github.com/lllyasviel/ControlNet/discussions/188) generates an image from **only** the control input (canny edge, depth map, pose, etc.) and without guidance from a prompt. It adjusts the scale of the ControlNet's output residuals by a fixed ratio depending on block depth. The earlier `DownBlock` is only scaled by `0.1` and the `MidBlock` is fully scaled by `1.0`.
Steven Liu's avatar
Steven Liu committed
314
315
316

```py
import torch
Steven Liu's avatar
Steven Liu committed
317
318
from diffusers.utils import load_iamge
from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel
Steven Liu's avatar
Steven Liu committed
319
320

controlnet = ControlNetModel.from_pretrained(
Steven Liu's avatar
Steven Liu committed
321
  "diffusers/controlnet-canny-sdxl-1.0", torch_dtype=torch.float16
Steven Liu's avatar
Steven Liu committed
322
)
Steven Liu's avatar
Steven Liu committed
323
324
325
326
327
328
329
330
331
332
333
pipeline = StableDiffusionXLControlNetPipeline.from_pretrained(
  "stabilityai/stable-diffusion-xl-base-1.0",
  controlnet=controlnet,
  torch_dtype=torch.float16
).to("cuda")

canny_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/canny-cat.png")
pipeline(
  "",
  image=canny_image,
  guess_mode=True
Steven Liu's avatar
Steven Liu committed
334
335
336
).images[0]
```

Steven Liu's avatar
Steven Liu committed
337
338
339
340
341
342
343
344
345
346
<div style="display: flex; gap: 10px; justify-content: space-around; align-items: flex-end;">
  <figure>
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/canny-cat.png" width="300" alt="Control image (Canny edges)"/>
    <figcaption style="text-align: center;">canny image</figcaption>
  </figure>
  <figure>
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/guess_mode.png" width="300" alt="Generated image (Guess mode)"/>
    <figcaption style="text-align: center;">generated image</figcaption>
  </figure>
</div>