ip_adapter.md 31.4 KB
Newer Older
Steven Liu's avatar
Steven Liu committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# IP-Adapter

Steven Liu's avatar
Steven Liu committed
15
[IP-Adapter](https://huggingface.co/papers/2308.06721) is a lightweight adapter designed to integrate image-based guidance with text-to-image diffusion models. The adapter uses an image encoder to extract image features that are passed to the newly added cross-attention layers in the UNet and fine-tuned. The original UNet model and the existing cross-attention layers corresponding to text features is frozen. Decoupling the cross-attention for image and text features enables more fine-grained and controllable generation.
Steven Liu's avatar
Steven Liu committed
16

Steven Liu's avatar
Steven Liu committed
17
IP-Adapter files are typically ~100MBs because they only contain the image embeddings. This means you need to load a model first, and then load the IP-Adapter with [`~loaders.IPAdapterMixin.load_ip_adapter`].
Steven Liu's avatar
Steven Liu committed
18

19
> [!TIP]
Steven Liu's avatar
Steven Liu committed
20
> IP-Adapters are available to many models such as [Flux](../api/pipelines/flux#ip-adapter) and [Stable Diffusion 3](../api/pipelines/stable_diffusion/stable_diffusion_3), and more. The examples in this guide use Stable Diffusion and Stable Diffusion XL.
Steven Liu's avatar
Steven Liu committed
21

Steven Liu's avatar
Steven Liu committed
22
Use the [`~loaders.IPAdapterMixin.set_ip_adapter_scale`] parameter to scale the influence of the IP-Adapter during generation. A value of `1.0` means the model is only conditioned on the image prompt, and `0.5` typically produces balanced results between the text and image prompt.
Steven Liu's avatar
Steven Liu committed
23
24

```py
Steven Liu's avatar
Steven Liu committed
25
import torch
Steven Liu's avatar
Steven Liu committed
26
27
28
from diffusers import AutoPipelineForText2Image
from diffusers.utils import load_image

Steven Liu's avatar
Steven Liu committed
29
30
31
32
33
34
35
36
37
38
pipeline = AutoPipelineForText2Image.from_pretrained(
  "stabilityai/stable-diffusion-xl-base-1.0",
  torch_dtype=torch.float16
).to("cuda")
pipeline.load_ip_adapter(
  "h94/IP-Adapter",
  subfolder="sdxl_models",
  weight_name="ip-adapter_sdxl.bin"
)
pipeline.set_ip_adapter_scale(0.8)
Steven Liu's avatar
Steven Liu committed
39
40
```

Steven Liu's avatar
Steven Liu committed
41
Pass an image to `ip_adapter_image` along with a text prompt to generate an image.
Steven Liu's avatar
Steven Liu committed
42
43
44

```py
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_diner.png")
Steven Liu's avatar
Steven Liu committed
45
pipeline(
M. Tolga Cangöz's avatar
M. Tolga Cangöz committed
46
    prompt="a polar bear sitting in a chair drinking a milkshake",
Steven Liu's avatar
Steven Liu committed
47
48
    ip_adapter_image=image,
    negative_prompt="deformed, ugly, wrong proportion, low res, bad anatomy, worst quality, low quality",
Steven Liu's avatar
Steven Liu committed
49
).images[0]
Steven Liu's avatar
Steven Liu committed
50
51
```

Steven Liu's avatar
Steven Liu committed
52
53
54
55
56
57
58
59
60
<div style="display: flex; gap: 10px; justify-content: space-around; align-items: flex-end;">
  <figure>
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_diner.png" width="400" alt="IP-Adapter image"/>
    <figcaption style="text-align: center;">IP-Adapter image</figcaption>
  </figure>
  <figure>
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_diner_2.png" width="400" alt="generated image"/>
    <figcaption style="text-align: center;">generated image</figcaption>
  </figure>
Steven Liu's avatar
Steven Liu committed
61
62
</div>

Steven Liu's avatar
Steven Liu committed
63
Take a look at the examples below to learn how to use IP-Adapter for other tasks.
Steven Liu's avatar
Steven Liu committed
64

Steven Liu's avatar
Steven Liu committed
65
66
<hfoptions id="usage">
<hfoption id="image-to-image">
Steven Liu's avatar
Steven Liu committed
67
68

```py
Steven Liu's avatar
Steven Liu committed
69
import torch
Steven Liu's avatar
Steven Liu committed
70
71
72
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import load_image

Steven Liu's avatar
Steven Liu committed
73
74
75
76
77
78
79
80
81
82
pipeline = AutoPipelineForImage2Image.from_pretrained(
  "stabilityai/stable-diffusion-xl-base-1.0",
  torch_dtype=torch.float16
).to("cuda")
pipeline.load_ip_adapter(
  "h94/IP-Adapter",
  subfolder="sdxl_models",
  weight_name="ip-adapter_sdxl.bin"
)
pipeline.set_ip_adapter_scale(0.8)
Steven Liu's avatar
Steven Liu committed
83
84

image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_bear_1.png")
Steven Liu's avatar
Steven Liu committed
85
86
ip_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_gummy.png")
pipeline(
Steven Liu's avatar
Steven Liu committed
87
88
89
    prompt="best quality, high quality",
    image=image,
    ip_adapter_image=ip_image,
Steven Liu's avatar
Steven Liu committed
90
91
    strength=0.5,
).images[0]
Steven Liu's avatar
Steven Liu committed
92
93
```

Steven Liu's avatar
Steven Liu committed
94
95
96
97
98
99
100
101
102
103
104
105
106
<div style="display: flex; gap: 10px; justify-content: space-around; align-items: flex-end;">
  <figure>
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_bear_1.png" width="300" alt="input image"/>
    <figcaption style="text-align: center;">input image</figcaption>
  </figure>
  <figure>
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_gummy.png" width="300" alt="IP-Adapter image"/>
    <figcaption style="text-align: center;">IP-Adapter image</figcaption>
  </figure>
  <figure>
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_bear_3.png" width="300" alt="generated image"/>
    <figcaption style="text-align: center;">generated image</figcaption>
  </figure>
Steven Liu's avatar
Steven Liu committed
107
108
109
</div>

</hfoption>
Steven Liu's avatar
Steven Liu committed
110
<hfoption id="inpainting">
Steven Liu's avatar
Steven Liu committed
111
112
113

```py
import torch
Steven Liu's avatar
Steven Liu committed
114
115
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import load_image
Steven Liu's avatar
Steven Liu committed
116

Steven Liu's avatar
Steven Liu committed
117
118
119
120
121
122
123
124
125
pipeline = AutoPipelineForImage2Image.from_pretrained(
  "stabilityai/stable-diffusion-xl-base-1.0",
  torch_dtype=torch.float16
).to("cuda")
pipeline.load_ip_adapter(
  "h94/IP-Adapter",
  subfolder="sdxl_models",
  weight_name="ip-adapter_sdxl.bin"
)
Steven Liu's avatar
Steven Liu committed
126
127
128
129
130
pipeline.set_ip_adapter_scale(0.6)

mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_mask.png")
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_bear_1.png")
ip_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_gummy.png")
Steven Liu's avatar
Steven Liu committed
131
pipeline(
Steven Liu's avatar
Steven Liu committed
132
133
134
135
    prompt="a cute gummy bear waving",
    image=image,
    mask_image=mask_image,
    ip_adapter_image=ip_image,
Steven Liu's avatar
Steven Liu committed
136
).images[0]
Steven Liu's avatar
Steven Liu committed
137
138
```

Steven Liu's avatar
Steven Liu committed
139
140
141
142
143
144
145
146
147
148
149
150
151
<div style="display: flex; gap: 10px; justify-content: space-around; align-items: flex-end;">
  <figure>
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_bear_1.png" width="300" alt="input image"/>
    <figcaption style="text-align: center;">input image</figcaption>
  </figure>
  <figure>
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_gummy.png" width="300" alt="IP-Adapter image"/>
    <figcaption style="text-align: center;">IP-Adapter image</figcaption>
  </figure>
  <figure>
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_inpaint.png" width="300" alt="generated image"/>
    <figcaption style="text-align: center;">generated image</figcaption>
  </figure>
Steven Liu's avatar
Steven Liu committed
152
153
154
</div>

</hfoption>
Steven Liu's avatar
Steven Liu committed
155
<hfoption id="video">
Steven Liu's avatar
Steven Liu committed
156

Steven Liu's avatar
Steven Liu committed
157
The [`~DiffusionPipeline.enable_model_cpu_offload`] method is useful for reducing memory and it should be enabled **after** the IP-Adapter is loaded. Otherwise, the IP-Adapter's image encoder is also offloaded to the CPU and returns an error.
Steven Liu's avatar
Steven Liu committed
158
159
160
161
162
163
164

```py
import torch
from diffusers import AnimateDiffPipeline, DDIMScheduler, MotionAdapter
from diffusers.utils import export_to_gif
from diffusers.utils import load_image

Steven Liu's avatar
Steven Liu committed
165
166
167
168
169
170
171
172
173
adapter = MotionAdapter.from_pretrained(
  "guoyww/animatediff-motion-adapter-v1-5-2",
  torch_dtype=torch.float16
)
pipeline = AnimateDiffPipeline.from_pretrained(
  "emilianJR/epiCRealism",
  motion_adapter=adapter,
  torch_dtype=torch.float16
)
Steven Liu's avatar
Steven Liu committed
174
175
176
177
178
179
180
181
182
183
184
185
186
187
scheduler = DDIMScheduler.from_pretrained(
    "emilianJR/epiCRealism",
    subfolder="scheduler",
    clip_sample=False,
    timestep_spacing="linspace",
    beta_schedule="linear",
    steps_offset=1,
)
pipeline.scheduler = scheduler
pipeline.enable_vae_slicing()
pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin")
pipeline.enable_model_cpu_offload()

ip_adapter_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_inpaint.png")
Steven Liu's avatar
Steven Liu committed
188
pipeline(
Steven Liu's avatar
Steven Liu committed
189
190
191
192
193
194
    prompt="A cute gummy bear waving",
    negative_prompt="bad quality, worse quality, low resolution",
    ip_adapter_image=ip_adapter_image,
    num_frames=16,
    guidance_scale=7.5,
    num_inference_steps=50,
Steven Liu's avatar
Steven Liu committed
195
).frames[0]
Steven Liu's avatar
Steven Liu committed
196
197
```

Steven Liu's avatar
Steven Liu committed
198
199
200
201
202
203
204
205
206
<div style="display: flex; gap: 10px; justify-content: space-around; align-items: flex-end;">
  <figure>
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_inpaint.png" width="400" alt="IP-Adapter image"/>
    <figcaption style="text-align: center;">IP-Adapter image</figcaption>
  </figure>
  <figure>
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/gummy_bear.gif" width="400" alt="generated video"/>
    <figcaption style="text-align: center;">generated video</figcaption>
  </figure>
Steven Liu's avatar
Steven Liu committed
207
208
209
210
211
</div>

</hfoption>
</hfoptions>

Steven Liu's avatar
Steven Liu committed
212
## Model variants
213

Steven Liu's avatar
Steven Liu committed
214
There are two variants of IP-Adapter, Plus and FaceID. The Plus variant uses patch embeddings and the ViT-H image encoder. FaceID variant uses face embeddings generated from InsightFace.
215

Steven Liu's avatar
Steven Liu committed
216
217
<hfoptions id="ipadapter-variants">
<hfoption id="IP-Adapter Plus">
218

Steven Liu's avatar
Steven Liu committed
219
220
221
```py
import torch
from transformers import CLIPVisionModelWithProjection, AutoPipelineForText2Image
222

Steven Liu's avatar
Steven Liu committed
223
224
225
226
227
image_encoder = CLIPVisionModelWithProjection.from_pretrained(
    "h94/IP-Adapter",
    subfolder="models/image_encoder",
    torch_dtype=torch.float16
)
228

Steven Liu's avatar
Steven Liu committed
229
230
231
232
233
pipeline = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    image_encoder=image_encoder,
    torch_dtype=torch.float16
).to("cuda")
234

Steven Liu's avatar
Steven Liu committed
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
pipeline.load_ip_adapter(
  "h94/IP-Adapter",
  subfolder="sdxl_models",
  weight_name="ip-adapter-plus_sdxl_vit-h.safetensors"
)
```

</hfoption>
<hfoption id="IP-Adapter FaceID">

```py
import torch
from transformers import AutoPipelineForText2Image

pipeline = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16
).to("cuda")

pipeline.load_ip_adapter(
  "h94/IP-Adapter-FaceID",
  subfolder=None,
  weight_name="ip-adapter-faceid_sdxl.bin",
  image_encoder_folder=None
)
```

To use a IP-Adapter FaceID Plus model, load the CLIP image encoder as well as [`~transformers.CLIPVisionModelWithProjection`].

```py
from transformers import AutoPipelineForText2Image, CLIPVisionModelWithProjection

image_encoder = CLIPVisionModelWithProjection.from_pretrained(
    "laion/CLIP-ViT-H-14-laion2B-s32B-b79K",
    torch_dtype=torch.float16,
)

pipeline = AutoPipelineForText2Image.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    image_encoder=image_encoder,
    torch_dtype=torch.float16
).to("cuda")

pipeline.load_ip_adapter(
  "h94/IP-Adapter-FaceID",
  subfolder=None,
  weight_name="ip-adapter-faceid-plus_sd15.bin"
)
```

</hfoption>
</hfoptions>

## Image embeddings

The `prepare_ip_adapter_image_embeds` generates image embeddings you can reuse if you're running the pipeline multiple times because you have more than one image. Loading and encoding multiple images each time you use the pipeline can be inefficient. Precomputing the image embeddings ahead of time, saving them to disk, and loading them when you need them is more efficient.
291
292

```py
Steven Liu's avatar
Steven Liu committed
293
294
295
296
297
298
299
300
import torch
from diffusers import AutoPipelineForText2Image

pipeline = AutoPipelineForImage2Image.from_pretrained(
  "stabilityai/stable-diffusion-xl-base-1.0",
  torch_dtype=torch.float16
).to("cuda")

301
302
303
304
305
306
307
308
309
310
311
image_embeds = pipeline.prepare_ip_adapter_image_embeds(
    ip_adapter_image=image,
    ip_adapter_image_embeds=None,
    device="cuda",
    num_images_per_prompt=1,
    do_classifier_free_guidance=True,
)

torch.save(image_embeds, "image_embeds.ipadpt")
```

Steven Liu's avatar
Steven Liu committed
312
313
314
315
Reload the image embeddings by passing them to the `ip_adapter_image_embeds` parameter. Set `image_encoder_folder` to `None` because you don't need the image encoder anymore to generate the image embeddings.

> [!TIP]
> You can also load image embeddings from other sources such as ComfyUI.
316
317

```py
Steven Liu's avatar
Steven Liu committed
318
319
320
321
322
323
324
pipeline.load_ip_adapter(
  "h94/IP-Adapter",
  subfolder="sdxl_models",
  image_encoder_folder=None,
  weight_name="ip-adapter_sdxl.bin"
)
pipeline.set_ip_adapter_scale(0.8)
325
image_embeds = torch.load("image_embeds.ipadpt")
Steven Liu's avatar
Steven Liu committed
326
pipeline(
327
328
329
330
331
    prompt="a polar bear sitting in a chair drinking a milkshake",
    ip_adapter_image_embeds=image_embeds,
    negative_prompt="deformed, ugly, wrong proportion, low res, bad anatomy, worst quality, low quality",
    num_inference_steps=100,
    generator=generator,
Steven Liu's avatar
Steven Liu committed
332
).images[0]
333
334
```

Steven Liu's avatar
Steven Liu committed
335
## Masking
336

Steven Liu's avatar
Steven Liu committed
337
Binary masking enables assigning an IP-Adapter image to a specific area of the output image, making it useful for composing multiple IP-Adapter images. Each IP-Adapter image requires a binary mask.
338

Steven Liu's avatar
Steven Liu committed
339
Load the [`~image_processor.IPAdapterMaskProcessor`] to preprocess the image masks. For the best results, provide the output `height` and `width` to ensure masks with different aspect ratios are appropriately sized. If the input masks already match the aspect ratio of the generated image, you don't need to set the `height` and `width`.
340
341

```py
Steven Liu's avatar
Steven Liu committed
342
343
import torch
from diffusers import AutoPipelineForText2Image
344
from diffusers.image_processor import IPAdapterMaskProcessor
Steven Liu's avatar
Steven Liu committed
345
346
347
348
349
350
from diffusers.utils import load_image

pipeline = AutoPipelineForImage2Image.from_pretrained(
  "stabilityai/stable-diffusion-xl-base-1.0",
  torch_dtype=torch.float16
).to("cuda")
351
352
353
354
355

mask1 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_mask1.png")
mask2 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_mask2.png")

processor = IPAdapterMaskProcessor()
Steven Liu's avatar
Steven Liu committed
356
masks = processor.preprocess([mask1, mask2], height=1024, width=1024)
357
358
```

Steven Liu's avatar
Steven Liu committed
359
360
361
362
363
364
365
366
367
<div style="display: flex; gap: 10px; justify-content: space-around; align-items: flex-end;">
  <figure>
    <img src="https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/ip_mask_mask1.png" width="200" alt="mask 1"/>
    <figcaption style="text-align: center;">mask 1</figcaption>
  </figure>
  <figure>
    <img src="https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/ip_mask_mask2.png" width="200" alt="mask 2"/>
    <figcaption style="text-align: center;">mask 2</figcaption>
  </figure>
368
369
</div>

Steven Liu's avatar
Steven Liu committed
370
Provide both the IP-Adapter images and their scales as a list. Pass the preprocessed masks to `cross_attention_kwargs` in the pipeline.
371
372
373
374
375

```py
face_image1 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_girl1.png")
face_image2 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_girl2.png")

Steven Liu's avatar
Steven Liu committed
376
377
378
379
380
381
pipeline.load_ip_adapter(
  "h94/IP-Adapter",
  subfolder="sdxl_models",
  weight_name=["ip-adapter-plus-face_sdxl_vit-h.safetensors"]
)
pipeline.set_ip_adapter_scale([[0.7, 0.7]])
382

Steven Liu's avatar
Steven Liu committed
383
ip_images = [[face_image1, face_image2]]
384
masks = [masks.reshape(1, masks.shape[0], masks.shape[2], masks.shape[3])]
385

Steven Liu's avatar
Steven Liu committed
386
387
388
389
390
pipeline(
  prompt="2 girls",
  ip_adapter_image=ip_images,
  negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality",
  cross_attention_kwargs={"ip_adapter_masks": masks}
391
392
393
).images[0]
```

Steven Liu's avatar
Steven Liu committed
394
395
396
397
398
399
400
401
402
403
<div style="display: flex; flex-direction: column; gap: 10px;">
  <div style="display: flex; gap: 10px; justify-content: space-around; align-items: flex-end;">
    <figure>
      <img src="https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/ip_mask_girl1.png" width="400" alt="IP-Adapter image 1"/>
      <figcaption style="text-align: center;">IP-Adapter image 1</figcaption>
    </figure>
    <figure>
      <img src="https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/ip_mask_girl2.png" width="400" alt="IP-Adapter image 2"/>
      <figcaption style="text-align: center;">IP-Adapter image 2</figcaption>
    </figure>
404
  </div>
Steven Liu's avatar
Steven Liu committed
405
406
407
408
409
410
411
412
413
  <div style="display: flex; gap: 10px; justify-content: space-around; align-items: flex-end;">
    <figure>
      <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_attention_mask_result_seed_0.png" width="400" alt="Generated image with mask"/>
      <figcaption style="text-align: center;">generated with mask</figcaption>
    </figure>
    <figure>
      <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_no_attention_mask_result_seed_0.png" width="400" alt="Generated image without mask"/>
      <figcaption style="text-align: center;">generated without mask</figcaption>
    </figure>
414
415
  </div>
</div>
416

Steven Liu's avatar
Steven Liu committed
417
## Applications
Steven Liu's avatar
Steven Liu committed
418

Steven Liu's avatar
Steven Liu committed
419
The section below covers some popular applications of IP-Adapter.
Steven Liu's avatar
Steven Liu committed
420

Steven Liu's avatar
Steven Liu committed
421
### Face models
Steven Liu's avatar
Steven Liu committed
422

Steven Liu's avatar
Steven Liu committed
423
Face generation and preserving its details can be challenging. To help generate more accurate faces, there are checkpoints specifically conditioned on images of cropped faces. You can find the face models in the [h94/IP-Adapter](https://huggingface.co/h94/IP-Adapter) repository or the [h94/IP-Adapter-FaceID](https://huggingface.co/h94/IP-Adapter-FaceID) repository. The FaceID checkpoints use the FaceID embeddings from [InsightFace](https://github.com/deepinsight/insightface) instead of CLIP image embeddings.
Steven Liu's avatar
Steven Liu committed
424

Steven Liu's avatar
Steven Liu committed
425
We recommend using the [`DDIMScheduler`] or [`EulerDiscreteScheduler`] for face models.
Steven Liu's avatar
Steven Liu committed
426

Steven Liu's avatar
Steven Liu committed
427
428
<hfoptions id="usage">
<hfoption id="h94/IP-Adapter">
Steven Liu's avatar
Steven Liu committed
429
430
431
432
433
434
435

```py
import torch
from diffusers import StableDiffusionPipeline, DDIMScheduler
from diffusers.utils import load_image

pipeline = StableDiffusionPipeline.from_pretrained(
Steven Liu's avatar
Steven Liu committed
436
437
  "stable-diffusion-v1-5/stable-diffusion-v1-5",
  torch_dtype=torch.float16,
Steven Liu's avatar
Steven Liu committed
438
439
).to("cuda")
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
Steven Liu's avatar
Steven Liu committed
440
441
442
443
444
pipeline.load_ip_adapter(
  "h94/IP-Adapter",
  subfolder="models", 
  weight_name="ip-adapter-full-face_sd15.bin"
)
Steven Liu's avatar
Steven Liu committed
445
446
447
448

pipeline.set_ip_adapter_scale(0.5)
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_einstein_base.png")

Steven Liu's avatar
Steven Liu committed
449
pipeline(
Steven Liu's avatar
Steven Liu committed
450
451
    prompt="A photo of Einstein as a chef, wearing an apron, cooking in a French restaurant",
    ip_adapter_image=image,
M. Tolga Cangöz's avatar
M. Tolga Cangöz committed
452
    negative_prompt="lowres, bad anatomy, worst quality, low quality",
Steven Liu's avatar
Steven Liu committed
453
454
455
456
    num_inference_steps=100,
).images[0]
```

Steven Liu's avatar
Steven Liu committed
457
458
459
460
461
462
463
464
465
<div style="display: flex; gap: 10px; justify-content: space-around; align-items: flex-end;">
  <figure>
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_einstein_base.png" width="400" alt="IP-Adapter image"/>
    <figcaption style="text-align: center;">IP-Adapter image</figcaption>
  </figure>
  <figure>
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_einstein.png" width="400" alt="generated image"/>
    <figcaption style="text-align: center;">generated image</figcaption>
  </figure>
Steven Liu's avatar
Steven Liu committed
466
467
</div>

Steven Liu's avatar
Steven Liu committed
468
469
470
471
</hfoption>
<hfoption id="h94/IP-Adapter-FaceID">

For FaceID models, extract the face embeddings and pass them as a list of tensors to `ip_adapter_image_embeds`.
472
473

```py
Steven Liu's avatar
Steven Liu committed
474
# pip install insightface
475
476
477
478
479
480
import torch
from diffusers import StableDiffusionPipeline, DDIMScheduler
from diffusers.utils import load_image
from insightface.app import FaceAnalysis

pipeline = StableDiffusionPipeline.from_pretrained(
481
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
482
483
484
    torch_dtype=torch.float16,
).to("cuda")
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
Steven Liu's avatar
Steven Liu committed
485
486
487
488
489
490
pipeline.load_ip_adapter(
  "h94/IP-Adapter-FaceID",
  subfolder=None,
  weight_name="ip-adapter-faceid_sd15.bin",
  image_encoder_folder=None
)
491
492
493
494
495
496
497
498
499
500
501
502
503
pipeline.set_ip_adapter_scale(0.6)

image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_girl1.png")

ref_images_embeds = []
app = FaceAnalysis(name="buffalo_l", providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
app.prepare(ctx_id=0, det_size=(640, 640))
image = cv2.cvtColor(np.asarray(image), cv2.COLOR_BGR2RGB)
faces = app.get(image)
image = torch.from_numpy(faces[0].normed_embedding)
ref_images_embeds.append(image.unsqueeze(0))
ref_images_embeds = torch.stack(ref_images_embeds, dim=0).unsqueeze(0)
neg_ref_images_embeds = torch.zeros_like(ref_images_embeds)
504
id_embeds = torch.cat([neg_ref_images_embeds, ref_images_embeds]).to(dtype=torch.float16, device="cuda")
505

Steven Liu's avatar
Steven Liu committed
506
pipeline(
507
    prompt="A photo of a girl",
508
509
    ip_adapter_image_embeds=[id_embeds],
    negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality",
Steven Liu's avatar
Steven Liu committed
510
).images[0]
511
512
```

Steven Liu's avatar
Steven Liu committed
513
The IP-Adapter FaceID Plus and Plus v2 models require CLIP image embeddings. Prepare the face embeddings and then extract and pass the CLIP embeddings to the hidden image projection layers.
514
515

```py
516
517
clip_embeds = pipeline.prepare_ip_adapter_image_embeds(
  [ip_adapter_images], None, torch.device("cuda"), num_images, True)[0]
518
519

pipeline.unet.encoder_hid_proj.image_projection_layers[0].clip_embeds = clip_embeds.to(dtype=torch.float16)
Steven Liu's avatar
Steven Liu committed
520
521
# set to True if using IP-Adapter FaceID Plus v2
pipeline.unet.encoder_hid_proj.image_projection_layers[0].shortcut = False
522
523
```

Steven Liu's avatar
Steven Liu committed
524
525
</hfoption>
</hfoptions>
Steven Liu's avatar
Steven Liu committed
526

Steven Liu's avatar
Steven Liu committed
527
### Multiple IP-Adapters
Steven Liu's avatar
Steven Liu committed
528

Steven Liu's avatar
Steven Liu committed
529
Combine multiple IP-Adapters to generate images in more diverse styles. For example, you can use IP-Adapter Face to generate consistent faces and characters and IP-Adapter Plus to generate those faces in specific styles.
Steven Liu's avatar
Steven Liu committed
530

Steven Liu's avatar
Steven Liu committed
531
Load an image encoder with [`~transformers.CLIPVisionModelWithProjection`].
Steven Liu's avatar
Steven Liu committed
532
533
534
535
536
537
538
539

```py
import torch
from diffusers import AutoPipelineForText2Image, DDIMScheduler
from transformers import CLIPVisionModelWithProjection
from diffusers.utils import load_image

image_encoder = CLIPVisionModelWithProjection.from_pretrained(
M. Tolga Cangöz's avatar
M. Tolga Cangöz committed
540
    "h94/IP-Adapter",
Steven Liu's avatar
Steven Liu committed
541
542
543
544
545
    subfolder="models/image_encoder",
    torch_dtype=torch.float16,
)
```

Steven Liu's avatar
Steven Liu committed
546
Load a base model, scheduler and the following IP-Adapters.
Steven Liu's avatar
Steven Liu committed
547

Steven Liu's avatar
Steven Liu committed
548
549
- [ip-adapter-plus_sdxl_vit-h](https://huggingface.co/h94/IP-Adapter#ip-adapter-for-sdxl-10) uses patch embeddings and a ViT-H image encoder
- [ip-adapter-plus-face_sdxl_vit-h](https://huggingface.co/h94/IP-Adapter#ip-adapter-for-sdxl-10) uses patch embeddings and a ViT-H image encoder but it is conditioned on images of cropped faces
Steven Liu's avatar
Steven Liu committed
550
551
552
553
554
555
556
557
558

```py
pipeline = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    image_encoder=image_encoder,
)
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
pipeline.load_ip_adapter(
M. Tolga Cangöz's avatar
M. Tolga Cangöz committed
559
560
  "h94/IP-Adapter",
  subfolder="sdxl_models",
Steven Liu's avatar
Steven Liu committed
561
562
563
  weight_name=["ip-adapter-plus_sdxl_vit-h.safetensors", "ip-adapter-plus-face_sdxl_vit-h.safetensors"]
)
pipeline.set_ip_adapter_scale([0.7, 0.3])
Steven Liu's avatar
Steven Liu committed
564
# enable_model_cpu_offload to reduce memory usage
Steven Liu's avatar
Steven Liu committed
565
566
567
pipeline.enable_model_cpu_offload()
```

Steven Liu's avatar
Steven Liu committed
568
Load an image and a folder containing images of a certain style to apply.
Steven Liu's avatar
Steven Liu committed
569
570
571
572

```py
face_image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/women_input.png")
style_folder = "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/style_ziggy"
M. Tolga Cangöz's avatar
M. Tolga Cangöz committed
573
style_images = [load_image(f"{style_folder}/img{i}.png") for i in range(10)]
Steven Liu's avatar
Steven Liu committed
574
575
```

Steven Liu's avatar
Steven Liu committed
576
577
578
579
580
581
582
583
584
<div style="display: flex; gap: 10px; justify-content: space-around; align-items: flex-end;">
  <figure>
    <img src="https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/women_input.png" width="400" alt="Face image"/>
    <figcaption style="text-align: center;">face image</figcaption>
  </figure>
  <figure>
    <img src="https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/ip_style_grid.png" width="400" alt="Style images"/>
    <figcaption style="text-align: center;">style images</figcaption>
  </figure>
Steven Liu's avatar
Steven Liu committed
585
586
</div>

Steven Liu's avatar
Steven Liu committed
587
Pass style and face images as a list to `ip_adapter_image`.
Steven Liu's avatar
Steven Liu committed
588
589
590
591

```py
generator = torch.Generator(device="cpu").manual_seed(0)

Steven Liu's avatar
Steven Liu committed
592
pipeline(
Steven Liu's avatar
Steven Liu committed
593
594
    prompt="wonderwoman",
    ip_adapter_image=[style_images, face_image],
M. Tolga Cangöz's avatar
M. Tolga Cangöz committed
595
    negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality",
Steven Liu's avatar
Steven Liu committed
596
597
598
).images[0]
```

Steven Liu's avatar
Steven Liu committed
599
600
601
602
603
<div style="display: flex; justify-content: center;">
  <figure>
    <img src="https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/ip_multi_out.png" width="400" alt="Generated image"/>
    <figcaption style="text-align: center;">generated image</figcaption>
  </figure>
Steven Liu's avatar
Steven Liu committed
604
605
606
607
</div>

### Instant generation

Steven Liu's avatar
Steven Liu committed
608
[Latent Consistency Models (LCM)](../api/pipelines/latent_consistency_models) can generate images 4 steps or less, unlike other diffusion models which require a lot more steps, making it feel "instantaneous". IP-Adapters are compatible with LCM models to instantly generate images.
Steven Liu's avatar
Steven Liu committed
609

Steven Liu's avatar
Steven Liu committed
610
Load the IP-Adapter weights and load the LoRA weights with [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`].
Steven Liu's avatar
Steven Liu committed
611
612
613

```py
import torch
Steven Liu's avatar
Steven Liu committed
614
from diffusers import DiffusionPipeline, LCMScheduler
Steven Liu's avatar
Steven Liu committed
615
616
from diffusers.utils import load_image

Steven Liu's avatar
Steven Liu committed
617
618
619
620
pipeline = DiffusionPipeline.from_pretrained(
  "sd-dreambooth-library/herge-style",
  torch_dtype=torch.float16
)
Steven Liu's avatar
Steven Liu committed
621

Steven Liu's avatar
Steven Liu committed
622
623
624
625
626
627
pipeline.load_ip_adapter(
  "h94/IP-Adapter",
  subfolder="models",
  weight_name="ip-adapter_sd15.bin"
)
pipeline.load_lora_weights("latent-consistency/lcm-lora-sdv1-5")
M. Tolga Cangöz's avatar
M. Tolga Cangöz committed
628
pipeline.scheduler = LCMScheduler.from_config(pipeline.scheduler.config)
Steven Liu's avatar
Steven Liu committed
629
# enable_model_cpu_offload to reduce memory usage
Steven Liu's avatar
Steven Liu committed
630
631
632
pipeline.enable_model_cpu_offload()
```

Steven Liu's avatar
Steven Liu committed
633
Try using a lower IP-Adapter scale to condition generation more on the style you want to apply and remember to use the special token in your prompt to trigger its generation.
Steven Liu's avatar
Steven Liu committed
634
635
636
637
638
639
640

```py
pipeline.set_ip_adapter_scale(0.4)

prompt = "herge_style woman in armor, best quality, high quality"

ip_adapter_image = load_image("https://user-images.githubusercontent.com/24734142/266492875-2d50d223-8475-44f0-a7c6-08b51cb53572.png")
Steven Liu's avatar
Steven Liu committed
641
pipeline(
Steven Liu's avatar
Steven Liu committed
642
643
644
645
646
647
648
    prompt=prompt,
    ip_adapter_image=ip_adapter_image,
    num_inference_steps=4,
    guidance_scale=1,
).images[0]
```

Steven Liu's avatar
Steven Liu committed
649
650
651
652
653
<div style="display: flex; justify-content: center;">
  <figure>
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_herge.png" width="400" alt="Generated image"/>
    <figcaption style="text-align: center;">generated image</figcaption>
  </figure>
Steven Liu's avatar
Steven Liu committed
654
655
656
657
</div>

### Structural control

Steven Liu's avatar
Steven Liu committed
658
For structural control, combine IP-Adapter with [ControlNet](../api/pipelines/controlnet) conditioned on depth maps, edge maps, pose estimations, and more.
Steven Liu's avatar
Steven Liu committed
659

Steven Liu's avatar
Steven Liu committed
660
The example below loads a [`ControlNetModel`] checkpoint conditioned on depth maps and combines it with a IP-Adapter.
Steven Liu's avatar
Steven Liu committed
661
662
663
664

```py
import torch
from diffusers.utils import load_image
Steven Liu's avatar
Steven Liu committed
665
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
Steven Liu's avatar
Steven Liu committed
666

Steven Liu's avatar
Steven Liu committed
667
668
669
670
controlnet = ControlNetModel.from_pretrained(
  "lllyasviel/control_v11f1p_sd15_depth",
  torch_dtype=torch.float16
)
Steven Liu's avatar
Steven Liu committed
671
672

pipeline = StableDiffusionControlNetPipeline.from_pretrained(
Steven Liu's avatar
Steven Liu committed
673
674
675
676
677
678
679
680
681
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16
).to("cuda")
pipeline.load_ip_adapter(
  "h94/IP-Adapter",
  subfolder="models",
  weight_name="ip-adapter_sd15.bin"
)
Steven Liu's avatar
Steven Liu committed
682
683
```

Steven Liu's avatar
Steven Liu committed
684
Pass the depth map and IP-Adapter image to the pipeline.
Steven Liu's avatar
Steven Liu committed
685
686

```py
Steven Liu's avatar
Steven Liu committed
687
688
689
690
691
pipeline(
  prompt="best quality, high quality",
  image=depth_map,
  ip_adapter_image=ip_adapter_image,
  negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality",
M. Tolga Cangöz's avatar
M. Tolga Cangöz committed
692
).images[0]
Steven Liu's avatar
Steven Liu committed
693
694
```

Steven Liu's avatar
Steven Liu committed
695
696
697
698
699
700
701
702
703
704
705
706
707
<div style="display: flex; gap: 10px; justify-content: space-around; align-items: flex-end;">
  <figure>
    <img src="https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/statue.png" width="300" alt="IP-Adapter image"/>
    <figcaption style="text-align: center;">IP-Adapter image</figcaption>
  </figure>
  <figure>
    <img src="https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/depth.png" width="300" alt="Depth map"/>
    <figcaption style="text-align: center;">depth map</figcaption>
  </figure>
  <figure>
    <img src="https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/ipa-controlnet-out.png" width="300" alt="Generated image"/>
    <figcaption style="text-align: center;">generated image</figcaption>
  </figure>
Steven Liu's avatar
Steven Liu committed
708
</div>
Jenyuan-Huang's avatar
Jenyuan-Huang committed
709

Steven Liu's avatar
Steven Liu committed
710
### Style and layout control
Jenyuan-Huang's avatar
Jenyuan-Huang committed
711

Steven Liu's avatar
Steven Liu committed
712
For style and layout control, combine IP-Adapter with [InstantStyle](https://huggingface.co/papers/2404.02733). InstantStyle separates *style* (color, texture, overall feel) and *content* from each other. It only applies the style in style-specific blocks of the model to prevent it from distorting other areas of an image. This generates images with stronger and more consistent styles and better control over the layout.
Jenyuan-Huang's avatar
Jenyuan-Huang committed
713

Steven Liu's avatar
Steven Liu committed
714
The IP-Adapter is only activated for specific parts of the model. Use the [`~loaders.IPAdapterMixin.set_ip_adapter_scale`] method to scale the influence of the IP-Adapter in different layers. The example below activates the IP-Adapter in the second layer of the models down `block_2` and up `block_0`. Down `block_2` is where the IP-Adapter injects layout information and up `block_0` is where style is injected.
Jenyuan-Huang's avatar
Jenyuan-Huang committed
715
716

```py
Steven Liu's avatar
Steven Liu committed
717
import torch
718
from diffusers import AutoPipelineForText2Image
Jenyuan-Huang's avatar
Jenyuan-Huang committed
719
720
from diffusers.utils import load_image

Steven Liu's avatar
Steven Liu committed
721
722
723
724
725
726
727
728
729
pipeline = AutoPipelineForText2Image.from_pretrained(
  "stabilityai/stable-diffusion-xl-base-1.0",
  torch_dtype=torch.float16
).to("cuda")
pipeline.load_ip_adapter(
  "h94/IP-Adapter",
  subfolder="sdxl_models",
  weight_name="ip-adapter_sdxl.bin"
)
Jenyuan-Huang's avatar
Jenyuan-Huang committed
730
731
732
733
734
735
736
737

scale = {
    "down": {"block_2": [0.0, 1.0]},
    "up": {"block_0": [0.0, 1.0, 0.0]},
}
pipeline.set_ip_adapter_scale(scale)
```

Steven Liu's avatar
Steven Liu committed
738
Load the style image and generate an image.
Jenyuan-Huang's avatar
Jenyuan-Huang committed
739
740
741
742

```py
style_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg")

Steven Liu's avatar
Steven Liu committed
743
pipeline(
Jenyuan-Huang's avatar
Jenyuan-Huang committed
744
    prompt="a cat, masterpiece, best quality, high quality",
745
    ip_adapter_image=style_image,
Jenyuan-Huang's avatar
Jenyuan-Huang committed
746
747
748
749
750
    negative_prompt="text, watermark, lowres, low quality, worst quality, deformed, glitch, low contrast, noisy, saturation, blurry",
    guidance_scale=5,
).images[0]
```

Steven Liu's avatar
Steven Liu committed
751
752
753
754
755
756
757
758
759
<div style="display: flex; gap: 10px; justify-content: space-around; align-items: flex-end;">
  <figure>
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg" width="400" alt="Style image"/>
    <figcaption style="text-align: center;">style image</figcaption>
  </figure>
  <figure>
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/cat_style_layout.png" width="400" alt="Generated image"/>
    <figcaption style="text-align: center;">generated image</figcaption>
  </figure>
Jenyuan-Huang's avatar
Jenyuan-Huang committed
760
761
</div>

Steven Liu's avatar
Steven Liu committed
762
You can also insert the IP-Adapter in all the model layers. This tends to generate images that focus more on the image prompt and may reduce the diversity of generated images. Only activate the IP-Adapter in up `block_0` or the style layer.
Jenyuan-Huang's avatar
Jenyuan-Huang committed
763

Steven Liu's avatar
Steven Liu committed
764
765
> [!TIP]
> You don't need to specify all the layers in the `scale` dictionary. Layers not included are set to 0, which means the IP-Adapter is disabled.
Jenyuan-Huang's avatar
Jenyuan-Huang committed
766
767
768
769
770
771
772

```py
scale = {
    "up": {"block_0": [0.0, 1.0, 0.0]},
}
pipeline.set_ip_adapter_scale(scale)

Steven Liu's avatar
Steven Liu committed
773
pipeline(
Jenyuan-Huang's avatar
Jenyuan-Huang committed
774
    prompt="a cat, masterpiece, best quality, high quality",
775
    ip_adapter_image=style_image,
Jenyuan-Huang's avatar
Jenyuan-Huang committed
776
777
778
779
780
    negative_prompt="text, watermark, lowres, low quality, worst quality, deformed, glitch, low contrast, noisy, saturation, blurry",
    guidance_scale=5,
).images[0]
```

Steven Liu's avatar
Steven Liu committed
781
782
783
784
785
786
787
788
789
790
<div style="display: flex; gap: 10px; justify-content: space-around; align-items: flex-end;">
  <figure>
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/cat_style_only.png" width="400" alt="Generated image (style only)"/>
    <figcaption style="text-align: center;">style-layer generated image</figcaption>
  </figure>
  <figure>
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/cat_ip_adapter.png" width="400" alt="Generated image (IP-Adapter only)"/>
    <figcaption style="text-align: center;">all layers generated image</figcaption>
  </figure>
</div>