controlling_generation.md 13.1 KB
Newer Older
Aryan's avatar
Aryan committed
1
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
2
3
4
5
6
7
8
9
10
11
12

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

13
# Controlled generation
14

15
Controlling outputs generated by diffusion models has been long pursued by the community and is now an active research topic. In many popular diffusion models, subtle changes in inputs, both images and text prompts, can drastically change outputs. In an ideal world we want to be able to control how semantics are preserved and changed.
16
17
18
19
20

Most examples of preserving semantics reduce to being able to accurately map a change in input to a change in output. I.e. adding an adjective to a subject in a prompt preserves the entire image, only modifying the changed subject. Or, image variation of a particular subject preserves the subject's pose.

Additionally, there are qualities of generated images that we would like to influence beyond semantic preservation. I.e. in general, we would like our outputs to be of good quality, adhere to a particular style, or be realistic.

21
We will document some of the techniques `diffusers` supports to control generation of diffusion models. Much is cutting edge research and can be quite nuanced. If something needs clarifying or you have a suggestion, don't hesitate to open a discussion on the [forum](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers/63) or a [GitHub issue](https://github.com/huggingface/diffusers/issues).
22
23
24

We provide a high level explanation of how the generation can be controlled as well as a snippet of the technicals. For more in depth explanations on the technicals, the original papers which are linked from the pipelines are always the best resources.

25
26
Depending on the use case, one should choose a technique accordingly. In many cases, these techniques can be combined. For example, one can combine Textual Inversion with SEGA to provide more semantic guidance to the outputs generated using Textual Inversion.

27
28
Unless otherwise mentioned, these are techniques that work with existing models and don't require their own weights.

29
30
1. [InstructPix2Pix](#instruct-pix2pix)
2. [Pix2Pix Zero](#pix2pix-zero)
31
3. [Attend and Excite](#attend-and-excite)
32
33
4. [Semantic Guidance](#semantic-guidance-sega)
5. [Self-attention Guidance](#self-attention-guidance-sag)
34
35
36
37
6. [Depth2Image](#depth2image)
7. [MultiDiffusion Panorama](#multidiffusion-panorama)
8. [DreamBooth](#dreambooth)
9. [Textual Inversion](#textual-inversion)
38
10. [ControlNet](#controlnet)
39
11. [Prompt Weighting](#prompt-weighting)
40
41
42
12. [Custom Diffusion](#custom-diffusion)
13. [Model Editing](#model-editing)
14. [DiffEdit](#diffedit)
Sayak Paul's avatar
Sayak Paul committed
43
15. [T2I-Adapter](#t2i-adapter)
Shauray Singh's avatar
Shauray Singh committed
44
16. [FABRIC](#fabric)
45

46
47
48
49
For convenience, we provide a table to denote which methods are inference-only and which require fine-tuning/training.

|                     **Method**                      | **Inference only** | **Requires training /<br> fine-tuning** |                                          **Comments**                                           |
| :-------------------------------------------------: | :----------------: | :-------------------------------------: | :---------------------------------------------------------------------------------------------: |
50
51
|        [InstructPix2Pix](#instruct-pix2pix)        |         ✅         |                   ❌                    | Can additionally be<br>fine-tuned for better <br>performance on specific <br>edit instructions. |
|            [Pix2Pix Zero](#pix2pix-zero)            |         ✅         |                   ❌                    |                                                                                                 |
52
|       [Attend and Excite](#attend-and-excite)       |         ✅         |                   ❌                    |                                                                                                 |
53
54
|       [Semantic Guidance](#semantic-guidance-sega)       |         ✅         |                   ❌                    |                                                                                                 |
| [Self-attention Guidance](#self-attention-guidance-sag) |         ✅         |                   ❌                    |                                                                                                 |
55
56
57
58
59
60
61
62
63
64
|             [Depth2Image](#depth2image)             |         ✅         |                   ❌                    |                                                                                                 |
| [MultiDiffusion Panorama](#multidiffusion-panorama) |         ✅         |                   ❌                    |                                                                                                 |
|              [DreamBooth](#dreambooth)              |         ❌         |                   ✅                    |                                                                                                 |
|       [Textual Inversion](#textual-inversion)       |         ❌         |                   ✅                    |                                                                                                 |
|              [ControlNet](#controlnet)              |         ✅         |                   ❌                    |             A ControlNet can be <br>trained/fine-tuned on<br>a custom conditioning.             |
|        [Prompt Weighting](#prompt-weighting)        |         ✅         |                   ❌                    |                                                                                                 |
|        [Custom Diffusion](#custom-diffusion)        |         ❌         |                   ✅                    |                                                                                                 |
|           [Model Editing](#model-editing)           |         ✅         |                   ❌                    |                                                                                                 |
|                [DiffEdit](#diffedit)                |         ✅         |                   ❌                    |                                                                                                 |
|             [T2I-Adapter](#t2i-adapter)             |         ✅         |                   ❌                    |                                                                                                 |
Shauray Singh's avatar
Shauray Singh committed
65
|                [Fabric](#fabric)                    |         ✅         |                   ❌                    |                                                                                                 |
66
## InstructPix2Pix
67

Quentin Gallouédec's avatar
Quentin Gallouédec committed
68
[Paper](https://huggingface.co/papers/2211.09800)
69

70
71
[InstructPix2Pix](../api/pipelines/pix2pix) is fine-tuned from Stable Diffusion to support editing input images. It takes as inputs an image and a prompt describing an edit, and it outputs the edited image.
InstructPix2Pix has been explicitly trained to work well with [InstructGPT](https://openai.com/blog/instruction-following/)-like prompts.
72

73
74
## Attend and Excite

Quentin Gallouédec's avatar
Quentin Gallouédec committed
75
[Paper](https://huggingface.co/papers/2301.13826)
76

77
[Attend and Excite](../api/pipelines/attend_and_excite) allows subjects in the prompt to be faithfully represented in the final image.
78

79
A set of token indices are given as input, corresponding to the subjects in the prompt that need to be present in the image. During denoising, each token index is guaranteed to have a minimum attention threshold for at least one patch of the image. The intermediate latents are iteratively optimized during the denoising process to strengthen the attention of the most neglected subject token until the attention threshold is passed for all subject tokens.
80

81
Like Pix2Pix Zero, Attend and Excite also involves a mini optimization loop (leaving the pre-trained weights untouched) in its pipeline and can require more memory than the usual [StableDiffusionPipeline](../api/pipelines/stable_diffusion/text2img).
82

83
## Semantic Guidance (SEGA)
84

Quentin Gallouédec's avatar
Quentin Gallouédec committed
85
[Paper](https://huggingface.co/papers/2301.12247)
86

87
[SEGA](../api/pipelines/semantic_stable_diffusion) allows applying or removing one or more concepts from an image. The strength of the concept can also be controlled. I.e. the smile concept can be used to incrementally increase or decrease the smile of a portrait.
88
89
90

Similar to how classifier free guidance provides guidance via empty prompt inputs, SEGA provides guidance on conceptual prompts. Multiple of these conceptual prompts can be applied simultaneously. Each conceptual prompt can either add or remove their concept depending on if the guidance is applied positively or negatively.

91
Unlike Pix2Pix Zero or Attend and Excite, SEGA directly interacts with the diffusion process instead of performing any explicit gradient-based optimization.
92
93

## Self-attention Guidance (SAG)
94

Quentin Gallouédec's avatar
Quentin Gallouédec committed
95
[Paper](https://huggingface.co/papers/2210.00939)
96

97
[Self-attention Guidance](../api/pipelines/self_attention_guidance) improves the general quality of images.
98

99
SAG provides guidance from predictions not conditioned on high-frequency details to fully conditioned images. The high frequency details are extracted out of the UNet self-attention maps.
100

101
## Depth2Image
102

103
[Project](https://huggingface.co/stabilityai/stable-diffusion-2-depth)
104

105
[Depth2Image](../api/pipelines/stable_diffusion/depth2img) is fine-tuned from Stable Diffusion to better preserve semantics for text guided image variation.
106
107
108

It conditions on a monocular depth estimate of the original image.

109
110
## MultiDiffusion Panorama

Quentin Gallouédec's avatar
Quentin Gallouédec committed
111
[Paper](https://huggingface.co/papers/2302.08113)
112

113
114
[MultiDiffusion Panorama](../api/pipelines/panorama) defines a new generation process over a pre-trained diffusion model. This process binds together multiple diffusion generation methods that can be readily applied to generate high quality and diverse images. Results adhere to user-provided controls, such as desired aspect ratio (e.g., panorama), and spatial guiding signals, ranging from tight segmentation masks to bounding boxes.
MultiDiffusion Panorama allows to generate high-quality images at arbitrary aspect ratios (e.g., panoramas).
115

116
117
118
119
## Fine-tuning your own models

In addition to pre-trained models, Diffusers has training scripts for fine-tuning models on user-provided data.

120
## DreamBooth
121

122
[Project](https://dreambooth.github.io/)
123

124
[DreamBooth](../training/dreambooth) fine-tunes a model to teach it about a new subject. I.e. a few pictures of a person can be used to generate images of that person in different styles.
125

126
## Textual Inversion
127

Quentin Gallouédec's avatar
Quentin Gallouédec committed
128
[Paper](https://huggingface.co/papers/2208.01618)
129

130
[Textual Inversion](../training/text_inversion) fine-tunes a model to teach it about a new concept. I.e. a few pictures of a style of artwork can be used to generate images in that style.
131
132
133

## ControlNet

Quentin Gallouédec's avatar
Quentin Gallouédec committed
134
[Paper](https://huggingface.co/papers/2302.05543)
135

136
137
[ControlNet](../api/pipelines/controlnet) is an auxiliary network which adds an extra condition.
There are 8 canonical pre-trained ControlNets trained on different conditionings such as edge detection, scribbles,
138
139
depth maps, and semantic segmentations.

140
141
## Prompt Weighting

142
[Prompt weighting](../using-diffusers/weighted_prompts) is a simple technique that puts more attention weight on certain parts of the text
143
input.
144

145
## Custom Diffusion
146

Quentin Gallouédec's avatar
Quentin Gallouédec committed
147
[Paper](https://huggingface.co/papers/2212.04488)
148

149
[Custom Diffusion](../training/custom_diffusion) only fine-tunes the cross-attention maps of a pre-trained
150
text-to-image diffusion model. It also allows for additionally performing Textual Inversion. It supports
151
multi-concept training by design. Like DreamBooth and Textual Inversion, Custom Diffusion is also used to
152
teach a pre-trained text-to-image diffusion model about new concepts to generate outputs involving the
153
154
concept(s) of interest.

155
## DiffEdit
156

Quentin Gallouédec's avatar
Quentin Gallouédec committed
157
[Paper](https://huggingface.co/papers/2210.11427)
158

159
[DiffEdit](../api/pipelines/diffedit) allows for semantic editing of input images along with
160
input prompts while preserving the original input images as much as possible.
161

Will Berman's avatar
Will Berman committed
162
163
## T2I-Adapter

Quentin Gallouédec's avatar
Quentin Gallouédec committed
164
[Paper](https://huggingface.co/papers/2302.08453)
Will Berman's avatar
Will Berman committed
165
166

[T2I-Adapter](../api/pipelines/stable_diffusion/adapter) is an auxiliary network which adds an extra condition.
167
There are 8 canonical pre-trained adapters trained on different conditionings such as edge detection, sketch,
Will Berman's avatar
Will Berman committed
168
169
depth maps, and semantic segmentations.

Shauray Singh's avatar
Shauray Singh committed
170
171
## Fabric

Quentin Gallouédec's avatar
Quentin Gallouédec committed
172
[Paper](https://huggingface.co/papers/2307.10159)
Shauray Singh's avatar
Shauray Singh committed
173

174
[Fabric](https://github.com/huggingface/diffusers/tree/442017ccc877279bcf24fbe92f92d3d0def191b6/examples/community#stable-diffusion-fabric-pipeline) is a training-free
Shauray Singh's avatar
Shauray Singh committed
175
176
177
approach applicable to a wide range of popular diffusion models, which exploits
the self-attention layer present in the most widely used architectures to condition
the diffusion process on a set of feedback images.