| [Instruct Pix2Pix](#instruct-pix2pix) | ✅ | ❌ | Can additionally be<br>fine-tuned for better <br>performance on specific <br>edit instructions. |
| [Instruct Pix2Pix](#instruct-pix2pix) | ✅ | ❌ | Can additionally be<br>fine-tuned for better <br>performance on specific <br>edit instructions. |
| [Pix2Pix Zero](#pix2pixzero) | ✅ | ❌ | |
| [Pix2Pix Zero](#pix2pixzero) | ✅ | ❌ | |
| [Attend and Excite](#attend-and-excite) | ✅ | ❌ | |
| [Attend and Excite](#attend-and-excite) | ✅ | ❌ | |
...
@@ -79,8 +79,9 @@ See [here](../api/pipelines/stable_diffusion/pix2pix) for more information on ho
...
@@ -79,8 +79,9 @@ See [here](../api/pipelines/stable_diffusion/pix2pix) for more information on ho
The denoising process is guided from one conceptual embedding towards another conceptual embedding. The intermediate latents are optimized during the denoising process to push the attention maps towards reference attention maps. The reference attention maps are from the denoising process of the input image and are used to encourage semantic preservation.
The denoising process is guided from one conceptual embedding towards another conceptual embedding. The intermediate latents are optimized during the denoising process to push the attention maps towards reference attention maps. The reference attention maps are from the denoising process of the input image and are used to encourage semantic preservation.
Pix2Pix Zero can be used both to edit synthetic images as well as real images.
Pix2Pix Zero can be used both to edit synthetic images as well as real images.
- To edit synthetic images, one first generates an image given a caption.
- To edit synthetic images, one first generates an image given a caption.
Next, we generate image captions for the concept that shall be edited and for the new target concept. We can use a model like [Flan-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) for this purpose. Then, "mean" prompt embeddings for both the source and target concepts are created via the text encoder. Finally, the pix2pix-zero algorithm is used to edit the synthetic image.
Next, we generate image captions for the concept that shall be edited and for the new target concept. We can use a model like [Flan-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) for this purpose. Then, "mean" prompt embeddings for both the source and target concepts are created via the text encoder. Finally, the pix2pix-zero algorithm is used to edit the synthetic image.
- To edit a real image, one first generates an image caption using a model like [BLIP](https://huggingface.co/docs/transformers/model_doc/blip). Then one applies ddim inversion on the prompt and image to generate "inverse" latents. Similar to before, "mean" prompt embeddings for both source and target concepts are created and finally the pix2pix-zero algorithm in combination with the "inverse" latents is used to edit the image.
- To edit a real image, one first generates an image caption using a model like [BLIP](https://huggingface.co/docs/transformers/model_doc/blip). Then one applies ddim inversion on the prompt and image to generate "inverse" latents. Similar to before, "mean" prompt embeddings for both source and target concepts are created and finally the pix2pix-zero algorithm in combination with the "inverse" latents is used to edit the image.
<Tip>
<Tip>
...
@@ -176,11 +177,12 @@ See [here](../training/text_inversion) for more information on how to use it.
...
@@ -176,11 +177,12 @@ See [here](../training/text_inversion) for more information on how to use it.
[Paper](https://arxiv.org/abs/2302.05543)
[Paper](https://arxiv.org/abs/2302.05543)
[ControlNet](../api/pipelines/stable_diffusion/controlnet) is an auxiliary network which adds an extra condition.
[ControlNet](../api/pipelines/controlnet) is an auxiliary network which adds an extra condition.
[ControlNet](../api/pipelines/controlnet) is an auxiliary network which adds an extra condition.
There are 8 canonical pre-trained ControlNets trained on different conditionings such as edge detection, scribbles,
There are 8 canonical pre-trained ControlNets trained on different conditionings such as edge detection, scribbles,
depth maps, and semantic segmentations.
depth maps, and semantic segmentations.
See [here](../api/pipelines/stable_diffusion/controlnet) for more information on how to use it.
See [here](../api/pipelines/controlnet) for more information on how to use it.
## Prompt Weighting
## Prompt Weighting
...
@@ -217,6 +219,7 @@ To know more details, check out the [official doc](../api/pipelines/stable_diffu
...
@@ -217,6 +219,7 @@ To know more details, check out the [official doc](../api/pipelines/stable_diffu
input prompts while preserving the original input images as much as possible.
input prompts while preserving the original input images as much as possible.
To know more details, check out the [official doc](../api/pipelines/stable_diffusion/model_editing).
To know more details, check out the [official doc](../api/pipelines/stable_diffusion/model_editing).