Unverified Commit 9ced7844 authored by M. Tolga Cangöz's avatar M. Tolga Cangöz Committed by GitHub
Browse files

[Docs] Fix typos, improve, update at Conceptual Guides page (#5585)



* Fix typos, improve, update

* Update docs/source/en/conceptual/contribution.md
Co-authored-by: default avatarSteven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/conceptual/contribution.md
Co-authored-by: default avatarSteven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/conceptual/philosophy.md
Co-authored-by: default avatarSteven Liu <59462357+stevhliu@users.noreply.github.com>

* Update philosophy.md

* Update philosophy.md

* Update docs/source/en/conceptual/philosophy.md
Co-authored-by: default avatarSteven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/using-diffusers/controlling_generation.md
Co-authored-by: default avatarSteven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/using-diffusers/controlling_generation.md
Co-authored-by: default avatarSteven Liu <59462357+stevhliu@users.noreply.github.com>

* Remove e.g.; some Grammarly fixes

* Update docs/source/en/conceptual/philosophy.md
Co-authored-by: default avatarSteven Liu <59462357+stevhliu@users.noreply.github.com>

* Update contribution.md

---------
Co-authored-by: default avatarSteven Liu <59462357+stevhliu@users.noreply.github.com>
parent 9723f8a5
This diff is collapsed.
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# 🧨 Diffusers’ Ethical Guidelines # 🧨 Diffusers’ Ethical Guidelines
## Preamble ## Preamble
...@@ -42,7 +54,7 @@ The team works daily to make the technical and non-technical tools available to ...@@ -42,7 +54,7 @@ The team works daily to make the technical and non-technical tools available to
- **Encouraging safety in deployment** - **Encouraging safety in deployment**
- [**Safe Stable Diffusion**](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion_safe): It mitigates the well-known issue that models, like Stable Diffusion, that are trained on unfiltered, web-crawled datasets tend to suffer from inappropriate degeneration. Related paper: [Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models](https://arxiv.org/abs/2211.05105). - [**Safe Stable Diffusion**](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_safe): It mitigates the well-known issue that models, like Stable Diffusion, that are trained on unfiltered, web-crawled datasets tend to suffer from inappropriate degeneration. Related paper: [Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models](https://arxiv.org/abs/2211.05105).
- [**Safety Checker**](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/safety_checker.py): It checks and compares the class probability of a set of hard-coded harmful concepts in the embedding space against an image after it has been generated. The harmful concepts are intentionally hidden to prevent reverse engineering of the checker. - [**Safety Checker**](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/safety_checker.py): It checks and compares the class probability of a set of hard-coded harmful concepts in the embedding space against an image after it has been generated. The harmful concepts are intentionally hidden to prevent reverse engineering of the checker.
......
...@@ -32,7 +32,7 @@ The methods shown in this document can also be used to evaluate different [noise ...@@ -32,7 +32,7 @@ The methods shown in this document can also be used to evaluate different [noise
We cover Diffusion models with the following pipelines: We cover Diffusion models with the following pipelines:
- Text-guided image generation (such as the [`StableDiffusionPipeline`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/text2img)). - Text-guided image generation (such as the [`StableDiffusionPipeline`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/text2img)).
- Text-guided image generation, additionally conditioned on an input image (such as the [`StableDiffusionImg2ImgPipeline`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/img2img), and [`StableDiffusionInstructPix2PixPipeline`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/pix2pix)). - Text-guided image generation, additionally conditioned on an input image (such as the [`StableDiffusionImg2ImgPipeline`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/img2img) and [`StableDiffusionInstructPix2PixPipeline`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/pix2pix)).
- Class-conditioned image generation models (such as the [`DiTPipeline`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/dit)). - Class-conditioned image generation models (such as the [`DiTPipeline`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/dit)).
## Qualitative Evaluation ## Qualitative Evaluation
...@@ -87,7 +87,7 @@ import torch ...@@ -87,7 +87,7 @@ import torch
seed = 0 seed = 0
generator = torch.manual_seed(seed) generator = torch.manual_seed(seed)
images = sd_pipeline(sample_prompts, num_images_per_prompt=1, generator=generator, output_type="numpy").images images = sd_pipeline(sample_prompts, num_images_per_prompt=1, generator=generator).images
``` ```
![parti-prompts-14](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/evaluation_diffusion_models/parti-prompts-14.png) ![parti-prompts-14](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/evaluation_diffusion_models/parti-prompts-14.png)
...@@ -141,7 +141,7 @@ prompts = [ ...@@ -141,7 +141,7 @@ prompts = [
"A small cabin on top of a snowy mountain in the style of Disney, artstation", "A small cabin on top of a snowy mountain in the style of Disney, artstation",
] ]
images = sd_pipeline(prompts, num_images_per_prompt=1, output_type="numpy").images images = sd_pipeline(prompts, num_images_per_prompt=1, output_type="np").images
print(images.shape) print(images.shape)
# (6, 512, 512, 3) # (6, 512, 512, 3)
...@@ -155,13 +155,11 @@ from functools import partial ...@@ -155,13 +155,11 @@ from functools import partial
clip_score_fn = partial(clip_score, model_name_or_path="openai/clip-vit-base-patch16") clip_score_fn = partial(clip_score, model_name_or_path="openai/clip-vit-base-patch16")
def calculate_clip_score(images, prompts): def calculate_clip_score(images, prompts):
images_int = (images * 255).astype("uint8") images_int = (images * 255).astype("uint8")
clip_score = clip_score_fn(torch.from_numpy(images_int).permute(0, 3, 1, 2), prompts).detach() clip_score = clip_score_fn(torch.from_numpy(images_int).permute(0, 3, 1, 2), prompts).detach()
return round(float(clip_score), 4) return round(float(clip_score), 4)
sd_clip_score = calculate_clip_score(images, prompts) sd_clip_score = calculate_clip_score(images, prompts)
print(f"CLIP score: {sd_clip_score}") print(f"CLIP score: {sd_clip_score}")
# CLIP score: 35.7038 # CLIP score: 35.7038
...@@ -176,7 +174,7 @@ fixed seed with the [v1-4 Stable Diffusion checkpoint](https://huggingface.co/Co ...@@ -176,7 +174,7 @@ fixed seed with the [v1-4 Stable Diffusion checkpoint](https://huggingface.co/Co
seed = 0 seed = 0
generator = torch.manual_seed(seed) generator = torch.manual_seed(seed)
images = sd_pipeline(prompts, num_images_per_prompt=1, generator=generator, output_type="numpy").images images = sd_pipeline(prompts, num_images_per_prompt=1, generator=generator, output_type="np").images
``` ```
Then we load the [v1-5 checkpoint](https://huggingface.co/runwayml/stable-diffusion-v1-5) to generate images: Then we load the [v1-5 checkpoint](https://huggingface.co/runwayml/stable-diffusion-v1-5) to generate images:
...@@ -185,7 +183,7 @@ Then we load the [v1-5 checkpoint](https://huggingface.co/runwayml/stable-diffus ...@@ -185,7 +183,7 @@ Then we load the [v1-5 checkpoint](https://huggingface.co/runwayml/stable-diffus
model_ckpt_1_5 = "runwayml/stable-diffusion-v1-5" model_ckpt_1_5 = "runwayml/stable-diffusion-v1-5"
sd_pipeline_1_5 = StableDiffusionPipeline.from_pretrained(model_ckpt_1_5, torch_dtype=weight_dtype).to(device) sd_pipeline_1_5 = StableDiffusionPipeline.from_pretrained(model_ckpt_1_5, torch_dtype=weight_dtype).to(device)
images_1_5 = sd_pipeline_1_5(prompts, num_images_per_prompt=1, generator=generator, output_type="numpy").images images_1_5 = sd_pipeline_1_5(prompts, num_images_per_prompt=1, generator=generator, output_type="np").images
``` ```
And finally, we compare their CLIP scores: And finally, we compare their CLIP scores:
...@@ -295,12 +293,11 @@ def edit_image(input_image, instruction): ...@@ -295,12 +293,11 @@ def edit_image(input_image, instruction):
image = instruct_pix2pix_pipeline( image = instruct_pix2pix_pipeline(
instruction, instruction,
image=input_image, image=input_image,
output_type="numpy", output_type="np",
generator=generator, generator=generator,
).images[0] ).images[0]
return image return image
input_images = [] input_images = []
original_captions = [] original_captions = []
modified_captions = [] modified_captions = []
...@@ -417,7 +414,7 @@ It should be noted that the `StableDiffusionInstructPix2PixPipeline` exposes t ...@@ -417,7 +414,7 @@ It should be noted that the `StableDiffusionInstructPix2PixPipeline` exposes t
We can extend the idea of this metric to measure how similar the original image and edited version are. To do that, we can just do `F.cosine_similarity(img_feat_two, img_feat_one)`. For these kinds of edits, we would still want the primary semantics of the images to be preserved as much as possible, i.e., a high similarity score. We can extend the idea of this metric to measure how similar the original image and edited version are. To do that, we can just do `F.cosine_similarity(img_feat_two, img_feat_one)`. For these kinds of edits, we would still want the primary semantics of the images to be preserved as much as possible, i.e., a high similarity score.
We can use these metrics for similar pipelines such as the [`StableDiffusionPix2PixZeroPipeline`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/pix2pix_zero#diffusers.StableDiffusionPix2PixZeroPipeline). We can use these metrics for similar pipelines such as the [`StableDiffusionPix2PixZeroPipeline`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/pix2pix_zero#diffusers.StableDiffusionPix2PixZeroPipeline).
<Tip> <Tip>
...@@ -427,7 +424,7 @@ Both CLIP score and CLIP direction similarity rely on the CLIP model, which can ...@@ -427,7 +424,7 @@ Both CLIP score and CLIP direction similarity rely on the CLIP model, which can
***Extending metrics like IS, FID (discussed later), or KID can be difficult*** when the model under evaluation was pre-trained on a large image-captioning dataset (such as the [LAION-5B dataset](https://laion.ai/blog/laion-5b/)). This is because underlying these metrics is an InceptionNet (pre-trained on the ImageNet-1k dataset) used for extracting intermediate image features. The pre-training dataset of Stable Diffusion may have limited overlap with the pre-training dataset of InceptionNet, so it is not a good candidate here for feature extraction. ***Extending metrics like IS, FID (discussed later), or KID can be difficult*** when the model under evaluation was pre-trained on a large image-captioning dataset (such as the [LAION-5B dataset](https://laion.ai/blog/laion-5b/)). This is because underlying these metrics is an InceptionNet (pre-trained on the ImageNet-1k dataset) used for extracting intermediate image features. The pre-training dataset of Stable Diffusion may have limited overlap with the pre-training dataset of InceptionNet, so it is not a good candidate here for feature extraction.
***Using the above metrics helps evaluate models that are class-conditioned. For example, [DiT](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/overview). It was pre-trained being conditioned on the ImageNet-1k classes.*** ***Using the above metrics helps evaluate models that are class-conditioned. For example, [DiT](https://huggingface.co/docs/diffusers/main/en/api/pipelines/dit). It was pre-trained being conditioned on the ImageNet-1k classes.***
### Class-conditioned image generation ### Class-conditioned image generation
...@@ -452,7 +449,6 @@ def download(url, local_filepath): ...@@ -452,7 +449,6 @@ def download(url, local_filepath):
f.write(r.content) f.write(r.content)
return local_filepath return local_filepath
dummy_dataset_url = "https://hf.co/datasets/sayakpaul/sample-datasets/resolve/main/sample-imagenet-images.zip" dummy_dataset_url = "https://hf.co/datasets/sayakpaul/sample-datasets/resolve/main/sample-imagenet-images.zip"
local_filepath = download(dummy_dataset_url, dummy_dataset_url.split("/")[-1]) local_filepath = download(dummy_dataset_url, dummy_dataset_url.split("/")[-1])
...@@ -470,7 +466,7 @@ image_paths = sorted([os.path.join(dataset_path, x) for x in os.listdir(dataset_ ...@@ -470,7 +466,7 @@ image_paths = sorted([os.path.join(dataset_path, x) for x in os.listdir(dataset_
real_images = [np.array(Image.open(path).convert("RGB")) for path in image_paths] real_images = [np.array(Image.open(path).convert("RGB")) for path in image_paths]
``` ```
These are 10 images from the following Imagenet-1k classes: "cassette_player", "chain_saw" (x2), "church", "gas_pump" (x3), "parachute" (x2), and "tench". These are 10 images from the following ImageNet-1k classes: "cassette_player", "chain_saw" (x2), "church", "gas_pump" (x3), "parachute" (x2), and "tench".
<p align="center"> <p align="center">
<img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/evaluation_diffusion_models/real-images.png" alt="real-images"><br> <img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/evaluation_diffusion_models/real-images.png" alt="real-images"><br>
...@@ -488,7 +484,6 @@ def preprocess_image(image): ...@@ -488,7 +484,6 @@ def preprocess_image(image):
image = image.permute(0, 3, 1, 2) / 255.0 image = image.permute(0, 3, 1, 2) / 255.0
return F.center_crop(image, (256, 256)) return F.center_crop(image, (256, 256))
real_images = torch.cat([preprocess_image(image) for image in real_images]) real_images = torch.cat([preprocess_image(image) for image in real_images])
print(real_images.shape) print(real_images.shape)
# torch.Size([10, 3, 256, 256]) # torch.Size([10, 3, 256, 256])
...@@ -517,7 +512,7 @@ words = [ ...@@ -517,7 +512,7 @@ words = [
] ]
class_ids = dit_pipeline.get_label_ids(words) class_ids = dit_pipeline.get_label_ids(words)
output = dit_pipeline(class_labels=class_ids, generator=generator, output_type="numpy") output = dit_pipeline(class_labels=class_ids, generator=generator, output_type="np")
fake_images = output.images fake_images = output.images
fake_images = torch.tensor(fake_images) fake_images = torch.tensor(fake_images)
......
...@@ -22,7 +22,7 @@ In a nutshell, Diffusers is built to be a natural extension of PyTorch. Therefor ...@@ -22,7 +22,7 @@ In a nutshell, Diffusers is built to be a natural extension of PyTorch. Therefor
## Usability over Performance ## Usability over Performance
- While Diffusers has many built-in performance-enhancing features (see [Memory and Speed](https://huggingface.co/docs/diffusers/optimization/fp16)), models are always loaded with the highest precision and lowest optimization. Therefore, by default diffusion pipelines are always instantiated on CPU with float32 precision if not otherwise defined by the user. This ensures usability across different platforms and accelerators and means that no complex installations are required to run the library. - While Diffusers has many built-in performance-enhancing features (see [Memory and Speed](https://huggingface.co/docs/diffusers/optimization/fp16)), models are always loaded with the highest precision and lowest optimization. Therefore, by default diffusion pipelines are always instantiated on CPU with float32 precision if not otherwise defined by the user. This ensures usability across different platforms and accelerators and means that no complex installations are required to run the library.
- Diffusers aim at being a **light-weight** package and therefore has very few required dependencies, but many soft dependencies that can improve performance (such as `accelerate`, `safetensors`, `onnx`, etc...). We strive to keep the library as lightweight as possible so that it can be added without much concern as a dependency on other packages. - Diffusers aims to be a **light-weight** package and therefore has very few required dependencies, but many soft dependencies that can improve performance (such as `accelerate`, `safetensors`, `onnx`, etc...). We strive to keep the library as lightweight as possible so that it can be added without much concern as a dependency on other packages.
- Diffusers prefers simple, self-explainable code over condensed, magic code. This means that short-hand code syntaxes such as lambda functions, and advanced PyTorch operators are often not desired. - Diffusers prefers simple, self-explainable code over condensed, magic code. This means that short-hand code syntaxes such as lambda functions, and advanced PyTorch operators are often not desired.
## Simple over easy ## Simple over easy
...@@ -31,13 +31,13 @@ As PyTorch states, **explicit is better than implicit** and **simple is better t ...@@ -31,13 +31,13 @@ As PyTorch states, **explicit is better than implicit** and **simple is better t
- We follow PyTorch's API with methods like [`DiffusionPipeline.to`](https://huggingface.co/docs/diffusers/main/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.to) to let the user handle device management. - We follow PyTorch's API with methods like [`DiffusionPipeline.to`](https://huggingface.co/docs/diffusers/main/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.to) to let the user handle device management.
- Raising concise error messages is preferred to silently correct erroneous input. Diffusers aims at teaching the user, rather than making the library as easy to use as possible. - Raising concise error messages is preferred to silently correct erroneous input. Diffusers aims at teaching the user, rather than making the library as easy to use as possible.
- Complex model vs. scheduler logic is exposed instead of magically handled inside. Schedulers/Samplers are separated from diffusion models with minimal dependencies on each other. This forces the user to write the unrolled denoising loop. However, the separation allows for easier debugging and gives the user more control over adapting the denoising process or switching out diffusion models or schedulers. - Complex model vs. scheduler logic is exposed instead of magically handled inside. Schedulers/Samplers are separated from diffusion models with minimal dependencies on each other. This forces the user to write the unrolled denoising loop. However, the separation allows for easier debugging and gives the user more control over adapting the denoising process or switching out diffusion models or schedulers.
- Separately trained components of the diffusion pipeline, *e.g.* the text encoder, the unet, and the variational autoencoder, each have their own model class. This forces the user to handle the interaction between the different model components, and the serialization format separates the model components into different files. However, this allows for easier debugging and customization. Dreambooth or textual inversion training - Separately trained components of the diffusion pipeline, *e.g.* the text encoder, the unet, and the variational autoencoder, each have their own model class. This forces the user to handle the interaction between the different model components, and the serialization format separates the model components into different files. However, this allows for easier debugging and customization. DreamBooth or Textual Inversion training
is very simple thanks to diffusers' ability to separate single components of the diffusion pipeline. is very simple thanks to Diffusers' ability to separate single components of the diffusion pipeline.
## Tweakable, contributor-friendly over abstraction ## Tweakable, contributor-friendly over abstraction
For large parts of the library, Diffusers adopts an important design principle of the [Transformers library](https://github.com/huggingface/transformers), which is to prefer copy-pasted code over hasty abstractions. This design principle is very opinionated and stands in stark contrast to popular design principles such as [Don't repeat yourself (DRY)](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself). For large parts of the library, Diffusers adopts an important design principle of the [Transformers library](https://github.com/huggingface/transformers), which is to prefer copy-pasted code over hasty abstractions. This design principle is very opinionated and stands in stark contrast to popular design principles such as [Don't repeat yourself (DRY)](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself).
In short, just like Transformers does for modeling files, diffusers prefers to keep an extremely low level of abstraction and very self-contained code for pipelines and schedulers. In short, just like Transformers does for modeling files, Diffusers prefers to keep an extremely low level of abstraction and very self-contained code for pipelines and schedulers.
Functions, long code blocks, and even classes can be copied across multiple files which at first can look like a bad, sloppy design choice that makes the library unmaintainable. Functions, long code blocks, and even classes can be copied across multiple files which at first can look like a bad, sloppy design choice that makes the library unmaintainable.
**However**, this design has proven to be extremely successful for Transformers and makes a lot of sense for community-driven, open-source machine learning libraries because: **However**, this design has proven to be extremely successful for Transformers and makes a lot of sense for community-driven, open-source machine learning libraries because:
- Machine Learning is an extremely fast-moving field in which paradigms, model architectures, and algorithms are changing rapidly, which therefore makes it very difficult to define long-lasting code abstractions. - Machine Learning is an extremely fast-moving field in which paradigms, model architectures, and algorithms are changing rapidly, which therefore makes it very difficult to define long-lasting code abstractions.
...@@ -47,15 +47,15 @@ Functions, long code blocks, and even classes can be copied across multiple file ...@@ -47,15 +47,15 @@ Functions, long code blocks, and even classes can be copied across multiple file
At Hugging Face, we call this design the **single-file policy** which means that almost all of the code of a certain class should be written in a single, self-contained file. To read more about the philosophy, you can have a look At Hugging Face, we call this design the **single-file policy** which means that almost all of the code of a certain class should be written in a single, self-contained file. To read more about the philosophy, you can have a look
at [this blog post](https://huggingface.co/blog/transformers-design-philosophy). at [this blog post](https://huggingface.co/blog/transformers-design-philosophy).
In diffusers, we follow this philosophy for both pipelines and schedulers, but only partly for diffusion models. The reason we don't follow this design fully for diffusion models is because almost all diffusion pipelines, such In Diffusers, we follow this philosophy for both pipelines and schedulers, but only partly for diffusion models. The reason we don't follow this design fully for diffusion models is because almost all diffusion pipelines, such
as [DDPM](https://huggingface.co/docs/diffusers/v0.12.0/en/api/pipelines/ddpm), [Stable Diffusion](https://huggingface.co/docs/diffusers/v0.12.0/en/api/pipelines/stable_diffusion/overview#stable-diffusion-pipelines), [UnCLIP (Dalle-2)](https://huggingface.co/docs/diffusers/v0.12.0/en/api/pipelines/unclip#overview) and [Imagen](https://imagen.research.google/) all rely on the same diffusion model, the [UNet](https://huggingface.co/docs/diffusers/api/models#diffusers.UNet2DConditionModel). as [DDPM](https://huggingface.co/docs/diffusers/api/pipelines/ddpm), [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview#stable-diffusion-pipelines), [unCLIP (DALL·E 2)](https://huggingface.co/docs/diffusers/api/pipelines/unclip) and [Imagen](https://imagen.research.google/) all rely on the same diffusion model, the [UNet](https://huggingface.co/docs/diffusers/api/models/unet2d-cond).
Great, now you should have generally understood why 🧨 Diffusers is designed the way it is 🤗. Great, now you should have generally understood why 🧨 Diffusers is designed the way it is 🤗.
We try to apply these design principles consistently across the library. Nevertheless, there are some minor exceptions to the philosophy or some unlucky design choices. If you have feedback regarding the design, we would ❤️ to hear it [directly on GitHub](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=&template=feedback.md&title=). We try to apply these design principles consistently across the library. Nevertheless, there are some minor exceptions to the philosophy or some unlucky design choices. If you have feedback regarding the design, we would ❤️ to hear it [directly on GitHub](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=&template=feedback.md&title=).
## Design Philosophy in Details ## Design Philosophy in Details
Now, let's look a bit into the nitty-gritty details of the design philosophy. Diffusers essentially consist of three major classes, [pipelines](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines), [models](https://github.com/huggingface/diffusers/tree/main/src/diffusers/models), and [schedulers](https://github.com/huggingface/diffusers/tree/main/src/diffusers/schedulers). Now, let's look a bit into the nitty-gritty details of the design philosophy. Diffusers essentially consists of three major classes: [pipelines](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines), [models](https://github.com/huggingface/diffusers/tree/main/src/diffusers/models), and [schedulers](https://github.com/huggingface/diffusers/tree/main/src/diffusers/schedulers).
Let's walk through more in-detail design decisions for each class. Let's walk through more in-detail design decisions for each class.
### Pipelines ### Pipelines
...@@ -83,14 +83,14 @@ The following design principles are followed: ...@@ -83,14 +83,14 @@ The following design principles are followed:
- Models correspond to **a type of model architecture**. *E.g.* the [`UNet2DConditionModel`] class is used for all UNet variations that expect 2D image inputs and are conditioned on some context. - Models correspond to **a type of model architecture**. *E.g.* the [`UNet2DConditionModel`] class is used for all UNet variations that expect 2D image inputs and are conditioned on some context.
- All models can be found in [`src/diffusers/models`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/models) and every model architecture shall be defined in its file, e.g. [`unet_2d_condition.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/unet_2d_condition.py), [`transformer_2d.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/transformer_2d.py), etc... - All models can be found in [`src/diffusers/models`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/models) and every model architecture shall be defined in its file, e.g. [`unet_2d_condition.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/unet_2d_condition.py), [`transformer_2d.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/transformer_2d.py), etc...
- Models **do not** follow the single-file policy and should make use of smaller model building blocks, such as [`attention.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention.py), [`resnet.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/resnet.py), [`embeddings.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/embeddings.py), etc... **Note**: This is in stark contrast to Transformers' modeling files and shows that models do not really follow the single-file policy. - Models **do not** follow the single-file policy and should make use of smaller model building blocks, such as [`attention.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention.py), [`resnet.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/resnet.py), [`embeddings.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/embeddings.py), etc... **Note**: This is in stark contrast to Transformers' modeling files and shows that models do not really follow the single-file policy.
- Models intend to expose complexity, just like PyTorch's module does, and give clear error messages. - Models intend to expose complexity, just like PyTorch's `Module` class, and give clear error messages.
- Models all inherit from `ModelMixin` and `ConfigMixin`. - Models all inherit from `ModelMixin` and `ConfigMixin`.
- Models can be optimized for performance when it doesn’t demand major code changes, keeps backward compatibility, and gives significant memory or compute gain. - Models can be optimized for performance when it doesn’t demand major code changes, keeps backward compatibility, and gives significant memory or compute gain.
- Models should by default have the highest precision and lowest performance setting. - Models should by default have the highest precision and lowest performance setting.
- To integrate new model checkpoints whose general architecture can be classified as an architecture that already exists in Diffusers, the existing model architecture shall be adapted to make it work with the new checkpoint. One should only create a new file if the model architecture is fundamentally different. - To integrate new model checkpoints whose general architecture can be classified as an architecture that already exists in Diffusers, the existing model architecture shall be adapted to make it work with the new checkpoint. One should only create a new file if the model architecture is fundamentally different.
- Models should be designed to be easily extendable to future changes. This can be achieved by limiting public function arguments, configuration arguments, and "foreseeing" future changes, *e.g.* it is usually better to add `string` "...type" arguments that can easily be extended to new future types instead of boolean `is_..._type` arguments. Only the minimum amount of changes shall be made to existing architectures to make a new model checkpoint work. - Models should be designed to be easily extendable to future changes. This can be achieved by limiting public function arguments, configuration arguments, and "foreseeing" future changes, *e.g.* it is usually better to add `string` "...type" arguments that can easily be extended to new future types instead of boolean `is_..._type` arguments. Only the minimum amount of changes shall be made to existing architectures to make a new model checkpoint work.
- The model design is a difficult trade-off between keeping code readable and concise and supporting many model checkpoints. For most parts of the modeling code, classes shall be adapted for new model checkpoints, while there are some exceptions where it is preferred to add new classes to make sure the code is kept concise and - The model design is a difficult trade-off between keeping code readable and concise and supporting many model checkpoints. For most parts of the modeling code, classes shall be adapted for new model checkpoints, while there are some exceptions where it is preferred to add new classes to make sure the code is kept concise and
readable longterm, such as [UNet blocks](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/unet_2d_blocks.py) and [Attention processors](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py). readable long-term, such as [UNet blocks](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/unet_2d_blocks.py) and [Attention processors](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
### Schedulers ### Schedulers
...@@ -99,10 +99,10 @@ Schedulers are responsible to guide the denoising process for inference as well ...@@ -99,10 +99,10 @@ Schedulers are responsible to guide the denoising process for inference as well
The following design principles are followed: The following design principles are followed:
- All schedulers are found in [`src/diffusers/schedulers`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/schedulers). - All schedulers are found in [`src/diffusers/schedulers`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/schedulers).
- Schedulers are **not** allowed to import from large utils files and shall be kept very self-contained. - Schedulers are **not** allowed to import from large utils files and shall be kept very self-contained.
- One scheduler python file corresponds to one scheduler algorithm (as might be defined in a paper). - One scheduler Python file corresponds to one scheduler algorithm (as might be defined in a paper).
- If schedulers share similar functionalities, we can make use of the `#Copied from` mechanism. - If schedulers share similar functionalities, we can make use of the `#Copied from` mechanism.
- Schedulers all inherit from `SchedulerMixin` and `ConfigMixin`. - Schedulers all inherit from `SchedulerMixin` and `ConfigMixin`.
- Schedulers can be easily swapped out with the [`ConfigMixin.from_config`](https://huggingface.co/docs/diffusers/main/en/api/configuration#diffusers.ConfigMixin.from_config) method as explained in detail [here](./using-diffusers/schedulers.md). - Schedulers can be easily swapped out with the [`ConfigMixin.from_config`](https://huggingface.co/docs/diffusers/main/en/api/configuration#diffusers.ConfigMixin.from_config) method as explained in detail [here](../using-diffusers/schedulers.md).
- Every scheduler has to have a `set_num_inference_steps`, and a `step` function. `set_num_inference_steps(...)` has to be called before every denoising process, *i.e.* before `step(...)` is called. - Every scheduler has to have a `set_num_inference_steps`, and a `step` function. `set_num_inference_steps(...)` has to be called before every denoising process, *i.e.* before `step(...)` is called.
- Every scheduler exposes the timesteps to be "looped over" via a `timesteps` attribute, which is an array of timesteps the model will be called upon. - Every scheduler exposes the timesteps to be "looped over" via a `timesteps` attribute, which is an array of timesteps the model will be called upon.
- The `step(...)` function takes a predicted model output and the "current" sample (x_t) and returns the "previous", slightly more denoised sample (x_t-1). - The `step(...)` function takes a predicted model output and the "current" sample (x_t) and returns the "previous", slightly more denoised sample (x_t-1).
......
...@@ -18,7 +18,7 @@ Most examples of preserving semantics reduce to being able to accurately map a c ...@@ -18,7 +18,7 @@ Most examples of preserving semantics reduce to being able to accurately map a c
Additionally, there are qualities of generated images that we would like to influence beyond semantic preservation. I.e. in general, we would like our outputs to be of good quality, adhere to a particular style, or be realistic. Additionally, there are qualities of generated images that we would like to influence beyond semantic preservation. I.e. in general, we would like our outputs to be of good quality, adhere to a particular style, or be realistic.
We will document some of the techniques `diffusers` supports to control generation of diffusion models. Much is cutting edge research and can be quite nuanced. If something needs clarifying or you have a suggestion, don't hesitate to open a discussion on the [forum](https://discuss.huggingface.co/) or a [GitHub issue](https://github.com/huggingface/diffusers/issues). We will document some of the techniques `diffusers` supports to control generation of diffusion models. Much is cutting edge research and can be quite nuanced. If something needs clarifying or you have a suggestion, don't hesitate to open a discussion on the [forum](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers/63) or a [GitHub issue](https://github.com/huggingface/diffusers/issues).
We provide a high level explanation of how the generation can be controlled as well as a snippet of the technicals. For more in depth explanations on the technicals, the original papers which are linked from the pipelines are always the best resources. We provide a high level explanation of how the generation can be controlled as well as a snippet of the technicals. For more in depth explanations on the technicals, the original papers which are linked from the pipelines are always the best resources.
...@@ -26,11 +26,11 @@ Depending on the use case, one should choose a technique accordingly. In many ca ...@@ -26,11 +26,11 @@ Depending on the use case, one should choose a technique accordingly. In many ca
Unless otherwise mentioned, these are techniques that work with existing models and don't require their own weights. Unless otherwise mentioned, these are techniques that work with existing models and don't require their own weights.
1. [Instruct Pix2Pix](#instruct-pix2pix) 1. [InstructPix2Pix](#instruct-pix2pix)
2. [Pix2Pix Zero](#pix2pixzero) 2. [Pix2Pix Zero](#pix2pix-zero)
3. [Attend and Excite](#attend-and-excite) 3. [Attend and Excite](#attend-and-excite)
4. [Semantic Guidance](#semantic-guidance) 4. [Semantic Guidance](#semantic-guidance-sega)
5. [Self-attention Guidance](#self-attention-guidance) 5. [Self-attention Guidance](#self-attention-guidance-sag)
6. [Depth2Image](#depth2image) 6. [Depth2Image](#depth2image)
7. [MultiDiffusion Panorama](#multidiffusion-panorama) 7. [MultiDiffusion Panorama](#multidiffusion-panorama)
8. [DreamBooth](#dreambooth) 8. [DreamBooth](#dreambooth)
...@@ -47,11 +47,11 @@ For convenience, we provide a table to denote which methods are inference-only a ...@@ -47,11 +47,11 @@ For convenience, we provide a table to denote which methods are inference-only a
| **Method** | **Inference only** | **Requires training /<br> fine-tuning** | **Comments** | | **Method** | **Inference only** | **Requires training /<br> fine-tuning** | **Comments** |
| :-------------------------------------------------: | :----------------: | :-------------------------------------: | :---------------------------------------------------------------------------------------------: | | :-------------------------------------------------: | :----------------: | :-------------------------------------: | :---------------------------------------------------------------------------------------------: |
| [Instruct Pix2Pix](#instruct-pix2pix) | ✅ | ❌ | Can additionally be<br>fine-tuned for better <br>performance on specific <br>edit instructions. | | [InstructPix2Pix](#instruct-pix2pix) | ✅ | ❌ | Can additionally be<br>fine-tuned for better <br>performance on specific <br>edit instructions. |
| [Pix2Pix Zero](#pix2pixzero) | ✅ | ❌ | | | [Pix2Pix Zero](#pix2pix-zero) | ✅ | ❌ | |
| [Attend and Excite](#attend-and-excite) | ✅ | ❌ | | | [Attend and Excite](#attend-and-excite) | ✅ | ❌ | |
| [Semantic Guidance](#semantic-guidance) | ✅ | ❌ | | | [Semantic Guidance](#semantic-guidance-sega) | ✅ | ❌ | |
| [Self-attention Guidance](#self-attention-guidance) | ✅ | ❌ | | | [Self-attention Guidance](#self-attention-guidance-sag) | ✅ | ❌ | |
| [Depth2Image](#depth2image) | ✅ | ❌ | | | [Depth2Image](#depth2image) | ✅ | ❌ | |
| [MultiDiffusion Panorama](#multidiffusion-panorama) | ✅ | ❌ | | | [MultiDiffusion Panorama](#multidiffusion-panorama) | ✅ | ❌ | |
| [DreamBooth](#dreambooth) | ❌ | ✅ | | | [DreamBooth](#dreambooth) | ❌ | ✅ | |
...@@ -63,14 +63,12 @@ For convenience, we provide a table to denote which methods are inference-only a ...@@ -63,14 +63,12 @@ For convenience, we provide a table to denote which methods are inference-only a
| [DiffEdit](#diffedit) | ✅ | ❌ | | | [DiffEdit](#diffedit) | ✅ | ❌ | |
| [T2I-Adapter](#t2i-adapter) | ✅ | ❌ | | | [T2I-Adapter](#t2i-adapter) | ✅ | ❌ | |
| [Fabric](#fabric) | ✅ | ❌ | | | [Fabric](#fabric) | ✅ | ❌ | |
## Instruct Pix2Pix ## InstructPix2Pix
[Paper](https://arxiv.org/abs/2211.09800) [Paper](https://arxiv.org/abs/2211.09800)
[Instruct Pix2Pix](../api/pipelines/pix2pix) is fine-tuned from stable diffusion to support editing input images. It takes as inputs an image and a prompt describing an edit, and it outputs the edited image. [InstructPix2Pix](../api/pipelines/pix2pix) is fine-tuned from Stable Diffusion to support editing input images. It takes as inputs an image and a prompt describing an edit, and it outputs the edited image.
Instruct Pix2Pix has been explicitly trained to work well with [InstructGPT](https://openai.com/blog/instruction-following/)-like prompts. InstructPix2Pix has been explicitly trained to work well with [InstructGPT](https://openai.com/blog/instruction-following/)-like prompts.
See [here](../api/pipelines/pix2pix) for more information on how to use it.
## Pix2Pix Zero ## Pix2Pix Zero
...@@ -84,7 +82,7 @@ Pix2Pix Zero can be used both to edit synthetic images as well as real images. ...@@ -84,7 +82,7 @@ Pix2Pix Zero can be used both to edit synthetic images as well as real images.
- To edit synthetic images, one first generates an image given a caption. - To edit synthetic images, one first generates an image given a caption.
Next, we generate image captions for the concept that shall be edited and for the new target concept. We can use a model like [Flan-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) for this purpose. Then, "mean" prompt embeddings for both the source and target concepts are created via the text encoder. Finally, the pix2pix-zero algorithm is used to edit the synthetic image. Next, we generate image captions for the concept that shall be edited and for the new target concept. We can use a model like [Flan-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) for this purpose. Then, "mean" prompt embeddings for both the source and target concepts are created via the text encoder. Finally, the pix2pix-zero algorithm is used to edit the synthetic image.
- To edit a real image, one first generates an image caption using a model like [BLIP](https://huggingface.co/docs/transformers/model_doc/blip). Then one applies ddim inversion on the prompt and image to generate "inverse" latents. Similar to before, "mean" prompt embeddings for both source and target concepts are created and finally the pix2pix-zero algorithm in combination with the "inverse" latents is used to edit the image. - To edit a real image, one first generates an image caption using a model like [BLIP](https://huggingface.co/docs/transformers/model_doc/blip). Then one applies DDIM inversion on the prompt and image to generate "inverse" latents. Similar to before, "mean" prompt embeddings for both source and target concepts are created and finally the pix2pix-zero algorithm in combination with the "inverse" latents is used to edit the image.
<Tip> <Tip>
...@@ -96,7 +94,13 @@ can edit an image in less than a minute on a consumer GPU as shown [here](../api ...@@ -96,7 +94,13 @@ can edit an image in less than a minute on a consumer GPU as shown [here](../api
As mentioned above, Pix2Pix Zero includes optimizing the latents (and not any of the UNet, VAE, or the text encoder) to steer the generation toward a specific concept. This means that the overall As mentioned above, Pix2Pix Zero includes optimizing the latents (and not any of the UNet, VAE, or the text encoder) to steer the generation toward a specific concept. This means that the overall
pipeline might require more memory than a standard [StableDiffusionPipeline](../api/pipelines/stable_diffusion/text2img). pipeline might require more memory than a standard [StableDiffusionPipeline](../api/pipelines/stable_diffusion/text2img).
See [here](../api/pipelines/pix2pix_zero) for more information on how to use it. <Tip>
An important distinction between methods like InstructPix2Pix and Pix2Pix Zero is that the former
involves fine-tuning the pre-trained weights while the latter does not. This means that you can
apply Pix2Pix Zero to any of the available Stable Diffusion models.
</Tip>
## Attend and Excite ## Attend and Excite
...@@ -108,20 +112,16 @@ A set of token indices are given as input, corresponding to the subjects in the ...@@ -108,20 +112,16 @@ A set of token indices are given as input, corresponding to the subjects in the
Like Pix2Pix Zero, Attend and Excite also involves a mini optimization loop (leaving the pre-trained weights untouched) in its pipeline and can require more memory than the usual [StableDiffusionPipeline](../api/pipelines/stable_diffusion/text2img). Like Pix2Pix Zero, Attend and Excite also involves a mini optimization loop (leaving the pre-trained weights untouched) in its pipeline and can require more memory than the usual [StableDiffusionPipeline](../api/pipelines/stable_diffusion/text2img).
See [here](../api/pipelines/attend_and_excite) for more information on how to use it.
## Semantic Guidance (SEGA) ## Semantic Guidance (SEGA)
[Paper](https://arxiv.org/abs/2301.12247) [Paper](https://arxiv.org/abs/2301.12247)
SEGA allows applying or removing one or more concepts from an image. The strength of the concept can also be controlled. I.e. the smile concept can be used to incrementally increase or decrease the smile of a portrait. [SEGA](../api/pipelines/semantic_stable_diffusion) allows applying or removing one or more concepts from an image. The strength of the concept can also be controlled. I.e. the smile concept can be used to incrementally increase or decrease the smile of a portrait.
Similar to how classifier free guidance provides guidance via empty prompt inputs, SEGA provides guidance on conceptual prompts. Multiple of these conceptual prompts can be applied simultaneously. Each conceptual prompt can either add or remove their concept depending on if the guidance is applied positively or negatively. Similar to how classifier free guidance provides guidance via empty prompt inputs, SEGA provides guidance on conceptual prompts. Multiple of these conceptual prompts can be applied simultaneously. Each conceptual prompt can either add or remove their concept depending on if the guidance is applied positively or negatively.
Unlike Pix2Pix Zero or Attend and Excite, SEGA directly interacts with the diffusion process instead of performing any explicit gradient-based optimization. Unlike Pix2Pix Zero or Attend and Excite, SEGA directly interacts with the diffusion process instead of performing any explicit gradient-based optimization.
See [here](../api/pipelines/semantic_stable_diffusion) for more information on how to use it.
## Self-attention Guidance (SAG) ## Self-attention Guidance (SAG)
[Paper](https://arxiv.org/abs/2210.00939) [Paper](https://arxiv.org/abs/2210.00939)
...@@ -130,34 +130,20 @@ See [here](../api/pipelines/semantic_stable_diffusion) for more information on h ...@@ -130,34 +130,20 @@ See [here](../api/pipelines/semantic_stable_diffusion) for more information on h
SAG provides guidance from predictions not conditioned on high-frequency details to fully conditioned images. The high frequency details are extracted out of the UNet self-attention maps. SAG provides guidance from predictions not conditioned on high-frequency details to fully conditioned images. The high frequency details are extracted out of the UNet self-attention maps.
See [here](../api/pipelines/self_attention_guidance) for more information on how to use it.
## Depth2Image ## Depth2Image
[Project](https://huggingface.co/stabilityai/stable-diffusion-2-depth) [Project](https://huggingface.co/stabilityai/stable-diffusion-2-depth)
[Depth2Image](../pipelines/stable_diffusion_2#depthtoimage) is fine-tuned from Stable Diffusion to better preserve semantics for text guided image variation. [Depth2Image](../api/pipelines/stable_diffusion/depth2img) is fine-tuned from Stable Diffusion to better preserve semantics for text guided image variation.
It conditions on a monocular depth estimate of the original image. It conditions on a monocular depth estimate of the original image.
See [here](../api/pipelines/stable_diffusion_2#depthtoimage) for more information on how to use it.
<Tip>
An important distinction between methods like InstructPix2Pix and Pix2Pix Zero is that the former
involves fine-tuning the pre-trained weights while the latter does not. This means that you can
apply Pix2Pix Zero to any of the available Stable Diffusion models.
</Tip>
## MultiDiffusion Panorama ## MultiDiffusion Panorama
[Paper](https://arxiv.org/abs/2302.08113) [Paper](https://arxiv.org/abs/2302.08113)
MultiDiffusion defines a new generation process over a pre-trained diffusion model. This process binds together multiple diffusion generation methods that can be readily applied to generate high quality and diverse images. Results adhere to user-provided controls, such as desired aspect ratio (e.g., panorama), and spatial guiding signals, ranging from tight segmentation masks to bounding boxes. [MultiDiffusion Panorama](../api/pipelines/panorama) defines a new generation process over a pre-trained diffusion model. This process binds together multiple diffusion generation methods that can be readily applied to generate high quality and diverse images. Results adhere to user-provided controls, such as desired aspect ratio (e.g., panorama), and spatial guiding signals, ranging from tight segmentation masks to bounding boxes.
[MultiDiffusion Panorama](../api/pipelines/panorama) allows to generate high-quality images at arbitrary aspect ratios (e.g., panoramas). MultiDiffusion Panorama allows to generate high-quality images at arbitrary aspect ratios (e.g., panoramas).
See [here](../api/pipelines/panorama) for more information on how to use it to generate panoramic images.
## Fine-tuning your own models ## Fine-tuning your own models
...@@ -165,44 +151,39 @@ In addition to pre-trained models, Diffusers has training scripts for fine-tunin ...@@ -165,44 +151,39 @@ In addition to pre-trained models, Diffusers has training scripts for fine-tunin
## DreamBooth ## DreamBooth
[DreamBooth](../training/dreambooth) fine-tunes a model to teach it about a new subject. I.e. a few pictures of a person can be used to generate images of that person in different styles. [Project](https://dreambooth.github.io/)
See [here](../training/dreambooth) for more information on how to use it. [DreamBooth](../training/dreambooth) fine-tunes a model to teach it about a new subject. I.e. a few pictures of a person can be used to generate images of that person in different styles.
## Textual Inversion ## Textual Inversion
[Textual Inversion](../training/text_inversion) fine-tunes a model to teach it about a new concept. I.e. a few pictures of a style of artwork can be used to generate images in that style. [Paper](https://arxiv.org/abs/2208.01618)
See [here](../training/text_inversion) for more information on how to use it. [Textual Inversion](../training/text_inversion) fine-tunes a model to teach it about a new concept. I.e. a few pictures of a style of artwork can be used to generate images in that style.
## ControlNet ## ControlNet
[Paper](https://arxiv.org/abs/2302.05543) [Paper](https://arxiv.org/abs/2302.05543)
[ControlNet](../api/pipelines/controlnet) is an auxiliary network which adds an extra condition.
[ControlNet](../api/pipelines/controlnet) is an auxiliary network which adds an extra condition. [ControlNet](../api/pipelines/controlnet) is an auxiliary network which adds an extra condition.
There are 8 canonical pre-trained ControlNets trained on different conditionings such as edge detection, scribbles, There are 8 canonical pre-trained ControlNets trained on different conditionings such as edge detection, scribbles,
depth maps, and semantic segmentations. depth maps, and semantic segmentations.
See [here](../api/pipelines/controlnet) for more information on how to use it.
## Prompt Weighting ## Prompt Weighting
Prompt weighting is a simple technique that puts more attention weight on certain parts of the text [Prompt weighting](../using-diffusers/weighted_prompts) is a simple technique that puts more attention weight on certain parts of the text
input. input.
For a more in-detail explanation and examples, see [here](../using-diffusers/weighted_prompts).
## Custom Diffusion ## Custom Diffusion
[Paper](https://arxiv.org/abs/2212.04488)
[Custom Diffusion](../training/custom_diffusion) only fine-tunes the cross-attention maps of a pre-trained [Custom Diffusion](../training/custom_diffusion) only fine-tunes the cross-attention maps of a pre-trained
text-to-image diffusion model. It also allows for additionally performing textual inversion. It supports text-to-image diffusion model. It also allows for additionally performing Textual Inversion. It supports
multi-concept training by design. Like DreamBooth and Textual Inversion, Custom Diffusion is also used to multi-concept training by design. Like DreamBooth and Textual Inversion, Custom Diffusion is also used to
teach a pre-trained text-to-image diffusion model about new concepts to generate outputs involving the teach a pre-trained text-to-image diffusion model about new concepts to generate outputs involving the
concept(s) of interest. concept(s) of interest.
For more details, check out our [official doc](../training/custom_diffusion).
## Model Editing ## Model Editing
[Paper](https://arxiv.org/abs/2303.08084) [Paper](https://arxiv.org/abs/2303.08084)
...@@ -211,8 +192,6 @@ The [text-to-image model editing pipeline](../api/pipelines/model_editing) helps ...@@ -211,8 +192,6 @@ The [text-to-image model editing pipeline](../api/pipelines/model_editing) helps
diffusion model might make about the subjects present in the input prompt. For example, if you prompt Stable Diffusion to generate images for "A pack of roses", the roses in the generated images diffusion model might make about the subjects present in the input prompt. For example, if you prompt Stable Diffusion to generate images for "A pack of roses", the roses in the generated images
are more likely to be red. This pipeline helps you change that assumption. are more likely to be red. This pipeline helps you change that assumption.
To know more details, check out the [official doc](../api/pipelines/model_editing).
## DiffEdit ## DiffEdit
[Paper](https://arxiv.org/abs/2210.11427) [Paper](https://arxiv.org/abs/2210.11427)
...@@ -220,8 +199,6 @@ To know more details, check out the [official doc](../api/pipelines/model_editin ...@@ -220,8 +199,6 @@ To know more details, check out the [official doc](../api/pipelines/model_editin
[DiffEdit](../api/pipelines/diffedit) allows for semantic editing of input images along with [DiffEdit](../api/pipelines/diffedit) allows for semantic editing of input images along with
input prompts while preserving the original input images as much as possible. input prompts while preserving the original input images as much as possible.
To know more details, check out the [official doc](../api/pipelines/diffedit).
## T2I-Adapter ## T2I-Adapter
[Paper](https://arxiv.org/abs/2302.08453) [Paper](https://arxiv.org/abs/2302.08453)
...@@ -230,15 +207,11 @@ To know more details, check out the [official doc](../api/pipelines/diffedit). ...@@ -230,15 +207,11 @@ To know more details, check out the [official doc](../api/pipelines/diffedit).
There are 8 canonical pre-trained adapters trained on different conditionings such as edge detection, sketch, There are 8 canonical pre-trained adapters trained on different conditionings such as edge detection, sketch,
depth maps, and semantic segmentations. depth maps, and semantic segmentations.
See [here](../api/pipelines/stable_diffusion/adapter) for more information on how to use it.
## Fabric ## Fabric
[Paper](https://arxiv.org/abs/2307.10159) [Paper](https://arxiv.org/abs/2307.10159)
[Fabric](../api/pipelines/fabric) is a training-free [Fabric](https://github.com/huggingface/diffusers/tree/442017ccc877279bcf24fbe92f92d3d0def191b6/examples/community#stable-diffusion-fabric-pipeline) is a training-free
approach applicable to a wide range of popular diffusion models, which exploits approach applicable to a wide range of popular diffusion models, which exploits
the self-attention layer present in the most widely used architectures to condition the self-attention layer present in the most widely used architectures to condition
the diffusion process on a set of feedback images. the diffusion process on a set of feedback images.
To know more details, check out the [official doc](../api/pipelines/fabric).
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment