Unverified Commit ecbe27a0 authored by M. Tolga Cangöz's avatar M. Tolga Cangöz Committed by GitHub
Browse files

[`Docs`] Fix typos and update files at API's Pipelines page 2 (#5748)



* Fix typos, update, add Copyright info, and trim trailing whitespace

* Update docs/source/en/api/pipelines/text_to_video_zero.md
Co-authored-by: default avatarSteven Liu <59462357+stevhliu@users.noreply.github.com>

* 1 second is not a long video, but 6 seconds is

* Update text_to_video_zero.md

* Update text_to_video_zero.md

* Update text_to_video_zero.md

* Update wuerstchen.md

---------
Co-authored-by: default avatarSteven Liu <59462357+stevhliu@users.noreply.github.com>
parent 3ad4207d
...@@ -16,7 +16,7 @@ specific language governing permissions and limitations under the License. ...@@ -16,7 +16,7 @@ specific language governing permissions and limitations under the License.
The abstract from the paper is: The abstract from the paper is:
*Diffusion models are powerful generative models but suffer from slow sampling, often taking 1000 sequential denoising steps for one sample. As a result, considerable efforts have been directed toward reducing the number of denoising steps, but these methods hurt sample quality. Instead of reducing the number of denoising steps (trading quality for speed), in this paper we explore an orthogonal approach: can we run the denoising steps in parallel (trading compute for speed)? In spite of the sequential nature of the denoising steps, we show that surprisingly it is possible to parallelize sampling via Picard iterations, by guessing the solution of future denoising steps and iteratively refining until convergence. With this insight, we present ParaDiGMS, a novel method to accelerate the sampling of pretrained diffusion models by denoising multiple steps in parallel. ParaDiGMS is the first diffusion sampling method that enables trading compute for speed and is even compatible with existing fast sampling techniques such as DDIM and DPMSolver. Using ParaDiGMS, we improve sampling speed by 2-4x across a range of robotics and image generation models, giving state-of-the-art sampling speeds of 0.2s on 100-step DiffusionPolicy and 16s on 1000-step StableDiffusion-v2 with no measurable degradation of task reward, FID score, or CLIP score.* *Diffusion models are powerful generative models but suffer from slow sampling, often taking 1000 sequential denoising steps for one sample. As a result, considerable efforts have been directed toward reducing the number of denoising steps, but these methods hurt sample quality. Instead of reducing the number of denoising steps (trading quality for speed), in this paper we explore an orthogonal approach: can we run the denoising steps in parallel (trading compute for speed)? In spite of the sequential nature of the denoising steps, we show that surprisingly it is possible to parallelize sampling via Picard iterations, by guessing the solution of future denoising steps and iteratively refining until convergence. With this insight, we present ParaDiGMS, a novel method to accelerate the sampling of pretrained diffusion models by denoising multiple steps in parallel. ParaDiGMS is the first diffusion sampling method that enables trading compute for speed and is even compatible with existing fast sampling techniques such as DDIM and DPMSolver. Using ParaDiGMS, we improve sampling speed by 2-4x across a range of robotics and image generation models, giving state-of-the-art sampling speeds of 0.2s on 100-step DiffusionPolicy and 14.6s on 1000-step StableDiffusion-v2 with no measurable degradation of task reward, FID score, or CLIP score.*
The original codebase can be found at [AndyShih12/paradigms](https://github.com/AndyShih12/paradigms), and the pipeline was contributed by [AndyShih12](https://github.com/AndyShih12). ❤️ The original codebase can be found at [AndyShih12/paradigms](https://github.com/AndyShih12/paradigms), and the pipeline was contributed by [AndyShih12](https://github.com/AndyShih12). ❤️
...@@ -27,16 +27,13 @@ Therefore, it is better to call this pipeline when running on multiple GPUs. Oth ...@@ -27,16 +27,13 @@ Therefore, it is better to call this pipeline when running on multiple GPUs. Oth
sampling may be even slower than sequential sampling. sampling may be even slower than sequential sampling.
The two parameters to play with are `parallel` (batch size) and `tolerance`. The two parameters to play with are `parallel` (batch size) and `tolerance`.
- If it fits in memory, for a 1000-step DDPM you can aim for a batch size of around 100 - If it fits in memory, for a 1000-step DDPM you can aim for a batch size of around 100 (for example, 8 GPUs and `batch_per_device=12` to get `parallel=96`). A higher batch size may not fit in memory, and lower batch size gives less parallelism.
(for example, 8 GPUs and `batch_per_device=12` to get `parallel=96`). A higher batch size - For tolerance, using a higher tolerance may get better speedups but can risk sample quality degradation. If there is quality degradation with the default tolerance, then use a lower tolerance like `0.001`.
may not fit in memory, and lower batch size gives less parallelism.
- For tolerance, using a higher tolerance may get better speedups but can risk sample quality degradation.
If there is quality degradation with the default tolerance, then use a lower tolerance like `0.001`.
For a 1000-step DDPM on 8 A100 GPUs, you can expect around a 3x speedup from [`StableDiffusionParadigmsPipeline`] compared to the [`StableDiffusionPipeline`] For a 1000-step DDPM on 8 A100 GPUs, you can expect around a 3x speedup from [`StableDiffusionParadigmsPipeline`] compared to the [`StableDiffusionPipeline`]
by setting `parallel=80` and `tolerance=0.1`. by setting `parallel=80` and `tolerance=0.1`.
🤗 Diffusers offers [distributed inference support](../training/distributed_inference) for generating multiple prompts 🤗 Diffusers offers [distributed inference support](../../training/distributed_inference) for generating multiple prompts
in parallel on multiple GPUs. But [`StableDiffusionParadigmsPipeline`] is designed for speeding up sampling of a single prompt by using multiple GPUs. in parallel on multiple GPUs. But [`StableDiffusionParadigmsPipeline`] is designed for speeding up sampling of a single prompt by using multiple GPUs.
<Tip> <Tip>
......
...@@ -29,12 +29,11 @@ you wanted to translate from "cat" to "dog". In this case, the edit direction wi ...@@ -29,12 +29,11 @@ you wanted to translate from "cat" to "dog". In this case, the edit direction wi
this in the pipeline, you simply have to set the embeddings related to the phrases including "cat" to this in the pipeline, you simply have to set the embeddings related to the phrases including "cat" to
`source_embeds` and "dog" to `target_embeds`. Refer to the code example below for more details. `source_embeds` and "dog" to `target_embeds`. Refer to the code example below for more details.
* When you're using this pipeline from a prompt, specify the _source_ concept in the prompt. Taking * When you're using this pipeline from a prompt, specify the _source_ concept in the prompt. Taking
the above example, a valid input prompt would be: "a high resolution painting of a **cat** in the style of van gough". the above example, a valid input prompt would be: "a high resolution painting of a **cat** in the style of van gogh".
* If you wanted to reverse the direction in the example above, i.e., "dog -> cat", then it's recommended to: * If you wanted to reverse the direction in the example above, i.e., "dog -> cat", then it's recommended to:
* Swap the `source_embeds` and `target_embeds`. * Swap the `source_embeds` and `target_embeds`.
* Change the input prompt to include "dog". * Change the input prompt to include "dog".
* To learn more about how the source and target embeddings are generated, refer to the [original * To learn more about how the source and target embeddings are generated, refer to the [original paper](https://arxiv.org/abs/2302.03027). Below, we also provide some directions on how to generate the embeddings.
paper](https://arxiv.org/abs/2302.03027). Below, we also provide some directions on how to generate the embeddings.
* Note that the quality of the outputs generated with this pipeline is dependent on how good the `source_embeds` and `target_embeds` are. Please, refer to [this discussion](#generating-source-and-target-embeddings) for some suggestions on the topic. * Note that the quality of the outputs generated with this pipeline is dependent on how good the `source_embeds` and `target_embeds` are. Please, refer to [this discussion](#generating-source-and-target-embeddings) for some suggestions on the topic.
## Available Pipelines: ## Available Pipelines:
...@@ -79,21 +78,20 @@ for url in [src_embs_url, target_embs_url]: ...@@ -79,21 +78,20 @@ for url in [src_embs_url, target_embs_url]:
src_embeds = torch.load(src_embs_url.split("/")[-1]) src_embeds = torch.load(src_embs_url.split("/")[-1])
target_embeds = torch.load(target_embs_url.split("/")[-1]) target_embeds = torch.load(target_embs_url.split("/")[-1])
images = pipeline( image = pipeline(
prompt, prompt,
source_embeds=src_embeds, source_embeds=src_embeds,
target_embeds=target_embeds, target_embeds=target_embeds,
num_inference_steps=50, num_inference_steps=50,
cross_attention_guidance_amount=0.15, cross_attention_guidance_amount=0.15,
).images ).images[0]
images[0].save("edited_image_dog.png") image
``` ```
### Based on an input image ### Based on an input image
When the pipeline is conditioned on an input image, we first obtain an inverted When the pipeline is conditioned on an input image, we first obtain an inverted
noise from it using a `DDIMInverseScheduler` with the help of a generated caption. Then noise from it using a `DDIMInverseScheduler` with the help of a generated caption. Then the inverted noise is used to start the generation process.
the inverted noise is used to start the generation process.
First, let's load our pipeline: First, let's load our pipeline:
...@@ -122,12 +120,12 @@ pipeline.enable_model_cpu_offload() ...@@ -122,12 +120,12 @@ pipeline.enable_model_cpu_offload()
Then, we load an input image for conditioning and obtain a suitable caption for it: Then, we load an input image for conditioning and obtain a suitable caption for it:
```py ```py
import requests from diffusers.utils import load_image
from PIL import Image
img_url = "https://github.com/pix2pixzero/pix2pix-zero/raw/main/assets/test_images/cats/cat_6.png" img_url = "https://github.com/pix2pixzero/pix2pix-zero/raw/main/assets/test_images/cats/cat_6.png"
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB").resize((512, 512)) raw_image = load_image(url).resize((512, 512))
caption = pipeline.generate_caption(raw_image) caption = pipeline.generate_caption(raw_image)
caption
``` ```
Then we employ the generated caption and the input image to get the inverted noise: Then we employ the generated caption and the input image to get the inverted noise:
...@@ -159,7 +157,7 @@ image = pipeline( ...@@ -159,7 +157,7 @@ image = pipeline(
latents=inv_latents, latents=inv_latents,
negative_prompt=caption, negative_prompt=caption,
).images[0] ).images[0]
image.save("edited_image.png") image
``` ```
## Generating source and target embeddings ## Generating source and target embeddings
...@@ -214,6 +212,7 @@ And then we just call it to generate our captions: ...@@ -214,6 +212,7 @@ And then we just call it to generate our captions:
```py ```py
source_captions = generate_captions(source_text) source_captions = generate_captions(source_text)
target_captions = generate_captions(target_concept) target_captions = generate_captions(target_concept)
print(source_captions, target_captions, sep='\n')
``` ```
We encourage you to play around with the different parameters supported by the We encourage you to play around with the different parameters supported by the
...@@ -268,16 +267,22 @@ from diffusers import DDIMScheduler ...@@ -268,16 +267,22 @@ from diffusers import DDIMScheduler
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config) pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
images = pipeline( image = pipeline(
prompt, prompt,
source_embeds=source_embeddings, source_embeds=source_embeddings,
target_embeds=target_embeddings, target_embeds=target_embeddings,
num_inference_steps=50, num_inference_steps=50,
cross_attention_guidance_amount=0.15, cross_attention_guidance_amount=0.15,
).images ).images[0]
images[0].save("edited_image_dog.png") image
``` ```
<Tip>
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
</Tip>
## StableDiffusionPix2PixZeroPipeline ## StableDiffusionPix2PixZeroPipeline
[[autodoc]] StableDiffusionPix2PixZeroPipeline [[autodoc]] StableDiffusionPix2PixZeroPipeline
- __call__ - __call__
......
...@@ -10,7 +10,7 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o ...@@ -10,7 +10,7 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
specific language governing permissions and limitations under the License. specific language governing permissions and limitations under the License.
--> -->
# PixArt # PixArt
![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/pixart/header_collage.png) ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/pixart/header_collage.png)
...@@ -24,13 +24,20 @@ You can find the original codebase at [PixArt-alpha/PixArt-alpha](https://github ...@@ -24,13 +24,20 @@ You can find the original codebase at [PixArt-alpha/PixArt-alpha](https://github
Some notes about this pipeline: Some notes about this pipeline:
* It uses a Transformer backbone (instead of a UNet) for denoising. As such it has a similar architecture as [DiT](./dit.md). * It uses a Transformer backbone (instead of a UNet) for denoising. As such it has a similar architecture as [DiT](./dit).
* It was trained using text conditions computed from T5. This aspect makes the pipeline better at following complex text prompts with intricate details. * It was trained using text conditions computed from T5. This aspect makes the pipeline better at following complex text prompts with intricate details.
* It is good at producing high-resolution images at different aspect ratios. To get the best results, the authors recommend some size brackets which can be found [here](https://github.com/PixArt-alpha/PixArt-alpha/blob/08fbbd281ec96866109bdd2cdb75f2f58fb17610/diffusion/data/datasets/utils.py). * It is good at producing high-resolution images at different aspect ratios. To get the best results, the authors recommend some size brackets which can be found [here](https://github.com/PixArt-alpha/PixArt-alpha/blob/08fbbd281ec96866109bdd2cdb75f2f58fb17610/diffusion/data/datasets/utils.py).
* It rivals the quality of state-of-the-art text-to-image generation systems (as of this writing) such as Stable Diffusion XL, Imagen, and DALL-E 2, while being more efficient than them. * It rivals the quality of state-of-the-art text-to-image generation systems (as of this writing) such as Stable Diffusion XL, Imagen, and DALL-E 2, while being more efficient than them.
<Tip>
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
</Tip>
## PixArtAlphaPipeline ## PixArtAlphaPipeline
[[autodoc]] PixArtAlphaPipeline [[autodoc]] PixArtAlphaPipeline
- all - all
- __call__ - __call__
\ No newline at end of file
...@@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License. ...@@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License.
# PNDM # PNDM
[Pseudo Numerical methods for Diffusion Models on manifolds](https://huggingface.co/papers/2202.09778) (PNDM) is by Luping Liu, Yi Ren, Zhijie Lin and Zhou Zhao. [Pseudo Numerical Methods for Diffusion Models on Manifolds](https://huggingface.co/papers/2202.09778) (PNDM) is by Luping Liu, Yi Ren, Zhijie Lin and Zhou Zhao.
The abstract from the paper is: The abstract from the paper is:
......
...@@ -12,12 +12,12 @@ specific language governing permissions and limitations under the License. ...@@ -12,12 +12,12 @@ specific language governing permissions and limitations under the License.
# Semantic Guidance # Semantic Guidance
Semantic Guidance for Diffusion Models was proposed in [SEGA: Instructing Diffusion using Semantic Dimensions](https://huggingface.co/papers/2301.12247) and provides strong semantic control over image generation. Semantic Guidance for Diffusion Models was proposed in [SEGA: Instructing Text-to-Image Models using Semantic Guidance](https://huggingface.co/papers/2301.12247) and provides strong semantic control over image generation.
Small changes to the text prompt usually result in entirely different output images. However, with SEGA a variety of changes to the image are enabled that can be controlled easily and intuitively, while staying true to the original image composition. Small changes to the text prompt usually result in entirely different output images. However, with SEGA a variety of changes to the image are enabled that can be controlled easily and intuitively, while staying true to the original image composition.
The abstract from the paper is: The abstract from the paper is:
*Text-to-image diffusion models have recently received a lot of interest for their astonishing ability to produce high-fidelity images from text only. However, achieving one-shot generation that aligns with the user's intent is nearly impossible, yet small changes to the input prompt often result in very different images. This leaves the user with little semantic control. To put the user in control, we show how to interact with the diffusion process to flexibly steer it along semantic directions. This semantic guidance (SEGA) allows for subtle and extensive edits, changes in composition and style, as well as optimizing the overall artistic conception. We demonstrate SEGA's effectiveness on a variety of tasks and provide evidence for its versatility and flexibility.* *Text-to-image diffusion models have recently received a lot of interest for their astonishing ability to produce high-fidelity images from text only. However, achieving one-shot generation that aligns with the user's intent is nearly impossible, yet small changes to the input prompt often result in very different images. This leaves the user with little semantic control. To put the user in control, we show how to interact with the diffusion process to flexibly steer it along semantic directions. This semantic guidance (SEGA) generalizes to any generative architecture using classifier-free guidance. More importantly, it allows for subtle and extensive edits, changes in composition and style, as well as optimizing the overall artistic conception. We demonstrate SEGA's effectiveness on both latent and pixel-based diffusion models such as Stable Diffusion, Paella, and DeepFloyd-IF using a variety of tasks, thus providing strong evidence for its versatility, flexibility, and improvements over existing methods.*
<Tip> <Tip>
......
...@@ -9,7 +9,7 @@ specific language governing permissions and limitations under the License. ...@@ -9,7 +9,7 @@ specific language governing permissions and limitations under the License.
# Shap-E # Shap-E
The Shap-E model was proposed in [Shap-E: Generating Conditional 3D Implicit Functions](https://huggingface.co/papers/2305.02463) by Alex Nichol and Heewon Jun from [OpenAI](https://github.com/openai). The Shap-E model was proposed in [Shap-E: Generating Conditional 3D Implicit Functions](https://huggingface.co/papers/2305.02463) by Alex Nichol and Heewoo Jun from [OpenAI](https://github.com/openai).
The abstract from the paper is: The abstract from the paper is:
......
...@@ -20,7 +20,7 @@ Using the pretrained models we can provide control images (for example, a depth ...@@ -20,7 +20,7 @@ Using the pretrained models we can provide control images (for example, a depth
The abstract of the paper is the following: The abstract of the paper is the following:
*The incredible generative ability of large-scale text-to-image (T2I) models has demonstrated strong power of learning complex structures and meaningful semantics. However, relying solely on text prompts cannot fully take advantage of the knowledge learned by the model, especially when flexible and accurate structure control is needed. In this paper, we aim to ``dig out" the capabilities that T2I models have implicitly learned, and then explicitly use them to control the generation more granularly. Specifically, we propose to learn simple and small T2I-Adapters to align internal knowledge in T2I models with external control signals, while freezing the original large T2I models. In this way, we can train various adapters according to different conditions, and achieve rich control and editing effects. Further, the proposed T2I-Adapters have attractive properties of practical value, such as composability and generalization ability. Extensive experiments demonstrate that our T2I-Adapter has promising generation quality and a wide range of applications.* *The incredible generative ability of large-scale text-to-image (T2I) models has demonstrated strong power of learning complex structures and meaningful semantics. However, relying solely on text prompts cannot fully take advantage of the knowledge learned by the model, especially when flexible and accurate controlling (e.g., color and structure) is needed. In this paper, we aim to ``dig out" the capabilities that T2I models have implicitly learned, and then explicitly use them to control the generation more granularly. Specifically, we propose to learn simple and lightweight T2I-Adapters to align internal knowledge in T2I models with external control signals, while freezing the original large T2I models. In this way, we can train various adapters according to different conditions, achieving rich control and editing effects in the color and structure of the generation results. Further, the proposed T2I-Adapters have attractive properties of practical value, such as composability and generalization ability. Extensive experiments demonstrate that our T2I-Adapter has promising generation quality and a wide range of applications.*
This model was contributed by the community contributor [HimariO](https://github.com/HimariO) ❤️ . This model was contributed by the community contributor [HimariO](https://github.com/HimariO) ❤️ .
...@@ -33,7 +33,7 @@ This model was contributed by the community contributor [HimariO](https://github ...@@ -33,7 +33,7 @@ This model was contributed by the community contributor [HimariO](https://github
## Usage example with the base model of StableDiffusion-1.4/1.5 ## Usage example with the base model of StableDiffusion-1.4/1.5
In the following we give a simple example of how to use a *T2IAdapter* checkpoint with Diffusers for inference based on StableDiffusion-1.4/1.5. In the following we give a simple example of how to use a *T2I-Adapter* checkpoint with Diffusers for inference based on StableDiffusion-1.4/1.5.
All adapters use the same pipeline. All adapters use the same pipeline.
1. Images are first converted into the appropriate *control image* format. 1. Images are first converted into the appropriate *control image* format.
...@@ -42,7 +42,7 @@ All adapters use the same pipeline. ...@@ -42,7 +42,7 @@ All adapters use the same pipeline.
Let's have a look at a simple example using the [Color Adapter](https://huggingface.co/TencentARC/t2iadapter_color_sd14v1). Let's have a look at a simple example using the [Color Adapter](https://huggingface.co/TencentARC/t2iadapter_color_sd14v1).
```python ```python
from diffusers.utils import load_image from diffusers.utils import load_image, make_image_grid
image = load_image("https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_ref.png") image = load_image("https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_ref.png")
``` ```
...@@ -83,20 +83,21 @@ Finally, pass the prompt and control image to the pipeline ...@@ -83,20 +83,21 @@ Finally, pass the prompt and control image to the pipeline
```py ```py
# fix the random seed, so you will get the same result as the example # fix the random seed, so you will get the same result as the example
generator = torch.manual_seed(7) generator = torch.Generator("cuda").manual_seed(7)
out_image = pipe( out_image = pipe(
"At night, glowing cubes in front of the beach", "At night, glowing cubes in front of the beach",
image=color_palette, image=color_palette,
generator=generator, generator=generator,
).images[0] ).images[0]
make_image_grid([image, color_palette, out_image], rows=1, cols=3)
``` ```
![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_output.png) ![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_output.png)
## Usage example with the base model of StableDiffusion-XL ## Usage example with the base model of StableDiffusion-XL
In the following we give a simple example of how to use a *T2IAdapter* checkpoint with Diffusers for inference based on StableDiffusion-XL. In the following we give a simple example of how to use a *T2I-Adapter* checkpoint with Diffusers for inference based on StableDiffusion-XL.
All adapters use the same pipeline. All adapters use the same pipeline.
1. Images are first downloaded into the appropriate *control image* format. 1. Images are first downloaded into the appropriate *control image* format.
...@@ -105,7 +106,7 @@ All adapters use the same pipeline. ...@@ -105,7 +106,7 @@ All adapters use the same pipeline.
Let's have a look at a simple example using the [Sketch Adapter](https://huggingface.co/Adapter/t2iadapter/tree/main/sketch_sdxl_1.0). Let's have a look at a simple example using the [Sketch Adapter](https://huggingface.co/Adapter/t2iadapter/tree/main/sketch_sdxl_1.0).
```python ```python
from diffusers.utils import load_image from diffusers.utils import load_image, make_image_grid
sketch_image = load_image("https://huggingface.co/Adapter/t2iadapter/resolve/main/sketch.png").convert("L") sketch_image = load_image("https://huggingface.co/Adapter/t2iadapter/resolve/main/sketch.png").convert("L")
``` ```
...@@ -121,10 +122,9 @@ from diffusers import ( ...@@ -121,10 +122,9 @@ from diffusers import (
StableDiffusionXLAdapterPipeline, StableDiffusionXLAdapterPipeline,
DDPMScheduler DDPMScheduler
) )
from diffusers.models.unet_2d_condition import UNet2DConditionModel
model_id = "stabilityai/stable-diffusion-xl-base-1.0" model_id = "stabilityai/stable-diffusion-xl-base-1.0"
adapter = T2IAdapter.from_pretrained("Adapter/t2iadapter", subfolder="sketch_sdxl_1.0",torch_dtype=torch.float16, adapter_type="full_adapter_xl") adapter = T2IAdapter.from_pretrained("Adapter/t2iadapter", subfolder="sketch_sdxl_1.0", torch_dtype=torch.float16, adapter_type="full_adapter_xl")
scheduler = DDPMScheduler.from_pretrained(model_id, subfolder="scheduler") scheduler = DDPMScheduler.from_pretrained(model_id, subfolder="scheduler")
pipe = StableDiffusionXLAdapterPipeline.from_pretrained( pipe = StableDiffusionXLAdapterPipeline.from_pretrained(
...@@ -147,6 +147,7 @@ sketch_image_out = pipe( ...@@ -147,6 +147,7 @@ sketch_image_out = pipe(
generator=generator, generator=generator,
guidance_scale=7.5 guidance_scale=7.5
).images[0] ).images[0]
make_image_grid([sketch_image, sketch_image_out], rows=1, cols=2)
``` ```
![img](https://huggingface.co/Adapter/t2iadapter/resolve/main/sketch_output.png) ![img](https://huggingface.co/Adapter/t2iadapter/resolve/main/sketch_output.png)
...@@ -159,7 +160,7 @@ Non-diffusers checkpoints can be found under [TencentARC/T2I-Adapter](https://hu ...@@ -159,7 +160,7 @@ Non-diffusers checkpoints can be found under [TencentARC/T2I-Adapter](https://hu
| Model Name | Control Image Overview| Control Image Example | Generated Image Example | | Model Name | Control Image Overview| Control Image Example | Generated Image Example |
|---|---|---|---| |---|---|---|---|
|[TencentARC/t2iadapter_color_sd14v1](https://huggingface.co/TencentARC/t2iadapter_color_sd14v1)<br/> *Trained with spatial color palette* | A image with 8x8 color palette.|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_sample_input.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_sample_input.png"/></a>|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_sample_output.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_sample_output.png"/></a>| |[TencentARC/t2iadapter_color_sd14v1](https://huggingface.co/TencentARC/t2iadapter_color_sd14v1)<br/> *Trained with spatial color palette* | An image with 8x8 color palette.|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_sample_input.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_sample_input.png"/></a>|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_sample_output.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_sample_output.png"/></a>|
|[TencentARC/t2iadapter_canny_sd14v1](https://huggingface.co/TencentARC/t2iadapter_canny_sd14v1)<br/> *Trained with canny edge detection* | A monochrome image with white edges on a black background.|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/canny_sample_input.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/canny_sample_input.png"/></a>|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/canny_sample_output.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/canny_sample_output.png"/></a>| |[TencentARC/t2iadapter_canny_sd14v1](https://huggingface.co/TencentARC/t2iadapter_canny_sd14v1)<br/> *Trained with canny edge detection* | A monochrome image with white edges on a black background.|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/canny_sample_input.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/canny_sample_input.png"/></a>|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/canny_sample_output.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/canny_sample_output.png"/></a>|
|[TencentARC/t2iadapter_sketch_sd14v1](https://huggingface.co/TencentARC/t2iadapter_sketch_sd14v1)<br/> *Trained with [PidiNet](https://github.com/zhuoinoulu/pidinet) edge detection* | A hand-drawn monochrome image with white outlines on a black background.|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/sketch_sample_input.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/sketch_sample_input.png"/></a>|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/sketch_sample_output.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/sketch_sample_output.png"/></a>| |[TencentARC/t2iadapter_sketch_sd14v1](https://huggingface.co/TencentARC/t2iadapter_sketch_sd14v1)<br/> *Trained with [PidiNet](https://github.com/zhuoinoulu/pidinet) edge detection* | A hand-drawn monochrome image with white outlines on a black background.|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/sketch_sample_input.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/sketch_sample_input.png"/></a>|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/sketch_sample_output.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/sketch_sample_output.png"/></a>|
|[TencentARC/t2iadapter_depth_sd14v1](https://huggingface.co/TencentARC/t2iadapter_depth_sd14v1)<br/> *Trained with Midas depth estimation* | A grayscale image with black representing deep areas and white representing shallow areas.|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_input.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_input.png"/></a>|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_output.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_output.png"/></a>| |[TencentARC/t2iadapter_depth_sd14v1](https://huggingface.co/TencentARC/t2iadapter_depth_sd14v1)<br/> *Trained with Midas depth estimation* | A grayscale image with black representing deep areas and white representing shallow areas.|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_input.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_input.png"/></a>|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_output.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_output.png"/></a>|
...@@ -181,9 +182,7 @@ Non-diffusers checkpoints can be found under [TencentARC/T2I-Adapter](https://hu ...@@ -181,9 +182,7 @@ Non-diffusers checkpoints can be found under [TencentARC/T2I-Adapter](https://hu
Here we use the keypose adapter for the character posture and the depth adapter for creating the scene. Here we use the keypose adapter for the character posture and the depth adapter for creating the scene.
```py ```py
import torch from diffusers.utils import load_image, make_image_grid
from PIL import Image
from diffusers.utils import load_image
cond_keypose = load_image( cond_keypose = load_image(
"https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/keypose_sample_input.png" "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/keypose_sample_input.png"
...@@ -191,7 +190,7 @@ cond_keypose = load_image( ...@@ -191,7 +190,7 @@ cond_keypose = load_image(
cond_depth = load_image( cond_depth = load_image(
"https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_input.png" "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_input.png"
) )
cond = [[cond_keypose, cond_depth]] cond = [cond_keypose, cond_depth]
prompt = ["A man walking in an office room with a nice view"] prompt = ["A man walking in an office room with a nice view"]
``` ```
...@@ -207,7 +206,8 @@ The two control images look as such: ...@@ -207,7 +206,8 @@ The two control images look as such:
`adapter_conditioning_scale` balances the relative influence of the different adapters. `adapter_conditioning_scale` balances the relative influence of the different adapters.
```py ```py
from diffusers import StableDiffusionAdapterPipeline, MultiAdapter import torch
from diffusers import StableDiffusionAdapterPipeline, MultiAdapter, T2IAdapter
adapters = MultiAdapter( adapters = MultiAdapter(
[ [
...@@ -221,18 +221,19 @@ pipe = StableDiffusionAdapterPipeline.from_pretrained( ...@@ -221,18 +221,19 @@ pipe = StableDiffusionAdapterPipeline.from_pretrained(
"CompVis/stable-diffusion-v1-4", "CompVis/stable-diffusion-v1-4",
torch_dtype=torch.float16, torch_dtype=torch.float16,
adapter=adapters, adapter=adapters,
) ).to("cuda")
images = pipe(prompt, cond, adapter_conditioning_scale=[0.8, 0.8]) image = pipe(prompt, cond, adapter_conditioning_scale=[0.8, 0.8]).images[0]
make_image_grid([cond_keypose, cond_depth, image], rows=1, cols=3)
``` ```
![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/keypose_depth_sample_output.png) ![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/keypose_depth_sample_output.png)
## T2I Adapter vs ControlNet ## T2I-Adapter vs ControlNet
T2I-Adapter is similar to [ControlNet](https://huggingface.co/docs/diffusers/main/en/api/pipelines/controlnet). T2I-Adapter is similar to [ControlNet](https://huggingface.co/docs/diffusers/main/en/api/pipelines/controlnet).
T2i-Adapter uses a smaller auxiliary network which is only run once for the entire diffusion process. T2I-Adapter uses a smaller auxiliary network which is only run once for the entire diffusion process.
However, T2I-Adapter performs slightly worse than ControlNet. However, T2I-Adapter performs slightly worse than ControlNet.
## StableDiffusionAdapterPipeline ## StableDiffusionAdapterPipeline
......
...@@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License. ...@@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License.
# Depth-to-image # Depth-to-image
The Stable Diffusion model can also infer depth based on an image using [MiDas](https://github.com/isl-org/MiDaS). This allows you to pass a text prompt and an initial image to condition the generation of new images as well as a `depth_map` to preserve the image structure. The Stable Diffusion model can also infer depth based on an image using [MiDaS](https://github.com/isl-org/MiDaS). This allows you to pass a text prompt and an initial image to condition the generation of new images as well as a `depth_map` to preserve the image structure.
<Tip> <Tip>
......
...@@ -34,7 +34,7 @@ The table below summarizes the available Stable Diffusion pipelines, their suppo ...@@ -34,7 +34,7 @@ The table below summarizes the available Stable Diffusion pipelines, their suppo
Supported tasks Supported tasks
</th> </th>
<th class="px-4 py-2 font-medium text-gray-900 text-left"> <th class="px-4 py-2 font-medium text-gray-900 text-left">
Space 🤗 Space
</th> </th>
</tr> </tr>
</thead> </thead>
......
...@@ -19,7 +19,7 @@ These models are trained on an aesthetic subset of the [LAION-5B dataset](https: ...@@ -19,7 +19,7 @@ These models are trained on an aesthetic subset of the [LAION-5B dataset](https:
For more details about how Stable Diffusion 2 works and how it differs from the original Stable Diffusion, please refer to the official [announcement post](https://stability.ai/blog/stable-diffusion-v2-release). For more details about how Stable Diffusion 2 works and how it differs from the original Stable Diffusion, please refer to the official [announcement post](https://stability.ai/blog/stable-diffusion-v2-release).
The architecture of Stable Diffusion 2 is more or less identical to the original [Stable Diffusion model](./text2img) so check out it's API documentation for how to use Stable Diffusion 2. We recommend using the [`DPMSolverMultistepScheduler`] as it's currently the fastest scheduler. The architecture of Stable Diffusion 2 is more or less identical to the original [Stable Diffusion model](./text2img) so check out it's API documentation for how to use Stable Diffusion 2. We recommend using the [`DPMSolverMultistepScheduler`] as it gives a reasonable speed/quality trade-off and can be run with as little as 20 steps.
Stable Diffusion 2 is available for tasks like text-to-image, inpainting, super-resolution, and depth-to-image: Stable Diffusion 2 is available for tasks like text-to-image, inpainting, super-resolution, and depth-to-image:
...@@ -55,30 +55,21 @@ pipe = pipe.to("cuda") ...@@ -55,30 +55,21 @@ pipe = pipe.to("cuda")
prompt = "High quality photo of an astronaut riding a horse in space" prompt = "High quality photo of an astronaut riding a horse in space"
image = pipe(prompt, num_inference_steps=25).images[0] image = pipe(prompt, num_inference_steps=25).images[0]
image.save("astronaut.png") image
``` ```
## Inpainting ## Inpainting
```py ```py
import PIL
import requests
import torch import torch
from io import BytesIO
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.utils import load_image, make_image_grid
def download_image(url):
response = requests.get(url)
return PIL.Image.open(BytesIO(response.content)).convert("RGB")
img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
init_image = download_image(img_url).resize((512, 512)) init_image = load_image(img_url).resize((512, 512))
mask_image = download_image(mask_url).resize((512, 512)) mask_image = load_image(mask_url).resize((512, 512))
repo_id = "stabilityai/stable-diffusion-2-inpainting" repo_id = "stabilityai/stable-diffusion-2-inpainting"
pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, revision="fp16") pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, revision="fp16")
...@@ -88,17 +79,14 @@ pipe = pipe.to("cuda") ...@@ -88,17 +79,14 @@ pipe = pipe.to("cuda")
prompt = "Face of a yellow cat, high resolution, sitting on a park bench" prompt = "Face of a yellow cat, high resolution, sitting on a park bench"
image = pipe(prompt=prompt, image=init_image, mask_image=mask_image, num_inference_steps=25).images[0] image = pipe(prompt=prompt, image=init_image, mask_image=mask_image, num_inference_steps=25).images[0]
make_image_grid([init_image, mask_image, image], rows=1, cols=3)
image.save("yellow_cat.png")
``` ```
## Super-resolution ## Super-resolution
```py ```py
import requests
from PIL import Image
from io import BytesIO
from diffusers import StableDiffusionUpscalePipeline from diffusers import StableDiffusionUpscalePipeline
from diffusers.utils import load_image, make_image_grid
import torch import torch
# load model and scheduler # load model and scheduler
...@@ -108,22 +96,19 @@ pipeline = pipeline.to("cuda") ...@@ -108,22 +96,19 @@ pipeline = pipeline.to("cuda")
# let's download an image # let's download an image
url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd2-upscale/low_res_cat.png" url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd2-upscale/low_res_cat.png"
response = requests.get(url) low_res_img = load_image(url)
low_res_img = Image.open(BytesIO(response.content)).convert("RGB")
low_res_img = low_res_img.resize((128, 128)) low_res_img = low_res_img.resize((128, 128))
prompt = "a white cat" prompt = "a white cat"
upscaled_image = pipeline(prompt=prompt, image=low_res_img).images[0] upscaled_image = pipeline(prompt=prompt, image=low_res_img).images[0]
upscaled_image.save("upsampled_cat.png") make_image_grid([low_res_img.resize((512, 512)), upscaled_image.resize((512, 512))], rows=1, cols=2)
``` ```
## Depth-to-image ## Depth-to-image
```py ```py
import torch import torch
import requests
from PIL import Image
from diffusers import StableDiffusionDepth2ImgPipeline from diffusers import StableDiffusionDepth2ImgPipeline
from diffusers.utils import load_image, make_image_grid
pipe = StableDiffusionDepth2ImgPipeline.from_pretrained( pipe = StableDiffusionDepth2ImgPipeline.from_pretrained(
"stabilityai/stable-diffusion-2-depth", "stabilityai/stable-diffusion-2-depth",
...@@ -132,8 +117,9 @@ pipe = StableDiffusionDepth2ImgPipeline.from_pretrained( ...@@ -132,8 +117,9 @@ pipe = StableDiffusionDepth2ImgPipeline.from_pretrained(
url = "http://images.cocodataset.org/val2017/000000039769.jpg" url = "http://images.cocodataset.org/val2017/000000039769.jpg"
init_image = Image.open(requests.get(url, stream=True).raw) init_image = load_image(url)
prompt = "two tigers" prompt = "two tigers"
n_propmt = "bad, deformed, ugly, bad anotomy" negative_prompt = "bad, deformed, ugly, bad anotomy"
image = pipe(prompt=prompt, image=init_image, negative_prompt=n_propmt, strength=0.7).images[0] image = pipe(prompt=prompt, image=init_image, negative_prompt=negative_prompt, strength=0.7).images[0]
make_image_grid([init_image, image], rows=1, cols=2)
``` ```
...@@ -23,7 +23,7 @@ The abstract from the paper is: ...@@ -23,7 +23,7 @@ The abstract from the paper is:
- Using SDXL with a DPM++ scheduler for less than 50 steps is known to produce [visual artifacts](https://github.com/huggingface/diffusers/issues/5433) because the solver becomes numerically unstable. To fix this issue, take a look at this [PR](https://github.com/huggingface/diffusers/pull/5541) which recommends for ODE/SDE solvers: - Using SDXL with a DPM++ scheduler for less than 50 steps is known to produce [visual artifacts](https://github.com/huggingface/diffusers/issues/5433) because the solver becomes numerically unstable. To fix this issue, take a look at this [PR](https://github.com/huggingface/diffusers/pull/5541) which recommends for ODE/SDE solvers:
- set `use_karras_sigmas=True` or `lu_lambdas=True` to improve image quality - set `use_karras_sigmas=True` or `lu_lambdas=True` to improve image quality
- set `euler_at_final=True` if you're using a solver with uniform step sizes (DPM++2M or DPM++2M SDE) - set `euler_at_final=True` if you're using a solver with uniform step sizes (DPM++2M or DPM++2M SDE)
- Most SDXL checkpoints work best with an image size of 1024x1024. Image sizes of 768x768 and 512x512 are also supported, but the results aren't as good. Anything below 512x512 is not recommended and likely won't for for default checkpoints like [stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0). - Most SDXL checkpoints work best with an image size of 1024x1024. Image sizes of 768x768 and 512x512 are also supported, but the results aren't as good. Anything below 512x512 is not recommended and likely won't be for default checkpoints like [stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0).
- SDXL can pass a different prompt for each of the text encoders it was trained on. We can even pass different parts of the same prompt to the text encoders. - SDXL can pass a different prompt for each of the text encoders it was trained on. We can even pass different parts of the same prompt to the text encoders.
- SDXL output images can be improved by making use of a refiner model in an image-to-image setting. - SDXL output images can be improved by making use of a refiner model in an image-to-image setting.
- SDXL offers `negative_original_size`, `negative_crops_coords_top_left`, and `negative_target_size` to negatively condition the model on image resolution and cropping parameters. - SDXL offers `negative_original_size`, `negative_crops_coords_top_left`, and `negative_target_size` to negatively condition the model on image resolution and cropping parameters.
......
...@@ -22,12 +22,10 @@ The abstract from the paper is: ...@@ -22,12 +22,10 @@ The abstract from the paper is:
## Tips ## Tips
Stable unCLIP takes `noise_level` as input during inference which determines how much noise is added Stable unCLIP takes `noise_level` as input during inference which determines how much noise is added to the image embeddings. A higher `noise_level` increases variation in the final un-noised images. By default, we do not add any additional noise to the image embeddings (`noise_level = 0`).
to the image embeddings. A higher `noise_level` increases variation in the final un-noised images. By default,
we do not add any additional noise to the image embeddings (`noise_level = 0`).
### Text-to-Image Generation ### Text-to-Image Generation
Stable unCLIP can be leveraged for text-to-image generation by pipelining it with the prior model of KakaoBrain's open source DALL-E 2 replication [Karlo](https://huggingface.co/kakaobrain/karlo-v1-alpha) Stable unCLIP can be leveraged for text-to-image generation by pipelining it with the prior model of KakaoBrain's open source DALL-E 2 replication [Karlo](https://huggingface.co/kakaobrain/karlo-v1-alpha):
```python ```python
import torch import torch
...@@ -60,8 +58,8 @@ pipe = StableUnCLIPPipeline.from_pretrained( ...@@ -60,8 +58,8 @@ pipe = StableUnCLIPPipeline.from_pretrained(
pipe = pipe.to("cuda") pipe = pipe.to("cuda")
wave_prompt = "dramatic wave, the Oceans roar, Strong wave spiral across the oceans as the waves unfurl into roaring crests; perfect wave form; perfect wave shape; dramatic wave shape; wave shape unbelievable; wave; wave shape spectacular" wave_prompt = "dramatic wave, the Oceans roar, Strong wave spiral across the oceans as the waves unfurl into roaring crests; perfect wave form; perfect wave shape; dramatic wave shape; wave shape unbelievable; wave; wave shape spectacular"
images = pipe(prompt=wave_prompt).images image = pipe(prompt=wave_prompt).images[0]
images[0].save("waves.png") image
``` ```
<Tip warning={true}> <Tip warning={true}>
...@@ -93,9 +91,16 @@ Optionally, you can also pass a prompt to `pipe` such as: ...@@ -93,9 +91,16 @@ Optionally, you can also pass a prompt to `pipe` such as:
```python ```python
prompt = "A fantasy landscape, trending on artstation" prompt = "A fantasy landscape, trending on artstation"
images = pipe(init_image, prompt=prompt).images image = pipe(init_image, prompt=prompt).images[0]
images[0].save("variation_image_two.png") image
``` ```
<Tip>
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
</Tip>
## StableUnCLIPPipeline ## StableUnCLIPPipeline
[[autodoc]] StableUnCLIPPipeline [[autodoc]] StableUnCLIPPipeline
...@@ -108,7 +113,6 @@ images[0].save("variation_image_two.png") ...@@ -108,7 +113,6 @@ images[0].save("variation_image_two.png")
- enable_xformers_memory_efficient_attention - enable_xformers_memory_efficient_attention
- disable_xformers_memory_efficient_attention - disable_xformers_memory_efficient_attention
## StableUnCLIPImg2ImgPipeline ## StableUnCLIPImg2ImgPipeline
[[autodoc]] StableUnCLIPImg2ImgPipeline [[autodoc]] StableUnCLIPImg2ImgPipeline
......
...@@ -16,7 +16,7 @@ specific language governing permissions and limitations under the License. ...@@ -16,7 +16,7 @@ specific language governing permissions and limitations under the License.
The abstract from the paper: The abstract from the paper:
*We argue that the theory and practice of diffusion-based generative models are currently unnecessarily convoluted and seek to remedy the situation by presenting a design space that clearly separates the concrete design choices. This lets us identify several changes to both the sampling and training processes, as well as preconditioning of the score networks. Together, our improvements yield new state-of-the-art FID of 1.79 for CIFAR-10 in a class-conditional setting and 1.97 in an unconditional setting, with much faster sampling (35 network evaluations per image) than prior designs. To further demonstrate their modular nature, we show that our design changes dramatically improve both the efficiency and quality obtainable with pre-trained score networks from previous work, including improving the FID of an existing ImageNet-64 model from 2.07 to near-SOTA 1.55.* *We argue that the theory and practice of diffusion-based generative models are currently unnecessarily convoluted and seek to remedy the situation by presenting a design space that clearly separates the concrete design choices. This lets us identify several changes to both the sampling and training processes, as well as preconditioning of the score networks. Together, our improvements yield new state-of-the-art FID of 1.79 for CIFAR-10 in a class-conditional setting and 1.97 in an unconditional setting, with much faster sampling (35 network evaluations per image) than prior designs. To further demonstrate their modular nature, we show that our design changes dramatically improve both the efficiency and quality obtainable with pre-trained score networks from previous work, including improving the FID of a previously trained ImageNet-64 model from 2.07 to near-SOTA 1.55, and after re-training with our proposed improvements to a new SOTA of 1.36.*
<Tip> <Tip>
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment