[`Docs`] Fix typos and update files at API's Pipelines page 2 (#5748)

* Fix typos, update, add Copyright info, and trim trailing whitespace * Update docs/source/en/api/pipelines/text_to_video_zero.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * 1 second is not a long video, but 6 seconds is * Update text_to_video_zero.md * Update text_to_video_zero.md * Update text_to_video_zero.md * Update wuerstchen.md --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

[`Docs`] Fix typos and update files at API's Pipelines page 2 (#5748)
* Fix typos, update, add Copyright info, and trim trailing whitespace * Update docs/source/en/api/pipelines/text_to_video_zero.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * 1 second is not a long video, but 6 seconds is * Update text_to_video_zero.md * Update text_to_video_zero.md * Update text_to_video_zero.md * Update wuerstchen.md --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
ecbe27a0 · M. Tolga Cangöz · GitHub · 3ad4207d · ecbe27a0 · ecbe27a0
Unverified Commit ecbe27a0 authored Nov 15, 2023 by M. Tolga Cangöz Committed by GitHub Nov 15, 2023
20 changed files
--- a/docs/source/en/api/pipelines/paradigms.md
+++ b/docs/source/en/api/pipelines/paradigms.md
@@ -16,7 +16,7 @@ specific language governing permissions and limitations under the License.
 The abstract from the paper is:
-*Diffusion models are powerful generative models but suffer from slow sampling, often taking 1000 sequential denoising steps for one sample. As a result, considerable efforts have been directed toward reducing the number of denoising steps, but these methods hurt sample quality. Instead of reducing the number of denoising steps (trading quality for speed), in this paper we explore an orthogonal approach: can we run the denoising steps in parallel (trading compute for speed)? In spite of the sequential nature of the denoising steps, we show that surprisingly it is possible to parallelize sampling via Picard iterations, by guessing the solution of future denoising steps and iteratively refining until convergence. With this insight, we present ParaDiGMS, a novel method to accelerate the sampling of pretrained diffusion models by denoising multiple steps in parallel. ParaDiGMS is the first diffusion sampling method that enables trading compute for speed and is even compatible with existing fast sampling techniques such as DDIM and DPMSolver. Using ParaDiGMS, we improve sampling speed by 2-4x across a range of robotics and image generation models, giving state-of-the-art sampling speeds of 0.2s on 100-step DiffusionPolicy and 16s on 1000-step StableDiffusion-v2 with no measurable degradation of task reward, FID score, or CLIP score.*
+*Diffusion models are powerful generative models but suffer from slow sampling, often taking 1000 sequential denoising steps for one sample. As a result, considerable efforts have been directed toward reducing the number of denoising steps, but these methods hurt sample quality. Instead of reducing the number of denoising steps (trading quality for speed), in this paper we explore an orthogonal approach: can we run the denoising steps in parallel (trading compute for speed)? In spite of the sequential nature of the denoising steps, we show that surprisingly it is possible to parallelize sampling via Picard iterations, by guessing the solution of future denoising steps and iteratively refining until convergence. With this insight, we present ParaDiGMS, a novel method to accelerate the sampling of pretrained diffusion models by denoising multiple steps in parallel. ParaDiGMS is the first diffusion sampling method that enables trading compute for speed and is even compatible with existing fast sampling techniques such as DDIM and DPMSolver. Using ParaDiGMS, we improve sampling speed by 2-4x across a range of robotics and image generation models, giving state-of-the-art sampling speeds of 0.2s on 100-step DiffusionPolicy and 14.6s on 1000-step StableDiffusion-v2 with no measurable degradation of task reward, FID score, or CLIP score.*
 The original codebase can be found at [AndyShih12/paradigms](https://github.com/AndyShih12/paradigms), and the pipeline was contributed by [AndyShih12](https://github.com/AndyShih12). ❤️
@@ -27,16 +27,13 @@ Therefore, it is better to call this pipeline when running on multiple GPUs. Oth
 sampling may be even slower than sequential sampling.
 The two parameters to play with are `parallel` (batch size) and `tolerance`.
- If it fits in memory, for a 1000-step DDPM you can aim for a batch size of around 100 
+- If it fits in memory, for a 1000-step DDPM you can aim for a batch size of around 100 (for example, 8 GPUs and `batch_per_device=12` to get `parallel=96`). A higher batch size may not fit in memory, and lower batch size gives less parallelism.
-(for example, 8 GPUs and `batch_per_device=12` to get `parallel=96`). A higher batch size
+- For tolerance, using a higher tolerance may get better speedups but can risk sample quality degradation. If there is quality degradation with the default tolerance, then use a lower tolerance like `0.001`.
-may not fit in memory, and lower batch size gives less parallelism. 
- For tolerance, using a higher tolerance may get better speedups but can risk sample quality degradation. 
-If there is quality degradation with the default tolerance, then use a lower tolerance like `0.001`.
 For a 1000-step DDPM on 8 A100 GPUs, you can expect around a 3x speedup from [`StableDiffusionParadigmsPipeline`] compared to the [`StableDiffusionPipeline`]
 by setting `parallel=80` and `tolerance=0.1`.
-🤗 Diffusers offers [distributed inference support](../training/distributed_inference) for generating multiple prompts
+🤗 Diffusers offers [distributed inference support](../../training/distributed_inference) for generating multiple prompts
 in parallel on multiple GPUs. But [`StableDiffusionParadigmsPipeline`] is designed for speeding up sampling of a single prompt by using multiple GPUs.
 <Tip>

--- a/docs/source/en/api/pipelines/pix2pix_zero.md
+++ b/docs/source/en/api/pipelines/pix2pix_zero.md
@@ -29,12 +29,11 @@ you wanted to translate from "cat" to "dog". In this case, the edit direction wi
 this in the pipeline, you simply have to set the embeddings related to the phrases including "cat" to
 `source_embeds` and "dog" to `target_embeds`. Refer to the code example below for more details.
 * When you're using this pipeline from a prompt, specify the _source_ concept in the prompt. Taking
-the above example, a valid input prompt would be: "a high resolution painting of a **cat** in the style of van gough".
+the above example, a valid input prompt would be: "a high resolution painting of a **cat** in the style of van gogh".
 * If you wanted to reverse the direction in the example above, i.e., "dog -> cat", then it's recommended to:
    * Swap the `source_embeds` and `target_embeds`.
    * Change the input prompt to include "dog".
-* To learn more about how the source and target embeddings are generated, refer to the [original 
+* To learn more about how the source and target embeddings are generated, refer to the [original paper](https://arxiv.org/abs/2302.03027). Below, we also provide some directions on how to generate the embeddings.
-paper](https://arxiv.org/abs/2302.03027). Below, we also provide some directions on how to generate the embeddings.
 * Note that the quality of the outputs generated with this pipeline is dependent on how good the `source_embeds` and `target_embeds` are. Please, refer to [this discussion](#generating-source-and-target-embeddings) for some suggestions on the topic.
 ## Available Pipelines:
@@ -79,21 +78,20 @@ for url in [src_embs_url, target_embs_url]:
 src_embeds = torch.load(src_embs_url.split("/")[-1])
 target_embeds = torch.load(target_embs_url.split("/")[-1])
-images = pipeline(
+image = pipeline(
    prompt,
    source_embeds=src_embeds,
    target_embeds=target_embeds,
    num_inference_steps=50,
    cross_attention_guidance_amount=0.15,
-).images
+).images[0]
-images[0].save("edited_image_dog.png")
+image
 ```
 ### Based on an input image
 When the pipeline is conditioned on an input image, we first obtain an inverted
-noise from it using a `DDIMInverseScheduler` with the help of a generated caption. Then 
+noise from it using a `DDIMInverseScheduler` with the help of a generated caption. Then the inverted noise is used to start the generation process.
-the inverted noise is used to start the generation process. 
 First, let's load our pipeline:
@@ -122,12 +120,12 @@ pipeline.enable_model_cpu_offload()
 Then, we load an input image for conditioning and obtain a suitable caption for it:
 ```py
-import requests
+from diffusers.utils import load_image
-from PIL import Image
 img_url = "https://github.com/pix2pixzero/pix2pix-zero/raw/main/assets/test_images/cats/cat_6.png"
-raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB").resize((512, 512))
+raw_image = load_image(url).resize((512, 512))
 caption = pipeline.generate_caption(raw_image)
+caption
 ```
 Then we employ the generated caption and the input image to get the inverted noise:
@@ -159,7 +157,7 @@ image = pipeline(
    latents=inv_latents,
    negative_prompt=caption,
 ).images[0]
-image.save("edited_image.png")
+image
 ```
 ## Generating source and target embeddings
@@ -214,6 +212,7 @@ And then we just call it to generate our captions:
 ```py
 source_captions = generate_captions(source_text)
 target_captions = generate_captions(target_concept)
+print(source_captions, target_captions, sep='\n')
 ```
 We encourage you to play around with the different parameters supported by the
@@ -268,16 +267,22 @@ from diffusers import DDIMScheduler
 pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
-images = pipeline(
+image = pipeline(
    prompt,
    source_embeds=source_embeddings,
    target_embeds=target_embeddings,
    num_inference_steps=50,
    cross_attention_guidance_amount=0.15,
-).images
+).images[0]
-images[0].save("edited_image_dog.png")
+image
 ```
+<Tip>
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+</Tip>
 ## StableDiffusionPix2PixZeroPipeline
 [[autodoc]] StableDiffusionPix2PixZeroPipeline
 	- __call__

--- a/docs/source/en/api/pipelines/pixart.md
+++ b/docs/source/en/api/pipelines/pixart.md
@@ -10,7 +10,7 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->
-# PixArt
+# PixArt-α
 ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/pixart/header_collage.png)
@@ -24,13 +24,20 @@ You can find the original codebase at [PixArt-alpha/PixArt-alpha](https://github
 Some notes about this pipeline:
-* It uses a Transformer backbone (instead of a UNet) for denoising. As such it has a similar architecture as [DiT](./dit.md).
+* It uses a Transformer backbone (instead of a UNet) for denoising. As such it has a similar architecture as [DiT](./dit).
 * It was trained using text conditions computed from T5. This aspect makes the pipeline better at following complex text prompts with intricate details.
 * It is good at producing high-resolution images at different aspect ratios. To get the best results, the authors recommend some size brackets which can be found [here](https://github.com/PixArt-alpha/PixArt-alpha/blob/08fbbd281ec96866109bdd2cdb75f2f58fb17610/diffusion/data/datasets/utils.py).
 * It rivals the quality of state-of-the-art text-to-image generation systems (as of this writing) such as Stable Diffusion XL, Imagen, and DALL-E 2, while being more efficient than them.
+<Tip>
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+</Tip>
 ## PixArtAlphaPipeline
 [[autodoc]] PixArtAlphaPipeline
 	- all
 	- __call__
\ No newline at end of file
--- a/docs/source/en/api/pipelines/pndm.md
+++ b/docs/source/en/api/pipelines/pndm.md
@@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License.
 # PNDM
-[Pseudo Numerical methods for Diffusion Models on manifolds](https://huggingface.co/papers/2202.09778) (PNDM) is by Luping Liu, Yi Ren, Zhijie Lin and Zhou Zhao.
+[Pseudo Numerical Methods for Diffusion Models on Manifolds](https://huggingface.co/papers/2202.09778) (PNDM) is by Luping Liu, Yi Ren, Zhijie Lin and Zhou Zhao.
 The abstract from the paper is:

--- a/docs/source/en/api/pipelines/score_sde_ve.md
+++ b/docs/source/en/api/pipelines/score_sde_ve.md
--- a/docs/source/en/api/pipelines/self_attention_guidance.md
+++ b/docs/source/en/api/pipelines/self_attention_guidance.md
--- a/docs/source/en/api/pipelines/semantic_stable_diffusion.md
+++ b/docs/source/en/api/pipelines/semantic_stable_diffusion.md
@@ -12,12 +12,12 @@ specific language governing permissions and limitations under the License.
 # Semantic Guidance
-Semantic Guidance for Diffusion Models was proposed in [SEGA: Instructing Diffusion using Semantic Dimensions](https://huggingface.co/papers/2301.12247) and provides strong semantic control over image generation.
+Semantic Guidance for Diffusion Models was proposed in [SEGA: Instructing Text-to-Image Models using Semantic Guidance](https://huggingface.co/papers/2301.12247) and provides strong semantic control over image generation.
 Small changes to the text prompt usually result in entirely different output images. However, with SEGA a variety of changes to the image are enabled that can be controlled easily and intuitively, while staying true to the original image composition.
 The abstract from the paper is:
-*Text-to-image diffusion models have recently received a lot of interest for their astonishing ability to produce high-fidelity images from text only. However, achieving one-shot generation that aligns with the user's intent is nearly impossible, yet small changes to the input prompt often result in very different images. This leaves the user with little semantic control. To put the user in control, we show how to interact with the diffusion process to flexibly steer it along semantic directions. This semantic guidance (SEGA) allows for subtle and extensive edits, changes in composition and style, as well as optimizing the overall artistic conception. We demonstrate SEGA's effectiveness on a variety of tasks and provide evidence for its versatility and flexibility.*
+*Text-to-image diffusion models have recently received a lot of interest for their astonishing ability to produce high-fidelity images from text only. However, achieving one-shot generation that aligns with the user's intent is nearly impossible, yet small changes to the input prompt often result in very different images. This leaves the user with little semantic control. To put the user in control, we show how to interact with the diffusion process to flexibly steer it along semantic directions. This semantic guidance (SEGA) generalizes to any generative architecture using classifier-free guidance. More importantly, it allows for subtle and extensive edits, changes in composition and style, as well as optimizing the overall artistic conception. We demonstrate SEGA's effectiveness on both latent and pixel-based diffusion models such as Stable Diffusion, Paella, and DeepFloyd-IF using a variety of tasks, thus providing strong evidence for its versatility, flexibility, and improvements over existing methods.*
 <Tip>

--- a/docs/source/en/api/pipelines/shap_e.md
+++ b/docs/source/en/api/pipelines/shap_e.md
@@ -9,7 +9,7 @@ specific language governing permissions and limitations under the License.
 # Shap-E
-The Shap-E model was proposed in [Shap-E: Generating Conditional 3D Implicit Functions](https://huggingface.co/papers/2305.02463) by Alex Nichol and Heewon Jun from [OpenAI](https://github.com/openai).
+The Shap-E model was proposed in [Shap-E: Generating Conditional 3D Implicit Functions](https://huggingface.co/papers/2305.02463) by Alex Nichol and Heewoo Jun from [OpenAI](https://github.com/openai).
 The abstract from the paper is:

--- a/docs/source/en/api/pipelines/spectrogram_diffusion.md
+++ b/docs/source/en/api/pipelines/spectrogram_diffusion.md
--- a/docs/source/en/api/pipelines/stable_diffusion/adapter.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/adapter.md
@@ -20,7 +20,7 @@ Using the pretrained models we can provide control images (for example, a depth
 The abstract of the paper is the following:
-*The incredible generative ability of large-scale text-to-image (T2I) models has demonstrated strong power of learning complex structures and meaningful semantics. However, relying solely on text prompts cannot fully take advantage of the knowledge learned by the model, especially when flexible and accurate structure control is needed. In this paper, we aim to ``dig out" the capabilities that T2I models have implicitly learned, and then explicitly use them to control the generation more granularly. Specifically, we propose to learn simple and small T2I-Adapters to align internal knowledge in T2I models with external control signals, while freezing the original large T2I models. In this way, we can train various adapters according to different conditions, and achieve rich control and editing effects. Further, the proposed T2I-Adapters have attractive properties of practical value, such as composability and generalization ability. Extensive experiments demonstrate that our T2I-Adapter has promising generation quality and a wide range of applications.*
+*The incredible generative ability of large-scale text-to-image (T2I) models has demonstrated strong power of learning complex structures and meaningful semantics. However, relying solely on text prompts cannot fully take advantage of the knowledge learned by the model, especially when flexible and accurate controlling (e.g., color and structure) is needed. In this paper, we aim to ``dig out" the capabilities that T2I models have implicitly learned, and then explicitly use them to control the generation more granularly. Specifically, we propose to learn simple and lightweight T2I-Adapters to align internal knowledge in T2I models with external control signals, while freezing the original large T2I models. In this way, we can train various adapters according to different conditions, achieving rich control and editing effects in the color and structure of the generation results. Further, the proposed T2I-Adapters have attractive properties of practical value, such as composability and generalization ability. Extensive experiments demonstrate that our T2I-Adapter has promising generation quality and a wide range of applications.*
 This model was contributed by the community contributor [HimariO](https://github.com/HimariO) ❤️ .
@@ -33,7 +33,7 @@ This model was contributed by the community contributor [HimariO](https://github
 ## Usage example with the base model of StableDiffusion-1.4/1.5
-In the following we give a simple example of how to use a *T2IAdapter* checkpoint with Diffusers for inference based on StableDiffusion-1.4/1.5.
+In the following we give a simple example of how to use a *T2I-Adapter* checkpoint with Diffusers for inference based on StableDiffusion-1.4/1.5.
 All adapters use the same pipeline.
 1. Images are first converted into the appropriate *control image* format.
@@ -42,7 +42,7 @@ All adapters use the same pipeline.
 Let's have a look at a simple example using the [Color Adapter](https://huggingface.co/TencentARC/t2iadapter_color_sd14v1).
 ```python
-from diffusers.utils import load_image
+from diffusers.utils import load_image, make_image_grid
 image = load_image("https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_ref.png")
 ```
@@ -83,20 +83,21 @@ Finally, pass the prompt and control image to the pipeline
 ```py
 # fix the random seed, so you will get the same result as the example
-generator = torch.manual_seed(7)
+generator = torch.Generator("cuda").manual_seed(7)
 out_image = pipe(
    "At night, glowing cubes in front of the beach",
    image=color_palette,
    generator=generator,
 ).images[0]
+make_image_grid([image, color_palette, out_image], rows=1, cols=3)
 ```
 ![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_output.png)
 ## Usage example with the base model of StableDiffusion-XL
-In the following we give a simple example of how to use a *T2IAdapter* checkpoint with Diffusers for inference based on StableDiffusion-XL.
+In the following we give a simple example of how to use a *T2I-Adapter* checkpoint with Diffusers for inference based on StableDiffusion-XL.
 All adapters use the same pipeline.
 1. Images are first downloaded into the appropriate *control image* format.
@@ -105,7 +106,7 @@ All adapters use the same pipeline.
 Let's have a look at a simple example using the [Sketch Adapter](https://huggingface.co/Adapter/t2iadapter/tree/main/sketch_sdxl_1.0).
 ```python
-from diffusers.utils import load_image
+from diffusers.utils import load_image, make_image_grid
 sketch_image = load_image("https://huggingface.co/Adapter/t2iadapter/resolve/main/sketch.png").convert("L")
 ```
@@ -121,10 +122,9 @@ from diffusers import (
    StableDiffusionXLAdapterPipeline,
    DDPMScheduler
 )
-from diffusers.models.unet_2d_condition import UNet2DConditionModel
 model_id = "stabilityai/stable-diffusion-xl-base-1.0"
-adapter = T2IAdapter.from_pretrained("Adapter/t2iadapter", subfolder="sketch_sdxl_1.0",torch_dtype=torch.float16, adapter_type="full_adapter_xl")
+adapter = T2IAdapter.from_pretrained("Adapter/t2iadapter", subfolder="sketch_sdxl_1.0", torch_dtype=torch.float16, adapter_type="full_adapter_xl")
 scheduler = DDPMScheduler.from_pretrained(model_id, subfolder="scheduler")
 pipe = StableDiffusionXLAdapterPipeline.from_pretrained(
@@ -147,6 +147,7 @@ sketch_image_out = pipe(
    generator=generator,
    guidance_scale=7.5
 ).images[0]
+make_image_grid([sketch_image, sketch_image_out], rows=1, cols=2)
 ```
 ![img](https://huggingface.co/Adapter/t2iadapter/resolve/main/sketch_output.png)
@@ -159,7 +160,7 @@ Non-diffusers checkpoints can be found under [TencentARC/T2I-Adapter](https://hu
 | Model Name | Control Image Overview| Control Image Example | Generated Image Example |
 |---|---|---|---|
-|[TencentARC/t2iadapter_color_sd14v1](https://huggingface.co/TencentARC/t2iadapter_color_sd14v1)<br/> *Trained with spatial color palette* | A image with 8x8 color palette.|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_sample_input.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_sample_input.png"/></a>|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_sample_output.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_sample_output.png"/></a>|
+|[TencentARC/t2iadapter_color_sd14v1](https://huggingface.co/TencentARC/t2iadapter_color_sd14v1)<br/> *Trained with spatial color palette* | An image with 8x8 color palette.|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_sample_input.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_sample_input.png"/></a>|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_sample_output.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_sample_output.png"/></a>|
 |[TencentARC/t2iadapter_canny_sd14v1](https://huggingface.co/TencentARC/t2iadapter_canny_sd14v1)<br/> *Trained with canny edge detection* | A monochrome image with white edges on a black background.|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/canny_sample_input.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/canny_sample_input.png"/></a>|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/canny_sample_output.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/canny_sample_output.png"/></a>|
 |[TencentARC/t2iadapter_sketch_sd14v1](https://huggingface.co/TencentARC/t2iadapter_sketch_sd14v1)<br/> *Trained with [PidiNet](https://github.com/zhuoinoulu/pidinet) edge detection* | A hand-drawn monochrome image with white outlines on a black background.|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/sketch_sample_input.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/sketch_sample_input.png"/></a>|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/sketch_sample_output.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/sketch_sample_output.png"/></a>|
 |[TencentARC/t2iadapter_depth_sd14v1](https://huggingface.co/TencentARC/t2iadapter_depth_sd14v1)<br/> *Trained with Midas depth estimation*  | A grayscale image with black representing deep areas and white representing shallow areas.|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_input.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_input.png"/></a>|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_output.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_output.png"/></a>|
@@ -181,9 +182,7 @@ Non-diffusers checkpoints can be found under [TencentARC/T2I-Adapter](https://hu
 Here we use the keypose adapter for the character posture and the depth adapter for creating the scene.
 ```py
-import torch
+from diffusers.utils import load_image, make_image_grid
-from PIL import Image
-from diffusers.utils import load_image
 cond_keypose = load_image(
    "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/keypose_sample_input.png"
@@ -191,7 +190,7 @@ cond_keypose = load_image(
 cond_depth = load_image(
    "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_input.png"
 )
-cond = [[cond_keypose, cond_depth]]
+cond = [cond_keypose, cond_depth]
 prompt = ["A man walking in an office room with a nice view"]
 ```
@@ -207,7 +206,8 @@ The two control images look as such:
 `adapter_conditioning_scale` balances the relative influence of the different adapters.
 ```py
-from diffusers import StableDiffusionAdapterPipeline, MultiAdapter
+import torch
+from diffusers import StableDiffusionAdapterPipeline, MultiAdapter, T2IAdapter
 adapters = MultiAdapter(
    [
@@ -221,18 +221,19 @@ pipe = StableDiffusionAdapterPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4",
    torch_dtype=torch.float16,
    adapter=adapters,
-)
+).to("cuda")
-images = pipe(prompt, cond, adapter_conditioning_scale=[0.8, 0.8])
+image = pipe(prompt, cond, adapter_conditioning_scale=[0.8, 0.8]).images[0]
+make_image_grid([cond_keypose, cond_depth, image], rows=1, cols=3)
 ```
 ![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/keypose_depth_sample_output.png)
-## T2I Adapter vs ControlNet
+## T2I-Adapter vs ControlNet
 T2I-Adapter is similar to [ControlNet](https://huggingface.co/docs/diffusers/main/en/api/pipelines/controlnet).
-T2i-Adapter uses a smaller auxiliary network which is only run once for the entire diffusion process. 
+T2I-Adapter uses a smaller auxiliary network which is only run once for the entire diffusion process.
 However, T2I-Adapter performs slightly worse than ControlNet.
 ## StableDiffusionAdapterPipeline

--- a/docs/source/en/api/pipelines/stable_diffusion/depth2img.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/depth2img.md
@@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License.
 # Depth-to-image
-The Stable Diffusion model can also infer depth based on an image using [MiDas](https://github.com/isl-org/MiDaS). This allows you to pass a text prompt and an initial image to condition the generation of new images as well as a `depth_map` to preserve the image structure. 
+The Stable Diffusion model can also infer depth based on an image using [MiDaS](https://github.com/isl-org/MiDaS). This allows you to pass a text prompt and an initial image to condition the generation of new images as well as a `depth_map` to preserve the image structure.
 <Tip>

--- a/docs/source/en/api/pipelines/stable_diffusion/inpaint.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/inpaint.md
--- a/docs/source/en/api/pipelines/stable_diffusion/latent_upscale.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/latent_upscale.md
--- a/docs/source/en/api/pipelines/stable_diffusion/overview.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/overview.md
@@ -34,7 +34,7 @@ The table below summarizes the available Stable Diffusion pipelines, their suppo
            Supported tasks
            </th>
            <th class="px-4 py-2 font-medium text-gray-900 text-left">
-            Space
+            🤗 Space
            </th>
        </tr>
        </thead>

--- a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_2.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_2.md
@@ -19,7 +19,7 @@ These models are trained on an aesthetic subset of the [LAION-5B dataset](https:
 For more details about how Stable Diffusion 2 works and how it differs from the original Stable Diffusion, please refer to the official [announcement post](https://stability.ai/blog/stable-diffusion-v2-release).
-The architecture of Stable Diffusion 2 is more or less identical to the original [Stable Diffusion model](./text2img) so check out it's API documentation for how to use Stable Diffusion 2. We recommend using the [`DPMSolverMultistepScheduler`] as it's currently the fastest scheduler.
+The architecture of Stable Diffusion 2 is more or less identical to the original [Stable Diffusion model](./text2img) so check out it's API documentation for how to use Stable Diffusion 2. We recommend using the [`DPMSolverMultistepScheduler`] as it gives a reasonable speed/quality trade-off and can be run with as little as 20 steps.
 Stable Diffusion 2 is available for tasks like text-to-image, inpainting, super-resolution, and depth-to-image:
@@ -55,30 +55,21 @@ pipe = pipe.to("cuda")
 prompt = "High quality photo of an astronaut riding a horse in space"
 image = pipe(prompt, num_inference_steps=25).images[0]
-image.save("astronaut.png")
+image
 ```
 ## Inpainting
 ```py
-import PIL
-import requests
 import torch
-from io import BytesIO
 from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
+from diffusers.utils import load_image, make_image_grid
-def download_image(url):
-    response = requests.get(url)
-    return PIL.Image.open(BytesIO(response.content)).convert("RGB")
 img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
 mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
-init_image = download_image(img_url).resize((512, 512))
+init_image = load_image(img_url).resize((512, 512))
-mask_image = download_image(mask_url).resize((512, 512))
+mask_image = load_image(mask_url).resize((512, 512))
 repo_id = "stabilityai/stable-diffusion-2-inpainting"
 pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, revision="fp16")
@@ -88,17 +79,14 @@ pipe = pipe.to("cuda")
 prompt = "Face of a yellow cat, high resolution, sitting on a park bench"
 image = pipe(prompt=prompt, image=init_image, mask_image=mask_image, num_inference_steps=25).images[0]
+make_image_grid([init_image, mask_image, image], rows=1, cols=3)
-image.save("yellow_cat.png")
 ```
 ## Super-resolution
 ```py
-import requests
-from PIL import Image
-from io import BytesIO
 from diffusers import StableDiffusionUpscalePipeline
+from diffusers.utils import load_image, make_image_grid
 import torch
 # load model and scheduler
@@ -108,22 +96,19 @@ pipeline = pipeline.to("cuda")
 # let's download an  image
 url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd2-upscale/low_res_cat.png"
-response = requests.get(url)
+low_res_img = load_image(url)
-low_res_img = Image.open(BytesIO(response.content)).convert("RGB")
 low_res_img = low_res_img.resize((128, 128))
 prompt = "a white cat"
 upscaled_image = pipeline(prompt=prompt, image=low_res_img).images[0]
-upscaled_image.save("upsampled_cat.png")
+make_image_grid([low_res_img.resize((512, 512)), upscaled_image.resize((512, 512))], rows=1, cols=2)
 ```
 ## Depth-to-image
 ```py
 import torch
-import requests
-from PIL import Image
 from diffusers import StableDiffusionDepth2ImgPipeline
+from diffusers.utils import load_image, make_image_grid
 pipe = StableDiffusionDepth2ImgPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-depth",
@@ -132,8 +117,9 @@ pipe = StableDiffusionDepth2ImgPipeline.from_pretrained(
 url = "http://images.cocodataset.org/val2017/000000039769.jpg"
-init_image = Image.open(requests.get(url, stream=True).raw)
+init_image = load_image(url)
 prompt = "two tigers"
-n_propmt = "bad, deformed, ugly, bad anotomy"
+negative_prompt = "bad, deformed, ugly, bad anotomy"
-image = pipe(prompt=prompt, image=init_image, negative_prompt=n_propmt, strength=0.7).images[0]
+image = pipe(prompt=prompt, image=init_image, negative_prompt=negative_prompt, strength=0.7).images[0]
+make_image_grid([init_image, image], rows=1, cols=2)
 ```
--- a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md
@@ -23,7 +23,7 @@ The abstract from the paper is:
 - Using SDXL with a DPM++ scheduler for less than 50 steps is known to produce [visual artifacts](https://github.com/huggingface/diffusers/issues/5433) because the solver becomes numerically unstable. To fix this issue, take a look at this [PR](https://github.com/huggingface/diffusers/pull/5541) which recommends for ODE/SDE solvers:
 	- set `use_karras_sigmas=True` or `lu_lambdas=True` to improve image quality
 	- set `euler_at_final=True` if you're using a solver with uniform step sizes (DPM++2M or DPM++2M SDE)
- Most SDXL checkpoints work best with an image size of 1024x1024. Image sizes of 768x768 and 512x512 are also supported, but the results aren't as good. Anything below 512x512 is not recommended and likely won't for for default checkpoints like [stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0).
+- Most SDXL checkpoints work best with an image size of 1024x1024. Image sizes of 768x768 and 512x512 are also supported, but the results aren't as good. Anything below 512x512 is not recommended and likely won't be for default checkpoints like [stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0).
 - SDXL can pass a different prompt for each of the text encoders it was trained on. We can even pass different parts of the same prompt to the text encoders.
 - SDXL output images can be improved by making use of a refiner model in an image-to-image setting.
 - SDXL offers `negative_original_size`, `negative_crops_coords_top_left`, and `negative_target_size` to negatively condition the model on image resolution and cropping parameters.

--- a/docs/source/en/api/pipelines/stable_diffusion/text2img.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/text2img.md
--- a/docs/source/en/api/pipelines/stable_diffusion/upscale.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/upscale.md
--- a/docs/source/en/api/pipelines/stable_unclip.md
+++ b/docs/source/en/api/pipelines/stable_unclip.md
@@ -22,12 +22,10 @@ The abstract from the paper is:
 ## Tips
-Stable unCLIP takes  `noise_level` as input during inference which determines how much noise is added 
+Stable unCLIP takes  `noise_level` as input during inference which determines how much noise is added to the image embeddings. A higher `noise_level` increases variation in the final un-noised images. By default, we do not add any additional noise to the image embeddings (`noise_level = 0`).
-to the image embeddings. A higher `noise_level` increases variation in the final un-noised images. By default, 
-we do not add any additional noise to the image embeddings (`noise_level = 0`).
 ### Text-to-Image Generation
-Stable unCLIP can be leveraged for text-to-image generation by pipelining it with the prior model of KakaoBrain's open source DALL-E 2 replication [Karlo](https://huggingface.co/kakaobrain/karlo-v1-alpha)
+Stable unCLIP can be leveraged for text-to-image generation by pipelining it with the prior model of KakaoBrain's open source DALL-E 2 replication [Karlo](https://huggingface.co/kakaobrain/karlo-v1-alpha):
 ```python
 import torch
@@ -60,8 +58,8 @@ pipe = StableUnCLIPPipeline.from_pretrained(
 pipe = pipe.to("cuda")
 wave_prompt = "dramatic wave, the Oceans roar, Strong wave spiral across the oceans as the waves unfurl into roaring crests; perfect wave form; perfect wave shape; dramatic wave shape; wave shape unbelievable; wave; wave shape spectacular"
-images = pipe(prompt=wave_prompt).images
+image = pipe(prompt=wave_prompt).images[0]
-images[0].save("waves.png")
+image
 ```
 <Tip warning={true}>
@@ -93,9 +91,16 @@ Optionally, you can also pass a prompt to `pipe` such as:
 ```python
 prompt = "A fantasy landscape, trending on artstation"
-images = pipe(init_image, prompt=prompt).images
+image = pipe(init_image, prompt=prompt).images[0]
-images[0].save("variation_image_two.png")
+image
 ```
+<Tip>
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+</Tip>
 ## StableUnCLIPPipeline
 [[autodoc]] StableUnCLIPPipeline
@@ -108,7 +113,6 @@ images[0].save("variation_image_two.png")
 	- enable_xformers_memory_efficient_attention
 	- disable_xformers_memory_efficient_attention
 ## StableUnCLIPImg2ImgPipeline
 [[autodoc]] StableUnCLIPImg2ImgPipeline

--- a/docs/source/en/api/pipelines/stochastic_karras_ve.md
+++ b/docs/source/en/api/pipelines/stochastic_karras_ve.md
@@ -16,7 +16,7 @@ specific language governing permissions and limitations under the License.
 The abstract from the paper:
-*We argue that the theory and practice of diffusion-based generative models are currently unnecessarily convoluted and seek to remedy the situation by presenting a design space that clearly separates the concrete design choices. This lets us identify several changes to both the sampling and training processes, as well as preconditioning of the score networks. Together, our improvements yield new state-of-the-art FID of 1.79 for CIFAR-10 in a class-conditional setting and 1.97 in an unconditional setting, with much faster sampling (35 network evaluations per image) than prior designs. To further demonstrate their modular nature, we show that our design changes dramatically improve both the efficiency and quality obtainable with pre-trained score networks from previous work, including improving the FID of an existing ImageNet-64 model from 2.07 to near-SOTA 1.55.*
+*We argue that the theory and practice of diffusion-based generative models are currently unnecessarily convoluted and seek to remedy the situation by presenting a design space that clearly separates the concrete design choices. This lets us identify several changes to both the sampling and training processes, as well as preconditioning of the score networks. Together, our improvements yield new state-of-the-art FID of 1.79 for CIFAR-10 in a class-conditional setting and 1.97 in an unconditional setting, with much faster sampling (35 network evaluations per image) than prior designs. To further demonstrate their modular nature, we show that our design changes dramatically improve both the efficiency and quality obtainable with pre-trained score networks from previous work, including improving the FID of a previously trained ImageNet-64 model from 2.07 to near-SOTA 1.55, and after re-training with our proposed improvements to a new SOTA of 1.36.*
 <Tip>