[`Docs`] Fix typos, update, and add visualizations at Using Diffusers'...

[`Docs`] Fix typos, update, and add visualizations at Using Diffusers' Pipelines for Inference Page (#5649) * Fix typos, update, add visualizations * Update sdxl.md * Update controlnet.md * Update docs/source/en/using-diffusers/shap-e.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/using-diffusers/shap-e.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update diffedit.md * Update kandinsky.md * Update sdxl.md * Update controlnet.md * Update docs/source/en/using-diffusers/controlnet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/using-diffusers/controlnet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update controlnet.md --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

[`Docs`] Fix typos, update, and add visualizations at Using Diffusers'...
[`Docs`] Fix typos, update, and add visualizations at Using Diffusers' Pipelines for Inference Page (#5649) * Fix typos, update, add visualizations * Update sdxl.md * Update controlnet.md * Update docs/source/en/using-diffusers/shap-e.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/using-diffusers/shap-e.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update diffedit.md * Update kandinsky.md * Update sdxl.md * Update controlnet.md * Update docs/source/en/using-diffusers/controlnet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/using-diffusers/controlnet.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update controlnet.md --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
3ad4207d · M. Tolga Cangöz · GitHub · 3517fb94 · 3ad4207d · 3ad4207d
Unverified Commit 3ad4207d authored Nov 15, 2023 by M. Tolga Cangöz Committed by GitHub Nov 15, 2023
8 changed files
--- a/docs/source/en/using-diffusers/contribute_pipeline.md
+++ b/docs/source/en/using-diffusers/contribute_pipeline.md
@@ -30,7 +30,6 @@ You should start by creating a `one_step_unet.py` file for your community pipeli
 from diffusers import DiffusionPipeline
 import torch
 class UnetSchedulerOneForwardPipeline(DiffusionPipeline):
    def __init__(self, unet, scheduler):
        super().__init__()
@@ -59,7 +58,6 @@ In the forward pass, which we recommend defining as `__call__`, you have complet
  from diffusers import DiffusionPipeline
  import torch
  class UnetSchedulerOneForwardPipeline(DiffusionPipeline):
      def __init__(self, unet, scheduler):
          super().__init__()
@@ -150,12 +148,12 @@ Sometimes you can't load all the pipeline components weights from an official re
 ```python
 from diffusers import DiffusionPipeline
-from transformers import CLIPFeatureExtractor, CLIPModel
+from transformers import CLIPImageProcessor, CLIPModel
 model_id = "CompVis/stable-diffusion-v1-4"
 clip_model_id = "laion/CLIP-ViT-B-32-laion2B-s34B-b79K"
-feature_extractor = CLIPFeatureExtractor.from_pretrained(clip_model_id)
+feature_extractor = CLIPImageProcessor.from_pretrained(clip_model_id)
 clip_model = CLIPModel.from_pretrained(clip_model_id, torch_dtype=torch.float16)
 pipeline = DiffusionPipeline.from_pretrained(
@@ -172,7 +170,7 @@ pipeline = DiffusionPipeline.from_pretrained(
 The magic behind community pipelines is contained in the following code. It allows the community pipeline to be loaded from GitHub or the Hub, and it'll be available to all 🧨 Diffusers packages.
 ```python
-# 2. Load the pipeline class, if using custom module then load it from the hub
+# 2. Load the pipeline class, if using custom module then load it from the Hub
 # if we load from explicit class, let's use it
 if custom_pipeline is not None:
    pipeline_class = get_class_from_dynamic_module(

--- a/docs/source/en/using-diffusers/controlnet.md
+++ b/docs/source/en/using-diffusers/controlnet.md
@@ -16,7 +16,7 @@ ControlNet is a type of model for controlling image diffusion models by conditio
 <Tip>
-Check out Section 3.5 of the [ControlNet](https://huggingface.co/papers/2302.05543) paper for a list of ControlNet implementations on various conditioning inputs. You can find the official Stable Diffusion ControlNet conditioned models on [lllyasviel](https://huggingface.co/lllyasviel)'s Hub profile, and more [community-trained](https://huggingface.co/models?other=stable-diffusion&other=controlnet) ones on the Hub.
+Check out Section 3.5 of the [ControlNet](https://huggingface.co/papers/2302.05543) paper v1 for a list of ControlNet implementations on various conditioning inputs. You can find the official Stable Diffusion ControlNet conditioned models on [lllyasviel](https://huggingface.co/lllyasviel)'s Hub profile, and more [community-trained](https://huggingface.co/models?other=stable-diffusion&other=controlnet) ones on the Hub.
 For Stable Diffusion XL (SDXL) ControlNet models, you can find them on the 🤗 [Diffusers](https://huggingface.co/diffusers) Hub organization, or you can browse [community-trained](https://huggingface.co/models?other=stable-diffusion-xl&other=controlnet) ones on the Hub.
@@ -35,7 +35,7 @@ Before you begin, make sure you have the following libraries installed:
 ```py
 # uncomment to install the necessary libraries in Colab
-#!pip install diffusers transformers accelerate safetensors opencv-python
+#!pip install -q diffusers transformers accelerate opencv-python
 ```
 ## Text-to-image
@@ -45,17 +45,16 @@ For text-to-image, you normally pass a text prompt to the model. But with Contro
 Load an image and use the [opencv-python](https://github.com/opencv/opencv-python) library to extract the canny image:
 ```py
-from diffusers import StableDiffusionControlNetPipeline
+from diffusers.utils import load_image, make_image_grid
-from diffusers.utils import load_image
 from PIL import Image
 import cv2
 import numpy as np
-image = load_image(
+original_image = load_image(
    "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png"
 )
-image = np.array(image)
+image = np.array(original_image)
 low_threshold = 100
 high_threshold = 200
@@ -98,6 +97,7 @@ Now pass your prompt and canny image to the pipeline:
 output = pipe(
    "the mona lisa", image=canny_image
 ).images[0]
+make_image_grid([original_image, canny_image, output], rows=1, cols=3)
 ```
 <div class="flex justify-center">
@@ -117,12 +117,11 @@ import torch
 import numpy as np
 from transformers import pipeline
-from diffusers.utils import load_image
+from diffusers.utils import load_image, make_image_grid
 image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-img2img.jpg"
-).resize((768, 768))
+)
 def get_depth_map(image, depth_estimator):
    image = depth_estimator(image)["depth"]
@@ -158,6 +157,7 @@ Now pass your prompt, initial image, and depth map to the pipeline:
 output = pipe(
    "lego batman and robin", image=image, control_image=depth_map,
 ).images[0]
+make_image_grid([image, output], rows=1, cols=2)
 ```
 <div class="flex gap-4">
@@ -171,18 +171,14 @@ output = pipe(
  </div>
 </div>
 ## Inpainting
-For inpainting, you need an initial image, a mask image, and a prompt describing what to replace the mask with. ControlNet models allow you to add another control image to condition a model with. Let’s condition the model with a canny image, a white outline of an image on a black background. This way, the ControlNet can use the canny image as a control to guide the model to generate an image with the same outline.
+For inpainting, you need an initial image, a mask image, and a prompt describing what to replace the mask with. ControlNet models allow you to add another control image to condition a model with. Let’s condition the model with an inpainting mask. This way, the ControlNet can use the inpainting mask as a control to guide the model to generate an image within the mask area.
 Load an initial image and a mask image:
 ```py
-from diffusers import StableDiffusionControlNetInpaintPipeline, ControlNetModel, UniPCMultistepScheduler
+from diffusers.utils import load_image, make_image_grid
-from diffusers.utils import load_image
-import numpy as np
-import torch
 init_image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-inpaint.jpg"
@@ -193,11 +189,15 @@ mask_image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-inpaint-mask.jpg"
 )
 mask_image = mask_image.resize((512, 512))
+make_image_grid([init_image, mask_image], rows=1, cols=2)
 ```
 Create a function to prepare the control image from the initial and mask images. This'll create a tensor to mark the pixels in `init_image` as masked if the corresponding pixel in `mask_image` is over a certain threshold.
 ```py
+import numpy as np
+import torch
 def make_inpaint_condition(image, image_mask):
    image = np.array(image.convert("RGB")).astype(np.float32) / 255.0
    image_mask = np.array(image_mask.convert("L")).astype(np.float32) / 255.0
@@ -226,7 +226,6 @@ Load a ControlNet model conditioned on inpainting and pass it to the [`StableDif
 ```py
 from diffusers import StableDiffusionControlNetInpaintPipeline, ControlNetModel, UniPCMultistepScheduler
-import torch
 controlnet = ControlNetModel.from_pretrained("lllyasviel/control_v11p_sd15_inpaint", torch_dtype=torch.float16, use_safetensors=True)
 pipe = StableDiffusionControlNetInpaintPipeline.from_pretrained(
@@ -248,6 +247,7 @@ output = pipe(
    mask_image=mask_image,
    control_image=control_image,
 ).images[0]
+make_image_grid([init_image, mask_image, output], rows=1, cols=3)
 ```
 <div class="flex justify-center">
@@ -270,14 +270,29 @@ Set `guess_mode=True` in the pipeline, and it is [recommended](https://github.co
 ```py
 from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
+from diffusers.utils import load_image, make_image_grid
+import numpy as np
 import torch
+from PIL import Image
+import cv2
 controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", use_safetensors=True)
-pipe = StableDiffusionControlNetPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", controlnet=controlnet, use_safetensors=True).to(
+pipe = StableDiffusionControlNetPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", controlnet=controlnet, use_safetensors=True).to("cuda")
-    "cuda"
-)
+original_image = load_image("https://huggingface.co/takuma104/controlnet_dev/resolve/main/bird_512x512.png")
+image = np.array(original_image)
+low_threshold = 100
+high_threshold = 200
+image = cv2.Canny(image, low_threshold, high_threshold)
+image = image[:, :, None]
+image = np.concatenate([image, image, image], axis=2)
+canny_image = Image.fromarray(image)
 image = pipe("", image=canny_image, guess_mode=True, guidance_scale=3.0).images[0]
-image
+make_image_grid([original_image, canny_image, image], rows=1, cols=3)
 ```
 <div class="flex gap-4">
@@ -293,22 +308,23 @@ image
 ## ControlNet with Stable Diffusion XL
-There aren't too many ControlNet models compatible with Stable Diffusion XL (SDXL) at the moment, but we've trained two full-sized ControlNet models for SDXL conditioned on canny edge detection and depth maps. We're also experimenting with creating smaller versions of these SDXL-compatible ControlNet models so it is easier to run on resource-constrained hardware. You can find these checkpoints on the 🤗 [Diffusers](https://huggingface.co/diffusers) Hub organization!
+There aren't too many ControlNet models compatible with Stable Diffusion XL (SDXL) at the moment, but we've trained two full-sized ControlNet models for SDXL conditioned on canny edge detection and depth maps. We're also experimenting with creating smaller versions of these SDXL-compatible ControlNet models so it is easier to run on resource-constrained hardware. You can find these checkpoints on the [🤗 Diffusers Hub organization](https://huggingface.co/diffusers)!
 Let's use a SDXL ControlNet conditioned on canny images to generate an image. Start by loading an image and prepare the canny image:
 ```py
 from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel, AutoencoderKL
-from diffusers.utils import load_image
+from diffusers.utils import load_image, make_image_grid
 from PIL import Image
 import cv2
 import numpy as np
+import torch
-image = load_image(
+original_image = load_image(
    "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/hf-logo.png"
 )
-image = np.array(image)
+image = np.array(original_image)
 low_threshold = 100
 high_threshold = 200
@@ -317,7 +333,7 @@ image = cv2.Canny(image, low_threshold, high_threshold)
 image = image[:, :, None]
 image = np.concatenate([image, image, image], axis=2)
 canny_image = Image.fromarray(image)
-canny_image
+make_image_grid([original_image, canny_image], rows=1, cols=2)
 ```
 <div class="flex gap-4">
@@ -362,13 +378,13 @@ The [`controlnet_conditioning_scale`](https://huggingface.co/docs/diffusers/main
 prompt = "aerial view, a futuristic research complex in a bright foggy jungle, hard lighting"
 negative_prompt = 'low quality, bad quality, sketches'
-images = pipe(
+image = pipe(
    prompt,
    negative_prompt=negative_prompt,
    image=canny_image,
    controlnet_conditioning_scale=0.5,
 ).images[0]
-images
+make_image_grid([original_image, canny_image, image], rows=1, cols=3)
 ```
 <div class="flex justify-center">
@@ -379,17 +395,16 @@ You can use [`StableDiffusionXLControlNetPipeline`] in guess mode as well by set
 ```py
 from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel, AutoencoderKL
-from diffusers.utils import load_image
+from diffusers.utils import load_image, make_image_grid
 import numpy as np
 import torch
 import cv2
 from PIL import Image
 prompt = "aerial view, a futuristic research complex in a bright foggy jungle, hard lighting"
 negative_prompt = "low quality, bad quality, sketches"
-image = load_image(
+original_image = load_image(
    "https://hf.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/hf-logo.png"
 )
@@ -402,15 +417,16 @@ pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
 )
 pipe.enable_model_cpu_offload()
-image = np.array(image)
+image = np.array(original_image)
 image = cv2.Canny(image, 100, 200)
 image = image[:, :, None]
 image = np.concatenate([image, image, image], axis=2)
 canny_image = Image.fromarray(image)
 image = pipe(
-    prompt, controlnet_conditioning_scale=0.5, image=canny_image, guess_mode=True,
+    prompt, negative_prompt=negative_prompt, controlnet_conditioning_scale=0.5, image=canny_image, guess_mode=True,
 ).images[0]
+make_image_grid([original_image, canny_image, image], rows=1, cols=3)
 ```
 ### MultiControlNet
@@ -431,29 +447,30 @@ In this example, you'll combine a canny image and a human pose estimation image
 Prepare the canny image conditioning:
 ```py
-from diffusers.utils import load_image
+from diffusers.utils import load_image, make_image_grid
 from PIL import Image
 import numpy as np
 import cv2
-canny_image = load_image(
+original_image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/landscape.png"
 )
-canny_image = np.array(canny_image)
+image = np.array(original_image)
 low_threshold = 100
 high_threshold = 200
-canny_image = cv2.Canny(canny_image, low_threshold, high_threshold)
+image = cv2.Canny(image, low_threshold, high_threshold)
 # zero out middle columns of image where pose will be overlaid
-zero_start = canny_image.shape[1] // 4
+zero_start = image.shape[1] // 4
-zero_end = zero_start + canny_image.shape[1] // 2
+zero_end = zero_start + image.shape[1] // 2
-canny_image[:, zero_start:zero_end] = 0
+image[:, zero_start:zero_end] = 0
-canny_image = canny_image[:, :, None]
+image = image[:, :, None]
-canny_image = np.concatenate([canny_image, canny_image, canny_image], axis=2)
+image = np.concatenate([image, image, image], axis=2)
-canny_image = Image.fromarray(canny_image).resize((1024, 1024))
+canny_image = Image.fromarray(image)
+make_image_grid([original_image, canny_image], rows=1, cols=2)
 ```
 <div class="flex gap-4">
@@ -467,18 +484,24 @@ canny_image = Image.fromarray(canny_image).resize((1024, 1024))
  </div>
 </div>
+For human pose estimation, install [controlnet_aux](https://github.com/patrickvonplaten/controlnet_aux):
+```py
+# uncomment to install the necessary library in Colab
+#!pip install -q controlnet-aux
+```
 Prepare the human pose estimation conditioning:
 ```py
 from controlnet_aux import OpenposeDetector
-from diffusers.utils import load_image
 openpose = OpenposeDetector.from_pretrained("lllyasviel/ControlNet")
+original_image = load_image(
-openpose_image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/person.png"
 )
-openpose_image = openpose(openpose_image).resize((1024, 1024))
+openpose_image = openpose(original_image)
+make_image_grid([original_image, openpose_image], rows=1, cols=2)
 ```
 <div class="flex gap-4">
@@ -500,7 +523,7 @@ import torch
 controlnets = [
    ControlNetModel.from_pretrained(
-        "thibaud/controlnet-openpose-sdxl-1.0", torch_dtype=torch.float16, use_safetensors=True
+        "thibaud/controlnet-openpose-sdxl-1.0", torch_dtype=torch.float16
    ),
    ControlNetModel.from_pretrained(
        "diffusers/controlnet-canny-sdxl-1.0", torch_dtype=torch.float16, use_safetensors=True
@@ -523,7 +546,7 @@ negative_prompt = "monochrome, lowres, bad anatomy, worst quality, low quality"
 generator = torch.manual_seed(1)
-images = [openpose_image, canny_image]
+images = [openpose_image.resize((1024, 1024)), canny_image.resize((1024, 1024))]
 images = pipe(
    prompt,
@@ -533,7 +556,9 @@ images = pipe(
    negative_prompt=negative_prompt,
    num_images_per_prompt=3,
    controlnet_conditioning_scale=[1.0, 0.8],
-).images[0]
+).images
+make_image_grid([original_image, canny_image, openpose_image,
+                images[0].resize((512, 512)), images[1].resize((512, 512)), images[2].resize((512, 512))], rows=2, cols=3)
 ```
 <div class="flex justify-center">

--- a/docs/source/en/using-diffusers/custom_pipeline_examples.md
+++ b/docs/source/en/using-diffusers/custom_pipeline_examples.md
@@ -25,6 +25,8 @@ Community pipelines allow you to get creative and build your own unique pipeline
 To load a community pipeline, use the `custom_pipeline` argument in [`DiffusionPipeline`] to specify one of the files in [diffusers/examples/community](https://github.com/huggingface/diffusers/tree/main/examples/community):
 ```py
+from diffusers import DiffusionPipeline
 pipe = DiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4", custom_pipeline="filename_in_the_community_folder", use_safetensors=True
 )
@@ -39,7 +41,6 @@ You can learn more about community pipelines in the how to [load community pipel
 The multilingual Stable Diffusion pipeline uses a pretrained [XLM-RoBERTa](https://huggingface.co/papluca/xlm-roberta-base-language-detection) to identify a language and the [mBART-large-50](https://huggingface.co/facebook/mbart-large-50-many-to-one-mmt) model to handle the translation. This allows you to generate images from text in 20 languages.
 ```py
-from PIL import Image
 import torch
 from diffusers import DiffusionPipeline
 from diffusers.utils import make_image_grid
@@ -59,15 +60,15 @@ language_detection_pipeline = pipeline("text-classification",
                                       device=device_dict[device])
 # add model for language translation
-trans_tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-one-mmt")
+translation_tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-one-mmt")
-trans_model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-one-mmt").to(device)
+translation_model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-one-mmt").to(device)
 diffuser_pipeline = DiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4",
    custom_pipeline="multilingual_stable_diffusion",
    detection_pipeline=language_detection_pipeline,
-    translation_model=trans_model,
+    translation_model=translation_model,
-    translation_tokenizer=trans_tokenizer,
+    translation_tokenizer=translation_tokenizer,
    torch_dtype=torch.float16,
 )
@@ -80,8 +81,7 @@ prompt = ["a photograph of an astronaut riding a horse",
          "Un restaurant parisien"]
 images = diffuser_pipeline(prompt).images
-grid = make_image_grid(images, rows=2, cols=2)
+make_image_grid(images, rows=2, cols=2)
-grid
 ```
 <div class="flex justify-center">
@@ -94,23 +94,23 @@ grid
 ```py
 from diffusers import DiffusionPipeline, DDIMScheduler
-from diffusers.utils import load_image
+from diffusers.utils import load_image, make_image_grid
 pipeline = DiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4",
    custom_pipeline="magic_mix",
-    scheduler = DDIMScheduler.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="scheduler"),
+    scheduler=DDIMScheduler.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="scheduler"),
 ).to('cuda')
 img = load_image("https://user-images.githubusercontent.com/59410571/209578593-141467c7-d831-4792-8b9a-b17dc5e47816.jpg")
-mix_img = pipeline(img, prompt="bed", kmin = 0.3, kmax = 0.5, mix_factor = 0.5)
+mix_img = pipeline(img, prompt="bed", kmin=0.3, kmax=0.5, mix_factor=0.5)
-mix_img
+make_image_grid([img, mix_img], rows=1, cols=2)
 ```
 <div class="flex gap-4">
  <div>
    <img class="rounded-xl" src="https://user-images.githubusercontent.com/59410571/209578593-141467c7-d831-4792-8b9a-b17dc5e47816.jpg" />
-    <figcaption class="mt-2 text-center text-sm text-gray-500">image prompt</figcaption>
+    <figcaption class="mt-2 text-center text-sm text-gray-500">original image</figcaption>
  </div>
  <div>
    <img class="rounded-xl" src="https://user-images.githubusercontent.com/59410571/209578602-70f323fa-05b7-4dd6-b055-e40683e37914.jpg" />

--- a/docs/source/en/using-diffusers/diffedit.md
+++ b/docs/source/en/using-diffusers/diffedit.md
@@ -26,7 +26,7 @@ Before you begin, make sure you have the following libraries installed:
 ```py
 # uncomment to install the necessary libraries in Colab
-#!pip install diffusers transformers accelerate safetensors
+#!pip install -q diffusers transformers accelerate
 ```
 The [`StableDiffusionDiffEditPipeline`] requires an image mask and a set of partially inverted latents. The image mask is generated from the [`~StableDiffusionDiffEditPipeline.generate_mask`] function, and includes two parameters, `source_prompt` and `target_prompt`. These parameters determine what to edit in the image. For example, if you want to change a bowl of *fruits* to a bowl of *pears*, then:
@@ -59,15 +59,18 @@ pipeline.enable_vae_slicing()
 Load the image to edit:
 ```py
-from diffusers.utils import load_image
+from diffusers.utils import load_image, make_image_grid
 img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png"
-raw_image = load_image(img_url).convert("RGB").resize((768, 768))
+raw_image = load_image(img_url).resize((768, 768))
+raw_image
 ```
 Use the [`~StableDiffusionDiffEditPipeline.generate_mask`] function to generate the image mask. You'll need to pass it the `source_prompt` and `target_prompt` to specify what to edit in the image:
 ```py
+from PIL import Image
 source_prompt = "a bowl of fruits"
 target_prompt = "a basket of pears"
 mask_image = pipeline.generate_mask(
@@ -75,6 +78,7 @@ mask_image = pipeline.generate_mask(
    source_prompt=source_prompt,
    target_prompt=target_prompt,
 )
+Image.fromarray((mask_image.squeeze()*255).astype("uint8"), "L").resize((768, 768))
 ```
 Next, create the inverted latents and pass it a caption describing the image:
@@ -86,13 +90,14 @@ inv_latents = pipeline.invert(prompt=source_prompt, image=raw_image).latents
 Finally, pass the image mask and inverted latents to the pipeline. The `target_prompt` becomes the `prompt` now, and the `source_prompt` is used as the `negative_prompt`:
 ```py
-image = pipeline(
+output_image = pipeline(
    prompt=target_prompt,
    mask_image=mask_image,
    image_latents=inv_latents,
    negative_prompt=source_prompt,
 ).images[0]
-image.save("edited_image.png")
+mask_image = Image.fromarray((mask_image.squeeze()*255).astype("uint8"), "L").resize((768, 768))
+make_image_grid([raw_image, mask_image, output_image], rows=1, cols=3)
 ```
 <div class="flex gap-4">
@@ -116,8 +121,8 @@ Load the Flan-T5 model and tokenizer from the 🤗 Transformers library:
 import torch
 from transformers import AutoTokenizer, T5ForConditionalGeneration
-tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-xl")
+tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-large")
-model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xl", device_map="auto", torch_dtype=torch.float16)
+model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", device_map="auto", torch_dtype=torch.float16)
 ```
 Provide some initial text to prompt the model to generate the source and target prompts.
@@ -136,7 +141,7 @@ target_text = f"Provide a caption for images containing a {target_concept}. "
 Next, create a utility function to generate the prompts:
 ```py
-@torch.no_grad
+@torch.no_grad()
 def generate_prompts(input_prompt):
    input_ids = tokenizer(input_prompt, return_tensors="pt").input_ids.to("cuda")
@@ -193,33 +198,39 @@ Finally, pass the embeddings to the [`~StableDiffusionDiffEditPipeline.generate_
 ```diff
  from diffusers import DDIMInverseScheduler, DDIMScheduler
-  from diffusers.utils import load_image
+  from diffusers.utils import load_image, make_image_grid
+  from PIL import Image
  pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
  pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config)
  img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png"
-  raw_image = load_image(img_url).convert("RGB").resize((768, 768))
+  raw_image = load_image(img_url).resize((768, 768))
  mask_image = pipeline.generate_mask(
      image=raw_image,
+-     source_prompt=source_prompt,
+-     target_prompt=target_prompt,
 +     source_prompt_embeds=source_embeds,
 +     target_prompt_embeds=target_embeds,
  )
  inv_latents = pipeline.invert(
+-     prompt=source_prompt,
 +     prompt_embeds=source_embeds,
      image=raw_image,
  ).latents
-  images = pipeline(
+  output_image = pipeline(
      mask_image=mask_image,
      image_latents=inv_latents,
+-     prompt=target_prompt,
+-     negative_prompt=source_prompt,
 +     prompt_embeds=target_embeds,
 +     negative_prompt_embeds=source_embeds,
-  ).images
+  ).images[0]
-  images[0].save("edited_image.png")
+  mask_image = Image.fromarray((mask_image.squeeze()*255).astype("uint8"), "L")
+  make_image_grid([raw_image, mask_image, output_image], rows=1, cols=3)
 ```
 ## Generate a caption for inversion
@@ -260,7 +271,7 @@ Load an input image and generate a caption for it using the `generate_caption` f
 from diffusers.utils import load_image
 img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png"
-raw_image = load_image(img_url).convert("RGB").resize((768, 768))
+raw_image = load_image(img_url).resize((768, 768))
 caption = generate_caption(raw_image, model, processor)
 ```

--- a/docs/source/en/using-diffusers/kandinsky.md
+++ b/docs/source/en/using-diffusers/kandinsky.md
@@ -26,7 +26,7 @@ Before you begin, make sure you have the following libraries installed:
 ```py
 # uncomment to install the necessary libraries in Colab
-#!pip install transformers accelerate safetensors
+#!pip install -q diffusers transformers accelerate
 ```
 <Tip warning={true}>
@@ -58,6 +58,7 @@ Now pass all the prompts and embeddings to the [`KandinskyPipeline`] to generate
 ```py
 image = pipeline(prompt, image_embeds=image_embeds, negative_prompt=negative_prompt, negative_image_embeds=negative_image_embeds, height=768, width=768).images[0]
+image
 ```
 <div class="flex justify-center">
@@ -83,6 +84,7 @@ Pass the `image_embeds` and `negative_image_embeds` to the [`KandinskyV22Pipelin
 ```py
 image = pipeline(image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768).images[0]
+image
 ```
 <div class="flex justify-center">
@@ -109,7 +111,8 @@ pipeline.enable_model_cpu_offload()
 prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
 negative_prompt = "low quality, bad quality"
-image = pipeline(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_scale=1.0, guidance_scale = 4.0, height=768, width=768).images[0]
+image = pipeline(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_scale=1.0, guidance_scale=4.0, height=768, width=768).images[0]
+image
 ```
 </hfoption>
@@ -125,7 +128,8 @@ pipeline.enable_model_cpu_offload()
 prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
 negative_prompt = "low quality, bad quality"
-image = pipeline(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_scale=1.0, guidance_scale = 4.0, height=768, width=768).images[0]
+image = pipeline(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_scale=1.0, guidance_scale=4.0, height=768, width=768).images[0]
+image
 ```
 </hfoption>
@@ -133,7 +137,7 @@ image = pipeline(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_
 ## Image-to-image
-For image-to-image, pass the initial image and text prompt to condition the image with to the pipeline. Start by loading the prior pipeline:
+For image-to-image, pass the initial image and text prompt to condition the image to the pipeline. Start by loading the prior pipeline:
 <hfoptions id="image-to-image">
 <hfoption id="Kandinsky 2.1">
@@ -163,14 +167,11 @@ pipeline = KandinskyV22Img2ImgPipeline.from_pretrained("kandinsky-community/kand
 Download an image to condition on:
 ```py
-from PIL import Image
+from diffusers.utils import load_image
-import requests
-from io import BytesIO
 # download image
 url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
-response = requests.get(url)
+original_image = load_image(url)
-original_image = Image.open(BytesIO(response.content)).convert("RGB")
 original_image = original_image.resize((768, 512))
 ```
@@ -193,7 +194,10 @@ Now pass the original image, and all the prompts and embeddings to the pipeline
 <hfoption id="Kandinsky 2.1">
 ```py
-image = pipeline(prompt, negative_prompt=negative_prompt, image=original_image, image_embeds=image_emebds, negative_image_embeds=negative_image_embeds, height=768, width=768, strength=0.3).images[0]
+from diffusers.utils import make_image_grid
+image = pipeline(prompt, negative_prompt=negative_prompt, image=original_image, image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768, strength=0.3).images[0]
+make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2)
 ```
 <div class="flex justify-center">
@@ -204,7 +208,10 @@ image = pipeline(prompt, negative_prompt=negative_prompt, image=original_image,
 <hfoption id="Kandinsky 2.2">
 ```py
-image = pipeline(image=original_image, image_embeds=image_emebds, negative_image_embeds=negative_image_embeds, height=768, width=768, strength=0.3).images[0]
+from diffusers.utils import make_image_grid
+image = pipeline(image=original_image, image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768, strength=0.3).images[0]
+make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2)
 ```
 <div class="flex justify-center">
@@ -223,11 +230,8 @@ Use the [`AutoPipelineForImage2Image`] to automatically call the combined pipeli
 ```py
 from diffusers import AutoPipelineForImage2Image
+from diffusers.utils import make_image_grid, load_image
 import torch
-import requests
-from io import BytesIO
-from PIL import Image
-import os
 pipeline = AutoPipelineForImage2Image.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16, use_safetensors=True)
 pipeline.enable_model_cpu_offload()
@@ -236,12 +240,12 @@ prompt = "A fantasy landscape, Cinematic lighting"
 negative_prompt = "low quality, bad quality"
 url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
+original_image = load_image(url)
-response = requests.get(url)
-original_image = Image.open(BytesIO(response.content)).convert("RGB")
 original_image.thumbnail((768, 768))
-image = pipeline(prompt=prompt, image=original_image, strength=0.3).images[0]
+image = pipeline(prompt=prompt, negative_prompt=negative_prompt, image=original_image, strength=0.3).images[0]
+make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2)
 ```
 </hfoption>
@@ -249,11 +253,8 @@ image = pipeline(prompt=prompt, image=original_image, strength=0.3).images[0]
 ```py
 from diffusers import AutoPipelineForImage2Image
+from diffusers.utils import make_image_grid, load_image
 import torch
-import requests
-from io import BytesIO
-from PIL import Image
-import os
 pipeline = AutoPipelineForImage2Image.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16)
 pipeline.enable_model_cpu_offload()
@@ -262,12 +263,12 @@ prompt = "A fantasy landscape, Cinematic lighting"
 negative_prompt = "low quality, bad quality"
 url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
+original_image = load_image(url)
-response = requests.get(url)
-original_image = Image.open(BytesIO(response.content)).convert("RGB")
 original_image.thumbnail((768, 768))
-image = pipeline(prompt=prompt, image=original_image, strength=0.3).images[0]
+image = pipeline(prompt=prompt, negative_prompt=negative_prompt, image=original_image, strength=0.3).images[0]
+make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2)
 ```
 </hfoption>
@@ -277,7 +278,7 @@ image = pipeline(prompt=prompt, image=original_image, strength=0.3).images[0]
 <Tip warning={true}>
-⚠️ The Kandinsky models uses ⬜️ **white pixels** to represent the masked area now instead of black pixels. If you are using [`KandinskyInpaintPipeline`] in production, you need to change the mask to use white pixels:
+⚠️ The Kandinsky models use ⬜️ **white pixels** to represent the masked area now instead of black pixels. If you are using [`KandinskyInpaintPipeline`] in production, you need to change the mask to use white pixels:
 ```py
 # For PIL input
@@ -297,9 +298,10 @@ For inpainting, you'll need the original image, a mask of the area to replace in
 ```py
 from diffusers import KandinskyInpaintPipeline, KandinskyPriorPipeline
-from diffusers.utils import load_image
+from diffusers.utils import load_image, make_image_grid
 import torch
 import numpy as np
+from PIL import Image
 prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
 pipeline = KandinskyInpaintPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-inpaint", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
@@ -310,9 +312,10 @@ pipeline = KandinskyInpaintPipeline.from_pretrained("kandinsky-community/kandins
 ```py
 from diffusers import KandinskyV22InpaintPipeline, KandinskyV22PriorPipeline
-from diffusers.utils import load_image
+from diffusers.utils import load_image, make_image_grid
 import torch
 import numpy as np
+from PIL import Image
 prior_pipeline = KandinskyV22PriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
 pipeline = KandinskyV22InpaintPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder-inpaint", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
@@ -343,7 +346,9 @@ Now pass the initial image, mask, and prompt and embeddings to the pipeline to g
 <hfoption id="Kandinsky 2.1">
 ```py
-image = pipeline(prompt, image=init_image, mask_image=mask, **prior_output, height=768, width=768, num_inference_steps=150).images[0]
+output_image = pipeline(prompt, image=init_image, mask_image=mask, **prior_output, height=768, width=768, num_inference_steps=150).images[0]
+mask = Image.fromarray((mask*255).astype('uint8'), 'L')
+make_image_grid([init_image, mask, output_image], rows=1, cols=3)
 ```
 <div class="flex justify-center">
@@ -354,7 +359,9 @@ image = pipeline(prompt, image=init_image, mask_image=mask, **prior_output, heig
 <hfoption id="Kandinsky 2.2">
 ```py
-image = pipeline(image=init_image, mask_image=mask, **prior_output, height=768, width=768, num_inference_steps=150).images[0]
+output_image = pipeline(image=init_image, mask_image=mask, **prior_output, height=768, width=768, num_inference_steps=150).images[0]
+mask = Image.fromarray((mask*255).astype('uint8'), 'L')
+make_image_grid([init_image, mask, output_image], rows=1, cols=3)
 ```
 <div class="flex justify-center">
@@ -371,14 +378,23 @@ You can also use the end-to-end [`KandinskyInpaintCombinedPipeline`] and [`Kandi
 ```py
 import torch
+import numpy as np
+from PIL import Image
 from diffusers import AutoPipelineForInpainting
+from diffusers.utils import load_image, make_image_grid
 pipe = AutoPipelineForInpainting.from_pretrained("kandinsky-community/kandinsky-2-1-inpaint", torch_dtype=torch.float16)
 pipe.enable_model_cpu_offload()
+init_image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")
+mask = np.zeros((768, 768), dtype=np.float32)
+# mask area above cat's head
+mask[:250, 250:-250] = 1
 prompt = "a hat"
-image = pipe(prompt=prompt, image=original_image, mask_image=mask).images[0]
+output_image = pipe(prompt=prompt, image=init_image, mask_image=mask).images[0]
+mask = Image.fromarray((mask*255).astype('uint8'), 'L')
+make_image_grid([init_image, mask, output_image], rows=1, cols=3)
 ```
 </hfoption>
@@ -386,14 +402,23 @@ image = pipe(prompt=prompt, image=original_image, mask_image=mask).images[0]
 ```py
 import torch
+import numpy as np
+from PIL import Image
 from diffusers import AutoPipelineForInpainting
+from diffusers.utils import load_image, make_image_grid
 pipe = AutoPipelineForInpainting.from_pretrained("kandinsky-community/kandinsky-2-2-decoder-inpaint", torch_dtype=torch.float16)
 pipe.enable_model_cpu_offload()
+init_image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")
+mask = np.zeros((768, 768), dtype=np.float32)
+# mask area above cat's head
+mask[:250, 250:-250] = 1
 prompt = "a hat"
-image = pipe(prompt=prompt, image=original_image, mask_image=mask).images[0]
+output_image = pipe(prompt=prompt, image=original_image, mask_image=mask).images[0]
+mask = Image.fromarray((mask*255).astype('uint8'), 'L')
+make_image_grid([init_image, mask, output_image], rows=1, cols=3)
 ```
 </hfoption>
@@ -408,13 +433,13 @@ Interpolation allows you to explore the latent space between the image and text
 ```py
 from diffusers import KandinskyPriorPipeline, KandinskyPipeline
-from diffusers.utils import load_image
+from diffusers.utils import load_image, make_image_grid
-import PIL
 import torch
 prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
 img_1 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")
 img_2 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/starry_night.jpeg")
+make_image_grid([img_1.resize((512,512)), img_2.resize((512,512))], rows=1, cols=2)
 ```
 </hfoption>
@@ -422,13 +447,13 @@ img_2 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffuser
 ```py
 from diffusers import KandinskyV22PriorPipeline, KandinskyV22Pipeline
-from diffusers.utils import load_image
+from diffusers.utils import load_image, make_image_grid
-import PIL
 import torch
 prior_pipeline = KandinskyV22PriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
 img_1 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")
 img_2 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/starry_night.jpeg")
+make_image_grid([img_1.resize((512,512)), img_2.resize((512,512))], rows=1, cols=2)
 ```
 </hfoption>
@@ -448,7 +473,7 @@ img_2 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffuser
 Specify the text or images to interpolate, and set the weights for each text or image. Experiment with the weights to see how they affect the interpolation!
 ```py
-images_texts = ["a cat", img1, img2]
+images_texts = ["a cat", img_1, img_2]
 weights = [0.3, 0.3, 0.4]
 ```
@@ -511,6 +536,7 @@ from diffusers.utils import load_image
 img = load_image(
    "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/cat.png"
 ).resize((768, 768))
+img
 ```
 <div class="flex justify-center">
@@ -524,8 +550,6 @@ import torch
 import numpy as np
 from transformers import pipeline
-from diffusers.utils import load_image
 def make_hint(image, depth_estimator):
    image = depth_estimator(image)["depth"]
@@ -536,7 +560,6 @@ def make_hint(image, depth_estimator):
    hint = detected_map.permute(2, 0, 1)
    return hint
 depth_estimator = pipeline("depth-estimation")
 hint = make_hint(img, depth_estimator).unsqueeze(0).half().to("cuda")
 ```
@@ -550,10 +573,10 @@ from diffusers import KandinskyV22PriorPipeline, KandinskyV22ControlnetPipeline
 prior_pipeline = KandinskyV22PriorPipeline.from_pretrained(
    "kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True
-)to("cuda")
+).to("cuda")
 pipeline = KandinskyV22ControlnetPipeline.from_pretrained(
-    "kandinsky-community/kandinsky-2-2-controlnet-depth", torch_dtype=torch.float16, use_safetensors=True
+    "kandinsky-community/kandinsky-2-2-controlnet-depth", torch_dtype=torch.float16
 ).to("cuda")
 ```
@@ -561,11 +584,11 @@ Generate the image embeddings from a prompt and negative prompt:
 ```py
 prompt = "A robot, 4k photo"
 negative_prior_prompt = "lowres, text, error, cropped, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, out of frame, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, username, watermark, signature"
 generator = torch.Generator(device="cuda").manual_seed(43)
-image_emb, zero_image_emb = pipe_prior(
+image_emb, zero_image_emb = prior_pipeline(
    prompt=prompt, negative_prompt=negative_prior_prompt, generator=generator
 ).to_tuple()
 ```
@@ -599,10 +622,9 @@ from diffusers.utils import load_image
 from transformers import pipeline
 img = load_image(
-    "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main" "/kandinskyv22/cat.png"
+    "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/cat.png"
 ).resize((768, 768))
 def make_hint(image, depth_estimator):
    image = depth_estimator(image)["depth"]
    image = np.array(image)
@@ -612,7 +634,6 @@ def make_hint(image, depth_estimator):
    hint = detected_map.permute(2, 0, 1)
    return hint
 depth_estimator = pipeline("depth-estimation")
 hint = make_hint(img, depth_estimator).unsqueeze(0).half().to("cuda")
 ```
@@ -637,15 +658,15 @@ negative_prior_prompt = "lowres, text, error, cropped, worst quality, low qualit
 generator = torch.Generator(device="cuda").manual_seed(43)
-img_emb = pipe_prior(prompt=prompt, image=img, strength=0.85, generator=generator)
+img_emb = prior_pipeline(prompt=prompt, image=img, strength=0.85, generator=generator)
-negative_emb = pipe_prior(prompt=negative_prior_prompt, image=img, strength=1, generator=generator)
+negative_emb = prior_pipeline(prompt=negative_prior_prompt, image=img, strength=1, generator=generator)
 ```
 Now you can run the [`KandinskyV22ControlnetImg2ImgPipeline`] to generate an image from the initial image and the image embeddings:
 ```py
 image = pipeline(image=img, strength=0.5, image_embeds=img_emb.image_embeds, negative_image_embeds=negative_emb.image_embeds, hint=hint, num_inference_steps=50, generator=generator, height=768, width=768).images[0]
-image
+make_image_grid([img.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2)
 ```
 <div class="flex justify-center">
@@ -656,7 +677,7 @@ image
 Kandinsky is unique because it requires a prior pipeline to generate the mappings, and a second pipeline to decode the latents into an image. Optimization efforts should be focused on the second pipeline because that is where the bulk of the computation is done. Here are some tips to improve Kandinsky during inference.
-1. Enable [xFormers](https://moon-ci-docs.huggingface.co/optimization/xformers) if you're using PyTorch < 2.0:
+1. Enable [xFormers](../optimization/xformers) if you're using PyTorch < 2.0:
 ```diff
  from diffusers import DiffusionPipeline
@@ -666,14 +687,11 @@ Kandinsky is unique because it requires a prior pipeline to generate the mapping
 + pipe.enable_xformers_memory_efficient_attention()
 ```
-2. Enable `torch.compile` if you're using PyTorch 2.0 to automatically use scaled dot-product attention (SDPA):
+2. Enable `torch.compile` if you're using PyTorch >= 2.0 to automatically use scaled dot-product attention (SDPA):
 ```diff
  pipe.unet.to(memory_format=torch.channels_last)
-+ pipe.unet = torch.compile(pipe.unet, mode="reduced-overhead", fullgraph=True)
+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
-  pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
-+ pipe.enable_xformers_memory_efficient_attention()
 ```
 This is the same as explicitly setting the attention processor to use [`~models.attention_processor.AttnAddedKVProcessor2_0`]:
@@ -697,7 +715,8 @@ pipe.unet.set_attn_processor(AttnAddedKVProcessor2_0())
 4. By default, the text-to-image pipeline uses the [`DDIMScheduler`] but you can replace it with another scheduler like [`DDPMScheduler`] to see how that affects the tradeoff between inference speed and image quality:
 ```py
-from diffusers import DDPMSCheduler
+from diffusers import DDPMScheduler
+from diffusers import DiffusionPipeline
 scheduler = DDPMScheduler.from_pretrained("kandinsky-community/kandinsky-2-1", subfolder="ddpm_scheduler")
 pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", scheduler=scheduler, torch_dtype=torch.float16, use_safetensors=True).to("cuda")

--- a/docs/source/en/using-diffusers/reproducibility.md
+++ b/docs/source/en/using-diffusers/reproducibility.md
@@ -55,7 +55,7 @@ But if you need to reliably generate the same image, that'll depend on whether y
 ### CPU
-To generate reproducible results on a CPU, you'll need to use a PyTorch [`Generator`](https://pytorch.org/docs/stable/generated/torch.randn.html) and set a seed:
+To generate reproducible results on a CPU, you'll need to use a PyTorch [`Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) and set a seed:
 ```python
 import torch
@@ -83,7 +83,7 @@ If you run this code example on your specific hardware and PyTorch version, you
 💡 It might be a bit unintuitive at first to pass `Generator` objects to the pipeline instead of
 just integer values representing the seed, but this is the recommended design when dealing with
-probabilistic models in PyTorch as `Generator`'s are *random states* that can be
+probabilistic models in PyTorch, as `Generator`s are *random states* that can be
 passed to multiple pipelines in a sequence.
 </Tip>
@@ -159,6 +159,7 @@ PyTorch typically benchmarks multiple algorithms to select the fastest one, but
 ```py
 import os
+import torch
 os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":16:8"
@@ -171,7 +172,6 @@ Now when you run the same pipeline twice, you'll get identical results.
 ```py
 import torch
 from diffusers import DDIMScheduler, StableDiffusionPipeline
-import numpy as np
 model_id = "runwayml/stable-diffusion-v1-5"
 pipe = StableDiffusionPipeline.from_pretrained(model_id, use_safetensors=True).to("cuda")
@@ -186,6 +186,6 @@ result1 = pipe(prompt=prompt, num_inference_steps=50, generator=g, output_type="
 g.manual_seed(0)
 result2 = pipe(prompt=prompt, num_inference_steps=50, generator=g, output_type="latent").images
-print("L_inf dist = ", abs(result1 - result2).max())
+print("L_inf dist =", abs(result1 - result2).max())
 "L_inf dist = tensor(0., device='cuda:0')"
 ```
--- a/docs/source/en/using-diffusers/sdxl.md
+++ b/docs/source/en/using-diffusers/sdxl.md
@@ -26,7 +26,7 @@ Before you begin, make sure you have the following libraries installed:
 ```py
 # uncomment to install the necessary libraries in Colab
-#!pip install diffusers transformers accelerate safetensors omegaconf invisible-watermark>=0.2.0
+#!pip install -q diffusers transformers accelerate omegaconf invisible-watermark>=0.2.0
 ```
 <Tip warning={true}>
@@ -84,7 +84,8 @@ pipeline_text2image = AutoPipelineForText2Image.from_pretrained(
 ).to("cuda")
 prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
-image = pipeline(prompt=prompt).images[0]
+image = pipeline_text2image(prompt=prompt).images[0]
+image
 ```
 <div class="flex justify-center">
@@ -96,16 +97,17 @@ image = pipeline(prompt=prompt).images[0]
 For image-to-image, SDXL works especially well with image sizes between 768x768 and 1024x1024. Pass an initial image, and a text prompt to condition the image with:
 ```py
-from diffusers import AutoPipelineForImg2Img
+from diffusers import AutoPipelineForImage2Image
-from diffusers.utils import load_image
+from diffusers.utils import load_image, make_image_grid
 # use from_pipe to avoid consuming additional memory when loading a checkpoint
 pipeline = AutoPipelineForImage2Image.from_pipe(pipeline_text2image).to("cuda")
-url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-img2img.png"
-init_image = load_image(url).convert("RGB")
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png"
+init_image = load_image(url)
 prompt = "a dog catching a frisbee in the jungle"
 image = pipeline(prompt, image=init_image, strength=0.8, guidance_scale=10.5).images[0]
+make_image_grid([init_image, image], rows=1, cols=2)
 ```
 <div class="flex justify-center">
@@ -118,7 +120,7 @@ For inpainting, you'll need the original image and a mask of what you want to re
 ```py
 from diffusers import AutoPipelineForInpainting
-from diffusers.utils import load_image
+from diffusers.utils import load_image, make_image_grid
 # use from_pipe to avoid consuming additional memory when loading a checkpoint
 pipeline = AutoPipelineForInpainting.from_pipe(pipeline_text2image).to("cuda")
@@ -126,11 +128,12 @@ pipeline = AutoPipelineForInpainting.from_pipe(pipeline_text2image).to("cuda")
 img_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png"
 mask_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-inpaint-mask.png"
-init_image = load_image(img_url).convert("RGB")
+init_image = load_image(img_url)
-mask_image = load_image(mask_url).convert("RGB")
+mask_image = load_image(mask_url)
 prompt = "A deep sea diver floating"
 image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, strength=0.85, guidance_scale=12.5).images[0]
+make_image_grid([init_image, mask_image, image], rows=1, cols=3)
 ```
 <div class="flex justify-center">
@@ -141,12 +144,12 @@ image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, strengt
 SDXL includes a [refiner model](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0) specialized in denoising low-noise stage images to generate higher-quality images from the base model. There are two ways to use the refiner:
-1. use the base and refiner model together to produce a refined image
+1. use the base and refiner models together to produce a refined image
-2. use the base model to produce an image, and subsequently use the refiner model to add more details to the image (this is how SDXL is originally trained)
+2. use the base model to produce an image, and subsequently use the refiner model to add more details to the image (this is how SDXL was originally trained)
 ### Base + refiner model
-When you use the base and refiner model together to generate an image, this is known as an ([*ensemble of expert denoisers*](https://research.nvidia.com/labs/dir/eDiff-I/)). The ensemble of expert denoisers approach requires less overall denoising steps versus passing the base model's output to the refiner model, so it should be significantly faster to run. However, you won't be able to inspect the base model's output because it still contains a large amount of noise.
+When you use the base and refiner model together to generate an image, this is known as an [*ensemble of expert denoisers*](https://research.nvidia.com/labs/dir/eDiff-I/). The ensemble of expert denoisers approach requires fewer overall denoising steps versus passing the base model's output to the refiner model, so it should be significantly faster to run. However, you won't be able to inspect the base model's output because it still contains a large amount of noise.
 As an ensemble of expert denoisers, the base model serves as the expert during the high-noise diffusion stage and the refiner model serves as the expert during the low-noise diffusion stage. Load the base and refiner model:
@@ -193,12 +196,13 @@ image = refiner(
    denoising_start=0.8,
    image=image,
 ).images[0]
+image
 ```
 <div class="flex gap-4">
  <div>
    <img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/lion_base.png" alt="generated image of a lion on a rock at night" />
-    <figcaption class="mt-2 text-center text-sm text-gray-500">base model</figcaption>
+    <figcaption class="mt-2 text-center text-sm text-gray-500">default base model</figcaption>
  </div>
  <div>
    <img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/lion_refined.png" alt="generated image of a lion on a rock at night in higher quality" />
@@ -210,7 +214,8 @@ The refiner model can also be used for inpainting in the [`StableDiffusionXLInpa
 ```py
 from diffusers import StableDiffusionXLInpaintPipeline
-from diffusers.utils import load_image
+from diffusers.utils import load_image, make_image_grid
+import torch
 base = StableDiffusionXLInpaintPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
@@ -218,8 +223,8 @@ base = StableDiffusionXLInpaintPipeline.from_pretrained(
 refiner = StableDiffusionXLInpaintPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-refiner-1.0",
-    text_encoder_2=pipe.text_encoder_2,
+    text_encoder_2=base.text_encoder_2,
-    vae=pipe.vae,
+    vae=base.vae,
    torch_dtype=torch.float16,
    use_safetensors=True,
    variant="fp16",
@@ -228,8 +233,8 @@ refiner = StableDiffusionXLInpaintPipeline.from_pretrained(
 img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
 mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
-init_image = load_image(img_url).convert("RGB")
+init_image = load_image(img_url)
-mask_image = load_image(mask_url).convert("RGB")
+mask_image = load_image(mask_url)
 prompt = "A majestic tiger sitting on a bench"
 num_inference_steps = 75
@@ -250,6 +255,7 @@ image = refiner(
    num_inference_steps=num_inference_steps,
    denoising_start=high_noise_frac,
 ).images[0]
+make_image_grid([init_image, mask_image, image.resize((512, 512))], rows=1, cols=3)
 ```
 This ensemble of expert denoisers method works well for all available schedulers!
@@ -270,8 +276,8 @@ base = DiffusionPipeline.from_pretrained(
 refiner = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-refiner-1.0",
-    text_encoder_2=pipe.text_encoder_2,
+    text_encoder_2=base.text_encoder_2,
-    vae=pipe.vae,
+    vae=base.vae,
    torch_dtype=torch.float16,
    use_safetensors=True,
    variant="fp16",
@@ -303,7 +309,7 @@ image = refiner(prompt=prompt, image=image[None, :]).images[0]
  </div>
 </div>
-For inpainting, load the refiner model in the [`StableDiffusionXLInpaintPipeline`], remove the `denoising_end` and `denoising_start` parameters, and choose a smaller number of inference steps for the refiner.
+For inpainting, load the base and the refiner model in the [`StableDiffusionXLInpaintPipeline`], remove the `denoising_end` and `denoising_start` parameters, and choose a smaller number of inference steps for the refiner.
 ## Micro-conditioning
@@ -343,7 +349,7 @@ image = pipe(
 <div class="flex flex-col justify-center">
  <img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/sd_xl/negative_conditions.png"/>
-  <figcaption class="text-center">Images negative conditioned on image resolutions of (128, 128), (256, 256), and (512, 512).</figcaption>
+  <figcaption class="text-center">Images negatively conditioned on image resolutions of (128, 128), (256, 256), and (512, 512).</figcaption>
 </div>
 ### Crop conditioning
@@ -354,13 +360,13 @@ Images generated by previous Stable Diffusion models may sometimes appear to be
 from diffusers import StableDiffusionXLPipeline
 import torch
 pipeline = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
 ).to("cuda")
 prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
-image = pipeline(prompt=prompt, crops_coords_top_left=(256,0)).images[0]
+image = pipeline(prompt=prompt, crops_coords_top_left=(256, 0)).images[0]
+image
 ```
 <div class="flex justify-center">
@@ -384,11 +390,12 @@ image = pipe(
    negative_crops_coords_top_left=(0, 0),
    negative_target_size=(1024, 1024),
 ).images[0]
+image
 ```
 ## Use a different prompt for each text-encoder
-SDXL uses two text-encoders, so it is possible to pass a different prompt to each text-encoder, which can [improve quality](https://github.com/huggingface/diffusers/issues/4004#issuecomment-1627764201). Pass your original prompt to `prompt` and the second prompt to `prompt_2` (use `negative_prompt` and `negative_prompt_2` if you're using a negative prompts):
+SDXL uses two text-encoders, so it is possible to pass a different prompt to each text-encoder, which can [improve quality](https://github.com/huggingface/diffusers/issues/4004#issuecomment-1627764201). Pass your original prompt to `prompt` and the second prompt to `prompt_2` (use `negative_prompt` and `negative_prompt_2` if you're using negative prompts):
 ```py
 from diffusers import StableDiffusionXLPipeline
@@ -403,13 +410,14 @@ prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
 # prompt_2 is passed to OpenCLIP-ViT/bigG-14
 prompt_2 = "Van Gogh painting"
 image = pipeline(prompt=prompt, prompt_2=prompt_2).images[0]
+image
 ```
 <div class="flex justify-center">
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-double-prompt.png" alt="generated image of an astronaut in a jungle in the style of a van gogh painting"/>
 </div>
-The dual text-encoders also support textual inversion embeddings that need to be loaded separately as explained in the [SDXL textual inversion](textual_inversion_inference#stable-diffusion-xl] section.
+The dual text-encoders also support textual inversion embeddings that need to be loaded separately as explained in the [SDXL textual inversion](textual_inversion_inference#stable-diffusion-xl) section.
 ## Optimizations
@@ -420,18 +428,18 @@ SDXL is a large model, and you may need to optimize memory to get it to run on y
 ```diff
 - base.to("cuda")
 - refiner.to("cuda")
-+ base.enable_model_cpu_offload
+ base.enable_model_cpu_offload()
-+ refiner.enable_model_cpu_offload
+ refiner.enable_model_cpu_offload()
 ```
-2. Use `torch.compile` for ~20% speed-up (you need `torch>2.0`):
+2. Use `torch.compile` for ~20% speed-up (you need `torch>=2.0`):
 ```diff
 + base.unet = torch.compile(base.unet, mode="reduce-overhead", fullgraph=True)
 + refiner.unet = torch.compile(refiner.unet, mode="reduce-overhead", fullgraph=True)
 ```
-3. Enable [xFormers](/optimization/xformers) to run SDXL if `torch<2.0`:
+3. Enable [xFormers](../optimization/xformers) to run SDXL if `torch<2.0`:
 ```diff
 + base.enable_xformers_memory_efficient_attention()

--- a/docs/source/en/using-diffusers/shap-e.md
+++ b/docs/source/en/using-diffusers/shap-e.md
@@ -16,7 +16,7 @@ specific language governing permissions and limitations under the License.
 Shap-E is a conditional model for generating 3D assets which could be used for video game development, interior design, and architecture. It is trained on a large dataset of 3D assets, and post-processed to render more views of each object and produce 16K instead of 4K point clouds. The Shap-E model is trained in two steps:
-1. a encoder accepts the point clouds and rendered views of a 3D asset and outputs the parameters of implicit functions that represent the asset
+1. an encoder accepts the point clouds and rendered views of a 3D asset and outputs the parameters of implicit functions that represent the asset
 2. a diffusion model is trained on the latents produced by the encoder to generate either neural radiance fields (NeRFs) or a textured 3D mesh, making it easier to render and use the 3D asset in downstream applications
 This guide will show you how to use Shap-E to start generating your own 3D assets!
@@ -25,7 +25,7 @@ Before you begin, make sure you have the following libraries installed:
 ```py
 # uncomment to install the necessary libraries in Colab
-#!pip install diffusers transformers accelerate safetensors trimesh
+#!pip install -q diffusers transformers accelerate trimesh
 ```
 ## Text-to-3D
@@ -38,7 +38,7 @@ from diffusers import ShapEPipeline
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-pipe = ShapEPipeline.from_pretrained("openai/shap-e", torch_dtype=torch.float16, variant="fp16", use_safetensors=True)
+pipe = ShapEPipeline.from_pretrained("openai/shap-e", torch_dtype=torch.float16, variant="fp16")
 pipe = pipe.to(device)
 guidance_scale = 15.0
@@ -64,11 +64,11 @@ export_to_gif(images[1], "cake_3d.gif")
 <div class="flex gap-4">
  <div>
    <img class="rounded-xl" src="https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/firecracker_out.gif"/>
-    <figcaption class="mt-2 text-center text-sm text-gray-500">firecracker</figcaption>
+    <figcaption class="mt-2 text-center text-sm text-gray-500">prompt = "A firecracker"</figcaption>
  </div>
  <div>
    <img class="rounded-xl" src="https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/cake_out.gif"/>
-    <figcaption class="mt-2 text-center text-sm text-gray-500">cupcake</figcaption>
+    <figcaption class="mt-2 text-center text-sm text-gray-500">prompt = "A birthday cupcake"</figcaption>
  </div>
 </div>
@@ -99,6 +99,7 @@ Pass the cheeseburger to the [`ShapEImg2ImgPipeline`] to generate a 3D represent
 ```py
 from PIL import Image
+from diffusers import ShapEImg2ImgPipeline
 from diffusers.utils import export_to_gif
 pipe = ShapEImg2ImgPipeline.from_pretrained("openai/shap-e-img2img", torch_dtype=torch.float16, variant="fp16").to("cuda")
@@ -139,7 +140,7 @@ from diffusers import ShapEPipeline
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-pipe = ShapEPipeline.from_pretrained("openai/shap-e", torch_dtype=torch.float16, variant="fp16", use_safetensors=True)
+pipe = ShapEPipeline.from_pretrained("openai/shap-e", torch_dtype=torch.float16, variant="fp16")
 pipe = pipe.to(device)
 guidance_scale = 15.0
@@ -160,7 +161,7 @@ You can optionally save the mesh output as an `obj` file with the [`~utils.expor
 from diffusers.utils import export_to_ply
 ply_path = export_to_ply(images[0], "3d_cake.ply")
-print(f"saved to folder: {ply_path}")
+print(f"Saved to folder: {ply_path}")
 ```
 Then you can convert the `ply` file to a `glb` file with the trimesh library:
@@ -169,7 +170,7 @@ Then you can convert the `ply` file to a `glb` file with the trimesh library:
 import trimesh
 mesh = trimesh.load("3d_cake.ply")
-mesh.export("3d_cake.glb", file_type="glb")
+mesh_export = mesh.export("3d_cake.glb", file_type="glb")
 ```
 By default, the mesh output is focused from the bottom viewpoint but you can change the default viewpoint by applying a rotation transform:
@@ -181,7 +182,7 @@ import numpy as np
 mesh = trimesh.load("3d_cake.ply")
 rot = trimesh.transformations.rotation_matrix(-np.pi / 2, [1, 0, 0])
 mesh = mesh.apply_transform(rot)
-mesh.export("3d_cake.glb", file_type="glb")
+mesh_export = mesh.export("3d_cake.glb", file_type="glb")
 ```
 Upload the mesh file to your dataset repository to visualize it with the Dataset viewer!