Unverified Commit 3ad4207d authored by M. Tolga Cangöz's avatar M. Tolga Cangöz Committed by GitHub
Browse files

[`Docs`] Fix typos, update, and add visualizations at Using Diffusers'...


[`Docs`] Fix typos, update, and add visualizations at Using Diffusers' Pipelines for Inference Page (#5649)

* Fix typos, update, add visualizations

* Update sdxl.md

* Update controlnet.md

* Update docs/source/en/using-diffusers/shap-e.md
Co-authored-by: default avatarSteven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/using-diffusers/shap-e.md
Co-authored-by: default avatarSteven Liu <59462357+stevhliu@users.noreply.github.com>

* Update diffedit.md

* Update kandinsky.md

* Update sdxl.md

* Update controlnet.md

* Update docs/source/en/using-diffusers/controlnet.md
Co-authored-by: default avatarSteven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/using-diffusers/controlnet.md
Co-authored-by: default avatarSteven Liu <59462357+stevhliu@users.noreply.github.com>

* Update controlnet.md

---------
Co-authored-by: default avatarSteven Liu <59462357+stevhliu@users.noreply.github.com>
parent 3517fb94
...@@ -30,7 +30,6 @@ You should start by creating a `one_step_unet.py` file for your community pipeli ...@@ -30,7 +30,6 @@ You should start by creating a `one_step_unet.py` file for your community pipeli
from diffusers import DiffusionPipeline from diffusers import DiffusionPipeline
import torch import torch
class UnetSchedulerOneForwardPipeline(DiffusionPipeline): class UnetSchedulerOneForwardPipeline(DiffusionPipeline):
def __init__(self, unet, scheduler): def __init__(self, unet, scheduler):
super().__init__() super().__init__()
...@@ -59,7 +58,6 @@ In the forward pass, which we recommend defining as `__call__`, you have complet ...@@ -59,7 +58,6 @@ In the forward pass, which we recommend defining as `__call__`, you have complet
from diffusers import DiffusionPipeline from diffusers import DiffusionPipeline
import torch import torch
class UnetSchedulerOneForwardPipeline(DiffusionPipeline): class UnetSchedulerOneForwardPipeline(DiffusionPipeline):
def __init__(self, unet, scheduler): def __init__(self, unet, scheduler):
super().__init__() super().__init__()
...@@ -150,12 +148,12 @@ Sometimes you can't load all the pipeline components weights from an official re ...@@ -150,12 +148,12 @@ Sometimes you can't load all the pipeline components weights from an official re
```python ```python
from diffusers import DiffusionPipeline from diffusers import DiffusionPipeline
from transformers import CLIPFeatureExtractor, CLIPModel from transformers import CLIPImageProcessor, CLIPModel
model_id = "CompVis/stable-diffusion-v1-4" model_id = "CompVis/stable-diffusion-v1-4"
clip_model_id = "laion/CLIP-ViT-B-32-laion2B-s34B-b79K" clip_model_id = "laion/CLIP-ViT-B-32-laion2B-s34B-b79K"
feature_extractor = CLIPFeatureExtractor.from_pretrained(clip_model_id) feature_extractor = CLIPImageProcessor.from_pretrained(clip_model_id)
clip_model = CLIPModel.from_pretrained(clip_model_id, torch_dtype=torch.float16) clip_model = CLIPModel.from_pretrained(clip_model_id, torch_dtype=torch.float16)
pipeline = DiffusionPipeline.from_pretrained( pipeline = DiffusionPipeline.from_pretrained(
...@@ -172,7 +170,7 @@ pipeline = DiffusionPipeline.from_pretrained( ...@@ -172,7 +170,7 @@ pipeline = DiffusionPipeline.from_pretrained(
The magic behind community pipelines is contained in the following code. It allows the community pipeline to be loaded from GitHub or the Hub, and it'll be available to all 🧨 Diffusers packages. The magic behind community pipelines is contained in the following code. It allows the community pipeline to be loaded from GitHub or the Hub, and it'll be available to all 🧨 Diffusers packages.
```python ```python
# 2. Load the pipeline class, if using custom module then load it from the hub # 2. Load the pipeline class, if using custom module then load it from the Hub
# if we load from explicit class, let's use it # if we load from explicit class, let's use it
if custom_pipeline is not None: if custom_pipeline is not None:
pipeline_class = get_class_from_dynamic_module( pipeline_class = get_class_from_dynamic_module(
......
...@@ -16,7 +16,7 @@ ControlNet is a type of model for controlling image diffusion models by conditio ...@@ -16,7 +16,7 @@ ControlNet is a type of model for controlling image diffusion models by conditio
<Tip> <Tip>
Check out Section 3.5 of the [ControlNet](https://huggingface.co/papers/2302.05543) paper for a list of ControlNet implementations on various conditioning inputs. You can find the official Stable Diffusion ControlNet conditioned models on [lllyasviel](https://huggingface.co/lllyasviel)'s Hub profile, and more [community-trained](https://huggingface.co/models?other=stable-diffusion&other=controlnet) ones on the Hub. Check out Section 3.5 of the [ControlNet](https://huggingface.co/papers/2302.05543) paper v1 for a list of ControlNet implementations on various conditioning inputs. You can find the official Stable Diffusion ControlNet conditioned models on [lllyasviel](https://huggingface.co/lllyasviel)'s Hub profile, and more [community-trained](https://huggingface.co/models?other=stable-diffusion&other=controlnet) ones on the Hub.
For Stable Diffusion XL (SDXL) ControlNet models, you can find them on the 🤗 [Diffusers](https://huggingface.co/diffusers) Hub organization, or you can browse [community-trained](https://huggingface.co/models?other=stable-diffusion-xl&other=controlnet) ones on the Hub. For Stable Diffusion XL (SDXL) ControlNet models, you can find them on the 🤗 [Diffusers](https://huggingface.co/diffusers) Hub organization, or you can browse [community-trained](https://huggingface.co/models?other=stable-diffusion-xl&other=controlnet) ones on the Hub.
...@@ -35,7 +35,7 @@ Before you begin, make sure you have the following libraries installed: ...@@ -35,7 +35,7 @@ Before you begin, make sure you have the following libraries installed:
```py ```py
# uncomment to install the necessary libraries in Colab # uncomment to install the necessary libraries in Colab
#!pip install diffusers transformers accelerate safetensors opencv-python #!pip install -q diffusers transformers accelerate opencv-python
``` ```
## Text-to-image ## Text-to-image
...@@ -45,17 +45,16 @@ For text-to-image, you normally pass a text prompt to the model. But with Contro ...@@ -45,17 +45,16 @@ For text-to-image, you normally pass a text prompt to the model. But with Contro
Load an image and use the [opencv-python](https://github.com/opencv/opencv-python) library to extract the canny image: Load an image and use the [opencv-python](https://github.com/opencv/opencv-python) library to extract the canny image:
```py ```py
from diffusers import StableDiffusionControlNetPipeline from diffusers.utils import load_image, make_image_grid
from diffusers.utils import load_image
from PIL import Image from PIL import Image
import cv2 import cv2
import numpy as np import numpy as np
image = load_image( original_image = load_image(
"https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png" "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png"
) )
image = np.array(image) image = np.array(original_image)
low_threshold = 100 low_threshold = 100
high_threshold = 200 high_threshold = 200
...@@ -98,6 +97,7 @@ Now pass your prompt and canny image to the pipeline: ...@@ -98,6 +97,7 @@ Now pass your prompt and canny image to the pipeline:
output = pipe( output = pipe(
"the mona lisa", image=canny_image "the mona lisa", image=canny_image
).images[0] ).images[0]
make_image_grid([original_image, canny_image, output], rows=1, cols=3)
``` ```
<div class="flex justify-center"> <div class="flex justify-center">
...@@ -117,12 +117,11 @@ import torch ...@@ -117,12 +117,11 @@ import torch
import numpy as np import numpy as np
from transformers import pipeline from transformers import pipeline
from diffusers.utils import load_image from diffusers.utils import load_image, make_image_grid
image = load_image( image = load_image(
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-img2img.jpg" "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-img2img.jpg"
).resize((768, 768)) )
def get_depth_map(image, depth_estimator): def get_depth_map(image, depth_estimator):
image = depth_estimator(image)["depth"] image = depth_estimator(image)["depth"]
...@@ -158,6 +157,7 @@ Now pass your prompt, initial image, and depth map to the pipeline: ...@@ -158,6 +157,7 @@ Now pass your prompt, initial image, and depth map to the pipeline:
output = pipe( output = pipe(
"lego batman and robin", image=image, control_image=depth_map, "lego batman and robin", image=image, control_image=depth_map,
).images[0] ).images[0]
make_image_grid([image, output], rows=1, cols=2)
``` ```
<div class="flex gap-4"> <div class="flex gap-4">
...@@ -171,18 +171,14 @@ output = pipe( ...@@ -171,18 +171,14 @@ output = pipe(
</div> </div>
</div> </div>
## Inpainting ## Inpainting
For inpainting, you need an initial image, a mask image, and a prompt describing what to replace the mask with. ControlNet models allow you to add another control image to condition a model with. Let’s condition the model with a canny image, a white outline of an image on a black background. This way, the ControlNet can use the canny image as a control to guide the model to generate an image with the same outline. For inpainting, you need an initial image, a mask image, and a prompt describing what to replace the mask with. ControlNet models allow you to add another control image to condition a model with. Let’s condition the model with an inpainting mask. This way, the ControlNet can use the inpainting mask as a control to guide the model to generate an image within the mask area.
Load an initial image and a mask image: Load an initial image and a mask image:
```py ```py
from diffusers import StableDiffusionControlNetInpaintPipeline, ControlNetModel, UniPCMultistepScheduler from diffusers.utils import load_image, make_image_grid
from diffusers.utils import load_image
import numpy as np
import torch
init_image = load_image( init_image = load_image(
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-inpaint.jpg" "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-inpaint.jpg"
...@@ -193,11 +189,15 @@ mask_image = load_image( ...@@ -193,11 +189,15 @@ mask_image = load_image(
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-inpaint-mask.jpg" "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-inpaint-mask.jpg"
) )
mask_image = mask_image.resize((512, 512)) mask_image = mask_image.resize((512, 512))
make_image_grid([init_image, mask_image], rows=1, cols=2)
``` ```
Create a function to prepare the control image from the initial and mask images. This'll create a tensor to mark the pixels in `init_image` as masked if the corresponding pixel in `mask_image` is over a certain threshold. Create a function to prepare the control image from the initial and mask images. This'll create a tensor to mark the pixels in `init_image` as masked if the corresponding pixel in `mask_image` is over a certain threshold.
```py ```py
import numpy as np
import torch
def make_inpaint_condition(image, image_mask): def make_inpaint_condition(image, image_mask):
image = np.array(image.convert("RGB")).astype(np.float32) / 255.0 image = np.array(image.convert("RGB")).astype(np.float32) / 255.0
image_mask = np.array(image_mask.convert("L")).astype(np.float32) / 255.0 image_mask = np.array(image_mask.convert("L")).astype(np.float32) / 255.0
...@@ -226,7 +226,6 @@ Load a ControlNet model conditioned on inpainting and pass it to the [`StableDif ...@@ -226,7 +226,6 @@ Load a ControlNet model conditioned on inpainting and pass it to the [`StableDif
```py ```py
from diffusers import StableDiffusionControlNetInpaintPipeline, ControlNetModel, UniPCMultistepScheduler from diffusers import StableDiffusionControlNetInpaintPipeline, ControlNetModel, UniPCMultistepScheduler
import torch
controlnet = ControlNetModel.from_pretrained("lllyasviel/control_v11p_sd15_inpaint", torch_dtype=torch.float16, use_safetensors=True) controlnet = ControlNetModel.from_pretrained("lllyasviel/control_v11p_sd15_inpaint", torch_dtype=torch.float16, use_safetensors=True)
pipe = StableDiffusionControlNetInpaintPipeline.from_pretrained( pipe = StableDiffusionControlNetInpaintPipeline.from_pretrained(
...@@ -248,6 +247,7 @@ output = pipe( ...@@ -248,6 +247,7 @@ output = pipe(
mask_image=mask_image, mask_image=mask_image,
control_image=control_image, control_image=control_image,
).images[0] ).images[0]
make_image_grid([init_image, mask_image, output], rows=1, cols=3)
``` ```
<div class="flex justify-center"> <div class="flex justify-center">
...@@ -270,14 +270,29 @@ Set `guess_mode=True` in the pipeline, and it is [recommended](https://github.co ...@@ -270,14 +270,29 @@ Set `guess_mode=True` in the pipeline, and it is [recommended](https://github.co
```py ```py
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from diffusers.utils import load_image, make_image_grid
import numpy as np
import torch import torch
from PIL import Image
import cv2
controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", use_safetensors=True) controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", use_safetensors=True)
pipe = StableDiffusionControlNetPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", controlnet=controlnet, use_safetensors=True).to( pipe = StableDiffusionControlNetPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", controlnet=controlnet, use_safetensors=True).to("cuda")
"cuda"
) original_image = load_image("https://huggingface.co/takuma104/controlnet_dev/resolve/main/bird_512x512.png")
image = np.array(original_image)
low_threshold = 100
high_threshold = 200
image = cv2.Canny(image, low_threshold, high_threshold)
image = image[:, :, None]
image = np.concatenate([image, image, image], axis=2)
canny_image = Image.fromarray(image)
image = pipe("", image=canny_image, guess_mode=True, guidance_scale=3.0).images[0] image = pipe("", image=canny_image, guess_mode=True, guidance_scale=3.0).images[0]
image make_image_grid([original_image, canny_image, image], rows=1, cols=3)
``` ```
<div class="flex gap-4"> <div class="flex gap-4">
...@@ -293,22 +308,23 @@ image ...@@ -293,22 +308,23 @@ image
## ControlNet with Stable Diffusion XL ## ControlNet with Stable Diffusion XL
There aren't too many ControlNet models compatible with Stable Diffusion XL (SDXL) at the moment, but we've trained two full-sized ControlNet models for SDXL conditioned on canny edge detection and depth maps. We're also experimenting with creating smaller versions of these SDXL-compatible ControlNet models so it is easier to run on resource-constrained hardware. You can find these checkpoints on the 🤗 [Diffusers](https://huggingface.co/diffusers) Hub organization! There aren't too many ControlNet models compatible with Stable Diffusion XL (SDXL) at the moment, but we've trained two full-sized ControlNet models for SDXL conditioned on canny edge detection and depth maps. We're also experimenting with creating smaller versions of these SDXL-compatible ControlNet models so it is easier to run on resource-constrained hardware. You can find these checkpoints on the [🤗 Diffusers Hub organization](https://huggingface.co/diffusers)!
Let's use a SDXL ControlNet conditioned on canny images to generate an image. Start by loading an image and prepare the canny image: Let's use a SDXL ControlNet conditioned on canny images to generate an image. Start by loading an image and prepare the canny image:
```py ```py
from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel, AutoencoderKL from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel, AutoencoderKL
from diffusers.utils import load_image from diffusers.utils import load_image, make_image_grid
from PIL import Image from PIL import Image
import cv2 import cv2
import numpy as np import numpy as np
import torch
image = load_image( original_image = load_image(
"https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/hf-logo.png" "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/hf-logo.png"
) )
image = np.array(image) image = np.array(original_image)
low_threshold = 100 low_threshold = 100
high_threshold = 200 high_threshold = 200
...@@ -317,7 +333,7 @@ image = cv2.Canny(image, low_threshold, high_threshold) ...@@ -317,7 +333,7 @@ image = cv2.Canny(image, low_threshold, high_threshold)
image = image[:, :, None] image = image[:, :, None]
image = np.concatenate([image, image, image], axis=2) image = np.concatenate([image, image, image], axis=2)
canny_image = Image.fromarray(image) canny_image = Image.fromarray(image)
canny_image make_image_grid([original_image, canny_image], rows=1, cols=2)
``` ```
<div class="flex gap-4"> <div class="flex gap-4">
...@@ -362,13 +378,13 @@ The [`controlnet_conditioning_scale`](https://huggingface.co/docs/diffusers/main ...@@ -362,13 +378,13 @@ The [`controlnet_conditioning_scale`](https://huggingface.co/docs/diffusers/main
prompt = "aerial view, a futuristic research complex in a bright foggy jungle, hard lighting" prompt = "aerial view, a futuristic research complex in a bright foggy jungle, hard lighting"
negative_prompt = 'low quality, bad quality, sketches' negative_prompt = 'low quality, bad quality, sketches'
images = pipe( image = pipe(
prompt, prompt,
negative_prompt=negative_prompt, negative_prompt=negative_prompt,
image=canny_image, image=canny_image,
controlnet_conditioning_scale=0.5, controlnet_conditioning_scale=0.5,
).images[0] ).images[0]
images make_image_grid([original_image, canny_image, image], rows=1, cols=3)
``` ```
<div class="flex justify-center"> <div class="flex justify-center">
...@@ -379,17 +395,16 @@ You can use [`StableDiffusionXLControlNetPipeline`] in guess mode as well by set ...@@ -379,17 +395,16 @@ You can use [`StableDiffusionXLControlNetPipeline`] in guess mode as well by set
```py ```py
from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel, AutoencoderKL from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel, AutoencoderKL
from diffusers.utils import load_image from diffusers.utils import load_image, make_image_grid
import numpy as np import numpy as np
import torch import torch
import cv2 import cv2
from PIL import Image from PIL import Image
prompt = "aerial view, a futuristic research complex in a bright foggy jungle, hard lighting" prompt = "aerial view, a futuristic research complex in a bright foggy jungle, hard lighting"
negative_prompt = "low quality, bad quality, sketches" negative_prompt = "low quality, bad quality, sketches"
image = load_image( original_image = load_image(
"https://hf.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/hf-logo.png" "https://hf.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/hf-logo.png"
) )
...@@ -402,15 +417,16 @@ pipe = StableDiffusionXLControlNetPipeline.from_pretrained( ...@@ -402,15 +417,16 @@ pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
) )
pipe.enable_model_cpu_offload() pipe.enable_model_cpu_offload()
image = np.array(image) image = np.array(original_image)
image = cv2.Canny(image, 100, 200) image = cv2.Canny(image, 100, 200)
image = image[:, :, None] image = image[:, :, None]
image = np.concatenate([image, image, image], axis=2) image = np.concatenate([image, image, image], axis=2)
canny_image = Image.fromarray(image) canny_image = Image.fromarray(image)
image = pipe( image = pipe(
prompt, controlnet_conditioning_scale=0.5, image=canny_image, guess_mode=True, prompt, negative_prompt=negative_prompt, controlnet_conditioning_scale=0.5, image=canny_image, guess_mode=True,
).images[0] ).images[0]
make_image_grid([original_image, canny_image, image], rows=1, cols=3)
``` ```
### MultiControlNet ### MultiControlNet
...@@ -431,29 +447,30 @@ In this example, you'll combine a canny image and a human pose estimation image ...@@ -431,29 +447,30 @@ In this example, you'll combine a canny image and a human pose estimation image
Prepare the canny image conditioning: Prepare the canny image conditioning:
```py ```py
from diffusers.utils import load_image from diffusers.utils import load_image, make_image_grid
from PIL import Image from PIL import Image
import numpy as np import numpy as np
import cv2 import cv2
canny_image = load_image( original_image = load_image(
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/landscape.png" "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/landscape.png"
) )
canny_image = np.array(canny_image) image = np.array(original_image)
low_threshold = 100 low_threshold = 100
high_threshold = 200 high_threshold = 200
canny_image = cv2.Canny(canny_image, low_threshold, high_threshold) image = cv2.Canny(image, low_threshold, high_threshold)
# zero out middle columns of image where pose will be overlaid # zero out middle columns of image where pose will be overlaid
zero_start = canny_image.shape[1] // 4 zero_start = image.shape[1] // 4
zero_end = zero_start + canny_image.shape[1] // 2 zero_end = zero_start + image.shape[1] // 2
canny_image[:, zero_start:zero_end] = 0 image[:, zero_start:zero_end] = 0
canny_image = canny_image[:, :, None] image = image[:, :, None]
canny_image = np.concatenate([canny_image, canny_image, canny_image], axis=2) image = np.concatenate([image, image, image], axis=2)
canny_image = Image.fromarray(canny_image).resize((1024, 1024)) canny_image = Image.fromarray(image)
make_image_grid([original_image, canny_image], rows=1, cols=2)
``` ```
<div class="flex gap-4"> <div class="flex gap-4">
...@@ -467,18 +484,24 @@ canny_image = Image.fromarray(canny_image).resize((1024, 1024)) ...@@ -467,18 +484,24 @@ canny_image = Image.fromarray(canny_image).resize((1024, 1024))
</div> </div>
</div> </div>
For human pose estimation, install [controlnet_aux](https://github.com/patrickvonplaten/controlnet_aux):
```py
# uncomment to install the necessary library in Colab
#!pip install -q controlnet-aux
```
Prepare the human pose estimation conditioning: Prepare the human pose estimation conditioning:
```py ```py
from controlnet_aux import OpenposeDetector from controlnet_aux import OpenposeDetector
from diffusers.utils import load_image
openpose = OpenposeDetector.from_pretrained("lllyasviel/ControlNet") openpose = OpenposeDetector.from_pretrained("lllyasviel/ControlNet")
original_image = load_image(
openpose_image = load_image(
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/person.png" "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/person.png"
) )
openpose_image = openpose(openpose_image).resize((1024, 1024)) openpose_image = openpose(original_image)
make_image_grid([original_image, openpose_image], rows=1, cols=2)
``` ```
<div class="flex gap-4"> <div class="flex gap-4">
...@@ -500,7 +523,7 @@ import torch ...@@ -500,7 +523,7 @@ import torch
controlnets = [ controlnets = [
ControlNetModel.from_pretrained( ControlNetModel.from_pretrained(
"thibaud/controlnet-openpose-sdxl-1.0", torch_dtype=torch.float16, use_safetensors=True "thibaud/controlnet-openpose-sdxl-1.0", torch_dtype=torch.float16
), ),
ControlNetModel.from_pretrained( ControlNetModel.from_pretrained(
"diffusers/controlnet-canny-sdxl-1.0", torch_dtype=torch.float16, use_safetensors=True "diffusers/controlnet-canny-sdxl-1.0", torch_dtype=torch.float16, use_safetensors=True
...@@ -523,7 +546,7 @@ negative_prompt = "monochrome, lowres, bad anatomy, worst quality, low quality" ...@@ -523,7 +546,7 @@ negative_prompt = "monochrome, lowres, bad anatomy, worst quality, low quality"
generator = torch.manual_seed(1) generator = torch.manual_seed(1)
images = [openpose_image, canny_image] images = [openpose_image.resize((1024, 1024)), canny_image.resize((1024, 1024))]
images = pipe( images = pipe(
prompt, prompt,
...@@ -533,7 +556,9 @@ images = pipe( ...@@ -533,7 +556,9 @@ images = pipe(
negative_prompt=negative_prompt, negative_prompt=negative_prompt,
num_images_per_prompt=3, num_images_per_prompt=3,
controlnet_conditioning_scale=[1.0, 0.8], controlnet_conditioning_scale=[1.0, 0.8],
).images[0] ).images
make_image_grid([original_image, canny_image, openpose_image,
images[0].resize((512, 512)), images[1].resize((512, 512)), images[2].resize((512, 512))], rows=2, cols=3)
``` ```
<div class="flex justify-center"> <div class="flex justify-center">
......
...@@ -25,6 +25,8 @@ Community pipelines allow you to get creative and build your own unique pipeline ...@@ -25,6 +25,8 @@ Community pipelines allow you to get creative and build your own unique pipeline
To load a community pipeline, use the `custom_pipeline` argument in [`DiffusionPipeline`] to specify one of the files in [diffusers/examples/community](https://github.com/huggingface/diffusers/tree/main/examples/community): To load a community pipeline, use the `custom_pipeline` argument in [`DiffusionPipeline`] to specify one of the files in [diffusers/examples/community](https://github.com/huggingface/diffusers/tree/main/examples/community):
```py ```py
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained( pipe = DiffusionPipeline.from_pretrained(
"CompVis/stable-diffusion-v1-4", custom_pipeline="filename_in_the_community_folder", use_safetensors=True "CompVis/stable-diffusion-v1-4", custom_pipeline="filename_in_the_community_folder", use_safetensors=True
) )
...@@ -39,7 +41,6 @@ You can learn more about community pipelines in the how to [load community pipel ...@@ -39,7 +41,6 @@ You can learn more about community pipelines in the how to [load community pipel
The multilingual Stable Diffusion pipeline uses a pretrained [XLM-RoBERTa](https://huggingface.co/papluca/xlm-roberta-base-language-detection) to identify a language and the [mBART-large-50](https://huggingface.co/facebook/mbart-large-50-many-to-one-mmt) model to handle the translation. This allows you to generate images from text in 20 languages. The multilingual Stable Diffusion pipeline uses a pretrained [XLM-RoBERTa](https://huggingface.co/papluca/xlm-roberta-base-language-detection) to identify a language and the [mBART-large-50](https://huggingface.co/facebook/mbart-large-50-many-to-one-mmt) model to handle the translation. This allows you to generate images from text in 20 languages.
```py ```py
from PIL import Image
import torch import torch
from diffusers import DiffusionPipeline from diffusers import DiffusionPipeline
from diffusers.utils import make_image_grid from diffusers.utils import make_image_grid
...@@ -59,15 +60,15 @@ language_detection_pipeline = pipeline("text-classification", ...@@ -59,15 +60,15 @@ language_detection_pipeline = pipeline("text-classification",
device=device_dict[device]) device=device_dict[device])
# add model for language translation # add model for language translation
trans_tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-one-mmt") translation_tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-one-mmt")
trans_model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-one-mmt").to(device) translation_model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-one-mmt").to(device)
diffuser_pipeline = DiffusionPipeline.from_pretrained( diffuser_pipeline = DiffusionPipeline.from_pretrained(
"CompVis/stable-diffusion-v1-4", "CompVis/stable-diffusion-v1-4",
custom_pipeline="multilingual_stable_diffusion", custom_pipeline="multilingual_stable_diffusion",
detection_pipeline=language_detection_pipeline, detection_pipeline=language_detection_pipeline,
translation_model=trans_model, translation_model=translation_model,
translation_tokenizer=trans_tokenizer, translation_tokenizer=translation_tokenizer,
torch_dtype=torch.float16, torch_dtype=torch.float16,
) )
...@@ -80,8 +81,7 @@ prompt = ["a photograph of an astronaut riding a horse", ...@@ -80,8 +81,7 @@ prompt = ["a photograph of an astronaut riding a horse",
"Un restaurant parisien"] "Un restaurant parisien"]
images = diffuser_pipeline(prompt).images images = diffuser_pipeline(prompt).images
grid = make_image_grid(images, rows=2, cols=2) make_image_grid(images, rows=2, cols=2)
grid
``` ```
<div class="flex justify-center"> <div class="flex justify-center">
...@@ -94,23 +94,23 @@ grid ...@@ -94,23 +94,23 @@ grid
```py ```py
from diffusers import DiffusionPipeline, DDIMScheduler from diffusers import DiffusionPipeline, DDIMScheduler
from diffusers.utils import load_image from diffusers.utils import load_image, make_image_grid
pipeline = DiffusionPipeline.from_pretrained( pipeline = DiffusionPipeline.from_pretrained(
"CompVis/stable-diffusion-v1-4", "CompVis/stable-diffusion-v1-4",
custom_pipeline="magic_mix", custom_pipeline="magic_mix",
scheduler = DDIMScheduler.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="scheduler"), scheduler=DDIMScheduler.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="scheduler"),
).to('cuda') ).to('cuda')
img = load_image("https://user-images.githubusercontent.com/59410571/209578593-141467c7-d831-4792-8b9a-b17dc5e47816.jpg") img = load_image("https://user-images.githubusercontent.com/59410571/209578593-141467c7-d831-4792-8b9a-b17dc5e47816.jpg")
mix_img = pipeline(img, prompt="bed", kmin = 0.3, kmax = 0.5, mix_factor = 0.5) mix_img = pipeline(img, prompt="bed", kmin=0.3, kmax=0.5, mix_factor=0.5)
mix_img make_image_grid([img, mix_img], rows=1, cols=2)
``` ```
<div class="flex gap-4"> <div class="flex gap-4">
<div> <div>
<img class="rounded-xl" src="https://user-images.githubusercontent.com/59410571/209578593-141467c7-d831-4792-8b9a-b17dc5e47816.jpg" /> <img class="rounded-xl" src="https://user-images.githubusercontent.com/59410571/209578593-141467c7-d831-4792-8b9a-b17dc5e47816.jpg" />
<figcaption class="mt-2 text-center text-sm text-gray-500">image prompt</figcaption> <figcaption class="mt-2 text-center text-sm text-gray-500">original image</figcaption>
</div> </div>
<div> <div>
<img class="rounded-xl" src="https://user-images.githubusercontent.com/59410571/209578602-70f323fa-05b7-4dd6-b055-e40683e37914.jpg" /> <img class="rounded-xl" src="https://user-images.githubusercontent.com/59410571/209578602-70f323fa-05b7-4dd6-b055-e40683e37914.jpg" />
......
...@@ -26,7 +26,7 @@ Before you begin, make sure you have the following libraries installed: ...@@ -26,7 +26,7 @@ Before you begin, make sure you have the following libraries installed:
```py ```py
# uncomment to install the necessary libraries in Colab # uncomment to install the necessary libraries in Colab
#!pip install diffusers transformers accelerate safetensors #!pip install -q diffusers transformers accelerate
``` ```
The [`StableDiffusionDiffEditPipeline`] requires an image mask and a set of partially inverted latents. The image mask is generated from the [`~StableDiffusionDiffEditPipeline.generate_mask`] function, and includes two parameters, `source_prompt` and `target_prompt`. These parameters determine what to edit in the image. For example, if you want to change a bowl of *fruits* to a bowl of *pears*, then: The [`StableDiffusionDiffEditPipeline`] requires an image mask and a set of partially inverted latents. The image mask is generated from the [`~StableDiffusionDiffEditPipeline.generate_mask`] function, and includes two parameters, `source_prompt` and `target_prompt`. These parameters determine what to edit in the image. For example, if you want to change a bowl of *fruits* to a bowl of *pears*, then:
...@@ -59,15 +59,18 @@ pipeline.enable_vae_slicing() ...@@ -59,15 +59,18 @@ pipeline.enable_vae_slicing()
Load the image to edit: Load the image to edit:
```py ```py
from diffusers.utils import load_image from diffusers.utils import load_image, make_image_grid
img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png" img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png"
raw_image = load_image(img_url).convert("RGB").resize((768, 768)) raw_image = load_image(img_url).resize((768, 768))
raw_image
``` ```
Use the [`~StableDiffusionDiffEditPipeline.generate_mask`] function to generate the image mask. You'll need to pass it the `source_prompt` and `target_prompt` to specify what to edit in the image: Use the [`~StableDiffusionDiffEditPipeline.generate_mask`] function to generate the image mask. You'll need to pass it the `source_prompt` and `target_prompt` to specify what to edit in the image:
```py ```py
from PIL import Image
source_prompt = "a bowl of fruits" source_prompt = "a bowl of fruits"
target_prompt = "a basket of pears" target_prompt = "a basket of pears"
mask_image = pipeline.generate_mask( mask_image = pipeline.generate_mask(
...@@ -75,6 +78,7 @@ mask_image = pipeline.generate_mask( ...@@ -75,6 +78,7 @@ mask_image = pipeline.generate_mask(
source_prompt=source_prompt, source_prompt=source_prompt,
target_prompt=target_prompt, target_prompt=target_prompt,
) )
Image.fromarray((mask_image.squeeze()*255).astype("uint8"), "L").resize((768, 768))
``` ```
Next, create the inverted latents and pass it a caption describing the image: Next, create the inverted latents and pass it a caption describing the image:
...@@ -86,13 +90,14 @@ inv_latents = pipeline.invert(prompt=source_prompt, image=raw_image).latents ...@@ -86,13 +90,14 @@ inv_latents = pipeline.invert(prompt=source_prompt, image=raw_image).latents
Finally, pass the image mask and inverted latents to the pipeline. The `target_prompt` becomes the `prompt` now, and the `source_prompt` is used as the `negative_prompt`: Finally, pass the image mask and inverted latents to the pipeline. The `target_prompt` becomes the `prompt` now, and the `source_prompt` is used as the `negative_prompt`:
```py ```py
image = pipeline( output_image = pipeline(
prompt=target_prompt, prompt=target_prompt,
mask_image=mask_image, mask_image=mask_image,
image_latents=inv_latents, image_latents=inv_latents,
negative_prompt=source_prompt, negative_prompt=source_prompt,
).images[0] ).images[0]
image.save("edited_image.png") mask_image = Image.fromarray((mask_image.squeeze()*255).astype("uint8"), "L").resize((768, 768))
make_image_grid([raw_image, mask_image, output_image], rows=1, cols=3)
``` ```
<div class="flex gap-4"> <div class="flex gap-4">
...@@ -116,8 +121,8 @@ Load the Flan-T5 model and tokenizer from the 🤗 Transformers library: ...@@ -116,8 +121,8 @@ Load the Flan-T5 model and tokenizer from the 🤗 Transformers library:
import torch import torch
from transformers import AutoTokenizer, T5ForConditionalGeneration from transformers import AutoTokenizer, T5ForConditionalGeneration
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-xl") tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-large")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xl", device_map="auto", torch_dtype=torch.float16) model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", device_map="auto", torch_dtype=torch.float16)
``` ```
Provide some initial text to prompt the model to generate the source and target prompts. Provide some initial text to prompt the model to generate the source and target prompts.
...@@ -136,7 +141,7 @@ target_text = f"Provide a caption for images containing a {target_concept}. " ...@@ -136,7 +141,7 @@ target_text = f"Provide a caption for images containing a {target_concept}. "
Next, create a utility function to generate the prompts: Next, create a utility function to generate the prompts:
```py ```py
@torch.no_grad @torch.no_grad()
def generate_prompts(input_prompt): def generate_prompts(input_prompt):
input_ids = tokenizer(input_prompt, return_tensors="pt").input_ids.to("cuda") input_ids = tokenizer(input_prompt, return_tensors="pt").input_ids.to("cuda")
...@@ -193,33 +198,39 @@ Finally, pass the embeddings to the [`~StableDiffusionDiffEditPipeline.generate_ ...@@ -193,33 +198,39 @@ Finally, pass the embeddings to the [`~StableDiffusionDiffEditPipeline.generate_
```diff ```diff
from diffusers import DDIMInverseScheduler, DDIMScheduler from diffusers import DDIMInverseScheduler, DDIMScheduler
from diffusers.utils import load_image from diffusers.utils import load_image, make_image_grid
from PIL import Image
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config) pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config) pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config)
img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png" img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png"
raw_image = load_image(img_url).convert("RGB").resize((768, 768)) raw_image = load_image(img_url).resize((768, 768))
mask_image = pipeline.generate_mask( mask_image = pipeline.generate_mask(
image=raw_image, image=raw_image,
- source_prompt=source_prompt,
- target_prompt=target_prompt,
+ source_prompt_embeds=source_embeds, + source_prompt_embeds=source_embeds,
+ target_prompt_embeds=target_embeds, + target_prompt_embeds=target_embeds,
) )
inv_latents = pipeline.invert( inv_latents = pipeline.invert(
- prompt=source_prompt,
+ prompt_embeds=source_embeds, + prompt_embeds=source_embeds,
image=raw_image, image=raw_image,
).latents ).latents
images = pipeline( output_image = pipeline(
mask_image=mask_image, mask_image=mask_image,
image_latents=inv_latents, image_latents=inv_latents,
- prompt=target_prompt,
- negative_prompt=source_prompt,
+ prompt_embeds=target_embeds, + prompt_embeds=target_embeds,
+ negative_prompt_embeds=source_embeds, + negative_prompt_embeds=source_embeds,
).images ).images[0]
images[0].save("edited_image.png") mask_image = Image.fromarray((mask_image.squeeze()*255).astype("uint8"), "L")
make_image_grid([raw_image, mask_image, output_image], rows=1, cols=3)
``` ```
## Generate a caption for inversion ## Generate a caption for inversion
...@@ -260,7 +271,7 @@ Load an input image and generate a caption for it using the `generate_caption` f ...@@ -260,7 +271,7 @@ Load an input image and generate a caption for it using the `generate_caption` f
from diffusers.utils import load_image from diffusers.utils import load_image
img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png" img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png"
raw_image = load_image(img_url).convert("RGB").resize((768, 768)) raw_image = load_image(img_url).resize((768, 768))
caption = generate_caption(raw_image, model, processor) caption = generate_caption(raw_image, model, processor)
``` ```
......
...@@ -26,7 +26,7 @@ Before you begin, make sure you have the following libraries installed: ...@@ -26,7 +26,7 @@ Before you begin, make sure you have the following libraries installed:
```py ```py
# uncomment to install the necessary libraries in Colab # uncomment to install the necessary libraries in Colab
#!pip install transformers accelerate safetensors #!pip install -q diffusers transformers accelerate
``` ```
<Tip warning={true}> <Tip warning={true}>
...@@ -58,6 +58,7 @@ Now pass all the prompts and embeddings to the [`KandinskyPipeline`] to generate ...@@ -58,6 +58,7 @@ Now pass all the prompts and embeddings to the [`KandinskyPipeline`] to generate
```py ```py
image = pipeline(prompt, image_embeds=image_embeds, negative_prompt=negative_prompt, negative_image_embeds=negative_image_embeds, height=768, width=768).images[0] image = pipeline(prompt, image_embeds=image_embeds, negative_prompt=negative_prompt, negative_image_embeds=negative_image_embeds, height=768, width=768).images[0]
image
``` ```
<div class="flex justify-center"> <div class="flex justify-center">
...@@ -83,6 +84,7 @@ Pass the `image_embeds` and `negative_image_embeds` to the [`KandinskyV22Pipelin ...@@ -83,6 +84,7 @@ Pass the `image_embeds` and `negative_image_embeds` to the [`KandinskyV22Pipelin
```py ```py
image = pipeline(image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768).images[0] image = pipeline(image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768).images[0]
image
``` ```
<div class="flex justify-center"> <div class="flex justify-center">
...@@ -109,7 +111,8 @@ pipeline.enable_model_cpu_offload() ...@@ -109,7 +111,8 @@ pipeline.enable_model_cpu_offload()
prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting" prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
negative_prompt = "low quality, bad quality" negative_prompt = "low quality, bad quality"
image = pipeline(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_scale=1.0, guidance_scale = 4.0, height=768, width=768).images[0] image = pipeline(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_scale=1.0, guidance_scale=4.0, height=768, width=768).images[0]
image
``` ```
</hfoption> </hfoption>
...@@ -125,7 +128,8 @@ pipeline.enable_model_cpu_offload() ...@@ -125,7 +128,8 @@ pipeline.enable_model_cpu_offload()
prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting" prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
negative_prompt = "low quality, bad quality" negative_prompt = "low quality, bad quality"
image = pipeline(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_scale=1.0, guidance_scale = 4.0, height=768, width=768).images[0] image = pipeline(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_scale=1.0, guidance_scale=4.0, height=768, width=768).images[0]
image
``` ```
</hfoption> </hfoption>
...@@ -133,7 +137,7 @@ image = pipeline(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_ ...@@ -133,7 +137,7 @@ image = pipeline(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_
## Image-to-image ## Image-to-image
For image-to-image, pass the initial image and text prompt to condition the image with to the pipeline. Start by loading the prior pipeline: For image-to-image, pass the initial image and text prompt to condition the image to the pipeline. Start by loading the prior pipeline:
<hfoptions id="image-to-image"> <hfoptions id="image-to-image">
<hfoption id="Kandinsky 2.1"> <hfoption id="Kandinsky 2.1">
...@@ -163,14 +167,11 @@ pipeline = KandinskyV22Img2ImgPipeline.from_pretrained("kandinsky-community/kand ...@@ -163,14 +167,11 @@ pipeline = KandinskyV22Img2ImgPipeline.from_pretrained("kandinsky-community/kand
Download an image to condition on: Download an image to condition on:
```py ```py
from PIL import Image from diffusers.utils import load_image
import requests
from io import BytesIO
# download image # download image
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
response = requests.get(url) original_image = load_image(url)
original_image = Image.open(BytesIO(response.content)).convert("RGB")
original_image = original_image.resize((768, 512)) original_image = original_image.resize((768, 512))
``` ```
...@@ -193,7 +194,10 @@ Now pass the original image, and all the prompts and embeddings to the pipeline ...@@ -193,7 +194,10 @@ Now pass the original image, and all the prompts and embeddings to the pipeline
<hfoption id="Kandinsky 2.1"> <hfoption id="Kandinsky 2.1">
```py ```py
image = pipeline(prompt, negative_prompt=negative_prompt, image=original_image, image_embeds=image_emebds, negative_image_embeds=negative_image_embeds, height=768, width=768, strength=0.3).images[0] from diffusers.utils import make_image_grid
image = pipeline(prompt, negative_prompt=negative_prompt, image=original_image, image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768, strength=0.3).images[0]
make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2)
``` ```
<div class="flex justify-center"> <div class="flex justify-center">
...@@ -204,7 +208,10 @@ image = pipeline(prompt, negative_prompt=negative_prompt, image=original_image, ...@@ -204,7 +208,10 @@ image = pipeline(prompt, negative_prompt=negative_prompt, image=original_image,
<hfoption id="Kandinsky 2.2"> <hfoption id="Kandinsky 2.2">
```py ```py
image = pipeline(image=original_image, image_embeds=image_emebds, negative_image_embeds=negative_image_embeds, height=768, width=768, strength=0.3).images[0] from diffusers.utils import make_image_grid
image = pipeline(image=original_image, image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768, strength=0.3).images[0]
make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2)
``` ```
<div class="flex justify-center"> <div class="flex justify-center">
...@@ -223,11 +230,8 @@ Use the [`AutoPipelineForImage2Image`] to automatically call the combined pipeli ...@@ -223,11 +230,8 @@ Use the [`AutoPipelineForImage2Image`] to automatically call the combined pipeli
```py ```py
from diffusers import AutoPipelineForImage2Image from diffusers import AutoPipelineForImage2Image
from diffusers.utils import make_image_grid, load_image
import torch import torch
import requests
from io import BytesIO
from PIL import Image
import os
pipeline = AutoPipelineForImage2Image.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16, use_safetensors=True) pipeline = AutoPipelineForImage2Image.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16, use_safetensors=True)
pipeline.enable_model_cpu_offload() pipeline.enable_model_cpu_offload()
...@@ -236,12 +240,12 @@ prompt = "A fantasy landscape, Cinematic lighting" ...@@ -236,12 +240,12 @@ prompt = "A fantasy landscape, Cinematic lighting"
negative_prompt = "low quality, bad quality" negative_prompt = "low quality, bad quality"
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
original_image = load_image(url)
response = requests.get(url)
original_image = Image.open(BytesIO(response.content)).convert("RGB")
original_image.thumbnail((768, 768)) original_image.thumbnail((768, 768))
image = pipeline(prompt=prompt, image=original_image, strength=0.3).images[0] image = pipeline(prompt=prompt, negative_prompt=negative_prompt, image=original_image, strength=0.3).images[0]
make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2)
``` ```
</hfoption> </hfoption>
...@@ -249,11 +253,8 @@ image = pipeline(prompt=prompt, image=original_image, strength=0.3).images[0] ...@@ -249,11 +253,8 @@ image = pipeline(prompt=prompt, image=original_image, strength=0.3).images[0]
```py ```py
from diffusers import AutoPipelineForImage2Image from diffusers import AutoPipelineForImage2Image
from diffusers.utils import make_image_grid, load_image
import torch import torch
import requests
from io import BytesIO
from PIL import Image
import os
pipeline = AutoPipelineForImage2Image.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16) pipeline = AutoPipelineForImage2Image.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16)
pipeline.enable_model_cpu_offload() pipeline.enable_model_cpu_offload()
...@@ -262,12 +263,12 @@ prompt = "A fantasy landscape, Cinematic lighting" ...@@ -262,12 +263,12 @@ prompt = "A fantasy landscape, Cinematic lighting"
negative_prompt = "low quality, bad quality" negative_prompt = "low quality, bad quality"
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
original_image = load_image(url)
response = requests.get(url)
original_image = Image.open(BytesIO(response.content)).convert("RGB")
original_image.thumbnail((768, 768)) original_image.thumbnail((768, 768))
image = pipeline(prompt=prompt, image=original_image, strength=0.3).images[0] image = pipeline(prompt=prompt, negative_prompt=negative_prompt, image=original_image, strength=0.3).images[0]
make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2)
``` ```
</hfoption> </hfoption>
...@@ -277,7 +278,7 @@ image = pipeline(prompt=prompt, image=original_image, strength=0.3).images[0] ...@@ -277,7 +278,7 @@ image = pipeline(prompt=prompt, image=original_image, strength=0.3).images[0]
<Tip warning={true}> <Tip warning={true}>
⚠️ The Kandinsky models uses ⬜️ **white pixels** to represent the masked area now instead of black pixels. If you are using [`KandinskyInpaintPipeline`] in production, you need to change the mask to use white pixels: ⚠️ The Kandinsky models use ⬜️ **white pixels** to represent the masked area now instead of black pixels. If you are using [`KandinskyInpaintPipeline`] in production, you need to change the mask to use white pixels:
```py ```py
# For PIL input # For PIL input
...@@ -297,9 +298,10 @@ For inpainting, you'll need the original image, a mask of the area to replace in ...@@ -297,9 +298,10 @@ For inpainting, you'll need the original image, a mask of the area to replace in
```py ```py
from diffusers import KandinskyInpaintPipeline, KandinskyPriorPipeline from diffusers import KandinskyInpaintPipeline, KandinskyPriorPipeline
from diffusers.utils import load_image from diffusers.utils import load_image, make_image_grid
import torch import torch
import numpy as np import numpy as np
from PIL import Image
prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda") prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
pipeline = KandinskyInpaintPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-inpaint", torch_dtype=torch.float16, use_safetensors=True).to("cuda") pipeline = KandinskyInpaintPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-inpaint", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
...@@ -310,9 +312,10 @@ pipeline = KandinskyInpaintPipeline.from_pretrained("kandinsky-community/kandins ...@@ -310,9 +312,10 @@ pipeline = KandinskyInpaintPipeline.from_pretrained("kandinsky-community/kandins
```py ```py
from diffusers import KandinskyV22InpaintPipeline, KandinskyV22PriorPipeline from diffusers import KandinskyV22InpaintPipeline, KandinskyV22PriorPipeline
from diffusers.utils import load_image from diffusers.utils import load_image, make_image_grid
import torch import torch
import numpy as np import numpy as np
from PIL import Image
prior_pipeline = KandinskyV22PriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda") prior_pipeline = KandinskyV22PriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
pipeline = KandinskyV22InpaintPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder-inpaint", torch_dtype=torch.float16, use_safetensors=True).to("cuda") pipeline = KandinskyV22InpaintPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder-inpaint", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
...@@ -343,7 +346,9 @@ Now pass the initial image, mask, and prompt and embeddings to the pipeline to g ...@@ -343,7 +346,9 @@ Now pass the initial image, mask, and prompt and embeddings to the pipeline to g
<hfoption id="Kandinsky 2.1"> <hfoption id="Kandinsky 2.1">
```py ```py
image = pipeline(prompt, image=init_image, mask_image=mask, **prior_output, height=768, width=768, num_inference_steps=150).images[0] output_image = pipeline(prompt, image=init_image, mask_image=mask, **prior_output, height=768, width=768, num_inference_steps=150).images[0]
mask = Image.fromarray((mask*255).astype('uint8'), 'L')
make_image_grid([init_image, mask, output_image], rows=1, cols=3)
``` ```
<div class="flex justify-center"> <div class="flex justify-center">
...@@ -354,7 +359,9 @@ image = pipeline(prompt, image=init_image, mask_image=mask, **prior_output, heig ...@@ -354,7 +359,9 @@ image = pipeline(prompt, image=init_image, mask_image=mask, **prior_output, heig
<hfoption id="Kandinsky 2.2"> <hfoption id="Kandinsky 2.2">
```py ```py
image = pipeline(image=init_image, mask_image=mask, **prior_output, height=768, width=768, num_inference_steps=150).images[0] output_image = pipeline(image=init_image, mask_image=mask, **prior_output, height=768, width=768, num_inference_steps=150).images[0]
mask = Image.fromarray((mask*255).astype('uint8'), 'L')
make_image_grid([init_image, mask, output_image], rows=1, cols=3)
``` ```
<div class="flex justify-center"> <div class="flex justify-center">
...@@ -371,14 +378,23 @@ You can also use the end-to-end [`KandinskyInpaintCombinedPipeline`] and [`Kandi ...@@ -371,14 +378,23 @@ You can also use the end-to-end [`KandinskyInpaintCombinedPipeline`] and [`Kandi
```py ```py
import torch import torch
import numpy as np
from PIL import Image
from diffusers import AutoPipelineForInpainting from diffusers import AutoPipelineForInpainting
from diffusers.utils import load_image, make_image_grid
pipe = AutoPipelineForInpainting.from_pretrained("kandinsky-community/kandinsky-2-1-inpaint", torch_dtype=torch.float16) pipe = AutoPipelineForInpainting.from_pretrained("kandinsky-community/kandinsky-2-1-inpaint", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload() pipe.enable_model_cpu_offload()
init_image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")
mask = np.zeros((768, 768), dtype=np.float32)
# mask area above cat's head
mask[:250, 250:-250] = 1
prompt = "a hat" prompt = "a hat"
image = pipe(prompt=prompt, image=original_image, mask_image=mask).images[0] output_image = pipe(prompt=prompt, image=init_image, mask_image=mask).images[0]
mask = Image.fromarray((mask*255).astype('uint8'), 'L')
make_image_grid([init_image, mask, output_image], rows=1, cols=3)
``` ```
</hfoption> </hfoption>
...@@ -386,14 +402,23 @@ image = pipe(prompt=prompt, image=original_image, mask_image=mask).images[0] ...@@ -386,14 +402,23 @@ image = pipe(prompt=prompt, image=original_image, mask_image=mask).images[0]
```py ```py
import torch import torch
import numpy as np
from PIL import Image
from diffusers import AutoPipelineForInpainting from diffusers import AutoPipelineForInpainting
from diffusers.utils import load_image, make_image_grid
pipe = AutoPipelineForInpainting.from_pretrained("kandinsky-community/kandinsky-2-2-decoder-inpaint", torch_dtype=torch.float16) pipe = AutoPipelineForInpainting.from_pretrained("kandinsky-community/kandinsky-2-2-decoder-inpaint", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload() pipe.enable_model_cpu_offload()
init_image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")
mask = np.zeros((768, 768), dtype=np.float32)
# mask area above cat's head
mask[:250, 250:-250] = 1
prompt = "a hat" prompt = "a hat"
image = pipe(prompt=prompt, image=original_image, mask_image=mask).images[0] output_image = pipe(prompt=prompt, image=original_image, mask_image=mask).images[0]
mask = Image.fromarray((mask*255).astype('uint8'), 'L')
make_image_grid([init_image, mask, output_image], rows=1, cols=3)
``` ```
</hfoption> </hfoption>
...@@ -408,13 +433,13 @@ Interpolation allows you to explore the latent space between the image and text ...@@ -408,13 +433,13 @@ Interpolation allows you to explore the latent space between the image and text
```py ```py
from diffusers import KandinskyPriorPipeline, KandinskyPipeline from diffusers import KandinskyPriorPipeline, KandinskyPipeline
from diffusers.utils import load_image from diffusers.utils import load_image, make_image_grid
import PIL
import torch import torch
prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda") prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
img_1 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png") img_1 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")
img_2 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/starry_night.jpeg") img_2 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/starry_night.jpeg")
make_image_grid([img_1.resize((512,512)), img_2.resize((512,512))], rows=1, cols=2)
``` ```
</hfoption> </hfoption>
...@@ -422,13 +447,13 @@ img_2 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffuser ...@@ -422,13 +447,13 @@ img_2 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffuser
```py ```py
from diffusers import KandinskyV22PriorPipeline, KandinskyV22Pipeline from diffusers import KandinskyV22PriorPipeline, KandinskyV22Pipeline
from diffusers.utils import load_image from diffusers.utils import load_image, make_image_grid
import PIL
import torch import torch
prior_pipeline = KandinskyV22PriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda") prior_pipeline = KandinskyV22PriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
img_1 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png") img_1 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")
img_2 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/starry_night.jpeg") img_2 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/starry_night.jpeg")
make_image_grid([img_1.resize((512,512)), img_2.resize((512,512))], rows=1, cols=2)
``` ```
</hfoption> </hfoption>
...@@ -448,7 +473,7 @@ img_2 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffuser ...@@ -448,7 +473,7 @@ img_2 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffuser
Specify the text or images to interpolate, and set the weights for each text or image. Experiment with the weights to see how they affect the interpolation! Specify the text or images to interpolate, and set the weights for each text or image. Experiment with the weights to see how they affect the interpolation!
```py ```py
images_texts = ["a cat", img1, img2] images_texts = ["a cat", img_1, img_2]
weights = [0.3, 0.3, 0.4] weights = [0.3, 0.3, 0.4]
``` ```
...@@ -511,6 +536,7 @@ from diffusers.utils import load_image ...@@ -511,6 +536,7 @@ from diffusers.utils import load_image
img = load_image( img = load_image(
"https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/cat.png" "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/cat.png"
).resize((768, 768)) ).resize((768, 768))
img
``` ```
<div class="flex justify-center"> <div class="flex justify-center">
...@@ -524,8 +550,6 @@ import torch ...@@ -524,8 +550,6 @@ import torch
import numpy as np import numpy as np
from transformers import pipeline from transformers import pipeline
from diffusers.utils import load_image
def make_hint(image, depth_estimator): def make_hint(image, depth_estimator):
image = depth_estimator(image)["depth"] image = depth_estimator(image)["depth"]
...@@ -536,7 +560,6 @@ def make_hint(image, depth_estimator): ...@@ -536,7 +560,6 @@ def make_hint(image, depth_estimator):
hint = detected_map.permute(2, 0, 1) hint = detected_map.permute(2, 0, 1)
return hint return hint
depth_estimator = pipeline("depth-estimation") depth_estimator = pipeline("depth-estimation")
hint = make_hint(img, depth_estimator).unsqueeze(0).half().to("cuda") hint = make_hint(img, depth_estimator).unsqueeze(0).half().to("cuda")
``` ```
...@@ -550,10 +573,10 @@ from diffusers import KandinskyV22PriorPipeline, KandinskyV22ControlnetPipeline ...@@ -550,10 +573,10 @@ from diffusers import KandinskyV22PriorPipeline, KandinskyV22ControlnetPipeline
prior_pipeline = KandinskyV22PriorPipeline.from_pretrained( prior_pipeline = KandinskyV22PriorPipeline.from_pretrained(
"kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True "kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True
)to("cuda") ).to("cuda")
pipeline = KandinskyV22ControlnetPipeline.from_pretrained( pipeline = KandinskyV22ControlnetPipeline.from_pretrained(
"kandinsky-community/kandinsky-2-2-controlnet-depth", torch_dtype=torch.float16, use_safetensors=True "kandinsky-community/kandinsky-2-2-controlnet-depth", torch_dtype=torch.float16
).to("cuda") ).to("cuda")
``` ```
...@@ -561,11 +584,11 @@ Generate the image embeddings from a prompt and negative prompt: ...@@ -561,11 +584,11 @@ Generate the image embeddings from a prompt and negative prompt:
```py ```py
prompt = "A robot, 4k photo" prompt = "A robot, 4k photo"
negative_prior_prompt = "lowres, text, error, cropped, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, out of frame, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, username, watermark, signature" negative_prior_prompt = "lowres, text, error, cropped, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, out of frame, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, username, watermark, signature"
generator = torch.Generator(device="cuda").manual_seed(43) generator = torch.Generator(device="cuda").manual_seed(43)
image_emb, zero_image_emb = pipe_prior(
image_emb, zero_image_emb = prior_pipeline(
prompt=prompt, negative_prompt=negative_prior_prompt, generator=generator prompt=prompt, negative_prompt=negative_prior_prompt, generator=generator
).to_tuple() ).to_tuple()
``` ```
...@@ -599,10 +622,9 @@ from diffusers.utils import load_image ...@@ -599,10 +622,9 @@ from diffusers.utils import load_image
from transformers import pipeline from transformers import pipeline
img = load_image( img = load_image(
"https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main" "/kandinskyv22/cat.png" "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/cat.png"
).resize((768, 768)) ).resize((768, 768))
def make_hint(image, depth_estimator): def make_hint(image, depth_estimator):
image = depth_estimator(image)["depth"] image = depth_estimator(image)["depth"]
image = np.array(image) image = np.array(image)
...@@ -612,7 +634,6 @@ def make_hint(image, depth_estimator): ...@@ -612,7 +634,6 @@ def make_hint(image, depth_estimator):
hint = detected_map.permute(2, 0, 1) hint = detected_map.permute(2, 0, 1)
return hint return hint
depth_estimator = pipeline("depth-estimation") depth_estimator = pipeline("depth-estimation")
hint = make_hint(img, depth_estimator).unsqueeze(0).half().to("cuda") hint = make_hint(img, depth_estimator).unsqueeze(0).half().to("cuda")
``` ```
...@@ -637,15 +658,15 @@ negative_prior_prompt = "lowres, text, error, cropped, worst quality, low qualit ...@@ -637,15 +658,15 @@ negative_prior_prompt = "lowres, text, error, cropped, worst quality, low qualit
generator = torch.Generator(device="cuda").manual_seed(43) generator = torch.Generator(device="cuda").manual_seed(43)
img_emb = pipe_prior(prompt=prompt, image=img, strength=0.85, generator=generator) img_emb = prior_pipeline(prompt=prompt, image=img, strength=0.85, generator=generator)
negative_emb = pipe_prior(prompt=negative_prior_prompt, image=img, strength=1, generator=generator) negative_emb = prior_pipeline(prompt=negative_prior_prompt, image=img, strength=1, generator=generator)
``` ```
Now you can run the [`KandinskyV22ControlnetImg2ImgPipeline`] to generate an image from the initial image and the image embeddings: Now you can run the [`KandinskyV22ControlnetImg2ImgPipeline`] to generate an image from the initial image and the image embeddings:
```py ```py
image = pipeline(image=img, strength=0.5, image_embeds=img_emb.image_embeds, negative_image_embeds=negative_emb.image_embeds, hint=hint, num_inference_steps=50, generator=generator, height=768, width=768).images[0] image = pipeline(image=img, strength=0.5, image_embeds=img_emb.image_embeds, negative_image_embeds=negative_emb.image_embeds, hint=hint, num_inference_steps=50, generator=generator, height=768, width=768).images[0]
image make_image_grid([img.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2)
``` ```
<div class="flex justify-center"> <div class="flex justify-center">
...@@ -656,7 +677,7 @@ image ...@@ -656,7 +677,7 @@ image
Kandinsky is unique because it requires a prior pipeline to generate the mappings, and a second pipeline to decode the latents into an image. Optimization efforts should be focused on the second pipeline because that is where the bulk of the computation is done. Here are some tips to improve Kandinsky during inference. Kandinsky is unique because it requires a prior pipeline to generate the mappings, and a second pipeline to decode the latents into an image. Optimization efforts should be focused on the second pipeline because that is where the bulk of the computation is done. Here are some tips to improve Kandinsky during inference.
1. Enable [xFormers](https://moon-ci-docs.huggingface.co/optimization/xformers) if you're using PyTorch < 2.0: 1. Enable [xFormers](../optimization/xformers) if you're using PyTorch < 2.0:
```diff ```diff
from diffusers import DiffusionPipeline from diffusers import DiffusionPipeline
...@@ -666,14 +687,11 @@ Kandinsky is unique because it requires a prior pipeline to generate the mapping ...@@ -666,14 +687,11 @@ Kandinsky is unique because it requires a prior pipeline to generate the mapping
+ pipe.enable_xformers_memory_efficient_attention() + pipe.enable_xformers_memory_efficient_attention()
``` ```
2. Enable `torch.compile` if you're using PyTorch 2.0 to automatically use scaled dot-product attention (SDPA): 2. Enable `torch.compile` if you're using PyTorch >= 2.0 to automatically use scaled dot-product attention (SDPA):
```diff ```diff
pipe.unet.to(memory_format=torch.channels_last) pipe.unet.to(memory_format=torch.channels_last)
+ pipe.unet = torch.compile(pipe.unet, mode="reduced-overhead", fullgraph=True) + pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
+ pipe.enable_xformers_memory_efficient_attention()
``` ```
This is the same as explicitly setting the attention processor to use [`~models.attention_processor.AttnAddedKVProcessor2_0`]: This is the same as explicitly setting the attention processor to use [`~models.attention_processor.AttnAddedKVProcessor2_0`]:
...@@ -697,7 +715,8 @@ pipe.unet.set_attn_processor(AttnAddedKVProcessor2_0()) ...@@ -697,7 +715,8 @@ pipe.unet.set_attn_processor(AttnAddedKVProcessor2_0())
4. By default, the text-to-image pipeline uses the [`DDIMScheduler`] but you can replace it with another scheduler like [`DDPMScheduler`] to see how that affects the tradeoff between inference speed and image quality: 4. By default, the text-to-image pipeline uses the [`DDIMScheduler`] but you can replace it with another scheduler like [`DDPMScheduler`] to see how that affects the tradeoff between inference speed and image quality:
```py ```py
from diffusers import DDPMSCheduler from diffusers import DDPMScheduler
from diffusers import DiffusionPipeline
scheduler = DDPMScheduler.from_pretrained("kandinsky-community/kandinsky-2-1", subfolder="ddpm_scheduler") scheduler = DDPMScheduler.from_pretrained("kandinsky-community/kandinsky-2-1", subfolder="ddpm_scheduler")
pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", scheduler=scheduler, torch_dtype=torch.float16, use_safetensors=True).to("cuda") pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", scheduler=scheduler, torch_dtype=torch.float16, use_safetensors=True).to("cuda")
......
...@@ -55,7 +55,7 @@ But if you need to reliably generate the same image, that'll depend on whether y ...@@ -55,7 +55,7 @@ But if you need to reliably generate the same image, that'll depend on whether y
### CPU ### CPU
To generate reproducible results on a CPU, you'll need to use a PyTorch [`Generator`](https://pytorch.org/docs/stable/generated/torch.randn.html) and set a seed: To generate reproducible results on a CPU, you'll need to use a PyTorch [`Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) and set a seed:
```python ```python
import torch import torch
...@@ -83,7 +83,7 @@ If you run this code example on your specific hardware and PyTorch version, you ...@@ -83,7 +83,7 @@ If you run this code example on your specific hardware and PyTorch version, you
💡 It might be a bit unintuitive at first to pass `Generator` objects to the pipeline instead of 💡 It might be a bit unintuitive at first to pass `Generator` objects to the pipeline instead of
just integer values representing the seed, but this is the recommended design when dealing with just integer values representing the seed, but this is the recommended design when dealing with
probabilistic models in PyTorch as `Generator`'s are *random states* that can be probabilistic models in PyTorch, as `Generator`s are *random states* that can be
passed to multiple pipelines in a sequence. passed to multiple pipelines in a sequence.
</Tip> </Tip>
...@@ -159,6 +159,7 @@ PyTorch typically benchmarks multiple algorithms to select the fastest one, but ...@@ -159,6 +159,7 @@ PyTorch typically benchmarks multiple algorithms to select the fastest one, but
```py ```py
import os import os
import torch
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":16:8" os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":16:8"
...@@ -171,7 +172,6 @@ Now when you run the same pipeline twice, you'll get identical results. ...@@ -171,7 +172,6 @@ Now when you run the same pipeline twice, you'll get identical results.
```py ```py
import torch import torch
from diffusers import DDIMScheduler, StableDiffusionPipeline from diffusers import DDIMScheduler, StableDiffusionPipeline
import numpy as np
model_id = "runwayml/stable-diffusion-v1-5" model_id = "runwayml/stable-diffusion-v1-5"
pipe = StableDiffusionPipeline.from_pretrained(model_id, use_safetensors=True).to("cuda") pipe = StableDiffusionPipeline.from_pretrained(model_id, use_safetensors=True).to("cuda")
...@@ -186,6 +186,6 @@ result1 = pipe(prompt=prompt, num_inference_steps=50, generator=g, output_type=" ...@@ -186,6 +186,6 @@ result1 = pipe(prompt=prompt, num_inference_steps=50, generator=g, output_type="
g.manual_seed(0) g.manual_seed(0)
result2 = pipe(prompt=prompt, num_inference_steps=50, generator=g, output_type="latent").images result2 = pipe(prompt=prompt, num_inference_steps=50, generator=g, output_type="latent").images
print("L_inf dist = ", abs(result1 - result2).max()) print("L_inf dist =", abs(result1 - result2).max())
"L_inf dist = tensor(0., device='cuda:0')" "L_inf dist = tensor(0., device='cuda:0')"
``` ```
...@@ -26,7 +26,7 @@ Before you begin, make sure you have the following libraries installed: ...@@ -26,7 +26,7 @@ Before you begin, make sure you have the following libraries installed:
```py ```py
# uncomment to install the necessary libraries in Colab # uncomment to install the necessary libraries in Colab
#!pip install diffusers transformers accelerate safetensors omegaconf invisible-watermark>=0.2.0 #!pip install -q diffusers transformers accelerate omegaconf invisible-watermark>=0.2.0
``` ```
<Tip warning={true}> <Tip warning={true}>
...@@ -84,7 +84,8 @@ pipeline_text2image = AutoPipelineForText2Image.from_pretrained( ...@@ -84,7 +84,8 @@ pipeline_text2image = AutoPipelineForText2Image.from_pretrained(
).to("cuda") ).to("cuda")
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipeline(prompt=prompt).images[0] image = pipeline_text2image(prompt=prompt).images[0]
image
``` ```
<div class="flex justify-center"> <div class="flex justify-center">
...@@ -96,16 +97,17 @@ image = pipeline(prompt=prompt).images[0] ...@@ -96,16 +97,17 @@ image = pipeline(prompt=prompt).images[0]
For image-to-image, SDXL works especially well with image sizes between 768x768 and 1024x1024. Pass an initial image, and a text prompt to condition the image with: For image-to-image, SDXL works especially well with image sizes between 768x768 and 1024x1024. Pass an initial image, and a text prompt to condition the image with:
```py ```py
from diffusers import AutoPipelineForImg2Img from diffusers import AutoPipelineForImage2Image
from diffusers.utils import load_image from diffusers.utils import load_image, make_image_grid
# use from_pipe to avoid consuming additional memory when loading a checkpoint # use from_pipe to avoid consuming additional memory when loading a checkpoint
pipeline = AutoPipelineForImage2Image.from_pipe(pipeline_text2image).to("cuda") pipeline = AutoPipelineForImage2Image.from_pipe(pipeline_text2image).to("cuda")
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-img2img.png"
init_image = load_image(url).convert("RGB") url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png"
init_image = load_image(url)
prompt = "a dog catching a frisbee in the jungle" prompt = "a dog catching a frisbee in the jungle"
image = pipeline(prompt, image=init_image, strength=0.8, guidance_scale=10.5).images[0] image = pipeline(prompt, image=init_image, strength=0.8, guidance_scale=10.5).images[0]
make_image_grid([init_image, image], rows=1, cols=2)
``` ```
<div class="flex justify-center"> <div class="flex justify-center">
...@@ -118,7 +120,7 @@ For inpainting, you'll need the original image and a mask of what you want to re ...@@ -118,7 +120,7 @@ For inpainting, you'll need the original image and a mask of what you want to re
```py ```py
from diffusers import AutoPipelineForInpainting from diffusers import AutoPipelineForInpainting
from diffusers.utils import load_image from diffusers.utils import load_image, make_image_grid
# use from_pipe to avoid consuming additional memory when loading a checkpoint # use from_pipe to avoid consuming additional memory when loading a checkpoint
pipeline = AutoPipelineForInpainting.from_pipe(pipeline_text2image).to("cuda") pipeline = AutoPipelineForInpainting.from_pipe(pipeline_text2image).to("cuda")
...@@ -126,11 +128,12 @@ pipeline = AutoPipelineForInpainting.from_pipe(pipeline_text2image).to("cuda") ...@@ -126,11 +128,12 @@ pipeline = AutoPipelineForInpainting.from_pipe(pipeline_text2image).to("cuda")
img_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png" img_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png"
mask_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-inpaint-mask.png" mask_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-inpaint-mask.png"
init_image = load_image(img_url).convert("RGB") init_image = load_image(img_url)
mask_image = load_image(mask_url).convert("RGB") mask_image = load_image(mask_url)
prompt = "A deep sea diver floating" prompt = "A deep sea diver floating"
image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, strength=0.85, guidance_scale=12.5).images[0] image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, strength=0.85, guidance_scale=12.5).images[0]
make_image_grid([init_image, mask_image, image], rows=1, cols=3)
``` ```
<div class="flex justify-center"> <div class="flex justify-center">
...@@ -141,12 +144,12 @@ image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, strengt ...@@ -141,12 +144,12 @@ image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, strengt
SDXL includes a [refiner model](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0) specialized in denoising low-noise stage images to generate higher-quality images from the base model. There are two ways to use the refiner: SDXL includes a [refiner model](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0) specialized in denoising low-noise stage images to generate higher-quality images from the base model. There are two ways to use the refiner:
1. use the base and refiner model together to produce a refined image 1. use the base and refiner models together to produce a refined image
2. use the base model to produce an image, and subsequently use the refiner model to add more details to the image (this is how SDXL is originally trained) 2. use the base model to produce an image, and subsequently use the refiner model to add more details to the image (this is how SDXL was originally trained)
### Base + refiner model ### Base + refiner model
When you use the base and refiner model together to generate an image, this is known as an ([*ensemble of expert denoisers*](https://research.nvidia.com/labs/dir/eDiff-I/)). The ensemble of expert denoisers approach requires less overall denoising steps versus passing the base model's output to the refiner model, so it should be significantly faster to run. However, you won't be able to inspect the base model's output because it still contains a large amount of noise. When you use the base and refiner model together to generate an image, this is known as an [*ensemble of expert denoisers*](https://research.nvidia.com/labs/dir/eDiff-I/). The ensemble of expert denoisers approach requires fewer overall denoising steps versus passing the base model's output to the refiner model, so it should be significantly faster to run. However, you won't be able to inspect the base model's output because it still contains a large amount of noise.
As an ensemble of expert denoisers, the base model serves as the expert during the high-noise diffusion stage and the refiner model serves as the expert during the low-noise diffusion stage. Load the base and refiner model: As an ensemble of expert denoisers, the base model serves as the expert during the high-noise diffusion stage and the refiner model serves as the expert during the low-noise diffusion stage. Load the base and refiner model:
...@@ -193,12 +196,13 @@ image = refiner( ...@@ -193,12 +196,13 @@ image = refiner(
denoising_start=0.8, denoising_start=0.8,
image=image, image=image,
).images[0] ).images[0]
image
``` ```
<div class="flex gap-4"> <div class="flex gap-4">
<div> <div>
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/lion_base.png" alt="generated image of a lion on a rock at night" /> <img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/lion_base.png" alt="generated image of a lion on a rock at night" />
<figcaption class="mt-2 text-center text-sm text-gray-500">base model</figcaption> <figcaption class="mt-2 text-center text-sm text-gray-500">default base model</figcaption>
</div> </div>
<div> <div>
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/lion_refined.png" alt="generated image of a lion on a rock at night in higher quality" /> <img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/lion_refined.png" alt="generated image of a lion on a rock at night in higher quality" />
...@@ -210,7 +214,8 @@ The refiner model can also be used for inpainting in the [`StableDiffusionXLInpa ...@@ -210,7 +214,8 @@ The refiner model can also be used for inpainting in the [`StableDiffusionXLInpa
```py ```py
from diffusers import StableDiffusionXLInpaintPipeline from diffusers import StableDiffusionXLInpaintPipeline
from diffusers.utils import load_image from diffusers.utils import load_image, make_image_grid
import torch
base = StableDiffusionXLInpaintPipeline.from_pretrained( base = StableDiffusionXLInpaintPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
...@@ -218,8 +223,8 @@ base = StableDiffusionXLInpaintPipeline.from_pretrained( ...@@ -218,8 +223,8 @@ base = StableDiffusionXLInpaintPipeline.from_pretrained(
refiner = StableDiffusionXLInpaintPipeline.from_pretrained( refiner = StableDiffusionXLInpaintPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-refiner-1.0", "stabilityai/stable-diffusion-xl-refiner-1.0",
text_encoder_2=pipe.text_encoder_2, text_encoder_2=base.text_encoder_2,
vae=pipe.vae, vae=base.vae,
torch_dtype=torch.float16, torch_dtype=torch.float16,
use_safetensors=True, use_safetensors=True,
variant="fp16", variant="fp16",
...@@ -228,8 +233,8 @@ refiner = StableDiffusionXLInpaintPipeline.from_pretrained( ...@@ -228,8 +233,8 @@ refiner = StableDiffusionXLInpaintPipeline.from_pretrained(
img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
init_image = load_image(img_url).convert("RGB") init_image = load_image(img_url)
mask_image = load_image(mask_url).convert("RGB") mask_image = load_image(mask_url)
prompt = "A majestic tiger sitting on a bench" prompt = "A majestic tiger sitting on a bench"
num_inference_steps = 75 num_inference_steps = 75
...@@ -250,6 +255,7 @@ image = refiner( ...@@ -250,6 +255,7 @@ image = refiner(
num_inference_steps=num_inference_steps, num_inference_steps=num_inference_steps,
denoising_start=high_noise_frac, denoising_start=high_noise_frac,
).images[0] ).images[0]
make_image_grid([init_image, mask_image, image.resize((512, 512))], rows=1, cols=3)
``` ```
This ensemble of expert denoisers method works well for all available schedulers! This ensemble of expert denoisers method works well for all available schedulers!
...@@ -270,8 +276,8 @@ base = DiffusionPipeline.from_pretrained( ...@@ -270,8 +276,8 @@ base = DiffusionPipeline.from_pretrained(
refiner = DiffusionPipeline.from_pretrained( refiner = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-refiner-1.0", "stabilityai/stable-diffusion-xl-refiner-1.0",
text_encoder_2=pipe.text_encoder_2, text_encoder_2=base.text_encoder_2,
vae=pipe.vae, vae=base.vae,
torch_dtype=torch.float16, torch_dtype=torch.float16,
use_safetensors=True, use_safetensors=True,
variant="fp16", variant="fp16",
...@@ -303,7 +309,7 @@ image = refiner(prompt=prompt, image=image[None, :]).images[0] ...@@ -303,7 +309,7 @@ image = refiner(prompt=prompt, image=image[None, :]).images[0]
</div> </div>
</div> </div>
For inpainting, load the refiner model in the [`StableDiffusionXLInpaintPipeline`], remove the `denoising_end` and `denoising_start` parameters, and choose a smaller number of inference steps for the refiner. For inpainting, load the base and the refiner model in the [`StableDiffusionXLInpaintPipeline`], remove the `denoising_end` and `denoising_start` parameters, and choose a smaller number of inference steps for the refiner.
## Micro-conditioning ## Micro-conditioning
...@@ -343,7 +349,7 @@ image = pipe( ...@@ -343,7 +349,7 @@ image = pipe(
<div class="flex flex-col justify-center"> <div class="flex flex-col justify-center">
<img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/sd_xl/negative_conditions.png"/> <img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/sd_xl/negative_conditions.png"/>
<figcaption class="text-center">Images negative conditioned on image resolutions of (128, 128), (256, 256), and (512, 512).</figcaption> <figcaption class="text-center">Images negatively conditioned on image resolutions of (128, 128), (256, 256), and (512, 512).</figcaption>
</div> </div>
### Crop conditioning ### Crop conditioning
...@@ -354,13 +360,13 @@ Images generated by previous Stable Diffusion models may sometimes appear to be ...@@ -354,13 +360,13 @@ Images generated by previous Stable Diffusion models may sometimes appear to be
from diffusers import StableDiffusionXLPipeline from diffusers import StableDiffusionXLPipeline
import torch import torch
pipeline = StableDiffusionXLPipeline.from_pretrained( pipeline = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
).to("cuda") ).to("cuda")
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipeline(prompt=prompt, crops_coords_top_left=(256,0)).images[0] image = pipeline(prompt=prompt, crops_coords_top_left=(256, 0)).images[0]
image
``` ```
<div class="flex justify-center"> <div class="flex justify-center">
...@@ -384,11 +390,12 @@ image = pipe( ...@@ -384,11 +390,12 @@ image = pipe(
negative_crops_coords_top_left=(0, 0), negative_crops_coords_top_left=(0, 0),
negative_target_size=(1024, 1024), negative_target_size=(1024, 1024),
).images[0] ).images[0]
image
``` ```
## Use a different prompt for each text-encoder ## Use a different prompt for each text-encoder
SDXL uses two text-encoders, so it is possible to pass a different prompt to each text-encoder, which can [improve quality](https://github.com/huggingface/diffusers/issues/4004#issuecomment-1627764201). Pass your original prompt to `prompt` and the second prompt to `prompt_2` (use `negative_prompt` and `negative_prompt_2` if you're using a negative prompts): SDXL uses two text-encoders, so it is possible to pass a different prompt to each text-encoder, which can [improve quality](https://github.com/huggingface/diffusers/issues/4004#issuecomment-1627764201). Pass your original prompt to `prompt` and the second prompt to `prompt_2` (use `negative_prompt` and `negative_prompt_2` if you're using negative prompts):
```py ```py
from diffusers import StableDiffusionXLPipeline from diffusers import StableDiffusionXLPipeline
...@@ -403,13 +410,14 @@ prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" ...@@ -403,13 +410,14 @@ prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
# prompt_2 is passed to OpenCLIP-ViT/bigG-14 # prompt_2 is passed to OpenCLIP-ViT/bigG-14
prompt_2 = "Van Gogh painting" prompt_2 = "Van Gogh painting"
image = pipeline(prompt=prompt, prompt_2=prompt_2).images[0] image = pipeline(prompt=prompt, prompt_2=prompt_2).images[0]
image
``` ```
<div class="flex justify-center"> <div class="flex justify-center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-double-prompt.png" alt="generated image of an astronaut in a jungle in the style of a van gogh painting"/> <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-double-prompt.png" alt="generated image of an astronaut in a jungle in the style of a van gogh painting"/>
</div> </div>
The dual text-encoders also support textual inversion embeddings that need to be loaded separately as explained in the [SDXL textual inversion](textual_inversion_inference#stable-diffusion-xl] section. The dual text-encoders also support textual inversion embeddings that need to be loaded separately as explained in the [SDXL textual inversion](textual_inversion_inference#stable-diffusion-xl) section.
## Optimizations ## Optimizations
...@@ -420,18 +428,18 @@ SDXL is a large model, and you may need to optimize memory to get it to run on y ...@@ -420,18 +428,18 @@ SDXL is a large model, and you may need to optimize memory to get it to run on y
```diff ```diff
- base.to("cuda") - base.to("cuda")
- refiner.to("cuda") - refiner.to("cuda")
+ base.enable_model_cpu_offload + base.enable_model_cpu_offload()
+ refiner.enable_model_cpu_offload + refiner.enable_model_cpu_offload()
``` ```
2. Use `torch.compile` for ~20% speed-up (you need `torch>2.0`): 2. Use `torch.compile` for ~20% speed-up (you need `torch>=2.0`):
```diff ```diff
+ base.unet = torch.compile(base.unet, mode="reduce-overhead", fullgraph=True) + base.unet = torch.compile(base.unet, mode="reduce-overhead", fullgraph=True)
+ refiner.unet = torch.compile(refiner.unet, mode="reduce-overhead", fullgraph=True) + refiner.unet = torch.compile(refiner.unet, mode="reduce-overhead", fullgraph=True)
``` ```
3. Enable [xFormers](/optimization/xformers) to run SDXL if `torch<2.0`: 3. Enable [xFormers](../optimization/xformers) to run SDXL if `torch<2.0`:
```diff ```diff
+ base.enable_xformers_memory_efficient_attention() + base.enable_xformers_memory_efficient_attention()
......
...@@ -16,7 +16,7 @@ specific language governing permissions and limitations under the License. ...@@ -16,7 +16,7 @@ specific language governing permissions and limitations under the License.
Shap-E is a conditional model for generating 3D assets which could be used for video game development, interior design, and architecture. It is trained on a large dataset of 3D assets, and post-processed to render more views of each object and produce 16K instead of 4K point clouds. The Shap-E model is trained in two steps: Shap-E is a conditional model for generating 3D assets which could be used for video game development, interior design, and architecture. It is trained on a large dataset of 3D assets, and post-processed to render more views of each object and produce 16K instead of 4K point clouds. The Shap-E model is trained in two steps:
1. a encoder accepts the point clouds and rendered views of a 3D asset and outputs the parameters of implicit functions that represent the asset 1. an encoder accepts the point clouds and rendered views of a 3D asset and outputs the parameters of implicit functions that represent the asset
2. a diffusion model is trained on the latents produced by the encoder to generate either neural radiance fields (NeRFs) or a textured 3D mesh, making it easier to render and use the 3D asset in downstream applications 2. a diffusion model is trained on the latents produced by the encoder to generate either neural radiance fields (NeRFs) or a textured 3D mesh, making it easier to render and use the 3D asset in downstream applications
This guide will show you how to use Shap-E to start generating your own 3D assets! This guide will show you how to use Shap-E to start generating your own 3D assets!
...@@ -25,7 +25,7 @@ Before you begin, make sure you have the following libraries installed: ...@@ -25,7 +25,7 @@ Before you begin, make sure you have the following libraries installed:
```py ```py
# uncomment to install the necessary libraries in Colab # uncomment to install the necessary libraries in Colab
#!pip install diffusers transformers accelerate safetensors trimesh #!pip install -q diffusers transformers accelerate trimesh
``` ```
## Text-to-3D ## Text-to-3D
...@@ -38,7 +38,7 @@ from diffusers import ShapEPipeline ...@@ -38,7 +38,7 @@ from diffusers import ShapEPipeline
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
pipe = ShapEPipeline.from_pretrained("openai/shap-e", torch_dtype=torch.float16, variant="fp16", use_safetensors=True) pipe = ShapEPipeline.from_pretrained("openai/shap-e", torch_dtype=torch.float16, variant="fp16")
pipe = pipe.to(device) pipe = pipe.to(device)
guidance_scale = 15.0 guidance_scale = 15.0
...@@ -64,11 +64,11 @@ export_to_gif(images[1], "cake_3d.gif") ...@@ -64,11 +64,11 @@ export_to_gif(images[1], "cake_3d.gif")
<div class="flex gap-4"> <div class="flex gap-4">
<div> <div>
<img class="rounded-xl" src="https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/firecracker_out.gif"/> <img class="rounded-xl" src="https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/firecracker_out.gif"/>
<figcaption class="mt-2 text-center text-sm text-gray-500">firecracker</figcaption> <figcaption class="mt-2 text-center text-sm text-gray-500">prompt = "A firecracker"</figcaption>
</div> </div>
<div> <div>
<img class="rounded-xl" src="https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/cake_out.gif"/> <img class="rounded-xl" src="https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/cake_out.gif"/>
<figcaption class="mt-2 text-center text-sm text-gray-500">cupcake</figcaption> <figcaption class="mt-2 text-center text-sm text-gray-500">prompt = "A birthday cupcake"</figcaption>
</div> </div>
</div> </div>
...@@ -99,6 +99,7 @@ Pass the cheeseburger to the [`ShapEImg2ImgPipeline`] to generate a 3D represent ...@@ -99,6 +99,7 @@ Pass the cheeseburger to the [`ShapEImg2ImgPipeline`] to generate a 3D represent
```py ```py
from PIL import Image from PIL import Image
from diffusers import ShapEImg2ImgPipeline
from diffusers.utils import export_to_gif from diffusers.utils import export_to_gif
pipe = ShapEImg2ImgPipeline.from_pretrained("openai/shap-e-img2img", torch_dtype=torch.float16, variant="fp16").to("cuda") pipe = ShapEImg2ImgPipeline.from_pretrained("openai/shap-e-img2img", torch_dtype=torch.float16, variant="fp16").to("cuda")
...@@ -139,7 +140,7 @@ from diffusers import ShapEPipeline ...@@ -139,7 +140,7 @@ from diffusers import ShapEPipeline
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
pipe = ShapEPipeline.from_pretrained("openai/shap-e", torch_dtype=torch.float16, variant="fp16", use_safetensors=True) pipe = ShapEPipeline.from_pretrained("openai/shap-e", torch_dtype=torch.float16, variant="fp16")
pipe = pipe.to(device) pipe = pipe.to(device)
guidance_scale = 15.0 guidance_scale = 15.0
...@@ -160,7 +161,7 @@ You can optionally save the mesh output as an `obj` file with the [`~utils.expor ...@@ -160,7 +161,7 @@ You can optionally save the mesh output as an `obj` file with the [`~utils.expor
from diffusers.utils import export_to_ply from diffusers.utils import export_to_ply
ply_path = export_to_ply(images[0], "3d_cake.ply") ply_path = export_to_ply(images[0], "3d_cake.ply")
print(f"saved to folder: {ply_path}") print(f"Saved to folder: {ply_path}")
``` ```
Then you can convert the `ply` file to a `glb` file with the trimesh library: Then you can convert the `ply` file to a `glb` file with the trimesh library:
...@@ -169,7 +170,7 @@ Then you can convert the `ply` file to a `glb` file with the trimesh library: ...@@ -169,7 +170,7 @@ Then you can convert the `ply` file to a `glb` file with the trimesh library:
import trimesh import trimesh
mesh = trimesh.load("3d_cake.ply") mesh = trimesh.load("3d_cake.ply")
mesh.export("3d_cake.glb", file_type="glb") mesh_export = mesh.export("3d_cake.glb", file_type="glb")
``` ```
By default, the mesh output is focused from the bottom viewpoint but you can change the default viewpoint by applying a rotation transform: By default, the mesh output is focused from the bottom viewpoint but you can change the default viewpoint by applying a rotation transform:
...@@ -181,7 +182,7 @@ import numpy as np ...@@ -181,7 +182,7 @@ import numpy as np
mesh = trimesh.load("3d_cake.ply") mesh = trimesh.load("3d_cake.ply")
rot = trimesh.transformations.rotation_matrix(-np.pi / 2, [1, 0, 0]) rot = trimesh.transformations.rotation_matrix(-np.pi / 2, [1, 0, 0])
mesh = mesh.apply_transform(rot) mesh = mesh.apply_transform(rot)
mesh.export("3d_cake.glb", file_type="glb") mesh_export = mesh.export("3d_cake.glb", file_type="glb")
``` ```
Upload the mesh file to your dataset repository to visualize it with the Dataset viewer! Upload the mesh file to your dataset repository to visualize it with the Dataset viewer!
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment