Unverified Commit 53a8439f authored by M. Tolga Cangöz's avatar M. Tolga Cangöz Committed by GitHub
Browse files

[`Docs`] Fix typos and update files at Optimization Page (#5674)



* Fix typos, update, trim trailing whitespace

* Trim trailing whitespaces

* Update docs/source/en/optimization/memory.md
Co-authored-by: default avatarSteven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/optimization/memory.md
Co-authored-by: default avatarSteven Liu <59462357+stevhliu@users.noreply.github.com>

* Update _toctree.yml

* Update adapt_a_model.md

* Reverse

* Reverse

* Reverse

* Update dreambooth.md

* Update instructpix2pix.md

* Update lora.md

* Update overview.md

* Update t2i_adapters.md

* Update text2image.md

* Update text_inversion.md

* Update create_dataset.md

* Update create_dataset.md

* Update create_dataset.md

* Update create_dataset.md

* Update coreml.md

* Delete docs/source/en/training/create_dataset.md

* Original create_dataset.md

* Update create_dataset.md

* Delete docs/source/en/training/create_dataset.md

* Add original file

* Delete docs/source/en/training/create_dataset.md

* Add original one

* Delete docs/source/en/training/text2image.md

* Delete docs/source/en/training/instructpix2pix.md

* Delete docs/source/en/training/dreambooth.md

* Add original files

---------
Co-authored-by: default avatarSteven Liu <59462357+stevhliu@users.noreply.github.com>
parent db2d8e76
...@@ -135,7 +135,7 @@ ...@@ -135,7 +135,7 @@
- local: optimization/memory - local: optimization/memory
title: Reduce memory usage title: Reduce memory usage
- local: optimization/torch2.0 - local: optimization/torch2.0
title: Torch 2.0 title: PyTorch 2.0
- local: optimization/xformers - local: optimization/xformers
title: xFormers title: xFormers
- local: optimization/tome - local: optimization/tome
......
...@@ -31,7 +31,7 @@ Thankfully, Apple engineers developed [a conversion tool](https://github.com/app ...@@ -31,7 +31,7 @@ Thankfully, Apple engineers developed [a conversion tool](https://github.com/app
Before you convert a model, though, take a moment to explore the Hugging Face Hub – chances are the model you're interested in is already available in Core ML format: Before you convert a model, though, take a moment to explore the Hugging Face Hub – chances are the model you're interested in is already available in Core ML format:
- the [Apple](https://huggingface.co/apple) organization includes Stable Diffusion versions 1.4, 1.5, 2.0 base, and 2.1 base - the [Apple](https://huggingface.co/apple) organization includes Stable Diffusion versions 1.4, 1.5, 2.0 base, and 2.1 base
- [coreml](https://huggingface.co/coreml) organization includes custom DreamBoothed and finetuned models - [coreml community](https://huggingface.co/coreml-community) includes custom finetuned models
- use this [filter](https://huggingface.co/models?pipeline_tag=text-to-image&library=coreml&p=2&sort=likes) to return all available Core ML checkpoints - use this [filter](https://huggingface.co/models?pipeline_tag=text-to-image&library=coreml&p=2&sort=likes) to return all available Core ML checkpoints
If you can't find the model you're interested in, we recommend you follow the instructions for [Converting Models to Core ML](https://github.com/apple/ml-stable-diffusion#-converting-models-to-core-ml) by Apple. If you can't find the model you're interested in, we recommend you follow the instructions for [Converting Models to Core ML](https://github.com/apple/ml-stable-diffusion#-converting-models-to-core-ml) by Apple.
...@@ -90,7 +90,6 @@ snapshot_download(repo_id, allow_patterns=f"{variant}/*", local_dir=model_path, ...@@ -90,7 +90,6 @@ snapshot_download(repo_id, allow_patterns=f"{variant}/*", local_dir=model_path,
print(f"Model downloaded at {model_path}") print(f"Model downloaded at {model_path}")
``` ```
### Inference[[python-inference]] ### Inference[[python-inference]]
Once you have downloaded a snapshot of the model, you can test it using Apple's Python script. Once you have downloaded a snapshot of the model, you can test it using Apple's Python script.
...@@ -99,7 +98,7 @@ Once you have downloaded a snapshot of the model, you can test it using Apple's ...@@ -99,7 +98,7 @@ Once you have downloaded a snapshot of the model, you can test it using Apple's
python -m python_coreml_stable_diffusion.pipeline --prompt "a photo of an astronaut riding a horse on mars" -i models/coreml-stable-diffusion-v1-4_original_packages -o </path/to/output/image> --compute-unit CPU_AND_GPU --seed 93 python -m python_coreml_stable_diffusion.pipeline --prompt "a photo of an astronaut riding a horse on mars" -i models/coreml-stable-diffusion-v1-4_original_packages -o </path/to/output/image> --compute-unit CPU_AND_GPU --seed 93
``` ```
`<output-mlpackages-directory>` should point to the checkpoint you downloaded in the step above, and `--compute-unit` indicates the hardware you want to allow for inference. It must be one of the following options: `ALL`, `CPU_AND_GPU`, `CPU_ONLY`, `CPU_AND_NE`. You may also provide an optional output path, and a seed for reproducibility. Pass the path of the downloaded checkpoint with `-i` flag to the script. `--compute-unit` indicates the hardware you want to allow for inference. It must be one of the following options: `ALL`, `CPU_AND_GPU`, `CPU_ONLY`, `CPU_AND_NE`. You may also provide an optional output path, and a seed for reproducibility.
The inference script assumes you're using the original version of the Stable Diffusion model, `CompVis/stable-diffusion-v1-4`. If you use another model, you *have* to specify its Hub id in the inference command line, using the `--model-version` option. This works for models already supported and custom models you trained or fine-tuned yourself. The inference script assumes you're using the original version of the Stable Diffusion model, `CompVis/stable-diffusion-v1-4`. If you use another model, you *have* to specify its Hub id in the inference command line, using the `--model-version` option. This works for models already supported and custom models you trained or fine-tuned yourself.
...@@ -109,7 +108,6 @@ For example, if you want to use [`runwayml/stable-diffusion-v1-5`](https://huggi ...@@ -109,7 +108,6 @@ For example, if you want to use [`runwayml/stable-diffusion-v1-5`](https://huggi
python -m python_coreml_stable_diffusion.pipeline --prompt "a photo of an astronaut riding a horse on mars" --compute-unit ALL -o output --seed 93 -i models/coreml-stable-diffusion-v1-5_original_packages --model-version runwayml/stable-diffusion-v1-5 python -m python_coreml_stable_diffusion.pipeline --prompt "a photo of an astronaut riding a horse on mars" --compute-unit ALL -o output --seed 93 -i models/coreml-stable-diffusion-v1-5_original_packages --model-version runwayml/stable-diffusion-v1-5
``` ```
## Core ML inference in Swift ## Core ML inference in Swift
Running inference in Swift is slightly faster than in Python because the models are already compiled in the `mlmodelc` format. This is noticeable on app startup when the model is loaded but shouldn’t be noticeable if you run several generations afterward. Running inference in Swift is slightly faster than in Python because the models are already compiled in the `mlmodelc` format. This is noticeable on app startup when the model is loaded but shouldn’t be noticeable if you run several generations afterward.
...@@ -149,7 +147,6 @@ You have to specify in `--resource-path` one of the checkpoints downloaded in th ...@@ -149,7 +147,6 @@ You have to specify in `--resource-path` one of the checkpoints downloaded in th
For more details, please refer to the [instructions in Apple's repo](https://github.com/apple/ml-stable-diffusion). For more details, please refer to the [instructions in Apple's repo](https://github.com/apple/ml-stable-diffusion).
## Supported Diffusers Features ## Supported Diffusers Features
The Core ML models and inference code don't support many of the features, options, and flexibility of 🧨 Diffusers. These are some of the limitations to keep in mind: The Core ML models and inference code don't support many of the features, options, and flexibility of 🧨 Diffusers. These are some of the limitations to keep in mind:
...@@ -160,8 +157,8 @@ The Core ML models and inference code don't support many of the features, option ...@@ -160,8 +157,8 @@ The Core ML models and inference code don't support many of the features, option
Apple's [conversion and inference repo](https://github.com/apple/ml-stable-diffusion) and our own [swift-coreml-diffusers](https://github.com/huggingface/swift-coreml-diffusers) repos are intended as technology demonstrators to enable other developers to build upon. Apple's [conversion and inference repo](https://github.com/apple/ml-stable-diffusion) and our own [swift-coreml-diffusers](https://github.com/huggingface/swift-coreml-diffusers) repos are intended as technology demonstrators to enable other developers to build upon.
If you feel strongly about any missing features, please feel free to open a feature request or, better yet, a contribution PR :) If you feel strongly about any missing features, please feel free to open a feature request or, better yet, a contribution PR 🙂.
## Native Diffusers Swift app ## Native Diffusers Swift app
One easy way to run Stable Diffusion on your own Apple hardware is to use [our open-source Swift repo](https://github.com/huggingface/swift-coreml-diffusers), based on `diffusers` and Apple's conversion and inference repo. You can study the code, compile it with [Xcode](https://developer.apple.com/xcode/) and adapt it for your own needs. For your convenience, there's also a [standalone Mac app in the App Store](https://apps.apple.com/app/diffusers/id1666309574), so you can play with it without having to deal with the code or IDE. If you are a developer and have determined that Core ML is the best solution to build your Stable Diffusion app, then you can use the rest of this guide to get started with your project. We can't wait to see what you'll build :) One easy way to run Stable Diffusion on your own Apple hardware is to use [our open-source Swift repo](https://github.com/huggingface/swift-coreml-diffusers), based on `diffusers` and Apple's conversion and inference repo. You can study the code, compile it with [Xcode](https://developer.apple.com/xcode/) and adapt it for your own needs. For your convenience, there's also a [standalone Mac app in the App Store](https://apps.apple.com/app/diffusers/id1666309574), so you can play with it without having to deal with the code or IDE. If you are a developer and have determined that Core ML is the best solution to build your Stable Diffusion app, then you can use the rest of this guide to get started with your project. We can't wait to see what you'll build 🙂.
...@@ -55,8 +55,7 @@ outputs = pipeline( ...@@ -55,8 +55,7 @@ outputs = pipeline(
) )
``` ```
For more information, check out 🤗 Optimum Habana's [documentation](https://huggingface.co/docs/optimum/habana/usage_guides/stable_diffusion) and the [example](https://github.com/huggingface/optimum-habana/tree/main/examples/stable-diffusion) provided in the official Github repository. For more information, check out 🤗 Optimum Habana's [documentation](https://huggingface.co/docs/optimum/habana/usage_guides/stable_diffusion) and the [example](https://github.com/huggingface/optimum-habana/tree/main/examples/stable-diffusion) provided in the official GitHub repository.
## Benchmark ## Benchmark
......
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Reduce memory usage # Reduce memory usage
A barrier to using diffusion models is the large amount of memory required. To overcome this challenge, there are several memory-reducing techniques you can use to run even some of the largest models on free-tier or consumer GPUs. Some of these techniques can even be combined to further reduce memory usage. A barrier to using diffusion models is the large amount of memory required. To overcome this challenge, there are several memory-reducing techniques you can use to run even some of the largest models on free-tier or consumer GPUs. Some of these techniques can even be combined to further reduce memory usage.
...@@ -18,10 +30,9 @@ The results below are obtained from generating a single 512x512 image from the p ...@@ -18,10 +30,9 @@ The results below are obtained from generating a single 512x512 image from the p
| traced UNet | 3.21s | x2.96 | | traced UNet | 3.21s | x2.96 |
| memory-efficient attention | 2.63s | x3.61 | | memory-efficient attention | 2.63s | x3.61 |
## Sliced VAE ## Sliced VAE
Sliced VAE enables decoding large batches of images with limited VRAM or batches with 32 images or more by decoding the batches of latents one image at a time. You'll likely want to couple this with [`~ModelMixin.enable_xformers_memory_efficient_attention`] to further reduce memory use. Sliced VAE enables decoding large batches of images with limited VRAM or batches with 32 images or more by decoding the batches of latents one image at a time. You'll likely want to couple this with [`~ModelMixin.enable_xformers_memory_efficient_attention`] to reduce memory use further if you have xFormers installed.
To use sliced VAE, call [`~StableDiffusionPipeline.enable_vae_slicing`] on your pipeline before inference: To use sliced VAE, call [`~StableDiffusionPipeline.enable_vae_slicing`] on your pipeline before inference:
...@@ -38,6 +49,7 @@ pipe = pipe.to("cuda") ...@@ -38,6 +49,7 @@ pipe = pipe.to("cuda")
prompt = "a photo of an astronaut riding a horse on mars" prompt = "a photo of an astronaut riding a horse on mars"
pipe.enable_vae_slicing() pipe.enable_vae_slicing()
#pipe.enable_xformers_memory_efficient_attention()
images = pipe([prompt] * 32).images images = pipe([prompt] * 32).images
``` ```
...@@ -45,7 +57,7 @@ You may see a small performance boost in VAE decoding on multi-image batches, an ...@@ -45,7 +57,7 @@ You may see a small performance boost in VAE decoding on multi-image batches, an
## Tiled VAE ## Tiled VAE
Tiled VAE processing also enables working with large images on limited VRAM (for example, generating 4k images on 8GB of VRAM) by splitting the image into overlapping tiles, decoding the tiles, and then blending the outputs together to compose the final image. You should also used tiled VAE with [`~ModelMixin.enable_xformers_memory_efficient_attention`] to further reduce memory use. Tiled VAE processing also enables working with large images on limited VRAM (for example, generating 4k images on 8GB of VRAM) by splitting the image into overlapping tiles, decoding the tiles, and then blending the outputs together to compose the final image. You should also used tiled VAE with [`~ModelMixin.enable_xformers_memory_efficient_attention`] to reduce memory use further if you have xFormers installed.
To use tiled VAE processing, call [`~StableDiffusionPipeline.enable_vae_tiling`] on your pipeline before inference: To use tiled VAE processing, call [`~StableDiffusionPipeline.enable_vae_tiling`] on your pipeline before inference:
...@@ -62,7 +74,7 @@ pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config) ...@@ -62,7 +74,7 @@ pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda") pipe = pipe.to("cuda")
prompt = "a beautiful landscape photograph" prompt = "a beautiful landscape photograph"
pipe.enable_vae_tiling() pipe.enable_vae_tiling()
pipe.enable_xformers_memory_efficient_attention() #pipe.enable_xformers_memory_efficient_attention()
image = pipe([prompt], width=3840, height=2224, num_inference_steps=20).images[0] image = pipe([prompt], width=3840, height=2224, num_inference_steps=20).images[0]
``` ```
...@@ -98,24 +110,6 @@ Consider using [model offloading](#model-offloading) if you want to optimize for ...@@ -98,24 +110,6 @@ Consider using [model offloading](#model-offloading) if you want to optimize for
</Tip> </Tip>
CPU offloading can also be chained with attention slicing to reduce memory consumption to less than 2GB.
```Python
import torch
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
use_safetensors=True,
)
prompt = "a photo of an astronaut riding a horse on mars"
pipe.enable_sequential_cpu_offload()
image = pipe(prompt).images[0]
```
<Tip warning={true}> <Tip warning={true}>
When using [`~StableDiffusionPipeline.enable_sequential_cpu_offload`], don't move the pipeline to CUDA beforehand or else the gain in memory consumption will only be minimal (see this [issue](https://github.com/huggingface/diffusers/issues/1934) for more information). When using [`~StableDiffusionPipeline.enable_sequential_cpu_offload`], don't move the pipeline to CUDA beforehand or else the gain in memory consumption will only be minimal (see this [issue](https://github.com/huggingface/diffusers/issues/1934) for more information).
...@@ -156,28 +150,9 @@ pipe.enable_model_cpu_offload() ...@@ -156,28 +150,9 @@ pipe.enable_model_cpu_offload()
image = pipe(prompt).images[0] image = pipe(prompt).images[0]
``` ```
Model offloading can also be combined with attention slicing for additional memory savings.
```Python
import torch
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
use_safetensors=True,
)
prompt = "a photo of an astronaut riding a horse on mars"
pipe.enable_model_cpu_offload()
image = pipe(prompt).images[0]
```
<Tip warning={true}> <Tip warning={true}>
In order to properly offload models after they're called, it is required to run the entire pipeline and models are called in the pipeline's expected order. Exercise caution if models are reused outside the context of the pipeline after hooks have been installed. See [Removing Hooks](https://huggingface.co/docs/accelerate/en/package_reference/big_modeling#accelerate.hooks.remove_hook_from_module) In order to properly offload models after they're called, it is required to run the entire pipeline and models are called in the pipeline's expected order. Exercise caution if models are reused outside the context of the pipeline after hooks have been installed. See [Removing Hooks](https://huggingface.co/docs/accelerate/en/package_reference/big_modeling#accelerate.hooks.remove_hook_from_module) for more information.
for more information.
[`~StableDiffusionPipeline.enable_model_cpu_offload`] is a stateful operation that installs hooks on the models and state on the pipeline. [`~StableDiffusionPipeline.enable_model_cpu_offload`] is a stateful operation that installs hooks on the models and state on the pipeline.
...@@ -303,7 +278,7 @@ unet_traced = torch.jit.load("unet_traced.pt") ...@@ -303,7 +278,7 @@ unet_traced = torch.jit.load("unet_traced.pt")
class TracedUNet(torch.nn.Module): class TracedUNet(torch.nn.Module):
def __init__(self): def __init__(self):
super().__init__() super().__init__()
self.in_channels = pipe.unet.in_channels self.in_channels = pipe.unet.config.in_channels
self.device = pipe.unet.device self.device = pipe.unet.device
def forward(self, latent_model_input, t, encoder_hidden_states): def forward(self, latent_model_input, t, encoder_hidden_states):
...@@ -319,7 +294,7 @@ with torch.inference_mode(): ...@@ -319,7 +294,7 @@ with torch.inference_mode():
## Memory-efficient attention ## Memory-efficient attention
Recent work on optimizing bandwidth in the attention block has generated huge speed-ups and reductions in GPU memory usage. The most recent type of memory-efficient attention is [Flash Attention](https://arxiv.org/pdf/2205.14135.pdf) (you can check out the original code at [HazyResearch/flash-attention](https://github.com/HazyResearch/flash-attention)). Recent work on optimizing bandwidth in the attention block has generated huge speed-ups and reductions in GPU memory usage. The most recent type of memory-efficient attention is [Flash Attention](https://arxiv.org/abs/2205.14135) (you can check out the original code at [HazyResearch/flash-attention](https://github.com/HazyResearch/flash-attention)).
<Tip> <Tip>
...@@ -354,4 +329,4 @@ with torch.inference_mode(): ...@@ -354,4 +329,4 @@ with torch.inference_mode():
# pipe.disable_xformers_memory_efficient_attention() # pipe.disable_xformers_memory_efficient_attention()
``` ```
The iteration speed when using `xformers` should match the iteration speed of Torch 2.0 as described [here](torch2.0). The iteration speed when using `xformers` should match the iteration speed of PyTorch 2.0 as described [here](torch2.0).
...@@ -31,6 +31,8 @@ pipe = pipe.to("mps") ...@@ -31,6 +31,8 @@ pipe = pipe.to("mps")
pipe.enable_attention_slicing() pipe.enable_attention_slicing()
prompt = "a photo of an astronaut riding a horse on mars" prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]
image
``` ```
<Tip warning={true}> <Tip warning={true}>
...@@ -48,10 +50,10 @@ If you're using **PyTorch 1.13**, you need to "prime" the pipeline with an addit ...@@ -48,10 +50,10 @@ If you're using **PyTorch 1.13**, you need to "prime" the pipeline with an addit
pipe.enable_attention_slicing() pipe.enable_attention_slicing()
prompt = "a photo of an astronaut riding a horse on mars" prompt = "a photo of an astronaut riding a horse on mars"
# First-time "warmup" pass if PyTorch version is 1.13 # First-time "warmup" pass if PyTorch version is 1.13
+ _ = pipe(prompt, num_inference_steps=1) + _ = pipe(prompt, num_inference_steps=1)
# Results match those from the CPU device after the warmup pass. # Results match those from the CPU device after the warmup pass.
image = pipe(prompt).images[0] image = pipe(prompt).images[0]
``` ```
...@@ -63,6 +65,7 @@ To prevent this from happening, we recommend *attention slicing* to reduce memor ...@@ -63,6 +65,7 @@ To prevent this from happening, we recommend *attention slicing* to reduce memor
```py ```py
from diffusers import DiffusionPipeline from diffusers import DiffusionPipeline
import torch
pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True).to("mps") pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True).to("mps")
pipeline.enable_attention_slicing() pipeline.enable_attention_slicing()
......
...@@ -10,13 +10,12 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o ...@@ -10,13 +10,12 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
specific language governing permissions and limitations under the License. specific language governing permissions and limitations under the License.
--> -->
# ONNX Runtime # ONNX Runtime
🤗 [Optimum](https://github.com/huggingface/optimum) provides a Stable Diffusion pipeline compatible with ONNX Runtime. You'll need to install 🤗 Optimum with the following command for ONNX Runtime support: 🤗 [Optimum](https://github.com/huggingface/optimum) provides a Stable Diffusion pipeline compatible with ONNX Runtime. You'll need to install 🤗 Optimum with the following command for ONNX Runtime support:
```bash ```bash
pip install optimum["onnxruntime"] pip install -q optimum["onnxruntime"]
``` ```
This guide will show you how to use the Stable Diffusion and Stable Diffusion XL (SDXL) pipelines with ONNX Runtime. This guide will show you how to use the Stable Diffusion and Stable Diffusion XL (SDXL) pipelines with ONNX Runtime.
......
...@@ -10,14 +10,13 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o ...@@ -10,14 +10,13 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
specific language governing permissions and limitations under the License. specific language governing permissions and limitations under the License.
--> -->
# OpenVINO # OpenVINO
🤗 [Optimum](https://github.com/huggingface/optimum-intel) provides Stable Diffusion pipelines compatible with OpenVINO to perform inference on a variety of Intel processors (see the [full list]((https://docs.openvino.ai/latest/openvino_docs_OV_UG_supported_plugins_Supported_Devices.html)) of supported devices). 🤗 [Optimum](https://github.com/huggingface/optimum-intel) provides Stable Diffusion pipelines compatible with OpenVINO to perform inference on a variety of Intel processors (see the [full list](https://docs.openvino.ai/latest/openvino_docs_OV_UG_supported_plugins_Supported_Devices.html) of supported devices).
You'll need to install 🤗 Optimum Intel with the `--upgrade-strategy eager` option to ensure [`optimum-intel`](https://github.com/huggingface/optimum-intel) is using the latest version: You'll need to install 🤗 Optimum Intel with the `--upgrade-strategy eager` option to ensure [`optimum-intel`](https://github.com/huggingface/optimum-intel) is using the latest version:
``` ```bash
pip install --upgrade-strategy eager optimum["openvino"] pip install --upgrade-strategy eager optimum["openvino"]
``` ```
......
...@@ -14,18 +14,25 @@ specific language governing permissions and limitations under the License. ...@@ -14,18 +14,25 @@ specific language governing permissions and limitations under the License.
[Token merging](https://huggingface.co/papers/2303.17604) (ToMe) merges redundant tokens/patches progressively in the forward pass of a Transformer-based network which can speed-up the inference latency of [`StableDiffusionPipeline`]. [Token merging](https://huggingface.co/papers/2303.17604) (ToMe) merges redundant tokens/patches progressively in the forward pass of a Transformer-based network which can speed-up the inference latency of [`StableDiffusionPipeline`].
Install ToMe from `pip`:
```bash
pip install tomesd
```
You can use ToMe from the [`tomesd`](https://github.com/dbolya/tomesd) library with the [`apply_patch`](https://github.com/dbolya/tomesd?tab=readme-ov-file#usage) function: You can use ToMe from the [`tomesd`](https://github.com/dbolya/tomesd) library with the [`apply_patch`](https://github.com/dbolya/tomesd?tab=readme-ov-file#usage) function:
```diff ```diff
from diffusers import StableDiffusionPipeline from diffusers import StableDiffusionPipeline
import tomesd import torch
import tomesd
pipeline = StableDiffusionPipeline.from_pretrained( pipeline = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True, "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True,
).to("cuda") ).to("cuda")
+ tomesd.apply_patch(pipeline, ratio=0.5) + tomesd.apply_patch(pipeline, ratio=0.5)
image = pipeline("a photo of an astronaut riding a horse on mars").images[0] image = pipeline("a photo of an astronaut riding a horse on mars").images[0]
``` ```
The `apply_patch` function exposes a number of [arguments](https://github.com/dbolya/tomesd#usage) to help strike a balance between pipeline inference speed and the quality of the generated tokens. The most important argument is `ratio` which controls the number of tokens that are merged during the forward pass. The `apply_patch` function exposes a number of [arguments](https://github.com/dbolya/tomesd#usage) to help strike a balance between pipeline inference speed and the quality of the generated tokens. The most important argument is `ratio` which controls the number of tokens that are merged during the forward pass.
......
...@@ -10,7 +10,7 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o ...@@ -10,7 +10,7 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
specific language governing permissions and limitations under the License. specific language governing permissions and limitations under the License.
--> -->
# Torch 2.0 # PyTorch 2.0
🤗 Diffusers supports the latest optimizations from [PyTorch 2.0](https://pytorch.org/get-started/pytorch-2.0/) which include: 🤗 Diffusers supports the latest optimizations from [PyTorch 2.0](https://pytorch.org/get-started/pytorch-2.0/) which include:
...@@ -48,7 +48,6 @@ In some cases - such as making the pipeline more deterministic or converting it ...@@ -48,7 +48,6 @@ In some cases - such as making the pipeline more deterministic or converting it
```diff ```diff
import torch import torch
from diffusers import DiffusionPipeline from diffusers import DiffusionPipeline
from diffusers.models.attention_processor import AttnProcessor
pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True).to("cuda") pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
+ pipe.unet.set_default_attn_processor() + pipe.unet.set_default_attn_processor()
...@@ -112,15 +111,12 @@ for _ in range(3): ...@@ -112,15 +111,12 @@ for _ in range(3):
```python ```python
from diffusers import StableDiffusionImg2ImgPipeline from diffusers import StableDiffusionImg2ImgPipeline
import requests from diffusers.utils import load_image
import torch import torch
from PIL import Image
from io import BytesIO
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
response = requests.get(url) init_image = load_image(url)
init_image = Image.open(BytesIO(response.content)).convert("RGB")
init_image = init_image.resize((512, 512)) init_image = init_image.resize((512, 512))
path = "runwayml/stable-diffusion-v1-5" path = "runwayml/stable-diffusion-v1-5"
...@@ -145,23 +141,14 @@ for _ in range(3): ...@@ -145,23 +141,14 @@ for _ in range(3):
```python ```python
from diffusers import StableDiffusionInpaintPipeline from diffusers import StableDiffusionInpaintPipeline
import requests from diffusers.utils import load_image
import torch import torch
from PIL import Image
from io import BytesIO
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
def download_image(url):
response = requests.get(url)
return Image.open(BytesIO(response.content)).convert("RGB")
img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
init_image = download_image(img_url).resize((512, 512)) init_image = load_image(img_url).resize((512, 512))
mask_image = download_image(mask_url).resize((512, 512)) mask_image = load_image(mask_url).resize((512, 512))
path = "runwayml/stable-diffusion-inpainting" path = "runwayml/stable-diffusion-inpainting"
...@@ -185,15 +172,12 @@ for _ in range(3): ...@@ -185,15 +172,12 @@ for _ in range(3):
```python ```python
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
import requests from diffusers.utils import load_image
import torch import torch
from PIL import Image
from io import BytesIO
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
response = requests.get(url) init_image = load_image(url)
init_image = Image.open(BytesIO(response.content)).convert("RGB")
init_image = init_image.resize((512, 512)) init_image = init_image.resize((512, 512))
path = "runwayml/stable-diffusion-v1-5" path = "runwayml/stable-diffusion-v1-5"
...@@ -227,20 +211,20 @@ import torch ...@@ -227,20 +211,20 @@ import torch
run_compile = True # Set True / False run_compile = True # Set True / False
pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-M-v1.0", variant="fp16", text_encoder=None, torch_dtype=torch.float16, use_safetensors=True) pipe_1 = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-M-v1.0", variant="fp16", text_encoder=None, torch_dtype=torch.float16, use_safetensors=True)
pipe.to("cuda") pipe_1.to("cuda")
pipe_2 = DiffusionPipeline.from_pretrained("DeepFloyd/IF-II-M-v1.0", variant="fp16", text_encoder=None, torch_dtype=torch.float16, use_safetensors=True) pipe_2 = DiffusionPipeline.from_pretrained("DeepFloyd/IF-II-M-v1.0", variant="fp16", text_encoder=None, torch_dtype=torch.float16, use_safetensors=True)
pipe_2.to("cuda") pipe_2.to("cuda")
pipe_3 = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-x4-upscaler", torch_dtype=torch.float16, use_safetensors=True) pipe_3 = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-x4-upscaler", torch_dtype=torch.float16, use_safetensors=True)
pipe_3.to("cuda") pipe_3.to("cuda")
pipe.unet.to(memory_format=torch.channels_last) pipe_1.unet.to(memory_format=torch.channels_last)
pipe_2.unet.to(memory_format=torch.channels_last) pipe_2.unet.to(memory_format=torch.channels_last)
pipe_3.unet.to(memory_format=torch.channels_last) pipe_3.unet.to(memory_format=torch.channels_last)
if run_compile: if run_compile:
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) pipe_1.unet = torch.compile(pipe_1.unet, mode="reduce-overhead", fullgraph=True)
pipe_2.unet = torch.compile(pipe_2.unet, mode="reduce-overhead", fullgraph=True) pipe_2.unet = torch.compile(pipe_2.unet, mode="reduce-overhead", fullgraph=True)
pipe_3.unet = torch.compile(pipe_3.unet, mode="reduce-overhead", fullgraph=True) pipe_3.unet = torch.compile(pipe_3.unet, mode="reduce-overhead", fullgraph=True)
...@@ -250,9 +234,9 @@ prompt_embeds = torch.randn((1, 2, 4096), dtype=torch.float16) ...@@ -250,9 +234,9 @@ prompt_embeds = torch.randn((1, 2, 4096), dtype=torch.float16)
neg_prompt_embeds = torch.randn((1, 2, 4096), dtype=torch.float16) neg_prompt_embeds = torch.randn((1, 2, 4096), dtype=torch.float16)
for _ in range(3): for _ in range(3):
image = pipe(prompt_embeds=prompt_embeds, negative_prompt_embeds=neg_prompt_embeds, output_type="pt").images image_1 = pipe_1(prompt_embeds=prompt_embeds, negative_prompt_embeds=neg_prompt_embeds, output_type="pt").images
image_2 = pipe_2(image=image, prompt_embeds=prompt_embeds, negative_prompt_embeds=neg_prompt_embeds, output_type="pt").images image_2 = pipe_2(image=image_1, prompt_embeds=prompt_embeds, negative_prompt_embeds=neg_prompt_embeds, output_type="pt").images
image_3 = pipe_3(prompt=prompt, image=image, noise_level=100).images image_3 = pipe_3(prompt=prompt, image=image_1, noise_level=100).images
``` ```
</details> </details>
......
...@@ -14,14 +14,14 @@ specific language governing permissions and limitations under the License. ...@@ -14,14 +14,14 @@ specific language governing permissions and limitations under the License.
[[open-in-colab]] [[open-in-colab]]
Most 🤗 Diffusers pipeline now accept a `callback_on_step_end` argument that allows you to change the default behavior of denoising loop with custom defined functions. Here is an example of a callback function we can write to disable classifier free guidance after 40% of inference steps to save compute with minimum tradeoff in performance. Most 🤗 Diffusers pipelines now accept a `callback_on_step_end` argument that allows you to change the default behavior of denoising loop with custom defined functions. Here is an example of a callback function we can write to disable classifier-free guidance after 40% of inference steps to save compute with a minimum tradeoff in performance.
```python ```python
def callback_dynamic_cfg(pipe, step_index, timestep, callback_kwargs): def callback_dynamic_cfg(pipe, step_index, timestep, callback_kwargs):
# adjust the batch_size of prompt_embeds according to guidance_scale # adjust the batch_size of prompt_embeds according to guidance_scale
if step_index == int(pipe.num_timestep * 0.4): if step_index == int(pipe.num_timestep * 0.4):
prompt_embeds = callback_kwargs["prompt_embeds"] prompt_embeds = callback_kwargs["prompt_embeds"]
prompt_embeds =prompt_embeds.chunk(2)[-1] prompt_embeds = prompt_embeds.chunk(2)[-1]
# update guidance_scale and prompt_embeds # update guidance_scale and prompt_embeds
pipe._guidance_scale = 0.0 pipe._guidance_scale = 0.0
...@@ -36,7 +36,7 @@ Your callback function has below arguments: ...@@ -36,7 +36,7 @@ Your callback function has below arguments:
You can pass the callback function as `callback_on_step_end` argument to the pipeline along with `callback_on_step_end_tensor_inputs`. You can pass the callback function as `callback_on_step_end` argument to the pipeline along with `callback_on_step_end_tensor_inputs`.
``` ```python
import torch import torch
from diffusers import StableDiffusionPipeline from diffusers import StableDiffusionPipeline
...@@ -46,7 +46,7 @@ pipe = pipe.to("cuda") ...@@ -46,7 +46,7 @@ pipe = pipe.to("cuda")
prompt = "a photo of an astronaut riding a horse on mars" prompt = "a photo of an astronaut riding a horse on mars"
generator = torch.Generator(device="cuda").manual_seed(1) generator = torch.Generator(device="cuda").manual_seed(1)
out= pipe(prompt, generator=generator, callback_on_step_end = callback_custom_cfg, callback_on_step_end_tensor_inputs=['prompt_embeds']) out = pipe(prompt, generator=generator, callback_on_step_end=callback_custom_cfg, callback_on_step_end_tensor_inputs=['prompt_embeds'])
out.images[0].save("out_custom_cfg.png") out.images[0].save("out_custom_cfg.png")
``` ```
...@@ -55,6 +55,6 @@ Your callback function will be executed at the end of each denoising step and mo ...@@ -55,6 +55,6 @@ Your callback function will be executed at the end of each denoising step and mo
<Tip> <Tip>
Currently we only support `callback_on_step_end`. If you have a solid use case and require a callback function with a different execution point, please open an [feature request](https://github.com/huggingface/diffusers/issues/new/choose) so we can add it! Currently we only support `callback_on_step_end`. If you have a solid use case and require a callback function with a different execution point, please open a [Feature Request](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=&projects=&template=feature_request.md&title=) so we can add it!
</Tip> </Tip>
...@@ -39,24 +39,19 @@ device_type = jax.devices()[0].device_kind ...@@ -39,24 +39,19 @@ device_type = jax.devices()[0].device_kind
print(f"Found {num_devices} JAX devices of type {device_type}.") print(f"Found {num_devices} JAX devices of type {device_type}.")
assert ( assert (
"TPU" in device_type, "TPU" in device_type,
"Available device is not a TPU, please select TPU from Edit > Notebook settings > Hardware accelerator" "Available device is not a TPU, please select TPU from Runtime > Change runtime type > Hardware accelerator"
) )
"Found 8 JAX devices of type Cloud TPU." # Found 8 JAX devices of type Cloud TPU.
``` ```
Great, now you can import the rest of the dependencies you'll need: Great, now you can import the rest of the dependencies you'll need:
```python ```python
import numpy as np
import jax.numpy as jnp import jax.numpy as jnp
from pathlib import Path
from jax import pmap from jax import pmap
from flax.jax_utils import replicate from flax.jax_utils import replicate
from flax.training.common_utils import shard from flax.training.common_utils import shard
from PIL import Image
from huggingface_hub import notebook_login
from diffusers import FlaxStableDiffusionPipeline from diffusers import FlaxStableDiffusionPipeline
``` ```
...@@ -90,7 +85,7 @@ prompt = "A cinematic film still of Morgan Freeman starring as Jimi Hendrix, por ...@@ -90,7 +85,7 @@ prompt = "A cinematic film still of Morgan Freeman starring as Jimi Hendrix, por
prompt = [prompt] * jax.device_count() prompt = [prompt] * jax.device_count()
prompt_ids = pipeline.prepare_inputs(prompt) prompt_ids = pipeline.prepare_inputs(prompt)
prompt_ids.shape prompt_ids.shape
"(8, 77)" # (8, 77)
``` ```
Model parameters and inputs have to be replicated across the 8 parallel devices. The parameters dictionary is replicated with [`flax.jax_utils.replicate`](https://flax.readthedocs.io/en/latest/api_reference/flax.jax_utils.html#flax.jax_utils.replicate) which traverses the dictionary and changes the shape of the weights so they are repeated 8 times. Arrays are replicated using `shard`. Model parameters and inputs have to be replicated across the 8 parallel devices. The parameters dictionary is replicated with [`flax.jax_utils.replicate`](https://flax.readthedocs.io/en/latest/api_reference/flax.jax_utils.html#flax.jax_utils.replicate) which traverses the dictionary and changes the shape of the weights so they are repeated 8 times. Arrays are replicated using `shard`.
...@@ -102,7 +97,7 @@ p_params = replicate(params) ...@@ -102,7 +97,7 @@ p_params = replicate(params)
# arrays # arrays
prompt_ids = shard(prompt_ids) prompt_ids = shard(prompt_ids)
prompt_ids.shape prompt_ids.shape
"(8, 1, 77)" # (8, 1, 77)
``` ```
This shape means each one of the 8 devices receives as an input a `jnp` array with shape `(1, 77)`, where `1` is the batch size per device. On TPUs with sufficient memory, you could have a batch size larger than `1` if you want to generate multiple images (per chip) at once. This shape means each one of the 8 devices receives as an input a `jnp` array with shape `(1, 77)`, where `1` is the batch size per device. On TPUs with sufficient memory, you could have a batch size larger than `1` if you want to generate multiple images (per chip) at once.
...@@ -127,7 +122,7 @@ To take advantage of JAX's optimized speed on a TPU, pass `jit=True` to the pipe ...@@ -127,7 +122,7 @@ To take advantage of JAX's optimized speed on a TPU, pass `jit=True` to the pipe
<Tip warning={true}> <Tip warning={true}>
You need to ensure all your inputs have the same shape in subsequent calls, other JAX will need to recompile the code which is slower. You need to ensure all your inputs have the same shape in subsequent calls, otherwise JAX will need to recompile the code which is slower.
</Tip> </Tip>
...@@ -137,18 +132,18 @@ The first inference run takes more time because it needs to compile the code, bu ...@@ -137,18 +132,18 @@ The first inference run takes more time because it needs to compile the code, bu
%%time %%time
images = pipeline(prompt_ids, p_params, rng, jit=True)[0] images = pipeline(prompt_ids, p_params, rng, jit=True)[0]
"CPU times: user 56.2 s, sys: 42.5 s, total: 1min 38s" # CPU times: user 56.2 s, sys: 42.5 s, total: 1min 38s
"Wall time: 1min 29s" # Wall time: 1min 29s
``` ```
The returned array has shape `(8, 1, 512, 512, 3)` which should be reshaped to remove the second dimension and get 8 images of `512 × 512 × 3`. Then you can use the [`~utils.numpy_to_pil`] function to convert the arrays into images. The returned array has shape `(8, 1, 512, 512, 3)` which should be reshaped to remove the second dimension and get 8 images of `512 × 512 × 3`. Then you can use the [`~utils.numpy_to_pil`] function to convert the arrays into images.
```python ```python
from diffusers import make_image_grid from diffusers.utils import make_image_grid
images = images.reshape((images.shape[0] * images.shape[1],) + images.shape[-3:]) images = images.reshape((images.shape[0] * images.shape[1],) + images.shape[-3:])
images = pipeline.numpy_to_pil(images) images = pipeline.numpy_to_pil(images)
make_image_grid(images, 2, 4) make_image_grid(images, rows=2, cols=4)
``` ```
![img](https://huggingface.co/datasets/YiYiXu/test-doc-assets/resolve/main/stable_diffusion_jax_how_to_cell_38_output_0.jpeg) ![img](https://huggingface.co/datasets/YiYiXu/test-doc-assets/resolve/main/stable_diffusion_jax_how_to_cell_38_output_0.jpeg)
...@@ -181,7 +176,6 @@ make_image_grid(images, 2, 4) ...@@ -181,7 +176,6 @@ make_image_grid(images, 2, 4)
![img](https://huggingface.co/datasets/YiYiXu/test-doc-assets/resolve/main/stable_diffusion_jax_how_to_cell_43_output_0.jpeg) ![img](https://huggingface.co/datasets/YiYiXu/test-doc-assets/resolve/main/stable_diffusion_jax_how_to_cell_43_output_0.jpeg)
## How does parallelization work? ## How does parallelization work?
The Flax pipeline in 🤗 Diffusers automatically compiles the model and runs it in parallel on all available devices. Let's take a closer look at how that process works. The Flax pipeline in 🤗 Diffusers automatically compiles the model and runs it in parallel on all available devices. Let's take a closer look at how that process works.
...@@ -202,7 +196,7 @@ p_generate = pmap(pipeline._generate) ...@@ -202,7 +196,7 @@ p_generate = pmap(pipeline._generate)
After calling `pmap`, the prepared function `p_generate` will: After calling `pmap`, the prepared function `p_generate` will:
1. Make a copy of the underlying function, `pipeline._generate`, on each device. 1. Make a copy of the underlying function, `pipeline._generate`, on each device.
2. Send each device a different portion of the input arguments (this is why its necessary to call the *shard* function). In this case, `prompt_ids` has shape `(8, 1, 77, 768)` so the array is split into 8 and each copy of `_generate` receives an input with shape `(1, 77, 768)`. 2. Send each device a different portion of the input arguments (this is why it's necessary to call the *shard* function). In this case, `prompt_ids` has shape `(8, 1, 77, 768)` so the array is split into 8 and each copy of `_generate` receives an input with shape `(1, 77, 768)`.
The most important thing to pay attention to here is the batch size (1 in this example), and the input dimensions that make sense for your code. You don't have to change anything else to make the code work in parallel. The most important thing to pay attention to here is the batch size (1 in this example), and the input dimensions that make sense for your code. You don't have to change anything else to make the code work in parallel.
...@@ -212,13 +206,14 @@ The first time you call the pipeline takes more time, but the calls afterward ar ...@@ -212,13 +206,14 @@ The first time you call the pipeline takes more time, but the calls afterward ar
%%time %%time
images = p_generate(prompt_ids, p_params, rng) images = p_generate(prompt_ids, p_params, rng)
images = images.block_until_ready() images = images.block_until_ready()
"CPU times: user 1min 15s, sys: 18.2 s, total: 1min 34s"
"Wall time: 1min 15s" # CPU times: user 1min 15s, sys: 18.2 s, total: 1min 34s
# Wall time: 1min 15s
``` ```
Check your image dimensions to see if they're correct: Check your image dimensions to see if they're correct:
```python ```python
images.shape images.shape
"(8, 1, 512, 512, 3)" # (8, 1, 512, 512, 3)
``` ```
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment