@@ -14,15 +14,15 @@ specific language governing permissions and limitations under the License.
...
@@ -14,15 +14,15 @@ specific language governing permissions and limitations under the License.
Diffusion models are known to be slower than their counter parts, GANs, because of the iterative and sequential reverse diffusion process. Recent works try to address limitation with:
Diffusion models are known to be slower than their counter parts, GANs, because of the iterative and sequential reverse diffusion process. Recent works try to address limitation with:
* progressive timestep distillation (such as [LCM LoRA](../using-diffusers/inference_with_lcm_lora.md))
* progressive timestep distillation (such as [LCM LoRA](../using-diffusers/inference_with_lcm_lora))
* model compression (such as [SSD-1B](https://huggingface.co/segmind/SSD-1B))
* model compression (such as [SSD-1B](https://huggingface.co/segmind/SSD-1B))
* reusing adjacent features of the denoiser (such as [DeepCache](https://github.com/horseee/DeepCache))
* reusing adjacent features of the denoiser (such as [DeepCache](https://github.com/horseee/DeepCache))
In this tutorial, we focus on leveraging the power of PyTorch 2 to accelerate the inference latency of text-to-image diffusion pipeline, instead. We will use [Stable Diffusion XL (SDXL)](../using-diffusers/sdxl.md) as a case study, but the techniques we will discuss should extend to other text-to-image diffusion pipelines.
In this tutorial, we focus on leveraging the power of PyTorch 2 to accelerate the inference latency of text-to-image diffusion pipeline, instead. We will use [Stable Diffusion XL (SDXL)](../using-diffusers/sdxl) as a case study, but the techniques we will discuss should extend to other text-to-image diffusion pipelines.
## Setup
## Setup
Make sure you're on the latest version of `diffusers`:
Make sure you're on the latest version of `diffusers`:
```bash
```bash
pip install-U diffusers
pip install-U diffusers
...
@@ -42,7 +42,7 @@ _This tutorial doesn't present the benchmarking code and focuses on how to perfo
...
@@ -42,7 +42,7 @@ _This tutorial doesn't present the benchmarking code and focuses on how to perfo
## Baseline
## Baseline
Let's start with a baseline. Disable the use of a reduced precision and [`scaled_dot_product_attention`](../optimization/torch2.0.md):
Let's start with a baseline. Disable the use of a reduced precision and [`scaled_dot_product_attention`](../optimization/torch2.0):
```python
```python
fromdiffusersimportStableDiffusionXLPipeline
fromdiffusersimportStableDiffusionXLPipeline
...
@@ -104,11 +104,11 @@ _(We later ran the experiments in float16 and found out that the recent versions
...
@@ -104,11 +104,11 @@ _(We later ran the experiments in float16 and found out that the recent versions
* The benefits of using the bfloat16 numerical precision as compared to float16 are hardware-dependent. Modern generations of GPUs tend to favor bfloat16.
* The benefits of using the bfloat16 numerical precision as compared to float16 are hardware-dependent. Modern generations of GPUs tend to favor bfloat16.
* Furthermore, in our experiments, we bfloat16 to be much more resilient when used with quantization in comparison to float16.
* Furthermore, in our experiments, we bfloat16 to be much more resilient when used with quantization in comparison to float16.
We have a [dedicated guide](../optimization/fp16.md) for running inference in a reduced precision.
We have a [dedicated guide](../optimization/fp16) for running inference in a reduced precision.
## Running attention efficiently
## Running attention efficiently
Attention blocks are intensive to run. But with PyTorch's [`scaled_dot_product_attention`](../optimization/torch2.0.md), we can run them efficiently.
Attention blocks are intensive to run. But with PyTorch's [`scaled_dot_product_attention`](../optimization/torch2.0), we can run them efficiently.
```python
```python
fromdiffusersimportStableDiffusionXLPipeline
fromdiffusersimportStableDiffusionXLPipeline
...
@@ -200,7 +200,7 @@ It provides a minor boost from 2.54 seconds to 2.52 seconds.
...
@@ -200,7 +200,7 @@ It provides a minor boost from 2.54 seconds to 2.52 seconds.
<Tipwarning={true}>
<Tipwarning={true}>
Support for `fuse_qkv_projections()` is limited and experimental. As such, it's not available for many non-SD pipelines such as [Kandinsky](../using-diffusers/kandinsky.md). You can refer to [this PR](https://github.com/huggingface/diffusers/pull/6179) to get an idea about how to support this kind of computation.
Support for `fuse_qkv_projections()` is limited and experimental. As such, it's not available for many non-SD pipelines such as [Kandinsky](../using-diffusers/kandinsky). You can refer to [this PR](https://github.com/huggingface/diffusers/pull/6179) to get an idea about how to support this kind of computation.