Docs: fp16 page (#404)

* Initial version of `fp16` page. * Fix typo in README. * Change titles of fp16 section in toctree. * PR suggestion Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * PR suggestion Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Clarify attention slicing is useful even for batches of 1 Explained by @patrickvonplaten after a suggestion by @keturn. Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Do not talk about `batches` in `enable_attention_slicing`. * Use Tip (just for fun), add link to method. * Comment about fp16 results looking the same as float32 in practice. * Style: docstring line wrapping. Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

Docs: fp16 page (#404)
* Initial version of `fp16` page. * Fix typo in README. * Change titles of fp16 section in toctree. * PR suggestion Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * PR suggestion Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Clarify attention slicing is useful even for batches of 1 Explained by @patrickvonplaten after a suggestion by @keturn. Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Do not talk about `batches` in `enable_attention_slicing`. * Use Tip (just for fun), add link to method. * Comment about fp16 results looking the same as float32 in practice. * Style: docstring line wrapping. Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
c29d81c3 · Pedro Cuenca · GitHub · a127363d · c29d81c3 · c29d81c3
Unverified Commit c29d81c3 authored Sep 08, 2022 by Pedro Cuenca Committed by GitHub Sep 08, 2022
5 changed files
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@@ -28,7 +28,7 @@
  title: "Using Diffusers"
 - sections:
  - local: optimization/fp16
-    title: "Torch Float16"
+    title: "Memory and Speed"
  - local: optimization/onnx
    title: "ONNX"
  - local: optimization/open_vino

--- a/docs/source/optimization/fp16.mdx
+++ b/docs/source/optimization/fp16.mdx
@@ -10,23 +10,67 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->
+# Memory and speed
+We present some techniques and ideas to optimize 🤗 Diffusers _inference_ for memory or speed.
-# Quicktour
+## CUDA `autocast`
-Start using Diffusers🧨 quickly!
+If you use a CUDA GPU, you can take advantage of `torch.autocast` to perform inference roughly twice as fast at the cost of slightly lower precision. All you need to do is put your inference call inside an `autocast` context manager. The following example shows how to do it using Stable Diffusion text-to-image generation as an example:
-To start, use the [`DiffusionPipeline`] for quick inference and sample generations!
+```Python
+from torch import autocast
+from diffusers import StableDiffusionPipeline
+pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", use_auth_token=True)
+pipe = pipe.to("cuda")
+prompt = "a photo of an astronaut riding a horse on mars"
+with autocast("cuda"):
+    image = pipe(prompt).images[0]  
 ```
-pip install diffusers
+Despite the precision loss, in our experience the final image results look the same as the `float32` versions. Feel free to experiment and report back!
+## Half precision weights
+To save more GPU memory, you can load the model weights directly in half precision. This involves loading the float16 version of the weights, which was saved to a branch named `fp16`, and telling PyTorch to use the `float16` type when loading them:
+```Python
+pipe = StableDiffusionPipeline.from_pretrained(
+    "CompVis/stable-diffusion-v1-4",
+    revision="fp16",
+    torch_dtype=torch.float16,
+    use_auth_token=True
+)
 ```
-## Main classes
+## Sliced attention for additional memory savings
+For even additional memory savings, you can use a sliced version of attention that performs the computation in steps instead of all at once. 
-### Models
+<Tip>
+Attention slicing is useful even if a batch size of just 1 is used - as long as the model uses more than one attention head. If there is more than one attention head the *QK^T* attention matrix can be computed sequentially for each head which can save a significant amount of memory.
+</Tip>
-### Schedulers
+To perform the attention computation sequentially over each head, you only need to invoke [`~StableDiffusionPipeline.enable_attention_slicing`] in your pipeline before inference, like here:
-### Pipeliens
+```Python
+import torch
+from diffusers import StableDiffusionPipeline
+pipe = StableDiffusionPipeline.from_pretrained(
+    "CompVis/stable-diffusion-v1-4",
+    revision="fp16",
+    torch_dtype=torch.float16,
+    use_auth_token=True
+)
+pipe = pipe.to("cuda")
+prompt = "a photo of an astronaut riding a horse on mars"
+pipe.enable_attention_slicing()
+with torch.autocast("cuda"):
+    image = pipe(prompt).images[0]  
+```
+There's a small performance penalty of about 10% slower inference times, but this method allows you to use Stable Diffusion in as little as 3.2 GB of VRAM! 
--- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py
+++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py
@@ -67,13 +67,13 @@ class StableDiffusionPipeline(DiffusionPipeline):
        r"""
        Enable sliced attention computation.
-        When this option is enabled, the attention module will split the input batch in slices, to compute attention in
+        When this option is enabled, the attention module will split the input tensor in slices, to compute attention
-        several steps. This is useful to save some memory in exchange for a small speed decrease.
+        in several steps. This is useful to save some memory in exchange for a small speed decrease.
        Args:
            slice_size (`str` or `int`, *optional*, defaults to `"auto"`):
-                When `"auto"`, halves the input batch to the attention heads, so attention will be computed in two
+                When `"auto"`, halves the input to the attention heads, so attention will be computed in two steps. If
-                steps. If a number is provided, use as many slices as `attention_head_dim // slice_size`. In this case,
+                a number is provided, uses as many slices as `attention_head_dim // slice_size`. In this case,
                `attention_head_dim` must be a multiple of `slice_size`.
        """
        if slice_size == "auto":

--- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_img2img.py
+++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_img2img.py
@@ -78,13 +78,13 @@ class StableDiffusionImg2ImgPipeline(DiffusionPipeline):
        r"""
        Enable sliced attention computation.
-        When this option is enabled, the attention module will split the input batch in slices, to compute attention in
+        When this option is enabled, the attention module will split the input tensor in slices, to compute attention
-        several steps. This is useful to save some memory in exchange for a small speed decrease.
+        in several steps. This is useful to save some memory in exchange for a small speed decrease.
        Args:
            slice_size (`str` or `int`, *optional*, defaults to `"auto"`):
-                When `"auto"`, halves the input batch to the attention heads, so attention will be computed in two
+                When `"auto"`, halves the input to the attention heads, so attention will be computed in two steps. If
-                steps. If a number is provided, use as many slices as `attention_head_dim // slice_size`. In this case,
+                a number is provided, uses as many slices as `attention_head_dim // slice_size`. In this case,
                `attention_head_dim` must be a multiple of `slice_size`.
        """
        if slice_size == "auto":

--- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_inpaint.py
+++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_inpaint.py
@@ -92,13 +92,13 @@ class StableDiffusionInpaintPipeline(DiffusionPipeline):
        r"""
        Enable sliced attention computation.
-        When this option is enabled, the attention module will split the input batch in slices, to compute attention in
+        When this option is enabled, the attention module will split the input tensor in slices, to compute attention
-        several steps. This is useful to save some memory in exchange for a small speed decrease.
+        in several steps. This is useful to save some memory in exchange for a small speed decrease.
        Args:
            slice_size (`str` or `int`, *optional*, defaults to `"auto"`):
-                When `"auto"`, halves the input batch to the attention heads, so attention will be computed in two
+                When `"auto"`, halves the input to the attention heads, so attention will be computed in two steps. If
-                steps. If a number is provided, use as many slices as `attention_head_dim // slice_size`. In this case,
+                a number is provided, uses as many slices as `attention_head_dim // slice_size`. In this case,
                `attention_head_dim` must be a multiple of `slice_size`.
        """
        if slice_size == "auto":