Unverified Commit 9c094458 authored by Sayak Paul's avatar Sayak Paul Committed by GitHub
Browse files

[docs] slight edits to the attention backends docs. (#12394)



* slight edits to the attention backends docs.

* Update docs/source/en/optimization/attention_backends.md
Co-authored-by: default avatarSteven Liu <59462357+stevhliu@users.noreply.github.com>

---------
Co-authored-by: default avatarSteven Liu <59462357+stevhliu@users.noreply.github.com>
parent 4588bbeb
...@@ -11,7 +11,7 @@ specific language governing permissions and limitations under the License. --> ...@@ -11,7 +11,7 @@ specific language governing permissions and limitations under the License. -->
# Attention backends # Attention backends
> [!TIP] > [!NOTE]
> The attention dispatcher is an experimental feature. Please open an issue if you have any feedback or encounter any problems. > The attention dispatcher is an experimental feature. Please open an issue if you have any feedback or encounter any problems.
Diffusers provides several optimized attention algorithms that are more memory and computationally efficient through it's *attention dispatcher*. The dispatcher acts as a router for managing and switching between different attention implementations and provides a unified interface for interacting with them. Diffusers provides several optimized attention algorithms that are more memory and computationally efficient through it's *attention dispatcher*. The dispatcher acts as a router for managing and switching between different attention implementations and provides a unified interface for interacting with them.
...@@ -33,7 +33,7 @@ The [`~ModelMixin.set_attention_backend`] method iterates through all the module ...@@ -33,7 +33,7 @@ The [`~ModelMixin.set_attention_backend`] method iterates through all the module
The example below demonstrates how to enable the `_flash_3_hub` implementation for FlashAttention-3 from the [kernel](https://github.com/huggingface/kernels) library, which allows you to instantly use optimized compute kernels from the Hub without requiring any setup. The example below demonstrates how to enable the `_flash_3_hub` implementation for FlashAttention-3 from the [kernel](https://github.com/huggingface/kernels) library, which allows you to instantly use optimized compute kernels from the Hub without requiring any setup.
> [!TIP] > [!NOTE]
> FlashAttention-3 is not supported for non-Hopper architectures, in which case, use FlashAttention with `set_attention_backend("flash")`. > FlashAttention-3 is not supported for non-Hopper architectures, in which case, use FlashAttention with `set_attention_backend("flash")`.
```py ```py
...@@ -78,10 +78,16 @@ with attention_backend("_flash_3_hub"): ...@@ -78,10 +78,16 @@ with attention_backend("_flash_3_hub"):
image = pipeline(prompt).images[0] image = pipeline(prompt).images[0]
``` ```
> [!TIP]
> Most attention backends support `torch.compile` without graph breaks and can be used to further speed up inference.
## Available backends ## Available backends
Refer to the table below for a complete list of available attention backends and their variants. Refer to the table below for a complete list of available attention backends and their variants.
<details>
<summary>Expand</summary>
| Backend Name | Family | Description | | Backend Name | Family | Description |
|--------------|--------|-------------| |--------------|--------|-------------|
| `native` | [PyTorch native](https://docs.pytorch.org/docs/stable/generated/torch.nn.attention.SDPBackend.html#torch.nn.attention.SDPBackend) | Default backend using PyTorch's scaled_dot_product_attention | | `native` | [PyTorch native](https://docs.pytorch.org/docs/stable/generated/torch.nn.attention.SDPBackend.html#torch.nn.attention.SDPBackend) | Default backend using PyTorch's scaled_dot_product_attention |
...@@ -104,3 +110,5 @@ Refer to the table below for a complete list of available attention backends and ...@@ -104,3 +110,5 @@ Refer to the table below for a complete list of available attention backends and
| `_sage_qk_int8_pv_fp16_cuda` | [SageAttention](https://github.com/thu-ml/SageAttention) | INT8 QK + FP16 PV (CUDA) | | `_sage_qk_int8_pv_fp16_cuda` | [SageAttention](https://github.com/thu-ml/SageAttention) | INT8 QK + FP16 PV (CUDA) |
| `_sage_qk_int8_pv_fp16_triton` | [SageAttention](https://github.com/thu-ml/SageAttention) | INT8 QK + FP16 PV (Triton) | | `_sage_qk_int8_pv_fp16_triton` | [SageAttention](https://github.com/thu-ml/SageAttention) | INT8 QK + FP16 PV (Triton) |
| `xformers` | [xFormers](https://github.com/facebookresearch/xformers) | Memory-efficient attention | | `xformers` | [xFormers](https://github.com/facebookresearch/xformers) | Memory-efficient attention |
</details>
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment