Improve performance of NPU FA (#12260)

Co-authored-by: J石页 <jiangshuo9@h-partners.com> Co-authored-by: Aryan <aryan@huggingface.co>

Improve performance of NPU FA (#12260)
Co-authored-by: J石页 <jiangshuo9@h-partners.com> Co-authored-by: Aryan <aryan@huggingface.co>
827fad66 · Leo Jiang · GitHub · 9b721db2 · 827fad66
Unverified Commit 827fad66 authored Aug 31, 2025 by Leo Jiang Committed by GitHub Aug 31, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 6 additions and 3 deletions

src/diffusers/models/attention_dispatch.py src/diffusers/models/attention_dispatch.py +6 -3

No files found.
--- a/src/diffusers/models/attention_dispatch.py
+++ b/src/diffusers/models/attention_dispatch.py
@@ -955,12 +955,13 @@ def _native_npu_attention(
    dropout_p: float = 0.0,
    scale: Optional[float] = None,
 ) -> torch.Tensor:
-    return npu_fusion_attention(
+    query, key, value = (x.transpose(1, 2).contiguous() for x in (query, key, value))
+    out = npu_fusion_attention(
        query,
        key,
        value,
-        query.size(2),  # num_heads
+        query.size(1),  # num_heads
-        input_layout="BSND",
+        input_layout="BNSD",
        pse=None,
        scale=1.0 / math.sqrt(query.shape[-1]) if scale is None else scale,
        pre_tockens=65536,
@@ -969,6 +970,8 @@ def _native_npu_attention(
        sync=False,
        inner_precise=0,
    )[0]
+    out = out.transpose(1, 2).contiguous()
+    return out
 # Reference: https://github.com/pytorch/xla/blob/06c5533de6588f6b90aa1655d9850bcf733b90b4/torch_xla/experimental/custom_kernel.py#L853