Reduce peak VRAM by releasing large attention tensors (as soon as they're unnecessary) (#3463)

Release large tensors in attention (as soon as they're no longer required). Reduces peak VRAM by nearly 2 GB for 1024x1024 (even after slicing), and the savings scale up with image size.

Reduce peak VRAM by releasing large attention tensors (as soon as they're unnecessary) (#3463)
Release large tensors in attention (as soon as they're no longer required). Reduces peak VRAM by nearly 2 GB for 1024x1024 (even after slicing), and the savings scale up with image size.
bd78f63a · cmdr2 · GitHub · 3ebd2d1f · bd78f63a
Unverified Commit bd78f63a authored May 17, 2023 by cmdr2 Committed by GitHub May 17, 2023
Show whitespace changes
Inline Side-by-side

Showing with 3 additions and 0 deletions

src/diffusers/models/attention_processor.py src/diffusers/models/attention_processor.py +3 -0

No files found.
--- a/src/diffusers/models/attention_processor.py
+++ b/src/diffusers/models/attention_processor.py
@@ -344,11 +344,14 @@ class Attention(nn.Module):
            beta=beta,
            alpha=self.scale,
        )
+        del baddbmm_input

        if self.upcast_softmax:
            attention_scores = attention_scores.float()

        attention_probs = attention_scores.softmax(dim=-1)
+        del attention_scores
+
        attention_probs = attention_probs.to(dtype)

        return attention_probs