Disable FAv2.1+ for causal mask in cross attention (#522)

* disable FAv2.1 if causal+cross attn Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove comment and add warning Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * include both causal and padding+causal Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add a space Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

Disable FAv2.1+ for causal mask in cross attention (#522)
* disable FAv2.1 if causal+cross attn Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove comment and add warning Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * include both causal and padding+causal Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add a space Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
da55d247 · cyanguwa · GitHub · 15088217 · da55d247
Unverified Commit da55d247 authored Nov 17, 2023 by cyanguwa Committed by GitHub Nov 17, 2023
Hide whitespace changes
Inline Side-by-side

Showing with 11 additions and 0 deletions

transformer_engine/pytorch/attention.py transformer_engine/pytorch/attention.py +11 -0

No files found.
--- a/transformer_engine/pytorch/attention.py
+++ b/transformer_engine/pytorch/attention.py
@@ -56,6 +56,7 @@ from transformer_engine.pytorch.jit import jit_fuser
 _flash_attn_version = packaging.version.Version(version("flash-attn"))
 _flash_attn_version_required = packaging.version.Version("1.0.6")
 _flash_attn_2_available = _flash_attn_version >= packaging.version.Version("2")
+_flash_attn_2_1_plus = _flash_attn_version >= packaging.version.Version("2.1")
 if _flash_attn_2_available:
    from flash_attn.flash_attn_interface import flash_attn_varlen_func as flash_attn_forward_func # pylint: disable=no-name-in-module
@@ -2134,6 +2135,16 @@ class DotProductAttention(torch.nn.Module):
        if not _flash_attn_2_available and self.num_gqa_groups != self.num_attention_heads:
            use_flash_attention = False
+        if (_flash_attn_2_1_plus
+            and causal_mask
+            and max_seqlen_q != max_seqlen_kv):
+            warnings.warn(
+                "Disabling the use of FlashAttention since version 2.1+ has changed its behavior "
+                "for causal mask in cross attention. See "
+                "https://github.com/Dao-AILab/flash-attention#21-change-behavior-of-causal-flag"
+            )
+            use_flash_attention = False
        if core_attention_bias_type != "no_bias" or core_attention_bias is not None:
            use_flash_attention = False