Catch FA internal error with compute capability 8.6 (#113)

FA doesn't support compute 8.6 with head_dim>64 Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Catch FA internal error with compute capability 8.6 (#113)
FA doesn't support compute 8.6 with head_dim>64 Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
5a881a08 · Kirthi Shankar Sivamani · GitHub · 5f0d3868 · 5a881a08
Unverified Commit 5a881a08 authored Mar 22, 2023 by Kirthi Shankar Sivamani Committed by GitHub Mar 22, 2023
Hide whitespace changes
Inline Side-by-side

Showing with 3 additions and 1 deletion

transformer_engine/pytorch/transformer.py transformer_engine/pytorch/transformer.py +3 -1

No files found.
--- a/transformer_engine/pytorch/transformer.py
+++ b/transformer_engine/pytorch/transformer.py
@@ -353,10 +353,11 @@ class DotProductAttention(torch.nn.Module):

        norm_factor = math.sqrt(self.hidden_size_per_attention_head)

+        self.device_compute_capability = get_device_compute_capability()
        self.use_flash_attention = (
            int(os.getenv("NVTE_FLASH_ATTN", "1"))
            and attn_mask_type == "causal"
-            and get_device_compute_capability() >= 8.0
+            and self.device_compute_capability >= 8.0
        )

        attn_kwargs = {
@@ -437,6 +438,7 @@ class DotProductAttention(torch.nn.Module):
        if (query_layer.dtype not in [torch.bfloat16, torch.float16]
            or key_layer.dtype not in [torch.bfloat16, torch.float16]
            or value_layer.dtype not in [torch.bfloat16, torch.float16]
+            or (self.device_compute_capability == 8.6 and key_layer.shape[-1] > 64)
        ):
            use_flash_attention = False