fix a sync race error of softmax_lse in CP+THD+P2P (#1624)

fix a race error softmax_lse Signed-off-by: Xiaowei Ren <xren@nvidia.com>

fix a sync race error of softmax_lse in CP+THD+P2P (#1624)
fix a race error softmax_lse Signed-off-by: Xiaowei Ren <xren@nvidia.com>
76187a5e · Xiaowei Ren · GitHub · 3bcd7f6f · 76187a5e
Unverified Commit 76187a5e authored Mar 31, 2025 by Xiaowei Ren Committed by GitHub Mar 31, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 8 additions and 9 deletions

transformer_engine/pytorch/attention.py transformer_engine/pytorch/attention.py +8 -9

No files found.
--- a/transformer_engine/pytorch/attention.py
+++ b/transformer_engine/pytorch/attention.py
@@ -1359,16 +1359,15 @@ class AttnFuncWithCPAndKVP2P(torch.autograd.Function):
                if i > 1:
                    flash_attn_streams[(i - 1) % 2].wait_event(fwd_results_correction_done)
-                if use_fused_attention:
-                    # [b, np, sq, 1] -> [b, np, sq] or
-                    # [t, np, 1] -> [t, np]
-                    softmax_lse_per_step[i - 1].squeeze_(-1)
-                    if softmax_lse_in_packed_format:
-                        softmax_lse_per_step[i - 1] = (
-                            softmax_lse_per_step[i - 1].transpose(0, 1).contiguous()
-                        )
                with torch.cuda.stream(flash_attn_streams[(i - 1) % 2]):
+                    if use_fused_attention:
+                        # [b, np, sq, 1] -> [b, np, sq] or
+                        # [t, np, 1] -> [t, np]
+                        softmax_lse_per_step[i - 1].squeeze_(-1)
+                        if softmax_lse_in_packed_format:
+                            softmax_lse_per_step[i - 1] = (
+                                softmax_lse_per_step[i - 1].transpose(0, 1).contiguous()
+                            )
                    if fp8:
                        out_per_step[i - 1] = out_per_step[i - 1].dequantize(dtype=torch.float32)
                    if i == 1: