Use log1p(x) instead of log(1+x) (#1401)

This function is more accurate than torch.log() for small values of input - https://pytorch.org/docs/stable/generated/torch.log1p.html Found with TorchFix https://github.com/pytorch-labs/torchfix/ Signed-off-by: Sergii Dymchenko <sdym@meta.com> Co-authored-by: Xiaowei Ren <103958965+xrennvidia@users.noreply.github.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

Use log1p(x) instead of log(1+x) (#1401)
This function is more accurate than torch.log() for small values of input - https://pytorch.org/docs/stable/generated/torch.log1p.html Found with TorchFix https://github.com/pytorch-labs/torchfix/ Signed-off-by: Sergii Dymchenko <sdym@meta.com> Co-authored-by: Xiaowei Ren <103958965+xrennvidia@users.noreply.github.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
199e6123 · Sergii Dymchenko · GitHub · 2fce82b7 · 199e6123
Unverified Commit 199e6123 authored Jan 27, 2025 by Sergii Dymchenko Committed by GitHub Jan 28, 2025
Show whitespace changes
Inline Side-by-side

Showing with 1 addition and 1 deletion

transformer_engine/pytorch/attention.py transformer_engine/pytorch/attention.py +1 -1

No files found.
--- a/transformer_engine/pytorch/attention.py
+++ b/transformer_engine/pytorch/attention.py
@@ -1604,7 +1604,7 @@ def flash_attn_fwd_softmax_lse_correction(
    """Merge softmax stats of each step in Attention with context parallelism"""
    max_scale = torch.max(softmax_lse, softmax_lse_per_step)
    min_scale = torch.min(softmax_lse, softmax_lse_per_step)
-    new_scale = max_scale + torch.log(1 + torch.exp(min_scale - max_scale))
+    new_scale = max_scale + torch.log1p(torch.exp(min_scale - max_scale))
    softmax_lse.copy_(new_scale)