Fixed cpu overhead when doing DS cast (#1941)

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com> Co-authored-by: Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com>

Fixed cpu overhead when doing DS cast (#1941)
Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com> Co-authored-by: Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com>
4c7095ca · Selvaraj Anandaraj · GitHub · 96ee7173 · 4c7095ca
Unverified Commit 4c7095ca authored Jul 09, 2025 by Selvaraj Anandaraj Committed by GitHub Jul 09, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 1 addition and 1 deletion

transformer_engine/pytorch/tensor/utils.py transformer_engine/pytorch/tensor/utils.py +1 -1

No files found.
--- a/transformer_engine/pytorch/tensor/utils.py
+++ b/transformer_engine/pytorch/tensor/utils.py
@@ -193,7 +193,7 @@ def _cast_master_weights_to_fp8_delayed_scaling(params, group, use_fsdp_shard_mo
        quantizer.update_quantized(master_weight.view(1, -1), shard_model_weight_fp8)

    if len(amaxes) > 0:
-        dummy_overflow_buf = torch.tensor([0], dtype=torch.int, device=amaxes[0].device)
+        dummy_overflow_buf = torch.zeros(1, dtype=torch.int, device=amaxes[0].device)

        # Reduce amaxes.
        packed_amaxes = torch.empty(len(amaxes), dtype=torch.float32, device=amaxes[0].device)