Returning an empty tensor of param dtype for wgrad (#507)

* Returning an empty tensor of param dtype for wgrad Signed-off-by: Selvaraj Anandaraj <selvaraja@computelab-frontend-4-ub22.nvidia.com> * lint Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Selvaraj Anandaraj <selvaraja@computelab-frontend-4-ub22.nvidia.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by: Selvaraj Anandaraj <selvaraja@computelab-frontend-4-ub22.nvidia.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Returning an empty tensor of param dtype for wgrad (#507)
* Returning an empty tensor of param dtype for wgrad Signed-off-by: Selvaraj Anandaraj <selvaraja@computelab-frontend-4-ub22.nvidia.com> * lint Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Selvaraj Anandaraj <selvaraja@computelab-frontend-4-ub22.nvidia.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by: Selvaraj Anandaraj <selvaraja@computelab-frontend-4-ub22.nvidia.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
c706ff8d · Selvaraj Anandaraj · GitHub · 50ff8116 · c706ff8d · c706ff8d
Unverified Commit c706ff8d authored Nov 08, 2023 by Selvaraj Anandaraj Committed by GitHub Nov 08, 2023
3 changed files
--- a/transformer_engine/pytorch/module/layernorm_linear.py
+++ b/transformer_engine/pytorch/module/layernorm_linear.py
@@ -525,6 +525,11 @@ class _LayerNormLinear(torch.autograd.Function):
            # Handle custom DDP from mcore.
            if ctx.fuse_wgrad_accumulation and hasattr(weight, 'grad_added_to_main_grad'):
                weight.grad_added_to_main_grad = True
+                wgrad = torch.empty(weight.main_grad.shape,
+                                   dtype=weight.dtype,
+                                   device=torch.cuda.current_device(),
+                                   requires_grad=False
+                                   )
            elif ctx.fuse_wgrad_accumulation:
                wgrad = None
        else:

--- a/transformer_engine/pytorch/module/layernorm_mlp.py
+++ b/transformer_engine/pytorch/module/layernorm_mlp.py
@@ -879,6 +879,11 @@ class _LayerNormMLP(torch.autograd.Function):
            # Handle custom DDP from mcore.
            if ctx.fuse_wgrad_accumulation and hasattr(fc1_weight, 'grad_added_to_main_grad'):
                fc1_weight.grad_added_to_main_grad = True
+                fc1_wgrad = torch.empty(fc1_weight.main_grad.shape,
+                                        dtype=fc1_weight.dtype,
+                                        device=torch.cuda.current_device(),
+                                        requires_grad=False
+                                        )
            elif ctx.fuse_wgrad_accumulation:
                fc1_wgrad = None
        else:
@@ -888,6 +893,11 @@ class _LayerNormMLP(torch.autograd.Function):
            # Handle custom DDP from mcore.
            if ctx.fuse_wgrad_accumulation and hasattr(fc2_weight, 'grad_added_to_main_grad'):
                fc2_weight.grad_added_to_main_grad = True
+                fc2_wgrad = torch.empty(fc2_weight.main_grad.shape,
+                                        dtype=fc2_weight.dtype,
+                                        device=torch.cuda.current_device(),
+                                        requires_grad=False
+                                        )
            elif ctx.fuse_wgrad_accumulation:
                fc2_wgrad = None
        else:

--- a/transformer_engine/pytorch/module/linear.py
+++ b/transformer_engine/pytorch/module/linear.py
@@ -465,6 +465,11 @@ class _Linear(torch.autograd.Function):
            # Handle custom DDP from mcore.
            if ctx.fuse_wgrad_accumulation and hasattr(weight, 'grad_added_to_main_grad'):
                weight.grad_added_to_main_grad = True
+                wgrad = torch.empty(weight.main_grad.shape,
+                                   dtype=weight.dtype,
+                                   device=torch.cuda.current_device(),
+                                   requires_grad=False
+                                   )
            elif ctx.fuse_wgrad_accumulation:
                wgrad = None
        else: