[PyTorch] Use dummy wgrad in GroupedLinear (#2305)

dummy wgrad Signed-off-by: tongliu <tongliu@nvidia.com> Signed-off-by: Xin Yao <xiny@nvidia.com> Co-authored-by: Xin Yao <xiny@nvidia.com>

[PyTorch] Use dummy wgrad in GroupedLinear (#2305)
dummy wgrad Signed-off-by: tongliu <tongliu@nvidia.com> Signed-off-by: Xin Yao <xiny@nvidia.com> Co-authored-by: Xin Yao <xiny@nvidia.com>
d2945c6a · Tong Liu · GitHub · 87cb26c6 · d2945c6a
Unverified Commit d2945c6a authored Oct 27, 2025 by Tong Liu Committed by GitHub Oct 27, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 8 additions and 10 deletions

transformer_engine/pytorch/module/grouped_linear.py transformer_engine/pytorch/module/grouped_linear.py +8 -10

No files found.
--- a/transformer_engine/pytorch/module/grouped_linear.py
+++ b/transformer_engine/pytorch/module/grouped_linear.py
@@ -13,6 +13,7 @@ import transformer_engine_torch as tex
 from transformer_engine.common.recipe import Recipe
 from .base import (
+    get_dummy_wgrad,
    get_multi_stream_cublas_workspace,
    TransformerEngineBaseModule,
    _2X_ACC_FPROP,
@@ -447,18 +448,15 @@ class _GroupedLinear(torch.autograd.Function):
                        ):
                            weight.grad_added_to_main_grad = True
                            if getattr(weight, "zero_out_wgrad", False):
-                                wgrad = torch.zeros(
+                                wgrad = get_dummy_wgrad(
-                                    weight.main_grad.shape,
+                                    list(weight.main_grad.shape),
-                                    dtype=weight.dtype,
+                                    weight.dtype,
-                                    device=torch.cuda.current_device(),
+                                    zero=True,
-                                    requires_grad=False,
                                )
                            else:
-                                wgrad = torch.empty(
+                                wgrad = get_dummy_wgrad(
-                                    weight.main_grad.shape,
+                                    list(weight.main_grad.shape),
-                                    dtype=weight.dtype,
+                                    weight.dtype,
-                                    device=torch.cuda.current_device(),
-                                    requires_grad=False,
                                )
                        elif ctx.fuse_wgrad_accumulation:
                            wgrad = None