Consistent docs for fuse_wgrad_accumulation (#289)

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Consistent docs for fuse_wgrad_accumulation (#289)
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
0426feb6 · Kirthi Shankar Sivamani · GitHub · 918a9ad7 · 0426feb6 · 0426feb6
Unverified Commit 0426feb6 authored Jun 20, 2023 by Kirthi Shankar Sivamani Committed by GitHub Jun 20, 2023
3 changed files
--- a/transformer_engine/pytorch/module/layernorm_linear.py
+++ b/transformer_engine/pytorch/module/layernorm_linear.py
@@ -593,7 +593,10 @@ class LayerNormLinear(TransformerEngineBaseModule):
    -----------------------
    fuse_wgrad_accumulation : bool, default = 'False'
                             if set to `True`, enables fusing of creation and accumulation of
-                             the weight gradient.
+                             the weight gradient. When enabled, it is assumed that the weights
+                             have an additional `main_grad` attribute (used instead of the
+                             regular `grad`) which is a pre-allocated buffer of the correct
+                             size to accumulate gradients in.
    return_bias : bool, default = `False`
                 when set to `True`, this module will not apply the additive bias itself, but
                 instead return the bias value during the forward pass together with the

--- a/transformer_engine/pytorch/module/layernorm_mlp.py
+++ b/transformer_engine/pytorch/module/layernorm_mlp.py
@@ -906,7 +906,10 @@ class LayerNormMLP(TransformerEngineBaseModule):
    -----------------------
    fuse_wgrad_accumulation : bool, default = 'False'
                             if set to `True`, enables fusing of creation and accumulation of
-                             the weight gradient.
+                             the weight gradient. When enabled, it is assumed that the weights
+                             have an additional `main_grad` attribute (used instead of the
+                             regular `grad`) which is a pre-allocated buffer of the correct
+                             size to accumulate gradients in.
    return_bias : bool, default = `False`
                 when set to `True`, this module will not apply the additive bias for FC2, but
                 instead return the bias value during the forward pass together with the

--- a/transformer_engine/pytorch/transformer.py
+++ b/transformer_engine/pytorch/transformer.py
@@ -161,7 +161,10 @@ class TransformerLayer(torch.nn.Module):
    -----------------------
    fuse_wgrad_accumulation : bool, default = 'False'
                             if set to `True`, enables fusing of creation and accumulation of
-                             the weight gradient.
+                             the weight gradient. When enabled, it is assumed that the weights
+                             have an additional `main_grad` attribute (used instead of the
+                             regular `grad`) which is a pre-allocated buffer of the correct
+                             size to accumulate gradients in.
    params_dtype : torch.dtype, default = `torch.float32`
                  it controls the type used to allocate the initial parameters. Useful when
                  the model is trained with lower precision and the original FP32 parameters