Unverified Commit df39a7c2 authored by Paweł Gadziński's avatar Paweł Gadziński Committed by GitHub
Browse files

Docs fix (#2301)



* init
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* lines lenght
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* subtitle --- fix in many files:
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* cross entropy _input -> input rename
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* cross entropy _input -> input rename
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* a lot of small fixes
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* torch_version() change
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* add missing module and fix warnings
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* removed training whitespace:
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* Update docs/api/pytorch.rst
Co-authored-by: default avatargreptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: default avatarPaweł Gadziński <62263673+pggPL@users.noreply.github.com>

* Fix import
Signed-off-by: default avatarKirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix more imports
Signed-off-by: default avatarKirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix NumPy docstring parameter spacing and indentation

- Standardize parameter documentation to use 'param : type' format (space before and after colon) per NumPy style guide
- Fix inconsistent indentation in cpu_offload.py docstring
- Modified 51 Python files across transformer_engine/pytorch
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

---------
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: default avatarPaweł Gadziński <62263673+pggPL@users.noreply.github.com>
Signed-off-by: default avatarKirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: default avatarpre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: default avatargreptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: default avatarKirthi Shankar Sivamani <ksivamani@nvidia.com>
parent ca468ebe
......@@ -1016,38 +1016,38 @@ def make_graphed_callables(
Positional arguments to callable(s).
num_warmup_iters: int, default = 3
Number of warmup iterations.
allow_unused_input: bool, default = `False`
allow_unused_input: bool, default = False
Whether to handle case where callable inputs
and outputs are disconnected in compute graph.
sample_kwargs: (tuple of) dict, optional
Keyword arguments to callable(s)
pool: (tuple of) int, default = `None`, optional
pool: (tuple of) int, default = None, optional
An instance returned from function `torch.cuda.graph_pool_handle` that hints
this graph may share memory with the indicated pool.
retain_graph_in_backward: bool, default = `False`
retain_graph_in_backward: bool, default = False
Whether to set retain_graph=True in backward graph capture.
_reuse_graph_input_output_buffers: bool, default = `False`
_reuse_graph_input_output_buffers: bool, default = False
Reduce memory usage by reusing input/output data buffers between
graphs. Only supported with Mcore interleaved pipeline parallelism, i.e.
when `_order` is provided. All callables in `modules` are assumed to have
inputs and outputs with the same dtype and shape.
Quantization related parameters
----------------------
enabled: (tuple of) bool, default = `False`
Quantization parameters
-----------------------
enabled: (tuple of) bool, default = False
whether or not to enable low precision quantization (FP8/FP4).
If tuple, the length must match the number of modules.
calibrating: bool, default = `False`
calibrating: bool, default = False
calibration mode allows collecting statistics such as amax and scale
data of quantized tensors even when executing without quantization enabled.
This is useful for saving an inference ready checkpoint while training
using a higher precision.
recipe: recipe.Recipe, default = `None`
recipe: recipe.Recipe, default = None
recipe used for low precision quantization.
amax_reduction_group: torch._C._distributed_c10d.ProcessGroup, default = `None`
amax_reduction_group: torch._C._distributed_c10d.ProcessGroup, default = None
distributed group over which amaxes for the quantized tensors
are reduced at the end of each training step.
cache_quantized_params: bool, default = `False`
cache_quantized_params: bool, default = False
Whether or not to cache quantized weights across microbatches. if set to `True`,
the `is_first_microbatch` boolean argument must be passed into the forward
method for TransformerEngine modules. When storing primary weights in low precision
......
......@@ -8,7 +8,7 @@ from functools import wraps
from typing import Callable, Optional, Tuple
import torch
from . import torch_version
from .torch_version import torch_version
from .export import is_in_onnx_export_mode
from .utils import gpu_autocast_ctx
......
......@@ -20,7 +20,6 @@ import torch.nn.functional as F
from torch.distributed.tensor import DTensor
import transformer_engine_torch as tex
from transformer_engine.common.recipe import Recipe
from ._common import _ParameterInitMeta, noop_cat
from ..quantization import (
......@@ -104,55 +103,55 @@ def initialize_ub(
) -> None:
r"""
Initialize the Userbuffers communicator for overlapping tensor-parallel communications with
GEMM compute in te.Linear, te.LayerNormLinear and te.LayerNormMLP modules.
GEMM compute in ``te.Linear``, ``te.LayerNormLinear`` and ``te.LayerNormMLP`` modules.
Parameters
----------
shape : list
shape of the communication buffer, typically set to be the same as the global shape of
the input tensor to a te.TransformerLayer forward pass, with the sequence and batch
dimensions collapsed together -- i.e.: `(sequence_length * batch_size, hidden_size)`
the input tensor to a ``te.TransformerLayer`` forward pass, with the sequence and batch
dimensions collapsed together -- i.e.: ``(sequence_length * batch_size, hidden_size)``
tp_size : int
number of GPUs in the tensor-parallel process group
use_fp8 : bool = False
allocate the communication buffer for FP8 GEMM inputs/outputs.
DEPRECATED: Please use `quantization_modes` instead.
DEPRECATED: Please use ``quantization_modes`` instead.
quantization_modes : List[UserBufferQuantizationMode] = None
if a list of UserBufferQuantizationMode is provided, a UB communicator is created for each quantization setting in the list.
falls back to the legacy `use_fp8` parameter if `None` is provided.
falls back to the legacy ``use_fp8`` parameter if ``None`` is provided.
dtype : torch.dtype = torch.bfloat16
non-FP8 data type of the communication buffer when `use_fp8 = False`
ub_cfgs: dict = None
Configuration dictionary with the structure
```
{
<gemm_name> : {
"method": <"ring_exchange" or "pipeline">,
"is_reduce_scatter": bool,
"num_sm": int,
"cga_size": int,
"set_sm_margin": bool,
"num_splits": int,
"aggregate": bool,
"atomic_gemm": bool,
"use_ce": bool,
"fp8_buf": bool,
}
}
```
for `te.TransformerLayer` GEMM layers in `["qkv_fprop", "qkv_dgrad", "qkv_wgrad",
non-FP8 data type of the communication buffer when ``use_fp8 = False``
ub_cfgs : dict = None
Configuration dictionary with the structure::
{
<gemm_name> : {
"method": <"ring_exchange" or "pipeline">,
"is_reduce_scatter": bool,
"num_sm": int,
"cga_size": int,
"set_sm_margin": bool,
"num_splits": int,
"aggregate": bool,
"atomic_gemm": bool,
"use_ce": bool,
"fp8_buf": bool,
}
}
for ``te.TransformerLayer`` GEMM layers in ``["qkv_fprop", "qkv_dgrad", "qkv_wgrad",
"proj_fprop", "proj_dgrad", "proj_wgrad", "fc1_fprop", "fc1_dgrad", "fc2_dgrad",
"fc2_fprop", "fc2_wgrad"]`.
a list may be provided to specify different overlap configurations for different the quantization settings in `quantization_modes`
"fc2_fprop", "fc2_wgrad"]``.
a list may be provided to specify different overlap configurations for different the quantization settings in ``quantization_modes``
bootstrap_backend : str = None
`torch.distributed` communication backend for the all-gather, broadcast and
``torch.distributed`` communication backend for the all-gather, broadcast and
barrier collectives during Userbuffers initialization. Not all backends are
valid for every cluster configuration and distributed launch method even if
they are available in PyTorch. When left unset, the initialization prefers
to use the MPI backend, falling back first on Gloo and then NCCL if MPI is
not available. Setting `NVTE_UB_WITH_MPI=1` when building TE overrides this
not available. Setting ``NVTE_UB_WITH_MPI=1`` when building TE overrides this
option and always initializes Userbuffers with direct MPI calls in C++,
which also requires `MPI_HOME=/path/to/mpi/root` to be set at compile time.
which also requires ``MPI_HOME=/path/to/mpi/root`` to be set at compile time.
"""
if not tex.device_supports_multicast():
assert bool(int(os.getenv("UB_SKIPMC", "0"))), (
......@@ -951,7 +950,7 @@ class TransformerEngineBaseModule(torch.nn.Module, ABC):
Parameters
----------
tp_group : ProcessGroup, default = `None`
tp_group : ProcessGroup, default = None
tensor parallel process group.
"""
self.tp_group = tp_group
......@@ -1345,7 +1344,7 @@ class TransformerEngineBaseModule(torch.nn.Module, ABC):
workspace is being constructed or updated.
cache_name: str, optional
Key for caching.
update_workspace: bool, default = `True`
update_workspace: bool, default = True
Update workspace with values from `tensor`.
skip_update_flag: torch.Tensor, optional
GPU flag to skip updating the workspace. Take precedence
......
......@@ -537,14 +537,14 @@ class GroupedLinear(TransformerEngineBaseModule):
size of each input sample.
out_features : int
size of each output sample.
bias : bool, default = `True`
if set to `False`, the layer will not learn an additive bias.
init_method : Callable, default = `None`
used for initializing weights in the following way: `init_method(weight)`.
When set to `None`, defaults to `torch.nn.init.normal_(mean=0.0, std=0.023)`.
get_rng_state_tracker : Callable, default = `None`
bias : bool, default = True
if set to ``False``, the layer will not learn an additive bias.
init_method : Callable, default = None
used for initializing weights in the following way: ``init_method(weight)``.
When set to ``None``, defaults to ``torch.nn.init.normal_(mean=0.0, std=0.023)``.
get_rng_state_tracker : Callable, default = None
used to get the random number generator state tracker for initializing weights.
rng_tracker_name : str, default = `None`
rng_tracker_name : str, default = None
the param passed to get_rng_state_tracker to get the specific rng tracker.
device : Union[torch.device, str], default = "cuda"
The device on which the parameters of the model will be allocated. It is the user's
......@@ -553,34 +553,36 @@ class GroupedLinear(TransformerEngineBaseModule):
Optimization parameters
-----------------------
fuse_wgrad_accumulation : bool, default = 'False'
if set to `True`, enables fusing of creation and accumulation of
fuse_wgrad_accumulation : bool, default = False
if set to ``True``, enables fusing of creation and accumulation of
the weight gradient. When enabled, it is assumed that the weights
have an additional `main_grad` attribute (used instead of the
regular `grad`) which is a pre-allocated buffer of the correct
have an additional ``main_grad`` attribute (used instead of the
regular ``grad``) which is a pre-allocated buffer of the correct
size to accumulate gradients in. This argument along with
weight tensor having attribute 'overwrite_main_grad' set to True
will overwrite `main_grad` instead of accumulating.
return_bias : bool, default = `False`
when set to `True`, this module will not apply the additive bias itself, but
will overwrite ``main_grad`` instead of accumulating.
return_bias : bool, default = False
when set to ``True``, this module will not apply the additive bias itself, but
instead return the bias value during the forward pass together with the
output of the linear transformation :math:`y = xA^T`. This is useful when
the bias addition can be fused to subsequent operations.
params_dtype : torch.dtype, default = `torch.get_default_dtype()`
params_dtype : torch.dtype, default = torch.get_default_dtype()
it controls the type used to allocate the initial parameters. Useful when
the model is trained with lower precision and the original FP32 parameters
would not fit in GPU memory.
delay_wgrad_compute : bool, default = `False`
delay_wgrad_compute : bool, default = False
Whether to delay weight gradient computation
save_original_input : bool, default = `False`
If set to `True`, always saves the original input tensor rather than the
save_original_input : bool, default = False
If set to ``True``, always saves the original input tensor rather than the
cast tensor. In some scenarios, the input tensor is used by multiple modules,
and saving the original input tensor may reduce the memory usage.
Cannot work with FP8 DelayedScaling recipe.
Note: GroupedLinear doesn't really handle the TP communications inside. The `tp_size` and
`parallel_mode` are used to determine the shapes of weights and biases.
The TP communication should be handled in the dispatch and combine stages of MoE models.
Notes
-----
GroupedLinear doesn't really handle the TP communications inside. The ``tp_size`` and
``parallel_mode`` are used to determine the shapes of weights and biases.
The TP communication should be handled in the dispatch and combine stages of MoE models.
"""
def __init__(
......
......@@ -28,33 +28,30 @@ class LayerNorm(_LayerNormOp):
Parameters
----------
normalized_shape: int or iterable of int
normalized_shape : int or iterable of int
Inner dimensions of input tensor
eps : float, default = 1e-5
A value added to the denominator of layer normalization for
numerical stability
device: torch.device, default = default CUDA device
device : torch.device, default = default CUDA device
Tensor device
dtype: torch.dtype, default = default dtype
dtype : torch.dtype, default = default dtype
Tensor datatype
zero_centered_gamma : bool, default = 'False'
If `True`, the :math:`\gamma` parameter is initialized to zero
If ``True``, the :math:`\gamma` parameter is initialized to zero
and the calculation changes to
.. math::
y = \frac{x - \mathrm{E}[x]}{\sqrt{\mathrm{Var}[x] + \varepsilon}} * (1 + \gamma) + \beta
sm_margin: int or dict, default = 0
sm_margin : int or dict, default = 0
Number of SMs to exclude when launching CUDA kernels. This
helps overlap with other kernels, e.g. communication kernels.
For more fine-grained control, provide a dict with the SM
margin at each compute stage ("forward", "backward",
"inference").
Legacy
------
sequence_parallel: bool
Set a bool attr named `sequence_parallel` in the parameters.
margin at each compute stage (``"forward"``, ``"backward"``,
``"inference"``).
sequence_parallel : bool
**Legacy parameter.** Set a bool attr named ``sequence_parallel`` in the parameters.
This is custom logic for Megatron-LM integration.
"""
......
......@@ -15,7 +15,7 @@ from torch.nn import init
import transformer_engine_torch as tex
from transformer_engine.common.recipe import Recipe
from transformer_engine.pytorch import torch_version
from transformer_engine.pytorch.torch_version import torch_version
from transformer_engine.pytorch.tensor.utils import is_custom
from .base import (
fill_userbuffers_buffer_for_all_gather,
......@@ -1045,20 +1045,20 @@ class LayerNormLinear(TransformerEngineBaseModule):
size of each output sample.
eps : float, default = 1e-5
a value added to the denominator of layer normalization for numerical stability.
bias : bool, default = `True`
if set to `False`, the layer will not learn an additive bias.
bias : bool, default = True
if set to ``False``, the layer will not learn an additive bias.
normalization : { 'LayerNorm', 'RMSNorm' }, default = 'LayerNorm'
type of normalization applied.
init_method : Callable, default = `None`
used for initializing weights in the following way: `init_method(weight)`.
When set to `None`, defaults to `torch.nn.init.normal_(mean=0.0, std=0.023)`.
return_layernorm_output : bool, default = `False`
if set to `True`, output of layernorm is returned from the forward
init_method : Callable, default = None
used for initializing weights in the following way: ``init_method(weight)``.
When set to ``None``, defaults to ``torch.nn.init.normal_(mean=0.0, std=0.023)``.
return_layernorm_output : bool, default = False
if set to ``True``, output of layernorm is returned from the forward
together with the output of the linear transformation.
Example use case: residual connection for transformer module is
taken post layernorm.
return_layernorm_output_gathered : bool, default = `False`
if set to `True`, output of layernorm is returned after the all
return_layernorm_output_gathered : bool, default = False
if set to ``True``, output of layernorm is returned after the all
gather operation. Ignored if return_layernorm_output is False.
Example use case: with sequence parallel, input to residual connection
for transformer module (e.g. LoRA) will need to be gathered.
......@@ -1069,10 +1069,10 @@ class LayerNormLinear(TransformerEngineBaseModule):
they are used to make the names of equally-sized parameters. If a dict
(preferably an OrderedDict) is provided, the keys are used as names and
values as split sizes along dim 0. The resulting parameters will have
names that end in `_weight` or `_bias`, so trailing underscores are
names that end in ``_weight`` or ``_bias``, so trailing underscores are
stripped from any provided names.
zero_centered_gamma : bool, default = 'False'
if set to 'True', gamma parameter in LayerNorm is initialized to 0 and
if set to ``'True'``, gamma parameter in LayerNorm is initialized to 0 and
the LayerNorm formula changes to
.. math::
......@@ -1082,53 +1082,53 @@ class LayerNormLinear(TransformerEngineBaseModule):
The device on which the parameters of the model will be allocated. It is the user's
responsibility to ensure all parameters are moved to the GPU before running the
forward pass.
name: str, default = `None`
name : str, default = None
name of the module, currently used for debugging purposes.
Parallelism parameters
----------------------
sequence_parallel : bool, default = `False`
if set to `True`, uses sequence parallelism.
tp_group : ProcessGroup, default = `None`
sequence_parallel : bool, default = False
if set to ``True``, uses sequence parallelism.
tp_group : ProcessGroup, default = None
tensor parallel process group.
tp_size : int, default = 1
used as TP (tensor parallel) world size when TP groups are not formed during
initialization. In this case, users must call the
`set_tensor_parallel_group(tp_group)` method on the initialized module before the
``set_tensor_parallel_group(tp_group)`` method on the initialized module before the
forward pass to supply the tensor parallel group needed for tensor and sequence
parallel collectives.
parallel_mode : {None, 'column', 'row'}, default = `None`
parallel_mode : {None, 'column', 'row'}, default = None
used to decide whether this Linear layer is Column Parallel Linear or Row
Parallel Linear as described `here <https://arxiv.org/pdf/1909.08053.pdf>`_.
When set to `None`, no communication is performed.
When set to ``None``, no communication is performed.
Optimization parameters
-----------------------
fuse_wgrad_accumulation : bool, default = 'False'
if set to `True`, enables fusing of creation and accumulation of
if set to ``True``, enables fusing of creation and accumulation of
the weight gradient. When enabled, it is assumed that the weights
have an additional `main_grad` attribute (used instead of the
regular `grad`) which is a pre-allocated buffer of the correct
have an additional ``main_grad`` attribute (used instead of the
regular ``grad``) which is a pre-allocated buffer of the correct
size to accumulate gradients in. This argument along with
weight tensor having attribute 'overwrite_main_grad' set to True
will overwrite `main_grad` instead of accumulating.
return_bias : bool, default = `False`
when set to `True`, this module will not apply the additive bias itself, but
will overwrite ``main_grad`` instead of accumulating.
return_bias : bool, default = False
when set to ``True``, this module will not apply the additive bias itself, but
instead return the bias value during the forward pass together with the
output of the linear transformation :math:`y = xA^T`. This is useful when
the bias addition can be fused to subsequent operations.
params_dtype : torch.dtype, default = `torch.get_default_dtype()`
params_dtype : torch.dtype, default = torch.get_default_dtype()
it controls the type used to allocate the initial parameters. Useful when
the model is trained with lower precision and the original FP32 parameters
would not fit in GPU memory.
delay_wgrad_compute : bool, default = `False`
Whether or not to delay weight gradient computation. If set to `True`,
it's the user's responsibility to call `module.backward_dw` to compute
delay_wgrad_compute : bool, default = False
Whether or not to delay weight gradient computation. If set to ``True``,
it's the user's responsibility to call ``module.backward_dw`` to compute
weight gradients.
symmetric_ar_type : {None, 'multimem_all_reduce', 'two_shot', 'one_shot'}, default = None
Type of symmetric memory all-reduce to use during the forward pass.
This can help in latency bound communication situations.
Requires PyTorch version 2.7.0 or higher. When set to None, standard all-reduce
Requires PyTorch version 2.7.0 or higher. When set to ``None``, standard all-reduce
is used.
"""
......
......@@ -16,7 +16,7 @@ from torch.nn import init
import transformer_engine_torch as tex
from transformer_engine.common.recipe import Recipe
from transformer_engine.pytorch import torch_version
from transformer_engine.pytorch.torch_version import torch_version
from transformer_engine.pytorch.tensor.utils import is_custom
from .base import (
fill_userbuffers_buffer_for_all_gather,
......@@ -1661,38 +1661,38 @@ class LayerNormMLP(TransformerEngineBaseModule):
intermediate size to which input samples are projected.
eps : float, default = 1e-5
a value added to the denominator of layer normalization for numerical stability.
bias : bool, default = `True`
if set to `False`, the FC1 and FC2 layers will not learn an additive bias.
bias : bool, default = True
if set to ``False``, the FC1 and FC2 layers will not learn an additive bias.
normalization : { 'LayerNorm', 'RMSNorm' }, default = 'LayerNorm'
type of normalization applied.
activation : str, default = 'gelu'
activation function used.
Options: 'gelu', 'geglu', 'qgelu', 'qgeglu', 'relu', 'reglu', 'srelu', 'sreglu',
'silu', 'swiglu', and 'clamped_swiglu'.
activation_params : dict, default = `None`
Options: ``'gelu'``, ``'geglu'``, ``'qgelu'``, ``'qgeglu'``, ``'relu'``, ``'reglu'``, ``'srelu'``, ``'sreglu'``,
``'silu'``, ``'swiglu'``, and ``'clamped_swiglu'``.
activation_params : dict, default = None
Additional parameters for the activation function.
At the moment, only used for 'clamped_swiglu' activation which
supports 'limit' and 'alpha' parameters.
init_method : Callable, default = `None`
used for initializing FC1 weights in the following way: `init_method(weight)`.
When set to `None`, defaults to `torch.nn.init.normal_(mean=0.0, std=0.023)`.
output_layer_init_method : Callable, default = `None`
At the moment, only used for ``'clamped_swiglu'`` activation which
supports ``'limit'`` and ``'alpha'`` parameters.
init_method : Callable, default = None
used for initializing FC1 weights in the following way: ``init_method(weight)``.
When set to ``None``, defaults to ``torch.nn.init.normal_(mean=0.0, std=0.023)``.
output_layer_init_method : Callable, default = None
used for initializing FC2 weights in the following way:
`output_layer_init_method(weight)`. When set to `None`, defaults to
`torch.nn.init.normal_(mean=0.0, std=0.023)`.
return_layernorm_output : bool, default = `False`
if set to `True`, output of layernorm is returned from the forward
``output_layer_init_method(weight)``. When set to ``None``, defaults to
``torch.nn.init.normal_(mean=0.0, std=0.023)``.
return_layernorm_output : bool, default = False
if set to ``True``, output of layernorm is returned from the :meth:`forward` method
together with the output of the linear transformation.
Example use case: residual connection for transformer module
is taken post layernorm.
return_layernorm_output_gathered : bool, default = `False`
if set to `True`, output of layernorm is returned after the all
gather operation. Ignored if return_layernorm_output is False.
return_layernorm_output_gathered : bool, default = False
if set to ``True``, output of layernorm is returned after the all
gather operation. Ignored if ``return_layernorm_output`` is False.
Example use case: with sequence parallel, input to residual connection
for transformer module (e.g. LoRA) will need to be gathered.
Returning layernorm output gathered will prevent a redundant gather.
zero_centered_gamma : bool, default = 'False'
if set to 'True', gamma parameter in LayerNorm is initialized to 0 and
zero_centered_gamma : bool, default = False
if set to ``True``, gamma parameter in LayerNorm is initialized to 0 and
the LayerNorm formula changes to
.. math::
......@@ -1702,62 +1702,62 @@ class LayerNormMLP(TransformerEngineBaseModule):
The device on which the parameters of the model will be allocated. It is the user's
responsibility to ensure all parameters are moved to the GPU before running the
forward pass.
name: str, default = `None`
name : str, default = None
name of the module, currently used for debugging purposes.
Parallelism parameters
----------------------
set_parallel_mode : bool, default = `False`
if set to `True`, FC1 is used as Column Parallel and FC2 is used as Row
set_parallel_mode : bool, default = False
if set to ``True``, FC1 is used as Column Parallel and FC2 is used as Row
Parallel as described `here <https://arxiv.org/pdf/1909.08053.pdf>`_.
sequence_parallel : bool, default = `False`
if set to `True`, uses sequence parallelism.
tp_group : ProcessGroup, default = `None`
sequence_parallel : bool, default = False
if set to ``True``, uses sequence parallelism.
tp_group : ProcessGroup, default = None
tensor parallel process group.
tp_size : int, default = 1
used as TP (tensor parallel) world size when TP groups are not formed during
initialization. In this case, users must call the
`set_tensor_parallel_group(tp_group)` method on the initialized module before the
``set_tensor_parallel_group(tp_group)`` method on the initialized module before the
forward pass to supply the tensor parallel group needed for tensor and sequence
parallel collectives.
Optimization parameters
-----------------------
fuse_wgrad_accumulation : bool, default = 'False'
if set to `True`, enables fusing of creation and accumulation of
fuse_wgrad_accumulation : bool, default = False
if set to ``True``, enables fusing of creation and accumulation of
the weight gradient. When enabled, it is assumed that the weights
have an additional `main_grad` attribute (used instead of the
regular `grad`) which is a pre-allocated buffer of the correct
have an additional ``main_grad`` attribute (used instead of the
regular ``grad``) which is a pre-allocated buffer of the correct
size to accumulate gradients in. This argument along with
weight tensor having attribute 'overwrite_main_grad' set to True
will overwrite `main_grad` instead of accumulating.
return_bias : bool, default = `False`
when set to `True`, this module will not apply the additive bias for FC2, but
weight tensor having attribute ``'overwrite_main_grad'`` set to True
will overwrite ``main_grad`` instead of accumulating.
return_bias : bool, default = False
when set to ``True``, this module will not apply the additive bias for FC2, but
instead return the bias value during the forward pass together with the
output of the linear transformation :math:`y = xA^T`. This is useful when
the bias addition can be fused to subsequent operations.
params_dtype : torch.dtype, default = `torch.get_default_dtype()`
params_dtype : torch.dtype, default = torch.get_default_dtype()
it controls the type used to allocate the initial parameters. Useful when
the model is trained with lower precision and the original FP32 parameters
would not fit in GPU memory.
seq_length: int
seq_length : int
sequence length of input samples. Needed for JIT Warmup, a technique where jit fused
functions are warmed up before training to ensure same kernels are used for forward
propogation and activation recompute phase.
micro_batch_size: int
micro_batch_size : int
batch size per training step. Needed for JIT Warmup, a technique where jit
fused functions are warmed up before training to ensure same kernels are
used for forward propogation and activation recompute phase.
delay_wgrad_compute : bool, default = `False`
Whether or not to delay weight gradient computation. If set to `True`,
it's the user's responsibility to call `module.backward_dw` to compute
delay_wgrad_compute : bool, default = False
Whether or not to delay weight gradient computation. If set to ``True``,
it's the user's responsibility to call :meth:`backward_dw` to compute
weight gradients.
symmetric_ar_type : {None, 'multimem_all_reduce', 'two_shot', 'one_shot'}, default = None
Type of symmetric memory all-reduce to use during the forward pass.
This can help in latency bound communication situations.
Requires PyTorch version 2.7.0 or higher. When set to None, standard all-reduce
Requires PyTorch version 2.7.0 or higher. When set to ``None``, standard all-reduce
is used.
checkpoint: bool, default = False
checkpoint : bool, default = False
whether to use selective activation checkpointing, where activations are not saved for bwd,
and instead are recomputed (skipping fc2, as it is not needed for backward). Trades compute
for memory. default is false, in which activations are saved in fwd. not supported for onnx forward
......@@ -2235,7 +2235,7 @@ class LayerNormMLP(TransformerEngineBaseModule):
self, inp: torch.Tensor, is_grad_enabled: bool
) -> Union[torch.Tensor, Tuple[torch.Tensor, ...]]:
"""
ONNX-compatible version of the forward function that provides numerical equivalence
ONNX-compatible version of the :meth:`forward` method that provides numerical equivalence
while only using operations that have defined ONNX symbolic translations.
This simplified implementation is designed specifically for inference scenarios.
"""
......
......@@ -13,7 +13,7 @@ import torch
import transformer_engine_torch as tex
from transformer_engine.common.recipe import Recipe
from transformer_engine.pytorch import torch_version
from transformer_engine.pytorch.torch_version import torch_version
from .base import (
fill_userbuffers_buffer_for_all_gather,
......@@ -985,7 +985,7 @@ class _Linear(torch.autograd.Function):
class Linear(TransformerEngineBaseModule):
"""Applies a linear transformation to the incoming data :math:`y = xA^T + b`
On NVIDIA GPUs it is a drop-in replacement for `torch.nn.Linear`.
On NVIDIA GPUs it is a drop-in replacement for ``torch.nn.Linear``.
Parameters
----------
......@@ -993,14 +993,14 @@ class Linear(TransformerEngineBaseModule):
size of each input sample.
out_features : int
size of each output sample.
bias : bool, default = `True`
if set to `False`, the layer will not learn an additive bias.
init_method : Callable, default = `None`
used for initializing weights in the following way: `init_method(weight)`.
When set to `None`, defaults to `torch.nn.init.normal_(mean=0.0, std=0.023)`.
get_rng_state_tracker : Callable, default = `None`
bias : bool, default = True
if set to ``False``, the layer will not learn an additive bias.
init_method : Callable, default = None
used for initializing weights in the following way: ``init_method(weight)``.
When set to ``None``, defaults to ``torch.nn.init.normal_(mean=0.0, std=0.023)``.
get_rng_state_tracker : Callable, default = None
used to get the random number generator state tracker for initializing weights.
rng_tracker_name : str, default = `None`
rng_tracker_name : str, default = None
the param passed to get_rng_state_tracker to get the specific rng tracker.
parameters_split : Optional[Union[Tuple[str, ...], Dict[str, int]]], default = None
Configuration for splitting the weight and bias tensors along dim 0 into
......@@ -1008,62 +1008,62 @@ class Linear(TransformerEngineBaseModule):
they are used to make the names of equally-sized parameters. If a dict
(preferably an OrderedDict) is provided, the keys are used as names and
values as split sizes along dim 0. The resulting parameters will have
names that end in `_weight` or `_bias`, so trailing underscores are
names that end in ``_weight`` or ``_bias``, so trailing underscores are
stripped from any provided names.
device : Union[torch.device, str], default = "cuda"
The device on which the parameters of the model will be allocated. It is the user's
responsibility to ensure all parameters are moved to the GPU before running the
forward pass.
name: str, default = `None`
name : str, default = None
name of the module, currently used for debugging purposes.
Parallelism parameters
----------------------
sequence_parallel : bool, default = `False`
if set to `True`, uses sequence parallelism.
tp_group : ProcessGroup, default = `None`
sequence_parallel : bool, default = False
if set to ``True``, uses sequence parallelism.
tp_group : ProcessGroup, default = None
tensor parallel process group.
tp_size : int, default = 1
used as TP (tensor parallel) world size when TP groups are not formed during
initialization. In this case, users must call the
`set_tensor_parallel_group(tp_group)` method on the initialized module before the
``set_tensor_parallel_group(tp_group)`` method on the initialized module before the
forward pass to supply the tensor parallel group needed for tensor and sequence
parallel collectives.
parallel_mode : {None, 'column', 'row'}, default = `None`
parallel_mode : {None, 'column', 'row'}, default = None
used to decide whether this Linear layer is Column Parallel Linear or Row
Parallel Linear as described `here <https://arxiv.org/pdf/1909.08053.pdf>`_.
When set to `None`, no communication is performed.
When set to ``None``, no communication is performed.
Optimization parameters
-----------------------
fuse_wgrad_accumulation : bool, default = 'False'
if set to `True`, enables fusing of creation and accumulation of
if set to ``True``, enables fusing of creation and accumulation of
the weight gradient. When enabled, it is assumed that the weights
have an additional `main_grad` attribute (used instead of the
regular `grad`) which is a pre-allocated buffer of the correct
have an additional ``main_grad`` attribute (used instead of the
regular ``grad``) which is a pre-allocated buffer of the correct
size to accumulate gradients in. This argument along with
weight tensor having attribute 'overwrite_main_grad' set to True
will overwrite `main_grad` instead of accumulating.
return_bias : bool, default = `False`
when set to `True`, this module will not apply the additive bias itself, but
will overwrite ``main_grad`` instead of accumulating.
return_bias : bool, default = False
when set to ``True``, this module will not apply the additive bias itself, but
instead return the bias value during the forward pass together with the
output of the linear transformation :math:`y = xA^T`. This is useful when
the bias addition can be fused to subsequent operations.
params_dtype : torch.dtype, default = `torch.get_default_dtype()`
params_dtype : torch.dtype, default = torch.get_default_dtype()
it controls the type used to allocate the initial parameters. Useful when
the model is trained with lower precision and the original FP32 parameters
would not fit in GPU memory.
delay_wgrad_compute : bool, default = `False`
Whether or not to delay weight gradient computation. If set to `True`,
it's the user's responsibility to call `module.backward_dw` to compute
delay_wgrad_compute : bool, default = False
Whether or not to delay weight gradient computation. If set to ``True``,
it's the user's responsibility to call ``module.backward_dw`` to compute
weight gradients.
symmetric_ar_type : {None, 'multimem_all_reduce', 'two_shot', 'one_shot'}, default = None
Type of symmetric memory all-reduce to use during the forward pass.
This can help in latency bound communication situations.
Requires PyTorch version 2.7.0 or higher. When set to None, standard all-reduce
Requires PyTorch version 2.7.0 or higher. When set to ``None``, standard all-reduce
is used.
save_original_input : bool, default = `False`
If set to `True`, always saves the original input tensor rather than the
save_original_input : bool, default = False
If set to ``True``, always saves the original input tensor rather than the
cast tensor. In some scenarios, the input tensor is used by multiple modules,
and saving the original input tensor may reduce the memory usage.
Cannot work with FP8 DelayedScaling recipe.
......
......@@ -33,32 +33,29 @@ class RMSNorm(_RMSNormOp):
Parameters
----------
normalized_shape: int or iterable of int
normalized_shape : int or iterable of int
Inner dimensions of input tensor
eps : float, default = 1e-5
A value added to the denominator for numerical stability
device: torch.device, default = default CUDA device
device : torch.device, default = default CUDA device
Tensor device
dtype: torch.dtype, default = default dtype
dtype : torch.dtype, default = default dtype
Tensor datatype
zero_centered_gamma : bool, default = 'False'
If `True`, the :math:`\gamma` parameter is initialized to zero
zero_centered_gamma : bool, default = False
If ``True``, the :math:`\gamma` parameter is initialized to zero
and the calculation changes to
.. math::
y = \frac{x}{\sqrt{\mathrm{Var}[x] + \varepsilon}} * (1 + \gamma)
sm_margin: int, default = 0
sm_margin : int, default = 0
Number of SMs to exclude when launching CUDA kernels. This
helps overlap with other kernels, e.g. communication kernels.
For more fine-grained control, provide a dict with the SM
margin at each compute stage ("forward", "backward",
"inference").
Legacy
------
sequence_parallel: bool
Set a bool attr named `sequence_parallel` in the parameters.
margin at each compute stage (``"forward"``, ``"backward"``,
``"inference"``).
sequence_parallel : bool
**Legacy parameter.** Set a bool attr named ``sequence_parallel`` in the parameters.
This is custom logic for Megatron-LM integration.
"""
......
......@@ -10,7 +10,7 @@ from typing import Optional
import torch
from transformer_engine_torch import FP8TensorMeta
from .. import torch_version
from ..torch_version import torch_version
from ..quantization import FP8GlobalStateManager
from ..tensor.float8_tensor import Float8Tensor
from ..quantized_tensor import QuantizedTensorStorage
......
......@@ -53,7 +53,7 @@ class _ActivationOperation(BasicOperation, metaclass=abc.ABCMeta):
Parameters
----------
cache_quantized_input: bool, default = False
cache_quantized_input : bool, default = False
Quantize input tensor when caching for use in the backward
pass. This will typically reduce memory usage but require
extra compute and increase numerical error. This feature is
......@@ -408,11 +408,11 @@ class ClampedSwiGLU(_ActivationOperation):
Parameters
----------
limit: float
limit : float
The clamp limit.
alpha: float
alpha : float
The scaling factor for the sigmoid function used in the activation.
cache_quantized_input: bool, default = False
cache_quantized_input : bool, default = False
Quantize input tensor when caching for use in the backward pass.
"""
......
......@@ -23,7 +23,7 @@ class AllGather(BasicOperation):
Parameters
----------
process_group: torch.distributed.ProcessGroup, default = world group
process_group : torch.distributed.ProcessGroup, default = world group
Process group for communication
"""
......
......@@ -24,7 +24,7 @@ class AllReduce(BasicOperation):
Parameters
----------
process_group: torch.distributed.ProcessGroup, default = world group
process_group : torch.distributed.ProcessGroup, default = world group
Process group for communication
"""
......
......@@ -53,27 +53,27 @@ class BasicLinear(BasicOperation):
Parameters
----------
in_features: int
in_features : int
Inner dimension of input tensor
out_features: int
out_features : int
Inner dimension of output tensor
device: torch.device, default = default CUDA device
device : torch.device, default = default CUDA device
Tensor device
dtype: torch.dtype, default = default dtype
dtype : torch.dtype, default = default dtype
Tensor datatype
tensor_parallel_mode: {`None`, "column", "row"}, default = `None`
tensor_parallel_mode : {`None`, "column", "row"}, default = `None`
Mode for tensor parallelism
tensor_parallel_group: torch.distributed.ProcessGroup, default = world group
tensor_parallel_group : torch.distributed.ProcessGroup, default = world group
Process group for tensor parallelism
sequence_parallel: bool, default = `False`
sequence_parallel : bool, default = `False`
Whether to apply sequence parallelism together with tensor
parallelism, i.e. distributing input or output tensors along
outer dimension (sequence or batch dim) when not distributing
along inner dimension (embedding dim)
rng_state_tracker_function: callable
rng_state_tracker_function : callable
Function that returns `CudaRNGStatesTracker`, which is used
for model-parallel weight initialization
accumulate_into_main_grad: bool, default = `False`
accumulate_into_main_grad : bool, default = `False`
Whether to directly accumulate weight gradients into the
weight's `main_grad` attribute instead of relying on PyTorch
autograd. The weight's `main_grad` must be set externally and
......
......@@ -22,16 +22,16 @@ class Bias(BasicOperation):
Parameters
----------
size: int
size : int
Inner dimension of input tensor
device: torch.device, default = default CUDA device
device : torch.device, default = default CUDA device
Tensor device
dtype: torch.dtype, default = default dtype
dtype : torch.dtype, default = default dtype
Tensor datatype
tensor_parallel: bool, default = `False`
tensor_parallel : bool, default = `False`
Whether to distribute input tensor and bias tensors along
inner dimension
tensor_parallel_group: torch.distributed.ProcessGroup, default = world group
tensor_parallel_group : torch.distributed.ProcessGroup, default = world group
Process group for tensor parallelism
"""
......
......@@ -10,7 +10,7 @@ import os
import torch
from ... import torch_version
from ...torch_version import torch_version
from ...cpu_offload import is_cpu_offload_enabled, mark_activation_offload
from ...jit import (
l2normalization_fused,
......@@ -40,11 +40,11 @@ class L2Normalization(BasicOperation):
----------
eps : float, default = 1e-6
A value added to the denominator for numerical stability
seq_length: int, default = None
seq_length : int, default = None
sequence length of input samples. Needed for JIT Warmup, a technique where jit fused
functions are warmed up before training to ensure same kernels are used for forward
propagation and activation recompute phase.
micro_batch_size: int, default = None
micro_batch_size : int, default = None
batch size per training step. Needed for JIT Warmup, a technique where jit
fused functions are warmed up before training to ensure same kernels are
used for forward propagation and activation recompute phase.
......
......@@ -42,14 +42,14 @@ class LayerNorm(BasicOperation):
Parameters
----------
normalized_shape: int or iterable of int
normalized_shape : int or iterable of int
Inner dimensions of input tensor
eps : float, default = 1e-5
A value added to the denominator of layer normalization for
numerical stability
device: torch.device, default = default CUDA device
device : torch.device, default = default CUDA device
Tensor device
dtype: torch.dtype, default = default dtype
dtype : torch.dtype, default = default dtype
Tensor datatype
zero_centered_gamma : bool, default = 'False'
If `True`, the :math:`\gamma` parameter is initialized to zero
......@@ -58,7 +58,7 @@ class LayerNorm(BasicOperation):
.. math::
y = \frac{x - \mathrm{E}[x]}{\sqrt{\mathrm{Var}[x] + \varepsilon}} * (1 + \gamma) + \beta
sm_margin: int or dict, default = 0
sm_margin : int or dict, default = 0
Number of SMs to exclude when launching CUDA kernels. This
helps overlap with other kernels, e.g. communication kernels.
For more fine-grained control, provide a dict with the SM
......
......@@ -23,9 +23,9 @@ class Quantize(BasicOperation):
Parameters
----------
forward: bool, default = `True`
forward : bool, default = `True`
Perform quantization in forward pass
backward: bool, default = `False`
backward : bool, default = `False`
Perform quantization in backward pass
"""
......
......@@ -23,7 +23,7 @@ class ReduceScatter(BasicOperation):
Parameters
----------
process_group: torch.distributed.ProcessGroup, default = world group
process_group : torch.distributed.ProcessGroup, default = world group
Process group for communication
"""
......
......@@ -24,7 +24,7 @@ class Reshape(BasicOperation):
Parameters
----------
shape: iterable of int
shape : iterable of int
Output tensor dimensions. If one dimension is -1, it is
inferred based on input tensor dimensions.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment