Unverified Commit c4a5cb85 authored by Paweł Gadziński's avatar Paweł Gadziński Committed by GitHub
Browse files

[PyTorch] Add GroupedLinear to the docs and fix typos (#1206)



* Docs fixes
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* docs fix
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

* docs fix
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>

---------
Signed-off-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: default avatarPawel Gadzinski <pgadzinski@nvidia.com>
parent 209b8e5a
...@@ -9,6 +9,9 @@ pyTorch ...@@ -9,6 +9,9 @@ pyTorch
.. autoapiclass:: transformer_engine.pytorch.Linear(in_features, out_features, bias=True, **kwargs) .. autoapiclass:: transformer_engine.pytorch.Linear(in_features, out_features, bias=True, **kwargs)
:members: forward, set_tensor_parallel_group :members: forward, set_tensor_parallel_group
.. autoapiclass:: transformer_engine.pytorch.GroupedLinear(in_features, out_features, bias=True, **kwargs)
:members: forward, set_tensor_parallel_group
.. autoapiclass:: transformer_engine.pytorch.LayerNorm(hidden_size, eps=1e-5, **kwargs) .. autoapiclass:: transformer_engine.pytorch.LayerNorm(hidden_size, eps=1e-5, **kwargs)
.. autoapiclass:: transformer_engine.pytorch.RMSNorm(hidden_size, eps=1e-5, **kwargs) .. autoapiclass:: transformer_engine.pytorch.RMSNorm(hidden_size, eps=1e-5, **kwargs)
......
...@@ -7853,7 +7853,7 @@ class MultiheadAttention(torch.nn.Module): ...@@ -7853,7 +7853,7 @@ class MultiheadAttention(torch.nn.Module):
bias : bool, default = `True` bias : bool, default = `True`
if set to `False`, the transformer layer will not learn any additive biases. if set to `False`, the transformer layer will not learn any additive biases.
device : Union[torch.device, str], default = "cuda" device : Union[torch.device, str], default = "cuda"
The device on which the parameters of the model will allocated. It is the user's The device on which the parameters of the model will be allocated. It is the user's
responsibility to ensure all parameters are moved to the GPU before running the responsibility to ensure all parameters are moved to the GPU before running the
forward pass. forward pass.
qkv_format: str, default = `sbhd` qkv_format: str, default = `sbhd`
......
...@@ -528,11 +528,11 @@ class GroupedLinear(TransformerEngineBaseModule): ...@@ -528,11 +528,11 @@ class GroupedLinear(TransformerEngineBaseModule):
used for initializing weights in the following way: `init_method(weight)`. used for initializing weights in the following way: `init_method(weight)`.
When set to `None`, defaults to `torch.nn.init.normal_(mean=0.0, std=0.023)`. When set to `None`, defaults to `torch.nn.init.normal_(mean=0.0, std=0.023)`.
get_rng_state_tracker : Callable, default = `None` get_rng_state_tracker : Callable, default = `None`
used to get the random number generator state tracker for initilizeing weights. used to get the random number generator state tracker for initializing weights.
rng_tracker_name : str, default = `None` rng_tracker_name : str, default = `None`
the param passed to get_rng_state_tracker to get the specific rng tracker. the param passed to get_rng_state_tracker to get the specific rng tracker.
device : Union[torch.device, str], default = "cuda" device : Union[torch.device, str], default = "cuda"
The device on which the parameters of the model will allocated. It is the user's The device on which the parameters of the model will be allocated. It is the user's
responsibility to ensure all parameters are moved to the GPU before running the responsibility to ensure all parameters are moved to the GPU before running the
forward pass. forward pass.
...@@ -548,7 +548,7 @@ class GroupedLinear(TransformerEngineBaseModule): ...@@ -548,7 +548,7 @@ class GroupedLinear(TransformerEngineBaseModule):
`set_tensor_parallel_group(tp_group)` method on the initialized module before the `set_tensor_parallel_group(tp_group)` method on the initialized module before the
forward pass to supply the tensor parallel group needed for tensor and sequence forward pass to supply the tensor parallel group needed for tensor and sequence
parallel collectives. parallel collectives.
parallel_mode : {None, 'Column', 'Row'}, default = `None` parallel_mode : {None, 'column', 'row'}, default = `None`
used to decide whether this GroupedLinear layer is Column Parallel Linear or Row used to decide whether this GroupedLinear layer is Column Parallel Linear or Row
Parallel Linear as described `here <https://arxiv.org/pdf/1909.08053.pdf>`_. Parallel Linear as described `here <https://arxiv.org/pdf/1909.08053.pdf>`_.
When set to `None`, no communication is performed. When set to `None`, no communication is performed.
......
...@@ -110,7 +110,7 @@ class LayerNorm(torch.nn.Module): ...@@ -110,7 +110,7 @@ class LayerNorm(torch.nn.Module):
y = \frac{x - \mathrm{E}[x]}{ \sqrt{\mathrm{Var}[x] + \varepsilon}} * y = \frac{x - \mathrm{E}[x]}{ \sqrt{\mathrm{Var}[x] + \varepsilon}} *
(1 + \gamma) + \beta (1 + \gamma) + \beta
device : Union[torch.device, str], default = "cuda" device : Union[torch.device, str], default = "cuda"
The device on which the parameters of the model will allocated. It is the user's The device on which the parameters of the model will be allocated. It is the user's
responsibility to ensure all parameters are moved to the GPU before running the responsibility to ensure all parameters are moved to the GPU before running the
forward pass. forward pass.
""" """
......
...@@ -816,7 +816,7 @@ class LayerNormLinear(TransformerEngineBaseModule): ...@@ -816,7 +816,7 @@ class LayerNormLinear(TransformerEngineBaseModule):
y = \frac{x - \mathrm{E}[x]}{ \sqrt{\mathrm{Var}[x] + \varepsilon}} * y = \frac{x - \mathrm{E}[x]}{ \sqrt{\mathrm{Var}[x] + \varepsilon}} *
(1 + \gamma) + \beta (1 + \gamma) + \beta
device : Union[torch.device, str], default = "cuda" device : Union[torch.device, str], default = "cuda"
The device on which the parameters of the model will allocated. It is the user's The device on which the parameters of the model will be allocated. It is the user's
responsibility to ensure all parameters are moved to the GPU before running the responsibility to ensure all parameters are moved to the GPU before running the
forward pass. forward pass.
...@@ -832,7 +832,7 @@ class LayerNormLinear(TransformerEngineBaseModule): ...@@ -832,7 +832,7 @@ class LayerNormLinear(TransformerEngineBaseModule):
`set_tensor_parallel_group(tp_group)` method on the initialized module before the `set_tensor_parallel_group(tp_group)` method on the initialized module before the
forward pass to supply the tensor parallel group needed for tensor and sequence forward pass to supply the tensor parallel group needed for tensor and sequence
parallel collectives. parallel collectives.
parallel_mode : {None, 'Column', 'Row'}, default = `None` parallel_mode : {None, 'column', 'row'}, default = `None`
used to decide whether this Linear layer is Column Parallel Linear or Row used to decide whether this Linear layer is Column Parallel Linear or Row
Parallel Linear as described `here <https://arxiv.org/pdf/1909.08053.pdf>`_. Parallel Linear as described `here <https://arxiv.org/pdf/1909.08053.pdf>`_.
When set to `None`, no communication is performed. When set to `None`, no communication is performed.
......
...@@ -1193,7 +1193,7 @@ class LayerNormMLP(TransformerEngineBaseModule): ...@@ -1193,7 +1193,7 @@ class LayerNormMLP(TransformerEngineBaseModule):
y = \frac{x - \mathrm{E}[x]}{ \sqrt{\mathrm{Var}[x] + \varepsilon}} * y = \frac{x - \mathrm{E}[x]}{ \sqrt{\mathrm{Var}[x] + \varepsilon}} *
(1 + \gamma) + \beta (1 + \gamma) + \beta
device : Union[torch.device, str], default = "cuda" device : Union[torch.device, str], default = "cuda"
The device on which the parameters of the model will allocated. It is the user's The device on which the parameters of the model will be allocated. It is the user's
responsibility to ensure all parameters are moved to the GPU before running the responsibility to ensure all parameters are moved to the GPU before running the
forward pass. forward pass.
......
...@@ -650,7 +650,7 @@ class Linear(TransformerEngineBaseModule): ...@@ -650,7 +650,7 @@ class Linear(TransformerEngineBaseModule):
used for initializing weights in the following way: `init_method(weight)`. used for initializing weights in the following way: `init_method(weight)`.
When set to `None`, defaults to `torch.nn.init.normal_(mean=0.0, std=0.023)`. When set to `None`, defaults to `torch.nn.init.normal_(mean=0.0, std=0.023)`.
get_rng_state_tracker : Callable, default = `None` get_rng_state_tracker : Callable, default = `None`
used to get the random number generator state tracker for initilizeing weights. used to get the random number generator state tracker for initializing weights.
rng_tracker_name : str, default = `None` rng_tracker_name : str, default = `None`
the param passed to get_rng_state_tracker to get the specific rng tracker. the param passed to get_rng_state_tracker to get the specific rng tracker.
parameters_split : Optional[Union[Tuple[str, ...], Dict[str, int]]], default = None parameters_split : Optional[Union[Tuple[str, ...], Dict[str, int]]], default = None
...@@ -662,7 +662,7 @@ class Linear(TransformerEngineBaseModule): ...@@ -662,7 +662,7 @@ class Linear(TransformerEngineBaseModule):
names that end in `_weight` or `_bias`, so trailing underscores are names that end in `_weight` or `_bias`, so trailing underscores are
stripped from any provided names. stripped from any provided names.
device : Union[torch.device, str], default = "cuda" device : Union[torch.device, str], default = "cuda"
The device on which the parameters of the model will allocated. It is the user's The device on which the parameters of the model will be allocated. It is the user's
responsibility to ensure all parameters are moved to the GPU before running the responsibility to ensure all parameters are moved to the GPU before running the
forward pass. forward pass.
...@@ -678,7 +678,7 @@ class Linear(TransformerEngineBaseModule): ...@@ -678,7 +678,7 @@ class Linear(TransformerEngineBaseModule):
`set_tensor_parallel_group(tp_group)` method on the initialized module before the `set_tensor_parallel_group(tp_group)` method on the initialized module before the
forward pass to supply the tensor parallel group needed for tensor and sequence forward pass to supply the tensor parallel group needed for tensor and sequence
parallel collectives. parallel collectives.
parallel_mode : {None, 'Column', 'Row'}, default = `None` parallel_mode : {None, 'column', 'row'}, default = `None`
used to decide whether this Linear layer is Column Parallel Linear or Row used to decide whether this Linear layer is Column Parallel Linear or Row
Parallel Linear as described `here <https://arxiv.org/pdf/1909.08053.pdf>`_. Parallel Linear as described `here <https://arxiv.org/pdf/1909.08053.pdf>`_.
When set to `None`, no communication is performed. When set to `None`, no communication is performed.
......
...@@ -120,7 +120,7 @@ class RMSNorm(torch.nn.Module): ...@@ -120,7 +120,7 @@ class RMSNorm(torch.nn.Module):
.. math:: .. math::
y = \frac{x}{RMS_\varepsilon(x)} * (1 + \gamma) y = \frac{x}{RMS_\varepsilon(x)} * (1 + \gamma)
device : Union[torch.device, str], default = "cuda" device : Union[torch.device, str], default = "cuda"
The device on which the parameters of the model will allocated. It is the user's The device on which the parameters of the model will be allocated. It is the user's
responsibility to ensure all parameters are moved to the GPU before running the responsibility to ensure all parameters are moved to the GPU before running the
forward pass. forward pass.
""" """
......
...@@ -173,7 +173,7 @@ class TransformerLayer(torch.nn.Module): ...@@ -173,7 +173,7 @@ class TransformerLayer(torch.nn.Module):
Type of activation used in MLP block. Type of activation used in MLP block.
Options are: 'gelu', 'relu', 'reglu', 'geglu', 'swiglu', 'qgelu' and 'srelu'. Options are: 'gelu', 'relu', 'reglu', 'geglu', 'swiglu', 'qgelu' and 'srelu'.
device : Union[torch.device, str], default = "cuda" device : Union[torch.device, str], default = "cuda"
The device on which the parameters of the model will allocated. It is the user's The device on which the parameters of the model will be allocated. It is the user's
responsibility to ensure all parameters are moved to the GPU before running the responsibility to ensure all parameters are moved to the GPU before running the
forward pass. forward pass.
attn_input_format: {'sbhd', 'bshd'}, default = 'sbhd' attn_input_format: {'sbhd', 'bshd'}, default = 'sbhd'
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment