Commit 31cee8e7 authored by Michael Carilli's avatar Michael Carilli
Browse files

Updating documentation

parent 262da9c6
...@@ -77,8 +77,8 @@ class FP16_Optimizer(object): ...@@ -77,8 +77,8 @@ class FP16_Optimizer(object):
# Usually, dynamic_loss_args is not necessary. # Usually, dynamic_loss_args is not necessary.
Args: Args:
init_optimizer (torch.optim.optimizer): Existing optimizer containing initialized fp16 parameters. Internally, :class:`FP16_Optimizer` replaces the passed optimizer's fp16 parameters with new fp32 parameters copied from the original ones. :class:`FP16_Optimizer` also stores references to the original fp16 parameters, and updates these fp16 parameters from the master fp32 copy after each step. init_optimizer (torch.optim.optimizer): Existing optimizer created with the parameters to optimize. Internally, :class:`FP16_Optimizer` replaces the passed optimizer's fp16 parameters, if any, with fp32 master parameters copied from the original ones. :class:`FP16_Optimizer` also stores references to the original fp16 parameters, and updates these fp16 parameters from the master fp32 copy at the end of each :attr:`step`.
static_loss_scale (float, optional, default=1.0): Loss scale used internally to scale fp16 gradients computed by the model. Scaled gradients will be copied to fp32, then downscaled before being applied to the fp32 master params, so ``static_loss_scale`` should not affect learning rate. static_loss_scale (float, optional, default=1.0): Loss scale used internally to scale gradients computed by the model. Any fp16 gradients will be copied to fp32, then downscaled before being applied to the fp32 master params, so ``static_loss_scale`` should not affect learning rate.
dynamic_loss_scale (bool, optional, default=False): Use dynamic loss scaling. If True, this will override any ``static_loss_scale`` option. dynamic_loss_scale (bool, optional, default=False): Use dynamic loss scaling. If True, this will override any ``static_loss_scale`` option.
dynamic_loss_args (dict, optional, default=None): Dict of kwargs that will be forwarded to the internal :class:`DynamicLossScaler` instance's constructor. Keys of this dict must match kwargs accepted by :class:`DynamicLossScaler`'s constructor. If ``dynamic_loss_args`` is unspecified, :class:`DynamicLossScaler`'s defaults will be used. dynamic_loss_args (dict, optional, default=None): Dict of kwargs that will be forwarded to the internal :class:`DynamicLossScaler` instance's constructor. Keys of this dict must match kwargs accepted by :class:`DynamicLossScaler`'s constructor. If ``dynamic_loss_args`` is unspecified, :class:`DynamicLossScaler`'s defaults will be used.
...@@ -275,7 +275,7 @@ class FP16_Optimizer(object): ...@@ -275,7 +275,7 @@ class FP16_Optimizer(object):
Total norm of the current fp32 gradients (viewed as a single vector). Total norm of the current fp32 gradients (viewed as a single vector).
.. warning:: .. warning::
Returns -1 if the most recently computed fp16 gradients overflowed (that is, if self.overflow is True). Returns -1 if the most recently computed fp16 gradients overflowed (that is, if ``self.overflow`` is ``True``).
""" """
if not self.overflow: if not self.overflow:
fp32_params = [] fp32_params = []
...@@ -380,12 +380,13 @@ class FP16_Optimizer(object): ...@@ -380,12 +380,13 @@ class FP16_Optimizer(object):
optimizer.zero_grad() optimizer.zero_grad()
output = model(input) output = model(input)
loss = loss_fn(output, target) loss = loss_fn(output, target)
# loss.backward() becomes:
optimizer.backward(loss) optimizer.backward(loss)
return loss return loss
optimizer.step(closure) optimizer.step(closure)
.. warning:: .. warning::
Currently, calling step with a closure is not compatible with dynamic loss scaling. Currently, calling :attr:`step` with a closure is not compatible with dynamic loss scaling.
.. _`ordinary Pytorch optimizer use`: .. _`ordinary Pytorch optimizer use`:
http://pytorch.org/docs/master/optim.html#optimizer-step-closure http://pytorch.org/docs/master/optim.html#optimizer-step-closure
...@@ -443,19 +444,13 @@ class FP16_Optimizer(object): ...@@ -443,19 +444,13 @@ class FP16_Optimizer(object):
def backward(self, loss, update_master_grads=True): def backward(self, loss, update_master_grads=True):
""" """
:attr:`backward` performs the following conceptual operations: :attr:`backward` performs the following conceptual steps:
fp32_loss = loss.float() (see first Note below) 1. fp32_loss = loss.float() (see first Note below)
2. scaled_loss = fp32_loss*loss_scale
scaled_loss = fp32_loss*loss_scale 3. scaled_loss.backward(), which accumulates scaled gradients into the ``.grad`` attributes of the model's leaves (which may be fp16, fp32, or a mixture, depending how your model was defined).
4. fp16 grads are then copied to the master params' ``.grad`` attributes (see second Note), which are guaranteed to be fp32.
scaled_loss.backward(), which accumulates scaled gradients into the .grad attributes of the 5. Finally, master grads are divided by loss_scale.
model's leaves (which may be fp16, fp32, or a mixture, depending how your model was defined).
fp16 grads are then copied to the master params' .grad attributes (see second Note), which
are guaranteed to be fp32.
Finally, master grads are divided by loss_scale.
In this way, after :attr:`backward`, the master params have fresh gradients, In this way, after :attr:`backward`, the master params have fresh gradients,
and :attr:`step` may be called. and :attr:`step` may be called.
...@@ -468,7 +463,7 @@ class FP16_Optimizer(object): ...@@ -468,7 +463,7 @@ class FP16_Optimizer(object):
compute the loss criterion (MSE, cross entropy, etc) in fp32 before supplying it to compute the loss criterion (MSE, cross entropy, etc) in fp32 before supplying it to
:attr:`backward`. :attr:`backward`.
.. note:: .. warning::
The gradients found in a model's leaves after the call to The gradients found in a model's leaves after the call to
:attr:`backward` should not be regarded as valid in general, :attr:`backward` should not be regarded as valid in general,
because it's possible because it's possible
...@@ -478,7 +473,6 @@ class FP16_Optimizer(object): ...@@ -478,7 +473,6 @@ class FP16_Optimizer(object):
only the master gradients should be regarded as valid. These can be retrieved via only the master gradients should be regarded as valid. These can be retrieved via
:attr:`inspect_master_grad_data()`. :attr:`inspect_master_grad_data()`.
Args: Args:
loss: The loss output by the user's model. loss may be either float or half (but see first Note above). loss: The loss output by the user's model. loss may be either float or half (but see first Note above).
update_master_grads (bool, optional, default=True): Option to copy fp16 grads to fp32 grads on this call. By setting this to False, the user can delay the copy, which is useful to eliminate redundant fp16->fp32 grad copies if :attr:`backward` is being called on multiple losses in one iteration. If set to False, the user becomes responsible for calling :attr:`update_master_grads` before calling :attr:`step`. update_master_grads (bool, optional, default=True): Option to copy fp16 grads to fp32 grads on this call. By setting this to False, the user can delay the copy, which is useful to eliminate redundant fp16->fp32 grad copies if :attr:`backward` is being called on multiple losses in one iteration. If set to False, the user becomes responsible for calling :attr:`update_master_grads` before calling :attr:`step`.
......
...@@ -14,8 +14,10 @@ def check_contig_cuda(tensors, names): ...@@ -14,8 +14,10 @@ def check_contig_cuda(tensors, names):
class Fused_Weight_Norm(Function): class Fused_Weight_Norm(Function):
""" """
Implements weight norm along a tensor's slowest dimension using fused kernel launches for Custom autograd function that implements weight norm, as presented in
the forward and backward pass. `<https://arxiv.org/abs/1602.07868>`_,
along a tensor's slowest or
fastest dimension using fused kernel launches for the forward and backward passes.
Accepts fp32 or fp16 input; the output type will match the input type. Accepts fp32 or fp16 input; the output type will match the input type.
Within the kernels, all calculations are performed in fp32 for numerical stability, regardless Within the kernels, all calculations are performed in fp32 for numerical stability, regardless
of input/output precision. of input/output precision.
...@@ -24,11 +26,15 @@ class Fused_Weight_Norm(Function): ...@@ -24,11 +26,15 @@ class Fused_Weight_Norm(Function):
@staticmethod @staticmethod
def forward(ctx, input, g, dim=0): def forward(ctx, input, g, dim=0):
""" """
:attr:`input` is assumed to be contiguous. Args:
:attr:`input` may be either float or half precision. input(torch.cuda.FloatTensor or torch.cuda.HalfTensor): input tensor corresponding to **v** in the paper. ``input`` should be contiguous.
The precision of :attr:`output` will match the precision of :attr:`input`. g(torch.cuda.FloatTensor or torch.cuda.HalfTensor): input tensor corresponding to **g** in the paper. ``g`` should be the same type as ``input``.
A float copy of the L2 norm across each slow dimension dim(int, optional, default=0): Dimension across which to perform weightnorm. Currently, only the first or last dimension of the input tensor is supported.
is also created and saved for the backward pass.
Returns:
Output tensor corresponding to **w** in the paper. Output type and precision will match
type and precision of ``input``.
""" """
# torch.cuda.nvtx.range_push("FusedNorm.forward, input.size() = {}" # torch.cuda.nvtx.range_push("FusedNorm.forward, input.size() = {}"
# .format(input.size())) # .format(input.size()))
...@@ -79,9 +85,11 @@ class Fused_Weight_Norm(Function): ...@@ -79,9 +85,11 @@ class Fused_Weight_Norm(Function):
@once_differentiable @once_differentiable
def backward(ctx, grad_output): def backward(ctx, grad_output):
""" """
:attr:`grad_output` is assumed to be contiguous. Args:
:attr:`grad_output` may be either float or half precision. grad_output(torch.cuda.FloatTensor or torch.cuda.HalfTensor): Gradient of loss with respect to output **w**. ``grad_output`` should be contiguous for performance.
The precision of :attr:`grad_input` will match the precision of :attr:`grad_output`.
Returns:
Gradient of loss with respect to ``input`` and ``g``. The precision of these gradients will match the precision of ``grad_input``.
""" """
check_contig_cuda((grad_output), ("grad_output")) check_contig_cuda((grad_output), ("grad_output"))
......
...@@ -53,7 +53,7 @@ class DynamicLossScaler: ...@@ -53,7 +53,7 @@ class DynamicLossScaler:
the ``dynamic_loss_args`` argument to :class:`FP16_Optimizer`'s constructor. the ``dynamic_loss_args`` argument to :class:`FP16_Optimizer`'s constructor.
Loss scaling is designed to combat the problem of underflowing gradients encountered at long Loss scaling is designed to combat the problem of underflowing gradients encountered at long
times when training FP16 networks. Dynamic loss scaling begins by attempting a very high loss times when training fp16 networks. Dynamic loss scaling begins by attempting a very high loss
scale. Ironically, this may result in OVERflowing gradients. If overflowing gradients are scale. Ironically, this may result in OVERflowing gradients. If overflowing gradients are
encountered, :class:`DynamicLossScaler` informs :class:`FP16_Optimizer` that an overflow has encountered, :class:`DynamicLossScaler` informs :class:`FP16_Optimizer` that an overflow has
occurred. occurred.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment