Commit 31cee8e7 authored by Michael Carilli's avatar Michael Carilli
Browse files

Updating documentation

parent 262da9c6
......@@ -77,8 +77,8 @@ class FP16_Optimizer(object):
# Usually, dynamic_loss_args is not necessary.
Args:
init_optimizer (torch.optim.optimizer): Existing optimizer containing initialized fp16 parameters. Internally, :class:`FP16_Optimizer` replaces the passed optimizer's fp16 parameters with new fp32 parameters copied from the original ones. :class:`FP16_Optimizer` also stores references to the original fp16 parameters, and updates these fp16 parameters from the master fp32 copy after each step.
static_loss_scale (float, optional, default=1.0): Loss scale used internally to scale fp16 gradients computed by the model. Scaled gradients will be copied to fp32, then downscaled before being applied to the fp32 master params, so ``static_loss_scale`` should not affect learning rate.
init_optimizer (torch.optim.optimizer): Existing optimizer created with the parameters to optimize. Internally, :class:`FP16_Optimizer` replaces the passed optimizer's fp16 parameters, if any, with fp32 master parameters copied from the original ones. :class:`FP16_Optimizer` also stores references to the original fp16 parameters, and updates these fp16 parameters from the master fp32 copy at the end of each :attr:`step`.
static_loss_scale (float, optional, default=1.0): Loss scale used internally to scale gradients computed by the model. Any fp16 gradients will be copied to fp32, then downscaled before being applied to the fp32 master params, so ``static_loss_scale`` should not affect learning rate.
dynamic_loss_scale (bool, optional, default=False): Use dynamic loss scaling. If True, this will override any ``static_loss_scale`` option.
dynamic_loss_args (dict, optional, default=None): Dict of kwargs that will be forwarded to the internal :class:`DynamicLossScaler` instance's constructor. Keys of this dict must match kwargs accepted by :class:`DynamicLossScaler`'s constructor. If ``dynamic_loss_args`` is unspecified, :class:`DynamicLossScaler`'s defaults will be used.
......@@ -275,7 +275,7 @@ class FP16_Optimizer(object):
Total norm of the current fp32 gradients (viewed as a single vector).
.. warning::
Returns -1 if the most recently computed fp16 gradients overflowed (that is, if self.overflow is True).
Returns -1 if the most recently computed fp16 gradients overflowed (that is, if ``self.overflow`` is ``True``).
"""
if not self.overflow:
fp32_params = []
......@@ -380,12 +380,13 @@ class FP16_Optimizer(object):
optimizer.zero_grad()
output = model(input)
loss = loss_fn(output, target)
# loss.backward() becomes:
optimizer.backward(loss)
return loss
optimizer.step(closure)
.. warning::
Currently, calling step with a closure is not compatible with dynamic loss scaling.
Currently, calling :attr:`step` with a closure is not compatible with dynamic loss scaling.
.. _`ordinary Pytorch optimizer use`:
http://pytorch.org/docs/master/optim.html#optimizer-step-closure
......@@ -443,19 +444,13 @@ class FP16_Optimizer(object):
def backward(self, loss, update_master_grads=True):
"""
:attr:`backward` performs the following conceptual operations:
fp32_loss = loss.float() (see first Note below)
scaled_loss = fp32_loss*loss_scale
scaled_loss.backward(), which accumulates scaled gradients into the .grad attributes of the
model's leaves (which may be fp16, fp32, or a mixture, depending how your model was defined).
:attr:`backward` performs the following conceptual steps:
fp16 grads are then copied to the master params' .grad attributes (see second Note), which
are guaranteed to be fp32.
Finally, master grads are divided by loss_scale.
1. fp32_loss = loss.float() (see first Note below)
2. scaled_loss = fp32_loss*loss_scale
3. scaled_loss.backward(), which accumulates scaled gradients into the ``.grad`` attributes of the model's leaves (which may be fp16, fp32, or a mixture, depending how your model was defined).
4. fp16 grads are then copied to the master params' ``.grad`` attributes (see second Note), which are guaranteed to be fp32.
5. Finally, master grads are divided by loss_scale.
In this way, after :attr:`backward`, the master params have fresh gradients,
and :attr:`step` may be called.
......@@ -468,7 +463,7 @@ class FP16_Optimizer(object):
compute the loss criterion (MSE, cross entropy, etc) in fp32 before supplying it to
:attr:`backward`.
.. note::
.. warning::
The gradients found in a model's leaves after the call to
:attr:`backward` should not be regarded as valid in general,
because it's possible
......@@ -478,7 +473,6 @@ class FP16_Optimizer(object):
only the master gradients should be regarded as valid. These can be retrieved via
:attr:`inspect_master_grad_data()`.
Args:
loss: The loss output by the user's model. loss may be either float or half (but see first Note above).
update_master_grads (bool, optional, default=True): Option to copy fp16 grads to fp32 grads on this call. By setting this to False, the user can delay the copy, which is useful to eliminate redundant fp16->fp32 grad copies if :attr:`backward` is being called on multiple losses in one iteration. If set to False, the user becomes responsible for calling :attr:`update_master_grads` before calling :attr:`step`.
......
......@@ -14,8 +14,10 @@ def check_contig_cuda(tensors, names):
class Fused_Weight_Norm(Function):
"""
Implements weight norm along a tensor's slowest dimension using fused kernel launches for
the forward and backward pass.
Custom autograd function that implements weight norm, as presented in
`<https://arxiv.org/abs/1602.07868>`_,
along a tensor's slowest or
fastest dimension using fused kernel launches for the forward and backward passes.
Accepts fp32 or fp16 input; the output type will match the input type.
Within the kernels, all calculations are performed in fp32 for numerical stability, regardless
of input/output precision.
......@@ -24,11 +26,15 @@ class Fused_Weight_Norm(Function):
@staticmethod
def forward(ctx, input, g, dim=0):
"""
:attr:`input` is assumed to be contiguous.
:attr:`input` may be either float or half precision.
The precision of :attr:`output` will match the precision of :attr:`input`.
A float copy of the L2 norm across each slow dimension
is also created and saved for the backward pass.
Args:
input(torch.cuda.FloatTensor or torch.cuda.HalfTensor): input tensor corresponding to **v** in the paper. ``input`` should be contiguous.
g(torch.cuda.FloatTensor or torch.cuda.HalfTensor): input tensor corresponding to **g** in the paper. ``g`` should be the same type as ``input``.
dim(int, optional, default=0): Dimension across which to perform weightnorm. Currently, only the first or last dimension of the input tensor is supported.
Returns:
Output tensor corresponding to **w** in the paper. Output type and precision will match
type and precision of ``input``.
"""
# torch.cuda.nvtx.range_push("FusedNorm.forward, input.size() = {}"
# .format(input.size()))
......@@ -79,9 +85,11 @@ class Fused_Weight_Norm(Function):
@once_differentiable
def backward(ctx, grad_output):
"""
:attr:`grad_output` is assumed to be contiguous.
:attr:`grad_output` may be either float or half precision.
The precision of :attr:`grad_input` will match the precision of :attr:`grad_output`.
Args:
grad_output(torch.cuda.FloatTensor or torch.cuda.HalfTensor): Gradient of loss with respect to output **w**. ``grad_output`` should be contiguous for performance.
Returns:
Gradient of loss with respect to ``input`` and ``g``. The precision of these gradients will match the precision of ``grad_input``.
"""
check_contig_cuda((grad_output), ("grad_output"))
......
......@@ -53,7 +53,7 @@ class DynamicLossScaler:
the ``dynamic_loss_args`` argument to :class:`FP16_Optimizer`'s constructor.
Loss scaling is designed to combat the problem of underflowing gradients encountered at long
times when training FP16 networks. Dynamic loss scaling begins by attempting a very high loss
times when training fp16 networks. Dynamic loss scaling begins by attempting a very high loss
scale. Ironically, this may result in OVERflowing gradients. If overflowing gradients are
encountered, :class:`DynamicLossScaler` informs :class:`FP16_Optimizer` that an overflow has
occurred.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment