Commit 31cee8e7 authored by Michael Carilli's avatar Michael Carilli
Browse files

Updating documentation

parent 262da9c6
......@@ -77,8 +77,8 @@ class FP16_Optimizer(object):
# Usually, dynamic_loss_args is not necessary.
Args:
init_optimizer (torch.optim.optimizer): Existing optimizer containing initialized fp16 parameters. Internally, :class:`FP16_Optimizer` replaces the passed optimizer's fp16 parameters with new fp32 parameters copied from the original ones. :class:`FP16_Optimizer` also stores references to the original fp16 parameters, and updates these fp16 parameters from the master fp32 copy after each step.
static_loss_scale (float, optional, default=1.0): Loss scale used internally to scale fp16 gradients computed by the model. Scaled gradients will be copied to fp32, then downscaled before being applied to the fp32 master params, so ``static_loss_scale`` should not affect learning rate.
init_optimizer (torch.optim.optimizer): Existing optimizer created with the parameters to optimize. Internally, :class:`FP16_Optimizer` replaces the passed optimizer's fp16 parameters, if any, with fp32 master parameters copied from the original ones. :class:`FP16_Optimizer` also stores references to the original fp16 parameters, and updates these fp16 parameters from the master fp32 copy at the end of each :attr:`step`.
static_loss_scale (float, optional, default=1.0): Loss scale used internally to scale gradients computed by the model. Any fp16 gradients will be copied to fp32, then downscaled before being applied to the fp32 master params, so ``static_loss_scale`` should not affect learning rate.
dynamic_loss_scale (bool, optional, default=False): Use dynamic loss scaling. If True, this will override any ``static_loss_scale`` option.
dynamic_loss_args (dict, optional, default=None): Dict of kwargs that will be forwarded to the internal :class:`DynamicLossScaler` instance's constructor. Keys of this dict must match kwargs accepted by :class:`DynamicLossScaler`'s constructor. If ``dynamic_loss_args`` is unspecified, :class:`DynamicLossScaler`'s defaults will be used.
......@@ -275,7 +275,7 @@ class FP16_Optimizer(object):
Total norm of the current fp32 gradients (viewed as a single vector).
.. warning::
Returns -1 if the most recently computed fp16 gradients overflowed (that is, if self.overflow is True).
Returns -1 if the most recently computed fp16 gradients overflowed (that is, if ``self.overflow`` is ``True``).
"""
if not self.overflow:
fp32_params = []
......@@ -380,12 +380,13 @@ class FP16_Optimizer(object):
optimizer.zero_grad()
output = model(input)
loss = loss_fn(output, target)
# loss.backward() becomes:
optimizer.backward(loss)
return loss
optimizer.step(closure)
.. warning::
Currently, calling step with a closure is not compatible with dynamic loss scaling.
Currently, calling :attr:`step` with a closure is not compatible with dynamic loss scaling.
.. _`ordinary Pytorch optimizer use`:
http://pytorch.org/docs/master/optim.html#optimizer-step-closure
......@@ -443,19 +444,13 @@ class FP16_Optimizer(object):
def backward(self, loss, update_master_grads=True):
"""
:attr:`backward` performs the following conceptual operations:
:attr:`backward` performs the following conceptual steps:
fp32_loss = loss.float() (see first Note below)
scaled_loss = fp32_loss*loss_scale
scaled_loss.backward(), which accumulates scaled gradients into the .grad attributes of the
model's leaves (which may be fp16, fp32, or a mixture, depending how your model was defined).
fp16 grads are then copied to the master params' .grad attributes (see second Note), which
are guaranteed to be fp32.
Finally, master grads are divided by loss_scale.
1. fp32_loss = loss.float() (see first Note below)
2. scaled_loss = fp32_loss*loss_scale
3. scaled_loss.backward(), which accumulates scaled gradients into the ``.grad`` attributes of the model's leaves (which may be fp16, fp32, or a mixture, depending how your model was defined).
4. fp16 grads are then copied to the master params' ``.grad`` attributes (see second Note), which are guaranteed to be fp32.
5. Finally, master grads are divided by loss_scale.
In this way, after :attr:`backward`, the master params have fresh gradients,
and :attr:`step` may be called.
......@@ -468,7 +463,7 @@ class FP16_Optimizer(object):
compute the loss criterion (MSE, cross entropy, etc) in fp32 before supplying it to
:attr:`backward`.
.. note::
.. warning::
The gradients found in a model's leaves after the call to
:attr:`backward` should not be regarded as valid in general,
because it's possible
......@@ -478,7 +473,6 @@ class FP16_Optimizer(object):
only the master gradients should be regarded as valid. These can be retrieved via
:attr:`inspect_master_grad_data()`.
Args:
loss: The loss output by the user's model. loss may be either float or half (but see first Note above).
update_master_grads (bool, optional, default=True): Option to copy fp16 grads to fp32 grads on this call. By setting this to False, the user can delay the copy, which is useful to eliminate redundant fp16->fp32 grad copies if :attr:`backward` is being called on multiple losses in one iteration. If set to False, the user becomes responsible for calling :attr:`update_master_grads` before calling :attr:`step`.
......
......@@ -14,8 +14,10 @@ def check_contig_cuda(tensors, names):
class Fused_Weight_Norm(Function):
"""
Implements weight norm along a tensor's slowest dimension using fused kernel launches for
the forward and backward pass.
Custom autograd function that implements weight norm, as presented in
`<https://arxiv.org/abs/1602.07868>`_,
along a tensor's slowest or
fastest dimension using fused kernel launches for the forward and backward passes.
Accepts fp32 or fp16 input; the output type will match the input type.
Within the kernels, all calculations are performed in fp32 for numerical stability, regardless
of input/output precision.
......@@ -24,11 +26,15 @@ class Fused_Weight_Norm(Function):
@staticmethod
def forward(ctx, input, g, dim=0):
"""
:attr:`input` is assumed to be contiguous.
:attr:`input` may be either float or half precision.
The precision of :attr:`output` will match the precision of :attr:`input`.
A float copy of the L2 norm across each slow dimension
is also created and saved for the backward pass.
Args:
input(torch.cuda.FloatTensor or torch.cuda.HalfTensor): input tensor corresponding to **v** in the paper. ``input`` should be contiguous.
g(torch.cuda.FloatTensor or torch.cuda.HalfTensor): input tensor corresponding to **g** in the paper. ``g`` should be the same type as ``input``.
dim(int, optional, default=0): Dimension across which to perform weightnorm. Currently, only the first or last dimension of the input tensor is supported.
Returns:
Output tensor corresponding to **w** in the paper. Output type and precision will match
type and precision of ``input``.
"""
# torch.cuda.nvtx.range_push("FusedNorm.forward, input.size() = {}"
# .format(input.size()))
......@@ -79,9 +85,11 @@ class Fused_Weight_Norm(Function):
@once_differentiable
def backward(ctx, grad_output):
"""
:attr:`grad_output` is assumed to be contiguous.
:attr:`grad_output` may be either float or half precision.
The precision of :attr:`grad_input` will match the precision of :attr:`grad_output`.
Args:
grad_output(torch.cuda.FloatTensor or torch.cuda.HalfTensor): Gradient of loss with respect to output **w**. ``grad_output`` should be contiguous for performance.
Returns:
Gradient of loss with respect to ``input`` and ``g``. The precision of these gradients will match the precision of ``grad_input``.
"""
check_contig_cuda((grad_output), ("grad_output"))
......
......@@ -53,7 +53,7 @@ class DynamicLossScaler:
the ``dynamic_loss_args`` argument to :class:`FP16_Optimizer`'s constructor.
Loss scaling is designed to combat the problem of underflowing gradients encountered at long
times when training FP16 networks. Dynamic loss scaling begins by attempting a very high loss
times when training fp16 networks. Dynamic loss scaling begins by attempting a very high loss
scale. Ironically, this may result in OVERflowing gradients. If overflowing gradients are
encountered, :class:`DynamicLossScaler` informs :class:`FP16_Optimizer` that an overflow has
occurred.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment