Updating documentation

31cee8e7 · Michael Carilli · 262da9c6 · 31cee8e7 · 31cee8e7 · 31cee8e7
Commit 31cee8e7 authored Apr 26, 2018 by Michael Carilli
3 changed files
--- a/apex/fp16_utils/fp16_optimizer.py
+++ b/apex/fp16_utils/fp16_optimizer.py
@@ -77,8 +77,8 @@ class FP16_Optimizer(object):
                                   # Usually, dynamic_loss_args is not necessary. 

    Args:
-        init_optimizer (torch.optim.optimizer):  Existing optimizer containing initialized fp16 parameters.  Internally, :class:`FP16_Optimizer` replaces the passed optimizer's fp16 parameters with new fp32 parameters copied from the original ones.  :class:`FP16_Optimizer` also stores references to the original fp16 parameters, and updates these fp16 parameters from the master fp32 copy after each step.  
-        static_loss_scale (float, optional, default=1.0):  Loss scale used internally to scale fp16 gradients computed by the model.  Scaled gradients will be copied to fp32, then downscaled before being applied to the fp32 master params, so ``static_loss_scale`` should not affect learning rate.
+        init_optimizer (torch.optim.optimizer):  Existing optimizer created with the parameters to optimize.  Internally, :class:`FP16_Optimizer` replaces the passed optimizer's fp16 parameters, if any, with fp32 master parameters copied from the original ones.  :class:`FP16_Optimizer` also stores references to the original fp16 parameters, and updates these fp16 parameters from the master fp32 copy at the end of each :attr:`step`.  
+        static_loss_scale (float, optional, default=1.0):  Loss scale used internally to scale gradients computed by the model.  Any fp16 gradients will be copied to fp32, then downscaled before being applied to the fp32 master params, so ``static_loss_scale`` should not affect learning rate.
        dynamic_loss_scale (bool, optional, default=False):  Use dynamic loss scaling.  If True, this will override any ``static_loss_scale`` option.
        dynamic_loss_args (dict, optional, default=None):  Dict of kwargs that will be forwarded to the internal :class:`DynamicLossScaler` instance's constructor.  Keys of this dict must match kwargs accepted by :class:`DynamicLossScaler`'s constructor.  If ``dynamic_loss_args`` is unspecified, :class:`DynamicLossScaler`'s defaults will be used.

@@ -275,7 +275,7 @@ class FP16_Optimizer(object):
            Total norm of the current fp32 gradients (viewed as a single vector).

        .. warning::
-            Returns -1 if the most recently computed fp16 gradients overflowed (that is, if self.overflow is True).
+            Returns -1 if the most recently computed fp16 gradients overflowed (that is, if ``self.overflow`` is ``True``).
        """
        if not self.overflow:
            fp32_params = []
@@ -380,12 +380,13 @@ class FP16_Optimizer(object):
                    optimizer.zero_grad()
                    output = model(input)
                    loss = loss_fn(output, target)
+                    # loss.backward() becomes:
                    optimizer.backward(loss)
                    return loss
                optimizer.step(closure)

        .. warning::
-            Currently, calling step with a closure is not compatible with dynamic loss scaling.
+            Currently, calling :attr:`step` with a closure is not compatible with dynamic loss scaling.

        .. _`ordinary Pytorch optimizer use`:
            http://pytorch.org/docs/master/optim.html#optimizer-step-closure
@@ -443,19 +444,13 @@ class FP16_Optimizer(object):

    def backward(self, loss, update_master_grads=True):
        """ 
-        :attr:`backward` performs the following conceptual operations:
+        :attr:`backward` performs the following conceptual steps:

-        fp32_loss = loss.float() (see first Note below)
-
-        scaled_loss = fp32_loss*loss_scale
-
-        scaled_loss.backward(), which accumulates scaled gradients into the .grad attributes of the
-        model's leaves (which may be fp16, fp32, or a mixture, depending how your model was defined).
-
-        fp16 grads are then copied to the master params' .grad attributes (see second Note), which
-        are guaranteed to be fp32.
-
-        Finally, master grads are divided by loss_scale.
+        1. fp32_loss = loss.float() (see first Note below)
+        2. scaled_loss = fp32_loss*loss_scale
+        3. scaled_loss.backward(), which accumulates scaled gradients into the ``.grad`` attributes of the model's leaves (which may be fp16, fp32, or a mixture, depending how your model was defined).
+        4. fp16 grads are then copied to the master params' ``.grad`` attributes (see second Note), which are guaranteed to be fp32.
+        5. Finally, master grads are divided by loss_scale.

        In this way, after :attr:`backward`, the master params have fresh gradients,
        and :attr:`step` may be called.
@@ -468,7 +463,7 @@ class FP16_Optimizer(object):
            compute the loss criterion (MSE, cross entropy, etc) in fp32 before supplying it to 
            :attr:`backward`.

-        .. note::
+        .. warning::
            The gradients found in a model's leaves after the call to 
            :attr:`backward` should not be regarded as valid in general, 
            because it's possible 
@@ -478,7 +473,6 @@ class FP16_Optimizer(object):
            only the master gradients should be regarded as valid.  These can be retrieved via
            :attr:`inspect_master_grad_data()`.

-
        Args:
            loss:  The loss output by the user's model.  loss may be either float or half (but see first Note above).
            update_master_grads (bool, optional, default=True):  Option to copy fp16 grads to fp32 grads on this call.  By setting this to False, the user can delay the copy, which is useful to eliminate redundant fp16->fp32 grad copies if :attr:`backward` is being called on multiple losses in one iteration.  If set to False, the user becomes responsible for calling :attr:`update_master_grads` before calling :attr:`step`.

--- a/apex/fp16_utils/fused_weight_norm.py
+++ b/apex/fp16_utils/fused_weight_norm.py
@@ -14,8 +14,10 @@ def check_contig_cuda(tensors, names):

 class Fused_Weight_Norm(Function):
    """
-    Implements weight norm along a tensor's slowest dimension using fused kernel launches for
-    the forward and backward pass.
+    Custom autograd function that implements weight norm, as presented in 
+    `<https://arxiv.org/abs/1602.07868>`_,
+    along a tensor's slowest or 
+    fastest dimension using fused kernel launches for the forward and backward passes.
    Accepts fp32 or fp16 input; the output type will match the input type.
    Within the kernels, all calculations are performed in fp32 for numerical stability, regardless
    of input/output precision.
@@ -24,11 +26,15 @@ class Fused_Weight_Norm(Function):
    @staticmethod
    def forward(ctx, input, g, dim=0):
        """
-        :attr:`input` is assumed to be contiguous.
-        :attr:`input` may be either float or half precision. 
-        The precision of :attr:`output` will match the precision of :attr:`input`.
-        A float copy of the L2 norm across each slow dimension
-        is also created and saved for the backward pass.
+        Args:
+            input(torch.cuda.FloatTensor or torch.cuda.HalfTensor):  input tensor corresponding to **v** in the paper.  ``input`` should be contiguous.
+            g(torch.cuda.FloatTensor or torch.cuda.HalfTensor):  input tensor corresponding to **g** in the paper.  ``g`` should be the same type as ``input``.
+            dim(int, optional, default=0):  Dimension across which to perform weightnorm.  Currently, only the first or last dimension of the input tensor is supported.
+
+        Returns:
+            Output tensor corresponding to **w** in the paper.  Output type and precision will match
+            type and precision of ``input``.
+        
        """
        # torch.cuda.nvtx.range_push("FusedNorm.forward, input.size() = {}"
        #                            .format(input.size()))
@@ -79,9 +85,11 @@ class Fused_Weight_Norm(Function):
    @once_differentiable
    def backward(ctx, grad_output):
        """
-        :attr:`grad_output` is assumed to be contiguous.
-        :attr:`grad_output` may be either float or half precision.
-        The precision of :attr:`grad_input` will match the precision of :attr:`grad_output`.
+        Args:
+            grad_output(torch.cuda.FloatTensor or torch.cuda.HalfTensor):  Gradient of loss with respect to output **w**. ``grad_output`` should be contiguous for performance.
+
+        Returns:
+            Gradient of loss with respect to ``input`` and ``g``.  The precision of these gradients will match the precision of ``grad_input``.
        """
        check_contig_cuda((grad_output), ("grad_output"))


--- a/apex/fp16_utils/loss_scaler.py
+++ b/apex/fp16_utils/loss_scaler.py
@@ -53,7 +53,7 @@ class DynamicLossScaler:
    the ``dynamic_loss_args`` argument to :class:`FP16_Optimizer`'s constructor.

    Loss scaling is designed to combat the problem of underflowing gradients encountered at long
-    times when training FP16 networks.  Dynamic loss scaling begins by attempting a very high loss
+    times when training fp16 networks.  Dynamic loss scaling begins by attempting a very high loss
    scale.  Ironically, this may result in OVERflowing gradients.  If overflowing gradients are
    encountered, :class:`DynamicLossScaler` informs :class:`FP16_Optimizer` that an overflow has 
    occurred.