Docstring updates

97ab5ad3 · Michael Carilli · efc561ba · 97ab5ad3
Commit 97ab5ad3 authored Nov 01, 2018 by Michael Carilli
Hide whitespace changes
Inline Side-by-side

Showing with 3 additions and 3 deletions

apex/parallel/distributed.py apex/parallel/distributed.py +3 -3

No files found.
--- a/apex/parallel/distributed.py
+++ b/apex/parallel/distributed.py
@@ -136,10 +136,10 @@ class DistributedDataParallel(Module):
        allreduce_trigger_params (list, optional, default=None):  If supplied, should contain a list of parameters drawn from the model.  Allreduces will be kicked off whenever one of these parameters receives its gradient (as opposed to when a bucket of size message_size is full).  At the end of backward(), a cleanup allreduce to catch any remaining gradients will also be performed automatically.  If allreduce_trigger_params is supplied, the message_size argument will be ignored.
        allreduce_always_fp32 (bool, default=False):  Convert any FP16 gradients to FP32 before allreducing.  This can improve stability for widely scaled-out runs.
        gradient_average (bool, default=True):  Option to toggle whether or not DDP averages the allreduced gradients over processes.  For proper scaling, the default value of True is recommended.
-        gradient_predivide_factor (float, default=1.0):  Allows perfoming the average of gradients over proceses partially before and partially after the allreduce.  Before allreduce:  grads.mul_(1.0/gradient_predivide_factor).  After allreduce:  grads.mul_(gradient_predivide_factor/world size).  This can reduce the stress on the dynamic range of FP16 allreduces for widely scaled-out runs.
+        gradient_predivide_factor (float, default=1.0):  Allows perfoming the average of gradients over processes partially before and partially after the allreduce.  Before allreduce:  ``grads.mul_(1.0/gradient_predivide_factor)``.  After allreduce:  ``grads.mul_(gradient_predivide_factor/world size)``.  This can reduce the stress on the dynamic range of FP16 allreduces for widely scaled-out runs.

-        ..warning::
-            If gradient_average=False, the second step (``grads.mul_(gradient_predivide_factor/world size)``) will be omitted.
+    .. warning::
+        If ``gradient_average=False``, the pre-allreduce division (``grads.mul_(1.0/gradient_predivide_factor)``) will still be applied, but the post-allreduce gradient averaging (``grads.mul_(gradient_predivide_factor/world size)``) will be omitted.

    """