Updating documentation for merged utilities

8124fba2 · Michael Carilli · 1fa1a073 · 8124fba2 · 8124fba2 · 8124fba2
Commit 8124fba2 authored Oct 30, 2018 by Michael Carilli
Showing with 37 additions and 16 deletions

README.md README.md +7 -5

apex/parallel/distributed.py apex/parallel/distributed.py +9 -10

docs/source/index.rst docs/source/index.rst +7 -1

docs/source/optimizers.rst docs/source/optimizers.rst +14 -0

No files found.
--- a/README.md
+++ b/README.md
@@ -94,14 +94,16 @@ import apex
 ```
 ### CUDA/C++ extension
-To build Apex with CUDA/C++ extension, follow the Linux instruction with the
+Apex contains optional CUDA/C++ extensions, installable via
-`--cuda_ext` option enabled
 ```
-python setup.py install --cuda_ext
+python setup.py install [--cuda_ext] [--cpp_ext]
 ```
+Currently, `--cuda_ext` enables
+- Fused kernels that improve the performance and numerical stability of `apex.parallel.SyncBatchNorm`.
+- Fused kernels required to use `apex.optimizers.FusedAdam`.
-CUDA/C++ extension provides customed synchronized Batch Normalization kernels
+`--cpp_ext` enables
-that provides better performance and numerical accuracy.
+- C++-side flattening and unflattening utilities that reduce the CPU overhead of `apex.parallel.DistributedDataParallel`.
 ### Windows support
 Windows support is experimental, and Linux is recommended.  However, since Apex could be Python-only, there's a good chance the Python-only features "just works" the same way as Linux.  If you installed Pytorch in a Conda environment, make sure to install Apex in that same environment.

--- a/apex/parallel/distributed.py
+++ b/apex/parallel/distributed.py
@@ -81,11 +81,9 @@ class Reducer(object):
    Like :class:`DistributedDataParallel`, :class:`Reducer` averages any tensors it allreduces 
    over the number of participating processes.
-    :class:`Reducer` is designed to work with the launch utility script 
+    :class:`Reducer` is designed to work with the upstream launch utility script 
-    ``apex.parallel.multiproc.py`` or the upstream launch utility script 
    ``torch.distributed.launch`` with ``--nproc_per_node <= number of gpus per node``.
-    For forward compatibility, ``torch.distributed.launch`` is recommended.
+    When used with this launcher, :class:`Reducer` assumes 1:1 mapping of processes to GPUs.
-    When used with these launchers, :class:`Reducer` assumes 1:1 mapping of processes to GPUs.
    It also assumes that your script calls ``torch.cuda.set_device(args.rank)`` before creating the model.
    main_reducer.py in https://github.com/NVIDIA/apex/tree/master/examples/imagenet shows example usage.
@@ -122,11 +120,9 @@ class DistributedDataParallel(Module):
    overlapping communication with computation during ``backward()`` and bucketing smaller gradient
    transfers to reduce the total number of transfers required.
-    :class:`DistributedDataParallel` is designed to work with the launch utility script 
+    :class:`DistributedDataParallel` is designed to work with the upstream launch utility script 
-    ``apex.parallel.multiproc.py`` or the upstream launch utility script 
    ``torch.distributed.launch`` with ``--nproc_per_node <= number of gpus per node``.
-    For forward compatibility, ``torch.distributed.launch`` is recommended.
+    When used with this launcher, :class:`DistributedDataParallel` assumes 1:1 mapping of processes to GPUs.
-    When used with these launchers, :class:`DistributedDataParallel` assumes 1:1 mapping of processes to GPUs.
    It also assumes that your script calls ``torch.cuda.set_device(args.rank)`` before creating the model.
    https://github.com/NVIDIA/apex/tree/master/examples/distributed shows detailed usage.
@@ -135,8 +131,11 @@ class DistributedDataParallel(Module):
    Args:
        module: Network definition to be run in multi-gpu/distributed mode.
-        message_size (Default = 1e7): Minimum number of elements in a communication bucket.
+        message_size (int, default=1e7): Minimum number of elements in a communication bucket.
-        delay_allreduce (Default = False):  Delay all communication to the end of the backward pass.  This disables overlapping communication with computation.
+        delay_allreduce (bool, default=False):  Delay all communication to the end of the backward pass.  This disables overlapping communication with computation.
+        allreduce_trigger_params (list, optional, default=None):  If supplied, should contain a list of parameters drawn from the model.  Allreduces will be kicked off whenever one of these parameters receives its gradient (as opposed to when a bucket of size message_size is full).  At the end of backward(), a cleanup allreduce to catch any remaining gradients will also be performed automatically.  If allreduce_trigger_params is supplied, the message_size argument will be ignored.
+        allreduce_always_fp32 (bool, default=False):  Convert any FP16 gradients to FP32 before allreducing.  This can improve stability for widely scaled-out runs.
+        gradient_average_split_factor (float, default=1.0):  Perform the averaging of gradients over proceses partially before and partially after the allreduce.  Before allreduce:  grads.mul_(1.0/gradient_average_split_factor).  After allreduce:  grads.mul_(gradient_average_split_factor/world size).  This can reduce the stress on the dynamic range of FP16 allreduces for widely scaled-out runs.
    """

--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -17,7 +17,7 @@ Installation requires CUDA 9 or later, PyTorch 0.4 or later, and Python 3. Insta
   git clone https://www.github.com/nvidia/apex
   cd apex
-   python setup.py install
+   python setup.py install [--cuda_ext] [--cpp_ext]
 .. toctree::
@@ -38,6 +38,12 @@ Installation requires CUDA 9 or later, PyTorch 0.4 or later, and Python 3. Insta
   parallel
+.. toctree::
+   :maxdepth: 1
+   :caption: Fused Optimizers
+   optimizers
 ..   reparameterization
 ..   RNN

--- a/docs/source/optimizers.rst
+++ b/docs/source/optimizers.rst
+.. role:: hidden
+    :class: hidden-section
+apex.optimizers
+===================================
+.. automodule:: apex.optimizers
+.. currentmodule:: apex.optimizers
+.. FusedAdam
+   ----------
+.. autoclass:: FusedAdam
+    :members: