More docstring + README updates

5f8c3183 · Michael Carilli · 82d7a3bf · 5f8c3183 · 5f8c3183 · 5f8c3183
Commit 5f8c3183 authored Jun 15, 2018 by Michael Carilli
9 changed files
--- a/README.md
+++ b/README.md
@@ -54,6 +54,9 @@ optimized for NVIDIA's NCCL communication library.

 [Example/Walkthrough](https://github.com/NVIDIA/apex/tree/master/examples/distributed)

+The [Imagenet with FP16_Optimizer](https://github.com/NVIDIA/apex/tree/master/examples/imagenet) 
+mixed precision examples also demonstrate `apex.parallel.DistributedDataParallel`.
+
 # Requirements

 Python 3

--- a/apex/parallel/README.md
+++ b/apex/parallel/README.md
-distributed.py contains the source code for `apex.parallel.DistributedDataParallel`, a module wrapper that enables multi-process multi-GPU data parallel training, optimized for NVIDIA's NCCL communication library.
+distributed.py contains the source code for `apex.parallel.DistributedDataParallel`, a module wrapper that enables multi-process multi-GPU data parallel training optimized for NVIDIA's NCCL communication library.
+
 `apex.parallel.DistributedDataParallel` achieves high performance by overlapping communication with
 computation in the backward pass and bucketing smaller transfers to reduce the total number of
 transfers required.
@@ -9,4 +10,6 @@ multiproc.py contains the source code for `apex.parallel.multiproc`, a launch ut

 ### [Example/Walkthrough](https://github.com/NVIDIA/apex/tree/master/examples/distributed)

+### [Imagenet Example w/Mixed Precision](https://github.com/NVIDIA/apex/tree/master/examples/imagenet)
+

--- a/apex/parallel/distributed.py
+++ b/apex/parallel/distributed.py
@@ -35,21 +35,33 @@ def flat_dist_call(tensors, call, extra_args=None):
            
 class DistributedDataParallel(Module):
    """
-    :class:`DistributedDataParallel` is a simpler version of upstream :class:`
-    DistributedDataParallel` that is optimized for use with NCCL. Its usage is designed
-    to be used in conjunction with apex.parallel.multiproc.py. It assumes that your run
-    is using multiprocess with 1 GPU/process, that the model is on the correct device,
-    and that torch.set_device has been used to set the device. Parameters are broadcasted
-    to the other processes on initialization of DistributedDataParallel, and will be
-    allreduced in buckets durring the backward pass.
+    :class:`apex.parallel.DistributedDataParallel` is a module wrapper that enables
+    easy multiprocess distributed data parallel training, similar to ``torch.nn.parallel.DistributedDataParallel``.

-    See https://github.com/NVIDIA/apex/tree/master/examples/distributed for detailed usage.
+    :class:`DistributedDataParallel` is designed to work with
+    the launch utility script ``apex.parallel.multiproc.py``.  
+    When used with ``multiproc.py``, :class:`DistributedDataParallel` 
+    assigns 1 process to each of the available (visible) GPUs on the node.
+    Parameters are broadcast across participating processes on initialization, and gradients are
+    allreduced and averaged over processes during ``backward()`.
+
+    :class:``DistributedDataParallel`` is optimized for use with NCCL.  It achieves high performance by 
+    overlapping communication with computation during ``backward()`` and bucketing smaller gradient
+    transfers to reduce the total number of transfers required.
+
+    :class:``DistributedDataParallel`` assumes that your script accepts the command line 
+    arguments "rank" and "world-size."  It also assumes that your script calls
+    ``torch.cuda.set_device(args.rank)`` before creating the model.
+
+    https://github.com/NVIDIA/apex/tree/master/examples/distributed shows detailed usage.
+    https://github.com/NVIDIA/apex/tree/master/examples/imagenet shows another example
+    that combines :class:`DistributedDataParallel` with mixed precision training.

    Args:
        module: Network definition to be run in multi-gpu/distributed mode.
-        message_size (Default = 10e6): Minimum number of elements in a communication bucket.
-        shared_param (Default = False): If your model uses shared parameters this must be true,
-        it will disable bucketing of parameters which is necessary to avoid race conditions.
+        message_size (Default = 1e7): Minimum number of elements in a communication bucket.
+        shared_param (Default = False): If your model uses shared parameters this must be True.
+        It will disable bucketing of parameters to avoid race conditions.

    """


--- a/apex/parallel/multiproc.py
+++ b/apex/parallel/multiproc.py
@@ -4,7 +4,7 @@ import subprocess

 def docstring_hack():
    """
-    Multiproc file which will launcch a set of processes locally for multi-gpu
+    Multiproc file which will launch a set of processes locally for multi-gpu
    usage: python -m apex.parallel.multiproc main.py ...
    """
    pass

--- a/docs/source/RNN.rst
+++ b/docs/source/RNN.rst
@@ -4,23 +4,25 @@
 apex.RNN
 ===================================

-This sumbodule is an in development API aimed to supply parity to torch.nn.RNN,
-but be easier to extend. This module is not ready for use and still lacks important
-features and validation.
-
-.. automodule:: apex.RNN
-.. currentmodule:: apex.RNN
-
-.. RNN
-   ----------
-
-.. autofunction:: LSTM
-
-.. autofunction:: mLSTM
-
-.. autofunction:: GRU
-
-.. autofunction:: ReLU
-
-.. autofunction:: Tanh		  
+Under construction...
+
+.. This submodule is an development API aimed to supply parity to torch.nn.RNN,
+.. but be easier to extend. This module is not ready for use and still lacks important
+.. features and validation.
+.. 
+.. .. automodule:: apex.RNN
+.. .. currentmodule:: apex.RNN
+.. 
+.. .. RNN
+..    ----------
+.. 
+.. .. autofunction:: LSTM
+.. 
+.. .. autofunction:: mLSTM
+.. 
+.. .. autofunction:: GRU
+.. 
+.. .. autofunction:: ReLU
+.. 
+.. .. autofunction:: Tanh		  
 		 
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -5,16 +5,13 @@

 :github_url: https://github.com/nvidia/apex

-APEx (A PyTorch Extension)
+Apex (A PyTorch Extension)
 ===================================

-This is a repo designed to hold PyTorch modules and utilities that are under active development and experimental. This repo is not designed as a long term solution or a production solution. Things placed in here are intended to be eventually moved to upstream PyTorch.
+This site contains the API documentation for Apex (https://github.com/nvidia/apex),
+a Pytorch extension with NVIDIA-maintained utilities to streamline mixed precision and distributed training.  Some of the code here will be included in upstream Pytorch eventually. The intention of Apex is to make up-to-date utilities available to users as quickly as possible.

-A major focus of this extension is the training of neural networks using 16-bit precision floating point math, which offers significant performance benefits on latest NVIDIA GPU architectures. The reduced dynamic range of half precision, however, is more vulnerable to numerical overflow/underflow.
-
-APEX is an NVIDIA-maintained repository of utilities, including some that are targeted to improve the accuracy and stability of half precision networks, while maintaining high performance. The utilities are designed to be minimally invasive and easy to use.
-
-Installation requires CUDA9, PyTorch 0.3 or later, and Python 3. Installation can be done by running
+Installation requires CUDA 9 or later, PyTorch 0.4 or later, and Python 3. Installation can be done by running
 ::
  git clone https://www.github.com/nvidia/apex
  cd apex
@@ -24,12 +21,18 @@ Installation requires CUDA9, PyTorch 0.3 or later, and Python 3. Installation ca

 .. toctree::
   :maxdepth: 1
-   :caption: apex
+   :caption: FP16/Mixed Precision Training

-   parallel
-   reparameterization
-   RNN
   fp16_utils
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Distributed Training
+
+   parallel
+
+..   reparameterization
+..   RNN
   
 Indices and tables
 ==================

--- a/docs/source/reparameterization.rst
+++ b/docs/source/reparameterization.rst
@@ -4,11 +4,13 @@
 apex.reparameterization
 ===================================

-.. automodule:: apex.reparameterization
-.. currentmodule:: apex.reparameterization
+Under construction...

-.. autoclass:: Reparameterization
-    :members:
-
-.. autoclass:: WeightNorm
-    :members:
+.. .. automodule:: apex.reparameterization
+.. .. currentmodule:: apex.reparameterization
+.. 
+.. .. autoclass:: Reparameterization
+..     :members:
+.. 
+.. .. autoclass:: WeightNorm
+..     :members:
--- a/examples/distributed/README.md
+++ b/examples/distributed/README.md
 # Basic Multiprocess Example based on pytorch/examples/mnist

-This example demonstrates how to modify a network to use a simple but effective distributed data parallel module. This parallel method is designed to easily run multi-gpu runs on a single node. It was created as current parallel methods integrated into pytorch can induce significant overhead due to python GIL lock. This method will reduce the influence of those overheads and potentially provide a benefit in performance, especially for networks with a significant number of fast running operations.
+main.py demonstrates how to modify a simple model to enable multiprocess distributed data parallel
+training using the module wrapper `apex.parallel.DistributedDataParallel` 
+(similar to `torch.nn.parallel.DistributedDataParallel`).
+
+Multiprocess distributed data parallel training frequently outperforms single-process 
+data parallel training (such as that offered by `torch.nn.DataParallel`) because each process has its 
+own python interpreter.  Therefore, driving multiple GPUs with multiple processes reduces 
+global interpreter lock contention versus having a single process (with a single GIL) drive all GPUs.
+
+`apex.parallel.DistributedDataParallel` is optimized for use with NCCL.  It achieves high performance by 
+overlapping communication with computation during ``backward()`` and bucketing smaller gradient
+transfers to reduce the total number of transfers required.

 [API Documentation](https://nvidia.github.io/apex/parallel.html)

@@ -10,15 +21,31 @@ This example demonstrates how to modify a network to use a simple but effective
 Prior to running please run
 ```pip install -r requirements.txt```

-and start a single process run to allow the dataset to be downloaded (This will not work properly in multi-gpu. You can stop this job as soon as it starts iterating.).
+To download the dataset, run
 ```python main.py```
-
-You can now launch multi-process data-parallel jobs via
-```python -m apex.parallel.multiproc main.py ...```
-adding any normal option you'd like.  Each process will run on one of you system's available GPUs.
+without any arguments.  Once you have downloaded the dataset, you should not need to do this again.
+
+You can now launch multi-process distributed data parallel jobs via
+```python -m apex.parallel.multiproc main.py args...```
+adding any args... you'd like.  The launch script `apex.parallel.multiproc` will 
+spawn one process for each of your system's available (visible) GPUs.
+Each process will run `python main.py args... --world-size <worldsize> --rank <rank>`
+(the `--world-size` and `--rank` arguments are determined and appended by `apex.parallel.multiproc`).
+Each `main.py` calls `torch.cuda.set_device()` and `torch.distributed.init_process_group()` 
+according to the `rank` and `world-size` arguments it receives.
+
+The number of visible GPU devices (and therefore the number of processes 
+`DistributedDataParallel` will spawn) can be controlled by setting the environment variable 
+`CUDA_VISIBLE_DEVICES`.  For example, if you `export CUDA_VISIBLE_DEVICES=0,1` and run
+```python -m apex.parallel.multiproc main.py ...```, the launch utility will spawn two processes
+which will run on devices 0 and 1.  By default, if `CUDA_VISIBLE_DEVICES` is unset, 
+`apex.parallel.multiproc` will attempt to use every device on the node.

 ## Converting your own model
-To understand how to convert your own model to use the distributed module included, please see all sections of main.py within ```#=====START: ADDED FOR DISTRIBUTED======``` and ```#=====END:   ADDED FOR DISTRIBUTED======``` flags.
+
+To understand how to convert your own model, please see all sections of main.py within ```#=====START: ADDED FOR DISTRIBUTED======``` and ```#=====END:   ADDED FOR DISTRIBUTED======``` flags.
+
+[Example with Imagenet and mixed precision training](https://github.com/NVIDIA/apex/tree/master/examples/imagenet)

 ## Requirements
 Pytorch master branch built from source. This requirement is to use NCCL as a distributed backend.
--- a/examples/imagenet/README.md
+++ b/examples/imagenet/README.md
@@ -34,7 +34,7 @@ python main.py -a alexnet --lr 0.01 /path/to/imagenet/folder
 The directory at /path/to/imagenet/directory should contain two subdirectories called "train"
 and "val" that contain the training and validation data respectively. Train images are expected to be 256x256 jpegs.

-Example commands (note:  batch size --b 256 assumes your GPUs have >=16GB of onboard memory).
+**Example commands:** (note:  batch size `--b 256` assumes your GPUs have >=16GB of onboard memory)

 ```bash
 ### Softlink training dataset into current directory