Commit 5f8c3183 authored by Michael Carilli's avatar Michael Carilli
Browse files

More docstring + README updates

parent 82d7a3bf
......@@ -54,6 +54,9 @@ optimized for NVIDIA's NCCL communication library.
[Example/Walkthrough](https://github.com/NVIDIA/apex/tree/master/examples/distributed)
The [Imagenet with FP16_Optimizer](https://github.com/NVIDIA/apex/tree/master/examples/imagenet)
mixed precision examples also demonstrate `apex.parallel.DistributedDataParallel`.
# Requirements
Python 3
......
distributed.py contains the source code for `apex.parallel.DistributedDataParallel`, a module wrapper that enables multi-process multi-GPU data parallel training, optimized for NVIDIA's NCCL communication library.
distributed.py contains the source code for `apex.parallel.DistributedDataParallel`, a module wrapper that enables multi-process multi-GPU data parallel training optimized for NVIDIA's NCCL communication library.
`apex.parallel.DistributedDataParallel` achieves high performance by overlapping communication with
computation in the backward pass and bucketing smaller transfers to reduce the total number of
transfers required.
......@@ -9,4 +10,6 @@ multiproc.py contains the source code for `apex.parallel.multiproc`, a launch ut
### [Example/Walkthrough](https://github.com/NVIDIA/apex/tree/master/examples/distributed)
### [Imagenet Example w/Mixed Precision](https://github.com/NVIDIA/apex/tree/master/examples/imagenet)
......@@ -35,21 +35,33 @@ def flat_dist_call(tensors, call, extra_args=None):
class DistributedDataParallel(Module):
"""
:class:`DistributedDataParallel` is a simpler version of upstream :class:`
DistributedDataParallel` that is optimized for use with NCCL. Its usage is designed
to be used in conjunction with apex.parallel.multiproc.py. It assumes that your run
is using multiprocess with 1 GPU/process, that the model is on the correct device,
and that torch.set_device has been used to set the device. Parameters are broadcasted
to the other processes on initialization of DistributedDataParallel, and will be
allreduced in buckets durring the backward pass.
:class:`apex.parallel.DistributedDataParallel` is a module wrapper that enables
easy multiprocess distributed data parallel training, similar to ``torch.nn.parallel.DistributedDataParallel``.
See https://github.com/NVIDIA/apex/tree/master/examples/distributed for detailed usage.
:class:`DistributedDataParallel` is designed to work with
the launch utility script ``apex.parallel.multiproc.py``.
When used with ``multiproc.py``, :class:`DistributedDataParallel`
assigns 1 process to each of the available (visible) GPUs on the node.
Parameters are broadcast across participating processes on initialization, and gradients are
allreduced and averaged over processes during ``backward()`.
:class:``DistributedDataParallel`` is optimized for use with NCCL. It achieves high performance by
overlapping communication with computation during ``backward()`` and bucketing smaller gradient
transfers to reduce the total number of transfers required.
:class:``DistributedDataParallel`` assumes that your script accepts the command line
arguments "rank" and "world-size." It also assumes that your script calls
``torch.cuda.set_device(args.rank)`` before creating the model.
https://github.com/NVIDIA/apex/tree/master/examples/distributed shows detailed usage.
https://github.com/NVIDIA/apex/tree/master/examples/imagenet shows another example
that combines :class:`DistributedDataParallel` with mixed precision training.
Args:
module: Network definition to be run in multi-gpu/distributed mode.
message_size (Default = 10e6): Minimum number of elements in a communication bucket.
shared_param (Default = False): If your model uses shared parameters this must be true,
it will disable bucketing of parameters which is necessary to avoid race conditions.
message_size (Default = 1e7): Minimum number of elements in a communication bucket.
shared_param (Default = False): If your model uses shared parameters this must be True.
It will disable bucketing of parameters to avoid race conditions.
"""
......
......@@ -4,7 +4,7 @@ import subprocess
def docstring_hack():
"""
Multiproc file which will launcch a set of processes locally for multi-gpu
Multiproc file which will launch a set of processes locally for multi-gpu
usage: python -m apex.parallel.multiproc main.py ...
"""
pass
......
......@@ -4,23 +4,25 @@
apex.RNN
===================================
This sumbodule is an in development API aimed to supply parity to torch.nn.RNN,
but be easier to extend. This module is not ready for use and still lacks important
features and validation.
.. automodule:: apex.RNN
.. currentmodule:: apex.RNN
.. RNN
----------
.. autofunction:: LSTM
.. autofunction:: mLSTM
.. autofunction:: GRU
.. autofunction:: ReLU
.. autofunction:: Tanh
Under construction...
.. This submodule is an development API aimed to supply parity to torch.nn.RNN,
.. but be easier to extend. This module is not ready for use and still lacks important
.. features and validation.
..
.. .. automodule:: apex.RNN
.. .. currentmodule:: apex.RNN
..
.. .. RNN
.. ----------
..
.. .. autofunction:: LSTM
..
.. .. autofunction:: mLSTM
..
.. .. autofunction:: GRU
..
.. .. autofunction:: ReLU
..
.. .. autofunction:: Tanh
......@@ -5,16 +5,13 @@
:github_url: https://github.com/nvidia/apex
APEx (A PyTorch Extension)
Apex (A PyTorch Extension)
===================================
This is a repo designed to hold PyTorch modules and utilities that are under active development and experimental. This repo is not designed as a long term solution or a production solution. Things placed in here are intended to be eventually moved to upstream PyTorch.
This site contains the API documentation for Apex (https://github.com/nvidia/apex),
a Pytorch extension with NVIDIA-maintained utilities to streamline mixed precision and distributed training. Some of the code here will be included in upstream Pytorch eventually. The intention of Apex is to make up-to-date utilities available to users as quickly as possible.
A major focus of this extension is the training of neural networks using 16-bit precision floating point math, which offers significant performance benefits on latest NVIDIA GPU architectures. The reduced dynamic range of half precision, however, is more vulnerable to numerical overflow/underflow.
APEX is an NVIDIA-maintained repository of utilities, including some that are targeted to improve the accuracy and stability of half precision networks, while maintaining high performance. The utilities are designed to be minimally invasive and easy to use.
Installation requires CUDA9, PyTorch 0.3 or later, and Python 3. Installation can be done by running
Installation requires CUDA 9 or later, PyTorch 0.4 or later, and Python 3. Installation can be done by running
::
git clone https://www.github.com/nvidia/apex
cd apex
......@@ -24,12 +21,18 @@ Installation requires CUDA9, PyTorch 0.3 or later, and Python 3. Installation ca
.. toctree::
:maxdepth: 1
:caption: apex
:caption: FP16/Mixed Precision Training
parallel
reparameterization
RNN
fp16_utils
.. toctree::
:maxdepth: 1
:caption: Distributed Training
parallel
.. reparameterization
.. RNN
Indices and tables
==================
......
......@@ -4,11 +4,13 @@
apex.reparameterization
===================================
.. automodule:: apex.reparameterization
.. currentmodule:: apex.reparameterization
Under construction...
.. autoclass:: Reparameterization
:members:
.. autoclass:: WeightNorm
:members:
.. .. automodule:: apex.reparameterization
.. .. currentmodule:: apex.reparameterization
..
.. .. autoclass:: Reparameterization
.. :members:
..
.. .. autoclass:: WeightNorm
.. :members:
# Basic Multiprocess Example based on pytorch/examples/mnist
This example demonstrates how to modify a network to use a simple but effective distributed data parallel module. This parallel method is designed to easily run multi-gpu runs on a single node. It was created as current parallel methods integrated into pytorch can induce significant overhead due to python GIL lock. This method will reduce the influence of those overheads and potentially provide a benefit in performance, especially for networks with a significant number of fast running operations.
main.py demonstrates how to modify a simple model to enable multiprocess distributed data parallel
training using the module wrapper `apex.parallel.DistributedDataParallel`
(similar to `torch.nn.parallel.DistributedDataParallel`).
Multiprocess distributed data parallel training frequently outperforms single-process
data parallel training (such as that offered by `torch.nn.DataParallel`) because each process has its
own python interpreter. Therefore, driving multiple GPUs with multiple processes reduces
global interpreter lock contention versus having a single process (with a single GIL) drive all GPUs.
`apex.parallel.DistributedDataParallel` is optimized for use with NCCL. It achieves high performance by
overlapping communication with computation during ``backward()`` and bucketing smaller gradient
transfers to reduce the total number of transfers required.
[API Documentation](https://nvidia.github.io/apex/parallel.html)
......@@ -10,15 +21,31 @@ This example demonstrates how to modify a network to use a simple but effective
Prior to running please run
```pip install -r requirements.txt```
and start a single process run to allow the dataset to be downloaded (This will not work properly in multi-gpu. You can stop this job as soon as it starts iterating.).
To download the dataset, run
```python main.py```
You can now launch multi-process data-parallel jobs via
```python -m apex.parallel.multiproc main.py ...```
adding any normal option you'd like. Each process will run on one of you system's available GPUs.
without any arguments. Once you have downloaded the dataset, you should not need to do this again.
You can now launch multi-process distributed data parallel jobs via
```python -m apex.parallel.multiproc main.py args...```
adding any args... you'd like. The launch script `apex.parallel.multiproc` will
spawn one process for each of your system's available (visible) GPUs.
Each process will run `python main.py args... --world-size <worldsize> --rank <rank>`
(the `--world-size` and `--rank` arguments are determined and appended by `apex.parallel.multiproc`).
Each `main.py` calls `torch.cuda.set_device()` and `torch.distributed.init_process_group()`
according to the `rank` and `world-size` arguments it receives.
The number of visible GPU devices (and therefore the number of processes
`DistributedDataParallel` will spawn) can be controlled by setting the environment variable
`CUDA_VISIBLE_DEVICES`. For example, if you `export CUDA_VISIBLE_DEVICES=0,1` and run
```python -m apex.parallel.multiproc main.py ...```, the launch utility will spawn two processes
which will run on devices 0 and 1. By default, if `CUDA_VISIBLE_DEVICES` is unset,
`apex.parallel.multiproc` will attempt to use every device on the node.
## Converting your own model
To understand how to convert your own model to use the distributed module included, please see all sections of main.py within ```#=====START: ADDED FOR DISTRIBUTED======``` and ```#=====END: ADDED FOR DISTRIBUTED======``` flags.
To understand how to convert your own model, please see all sections of main.py within ```#=====START: ADDED FOR DISTRIBUTED======``` and ```#=====END: ADDED FOR DISTRIBUTED======``` flags.
[Example with Imagenet and mixed precision training](https://github.com/NVIDIA/apex/tree/master/examples/imagenet)
## Requirements
Pytorch master branch built from source. This requirement is to use NCCL as a distributed backend.
......@@ -34,7 +34,7 @@ python main.py -a alexnet --lr 0.01 /path/to/imagenet/folder
The directory at /path/to/imagenet/directory should contain two subdirectories called "train"
and "val" that contain the training and validation data respectively. Train images are expected to be 256x256 jpegs.
Example commands (note: batch size --b 256 assumes your GPUs have >=16GB of onboard memory).
**Example commands:** (note: batch size `--b 256` assumes your GPUs have >=16GB of onboard memory)
```bash
### Softlink training dataset into current directory
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment