distributed.py contains the source code for `apex.parallel.DistributedDataParallel`, a module wrapper that enables multi-process multi-GPU data parallel training, optimized for NVIDIA's NCCL communication library.
distributed.py contains the source code for `apex.parallel.DistributedDataParallel`, a module wrapper that enables multi-process multi-GPU data parallel training optimized for NVIDIA's NCCL communication library.
`apex.parallel.DistributedDataParallel` achieves high performance by overlapping communication with
computation in the backward pass and bucketing smaller transfers to reduce the total number of
transfers required.
...
...
@@ -9,4 +10,6 @@ multiproc.py contains the source code for `apex.parallel.multiproc`, a launch ut
This is a repo designed to hold PyTorch modules and utilities that are under active development and experimental. This repo is not designed as a long term solution or a production solution. Things placed in here are intended to be eventually moved to upstream PyTorch.
This site contains the API documentation for Apex (https://github.com/nvidia/apex),
a Pytorch extension with NVIDIA-maintained utilities to streamline mixed precision and distributed training. Some of the code here will be included in upstream Pytorch eventually. The intention of Apex is to make up-to-date utilities available to users as quickly as possible.
A major focus of this extension is the training of neural networks using 16-bit precision floating point math, which offers significant performance benefits on latest NVIDIA GPU architectures. The reduced dynamic range of half precision, however, is more vulnerable to numerical overflow/underflow.
APEX is an NVIDIA-maintained repository of utilities, including some that are targeted to improve the accuracy and stability of half precision networks, while maintaining high performance. The utilities are designed to be minimally invasive and easy to use.
Installation requires CUDA9, PyTorch 0.3 or later, and Python 3. Installation can be done by running
Installation requires CUDA 9 or later, PyTorch 0.4 or later, and Python 3. Installation can be done by running
::
git clone https://www.github.com/nvidia/apex
cd apex
...
...
@@ -24,12 +21,18 @@ Installation requires CUDA9, PyTorch 0.3 or later, and Python 3. Installation ca
# Basic Multiprocess Example based on pytorch/examples/mnist
This example demonstrates how to modify a network to use a simple but effective distributed data parallel module. This parallel method is designed to easily run multi-gpu runs on a single node. It was created as current parallel methods integrated into pytorch can induce significant overhead due to python GIL lock. This method will reduce the influence of those overheads and potentially provide a benefit in performance, especially for networks with a significant number of fast running operations.
main.py demonstrates how to modify a simple model to enable multiprocess distributed data parallel
training using the module wrapper `apex.parallel.DistributedDataParallel`
(similar to `torch.nn.parallel.DistributedDataParallel`).
Multiprocess distributed data parallel training frequently outperforms single-process
data parallel training (such as that offered by `torch.nn.DataParallel`) because each process has its
own python interpreter. Therefore, driving multiple GPUs with multiple processes reduces
global interpreter lock contention versus having a single process (with a single GIL) drive all GPUs.
`apex.parallel.DistributedDataParallel` is optimized for use with NCCL. It achieves high performance by
overlapping communication with computation during ``backward()`` and bucketing smaller gradient
transfers to reduce the total number of transfers required.
@@ -10,15 +21,31 @@ This example demonstrates how to modify a network to use a simple but effective
Prior to running please run
```pip install -r requirements.txt```
and start a single process run to allow the dataset to be downloaded (This will not work properly in multi-gpu. You can stop this job as soon as it starts iterating.).
To download the dataset, run
```python main.py```
You can now launch multi-process data-parallel jobs via
adding any args... you'd like. The launch script `apex.parallel.multiproc` will
spawn one process for each of your system's available (visible) GPUs.
Each process will run `python main.py args... --world-size <worldsize> --rank <rank>`
(the `--world-size` and `--rank` arguments are determined and appended by `apex.parallel.multiproc`).
Each `main.py` calls `torch.cuda.set_device()` and `torch.distributed.init_process_group()`
according to the `rank` and `world-size` arguments it receives.
The number of visible GPU devices (and therefore the number of processes
`DistributedDataParallel` will spawn) can be controlled by setting the environment variable
`CUDA_VISIBLE_DEVICES`. For example, if you `export CUDA_VISIBLE_DEVICES=0,1` and run
```python -m apex.parallel.multiproc main.py ...```, the launch utility will spawn two processes
which will run on devices 0 and 1. By default, if `CUDA_VISIBLE_DEVICES` is unset,
`apex.parallel.multiproc` will attempt to use every device on the node.
## Converting your own model
To understand how to convert your own model to use the distributed module included, please see all sections of main.py within ```#=====START: ADDED FOR DISTRIBUTED======``` and ```#=====END: ADDED FOR DISTRIBUTED======``` flags.
To understand how to convert your own model, please see all sections of main.py within ```#=====START: ADDED FOR DISTRIBUTED======``` and ```#=====END: ADDED FOR DISTRIBUTED======``` flags.
[Example with Imagenet and mixed precision training](https://github.com/NVIDIA/apex/tree/master/examples/imagenet)
## Requirements
Pytorch master branch built from source. This requirement is to use NCCL as a distributed backend.