**distributed_data_parallel.py** and **run.sh** show an example using `FP16_Optimizer` with
`apex.parallel.DistributedDataParallel` in conjuction with the legacy Apex
launcher script, `apex.parallel.multiproc`. See
[FP16_Optimizer_simple/distributed_apex](https://github.com/NVIDIA/apex/tree/torch_launcher/examples/FP16_Optimizer_simple/distributed_apex) for a more up-to-date example that uses the Pytorch launcher
script, `torch.distributed.launch`.
The usage of `FP16_Optimizer` with distributed does not need to change from ordinary
adding any `args...` you like. The launch script `apex.parallel.multiproc` will
`torch.distributed.launch` spawns `N` processes, each of which runs as
spawn one process for each of your system's available (visible) GPUs.
`python main.py args... --local_rank <rank>`.
Each process will run `python main.py args... --world-size <worldsize> --rank <rank>`
The `local_rank` argument for each process is determined and appended by `torch.distributed.launch`,
(the `--world-size` and `--rank` arguments are determined and appended by `apex.parallel.multiproc`).
and varies between 0 and `N-1`. `torch.distributed.launch` also provides environment variables
Each `main.py` calls `torch.cuda.set_device()` and `torch.distributed.init_process_group()`
for each process.
according to the `rank` and `world-size` arguments it receives.
Internally, each process calls `set_device` according to its local
rank and `init_process_group` with `init_method=`env://' to ingest the provided environment
The number of visible GPU devices (and therefore the number of processes
variables.
`DistributedDataParallel` will spawn) can be controlled by setting the environment variable
For best performance, set `N` equal to the number of visible CUDA devices on the node.
`CUDA_VISIBLE_DEVICES`. For example, if you `export CUDA_VISIBLE_DEVICES=0,1` and run
```python -m apex.parallel.multiproc main.py ...```, the launch utility will spawn two processes
which will run on devices 0 and 1. By default, if `CUDA_VISIBLE_DEVICES` is unset,
`apex.parallel.multiproc` will attempt to use every device on the node.
## Converting your own model
## Converting your own model
To understand how to convert your own model, please see all sections of main.py within ```#=====START: ADDED FOR DISTRIBUTED======``` and ```#=====END: ADDED FOR DISTRIBUTED======``` flags.
To understand how to convert your own model, please see all sections of main.py within ```#=====START: ADDED FOR DISTRIBUTED======``` and ```#=====END: ADDED FOR DISTRIBUTED======``` flags.
## Requirements
## Requirements
Pytorch master branch built from source. This requirement is to use NCCL as a distributed backend.
Pytorch with NCCL available as a distributed backend. Pytorch 0.4+, installed as a pip or conda package, should have this by default. Otherwise, you can build Pytorch from source, in an environment where NCCL is installed and visible.
This example is based on [https://github.com/pytorch/examples/tree/master/imagenet](https://github.com/pytorch/examples/tree/master/imagenet).
This example is based on [https://github.com/pytorch/examples/tree/master/imagenet](https://github.com/pytorch/examples/tree/master/imagenet).
It implements training of popular model architectures, such as ResNet, AlexNet, and VGG on the ImageNet dataset.
It implements training of popular model architectures, such as ResNet, AlexNet, and VGG on the ImageNet dataset.
`main.py` and `main_fp16_optimizer.py` have been modified to use the `DistributedDataParallel` module in Apex instead of the one in upstream PyTorch. For description of how this works please see the distributed example included in this repo.
`main.py` with the `--fp16` argument demonstrates mixed precision training with manual management of master parameters and loss scaling.
`main.py` with the `--fp16` argument demonstrates mixed precision training with manual management of master parameters and loss scaling.
`main_fp16_optimizer.py` with `--fp16` demonstrates use of `apex.fp16_utils.FP16_Optimizer` to automatically manage master parameters and loss scaling.
`main_fp16_optimizer.py` with `--fp16` demonstrates use of `apex.fp16_utils.FP16_Optimizer` to automatically manage master parameters and loss scaling.
The directory at /path/to/imagenet/directory should contain two subdirectories called "train"
The directory at /path/to/imagenet/directory should contain two subdirectories called "train"
and "val" that contain the training and validation data respectively. Train images are expected to be 256x256 jpegs.
and "val" that contain the training and validation data respectively. Train images are expected to be 256x256 jpegs.
**Example commands:** (note: batch size `--b 256` assumes your GPUs have >=16GB of onboard memory)
## Distributed training
`main.py` and `main_fp16_optimizer.py` have been modified to use the `DistributedDataParallel` module in Apex instead of the one in upstream PyTorch. `apex.parallel.DistributedDataParallel`
is a drop-in replacement for `torch.nn.parallel.DistribtuedDataParallel` (see our [distributed example](https://github.com/NVIDIA/apex/tree/master/examples/distributed)).