# Multiprocess Example based on pytorch/examples/mnist main.py demonstrates how to modify a simple model to enable multiprocess distributed data parallel training using the module wrapper `apex.parallel.DistributedDataParallel` (similar to `torch.nn.parallel.DistributedDataParallel`). Multiprocess distributed data parallel training frequently outperforms single-process data parallel training (such as that offered by `torch.nn.DataParallel`) because each process has its own python interpreter. Therefore, driving multiple GPUs with multiple processes reduces global interpreter lock contention versus having a single process (with a single GIL) drive all GPUs. `apex.parallel.DistributedDataParallel` is optimized for use with NCCL. It achieves high performance by overlapping communication with computation during ``backward()`` and bucketing smaller gradient transfers to reduce the total number of transfers required. #### [API Documentation](https://nvidia.github.io/apex/parallel.html) #### [Source Code](https://github.com/NVIDIA/apex/tree/master/apex/parallel) #### [Another example: Imagenet with mixed precision](https://github.com/NVIDIA/apex/tree/master/examples/imagenet) #### [Simple example with FP16_Optimizer](https://github.com/NVIDIA/apex/tree/master/examples/FP16_Optimizer_simple/distributed_apex) ## Getting started Prior to running please run ```pip install -r requirements.txt``` To download the dataset, run ```python main.py``` without any arguments. Once you have downloaded the dataset, you should not need to do this again. You can now launch multi-process distributed data parallel jobs via ```bash python -m apex.parallel.multiproc main.py args... ``` adding any `args...` you like. The launch script `apex.parallel.multiproc` will spawn one process for each of your system's available (visible) GPUs. Each process will run `python main.py args... --world-size --rank ` (the `--world-size` and `--rank` arguments are determined and appended by `apex.parallel.multiproc`). Each `main.py` calls `torch.cuda.set_device()` and `torch.distributed.init_process_group()` according to the `rank` and `world-size` arguments it receives. The number of visible GPU devices (and therefore the number of processes `DistributedDataParallel` will spawn) can be controlled by setting the environment variable `CUDA_VISIBLE_DEVICES`. For example, if you `export CUDA_VISIBLE_DEVICES=0,1` and run ```python -m apex.parallel.multiproc main.py ...```, the launch utility will spawn two processes which will run on devices 0 and 1. By default, if `CUDA_VISIBLE_DEVICES` is unset, `apex.parallel.multiproc` will attempt to use every device on the node. ## Converting your own model To understand how to convert your own model, please see all sections of main.py within ```#=====START: ADDED FOR DISTRIBUTED======``` and ```#=====END: ADDED FOR DISTRIBUTED======``` flags. ## Requirements Pytorch master branch built from source. This requirement is to use NCCL as a distributed backend.