README.md 2.86 KB
Newer Older
1
# Multiprocess Example based on pytorch/examples/mnist
Christian Sarofeen's avatar
Christian Sarofeen committed
2

3
main.py demonstrates how to modify a simple model to enable multiprocess distributed data parallel
4
training using the module wrapper `apex.parallel.DistributedDataParallel`
5
6
7
8
9
10
11
12
13
14
(similar to `torch.nn.parallel.DistributedDataParallel`).

Multiprocess distributed data parallel training frequently outperforms single-process 
data parallel training (such as that offered by `torch.nn.DataParallel`) because each process has its 
own python interpreter.  Therefore, driving multiple GPUs with multiple processes reduces 
global interpreter lock contention versus having a single process (with a single GIL) drive all GPUs.

`apex.parallel.DistributedDataParallel` is optimized for use with NCCL.  It achieves high performance by 
overlapping communication with computation during ``backward()`` and bucketing smaller gradient
transfers to reduce the total number of transfers required.
Michael Carilli's avatar
Michael Carilli committed
15

16
#### [API Documentation](https://nvidia.github.io/apex/parallel.html)
Michael Carilli's avatar
Michael Carilli committed
17

18
#### [Source Code](https://github.com/NVIDIA/apex/tree/master/apex/parallel)
Christian Sarofeen's avatar
Christian Sarofeen committed
19

20
21
22
#### [Another example: Imagenet with mixed precision](https://github.com/NVIDIA/apex/tree/master/examples/imagenet)

#### [Simple example with FP16_Optimizer](https://github.com/NVIDIA/apex/tree/master/examples/FP16_Optimizer_simple/distributed_apex)
23

Christian Sarofeen's avatar
Christian Sarofeen committed
24
25
26
27
## Getting started
Prior to running please run
```pip install -r requirements.txt```

28
To download the dataset, run
Christian Sarofeen's avatar
Christian Sarofeen committed
29
```python main.py```
30
31
without any arguments.  Once you have downloaded the dataset, you should not need to do this again.

32
33
34
`main.py` runs multiprocess distributed data parallel jobs using the Pytorch launcher script
[torch.distributed.launch](https://pytorch.org/docs/master/distributed.html#launch-utility).
Jobs are launched via
35
```bash
36
python -m torch.distributed.launch --nproc_per_node=N main.py args...
37
```
38
39
40
41
42
43
44
45
46
`torch.distributed.launch` spawns `N` processes, each of which runs as
`python main.py args... --local_rank <rank>`.
The `local_rank` argument for each process is determined and appended by `torch.distributed.launch`,
and varies between  0 and `N-1`.  `torch.distributed.launch` also provides environment variables 
for each process.
Internally, each process calls `set_device` according to its local
rank and `init_process_group` with `init_method=`env://' to ingest the provided environment 
variables.
For best performance, set `N` equal to the number of visible CUDA devices on the node.
Christian Sarofeen's avatar
Christian Sarofeen committed
47
48

## Converting your own model
49
50
51

To understand how to convert your own model, please see all sections of main.py within ```#=====START: ADDED FOR DISTRIBUTED======``` and ```#=====END:   ADDED FOR DISTRIBUTED======``` flags.

Christian Sarofeen's avatar
Christian Sarofeen committed
52
## Requirements
53
Pytorch with NCCL available as a distributed backend.  Pytorch 0.4+, installed as a pip or conda package, should have this by default.  Otherwise, you can build Pytorch from source, in an environment where NCCL is installed and visible.