README.md 1.53 KB
Newer Older
Michael Carilli's avatar
Michael Carilli committed
1
# Basic Multiprocess Example based on pytorch/examples/mnist
Christian Sarofeen's avatar
Christian Sarofeen committed
2

Michael Carilli's avatar
Michael Carilli committed
3
4
5
6
This example demonstrates how to modify a network to use a simple but effective distributed data parallel module. This parallel method is designed to easily run multi-gpu runs on a single node. It was created as current parallel methods integrated into pytorch can induce significant overhead due to python GIL lock. This method will reduce the influence of those overheads and potentially provide a benefit in performance, especially for networks with a significant number of fast running operations.

[API Documentation](https://nvidia.github.io/apex/parallel.html)

7
[Source Code](https://github.com/NVIDIA/apex/tree/master/apex/parallel)
Christian Sarofeen's avatar
Christian Sarofeen committed
8
9
10
11
12
13
14
15

## Getting started
Prior to running please run
```pip install -r requirements.txt```

and start a single process run to allow the dataset to be downloaded (This will not work properly in multi-gpu. You can stop this job as soon as it starts iterating.).
```python main.py```

Michael Carilli's avatar
Michael Carilli committed
16
You can now launch multi-process data-parallel jobs via
Christian Sarofeen's avatar
Typo  
Christian Sarofeen committed
17
```python -m apex.parallel.multiproc main.py ...```
Michael Carilli's avatar
Michael Carilli committed
18
adding any normal option you'd like.  Each process will run on one of you system's available GPUs.
Christian Sarofeen's avatar
Christian Sarofeen committed
19
20
21
22
23
24

## Converting your own model
To understand how to convert your own model to use the distributed module included, please see all sections of main.py within ```#=====START: ADDED FOR DISTRIBUTED======``` and ```#=====END:   ADDED FOR DISTRIBUTED======``` flags.

## Requirements
Pytorch master branch built from source. This requirement is to use NCCL as a distributed backend.