README.md 5.49 KB
Newer Older
mcarilli's avatar
mcarilli committed
1
# PSA:  Amp 1.0 API coming soon!  
mcarilli's avatar
mcarilli committed
2
(as introduced by https://info.nvidia.com/webinar-mixed-precision-with-pytorch-reg-page.html.  The `amp` and `FP16_Optimizer` tools currently in master are separate prototypes, which will be unified by the Amp 1.0 API.)
mcarilli's avatar
mcarilli committed
3

mcarilli's avatar
mcarilli committed
4
Branch `api_refactor` is tracking my progress.  Update as of 2/28:  PR-ed in https://github.com/NVIDIA/apex/pull/173. I'd like to clean up the documentation a bit more before final merge.
mcarilli's avatar
mcarilli committed
5

Christian Sarofeen's avatar
Christian Sarofeen committed
6
7
# Introduction

Michael Carilli's avatar
Michael Carilli committed
8
9
10
11
12
13
This repository holds NVIDIA-maintained utilities to streamline 
mixed precision and distributed training in Pytorch. 
Some of the code here will be included in upstream Pytorch eventually.
The intention of Apex is to make up-to-date utilities available to 
users as quickly as possible.

14
## Full API Documentation: [https://nvidia.github.io/apex](https://nvidia.github.io/apex)
Michael Carilli's avatar
Michael Carilli committed
15
16
17
18
19

# Contents

## 1. Mixed Precision 

20
### amp:  Automatic Mixed Precision
Michael Carilli's avatar
Michael Carilli committed
21
22
23
24
25

`apex.amp` is a tool designed for ease of use and maximum safety in FP16 training.  All potentially unsafe ops are performed in FP32 under the hood, while safe ops are performed using faster, Tensor Core-friendly FP16 math.  `amp` also automatically implements dynamic loss scaling. 

The intention of `amp` is to be the "on-ramp" to easy FP16 training: achieve all the numerical stability of full FP32 training, with most of the performance benefits of full FP16 training.

26
27
28
[Python Source and API Documentation](https://github.com/NVIDIA/apex/tree/master/apex/amp)

### FP16_Optimizer
Michael Carilli's avatar
Michael Carilli committed
29
30
31
32
33

`apex.FP16_Optimizer` wraps an existing Python optimizer and automatically implements master parameters and static or dynamic loss scaling under the hood.

The intention of `FP16_Optimizer` is to be the "highway" for FP16 training: achieve most of the numerically stability of full FP32 training, and almost all the performance benefits of full FP16 training.

34
[API Documentation](https://nvidia.github.io/apex/fp16_utils.html#automatic-management-of-master-params-loss-scaling)
Michael Carilli's avatar
Michael Carilli committed
35

Michael Carilli's avatar
Michael Carilli committed
36
37
[Python Source](https://github.com/NVIDIA/apex/tree/master/apex/fp16_utils)

Michael Carilli's avatar
Michael Carilli committed
38
39
40
41
42
43
44
[Simple examples with FP16_Optimizer](https://github.com/NVIDIA/apex/tree/master/examples/FP16_Optimizer_simple)

[Imagenet with FP16_Optimizer](https://github.com/NVIDIA/apex/tree/master/examples/imagenet)

[word_language_model with FP16_Optimizer](https://github.com/NVIDIA/apex/tree/master/examples/word_language_model)

The Imagenet and word_language_model directories also contain examples that show manual management of master parameters and static loss scaling.  
45
46

These manual examples illustrate what sort of operations `amp` and `FP16_Optimizer` are performing automatically.
Michael Carilli's avatar
Michael Carilli committed
47
48
49
50
51
52
53

## 2. Distributed Training

`apex.parallel.DistributedDataParallel` is a module wrapper, similar to 
`torch.nn.parallel.DistributedDataParallel`.  It enables convenient multiprocess distributed training,
optimized for NVIDIA's NCCL communication library.

54
55
[API Documentation](https://nvidia.github.io/apex/parallel.html)

56
[Python Source](https://github.com/NVIDIA/apex/tree/master/apex/parallel)
Michael Carilli's avatar
Michael Carilli committed
57

58
[Example/Walkthrough](https://github.com/NVIDIA/apex/tree/master/examples/distributed)
Christian Sarofeen's avatar
Christian Sarofeen committed
59

60
61
62
The [Imagenet with FP16_Optimizer](https://github.com/NVIDIA/apex/tree/master/examples/imagenet) 
mixed precision examples also demonstrate `apex.parallel.DistributedDataParallel`.

jjsjann123's avatar
jjsjann123 committed
63
64
65
66
67
68
69
70
71
72
73
74
### Synchronized Batch Normalization

`apex.parallel.SyncBatchNorm` extends `torch.nn.modules.batchnorm._BatchNorm` to
support synchronized BN.
It reduces stats across processes during multiprocess distributed data parallel
training.
Synchronous Batch Normalization has been used in cases where only very small
number of mini-batch could be fit on each GPU.
All-reduced stats boost the effective batch size for sync BN layer to be the
total number of mini-batches across all processes.
It has improved the converged accuracy in some of our research models.

Christian Sarofeen's avatar
Christian Sarofeen committed
75
76
77
# Requirements

Python 3
Michael Carilli's avatar
Michael Carilli committed
78

mcarilli's avatar
mcarilli committed
79
CUDA 9 or 10
Michael Carilli's avatar
Michael Carilli committed
80

Michael Carilli's avatar
Michael Carilli committed
81
82
83
84
PyTorch 0.4 or newer.  We recommend to use the latest stable release, obtainable from 
[https://pytorch.org/](https://pytorch.org/).  We also test against the latest master branch, obtainable from [https://github.com/pytorch/pytorch](https://github.com/pytorch/pytorch).  
If you have any problems building, please file an issue.

mcarilli's avatar
mcarilli committed
85
86
The cpp and cuda extensions require pytorch 1.0 or newer.

Christian Sarofeen's avatar
Christian Sarofeen committed
87
88
89
90


# Quick Start

91
### Linux
mcarilli's avatar
mcarilli committed
92
To build the extension run
Christian Sarofeen's avatar
Christian Sarofeen committed
93
94
95
```
python setup.py install
```
mcarilli's avatar
mcarilli committed
96
in the root directory of the cloned repository.
Christian Sarofeen's avatar
Christian Sarofeen committed
97

Michael Carilli's avatar
Michael Carilli committed
98
To use the extension
Christian Sarofeen's avatar
Christian Sarofeen committed
99
100
101
102
```
import apex
```

jjsjann123's avatar
jjsjann123 committed
103
### CUDA/C++ extension
104
Apex contains optional CUDA/C++ extensions, installable via
jjsjann123's avatar
jjsjann123 committed
105
```
106
python setup.py install [--cuda_ext] [--cpp_ext]
jjsjann123's avatar
jjsjann123 committed
107
```
108
109
110
Currently, `--cuda_ext` enables
- Fused kernels that improve the performance and numerical stability of `apex.parallel.SyncBatchNorm`.
- Fused kernels required to use `apex.optimizers.FusedAdam`.
111
- Fused kernels required to use 'apex.normalization.FusedLayerNorm'.
jjsjann123's avatar
jjsjann123 committed
112

113
114
`--cpp_ext` enables
- C++-side flattening and unflattening utilities that reduce the CPU overhead of `apex.parallel.DistributedDataParallel`.
jjsjann123's avatar
jjsjann123 committed
115

116
### Windows support
jjsjann123's avatar
jjsjann123 committed
117
Windows support is experimental, and Linux is recommended.  However, since Apex could be Python-only, there's a good chance the Python-only features "just works" the same way as Linux.  If you installed Pytorch in a Conda environment, make sure to install Apex in that same environment.
118

Michael Carilli's avatar
Michael Carilli committed
119
120
<!--
reparametrization and RNN API under construction
Christian Sarofeen's avatar
Christian Sarofeen committed
121
122
123
124

Current version of apex contains:
3. Reparameterization function that allows you to recursively apply reparameterization to an entire module (including children modules).
4. An experimental and in development flexible RNN API.
Michael Carilli's avatar
Michael Carilli committed
125
-->