# PSA: Unified API for mixed precision tools coming soon!
(as introduced by https://info.nvidia.com/webinar-mixed-precision-with-pytorch-reg-page.html.
Branch `api_refactor` is tracking my progress. Update as of 2/28: PR-ed in https://github.com/NVIDIA/apex/pull/173. I'd like to clean up the documentation a bit more before final merge.
# Introduction
This repository holds NVIDIA-maintained utilities to streamline
...
...
@@ -15,35 +10,24 @@ users as quickly as possible.
# Contents
## 1. Mixed Precision
## 1. Mixed Precision
### amp: Automatic Mixed Precision
`apex.amp` is a tool designed for ease of use and maximum safety in FP16 training. All potentially unsafe ops are performed in FP32 under the hood, while safe ops are performed using faster, Tensor Core-friendly FP16 math. `amp` also automatically implements dynamic loss scaling.
The intention of `amp` is to be the "on-ramp" to easy FP16 training: achieve all the numerical stability of full FP32 training, with most of the performance benefits of full FP16 training.
[Python Source and API Documentation](https://github.com/NVIDIA/apex/tree/master/apex/amp)
### FP16_Optimizer
`apex.FP16_Optimizer` wraps an existing Python optimizer and automatically implements master parameters and static or dynamic loss scaling under the hood.
The intention of `FP16_Optimizer` is to be the "highway" for FP16 training: achieve most of the numerically stability of full FP32 training, and almost all the performance benefits of full FP16 training.
The [Imagenet with FP16_Optimizer](https://github.com/NVIDIA/apex/tree/master/examples/imagenet)
mixed precision examples also demonstrate`apex.parallel.DistributedDataParallel`.
The [Imagenet example](https://github.com/NVIDIA/apex/tree/master/examples/imagenet)
shows use of`apex.parallel.DistributedDataParallel` along with `apex.amp`.
### Synchronized Batch Normalization
`apex.parallel.SyncBatchNorm` extends `torch.nn.modules.batchnorm._BatchNorm` to
support synchronized BN.
It reduces stats across processes during multiprocess distributed data parallel
training.
Synchronous Batch Normalization has been used in cases where only very small
number of mini-batch could be fit on each GPU.
All-reduced stats boost the effective batch size for sync BN layer to be the
total number of mini-batches across all processes.
It has improved the converged accuracy in some of our research models.
It allreduces stats across processes during multiprocess (DistributedDataParallel) training.
Synchronous BN has been used in cases where only a small
local minibatch can fit on each GPU.
Allreduced stats increase the effective batch size for the BN layer to the
global batch size across all processes (which, technically, is the correct
formulation).
Synchronous BN has been observed to improve converged accuracy in some of our research models.
# Requirements
Python 3
CUDA 9 or 10
CUDA 9 or newer
PyTorch 0.4 or newer. We recommend to use the latest stable release, obtainable from
[https://pytorch.org/](https://pytorch.org/). We also test against the latest master branch, obtainable from [https://github.com/pytorch/pytorch](https://github.com/pytorch/pytorch).
If you have any problems building, please file an issue.
The cpp and cuda extensions require pytorch 1.0 or newer.
PyTorch 0.4 or newer. The CUDA and C++ extensions require pytorch 1.0 or newer.
We recommend the latest stable release, obtainable from
[https://pytorch.org/](https://pytorch.org/). We also test against the latest master branch, obtainable from [https://github.com/pytorch/pytorch](https://github.com/pytorch/pytorch).
It's often convenient to use Apex in Docker containers. Compatible options include:
*[NVIDIA Pytorch containers from NGC](https://ngc.nvidia.com/catalog/containers/nvidia%2Fpytorch), which come with Apex preinstalled. To use the latest Amp API, you may need to `pip uninstall apex` then reinstall Apex using the **Quick Start** commands below.
*[official Pytorch -devel Dockerfiles](https://hub.docker.com/r/pytorch/pytorch/tags), e.g. `docker pull pytorch/pytorch:nightly-devel-cuda10.0-cudnn7`, in which you can install Apex using the **Quick Start** commands.
# Quick Start
### Linux
To build the extension run
```
python setup.py install
```
in the root directory of the cloned repository.
To use the extension
For performance and full functionality, we recommend installing Apex with
Apex contains optional CUDA/C++ extensions, installable via
Apex also supports a Python-only build (required with Pytorch 0.4) via
```
python setup.py install [--cuda_ext] [--cpp_ext]
$ pip install -v --no-cache-dir .
```
Currently, `--cuda_ext` enables
- Fused kernels that improve the performance and numerical stability of `apex.parallel.SyncBatchNorm`.
A Python-only build omits:
- Fused kernels required to use `apex.optimizers.FusedAdam`.
- Fused kernels required to use `apex.normalization.FusedLayerNorm`.
`--cpp_ext` enables
- C++-side flattening and unflattening utilities that reduce the CPU overhead of `apex.parallel.DistributedDataParallel`.
- Fused kernels that improve the performance and numerical stability of `apex.parallel.SyncBatchNorm`.
- Fused kernels that improve the performance of `apex.parallel.DistributedDataParallel` and `apex.amp`.
`DistributedDataParallel`, `amp`, and `SyncBatchNorm` will still be usable, but they may be slower.
### Windows support
Windows support is experimental, and Linux is recommended. However, since Apex could be Python-only, there's a good chance the Python-only features "just works" the same way as Linux. If you installed Pytorch in a Conda environment, make sure to install Apex in that same environment.
<!--
reparametrization and RNN API under construction
Current version of apex contains:
3. Reparameterization function that allows you to recursively apply reparameterization to an entire module (including children modules).
4. An experimental and in development flexible RNN API.
-->
Windows support is experimental, and Linux is recommended. `python setup.py install --cpp_ext --cuda_ext` may work if you were able to build Pytorch from source
on your system. `python setup.py install` (without CUDA/C++ extensions) is more likely to work. If you installed Pytorch in a Conda environment,
make sure to install Apex in that same environment.
This example is based on [https://github.com/pytorch/examples/tree/master/imagenet](https://github.com/pytorch/examples/tree/master/imagenet).
It implements mixed precision training of popular model architectures, such as ResNet, AlexNet, and VGG on the ImageNet dataset.
`main_amp.py` is based on [https://github.com/pytorch/examples/tree/master/imagenet](https://github.com/pytorch/examples/tree/master/imagenet).
It implements Automatic Mixed Precision (Amp) training of popular model architectures, such as ResNet, AlexNet, and VGG, on the ImageNet dataset, and illustrates use of the new Amp API along with command-line flags (forwarded to `amp.initialize`) to easily manipulate and switch between various pure and mixed precision training modes.
`main_amp.py` illustrates use of the new Amp API along with command-line flags (forwarded to `amp.initialize`) to easily manipulate and switch between various pure and mixed precision training modes.
@@ -133,6 +141,16 @@ With Apex DDP, it uses only the current device by default).
The choice of DDP wrapper (Torch or Apex) is orthogonal to the use of Amp and other Apex tools. It is safe to use `apex.amp` with either `torch.nn.parallel.DistributedDataParallel` or `apex.parallel.DistributedDataParallel`. In the future, I may add some features that permit optional tighter integration between `Amp` and `apex.parallel.DistributedDataParallel` for marginal performance benefits, but currently, there's no compelling reason to use Apex DDP versus Torch DDP for most models.
To use DDP with `apex.amp`, the only gotcha is that
print("Warning: nvcc is not available. Ignoring --cuda-ext")
raiseRuntimeError("--cuda_ext was requested, but nvcc was not found. Are you sure your environment has nvcc available? If you're installing within a container from https://hub.docker.com/r/pytorch/pytorch, only images whose names contain 'devel' will provide nvcc.")