This repo is designed to hold PyTorch modules and utilities that are under active development and experimental. This repo is not designed as a long term solution or a production solution. Things placed in here are intended to be eventually moved to upstream PyTorch.
This repository holds NVIDIA-maintained utilities to streamline
mixed precision and distributed training in Pytorch.
Some of the code here will be included in upstream Pytorch eventually.
The intention of Apex is to make up-to-date utilities available to
`apex.amp` is a tool designed for ease of use and maximum safety in FP16 training. All potentially unsafe ops are performed in FP32 under the hood, while safe ops are performed using faster, Tensor Core-friendly FP16 math. `amp` also automatically implements dynamic loss scaling.
The intention of `amp` is to be the "on-ramp" to easy FP16 training: achieve all the numerical stability of full FP32 training, with most of the performance benefits of full FP16 training.
`apex.FP16_Optimizer` wraps an existing Python optimizer and automatically implements master parameters and static or dynamic loss scaling under the hood.
The intention of `FP16_Optimizer` is to be the "highway" for FP16 training: achieve most of the numerically stability of full FP32 training, and almost all the performance benefits of full FP16 training.
### Examples:
[Simple examples with FP16_Optimizer](https://github.com/NVIDIA/apex/tree/master/examples/FP16_Optimizer_simple)
[Imagenet with FP16_Optimizer](https://github.com/NVIDIA/apex/tree/master/examples/imagenet)
[word_language_model with FP16_Optimizer](https://github.com/NVIDIA/apex/tree/master/examples/word_language_model)
The Imagenet and word_language_model directories also contain examples that show manual management of master parameters and static loss scaling.
These examples illustrate what sort of operations `amp` and `FP16_Optimizer` are performing automatically.
## 2. Distributed Training
`apex.parallel.DistributedDataParallel` is a module wrapper, similar to
`torch.nn.parallel.DistributedDataParallel`. It enables convenient multiprocess distributed training,
optimized for NVIDIA's NCCL communication library.
`apex.parallel.multiproc` is a launch utility that helps set up arguments for `DistributedDataParallel.`
PyTorch 0.4 or newer. We recommend to use the latest stable release, obtainable from https://pytorch.org/. We also test against the latest master branch, obtainable from https://github.com/pytorch/pytorch. If you have any problems building, please file an issue.
PyTorch 0.4 or newer. We recommend to use the latest stable release, obtainable from
[https://pytorch.org/](https://pytorch.org/). We also test against the latest master branch, obtainable from [https://github.com/pytorch/pytorch](https://github.com/pytorch/pytorch).
If you have any problems building, please file an issue.
@@ -19,7 +66,7 @@ To build the extension run the following command in the root directory of this p
...
@@ -19,7 +66,7 @@ To build the extension run the following command in the root directory of this p
python setup.py install
python setup.py install
```
```
To use the extension simply run
To use the extension
```
```
import apex
import apex
```
```
...
@@ -28,15 +75,12 @@ and optionally (if required for your use)
...
@@ -28,15 +75,12 @@ and optionally (if required for your use)
import apex_C as apex_backend
import apex_C as apex_backend
```
```
# What's included
<!--
reparametrization and RNN API under construction
Current version of apex contains:
Current version of apex contains:
1. Mixed precision utilities can be found [here](https://nvidia.github.io/apex/fp16_utils) examples of using mixed precision utilities can be found for the [PyTorch imagenet example](https://github.com/csarofeen/examples/tree/apex/imagenet) and the [PyTorch word language model example](https://github.com/csarofeen/examples/tree/apex/word_language_model).
2. Parallel utilities can be found [here](https://nvidia.github.io/apex/parallel) and an example/walkthrough can be found [here](https://github.com/csarofeen/examples/tree/apex/distributed)
- apex/parallel/distributed.py contains a simplified implementation of PyTorch's DistributedDataParallel that's optimized for use with NCCL in single gpu / process mode
- apex/parallel/multiproc.py is a simple multi-process launcher that can be used on a single node/computer with multiple GPU's
3. Reparameterization function that allows you to recursively apply reparameterization to an entire module (including children modules).
3. Reparameterization function that allows you to recursively apply reparameterization to an entire module (including children modules).
4. An experimental and in development flexible RNN API.
4. An experimental and in development flexible RNN API.
fp16_optimizer.py contains `FP16_Optimizer`, a Python class designed to wrap an existing Pytorch optimizer and automatically enable master parameters and loss scaling in a manner transparent to the user. To use `FP16_Optimizer`, only two lines of one's Python model need to change.
### [FP16_Optimizer API documentation](https://nvidia.github.io/apex/fp16_utils.html#automatic-management-of-master-params-loss-scaling)
[Simple examples with FP16_Optimizer](https://github.com/NVIDIA/apex/tree/master/examples/FP16_Optimizer_simple)
[Imagenet with FP16_Optimizer](https://github.com/NVIDIA/apex/tree/master/examples/imagenet)
[word_language_model with FP16_Optimizer](https://github.com/NVIDIA/apex/tree/master/examples/word_language_model)
fp16_util.py contains a number of utilities to manually manage master parameters and loss scaling, if the user chooses.
In addition to `FP16_Optimizer` examples, the Imagenet and word_language_model directories contain examples that demonstrate manual management of master parameters and static loss scaling.
These examples illustrate what sort of operations `FP16_Optimizer` is performing automatically.
distributed.py contains the source code for `apex.parallel.DistributedDataParallel`, a module wrapper that enables multi-process multi-GPU data parallel training, optimized for NVIDIA's NCCL communication library.
`apex.parallel.DistributedDataParallel` achieves high performance by overlapping communication with
computation in the backward pass and bucketing smaller transfers to reduce the total number of
transfers required.
multiproc.py contains the source code for `apex.parallel.multiproc`, a launch utility that places one process on each of the node's available GPUs.
# Basic Multirpocess Example based on the MNIST example
# Basic Multiprocess Example based on pytorch/examples/mnist
This version of this examples requires APEx which can be installed from https://www.github.com/nvidia/apex. This example demonstrates how to modify a network to use a basic but effective distributed data parallel module. This parallel method is designed to easily run multi-gpu runs on a single node. It was created as current parallel methods integraded into pytorch can induce significant overhead due to python GIL lock. This method will reduce the influence of those overheads and potentially provide a benefit in performance, especially for networks with a significant number of fast running operations.
This example demonstrates how to modify a network to use a simple but effective distributed data parallel module. This parallel method is designed to easily run multi-gpu runs on a single node. It was created as current parallel methods integrated into pytorch can induce significant overhead due to python GIL lock. This method will reduce the influence of those overheads and potentially provide a benefit in performance, especially for networks with a significant number of fast running operations.
and start a single process run to allow the dataset to be downloaded (This will not work properly in multi-gpu. You can stop this job as soon as it starts iterating.).
and start a single process run to allow the dataset to be downloaded (This will not work properly in multi-gpu. You can stop this job as soon as it starts iterating.).
```python main.py```
```python main.py```
You can now the code multi-gpu with
You can now launch multi-process data-parallel jobs via
adding any normal option you'd like. Each process will run on one of you system's available GPUs.
## Converting your own model
## Converting your own model
To understand how to convert your own model to use the distributed module included, please see all sections of main.py within ```#=====START: ADDED FOR DISTRIBUTED======``` and ```#=====END: ADDED FOR DISTRIBUTED======``` flags.
To understand how to convert your own model to use the distributed module included, please see all sections of main.py within ```#=====START: ADDED FOR DISTRIBUTED======``` and ```#=====END: ADDED FOR DISTRIBUTED======``` flags.
## Requirements
## Requirements
Pytorch master branch built from source. This requirement is to use NCCL as a distributed backend.
Pytorch master branch built from source. This requirement is to use NCCL as a distributed backend.
APEx installed from https://www.github.com/nvidia/apex
Apex installed from https://www.github.com/nvidia/apex