In addition to `FP16_Optimizer` examples, the Imagenet and word_language_model directories contain examples that demonstrate manual management of master parameters and static loss scaling.
These examples illustrate what sort of operations `FP16_Optimizer` is performing automatically.
The [Imagenet with FP16_Optimizer](https://github.com/NVIDIA/apex/tree/master/examples/imagenet) and [word_language_model with FP16_Optimizer](https://github.com/NVIDIA/apex/tree/master/examples/word_language_model) directories also contain `main.py` files that demonstrate manual management of master parameters and static loss scaling. These examples illustrate what sort of operations `FP16_Optimizer` is performing automatically.
@@ -43,13 +43,13 @@ class DistributedDataParallel(Module):
When used with ``multiproc.py``, :class:`DistributedDataParallel`
assigns 1 process to each of the available (visible) GPUs on the node.
Parameters are broadcast across participating processes on initialization, and gradients are
allreduced and averaged over processes during ``backward()`.
allreduced and averaged over processes during ``backward()``.
:class:``DistributedDataParallel`` is optimized for use with NCCL. It achieves high performance by
:class:`DistributedDataParallel` is optimized for use with NCCL. It achieves high performance by
overlapping communication with computation during ``backward()`` and bucketing smaller gradient
transfers to reduce the total number of transfers required.
:class:``DistributedDataParallel`` assumes that your script accepts the command line
:class:`DistributedDataParallel` assumes that your script accepts the command line
arguments "rank" and "world-size." It also assumes that your script calls
``torch.cuda.set_device(args.rank)`` before creating the model.
...
...
@@ -60,8 +60,7 @@ class DistributedDataParallel(Module):
Args:
module: Network definition to be run in multi-gpu/distributed mode.
message_size (Default = 1e7): Minimum number of elements in a communication bucket.
shared_param (Default = False): If your model uses shared parameters this must be True.
It will disable bucketing of parameters to avoid race conditions.
shared_param (Default = False): If your model uses shared parameters this must be True. It will disable bucketing of parameters to avoid race conditions.
This site contains the API documentation for Apex (https://github.com/nvidia/apex),
a Pytorch extension with NVIDIA-maintained utilities to streamline mixed precision and distributed training. Some of the code here will be included in upstream Pytorch eventually. The intention of Apex is to make up-to-date utilities available to users as quickly as possible.
Installation requires CUDA 9 or later, PyTorch 0.4 or later, and Python 3. Installation can be done by running
Installation requires CUDA 9 or later, PyTorch 0.4 or later, and Python 3. Install by running
adding any args... you'd like. The launch script `apex.parallel.multiproc` will
```bash
python -m apex.parallel.multiproc main.py args...
```
adding any `args...` you like. The launch script `apex.parallel.multiproc` will
spawn one process for each of your system's available (visible) GPUs.
Each process will run `python main.py args... --world-size <worldsize> --rank <rank>`
(the `--world-size` and `--rank` arguments are determined and appended by `apex.parallel.multiproc`).
...
...
@@ -45,7 +49,5 @@ which will run on devices 0 and 1. By default, if `CUDA_VISIBLE_DEVICES` is uns
To understand how to convert your own model, please see all sections of main.py within ```#=====START: ADDED FOR DISTRIBUTED======``` and ```#=====END: ADDED FOR DISTRIBUTED======``` flags.
[Example with Imagenet and mixed precision training](https://github.com/NVIDIA/apex/tree/master/examples/imagenet)
## Requirements
Pytorch master branch built from source. This requirement is to use NCCL as a distributed backend.
This example is based on [https://github.com/pytorch/examples/tree/master/imagenet](https://github.com/pytorch/examples/tree/master/imagenet).
It implements training of popular model architectures, such as ResNet, AlexNet, and VGG on the ImageNet dataset.
`main.py` and `main_fp16_optimizer.py` have been modified to use the `DistributedDataParallel` module in APEx instead of the one in upstream PyTorch. For description of how this works please see the distributed example included in this repo.
`main.py` and `main_fp16_optimizer.py` have been modified to use the `DistributedDataParallel` module in Apex instead of the one in upstream PyTorch. For description of how this works please see the distributed example included in this repo.
`main.py` with the `--fp16` argument demonstrates mixed precision training with manual management of master parameters and loss scaling.
...
...
@@ -15,8 +15,8 @@ adding any normal arguments.
## Requirements
- APEx which can be installed from https://www.github.com/nvidia/apex
- Install PyTorch from source, master branch of ([pytorch on github](https://www.github.com/pytorch/pytorch)
- Apex which can be installed from https://www.github.com/nvidia/apex
- Install PyTorch from source, master branch of [pytorch on github](https://www.github.com/pytorch/pytorch).
- `pip install -r requirements.txt`
- Download the ImageNet dataset and move validation images to labeled subfolders
- To do this, you can use the following script: https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh