Commit 82d7a3bf authored by Michael Carilli's avatar Michael Carilli
Browse files

Updating READMEs and examples

parent 2c175a5d
......@@ -37,7 +37,8 @@ The intention of `FP16_Optimizer` is to be the "highway" for FP16 training: achi
[word_language_model with FP16_Optimizer](https://github.com/NVIDIA/apex/tree/master/examples/word_language_model)
The Imagenet and word_language_model directories also contain examples that show manual management of master parameters and static loss scaling.
These examples illustrate what sort of operations `amp` and `FP16_Optimizer` are performing automatically.
These manual examples illustrate what sort of operations `amp` and `FP16_Optimizer` are performing automatically.
## 2. Distributed Training
......@@ -49,7 +50,7 @@ optimized for NVIDIA's NCCL communication library.
[API Documentation](https://nvidia.github.io/apex/parallel.html)
[Python Source](https://nvidia.github.io/apex/parallel)
[Python Source](https://github.com/NVIDIA/apex/tree/master/apex/parallel)
[Example/Walkthrough](https://github.com/NVIDIA/apex/tree/master/examples/distributed)
......
......@@ -43,7 +43,7 @@ class DistributedDataParallel(Module):
to the other processes on initialization of DistributedDataParallel, and will be
allreduced in buckets durring the backward pass.
See https://github.com/csarofeen/examples/tree/apex/distributed for detailed usage.
See https://github.com/NVIDIA/apex/tree/master/examples/distributed for detailed usage.
Args:
module: Network definition to be run in multi-gpu/distributed mode.
......
......@@ -3,7 +3,7 @@
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
:github_url: https://gitlab-master.nvidia.com/csarofeen/apex
:github_url: https://github.com/nvidia/apex
APEx (A PyTorch Extension)
===================================
......
# Simple examples of FP16_Optimizer functionality
#### Minimal Working Sample
`minimal.py` shows the basic usage of `FP16_Optimizer` with either static or dynamic loss scaling. Test via
```bash
python minimal.py
```
`minimal.py` shows the basic usage of `FP16_Optimizer` with either static or dynamic loss scaling. Test via `python minimal.py`.
#### Closures
`FP16_Optimizer` supports closures with the same control flow as ordinary Pytorch optimizers.
`closure.py` shows an example. Test via
```bash
python closure.py
```
`closure.py` shows an example. Test via `python closure.py`.
See [the API documentation](https://nvidia.github.io/apex/fp16_utils.html#apex.fp16_utils.FP16_Optimizer.step) for more details.
#### Checkpointing
`FP16_Optimizer` also supports checkpointing with the same control flow as ordinary Pytorch optimizers.
`save_load.py` shows an example. Test via
```bash
python save_load.py
```
`save_load.py` shows an example. Test via `python save_load.py`.
See [the API documentation](https://nvidia.github.io/apex/fp16_utils.html#apex.fp16_utils.FP16_Optimizer.load_state_dict) for more details.
#### Distributed
**distributed_pytorch** shows an example using `FP16_Optimizer` with Pytorch DistributedDataParallel.
The usage of `FP16_Optimizer` with distributed does not need to change from ordinary single-process
usage. Run via
usage. Test via
```bash
cd distributed_pytorch
bash run.sh
......@@ -33,7 +26,7 @@ bash run.sh
**distributed_pytorch** shows an example using `FP16_Optimizer` with Apex DistributedDataParallel.
Again, the usage of `FP16_Optimizer` with distributed does not need to change from ordinary
single-process usage. Run via
single-process usage. Test via
```bash
cd distributed_apex
bash run.sh
......
......@@ -4,7 +4,7 @@ This example demonstrates how to modify a network to use a simple but effective
[API Documentation](https://nvidia.github.io/apex/parallel.html)
[Source Code](https://github.com/csarofeen/examples/tree/apex/distributed)
[Source Code](https://github.com/NVIDIA/apex/tree/master/apex/parallel)
## Getting started
Prior to running please run
......@@ -22,4 +22,3 @@ To understand how to convert your own model to use the distributed module includ
## Requirements
Pytorch master branch built from source. This requirement is to use NCCL as a distributed backend.
Apex installed from https://www.github.com/nvidia/apex
......@@ -23,11 +23,7 @@ adding any normal arguments.
## Training
To train a model, run `main.py` with the desired model architecture and the path to the ImageNet dataset:
```bash
python main.py -a resnet18 [imagenet-folder with train and val folders]
```
To train a model, run `main.py` with the desired model architecture and the path to the ImageNet dataset.
The default learning rate schedule starts at 0.1 and decays by a factor of 10 every 30 epochs. This is appropriate for ResNet and models with batch normalization, but too high for AlexNet and VGG. Use 0.01 as the initial learning rate for AlexNet or VGG:
......@@ -36,7 +32,29 @@ python main.py -a alexnet --lr 0.01 /path/to/imagenet/folder
```
The directory at /path/to/imagenet/directory should contain two subdirectories called "train"
and "val" that contain the training and validation data respectively.
and "val" that contain the training and validation data respectively. Train images are expected to be 256x256 jpegs.
Example commands (note: batch size --b 256 assumes your GPUs have >=16GB of onboard memory).
```bash
### Softlink training dataset into current directory
$ ln -sf /data/imagenet/train-jpeg-256x256/ train
### Softlink validation dataset into current directory
$ ln -sf /data/imagenet/val-jpeg/ val
### Single-process training
$ python main.py -a resnet50 --fp16 --b 256 --workers 4 ./
### Multi-process training (uses all visible GPU on the node)
$ python -m apex.parallel.multiproc main.py -a resnet50 --fp16 --b 256 --workers 4 ./
### Multi-process training on GPUs 0 and 1 only
$ export CUDA_VISIBLE_DEVICES=0,1
$ python -m apex.parallel.multiproc main.py -a resnet50 --fp16 --b 256 --workers 4 ./
### Multi-process training with FP16_Optimizer, default loss scale 1.0 (still uses FP32 master params)
$ python -m apex.parallel.multiproc main_fp16_optimizer.py -a resnet50 --fp16 --b 256 --workers 4 ./
# Multi-process training with FP16_Optimizer, static loss scale
$ python -m apex.parallel.multiproc main_fp16_optimizer.py -a resnet50 --fp16 --b 256 --static-loss-scale 128.0 --workers 4 ./
### Multi-process training with FP16_Optimizer, dynamic loss scaling
$ python -m apex.parallel.multiproc main_fp16_optimizer.py -a resnet50 --fp16 --b 256 --dynamic-loss-scale --workers 4 ./
```
## Usage for `main.py` and `main_fp16_optimizer.py`
......
......@@ -120,7 +120,8 @@ def main():
if args.fp16:
model = network_to_half(model)
if args.distributed:
model = DDP(model)
#shared param turns off bucketing in DDP, for lower latency runs this can improve perf
model = DDP(model, shared_param=True)
# define loss function (criterion) and optimizer
criterion = nn.CrossEntropyLoss().cuda()
......@@ -293,6 +294,7 @@ def train(train_loader, model, criterion, optimizer, epoch):
loss.backward()
optimizer.step()
torch.cuda.synchronize()
# measure elapsed time
batch_time.update(time.time() - end)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment