README.md 4.51 KB
Newer Older
1
2
# ImageNet training in PyTorch

3
This example is based on [https://github.com/pytorch/examples/tree/master/imagenet](https://github.com/pytorch/examples/tree/master/imagenet).
4
5
6
7
It implements training of popular model architectures, such as ResNet, AlexNet, and VGG on the ImageNet dataset.

`main.py` with the `--fp16` argument demonstrates mixed precision training with manual management of master parameters and loss scaling.

8
`main_fp16_optimizer.py` with `--fp16` demonstrates use of `apex.fp16_utils.FP16_Optimizer` to automatically manage master parameters and loss scaling.
9

Michael Carilli's avatar
Michael Carilli committed
10
11
12
13
`main_amp.py` with `--fp16` demonstrates use of Amp to automatically perform all FP16-friendly operations in half precision under the hood.  Notice that with Amp:
..* you don't need to explicitly convert your model, or the input data, to half().  Conversions will occur on-the-fly internally within the Amp-patched torch functions.
..* dynamic loss scaling is always used under the hood.

Michael Carilli's avatar
Michael Carilli committed
14
`main_reducer.py` is identical to `main.py`, except that it shows the use of [apex.parallel.Reduce](https://nvidia.github.io/apex/parallel.html#apex.parallel.Reducer) instead of `DistributedDataParallel`.
15

16
17
18
19
20
21
22
23
## Requirements

- `pip install -r requirements.txt`
- Download the ImageNet dataset and move validation images to labeled subfolders
    - To do this, you can use the following script: https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh

## Training

24
To train a model, run `main.py` with the desired model architecture and the path to the ImageNet dataset.
25
26
27
28
29
30
31
32

The default learning rate schedule starts at 0.1 and decays by a factor of 10 every 30 epochs. This is appropriate for ResNet and models with batch normalization, but too high for AlexNet and VGG. Use 0.01 as the initial learning rate for AlexNet or VGG:

```bash
python main.py -a alexnet --lr 0.01 /path/to/imagenet/folder
```

The directory at /path/to/imagenet/directory should contain two subdirectories called "train"
mcarilli's avatar
mcarilli committed
33
and "val" that contain the training and validation data respectively.
34

35
36
37
38
39
40
41
42
43
44
45
46
## Distributed training

`main.py` and `main_fp16_optimizer.py` have been modified to use the `DistributedDataParallel` module in Apex instead of the one in upstream PyTorch. `apex.parallel.DistributedDataParallel` 
is a drop-in replacement for `torch.nn.parallel.DistribtuedDataParallel` (see our [distributed example](https://github.com/NVIDIA/apex/tree/master/examples/distributed)).  
The scripts can interact with 
[torch.distributed.launch](https://pytorch.org/docs/master/distributed.html#launch-utility)
to spawn multiprocess jobs using the following syntax:
```
python -m torch.distributed.launch --nproc_per_node=NUM_GPUS main.py args...
```
`NUM_GPUS` should be less than or equal to the number of visible GPU devices on the node.

jjsjann123's avatar
jjsjann123 committed
47
48
49
Optionally one can run imagenet with sync batch normalization by adding
`--sync_bn` into the `args...`

50
51
## Example commands

52
(note:  batch size `--b 224` assumes your GPUs have >=16GB of onboard memory)
53
54
55

```bash
### Softlink training dataset into current directory
mcarilli's avatar
mcarilli committed
56
$ ln -sf /data/imagenet/train-jpeg/ train
57
58
59
### Softlink validation dataset into current directory
$ ln -sf /data/imagenet/val-jpeg/ val
### Single-process training
60
$ python main.py -a resnet50 --fp16 --b 224 --workers 4 --static-loss-scale 128.0 ./
Michael Carilli's avatar
Michael Carilli committed
61
62
63
### Single-process training with Amp.  Amp's casting causes it to use a bit more memory,
### hence the batch size 128.
$ python main_amp.py -a resnet50 --fp16 --b 128 --workers 4 ./
64
### Multi-process training (uses all visible GPUs on the node)
65
$ python -m torch.distributed.launch --nproc_per_node=NUM_GPUS main.py -a resnet50 --fp16 --b 224 --workers 4 --static-loss-scale 128.0 ./
66
67
### Multi-process training on GPUs 0 and 1 only
$ export CUDA_VISIBLE_DEVICES=0,1
68
$ python -m torch.distributed.launch --nproc_per_node=2 main.py -a resnet50 --fp16 --b 224 --workers 4 ./
69
### Multi-process training with FP16_Optimizer, static loss scale 128.0 (still uses FP32 master params)
70
$ python -m torch.distributed.launch --nproc_per_node=NUM_GPUS main_fp16_optimizer.py -a resnet50 --fp16 --b 224 --static-loss-scale 128.0 --workers 4 ./
71
### Multi-process training with FP16_Optimizer, dynamic loss scaling
72
$ python -m torch.distributed.launch --nproc_per_node=NUM_GPUS main_fp16_optimizer.py -a resnet50 --fp16 --b 224 --dynamic-loss-scale --workers 4 ./
73
```
74
75
76
77
78
79
80
81
82

## Usage for `main.py` and `main_fp16_optimizer.py`

`main_fp16_optimizer.py` also accepts the optional flag
```bash
  --dynamic-loss-scale  Use dynamic loss scaling. If supplied, this argument
                        supersedes --static-loss-scale.
```