amp.md 2.86 KB
Newer Older
1
# Mixed precision training
zbian's avatar
zbian committed
2

3
In Colossal-AI, we have incorporated different implementations of mixed precision training:
zbian's avatar
zbian committed
4
5
6
7
8
9
1. torch.cuda.amp
2. apex.amp
3. tensor-parallel amp

The first two rely on the original implementation of [PyTorch](https://pytorch.org/docs/stable/amp.html)
(version 1.6 and above) and [Nvidia Apex](https://github.com/NVIDIA/apex). However, these two methods are not compatible 
10
11
12
with tensor parallelism. This is because that tensors are split across devices in tensor parallelism, thus, it is required 
to communicate among different processes to check if `inf` or `nan` occurs in the whole model weights. For the mixed
precision training with tensor parallelism, we adapted this feature from [Megatron-LM](https://github.com/NVIDIA/Megatron-LM). 
zbian's avatar
zbian committed
13

14
15
To use mixed precision training, you can easily specify the `fp16` field in the config file to be True. Currently, PyTorch and 
Apex amp cannot be guaranteed to work with tensor and pipeline parallelism, thus, only the last one is recommended if you 
zbian's avatar
zbian committed
16
17
are using hybrid parallelism.

18
## PyTorch AMP
zbian's avatar
zbian committed
19

20
21
PyTorch provides mixed precision training in version 1.6 and above. It provides an easy way to cast data to `fp16` format 
while keeping some operations such as reductions in `fp32`. You can configure the gradient scaler in the config file.
zbian's avatar
zbian committed
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38

```python
from colossalai.engine import AMP_TYPE

fp16=dict(
    mode=AMP_TYPE.TORCH,
    # below are default values for grad scaler
    init_scale=2.**16,
    growth_factor=2.0,
    backoff_factor=0.5,
    growth_interval=2000,
    enabled=True
)
```

## Apex AMP

39
40
41
42
43
For this mode, we rely on the [Apex](https://nvidia.github.io/apex/) implementation for mixed precision training. We support 
this plugin because it allows for finer control on the granularity of mixed precision. For example, `O2` level (optimization level 2) 
will keep batch normalization in `fp32`.

The following code block shows a config file for Apex AMP.
zbian's avatar
zbian committed
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67

```python
from colossalai.engine import AMP_TYPE

fp16 = dict(
    mode=AMP_TYPE.APEX,
    # below are the default values
    enabled=True, 
    opt_level='O1', 
    cast_model_type=None, 
    patch_torch_functions=None, 
    keep_batchnorm_fp32=None, 
    master_weights=None, 
    loss_scale=None, 
    cast_model_outputs=None,
    num_losses=1, 
    verbosity=1, 
    min_loss_scale=None, 
    max_loss_scale=16777216.0
)
```

## Tensor Parallel AMP

68
69
70
71
We leveraged the Megatron-LM implementation to achieve mixed precision training while maintaining compatibility with complex tensor 
and pipeline parallelism.

The following conde block show a config file for this mode.
zbian's avatar
zbian committed
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88

```python
from colossalai.engine import AMP_TYPE

fp16 = dict(
    mode=AMP_TYPE.PARALLEL,
    # below are the default values
    clip_grad=0,
    log_num_zeros_in_grad=False,
    initial_scale=2 ** 32,
    min_scale=1,
    growth_factor=2,
    backoff_factor=0.5,
    growth_interval=1000,
    hysteresis=2
)
```