zero.md 3.41 KB
Newer Older
1
# Zero Redundancy optimizer and zero offload
zbian's avatar
zbian committed
2

3
The Zero Redundancy Optimizer (ZeRO) removes the memory redundancies across data-parallel processes by partitioning three 
4
model states (optimizer states, gradients, and parameters) instead of replicating them. 
5
6
By doing so, memory efficiency is boosted drastically compared to classic data parallelism while the computational granularity 
and communication efficiency are retained.
zbian's avatar
zbian committed
7

8
9
10
11
12
13
1. **ZeRO Level 1**: The optimizer states (e.g., for [Adam optimizer](https://arxiv.org/abs/1412.6980), 32-bit weights, and the 
first and second momentum estimates) are partitioned across the processes, so that each process updates only its partition.
2. **ZeRO Level 2**: The reduced 32-bit gradients for updating the model weights are also partitioned such that each process 
only stores the gradients corresponding to its partition of the optimizer states.
3. **ZeRO Level 3**: The 16-bit model parameters are partitioned across the processes. ZeRO-3 will automatically collect and 
partition them during the forward and backward passes.
zbian's avatar
zbian committed
14

15
## Getting Started with ZeRO
zbian's avatar
zbian committed
16

17
If you are training models with Colossal-AI, enabling ZeRO DP and Offloading is easy by addding several lines in your configuration file. We support configration for level 2 and 3. You have use [PyTorch native implementation](https://pytorch.org/tutorials/recipes/zero_redundancy_optimizer.html) for level 1 optimizer.
18
Below are a few examples of ZeRO-3 configurations.
zbian's avatar
zbian committed
19

20
### Example of ZeRO-3 Configurations
zbian's avatar
zbian committed
21

22
Here we use `Adam` as the initial optimizer.
zbian's avatar
zbian committed
23

24
1. Use ZeRO to partition the optimizer states, gradients (level 2), and parameters (level 3).
zbian's avatar
zbian committed
25
26
    ```python
    zero = dict(
27
        level=3,
zbian's avatar
zbian committed
28
29
30
31
        dynamic_loss_scale=True,
        clip_grad=1.0
    )
    ```
32

zbian's avatar
zbian committed
33
34
35
2. Additionally offload the optimizer states and computations to the CPU.
    ```python
    zero = dict(
36
        level=3,
zbian's avatar
zbian committed
37
38
39
40
41
42
43
44
45
46
47
        offload_optimizer_config=dict(
            device='cpu',
            pin_memory=True,
            fast_init=True
        ),
        ...
    )
    ```
3. Save even more memory by offloading parameters to the CPU memory.
    ```python
    zero = dict(
48
        level=3,
zbian's avatar
zbian committed
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
        offload_optimizer_config=dict(
            device='cpu',
            pin_memory=True,
            fast_init=True
        ),
        offload_param_config=dict(
            device='cpu',
            pin_memory=True,
            fast_init=OFFLOAD_PARAM_MAX_IN_CPU
        ),
        ...
    )
    ```
4. Save even MORE memory by offloading to NVMe (if available on your system):
    ```python
    zero = dict(
65
        level=3,
zbian's avatar
zbian committed
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
        offload_optimizer_config=dict(
            device='nvme',
            pin_memory=True,
            fast_init=True,
            nvme_path='/nvme_data'
        ),
        offload_param_config=dict(
            device='nvme',
            pin_memory=True,
            max_in_cpu=OFFLOAD_PARAM_MAX_IN_CPU,
            nvme_path='/nvme_data'
        ),
        ...
    )
    ```

82
Note that `fp16` is automatically enabled when using ZeRO. This relies on `AMP_TYPE.NAIVE` in Colossal-AI AMP module.
zbian's avatar
zbian committed
83
84
85

### Training

86
87
88
89
90
91
92
93
94
Note that if your model is too large to fit within the memory when using ZeRO-3, you should use `colossalai.zero.zero3_model_context` to construct your model:

```python
from colossalai.zero import zero3_model_context

with zero3_model_context():
    model = Model()
```

95
Once you have completed your configuration, just use `colossalai.initialize()` to initialize your training.