"vscode:/vscode.git/clone" did not exist on "a32cdad62208c2fb44f7572006461c9a27d30984"
zero.md 3.02 KB
Newer Older
1
# Zero Redundancy optimizer and zero offload
zbian's avatar
zbian committed
2

3
4
5
6
The Zero Redundancy Optimizer (ZeRO) removes the memory redundancies across data-parallel processes by partitioning three 
model states (optimizer states, gradients, and parameters) across data-parallel processes instead of replicating them. 
By doing so, memory efficiency is boosted drastically compared to classic data parallelism while the computational granularity 
and communication efficiency are retained.
zbian's avatar
zbian committed
7

8
9
10
11
12
13
1. **ZeRO Level 1**: The optimizer states (e.g., for [Adam optimizer](https://arxiv.org/abs/1412.6980), 32-bit weights, and the 
first and second momentum estimates) are partitioned across the processes, so that each process updates only its partition.
2. **ZeRO Level 2**: The reduced 32-bit gradients for updating the model weights are also partitioned such that each process 
only stores the gradients corresponding to its partition of the optimizer states.
3. **ZeRO Level 3**: The 16-bit model parameters are partitioned across the processes. ZeRO-3 will automatically collect and 
partition them during the forward and backward passes.
zbian's avatar
zbian committed
14

15
## Getting Started with ZeRO
zbian's avatar
zbian committed
16

17
If you are training models with Colossal-AI, enabling ZeRO-3 offload is as simple as enabling it in your Colossal-AI configuration! 
18
Below are a few examples of ZeRO-3 configurations.
zbian's avatar
zbian committed
19

20
### Example of ZeRO-3 Configurations
zbian's avatar
zbian committed
21

22
Here we use `Adam` as the initial optimizer.
zbian's avatar
zbian committed
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83

1. Use ZeRO to partition the optimizer states (level 1), gradients (level 2), and parameters (level 3).
    ```python
    optimizer = dict(
        type='Adam',
        lr=0.001,
        weight_decay=0
    )

    zero = dict(
        type='ZeroRedundancyOptimizer_Level_3',
        dynamic_loss_scale=True,
        clip_grad=1.0
    )
    ```
2. Additionally offload the optimizer states and computations to the CPU.
    ```python
    zero = dict(
        offload_optimizer_config=dict(
            device='cpu',
            pin_memory=True,
            fast_init=True
        ),
        ...
    )
    ```
3. Save even more memory by offloading parameters to the CPU memory.
    ```python
    zero = dict(
        offload_optimizer_config=dict(
            device='cpu',
            pin_memory=True,
            fast_init=True
        ),
        offload_param_config=dict(
            device='cpu',
            pin_memory=True,
            fast_init=OFFLOAD_PARAM_MAX_IN_CPU
        ),
        ...
    )
    ```
4. Save even MORE memory by offloading to NVMe (if available on your system):
    ```python
    zero = dict(
        offload_optimizer_config=dict(
            device='nvme',
            pin_memory=True,
            fast_init=True,
            nvme_path='/nvme_data'
        ),
        offload_param_config=dict(
            device='nvme',
            pin_memory=True,
            max_in_cpu=OFFLOAD_PARAM_MAX_IN_CPU,
            nvme_path='/nvme_data'
        ),
        ...
    )
    ```

84
Note that `fp16` is automatically enabled when using ZeRO. 
zbian's avatar
zbian committed
85
86
87

### Training

88
Once you have completed your configuration, just use `colossalai.initialize()` to initialize your training.