Unverified Commit 6ee3b296 authored by Olatunji Ruwase's avatar Olatunji Ruwase Committed by GitHub
Browse files

Clickable screenshots (#746)

* Fix docstring

* Make screenshots clickable for easier viewing
parent 6beca3ce
......@@ -31,7 +31,10 @@ We demonstrate the benefits of ZeRO stage 1 by showing that it enables data para
```
Training this model without ZeRO fails with an out-of-memory (OOM) error as shown below:
![OOM_DP8_1.5B_model](/assets/images/oom_dp8_1.5B_log.png)
<a href="/assets/images/oom_dp8_1.5B_log.png">
<img src="/assets/images/oom_dp8_1.5B_log.png">
</a>
A key reason why this model does not fit in GPU memory is that the Adam optimizer states for the model consume 18GB; a significant portion of the 32GB RAM. By using ZeRO stage 1 to partition the optimizer state among eight data parallel ranks, the per-device memory consumption can be reduced to 2.25GB, thus making the model trainable. To enable ZeRO stage 1, we simply update the DeepSpeed json config file as below:
......@@ -45,9 +48,15 @@ A key reason why this model does not fit in GPU memory is that the Adam optimize
```
As seen above, we set two fields in the **zero_optimization** key. Specifically we set the _stage_ field to 1, and the optional _reduce_bucket_size_ for gradient reduction to 500M. With ZeRO stage 1 enabled, the model can now train smoothly on 8 GPUs without running out of memory. Below we provide some screenshots of the model training:
![ZERO1_DP8_1.5B_LOG](/assets/images/zero1_dp8_1.5B_log.png)
![ZERO1_DP8_1.5B_SMI](/assets/images/zero1_dp8_1.5B_smi.png)
<a href="/assets/images/zero1_dp8_1.5B_log.png">
<img src="/assets/images/zero1_dp8_1.5B_log.png">
</a>
<a href="/assets/images/zero1_dp8_1.5B_smi.png">
<img src="/assets/images/zero1_dp8_1.5B_smi.png">
</a>
From the nvidia-smi screenshot above we can see that only GPUs 6-7 are being used for training the model. With ZeRO stage 1 we can further reduce the per-device memory consumption by increasing the data parallelism degree. These memory savings can be leveraged to either increase model size and/or batch size. In contrast, such benefits are not possible with data parallelism alone.
......@@ -85,10 +94,14 @@ In the above changes, we have set the _stage_ field to 2, and configured other o
Here is a screenshot of the training log:
![ZERO2_DP32_10B_LOG](/assets/images/zero2_dp32_10B_log.png)
<a href="/assets/images/zero2_dp32_10B_log.png">
<img src="/assets/images/zero2_dp32_10B_log.png">
</a>
Here is a screenshot of nvidia-smi showing GPU activity during training:
![ZERO2_DP32_10B_SMI](/assets/images/zero2_dp32_10B_smi.png)
<a href="/assets/images/zero2_dp32_10B_smi.png">
<img src="/assets/images/zero2_dp32_10B_smi.png">
</a>
Congratulations! You have completed the ZeRO tutorial.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment