Unverified Commit 9ee66ada authored by Stas Bekman's avatar Stas Bekman Committed by GitHub
Browse files

fix anchor (#12620)

parent 0dcc3c86
...@@ -43,14 +43,11 @@ The following is the brief description of the main concepts that will be describ ...@@ -43,14 +43,11 @@ The following is the brief description of the main concepts that will be describ
Most users with just 2 GPUs already enjoy the increased training speed up thanks to DataParallel (DP) and DistributedDataParallel (DDP) that are almost trivial to use. This is a built-in feature of Pytorch. Most users with just 2 GPUs already enjoy the increased training speed up thanks to DataParallel (DP) and DistributedDataParallel (DDP) that are almost trivial to use. This is a built-in feature of Pytorch.
.. parallelism-zero-dp
## ZeRO Data Parallel ## ZeRO Data Parallel
ZeRO-powered data parallelism (ZeRO-DP) is described on the following diagram from this [blog post](https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/) ZeRO-powered data parallelism (ZeRO-DP) is described on the following diagram from this [blog post](https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/)
![DeepSpeed-Image-1](imgs/parallelism-zero.png) ![DeepSpeed-Image-1](imgs/parallelism-zero.png)
It can be difficult to wrap one's head around it, but in reality the concept is quite simple. This is just the usual DataParallel (DP), except, instead of replicating the full model params, gradients and optimizer states, each GPU stores only a slice of it. And then at run-time when the full layer params are needed just for the given layer, all GPUs synchronize to give each other parts that they miss - this is it. It can be difficult to wrap one's head around it, but in reality the concept is quite simple. This is just the usual DataParallel (DP), except, instead of replicating the full model params, gradients and optimizer states, each GPU stores only a slice of it. And then at run-time when the full layer params are needed just for the given layer, all GPUs synchronize to give each other parts that they miss - this is it.
Consider this simple model with 3 layers, where each layer has 3 params: Consider this simple model with 3 layers, where each layer has 3 params:
...@@ -266,7 +263,7 @@ Implementations: ...@@ -266,7 +263,7 @@ Implementations:
## DP+PP+TP+ZeRO ## DP+PP+TP+ZeRO
One of the main features of DeepSpeed is ZeRO, which is a super-scalable extension of DP. It has already been discussed in :ref:`parallelism-zero-dp`. Normally it's a standalone feature that doesn't require PP or TP. But it can be combined with PP and TP. One of the main features of DeepSpeed is ZeRO, which is a super-scalable extension of DP. It has already been discussed in [ZeRO Data Parallel](#zero-data-parallel). Normally it's a standalone feature that doesn't require PP or TP. But it can be combined with PP and TP.
When ZeRO-DP is combined with PP (and optinally TP) it typically enables only ZeRO stage 1 (optimizer sharding). When ZeRO-DP is combined with PP (and optinally TP) it typically enables only ZeRO stage 1 (optimizer sharding).
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment