fix anchor (#12620)

9ee66ada · Stas Bekman · GitHub · 0dcc3c86 · 9ee66ada
Unverified Commit 9ee66ada authored Jul 09, 2021 by Stas Bekman Committed by GitHub Jul 09, 2021
Show whitespace changes
Inline Side-by-side

Showing with 1 addition and 4 deletions

docs/source/parallelism.md docs/source/parallelism.md +1 -4

No files found.
--- a/docs/source/parallelism.md
+++ b/docs/source/parallelism.md
@@ -43,14 +43,11 @@ The following is the brief description of the main concepts that will be describ
 Most users with just 2 GPUs already enjoy the increased training speed up thanks to DataParallel (DP) and DistributedDataParallel (DDP) that are almost trivial to use. This is a built-in feature of Pytorch.
-.. parallelism-zero-dp
 ## ZeRO Data Parallel
 ZeRO-powered data parallelism (ZeRO-DP) is described on the following diagram from this [blog post](https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/)
 ![DeepSpeed-Image-1](imgs/parallelism-zero.png)
 It can be difficult to wrap one's head around it, but in reality the concept is quite simple. This is just the usual DataParallel (DP), except, instead of replicating the full model params, gradients and optimizer states, each GPU stores only a slice of it.  And then at run-time when the full layer params are needed just for the given layer, all GPUs synchronize to give each other parts that they miss - this is it.
 Consider this simple model with 3 layers, where each layer has 3 params:
@@ -266,7 +263,7 @@ Implementations:
 ## DP+PP+TP+ZeRO
-One of the main features of DeepSpeed is ZeRO, which is a super-scalable extension of DP. It has already been discussed in :ref:`parallelism-zero-dp`. Normally it's a standalone feature that doesn't require PP or TP. But it can be combined with PP and TP.
+One of the main features of DeepSpeed is ZeRO, which is a super-scalable extension of DP. It has already been discussed in [ZeRO Data Parallel](#zero-data-parallel). Normally it's a standalone feature that doesn't require PP or TP. But it can be combined with PP and TP.
 When ZeRO-DP is combined with PP (and optinally TP) it typically enables only ZeRO stage 1 (optimizer sharding).