reoder parallelization methods in parallelization documentation

50982c0b · ver217 · 3c7604ba · 50982c0b
Commit 50982c0b authored Nov 01, 2021 by ver217
Show whitespace changes
Inline Side-by-side

Showing with 49 additions and 49 deletions

docs/parallelization.md docs/parallelization.md +49 -49

No files found.
--- a/docs/parallelization.md
+++ b/docs/parallelization.md
@@ -29,6 +29,54 @@ not have to explicitly set them in your configurations. When data parallel size
 adds the distributed data sampler to the dataloader to shard the dataset.
+## 1D, 2D, 2.5D and 3D Parallel
+To enable hybrid parallelism, we provide an array of tensor parallelism. We provide the list of papers which match each 
+tensor parallel method. These parallel modes need to work with the distributed layers provided by Colossal-AI.
+- 1D: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
+- 2D: [An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/abs/2104.05343)  
+2D parallel relies on the SUMMA matrix multiplication algorithm and splits the input data, 
+model weights and layer outputs along two different dimensions. The tensor chunks are distributed over a 2D mesh of $P = N^2$ 
+devices where N is the number of tensor chunks in a single dimension.
+- 2.5D: [2.5-dimensional distributed model training](https://arxiv.org/abs/2105.14500)  
+Inspired by the 2.5D matrix multi-plication algorithm, 2.5D parallel introduces a novel tensor parallelism which further 
+parallelizes 2D tensor parallelism. An amount of $P = N^2 ∗ d$ processors are arranged into d layers, 
+where each layer performs matrix multiplication operations independently with a dimension N.
+- 3D: [Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/abs/2105.14450)  
+We also introduce a 3D tensor parallelism that parallelizes neural networks on a 3D processor cube. This method achieves 
+the optimal, $O(P^{1/3})$ communication overhead on P processors, while both computation and memory usage are evenly distributed 
+through optimized load balancing of parameters as well as activations.
+```python
+# 1D parallel
+parallel = dict(
+    pipeline=dict(size=1), # number of pipeline stages
+    tensor=dict(size=4, mode='1d')
+)
+# 2D parallel
+parallel = dict(
+    pipeline=dict(size=1), # number of pipeline stages
+    tensor=dict(size=4, mode='2d')
+)
+# 2.5D parallel
+parallel = dict(
+    pipeline=dict(size=1), # number of pipeline stages
+    tensor=dict(size=8, mode='2.5d', depth=2)
+)
+# 3D parallel
+parallel = dict(
+    pipeline=dict(size=1), # number of pipeline stages
+    tensor=dict(size=8, mode='3d')
+)
+```
 ## Pipeline Parallel (experimental)
 Pipeline parallelism is to split the model into several partitions by layer. For example, let's assume we have a simple 
@@ -160,54 +208,6 @@ schedule = dict(
 )
 ```
-## 1D, 2D, 2.5D and 3D Parallel
-To enable hybrid parallelism, we provide an array of tensor parallelism. We provide the list of papers which match each 
-tensor parallel method. These parallel modes need to work with the distributed layers provided by Colossal-AI.
- 1D: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
- 2D: [An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/abs/2104.05343)  
-2D parallel relies on the SUMMA matrix multiplication algorithm and splits the input data, 
-model weights and layer outputs along two different dimensions. The tensor chunks are distributed over a 2D mesh of $P = N^2$ 
-devices where N is the number of tensor chunks in a single dimension.
- 2.5D: [2.5-dimensional distributed model training](https://arxiv.org/abs/2105.14500)  
-Inspired by the 2.5D matrix multi-plication algorithm, 2.5D parallel introduces a novel tensor parallelism which further 
-parallelizes 2D tensor parallelism. An amount of $P = N^2 ∗ d$ processors are arranged into d layers, 
-where each layer performs matrix multiplication operations independently with a dimension N.
- 3D: [Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/abs/2105.14450)  
-We also introduce a 3D tensor parallelism that parallelizes neural networks on a 3D processor cube. This method achieves 
-the optimal, $O(P^{1/3})$ communication overhead on P processors, while both computation and memory usage are evenly distributed 
-through optimized load balancing of parameters as well as activations.
-```python
-# 1D parallel
-parallel = dict(
-    pipeline=dict(size=1), # number of pipeline stages
-    tensor=dict(size=4, mode='1d')
-)
-# 2D parallel
-parallel = dict(
-    pipeline=dict(size=1), # number of pipeline stages
-    tensor=dict(size=4, mode='2d')
-)
-# 2.5D parallel
-parallel = dict(
-    pipeline=dict(size=1), # number of pipeline stages
-    tensor=dict(size=8, mode='2.5d', depth=2)
-)
-# 3D parallel
-parallel = dict(
-    pipeline=dict(size=1), # number of pipeline stages
-    tensor=dict(size=8, mode='3d')
-)
-```
 ## Sequence Parallel (experimental)