Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
OpenDAS
ColossalAI
Commits
50982c0b
Commit
50982c0b
authored
Nov 01, 2021
by
ver217
Browse files
reoder parallelization methods in parallelization documentation
parent
3c7604ba
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
49 additions
and
49 deletions
+49
-49
docs/parallelization.md
docs/parallelization.md
+49
-49
No files found.
docs/parallelization.md
View file @
50982c0b
...
@@ -29,6 +29,54 @@ not have to explicitly set them in your configurations. When data parallel size
...
@@ -29,6 +29,54 @@ not have to explicitly set them in your configurations. When data parallel size
adds the distributed data sampler to the dataloader to shard the dataset.
adds the distributed data sampler to the dataloader to shard the dataset.
## 1D, 2D, 2.5D and 3D Parallel
To enable hybrid parallelism, we provide an array of tensor parallelism. We provide the list of papers which match each
tensor parallel method. These parallel modes need to work with the distributed layers provided by Colossal-AI.
-
1D:
[
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
](
https://arxiv.org/abs/1909.08053
)
-
2D:
[
An Efficient 2D Method for Training Super-Large Deep Learning Models
](
https://arxiv.org/abs/2104.05343
)
2D parallel relies on the SUMMA matrix multiplication algorithm and splits the input data,
model weights and layer outputs along two different dimensions. The tensor chunks are distributed over a 2D mesh of $P = N^2$
devices where N is the number of tensor chunks in a single dimension.
-
2.5D:
[
2.5-dimensional distributed model training
](
https://arxiv.org/abs/2105.14500
)
Inspired by the 2.5D matrix multi-plication algorithm, 2.5D parallel introduces a novel tensor parallelism which further
parallelizes 2D tensor parallelism. An amount of $P = N^2 ∗ d$ processors are arranged into d layers,
where each layer performs matrix multiplication operations independently with a dimension N.
-
3D:
[
Maximizing Parallelism in Distributed Training for Huge Neural Networks
](
https://arxiv.org/abs/2105.14450
)
We also introduce a 3D tensor parallelism that parallelizes neural networks on a 3D processor cube. This method achieves
the optimal, $O(P^{1/3})$ communication overhead on P processors, while both computation and memory usage are evenly distributed
through optimized load balancing of parameters as well as activations.
```
python
# 1D parallel
parallel
=
dict
(
pipeline
=
dict
(
size
=
1
),
# number of pipeline stages
tensor
=
dict
(
size
=
4
,
mode
=
'1d'
)
)
# 2D parallel
parallel
=
dict
(
pipeline
=
dict
(
size
=
1
),
# number of pipeline stages
tensor
=
dict
(
size
=
4
,
mode
=
'2d'
)
)
# 2.5D parallel
parallel
=
dict
(
pipeline
=
dict
(
size
=
1
),
# number of pipeline stages
tensor
=
dict
(
size
=
8
,
mode
=
'2.5d'
,
depth
=
2
)
)
# 3D parallel
parallel
=
dict
(
pipeline
=
dict
(
size
=
1
),
# number of pipeline stages
tensor
=
dict
(
size
=
8
,
mode
=
'3d'
)
)
```
## Pipeline Parallel (experimental)
## Pipeline Parallel (experimental)
Pipeline parallelism is to split the model into several partitions by layer. For example, let's assume we have a simple
Pipeline parallelism is to split the model into several partitions by layer. For example, let's assume we have a simple
...
@@ -160,54 +208,6 @@ schedule = dict(
...
@@ -160,54 +208,6 @@ schedule = dict(
)
)
```
```
## 1D, 2D, 2.5D and 3D Parallel
To enable hybrid parallelism, we provide an array of tensor parallelism. We provide the list of papers which match each
tensor parallel method. These parallel modes need to work with the distributed layers provided by Colossal-AI.
-
1D:
[
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
](
https://arxiv.org/abs/1909.08053
)
-
2D:
[
An Efficient 2D Method for Training Super-Large Deep Learning Models
](
https://arxiv.org/abs/2104.05343
)
2D parallel relies on the SUMMA matrix multiplication algorithm and splits the input data,
model weights and layer outputs along two different dimensions. The tensor chunks are distributed over a 2D mesh of $P = N^2$
devices where N is the number of tensor chunks in a single dimension.
-
2.5D:
[
2.5-dimensional distributed model training
](
https://arxiv.org/abs/2105.14500
)
Inspired by the 2.5D matrix multi-plication algorithm, 2.5D parallel introduces a novel tensor parallelism which further
parallelizes 2D tensor parallelism. An amount of $P = N^2 ∗ d$ processors are arranged into d layers,
where each layer performs matrix multiplication operations independently with a dimension N.
-
3D:
[
Maximizing Parallelism in Distributed Training for Huge Neural Networks
](
https://arxiv.org/abs/2105.14450
)
We also introduce a 3D tensor parallelism that parallelizes neural networks on a 3D processor cube. This method achieves
the optimal, $O(P^{1/3})$ communication overhead on P processors, while both computation and memory usage are evenly distributed
through optimized load balancing of parameters as well as activations.
```
python
# 1D parallel
parallel
=
dict
(
pipeline
=
dict
(
size
=
1
),
# number of pipeline stages
tensor
=
dict
(
size
=
4
,
mode
=
'1d'
)
)
# 2D parallel
parallel
=
dict
(
pipeline
=
dict
(
size
=
1
),
# number of pipeline stages
tensor
=
dict
(
size
=
4
,
mode
=
'2d'
)
)
# 2.5D parallel
parallel
=
dict
(
pipeline
=
dict
(
size
=
1
),
# number of pipeline stages
tensor
=
dict
(
size
=
8
,
mode
=
'2.5d'
,
depth
=
2
)
)
# 3D parallel
parallel
=
dict
(
pipeline
=
dict
(
size
=
1
),
# number of pipeline stages
tensor
=
dict
(
size
=
8
,
mode
=
'3d'
)
)
```
## Sequence Parallel (experimental)
## Sequence Parallel (experimental)
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment