# as you need to specify the mode for tensor parallelism
parallel=dict(
pipeline=2,
tensor=4
)
# this is ok as well as tensor will be default to size 1
# and mode None
parallel=dict(
pipeline=2
)
# this is ok as well as pipeline will default to size 1
parallel=dict(
tensor=dict(size=4,mode='2d')
)
```
```
The name of the dictionary variable should be **parallel**. All the arguments even **parallel** itself are optional and
The name of the dictionary variable should be **parallel**. All the arguments even **parallel** itself are optional and
data, pipeline, tensor parallel size will be set to defaulted value 1. The value of data, pipeline and tensor can be a
data, pipeline, tensor parallel size will be set to defaulted value 1. The value of data, pipeline and tensor can be a
int representing the size of specific parallel dimension or a dictionary with a key called "size". The key "mode"
int representing the size of specific parallel dimension or a dictionary with a key called "size". The key "mode"
represents the way of tensor parallelism.
represents the way of tensor parallelism.
**You can choose to not have 'parallel' in your configuration and both pipelineand tensor will default to size 1.**
## Data Parallel
## Data Parallel
Data parallel is the most common way to distribute your training task by splitting data into several shards and train on
Data parallel is the most common way to distribute your training task by splitting data into several shards and train on
a single shard on each device. The configuration for data parallel is detected automatically and set for you. You do not
a single shard on each device. The configuration for data parallel is detected automatically and set for you. You do not
have to explicitly set them in your configurations. When data parallel size is larger than 1, Colossal-AI automatically
have to explicitly set them in your configurations. There are two ways to handle the all-reduce in data parallel in Colossal-AI.
adds the distributed data sampler to the dataloader to shard the dataset.
1. If you specify gradient handlers, gradients will be all-reduced according to the gradient handlers
2. Otherwise, PyTorch DistributedDataParallel will be used
In most cases, you will be using the second mode unless you have complex handling of the gradients.
## 1D, 2D, 2.5D and 3D Parallel
## 1D, 2D, 2.5D and 3D Parallel
To enable hybrid parallelism, we provide an array of tensor parallelism. We provide the list of papers which match each
To enable hybrid parallelism, we provide an array of tensor parallelism. We provide the list of papers which match each
tensor parallel method. These parallel modes need to work with the distributed layers provided by Colossal-AI.
tensor parallel method. These parallel modes need to work with the distributed layers provided by Colossal-AI.
-
- 1D: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
1D: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
- 2D: [An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/abs/2104.05343)
- 2D: [An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/abs/2104.05343)
2D parallel relies on the SUMMA matrix multiplication algorithm and splits the input data, model weights and layer
2D parallel relies on the SUMMA matrix multiplication algorithm and splits the input data, model weights and layer
...
@@ -55,158 +95,134 @@ tensor parallel method. These parallel modes need to work with the distributed l
...
@@ -55,158 +95,134 @@ tensor parallel method. These parallel modes need to work with the distributed l
```python
```python
# 1D parallel
# 1D parallel
parallel=dict(
parallel=dict(
pipeline=dict(size=1),# number of pipeline stages
tensor=dict(size=4,mode='1d')
tensor=dict(size=4,mode='1d')
)
)
# 2D parallel
# 2D parallel
parallel=dict(
parallel=dict(
pipeline=dict(size=1),# number of pipeline stages
tensor=dict(size=4,mode='2d')
tensor=dict(size=4,mode='2d')
)
)
# 2.5D parallel
# 2.5D parallel
parallel=dict(
parallel=dict(
pipeline=dict(size=1),# number of pipeline stages
tensor=dict(size=8,mode='2.5d',depth=2)
tensor=dict(size=8,mode='2.5d',depth=2)
)
)
# 3D parallel
# 3D parallel
parallel=dict(
parallel=dict(
pipeline=dict(size=1),# number of pipeline stages
tensor=dict(size=8,mode='3d')
tensor=dict(size=8,mode='3d')
)
)
```
```
Once you specify the tensor parallel mode in your configuration, you can proceed to use its corresponding distributed
operator. For example, if you mode is '2d', you can use `colossalai.nn.Linear2D` in you model construction.
## Pipeline Parallel (experimental)
## Pipeline Parallel (experimental)
Pipeline parallelism is to split the model into several partitions by layer. For example, let's assume we have a simple
Pipeline parallelism is to split the model into several partitions by layer. For example, let's assume we have a simple
model which consists of two linear layer. We have two GPUs, and we can allocate the first linear layer to the first GPU
model which consists of two linear layer. We have two GPUs, and we can allocate the first linear layer to the first GPU
and the second layer to the second GPU. This example of course wastes the computing resources and is only to demonstrate
and the second layer to the second GPU.
the idea of pipeline parallelism.
As PyTorch is based on dynamic computation graph, the computation flow is not known until execution. To support pipeline
You can set the number of pipeline stages in your configuration file. When pipeline size is larger than 1, Colossal-AI
parallelism in PyTorch, you may need to add one more attribute, `layers_cfg` in your model class which tells Colossal-AI
will automatically creates the pipeline schedule which defines the forward and backward step.
the sequence of execution. One example you can refer is `colossalai.nn.model.VanillaResNet`.
```python
```python
fromcolossalai.nnimportBaseModel
parallel=dict(
importtorch
pipeline=dict(size=4),# number of pipeline stages
)
classVanillaResNet(BaseModel):
```
def__init__(
As PyTorch is based on dynamic computation graph, the computation flow is not known until execution. To support pipeline parallelism, you have the following two ways to split your model,
self,
num_cls:int,
1. Split your model directly. Below is an exmaple of resnet split into two pipeline stages.
In `ignite.engine` or `keras.engine`, the process function is always provided by users. However, it is tricky for users
The engine class is a high-level wrapper of these frequently-used functions while preserving the PyTorch-like function signature and integrating with our features.
to write their own process functions for pipeline parallelism. Aiming at offering accessible hybrid parallelism for
users, we provide the powerful `Engine` class. This class enables pipeline parallelism and offers
one-forward-one-backward non-interleaving strategy. Also, you can use pre-defined learning rate scheduler in
the `Engine` class to adjust learning rate during training.
In order to build your engine, just set variables `model`, `criterion`, `optimizer`, `lr_scheduler` and `schedule`. The
following code block provides an example. **The engine is automatically created from the config file for you if you
The Zero Redundancy Optimizer (ZeRO) removes the memory redundancies across data-parallel processes by partitioning three
The Zero Redundancy Optimizer (ZeRO) removes the memory redundancies across data-parallel processes by partitioning three
model states (optimizer states, gradients, and parameters) across data-parallel processes instead of replicating them.
model states (optimizer states, gradients, and parameters) instead of replicating them.
By doing so, memory efficiency is boosted drastically compared to classic data parallelism while the computational granularity
By doing so, memory efficiency is boosted drastically compared to classic data parallelism while the computational granularity
and communication efficiency are retained.
and communication efficiency are retained.
...
@@ -14,30 +14,26 @@ partition them during the forward and backward passes.
...
@@ -14,30 +14,26 @@ partition them during the forward and backward passes.
## Getting Started with ZeRO
## Getting Started with ZeRO
If you are training models with Colossal-AI, enabling ZeRO-3 offload is as simple as enabling it in your Colossal-AI configuration!
If you are training models with Colossal-AI, enabling ZeRO DP and Offloading is easy by addding several lines in your configuration file. We support configration for level 2 and 3. You have use [PyTorch native implementation](https://pytorch.org/tutorials/recipes/zero_redundancy_optimizer.html) for level 1 optimizer.
Below are a few examples of ZeRO-3 configurations.
Below are a few examples of ZeRO-3 configurations.
### Example of ZeRO-3 Configurations
### Example of ZeRO-3 Configurations
Here we use `Adam` as the initial optimizer.
Here we use `Adam` as the initial optimizer.
1. Use ZeRO to partition the optimizer states (level 1), gradients (level 2), and parameters (level 3).
1. Use ZeRO to partition the optimizer states, gradients (level 2), and parameters (level 3).
```python
```python
optimizer=dict(
type='Adam',
lr=0.001,
weight_decay=0
)
zero=dict(
zero=dict(
type='ZeroRedundancyOptimizer_Level_3',
level=3,
dynamic_loss_scale=True,
dynamic_loss_scale=True,
clip_grad=1.0
clip_grad=1.0
)
)
```
```
2. Additionally offload the optimizer states and computations to the CPU.
2. Additionally offload the optimizer states and computations to the CPU.
```python
```python
zero=dict(
zero=dict(
level=3,
offload_optimizer_config=dict(
offload_optimizer_config=dict(
device='cpu',
device='cpu',
pin_memory=True,
pin_memory=True,
...
@@ -49,6 +45,7 @@ Here we use `Adam` as the initial optimizer.
...
@@ -49,6 +45,7 @@ Here we use `Adam` as the initial optimizer.
3. Save even more memory by offloading parameters to the CPU memory.
3. Save even more memory by offloading parameters to the CPU memory.
```python
```python
zero=dict(
zero=dict(
level=3,
offload_optimizer_config=dict(
offload_optimizer_config=dict(
device='cpu',
device='cpu',
pin_memory=True,
pin_memory=True,
...
@@ -65,6 +62,7 @@ Here we use `Adam` as the initial optimizer.
...
@@ -65,6 +62,7 @@ Here we use `Adam` as the initial optimizer.
4. Save even MORE memory by offloading to NVMe (if available on your system):
4. Save even MORE memory by offloading to NVMe (if available on your system):
```python
```python
zero=dict(
zero=dict(
level=3,
offload_optimizer_config=dict(
offload_optimizer_config=dict(
device='nvme',
device='nvme',
pin_memory=True,
pin_memory=True,
...
@@ -81,7 +79,7 @@ Here we use `Adam` as the initial optimizer.
...
@@ -81,7 +79,7 @@ Here we use `Adam` as the initial optimizer.
)
)
```
```
Note that `fp16` is automatically enabled when using ZeRO.
Note that `fp16` is automatically enabled when using ZeRO. This relies on `AMP_TYPE.NAIVE` in Colossal-AI AMP module.