Trainer is a more high-level wrapper for the user to execute training with fewer lines of code. However, in pursuit of more abstraction, it loses some flexibility compared to engine. The trainer is designed to execute a forward and backward step to perform model weight update. It is easy to create a trainer object by passing the engine object. The trainer has a default value `None` for the argument `schedule`. In most cases, we leave this value to `None` unless we want to use pipeline parallelism. If you wish to explore more about this parameter, you can go to the tutorial on pipeline parallelism.
```python
fromcolossalai.loggingimportget_dist_logger
fromcolossalai.trainerimportTrainer,hooks
# build components and initialize with colossalai.initialize
...
# create a logger so that trainer can log on the console
logger=get_dist_logger()
# create a trainer object
trainer=Trainer(
engine=engine,
logger=logger
)
```
In trainer, the user can customize some hooks and attach these hooks to the trainer object. A hook object will execute life-cycle methods periodically based on the training scheme. For example, The `LRSchedulerHook` will execute `lr_scheduler.step()` to update the learning rate of the model during either `after_train_iter` or `after_train_epoch` stages depending on whether the user wants to update the learning rate after each training iteration or only after the entire training epoch. You can store the hook objects in a list and pass it to `trainer.fit` method. `trainer.fit` method will execute training and testing based on your parameters. If `display_process` is True, a progress bar will be displayed on your console to show the training process.
If you want to customize your own hook class, you can inherit `hooks.BaseHook` and override the life-cycle methods of your interest. A dummy example to demonstrate how to create a simple log message hook is provided below for your reference.
```python
fromcolossalai.loggingimportget_dist_logger
fromcolossalai.trainerimporthooks
classLogMessageHook(hooks.BaseHook):
def__init__(self,priority=10):
self._logger=get_dist_logger()
defbefore_train(self,trainer):
self._logger.info('training starts')
defafter_train(self,trainer):
self._logger.info('training finished')
...
# then in your training script
hook_list.append(LogMessageHook())
```
In the sections below, I will guide you through the steps required to train a ResNet model with both engine and trainer.
## Explain with ResNet
### Overview
In this section we will cover:
1. Use an engine object to train a ResNet34 model on CIFAR10 dataset
2. Use a trainer object to train a ResNet34 model on CIFAR10 dataset
The project structure will be like:
```bash
-- config.py
-- run_resnet_cifar10_with_engine.py
-- run_resnet_cifar10_with_trainer.py
```
Steps 1-4 below are commonly used regardless of using engine or trainer. Thus, steps 1-4 + step 5 will be your `run_resnet_cifar10_with_engine.py` and steps 1-4 + step 6 will form `run_resnet_cifar10_with_trainer.py`.
### Hands-on Practice
#### Step 1. Create a Config File
In your project folder, create a `config.py`. This file is to specify some features you may want to use to train your model. A sample config file is as below:
```python
fromcolossalai.ampimportAMP_TYPE
BATCH_SIZE=128
NUM_EPOCHS=200
fp16=dict(
mode=AMP_TYPE.TORCH
)
```
In this config file, we specify that we want to use batch size 128 per GPU and run for 200 epochs. These two parameters are exposed by `gpc.config`. For example, you can use `gpc.config.BATCH_SIZE` to access the value you store in your config file. The `fp16` configuration tells `colossalai.initialize` to use mixed precision training provided by PyTorch to train the model with better speed and lower memory consumption.
#### Step 2. Initialize Distributed Environment
We need to initialize the distributed training environment. This has been introduced in the tutorial on how to
[launch Colossal-AI](./launch_colossalai.md). For this demostration, we use `launch_from_torch` and PyTorch launch utility.
```python
importcolossalai
# ./config.py refers to the config file we just created in step 1
Then build your components in the same way as how to normally build them in your PyTorch scripts. In the script below, we set the root path for CIFAR10 dataset as an environment variable `DATA`. You can change it to any path you like, for example, you can change `root=Path(os.environ['DATA'])` to `root='./data'` so that there is no need to set the environment variable.
Next, the essential step is to obtain the engine class by calling `colossalai.initialize`. As stated in `config.py`, we will be using mixed precision training for training ResNet34 model. `colossalai.initialize` will automatically check your config file and assign relevant features to your training components. In this way, our engine object has already been able to train with mixed precision, but you do not have to explicitly take care of it.
Lastly, we can invoke the scripts using the distributed launcher provided by PyTorch as we used `launch_from_torch` in Step 2. You need to replace `<num_gpus>` with the number of GPUs available on your machine. This number can be 1 if you only want to use 1 GPU. If you wish to use other launchers, you can refer to the tutorial on How to Launch Colossal-AI.
`backend` is optional and the default value is `nccl`.
### Native Launch
To initialize the distributed environment, we provided a general `colossalai.launch` API. The `colossalai.launch` function takes in the parameters
listed above and create a default process group in the communication network. This function is often used with the default
parser for convenience.
```python
importcolossalai
# parse arguments
args=colossalai.get_default_parser().parse_args()
# launch distributed environment
colossalai.launch(config=<CONFIG>,
rank=args.rank,
world_size=args.world_size,
host=args.host,
port=args.port,
backend=args.backend
)
```
### Launch with Colossal-AI CLI
To enable easy launching on both single or multi nodes, we have implemented a launcher for Colossal-AI. This launcher is
a wrapper of the torch distributed launch utility but enhanced with the capability of launching multi-node jobs easily.
First, we need to set the launch method in our code. As this is a wrapper of the torch distributed launch utility, we will
use `colossalai.launch_from_torch`. The arguments required for distributed environment such as rank, world size, host and port are all set by the PyTorch
launcher and can be read from the environment variable directly.
```python
importcolossalai
colossalai.launch_from_torch(
config=<CONFIG>,
)
```
Next, we can easily start multiple processes with `colossalai run` in your terminal. Below is an example to run the code
on a single node with 4 GPUs. You can change the number of GPUs by `nproc_per_node` and the default port by `master_port`.
```shell
# run on the local node with 4 GPUs (default port: 29500)
colossalai run --nproc_per_node 4 train.py
# run on the local node with 4 GPUs with a different port
colossalai run --nproc_per_node 4 --master_port 29505 test.py
```
If you are in a cluster and want to launch multi-node training, the CLI can help you start processes on different nodes
with one simple command. There are two ways you can launch multi-node jobs.
- Run with `--hosts`
This is suitable when you only have a few nodes. Let's say I have two nodes, namely `host1` and `host2`, I can start
multi-node training with the following command. Compared to single-node training, you must specify the `master_addr`
option, which is auto-set to localhost if running on a single node only.
:::caution
`master_addr` cannot be localhost when running on multiple nodes, it should be the hostname or IP address of a node.
:::
```shell
# run on these two nodes
colossalai run --nproc_per_node 4 --host host1,host2 --master_addr host1 test.py
```
- Run with `--hostfile`
This method is suitable when you have a lot of nodes. The host file is a simple text file listing the available nodes.
The list of nodes is commonly provided by cluster managers such as SLURM and PBS Pro. For example, you can get the list
of nodes allocated to you via the environment variable `SLURM_NODELIST` in SLURM and `PBS_NODEFILE` in PBS Pro.
Just do `echo $SLURM_NODELIST` or `cat $PBS_NODEFILE` to check it out. If you do not have such cluster managers, you can
manually create one for your own use.
The host file given to Colossal-AI launcher must be in the following format where each line is the host name of a node.
```text
host1
host2
```
With the host file ready, we can launch multi-node jobs with the following commands. Just like using `--host`, you also
need to specify the `master_addr` option. Some extra options are provided for `--hostfile` as listed below:
-`--include`: specify the hosts to include for multi-node jobs. For example, if your host file has 8 nodes, but you
happen to only want to run on 6 nodes instead, you can add `--include host1,host2,host3,...,host6` so that the job will only
be launcher on the 6 nodes.
-`--exclude`: specify the hosts to exclude for multi-node jobs. This is useful when some nodes are faulty. For example,
if host1 GPU has some problems and you do not wish to run on host1 but all other nodes, you can add `--exclude host1` so that
the job will only be launched on the remaining nodes.
```shell
# run with a hostfile
colossalai run --nproc_per_node 4 --hostfile ./hostfile --master_addr host1 test.py
-[ColossalAI-Examples Model Checkpoint](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/utils/checkpoint)
**This function is experiential.**
## Introduction
In this tutorial, you will learn how to save and load model checkpoints.
To leverage the power of parallel strategies in Colossal-AI, modifications to models and tensors are needed, for which you cannot directly use `torch.save` or `torch.load` to save or load model checkpoints. Therefore, we have provided you with the API to achieve the same thing.
Moreover, when loading, you are not demanded to use the same parallel strategy as saving.
## How to use
### Save
There are two ways to train a model in Colossal-AI, by engine or by trainer.
**Be aware that we only save the `state_dict`.** Therefore, when loading the checkpoints, you need to define the model first.
With the development of deep learning model size, it is important to shift to a new training paradigm. The traditional training method with no parallelism and optimization became a thing of the past and new training methods are the key to make training large-scale models efficient and cost-effective.
Colossal-AI is designed to be a unfied system to provide an integrated set of training skills and utilities to the user. You can find the common training utilities such as mixed precision training and gradient accumulation. Besides, we provide an array of parallelism including data, tensor and pipeline parallelism. We optimize tensor parallelism with different multi-dimensional distributed matrix-matrix multiplication algorithm. We also provided different pipeline parallelism methods to allow the user to scale their model across nodes efficiently. More advanced features such as offloading can be found in this tutorial documentation in detail as well.
## General Usage
We aim to make Colossal-AI easy to use and non-instrusive to user code. There is a simple general workflow if you want to use Colossal-AI.
<figcaption>Image source: <ahref="https://towardsdatascience.com/distributed-training-in-the-cloud-cloud-machine-learning-engine-9e264ddde27f">Towards Data Science</a></figcaption>
</figure>
A distributed system consists of multiple software components which run on multiple machines. For example, the traditional
database runs on a single machine. As the amount of data gets incredibly large, a single machine can no longer deliver desirable
performance to the business, especially in situations such as Black Friday where network traffic can be unexpectedly high.
To handle such pressure, modern high-performance database is designed to run on multiple machines, and they work together to provide
high throughput and low latency to the user.
One important evaluation metric for distributed system is scalability. For example, when we run an application on 4 machines,
we naturally expect that the application can run 4 times faster. However, due to communication overhead and difference in
hardware performance, it is difficult to achieve linear speedup. Thus, it is important to consider how to make the application
faster when we implement it. Algorithms of good design and system optimization can help to deliver good performance. Sometimes,
it is even possible to achieve linear and super-linear speedup.
## Why we need distributed training for machine learning?
Back in 2012, [AlexNet](https://arxiv.org/abs/1404.5997) won the champion of the ImageNet competition, and it was trained
on two GTX 580 3GB GPUs.
Today, most models that appear in the top AI conferences are trained on multiple GPUs. Distributed training is definitely
a common practice when researchers and engineers develop AI models. There are several reasons behind this trend.
1. Model size increases rapidly. [ResNet50](https://arxiv.org/abs/1512.03385) has 20 million parameters in 2015,
[BERT-Large](https://arxiv.org/abs/1810.04805) has 345 million parameters in 2018,
-[Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://deepakn94.github.io/assets/papers/megatron-sc21.pdf)
## Introduction
Tensor parallelism partitions model weights across multiple devices in order to reduce memory load.
An efficient 1D tensor parallelism implementation was introduced by [Megatron-LM](https://deepakn94.github.io/assets/papers/megatron-sc21.pdf).
Let's take a linear layer as an example, which consists of a GEMM $Y = XA$. Given 2 processors, we split the columns of $A$ into $[A_1 ~ A_2]$, and calculate $Y_i = XA_i$ on each processor, which then forms $[Y_1 ~ Y_2] = [XA_1 ~ XA_2]$. This is called a column-parallel fashion.
When a second linear layer $Z=YB$ follows the column-parallel one, we split $B$ into $\left[\begin{matrix} B_1 \\ B_2 \end{matrix} \right]$,
which is called a row-parallel fashion.
To calculate $Z = [Y_1 ~ Y_2] \left[\begin{matrix} B_1 \\ B_2 \end{matrix} \right]$, we first calculate $Y_iB_i$ on each processor, then use an all-reduce to aggregate the results as $Z=Y_1B_1+Y_2B_2$.
We also need to note that in the backward pass, the column-parallel linear layer needs to aggregate the gradients of the input tensor $X$, because on each processor $i$ we only have $\dot{X_i}=\dot{Y_i}A_i^T$.
Thus, we apply an all-reduce across the processors to get $\dot{X}=\dot{Y}A^T=\dot{Y_1}A_1^T+\dot{Y_2}A_2^T$.
## Efficiency
Given $P$ processors, we present the theoretical computation and memory cost, as well as the communication cost based on the ring algorithm in both the forward and backward pass of 1D tensor parallelism.
| Computation | Memory (parameters) | Memory (activations) | Communication (bandwidth) | Communication (latency) |
To enable 1D tensor parallelism for our model, e.g. on 2 GPUs, we need to configure the parallism setting as below.
```python
CONFIG=dict(parallel=dict(
data=1,
pipeline=1,
tensor=dict(size=2,mode='1d'),
))
```
Then Colossal-AI will automatically apply 1D parallelism to all the layers from `colossalai.nn`.
Let's define a model that consists of a two-layer multi-layer perceptron (MLP) as below.
```python
importcolossalai
importcolossalai.nnascol_nn
importtorch
fromcolossalai.utilsimportprint_rank_0
classMLP(torch.nn.Module):
def__init__(self,dim:int=256):
super().__init__()
intermediate_dim=dim*4
self.dense_1=col_nn.Linear(dim,intermediate_dim)
print_rank_0(f'Weight of the first linear layer: {self.dense_1.weight.transpose(0,1).shape}')
self.activation=torch.nn.GELU()
self.dense_2=col_nn.Linear(intermediate_dim,dim)
print_rank_0(f'Weight of the second linear layer: {self.dense_2.weight.transpose(0,1).shape}')
self.dropout=col_nn.Dropout(0.1)
defforward(self,x):
x=self.dense_1(x)
print_rank_0(f'Output of the first linear layer: {x.shape}')
x=self.activation(x)
x=self.dense_2(x)
print_rank_0(f'Output of the second linear layer: {x.shape}')
x=self.dropout(x)
returnx
```
Launch Colossal-AI on 2 GPUs and build the model.
```python
parser=colossalai.get_default_parser()
colossalai.launch(config=CONFIG,
rank=args.rank,
world_size=args.world_size,
local_rank=args.local_rank,
host=args.host,
port=args.port)
m=MLP()
```
We will see the shapes of partitioned parameters(e.g. weights) in the MLP model.
```shell
Weight of the first linear layer: torch.Size([256, 512])
Weight of the second linear layer: torch.Size([512, 256])
```
The complete weight of the first linear layer is supposed to have the shape `[256, 1024]`. After the column-parallel partitioning, it becomes `[256, 512]`.
Similarly, the second row-parallel layer partitions the weight `[1024, 256]` into `[512, 256]`.
Output of the first linear layer: torch.Size([16, 512])
Output of the second linear layer: torch.Size([16, 256])
```
The output of the first linear layer is split into 2 partitions (each has the shape `[16, 512]`), while the second layer has identical outputs across the GPUs.
-[An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/pdf/2104.05343.pdf)
## Introduction
1D tensor parallelism does not partition activations, which can also consume a great amount of memory in terms of large-scale models.
To evenly distribute the computation and memory load, [an efficient 2D tensor parallelism algorithm](https://arxiv.org/pdf/2104.05343.pdf) was introduced based on SUMMA (Scalable Universal Matrix Multiplication Algorithm).
Let's still take a linear layer $Y = XA$ as an example.
Given $P=q\times q$ processors (necessary condition), e.g. $q=2$, we split both the input $X$ and weight $A$ into
Y = XA = \left[\begin{matrix} X_{10}A_{00}+X_{11}A_{10} & X_{10}A_{01}+X_{11}A_{11} \\ X_{00}A_{00}+X_{01}A_{10} & X_{00}A_{01}+X_{01}A_{11} \end{matrix} \right].
$$
## Efficiency
Given $P=q\times q$ processors, we present the theoretical computation and memory cost, as well as the communication cost based on the ring algorithm in both the forward and backward pass of 2D tensor parallelism.
| Computation | Memory (parameters) | Memory (activations) | Communication (bandwidth) | Communication (latency) |
To enable 2D tensor parallelism for our model, e.g. on 4 GPUs, we need to configure the parallism setting as below.
```python
CONFIG=dict(parallel=dict(
data=1,
pipeline=1,
tensor=dict(size=4,mode='2d'),
))
```
Then Colossal-AI will automatically apply 2D parallelism to all the layers from `colossalai.nn`.
Let's define a model that consists of a two-layer multi-layer perceptron (MLP) as below.
```python
importcolossalai
importcolossalai.nnascol_nn
importtorch
fromcolossalai.utilsimportprint_rank_0
classMLP(torch.nn.Module):
def__init__(self,dim:int=256):
super().__init__()
intermediate_dim=dim*4
self.dense_1=col_nn.Linear(dim,intermediate_dim)
print_rank_0(f'Weight of the first linear layer: {self.dense_1.weight.shape}')
self.activation=torch.nn.GELU()
self.dense_2=col_nn.Linear(intermediate_dim,dim)
print_rank_0(f'Weight of the second linear layer: {self.dense_2.weight.shape}')
self.dropout=col_nn.Dropout(0.1)
defforward(self,x):
x=self.dense_1(x)
print_rank_0(f'Output of the first linear layer: {x.shape}')
x=self.activation(x)
x=self.dense_2(x)
print_rank_0(f'Output of the second linear layer: {x.shape}')
x=self.dropout(x)
returnx
```
Launch Colossal-AI on 4 GPUs and build the model
```python
parser=colossalai.get_default_parser()
colossalai.launch(config=CONFIG,
rank=args.rank,
world_size=args.world_size,
local_rank=args.local_rank,
host=args.host,
port=args.port)
m=MLP()
```
We will see the shapes of partitioned parameters(e.g. weights) in the MLP model.
```shell
Weight of the first linear layer: torch.Size([128, 512])
Weight of the second linear layer: torch.Size([512, 128])
```
The complete weight of the first linear layer is supposed to have the shape `[256, 1024]`. After the partitioning of 2D parallelism, it becomes `[128, 512]` on each GPU.
Similarly, the second layer partitions the weight `[1024, 256]` into `[512, 128]`.
-[2.5-dimensional distributed model training](https://arxiv.org/pdf/2105.14500.pdf)
## Introduction
Compared with 1D tensor parallelism, 2D parallelism reduces the memory cost, but may introduce more communication.
Therefore, a [2.5D tensor parallelism algorithm](https://arxiv.org/pdf/2105.14500.pdf) was proposed based on 2.5D SUMMA to reduce communication by using more devices.
Let's still take a linear layer $Y = XA$ as an example.
Given $P=q \times q \times d$ processors (necessary condition), e.g. $q=d=2$, we split the input $X$ into $d\times q$ rows and $q$ columns as
Given $P=q \times q \times d$ processors, we present the theoretical computation and memory cost, as well as the communication cost based on the ring algorithm in both the forward and backward pass of 2.5D tensor parallelism.
| Computation | Memory (parameters) | Memory (activations) | Communication (bandwidth) | Communication (latency) |
To enable 2.5D tensor parallelism for our model, e.g. on 8 GPUs, we need to configure the parallism setting as below.
```python
CONFIG=dict(parallel=dict(
data=1,
pipeline=1,
tensor=dict(size=8,mode='2.5d',depth=2),
))
```
Then Colossal-AI will automatically apply 2.5D parallelism to all the layers from `colossalai.nn`.
Let's define a model that consists of a two-layer multi-layer perceptron (MLP) as below.
```python
importcolossalai
importcolossalai.nnascol_nn
importtorch
fromcolossalai.utilsimportprint_rank_0
classMLP(torch.nn.Module):
def__init__(self,dim:int=256):
super().__init__()
intermediate_dim=dim*4
self.dense_1=col_nn.Linear(dim,intermediate_dim)
print_rank_0(f'Weight of the first linear layer: {self.dense_1.weight.shape}')
self.activation=torch.nn.GELU()
self.dense_2=col_nn.Linear(intermediate_dim,dim)
print_rank_0(f'Weight of the second linear layer: {self.dense_2.weight.shape}')
self.dropout=col_nn.Dropout(0.1)
defforward(self,x):
x=self.dense_1(x)
print_rank_0(f'Output of the first linear layer: {x.shape}')
x=self.activation(x)
x=self.dense_2(x)
print_rank_0(f'Output of the second linear layer: {x.shape}')
x=self.dropout(x)
returnx
```
Launch Colossal-AI on 8 GPUs and build the model
```python
parser=colossalai.get_default_parser()
colossalai.launch(config=CONFIG,
rank=args.rank,
world_size=args.world_size,
local_rank=args.local_rank,
host=args.host,
port=args.port)
m=MLP()
```
We will see the shapes of partitioned parameters(e.g. weights) in the MLP model.
```shell
Weight of the first linear layer: torch.Size([128, 512])
Weight of the second linear layer: torch.Size([512, 128])
```
The complete weight of the first linear layer is supposed to have the shape `[256, 1024]`. After the partitioning of 2.5D parallelism, it becomes `[128, 512]` on each GPU.
Similarly, the second layer partitions the weight `[1024, 256]` into `[512, 128]`.
-[ColossalAI-Examples - 3D Tensor Parallelism](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/tensor_parallel/tensor_parallel_3d.py)
**Related Paper**
-[Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/pdf/2105.14450.pdf)
## Introduction
The [3D tensor parallelism](https://arxiv.org/pdf/2105.14450.pdf) is an approach to parallelize the computation of neural models, hoping to obtain the optimal communication cost.
Let's still take a linear layer $Y = XA$ as an example.
Given $P=q \times q \times q$ processors (necessary condition), e.g. $q=2$, we split the input $X$ and weight $A$ into
Then we all-gather $X_{ijl}$ across $(i, 0...q,l)$, as well as $A_{lji}$ across $(0...q, j, l)$.
So, we have $X_{il}$ and $A_{lj}$ on each processor $(i,j,l)$ to get $X_{il}A_{lj}$.
Finally, we reduce-scatter the results across $(i, j, 0...q)$ to get $Y_{ijl}$, which forms
$$
Y=
\left[\begin{matrix}
Y_{000} & Y_{001} \\
Y_{010} & Y_{011} \\
Y_{100} & Y_{101} \\
Y_{110} & Y_{111} \end{matrix}
\right].
$$
We also need to note that in the backward pass, we need to all-gather the gradient $\dot{Y_{ijl}}$, and then reduce-scatter the gradient $\dot{X_{il}}=\dot{Y_{ij}}A_{lj}^T$ and $\dot{A_{lj}}=X_{il}^T\dot{Y_{ij}}$.
## Efficiency
Given $P=q \times q \times q$ processors, we present the theoretical computation and memory cost, as well as the communication cost based on the ring algorithm in both the forward and backward pass of 3D tensor parallelism.
| Computation | Memory (parameters) | Memory (activations) | Communication (bandwidth) | Communication (latency) |
To enable 3D tensor parallelism for our model, e.g. on 8 GPUs, we need to configure the parallism setting as below.
```python
CONFIG=dict(parallel=dict(
data=1,
pipeline=1,
tensor=dict(size=8,mode='3d'),
))
```
Then Colossal-AI will automatically apply 3D parallelism to all the layers from `colossalai.nn`.
Let's define a model that consists of a two-layer multi-layer perceptron (MLP) as below.
```python
importcolossalai
importcolossalai.nnascol_nn
importtorch
fromcolossalai.utilsimportprint_rank_0
classMLP(torch.nn.Module):
def__init__(self,dim:int=256):
super().__init__()
intermediate_dim=dim*4
self.dense_1=col_nn.Linear(dim,intermediate_dim)
print_rank_0(f'Weight of the first linear layer: {self.dense_1.weight.shape}')
self.activation=torch.nn.GELU()
self.dense_2=col_nn.Linear(intermediate_dim,dim)
print_rank_0(f'Weight of the second linear layer: {self.dense_2.weight.shape}')
self.dropout=col_nn.Dropout(0.1)
defforward(self,x):
x=self.dense_1(x)
print_rank_0(f'Output of the first linear layer: {x.shape}')
x=self.activation(x)
x=self.dense_2(x)
print_rank_0(f'Output of the second linear layer: {x.shape}')
x=self.dropout(x)
returnx
```
Launch Colossal-AI on 8 GPUs and build the model
```python
parser=colossalai.get_default_parser()
colossalai.launch(config=CONFIG,
rank=args.rank,
world_size=args.world_size,
local_rank=args.local_rank,
host=args.host,
port=args.port)
m=MLP()
```
We will see the shapes of partitioned parameters(e.g. weights) in the MLP model.
```shell
Weight of the first linear layer: torch.Size([128, 256])
Weight of the second linear layer: torch.Size([512, 64])
```
The complete weight of the first linear layer is supposed to have the shape `[256, 1024]`. After the partitioning of 3D parallelism, it becomes `[128, 256]` on each GPU.
Similarly, the second layer partitions the weight `[1024, 256]` into `[512, 64]`.
Output of the first linear layer: torch.Size([4, 512])
Output of the second linear layer: torch.Size([4, 128])
```
The activation tensors in 3D parallelism are all split by $q^2$ in the row and $q$ in the column.
E.g. the output of the first linear layer has the shape `[4, 512]`), while the second layer has the output of `[4, 128]`.
Note, although the results of 3D parallelism have the same shape as that of 2.5D parallelism for weights here, the content of each partition is different.
| AMP_TYPE.TORCH | ✅ | ❌ | Model parameters, activation, gradients are downcast to fp16 during forward and backward propagation |
| AMP_TYPE.APEX | ❌ | ❌ | More fine-grained, we can choose opt_level O0, O1, O2, O3 |
| AMP_TYPE.NAIVE | ✅ | ✅ | Model parameters, forward and backward operations are all downcast to fp16 |
The first two rely on the original implementation of PyTorch (version 1.6 and above) and NVIDIA Apex.
The last method is similar to Apex O2 level.
Among these methods, apex AMP is not compatible with tensor parallelism.
This is because that tensors are split across devices in tensor parallelism, thus, it is required to communicate among different processes to check if inf or nan occurs in the whole model weights.
We modified the torch amp implementation so that it is compatible with tensor parallelism now.
> ❌️ fp16 and zero configuration are not compatible
>
> ⚠️ Pipeline only support naive AMP currently
We recommend you to use torch AMP as it generally gives better accuracy than naive AMP if no pipeline is used.
## Table of Contents
In this tutorial we will cover:
1. AMP introduction
2. AMP in Colossal-AI
3. Hands-on Practice
## AMP Introduction
Automatic Mixed Precision training is a mixture of FP16 and FP32 training.
Half-precision float point format (FP16) has lower arithmetic complexity and higher compute efficiency.
Besides, fp16 requires half of the storage needed by fp32 and saves memory & network bandwidth, which makes more memory
available for large batch size and model size.
However, there are other operations, like reductions, which require the dynamic range of fp32 to avoid numeric overflow/underflow. That's the reason why we introduce automatic mixed precision, attempting to match each operation to its appropriate data type, which can reduce the memory footprint and augment training efficiency.
- growth_factor(float, optional, default=2.0): Factor by which the scale is multiplied during `update` if no inf/NaN gradients occur for ``growth_interval`` consecutive iterations.
- backoff_factor(float, optional, default=0.5): Factor by which the scale is multiplied during `update` if inf/NaN gradients occur in an iteration.
- growth_interval(int, optional, default=2000): Number of consecutive iterations without inf/NaN gradients that must occur for the scale to be multiplied by ``growth_factor``.
- enabled(bool, optional, default=True): If ``False``, disables gradient scaling. `step` simply invokes the underlying ``optimizer.step()``, and other methods become no-ops.
### Apex AMP Configuration
For this mode, we rely on the Apex implementation for mixed precision training.
We support this plugin because it allows for finer control on the granularity of mixed precision.
For example, O2 level (optimization level 2) will keep batch normalization in fp32.
If you look for more details, please refer to [Apex Documentation](https://nvidia.github.io/apex/).
```python
fromcolossalai.ampimportAMP_TYPE
fp16=dict(
mode=AMP_TYPE.APEX,
# below are the default values
enabled=True,
opt_level='O1',
cast_model_type=None,
patch_torch_functions=None,
keep_batchnorm_fp32=None,
master_weights=None,
loss_scale=None,
cast_model_outputs=None,
num_losses=1,
verbosity=1,
min_loss_scale=None,
max_loss_scale=16777216.0
)
```
Parameters:
- enabled(bool, optional, default=True): If False, renders all AMP calls no-ops, so your script should run as if Amp were not present.
- opt_level(str, optional, default="O1" ): Pure or mixed precision optimization level.
Accepted values are “O0”, “O1”, “O2”, and “O3”, explained in detail above Apex AMP Documentation.
- num_losses(int, optional, default=1): Option to tell AMP in advance how many losses/backward passes you plan to use.
When used in conjunction with the loss_id argument to `amp.scale_loss`, enables Amp to use a different loss scale per
loss/backward pass, which can improve stability. If num_losses is left to 1, Amp will still support multiple
losses/backward passes, but use a single global loss scale for all of them.
- verbosity(int, default=1): Set to 0 to suppress Amp-related output.
- min_loss_scale(float, default=None): Sets a floor for the loss scale values that can be chosen by dynamic loss scaling.
The default value of None means that no floor is imposed. If dynamic loss scaling is not used, min_loss_scale is ignored.
- max_loss_scale(float, default=2.**24 ): Sets a ceiling for the loss scale values that can be chosen by dynamic loss
scaling. If dynamic loss scaling is not used, max_loss_scale is ignored.
Currently, the under-the-hood properties that govern pure or mixed precision training are the following:
They are optional properties override once opt_level is determined
- cast_model_type: Casts your model’s parameters and buffers to the desired type.
- patch_torch_functions: Patch all Torch functions and Tensor methods to perform Tensor Core-friendly ops like GEMMs and convolutions in FP16, and any ops that benefit from FP32 precision in FP32.
- keep_batchnorm_fp32: To enhance precision and enable cudnn batchnorm (which improves performance), it’s often beneficial to keep batchnorm weights in FP32 even if the rest of the model is FP16.
- master_weights: Maintain FP32 master weights to accompany any FP16 model weights. FP32 master weights are stepped by the optimizer to enhance precision and capture small gradients.
- loss_scale: If loss_scale is a float value, use this value as the static (fixed) loss scale. If loss_scale is the string "dynamic", adaptively adjust the loss scale over time. Dynamic loss scale adjustments are performed by Amp automatically.
### Naive AMP Configuration
In Naive AMP mode, we achieved mixed precision training while maintaining compatibility with complex tensor and pipeline parallelism.
This AMP mode will cast all operations into fp16.
The following code block shows the `config.py` file for this mode.
```python
fromcolossalai.ampimportAMP_TYPE
fp16=dict(
mode=AMP_TYPE.NAIVE,
# below are the default values
log_num_zeros_in_grad=False,
initial_scale=2**32,
min_scale=1,
growth_factor=2,
backoff_factor=0.5,
growth_interval=1000,
hysteresis=2
)
```
The default parameters of Naive AMP:
- log_num_zeros_in_grad(bool): return number of zeros in the gradients.
- initial_scale(int): initial scale of gradient scaler
- growth_factor(int): the growth rate of loss scale
- backoff_factor(float): the decrease rate of loss scale
- hysterisis(int): delay shift in dynamic loss scaling
- max_scale(int): maximum loss scale allowed
- verbose(bool): if set to `True`, will print debug info
When using `colossalai.initialize`, you are required to first instantiate a model, an optimizer and a criterion.
The output model is converted to AMP model of smaller memory consumption.
If your input model is already too large to fit in a GPU, please instantiate your model weights in `dtype=torch.float16`.
Otherwise, try smaller models or checkout more parallelization training techniques!
## Hands-on Practice
We provide a [runnable example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/amp) which demonstrates
the use of AMP with Colossal-AI. In this practice, we will use Torch AMP as an example, but do note that config files are provided for all AMP modes.
### Step 1. Create a config file
Create a `config.py` and add the `fp16` configuration.
```python
# in config.py
fromcolossalai.ampimportAMP_TYPE
BATCH_SIZE=128
DROP_RATE=0.1
NUM_EPOCHS=300
fp16=dict(
mode=AMP_TYPE.TORCH,
)
clip_grad_norm=1.0
```
### Step 2. Import libraries in train_with_engine.py
Create a `train_with_engine.py` and import the necessary dependencies. Remember to install `scipy` and `timm` by running
We then need to initialize distributed environment. For demo purpose, we uses `launch_from_torch`. You can refer to [Launch Colossal-AI](../basics/launch_colossalai.md)
for other initialization methods.
```python
# initialize distributed setting
parser=colossalai.get_default_parser()
args=parser.parse_args()
# launch from torch
colossalai.launch_from_torch(config=args.config)
```
### Step 4. Create training components
Build your model, optimizer, loss function, lr scheduler and dataloaders. Note that the root path of the dataset is
obtained from the environment varialbe `DATA`. You may `export DATA=/path/to/data` or change `Path(os.environ['DATA'])`
to a path on your machine. Data will be automatically downloaded to the root path.
-[Zero Redundancy Optimizer with chunk-based memory management](../features/zero_with_chunk.md)
## Introduction
If a model has `N` parameters, when using Adam, it has `8N` optimizer states. For billion-scale models, optimizer states take at least 32 GB memory. GPU memory limits the model scale we can train, which is called GPU memory wall. If we offload optimizer states to the disk, we can break through GPU memory wall.
We implement a user-friendly and efficient asynchronous Tensor I/O library: [TensorNVMe](https://github.com/hpcaitech/TensorNVMe). With this library, we can simply implement NVMe offload.
> This library is compatible with all kinds of disk (HDD, SATA SSD, and NVMe SSD). As I/O bandwidth of HDD or SATA SSD is low, it's recommended to use this lib only on NVMe disk.
When optimizing a parameter, we can divide the optimization process into three stages: read, compute and offload. We perform the optimization process in a pipelined fashion, which can overlap computation and I/O.
First, please make sure you installed [TensorNVMe](https://github.com/hpcaitech/TensorNVMe):
```shell
pip install packaging
pip install tensornvme
```
We implement NVMe offload of optimizer states for Adam ([CPUAdam](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.nn.optimizer.cpu_adam.html) and [HybridAdam](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.nn.optimizer.hybrid_adam.html)).
`nvme_offload_fraction` is the fraction of optimizer states to be offloaded to NVMe. `nvme_offload_dir` is the directory to save NVMe offload files. If `nvme_offload_dir` is `None`, a random temporary directory will be used.
It's compatible with all parallel methods in ColossalAI.
-[ColossalAI-Examples ResNet with pipeline](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/pipeline_parallel)
**Related Paper**
-[Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training](https://arxiv.org/abs/2110.14883)
-[Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://arxiv.org/abs/2104.04473)
-[GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism](https://arxiv.org/abs/1811.06965)
## Quick introduction
In this tutorial, you will learn how to use pipeline parallel. In Colossal-AI, we use 1F1B pipeline, introduced by Nvidia. In this case, ViT and Imagenet are too large to use. Therefore, here we use ResNet and Cifar as example.
## Table Of Content
In this tutorial we will cover:
1. Introduction of 1F1B pipeline.
2. Usage of non-interleaved and interleaved schedule.
3. Training ResNet with pipeline.
## Introduction of 1F1B pipeline
First of all, we will introduce you GPipe for your better understanding.
<figcaption>Figure1: GPipe. This figure is from <ahref="https://arxiv.org/pdf/2104.04473.pdf">Megatron-LM</a> paper.</figcaption>
</figure>
As you can see, for GPipe, only when the forward passes of all microbatches in a batch finish, the backward passes would be executed.
In general, 1F1B(one forward pass followed by one backward pass) is more efficient than GPipe(in memory or both memory and time). There are two schedules of 1F1B pipeline, the non-interleaved and the interleaved. The figures are shown below.
<figcaption>Figure2: This figure is from <ahref="https://arxiv.org/pdf/2104.04473.pdf">Megatron-LM</a> paper. The top part shows the default non-interleaved schedule. And the bottom part shows the interleaved schedule.</figcaption>
</figure>
### Non-interleaved Schedule
The non-interleaved schedule can be divided into three stages. The first stage is the warm-up stage, where workers perform differing numbers of forward passes. At the following stage, workers perform one forward pass followed by one backward pass. Workers will finish backward passes at the last stage.
This mode is more memory-efficient than GPipe. However, it would take the same time to finish a turn of passes as GPipe.
### Interleaved Schedule
This schedule requires **the number of microbatches to be an integer multiple of the stage of pipeline**.
In this schedule, each device can perform computation for multiple subsets of layers(called a model chunk) instead of a single contiguous set of layers. i.e. Before device 1 had layer 1-4; device 2 had layer 5-8; and so on. But now device 1 has layer 1,2,9,10; device 2 has layer 3,4,11,12; and so on. With this scheme, each device in the pipeline is assigned multiple pipeline stages and each pipeline stage has less computation.
This mode is both memory-efficient and time-efficient.
## Usage of non-interleaved and interleaved schedule
In Colossal-AI, we provided both non-interleaved(as `PipelineSchedule`) and interleaved schedule(as `InterleavedPipelineSchedule`).
You just need to set `NUM_MICRO_BATCHES` in config file and set `NUM_CHUNKS` in config file if you want to use Interleaved Pipeline Schedule. If you certainly know the shape of each pipeline stage's output tensor and the shapes are all the same, you can set `TENSOR_SHAPE` in config file to further reduce communication. Otherwise, you can just ignore `tensor_shape`, and the shape will be exchanged over pipeline stages automatically. Then we will generate an appropriate schedule for you.
## Training ResNet with pipeline
Let's build the `ResNet` model first with Colossal PipelinableContext:
-[Define Your Configuration](../basics/define_your_config.md)
**Example Code**
-[Train GPT with Colossal-AI](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt)
**Related Paper**
-[ZeRO: Memory Optimizations Toward Training Trillion Parameter Models](https://arxiv.org/abs/1910.02054)
-[ZeRO-Offload: Democratizing Billion-Scale Model Training](https://arxiv.org/abs/2101.06840)
-[ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning](https://arxiv.org/abs/2104.07857)
-[PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management](https://arxiv.org/abs/2108.05818)
## Introduction
The Zero Redundancy Optimizer (ZeRO) removes the memory redundancies across data-parallel processes by partitioning three
model states (optimizer states, gradients, and parameters) instead of replicating them.
By doing so, memory efficiency is boosted drastically compared to classic data parallelism, while the computational granularity
and communication efficiency is retained.
1.**Shard Optimizer States**: The optimizer states (e.g., for [Adam optimizer](https://arxiv.org/abs/1412.6980), 32-bit weights,
and the first and second momentum estimates) are partitioned across the processes, so that each process updates only its partition.
2.**Shard Gradient**: After reduction inside data parallel process group, gradient tensors are also partitioned such that each process only stores the gradients corresponding to its partition of the optimizer states. Note, Colossal converts gradient into fp32 format to participate in parameter updating.
3.**Shard Parameter**: The 16-bit model parameters are partitioned across the processes of a data parallel group.
4.**[Gemini](../advanced_tutorials/meet_gemini.md)**: Dynamic heterogeneous memory space manager for paramters, gradients and optimizer states.
Besides, this article will introduce the Zero Redundancy Optimizer with chunk-based memory management.
When using ZeRO, we distributed the model by sharding the parameters. The advantage of this method is that the memory of each node is load balanced. But this approach has two significiant disadvantages. First, during communication, a temporary memory buffer needs to be allocated and released afterwards, leading to the memory fragmentation problem. Secondly, using tensor as the granularity for communication will cause the network bandwidth underutilized. Generally, the longer the transmitted message length, the higher the bandwidth utilization.
Using the Chunk mechanism introduced in ColossalAI v0.1.8, we can improve the efficiency of ZeRO. We store a continuous set of parameters in initialization order into a Chunk (a chunk is a continuous memory space), and each Chunk has the same size. Organizing memory in chunks can lead to efficient use of network bandwidth between PCI-e and GPU-GPU, reduce the number of communications, and avoid potential memory fragmentation.
Before v0.1.8, ZeRO had a high communication cost for parameter communications. If a parameter was used multiple times in several consecutive operators, there will be repeated communications operations, and the efficiency was highly damaged. This situation is very common when using the Gradient Checkpoint technique, and the parameter will recompute the forward propagation during backward propagation.
Taking GPT as an example, its Checkpoint will be applied to each GPT Block, and each GPT Block contains a Self-Attention layer and an MLP layer. During the backward pass, the forward of the Self-Attention layer and the MLP layer will be computed in turn, and then the backward of the MLP layer and the Self-Attention layer will be computed in turn.
In addition, due to the communication and memory movement of small Tensors, the bandwidth of NVLINK and PCI-E cannot be fully utilized, and each communication and memory movement has the overhead of kernel launch. After using Chunk, multiple small Tensor communication and memory movement can be changed into one large Tensor communication and memory movement, which not only improves bandwidth utilization but also reduces the overhead of kernel launch.
We also provide a lightweight chunk search mechanism to help users automatically find the chunk size with the smallest memory fragmentation.
## Usage
### GeminiDDP
We will use `GeminiDDP` to use ZeRO with chunk-based memory management. This is our new torch.Module wrapper which uses ZeRO-DP and Gemini. ZeRO is for parallelism and Gemini is for memory management.
Also Make sure that your model is initialized under the context of ColoInitContext.
`hidden_dim` is the hidden dimension of DNN. Users can provide this argument to speed up searching. If users do not know this argument before training, it is ok. We will use a default value 1024. `min_chunk_size_mb` is the the minimum chunk size in MegaByte. If the aggregate size of parameters is still samller than the minimum chunk size, all parameters will be compacted into one small chunk.
> ⚠️ Note: Please do not use `loss.backward()`, the standard way of writing is `optimizer.backward(loss)`.
### Train GPT
In this example, we use `Hugging Face Transformers`. You have to install `transformers` before running this example. We will take `GPT2 Medium` as an example here.
For simplicity, we just use randomly generated data here.
First we only need to import `GPT2LMHeadModel` from `Huggingface transformers` to define our model, which does not require users to define or modify the model, so that users can use it more conveniently.
> ⚠️ Note: If you want to use the Gemini module, please do not use the [Gradient Accumulation](../features/gradient_accumulation.md) we mentioned before。
The complete example can be found on [Train GPT with Colossal-AI](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt).