Unverified Commit 85b2303b authored by Frank Lee's avatar Frank Lee Committed by GitHub
Browse files

[doc] migrate the markdown files (#2652)

parent a020eecc
# Configure Parallelization
Author: Shenggui Li, Siqi Mai
**Prerequisite:**
- [Distributed Training](../concepts/distributed_training.md)
- [Paradigms of Parallelism](../concepts/paradigms_of_parallelism.md)
- [Define Your Configuration](./define_your_config.md)
## Introduction
We support multiple parallelization in Colossal-AI. Hybrid parallelism in our codebase refers to namely the combination
of data parallelism, pipeline parallelism and tensor parallelism (1D, 2D, 2.5D, 3D).
Each parallelism requires different network topology and thus initialize different process groups.
You can initialize the corresponding process group by setting `parallel` in the config file.
The configuration for `parallel` must obey the following format. Data parallel size will be
inferred automatically based on your inputs to pipeline parallelism and tensor parallelism.
`colossalai.launch` will initialize these distributed process groups automatically based on your configuration.
Some sample configurations are shown below:
```python
# sampler format
parallel = dict(
pipeline=dict("size": int),
tensor=dict("size": int, "mode": '1d' or '2d' or '2.5d' or '3d', "kwargs": Any)
)
# this is ok
parallel = dict(
pipeline=dict(size=2),
tensor=dict(size=4, mode='2d')
)
# this is ok
parallel = dict(
pipeline=2,
tensor=dict(size=4, mode='2d')
)
# this is not ok
# as you need to specify the mode for tensor parallelism
parallel = dict(
pipeline=2,
tensor=4
)
# this is ok as well as tensor will be default to size 1
# and mode None
parallel = dict(
pipeline=2
)
# this is ok as well as pipeline will default to size 1
parallel = dict(
tensor=dict(size=4, mode='2d')
)
```
The key name `size` refers to the parallel size of the parallelism dimension. For example, pipeline size 2 means there
will be 2 pipeline stages. The key name `mode` in tensor parallel config means the corresponding tensor parallelism
will be initialized.
**You can choose to not have 'parallel' in your configuration and both pipeline and tensor will default to size 1.**
**Total number of GPUs must be equal to `data parallel size * tensor parallel size * pipeline parallel size`**
## Data Parallel
Data parallel is the most common way to distribute your training task by splitting data into several shards and train on
a single shard on each device. The configuration for data parallel is detected automatically and set for you. You do not
have to explicitly set them in your configurations. There are two ways to handle the all-reduce in data parallel in Colossal-AI.
1. If you specify gradient handlers, gradients will be all-reduced according to the gradient handlers
2. Otherwise, PyTorch DistributedDataParallel will be used
In most cases, you will be using the second mode unless you have complex handling of the gradients.
## 1D, 2D, 2.5D and 3D Parallel
To enable hybrid parallelism, we provide an array of tensor parallelism. We provide the list of papers which match each
tensor parallel method. These parallel modes need to work with the distributed layers provided by Colossal-AI.
- 1D: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
- 2D: [An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/abs/2104.05343)
2D parallel relies on the SUMMA matrix multiplication algorithm and splits the input data, model weights and layer
outputs along two different dimensions. The tensor chunks are distributed over a 2D mesh of `P = N^2` devices where
`N` is the number of tensor chunks in a single dimension.
- 2.5D: [2.5-dimensional distributed model training](https://arxiv.org/abs/2105.14500)
Inspired by the 2.5D matrix multiplication algorithm, 2.5D parallel introduces a novel tensor parallelism which
further parallelizes 2D tensor parallelism. An amount of `P = N^2 ∗ d` processors are arranged into `d` layers, where
each layer performs matrix multiplication operations independently with a dimension `N`.
- 3D: [Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/abs/2105.14450)
We also introduce a 3D tensor parallelism that parallelizes neural networks on a 3D processor cube. This method
achieves the optimal, `O(P^{1/3})` communication overhead on $P$ processors, while both computation and memory usage
are evenly distributed through optimized load balancing of parameters as well as activations.
```python
# 1D parallel
parallel = dict(
tensor=dict(size=4, mode='1d')
)
# 2D parallel
parallel = dict(
tensor=dict(size=4, mode='2d')
)
# 2.5D parallel
parallel = dict(
tensor=dict(size=8, mode='2.5d', depth=2)
)
# 3D parallel
parallel = dict(
tensor=dict(size=8, mode='3d')
)
```
Once you specify the tensor parallel mode in your configuration, you can proceed to use its corresponding distributed
operator. For example, if you mode is '2d', you can use `colossalai.nn.Linear2D` in you model construction.
## Pipeline Parallel
Pipeline parallelism is to split the model into several partitions by layer. For example, let's assume we have a simple
model which consists of two linear layer. We have two GPUs, and we can allocate the first linear layer to the first GPU
and the second layer to the second GPU.
You can set the number of pipeline stages in your configuration file. When pipeline size is larger than 1, Colossal-AI
will automatically creates the pipeline schedule which defines the forward and backward step.
```python
parallel = dict(
pipeline=dict(size=4), # number of pipeline stages
)
```
## Sequence Parallel
Sequence parallel is to support long-sequence modelling such as document-level text understanding and medical imaging.
This method is proposed in [Sequence Parallelism: Making 4D Parallelism Possible](https://arxiv.org/abs/2105.13120).
You can use specify the mode to be `sequence` to initialize its process group.
```python
parallel = dict(
tensor=dict(size=4, mode='sequence')
)
```
# Define Your Configuration
Author: Guangyang Lu, Shenggui Li, Siqi Mai
**Prerequisite:**
- [Distributed Training](../concepts/distributed_training.md)
- [Colossal-AI Overview](../concepts/colossalai_overview.md)
## Introduction
In Colossal-AI, a configuration file is required to specify the features the system will inject into the training process.
In this tutorial, we will introduce you how to construct your configuration file and how this config file will be used.
Using configuration file has several advantages:
1. You can store your feature configuration and training hyper-parameters in different configuration files
2. New features released in the future can be specified in the configuration without code change in the training script
In this tutorial, we will cover how to define your configuration file.
## Configuration Definition
In a configuration file, there are two types of variables. One serves as feature specification and the other serves
as hyper-parameters. All feature-related variables are reserved keywords. For example, if you want to use mixed precision
training, you need to use the variable name `fp16` in the config file and follow a pre-defined format.
### Feature Specification
There is an array of features Colossal-AI provides to speed up training. Each feature is defined by a corresponding field
in the config file. In this tutorial, we are not giving the config details for all the features, but rather we are providing
an illustration of how to specify a feature. **The details of each feature can be found in its respective tutorial.**
To illustrate the use of config file, we use mixed precision training as an example here. In order to do so, you need to
follow the steps below.
1. create a configuration file (e.g. `config.py`, the file name can be anything)
2. define the mixed precision configuration in the config file. For example, in order to use mixed precision training
natively provided by PyTorch, you can just write these lines of code below into your config file.
```python
from colossalai.amp import AMP_TYPE
fp16 = dict(
mode=AMP_TYPE.TORCH
)
```
3. Tell Colossal-AI where your config file is when launch the distributed environment. For example, the config file is in
the current directory.
```python
import colossalai
colossalai.launch(config='./config.py', ...)
```
In this way, Colossal-AI knows what features you want to use and will inject this feature during `colossalai.initialize`.
### Global Hyper-parameters
Besides feature specification, the config file can also serve as a place to define your training hyper-parameters. This
comes handy when you want to perform multiple experiments, each experiment details can be put into a single config file
to avoid confusion. These parameters will be stored in the global parallel context and can be accessed in the training script.
For example, you can specify the batch size in your config file.
```python
BATCH_SIZE = 32
```
After launch, you are able to access your hyper-parameters through global parallel context.
```python
import colossalai
from colossalai.core import global_context as gpc
colossalai.launch(config='./config.py', ...)
# access your parameter
print(gpc.config.BATCH_SIZE)
```
# Use Engine and Trainer in Training
Author: Shenggui Li, Siqi Mai
**Prerequisite:**
- [Initialize Features](./initialize_features.md)
## Introduction
In this tutorial, you will learn how to use the engine and trainer provided in Colossal-AI to train your model.
Before we delve into the details, we would like to first explain the concept of engine and trainer.
### Engine
Engine is essentially a wrapper class for model, optimizer and loss function.
When we call `colossalai.initialize`, an engine object will be returned, and it has already been equipped with
functionalities such as gradient clipping, gradient accumulation and zero optimizer as specified in your configuration file.
An engine object will use similar APIs to those of PyTorch training components such that the user has minimum change
to their code.
Below is a table which shows the commonly used APIs for the engine object.
| Component | Function | PyTorch | Colossal-AI |
| ------------------------------------- | --------------------------------------------- | ------------------------------- | -------------------------------------- |
| optimizer | Set all gradients to zero before an iteration | optimizer.zero_grad() | engine.zero_grad() |
| optimizer | Update the parameters | optimizer.step() | engine.step() |
| model | Run a forward pass | outputs = model(inputs) | outputs = engine(inputs) |
| criterion | Calculate the loss value | loss = criterion(output, label) | loss = engine.criterion(output, label) |
| criterion | Execute back-propagation on the model | loss.backward() | engine.backward(loss) |
The reason why we need such an engine class is that we can add more functionalities while hiding the implementations in
the `colossalai.initialize` function.
Imaging we are gonna add a new feature, we can manipulate the model, optimizer, dataloader and loss function in the
`colossalai.initialize` function and only expose an engine object to the user.
The user only needs to modify their code to the minimum extent by adapting the normal PyTorch APIs to the Colossal-AI
engine APIs. In this way, they can enjoy more features for efficient training.
A normal training iteration using engine can be:
```python
import colossalai
# build your model, optimizer, criterion, dataloaders
...
engine, train_dataloader, test_dataloader, _ = colossalai.initialize(model,
optimizer,
criterion,
train_dataloader,
test_dataloader)
for img, label in train_dataloader:
engine.zero_grad()
output = engine(img)
loss = engine.criterion(output, label)
engine.backward(loss)
engine.step()
```
### Trainer
Trainer is a more high-level wrapper for the user to execute training with fewer lines of code. However, in pursuit of more abstraction, it loses some flexibility compared to engine. The trainer is designed to execute a forward and backward step to perform model weight update. It is easy to create a trainer object by passing the engine object. The trainer has a default value `None` for the argument `schedule`. In most cases, we leave this value to `None` unless we want to use pipeline parallelism. If you wish to explore more about this parameter, you can go to the tutorial on pipeline parallelism.
```python
from colossalai.logging import get_dist_logger
from colossalai.trainer import Trainer, hooks
# build components and initialize with colossalai.initialize
...
# create a logger so that trainer can log on the console
logger = get_dist_logger()
# create a trainer object
trainer = Trainer(
engine=engine,
logger=logger
)
```
In trainer, the user can customize some hooks and attach these hooks to the trainer object. A hook object will execute life-cycle methods periodically based on the training scheme. For example, The `LRSchedulerHook` will execute `lr_scheduler.step()` to update the learning rate of the model during either `after_train_iter` or `after_train_epoch` stages depending on whether the user wants to update the learning rate after each training iteration or only after the entire training epoch. You can store the hook objects in a list and pass it to `trainer.fit` method. `trainer.fit` method will execute training and testing based on your parameters. If `display_process` is True, a progress bar will be displayed on your console to show the training process.
```python
# define the hooks to attach to the trainer
hook_list = [
hooks.LossHook(),
hooks.LRSchedulerHook(lr_scheduler=lr_scheduler, by_epoch=True),
hooks.AccuracyHook(accuracy_func=Accuracy()),
hooks.LogMetricByEpochHook(logger),
]
# start training
trainer.fit(
train_dataloader=train_dataloader,
epochs=NUM_EPOCHS,
test_dataloader=test_dataloader,
test_interval=1,
hooks=hook_list,
display_progress=True
)
```
If you want to customize your own hook class, you can inherit `hooks.BaseHook` and override the life-cycle methods of your interest. A dummy example to demonstrate how to create a simple log message hook is provided below for your reference.
```python
from colossalai.logging import get_dist_logger
from colossalai.trainer import hooks
class LogMessageHook(hooks.BaseHook):
def __init__(self, priority=10):
self._logger = get_dist_logger()
def before_train(self, trainer):
self._logger.info('training starts')
def after_train(self, trainer):
self._logger.info('training finished')
...
# then in your training script
hook_list.append(LogMessageHook())
```
In the sections below, I will guide you through the steps required to train a ResNet model with both engine and trainer.
## Explain with ResNet
### Overview
In this section we will cover:
1. Use an engine object to train a ResNet34 model on CIFAR10 dataset
2. Use a trainer object to train a ResNet34 model on CIFAR10 dataset
The project structure will be like:
```bash
-- config.py
-- run_resnet_cifar10_with_engine.py
-- run_resnet_cifar10_with_trainer.py
```
Steps 1-4 below are commonly used regardless of using engine or trainer. Thus, steps 1-4 + step 5 will be your `run_resnet_cifar10_with_engine.py` and steps 1-4 + step 6 will form `run_resnet_cifar10_with_trainer.py`.
### Hands-on Practice
#### Step 1. Create a Config File
In your project folder, create a `config.py`. This file is to specify some features you may want to use to train your model. A sample config file is as below:
```python
from colossalai.amp import AMP_TYPE
BATCH_SIZE = 128
NUM_EPOCHS = 200
fp16=dict(
mode=AMP_TYPE.TORCH
)
```
In this config file, we specify that we want to use batch size 128 per GPU and run for 200 epochs. These two parameters are exposed by `gpc.config`. For example, you can use `gpc.config.BATCH_SIZE` to access the value you store in your config file. The `fp16` configuration tells `colossalai.initialize` to use mixed precision training provided by PyTorch to train the model with better speed and lower memory consumption.
#### Step 2. Initialize Distributed Environment
We need to initialize the distributed training environment. This has been introduced in the tutorial on how to
[launch Colossal-AI](./launch_colossalai.md). For this demostration, we use `launch_from_torch` and PyTorch launch utility.
```python
import colossalai
# ./config.py refers to the config file we just created in step 1
colossalai.launch_from_torch(config='./config.py')
```
#### Step 3. Create all the training components
In this step, we can create all the components used for training. These components include:
1. Model
2. Optimizer
3. Criterion/loss function
4. Training/Testing dataloaders
5. Learning rate Scheduler
6. Logger
To build these components, you need to import the following modules:
```python
from pathlib import Path
from colossalai.logging import get_dist_logger
import torch
import os
from colossalai.core import global_context as gpc
from colossalai.utils import get_dataloader
from torchvision import transforms
from colossalai.nn.lr_scheduler import CosineAnnealingLR
from torchvision.datasets import CIFAR10
from torchvision.models import resnet34
```
Then build your components in the same way as how to normally build them in your PyTorch scripts. In the script below, we set the root path for CIFAR10 dataset as an environment variable `DATA`. You can change it to any path you like, for example, you can change `root=Path(os.environ['DATA'])` to `root='./data'` so that there is no need to set the environment variable.
```python
# build logger
logger = get_dist_logger()
# build resnet
model = resnet34(num_classes=10)
# build datasets
train_dataset = CIFAR10(
root='./data',
download=True,
transform=transforms.Compose(
[
transforms.RandomCrop(size=32, padding=4),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize(mean=[0.4914, 0.4822, 0.4465], std=[
0.2023, 0.1994, 0.2010]),
]
)
)
test_dataset = CIFAR10(
root='./data',
train=False,
transform=transforms.Compose(
[
transforms.ToTensor(),
transforms.Normalize(mean=[0.4914, 0.4822, 0.4465], std=[
0.2023, 0.1994, 0.2010]),
]
)
)
# build dataloaders
train_dataloader = get_dataloader(dataset=train_dataset,
shuffle=True,
batch_size=gpc.config.BATCH_SIZE,
num_workers=1,
pin_memory=True,
)
test_dataloader = get_dataloader(dataset=test_dataset,
add_sampler=False,
batch_size=gpc.config.BATCH_SIZE,
num_workers=1,
pin_memory=True,
)
# build criterion
criterion = torch.nn.CrossEntropyLoss()
# optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=5e-4)
# lr_scheduler
lr_scheduler = CosineAnnealingLR(optimizer, total_steps=gpc.config.NUM_EPOCHS)
```
#### Step 4. Initialize with Colossal-AI
Next, the essential step is to obtain the engine class by calling `colossalai.initialize`. As stated in `config.py`, we will be using mixed precision training for training ResNet34 model. `colossalai.initialize` will automatically check your config file and assign relevant features to your training components. In this way, our engine object has already been able to train with mixed precision, but you do not have to explicitly take care of it.
```python
engine, train_dataloader, test_dataloader, _ = colossalai.initialize(model,
optimizer,
criterion,
train_dataloader,
test_dataloader,
)
```
#### Step 5. Train with engine
With all the training components ready, we can train ResNet34 just like how to normally deal with PyTorch training.
```python
for epoch in range(gpc.config.NUM_EPOCHS):
# execute a training iteration
engine.train()
for img, label in train_dataloader:
img = img.cuda()
label = label.cuda()
# set gradients to zero
engine.zero_grad()
# run forward pass
output = engine(img)
# compute loss value and run backward pass
train_loss = engine.criterion(output, label)
engine.backward(train_loss)
# update parameters
engine.step()
# update learning rate
lr_scheduler.step()
# execute a testing iteration
engine.eval()
correct = 0
total = 0
for img, label in test_dataloader:
img = img.cuda()
label = label.cuda()
# run prediction without back-propagation
with torch.no_grad():
output = engine(img)
test_loss = engine.criterion(output, label)
# compute the number of correct prediction
pred = torch.argmax(output, dim=-1)
correct += torch.sum(pred == label)
total += img.size(0)
logger.info(
f"Epoch {epoch} - train loss: {train_loss:.5}, test loss: {test_loss:.5}, acc: {correct / total:.5}, lr: {lr_scheduler.get_last_lr()[0]:.5g}", ranks=[0])
```
#### Step 6. Train with trainer
If you wish to train with a trainer object, you can follow the code snippet below:
```python
from colossalai.nn.metric import Accuracy
from colossalai.trainer import Trainer, hooks
# create a trainer object
trainer = Trainer(
engine=engine,
logger=logger
)
# define the hooks to attach to the trainer
hook_list = [
hooks.LossHook(),
hooks.LRSchedulerHook(lr_scheduler=lr_scheduler, by_epoch=True),
hooks.AccuracyHook(accuracy_func=Accuracy()),
hooks.LogMetricByEpochHook(logger),
hooks.LogMemoryByEpochHook(logger)
]
# start training
# run testing every 1 epoch
trainer.fit(
train_dataloader=train_dataloader,
epochs=gpc.config.NUM_EPOCHS,
test_dataloader=test_dataloader,
test_interval=1,
hooks=hook_list,
display_progress=True
)
```
#### Step 7. Start Distributed Training
Lastly, we can invoke the scripts using the distributed launcher provided by PyTorch as we used `launch_from_torch` in Step 2. You need to replace `<num_gpus>` with the number of GPUs available on your machine. This number can be 1 if you only want to use 1 GPU. If you wish to use other launchers, you can refer to the tutorial on How to Launch Colossal-AI.
```bash
# with engine
python -m torch.distributed.launch --nproc_per_node <num_gpus> --master_addr localhost --master_port 29500 run_resnet_cifar10_with_engine.py
# with trainer
python -m torch.distributed.launch --nproc_per_node <num_gpus> --master_addr localhost --master_port 29500 run_resnet_cifar10_with_trainer.py
```
# Initialize Features
Author: Shenggui Li, Siqi Mai
**Prerequisite:**
- [Distributed Training](../concepts/distributed_training.md)
- [Colossal-AI Overview](../concepts/colossalai_overview.md)
## Introduction
In this tutorial, we will cover the use of `colossalai.initialize` which injects features into your training components
(e.g. model, optimizer, dataloader) seamlessly. Calling `colossalai.initialize` is the standard procedure before you run
into your training loops.
In the section below, I will cover how `colossalai.initialize` works and what we should take note of.
## Usage
In a typical workflow, we will launch distributed environment at the beginning of our training script.
Afterwards, we will instantiate our objects such as model, optimizer, loss function, dataloader etc. At this moment, `colossalai.initialize`
can come in to inject features into these objects. A pseudo-code example is like below:
```python
import colossalai
import torch
...
# launch distributed environment
colossalai.launch(config='./config.py', ...)
# create your objects
model = MyModel()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = torch.nn.CrossEntropyLoss()
train_dataloader = MyTrainDataloader()
test_dataloader = MyTrainDataloader()
# initialize features
engine, train_dataloader, test_dataloader, _ = colossalai.initialize(model,
optimizer,
criterion,
train_dataloader,
test_dataloader)
```
The `colossalai.initialize` function will return an `Engine` object. The engine object is a wrapper
for model, optimizer and loss function. **The engine object will run with features specified in the config file.**
More details about the engine can be found in the [Use Engine and Trainer in Training](./engine_trainer.md).
# Launch Colossal-AI
Author: Chuanrui Wang, Shenggui Li, Siqi Mai
**Prerequisite:**
- [Distributed Training](../concepts/distributed_training.md)
- [Colossal-AI Overview](../concepts/colossalai_overview.md)
## Introduction
As mentioned in the previous tutorials stated in the prerequisite, you need to initialize the distributed environment
for Colossal-AI after your config file is prepared.
We call this process `launch`.
In this tutorial, you will learn how to launch Colossal-AI on your server, be it a small one or big one.
In Colossal-AI, we provided several launch methods to initialize the distributed backend.
In most cases, you can use `colossalai.launch` and `colossalai.get_default_parser` to pass the
parameters via command line.
If you happen to use launchers such as SLURM, OpenMPI and PyTorch launch utility,
we also provide several launching helper methods to access the rank and world size from the environment variables
set by these launchers directly for your convenience.
In this tutorial we will cover how to launch Colossal-AI to initialize the distributed backends:
- Launch with `colossalai.launch`
- Launch with Colossal-AI CLI
- Launch with SLURM
- Launch with OpenMPI
## Launch Distributed Environment
In order to launch Colossal-AI, we need two types of arguments:
1. config file
2. distributed settings
The config file is always required regardless of the launch method but distributed settings can vary. The config file
can be a path to the configuration file or a Python dictionary. The distributed settings can be passed via command line
or multi-process launchers.
### Command Line Parser
Before we jump to `launch`, we firstly need to understand what parameters we need for initialization.
As stated in the `Basic Concepts in Distributed Training` section of [Distributed Training](../concepts/distributed_training.md),
the important parameters are:
1. host
2. port
3. rank
4. world_size
5. backend
In Colossal-AI, we provided a command line parser which has added these arguments in advance. You can get this parser by calling
`colossalai.get_default_parser()`. This parser is usually used with `colossalai.launch`.
```python
# add these lines in your train.py
import colossalai
# get default parser
parser = colossalai.get_default_parser()
# if you want to add your own arguments
parser.add_argument(...)
# parse arguments
args = parser.parse_args()
```
Then in your terminal, you can pass in these arguments:
```shell
python train.py --host <host> --rank <rank> --world_size <world_size> --port <port> --backend <backend>
```
`backend` is optional and the default value is `nccl`.
### Native Launch
To initialize the distributed environment, we provided a general `colossalai.launch` API. The `colossalai.launch` function takes in the parameters
listed above and create a default process group in the communication network. This function is often used with the default
parser for convenience.
```python
import colossalai
# parse arguments
args = colossalai.get_default_parser().parse_args()
# launch distributed environment
colossalai.launch(config=<CONFIG>,
rank=args.rank,
world_size=args.world_size,
host=args.host,
port=args.port,
backend=args.backend
)
```
### Launch with Colossal-AI CLI
To enable easy launching on both single or multi nodes, we have implemented a launcher for Colossal-AI. This launcher is
a wrapper of the torch distributed launch utility but enhanced with the capability of launching multi-node jobs easily.
First, we need to set the launch method in our code. As this is a wrapper of the torch distributed launch utility, we will
use `colossalai.launch_from_torch`. The arguments required for distributed environment such as rank, world size, host and port are all set by the PyTorch
launcher and can be read from the environment variable directly.
```python
import colossalai
colossalai.launch_from_torch(
config=<CONFIG>,
)
```
Next, we can easily start multiple processes with `colossalai run` in your terminal. Below is an example to run the code
on a single node with 4 GPUs. You can change the number of GPUs by `nproc_per_node` and the default port by `master_port`.
```shell
# run on the local node with 4 GPUs (default port: 29500)
colossalai run --nproc_per_node 4 train.py
# run on the local node with 4 GPUs with a different port
colossalai run --nproc_per_node 4 --master_port 29505 test.py
```
If you are in a cluster and want to launch multi-node training, the CLI can help you start processes on different nodes
with one simple command. There are two ways you can launch multi-node jobs.
- Run with `--hosts`
This is suitable when you only have a few nodes. Let's say I have two nodes, namely `host1` and `host2`, I can start
multi-node training with the following command. Compared to single-node training, you must specify the `master_addr`
option, which is auto-set to localhost if running on a single node only.
:::caution
`master_addr` cannot be localhost when running on multiple nodes, it should be the hostname or IP address of a node.
:::
```shell
# run on these two nodes
colossalai run --nproc_per_node 4 --host host1,host2 --master_addr host1 test.py
```
- Run with `--hostfile`
This method is suitable when you have a lot of nodes. The host file is a simple text file listing the available nodes.
The list of nodes is commonly provided by cluster managers such as SLURM and PBS Pro. For example, you can get the list
of nodes allocated to you via the environment variable `SLURM_NODELIST` in SLURM and `PBS_NODEFILE` in PBS Pro.
Just do `echo $SLURM_NODELIST` or `cat $PBS_NODEFILE` to check it out. If you do not have such cluster managers, you can
manually create one for your own use.
The host file given to Colossal-AI launcher must be in the following format where each line is the host name of a node.
```text
host1
host2
```
With the host file ready, we can launch multi-node jobs with the following commands. Just like using `--host`, you also
need to specify the `master_addr` option. Some extra options are provided for `--hostfile` as listed below:
- `--include`: specify the hosts to include for multi-node jobs. For example, if your host file has 8 nodes, but you
happen to only want to run on 6 nodes instead, you can add `--include host1,host2,host3,...,host6` so that the job will only
be launcher on the 6 nodes.
- `--exclude`: specify the hosts to exclude for multi-node jobs. This is useful when some nodes are faulty. For example,
if host1 GPU has some problems and you do not wish to run on host1 but all other nodes, you can add `--exclude host1` so that
the job will only be launched on the remaining nodes.
```shell
# run with a hostfile
colossalai run --nproc_per_node 4 --hostfile ./hostfile --master_addr host1 test.py
# only include certain hosts to execute commands
# this is used to manually select nodes to run
colossalai run --nproc_per_node 4 --hostfile ./hostfile --master_addr host1 --include host1 test.py
# exclude certain hosts to execute commands
# this can be used when certain nodes are faulty
colossalai run --nproc_per_node 4 --hostfile ./hostfile --master_addr host1 --exclude host2 test.py
```
### Launch with SLURM
If you are on a system managed by the SLURM scheduler, you can also rely on the `srun` launcher to kickstart your Colossal-AI scripts.
We provided the helper function `launch_from_slurm` for compatibility with the SLURM scheduler.
`launch_from_slurm` will automatically read the rank and world size from the environment variables `SLURM_PROCID` and `SLURM_NPROCS` respectively
and use them to start the distributed backend.
Do this in your training script:
```python
import colossalai
colossalai.launch_from_slurm(
config=<CONFIG>,
host=args.host,
port=args.port
)
```
You can initialize the distributed environment by using this command in terminal.
```bash
srun python train.py --host <master_node> --port 29500
```
### Launch with OpenMPI
If you are more familiar with OpenMPI, you can use `launch_from_openmpi` instead.
`launch_from_openmpi` will automatically read the local rank, global rank and world size from the environment variables
`OMPI_COMM_WORLD_LOCAL_RANK`, `MPI_COMM_WORLD_RANK` and `OMPI_COMM_WORLD_SIZE` respectively and
use them to start the distributed backend.
Do this in your train.py:
```python
colossalai.launch_from_openmpi(
config=<CONFIG>,
host=args.host,
port=args.port
)
```
A sample command to launch multiple processes with OpenMPI would be:
```bash
mpirun --hostfile <my_hostfile> -np <num_process> python train.py --host <node name or ip> --port 29500
```
- --hostfile: use this option to specify a list of hosts on which to run
- --np: set the number of processes (GPUs) to launch in total. For example, if --np 4, 4 python processes will be initialized to run train.py.
# Model Checkpoint
Author : Guangyang Lu
**Prerequisite:**
- [Launch Colossal-AI](./launch_colossalai.md)
- [Initialize Colossal-AI](./initialize_features.md)
**Example Code:**
- [ColossalAI-Examples Model Checkpoint](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/utils/checkpoint)
**This function is experiential.**
## Introduction
In this tutorial, you will learn how to save and load model checkpoints.
To leverage the power of parallel strategies in Colossal-AI, modifications to models and tensors are needed, for which you cannot directly use `torch.save` or `torch.load` to save or load model checkpoints. Therefore, we have provided you with the API to achieve the same thing.
Moreover, when loading, you are not demanded to use the same parallel strategy as saving.
## How to use
### Save
There are two ways to train a model in Colossal-AI, by engine or by trainer.
**Be aware that we only save the `state_dict`.** Therefore, when loading the checkpoints, you need to define the model first.
#### Save when using engine
```python
from colossalai.utils import save_checkpoint
model = ...
engine, _, _, _ = colossalai.initialize(model=model, ...)
for epoch in range(num_epochs):
... # do some training
save_checkpoint('xxx.pt', epoch, model)
```
#### Save when using trainer
```python
from colossalai.trainer import Trainer, hooks
model = ...
engine, _, _, _ = colossalai.initialize(model=model, ...)
trainer = Trainer(engine, ...)
hook_list = [
hooks.SaveCheckpointHook(1, 'xxx.pt', model)
...]
trainer.fit(...
hook=hook_list)
```
### Load
```python
from colossalai.utils import load_checkpoint
model = ...
load_checkpoint('xxx.pt', model)
... # train or test
```
# Colossal-AI Overview
Author: Shenggui Li, Siqi Mai
## About Colossal-AI
With the development of deep learning model size, it is important to shift to a new training paradigm. The traditional training method with no parallelism and optimization became a thing of the past and new training methods are the key to make training large-scale models efficient and cost-effective.
Colossal-AI is designed to be a unfied system to provide an integrated set of training skills and utilities to the user. You can find the common training utilities such as mixed precision training and gradient accumulation. Besides, we provide an array of parallelism including data, tensor and pipeline parallelism. We optimize tensor parallelism with different multi-dimensional distributed matrix-matrix multiplication algorithm. We also provided different pipeline parallelism methods to allow the user to scale their model across nodes efficiently. More advanced features such as offloading can be found in this tutorial documentation in detail as well.
## General Usage
We aim to make Colossal-AI easy to use and non-instrusive to user code. There is a simple general workflow if you want to use Colossal-AI.
<figure style={{textAlign: "center"}}>
<img src="https://s2.loli.net/2022/01/28/ZK7ICWzbMsVuJof.png"/>
<figcaption>Workflow</figcaption>
</figure>
1. Prepare a configiguration file where specifies the features you want to use and your parameters.
2. Initialize distributed backend with `colossalai.launch`
3. Inject the training features into your training components (e.g. model, optimizer) with `colossalai.initialize`.
4. Run training and testing
We will cover the whole workflow in the `basic tutorials` section.
## Future Development
The Colossal-AI system will be expanded to include more training skills, these new developments may include but are not limited to:
1. optimization of distributed operations
2. optimization of training on heterogenous system
3. implementation of training utilities to reduce model size and speed up training while preserving model performance
4. expansion of existing parallelism methods
We welcome ideas and contribution from the community and you can post your idea for future development in our forum.
# Distributed Training
Author: Shenggui Li, Siqi Mai
## What is a distributed system?
<figure style={{textAlign: "center"}}>
<img src="https://s2.loli.net/2022/01/28/sE5daHf2ohIy9wX.png"/>
<figcaption>Image source: <a href="https://towardsdatascience.com/distributed-training-in-the-cloud-cloud-machine-learning-engine-9e264ddde27f">Towards Data Science</a></figcaption>
</figure>
A distributed system consists of multiple software components which run on multiple machines. For example, the traditional
database runs on a single machine. As the amount of data gets incredibly large, a single machine can no longer deliver desirable
performance to the business, especially in situations such as Black Friday where network traffic can be unexpectedly high.
To handle such pressure, modern high-performance database is designed to run on multiple machines, and they work together to provide
high throughput and low latency to the user.
One important evaluation metric for distributed system is scalability. For example, when we run an application on 4 machines,
we naturally expect that the application can run 4 times faster. However, due to communication overhead and difference in
hardware performance, it is difficult to achieve linear speedup. Thus, it is important to consider how to make the application
faster when we implement it. Algorithms of good design and system optimization can help to deliver good performance. Sometimes,
it is even possible to achieve linear and super-linear speedup.
## Why we need distributed training for machine learning?
Back in 2012, [AlexNet](https://arxiv.org/abs/1404.5997) won the champion of the ImageNet competition, and it was trained
on two GTX 580 3GB GPUs.
Today, most models that appear in the top AI conferences are trained on multiple GPUs. Distributed training is definitely
a common practice when researchers and engineers develop AI models. There are several reasons behind this trend.
1. Model size increases rapidly. [ResNet50](https://arxiv.org/abs/1512.03385) has 20 million parameters in 2015,
[BERT-Large](https://arxiv.org/abs/1810.04805) has 345 million parameters in 2018,
[GPT-2](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
has 1.5 billion parameters in 2018, and [GPT-3](https://arxiv.org/abs/2005.14165) has 175 billion parameters in 2020.
It is obvious that the model size grows exponentially with time. The current largest model has exceeded more than 1000
billion parameters. Super large models generally deliver more superior performance compared to their smaller counterparts.
<figure style={{textAlign: "center"}}>
<img src="https://s2.loli.net/2022/01/28/sCyreJ9PF1EdZYf.jpg"/>
<figcaption>Image source: <a href="https://huggingface.co/blog/large-language-models">HuggingFace</a></figcaption>
</figure>
2. Dataset size increases rapidly. For most machine learning developers, MNIST and CIFAR10 datasets are often the first few
datasets on which they train their models. However, these datasets are very small compared to well-known ImageNet datasets.
Google even has its own (unpublished) JFT-300M dataset which has around 300 million images, and this is close to 300 times
larger than the ImageNet-1k dataset.
3. Computing power gets stronger. With the advancement in the semiconductor industry, graphics cards become more and more
powerful. Due to its larger number of cores, GPU is the most common compute platform for deep learning.
From K10 GPU in 2012 to A100 GPU in 2020, the computing power has increased several hundred times. This allows us to performance
compute-intensive tasks faster and deep learning is exactly such a task.
Nowadays, the model can be too large to fit into a single GPU, and the dataset can be large enough to train for a hundred
days on a single GPU. Only by training our models on multiple GPUs with different parallelization techniques, we are able
to speed up the training process and obtain results in a reasonable amount of time.
## Basic Concepts in Distributed Training
Distributed training requires multiple machines/GPUs. During training, there will be communication among these devices.
To understand distributed training better, there are several important terms to be made clear.
- host: host is the main device in the communication network. It is often required as an argument when initializing the
distributed environment.
- port: port here mainly refers to master port on the host for communication.
- rank: the unique ID given to a device in the network.
- world size: the number of devices in the network.
- process group: a process group is a communication network which include a subset of the devices. There is always a default
process group which contains all the devices. A subset devices can form a process group so that they only communicate among
the devices within the group.
<figure style={{textAlign: "center"}}>
<img src="https://s2.loli.net/2022/01/28/qnNBKh8AjzgM5sY.png"/>
<figcaption>A distributed system example</figcaption>
</figure>
To illustrate these concepts, let's assume we have 2 machines (also called nodes), and each machine has 4 GPUs. When we
initialize distributed environment over these two machines, we essentially launch 8 processes (4 processes on each machine)
and each process is bound to a GPU.
Before initializing the distributed environment, we need to specify the host (master address) and port (master port). In
this example, we can let host be node 0 and port be a number such as 29500. All the 8 processes will then look for the
address and port and connect to one another.
The default process group will then be created. The default process group has a world size of 8 and details are as follows:
| process ID | rank | Node index | GPU index |
| ---------- | ---- | ---------- | --------- |
| 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 1 |
| 2 | 2 | 0 | 2 |
| 3 | 3 | 0 | 3 |
| 4 | 4 | 1 | 0 |
| 5 | 5 | 1 | 1 |
| 6 | 6 | 1 | 2 |
| 7 | 7 | 1 | 3 |
We can also create a new process group. This new process group can contain any subset of the processes.
For example, we can create one containing only even-number processes, and the details of this new group will be:
| process ID | rank | Node index | GPU index |
| ---------- | ---- | ---------- | --------- |
| 0 | 0 | 0 | 0 |
| 2 | 1 | 0 | 2 |
| 4 | 2 | 1 | 0 |
| 6 | 3 | 1 | 2 |
**Please note that rank is relative to the process group and one process can have a different rank in different process
groups. The max rank is always `world size of the process group - 1`.**
In the process group, the processes can communicate in two ways:
1. peer-to-peer: one process send data to another process
2. collective: a group of process perform operations such as scatter, gather, all-reduce, broadcast together.
<figure style={{textAlign: "center"}}>
<img src="https://s2.loli.net/2022/01/28/zTmlxgc3oeAdn97.png"/>
<figcaption>Collective communication, source: <a href="https://pytorch.org/tutorials/intermediate/dist_tuto.html">PyTorch distributed tutorial</a></figcaption>
</figure>
# Paradigms of Parallelism
Author: Shenggui Li, Siqi Mai
## Introduction
With the development of deep learning, there is an increasing demand for parallel training. This is because that model
and datasets are getting larger and larger and training time becomes a nightmare if we stick to single-GPU training. In
this section, we will provide a brief overview of existing methods to parallelize training. If you wish to add on to this
post, you may create a discussion in the [GitHub forum](https://github.com/hpcaitech/ColossalAI/discussions).
## Data Parallel
Data parallel is the most common form of parallelism due to its simplicity. In data parallel training, the dataset is split
into several shards, each shard is allocated to a device. This is equivalent to parallelize the training process along the
batch dimension. Each device will hold a full copy of the model replica and trains on the dataset shard allocated. After
back-propagation, the gradients of the model will be all-reduced so that the model parameters on different devices can stay
synchronized.
<figure style={{textAlign: "center"}}>
<img src="https://s2.loli.net/2022/01/28/WSAensMqjwHdOlR.png"/>
<figcaption>Data parallel illustration</figcaption>
</figure>
## Model Parallel
In data parallel training, one prominent feature is that each GPU holds a copy of the whole model weights. This brings
redundancy issue. Another paradigm of parallelism is model parallelism, where model is split and distributed over an array
of devices. There are generally two types of parallelism: tensor parallelism and pipeline parallelism. Tensor parallelism is
to parallelize computation within an operation such as matrix-matrix multiplication. Pipeline parallelism is to parallelize
computation between layers. Thus, from another point of view, tensor parallelism can be seen as intra-layer parallelism and
pipeline parallelism can be seen as inter-layer parallelism.
### Tensor Parallel
Tensor parallel training is to split a tensor into `N` chunks along a specific dimension and each device only holds `1/N`
of the whole tensor while not affecting the correctness of the computation graph. This requires additional communication
to make sure that the result is correct.
Taking a general matrix multiplication as an example, let's say we have C = AB. We can split B along the column dimension
into `[B0 B1 B2 ... Bn]` and each device holds a column. We then multiply `A` with each column in `B` on each device, we
will get `[AB0 AB1 AB2 ... ABn]`. At this moment, each device still holds partial results, e.g. device rank 0 holds `AB0`.
To make sure the result is correct, we need to all-gather the partial result and concatenate the tensor along the column
dimension. In this way, we are able to distribute the tensor over devices while making sure the computation flow remains
correct.
<figure style={{textAlign: "center"}}>
<img src="https://s2.loli.net/2022/01/28/2ZwyPDvXANW4tMG.png"/>
<figcaption>Tensor parallel illustration</figcaption>
</figure>
In Colossal-AI, we provide an array of tensor parallelism methods, namely 1D, 2D, 2.5D and 3D tensor parallelism. We will
talk about them in detail in `advanced tutorials`.
Related paper:
- [GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding](https://arxiv.org/abs/2006.16668)
- [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
- [An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/abs/2104.05343)
- [2.5-dimensional distributed model training](https://arxiv.org/abs/2105.14500)
- [Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/abs/2105.14450)
### Pipeline Parallel
Pipeline parallelism is generally easy to understand. If you recall your computer architecture course, this indeed exists
in the CPU design.
<figure style={{textAlign: "center"}}>
<img src="https://s2.loli.net/2022/01/28/at3eDv7kKBusxbd.png"/>
<figcaption>Pipeline parallel illustration</figcaption>
</figure>
The core idea of pipeline parallelism is that the model is split by layer into several chunks, each chunk is
given to a device. During the forward pass, each device passes the intermediate activation to the next stage. During the backward pass,
each device passes the gradient of the input tensor back to the previous pipeline stage. This allows devices to compute simultaneously,
and increases the training throughput. One drawback of pipeline parallel training is that there will be some bubble time where
some devices are engaged in computation, leading to waste of computational resources.
<figure style={{textAlign: "center"}}>
<img src="https://s2.loli.net/2022/01/28/sDNq51PS3Gxbw7F.png"/>
<figcaption>Source: <a href="https://arxiv.org/abs/1811.06965">GPipe</a></figcaption>
</figure>
Related paper:
- [PipeDream: Fast and Efficient Pipeline Parallel DNN Training](https://arxiv.org/abs/1806.03377)
- [GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism](https://arxiv.org/abs/1811.06965)
- [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
- [Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines](https://arxiv.org/abs/2107.06925)
## Optimizer-Level Parallel
Another paradigm works at the optimizer level, and the current most famous method of this paradigm is ZeRO which stands
for [zero redundancy optimizer](https://arxiv.org/abs/1910.02054). ZeRO works at three levels to remove memory redundancy
(fp16 training is required for ZeRO):
- Level 1: The optimizer states are partitioned across the processes
- Level 2: The reduced 32-bit gradients for updating the model weights are also partitioned such that each process
only stores the gradients corresponding to its partition of the optimizer states.
- Level 3: The 16-bit model parameters are partitioned across the processes
Related paper:
- [ZeRO: Memory Optimizations Toward Training Trillion Parameter Models](https://arxiv.org/abs/1910.02054)
## Parallelism on Heterogeneous System
The methods mentioned above generally require a large number of GPU to train a large model. However, it is often neglected
that CPU has a much larger memory compared to GPU. On a typical server, CPU can easily have several hundred GB RAM while each GPU
typically only has 16 or 32 GB RAM. This prompts the community to think why CPU memory is not utilized for distributed training.
Recent advances rely on CPU and even NVMe disk to train large models. The main idea is to offload tensors back to CPU memory
or NVMe disk when they are not used. By using the heterogeneous system architecture, it is possible to accommodate a huge
model on a single machine.
<figure style={{textAlign: "center"}}>
<img src="https://s2.loli.net/2022/01/28/qLHD5lk97hXQdbv.png"/>
<figcaption>Heterogenous system illustration</figcaption>
</figure>
Related paper:
- [ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning](https://arxiv.org/abs/2104.07857)
- [PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management](https://arxiv.org/abs/2108.05818)
# 1D Tensor Parallelism
Author: Zhengda Bian, Yongbin Li
**Prerequisite**
- [Define Your Configuration](../basics/define_your_config.md)
- [Configure Parallelization](../basics/configure_parallelization.md)
**Example Code**
- [ColossalAI-Examples 1D Tensor Parallelism](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/tensor_parallel/tensor_parallel_1d.py)
**Related Paper**
- [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://deepakn94.github.io/assets/papers/megatron-sc21.pdf)
## Introduction
Tensor parallelism partitions model weights across multiple devices in order to reduce memory load.
An efficient 1D tensor parallelism implementation was introduced by [Megatron-LM](https://deepakn94.github.io/assets/papers/megatron-sc21.pdf).
Let's take a linear layer as an example, which consists of a GEMM $Y = XA$. Given 2 processors, we split the columns of $A$ into $[A_1 ~ A_2]$, and calculate $Y_i = XA_i$ on each processor, which then forms $[Y_1 ~ Y_2] = [XA_1 ~ XA_2]$. This is called a column-parallel fashion.
When a second linear layer $Z=YB$ follows the column-parallel one, we split $B$ into $\left[\begin{matrix} B_1 \\ B_2 \end{matrix} \right]$,
which is called a row-parallel fashion.
To calculate $Z = [Y_1 ~ Y_2] \left[\begin{matrix} B_1 \\ B_2 \end{matrix} \right]$, we first calculate $Y_iB_i$ on each processor, then use an all-reduce to aggregate the results as $Z=Y_1B_1+Y_2B_2$.
We also need to note that in the backward pass, the column-parallel linear layer needs to aggregate the gradients of the input tensor $X$, because on each processor $i$ we only have $\dot{X_i}=\dot{Y_i}A_i^T$.
Thus, we apply an all-reduce across the processors to get $\dot{X}=\dot{Y}A^T=\dot{Y_1}A_1^T+\dot{Y_2}A_2^T$.
## Efficiency
Given $P$ processors, we present the theoretical computation and memory cost, as well as the communication cost based on the ring algorithm in both the forward and backward pass of 1D tensor parallelism.
| Computation | Memory (parameters) | Memory (activations) | Communication (bandwidth) | Communication (latency) |
| :-: | :-: | :-: | :-: | :-: |
| $O(1/P)$ | $O(1/P)$ | $O(1)$ | $O(2(P-1)/P)$ | $O(2(P-1))$ |
## Usage
To enable 1D tensor parallelism for our model, e.g. on 2 GPUs, we need to configure the parallism setting as below.
```python
CONFIG = dict(parallel=dict(
data=1,
pipeline=1,
tensor=dict(size=2, mode='1d'),
))
```
Then Colossal-AI will automatically apply 1D parallelism to all the layers from `colossalai.nn`.
Let's define a model that consists of a two-layer multi-layer perceptron (MLP) as below.
```python
import colossalai
import colossalai.nn as col_nn
import torch
from colossalai.utils import print_rank_0
class MLP(torch.nn.Module):
def __init__(self, dim: int = 256):
super().__init__()
intermediate_dim = dim * 4
self.dense_1 = col_nn.Linear(dim, intermediate_dim)
print_rank_0(f'Weight of the first linear layer: {self.dense_1.weight.transpose(0, 1).shape}')
self.activation = torch.nn.GELU()
self.dense_2 = col_nn.Linear(intermediate_dim, dim)
print_rank_0(f'Weight of the second linear layer: {self.dense_2.weight.transpose(0, 1).shape}')
self.dropout = col_nn.Dropout(0.1)
def forward(self, x):
x = self.dense_1(x)
print_rank_0(f'Output of the first linear layer: {x.shape}')
x = self.activation(x)
x = self.dense_2(x)
print_rank_0(f'Output of the second linear layer: {x.shape}')
x = self.dropout(x)
return x
```
Launch Colossal-AI on 2 GPUs and build the model.
```python
parser = colossalai.get_default_parser()
colossalai.launch(config=CONFIG,
rank=args.rank,
world_size=args.world_size,
local_rank=args.local_rank,
host=args.host,
port=args.port)
m = MLP()
```
We will see the shapes of partitioned parameters(e.g. weights) in the MLP model.
```shell
Weight of the first linear layer: torch.Size([256, 512])
Weight of the second linear layer: torch.Size([512, 256])
```
The complete weight of the first linear layer is supposed to have the shape `[256, 1024]`. After the column-parallel partitioning, it becomes `[256, 512]`.
Similarly, the second row-parallel layer partitions the weight `[1024, 256]` into `[512, 256]`.
We can run the model with some random inputs.
```python
from colossalai.utils import get_current_device
x = torch.randn((16, 256), device=get_current_device())
torch.distributed.broadcast(x, src=0) # synchronize input
x = m(x)
```
Then we can see the shapes of activation results.
```shell
Output of the first linear layer: torch.Size([16, 512])
Output of the second linear layer: torch.Size([16, 256])
```
The output of the first linear layer is split into 2 partitions (each has the shape `[16, 512]`), while the second layer has identical outputs across the GPUs.
# 2D Tensor Parallelism
Author: Zhengda Bian, Yongbin Li
**Prerequisite**
- [Define Your Configuration](../basics/define_your_config.md)
- [Configure Parallelization](../basics/configure_parallelization.md)
- [1D Tensor Parallelism](./1D_tensor_parallel.md)
**Example Code**
- [ColossalAI-Examples - 2D Tensor Parallelism](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/tensor_parallel/tensor_parallel_2d.py)
**Related Paper**
- [An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/pdf/2104.05343.pdf)
## Introduction
1D tensor parallelism does not partition activations, which can also consume a great amount of memory in terms of large-scale models.
To evenly distribute the computation and memory load, [an efficient 2D tensor parallelism algorithm](https://arxiv.org/pdf/2104.05343.pdf) was introduced based on SUMMA (Scalable Universal Matrix Multiplication Algorithm).
Let's still take a linear layer $Y = XA$ as an example.
Given $P=q\times q$ processors (necessary condition), e.g. $q=2$, we split both the input $X$ and weight $A$ into
$$
\left[\begin{matrix} X_{10} & X_{11} \\ X_{00} & X_{01} \end{matrix} \right]
\text{~and~}
\left[\begin{matrix} A_{10} & A_{11} \\ A_{00} & A_{01} \end{matrix} \right].
$$
The calculation includes $q$ steps. When $t=1$, $X_{i0}$ is broadcasted in its row, and $A_{0j}$ is broadcasted in its column. So, we have
$$
\left[\begin{matrix} X_{10},A_{00} & X_{10},A_{01} \\ X_{00},A_{00} & X_{00},A_{01} \end{matrix} \right].
$$
Then we multiply $X_{i0}$ and $A_{0j}$ on each processor $(i, j)$ as
$$
\left[\begin{matrix} X_{10}A_{00} & X_{10}A_{01} \\ X_{00}A_{00} & X_{00}A_{01} \end{matrix} \right] (1).
$$
Similarly, when $t=2$, $X_{i1}$ is broadcasted in its row, $A_{1j}$ is broadcasted in its column, and we multiply them as
$$
\left[\begin{matrix} X_{11}A_{10} & X_{11}A_{11} \\ X_{01}A_{10} & X_{01}A_{11} \end{matrix} \right] (2).
$$
By adding $(1)$ and $(2)$ up, we have
$$
Y = XA = \left[\begin{matrix} X_{10}A_{00}+X_{11}A_{10} & X_{10}A_{01}+X_{11}A_{11} \\ X_{00}A_{00}+X_{01}A_{10} & X_{00}A_{01}+X_{01}A_{11} \end{matrix} \right].
$$
## Efficiency
Given $P=q\times q$ processors, we present the theoretical computation and memory cost, as well as the communication cost based on the ring algorithm in both the forward and backward pass of 2D tensor parallelism.
| Computation | Memory (parameters) | Memory (activations) | Communication (bandwidth) | Communication (latency) |
| :-: | :-: | :-: | :-: | :-: |
| $O(1/q^2)$ | $O(1/q^2)$ | $O(1/q^2)$ | $O(6(q-1)/q)$ | $O(6(q-1))$ |
## Usage
To enable 2D tensor parallelism for our model, e.g. on 4 GPUs, we need to configure the parallism setting as below.
```python
CONFIG = dict(parallel=dict(
data=1,
pipeline=1,
tensor=dict(size=4, mode='2d'),
))
```
Then Colossal-AI will automatically apply 2D parallelism to all the layers from `colossalai.nn`.
Let's define a model that consists of a two-layer multi-layer perceptron (MLP) as below.
```python
import colossalai
import colossalai.nn as col_nn
import torch
from colossalai.utils import print_rank_0
class MLP(torch.nn.Module):
def __init__(self, dim: int = 256):
super().__init__()
intermediate_dim = dim * 4
self.dense_1 = col_nn.Linear(dim, intermediate_dim)
print_rank_0(f'Weight of the first linear layer: {self.dense_1.weight.shape}')
self.activation = torch.nn.GELU()
self.dense_2 = col_nn.Linear(intermediate_dim, dim)
print_rank_0(f'Weight of the second linear layer: {self.dense_2.weight.shape}')
self.dropout = col_nn.Dropout(0.1)
def forward(self, x):
x = self.dense_1(x)
print_rank_0(f'Output of the first linear layer: {x.shape}')
x = self.activation(x)
x = self.dense_2(x)
print_rank_0(f'Output of the second linear layer: {x.shape}')
x = self.dropout(x)
return x
```
Launch Colossal-AI on 4 GPUs and build the model
```python
parser = colossalai.get_default_parser()
colossalai.launch(config=CONFIG,
rank=args.rank,
world_size=args.world_size,
local_rank=args.local_rank,
host=args.host,
port=args.port)
m = MLP()
```
We will see the shapes of partitioned parameters(e.g. weights) in the MLP model.
```shell
Weight of the first linear layer: torch.Size([128, 512])
Weight of the second linear layer: torch.Size([512, 128])
```
The complete weight of the first linear layer is supposed to have the shape `[256, 1024]`. After the partitioning of 2D parallelism, it becomes `[128, 512]` on each GPU.
Similarly, the second layer partitions the weight `[1024, 256]` into `[512, 128]`.
We can run the model with some random inputs.
```python
from colossalai.context import ParallelMode
from colossalai.core import global_context as gpc
from colossalai.utils import get_current_device
x = torch.randn((16, 256), device=get_current_device())
# partition input
torch.distributed.broadcast(x, src=0)
x = torch.chunk(x, 2, dim=0)[gpc.get_local_rank(ParallelMode.PARALLEL_2D_COL)]
x = torch.chunk(x, 2, dim=-1)[gpc.get_local_rank(ParallelMode.PARALLEL_2D_ROW)]
print_rank_0(f'Input: {x.shape}')
x = m(x)
```
Then we can see the shapes of activation results.
```shell
Input: torch.Size([8, 128])
Output of the first linear layer: torch.Size([8, 512])
Output of the second linear layer: torch.Size([8, 128])
```
The activation tensors in 2D parallelism are all split in both row and column.
E.g. the output of the first linear layer has the shape `[8, 512]`, while the second layer has the output of `[8, 128]`.
# 2.5D Tensor Parallelism
Author: Zhengda Bian, Yongbin Li
**Prerequisite**
- [Define Your Configuration](../basics/define_your_config.md)
- [Configure Parallelization](../basics/configure_parallelization.md)
- [1D Tensor Parallelism](./1D_tensor_parallel.md)
- [2D Tensor Parallelism](./2D_tensor_parallel.md)
**Example Code**
- [ColossalAI-Examples - 2.5D Tensor Parallelism](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/tensor_parallel/tensor_parallel_2p5d.py)
**Related Paper**
- [2.5-dimensional distributed model training](https://arxiv.org/pdf/2105.14500.pdf)
## Introduction
Compared with 1D tensor parallelism, 2D parallelism reduces the memory cost, but may introduce more communication.
Therefore, a [2.5D tensor parallelism algorithm](https://arxiv.org/pdf/2105.14500.pdf) was proposed based on 2.5D SUMMA to reduce communication by using more devices.
Let's still take a linear layer $Y = XA$ as an example.
Given $P=q \times q \times d$ processors (necessary condition), e.g. $q=d=2$, we split the input $X$ into $d\times q$ rows and $q$ columns as
$$
\left[\begin{matrix} X_{30} & X_{31} \\ X_{20} & X_{21} \\ X_{10} & X_{11} \\ X_{00} & X_{01}\end{matrix} \right],
$$
which can be reshaped into $d$ layers as
$$
\left[\begin{matrix} X_{10} & X_{11} \\ X_{00} & X_{01} \end{matrix} \right] \text{~and~}\left[\begin{matrix} X_{30} & X_{31} \\ X_{20} & X_{21} \end{matrix} \right].
$$
Also, the weight $A$ is split into
$$
\left[\begin{matrix} A_{10} & A_{11} \\ A_{00} & A_{01} \end{matrix} \right].
$$
For each layer of $X$, we use the SUMMA algorithm to multiply $X$ and $A$.
Then, we have the output
$$
\left[\begin{matrix} Y_{10}=X_{10}A_{00}+X_{11}A_{10} & Y_{11}=X_{10}A_{01}+X_{11}A_{11} \\ Y_{00}=X_{00}A_{00}+X_{01}A_{10} & Y_{01}=X_{00}A_{01}+X_{01}A_{11} \end{matrix} \right]
\text{~and~}
$$
$$
\left[\begin{matrix} Y_{30}=X_{30}A_{00}+X_{31}A_{10} & Y_{31}=X_{30}A_{01}+X_{31}A_{11} \\ Y_{20}=X_{20}A_{00}+X_{21}A_{10} & Y_{21}=X_{20}A_{01}+X_{21}A_{11} \end{matrix} \right].
$$
## Efficiency
Given $P=q \times q \times d$ processors, we present the theoretical computation and memory cost, as well as the communication cost based on the ring algorithm in both the forward and backward pass of 2.5D tensor parallelism.
| Computation | Memory (parameters) | Memory (activations) | Communication (bandwidth) | Communication (latency) |
| :-: | :-: | :-: | :-: | :-: |
| $O(1/dq^2)$ | $O(1/q^2)$ | $O(1/dq^2)$ | $\small O(3(q-1)(d+1)/dq)$ | $O(6(q-1))$ |
## Usage
To enable 2.5D tensor parallelism for our model, e.g. on 8 GPUs, we need to configure the parallism setting as below.
```python
CONFIG = dict(parallel=dict(
data=1,
pipeline=1,
tensor=dict(size=8, mode='2.5d', depth=2),
))
```
Then Colossal-AI will automatically apply 2.5D parallelism to all the layers from `colossalai.nn`.
Let's define a model that consists of a two-layer multi-layer perceptron (MLP) as below.
```python
import colossalai
import colossalai.nn as col_nn
import torch
from colossalai.utils import print_rank_0
class MLP(torch.nn.Module):
def __init__(self, dim: int = 256):
super().__init__()
intermediate_dim = dim * 4
self.dense_1 = col_nn.Linear(dim, intermediate_dim)
print_rank_0(f'Weight of the first linear layer: {self.dense_1.weight.shape}')
self.activation = torch.nn.GELU()
self.dense_2 = col_nn.Linear(intermediate_dim, dim)
print_rank_0(f'Weight of the second linear layer: {self.dense_2.weight.shape}')
self.dropout = col_nn.Dropout(0.1)
def forward(self, x):
x = self.dense_1(x)
print_rank_0(f'Output of the first linear layer: {x.shape}')
x = self.activation(x)
x = self.dense_2(x)
print_rank_0(f'Output of the second linear layer: {x.shape}')
x = self.dropout(x)
return x
```
Launch Colossal-AI on 8 GPUs and build the model
```python
parser = colossalai.get_default_parser()
colossalai.launch(config=CONFIG,
rank=args.rank,
world_size=args.world_size,
local_rank=args.local_rank,
host=args.host,
port=args.port)
m = MLP()
```
We will see the shapes of partitioned parameters(e.g. weights) in the MLP model.
```shell
Weight of the first linear layer: torch.Size([128, 512])
Weight of the second linear layer: torch.Size([512, 128])
```
The complete weight of the first linear layer is supposed to have the shape `[256, 1024]`. After the partitioning of 2.5D parallelism, it becomes `[128, 512]` on each GPU.
Similarly, the second layer partitions the weight `[1024, 256]` into `[512, 128]`.
We can run the model with some random inputs.
```python
from colossalai.context import ParallelMode
from colossalai.core import global_context as gpc
from colossalai.utils import get_current_device
x = torch.randn((16, 256), device=get_current_device())
# partition input
torch.distributed.broadcast(x, src=0)
x = torch.chunk(x, 2, dim=0)[gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_DEP)]
x = torch.chunk(x, 2, dim=0)[gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_COL)]
x = torch.chunk(x, 2, dim=-1)[gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_ROW)]
print_rank_0(f'Input: {x.shape}')
x = m(x)
```
Then we can see the shapes of activation results.
```shell
Input: torch.Size([4, 128])
Output of the first linear layer: torch.Size([4, 512])
Output of the second linear layer: torch.Size([4, 128])
```
The activation tensors in 2.5D parallelism are all split by $d \times q$ in the row and $q$ in the column.
E.g. the output of the first linear layer has the shape `[4, 512]`), while the second layer has the output of `[4, 128]`.
Note, 2.5D parallelism use the same partition method as 2D parallelism for weights, where the difference is the partition of input.
# 3D Tensor Parallelism
Author: Zhengda Bian, Yongbin Li
**Prerequisite**
- [Define Your Configuration](../basics/define_your_config.md)
- [Configure Parallelization](../basics/configure_parallelization.md)
- [1D Tensor Parallelism](./1D_tensor_parallel.md)
- [2D Tensor Parallelism](./2D_tensor_parallel.md)
**Example Code**
- [ColossalAI-Examples - 3D Tensor Parallelism](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/tensor_parallel/tensor_parallel_3d.py)
**Related Paper**
- [Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/pdf/2105.14450.pdf)
## Introduction
The [3D tensor parallelism](https://arxiv.org/pdf/2105.14450.pdf) is an approach to parallelize the computation of neural models, hoping to obtain the optimal communication cost.
Let's still take a linear layer $Y = XA$ as an example.
Given $P=q \times q \times q$ processors (necessary condition), e.g. $q=2$, we split the input $X$ and weight $A$ into
$$
\left[\begin{matrix}
X_{000} & X_{001} \\
X_{010} & X_{011} \\
X_{100} & X_{101} \\
X_{110} & X_{111} \end{matrix}
\right]
\text{~and~}
\left[\begin{matrix}
A_{000} & A_{001} & A_{010} & A_{011} \\
A_{100} & A_{101} & A_{110} & A_{111} \end{matrix}
\right]
\text{~respectively,}$$
where each $X_{ijl}$ and $A_{lji}$ are stored at processor $(i,j,l)$, as shown in the figure below.
<center>
<img src="https://s2.loli.net/2022/02/17/JevO6SED5z4PFdp.png" width = "200" height = "250" />
<img src="https://s2.loli.net/2022/02/17/qvtwjdfNXMAb4nF.png" width = "200" height = "250" />
<img src="https://s2.loli.net/2022/02/17/WFzm2N4IwKf1jXZ.png" width = "200" height = "250" />
<img src="https://s2.loli.net/2022/02/17/r2dZQ4hKxwTuIv6.png" width = "200" height = "250" />
</center>
Then we all-gather $X_{ijl}$ across $(i, 0...q,l)$, as well as $A_{lji}$ across $(0...q, j, l)$.
So, we have $X_{il}$ and $A_{lj}$ on each processor $(i,j,l)$ to get $X_{il}A_{lj}$.
Finally, we reduce-scatter the results across $(i, j, 0...q)$ to get $Y_{ijl}$, which forms
$$
Y=
\left[\begin{matrix}
Y_{000} & Y_{001} \\
Y_{010} & Y_{011} \\
Y_{100} & Y_{101} \\
Y_{110} & Y_{111} \end{matrix}
\right].
$$
We also need to note that in the backward pass, we need to all-gather the gradient $\dot{Y_{ijl}}$, and then reduce-scatter the gradient $\dot{X_{il}}=\dot{Y_{ij}}A_{lj}^T$ and $\dot{A_{lj}}=X_{il}^T\dot{Y_{ij}}$.
## Efficiency
Given $P=q \times q \times q$ processors, we present the theoretical computation and memory cost, as well as the communication cost based on the ring algorithm in both the forward and backward pass of 3D tensor parallelism.
| Computation | Memory (parameters) | Memory (activations) | Communication (bandwidth) | Communication (latency) |
| :-: | :-: | :-: | :-: | :-: |
| $O(1/q^3)$ | $O(1/q^3)$ | $O(1/q^3)$ | $O(6(q-1)/q^3)$ | $O(6(q-1))$ |
## Usage
To enable 3D tensor parallelism for our model, e.g. on 8 GPUs, we need to configure the parallism setting as below.
```python
CONFIG = dict(parallel=dict(
data=1,
pipeline=1,
tensor=dict(size=8, mode='3d'),
))
```
Then Colossal-AI will automatically apply 3D parallelism to all the layers from `colossalai.nn`.
Let's define a model that consists of a two-layer multi-layer perceptron (MLP) as below.
```python
import colossalai
import colossalai.nn as col_nn
import torch
from colossalai.utils import print_rank_0
class MLP(torch.nn.Module):
def __init__(self, dim: int = 256):
super().__init__()
intermediate_dim = dim * 4
self.dense_1 = col_nn.Linear(dim, intermediate_dim)
print_rank_0(f'Weight of the first linear layer: {self.dense_1.weight.shape}')
self.activation = torch.nn.GELU()
self.dense_2 = col_nn.Linear(intermediate_dim, dim)
print_rank_0(f'Weight of the second linear layer: {self.dense_2.weight.shape}')
self.dropout = col_nn.Dropout(0.1)
def forward(self, x):
x = self.dense_1(x)
print_rank_0(f'Output of the first linear layer: {x.shape}')
x = self.activation(x)
x = self.dense_2(x)
print_rank_0(f'Output of the second linear layer: {x.shape}')
x = self.dropout(x)
return x
```
Launch Colossal-AI on 8 GPUs and build the model
```python
parser = colossalai.get_default_parser()
colossalai.launch(config=CONFIG,
rank=args.rank,
world_size=args.world_size,
local_rank=args.local_rank,
host=args.host,
port=args.port)
m = MLP()
```
We will see the shapes of partitioned parameters(e.g. weights) in the MLP model.
```shell
Weight of the first linear layer: torch.Size([128, 256])
Weight of the second linear layer: torch.Size([512, 64])
```
The complete weight of the first linear layer is supposed to have the shape `[256, 1024]`. After the partitioning of 3D parallelism, it becomes `[128, 256]` on each GPU.
Similarly, the second layer partitions the weight `[1024, 256]` into `[512, 64]`.
We can run the model with some random inputs.
```python
from colossalai.context import ParallelMode
from colossalai.core import global_context as gpc
from colossalai.utils import get_current_device
x = torch.randn((16, 256), device=get_current_device())
# partition input
torch.distributed.broadcast(x, src=0)
x = torch.chunk(x, 2, dim=0)[gpc.get_local_rank(ParallelMode.PARALLEL_3D_WEIGHT)]
x = torch.chunk(x, 2, dim=0)[gpc.get_local_rank(ParallelMode.PARALLEL_3D_INPUT)]
x = torch.chunk(x, 2, dim=-1)[gpc.get_local_rank(ParallelMode.PARALLEL_3D_OUTPUT)]
print_rank_0(f'Input: {x.shape}')
x = m(x)
```
Then we can see the shapes of activation results.
```shell
Input: torch.Size([4, 128])
Output of the first linear layer: torch.Size([4, 512])
Output of the second linear layer: torch.Size([4, 128])
```
The activation tensors in 3D parallelism are all split by $q^2$ in the row and $q$ in the column.
E.g. the output of the first linear layer has the shape `[4, 512]`), while the second layer has the output of `[4, 128]`.
Note, although the results of 3D parallelism have the same shape as that of 2.5D parallelism for weights here, the content of each partition is different.
# Gradient Accumulation
Author: Shenggui Li, Yongbin Li
**Prerequisite**
- [Define Your Configuration](../basics/define_your_config.md)
- [Use Engine and Trainer in Training](../basics/engine_trainer.md)
**Example Code**
- [ColossalAI-Examples Gradient Accumulation](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_accumulation)
## Introduction
Gradient accumulation is a common way to enlarge your batch size for training.
When training large-scale models, memory can easily become the bottleneck and the batch size can be very small, (e.g. 2),
leading to unsatisfactory convergence. Gradient accumulation works by adding up the gradients calculated in multiple iterations,
and only update the parameters in the preset iteration.
## Usage
It is simple to use gradient accumulation in Colossal-AI. Just add this following configuration into your config file.
The integer represents the number of iterations to accumulate gradients.
```python
gradient_accumulation = <int>
```
## Hands-on Practice
We provide a [runnable example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_accumulation)
to demonstrate gradient accumulation. In this example, we set the gradinet accumulation size to be 4. You can run the script using this command:
```shell
python -m torch.distributed.launch --nproc_per_node 1 --master_addr localhost --master_port 29500 run_resnet_cifar10_with_engine.py
```
You will see output similar to the text below. This shows gradient is indeed accumulated as the parameter is not updated
in the first 3 steps, but only updated in the last step.
```text
iteration 0, first 10 elements of param: tensor([-0.0208, 0.0189, 0.0234, 0.0047, 0.0116, -0.0283, 0.0071, -0.0359, -0.0267, -0.0006], device='cuda:0', grad_fn=<SliceBackward0>)
iteration 1, first 10 elements of param: tensor([-0.0208, 0.0189, 0.0234, 0.0047, 0.0116, -0.0283, 0.0071, -0.0359, -0.0267, -0.0006], device='cuda:0', grad_fn=<SliceBackward0>)
iteration 2, first 10 elements of param: tensor([-0.0208, 0.0189, 0.0234, 0.0047, 0.0116, -0.0283, 0.0071, -0.0359, -0.0267, -0.0006], device='cuda:0', grad_fn=<SliceBackward0>)
iteration 3, first 10 elements of param: tensor([-0.0141, 0.0464, 0.0507, 0.0321, 0.0356, -0.0150, 0.0172, -0.0118, 0.0222, 0.0473], device='cuda:0', grad_fn=<SliceBackward0>)
```
# Gradient Clipping
Author: Boxiang Wang, Haichen Huang, Yongbin Li
**Prerequisite**
- [Define Your Configuration](../basics/define_your_config.md)
- [Use Engine and Trainer in Training](../basics/engine_trainer.md)
**Example Code**
- [ColossalAI-Examples Gradient Clipping](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_clipping)
**Related Paper**
- [On the difficulty of training Recurrent Neural Networks](https://arxiv.org/abs/1211.5063)
## Introduction
In order to speed up training process and seek global optimum for better performance, more and more learning
rate schedulers have been proposed. People turn to control learning rate to adjust descent pace during training,
which makes gradient vector better to be uniformed in every step. In that case, the descent pace can be
controlled as expected. As a result, gradient clipping, a technique which can normalize the gradient vector
to circumscribe it in a uniformed length, becomes indispensable for those who desire their better
performance of their models.
You do not have to worry about implementing gradient clipping when using Colossal-AI, we support gradient
clipping in a powerful and convenient way. All you need is just an additional command in your configuration
file.
## Why you should use gradient clipping provided by Colossal-AI
The reason of why we do not recommend users to write gradient clipping by themselves is that naive gradient clipping
may fail when applying tensor parallelism, pipeline parallelism or MoE.
According to the illustration below, each GPU only owns a portion of parameters of the weight in a linear layer.
To get correct norm of gradient vector of the weight of the linear layer, the norm of every gradient vector in each GPU
should be summed together.
More complicated thing is that the distribution of bias is different from the distribution of the weight.
The communication group is different in the sum operation.
(PS: This situation is an old version of 2D parallelism, the implementation in the code is not the same.
But it is a good example about the difficulty to unify all communication in gradient clipping.)
<figure style={{textAlign: "center"}}>
<img src="https://s2.loli.net/2022/01/28/KXiJPHt3Dum82cA.png"/>
<figcaption>Layout of parameters</figcaption>
</figure>
Do not worry about it, since Colossal-AI have handled it for you.
### Usage
To use gradient clipping, you can just simply add gradient clipping norm in your configuration file.
```python
clip_grad_norm = 1.0
```
### Hands-On Practice
We provide a [runnable example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_clipping)
to demonstrate gradient clipping. In this example, we set the gradient clipping vector norm to be 1.0. You can run the script using this command:
```shell
python -m torch.distributed.launch --nproc_per_node 1 --master_addr localhost --master_port 29500 train_with_engine.py
```
# Gradient Handler
Author: Shenggui Li, Yongbin Li
**Prerequisite**
- [Define Your Configuration](../basics/define_your_config.md)
- [Use Engine and Trainer in Training](../basics/engine_trainer.md)
**Example Code**
- [ColossalAI-Examples Gradient Handler](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_handler)
## Introduction
In distributed training, gradient synchronization is required at the end of each iteration. This is important because we
need to make sure the parameters are updated with the same gradients in different machines so that the resulting parameters
are the same. This is often seen in data parallel as the model is replicated across data parallel ranks.
In Colossal-AI, we provide an interface for users to customize how they want to handle the synchronization. This brings
flexibility in cases such as implementing a new parallelism method.
When gradient handlers are used, PyTorch `DistributedDataParallel` will not be used as it will synchronize automatically.
## Customize Your Gradient Handlers
To implement a customized gradient handler, you need to follow these steps.
1. inherit `BaseGradientHandler` in Colossal-AI.
2. register the gradient handler into the `GRADIENT_HANDLER`.
3. implement `handle_gradient` method.
```python
from colossalai.registry import GRADIENT_HANDLER
from colossalai.engine.gradient_handler import BaseGradientHandler
@GRADIENT_HANDLER.register_module
class MyGradientHandler(BaseGradientHandler):
def handle_gradient(self):
do_something()
```
## Usage
To use a gradient handler, you need to specify your gradient handler in the config file. The gradient handler
will be automatically built and attached to the engine.
```python
gradient_handler = [dict(type='MyGradientHandler')]
```
### Hands-On Practice
We provide a [runnable example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_handler)
to demonstrate the use of gradient handler. In this example, we used `DataParallelGradientHandler` instead of PyTorch
`DistributedDataParallel` for data parallel training.
```shell
python -m torch.distributed.launch --nproc_per_node 4 --master_addr localhost --master_port 29500 train_with_engine.py
```
# Auto Mixed Precision Training
Author: Chuanrui Wang, Shenggui Li, Yongbin Li
**Prerequisite**
- [Define Your Configuration](../basics/define_your_config.md)
- [Use Engine and Trainer in Training](../basics/engine_trainer.md)
**Example Code**
- [ColossalAI-Examples AMP](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/amp)
**Related Paper**
- [Accelerating Scientific Computations with Mixed Precision Algorithms](https://arxiv.org/abs/0808.2794)
## Introduction
AMP stands for automatic mixed precision training.
In Colossal-AI, we have incorporated different implementations of mixed precision training:
1. torch.cuda.amp
2. apex.amp
3. naive amp
| Colossal-AI | support tensor parallel | support pipeline parallel | fp16 extent |
| ----------- | ----------------------- | ------------------------- | ----------- |
| AMP_TYPE.TORCH | ✅ | ❌ | Model parameters, activation, gradients are downcast to fp16 during forward and backward propagation |
| AMP_TYPE.APEX | ❌ | ❌ | More fine-grained, we can choose opt_level O0, O1, O2, O3 |
| AMP_TYPE.NAIVE | ✅ | ✅ | Model parameters, forward and backward operations are all downcast to fp16 |
The first two rely on the original implementation of PyTorch (version 1.6 and above) and NVIDIA Apex.
The last method is similar to Apex O2 level.
Among these methods, apex AMP is not compatible with tensor parallelism.
This is because that tensors are split across devices in tensor parallelism, thus, it is required to communicate among different processes to check if inf or nan occurs in the whole model weights.
We modified the torch amp implementation so that it is compatible with tensor parallelism now.
> ❌️ fp16 and zero configuration are not compatible
>
> ⚠️ Pipeline only support naive AMP currently
We recommend you to use torch AMP as it generally gives better accuracy than naive AMP if no pipeline is used.
## Table of Contents
In this tutorial we will cover:
1. AMP introduction
2. AMP in Colossal-AI
3. Hands-on Practice
## AMP Introduction
Automatic Mixed Precision training is a mixture of FP16 and FP32 training.
Half-precision float point format (FP16) has lower arithmetic complexity and higher compute efficiency.
Besides, fp16 requires half of the storage needed by fp32 and saves memory & network bandwidth, which makes more memory
available for large batch size and model size.
However, there are other operations, like reductions, which require the dynamic range of fp32 to avoid numeric overflow/underflow. That's the reason why we introduce automatic mixed precision, attempting to match each operation to its appropriate data type, which can reduce the memory footprint and augment training efficiency.
<figure style={{textAlign: "center"}}>
<img src="https://s2.loli.net/2022/01/28/URzLJ3MPeDQbtck.png"/>
<figcaption>Illustration of an ordinary AMP (figure from <a href="https://arxiv.org/abs/2108.05818">PatrickStar paper</a>)</figcaption>
</figure>
## AMP in Colossal-AI
We supported three AMP training methods and allowed the user to train with AMP with no code. You can just simply add `fp16`
configuration in your configuration file to use AMP.
```python
from colossalai.amp import AMP_TYPE
# use Torch AMP
fp16=dict(
mode = AMP_TYPE.TORCH
)
# use naive AMP
fp16=dict(
mode = AMP_TYPE.NAIVE
)
# use NVIDIA Apex AMP
fp16=dict(
mode = AMP_TYPE.APEX
)
```
> These are the minimum configuration, full configuration are stated in the section later
### AMP Modularity
AMP module is designed to be completely modular and can be used independently.
If you wish to only use AMP in your code base without `colossalai.initialize`,
you can use `colossalai.amp.convert_to_amp`.
```python
from colossalai.amp import AMP_TYPE
# exmaple of using torch amp
model, optimizer, criterion = colossalai.amp.convert_to_amp(model,
optimizer,
criterion,
AMP_TYPE.TORCH)
```
### Torch AMP Configuration
```python
from colossalai.amp import AMP_TYPE
fp16=dict(
mode=AMP_TYPE.TORCH,
# below are default values for grad scaler
init_scale=2.**16,
growth_factor=2.0,
backoff_factor=0.5,
growth_interval=2000,
enabled=True
)
```
With optional arguments:
- init_scale(float, optional, default=2.**16): Initial scale factor
- growth_factor(float, optional, default=2.0): Factor by which the scale is multiplied during `update` if no inf/NaN gradients occur for ``growth_interval`` consecutive iterations.
- backoff_factor(float, optional, default=0.5): Factor by which the scale is multiplied during `update` if inf/NaN gradients occur in an iteration.
- growth_interval(int, optional, default=2000): Number of consecutive iterations without inf/NaN gradients that must occur for the scale to be multiplied by ``growth_factor``.
- enabled(bool, optional, default=True): If ``False``, disables gradient scaling. `step` simply invokes the underlying ``optimizer.step()``, and other methods become no-ops.
### Apex AMP Configuration
For this mode, we rely on the Apex implementation for mixed precision training.
We support this plugin because it allows for finer control on the granularity of mixed precision.
For example, O2 level (optimization level 2) will keep batch normalization in fp32.
If you look for more details, please refer to [Apex Documentation](https://nvidia.github.io/apex/).
```python
from colossalai.amp import AMP_TYPE
fp16 = dict(
mode=AMP_TYPE.APEX,
# below are the default values
enabled=True,
opt_level='O1',
cast_model_type=None,
patch_torch_functions=None,
keep_batchnorm_fp32=None,
master_weights=None,
loss_scale=None,
cast_model_outputs=None,
num_losses=1,
verbosity=1,
min_loss_scale=None,
max_loss_scale=16777216.0
)
```
Parameters:
- enabled(bool, optional, default=True): If False, renders all AMP calls no-ops, so your script should run as if Amp were not present.
- opt_level(str, optional, default="O1" ): Pure or mixed precision optimization level.
Accepted values are “O0”, “O1”, “O2”, and “O3”, explained in detail above Apex AMP Documentation.
- num_losses(int, optional, default=1): Option to tell AMP in advance how many losses/backward passes you plan to use.
When used in conjunction with the loss_id argument to `amp.scale_loss`, enables Amp to use a different loss scale per
loss/backward pass, which can improve stability. If num_losses is left to 1, Amp will still support multiple
losses/backward passes, but use a single global loss scale for all of them.
- verbosity(int, default=1): Set to 0 to suppress Amp-related output.
- min_loss_scale(float, default=None): Sets a floor for the loss scale values that can be chosen by dynamic loss scaling.
The default value of None means that no floor is imposed. If dynamic loss scaling is not used, min_loss_scale is ignored.
- max_loss_scale(float, default=2.**24 ): Sets a ceiling for the loss scale values that can be chosen by dynamic loss
scaling. If dynamic loss scaling is not used, max_loss_scale is ignored.
Currently, the under-the-hood properties that govern pure or mixed precision training are the following:
cast_model_type, patch_torch_functions, keep_batchnorm_fp32, master_weights, loss_scale.
They are optional properties override once opt_level is determined
- cast_model_type: Casts your model’s parameters and buffers to the desired type.
- patch_torch_functions: Patch all Torch functions and Tensor methods to perform Tensor Core-friendly ops like GEMMs and convolutions in FP16, and any ops that benefit from FP32 precision in FP32.
- keep_batchnorm_fp32: To enhance precision and enable cudnn batchnorm (which improves performance), it’s often beneficial to keep batchnorm weights in FP32 even if the rest of the model is FP16.
- master_weights: Maintain FP32 master weights to accompany any FP16 model weights. FP32 master weights are stepped by the optimizer to enhance precision and capture small gradients.
- loss_scale: If loss_scale is a float value, use this value as the static (fixed) loss scale. If loss_scale is the string "dynamic", adaptively adjust the loss scale over time. Dynamic loss scale adjustments are performed by Amp automatically.
### Naive AMP Configuration
In Naive AMP mode, we achieved mixed precision training while maintaining compatibility with complex tensor and pipeline parallelism.
This AMP mode will cast all operations into fp16.
The following code block shows the `config.py` file for this mode.
```python
from colossalai.amp import AMP_TYPE
fp16 = dict(
mode=AMP_TYPE.NAIVE,
# below are the default values
log_num_zeros_in_grad=False,
initial_scale=2 ** 32,
min_scale=1,
growth_factor=2,
backoff_factor=0.5,
growth_interval=1000,
hysteresis=2
)
```
The default parameters of Naive AMP:
- log_num_zeros_in_grad(bool): return number of zeros in the gradients.
- initial_scale(int): initial scale of gradient scaler
- growth_factor(int): the growth rate of loss scale
- backoff_factor(float): the decrease rate of loss scale
- hysterisis(int): delay shift in dynamic loss scaling
- max_scale(int): maximum loss scale allowed
- verbose(bool): if set to `True`, will print debug info
When using `colossalai.initialize`, you are required to first instantiate a model, an optimizer and a criterion.
The output model is converted to AMP model of smaller memory consumption.
If your input model is already too large to fit in a GPU, please instantiate your model weights in `dtype=torch.float16`.
Otherwise, try smaller models or checkout more parallelization training techniques!
## Hands-on Practice
We provide a [runnable example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/amp) which demonstrates
the use of AMP with Colossal-AI. In this practice, we will use Torch AMP as an example, but do note that config files are provided for all AMP modes.
### Step 1. Create a config file
Create a `config.py` and add the `fp16` configuration.
```python
# in config.py
from colossalai.amp import AMP_TYPE
BATCH_SIZE = 128
DROP_RATE = 0.1
NUM_EPOCHS = 300
fp16 = dict(
mode=AMP_TYPE.TORCH,
)
clip_grad_norm = 1.0
```
### Step 2. Import libraries in train_with_engine.py
Create a `train_with_engine.py` and import the necessary dependencies. Remember to install `scipy` and `timm` by running
`pip install timm scipy`.
```python
import os
import colossalai
import torch
from pathlib import Path
from colossalai.core import global_context as gpc
from colossalai.logging import get_dist_logger
from colossalai.utils import get_dataloader
from colossalai.trainer import Trainer, hooks
from colossalai.nn.lr_scheduler import LinearWarmupLR
from timm.models import vit_base_patch16_224
from torchvision import datasets, transforms
```
### Step 3. Initialize Distributed Environment
We then need to initialize distributed environment. For demo purpose, we uses `launch_from_torch`. You can refer to [Launch Colossal-AI](../basics/launch_colossalai.md)
for other initialization methods.
```python
# initialize distributed setting
parser = colossalai.get_default_parser()
args = parser.parse_args()
# launch from torch
colossalai.launch_from_torch(config=args.config)
```
### Step 4. Create training components
Build your model, optimizer, loss function, lr scheduler and dataloaders. Note that the root path of the dataset is
obtained from the environment varialbe `DATA`. You may `export DATA=/path/to/data` or change `Path(os.environ['DATA'])`
to a path on your machine. Data will be automatically downloaded to the root path.
```python
# build model
model = vit_base_patch16_224(drop_rate=0.1)
# build dataloader
train_dataset = datasets.Caltech101(
root=Path(os.environ['DATA']),
download=True,
transform=transforms.Compose([
transforms.Resize(256),
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
Gray2RGB(),
transforms.Normalize([0.5, 0.5, 0.5],
[0.5, 0.5, 0.5])
]))
train_dataloader = get_dataloader(dataset=train_dataset,
shuffle=True,
batch_size=gpc.config.BATCH_SIZE,
num_workers=1,
pin_memory=True,
)
# build optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=1e-2, weight_decay=0.1)
# build loss
criterion = torch.nn.CrossEntropyLoss()
# lr_scheduelr
lr_scheduler = LinearWarmupLR(optimizer, warmup_steps=50, total_steps=gpc.config.NUM_EPOCHS)
```
### Step 5. Inject AMP Feature
Call `colossalai.initialize` to convert the training components to be running with FP16.
```python
engine, train_dataloader, _, _ = colossalai.initialize(
model, optimizer, criterion, train_dataloader,
)
```
### Step 6. Train with Engine
Use engine in a normal training loops.
```python
engine.train()
for epoch in range(gpc.config.NUM_EPOCHS):
for img, label in enumerate(train_dataloader):
img = img.cuda()
label = label.cuda()
engine.zero_grad()
output = engine(img)
loss = engine.criterion(output, label)
engine.backward(loss)
engine.step()
lr_scheduler.step()
```
### Step 7. Invoke Training Scripts
Use the following command to start the training scripts. You can change `--nproc_per_node` to use a different number of GPUs.
```python
python -m torch.distributed.launch --nproc_per_node 4 --master_addr localhost --master_port 29500 train_with_engine.py --config config/config_AMP_torch.py
```
# NVMe offload
Author: Hongxin Liu
**Prerequisite:**
- [Zero Redundancy Optimizer with chunk-based memory management](../features/zero_with_chunk.md)
## Introduction
If a model has `N` parameters, when using Adam, it has `8N` optimizer states. For billion-scale models, optimizer states take at least 32 GB memory. GPU memory limits the model scale we can train, which is called GPU memory wall. If we offload optimizer states to the disk, we can break through GPU memory wall.
We implement a user-friendly and efficient asynchronous Tensor I/O library: [TensorNVMe](https://github.com/hpcaitech/TensorNVMe). With this library, we can simply implement NVMe offload.
> This library is compatible with all kinds of disk (HDD, SATA SSD, and NVMe SSD). As I/O bandwidth of HDD or SATA SSD is low, it's recommended to use this lib only on NVMe disk.
When optimizing a parameter, we can divide the optimization process into three stages: read, compute and offload. We perform the optimization process in a pipelined fashion, which can overlap computation and I/O.
<figure style={{textAlign: "center"}}>
<img src="https://s2.loli.net/2022/08/16/CvRnowrsNyB4hza.jpg"/>
<figcaption>Optimization process</figcaption>
</figure>
## Usage
First, please make sure you installed [TensorNVMe](https://github.com/hpcaitech/TensorNVMe):
```shell
pip install packaging
pip install tensornvme
```
We implement NVMe offload of optimizer states for Adam ([CPUAdam](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.nn.optimizer.cpu_adam.html) and [HybridAdam](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.nn.optimizer.hybrid_adam.html)).
```python
from colossalai.nn.optimizer import CPUAdam, HybridAdam
optimizer = HybridAdam(model.parameters(), lr=1e-3, nvme_offload_fraction=1.0, nvme_offload_dir='./')
```
`nvme_offload_fraction` is the fraction of optimizer states to be offloaded to NVMe. `nvme_offload_dir` is the directory to save NVMe offload files. If `nvme_offload_dir` is `None`, a random temporary directory will be used.
It's compatible with all parallel methods in ColossalAI.
# Pipeline Parallel
Author: Guangyang Lu, Hongxin Liu, Yongbin Li
**Prerequisite**
- [Define Your Configuration](../basics/define_your_config.md)
- [Use Engine and Trainer in Training](../basics/engine_trainer.md)
- [Configure Parallelization](../basics/configure_parallelization.md)
**Example Code**
- [ColossalAI-Examples ResNet with pipeline](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/pipeline_parallel)
**Related Paper**
- [Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training](https://arxiv.org/abs/2110.14883)
- [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://arxiv.org/abs/2104.04473)
- [GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism](https://arxiv.org/abs/1811.06965)
## Quick introduction
In this tutorial, you will learn how to use pipeline parallel. In Colossal-AI, we use 1F1B pipeline, introduced by Nvidia. In this case, ViT and Imagenet are too large to use. Therefore, here we use ResNet and Cifar as example.
## Table Of Content
In this tutorial we will cover:
1. Introduction of 1F1B pipeline.
2. Usage of non-interleaved and interleaved schedule.
3. Training ResNet with pipeline.
## Introduction of 1F1B pipeline
First of all, we will introduce you GPipe for your better understanding.
<figure style={{textAlign: "center"}}>
<img src="https://s2.loli.net/2022/01/28/OAucPF6mWYynUtV.png"/>
<figcaption>Figure1: GPipe. This figure is from <a href="https://arxiv.org/pdf/2104.04473.pdf">Megatron-LM</a> paper.</figcaption>
</figure>
As you can see, for GPipe, only when the forward passes of all microbatches in a batch finish, the backward passes would be executed.
In general, 1F1B(one forward pass followed by one backward pass) is more efficient than GPipe(in memory or both memory and time). There are two schedules of 1F1B pipeline, the non-interleaved and the interleaved. The figures are shown below.
<figure style={{textAlign: "center"}}>
<img src="https://s2.loli.net/2022/01/28/iJrVkp2HLcahjsT.png"/>
<figcaption>Figure2: This figure is from <a href="https://arxiv.org/pdf/2104.04473.pdf">Megatron-LM</a> paper. The top part shows the default non-interleaved schedule. And the bottom part shows the interleaved schedule.</figcaption>
</figure>
### Non-interleaved Schedule
The non-interleaved schedule can be divided into three stages. The first stage is the warm-up stage, where workers perform differing numbers of forward passes. At the following stage, workers perform one forward pass followed by one backward pass. Workers will finish backward passes at the last stage.
This mode is more memory-efficient than GPipe. However, it would take the same time to finish a turn of passes as GPipe.
### Interleaved Schedule
This schedule requires **the number of microbatches to be an integer multiple of the stage of pipeline**.
In this schedule, each device can perform computation for multiple subsets of layers(called a model chunk) instead of a single contiguous set of layers. i.e. Before device 1 had layer 1-4; device 2 had layer 5-8; and so on. But now device 1 has layer 1,2,9,10; device 2 has layer 3,4,11,12; and so on. With this scheme, each device in the pipeline is assigned multiple pipeline stages and each pipeline stage has less computation.
This mode is both memory-efficient and time-efficient.
## Usage of non-interleaved and interleaved schedule
In Colossal-AI, we provided both non-interleaved(as `PipelineSchedule`) and interleaved schedule(as `InterleavedPipelineSchedule`).
You just need to set `NUM_MICRO_BATCHES` in config file and set `NUM_CHUNKS` in config file if you want to use Interleaved Pipeline Schedule. If you certainly know the shape of each pipeline stage's output tensor and the shapes are all the same, you can set `TENSOR_SHAPE` in config file to further reduce communication. Otherwise, you can just ignore `tensor_shape`, and the shape will be exchanged over pipeline stages automatically. Then we will generate an appropriate schedule for you.
## Training ResNet with pipeline
Let's build the `ResNet` model first with Colossal PipelinableContext:
```python
import os
from typing import Callable, List, Optional, Type, Union
import torch
import torch.nn as nn
import colossalai
import colossalai.nn as col_nn
from colossalai.core import global_context as gpc
from colossalai.logging import disable_existing_loggers, get_dist_logger
from colossalai.trainer import Trainer, hooks
from colossalai.utils import MultiTimer, get_dataloader
from colossalai.context import ParallelMode
from colossalai.pipeline.pipelinable import PipelinableContext
from titans.dataloader.cifar10 import build_cifar
from torchvision.models import resnet50
from torchvision.models.resnet import BasicBlock, Bottleneck, conv1x1
# Define some config
BATCH_SIZE = 64
NUM_EPOCHS = 2
NUM_CHUNKS = 1
CONFIG = dict(NUM_MICRO_BATCHES=4, parallel=dict(pipeline=2))
# Train
disable_existing_loggers()
parser = colossalai.get_default_parser()
args = parser.parse_args()
colossalai.launch_from_torch(backend=args.backend, config=CONFIG)
logger = get_dist_logger()
pipelinable = PipelinableContext()
# build model
with pipelinable:
model = resnet50()
```
Define an execution sequence.
```python
exec_seq = [
'conv1', 'bn1', 'relu', 'maxpool', 'layer1', 'layer2', 'layer3', 'layer4', 'avgpool',
(lambda x: torch.flatten(x, 1), "behind"), 'fc'
]
pipelinable.to_layer_list(exec_seq)
```
Partition the model into pipeline.
```python
model = pipelinable.partition(NUM_CHUNKS, gpc.pipeline_parallel_size, gpc.get_local_rank(ParallelMode.PIPELINE))
```
In this tutorial, we use `Trainer` to train `ResNet`:
```python
# build criterion
criterion = nn.CrossEntropyLoss()
# optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
# build dataloader
root = os.environ.get('DATA', './data')
train_dataloader, test_dataloader = build_cifar(BATCH_SIZE, root, padding=4, crop=32, resize=32)
lr_scheduler = col_nn.lr_scheduler.LinearWarmupLR(optimizer, NUM_EPOCHS, warmup_steps=1)
engine, train_dataloader, test_dataloader, lr_scheduler = colossalai.initialize(model, optimizer, criterion,
train_dataloader, test_dataloader,
lr_scheduler)
timer = MultiTimer()
trainer = Trainer(engine=engine, timer=timer, logger=logger)
hook_list = [
hooks.LossHook(),
hooks.AccuracyHook(col_nn.metric.Accuracy()),
hooks.LogMetricByEpochHook(logger),
hooks.LRSchedulerHook(lr_scheduler, by_epoch=True)
]
trainer.fit(train_dataloader=train_dataloader,
epochs=NUM_EPOCHS,
test_dataloader=test_dataloader,
test_interval=1,
hooks=hook_list,
display_progress=True)
```
We use `2` pipeline stages and the batch will be splitted into `4` micro batches.
# Zero Redundancy Optimizer with chunk-based memory management
Author: [Hongxiu Liu](https://github.com/ver217), [Jiarui Fang](https://github.com/feifeibear), [Zijian Ye](https://github.com/ZijianYY)
**Prerequisite:**
- [Define Your Configuration](../basics/define_your_config.md)
**Example Code**
- [Train GPT with Colossal-AI](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt)
**Related Paper**
- [ZeRO: Memory Optimizations Toward Training Trillion Parameter Models](https://arxiv.org/abs/1910.02054)
- [ZeRO-Offload: Democratizing Billion-Scale Model Training](https://arxiv.org/abs/2101.06840)
- [ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning](https://arxiv.org/abs/2104.07857)
- [PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management](https://arxiv.org/abs/2108.05818)
## Introduction
The Zero Redundancy Optimizer (ZeRO) removes the memory redundancies across data-parallel processes by partitioning three
model states (optimizer states, gradients, and parameters) instead of replicating them.
By doing so, memory efficiency is boosted drastically compared to classic data parallelism, while the computational granularity
and communication efficiency is retained.
1. **Shard Optimizer States**: The optimizer states (e.g., for [Adam optimizer](https://arxiv.org/abs/1412.6980), 32-bit weights,
and the first and second momentum estimates) are partitioned across the processes, so that each process updates only its partition.
2. **Shard Gradient**: After reduction inside data parallel process group, gradient tensors are also partitioned such that each process only stores the gradients corresponding to its partition of the optimizer states. Note, Colossal converts gradient into fp32 format to participate in parameter updating.
3. **Shard Parameter**: The 16-bit model parameters are partitioned across the processes of a data parallel group.
4. **[Gemini](../advanced_tutorials/meet_gemini.md)**: Dynamic heterogeneous memory space manager for paramters, gradients and optimizer states.
Besides, this article will introduce the Zero Redundancy Optimizer with chunk-based memory management.
When using ZeRO, we distributed the model by sharding the parameters. The advantage of this method is that the memory of each node is load balanced. But this approach has two significiant disadvantages. First, during communication, a temporary memory buffer needs to be allocated and released afterwards, leading to the memory fragmentation problem. Secondly, using tensor as the granularity for communication will cause the network bandwidth underutilized. Generally, the longer the transmitted message length, the higher the bandwidth utilization.
Using the Chunk mechanism introduced in ColossalAI v0.1.8, we can improve the efficiency of ZeRO. We store a continuous set of parameters in initialization order into a Chunk (a chunk is a continuous memory space), and each Chunk has the same size. Organizing memory in chunks can lead to efficient use of network bandwidth between PCI-e and GPU-GPU, reduce the number of communications, and avoid potential memory fragmentation.
Before v0.1.8, ZeRO had a high communication cost for parameter communications. If a parameter was used multiple times in several consecutive operators, there will be repeated communications operations, and the efficiency was highly damaged. This situation is very common when using the Gradient Checkpoint technique, and the parameter will recompute the forward propagation during backward propagation.
Taking GPT as an example, its Checkpoint will be applied to each GPT Block, and each GPT Block contains a Self-Attention layer and an MLP layer. During the backward pass, the forward of the Self-Attention layer and the MLP layer will be computed in turn, and then the backward of the MLP layer and the Self-Attention layer will be computed in turn.
In addition, due to the communication and memory movement of small Tensors, the bandwidth of NVLINK and PCI-E cannot be fully utilized, and each communication and memory movement has the overhead of kernel launch. After using Chunk, multiple small Tensor communication and memory movement can be changed into one large Tensor communication and memory movement, which not only improves bandwidth utilization but also reduces the overhead of kernel launch.
We also provide a lightweight chunk search mechanism to help users automatically find the chunk size with the smallest memory fragmentation.
## Usage
### GeminiDDP
We will use `GeminiDDP` to use ZeRO with chunk-based memory management. This is our new torch.Module wrapper which uses ZeRO-DP and Gemini. ZeRO is for parallelism and Gemini is for memory management.
Also Make sure that your model is initialized under the context of ColoInitContext.
```python
with ColoInitContext(device='cpu', default_dist_spec=default_dist_spec, default_pg=default_pg):
model = gpt2_medium(checkpoint=True)
```
Define the model parameters as follows:
```python
chunk_manager = init_chunk_manager(model=module,
init_device=device,
hidden_dim=hidden_dim,
search_range_mb=search_range_mb,
min_chunk_size_mb=min_chunk_size_mb)
gemini_manager = GeminiManager(placement_policy, chunk_manager)
```
`hidden_dim` is the hidden dimension of DNN. Users can provide this argument to speed up searching. If users do not know this argument before training, it is ok. We will use a default value 1024. `min_chunk_size_mb` is the the minimum chunk size in MegaByte. If the aggregate size of parameters is still samller than the minimum chunk size, all parameters will be compacted into one small chunk.
Initialization of the optimizer.
```python
optimizer = GeminiAdamOptimizer(model, lr=1e-3, initial_scale=2**5)
```
Training
```python
optimizer.zero_grad()
outputs = model(input_ids, attn_mask)
loss = criterion(outputs, input_ids)
optimizer.backward(loss)
optimizer.step()
```
> ⚠️ Note: Please do not use `loss.backward()`, the standard way of writing is `optimizer.backward(loss)`.
### Train GPT
In this example, we use `Hugging Face Transformers`. You have to install `transformers` before running this example. We will take `GPT2 Medium` as an example here.
For simplicity, we just use randomly generated data here.
First we only need to import `GPT2LMHeadModel` from `Huggingface transformers` to define our model, which does not require users to define or modify the model, so that users can use it more conveniently.
```python
class GPTLMModel(nn.Module):
def __init__(self,
hidden_size=768,
num_layers=12,
num_attention_heads=12,
max_seq_len=1024,
vocab_size=50257,
checkpoint=False):
super().__init__()
self.checkpoint = checkpoint
self.model = GPT2LMHeadModel(
GPT2Config(n_embd=hidden_size,
n_layer=num_layers,
n_head=num_attention_heads,
n_positions=max_seq_len,
n_ctx=max_seq_len,
vocab_size=vocab_size))
if checkpoint:
self.model.gradient_checkpointing_enable()
def forward(self, input_ids, attention_mask):
return self.model(input_ids=input_ids, attention_mask=attention_mask, use_cache=not self.checkpoint)[0]
def gpt2_medium(checkpoint=False):
return GPTLMModel(hidden_size=1024, num_layers=24, num_attention_heads=16, checkpoint=checkpoint)
```
Define our loss function:
```python
class GPTLMLoss(nn.Module):
def __init__(self):
super().__init__()
self.loss_fn = nn.CrossEntropyLoss()
def forward(self, logits, labels):
shift_logits = logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()
return self.loss_fn(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
```
Define tensor parallel and parameter sharding strategies for tensor parallelism:
```python
def tensor_parallelize(model: torch.nn.Module, pg: ProcessGroup):
for mn, module in model.named_modules():
for pn, param in module.named_parameters(recurse=False):
if hasattr(param, 'visited'):
continue
param.set_dist_spec(ReplicaSpec())
if 'mlp.c_fc' in mn:
if 'weight' in pn or 'bias' in pn:
split_param_col_tp1d(param, pg)
param.compute_spec.set_output_replicate(False)
else:
param.set_dist_spec(ReplicaSpec())
elif 'mlp.c_proj' in mn:
if 'weight' in pn:
split_param_row_tp1d(param, pg)
else:
param.set_dist_spec(ReplicaSpec())
elif 'wte' in mn or 'wpe' in mn:
split_param_col_tp1d(param, pg)
elif 'c_attn' in mn or 'c_proj' in mn:
split_param_col_tp1d(param, pg)
else:
param.set_dist_spec(ReplicaSpec())
param.visited = True
def split_param_single_dim_tp1d(dim: int, param: ColoParameter, pg: ProcessGroup):
spec = (ShardSpec([dim], [pg.tp_world_size()]), ComputeSpec(ComputePattern.TP1D))
param.set_tensor_spec(*spec)
def split_param_row_tp1d(param: ColoParameter, pg: ProcessGroup):
split_param_single_dim_tp1d(0, param, pg)
def split_param_col_tp1d(param: ColoParameter, pg: ProcessGroup):
split_param_single_dim_tp1d(-1, param, pg)
```
Define a model which uses Gemini + ZeRO DDP:
```python
def gemini_zero_dpp(model: torch.nn.Module, pg: ProcessGroup, placememt_policy: str = "auto"):
cai_version = colossalai.__version__
if version.parse(cai_version) > version.parse("0.1.10"):
from colossalai.nn.parallel import GeminiDDP
model = GeminiDDP(model,
device=get_current_device(),
placement_policy=placememt_policy,
pin_memory=True,
search_range_mb=32)
elif version.parse(cai_version) <= version.parse("0.1.10") and version.parse(cai_version) >= version.parse("0.1.9"):
from colossalai.gemini import ChunkManager, GeminiManager
chunk_size = ChunkManager.search_chunk_size(model, 64 * 1024**2, 32)
gemini_manager = GeminiManager(placememt_policy, chunk_manager)
chunk_manager = ChunkManager(chunk_size,
pg,
enable_distributed_storage=True,
init_device=GeminiManager.get_default_device(placememt_policy))
model = ZeroDDP(model, gemini_manager)
else:
raise NotImplemented(f"CAI version {cai_version} is not supported")
return model
```
As we pre-train GPT in this example, we just use a simple language model loss.
Write a function to get random inputs:
```python
def get_data(batch_size, seq_len, vocab_size):
input_ids = torch.randint(0, vocab_size, (batch_size, seq_len), device=torch.cuda.current_device())
attention_mask = torch.ones_like(input_ids)
return input_ids, attention_mask
```
Finally, we can define our training loop:
```python
def main():
args = parse_args()
BATCH_SIZE = 8
SEQ_LEN = 1024
VOCAB_SIZE = 50257
NUM_STEPS = 10
colossalai.launch_from_torch(config={})
# build criterion
criterion = GPTLMLoss()
torch.manual_seed(123)
default_pg = ProcessGroup(tp_degree=args.tp_degree)
default_dist_spec = ShardSpec([-1], [args.tp_degree]) if args.shardinit else None
# build GPT model
with ColoInitContext(device='cpu', default_dist_spec=default_dist_spec, default_pg=default_pg):
model = gpt2_medium(checkpoint=True)
pg = default_pg
# Tensor Parallelism (TP)
tensor_parallelize(model, pg)
# Gemini + ZeRO DP, Note it must be used after TP
model = gemini_zero_dpp(model, pg, args.placement)
# build optimizer
optimizer = GeminiAdamOptimizer(model, lr=1e-3, initial_scale=2**5)
numel = sum([p.numel() for p in model.parameters()])
get_tflops_func = partial(get_tflops, numel, BATCH_SIZE, SEQ_LEN)
torch.cuda.synchronize()
model.train()
for n in range(NUM_STEPS):
# we just use randomly generated data here
input_ids, attn_mask = get_data(BATCH_SIZE, SEQ_LEN, VOCAB_SIZE)
optimizer.zero_grad()
outputs = model(input_ids, attn_mask)
loss = criterion(outputs, input_ids)
optimizer.backward(loss)
optimizer.step()
torch.cuda.synchronize()
```
> ⚠️ Note: If you want to use the Gemini module, please do not use the [Gradient Accumulation](../features/gradient_accumulation.md) we mentioned before。
The complete example can be found on [Train GPT with Colossal-AI](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt).
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment