Commit be85a0f3 authored by Frank Lee's avatar Frank Lee Committed by ver217
Browse files

removed tutorial markdown and refreshed rst files for consistency

parent ca4ae52d
colossalai.nn.multi\_tensor\_apply colossalai.utils.multi\_tensor\_apply
================================== =====================================
.. automodule:: colossalai.utils.multi_tensor_apply.multi_tensor_apply .. automodule:: colossalai.utils.multi_tensor_apply
:members: :members:
.. toctree::
:maxdepth: 2
colossalai.utils.multi_tensor_apply.multi_tensor_apply
colossalai.utils colossalai.utils
================ ================
.. automodule:: colossalai.utils
:members:
.. toctree::
:maxdepth: 2
colossalai.utils.data_sampler
colossalai.utils.gradient_accumulation
colossalai.utils.multi_tensor_apply
.. toctree:: .. toctree::
:maxdepth: 2 :maxdepth: 2
...@@ -8,12 +19,5 @@ colossalai.utils ...@@ -8,12 +19,5 @@ colossalai.utils
colossalai.utils.checkpointing colossalai.utils.checkpointing
colossalai.utils.common colossalai.utils.common
colossalai.utils.cuda colossalai.utils.cuda
colossalai.utils.data_sampler
colossalai.utils.gradient_accumulation
colossalai.utils.memory colossalai.utils.memory
colossalai.utils.multi_tensor_apply
colossalai.utils.timer colossalai.utils.timer
.. automodule:: colossalai.utils
:members:
colossalai.zero.loss\_scaler
============================
.. automodule:: colossalai.zero.loss_scaler
:members:
colossalai.zero colossalai.zero
================ ===============
.. automodule:: colossalai.zero .. automodule:: colossalai.zero
:members: :members:
.. toctree::
:maxdepth: 2
colossalai.zero.loss_scaler
colossalai.zero.zero_redundancy_optimizer_level_2
colossalai.zero.zero_redundancy_optimizer_level_3
colossalai.zero.zero\_redundancy\_optimizer\_level\_2
=====================================================
.. automodule:: colossalai.zero.zero_redundancy_optimizer_level_2
:members:
colossalai.zero.zero\_redundancy\_optimizer\_level\_3
=====================================================
.. automodule:: colossalai.zero.zero_redundancy_optimizer_level_3
:members:
# Config file
Here is a config file example showing how to train a ViT model on the CIFAR10 dataset using Colossal-AI:
```python
# optional
# three keys: pipeline, tensor
# data parallel size is inferred
parallel = dict(
pipeline=dict(size=1),
tensor=dict(size=4, mode='2d'),
)
# optional
# pipeline or no pipeline schedule
fp16 = dict(
mode=AMP_TYPE.NAIVE,
initial_scale=2 ** 8
)
# optional
# configuration for zero
# you can refer to the Zero Redundancy optimizer and zero offload section for details
# https://www.colossalai.org/zero.html
zero = dict(
level=<int>,
...
)
# optional
# if you are using complex gradient handling
# otherwise, you do not need this in your config file
# default gradient_handlers = None
gradient_handlers = [dict(type='MyHandler', arg1=1, arg=2), ...]
# optional
# specific gradient accumulation size
# if your batch size is not large enough
gradient_accumulation = <int>
# optional
# add gradient clipping to your engine
# this config is not compatible with zero and AMP_TYPE.NAIVE
# but works with AMP_TYPE.TORCH and AMP_TYPE.APEX
# defautl clip_grad_norm = 0.0
clip_grad_norm = <float>
# optional
# cudnn setting
# default is like below
cudnn_benchmark = False,
cudnn_deterministic=True,
```
\ No newline at end of file
# 配置文件
下方代码块中的示例展示了如何在CIFAR10数据集上使用Colossal-AI训练ViT模型。
```python
# build train_dataset and train_dataloader from this dictionary
# It is not compulsory in Config File, instead, you can input this dictionary as an argument into colossalai.initialize()
train_data = dict(
# dictionary for building Dataset
dataset=dict(
# the type CIFAR10Dataset has to be registered
type='CIFAR10Dataset',
root='/path/to/data',
# transform pipeline
transform_pipeline=[
dict(type='Resize', size=IMG_SIZE),
dict(type='RandomCrop', size=IMG_SIZE, padding=4),
dict(type='RandomHorizontalFlip'),
dict(type='ToTensor'),
dict(type='Normalize',
mean=[0.4914, 0.4822, 0.4465],
std=[0.2023, 0.1994, 0.2010]),
]
),
# dictionary for building Dataloader
dataloader=dict(
batch_size=BATCH_SIZE,
pin_memory=True,
# num_workers=1,
shuffle=True,
)
)
# build test_dataset and test_dataloader from this dictionary
test_data = dict(
dataset=dict(
type='CIFAR10Dataset',
root='/path/to/data',
train=False,
transform_pipeline=[
dict(type='Resize', size=IMG_SIZE),
dict(type='ToTensor'),
dict(type='Normalize',
mean=[0.4914, 0.4822, 0.4465],
std=[0.2023, 0.1994, 0.2010]
),
]
),
dataloader=dict(
batch_size=BATCH_SIZE,
pin_memory=True,
# num_workers=1,
)
)
# compulsory
# build optimizer from this dictionary
optimizer = dict(
# Avaluable types: 'ZeroRedundancyOptimizer_Level_1', 'ZeroRedundancyOptimizer_Level_2', 'ZeroRedundancyOptimizer_Level_3'
# 'Adam', 'Lamb', 'SGD', 'FusedLAMB', 'FusedAdam', 'FusedSGD', 'FP16Optimizer'
type='Adam',
lr=0.001,
weight_decay=0
)
# compulsory
# build loss function from this dictionary
loss = dict(
# Avaluable types:
# 'CrossEntropyLoss2D', 'CrossEntropyLoss2p5D', 'CrossEntropyLoss3D'
type='CrossEntropyLoss2D',
)
# compulsory
# build model from this dictionary
model = dict(
# types avaluable: 'PretrainBERT', 'VanillaResNet', 'VisionTransformerFromConfig'
type='VisionTransformerFromConfig',
# each key-value pair above refers to a layer
# input data pass through these layers recursively
tensor_splitting_cfg=dict(
type='ViTInputSplitter2D',
),
embedding_cfg=dict(
type='ViTPatchEmbedding2D',
img_size=IMG_SIZE,
patch_size=PATCH_SIZE,
embed_dim=DIM,
),
token_fusion_cfg=dict(
type='ViTTokenFuser2D',
img_size=IMG_SIZE,
patch_size=PATCH_SIZE,
embed_dim=DIM,
drop_rate=0.1
),
norm_cfg=dict(
type='LayerNorm2D',
normalized_shape=DIM,
eps=1e-6,
),
block_cfg=dict(
# ViTBlock is a submodule
type='ViTBlock',
attention_cfg=dict(
type='ViTSelfAttention2D',
hidden_size=DIM,
num_attention_heads=NUM_ATTENTION_HEADS,
attention_dropout_prob=0.,
hidden_dropout_prob=0.1,
checkpoint=True
),
droppath_cfg=dict(
type='VanillaViTDropPath',
),
mlp_cfg=dict(
type='ViTMLP2D',
in_features=DIM,
dropout_prob=0.1,
mlp_ratio=4,
checkpoint=True
),
norm_cfg=dict(
type='LayerNorm2D',
normalized_shape=DIM,
eps=1e-6,
),
),
head_cfg=dict(
type='ViTHead2D',
hidden_size=DIM,
num_classes=NUM_CLASSES,
),
embed_dim=DIM,
depth=DEPTH,
drop_path_rate=0.,
)
# hooks are built when initializing trainer
# possible hooks: 'BaseHook', 'MetricHook','LoadCheckpointHook'
# 'SaveCheckpointHook','LossHook', 'AccuracyHook', 'Accuracy2DHook'
# 'LogMetricByEpochHook', 'TensorboardHook','LogTimingByEpochHook', 'LogMemoryByEpochHook'
hooks = [
dict(type='LogMetricByEpochHook'),
dict(type='LogTimingByEpochHook'),
dict(type='LogMemoryByEpochHook'),
dict(type='Accuracy2DHook'),
dict(type='LossHook'),
# dict(type='TensorboardHook', log_dir='./tfb_logs'),
# dict(type='SaveCheckpointHook', interval=5, checkpoint_dir='./ckpt'),
# dict(type='LoadCheckpointHook', epoch=20, checkpoint_dir='./ckpt')
]
# three keys: pipeline, tensor, data
# if data=dict(size=1), which means no data parallelization, then there is no need to define it
parallel = dict(
pipeline=dict(size=1),
tensor=dict(size=4, mode='2d'),
)
# not compulsory
# pipeline or no pipeline schedule
fp16 = dict(
mode=AMP_TYPE.PARALLEL,
initial_scale=2 ** 8
)
# not compulsory
# build learning rate scheduler
lr_scheduler = dict(
type='LinearWarmupLR',
warmup_epochs=5
)
schedule = dict(
num_microbatches=8
)
# training stopping criterion
# you can give num_steps or num_epochs
num_epochs = 60
# config logging path
logging = dict(
root_path='./logs'
)
```
\ No newline at end of file
...@@ -3,30 +3,8 @@ ...@@ -3,30 +3,8 @@
You can adapt this file completely to your liking, but it should at least You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive. contain the root `toctree` directive.
Colossal-AI documentation Colossal-AI API documentation
====================================== ======================================
.. toctree::
:maxdepth: 1
:caption: GETTING STARTED
installation.md
run_demo.md
.. toctree::
:maxdepth: 1
:caption: CUSTOMIZE YOUR TRAINING
parallelization.md
model.md
trainer_engine.md
amp.md
zero.md
add_your_parallel.md
config.md
.. toctree:: .. toctree::
:maxdepth: 2 :maxdepth: 2
:caption: API REFERENCE :caption: API REFERENCE
......
.. Colossal-AI documentation master file, created by
sphinx-quickstart on Mon Oct 11 17:05:05 2021.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
夸父AI系统(Colossal-AI)开发文档
======================================
.. toctree::
:maxdepth: 1
:caption: 快速上手指南
installation_zh.md
run_demo_zh.md
.. toctree::
:maxdepth: 1
:caption: 个性化您的训练
parallelization_zh.md
model_zh.md
trainer_engine_zh.md
amp_zh.md
zero_zh.md
add_your_parallel_zh.md
config_zh.md
.. toctree::
:maxdepth: 2
:caption: API REFERENCE
colossalai/colossalai
Indices and tables
==================
* :ref:`genindex`
\ No newline at end of file
# Setup
### PyPI
```bash
pip install colossalai
```
### Install From Source (Recommended)
> We **recommend** you to install from source as the Colossal-AI is updating frequently in the early versions. The documentation will be in line with the main branch of the repository. Feel free to raise an issue if you encounter any problem. :)
```shell
git clone https://github.com/hpcaitech/ColossalAI.git
cd ColossalAI
# install dependency
pip install -r requirements/requirements.txt
# install colossalai
pip install .
```
Install and enable CUDA kernel fusion (compulsory installation when using fused optimizer)
```shell
pip install -v --no-cache-dir --global-option="--cuda_ext" .
```
# 快速安装
## 使用pip安装
```bash
pip install colossalai
```
## 使用源代码安装
```shell
git clone git@github.com:hpcaitech/ColossalAI.git
cd ColossalAI
# install dependency
pip install -r requirements/requirements.txt
# install colossalai
pip install .
```
安装并支持内核融合(使用融合优化器时必须执行下面的代码)
```
pip install -v --no-cache-dir --global-option="--cuda_ext" .
```
# Define your own parallel model
Let's say that you have a huge MLP model with billions of parameters and its extremely large hidden layer size makes it
impossible to fit into a single GPU directly. Don't worry, ColossalAI is here to help you sort things out. With the help of ColossalAI,
you can write your model in the familiar way in which you used to write models for a single GPU, while ColossalAI automatically
splits your model weights and fit them perfectly into a set of GPUs. We give a simple example showing how to write a simple
2D parallel model in the Colossal-AI context.
## Write a simple 2D parallel model
```python
from colossalai.nn import Linear2D
import torch.nn as nn
class MLP_2D(nn.Module):
def __init__(self):
super().__init__()
self.linear_1 = Linear2D(in_features=1024, out_features=16384)
self.linear_2 = Linear2D(in_features=16384, out_features=1024)
def forward(self, x):
x = self.linear_1(x)
x = self.linear_2(x)
return x
```
## Use pre-defined model
For the sake of your convenience, we kindly provide you in our Model Zoo with some prevalent models such as *BERT*, *VIT*,
and *MLP-Mixer*. Feel free to customize them into different sizes to fit into your special needs.
# 定义符合您需求的并行模型
如果您在训练一个拥有数亿级参数的巨大MLP模型,那么该模型一定无法在单个GPU上直接进行训练,不用担心,Colossal-AI可以帮您解决这一问题。您仍旧可以像写单GPU模型那样来写您的模型,Colossal-AI会按照您的并行设置自动将模型参数进行切割,并将它们均匀地存入一组GPU中。下面是一个简单的例子,来向您展示如何在Colossal-AI环境下写一个2D张量并行的模型。
## 简单的2D张量并行模型
```python
from colossalai.nn import Linear2D
import torch.nn as nn
class MLP_2D(nn.Module):
def __init__(self):
super().__init__()
self.linear_1 = Linear2D(in_features=1024, out_features=16384)
self.linear_2 = Linear2D(in_features=16384, out_features=1024)
def forward(self, x):
x = self.linear_1(x)
x = self.linear_2(x)
return x
```
## 使用事先定义好的模型
为了您使用的方便,我们事先在我们的Model Zoo中定义好了一些现在流行的模型,比如*BERT**VIT*以及*MLP-Mixer*等,您可以根据您的需求来自定义这些模型的规模。
# Parallelization
## Configure the Combination of Parallelization
We support multiple parallelization in our library.
Hybrid parallelism in our codebase refers to namely the combination of data parallelism, pipeline parallelism
and tensor parallelism (1D, 2D, 2.5D, 3D). Each parallelism requires different network topology and thus
different initializers for distributed process group. You can initialize the corresponding process group by
setting `parallel` in our config. The parallel configuration can be easily deployed by a dictionary in
configuration file. The configuration dictionary must obey the following format. Data parallel size will be
inferred automatically based on your inputs to pipeline parallelism and tensor parallelism. The distributed
environment will set up by `colossalai.launch`.
```python
# sampler format
parallel = dict(
pipeline=dict("size": int),
tensor=dict("size": int, "mode": '1d' or '2d' or '2.5d' or '3d', "kwargs": Any)
)
# this is ok
parallel = dict(
pipeline=dict(size=2),
tensor=dict(size=4, mode='2d')
)
# this is ok
parallel = dict(
pipeline=2,
tensor=dict(size=4, mode='2d')
)
# this is not ok
# as you need to specify the mode for tensor parallelism
parallel = dict(
pipeline=2,
tensor=4
)
# this is ok as well as tensor will be default to size 1
# and mode None
parallel = dict(
pipeline=2
)
# this is ok as well as pipeline will default to size 1
parallel = dict(
tensor=dict(size=4, mode='2d')
)
```
The name of the dictionary variable should be **parallel**. All the arguments even **parallel** itself are optional and
data, pipeline, tensor parallel size will be set to defaulted value 1. The value of data, pipeline and tensor can be a
int representing the size of specific parallel dimension or a dictionary with a key called "size". The key "mode"
represents the way of tensor parallelism.
**You can choose to not have 'parallel' in your configuration and both pipelineand tensor will default to size 1.**
## Data Parallel
Data parallel is the most common way to distribute your training task by splitting data into several shards and train on
a single shard on each device. The configuration for data parallel is detected automatically and set for you. You do not
have to explicitly set them in your configurations. There are two ways to handle the all-reduce in data parallel in Colossal-AI.
1. If you specify gradient handlers, gradients will be all-reduced according to the gradient handlers
2. Otherwise, PyTorch DistributedDataParallel will be used
In most cases, you will be using the second mode unless you have complex handling of the gradients.
## 1D, 2D, 2.5D and 3D Parallel
To enable hybrid parallelism, we provide an array of tensor parallelism. We provide the list of papers which match each
tensor parallel method. These parallel modes need to work with the distributed layers provided by Colossal-AI.
- 1D: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
- 2D: [An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/abs/2104.05343)
2D parallel relies on the SUMMA matrix multiplication algorithm and splits the input data, model weights and layer
outputs along two different dimensions. The tensor chunks are distributed over a 2D mesh of $P = N^2$ devices where
$N$ is the number of tensor chunks in a single dimension.
- 2.5D: [2.5-dimensional distributed model training](https://arxiv.org/abs/2105.14500)
Inspired by the 2.5D matrix multiplication algorithm, 2.5D parallel introduces a novel tensor parallelism which
further parallelizes 2D tensor parallelism. An amount of $P = N^2 ∗ d$ processors are arranged into $d$ layers, where
each layer performs matrix multiplication operations independently with a dimension $N$.
- 3D: [Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/abs/2105.14450)
We also introduce a 3D tensor parallelism that parallelizes neural networks on a 3D processor cube. This method
achieves the optimal, $O(P^{1/3})$ communication overhead on $P$ processors, while both computation and memory usage
are evenly distributed through optimized load balancing of parameters as well as activations.
```python
# 1D parallel
parallel = dict(
tensor=dict(size=4, mode='1d')
)
# 2D parallel
parallel = dict(
tensor=dict(size=4, mode='2d')
)
# 2.5D parallel
parallel = dict(
tensor=dict(size=8, mode='2.5d', depth=2)
)
# 3D parallel
parallel = dict(
tensor=dict(size=8, mode='3d')
)
```
Once you specify the tensor parallel mode in your configuration, you can proceed to use its corresponding distributed
operator. For example, if you mode is '2d', you can use `colossalai.nn.Linear2D` in you model construction.
## Pipeline Parallel (experimental)
Pipeline parallelism is to split the model into several partitions by layer. For example, let's assume we have a simple
model which consists of two linear layer. We have two GPUs, and we can allocate the first linear layer to the first GPU
and the second layer to the second GPU.
You can set the number of pipeline stages in your configuration file. When pipeline size is larger than 1, Colossal-AI
will automatically creates the pipeline schedule which defines the forward and backward step.
```python
parallel = dict(
pipeline=dict(size=4), # number of pipeline stages
)
```
As PyTorch is based on dynamic computation graph, the computation flow is not known until execution. To support pipeline parallelism, you have the following two ways to split your model,
1. Split your model directly. Below is an exmaple of resnet split into two pipeline stages.
```python
from torchvision.models import resnet18
from colossalai.core import global_context as gpc
model = resnet18(num_classes=10)
if gpc.get_local_rank(ParallelMode.PIPELINE) == 0:
model = nn.Sequential(
model.conv1,
model.bn1,
model.relu,
model.maxpool,
model.layer1,
model.layer2
)
elif gpc.get_local_rank(ParallelMode.PIPELINE) == 1:
from functools import partial
class Flatten(nn.Module):
def forward(self, x):
return torch.flatten(x, 1)
model = nn.Sequential(
model.layer3,
model.layer4,
model.avgpool,
Flatten(),
model.fc
)
```
2. Make sure your model inherit `colossalai.nn.model.ModelFromConfig` and registered into the
`MODELS` registry. Define the `self.layers_cfg` attribute.
Pass in a dict/Config object which specifies the parameters of your model.
Use `colossalai.builder.pipeline.build_pipeline_model_from_cfg` to partition the layers.
```python
from colossalai.builder import build_pipeline_model_from_cfg
from colossalai.nn.model import ModelFromConfig
from colossalai.registry import MODELS
@MODELS.register_module
class MyModel(ModelFromConfig):
def __init__(self, arg1, arg2, ...):
...
self.layers_cfg = [
dict(type='Linear', in_features=3, out_features=512),
dict(type='Linear', in_features=512, out_features=512),
...
]
model_cfg = dict(
type='MyModel',
arg1=1,
arg2=2
...
)
# from config
model = build_pipeline_model_from_cfg(model_cfg, num_chunks=1)
# from torch.nn.Sequential
# model = build_pipeline_model(sequential_model, num_model_chunks)
```
When your model is split into partitions, you can use PipelineSchedule to execute training.
```python
import colossalai
from colossalai.engine.schedule import PipelineSchedule
engine, train_dataloader, _, _ = colossalai.initialize(model, optimizer, criterion, train_dataloader)
schedule = PipelineSchedule(num_microbatches=4)
# interleaved pipeline
# schedule = InterleavedPipelineSchedule(num_microbatches=4, num_model_chunks=2)
# execute a training epoch
data_iter = iter(train_dataloader)
for i in range(len(train_dataloader)):
output, label, loss = schedule.forward_backward_step(engine,
data_iter,
forward_only=False,
)
```
This feature is still in development and is only experimental for now.
## Sequence Parallel (experimental)
Sequence parallel is to support long-sequence modelling such as document-level text understanding and medical imaging.
This method is proposed in [Sequence Parallelism: Making 4D Parallelism Possible](https://arxiv.org/abs/2105.13120).
This feature is still in development and is only experimental for now.
# 并行技术
## 配置并行技术组合
Colossal-AI支持多种并行技术,包括数据并行、张量并行(1D、2D、2.5D、3D)、流水线并行以及序列并行。您可以通过更改配置文件中的`parallel`字典变量来初始化分布式系统中的进程组,配置文件中的`parallel`字典变量必须满足下面的格式。数据并行的规模可以通过`parallel`中流水线并行的规模和张量并行的规模计算得出。
```python
parallel = dict(
pipeline=dict("size": int),
tensor=dict("size": int, "mode": '1d' or '2d' or '2.5d' or '3d', "kwargs": Any)
)
```
注意该字典变量的名称必须为**parallel**。该变量中所有的参数,包括`parallel`本身都是非必需的,如果您的代码中没有提供该变量,则所有并行规模都将被设定为默认值1,即不使用任何并行技术的情况。`parallel`中data、pipeline以及tensor的值分别代表了数据并行、流水线并行、以及张量并行的规模,而`mode`的值代表了张量并行的模式。
## 数据并行
数据并行是一种最常见的并行技术,可以将数据分成几个不同的部分,并对每一个部分在一台设备上进行训练。Colossal-AI可以自动检测数据并行设置并为您设置好环境,您不需要在您的环境配置中显式地设置。当数据并行规模大于1时,Colossal-AI会自动为数据读取器增加分布式数据采样器,以此来达到切分数据集的目的。
## 1D、2D、2.5D与3D张量并行
为了方便混合并行技术,我们提供了一系列的张量并行技术,同时下面罗列了每一种张量并行技术对应的论文,这些张量并行技术需要Colossal-AI提供的分布式层结构的支持。
- 1D:[Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
- 2D:[An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/abs/2104.05343)
2维张量并行依赖SUMMA矩阵乘法技术,其在两个不同的维度上对于输入数据进行切分。切分后的张量分布在一个的2维网格上,使用的总设备数量为$P = N^2$,其中$N$为一个维度上的切分张量数量。
- 2.5D:[2.5-dimensional distributed model training](https://arxiv.org/abs/2105.14500)
2.5维并行技术受到了2.5D矩阵乘法的启发,其对于2维张量并行的结果进行进一步切分,在$d$层上面安排$P = N^2 ∗ d$个处理器,相应地,矩阵乘法操作也被切分为$d$份在不同的层上面进行。
- 3D:[Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/abs/2105.14450)
我们还引入了3维张量并行技术,该技术在一个3维处理器立方体中对神经网络参数进行并行化。使用$P$个处理器时,该并行技术可以在付出$O(P^{1/3})$的通信开销的情况下达到最优表现,且计算资源和内存使用都可以在$P$个处理器上达到平均分配。
使用上述几种张量并行的`parallel`字典变量示例参见下方代码。
```python
# 1D parallel
parallel = dict(
pipeline=dict(size=1), # number of pipeline stages
tensor=dict(size=4, mode='1d')
)
# 2D parallel
parallel = dict(
pipeline=dict(size=1), # number of pipeline stages
tensor=dict(size=4, mode='2d')
)
# 2.5D parallel
parallel = dict(
pipeline=dict(size=1), # number of pipeline stages
tensor=dict(size=8, mode='2.5d', depth=2)
)
# 3D parallel
parallel = dict(
pipeline=dict(size=1), # number of pipeline stages
tensor=dict(size=8, mode='3d')
)
```
## 流水线并行(开发中)
流水线并行指的是在将深度学习模型按照层切分为几个不同的部分。例如,对于一个由两个线性层组成的简单模型,我们可以使用两个GPU,并把第一个线性层的工作分配给一个GPU,把第二个线性层的工作分配给另一个GPU。当然这个例子只是为了说明流水线并行的工作方式,没有实际意义。
由于PyTorch的计算基于动态计算图,所以在执行前无法确定计算流。为了支持PyTorch中的流水线并行,您需要为您的模型类加入一个额外的特征`layers_cfg`,使Colossal-AI清楚具体的计算流程,`colossalai.nn.VanillaResNet`给出了一个您可以参考的示例。
```python
from colossalai.nn import BaseModel
import torch
class VanillaResNet(BaseModel):
def __init__(
self,
num_cls: int,
block_type: str,
layers: List[int],
norm_layer_type: str = 'BatchNorm2d',
in_channels: int = 3,
groups: int = 1,
width_per_group: int = 64,
zero_init_residual: bool = False,
replace_stride_with_dilation: Optional[List[bool]] = None,
dilations=(1, 1, 1, 1)
) -> None:
super().__init__()
... # some model params
self.layers_cfg = [
# conv1
dict(type='Conv2d',
in_channels=in_channels,
out_channels=self.inplanes,
kernel_size=7,
stride=2,
padding=3,
bias=False),
# bn1
dict(
type=norm_layer_type,
num_features=self.inplanes
),
# relu
dict(
type='ReLU',
inplace=True
),
# maxpool
dict(
type='MaxPool2d',
kernel_size=3,
stride=2,
padding=1
),
# layer 1
dict(
inplanes=self.inplanes,
planes=64,
blocks=self.blocks[0],
dilation=self.dilations[0],
**self.reslayer_common_cfg
),
# layer 2
dict(
inplanes=64 * self.block_expansion,
planes=128,
blocks=self.blocks[1],
stride=2,
dilate=replace_stride_with_dilation[0],
dilation=self.dilations[1],
**self.reslayer_common_cfg
),
# layer 3
dict(
inplanes=128 * self.block_expansion,
planes=256,
blocks=layers[2],
stride=2,
dilate=replace_stride_with_dilation[1],
dilation=self.dilations[2],
**self.reslayer_common_cfg
),
# layer 4
dict(
inplanes=256 * self.block_expansion,
planes=512,
blocks=layers[3], stride=2,
dilate=replace_stride_with_dilation[2],
dilation=self.dilations[3],
**self.reslayer_common_cfg
),
# avg pool
dict(
type='AdaptiveAvgPool2d',
output_size=(1, 1)
),
# flatten
dict(
type='LambdaWrapper',
func=lambda mod, x: torch.flatten(x, 1)
),
# linear
dict(
type='Linear',
in_features=512 * self.block_expansion,
out_features=num_cls
)
]
```
您可以在配置文件中手动设置流水线并行的级数,当流水线的并行级数大于1时,Colossal-AI将会自动创建定义前向传播和后向传播的流水线调度程序。同时,您还可以在配置文件中的`schedule`字典变量来定义每一个步骤中训练的微批次数量。下面的代码给出了一个配置流水线并行的例子。
```python
parallel = dict(
pipeline=dict(size=1), # number of pipeline stages
tensor=dict(size=1, mode=None)
)
schedule = dict(
num_microbatches = 4 # set the number of microbatches per step
)
```
目前该并行技术仍处于实验开发阶段。
## 序列并行(开发中)
序列并行是为了支持对于长序列数据的建模,这类数据包括文档级别的文本理解以及医学影像分析,该并行技术由论文[Sequence Parallelism: Making 4D Parallelism Possible](https://arxiv.org/abs/2105.13120)提出。目前该并行技术仍处于实验开发阶段。
# Quick demo
Colossal-AI is an integrated large-scale deep learning system with efficient parallelization techniques. The system can
accelerate model training on distributed systems with multiple GPUs by applying parallelization techniques. The system
can also run on systems with only one GPU. Quick demos showing how to use Colossal-AI are given below.
## Single GPU
Colossal-AI can be used to train deep learning models on systems with only one GPU and achieve baseline
performances. We provided an example to train ResNet on CIFAR10 data with only one GPU. You can find this example in
`examples\resnet_cifar10_data_parallel` in the repository. Detailed instructions can be found in its `README.md`.
## Multiple GPUs
Colossal-AI can be used to train deep learning models on distributed systems with multiple GPUs and accelerate the
training process drastically by applying efficient parallelization techiniques, which will be elaborated in
the [Parallelization](parallelization.md) section below.
You can turn the resnet example mentioned above into a multi-GPU training by setting `--nproc_per_node` to be the number of
GPUs you have on your system. We also provide an example of Vision Transformer which relies on
training with more GPUs. You can visit this example in `examples\vit_b16_imagenet_data_parallel`. It has a detailed instructional
`README.md` for you too.
## Sample Training Script
Below is a typical way of how you train the model using
```python
import colossalai
from colossalai.amp import AMP_TYPE
from colossalai.logging import get_dist_logger
from colossalai.trainer import Trainer, hooks
from colossalai.utils import get_dataloader
CONFIG = dict(
parallel=dict(
pipeline=1,
tensor=1, mode=None
),
fp16 = dict(
mode=AMP_TYPE.TORCH
),
gradient_accumulation=4,
clip_grad_norm=1.0
)
def run_trainer():
parser = colossalai.get_default_parser()
args = parser.parse_args()
colossalai.launch(config=CONFIG,
rank=args.rank,
world_size=args.world_size,
host=args.host,
port=args.port,
backend=args.backend)
logger = get_dist_logger()
# instantiate your compoentns
model = MyModel()
optimizer = MyOptimizer(model.parameters(), ...)
train_dataset = TrainDataset()
test_dataset = TestDataset()
train_dataloader = get_dataloader(train_dataset, ...)
test_dataloader = get_dataloader(test_dataset, ...)
lr_scheduler = MyScheduler()
logger.info("components are built")
engine, train_dataloader, test_dataloader, lr_scheduler = colossalai.initialize(model,
optimizer,
criterion,
train_dataloader,
test_dataloader,
lr_scheduler)
trainer = Trainer(engine=engine,
verbose=True)
hook_list = [
hooks.LossHook(),
hooks.LRSchedulerHook(lr_scheduler=lr_scheduler, by_epoch=False),
hooks.AccuracyHook(),
hooks.TensorboardHook(log_dir='./tb_logs', ranks=[0]),
hooks.LogMetricByEpochHook(logger),
hooks.LogMemoryByEpochHook(logger),
hooks.SaveCheckpointHook(checkpoint_dir='./ckpt')
]
trainer.fit(
train_dataloader=train_dataloader,
test_dataloader=test_dataloader,
epochs=NUM_EPOCH,
hooks=hook_list,
display_progress=True,
test_interval=2
)
if __name__ == '__main__':
run_trainer()
```
Alternatively, the `model` variable can be substituted with a self-defined model or a pre-defined model in our Model
Zoo. The detailed substitution process is elaborated [here](model.md).
## Features
Colossal-AI provides a collection of parallel training components for you. We aim to support you with your development
of distributed deep learning models just like how you write single-GPU deep learning models. We provide friendly tools
to kickstart distributed training in a few lines.
- [Data Parallelism](parallelization.md)
- [Pipeline Parallelism](parallelization.md)
- [1D, 2D, 2.5D, 3D and sequence parallelism](parallelization.md)
- [Friendly trainer and engine](trainer_engine.md)
- [Extensible for new parallelism](add_your_parallel.md)
- [Mixed Precision Training](amp.md)
- [Zero Redundancy Optimizer (ZeRO)](zero.md)
# 快速上手
Colossal-AI是一个大规模深度学习系统,其中包含高效的并行技术。该系统可以在多GPU的分布式系统上使用并行技术有效地加速模型训练,同时该系统也可以运行在带有GPU的非分布式系统上。下面是ColossalAI的快速上手指南。
## 单GPU系统
在带有GPU的非分布式系统上进行模型训练时,Colossal-AI可以达到当前的基线效率。[这里](https://colab.research.google.com/drive/1fJnqqFzPuzZ_kn1lwCpG2nh3l2ths0KE?usp=sharing#scrollTo=cQ_y7lBG09LS)我们给出一个Google
Colab示例展现如何使用Colossal-AI与CIFAR10数据集在非分布式系统上训练一个LeNet模型。
## 多GPU系统
在多GPU的分布式系统上训练深度学习模型时,Colossal-AI可以使用高效的并行技术来显著地加速训练过程,这些技术将在下面的[并行技术](parallelization.md)
章节中被详述。下面的代码将在拥有四个GPU的分布式系统上训练一个ViT模型,其中`HOST`
变量为您分布式系统的IP地址。请注意下面的代码使用了[Slurm](https://slurm.schedmd.com/documentation.html)作业调度系统。
```bash
HOST=xxx.xxx.xxx.xxx srun ./scripts/slurm_dist_train.sh ./examples/run_trainer.py ./configs/vit/vit_2d.py
```
`./configs/vit/vit_2d.py`是一个[配置文件](config.md)
,Colossal-AI使用配置文件来定义训练过程中需要用到的参数,比如模型类型、数据集、以及优化器、学习率调度器等。您可以通过编写配置文件的方式来训练不同的模型。`./examples/run_trainer.py`
是一个标准的训练脚本,具体代码已经附在下面。该脚本可以读入配置文件中的训练参数并训练模型。
```python
import colossalai
from colossalai.core import global_context as gpc
from colossalai.logging import get_dist_logger
from colossalai.trainer import Trainer
def run_trainer():
engine, train_dataloader, test_dataloader = colossalai.initialize()
logger = get_dist_logger()
logger.info("engine is built", ranks=[0])
trainer = Trainer(engine=engine,
verbose=True)
logger.info("trainer is built", ranks=[0])
logger.info("start training", ranks=[0])
trainer.fit(
train_dataloader=train_dataloader,
test_dataloader=test_dataloader,
epochs=gpc.config.num_epochs,
hooks_cfg=gpc.config.hooks,
display_progress=True,
test_interval=2
)
if __name__ == '__main__':
run_trainer()
```
上面代码中的`model`变量可以被替换为一个自定义的模型或者`Model Zoo`中一个事先定义的模型,以此来达到训练不同模型的目的,[这里](model.md)详述了如何进行这样的替换。
## 系统功能
Colossal-AI提供了一系列并行组件来加速您的模型训练,我们在下面的章节提供了关于这些并行组件的介绍。我们的目标是使您的分布式深度学习模型开发像单卡深度学习模型开发那样方便。
- [数据并行](parallelization.md)
- [1D、2D、2.5D、3D张量并行以及序列并行](parallelization.md)
- [流水线并行](parallelization.md)
- [训练器以及引擎](trainer_engine.md)
- [自定义您的并行模式](add_your_parallel.md)
- [混合精度训练](amp.md)
- [ZeRO优化器](zero.md)
# Colossal-AI Engine & Customize Your Trainer
## Colossal-AI engine
To better understand how `Engine` class works, let's start from the conception of the process function in common
engines. The process function usually controls the behavior over a batch of a dataset, `Engine` class just controls the
process function. Here we give a standard process function in the following code block.
```python
def process_function(dataloader, model, criterion, optim):
optim.zero_grad()
data, label = next(dataloader)
output = model(data)
loss = criterion(output, label)
loss.backward()
optim.setp()
```
The engine class is a high-level wrapper of these frequently-used functions while preserving the PyTorch-like function signature and integrating with our features.
```python
import torch
import torch.nn as nn
import torchvision.models as models
import colossalai
from colossalai.engine import Engine
from torchvision.datasets import CIFAR10
model = models.resnet18()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters())
dataset = CIFAR10(...)
dataloader = colossalai.utils.get_dataloader(dataset)
engine, dataloader, _, _ = colossalai.initialize(model, optimizer, criterion, dataloader)
# exmaple of a training iteratio
for img, label in dataloader:
engine.zero_grad()
output = engine(img)
loss = engine.criterion(output, label)
engine.backward(loss)
engine.step()
```
More information regarding the class can be found in the API references.
## Customize your trainer
### Overview
To learn how to customize a trainer which meets your needs, let's first give a look at the `Trainer` class. We highly
recommend that you read *Get Started*
section and *Colossal-AI engine* first.
The `Trainer` class enables researchers and engineers to use our system more conveniently. Instead of having to write
your own scripts, you can simply construct your own trainer by calling the `Trainer` class, just like what we did in the
following code block.
```python
trainer = Trainer(engine)
```
After that, you can use the `fit` method to train or evaluate your model. In order to make our `Trainer` class even more
powerful, we incorporate a set of handy tools to the class. For example, you can monitor or record the running states
and metrics which indicate the current performance of the model. These functions are realized by hooks. The `BasicHook`
class allows you to execute your hook functions at specified time. We have already created some practical hooks for you,
as listed below. What you need to do is just picking the right ones which suit your needs. Detailed descriptions of the
class can be found in the API references.
These hook functions will record metrics, elapsed time and memory usage and write them to log after each epoch. Besides,
they print the current loss and accuracy to let users monitor the performance of the model.
```python
import colossalai
from colossalai.trainer import hooks, Trainer
from colossalai.utils import MultiTimer
from colossalai.logging import get_dist_logger
... = colossalai.initialize(...)
timer = MultiTimer()
logger = get_dist_logger()
# if you want to save log to file
logger.log_to_file('./logs/')
trainer = Trainer(
engine=engine,
timer=timer,
logger=logger
)
hook_list = [
hooks.LossHook(),
hooks.LRSchedulerHook(lr_scheduler=lr_scheduler, by_epoch=False),
hooks.AccuracyHook(),
hooks.TensorboardHook(log_dir='./tb_logs', ranks=[0]),
hooks.LogMetricByEpochHook(logger),
hooks.LogMemoryByEpochHook(logger),
hooks.LogTimingByEpochHook(timer, logger),
hooks.SaveCheckpointHook(checkpoint_dir='./ckpt')
]
trainer.fit(
train_dataloader=train_dataloader,
epochs=NUM_EPOCHS,
test_dataloader=test_dataloader,
test_interval=1,
hooks=hook_list,
display_progress=True
)
```
### Hook
If you have your specific needs, feel free to extend our `BaseHook` class to add your own functions, or our `MetricHook`
class to write a metric collector. These hook functions can be called at different stage in the trainer's life cycle.
Besides, you can define the priorities of all hooks to arrange the execution order of them. More information can be
found in the API references.
### Metric
You can write your own metrics by extending our `Metric` class. It should be used with the `MetricHook` class. When your
write your own metric hooks, please set the priority carefully and make sure the hook is called before other hooks which
might require the results of the metric hook.
We've already provided some metric hooks and we store metric objects in `runner.states['metrics']`. It is a dictionary
and metrics can be accessed by their names.
# 引擎与训练器
## 引擎
为了更好的理解我们的`Engine`类是如何工作的,我们首先需要了解常见引擎中进程函数的概念。进程函数控制数据集中一个批次的行为,`Engine`类控制的正是该进程函数。我们在下方的代码块中给出了一个标准的进程函数例子。
```python
def process_function(dataloader, model, criterion, optim):
optim.zero_grad()
data, label = next(dataloader)
output = model(data)
loss = criterion(output, label)
loss.backward()
optim.setp()
```
`ignite.engine``keras.engine`中,进程函数需要由用户提供,然而,用户很难为流水线并行编写进程函数。为了向用户提供方便的混合并行,我们提供了具备强大功能的`Engine`
类,该类支持流水线并行,并提供前向传播后向传播不交织的策略。同时,您可以在`Engine`类中使用您事先定义好的学习率调度器来在训练过程中调整学习率。
您在构造引擎时只需要定义`model``criterion``optimizer``lr_scheduler``schedule`等变量即可,下面的代码块给出了一个这样的例子。
**如果你使用`colossalai.initialize`的话,engine会从config文件里自动构建。**
```python
import torch
import torch.nn as nn
import torchvision.models as models
import colossalai
from colossalai.engine import Engine
model = models.resnet18()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model)
lr_scheduler = colossalai.nn.lr_scheduler.CosineAnnealingLR(optimizer, 1000)
schedule = colossalai.engine.NonPipelineSchedule()
MyEngine = Engine(
model=model,
criterion=criterion,
optimizer=optimizer,
step_schedule=schedule
)
```
更多该类的相关信息可以在API信息中找到。
## 训练器
要了解如何个性化适应您需求的训练器,首先需要了解我们的`Trainer`类。
`Trainer`类旨在让科研工作者和工程师更加方便地使用我们的系统,您不需要自己写脚本,只需要调用`Trainer`类来构造您的训练器即可,就像下面的代码块中所做的。
```python
MyTrainer = Trainer(my_trainer)
```
在此之后,您可以使用`fit`方法来训练或调用您的模型。除此之外,为了让我们的`Trainer`
类拥有更强大的功能,我们加入了一系列方便您使用的工具。例如,您可以在训练过程中持续监测并记录模型目前的运行状态和表现,这些功能都是通过钩子函数来实现的。我们提供的`BasicHook`
类让您可以在指定时间执行您的钩子函数。如下方的代码块所示,我们事先为您定义好了一些实用的钩子函数,您需要做的就是找到符合您需求的钩子函数。更多该类的相关信息可以在API信息中找到。
```python
hooks = [
dict(type='LogMetricByEpochHook'),
dict(type='LogTimingByEpochHook'),
dict(type='LogMemoryByEpochHook'),
dict(type='AccuracyHook'),
dict(type='LossHook'),
dict(type='TensorboardHook', log_dir='./tfb_logs'),
dict(type='SaveCheckpointHook', interval=5, checkpoint_dir='./ckpt'),
dict(type='LoadCheckpointHook', epoch=20, checkpoint_dir='./ckpt')
]
```
上面这些钩子函数可以记录模型性能指标,训练时间,显存使用等信息,并在每一个epoch结束后将这些信息写入到日志中。除此之外,这些钩子函数还可以即时输出当前的损失以及准确率,让用户可以监测模型的性能。
### 钩子函数
如果您有个性化需求,您可以继承我们的`BaseHook`类并添加您的钩子函数,或者继承我们的`MetricHook`来编写您需要的度量标准。这些钩子函数可以在`Trainer`
生命周期的12个时间点被执行。更多该类的相关信息可以在API信息中找到。
### 度量标准
您可以通过继承我们的`Metric`类来提供您需要的度量标准,该类需要与`MetricHook`类一同使用。当您编写您的度量标准钩子函数时,请用心设置您的优先级来确保该钩子函数的优先级高于那些需要度量结果的钩子函数。
我们已经为您定义好了一些度量标准钩子函数在`runner.states['metrics']`供您参考。
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment