removed tutorial markdown and refreshed rst files for consistency

be85a0f3 · Frank Lee · ver217 · ca4ae52d · be85a0f3 · be85a0f3
Commit be85a0f3 authored Jan 19, 2022 by Frank Lee Committed by ver217 Jan 19, 2022
20 changed files
--- a/docs/colossalai/colossalai.utils.multi_tensor_apply.rst
+++ b/docs/colossalai/colossalai.utils.multi_tensor_apply.rst
-colossalai.nn.multi\_tensor\_apply
+colossalai.utils.multi\_tensor\_apply
-==================================
+=====================================
-.. automodule:: colossalai.utils.multi_tensor_apply.multi_tensor_apply
+.. automodule:: colossalai.utils.multi_tensor_apply
   :members:
+.. toctree::
+   :maxdepth: 2
+   colossalai.utils.multi_tensor_apply.multi_tensor_apply
--- a/docs/colossalai/colossalai.utils.rst
+++ b/docs/colossalai/colossalai.utils.rst
 colossalai.utils
 ================
+.. automodule:: colossalai.utils
+   :members:
+.. toctree::
+   :maxdepth: 2
+   colossalai.utils.data_sampler
+   colossalai.utils.gradient_accumulation
+   colossalai.utils.multi_tensor_apply
 .. toctree::
   :maxdepth: 2
@@ -8,12 +19,5 @@ colossalai.utils
   colossalai.utils.checkpointing
   colossalai.utils.common
   colossalai.utils.cuda
-   colossalai.utils.data_sampler
-   colossalai.utils.gradient_accumulation
   colossalai.utils.memory
-   colossalai.utils.multi_tensor_apply
   colossalai.utils.timer
-.. automodule:: colossalai.utils
-   :members:
--- a/docs/colossalai/colossalai.zero.loss_scaler.rst
+++ b/docs/colossalai/colossalai.zero.loss_scaler.rst
+colossalai.zero.loss\_scaler
+============================
+.. automodule:: colossalai.zero.loss_scaler
+   :members:
--- a/docs/colossalai/colossalai.zero.rst
+++ b/docs/colossalai/colossalai.zero.rst
 colossalai.zero
-================
+===============
 .. automodule:: colossalai.zero
   :members:
+.. toctree::
+   :maxdepth: 2
+   colossalai.zero.loss_scaler
+   colossalai.zero.zero_redundancy_optimizer_level_2
+   colossalai.zero.zero_redundancy_optimizer_level_3
--- a/docs/colossalai/colossalai.zero.zero_redundancy_optimizer_level_2.rst
+++ b/docs/colossalai/colossalai.zero.zero_redundancy_optimizer_level_2.rst
+colossalai.zero.zero\_redundancy\_optimizer\_level\_2
+=====================================================
+.. automodule:: colossalai.zero.zero_redundancy_optimizer_level_2
+   :members:
--- a/docs/colossalai/colossalai.zero.zero_redundancy_optimizer_level_3.rst
+++ b/docs/colossalai/colossalai.zero.zero_redundancy_optimizer_level_3.rst
+colossalai.zero.zero\_redundancy\_optimizer\_level\_3
+=====================================================
+.. automodule:: colossalai.zero.zero_redundancy_optimizer_level_3
+   :members:
--- a/docs/config.md
+++ b/docs/config.md
-# Config file
-Here is a config file example showing how to train a ViT model on the CIFAR10 dataset using Colossal-AI:
-```python
-# optional
-# three keys: pipeline, tensor
-# data parallel size is inferred
-parallel = dict(
-    pipeline=dict(size=1),
-    tensor=dict(size=4, mode='2d'),
-)
-# optional
-# pipeline or no pipeline schedule
-fp16 = dict(
-    mode=AMP_TYPE.NAIVE,
-    initial_scale=2 ** 8
-)
-# optional
-# configuration for zero
-# you can refer to the Zero Redundancy optimizer and zero offload section for details
-# https://www.colossalai.org/zero.html
-zero = dict(
-    level=<int>,
-    ...
-)
-# optional
-# if you are using complex gradient handling
-# otherwise, you do not need this in your config file
-# default gradient_handlers = None
-gradient_handlers = [dict(type='MyHandler', arg1=1, arg=2), ...]
-# optional
-# specific gradient accumulation size
-# if your batch size is not large enough
-gradient_accumulation = <int>
-# optional
-# add gradient clipping to your engine
-# this config is not compatible with zero and AMP_TYPE.NAIVE
-# but works with AMP_TYPE.TORCH and AMP_TYPE.APEX
-# defautl clip_grad_norm = 0.0
-clip_grad_norm = <float>
-# optional
-# cudnn setting
-# default is like below
-cudnn_benchmark = False,
-cudnn_deterministic=True,
-```
\ No newline at end of file
--- a/docs/config_zh.md
+++ b/docs/config_zh.md
-# 配置文件
-下方代码块中的示例展示了如何在CIFAR10数据集上使用Colossal-AI训练ViT模型。
-```python
-# build train_dataset and train_dataloader from this dictionary
-# It is not compulsory in Config File, instead, you can input this dictionary as an argument into colossalai.initialize() 
-train_data = dict(
-    # dictionary for building Dataset
-    dataset=dict(
-        # the type CIFAR10Dataset has to be registered
-        type='CIFAR10Dataset',
-        root='/path/to/data',
-        # transform pipeline
-        transform_pipeline=[
-            dict(type='Resize', size=IMG_SIZE),
-            dict(type='RandomCrop', size=IMG_SIZE, padding=4),
-            dict(type='RandomHorizontalFlip'),
-            dict(type='ToTensor'),
-            dict(type='Normalize',
-                 mean=[0.4914, 0.4822, 0.4465],
-                 std=[0.2023, 0.1994, 0.2010]),
-        ]
-    ),
-    # dictionary for building Dataloader
-    dataloader=dict(
-        batch_size=BATCH_SIZE,
-        pin_memory=True,
-        # num_workers=1,
-        shuffle=True,
-    )
-)
-# build test_dataset and test_dataloader from this dictionary
-test_data = dict(
-    dataset=dict(
-        type='CIFAR10Dataset',
-        root='/path/to/data',
-        train=False,
-        transform_pipeline=[
-            dict(type='Resize', size=IMG_SIZE),
-            dict(type='ToTensor'),
-            dict(type='Normalize',
-                 mean=[0.4914, 0.4822, 0.4465],
-                 std=[0.2023, 0.1994, 0.2010]
-                 ),
-        ]
-    ),
-    dataloader=dict(
-        batch_size=BATCH_SIZE,
-        pin_memory=True,
-        # num_workers=1,
-    )
-)
-# compulsory
-# build optimizer from this dictionary
-optimizer = dict(
-    # Avaluable types: 'ZeroRedundancyOptimizer_Level_1', 'ZeroRedundancyOptimizer_Level_2', 'ZeroRedundancyOptimizer_Level_3'
-    # 'Adam', 'Lamb', 'SGD', 'FusedLAMB', 'FusedAdam', 'FusedSGD', 'FP16Optimizer'
-    type='Adam',
-    lr=0.001,
-    weight_decay=0
-)
-# compulsory
-# build loss function from this dictionary
-loss = dict(
-    # Avaluable types:
-    # 'CrossEntropyLoss2D', 'CrossEntropyLoss2p5D', 'CrossEntropyLoss3D'
-    type='CrossEntropyLoss2D',
-)
-# compulsory
-# build model from this dictionary
-model = dict(
-    # types avaluable: 'PretrainBERT', 'VanillaResNet', 'VisionTransformerFromConfig'
-    type='VisionTransformerFromConfig',
-    # each key-value pair above refers to a layer
-    # input data pass through these layers recursively
-    tensor_splitting_cfg=dict(
-        type='ViTInputSplitter2D',
-    ),
-    embedding_cfg=dict(
-        type='ViTPatchEmbedding2D',
-        img_size=IMG_SIZE,
-        patch_size=PATCH_SIZE,
-        embed_dim=DIM,
-    ),
-    token_fusion_cfg=dict(
-        type='ViTTokenFuser2D',
-        img_size=IMG_SIZE,
-        patch_size=PATCH_SIZE,
-        embed_dim=DIM,
-        drop_rate=0.1
-    ),
-    norm_cfg=dict(
-        type='LayerNorm2D',
-        normalized_shape=DIM,
-        eps=1e-6,
-    ),
-    block_cfg=dict(
-        # ViTBlock is a submodule
-        type='ViTBlock',
-        attention_cfg=dict(
-            type='ViTSelfAttention2D',
-            hidden_size=DIM,
-            num_attention_heads=NUM_ATTENTION_HEADS,
-            attention_dropout_prob=0.,
-            hidden_dropout_prob=0.1,
-            checkpoint=True
-        ),
-        droppath_cfg=dict(
-            type='VanillaViTDropPath',
-        ),
-        mlp_cfg=dict(
-            type='ViTMLP2D',
-            in_features=DIM,
-            dropout_prob=0.1,
-            mlp_ratio=4,
-            checkpoint=True
-        ),
-        norm_cfg=dict(
-            type='LayerNorm2D',
-            normalized_shape=DIM,
-            eps=1e-6,
-        ),
-    ),
-    head_cfg=dict(
-        type='ViTHead2D',
-        hidden_size=DIM,
-        num_classes=NUM_CLASSES,
-    ),
-    embed_dim=DIM,
-    depth=DEPTH,
-    drop_path_rate=0.,
-)
-# hooks are built when initializing trainer
-# possible hooks: 'BaseHook', 'MetricHook','LoadCheckpointHook'
-# 'SaveCheckpointHook','LossHook', 'AccuracyHook', 'Accuracy2DHook'
-# 'LogMetricByEpochHook', 'TensorboardHook','LogTimingByEpochHook', 'LogMemoryByEpochHook' 
-hooks = [
-    dict(type='LogMetricByEpochHook'),
-    dict(type='LogTimingByEpochHook'),
-    dict(type='LogMemoryByEpochHook'),
-    dict(type='Accuracy2DHook'),
-    dict(type='LossHook'),
-    # dict(type='TensorboardHook', log_dir='./tfb_logs'),
-    # dict(type='SaveCheckpointHook', interval=5, checkpoint_dir='./ckpt'),
-    # dict(type='LoadCheckpointHook', epoch=20, checkpoint_dir='./ckpt')
-]
-# three keys: pipeline, tensor, data
-# if data=dict(size=1), which means no data parallelization, then there is no need to define it
-parallel = dict(
-    pipeline=dict(size=1),
-    tensor=dict(size=4, mode='2d'),
-)
-# not compulsory
-# pipeline or no pipeline schedule
-fp16 = dict(
-    mode=AMP_TYPE.PARALLEL,
-    initial_scale=2 ** 8
-)
-# not compulsory
-# build learning rate scheduler
-lr_scheduler = dict(
-    type='LinearWarmupLR',
-    warmup_epochs=5
-)
-schedule = dict(
-    num_microbatches=8
-)
-# training stopping criterion
-# you can give num_steps or num_epochs
-num_epochs = 60
-# config logging path
-logging = dict(
-    root_path='./logs'
-)
-```
\ No newline at end of file
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -3,30 +3,8 @@
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.
-Colossal-AI documentation
+Colossal-AI API documentation
 ======================================
-.. toctree::
-   :maxdepth: 1
-   :caption: GETTING STARTED
-   installation.md
-   run_demo.md
-.. toctree::
-   :maxdepth: 1
-   :caption: CUSTOMIZE YOUR TRAINING
-   parallelization.md
-   model.md
-   trainer_engine.md
-   amp.md
-   zero.md
-   add_your_parallel.md
-   config.md
 .. toctree::
   :maxdepth: 2
   :caption: API REFERENCE

--- a/docs/index_zh.rst
+++ b/docs/index_zh.rst
-.. Colossal-AI documentation master file, created by
-   sphinx-quickstart on Mon Oct 11 17:05:05 2021.
-   You can adapt this file completely to your liking, but it should at least
-   contain the root `toctree` directive.
-夸父AI系统（Colossal-AI）开发文档
-======================================
-.. toctree::
-   :maxdepth: 1
-   :caption: 快速上手指南
-   installation_zh.md
-   run_demo_zh.md
-.. toctree::
-   :maxdepth: 1
-   :caption: 个性化您的训练
-   parallelization_zh.md
-   model_zh.md
-   trainer_engine_zh.md
-   amp_zh.md
-   zero_zh.md
-   add_your_parallel_zh.md
-   config_zh.md
-.. toctree::
-   :maxdepth: 2
-   :caption: API REFERENCE
-   colossalai/colossalai
-Indices and tables
-==================
-* :ref:`genindex`
\ No newline at end of file
--- a/docs/installation.md
+++ b/docs/installation.md
-# Setup
-### PyPI
-```bash
-pip install colossalai
-```
-### Install From Source (Recommended)
-> We **recommend** you to install from source as the Colossal-AI is updating frequently in the early versions. The documentation will be in line with the main branch of the repository. Feel free to raise an issue if you encounter any problem. :)
-```shell
-git clone https://github.com/hpcaitech/ColossalAI.git
-cd ColossalAI
-# install dependency
-pip install -r requirements/requirements.txt
-# install colossalai
-pip install .
-```
-Install and enable CUDA kernel fusion (compulsory installation when using fused optimizer)
-```shell
-pip install -v --no-cache-dir --global-option="--cuda_ext" .
-```
--- a/docs/installation_zh.md
+++ b/docs/installation_zh.md
-# 快速安装
-## 使用pip安装
-```bash
-pip install colossalai
-```
-## 使用源代码安装
-```shell
-git clone git@github.com:hpcaitech/ColossalAI.git
-cd ColossalAI
-# install dependency
-pip install -r requirements/requirements.txt
-# install colossalai
-pip install .
-```
-安装并支持内核融合（使用融合优化器时必须执行下面的代码）
-```
-pip install -v --no-cache-dir --global-option="--cuda_ext" .
-```
--- a/docs/model.md
+++ b/docs/model.md
-# Define your own parallel model
-Let's say that you have a huge MLP model with billions of parameters and its extremely large hidden layer size makes it
-impossible to fit into a single GPU directly. Don't worry, ColossalAI is here to help you sort things out. With the help of ColossalAI, 
-you can write your model in the familiar way in which you used to write models for a single GPU, while ColossalAI automatically 
-splits your model weights and fit them perfectly into a set of GPUs. We give a simple example showing how to write a simple 
-2D parallel model in the Colossal-AI context.
-## Write a simple 2D parallel model
-```python
-from colossalai.nn import Linear2D
-import torch.nn as nn
-class MLP_2D(nn.Module):
-    def __init__(self):
-        super().__init__()
-        self.linear_1 = Linear2D(in_features=1024, out_features=16384)
-        self.linear_2 = Linear2D(in_features=16384, out_features=1024)
-    def forward(self, x):
-        x = self.linear_1(x)
-        x = self.linear_2(x)
-        return x
-```
-## Use pre-defined model
-For the sake of your convenience, we kindly provide you in our Model Zoo with some prevalent models such as *BERT*, *VIT*, 
-and *MLP-Mixer*. Feel free to customize them into different sizes to fit into your special needs.
--- a/docs/model_zh.md
+++ b/docs/model_zh.md
-# 定义符合您需求的并行模型
-如果您在训练一个拥有数亿级参数的巨大MLP模型，那么该模型一定无法在单个GPU上直接进行训练，不用担心，Colossal-AI可以帮您解决这一问题。您仍旧可以像写单GPU模型那样来写您的模型，Colossal-AI会按照您的并行设置自动将模型参数进行切割，并将它们均匀地存入一组GPU中。下面是一个简单的例子，来向您展示如何在Colossal-AI环境下写一个2D张量并行的模型。
-## 简单的2D张量并行模型
-```python
-from colossalai.nn import Linear2D
-import torch.nn as nn
-class MLP_2D(nn.Module):
-    def __init__(self):
-        super().__init__()
-        self.linear_1 = Linear2D(in_features=1024, out_features=16384)
-        self.linear_2 = Linear2D(in_features=16384, out_features=1024)
-    def forward(self, x):
-        x = self.linear_1(x)
-        x = self.linear_2(x)
-        return x
-```
-## 使用事先定义好的模型
-为了您使用的方便，我们事先在我们的Model Zoo中定义好了一些现在流行的模型，比如*BERT*、*VIT*以及*MLP-Mixer*等，您可以根据您的需求来自定义这些模型的规模。
--- a/docs/parallelization.md
+++ b/docs/parallelization.md
-# Parallelization
-## Configure the Combination of Parallelization
-We support multiple parallelization in our library.
-Hybrid parallelism in our codebase refers to namely the combination of data parallelism, pipeline parallelism 
-and tensor parallelism (1D, 2D, 2.5D, 3D). Each parallelism requires different network topology and thus 
-different initializers for distributed process group. You can initialize the corresponding process group by 
-setting `parallel` in our config. The parallel configuration can be easily deployed by a dictionary in 
-configuration file. The configuration dictionary must obey the following format. Data parallel size will be 
-inferred automatically based on your inputs to pipeline parallelism and tensor parallelism. The distributed 
-environment will set up by `colossalai.launch`.
-```python
-# sampler format
-parallel = dict(
-    pipeline=dict("size": int),
-    tensor=dict("size": int, "mode": '1d' or '2d' or '2.5d' or '3d', "kwargs": Any)
-)
-# this is ok
-parallel = dict(
-    pipeline=dict(size=2),
-    tensor=dict(size=4, mode='2d')
-)
-# this is ok
-parallel = dict(
-    pipeline=2,
-    tensor=dict(size=4, mode='2d')
-)
-# this is not ok
-# as you need to specify the mode for tensor parallelism
-parallel = dict(
-    pipeline=2,
-    tensor=4
-)
-# this is ok as well as tensor will be default to size 1 
-# and mode None
-parallel = dict(
-    pipeline=2
-)
-# this is ok as well as pipeline will default to size 1
-parallel = dict(
-    tensor=dict(size=4, mode='2d')
-)
-```
-The name of the dictionary variable should be **parallel**. All the arguments even **parallel** itself are optional and
-data, pipeline, tensor parallel size will be set to defaulted value 1. The value of data, pipeline and tensor can be a
-int representing the size of specific parallel dimension or a dictionary with a key called "size". The key "mode"
-represents the way of tensor parallelism. 
-**You can choose to not have 'parallel' in your configuration and both pipelineand tensor will default to size 1.**
-## Data Parallel
-Data parallel is the most common way to distribute your training task by splitting data into several shards and train on
-a single shard on each device. The configuration for data parallel is detected automatically and set for you. You do not
-have to explicitly set them in your configurations. There are two ways to handle the all-reduce in data parallel in Colossal-AI.
-1. If you specify gradient handlers, gradients will be all-reduced according to the gradient handlers
-2. Otherwise, PyTorch DistributedDataParallel will be used
-In most cases, you will be using the second mode unless you have complex handling of the gradients.
-## 1D, 2D, 2.5D and 3D Parallel
-To enable hybrid parallelism, we provide an array of tensor parallelism. We provide the list of papers which match each
-tensor parallel method. These parallel modes need to work with the distributed layers provided by Colossal-AI.
- 1D: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
- 2D: [An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/abs/2104.05343)  
-  2D parallel relies on the SUMMA matrix multiplication algorithm and splits the input data, model weights and layer
-  outputs along two different dimensions. The tensor chunks are distributed over a 2D mesh of $P = N^2$ devices where
-  $N$ is the number of tensor chunks in a single dimension.
- 2.5D: [2.5-dimensional distributed model training](https://arxiv.org/abs/2105.14500)  
-  Inspired by the 2.5D matrix multiplication algorithm, 2.5D parallel introduces a novel tensor parallelism which
-  further parallelizes 2D tensor parallelism. An amount of $P = N^2 ∗ d$ processors are arranged into $d$ layers, where
-  each layer performs matrix multiplication operations independently with a dimension $N$.
- 3D: [Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/abs/2105.14450)  
-  We also introduce a 3D tensor parallelism that parallelizes neural networks on a 3D processor cube. This method
-  achieves the optimal, $O(P^{1/3})$ communication overhead on $P$ processors, while both computation and memory usage
-  are evenly distributed through optimized load balancing of parameters as well as activations.
-```python
-# 1D parallel
-parallel = dict(
-    tensor=dict(size=4, mode='1d')
-)
-# 2D parallel
-parallel = dict(
-    tensor=dict(size=4, mode='2d')
-)
-# 2.5D parallel
-parallel = dict(
-    tensor=dict(size=8, mode='2.5d', depth=2)
-)
-# 3D parallel
-parallel = dict(
-    tensor=dict(size=8, mode='3d')
-)
-```
-Once you specify the tensor parallel mode in your configuration, you can proceed to use its corresponding distributed 
-operator. For example, if you mode is '2d', you can use `colossalai.nn.Linear2D` in you model construction.
-## Pipeline Parallel (experimental)
-Pipeline parallelism is to split the model into several partitions by layer. For example, let's assume we have a simple
-model which consists of two linear layer. We have two GPUs, and we can allocate the first linear layer to the first GPU
-and the second layer to the second GPU. 
-You can set the number of pipeline stages in your configuration file. When pipeline size is larger than 1, Colossal-AI
-will automatically creates the pipeline schedule which defines the forward and backward step. 
-```python
-parallel = dict(
-    pipeline=dict(size=4), # number of pipeline stages
-)
-```
-As PyTorch is based on dynamic computation graph, the computation flow is not known until execution. To support pipeline parallelism, you have the following two ways to split your model,
-1. Split your model directly. Below is an exmaple of resnet split into two pipeline stages.
-```python
-from torchvision.models import resnet18
-from colossalai.core import global_context as gpc
-model = resnet18(num_classes=10)
-if gpc.get_local_rank(ParallelMode.PIPELINE) == 0:
-    model = nn.Sequential(
-        model.conv1,
-        model.bn1,
-        model.relu,
-        model.maxpool,
-        model.layer1,
-        model.layer2
-    )
-elif gpc.get_local_rank(ParallelMode.PIPELINE) == 1:
-    from functools import partial
-    class Flatten(nn.Module):
-        def forward(self, x):
-            return torch.flatten(x, 1)
-    model = nn.Sequential(
-        model.layer3,
-        model.layer4,
-        model.avgpool,
-        Flatten(),
-        model.fc
-    )
-```
-2. Make sure your model inherit `colossalai.nn.model.ModelFromConfig` and registered into the
-`MODELS` registry. Define the `self.layers_cfg` attribute. 
-Pass in a dict/Config object which specifies the parameters of your model. 
-Use `colossalai.builder.pipeline.build_pipeline_model_from_cfg` to partition the layers.
-```python
-from colossalai.builder import build_pipeline_model_from_cfg
-from colossalai.nn.model import ModelFromConfig
-from colossalai.registry import MODELS
-@MODELS.register_module
-class MyModel(ModelFromConfig):
-    def __init__(self, arg1, arg2, ...):
-        ...
-        self.layers_cfg = [
-            dict(type='Linear', in_features=3, out_features=512),
-            dict(type='Linear', in_features=512, out_features=512),
-            ...
-        ]
-model_cfg = dict(
-    type='MyModel',
-    arg1=1,
-    arg2=2
-    ...
-)
-# from config 
-model = build_pipeline_model_from_cfg(model_cfg, num_chunks=1)
-# from torch.nn.Sequential
-# model = build_pipeline_model(sequential_model, num_model_chunks)
-```
-When your model is split into partitions, you can use PipelineSchedule to execute training.
-```python
-import colossalai
-from colossalai.engine.schedule import PipelineSchedule
-engine, train_dataloader, _, _ = colossalai.initialize(model, optimizer, criterion, train_dataloader) 
-schedule = PipelineSchedule(num_microbatches=4)
-# interleaved pipeline
-# schedule = InterleavedPipelineSchedule(num_microbatches=4, num_model_chunks=2)
-# execute a training epoch
-data_iter = iter(train_dataloader)
-for i in range(len(train_dataloader)):
-    output, label, loss = schedule.forward_backward_step(engine,
-                                                        data_iter,
-                                                        forward_only=False,
-                                                        )
-```
-This feature is still in development and is only experimental for now.
-## Sequence Parallel (experimental)
-Sequence parallel is to support long-sequence modelling such as document-level text understanding and medical imaging.
-This method is proposed in [Sequence Parallelism: Making 4D Parallelism Possible](https://arxiv.org/abs/2105.13120).
-This feature is still in development and is only experimental for now.
--- a/docs/parallelization_zh.md
+++ b/docs/parallelization_zh.md
-# 并行技术
-## 配置并行技术组合
-Colossal-AI支持多种并行技术，包括数据并行、张量并行（1D、2D、2.5D、3D）、流水线并行以及序列并行。您可以通过更改配置文件中的`parallel`字典变量来初始化分布式系统中的进程组，配置文件中的`parallel`字典变量必须满足下面的格式。数据并行的规模可以通过`parallel`中流水线并行的规模和张量并行的规模计算得出。
-```python
-parallel = dict(
-    pipeline=dict("size": int),
-    tensor=dict("size": int, "mode": '1d' or '2d' or '2.5d' or '3d', "kwargs": Any)
-)
-```
-注意该字典变量的名称必须为**parallel**。该变量中所有的参数，包括`parallel`本身都是非必需的，如果您的代码中没有提供该变量，则所有并行规模都将被设定为默认值1，即不使用任何并行技术的情况。`parallel`中data、pipeline以及tensor的值分别代表了数据并行、流水线并行、以及张量并行的规模，而`mode`的值代表了张量并行的模式。
-## 数据并行
-数据并行是一种最常见的并行技术，可以将数据分成几个不同的部分，并对每一个部分在一台设备上进行训练。Colossal-AI可以自动检测数据并行设置并为您设置好环境，您不需要在您的环境配置中显式地设置。当数据并行规模大于1时，Colossal-AI会自动为数据读取器增加分布式数据采样器，以此来达到切分数据集的目的。
-## 1D、2D、2.5D与3D张量并行
-为了方便混合并行技术，我们提供了一系列的张量并行技术，同时下面罗列了每一种张量并行技术对应的论文，这些张量并行技术需要Colossal-AI提供的分布式层结构的支持。
- 1D：[Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
- 2D：[An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/abs/2104.05343)
-2维张量并行依赖SUMMA矩阵乘法技术，其在两个不同的维度上对于输入数据进行切分。切分后的张量分布在一个的2维网格上，使用的总设备数量为$P = N^2$，其中$N$为一个维度上的切分张量数量。
- 2.5D：[2.5-dimensional distributed model training](https://arxiv.org/abs/2105.14500)
-2.5维并行技术受到了2.5D矩阵乘法的启发，其对于2维张量并行的结果进行进一步切分，在$d$层上面安排$P = N^2 ∗ d$个处理器，相应地，矩阵乘法操作也被切分为$d$份在不同的层上面进行。
- 3D：[Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/abs/2105.14450)
-我们还引入了3维张量并行技术，该技术在一个3维处理器立方体中对神经网络参数进行并行化。使用$P$个处理器时，该并行技术可以在付出$O(P^{1/3})$的通信开销的情况下达到最优表现，且计算资源和内存使用都可以在$P$个处理器上达到平均分配。
-使用上述几种张量并行的`parallel`字典变量示例参见下方代码。
-```python
-# 1D parallel
-parallel = dict(
-    pipeline=dict(size=1), # number of pipeline stages
-    tensor=dict(size=4, mode='1d')
-)
-# 2D parallel
-parallel = dict(
-    pipeline=dict(size=1), # number of pipeline stages
-    tensor=dict(size=4, mode='2d')
-)
-# 2.5D parallel
-parallel = dict(
-    pipeline=dict(size=1), # number of pipeline stages
-    tensor=dict(size=8, mode='2.5d', depth=2)
-)
-# 3D parallel
-parallel = dict(
-    pipeline=dict(size=1), # number of pipeline stages
-    tensor=dict(size=8, mode='3d')
-)
-```
-## 流水线并行（开发中）
-流水线并行指的是在将深度学习模型按照层切分为几个不同的部分。例如，对于一个由两个线性层组成的简单模型，我们可以使用两个GPU，并把第一个线性层的工作分配给一个GPU，把第二个线性层的工作分配给另一个GPU。当然这个例子只是为了说明流水线并行的工作方式，没有实际意义。
-由于PyTorch的计算基于动态计算图，所以在执行前无法确定计算流。为了支持PyTorch中的流水线并行，您需要为您的模型类加入一个额外的特征`layers_cfg`，使Colossal-AI清楚具体的计算流程，`colossalai.nn.VanillaResNet`给出了一个您可以参考的示例。
-```python
-from colossalai.nn import BaseModel
-import torch
-class VanillaResNet(BaseModel):
-    def __init__(
-            self,
-            num_cls: int,
-            block_type: str,
-            layers: List[int],
-            norm_layer_type: str = 'BatchNorm2d',
-            in_channels: int = 3,
-            groups: int = 1,
-            width_per_group: int = 64,
-            zero_init_residual: bool = False,
-            replace_stride_with_dilation: Optional[List[bool]] = None,
-            dilations=(1, 1, 1, 1)
-    ) -> None:
-        super().__init__()
-        ... # some model params
-        self.layers_cfg = [
-            # conv1
-            dict(type='Conv2d',
-                 in_channels=in_channels,
-                 out_channels=self.inplanes,
-                 kernel_size=7,
-                 stride=2,
-                 padding=3,
-                 bias=False),
-            # bn1
-            dict(
-                type=norm_layer_type,
-                num_features=self.inplanes
-            ),
-            # relu
-            dict(
-                type='ReLU',
-                inplace=True
-            ),
-            # maxpool
-            dict(
-                type='MaxPool2d',
-                kernel_size=3,
-                stride=2,
-                padding=1
-            ),
-            # layer 1
-            dict(
-                inplanes=self.inplanes,
-                planes=64,
-                blocks=self.blocks[0],
-                dilation=self.dilations[0],
-                **self.reslayer_common_cfg
-            ),
-            # layer 2
-            dict(
-                inplanes=64 * self.block_expansion,
-                planes=128,
-                blocks=self.blocks[1],
-                stride=2,
-                dilate=replace_stride_with_dilation[0],
-                dilation=self.dilations[1],
-                **self.reslayer_common_cfg
-            ),
-            # layer  3
-            dict(
-                inplanes=128 * self.block_expansion,
-                planes=256,
-                blocks=layers[2],
-                stride=2,
-                dilate=replace_stride_with_dilation[1],
-                dilation=self.dilations[2],
-                **self.reslayer_common_cfg
-            ),
-            # layer 4
-            dict(
-                inplanes=256 * self.block_expansion,
-                planes=512,
-                blocks=layers[3], stride=2,
-                dilate=replace_stride_with_dilation[2],
-                dilation=self.dilations[3],
-                **self.reslayer_common_cfg
-            ),
-            # avg pool
-            dict(
-                type='AdaptiveAvgPool2d',
-                output_size=(1, 1)
-            ),
-            # flatten
-            dict(
-                type='LambdaWrapper',
-                func=lambda mod, x: torch.flatten(x, 1)
-            ),
-            # linear
-            dict(
-                type='Linear',
-                in_features=512 * self.block_expansion,
-                out_features=num_cls
-            )
-        ]
-```
-您可以在配置文件中手动设置流水线并行的级数，当流水线的并行级数大于1时，Colossal-AI将会自动创建定义前向传播和后向传播的流水线调度程序。同时，您还可以在配置文件中的`schedule`字典变量来定义每一个步骤中训练的微批次数量。下面的代码给出了一个配置流水线并行的例子。
-```python
-parallel = dict(
-    pipeline=dict(size=1), # number of pipeline stages
-    tensor=dict(size=1, mode=None)
-)
-schedule = dict(
-    num_microbatches = 4 # set the number of microbatches per step
-)
-```
-目前该并行技术仍处于实验开发阶段。
-## 序列并行（开发中）
-序列并行是为了支持对于长序列数据的建模，这类数据包括文档级别的文本理解以及医学影像分析，该并行技术由论文[Sequence Parallelism: Making 4D Parallelism Possible](https://arxiv.org/abs/2105.13120)提出。目前该并行技术仍处于实验开发阶段。
--- a/docs/run_demo.md
+++ b/docs/run_demo.md
-# Quick demo
-Colossal-AI is an integrated large-scale deep learning system with efficient parallelization techniques. The system can
-accelerate model training on distributed systems with multiple GPUs by applying parallelization techniques. The system
-can also run on systems with only one GPU. Quick demos showing how to use Colossal-AI are given below.
-## Single GPU
-Colossal-AI can be used to train deep learning models on systems with only one GPU and achieve baseline
-performances. We provided an example to train ResNet on CIFAR10 data with only one GPU. You can find this example in 
-`examples\resnet_cifar10_data_parallel` in the repository. Detailed instructions can be found in its `README.md`.
-## Multiple GPUs
-Colossal-AI can be used to train deep learning models on distributed systems with multiple GPUs and accelerate the
-training process drastically by applying efficient parallelization techiniques, which will be elaborated in
-the [Parallelization](parallelization.md) section below. 
-You can turn the resnet example mentioned above into a multi-GPU training by setting `--nproc_per_node` to be the number of 
-GPUs you have on your system. We also provide an example of Vision Transformer which relies on
-training with more GPUs. You can visit this example in `examples\vit_b16_imagenet_data_parallel`. It has a detailed instructional 
-`README.md` for you too.
-## Sample Training Script
-Below is a typical way of how you train the model using 
-```python
-import colossalai
-from colossalai.amp import AMP_TYPE
-from colossalai.logging import get_dist_logger
-from colossalai.trainer import Trainer, hooks
-from colossalai.utils import get_dataloader
-CONFIG = dict(
-    parallel=dict(
-        pipeline=1,
-        tensor=1, mode=None
-    ),
-    fp16 = dict(
-        mode=AMP_TYPE.TORCH
-    ),
-    gradient_accumulation=4,
-    clip_grad_norm=1.0
-)
-def run_trainer():
-    parser = colossalai.get_default_parser()
-    args = parser.parse_args()
-    colossalai.launch(config=CONFIG,
-                      rank=args.rank,
-                      world_size=args.world_size,
-                      host=args.host,
-                      port=args.port,
-                      backend=args.backend)
-    logger = get_dist_logger()
-    # instantiate your compoentns
-    model = MyModel()
-    optimizer = MyOptimizer(model.parameters(), ...)
-    train_dataset = TrainDataset()
-    test_dataset = TestDataset()
-    train_dataloader = get_dataloader(train_dataset, ...)
-    test_dataloader = get_dataloader(test_dataset, ...)
-    lr_scheduler = MyScheduler()
-    logger.info("components are built")
-    engine, train_dataloader, test_dataloader, lr_scheduler = colossalai.initialize(model, 
-                                                                                    optimizer, 
-                                                                                    criterion, 
-                                                                                    train_dataloader, 
-                                                                                    test_dataloader, 
-                                                                                    lr_scheduler)
-    trainer = Trainer(engine=engine,
-                      verbose=True)
-    hook_list = [
-        hooks.LossHook(),
-        hooks.LRSchedulerHook(lr_scheduler=lr_scheduler, by_epoch=False),
-        hooks.AccuracyHook(),
-        hooks.TensorboardHook(log_dir='./tb_logs', ranks=[0]),
-        hooks.LogMetricByEpochHook(logger),
-        hooks.LogMemoryByEpochHook(logger),
-        hooks.SaveCheckpointHook(checkpoint_dir='./ckpt')
-    ]
-    trainer.fit(
-        train_dataloader=train_dataloader,
-        test_dataloader=test_dataloader,
-        epochs=NUM_EPOCH,
-        hooks=hook_list,
-        display_progress=True,
-        test_interval=2
-    )
-if __name__ == '__main__':
-    run_trainer()
-```
-Alternatively, the `model` variable can be substituted with a self-defined model or a pre-defined model in our Model
-Zoo. The detailed substitution process is elaborated [here](model.md).
-## Features
-Colossal-AI provides a collection of parallel training components for you. We aim to support you with your development
-of distributed deep learning models just like how you write single-GPU deep learning models. We provide friendly tools
-to kickstart distributed training in a few lines.
- [Data Parallelism](parallelization.md)
- [Pipeline Parallelism](parallelization.md)
- [1D, 2D, 2.5D, 3D and sequence parallelism](parallelization.md)
- [Friendly trainer and engine](trainer_engine.md)
- [Extensible for new parallelism](add_your_parallel.md)
- [Mixed Precision Training](amp.md)
- [Zero Redundancy Optimizer (ZeRO)](zero.md)
--- a/docs/run_demo_zh.md
+++ b/docs/run_demo_zh.md
-# 快速上手
-Colossal-AI是一个大规模深度学习系统，其中包含高效的并行技术。该系统可以在多GPU的分布式系统上使用并行技术有效地加速模型训练，同时该系统也可以运行在带有GPU的非分布式系统上。下面是ColossalAI的快速上手指南。
-## 单GPU系统
-在带有GPU的非分布式系统上进行模型训练时，Colossal-AI可以达到当前的基线效率。[这里](https://colab.research.google.com/drive/1fJnqqFzPuzZ_kn1lwCpG2nh3l2ths0KE?usp=sharing#scrollTo=cQ_y7lBG09LS)我们给出一个Google
-Colab示例展现如何使用Colossal-AI与CIFAR10数据集在非分布式系统上训练一个LeNet模型。
-## 多GPU系统
-在多GPU的分布式系统上训练深度学习模型时，Colossal-AI可以使用高效的并行技术来显著地加速训练过程，这些技术将在下面的[并行技术](parallelization.md)
-章节中被详述。下面的代码将在拥有四个GPU的分布式系统上训练一个ViT模型，其中`HOST`
-变量为您分布式系统的IP地址。请注意下面的代码使用了[Slurm](https://slurm.schedmd.com/documentation.html)作业调度系统。
-```bash
-HOST=xxx.xxx.xxx.xxx srun ./scripts/slurm_dist_train.sh ./examples/run_trainer.py ./configs/vit/vit_2d.py
-```
-`./configs/vit/vit_2d.py`是一个[配置文件](config.md)
-，Colossal-AI使用配置文件来定义训练过程中需要用到的参数，比如模型类型、数据集、以及优化器、学习率调度器等。您可以通过编写配置文件的方式来训练不同的模型。`./examples/run_trainer.py`
-是一个标准的训练脚本，具体代码已经附在下面。该脚本可以读入配置文件中的训练参数并训练模型。
-```python
-import colossalai
-from colossalai.core import global_context as gpc
-from colossalai.logging import get_dist_logger
-from colossalai.trainer import Trainer
-def run_trainer():
-    engine, train_dataloader, test_dataloader = colossalai.initialize()
-    logger = get_dist_logger()
-    logger.info("engine is built", ranks=[0])
-    trainer = Trainer(engine=engine,
-                      verbose=True)
-    logger.info("trainer is built", ranks=[0])
-    logger.info("start training", ranks=[0])
-    trainer.fit(
-        train_dataloader=train_dataloader,
-        test_dataloader=test_dataloader,
-        epochs=gpc.config.num_epochs,
-        hooks_cfg=gpc.config.hooks,
-        display_progress=True,
-        test_interval=2
-    )
-if __name__ == '__main__':
-    run_trainer()
-```
-上面代码中的`model`变量可以被替换为一个自定义的模型或者`Model Zoo`中一个事先定义的模型，以此来达到训练不同模型的目的，[这里](model.md)详述了如何进行这样的替换。
-## 系统功能
-Colossal-AI提供了一系列并行组件来加速您的模型训练，我们在下面的章节提供了关于这些并行组件的介绍。我们的目标是使您的分布式深度学习模型开发像单卡深度学习模型开发那样方便。
- [数据并行](parallelization.md)
- [1D、2D、2.5D、3D张量并行以及序列并行](parallelization.md)
- [流水线并行](parallelization.md)
- [训练器以及引擎](trainer_engine.md)
- [自定义您的并行模式](add_your_parallel.md)
- [混合精度训练](amp.md)
- [ZeRO优化器](zero.md)
--- a/docs/trainer_engine.md
+++ b/docs/trainer_engine.md
-# Colossal-AI Engine & Customize Your Trainer
-## Colossal-AI engine
-To better understand how `Engine` class works, let's start from the conception of the process function in common
-engines. The process function usually controls the behavior over a batch of a dataset, `Engine` class just controls the
-process function. Here we give a standard process function in the following code block.
-```python
-def process_function(dataloader, model, criterion, optim):
-    optim.zero_grad()
-    data, label = next(dataloader)
-    output = model(data)
-    loss = criterion(output, label)
-    loss.backward()
-    optim.setp()
-```
-The engine class is a high-level wrapper of these frequently-used functions while preserving the PyTorch-like function signature and integrating with our features.
-```python
-import torch
-import torch.nn as nn
-import torchvision.models as models
-import colossalai
-from colossalai.engine import Engine
-from torchvision.datasets import CIFAR10
-model = models.resnet18()
-criterion = nn.CrossEntropyLoss()
-optimizer = torch.optim.Adam(model.parameters())
-dataset = CIFAR10(...)
-dataloader = colossalai.utils.get_dataloader(dataset)
-engine, dataloader, _,  _ = colossalai.initialize(model, optimizer, criterion, dataloader)
-# exmaple of a training iteratio
-for img, label in dataloader:
-    engine.zero_grad()
-    output = engine(img)
-    loss = engine.criterion(output, label)
-    engine.backward(loss)
-    engine.step()
-```
-More information regarding the class can be found in the API references.
-## Customize your trainer
-### Overview
-To learn how to customize a trainer which meets your needs, let's first give a look at the `Trainer` class. We highly
-recommend that you read *Get Started*
-section and *Colossal-AI engine* first.
-The `Trainer` class enables researchers and engineers to use our system more conveniently. Instead of having to write
-your own scripts, you can simply construct your own trainer by calling the `Trainer` class, just like what we did in the
-following code block.
-```python
-trainer = Trainer(engine)
-```
-After that, you can use the `fit` method to train or evaluate your model. In order to make our `Trainer` class even more
-powerful, we incorporate a set of handy tools to the class. For example, you can monitor or record the running states
-and metrics which indicate the current performance of the model. These functions are realized by hooks. The `BasicHook`
-class allows you to execute your hook functions at specified time. We have already created some practical hooks for you,
-as listed below. What you need to do is just picking the right ones which suit your needs. Detailed descriptions of the
-class can be found in the API references.
-These hook functions will record metrics, elapsed time and memory usage and write them to log after each epoch. Besides,
-they print the current loss and accuracy to let users monitor the performance of the model.
-```python
-import colossalai
-from colossalai.trainer import hooks, Trainer
-from colossalai.utils import MultiTimer
-from colossalai.logging import get_dist_logger
-... = colossalai.initialize(...)
-timer = MultiTimer()
-logger = get_dist_logger()
-# if you want to save log to file
-logger.log_to_file('./logs/')
-trainer = Trainer(
-    engine=engine,
-    timer=timer,
-    logger=logger
-)
-hook_list = [
-    hooks.LossHook(),
-    hooks.LRSchedulerHook(lr_scheduler=lr_scheduler, by_epoch=False),
-    hooks.AccuracyHook(),
-    hooks.TensorboardHook(log_dir='./tb_logs', ranks=[0]),
-    hooks.LogMetricByEpochHook(logger),
-    hooks.LogMemoryByEpochHook(logger),
-    hooks.LogTimingByEpochHook(timer, logger),
-    hooks.SaveCheckpointHook(checkpoint_dir='./ckpt')
-]
-trainer.fit(
-    train_dataloader=train_dataloader,
-    epochs=NUM_EPOCHS,
-    test_dataloader=test_dataloader,
-    test_interval=1,
-    hooks=hook_list,
-    display_progress=True
-)
-```
-### Hook
-If you have your specific needs, feel free to extend our `BaseHook` class to add your own functions, or our `MetricHook`
-class to write a metric collector. These hook functions can be called at different stage in the trainer's life cycle.
-Besides, you can define the priorities of all hooks to arrange the execution order of them. More information can be
-found in the API references.
-### Metric
-You can write your own metrics by extending our `Metric` class. It should be used with the `MetricHook` class. When your
-write your own metric hooks, please set the priority carefully and make sure the hook is called before other hooks which
-might require the results of the metric hook.
-We've already provided some metric hooks and we store metric objects in `runner.states['metrics']`. It is a dictionary
-and metrics can be accessed by their names.
--- a/docs/trainer_engine_zh.md
+++ b/docs/trainer_engine_zh.md
-# 引擎与训练器
-## 引擎
-为了更好的理解我们的`Engine`类是如何工作的，我们首先需要了解常见引擎中进程函数的概念。进程函数控制数据集中一个批次的行为，`Engine`类控制的正是该进程函数。我们在下方的代码块中给出了一个标准的进程函数例子。
-```python
-def process_function(dataloader, model, criterion, optim):
-    optim.zero_grad()
-    data, label = next(dataloader)
-    output = model(data)
-    loss = criterion(output, label)
-    loss.backward()
-    optim.setp()
-```
-在`ignite.engine`与`keras.engine`中，进程函数需要由用户提供，然而，用户很难为流水线并行编写进程函数。为了向用户提供方便的混合并行，我们提供了具备强大功能的`Engine`
-类，该类支持流水线并行，并提供前向传播后向传播不交织的策略。同时，您可以在`Engine`类中使用您事先定义好的学习率调度器来在训练过程中调整学习率。
-您在构造引擎时只需要定义`model`、`criterion`、`optimizer`、`lr_scheduler`与`schedule`等变量即可，下面的代码块给出了一个这样的例子。
-**如果你使用`colossalai.initialize`的话，engine会从config文件里自动构建。**
-```python
-import torch
-import torch.nn as nn
-import torchvision.models as models
-import colossalai
-from colossalai.engine import Engine
-model = models.resnet18()
-criterion = nn.CrossEntropyLoss()
-optimizer = torch.optim.Adam(model)
-lr_scheduler = colossalai.nn.lr_scheduler.CosineAnnealingLR(optimizer, 1000)
-schedule = colossalai.engine.NonPipelineSchedule()
-MyEngine = Engine(
-    model=model,
-    criterion=criterion,
-    optimizer=optimizer,
-    step_schedule=schedule
-)
-```
-更多该类的相关信息可以在API信息中找到。
-## 训练器
-要了解如何个性化适应您需求的训练器，首先需要了解我们的`Trainer`类。
-`Trainer`类旨在让科研工作者和工程师更加方便地使用我们的系统，您不需要自己写脚本，只需要调用`Trainer`类来构造您的训练器即可，就像下面的代码块中所做的。
-```python
-MyTrainer = Trainer(my_trainer)
-```
-在此之后，您可以使用`fit`方法来训练或调用您的模型。除此之外，为了让我们的`Trainer`
-类拥有更强大的功能，我们加入了一系列方便您使用的工具。例如，您可以在训练过程中持续监测并记录模型目前的运行状态和表现，这些功能都是通过钩子函数来实现的。我们提供的`BasicHook`
-类让您可以在指定时间执行您的钩子函数。如下方的代码块所示，我们事先为您定义好了一些实用的钩子函数，您需要做的就是找到符合您需求的钩子函数。更多该类的相关信息可以在API信息中找到。
-```python
-hooks = [
-    dict(type='LogMetricByEpochHook'),
-    dict(type='LogTimingByEpochHook'),
-    dict(type='LogMemoryByEpochHook'),
-    dict(type='AccuracyHook'),
-    dict(type='LossHook'),
-    dict(type='TensorboardHook', log_dir='./tfb_logs'),
-    dict(type='SaveCheckpointHook', interval=5, checkpoint_dir='./ckpt'),
-    dict(type='LoadCheckpointHook', epoch=20, checkpoint_dir='./ckpt')
-]
-```
-上面这些钩子函数可以记录模型性能指标，训练时间，显存使用等信息，并在每一个epoch结束后将这些信息写入到日志中。除此之外，这些钩子函数还可以即时输出当前的损失以及准确率，让用户可以监测模型的性能。
-### 钩子函数
-如果您有个性化需求，您可以继承我们的`BaseHook`类并添加您的钩子函数，或者继承我们的`MetricHook`来编写您需要的度量标准。这些钩子函数可以在`Trainer`
-生命周期的12个时间点被执行。更多该类的相关信息可以在API信息中找到。
-### 度量标准
-您可以通过继承我们的`Metric`类来提供您需要的度量标准，该类需要与`MetricHook`类一同使用。当您编写您的度量标准钩子函数时，请用心设置您的优先级来确保该钩子函数的优先级高于那些需要度量结果的钩子函数。
-我们已经为您定义好了一些度量标准钩子函数在`runner.states['metrics']`供您参考。