removed tutorial markdown and refreshed rst files for consistency

be85a0f3 · Frank Lee · ver217 · ca4ae52d · be85a0f3 · ca4ae52d
Commit be85a0f3 authored Jan 19, 2022 by Frank Lee Committed by ver217 Jan 19, 2022
20 changed files
--- a/README.md
+++ b/README.md
 # Colossal-AI
 ![logo](./docs/images/Colossal-AI_logo.png)
 <div align="center">
   <h3> <a href="https://arxiv.org/abs/2110.14883"> Paper </a> | <a href="https://www.colossalai.org/"> Documentation </a> | <a href="https://github.com/hpcaitech/ColossalAI/discussions"> Forum </a> | <a href="https://medium.com/@hpcaitech"> Blog </a></h3>
 </div>
@@ -33,7 +35,6 @@ Install and enable CUDA kernel fusion (compulsory installation when using fused
 pip install -v --no-cache-dir --global-option="--cuda_ext" .
 ```
 ## Use Docker
 Run the following command to build a docker image from Dockerfile provided.
@@ -71,18 +72,18 @@ colossalai.launch(
 )
 # build your model
-model = ... 
+model = ...
-# build you dataset, the dataloader will have distributed data 
+# build you dataset, the dataloader will have distributed data
 # sampler by default
-train_dataset = ... 
+train_dataset = ...
 train_dataloader = get_dataloader(dataset=dataset,
                                shuffle=True
                                )
-# build your 
+# build your
-optimizer = ... 
+optimizer = ...
 # build your loss function
 criterion = ...
@@ -137,13 +138,15 @@ Colossal-AI provides a collection of parallel training components for you. We ai
 distributed deep learning models just like how you write your single-GPU model. We provide friendly tools to kickstart
 distributed training in a few lines.
- [Data Parallelism](./docs/parallelization.md)
+- Data Parallelism
- [Pipeline Parallelism](./docs/parallelization.md)
+- Pipeline Parallelism
- [1D, 2D, 2.5D, 3D and sequence parallelism](./docs/parallelization.md)
+- 1D, 2D, 2.5D, 3D and sequence parallelism
- [Friendly trainer and engine](./docs/trainer_engine.md)
+- Friendly trainer and engine
- [Extensible for new parallelism](./docs/add_your_parallel.md)
+- Extensible for new parallelism
- [Mixed Precision Training](./docs/amp.md)
+- Mixed Precision Training
- [Zero Redundancy Optimizer (ZeRO)](./docs/zero.md)
+- Zero Redundancy Optimizer (ZeRO)
+Please visit our [documentation and tutorials](https://www.colossalai.org/) for more details.
 ## Cite Us

--- a/docs/add_your_parallel.md
+++ b/docs/add_your_parallel.md
-# Add your own parallelism
-## Overview
-To enable researchers and engineers to extend our system to other novel large-scale distributed training algorithm
-with less effort, we have decoupled various components in the training lifecycle. You can implement your own
-parallelism by simply inheriting from the base class.
-The main components are:
-1. `ProcessGroupInitializer`
-2. `GradientHandler`
-3. `Schedule`
-## Process Group Initializer
-Parallelism is often managed by process groups where processes involved in the same parallel algorithm are placed in the same
-process group. For different parallel algorithms, different process groups need to be created. Colossal-AI provides a
-global context for users to easily manage their process groups. If you wish to add new process group, you can easily
-define a new class and set it in your configuration file. To define your own way of creating process groups, you can
-follow the steps below to create a new distributed initialization.
-1. Add your parallel mode in `colossalai.context.parallel_mode.ParallelMode`.
-    ```python
-    class ParallelMode(Enum):
-        GLOBAL = 'global'
-        DATA = 'data'
-        PIPELINE = 'pipe'
-        ...
-        NEW_MODE = 'new_mode'  # define your mode here
-    ```
-2. Create a `ProcessGroupInitializer`. You can refer to examples given in `colossalai.context.dist_group_initializer`. The
-   first six arguments are fixed. `ParallelContext` will pass in these arguments for you. If you need to set other
-   arguments, you can add it behind like the `arg1, arg2` in the example below. Lastly, register your initializer to the
-   registry by adding the decorator `@DIST_GROUP_INITIALIZER.register_module`.
-    ```python
-    # sample initializer class
-    @DIST_GROUP_INITIALIZER.register_module
-    class MyParallelInitializer(ProcessGroupInitializer):
-        def __init__(self,
-                    rank: int,
-                    world_size: int,
-                    config: Config,
-                    data_parallel_size: int,
-                    pipeline_parlalel_size: int,
-                    tensor_parallel_size: int,
-                    arg1,
-                    arg2):
-            super().__init__(rank, world_size, config)
-            self.arg1 = arg1
-            self.arg2 = arg2
-            # ... your variable init
-        def init_parallel_groups(self):
-            # initialize your process groups
-            pass
-    ```
-    Then, you can insert your new initializer to the current mode-to-initialize mapping
-    in `colossalai.constants.INITIALIZER_MAPPING`. You can modify the file or insert new key-value pair dynamically.
-    ```python
-    colossalai.constants.INITIALIZER_MAPPING['new_mode'] = 'MyParallelInitializer'
-    ```
-3. Set your initializer in your config file. You can pass in your own arguments if there is any. This allows
-   the `ParallelContext` to create your initializer and initialize your desired process groups.
-    ```python
-    parallel = dict(
-        pipeline=dict(size=1),
-        tensor=dict(size=x, mode='new_mode')  # this is where you enable your new parallel mode
-    )
-    ```
-## Gradient Handler
-Gradient handlers are objects which execute the all-reduce operations on parameters' gradients. As different all-reduce
-strategies may be executed for different kinds of parallelism, users can
-inherit `colossalai.engine.gradient_handler.BaseGradientHandler` to implement their strategies. Currently, the library
-uses the normal data parallel gradient handler which all-reduces the gradients across data parallel ranks. The data
-parallel gradient handler is added to the engine automatically if data parallel is detected. You can add your own
-gradient handler like below:
-```python
-from colossalai.registry import GRADIENT_HANDLER
-from colossalai.engine import BaseGradientHandler
-@GRADIENT_HANDLER.register_module
-class YourGradientHandler(BaseGradientHandler):
-    def handle_gradient(self):
-        do_something()
-```
-Afterwards, you can specify the gradient handler you want to use in your configuration file.
-```python
-gradient_handlers = [
-    dict(type='YourGradientHandler'),
-]
-```
-## Schedule
-Schedule entails how to execute a forward and backward pass. Currently, Colossal-AI provides pipeline and non-pipeline
-schedules. If you want to modify how the forward and backward passes are executed, you can
-inherit `colossalai.engine.schedule.BaseSchedule` and implement the `forward_back_step` function.
\ No newline at end of file
--- a/docs/add_your_parallel_zh.md
+++ b/docs/add_your_parallel_zh.md
-# 添加新的并行技术
-为了方便科研人员和工程师们更方便地拓展我们的系统来兼容一些新的大规模分布式训练算法，我们对训练过程中的几个组件进行了解耦，您可以通过继承基类的方式来实现新的并行技术。
-主要的组件如下所示：
-1. `ProcessGroupInitializer`
-2. `GradientHandler`
-3. `Schedule`
-## 进程组初始化器
-并行化一般是通过进程组来进行管理的，同属于一个并行化算法的进程将被分到一个进程组中，如果系统中存在多种不同的并行化技术，那么需要创建多个不同的进程组。Colossal-AI为用户提供了一个全局上下文变量来便捷地管理他们的进程组。如果您希望增加新的进程组，您可以定义一个新的类并且在您的配置文件中进行设置。下方的代码块介绍了如何在系统中加入您的新颖并行技术以及如何进行初始化。
-1. 在`colossalai.context.parallel_mode.ParallelMode`中添加新的并行模式。
-```python
-class ParallelMode(Enum):
-    GLOBAL = 'global'
-    DATA = 'data'
-    PIPELINE = 'pipe'
-    ...
-    NEW_MODE = 'new_mode'  # define your mode here
-```
-2. 创建一个`ProcessGroupInitializer`的子类，您可以参考`colossalai.context.dist_group_initializer`中给出的例子。前六个参数将由`ParallelContext`决定。如果您需要设置新的参数，您可以用新的参数替换下面例子中的`arg1`与`arg2`。最后，您需要使用`@DIST_GROUP_INITIALIZER.register_module`装饰器在我们的注册表中注册您的初始化器。
-```python
-# sample initializer class
-@DIST_GROUP_INITIALIZER.register_module
-class MyParallelInitializer(ProcessGroupInitializer):
-    def __init__(self,
-                rank: int,
-                world_size: int,
-                config: Config,
-                data_parallel_size: int,
-                pipeline_parlalel_size: int,
-                tensor_parallel_size: int,
-                arg1,
-                arg2):
-        super().__init__(rank, world_size, config)
-        self.arg1 = arg1
-        self.arg2 = arg2
-        # ... your variable init
-    def init_parallel_groups(self):
-        # initialize your process groups
-        pass
-```
-在此之后，您可以将您的初始化器插入到当前的mode-to-initialize映射`colossalai.constants.INITIALIZER_MAPPING`中，您也可以通过更改该文件来动态变更名称与并行模式的映射。
-```python
-colossalai.constants.INITIALIZER_MAPPING['new_mode'] = 'MyParallelInitializer'
-```
-3. 在配置文件中设置您的初始化器。如果您的初始化器需要参数，您可以自行传入。下面的代码可以让`ParallelContext`来创建您的初始化器并初始化您需要的进程组。
-```python
-parallel = dict(
-    pipeline=dict(size=1),
-    tensor=dict(size=x, mode='new_mode')  # this is where you enable your new parallel mode
-)
-```
-## 梯度处理器
-梯度处理器的功能是对模型参数的梯度进行all-reduce操作。由于不同的并行技术可能需要不同的all-reduce操作，用户们可以通过继承`colossalai.engine.gradient_handler.BaseGradientHandler`来执行其个性化操作。目前，Colossal-AI使用普通的数据并行梯度处理器，该处理器在所有的数据并行rank上执行all-reduce操作，且当Colossal-AI检测到当前系统使用了数据并行时，该处理器会被自动创建。您可以使用下方代码块中的代码添加您自定义的梯度处理器：
-```python
-from colossalai.registry import GRADIENT_HANDLER
-from colossalai.engine import BaseGradientHandler
-@GRADIENT_HANDLER.register_module
-class YourGradientHandler(BaseGradientHandler):
-    def handle_gradient(self):
-        do_something()
-```
-在此之后，您可以在配置文件中指定您想要使用的梯度处理器。
-```python
-dist_initializer = [
-    dict(type='YourGradientHandler'),
-]
-```
-## 调度器
-调度器中指定了在前向传播和后向传播时需要执行哪些操作，Colossal-AI提供了流水线和非流水线的调度器。如果您想要修改前向传播和后向传播的执行方式，您可以继承`colossalai.engine.BaseSchedule`并实现您想要的操作。您也可以在训练模型之前将您的调度器添加到我们的引擎中来。
--- a/docs/amp.md
+++ b/docs/amp.md
-# Mixed precision training
-In Colossal-AI, we have incorporated different implementations of mixed precision training:
-1. torch.cuda.amp
-2. apex.amp
-3. naive amp
-The first two rely on the original implementation of [PyTorch](https://pytorch.org/docs/stable/amp.html)
-(version 1.6 and above) and [Nvidia Apex](https://github.com/NVIDIA/apex). The last mehtod is simialr to Apex O2 level.
-Among these methods, apex.amp is not compatible with tensor parallelism. This is because that tensors are split across devices 
-in tensor parallelism, thus, it is required to communicate among different processes to check if `inf` or `nan` occurs in the 
-whole model weights. **We modified the torch amp implementation so that it is compatible with tensor parallelism now.**
-To use mixed precision training, you can easily specify the `fp16` field in the config file to be True. Currently, PyTorch and 
-Apex amp cannot be guaranteed to work with tensor and pipeline parallelism. We recommend you to use torch amp as it generally 
-gives better accuracy than naive amp.
-The AMP module is designed to be completely modular and can be used independently from other colossalai modules.
-If you wish to only use amp in your code base without `colossalai.initialize`, you can use `colossalai.amp.convert_to_amp`.
-```python
-from colossalai.amp import AMP_TYPE
-# exmaple of using torch amp
-model, optimizer, criterion = colossalai.amp.convert_to_amp(model, 
-                                                            optimizer, 
-                                                            criterion,
-                                                            AMP_TYPE.TORCH)
-```
-## PyTorch AMP
-PyTorch provides mixed precision training in version 1.6 and above. It provides an easy way to cast data to `fp16` format 
-while keeping some operations such as reductions in `fp32`. You can configure the gradient scaler in the config file.
-```python
-from colossalai.amp import AMP_TYPE
-fp16=dict(
-    mode=AMP_TYPE.TORCH,
-    # below are default values for grad scaler
-    init_scale=2.**16,
-    growth_factor=2.0,
-    backoff_factor=0.5,
-    growth_interval=2000,
-    enabled=True
-)
-```
-## Apex AMP
-For this mode, we rely on the [Apex](https://nvidia.github.io/apex/) implementation for mixed precision training. We support 
-this plugin because it allows for finer control on the granularity of mixed precision. For example, `O2` level (optimization level 2) 
-will keep batch normalization in `fp32`.
-The following code block shows a config file for Apex AMP.
-```python
-from colossalai.amp import AMP_TYPE
-fp16 = dict(
-    mode=AMP_TYPE.APEX,
-    # below are the default values
-    enabled=True, 
-    opt_level='O1', 
-    cast_model_type=None, 
-    patch_torch_functions=None, 
-    keep_batchnorm_fp32=None, 
-    master_weights=None, 
-    loss_scale=None, 
-    cast_model_outputs=None,
-    num_losses=1, 
-    verbosity=1, 
-    min_loss_scale=None, 
-    max_loss_scale=16777216.0
-)
-```
-## Naive AMP
-We leveraged the Megatron-LM implementation to achieve mixed precision training while maintaining compatibility with complex tensor 
-and pipeline parallelism. This AMP mode will cast all operations into fp16.
-The following code block shows a config file for this mode.
-```python
-from colossalai.amp import AMP_TYPE
-fp16 = dict(
-    mode=AMP_TYPE.NAIVE,
-    # below are the default values
-    clip_grad=0,
-    log_num_zeros_in_grad=False,
-    initial_scale=2 ** 32,
-    min_scale=1,
-    growth_factor=2,
-    backoff_factor=0.5,
-    growth_interval=1000,
-    hysteresis=2
-)
-```
\ No newline at end of file
--- a/docs/amp_zh.md
+++ b/docs/amp_zh.md
-# 混合精度训练
-Colossal-AI可以使用如下三种不同的混合精度训练方式：
-1. torch.cuda.amp
-2. apex.amp
-3. 张量并行AMP
-前两种混合精度训练方式依赖于[PyTorch](https://pytorch.org/docs/stable/amp.html)的原生实现（1.6或以上版本）以及[Nvidia Apex](https://github.com/NVIDIA/apex)，但这两种方法与张量并行并不兼容，因为在张量并行中我们需要将张量进行切分并保存在不同的设备上，因此，实现兼容张量并行的混合精度训练需要在不同进程之间不断通信来交流`inf`以及`nan`是否存在于模型参数中，因此我们采用了[Megatron-LM](https://github.com/NVIDIA/Megatron-LM)的实现方式。
-您可以简单地将配置文件中的`fp16`字段设置为True来使用混合精度训练。目前，PyTorch与Apex的amp不能保证与张量和流水线并行兼容，因此，我们推荐您使用最后一种混合精度训练方式。
-## PyTorch AMP
-PyTorch在1.6及以上版本中提供了混合精度训练，它可以在保持一些操作的精度为`fp32`的同时，将数据转换成`fp16`格式，您可以在配置文件中配置使用。
-```python
-from colossalai.engine import AMP_TYPE
-fp16=dict(
-    mode=AMP_TYPE.TORCH,
-    # below are default values for grad scaler
-    init_scale=2.**16,
-    growth_factor=2.0,
-    backoff_factor=0.5,
-    growth_interval=2000,
-    enabled=True
-)
-```
-## Apex AMP
-我们使用了[Apex](https://nvidia.github.io/apex/)中的混合精度训练，因为该模式提供了细粒度的混合精度控制，例如，`O2`级（第二级优化器）将会保持批标准化在`fp32`上进行。下面的代码块展示了使用Apex AMP的配置文件。
-```python
-from colossalai.engine import AMP_TYPE
-fp16 = dict(
-    mode=AMP_TYPE.APEX,
-    # below are the default values
-    enabled=True, 
-    opt_level='O1', 
-    cast_model_type=None, 
-    patch_torch_functions=None, 
-    keep_batchnorm_fp32=None, 
-    master_weights=None, 
-    loss_scale=None, 
-    cast_model_outputs=None,
-    num_losses=1, 
-    verbosity=1, 
-    min_loss_scale=None, 
-    max_loss_scale=16777216.0
-)
-```
-## 张量并行AMP
-我们借鉴了Megatron-LM的混合精度训练实现，该实现方式与张量并行、流水线并行相兼容。下面的代码块展示了使用张量并行AMP的配置文件。
-```python
-from colossalai.engine import AMP_TYPE
-fp16 = dict(
-    mode=AMP_TYPE.PARALLEL,
-    # below are the default values
-    clip_grad=0,
-    log_num_zeros_in_grad=False,
-    initial_scale=2 ** 32,
-    min_scale=1,
-    growth_factor=2,
-    backoff_factor=0.5,
-    growth_interval=1000,
-    hysteresis=2
-)
-```
\ No newline at end of file
--- a/docs/colossalai/colossalai.amp.amp_type.rst
+++ b/docs/colossalai/colossalai.amp.amp_type.rst
+colossalai.amp.amp\_type
+========================
+.. automodule:: colossalai.amp.amp_type
+   :members:
--- a/docs/colossalai/colossalai.amp.apex_amp.apex_amp.rst
+++ b/docs/colossalai/colossalai.amp.apex_amp.apex_amp.rst
+colossalai.amp.apex\_amp.apex\_amp
+==================================
+.. automodule:: colossalai.amp.apex_amp.apex_amp
+   :members:
--- a/docs/colossalai/colossalai.amp.apex_amp.rst
+++ b/docs/colossalai/colossalai.amp.apex_amp.rst
 colossalai.amp.apex\_amp
-==========================
+========================
 .. automodule:: colossalai.amp.apex_amp
   :members:
+.. toctree::
+   :maxdepth: 2
+   colossalai.amp.apex_amp.apex_amp
--- a/docs/colossalai/colossalai.amp.naive_amp.naive_amp.rst
+++ b/docs/colossalai/colossalai.amp.naive_amp.naive_amp.rst
+colossalai.amp.naive\_amp.naive\_amp
+====================================
+.. automodule:: colossalai.amp.naive_amp.naive_amp
+   :members:
--- a/docs/colossalai/colossalai.amp.naive_amp.rst
+++ b/docs/colossalai/colossalai.amp.naive_amp.rst
 colossalai.amp.naive\_amp
-==========================
+=========================
 .. automodule:: colossalai.amp.naive_amp
   :members:
+.. toctree::
+   :maxdepth: 2
+   colossalai.amp.naive_amp.naive_amp
--- a/docs/colossalai/colossalai.amp.rst
+++ b/docs/colossalai/colossalai.amp.rst
 colossalai.amp
-==================
+==============
+.. automodule:: colossalai.amp
+   :members:
 .. toctree::
   :maxdepth: 2
-   colossalai.amp.torch_amp
   colossalai.amp.apex_amp
   colossalai.amp.naive_amp
+   colossalai.amp.torch_amp
-.. automodule:: colossalai.amp
+.. toctree::
-   :members:
+   :maxdepth: 2
+   colossalai.amp.amp_type
--- a/docs/colossalai/colossalai.amp.torch_amp.rst
+++ b/docs/colossalai/colossalai.amp.torch_amp.rst
 colossalai.amp.torch\_amp
-==========================
+=========================
 .. automodule:: colossalai.amp.torch_amp
-      :members:
+   :members:
+.. toctree::
+   :maxdepth: 2
+   colossalai.amp.torch_amp.torch_amp
--- a/docs/colossalai/colossalai.amp.torch_amp.torch_amp.rst
+++ b/docs/colossalai/colossalai.amp.torch_amp.torch_amp.rst
+colossalai.amp.torch\_amp.torch\_amp
+====================================
+.. automodule:: colossalai.amp.torch_amp.torch_amp
+   :members:
--- a/docs/colossalai/colossalai.builder.rst
+++ b/docs/colossalai/colossalai.builder.rst
 colossalai.builder
 ==================
+.. automodule:: colossalai.builder
+   :members:
 .. toctree::
   :maxdepth: 2
   colossalai.builder.builder
   colossalai.builder.pipeline
-.. automodule:: colossalai.builder
-   :members:
--- a/docs/colossalai/colossalai.communication.rst
+++ b/docs/colossalai/colossalai.communication.rst
 colossalai.communication
 ========================
+.. automodule:: colossalai.communication
+   :members:
 .. toctree::
   :maxdepth: 2
@@ -8,7 +12,3 @@ colossalai.communication
   colossalai.communication.p2p
   colossalai.communication.ring
   colossalai.communication.utils
-.. automodule:: colossalai.communication
-   :members:
--- a/docs/colossalai/colossalai.constants.rst
+++ b/docs/colossalai/colossalai.constants.rst
-colossalai.constants
-====================
-.. automodule:: colossalai.constants
-   :members:
--- a/docs/colossalai/colossalai.context.process_group_initializer.initializer_model.rst
+++ b/docs/colossalai/colossalai.context.process_group_initializer.initializer_model.rst
+colossalai.context.process\_group\_initializer.initializer\_model
+=================================================================
+.. automodule:: colossalai.context.process_group_initializer.initializer_model
+   :members:
--- a/docs/colossalai/colossalai.context.process_group_initializer.initializer_moe.rst
+++ b/docs/colossalai/colossalai.context.process_group_initializer.initializer_moe.rst
+colossalai.context.process\_group\_initializer.initializer\_moe
+===============================================================
+.. automodule:: colossalai.context.process_group_initializer.initializer_moe
+   :members:
--- a/docs/colossalai/colossalai.context.process_group_initializer.rst
+++ b/docs/colossalai/colossalai.context.process_group_initializer.rst
@@ -13,6 +13,8 @@ colossalai.context.process\_group\_initializer
   colossalai.context.process_group_initializer.initializer_2p5d
   colossalai.context.process_group_initializer.initializer_3d
   colossalai.context.process_group_initializer.initializer_data
+   colossalai.context.process_group_initializer.initializer_model
+   colossalai.context.process_group_initializer.initializer_moe
   colossalai.context.process_group_initializer.initializer_pipeline
   colossalai.context.process_group_initializer.initializer_sequence
   colossalai.context.process_group_initializer.initializer_tensor

--- a/docs/colossalai/colossalai.context.random.rst
+++ b/docs/colossalai/colossalai.context.random.rst
 colossalai.context.random
 =========================
+.. automodule:: colossalai.context.random
+   :members:
 .. toctree::
   :maxdepth: 2
   colossalai.context.random.seed_manager
-.. automodule:: colossalai.context.random
-   :members: