Commit be85a0f3 authored by Frank Lee's avatar Frank Lee Committed by ver217
Browse files

removed tutorial markdown and refreshed rst files for consistency

parent ca4ae52d
# Colossal-AI # Colossal-AI
![logo](./docs/images/Colossal-AI_logo.png) ![logo](./docs/images/Colossal-AI_logo.png)
<div align="center"> <div align="center">
<h3> <a href="https://arxiv.org/abs/2110.14883"> Paper </a> | <a href="https://www.colossalai.org/"> Documentation </a> | <a href="https://github.com/hpcaitech/ColossalAI/discussions"> Forum </a> | <a href="https://medium.com/@hpcaitech"> Blog </a></h3> <h3> <a href="https://arxiv.org/abs/2110.14883"> Paper </a> | <a href="https://www.colossalai.org/"> Documentation </a> | <a href="https://github.com/hpcaitech/ColossalAI/discussions"> Forum </a> | <a href="https://medium.com/@hpcaitech"> Blog </a></h3>
</div> </div>
...@@ -33,7 +35,6 @@ Install and enable CUDA kernel fusion (compulsory installation when using fused ...@@ -33,7 +35,6 @@ Install and enable CUDA kernel fusion (compulsory installation when using fused
pip install -v --no-cache-dir --global-option="--cuda_ext" . pip install -v --no-cache-dir --global-option="--cuda_ext" .
``` ```
## Use Docker ## Use Docker
Run the following command to build a docker image from Dockerfile provided. Run the following command to build a docker image from Dockerfile provided.
...@@ -71,18 +72,18 @@ colossalai.launch( ...@@ -71,18 +72,18 @@ colossalai.launch(
) )
# build your model # build your model
model = ... model = ...
# build you dataset, the dataloader will have distributed data # build you dataset, the dataloader will have distributed data
# sampler by default # sampler by default
train_dataset = ... train_dataset = ...
train_dataloader = get_dataloader(dataset=dataset, train_dataloader = get_dataloader(dataset=dataset,
shuffle=True shuffle=True
) )
# build your # build your
optimizer = ... optimizer = ...
# build your loss function # build your loss function
criterion = ... criterion = ...
...@@ -137,13 +138,15 @@ Colossal-AI provides a collection of parallel training components for you. We ai ...@@ -137,13 +138,15 @@ Colossal-AI provides a collection of parallel training components for you. We ai
distributed deep learning models just like how you write your single-GPU model. We provide friendly tools to kickstart distributed deep learning models just like how you write your single-GPU model. We provide friendly tools to kickstart
distributed training in a few lines. distributed training in a few lines.
- [Data Parallelism](./docs/parallelization.md) - Data Parallelism
- [Pipeline Parallelism](./docs/parallelization.md) - Pipeline Parallelism
- [1D, 2D, 2.5D, 3D and sequence parallelism](./docs/parallelization.md) - 1D, 2D, 2.5D, 3D and sequence parallelism
- [Friendly trainer and engine](./docs/trainer_engine.md) - Friendly trainer and engine
- [Extensible for new parallelism](./docs/add_your_parallel.md) - Extensible for new parallelism
- [Mixed Precision Training](./docs/amp.md) - Mixed Precision Training
- [Zero Redundancy Optimizer (ZeRO)](./docs/zero.md) - Zero Redundancy Optimizer (ZeRO)
Please visit our [documentation and tutorials](https://www.colossalai.org/) for more details.
## Cite Us ## Cite Us
......
# Add your own parallelism
## Overview
To enable researchers and engineers to extend our system to other novel large-scale distributed training algorithm
with less effort, we have decoupled various components in the training lifecycle. You can implement your own
parallelism by simply inheriting from the base class.
The main components are:
1. `ProcessGroupInitializer`
2. `GradientHandler`
3. `Schedule`
## Process Group Initializer
Parallelism is often managed by process groups where processes involved in the same parallel algorithm are placed in the same
process group. For different parallel algorithms, different process groups need to be created. Colossal-AI provides a
global context for users to easily manage their process groups. If you wish to add new process group, you can easily
define a new class and set it in your configuration file. To define your own way of creating process groups, you can
follow the steps below to create a new distributed initialization.
1. Add your parallel mode in `colossalai.context.parallel_mode.ParallelMode`.
```python
class ParallelMode(Enum):
GLOBAL = 'global'
DATA = 'data'
PIPELINE = 'pipe'
...
NEW_MODE = 'new_mode' # define your mode here
```
2. Create a `ProcessGroupInitializer`. You can refer to examples given in `colossalai.context.dist_group_initializer`. The
first six arguments are fixed. `ParallelContext` will pass in these arguments for you. If you need to set other
arguments, you can add it behind like the `arg1, arg2` in the example below. Lastly, register your initializer to the
registry by adding the decorator `@DIST_GROUP_INITIALIZER.register_module`.
```python
# sample initializer class
@DIST_GROUP_INITIALIZER.register_module
class MyParallelInitializer(ProcessGroupInitializer):
def __init__(self,
rank: int,
world_size: int,
config: Config,
data_parallel_size: int,
pipeline_parlalel_size: int,
tensor_parallel_size: int,
arg1,
arg2):
super().__init__(rank, world_size, config)
self.arg1 = arg1
self.arg2 = arg2
# ... your variable init
def init_parallel_groups(self):
# initialize your process groups
pass
```
Then, you can insert your new initializer to the current mode-to-initialize mapping
in `colossalai.constants.INITIALIZER_MAPPING`. You can modify the file or insert new key-value pair dynamically.
```python
colossalai.constants.INITIALIZER_MAPPING['new_mode'] = 'MyParallelInitializer'
```
3. Set your initializer in your config file. You can pass in your own arguments if there is any. This allows
the `ParallelContext` to create your initializer and initialize your desired process groups.
```python
parallel = dict(
pipeline=dict(size=1),
tensor=dict(size=x, mode='new_mode') # this is where you enable your new parallel mode
)
```
## Gradient Handler
Gradient handlers are objects which execute the all-reduce operations on parameters' gradients. As different all-reduce
strategies may be executed for different kinds of parallelism, users can
inherit `colossalai.engine.gradient_handler.BaseGradientHandler` to implement their strategies. Currently, the library
uses the normal data parallel gradient handler which all-reduces the gradients across data parallel ranks. The data
parallel gradient handler is added to the engine automatically if data parallel is detected. You can add your own
gradient handler like below:
```python
from colossalai.registry import GRADIENT_HANDLER
from colossalai.engine import BaseGradientHandler
@GRADIENT_HANDLER.register_module
class YourGradientHandler(BaseGradientHandler):
def handle_gradient(self):
do_something()
```
Afterwards, you can specify the gradient handler you want to use in your configuration file.
```python
gradient_handlers = [
dict(type='YourGradientHandler'),
]
```
## Schedule
Schedule entails how to execute a forward and backward pass. Currently, Colossal-AI provides pipeline and non-pipeline
schedules. If you want to modify how the forward and backward passes are executed, you can
inherit `colossalai.engine.schedule.BaseSchedule` and implement the `forward_back_step` function.
\ No newline at end of file
# 添加新的并行技术
为了方便科研人员和工程师们更方便地拓展我们的系统来兼容一些新的大规模分布式训练算法,我们对训练过程中的几个组件进行了解耦,您可以通过继承基类的方式来实现新的并行技术。
主要的组件如下所示:
1. `ProcessGroupInitializer`
2. `GradientHandler`
3. `Schedule`
## 进程组初始化器
并行化一般是通过进程组来进行管理的,同属于一个并行化算法的进程将被分到一个进程组中,如果系统中存在多种不同的并行化技术,那么需要创建多个不同的进程组。Colossal-AI为用户提供了一个全局上下文变量来便捷地管理他们的进程组。如果您希望增加新的进程组,您可以定义一个新的类并且在您的配置文件中进行设置。下方的代码块介绍了如何在系统中加入您的新颖并行技术以及如何进行初始化。
1.`colossalai.context.parallel_mode.ParallelMode`中添加新的并行模式。
```python
class ParallelMode(Enum):
GLOBAL = 'global'
DATA = 'data'
PIPELINE = 'pipe'
...
NEW_MODE = 'new_mode' # define your mode here
```
2. 创建一个`ProcessGroupInitializer`的子类,您可以参考`colossalai.context.dist_group_initializer`中给出的例子。前六个参数将由`ParallelContext`决定。如果您需要设置新的参数,您可以用新的参数替换下面例子中的`arg1``arg2`。最后,您需要使用`@DIST_GROUP_INITIALIZER.register_module`装饰器在我们的注册表中注册您的初始化器。
```python
# sample initializer class
@DIST_GROUP_INITIALIZER.register_module
class MyParallelInitializer(ProcessGroupInitializer):
def __init__(self,
rank: int,
world_size: int,
config: Config,
data_parallel_size: int,
pipeline_parlalel_size: int,
tensor_parallel_size: int,
arg1,
arg2):
super().__init__(rank, world_size, config)
self.arg1 = arg1
self.arg2 = arg2
# ... your variable init
def init_parallel_groups(self):
# initialize your process groups
pass
```
在此之后,您可以将您的初始化器插入到当前的mode-to-initialize映射`colossalai.constants.INITIALIZER_MAPPING`中,您也可以通过更改该文件来动态变更名称与并行模式的映射。
```python
colossalai.constants.INITIALIZER_MAPPING['new_mode'] = 'MyParallelInitializer'
```
3. 在配置文件中设置您的初始化器。如果您的初始化器需要参数,您可以自行传入。下面的代码可以让`ParallelContext`来创建您的初始化器并初始化您需要的进程组。
```python
parallel = dict(
pipeline=dict(size=1),
tensor=dict(size=x, mode='new_mode') # this is where you enable your new parallel mode
)
```
## 梯度处理器
梯度处理器的功能是对模型参数的梯度进行all-reduce操作。由于不同的并行技术可能需要不同的all-reduce操作,用户们可以通过继承`colossalai.engine.gradient_handler.BaseGradientHandler`来执行其个性化操作。目前,Colossal-AI使用普通的数据并行梯度处理器,该处理器在所有的数据并行rank上执行all-reduce操作,且当Colossal-AI检测到当前系统使用了数据并行时,该处理器会被自动创建。您可以使用下方代码块中的代码添加您自定义的梯度处理器:
```python
from colossalai.registry import GRADIENT_HANDLER
from colossalai.engine import BaseGradientHandler
@GRADIENT_HANDLER.register_module
class YourGradientHandler(BaseGradientHandler):
def handle_gradient(self):
do_something()
```
在此之后,您可以在配置文件中指定您想要使用的梯度处理器。
```python
dist_initializer = [
dict(type='YourGradientHandler'),
]
```
## 调度器
调度器中指定了在前向传播和后向传播时需要执行哪些操作,Colossal-AI提供了流水线和非流水线的调度器。如果您想要修改前向传播和后向传播的执行方式,您可以继承`colossalai.engine.BaseSchedule`并实现您想要的操作。您也可以在训练模型之前将您的调度器添加到我们的引擎中来。
# Mixed precision training
In Colossal-AI, we have incorporated different implementations of mixed precision training:
1. torch.cuda.amp
2. apex.amp
3. naive amp
The first two rely on the original implementation of [PyTorch](https://pytorch.org/docs/stable/amp.html)
(version 1.6 and above) and [Nvidia Apex](https://github.com/NVIDIA/apex). The last mehtod is simialr to Apex O2 level.
Among these methods, apex.amp is not compatible with tensor parallelism. This is because that tensors are split across devices
in tensor parallelism, thus, it is required to communicate among different processes to check if `inf` or `nan` occurs in the
whole model weights. **We modified the torch amp implementation so that it is compatible with tensor parallelism now.**
To use mixed precision training, you can easily specify the `fp16` field in the config file to be True. Currently, PyTorch and
Apex amp cannot be guaranteed to work with tensor and pipeline parallelism. We recommend you to use torch amp as it generally
gives better accuracy than naive amp.
The AMP module is designed to be completely modular and can be used independently from other colossalai modules.
If you wish to only use amp in your code base without `colossalai.initialize`, you can use `colossalai.amp.convert_to_amp`.
```python
from colossalai.amp import AMP_TYPE
# exmaple of using torch amp
model, optimizer, criterion = colossalai.amp.convert_to_amp(model,
optimizer,
criterion,
AMP_TYPE.TORCH)
```
## PyTorch AMP
PyTorch provides mixed precision training in version 1.6 and above. It provides an easy way to cast data to `fp16` format
while keeping some operations such as reductions in `fp32`. You can configure the gradient scaler in the config file.
```python
from colossalai.amp import AMP_TYPE
fp16=dict(
mode=AMP_TYPE.TORCH,
# below are default values for grad scaler
init_scale=2.**16,
growth_factor=2.0,
backoff_factor=0.5,
growth_interval=2000,
enabled=True
)
```
## Apex AMP
For this mode, we rely on the [Apex](https://nvidia.github.io/apex/) implementation for mixed precision training. We support
this plugin because it allows for finer control on the granularity of mixed precision. For example, `O2` level (optimization level 2)
will keep batch normalization in `fp32`.
The following code block shows a config file for Apex AMP.
```python
from colossalai.amp import AMP_TYPE
fp16 = dict(
mode=AMP_TYPE.APEX,
# below are the default values
enabled=True,
opt_level='O1',
cast_model_type=None,
patch_torch_functions=None,
keep_batchnorm_fp32=None,
master_weights=None,
loss_scale=None,
cast_model_outputs=None,
num_losses=1,
verbosity=1,
min_loss_scale=None,
max_loss_scale=16777216.0
)
```
## Naive AMP
We leveraged the Megatron-LM implementation to achieve mixed precision training while maintaining compatibility with complex tensor
and pipeline parallelism. This AMP mode will cast all operations into fp16.
The following code block shows a config file for this mode.
```python
from colossalai.amp import AMP_TYPE
fp16 = dict(
mode=AMP_TYPE.NAIVE,
# below are the default values
clip_grad=0,
log_num_zeros_in_grad=False,
initial_scale=2 ** 32,
min_scale=1,
growth_factor=2,
backoff_factor=0.5,
growth_interval=1000,
hysteresis=2
)
```
\ No newline at end of file
# 混合精度训练
Colossal-AI可以使用如下三种不同的混合精度训练方式:
1. torch.cuda.amp
2. apex.amp
3. 张量并行AMP
前两种混合精度训练方式依赖于[PyTorch](https://pytorch.org/docs/stable/amp.html)的原生实现(1.6或以上版本)以及[Nvidia Apex](https://github.com/NVIDIA/apex),但这两种方法与张量并行并不兼容,因为在张量并行中我们需要将张量进行切分并保存在不同的设备上,因此,实现兼容张量并行的混合精度训练需要在不同进程之间不断通信来交流`inf`以及`nan`是否存在于模型参数中,因此我们采用了[Megatron-LM](https://github.com/NVIDIA/Megatron-LM)的实现方式。
您可以简单地将配置文件中的`fp16`字段设置为True来使用混合精度训练。目前,PyTorch与Apex的amp不能保证与张量和流水线并行兼容,因此,我们推荐您使用最后一种混合精度训练方式。
## PyTorch AMP
PyTorch在1.6及以上版本中提供了混合精度训练,它可以在保持一些操作的精度为`fp32`的同时,将数据转换成`fp16`格式,您可以在配置文件中配置使用。
```python
from colossalai.engine import AMP_TYPE
fp16=dict(
mode=AMP_TYPE.TORCH,
# below are default values for grad scaler
init_scale=2.**16,
growth_factor=2.0,
backoff_factor=0.5,
growth_interval=2000,
enabled=True
)
```
## Apex AMP
我们使用了[Apex](https://nvidia.github.io/apex/)中的混合精度训练,因为该模式提供了细粒度的混合精度控制,例如,`O2`级(第二级优化器)将会保持批标准化在`fp32`上进行。下面的代码块展示了使用Apex AMP的配置文件。
```python
from colossalai.engine import AMP_TYPE
fp16 = dict(
mode=AMP_TYPE.APEX,
# below are the default values
enabled=True,
opt_level='O1',
cast_model_type=None,
patch_torch_functions=None,
keep_batchnorm_fp32=None,
master_weights=None,
loss_scale=None,
cast_model_outputs=None,
num_losses=1,
verbosity=1,
min_loss_scale=None,
max_loss_scale=16777216.0
)
```
## 张量并行AMP
我们借鉴了Megatron-LM的混合精度训练实现,该实现方式与张量并行、流水线并行相兼容。下面的代码块展示了使用张量并行AMP的配置文件。
```python
from colossalai.engine import AMP_TYPE
fp16 = dict(
mode=AMP_TYPE.PARALLEL,
# below are the default values
clip_grad=0,
log_num_zeros_in_grad=False,
initial_scale=2 ** 32,
min_scale=1,
growth_factor=2,
backoff_factor=0.5,
growth_interval=1000,
hysteresis=2
)
```
\ No newline at end of file
colossalai.amp.amp\_type
========================
.. automodule:: colossalai.amp.amp_type
:members:
colossalai.amp.apex\_amp.apex\_amp
==================================
.. automodule:: colossalai.amp.apex_amp.apex_amp
:members:
colossalai.amp.apex\_amp colossalai.amp.apex\_amp
========================== ========================
.. automodule:: colossalai.amp.apex_amp .. automodule:: colossalai.amp.apex_amp
:members: :members:
.. toctree::
:maxdepth: 2
colossalai.amp.apex_amp.apex_amp
colossalai.amp.naive\_amp.naive\_amp
====================================
.. automodule:: colossalai.amp.naive_amp.naive_amp
:members:
colossalai.amp.naive\_amp colossalai.amp.naive\_amp
========================== =========================
.. automodule:: colossalai.amp.naive_amp .. automodule:: colossalai.amp.naive_amp
:members: :members:
.. toctree::
:maxdepth: 2
colossalai.amp.naive_amp.naive_amp
colossalai.amp colossalai.amp
================== ==============
.. automodule:: colossalai.amp
:members:
.. toctree:: .. toctree::
:maxdepth: 2 :maxdepth: 2
colossalai.amp.torch_amp
colossalai.amp.apex_amp colossalai.amp.apex_amp
colossalai.amp.naive_amp colossalai.amp.naive_amp
colossalai.amp.torch_amp
.. automodule:: colossalai.amp .. toctree::
:members: :maxdepth: 2
colossalai.amp.amp_type
colossalai.amp.torch\_amp colossalai.amp.torch\_amp
========================== =========================
.. automodule:: colossalai.amp.torch_amp .. automodule:: colossalai.amp.torch_amp
:members: :members:
.. toctree::
:maxdepth: 2
colossalai.amp.torch_amp.torch_amp
colossalai.amp.torch\_amp.torch\_amp
====================================
.. automodule:: colossalai.amp.torch_amp.torch_amp
:members:
colossalai.builder colossalai.builder
================== ==================
.. automodule:: colossalai.builder
:members:
.. toctree:: .. toctree::
:maxdepth: 2 :maxdepth: 2
colossalai.builder.builder colossalai.builder.builder
colossalai.builder.pipeline colossalai.builder.pipeline
.. automodule:: colossalai.builder
:members:
colossalai.communication colossalai.communication
======================== ========================
.. automodule:: colossalai.communication
:members:
.. toctree:: .. toctree::
:maxdepth: 2 :maxdepth: 2
...@@ -8,7 +12,3 @@ colossalai.communication ...@@ -8,7 +12,3 @@ colossalai.communication
colossalai.communication.p2p colossalai.communication.p2p
colossalai.communication.ring colossalai.communication.ring
colossalai.communication.utils colossalai.communication.utils
.. automodule:: colossalai.communication
:members:
colossalai.constants
====================
.. automodule:: colossalai.constants
:members:
colossalai.context.process\_group\_initializer.initializer\_model
=================================================================
.. automodule:: colossalai.context.process_group_initializer.initializer_model
:members:
colossalai.context.process\_group\_initializer.initializer\_moe
===============================================================
.. automodule:: colossalai.context.process_group_initializer.initializer_moe
:members:
...@@ -13,6 +13,8 @@ colossalai.context.process\_group\_initializer ...@@ -13,6 +13,8 @@ colossalai.context.process\_group\_initializer
colossalai.context.process_group_initializer.initializer_2p5d colossalai.context.process_group_initializer.initializer_2p5d
colossalai.context.process_group_initializer.initializer_3d colossalai.context.process_group_initializer.initializer_3d
colossalai.context.process_group_initializer.initializer_data colossalai.context.process_group_initializer.initializer_data
colossalai.context.process_group_initializer.initializer_model
colossalai.context.process_group_initializer.initializer_moe
colossalai.context.process_group_initializer.initializer_pipeline colossalai.context.process_group_initializer.initializer_pipeline
colossalai.context.process_group_initializer.initializer_sequence colossalai.context.process_group_initializer.initializer_sequence
colossalai.context.process_group_initializer.initializer_tensor colossalai.context.process_group_initializer.initializer_tensor
......
colossalai.context.random colossalai.context.random
========================= =========================
.. automodule:: colossalai.context.random
:members:
.. toctree:: .. toctree::
:maxdepth: 2 :maxdepth: 2
colossalai.context.random.seed_manager colossalai.context.random.seed_manager
.. automodule:: colossalai.context.random
:members:
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment