If you want to build PyTorch extensions during installation, you can use the command below. Otherwise, the PyTorch extensions will be built during runtime.
```shell
CUDA_EXT=1 pip install colossalai
```
## Download From Source
> The version of Colossal-AI will be in line with the main branch of the repository. Feel free to raise an issue if you encounter any problem. :)
Colossal-AI provides a collection of parallel training components for you. We aim to support you with your development
of distributed deep learning models just like how you write single-GPU deep learning models. ColossalAI provides easy-to-use
APIs to help you kickstart your training process. To better how ColossalAI works, we recommend you to read this documentation
in the following order.
- If you are not familiar with distributed system or have never used Colossal-AI, you should first jump into the `Concepts`
section to get a sense of what we are trying to achieve. This section can provide you with some background knowledge on
distributed training as well.
- Next, you can follow the `basics` tutorials. This section will cover the details about how to use Colossal-AI.
- Afterwards, you can try out the features provided in Colossal-AI by reading `features` section. We will provide a codebase for each tutorial. These tutorials will cover the
basic usage of Colossal-AI to realize simple functions such as data parallel and mixed precision training.
- Lastly, if you wish to apply more complicated techniques such as how to run hybrid parallel on GPT-3, the
`advanced tutorials` section is the place to go!
**We always welcome suggestions and discussions from the community, and we would be more than willing to help you if you
encounter any issue. You can raise an [issue](https://github.com/hpcaitech/ColossalAI/issues) here or create a discussion
topic in the [forum](https://github.com/hpcaitech/ColossalAI/discussions).**
-[Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961)
-[Go Wider Instead of Deeper](https://arxiv.org/abs/2107.11817)
(中文版教程将会在近期提供)
## Introduction
Since the advent of Switch Transformer, the AI community has found Mixture of Experts (MoE) a useful technique to enlarge the capacity of deep learning models.
Colossal-AI provides an early access version of parallelism specifically designed for MoE models.
The most prominent advantage of MoE in Colossal-AI is convenience.
We aim to help our users to easily combine MoE with model parallelism and data parallelism.
However, the current implementation has two main drawbacks now.
The first drawback is its poor efficiency in large batch size and long sequence length training.
The second drawback is incompatibility with tensor parallelism.
We are working on system optimization to overcome the training efficiency problem.
The compatibility problem with tensor parallelism requires more adaptation, and we will tackle this issue in the future.
Here, we will introduce how to use MoE with model parallelism and data parallelism.
## Table of Content
In this tutorial we will cover:
1. Set up MoE running environment
2. Create MoE layer
3. Train your model
We provided the [example code](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/widenet) for this tutorial in [ColossalAI-Examples](https://github.com/hpcaitech/ColossalAI-Examples).
This example uses [WideNet](https://arxiv.org/abs/2107.11817) as an example of MoE-based model.
## Set up MoE running environment
In your project folder, create a `config.py`.
This file is to specify some features you may want to use to train your model.
In order to enable MoE, you need to add a dict called parallel and specify the value of key moe.
You can assign a value for the key size of moe, which represents the model parallel size of experts (i.e. the number of experts in one group to parallelize training).
For example, if the size is 4, 4 processes will be assigned to 4 consecutive GPUs and these 4 processes form a moe model parallel group.
Each process on the 4 GPUs will only get a portion of experts. Increasing the model parallel size will reduce communication cost, but increase computation cost in each GPU and activation cost in memory.
The total data parallel size is auto-detected and set as the number of GPUs by default.
```python
MOE_MODEL_PARALLEL_SIZE=...
parallel=dict(
moe=dict(size=MOE_MODEL_PARALLEL_SIZE)
)
```
If `MOE_MODEL_PARALLEL_SIZE = E` and set the number of experts as `E` where `E` is a constant number, the process flow of forward pass of a transformer encoder in a model parallel group is shown below.
Inside the initialization of Experts, the local expert number of each GPU will be calculated automatically. You just need to specify the class of each expert and its parameters used in its initialization. As for routers, we have provided top1 router and top2 router. You can find them in colossalai.nn.layer.moe. After creating the instance of experts and router, the only thing initialized in Moelayer is gate module. More definitions of each class can be found in our API document and code.
## Train Your Model
Do not to forget to use `colossalai.initialize` function in `colosalai` to add gradient handler for the engine.
We handle the back-propagation of MoE models for you.
In `colossalai.initialize`, we will automatically create a `MoeGradientHandler` object to process gradients.
You can find more information about the handler `MoeGradientHandler` in colossal directory.
The loss criterion should be wrapped by `Moeloss` to add auxiliary loss of MoE. Example is like this.
```python
criterion=MoeLoss(
aux_weight=0.01,
loss_fn=nn.CrossEntropyLoss,
label_smoothing=0.1
)
```
Finally, just use trainer or engine in `colossalai` to do your training.
Otherwise, you should take care of gradient by yourself.
在GPU数量不足情况下,想要增加模型规模,异构训练是最有效的手段。它通过在 CPU 和 GPU 中容纳模型数据,并仅在必要时将数据移动到当前设备,可以同时利用 GPU 内存、CPU 内存(由 CPU DRAM 或 NVMe SSD内存组成)来突破单GPU内存墙的限制。并行,在大规模训练下,其他方案如数据并行、模型并行、流水线并行都可以在异构训练基础上进一步扩展GPU规模。这篇文章描述ColossalAI的异构内存空间管理模块Gemini的设计细节,它的思想来源于[PatrickStar](https://arxiv.org/abs/2108.05818),ColossalAI根据自身情况进行了重新实现。
你可以在配置文件中设置流水并行的大小。`NUM_CHUNKS` 在使用交错流水线时很有用 (更多细节见 [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://arxiv.org/abs/2104.04473) )。