[doc] clean up outdated docs (#4765)

* [doc] clean up outdated docs * [doc] fix linking * [doc] fix linking

[doc] clean up outdated docs (#4765)
* [doc] clean up outdated docs * [doc] fix linking * [doc] fix linking
66f39260 · Hongxin Liu · GitHub · df66741f · 66f39260 · df66741f
Unverified Commit 66f39260 authored Sep 21, 2023 by Hongxin Liu Committed by GitHub Sep 21, 2023
20 changed files
--- a/docs/sidebars.json
+++ b/docs/sidebars.json
@@ -29,13 +29,7 @@
        "basics/launch_colossalai",
        "basics/booster_api",
        "basics/booster_plugins",
-        "basics/booster_checkpoint",
+        "basics/booster_checkpoint"
-        "basics/define_your_config",
-        "basics/initialize_features",
-        "basics/engine_trainer",
-        "basics/configure_parallelization",
-        "basics/model_checkpoint",
-        "basics/colotensor_concept"
      ]
    },
    {
@@ -44,12 +38,8 @@
      "collapsed": true,
      "items": [
        "features/mixed_precision_training_with_booster",
-        "features/mixed_precision_training",
        "features/gradient_accumulation_with_booster",
-        "features/gradient_accumulation",
        "features/gradient_clipping_with_booster",
-        "features/gradient_clipping",
-        "features/gradient_handler",
        "features/zero_with_chunk",
        {
          "type": "category",
@@ -75,10 +65,7 @@
        "advanced_tutorials/train_vit_using_pipeline_parallelism",
        "advanced_tutorials/train_vit_with_hybrid_parallelism",
        "advanced_tutorials/train_gpt_using_hybrid_parallelism",
-        "advanced_tutorials/define_your_own_parallel_model",
-        "advanced_tutorials/add_your_parallel",
        "advanced_tutorials/meet_gemini",
-        "advanced_tutorials/parallelize_your_training_like_Megatron",
        "advanced_tutorials/integrate_mixture_of_experts_into_your_model",
        "advanced_tutorials/opt_service"
      ]

--- a/docs/source/en/advanced_tutorials/add_your_parallel.md
+++ b/docs/source/en/advanced_tutorials/add_your_parallel.md
-# Add Your Own Parallel Mode
-Author: Shenggui Li, Yongbin Li
-**Prerequisite:**
- [Define Your Configuration](../basics/define_your_config.md)
- [Configure Parallelization](../basics/configure_parallelization.md)
-## Introduction
-To enable researchers and engineers to extend our system to other novel large-scale distributed training algorithm
-with less effort, we have decoupled various components in the training lifecycle. You can implement your own
-parallelism by simply inheriting from the base class.
-The main components are:
-1. `ProcessGroupInitializer`
-2. `GradientHandler`
-3. `Schedule`
-**This currently requires some code to the source code, thus we recommend that you install from source with the `-e` flag.
-`-e` flag makes the installation editable, thus, your code change will be reflected in your Python runtime.
-We will work on this to avoid change to source code in future releases.**
-## Process Group Initializer
-Parallelism is often managed by process groups where processes involved in the same parallel algorithm are placed in the same
-process group. For different parallel algorithms, different process groups need to be created. Colossal-AI provides a
-global context for users to easily manage their process groups. If you wish to add new process group, you can easily
-define a new class and set it in your configuration file. To define your own way of creating process groups, you can
-follow the steps below to create a new distributed initialization.
-1. Add your parallel mode in `colossalai.legacy.context.parallel_mode.ParallelMode`.
-    ```python
-    class ParallelMode(Enum):
-        GLOBAL = 'global'
-        DATA = 'data'
-        PIPELINE = 'pipe'
-        ...
-        NEW_MODE = 'new_mode'  # define your mode here
-    ```
-2. Create a `ProcessGroupInitializer`. You can refer to examples given in `colossalai.context.dist_group_initializer`. The
-   first six arguments are fixed. `ParallelContext` will pass in these arguments for you. If you need to set other
-   arguments, you can add it behind like the `arg1, arg2` in the example below. Lastly, register your initializer to the
-   registry by adding the decorator `@DIST_GROUP_INITIALIZER.register_module`.
-    ```python
-    # sample initializer class
-    @DIST_GROUP_INITIALIZER.register_module
-    class MyParallelInitializer(ProcessGroupInitializer):
-        def __init__(self,
-                    rank: int,
-                    world_size: int,
-                    config: Config,
-                    data_parallel_size: int,
-                    pipeline_parallel_size: int,
-                    tensor_parallel_size: int,
-                    arg1,
-                    arg2):
-            super().__init__(rank, world_size, config)
-            self.arg1 = arg1
-            self.arg2 = arg2
-            # ... your variable init
-        def init_parallel_groups(self):
-            # initialize your process groups
-            pass
-    ```
-    Then, you can insert your new initializer to the current mode-to-initialize mapping
-    in `colossalai.constants.INITIALIZER_MAPPING`. You can modify the file or insert new key-value pair dynamically.
-    ```python
-    colossalai.constants.INITIALIZER_MAPPING['new_mode'] = 'MyParallelInitializer'
-    ```
-3. Set your initializer in your config file. You can pass in your own arguments if there is any. This allows
-   the `ParallelContext` to create your initializer and initialize your desired process groups.
-    ```python
-    parallel = dict(
-        pipeline=dict(size=1),
-        tensor=dict(size=x, mode='new_mode')  # this is where you enable your new parallel mode
-    )
-    ```
-## Gradient Handler
-Gradient handlers are objects which execute the all-reduce operations on parameters' gradients. As different all-reduce
-strategies may be executed for different kinds of parallelism, users can
-inherit `colossalai.legacy.engine.gradient_handler.BaseGradientHandler` to implement their strategies. Currently, the library
-uses the normal data parallel gradient handler which all-reduces the gradients across data parallel ranks. The data
-parallel gradient handler is added to the engine automatically if data parallel is detected. You can add your own
-gradient handler like below:
-```python
-from colossalai.legacy.registry import GRADIENT_HANDLER
-from colossalai.legacy.engine import BaseGradientHandler
-@GRADIENT_HANDLER.register_module
-class YourGradientHandler(BaseGradientHandler):
-    def handle_gradient(self):
-        do_something()
-```
-Afterwards, you can specify the gradient handler you want to use in your configuration file.
-```python
-gradient_handlers = [
-    dict(type='YourGradientHandler'),
-]
-```
-## Schedule
-Schedule entails how to execute a forward and backward pass. Currently, Colossal-AI provides pipeline and non-pipeline
-schedules. If you want to modify how the forward and backward passes are executed, you can
-inherit `colossalai.legacy.engine.schedule.BaseSchedule` and implement the `forward_back_step` function.
-<!-- doc-test-command: echo  -->
--- a/docs/source/en/advanced_tutorials/define_your_own_parallel_model.md
+++ b/docs/source/en/advanced_tutorials/define_your_own_parallel_model.md
-# Define your own parallel model
-Author: Zhengda Bian, Yongbin Li
-> ⚠️ We are working on this documentation to make it more detailed. We will introduce the mechanism of different parallelism
-> and how to use them to write a model.
-Let's say that you have a huge MLP model with billions of parameters and its extremely large hidden layer size makes it
-impossible to fit into a single GPU directly. Don't worry, Colossal-AI is here to help you sort things out. With the help of Colossal-AI,
-you can write your model in the familiar way in which you used to write models for a single GPU, while Colossal-AI automatically
-splits your model weights and fit them perfectly into a set of GPUs. We give a simple example showing how to write a simple
-2D parallel model in the Colossal-AI context.
-## Write a simple 2D parallel model
-```python
-from colossalai.nn import Linear2D
-import torch.nn as nn
-class MLP_2D(nn.Module):
-    def __init__(self):
-        super().__init__()
-        self.linear_1 = Linear2D(in_features=1024, out_features=16384)
-        self.linear_2 = Linear2D(in_features=16384, out_features=1024)
-    def forward(self, x):
-        x = self.linear_1(x)
-        x = self.linear_2(x)
-        return x
-```
-## Use pre-defined model
-For the sake of your convenience, we kindly provide you in our Model Zoo with some prevalent models such as *BERT*, *ViT*, *MoE*,
-and *GPT*. Feel free to customize them into different sizes to fit into your special needs.
--- a/docs/source/en/advanced_tutorials/parallelize_your_training_like_Megatron.md
+++ b/docs/source/en/advanced_tutorials/parallelize_your_training_like_Megatron.md
-# Parallelize Your Training like Megatron-LM via ColoTensor
-Author: [Haichen Huang](https://github.com/1SAA) and [Jiarui Fang](https://github.com/feifeibear)
-**Prerequisite:**
- [ColoTensor Concepts](../basics/colotensor_concept.md)
-## Introduction
-Thanks to the convenience given by ColoTensor, users can apply parallelism with the least edition to their serial code.
-In this tutorial, we will illustrate how to modify the training model to automatically adapt the code to parallel training like Megatron-LM.
-We take the GPT-2 model offered by HuggingFace as an example and provide a way for you to pre-train the GPT-2 model on a single GPU.
-Megatron-LM provided a profound paradigm to parallelize large transformer language models.
-However, in order to train large transformer language models at scale, users have to build their models with those modules provided by Megatron.
-It imposes several difficult jobs on users, such as loading the weights from the pre-trained models and constructing the parallelized models.
-To mitigate users' trouble, we offer ColoTensor to enable the tensor model parallelism automatically.
-## Definitions of the model and the loss function
-First we use the GPTModel and GPTLoss directly from the HuggingFace library.
-```python
-import torch
-import torch.nn as nn
-from transformers import GPT2Config, GPT2LMHeadModel
-class GPTLMModel(nn.Module):
-    def __init__(self, hidden_size=768, num_layers=12, num_attention_heads=12, max_seq_len=1024, vocab_size=50257, checkpoint=False):
-        super().__init__()
-        self.checkpoint = checkpoint
-        self.model = GPT2LMHeadModel(GPT2Config(n_embd=hidden_size, n_layer=num_layers,
-                                     n_head=num_attention_heads, n_positions=max_seq_len, n_ctx=max_seq_len, vocab_size=vocab_size))
-        if checkpoint:
-            self.model.gradient_checkpointing_enable()
-    def forward(self, input_ids, attention_mask):
-        # Only return lm_logits
-        return self.model(input_ids=input_ids, attention_mask=attention_mask, use_cache=not self.checkpoint)[0]
-class GPTLMLoss(nn.Module):
-    def __init__(self):
-        super().__init__()
-        self.loss_fn = nn.CrossEntropyLoss()
-    def forward(self, logits, labels):
-        shift_logits = logits[..., :-1, :].contiguous()
-        shift_labels = labels[..., 1:].contiguous()
-        # Flatten the tokens
-        return self.loss_fn(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
-```
-## Brief Review of GPT-2
-Now, we recall the structure of each GPT-2 model.
-Every GPT-2 model can be represented as a DAG.
-As shown in the below pictures, each circle represents an operator and each square represents a weight.
-An arrow indicates the flow of the input data, and the notation alongside the arrow demonstrates the shape of the input data.
-Then, let's take an insight into this GPT-2 model. It consists of three parts.
-They are the **embedding module**, **transformer layers**, and the **classification head**.
-The embedding module contains two weights, token embedding weight and position embedding weight.
-After the forward operation of the embedding module, each word in all sequences of the raw input data will be embedded into a hidden state.
-<figure style={{textAlign: "center"}}>
-<img src="https://s2.loli.net/2022/08/17/omfkIEN6ui5jcL3.png"/>
-<figcaption>The embedding module</figcaption>
-</figure>
-Each transformer layer contains two blocks. The self-attention operation is called in the first block and a two-layer perception is located in the second block.
-<figure style={{textAlign: "center"}}>
-<img src="https://s2.loli.net/2022/08/17/LAVzDlpRcj4dYeb.png"/>
-<figcaption>The transformer layer</figcaption>
-</figure>
-In the end, the classification head is just a linear module without bias, which only has a weight inside.
-## Applied with ColoTensor
-Two steps make your serial code adapted to Megatron-LM tensor parallel style.
-1. Initialize the model in the context of ColoInitContext.
-2. Setting ColoTensorSpec for each parameter.
-### Initialize with ColoInitContext
-We should build the model in the ColoInitContext.
-In this context, any parameter initialized would be transformed to ColoParameter and moved to the corresponded device automatically.
-```python
-from colossalai.utils.model.colo_init_context import ColoInitContext
-with ColoInitContext(device=torch.device('cpu')):
-    model = GPTLMModel()
-```
-### Setting ColoTensorSpec for each parameter
-After the creation of the model, we establish the distributed environment through ProcessGroup.
-Here, we specify the degree of the tensor parallelism as the same as the number of all GPUs, which means the degree of data parallelism is 1.
-```python
-import torch.distributed as dist
-from colossalai.tensor import ProcessGroup
-pg = ProcessGroup(tp_degree=dist.get_world_size())
-```
-Now, some auxiliary functions are necessary for the next step. We define two functions to split a parameter.
-Megatron-LM-like tensor parallelism requires splitting a parameter tensor along its first dimension or its last dimension.
-```python
-from colossalai.tensor import ShardSpec, ComputeSpec, ComputePattern, ColoParameter, ProcessGroup
-def split_param_single_dim_tp1d(dim: int, param: ColoParameter, pg: ProcessGroup):
-    spec = (ShardSpec([dim], [pg.tp_world_size()]), ComputeSpec(ComputePattern.TP1D))
-    if param.process_group.tp_world_size() == 1:
-        param.set_process_group(pg)
-    param.set_tensor_spec(*spec)
-def split_param_row_tp1d(param: ColoParameter, pg: ProcessGroup):
-    split_param_single_dim_tp1d(0, param, pg)
-def split_param_col_tp1d(param: ColoParameter, pg: ProcessGroup):
-    split_param_single_dim_tp1d(-1, param, pg)
-```
-Then we adapt the model to the tensor parallelism.
-According to the tensor parallelism applied in Megatron, it is supposed to shard along the last dimension of tensors, including the weights of token embedding, position embedding, all linear weights and biases in self-attention blocks, the first weight linear and bias in each MLP.
-And it shards the second linear weight along its first dimension.
-```python
-for mn, module in model.named_modules():
-    for pn, param in module.named_parameters(recurse=False):
-        # set process group for all parameters
-        param.set_process_group(pg)
-        if 'mlp.c_fc' in mn:
-            if 'weight' in pn or 'bias' in pn:
-                split_param_col_tp1d(param, pg)  # column slice
-                # keep the shape of the output from c_fc
-                param.compute_spec.set_output_replicate(False)
-        elif 'mlp.c_proj' in mn:
-            if 'weight' in pn:
-                split_param_row_tp1d(param, pg)  # row slice
-        elif 'wte' in mn or 'wpe' in mn:
-            split_param_col_tp1d(param, pg)  # column slice
-        elif 'c_attn' in mn or 'c_proj' in mn:
-            split_param_col_tp1d(param, pg)  # column slice
-```
-The modified model is illustrated below.
-The embedding module:
-<figure style={{textAlign: "center"}}>
-<img src="https://s2.loli.net/2022/08/17/Yu2xzXEabHV7pwe.png"/>
-<figcaption>The modified embedding module</figcaption>
-</figure>
-The transformer layers:
-<figure style={{textAlign: "center"}}>
-<img src="https://s2.loli.net/2022/08/17/4HWsA2xz51IhPFO.png"/>
-<figcaption>The modified transformer layer</figcaption>
-</figure>
-Once users have specified the distributed pattern of each parameter, ColoTensor is capable of inferring the computation patterns of all operators, including matrix multiplication, the linear function, other elementwise functions in torch.nn.functional, etc.
-In this way, users can train their models as usual.
-In our latest example, a Gemini + ZeRO DDP model is also defined to reduce overhead and improve efficiency.For the details of this part, please refer to [ZeRO](../features/zero_with_chunk.md). You can combine these two parts to understand our entire training process:
-```python
-def gemini_zero_dpp(model: torch.nn.Module, pg: ProcessGroup, placement_policy: str = "auto"):
-    from colossalai.zero import GeminiDDP
-    model = GeminiDDP(model,
-                        device=get_current_device(),
-                        placement_policy=placement_policy,
-                        pin_memory=True,
-                        search_range_m=32)
-    return model
-```
-## Pretrain GPT-2 On Single GPU
-The above optimization we made allows us to pretrain the GPT-2 model on a single GPU. We only need to set the parameter `GPUNUM`=1 in `run.sh`, and then we can complete the model training on a single GPU when running the file.
-The GPT-2 example is accessible at [Train GPT with Colossal-AI](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt).
-<!-- doc-test-command: torchrun --standalone --nproc_per_node=1 parallelize_your_training_like_Megatron.py  -->
--- a/docs/source/en/basics/colotensor_concept.md
+++ b/docs/source/en/basics/colotensor_concept.md
-# ColoTensor Concepts
-Author: [Jiarui Fang](https://github.com/feifeibear), [Hongxin Liu](https://github.com/ver217) and [Haichen Huang](https://github.com/1SAA)
-> ⚠️ The information on this page is outdated and will be deprecated.
-**Prerequisite:**
- [Colossal-AI Overview](../concepts/colossalai_overview.md)
- [Distributed Training](../concepts/distributed_training.md)
- [Paradigms of Parallelism](../concepts/paradigms_of_parallelism.md)
-## Introduction
-After ColossalAI version 0.1.8, [ColoTensor](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.html#colossalai.tensor.ColoTensor) becomes the basic data structure for tensors in ColossalAI. It is a subclass of torch.Tensor and can be used as a PyTorch Tensor. Additionally, some unique features make it possible to represent a Global Tensor with a payload distributed across multiple GPU devices. With the help of ColoTensor, the users can write distributed DNN training program similar to a serial one.support the following features.
-ColoTensor contains extra attributes capsuled in a [ColoTensorSpec](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.tensor_spec.html#colossalai.tensor.tensor_spec.ColoTensorSpec) instance to describe the tensor's payload distribution and computing pattern.
- ProcessGroup: how processes are organized as communication groups.
- Distributed Spec: how tensor is distributed among process groups.
- Compute Spec: how the tensor is used during computation.
-We elaborate on them one by one.
-## ProcessGroup
-An instance of class [ProcessGroup](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.html#colossalai.tensor.ProcessGroup) describes how processes are organized in process groups. Processes in a process group can participate in the same collective communication operations together, such as allgather, allreduce, etc. The way the process group is organized is dominated by the Tensor's parallelism strategy. For example, if the user defines the tensor parallel (TP) and data parallel (DP) modes of a tensor, then the process organization of the process group will be automatically deduced. The process group settings can vary among different tensors. Therefore, it enables us to support more complicated hybrid parallel. The pipeline parallel (PP) definition is not in the ProcessGroup, it needs another set of mechanisms . We will supplement the related content of ColoTensor applied to PP in the future.
-Currently, a process group of ColoTensor is defined by two configurations, i.e. tp_degree and dp_degree. In the case of DP+TP hybrid parallelism, the device can be viewed as a 2D mesh. We place TP communication groups on the leading low dimension of the device mesh and then place the data parallel groups along the high dimension of the device mesh. The reason is that tensor parallelism has a larger communication overhead than data parallelism. Neighboring devices are placed inside a TP process group and are often placed in the same node.
-Considering that 8 processes are configured as tp_degree=4, and dp_degree=2, the layout is shown below. Process group tp0 contains gpu 0,1,2,3. Process dp1 contains gpu 1 and 5.
-<figure style={{textAlign: "center"}}>
-<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/ColoTensor_layout_demo.PNG"/>
-<figcaption>Process Group using tp_degree=4, dp_degree=2</figcaption>
-</figure>
-## Distributed Spec
-An instance of [Distributed Spec](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.distspec.html) describes how a ColoTensor is distributed among the ProcessGroup.
-How tensors are distributed among DP process groups is automatically derived and does not need to be manually specified by the user. If this tensor is a model parameter, it is replicated within the DP process group. If it is an activation tensor, it is split along the process with the highest dimension and evenly distributed the tensor payload among processes in the DP process group.
-Therefore, when using Distributed Spec, we only need to describe the way that the tensor is distributed among TP process groups. There are currently two ways to distribute among TP process group, i.e. [ShardSpec](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.distspec.html#colossalai.tensor.distspec.ShardSpec) and [ReplicaSpec](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.distspec.html#colossalai.tensor.distspec.ReplicaSpec). ShardSpec needs to specify the dimension index dim of the partition and the number of partitions num_partitions. Currently, we only support the split on a single dim. Different dist specs on the TP process groups can be converted to each other through the set_dist_spec() interface. The spec conversions are recorded by the autograd mechanism and it will trigger corresponding reverse operations during backward propagation.
-## Compute Spec
-An instance of class [ComputeSpec](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.compute_spec.html#colossalai.tensor.compute_spec.ComputeSpec) describes how a Colotensor be used in DNN training. Currently, we will set the correct Compute Pattern for the ColoTensor as the parameters of the module. The specific application scenarios will be shown in the next document.
-## ColoParameter
-[ColoParameter](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.colo_parameter.html#colossalai.tensor.colo_parameter.ColoParameter) is a subclass of ColoTensor. Used to define a Global Parameter tensor. Its relationship with ColoTensor is consistent with Torch.Tensor and torch.Parameter. The latter allows the tensor to appear in the return values of the module's parameters() and name_parameters() methods.
-## Example
-Let's see an example. A ColoTensor is initialized and sharded on 8 GPUs using tp_degree=4, dp_degree=2. And then the tensor is sharded along the last dim among the TP process groups. Finally, we reshard it along the first dim (0 dim) among the TP process groups. We encourage users to run the code and observe the shape of each tensor.
-```python
-import torch
-import torch.multiprocessing as mp
-from colossalai.utils import print_rank_0
-from functools import partial
-import colossalai
-from colossalai.tensor import ProcessGroup, ColoTensor, ColoTensorSpec, ShardSpec, ComputeSpec, ComputePattern
-from colossalai.testing import spawn
-import torch
-def run_dist_tests(rank, world_size, port):
-    colossalai.launch(config={}, rank=rank, world_size=world_size, host='localhost', port=port, backend='nccl')
-    pg = ProcessGroup(tp_degree=2, dp_degree=2)
-    torch.manual_seed(0)
-    local_tensor = torch.randn(2, 3, 1).cuda()
-    print_rank_0(f"shape {local_tensor.shape}, {local_tensor.data}")
-    spec = ColoTensorSpec(pg, ShardSpec(dims=[-1], num_partitions=[pg.tp_world_size()]), ComputeSpec(ComputePattern.TP1D))
-    t1 = ColoTensor.from_torch_tensor(local_tensor, spec)
-    t1 = t1.to_replicate()
-    print_rank_0(f"shape {t1.shape}, {t1.data}")
-    spec2 = ShardSpec([0], [pg.tp_world_size()])
-    t1.set_dist_spec(spec2)
-    print_rank_0(f"shape {t1.shape}, {t1.data}")
-def test_dist_cases(world_size):
-    spawn(run_dist_tests, world_size)
-if __name__ == '__main__':
-    test_dist_cases(4)
-```
-:::caution
-The ColoTensor is an experimental feature and may be updated.
-:::
--- a/docs/source/en/basics/configure_parallelization.md
+++ b/docs/source/en/basics/configure_parallelization.md
-# Configure Parallelization
-Author: Shenggui Li, Siqi Mai
-> ⚠️ The information on this page is outdated and will be deprecated. Please check [Booster Plugins](../basics/booster_plugins.md) for more information.
-**Prerequisite:**
- [Distributed Training](../concepts/distributed_training.md)
- [Paradigms of Parallelism](../concepts/paradigms_of_parallelism.md)
- [Define Your Configuration](./define_your_config.md)
-## Introduction
-We support multiple parallelization in Colossal-AI. Hybrid parallelism in our codebase refers to namely the combination
-of data parallelism, pipeline parallelism and tensor parallelism (1D, 2D, 2.5D, 3D).
-Each parallelism requires different network topology and thus initialize different process groups.
-You can initialize the corresponding process group by setting `parallel` in the config file.
-The configuration for `parallel` must obey the following format. Data parallel size will be
-inferred automatically based on your inputs to pipeline parallelism and tensor parallelism.
-`colossalai.launch` will initialize these distributed process groups automatically based on your configuration.
-Some sample configurations are shown below:
-```python
-# sampler format
-parallel = dict(
-    pipeline=dict("size": int),
-    tensor=dict("size": int, "mode": '1d' or '2d' or '2.5d' or '3d', "kwargs": Any)
-)
-# this is ok
-parallel = dict(
-    pipeline=dict(size=2),
-    tensor=dict(size=4, mode='2d')
-)
-# this is ok
-parallel = dict(
-    pipeline=2,
-    tensor=dict(size=4, mode='2d')
-)
-# this is not ok
-# as you need to specify the mode for tensor parallelism
-parallel = dict(
-    pipeline=2,
-    tensor=4
-)
-# this is ok as well as tensor will be default to size 1
-# and mode None
-parallel = dict(
-    pipeline=2
-)
-# this is ok as well as pipeline will default to size 1
-parallel = dict(
-    tensor=dict(size=4, mode='2d')
-)
-```
-The key name `size` refers to the parallel size of the parallelism dimension. For example, pipeline size 2 means there
-will be 2 pipeline stages. The key name `mode` in tensor parallel config means the corresponding tensor parallelism
-will be initialized.
-**You can choose to not have 'parallel' in your configuration and both pipeline and tensor will default to size 1.**
-**Total number of GPUs must be equal to `data parallel size * tensor parallel size * pipeline parallel size`**
-## Data Parallel
-Data parallel is the most common way to distribute your training task by splitting data into several shards and train on
-a single shard on each device. The configuration for data parallel is detected automatically and set for you. You do not
-have to explicitly set them in your configurations. There are two ways to handle the all-reduce in data parallel in Colossal-AI.
-1. If you specify gradient handlers, gradients will be all-reduced according to the gradient handlers
-2. Otherwise, PyTorch DistributedDataParallel will be used
-In most cases, you will be using the second mode unless you have complex handling of the gradients.
-## 1D, 2D, 2.5D and 3D Parallel
-To enable hybrid parallelism, we provide an array of tensor parallelism. We provide the list of papers which match each
-tensor parallel method. These parallel modes need to work with the distributed layers provided by Colossal-AI.
- 1D: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
- 2D: [An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/abs/2104.05343)
-  2D parallel relies on the SUMMA matrix multiplication algorithm and splits the input data, model weights and layer
-  outputs along two different dimensions. The tensor chunks are distributed over a 2D mesh of `P = N^2` devices where
-  `N` is the number of tensor chunks in a single dimension.
- 2.5D: [2.5-dimensional distributed model training](https://arxiv.org/abs/2105.14500)
-  Inspired by the 2.5D matrix multiplication algorithm, 2.5D parallel introduces a novel tensor parallelism which
-  further parallelizes 2D tensor parallelism. An amount of `P = N^2 ∗ d` processors are arranged into `d` layers, where
-  each layer performs matrix multiplication operations independently with a dimension `N`.
- 3D: [Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/abs/2105.14450)
-  We also introduce a 3D tensor parallelism that parallelizes neural networks on a 3D processor cube. This method
-  achieves the optimal, `O(P^{1/3})` communication overhead on $P$ processors, while both computation and memory usage
-  are evenly distributed through optimized load balancing of parameters as well as activations.
-```python
-# 1D parallel
-parallel = dict(
-    tensor=dict(size=4, mode='1d')
-)
-# 2D parallel
-parallel = dict(
-    tensor=dict(size=4, mode='2d')
-)
-# 2.5D parallel
-parallel = dict(
-    tensor=dict(size=8, mode='2.5d', depth=2)
-)
-# 3D parallel
-parallel = dict(
-    tensor=dict(size=8, mode='3d')
-)
-```
-Once you specify the tensor parallel mode in your configuration, you can proceed to use its corresponding distributed
-operator. For example, if you mode is '2d', you can use `colossalai.nn.Linear2D` in you model construction.
-## Pipeline Parallel
-Pipeline parallelism is to split the model into several partitions by layer. For example, let's assume we have a simple
-model which consists of two linear layer. We have two GPUs, and we can allocate the first linear layer to the first GPU
-and the second layer to the second GPU.
-You can set the number of pipeline stages in your configuration file. When pipeline size is larger than 1, Colossal-AI
-will automatically creates the pipeline schedule which defines the forward and backward step.
-```python
-parallel = dict(
-    pipeline=dict(size=4), # number of pipeline stages
-)
-```
-## Sequence Parallel
-Sequence parallel is to support long-sequence modelling such as document-level text understanding and medical imaging.
-This method is proposed in [Sequence Parallelism: Making 4D Parallelism Possible](https://arxiv.org/abs/2105.13120).
-You can use specify the mode to be `sequence` to initialize its process group.
-```python
-parallel = dict(
-    tensor=dict(size=4, mode='sequence')
-)
-```
--- a/docs/source/en/basics/define_your_config.md
+++ b/docs/source/en/basics/define_your_config.md
-# Define Your Configuration
-Author: Guangyang Lu, Shenggui Li, Siqi Mai
-> ⚠️ The information on this page is outdated and will be deprecated. Please check [Booster API](../basics/booster_api.md) for more information.
-**Prerequisite:**
- [Distributed Training](../concepts/distributed_training.md)
- [Colossal-AI Overview](../concepts/colossalai_overview.md)
-## Introduction
-In Colossal-AI, a configuration file is required to specify the features the system will inject into the training process.
-In this tutorial, we will introduce you how to construct your configuration file and how this config file will be used.
-Using configuration file has several advantages:
-1. You can store your feature configuration and training hyper-parameters in different configuration files
-2. New features released in the future can be specified in the configuration without code change in the training script
-In this tutorial, we will cover how to define your configuration file.
-## Configuration Definition
-In a configuration file, there are two types of variables. One serves as feature specification and the other serves
-as hyper-parameters. All feature-related variables are reserved keywords. For example, if you want to use mixed precision
-training, you need to use the variable name `fp16` in the config file and follow a pre-defined format.
-### Feature Specification
-There is an array of features Colossal-AI provides to speed up training. Each feature is defined by a corresponding field
-in the config file. In this tutorial, we are not giving the config details for all the features, but rather we are providing
-an illustration of how to specify a feature. **The details of each feature can be found in its respective tutorial.**
-To illustrate the use of config file, we use mixed precision training as an example here. In order to do so, you need to
-follow the steps below.
-1. create a configuration file (e.g. `config.py`, the file name can be anything)
-2. define the mixed precision configuration in the config file. For example, in order to use mixed precision training
-natively provided by PyTorch, you can just write these lines of code below into your config file.
-   ```python
-   from colossalai.amp import AMP_TYPE
-   fp16 = dict(
-     mode=AMP_TYPE.TORCH
-   )
-   ```
-3. Tell Colossal-AI where your config file is when launch the distributed environment. For example, the config file is in
-the current directory.
-   ```python
-   import colossalai
-   colossalai.launch(config='./config.py', ...)
-   ```
-In this way, Colossal-AI knows what features you want to use and will inject this feature during `colossalai.initialize`.
-### Global Hyper-parameters
-Besides feature specification, the config file can also serve as a place to define your training hyper-parameters. This
-comes handy when you want to perform multiple experiments, each experiment details can be put into a single config file
-to avoid confusion. These parameters will be stored in the global parallel context and can be accessed in the training script.
-For example, you can specify the batch size in your config file.
-```python
-BATCH_SIZE = 32
-```
-After launch, you are able to access your hyper-parameters through global parallel context.
-```python
-import colossalai
-from colossalai.core import global_context as gpc
-colossalai.launch(config='./config.py', ...)
-# access your parameter
-print(gpc.config.BATCH_SIZE)
-```
--- a/docs/source/en/basics/engine_trainer.md
+++ b/docs/source/en/basics/engine_trainer.md
-# Use Engine and Trainer in Training
-Author: Shenggui Li, Siqi Mai
-> ⚠️ The information on this page is outdated and will be deprecated. Please check [Booster API](../basics/booster_api.md) for more information.
-**Prerequisite:**
- [Initialize Features](./initialize_features.md)
-## Introduction
-In this tutorial, you will learn how to use the engine and trainer provided in Colossal-AI to train your model.
-Before we delve into the details, we would like to first explain the concept of engine and trainer.
-### Engine
-Engine is essentially a wrapper class for model, optimizer and loss function.
-When we call `colossalai.initialize`, an engine object will be returned, and it has already been equipped with
-functionalities such as gradient clipping, gradient accumulation and zero optimizer as specified in your configuration file.
-An engine object will use similar APIs to those of PyTorch training components such that the user has minimum change
-to their code.
-Below is a table which shows the commonly used APIs for the engine object.
-| Component                             | Function                                      | PyTorch                         | Colossal-AI                            |
-| ------------------------------------- | --------------------------------------------- | ------------------------------- | -------------------------------------- |
-| optimizer                             | Set all gradients to zero before an iteration | optimizer.zero_grad()           | engine.zero_grad()                     |
-| optimizer                             | Update the parameters                         | optimizer.step()                | engine.step()                          |
-| model                                 | Run a forward pass                            | outputs = model(inputs)         | outputs = engine(inputs)               |
-| criterion                             | Calculate the loss value                      | loss = criterion(output, label) | loss = engine.criterion(output, label) |
-| criterion                             | Execute back-propagation on the model         | loss.backward()                 | engine.backward(loss)                  |
-The reason why we need such an engine class is that we can add more functionalities while hiding the implementations in
-the `colossalai.initialize` function.
-Imaging we are gonna add a new feature, we can manipulate the model, optimizer, dataloader and loss function in the
-`colossalai.initialize` function and only expose an engine object to the user.
-The user only needs to modify their code to the minimum extent by adapting the normal PyTorch APIs to the Colossal-AI
-engine APIs. In this way, they can enjoy more features for efficient training.
-A normal training iteration using engine can be:
-```python
-import colossalai
-# build your model, optimizer, criterion, dataloaders
-...
-engine, train_dataloader, test_dataloader, _ = colossalai.initialize(model,
-                                                                    optimizer,
-                                                                    criterion,
-                                                                    train_dataloader,
-                                                                    test_dataloader)
-for img, label in train_dataloader:
-    engine.zero_grad()
-    output = engine(img)
-    loss = engine.criterion(output, label)
-    engine.backward(loss)
-    engine.step()
-```
-### Trainer
-Trainer is a more high-level wrapper for the user to execute training with fewer lines of code. However, in pursuit of more abstraction, it loses some flexibility compared to engine. The trainer is designed to execute a forward and backward step to perform model weight update. It is easy to create a trainer object by passing the engine object. The trainer has a default value `None` for the argument `schedule`. In most cases, we leave this value to `None` unless we want to use pipeline parallelism. If you wish to explore more about this parameter, you can go to the tutorial on pipeline parallelism.
-```python
-from colossalai.logging import get_dist_logger
-from colossalai.legacy.trainer import Trainer, hooks
-# build components and initialize with colossalai.initialize
-...
-# create a logger so that trainer can log on the console
-logger = get_dist_logger()
-# create a trainer object
-trainer = Trainer(
-    engine=engine,
-    logger=logger
-)
-```
-In trainer, the user can customize some hooks and attach these hooks to the trainer object. A hook object will execute life-cycle methods periodically based on the training scheme. For example,  The `LRSchedulerHook` will execute `lr_scheduler.step()` to update the learning rate of the model during either `after_train_iter` or `after_train_epoch` stages depending on whether the user wants to update the learning rate after each training iteration or only after the entire training epoch. You can store the hook objects in a list and pass it to `trainer.fit` method. `trainer.fit` method will execute training and testing based on your parameters. If `display_process` is True, a progress bar will be displayed on your console to show the training process.
-```python
-# define the hooks to attach to the trainer
-hook_list = [
-    hooks.LossHook(),
-    hooks.LRSchedulerHook(lr_scheduler=lr_scheduler, by_epoch=True),
-    hooks.AccuracyHook(accuracy_func=Accuracy()),
-    hooks.LogMetricByEpochHook(logger),
-]
-# start training
-trainer.fit(
-    train_dataloader=train_dataloader,
-    epochs=NUM_EPOCHS,
-    test_dataloader=test_dataloader,
-    test_interval=1,
-    hooks=hook_list,
-    display_progress=True
-)
-```
-If you want to customize your own hook class, you can inherit `hooks.BaseHook` and override the life-cycle methods of your interest. A dummy example to demonstrate how to create a simple log message hook is provided below for your reference.
-```python
-from colossalai.logging import get_dist_logger
-from colossalai.legacy.trainer import hooks
-class LogMessageHook(hooks.BaseHook):
-    def __init__(self, priority=10):
-        self._logger = get_dist_logger()
-    def before_train(self, trainer):
-        self._logger.info('training starts')
-    def after_train(self, trainer):
-        self._logger.info('training finished')
-...
-# then in your training script
-hook_list.append(LogMessageHook())
-```
-In the sections below, I will guide you through the steps required to train a ResNet model with both engine and trainer.
-## Explain with ResNet
-### Overview
-In this section we will cover:
-1. Use an engine object to train a ResNet34 model on CIFAR10 dataset
-2. Use a trainer object to train a ResNet34 model on CIFAR10 dataset
-The project structure will be like:
-```bash
-- config.py
-- run_resnet_cifar10_with_engine.py
-- run_resnet_cifar10_with_trainer.py
-```
-Steps 1-4 below are commonly used regardless of using engine or trainer. Thus, steps 1-4 + step 5 will be your `run_resnet_cifar10_with_engine.py` and steps 1-4 + step 6 will form `run_resnet_cifar10_with_trainer.py`.
-### Hands-on Practice
-#### Step 1. Create a Config File
-In your project folder, create a `config.py`. This file is to specify some features you may want to use to train your model. A sample config file is as below:
-```python
-from colossalai.amp import AMP_TYPE
-BATCH_SIZE = 128
-NUM_EPOCHS = 200
-fp16=dict(
-    mode=AMP_TYPE.TORCH
-)
-```
-In this config file, we specify that we want to use batch size 128 per GPU and run for 200 epochs. These two parameters are exposed by `gpc.config`. For example, you can use `gpc.config.BATCH_SIZE` to access the value you store in your config file. The `fp16` configuration tells `colossalai.initialize` to use mixed precision training provided by PyTorch to train the model with better speed and lower memory consumption.
-#### Step 2. Initialize Distributed Environment
-We need to initialize the distributed training environment. This has been introduced in the tutorial on how to
-[launch Colossal-AI](./launch_colossalai.md). For this demonstration, we use `launch_from_torch` and PyTorch launch utility.
-```python
-import colossalai
-# ./config.py refers to the config file we just created in step 1
-colossalai.launch_from_torch(config='./config.py')
-```
-#### Step 3. Create all the training components
-In this step, we can create all the components used for training. These components include:
-1. Model
-2. Optimizer
-3. Criterion/loss function
-4. Training/Testing dataloaders
-5. Learning rate Scheduler
-6. Logger
-To build these components, you need to import the following modules:
-```python
-from pathlib import Path
-from colossalai.logging import get_dist_logger
-import torch
-import os
-from colossalai.core import global_context as gpc
-from colossalai.utils import get_dataloader
-from torchvision import transforms
-from colossalai.nn.lr_scheduler import CosineAnnealingLR
-from torchvision.datasets import CIFAR10
-from torchvision.models import resnet34
-```
-Then build your components in the same way as how to normally build them in your PyTorch scripts. In the script below, we set the root path for CIFAR10 dataset as an environment variable `DATA`. You can change it to any path you like, for example, you can change `root=Path(os.environ['DATA'])` to `root='./data'` so that there is no need to set the environment variable.
-```python
-# build logger
-logger = get_dist_logger()
-# build resnet
-model = resnet34(num_classes=10)
-# build datasets
-train_dataset = CIFAR10(
-    root='./data',
-    download=True,
-    transform=transforms.Compose(
-        [
-            transforms.RandomCrop(size=32, padding=4),
-            transforms.RandomHorizontalFlip(),
-            transforms.ToTensor(),
-            transforms.Normalize(mean=[0.4914, 0.4822, 0.4465], std=[
-                0.2023, 0.1994, 0.2010]),
-        ]
-    )
-)
-test_dataset = CIFAR10(
-    root='./data',
-    train=False,
-    transform=transforms.Compose(
-        [
-            transforms.ToTensor(),
-            transforms.Normalize(mean=[0.4914, 0.4822, 0.4465], std=[
-                0.2023, 0.1994, 0.2010]),
-        ]
-    )
-)
-# build dataloaders
-train_dataloader = get_dataloader(dataset=train_dataset,
-                                  shuffle=True,
-                                  batch_size=gpc.config.BATCH_SIZE,
-                                  num_workers=1,
-                                  pin_memory=True,
-                                  )
-test_dataloader = get_dataloader(dataset=test_dataset,
-                                 add_sampler=False,
-                                 batch_size=gpc.config.BATCH_SIZE,
-                                 num_workers=1,
-                                 pin_memory=True,
-                                 )
-# build criterion
-criterion = torch.nn.CrossEntropyLoss()
-# optimizer
-optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=5e-4)
-# lr_scheduler
-lr_scheduler = CosineAnnealingLR(optimizer, total_steps=gpc.config.NUM_EPOCHS)
-```
-#### Step 4. Initialize with Colossal-AI
-Next, the essential step is to obtain the engine class by calling `colossalai.initialize`. As stated in `config.py`, we will be using mixed precision training for training ResNet34 model. `colossalai.initialize` will automatically check your config file and assign relevant features to your training components. In this way, our engine object has already been able to train with mixed precision, but you do not have to explicitly take care of it.
-```python
-engine, train_dataloader, test_dataloader, _ = colossalai.initialize(model,
-                                                                     optimizer,
-                                                                     criterion,
-                                                                     train_dataloader,
-                                                                     test_dataloader,
-                                                                     )
-```
-#### Step 5. Train with engine
-With all the training components ready, we can train ResNet34 just like how to normally deal with PyTorch training.
-```python
-for epoch in range(gpc.config.NUM_EPOCHS):
-    # execute a training iteration
-    engine.train()
-    for img, label in train_dataloader:
-        img = img.cuda()
-        label = label.cuda()
-        # set gradients to zero
-        engine.zero_grad()
-        # run forward pass
-        output = engine(img)
-        # compute loss value and run backward pass
-        train_loss = engine.criterion(output, label)
-        engine.backward(train_loss)
-        # update parameters
-        engine.step()
-    # update learning rate
-    lr_scheduler.step()
-    # execute a testing iteration
-    engine.eval()
-    correct = 0
-    total = 0
-    for img, label in test_dataloader:
-        img = img.cuda()
-        label = label.cuda()
-        # run prediction without back-propagation
-        with torch.no_grad():
-            output = engine(img)
-            test_loss = engine.criterion(output, label)
-        # compute the number of correct prediction
-        pred = torch.argmax(output, dim=-1)
-        correct += torch.sum(pred == label)
-        total += img.size(0)
-    logger.info(
-        f"Epoch {epoch} - train loss: {train_loss:.5}, test loss: {test_loss:.5}, acc: {correct / total:.5}, lr: {lr_scheduler.get_last_lr()[0]:.5g}", ranks=[0])
-```
-#### Step 6. Train with trainer
-If you wish to train with a trainer object, you can follow the code snippet below:
-```python
-from colossalai.legacy.nn.metric import Accuracy
-from colossalai.legacy.trainer import Trainer, hooks
-# create a trainer object
-trainer = Trainer(
-    engine=engine,
-    logger=logger
-)
-# define the hooks to attach to the trainer
-hook_list = [
-    hooks.LossHook(),
-    hooks.LRSchedulerHook(lr_scheduler=lr_scheduler, by_epoch=True),
-    hooks.AccuracyHook(accuracy_func=Accuracy()),
-    hooks.LogMetricByEpochHook(logger),
-    hooks.LogMemoryByEpochHook(logger)
-]
-# start training
-# run testing every 1 epoch
-trainer.fit(
-    train_dataloader=train_dataloader,
-    epochs=gpc.config.NUM_EPOCHS,
-    test_dataloader=test_dataloader,
-    test_interval=1,
-    hooks=hook_list,
-    display_progress=True
-)
-```
-#### Step 7. Start Distributed Training
-Lastly, we can invoke the scripts using the distributed launcher provided by PyTorch as we used `launch_from_torch` in Step 2. You need to replace `<num_gpus>` with the number of GPUs available on your machine. This number can be 1 if you only want to use 1 GPU. If you wish to use other launchers, you can refer to the tutorial on How to Launch Colossal-AI.
-```bash
-# with engine
-python -m torch.distributed.launch --nproc_per_node <num_gpus> --master_addr localhost --master_port 29500 run_resnet_cifar10_with_engine.py
-# with trainer
-python -m torch.distributed.launch --nproc_per_node <num_gpus> --master_addr localhost --master_port 29500 run_resnet_cifar10_with_trainer.py
-```
-<!-- doc-test-command: echo  -->
--- a/docs/source/en/basics/initialize_features.md
+++ b/docs/source/en/basics/initialize_features.md
-# Initialize Features
-Author: Shenggui Li, Siqi Mai
-> ⚠️ The information on this page is outdated and will be deprecated. Please check [Booster API](../basics/booster_api.md) for more information.
-**Prerequisite:**
- [Distributed Training](../concepts/distributed_training.md)
- [Colossal-AI Overview](../concepts/colossalai_overview.md)
-## Introduction
-In this tutorial, we will cover the use of `colossalai.initialize` which injects features into your training components
-(e.g. model, optimizer, dataloader) seamlessly. Calling `colossalai.initialize` is the standard procedure before you run
-into your training loops.
-In the section below, I will cover how `colossalai.initialize` works and what we should take note  of.
-## Usage
-In a typical workflow, we will launch distributed environment at the beginning of our training script.
-Afterwards, we will instantiate our objects such as model, optimizer, loss function, dataloader etc. At this moment, `colossalai.initialize`
-can come in to inject features into these objects. A pseudo-code example is like below:
-```python
-import colossalai
-import torch
-...
-# launch distributed environment
-colossalai.launch(config='./config.py', ...)
-# create your objects
-model = MyModel()
-optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
-criterion = torch.nn.CrossEntropyLoss()
-train_dataloader = MyTrainDataloader()
-test_dataloader = MyTrainDataloader()
-# initialize features
-engine, train_dataloader, test_dataloader, _ = colossalai.initialize(model,
-                                                                     optimizer,
-                                                                     criterion,
-                                                                     train_dataloader,
-                                                                     test_dataloader)
-```
-The `colossalai.initialize` function will return an `Engine` object. The engine object is a wrapper
-for model, optimizer and loss function. **The engine object will run with features specified in the config file.**
-More details about the engine can be found in the [Use Engine and Trainer in Training](./engine_trainer.md).
--- a/docs/source/en/basics/model_checkpoint.md
+++ b/docs/source/en/basics/model_checkpoint.md
-# Model Checkpoint
-Author : Guangyang Lu
-> ⚠️ The information on this page is outdated and will be deprecated. Please check [Booster Checkpoint](../basics/booster_checkpoint.md) for more information.
-**Prerequisite:**
- [Launch Colossal-AI](./launch_colossalai.md)
- [Initialize Colossal-AI](./initialize_features.md)
-**Example Code:**
- [ColossalAI-Examples Model Checkpoint](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/utils/checkpoint)
-**This function is experiential.**
-## Introduction
-In this tutorial, you will learn how to save and load model checkpoints.
-To leverage the power of parallel strategies in Colossal-AI, modifications to models and tensors are needed, for which you cannot directly use `torch.save` or `torch.load`  to save or load model checkpoints. Therefore, we have provided you with the API to achieve the same thing.
-Moreover, when loading, you are not demanded to use the same parallel strategy as saving.
-## How to use
-### Save
-There are two ways to train a model in Colossal-AI, by engine or by trainer.
-**Be aware that we only save the `state_dict`.** Therefore, when loading the checkpoints, you need to define the model first.
-#### Save when using engine
-```python
-from colossalai.utils import save_checkpoint
-model = ...
-engine, _, _, _ = colossalai.initialize(model=model, ...)
-for epoch in range(num_epochs):
-    ... # do some training
-    save_checkpoint('xxx.pt', epoch, model)
-```
-#### Save when using trainer
-```python
-from colossalai.legacy.trainer import Trainer, hooks
-model = ...
-engine, _, _, _ = colossalai.initialize(model=model, ...)
-trainer = Trainer(engine, ...)
-hook_list = [
-            hooks.SaveCheckpointHook(1, 'xxx.pt', model)
-            ...]
-trainer.fit(...
-            hook=hook_list)
-```
-### Load
-```python
-from colossalai.utils import load_checkpoint
-model = ...
-load_checkpoint('xxx.pt', model)
-... # train or test
-```
-<!-- doc-test-command: echo  -->
--- a/docs/source/en/features/1D_tensor_parallel.md
+++ b/docs/source/en/features/1D_tensor_parallel.md
@@ -2,10 +2,6 @@
 Author: Zhengda Bian, Yongbin Li
-**Prerequisite**
- [Define Your Configuration](../basics/define_your_config.md)
- [Configure Parallelization](../basics/configure_parallelization.md)
 **Example Code**
 - [Tensor Parallelism with Shardformer](https://github.com/hpcaitech/ColossalAI/tree/main/colossalai/shardformer/examples)

--- a/docs/source/en/features/2D_tensor_parallel.md
+++ b/docs/source/en/features/2D_tensor_parallel.md
@@ -3,8 +3,6 @@
 Author: Zhengda Bian, Yongbin Li
 **Prerequisite**
- [Define Your Configuration](../basics/define_your_config.md)
- [Configure Parallelization](../basics/configure_parallelization.md)
 - [1D Tensor Parallelism](./1D_tensor_parallel.md)
 **Example Code**

--- a/docs/source/en/features/2p5D_tensor_parallel.md
+++ b/docs/source/en/features/2p5D_tensor_parallel.md
@@ -3,8 +3,6 @@
 Author: Zhengda Bian, Yongbin Li
 **Prerequisite**
- [Define Your Configuration](../basics/define_your_config.md)
- [Configure Parallelization](../basics/configure_parallelization.md)
 - [1D Tensor Parallelism](./1D_tensor_parallel.md)
 - [2D Tensor Parallelism](./2D_tensor_parallel.md)

--- a/docs/source/en/features/3D_tensor_parallel.md
+++ b/docs/source/en/features/3D_tensor_parallel.md
@@ -3,8 +3,6 @@
 Author: Zhengda Bian, Yongbin Li
 **Prerequisite**
- [Define Your Configuration](../basics/define_your_config.md)
- [Configure Parallelization](../basics/configure_parallelization.md)
 - [1D Tensor Parallelism](./1D_tensor_parallel.md)
 - [2D Tensor Parallelism](./2D_tensor_parallel.md)

--- a/docs/source/en/features/gradient_accumulation.md
+++ b/docs/source/en/features/gradient_accumulation.md
-# Gradient Accumulation (Outdated)
-Author: Shenggui Li, Yongbin Li
-**Prerequisite**
- [Define Your Configuration](../basics/define_your_config.md)
- [Use Engine and Trainer in Training](../basics/engine_trainer.md)
-**Example Code**
- [ColossalAI-Examples Gradient Accumulation](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_accumulation)
-## Introduction
-Gradient accumulation is a common way to enlarge your batch size for training.
-When training large-scale models, memory can easily become the bottleneck and the batch size can be very small, (e.g. 2),
-leading to unsatisfactory convergence. Gradient accumulation works by adding up the gradients calculated in multiple iterations,
-and only update the parameters in the preset iteration.
-## Usage
-It is simple to use gradient accumulation in Colossal-AI. Just add this following configuration into your config file.
-The integer represents the number of iterations to accumulate gradients.
-```python
-gradient_accumulation = <int>
-```
-## Hands-on Practice
-We provide a [runnable example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_accumulation)
-to demonstrate gradient accumulation. In this example, we set the gradient accumulation size to be 4. You can run the script using this command:
-```shell
-python -m torch.distributed.launch --nproc_per_node 1 --master_addr localhost --master_port 29500  run_resnet_cifar10_with_engine.py
-```
-You will see output similar to the text below. This shows gradient is indeed accumulated as the parameter is not updated
-in the first 3 steps, but only updated in the last step.
-```text
-iteration 0, first 10 elements of param: tensor([-0.0208,  0.0189,  0.0234,  0.0047,  0.0116, -0.0283,  0.0071, -0.0359, -0.0267, -0.0006], device='cuda:0', grad_fn=<SliceBackward0>)
-iteration 1, first 10 elements of param: tensor([-0.0208,  0.0189,  0.0234,  0.0047,  0.0116, -0.0283,  0.0071, -0.0359, -0.0267, -0.0006], device='cuda:0', grad_fn=<SliceBackward0>)
-iteration 2, first 10 elements of param: tensor([-0.0208,  0.0189,  0.0234,  0.0047,  0.0116, -0.0283,  0.0071, -0.0359, -0.0267, -0.0006], device='cuda:0', grad_fn=<SliceBackward0>)
-iteration 3, first 10 elements of param: tensor([-0.0141,  0.0464,  0.0507,  0.0321,  0.0356, -0.0150,  0.0172, -0.0118, 0.0222,  0.0473], device='cuda:0', grad_fn=<SliceBackward0>)
-```
-<!-- doc-test-command: torchrun --standalone --nproc_per_node=1 gradient_accumulation.py  -->
--- a/docs/source/en/features/gradient_accumulation_with_booster.md
+++ b/docs/source/en/features/gradient_accumulation_with_booster.md
-# Gradient Accumulation (Latest)
+# Gradient Accumulation
 Author: [Mingyan Jiang](https://github.com/jiangmingyan)
 **Prerequisite**
- [Define Your Configuration](../basics/define_your_config.md)
 - [Training Booster](../basics/booster_api.md)
 ## Introduction

--- a/docs/source/en/features/gradient_clipping.md
+++ b/docs/source/en/features/gradient_clipping.md
-# Gradient Clipping (Outdated)
-Author: Boxiang Wang, Haichen Huang, Yongbin Li
-**Prerequisite**
- [Define Your Configuration](../basics/define_your_config.md)
- [Use Engine and Trainer in Training](../basics/engine_trainer.md)
-**Example Code**
- [ColossalAI-Examples Gradient Clipping](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_clipping)
-**Related Paper**
- [On the difficulty of training Recurrent Neural Networks](https://arxiv.org/abs/1211.5063)
-## Introduction
-In order to speed up training process and seek global optimum for better performance, more and more learning
-rate schedulers have been proposed. People turn to control learning rate to adjust descent pace during training,
-which makes gradient vector better to be uniformed in every step. In that case, the descent pace can be
-controlled as expected. As a result, gradient clipping, a technique which can normalize the gradient vector
-to circumscribe it in a uniformed length, becomes indispensable for those who desire their better
-performance of their models.
-You do not have to worry about implementing gradient clipping when using Colossal-AI, we support gradient
-clipping in a powerful and convenient way. All you need is just an additional command in your configuration
-file.
-## Why you should use gradient clipping provided by Colossal-AI
-The reason of why we do not recommend users to write gradient clipping by themselves is that naive gradient clipping
-may fail when applying tensor parallelism, pipeline parallelism or MoE.
-According to the illustration below, each GPU only owns a portion of parameters of the weight in a linear layer.
-To get correct norm of gradient vector of the weight of the linear layer, the norm of every gradient vector in each GPU
-should be summed together.
-More complicated thing is that the distribution of bias is different from the distribution of the weight.
-The communication group is different in the sum operation.
-(PS: This situation is an old version of 2D parallelism, the implementation in the code is not the same.
-But it is a good example about the difficulty to unify all communication in gradient clipping.)
-<figure style={{textAlign: "center"}}>
-<img src="https://s2.loli.net/2022/01/28/KXiJPHt3Dum82cA.png"/>
-<figcaption>Layout of parameters</figcaption>
-</figure>
-Do not worry about it, since Colossal-AI have handled it for you.
-### Usage
-To use gradient clipping, you can just simply add gradient clipping norm in your configuration file.
-```python
-clip_grad_norm = 1.0
-```
-### Hands-On Practice
-We provide a [runnable example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_clipping)
-to demonstrate gradient clipping. In this example, we set the gradient clipping vector norm to be 1.0. You can run the script using this command:
-```shell
-python -m torch.distributed.launch --nproc_per_node 1 --master_addr localhost --master_port 29500  train_with_engine.py
-```
-<!-- doc-test-command: torchrun --standalone --nproc_per_node=1 gradient_clipping.py  -->
--- a/docs/source/en/features/gradient_clipping_with_booster.md
+++ b/docs/source/en/features/gradient_clipping_with_booster.md
-# Gradient Clipping (Latest)
+# Gradient Clipping
 Author: [Mingyan Jiang](https://github.com/jiangmingyan)
 **Prerequisite**
- [Define Your Configuration](../basics/define_your_config.md)
 - [Training Booster](../basics/booster_api.md)
 **Related Paper**

--- a/docs/source/en/features/gradient_handler.md
+++ b/docs/source/en/features/gradient_handler.md
-# Gradient Handler
-Author: Shenggui Li, Yongbin Li
-**Prerequisite**
- [Define Your Configuration](../basics/define_your_config.md)
- [Use Engine and Trainer in Training](../basics/engine_trainer.md)
-**Example Code**
- [ColossalAI-Examples Gradient Handler](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_handler)
-## Introduction
-In distributed training, gradient synchronization is required at the end of each iteration. This is important because we
-need to make sure the parameters are updated with the same gradients in different machines so that the resulting parameters
-are the same. This is often seen in data parallel as the model is replicated across data parallel ranks.
-In Colossal-AI, we provide an interface for users to customize how they want to handle the synchronization. This brings
-flexibility in cases such as implementing a new parallelism method.
-When gradient handlers are used, PyTorch `DistributedDataParallel` will not be used as it will synchronize automatically.
-## Customize Your Gradient Handlers
-To implement a customized gradient handler, you need to follow these steps.
-1. inherit `BaseGradientHandler` in Colossal-AI.
-2. register the gradient handler into the `GRADIENT_HANDLER`.
-3. implement `handle_gradient` method.
-```python
-from colossalai.legacy.registry import GRADIENT_HANDLER
-from colossalai.legacy.engine.gradient_handler import BaseGradientHandler
-@GRADIENT_HANDLER.register_module
-class MyGradientHandler(BaseGradientHandler):
-    def handle_gradient(self):
-        do_something()
-```
-## Usage
-To use a gradient handler, you need to specify your gradient handler in the config file. The gradient handler
-will be automatically built and attached to the engine.
-```python
-gradient_handler = [dict(type='MyGradientHandler')]
-```
-### Hands-On Practice
-We provide a [runnable example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_handler)
-to demonstrate the use of gradient handler. In this example, we used `DataParallelGradientHandler` instead of PyTorch
-`DistributedDataParallel` for data parallel training.
-```shell
-python -m torch.distributed.launch --nproc_per_node 4 --master_addr localhost --master_port 29500  train_with_engine.py
-```
-<!-- doc-test-command: echo  -->
--- a/docs/source/en/features/mixed_precision_training.md
+++ b/docs/source/en/features/mixed_precision_training.md
-# Auto Mixed Precision Training (Outdated)
-Author: Chuanrui Wang, Shenggui Li, Yongbin Li
-**Prerequisite**
- [Define Your Configuration](../basics/define_your_config.md)
- [Use Engine and Trainer in Training](../basics/engine_trainer.md)
-**Example Code**
- [ColossalAI-Examples AMP](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/amp)
-**Related Paper**
- [Accelerating Scientific Computations with Mixed Precision Algorithms](https://arxiv.org/abs/0808.2794)
-## Introduction
-AMP stands for automatic mixed precision training.
-In Colossal-AI, we have incorporated different implementations of mixed precision training:
-1. torch.cuda.amp
-2. apex.amp
-3. naive amp
-| Colossal-AI | support tensor parallel | support pipeline parallel | fp16 extent |
-| ----------- | ----------------------- | ------------------------- | ----------- |
-| AMP_TYPE.TORCH | ✅ | ❌ | Model parameters, activation, gradients are downcast to fp16 during forward and backward propagation |
-| AMP_TYPE.APEX | ❌ | ❌ | More fine-grained, we can choose opt_level O0, O1, O2, O3 |
-| AMP_TYPE.NAIVE | ✅ | ✅ | Model parameters, forward and backward operations are all downcast to fp16 |
-The first two rely on the original implementation of PyTorch (version 1.6 and above) and NVIDIA Apex.
-The last method is similar to Apex O2 level.
-Among these methods, apex AMP is not compatible with tensor parallelism.
-This is because that tensors are split across devices in tensor parallelism, thus, it is required to communicate among different processes to check if inf or nan occurs in the whole model weights.
-We modified the torch amp implementation so that it is compatible with tensor parallelism now.
-> ❌️ fp16 and zero configuration are not compatible
->
-> ⚠️ Pipeline only support naive AMP currently
-We recommend you to use torch AMP as it generally gives better accuracy than naive AMP if no pipeline is used.
-## Table of Contents
-In this tutorial we will cover:
-1. AMP introduction
-2. AMP in Colossal-AI
-3. Hands-on Practice
-## AMP Introduction
-Automatic Mixed Precision training is a mixture of FP16 and FP32 training.
-Half-precision float point format (FP16) has lower arithmetic complexity and higher compute efficiency.
-Besides, fp16 requires half of the storage needed by fp32 and saves memory & network bandwidth, which makes more memory
-available for large batch size and model size.
-However, there are other operations, like reductions, which require the dynamic range of fp32 to avoid numeric overflow/underflow. That's the reason why we introduce automatic mixed precision, attempting to match each operation to its appropriate data type, which can reduce the memory footprint and augment training efficiency.
-<figure style={{textAlign: "center"}}>
-<img src="https://s2.loli.net/2022/01/28/URzLJ3MPeDQbtck.png"/>
-<figcaption>Illustration of an ordinary AMP (figure from <a href="https://arxiv.org/abs/2108.05818">PatrickStar paper</a>)</figcaption>
-</figure>
-## AMP in Colossal-AI
-We supported three AMP training methods and allowed the user to train with AMP with no code. You can just simply add `fp16`
-configuration in your configuration file to use AMP.
-```python
-from colossalai.amp import AMP_TYPE
-# use Torch AMP
-fp16=dict(
-    mode = AMP_TYPE.TORCH
-)
-# use naive AMP
-fp16=dict(
-    mode = AMP_TYPE.NAIVE
-)
-# use NVIDIA Apex AMP
-fp16=dict(
-    mode = AMP_TYPE.APEX
-)
-```
-> These are the minimum configuration, full configuration are stated in the section later
-### AMP Modularity
-AMP module is designed to be completely modular and can be used independently.
-If you wish to only use AMP in your code base without `colossalai.initialize`,
-you can use `colossalai.amp.convert_to_amp`.
-```python
-from colossalai.amp import AMP_TYPE
-# example of using torch amp
-model, optimizer, criterion = colossalai.amp.convert_to_amp(model,
-                                                            optimizer,
-                                                            criterion,
-                                                            AMP_TYPE.TORCH)
-```
-### Torch AMP Configuration
-```python
-from colossalai.amp import AMP_TYPE
-fp16=dict(
-    mode=AMP_TYPE.TORCH,
-    # below are default values for grad scaler
-    init_scale=2.**16,
-    growth_factor=2.0,
-    backoff_factor=0.5,
-    growth_interval=2000,
-    enabled=True
-)
-```
-With optional arguments:
- init_scale(float, optional, default=2.**16): Initial scale factor
- growth_factor(float, optional, default=2.0): Factor by which the scale is multiplied during `update` if no inf/NaN gradients occur for ``growth_interval`` consecutive iterations.
- backoff_factor(float, optional, default=0.5): Factor by which the scale is multiplied during `update` if inf/NaN gradients occur in an iteration.
- growth_interval(int, optional, default=2000): Number of consecutive iterations without inf/NaN gradients that must occur for the scale to be multiplied by ``growth_factor``.
- enabled(bool, optional, default=True): If ``False``, disables gradient scaling. `step` simply invokes the underlying ``optimizer.step()``, and other methods become no-ops.
-### Apex AMP Configuration
-For this mode, we rely on the Apex implementation for mixed precision training.
-We support this plugin because it allows for finer control on the granularity of mixed precision.
-For example, O2 level (optimization level 2) will keep batch normalization in fp32.
-If you look for more details, please refer to [Apex Documentation](https://nvidia.github.io/apex/).
-```python
-from colossalai.amp import AMP_TYPE
-fp16 = dict(
-    mode=AMP_TYPE.APEX,
-    # below are the default values
-    enabled=True,
-    opt_level='O1',
-    cast_model_type=None,
-    patch_torch_functions=None,
-    keep_batchnorm_fp32=None,
-    master_weights=None,
-    loss_scale=None,
-    cast_model_outputs=None,
-    num_losses=1,
-    verbosity=1,
-    min_loss_scale=None,
-    max_loss_scale=16777216.0
-)
-```
-Parameters:
- enabled(bool, optional, default=True): If False, renders all AMP calls no-ops, so your script should run as if Amp were not present.
- opt_level(str, optional, default="O1" ): Pure or mixed precision optimization level.
-Accepted values are “O0”, “O1”, “O2”, and “O3”, explained in detail above Apex AMP Documentation.
- num_losses(int, optional, default=1): Option to tell AMP in advance how many losses/backward passes you plan to use.
-When used in conjunction with the loss_id argument to `amp.scale_loss`, enables Amp to use a different loss scale per
-loss/backward pass, which can improve stability. If num_losses is left to 1, Amp will still support multiple
-losses/backward passes, but use a single global loss scale for all of them.
- verbosity(int, default=1): Set to 0 to suppress Amp-related output.
- min_loss_scale(float, default=None): Sets a floor for the loss scale values that can be chosen by dynamic loss scaling.
-The default value of None means that no floor is imposed. If dynamic loss scaling is not used, min_loss_scale is ignored.
- max_loss_scale(float, default=2.**24 ): Sets a ceiling for the loss scale values that can be chosen by dynamic loss
-scaling. If dynamic loss scaling is not used, max_loss_scale is ignored.
-Currently, the under-the-hood properties that govern pure or mixed precision training are the following:
-cast_model_type, patch_torch_functions, keep_batchnorm_fp32, master_weights, loss_scale.
-They are optional properties override once opt_level is determined
- cast_model_type: Casts your model’s parameters and buffers to the desired type.
- patch_torch_functions: Patch all Torch functions and Tensor methods to perform Tensor Core-friendly ops like GEMMs and convolutions in FP16, and any ops that benefit from FP32 precision in FP32.
- keep_batchnorm_fp32: To enhance precision and enable cudnn batchnorm (which improves performance), it’s often beneficial to keep batchnorm weights in FP32 even if the rest of the model is FP16.
- master_weights: Maintain FP32 master weights to accompany any FP16 model weights. FP32 master weights are stepped by the optimizer to enhance precision and capture small gradients.
- loss_scale: If loss_scale is a float value, use this value as the static (fixed) loss scale. If loss_scale is the string "dynamic", adaptively adjust the loss scale over time. Dynamic loss scale adjustments are performed by Amp automatically.
-### Naive AMP Configuration
-In Naive AMP mode, we achieved mixed precision training while maintaining compatibility with complex tensor and pipeline parallelism.
-This AMP mode will cast all operations into fp16.
-The following code block shows the `config.py` file for this mode.
-```python
-from colossalai.amp import AMP_TYPE
-fp16 = dict(
-    mode=AMP_TYPE.NAIVE,
-    # below are the default values
-    log_num_zeros_in_grad=False,
-    initial_scale=2 ** 32,
-    min_scale=1,
-    growth_factor=2,
-    backoff_factor=0.5,
-    growth_interval=1000,
-    hysteresis=2
-)
-```
-The default parameters of Naive AMP:
- log_num_zeros_in_grad(bool): return number of zeros in the gradients.
- initial_scale(int): initial scale of gradient scaler
- growth_factor(int): the growth rate of loss scale
- backoff_factor(float): the decrease rate of loss scale
- hysteresis(int): delay shift in dynamic loss scaling
- max_scale(int): maximum loss scale allowed
- verbose(bool): if set to `True`, will print debug info
-When using `colossalai.initialize`, you are required to first instantiate a model, an optimizer and a criterion.
-The output model is converted to AMP model of smaller memory consumption.
-If your input model is already too large to fit in a GPU, please instantiate your model weights in `dtype=torch.float16`.
-Otherwise, try smaller models or checkout more parallelization training techniques!
-## Hands-on Practice
-We provide a [runnable example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/amp) which demonstrates
-the use of AMP with Colossal-AI. In this practice, we will use Torch AMP as an example, but do note that config files are provided for all AMP modes.
-### Step 1. Create a config file
-Create a `config.py` and add the `fp16` configuration.
-```python
-# in config.py
-from colossalai.amp import AMP_TYPE
-BATCH_SIZE = 128
-DROP_RATE = 0.1
-NUM_EPOCHS = 300
-fp16 = dict(
-    mode=AMP_TYPE.TORCH,
-)
-clip_grad_norm = 1.0
-```
-### Step 2. Import libraries in train_with_engine.py
-Create a `train_with_engine.py` and import the necessary dependencies. Remember to install `scipy` and `timm` by running
-`pip install timm scipy`.
-```python
-import os
-import colossalai
-import torch
-from pathlib import Path
-from colossalai.core import global_context as gpc
-from colossalai.logging import get_dist_logger
-from colossalai.utils import get_dataloader
-from colossalai.legacy.trainer import Trainer, hooks
-from colossalai.nn.lr_scheduler import LinearWarmupLR
-from timm.models import vit_base_patch16_224
-from torchvision import datasets, transforms
-```
-### Step 3. Initialize Distributed Environment
-We then need to initialize distributed environment. For demo purpose, we uses `launch_from_torch`. You can refer to [Launch Colossal-AI](../basics/launch_colossalai.md)
-for other initialization methods.
-```python
-# initialize distributed setting
-parser = colossalai.get_default_parser()
-args = parser.parse_args()
-# launch from torch
-colossalai.launch_from_torch(config=args.config)
-```
-### Step 4. Create training components
-Build your model, optimizer, loss function, lr scheduler and dataloaders. Note that the root path of the dataset is
-obtained from the environment variable `DATA`. You may `export DATA=/path/to/data` or change `Path(os.environ['DATA'])`
-to a path on your machine. Data will be automatically downloaded to the root path.
-```python
-# build model
-    model = vit_base_patch16_224(drop_rate=0.1)
-    # build dataloader
-    train_dataset = datasets.Caltech101(
-        root=Path(os.environ['DATA']),
-        download=True,
-        transform=transforms.Compose([
-            transforms.Resize(256),
-            transforms.RandomResizedCrop(224),
-            transforms.RandomHorizontalFlip(),
-            transforms.ToTensor(),
-            Gray2RGB(),
-            transforms.Normalize([0.5, 0.5, 0.5],
-                                 [0.5, 0.5, 0.5])
-        ]))
-    train_dataloader = get_dataloader(dataset=train_dataset,
-                                      shuffle=True,
-                                      batch_size=gpc.config.BATCH_SIZE,
-                                      num_workers=1,
-                                      pin_memory=True,
-                                      )
-    # build optimizer
-    optimizer = torch.optim.SGD(model.parameters(), lr=1e-2, weight_decay=0.1)
-    # build loss
-    criterion = torch.nn.CrossEntropyLoss()
-    # lr_scheduler
-    lr_scheduler = LinearWarmupLR(optimizer, warmup_steps=50, total_steps=gpc.config.NUM_EPOCHS)
-```
-### Step 5. Inject AMP Feature
-Call `colossalai.initialize` to convert the training components to be running with FP16.
-```python
-engine, train_dataloader, _, _ = colossalai.initialize(
-        model, optimizer, criterion, train_dataloader,
-    )
-```
-### Step 6. Train with Engine
-Use engine in a normal training loops.
-```python
-engine.train()
-for epoch in range(gpc.config.NUM_EPOCHS):
-    for img, label in enumerate(train_dataloader):
-        img = img.cuda()
-        label = label.cuda()
-        engine.zero_grad()
-        output = engine(img)
-        loss = engine.criterion(output, label)
-        engine.backward(loss)
-        engine.step()
-        lr_scheduler.step()
-```
-### Step 7. Invoke Training Scripts
-Use the following command to start the training scripts. You can change `--nproc_per_node` to use a different number of GPUs.
-```shell
-python -m torch.distributed.launch --nproc_per_node 4 --master_addr localhost --master_port 29500 train_with_engine.py --config config/config_AMP_torch.py
-```
-<!-- doc-test-command: torchrun --standalone --nproc_per_node=1 mixed_precision_training.py  -->