bert-large

5988d2cc · yuguo960516 · 478602ba · 5988d2cc · 5988d2cc · 5988d2cc
Commit 5988d2cc authored Mar 29, 2023 by yuguo960516
20 changed files
--- a/docs/source/modules/libai.tokenizer.rst
+++ b/docs/source/modules/libai.tokenizer.rst
+libai.tokenizer
+##############################
+
+.. currentmodule:: libai.tokenizer
+.. automodule:: libai.tokenizer
+    :member-order: bysource
+    :members:
+        BertTokenizer,
+        RobertaTokenizer,
+        GPT2Tokenizer,
+        T5Tokenizer,
+        PreTrainedTokenizer,
--- a/docs/source/modules/libai.utils.rst
+++ b/docs/source/modules/libai.utils.rst
+libai.utils
+##############################
+
+libai.utils.distributed module
+---------------------------------
+.. currentmodule:: libai.utils
+.. automodule:: libai.utils.distributed
+    :members:
+        get_dist_util,
+        setup_dist_util,
+        ttol,
+        tton,
+        synchronize,
+        convert_to_distributed_default_setting,
+        get_nd_sbp,
+        get_layer_placement,
+        get_world_size,
+        get_num_nodes,
+        get_rank,
+        get_local_rank,
+        same_sbp,
+        get_data_parallel_size,
+        get_data_parallel_rank,
+        get_hidden_sbp
+        
+libai.utils.events module
+---------------------------------
+.. currentmodule:: libai.utils
+.. automodule:: libai.utils.events
+    :members:
+        get_event_storage,
+        JSONWriter,
+        CommonMetricPrinter,
+        EventStorage,
+
+
+libai.utils.logger module
+---------------------------------
+.. currentmodule:: libai.utils
+.. automodule:: libai.utils.logger
+    :members:
+        setup_logger,
+        log_first_n,
+        log_every_n,
+        log_every_n_seconds
+
+
+libai.utils.checkpoint module
+---------------------------------
+.. currentmodule:: libai.utils
+.. automodule:: libai.utils.checkpoint
+    :members:
+        Checkpointer,
+        PeriodicCheckpointer
--- a/docs/source/notes/FAQ.md
+++ b/docs/source/notes/FAQ.md
+# Frequently Asked Questions
+
+We list some common problems encountered by users and the corresponding solutions here. Feel free to enrich the list if you find any frequent issues and have ways to help others solve them.
+
+## Training
+
+- "Loss goes NaN or very large"
+  1. Check if the dataset annotations are valid. Mask must be `{0, 1}` where `1` for tokens that are **not masked** and `0` for tokens that are **masked**.
+  2. Check `initializer_range` in config file. It can be safely set to `0.02` in most cases. If the model size is very large, decreasing `initializer_range` is a good choice. For example, `initializer_range` can be set to `0.006` when training 175 billion parameter configuration GPT-3 model.
+
+- "AMP enabled goes NaN"  
+  Set `ONEFLOW_DEBUG_KERNEL_SYNC_CHECK_NUMERICS=1` to check what triggers an overflow of the value range in fp16.
+
+- "GPU out of memory when validation"  
+  Decrease `test_micro_batch_size` and use `--fast-dev-run` for quickly running through training and evaluation to check if memory is sufficient.
+
+
+## Model
+
+- "`apply_query_key_layer_scaling` in MultiheadAttention"  
+  As the number of attention heads increases, some of the GEMMS inside the self-attention layer become smaller and the number of elements in the self attention softmax also increases.  
+
+- "QKV implementation is not consistent with Hugging Face in self attention"
+  ```python
+  # query_key_value：[batch_size, seq_len, 3*hidden_size]
+
+  # QKV in LiBai
+  query_key_value = query_key_value.view(bsz, -1, self.num_heads, 3 * self.head_size)   
+  query_key_value = query_key_value.permute(0, 2, 1, 3)                                
+  query, key, value = flow.chunk(query_key_value, chunks=3, dim=-1)                    
+
+  # QKV in Huggingface
+  query, key, value = flow.chunk(query_key_value, chunks=3, dim=-1)
+  query = query.view(query.size(0), query.size(1), self.num_heads, -1).permute(0, 2, 1, 3)
+  key = key.view(key.size(0), key.size(1), self.num_heads, -1).permute(0, 2, 1, 3)
+  value = value.view(value.size(0), value.size(1), self.num_heads, -1).permute(0, 2, 1, 3)
+  ```
+  In tensor parallelism, `chunk` dimension and `flow.sbp.split` dimension will be the same in Huggingface's implementation which will occur some unexpected behaviors (i.e., changing the tensor's SBP unexpectedly).  
+
+  We also provide a tutorial about how to load Huggingface weights correctly. Please refer to [How to use Huggingface's pretrained weights in LiBai](https://libai.readthedocs.io/en/latest/notes/How_to_implement_huggingface%27s_weights_in_LiBai.html) for more details.
+  
+- "the order of layer normalization and the residual connection"  
+  This is critical to enable the scaling of the BERT-style models beyond BERT-Large. The architecture with `apply_residual_post_layernorm=False` eliminates instabilities observed using the origin BERT architecture with `apply_residual_post_layernorm=True` and also has a lower training loss according to [Megatron-LM](https://arxiv.org/pdf/1909.08053.pdf).
+
+If you find some troubles hard to understand, feel free to open an issue to collect feedbacks in [OneFlow](https://github.com/Oneflow-Inc/oneflow).
\ No newline at end of file
--- a/docs/source/notes/How_to_build_vision_transformer_model_in_LiBai.md
+++ b/docs/source/notes/How_to_build_vision_transformer_model_in_LiBai.md
+# Detailed instruction on building Vision Transformer models in LiBai
+It's easy for users to build the `transformer-based` models by using LiBai's built-in [layers](https://libai.readthedocs.io/en/latest/modules/libai.layers.html). Let's take a deep dive into the process of building a Vision Transformer model in LiBai.
+
+## Model Architecture
+**Vision Transformer** was released in the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
+
+A **Vision Transformer** model contains three parts: `Patch Embedding` + `Transformer Block` + `Linear Classification Head`, which can be summarized in the following picture:
+
+![](./assets/vision_transformer.png)
+
+## A simple Torch implementation of Vision Transformer
+The following code shows the PyTorch implementation of ViT models modified from [timm.models.vision_transformer](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/vision_transformer.py):
+
+```python
+import torch
+import torch.nn as nn
+
+from timm.models.layers import trunc_normal_, PatchEmbed, Mlp, DropPath
+
+"""
+1. Build a self-attention module
+"""
+class Attention(nn.Module):
+    def __init__(self, dim, num_heads=8, qkv_bias=False, attn_drop=0.0, proj_drop=0.0):
+        super().__init__()
+        self.num_heads = num_heads
+        head_dim = dim // num_heads
+        self.scale = head_dim ** -0.5
+
+        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
+        self.attn_drop = nn.Dropout(attn_drop)
+        self.proj = nn.Linear(dim, dim)
+        self.proj_drop = nn.Dropout(proj_drop)
+
+    def forward(self, x):
+        B, N, C = x.shape
+        qkv = (
+            self.qkv(x)
+            .reshape(B, N, 3, self.num_heads, C // self.num_heads)
+            .permute(2, 0, 3, 1, 4)
+        )
+        q, k, v = qkv[0], qkv[1], qkv[2]
+
+        attn = (q @ k.transpose(-2, -1)) * self.scale
+        attn = attn.softmax(dim=-1)
+        attn = self.attn_drop(attn)
+
+        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x
+
+"""
+2. Build a transformer block, which contains:
+   self-attention layer + mlp layer
+"""
+class Block(nn.Module):
+    def __init__(
+        self,
+        dim,
+        num_heads,
+        mlp_ratio=4.0,
+        qkv_bias=False,
+        drop=0.0,
+        attn_drop=0.0,
+        drop_path=0.0,
+        act_layer=nn.GELU,
+        norm_layer=nn.LayerNorm,
+    ):
+        super().__init__()
+        self.norm1 = norm_layer(dim)
+        self.attn = Attention(
+            dim,
+            num_heads=num_heads,
+            qkv_bias=qkv_bias,
+            attn_drop=attn_drop,
+            proj_drop=drop,
+        )
+        # Use drop_path here
+        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()
+        self.norm2 = norm_layer(dim)
+        mlp_hidden_dim = int(dim * mlp_ratio)
+        self.mlp = Mlp(
+            in_features=dim,
+            hidden_features=mlp_hidden_dim,
+            act_layer=act_layer,
+            drop=drop,
+        )
+
+    def forward(self, x):
+        x = x + self.drop_path(self.attn(self.norm1(x)))
+        x = x + self.drop_path(self.mlp(self.norm2(x)))
+        return x
+
+"""
+3. Build a Vision Transformer model which contains three parts:
+   patch embedding + transformer block + mlp classification head
+"""
+class VisionTransformer(nn.Module):
+    def __init__(
+        self,
+        img_size=224,
+        patch_size=16,
+        in_chans=3,
+        embed_dim=192,
+        depth=12,
+        num_heads=3,
+        mlp_ratio=4.0,
+        drop_rate=0.0,
+        attn_drop_rate=0.0,
+        drop_path_rate=0.0,
+        num_classes=1000,
+    ):
+        super().__init__()
+        self.num_classes = num_classes
+
+        # patch embedding
+        self.patch_embed = PatchEmbed(
+            img_size=img_size,
+            patch_size=patch_size,
+            in_chans=in_chans,
+            embed_dim=embed_dim
+        )
+        num_patches = self.patch_embed.num_patches
+
+        # cls token and position embedding
+        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
+        self.pos_embed = self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))
+        self.pos_drop = nn.Dropout(p=drop_rate)
+
+        # stochastic depth decay rule
+        dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)]
+        
+        # transformer block
+        self.blocks = nn.Sequential(*[
+            Block(
+                dim=embed_dim, num_heads=num_heads, mlp_ratio=mlp_ratio, qkv_bias=True, drop=drop_rate,
+                attn_drop=attn_drop_rate, drop_path=dpr[i], norm_layer=nn.LayerNorm, act_layer=nn.GELU)
+            for i in range(depth)])
+
+        # classification head
+        self.norm = nn.LayerNorm(embed_dim)
+        self.head = nn.Linear(embed_dim, num_classes)
+
+        # weight init
+        trunc_normal_(self.pos_embed, std=0.02)
+        trunc_normal_(self.cls_token, std=0.02)
+        self.apply(self._init_weights)
+
+    def _init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            trunc_normal_(m.weight, std=0.02)
+            if m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+        elif isinstance(m, nn.LayerNorm):
+            nn.init.constant_(m.bias, 0)
+            nn.init.constant_(m.weight, 1.0)
+
+    def forward_features(self, x):
+        # patch embedding
+        x = self.patch_embed(x)
+
+        cls_token = self.cls_token.expand(x.shape[0], -1, -1)
+        x = torch.cat((cls_token, x), dim=1)
+
+        # position embedding
+        pos_embed = self.pos_embed.expand(x.shape[0], -1, -1)
+        x = self.pos_drop(x + pos_embed)
+
+        # transformer block
+        x = self.blocks(x)
+        return x
+
+    def forward_head(self, x):
+        # only use cls token for classification
+        x = self.norm(x)
+        outcome = x[:, 0]
+        return self.head(outcome)
+    
+    def forward(self, x):
+        x = self.forward_features(x)
+        x = self.forward_head(x)
+        return x
+```
+We have further decoupled the forward function into `forward_features` and `forward_head`:
+- `forward_features`: extract the image features using the `patch_embed` layer and a stack of `transformer` blocks
+- `forward_head`: take the `cls_token` of each sample and use `nn.Linear` for classification
+
+## Implement 3D parallel Vision Transformer in LiBai
+In this section, we will show users how to use [libai.layers](https://libai.readthedocs.io/en/latest/modules/libai.layers.html) to build a 3D parallel Vision Transformer model with only 100+ lines of code, which is modified from [libai.models.vision_transformer](https://github.com/Oneflow-Inc/libai/blob/main/libai/models/vision_transformer.py)
+
+Here is the LiBai implementation of Vision Transformer models, and users only need to replace the PyTorch modules with the corresponding `libai.layers` as follows:
+
+```python
+# LiBai's implementation of Vision Transformer
+import oneflow as flow
+import oneflow.nn as nn
+from flowvision.layers.weight_init import trunc_normal_
+
+import libai.utils.distributed as dist
+from libai.config.config import configurable
+from libai.layers import LayerNorm, Linear, PatchEmbedding, TransformerLayer
+
+"""
+LiBai has already implemented:
+1. PatchEmbedding Layer
+2. Transformer Layer: Self-Attention + MLP + DropPath + LayerNorm
+3. Linear Layer
+We can directly build a Vision Transformer model with the built-in layers in LiBai as follows:
+"""
+
+class VisionTransformer(nn.Module):
+    def __init__(
+        self, 
+        img_size=224, 
+        patch_size=16, 
+        in_chans=3, 
+        embed_dim=192, 
+        depth=12, 
+        num_heads=3, 
+        mlp_ratio=4.0, 
+        drop_rate=0.0, 
+        attn_drop_rate=0.0, 
+        drop_path_rate=0.0, 
+        num_classes=1000, 
+        loss_func=None
+    ):
+        super().__init__()
+        self.num_classes = num_classes
+        # patch embedding
+        self.patch_embed = PatchEmbedding(
+            img_size=img_size,
+            patch_size=patch_size,
+            in_chans=in_chans,
+            embed_dim=embed_dim,
+        )
+        num_patches = self.patch_embed.num_patches
+        
+        # cls token and position embedding with sbp signature
+        self.cls_token = nn.Parameter(
+            flow.zeros(1, 1, embed_dim, 
+            sbp=dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.broadcast]),
+            placement=dist.get_layer_placement(0),)
+        )
+        self.pos_embed = nn.Parameter(
+            flow.zeros(1, num_patches+1, embed_dim,
+            sbp=dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.broadcast]),
+            placement=dist.get_layer_placement(0),)
+        )
+        self.pos_drop = nn.Dropout(p=drop_rate)
+
+        # stochastic depth decay rule
+        dpr = [x.item() for x in flow.linspace(0, drop_path_rate, depth)]
+
+        # a stack of transformer block
+        ffn_size = int(embed_dim * mlp_ratio)
+        self.blocks = nn.Sequential(
+            *[
+                TransformerLayer(
+                    hidden_size=embed_dim,
+                    ffn_hidden_size=ffn_size,
+                    num_attention_heads=num_heads,
+                    attention_dropout_prob=attn_drop_rate,
+                    output_dropout_prob=drop_rate,
+                    drop_path_prob=dpr[i],
+                    layer_idx=i,
+                )
+                for i in range(depth)
+            ]
+        )
+        self.norm = LayerNorm(embed_dim, layer_idx=-1)
+        self.head = Linear(embed_dim, num_classes, layer_idx=-1)
+
+        # implement loss function in nn.Module to match LiBai style
+        self.loss_func = nn.CrossEntropyLoss() if loss_func is None else loss_func
+
+        # weight init function
+        trunc_normal_(self.pos_embed, std=0.02)
+        trunc_normal_(self.cls_token, std=0.02)
+        self.apply(self._init_weights)
+
+    def _init_weights(self, m):
+        if isinstance(m, Linear):
+            trunc_normal_(m.weight, std=0.02)
+            if m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+        elif isinstance(m, LayerNorm):
+            nn.init.constant_(m.bias, 0)
+            nn.init.constant_(m.weight, 1.0)
+
+    def forward_features(self, x):
+        # patch embedding
+        x = self.patch_embed(x)
+
+        cls_token = self.cls_token.expand(
+            x.shape[0], -1, -1
+        )  # stole cls_tokens impl from Phil Wang, thanks
+        cls_token = cls_token.to_global(sbp=x.sbp, placement=cls_token.placement)
+        x = flow.cat((cls_token, x), dim=1)
+
+        # position embedding
+        pos_embed = self.pos_embed.expand(x.shape[0], -1, -1)
+        pos_embed = pos_embed.to_global(sbp=x.sbp, placement=pos_embed.placement)
+        x = self.pos_drop(x + pos_embed)
+
+        # transformer block
+        x = self.blocks(x)
+        return x
+
+    def forward_head(self, x):
+        x = self.norm(x)
+        outcome = x[:, 0]
+        outcome = self.head(outcome)
+        return outcome
+
+    def forward(self, images, labels=None):
+        x = self.forward_features(images)
+        x = self.forward_head(x)
+
+        if labels is not None and self.training:
+            losses = self.loss_func(x, labels)
+            return {"losses": losses}
+        else:
+            return {"prediction_scores": x}
+
+    @staticmethod
+    def set_pipeline_stage_id(model):
+        dist_utils = dist.get_dist_util()
+
+        # Set pipeline parallelism stage_id
+        for module_block in model.modules():
+            if isinstance(module_block.to(nn.Module), PatchEmbedding):
+                module_block.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(0))
+            elif isinstance(module_block.to(nn.Module), TransformerLayer):
+                module_block.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(module_block.layer_idx))
+
+        # Set pos_embed and cls_token stage id
+        model.pos_embed.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(0))
+        model.cls_token.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(0))
+        model.pos_drop.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(0))
+        model.norm.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(-1))
+        model.head.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(-1))
+        model.loss_func.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(-1))
+```
+
+## Details about LiBai's implementation of the Vision Transformer model
+
+**1. Replace nn.Module with libai.layers**
+
+LiBai has already implemented `PatchEmbedding`, `TransformerLayer`, `Linear`, `LayerNorm` layers, and users only need to replace the module in Torch Vision Transformer models to convert a Torch model into LiBai's style:
+  - `Block` -> `libai.layers.TransformerLayer`
+  - `nn.Linear` -> `libai.layers.Linear`
+  - `nn.LayerNorm` -> `libai.layers.LayerNorm`
+  - `PatchEmbed` -> `libai.layers.PatchEmbedding`
+
+**2. Manually set the SBP signature of `cls_token` and `pos_embed`**
+
+In order to fit different parallel modes in LiBai, users must manually set the [SBP signature](https://docs.oneflow.org/en/master/parallelism/02_sbp.html#spb-signature) for all the parameters and buffers of those layers not implemented in LiBai, like `cls_token` and `pos_embed` in Vision Transformer:
+```python
+import oneflow as flow
+import oneflow.nn as nn
+import libai.utils.distributed as dist
+
+self.cls_token = nn.Parameter(
+    flow.zeros(
+        1, 1, embed_dim,
+        sbp=dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.broadcast]),
+        placement=dist.get_layer_placement(0),
+    )
+)
+self.pos_embed = nn.Parameter(
+    flow.zeros(
+        1, num_patches+1, embed_dim,
+        sbp=dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.broadcast]),
+        placement=dist.get_layer_placement(0),)
+)
+```
+- The SBP signature returned by `dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.broadcast])` means to broadcast `cls_token` and `pos_embed` across each GPU group.
+
+**3. Use the `to_global()` function to update the SBP signature of `cls_token` and `pos_embed` during forward function**
+
+In forward function, `cls_token` and `pos_embed` will be expanded to fit the input size. For efficiency, we can use the `to_global()` function to match the `cls_token` and `pos_embed` SBP signature with the input SBP signature like this:
+```python
+def forward_features(self, x):
+    cls_token = self.cls_token.expand(
+        x.shape[0], -1, -1
+    )
+    # use to_global to update the sbp signature of cls_token
+    cls_token = cls_token.to_global(sbp=x.sbp, placement=cls_token.placement)
+    x = flow.cat((cls_token, x), dim=1)
+
+    # use to_global to update the sbp signature of pos_embed
+    pos_embed = self.pos_embed.expand(x.shape[0], -1, -1)
+    pos_embed = pos_embed.to_global(sbp=x.sbp, placement=pos_embed.placement)
+```
+
+**4. Manually set the stage id for pipeline parallel training**
+
+Most of the built-in layers in LiBai has the arg named `layer_idx` for pipeline parallel settings. To configure a **1F1B pipeline parallel** model, users should manually set the stage id for each layers in the model, which will automatically assign different layers on different stages and insert buffer in the process of forward & backward computation for 1F1B pipeline parallel training. With the help of `layer_idx`, we can simply get a pipeline parallel Vision Transformer model like:
+```python
+import libai.utils.distributed as dist
+
+"""
+This is a staticmethod for class inherited from nn.Module, 
+"""
+@staticmethod
+def set_pipeline_stage_id(model):
+    dist_utils = dist.get_dist_util()
+
+    # Set pipeline parallelism stage_id
+    for module_block in model.modules():
+        # module_block.to(nn.Module) can get the original module
+        if isinstance(module_block.to(nn.Module), PatchEmbedding):
+            module_block.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(0))
+        elif isinstance(module_block.to(nn.Module), TransformerLayer):
+            module_block.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(module_block.layer_idx))
+
+    # Set pos_embed and cls_token stage id
+    model.pos_embed.to(nn.graph.GraphModule).set_stage(sdist_utils.get_layer_stage_id(0))
+    model.cls_token.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(0))
+    model.pos_drop.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(0))
+    model.norm.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(-1))
+    model.head.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(-1))
+    model.loss_func.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(-1))
+```
+
+Manually set the stage id:
+- `PatchEmbedding` should be on the first stage
+- Automatically assign the stage for `TransformerLayer` with `layer_idx` args
+- `cls_token`, `pos_embed`, `pos_drop` should be on the first stage
+- `norm`, `head` and `loss_func` should be on the last stage
+
+Please see [Write your own pipeline parallel model](https://libai.readthedocs.io/en/latest/tutorials/advanced_tutorials/customize_parallel.html#write-your-own-pipeline-parallel-model) for more details about the settings of pipeline parallel training in LiBai.
--- a/docs/source/notes/How_to_load_huggingface's_pretrained_model_in_libai.md
+++ b/docs/source/notes/How_to_load_huggingface's_pretrained_model_in_libai.md
+# How to load pretrained model in LiBai
+In this tutorial, we will introduce to users how to instantiate a pretrained oneflow model.
+
+## Steps
+Firstly, Prepare pretrained model weights file, which can be the form of `OneFlow` or `HuggingFace`.
+- `OneFlow`'s pretrained model weights saved using [`oneflow.save()`].
+- `Huggingface`'s pretrained model weights file(`pytorch_model.bin`) can be downloaded from https://huggingface.co/models.
+
+Secondly, Prepare config file.
+> The config file is required when loading the `HuggingFace` model.
+> `OneFlow`'s config file can be import directly from `configs/common/models`.
+- `Huggingface`'s config file(`config.json`) can be downloaded from https://huggingface.co/models.
+
+Lastly, The structure of the pretrained model folder should be like:
+```bash
+# OneFlow pretrained model
+$ tree pretrained_model_dir
+path/to/pretrained_model_dir/
+ └── oneflow_model
+
+# Huggingface pretrained model
+$ tree pretrained_model_dir
+path/to/pretrained_model_dir/
+ ├── pytorch_model.bin
+ └── config.json
+```
+
+## Start Loading
+You can load pretrained BERT as following:
+```python
+import libai
+from libai.models.utils import BertLoaderHuggerFace, BertLoaderLiBai
+from libai.config.configs.common.models.bert import cfg
+
+
+# load huggingface weight
+loader = BertLoaderHuggerFace(
+    model=libai.models.BertModel,
+    libai_cfg=cfg,
+    pretrained_model_path='path/to/huggingface_pretrained_model_directory',
+    hidden_dropout_prob=0,
+)
+bert = loader.load()
+
+# load libai weight
+loader = BertLoaderLiBai(
+    model=libai.models.BertModel,
+    libai_cfg=cfg,
+    pretrained_model_path='path/to/libai_pretrained_model_directory',
+    hidden_dropout_prob=0,
+)
+bert = loader.load()
+```
+
+
+# Use Custom ModelLoader
+
+## Model Loader for HuggerFace
+If you want to define your own HuggerFace's model loader, you can inherit the base `ModelLoaderHuggerFace` in `libai.models.utils.model_utils.base_loader`.
+
+Then you need to overwrite the `_convert_state_dict` and `_load_config_from_json` method to load HuggingFace's pretrained model in LiBai. 
+
+Finally, you need set `base_model_prefix_1` and `base_model_prefix_2` argument, which represent the base model name for HuggingFace and LiBai respectively.
+
+The following code shows how to use custom ModelLoaderHuggerFace:
+
+```python
+from libai.models.utils import ModelLoaderHuggerFace
+
+
+class ToyModelLoaderHuggerFace(ModelLoaderHuggerFace):
+    def __init__(self, model, libai_cfg, pretrained_model_path, **kwargs):
+        super().__init__(model, libai_cfg, pretrained_model_path, **kwargs)
+
+        """NOTE: base_model_prefix_1 is ToyModel's prefix in Transformers.
+        base_model_prefix_2 is ToyModel's prefix in LiBai."""
+        self.base_model_prefix_1 = "toy_model"
+        self.base_model_prefix_2 = "toy_model"
+
+    def _convert_state_dict(self, flow_state_dict, cfg):
+        """Convert state_dict's keys to match model.
+
+        Args:
+            flow_state_dict (OrderedDict): model state dict.
+            cfg (dict): model's default config dict in LiBai.
+
+        Returns:
+            OrderedDict: flow state dict.
+        """
+        ...
+
+    def _load_config_from_json(self, config_file):
+        """load config from `config.json`, and update default config.
+
+        Args:
+            config_file (str): Path of config file.
+        """
+        ...
+```
+
+## Model Loader for LiBai
+If you want to define your own LiBai's model loader, you can inherit the base `ModelLoaderLiBai` class in `libai.models.utils.model_utils.base_loader`.
+
+You just need to set `base_model_prefix_2` argument to load LiBai's pretrained model.
+
+The following code shows how to use custom ModelLoaderLiBai:
+
+```python
+from libai.models.utils import ModelLoaderLiBai
+
+
+class ToyModelLoaderLiBai(ModelLoaderLiBai):
+    def __init__(self, model, libai_cfg, pretrained_model_path, **kwargs):
+        super().__init__(model, libai_cfg, pretrained_model_path, **kwargs)
+        self.base_model_prefix_2 = "toy_model"
+```
\ No newline at end of file
--- a/docs/source/notes/How_to_use_distributed_inference_in_LiBai.md
+++ b/docs/source/notes/How_to_use_distributed_inference_in_LiBai.md
+# Detailed instruction for using distributed inference in LiBai
+
+If you want to using distributed inference in LiBai from pretrained `pytorch` model, you can refer to [DALLE2 inferecn doc](https://github.com/Oneflow-Inc/libai/blob/main/docs/source/notes/How_to_use_model_parallel_in_LiBai.md). And [Chinese doc for distributed inference](https://github.com/Oneflow-Inc/libai/discussions/386) is also available.
+
+Here we introduce how to use distributed infenrence in LiBai:
+
+## Check `model.py`
+
+check your `model.py` first:
+1. Ensure There are `libai.layers` in your `model.py`:
+   ```python
+    # NOTE: you don't need to import all layers from libai, if you only use libai.layers.Linear 
+    # in your `model.py`, you model will run model/pipeline parallel only in `Linear` layers 
+    from libai.layers import (
+        Linear, 
+        LayerNorm,
+        ...
+    )
+   ```
+2. If you want to run pipeline parallel in LiBai, you should additionally insert code `x = x.to_global(placement=target_tensor.placement)` in your `model.forward()`. 
+It is equal to torch code `x.to(cuda_device)`, which move tensor from gpuA to gpuB. There are many examples in LiBai: [example1](https://github.com/Oneflow-Inc/libai/blob/92dbe7c1b1496290e32e595f8473f9288ea1886e/projects/MT5/layers/attention_layer.py#L220), [example2](https://github.com/Oneflow-Inc/libai/blob/92dbe7c1b1496290e32e595f8473f9288ea1886e/projects/MT5/layers/attention_layer.py#L156) ...
+   
+    If you don't know where to insert code, you can run your code first, and the it will raise bug in the line which needed `to_global`. 
+    for example:
+
+    ```shell
+      File "libai/libai/layers/layer_norm.py", line 129, in forward   
+
+        return flow._C.rms_layer_norm(hidden_states, self.weight, self.l2norm_epsilon)    RuntimeError: return flow._C.rms_layer_norm(hidden_states, self.weight, self.l2norm_epsilon)RuntimeErrorExpected all tensors to be on the same placement, but found at least two placements, oneflow.placement(type="cuda", ranks=[0, 1]) (positional 0) and oneflow.placement(type="cuda", ranks=[2, 3]) (positional 1)!
+    ```
+
+## Build `config.py`
+
+If your model is Trained from LiBai, you can use the same `config.py` from training. refer to [Couplets](https://github.com/Oneflow-Inc/libai/tree/main/projects/Couplets#inference) for more details
+
+If your model is Trainer from other framework, you should build your own `inference_config.py`, you can refer to [`dalle2_config.py`](https://github.com/Oneflow-Inc/libai/blob/main/projects/DALLE2/configs/dalle2_config.py) and [`t5_inference_config.py `](https://github.com/Oneflow-Inc/libai/blob/main/projects/MT5/configs/t5_inference.py)
+
+## Refine `pipeline_inference.py`
+
+The base class [libai/inference/basic.py](https://github.com/Oneflow-Inc/libai/blob/main/libai/inference/basic.py) is already provided in `LiBai` , 
+Users only need to overload the functions they need. refer to [text_generation.py](https://github.com/Oneflow-Inc/libai/blob/main/libai/inference/text_generation.py)
+
+If your model is trained from `LiBai`, it will be easy to use, you can refer to [distribute_infer.py](https://github.com/Oneflow-Inc/libai/blob/main/projects/Couplets/distribute_infer.py) in [Couplets](https://github.com/Oneflow-Inc/libai/tree/main/projects/Couplets)
+
+If your model is trained from other framework, you need to build your own `model_loader` to load your model weights in LiBai, refer to [model_loader](https://libai.readthedocs.io/en/latest/notes/How_to_load_huggingface%27s_pretrained_model_in_libai.html) for more details
+
+Give a simple example,  the function overloaded in `LiBai`:
+```python
+from libai.inference.basic import BasePipeline
+from libai.utils import distributed as dist
+
+class MyPipeline(BasePipeline):
+    def _parse_parameters(self, **pipeline_parameters):
+        # By overloading this function, the input parameters in MyPipeline.__call__() hand out to preprocess/forward/postprocess stages of inference.
+        preprocess_params = {
+            "preprocess_param1": pipeline_parameters["preprocess_param1"],
+            "preprocess_param2": pipeline_parameters["preprocess_param2"],
+        }
+        forward_params = {
+            "forward_param": pipeline_parameters["forward_param"]
+        }
+        postprocess_params = {
+            "postprocess_param": pipeline_parameters["postprocess_param"]
+        }
+        return preprocess_params, forward_params, postprocess_params
+
+    def load_pretrain_weight(self, libai_cfg_model, model_path, mode="myloader"):
+        # load your pretrain weight in this functor
+        # set your own "MyLoader" if your model is pretrained from other framework
+        # set mode to "libai" if your model is pretrained from libai
+        if mode == "myloader":
+            import MyLoader
+
+            model_loader = MyLoader(
+                libai_cfg_model,
+                libai_cfg_model.cfg,
+                model_path,
+                ...,
+            )
+            return model_loader.load()
+        else:
+            return super().load_pretrain_weight(
+                libai_cfg_model,
+                model_path,
+                mode=mode,
+            )
+
+    def preprocess(self, inputs, preprocess_param1, preprocess_param2, **kwargs):
+        ...
+        # model_input_dict: {"key1": flow.Tensor1, ...}
+        return model_input_dict
+    
+    def forward(self, model_input_dict, forward_param, **kwargs):
+        ...
+        model_output_dict = self.model(**model_input_dict)
+        return model_output_dict
+
+    def postprocess(self, model_output_dict, postprocess_param, **kwargs):
+        ...
+        return out_dict
+
+if __name__ == "__main__":
+    pipeline = MyPipeline(
+        "path/to/myconfig.py",
+        data_parallel=1,
+        tensor_parallel=...,
+        pipeline_parallel=...,
+        pipeline_stage_id=...,
+        pipeline_num_layers=...,
+        model_path=...,
+        mode=...,
+    )
+    out = pipeline(
+        input_text=..., 
+        preprocess_param1=..., 
+        preprocess_param2=...,
+        forward_param=...,
+        postprocess_param=...,
+    )
+    if dist.is_main_process():
+        print(out)
+```
+
+## Distributed run `pipeline_inference.py`
+
+To run model on 2 nodes with total 4 GPUs, 
+
+  in `node0`, run:
+  ```bash
+  NODE=2 NODE_RANK=1 ADDR=192.168.0.1 PORT=12345 bash tools/infer.sh pipeline_inference.py 2
+  ``` 
+  `NODE=2` means total number of nodes
+  
+  `NODE_RANK=0` means current node is node0
+
+  `ADDR=192.168.0.0` means the ip address of node0
+
+  `PORT=12345` means the port of node0
+
+  in `node1`, run:
+  ```bash
+  NODE=2 NODE_RANK=1 ADDR=192.168.0.1 PORT=12345 bash tools/infer.sh pipeline_inference.py 2
+  ``` 
+  `NODE=2` means total number of nodes
+  
+  `NODE_RANK=1` means current node is node1
+
+  `ADDR=192.168.0.0` means the ip address of node0
+
+  `PORT=12345` means the port of node0
--- a/docs/source/notes/How_to_use_huggingface's_weights_in_LiBai.md
+++ b/docs/source/notes/How_to_use_huggingface's_weights_in_LiBai.md
+# How to use Huggingface's pretrained weights in LiBai
+The built-in layers in [LiBai](https://github.com/Oneflow-Inc/libai) adopts the structure which is more suitable for parallel training, therefore the implementation in LiBai may be a little bit different from that in Huggingface. In this tutorial, we will introduce to users how to correctly load Huggingface's pretrained weights into LiBai's model. Let's take BERT as an example.
+
+
+## LiBai Transformer vs Huggingface Transformer
+There are subtle differences in the BERT structure as shown in the following figure (left: LiBai, right: Huggingface), which can be summarized as:
+- Location of layernorm: The location of layernorm is different, but the calculation order is the same.
+- A different slicing way to get the `query`, `key` and `value` matrix.
+- LiBai follows [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) to use the order of the layernorm and the residual connections by default. Megatron-LM shows that this structure will eliminate instabilities and bring a lower training loss. LiBai can also support the original BERT architecture mentioned in [Paper](https://arxiv.org/pdf/1810.04805.pdf) by setting `apply_residual_post_layernorm=True`.
+
+![architecture](./assets/architecture.jpg)
+
+
+## QKV slicing logic
+LiBai's QKV slicing logic is different from that in Huggingface.
+```python
+# LiBai's QKV slicing logic
+query_key_value = query_key_value.view(batch_size, -1, num_heads, 3 * head_size)
+query_key_value = query_key_value.permute(0, 2, 1, 3)
+query, key, value = flow.chunk(query_key_value, chunks=3, dim=-1)
+
+# Huggingface's QKV slicing logic
+query, key, value = flow.chunk(query_key_value, chunks=3, dim=-1)
+query = query.view(query.size(0), query.size(1), num_heads, -1).permute(0, 2, 1, 3)
+key = key.view(key.size(0), key.size(1), num_heads, -1).permute(0, 2, 1, 3)
+value = value.view(value.size(0), value.size(1), num_heads, -1).permute(0, 2, 1, 3)
+```
+
+
+## How to correctly load QKV weights
+- To correctly load Huggingface's transformer weights, you only need to rearrange the loaded weights as follows:
+
+```python
+def convert_qkv_weight(cfg, qkv_weight, qkv_bias):
+    qkv_weight = qkv_weight.view([3, cfg.num_heads, cfg.head_size, cfg.hidden_size])
+    qkv_weight = qkv_weight.permute(1, 0, 2, 3).contiguous().view(3*cfg.hidden_size, cfg.hidden_size)
+    qkv_bias = qkv_bias.view(3, cfg.num_heads, cfg.head_size)
+    qkv_bias = qkv_bias.permute(1,0,2).contiguous().view(-1)
+    return qkv_weight, qkv_bias
+```
+
+- For detailed examples, please refer to [load-huggingface-bert](https://github.com/Oneflow-Inc/libai/tree/test_bert_load_huggingface_weight/projects/test_bert_load_huggingface_weight). You can verify this by running:
+```bash
+bash test.sh
+```
\ No newline at end of file
--- a/docs/source/notes/How_to_use_model_parallel_in_LiBai.md
+++ b/docs/source/notes/How_to_use_model_parallel_in_LiBai.md
+# Detailed instruction on using model parallel in LiBai
+This document is a tutorial for users to learn how to transer a pytorch model to oneflow, and use model parallel in Libai for inference. We will first take the DALLE2 model for example, and then we will show how to use model parallel which can be easily done in libai.
+
+**Note**: the code of DALLE2 is adapted from [this repo](https://github.com/lucidrains/DALLE2-pytorch), which is an unofficial implementation. The final result may differ from the original generated images in the [paper](https://arxiv.org/abs/2204.06125). You can also try the model in [google colab](https://colab.research.google.com/github/LAION-AI/dalle2-laion/blob/main/notebooks/dalle2_laion_alpha.ipynb).
+
+## Transfer pytroch model to oneflow.
+It's easy for user to tansfer a pytorch model into oneflow, since most of oneflow's api is consistent with pytorch. First we change `import torch` to `import oneflow as flow`, and then we can replace all `torch` in the code to `flow`. If the model can work correctly in the originally
+pytorch codes, it's likely to be able to work correctly in oneflow. Sometimes the program may raise error like 
+```
+AttributeError: module 'oneflow' has no attribute 'xxx'
+```
+try install the latest version of oneflow which might help, you can find more details [here](https://github.com/Oneflow-Inc/oneflow#install-oneflow).
+
+
+
+**1、Download the pytorch DALLE2 model**:
+
+As show in the [google colab](https://colab.research.google.com/github/LAION-AI/dalle2-laion/blob/main/notebooks/dalle2_laion_alpha.ipynb), we will use the version of 0.15.4, 
+```
+git clone -b v0.15.4 https://github.com/lucidrains/DALLE2-pytorch.git
+```
+the pretrained model weights can be found in huggingface: [the prior weight](https://huggingface.co/zenglishuci/conditioned-prior/resolve/main/vit-l-14/prior_aes_finetune.pth) and [the decoder weight](https://huggingface.co/laion/DALLE2-PyTorch/resolve/main/decoder/1.5B_laion2B/latest.pth).
+A simple inference script can be written as 
+```python
+# inference_dalle2.py
+import numpy as np
+import torch
+import os,sys
+from dalle2_pytorch import tokenizer
+from dalle2_pytorch import OpenAIClipAdapter, DALLE2, DiffusionPriorNetwork, DiffusionPrior, Unet, Decoder
+
+def generate_images_from_text(texts):
+    clip=OpenAIClipAdapter("ViT-L-14.pt").to("cuda")
+    
+    tokens = tokenizer.tokenize(text).to("cuda")
+    _, text_encodings, text_mask = clip.embed_text(tokens)
+    
+    prior_network = DiffusionPriorNetwork(
+        dim = 768,
+        depth = 24,
+        num_timesteps = 1000,
+        num_time_embeds = 1,
+        num_image_embeds=1,
+        num_text_embeds = 1,
+        dim_head = 64,
+        heads = 32,
+        ff_mult = 4,
+        attn_dropout = 0.05,
+        ff_dropout = 0.05,
+        normformer = True,
+    )
+    
+    diffusion_prior = DiffusionPrior(
+        net = prior_network,
+        clip = clip,
+        image_embed_dim = 768,
+        timesteps = 1000,
+        cond_drop_prob = 0.1,
+        loss_type="l2",
+        condition_on_text_encodings = True
+    )
+    state_dict = torch.load("prior_aes_finetune.pth", map_location="cpu")['ema_model']
+    diffusion_prior.load_state_dict(state_dict, strict=True)
+    diffusion_prior.to("cuda")
+
+    image_embed = diffusion_prior.sample(tokens, num_samples_per_batch = 2, cond_scale = 1.)
+
+    unet = Unet(
+        dim = 320,
+        image_embed_dim = 768,
+        text_embed_dim = 768,
+        cond_dim = 512,
+        channels = 3,
+        dim_mults=(1, 2, 3, 4),
+        num_resnet_blocks = 4,
+        attn_heads = 8,
+        attn_dim_head = 64,
+        sparse_attn  = True,
+        memory_efficient = True,
+        cond_on_text_encodings = True,    # set to True for any unets that need to be conditioned on text encodings
+        self_attn = [False, True, True, True]
+    )
+    
+    decoder = Decoder(
+        unet = (unet,),
+        image_sizes = [64],
+        clip = clip,
+        channels = 3,
+        timesteps = 1000,
+        loss_type = "l2",
+        beta_schedule = ["cosine"],
+        learned_variance = True
+    )
+    state_dict = torch.load("latest.pth", map_location = "cpu")
+                
+    new_dict = {}
+    for k,v in state_dict.items():
+        if 'clip.' in k: continue
+        if ('cross_attn' in k or 'fn.fn.' in k) and k.endswith(".g"):
+            k = k[:-1] + "gamma"
+        new_dict[k] = v
+        assert k in decoder.state_dict().keys(), k 
+    decoder.load_state_dict(new_dict, strict=False)
+    decoder.to("cuda")
+    images = decoder.sample(image_embed = image_embed, text_encodings = text_encodings, text_mask = text_mask, cond_scale = 3.5)
+    return images
+
+def save_images(images):
+    import torchvision.transforms as T
+    to_pil = T.ToPILImage()
+    images = list(map(to_pil,images.unbind(dim = 0))) 
+    for i,image in enumerate(images):
+        image.save(f"./result_{i}.png")
+
+def main():
+    text = ["a dolphin in an astronaut suit on saturn, artstation"]
+    images = gen_text_and_img_emb(text)
+    save_images(images)
+
+if __name__ == "__main__":
+    main()
+```
+run `python inference_dalle2.py`, this should work.
+
+
+
+## 2、Change the deep learning framework to oneflow.
+As mentioned above, we replace all the `torch` symbol to `flow` by firstly change `import torch` to `import oneflow as flow` in all python files. 
+It should be noted that the original pytorch code also import other python packages using pytorch backend like [einops](https://github.com/arogozhnikov/einops)、[einops_ext](https://github.com/lucidrains/einops-exts)、[kornia](https://github.com/kornia/kornia) etc. which should also be modified at the same time.
+
+Fortunately, only a few api of these packages are used, we can take out the relevant code from the github repos and merge them in a separate file.
+
+For example, we can simplely create the einops_ext.py file adapted from [here](https://github.com/lucidrains/einops-exts/blob/main/einops_exts/einops_exts.py), then we can import einops_ext from the python file which use oneflow instead of python packages using torch.
+```python
+# einops_ext.py
+import re
+from oneflow import nn #here change `from torch improt nn` to `from oneflow import nn`
+from functools import wraps, partial
+
+from einops import rearrange, reduce, repeat
+```
+
+
+
+## 3、Using Libai's api. 
+[LiBai](https://github.com/Oneflow-Inc/libai) is a large-scale open-source model training toolbox based on OneFlow.
+
+Libai provides many efficient api which can be easily used for distributed training and evaluation. It also supports some popular models under the projects folder such as [CLIP](https://github.com/Oneflow-Inc/libai/tree/main/projects/CLIP). To avoid duplication of work, we directly use the clip model implemented in Libai. The relavant code in the original pytorch code is the `OpenAIClipAdapter` class which can be written as follows:
+```python
+# _clip.py
+import os
+import sys
+import oneflow as flow
+import oneflow.nn.functional as F
+from oneflow import nn
+from collections import namedtuple
+
+def import_flow_clip(fn):
+    
+    def wrapper(*args, **kwargs):
+        sys.path.append(os.path.join(os.path.abspath(os.path.join(os.path.dirname(__file__), "../..")), "CLIP"))
+        fn(*args, **kwargs)
+        sys.path.pop()
+
+    return wrapper
+
+class BaseClipAdapter(nn.Module):
+    pass
+
+class OpenAIClipAdapter(BaseClipAdapter):
+
+    @import_flow_clip
+    def __init__(
+        self,
+        name = 'ViT-L/14'
+    ):
+        import clip
+        openai_clip, preprocess = clip.load(name)
+        super().__init__(openai_clip)
+
+```
+
+[DiffusionPrior](https://github.com/lucidrains/DALLE2-pytorch/blob/v0.15.4/dalle2_pytorch/dalle2_pytorch.py#L873) and [Decoder](https://github.com/lucidrains/DALLE2-pytorch/blob/v0.15.4/dalle2_pytorch/dalle2_pytorch.py#L1802) follow their original implementation.
+
+
+**Using libai.layers** 
+
+LiBai provides multiple parallelisms such as Data Parallelism, Tensor Parallelism, and Pipeline Parallelism. To experience these features, we will use libai.layers like Linear and LayerNorm:
+```python
+from libai.layers import Linear, LayerNorm
+```
+the nn.Linear will be replace with `libai.layers.Linear`.
+
+**Compare the outputs** To make sure it is correctly modified from `torch` to `flow`,  it's necessary to compare the outputs to see if they are the same after the change. A notable point here is that in the sampling stage, the noise are randomly generated, like 
+```python
+noise = flow.randn(shape)
+# or noise = torch.randn(shape) in torch code
+``` 
+torch and oneflow will generate different numbers here even if they are set the same random seed. An alternate way is to make a transition through numpy:
+```python
+import numpy as np
+np.random.seed(6666)
+noise = flow.tensor(np.randn(shape))
+# or noise = torch.tensor(np.randn(shape)) in torch code 
+```
+When the model is fed the same input text, the output images by oneflow or torch code should be same.
+
+**LazyConfig and LazyCall**
+
+Oneflow provides LazyConfig system for more flexible syntax and no predefined structures, find more [here](https://libai.readthedocs.io/en/latest/tutorials/basics/Config_System.html). As for the DALLE2, the config file can be write as 
+```python
+from omegaconf import DictConfig
+from libai.config import LazyCall
+from dalle2.models import DiffusionPrior, DiffusionPriorNetwork, Unet, Decoder, DALLE2
+from dalle2._clip import OpenAIClipAdapter
+
+clip = LazyCall(OpenAIClipAdapter)(name="./dalle2/model_weights/ViT-L-14.pt")
+
+prior = LazyCall(DiffusionPrior)(
+    net = LazyCall(DiffusionPriorNetwork)(
+        dim=768,
+        depth=24,
+        num_timesteps=1000,
+        max_text_len=77,
+        num_time_embeds=1,
+        num_image_embeds=1,
+        num_text_embeds=1,
+        dim_head=64,
+        heads=32,
+        ff_mult=4,
+        attn_dropout=0.05,
+        ff_dropout=0.05,
+        normformer=True,
+    ),
+    clip=clip,
+    image_embed_dim=768,
+    timesteps=1000,
+    cond_drop_prob=0.1,
+    loss_type="l2",
+    condition_on_text_encodings=True
+)
+
+unet1 = LazyCall(Unet)(
+    dim=320,
+    image_embed_dim=768,
+    text_embed_dim=768,
+    cond_dim=512,
+    channels=3,
+    dim_mults=(1, 2, 3, 4),
+    num_resnet_blocks=4,
+    attn_heads=8,
+    attn_dim_head=64,
+    sparse_attn=True,
+    memory_efficient=True,
+    cond_on_text_encodings=True,    # set to True for any unets that need to be conditioned on text encodings
+    self_attn=[False, True, True, True]
+)
+
+decoder = LazyCall(Decoder)(
+    unet=(unet1,),
+    image_sizes=[64, ],
+    clip=None,
+    channels=3,
+    timesteps=1000,
+    loss_type="l2",
+    beta_schedule=["cosine"],
+    learned_variance=True
+)
+
+dalle2_model = LazyCall(DALLE2)(
+    prior=prior,
+    decoder=decoder,
+    prior_weight_path='',
+    decoder_weight_path=''
+)
+```
+
+## 4、Model parallel in Libai.
+In order to achieve the model parallel inference under libai, we should set the parallel mode according to your needs. The default value of argument parallel is `data` in libai.layers.Linear, which means data parallel. To achieve model parallel, we need change the parallel to `col` or `row`. The most efficient way is to set the Linear layers in the col -> row -> col order.
+
+A transformer block contains a attention and a feedforward submodule, and each submodule exactly contains 2 Linear layers. 
+The attention module contains the qkv projection and out projection. Thus we set the qkv projejction as `col`, and the out projection as `row`:
+```python
+#attention 
+class Attention(nn.Module):
+    def __init__(self, *args, **kwargs):
+        super().__init__()
+        # 1、 qkv projection
+        self.to_q = Linear(dim, inner_dim, bias = False, parallel='col')
+        self.to_kv = Linear(dim, dim_head * 2, bias = False, parallel='col') 
+        #2、 output projection
+        self.to_out = nn.Sequential(
+            Linear(inner_dim, dim, bias = False, parallel='row'), #'row'
+            LayerNorm(dim)
+        )
+```
+and feed forward contains in projection and out projection, the former will be set `col` and the later will be set `row`.
+```python
+def FeedForward(
+    dim,
+    mult = 4,
+    dropout = 0.,
+    post_activation_norm = False
+):
+    inner_dim = int(mult * dim)
+    return nn.Sequential(
+        LayerNorm(dim),
+        Linear(dim, inner_dim * 2, bias = False, parallel='col'), 
+        SwiGLU(),
+        LayerNorm(inner_dim) if post_activation_norm else nn.Identity(),
+        nn.Dropout(dropout),
+        Linear(inner_dim, dim, bias = False, parallel='row')
+    )
+```
+
+
+for the single machine with 4 GPUs, the model parallel could be set like:
+```python
+import libai.utils.distributed as dist
+dist.setup_dist_util(
+    DictConfig(
+        dict(
+            data_parallel_size=1,
+            tensor_parallel_size=4,
+            pipeline_parallel_size=1,
+        )
+    )
+)
+```
+
+
+
+If you successfully complete the above steps, now you can have fun with the (unofficial) dalle2 model.
\ No newline at end of file
--- a/docs/source/notes/assets/architecture.jpg
+++ b/docs/source/notes/assets/architecture.jpg
--- a/docs/source/notes/assets/vision_transformer.png
+++ b/docs/source/notes/assets/vision_transformer.png
--- a/docs/source/notes/index.rst
+++ b/docs/source/notes/index.rst
+Notes
+=============
+
+.. toctree::
+   :glob:
+   :maxdepth: 2
+
+   How_to_use_huggingface's_weights_in_LiBai.md
+   How_to_build_vision_transformer_model_in_LiBai.md
+   How_to_load_huggingface's_pretrained_model_in_libai.md
+   How_to_use_model_parallel_in_LiBai.md
+   How_to_use_distributed_inference_in_LiBai.md
+   FAQ.md
--- a/docs/source/tutorials/advanced_tutorials/customize_dataloader.md
+++ b/docs/source/tutorials/advanced_tutorials/customize_dataloader.md
+# How to Customize Dataloader
+
+Dataloader is the component that provides data to models. Dataloader usually (but not necessarily) takes raw information from [write dataloaders](https://libai.readthedocs.io/en/latest/tutorials/basics/Write_Dataloaders.html), and processes them into the format needed by the model.
+
+## How the Existing Dataloader Works 
+
+LiBai contains a built-in data loading pipeline. It's beneficial to understand how it works, in case you need to write a custom one.
+
+LiBai provides some functions [build_{image,nlp}_{train,test}_loader](https://libai.readthedocs.io/en/latest/modules/libai.data.html#libai.data.build.build_nlp_train_loader) that create a default dataloader from a given config. Here is how `build_{image,nlp}_{train,test}_loader` work:
+
+1. It instantiates the `list[flow.utils.Dataset]` (e.g., `BertDataset`) by loading some dataset items with lightweight format. These dataset items are not yet ready to be used by the model (e.g., images are not loaded into memory, random augmentation have not been applied, etc.). 
+
+2. The output format of dataset (`__getitem__(...)`) must be a dict whose keys must be consistent with argument names of the dataloader's consumer (usually the `model.forward(...)`). The role of the process is to transform the lightweight representation of a dataset item into a format that is ready for the model to consume (including, e.g., read images, perform random data augmentation and convert to oneflow Tensors). If you would like to perform custom transformations to data, you often want to rewrite it. Details about the dataset format can be found in [write dataloaders](https://libai.readthedocs.io/en/latest/tutorials/basics/Write_Dataloaders.html).
+
+3. The outputs of the dataset are simply batched with the following function.
+
+```python
+def trivial_batch_collator(batch):
+    assert isinstance(batch[0], Instance), "batch[0] must be `instance` for trivial batch collator"
+    batch = Instance.stack(batch)
+    return batch
+```
+
+4. This batched data is the output of the dataloader. Typically, it's also the input of `get_batch`. After `get_batch(...)`, it becomes the input of `model.forward()`. `get_batch` simply changes the local tensors to global tensors with the given `sbp` and `placement` meta information.
+
+
+```python
+@classmethod
+def get_batch(cls, data, mixup_func = None):
+    ...
+    ret_dict = {}
+    for key, value in data.get_fields().items():
+        value.to_global()
+        ret_dict[key] = value.tensor
+    return ret_dict
+```
+
+
+## Use Custom Dataloader
+
+If you use `DefaultTrainer`, you can overwrite its `build_train_loader` method to use your own dataloader which can be implemented with any tools you like. But you need to make sure that each rank is reading the data correctly under different parallelism circumstances.
+
+Then you need to overwrite `get_batch` method. `data` argument in `get_batch` is the output of your dataloader. You need to change the local tensors to global tensors manually, which means you should set the `sbp` and `placement` correctly.
+
+Here is an example. Process of rank0 gets all data and redistributes them into the other ranks.
+
+```python
+@classmethod
+def get_batch(cls, data, mixup_func=None):
+    if data is None: 
+        # not rank0, set placeholders for data
+        # Note: make sure imgs and labels have the same shape and dtype on all ranks
+        imgs = flow.empty(16, 3, 224, 224, dtype=flow.float32)
+        labels = flow.empty(16, dtype=flow.int64)
+    else: 
+        # rank0
+        imgs, labels = data
+    dist.synchronize()
+
+    imgs = imgs.to_global(spb=flow.sbp.broadcast, placement=flow.env.all_device_placement("cuda"))
+    imgs = imgs.to_global(
+        spb=dist.get_nd_sbp([flow.sbp.split(0),
+                             flow.sbp.broadcast]),
+        placement=dist.get_layer_placement(0))
+
+    labels = labels.to_global(spb=flow.sbp.broadcast, placement=flow.env.all_device_placement("cuda"))
+    labels = labels.to_global(
+        spb=dist.get_nd_sbp([flow.sbp.split(0),
+                             flow.sbp.broadcast]),
+        placement=dist.get_layer_placement(-1))
+    return {
+        "images": imgs,
+        "labels": labels
+    }
+```
--- a/docs/source/tutorials/advanced_tutorials/customize_parallel.md
+++ b/docs/source/tutorials/advanced_tutorials/customize_parallel.md
+# How to Customize Parallelism
+
+Common parallelisms have already been implemented in LiBai, such as data parallel, tensor parallel and pipeline parallel. But there is also a need for user customized parallel. In this tutorial, we will show you how to customize your own parallelism.
+
+## Define your own Parallel Model with LiBai.layers
+
+### Large-scale FC
+
+Suppose you have a huge fully-connected-layer for large-scale classification (e.g., 1000w classes), which makes it impossible to fit into a single GPU.
+
+Don't worry, with the help of `LiBai.layers`, you can write models in a familiar way that you used to write for a single GPU. Here is a simple example showing how to write a tensor-parallel fully-connected-layer with 2 GPUs.
+
+```python
+# huge_fc_example.py
+import oneflow as flow
+from omegaconf import DictConfig
+from oneflow import nn
+
+from libai.layers import Linear
+from libai.utils import distributed as dist
+
+cfg = DictConfig(dict(data_parallel_size=1, tensor_parallel_size=2, pipeline_parallel_size=1))
+dist.setup_dist_util(cfg)
+
+
+class Huge_FC(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.fc = Linear(2048, 32768, parallel="col")
+
+    def forward(self, x):
+        return self.fc(x)
+
+
+huge_fc = Huge_FC()
+
+x = flow.rand(32, 2048, sbp=flow.sbp.broadcast, placement=flow.placement("cuda", ranks=[0, 1]))
+y = huge_fc(x)
+
+print(f"rank: {flow.env.get_rank()}, tensor shape: {y.to_local().shape}")
+```
+
+You can run this toy example with command line as follows:
+
+```shell
+python3 -m oneflow.distributed.launch --nproc_per_node 2 huge_fc_example.py
+
+>> rank: 0, tensor shape: oneflow.Size([32, 16384])
+>> rank: 1, tensor shape: oneflow.Size([32, 16384])
+```
+
+In the result, you can find that `y` has been split along with `axis=1` on 2 GPUs.
+
+### Large MLP models
+
+Suppose you have a huge MLP model which is very popular in transformer-based models, with a huge hidden size that makes it difficult to fit into a single GPU.
+
+You can then split the model weights across GPUs in a hybrid parallel mode while you can still write your model in a familiar way.
+
+Here is a simple example about the 2D parallel MLP in the LiBai context.
+
+```python
+import oneflow as flow
+from omegaconf import DictConfig
+from oneflow import nn
+
+from libai.layers import Linear
+from libai.utils import distributed as dist
+
+cfg = DictConfig(dict(data_parallel_size=2, tensor_parallel_size=2, pipeline_parallel_size=1))
+dist.setup_dist_util(cfg)
+
+# Write a Simple 2D Parallel MLP
+class MLP_2D(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.linear_1 = Linear(in_features=1024, out_features=16384, parallel="col")
+        self.relu = nn.GELU()
+        self.linear_2 = Linear(in_features=16384, out_features=1024, parallel="row")
+
+    def forward(self, x):
+        x = self.linear_1(x)
+        x = self.relu(x)
+        x = self.linear_2(x)
+        return x
+
+# define a model
+mlp = MLP_2D()
+
+# define input with 2D sbp
+x = flow.rand(
+    32,
+    1024,
+    sbp=dist.get_nd_sbp([flow.sbp.split(0), flow.sbp.broadcast]),
+    placement=dist.get_layer_placement(0)
+)
+y = mlp(x)
+
+print(f"rank: {flow.env.get_rank()}, tensor shape: {y.to_local().shape}")
+```
+
+You can run it with command line as follows:
+
+```shell
+python3 -m oneflow.distributed.launch --nproc_per_node 4 huge_mlp_example.py
+
+>> rank: 2, tensor shape: oneflow.Size([16, 1024])
+>> rank: 3, tensor shape: oneflow.Size([16, 1024])
+>> rank: 1, tensor shape: oneflow.Size([16, 1024])
+>> rank: 0, tensor shape: oneflow.Size([16, 1024])
+```
+
+From above, you can see that the data are split into 2 groups for data parallel, and weights are split into 2 groups for tensor model parallel. So this simple example just implements a 2D parallel.
+
+For your convenience, we provide some prevalent models such as BERT, GPT-2, and ViT in Mode Zoo. Feel free to customize them into different sizes to fit into your special needs.
+
+## Write your own Pipeline Parallel Model
+
+This tutorial describes how to use pipeline parallel in your own model. LiBai has two pipeline-parallel modes: naive pipeline parallel and (similar) 1F1B pipeline parallel introduced by [Megatron-LM](https://arxiv.org/abs/1909.08053).
+
+### Introduction of Naive Pipeline Parallel
+
+In LiBai, naive pipeline parallel can be implemented by setting layers and parameters `placement`. 
+You can easily configure their `placement` by `dist.get_layer_placement(idx)`.
+
+Here is an example for `placement` configuration.
+
+```python
+# set a free tensor placement to first stage
+self.pos_embed = nn.Parameter(
+    flow.zeros(
+        1,
+        num_patches + 1,
+        embed_dim,
+        sbp=dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.broadcast]),
+        placement=dist.get_layer_placement(0),
+    )
+)
+
+# set a Linear placement to last stage
+# set it manually 
+self.head = Linear(embed_dim, num_classes, layer_idx=-1).to_global(placement=dist.get_layer_placement(-1))
+# use `layer_idx` API
+self.head = Linear(embed_dim, num_classes, layer_idx=-1)
+```
+
+After configuring models placement, add the input placement transition across different stages. LiBai sets a `layer_idx` attribute in each `nn.Module`, so you can simply add `to_global` in `forward` to implement input placement transition.
+
+```python
+class MyModule(nn.Module):
+    def __init__(self, ... *, layer_idx):
+        ...
+        self.layer_idx = layer_idx
+        ...
+
+    def forward(self, hidden_states):
+        hidden_states = hidden_states.to_global(placement=dist.get_layer_placement(self.layer_idx))
+        ...
+```
+
+After configuring models and data placement, you only need to set the distributed configuration before training.
+
+```python
+# set pipeline stages to 2
+train.dist.pipeline_parallel_size = 2
+
+# set model layers for pipeline
+train.dist.pipeline_num_layers = hidden_layers
+```
+
+### Introduction of 1F1B Pipeline Parallel
+
+First, we will introduce GPipe to you to get a better understanding of pipeline parallelism. In GPipe, when the forward passes of all microbatches finish, the backward passes would be executed (as shown in below).
+
+![gpipe](../assets/gpipe.png)
+
+1F1B performs one forward pass followed by one backward pass. Finally, at the end of a batch, complete backward passes for all remaining in-flight microbatches. In general, 1F1B is more efficient than GPipe.
+
+There are two schedules of 1F1B pipeline: the non-interleaved and the interleaved. The figures are shown below. 
+
+![1f1b](../assets/1f1b.png)
+
+In LiBai, the non-interleaved schedule is supported currently, and this mode is more memory-efficient than GPipe.
+
+You only need to set models stage id except that placement configuration in naive pipeline parallel, and stage id can help create stashed buffers for activation.
+
+This example shows how to configure bert model stage id:
+
+```python
+class BertForPreTraining(nn.Module):
+    def __init__(self, ...):
+        ...
+    def forward(self, ...):
+        ...
+    
+    @staticmethod
+    def set_pipeline_stage_id(model):
+        dist_utils = dist.get_dist_util()
+                                                                                                     
+        # Set pipeline parallelism stage_id
+        for module_block in model.modules():
+            # module_block.to(nn.Module) can get the original module
+            if isinstance(module_block.to(nn.Module), BertEmbeddings):
+                module_block.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(0))
+            elif isinstance(module_block.to(nn.Module), BertExtendedAttnMask):
+                module_block.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(0))
+            elif isinstance(module_block.to(nn.Module), TransformerLayer):
+                module_block.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(module_block.layer_idx))
+            elif isinstance(module_block.to(nn.Module), BertPooler):
+                module_block.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(-1))
+            elif isinstance(module_block.to(nn.Module), BertPreTrainingHeads):
+                module_block.to(nn.graph.GraphModule).set_stage(dist_utils.get_layer_stage_id(-1))
+                                                                                                     
+        # Set the last layernorm stage id
+        model.bert.final_layernorm.config.stage_id = dist_utils.get_layer_stage_id(-1)
+```
+
+In `set_pipeline_stage_id`, `BertEmbeddings` and `BertExtendedAttnMask` are placed in the first stage, then each `TransformerLayer` is uniformly placed in each stages. At last, place `BertPooler` and `BertPreTrainingHeads` in the last stage. But don't forget to place the last `layernorm` in `BertEncoder` which does not belong to any `TransformerLayer` in the last stage.
+
+After adding the `set_pipeline_stage_id` function in a pre-defined `nn.Module`, `GraphBase` will invoke it automatically as below:
+
+```python
+def set_pipeline_stage_id(self):
+    if hasattr(type(self.model.to(nn.Module)), "set_pipeline_stage_id"):
+        type(self.model.to(nn.Module)).set_pipeline_stage_id(self.model)
+```
+
+The last thing left is to set the training configuration as below:
+
+```python
+# set pipeline stages to 2
+train.dist.pipeline_parallel_size = 2
+
+# set model layers for pipeline
+train.dist.pipeline_num_layers = hidden_layers
+
+# enable activation checkpointing
+train.activation_checkpoint.enabled = True
+
+# enable gradient accumulation with 8 micro-batches
+train.num_accumulation_steps = 8
+```
--- a/docs/source/tutorials/advanced_tutorials/index.rst
+++ b/docs/source/tutorials/advanced_tutorials/index.rst
+Advanced Tutorials
+===================
+
+.. toctree::
+   :glob:
+   :maxdepth: 2
+
+   customize_dataloader.md
+   customize_parallel.md
--- a/docs/source/tutorials/assets/1f1b.png
+++ b/docs/source/tutorials/assets/1f1b.png
--- a/docs/source/tutorials/assets/LiBai_Wechat.png
+++ b/docs/source/tutorials/assets/LiBai_Wechat.png
--- a/docs/source/tutorials/assets/gpipe.png
+++ b/docs/source/tutorials/assets/gpipe.png
--- a/docs/source/tutorials/basics/Auto_Parallel.md
+++ b/docs/source/tutorials/basics/Auto_Parallel.md
+# Auto Parallel Training
+
+LiBai supports **auto-parallel training** which means LiBai will automatically find **an efficient parallel training strategy** for a specific model during training. Users can try out auto-parallel training by the following steps.
+
+## Installation
+
+Install OneFlow nightly
+
+```shell
+python3 -m pip install --pre oneflow -f https://staging.oneflow.info/branch/master/[PLATFORM]
+```
+
+- All available `[PLATFORM]`:
+
+<table class="docutils">
+  <tbody>
+    <tr>
+      <th width="80"> Platform </th>
+      <th valign="bottom" align="left" width="120">CUDA Driver Version</th>
+      <th valign="bottom" align="left" width="120">Supported GPUs</th>
+    </tr>
+    <tr>
+      <td align="left"> cu112 </td>
+      <td align="left"> >= 450.80.02 </td>
+      <td align="left"> GTX 10xx, RTX 20xx, A100, RTX 30xx</td>
+    </tr>
+    <tr>
+      <td align="left"> cu102 </td>
+      <td align="left"> >= 440.33 </td>
+      <td align="left"> GTX 10xx, RTX 20xx</td>
+    </tr>
+    <tr>
+      <td align="left"> cpu </td>
+      <td align="left"> N/A </td>
+      <td align="left"> N/A </td>
+    </tr>
+  </tbody>
+</table>
+
+
+## Train/Evaluate model in auto-parallel mode
+
+You can train your own model in auto-parallel mode by simply updating the config as follows:
+
+### Modify config file
+
+```python
+# your config
+from .common.models.graph import graph
+
+graph.auto_parallel.enabled = True
+```
+Training model with auto-parallel on 4 GPUs:
+```shell
+bash ./tools/train.sh tools/train_net.py configs/your_own_config.py 4
+```
+
+### Directly modify the training command line
+
+- auto-parallel training:
+
+```shell
+bash ./tools/train.sh tools/train_net.py configs/your_own_config.py 4 graph.auto_parallel.enabled=True
+```
+
+- auto-parallel evaluation:
+
+```shell
+bash ./tools/train.sh tools/train_net.py configs/your_own_config.py 4 --eval graph.auto_parallel.enabled=True
+```
+
+### More details with instructions and interface
+
+See [OneFlow Auto-Parallelism](https://oneflow.readthedocs.io/en/master/auto_parallel.html).
--- a/docs/source/tutorials/basics/Build_New_Project_on_LiBai.md
+++ b/docs/source/tutorials/basics/Build_New_Project_on_LiBai.md
+# Build New Project on LiBai
+
+This is a basic guide to build new projects based on LiBai. The advantages of using LiBai to start a new project (such as paper reproduction and finetune task) are as follows:
+
+- Avoid redundant work. Developers can directly inherit many built-in modules from LiBai.
+- Easily reproduce the experiments already run, because LiBai will save the configuration file automatically.
+- Automatically output useful information during training time, such as remaining training time, current iter, throughput, loss information and current learning rate, etc.
+- Set a few config params to enjoy distributed training techniques.
+
+## Introduction
+Take the [Bert Finetune](https://github.com/Oneflow-Inc/libai/tree/main/projects/QQP) task as an example to introduce LiBai.
+
+The complete file structure of the project is:
+
+```
+projects/my_project
+├── configs
+│   └── config_custom.py
+│   └── ...
+├── dataset
+│   ├── custom_dataset.py
+│   └── ...
+├── modeling
+│   ├── custom_model.py
+│   └── ...
+├── README.md
+```
+
+To start a new project based on LiBai step by step:
+
+Step 1. Prepare an independent config file (such as [config.py](https://github.com/Oneflow-Inc/libai/blob/main/projects/QQP/configs/config_qqp.py)) which contains:
+
+- The relevant parameters of the task.
+- The pre-defined related Class, such as `Model`, `Optimizer`, `Scheduler`, `Dataset`.
+- You can inherit the default config in `configs/common` and rewrite it, which can greatly reduce the workload.
+- Related class defined with LazyCall which returns a dict instead of calling the object.
+
+Step 2. Prepare a model file (such as [model.py](https://github.com/Oneflow-Inc/libai/blob/main/projects/QQP/modeling/model.py)) :
+- Build related models in this file. The construction method is similar to OneFlow.
+- Because Libai will set up a static diagram by default, the calculation of loss needs to be inside the model.
+- The function `forward` must return a dict.
+- When defining a tensor in the model, you need to use `to_global`. Turn tensor into a global pattern.
+- When defining layers, you can import them directly from `libai.layers`, because it has already pre-defined the SBP signature.
+
+Step 3. Prepare a dataset file (such as [dataset.py](https://github.com/Oneflow-Inc/libai/tree/main/projects/QQP/dataset)) :
+- Build `Dataset` in this file. The construction method is similar to OneFlow.
+- The difference is that you need to use `DistTensorData` and `Instance`.
+- The shape of each batch must be global.
+- In `__getitem__` function, the `key` returned by the method must be consistent with the parameter name of the `forward` function in the `model`.
+
+
+## Main Function Entry
+[tools/train_net.py](https://github.com/Oneflow-Inc/libai/blob/main/tools/train_net.py) is the default main function entry provided in LiBai.
+
+
+## Build Config
+The `config.py` in LiBai is special, which takes the form of lazyconfig and will be saved as `.yaml` at runtime. The config has several necessary fields, such as `train`, `model`, `optim`, `lr_scheduler`, `graph`. For more information, please refer to [Config_System.md](https://libai.readthedocs.io/en/latest/tutorials/Config_System.html).
+
+> All imported modules must take LiBai as the root directory. Otherwise, the saved `yaml` file cannot save the correct path of the module, resulting in an error when reading `yaml`, and the experiment cannot be reproduced.
+
+After building the `config.py`, if you want to get the corresponding fields in the project, you need to access like `cfg.my_cfg.***`.
+
+## Start Training
+The `train.sh` file contains some parameters, such as `GPUS`, `NODE`, etc.
+
+```bash
+#!/usr/bin/env bash
+FILE=$1
+CONFIG=$2
+GPUS=$3
+NODE=${NODE:-1}
+NODE_RANK=${NODE_RANK:-0}
+ADDR=${ADDR:-127.0.0.1}
+PORT=${PORT:-12345}
+
+python3 -m oneflow.distributed.launch \
+--nproc_per_node $GPUS --nnodes $NODE --node_rank $NODE_RANK --master_addr $ADDR --master_port $PORT \
+$FILE --config-file $CONFIG ${@:4}
+```
+
+After building the above modules, you can start training with single GPU.
+
+> Config can support both `py` files and generated `yaml` files.
+
+```bash
+bash projects/my_projects/train.sh tools/train_net.py projects/my_projects/config.py 1
+```
--- a/docs/source/tutorials/basics/Config_System.md
+++ b/docs/source/tutorials/basics/Config_System.md
+# Config System
+
+Given that the traditional yacs-based config system or python argparse command-line options suffer from providing enough flexibility for the development of new project, we borrowed the [lazy config system](https://detectron2.readthedocs.io/en/latest/tutorials/lazyconfigs.html) design from detectron2 which forms the non-intrusive config system for LiBai.
+
+You can refer to the [d2 tutorial](https://detectron2.readthedocs.io/en/latest/tutorials/lazyconfigs.html) for more details about the syntax and basic usage of lazy config. This section shows some examples of usage in LiBai.
+
+## Configs in LiBai
+
+LiBai defines a standard set of config namespaces for later use. This set of namespaces must be kept if you want to perform the complete training and evaluation process of LiBai. 
+
+In summary, this set of namespaces is `model, graph, train, optim, dataloader, tokenization(optional)`. The details are as follows.
+
+### model
+
+This is the configuration for model definition. You can refer to `configs/common/models` for more examples.
+
+A model config file can be loaded like this:
+
+```python
+# bert.py:
+from libai.config import LazyCall
+from libai.models import BertModel
+
+# define a model with lazycall
+bert_model = LazyCall(BertModel)(
+    vocab_size=30522,
+    hidden_size=768,
+    hidden_layers=24,
+    num_attention_heads=12,
+    intermediate_size=4096,
+    hidden_dropout_prob=0.1,
+    attention_probs_dropout_prob=0.1,
+    max_position_embeddings=512,
+    num_tokentypes=2,
+    add_pooling_layer=True,
+    initializer_range=0.02,
+    layernorm_eps=1e-5,
+    bias_gelu_fusion=True,
+    bias_dropout_fusion=True,
+    scale_mask_softmax_fusion=False,
+    apply_query_key_layer_scaling=True,
+    add_binary_head=True,
+    amp_enabled=False,
+)
+
+# my_config.py:
+from bert import bert_model as model
+assert model.hidden_size == 768
+model.hidden_layers = 12 # change hidden layers
+```
+
+After defining the model config in a python file, you can `import` it in the global scope of the config file. Note that you need to rename it as `model` regardless of the name used in the model config.
+
+You can access and change all keys in the model config after import.
+
+### graph
+
+This is the configuration for static `nn.Graph` mode. For more information about the static graph mode, refer to the official [nn.Graph docs](https://docs.oneflow.org/master/basics/08_nn_graph.html).
+
+LiBai has already defined a `GraphBase` class for almost all models to use. You can simply turn on this option to convert eager mode to graph mode.
+
+The graph config can be found in [graph.py](https://github.com/Oneflow-Inc/libai/blob/main/configs/common/models/graph.py), and two useful options are shown as follows:
+
+```python
+# Turn on graph mode, if set to `False`, will use eager mode.
+graph.enabled = True 
+
+# Set graph debug level, -1 means no debug info, and 0,1,2,3 can be 
+# set for different debug levels. 
+# More information can be found in nn.Graph documents.
+graph.debug = -1 
+```
+
+### train
+
+This is the configuration for training and evaluation. The default training config can be found in `configs/common/train.py`.
+
+The convention of training / test specific parameters is as follows:
+
+```python
+from libai.config import LazyCall
+
+train = dict(
+    
+    # Directory where output files are written
+    output_dir="./output",
+
+    # `train_micro_batch_size` is number of samples per batch on each GPU. 
+    # train_mini_batch_size = train_micro_batch_size * num_accumulation_steps.
+    # This is also the number of training samples per step (i.e. per iteration). 
+
+    # If we use 8 GPUs for data parallel groups, `train_micro_batch_size = 2` and 
+    # `num_accumulation_steps = 4`, then each GPU will see 2 samples per batch and
+    # 8 samples per iteration.
+    # Total 64 samples will be trained per iteration across all GPUs.
+
+    # global_batch_size = micro_batch_size  * num_grad_acc * data_parallel_groups
+    train_micro_batch_size=32,
+    global_batch_size=None,
+    num_accumulation_steps=None,
+
+    # The total training iterations
+    train_iter=10000,
+    # The total training epochs, will be scaled to training iterations automatically.
+    # The actual total training iterations will be calculated by the 
+    # formula `max(train_iter, train_epoch * iter_per_epoch)`.
+    train_epoch=0,  
+    consumed_train_samples=0,
+    consumed_valid_samples=0,
+    train_samples=None,
+
+    # Fraction of lr-warmup-iters to use for warmup (as a float)
+    warmup_ratio=0,  
+
+    # The start iteration, usually needn't set it manually.
+    # It can be computed automatically when resuming training.
+    start_iter=0,
+
+    # Enable automatic mixed precision for training which does not 
+    # change model's inference behavior.
+    amp=dict(enabled=False),  
+
+    # Enable activation checkpointing to allow for training
+    # with larger models, sequences, and batch sizes.
+    # If enabled, checkpoint the input activations of each transformer layers by default.
+    activation_checkpoint=dict(enabled=False),  
+
+    # NCCL fusion threshold megabytes, set to 0 to 
+    # compatible with previous version of OneFlow.
+    nccl_fusion_threshold_mb=16,
+
+    # Maximum number of ops of NCCL fusion, set to 0 to 
+    # compatible with previous version of OneFlow.
+    nccl_fusion_max_ops=24,
+
+    # Enable ZeRO Optimization to allow for training with larger models.
+    # This optimization will reduce optimizer stages memory consumption
+    # as described in ZeRO https://arxiv.org/abs/1910.02054.
+    zero_optimization=dict(
+        enabled=False,
+        stage=1,
+    ),
+    
+    # Save a model checkpoint after every this number of iterations,
+    # and maximum number of checkpoint will be kept.
+    checkpointer=dict(period=5000, max_to_keep=100),  
+
+    # Options for evaluation
+
+    # `test_micro_batch_size` is number of samples per batch on each GPU for testing. 
+    # If we use 8 GPUs for data parallel groups and `test_micro_batch_size = 2`, then
+    # total 16 samples will be used per iteration across all GPUs.
+    test_micro_batch_size=32,
+
+    # Enabled evaluation during training, after every `eval_period` number of iterations
+    # will perform the evaluation process.
+    # You can set the maximum evaluation iterations to run for validation/test.
+    # You can also set a customized evaluator for use.
+    evaluation=dict(
+        enabled=True,
+        # evaluator for calculating top-k acc
+        evaluator=LazyCall(ClsEvaluator)(topk=(1, 5)),  
+        eval_period=5000,
+        eval_iter=1e9,  # running steps for validation/test
+
+        # Metrics to be used for best model checkpoint.
+        eval_metric="Acc@1",
+        eval_mode="max",
+    ),
+
+    # Path to a checkpoint file to be loaded to the model for training or evaluation. 
+    load_weight="",
+
+    # Output log to console after every this number of iterations.
+    log_period=20,
+
+    # lr_scheduler arguments
+    # See libai/scheduler/lr_scheduler.py for definition.
+    scheduler=LazyCall(WarmupCosineLR)(
+        # In DefaultTrainer we will automatically set `max_iter`
+        # and `warmup_iter` by the given train cfg.
+        warmup_factor=0.001,
+        alpha=0.01,
+        warmup_method="linear",
+    ),
+
+    # Distributed arguments
+    # See https://libai.readthedocs.io/en/latest/tutorials/basics/Distributed_Configuration.html for more details.
+    dist=dict(
+        data_parallel_size=1,
+        tensor_parallel_size=1,
+        pipeline_parallel_size=1,
+        # users must set the `pipeline_num_layers` attribute when `pipeline_parallel_size > 1`
+        pipeline_num_layers=None,
+        # users could customize the number of layers in different stages
+        # by setting the `custom_pipeline_stage_id ` attribute which is used for
+        # manually balance calculation between stages when running pipeline parallelism
+        # e.g. you can set `custom_pipeline_stage_id=[0, 0, 0, 1]`
+        # for `pipeline_num_layers=4 and pipeline_parallel_size=2`
+        # which means the first 3 layers will be placed on stage0 and
+        # the last layer will be placed on stage1
+        # NOTE: if it is None, LiBai will automatically set pipeline_stage_id
+        # `auto_pipeline_stage_id` and `actual_pipeline_stage_id` will be saved in `config.yaml`
+        custom_pipeline_stage_id=None,
+    ),
+
+    # the device type of input tensors for model, defaults to "cuda".
+    # if you want to accelerate the model training when pipeline_parallel > 1
+    # you can set `input_placement_device="cpu"` then call input_tensor.to_global()
+    # inside your model.forward() method
+    # see `libai/models/bert_model.py` as reference
+    input_placement_device="cuda",
+
+    # set to `True` to enable rdma for improving speed of pipeline_parallel
+    rdma_enabled=True,
+    
+    # Set seed to positive to use a fixed seed. Note that a fixed seed increases
+    # reproducibility but does not guarantee fully deterministic behavior.
+    # Disabling all parallelism further increases reproducibility.
+    seed=1234,
+)
+```
+**Note:** ``warmup_ratio`` is the ratio of warmup iterations of the total training iterations, and the real ``warmup iterations`` will be calculated by ``wramup_ratio * train_iter`` automatically.
+
+**Example:** If you need to train 300 epochs with 5 warmup epochs, update the config as follows:
+```python
+# config.py
+train.train_epoch = 300
+train.warmup_ratio = 5 / 300
+```
+If you need to train 1000 iters with 200 warmup iters, set the training config like this:
+```python
+# config.py
+train.train_iter = 1000
+train.warmup_ratio = 200 / 1000
+```
+
+
+### optim
+
+This is the configuration for optimizer. The default configuration can be found in `configs/common/optim.py`.
+
+LiBai utilizes the function `get_default_optimizer_params`, which needs the `nn.Module` as the argument and returns the parameter groups.
+
+With `LazyConfig`, you can set other arguments in advance and pass the `model` argument later. For more details, refer to [API docs of libai optim](../libai.optim.html#libai.optim.get_default_optimizer_params).
+
+```python
+# optim.py:
+import oneflow as flow
+from libai.config import LazyCall
+from libai.optim import get_default_optimizer_params
+
+optim = LazyCall(flow.optim.AdamW)(
+    params=LazyCall(get_default_optimizer_params)(
+        # params.model is meant to be set to the model object,
+        # before instantiating the optimizer.
+        clip_grad_max_norm=1.0,
+        clip_grad_norm_type=2.0,
+        weight_decay_norm=0.0,
+        weight_decay_bias=0.0,
+    ),
+    lr=1e-4,
+    weight_decay=0.01,
+    betas=(0.9, 0.999),
+    eps=1e-8,
+    do_bias_correction=True,
+)
+
+# my_config.py:
+import oneflow as flow
+optim._target_ = flow.optim.SGD
+
+# Remove the incompatible arguments in optim
+del optim.do_bias_correction
+
+# Set the need arguments
+optim.momentum = 0.9
+```
+
+### dataloader
+
+This is the configuration for dataset/dataloader. This component provides data to the model. A dataloader usually takes raw information and processes it into the format required by the model.
+
+See example datasets in `configs/common/data/`, including `cifar100`, `imagenet`, `bert_dataset` and so on. You can also define your customized dataset config as you like.
+
+Take `bert_dataset.py` as an example:
+
+```python
+# bert_dataset.py:
+from libai.config import LazyCall
+from omegaconf import OmegaConf
+from libai.data import build_nlp_test_loader, build_nlp_train_val_test_loader
+from libai.data.datasets import BertDataset
+from libai.data.data_utils import get_indexed_dataset
+
+dataloader = OmegaConf.create()
+
+dataloader.train = LazyCall(build_nlp_train_val_test_loader)(
+    dataset=[
+        LazyCall(BertDataset)(
+            data_prefix="/your/data_prefix/path",
+            indexed_dataset=LazyCall(get_indexed_dataset)(
+                data_prefix="/your/data_prefix/path",
+                data_impl="mmap",
+                skip_warmup=False,
+            ),
+            max_seq_length=512,
+            mask_lm_prob=0.15,
+            short_seq_prob=0.1,
+        ),
+    ],
+    splits=[[949.0, 50.0, 1.0]],
+    weights=[1.0],
+    num_workers=4,
+)
+
+# my_config.py:
+dataloader.train.dataset[0].max_seq_length = 256
+dataloader.train.num_workers = 2
+```
+
+LiBai provides two functions `build_nlp_train_val_test_loader` and `build_image_train_loader` to create a default train data loader from a given config. It takes the list of `dataset_class`(e.g., `BertDataset`) and combines them using `flow.utils.data.dataset.ConcatDataset`. 
+
+It is recommended to check out [API docs of libai.data](../libai.data.html#libai.data.build.build_nlp_train_loader) to learn more about the APIs of `build_nlp_train_val_test_loader`.
+
+### tokenization (optional)
+
+You need to configure a tokenizer if you want to train a NLP task. Each NLP dataset has its own tokenizer config in the corresponding data config file.
+
+Here we use:
+
+```python
+# bert_dataset.py:
+from libai.config import LazyCall
+from omegaconf import OmegaConf
+from libai.tokenizer import BertTokenizer
+
+tokenization = OmegaConf.create()
+
+tokenization.tokenizer = LazyCall(BertTokenizer)(
+    vocab_file="bert-base-chinese-vocab.txt",
+    do_lower_case=True,
+    do_chinese_wwm=True,
+)
+tokenization.append_eod = False
+tokenization.make_vocab_size_divisible_by = 128
+
+# my_config.py:
+tokenization.tokenizer.do_lower_case = False
+```
+
+Tokenization config must contain a tokenizer(e.g., `BertTokenizer`). `append_eod` and `make_vocab_size_divisible_by` are not necessary. 
+
+`make_vocab_size_divisible_by` is used for padding the vocab size to be divisible by this value. This is added for computational efficiency for tensor parallelism.
+
+## Get the Default Config
+
+You don't need to rewrite all contents in config every time. You can import a config file as a python file or use function [`get_config`](../libai.config.html#libai.config.get_config).
+
+If you build LiBai from source, you can get all default config files in `configs/*`. Then you can import the config files as follows:
+
+```python
+# import config
+from .common.models.bert import pretrain_model as model
+from .common.models.graph import graph
+from .common.train import train
+from .common.optim import optim
+from .common.data.bert_dataset import dataloader, tokenization
+
+# modify it
+train.train_iter = 100
+...
+```
+
+If you install LiBai by `pip`, you can use `get_config` function to get all default config files as follows:
+
+```python
+from libai.config import get_config
+# get config
+model = get_config("common/models/bert.py").pretrain_model
+graph = get_config("common/models/graph.py").graph
+train = get_config("common/train.py").train
+optim = get_config("common/optim.py").optim
+dataloader = get_config("common/data/bert_dataset.py").dataloader
+tokenization = get_config("common/data/bert_dataset.py").tokenization
+
+# modify it
+train.train_iter = 100
+...
+```
+
+## LazyConfig Best Practices
+
+1. Treat the configs you write as actual "code": Avoid copying them or duplicating them. Import the common parts between configs.
+2. Keep the configs you write simple: Don't include keys that do not affect the experimental setting.