gpt2

3b355d3f · yuguo960516 · fd158e88 · 3b355d3f · 3b355d3f · 3b355d3f
Commit 3b355d3f authored Mar 29, 2023 by yuguo960516
20 changed files
--- a/docs/source/tutorials/basics/Build_New_Project_on_LiBai.md
+++ b/docs/source/tutorials/basics/Build_New_Project_on_LiBai.md
+# Build New Project on LiBai
+
+This is a basic guide to build new projects based on LiBai. The advantages of using LiBai to start a new project (such as paper reproduction and finetune task) are as follows:
+
+- Avoid redundant work. Developers can directly inherit many built-in modules from LiBai.
+- Easily reproduce the experiments already run, because LiBai will save the configuration file automatically.
+- Automatically output useful information during training time, such as remaining training time, current iter, throughput, loss information and current learning rate, etc.
+- Set a few config params to enjoy distributed training techniques.
+
+## Introduction
+Take the [Bert Finetune](https://github.com/Oneflow-Inc/libai/tree/main/projects/QQP) task as an example to introduce LiBai.
+
+The complete file structure of the project is:
+
+```
+projects/my_project
+├── configs
+│   └── config_custom.py
+│   └── ...
+├── dataset
+│   ├── custom_dataset.py
+│   └── ...
+├── modeling
+│   ├── custom_model.py
+│   └── ...
+├── README.md
+```
+
+To start a new project based on LiBai step by step:
+
+Step 1. Prepare an independent config file (such as [config.py](https://github.com/Oneflow-Inc/libai/blob/main/projects/QQP/configs/config_qqp.py)) which contains:
+
+- The relevant parameters of the task.
+- The pre-defined related Class, such as `Model`, `Optimizer`, `Scheduler`, `Dataset`.
+- You can inherit the default config in `configs/common` and rewrite it, which can greatly reduce the workload.
+- Related class defined with LazyCall which returns a dict instead of calling the object.
+
+Step 2. Prepare a model file (such as [model.py](https://github.com/Oneflow-Inc/libai/blob/main/projects/QQP/modeling/model.py)) :
+- Build related models in this file. The construction method is similar to OneFlow.
+- Because Libai will set up a static diagram by default, the calculation of loss needs to be inside the model.
+- The function `forward` must return a dict.
+- When defining a tensor in the model, you need to use `to_global`. Turn tensor into a global pattern.
+- When defining layers, you can import them directly from `libai.layers`, because it has already pre-defined the SBP signature.
+
+Step 3. Prepare a dataset file (such as [dataset.py](https://github.com/Oneflow-Inc/libai/tree/main/projects/QQP/dataset)) :
+- Build `Dataset` in this file. The construction method is similar to OneFlow.
+- The difference is that you need to use `DistTensorData` and `Instance`.
+- The shape of each batch must be global.
+- In `__getitem__` function, the `key` returned by the method must be consistent with the parameter name of the `forward` function in the `model`.
+
+
+## Main Function Entry
+[tools/train_net.py](https://github.com/Oneflow-Inc/libai/blob/main/tools/train_net.py) is the default main function entry provided in LiBai.
+
+
+## Build Config
+The `config.py` in LiBai is special, which takes the form of lazyconfig and will be saved as `.yaml` at runtime. The config has several necessary fields, such as `train`, `model`, `optim`, `lr_scheduler`, `graph`. For more information, please refer to [Config_System.md](https://libai.readthedocs.io/en/latest/tutorials/Config_System.html).
+
+> All imported modules must take LiBai as the root directory. Otherwise, the saved `yaml` file cannot save the correct path of the module, resulting in an error when reading `yaml`, and the experiment cannot be reproduced.
+
+After building the `config.py`, if you want to get the corresponding fields in the project, you need to access like `cfg.my_cfg.***`.
+
+## Start Training
+The `train.sh` file contains some parameters, such as `GPUS`, `NODE`, etc.
+
+```bash
+#!/usr/bin/env bash
+FILE=$1
+CONFIG=$2
+GPUS=$3
+NODE=${NODE:-1}
+NODE_RANK=${NODE_RANK:-0}
+ADDR=${ADDR:-127.0.0.1}
+PORT=${PORT:-12345}
+
+python3 -m oneflow.distributed.launch \
+--nproc_per_node $GPUS --nnodes $NODE --node_rank $NODE_RANK --master_addr $ADDR --master_port $PORT \
+$FILE --config-file $CONFIG ${@:4}
+```
+
+After building the above modules, you can start training with single GPU.
+
+> Config can support both `py` files and generated `yaml` files.
+
+```bash
+bash projects/my_projects/train.sh tools/train_net.py projects/my_projects/config.py 1
+```
--- a/docs/source/tutorials/basics/Config_System.md
+++ b/docs/source/tutorials/basics/Config_System.md
+# Config System
+
+Given that the traditional yacs-based config system or python argparse command-line options suffer from providing enough flexibility for the development of new project, we borrowed the [lazy config system](https://detectron2.readthedocs.io/en/latest/tutorials/lazyconfigs.html) design from detectron2 which forms the non-intrusive config system for LiBai.
+
+You can refer to the [d2 tutorial](https://detectron2.readthedocs.io/en/latest/tutorials/lazyconfigs.html) for more details about the syntax and basic usage of lazy config. This section shows some examples of usage in LiBai.
+
+## Configs in LiBai
+
+LiBai defines a standard set of config namespaces for later use. This set of namespaces must be kept if you want to perform the complete training and evaluation process of LiBai. 
+
+In summary, this set of namespaces is `model, graph, train, optim, dataloader, tokenization(optional)`. The details are as follows.
+
+### model
+
+This is the configuration for model definition. You can refer to `configs/common/models` for more examples.
+
+A model config file can be loaded like this:
+
+```python
+# bert.py:
+from libai.config import LazyCall
+from libai.models import BertModel
+
+# define a model with lazycall
+bert_model = LazyCall(BertModel)(
+    vocab_size=30522,
+    hidden_size=768,
+    hidden_layers=24,
+    num_attention_heads=12,
+    intermediate_size=4096,
+    hidden_dropout_prob=0.1,
+    attention_probs_dropout_prob=0.1,
+    max_position_embeddings=512,
+    num_tokentypes=2,
+    add_pooling_layer=True,
+    initializer_range=0.02,
+    layernorm_eps=1e-5,
+    bias_gelu_fusion=True,
+    bias_dropout_fusion=True,
+    scale_mask_softmax_fusion=False,
+    apply_query_key_layer_scaling=True,
+    add_binary_head=True,
+    amp_enabled=False,
+)
+
+# my_config.py:
+from bert import bert_model as model
+assert model.hidden_size == 768
+model.hidden_layers = 12 # change hidden layers
+```
+
+After defining the model config in a python file, you can `import` it in the global scope of the config file. Note that you need to rename it as `model` regardless of the name used in the model config.
+
+You can access and change all keys in the model config after import.
+
+### graph
+
+This is the configuration for static `nn.Graph` mode. For more information about the static graph mode, refer to the official [nn.Graph docs](https://docs.oneflow.org/master/basics/08_nn_graph.html).
+
+LiBai has already defined a `GraphBase` class for almost all models to use. You can simply turn on this option to convert eager mode to graph mode.
+
+The graph config can be found in [graph.py](https://github.com/Oneflow-Inc/libai/blob/main/configs/common/models/graph.py), and two useful options are shown as follows:
+
+```python
+# Turn on graph mode, if set to `False`, will use eager mode.
+graph.enabled = True 
+
+# Set graph debug level, -1 means no debug info, and 0,1,2,3 can be 
+# set for different debug levels. 
+# More information can be found in nn.Graph documents.
+graph.debug = -1 
+```
+
+### train
+
+This is the configuration for training and evaluation. The default training config can be found in `configs/common/train.py`.
+
+The convention of training / test specific parameters is as follows:
+
+```python
+from libai.config import LazyCall
+
+train = dict(
+    
+    # Directory where output files are written
+    output_dir="./output",
+
+    # `train_micro_batch_size` is number of samples per batch on each GPU. 
+    # train_mini_batch_size = train_micro_batch_size * num_accumulation_steps.
+    # This is also the number of training samples per step (i.e. per iteration). 
+
+    # If we use 8 GPUs for data parallel groups, `train_micro_batch_size = 2` and 
+    # `num_accumulation_steps = 4`, then each GPU will see 2 samples per batch and
+    # 8 samples per iteration.
+    # Total 64 samples will be trained per iteration across all GPUs.
+
+    # global_batch_size = micro_batch_size  * num_grad_acc * data_parallel_groups
+    train_micro_batch_size=32,
+    global_batch_size=None,
+    num_accumulation_steps=None,
+
+    # The total training iterations
+    train_iter=10000,
+    # The total training epochs, will be scaled to training iterations automatically.
+    # The actual total training iterations will be calculated by the 
+    # formula `max(train_iter, train_epoch * iter_per_epoch)`.
+    train_epoch=0,  
+    consumed_train_samples=0,
+    consumed_valid_samples=0,
+    train_samples=None,
+
+    # Fraction of lr-warmup-iters to use for warmup (as a float)
+    warmup_ratio=0,  
+
+    # The start iteration, usually needn't set it manually.
+    # It can be computed automatically when resuming training.
+    start_iter=0,
+
+    # Enable automatic mixed precision for training which does not 
+    # change model's inference behavior.
+    amp=dict(enabled=False),  
+
+    # Enable activation checkpointing to allow for training
+    # with larger models, sequences, and batch sizes.
+    # If enabled, checkpoint the input activations of each transformer layers by default.
+    activation_checkpoint=dict(enabled=False),  
+
+    # NCCL fusion threshold megabytes, set to 0 to 
+    # compatible with previous version of OneFlow.
+    nccl_fusion_threshold_mb=16,
+
+    # Maximum number of ops of NCCL fusion, set to 0 to 
+    # compatible with previous version of OneFlow.
+    nccl_fusion_max_ops=24,
+
+    # Enable ZeRO Optimization to allow for training with larger models.
+    # This optimization will reduce optimizer stages memory consumption
+    # as described in ZeRO https://arxiv.org/abs/1910.02054.
+    zero_optimization=dict(
+        enabled=False,
+        stage=1,
+    ),
+    
+    # Save a model checkpoint after every this number of iterations,
+    # and maximum number of checkpoint will be kept.
+    checkpointer=dict(period=5000, max_to_keep=100),  
+
+    # Options for evaluation
+
+    # `test_micro_batch_size` is number of samples per batch on each GPU for testing. 
+    # If we use 8 GPUs for data parallel groups and `test_micro_batch_size = 2`, then
+    # total 16 samples will be used per iteration across all GPUs.
+    test_micro_batch_size=32,
+
+    # Enabled evaluation during training, after every `eval_period` number of iterations
+    # will perform the evaluation process.
+    # You can set the maximum evaluation iterations to run for validation/test.
+    # You can also set a customized evaluator for use.
+    evaluation=dict(
+        enabled=True,
+        # evaluator for calculating top-k acc
+        evaluator=LazyCall(ClsEvaluator)(topk=(1, 5)),  
+        eval_period=5000,
+        eval_iter=1e9,  # running steps for validation/test
+
+        # Metrics to be used for best model checkpoint.
+        eval_metric="Acc@1",
+        eval_mode="max",
+    ),
+
+    # Path to a checkpoint file to be loaded to the model for training or evaluation. 
+    load_weight="",
+
+    # Output log to console after every this number of iterations.
+    log_period=20,
+
+    # lr_scheduler arguments
+    # See libai/scheduler/lr_scheduler.py for definition.
+    scheduler=LazyCall(WarmupCosineLR)(
+        # In DefaultTrainer we will automatically set `max_iter`
+        # and `warmup_iter` by the given train cfg.
+        warmup_factor=0.001,
+        alpha=0.01,
+        warmup_method="linear",
+    ),
+
+    # Distributed arguments
+    # See https://libai.readthedocs.io/en/latest/tutorials/basics/Distributed_Configuration.html for more details.
+    dist=dict(
+        data_parallel_size=1,
+        tensor_parallel_size=1,
+        pipeline_parallel_size=1,
+        # users must set the `pipeline_num_layers` attribute when `pipeline_parallel_size > 1`
+        pipeline_num_layers=None,
+        # users could customize the number of layers in different stages
+        # by setting the `custom_pipeline_stage_id ` attribute which is used for
+        # manually balance calculation between stages when running pipeline parallelism
+        # e.g. you can set `custom_pipeline_stage_id=[0, 0, 0, 1]`
+        # for `pipeline_num_layers=4 and pipeline_parallel_size=2`
+        # which means the first 3 layers will be placed on stage0 and
+        # the last layer will be placed on stage1
+        # NOTE: if it is None, LiBai will automatically set pipeline_stage_id
+        # `auto_pipeline_stage_id` and `actual_pipeline_stage_id` will be saved in `config.yaml`
+        custom_pipeline_stage_id=None,
+    ),
+
+    # the device type of input tensors for model, defaults to "cuda".
+    # if you want to accelerate the model training when pipeline_parallel > 1
+    # you can set `input_placement_device="cpu"` then call input_tensor.to_global()
+    # inside your model.forward() method
+    # see `libai/models/bert_model.py` as reference
+    input_placement_device="cuda",
+
+    # set to `True` to enable rdma for improving speed of pipeline_parallel
+    rdma_enabled=True,
+    
+    # Set seed to positive to use a fixed seed. Note that a fixed seed increases
+    # reproducibility but does not guarantee fully deterministic behavior.
+    # Disabling all parallelism further increases reproducibility.
+    seed=1234,
+)
+```
+**Note:** ``warmup_ratio`` is the ratio of warmup iterations of the total training iterations, and the real ``warmup iterations`` will be calculated by ``wramup_ratio * train_iter`` automatically.
+
+**Example:** If you need to train 300 epochs with 5 warmup epochs, update the config as follows:
+```python
+# config.py
+train.train_epoch = 300
+train.warmup_ratio = 5 / 300
+```
+If you need to train 1000 iters with 200 warmup iters, set the training config like this:
+```python
+# config.py
+train.train_iter = 1000
+train.warmup_ratio = 200 / 1000
+```
+
+
+### optim
+
+This is the configuration for optimizer. The default configuration can be found in `configs/common/optim.py`.
+
+LiBai utilizes the function `get_default_optimizer_params`, which needs the `nn.Module` as the argument and returns the parameter groups.
+
+With `LazyConfig`, you can set other arguments in advance and pass the `model` argument later. For more details, refer to [API docs of libai optim](../libai.optim.html#libai.optim.get_default_optimizer_params).
+
+```python
+# optim.py:
+import oneflow as flow
+from libai.config import LazyCall
+from libai.optim import get_default_optimizer_params
+
+optim = LazyCall(flow.optim.AdamW)(
+    params=LazyCall(get_default_optimizer_params)(
+        # params.model is meant to be set to the model object,
+        # before instantiating the optimizer.
+        clip_grad_max_norm=1.0,
+        clip_grad_norm_type=2.0,
+        weight_decay_norm=0.0,
+        weight_decay_bias=0.0,
+    ),
+    lr=1e-4,
+    weight_decay=0.01,
+    betas=(0.9, 0.999),
+    eps=1e-8,
+    do_bias_correction=True,
+)
+
+# my_config.py:
+import oneflow as flow
+optim._target_ = flow.optim.SGD
+
+# Remove the incompatible arguments in optim
+del optim.do_bias_correction
+
+# Set the need arguments
+optim.momentum = 0.9
+```
+
+### dataloader
+
+This is the configuration for dataset/dataloader. This component provides data to the model. A dataloader usually takes raw information and processes it into the format required by the model.
+
+See example datasets in `configs/common/data/`, including `cifar100`, `imagenet`, `bert_dataset` and so on. You can also define your customized dataset config as you like.
+
+Take `bert_dataset.py` as an example:
+
+```python
+# bert_dataset.py:
+from libai.config import LazyCall
+from omegaconf import OmegaConf
+from libai.data import build_nlp_test_loader, build_nlp_train_val_test_loader
+from libai.data.datasets import BertDataset
+from libai.data.data_utils import get_indexed_dataset
+
+dataloader = OmegaConf.create()
+
+dataloader.train = LazyCall(build_nlp_train_val_test_loader)(
+    dataset=[
+        LazyCall(BertDataset)(
+            data_prefix="/your/data_prefix/path",
+            indexed_dataset=LazyCall(get_indexed_dataset)(
+                data_prefix="/your/data_prefix/path",
+                data_impl="mmap",
+                skip_warmup=False,
+            ),
+            max_seq_length=512,
+            mask_lm_prob=0.15,
+            short_seq_prob=0.1,
+        ),
+    ],
+    splits=[[949.0, 50.0, 1.0]],
+    weights=[1.0],
+    num_workers=4,
+)
+
+# my_config.py:
+dataloader.train.dataset[0].max_seq_length = 256
+dataloader.train.num_workers = 2
+```
+
+LiBai provides two functions `build_nlp_train_val_test_loader` and `build_image_train_loader` to create a default train data loader from a given config. It takes the list of `dataset_class`(e.g., `BertDataset`) and combines them using `flow.utils.data.dataset.ConcatDataset`. 
+
+It is recommended to check out [API docs of libai.data](../libai.data.html#libai.data.build.build_nlp_train_loader) to learn more about the APIs of `build_nlp_train_val_test_loader`.
+
+### tokenization (optional)
+
+You need to configure a tokenizer if you want to train a NLP task. Each NLP dataset has its own tokenizer config in the corresponding data config file.
+
+Here we use:
+
+```python
+# bert_dataset.py:
+from libai.config import LazyCall
+from omegaconf import OmegaConf
+from libai.tokenizer import BertTokenizer
+
+tokenization = OmegaConf.create()
+
+tokenization.tokenizer = LazyCall(BertTokenizer)(
+    vocab_file="bert-base-chinese-vocab.txt",
+    do_lower_case=True,
+    do_chinese_wwm=True,
+)
+tokenization.append_eod = False
+tokenization.make_vocab_size_divisible_by = 128
+
+# my_config.py:
+tokenization.tokenizer.do_lower_case = False
+```
+
+Tokenization config must contain a tokenizer(e.g., `BertTokenizer`). `append_eod` and `make_vocab_size_divisible_by` are not necessary. 
+
+`make_vocab_size_divisible_by` is used for padding the vocab size to be divisible by this value. This is added for computational efficiency for tensor parallelism.
+
+## Get the Default Config
+
+You don't need to rewrite all contents in config every time. You can import a config file as a python file or use function [`get_config`](../libai.config.html#libai.config.get_config).
+
+If you build LiBai from source, you can get all default config files in `configs/*`. Then you can import the config files as follows:
+
+```python
+# import config
+from .common.models.bert import pretrain_model as model
+from .common.models.graph import graph
+from .common.train import train
+from .common.optim import optim
+from .common.data.bert_dataset import dataloader, tokenization
+
+# modify it
+train.train_iter = 100
+...
+```
+
+If you install LiBai by `pip`, you can use `get_config` function to get all default config files as follows:
+
+```python
+from libai.config import get_config
+# get config
+model = get_config("common/models/bert.py").pretrain_model
+graph = get_config("common/models/graph.py").graph
+train = get_config("common/train.py").train
+optim = get_config("common/optim.py").optim
+dataloader = get_config("common/data/bert_dataset.py").dataloader
+tokenization = get_config("common/data/bert_dataset.py").tokenization
+
+# modify it
+train.train_iter = 100
+...
+```
+
+## LazyConfig Best Practices
+
+1. Treat the configs you write as actual "code": Avoid copying them or duplicating them. Import the common parts between configs.
+2. Keep the configs you write simple: Don't include keys that do not affect the experimental setting.
--- a/docs/source/tutorials/basics/Distributed_Configuration.md
+++ b/docs/source/tutorials/basics/Distributed_Configuration.md
+# Distributed Configuration
+
+In LiBai, you can try out different parallel strategies by simply changing the distributed config in [training config file](https://github.com/Oneflow-Inc/libai/blob/main/configs/common/train.py).
+```python
+# Distributed arguments
+dist=dict(
+        data_parallel_size=1,
+        tensor_parallel_size=1,
+        pipeline_parallel_size=1,
+
+        # users must set the `pipeline_num_layers` attribute when `pipeline_parallel_size > 1`
+        pipeline_num_layers=None,
+        # users could customize the number of layers in different stages
+        # by setting the `custom_pipeline_stage_id ` attribute which is used for
+        # manually balance calculation between stages when running pipeline parallelism
+        # e.g. you can set `custom_pipeline_stage_id=[0, 0, 0, 1]`
+        # for `pipeline_num_layers=4 and pipeline_parallel_size=2`
+        # which means the first 3 layers will be placed on stage0 and
+        # the last layer will be placed on stage1
+        # NOTE: if it is None, LiBai will automatically set pipeline_stage_id
+        # `auto_pipeline_stage_id` and `actual_pipeline_stage_id` will be saved in `config.yaml`
+        custom_pipeline_stage_id=None,
+)
+```
+For example, you can set `data_parallel_size=2` which will automatically split the input data into two groups for data parallel training.
+
+## Distributed Setting Example
+Here are some simple examples for you to understand the basic configuration of LiBai's distributed settings. LiBai's **BERT** model supports three parallelism techniques (**data parallel training**, **tensor parallel training** and **pipeline parallel training**). Take 1 node with 8 GPUs as an example. If you do not change any default settings, LiBai will execute **data parallel training** by default. You can try out different combinations of parallelism training techniques by updating [bert config file](https://github.com/Oneflow-Inc/libai/blob/main/configs/bert_large_pretrain.py) as follows:
+
+#### **Pure Data Parallel Training on 8 GPUs**
+
+In this example, the model is replicated on 8 GPUs, and each replica handles only part of the input data during iteration.
+```python
+from .common.train import train
+...
+
+train.dist.data_parallel_size = 8
+```
+
+#### **Pure Tensor Parallel Training on 8 GPUs**
+
+In this example, the weight of the layers in the model will be split into 8 parts for tensor parallel training on 8 GPUs.
+```python
+from .common.train import train
+...
+
+train.dist.tensor_parallel_size = 8
+```
+
+**Note:** This only works for models built with ``libai.layers``.
+
+#### **Pure Pipeline Parallel Training on 8 GPUs**
+
+In this example, 8 GPUs will be split into 8 stages, and different layers of the model will be put on different stages automatically for pipeline parallel training.
+```python
+from .common.train import train
+...
+
+train.dist.pipeline_parallel_size = 8
+
+train.dist.pipeline_num_layers = model.cfg.hidden_layers
+```
+
+**Note:** 
+- `train.dist.pipeline_num_layers` must be set consistent with the model layers. If unset, it will use the default value `1000`,
+which might trigger unexpected behavior.
+
+- For models which have been configured with pipeline parallelism(e.g., BERT, GPT-2, T5 and ViT), you can simply update the distributed config to execute pipeline parallel training on them. If you need to train your own model with pipeline parallel strategy, please refer to [Write Models](https://libai.readthedocs.io/en/latest/tutorials/basics/Write_Models.html) for more details about configuring your own model with pipeline parallelism.
+
+#### **Data Parallel + Tensor Parallel for 2D Parallel Training on 8 GPUs**
+
+In this example, 8 GPUs will be split into **2 groups**, and each group contains **4 GPUs**. The input data will be split into 2 parts by chunking in the batch dimension for data parallel training between 2 groups. The model is replicated between **2 data-parellel groups**. Within each group, the weight of each layers will be splited across **4 GPUs** for tensor parallel training.
+
+```python
+from .common.train import train
+...
+
+train.dist.data_parallel_size = 2
+train.dist.tensor_parallel_size = 4
+```
+Here we provide a specific example for you to understand this. We number 8 GPUs from 0 to 7, e.g., ``[0, 1, 2, 3, 4, 5, 6, 7]``, and for ``data parallel + tensor parallel``, 8 GPUs will be split into 2 groups as ``[[0, 1, 2, 3], [4, 5, 6, 7]]``, ``GPU: [0, 1, 2, 3]`` as group-0 and ``GPU: [4, 5, 6, 7]`` as group-1. The model is replicated between group-0 and group-1. In group-0, the model will execute tensor parallel between ``GPU: [0, 1, 2, 3]``. In group-1, the model will execute tensor parallel between ``GPU: [4, 5, 6, 7]``, and each group only handle a portion of the input data for data parallel training.
+
+#### **Data Parallel + Pipeline Parallel for 2D Parallel Training on 8 GPUs**
+
+In this example, 8 GPUs will be split into **4 stages**. Each stage contains **2 GPUs** which will be split into **2 data-parallel groups**. Each stage only contains a portion of the model. The weight of the layers put on the specific stage is replicated on **2 data-parallel groups**. Each group handles a portion of the input data.
+```python
+from .common.train import train
+...
+
+train.dist.data_parallel_size = 2
+train.dist.pipeline_parallel_size = 4
+
+train.dist.pipeline_num_layers = model.cfg.hidden_layers
+```
+
+#### **Tensor Parallel + Pipeline Parallel for 2D Parallel Training on 8 GPUs**
+
+In this example, 8 GPUs will be split into **4 stages**, each stage contains **2 GPUs** as a **group**. And different layers in the model will be put on different stages automatically for pipeline parallel training. The weight of the layers put on the specific stage will be split into 2 parts for tensor parallel training within the group. 
+
+```python
+from .common.train import train
+...
+
+train.dist.tensor_parallel_size = 2
+train.dist.pipeline_parallel_size = 4
+
+train.dist.pipeline_num_layers = model.cfg.hidden_layers
+```
+
+#### **Data Parallel + Tensor Parallel + Pipeline Parallel for 3D Parallel Training on 8 GPUs**
+
+In this example, 8 GPUs will also be split into **2 stages**, and different layers in the model will be put on different stages for pipeline parallel training. Each stage only contains a portion of the whole model, and each stage will be split into **2 groups**. In each stage, the model will be replicated between **2 data-parallel groups**, and each **data-parallel group** contains **2 GPUs**. The input data will be split into 2 parts by chunking in the batch dimension for data-parallel training between **2 data-parallel groups**. Within each group, the weight of each layer will be split across **2 GPUs** for tensor parallel training.
+
+```python
+from .common.train import train
+...
+
+train.dist.data_parallel_size = 2
+train.dist.tensor_parallel_size = 2
+train.dist.pipeline_parallel_size = 2
+
+train.dist.pipeline_num_layers = model.cfg.hidden_layers
+```
+
+
+**Note:** `train.dist.data_parallel_size` will be automatically calculated by `(gpu_nums / (tensor_parallel_size * pipeline_parallel_size))` if only `train.dist.tensor_parallel_size` and `train.dist.pipeline_parallel_size` are set. For example:
+
+```python
+from .common.train import train
+...
+# only set tensor_parallel_size and pipeline_parallel_size
+train.dist.tensor_parallel_size = 2
+train.dist.pipeline_parallel_size = 2
+
+train.dist.pipeline_num_layers = model.cfg.hidden_layers
+```
+And the `data_parallel_size` will be automatically set to `(8 / (2 * 2)) = 2`
+
+
+#### **Set `custom_pipeline_stage_id` for Load Balance**
+In most cases, the transformer layers of common models have the same computational overhead, so there is no need to set `custom_pipeline_stage_id`.
+
+But when transformer layers have unbalanced computational overhead, you can set `custom_pipeline_stage_id` for manually balance the compuation between stages in pipeline_parallelism
+
+For example:
+```python
+train.dist.pipeline_parallel_size = 4
+train.dist.pipeline_num_layers = 24
+train.dist.custom_pipeline_stage_id = [0]*6 + [1]*7 + [2]*7 + [3]*4
+```
+It means you have `[6, 7, 7, 4]` layers separately located in `stage0`~`stage3`.
+Modify `custom_pipeline_stage_id` according to your own needs.
+
+## Update Distributed Config with Command Line
+You can also control the parallelization strategy by **command line** parameters as follows:
+
+```bash
+bash tools/train.sh tools/train_net.py configs/bert_large_pretrain.py \
+8 \  # num of gpus
+train.dist.data_parallel_size=2 \
+train.dist.tensor_parallel_size=2 \
+train.dist.pipeline_parallel_size=2 \
+```
\ No newline at end of file
--- a/docs/source/tutorials/basics/Evaluation.md
+++ b/docs/source/tutorials/basics/Evaluation.md
+# Evaluation
+Evaluation is a process that takes a number of inputs/outputs pairs and calculates them to get metrics. You can always use the model directly and parse its inputs/outputs manually to perform evaluation. Alternatively, evaluation can be implemented in LiBai using the `DatasetEvaluator` interface.
+
+LiBai includes a few `DatasetEvaluator` that computes metrics like top-N accuracy, PPL(Perplexity), etc. You can also implement your own `DatasetEvaluator` that performs some other jobs using the inputs/outputs pairs. For example, to count how many instances are detected on the validation set:
+``` Python
+class Counter(DatasetEvaluator):
+  def reset(self):
+    self.count = 0
+  def process(self, inputs, outputs):
+    for output in outputs:
+      self.count += len(output["instances"])
+  def evaluate(self):
+    # save self.count somewhere, or print it, or return it.
+    return {"count": self.count}
+```
+
+## Customize Evaluator using DatasetEvaluator
+`DatasetEvaluator` is the Base class for a dataset evaluator. This class accumulates information of the inputs/outputs (by `process`) after every batch inference, and produces evaluation results in the end (by `evaluate`). The input is from the `trainer.get_batch()`, which converts the outputs of `dataset.__getitem__()` to dict. The output is from the dict return of `model.forward()`.
+
+Firstly, declare a new evaluator class that inherits the `DatasetEvaluator` and overwrites its `process` and `evaluation` functions to satisfy the needs.
+
+For example, declare a `MyEvaluator` class in `libai/evaluator/myevaluator.py`:
+``` Python
+class MyEvaluator(DatasetEvaluator):
+    def __init__(self):
+        self._predictions = []
+
+    def reset(self):
+        self._predictions = []
+
+    def process(self, inputs, outputs):
+        # the key of inputs/outputs can be customized
+        pred_logits = outputs["prediction_scores"]
+        labels = inputs["labels"]
+
+        # measure accuracy
+        preds = pred_logits.cpu().topk(1)[1].squeeze(1).numpy()
+        labels = labels.cpu().numpy()
+
+        self._predictions.append({"preds": preds, "labels": labels})
+
+    def evaluate(self):
+        correct = 0.0
+        all_sample = 0.0
+        for pred in self._predictions:
+            preds = pred["preds"]
+            labels = pred["labels"]
+            correct += (preds==labels).sum()
+            all_sample += len(preds)
+        self._results = OrderedDict()
+        self._results["acc"] = correct/all_sample
+        return copy.deepcopy(self._results)
+```
+
+Secondly, import the customized class and set the evaluation in config:
+``` Python
+from libai.evaluation.myevaluator import MyEvaluator
+evaluation=dict(
+      enabled=True,
+      # evaluator for calculating top-k acc
+      evaluator=LazyCall(MyEvaluator)(),
+      eval_period=5000,
+      eval_iter=1e9,  # running steps for validation/test
+      # Metrics to be used for best model checkpoint.
+      eval_metric="acc", # your returned metric key in MyEvaluator.evaluate()
+      eval_mode="max", # set `max` or `min` for saving best model according to your metric
+)
+```
+
+## Run Evaluator Manually
+To check your evaluator code outside `LiBai`, use the methods of evaluators manually:
+``` Python
+def get_all_inputs_outputs():
+  for data in data_loader:
+    yield data, model(data)
+
+evaluator.reset()
+for inputs, outputs in get_all_inputs_outputs():
+  evaluator.process(inputs, outputs)
+eval_results = evaluator.evaluate()
+```
+
+Evaluators can also be used with `inference_on_dataset`. For example:
+``` Python
+eval_results = inference_on_dataset(
+    model,
+    data_loader,
+    evaluator,
+    ...
+)
+```
\ No newline at end of file
--- a/docs/source/tutorials/basics/Features.md
+++ b/docs/source/tutorials/basics/Features.md
+# Features in LiBai
+
+LiBai provides many features out of the box. This section shows how to configure them step by step.
+
+## Automatic Mixed Precision Training
+
+AMP stands for automatic mixed precision training. To enable AMP in LiBaiYou, add `train.amp.enabled=True` in your configuration file .
+
+### Usage
+
+```python
+# import config
+from .common.train import train
+
+# get config
+from libai.config import get_config
+train = get_config("common/train.py").train
+
+# enable amp
+train.amp.enabled = True
+# disable amp
+train.amp.enabled = False
+```
+
+## Gradient Clipping
+
+Gradient clipping is a technique that tackles exploding gradients. The idea of gradient clipping is very simple: the gradient will be rescaled down if it gets too large.
+
+LiBai supports gradient clipping in a convenient way, and you don't have to implement it by yourself. You just need to add some settings to your configuration file to enable gradient clipping.
+
+**Note:** We do not recommend writing gradient clipping by yourself, because naive gradient clipping may fail when using tensor parallel or pipeline parallel.
+
+### Usage
+
+```python
+# import config
+from .common.optim import optim
+
+# get config
+from libai.config import get_config
+optim = get_config("common/optim.py").optim
+
+# enable gradient clipping
+optim.params.clip_grad_max_norm = 1.0
+optim.params.clip_grad_norm_type = 2.0
+
+# disable gradient clipping
+optim.params.clip_grad_max_norm = None
+optim.params.clip_grad_norm_type = None
+```
+
+`clip_grad_max_norm` and `clip_grad_norm_type` can be checked in [API docs of oneflow](https://oneflow.readthedocs.io/en/master/nn.html#oneflow.nn.utils.clip_grad_norm_).
+
+## Gradient Accumulation
+
+Gradient accumulation is a common strategy to train large-scale models when memory becomes the bottleneck. This technique splits the mini-batch into several micro-batches, then performs normal forward and backward operations. Models will only be updated after accumulating the gradients of all these micro-batches.
+
+Besides, when training with pipeline parallel, gradient accumulation makes different stages executed in different micro-batch in parallel. Therefore, the calculation of each stage can be overlapped.
+
+### Usage
+
+```python
+# import config
+from .common.train import train
+
+# get config 
+from libai.config import get_config
+train = get_config("common/train.py").train
+
+# enable grad accumulation for 4 steps
+train.num_accumulation_steps = 4
+
+# disable grad accumulation
+train.num_accumulation_steps = None
+```
+
+## Activation Checkpointing
+
+To reduce GPU memory usage and deploy a large model to a training system, LiBai supports activation checkpointing. LiBai uses a Transformer layer as the unit of checkpointing, because the activation size bloats in the middle of a Transformer layer, so checkpointing the input of a Transformer layer is storage-efficient.
+
+LiBai supports [activation checkpointing](https://arxiv.org/abs/1604.06174) by `set_activation_checkpoint` in `GraphBase`. So models using `libai.layers.TransformerLayer` support activation checkpointing by default. If you want to set activation checkpointing for customized layers, you need to override this function.
+
+```python
+def set_activation_checkpoint(self):
+    for module_block in self.model.modules():
+        if isinstance(module_block.to(nn.Module), TransformerLayer):
+            module_block.to(nn.graph.GraphModule).activation_checkpointing = True
+```
+
+### Usage
+
+```python
+# import config
+from .common.train import train
+
+# get config 
+from libai.config import get_config
+train = get_config("common/train.py").train
+
+# enable activation checkpointing
+train.activation_checkpoint.enabled = True
+
+# disable activation checkpointing
+train.activation_checkpoint.enabled = False
+```
+
+## ZeRO 
+
+Unlike normal data parallelism, where model states and gradients are replicated across data-parallel processes, Zero Redundancy Optimizer (ZeRO) partitions them across data-parallel processes, which can reduce memory footprint significantly.
+
+- Level 1: The optimizer states (e.g., for Adam optimizer, 32-bit weights, and the first and second moment estimates) are partitioned across the processes so that each process will only update its own partition.
+
+- Level 2: The reduced 32-bit gradients for updating the model weights are also partitioned so that each process retains only the gradients corresponding to its portion of the optimizer states.
+
+> **Note:** ZeRO only supports data parallel and pipeline parallel, or the combination of them. If you use tensor parallel in your training, make sure ZeRO is disabled.
+
+### Usage 
+
+```python
+# import config
+from .common.train import train
+
+# get config 
+from libai.config import get_config
+train = get_config("common/train.py").train
+
+# enable zero 
+train.zero_optimization.enabled = True
+
+# enable zero for level-1
+train.zero_optimization.stage = 1
+
+# enable zero for level-2
+train.zero_optimization.stage = 2
+
+# disable zero
+train.zero_optimization.enabled = False
+```
\ No newline at end of file
--- a/docs/source/tutorials/basics/Load_and_Save_Checkpoint.md
+++ b/docs/source/tutorials/basics/Load_and_Save_Checkpoint.md
+# Load and Save a Checkpoint in LiBai
+
+Instead of directly using [`flow.save()`](https://oneflow.readthedocs.io/en/master/oneflow.html?highlight=save#oneflow.save) and [`flow.load()`](https://oneflow.readthedocs.io/en/master/oneflow.html?highlight=oneflow.load#oneflow.load), LiBai provides the [`checkpoint module`](https://libai.readthedocs.io/en/latest/modules/libai.utils.html#module-libai.utils.checkpoint) to deal with the complex situations when saving/loading model.
+
+
+Typically, you don't need to write the relative code to load/save weights trained from LiBai when using LiBai's `DefaultTrainer` and `LazyConfig`. For more details, see [Training & Evaluation in Command Line](https://libai.readthedocs.io/en/latest/tutorials/basics/Train_and_Eval_Command_Line.html) which introduces `weight load` and `resume training` settings in `config.py` or in command line for standard training.
+
+Here we introduce how to load&save weights according to your custom needs. Suppose you have a model trained with LiBai.
+
+```shell
+# your model directory
+output/finetune_qqp
+├── config.yaml
+├── last_checkpoint
+├── log.txt
+├── log.txt.rank1
+├── log.txt.rank2
+├── log.txt.rank3
+├── metrics.json
+├── model_0000009
+│   ├── graph
+│   ├── lr_scheduler
+│   └── model
+├── model_0000019
+│   ├── graph
+│   ├── lr_scheduler
+│   └── model
+├── model_best
+│   ├── graph
+│   ├── lr_scheduler
+│   └── model
+└── model_final
+    ├── graph
+    ├── lr_scheduler
+    └── model
+```
+
+The following code shows how to load/save model weights:
+```python
+from libai.utils.checkpoint import Checkpointer
+from path.to.your.build_model import build_model
+
+model = build_model(cfg.model)
+# load model weights
+Checkpointer(model).load(path_to_model) # path_to_model should be "output/finetune_qqp/model_final" 
+
+# save model weights
+checkpointer = Checkpointer(model, save_dir="output/")
+checkpointer.save("model_999")  # save to output/model_999
+```
+
+You can also save other informations (e.g. `optim`, `scheduler`) other than model weights by using `checkpointer`. See [libai.utils.checkpoint](https://libai.readthedocs.io/en/latest/modules/libai.utils.html#module-libai.utils.checkpoint) for more details.
\ No newline at end of file
--- a/docs/source/tutorials/basics/Preprocessing_Dataset.md
+++ b/docs/source/tutorials/basics/Preprocessing_Dataset.md
+# Preprocessing Dataset
+
+If you use LiBai's `Dataset` to training NLP model, you can preprocess the training data.
+
+This tutorial introduces how to preprocess your own training data, let's take training `Bert` as an example.
+
+First, You need to store the training data in loose JSON format file, which contains one text sample per line, For example:
+
+```bash
+{"chapter": "Chapter One", "text": "April Johnson had been crammed inside an apartment", "type": "April", "background": "novel"}
+{"chapter": "Chapter Two", "text": "He couldn't remember their names", "type": "Dominic", "background": "novel"}
+```
+
+You can set the `--json-keys` argument to select the specific data of per sample, and the other keys will not be used.
+
+Then, Process the JSON file into a binary format for training. To conver the json into mmap, cached index file, or the lazy loader format use `toos/preprocess_data.py`. Set the `--dataset-impl` flag to `mmap`, `cached`, or `lazy` respectively. You can run the following code to prepare you own dataset for training BERT:
+
+```bash
+#!/bin/bash
+
+IMPL=lazy
+KEYS=text
+
+python tools/preprocess_data.py \
+        --input path/to/test_sample_cn.json \
+        --json-keys ${KEYS} \
+        --vocab-file path/to/bert-base-chinese-vocab.txt \
+        --dataset-impl ${IMPL} \
+        --tokenizer-name BertTokenizer \
+        --do-lower-case \
+        --do-chinese-wwm \
+        --split-sentences \
+        --output-prefix cn_samples_${IMPL} \
+        --workers 1 \
+        --log-interval 2
+```
+
+Further command line arguments are described in the source file [`preprocess_data.py`](https://github.com/Oneflow-Inc/libai/blob/main/tools/preprocess_data.py).
--- a/docs/source/tutorials/basics/Train_and_Eval_Command_Line.md
+++ b/docs/source/tutorials/basics/Train_and_Eval_Command_Line.md
+# Training & Evaluation in Command Line 
+
+LiBai provides multiple arguments covering a variety of situations.
+
+## Training
+
+LiBai provides `tools/train.sh` and `tools/train_net.py` for launching training & eval task.
+
+You can modify `tools/train_net.py` according to your own needs.
+
+### Training & Evaluation
+
+To completely train and test, run: 
+
+```shell
+bash tools/train.sh \
+tools/train_net.py \ 
+path_to_your_config.py \ # config.py for your task
+4                        # number of gpus
+```
+
+### Training & Partial Evaluation 
+
+If the evaluation process is time consuming, you can set the parameter `train.evaluation.eval_iter` in your `config.py` to a smaller number such as 20, which can make the evaluation process faster by using only part of the testset. You can also set the parameter by the command line directly :
+
+**Note:** the eval metric will be calculated by part of testing dataset.
+
+```shell
+bash tools/train.sh \
+tools/train_net.py \ 
+path_to_your_config.py \        # config.py for your task
+4 \                             # number of gpus
+train.evaluation.eval_iter=20   # set eval_iter for testing
+```
+
+### Training & No Evaluation
+
+To train without evaluation, set `train.evaluation.enabled=False` in your `config.py` or in the command line:
+
+```shell
+bash tools/train.sh \
+tools/train_net.py \ 
+path_to_your_config.py \         # config.py for your task
+4 \                              # number of gpus
+train.evaluation.enabled=False   # set no evaluation 
+```
+
+### Resume Training
+
+To resume training, set `--resume` in the command line, and set `train.output_dir` in your `config.py` or in the command line
+
+For example, if your training is interrupted unexpectly, and your lastest model path is `output/demo/model_0000019/`, then set `train.output_dir=output/demo` to resume trainig:
+
+```shell
+bash tools/train.sh \
+tools/train_net.py \ 
+path_to_your_config.py \         # config.py for your task
+4 \                              # number of gpus
+--resume \                       # set resume training
+train.output_dir=path/task       # set resume path, it should be parent directory of model path
+```
+
+
+## Evaluation
+
+To evaluate your model without training, set `--eval-only` in your command line, and set `train.load_weight`.
+
+Besides, `train.evaluation.eval_iter=20` will be valid in `--eval-only` if you set it. You can set `eval_iter` according to your own needs.
+
+```shell
+bash tools/train.sh \
+tools/train_net.py \ 
+path_to_your_config.py \                     # config.py for your task
+4 \                                          # number of gpus
+--eval-only \                                # set eval without train
+train.load_weight=path/task/model_final      # set model path
+```
+
+## Quickly check in the respective loop
+
+To find out whether there are any bugs in your program, pass `--fast-dev-run` to the command line, which will change config settings to:
+```python
+train.train_epoch = 0
+train.train_iter = 20
+train.evaluation.eval_period = 10
+train.log_period = 1
+```
+Besides, `train.evaluation.eval_iter=20` will be valid in `--fast-dev-run` if you set it. You can set `eval_iter` according to your own needs.
+```shell
+bash tools/train.sh \
+tools/train_net.py \ 
+path_to_your_config.py \                     # config.py for your task
+4 \                                          # number of gpus
+--fast-dev-run                               # set for quickly check
+``` 
--- a/docs/source/tutorials/basics/Training.md
+++ b/docs/source/tutorials/basics/Training.md
+# Training
+
+To run training, we highly recommend using the standardized `trainer` in LiBai.
+
+## Trainer Abstraction
+
+LiBai provides a standardized `trainer` abstraction with a hook system to help simplify the standard training behavior.
+
+`DefaultTrainer` is initialized from the lazy config system, used by `tools/train_net.py` and many scripts. It includes many standard default behaviors that you might want to opt in, including default configurations for the optimizer, learning rate scheduler, logging, evaluation, model checkpointing, etc.
+
+For simple customizations (e.g. change optimizer, evaluator, LR scheduler, data loader, etc.), you can just modify the corresponding configuration in `config.py` according to your own needs (refer to [Config_System](https://libai.readthedocs.io/en/latest/tutorials/Config_System.html#configs-in-libai)).
+
+## Customize a DefaultTrainer
+
+For complicated customizations, we recommend you to overwrite function in [DefaultTrainer](https://github.com/Oneflow-Inc/libai/blob/main/libai/engine/default.py).
+
+In `DefaultTrainer`, the training process consists of `run_step in trainer` and `hooks` which can be modified according to your own needs.
+
+The following code indicates how `run_step` and `hooks` work during training:
+```python
+class DefaultTrainer(TrainerBase):
+    def train(self, start_iter: int, max_iter: int):
+        ...
+
+        with EventStorage(self.start_iter) as self.storage:
+            try:
+                self.before_train() # in hooks
+                for self.iter in range(start_iter, max_iter):
+                    self.before_step() # in hooks
+                    self.run_step() # in self._trainer
+                    self.after_step() # in hooks
+                self.iter += 1
+            except Exception:
+                logger.exception("Exception during training:")
+                raise
+            finally:
+                self.after_train() # in hooks
+
+```
+
+Refer to `tools/train_net.py` to rewrite `tools/my_train_net.py` with your modified `_trainer` and `hooks`. The next subsection will introduce how to modify them.
+
+```python
+# tools/my_train_net.py
+
+import ...
+from libai.engine import DefaultTrainer
+from path_to_myhook import myhook
+from path_to_mytrainer import _mytrainer
+
+class MyTrainer(DefaultTrainer):
+    def __init__(self, cfg):
+        super().__init__(cfg)
+
+        # add your _trainer according to your own needs
+        # NOTE: run_step() is overwrited in your _trainer
+        self._trainer = _mytrainer()
+
+    def build_hooks(self):
+        ret = [
+            hooks.IterationTimer(),
+            hooks.LRScheduler(),
+            hooks.PeriodicCheckpointer(self.checkpointer, self.cfg.train.checkpointer.period),
+        ]
+        # add your hook according to your own needs
+        # NOTE: all hooks will be called sequentially 
+        ret.append(myhook()) 
+
+        ...
+
+        if dist.is_main_process():
+            ret.append(hooks.PeriodicWriter(self.build_writers(), self.cfg.train.log_period))
+        return ret
+
+logger = logging.getLogger("libai." + __name__)
+
+def main(args):
+    ...
+
+    trainer = MyTrainer(cfg)
+    return trainer.train()
+
+
+if __name__ == "__main__":
+    args = default_argument_parser().parse_args()
+    main(args)
+```
+
+Using ``trainer & hook`` system means there will always be some non-standard behaviors which is hard to support in LiBai, especially for research. Therefore, we intentionally keep the ``trainer & hook`` system minimal, rather than powerful.
+
+### Customize Hooks in Trainer
+
+You can customize your own hooks for some extra tasks during training.
+
+[HookBase](https://github.com/Oneflow-Inc/libai/blob/ffe5ca0e46544d1cbb4fbe88d9185f96c0dc2c95/libai/engine/trainer.py#L28) in `libai/engine/trainer.py` provides a standard behavior for you to use hook. You can overwirte its function according to your own needs. Please refer to [libai/engine/hooks.py](https://github.com/Oneflow-Inc/libai/blob/main/libai/engine/hooks.py) for more details.
+```python 
+class HookBase:
+    def before_train(self):
+        """
+        Called before the first iteration.
+        """
+
+    def after_train(self):
+        """
+        Called after the last iteration.
+        """
+
+    def before_step(self):
+        """
+        Called before each iteration.
+        """
+
+    def after_step(self):
+        """
+        Called after each iteration.
+        """
+```
+
+Depending on the functionality of the hook, you can specify what the hook will do at each stage of the training in ``before_train``, ``after_train``, ``before_step``, ``after_step``. For example, to print `iter` in trainer during training:
+
+```python
+class InfoHook(HookBase):
+    def before_train(self):
+        logger.info(f"start training at {self.trainer.iter}")
+
+    def after_train(self):
+        logger.info(f"end training ad {self.trainer.iter}")
+
+    def after_step(self):
+        if self.trainer.iter % 100 == 0:
+            logger.info(f"iteration {self.trainer.iter}!")
+```
+
+Then you can import your `hook` in `tools/my_train_net.py`
+
+### Modify train_step in Trainer
+
+LiBai provides `EagerTrainer` and `GraphTrainer` in `libai/engine/trainer.py` by default. `EagerTrainer` is used in `eager` mode, while `GraphTrainer` is used in `graph` mode, and the mode is determined by the `graph.enabled` parameter in your `config.py`.
+
+> For more details about `eager` and `graph` mode, please refer to [oneflow doc](https://docs.oneflow.org/en/master/basics/08_nn_graph.html).
+
+For example, using a temp variable to keep the model's output in run_step:
+
+```python
+class MyEagerTrainer(EagerTrainer):
+
+    def __init__(self, model, data_loader, optimizer, grad_acc_steps=1):
+        super().__init__(model, data_loader, optimizer, grad_acc_steps)
+        self.previous_output = None
+
+    def run_step(self, get_batch: Callable):
+        ...
+        loss_dict = self.model(**data)
+        self.previous_output = loss_dict
+        ...
+```
+
+Then you can set your `MyEagerTrainer` as `self.trainer` in `tools/my_train_net.py`
+
+## Logging of Metrics
+
+During training, the trainer put metrics to a centralized [EventStorage](https://libai.readthedocs.io/en/latest/modules/libai.utils.html#module-libai.utils.events). The following code can be used to access it and log metrics to it:
+
+```python
+from libai.utils.events import get_event_storage
+
+# inside the model:
+if self.training:
+    value = # compute the value from inputs
+    storage = get_event_storage()
+    storage.put_scalar("some_accuracy", value)
+
+```
+
+See [EventStorage](https://libai.readthedocs.io/en/latest/modules/libai.utils.html#module-libai.utils.events) for more details.
+
+Metrics are then written to various destinations with EventWriter. Metrics information will be written to `{cfg.train.output_dir}/metrics.json`. DefaultTrainer enables a few EventWriter with default configurations. See above for how to customize them.
\ No newline at end of file
--- a/docs/source/tutorials/basics/Write_Dataloaders.md
+++ b/docs/source/tutorials/basics/Write_Dataloaders.md
+# Write Dataloaders
+
+This tutorial explains how the dataset APIs work, and how to customize your own datasets with them.
+
+## Build Common Dataloaders 
+
+To build dataloaders in LiBai, we highly recommend users to use the default `build_nlp_train_val_test_loader`, `build_nlp_train_loader`, `build_nlp_test_loader`, `build_image_train_loader` and `build_image_test_loader` which are defined in [`libai/data/build.py`](https://github.com/Oneflow-Inc/libai/blob/main/libai/data/build.py) for most of the common cases.
+
+The only thing you need to do is to write pytorch style `Dataset`, and return `Instance` structure in `__getitem__`. The `Instance` structure stores the attributes of an instance (e.g., image, tokens) as "fields", and the `DistTensorData` structure provides a standard `to_global()`(called in `get_batch()`) function to convert local tensors to global tensors.
+
+The returned instance by `__getitem__` function must contain the same keys with the `args` passed in `forward` function of the `model`. The following shows an example:
+
+**NOTE:** Set `placement_idx=-1` in `DistTensorData` when the `tensor` is **only** used in `loss_function`, it is used for pipeline parallel training.
+
+```python
+# my_dataset.py
+import numpy as np
+import oneflow as flow
+
+from libai.data.structures import DistTensorData, Instance
+
+class MyDataset(flow.utils.data.Dataset):
+
+    ...
+
+    def __getitem__(self, idx):
+        text = np.array(self.dataset[idx], dtype=np.long)
+        # transfer to flow.tensor
+        input_ids = flow.tensor(text[:-1], dtype=flow.long)
+        lm_labels = flow.tensor(text[1:2], dtype=flow.long)
+        # attention_mask must be a [0, 1] metric
+        attention_mask = flow.tensor(text[2:3], dtype=flow.long)
+        loss_mask = flow.tensor(text[3:], dtype=flow.long)
+        # the keys (`input_ids` ... `labels`) should be same as the parameter name of model.forward()
+        sample = Instance(
+            input_ids=DistTensorData(input_ids),
+            # attention_mask must be a [0, 1] metric
+            attention_mask=DistTensorData(attention_mask),
+            loss_mask=DistTensorData(lm_labels, placement_idx=-1),
+            labels=DistTensorData(lm_labels, placement_idx=-1),
+        )
+        return sample
+
+# my_model.py
+import oneflow.nn as nn
+
+class MyModel(nn.Module):
+    ...
+    
+    # the parameters' name is the same as the returned key in __getitem__
+    def forward(self, input_ids, attention_mask, loss_mask, labels):
+        ...
+```
+
+In particular, the values of `attention_mask` can only be `0` or `1` if you need to generate your own `attention_mask`. Because LiBai has already processed `attention_mask` in [`libai/layers/attention.py`](https://github.com/Oneflow-Inc/libai/blob/main/libai/layers/attention.py) as follows:
+
+```python
+attention_scores = flow.mul(attention_scores, attention_mask)
+attention_scores = attention_scores - 10000.0 * (1 - attention_mask)
+attention_weights = flow.softmax(attention_scores, dim=-1)
+```
+
+After finishing your `MyDataset`, set `dataloader` in your `config.py` depending on your needs. If you have only one training dataset for nlp task and want to split it into `train`, `valid` and `test` datasets automatically, you can choose `build_nlp_train_val_test_loader`, the evaluation will be calculated in `valid` and `test` dataset. 
+
+Otherwise, you can choose `build_nlp_train_loader` && `build_nlp_test_loader` or  `build_image_train_loader` && `build_image_test_loader` in `config.py` according to your own needs.
+see [`libai/data/build.py`](https://github.com/Oneflow-Inc/libai/blob/main/libai/data/build.py) for more details.
\ No newline at end of file
--- a/docs/source/tutorials/basics/Write_Models.md
+++ b/docs/source/tutorials/basics/Write_Models.md
+# Write Models
+
+This section introduces how to implement a new model entirely from scratch and make it compatible with LiBai.
+
+
+## Construct Models in LiBai
+
+LiBai uses [LazyConfig](https://libai.readthedocs.io/en/latest/tutorials/Config_System.html) for a more flexible config system, which means you can simply import your own model in your config and train it under LiBai.
+
+For image classification task, the input data is usually a batch of images and labels. The following code shows how to build a toy model for this task. Import in your code:
+```python
+# toy_model.py
+import oneflow as flow
+import oneflow.nn as nn
+
+
+class ToyModel(nn.Module):
+    def __init__(self, 
+        num_classes=1000, 
+    ):
+        super().__init__()
+        self.features = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1)
+        self.avgpool = nn.AdaptiveAvgPool2d(1)
+        self.classifier = nn.Linear(64, num_classes)
+        self.loss_func = nn.CrossEntropyLoss()
+    
+    def forward(self, images, labels=None):
+        x = self.features(images)
+        x = self.avgpool(x)
+        x = flow.flatten(x, 1)
+        x = self.classifier(x)
+
+        if labels is not None and self.training:
+            losses = self.loss_func(x, labels)
+            return {"losses": losses}
+        else:
+            return {"prediction_scores": x}
+```
+
+**Note:**
+- For classification models, the ``forward`` function must have ``images`` and ``labels`` as arguments, which correspond to the output in ``__getitem__`` of LiBai's built-in datasets. Please refer to [imagenet.py](https://github.com/Oneflow-Inc/libai/blob/main/libai/data/datasets/imagenet.py) for more details about the dataset.
+- **This toy model** will return ``losses`` during training and ``prediction_scores`` during inference, and both of them should be the type of ``dict``, which means you should implement the ``loss function`` in your model, like ``self.loss_func=nn.CrossEntropyLoss()`` as the ToyModel shows above.
+
+
+## Import the model in config
+
+With ``LazyConfig System``, you can simply import the model in your config file. The following code shows how to use ``ToyModel`` in your config file:
+```python
+# config.py
+from libai.config import LazyCall
+from toy_model import ToyModel
+
+model = LazyCall(ToyModel)(
+    num_classes=1000
+)
+```
+
+
+
--- a/docs/source/tutorials/basics/index.rst
+++ b/docs/source/tutorials/basics/index.rst
+Basics
+=========
+
+.. toctree::
+   :glob:
+   :maxdepth: 2
+
+   Config_System.md
+   Features.md
+   Training.md
+   Evaluation.md
+   Train_and_Eval_Command_Line.md
+   Load_and_Save_Checkpoint.md
+   Write_Models.md
+   Write_Dataloaders.md
+   Build_New_Project_on_LiBai.md
+   Distributed_Configuration.md
+   Auto_Parallel.md
+   Preprocessing_Dataset.md
\ No newline at end of file
--- a/docs/source/tutorials/get_started/Benchmark.md
+++ b/docs/source/tutorials/get_started/Benchmark.md
+# Benchmarks
+
+Here we provides our benchmark speed test results of LiBai's models compared with [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) implementations. In LiBai V0.2.0, we only benchmark the speed tests under 32 GPUs in 4 nodes and all of the experiments were conducted under the same settings for a fair comparison.
+
+## Settings
+### Environments
+
+- The commit of LiBai for comparison: [commit](https://github.com/Oneflow-Inc/libai/commit/9fc504c457da4fd1e92d854c60b7271c89a55222)
+- The commit of OneFlow for comparison: [commit](https://github.com/Oneflow-Inc/oneflow/commit/55b822e4d3c88757d11077d7546981309125c73f)
+- The commit of Megatron-LM for comparison: [commit](https://github.com/NVIDIA/Megatron-LM/commit/e156d2fea7fc5c98e645f7742eb86b643956d840)
+
+### Model Hyper-parameters
+- **BERT Model**
+```python
+num_layers = 24/48
+num_attention_heads = 16
+hidden_size = 1024
+seq_length = 512
+```
+- **GPT-2 Model**
+```python
+num_layers = 24/48
+num_attention_heads = 16
+hidden_size = 1024
+seq_length = 1024
+```
+
+
+## Main Results
+Here we explain the evaluation indicators in the following tables:
+- **fp16**: mixed precision training
+- **nl**: num layers (When pipeline parallel size = 8, in order to have a relative number of layers per stage for computation, we adjust the num layers from 24 to 48)
+- **ac**: enable activation checkpointing
+- **mb**: micro-batch size per gpu
+- **gb**: global batch size total
+- **d x m x p**:
+  - d: data-parallel-size
+  - m: tensor-model-parallel-size
+  - p: pipeline-model-parallel-size
+- **1n1g**: 1 node, 1 gpu
+- **2n8g**: 2 nodes, 8 gpus per node, 16 gpus in total
+- **4n8g**: 4 nodes, 8 gpus per node, 32 gpus in total
+- `grad_acc_num_step = global_batch_size / (micro_batch_size * data_parallel_size)`
+- **samples/s**: throughput
+
+
+### Data Parallel
+
+<table class="docutils">
+    <thead>
+        <tr class="header">
+            <th>BERT</th>
+            <th>LiBai</th>
+            <th>Megatron</th>
+        </tr>
+    </thead>
+    <tbody>
+        <tr class="odd">
+            <td>nl24_fp16_1x1x1_mb24_gb24_1n1g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e/1n1g/LibAI_bert_nl24_nah16_hs1024_FP16_acfalse_mp1_pp1_mb24_gb24_1n1g_20220705_071307389288504/output.log">46.91</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base/1n1g/Megatron_bert_nl24_nah16_hs1024_FP16_acfalse_mp1_pp1_mb24_gb24_1n1g_20220615_130039677349789.log">42.6</a>
+                samples/s</td>
+        </tr>
+        <tr class="even">
+            <td>nl24_fp16_4x1x1_mb16_gb64_1n4g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e_supple/1n4g/LibAI_bert_nl24_nah16_hs1024_FP16_acfalse_mp1_pp1_mb16_gb64_1n4g_20220706_103618805733678/output.log">176.88</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base_supple/1n4g/Megatron_bert_nl24_nah16_hs1024_FP16_acfalse_mp1_pp1_mb16_gb64_1n4g_20220706_121753217673018.log">154.7</a>
+                samples/s</td>
+        </tr>
+        <tr class="odd">
+            <td>nl24_fp16_8x1x1_mb16_gb128_1n8g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e/1n8g/LibAI_bert_nl24_nah16_hs1024_FP16_acfalse_mp1_pp1_mb16_gb128_1n8g_20220705_101124804210475/output.log">351.57</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base/1n8g/Megatron_bert_nl24_nah16_hs1024_FP16_acfalse_mp1_pp1_mb16_gb128_1n8g_20220705_140535074517604.log">309.2</a>
+                samples/s</td>
+        </tr>
+        <tr class="even">
+            <td>nl24_fp16_16x1x1_mb16_gb256_2n8g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e/2n8g/LibAI_bert_nl24_nah16_hs1024_FP16_acfalse_mp1_pp1_mb16_gb256_2n8g_20220705_172421459267607/output.log">675.87</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base/2n8g/Megatron_bert_nl24_nah16_hs1024_FP16_acfalse_mp1_pp1_mb16_gb256_2n8g_20220705_193107517518321.log">534.7</a>
+                samples/s</td>
+        </tr>
+        <tr class="odd">
+            <td>nl24_fp16_32x1x1_mb16_gb512_4n8g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e_supple/4n8g/LibAI_bert_nl24_nah16_hs1024_FP16_acfalse_mp1_pp1_mb16_gb512_4n8g_20220706_100943865207187/output.log">1207.65</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base_supple/4n8g/Megatron_bert_nl24_nah16_hs1024_FP16_acfalse_mp1_pp1_mb16_gb512_4n8g_20220706_115955118787426.log">950.3</a>
+                samples/s</td>
+        </tr>
+    </tbody>
+</table>
+
+<table class="docutils">
+    <thead>
+        <tr class="header">
+            <th>GPT-2</th>
+            <th>LiBai</th>
+            <th>Megatron</th>
+        </tr>
+    </thead>
+    <tbody>
+        <tr class="odd">
+            <td>nl24_fp16_1x1x1_mb6_gb6_1n1g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e/1n1g/LibAI_gpt2_nl24_nah16_hs1024_FP16_acfalse_mp1_pp1_mb6_gb6_1n1g_20220705_071259765473007/output.log">17.52</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base/1n1g/Megatron_gpt2_nl24_nah16_hs1024_FP16_acfalse_mp1_pp1_mb6_gb6_1n1g_20220615_075355864672227.log">15.5</a>
+                samples/s</td>
+        </tr>
+        <tr class="even">
+            <td>nl24_fp16_4x1x1_mb4_gb16_1n4g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e_supple/1n4g/LibAI_gpt2_nl24_nah16_hs1024_FP16_acfalse_mp1_pp1_mb4_gb16_1n4g_20220706_121838771888563/output.log">63.45</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base_supple/1n4g/Megatron_gpt2_nl24_nah16_hs1024_FP16_acfalse_mp1_pp1_mb4_gb16_1n4g_20220706_121755031184092.log">53.3</a>
+                samples/s</td>
+        </tr>
+        <tr class="odd">
+            <td>nl24_fp16_8x1x1_mb4_gb32_1n8g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e/1n8g/LibAI_gpt2_nl24_nah16_hs1024_FP16_acfalse_mp1_pp1_mb4_gb32_1n8g_20220705_091214203744961/output.log">125.64</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base/1n8g/Megatron_gpt2_nl24_nah16_hs1024_FP16_acfalse_mp1_pp1_mb4_gb32_1n8g_20220705_162733227027517.log">107.9</a>
+                samples/s</td>
+        </tr>
+        <tr class="even">
+            <td>nl24_fp16_16x1x1_mb4_gb64_2n8g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e/2n8g/LibAI_gpt2_nl24_nah16_hs1024_FP16_acfalse_mp1_pp1_mb4_gb64_2n8g_20220705_153427485380612/output.log">215.35</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base/2n8g/Megatron_gpt2_nl24_nah16_hs1024_FP16_acfalse_mp1_pp1_mb4_gb64_2n8g_20220705_205510043191423.log">176.0</a>
+                samples/s</td>
+        </tr>
+        <tr class="odd">
+            <td>nl24_fp16_32x1x1_mb4_gb128_4n8g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e_supple/4n8g/LibAI_gpt2_nl24_nah16_hs1024_FP16_acfalse_mp1_pp1_mb4_gb128_4n8g_20220706_140324618820537/output.log">329.58</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base_supple/4n8g/Megatron_gpt2_nl24_nah16_hs1024_FP16_acfalse_mp1_pp1_mb4_gb128_4n8g_20220706_123437709246728.log">296.6</a>
+                samples/s</td>
+        </tr>
+    </tbody>
+</table>
+
+### Tensor Model Parallel 
+
+<table class="docutils">
+    <thead>
+        <tr class="header">
+            <th>BERT</th>
+            <th>LiBai</th>
+            <th>Megatron</th>
+        </tr>
+    </thead>
+    <tbody>
+        <tr class="odd">
+            <td>nl24_fp16_1x1x1_ac_mb128_gb1024_1n1g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e/1n1g/LibAI_bert_nl24_nah16_hs1024_FP16_actrue_mp1_pp1_mb128_gb1024_1n1g_20220705_071531848751549/output.log">35.74</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base/1n1g/Megatron_bert_nl24_nah16_hs1024_FP16_actrue_mp1_pp1_mb128_gb1024_1n1g_20220615_131647218393872.log">33.6</a>
+                samples/s</td>
+        </tr>
+        <tr class="even">
+            <td>nl24_fp16_1x4x1_ac_mb128_gb1024_1n4g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e/1n4g/LibAI_bert_nl24_nah16_hs1024_FP16_actrue_mp4_pp1_mb128_gb1024_1n4g_20220705_091639328686421/output.log">87.12</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base/1n4g/Megatron_bert_nl24_nah16_hs1024_FP16_actrue_mp4_pp1_mb128_gb1024_1n4g_20220705_122604083123137.log">86.6</a>
+                samples/s</td>
+        </tr>
+        <tr class="odd">
+            <td>nl24_fp16_1x8x1_ac_mb128_gb1024_1n8g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e/1n8g/LibAI_bert_nl24_nah16_hs1024_FP16_actrue_mp8_pp1_mb128_gb1024_1n8g_20220705_071502819874891/output.log">131.94</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base/1n8g/Megatron_bert_nl24_nah16_hs1024_FP16_actrue_mp8_pp1_mb128_gb1024_1n8g_20220705_113839195864897.log">128.7</a>
+                samples/s</td>
+        </tr>
+    </tbody>
+</table>
+
+<table class="docutils">
+    <thead>
+        <tr class="header">
+            <th>GPT-2</th>
+            <th>LiBai</th>
+            <th>Megatron</th>
+        </tr>
+    </thead>
+    <tbody>
+        <tr class="odd">
+            <td>nl24_fp16_1x1x1_mb6_gb6_1n1g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e/1n1g/LibAI_gpt2_nl24_nah16_hs1024_FP16_acfalse_mp1_pp1_mb6_gb6_1n1g_20220705_071259765473007/output.log">17.52</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base/1n1g/Megatron_gpt2_nl24_nah16_hs1024_FP16_acfalse_mp1_pp1_mb6_gb6_1n1g_20220615_075355864672227.log">15.5</a>
+                samples/s</td>
+        </tr>
+        <tr class="even">
+            <td>nl24_fp16_1x4x1_mb6_gb6_1n4g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e/1n4g/LibAI_gpt2_nl24_nah16_hs1024_FP16_acfalse_mp4_pp1_mb6_gb6_1n4g_20220705_083540814077836/output.log">40.38</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base/1n4g/Megatron_gpt2_nl24_nah16_hs1024_FP16_acfalse_mp4_pp1_mb6_gb6_1n4g_20220705_161200662119880.log">38.0</a>
+                samples/s</td>
+        </tr>
+        <tr class="odd">
+            <td>nl24_fp16_1x8x1_mb8_gb8_1n8g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e/1n8g/LibAI_gpt2_nl24_nah16_hs1024_FP16_acfalse_mp8_pp1_mb8_gb8_1n8g_20220705_071300514010057/output.log">60.53</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base/1n8g/Megatron_gpt2_nl24_nah16_hs1024_FP16_acfalse_mp8_pp1_mb8_gb8_1n8g_20220705_145234374022700.log">55.7</a>
+                samples/s</td>
+        </tr>
+    </tbody>
+</table>
+
+### Pipeline Model Parallel
+
+<table class="docutils">
+    <thead>
+        <tr class="header">
+            <th>BERT</th>
+            <th>LiBai</th>
+            <th>Megatron</th>
+        </tr>
+    </thead>
+    <tbody>
+        <tr class="odd">
+            <td>nl24_fp16_1x1x1_ac_mb128_gb1024_1n1g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e/1n1g/LibAI_bert_nl24_nah16_hs1024_FP16_actrue_mp1_pp1_mb128_gb1024_1n1g_20220705_071531848751549/output.log">35.74</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base/1n1g/Megatron_bert_nl24_nah16_hs1024_FP16_actrue_mp1_pp1_mb128_gb1024_1n1g_20220615_131647218393872.log">33.6</a>
+                samples/s</td>
+        </tr>
+        <tr class="even">
+            <td>nl24_fp16_1x1x4_ac_mb128_gb1024_1n4g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e/1n4g/LibAI_bert_nl24_nah16_hs1024_FP16_actrue_mp1_pp4_mb128_gb1024_1n4g_20220705_110658353978881/output.log">103.6</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base/1n4g/Megatron_bert_nl24_nah16_hs1024_FP16_actrue_mp1_pp4_mb128_gb1024_1n4g_20220615_163155223131475.log">88.7</a>
+                samples/s</td>
+        </tr>
+        <tr class="odd">
+            <td><strong>nl48</strong>_fp16_1x1x8_ac_mb64_gb1024_1n8g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e/1n8g/LibAI_bert_nl48_nah16_hs1024_FP16_actrue_mp1_pp8_mb64_gb1024_1n8g_20220705_074452866672066/output.log">94.4</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base/1n8g/Megatron_bert_nl48_nah16_hs1024_FP16_actrue_mp1_pp8_mb64_gb1024_1n8g_20220705_120956967492395.log">85.5</a>
+                samples/s</td>
+        </tr>
+    </tbody>
+</table>
+
+<table class="docutils">
+    <thead>
+        <tr class="header">
+            <th>GPT-2</th>
+            <th>LiBai</th>
+            <th>Megatron</th>
+        </tr>
+    </thead>
+    <tbody>
+        <tr class="odd">
+            <td>nl24_fp16_1x1x1_ac_mb32_gb256_1n1g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e/1n1g/LibAI_gpt2_nl24_nah16_hs1024_FP16_actrue_mp1_pp1_mb32_gb256_1n1g_20220705_071446147204953/output.log">14.43</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base/1n1g/Megatron_gpt2_nl24_nah16_hs1024_FP16_actrue_mp1_pp1_mb32_gb256_1n1g_20220705_145945599193771.log">13.3</a>
+                samples/</td>
+        </tr>
+        <tr class="even">
+            <td>nl24_fp16_1x1x4_ac_mb32_gb256_1n4g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e/1n4g/LibAI_gpt2_nl24_nah16_hs1024_FP16_actrue_mp1_pp4_mb32_gb256_1n4g_20220705_090306115011489/output.log">41.9</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base/1n4g/Megatron_gpt2_nl24_nah16_hs1024_FP16_actrue_mp1_pp4_mb32_gb256_1n4g_20220615_111701194391665.log">33.2</a>
+                samples/s</td>
+        </tr>
+        <tr class="odd">
+            <td><strong>nl48</strong>_fp16_1x1x8_ac_mb24_gb384_1n8g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e/1n8g/LibAI_gpt2_nl48_nah16_hs1024_FP16_actrue_mp1_pp8_mb24_gb384_1n8g_20220705_075906245664894/output.log">37.4</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base/1n8g/Megatron_gpt2_nl48_nah16_hs1024_FP16_actrue_mp1_pp8_mb24_gb384_1n8g_20220705_154144783493377.log">31.8</a>
+                samples/s</td>
+        </tr>
+    </tbody>
+</table>
+
+### 2-D Parallel
+
+#### Data Parallel + Tensor Model Parallel
+
+<table class="docutils">
+    <thead>
+        <tr class="header">
+            <th>BERT</th>
+            <th>LiBai</th>
+            <th>Megatron</th>
+        </tr>
+    </thead>
+    <tbody>
+        <tr class="odd">
+            <td>nl24_fp16_2x2x1_ac_mb128_gb2048_1n4g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e/1n4g/LibAI_bert_nl24_nah16_hs1024_FP16_actrue_mp2_pp1_mb128_gb2048_1n4g_20220705_140640645048573/output.log">88.47</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base/1n4g/Megatron_bert_nl24_nah16_hs1024_FP16_actrue_mp2_pp1_mb128_gb2048_1n4g_20220615_171428527286012.log">86.6</a>
+                samples/s</td>
+        </tr>
+        <tr class="even">
+            <td>nl24_fp16_4x2x1_ac_mb128_gb4096_1n8g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e/1n8g/LibAI_bert_nl24_nah16_hs1024_FP16_actrue_mp2_pp1_mb128_gb4096_1n8g_20220705_121419365203845/output.log">175.94</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base/1n8g/Megatron_bert_nl24_nah16_hs1024_FP16_actrue_mp2_pp1_mb128_gb4096_1n8g_20220615_162613310187064.log">172.0</a>
+                samples/s</td>
+        </tr>
+        <tr class="odd">
+            <td>nl24_fp16_8x2x1_ac_mb128_gb8192_2n8g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e/2n8g/LibAI_bert_nl24_nah16_hs1024_FP16_actrue_mp2_pp1_mb128_gb8192_2n8g_20220705_191030011908901/output.log">348.58</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base/2n8g/Megatron_bert_nl24_nah16_hs1024_FP16_actrue_mp2_pp1_mb128_gb8192_2n8g_20220615_092121490236726.log">343.8</a>
+                samples/s</td>
+        </tr>
+        <tr class="even">
+            <td>nl24_fp16_2x8x1_ac_mb128_gb2048_2n8g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e/2n8g/LibAI_bert_nl24_nah16_hs1024_FP16_actrue_mp8_pp1_mb128_gb2048_2n8g_20220705_204305155951783/output.log">261.78</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base/2n8g/Megatron_bert_nl24_nah16_hs1024_FP16_actrue_mp8_pp1_mb128_gb2048_2n8g_20220615_104722377958514.log">255.8</a>
+                samples/s</td>
+        </tr>
+        <tr class="odd">
+            <td>nl24_fp16_4x4x1_ac_mb128_gb2048_2n8g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e/2n8g/LibAI_bert_nl24_nah16_hs1024_FP16_actrue_mp4_pp1_mb128_gb2048_2n8g_20220705_184204966857940/output.log">338.97</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base/2n8g/Megatron_bert_nl24_nah16_hs1024_FP16_actrue_mp4_pp1_mb128_gb2048_2n8g_20220705_203137819762324.log">337.3</a>
+                samples/s</td>
+        </tr>
+    </tbody>
+</table>
+
+<table class="docutils">
+    <thead>
+        <tr class="header">
+            <th>GPT-2</th>
+            <th>LiBai</th>
+            <th>Megatron</th>
+        </tr>
+    </thead>
+    <tbody>
+        <tr class="odd">
+            <td>nl24_fp16_2x2x1_ac_mb32_gb512_1n4g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e/1n4g/LibAI_gpt2_nl24_nah16_hs1024_FP16_actrue_mp2_pp1_mb32_gb512_1n4g_20220705_102345166928423/output.log">37.63</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base/1n4g/Megatron_gpt2_nl24_nah16_hs1024_FP16_actrue_mp2_pp1_mb32_gb512_1n4g_20220615_114458702264816.log">36.9</a>
+                samples/s</td>
+        </tr>
+        <tr class="even">
+            <td>nl24_fp16_4x2x1_ac_mb32_gb1024_1n8g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e/1n8g/LibAI_gpt2_nl24_nah16_hs1024_FP16_actrue_mp2_pp1_mb32_gb1024_1n8g_20220705_103654387121991/output.log">74.35</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base/1n8g/Megatron_gpt2_nl24_nah16_hs1024_FP16_actrue_mp2_pp1_mb32_gb1024_1n8g_20220615_102825468361561.log">73.2</a>
+                samples/s</td>
+        </tr>
+        <tr class="odd">
+            <td>nl24_fp16_8x2x1_ac_mb32_gb2048_2n8g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e/2n8g/LibAI_gpt2_nl24_nah16_hs1024_FP16_actrue_mp2_pp1_mb32_gb2048_2n8g_20220705_163225947465351/output.log">148.94</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base/2n8g/Megatron_gpt2_nl24_nah16_hs1024_FP16_actrue_mp2_pp1_mb32_gb2048_2n8g_20220615_075410947484330.log">146.5</a>
+                samples/s</td>
+        </tr>
+        <tr class="even">
+            <td>nl24_fp16_2x8x1_ac_mb32_gb512_2n8g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e/2n8g/LibAI_gpt2_nl24_nah16_hs1024_FP16_actrue_mp8_pp1_mb32_gb512_2n8g_20220705_174941061081146/output.log">116.04</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base/2n8g/Megatron_gpt2_nl24_nah16_hs1024_FP16_actrue_mp8_pp1_mb32_gb512_2n8g_20220616_090223352685185.log">109.1</a>
+                samples/s</td>
+        </tr>
+        <tr class="odd">
+            <td>nl24_fp16_4x4x1_ac_mb32_gb512_2n8g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e/2n8g/LibAI_gpt2_nl24_nah16_hs1024_FP16_actrue_mp4_pp1_mb32_gb512_2n8g_20220705_161315502270392/output.log">141.25</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base/2n8g/Megatron_gpt2_nl24_nah16_hs1024_FP16_actrue_mp4_pp1_mb32_gb512_2n8g_20220615_084455786824917.log">138.1</a>
+                samples/s</td>
+        </tr>
+    </tbody>
+</table>
+
+#### Data Parallel + Pipeline Model Parallel
+
+<table class="docutils">
+    <thead>
+        <tr class="header">
+            <th>BERT</th>
+            <th>LiBai</th>
+            <th>Megatron</th>
+        </tr>
+    </thead>
+    <tbody>
+        <tr class="odd">
+            <td>nl24_fp16_2x1x4_ac_mb128_gb2048_1n8g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e/1n8g/LibAI_bert_nl24_nah16_hs1024_FP16_actrue_mp1_pp4_mb128_gb2048_1n8g_20220705_135654422062875/output.log">207.31</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base/1n8g/Megatron_bert_nl24_nah16_hs1024_FP16_actrue_mp1_pp4_mb128_gb2048_1n8g_20220705_140726038527715.log">175.0</a>
+                samples/s</td>
+        </tr>
+        <tr class="even">
+            <td>nl24_fp16_4x1x4_ac_mb128_gb4096_2n8g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e/2n8g/LibAI_bert_nl24_nah16_hs1024_FP16_actrue_mp1_pp4_mb128_gb4096_2n8g_20220705_211808588422098/output.log">406.24</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base/2n8g/Megatron_bert_nl24_nah16_hs1024_FP16_actrue_mp1_pp4_mb128_gb4096_2n8g_20220615_121601428159750.log">342.9</a>
+                samples/s</td>
+        </tr>
+        <tr class="odd">
+            <td>nl24_fp16_8x1x4_ac_mb128_gb8192_4n8g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e_supple/4n8g/LibAI_bert_nl24_nah16_hs1024_FP16_actrue_mp1_pp4_mb128_gb8192_4n8g_20220706_124739788495384/output.log">805.04</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base_supple/4n8g/Megatron_bert_nl24_nah16_hs1024_FP16_actrue_mp1_pp4_mb128_gb8192_4n8g_20220706_152441274628712.log">650.7</a>
+                samples/s</td>
+        </tr>
+    </tbody>
+</table>
+
+<table class="docutils">
+    <thead>
+        <tr class="header">
+            <th>GPT-2</th>
+            <th>LiBai</th>
+            <th>Megatron</th>
+        </tr>
+    </thead>
+    <tbody>
+        <tr class="odd">
+            <td>nl24_fp16_2x1x4_ac_mb32_gb512_1n8g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e/1n8g/LibAI_gpt2_nl24_nah16_hs1024_FP16_actrue_mp1_pp4_mb32_gb512_1n8g_20220705_120100257233978/output.log">83.12</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base/1n8g/Megatron_gpt2_nl24_nah16_hs1024_FP16_actrue_mp1_pp4_mb32_gb512_1n8g_20220705_162859180952832.log">65.3</a>
+                samples/s</td>
+        </tr>
+        <tr class="even">
+            <td>nl24_fp16_4x1x4_ac_mb32_gb1024_2n8g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e/2n8g/LibAI_gpt2_nl24_nah16_hs1024_FP16_actrue_mp1_pp4_mb32_gb1024_2n8g_20220705_181145725094854/output.log">164.23</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base/2n8g/Megatron_gpt2_nl24_nah16_hs1024_FP16_actrue_mp1_pp4_mb32_gb1024_2n8g_20220615_130009719082439.log">128.4</a>
+                samples/s</td>
+        </tr>
+        <tr class="odd">
+            <td>nl24_fp16_8x1x4_ac_mb32_gb2048_4n8g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e_supple/4n8g/LibAI_gpt2_nl24_nah16_hs1024_FP16_actrue_mp1_pp4_mb32_gb2048_4n8g_20220706_145622217184041/output.log">322.42</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base_supple/4n8g/Megatron_gpt2_nl24_nah16_hs1024_FP16_actrue_mp1_pp4_mb32_gb2048_4n8g_20220706_142353564914037.log">247.3</a>
+                samples/s</td>
+        </tr>
+    </tbody>
+</table>
+
+### 3-D Parallel
+
+<table class="docutils">
+    <thead>
+        <tr class="header">
+            <th>BERT</th>
+            <th>LiBai</th>
+            <th>Megatron</th>
+        </tr>
+    </thead>
+    <tbody>
+        <tr class="odd">
+            <td>nl24_fp16_2x2x4_ac_mb128_gb2048_2n8g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e/2n8g/LibAI_bert_nl24_nah16_hs1024_FP16_actrue_mp2_pp4_mb128_gb2048_2n8g_20220705_223156628574994/output.log">267.39</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base/2n8g/Megatron_bert_nl24_nah16_hs1024_FP16_actrue_mp2_pp4_mb128_gb2048_2n8g_20220616_091946235804420.log">233.7</a>
+                samples/s</td>
+        </tr>
+        <tr class="even">
+            <td>nl24_fp16_4x2x4_ac_mb192_gb6144_4n8g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e/4n8g/LibAI_bert_nl24_nah16_hs1024_FP16_actrue_mp2_pp4_mb192_gb6144_4n8g_20220705_050226500268757/output.log">503.51</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base/4n8g/Megatron_bert_nl24_nah16_hs1024_FP16_actrue_mp2_pp4_mb192_gb6144_4n8g_20220706_000244759822631.log">439.4</a>
+                samples/s</td>
+        </tr>
+        <tr class="odd">
+            <td>nl24_fp16_2x4x4_ac_mb256_gb4096_4n8g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e/4n8g/LibAI_bert_nl24_nah16_hs1024_FP16_actrue_mp4_pp4_mb256_gb4096_4n8g_20220705_062431065749653/output.log">405.75</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base/4n8g/Megatron_bert_nl24_nah16_hs1024_FP16_actrue_mp4_pp4_mb256_gb4096_4n8g_20220616_023203818494929.log">338.7</a>
+                samples/s</td>
+        </tr>
+    </tbody>
+</table>
+
+<table class="docutils">
+    <thead>
+        <tr class="header">
+            <th>GPT-2</th>
+            <th>LiBai</th>
+            <th>Megatron</th>
+        </tr>
+    </thead>
+    <tbody>
+        <tr class="odd">
+            <td>nl24_fp16_2x2x4_ac_mb32_gb1024_2n8g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e/2n8g/LibAI_gpt2_nl24_nah16_hs1024_FP16_actrue_mp2_pp4_mb32_gb1024_2n8g_20220705_185756187637203/output.log">128.77</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base/2n8g/Megatron_gpt2_nl24_nah16_hs1024_FP16_actrue_mp2_pp4_mb32_gb1024_2n8g_20220705_213345094190188.log">106.3</a>
+                samples/s</td>
+        </tr>
+        <tr class="even">
+            <td>nl24_fp16_4x2x4_ac_mb48_gb1536_4n8g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e/4n8g/LibAI_gpt2_nl24_nah16_hs1024_FP16_actrue_mp2_pp4_mb48_gb1536_4n8g_20220705_035358751889185/output.log">209.32</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base/4n8g/Megatron_gpt2_nl24_nah16_hs1024_FP16_actrue_mp2_pp4_mb48_gb1536_4n8g_20220706_005719759064651.log">179.5</a>
+                samples/s</td>
+        </tr>
+        <tr class="odd">
+            <td>nl24_fp16_2x4x4_ac_mb64_gb1024_4n8g</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/55b822e/4n8g/LibAI_gpt2_nl24_nah16_hs1024_FP16_actrue_mp4_pp4_mb64_gb1024_4n8g_20220705_043108406236792/output.log">186.67</a>
+                samples/s</td>
+            <td><a
+                    href="https://oneflow-test.oss-cn-beijing.aliyuncs.com/OneFlowAutoTest/huoshanyingqin/baseline/megatron_base/4n8g/Megatron_gpt2_nl24_nah16_hs1024_FP16_actrue_mp4_pp4_mb64_gb1024_4n8g_20220616_012941284271973.log">178.2</a>
+                samples/s</td>
+        </tr>
+    </tbody>
+</table>
\ No newline at end of file
--- a/docs/source/tutorials/get_started/Installation.md
+++ b/docs/source/tutorials/get_started/Installation.md
+# Installation
+
+LiBai provides an editable installation way for you to develop your own project based on LiBai's framework.
+
+## Build LiBai from Source
+
+- Clone this repo:
+
+```bash
+git clone https://github.com/Oneflow-Inc/libai.git
+cd libai
+```
+
+- Create a conda virtual environment and activate it:
+
+```bash
+conda create -n libai python=3.8 -y
+conda activate libai
+```
+
+- Install the stable release of OneFlow with `CUDA` support. See [OneFlow installation guide](https://github.com/Oneflow-Inc/oneflow#install-with-pip-package). To use **latest** LiBai(branch `main`), we highly recommend you install **Nightly** Oneflow
+
+  - Stable
+
+    ```bash
+    python3 -m pip install --find-links https://release.oneflow.info oneflow==0.8.0+[PLATFORM]
+    ```
+
+  - Nightly
+
+    ```
+    python3 -m pip install --pre oneflow -f https://staging.oneflow.info/branch/master/[PLATFORM]
+    ```
+
+  - All available `[PLATFORM]`:
+  
+    <table class="docutils">
+    <thead>
+    <tr class="header">
+    <th>Platform</th>
+    <th>CUDA Driver Version</th>
+    <th>Supported GPUs</th>
+    </tr>
+    </thead>
+    <tbody>
+    <tr class="odd">
+    <td>cu112</td>
+    <td>&gt;= 450.80.02</td>
+    <td>GTX 10xx, RTX 20xx, A100, RTX 30xx</td>
+    </tr>
+    <tr class="even">
+    <td>cu102</td>
+    <td>&gt;= 440.33</td>
+    <td>GTX 10xx, RTX 20xx</td>
+    </tr>
+    <tr class="odd">
+    <td>cpu</td>
+    <td>N/A</td>
+    <td>N/A</td>
+    </tr>
+    </tbody>
+    </table></li>
+
+- Install `pybind11`:
+
+```bash
+pip install pybind11
+```
+
+- For an editable installation of LiBai:
+
+```bash
+pip install -e .
+```
--- a/docs/source/tutorials/get_started/Model_Zoo.md
+++ b/docs/source/tutorials/get_started/Model_Zoo.md
+# LiBai Model Zoo
+To date, LiBai has implemented the following models:
+- [Vision Transformer](https://arxiv.org/abs/2010.11929)
+- [Swin Transformer](https://arxiv.org/abs/2103.14030)
+- [ResMLP](https://arxiv.org/abs/2105.03404)
+- [BERT](https://arxiv.org/abs/1810.04805)
+- [T5](https://arxiv.org/abs/1910.10683)
+- [GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
+
+
+## Parallelism Mode in LiBai
+A collection of parallel training strategies is supported in LiBai:
+- **Data Parallel Training**
+- **Tensor Parallel Training**
+- **Pipeline Parallel Training**
+
+You can refer to OneFlow official [tutorial](https://docs.oneflow.org/en/master/parallelism/01_introduction.html) to better understand the basic conception of parallelization techniques.
+
+
+## Supported Models in LiBai
+
+For more details about the supported parallelism training on different models, please refer to the following table:
+
+<table class="docutils">
+  <tbody>
+    <tr>
+      <th width="80"> Model </th>
+      <th valign="bottom" align="left" width="120">Data Parallel</th>
+      <th valign="bottom" align="left" width="120">Tensor Parallel</th>
+      <th valign="bottom" align="left" width="120">Pipeline Parallel</th>
+    </tr>
+    <tr>
+      <td align="left"> <b> Vision Transformer </b> </td>
+      <td align="left">&#10004;</td>
+      <td align="left">&#10004;</td>
+      <td align="left">&#10004;</td>
+    </tr>
+    <tr>
+      <td align="left"> <b> Swin Transformer </b> </td>
+      <td align="left">&#10004;</td>
+      <td align="left">-</td>
+      <td align="left">-</td>
+    <tr>
+    <tr>
+      <td align="left"> <b> ResMLP </b> </td>
+      <td align="left">&#10004;</td>
+      <td align="left">&#10004;</td>
+      <td align="left">&#10004;</td>
+    </tr>
+    <tr>
+      <td align="left"> <b> BERT </b> </td>
+      <td align="left">&#10004;</td>
+      <td align="left">&#10004;</td>
+      <td align="left">&#10004;</td>
+    </tr>
+    <tr>
+      <td align="left"> <b> T5 </b> </td>
+      <td align="left">&#10004;</td>
+      <td align="left">&#10004;</td>
+      <td align="left">&#10004;</td>
+    </tr>
+    <tr>
+      <td align="left"> <b> GPT-2 </b> </td>
+      <td align="left">&#10004;</td>
+      <td align="left">&#10004;</td>
+      <td align="left">&#10004;</td>
+    </tr>
+    </tr>
+  </tbody>
+</table>
+
+**Additions:**
+&#10004; means you can train this model under specific parallelism techniques or combine two or three of them with &#10004; for 2D or 3D paralleism training.
+
+## Baselines
+Here is the collection of baselines trained with LiBai. Due to our resource constraints, we will gradually release the training results in the future.
+
+### Main Results on ImageNet with Pretrained Models
+
+**ImageNet-1K Pretrained Models**
+<table class="docutils">
+  <tbody>
+    <tr>
+      <th width="80"> Model </th>
+      <th valign="bottom" align="center" width="120">Pretrain</th>
+      <th valign="bottom" align="center" width="120">Resolution</th>
+      <th valign="bottom" align="center" width="120">Acc@1</th>
+      <th valign="bottom" align="center" width="120">Acc@5</th>
+      <th valign="bottom" align="center" width="120">Download</th>
+    </tr>
+    <tr>
+      <td align="center"> ViT-Tiny w/o EMA </td>
+      <td align="center"> ImageNet-1K </td>
+      <td align="center"> 224x224 </td>
+      <td align="center"> 72.7 </td>
+      <td align="center"> 91.0 </td>
+      <td align="center"> <a href="https://oneflow-public.oss-cn-beijing.aliyuncs.com/model_zoo/LiBai/ImageNet/vit_tiny_patch16_224/config.yaml">Config</a> | <a href="https://oneflow-public.oss-cn-beijing.aliyuncs.com/model_zoo/LiBai/ImageNet/vit_tiny_patch16_224/model_best.zip">Checkpoint</a> </td>
+    </tr>
+    <tr>
+      <td align="center"> ViT-Small w/o EMA</td>
+      <td align="center"> ImageNet-1K </td>
+      <td align="center"> 224x224 </td>
+      <td align="center"> 79.3 </td>
+      <td align="center"> 94.5 </td>
+      <td align="center"> <a href="https://oneflow-public.oss-cn-beijing.aliyuncs.com/model_zoo/LiBai/ImageNet/vit_small_patch16_224/config.yaml">Config</a> | <a href="https://oneflow-public.oss-cn-beijing.aliyuncs.com/model_zoo/LiBai/ImageNet/vit_small_patch16_224/model_best.zip">Checkpoint</a> </td>
+    </tr>
+    </tr>
+  </tbody>
+</table>
+
+**Notes:** `w/o EMA` denotes to models pretrained without **Exponential Moving Average** (EMA).
\ No newline at end of file
--- a/docs/source/tutorials/get_started/index.rst
+++ b/docs/source/tutorials/get_started/index.rst
+Get Started
+=============
+
+.. toctree::
+   :glob:
+   :maxdepth: 2
+
+   Installation.md
+   quick_run.md
+   Model_Zoo.md
+   Benchmark.md
+   run_distributed_inference.md
\ No newline at end of file
--- a/docs/source/tutorials/get_started/quick_run.md
+++ b/docs/source/tutorials/get_started/quick_run.md
+# Quick Run
+This is a step-by-step tutorial on how to get started with LiBai:
+- [Quick Run](#quick-run)
+  - [Train Bert-large Model Parallelly](#train-bert-large-model-parallelly)
+    - [Prepare the Data and the Vocab](#prepare-the-data-and-the-vocab)
+    - [How to Train Bert_large Model with Parallelism](#how-to-train-bert_large-model-with-parallelism)
+  - [Train VisionTransformer on ImageNet Dataset](#train-visiontransformer-on-imagenet-dataset)
+    - [Prepare the Data](#prepare-the-data)
+    - [Train vit Model from Scratch](#train-vit-model-from-scratch)
+
+
+## Train Bert-large Model Parallelly
+### Prepare the Data and the Vocab
+
+- We have prepared relevant datasets, which can be downloaded from the following links:
+
+1. [VOCAB_URL](https://oneflow-static.oss-cn-beijing.aliyuncs.com/ci-files/dataset/libai/bert_dataset/bert-base-chinese-vocab.txt)
+2. [BIN_DATA_URL](https://oneflow-static.oss-cn-beijing.aliyuncs.com/ci-files/dataset/libai/bert_dataset/loss_compara_content_sentence.bin)
+3. [IDX_DATA_URL](https://oneflow-static.oss-cn-beijing.aliyuncs.com/ci-files/dataset/libai/bert_dataset/loss_compara_content_sentence.idx)
+
+- Download the dataset and move the data file to the folder. The file structure should be like:
+```bash
+$ tree data
+path/to/bert_data
+├── bert-base-chinese-vocab.txt
+├── loss_compara_content_sentence.bin
+└── loss_compara_content_sentence.idx
+```
+### How to Train Bert_large Model with Parallelism
+
+We provide `train.sh` for execute training. Before invoking the script, perform the following steps.
+
+**Step 1. Set data path and vocab path**
+
+- Update the data path and vocab path in [bert_large_pretrain](https://github.com/Oneflow-Inc/libai/blob/main/configs/bert_large_pretrain.py) config file:
+```python
+# Refine data path and vocab path to data folder
+vocab_file = "/path/to/bert_data/bert-base-chinese-vocab.txt"
+data_prefix = "/path/to/bert_data/loss_compara_content_sentence"
+```
+
+**Step 2. Configure your parameters**
+- In the [`configs/bert_large_pretrain.py`](https://github.com/Oneflow-Inc/libai/blob/main/configs/bert_large_pretrain.py) provided, a set of parameters are defined including training scheme, model, etc.
+- You can also modify the parameters setting. For example, if you want to use 8 GPUs for training, you can refer to the file [`configs/common/train.py`](https://github.com/Oneflow-Inc/libai/blob/main/configs/common/train.py). If you want to train model with 2D mesh hybrid parallelism (4 groups for data parallel and 2 groups for tensor parallel), you can set the the parameters as follows:
+
+```python
+train.dist.data_parallel_size=4
+train.dist.tensor_parallel_size=2
+```
+
+**Step 3. Invoke parallel training**
+- To train `BertForPreTraining` model on a single node with 8 GPUs, run:
+```bash
+bash tools/train.sh tools/train_net.py configs/bert_large_pretrain.py 8
+```
+
+- To train `BertForPreTraining` model on 2 nodes with 16 GPUs, 
+  
+  in `node0`, run:
+  ```bash
+  NODE=2 NODE_RANK=0 ADDR=192.168.0.0 PORT=12345 bash tools/train.sh tools/train_net.py configs/bert_large_pretrain.py 8
+  ``` 
+  `NODE=2` means total number of nodes
+  
+  `NODE_RANK=0` means current node is node0
+
+  `ADDR=192.168.0.0` means the ip address of node0
+
+  `PORT=12345` means the port of node0
+
+  in `node1`, run:
+  ```bash
+  NODE=2 NODE_RANK=1 ADDR=192.168.0.0 PORT=12345 bash tools/train.sh tools/train_net.py configs/bert_large_pretrain.py 8
+  ``` 
+  `NODE=2` means total number of nodes
+  
+  `NODE_RANK=1` means current node is node1
+
+  `ADDR=192.168.0.0` means the ip address of node0
+
+  `PORT=12345` means the port of node0
+
+## Train VisionTransformer on ImageNet Dataset
+### Prepare the Data
+For ImageNet, we use standard ImageNet dataset, which can be downloaded from http://image-net.org/.
+- For the standard folder dataset, move validation images to labeled sub-folders. The file structure should be like:
+```bash
+$ tree data
+imagenet
+├── train
+│   ├── class1
+│   │   ├── img1.jpeg
+│   │   ├── img2.jpeg
+│   │   └── ...
+│   ├── class2
+│   │   ├── img3.jpeg
+│   │   └── ...
+│   └── ...
+└── val
+    ├── class1
+    │   ├── img4.jpeg
+    │   ├── img5.jpeg
+    │   └── ...
+    ├── class2
+    │   ├── img6.jpeg
+    │   └── ...
+    └── ...
+
+```
+### Train vit Model from Scratch
+- Update the data path in [vit_imagenet](https://github.com/Oneflow-Inc/libai/blob/main/configs/vit_imagenet.py) config file:
+```python
+# Refine data path to imagenet data folder
+dataloader.train.dataset[0].root = "/path/to/imagenet"
+dataloader.test[0].dataset.root = "/path/to/imagenet"
+```
+- To train `vit_tiny_patch16_224` model on ImageNet on a single node with 8 GPUs for 300 epochs, run:
+```bash
+bash tools/train.sh tools/train_net.py configs/vit_imagenet.py 8
+```
+- The default vit model in LiBai is set to `vit_tiny_patch16_224`. To train other vit models, update the [vit_imagenet](https://github.com/Oneflow-Inc/libai/blob/main/configs/vit_imagenet.py) config file by importing other vit models in the config file as follows:
+```python
+# from .common.models.vit.vit_tiny_patch16_224 import model
+from .common.models.vit.vit_base_patch16_224 import model
+```
--- a/docs/source/tutorials/get_started/run_distributed_inference.md
+++ b/docs/source/tutorials/get_started/run_distributed_inference.md
+# Run distributed inference
+
+This is a tutorial on how to quickly run distributed inference in LiBai from a `huggingface` pretrained model.
+
+## Download model weights
+
+run shell
+```shell
+mkdir -r data_test/t5_inference_model/
+cd data_test/t5_inference_model
+wget https://huggingface.co/t5-base/resolve/main/pytorch_model.bin https://huggingface.co/t5-base/resolve/main/config.json https://huggingface.co/t5-base/resolve/main/spiece.model
+```
+
+the dir will like this:
+```shell
+data_test/t5_inference_model
+├── config.json
+├── pytorch_model.bin
+├── spiece.model
+```
+
+## run text_generation.py
+
+set `vocab_file` path in `projects/MT5/configs/t5_inference.py`
+
+```python
+tokenization.tokenizer = LazyCall(T5Tokenizer)(
+    vocab_file="data_test/t5_inference_model/spiece.model",
+    add_bos_token=True,
+)
+```
+
+set your own distributed config in `libai/inference/text_generation.py`
+
+```python
+if __name__ == "__main__":
+    pipeline = TextGenerationPipeline(
+        "projects/MT5/configs/t5_inference.py",
+        data_parallel=1,
+        tensor_parallel=2,
+        pipeline_parallel=2,
+        pipeline_stage_id=[0] * 12 + [1] * 12,
+        pipeline_num_layers=12 * 2,
+        model_path="data_test/t5_inference_model",
+        mode="huggingface",
+    )
+
+    text = ["summarize: She is a student, She is tall, She loves study"]
+    dict1 = pipeline(text)
+    if dist.is_main_process():
+        print(dict1)
+```
+
+To run distributed inference on 2 nodes with total 4 GPUs, 
+
+  in `node0`, run:
+  ```bash
+  NODE=2 NODE_RANK=1 ADDR=192.168.0.1 PORT=12345 bash tools/infer.sh libai/inference/text_generation.py 2
+  ``` 
+  `NODE=2` means total number of nodes
+  
+  `NODE_RANK=0` means current node is node0
+
+  `ADDR=192.168.0.0` means the ip address of node0
+
+  `PORT=12345` means the port of node0
+
+  in `node1`, run:
+  ```bash
+  NODE=2 NODE_RANK=1 ADDR=192.168.0.1 PORT=12345 bash tools/infer.sh libai/inference/text_generation.py 2
+  ``` 
+  `NODE=2` means total number of nodes
+  
+  `NODE_RANK=1` means current node is node1
+
+  `ADDR=192.168.0.0` means the ip address of node0
+
+  `PORT=12345` means the port of node0
\ No newline at end of file
--- a/docs/source/tutorials/index.rst
+++ b/docs/source/tutorials/index.rst
+Tutorials
+=========
+
+.. toctree::
+   :glob:
+   :maxdepth: 3
+
+   get_started/index
+   basics/index
+   advanced_tutorials/index
+   
--- a/env.sh
+++ b/env.sh
+#!/bin/bash
+
+source ./venv_oneflow/bin/activate
+module load compiler/rocm/dtk-22.10.1
+export PYTHON_VENV_PATH=./venv_oneflow
+export LD_LIBRARY_PATH=$PYTHON3_LIB_PATH:$LD_LIBRARY_PATH
+
+export MIOPEN_FIND_MODE=3
+export MIOPEN_COMPILE_PARALLEL_LEVEL=1
+
+export HSA_FORCE_FINE_GRAIN_PCIE=1
+export NCCL_P2P_LEVEL=5
+export RCCL_NCHANNELS=2
+export NCCL_SOCKET_IFNAME=ib0
+#export NCCL_GDR_FLUSH_DISABLE=1
+#export NCCL_NET_GDR_LEVEL=SYS
+#export NCCL_DEBUG=INFO
+
+export OMP_NUM_THREADS=1
+export ONEFLOW_TIMEOUT_SECONDS=1800
+#export ONEFLOW_DEBUG_MODE=1
+#export GLOG_v=1
+#export ONEFLOW_NNGRAPH_ENABLE_PROGRESS_BAR=1
+
+
+
+
+
+
+
+