Layer integration (#83)

* integrated parallel layers for ease of building models * integrated 2.5d layers * cleaned codes and unit tests * added log metric by step hook; updated imagenet benchmark; fixed some bugs * reworked initialization; cleaned codes Co-authored-by: BoxiangW <45734921+BoxiangW@users.noreply.github.com>

Layer integration (#83)
* integrated parallel layers for ease of building models * integrated 2.5d layers * cleaned codes and unit tests * added log metric by step hook; updated imagenet benchmark; fixed some bugs * reworked initialization; cleaned codes Co-authored-by: BoxiangW <45734921+BoxiangW@users.noreply.github.com>
0fedef4f · アマデウス · GitHub · 5c3843dc · 0fedef4f · 0fedef4f
Unverified Commit 0fedef4f authored Dec 27, 2021 by アマデウス Committed by GitHub Dec 27, 2021
20 changed files
--- a/benchmark/README.md
+++ b/benchmark/README.md
+# Benchmark for Tuning Accuracy and Efficiency
+
+## Overview
+
+The benchmark includes our efforts in using Colossal-AI to train different tasks to achieve SOTA results.
+We are interested in both validataion accuracy and training speed, and prefer larger batch size to take advantage of more GPU devices.
+For example, we trained vision transformer with batch size 512 on CIFAR10 and 4096 on ImageNet1k, which are basically not used in existing works.
+Some of the results in the benchmark trained with 8x A100 are shown below.
+
+| Task       | Model        | Training Time | Top-1 Accuracy |
+| ---------- | ------------ | ------------- | -------------- |
+| CIFAR10    | [ViT-Lite-7/4](https://arxiv.org/pdf/2104.05704.pdf) | ~ 16 min      | ~ 90.5%        |
+| ImageNet1k | ViT-S/16     | ~ 16.5 h      | ~ 74.5%        |
+
+The `train.py` script in each task runs training with the specific configuration script in `configs/` for different parallelisms.
+Supported parallelisms include data parallel only (ends with `vanilla`), 1D (ends with `1d`), 2D (ends with `2d`), 2.5D (ends with `2p5d`), 3D (ends with `3d`).
+
+Each configuration scripts basically includes the following elements, taking ImageNet1k task as example:
+```
+TOTAL_BATCH_SIZE = 4096
+LEARNING_RATE = 3e-3
+WEIGHT_DECAY = 0.3
+
+NUM_EPOCHS = 300
+WARMUP_EPOCHS = 32
+
+# data parallel only
+TENSOR_PARALLEL_SIZE = 1    
+TENSOR_PARALLEL_MODE = None
+
+# parallelism setting
+parallel = dict(
+    pipeline=1,
+    tensor=dict(mode=TENSOR_PARALLEL_MODE, size=TENSOR_PARALLEL_SIZE),
+)
+
+fp16 = dict(mode=AMP_TYPE.TORCH, ) # amp setting
+
+gradient_accumulation = 2 # accumulate 2 steps for gradient update
+
+BATCH_SIZE = TOTAL_BATCH_SIZE // gradient_accumulation # actual batch size for dataloader
+
+clip_grad_norm = 1.0 # clip gradient with norm 1.0
+```
+Upper case elements are basically what `train.py` needs, and lower case elements are what Colossal-AI needs to initialize the training.
+
+## Usage
+
+To start training, use the following command to run each worker:
+```
+$ DATA=/path/to/dataset python train.py --world_size=WORLD_SIZE \
+                                        --rank=RANK \
+                                        --local_rank=LOCAL_RANK \
+                                        --host=MASTER_IP_ADDRESS \
+                                        --port=MASTER_PORT \
+                                        --config=CONFIG_FILE
+```
+It is also recommended to start training with `torchrun` as:
+```
+$ DATA=/path/to/dataset torchrun --nproc_per_node=NUM_GPUS_PER_NODE \
+                                 --nnodes=NUM_NODES \
+                                 --node_rank=NODE_RANK \
+                                 --master_addr=MASTER_IP_ADDRESS \
+                                 --master_port=MASTER_PORT \
+                                 train.py --config=CONFIG_FILE
+```
\ No newline at end of file
--- a/benchmark/cifar/configs/vit_1d.py
+++ b/benchmark/cifar/configs/vit_1d.py
+BATCH_SIZE = 512
+LEARNING_RATE = 2e-3
+WEIGHT_DECAY = 3e-2
+
+TENSOR_PARALLEL_SIZE = 4
+TENSOR_PARALLEL_MODE = '1d'
+
+NUM_EPOCHS = 200
+WARMUP_EPOCHS = 40
+
+parallel = dict(
+    pipeline=1,
+    tensor=dict(mode=TENSOR_PARALLEL_MODE, size=TENSOR_PARALLEL_SIZE),
+)
+
+seed = 42
+
+LOG_PATH = f"./vit_{TENSOR_PARALLEL_MODE}_cifar10_tp{TENSOR_PARALLEL_SIZE}_bs{BATCH_SIZE}_lr{LEARNING_RATE}/"
--- a/benchmark/cifar/configs/vit_2d.py
+++ b/benchmark/cifar/configs/vit_2d.py
+BATCH_SIZE = 512
+LEARNING_RATE = 2e-3
+WEIGHT_DECAY = 3e-2
+
+TENSOR_PARALLEL_SIZE = 4
+TENSOR_PARALLEL_MODE = '2d'
+
+NUM_EPOCHS = 200
+WARMUP_EPOCHS = 40
+
+parallel = dict(
+    pipeline=1,
+    tensor=dict(mode=TENSOR_PARALLEL_MODE, size=TENSOR_PARALLEL_SIZE),
+)
+
+seed = 42
+
+LOG_PATH = f"./vit_{TENSOR_PARALLEL_MODE}_cifar10_tp{TENSOR_PARALLEL_SIZE}_bs{BATCH_SIZE}_lr{LEARNING_RATE}/"
--- a/benchmark/cifar/configs/vit_2p5d.py
+++ b/benchmark/cifar/configs/vit_2p5d.py
+BATCH_SIZE = 512
+LEARNING_RATE = 2e-3
+WEIGHT_DECAY = 3e-2
+
+TENSOR_PARALLEL_SIZE = 4
+DEPTH = 1
+TENSOR_PARALLEL_MODE = '2.5d'
+
+NUM_EPOCHS = 200
+WARMUP_EPOCHS = 40
+
+parallel = dict(
+    pipeline=1,
+    tensor=dict(mode=TENSOR_PARALLEL_MODE, size=TENSOR_PARALLEL_SIZE, depth=DEPTH),
+)
+
+seed = 42
+
+LOG_PATH = f"./vit_{TENSOR_PARALLEL_MODE}_cifar10_tp{TENSOR_PARALLEL_SIZE}_bs{BATCH_SIZE}_lr{LEARNING_RATE}/"
--- a/benchmark/cifar/configs/vit_3d.py
+++ b/benchmark/cifar/configs/vit_3d.py
+BATCH_SIZE = 512
+LEARNING_RATE = 2e-3
+WEIGHT_DECAY = 3e-2
+
+TENSOR_PARALLEL_SIZE = 8
+TENSOR_PARALLEL_MODE = '3d'
+
+NUM_EPOCHS = 200
+WARMUP_EPOCHS = 40
+
+parallel = dict(
+    pipeline=1,
+    tensor=dict(mode=TENSOR_PARALLEL_MODE, size=TENSOR_PARALLEL_SIZE),
+)
+
+seed = 42
+
+LOG_PATH = f"./vit_{TENSOR_PARALLEL_MODE}_cifar10_tp{TENSOR_PARALLEL_SIZE}_bs{BATCH_SIZE}_lr{LEARNING_RATE}/"
--- a/benchmark/cifar/configs/vit_vanilla.py
+++ b/benchmark/cifar/configs/vit_vanilla.py
+BATCH_SIZE = 512
+LEARNING_RATE = 2e-3
+WEIGHT_DECAY = 3e-2
+
+TENSOR_PARALLEL_SIZE = 1
+TENSOR_PARALLEL_MODE = None
+
+NUM_EPOCHS = 200
+WARMUP_EPOCHS = 40
+
+parallel = dict(
+    pipeline=1,
+    tensor=dict(mode=TENSOR_PARALLEL_MODE, size=TENSOR_PARALLEL_SIZE),
+)
+
+seed = 42
+
+LOG_PATH = f"./vit_{TENSOR_PARALLEL_MODE}_cifar10_tp{TENSOR_PARALLEL_SIZE}_bs{BATCH_SIZE}_lr{LEARNING_RATE}/"
--- a/benchmark/cifar/train.py
+++ b/benchmark/cifar/train.py
+#!/usr/bin/env python
+# -*- encoding: utf-8 -*-
+
+import os
+
+import colossalai
+import torch
+import torchvision
+from colossalai.builder import *
+from colossalai.core import global_context as gpc
+from colossalai.logging import get_dist_logger
+from colossalai.nn import Accuracy, CrossEntropyLoss
+from colossalai.nn.lr_scheduler import CosineAnnealingWarmupLR
+from colossalai.trainer import Trainer
+from colossalai.trainer.hooks import (AccuracyHook, LogMemoryByEpochHook,
+                                      LogMetricByEpochHook,
+                                      LogMetricByStepHook,
+                                      LogTimingByEpochHook, LossHook,
+                                      LRSchedulerHook, ThroughputHook)
+from colossalai.utils import MultiTimer, get_dataloader
+from model_zoo.vit import vit_lite_depth7_patch4_32
+from torchvision import transforms
+
+DATASET_PATH = str(os.environ['DATA'])
+
+
+def build_cifar(batch_size):
+    transform_train = transforms.Compose([
+        transforms.RandomCrop(32, padding=4),
+        transforms.AutoAugment(policy=transforms.AutoAugmentPolicy.CIFAR10),
+        transforms.ToTensor(),
+        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
+    ])
+    transform_test = transforms.Compose([
+        transforms.Resize(32),
+        transforms.ToTensor(),
+        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
+    ])
+
+    train_dataset = torchvision.datasets.CIFAR10(root=DATASET_PATH,
+                                                 train=True,
+                                                 download=True,
+                                                 transform=transform_train)
+    test_dataset = torchvision.datasets.CIFAR10(root=DATASET_PATH, train=False, transform=transform_test)
+    train_dataloader = get_dataloader(dataset=train_dataset,
+                                      shuffle=True,
+                                      batch_size=batch_size,
+                                      num_workers=4,
+                                      pin_memory=True)
+    test_dataloader = get_dataloader(dataset=test_dataset, batch_size=batch_size, num_workers=4, pin_memory=True)
+    return train_dataloader, test_dataloader
+
+
+def train_cifar():
+    args = colossalai.get_default_parser().parse_args()
+    # standard launch
+    # colossalai.launch(config=args.config,
+    #                   rank=args.rank,
+    #                   world_size=args.world_size,
+    #                   local_rank=args.local_rank,
+    #                   host=args.host,
+    #                   port=args.port)
+
+    # launch from torchrun
+    colossalai.launch_from_torch(config=args.config)
+
+    logger = get_dist_logger()
+    if hasattr(gpc.config, 'LOG_PATH'):
+        if gpc.get_global_rank() == 0:
+            log_path = gpc.config.LOG_PATH
+            if not os.path.exists(log_path):
+                os.mkdir(log_path)
+            logger.log_to_file(log_path)
+
+    tp = gpc.config.parallel.tensor.mode
+
+    model = vit_lite_depth7_patch4_32(tensor_parallel=tp)
+
+    train_dataloader, test_dataloader = build_cifar(gpc.config.BATCH_SIZE // gpc.data_parallel_size)
+
+    criterion = CrossEntropyLoss(label_smoothing=0.1, tensor_parallel=tp)
+
+    optimizer = torch.optim.AdamW(model.parameters(), lr=gpc.config.LEARNING_RATE, weight_decay=gpc.config.WEIGHT_DECAY)
+
+    steps_per_epoch = len(train_dataloader)
+
+    lr_scheduler = CosineAnnealingWarmupLR(optimizer=optimizer,
+                                           total_steps=gpc.config.NUM_EPOCHS * steps_per_epoch,
+                                           warmup_steps=gpc.config.WARMUP_EPOCHS * steps_per_epoch)
+
+    engine, train_dataloader, test_dataloader, lr_scheduler = colossalai.initialize(model=model,
+                                                                                    optimizer=optimizer,
+                                                                                    criterion=criterion,
+                                                                                    train_dataloader=train_dataloader,
+                                                                                    test_dataloader=test_dataloader,
+                                                                                    lr_scheduler=lr_scheduler)
+
+    logger.info("Engine is built", ranks=[0])
+
+    timer = MultiTimer()
+
+    trainer = Trainer(engine=engine, logger=logger, timer=timer)
+    logger.info("Trainer is built", ranks=[0])
+
+    hooks = [
+        LogMetricByEpochHook(logger=logger),
+        LogMetricByStepHook(),
+        # LogTimingByEpochHook(timer=timer, logger=logger),
+        # LogMemoryByEpochHook(logger=logger),
+        AccuracyHook(accuracy_func=Accuracy(tensor_parallel=tp)),
+        LossHook(),
+        ThroughputHook(),
+        LRSchedulerHook(lr_scheduler=lr_scheduler, by_epoch=False)
+    ]
+
+    logger.info("Train start", ranks=[0])
+    trainer.fit(train_dataloader=train_dataloader,
+                test_dataloader=test_dataloader,
+                epochs=gpc.config.NUM_EPOCHS,
+                hooks=hooks,
+                display_progress=True,
+                test_interval=1)
+
+
+if __name__ == '__main__':
+    train_cifar()
--- a/benchmark/imagenet100/configs/vit_1d.py
+++ b/benchmark/imagenet100/configs/vit_1d.py
+from colossalai.amp import AMP_TYPE
+
+TOTAL_BATCH_SIZE = 4096
+LEARNING_RATE = 3e-3
+WEIGHT_DECAY = 0.3
+
+TENSOR_PARALLEL_SIZE = 4
+TENSOR_PARALLEL_MODE = '1d'
+
+NUM_EPOCHS = 300
+WARMUP_EPOCHS = 32
+
+parallel = dict(
+    pipeline=1,
+    tensor=dict(mode=TENSOR_PARALLEL_MODE, size=TENSOR_PARALLEL_SIZE),
+)
+
+fp16 = dict(mode=AMP_TYPE.TORCH, )
+
+gradient_accumulation = 2
+
+BATCH_SIZE = TOTAL_BATCH_SIZE // gradient_accumulation
+
+clip_grad_norm = 1.0
+
+LOG_PATH = f"./vit_{TENSOR_PARALLEL_MODE}_imagenet100_tp{TENSOR_PARALLEL_SIZE}_bs{BATCH_SIZE}_lr{LEARNING_RATE}_{fp16['mode']}_clip_grad{clip_grad_norm}/"
--- a/benchmark/imagenet100/configs/vit_2d.py
+++ b/benchmark/imagenet100/configs/vit_2d.py
+from colossalai.amp import AMP_TYPE
+
+TOTAL_BATCH_SIZE = 4096
+LEARNING_RATE = 3e-3
+WEIGHT_DECAY = 0.3
+
+TENSOR_PARALLEL_SIZE = 4
+TENSOR_PARALLEL_MODE = '2d'
+
+NUM_EPOCHS = 300
+WARMUP_EPOCHS = 32
+
+parallel = dict(
+    pipeline=1,
+    tensor=dict(mode=TENSOR_PARALLEL_MODE, size=TENSOR_PARALLEL_SIZE),
+)
+
+fp16 = dict(mode=AMP_TYPE.TORCH, )
+
+gradient_accumulation = 2
+
+BATCH_SIZE = TOTAL_BATCH_SIZE // gradient_accumulation
+
+clip_grad_norm = 1.0
+
+LOG_PATH = f"./vit_{TENSOR_PARALLEL_MODE}_imagenet100_tp{TENSOR_PARALLEL_SIZE}_bs{BATCH_SIZE}_lr{LEARNING_RATE}_{fp16['mode']}_clip_grad{clip_grad_norm}/"
--- a/benchmark/imagenet100/configs/vit_2p5d.py
+++ b/benchmark/imagenet100/configs/vit_2p5d.py
+from colossalai.amp import AMP_TYPE
+
+TOTAL_BATCH_SIZE = 4096
+LEARNING_RATE = 3e-3
+WEIGHT_DECAY = 0.3
+
+TENSOR_PARALLEL_SIZE = 4
+DEPTH = 1
+TENSOR_PARALLEL_MODE = '2.5d'
+
+NUM_EPOCHS = 300
+WARMUP_EPOCHS = 32
+
+parallel = dict(
+    pipeline=1,
+    tensor=dict(mode=TENSOR_PARALLEL_MODE, size=TENSOR_PARALLEL_SIZE, depth=DEPTH),
+)
+
+fp16 = dict(mode=AMP_TYPE.TORCH, )
+
+gradient_accumulation = 2
+
+BATCH_SIZE = TOTAL_BATCH_SIZE // gradient_accumulation
+
+clip_grad_norm = 1.0
+
+LOG_PATH = f"./vit_{TENSOR_PARALLEL_MODE}_imagenet100_tp{TENSOR_PARALLEL_SIZE}_bs{BATCH_SIZE}_lr{LEARNING_RATE}_{fp16['mode']}_clip_grad{clip_grad_norm}/"
--- a/benchmark/imagenet100/configs/vit_3d.py
+++ b/benchmark/imagenet100/configs/vit_3d.py
+from colossalai.amp import AMP_TYPE
+
+TOTAL_BATCH_SIZE = 4096
+LEARNING_RATE = 3e-3
+WEIGHT_DECAY = 0.3
+
+TENSOR_PARALLEL_SIZE = 8
+TENSOR_PARALLEL_MODE = '3d'
+
+NUM_EPOCHS = 300
+WARMUP_EPOCHS = 32
+
+parallel = dict(
+    pipeline=1,
+    tensor=dict(mode=TENSOR_PARALLEL_MODE, size=TENSOR_PARALLEL_SIZE),
+)
+
+fp16 = dict(mode=AMP_TYPE.TORCH, )
+
+gradient_accumulation = 2
+
+BATCH_SIZE = TOTAL_BATCH_SIZE // gradient_accumulation
+
+clip_grad_norm = 1.0
+
+LOG_PATH = f"./vit_{TENSOR_PARALLEL_MODE}_imagenet100_tp{TENSOR_PARALLEL_SIZE}_bs{BATCH_SIZE}_lr{LEARNING_RATE}_{fp16['mode']}_clip_grad{clip_grad_norm}/"
--- a/benchmark/imagenet100/configs/vit_vanilla.py
+++ b/benchmark/imagenet100/configs/vit_vanilla.py
+from colossalai.amp import AMP_TYPE
+
+TOTAL_BATCH_SIZE = 4096
+LEARNING_RATE = 3e-3
+WEIGHT_DECAY = 0.3
+
+TENSOR_PARALLEL_SIZE = 1
+TENSOR_PARALLEL_MODE = None
+
+NUM_EPOCHS = 300
+WARMUP_EPOCHS = 32
+
+parallel = dict(
+    pipeline=1,
+    tensor=dict(mode=TENSOR_PARALLEL_MODE, size=TENSOR_PARALLEL_SIZE),
+)
+
+fp16 = dict(mode=AMP_TYPE.TORCH, )
+
+gradient_accumulation = 2
+
+BATCH_SIZE = TOTAL_BATCH_SIZE // gradient_accumulation
+
+clip_grad_norm = 1.0
+
+LOG_PATH = f"./vit_{TENSOR_PARALLEL_MODE}_imagenet100_tp{TENSOR_PARALLEL_SIZE}_bs{BATCH_SIZE}_lr{LEARNING_RATE}_{fp16['mode']}_clip_grad{clip_grad_norm}/"
--- a/benchmark/imagenet100/train.py
+++ b/benchmark/imagenet100/train.py
+#!/usr/bin/env python
+# -*- encoding: utf-8 -*-
+
+import glob
+import os
+
+import colossalai
+import nvidia.dali.fn as fn
+import nvidia.dali.tfrecord as tfrec
+import torch
+from colossalai.builder import *
+from colossalai.context import ParallelMode
+from colossalai.core import global_context as gpc
+from colossalai.logging import get_dist_logger
+from colossalai.nn import Accuracy, CrossEntropyLoss
+from colossalai.nn.lr_scheduler import CosineAnnealingWarmupLR
+from colossalai.trainer import Trainer
+from colossalai.trainer.hooks import (AccuracyHook, LogMemoryByEpochHook, LogMetricByEpochHook, LogMetricByStepHook,
+                                      LogTimingByEpochHook, LossHook, LRSchedulerHook, ThroughputHook)
+from colossalai.utils import MultiTimer
+from model_zoo.vit import vit_small_patch16_224
+from nvidia.dali import types
+from nvidia.dali.pipeline import Pipeline
+from nvidia.dali.plugin.pytorch import DALIClassificationIterator
+
+DATASET_PATH = str(os.environ['DATA'])
+
+TRAIN_RECS = DATASET_PATH + '/train/*'
+VAL_RECS = DATASET_PATH + '/validation/*'
+TRAIN_IDX = DATASET_PATH + '/idx_files/train/*'
+VAL_IDX = DATASET_PATH + '/idx_files/validation/*'
+
+
+class DaliDataloader(DALIClassificationIterator):
+    def __init__(self,
+                 tfrec_filenames,
+                 tfrec_idx_filenames,
+                 shard_id=0,
+                 num_shards=1,
+                 batch_size=128,
+                 num_threads=4,
+                 resize=256,
+                 crop=224,
+                 prefetch=2,
+                 training=True,
+                 gpu_aug=False,
+                 cuda=True):
+        pipe = Pipeline(batch_size=batch_size,
+                        num_threads=num_threads,
+                        device_id=torch.cuda.current_device() if cuda else None,
+                        seed=1024)
+        with pipe:
+            inputs = fn.readers.tfrecord(path=tfrec_filenames,
+                                         index_path=tfrec_idx_filenames,
+                                         random_shuffle=training,
+                                         shard_id=shard_id,
+                                         num_shards=num_shards,
+                                         initial_fill=10000,
+                                         read_ahead=True,
+                                         prefetch_queue_depth=prefetch,
+                                         name='Reader',
+                                         features={
+                                             'image/encoded': tfrec.FixedLenFeature((), tfrec.string, ""),
+                                             'image/class/label': tfrec.FixedLenFeature([1], tfrec.int64, -1),
+                                         })
+            images = inputs["image/encoded"]
+
+            if training:
+                images = fn.decoders.image(images, device='mixed' if gpu_aug else 'cpu', output_type=types.RGB)
+                images = fn.random_resized_crop(images, size=crop, device='gpu' if gpu_aug else 'cpu')
+                flip_lr = fn.random.coin_flip(probability=0.5)
+            else:
+                # decode jpeg and resize
+                images = fn.decoders.image(images, device='mixed' if gpu_aug else 'cpu', output_type=types.RGB)
+                images = fn.resize(images,
+                                   device='gpu' if gpu_aug else 'cpu',
+                                   resize_x=resize,
+                                   resize_y=resize,
+                                   dtype=types.FLOAT,
+                                   interp_type=types.INTERP_TRIANGULAR)
+                flip_lr = False
+
+            # center crop and normalise
+            images = fn.crop_mirror_normalize(images,
+                                              dtype=types.FLOAT,
+                                              crop=(crop, crop),
+                                              mean=[127.5],
+                                              std=[127.5],
+                                              mirror=flip_lr)
+            label = inputs["image/class/label"] - 1  # 0-999
+            # LSG: element_extract will raise exception, let's flatten outside
+            # label = fn.element_extract(label, element_map=0)  # Flatten
+            if cuda:  # transfer data to gpu
+                pipe.set_outputs(images.gpu(), label.gpu())
+            else:
+                pipe.set_outputs(images, label)
+
+        pipe.build()
+        last_batch_policy = 'DROP' if training else 'PARTIAL'
+        super().__init__(pipe, reader_name="Reader", auto_reset=True, last_batch_policy=last_batch_policy)
+
+    def __iter__(self):
+        # if not reset (after an epoch), reset; if just initialize, ignore
+        if self._counter >= self._size or self._size < 0:
+            self.reset()
+        return self
+
+    def __next__(self):
+        data = super().__next__()
+        img, label = data[0]['data'], data[0]['label']
+        label = label.squeeze()
+        return (img, ), (label, )
+
+
+def build_dali_train(batch_size):
+    return DaliDataloader(
+        sorted(glob.glob(TRAIN_RECS)),
+        sorted(glob.glob(TRAIN_IDX)),
+        batch_size=batch_size,
+        shard_id=gpc.get_local_rank(ParallelMode.DATA),
+        num_shards=gpc.get_world_size(ParallelMode.DATA),
+        training=True,
+        gpu_aug=True,
+        cuda=True,
+    )
+
+
+def build_dali_test(batch_size):
+    return DaliDataloader(
+        sorted(glob.glob(VAL_RECS)),
+        sorted(glob.glob(VAL_IDX)),
+        batch_size=batch_size,
+        shard_id=gpc.get_local_rank(ParallelMode.DATA),
+        num_shards=gpc.get_world_size(ParallelMode.DATA),
+        training=False,
+        gpu_aug=True,
+        cuda=True,
+    )
+
+
+def train_imagenet():
+    args = colossalai.get_default_parser().parse_args()
+    # standard launch
+    # colossalai.launch(config=args.config,
+    #                   rank=args.rank,
+    #                   world_size=args.world_size,
+    #                   local_rank=args.local_rank,
+    #                   host=args.host,
+    #                   port=args.port)
+
+    # launch from torchrun
+    colossalai.launch_from_torch(config=args.config)
+
+    logger = get_dist_logger()
+    if hasattr(gpc.config, 'LOG_PATH'):
+        if gpc.get_global_rank() == 0:
+            log_path = gpc.config.LOG_PATH
+            if not os.path.exists(log_path):
+                os.mkdir(log_path)
+            logger.log_to_file(log_path)
+
+    tp = gpc.config.parallel.tensor.mode
+
+    model = vit_small_patch16_224(tensor_parallel=tp, num_classes=100, init_method='jax')
+
+    train_dataloader = build_dali_train(gpc.config.BATCH_SIZE // gpc.data_parallel_size)
+    test_dataloader = build_dali_test(gpc.config.BATCH_SIZE // gpc.data_parallel_size)
+
+    criterion = CrossEntropyLoss(label_smoothing=0.1, tensor_parallel=tp)
+
+    optimizer = torch.optim.AdamW(model.parameters(), lr=gpc.config.LEARNING_RATE, weight_decay=gpc.config.WEIGHT_DECAY)
+
+    lr_scheduler = CosineAnnealingWarmupLR(optimizer=optimizer,
+                                           total_steps=gpc.config.NUM_EPOCHS,
+                                           warmup_steps=gpc.config.WARMUP_EPOCHS)
+
+    engine, train_dataloader, test_dataloader, _ = colossalai.initialize(model=model,
+                                                                         optimizer=optimizer,
+                                                                         criterion=criterion,
+                                                                         train_dataloader=train_dataloader,
+                                                                         test_dataloader=test_dataloader)
+
+    logger.info("Engine is built", ranks=[0])
+
+    timer = MultiTimer()
+
+    trainer = Trainer(engine=engine, logger=logger, timer=timer)
+    logger.info("Trainer is built", ranks=[0])
+
+    hooks = [
+        LogMetricByEpochHook(logger=logger),
+        LogMetricByStepHook(),
+        # LogTimingByEpochHook(timer=timer, logger=logger),
+        # LogMemoryByEpochHook(logger=logger),
+        AccuracyHook(accuracy_func=Accuracy(tensor_parallel=tp)),
+        LossHook(),
+        ThroughputHook(),
+        LRSchedulerHook(lr_scheduler=lr_scheduler, by_epoch=True)
+    ]
+
+    logger.info("Train start", ranks=[0])
+    trainer.fit(train_dataloader=train_dataloader,
+                test_dataloader=test_dataloader,
+                epochs=gpc.config.NUM_EPOCHS,
+                hooks=hooks,
+                display_progress=True,
+                test_interval=1)
+
+
+if __name__ == '__main__':
+    train_imagenet()
--- a/benchmark/imagenet1k/configs/vit_1d.py
+++ b/benchmark/imagenet1k/configs/vit_1d.py
+from colossalai.amp import AMP_TYPE
+
+TOTAL_BATCH_SIZE = 4096
+LEARNING_RATE = 3e-3
+WEIGHT_DECAY = 0.3
+
+TENSOR_PARALLEL_SIZE = 4
+TENSOR_PARALLEL_MODE = '1d'
+
+NUM_EPOCHS = 300
+WARMUP_EPOCHS = 32
+
+parallel = dict(
+    pipeline=1,
+    tensor=dict(mode=TENSOR_PARALLEL_MODE, size=TENSOR_PARALLEL_SIZE),
+)
+
+fp16 = dict(mode=AMP_TYPE.TORCH, )
+
+gradient_accumulation = 2
+
+BATCH_SIZE = TOTAL_BATCH_SIZE // gradient_accumulation
+
+clip_grad_norm = 1.0
+
+LOG_PATH = f"./vit_{TENSOR_PARALLEL_MODE}_imagenet1k_tp{TENSOR_PARALLEL_SIZE}_bs{BATCH_SIZE}_lr{LEARNING_RATE}_{fp16['mode']}_clip_grad{clip_grad_norm}/"
--- a/benchmark/imagenet1k/configs/vit_2d.py
+++ b/benchmark/imagenet1k/configs/vit_2d.py
+from colossalai.amp import AMP_TYPE
+
+TOTAL_BATCH_SIZE = 4096
+LEARNING_RATE = 3e-3
+WEIGHT_DECAY = 0.3
+
+TENSOR_PARALLEL_SIZE = 4
+TENSOR_PARALLEL_MODE = '2d'
+
+NUM_EPOCHS = 300
+WARMUP_EPOCHS = 32
+
+parallel = dict(
+    pipeline=1,
+    tensor=dict(mode=TENSOR_PARALLEL_MODE, size=TENSOR_PARALLEL_SIZE),
+)
+
+fp16 = dict(mode=AMP_TYPE.TORCH, )
+
+gradient_accumulation = 2
+
+BATCH_SIZE = TOTAL_BATCH_SIZE // gradient_accumulation
+
+clip_grad_norm = 1.0
+
+LOG_PATH = f"./vit_{TENSOR_PARALLEL_MODE}_imagenet1k_tp{TENSOR_PARALLEL_SIZE}_bs{BATCH_SIZE}_lr{LEARNING_RATE}_{fp16['mode']}_clip_grad{clip_grad_norm}/"
--- a/benchmark/imagenet1k/configs/vit_2p5d.py
+++ b/benchmark/imagenet1k/configs/vit_2p5d.py
+from colossalai.amp import AMP_TYPE
+
+TOTAL_BATCH_SIZE = 4096
+LEARNING_RATE = 3e-3
+WEIGHT_DECAY = 0.3
+
+TENSOR_PARALLEL_SIZE = 4
+DEPTH = 1
+TENSOR_PARALLEL_MODE = '2.5d'
+
+NUM_EPOCHS = 300
+WARMUP_EPOCHS = 32
+
+parallel = dict(
+    pipeline=1,
+    tensor=dict(mode=TENSOR_PARALLEL_MODE, size=TENSOR_PARALLEL_SIZE, depth=DEPTH),
+)
+
+fp16 = dict(mode=AMP_TYPE.TORCH, )
+
+gradient_accumulation = 2
+
+BATCH_SIZE = TOTAL_BATCH_SIZE // gradient_accumulation
+
+clip_grad_norm = 1.0
+
+LOG_PATH = f"./vit_{TENSOR_PARALLEL_MODE}_imagenet1k_tp{TENSOR_PARALLEL_SIZE}_bs{BATCH_SIZE}_lr{LEARNING_RATE}_{fp16['mode']}_clip_grad{clip_grad_norm}/"
--- a/benchmark/imagenet1k/configs/vit_3d.py
+++ b/benchmark/imagenet1k/configs/vit_3d.py
+from colossalai.amp import AMP_TYPE
+
+TOTAL_BATCH_SIZE = 4096
+LEARNING_RATE = 3e-3
+WEIGHT_DECAY = 0.3
+
+TENSOR_PARALLEL_SIZE = 8
+TENSOR_PARALLEL_MODE = '3d'
+
+NUM_EPOCHS = 300
+WARMUP_EPOCHS = 32
+
+parallel = dict(
+    pipeline=1,
+    tensor=dict(mode=TENSOR_PARALLEL_MODE, size=TENSOR_PARALLEL_SIZE),
+)
+
+fp16 = dict(mode=AMP_TYPE.TORCH, )
+
+gradient_accumulation = 2
+
+BATCH_SIZE = TOTAL_BATCH_SIZE // gradient_accumulation
+
+clip_grad_norm = 1.0
+
+LOG_PATH = f"./vit_{TENSOR_PARALLEL_MODE}_imagenet1k_tp{TENSOR_PARALLEL_SIZE}_bs{BATCH_SIZE}_lr{LEARNING_RATE}_{fp16['mode']}_clip_grad{clip_grad_norm}/"
--- a/benchmark/imagenet1k/configs/vit_vanilla.py
+++ b/benchmark/imagenet1k/configs/vit_vanilla.py
+from colossalai.amp import AMP_TYPE
+
+TOTAL_BATCH_SIZE = 4096
+LEARNING_RATE = 3e-3
+WEIGHT_DECAY = 0.3
+
+TENSOR_PARALLEL_SIZE = 1
+TENSOR_PARALLEL_MODE = None
+
+NUM_EPOCHS = 300
+WARMUP_EPOCHS = 32
+
+parallel = dict(
+    pipeline=1,
+    tensor=dict(mode=TENSOR_PARALLEL_MODE, size=TENSOR_PARALLEL_SIZE),
+)
+
+fp16 = dict(mode=AMP_TYPE.TORCH, )
+
+gradient_accumulation = 2
+
+BATCH_SIZE = TOTAL_BATCH_SIZE // gradient_accumulation
+
+clip_grad_norm = 1.0
+
+LOG_PATH = f"./vit_{TENSOR_PARALLEL_MODE}_imagenet1k_tp{TENSOR_PARALLEL_SIZE}_bs{BATCH_SIZE}_lr{LEARNING_RATE}_{fp16['mode']}_clip_grad{clip_grad_norm}/"
--- a/benchmark/imagenet1k/train.py
+++ b/benchmark/imagenet1k/train.py
+#!/usr/bin/env python
+# -*- encoding: utf-8 -*-
+
+import glob
+import os
+
+import colossalai
+import nvidia.dali.fn as fn
+import nvidia.dali.tfrecord as tfrec
+import torch
+from colossalai.builder import *
+from colossalai.context import ParallelMode
+from colossalai.core import global_context as gpc
+from colossalai.logging import get_dist_logger
+from colossalai.nn import Accuracy, CrossEntropyLoss
+from colossalai.nn.lr_scheduler import CosineAnnealingWarmupLR
+from colossalai.trainer import Trainer
+from colossalai.trainer.hooks import (AccuracyHook, LogMemoryByEpochHook, LogMetricByEpochHook, LogMetricByStepHook,
+                                      LogTimingByEpochHook, LossHook, LRSchedulerHook, ThroughputHook)
+from colossalai.utils import MultiTimer
+from model_zoo.vit import vit_small_patch16_224
+from nvidia.dali import types
+from nvidia.dali.pipeline import Pipeline
+from nvidia.dali.plugin.pytorch import DALIClassificationIterator
+
+DATASET_PATH = str(os.environ['DATA'])
+
+TRAIN_RECS = DATASET_PATH + '/train/*'
+VAL_RECS = DATASET_PATH + '/validation/*'
+TRAIN_IDX = DATASET_PATH + '/idx_files/train/*'
+VAL_IDX = DATASET_PATH + '/idx_files/validation/*'
+
+
+class DaliDataloader(DALIClassificationIterator):
+    def __init__(self,
+                 tfrec_filenames,
+                 tfrec_idx_filenames,
+                 shard_id=0,
+                 num_shards=1,
+                 batch_size=128,
+                 num_threads=4,
+                 resize=256,
+                 crop=224,
+                 prefetch=2,
+                 training=True,
+                 gpu_aug=False,
+                 cuda=True):
+        pipe = Pipeline(batch_size=batch_size,
+                        num_threads=num_threads,
+                        device_id=torch.cuda.current_device() if cuda else None,
+                        seed=1024)
+        with pipe:
+            inputs = fn.readers.tfrecord(path=tfrec_filenames,
+                                         index_path=tfrec_idx_filenames,
+                                         random_shuffle=training,
+                                         shard_id=shard_id,
+                                         num_shards=num_shards,
+                                         initial_fill=10000,
+                                         read_ahead=True,
+                                         prefetch_queue_depth=prefetch,
+                                         name='Reader',
+                                         features={
+                                             'image/encoded': tfrec.FixedLenFeature((), tfrec.string, ""),
+                                             'image/class/label': tfrec.FixedLenFeature([1], tfrec.int64, -1),
+                                         })
+            images = inputs["image/encoded"]
+
+            if training:
+                images = fn.decoders.image(images, device='mixed' if gpu_aug else 'cpu', output_type=types.RGB)
+                images = fn.random_resized_crop(images, size=crop, device='gpu' if gpu_aug else 'cpu')
+                flip_lr = fn.random.coin_flip(probability=0.5)
+            else:
+                # decode jpeg and resize
+                images = fn.decoders.image(images, device='mixed' if gpu_aug else 'cpu', output_type=types.RGB)
+                images = fn.resize(images,
+                                   device='gpu' if gpu_aug else 'cpu',
+                                   resize_x=resize,
+                                   resize_y=resize,
+                                   dtype=types.FLOAT,
+                                   interp_type=types.INTERP_TRIANGULAR)
+                flip_lr = False
+
+            # center crop and normalise
+            images = fn.crop_mirror_normalize(images,
+                                              dtype=types.FLOAT,
+                                              crop=(crop, crop),
+                                              mean=[127.5],
+                                              std=[127.5],
+                                              mirror=flip_lr)
+            label = inputs["image/class/label"] - 1  # 0-999
+            # LSG: element_extract will raise exception, let's flatten outside
+            # label = fn.element_extract(label, element_map=0)  # Flatten
+            if cuda:  # transfer data to gpu
+                pipe.set_outputs(images.gpu(), label.gpu())
+            else:
+                pipe.set_outputs(images, label)
+
+        pipe.build()
+        last_batch_policy = 'DROP' if training else 'PARTIAL'
+        super().__init__(pipe, reader_name="Reader", auto_reset=True, last_batch_policy=last_batch_policy)
+
+    def __iter__(self):
+        # if not reset (after an epoch), reset; if just initialize, ignore
+        if self._counter >= self._size or self._size < 0:
+            self.reset()
+        return self
+
+    def __next__(self):
+        data = super().__next__()
+        img, label = data[0]['data'], data[0]['label']
+        label = label.squeeze()
+        return (img, ), (label, )
+
+
+def build_dali_train(batch_size):
+    return DaliDataloader(
+        sorted(glob.glob(TRAIN_RECS)),
+        sorted(glob.glob(TRAIN_IDX)),
+        batch_size=batch_size,
+        shard_id=gpc.get_local_rank(ParallelMode.DATA),
+        num_shards=gpc.get_world_size(ParallelMode.DATA),
+        training=True,
+        gpu_aug=True,
+        cuda=True,
+    )
+
+
+def build_dali_test(batch_size):
+    return DaliDataloader(
+        sorted(glob.glob(VAL_RECS)),
+        sorted(glob.glob(VAL_IDX)),
+        batch_size=batch_size,
+        shard_id=gpc.get_local_rank(ParallelMode.DATA),
+        num_shards=gpc.get_world_size(ParallelMode.DATA),
+        training=False,
+        gpu_aug=True,
+        cuda=True,
+    )
+
+
+def train_imagenet():
+    args = colossalai.get_default_parser().parse_args()
+    # standard launch
+    # colossalai.launch(config=args.config,
+    #                   rank=args.rank,
+    #                   world_size=args.world_size,
+    #                   local_rank=args.local_rank,
+    #                   host=args.host,
+    #                   port=args.port)
+
+    # launch from torchrun
+    colossalai.launch_from_torch(config=args.config)
+
+    logger = get_dist_logger()
+    if hasattr(gpc.config, 'LOG_PATH'):
+        if gpc.get_global_rank() == 0:
+            log_path = gpc.config.LOG_PATH
+            if not os.path.exists(log_path):
+                os.mkdir(log_path)
+            logger.log_to_file(log_path)
+
+    tp = gpc.config.parallel.tensor.mode
+
+    model = vit_small_patch16_224(tensor_parallel=tp, num_classes=1000, init_method='jax')
+
+    train_dataloader = build_dali_train(gpc.config.BATCH_SIZE // gpc.data_parallel_size)
+    test_dataloader = build_dali_test(gpc.config.BATCH_SIZE // gpc.data_parallel_size)
+
+    criterion = CrossEntropyLoss(label_smoothing=0.1, tensor_parallel=tp)
+
+    optimizer = torch.optim.AdamW(model.parameters(), lr=gpc.config.LEARNING_RATE, weight_decay=gpc.config.WEIGHT_DECAY)
+
+    lr_scheduler = CosineAnnealingWarmupLR(optimizer=optimizer,
+                                           total_steps=gpc.config.NUM_EPOCHS,
+                                           warmup_steps=gpc.config.WARMUP_EPOCHS)
+
+    engine, train_dataloader, test_dataloader, _ = colossalai.initialize(model=model,
+                                                                         optimizer=optimizer,
+                                                                         criterion=criterion,
+                                                                         train_dataloader=train_dataloader,
+                                                                         test_dataloader=test_dataloader)
+
+    logger.info("Engine is built", ranks=[0])
+
+    timer = MultiTimer()
+
+    trainer = Trainer(engine=engine, logger=logger, timer=timer)
+    logger.info("Trainer is built", ranks=[0])
+
+    hooks = [
+        LogMetricByEpochHook(logger=logger),
+        LogMetricByStepHook(),
+        # LogTimingByEpochHook(timer=timer, logger=logger),
+        # LogMemoryByEpochHook(logger=logger),
+        AccuracyHook(accuracy_func=Accuracy(tensor_parallel=tp)),
+        LossHook(),
+        ThroughputHook(),
+        LRSchedulerHook(lr_scheduler=lr_scheduler, by_epoch=True)
+    ]
+
+    logger.info("Train start", ranks=[0])
+    trainer.fit(train_dataloader=train_dataloader,
+                test_dataloader=test_dataloader,
+                epochs=gpc.config.NUM_EPOCHS,
+                hooks=hooks,
+                display_progress=True,
+                test_interval=1)
+
+
+if __name__ == '__main__':
+    train_imagenet()
--- a/colossalai/communication/__init__.py
+++ b/colossalai/communication/__init__.py
-from .collective import all_gather, reduce_scatter, all_reduce
-from .p2p import (send_forward, send_forward_recv_forward, send_backward_recv_forward,
-                  send_backward, send_backward_recv_backward, send_forward_recv_backward,
-                  send_forward_backward_recv_forward_backward, recv_forward, recv_backward)
+from .collective import all_gather, reduce_scatter, all_reduce, broadcast, reduce
+from .p2p import (send_forward, send_forward_recv_forward,
+                  send_backward_recv_forward, send_backward,
+                  send_backward_recv_backward, send_forward_recv_backward,
+                  send_forward_backward_recv_forward_backward, recv_forward,
+                  recv_backward)
 from .ring import ring_forward
 from .utils import send_tensor_meta, recv_tensor_meta

 __all__ = [
-    'all_gather', 'reduce_scatter', 'all_reduce',
-    'send_forward', 'send_forward_recv_forward', 'send_forward_backward_recv_forward_backward',
-    'send_backward', 'send_backward_recv_backward', 'send_backward_recv_forward',
+    'all_gather', 'reduce_scatter', 'all_reduce', 'broadcast', 'reduce',
+    'send_forward', 'send_forward_recv_forward',
+    'send_forward_backward_recv_forward_backward', 'send_backward',
+    'send_backward_recv_backward', 'send_backward_recv_forward',
    'send_forward_recv_backward', 'recv_backward', 'recv_forward',
    'ring_forward', 'send_tensor_meta', 'recv_tensor_meta'
 ]
\ No newline at end of file