diff --git a/CHANGE_LOG.md b/CHANGE_LOG.md
deleted file mode 100644
index ea69b301df7f6114b563a99fdf5bef3d51e3bb22..0000000000000000000000000000000000000000
--- a/CHANGE_LOG.md
+++ /dev/null
@@ -1,36 +0,0 @@
-# Change Log
-
-All notable changes to this project will be documented in this file.
-
-## v0.0.2 | 2022-02
-
-### Added
-
-- Unifed distributed layers
-- MoE support
-- DevOps tools such as github action, code review automation, etc.
-- New project official website
-
-### Changes
-
-- refactored the APIs for usability, flexibility and modularity
-- adapted PyTorch AMP for tensor parallel
-- refactored utilities for tensor parallel and pipeline parallel
-- Separated benchmarks and examples as independent repositories
-- Updated pipeline parallelism to support non-interleaved and interleaved versions
-- refactored installation scripts for convenience
-
-### Fixed
-
-- zero level 3 runtime error
-- incorrect calculation in gradient clipping
-
-
-## v0.0.1 beta | 2021-10
-
-The first beta version of Colossal-AI. Thanks to all contributors for the effort to implement the system.
-
-### Added
-
-- Initial architecture of the system
-- Features such as tensor parallelism, gradient clipping, gradient accumulation
\ No newline at end of file
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
deleted file mode 100644
index cf045226364ea8c4f309ddeee5ce1c5ff3bd5d6c..0000000000000000000000000000000000000000
--- a/CONTRIBUTING.md
+++ /dev/null
@@ -1,141 +0,0 @@
-# Contributing
-
-Colossal-AI welcomes any constructive contribution from the community and the team is more than willing to work on problems you have encountered to make it a better project.
-
-## Environment Setup
-
-To contribute to Colossal-AI, we would like to first guide you to set up a proper development environment so that you can better implement your code. It is good to install this system from source with the `editable` flag (`-e`, for development mode) so that your change to the source code will be reflected in runtime without repeated installation and uninstallation. Here are the steps to set up the development environment.
-
-1. Uninstall any existing Colossal-AI distribution.
-
-```shell
-pip uninstall colossalai
-```
-
-2. Clone the repository to local workspace
-
-```shell
-git clone https://github.com/hpcaitech/ColossalAI.git
-cd ColossalAI
-```
-
-3. The *Get Started* section of [official documentation](https://colossalai.org) has provided instructions to build from source. Follow to instruction to build from source, **but replace the last `pip install` statement with the command below by adding the `-e` flag.**
-
-```shell
-pip install <options> -e .
-```
-
-## Coding Standards
-
-### Unit Tests
-We use [PyTest](https://docs.pytest.org/en/latest/) to execute tests. You can install pytest by `pip install pytest`. As some of the tests require initialization of the distributed backend, GPUs are needed to execute these tests.
-
-If you only want to run CPU tests, you can run
-
-```bash
-python -m cpu tests/
-```
-
-If you have 8 GPUs on your machine, you can run the full test
-
-```bash
-python tests/
-```
-
-If you do not have 8 GPUs on your machine, do not worry. Unit testing will be automatically conducted when you put up a pull request to the main branch.
-
-
-### Code Style
-
-We have some static checks when you commit your code change, please make sure you can pass all the tests and make sure the coding style meets our requirements. We use pre-commit hook to make sure the code is aligned with the writing standard. To set up the code style checking, you need to follow the steps below.
-
-```shell
-# these commands are executed under the Colossal-AI directory
-pip install pre-commit
-pre-commit install
-```
-
-Code format checking will be automatically executed when you commit your changes.
-
-
-## Contribution Guide
-
-You need to follow these steps below to make contribution to the main repository via pull request. You can learn about the details of pull request [here](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-pull-requests). We follow the [Gitflow workflow](https://www.atlassian.com/git/tutorials/comparing-workflows/gitflow-workflow) during development. Thus, we work on the `develop` branch instead of the `main` branch in most of the time. The `main` branch is mainly served for version release.
-
-### 1. Fork the Official Repository
-
-Firstly, you need to visit the [Colossal-AI repository](https://github.com/hpcaitech/ColossalAI) and fork into your own account. The `fork` button is at the right top corner of the web page alongside with buttons such as `watch` and `star`.
-
-Now, you can clone your own forked repository into your local environment.
-
-```shell
-git clone https://github.com/<YOUR-USERNAME>/ColossalAI.git
-```
-
-### 2. Configure Git
-
-You need to set the official repository as your upstream so that you can synchronize with the latest update in the official repository. You can learn about upstream [here](https://www.atlassian.com/git/tutorials/git-forks-and-upstreams).
-
-Then add the original repository as upstream
-
-```shell
-cd ColossalAI
-git remote add upstream https://github.com/hpcaitech/ColossalAI.git
-```
-
-you can use the following command to verify that the remote is set. You should see both `origin` and `upstream` in the output.
-
-```shell
-git remote -v
-```
-
-### 3. Synchronize with Official Repository
-
-Before you make changes to the codebase, it is always good to fetch the latest updates in the official repository. In order to do so, you can use the commands below.
-
-```shell
-git fetch upstream
-git checkout develop
-git merge upstream/develop
-git push origin develop
-```
-
-Otherwise, you can click the `fetch upstream` button on the github webpage of the main branch of your forked repository. Then, use these commands to sync.
-
-```
-git checkout develop
-git fetch develop
-```
-
-### 4. Choose/Create an Issue for Your Pull Request
-
-Generally, your code change should be only targeted at one problem. Stacking multiple commits for different problems into one pull request will only make the code review such dire suffering and make the system prone to new bugs as the reviewer may not understand the code logic correctly. Thus, you should choose an existing issue or [create your own issue](https://github.com/hpcaitech/ColossalAI/issues) as your pull request target. If you wish to create a new issue, do use appropriate title and description and add related labels.
-
-
-### 5. Create a New Branch
-
-You should not make changes to the `main` or `develop` branch of your forked repository as this might make upstream synchronization difficult. You can create a new branch with the appropriate name. General branch name format should start with `hotfix/` and `feature/`. `hotfix` is for bug fix and `feature` is for addition of a new feature.
-
-
-```shell
-git checkout -b <NEW-BRANCH-NAME>
-```
-
-### 6. Implementation and Code Commit
-
-Now you can implement your code change in the source code. Remember that you installed the system in development, thus you do not need to uninstall and install to make the code take effect. The code change will be reflected in every new PyThon execution.
-You can commit and push the changes to your local repository. The changes should be kept logical, modular and atomic.
-
-```shell
-git add -A
-git commit -m "<COMMIT-MESSAGE>"
-git push -u origin <NEW-BRANCH-NAME>
-```
-
-### 7. Open a Pull Request
-
-You can now create a pull request on the GitHub webpage of your repository. The source branch is `<NEW-BRANCH-NAME>` of your repository and the target branch should be `develop` of `hpcaitech/ColossalAI`. After creating this pull request, you should be able to see it [here](https://github.com/hpcaitech/ColossalAI/pulls).
-
-Do write clearly the description of your pull request and [link the pull request to your target issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue). This will automatically close the issue when the pull request is approved.
-
-In case of code conflict, you should rebase your branch and resolve the conflicts manually.
\ No newline at end of file
diff --git a/LICENSE b/LICENSE
deleted file mode 100644
index 261eeb9e9f8b2b4b0d119366dda99c6fd7d35c64..0000000000000000000000000000000000000000
--- a/LICENSE
+++ /dev/null
@@ -1,201 +0,0 @@
-                                 Apache License
-                           Version 2.0, January 2004
-                        http://www.apache.org/licenses/
-
-   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
-
-   1. Definitions.
-
-      "License" shall mean the terms and conditions for use, reproduction,
-      and distribution as defined by Sections 1 through 9 of this document.
-
-      "Licensor" shall mean the copyright owner or entity authorized by
-      the copyright owner that is granting the License.
-
-      "Legal Entity" shall mean the union of the acting entity and all
-      other entities that control, are controlled by, or are under common
-      control with that entity. For the purposes of this definition,
-      "control" means (i) the power, direct or indirect, to cause the
-      direction or management of such entity, whether by contract or
-      otherwise, or (ii) ownership of fifty percent (50%) or more of the
-      outstanding shares, or (iii) beneficial ownership of such entity.
-
-      "You" (or "Your") shall mean an individual or Legal Entity
-      exercising permissions granted by this License.
-
-      "Source" form shall mean the preferred form for making modifications,
-      including but not limited to software source code, documentation
-      source, and configuration files.
-
-      "Object" form shall mean any form resulting from mechanical
-      transformation or translation of a Source form, including but
-      not limited to compiled object code, generated documentation,
-      and conversions to other media types.
-
-      "Work" shall mean the work of authorship, whether in Source or
-      Object form, made available under the License, as indicated by a
-      copyright notice that is included in or attached to the work
-      (an example is provided in the Appendix below).
-
-      "Derivative Works" shall mean any work, whether in Source or Object
-      form, that is based on (or derived from) the Work and for which the
-      editorial revisions, annotations, elaborations, or other modifications
-      represent, as a whole, an original work of authorship. For the purposes
-      of this License, Derivative Works shall not include works that remain
-      separable from, or merely link (or bind by name) to the interfaces of,
-      the Work and Derivative Works thereof.
-
-      "Contribution" shall mean any work of authorship, including
-      the original version of the Work and any modifications or additions
-      to that Work or Derivative Works thereof, that is intentionally
-      submitted to Licensor for inclusion in the Work by the copyright owner
-      or by an individual or Legal Entity authorized to submit on behalf of
-      the copyright owner. For the purposes of this definition, "submitted"
-      means any form of electronic, verbal, or written communication sent
-      to the Licensor or its representatives, including but not limited to
-      communication on electronic mailing lists, source code control systems,
-      and issue tracking systems that are managed by, or on behalf of, the
-      Licensor for the purpose of discussing and improving the Work, but
-      excluding communication that is conspicuously marked or otherwise
-      designated in writing by the copyright owner as "Not a Contribution."
-
-      "Contributor" shall mean Licensor and any individual or Legal Entity
-      on behalf of whom a Contribution has been received by Licensor and
-      subsequently incorporated within the Work.
-
-   2. Grant of Copyright License. Subject to the terms and conditions of
-      this License, each Contributor hereby grants to You a perpetual,
-      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
-      copyright license to reproduce, prepare Derivative Works of,
-      publicly display, publicly perform, sublicense, and distribute the
-      Work and such Derivative Works in Source or Object form.
-
-   3. Grant of Patent License. Subject to the terms and conditions of
-      this License, each Contributor hereby grants to You a perpetual,
-      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
-      (except as stated in this section) patent license to make, have made,
-      use, offer to sell, sell, import, and otherwise transfer the Work,
-      where such license applies only to those patent claims licensable
-      by such Contributor that are necessarily infringed by their
-      Contribution(s) alone or by combination of their Contribution(s)
-      with the Work to which such Contribution(s) was submitted. If You
-      institute patent litigation against any entity (including a
-      cross-claim or counterclaim in a lawsuit) alleging that the Work
-      or a Contribution incorporated within the Work constitutes direct
-      or contributory patent infringement, then any patent licenses
-      granted to You under this License for that Work shall terminate
-      as of the date such litigation is filed.
-
-   4. Redistribution. You may reproduce and distribute copies of the
-      Work or Derivative Works thereof in any medium, with or without
-      modifications, and in Source or Object form, provided that You
-      meet the following conditions:
-
-      (a) You must give any other recipients of the Work or
-          Derivative Works a copy of this License; and
-
-      (b) You must cause any modified files to carry prominent notices
-          stating that You changed the files; and
-
-      (c) You must retain, in the Source form of any Derivative Works
-          that You distribute, all copyright, patent, trademark, and
-          attribution notices from the Source form of the Work,
-          excluding those notices that do not pertain to any part of
-          the Derivative Works; and
-
-      (d) If the Work includes a "NOTICE" text file as part of its
-          distribution, then any Derivative Works that You distribute must
-          include a readable copy of the attribution notices contained
-          within such NOTICE file, excluding those notices that do not
-          pertain to any part of the Derivative Works, in at least one
-          of the following places: within a NOTICE text file distributed
-          as part of the Derivative Works; within the Source form or
-          documentation, if provided along with the Derivative Works; or,
-          within a display generated by the Derivative Works, if and
-          wherever such third-party notices normally appear. The contents
-          of the NOTICE file are for informational purposes only and
-          do not modify the License. You may add Your own attribution
-          notices within Derivative Works that You distribute, alongside
-          or as an addendum to the NOTICE text from the Work, provided
-          that such additional attribution notices cannot be construed
-          as modifying the License.
-
-      You may add Your own copyright statement to Your modifications and
-      may provide additional or different license terms and conditions
-      for use, reproduction, or distribution of Your modifications, or
-      for any such Derivative Works as a whole, provided Your use,
-      reproduction, and distribution of the Work otherwise complies with
-      the conditions stated in this License.
-
-   5. Submission of Contributions. Unless You explicitly state otherwise,
-      any Contribution intentionally submitted for inclusion in the Work
-      by You to the Licensor shall be under the terms and conditions of
-      this License, without any additional terms or conditions.
-      Notwithstanding the above, nothing herein shall supersede or modify
-      the terms of any separate license agreement you may have executed
-      with Licensor regarding such Contributions.
-
-   6. Trademarks. This License does not grant permission to use the trade
-      names, trademarks, service marks, or product names of the Licensor,
-      except as required for reasonable and customary use in describing the
-      origin of the Work and reproducing the content of the NOTICE file.
-
-   7. Disclaimer of Warranty. Unless required by applicable law or
-      agreed to in writing, Licensor provides the Work (and each
-      Contributor provides its Contributions) on an "AS IS" BASIS,
-      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
-      implied, including, without limitation, any warranties or conditions
-      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
-      PARTICULAR PURPOSE. You are solely responsible for determining the
-      appropriateness of using or redistributing the Work and assume any
-      risks associated with Your exercise of permissions under this License.
-
-   8. Limitation of Liability. In no event and under no legal theory,
-      whether in tort (including negligence), contract, or otherwise,
-      unless required by applicable law (such as deliberate and grossly
-      negligent acts) or agreed to in writing, shall any Contributor be
-      liable to You for damages, including any direct, indirect, special,
-      incidental, or consequential damages of any character arising as a
-      result of this License or out of the use or inability to use the
-      Work (including but not limited to damages for loss of goodwill,
-      work stoppage, computer failure or malfunction, or any and all
-      other commercial damages or losses), even if such Contributor
-      has been advised of the possibility of such damages.
-
-   9. Accepting Warranty or Additional Liability. While redistributing
-      the Work or Derivative Works thereof, You may choose to offer,
-      and charge a fee for, acceptance of support, warranty, indemnity,
-      or other liability obligations and/or rights consistent with this
-      License. However, in accepting such obligations, You may act only
-      on Your own behalf and on Your sole responsibility, not on behalf
-      of any other Contributor, and only if You agree to indemnify,
-      defend, and hold each Contributor harmless for any liability
-      incurred by, or claims asserted against, such Contributor by reason
-      of your accepting any such warranty or additional liability.
-
-   END OF TERMS AND CONDITIONS
-
-   APPENDIX: How to apply the Apache License to your work.
-
-      To apply the Apache License to your work, attach the following
-      boilerplate notice, with the fields enclosed by brackets "[]"
-      replaced with your own identifying information. (Don't include
-      the brackets!)  The text should be enclosed in the appropriate
-      comment syntax for the file format. We also recommend that a
-      file or class name and description of purpose be included on the
-      same "printed page" as the copyright notice for easier
-      identification within third-party archives.
-
-   Copyright [yyyy] [name of copyright owner]
-
-   Licensed under the Apache License, Version 2.0 (the "License");
-   you may not use this file except in compliance with the License.
-   You may obtain a copy of the License at
-
-       http://www.apache.org/licenses/LICENSE-2.0
-
-   Unless required by applicable law or agreed to in writing, software
-   distributed under the License is distributed on an "AS IS" BASIS,
-   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-   See the License for the specific language governing permissions and
-   limitations under the License.
diff --git a/MANIFEST.in b/MANIFEST.in
deleted file mode 100644
index 48a44e0b47c11aaf9bf4538bc734d9abf634c60c..0000000000000000000000000000000000000000
--- a/MANIFEST.in
+++ /dev/null
@@ -1,3 +0,0 @@
-include *.txt README.md
-recursive-include requirements *.txt
-recursive-include colossalai *.cpp *.h *.cu *.tr *.cuh *.cc
\ No newline at end of file
diff --git a/README.md b/README.md
deleted file mode 100644
index 93282185f16bef4bce30b2e68aa808f32bb97c66..0000000000000000000000000000000000000000
--- a/README.md
+++ /dev/null
@@ -1,185 +0,0 @@
-# Colossal-AI
-
-[![logo](./docs/images/Colossal-AI_logo.png)](https://www.colossalai.org/)
-
-<div align="center">
-   <h3> <a href="https://arxiv.org/abs/2110.14883"> Paper </a> | 
-   <a href="https://www.colossalai.org/"> Documentation </a> | 
-   <a href="https://github.com/hpcaitech/ColossalAI-Examples"> Examples </a> |   
-   <a href="https://github.com/hpcaitech/ColossalAI/discussions"> Forum </a> | 
-   <a href="https://medium.com/@hpcaitech"> Blog </a></h3> 
-   <br/>
-
-   [![Build](https://github.com/hpcaitech/ColossalAI/actions/workflows/PR_CI.yml/badge.svg)](https://github.com/hpcaitech/ColossalAI/actions/workflows/PR_CI.yml)
-   [![Documentation](https://readthedocs.org/projects/colossalai/badge/?version=latest)](https://colossalai.readthedocs.io/en/latest/?badge=latest)
-   [![codebeat badge](https://codebeat.co/badges/bfe8f98b-5d61-4256-8ad2-ccd34d9cc156)](https://codebeat.co/projects/github-com-hpcaitech-colossalai-main)
-</div>
-An integrated large-scale model training system with efficient parallelization techniques.
-
-## Installation
-
-### PyPI
-
-```bash
-pip install colossalai
-```
-This command will install CUDA extension if your have installed CUDA, NVCC and torch. 
-
-If you don't want to install CUDA extension, you should add `--global-option="--no_cuda_ext"`, like:
-```bash
-pip install colossalai --global-option="--no_cuda_ext"
-```
-
-If you want to use `ZeRO`, you can run:
-```bash
-pip install colossalai[zero]
-```
-
-### Install From Source
-
-> The documentation will be in line with the main branch of the repository. Feel free to raise an issue if you encounter any problem. :)
-
-```shell
-git clone https://github.com/hpcaitech/ColossalAI.git
-cd ColossalAI
-# install dependency
-pip install -r requirements/requirements.txt
-
-# install colossalai
-pip install .
-```
-
-If you don't want to install and enable CUDA kernel fusion (compulsory installation when using fused optimizer):
-
-```shell
-pip install --global-option="--no_cuda_ext" .
-```
-
-## Use Docker
-
-Run the following command to build a docker image from Dockerfile provided.
-
-```bash
-cd ColossalAI
-docker build -t colossalai ./docker
-```
-
-Run the following command to start the docker container in interactive mode.
-
-```bash
-docker run -ti --gpus all --rm --ipc=host colossalai bash
-```
-
-## Contributing
-
-If you wish to contribute to this project, you can follow the guideline in [Contributing](./CONTRIBUTING.md)
-
-
-## Quick View
-
-### Start Distributed Training in Lines
-
-```python
-import colossalai
-from colossalai.utils import get_dataloader
-
-
-# my_config can be path to config file or a dictionary obj
-# 'localhost' is only for single node, you need to specify
-# the node name if using multiple nodes
-colossalai.launch(
-    config=my_config,
-    rank=rank,
-    world_size=world_size,
-    backend='nccl',
-    port=29500,
-    host='localhost'
-)
-
-# build your model
-model = ...
-
-# build you dataset, the dataloader will have distributed data
-# sampler by default
-train_dataset = ...
-train_dataloader = get_dataloader(dataset=dataset,
-                                shuffle=True
-                                )
-
-
-# build your
-optimizer = ...
-
-# build your loss function
-criterion = ...
-
-# build your lr_scheduler
-engine, train_dataloader, _, _ = colossalai.initialize(
-    model=model,
-    optimizer=optimizer,
-    criterion=criterion,
-    train_dataloader=train_dataloader
-)
-
-# start training
-engine.train()
-for epoch in range(NUM_EPOCHS):
-    for data, label in train_dataloader:
-        engine.zero_grad()
-        output = engine(data)
-        loss = engine.criterion(output, label)
-        engine.backward(loss)
-        engine.step()
-
-```
-
-### Write a Simple 2D Parallel Model
-
-Let's say we have a huge MLP model and its very large hidden size makes it difficult to fit into a single GPU. We can
-then distribute the model weights across GPUs in a 2D mesh while you still write your model in a familiar way.
-
-```python
-from colossalai.nn import Linear2D
-import torch.nn as nn
-
-
-class MLP_2D(nn.Module):
-
-    def __init__(self):
-        super().__init__()
-        self.linear_1 = Linear2D(in_features=1024, out_features=16384)
-        self.linear_2 = Linear2D(in_features=16384, out_features=1024)
-
-    def forward(self, x):
-        x = self.linear_1(x)
-        x = self.linear_2(x)
-        return x
-
-```
-
-## Features
-
-Colossal-AI provides a collection of parallel training components for you. We aim to support you to write your
-distributed deep learning models just like how you write your single-GPU model. We provide friendly tools to kickstart
-distributed training in a few lines.
-
-- Data Parallelism
-- Pipeline Parallelism
-- 1D, 2D, 2.5D, 3D and sequence parallelism
-- Friendly trainer and engine
-- Extensible for new parallelism
-- Mixed Precision Training
-- Zero Redundancy Optimizer (ZeRO)
-
-Please visit our [documentation and tutorials](https://www.colossalai.org/) for more details.
-
-## Cite Us
-
-```
-@article{bian2021colossal,
-  title={Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training},
-  author={Bian, Zhengda and Liu, Hongxin and Wang, Boxiang and Huang, Haichen and Li, Yongbin and Wang, Chuanrui and Cui, Fan and You, Yang},
-  journal={arXiv preprint arXiv:2110.14883},
-  year={2021}
-}
-```
diff --git a/cifar-10-python.tar.gz b/cifar-10-python.tar.gz
deleted file mode 100644
index 90c5365492dea3b3c855b2375f1de8588ac1bda4..0000000000000000000000000000000000000000
Binary files a/cifar-10-python.tar.gz and /dev/null differ
diff --git a/cifar_datase/cifar-10-python.tar.gz b/cifar_datase/cifar-10-python.tar.gz
deleted file mode 100644
index 637effc14b52f111c127d0573408d5fc05524056..0000000000000000000000000000000000000000
Binary files a/cifar_datase/cifar-10-python.tar.gz and /dev/null differ
diff --git a/cifar_dataset/cifar-10-batches-py/batches.meta b/cifar_dataset/cifar-10-batches-py/batches.meta
deleted file mode 100644
index 4467a6ec2e886a9f14f25e31776fb0152d8ac64a..0000000000000000000000000000000000000000
Binary files a/cifar_dataset/cifar-10-batches-py/batches.meta and /dev/null differ
diff --git a/cifar_dataset/cifar-10-batches-py/data_batch_1 b/cifar_dataset/cifar-10-batches-py/data_batch_1
deleted file mode 100644
index ab404a5ac32492b807a5c6cd02b83dc4dd5ff980..0000000000000000000000000000000000000000
Binary files a/cifar_dataset/cifar-10-batches-py/data_batch_1 and /dev/null differ
diff --git a/cifar_dataset/cifar-10-batches-py/data_batch_2 b/cifar_dataset/cifar-10-batches-py/data_batch_2
deleted file mode 100644
index 6bf1369a6cacadfdbd2f8c61e354cc7d0c17bbae..0000000000000000000000000000000000000000
Binary files a/cifar_dataset/cifar-10-batches-py/data_batch_2 and /dev/null differ
diff --git a/cifar_dataset/cifar-10-batches-py/data_batch_3 b/cifar_dataset/cifar-10-batches-py/data_batch_3
deleted file mode 100644
index 66a0d630a7eb736563b1861ce716bdc489f2113b..0000000000000000000000000000000000000000
Binary files a/cifar_dataset/cifar-10-batches-py/data_batch_3 and /dev/null differ
diff --git a/cifar_dataset/cifar-10-batches-py/data_batch_4 b/cifar_dataset/cifar-10-batches-py/data_batch_4
deleted file mode 100644
index cf8d03d1e80e6d9e440d1764faa85aedd1d6b960..0000000000000000000000000000000000000000
Binary files a/cifar_dataset/cifar-10-batches-py/data_batch_4 and /dev/null differ
diff --git a/cifar_dataset/cifar-10-batches-py/data_batch_5 b/cifar_dataset/cifar-10-batches-py/data_batch_5
deleted file mode 100644
index 468b2aa538c551bc9f590f213b19d96915b85062..0000000000000000000000000000000000000000
Binary files a/cifar_dataset/cifar-10-batches-py/data_batch_5 and /dev/null differ
diff --git a/cifar_dataset/cifar-10-batches-py/readme.html b/cifar_dataset/cifar-10-batches-py/readme.html
deleted file mode 100644
index e377adef45c85dc91051edf2dee72c1d4d57732c..0000000000000000000000000000000000000000
--- a/cifar_dataset/cifar-10-batches-py/readme.html
+++ /dev/null
@@ -1 +0,0 @@
-<meta HTTP-EQUIV="REFRESH" content="0; url=http://www.cs.toronto.edu/~kriz/cifar.html">
diff --git a/cifar_dataset/cifar-10-batches-py/test_batch b/cifar_dataset/cifar-10-batches-py/test_batch
deleted file mode 100644
index 3e03f1fc5261d102600fc1c130454f1f5cda567b..0000000000000000000000000000000000000000
Binary files a/cifar_dataset/cifar-10-batches-py/test_batch and /dev/null differ
diff --git a/cifar_dataset/cifar-10-python.tar.gz b/cifar_dataset/cifar-10-python.tar.gz
deleted file mode 100644
index 90c5365492dea3b3c855b2375f1de8588ac1bda4..0000000000000000000000000000000000000000
Binary files a/cifar_dataset/cifar-10-python.tar.gz and /dev/null differ
diff --git a/colossalai.egg-info/PKG-INFO b/colossalai.egg-info/PKG-INFO
deleted file mode 100644
index eb5bf15a75ac444808804cdc3fe4dbc6f604e706..0000000000000000000000000000000000000000
--- a/colossalai.egg-info/PKG-INFO
+++ /dev/null
@@ -1,9 +0,0 @@
-Metadata-Version: 2.1
-Name: colossalai
-Version: 0.0.2
-Summary: An integrated large-scale model training system with efficient parallelization techniques
-Home-page: UNKNOWN
-License: UNKNOWN
-Description: UNKNOWN
-Platform: UNKNOWN
-Provides-Extra: zero
diff --git a/colossalai.egg-info/SOURCES.txt b/colossalai.egg-info/SOURCES.txt
deleted file mode 100644
index 3005844a0fe5e59efeb4eb902baf4e2a42f59b99..0000000000000000000000000000000000000000
--- a/colossalai.egg-info/SOURCES.txt
+++ /dev/null
@@ -1,272 +0,0 @@
-MANIFEST.in
-README.md
-setup.py
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/colossal_C_frontend.cpp
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/layer_norm_hip.cpp
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/layer_norm_hip_kernel.hip
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multi_tensor_adam.hip
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multi_tensor_l2norm_kernel.hip
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multi_tensor_lamb.hip
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multi_tensor_scale_kernel.hip
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multi_tensor_sgd_kernel.hip
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multihead_attention_1d.cpp
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_masked_softmax.cpp
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_masked_softmax_hip.hip
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_upper_triang_masked_softmax.cpp
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_upper_triang_masked_softmax_hip.hip
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/cublas_wrappers.hip
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/dropout_kernels.hip
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/general_kernels.hip
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/hip_util.hip
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/normalize_kernels.hip
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/softmax_kernels.hip
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/transform_kernels.hip
-colossalai/__init__.py
-colossalai/constants.py
-colossalai/core.py
-colossalai/global_variables.py
-colossalai/initialize.py
-colossalai.egg-info/PKG-INFO
-colossalai.egg-info/SOURCES.txt
-colossalai.egg-info/dependency_links.txt
-colossalai.egg-info/requires.txt
-colossalai.egg-info/top_level.txt
-colossalai/amp/__init__.py
-colossalai/amp/amp_type.py
-colossalai/amp/apex_amp/__init__.py
-colossalai/amp/apex_amp/apex_amp.py
-colossalai/amp/naive_amp/__init__.py
-colossalai/amp/naive_amp/_fp16_optimizer.py
-colossalai/amp/naive_amp/naive_amp.py
-colossalai/amp/torch_amp/__init__.py
-colossalai/amp/torch_amp/_grad_scaler.py
-colossalai/amp/torch_amp/torch_amp.py
-colossalai/builder/__init__.py
-colossalai/builder/builder.py
-colossalai/builder/pipeline.py
-colossalai/communication/__init__.py
-colossalai/communication/collective.py
-colossalai/communication/p2p.py
-colossalai/communication/ring.py
-colossalai/communication/utils.py
-colossalai/context/__init__.py
-colossalai/context/config.py
-colossalai/context/parallel_context.py
-colossalai/context/parallel_mode.py
-colossalai/context/process_group_initializer/__init__.py
-colossalai/context/process_group_initializer/initializer_1d.py
-colossalai/context/process_group_initializer/initializer_2d.py
-colossalai/context/process_group_initializer/initializer_2p5d.py
-colossalai/context/process_group_initializer/initializer_3d.py
-colossalai/context/process_group_initializer/initializer_data.py
-colossalai/context/process_group_initializer/initializer_model.py
-colossalai/context/process_group_initializer/initializer_moe.py
-colossalai/context/process_group_initializer/initializer_pipeline.py
-colossalai/context/process_group_initializer/initializer_sequence.py
-colossalai/context/process_group_initializer/initializer_tensor.py
-colossalai/context/process_group_initializer/process_group_initializer.py
-colossalai/context/random/__init__.py
-colossalai/context/random/_helper.py
-colossalai/context/random/seed_manager.py
-colossalai/engine/__init__.py
-colossalai/engine/_base_engine.py
-colossalai/engine/gradient_handler/__init__.py
-colossalai/engine/gradient_handler/_base_gradient_handler.py
-colossalai/engine/gradient_handler/_data_parallel_gradient_handler.py
-colossalai/engine/gradient_handler/_moe_gradient_handler.py
-colossalai/engine/gradient_handler/_pipeline_parallel_gradient_handler.py
-colossalai/engine/gradient_handler/_sequence_parallel_gradient_handler.py
-colossalai/engine/gradient_handler/_zero_gradient_handler.py
-colossalai/engine/ophooks/__init__.py
-colossalai/engine/ophooks/_base_ophook.py
-colossalai/engine/ophooks/_memtracer_ophook.py
-colossalai/engine/schedule/__init__.py
-colossalai/engine/schedule/_base_schedule.py
-colossalai/engine/schedule/_non_pipeline_schedule.py
-colossalai/engine/schedule/_pipeline_schedule.py
-colossalai/kernel/__init__.py
-colossalai/kernel/cuda_native/__init__.py
-colossalai/kernel/cuda_native/layer_norm.py
-colossalai/kernel/cuda_native/multihead_attention.py
-colossalai/kernel/cuda_native/scaled_softmax.py
-colossalai/kernel/cuda_native/csrc/colossal_C_frontend.cpp
-colossalai/kernel/cuda_native/csrc/compat.h
-colossalai/kernel/cuda_native/csrc/layer_norm_cuda.cpp
-colossalai/kernel/cuda_native/csrc/layer_norm_cuda_kernel.cu
-colossalai/kernel/cuda_native/csrc/multi_tensor_adam.cu
-colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh
-colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu
-colossalai/kernel/cuda_native/csrc/multi_tensor_lamb.cu
-colossalai/kernel/cuda_native/csrc/multi_tensor_scale_kernel.cu
-colossalai/kernel/cuda_native/csrc/multi_tensor_sgd_kernel.cu
-colossalai/kernel/cuda_native/csrc/multihead_attention_1d.cpp
-colossalai/kernel/cuda_native/csrc/multihead_attention_1d.h
-colossalai/kernel/cuda_native/csrc/scaled_masked_softmax.cpp
-colossalai/kernel/cuda_native/csrc/scaled_masked_softmax.h
-colossalai/kernel/cuda_native/csrc/scaled_masked_softmax_cuda.cu
-colossalai/kernel/cuda_native/csrc/scaled_upper_triang_masked_softmax.cpp
-colossalai/kernel/cuda_native/csrc/scaled_upper_triang_masked_softmax.h
-colossalai/kernel/cuda_native/csrc/scaled_upper_triang_masked_softmax_cuda.cu
-colossalai/kernel/cuda_native/csrc/type_shim.h
-colossalai/kernel/cuda_native/csrc/kernels/cross_entropy.cu
-colossalai/kernel/cuda_native/csrc/kernels/cublas_wrappers.cu
-colossalai/kernel/cuda_native/csrc/kernels/cuda_util.cu
-colossalai/kernel/cuda_native/csrc/kernels/dropout_kernels.cu
-colossalai/kernel/cuda_native/csrc/kernels/general_kernels.cu
-colossalai/kernel/cuda_native/csrc/kernels/normalize_kernels.cu
-colossalai/kernel/cuda_native/csrc/kernels/softmax_kernels.cu
-colossalai/kernel/cuda_native/csrc/kernels/transform_kernels.cu
-colossalai/kernel/cuda_native/csrc/kernels/include/block_reduce.h
-colossalai/kernel/cuda_native/csrc/kernels/include/context.h
-colossalai/kernel/cuda_native/csrc/kernels/include/cross_entropy_layer.h
-colossalai/kernel/cuda_native/csrc/kernels/include/cublas_wrappers.h
-colossalai/kernel/cuda_native/csrc/kernels/include/cuda_util.h
-colossalai/kernel/cuda_native/csrc/kernels/include/dropout.h
-colossalai/kernel/cuda_native/csrc/kernels/include/feed_forward.h
-colossalai/kernel/cuda_native/csrc/kernels/include/kernels.h
-colossalai/kernel/cuda_native/csrc/kernels/include/ls_cub.cuh
-colossalai/kernel/cuda_native/csrc/kernels/include/normalize_layer.h
-colossalai/kernel/cuda_native/csrc/kernels/include/softmax.h
-colossalai/kernel/cuda_native/csrc/kernels/include/strided_batch_gemm.h
-colossalai/kernel/hip_native/csrc/colossal_C_frontend.cpp
-colossalai/kernel/hip_native/csrc/compat.h
-colossalai/kernel/hip_native/csrc/layer_norm_hip.cpp
-colossalai/kernel/hip_native/csrc/multi_tensor_apply.cuh
-colossalai/kernel/hip_native/csrc/multihead_attention_1d.cpp
-colossalai/kernel/hip_native/csrc/multihead_attention_1d.h
-colossalai/kernel/hip_native/csrc/scaled_masked_softmax.cpp
-colossalai/kernel/hip_native/csrc/scaled_masked_softmax.h
-colossalai/kernel/hip_native/csrc/scaled_upper_triang_masked_softmax.cpp
-colossalai/kernel/hip_native/csrc/scaled_upper_triang_masked_softmax.h
-colossalai/kernel/hip_native/csrc/type_shim.h
-colossalai/kernel/hip_native/csrc/kernels/include/block_reduce.h
-colossalai/kernel/hip_native/csrc/kernels/include/context.h
-colossalai/kernel/hip_native/csrc/kernels/include/cross_entropy_layer.h
-colossalai/kernel/hip_native/csrc/kernels/include/cublas_wrappers.h
-colossalai/kernel/hip_native/csrc/kernels/include/dropout.h
-colossalai/kernel/hip_native/csrc/kernels/include/feed_forward.h
-colossalai/kernel/hip_native/csrc/kernels/include/hip_util.h
-colossalai/kernel/hip_native/csrc/kernels/include/kernels.h
-colossalai/kernel/hip_native/csrc/kernels/include/ls_cub.cuh
-colossalai/kernel/hip_native/csrc/kernels/include/normalize_layer.h
-colossalai/kernel/hip_native/csrc/kernels/include/softmax.h
-colossalai/kernel/hip_native/csrc/kernels/include/strided_batch_gemm.h
-colossalai/kernel/jit/__init__.py
-colossalai/kernel/jit/bias_dropout_add.py
-colossalai/kernel/jit/bias_gelu.py
-colossalai/kernel/jit/option.py
-colossalai/logging/__init__.py
-colossalai/logging/logging.py
-colossalai/nn/__init__.py
-colossalai/nn/init.py
-colossalai/nn/layer/__init__.py
-colossalai/nn/layer/base_layer.py
-colossalai/nn/layer/colossalai_layer/__init__.py
-colossalai/nn/layer/colossalai_layer/_utils.py
-colossalai/nn/layer/colossalai_layer/dropout.py
-colossalai/nn/layer/colossalai_layer/embedding.py
-colossalai/nn/layer/colossalai_layer/linear.py
-colossalai/nn/layer/colossalai_layer/normalization.py
-colossalai/nn/layer/moe/__init__.py
-colossalai/nn/layer/moe/_operation.py
-colossalai/nn/layer/moe/layers.py
-colossalai/nn/layer/parallel_1d/__init__.py
-colossalai/nn/layer/parallel_1d/_operation.py
-colossalai/nn/layer/parallel_1d/_utils.py
-colossalai/nn/layer/parallel_1d/layers.py
-colossalai/nn/layer/parallel_2d/__init__.py
-colossalai/nn/layer/parallel_2d/_operation.py
-colossalai/nn/layer/parallel_2d/_utils.py
-colossalai/nn/layer/parallel_2d/layers.py
-colossalai/nn/layer/parallel_2p5d/__init__.py
-colossalai/nn/layer/parallel_2p5d/_operation.py
-colossalai/nn/layer/parallel_2p5d/_utils.py
-colossalai/nn/layer/parallel_2p5d/layers.py
-colossalai/nn/layer/parallel_3d/__init__.py
-colossalai/nn/layer/parallel_3d/_operation.py
-colossalai/nn/layer/parallel_3d/_utils.py
-colossalai/nn/layer/parallel_3d/layers.py
-colossalai/nn/layer/parallel_sequence/__init__.py
-colossalai/nn/layer/parallel_sequence/_operation.py
-colossalai/nn/layer/parallel_sequence/_utils.py
-colossalai/nn/layer/parallel_sequence/layers.py
-colossalai/nn/layer/utils/__init__.py
-colossalai/nn/layer/utils/common.py
-colossalai/nn/layer/vanilla/__init__.py
-colossalai/nn/layer/vanilla/layers.py
-colossalai/nn/layer/wrapper/__init__.py
-colossalai/nn/layer/wrapper/lambda_wrapper.py
-colossalai/nn/layer/wrapper/pipeline_wrapper.py
-colossalai/nn/loss/__init__.py
-colossalai/nn/loss/loss_1d.py
-colossalai/nn/loss/loss_2d.py
-colossalai/nn/loss/loss_2p5d.py
-colossalai/nn/loss/loss_3d.py
-colossalai/nn/loss/loss_moe.py
-colossalai/nn/lr_scheduler/__init__.py
-colossalai/nn/lr_scheduler/cosine.py
-colossalai/nn/lr_scheduler/delayed.py
-colossalai/nn/lr_scheduler/linear.py
-colossalai/nn/lr_scheduler/multistep.py
-colossalai/nn/lr_scheduler/onecycle.py
-colossalai/nn/lr_scheduler/poly.py
-colossalai/nn/lr_scheduler/torch.py
-colossalai/nn/metric/__init__.py
-colossalai/nn/metric/_utils.py
-colossalai/nn/metric/accuracy_2d.py
-colossalai/nn/metric/accuracy_2p5d.py
-colossalai/nn/metric/accuracy_3d.py
-colossalai/nn/model/__init__.py
-colossalai/nn/model/model_from_config.py
-colossalai/nn/optimizer/__init__.py
-colossalai/nn/optimizer/colossalai_optimizer.py
-colossalai/nn/optimizer/fused_adam.py
-colossalai/nn/optimizer/fused_lamb.py
-colossalai/nn/optimizer/fused_sgd.py
-colossalai/nn/optimizer/lamb.py
-colossalai/nn/optimizer/lars.py
-colossalai/registry/__init__.py
-colossalai/registry/registry.py
-colossalai/trainer/__init__.py
-colossalai/trainer/_trainer.py
-colossalai/trainer/hooks/__init__.py
-colossalai/trainer/hooks/_base_hook.py
-colossalai/trainer/hooks/_checkpoint_hook.py
-colossalai/trainer/hooks/_log_hook.py
-colossalai/trainer/hooks/_lr_scheduler_hook.py
-colossalai/trainer/hooks/_metric_hook.py
-colossalai/utils/__init__.py
-colossalai/utils/activation_checkpoint.py
-colossalai/utils/checkpointing.py
-colossalai/utils/common.py
-colossalai/utils/cuda.py
-colossalai/utils/memory.py
-colossalai/utils/timer.py
-colossalai/utils/data_sampler/__init__.py
-colossalai/utils/data_sampler/base_sampler.py
-colossalai/utils/data_sampler/data_parallel_sampler.py
-colossalai/utils/gradient_accumulation/__init__.py
-colossalai/utils/gradient_accumulation/_gradient_accumulation.py
-colossalai/utils/multi_tensor_apply/__init__.py
-colossalai/utils/multi_tensor_apply/multi_tensor_apply.py
-colossalai/zero/__init__.py
-colossalai/zero/loss_scaler.py
-colossalai/zero/zero_redundancy_optimizer_level_2.py
-colossalai/zero/zero_redundancy_optimizer_level_3.py
-model_zoo/__init__.py
-model_zoo/helper.py
-model_zoo/bert/__init__.py
-model_zoo/gpt/__init__.py
-model_zoo/gpt/gpt.py
-model_zoo/mlp_mixer/__init__.py
-model_zoo/mlp_mixer/parallel_3d/__init__.py
-model_zoo/mlp_mixer/parallel_3d/mlp_mixer.py
-model_zoo/moe/__init__.py
-model_zoo/moe/models.py
-model_zoo/moe/util.py
-model_zoo/vit/__init__.py
-model_zoo/vit/vision_transformer_from_config.py
-model_zoo/vit/vit.py
-requirements/requirements-test.txt
-requirements/requirements-zero.txt
-requirements/requirements.txt
\ No newline at end of file
diff --git a/colossalai.egg-info/dependency_links.txt b/colossalai.egg-info/dependency_links.txt
deleted file mode 100644
index 8b137891791fe96927ad78e64b0aad7bded08bdc..0000000000000000000000000000000000000000
--- a/colossalai.egg-info/dependency_links.txt
+++ /dev/null
@@ -1 +0,0 @@
-
diff --git a/colossalai.egg-info/requires.txt b/colossalai.egg-info/requires.txt
deleted file mode 100644
index 78c9596d4f0cc5817c6577fba189bcfe997f15a4..0000000000000000000000000000000000000000
--- a/colossalai.egg-info/requires.txt
+++ /dev/null
@@ -1,11 +0,0 @@
-torch>=1.8
-torchvision>=0.9
-numpy
-tqdm
-psutil
-tensorboard
-packaging
-pre-commit
-
-[zero]
-deepspeed
diff --git a/colossalai.egg-info/top_level.txt b/colossalai.egg-info/top_level.txt
deleted file mode 100644
index 728433db0446e4f27ef1bd9e2b0a78633d419367..0000000000000000000000000000000000000000
--- a/colossalai.egg-info/top_level.txt
+++ /dev/null
@@ -1,7 +0,0 @@
-colossal_C
-colossal_layer_norm_cuda
-colossal_multihead_attention
-colossal_scaled_masked_softmax
-colossal_scaled_upper_triang_masked_softmax
-colossalai
-model_zoo
diff --git a/colossalai/__init__.py b/colossalai/__init__.py
deleted file mode 100644
index e7ea7d65a4319162c7e1366d3e710edc946c3ebe..0000000000000000000000000000000000000000
--- a/colossalai/__init__.py
+++ /dev/null
@@ -1,4 +0,0 @@
-from .initialize import (initialize, launch, launch_from_openmpi,
-                         launch_from_slurm, launch_from_torch, get_default_parser)
-
-__version__ = '0.0.1'
diff --git a/colossalai/__pycache__/__init__.cpython-36.pyc b/colossalai/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index f43baadd777552836d10d8a06020470c6c68ee87..0000000000000000000000000000000000000000
Binary files a/colossalai/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/__pycache__/__init__.cpython-37.pyc b/colossalai/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index 1817453af62f65951517f539547d1b1b3925d0c1..0000000000000000000000000000000000000000
Binary files a/colossalai/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/__pycache__/constants.cpython-36.pyc b/colossalai/__pycache__/constants.cpython-36.pyc
deleted file mode 100644
index 3520f319b3f5366cfbb91e34ea0450fb414699cd..0000000000000000000000000000000000000000
Binary files a/colossalai/__pycache__/constants.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/__pycache__/constants.cpython-37.pyc b/colossalai/__pycache__/constants.cpython-37.pyc
deleted file mode 100644
index fa3a3ec3e75c147f25fb03c5eb72ddb7a218d455..0000000000000000000000000000000000000000
Binary files a/colossalai/__pycache__/constants.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/__pycache__/core.cpython-36.pyc b/colossalai/__pycache__/core.cpython-36.pyc
deleted file mode 100644
index a89e9dda842e833d16485b35a603502f2b37836e..0000000000000000000000000000000000000000
Binary files a/colossalai/__pycache__/core.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/__pycache__/core.cpython-37.pyc b/colossalai/__pycache__/core.cpython-37.pyc
deleted file mode 100644
index de16fec841bfbb27d8934a39558b4eb6617940c5..0000000000000000000000000000000000000000
Binary files a/colossalai/__pycache__/core.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/__pycache__/global_variables.cpython-36.pyc b/colossalai/__pycache__/global_variables.cpython-36.pyc
deleted file mode 100644
index c72607ea712ab3f089587054c03b7ebb28a525b1..0000000000000000000000000000000000000000
Binary files a/colossalai/__pycache__/global_variables.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/__pycache__/global_variables.cpython-37.pyc b/colossalai/__pycache__/global_variables.cpython-37.pyc
deleted file mode 100644
index dc594ceba0db2677c7a6241698ae74b131eecfe7..0000000000000000000000000000000000000000
Binary files a/colossalai/__pycache__/global_variables.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/__pycache__/initialize.cpython-36.pyc b/colossalai/__pycache__/initialize.cpython-36.pyc
deleted file mode 100644
index 786f7f1975cbedb84d728a0bd44c4d5f2a6396bb..0000000000000000000000000000000000000000
Binary files a/colossalai/__pycache__/initialize.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/__pycache__/initialize.cpython-37.pyc b/colossalai/__pycache__/initialize.cpython-37.pyc
deleted file mode 100644
index b52abe35a4efbc82dfe0d0e78545050ca1269f60..0000000000000000000000000000000000000000
Binary files a/colossalai/__pycache__/initialize.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/amp/__init__.py b/colossalai/amp/__init__.py
deleted file mode 100644
index 5a30e67fbd882d896864b07ae376b9dc63d043b3..0000000000000000000000000000000000000000
--- a/colossalai/amp/__init__.py
+++ /dev/null
@@ -1,48 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from .amp_type import AMP_TYPE
-from colossalai.context import Config
-import torch.nn as nn
-from torch.optim import Optimizer
-from torch.nn.modules.loss import _Loss
-from .torch_amp import convert_to_torch_amp
-from .apex_amp import convert_to_apex_amp
-from .naive_amp import convert_to_naive_amp
-
-
-def convert_to_amp(model: nn.Module,
-                   optimizer: Optimizer,
-                   criterion: _Loss,
-                   mode: AMP_TYPE,
-                   amp_config: Config = None):
-    """A helper function to wrap training components with Torch AMP modules
-
-    :param model: your model object
-    :type model: :class:`torch.nn.Module`
-    :param optimizer: your optimizer object
-    :type optimizer: :class:`torch.optim.Optimzer`
-    :param criterion: your loss function object
-    :type criterion: :class:`torch.nn.modules.loss._Loss`
-    :param mode: amp mode
-    :type mode: :class:`colossalai.amp.AMP_TYPE`
-    :param amp_config: configuration for different amp modes
-    :type amp_config: :class:`colossalai.context.Config` or dict
-
-    :return: (model, optimizer, criterion)
-    :rtype: Tuple
-    """
-    assert isinstance(mode, AMP_TYPE), \
-        f'expected the argument mode be AMP_TYPE, but got {type(mode)}'
-
-    if amp_config is None:
-        amp_config = Config()
-
-    if mode == AMP_TYPE.TORCH:
-        model, optimizer, criterion = convert_to_torch_amp(model, optimizer, criterion, amp_config)
-    elif mode == AMP_TYPE.APEX:
-        model, optimizer = convert_to_apex_amp(model, optimizer, amp_config)
-    elif mode == AMP_TYPE.NAIVE:
-        model, optimizer = convert_to_naive_amp(model, optimizer, amp_config)
-
-    return model, optimizer, criterion
diff --git a/colossalai/amp/__pycache__/__init__.cpython-36.pyc b/colossalai/amp/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index 14ee23f5accbb80ac38c0a6c469875426986f6df..0000000000000000000000000000000000000000
Binary files a/colossalai/amp/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/amp/__pycache__/__init__.cpython-37.pyc b/colossalai/amp/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index 69ed3b413c9bbcdb60b726e945c71b004af74478..0000000000000000000000000000000000000000
Binary files a/colossalai/amp/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/amp/__pycache__/amp_type.cpython-36.pyc b/colossalai/amp/__pycache__/amp_type.cpython-36.pyc
deleted file mode 100644
index 0ac51649dbd64801f3af275eca35aaef4b4eb25e..0000000000000000000000000000000000000000
Binary files a/colossalai/amp/__pycache__/amp_type.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/amp/__pycache__/amp_type.cpython-37.pyc b/colossalai/amp/__pycache__/amp_type.cpython-37.pyc
deleted file mode 100644
index f98bf6428b9a0d53312b0a9c82f8b3c07cc0bfae..0000000000000000000000000000000000000000
Binary files a/colossalai/amp/__pycache__/amp_type.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/amp/amp_type.py b/colossalai/amp/amp_type.py
deleted file mode 100644
index 6f322f866cfc813e66e54b0c1006d62ef949e96e..0000000000000000000000000000000000000000
--- a/colossalai/amp/amp_type.py
+++ /dev/null
@@ -1,10 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from enum import Enum
-
-
-class AMP_TYPE(Enum):
-    APEX = 'apex'
-    TORCH = 'torch'
-    NAIVE = 'naive'
diff --git a/colossalai/amp/apex_amp/__init__.py b/colossalai/amp/apex_amp/__init__.py
deleted file mode 100644
index 23585ede7d1a4dee33212a5d0ff6a7f5b0e91615..0000000000000000000000000000000000000000
--- a/colossalai/amp/apex_amp/__init__.py
+++ /dev/null
@@ -1,27 +0,0 @@
-from .apex_amp import ApexAMPOptimizer
-import torch.nn as nn
-from torch.optim import Optimizer
-
-
-def convert_to_apex_amp(model: nn.Module,
-                        optimizer: Optimizer,
-                        amp_config):
-    """A helper function to wrap training components with Apex AMP modules
-
-    :param model: your model object
-    :type model: :class:`torch.nn.Module`
-    :param optimizer: your optimizer object
-    :type optimizer: :class:`torch.optim.Optimzer`
-    :param amp_config: configuration for nvidia apex
-    :type amp_config: :class:`colossalai.context.Config` or dict
-
-    :return: (model, optimizer)
-    :rtype: Tuple
-    """
-    import apex.amp as apex_amp
-    model, optimizer = apex_amp.initialize(model, optimizer, **amp_config)
-    optimizer = ApexAMPOptimizer(optimizer)
-    return model, optimizer
-
-
-__all__ = ['convert_to_apex_amp', 'ApexAMPOptimizer']
diff --git a/colossalai/amp/apex_amp/__pycache__/__init__.cpython-36.pyc b/colossalai/amp/apex_amp/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index cf6acf71a6fb92198703107a74c72d10064452a7..0000000000000000000000000000000000000000
Binary files a/colossalai/amp/apex_amp/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/amp/apex_amp/__pycache__/__init__.cpython-37.pyc b/colossalai/amp/apex_amp/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index f0ddfbc8928da51f53f3b71589781dadef89c6cf..0000000000000000000000000000000000000000
Binary files a/colossalai/amp/apex_amp/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/amp/apex_amp/__pycache__/apex_amp.cpython-36.pyc b/colossalai/amp/apex_amp/__pycache__/apex_amp.cpython-36.pyc
deleted file mode 100644
index e9f04c1a31ac53c828eb428b3ed5abbf5fdfe86e..0000000000000000000000000000000000000000
Binary files a/colossalai/amp/apex_amp/__pycache__/apex_amp.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/amp/apex_amp/__pycache__/apex_amp.cpython-37.pyc b/colossalai/amp/apex_amp/__pycache__/apex_amp.cpython-37.pyc
deleted file mode 100644
index cedb2754d85e948733927f9c39db9b065f31b7bf..0000000000000000000000000000000000000000
Binary files a/colossalai/amp/apex_amp/__pycache__/apex_amp.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/amp/apex_amp/apex_amp.py b/colossalai/amp/apex_amp/apex_amp.py
deleted file mode 100644
index 6d7196b334664989d6713a40a0c978f32c337922..0000000000000000000000000000000000000000
--- a/colossalai/amp/apex_amp/apex_amp.py
+++ /dev/null
@@ -1,39 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import torch.nn as nn
-try:
-    import apex.amp as apex_amp
-except ImportError:
-    pass
-
-from torch import Tensor
-
-from colossalai.nn.optimizer import ColossalaiOptimizer
-from colossalai.utils import clip_grad_norm_fp32
-
-
-class ApexAMPOptimizer(ColossalaiOptimizer):
-    """ A wrapper class for APEX optimizer and it implements apex-specific backward and clip_grad_norm
-    methods
-    """
-
-    def backward(self, loss: Tensor):
-        """Backward pass to get all gradients
-
-        :param loss: Loss computed by a loss function
-        :type loss: torch.Tensor
-        """
-        with apex_amp.scale_loss(loss, self.optim) as scaled_loss:
-            scaled_loss.backward()
-
-    def clip_grad_norm(self, model: nn.Module, max_norm: float):
-        """Clip gradients' norm
-
-        :param model: Your model object
-        :type model: torch.nn.Module
-        :param max_norm: The max norm value for gradient clipping
-        :type max_norm: float
-        """
-        if max_norm > 0:
-            clip_grad_norm_fp32(apex_amp.master_params(self.optim), max_norm)
diff --git a/colossalai/amp/naive_amp/__init__.py b/colossalai/amp/naive_amp/__init__.py
deleted file mode 100644
index 32ea3469af0f4167073da2211338d4dd76720cc8..0000000000000000000000000000000000000000
--- a/colossalai/amp/naive_amp/__init__.py
+++ /dev/null
@@ -1,38 +0,0 @@
-import torch.nn as nn
-from torch.optim import Optimizer
-from colossalai.utils import is_no_pp_or_last_stage
-
-from .naive_amp import NaiveAMPOptimizer, NaiveAMPModel
-
-
-def convert_to_naive_amp(model: nn.Module,
-                         optimizer: Optimizer,
-                         amp_config):
-    """A helper function to wrap training components with naive AMP modules
-
-    :param model: your model object
-    :type model: :class:`torch.nn.Module`
-    :param optimizer: your optimizer object
-    :type optimizer: :class:`torch.optim.Optimzer`
-    :param amp_config: configuration for naive mode amp
-    :type amp_config: :class:`colossalai.context.Config` or dict
-
-    :return: (model, optimizer)
-    :rtype: Tuple
-    """
-    if isinstance(model, nn.ModuleList):
-        # interleaved pipeline
-        module_list = []
-        for chunk, m in enumerate(model):
-            output_to_fp32 = is_no_pp_or_last_stage() and chunk == len(model) - 1
-            module_list.append(NaiveAMPModel(m, output_to_fp32=output_to_fp32))
-        model = nn.ModuleList(module_list)
-    else:
-        output_to_fp32 = is_no_pp_or_last_stage()
-        model = NaiveAMPModel(model, output_to_fp32=output_to_fp32)
-
-    optimizer = NaiveAMPOptimizer(optimizer, **amp_config)
-    return model, optimizer
-
-
-__all__ = ['convert_to_naive_amp', 'NaiveAMPOptimizer']
diff --git a/colossalai/amp/naive_amp/__pycache__/__init__.cpython-36.pyc b/colossalai/amp/naive_amp/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index 16875c64267b045ba1c2016b60f4019ea108996d..0000000000000000000000000000000000000000
Binary files a/colossalai/amp/naive_amp/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/amp/naive_amp/__pycache__/__init__.cpython-37.pyc b/colossalai/amp/naive_amp/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index bffe520d88cbcf5d72f5de094696ef2aa26e4969..0000000000000000000000000000000000000000
Binary files a/colossalai/amp/naive_amp/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/amp/naive_amp/__pycache__/_fp16_optimizer.cpython-36.pyc b/colossalai/amp/naive_amp/__pycache__/_fp16_optimizer.cpython-36.pyc
deleted file mode 100644
index aa48023df0fcc242a7ff69a476f2c45fe5029350..0000000000000000000000000000000000000000
Binary files a/colossalai/amp/naive_amp/__pycache__/_fp16_optimizer.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/amp/naive_amp/__pycache__/_fp16_optimizer.cpython-37.pyc b/colossalai/amp/naive_amp/__pycache__/_fp16_optimizer.cpython-37.pyc
deleted file mode 100644
index 7f5ac2caa1668377b1adb76f523a1a276d714201..0000000000000000000000000000000000000000
Binary files a/colossalai/amp/naive_amp/__pycache__/_fp16_optimizer.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/amp/naive_amp/__pycache__/naive_amp.cpython-36.pyc b/colossalai/amp/naive_amp/__pycache__/naive_amp.cpython-36.pyc
deleted file mode 100644
index c017e1d867e9b9197350339c967165a782c46ef9..0000000000000000000000000000000000000000
Binary files a/colossalai/amp/naive_amp/__pycache__/naive_amp.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/amp/naive_amp/__pycache__/naive_amp.cpython-37.pyc b/colossalai/amp/naive_amp/__pycache__/naive_amp.cpython-37.pyc
deleted file mode 100644
index 6eae86307191838f90699c446553cbc9a59de2e2..0000000000000000000000000000000000000000
Binary files a/colossalai/amp/naive_amp/__pycache__/naive_amp.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/amp/naive_amp/_fp16_optimizer.py b/colossalai/amp/naive_amp/_fp16_optimizer.py
deleted file mode 100644
index b1fc621c211c79f176ef0e03ff42823f2bb274ca..0000000000000000000000000000000000000000
--- a/colossalai/amp/naive_amp/_fp16_optimizer.py
+++ /dev/null
@@ -1,508 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import torch
-
-try:
-    import colossal_C
-except:
-    print('Colossalai should be built with cuda extension to use the FP16 optimizer')
-
-from torch.optim import Optimizer
-
-from colossalai.context.parallel_mode import ParallelMode
-from colossalai.core import global_context as gpc
-from colossalai.logging import get_dist_logger
-from colossalai.utils import (print_rank_0, copy_tensor_parallel_attributes,
-                              clip_grad_norm_fp32, count_zeros_fp32, multi_tensor_applier, is_using_pp)
-
-
-def _zero_grad_group_helper(group, set_to_none):
-    """Zero out the gradient for a group of parameters.
-    Note: copied from torch.optim.optimizer."""
-    for param in group:
-        if param.grad is not None:
-            if set_to_none:
-                param.grad = None
-            else:
-                if param.grad.grad_fn is not None:
-                    param.grad.detach_()
-                else:
-                    param.grad.requires_grad_(False)
-                param.grad.zero_()
-
-
-def _multi_tensor_copy_this_to_that(this, that, overflow_buf=None):
-    """Use multi-tensor-applier to copy values from one list to another.
-    We don't have a blfoat16 implementation so for now if the overflow_buf
-    is not provided, we default back to simple loop copy to be compatible
-    with bfloat16."""
-    if overflow_buf:
-        overflow_buf.fill_(0)
-        # Scaling with factor `1.0` is equivalent to copy.
-        multi_tensor_applier(colossal_C.multi_tensor_scale,
-                             overflow_buf,
-                             [this, that],
-                             1.0)
-    else:
-        for this_, that_ in zip(this, that):
-            that_.copy_(this_)
-
-
-class DynamicGradScaler:
-
-    def __init__(self,
-                 initial_scale,
-                 min_scale,
-                 growth_factor,
-                 backoff_factor,
-                 growth_interval,
-                 hysteresis,
-                 max_scale: int = None,
-                 verbose: bool = False):
-        """"Grad scaler with dynamic scale that gets adjusted
-        during training."""
-        assert initial_scale > 0.0
-        self._scale = torch.cuda.FloatTensor([initial_scale])
-
-        # Lower bound on the scale.
-        assert min_scale > 0.0
-        assert min_scale <= initial_scale
-        self.min_scale = torch.cuda.FloatTensor([min_scale])
-        # Growth and backoff factors for the scale.
-        assert growth_factor > 1.0
-        self.growth_factor = torch.cuda.FloatTensor([growth_factor])
-        assert backoff_factor < 1.0
-        assert backoff_factor > 0.0
-        self.backoff_factor = torch.cuda.FloatTensor([backoff_factor])
-        # Interval over which if we don't see any inf/nan,
-        # we will scale the grad scale by the growth factor.
-        assert growth_interval > 0
-        self.growth_interval = growth_interval
-        # Number of inf/nans we should see before scaling down
-        # the grad scale by the backoff factor.
-        assert hysteresis > 0
-        self.hysteresis = hysteresis
-        if max_scale is not None:
-            assert max_scale > 1 and initial_scale <= max_scale
-        self._max_scale = max_scale
-
-        # Trackers.
-        self._growth_tracker = 0
-        self._hysteresis_tracker = self.hysteresis
-
-        self._logger = get_dist_logger()
-        self.verbose = verbose
-
-    @property
-    def scale(self):
-        return self._scale
-
-    @property
-    def inv_scale(self):
-        return self._scale.double().reciprocal().float()
-
-    def update(self, found_inf):
-
-        # If we have an inf/nan, growth tracker is set to 0
-        # and hysterisis tracker is reduced by 1.
-        if found_inf:
-            self._growth_tracker = 0
-            self._hysteresis_tracker -= 1
-            # Now if we are out of hysteresis count, scale down the loss.
-            if self._hysteresis_tracker <= 0:
-                self._scale = torch.max(self._scale * self.backoff_factor,
-                                        self.min_scale)
-            if self.verbose:
-                self._logger.info(f'overflow occurs, loss scale is adjusted to {self._scale}', ranks=[0])
-        else:
-            # If there is no nan/inf, increment the growth tracker.
-            self._growth_tracker += 1
-            # If we have had enough consequitive intervals with no nan/inf:
-            if self._growth_tracker == self.growth_interval:
-                # Reset the tracker and hysteresis trackers,
-                self._growth_tracker = 0
-                self._hysteresis_tracker = self.hysteresis
-                # and scale up the loss scale.
-                if self._max_scale is not None and self._scale >= self._max_scale:
-                    if self.verbose:
-                        self._logger.info(
-                            f'Current loss scale {self._scale} has reached the max scale {self._max_scale} allowed', ranks=[0])
-                else:
-                    self._scale = self._scale * self.growth_factor
-                    if self.verbose:
-                        self._logger.info(
-                            f'no consecutive overflow, loss scale is adjusted to {self._scale}', ranks=[0])
-
-    def state_dict(self):
-        state_dict = {}
-        state_dict['max_scale'] = self._max_scale
-        state_dict['scale'] = self._scale
-        state_dict['growth_tracker'] = self._growth_tracker
-        state_dict['hysteresis_tracker'] = self._hysteresis_tracker
-        return state_dict
-
-    def load_state_dict(self, state_dict):
-        self._scale = state_dict['scale'].cuda(torch.cuda.current_device())
-        self._growth_tracker = state_dict['growth_tracker']
-        self._hysteresis_tracker = state_dict['hysteresis_tracker']
-        self._max_scale = state_dict['max_scale']
-
-
-class FP16Optimizer(Optimizer):
-    """Float16 optimizer for fp16 and bf16 data types.
-
-    :param optimizer: base optimizer such as Adam or SGD
-    :type optimizer: torch.optim.Optimizer
-    :param clip_grad: clip gradeints with this global L2 norm. Note that clipping is ignored if clip_grad == 0
-    :type param clip_grad: float
-    :param log_num_zeros_in_grad: return number of zeros in the gradients.
-    :type log_num_zeros_in_grad: bool
-    :param initial_scale: initial scale of gradient scaler
-    :type initial_scale: int
-    :param growth_factor: the growth rate of loss scale
-    :type growth_factor: int
-    :param backoff_factor: the decrease rate of loss scale
-    :type backoff_factor: float
-    :param hysterisis: delay shift in dynamic loss scaling
-    :type hysterisis: int
-    :param max_scale: maximum loss scale allowed
-    :type max_scale: int
-    :param verbose: if set to `True`, will print debug info
-    :type verbose: bool
-    """
-
-    def __init__(self,
-                 optimizer,
-                 clip_grad=0,
-                 log_num_zeros_in_grad=False,
-                 initial_scale=2 ** 32,
-                 min_scale=1,
-                 growth_factor=2,
-                 backoff_factor=0.5,
-                 growth_interval=1000,
-                 hysteresis=2,
-                 max_scale: int = 2 ** 32,
-                 verbose: bool = False):
-        # default args for compatibility
-        bf16 = False
-        params_have_main_grad = False
-
-        # have a defaults for compatibility with pytorch optim
-        self.defaults = optimizer.defaults
-
-        # log config
-        self._logger = get_dist_logger()
-        if verbose:
-            self._logger.info(f"\n=========  FP16 Optimizer Config =========\n"
-                              f"Optimizer: {optimizer.__class__.__name__}\n"
-                              f"clip_grad = {clip_grad}\n"
-                              f"log_num_zeros_in_grad = {log_num_zeros_in_grad}\n"
-                              f"initial_scale = {initial_scale}\n"
-                              f"min_scale = {min_scale}\n"
-                              f"growth_factor = {growth_factor}\n"
-                              f"backoff_factor = {backoff_factor}\n"
-                              f"growth_interval = {growth_interval}\n"
-                              f"hysteresis = {hysteresis}\n"
-                              f"==========================================", ranks=[0])
-
-        """Input optimizer is the base optimizer for example Adam."""
-        self.optimizer = optimizer
-        assert self.optimizer, 'no optimizer is provided.'
-        # Set gradient clipping and logging params.
-        self.clip_grad = clip_grad
-        self.log_num_zeros_in_grad = log_num_zeros_in_grad
-        self.params_have_main_grad = params_have_main_grad
-
-        self.bf16 = bf16
-        self.grad_scaler = DynamicGradScaler(
-            initial_scale=initial_scale,
-            min_scale=min_scale,
-            growth_factor=growth_factor,
-            backoff_factor=backoff_factor,
-            growth_interval=growth_interval,
-            hysteresis=hysteresis,
-            max_scale=max_scale,
-            verbose=verbose
-        )
-
-        # None grad scaler is only supported for bf16.
-        if self.grad_scaler is None:
-            assert self.bf16, 'fp16 expects a grad scaler.'
-
-        # Tensor used to determine if a nan/if has happend.
-        # Any non-zero value indicates inf/nan.
-        # Note that we keep this for the cases that grad scaler is none.
-        # We still record nan/inf if we have a bfloat16 with a grad scaler.
-        if self.grad_scaler:
-            self.found_inf = torch.cuda.FloatTensor([0.0])
-
-        # Dummy tensor needed for apex multi-apply tensor.
-        # For bfloat, we don't have multi-tensor apply and for now
-        # we set it to none so the multi-tensor apply gets ignored.
-        if bf16:
-            self._dummy_overflow_buf = None
-        else:
-            self._dummy_overflow_buf = torch.cuda.IntTensor([0])
-
-        # In case grad scaler is not passed, define the unity scale.
-        if self.grad_scaler is None:
-            self._scale_one = torch.cuda.FloatTensor([1.0])
-
-        # ======================
-        # main parameter stuff
-        # ======================
-
-        # Three groups of parameters:
-        #   float16_groups: original float16 parameters
-        #   fp32_from_float16_groups: fp32 copy of float16 parameters
-        #   fp32_from_fp32_groups: original fp32 parameters
-        self.float16_groups = []
-        self.fp32_from_float16_groups = []
-        self.fp32_from_fp32_groups = []
-
-        # For all the groups in the original optimizer:
-        for param_group in self.optimizer.param_groups:
-            float16_params_this_group = []
-            fp32_params_this_group = []
-            fp32_from_float16_params_this_group = []
-            # For all the parameters in this group:
-            for i, param in enumerate(param_group['params']):
-                if param.requires_grad:
-                    # float16 params:
-                    if param.type() in ['torch.cuda.HalfTensor',
-                                        'torch.cuda.BFloat16Tensor']:
-                        float16_params_this_group.append(param)
-                        # Create a copy
-                        main_param = param.detach().clone().float()
-                        # Copy tensor model parallel attributes.
-                        copy_tensor_parallel_attributes(param, main_param)
-
-                        # if hasattr(param, 'shared'):
-                        #     main_param.shared = param.shared
-
-                        # Replace the optimizer params with the new fp32 copy.
-                        param_group['params'][i] = main_param
-                        fp32_from_float16_params_this_group.append(main_param)
-                        # Reset existing state dict key to the new main param.
-                        if param in self.optimizer.state:
-                            self.optimizer.state[main_param] \
-                                = self.optimizer.state.pop(param)
-
-                    # fp32 params.
-                    elif param.type() == 'torch.cuda.FloatTensor':
-                        fp32_params_this_group.append(param)
-                        param_group['params'][i] = param
-                    else:
-                        raise TypeError('Wrapped parameters must be one of '
-                                        'torch.cuda.FloatTensor,  '
-                                        'torch.cuda.HalfTensor, or '
-                                        'torch.cuda.BFloat16Tensor. '
-                                        'Received {}'.format(param.type()))
-
-            self.float16_groups.append(float16_params_this_group)
-            self.fp32_from_float16_groups.append(
-                fp32_from_float16_params_this_group)
-            self.fp32_from_fp32_groups.append(fp32_params_this_group)
-
-        # Leverage state_dict() and load_state_dict() to
-        # recast preexisting per-param state tensors
-        self.optimizer.load_state_dict(self.optimizer.state_dict())
-
-    def zero_grad(self, set_to_none=False):
-        """We only need to zero the model related parameters, i.e.,
-                float16_groups & fp32_from_fp32_groups."""
-        for group in self.float16_groups:
-            _zero_grad_group_helper(group, set_to_none)
-        for group in self.fp32_from_fp32_groups:
-            _zero_grad_group_helper(group, set_to_none)
-
-    def get_loss_scale(self):
-        if self.grad_scaler is None:
-            return self._scale_one
-        return self.grad_scaler.scale
-
-    def _copy_model_grads_to_main_grads(self):
-        # This only needs to be done for the float16 group.
-        for model_group, main_group in zip(self.float16_groups,
-                                           self.fp32_from_float16_groups):
-            for model_param, main_param in zip(model_group, main_group):
-                if self.params_have_main_grad:
-                    main_param.grad = model_param.main_grad.float()
-                else:
-                    if model_param.grad is not None:
-                        main_param.grad = model_param.grad.float()
-
-        # For fp32 grads, we need to reset the grads to main grad.
-        if self.params_have_main_grad:
-            for model_group in self.fp32_from_fp32_groups:
-                for model_param in model_group:
-                    model_param.grad = model_param.main_grad
-
-    def _unscale_main_grads_and_check_for_nan(self):
-        main_grads = []
-        # fp32 params fromm float16 ones.
-        for main_group in self.fp32_from_float16_groups:
-            for main_param in main_group:
-                if main_param.grad is not None:
-                    main_grads.append(main_param.grad.data)
-        # Append fp32 parameters.
-        for main_group in self.fp32_from_fp32_groups:
-            for main_param in main_group:
-                if main_param.grad is not None:
-                    main_grads.append(main_param.grad.data)
-        # Reset found inf.
-        self.found_inf.fill_(0.0)
-        # Unscale and set found inf/nan
-        torch._amp_foreach_non_finite_check_and_unscale_(
-            main_grads, self.found_inf, self.grad_scaler.inv_scale)
-        # Update across all model parallel instances.
-        torch.distributed.all_reduce(self.found_inf,
-                                     op=torch.distributed.ReduceOp.MAX,
-                                     group=gpc.get_group(ParallelMode.MODEL))
-
-        # Check for nan.
-        found_inf_flag = (self.found_inf.item() > 0)
-        return found_inf_flag
-
-    def _get_model_and_main_params_data_float16(self):
-        model_data = []
-        main_data = []
-        for model_group, main_group in zip(self.float16_groups,
-                                           self.fp32_from_float16_groups):
-            for model_param, main_param in zip(model_group, main_group):
-                model_data.append(model_param.data)
-                main_data.append(main_param.data)
-        return model_data, main_data
-
-    def _copy_main_params_to_model_params(self):
-        # Only needed for the float16 params.
-        model_data, main_data = self._get_model_and_main_params_data_float16()
-        _multi_tensor_copy_this_to_that(this=main_data, that=model_data,
-                                        overflow_buf=self._dummy_overflow_buf)
-
-    def _copy_model_params_to_main_params(self):
-        # Only needed for the float16 params.
-        model_data, main_data = self._get_model_and_main_params_data_float16()
-        _multi_tensor_copy_this_to_that(this=model_data, that=main_data,
-                                        overflow_buf=self._dummy_overflow_buf)
-
-    def reload_model_params(self):
-        self._copy_model_params_to_main_params()
-
-    @torch.no_grad()
-    def step(self):
-        # Copy gradients from model params to main params.
-        self._copy_model_grads_to_main_grads()
-
-        # Do unscale, check for inf, and update grad scaler only for
-        # the case that grad scaler is provided.
-        if self.grad_scaler:
-
-            # Unscale and check for inf/nan.
-            found_inf_flag = self._unscale_main_grads_and_check_for_nan()
-
-            # We are done with scaling gradients
-            # so we can update the loss scale.
-            self.grad_scaler.update(found_inf_flag)
-
-            # If we found inf/nan, skip the update.
-            if found_inf_flag:
-                return False, None, None
-
-        # Clip the main gradients.
-        grad_norm = None
-        if self.clip_grad > 0.0:
-            grad_norm = self.clip_grad_norm(self.clip_grad)
-
-        # count the zeros in the grads
-        num_zeros_in_grad = self.count_zeros() if \
-            self.log_num_zeros_in_grad else None
-
-        # Step the optimizer.
-        self.optimizer.step()
-
-        # Update params from main params.
-        self._copy_main_params_to_model_params()
-
-        # Successful update.
-        return True, grad_norm, num_zeros_in_grad
-
-    def state_dict(self):
-        state_dict = {}
-        state_dict['optimizer'] = self.optimizer.state_dict()
-        if self.grad_scaler:
-            state_dict['grad_scaler'] = self.grad_scaler.state_dict()
-        state_dict['fp32_from_fp16_params'] = self.fp32_from_float16_groups
-        return state_dict
-
-    def load_state_dict(self, state_dict):
-        # Optimizer.
-        optimizer_key = 'optimizer'
-        if optimizer_key not in state_dict:
-            optimizer_key = 'optimizer_state_dict'
-            print_rank_0('***WARNING*** loading optimizer from '
-                         'an old checkpoint ...')
-        self.optimizer.load_state_dict(state_dict[optimizer_key])
-
-        # Grad scaler.
-        if 'grad_scaler' not in state_dict:
-            print_rank_0('***WARNING*** found an old checkpoint, will not '
-                         'load grad scaler ...')
-        else:
-            if self.grad_scaler:
-                self.grad_scaler.load_state_dict(state_dict['grad_scaler'])
-            else:
-                print_rank_0('***WARNING*** fould the grad scaler in the '
-                             'checkpoint but it is None in the class. '
-                             'Skipping loading grad scaler ...')
-
-        # Copy data for the main params.
-        fp32_from_float16_params_key = 'fp32_from_fp16_params'
-        if fp32_from_float16_params_key not in state_dict:
-            fp32_from_float16_params_key = 'fp32_from_fp16'
-        for current_group, saved_group in zip(
-                self.fp32_from_float16_groups,
-                state_dict[fp32_from_float16_params_key]):
-            for current_param, saved_param in zip(current_group, saved_group):
-                current_param.data.copy_(saved_param.data)
-
-    def get_parameters(self):
-        params = []
-        for param_group in self.optimizer.param_groups:
-            for param in param_group['params']:
-                params.append(param)
-        return params
-
-    def clip_grad_norm(self, clip_grad):
-        params = self.get_parameters()
-        return clip_grad_norm_fp32(params, clip_grad)
-
-    def count_zeros(self):
-        params = self.get_parameters()
-        return count_zeros_fp32(params)
-
-    def scale_loss(self, loss):
-        """Simple scaling."""
-        return self.get_loss_scale() * loss
-
-    # Promote state so it can be retrieved or set via
-    # "optimizer_instance.state"
-    def _get_state(self):
-        return self.optimizer.state
-
-    def _set_state(self, value):
-        self.optimizer.state = value
-
-    state = property(_get_state, _set_state)
-
-    # Promote param_groups so it can be retrieved or set via
-    # "optimizer_instance.param_groups"
-    # (for example, to adjust the learning rate)
-    def _get_param_groups(self):
-        return self.optimizer.param_groups
-
-    def _set_param_groups(self, value):
-        self.optimizer.param_groups = value
-
-    param_groups = property(_get_param_groups, _set_param_groups)
diff --git a/colossalai/amp/naive_amp/naive_amp.py b/colossalai/amp/naive_amp/naive_amp.py
deleted file mode 100644
index 62a6b9ff2c19fe155a356b80dad8c73a56321c93..0000000000000000000000000000000000000000
--- a/colossalai/amp/naive_amp/naive_amp.py
+++ /dev/null
@@ -1,81 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import torch
-import torch.nn as nn
-from torch import Tensor
-from typing import Union, List, Any, Dict
-from torch.optim import Optimizer
-import torch.cuda.amp as torch_amp
-
-from colossalai.nn.optimizer import ColossalaiOptimizer
-from ._fp16_optimizer import FP16Optimizer
-
-
-class NaiveAMPOptimizer(ColossalaiOptimizer):
-    """A wrapper class for optimizer to cast all parameters to fp16
-
-    :param optim: A normal optimizer like Adam or SGD
-    :param args: Args used to initialize FP16 optimizer
-    :param kwargs: Kwargs used to initialize FP16 optimizer
-
-    :type optim: torch.optim.Optimizer
-    """
-
-    def __init__(self, optim: Optimizer, *args, **kwargs):
-        optim = FP16Optimizer(optimizer=optim, *args, **kwargs)
-        super().__init__(optim)
-
-    def backward(self, loss: Tensor):
-        """Backward with gradient scaler
-
-        :param loss: loss computed by a loss function
-        :type loss: torch.Tensor
-        """
-        loss = self.optim.scale_loss(loss)
-        loss.backward()
-
-    def step(self):
-        return self.optim.step()
-
-    def clip_grad_norm(self, model: nn.Module, max_norm: float):
-        pass
-
-
-class NaiveAMPModel(nn.Module):
-    """A wrapper class for model to cast the model into fp16 and 
-    automatically cast the input and output
-    """
-
-    def __init__(self,
-                 model: nn.Module,
-                 output_to_fp32: bool = True):
-        super().__init__()
-        self.model = model.half()
-        self._output_to_fp32 = output_to_fp32
-
-    def _convert_to_fp16(self, input_: Any):
-        if isinstance(input_, Tensor) and input_.dtype == torch.float32:
-            input_ = input_.half()
-        return input_
-
-    def _convert_to_fp32(self, input_: Any):
-        if isinstance(input_, Tensor) and input_.dtype == torch.float16:
-            input_ = input_.float()
-        return input_
-
-    def forward(self, *args, **kwargs):
-        if args:
-            args = [self._convert_to_fp16(arg) for arg in args]
-        if kwargs:
-            for k, v in kwargs.items():
-                kwargs[k] = self._convert_to_fp16(v)
-
-        out = self.model(*args, **kwargs)
-
-        if self._output_to_fp32:
-            if isinstance(out, Tensor):
-                out = self._convert_to_fp32(out)
-            elif isinstance(out, (tuple, list)):
-                out = [self._convert_to_fp32(val) for val in out]
-        return out
diff --git a/colossalai/amp/torch_amp/__init__.py b/colossalai/amp/torch_amp/__init__.py
deleted file mode 100644
index af8d349045389c7c5d59263f9c98abba063911db..0000000000000000000000000000000000000000
--- a/colossalai/amp/torch_amp/__init__.py
+++ /dev/null
@@ -1,32 +0,0 @@
-import torch.nn as nn
-from torch.optim import Optimizer
-from torch.nn.modules.loss import _Loss
-from colossalai.context import Config
-from .torch_amp import TorchAMPOptimizer, TorchAMPModel, TorchAMPLoss
-
-
-def convert_to_torch_amp(model: nn.Module,
-                         optimizer: Optimizer,
-                         criterion: _Loss,
-                         amp_config: Config):
-    """A helper function to wrap training components with Torch AMP modules
-
-    :param model: your model object
-    :type model: :class:`torch.nn.Module`
-    :param optimizer: your optimizer object
-    :type optimizer: :class:`torch.optim.Optimzer`
-    :param criterion: your loss function object
-    :type criterion: :class:`torch.nn.modules.loss._Loss`
-    :param amp_config: configuration for different amp modes
-    :type amp_config: :class:`colossalai.context.Config` or dict
-    
-    :return: (model, optimizer, criterion)
-    :rtype: Tuple
-    """
-    model = TorchAMPModel(model)
-    optimizer = TorchAMPOptimizer(optimizer, **amp_config)
-    criterion = TorchAMPLoss(criterion)
-    return model, optimizer, criterion
-
-
-__all__ = ['convert_to_torch_amp', 'TorchAMPModel', 'TorchAMPLoss', 'TorchAMPOptimizer']
diff --git a/colossalai/amp/torch_amp/__pycache__/__init__.cpython-36.pyc b/colossalai/amp/torch_amp/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index 0e612c1206de50eaf1b276769d237c88283c9078..0000000000000000000000000000000000000000
Binary files a/colossalai/amp/torch_amp/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/amp/torch_amp/__pycache__/__init__.cpython-37.pyc b/colossalai/amp/torch_amp/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index ff9a3724d281a9c35720d9dfb3b08c9e170b7dc4..0000000000000000000000000000000000000000
Binary files a/colossalai/amp/torch_amp/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/amp/torch_amp/__pycache__/_grad_scaler.cpython-36.pyc b/colossalai/amp/torch_amp/__pycache__/_grad_scaler.cpython-36.pyc
deleted file mode 100644
index 1f23b6cba6a37dedff451eacd2577dd474e3d044..0000000000000000000000000000000000000000
Binary files a/colossalai/amp/torch_amp/__pycache__/_grad_scaler.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/amp/torch_amp/__pycache__/_grad_scaler.cpython-37.pyc b/colossalai/amp/torch_amp/__pycache__/_grad_scaler.cpython-37.pyc
deleted file mode 100644
index d35e8b43a57c87cf536543c095f649753e141a39..0000000000000000000000000000000000000000
Binary files a/colossalai/amp/torch_amp/__pycache__/_grad_scaler.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/amp/torch_amp/__pycache__/torch_amp.cpython-36.pyc b/colossalai/amp/torch_amp/__pycache__/torch_amp.cpython-36.pyc
deleted file mode 100644
index 448e0a00a534e87f7a6c9282cc0cacb516b2450a..0000000000000000000000000000000000000000
Binary files a/colossalai/amp/torch_amp/__pycache__/torch_amp.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/amp/torch_amp/__pycache__/torch_amp.cpython-37.pyc b/colossalai/amp/torch_amp/__pycache__/torch_amp.cpython-37.pyc
deleted file mode 100644
index 85fc2c583c6e525095a0d3d5ba92b1a5e86fa642..0000000000000000000000000000000000000000
Binary files a/colossalai/amp/torch_amp/__pycache__/torch_amp.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/amp/torch_amp/_grad_scaler.py b/colossalai/amp/torch_amp/_grad_scaler.py
deleted file mode 100644
index b3ad5c084bae44a3c53f050a4768f8df3adabf47..0000000000000000000000000000000000000000
--- a/colossalai/amp/torch_amp/_grad_scaler.py
+++ /dev/null
@@ -1,586 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-# modified from https://github.com/pytorch/pytorch/blob/master/torch/cuda/amp/grad_scaler.py
-# to support tensor parallel
-
-import torch
-from collections import defaultdict, abc
-import warnings
-from enum import Enum
-from typing import Any, Dict, List, Optional, Tuple
-from colossalai.context import ParallelMode
-import torch.distributed as dist
-from colossalai.core import global_context as gpc
-from torch._utils import _flatten_dense_tensors, _unflatten_dense_tensors
-
-
-class _MultiDeviceReplicator(object):
-    """
-    Lazily serves copies of a tensor to requested devices.  Copies are cached per-device.
-    """
-
-    def __init__(self, master_tensor: torch.Tensor) -> None:
-        assert master_tensor.is_cuda or master_tensor.device.type == 'xla'
-        self.master = master_tensor
-        self._per_device_tensors: Dict[torch.device, torch.Tensor] = {}
-
-    def get(self, device) -> torch.Tensor:
-        retval = self._per_device_tensors.get(device, None)
-        if retval is None:
-            retval = self.master.to(
-                device=device, non_blocking=True, copy=True)
-            self._per_device_tensors[device] = retval
-        return retval
-
-
-# Defines default_factory for GradScaler's _per_optimizer_states defaultdict,
-# as well as associated "enum" values.  Prefers defining these at top level because
-# - Lambdas can't be pickled, so we don't want to supply a lambda as the factory.
-# - Defining READY, UNSCALED, STEPPED and _refresh_per_optimizer_state within GradScaler
-#   causes a circular reference, which we'd rather avoid.
-class OptState(Enum):
-    READY = 0
-    UNSCALED = 1
-    STEPPED = 2
-
-
-def _refresh_per_optimizer_state():
-    return {"stage": OptState.READY, "found_inf_per_device": {}}
-
-
-class GradScaler(object):
-    _scale: Optional[torch.Tensor]
-    _grows_tracker: Optional[torch.Tensor]
-    _per_optimizer_states: Dict[int, Dict[str, Any]]
-    """
-    An instance ``scaler`` of :class:`GradScaler` helps perform the steps of gradient scaling
-    conveniently.
-
-    * ``scaler.scale(loss)`` multiplies a given loss by ``scaler``'s current scale factor.
-    * ``scaler.step(optimizer)`` safely unscales gradients and calls ``optimizer.step()``.
-    * ``scaler.update()`` updates ``scaler``'s scale factor.
-
-    Example::
-
-        # Creates a GradScaler once at the beginning of training.
-        scaler = GradScaler()
-
-        for epoch in epochs:
-            for input, target in data:
-                optimizer.zero_grad()
-                output = model(input)
-                loss = loss_fn(output, target)
-
-                # Scales loss.  Calls backward() on scaled loss to create scaled gradients.
-                scaler.scale(loss).backward()
-
-                # scaler.step() first unscales gradients of the optimizer's params.
-                # If gradients don't contain infs/NaNs, optimizer.step() is then called,
-                # otherwise, optimizer.step() is skipped.
-                scaler.step(optimizer)
-
-                # Updates the scale for next iteration.
-                scaler.update()
-
-    See the :ref:`Automatic Mixed Precision examples<amp-examples>` for usage
-    (along with autocasting) in more complex cases like gradient clipping, gradient accumulation, gradient penalty,
-    and multiple losses/optimizers.
-
-    ``scaler`` dynamically estimates the scale factor each iteration.  To minimize gradient underflow,
-    a large scale factor should be used.  However, ``float16`` values can "overflow" (become inf or NaN) if
-    the scale factor is too large.  Therefore, the optimal scale factor is the largest factor that can be used
-    without incurring inf or NaN gradient values.
-    ``scaler`` approximates the optimal scale factor over time by checking the gradients for infs and NaNs during every
-    ``scaler.step(optimizer)`` (or optional separate ``scaler.unscale_(optimizer)``, see :meth:`unscale_`).
-
-    * If infs/NaNs are found, ``scaler.step(optimizer)`` skips the underlying ``optimizer.step()`` (so the params
-      themselves remain uncorrupted) and ``update()`` multiplies the scale by ``backoff_factor``.
-
-    * If no infs/NaNs are found, ``scaler.step(optimizer)`` runs the underlying ``optimizer.step()`` as usual.
-      If ``growth_interval`` unskipped iterations occur consecutively, ``update()`` multiplies the scale by
-      ``growth_factor``.
-
-    The scale factor often causes infs/NaNs to appear in gradients for the first few iterations as its
-    value calibrates.  ``scaler.step`` will skip the underlying ``optimizer.step()`` for these
-    iterations.  After that, step skipping should occur rarely (once every few hundred or thousand iterations).
-
-    Args:
-        init_scale (float, optional, default=2.**16):  Initial scale factor.
-        growth_factor (float, optional, default=2.0):  Factor by which the scale is multiplied during
-            :meth:`update` if no inf/NaN gradients occur for ``growth_interval`` consecutive iterations.
-        backoff_factor (float, optional, default=0.5):  Factor by which the scale is multiplied during
-            :meth:`update` if inf/NaN gradients occur in an iteration.
-        growth_interval (int, optional, default=2000):  Number of consecutive iterations without inf/NaN gradients
-            that must occur for the scale to be multiplied by ``growth_factor``.
-        enabled (bool, optional, default=True):  If ``False``, disables gradient scaling. :meth:`step` simply
-            invokes the underlying ``optimizer.step()``, and other methods become no-ops.
-    """
-
-    def __init__(self,
-                 init_scale=2.**16,
-                 growth_factor=2.0,
-                 backoff_factor=0.5,
-                 growth_interval=2000,
-                 enabled=True):
-        if enabled and not torch.cuda.is_available():
-            warnings.warn(
-                "torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.")
-            self._enabled = False
-        else:
-            self._enabled = enabled
-
-        if self._enabled:
-            assert growth_factor > 1.0, "The growth factor must be > 1.0."
-            assert backoff_factor < 1.0, "The backoff factor must be < 1.0."
-
-            self._init_scale = init_scale
-            # self._scale will be lazily initialized during the first call to scale()
-            self._scale = None
-            self._growth_factor = growth_factor
-            self._backoff_factor = backoff_factor
-            self._growth_interval = growth_interval
-            self._init_growth_tracker = 0
-            # self._growth_tracker will be lazily initialized during the first call to scale()
-            self._growth_tracker = None
-            self._per_optimizer_states = defaultdict(
-                _refresh_per_optimizer_state)
-
-    def _check_scale_growth_tracker(self, funcname) -> Tuple[torch.Tensor, torch.Tensor]:
-        fix = "This may indicate your script did not use scaler.scale(loss or outputs) earlier in the iteration."
-        assert self._scale is not None, "Attempted {} but _scale is None.  ".format(
-            funcname) + fix
-        assert self._growth_tracker is not None, "Attempted {} but _growth_tracker is None.  ".format(
-            funcname) + fix
-        return (self._scale, self._growth_tracker)
-
-    def _lazy_init_scale_growth_tracker(self, dev):
-        assert self._growth_tracker is None, "_growth_tracker initialized before _scale"
-        self._scale = torch.full(
-            (1,), self._init_scale, dtype=torch.float32, device=dev)
-        self._growth_tracker = torch.full(
-            (1,), self._init_growth_tracker, dtype=torch.int32, device=dev)
-
-    def scale(self, outputs):
-        """
-        Multiplies ('scales') a tensor or list of tensors by the scale factor.
-
-        Returns scaled outputs.  If this instance of :class:`GradScaler` is not enabled, outputs are returned
-        unmodified.
-
-        Args:
-            outputs (Tensor or iterable of Tensors):  Outputs to scale.
-        """
-        if not self._enabled:
-            return outputs
-
-        # Short-circuit for the common case.
-        if isinstance(outputs, torch.Tensor):
-            assert outputs.is_cuda or outputs.device.type == 'xla'
-            if self._scale is None:
-                self._lazy_init_scale_growth_tracker(outputs.device)
-            assert self._scale is not None
-            return outputs * self._scale.to(device=outputs.device, non_blocking=True)
-
-        # Invoke the more complex machinery only if we're treating multiple outputs.
-        # holds a reference that can be overwritten by apply_scale
-        stash: List[_MultiDeviceReplicator] = []
-
-        def apply_scale(val):
-            if isinstance(val, torch.Tensor):
-                assert val.is_cuda or val.device.type == 'xla'
-                if len(stash) == 0:
-                    if self._scale is None:
-                        self._lazy_init_scale_growth_tracker(val.device)
-                    assert self._scale is not None
-                    stash.append(_MultiDeviceReplicator(self._scale))
-                return val * stash[0].get(val.device)
-            elif isinstance(val, abc.Iterable):
-                iterable = map(apply_scale, val)
-                if isinstance(val, list) or isinstance(val, tuple):
-                    return type(val)(iterable)
-                else:
-                    return iterable
-            else:
-                raise ValueError(
-                    "outputs must be a Tensor or an iterable of Tensors")
-
-        return apply_scale(outputs)
-
-    def _unscale_grads_(self, optimizer, inv_scale, found_inf, allow_fp16):
-        per_device_inv_scale = _MultiDeviceReplicator(inv_scale)
-        per_device_found_inf = _MultiDeviceReplicator(found_inf)
-
-        # To set up _amp_foreach_non_finite_check_and_unscale_, split grads by device and dtype.
-        # There could be hundreds of grads, so we'd like to iterate through them just once.
-        # However, we don't know their devices or dtypes in advance.
-
-        # https://stackoverflow.com/questions/5029934/defaultdict-of-defaultdict
-        # Google says mypy struggles with defaultdicts type annotations.
-        per_device_and_dtype_grads = defaultdict(
-            lambda: defaultdict(list))  # type: ignore[var-annotated]
-        with torch.no_grad():
-            for group in optimizer.param_groups:
-                for param in group["params"]:
-                    if param.grad is None:
-                        continue
-                    if (not allow_fp16) and param.grad.dtype == torch.float16:
-                        raise ValueError(
-                            "Attempting to unscale FP16 gradients.")
-                    if param.grad.is_sparse:
-                        # is_coalesced() == False means the sparse grad has values with duplicate indices.
-                        # coalesce() deduplicates indices and adds all values that have the same index.
-                        # For scaled fp16 values, there's a good chance coalescing will cause overflow,
-                        # so we should check the coalesced _values().
-                        if param.grad.dtype is torch.float16:
-                            param.grad = param.grad.coalesce()
-                        to_unscale = param.grad._values()
-                    else:
-                        to_unscale = param.grad
-
-                    # TODO: is there a way to split by device and dtype without appending in the inner loop?
-                    per_device_and_dtype_grads[to_unscale.device][to_unscale.dtype].append(
-                        to_unscale)
-
-            for device, per_dtype_grads in per_device_and_dtype_grads.items():
-                for grads in per_dtype_grads.values():
-                    torch._amp_foreach_non_finite_check_and_unscale_(grads,
-                                                                     per_device_found_inf.get(
-                                                                         device),
-                                                                     per_device_inv_scale.get(device))
-        # For tensor parallel paramters it should be all-reduced over tensor parallel process group
-        if gpc.is_initialized(ParallelMode.MODEL) and gpc.get_world_size(ParallelMode.MODEL) > 1:
-            vals = [val for val in per_device_found_inf._per_device_tensors.values()]
-            coalesced = _flatten_dense_tensors(vals)
-            dist.all_reduce(coalesced,
-                            op=dist.ReduceOp.MAX,
-                            group=gpc.get_group(ParallelMode.MODEL))
-            for buf, synced in zip(vals, _unflatten_dense_tensors(coalesced, vals)):
-                buf.copy_(synced)
-        return per_device_found_inf._per_device_tensors
-
-    def unscale_(self, optimizer):
-        """
-        Divides ("unscales") the optimizer's gradient tensors by the scale factor.
-
-        :meth:`unscale_` is optional, serving cases where you need to
-        :ref:`modify or inspect gradients<working-with-unscaled-gradients>`
-        between the backward pass(es) and :meth:`step`.
-        If :meth:`unscale_` is not called explicitly,  gradients will be unscaled  automatically during :meth:`step`.
-
-        Simple example, using :meth:`unscale_` to enable clipping of unscaled gradients::
-
-            ...
-            scaler.scale(loss).backward()
-            scaler.unscale_(optimizer)
-            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
-            scaler.step(optimizer)
-            scaler.update()
-
-        Args:
-            optimizer (torch.optim.Optimizer):  Optimizer that owns the gradients to be unscaled.
-
-        .. note::
-            :meth:`unscale_` does not incur a CPU-GPU sync.
-
-        .. warning::
-            :meth:`unscale_` should only be called once per optimizer per :meth:`step` call,
-            and only after all gradients for that optimizer's assigned parameters have been accumulated.
-            Calling :meth:`unscale_` twice for a given optimizer between each :meth:`step` triggers a RuntimeError.
-
-        .. warning::
-            :meth:`unscale_` may unscale sparse gradients out of place, replacing the ``.grad`` attribute.
-        """
-        if not self._enabled:
-            return
-
-        self._check_scale_growth_tracker("unscale_")
-
-        optimizer_state = self._per_optimizer_states[id(optimizer)]
-
-        if optimizer_state["stage"] is OptState.UNSCALED:
-            raise RuntimeError(
-                "unscale_() has already been called on this optimizer since the last update().")
-        elif optimizer_state["stage"] is OptState.STEPPED:
-            raise RuntimeError("unscale_() is being called after step().")
-
-        # FP32 division can be imprecise for certain compile options, so we carry out the reciprocal in FP64.
-        assert self._scale is not None
-        inv_scale = self._scale.double().reciprocal().float()
-        found_inf = torch.full(
-            (1,), 0.0, dtype=torch.float32, device=self._scale.device)
-
-        optimizer_state["found_inf_per_device"] = self._unscale_grads_(
-            optimizer, inv_scale, found_inf, False)
-        optimizer_state["stage"] = OptState.UNSCALED
-
-    def _maybe_opt_step(self, optimizer, optimizer_state, *args, **kwargs):
-        retval = None
-        if not sum(v.item() for v in optimizer_state["found_inf_per_device"].values()):
-            retval = optimizer.step(*args, **kwargs)
-        return retval
-
-    def step(self, optimizer, *args, **kwargs):
-        """
-        :meth:`step` carries out the following two operations:
-
-        1.  Internally invokes ``unscale_(optimizer)`` (unless :meth:`unscale_` was explicitly called for ``optimizer``
-            earlier in the iteration).  As part of the :meth:`unscale_`, gradients are checked for infs/NaNs.
-        2.  If no inf/NaN gradients are found, invokes ``optimizer.step()`` using the unscaled
-            gradients.  Otherwise, ``optimizer.step()`` is skipped to avoid corrupting the params.
-
-        ``*args`` and ``**kwargs`` are forwarded to ``optimizer.step()``.
-
-        Returns the return value of ``optimizer.step(*args, **kwargs)``.
-
-        Args:
-            optimizer (torch.optim.Optimizer):  Optimizer that applies the gradients.
-            args:  Any arguments.
-            kwargs:  Any keyword arguments.
-
-        .. warning::
-            Closure use is not currently supported.
-        """
-        if (not self._enabled):
-            return optimizer.step(*args, **kwargs)
-
-        if "closure" in kwargs:
-            raise RuntimeError(
-                "Closure use is not currently supported if GradScaler is enabled.")
-
-        self._check_scale_growth_tracker("step")
-
-        optimizer_state = self._per_optimizer_states[id(optimizer)]
-
-        if optimizer_state["stage"] is OptState.STEPPED:
-            raise RuntimeError(
-                "step() has already been called since the last update().")
-
-        retval = None
-
-        if (hasattr(optimizer, "_step_supports_amp_scaling") and optimizer._step_supports_amp_scaling):
-            # This optimizer has customized scale-handling logic, so we can call optimizer.step() directly.
-            # The contract with custom optimizers is that their step() should accept an additional,
-            # optional grad_scaler kwarg.  We append self to the kwargs so the custom optimizer has full information:
-            # it can query its own state, invoke unscale_ on itself, etc
-            retval = optimizer.step(*args, **dict(kwargs, grad_scaler=self))
-            optimizer_state["stage"] = OptState.STEPPED
-            return retval
-
-        if optimizer_state["stage"] is OptState.READY:
-            self.unscale_(optimizer)
-
-        assert len(optimizer_state["found_inf_per_device"]
-                   ) > 0, "No inf checks were recorded for this optimizer."
-
-        retval = self._maybe_opt_step(
-            optimizer, optimizer_state, *args, **kwargs)
-
-        optimizer_state["stage"] = OptState.STEPPED
-
-        return retval
-
-    def update(self, new_scale=None):
-        """
-        Updates the scale factor.
-
-        If any optimizer steps were skipped the scale is multiplied by ``backoff_factor``
-        to reduce it. If ``growth_interval`` unskipped iterations occurred consecutively,
-        the scale is multiplied by ``growth_factor`` to increase it.
-
-        Passing ``new_scale`` sets the new scale value manually. (``new_scale`` is not
-        used directly, it's used to fill GradScaler's internal scale tensor. So if
-        ``new_scale`` was a tensor, later in-place changes to that tensor will not further
-        affect the scale GradScaler uses internally.)
-
-        Args:
-            new_scale (float or :class:`torch.cuda.FloatTensor`, optional, default=None):  New scale factor.
-
-        .. warning::
-            :meth:`update` should only be called at the end of the iteration, after ``scaler.step(optimizer)`` has
-            been invoked for all optimizers used this iteration.
-        """
-        if not self._enabled:
-            return
-
-        _scale, _growth_tracker = self._check_scale_growth_tracker("update")
-
-        if new_scale is not None:
-            # Accept a new user-defined scale.
-            if isinstance(new_scale, float):
-                self._scale.fill_(new_scale)  # type: ignore[union-attr]
-            else:
-                reason = "new_scale should be a float or a 1-element torch.cuda.FloatTensor with requires_grad=False."
-                # type: ignore[attr-defined]
-                assert isinstance(new_scale, torch.cuda.FloatTensor), reason
-                assert new_scale.numel() == 1, reason
-                assert new_scale.requires_grad is False, reason
-                self._scale.copy_(new_scale)  # type: ignore[union-attr]
-        else:
-            # Consume shared inf/nan data collected from optimizers to update the scale.
-            # If all found_inf tensors are on the same device as self._scale, this operation is asynchronous.
-            found_infs = [found_inf.to(device=_scale.device, non_blocking=True)
-                          for state in self._per_optimizer_states.values()
-                          for found_inf in state["found_inf_per_device"].values()]
-
-            assert len(
-                found_infs) > 0, "No inf checks were recorded prior to update."
-
-            found_inf_combined = found_infs[0]
-            if len(found_infs) > 1:
-                for i in range(1, len(found_infs)):
-                    found_inf_combined += found_infs[i]
-
-            torch._amp_update_scale_(_scale,
-                                     _growth_tracker,
-                                     found_inf_combined,
-                                     self._growth_factor,
-                                     self._backoff_factor,
-                                     self._growth_interval)
-
-        # To prepare for next iteration, clear the data collected from optimizers this iteration.
-        self._per_optimizer_states = defaultdict(_refresh_per_optimizer_state)
-
-    def _get_scale_async(self):
-        return self._scale
-
-    def get_scale(self):
-        """
-        Returns a Python float containing the current scale, or 1.0 if scaling is disabled.
-
-        .. warning::
-            :meth:`get_scale` incurs a CPU-GPU sync.
-        """
-        if self._enabled:
-            return self._init_scale if self._scale is None else self._get_scale_async().item()
-        else:
-            return 1.0
-
-    def get_growth_factor(self):
-        r"""
-        Returns a Python float containing the scale growth factor.
-        """
-        return self._growth_factor
-
-    def set_growth_factor(self, new_factor):
-        r"""
-        Args:
-            new_scale (float):  Value to use as the new scale growth factor.
-        """
-        self._growth_factor = new_factor
-
-    def get_backoff_factor(self):
-        r"""
-        Returns a Python float containing the scale backoff factor.
-        """
-        return self._backoff_factor
-
-    def set_backoff_factor(self, new_factor):
-        r"""
-        Args:
-            new_scale (float):  Value to use as the new scale backoff factor.
-        """
-        self._backoff_factor = new_factor
-
-    def get_growth_interval(self):
-        r"""
-        Returns a Python int containing the growth interval.
-        """
-        return self._growth_interval
-
-    def set_growth_interval(self, new_interval):
-        r"""
-        Args:
-            new_interval (int):  Value to use as the new growth interval.
-        """
-        self._growth_interval = new_interval
-
-    def _get_growth_tracker(self):
-        if self._enabled:
-            return self._init_growth_tracker if self._growth_tracker is None else self._growth_tracker.item()
-        else:
-            return 0
-
-    def is_enabled(self):
-        r"""
-        Returns a bool indicating whether this instance is enabled.
-        """
-        return self._enabled
-
-    def state_dict(self):
-        r"""
-        Returns the state of the scaler as a :class:`dict`.  It contains five entries:
-
-        * ``"scale"`` - a Python float containing the current scale
-        * ``"growth_factor"`` - a Python float containing the current growth factor
-        * ``"backoff_factor"`` - a Python float containing the current backoff factor
-        * ``"growth_interval"`` - a Python int containing the current growth interval
-        * ``"_growth_tracker"`` - a Python int containing the number of recent consecutive unskipped steps.
-
-        If this instance is not enabled, returns an empty dict.
-
-        .. note::
-           If you wish to checkpoint the scaler's state after a particular iteration, :meth:`state_dict`
-           should be called after :meth:`update`.
-        """
-        return {"scale": self.get_scale(),
-                "growth_factor": self._growth_factor,
-                "backoff_factor": self._backoff_factor,
-                "growth_interval": self._growth_interval,
-                "_growth_tracker": self._get_growth_tracker()} if self._enabled else {}
-
-    def load_state_dict(self, state_dict):
-        r"""
-        Loads the scaler state.  If this instance is disabled, :meth:`load_state_dict` is a no-op.
-
-        Args:
-           state_dict(dict): scaler state.  Should be an object returned from a call to :meth:`state_dict`.
-        """
-        if not self._enabled:
-            return
-
-        if len(state_dict) == 0:
-            raise RuntimeError("The source state dict is empty, possibly because it was saved "
-                               "from a disabled instance of GradScaler.")
-
-        self._init_scale = state_dict["scale"]
-        if self._scale is not None:
-            self._scale.fill_(state_dict["scale"])
-        self._growth_factor = state_dict["growth_factor"]
-        self._backoff_factor = state_dict["backoff_factor"]
-        self._growth_interval = state_dict["growth_interval"]
-        self._init_growth_tracker = state_dict["_growth_tracker"]
-        if self._growth_tracker is not None:
-            self._growth_tracker.fill_(state_dict["_growth_tracker"])
-
-    def __getstate__(self):
-        state = self.__dict__.copy()
-        if self._enabled:
-            assert len(self._per_optimizer_states) == 0, "A GradScaler instance may only be pickled at the beginning "\
-                                                         "of an iteration, or at the end after scaler.update()."
-            # Pickling _scale and _growth_tracker Tensors directly triggers
-            # "warnings.warn("pickle support for Storage will be removed in 1.5..."
-            # so instead, we set the unpickled instance up to reinitialize them lazily.
-            state['_init_scale'] = self.get_scale()
-            state['_init_growth_tracker'] = self._get_growth_tracker()
-            state['_scale'] = None
-            state['_growth_tracker'] = None
-        return state
-
-    def __setstate__(self, state):
-        self.__dict__.update(state)
-
-    def _check_inf_per_device(self, optimizer):
-        _scale, _ = self._check_scale_growth_tracker("_check_inf_per_device")
-
-        dummy_inv_scale = torch.full(
-            (1,), 1.0, dtype=torch.float32, device=_scale.device)
-        found_inf = torch.full(
-            (1,), 0.0, dtype=torch.float32, device=_scale.device)
-
-        self._per_optimizer_states[id(optimizer)]["found_inf_per_device"] = \
-            self._unscale_grads_(optimizer, dummy_inv_scale, found_inf, True)
-
-        return self._per_optimizer_states[id(optimizer)]["found_inf_per_device"]
-
-    def _found_inf_per_device(self, optimizer):
-        return self._per_optimizer_states[id(optimizer)]["found_inf_per_device"]
diff --git a/colossalai/amp/torch_amp/torch_amp.py b/colossalai/amp/torch_amp/torch_amp.py
deleted file mode 100644
index d7b2c61c9a2bb8834524fa548b91584f08a7ea52..0000000000000000000000000000000000000000
--- a/colossalai/amp/torch_amp/torch_amp.py
+++ /dev/null
@@ -1,84 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import torch.nn as nn
-import torch.cuda.amp as torch_amp
-
-from torch import Tensor
-from torch.nn.modules.loss import _Loss
-from torch.optim import Optimizer
-from ._grad_scaler import GradScaler
-
-from colossalai.nn.optimizer import ColossalaiOptimizer
-from colossalai.utils import clip_grad_norm_fp32
-
-
-class TorchAMPOptimizer(ColossalaiOptimizer):
-    """A wrapper class which integrate pytorch amp with an optimizer
-
-    :param optim: A normal optimizer like Adam or SGD
-    :param args: Args used to initialize gradient scaler
-    :param kwargs: Kwargs used to initialize gradient scaler
-
-    :type optim: torch.optim.Optimizer
-    """
-
-    def __init__(self, optim: Optimizer, *args, **kwargs):
-        super().__init__(optim)
-        self.scaler = GradScaler(*args, **kwargs)
-
-    def backward(self, loss: Tensor):
-        """Backward with torch amp gradient scaler
-
-        :param loss: Loss computed by a loss function
-        :type loss: torch.Tensor
-        """
-        self.scaler.scale(loss).backward()
-
-    def step(self):
-        """Update the parameters of the model
-        """
-        self.scaler.step(self.optim)
-        self.scaler.update()
-
-    def clip_grad_norm(self, model: nn.Module, max_norm: float):
-        """Apply gradient clipping to the model parameters
-
-        :param model: Your model object
-        :type model: torch.nn.Module
-        :param max_norm: Max norm value for gradient clipping
-        :type max_norm: float
-        """
-        if max_norm > 0.0:
-            self.scaler.unscale_(self.optim)
-            clip_grad_norm_fp32(model.parameters(), max_norm)
-
-
-class TorchAMPModel(nn.Module):
-    """A wrapper class for a model object which executes forward with values automatically
-    cast to fp16
-    """
-
-    def __init__(self, model: nn.Module) -> None:
-        super().__init__()
-        self.model = model
-
-    @torch_amp.autocast()
-    def forward(self, *args, **kwargs):
-        return self.model(*args, **kwargs)
-
-
-class TorchAMPLoss(nn.Module):
-    """A wrapper class for a criterion object which computes the loss in mixed-precision context
-
-    :param loss: A loss function object
-    :type loss: torch.nn.modules.loss._Loss
-    """
-
-    def __init__(self, loss: _Loss):
-        super().__init__()
-        self.loss = loss
-
-    @torch_amp.autocast()
-    def forward(self, *args, **kwargs):
-        return self.loss(*args, **kwargs)
diff --git a/colossalai/builder/__init__.py b/colossalai/builder/__init__.py
deleted file mode 100644
index c4840c24a530fcf748956e3c5aba9374275d3d33..0000000000000000000000000000000000000000
--- a/colossalai/builder/__init__.py
+++ /dev/null
@@ -1,12 +0,0 @@
-from .builder import (build_schedule, build_lr_scheduler, build_model,
-                      build_optimizer, build_layer, build_loss, build_hooks,
-                      build_dataset, build_transform, build_data_sampler,
-                      build_gradient_handler, build_ophooks)
-from .pipeline import build_pipeline_model, build_pipeline_model_from_cfg
-
-__all__ = [
-    'build_schedule', 'build_lr_scheduler', 'build_model', 'build_optimizer',
-    'build_layer', 'build_loss', 'build_hooks', 'build_dataset',
-    'build_transform', 'build_data_sampler', 'build_gradient_handler',
-    'build_pipeline_model', 'build_pipeline_model_from_cfg', 'build_ophooks'
-]
diff --git a/colossalai/builder/__pycache__/__init__.cpython-36.pyc b/colossalai/builder/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index 7a844190b25e976b9fd37723e79e6fe7fe5ca639..0000000000000000000000000000000000000000
Binary files a/colossalai/builder/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/builder/__pycache__/__init__.cpython-37.pyc b/colossalai/builder/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index 5503cde17e73ba1183087280048367b4964a645e..0000000000000000000000000000000000000000
Binary files a/colossalai/builder/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/builder/__pycache__/builder.cpython-36.pyc b/colossalai/builder/__pycache__/builder.cpython-36.pyc
deleted file mode 100644
index d001b9883f1ed91e913c0f6209e65fc78aaefe71..0000000000000000000000000000000000000000
Binary files a/colossalai/builder/__pycache__/builder.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/builder/__pycache__/builder.cpython-37.pyc b/colossalai/builder/__pycache__/builder.cpython-37.pyc
deleted file mode 100644
index 95e0eaeed025122494ec8b4f7462b16f733d6d22..0000000000000000000000000000000000000000
Binary files a/colossalai/builder/__pycache__/builder.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/builder/__pycache__/pipeline.cpython-36.pyc b/colossalai/builder/__pycache__/pipeline.cpython-36.pyc
deleted file mode 100644
index 2500132ba5e71a5ecfa65096bff4f00315dc5707..0000000000000000000000000000000000000000
Binary files a/colossalai/builder/__pycache__/pipeline.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/builder/__pycache__/pipeline.cpython-37.pyc b/colossalai/builder/__pycache__/pipeline.cpython-37.pyc
deleted file mode 100644
index c7dd1aacbf7c68732ad65b3fded6df7b95d2b087..0000000000000000000000000000000000000000
Binary files a/colossalai/builder/__pycache__/pipeline.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/builder/builder.py b/colossalai/builder/builder.py
deleted file mode 100644
index 2c7eea999d2b9e270aff1b6356b984a11047808b..0000000000000000000000000000000000000000
--- a/colossalai/builder/builder.py
+++ /dev/null
@@ -1,234 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import inspect
-from collections.abc import Iterable
-
-from colossalai.registry import *
-
-
-def build_from_config(module, config: dict):
-    """Returns an object of :class:`module` constructed from `config`.
-
-    :param module: A python or user-defined class
-    :type module: class
-    :param config: A python dict containing information used in the construction
-        of the return object
-    :type config: dict
-    :raises AssertionError: Raises an AssertionError if `module` is not a class
-    :return: An object of interest
-    :rtype: Object
-    """
-    assert inspect.isclass(module), 'module must be a class'
-    return module(**config)
-
-
-def build_from_registry(config, registry: Registry):
-    """Returns an object constructed from `config`, the type of the object
-    is specified by `registry`.
-
-    :param config: A python dict or a :class:`colossalai.context.Config` object
-        containing information used in the construction of the return object
-    :type config: dict or :class:`colossalai.context.colossalai.context.Config`
-    :param registry: A registry specifying the type of the return object
-    :type registry: :class:`Registry`
-    :raises AssertionError: Raises an AssertionError if `registry` is not an object
-        of :class:`Registry` or `mod_type` in `config` is not found in `registry`
-    :raises Exception: Raises an Exception if an error occurred when building
-        from registry
-    :return: An object specified by `registry`
-    :rtype: Python object specified by `registry`
-    """
-    config_ = config.copy()  # keep the original config untouched
-    assert isinstance(
-        registry, Registry), f'Expected type Registry but got {type(registry)}'
-
-    mod_type = config_.pop('type')
-    assert registry.has(
-        mod_type), f'{mod_type} is not found in registry {registry.name}'
-    try:
-        obj = registry.get_module(mod_type)(**config_)
-    except Exception as e:
-        print(
-            f'An error occurred when building {mod_type} from registry {registry.name}',
-            flush=True)
-        raise e
-
-    return obj
-
-
-def build_layer(config):
-    """Returns a layer object of :class:`nn.Module` constructed from `config`.
-
-    :param config: A python dict or a :class:`colossalai.context.Config` object
-        containing information used in the construction of the return object
-    :type config: dict or :class:`colossalai.context.Config`
-    :return: An object of :class:`torch.nn.Module`
-    :rtype: :class:`torch.nn.Module`
-    """
-    return build_from_registry(config, LAYERS)
-
-
-def build_loss(config):
-    """Returns a loss function object of :class:`torch.autograd.Function` constructed
-    from `config`.
-
-    :param config: A python dict or a :class:`colossalai.context.Config` object
-        containing information used in the construction of the return object
-    :type config: dict or :class:`colossalai.context.Config`
-    :return: An object of :class:`torch.nn.modules.loss._Loss`
-    :rtype: :class:`torch.nn.modules.loss._Loss`
-    """
-    return build_from_registry(config, LOSSES)
-
-
-def build_model(config):
-    """Returns a model object of :class:`nn.Module` constructed from `config`.
-
-    :param config: A python dict or a :class:`colossalai.context.Config` object
-        containing information used in the construction of the return object
-    :type config: dict or :class:`colossalai.context.Config`
-    :return: An object of :class:`torch.nn.Module`
-    :rtype: :class:`torch.nn.Module`
-    """
-    return build_from_registry(config, MODELS)
-
-
-def build_dataset(config):
-    """Returns a dataset object of :class:`torch.utils.data.Dataset` constructed
-    from `config`.
-
-    :param config: A python dict or a :class:`colossalai.context.Config` object
-        containing information used in the construction of the return object
-    :type config: dict or :class:`colossalai.context.Config`
-    :return: An object of :class:`torch.utils.data.Dataset`
-    :rtype: :class:`torch.utils.data.Dataset`
-    """
-    return build_from_registry(config, DATASETS)
-
-
-def build_optimizer(config, model):
-    """Returns an optimizer object of :class:`torch.optim.Optimizer` constructed from `config`,
-    'model' and 'params'.
-
-    :param config: A python dict or a :class:`colossalai.context.Config` object
-        containing information used in the construction of the return object
-    :type config: dict or :class:`colossalai.context.Config`
-    :param model: A model containing parameters for the optimizer
-    :type model: :class:`nn.Module`
-    :return: An object of :class:`torch.optim.Optimizer`
-    :rtype: :class:`torch.optim.Optimizer`
-    """
-    config_ = config.copy()
-    config_['params'] = model.parameters()
-    return build_from_registry(config_, OPTIMIZERS)
-
-
-def build_gradient_handler(config, model, optimizer):
-    """Returns a gradient handler object of :class:`BaseGradientHandler` constructed from `config`,
-    `model` and `optimizer`.
-
-    :param config: A python dict or a :class:`colossalai.context.Config` object
-        containing information used in the construction of the return object
-    :type config: dict or :class:`colossalai.context.Config`
-    :param model: A model containing parameters for the gradient handler
-    :type model: :class:`nn.Module`
-    :param optimizer: An optimizer object containing parameters for the gradient handler
-    :type optimizer: :class:`torch.optim.Optimizer`
-    :return: An object of :class:`colossalai.engine.BaseGradientHandler`
-    :rtype: :class:`colossalai.engine.BaseGradientHandler`
-    """
-    config_ = config.copy()
-    config_['model'] = model
-    config_['optimizer'] = optimizer
-    return build_from_registry(config_, GRADIENT_HANDLER)
-
-
-def build_hooks(config, trainer):
-    """Returns a hook object of :class:`BaseHook` constructed from `config` and `trainer`.
-
-    :param config: A python dict or a :class:`colossalai.context.Config` object
-        containing information used in the construction of the return object
-    :type config: dict or :class:`colossalai.context.Config`
-    :param trainer: A :class:`Trainer` object containing parameters for the hook
-    :type trainer: :class:`Trainer`
-    :return: An object of :class:`colossalai.trainer.hooks.BaseHook`
-    :rtype: :class:`colossalai.trainer.hooks.BaseHook`
-    """
-    config_ = config.copy()
-    config_['trainer'] = trainer
-    return build_from_registry(config_, HOOKS)
-
-
-def build_ophooks(config):
-    """Returns a hook object of :class:`BaseOpHook` constructed from `config`.
-
-    :param config: A python dict or a :class:`colossalai.context.Config` object
-        containing information used in the construction of the return object
-    :type config: dict or :class:`colossalai.context.Config`
-    :return: An object of :class:`colossalai.trainer.hooks.BaseOpHook`
-    :rtype: :class:`colossalai.trainer.hooks.BaseOpHook`
-    """
-    config_ = config.copy()
-    return build_from_registry(config_, OPHOOKS)
-
-
-def build_transform(config):
-    """Returns a transformation object of :class:`torchvision.transforms` constructed
-    from `config`.
-
-    :param config: A python dict or a :class:`colossalai.context.Config` object
-        containing information used in the construction of the return object
-    :type config: dict or :class:`colossalai.context.Config`
-    :return: An object of :class:`torchvision.transforms`
-    :rtype: :class:`torchvision.transforms`
-    """
-    return build_from_registry(config, TRANSFORMS)
-
-
-def build_data_sampler(config, dataset):
-    """Returns a data sampler object of :class:`colossalai.nn.data.sampler.BaseSampler`
-    constructed from `config`.
-
-    :param config: A python dict or a :class:`colossalai.context.Config` object
-        containing information used in the construction of the return object
-    :type config: dict or :class:`colossalai.context.Config`
-    :param dataset: An object of :class:`torch.utils.data.Dataset` containing information
-        used in the construction of the return object
-    :type dataset: :class:`torch.utils.data.Dataset`
-    :return: An object of :class:`colossalai.utils.data_sampler.BaseSampler`
-    :rtype: :class:`colossalai.utils.data_sampler.BaseSampler`
-    """
-    config_ = config.copy()
-    config_['dataset'] = dataset
-    return build_from_registry(config_, DATA_SAMPLERS)
-
-
-def build_lr_scheduler(config, optimizer):
-    """Returns a learning rate scheduler object of :class:`torch.optim.lr_scheduler`
-    constructed from `config`, `optimizer`, `total_steps` and `num_steps_per_epoch`.
-
-    :param config: A python dict or a :class:`colossalai.context.Config` object
-        containing information used in the construction of the return object
-    :type config: dict or :class:`colossalai.context.Config`
-    :param optimizer: An optimizer object containing parameters for the learning rate
-        scheduler
-    :type optimizer: :class:`torch.optim.Optimizer`
-    :return: An object of :class:`torch.optim.lr_scheduler`
-    :rtype: :class:`torch.optim.lr_scheduler`
-    """
-    config_ = config.copy()
-    config_['optimizer'] = optimizer
-    return build_from_registry(config_, LR_SCHEDULERS)
-
-
-def build_schedule(config):
-    """Returns a schedule of :class:`colossalai.engine.schedule.BaseSchedule`.
-
-    :param config: A python dict or a :class:`colossalai.context.Config` object
-        containing information used in the construction of the return object
-    :type config: dict or :class:`colossalai.context.Config`
-    :return: An object of :class:`colossalai.engine.schedule.BaseSchedule`
-    :rtype: :class:`colossalai.engine.schedule.BaseSchedule`
-    """
-    return build_from_registry(config, SCHEDULE)
diff --git a/colossalai/builder/pipeline.py b/colossalai/builder/pipeline.py
deleted file mode 100644
index a3312f6ccebc16fddf6203e43e4829558464451a..0000000000000000000000000000000000000000
--- a/colossalai/builder/pipeline.py
+++ /dev/null
@@ -1,266 +0,0 @@
-import copy
-import heapq
-
-
-from colossalai.builder import build_model, build_layer
-from colossalai.context.parallel_mode import ParallelMode
-from colossalai.core import global_context as gpc
-from colossalai.logging import get_dist_logger
-import torch.nn as nn
-
-
-def _binary_partition(weights, st, ed):
-    """Returns the binary partition position of `weights`, given the start
-    position `st` and the end position `ed`.
-
-    :param weights: A python list to be binary partitioned
-    :type weights: list
-    :param st: the start position of the binary partition
-    :type st: int
-    :param ed: the end postition of the binary partition
-    :type ed: int
-    :return: the binary partition position of `weights`
-    :rtype: int
-    """
-    w_sum = weights[ed - 1]
-    prefix = 0
-    if st > 0:
-        w_sum -= weights[st - 1]
-        prefix = weights[st - 1]
-    minimum = float("inf")
-    for idx in range(st + 1, ed):
-        front = weights[idx - 1] - prefix
-        diff = abs(w_sum - 2 * front)
-        if diff < minimum:
-            pos = idx
-            minimum = diff
-
-    return st, pos, ed
-
-
-def _heap_addition(weights, intervals, add_cnt):
-    """
-    """
-    def _heap_push(heap, st, ed):
-        value = weights[ed - 1]
-        if st > 0:
-            value -= weights[st - 1]
-        heapq.heappush(heap, (-value, st, ed))
-
-    ret_intervals = []
-    heap = []
-
-    for st, ed in intervals:
-        _heap_push(heap, st, ed)
-
-    while add_cnt > 0:
-        _, st, ed = heapq.heappop(heap)
-        if ed - st == 1:
-            ret_intervals.append((st, ed))
-        else:
-            l, m, r = _binary_partition(weights, st, ed)
-            _heap_push(heap, l, m)
-            _heap_push(heap, m, r)
-            add_cnt -= 1
-
-    while heap:
-        _, st, ed = heapq.heappop(heap)
-        ret_intervals.append((st, ed))
-
-    ret_intervals.sort()
-    return ret_intervals
-
-
-def _calc_partitions(weights, value):
-    prev = 0
-    prefix = 0
-    num_block = 0
-    intervals = []
-
-    for idx, w in enumerate(weights):
-        if weights[idx] - prefix > value:
-            intervals.append((prev, idx))
-            prev = idx
-            prefix = weights[idx - 1]
-            num_block += 1
-
-    intervals.append((prev, len(weights)))
-    return num_block + 1, intervals
-
-
-def _binary_search(weights, num):
-    length = len(weights)
-    prefix = [1 if w == 0 else w for w in weights]
-    for i in range(1, length):
-        prefix[i] += prefix[i - 1]
-
-    lower_bound = max(weights)
-    upper_bound = prefix[length - 1]
-
-    while upper_bound > lower_bound:
-        mid = (upper_bound + lower_bound) // 2
-        number, _ = _calc_partitions(prefix, mid)
-        if number <= num:
-            upper_bound = mid
-        else:
-            lower_bound = mid + 1
-
-    num_block, intervals = _calc_partitions(prefix, upper_bound)
-    if num_block < num:
-        intervals = _heap_addition(prefix, intervals, num - num_block)
-
-    return intervals
-
-
-def partition_uniform(num_items, pipeline_parallel_size, num_chunks):
-    assert num_items % num_chunks == 0, \
-        "Layer length should be divided by the number of chunks, otherwise parameter method is recomended"
-
-    logger = get_dist_logger()
-    parts = [[] for _ in range(pipeline_parallel_size)]
-    partition_items = num_items // num_chunks
-    for idx in range(num_chunks):
-        base_idx = idx * partition_items
-        chunk_size = partition_items // pipeline_parallel_size
-        left = pipeline_parallel_size - partition_items % pipeline_parallel_size
-        if chunk_size == 0:
-            logger.warning("Some nodes in Pipeline have no requests")
-
-        for p in range(pipeline_parallel_size):
-            st = base_idx
-            base_idx += chunk_size + (p >= left)
-            parts[p].append((st, base_idx))
-
-    return parts
-
-
-def partition_balanced(weights, pipeline_parallel_size, num_chunks):
-    num_total = pipeline_parallel_size * num_chunks
-    num_items = len(weights)
-    if num_items <= num_total:
-        return partition_uniform(num_items, pipeline_parallel_size, num_chunks)
-
-    intervals = _binary_search(weights, num_total)
-
-    current = 0
-    parts = [[] for _ in range(pipeline_parallel_size)]
-    for inter in intervals:
-        parts[current].append(inter)
-        current = (current + 1) % pipeline_parallel_size
-
-    return parts
-
-
-def count_layer_params(layers):
-    """Count the number of parameters in each layer
-    """
-    param_counts = [0] * len(layers)
-    for idx, cfg in enumerate(layers):
-        layer = build_layer(cfg)
-        params = filter(lambda p: p.requires_grad, layer.parameters())
-        param_counts[idx] = sum(p.numel() for p in params)
-
-    return param_counts
-
-
-def build_pipeline_model_from_cfg(config, num_chunks: int = 1, partition_method: str = 'parameter', verbose: bool = False):
-    """An intializer to split the model into different stages for pipeline parallelism.
-
-    An example for the model config is shown below. The class VisionTransformerFromConfig should
-    inherit colossalai.nn.model.ModelFromConfig to allow this initializer to build model from a sequence
-    of layer configurations.
-
-    model_config = dict(
-        type='VisionTransformerFromConfig',
-        embedding_cfg=dict(...),
-        ...
-    )
-
-    :param config: Configuration of the model
-    :type config: dict
-    :param num_chunks: The number of chunks you want to have on the current stage. This value should be 1
-                        in most cases unless you are using virutal pipeline parallelism.
-    :type num_chunks: int, optional
-    :param partition_method: This parameter determines how you want to split your model layers into stages,
-                                you can set it as 'layer' or 'parameter'
-    :type partition_method: str, optional
-    :param verbose: Whether to print the logs
-    :type verbose: bool, optional
-    """
-    ori_model = build_model(config)
-    layers = ori_model.layers_cfg
-    layer_length = len(layers)
-    logger = get_dist_logger()
-    if verbose:
-        logger.info(f"The total length of layers is {layer_length}", ranks=[0])
-
-    pipeline_parallel_size = gpc.get_world_size(ParallelMode.PIPELINE)
-    pipeline_rank = gpc.get_local_rank(ParallelMode.PIPELINE)
-
-    method = partition_method.lower()
-    # Make a partition
-    if method == 'layer':
-        num_layers = len(layers)
-        parts = partition_uniform(num_layers, pipeline_parallel_size, num_chunks)
-    elif method == 'parameter':
-        param_counts = count_layer_params(layers)
-        # print_rank_0(param_counts)
-        parts = partition_balanced(param_counts, pipeline_parallel_size, num_chunks)
-    else:
-        raise ValueError("Method should be a pre-set string in [layer, parameter]")
-
-    # Display the partition
-    if verbose:
-        log_str = 'Layer allocation after partitioning: \n'
-        for stage in range(pipeline_parallel_size):
-
-            num_layers = 0
-            for st, ed in parts[stage]:
-                num_layers += ed - st
-
-            log_str += f'\n===== stage={stage}, layers={num_layers} =====\n'
-            for st, ed in parts[stage]:
-                for idx, layer in enumerate(layers[st: ed]):
-                    log_str += f'\t{idx + st:2d}: {layer}\n'
-        logger.info(log_str, ranks=[0])
-
-    # Save the partition
-    interval = parts[pipeline_rank]
-
-    models = []
-    for st, ed in interval:
-        model = copy.deepcopy(ori_model)
-        model.build_from_cfg(st, ed)
-        models.append(model)
-
-    return nn.ModuleList(models) if len(models) > 1 else models[0]
-
-
-def build_pipeline_model(layers: nn.Sequential, num_chunks: int = 1, verbose: bool = False):
-    """An intializer to split the model into different stages for pipeline parallelism.
-    Note that `layer` must be `torch.nn.Sequential`.
-
-    :param layers: Layers of model
-    :type layers: `torch.nn.Sequential`
-    :param num_chunks: The number of chunks you want to have on the current stage. This value should be 1
-                        in most cases unless you are using virutal pipeline parallelism.
-    :type num_chunks: int, optional
-    :param verbose: Whether to print the logs
-    :type verbose: bool, optional
-    """
-    pipeline_parallel_size = gpc.get_world_size(ParallelMode.PIPELINE)
-    pipeline_rank = gpc.get_local_rank(ParallelMode.PIPELINE)
-    partitions = partition_uniform(len(layers), pipeline_parallel_size, num_chunks)
-    module_list = []
-    for start, end in partitions[pipeline_rank]:
-        module_list.append(nn.Sequential(*layers[start:end]))
-    if verbose:
-        logger = get_dist_logger()
-        logger.info(f'Total {len(layers)} layers', ranks=[0])
-        for rank, part in enumerate(partitions):
-            log_str = f'===== stage={rank} =====\n'
-            for chunk, (start, end) in enumerate(part):
-                log_str += f'===== chunk={chunk}, layer=[{start}-{end}] =====\n'
-                log_str += '\n'.join([str(layer) for layer in layers[start:end]]) + '\n'
-            logger.info(log_str, ranks=[0])
-    return nn.ModuleList(module_list) if len(module_list) > 1 else module_list[0]
diff --git a/colossalai/communication/__init__.py b/colossalai/communication/__init__.py
deleted file mode 100644
index 25e817f1f67f79908596a3e7146ab840dd7f9ee9..0000000000000000000000000000000000000000
--- a/colossalai/communication/__init__.py
+++ /dev/null
@@ -1,17 +0,0 @@
-from .collective import all_gather, reduce_scatter, all_reduce, broadcast, reduce
-from .p2p import (send_forward, send_forward_recv_forward,
-                  send_backward_recv_forward, send_backward,
-                  send_backward_recv_backward, send_forward_recv_backward,
-                  send_forward_backward_recv_forward_backward, recv_forward,
-                  recv_backward)
-from .ring import ring_forward
-from .utils import send_tensor_meta, recv_tensor_meta
-
-__all__ = [
-    'all_gather', 'reduce_scatter', 'all_reduce', 'broadcast', 'reduce',
-    'send_forward', 'send_forward_recv_forward',
-    'send_forward_backward_recv_forward_backward', 'send_backward',
-    'send_backward_recv_backward', 'send_backward_recv_forward',
-    'send_forward_recv_backward', 'recv_backward', 'recv_forward',
-    'ring_forward', 'send_tensor_meta', 'recv_tensor_meta',
-]
diff --git a/colossalai/communication/__pycache__/__init__.cpython-36.pyc b/colossalai/communication/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index 634db9b6a7d70f266d1452e4202becbef96c4275..0000000000000000000000000000000000000000
Binary files a/colossalai/communication/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/communication/__pycache__/__init__.cpython-37.pyc b/colossalai/communication/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index 8b7bb069882a5cbb78bafbf76e128ec777f14129..0000000000000000000000000000000000000000
Binary files a/colossalai/communication/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/communication/__pycache__/collective.cpython-36.pyc b/colossalai/communication/__pycache__/collective.cpython-36.pyc
deleted file mode 100644
index 0a02f6cade19161afb0372808c1562b838aadfbb..0000000000000000000000000000000000000000
Binary files a/colossalai/communication/__pycache__/collective.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/communication/__pycache__/collective.cpython-37.pyc b/colossalai/communication/__pycache__/collective.cpython-37.pyc
deleted file mode 100644
index d4f0c2aa4016a4be14253e6fc54578ce256ad003..0000000000000000000000000000000000000000
Binary files a/colossalai/communication/__pycache__/collective.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/communication/__pycache__/p2p.cpython-36.pyc b/colossalai/communication/__pycache__/p2p.cpython-36.pyc
deleted file mode 100644
index bd0d8b8751cf71a2287e441a1abcea22b18136cb..0000000000000000000000000000000000000000
Binary files a/colossalai/communication/__pycache__/p2p.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/communication/__pycache__/p2p.cpython-37.pyc b/colossalai/communication/__pycache__/p2p.cpython-37.pyc
deleted file mode 100644
index bf8abf1c99756cfb16d8a35bb1171ad690713714..0000000000000000000000000000000000000000
Binary files a/colossalai/communication/__pycache__/p2p.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/communication/__pycache__/ring.cpython-36.pyc b/colossalai/communication/__pycache__/ring.cpython-36.pyc
deleted file mode 100644
index de6f5f6d1ee5128efda8c0fffafe2f8b4df23bfd..0000000000000000000000000000000000000000
Binary files a/colossalai/communication/__pycache__/ring.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/communication/__pycache__/ring.cpython-37.pyc b/colossalai/communication/__pycache__/ring.cpython-37.pyc
deleted file mode 100644
index 5d7442d25fc9f343d62f8d0acc7a51fb7ea3fade..0000000000000000000000000000000000000000
Binary files a/colossalai/communication/__pycache__/ring.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/communication/__pycache__/utils.cpython-36.pyc b/colossalai/communication/__pycache__/utils.cpython-36.pyc
deleted file mode 100644
index da8dded6decac7a6e6e6cd8b304e3216fbbbdbe8..0000000000000000000000000000000000000000
Binary files a/colossalai/communication/__pycache__/utils.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/communication/__pycache__/utils.cpython-37.pyc b/colossalai/communication/__pycache__/utils.cpython-37.pyc
deleted file mode 100644
index ab44958772dbfa423afb15767e4934df4f9c799f..0000000000000000000000000000000000000000
Binary files a/colossalai/communication/__pycache__/utils.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/communication/collective.py b/colossalai/communication/collective.py
deleted file mode 100644
index 5b4e5eeba4733366aee40be751322c55a16af9f3..0000000000000000000000000000000000000000
--- a/colossalai/communication/collective.py
+++ /dev/null
@@ -1,135 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import torch
-import torch.distributed as dist
-from torch.distributed import ReduceOp
-from torch import Tensor
-
-from colossalai.context import ParallelMode
-from colossalai.core import global_context as gpc
-from colossalai.utils import get_current_device
-
-
-def all_gather(tensor: Tensor, dim: int, parallel_mode: ParallelMode, async_op: bool = False) -> Tensor:
-    """Gathers all tensors from the parallel group and concatenates them in a 
-    specific dimension.
-    
-    :param tensor: Tensor to be gathered
-    :param dim: The dimension concatenating in
-    :param parallel_mode: Parallel group mode used in this communication
-    :param async_op: Whether operations are asynchronous
-
-    :type tensor: :class:`torch.Tensor`
-    :type dim: int
-    :type parallel_mode: :class:`colossalai.context.ParallelMode`
-    :type async_op: bool, optional
-
-    :return: The tensor generated by all-gather
-    :rtype: :class:`torch.Tensor`
-    """
-    depth = gpc.get_world_size(parallel_mode)
-    if depth == 1:
-        out = tensor
-        work = None
-    else:
-        shape = list(tensor.shape)
-        shape[0], shape[dim] = shape[dim], shape[0]
-        shape[0] *= depth
-        out = torch.empty(shape, dtype=tensor.dtype, device=get_current_device())
-        temp = list(torch.chunk(out, depth, dim=0))
-        work = dist.all_gather(tensor_list=temp,
-                               tensor=tensor.transpose(0, dim).contiguous(),
-                               group=gpc.get_group(parallel_mode),
-                               async_op=async_op)
-        out = torch.transpose(out, 0, dim)
-    if async_op:
-        return out, work
-    else:
-        return out
-
-
-def reduce_scatter(tensor: Tensor,
-                   dim: int,
-                   parallel_mode: ParallelMode,
-                   op: ReduceOp = ReduceOp.SUM,
-                   async_op: bool = False) -> Tensor:
-    """Reduces all tensors then scatters it in a specific dimension to all 
-    members in the parallel group.
-    
-    :param tensor: Tensor to be reduced and scattered
-    :param dim: The dimension scattering in
-    :param parallel_mode: Parallel group mode used in this communication
-    :param op: The type of reduce operation
-    :param async_op: Whether operations are asynchronous
-
-    :type tensor: :class:`torch.Tensor`
-    :type dim: int
-    :type parallel_mode: :class:`colossalai.context.ParallelMode`
-    :type op: ReduceOp, optional
-    :type async_op: bool, optional
-
-    :return: The tensor generated by reduce-scatter
-    :rtype: :class:`Tensor`
-    """
-    depth = gpc.get_world_size(parallel_mode)
-    if depth == 1:
-        out = tensor
-        work = None
-    else:
-        temp = list(map(lambda x: x.contiguous(), torch.chunk(tensor, depth, dim=dim)))
-        out = torch.empty(temp[0].shape, dtype=tensor.dtype, device=get_current_device())
-        work = dist.reduce_scatter(output=out,
-                                   input_list=temp,
-                                   op=op,
-                                   group=gpc.get_group(parallel_mode),
-                                   async_op=async_op)
-    if async_op:
-        return out, work
-    else:
-        return out
-
-
-def all_reduce(tensor: Tensor,
-               parallel_mode: ParallelMode,
-               op: ReduceOp = ReduceOp.SUM,
-               async_op: bool = False) -> Tensor:
-    depth = gpc.get_world_size(parallel_mode)
-    if depth == 1:
-        out = tensor
-        work = None
-    else:
-        out = tensor.contiguous()
-        work = dist.all_reduce(out, op=op, group=gpc.get_group(parallel_mode), async_op=async_op)
-    if async_op:
-        return out, work
-    else:
-        return out
-
-
-def broadcast(tensor: Tensor, src: int, parallel_mode: ParallelMode, async_op: bool = False):
-    depth = gpc.get_world_size(parallel_mode)
-    if depth == 1:
-        out = tensor
-        work = None
-    else:
-        out = tensor.contiguous()
-        work = dist.broadcast(out, src=src, group=gpc.get_group(parallel_mode), async_op=async_op)
-    if async_op:
-        return out, work
-    else:
-        return out
-
-
-def reduce(tensor: Tensor, dst: int, parallel_mode: ParallelMode, op: ReduceOp = ReduceOp.SUM, async_op: bool = False):
-    depth = gpc.get_world_size(parallel_mode)
-    if depth == 1:
-        out = tensor
-        work = None
-    else:
-        out = tensor.contiguous()
-        work = dist.reduce(out, dst=dst, op=op, group=gpc.get_group(parallel_mode), async_op=async_op)
-    if async_op:
-        return out, work
-    else:
-        return out
diff --git a/colossalai/communication/p2p.py b/colossalai/communication/p2p.py
deleted file mode 100644
index 4aefe342da15af83df0a7741b3e131e5c451a47d..0000000000000000000000000000000000000000
--- a/colossalai/communication/p2p.py
+++ /dev/null
@@ -1,356 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from typing import List, Tuple, Union
-import torch
-import torch.distributed as dist
-
-from colossalai.context.parallel_mode import ParallelMode
-from colossalai.core import global_context as gpc
-from colossalai.utils import get_current_device
-from functools import reduce
-import operator
-from .utils import split_tensor_into_1d_equal_chunks, gather_split_1d_tensor
-
-
-TensorShape = Union[torch.Size, List[int], Tuple[int]]
-
-
-def _get_tensor_shape(tensor_shape: TensorShape, chunk_tensor: bool = False) -> Tuple[TensorShape, bool]:
-    """get the exact tensor shape when communicating and return whether the tensor is a chunk
-
-    :param tensor_shape: shape of tensor
-    :type tensor_shape: TensorShape
-    :param chunk_tensor: whether to chunk tensor, defaults to False
-    :type chunk_tensor: bool, optional
-    :return: exact tensor shape, whether to chunk tensor
-    :rtype: Tuple[Union[torch.Size, List[int], Tuple[int]], bool]
-    """
-    if chunk_tensor:
-        tensor_chunk_shape = reduce(operator.mul, tensor_shape, 1)
-        tensor_parallel_world_size = gpc.get_world_size(ParallelMode.TENSOR)
-        if tensor_chunk_shape % tensor_parallel_world_size == 0:
-            tensor_chunk_shape = tensor_chunk_shape // tensor_parallel_world_size
-        else:
-            tensor_chunk_shape = tensor_shape
-            chunk_tensor = False
-    else:
-        tensor_chunk_shape = tensor_shape
-    return tensor_chunk_shape, chunk_tensor
-
-
-def _communicate(tensor_send_next=None,
-                 tensor_send_prev=None,
-                 recv_prev=False,
-                 recv_next=False,
-                 recv_prev_shape=None,
-                 recv_next_shape=None,
-                 prev_rank=None,
-                 next_rank=None,
-                 dtype=None,
-                 scatter_gather_tensors=False):
-    """
-    Adapted from megatron.p2p_communication.
-    Communicate tensors between stages. Used as helper method in other
-    communication methods that are used in pipeline schedule.
-    Takes the following arguments:
-        tensor_send_next: tensor to send to next rank (no tensor sent if
-                          set to None).
-        tensor_send_prev: tensor to send to prev rank (no tensor sent if
-                          set to None).
-        recv_prev: boolean for whether tensor should be received from
-                   previous rank.
-        recv_next: boolean for whether tensor should be received from
-                   next rank.
-    Returns:
-        (tensor_recv_prev, tensor_recv_next)
-    """
-
-    # Create placeholder tensors for receive in forward and backward directions
-    # if needed.
-    tensor_recv_prev = None
-    tensor_recv_next = None
-
-    if recv_prev:
-        assert recv_prev_shape is not None
-        recv_prev_chunk_shape, recv_prev_split = _get_tensor_shape(recv_prev_shape, scatter_gather_tensors)
-        tensor_recv_prev = torch.empty(recv_prev_chunk_shape,
-                                       requires_grad=True,
-                                       device=get_current_device(),
-                                       dtype=dtype)
-    if recv_next:
-        assert recv_next_shape is not None
-        recv_next_chunk_shape, recv_next_split = _get_tensor_shape(recv_next_shape, scatter_gather_tensors)
-        tensor_recv_next = torch.empty(recv_next_chunk_shape,
-                                       requires_grad=True,
-                                       device=get_current_device(),
-                                       dtype=dtype)
-
-    if tensor_send_prev is not None or recv_prev:
-        if prev_rank is None:
-            prev_rank = gpc.get_prev_global_rank(
-                ParallelMode.PIPELINE)
-
-    if tensor_send_next is not None or recv_next:
-        if next_rank is None:
-            next_rank = gpc.get_next_global_rank(
-                ParallelMode.PIPELINE)
-
-    if tensor_send_prev is not None:
-        send_prev_split = _get_tensor_shape(tensor_send_prev.shape, scatter_gather_tensors)[1]
-        if send_prev_split:
-            tensor_send_prev = split_tensor_into_1d_equal_chunks(tensor_send_prev)
-
-    if tensor_send_next is not None:
-        send_next_split = _get_tensor_shape(tensor_send_next.shape, scatter_gather_tensors)[1]
-        if send_next_split:
-            tensor_send_next = split_tensor_into_1d_equal_chunks(tensor_send_next)
-
-    ops = []
-    if tensor_send_prev is not None:
-        send_prev_op = dist.P2POp(dist.isend, tensor_send_prev, prev_rank)
-        ops.append(send_prev_op)
-    if tensor_recv_prev is not None:
-        recv_prev_op = dist.P2POp(dist.irecv, tensor_recv_prev, prev_rank)
-        ops.append(recv_prev_op)
-    if tensor_recv_next is not None:
-        recv_next_op = dist.P2POp(dist.irecv, tensor_recv_next, next_rank)
-        ops.append(recv_next_op)
-    if tensor_send_next is not None:
-        send_next_op = dist.P2POp(dist.isend, tensor_send_next, next_rank)
-        ops.append(send_next_op)
-    if len(ops) > 0:
-        reqs = dist.batch_isend_irecv(ops)
-        for req in reqs:
-            req.wait()
-    # To protect against race condition when using batch_isend_irecv().
-    torch.cuda.synchronize()
-
-    if recv_prev and recv_prev_split:
-        tensor_recv_prev = gather_split_1d_tensor(tensor_recv_prev).view(recv_prev_shape).requires_grad_()
-    if recv_next and recv_next_split:
-        tensor_recv_next = gather_split_1d_tensor(tensor_recv_next).view(recv_next_shape).requires_grad_()
-    return tensor_recv_prev, tensor_recv_next
-
-
-def recv_forward(input_tensor_shape, prev_rank=None, dtype=torch.float, scatter_gather_tensors=False):
-    """Receives the input tensor from the previous member in pipeline.
-
-    :param input_tensor_shape: The shape of the tensor to be recieved
-    :param prev_rank: The rank of the source of the tensor
-    :type input_tensor_shape: torch.Size
-    :type prev_rank: int, optional
-    :return: The input tensor in forward step
-    :rtype: :class:`torch.Tensor`
-    """
-    if gpc.is_pipeline_first_stage():
-        input_tensor = None
-    else:
-        input_tensor, _ = _communicate(recv_prev=True,
-                                       recv_prev_shape=input_tensor_shape,
-                                       prev_rank=prev_rank,
-                                       dtype=dtype,
-                                       scatter_gather_tensors=scatter_gather_tensors)
-    return input_tensor
-
-
-def recv_backward(output_grad_shape, next_rank=None, dtype=torch.float, scatter_gather_tensors=False):
-    """Receives the grad tensor from the next member in pipeline.
-
-    :param output_grad_shape: The shape of the tensor to be recieved
-    :param next_rank: The rank of the source of the tensor
-    :type output_grad_shape: torch.Size
-    :type next_rank: int, optional
-    :return: The grad of output tensor in forward step
-    :rtype: :class:`torch.Tensor`
-    """
-    if gpc.is_pipeline_last_stage():
-        output_tensor_grad = None
-    else:
-        _, output_tensor_grad = _communicate(recv_next=True,
-                                             recv_next_shape=output_grad_shape,
-                                             next_rank=next_rank,
-                                             dtype=dtype,
-                                             scatter_gather_tensors=scatter_gather_tensors)
-    return output_tensor_grad
-
-
-def send_forward(output_tensor, next_rank=None, scatter_gather_tensors=False):
-    """Sends the input tensor to the next member in pipeline.
-
-    :param output_tensor: Tensor to be sent
-    :param next_rank: The rank of the recipient of the tensor
-    :type output_tensor: :class:`torch.Tensor`
-    :type next_rank: int, optional
-    """
-    if not gpc.is_pipeline_last_stage():
-        _communicate(tensor_send_next=output_tensor,
-                     next_rank=next_rank,
-                     scatter_gather_tensors=scatter_gather_tensors)
-
-
-def send_backward(input_tensor_grad, prev_rank=None, scatter_gather_tensors=False):
-    """Sends the grad tensor to the previous member in pipeline.
-
-    :param input_tensor_grad: Tensor to be sent
-    :param prev_rank: The rank of the recipient of the tensor
-    :type input_tensor_grad: :class:`torch.Tensor`
-    :type prev_rank: int, optional
-    """
-    if not gpc.is_pipeline_first_stage():
-        _communicate(tensor_send_prev=input_tensor_grad,
-                     prev_rank=prev_rank,
-                     scatter_gather_tensors=scatter_gather_tensors)
-
-
-def send_forward_recv_backward(output_tensor,
-                               output_grad_shape,
-                               recv_next=True,
-                               next_rank=None,
-                               dtype=torch.float,
-                               scatter_gather_tensors=False):
-    """Batched communication operation. Sends the input tensor to the 
-    next member in pipeline, while recieves the grad tensor from the
-    next member in pipeline.
-
-    :param output_tensor: Tensor to be sent
-    :param output_grad_shape: The shape of the tensor to be recieved
-    :type output_tensor: :class:`torch.Tensor`
-    :type output_grad_shape: :class:`torch.Size`
-    :return: The grad of output tensor in forward step
-    :rtype: :class:`torch.Tensor`
-    """
-    if gpc.is_pipeline_last_stage():
-        output_tensor_grad = None
-    else:
-        _, output_tensor_grad = _communicate(tensor_send_next=output_tensor,
-                                             recv_next=recv_next,
-                                             recv_next_shape=output_grad_shape,
-                                             next_rank=next_rank,
-                                             dtype=dtype,
-                                             scatter_gather_tensors=scatter_gather_tensors)
-    return output_tensor_grad
-
-
-def send_backward_recv_forward(input_tensor_grad,
-                               input_tensor_shape,
-                               recv_prev=True,
-                               prev_rank=None,
-                               dtype=torch.float,
-                               scatter_gather_tensors=False):
-    """Batched communication operation. Sends the grad tensor to the 
-    previous member in pipeline, while recieves the input tensor from the
-    previous member in pipeline.
-
-    :param input_tensor_grad: Tensor to be sent
-    :param input_tensor_shape: The shape of the tensor to be recieved
-    :type input_tensor_grad: :class:`torch.Tensor`
-    :type input_tensor_shape: :class:`torch.Size`
-    :return: The input tensor in forward step
-    :rtype: :class:`torch.Tensor`
-    """
-    if gpc.is_pipeline_first_stage():
-        input_tensor = None
-    else:
-        input_tensor, _ = _communicate(tensor_send_prev=input_tensor_grad,
-                                       recv_prev=recv_prev,
-                                       recv_prev_shape=input_tensor_shape,
-                                       prev_rank=prev_rank,
-                                       dtype=dtype,
-                                       scatter_gather_tensors=scatter_gather_tensors)
-    return input_tensor
-
-
-def send_forward_recv_forward(output_tensor,
-                              input_tensor_shape,
-                              recv_prev=True,
-                              prev_rank=None,
-                              next_rank=None,
-                              dtype=torch.float,
-                              scatter_gather_tensors=False):
-    """Batched communication operation. Sends the input tensor to the 
-    next member in pipeline, while recieves the input tensor from the
-    previous member in pipeline.
-
-    :param output_tensor: Tensor to be sent
-    :param input_tensor_shape: The shape of the tensor to be recieved
-    :type output_tensor: :class:`torch.Tensor`
-    :type input_tensor_shape: :class:`torch.Size`
-    :return: The input tensor in forward step
-    :rtype: :class:`torch.Tensor`
-    """
-    input_tensor, _ = _communicate(tensor_send_next=output_tensor,
-                                   recv_prev=recv_prev,
-                                   recv_prev_shape=input_tensor_shape,
-                                   prev_rank=prev_rank,
-                                   next_rank=next_rank,
-                                   dtype=dtype,
-                                   scatter_gather_tensors=scatter_gather_tensors)
-    return input_tensor
-
-
-def send_backward_recv_backward(input_tensor_grad,
-                                output_grad_shape,
-                                recv_next=True,
-                                prev_rank=None,
-                                next_rank=None,
-                                dtype=torch.float,
-                                scatter_gather_tensors=False):
-    """Batched communication operation. Sends the grad tensor to the 
-    previous member in pipeline, while recieves the grad tensor from the
-    next member in pipeline.
-
-    :param input_tensor_grad: Tensor to be sent
-    :param output_grad_shape: The shape of the tensor to be recieved
-    :type input_tensor_grad: :class:`torch.Tensor`
-    :type output_grad_shape: :class:`torch.Size`
-    :return: The grad of output tensor in forward step
-    :rtype: :class:`torch.Tensor`
-    """
-    _, output_tensor_grad = _communicate(tensor_send_prev=input_tensor_grad,
-                                         recv_next=recv_next,
-                                         recv_next_shape=output_grad_shape,
-                                         prev_rank=prev_rank,
-                                         next_rank=next_rank,
-                                         dtype=dtype,
-                                         scatter_gather_tensors=scatter_gather_tensors)
-    return output_tensor_grad
-
-
-def send_forward_backward_recv_forward_backward(output_tensor,
-                                                input_tensor_grad,
-                                                input_tensor_shape,
-                                                output_grad_shape,
-                                                recv_prev=True,
-                                                recv_next=True,
-                                                prev_rank=None,
-                                                next_rank=None,
-                                                dtype=torch.float,
-                                                scatter_gather_tensors=False):
-    """Batched communication operation. Sends the input tensor to the next and 
-    the grad tensor to the previous, while recieves the grad tensor from the
-    next and the input tensor from the previous.
-
-    :param output_tensor: Tensor sent to the next
-    :param input_tensor_grad: Tensor sent to the previous
-    :param input_tensor_shape: The shape of the tensor recieved from the previous
-    :param output_grad_shape: The shape of the tensor recieved from the next
-    :type output_tensor: :class:`torch.Tensor`
-    :type input_tensor_grad: :class:`torch.Tensor`
-    :type input_tensor_shape: :class:`torch.Size`
-    :type output_grad_shape: :class:`torch.Size`
-    :return: (the input tensor in forward step, the grad of output tensor in forward step)
-    :rtype: (Tensor, Tensor)
-    """
-    input_tensor, output_tensor_grad = _communicate(
-        tensor_send_next=output_tensor,
-        tensor_send_prev=input_tensor_grad,
-        recv_prev=recv_prev,
-        recv_next=recv_next,
-        recv_prev_shape=input_tensor_shape,
-        recv_next_shape=output_grad_shape,
-        prev_rank=prev_rank,
-        next_rank=next_rank,
-        dtype=dtype,
-        scatter_gather_tensors=scatter_gather_tensors)
-    return input_tensor, output_tensor_grad
diff --git a/colossalai/communication/ring.py b/colossalai/communication/ring.py
deleted file mode 100644
index 6f42e90ab4a3f99bd5307dc54232a3ee1876d427..0000000000000000000000000000000000000000
--- a/colossalai/communication/ring.py
+++ /dev/null
@@ -1,54 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import torch
-
-from colossalai.context.parallel_mode import ParallelMode
-from colossalai.core import global_context as gpc
-from colossalai.utils import get_current_device, synchronize
-
-
-def ring_forward(tensor_send_next: torch.Tensor, parallel_mode: ParallelMode):
-    """Sends a tensor to the next member and recieves a tensor from the previous member.
-    This function returns the recieved tensor from the previous member.
-
-    :param tensor_send_next: Tensor sent to next member
-    :param parallel_mode: Parallel group mode used in this communication
-    :type tensor_send_next: :class:`torch.Tensor`
-    :type parallel_mode: :class:`colossalai.context.ParallelMode`
-    :return: The tensor recieved from the previous
-    :rtype: :class:`torch.Tensor`
-    """
-    buffer_shape = tensor_send_next.size()
-
-    ops = []
-    current_rank = gpc.get_global_rank()
-
-    tensor_recv_prev = torch.empty(buffer_shape,
-                                   requires_grad=True,
-                                   device=get_current_device(),
-                                   dtype=tensor_send_next.dtype)
-
-    # send to next rank
-    send_next_op = torch.distributed.P2POp(
-        torch.distributed.isend, tensor_send_next,
-        gpc.get_next_global_rank(parallel_mode))
-    ops.append(send_next_op)
-
-    # receive from prev rank
-    recv_prev_op = torch.distributed.P2POp(
-        torch.distributed.irecv, tensor_recv_prev,
-        gpc.get_prev_global_rank(parallel_mode))
-    ops.append(recv_prev_op)
-
-    if current_rank % 2 == 0:
-        ops = ops[::-1]
-
-    reqs = torch.distributed.batch_isend_irecv(ops)
-    for req in reqs:
-        req.wait()
-
-    # To protect against race condition when using batch_isend_irecv().
-    synchronize()
-
-    return tensor_recv_prev
diff --git a/colossalai/communication/utils.py b/colossalai/communication/utils.py
deleted file mode 100644
index 234791e324ac152c90e172064f6b30801f98a863..0000000000000000000000000000000000000000
--- a/colossalai/communication/utils.py
+++ /dev/null
@@ -1,109 +0,0 @@
-import torch
-import torch.distributed as dist
-
-from colossalai.context.parallel_mode import ParallelMode
-from colossalai.core import global_context as gpc
-from colossalai.utils import get_current_device
-
-
-def send_tensor_meta(tensor, need_meta=True, next_rank=None):
-    """Sends tensor meta information before sending a specific tensor. 
-    Since the recipient must know the shape of the tensor in p2p communications,
-    meta information of the tensor should be sent before communications. This function
-    synchronizes with :func:`recv_tensor_meta`.
-
-    :param tensor: Tensor to be sent
-    :param need_meta: If False, meta information won't be sent
-    :param next_rank: The rank of the next member in pipeline parallel group
-    :type tensor: Tensor
-    :type need_meta: bool, optional
-    :type next_rank: int
-    :return: False
-    :rtype: bool
-    """
-    if need_meta:
-        if next_rank is None:
-            next_rank = gpc.get_next_global_rank(ParallelMode.PIPELINE)
-
-        tensor_kwargs = {'dtype': torch.long, 'device': get_current_device()}
-
-        send_shape = torch.tensor(tensor.size(), **tensor_kwargs)
-        send_ndims = torch.tensor(len(tensor.size()), **tensor_kwargs)
-        dist.send(send_ndims, next_rank)
-        dist.send(send_shape, next_rank)
-
-    return False
-
-
-def recv_tensor_meta(tensor_shape, prev_rank=None):
-    """Recieves tensor meta information before recieving a specific tensor. 
-    Since the recipient must know the shape of the tensor in p2p communications,
-    meta information of the tensor should be recieved before communications. This function
-    synchronizes with :func:`send_tensor_meta`.
-
-    :param tensor_shape: The shape of the tensor to be recieved
-    :param prev_rank: The rank of the source of the tensor
-    :type tensor_shape: torch.Size
-    :type prev_rank: int, optional
-    :return: The shape of the tensor to be recieved
-    :rtype: torch.Size
-    """
-    if tensor_shape is None:
-        if prev_rank is None:
-            prev_rank = gpc.get_prev_global_rank(ParallelMode.PIPELINE)
-
-        tensor_kwargs = {'dtype': torch.long, 'device': get_current_device()}
-
-        recv_ndims = torch.empty((), **tensor_kwargs)
-        dist.recv(recv_ndims, prev_rank)
-        recv_shape = torch.empty(recv_ndims, **tensor_kwargs)
-        dist.recv(recv_shape, prev_rank)
-
-        tensor_shape = torch.Size(recv_shape)
-
-    return tensor_shape
-
-
-def split_tensor_into_1d_equal_chunks(tensor, new_buffer=False):
-    """Break a tensor into equal 1D chunks.
-
-    :param tensor: Tensor to be splitted before communication
-    :param new_buffer: Whether uses a new buffer to store sliced tensor
-
-    :type tensor: torch.Tensor
-    :type new_buffer: bool, optional
-
-    :return splitted_tensor: The splitted tensor
-    :rtype splitted_tensor: torch.Tensor
-    """
-    partition_size = torch.numel(tensor) // gpc.get_world_size(ParallelMode.PARALLEL_1D)
-    start_index = partition_size * gpc.get_local_rank(ParallelMode.PARALLEL_1D)
-    end_index = start_index + partition_size
-    if new_buffer:
-        data = torch.empty(partition_size, dtype=tensor.dtype,
-                           device=torch.cuda.current_device(),
-                           requires_grad=False)
-        data.copy_(tensor.view(-1)[start_index:end_index])
-    else:
-        data = tensor.view(-1)[start_index:end_index]
-    return data
-
-
-def gather_split_1d_tensor(tensor):
-    """Opposite of above function, gather values from model parallel ranks.
-
-    :param tensor: Tensor to be gathered after communication
-    :type tensor: torch.Tensor
-
-    :return gathered: The gathered tensor
-    :rtype gathered: torch.Tensor
-    """
-    world_size = gpc.get_world_size(ParallelMode.PARALLEL_1D)
-    numel = torch.numel(tensor)
-    numel_gathered = world_size * numel
-    gathered = torch.empty(numel_gathered, dtype=tensor.dtype,
-                           device=torch.cuda.current_device(),
-                           requires_grad=False)
-    chunks = [gathered[i*numel:(i+1)*numel] for i in range(world_size)]
-    dist.all_gather(chunks, tensor, group=gpc.get_group(ParallelMode.PARALLEL_1D))
-    return gathered
diff --git a/colossalai/constants.py b/colossalai/constants.py
deleted file mode 100644
index 33babff9616337e68af0ae92141b9cb6a11b2f7b..0000000000000000000000000000000000000000
--- a/colossalai/constants.py
+++ /dev/null
@@ -1,30 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-ALLOWED_MODES = [None, '1d', '2d', '2.5d', '3d', 'sequence']
-TENSOR_PARALLEL_MODE = 'tensor_parallel_mode'
-
-# intializer
-INITIALIZER_MAPPING = {
-    'data': 'Initializer_Data',
-    'tensor': 'Initializer_Tensor',
-    'pipeline': 'Initializer_Pipeline',
-    'embedding': 'Initializer_Embedding',
-    '1d': 'Initializer_1D',
-    '2d': 'Initializer_2D',
-    '2.5d': 'Initializer_2p5D',
-    '3d': 'Initializer_3D',
-    'sequence': 'Initializer_Sequence',
-    'model': 'Initializer_Model',
-    'moe': 'Initializer_Moe'
-}
-
-# 3D parallelism groups
-INPUT_GROUP_3D = 'input_group_3d'
-WEIGHT_GROUP_3D = 'weight_group_3d'
-OUTPUT_GROUP_3D = 'output_group_3d'
-
-# Attributes of tensor parallel parameters 
-IS_TENSOR_PARALLEL = 'is_tensor_parallel'
-NUM_PARTITIONS = 'num_partitions'
-TENSOR_PARALLEL_ATTRIBUTES = [IS_TENSOR_PARALLEL, NUM_PARTITIONS]
diff --git a/colossalai/context/__init__.py b/colossalai/context/__init__.py
deleted file mode 100644
index ac14087739a7a3b5004492536d5db57e1ae91bd7..0000000000000000000000000000000000000000
--- a/colossalai/context/__init__.py
+++ /dev/null
@@ -1,5 +0,0 @@
-from .config import Config, ConfigException
-from .parallel_context import ParallelContext
-from .parallel_mode import ParallelMode
-from .process_group_initializer import *
-from .random import *
diff --git a/colossalai/context/__pycache__/__init__.cpython-36.pyc b/colossalai/context/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index 5d4294637ce8fcfdd752ecc2502d7eddade17b40..0000000000000000000000000000000000000000
Binary files a/colossalai/context/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/context/__pycache__/__init__.cpython-37.pyc b/colossalai/context/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index 9857254f5aa26c92ab84be8885444e39d9113c9e..0000000000000000000000000000000000000000
Binary files a/colossalai/context/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/context/__pycache__/config.cpython-36.pyc b/colossalai/context/__pycache__/config.cpython-36.pyc
deleted file mode 100644
index a1284cdacbf739d27c8534c96e4d40fd10bab38e..0000000000000000000000000000000000000000
Binary files a/colossalai/context/__pycache__/config.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/context/__pycache__/config.cpython-37.pyc b/colossalai/context/__pycache__/config.cpython-37.pyc
deleted file mode 100644
index c6bdf18da70351429cd703ffe400d3d150f5ea3a..0000000000000000000000000000000000000000
Binary files a/colossalai/context/__pycache__/config.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/context/__pycache__/parallel_context.cpython-36.pyc b/colossalai/context/__pycache__/parallel_context.cpython-36.pyc
deleted file mode 100644
index 3abc72d58c3d8cb1bb21f287ded7d077d46679d9..0000000000000000000000000000000000000000
Binary files a/colossalai/context/__pycache__/parallel_context.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/context/__pycache__/parallel_context.cpython-37.pyc b/colossalai/context/__pycache__/parallel_context.cpython-37.pyc
deleted file mode 100644
index 181314ee7924c3d27fa22208725780a98f62145a..0000000000000000000000000000000000000000
Binary files a/colossalai/context/__pycache__/parallel_context.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/context/__pycache__/parallel_mode.cpython-36.pyc b/colossalai/context/__pycache__/parallel_mode.cpython-36.pyc
deleted file mode 100644
index e0d766e8fef75959c9a06e38fffd7bdaf1ff9541..0000000000000000000000000000000000000000
Binary files a/colossalai/context/__pycache__/parallel_mode.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/context/__pycache__/parallel_mode.cpython-37.pyc b/colossalai/context/__pycache__/parallel_mode.cpython-37.pyc
deleted file mode 100644
index f8d82d922259942e558d5e83d62413c351428dcf..0000000000000000000000000000000000000000
Binary files a/colossalai/context/__pycache__/parallel_mode.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/context/config.py b/colossalai/context/config.py
deleted file mode 100644
index de1e11c9ff72f3e0f241770a7bab1e8ac10fcf64..0000000000000000000000000000000000000000
--- a/colossalai/context/config.py
+++ /dev/null
@@ -1,104 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import inspect
-import sys
-from importlib.machinery import SourceFileLoader
-from pathlib import Path
-from colossalai.logging import get_dist_logger
-
-
-class Config(dict):
-    """This is a wrapper class for dict objects so that values of which can be
-    accessed as attributes.
-
-    :param config: The dict object to be wrapped
-    :type config: dict
-    """
-
-    def __init__(self, config: dict = None):
-        if config is not None:
-            for k, v in config.items():
-                self._add_item(k, v)
-
-    def __missing__(self, key):
-        raise KeyError(key)
-
-    def __getattr__(self, key):
-        try:
-            value = super(Config, self).__getitem__(key)
-            return value
-        except KeyError:
-            raise AttributeError(key)
-
-    def __setattr__(self, key, value):
-        super(Config, self).__setitem__(key, value)
-
-    def _add_item(self, key, value):
-        if isinstance(value, dict):
-            self.__setattr__(key, Config(value))
-        else:
-            self.__setattr__(key, value)
-
-    def update(self, config):
-        assert isinstance(config, (Config, dict)), 'can only update dictionary or Config objects.'
-        for k, v in config.items():
-            self._add_item(k, v)
-        return self
-
-    @staticmethod
-    def from_file(filename: str):
-        """Reads a python file and constructs a corresponding :class:`Config` object.
-
-        :param filename: Name of the file to construct the return object
-        :type filename: str
-        :raises AssertionError: Raises an AssertionError if the file does not exist, or the file
-            is not .py file
-        :return: A :class:`Config` object constructed with information in the file
-        :rtype: :class:`Config`
-        """
-
-        # check config path
-        if isinstance(filename, str):
-            filepath = Path(filename).absolute()
-        elif isinstance(filename, Path):
-            filepath = filename.absolute()
-
-        assert filepath.exists(), f'{filename} is not found, please check your configuration path'
-
-        # check extension
-        extension = filepath.suffix
-        assert extension == '.py', 'only .py files are supported'
-
-        # import the config as module
-        remove_path = False
-        if filepath.parent not in sys.path:
-            sys.path.insert(0, (filepath))
-            remove_path = True
-
-        module_name = filepath.stem
-        source_file = SourceFileLoader(fullname=str(module_name), path=str(filepath))
-        module = source_file.load_module()
-
-        # load into config
-        config = Config()
-
-        for k, v in module.__dict__.items():
-            if k.startswith('__') or inspect.ismodule(v) or inspect.isclass(v):
-                continue
-            else:
-                config._add_item(k, v)
-
-        logger = get_dist_logger()
-        logger.debug('variables which starts with __, is a module or class declaration are omitted in config file')
-
-        # remove module
-        del sys.modules[module_name]
-        if remove_path:
-            sys.path.pop(0)
-
-        return config
-
-
-class ConfigException(Exception):
-    pass
diff --git a/colossalai/context/parallel_context.py b/colossalai/context/parallel_context.py
deleted file mode 100644
index b81c0b4524c3290fb32eb77598a92788cba103aa..0000000000000000000000000000000000000000
--- a/colossalai/context/parallel_context.py
+++ /dev/null
@@ -1,530 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import os
-import random
-from typing import Union
-
-import numpy as np
-import torch
-import torch.distributed as dist
-from colossalai.constants import ALLOWED_MODES, INITIALIZER_MAPPING
-from colossalai.context.config import Config
-from colossalai.global_variables import moe_env
-from colossalai.global_variables import tensor_parallel_env as env
-from colossalai.logging import get_dist_logger
-from colossalai.registry import DIST_GROUP_INITIALIZER
-
-from .parallel_mode import ParallelMode
-from .random import add_seed, get_seeds, set_mode
-
-
-class ParallelContext:
-    """This class provides interface functions for users to get the parallel context, 
-    such as the global rank, the local rank, the world size, etc. of each device.
-
-    """
-
-    __instance = None
-
-    @staticmethod
-    def get_instance():
-        if ParallelContext.__instance is None:
-            ParallelContext()
-        return ParallelContext.__instance
-
-    def __init__(self):
-        # create a singleton instance
-        if ParallelContext.__instance is not None:
-            raise Exception(
-                'ParallelContext is a singleton class, you should get the instance by colossalai.core.global_context')
-        else:
-            ParallelContext.__instance = self
-
-        # distributed settings
-        self._global_ranks = dict()
-        self._local_ranks = dict()
-        self._world_sizes = dict()
-        self._groups = dict()
-        self._ranks_in_group = dict()
-
-        # load config from file
-        self._config = None
-
-        # default 3D parallel args, will be overwritten during process group intialization
-        self.world_size = 1
-        self.data_parallel_size = 1
-        self.pipeline_parallel_size = 1
-        self.tensor_parallel_size = 1
-        self.virtual_pipeline_parallel_size = None
-        self.virtual_pipeline_parallel_rank = None
-
-        # logging
-        self._verbose = False
-        self._logger = get_dist_logger()
-
-    @property
-    def config(self):
-        return self._config
-
-    @property
-    def verbose(self):
-        return self._verbose
-
-    @verbose.setter
-    def verbose(self, verbose_: bool):
-        self._verbose = verbose_
-
-    def load_config(self, config: Union[dict, str]):
-        """Loads the configuration from either a dict or a file.
-
-        :param config: Either a dict containing the configuration information or the filename
-            of a file containing the configuration information
-        :type config: dict or str
-        :raises TypeError: Raises a TypeError if `config` is neither a dict or a str
-        """
-        if isinstance(config, str):
-            self._config = Config.from_file(config)
-        elif isinstance(config, dict):
-            self._config = Config(config)
-        else:
-            raise TypeError("Invalid type for config, only dictionary or string is supported")
-
-    @staticmethod
-    def _check_parallel_mode(parallel_mode: ParallelMode):
-        assert isinstance(parallel_mode, ParallelMode)
-
-    def get_global_rank(self):
-        """Returns the global rank of the current device.
-
-        :return: The global rank of the current device
-        :rtype: int
-        """
-        return self._global_ranks[ParallelMode.GLOBAL]
-
-    def add_global_rank(self, parallel_mode: ParallelMode, rank: int):
-        """Adds the global rank of the current device for `parallel_mode` to the context.
-
-        :param parallel_mode: The parallel mode for the rank
-        :type parallel_mode: :class:`colossalai.context.ParallelMode`
-        :param rank: The rank to be added
-        :type rank: int
-        :raises AssertionError: Raises an AssertionError if `parallel_mode` is not an instance
-            of :class:`colossalai.context.ParallelMode`
-        """
-        self._check_parallel_mode(parallel_mode)
-        self._global_ranks[parallel_mode] = rank
-
-    def get_local_rank(self, parallel_mode: ParallelMode):
-        """Returns the local rank of the current device.
-
-        :param parallel_mode: The chosen parallel mode
-        :type parallel_mode: :class:`colossalai.context.ParallelMode`
-        :raises AssertionError: Raises an AssertionError if `parallel_mode` is not an instance
-            of :class:`colossalai.context.ParallelMode`
-        :return: The local rank of the current device for `parallel_mode`
-        :rtype: int
-        """
-        self._check_parallel_mode(parallel_mode)
-        return self._local_ranks[parallel_mode]
-
-    def add_local_rank(self, parallel_mode: ParallelMode, rank: int):
-        """Adds the local rank of the current device for `parallel_mode` to the context.
-
-        :param parallel_mode: The parallel mode for the rank
-        :type parallel_mode: :class:`colossalai.context.ParallelMode`
-        :param rank: The rank to be added
-        :type rank: int
-        :raises AssertionError: Raises an AssertionError if `parallel_mode` is not an instance
-            of :class:`colossalai.context.ParallelMode`
-        """
-        self._check_parallel_mode(parallel_mode)
-        self._local_ranks[parallel_mode] = rank
-
-    def get_next_global_rank(self, parallel_mode: ParallelMode):
-        """Returns the global rank of the next device.
-
-        :param parallel_mode: The chosen parallel mode
-        :type parallel_mode: :class:`colossalai.context.ParallelMode`
-        :raises AssertionError: Raises an AssertionError if `parallel_mode` is not an instance
-            of :class:`colossalai.context.ParallelMode`
-        :return: The global rank of the next device for `parallel_mode`
-        :rtype: int
-        """
-        self._check_parallel_mode(parallel_mode)
-
-        # get rank and world size
-        local_rank = self.get_local_rank(parallel_mode)
-        world_size = self.get_world_size(parallel_mode)
-        ranks_in_group = self.get_ranks_in_group(parallel_mode)
-
-        return ranks_in_group[(local_rank + 1) % world_size]
-
-    def get_prev_global_rank(self, parallel_mode: ParallelMode):
-        """Returns the global rank of the previous device.
-
-        :param parallel_mode: The chosen parallel mode
-        :type parallel_mode: :class:`colossalai.context.ParallelMode`
-        :raises AssertionError: Raises an AssertionError if `parallel_mode` is not an instance
-            of :class:`colossalai.context.ParallelMode`
-        :return: The global rank of the previous device for `parallel_mode`
-        :rtype: int
-        """
-        self._check_parallel_mode(parallel_mode)
-
-        # get rank and world size
-        local_rank = self.get_local_rank(parallel_mode)
-        world_size = self.get_world_size(parallel_mode)
-        ranks_in_group = self.get_ranks_in_group(parallel_mode)
-
-        return ranks_in_group[(local_rank - 1) % world_size]
-
-    def is_first_rank(self, parallel_mode: ParallelMode):
-        """Returns a boolean value indicating whether the current device is the first one
-        among its group for `parallel_mode`.
-
-        :param parallel_mode: The chosen parallel mode
-        :type parallel_mode: :class:`colossalai.context.ParallelMode`
-        :raises AssertionError: Raises an AssertionError if `parallel_mode` is not an instance
-            of :class:`colossalai.context.ParallelMode`
-        :return: a boolean value indicating whether the current device is the first one
-            among its group for `parallel_mode`
-        :rtype: bool
-        """
-        rank = self.get_local_rank(parallel_mode)
-        return rank == 0
-
-    def is_last_rank(self, parallel_mode: ParallelMode):
-        """Returns a boolean value indicating whether the current device is the last one
-        among its group for `parallel_mode`.
-
-        :param parallel_mode: The chosen parallel mode
-        :type parallel_mode: :class:`colossalai.context.ParallelMode`
-        :raises AssertionError: Raises an AssertionError if `parallel_mode` is not an instance
-            of :class:`colossalai.context.ParallelMode`
-        :return: a boolean value indicating whether the current device is the last one
-            among its group for `parallel_mode`
-        :rtype: bool
-        """
-        rank = self.get_local_rank(parallel_mode)
-        world_size = self.get_world_size(parallel_mode)
-        return rank == world_size - 1
-
-    def is_pipeline_first_stage(self, ignore_virtual=False):
-        if not ignore_virtual:
-            if self.virtual_pipeline_parallel_size is not None and self.virtual_pipeline_parallel_rank != 0:
-                return False
-        return self.is_first_rank(ParallelMode.PIPELINE)
-
-    def is_pipeline_last_stage(self, ignore_virtual=False):
-        if not ignore_virtual:
-            if self.virtual_pipeline_parallel_size is not None and self.virtual_pipeline_parallel_rank != self.virtual_pipeline_parallel_size - 1:
-                return False
-        return self.is_last_rank(ParallelMode.PIPELINE)
-
-    def get_world_size(self, parallel_mode: ParallelMode):
-        """Returns the world size for `parallel_mode`.
-
-        :param parallel_mode: The chosen parallel mode
-        :type parallel_mode: :class:`colossalai.context.ParallelMode`
-        :raises AssertionError: Raises an AssertionError if `parallel_mode` is not an instance
-            of :class:`colossalai.context.ParallelMode`
-        :return: The world size for `parallel_mode`
-        :rtype: int
-        """
-        self._check_parallel_mode(parallel_mode)
-        return self._world_sizes[parallel_mode]
-
-    def add_world_size(self, parallel_mode: ParallelMode, world_size: int):
-        """Adds world size for `parallel_mode`.
-
-        :param parallel_mode: The chosen parallel mode
-        :type parallel_mode: :class:`colossalai.context.ParallelMode`
-        :param world_size: The world size to be added
-        :type world_size: int
-        :raises AssertionError: Raises an AssertionError if `parallel_mode` is not an instance
-            of :class:`colossalai.context.ParallelMode`
-        """
-        self._check_parallel_mode(parallel_mode)
-        self._world_sizes[parallel_mode] = world_size
-
-    def get_group(self, parallel_mode: ParallelMode):
-        """Returns the group of the current device for `parallel_mode`.
-
-        :param parallel_mode: The chosen parallel mode
-        :type parallel_mode: :class:`colossalai.context.ParallelMode`
-        :raises AssertionError: Raises an AssertionError if `parallel_mode` is not an instance
-            of :class:`colossalai.context.ParallelMode`
-        :return: The group of the current device for `parallel_mode`
-        :rtype: torch.distributed.ProcessGroup
-        """
-        self._check_parallel_mode(parallel_mode)
-        return self._groups[parallel_mode]
-
-    def add_group(self, parallel_mode: ParallelMode, group: dist.ProcessGroup):
-        """Adds the group of the current device for `parallel_mode`.
-
-        :param parallel_mode: The chosen parallel mode
-        :type parallel_mode: :class:`colossalai.context.ParallelMode`
-        :param group: The group to be added
-        :type group: torch.distributed.ProcessGroup
-        :raises AssertionError: Raises an AssertionError if `parallel_mode` is not an instance
-            of :class:`colossalai.context.ParallelMode`
-        """
-        self._check_parallel_mode(parallel_mode)
-        self._groups[parallel_mode] = group
-
-    def get_ranks_in_group(self, parallel_mode: ParallelMode):
-        """Returns the rank of the current device for `parallel_mode` in the group.
-
-        :param parallel_mode: The chosen parallel mode
-        :type parallel_mode: :class:`colossalai.context.ParallelMode`
-        :raises AssertionError: Raises an AssertionError if `parallel_mode` is not an instance
-            of :class:`colossalai.context.ParallelMode`
-        :return: the rank of the current device for `parallel_mode` in the group
-        :rtype: int
-        """
-        self._check_parallel_mode(parallel_mode)
-        return self._ranks_in_group[parallel_mode]
-
-    def add_ranks_in_group(self, parallel_mode: ParallelMode, ranks: list):
-        """Adds the ranks of the current device for `parallel_mode` in the group.
-
-        :param parallel_mode: The chosen parallel mode
-        :type parallel_mode: :class:`colossalai.context.ParallelMode`
-        :param ranks: List of ranks to be added
-        :type ranks: list
-        :raises AssertionError: Raises an AssertionError if `parallel_mode` is not an instance
-            of :class:`colossalai.context.ParallelMode`
-        """
-        self._check_parallel_mode(parallel_mode)
-        self._ranks_in_group[parallel_mode] = ranks
-
-    def init_global_dist(self,
-                         rank: int,
-                         world_size: int,
-                         backend: str,
-                         host: str,
-                         port: int
-                         ):
-        """Initializes the global distributed environment
-        :param rank: rank for the default process group
-        :type rank: int
-        :param world_size: world size of the default process group
-        :type world_size: int
-        :param host: the master address for distributed training
-        :type host: str
-        :param port: the master port for distributed training
-        :type port: str
-        :param backend: backend for torch.distributed
-        :type backend: str
-        """
-        # initialize the default process group
-        init_method = f'tcp://{host}:{port}'
-        dist.init_process_group(rank=rank,
-                                world_size=world_size,
-                                backend=backend,
-                                init_method=init_method)
-
-        # None will give the default global process group for pytorch dist operations
-        self._register_dist(rank, world_size, None,
-                            list(range(world_size)), ParallelMode.GLOBAL)
-        self.add_global_rank(ParallelMode.GLOBAL, rank)
-
-    def _register_dist(self, local_rank, world_size,
-                       process_group, ranks_in_group, mode):
-        self.add_local_rank(mode, local_rank)
-        self.add_world_size(mode, world_size)
-        self.add_group(mode, process_group)
-        self.add_ranks_in_group(mode, ranks_in_group)
-
-    def check_sanity(self):
-        """Checks sanity of the parallel context.
-
-        :raises AssertionError: Raises an AssertionError if the world size does not equal to the product
-            of data paralle size, pipeline parallel size and tensor parallel size
-        """
-        dps = self.data_parallel_size
-        pps = self.pipeline_parallel_size
-        tps = self.tensor_parallel_size
-        ws = self.world_size
-        assert ws == dps * pps * \
-            tps, f"Expected the world size {ws} to be equal to data parallel size ({dps}) * pipeline parallel size ({pps}) * tensor parallel size ({tps})"
-
-    def _set_parallel_size_from_config(self, config: dict, key: str, attr_name: str):
-        if key in config:
-            ele = config[key]
-            if isinstance(ele, int):
-                setattr(self, attr_name, ele)
-            elif isinstance(ele, dict):
-                setattr(self, attr_name, ele['size'])
-            else:
-                raise NotImplementedError(
-                    f"Parallel configuration does not support this kind of argument, please use int or dict"
-                )
-
-    def init_parallel_groups(self):
-        """Initializes the parallel groups.
-
-        :raises AssertionError: Raises an AssertionError if the field paralle is not present in the config file
-        """
-
-        # get rank and world size
-        rank = self.get_global_rank()
-        world_size = self.get_world_size(ParallelMode.GLOBAL)
-        self.world_size = world_size
-
-        # set parallel size as attributes for global context
-        parallel_config = self.config.get('parallel', None)
-        if parallel_config is not None:
-            self._set_parallel_size_from_config(parallel_config, 'pipeline', 'pipeline_parallel_size')
-            self._set_parallel_size_from_config(parallel_config, 'tensor', 'tensor_parallel_size')
-
-        # the user should not set the data parallel size manually
-        # instead, it should be calculated based on other parallel config
-        self.data_parallel_size = self.world_size // (self.pipeline_parallel_size * self.tensor_parallel_size)
-
-        # get the tensor parallel mode and check
-        tensor_parallel_mode = None
-        if parallel_config is not None and 'tensor' in parallel_config and 'mode' in parallel_config['tensor']:
-            tensor_parallel_mode = parallel_config['tensor']['mode']
-        assert tensor_parallel_mode in ALLOWED_MODES, f"mode in the parallel config must be set to one of {ALLOWED_MODES}"
-        env.mode = tensor_parallel_mode
-        
-        self.check_sanity()
-
-        pg_init = []
-        # LSG: init data parallel process group for compatibility with other parallel module such as zero
-        pg_init.append(dict(type=INITIALIZER_MAPPING['data']))
-
-        # LSG: init model parallel process group for compatibility with amp and clip grad
-        pg_init.append(dict(type=INITIALIZER_MAPPING['model']))
-
-        if self.pipeline_parallel_size > 1:
-            pg_init.append(dict(type=INITIALIZER_MAPPING['pipeline']))
-        pg_init.append(dict(type=INITIALIZER_MAPPING['tensor']))
-
-        # init specific tensor parallel group
-        if tensor_parallel_mode is not None:
-            tensor_parallel_cfg = parallel_config['tensor'].copy()
-
-            # remove duplicate parameters
-            tensor_parallel_cfg.pop('mode')
-            tensor_parallel_cfg.pop('size')
-
-            # add this config to initialize later
-            pg_init.append(dict(type=INITIALIZER_MAPPING[tensor_parallel_mode.lower()], **tensor_parallel_cfg))
-
-        # initialization for moe environment
-        if parallel_config is not None and 'moe' in parallel_config:
-            param = parallel_config['moe']
-            assert 'size' in param, "Moe model parallel size should be given"
-            moe_env.setup(param['size'])
-            pg_init.append(dict(type=INITIALIZER_MAPPING['moe']))
-
-        # run initialization of different process groups
-        for initializer_cfg in pg_init:
-            cfg = initializer_cfg.copy()
-            initializer_type = cfg.pop('type')
-            initializer = DIST_GROUP_INITIALIZER.get_module(initializer_type)(
-                rank, world_size, self.config,
-                self.data_parallel_size,
-                self.pipeline_parallel_size,
-                self.tensor_parallel_size,
-                **cfg)
-            parallel_setting = initializer.init_dist_group()
-            if isinstance(parallel_setting, list):
-                for args in parallel_setting:
-                    self._register_dist(*args)
-            else:
-                self._register_dist(*parallel_setting)
-
-    def is_initialized(self, parallel_mode: ParallelMode):
-        """Returns a boolean value indicating whether `parallel_mode` is initialized
-        in the current system.
-
-        :param parallel_mode: The chosen parallel mode
-        :type parallel_mode: :class:`colossalai.context.ParallelMode`
-        :return: a boolean value indicating whether `parallel_mode` is initialized
-            in the current system
-        :rtype: bool
-        """
-        return parallel_mode in self._groups
-
-    def destroy(self):
-        """Destroys the current distributed parallel environment.
-        """
-        for mode, group in self._groups.items():
-            if mode is not ParallelMode.GLOBAL:
-                dist.destroy_process_group(group)
-        # destroy global process group
-        dist.destroy_process_group()
-
-    def set_device(self, device_ordinal: int = None):
-        """Sets distributed processes to be bound to devices.
-
-        :param device_ordinal: the device id to be bound to
-        :type device_ordinal: int, optional
-        """
-        global_rank = self.get_global_rank()
-        if device_ordinal is None:
-            devices_per_node = torch.cuda.device_count()
-            device_ordinal = global_rank % devices_per_node
-
-        torch.cuda.set_device(device_ordinal)
-        if self._verbose:
-            self._logger.info(f'process rank {global_rank} is bound to device {device_ordinal}')
-
-    def set_seed(self, seed: int):
-        """Sets seeds for all random libraries.
-
-        :param seed: seed for random states
-        :type seed: int
-        """
-        random.seed(seed)
-        np.random.seed(seed)
-        torch.manual_seed(seed)
-
-        global_rank = self.get_global_rank()
-
-        if torch.cuda.is_available():
-            # create random seed for different parallel modes
-            # data parallel seed are kept the same
-            parallel_seed = seed
-            add_seed(ParallelMode.DATA, parallel_seed)
-
-            # model parallel seeds are different across ranks
-            pipeline_offset = self._local_ranks.get(ParallelMode.PIPELINE, 0)
-
-            # add seed for data parallel and tensor parallel only
-            if self.is_initialized(ParallelMode.TENSOR):
-                tp_rank = self.get_local_rank(ParallelMode.TENSOR)
-                # 100 is only to increase the diff in seeds between pipeline stages
-                tp_rank_with_offset = tp_rank + pipeline_offset * 1024
-                tp_seed = seed + tp_rank_with_offset
-                add_seed(ParallelMode.TENSOR, tp_seed)
-
-            set_mode(ParallelMode.DATA)
-            seeds = get_seeds()
-            seed_str = ', '.join([f'{k}: {v}' for k, v in seeds.items()])
-
-            if self._verbose:
-                self._logger.info(
-                    f"initialized seed on rank {global_rank}, "
-                    f"numpy: {seed}, python random: {seed}, {seed_str},"
-                    f"the default parallel seed is {ParallelMode.DATA}.")
-        else:
-            if self._verbose:
-                self._logger.info(
-                    f"initialized seed on rank {global_rank}, "
-                    f"numpy: {seed}, python random: {seed}, pytorch: {seed}",
-                    ranks=[0])
-                self._logger.info(
-                    'WARNING: CUDA is not available, thus CUDA RNG cannot be used to track CUDA random number states',
-                    ranks=[0])
-
-    def set_virtual_pipeline_parallel_size(self, size):
-        self.virtual_pipeline_parallel_size = size
-
-    def set_virtual_pipeline_parallel_rank(self, rank):
-        self.virtual_pipeline_parallel_rank = rank
diff --git a/colossalai/context/parallel_mode.py b/colossalai/context/parallel_mode.py
deleted file mode 100644
index 34c3ad475642bbf440424d126340d340cbab4e7c..0000000000000000000000000000000000000000
--- a/colossalai/context/parallel_mode.py
+++ /dev/null
@@ -1,51 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from enum import Enum
-
-
-# parallel modes
-class ParallelMode(Enum):
-    """This is an enumeration class containing all possible parallel modes.
-    """
-
-    GLOBAL = 'global'
-
-    # common parallel
-    DATA = 'data'
-
-    # model parallel - containing tensor and pipeline parallel groups
-    # this is added to facilitate amp and grad clipping in hybrid parallel
-    MODEL = 'model'
-
-    # pipeline parallel
-    PIPELINE = 'pipe'
-
-    # containing all ranks in tensor parallel
-    TENSOR = 'tensor'
-
-    # sequence parallel
-    SEQUENCE = 'sequence'
-    SEQUENCE_DP = 'sequence_dp'
-
-    # 1D Parallel
-    PARALLEL_1D = '1d'
-
-    # 2D parallel
-    PARALLEL_2D_ROW = '2d_row'
-    PARALLEL_2D_COL = '2d_col'
-
-    # 3D parallel
-    PARALLEL_3D_INPUT = '3d_input'
-    PARALLEL_3D_WEIGHT = '3d_weight'
-    PARALLEL_3D_OUTPUT = '3d_output'
-
-    # 2.5D parallel
-    PARALLEL_2P5D_ROW = '2p5d_row'
-    PARALLEL_2P5D_COL = '2p5d_col'
-    PARALLEL_2P5D_DEP = '2p5d_dep'
-    PARALLEL_2P5D_XZ = '2p5d_xz'
-
-    # MOE parallel
-    MOE_DATA = 'moe_data'
-    MOE_MODEL = 'moe_model'
diff --git a/colossalai/context/process_group_initializer/__init__.py b/colossalai/context/process_group_initializer/__init__.py
deleted file mode 100644
index e8262162b84b383a9a371bffd3b088b4c9d862f5..0000000000000000000000000000000000000000
--- a/colossalai/context/process_group_initializer/__init__.py
+++ /dev/null
@@ -1,18 +0,0 @@
-from .initializer_1d import Initializer_1D
-from .initializer_2d import Initializer_2D
-from .initializer_2p5d import Initializer_2p5D
-from .initializer_3d import Initializer_3D
-from .initializer_data import Initializer_Data
-from .initializer_pipeline import Initializer_Pipeline
-from .initializer_sequence import Initializer_Sequence
-from .initializer_tensor import Initializer_Tensor
-from .initializer_model import Initializer_Model
-from .initializer_moe import Initializer_Moe
-from .process_group_initializer import ProcessGroupInitializer
-
-__all__ = [
-    'Initializer_Tensor', 'Initializer_Sequence', 'Initializer_Pipeline',
-    'Initializer_Data', 'Initializer_2p5D', 'Initializer_2D', 'Initializer_3D',
-    'Initializer_1D', 'ProcessGroupInitializer', 'Initializer_Model',
-    'Initializer_Moe'
-]
diff --git a/colossalai/context/process_group_initializer/__pycache__/__init__.cpython-36.pyc b/colossalai/context/process_group_initializer/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index 90f09f86e66dba477fe8614a0156c3465a35246e..0000000000000000000000000000000000000000
Binary files a/colossalai/context/process_group_initializer/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/context/process_group_initializer/__pycache__/__init__.cpython-37.pyc b/colossalai/context/process_group_initializer/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index 3f43a63e1601e9fc6fa9db3a6f66577c7eeb9a2c..0000000000000000000000000000000000000000
Binary files a/colossalai/context/process_group_initializer/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/context/process_group_initializer/__pycache__/initializer_1d.cpython-36.pyc b/colossalai/context/process_group_initializer/__pycache__/initializer_1d.cpython-36.pyc
deleted file mode 100644
index 82602a7f11fc94897d0b1d10f5b2bd3fd60c881b..0000000000000000000000000000000000000000
Binary files a/colossalai/context/process_group_initializer/__pycache__/initializer_1d.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/context/process_group_initializer/__pycache__/initializer_1d.cpython-37.pyc b/colossalai/context/process_group_initializer/__pycache__/initializer_1d.cpython-37.pyc
deleted file mode 100644
index 95d367a1cc842f3fa70066ff1d2fc00d587e5a3b..0000000000000000000000000000000000000000
Binary files a/colossalai/context/process_group_initializer/__pycache__/initializer_1d.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/context/process_group_initializer/__pycache__/initializer_2d.cpython-36.pyc b/colossalai/context/process_group_initializer/__pycache__/initializer_2d.cpython-36.pyc
deleted file mode 100644
index e9af26719a277f056ad9b9309ce30b04b8016a36..0000000000000000000000000000000000000000
Binary files a/colossalai/context/process_group_initializer/__pycache__/initializer_2d.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/context/process_group_initializer/__pycache__/initializer_2d.cpython-37.pyc b/colossalai/context/process_group_initializer/__pycache__/initializer_2d.cpython-37.pyc
deleted file mode 100644
index 79e31ff14496491a0a21fb23dcf22432d051dbc8..0000000000000000000000000000000000000000
Binary files a/colossalai/context/process_group_initializer/__pycache__/initializer_2d.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/context/process_group_initializer/__pycache__/initializer_2p5d.cpython-36.pyc b/colossalai/context/process_group_initializer/__pycache__/initializer_2p5d.cpython-36.pyc
deleted file mode 100644
index d6c506f0df7a67683346be777b057e903e975625..0000000000000000000000000000000000000000
Binary files a/colossalai/context/process_group_initializer/__pycache__/initializer_2p5d.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/context/process_group_initializer/__pycache__/initializer_2p5d.cpython-37.pyc b/colossalai/context/process_group_initializer/__pycache__/initializer_2p5d.cpython-37.pyc
deleted file mode 100644
index a4a9958c9580d64d25483ced7298fcaa4e0650eb..0000000000000000000000000000000000000000
Binary files a/colossalai/context/process_group_initializer/__pycache__/initializer_2p5d.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/context/process_group_initializer/__pycache__/initializer_3d.cpython-36.pyc b/colossalai/context/process_group_initializer/__pycache__/initializer_3d.cpython-36.pyc
deleted file mode 100644
index cd0a3a8d42b4c44aa211d6bfb2e0ae4a1d08050f..0000000000000000000000000000000000000000
Binary files a/colossalai/context/process_group_initializer/__pycache__/initializer_3d.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/context/process_group_initializer/__pycache__/initializer_3d.cpython-37.pyc b/colossalai/context/process_group_initializer/__pycache__/initializer_3d.cpython-37.pyc
deleted file mode 100644
index c15707b9265ee5b1d5343931f2ad1dedffe40c36..0000000000000000000000000000000000000000
Binary files a/colossalai/context/process_group_initializer/__pycache__/initializer_3d.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/context/process_group_initializer/__pycache__/initializer_data.cpython-36.pyc b/colossalai/context/process_group_initializer/__pycache__/initializer_data.cpython-36.pyc
deleted file mode 100644
index b01fd644b22ead7d939afbba42d2f0578ccf0de6..0000000000000000000000000000000000000000
Binary files a/colossalai/context/process_group_initializer/__pycache__/initializer_data.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/context/process_group_initializer/__pycache__/initializer_data.cpython-37.pyc b/colossalai/context/process_group_initializer/__pycache__/initializer_data.cpython-37.pyc
deleted file mode 100644
index a37e824dcf8a3b3db7ba9002b91f7b77dd48cd66..0000000000000000000000000000000000000000
Binary files a/colossalai/context/process_group_initializer/__pycache__/initializer_data.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/context/process_group_initializer/__pycache__/initializer_model.cpython-36.pyc b/colossalai/context/process_group_initializer/__pycache__/initializer_model.cpython-36.pyc
deleted file mode 100644
index 63a2534fddf1ec635c66b78e9bbb4fcb90282ff2..0000000000000000000000000000000000000000
Binary files a/colossalai/context/process_group_initializer/__pycache__/initializer_model.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/context/process_group_initializer/__pycache__/initializer_model.cpython-37.pyc b/colossalai/context/process_group_initializer/__pycache__/initializer_model.cpython-37.pyc
deleted file mode 100644
index b61a1e074b4b422162b548fc9793f382f9700885..0000000000000000000000000000000000000000
Binary files a/colossalai/context/process_group_initializer/__pycache__/initializer_model.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/context/process_group_initializer/__pycache__/initializer_moe.cpython-36.pyc b/colossalai/context/process_group_initializer/__pycache__/initializer_moe.cpython-36.pyc
deleted file mode 100644
index eaf3eee9288f6950ebd3b1faa8be549f86682048..0000000000000000000000000000000000000000
Binary files a/colossalai/context/process_group_initializer/__pycache__/initializer_moe.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/context/process_group_initializer/__pycache__/initializer_moe.cpython-37.pyc b/colossalai/context/process_group_initializer/__pycache__/initializer_moe.cpython-37.pyc
deleted file mode 100644
index fca080bf2cf7aaa8bdab0ac2cf88028520991202..0000000000000000000000000000000000000000
Binary files a/colossalai/context/process_group_initializer/__pycache__/initializer_moe.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/context/process_group_initializer/__pycache__/initializer_pipeline.cpython-36.pyc b/colossalai/context/process_group_initializer/__pycache__/initializer_pipeline.cpython-36.pyc
deleted file mode 100644
index ff871dc34fbabb172f2096bd6602a57b077dfcc8..0000000000000000000000000000000000000000
Binary files a/colossalai/context/process_group_initializer/__pycache__/initializer_pipeline.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/context/process_group_initializer/__pycache__/initializer_pipeline.cpython-37.pyc b/colossalai/context/process_group_initializer/__pycache__/initializer_pipeline.cpython-37.pyc
deleted file mode 100644
index 68f4ec3ea51a0d16af4628267f8c4b1847a16906..0000000000000000000000000000000000000000
Binary files a/colossalai/context/process_group_initializer/__pycache__/initializer_pipeline.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/context/process_group_initializer/__pycache__/initializer_sequence.cpython-36.pyc b/colossalai/context/process_group_initializer/__pycache__/initializer_sequence.cpython-36.pyc
deleted file mode 100644
index 37bca9e1582875acbba302dbb12e6021bf964c2f..0000000000000000000000000000000000000000
Binary files a/colossalai/context/process_group_initializer/__pycache__/initializer_sequence.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/context/process_group_initializer/__pycache__/initializer_sequence.cpython-37.pyc b/colossalai/context/process_group_initializer/__pycache__/initializer_sequence.cpython-37.pyc
deleted file mode 100644
index 0c1beb10a7cb85c0215f9a657815e829317c1695..0000000000000000000000000000000000000000
Binary files a/colossalai/context/process_group_initializer/__pycache__/initializer_sequence.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/context/process_group_initializer/__pycache__/initializer_tensor.cpython-36.pyc b/colossalai/context/process_group_initializer/__pycache__/initializer_tensor.cpython-36.pyc
deleted file mode 100644
index 336c611d8ef5bd7e4c3804a4e2851d5c066e399b..0000000000000000000000000000000000000000
Binary files a/colossalai/context/process_group_initializer/__pycache__/initializer_tensor.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/context/process_group_initializer/__pycache__/initializer_tensor.cpython-37.pyc b/colossalai/context/process_group_initializer/__pycache__/initializer_tensor.cpython-37.pyc
deleted file mode 100644
index 5f4c6251b99a6b9e432078dbceb27c00e18610ab..0000000000000000000000000000000000000000
Binary files a/colossalai/context/process_group_initializer/__pycache__/initializer_tensor.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/context/process_group_initializer/__pycache__/process_group_initializer.cpython-36.pyc b/colossalai/context/process_group_initializer/__pycache__/process_group_initializer.cpython-36.pyc
deleted file mode 100644
index 1ab0d1af8e94b8794f33e7a2a6d70bf2919e107d..0000000000000000000000000000000000000000
Binary files a/colossalai/context/process_group_initializer/__pycache__/process_group_initializer.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/context/process_group_initializer/__pycache__/process_group_initializer.cpython-37.pyc b/colossalai/context/process_group_initializer/__pycache__/process_group_initializer.cpython-37.pyc
deleted file mode 100644
index 57ef1f0b9122e69481a5df043246ec2f2e20a062..0000000000000000000000000000000000000000
Binary files a/colossalai/context/process_group_initializer/__pycache__/process_group_initializer.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/context/process_group_initializer/initializer_1d.py b/colossalai/context/process_group_initializer/initializer_1d.py
deleted file mode 100644
index 4d454f2a603f82af20e4d32140e7be41d25f91e8..0000000000000000000000000000000000000000
--- a/colossalai/context/process_group_initializer/initializer_1d.py
+++ /dev/null
@@ -1,44 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import torch.distributed as dist
-from colossalai.global_variables import tensor_parallel_env as env
-from colossalai.registry import DIST_GROUP_INITIALIZER
-
-from ..parallel_mode import ParallelMode
-from .process_group_initializer import ProcessGroupInitializer
-
-
-@DIST_GROUP_INITIALIZER.register_module
-class Initializer_1D(ProcessGroupInitializer):
-    '''A ProcessGroupInitializer for 1d tensor parallelism.
-    '''
-
-    def __init__(self, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-        self.num_group = self.world_size // self.tensor_parallel_size
-
-    def init_dist_group(self):
-        """Initialize 1D tensor parallel groups, and assign local_ranks and groups to each gpu.
-        
-        :return: (local_rank, group_world_size, process_group, ranks_in_group, mode)
-        :rtype: Tuple
-        """
-        local_rank = None
-        ranks_in_group = None
-        process_group = None
-        group_world_size = None
-        mode = ParallelMode.PARALLEL_1D
-        env.parallel_input_1d = False
-
-        for i in range(self.num_group):
-            ranks = [i * self.tensor_parallel_size + j for j in range(self.tensor_parallel_size)]
-            group = dist.new_group(ranks)
-
-            if self.rank in ranks:
-                local_rank = ranks.index(self.rank)
-                group_world_size = len(ranks)
-                process_group = group
-                ranks_in_group = ranks
-
-        return local_rank, group_world_size, process_group, ranks_in_group, mode
diff --git a/colossalai/context/process_group_initializer/initializer_2d.py b/colossalai/context/process_group_initializer/initializer_2d.py
deleted file mode 100644
index b48ce60f9e752ac6ee0420e60cdb50bc580059ee..0000000000000000000000000000000000000000
--- a/colossalai/context/process_group_initializer/initializer_2d.py
+++ /dev/null
@@ -1,137 +0,0 @@
-import math
-
-import torch.distributed as dist
-
-from colossalai.registry import DIST_GROUP_INITIALIZER
-from .process_group_initializer import ProcessGroupInitializer
-from ..parallel_mode import ParallelMode
-from colossalai.global_variables import tensor_parallel_env as env
-
-
-def _check_summa_env_var(summa_dim):
-    # check environment variable for SUMMA
-    env_summa_dim = env.summa_dim
-
-    if env_summa_dim:
-        assert int(env_summa_dim) == summa_dim, \
-            'SUMMA_DIM has been set in the current environment and ' \
-            'does not match with the value passed to this initialized'
-    else:
-        env.summa_dim = summa_dim
-
-
-class Initializer_2D_Row(ProcessGroupInitializer):
-    """2d tensor parallel initialization among rows.
-    :param num_group: The number of all tensor groups
-    :param summa_dim: The dimension of SUMMA
-    :param args: Args used to initialize base class
-    :param kwargs: Kwargs used to initialize base class
-    :type num_group: int
-    :type summa_dim: int
-    """
-
-    def __init__(self, num_group, summa_dim, *args, **kwargs):
-        super(Initializer_2D_Row, self).__init__(*args, **kwargs)
-        self.num_group = num_group
-        self.summa_dim = summa_dim
-
-    def init_dist_group(self):
-        """Initialize 2D tensor row parallel groups, and assign local_ranks and groups to each gpu.
-
-        :return: 2D tensor row parallelism's information
-        :rtype: Tuple(local_rank, group_world_size, process_group, ranks_in_group, mode)
-        """
-        local_rank = None
-        ranks_in_group = None
-        process_group = None
-        group_world_size = None
-        mode = ParallelMode.PARALLEL_2D_ROW
-
-        for i in range(self.num_group):
-            for j in range(self.summa_dim):
-                ranks = [i * self.tensor_parallel_size + j * self.summa_dim + k
-                         for k in range(self.summa_dim)]
-                group = dist.new_group(ranks)
-
-                if self.rank in ranks:
-                    local_rank = ranks.index(self.rank)
-                    group_world_size = len(ranks)
-                    process_group = group
-                    ranks_in_group = ranks
-
-        return local_rank, group_world_size, process_group, ranks_in_group, mode
-
-
-class Initializer_2D_Col(ProcessGroupInitializer):
-    """2d tensor parallel initialization among cols.
-
-    :param num_group: The number of all tensor groups
-    :param summa_dim: The dimension of SUMMA
-    :param args: Args used to initialize base class
-    :param kwargs: Kwargs used to initialize base class
-
-    :type num_group: int
-    :type summa_dim: int
-    """
-
-    def __init__(self, num_group, summa_dim, *args, **kwargs):
-        super(Initializer_2D_Col, self).__init__(*args, **kwargs)
-        self.num_group = num_group
-        self.summa_dim = summa_dim
-
-    def init_dist_group(self):
-        """Initialize 2D tensor row parallel groups, and assign local_ranks and groups to each gpu.
-
-        :return: 2D tensor col parallelism's information
-        :rtype: Tuple(local_rank, group_world_size, process_group, ranks_in_group, mode)
-        """
-        local_rank = None
-        ranks_in_group = None
-        process_group = None
-        group_world_size = None
-        mode = ParallelMode.PARALLEL_2D_COL
-
-        for i in range(self.num_group):
-            for j in range(self.summa_dim):
-                ranks = [i * self.tensor_parallel_size + j + k * self.summa_dim
-                         for k in range(self.summa_dim)]
-                group = dist.new_group(ranks)
-
-                if self.rank in ranks:
-                    local_rank = ranks.index(self.rank)
-                    group_world_size = len(ranks)
-                    process_group = group
-                    ranks_in_group = ranks
-
-        return local_rank, group_world_size, process_group, ranks_in_group, mode
-
-
-@DIST_GROUP_INITIALIZER.register_module
-class Initializer_2D(ProcessGroupInitializer):
-    """
-    Serve as the single entry point to 2D parallel initialization.
-
-    :param args: Args used to initialize ProcessGroupInitializer
-    :param kwargs: Kwargs used to initialize ProcessGroupInitializer
-    """
-
-    def __init__(self, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-        self.num_group = self.world_size // self.tensor_parallel_size
-        self.summa_dim = int(math.sqrt(self.tensor_parallel_size))
-
-        assert self.tensor_parallel_size == self.summa_dim ** 2, \
-            "2D summa dim should equal to tensor parallel size ^ 0.5"
-        _check_summa_env_var(self.summa_dim)
-
-        self.col_initializer = Initializer_2D_Col(self.num_group, self.summa_dim, *args, **kwargs)
-        self.row_initializer = Initializer_2D_Row(self.num_group, self.summa_dim, *args, **kwargs)
-
-    def init_dist_group(self):
-        """Initialize 2D tensor row and col parallel groups, and assign local_ranks and groups to each gpu.
-        
-        :return: 2D tensor parallelism's information
-        :rtype: list of Tuples (local_rank, group_world_size, process_group, ranks_in_group, mode)
-        """
-        parallel_setting = [self.row_initializer.init_dist_group(), self.col_initializer.init_dist_group()]
-        return parallel_setting
diff --git a/colossalai/context/process_group_initializer/initializer_2p5d.py b/colossalai/context/process_group_initializer/initializer_2p5d.py
deleted file mode 100644
index 3c3e1b9787ceed84a3d32bb39dbf6d8d815c43f5..0000000000000000000000000000000000000000
--- a/colossalai/context/process_group_initializer/initializer_2p5d.py
+++ /dev/null
@@ -1,288 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import math
-
-import torch.distributed as dist
-from colossalai.context import Config
-from colossalai.global_variables import tensor_parallel_env as env
-from colossalai.registry import DIST_GROUP_INITIALIZER
-
-from ..parallel_mode import ParallelMode
-from .process_group_initializer import ProcessGroupInitializer
-
-
-def _check_tesseract_env_var(tesseract_dim: int,
-                             tesseract_dep: int):
-    # check global variable for TESSERACT
-    env_tesseract_dim = env.tesseract_dim
-    env_tesseract_dep = env.tesseract_dep
-
-    if env_tesseract_dim and env_tesseract_dep:
-        assert int(env_tesseract_dim) == tesseract_dim, \
-            'TESSERACT_DIM has been set in the current environment and ' \
-            'does not match with the value passed to this initialized'
-        assert int(env_tesseract_dep) == tesseract_dep, \
-            'TESSERACT_DEP has been set in the current environment and ' \
-            'does not match with the value passed to this initialized'
-    else:
-        env.tesseract_dim = tesseract_dim
-        env.tesseract_dep = tesseract_dep
-
-
-# i row j col k dep
-class Initializer_2p5D_ROW(ProcessGroupInitializer):
-    """2p5d tensor parallel initialization among rows.
-
-    :param tesseract_dim: The dimension of tesseract
-    :param tesseract_dep: The dimension of depth
-    :param args: Args used to initialize base class
-
-    :type tesseract_dim: int
-    :type tesseract_dep: int
-    """
-
-    def __init__(self,
-                 tesseract_dim: int,
-                 tesseract_dep: int,
-                 *args):
-        super(Initializer_2p5D_ROW, self).__init__(*args)
-        self.num_group = self.world_size // self.tensor_parallel_size
-        self.tesseract_dep = tesseract_dep
-        self.tesseract_dim = tesseract_dim
-        assert self.tensor_parallel_size == self.tesseract_dim ** 2 * self.tesseract_dep, \
-            "Tensor parallel size should be depth * dim ** 2 in 2.5D parallel"
-
-    def init_dist_group(self):
-        """Initialize 2p5D tensor row parallel groups, and assign local_ranks and groups to each gpu.
-
-        :return: 2p5D tensor row parallelism's information
-        :rtype: Tuple(local_rank, group_world_size, process_group, ranks_in_group, mode)
-        """
-        local_rank = None
-        ranks_in_group = None
-        process_group = None
-        group_world_size = None
-        mode = ParallelMode.PARALLEL_2P5D_ROW
-
-        for h in range(self.num_group):
-            for j in range(self.tesseract_dim):
-                for k in range(self.tesseract_dep):
-                    ranks = [h * self.tensor_parallel_size + i + self.tesseract_dim * (
-                        j + self.tesseract_dim * k) for i in range(self.tesseract_dim)]
-                    group = dist.new_group(ranks)
-
-                    if self.rank in ranks:
-                        local_rank = ranks.index(self.rank)
-                        group_world_size = len(ranks)
-                        process_group = group
-                        ranks_in_group = ranks
-
-        return local_rank, group_world_size, process_group, ranks_in_group, mode
-
-
-class Initializer_2p5D_Col(ProcessGroupInitializer):
-    """2p5d tensor parallel initialization among cols.
-
-    :param tesseract_dim: The dimension of tesseract
-    :param tesseract_dep: The dimension of depth
-    :param args: Args used to initialize base class
-
-    :type tesseract_dim: int
-    :type tesseract_dep: int
-    """
-
-    def __init__(self,
-                 tesseract_dim: int,
-                 tesseract_dep: int,
-                 *args):
-        super(Initializer_2p5D_Col, self).__init__(*args)
-        self.num_group = self.world_size // self.tensor_parallel_size
-        self.tesseract_dep = tesseract_dep
-        self.tesseract_dim = tesseract_dim
-        assert self.tensor_parallel_size == self.tesseract_dim ** 2 * self.tesseract_dep, \
-            "Tensor parallel size should be depth * dim ** 2 in 2.5D parallel"
-
-    def init_dist_group(self):
-        """Initialize 2p5D tensor col parallel groups, and assign local_ranks and groups to each gpu.
-
-        :return: 2p5D tensor col parallelism's information
-        :rtype: Tuple(local_rank, group_world_size, process_group, ranks_in_group, mode)
-        """
-        local_rank = None
-        ranks_in_group = None
-        process_group = None
-        group_world_size = None
-        mode = ParallelMode.PARALLEL_2P5D_COL
-
-        for h in range(self.num_group):
-            for i in range(self.tesseract_dim):
-                for k in range(self.tesseract_dep):
-                    ranks = [h * self.tensor_parallel_size + i + self.tesseract_dim * (
-                        j + self.tesseract_dim * k) for j in range(self.tesseract_dim)]
-                    group = dist.new_group(ranks)
-
-                    if self.rank in ranks:
-                        local_rank = ranks.index(self.rank)
-                        group_world_size = len(ranks)
-                        process_group = group
-                        ranks_in_group = ranks
-
-        return local_rank, group_world_size, process_group, ranks_in_group, mode
-
-
-class Initializer_2p5D_Dep(ProcessGroupInitializer):
-    """2p5D tensor parallel initialization among depths.
-
-    :param tesseract_dim: The dimension of tesseract
-    :param tesseract_dep: The dimension of depth
-    :param args: Args used to initialize base class
-
-    :type tesseract_dim: int
-    :type tesseract_dep: int
-    """
-
-    def __init__(self,
-                 tesseract_dim: int,
-                 tesseract_dep: int,
-                 *args):
-        super(Initializer_2p5D_Dep, self).__init__(*args)
-        self.num_group = self.world_size // self.tensor_parallel_size
-        self.tesseract_dep = tesseract_dep
-        self.tesseract_dim = tesseract_dim
-        assert self.tensor_parallel_size == self.tesseract_dim ** 2 * self.tesseract_dep, \
-            "Tensor parallel size should be depth * dim ** 2 in 2.5D parallel"
-
-    def init_dist_group(self):
-        """Initialize 2p5D tensor depth parallel groups, and assign local_ranks and groups to each gpu.
-
-        :return: 2p5D tensor depth parallelism's information
-        :rtype: Tuple(local_rank, group_world_size, process_group, ranks_in_group, mode)
-        """
-        local_rank = None
-        ranks_in_group = None
-        process_group = None
-        group_world_size = None
-        mode = ParallelMode.PARALLEL_2P5D_DEP
-
-        for h in range(self.num_group):
-            for i in range(self.tesseract_dim):
-                for j in range(self.tesseract_dim):
-                    ranks = [h * self.tensor_parallel_size + i + self.tesseract_dim * (
-                        j + self.tesseract_dim * k) for k in range(self.tesseract_dep)]
-                    group = dist.new_group(ranks)
-
-                    if self.rank in ranks:
-                        local_rank = ranks.index(self.rank)
-                        group_world_size = len(ranks)
-                        process_group = group
-                        ranks_in_group = ranks
-
-        return local_rank, group_world_size, process_group, ranks_in_group, mode
-
-
-# i row j col k dep
-class Initializer_2p5D_XZ(ProcessGroupInitializer):
-    """2p5d tensor parallel initialization among cols times dep.
-
-    :param tesseract_dim: The dimension of tesseract
-    :param tesseract_dep: The dimension of depth
-    :param args: Args used to initialize base class
-
-    :type tesseract_dim: int
-    :type tesseract_dep: int
-    """
-
-    def __init__(self,
-                 tesseract_dim: int,
-                 tesseract_dep: int,
-                 *args):
-        super(Initializer_2p5D_XZ, self).__init__(*args)
-        self.num_group = self.world_size // self.tensor_parallel_size
-        self.tesseract_dep = tesseract_dep
-        self.tesseract_dim = tesseract_dim
-        assert self.tensor_parallel_size == self.tesseract_dim ** 2 * self.tesseract_dep, \
-            "Tensor parallel size should be depth * dim ** 2 in 2.5D parallel"
-
-    def init_dist_group(self):
-        """Initialize 2p5D tensor colXdepth parallel groups, and assign local_ranks and groups to each gpu.
-
-        :return: 2p5D tensor colXdepth parallelism's information
-        :rtype: Tuple(local_rank, group_world_size, process_group, ranks_in_group, mode)
-        """
-        local_rank = None
-        ranks_in_group = None
-        process_group = None
-        group_world_size = None
-        mode = ParallelMode.PARALLEL_2P5D_XZ
-
-        for h in range(self.num_group):
-            for i in range(self.tesseract_dim):
-                ranks = [h * self.tensor_parallel_size + i + self.tesseract_dim * (
-                    j + self.tesseract_dim * k) for k in range(self.tesseract_dep) for j in
-                    range(self.tesseract_dim)]
-                group = dist.new_group(ranks)
-
-                if self.rank in ranks:
-                    local_rank = ranks.index(self.rank)
-                    group_world_size = len(ranks)
-                    process_group = group
-                    ranks_in_group = ranks
-
-        return local_rank, group_world_size, process_group, ranks_in_group, mode
-
-
-@DIST_GROUP_INITIALIZER.register_module
-class Initializer_2p5D(ProcessGroupInitializer):
-    """
-    Serve as the single entry point to Tesseract parallel initialization.
-
-    :param rank: The rank of current process
-    :param world_size: Size of whole communication world
-    :param config: Running configuration
-    :param data_parallel_size: Size of data parallel
-    :param pipeline_parallel_size: Size of pipeline parallel
-    :param tensor_parallel_size: Size of tensor parallel
-    :param depth: The depth of 2p5d parallel
-    :type rank: int
-    :type world_size: int
-    :type config: Config
-    :type data_parallel_size: int
-    :type pipeline_parallel_size: int
-    :type tensor_parallel_size: int
-    :type depth: int
-    """
-
-    def __init__(self,
-                 rank: int,
-                 world_size: int,
-                 config: Config,
-                 data_parallel_size: int,
-                 pipeline_parallel_size: int,
-                 tensor_parallel_size: int,
-                 depth: int
-                 ):
-        args = (rank, world_size, config, data_parallel_size, pipeline_parallel_size, tensor_parallel_size)
-        super().__init__(*args)
-        self.num_group = self.world_size // self.tensor_parallel_size
-        self.tesseract_dim = int(math.sqrt(self.tensor_parallel_size / depth))
-        self.tesseract_dep = depth
-
-        assert self.tensor_parallel_size == self.tesseract_dim ** 2 * self.tesseract_dep, \
-            "2.5D tesseract dim should equal to (tensor parallel size / tesseract dep) ^ 0.5"
-        _check_tesseract_env_var(self.tesseract_dim, self.tesseract_dep)
-
-        self.col_initializer = Initializer_2p5D_Col(self.tesseract_dim, self.tesseract_dep, *args)
-        self.row_initializer = Initializer_2p5D_ROW(self.tesseract_dim, self.tesseract_dep, *args)
-        self.dep_initializer = Initializer_2p5D_Dep(self.tesseract_dim, self.tesseract_dep, *args)
-        self.xz_initializer = Initializer_2p5D_XZ(self.tesseract_dim, self.tesseract_dep, *args)
-
-    def init_dist_group(self):
-        """Initialize 2p5D tensor row, col, depth, and colXdepth parallel groups, and assign local_ranks and groups to each gpu.
-        
-        :return: Whole 2p5D tensor parallelism's information
-        :rtype: list of Tuples (local_rank, group_world_size, process_group, ranks_in_group, mode)
-        """
-        parallel_setting = [self.col_initializer.init_dist_group(), self.row_initializer.init_dist_group(),
-                            self.dep_initializer.init_dist_group(), self.xz_initializer.init_dist_group()]
-        return parallel_setting
diff --git a/colossalai/context/process_group_initializer/initializer_3d.py b/colossalai/context/process_group_initializer/initializer_3d.py
deleted file mode 100644
index edd8b46940182a7fb31ad5440a35c2d02696baee..0000000000000000000000000000000000000000
--- a/colossalai/context/process_group_initializer/initializer_3d.py
+++ /dev/null
@@ -1,185 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import math
-
-import torch.distributed as dist
-from colossalai.global_variables import tensor_parallel_env as env
-from colossalai.registry import DIST_GROUP_INITIALIZER
-
-from ..parallel_mode import ParallelMode
-from .process_group_initializer import ProcessGroupInitializer
-
-
-def _check_depth_env_var(depth):
-    # check global variable
-    env_depth = env.depth_3d
-
-    if env_depth:
-        assert int(env_depth) == depth, \
-            'DEPTH_3D has been set in the current environment and ' \
-            'does not match with the value passed to this initialized'
-    else:
-        env.depth_3d = depth
-
-
-class Initializer_3D_Input(ProcessGroupInitializer):
-    """3D tensor parallel initialization among input.
-
-    :param num_group: The number of all tensor groups
-    :param depth: Depth of 3D parallelism
-    :param args: Args used in base class
-
-    :type num_group: int
-    :type depth: int
-    """
-
-    def __init__(self, num_group: int, depth: int, *args):
-        super().__init__(*args)
-        self.num_group = num_group
-        self.depth = depth
-
-    def init_dist_group(self):
-        """Initialize 3D tensor parallel groups among input, and assign local_ranks and groups to each gpu.
-
-        :return: 3D tensor parallelism's information among input
-        :rtype: Tuple(local_rank, group_world_size, process_group, ranks_in_group, mode)
-        """
-        local_rank = None
-        ranks_in_group = None
-        process_group = None
-        group_world_size = None
-        mode = ParallelMode.PARALLEL_3D_INPUT
-        env.input_group_3d = mode
-
-        for h in range(self.num_group):
-            for i in range(self.depth):
-                for k in range(self.depth):
-                    ranks = [h * self.depth**3 + i + self.depth * (j + self.depth * k) for j in range(self.depth)]
-                    group = dist.new_group(ranks)
-
-                    if self.rank in ranks:
-                        local_rank = ranks.index(self.rank)
-                        group_world_size = len(ranks)
-                        process_group = group
-                        ranks_in_group = ranks
-
-        return local_rank, group_world_size, process_group, ranks_in_group, mode
-
-
-class Initializer_3D_Weight(ProcessGroupInitializer):
-    """3D tensor parallel initialization among weight.
-
-    :param num_group: The number of all tensor groups
-    :param depth: Depth of 3D parallelism
-    :param args: Args used in base class
-
-    :type num_group: int
-    :type depth: int
-    """
-
-    def __init__(self, num_group: int, depth: int, *args):
-        super().__init__(*args)
-        self.num_group = num_group
-        self.depth = depth
-
-    def init_dist_group(self):
-        """Initialize 3D tensor parallel groups among weight, and assign local_ranks and groups to each gpu.
-
-        :return: 3D tensor parallelism's information among weight
-        :rtype: Tuple(local_rank, group_world_size, process_group, ranks_in_group, mode)
-        """
-        local_rank = None
-        ranks_in_group = None
-        process_group = None
-        group_world_size = None
-        mode = ParallelMode.PARALLEL_3D_WEIGHT
-        env.weight_group_3d = mode
-
-        for h in range(self.num_group):
-            for k in range(self.depth):
-                for j in range(self.depth):
-                    ranks = [h * self.depth**3 + i + self.depth * (j + self.depth * k) for i in range(self.depth)]
-                    group = dist.new_group(ranks)
-
-                    if self.rank in ranks:
-                        local_rank = ranks.index(self.rank)
-                        group_world_size = len(ranks)
-                        process_group = group
-                        ranks_in_group = ranks
-
-        return local_rank, group_world_size, process_group, ranks_in_group, mode
-
-
-class Initializer_3D_Output(ProcessGroupInitializer):
-    """3D tensor parallel initialization among output.
-
-    :param num_group: The number of all tensor groups
-    :param depth: Depth of 3D parallelism
-    :param args: Args used in base class
-
-    :type num_group: int
-    :type depth: int
-    """
-
-    def __init__(self, num_group: int, depth: int, *args):
-        super().__init__(*args)
-        self.num_group = num_group
-        self.depth = depth
-
-    def init_dist_group(self):
-        """Initialize 3D tensor parallel groups among output, and assign local_ranks and groups to each gpu.
-
-        :return: 3D tensor parallelism's information among output
-        :rtype: Tuple(local_rank, group_world_size, process_group, ranks_in_group, mode)
-        """
-        local_rank = None
-        ranks_in_group = None
-        process_group = None
-        group_world_size = None
-        mode = ParallelMode.PARALLEL_3D_OUTPUT
-        env.output_group_3d = mode
-
-        for h in range(self.num_group):
-            for i in range(self.depth):
-                for j in range(self.depth):
-                    ranks = [h * self.depth**3 + i + self.depth * (j + self.depth * k) for k in range(self.depth)]
-                    group = dist.new_group(ranks)
-
-                    if self.rank in ranks:
-                        local_rank = ranks.index(self.rank)
-                        group_world_size = len(ranks)
-                        process_group = group
-                        ranks_in_group = ranks
-
-        return local_rank, group_world_size, process_group, ranks_in_group, mode
-
-
-@DIST_GROUP_INITIALIZER.register_module
-class Initializer_3D(ProcessGroupInitializer):
-    """Serve as the single entry point to 3D parallel initialization.
-
-    :param args: Args used to initialize ProcessGroupInitializer
-    """
-
-    def __init__(self, *args):
-        super().__init__(*args)
-        self.num_group = self.world_size // self.tensor_parallel_size
-        self.depth = round(math.pow(self.tensor_parallel_size, 1 / 3))
-        assert self.tensor_parallel_size == self.depth ** 3, \
-            f'3D depth ({self.depth}) if not cube root of tensor parallel size ({self.tensor_parallel_size})'
-        _check_depth_env_var(self.depth)
-
-        self.input_initializer = Initializer_3D_Input(self.num_group, self.depth, *args)
-        self.weight_initializer = Initializer_3D_Weight(self.num_group, self.depth, *args)
-        self.output_initializer = Initializer_3D_Output(self.num_group, self.depth, *args)
-
-    def init_dist_group(self):
-        """Initialize 3D tensor parallel groups, and assign local_ranks and groups to each gpu.
-        
-        :return: 3D tensor parallelism's information
-        :rtype: list of Tuples (local_rank, group_world_size, process_group, ranks_in_group, mode)
-        """
-        parallel_setting = [self.input_initializer.init_dist_group(), self.weight_initializer.init_dist_group(),
-                            self.output_initializer.init_dist_group()]
-        return parallel_setting
diff --git a/colossalai/context/process_group_initializer/initializer_data.py b/colossalai/context/process_group_initializer/initializer_data.py
deleted file mode 100644
index 89f55a189dab3384e7bc61ff66c89613e60046bd..0000000000000000000000000000000000000000
--- a/colossalai/context/process_group_initializer/initializer_data.py
+++ /dev/null
@@ -1,44 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from torch import distributed as dist
-
-from colossalai.registry import DIST_GROUP_INITIALIZER
-from .process_group_initializer import ProcessGroupInitializer
-from ..parallel_mode import ParallelMode
-
-
-@DIST_GROUP_INITIALIZER.register_module
-class Initializer_Data(ProcessGroupInitializer):
-    """A ProcessGroupInitializer for data parallelism.
-
-    :param args: Args used to initialize ProcessGroupInitializer
-    :param kwargs: Kwargs used to initialize ProcessGroupInitializer
-    """
-    def __init__(self, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-        self.num_data_parallel_group = self.world_size // self.data_parallel_size
-
-    def init_dist_group(self):
-        """Initialize data parallel groups, and assign local_ranks and groups to each gpu.
-
-        :return: Data parallelism's information
-        :rtype: Tuple(local_rank, group_world_size, process_group, ranks_in_group, mode)
-        """
-        local_rank = None
-        ranks_in_group = None
-        process_group = None
-        group_world_size = None
-        mode = ParallelMode.DATA
-
-        for i in range(self.num_data_parallel_group):
-            ranks = [i + j * self.num_data_parallel_group for j in range(self.data_parallel_size)]
-            group = dist.new_group(ranks)
-
-            if self.rank in ranks:
-                local_rank = ranks.index(self.rank)
-                group_world_size = len(ranks)
-                process_group = group
-                ranks_in_group = ranks
-
-        return local_rank, group_world_size, process_group, ranks_in_group, mode
diff --git a/colossalai/context/process_group_initializer/initializer_model.py b/colossalai/context/process_group_initializer/initializer_model.py
deleted file mode 100644
index e4fe0e5e1ec36495f436020f84ba5b295072e644..0000000000000000000000000000000000000000
--- a/colossalai/context/process_group_initializer/initializer_model.py
+++ /dev/null
@@ -1,47 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import torch.distributed as dist
-
-from colossalai.context import Config
-from colossalai.registry import DIST_GROUP_INITIALIZER
-from .process_group_initializer import ProcessGroupInitializer
-from ..parallel_mode import ParallelMode
-
-
-@DIST_GROUP_INITIALIZER.register_module
-class Initializer_Model(ProcessGroupInitializer):
-    """A ProcessGroupInitializer for model parallelism (model parallel group contains pipeline and tensor parallel
-    groups).
-
-    :param args: Args used to initialize ProcessGroupInitializer
-    :param kwargs: Kwargs used to initialize ProcessGroupInitializer
-    """
-
-    def __init__(self, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-        self.model_parallel_size = self.tensor_parallel_size * self.pipeline_parallel_size
-        self.num_group = self.world_size // self.model_parallel_size
-
-    def init_dist_group(self):
-        """Initialize model parallel groups, and assign local_ranks and groups to each gpu.
-
-        :return: (local_rank, group_world_size, process_group, ranks_in_group, mode)
-        :rtype: Tuple
-        """
-        local_rank = None
-        ranks_in_group = None
-        process_group = None
-        group_world_size = None
-        mode = ParallelMode.MODEL
-
-        for i in range(self.num_group):
-            ranks = [i * self.model_parallel_size + j for j in range(self.model_parallel_size)]
-            group = dist.new_group(ranks)
-
-            if self.rank in ranks:
-                local_rank = ranks.index(self.rank)
-                group_world_size = len(ranks)
-                process_group = group
-                ranks_in_group = ranks
-        return local_rank, group_world_size, process_group, ranks_in_group, mode
diff --git a/colossalai/context/process_group_initializer/initializer_moe.py b/colossalai/context/process_group_initializer/initializer_moe.py
deleted file mode 100644
index 5632c3396265b85e95e771ed4fc534077fa93b93..0000000000000000000000000000000000000000
--- a/colossalai/context/process_group_initializer/initializer_moe.py
+++ /dev/null
@@ -1,119 +0,0 @@
-import torch.distributed as dist
-
-from colossalai.registry import DIST_GROUP_INITIALIZER
-from colossalai.global_variables import moe_env
-from .process_group_initializer import ProcessGroupInitializer
-from ..parallel_mode import ParallelMode
-
-
-@DIST_GROUP_INITIALIZER.register_module
-class Initializer_Moemodel(ProcessGroupInitializer):
-    """Model parallel initialization for MoE system.
-
-    :param moe_moel: Size of moe model parallel
-    :param moe_data: Size of moe data parallel
-    :param args: Args used in base class
-    :param kwargs: Kwargs used in base class
-
-    :type moe_model: int
-    :type moe_data: int
-    """
-    def __init__(self, moe_model, moe_data, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-        self.moe_model = moe_model
-        self.moe_data = moe_data
-
-    def init_dist_group(self):
-        """Initialize model parallel groups in moe parallel environment,
-        and assign local_ranks and groups to each gpu.
-
-        :return: MoE model parallelism's information
-        :rtype: Tuple(local_rank, group_world_size, process_group, ranks_in_group, mode)
-        """
-        local_rank = None
-        ranks_in_group = None
-        process_group = None
-        group_world_size = None
-        mode = ParallelMode.MOE_MODEL
-
-        for i in range(self.moe_data):
-            ranks = [i * self.moe_model + j for j in range(self.moe_model)]
-            group = dist.new_group(ranks)
-
-            if self.rank in ranks:
-                local_rank = ranks.index(self.rank)
-                group_world_size = len(ranks)
-                process_group = group
-                ranks_in_group = ranks
-
-        return local_rank, group_world_size, process_group, ranks_in_group, mode
-
-
-@DIST_GROUP_INITIALIZER.register_module
-class Initializer_Moedata(ProcessGroupInitializer):
-    """Data parallel initialization for MoE system.
-
-    :param moe_moel: Size of moe model parallel
-    :param moe_data: Size of moe data parallel
-    :param args: Args used in base class
-    :param kwargs: Kwargs used in base class
-
-    :type moe_model: int
-    :type moe_data: int
-    """
-    def __init__(self, moe_model, moe_data, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-        self.moe_model = moe_model
-        self.moe_data = moe_data
-
-    def init_dist_group(self):
-        """Initialize data parallel groups in moe parallel environment,
-        and assign local_ranks and groups to each gpu.
-
-        :return: MoE data parallelism's information
-        :rtype: Tuple(local_rank, group_world_size, process_group, ranks_in_group, mode)
-        """
-        local_rank = None
-        ranks_in_group = None
-        process_group = None
-        group_world_size = None
-        mode = ParallelMode.MOE_DATA
-
-        for i in range(self.moe_model):
-            ranks = [i + j * self.moe_model for j in range(self.moe_data)]
-            group = dist.new_group(ranks)
-
-            if self.rank in ranks:
-                local_rank = ranks.index(self.rank)
-                group_world_size = len(ranks)
-                process_group = group
-                ranks_in_group = ranks
-
-        return local_rank, group_world_size, process_group, ranks_in_group, mode
-
-
-@DIST_GROUP_INITIALIZER.register_module
-class Initializer_Moe(ProcessGroupInitializer):
-    """Serves as the single entry point to MoE parallel initialization.
-
-    :param args: Args used to initialize ProcessGroupInitializer
-    :param kwargs: Kwargs used to initialize ProcessGroupInitializer
-    """
-    def __init__(self, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-        self.moe_model = moe_env.model_parallel_size
-        self.moe_data = moe_env.data_parallel_size
-        self.model_initializer = Initializer_Moemodel(
-            self.moe_model, self.moe_data, *args, **kwargs)
-        self.data_initializer = Initializer_Moedata(
-            self.moe_model, self.moe_data, *args, **kwargs)
-
-    def init_dist_group(self):
-        """Initializes MoE parallel communication groups.
-
-        :return: MoE parallelism's information
-        :rtype: list of Tuples (local_rank, group_world_size, process_group, ranks_in_group, mode)
-        """
-        parallel_setting = [self.model_initializer.init_dist_group(),
-                            self.data_initializer.init_dist_group()]
-        return parallel_setting
diff --git a/colossalai/context/process_group_initializer/initializer_pipeline.py b/colossalai/context/process_group_initializer/initializer_pipeline.py
deleted file mode 100644
index 773c3329dfa80b0cd88151cae4ac8c5f84684509..0000000000000000000000000000000000000000
--- a/colossalai/context/process_group_initializer/initializer_pipeline.py
+++ /dev/null
@@ -1,49 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from torch import distributed as dist
-
-from colossalai.registry import DIST_GROUP_INITIALIZER
-from .process_group_initializer import ProcessGroupInitializer
-from ..parallel_mode import ParallelMode
-
-
-@DIST_GROUP_INITIALIZER.register_module
-class Initializer_Pipeline(ProcessGroupInitializer):
-    """A ProcessGroupInitializer for pipeline parallelism.
-
-    :param args: Args used to initialize ProcessGroupInitializer
-    :param kwargs: Kwargs used to initialize ProcessGroupInitializer
-    """
-    def __init__(self, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-        self.data_group_size = self.world_size // self.data_parallel_size
-        self.pipeline_stage_size = self.data_group_size // self.pipeline_parallel_size
-
-    def init_dist_group(self):
-        """Initialize pipeline parallel groups, and assign local_ranks and groups to each gpu.
-
-        :return: Pipeline parallelism's information
-        :rtype: list of Tuples (local_rank, group_world_size, process_group, ranks_in_group, mode)
-        """
-        dist_settings = list()
-        for i in range(self.data_parallel_size):
-            for j in range(self.pipeline_stage_size):
-                pipe_ranks = list(
-                    range(i * self.data_group_size + j,
-                          (i + 1) * self.data_group_size,
-                          self.pipeline_stage_size))
-                pipe_group_size = len(pipe_ranks)
-                pipe_group = dist.new_group(pipe_ranks)
-
-                if self.rank in pipe_ranks:
-                    local_rank = pipe_ranks.index(self.rank)
-                    group_world_size = pipe_group_size
-                    process_group = pipe_group
-                    ranks_in_group = pipe_ranks
-                    dist_settings.append(
-                        tuple((local_rank, group_world_size,
-                               process_group, ranks_in_group,
-                               ParallelMode.PIPELINE)))
-
-        return dist_settings
diff --git a/colossalai/context/process_group_initializer/initializer_sequence.py b/colossalai/context/process_group_initializer/initializer_sequence.py
deleted file mode 100644
index 8b702370dd2f1ec82cd2593734853380e3a41ae6..0000000000000000000000000000000000000000
--- a/colossalai/context/process_group_initializer/initializer_sequence.py
+++ /dev/null
@@ -1,84 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-import torch.distributed as dist
-
-from colossalai.registry import DIST_GROUP_INITIALIZER
-from .initializer_tensor import Initializer_Tensor
-from .process_group_initializer import ProcessGroupInitializer
-from ..parallel_mode import ParallelMode
-
-
-@DIST_GROUP_INITIALIZER.register_module
-class Initializer_Sequence_DP(ProcessGroupInitializer):
-    """A ProcessGroupInitializer for sequence parallelism all-reduce.
-
-    In Sequence Parallelism, each GPU holds the full copy of model weights,
-    thus, gradient all-reduce occurs across all processes in the same pipeline stage
-
-    :param args: Args used to initialize ProcessGroupInitializer
-    :param kwargs: Kwargs used to initialize ProcessGroupInitializer
-    """
-
-    def __init__(self, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-        self.dp_size = self.world_size // self.pipeline_parallel_size
-        self.num_group = self.pipeline_parallel_size
-
-    def init_dist_group(self):
-        """Initialize Sequence Parallel process groups used for gradient all-reduce.
-
-        :return: (local_rank, group_world_size, process_group, ranks_in_group, mode)
-        :rtype: Tuple
-        """
-        local_rank = None
-        ranks_in_group = None
-        process_group = None
-        group_world_size = None
-        mode = ParallelMode.SEQUENCE_DP
-
-        for i in range(self.num_group):
-            ranks = [i * self.dp_size + j for j in range(self.dp_size)]
-            group = dist.new_group(ranks)
-
-            if self.rank in ranks:
-                local_rank = ranks.index(self.rank)
-                group_world_size = len(ranks)
-                process_group = group
-                ranks_in_group = ranks
-        return local_rank, group_world_size, process_group, ranks_in_group, mode
-
-
-@DIST_GROUP_INITIALIZER.register_module
-class Initializer_Sequence(ProcessGroupInitializer):
-    """A ProcessGroupInitializer for sequence parallelism.
-
-    :param args: Args used to initialize ProcessGroupInitializer
-    :param kwargs: Kwargs used to initialize ProcessGroupInitializer
-    """
-    def __init__(self,
-                 *args, **kwargs):
-        super().__init__(*args, **kwargs)
-        # reuse tensor parallel initializer code
-        self._sequence_initializer = Initializer_Tensor(*args, **kwargs)
-        self._sequence_dp_initializer = Initializer_Sequence_DP(*args, **kwargs)
-
-    def init_dist_group(self):
-        """Initialize Sequence parallel process groups and assign local_ranks and groups to each gpu.
-
-        Sequence parallelism requires 2 process groups. The first is for model forward where several processes
-        exchange paritial query, key and value embedding to compute self attention values. The second is for
-        all-reduce to synchronize the model parameters.
-
-        :return: Sequence parallelism's information
-        :rtype: list of Tuples (local_rank, group_world_size, process_group, ranks_in_group, mode)
-        """
-
-        parallel_setting = []
-
-        local_rank, group_world_size, process_group, ranks_in_group, mode = self._sequence_initializer.init_dist_group()
-        # change mode to sequence
-        mode = ParallelMode.SEQUENCE
-
-        parallel_setting.append((local_rank, group_world_size, process_group, ranks_in_group, mode))
-        parallel_setting.append(self._sequence_dp_initializer.init_dist_group())
-        return parallel_setting
diff --git a/colossalai/context/process_group_initializer/initializer_tensor.py b/colossalai/context/process_group_initializer/initializer_tensor.py
deleted file mode 100644
index 628a43434a06c3b3983acffe182c92727f918484..0000000000000000000000000000000000000000
--- a/colossalai/context/process_group_initializer/initializer_tensor.py
+++ /dev/null
@@ -1,44 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import torch.distributed as dist
-
-from colossalai.registry import DIST_GROUP_INITIALIZER
-from .process_group_initializer import ProcessGroupInitializer
-from ..parallel_mode import ParallelMode
-
-
-@DIST_GROUP_INITIALIZER.register_module
-class Initializer_Tensor(ProcessGroupInitializer):
-    """A ProcessGroupInitializer for tensor parallelism.
-
-    :param args: Args used to initialize ProcessGroupInitializer
-    :param kwargs: Kwargs used to initialize ProcessGroupInitializer
-    """
-    def __init__(self, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-        self.num_tensor_parallel_group = self.world_size // self.tensor_parallel_size
-
-    def init_dist_group(self):
-        """Initialize tensor parallel groups, and assign local_ranks and groups to each gpu.
-
-        :return: Tensor parallelism's information
-        :rtype: Tuple(local_rank, group_world_size, process_group, ranks_in_group, mode)
-        """
-        local_rank = None
-        ranks_in_group = None
-        process_group = None
-        group_world_size = None
-        mode = ParallelMode.TENSOR
-
-        for i in range(self.num_tensor_parallel_group):
-            ranks = [i * self.tensor_parallel_size + j for j in range(self.tensor_parallel_size)]
-            group = dist.new_group(ranks)
-
-            if self.rank in ranks:
-                local_rank = ranks.index(self.rank)
-                group_world_size = len(ranks)
-                process_group = group
-                ranks_in_group = ranks
-
-        return local_rank, group_world_size, process_group, ranks_in_group, mode
diff --git a/colossalai/context/process_group_initializer/process_group_initializer.py b/colossalai/context/process_group_initializer/process_group_initializer.py
deleted file mode 100644
index f8f2be6a75646b481c6a89120d2c7f805c62a007..0000000000000000000000000000000000000000
--- a/colossalai/context/process_group_initializer/process_group_initializer.py
+++ /dev/null
@@ -1,44 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from abc import ABC, abstractmethod
-
-from colossalai.context import Config
-
-
-class ProcessGroupInitializer(ABC):
-    """An object, knowing the parallelism configuration, that initializes parallel groups.
-
-    :param rank: The rank of current process
-    :param world_size: Size of whole communication world
-    :param config: Running configuration
-    :param data_parallel_size: Size of data parallel
-    :param pipeline_parallel_size: Size of pipeline parallel
-    :param tensor_parallel_size: Size of tensor parallel
-
-    :type rank: int
-    :type world_size: int
-    :type config: Config
-    :type data_parallel_size: int
-    :type pipeline_parallel_size: int
-    :type tensor_parallel_size: int
-    """
-    def __init__(self,
-                 rank: int,
-                 world_size: int,
-                 config: Config,
-                 data_parallel_size: int,
-                 pipeline_parallel_size: int,
-                 tensor_parallel_size: int
-                 ):
-        self.rank = rank
-        self.world_size = world_size
-        self.data_parallel_size = data_parallel_size
-        self.config = config
-        self.pipeline_parallel_size = pipeline_parallel_size
-        self.tensor_parallel_size = tensor_parallel_size
-        super().__init__()
-
-    @abstractmethod
-    def init_dist_group(self):
-        pass
diff --git a/colossalai/context/random/__init__.py b/colossalai/context/random/__init__.py
deleted file mode 100644
index 675fea5aab111f762823634dd7dda2a67eb80253..0000000000000000000000000000000000000000
--- a/colossalai/context/random/__init__.py
+++ /dev/null
@@ -1,9 +0,0 @@
-from ._helper import (seed, set_mode, with_seed, add_seed,
-                      get_seeds, get_states, get_current_mode,
-                      set_seed_states, sync_states, moe_set_seed)
-
-__all__ = [
-    'seed', 'set_mode', 'with_seed', 'add_seed', 'get_seeds',
-    'get_states', 'get_current_mode', 'set_seed_states', 'sync_states',
-    'moe_set_seed'
-]
diff --git a/colossalai/context/random/__pycache__/__init__.cpython-36.pyc b/colossalai/context/random/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index 4c9f5bebb18bbd8a019d06c044b570cc8df4f43f..0000000000000000000000000000000000000000
Binary files a/colossalai/context/random/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/context/random/__pycache__/__init__.cpython-37.pyc b/colossalai/context/random/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index 41fbdafd2ec616cc072be22efab46a787777191f..0000000000000000000000000000000000000000
Binary files a/colossalai/context/random/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/context/random/__pycache__/_helper.cpython-36.pyc b/colossalai/context/random/__pycache__/_helper.cpython-36.pyc
deleted file mode 100644
index 6cb6121e4f16b7c3512e768b178278c39d73c1ba..0000000000000000000000000000000000000000
Binary files a/colossalai/context/random/__pycache__/_helper.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/context/random/__pycache__/_helper.cpython-37.pyc b/colossalai/context/random/__pycache__/_helper.cpython-37.pyc
deleted file mode 100644
index bd5fd7cf85c12ec285e0c19e06166907e4e7fa82..0000000000000000000000000000000000000000
Binary files a/colossalai/context/random/__pycache__/_helper.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/context/random/__pycache__/seed_manager.cpython-36.pyc b/colossalai/context/random/__pycache__/seed_manager.cpython-36.pyc
deleted file mode 100644
index b620194f786e3265f7e7a033144c0069e008427b..0000000000000000000000000000000000000000
Binary files a/colossalai/context/random/__pycache__/seed_manager.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/context/random/__pycache__/seed_manager.cpython-37.pyc b/colossalai/context/random/__pycache__/seed_manager.cpython-37.pyc
deleted file mode 100644
index 6a06a75e6f03f25f28a03ff0089e9dac67bddf14..0000000000000000000000000000000000000000
Binary files a/colossalai/context/random/__pycache__/seed_manager.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/context/random/_helper.py b/colossalai/context/random/_helper.py
deleted file mode 100644
index ba5308cdc6b6d46de8814df866f38bd3162772f2..0000000000000000000000000000000000000000
--- a/colossalai/context/random/_helper.py
+++ /dev/null
@@ -1,157 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import functools
-from contextlib import contextmanager
-
-import torch.cuda
-from torch import Tensor
-
-from .seed_manager import SeedManager
-from ..parallel_mode import ParallelMode
-
-_SEED_MANAGER = SeedManager()
-
-
-def get_seeds():
-    """Returns the seeds of the seed manager.
-
-    :return: The seeds of the seed manager
-    :rtype: dict
-    """
-    return _SEED_MANAGER.seeds
-
-
-def get_states(copy=False):
-    """Returns the seed states of the seed manager.
-
-    :return: The seed states of the seed manager
-    :rtype: dict
-    """
-    states = _SEED_MANAGER.seed_states
-
-    if copy:
-        new_states = dict()
-
-        for parallel_mode, state in states.items():
-            new_states[parallel_mode] = state.clone()
-        return new_states
-    else:
-        return _SEED_MANAGER.seed_states
-
-
-def get_current_mode():
-    """Returns the current mode of the seed manager.
-
-    :return: The current mode of the seed manager.
-    :rtype: :class:`torch.ByteTensor`
-    """
-    return _SEED_MANAGER.current_mode
-
-
-def add_seed(parallel_mode: ParallelMode, seed: int, overwrite: bool = False):
-    """Adds a seed to the seed manager for `parallel_mode`.
-
-    :param parallel_mode: The chosen parallel mode
-    :type parallel_mode: :class:`colossalai.context.ParallelMode`
-    :param seed: The seed to be added
-    :type seed: int
-    :raises AssertionError: Raises an AssertionError if `parallel_mode` is not an instance of 
-        :class:`colossalai.context.ParallelMode` or the seed for `parallel_mode` has been added
-    """
-    _SEED_MANAGER.add_seed(parallel_mode, seed, overwrite)
-
-
-def set_mode(parallel_mode: ParallelMode):
-    """Sets the current mode of the seed manager.
-
-    :param parallel_mode: The chosen parallel mode
-    :type parallel_mode: :class:`colossalai.context.ParallelMode`
-    """
-    _SEED_MANAGER.set_mode(parallel_mode)
-
-
-def set_seed_states(parallel_mode: ParallelMode, state: Tensor):
-    """Sets the state of the seed manager for `parallel_mode`.
-
-    :param parallel_mode: The chosen parallel mode
-    :type parallel_mode: :class:`colossalai.context.ParallelMode`
-    :param state: the state to be set
-    :type state: :class:`torch.Tensor`
-    :raises AssertionError: Raises an AssertionError if `parallel_mode` is not found in the seed manager
-    """
-    _SEED_MANAGER.set_state(parallel_mode, state)
-
-
-def sync_states():
-    current_mode = get_current_mode()
-    current_states = torch.cuda.get_rng_state()
-    set_seed_states(current_mode, current_states)
-
-
-@contextmanager
-def seed(parallel_mode: ParallelMode):
-    """ A context for seed switch
-
-    Examples::
-
-        with seed(ParallelMode.DATA):
-            output = F.dropout(input)
-
-    """
-    try:
-        # set to new mode
-        current_mode = _SEED_MANAGER.current_mode
-        yield _SEED_MANAGER.set_mode(parallel_mode)
-    finally:
-        # recover
-        _SEED_MANAGER.set_mode(current_mode)
-
-
-def with_seed(func, parallel_mode: ParallelMode):
-    """
-    A function wrapper which executes the function with a specified seed.
-
-    Examples::
-
-        # use with decorator
-        @with_seed(ParallelMode.DATA)
-        def forward(input):
-            return F.dropout(input)
-        out = forward(input)
-        # OR use it inline
-        def forward(input):
-            return F.dropout(input)
-        wrapper_forward = with_seed(forward, ParallelMode.DATA)
-        out = wrapped_forward(input)
-
-    """
-
-    @functools.wraps(func)
-    def wrapper(*args, **kwargs):
-        # switch mode
-        current_mode = _SEED_MANAGER.current_mode
-        _SEED_MANAGER.set_mode(parallel_mode)
-
-        # exec func
-        out = func(*args, **kwargs)
-
-        # recover state
-        _SEED_MANAGER.set_mode(current_mode)
-
-        return out
-
-    return wrapper
-
-
-def moe_set_seed(seed):
-    if torch.cuda.is_available():
-        from colossalai.core import global_context as gpc
-        moe_mp_rank = gpc.get_local_rank(ParallelMode.MOE_MODEL)
-        moe_mp_seed = seed + moe_mp_rank
-        add_seed(ParallelMode.MOE_MODEL, moe_mp_seed)
-
-        global_rank = gpc.get_global_rank()
-        add_seed(ParallelMode.TENSOR, global_rank, True)
-        print(f"moe seed condition: {global_rank} with moe seed {moe_mp_seed}, ",
-              f"tensor seed {global_rank}", flush=True)
diff --git a/colossalai/context/random/seed_manager.py b/colossalai/context/random/seed_manager.py
deleted file mode 100644
index 02b8a88a6c826d87492c9a346834cdc189df09aa..0000000000000000000000000000000000000000
--- a/colossalai/context/random/seed_manager.py
+++ /dev/null
@@ -1,80 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import torch
-from torch import Tensor
-
-from colossalai.context.parallel_mode import ParallelMode
-
-
-class SeedManager:
-    """This class is a manager of all random seeds involved in the system.
-    """
-
-    def __init__(self):
-        self._current_mode = None
-        self._seeds = dict()
-        self._seed_states = dict()
-
-    @property
-    def current_mode(self):
-        return self._current_mode
-
-    @property
-    def seeds(self):
-        return self._seeds
-
-    @property
-    def seed_states(self):
-        return self._seed_states
-
-    def set_state(self, parallel_mode: ParallelMode, state: Tensor):
-        """Sets the state of the seed manager for `parallel_mode`.
-
-        :param parallel_mode: The chosen parallel mode
-        :type parallel_mode: :class:`colossalai.context.ParallelMode`
-        :param state: the state to be set
-        :type state: :class:`torch.Tensor`
-        :raises AssertionError: Raises an AssertionError if `parallel_mode` is not found in the seed manager
-        """
-        assert parallel_mode in self._seed_states, f'Parallel mode {parallel_mode} is not found in the seed manager'
-        self._seed_states[parallel_mode] = state
-
-    def set_mode(self, parallel_mode: ParallelMode):
-        """Sets the current mode of the seed manager.
-
-        :param parallel_mode: The chosen parallel mode
-        :type parallel_mode: :class:`colossalai.context.ParallelMode`
-        """
-        if self.current_mode:
-            # save the current state for current mode
-            self._seed_states[self._current_mode] = torch.cuda.get_rng_state()
-
-        # set the new state for new mode
-        self._current_mode = parallel_mode
-        torch.cuda.set_rng_state(self._seed_states[parallel_mode])
-
-    def add_seed(self, parallel_mode: ParallelMode, seed: int, overwrtie: bool = False):
-        """Adds a seed to the seed manager for `parallel_mode`.
-
-        :param parallel_mode: The chosen parallel mode
-        :type parallel_mode: :class:`colossalai.context.ParallelMode`
-        :param seed: The seed to be added
-        :type seed: int
-        :param overwrtie: Whether allows to overwrite the seed that has been set already
-        :type overwrtie: bool, optional
-        :raises AssertionError: Raises an AssertionError if `parallel_mode` is not an instance of 
-            :class:`colossalai.context.ParallelMode` or the seed for `parallel_mode` has been added
-        """
-        assert isinstance(
-            parallel_mode, ParallelMode), 'A valid ParallelMode must be provided'
-        if overwrtie is False:
-            assert parallel_mode not in self._seed_states, f'The seed for {parallel_mode} has been added'
-        elif parallel_mode in self._seed_states:
-            print(f"Warnning: {parallel_mode} seed has been overwritten.", flush=True)
-
-        current_state = torch.cuda.get_rng_state()
-        torch.cuda.manual_seed(seed)
-        self._seed_states[parallel_mode] = torch.cuda.get_rng_state()
-        self._seeds[parallel_mode] = seed
-        torch.cuda.set_rng_state(current_state)
diff --git a/colossalai/core.py b/colossalai/core.py
deleted file mode 100644
index ff30347913a37a0f2ba109570cfbce9f92891864..0000000000000000000000000000000000000000
--- a/colossalai/core.py
+++ /dev/null
@@ -1,6 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from colossalai.context import ParallelContext
-
-global_context = ParallelContext.get_instance()
diff --git a/colossalai/engine/__init__.py b/colossalai/engine/__init__.py
deleted file mode 100644
index 73ccb094e7561f91e9104aa608934950093f0a64..0000000000000000000000000000000000000000
--- a/colossalai/engine/__init__.py
+++ /dev/null
@@ -1,5 +0,0 @@
-from ._base_engine import Engine
-from .gradient_handler import *
-
-
-__all__ = ['Engine']
diff --git a/colossalai/engine/__pycache__/__init__.cpython-36.pyc b/colossalai/engine/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index c19bbfcb42cce46e4c1fb86fb51be65df18161a1..0000000000000000000000000000000000000000
Binary files a/colossalai/engine/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/engine/__pycache__/__init__.cpython-37.pyc b/colossalai/engine/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index 1fd6ccb48f394f95db982509355a60f274b13f13..0000000000000000000000000000000000000000
Binary files a/colossalai/engine/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/engine/__pycache__/_base_engine.cpython-36.pyc b/colossalai/engine/__pycache__/_base_engine.cpython-36.pyc
deleted file mode 100644
index 5c6dbb5fa9d2b66d9b687c6996203993a3589b99..0000000000000000000000000000000000000000
Binary files a/colossalai/engine/__pycache__/_base_engine.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/engine/__pycache__/_base_engine.cpython-37.pyc b/colossalai/engine/__pycache__/_base_engine.cpython-37.pyc
deleted file mode 100644
index 26b08bbd0f3e1af1667bb6f34f349eeb68b38554..0000000000000000000000000000000000000000
Binary files a/colossalai/engine/__pycache__/_base_engine.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/engine/_base_engine.py b/colossalai/engine/_base_engine.py
deleted file mode 100644
index df201e6af555db0e16bb059d3e565462fe56238e..0000000000000000000000000000000000000000
--- a/colossalai/engine/_base_engine.py
+++ /dev/null
@@ -1,145 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from typing import List
-from torch.nn import Module
-from torch.nn.modules.loss import _Loss
-from torch.optim import Optimizer
-
-from colossalai.logging import get_dist_logger
-from torch import Tensor
-from colossalai.engine.ophooks import register_ophooks_recursively, BaseOpHook
-
-
-class Engine:
-    """Basic engine class for training and evaluation. It runs a specific process method
-    :meth:`step` which is based on the given :attr:`schedule` over each batch of a dataset.
-    It controls a iteration in training.
-
-    :param model: The neural network model
-    :type model: ``torch.nn.Module``
-    :param optimizer: Optimizer for updating the parameters
-    :type optimizer: ``torch.optim.Optimizer``
-    :param criterion: Loss function for calculating loss
-    :type criterion: ``torch.nn.modules.loss._Loss``
-    :param gradient_handlers: A list of gradient handler used in backward
-    :type gradient_handlers: list
-    :param clip_grad_norm: The norm of gradient clipping
-    :type clip_grad_norm: float, optional
-    :param verbose: whether to display log info
-    :type verbose: bool
-    """
-    def __init__(self,
-                 model: Module,
-                 optimizer: Optimizer,
-                 criterion: _Loss,
-                 gradient_handlers: List = None,
-                 clip_grad_norm: float = 0.0,
-                 ophook_list: List[BaseOpHook] = [],
-                 verbose: bool = True):
-        self._model = model
-        self._optimizer = optimizer
-        self._criterion = criterion
-        self._clip_grad_norm = clip_grad_norm
-        self._verbose = verbose
-        self._logger = get_dist_logger()
-
-        # state
-        self.training = True  # default
-
-        # build gradient handler
-        if gradient_handlers:
-            self._gradient_handlers = gradient_handlers
-        else:
-            self._gradient_handlers = []
-
-        self._ophook_list = ophook_list
-        register_ophooks_recursively(self._model, self._ophook_list)
-
-    @property
-    def model(self):
-        """Model attached to the engine"""
-        return self._model
-
-    @property
-    def optimizer(self):
-        """Optimizer attached to the engine"""
-        return self._optimizer
-
-    @property
-    def criterion(self):
-        """Criterion attached to the engine"""
-        return self._criterion
-
-    def zero_grad(self):
-        """Set the gradient of parameters to zero
-        """
-        self.optimizer.zero_grad()
-
-    def step(self):
-        """Execute parameter update
-        """
-        self._all_reduce_gradients()
-        self.optimizer.clip_grad_norm(self.model, self._clip_grad_norm)
-        return self.optimizer.step()
-
-    def backward(self, loss: Tensor):
-        """Start backward propagation given the loss value computed by a loss function
-
-        :param loss: Loss value computed by a loss function
-        :type loss: :class:`torch.Tensor`
-        """
-        ret = self.optimizer.backward(loss)
-        for ophook in self._ophook_list:
-            ophook.post_iter()
-        return ret
-
-    def backward_by_grad(self, tensor, grad):
-        """Start backward propagation given the gradient of the output tensor
-
-        :param tensor: Output tensor
-        :type tensor: :class:`torch.Tensor`
-        :param grad: Gradient passed back to the output
-        :type grad: :class:`torch.Tensor`
-        """
-        ret = self.optimizer.backward_by_grad(tensor, grad)
-        for ophook in self._ophook_list:
-            ophook.post_iter()
-        return ret
-
-    def calc_loss(self, *args, **kwargs):
-        """Compute the loss value
-
-        :param args: Args used in criterion function
-        :param kwargs: Kwargs used in criterion function
-
-        :return: The loss value
-        :rtype: :class:`torch.Tensor`
-        """
-        return self.criterion(*args, **kwargs)
-
-    def __call__(self, *args, **kwargs):
-        """Run the forward step for the model
-
-        :return: Output the model
-        :rtype: Tuple[:class:`torch.Tensor`] or :class:`torch.Tensor`
-        """
-        return self.model(*args, **kwargs)
-
-    def _all_reduce_gradients(self):
-        """Handles all-reduce operations of gradients across different parallel groups.
-        """
-        for handler in self._gradient_handlers:
-            handler.handle_gradient()
-
-    def train(self):
-        """Sets the model to training mode.
-        """
-        self.training = True
-        self._model.train()
-
-    def eval(self):
-        """Sets the model to evaluation mode.
-        """
-        self.training = False
-        self._model.eval()
diff --git a/colossalai/engine/gradient_handler/__init__.py b/colossalai/engine/gradient_handler/__init__.py
deleted file mode 100644
index b6503b7782a839c1d0ef69aa29bdda31fc5cb2ea..0000000000000000000000000000000000000000
--- a/colossalai/engine/gradient_handler/__init__.py
+++ /dev/null
@@ -1,12 +0,0 @@
-from ._base_gradient_handler import BaseGradientHandler
-from ._data_parallel_gradient_handler import DataParallelGradientHandler
-from ._zero_gradient_handler import ZeROGradientHandler
-from ._sequence_parallel_gradient_handler import SequenceParallelGradientHandler
-from ._pipeline_parallel_gradient_handler import PipelineSharedModuleGradientHandler
-from ._moe_gradient_handler import MoeGradientHandler
-from ._sequence_parallel_gradient_handler import SequenceParallelGradientHandler
-
-
-__all__ = ['BaseGradientHandler', 'DataParallelGradientHandler',
-           'ZeROGradientHandler', 'PipelineSharedModuleGradientHandler',
-           'MoeGradientHandler', 'SequenceParallelGradientHandler']
\ No newline at end of file
diff --git a/colossalai/engine/gradient_handler/__pycache__/__init__.cpython-36.pyc b/colossalai/engine/gradient_handler/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index 6ee48d3af8cf01b7c8c2b0c7818714e227d5ffb7..0000000000000000000000000000000000000000
Binary files a/colossalai/engine/gradient_handler/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/engine/gradient_handler/__pycache__/__init__.cpython-37.pyc b/colossalai/engine/gradient_handler/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index 6d28fc7e12cede33c838d8cc096bf326c6effb97..0000000000000000000000000000000000000000
Binary files a/colossalai/engine/gradient_handler/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/engine/gradient_handler/__pycache__/_base_gradient_handler.cpython-36.pyc b/colossalai/engine/gradient_handler/__pycache__/_base_gradient_handler.cpython-36.pyc
deleted file mode 100644
index afe708e533627925d45dafaf30c388ec1a78302c..0000000000000000000000000000000000000000
Binary files a/colossalai/engine/gradient_handler/__pycache__/_base_gradient_handler.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/engine/gradient_handler/__pycache__/_base_gradient_handler.cpython-37.pyc b/colossalai/engine/gradient_handler/__pycache__/_base_gradient_handler.cpython-37.pyc
deleted file mode 100644
index 5ed0d09bf9a0347352c1ed1f728c329a7ad2c1ce..0000000000000000000000000000000000000000
Binary files a/colossalai/engine/gradient_handler/__pycache__/_base_gradient_handler.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/engine/gradient_handler/__pycache__/_data_parallel_gradient_handler.cpython-36.pyc b/colossalai/engine/gradient_handler/__pycache__/_data_parallel_gradient_handler.cpython-36.pyc
deleted file mode 100644
index 39b28b0a1cc4918b3968f20d0706c7f8855e81c3..0000000000000000000000000000000000000000
Binary files a/colossalai/engine/gradient_handler/__pycache__/_data_parallel_gradient_handler.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/engine/gradient_handler/__pycache__/_data_parallel_gradient_handler.cpython-37.pyc b/colossalai/engine/gradient_handler/__pycache__/_data_parallel_gradient_handler.cpython-37.pyc
deleted file mode 100644
index d6ef1a868350b5c5378bfc2fee56c1a8ced7fab3..0000000000000000000000000000000000000000
Binary files a/colossalai/engine/gradient_handler/__pycache__/_data_parallel_gradient_handler.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/engine/gradient_handler/__pycache__/_moe_gradient_handler.cpython-36.pyc b/colossalai/engine/gradient_handler/__pycache__/_moe_gradient_handler.cpython-36.pyc
deleted file mode 100644
index dc0621ae293208807420b6ffb7261ae3f718b415..0000000000000000000000000000000000000000
Binary files a/colossalai/engine/gradient_handler/__pycache__/_moe_gradient_handler.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/engine/gradient_handler/__pycache__/_moe_gradient_handler.cpython-37.pyc b/colossalai/engine/gradient_handler/__pycache__/_moe_gradient_handler.cpython-37.pyc
deleted file mode 100644
index af91d588e5fb50787b64c3c3d6298c82cf0318dd..0000000000000000000000000000000000000000
Binary files a/colossalai/engine/gradient_handler/__pycache__/_moe_gradient_handler.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/engine/gradient_handler/__pycache__/_pipeline_parallel_gradient_handler.cpython-36.pyc b/colossalai/engine/gradient_handler/__pycache__/_pipeline_parallel_gradient_handler.cpython-36.pyc
deleted file mode 100644
index 996fb2135600c134669b29a401b4ba5db6253d55..0000000000000000000000000000000000000000
Binary files a/colossalai/engine/gradient_handler/__pycache__/_pipeline_parallel_gradient_handler.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/engine/gradient_handler/__pycache__/_pipeline_parallel_gradient_handler.cpython-37.pyc b/colossalai/engine/gradient_handler/__pycache__/_pipeline_parallel_gradient_handler.cpython-37.pyc
deleted file mode 100644
index ab9351a82e7f17a13440a5ce6016fdb596162da2..0000000000000000000000000000000000000000
Binary files a/colossalai/engine/gradient_handler/__pycache__/_pipeline_parallel_gradient_handler.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/engine/gradient_handler/__pycache__/_sequence_parallel_gradient_handler.cpython-36.pyc b/colossalai/engine/gradient_handler/__pycache__/_sequence_parallel_gradient_handler.cpython-36.pyc
deleted file mode 100644
index 8f88c06572a13958928bcd38e8e4d2df8f969a22..0000000000000000000000000000000000000000
Binary files a/colossalai/engine/gradient_handler/__pycache__/_sequence_parallel_gradient_handler.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/engine/gradient_handler/__pycache__/_sequence_parallel_gradient_handler.cpython-37.pyc b/colossalai/engine/gradient_handler/__pycache__/_sequence_parallel_gradient_handler.cpython-37.pyc
deleted file mode 100644
index 94c16539d1185e27d7cf758183ad89a074af39f9..0000000000000000000000000000000000000000
Binary files a/colossalai/engine/gradient_handler/__pycache__/_sequence_parallel_gradient_handler.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/engine/gradient_handler/__pycache__/_zero_gradient_handler.cpython-36.pyc b/colossalai/engine/gradient_handler/__pycache__/_zero_gradient_handler.cpython-36.pyc
deleted file mode 100644
index f8ea67e255bb2bacfb8f05558fa799657b308977..0000000000000000000000000000000000000000
Binary files a/colossalai/engine/gradient_handler/__pycache__/_zero_gradient_handler.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/engine/gradient_handler/__pycache__/_zero_gradient_handler.cpython-37.pyc b/colossalai/engine/gradient_handler/__pycache__/_zero_gradient_handler.cpython-37.pyc
deleted file mode 100644
index 9fd4734458c1ffd4d5e00ff31119ff0deda777c5..0000000000000000000000000000000000000000
Binary files a/colossalai/engine/gradient_handler/__pycache__/_zero_gradient_handler.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/engine/gradient_handler/_base_gradient_handler.py b/colossalai/engine/gradient_handler/_base_gradient_handler.py
deleted file mode 100644
index 31f2e6e57eda7bf9309f11445dee03b63e0fca4e..0000000000000000000000000000000000000000
--- a/colossalai/engine/gradient_handler/_base_gradient_handler.py
+++ /dev/null
@@ -1,25 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from abc import ABC, abstractmethod
-
-
-class BaseGradientHandler(ABC):
-    """A basic helper class to handle all-reduce operations of gradients across different parallel groups 
-    before optimization.
-
-    :param model: Model where the gradients accumulate
-    :param optimizer: Optimizer for updating the parameters
-    :type model: Module
-    :type optimizer: Optimizer
-    """
-    def __init__(self, model, optimizer):
-        self._model = model
-        self._optimizer = optimizer
-
-    @abstractmethod
-    def handle_gradient(self):
-        """A method to accumulate gradients across different parallel groups. Users should
-        write their own functions or just use the functions in pre-defined subclasses.
-        """
-        pass
diff --git a/colossalai/engine/gradient_handler/_data_parallel_gradient_handler.py b/colossalai/engine/gradient_handler/_data_parallel_gradient_handler.py
deleted file mode 100644
index d29abb2d36b29988f993920732213eff8e2eb83e..0000000000000000000000000000000000000000
--- a/colossalai/engine/gradient_handler/_data_parallel_gradient_handler.py
+++ /dev/null
@@ -1,48 +0,0 @@
-#!/usr/bin/env python
-
-import torch.distributed as dist
-from torch._utils import _flatten_dense_tensors, _unflatten_dense_tensors
-
-from colossalai.core import global_context as gpc
-from colossalai.registry import GRADIENT_HANDLER
-from ._base_gradient_handler import BaseGradientHandler
-from ...context.parallel_mode import ParallelMode
-
-
-@GRADIENT_HANDLER.register_module
-class DataParallelGradientHandler(BaseGradientHandler):
-    """A helper class to handle all-reduce operations in a data parallel group.
-    A all-reduce collective communication will be operated in 
-    :func:`handle_gradient` among a data parallel group.
-    For better performance, it bucketizes the gradients of all parameters that are 
-    the same type to improve the efficiency of communication.
-    """
-
-    def handle_gradient(self):
-        """A method running a all-reduce operation in a data parallel group.
-        """
-        # TODO: add memory buffer
-        if gpc.data_parallel_size > 1:
-            # bucketize and all-reduce
-            buckets = {}
-            # Pack the buckets.
-            for param in self._model.parameters():
-                if param.requires_grad and param.grad is not None:
-                    tp = param.data.type()
-                    if tp not in buckets:
-                        buckets[tp] = []
-                    buckets[tp].append(param)
-                    # param.main_grad = param.grad
-
-            # For each bucket, all-reduce and copy all-reduced grads.
-            for tp in buckets:
-                bucket = buckets[tp]
-                grads = [param.grad.data for param in bucket]
-                coalesced = _flatten_dense_tensors(grads)
-                coalesced /= gpc.get_world_size(ParallelMode.DATA)
-
-                dist.all_reduce(
-                    coalesced, group=gpc.get_group(ParallelMode.DATA))
-                for buf, synced in zip(grads, _unflatten_dense_tensors(
-                        coalesced, grads)):
-                    buf.copy_(synced)
diff --git a/colossalai/engine/gradient_handler/_moe_gradient_handler.py b/colossalai/engine/gradient_handler/_moe_gradient_handler.py
deleted file mode 100644
index dcdd02f860ab1d632c5fb0da82a4ad809c70b30b..0000000000000000000000000000000000000000
--- a/colossalai/engine/gradient_handler/_moe_gradient_handler.py
+++ /dev/null
@@ -1,61 +0,0 @@
-import torch.distributed as dist
-from torch._utils import _flatten_dense_tensors, _unflatten_dense_tensors
-from colossalai.core import global_context as gpc
-from colossalai.registry import GRADIENT_HANDLER
-from colossalai.global_variables import moe_env
-from ._base_gradient_handler import BaseGradientHandler
-from ...context.parallel_mode import ParallelMode
-
-
-@GRADIENT_HANDLER.register_module
-class MoeGradientHandler(BaseGradientHandler):
-    """A helper class to handle all-reduce operations in a data parallel group and
-    moe model parallel. A all-reduce collective communication will be operated in
-    :func:`handle_gradient` among a data parallel group.
-    For better performance, it bucketizes the gradients of all parameters that are
-    the same type to improve the efficiency of communication.
-    """
-
-    def handle_gradient(self):
-        """A method running an all-reduce operation in a data parallel group.
-        Then running an all-reduce operation for all parameters in experts
-        across moe model parallel group
-        """
-        moe_data = moe_env.data_parallel_size
-        global_data = gpc.data_parallel_size
-
-        if global_data > 1:
-            # bucketize and all-reduce
-            buckets = {}
-            # Pack the buckets.
-            for param in self._model.parameters():
-                if param.requires_grad and \
-                        param.grad is not None and \
-                        not hasattr(param, 'moe_param'):
-                    tp = param.data.type()
-                    if tp not in buckets:
-                        buckets[tp] = []
-                    buckets[tp].append(param)
-                    # param.main_grad = param.grad
-
-            # For each bucket, all-reduce and copy all-reduced grads.
-            for tp in buckets:
-                bucket = buckets[tp]
-                grads = [param.grad.data for param in bucket]
-                coalesced = _flatten_dense_tensors(grads)
-                coalesced /= gpc.get_world_size(ParallelMode.DATA)
-
-                dist.all_reduce(
-                    coalesced, group=gpc.get_group(ParallelMode.DATA))
-                for buf, synced in zip(grads, _unflatten_dense_tensors(
-                        coalesced, grads)):
-                    buf.copy_(synced)
-
-        if global_data > 1:
-            for param in self._model.parameters():
-                if not param.requires_grad or param.grad is None:
-                    continue
-                if moe_data > 1 and hasattr(param, 'moe_param'):
-                    param.grad.data /= moe_data
-                    dist.all_reduce(param.grad.data,
-                                    group=gpc.get_group(ParallelMode.MOE_DATA))
diff --git a/colossalai/engine/gradient_handler/_pipeline_parallel_gradient_handler.py b/colossalai/engine/gradient_handler/_pipeline_parallel_gradient_handler.py
deleted file mode 100644
index 458a11509336782198c5aca28148e1fab2c06058..0000000000000000000000000000000000000000
--- a/colossalai/engine/gradient_handler/_pipeline_parallel_gradient_handler.py
+++ /dev/null
@@ -1,41 +0,0 @@
-#!/usr/bin/env python
-
-import torch.distributed as dist
-from torch._utils import _flatten_dense_tensors, _unflatten_dense_tensors
-
-from colossalai.core import global_context as gpc
-from colossalai.registry import GRADIENT_HANDLER
-from ._base_gradient_handler import BaseGradientHandler
-from collections import defaultdict
-
-
-@GRADIENT_HANDLER.register_module
-class PipelineSharedModuleGradientHandler(BaseGradientHandler):
-    """A helper class to handle all-reduce operations in sub parallel groups.
-    A all-reduce collective communication will be operated in 
-    :func:`handle_gradient` among all sub pipeline parallel groups.
-    For better performance, it bucketizes the gradients of all parameters that are 
-    the same type to improve the efficiency of communication.
-    """
-
-    def handle_gradient(self):
-        """A method running a all-reduce operation in sub pipeline parallel groups.
-        """
-        if gpc.pipeline_parallel_size > 1:
-            # bucketize and all-reduce
-            buckets = defaultdict(lambda: defaultdict(list))
-            # Pack the buckets.
-            for param in self._model.parameters():
-                group = getattr(param, 'pipeline_shared_module_pg', None)
-                if param.requires_grad and param.grad is not None and group is not None:
-                    tp = param.data.type()
-                    buckets[group][tp].append(param)
-
-            # For each bucket, all-reduce and copy all-reduced grads.
-            for group, group_buckets in buckets.items():
-                for tp, bucket in group_buckets.items():
-                    grads = [param.grad.data for param in bucket]
-                    coalesced = _flatten_dense_tensors(grads)
-                    dist.all_reduce(coalesced, op=dist.ReduceOp.SUM, group=group)
-                    for buf, synced in zip(grads, _unflatten_dense_tensors(coalesced, grads)):
-                        buf.copy_(synced)
diff --git a/colossalai/engine/gradient_handler/_sequence_parallel_gradient_handler.py b/colossalai/engine/gradient_handler/_sequence_parallel_gradient_handler.py
deleted file mode 100644
index 69563acba7315b767d5b2fbe839e2bd058acc995..0000000000000000000000000000000000000000
--- a/colossalai/engine/gradient_handler/_sequence_parallel_gradient_handler.py
+++ /dev/null
@@ -1,51 +0,0 @@
-#!/usr/bin/env python
-from functools import total_ordering
-import torch
-import torch.distributed as dist
-from torch._utils import _flatten_dense_tensors, _unflatten_dense_tensors
-
-from colossalai.core import global_context as gpc
-from colossalai.registry import GRADIENT_HANDLER
-from ._base_gradient_handler import BaseGradientHandler
-from ...context.parallel_mode import ParallelMode
-import colossalai
-
-
-@GRADIENT_HANDLER.register_module
-class SequenceParallelGradientHandler(BaseGradientHandler):
-    """A helper class to handle all-reduce operations in a data parallel group.
-    A all-reduce collective communication will be operated in 
-    :func:`handle_gradient` among a data parallel group.
-    For better performance, it bucketizes the gradients of all parameters that are 
-    the same type to improve the efficiency of communication.
-    """
-
-    def handle_gradient(self):
-        """A method running a all-reduce operation in a data parallel group.
-        """
-
-        # bucketize and all-reduce
-        buckets = {}
-
-        # Pack the buckets.
-        for param in self._model.parameters():
-            if param.requires_grad and param.grad is not None:
-                tp = param.data.type()
-                if tp not in buckets:
-                    buckets[tp] = []
-                buckets[tp].append(param)
-
-        # For each bucket, all-reduce and copy all-reduced grads.
-        for tp in buckets:
-            bucket = buckets[tp]
-            grads = [param.grad.data for param in bucket]
-            coalesced = _flatten_dense_tensors(grads)
-
-            coalesced /= gpc.get_world_size(ParallelMode.SEQUENCE_DP)
-
-            dist.all_reduce(
-                coalesced, group=gpc.get_group(ParallelMode.SEQUENCE_DP))
-
-            for buf, synced in zip(grads, _unflatten_dense_tensors(
-                    coalesced, grads)):
-                buf.copy_(synced)
diff --git a/colossalai/engine/gradient_handler/_zero_gradient_handler.py b/colossalai/engine/gradient_handler/_zero_gradient_handler.py
deleted file mode 100644
index b303bcb39657c855bb10374ad8f0858a46593ca4..0000000000000000000000000000000000000000
--- a/colossalai/engine/gradient_handler/_zero_gradient_handler.py
+++ /dev/null
@@ -1,16 +0,0 @@
-from colossalai.registry import GRADIENT_HANDLER
-from ._base_gradient_handler import BaseGradientHandler
-
-
-@GRADIENT_HANDLER.register_module
-class ZeROGradientHandler(BaseGradientHandler):
-    """A helper class to handle all-reduce operations in a data parallel group.
-    A all-reduce collective communication will be operated in 
-    :func:`handle_gradient` among a data parallel group.
-    This class is specialized with ZeRO optimization.
-    """
-
-    def handle_gradient(self):
-        """A method running a all-reduce operation in a data parallel group.
-        """
-        self._optimizer.allreduce_gradients()
diff --git a/colossalai/engine/ophooks/__init__.py b/colossalai/engine/ophooks/__init__.py
deleted file mode 100644
index abfe0a5819a035d472e46a36abe78542933f2df6..0000000000000000000000000000000000000000
--- a/colossalai/engine/ophooks/__init__.py
+++ /dev/null
@@ -1,115 +0,0 @@
-from ._base_ophook import BaseOpHook
-from ._memtracer_ophook import MemTracerOpHook
-import torch
-from typing import List
-
-all = ["BaseOpHook", "MemTracerOpHook", "register_ophooks_recursively"]
-
-
-# apply torch.autograd.Function that calls a backward_function to tensors in output
-def _apply_to_tensors_only(module, functional, backward_function, outputs):
-    if type(outputs) is tuple:
-        touched_outputs = []
-        for output in outputs:
-            touched_output = _apply_to_tensors_only(module, functional,
-                                                    backward_function, output)
-            touched_outputs.append(touched_output)
-        return tuple(touched_outputs)
-    elif type(outputs) is torch.Tensor:
-        return functional.apply(module, backward_function, outputs)
-    else:
-        return outputs
-
-
-class PreBackwardFunction(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, module, pre_backward_function, outputs):
-        ctx.module = module
-        ctx.pre_backward_function = pre_backward_function
-        module.applied_pre_backward = False
-        outputs = outputs.detach()
-        return outputs
-
-    @staticmethod
-    def backward(ctx, *args):
-        ctx.pre_backward_function(ctx.module)
-        return (None, None) + args
-
-
-class PostBackwardFunction(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, module, pre_backward_function, output):
-        ctx.module = module
-        output = output.detach()
-        ctx.pre_backward_function = pre_backward_function
-        return output
-
-    @staticmethod
-    def backward(ctx, *args):
-        """
-        Args:
-            activation_grad of the next layer.
-        Returns:
-            grad of the input activation.
-        """
-        ctx.pre_backward_function(ctx.module)
-        return (None, None) + args
-
-
-def register_ophooks_recursively(module: torch.nn.Module,
-                                 ophook_list: List[BaseOpHook] = None,
-                                 name: str = ""):
-    r"""Recursilvely register pre/post hooks for all submodules in the module in FWD and BWD."""
-    assert isinstance(module, torch.nn.Module)
-    has_children = False
-    for child_name, child in module.named_children():
-        register_ophooks_recursively(child, ophook_list, name + child_name)
-        has_children = True
-
-    # Early return on modules with no parameters or buffers that
-    # are not in their children.
-    if (len(list(module.named_parameters(recurse=False))) == 0
-            and len(list(module.named_buffers(recurse=False))) == 0):
-        return
-
-    # return if the module has not childern.
-    if has_children:
-        return
-
-    if ophook_list is not None:
-        for hook in ophook_list:
-            assert (isinstance(hook, BaseOpHook))
-
-    def _pre_forward_module_hook(submodule, *args):
-        for hook in ophook_list:
-            assert isinstance(submodule, torch.nn.Module)
-            hook.pre_fwd_exec(submodule, *args)
-
-    def _post_forward_module_hook(submodule, *args):
-        for hook in ophook_list:
-            assert isinstance(submodule, torch.nn.Module)
-            hook.post_fwd_exec(submodule, *args)
-
-    def _pre_backward_module_hook(submodule, inputs, output):
-        def _run_before_backward_function(submodule):
-            for hook in ophook_list:
-                assert isinstance(submodule, torch.nn.Module)
-                hook.pre_bwd_exec(submodule, inputs, output)
-
-        return _apply_to_tensors_only(submodule, PreBackwardFunction,
-                                      _run_before_backward_function, output)
-
-    def _post_backward_module_hook(submodule, inputs):
-        def _run_after_backward_function(submodule):
-            for hook in ophook_list:
-                assert isinstance(submodule, torch.nn.Module)
-                hook.post_bwd_exec(submodule, inputs)
-
-        return _apply_to_tensors_only(submodule, PostBackwardFunction,
-                                      _run_after_backward_function, inputs)
-
-    module.register_forward_pre_hook(_pre_forward_module_hook)
-    module.register_forward_hook(_post_forward_module_hook)
-
-    module.register_forward_hook(_pre_backward_module_hook)
-    module.register_forward_pre_hook(_post_backward_module_hook)
diff --git a/colossalai/engine/ophooks/__pycache__/__init__.cpython-36.pyc b/colossalai/engine/ophooks/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index a8589fb4c1e56be312cbe1ae4696432472e7e2ee..0000000000000000000000000000000000000000
Binary files a/colossalai/engine/ophooks/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/engine/ophooks/__pycache__/__init__.cpython-37.pyc b/colossalai/engine/ophooks/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index 784ab0dfed3132d82c24653497a1d93a308caa8e..0000000000000000000000000000000000000000
Binary files a/colossalai/engine/ophooks/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/engine/ophooks/__pycache__/_base_ophook.cpython-36.pyc b/colossalai/engine/ophooks/__pycache__/_base_ophook.cpython-36.pyc
deleted file mode 100644
index 52be4f8f950812f6d7fba505f63e314ee1644df9..0000000000000000000000000000000000000000
Binary files a/colossalai/engine/ophooks/__pycache__/_base_ophook.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/engine/ophooks/__pycache__/_base_ophook.cpython-37.pyc b/colossalai/engine/ophooks/__pycache__/_base_ophook.cpython-37.pyc
deleted file mode 100644
index b03b0e7f7568d58b7efdbb2e8b652977081a7e0c..0000000000000000000000000000000000000000
Binary files a/colossalai/engine/ophooks/__pycache__/_base_ophook.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/engine/ophooks/__pycache__/_memtracer_ophook.cpython-36.pyc b/colossalai/engine/ophooks/__pycache__/_memtracer_ophook.cpython-36.pyc
deleted file mode 100644
index 478d9937805b20a3226fae4715dc1d9e6b0effbf..0000000000000000000000000000000000000000
Binary files a/colossalai/engine/ophooks/__pycache__/_memtracer_ophook.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/engine/ophooks/__pycache__/_memtracer_ophook.cpython-37.pyc b/colossalai/engine/ophooks/__pycache__/_memtracer_ophook.cpython-37.pyc
deleted file mode 100644
index 96b83906eaeb9d2461cc0bbf5420db6666c1cd2f..0000000000000000000000000000000000000000
Binary files a/colossalai/engine/ophooks/__pycache__/_memtracer_ophook.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/engine/ophooks/_base_ophook.py b/colossalai/engine/ophooks/_base_ophook.py
deleted file mode 100644
index e948a8cfbcc13c72a978f2a0d8b6dc0a85325f92..0000000000000000000000000000000000000000
--- a/colossalai/engine/ophooks/_base_ophook.py
+++ /dev/null
@@ -1,29 +0,0 @@
-from abc import ABC, abstractmethod
-import torch
-
-
-class BaseOpHook(ABC):
-    """This class allows users to add customized operations
-    before and after the execution of a PyTorch submodule"""
-    def __init__(self):
-        pass
-
-    @abstractmethod
-    def pre_fwd_exec(self, module: torch.nn.Module, *args):
-        pass
-
-    @abstractmethod
-    def post_fwd_exec(self, module: torch.nn.Module, *args):
-        pass
-
-    @abstractmethod
-    def pre_bwd_exec(self, module: torch.nn.Module, input, output):
-        pass
-
-    @abstractmethod
-    def post_bwd_exec(self, module: torch.nn.Module, input):
-        pass
-
-    @abstractmethod
-    def post_iter(self):
-        pass
diff --git a/colossalai/engine/ophooks/_memtracer_ophook.py b/colossalai/engine/ophooks/_memtracer_ophook.py
deleted file mode 100644
index 3f5671230351be724ab02f799ef688b660266735..0000000000000000000000000000000000000000
--- a/colossalai/engine/ophooks/_memtracer_ophook.py
+++ /dev/null
@@ -1,131 +0,0 @@
-import torch
-from . import BaseOpHook
-from concurrent.futures import ThreadPoolExecutor
-from colossalai.registry import OPHOOKS
-from colossalai.logging import get_dist_logger
-from time import sleep, time
-import psutil
-import pickle
-
-
-def get_cuda_memory_used(device):
-    """
-    Get the free memory info of device.
-    Notice that for CPU, this function will return 1/N of the total free memory,
-    where N is the world size.
-    """
-    ret = torch.cuda.memory_allocated()
-    # get the peak memory to report correct data, so reset the counter for the next call
-    if hasattr(torch.cuda, "reset_peak_memory_stats"):  # pytorch 1.4+
-        torch.cuda.reset_peak_memory_stats()
-    return ret
-
-
-class AsyncMemoryMonitor:
-    def __init__(self, power=10):
-        """
-        An Async Mem Monitor runing during computing.
-        Sampling GPU memory usage of the current GPU dev
-        at interval of 1/(10**power) sec.
-        """
-        self.keep_measuring = False
-        self.executor = ThreadPoolExecutor(max_workers=1)
-        self.monitor_thread = None
-        self.interval = 1 / (10**power)
-        self.time_stamps = []
-        self.mem_stats = []
-
-    def set_interval(self, power: int):
-        self.interval = 1 / (10**power)
-
-    def is_measuring(self):
-        return self.keep_measuring
-
-    def start(self):
-        self.keep_measuring = True
-        self.monitor_thread = self.executor.submit(self._measure_usage)
-
-    def finish(self):
-        if self.keep_measuring is False:
-            return 0
-        self.keep_measuring = False
-        max_usage = self.monitor_thread.result()
-        self.monitor_thread = None
-        self.time_stamps.append(time())
-        self.mem_stats.append(max_usage)
-        return max_usage
-
-    def _measure_usage(self):
-        max_usage = 0
-        dev = torch.device(f"cuda:{torch.cuda.current_device()}")
-        while self.keep_measuring:
-            max_usage = max(
-                max_usage,
-                get_cuda_memory_used(dev),
-            )
-            sleep(self.interval)
-        return max_usage
-
-    def state_dict(self):
-        return {
-            "time_stamps": self.time_stamps,
-            "mem_stats": self.mem_stats,
-        }
-
-    def save(self, filename):
-        with open(filename, "wb") as f:
-            pickle.dump(self.state_dict(), f)
-
-
-@OPHOOKS.register_module
-class MemTracerOpHook(BaseOpHook):
-    def __init__(self, niter=5):
-        super().__init__()
-        self.async_mem_monitor = AsyncMemoryMonitor()
-        self._niter = niter
-        self._curiter = 0
-        self._logger = get_dist_logger()
-
-    def _isvalid(self, module):
-        return module.training and self._curiter < self._niter
-
-    def niter(self):
-        return self._niter
-
-    def pre_fwd_exec(self, module: torch.nn.Module, *args):
-        if self._isvalid(module):
-            self.async_mem_monitor.finish()
-            self.async_mem_monitor.start()
-            self._logger.debug(f'FWD PRE {module.__class__.__name__}')
-
-    def post_fwd_exec(self, module: torch.nn.Module, *args):
-        if self._isvalid(module):
-            self.async_mem_monitor.finish()
-            self._logger.debug(f'FWD POST {module.__class__.__name__}')
-
-    def pre_bwd_exec(self, module: torch.nn.Module, input, output):
-        assert isinstance(module, torch.nn.Module)
-        if self._isvalid(module):
-            self.async_mem_monitor.finish()
-            self.async_mem_monitor.start()
-            self._logger.debug(f'BWD PRE {module.__class__.__name__}')
-
-    def post_bwd_exec(self, module: torch.nn.Module, input):
-        assert isinstance(module, torch.nn.Module)
-        if self._isvalid(module):
-            self.async_mem_monitor.finish()
-            self._logger.debug(f'BWD POST {module.__class__.__name__}')
-
-    def pre_iter(self):
-        pass
-
-    def post_iter(self):
-        self.async_mem_monitor.finish()
-        if self._curiter == self._niter:
-            self._logger.info(
-                f'dump a memory statistics as pickle to ./memstats.pkl')
-            self.save_results("memstats.pkl")
-        self._curiter += 1
-
-    def save_results(self, filename):
-        self.async_mem_monitor.save(filename)
diff --git a/colossalai/engine/schedule/__init__.py b/colossalai/engine/schedule/__init__.py
deleted file mode 100644
index 36472413eab27d50294235865ad11037c3e19c55..0000000000000000000000000000000000000000
--- a/colossalai/engine/schedule/__init__.py
+++ /dev/null
@@ -1,5 +0,0 @@
-from ._base_schedule import BaseSchedule
-from ._pipeline_schedule import PipelineSchedule, InterleavedPipelineSchedule
-from ._non_pipeline_schedule import NonPipelineSchedule
-
-__all__ = ['BaseSchedule', 'NonPipelineSchedule', 'PipelineSchedule', 'InterleavedPipelineSchedule']
diff --git a/colossalai/engine/schedule/__pycache__/__init__.cpython-37.pyc b/colossalai/engine/schedule/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index 2ac2eda201e5ceee7a4fce9c73336e093843f2ed..0000000000000000000000000000000000000000
Binary files a/colossalai/engine/schedule/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/engine/schedule/__pycache__/_base_schedule.cpython-37.pyc b/colossalai/engine/schedule/__pycache__/_base_schedule.cpython-37.pyc
deleted file mode 100644
index 07291b5a251e2116f36c76990104c30b47380938..0000000000000000000000000000000000000000
Binary files a/colossalai/engine/schedule/__pycache__/_base_schedule.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/engine/schedule/__pycache__/_non_pipeline_schedule.cpython-37.pyc b/colossalai/engine/schedule/__pycache__/_non_pipeline_schedule.cpython-37.pyc
deleted file mode 100644
index ca8ef4c82ecfa80c8bc48dd685322b11e4f12860..0000000000000000000000000000000000000000
Binary files a/colossalai/engine/schedule/__pycache__/_non_pipeline_schedule.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/engine/schedule/__pycache__/_pipeline_schedule.cpython-37.pyc b/colossalai/engine/schedule/__pycache__/_pipeline_schedule.cpython-37.pyc
deleted file mode 100644
index da73b9b5cbac8f631a5d03a26dcf911b53f5e95a..0000000000000000000000000000000000000000
Binary files a/colossalai/engine/schedule/__pycache__/_pipeline_schedule.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/engine/schedule/_base_schedule.py b/colossalai/engine/schedule/_base_schedule.py
deleted file mode 100644
index d3c781b13dbdabf6375615e56373aae10a858202..0000000000000000000000000000000000000000
--- a/colossalai/engine/schedule/_base_schedule.py
+++ /dev/null
@@ -1,119 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from abc import ABC, abstractmethod
-
-import torch
-
-from typing import Iterable, Callable
-from .._base_engine import Engine
-from colossalai.logging import get_dist_logger
-from colossalai.utils import get_current_device
-
-
-class BaseSchedule(ABC):
-    """A basic helper class to control the process of training or evaluation.
-    It mainly composes of forward_backward_step for gradient backward and
-    optimizer_step for parameters update.
-    For the convenience to enable FP16, we aggreate all codes that contain the
-    control of FP16 in class schedule.
-    """
-
-    def __init__(self, batch_data_process_func: Callable = None):
-        self.logger = get_dist_logger()
-        self.batch_data_process_func = batch_data_process_func
-
-    @staticmethod
-    def _move_tensor(element):
-        if torch.is_tensor(element):
-            if not element.is_cuda:
-                return element.to(get_current_device()).detach()
-        return element
-
-    def _move_to_device(self, data):
-        if isinstance(data, dict):
-            data = {k: self._move_tensor(v) for k, v in data.items()}
-        else:
-            data = self._move_tensor(data)
-        return data
-
-    @staticmethod
-    def _check_sanity(data, tag: str):
-        assert isinstance(data, (torch.Tensor, dict)), \
-            f'{tag} must be torch.Tensor or dict'
-
-    def load_batch(self, data_iter, to_gpu=True):
-        """Loads a batch from data iterator. It returns the data and labels which are
-        already in the same GPU as where the model's.
-
-        :param data_iter: Data iterator from which get a batch of data
-        :type data_iter: DataIter
-        :param to_gpu: Whether the data should be moved to GPU
-        :type to_gpu: bool, optional
-
-        :return: (data, label)
-        :rtype: (:class:`Tensor`, :class:`torch.Tensor`)
-        """
-        if data_iter is None:
-            raise RuntimeError('Dataloader is not defined.')
-        batch_data = next(data_iter)
-
-        if self.batch_data_process_func:
-            data, label = self.batch_data_process_func(batch_data)
-        else:
-            data, label = batch_data
-        self._check_sanity(data, 'data')
-        self._check_sanity(label, 'label')
-        if isinstance(data, torch.Tensor):
-            self.batch_size = data.size(0)
-        else:
-            self.batch_size = next(iter(data.values())).size(0)
-        if to_gpu:
-            return self._move_to_device(data), self._move_to_device(label)
-        return data, label
-
-    def pre_processing(self, engine: Engine):
-        """To perform actions before running the schedule.
-        """
-        pass
-
-    @abstractmethod
-    def forward_backward_step(self,
-                              engine: Engine,
-                              data_iter: Iterable,
-                              forward_only: bool,
-                              return_loss: bool = True,
-                              return_output_label: bool = True
-                              ):
-        """The process function over a batch of dataset for training or evaluation.
-
-        :param engine: Colossalai training engine
-        :type engine: colossalai.engine.Engine
-        :param data_iter: Data iterator from which get a batch of data
-        :type data_iter: DataIter
-        :param forward_only: If True, the process won't include backward
-        :type forward_only: bool
-        :param return_loss: If False, the loss won't be returned
-        :type return_loss: bool, optional
-        :param return_output_label: If False, the output and label won't be returned
-        :type return_output_label: bool, optional
-        """
-        pass
-
-    @staticmethod
-    def _call_engine(engine, inputs):
-        if isinstance(inputs, torch.Tensor):
-            return engine(inputs)
-        else:
-            return engine(**inputs)
-
-    @staticmethod
-    def _call_engine_criterion(engine, outputs, labels):
-        assert isinstance(outputs, (torch.Tensor, list, tuple)
-                          ), f'Expect output of model is (torch.Tensor, list, tuple), got {type(outputs)}'
-        if isinstance(outputs, torch.Tensor):
-            outputs = (outputs,)
-        if isinstance(labels, torch.Tensor):
-            return engine.criterion(*outputs, labels)
-        else:
-            return engine.criterion(*outputs, **labels)
diff --git a/colossalai/engine/schedule/_non_pipeline_schedule.py b/colossalai/engine/schedule/_non_pipeline_schedule.py
deleted file mode 100644
index bc1a5664c35e0b36800e60cd01b28202f2849adb..0000000000000000000000000000000000000000
--- a/colossalai/engine/schedule/_non_pipeline_schedule.py
+++ /dev/null
@@ -1,65 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from typing import Iterable
-
-import torch
-
-from colossalai.engine import Engine
-from ._base_schedule import BaseSchedule
-from colossalai.utils import conditional_context
-
-
-class NonPipelineSchedule(BaseSchedule):
-    """A helper schedule class for no pipeline parallelism running environment.
-    During one process, it loads a batch of dataset and feeds it to the model.
-    After getting the output and calculating the loss, it will use :meth:`step`
-    to update the parameters if it is in training mode.
-    """
-
-    def forward_backward_step(self,
-                              engine: Engine,
-                              data_iter: Iterable,
-                              forward_only: bool = False,
-                              return_loss: bool = True,
-                              return_output_label: bool = True):
-        """The process function that loads loads a batch of dataset and feeds it to the model.
-        The returned labels and loss will None if :attr:`return_loss` is False.
-
-        :param engine: Model for training and inference
-        :param data_iter: Data iterator of the dataloader, e.g. iter(dataloader)
-        :param forward_only: If True, the model is run for the forward pass, else back propagation will be executed
-        :param return_loss: Loss will be returned if True
-        :param return_output_label: Output and label will be returned if True
-        :type engine: Iterator
-        :type data_iter: Iterator
-        :type forward_only: bool, optional
-        :type return_loss: bool, optional
-        :type return_output_label: bool, optional
-
-        :return: (output, label, loss)
-        :rtype: Tuple[:class:`torch.Tensor`]
-        """
-        assert forward_only or return_loss, \
-            "The argument 'return_loss' has to be True when 'forward_only' is False, but got False."
-        data, label = self.load_batch(data_iter)
-
-        # forward
-        with conditional_context(torch.no_grad(), enable=forward_only):
-            output = self._call_engine(engine, data)
-            if return_loss:
-                loss = self._call_engine_criterion(engine, output, label)
-
-        if not forward_only:
-            engine.backward(loss)
-
-        if return_output_label:
-            if return_loss:
-                return output, label, loss
-            else:
-                return output, label, None
-        else:
-            if return_loss:
-                return None, None, loss
-            else:
-                return None, None, None
diff --git a/colossalai/engine/schedule/_pipeline_schedule.py b/colossalai/engine/schedule/_pipeline_schedule.py
deleted file mode 100644
index 5bab0d524d89fd303bd1e1ea1731600c2544f9c0..0000000000000000000000000000000000000000
--- a/colossalai/engine/schedule/_pipeline_schedule.py
+++ /dev/null
@@ -1,710 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import inspect
-from typing import Callable, List, Tuple, Union
-
-import colossalai.communication as comm
-import torch.cuda
-from colossalai.amp.naive_amp import NaiveAMPModel
-from colossalai.context.parallel_mode import ParallelMode
-from colossalai.core import global_context as gpc
-from colossalai.logging import get_dist_logger
-from colossalai.utils import switch_virtual_pipeline_parallel_rank
-from colossalai.utils.cuda import get_current_device
-from colossalai.zero import (ZeroRedundancyOptimizer_Level_2,
-                             ZeroRedundancyOptimizer_Level_3)
-
-from ._base_schedule import BaseSchedule
-
-
-def pack_return_tensors(return_tensors):
-    output, label = tuple(zip(*return_tensors))
-    if isinstance(output[0], torch.Tensor):
-        output = torch.cat(output, dim=0)
-    elif isinstance(output[0], (list, tuple)):
-        output = tuple(torch.cat(tensors, dim=0) for tensors in zip(*output))
-    else:
-        raise TypeError(f'Output of model must be tensor or list/tuple of tensors')
-    if isinstance(label[0], torch.Tensor):
-        label = torch.cat(label, dim=0)
-    else:
-        merged_label = {k: [] for k in label[0].keys()}
-        for d in label:
-            for k, v in d.items():
-                merged_label[k].append(v)
-        label = {k: torch.cat(v, dim=0) for k, v in merged_label.items()}
-    return output, label
-
-
-class PipelineSchedule(BaseSchedule):
-    """A helper schedule class for pipeline parallelism running environment.
-    It uses non-interleaved 1F1B strategy. Other properties are similar as
-    :class:`NonPipelineSchedule`.
-
-    :param num_microbatches: The number of microbatches
-    :type num_microbatches: int
-    :param batch_data_process_func: The preprocessing function which receives a batch of data, and it will be executed in `load_batch`
-    :type batch_data_process_func: Callable, optional
-    :param tensor_shape: Specified shape in pipeline communication
-    :type tensor_shape: torch.Size, optional
-    :param scatter_gather_tensors: If set to `True`, communication will be reduced over pipeline when using 1D tensor parallelization
-    :type scatter_gather_tensors: bool, optional
-    """
-
-    def __init__(self,
-                 num_microbatches,
-                 batch_data_process_func: Callable = None,
-                 tensor_shape: Union[torch.Size, List[int], Tuple[int]] = None,
-                 scatter_gather_tensors: bool = False):
-        super().__init__(batch_data_process_func=batch_data_process_func)
-        self.num_microbatches = num_microbatches
-        self.dtype = torch.float
-        self.tensor_shape = tensor_shape
-        self.scatter_gather_tensors = False
-        if gpc.is_initialized(ParallelMode.PARALLEL_1D) and gpc.get_world_size(ParallelMode.PARALLEL_1D) > 1:
-            self.scatter_gather_tensors = scatter_gather_tensors
-        self._logger = get_dist_logger()
-
-    def load_batch(self, data_iter):
-        # Pipeline schedule just puts data in memory
-        self.batch_data, self.batch_label = super().load_batch(data_iter, to_gpu=False)
-        self.microbatch_offset = 0
-        if isinstance(self.batch_data, torch.Tensor):
-            batch_size = self.batch_data.size(0)
-        else:
-            batch_size = next(iter(self.batch_data.values())).size(0)
-        assert batch_size % self.num_microbatches == 0, \
-            "Batch size should divided by the number of microbatches"
-        self.microbatch_size = batch_size // self.num_microbatches
-
-    def _get_data_slice(self, data, offset):
-        if isinstance(data, torch.Tensor):
-            return data[offset: offset + self.microbatch_size]
-        else:
-            return {k: v[offset:offset + self.microbatch_size] for k, v in data.items()}
-
-    def load_micro_batch(self):
-        data = self._get_data_slice(self.batch_data, self.microbatch_offset)
-        label = self._get_data_slice(self.batch_label, self.microbatch_offset)
-        self.microbatch_offset += self.microbatch_size
-        return self._move_to_device(data), self._move_to_device(label)
-
-    def pre_processing(self, engine):
-        if isinstance(engine.optimizer, (ZeroRedundancyOptimizer_Level_2, ZeroRedundancyOptimizer_Level_3)):
-            raise TypeError(
-                "Pipeline schedule is currently not compatible with ZeRO Level 2 and Level 3"
-            )
-        model = engine.model
-        if isinstance(model, NaiveAMPModel):
-            self.dtype = torch.half
-            model = model.model
-        sig = inspect.signature(model.forward)
-        for p in sig.parameters.values():
-            assert p.kind != inspect.Parameter.VAR_POSITIONAL, '*args is not supported'
-
-    @staticmethod
-    def _call_engine(model, input_tensor, batch_data):
-        if isinstance(model, NaiveAMPModel):
-            sig = inspect.signature(model.model.forward)
-        else:
-            sig = inspect.signature(model.forward)
-        if isinstance(batch_data, torch.Tensor):
-            if input_tensor is None:
-                return model(batch_data)
-            elif len(sig.parameters) > 1:
-                return model(input_tensor, batch_data)
-            else:
-                return model(input_tensor)
-        else:
-            filter_batch = True
-            for p in sig.parameters.values():
-                if p.kind == inspect.Parameter.VAR_KEYWORD:
-                    filter_batch = False
-            if filter_batch:
-                batch_data = {k: v for k, v in batch_data.items() if k in sig.parameters}
-            if input_tensor is None:
-                return model(**batch_data)
-            else:
-                return model(input_tensor, **batch_data)
-
-    def forward_step(self, engine, input_tensor, return_tensors, return_output_label=True, accum_loss=None):
-        """Forward step for passed-in model. If it is the first stage, the input tensor 
-        is obtained from data_iterator, otherwise the passed-in input_tensor is used.
-        Returns output tensor. This is a helper function and can be ignored by users.
-
-        :param engine: Your engine object
-        :type engine: colossalai.engine.Engine
-        :param input_tensor: Input tensor for this pipeline stage
-        :type input_tensor: :class:`torch.Tensor`
-        :param return_tensors: A list of tensors to return
-        :type return_tensors: List[:class:`torch.Tensor`]
-        :param return_output_label: Whether returns output labels
-        :type return_output_label: bool, optional
-        :param accum_loss: Where accumulated loss stores
-        :type  accum_loss: optional
-
-        :return: output or the loss value of the current pipeline stage
-        :rtype: :class:`torch.Tensor`
-        """
-        data, label = self.load_micro_batch()
-        output_tensor = self._call_engine(engine.model, input_tensor, data)
-
-        if gpc.is_last_rank(ParallelMode.PIPELINE):
-            if return_output_label:
-                return_tensors.append((output_tensor, label))
-            if accum_loss is not None:
-                loss_reduced = self._call_engine_criterion(engine, output_tensor, label) / self.num_microbatches
-                accum_loss.add_(loss_reduced.detach())
-                return loss_reduced
-            else:
-                # forward only, it's useless since backward is not needed
-                return output_tensor
-        else:
-            assert isinstance(
-                output_tensor, torch.Tensor), 'Output of model using pipeline parallelism must be a tensor (except the last stage).'
-            self._logger.debug(
-                f'Global rank {gpc.get_global_rank()}, pipeline rank {gpc.get_local_rank(ParallelMode.PIPELINE)} forward output tensor {output_tensor.shape}, dtype {output_tensor.dtype}')
-            return output_tensor
-
-    def backward_step(self, engine, input_tensor, output_tensor, output_tensor_grad):
-        """Backward step through the passed-in output tensor. If it is the last stage, the 
-        output_tensor_grad is None, otherwise it is the gradients with respect to stage's output tensor.
-        Returns the gradients with respect to the input tensor (None if first stage).
-        This is a helper function and can be ignored by users.
-
-        :param engine: your engine object
-        :type engine: colossalai.engine.Engine
-        :param input_tensor: input tensor for this pipeline stage
-        :type input_tensor: :class:`torch.Tensor`
-        :param output_tensor: output tensor for this pipeline stage
-        :type output_tensor: :class:`torch.Tensor`
-        :param output_tensor_grad: gradient of output tensor for this pipeline stage
-        :type output_tensor_grad: :class:`torch.Tensor`
-
-        :return: gradient of input tensor
-        :rtype: :class:`torch.Tensor`
-        """
-
-        # Retain the grad on the input_tensor.
-        if input_tensor is not None:
-            input_tensor.retain_grad()
-
-        # Backward pass.
-        if output_tensor_grad is None:
-            engine.backward(output_tensor)
-        else:
-            engine.backward_by_grad(output_tensor, output_tensor_grad)
-
-        # Collect the grad of the input_tensor.
-        input_tensor_grad = None
-        if input_tensor is not None:
-            input_tensor_grad = input_tensor.grad
-
-        return input_tensor_grad
-
-    def forward_backward_step(self,
-                              engine,
-                              data_iter,
-                              forward_only=False,
-                              return_loss=True,
-                              return_output_label=True):
-        """Runs non-interleaved 1F1B schedule, with communication between pipeline stages.
-        Returns a tuple with losses if the last stage, an empty tuple otherwise.
-
-        :param engine: Your engine object
-        :type engine: colossalai.engine.Engine
-        :param data_iter: Dataloader as the form of an iterator, obtained by calling iter(dataloader)
-        :type data_iter: Iterable
-        :param forward_only: Whether run forward step only. Default is false. If true, no backward will be run.
-        :type forward_only: bool
-        :param return_loss: Whether returns the loss value. Default is true.
-        :type return_loss: bool
-        :param return_output_label: If False, the output and label won't be returned
-        :type return_output_label: bool
-
-        :return: (output, label, loss)
-        :rtype: Tuple[:class:`torch.Tensor`]
-        """
-
-        assert forward_only or return_loss, \
-            'The argument \'return_loss\' has to be True when \'forward_only\' is False, but got False.'
-        self.load_batch(data_iter)
-        num_warmup_microbatches = \
-            (gpc.get_world_size(ParallelMode.PIPELINE) -
-             gpc.get_local_rank(ParallelMode.PIPELINE) - 1)
-        num_warmup_microbatches = min(num_warmup_microbatches,
-                                      self.num_microbatches)
-        num_microbatches_remaining = self.num_microbatches - num_warmup_microbatches
-
-        # Input, output tensors only need to be saved when doing backward passes
-        input_tensors = None
-        output_tensors = None
-        if not forward_only:
-            input_tensors = []
-            output_tensors = []
-        return_tensors = []
-        if return_loss and gpc.is_pipeline_last_stage(ignore_virtual=True):
-            accum_loss = torch.zeros(1, device=get_current_device())
-        else:
-            accum_loss = None
-        # Used for tensor meta information communication
-        ft_shape = self.tensor_shape
-        bt_shape = None
-        fs_checker = self.tensor_shape is None
-
-        # Run warmup forward passes.
-        for i in range(num_warmup_microbatches):
-            if not gpc.is_first_rank(ParallelMode.PIPELINE):
-                ft_shape = comm.recv_tensor_meta(ft_shape)
-            input_tensor = comm.recv_forward(ft_shape, dtype=self.dtype,
-                                             scatter_gather_tensors=self.scatter_gather_tensors)
-            output_tensor = self.forward_step(
-                engine, input_tensor, return_tensors,
-                return_output_label=return_output_label,
-                accum_loss=accum_loss
-            )
-            if not gpc.is_last_rank(ParallelMode.PIPELINE):
-                bt_shape = output_tensor.shape
-                fs_checker = comm.send_tensor_meta(output_tensor, fs_checker)
-            comm.send_forward(output_tensor, scatter_gather_tensors=self.scatter_gather_tensors)
-
-            if not forward_only:
-                input_tensors.append(input_tensor)
-                output_tensors.append(output_tensor)
-
-        # Before running 1F1B, need to receive first forward tensor.
-        # If all microbatches are run in warmup / cooldown phase, then no need to
-        # receive this tensor here.
-        if num_microbatches_remaining > 0:
-            if not gpc.is_first_rank(ParallelMode.PIPELINE):
-                ft_shape = comm.recv_tensor_meta(ft_shape)
-            input_tensor = comm.recv_forward(ft_shape, dtype=self.dtype,
-                                             scatter_gather_tensors=self.scatter_gather_tensors)
-
-        # Run 1F1B in steady state.
-        for i in range(num_microbatches_remaining):
-            last_iteration = (i == (num_microbatches_remaining - 1))
-
-            output_tensor = self.forward_step(
-                engine, input_tensor, return_tensors,
-                return_output_label=return_output_label,
-                accum_loss=accum_loss
-            )
-            if forward_only:
-                comm.send_forward(output_tensor, scatter_gather_tensors=self.scatter_gather_tensors)
-
-                if not last_iteration:
-                    input_tensor = comm.recv_forward(ft_shape, dtype=self.dtype,
-                                                     scatter_gather_tensors=self.scatter_gather_tensors)
-
-            else:
-                output_tensor_grad = comm.send_forward_recv_backward(
-                    output_tensor, bt_shape, dtype=self.dtype, scatter_gather_tensors=self.scatter_gather_tensors)
-
-                # Add input_tensor and output_tensor to end of list.
-                input_tensors.append(input_tensor)
-                output_tensors.append(output_tensor)
-
-                # Pop input_tensor and output_tensor from the start of the list for
-                # the backward pass.
-                input_tensor = input_tensors.pop(0)
-                output_tensor = output_tensors.pop(0)
-
-                input_tensor_grad = self.backward_step(
-                    engine,
-                    input_tensor, output_tensor,
-                    output_tensor_grad
-                )
-
-                if last_iteration:
-                    input_tensor = None
-                    comm.send_backward(input_tensor_grad, scatter_gather_tensors=self.scatter_gather_tensors)
-                else:
-                    input_tensor = comm.send_backward_recv_forward(
-                        input_tensor_grad, ft_shape, dtype=self.dtype, scatter_gather_tensors=self.scatter_gather_tensors)
-
-        # Run cooldown backward passes.
-        if not forward_only:
-            for i in range(num_warmup_microbatches):
-                input_tensor = input_tensors.pop(0)
-                output_tensor = output_tensors.pop(0)
-
-                output_tensor_grad = comm.recv_backward(bt_shape, dtype=self.dtype,
-                                                        scatter_gather_tensors=self.scatter_gather_tensors)
-
-                input_tensor_grad = self.backward_step(
-                    engine,
-                    input_tensor, output_tensor,
-                    output_tensor_grad
-                )
-
-                comm.send_backward(input_tensor_grad, scatter_gather_tensors=self.scatter_gather_tensors)
-
-        if len(return_tensors) > 0:
-            output, label = pack_return_tensors(return_tensors)
-            return output, label, accum_loss
-        else:
-            return None, None, accum_loss
-
-
-class InterleavedPipelineSchedule(PipelineSchedule):
-    def __init__(self,
-                 num_microbatches,
-                 num_model_chunks,
-                 batch_data_process_func: Callable = None,
-                 tensor_shape: Union[torch.Size, List[int], Tuple[int]] = None,
-                 scatter_gather_tensors: bool = False):
-        """A helper schedule class for pipeline parallelism running environment.
-        It uses interleaved 1F1B strategy. Other properties are similar as
-        :class:`NonPipelineSchedule`.
-
-        :param num_microbatches: The number of microbatches
-        :type num_microbatches: int
-        :param num_model_chunks: The number of model chunks
-        :type num_model_chunks: int
-        :param batch_data_process_func: The preprocessing function which receives a batch of data, and it will be executed in `load_batch`
-        :type batch_data_process_func: Callable, optional
-        :param tensor_shape: Specified shape in pipeline communication
-        :type tensor_shape: torch.Size, optional
-        :param scatter_gather_tensors: If set to `True`, communication will be reduced over pipeline when using 1D tensor parallelization
-        :type scatter_gather_tensors: bool, optional
-        """
-        assert num_microbatches % gpc.get_world_size(ParallelMode.PIPELINE) == 0, \
-            'num_microbatches must be an integer multiple of pipeline parallel world size'
-        super().__init__(num_microbatches, batch_data_process_func=batch_data_process_func,
-                         tensor_shape=tensor_shape, scatter_gather_tensors=scatter_gather_tensors)
-        gpc.set_virtual_pipeline_parallel_size(num_model_chunks)
-        gpc.set_virtual_pipeline_parallel_rank(0)
-        self.num_model_chunks = num_model_chunks
-
-    def pre_processing(self, engine):
-        if isinstance(engine.optimizer, (ZeroRedundancyOptimizer_Level_2, ZeroRedundancyOptimizer_Level_3)):
-            raise TypeError(
-                "Pipeline schedule is currently not compatible with ZeRO Level 2 and Level 3"
-            )
-
-        if isinstance(engine.model[0], NaiveAMPModel):
-            self.dtype = torch.half
-
-        for model in engine.model:
-            if isinstance(model, NaiveAMPModel):
-                model = model.model
-            sig = inspect.signature(model.forward)
-            for p in sig.parameters.values():
-                assert p.kind != inspect.Parameter.VAR_POSITIONAL, '*args is not supported'
-
-    def load_batch(self, data_iter):
-        super().load_batch(data_iter)
-        # overwrite microbatch_offset, since model chunks load the same microbatch, and should tract the offset
-        self.microbatch_offset = [0 for _ in range(self.num_model_chunks)]
-
-    def load_micro_batch(self, model_chunk_id):
-        data = self._get_data_slice(self.batch_data, self.microbatch_offset[model_chunk_id])
-        label = self._get_data_slice(self.batch_label, self.microbatch_offset[model_chunk_id])
-        self.microbatch_offset[model_chunk_id] += self.microbatch_size
-        return self._move_to_device(data), self._move_to_device(label)
-
-    def forward_step(self, engine, model_chunk_id, input_tensor, return_tensors, return_output_label=True, accum_loss=None):
-        """Forward step for passed-in model. If it is the first stage, the input tensor 
-        is obtained from data_iterator, otherwise the passed-in input_tensor is used.
-        Returns output tensor. This is a helper function and can be ignored by users.
-        """
-        data, label = self.load_micro_batch(model_chunk_id)
-        output_tensor = self._call_engine(engine.model[model_chunk_id], input_tensor, data)
-
-        if gpc.is_pipeline_last_stage():
-            if return_output_label:
-                return_tensors.append((output_tensor, label))
-            if accum_loss is not None:
-                loss_reduced = self._call_engine_criterion(engine, output_tensor, label) / self.num_microbatches
-                accum_loss.add_(loss_reduced.detach())
-                return loss_reduced
-            else:
-                # forward only, it's useless since backward is not needed
-                return output_tensor
-        else:
-            assert isinstance(
-                output_tensor, torch.Tensor), 'Output of model using pipeline parallelism must be a tensor (except the last stage).'
-            self._logger.debug(
-                f'Global rank {gpc.get_global_rank()}, pipeline rank {gpc.get_local_rank(ParallelMode.PIPELINE)} forward output tensor {output_tensor.shape}, dtype {output_tensor.dtype}')
-            return output_tensor
-
-    def forward_backward_step(self, engine, data_iter, forward_only=False, return_loss=True, return_output_label=True):
-        """Run interleaved 1F1B schedule (model split into model chunks), with
-        communication between pipeline stages as needed.
-
-        Returns dictionary with losses if the last stage, empty dict otherwise.
-
-        :param engine: Your engine object
-        :type engine: colossalai.engine.Engine
-        :param data_iter: Dataloader as the form of an iterator, obtained by calling iter(dataloader)
-        :type data_iter: Iterable
-        :param forward_only: Whether run forward step only. Default is false. If true, no backward will be run.
-        :type forward_only: bool
-        :param return_loss: Whether returns the loss value. Default is true.
-        :type return_loss: bool
-        :param return_output_label: If False, the output and label won't be returned
-        :type return_output_label: bool
-        """
-        assert forward_only or return_loss, \
-            'The argument \'return_loss\' has to be True when \'forward_only\' is False, but got False.'
-        self.load_batch(data_iter)
-        model = engine.model
-        input_tensors = [[] for _ in range(len(model))]
-        output_tensors = [[] for _ in range(len(model))]
-        return_tensors = []
-        if not forward_only:
-            output_tensor_grads = [[] for _ in range(len(model))]
-        if return_loss and gpc.is_pipeline_last_stage(ignore_virtual=True):
-            accum_loss = torch.zeros(1, device=get_current_device())
-        else:
-            accum_loss = None
-
-        # Used for tensor meta information communication
-        input_tensor_shapes = [self.tensor_shape for _ in range(len(model))]
-        output_tensor_shapes = [None for _ in range(len(model))]
-        send_tensor_shape_flags = [self.tensor_shape is None for _ in range(len(model))]
-
-        pipeline_parallel_size = gpc.get_world_size(ParallelMode.PIPELINE)
-        pipeline_parallel_rank = gpc.get_local_rank(ParallelMode.PIPELINE)
-
-        # Compute number of warmup and remaining microbatches.
-        num_model_chunks = len(model)
-        num_microbatches = self.num_microbatches * num_model_chunks
-        all_warmup_microbatches = False
-        if forward_only:
-            num_warmup_microbatches = num_microbatches
-        else:
-            # Run all forward passes and then all backward passes if number of
-            # microbatches is just the number of pipeline stages.
-            # Otherwise, perform (num_model_chunks-1)*pipeline_parallel_size on
-            # all workers, followed by more microbatches after depending on
-            # stage ID (more forward passes for earlier stages, later stages can
-            # immediately start with 1F1B).
-            if self.num_microbatches == pipeline_parallel_size:
-                num_warmup_microbatches = num_microbatches
-                all_warmup_microbatches = True
-            else:
-                num_warmup_microbatches = \
-                    (pipeline_parallel_size - pipeline_parallel_rank - 1) * 2
-                num_warmup_microbatches += (
-                    num_model_chunks - 1) * pipeline_parallel_size
-                num_warmup_microbatches = min(num_warmup_microbatches,
-                                              num_microbatches)
-        num_microbatches_remaining = \
-            num_microbatches - num_warmup_microbatches
-
-        def get_model_chunk_id(microbatch_id, forward):
-            """Helper method to get the model chunk ID given the iteration number."""
-            microbatch_id_in_group = microbatch_id % (pipeline_parallel_size * num_model_chunks)
-            model_chunk_id = microbatch_id_in_group // pipeline_parallel_size
-            if not forward:
-                model_chunk_id = (num_model_chunks - model_chunk_id - 1)
-            return model_chunk_id
-
-        def forward_step_helper(microbatch_id):
-            """Helper method to run forward step with model split into chunks
-            (run set_virtual_pipeline_model_parallel_rank() before calling
-            forward_step())."""
-            model_chunk_id = get_model_chunk_id(microbatch_id, forward=True)
-            gpc.set_virtual_pipeline_parallel_rank(model_chunk_id)
-
-            # forward step
-            if gpc.is_pipeline_first_stage():
-                if len(input_tensors[model_chunk_id]) == \
-                        len(output_tensors[model_chunk_id]):
-                    input_tensors[model_chunk_id].append(None)
-            input_tensor = input_tensors[model_chunk_id][-1]
-            output_tensor = self.forward_step(engine, model_chunk_id, input_tensor,
-                                              return_tensors, return_output_label=return_output_label, accum_loss=accum_loss)
-            output_tensors[model_chunk_id].append(output_tensor)
-
-            # if forward-only, no need to save tensors for a backward pass
-            if forward_only:
-                input_tensors[model_chunk_id].pop()
-                output_tensors[model_chunk_id].pop()
-
-            return output_tensor
-
-        def backward_step_helper(microbatch_id):
-            """Helper method to run backward step with model split into chunks
-            (run set_virtual_pipeline_model_parallel_rank() before calling
-            backward_step())."""
-            model_chunk_id = get_model_chunk_id(microbatch_id, forward=False)
-            gpc.set_virtual_pipeline_parallel_rank(model_chunk_id)
-
-            if gpc.is_pipeline_last_stage():
-                if len(output_tensor_grads[model_chunk_id]) == 0:
-                    output_tensor_grads[model_chunk_id].append(None)
-            input_tensor = input_tensors[model_chunk_id].pop(0)
-            output_tensor = output_tensors[model_chunk_id].pop(0)
-            output_tensor_grad = output_tensor_grads[model_chunk_id].pop(0)
-            input_tensor_grad = self.backward_step(engine, input_tensor, output_tensor, output_tensor_grad)
-
-            return input_tensor_grad
-
-        # Run warmup forward passes.
-        gpc.set_virtual_pipeline_parallel_rank(0)
-        if not gpc.is_pipeline_first_stage():
-            input_tensor_shapes[0] = comm.recv_tensor_meta(input_tensor_shapes[0])
-        input_tensors[0].append(comm.recv_forward(input_tensor_shapes[0], dtype=self.dtype,
-                                scatter_gather_tensors=self.scatter_gather_tensors))
-
-        for k in range(num_warmup_microbatches):
-            model_chunk_id = get_model_chunk_id(k, forward=True)
-            output_tensor = forward_step_helper(k)
-            if not gpc.is_pipeline_last_stage():
-                output_tensor_shapes[model_chunk_id] = output_tensor.shape
-                send_tensor_shape_flags[model_chunk_id] = comm.send_tensor_meta(
-                    output_tensor, send_tensor_shape_flags[model_chunk_id])
-            # Determine if tensor should be received from previous stage.
-            next_forward_model_chunk_id = get_model_chunk_id(k+1, forward=True)
-            recv_prev = True
-            if gpc.is_pipeline_first_stage(ignore_virtual=True):
-                if next_forward_model_chunk_id == 0:
-                    recv_prev = False
-            if k == (num_microbatches - 1):
-                recv_prev = False
-
-            # Don't send tensor downstream if on last stage.
-            if gpc.is_pipeline_last_stage():
-                output_tensor = None
-
-            with switch_virtual_pipeline_parallel_rank(next_forward_model_chunk_id):
-                if not gpc.is_pipeline_first_stage():
-                    input_tensor_shapes[next_forward_model_chunk_id] = comm.recv_tensor_meta(
-                        input_tensor_shapes[next_forward_model_chunk_id])
-            # Send and receive tensors as appropriate (send tensors computed
-            # in this iteration; receive tensors for next iteration).
-            input_shape = input_tensor_shapes[next_forward_model_chunk_id] if recv_prev else None
-            if k == (num_warmup_microbatches - 1) and not forward_only and \
-                    not all_warmup_microbatches:
-                input_tensor_grad = None
-                recv_next = True
-                if gpc.is_pipeline_last_stage(ignore_virtual=True):
-                    recv_next = False
-                output_shape = output_tensor_shapes[num_model_chunks-1] if recv_next else None
-                input_tensor, output_tensor_grad = \
-                    comm.send_forward_backward_recv_forward_backward(
-                        output_tensor, input_tensor_grad,
-                        input_shape,
-                        output_shape,
-                        recv_prev=recv_prev, recv_next=recv_next,
-                        dtype=self.dtype,
-                        scatter_gather_tensors=self.scatter_gather_tensors)
-                output_tensor_grads[num_model_chunks-1].append(output_tensor_grad)
-            else:
-                input_tensor = \
-                    comm.send_forward_recv_forward(
-                        output_tensor,
-                        input_shape,
-                        recv_prev=recv_prev,
-                        dtype=self.dtype,
-                        scatter_gather_tensors=self.scatter_gather_tensors)
-            input_tensors[next_forward_model_chunk_id].append(input_tensor)
-
-        # Run 1F1B in steady state.
-        for k in range(num_microbatches_remaining):
-            # Forward pass.
-            forward_k = k + num_warmup_microbatches
-            output_tensor = forward_step_helper(forward_k)
-
-            # Backward pass.
-            backward_k = k
-            input_tensor_grad = backward_step_helper(backward_k)
-
-            # Send output_tensor and input_tensor_grad, receive input_tensor
-            # and output_tensor_grad.
-
-            # Determine if current stage has anything to send in either direction,
-            # otherwise set tensor to None.
-            forward_model_chunk_id = get_model_chunk_id(forward_k, forward=True)
-            gpc.set_virtual_pipeline_parallel_rank(forward_model_chunk_id)
-            if gpc.is_pipeline_last_stage():
-                output_tensor = None
-
-            backward_model_chunk_id = get_model_chunk_id(backward_k, forward=False)
-            gpc.set_virtual_pipeline_parallel_rank(backward_model_chunk_id)
-            if gpc.is_pipeline_first_stage():
-                input_tensor_grad = None
-
-            # Determine if peers are sending, and where in data structure to put
-            # received tensors.
-            recv_prev = True
-            if gpc.is_pipeline_first_stage(ignore_virtual=True):
-                # First stage is ahead of last stage by (pipeline_parallel_size - 1).
-                next_forward_model_chunk_id = get_model_chunk_id(
-                    forward_k - (pipeline_parallel_size - 1), forward=True)
-                if next_forward_model_chunk_id == (num_model_chunks - 1):
-                    recv_prev = False
-                next_forward_model_chunk_id += 1
-            else:
-                next_forward_model_chunk_id = get_model_chunk_id(forward_k + 1,
-                                                                 forward=True)
-
-            recv_next = True
-            if gpc.is_pipeline_last_stage(ignore_virtual=True):
-                # Last stage is ahead of first stage by (pipeline_parallel_size - 1).
-                next_backward_model_chunk_id = get_model_chunk_id(
-                    backward_k - (pipeline_parallel_size - 1), forward=False)
-                if next_backward_model_chunk_id == 0:
-                    recv_next = False
-                next_backward_model_chunk_id -= 1
-            else:
-                next_backward_model_chunk_id = get_model_chunk_id(backward_k + 1,
-                                                                  forward=False)
-
-            # If last iteration, don't receive; we already received one extra
-            # before the start of the for loop.
-            if k == (num_microbatches_remaining - 1):
-                recv_prev = False
-
-            input_shape = input_tensor_shapes[next_forward_model_chunk_id] if recv_prev else None
-            output_shape = output_tensor_shapes[next_backward_model_chunk_id] if recv_next else None
-            # Communicate tensors.
-            input_tensor, output_tensor_grad = \
-                comm.send_forward_backward_recv_forward_backward(
-                    output_tensor, input_tensor_grad,
-                    input_shape,
-                    output_shape,
-                    recv_prev=recv_prev, recv_next=recv_next,
-                    dtype=self.dtype,
-                    scatter_gather_tensors=self.scatter_gather_tensors)
-
-            # Put input_tensor and output_tensor_grad in data structures in the
-            # right location.
-            if recv_prev:
-                input_tensors[next_forward_model_chunk_id].append(input_tensor)
-            if recv_next:
-                output_tensor_grads[next_backward_model_chunk_id].append(
-                    output_tensor_grad)
-
-        # Run cooldown backward passes (flush out pipeline).
-        if not forward_only:
-            if all_warmup_microbatches:
-                output_tensor_grads[num_model_chunks-1].append(
-                    comm.recv_backward(output_tensor_shapes[num_model_chunks-1], scatter_gather_tensors=self.scatter_gather_tensors))
-            for k in range(num_microbatches_remaining, num_microbatches):
-                input_tensor_grad = backward_step_helper(k)
-                next_backward_model_chunk_id = get_model_chunk_id(k+1, forward=False)
-                recv_next = True
-                if gpc.is_pipeline_last_stage(ignore_virtual=True):
-                    if next_backward_model_chunk_id == (num_model_chunks - 1):
-                        recv_next = False
-                if k == (num_microbatches - 1):
-                    recv_next = False
-                output_shape = output_tensor_shapes[next_backward_model_chunk_id] if recv_next else None
-                output_tensor_grads[next_backward_model_chunk_id].append(
-                    comm.send_backward_recv_backward(
-                        input_tensor_grad,
-                        output_shape,
-                        recv_next=recv_next,
-                        dtype=self.dtype,
-                        scatter_gather_tensors=self.scatter_gather_tensors))
-
-        if len(return_tensors) > 0:
-            output, label = pack_return_tensors(return_tensors)
-            return output, label, accum_loss
-        else:
-            return None, None, accum_loss
diff --git a/colossalai/global_variables.py b/colossalai/global_variables.py
deleted file mode 100644
index 04f6e891e8581a638e0ed1dba4a65f87c4af77a8..0000000000000000000000000000000000000000
--- a/colossalai/global_variables.py
+++ /dev/null
@@ -1,86 +0,0 @@
-from typing import Optional
-
-
-class TensorParallelEnv(object):
-
-    _instance = None
-
-    def __new__(cls, *args, **kwargs):
-        if cls._instance is None:
-            cls._instance = object.__new__(cls, *args, **kwargs)
-        return cls._instance
-
-    def __init__(self, *args, **kwargs):
-        self.load(*args, **kwargs)
-
-    def load(self,
-             mode: Optional[str] = None,
-             vocab_parallel: bool = False,
-             parallel_input_1d: bool = False,
-             summa_dim: int = None,
-             tesseract_dim: int = None,
-             tesseract_dep: int = None,
-             depth_3d: int = None,
-             input_group_3d=None,
-             weight_group_3d=None,
-             output_group_3d=None):
-        self.mode = mode
-        self.vocab_parallel = vocab_parallel
-        self.parallel_input_1d = parallel_input_1d
-        self.summa_dim = summa_dim
-        self.tesseract_dim = tesseract_dim
-        self.tesseract_dep = tesseract_dep
-        self.depth_3d = depth_3d
-        self.input_group_3d = input_group_3d
-        self.weight_group_3d = weight_group_3d
-        self.output_group_3d = output_group_3d        
-
-    def save(self):
-        return dict(mode=self.mode,
-                    vocab_parallel=self.vocab_parallel,
-                    parallel_input_1d=self.parallel_input_1d,
-                    summa_dim=self.summa_dim,
-                    tesseract_dim=self.tesseract_dim,
-                    tesseract_dep=self.tesseract_dep,
-                    depth_3d=self.depth_3d,
-                    input_group_3d=self.input_group_3d,
-                    weight_group_3d=self.weight_group_3d,
-                    output_group_3d=self.output_group_3d)
-
-
-class MoeEnv:
-    """Moe enviroment variables.
-    """
-
-    def __init__(self):
-        self.data_parallel_size = None
-        self.model_parallel_size = None
-        self.aux_loss = None
-
-    def setup(self, moe_model_size):
-        from .core import global_context as gpc
-        if gpc.tensor_parallel_size > 1 or gpc.pipeline_parallel_size > 1:
-            raise NotImplementedError("Moe is not compatible with tensor or pipeline parallel")
-
-        assert gpc.data_parallel_size % moe_model_size == 0, \
-            "The size of data parallel needs to be divided by moe model parallel size"
-
-        self.data_parallel_size = gpc.data_parallel_size // moe_model_size
-        self.model_parallel_size = moe_model_size
-
-    def is_initialized(self):
-        return self.model_parallel_size is not None
-
-    def reset_loss(self):
-        self.aux_loss = 0
-
-    def add_loss(self, loss):
-        self.aux_loss += loss
-
-    def get_loss(self):
-        return self.aux_loss
-
-
-tensor_parallel_env = TensorParallelEnv()
-
-moe_env = MoeEnv()
diff --git a/colossalai/initialize.py b/colossalai/initialize.py
deleted file mode 100644
index 9329dc0521fef2487a4f2efe0c6f23b779a77e2f..0000000000000000000000000000000000000000
--- a/colossalai/initialize.py
+++ /dev/null
@@ -1,417 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import argparse
-import pprint
-import os
-from colossalai.nn.optimizer.colossalai_optimizer import ColossalaiOptimizer
-import torch
-import torch.nn as nn
-
-from pathlib import Path
-from typing import Iterable, Union, Optional, Tuple, List, Dict
-
-from colossalai.amp import convert_to_amp, AMP_TYPE
-from colossalai.context import Config, ParallelMode, ConfigException
-from colossalai.core import global_context as gpc
-from colossalai.engine import Engine
-from colossalai.logging import get_dist_logger
-from colossalai.utils import (accumulate_gradient, get_current_device,
-                              sync_model_param, is_using_ddp, is_using_pp, is_using_sequence)
-from colossalai.zero import convert_to_zero, ZeroRedundancyOptimizer_Level_2, ZeroRedundancyOptimizer_Level_3
-from colossalai.builder.builder import build_gradient_handler
-from torch.optim.optimizer import Optimizer
-from torch.optim.lr_scheduler import _LRScheduler
-from torch.utils.data import DataLoader
-from torch.nn.modules.loss import _Loss
-from torch.nn.parallel import DistributedDataParallel as DDP
-from colossalai.global_variables import moe_env
-
-
-def get_default_parser():
-    """Reads user command line and uses an argument parser to parse the input arguments.
-    Input arguments include configuration, host, port, world size, local rank, backend for torch.distributed.
-
-    :return: Returns the parser with the default arguments, the user may add customized arguments into this parser
-    :rtype: Namespace
-    """
-    parser = argparse.ArgumentParser()
-    parser.add_argument('--config', type=str, help='path to the config file')
-    parser.add_argument('--host',
-                        type=str,
-                        help='the master address for distributed training')
-    parser.add_argument('--port',
-                        type=int,
-                        help='the master port for distributed training')
-    parser.add_argument('--world_size', type=int, help='world size for distributed training')
-    parser.add_argument('--rank', type=int, help='rank for the default process group')
-    parser.add_argument('--local_rank',
-                        type=int,
-                        help='local rank on the node')
-    parser.add_argument('--backend',
-                        type=str,
-                        default='nccl',
-                        help='backend for distributed communication')
-    return parser
-
-
-def launch(config: Union[str, Path, Config, Dict],
-           rank: int,
-           world_size: int,
-           host: str,
-           port: int,
-           backend: str = 'nccl',
-           local_rank: int = None,
-           seed: int = 1024,
-           verbose: bool = True):
-    """This function first parses the configuration arguments, using :func:`parse_args()` in case one of the input
-    arguments are not given. Then initialize and set distributed environment by calling global_context's functions.
-
-    :param config: Config file or config file path are both acceptable
-    :type config: Union[str, dict, Config]
-    :param rank: Rank for the default process group
-    :type rank: int
-    :param world_size: World size of the default process group
-    :type world_size: int
-    :param host: The master address for distributed training
-    :type host: str
-    :param port: The master port for distributed training
-    :type port: str
-    :param backend: Backend for torch.distributed
-    :type backend: str, optional
-    :param local_rank: Rank for the process on the node and is used to set the default CUDA device, defaults to None.
-        If local_rank = None, the default device ordinal will be calculated automatically
-    :type local_rank: int, optional
-    :param seed: Specified random seed for every processes
-    :type seed: int, optional
-    :param verbose: Whether to print logs
-    :type verbose: bool, optional
-    :raises Exception: Raise exception when config type is wrong
-    """
-    gpc.verbose = verbose
-
-    # set config
-    assert isinstance(config, (Config, str, Path, dict)), \
-        f'expected argument config to be Config, str or Path, but got {type(config)}'
-    if not isinstance(config, Config) and isinstance(config, dict):
-        config = Config(config)
-    if isinstance(config, (str, Path)):
-        config = Config.from_file(config)
-    gpc.load_config(config)
-
-    # init default process group
-    gpc.init_global_dist(rank, world_size, backend, host, port)
-
-    # init process groups for different parallel modes from config
-    gpc.init_parallel_groups()
-
-    # set cuda device
-    if torch.cuda.is_available():
-        # if local rank is not given, calculate automatically
-        gpc.set_device(local_rank)
-
-    gpc.set_seed(seed)
-
-    if verbose:
-        logger = get_dist_logger()
-        logger.info(f'Distributed environment is initialized, '
-                    f'data parallel size: {gpc.data_parallel_size}, pipeline parallel size: {gpc.pipeline_parallel_size}, '
-                    f'tensor parallel size: {gpc.tensor_parallel_size}', ranks=[0])
-
-
-def launch_from_slurm(config: Union[str, Path, Config, Dict],
-                      host: str,
-                      port: int,
-                      backend: str = 'nccl',
-                      seed: int = 1024,
-                      verbose: bool = True):
-    """A wrapper for colossalai.launch for SLURM launcher by reading rank and world size from the environment variables
-    set by SLURM
-
-    :param config: Config file or config file path are both acceptable
-    :type config: Union[str, dict, Config]
-    :param host: The master address for distributed training
-    :type host: str
-    :param port: The master port for distributed training
-    :type port: str
-    :param backend: Backend for torch.distributed
-    :type backend: str, optional
-    :param seed: Specified random seed for every processes
-    :type seed: int, optional
-    :param verbose: Whether to print logs
-    :type verbose: bool, optional
-    """
-    rank = int(os.environ['SLURM_PROCID'])
-    world_size = int(os.environ['SLURM_NPROCS'])
-    launch(config=config,
-           rank=rank,
-           world_size=world_size,
-           host=host,
-           port=port,
-           backend=backend,
-           seed=seed,
-           verbose=verbose)
-
-
-def launch_from_openmpi(config: Union[str, Path, Config, Dict],
-                        host: str,
-                        port: int,
-                        backend: str = 'nccl',
-                        seed: int = 1024,
-                        verbose: bool = True):
-    """A wrapper for colossalai.launch for OpenMPI launcher by reading rank and world size from the environment variables
-    set by OpenMPI
-
-    :param config: Config file or config file path are both acceptable
-    :type config: Union[str, dict, Config]
-    :param host: The master address for distributed training
-    :type host: str
-    :param port: The master port for distributed training
-    :type port: str
-    :param backend: Backend for torch.distributed
-    :type backend: str, optional
-    :param seed: Specified random seed for every processes
-    :type seed: int, optional
-    :param verbose: Whether to print logs
-    :type verbose: bool, optional
-    """
-    rank = int(os.environ['OMPI_COMM_WORLD_RANK'])
-    local_rank = int(os.environ['OMPI_COMM_WORLD_LOCAL_RANK'])
-    world_size = int(os.environ['OMPI_COMM_WORLD_SIZE'])
-    launch(config=config,
-           local_rank=local_rank,
-           rank=rank,
-           world_size=world_size,
-           host=host,
-           port=port,
-           backend=backend,
-           seed=seed,
-           verbose=verbose)
-
-
-def launch_from_torch(config: Union[str, Path, Config, Dict],
-                      backend: str = 'nccl',
-                      seed: int = 1024,
-                      verbose: bool = True):
-    """A wrapper for colossalai.launch for torchrun or torch.distributed.launch by reading rank and world size
-    from the environment variables set by PyTorch
-
-    :param config: Config file or config file path are both acceptable
-    :type config: Union[str, dict, Config]
-    :param backend: Backend for torch.distributed
-    :type backend: str, optional
-    :param seed: Specified random seed for every processes
-    :type seed: int, optional
-    :param verbose: Whether to print logs
-    :type verbose: bool, optional
-    """
-    rank = int(os.environ['RANK'])
-    local_rank = int(os.environ['LOCAL_RANK'])
-    world_size = int(os.environ['WORLD_SIZE'])
-    host = os.environ['MASTER_ADDR']
-    port = int(os.environ['MASTER_PORT'])
-    launch(config=config,
-           local_rank=local_rank,
-           rank=rank,
-           world_size=world_size,
-           host=host,
-           port=port,
-           backend=backend,
-           seed=seed,
-           verbose=verbose)
-
-
-def initialize(model: Union[nn.Module, List[nn.Module]],
-               optimizer: Union[Optimizer, List[Optimizer]],
-               criterion: Union[_Loss, List[_Loss]],
-               train_dataloader: Optional[Union[Iterable, List[Iterable]]] = None,
-               test_dataloader: Optional[Union[Iterable, List[Iterable]]] = None,
-               lr_scheduler: _LRScheduler = None,
-               verbose: bool = True
-               ) -> Tuple[Engine, DataLoader, DataLoader, _LRScheduler]:
-    """Core function to wrap the essential training components with our functionality based on the config which is
-    loaded into gpc.config.
-
-    :param model: Your model instance
-    :type model: :class:`torch.nn.Module`
-    :param optimizer: Your optimizer instance
-    :type optimizer: :class:`torch.optim.optimizer.Optimizer`
-    :param criterion: Your criterion instance
-    :type criterion: :class:`torch.nn.modules.loss._Loss`
-    :param train_dataloader: Dataloader for training
-    :type train_dataloader: :class:`torch.utils.data.DataLoader`, optional
-    :param test_dataloader: Dataloader for testing
-    :type test_dataloader: :class:`torch.utils.data.DataLoader`, optional
-    :param lr_scheduler: Your lr scheduler instance
-    :type lr_scheduler: :class:`torch.nn.lr_scheduler._LRScheduler`, optional
-    :param verbose: Whether to print logs
-    :type verbose: bool, optional
-    :return: (engine, train_dataloader, test_dataloader, lr_scheduler)
-    :rtype: Tuple
-    """
-    # get logger
-    logger = get_dist_logger()
-    gpc.verbose = verbose
-
-    # get config from gpc
-    config = gpc.config
-
-    # print config
-    if verbose:
-        logger.info(f"\n========== Your Config ========\n"
-                    f"{pprint.pformat(gpc.config)}\n"
-                    f"================================\n", ranks=[0])
-
-    # cudnn
-    cudnn_benchmark = config.get('cudnn_benchmark', True)
-    cudnn_deterministic = config.get('cudnn_deterministic', False)
-    torch.backends.cudnn.benchmark = cudnn_benchmark
-    torch.backends.cudnn.deterministic = cudnn_deterministic
-    if verbose:
-        logger.info(
-            f"cuDNN benchmark = {cudnn_benchmark}, deterministic = {cudnn_deterministic}", ranks=[0])
-
-    # first sync model across dp ranks
-    model.to(get_current_device())
-    use_zero3 = hasattr(gpc.config, 'zero') and gpc.config.zero.level == 3
-    if not moe_env.is_initialized() and not use_zero3:
-        if is_using_sequence():
-            sync_model_param(model, ParallelMode.SEQUENCE_DP)
-        elif is_using_ddp():
-            sync_model_param(model, ParallelMode.DATA)
-    else:
-        logger.warning(
-            "The parameters of models is not automatically synchronized.\n"
-            "Please make sure that all parameters are the same in data parallel group.",
-            ranks=[0])
-
-    # check amp and zero
-    fp16_cfg = gpc.config.get('fp16', None)
-    zero_cfg = gpc.config.get('zero', None)
-
-    if fp16_cfg is not None and fp16_cfg.mode is not None and zero_cfg is not None:
-        raise ConfigException(
-            "It is not allowed to set fp16 and zero configuration in your config file at the same time")
-
-    # clip grad norm
-    clip_grad_norm = gpc.config.get('clip_grad_norm', 0.0)
-    if clip_grad_norm > 0:
-        if zero_cfg is not None:
-            raise ConfigException(
-                "clip_grad_norm should be specified with zero, you should specify clip_grad in zero configuration")
-
-    # initialize amp
-    amp_mode = None
-    if fp16_cfg is not None and fp16_cfg.mode is not None:
-        cfg_ = fp16_cfg.copy()
-        amp_mode = cfg_.pop('mode')
-        if is_using_pp():
-            assert amp_mode == AMP_TYPE.NAIVE, 'Pipeline only support NaiveAMP currently'
-        if amp_mode == AMP_TYPE.NAIVE:
-            cfg_['clip_grad'] = clip_grad_norm
-        model, optimizer, criterion = convert_to_amp(model=model,
-                                                     optimizer=optimizer,
-                                                     criterion=criterion,
-                                                     mode=amp_mode,
-                                                     amp_config=cfg_)
-
-    if zero_cfg is not None:
-        cfg_ = zero_cfg.copy()
-        level = cfg_.pop('level')
-        model, optimizer = convert_to_zero(model=model,
-                                           optimizer=optimizer,
-                                           level=level,
-                                           zero_config=cfg_
-                                           )
-
-    # gradient handler
-    gradient_handler_cfg = gpc.config.get('gradient_handler', None)
-    if gradient_handler_cfg is None:
-        # if gradient handler is not specified in the configuration file,
-        # check in the following order
-        # 1. if optimizer is ZERO, then use zero grad handler
-        # 2. if dp size is larger than 1 and pipeline is not used, use pytorch ddp
-        # 3. if using pipeline and dp size larger than 1, use data parallel grad handler
-        if isinstance(optimizer, (ZeroRedundancyOptimizer_Level_2,
-                                  ZeroRedundancyOptimizer_Level_3)):
-            gradient_handler_cfg = [dict(type='ZeROGradientHandler')]
-            if verbose:
-                logger.info(
-                    "Training with zero is detected, ZeROGradientHandler is automatically "
-                    "added even though not specified in the configuration",
-                    ranks=[0])
-        elif is_using_ddp() and moe_env.is_initialized():
-            gradient_handler_cfg = [dict(type='MoeGradientHandler')]
-            if verbose:
-                logger.info(
-                    "Data parallel training is detected with moe parallel, MoeGradientHandler is automatically "
-                    "added even though not specified in the configuration",
-                    ranks=[0])
-        elif is_using_sequence():
-            model = DDP(model, process_group=gpc.get_group(ParallelMode.SEQUENCE_DP), device_ids=[torch.cuda.current_device()])
-            if verbose:
-                logger.info(
-                    'Model is using torch.nn.parallel.DistributedDataParallel for Sequence Parallelism', ranks=[0])
-        elif is_using_ddp() and not is_using_pp() and amp_mode != AMP_TYPE.NAIVE:
-            model = DDP(model, process_group=gpc.get_group(ParallelMode.DATA), device_ids=[torch.cuda.current_device()])
-            if verbose:
-                logger.info(
-                    'Model is using torch.nn.parallel.DistributedDataParallel for Data Parallelism', ranks=[0])
-        elif is_using_ddp():
-            gradient_handler_cfg = [dict(type='DataParallelGradientHandler')]
-            if verbose:
-                logger.info(
-                    "Data parallel training is detected when using pipeline parallel, DataParallelGradientHandler is automatically "
-                    "added even though not specified in the configuration",
-                    ranks=[0])
-        # add pipeline parallel gradient handler, if pipeline shared module is detected
-        for param in model.parameters():
-            if getattr(param, 'pipeline_shared_module_pg', None) is not None:
-                if gradient_handler_cfg is None:
-                    gradient_handler_cfg = [dict(type='PipelineSharedModuleGradientHandler')]
-                else:
-                    gradient_handler_cfg.append(dict(type='PipelineSharedModuleGradientHandler'))
-                if verbose:
-                    logger.info(
-                        "pipeline_shared_module is detected, PipelineSharedModuleGradientHandler is automatically "
-                        "added even though not specified in the configuration",
-                        ranks=[0])
-                break
-    else:
-        if not isinstance(gradient_handler_cfg, list):
-            raise ConfigException(
-                f"expected gradient_handler in the configuration file to be a list but got {type(gradient_handler_cfg)}")
-
-    if gradient_handler_cfg is None:
-        gradient_handlers = None
-        if verbose and not isinstance(model, DDP):
-            logger.warning(
-                "No PyTorch DDP or gradient handler is set up, please make sure you do not need "
-                "to all-reduce the gradients after a training step.",
-                ranks=[0])
-    else:
-        gradient_handlers = [build_gradient_handler(cfg, model, optimizer) for cfg in gradient_handler_cfg]
-
-    # check if optimizer is ColossalaiOptimizer
-    if not isinstance(optimizer, (ColossalaiOptimizer, ZeroRedundancyOptimizer_Level_2, ZeroRedundancyOptimizer_Level_3)):
-        optimizer = ColossalaiOptimizer(optim=optimizer)
-
-    # gradient accumulation
-    grad_accum_size = gpc.config.get('gradient_accumulation', None)
-    if grad_accum_size is not None:
-        optimizer, train_dataloader, gradient_handlers, lr_scheduler = accumulate_gradient(model=model,
-                                                                                           optimizer=optimizer,
-                                                                                           dataloader=train_dataloader,
-                                                                                           accumulate_size=grad_accum_size,
-                                                                                           gradient_handlers=gradient_handlers,
-                                                                                           lr_scheduler=lr_scheduler)
-
-    engine = Engine(
-        model=model,
-        optimizer=optimizer,
-        criterion=criterion,
-        gradient_handlers=gradient_handlers,
-        clip_grad_norm=clip_grad_norm
-    )
-
-    return engine, train_dataloader, test_dataloader, lr_scheduler
diff --git a/colossalai/kernel/__init__.py b/colossalai/kernel/__init__.py
deleted file mode 100644
index d3d0be02bc4e7ded05fb541eaf718374b1d421d8..0000000000000000000000000000000000000000
--- a/colossalai/kernel/__init__.py
+++ /dev/null
@@ -1,5 +0,0 @@
-from .cuda_native import LayerNorm, FusedScaleMaskSoftmax, MultiHeadAttention
-
-__all__ = [
-    "LayerNorm", "FusedScaleMaskSoftmax", "MultiHeadAttention"
-]
diff --git a/colossalai/kernel/__pycache__/__init__.cpython-36.pyc b/colossalai/kernel/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index 9ab45eb477bda4af2c9985f1763b949037a7c6f3..0000000000000000000000000000000000000000
Binary files a/colossalai/kernel/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/kernel/__pycache__/__init__.cpython-37.pyc b/colossalai/kernel/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index afc3027d75bc62b0ce014b3c81f59d9507345f95..0000000000000000000000000000000000000000
Binary files a/colossalai/kernel/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/kernel/cuda_native/__init__.py b/colossalai/kernel/cuda_native/__init__.py
deleted file mode 100644
index a35158b72c7c9c6c39f3c067e51c7dd520c067b7..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/__init__.py
+++ /dev/null
@@ -1,3 +0,0 @@
-from .layer_norm import MixedFusedLayerNorm as LayerNorm
-from .scaled_softmax import FusedScaleMaskSoftmax
-from .multihead_attention import MultiHeadAttention
diff --git a/colossalai/kernel/cuda_native/__pycache__/__init__.cpython-36.pyc b/colossalai/kernel/cuda_native/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index a67eb4583ecc7aba37326ff6ef77cb4e74233b44..0000000000000000000000000000000000000000
Binary files a/colossalai/kernel/cuda_native/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/kernel/cuda_native/__pycache__/__init__.cpython-37.pyc b/colossalai/kernel/cuda_native/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index 7b2a735caba73c5dc8075106a0b9002c63b67003..0000000000000000000000000000000000000000
Binary files a/colossalai/kernel/cuda_native/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/kernel/cuda_native/__pycache__/layer_norm.cpython-36.pyc b/colossalai/kernel/cuda_native/__pycache__/layer_norm.cpython-36.pyc
deleted file mode 100644
index a31c71b10289e4d2044b2a035fe35caccf81bba7..0000000000000000000000000000000000000000
Binary files a/colossalai/kernel/cuda_native/__pycache__/layer_norm.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/kernel/cuda_native/__pycache__/layer_norm.cpython-37.pyc b/colossalai/kernel/cuda_native/__pycache__/layer_norm.cpython-37.pyc
deleted file mode 100644
index 101395a89bb119479afa973c1bec5a510198012c..0000000000000000000000000000000000000000
Binary files a/colossalai/kernel/cuda_native/__pycache__/layer_norm.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/kernel/cuda_native/__pycache__/multihead_attention.cpython-36.pyc b/colossalai/kernel/cuda_native/__pycache__/multihead_attention.cpython-36.pyc
deleted file mode 100644
index 19b0956dca2441c7ced07e252a77b1d30a92441a..0000000000000000000000000000000000000000
Binary files a/colossalai/kernel/cuda_native/__pycache__/multihead_attention.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/kernel/cuda_native/__pycache__/multihead_attention.cpython-37.pyc b/colossalai/kernel/cuda_native/__pycache__/multihead_attention.cpython-37.pyc
deleted file mode 100644
index 9d285d320c75c1d26cb9659a990d4dd61954c5c5..0000000000000000000000000000000000000000
Binary files a/colossalai/kernel/cuda_native/__pycache__/multihead_attention.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/kernel/cuda_native/__pycache__/scaled_softmax.cpython-36.pyc b/colossalai/kernel/cuda_native/__pycache__/scaled_softmax.cpython-36.pyc
deleted file mode 100644
index e1168746606d857dfbafd67806fc3ff889e7f0ec..0000000000000000000000000000000000000000
Binary files a/colossalai/kernel/cuda_native/__pycache__/scaled_softmax.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/kernel/cuda_native/__pycache__/scaled_softmax.cpython-37.pyc b/colossalai/kernel/cuda_native/__pycache__/scaled_softmax.cpython-37.pyc
deleted file mode 100644
index 914ba266f17ffb2bee5d9fb8fb0b6eca8f28404f..0000000000000000000000000000000000000000
Binary files a/colossalai/kernel/cuda_native/__pycache__/scaled_softmax.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/kernel/cuda_native/csrc/colossal_C_frontend.cpp b/colossalai/kernel/cuda_native/csrc/colossal_C_frontend.cpp
deleted file mode 100644
index 735caf54e9ce2e243dc3123c85369ba2b00a1cb2..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/csrc/colossal_C_frontend.cpp
+++ /dev/null
@@ -1,71 +0,0 @@
-// modified from https://github.com/NVIDIA/apex/blob/master/csrc/multi_tensor_adam.cu
-#include <torch/extension.h>
-
-void multi_tensor_scale_cuda(
-  int chunk_size,
-  at::Tensor noop_flag,
-  std::vector<std::vector<at::Tensor>> tensor_lists,
-  float scale);
-
-void multi_tensor_sgd_cuda(
-    int chunk_size,
-    at::Tensor noop_flag,
-    std::vector<std::vector<at::Tensor>> tensor_lists,
-    float wd,
-    float momentum,
-    float dampening,
-    float lr,
-    bool nesterov,
-    bool first_run,
-    bool wd_after_momentum,
-    float scale);
-
-void multi_tensor_adam_cuda(
-    int chunk_size,
-    at::Tensor noop_flag,
-    std::vector<std::vector<at::Tensor>> tensor_lists,
-    const float lr,
-    const float beta1,
-    const float beta2,
-    const float epsilon,
-    const int step,
-    const int mode,
-    const int bias_correction,
-    const float weight_decay);
-
-void multi_tensor_lamb_cuda(
-    int chunk_size,
-    at::Tensor noop_flag,
-    std::vector<std::vector<at::Tensor>> tensor_lists,
-    const float lr,
-    const float beta1,
-    const float beta2,
-    const float epsilon,
-    const int step,
-    const int bias_correction,
-    const float weight_decay,
-    const int grad_averaging,
-    const int mode,
-    at::Tensor global_grad_norm,
-    const float max_grad_norm,
-    at::optional<bool> use_nvlamb_python);
-
-std::tuple<at::Tensor, at::Tensor> multi_tensor_l2norm_cuda(
-    int chunk_size,
-    at::Tensor noop_flag,
-    std::vector<std::vector<at::Tensor>> tensor_lists,
-    at::optional<bool> per_tensor_python);
-
-PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
-{
-    m.def("multi_tensor_scale", &multi_tensor_scale_cuda,
-        "Fused overflow check + scale for a list of contiguous tensors");
-    m.def("multi_tensor_sgd", &multi_tensor_sgd_cuda,
-          "Fused SGD optimizer for list of contiguous tensors");
-    m.def("multi_tensor_adam", &multi_tensor_adam_cuda,
-          "Compute and apply gradient update to parameters for Adam optimizer");
-    m.def("multi_tensor_lamb", &multi_tensor_lamb_cuda,
-          "Computes and apply update for LAMB optimizer");
-    m.def("multi_tensor_l2norm", &multi_tensor_l2norm_cuda,
-          "Computes L2 norm for a list of contiguous tensors");
-}
\ No newline at end of file
diff --git a/colossalai/kernel/cuda_native/csrc/compat.h b/colossalai/kernel/cuda_native/csrc/compat.h
deleted file mode 100644
index 00066dc95475296168c799904dc595ed435d2b0a..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/csrc/compat.h
+++ /dev/null
@@ -1,10 +0,0 @@
-// modified from https://github.com/NVIDIA/apex/blob/master/csrc/compat.h
-#ifndef TORCH_CHECK
-#define TORCH_CHECK AT_CHECK
-#endif
-
-#ifdef VERSION_GE_1_3
-#define DATA_PTR data_ptr
-#else
-#define DATA_PTR data
-#endif
\ No newline at end of file
diff --git a/colossalai/kernel/cuda_native/csrc/kernels/cross_entropy.cu b/colossalai/kernel/cuda_native/csrc/kernels/cross_entropy.cu
deleted file mode 100644
index 58d26235a9cc6954e9822119f215b9745b0a1684..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/csrc/kernels/cross_entropy.cu
+++ /dev/null
@@ -1,191 +0,0 @@
-#include "block_reduce.h"
-#include "cuda_util.h"
-#include "kernels.h"
-#include "ls_cub.cuh"
-
-ls::cub::CachingDeviceAllocator g_allocator(true);
-
-template <typename T>
-__global__ void ls_cross_entropy_fw_kernel(
-    const T *__restrict__ inputs, const int *__restrict__ targets,
-    float *__restrict__ outputs, float *__restrict__ nll_loss_outputs,
-    const int padding_idx, const float epsilon, const int vocab_size) {
-  /* step1: compute each thread's max_logit and sum_exp_logit, store in
-   * max_input, sum_exp_logit */
-  const int block_start = blockIdx.x * vocab_size;
-  const int left_idx = block_start + threadIdx.x;
-  const int right_idx = (blockIdx.x + 1) * vocab_size;
-  float max_input[1] = {REDUCE_FLOAT_INF_NEG};
-  float sum_logits[2] = {0.f, 0.f};  // logit and logit exp
-  int target_tid = targets[blockIdx.x];
-
-  if (target_tid == padding_idx) {
-    if (threadIdx.x == 0) {
-      nll_loss_outputs[blockIdx.x] = 0.f;
-      outputs[blockIdx.x] = 0.f;
-    }
-    return;
-  }
-
-  for (int i = left_idx; i < right_idx; i += blockDim.x) {
-    max_input[0] = fmaxf(max_input[0], static_cast<float>(inputs[i]));
-  }
-  blockReduce<ReduceType::kMax, 1>(max_input);
-  __shared__ float s_max_input;
-  if (threadIdx.x == 0) {
-    s_max_input = max_input[0];
-  }
-  __syncthreads();
-
-  for (int i = left_idx; i < right_idx; i += blockDim.x) {
-    float logit = static_cast<float>(inputs[i]) - s_max_input;
-    sum_logits[0] += logit;
-    sum_logits[1] += expf(logit);
-  }
-
-  blockReduce<ReduceType::kSum, 2>(sum_logits);
-  __shared__ float s_sum_logit;
-  __shared__ float s_sum_exp;
-  if (threadIdx.x == 0) {
-    s_sum_logit = sum_logits[0];
-    s_sum_exp = sum_logits[1];
-  }
-  __syncthreads();
-
-  float eps_i = epsilon / (vocab_size - 1);
-  if (threadIdx.x == 0) {
-    // neg_log_prob = log(sum(exp(x - x_max))) - (x - x_max)
-    float nll_loss = logf(s_sum_exp) -
-                     static_cast<float>(inputs[block_start + target_tid]) +
-                     s_max_input;
-    nll_loss_outputs[blockIdx.x] = nll_loss;
-    float sum_nll_loss = vocab_size * logf(s_sum_exp) - s_sum_logit;
-    outputs[blockIdx.x] =
-        (1.f - epsilon - eps_i) * nll_loss + eps_i * sum_nll_loss;
-  }
-}
-
-template <typename T>
-__global__ void ls_cross_entropy_bw_kernel(
-    const float *__restrict__ grad_outputs, const T *__restrict__ inputs,
-    const int *__restrict__ targets, T *__restrict__ grad_inputs,
-    const int padding_idx, const float epsilon, const int vocab_size) {
-  /* step1: compute each thread's max_logit and sum_exp_logit, store in
-   * max_input, sum_exp_logit */
-  const int block_start = blockIdx.x * vocab_size;
-  const int left_idx = block_start + threadIdx.x;
-  const int right_idx = (blockIdx.x + 1) * vocab_size;
-  float max_input[1] = {REDUCE_FLOAT_INF_NEG};
-  float sum_logits[1] = {0.f};
-  const float grad_out = static_cast<float>(grad_outputs[0]);
-  int target_tid = targets[blockIdx.x];
-
-  if (target_tid == padding_idx) {
-    for (int i = left_idx; i < right_idx; i += blockDim.x) {
-      grad_inputs[i] = 0.f;
-    }
-    return;
-  }
-
-  for (int i = left_idx; i < right_idx; i += blockDim.x) {
-    max_input[0] = fmaxf(max_input[0], static_cast<float>(inputs[i]));
-  }
-  blockReduce<ReduceType::kMax, 1>(max_input);
-  __shared__ float s_max_input;
-  if (threadIdx.x == 0) {
-    s_max_input = max_input[0];
-  }
-  __syncthreads();
-
-  for (int i = left_idx; i < right_idx; i += blockDim.x) {
-    float logit = static_cast<float>(inputs[i]) - s_max_input;
-    sum_logits[0] += expf(logit);
-  }
-
-  blockReduce<ReduceType::kSum, 1>(sum_logits);
-  __shared__ float s_sum_exp;
-  if (threadIdx.x == 0) {
-    s_sum_exp = sum_logits[0];
-  }
-  __syncthreads();
-
-  float eps_i = epsilon / (vocab_size - 1);
-  float nll_weight = 1.0 - epsilon - eps_i;
-
-  for (int i = left_idx; i < right_idx; i += blockDim.x) {
-    float prob = expf(static_cast<float>(inputs[i]) - s_max_input) / s_sum_exp;
-    float grad = 0;
-    grad += (vocab_size * prob - 1) * eps_i;
-    grad += prob * nll_weight;
-    if ((i - block_start) == target_tid) {
-      grad -= nll_weight;
-    }
-    grad_inputs[i] = grad_out * grad;
-  }
-}
-
-template <typename T>
-void launch_cross_entropy_fw(const T *inputs_ptr, const int *targets_ptr,
-                             float *outputs_ptr, float *nll_loss_ptr,
-                             float *loss_buffer, const int padding_idx,
-                             const float epsilon, const int batch_size,
-                             const int seq_len, const int vocab_size,
-                             cudaStream_t stream) {
-  int grid_dim = batch_size * seq_len;
-  float *nll_loss_buffer = loss_buffer + grid_dim;
-  ls_cross_entropy_fw_kernel<<<grid_dim, MAX_THREADS, 0, stream>>>(
-      inputs_ptr, targets_ptr, loss_buffer, nll_loss_buffer, padding_idx,
-      epsilon, vocab_size);
-
-  int num_items = grid_dim;
-  void *d_temp_storage = NULL;
-  size_t temp_storage_bytes = 0;
-  CHECK_GPU_ERROR(ls::cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes,
-                                             loss_buffer, outputs_ptr,
-                                             num_items, stream));
-  CHECK_GPU_ERROR(
-      g_allocator.DeviceAllocate(&d_temp_storage, temp_storage_bytes));
-  CHECK_GPU_ERROR(ls::cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes,
-                                             loss_buffer, outputs_ptr,
-                                             num_items, stream));
-  CHECK_GPU_ERROR(ls::cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes,
-                                             nll_loss_buffer, nll_loss_ptr,
-                                             num_items, stream));
-  CHECK_GPU_ERROR(g_allocator.DeviceFree(d_temp_storage));
-}
-
-template void launch_cross_entropy_fw<float>(
-    const float *inputs_ptr, const int *targets_ptr, float *outputs_ptr,
-    float *nll_loss_ptr, float *loss_buffer, const int padding_idx,
-    const float epsilon, const int batch_size, const int seq_len,
-    const int vocab_size, cudaStream_t stream);
-
-template void launch_cross_entropy_fw<__half>(
-    const __half *inputs_ptr, const int *targets_ptr, float *outputs_ptr,
-    float *nll_loss_ptr, float *loss_buffer, const int padding_idx,
-    const float epsilon, const int batch_size, const int seq_len,
-    const int vocab_size, cudaStream_t stream);
-
-template <typename T>
-void launch_cross_entropy_bw(const float *grad_outputs_ptr, const T *inputs_ptr,
-                             const int *targets_ptr, T *grad_inputs_ptr,
-                             const int padding_idx, const float epsilon,
-                             const int batch_size, const int seq_len,
-                             const int vocab_size, cudaStream_t stream) {
-  int grid_dim = batch_size * seq_len;
-  ls_cross_entropy_bw_kernel<<<grid_dim, MAX_THREADS, 0, stream>>>(
-      grad_outputs_ptr, inputs_ptr, targets_ptr, grad_inputs_ptr, padding_idx,
-      epsilon, vocab_size);
-}
-
-template void launch_cross_entropy_bw<float>(
-    const float *grad_outputs_ptr, const float *inputs_ptr,
-    const int *targets_ptr, float *grad_inputs_ptr, const int padding_idx,
-    const float epsilon, const int batch_size, const int seq_len,
-    const int vocab_size, cudaStream_t stream);
-
-template void launch_cross_entropy_bw<__half>(
-    const float *grad_outputs_ptr, const __half *inputs_ptr,
-    const int *targets_ptr, __half *grad_inputs_ptr, const int padding_idx,
-    const float epsilon, const int batch_size, const int seq_len,
-    const int vocab_size, cudaStream_t stream);
diff --git a/colossalai/kernel/cuda_native/csrc/kernels/cublas_wrappers.cu b/colossalai/kernel/cuda_native/csrc/kernels/cublas_wrappers.cu
deleted file mode 100644
index 6c49280ff2734a2ec08a4f628424320a62b3a7e7..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/csrc/kernels/cublas_wrappers.cu
+++ /dev/null
@@ -1,171 +0,0 @@
-/* Copyright 2021 The LightSeq Team
-   Copyright Microsoft DeepSpeed
-   This file is adapted from Microsoft DeepSpeed
-*/
-#include "cublas_wrappers.h"
-
-#ifdef COLOSSAL_HIP
-int cublas_gemm_ex(cublasHandle_t handle, cublasOperation_t transa,
-                   cublasOperation_t transb, int m, int n, int k,
-                   const float *alpha, const float *beta, const float *A,
-                   const float *B, float *C, rocblas_gemm_algo algo) {
-  cublasStatus_t status =
-      rocblas_gemm_ex(handle, transa, transb, m, n, k, (const void *)alpha,
-                   (const void *)A, rocblas_datatype_f32_r, (transa == rocblas_operation_none) ? m : k,
-                   (const void *)B, rocblas_datatype_f32_r, (transb == rocblas_operation_none) ? k : n,
-                   (const void *)beta, C, rocblas_datatype_f32_r, m, C, rocblas_datatype_f32_r, m, rocblas_datatype_f32_r, algo, 0, 0);
-
-  if (status != CUBLAS_STATUS_SUCCESS) {
-    fprintf(stderr,
-            "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
-            m, n, k, (int)status);
-    return EXIT_FAILURE;
-  }
-  return 0;
-}
-
-int cublas_gemm_ex(cublasHandle_t handle, cublasOperation_t transa,
-                   cublasOperation_t transb, int m, int n, int k,
-                   const float *alpha, const float *beta, const __half *A,
-                   const __half *B, __half *C, rocblas_gemm_algo algo) {
-  cublasStatus_t status = rocblas_gemm_ex(
-      handle, transa, transb, m, n, k, (const void *)alpha, (const void *)A,
-      rocblas_datatype_f16_r, (transa == rocblas_operation_none) ? m : k, (const void *)B, rocblas_datatype_f16_r,
-      (transb == rocblas_operation_none) ? k : n, (const void *)beta, (void *)C,
-      rocblas_datatype_f16_r, m, (void *)C, rocblas_datatype_f16_r, m, rocblas_datatype_f32_r, algo, 0, 0);
-
-  if (status != CUBLAS_STATUS_SUCCESS) {
-    fprintf(stderr,
-            "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
-            m, n, k, (int)status);
-    return EXIT_FAILURE;
-  }
-  return 0;
-}
-
-int cublas_strided_batched_gemm(cublasHandle_t handle, int m, int n, int k,
-                                const float *alpha, const float *beta,
-                                const float *A, const float *B, float *C,
-                                cublasOperation_t op_A, cublasOperation_t op_B,
-                                int stride_A, int stride_B, int stride_C,
-                                int batch, rocblas_gemm_algo algo) {
-  cublasStatus_t status = rocblas_gemm_strided_batched_ex(
-      handle, op_A, op_B, m, n, k, alpha, A, rocblas_datatype_f32_r,
-      (op_A == rocblas_operation_none) ? m : k, stride_A, B, rocblas_datatype_f32_r,
-      (op_B == rocblas_operation_none) ? k : n, stride_B, beta, C, rocblas_datatype_f32_r, m, stride_C,
-      C, rocblas_datatype_f16_r, m, stride_C, batch, rocblas_datatype_f32_r, algo, 0, 0);
-
-  if (status != CUBLAS_STATUS_SUCCESS) {
-    fprintf(stderr,
-            "!!!! kernel execution error. (batch: %d, m: %d, n: %d, k: %d, "
-            "error: %d) \n",
-            batch, m, n, k, (int)status);
-    return EXIT_FAILURE;
-  }
-  return 0;
-}
-
-int cublas_strided_batched_gemm(cublasHandle_t handle, int m, int n, int k,
-                                const float *alpha, const float *beta,
-                                const __half *A, const __half *B, __half *C,
-                                cublasOperation_t op_A, cublasOperation_t op_B,
-                                int stride_A, int stride_B, int stride_C,
-                                int batch, rocblas_gemm_algo algo) {
-  cublasStatus_t status = rocblas_gemm_strided_batched_ex(
-      handle, op_A, op_B, m, n, k, alpha, A, rocblas_datatype_f16_r,
-      (op_A == rocblas_operation_none) ? m : k, stride_A, B, rocblas_datatype_f16_r,
-      (op_B == rocblas_operation_none) ? k : n, stride_B, beta, C, rocblas_datatype_f16_r, m, stride_C,
-      C, rocblas_datatype_f16_r, m, stride_C, batch, rocblas_datatype_f32_r, algo, 0, 0);
-
-  if (status != CUBLAS_STATUS_SUCCESS) {
-    fprintf(stderr,
-            "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
-            m, n, k, (int)status);
-    return EXIT_FAILURE;
-  }
-
-  return 0;
-}
-#else
-int cublas_gemm_ex(cublasHandle_t handle, cublasOperation_t transa,
-                   cublasOperation_t transb, int m, int n, int k,
-                   const float *alpha, const float *beta, const float *A,
-                   const float *B, float *C, cublasGemmAlgo_t algo) {
-  cublasStatus_t status =
-      cublasGemmEx(handle, transa, transb, m, n, k, (const void *)alpha,
-                   (const void *)A, CUDA_R_32F, (transa == CUBLAS_OP_N) ? m : k,
-                   (const void *)B, CUDA_R_32F, (transb == CUBLAS_OP_N) ? k : n,
-                   (const void *)beta, C, CUDA_R_32F, m, CUDA_R_32F, algo);
-
-  if (status != CUBLAS_STATUS_SUCCESS) {
-    fprintf(stderr,
-            "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
-            m, n, k, (int)status);
-    return EXIT_FAILURE;
-  }
-  return 0;
-}
-
-int cublas_gemm_ex(cublasHandle_t handle, cublasOperation_t transa,
-                   cublasOperation_t transb, int m, int n, int k,
-                   const float *alpha, const float *beta, const __half *A,
-                   const __half *B, __half *C, cublasGemmAlgo_t algo) {
-  cublasStatus_t status = cublasGemmEx(
-      handle, transa, transb, m, n, k, (const void *)alpha, (const void *)A,
-      CUDA_R_16F, (transa == CUBLAS_OP_N) ? m : k, (const void *)B, CUDA_R_16F,
-      (transb == CUBLAS_OP_N) ? k : n, (const void *)beta, (void *)C,
-      CUDA_R_16F, m, CUDA_R_32F, algo);
-
-  if (status != CUBLAS_STATUS_SUCCESS) {
-    fprintf(stderr,
-            "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
-            m, n, k, (int)status);
-    return EXIT_FAILURE;
-  }
-  return 0;
-}
-
-int cublas_strided_batched_gemm(cublasHandle_t handle, int m, int n, int k,
-                                const float *alpha, const float *beta,
-                                const float *A, const float *B, float *C,
-                                cublasOperation_t op_A, cublasOperation_t op_B,
-                                int stride_A, int stride_B, int stride_C,
-                                int batch, cublasGemmAlgo_t algo) {
-  cublasStatus_t status = cublasGemmStridedBatchedEx(
-      handle, op_A, op_B, m, n, k, alpha, A, CUDA_R_32F,
-      (op_A == CUBLAS_OP_N) ? m : k, stride_A, B, CUDA_R_32F,
-      (op_B == CUBLAS_OP_N) ? k : n, stride_B, beta, C, CUDA_R_32F, m, stride_C,
-      batch, CUDA_R_32F, algo);
-
-  if (status != CUBLAS_STATUS_SUCCESS) {
-    fprintf(stderr,
-            "!!!! kernel execution error. (batch: %d, m: %d, n: %d, k: %d, "
-            "error: %d) \n",
-            batch, m, n, k, (int)status);
-    return EXIT_FAILURE;
-  }
-  return 0;
-}
-
-int cublas_strided_batched_gemm(cublasHandle_t handle, int m, int n, int k,
-                                const float *alpha, const float *beta,
-                                const __half *A, const __half *B, __half *C,
-                                cublasOperation_t op_A, cublasOperation_t op_B,
-                                int stride_A, int stride_B, int stride_C,
-                                int batch, cublasGemmAlgo_t algo) {
-  cublasStatus_t status = cublasGemmStridedBatchedEx(
-      handle, op_A, op_B, m, n, k, alpha, A, CUDA_R_16F,
-      (op_A == CUBLAS_OP_N) ? m : k, stride_A, B, CUDA_R_16F,
-      (op_B == CUBLAS_OP_N) ? k : n, stride_B, beta, C, CUDA_R_16F, m, stride_C,
-      batch, CUDA_R_32F, algo);
-
-  if (status != CUBLAS_STATUS_SUCCESS) {
-    fprintf(stderr,
-            "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
-            m, n, k, (int)status);
-    return EXIT_FAILURE;
-  }
-
-  return 0;
-}
-#endif
diff --git a/colossalai/kernel/cuda_native/csrc/kernels/cuda_util.cu b/colossalai/kernel/cuda_native/csrc/kernels/cuda_util.cu
deleted file mode 100644
index 890131e9d6c27a892975b001859c0de3fb3092e9..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/csrc/kernels/cuda_util.cu
+++ /dev/null
@@ -1,174 +0,0 @@
-#include <thrust/device_vector.h>
-#include <thrust/reduce.h>
-
-#ifdef COLOSSAL_HIP
-#include "hip_util.h"
-#else
-#include "cuda_util.h"
-#endif
-
-/* GPU function guard */
-std::string _cudaGetErrorString(cudaError_t error) {
-  return cudaGetErrorString(error);
-}
-
-std::string _cudaGetErrorString(cublasStatus_t error) {
-  switch (error) {
-    case CUBLAS_STATUS_SUCCESS:
-      return "CUBLAS_STATUS_SUCCESS";
-
-    case CUBLAS_STATUS_NOT_INITIALIZED:
-      return "CUBLAS_STATUS_NOT_INITIALIZED";
-
-    case CUBLAS_STATUS_ALLOC_FAILED:
-      return "CUBLAS_STATUS_ALLOC_FAILED";
-
-    case CUBLAS_STATUS_INVALID_VALUE:
-      return "CUBLAS_STATUS_INVALID_VALUE";
-
-    case CUBLAS_STATUS_ARCH_MISMATCH:
-      return "CUBLAS_STATUS_ARCH_MISMATCH";
-#ifndef COLOSSAL_HIP
-    case CUBLAS_STATUS_MAPPING_ERROR:
-      return "CUBLAS_STATUS_MAPPING_ERROR";
-
-    case CUBLAS_STATUS_EXECUTION_FAILED:
-      return "CUBLAS_STATUS_EXECUTION_FAILED";
-
-    case CUBLAS_STATUS_INTERNAL_ERROR:
-      return "CUBLAS_STATUS_INTERNAL_ERROR";
-
-    case CUBLAS_STATUS_NOT_SUPPORTED:
-      return "CUBLAS_STATUS_NOT_SUPPORTED";
-
-    case CUBLAS_STATUS_LICENSE_ERROR:
-      return "CUBLAS_STATUS_LICENSE_ERROR";
-#endif
-  }
-  return "CUBLAS_UNKNOW";
-}
-
-template <typename T>
-void check_gpu_error(T result, char const *const func, const char *const file,
-                     int const line) {
-  if (result) {
-    throw std::runtime_error(std::string("[CUDA][ERROR] ") + +file + "(" +
-                             std::to_string(line) +
-                             "): " + (_cudaGetErrorString(result)) + "\n");
-  }
-}
-
-template void check_gpu_error<cudaError_t>(cudaError_t result,
-                                           char const *const func,
-                                           const char *const file,
-                                           int const line);
-template void check_gpu_error<cublasStatus_t>(cublasStatus_t result,
-                                              char const *const func,
-                                              const char *const file,
-                                              int const line);
-
-template <typename T>
-void print_vec(const T *outv, std::string outn, int num_output_ele) {
-  std::cout << outn << ": ";
-  std::vector<T> hout(num_output_ele, (T)0);
-  cudaMemcpy(hout.data(), outv, num_output_ele * sizeof(T),
-             cudaMemcpyDeviceToHost);
-  for (int i = 0; i < num_output_ele; i++) {
-    std::cout << hout[i] << ", ";
-  }
-  std::cout << std::endl;
-}
-
-template <>
-void print_vec<__half>(const __half *outv, std::string outn,
-                       int num_output_ele) {
-  std::cout << outn << ": ";
-  std::vector<__half> hout(num_output_ele, (__half)0.f);
-  cudaMemcpy(hout.data(), outv, num_output_ele * sizeof(__half),
-             cudaMemcpyDeviceToHost);
-  for (int i = 0; i < num_output_ele; i++) {
-    std::cout << __half2float(hout[i]) << ", ";
-  }
-  std::cout << std::endl;
-}
-
-template void print_vec<float>(const float *outv, std::string outn,
-                               int num_output_ele);
-
-template void print_vec<int>(const int *outv, std::string outn,
-                             int num_output_ele);
-
-template void print_vec<__half>(const __half *outv, std::string outn,
-                                int num_output_ele);
-
-template <typename T>
-T *cuda_malloc(size_t ele_num) {
-  size_t byte_size = ele_num * sizeof(T);
-  T *pdata = nullptr;
-  CHECK_GPU_ERROR(cudaMalloc((void **)&pdata, byte_size));
-  return pdata;
-}
-
-template float *cuda_malloc<float>(size_t ele_num);
-
-template __half *cuda_malloc<__half>(size_t ele_num);
-
-template uint8_t *cuda_malloc<uint8_t>(size_t ele_num);
-
-void cuda_free(void *pdata) {
-  if (pdata != nullptr) {
-    cudaFree(pdata);
-  }
-}
-
-template <typename T>
-struct _isnan {
-  __device__ bool operator()(T a) const { return isnan(a); }
-};
-
-template <>
-struct _isnan<__half> {
-  __device__ bool operator()(const __half a) const { return __hisnan(a); }
-};
-
-template <typename T>
-struct _isinf {
-  __device__ bool operator()(T a) const { return isinf(a); }
-};
-
-template <>
-struct _isinf<__half> {
-  __device__ bool operator()(const __half a) const { return __hisinf(a); }
-};
-
-template <typename T>
-void check_nan_inf(const T *data_ptr, int dsize, bool check_nan_inf,
-                   std::string file, int line, cudaStream_t stream) {
-  // check_nan_inf = 0 for checking nan
-  // check_nan_inf = 1 for checking inf
-  bool res = false;
-  std::string msg = file + "(" + std::to_string(line) + "): ";
-  if (check_nan_inf) {
-    msg += "nan.";
-    res = thrust::transform_reduce(thrust::cuda::par.on(stream), data_ptr,
-                                   data_ptr + dsize, _isnan<T>(), false,
-                                   thrust::logical_or<bool>());
-  } else {
-    msg += "inf.";
-    res = thrust::transform_reduce(thrust::cuda::par.on(stream), data_ptr,
-                                   data_ptr + dsize, _isinf<T>(), false,
-                                   thrust::logical_or<bool>());
-  }
-  if (res) {
-    throw std::runtime_error(msg);
-  }
-  std::cout << msg << " [check pass]." << std::endl;
-}
-
-template void check_nan_inf<float>(const float *data_ptr, int dsize,
-                                   bool check_nan_inf, std::string file,
-                                   int line, cudaStream_t stream);
-
-template void check_nan_inf<__half>(const __half *data_ptr, int dsize,
-                                    bool check_nan_inf, std::string file,
-                                    int line, cudaStream_t stream);
diff --git a/colossalai/kernel/cuda_native/csrc/kernels/dropout_kernels.cu b/colossalai/kernel/cuda_native/csrc/kernels/dropout_kernels.cu
deleted file mode 100644
index 0ceb16eabef7928b623b774a83645cc810afa1a4..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/csrc/kernels/dropout_kernels.cu
+++ /dev/null
@@ -1,1043 +0,0 @@
-#include <chrono>
-#include <ctime>
-
-#include "kernels.h"
-
-#ifdef COLOSSAL_HIP
-#include <hiprand/hiprand_kernel_hcc.h>
-#endif
-
-#ifndef COLOSSAL_HIP
-#include <cooperative_groups.h>
-
-namespace cg = cooperative_groups;
-#endif
-
-curandStatePhilox4_32_10_t *curandstate;
-
-/**
- * @brief element-wise activation function on device, like Relu, Gelu
- *
- * @tparam enum class ActivationType, kRelu, kGelu
- * @tparam input type
- * @param any shape of float and __half2
- * @return same shape and type with input
- */
-template <ActivationType, typename T>
-__forceinline__ __device__ T activation_kernel(T x);
-
-template <>
-__device__ float activation_kernel<ActivationType::kGelu, float>(float x) {
-  float cdf =
-      0.5f *
-      (1.0f + tanhf((0.7978845608028654f * (x + 0.044715f * x * x * x))));
-  return x * cdf;
-}
-
-template <>
-__device__ __half2
-activation_kernel<ActivationType::kGelu, __half2>(__half2 val) {
-  __half2 val_pow3 = __hmul2(val, __hmul2(val, val));
-  float2 tmp_pow = __half22float2(val_pow3);
-  float2 tmp = __half22float2(val);
-
-  tmp.x =
-      0.5f *
-      (1.0f + tanhf((0.7978845608028654f * (tmp.x + 0.044715f * tmp_pow.x))));
-  tmp.y =
-      0.5f *
-      (1.0f + tanhf((0.7978845608028654f * (tmp.y + 0.044715f * tmp_pow.y))));
-  return __hmul2(val, __float22half2_rn(tmp));
-}
-
-template <>
-__device__ float activation_kernel<ActivationType::kRelu, float>(float x) {
-  return fmaxf(x, 0);
-}
-
-template <>
-__device__ __half2
-activation_kernel<ActivationType::kRelu, __half2>(__half2 x) {
-#ifdef COLOSSAL_HIP
-  float2 tmp = __half22float2(x);
-  return __floats2half2_rn(fmaxf(0.f, tmp.x),
-                           fmaxf(0.f, tmp.y));
-#else
-  return __floats2half2_rn(fmaxf(0.f, __half2float(x.x)),
-                           fmaxf(0.f, __half2float(x.y)));
-#endif
-}
-
-/**
- * @brief element-wise activation backward function on device
- *
- * @tparam enum class ActivationType
- * @tparam input type
- * @param any shape of float and __half2
- * @return same shape of input
- */
-template <ActivationType, typename T>
-__forceinline__ __device__ T activation_bwd_kernel(T grad, T x);
-
-template <>
-__device__ float activation_bwd_kernel<ActivationType::kGelu, float>(float grad,
-                                                                     float x) {
-  const float sqrt_param = 0.79788456080286535587989211986876f;
-  const float mul_param = 0.044715;
-
-  float x2mul = x * x * mul_param;
-  float tan_h = tanhf(sqrt_param * (x + x * x2mul));
-  float dg1 = 0.5f * (1.0f + tan_h);
-  float dg2 = x * 0.5f * sqrt_param * (1 - tan_h * tan_h);
-  float dg3 = dg2 * 3 * x2mul;
-  return grad * (dg1 + dg2 + dg3);
-}
-
-template <>
-__device__ __half activation_bwd_kernel<ActivationType::kGelu, __half>(
-    __half grad, __half x_half) {
-  float x = __half2float(x_half);
-  const float sqrt_param = 0.79788456080286535587989211986876f;
-  const float mul_param = 0.044715;
-
-  float x2mul = x * x * mul_param;
-  float tan_h = tanhf(sqrt_param * (x + x * x2mul));
-  float dg1 = 0.5f * (1.0f + tan_h);
-  float dg2 = x * 0.5f * sqrt_param * (1 - tan_h * tan_h);
-  float dg3 = dg2 * 3 * x2mul;
-  return grad * __float2half(dg1 + dg2 + dg3);
-}
-
-template <>
-__device__ float activation_bwd_kernel<ActivationType::kRelu, float>(float grad,
-                                                                     float x) {
-  return x > 0.f ? grad : 0.f;
-}
-
-template <>
-__device__ __half
-activation_bwd_kernel<ActivationType::kRelu, __half>(__half grad, __half x) {
-  const __half half_zero = __float2half(0.f);
-  return x > half_zero ? grad : half_zero;
-}
-
-template <>
-__device__ __half2 activation_bwd_kernel<ActivationType::kRelu, __half2>(
-    __half2 grad2, __half2 x_half2) {
-#ifdef COLOSSAL_HIP
-  float2 tmp_x = __half22float2(x_half2);
-  float2 tmp_grad2 = __half22float2(grad2);
-
-  return __floats2half2_rn(tmp_x.x > 0.0 ? tmp_grad2.x : 0.0,
-                           tmp_x.y > 0.0 ? tmp_grad2.y : 0.0);
-#else
-  const __half half_zero = __float2half(0.f);
-  return __floats2half2_rn(x_half2.x > half_zero ? grad2.x : half_zero,
-                           x_half2.y > half_zero ? grad2.y : half_zero);
-#endif
-}
-
-/**
- * @brief init curand states in global memory
- *
- * @thread grid_dim * block*dim to suuport any size of states
- * @param state persistant curand states
- * @param seed seed to init states
- * @return void
- */
-__global__ void curand_init_kernel(curandStatePhilox4_32_10_t *state,
-                                   int seed) {
-  /* Each thread gets same seed, a different sequence
-     number, no offset */
-  int id = threadIdx.x + blockIdx.x * blockDim.x;
-  curand_init(seed, id, 0, &state[id]);
-}
-
-void launch_curand_init(int total_count, int dim, cudaStream_t stream) {
-  cudaMalloc(&curandstate, total_count * sizeof(curandStatePhilox4_32_10_t));
-  int grid_dim = total_count >> 9;
-  curand_init_kernel<<<grid_dim, 512, 0, stream>>>(
-      curandstate, std::chrono::duration_cast<std::chrono::microseconds>(
-                       std::chrono::system_clock::now().time_since_epoch())
-                       .count());
-}
-
-/**
- * @brief element-wise dropout, store dropped position in mask, it's not
- * in-place
- *
- * @thread
- * gridDim.x = total_count / 1024
- * blockDim.x = 1024
- *
- * @param total_count total elements
- * @param ratio drop ratio
- * @param out any size of float and __half
- * @param in same with out
- * @param mask uint8 type, same size with out
- * @param seed seed to curand
- * @return void
- */
-__global__ void ls_dropout_kernel(const int total_count, const float ratio,
-                                  float *__restrict__ out,
-                                  const float *__restrict__ in,
-                                  uint8_t *__restrict__ mask, const int seed) {
-  const float scale = 1.f / (1.f - ratio);
-  int i = blockIdx.x * blockDim.x + threadIdx.x;
-
-  if (i * 4 >= total_count) return;
-
-  curandStatePhilox4_32_10_t state;
-  curand_init(seed, i, 0, &state);
-  uint8_t m[4];
-
-  float4 *out4 = reinterpret_cast<float4 *>(out);
-  const float4 *data4 = reinterpret_cast<const float4 *>(in);
-  uint32_t *mask4 = reinterpret_cast<uint32_t *>(mask);
-  float4 rand = curand_uniform4(&state);
-
-  m[0] = (uint8_t)(rand.x > ratio);
-  m[1] = (uint8_t)(rand.y > ratio);
-  m[2] = (uint8_t)(rand.z > ratio);
-  m[3] = (uint8_t)(rand.w > ratio);
-
-  uint32_t *m4 = reinterpret_cast<uint32_t *>(m);
-  mask4[i] = m4[0];
-
-  float4 input4 = data4[i];
-  float4 res4;
-  res4.x = input4.x * scale * m[0];
-  res4.y = input4.y * scale * m[1];
-  res4.z = input4.z * scale * m[2];
-  res4.w = input4.w * scale * m[3];
-  out4[i] = res4;
-}
-
-__global__ void ls_dropout_kernel(const int total_count, const float ratio,
-                                  __half *__restrict__ out,
-                                  const __half *__restrict__ in,
-                                  uint8_t *__restrict__ mask, const int seed) {
-  const float scale = 1.f / (1.f - ratio);
-
-  int i = blockIdx.x * blockDim.x + threadIdx.x;
-
-  if (i * 8 >= total_count) return;
-
-  curandStatePhilox4_32_10_t state;
-  curand_init(seed, i, 0, &state);
-
-  const float4 *vals_float4 = reinterpret_cast<const float4 *>(in);
-  float4 *outs_float4 = reinterpret_cast<float4 *>(out);
-  uint64_t *mask8 = reinterpret_cast<uint64_t *>(mask);
-
-  uint8_t m[8];
-  float4 rand = curand_uniform4(&state);
-  m[0] = (uint8_t)(rand.x > ratio);
-  m[1] = (uint8_t)(rand.y > ratio);
-  m[2] = (uint8_t)(rand.z > ratio);
-  m[3] = (uint8_t)(rand.w > ratio);
-  rand = curand_uniform4(&state);
-  m[4] = (uint8_t)(rand.x > ratio);
-  m[5] = (uint8_t)(rand.y > ratio);
-  m[6] = (uint8_t)(rand.z > ratio);
-  m[7] = (uint8_t)(rand.w > ratio);
-  uint64_t *m8 = reinterpret_cast<uint64_t *>(m);
-  mask8[i] = *m8;
-
-  float4 val_float4 = vals_float4[i];
-  float4 out_float4;
-  __half2 *val_half2 = reinterpret_cast<__half2 *>(&val_float4);
-  __half2 *out_half2 = reinterpret_cast<__half2 *>(&out_float4);
-  __half2 scale_mask_1 = __floats2half2_rn(scale * m[0], scale * m[1]);
-  __half2 scale_mask_2 = __floats2half2_rn(scale * m[2], scale * m[3]);
-  __half2 scale_mask_3 = __floats2half2_rn(scale * m[4], scale * m[5]);
-  __half2 scale_mask_4 = __floats2half2_rn(scale * m[6], scale * m[7]);
-  out_half2[0] = __hmul2(val_half2[0], scale_mask_1);
-  out_half2[1] = __hmul2(val_half2[1], scale_mask_2);
-  out_half2[2] = __hmul2(val_half2[2], scale_mask_3);
-  out_half2[3] = __hmul2(val_half2[3], scale_mask_4);
-  outs_float4[i] = out_float4;
-}
-
-/**
- * @brief element-wise dropout backward with dropout mask, it's
- * not in-place
- *
- * @thread
- * gridDim.x = total_count / 1024
- * blockDim.x = 1024
- *
- * @param total_count total elements
- * @param ratio drop ratio
- * @param in any size of float and __half
- * @param mask uint8 type, same size with in
- * @return void
- */
-__global__ void ls_dropout_bwd_kernel(const int total_count, const float ratio,
-                                      float *out, const float *in,
-                                      const uint8_t *__restrict__ mask) {
-  const float scale = 1.f / (1.f - ratio);
-  int i = blockIdx.x * blockDim.x + threadIdx.x;
-
-  if (i * 4 >= total_count) return;
-
-  uint8_t m[4];
-
-  float4 *out4 = reinterpret_cast<float4 *>(out);
-  const float4 *in4 = reinterpret_cast<const float4 *>(in);
-  const uint32_t *mask4 = reinterpret_cast<const uint32_t *>(mask);
-
-  uint32_t *m4 = reinterpret_cast<uint32_t *>(m);
-  m4[0] = mask4[i];
-
-  float4 input4 = in4[i];
-  float4 res4;
-  res4.x = input4.x * scale * static_cast<float>(m[0]);
-  res4.y = input4.y * scale * static_cast<float>(m[1]);
-  res4.z = input4.z * scale * static_cast<float>(m[2]);
-  res4.w = input4.w * scale * static_cast<float>(m[3]);
-  out4[i] = res4;
-}
-
-__global__ void ls_dropout_bwd_kernel(const int total_count, const float ratio,
-                                      __half *out, const __half *in,
-                                      const uint8_t *__restrict__ mask) {
-  const __half scale = 1.f / (1.f - ratio);
-
-  int i = blockIdx.x * blockDim.x + threadIdx.x;
-
-  if (i * 8 >= total_count) return;
-
-  float4 *out4 = reinterpret_cast<float4 *>(out);
-  const float4 *vals_float4 = reinterpret_cast<const float4 *>(in);
-  const uint64_t *mask8 = reinterpret_cast<const uint64_t *>(mask);
-
-  uint8_t m[8];
-  uint64_t *m8 = reinterpret_cast<uint64_t *>(m);
-  m8[0] = mask8[i];
-
-  float4 val_float4 = vals_float4[i];
-  float4 out_float4;
-  __half2 *val_half2 = reinterpret_cast<__half2 *>(&val_float4);
-  __half2 *out_half2 = reinterpret_cast<__half2 *>(&out_float4);
-  __half2 scale_mask_1 =
-      __halves2half2(scale * __float2half(m[0]), scale * __float2half(m[1]));
-  __half2 scale_mask_2 =
-      __halves2half2(scale * __float2half(m[2]), scale * __float2half(m[3]));
-  __half2 scale_mask_3 =
-      __halves2half2(scale * __float2half(m[4]), scale * __float2half(m[5]));
-  __half2 scale_mask_4 =
-      __halves2half2(scale * __float2half(m[6]), scale * __float2half(m[7]));
-  out_half2[0] = __hmul2(val_half2[0], scale_mask_1);
-  out_half2[1] = __hmul2(val_half2[1], scale_mask_2);
-  out_half2[2] = __hmul2(val_half2[2], scale_mask_3);
-  out_half2[3] = __hmul2(val_half2[3], scale_mask_4);
-  out4[i] = out_float4;
-}
-
-template <>
-void launch_ls_dropout<float>(float *out, const float *vals, uint8_t *mask,
-                              int total_count, float ratio, cudaStream_t stream,
-                              bool backward) {
-  int grid_dim = total_count >> 12;
-  if (!backward) {
-    ls_dropout_kernel<<<grid_dim + 1, 1024, 0, stream>>>(
-        total_count, ratio, out, vals, mask,
-        std::chrono::duration_cast<std::chrono::microseconds>(
-            std::chrono::system_clock::now().time_since_epoch())
-            .count());
-  } else {
-    ls_dropout_bwd_kernel<<<grid_dim + 1, 1024, 0, stream>>>(total_count, ratio,
-                                                             out, vals, mask);
-  }
-}
-
-template <>
-void launch_ls_dropout<__half>(__half *out, const __half *vals, uint8_t *mask,
-                               int total_count, float ratio,
-                               cudaStream_t stream, bool backward) {
-  int grid_dim = total_count >> 13;
-  if (!backward) {
-    ls_dropout_kernel<<<grid_dim + 1, 1024, 0, stream>>>(
-        total_count, ratio, out, vals, mask,
-        std::chrono::duration_cast<std::chrono::microseconds>(
-            std::chrono::system_clock::now().time_since_epoch())
-            .count());
-  } else {
-    ls_dropout_bwd_kernel<<<grid_dim + 1, 1024, 0, stream>>>(total_count, ratio,
-                                                             out, vals, mask);
-  }
-}
-
-/**
- * @brief fused bias, dropout, and residual at the end of Attention and FFN,
- * store dropped position in mask, it's not in-place
- *
- * @thread
- * gridDim.x = total_count / 1024
- * blockDim.x = 1024
- *
- * @param total_count total elements
- * @param ratio drop ratio
- * @param out [batch_size, seq_len, hidden_size], float and __half
- * @param in [batch_size, seq_len, hidden_size], float and __half
- * @param mask [batch_size, seq_len, hidden_size], uint8 type
- * @param bias [hidden_size], ffn bias
- * @param residual [batch_size, seq_len, hidden_size], float and __half
- * @param seed seed to curand
- * @param hidden_size hidden size
- * @return void
- */
-__global__ void ls_dropout_res_bias_kernel(
-    const int total_count, const float ratio, float *__restrict__ out,
-    const float *__restrict__ in, uint8_t *__restrict__ mask,
-    const float *__restrict__ bias, const float *__restrict__ residual,
-    const int seed, const int hidden_size) {
-  const float scale = 1.f / (1.f - ratio);
-  int i = blockIdx.x * blockDim.x + threadIdx.x;
-
-  if (i * 4 >= total_count) return;
-
-  curandStatePhilox4_32_10_t state;
-  curand_init(seed, i, 0, &state);
-  uint8_t m[4];
-
-  float4 *out4 = reinterpret_cast<float4 *>(out);
-  const float4 *data4 = reinterpret_cast<const float4 *>(in);
-  const float4 *residual4 = reinterpret_cast<const float4 *>(residual);
-  const float4 *bias4 = reinterpret_cast<const float4 *>(bias);
-  uint32_t *mask4 = reinterpret_cast<uint32_t *>(mask);
-  float4 rand = curand_uniform4(&state);
-
-  m[0] = static_cast<uint8_t>(rand.x > ratio);
-  m[1] = static_cast<uint8_t>(rand.y > ratio);
-  m[2] = static_cast<uint8_t>(rand.z > ratio);
-  m[3] = static_cast<uint8_t>(rand.w > ratio);
-
-  int bias_i = i % (hidden_size >> 2);
-  uint32_t *m4 = reinterpret_cast<uint32_t *>(m);
-  mask4[i] = m4[0];
-  const float4 input4 = data4[i];
-  const float4 b4 = __ldg(&bias4[bias_i]);
-  const float4 res4 = residual4[i];
-  float4 output4;
-
-  output4.x = (input4.x + b4.x) * scale * m[0] + res4.x;
-  output4.y = (input4.y + b4.y) * scale * m[1] + res4.y;
-  output4.z = (input4.z + b4.z) * scale * m[2] + res4.z;
-  output4.w = (input4.w + b4.w) * scale * m[3] + res4.w;
-
-  out4[i] = output4;
-}
-
-__global__ void ls_dropout_res_bias_kernel(
-    const int total_count, const float ratio, __half *__restrict__ out,
-    const __half *__restrict__ in, uint8_t *__restrict__ mask,
-    const __half *__restrict__ bias, const __half *__restrict__ residual,
-    const int seed, const int hidden_size) {
-  const __half scale = 1. / (1. - ratio);
-
-  int i = blockIdx.x * blockDim.x + threadIdx.x;
-
-  if (i * 8 >= total_count) return;
-
-  curandStatePhilox4_32_10_t state;
-  curand_init(seed, i, 0, &state);
-
-  const float4 *vals_float4 = reinterpret_cast<const float4 *>(in);
-  float4 *outs_float4 = reinterpret_cast<float4 *>(out);
-  const float4 *residual4 = reinterpret_cast<const float4 *>(residual);
-  const float4 *bias4 = reinterpret_cast<const float4 *>(bias);
-  uint64_t *mask8 = reinterpret_cast<uint64_t *>(mask);
-
-  uint8_t m[8];
-  float4 rand = curand_uniform4(&state);
-  m[0] = static_cast<uint8_t>(rand.x > ratio);
-  m[1] = static_cast<uint8_t>(rand.y > ratio);
-  m[2] = static_cast<uint8_t>(rand.z > ratio);
-  m[3] = static_cast<uint8_t>(rand.w > ratio);
-  rand = curand_uniform4(&state);
-  m[4] = static_cast<uint8_t>(rand.x > ratio);
-  m[5] = static_cast<uint8_t>(rand.y > ratio);
-  m[6] = static_cast<uint8_t>(rand.z > ratio);
-  m[7] = static_cast<uint8_t>(rand.w > ratio);
-  uint64_t *m8 = reinterpret_cast<uint64_t *>(m);
-  mask8[i] = m8[0];
-
-  int bias_i = i % (hidden_size >> 3);
-  float4 val_float4 = vals_float4[i];
-  const float4 b4 = __ldg(&bias4[bias_i]);
-  const float4 res4 = residual4[i];
-  float4 out_float4;
-
-  __half2 *val_half2 = reinterpret_cast<__half2 *>(&val_float4);
-  __half2 *out_half2 = reinterpret_cast<__half2 *>(&out_float4);
-  const __half2 *b_half2 = reinterpret_cast<const __half2 *>(&b4);
-  const __half2 *res_half2 = reinterpret_cast<const __half2 *>(&res4);
-  __half2 scale_mask_1 =
-      __halves2half2(scale * __float2half(m[0]), scale * __float2half(m[1]));
-  __half2 scale_mask_2 =
-      __halves2half2(scale * __float2half(m[2]), scale * __float2half(m[3]));
-  __half2 scale_mask_3 =
-      __halves2half2(scale * __float2half(m[4]), scale * __float2half(m[5]));
-  __half2 scale_mask_4 =
-      __halves2half2(scale * __float2half(m[6]), scale * __float2half(m[7]));
-  out_half2[0] =
-      __hfma2(__hadd2(val_half2[0], b_half2[0]), scale_mask_1, res_half2[0]);
-  out_half2[1] =
-      __hfma2(__hadd2(val_half2[1], b_half2[1]), scale_mask_2, res_half2[1]);
-  out_half2[2] =
-      __hfma2(__hadd2(val_half2[2], b_half2[2]), scale_mask_3, res_half2[2]);
-  out_half2[3] =
-      __hfma2(__hadd2(val_half2[3], b_half2[3]), scale_mask_4, res_half2[3]);
-  outs_float4[i] = out_float4;
-}
-
-template <>
-void launch_ls_dropout_res_bias<float>(float *out, const float *vals,
-                                       uint8_t *mask, const float *bias,
-                                       const float *residual, int total_count,
-                                       int dim, float ratio,
-                                       cudaStream_t stream) {
-  int grid_dim = total_count >> 12;
-  ls_dropout_res_bias_kernel<<<grid_dim + 1, 1024, 0, stream>>>(
-      total_count, ratio, out, vals, mask, bias, residual,
-      std::chrono::duration_cast<std::chrono::microseconds>(
-          std::chrono::system_clock::now().time_since_epoch())
-          .count(),
-      dim);
-}
-
-template <>
-void launch_ls_dropout_res_bias<__half>(__half *out, const __half *vals,
-                                        uint8_t *mask, const __half *bias,
-                                        const __half *residual, int total_count,
-                                        int dim, float ratio,
-                                        cudaStream_t stream) {
-  int grid_dim = total_count >> 13;
-  ls_dropout_res_bias_kernel<<<grid_dim + 1, 1024, 0, stream>>>(
-      total_count, ratio, out, vals, mask, bias, residual,
-      std::chrono::duration_cast<std::chrono::microseconds>(
-          std::chrono::system_clock::now().time_since_epoch())
-          .count(),
-      dim);
-}
-
-/**
- * @brief fused bias and dropout backward at the end of Attention and FFN
- *
- * @thread
- * gridDim.x = hidden_size / 8
- * blockDim.x = 8
- * blockDim.y = 1024 / 8 = 128
- *
- * @param row_size batch_size * seq_len
- * @param ratio dropout ratio
- * @param in_grad [batch_size, seq_len, hidden_size], input grad
- * @param bias_grad [hidden_size], bias grad
- * @param out_grad [batch_size, seq_len, hidden_size], output grad
- * @param mask [batch_size, seq_len, hidden_size], dropout mask
- * @param hidden_size
- * @return void
- */
-__global__ void ls_dropout_bias_bwd_kernel(
-    const int row_size, const float ratio, float *__restrict__ in_grad,
-    float *__restrict__ bias_grad, const float *__restrict__ out_grad,
-    const uint8_t *__restrict__ mask, const int hidden_size) {
-  const float scale = 1.f / (1.f - ratio);
-  // every block generate 8 bias result
-  __shared__ float tile[8][129];
-
-#ifndef COLOSSAL_HIP
-  cg::thread_block b = cg::this_thread_block();
-  cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
-#endif
-
-  int col_idx = flat_2dim(blockIdx.x, threadIdx.x, 8);
-  int stride = hidden_size * 128;
-  float local_sum = 0;
-
-  int idx = flat_2dim(threadIdx.y, col_idx, hidden_size);
-  for (int r = threadIdx.y; r < row_size; r += 128) {
-    float val = out_grad[idx];
-    val *= scale * static_cast<float>(mask[idx]);
-    local_sum += val;
-    in_grad[idx] = val;
-    idx += stride;
-  }
-
-  tile[threadIdx.x][threadIdx.y] = local_sum;
-  __syncthreads();
-
-  float sum = 0;
-  int tid = threadIdx.y * blockDim.x + threadIdx.x;
-  int x = tid >> 7;
-  int y = tid & (127);
-  if (y < 32) {
-#pragma unroll
-    for (int i = 0; i < 4; i++) {
-      sum += tile[x][y + i * 32];
-    }
-  }
-  __syncthreads();
-
-#ifdef COLOSSAL_HIP
-  for (int i = 1; i < 32; i <<= 1) sum += __shfl_down(sum, i);
-#else
-  for (int i = 1; i < 32; i <<= 1) sum += g.shfl_down(sum, i);
-#endif
-
-  if (y == 0) tile[0][x] = sum;
-  __syncthreads();
-
-  if (threadIdx.x < 8) {
-    int pos = flat_2dim(blockIdx.x, threadIdx.x, 8);
-    bias_grad[pos] = tile[0][threadIdx.x];
-  }
-}
-
-__global__ void ls_dropout_bias_bwd_kernel(
-    const int row_size, const float ratio, __half *__restrict__ in_grad,
-    __half *__restrict__ bias_grad, const __half *__restrict__ out_grad,
-    const uint8_t *__restrict__ mask, const int hidden_size) {
-  const __half2 scale = __float2half2_rn(1.f / (1.f - ratio));
-  __shared__ __half2 tile[8][129];
-
-#ifndef COLOSSAL_HIP
-  cg::thread_block b = cg::this_thread_block();
-  cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
-#endif
-
-
-  __half2 *in_grad2 = reinterpret_cast<__half2 *>(in_grad);
-  const __half2 *out_grad2 = reinterpret_cast<const __half2 *>(out_grad);
-  __half2 *bias_grad2 = reinterpret_cast<__half2 *>(bias_grad);
-
-  int col_idx = flat_2dim(blockIdx.x, threadIdx.x, 8);
-  int stride = hidden_size * 128;
-  __half2 local_sum = __float2half2_rn(0.f);
-
-  int idx = flat_2dim(threadIdx.y, col_idx, hidden_size);
-  for (int r = threadIdx.y; r < row_size; r += 128) {
-    __half2 val = out_grad2[idx];
-    __half2 m2 = __floats2half2_rn(mask[2 * idx], mask[2 * idx + 1]);
-    val *= scale * m2;
-    local_sum += val;
-    in_grad2[idx] = val;
-    idx += stride;
-  }
-
-  tile[threadIdx.x][threadIdx.y] = local_sum;
-  __syncthreads();
-
-  __half2 sum = __float2half2_rn(0.f);
-  int tid = threadIdx.y * blockDim.x + threadIdx.x;
-  int x = tid >> 7;
-  int y = tid & (127);
-  if (y < 32) {
-#pragma unroll
-    for (int i = 0; i < 4; i++) {
-      sum += tile[x][y + i * 32];
-    }
-  }
-  __syncthreads();
-
-#ifdef COLOSSAL_HIP
-  float2 sum_f2 = __half22float2(sum);
-  for (int i = 1; i < WARP_SIZE; i <<= 1) sum_f2.x += __shfl_down(sum_f2.x, i);
-  for (int i = 1; i < WARP_SIZE; i <<= 1) sum_f2.y += __shfl_down(sum_f2.y, i);
-  sum = __float22half2_rn(sum_f2);
-#else
-  for (int i = 1; i < WARP_SIZE; i <<= 1) sum += g.shfl_down(sum, i);
-#endif
-
-  if (y == 0) tile[0][x] = sum;
-  __syncthreads();
-
-  if (threadIdx.x < 8) {
-    int pos = flat_2dim(blockIdx.x, threadIdx.x, 8);
-    bias_grad2[pos] = tile[0][threadIdx.x];
-  }
-}
-
-template <typename T>
-void launch_ls_dropout_bias_bwd(T *in_grad, T *bias_grad, const T *out_grad,
-                                const uint8_t *mask, int row_size, int dim,
-                                float ratio, cudaStream_t stream) {
-  dim3 grid_dim((dim - 1) / 8 + 1);
-  dim3 block_dim(8, 128);
-  ls_dropout_bias_bwd_kernel<<<grid_dim, block_dim, 0, stream>>>(
-      row_size, ratio, in_grad, bias_grad, out_grad, mask, dim);
-}
-
-template <>
-void launch_ls_dropout_bias_bwd(__half *in_grad, __half *bias_grad,
-                                const __half *out_grad, const uint8_t *mask,
-                                int row_size, int dim, float ratio,
-                                cudaStream_t stream) {
-  dim >>= 1;
-  dim3 grid_dim((dim - 1) / 8 + 1);
-  dim3 block_dim(8, 128);
-  ls_dropout_bias_bwd_kernel<<<grid_dim, block_dim, 0, stream>>>(
-      row_size, ratio, in_grad, bias_grad, out_grad, mask, dim);
-}
-
-template void launch_ls_dropout_bias_bwd(float *in_grad, float *bias_grad,
-                                         const float *out_grad,
-                                         const uint8_t *mask, int row_size,
-                                         int dim, float ratio,
-                                         cudaStream_t stream);
-
-/**
- * @brief fused bias, activation, and dropout at the end of first ffn
- *
- * @thread
- * gridDim.x = hidden_size / 8
- * blockDim.x = 8
- * blockDim.y = 1024 / 8 = 128
- *
- * @tparam act_type activation function, like kRelu, kGelu
- * @param total_count total elements
- * @param ratio drop ratio
- * @param out [batch_size, seq_len, hidden_size], float and __half
- * @param in [batch_size, seq_len, hidden_size], float and __half
- * @param mask [batch_size, seq_len, hidden_size], uint8 type
- * @param bias [hidden_size], ffn bias
- * @param seed seed to curand
- * @param hidden_size
- * @return void
- */
-template <ActivationType act_type>
-__global__ void ls_dropout_act_bias_kernel(
-    const int total_count, const float ratio, float *__restrict__ out,
-    const float *__restrict__ in, uint8_t *__restrict__ mask,
-    const float *__restrict__ bias, const int seed, const int hidden_size) {
-  const float scale = 1.f / (1.f - ratio);
-  int i = blockIdx.x * blockDim.x + threadIdx.x;
-
-  if (i * 4 >= total_count) return;
-
-  curandStatePhilox4_32_10_t state;
-  curand_init(seed, i, 0, &state);
-  uint8_t m[4];
-
-  float4 *out4 = reinterpret_cast<float4 *>(out);
-  const float4 *data4 = reinterpret_cast<const float4 *>(in);
-  const float4 *bias4 = reinterpret_cast<const float4 *>(bias);
-  uint32_t *mask4 = reinterpret_cast<uint32_t *>(mask);
-  float4 rand = curand_uniform4(&state);
-
-  m[0] = (uint8_t)(rand.x > ratio);
-  m[1] = (uint8_t)(rand.y > ratio);
-  m[2] = (uint8_t)(rand.z > ratio);
-  m[3] = (uint8_t)(rand.w > ratio);
-
-  int bias_i = i % (hidden_size >> 2);
-  uint32_t *m4 = reinterpret_cast<uint32_t *>(m);
-  mask4[i] = m4[0];
-  const float4 input4 = data4[i];
-  const float4 b4 = __ldg(&bias4[bias_i]);
-  float4 output4;
-
-  output4.x =
-      activation_kernel<act_type, float>(input4.x + b4.x) * scale * m[0];
-  output4.y =
-      activation_kernel<act_type, float>(input4.y + b4.y) * scale * m[1];
-  output4.z =
-      activation_kernel<act_type, float>(input4.z + b4.z) * scale * m[2];
-  output4.w =
-      activation_kernel<act_type, float>(input4.w + b4.w) * scale * m[3];
-
-  out4[i] = output4;
-}
-
-template <ActivationType act_type>
-__global__ void ls_dropout_act_bias_kernel(
-    const int total_count, const float ratio, __half *__restrict__ out,
-    const __half *__restrict__ in, uint8_t *__restrict__ mask,
-    const __half *__restrict__ bias, const int seed, const int hidden_size) {
-  const float scale = 1.f / (1.f - ratio);
-
-  int i = blockIdx.x * blockDim.x + threadIdx.x;
-
-  if (i * 8 >= total_count) return;
-
-  curandStatePhilox4_32_10_t state;
-  curand_init(seed, i, 0, &state);
-
-  const float4 *vals_float4 = reinterpret_cast<const float4 *>(in);
-  float4 *outs_float4 = reinterpret_cast<float4 *>(out);
-  const float4 *bias4 = reinterpret_cast<const float4 *>(bias);
-  uint64_t *mask8 = reinterpret_cast<uint64_t *>(mask);
-
-  uint8_t m[8];
-  float4 rand = curand_uniform4(&state);
-  m[0] = (uint8_t)(rand.x > ratio);
-  m[1] = (uint8_t)(rand.y > ratio);
-  m[2] = (uint8_t)(rand.z > ratio);
-  m[3] = (uint8_t)(rand.w > ratio);
-  rand = curand_uniform4(&state);
-  m[4] = (uint8_t)(rand.x > ratio);
-  m[5] = (uint8_t)(rand.y > ratio);
-  m[6] = (uint8_t)(rand.z > ratio);
-  m[7] = (uint8_t)(rand.w > ratio);
-  uint64_t *m8 = reinterpret_cast<uint64_t *>(m);
-  mask8[i] = *m8;
-
-  int bias_i = i % (hidden_size >> 3);
-  float4 val_float4 = vals_float4[i];
-  const float4 b4 = __ldg(&bias4[bias_i]);
-  float4 out_float4;
-
-  __half2 *val_half2 = reinterpret_cast<__half2 *>(&val_float4);
-  __half2 *out_half2 = reinterpret_cast<__half2 *>(&out_float4);
-  const __half2 *b_half2 = reinterpret_cast<const __half2 *>(&b4);
-
-  __half2 scale_mask_1 = __floats2half2_rn(scale * m[0], scale * m[1]);
-  __half2 scale_mask_2 = __floats2half2_rn(scale * m[2], scale * m[3]);
-  __half2 scale_mask_3 = __floats2half2_rn(scale * m[4], scale * m[5]);
-  __half2 scale_mask_4 = __floats2half2_rn(scale * m[6], scale * m[7]);
-  out_half2[0] = __hmul2(
-      activation_kernel<act_type, __half2>(__hadd2(val_half2[0], b_half2[0])),
-      scale_mask_1);
-  out_half2[1] = __hmul2(
-      activation_kernel<act_type, __half2>(__hadd2(val_half2[1], b_half2[1])),
-      scale_mask_2);
-  out_half2[2] = __hmul2(
-      activation_kernel<act_type, __half2>(__hadd2(val_half2[2], b_half2[2])),
-      scale_mask_3);
-  out_half2[3] = __hmul2(
-      activation_kernel<act_type, __half2>(__hadd2(val_half2[3], b_half2[3])),
-      scale_mask_4);
-  outs_float4[i] = out_float4;
-}
-
-template <>
-void launch_ls_dropout_act_bias<ActivationType::kGelu, float>(
-    float *out, const float *vals, uint8_t *mask, const float *bias,
-    int total_count, int dim, float ratio, cudaStream_t stream) {
-  int grid_dim = total_count >> 10;
-  ls_dropout_act_bias_kernel<ActivationType::kGelu>
-      <<<grid_dim + 1, 256, 0, stream>>>(
-          total_count, ratio, out, vals, mask, bias,
-          std::chrono::duration_cast<std::chrono::microseconds>(
-              std::chrono::system_clock::now().time_since_epoch())
-              .count(),
-          dim);
-}
-
-template <>
-void launch_ls_dropout_act_bias<ActivationType::kGelu, __half>(
-    __half *out, const __half *vals, uint8_t *mask, const __half *bias,
-    int total_count, int dim, float ratio, cudaStream_t stream) {
-  int grid_dim = total_count >> 11;
-  ls_dropout_act_bias_kernel<ActivationType::kGelu>
-      <<<grid_dim + 1, 256, 0, stream>>>(
-          total_count, ratio, out, vals, mask, bias,
-          std::chrono::duration_cast<std::chrono::microseconds>(
-              std::chrono::system_clock::now().time_since_epoch())
-              .count(),
-          dim);
-}
-
-template <>
-void launch_ls_dropout_act_bias<ActivationType::kRelu, float>(
-    float *out, const float *vals, uint8_t *mask, const float *bias,
-    int total_count, int dim, float ratio, cudaStream_t stream) {
-  int grid_dim = total_count >> 10;
-  ls_dropout_act_bias_kernel<ActivationType::kRelu>
-      <<<grid_dim + 1, 256, 0, stream>>>(
-          total_count, ratio, out, vals, mask, bias,
-          std::chrono::duration_cast<std::chrono::microseconds>(
-              std::chrono::system_clock::now().time_since_epoch())
-              .count(),
-          dim);
-}
-
-template <>
-void launch_ls_dropout_act_bias<ActivationType::kRelu, __half>(
-    __half *out, const __half *vals, uint8_t *mask, const __half *bias,
-    int total_count, int dim, float ratio, cudaStream_t stream) {
-  int grid_dim = total_count >> 11;
-  ls_dropout_act_bias_kernel<ActivationType::kRelu>
-      <<<grid_dim + 1, 256, 0, stream>>>(
-          total_count, ratio, out, vals, mask, bias,
-          std::chrono::duration_cast<std::chrono::microseconds>(
-              std::chrono::system_clock::now().time_since_epoch())
-              .count(),
-          dim);
-}
-
-/**
- * @brief fused bias, activation, and dropout backward
- *
- * @thread
- * gridDim.x = total_count / 1024
- * blockDim.x = 1024
- *
- * @tparam act_type kRelu
- * @param row_size batch_size * seq_len
- * @param ratio dropout ratio
- * @param in_grad [batch_size, seq_len, hidden_size], input grad
- * @param bias_grad [hidden_size], bias grad
- * @param out_grad [batch_size, seq_len, hidden_size], output grad
- * @param mask [batch_size, seq_len, hidden_size], dropout mask
- * @param hidden_size
- * @return void
- */
-template <ActivationType act_type, typename T>
-__global__ void ls_dropout_act_bias_bwd_kernel(
-    const int row_size, const float ratio, T *in_grad,
-    T *__restrict__ bias_grad, const T *__restrict__ input,
-    const T *__restrict__ bias, const T *out_grad,
-    const uint8_t *__restrict__ mask, const int hidden_size) {
-  const float scale = 1.f / (1.f - ratio);
-  __shared__ float tile[WARP_SIZE][WARP_SIZE + 1];
-
-#ifndef COLOSSAL_HIP
-  cg::thread_block b = cg::this_thread_block();
-  cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
-#endif
-
-  int col_idx = flat_2dim(blockIdx.x, threadIdx.x, WARP_SIZE);
-
-  int stride = hidden_size * WARP_SIZE;
-  float local_sum = 0;
-
-  int idx = flat_2dim(threadIdx.y, col_idx, hidden_size);
-  if (col_idx < hidden_size) {
-    for (int r = threadIdx.y; r < row_size; r += WARP_SIZE) {
-      float val = out_grad[idx];
-      float in = input[idx];
-      float b = bias[idx % hidden_size];
-      val = activation_bwd_kernel<act_type, float>(
-          val * scale * static_cast<float>(mask[idx]), in + b);
-      local_sum += val;
-      in_grad[idx] = val;
-      idx += stride;
-    }
-  }
-
-  tile[threadIdx.x][threadIdx.y] = local_sum;
-  __syncthreads();
-  float sum = tile[threadIdx.y][threadIdx.x];
-  __syncthreads();
-
-#ifdef COLOSSAL_HIP
-  for (int i = 1; i < WARP_SIZE; i <<= 1) sum += __shfl_down(sum, i);
-#else
-  for (int i = 1; i < WARP_SIZE; i <<= 1) sum += g.shfl_down(sum, i);
-#endif
-
-  if (threadIdx.x == 0) tile[0][threadIdx.y] = sum;
-  __syncthreads();
-
-  if (threadIdx.y == 0) {
-    int pos = flat_2dim(blockIdx.x, threadIdx.x, WARP_SIZE);
-    bias_grad[pos] = tile[0][threadIdx.x];
-  }
-}
-
-// @brief fused bias, activation, and dropout backward
-// It is deprecated for precision reason. Keep it for future optimization.
-//
-// template <ActivationType act_type>
-// __global__ void ls_dropout_act_bias_bwd_kernel(
-//     const int row_size, const float ratio, __half * in_grad,
-//     __half *__restrict__ bias_grad, const __half *__restrict__ input, const
-//     __half *__restrict__ bias, const __half * out_grad, const uint8_t
-//     *__restrict__ mask, const int hidden_size) {
-//   const __half2 scale = __float2half2_rn(1.f / (1.f - ratio));
-//   __shared__ __half2 tile[WARP_SIZE][WARP_SIZE + 1];
-
-//   cg::thread_block b = cg::this_thread_block();
-//   cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
-
-//   __half2 *in_grad2 = reinterpret_cast<__half2 *>(in_grad);
-//   __half2 *bias_grad2 = reinterpret_cast<__half2 *>(bias_grad);
-//   const __half2 *out_grad2 = reinterpret_cast<const __half2 *>(out_grad);
-//   const __half2 *input2 = reinterpret_cast<const __half2 *>(input);
-//   const __half2 *bias2 = reinterpret_cast<const __half2 *>(bias);
-
-//   int col_idx = flat_2dim(blockIdx.x, threadIdx.x, WARP_SIZE);
-
-//   int stride = hidden_size * WARP_SIZE;
-//   __half2 local_sum = __float2half2_rn(0.f);
-
-//   int idx = flat_2dim(threadIdx.y, col_idx, hidden_size);
-//   if (col_idx < hidden_size) {
-//     for (int r = threadIdx.y; r < row_size; r += WARP_SIZE) {
-//       __half2 val = out_grad2[idx];
-//       __half2 in2 = input2[idx];
-//       __half2 b2 = bias2[idx % hidden_size ];
-//       __half2 m2 = __floats2half2_rn(mask[2 * idx], mask[2 * idx + 1]);
-//       val = activation_bwd_kernel<ActivationType::kRelu, __half2>(val * scale
-//       *
-//                                                                   m2,
-//                                                                   in2+b2);
-//       local_sum += val;
-//       in_grad2[idx] = val;
-//       idx += stride;
-//     }
-//   }
-
-//   tile[threadIdx.x][threadIdx.y] = local_sum;
-//   __syncthreads();
-//   __half2 sum = tile[threadIdx.y][threadIdx.x];
-//   __syncthreads();
-
-//   for (int i = 1; i < WARP_SIZE; i <<= 1) sum += g.shfl_down(sum, i);
-
-//   if (threadIdx.x == 0) tile[0][threadIdx.y] = sum;
-//   __syncthreads();
-
-//   if (threadIdx.y == 0) {
-//     int pos = flat_2dim(blockIdx.x, threadIdx.x, WARP_SIZE);
-//     bias_grad2[pos] = tile[0][threadIdx.x];
-//   }
-// }
-
-template <ActivationType act_type, typename T>
-void launch_ls_dropout_act_bias_bwd(T *in_grad, T *bias_grad, const T *input,
-                                    const T *bias, const T *out_grad,
-                                    const uint8_t *mask, int row_size, int dim,
-                                    float ratio, cudaStream_t stream) {
-  dim3 grid_dim((dim - 1) / WARP_SIZE + 1);
-  dim3 block_dim(WARP_SIZE, WARP_SIZE);
-  ls_dropout_act_bias_bwd_kernel<act_type><<<grid_dim, block_dim, 0, stream>>>(
-      row_size, ratio, in_grad, bias_grad, input, bias, out_grad, mask, dim);
-}
-
-// template <>
-// void launch_ls_dropout_act_bias_bwd<ActivationType::kRelu, __half>(
-//     __half *in_grad, __half *bias_grad,const __half *input, const __half
-//     *bias, const __half *out_grad, const uint8_t *mask, int row_size, int
-//     dim, float ratio, cudaStream_t stream) {
-//   dim >>= 1;
-//   dim3 grid_dim((dim - 1) / WARP_SIZE + 1);
-//   dim3 block_dim(WARP_SIZE, WARP_SIZE);
-//   ls_dropout_act_bias_bwd_kernel<ActivationType::kRelu>
-//       <<<grid_dim, block_dim, 0, stream>>>(row_size, ratio, in_grad,
-//       bias_grad,
-//                                            input, bias,out_grad, mask, dim);
-// }
-
-template void launch_ls_dropout_act_bias_bwd<ActivationType::kRelu, float>(
-    float *in_grad, float *bias_grad, const float *input, const float *bias,
-    const float *out_grad, const uint8_t *mask, int row_size, int dim,
-    float ratio, cudaStream_t stream);
-
-template void launch_ls_dropout_act_bias_bwd<ActivationType::kRelu, __half>(
-    __half *in_grad, __half *bias_grad, const __half *input, const __half *bias,
-    const __half *out_grad, const uint8_t *mask, int row_size, int dim,
-    float ratio, cudaStream_t stream);
-
-template void launch_ls_dropout_act_bias_bwd<ActivationType::kGelu, float>(
-    float *in_grad, float *bias_grad, const float *input, const float *bias,
-    const float *out_grad, const uint8_t *mask, int row_size, int dim,
-    float ratio, cudaStream_t stream);
-
-template void launch_ls_dropout_act_bias_bwd<ActivationType::kGelu, __half>(
-    __half *in_grad, __half *bias_grad, const __half *input, const __half *bias,
-    const __half *out_grad, const uint8_t *mask, int row_size, int dim,
-    float ratio, cudaStream_t stream);
diff --git a/colossalai/kernel/cuda_native/csrc/kernels/general_kernels.cu b/colossalai/kernel/cuda_native/csrc/kernels/general_kernels.cu
deleted file mode 100644
index 300cf6a15ae7113174e6fa3a7aadcc25ad2eccbf..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/csrc/kernels/general_kernels.cu
+++ /dev/null
@@ -1,240 +0,0 @@
-#include "kernels.h"
-
-#ifndef COLOSSAL_HIP
-#include <cooperative_groups.h>
-
-namespace cg = cooperative_groups;
-#endif
-
-/**
-@brief: fuse_transpose_bias
-Calculate the sum of elements in each column of the matrix.
-
-@thread
-gridDim.x = ceil(cols / WARP_SIZE)
-blockDim.x = WARP_SIZE
-blockDim.y = WARP_SIZE
-
-@param
-inp: [rows, cols]
-out: [cols]
-rows: the number of rows in the matrix
-cols: the number of cols in the matrix
-*/
-template <typename T>
-__global__ void column_sum_reduce(const T *__restrict__ inp,
-                                  T *__restrict__ out, int rows, int cols) {
-  __shared__ float tile[WARP_SIZE][WARP_SIZE];
-
-#ifndef COLOSSAL_HIP
-  cg::thread_block b = cg::this_thread_block();
-  cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
-#endif
-
-  int idx = flat_2dim(blockIdx.x, threadIdx.x, WARP_SIZE);
-  int y_stride = cols * WARP_SIZE;
-  float localSum = 0;
-
-  // Loop across matrix row
-  // TODO: optimize to log complexity
-  if (idx < cols) {
-    int offset = flat_2dim(threadIdx.y, idx, cols);
-    for (int r = threadIdx.y; r < rows; r += WARP_SIZE) {
-      localSum += (float)inp[offset];
-      offset += y_stride;
-    }
-  }
-
-  // The sum of a row in tile is equal to the sum of a col in original matrix
-  tile[threadIdx.x][threadIdx.y] = localSum;
-
-  __syncthreads();
-
-  // Sum the shared buffer.
-  // The change of threadIdx.x is continuous
-  float sum = tile[threadIdx.y][threadIdx.x];
-
-  __syncthreads();
-
-  // Calculate the sum of a row in tile
-#ifdef COLOSSAL_HIP
-  for (int i = 1; i < WARP_SIZE; i <<= 1) sum += __shfl_down(sum, i);
-#else
-  for (int i = 1; i < WARP_SIZE; i <<= 1) sum += g.shfl_down(sum, i);
-#endif
-
-  if (threadIdx.x == 0) {
-    int pos = flat_2dim(blockIdx.x, threadIdx.y, WARP_SIZE);
-    if (pos < cols) out[pos] = sum;
-  }
-}
-
-// [r, c] -> [c]
-template <>
-void launch_fuse_transpose_bias_kernel<float>(const float *inp, float *out,
-                                              int rows, int cols,
-                                              cudaStream_t stream) {
-  dim3 grid_dim((cols - 1) / WARP_SIZE + 1);
-  dim3 block_dim(WARP_SIZE, WARP_SIZE);
-
-  column_sum_reduce<float>
-      <<<grid_dim, block_dim, 0, stream>>>(inp, out, rows, cols);
-}
-
-template <>
-void launch_fuse_transpose_bias_kernel<__half>(const __half *inp, __half *out,
-                                               int rows, int cols,
-                                               cudaStream_t stream) {
-  dim3 grid_dim((cols - 1) / WARP_SIZE + 1);
-  dim3 block_dim(WARP_SIZE, WARP_SIZE);
-
-  column_sum_reduce<__half>
-      <<<grid_dim, block_dim, 0, stream>>>(inp, out, rows, cols);
-}
-
-/**
-@brief: fused_add2
-Add two matrix inp1 and inp2 to out.
-
-@thread
-gridDim.x = batch_size * seq_len
-blockDim.x = min(hidden_dim, MAX_THREADS)
-
-@param
-inp1: [batch_size, seq_len, hidden_dim]
-inp2: [batch_size, seq_len, hidden_dim]
-out: [batch_size, seq_len, hidden_dim]
-batch_size: the size of the current batch
-seq_len: the sequence length of the current batch
-hidden_dim: dim of the hidden tensor
-*/
-template <typename T>
-__global__ void fused_add2_kernel(T *out, const T *inp1, const T *inp2,
-                                  int hidden_dim);
-
-template <>
-__global__ void fused_add2_kernel<float>(float *out, const float *inp1,
-                                         const float *inp2, int hidden_dim) {
-  int row_id = blockIdx.x;
-  int offset = flat_2dim(row_id, 0, hidden_dim);
-
-  const float4 *inp1_4 = reinterpret_cast<const float4 *>(inp1);
-  const float4 *inp2_4 = reinterpret_cast<const float4 *>(inp2);
-  float4 *out_4 = reinterpret_cast<float4 *>(out);
-  float4 vinp1;
-  float4 vinp2;
-  float4 val;
-
-  for (std::size_t i = threadIdx.x; i < hidden_dim; i += blockDim.x) {
-    vinp1 = inp1_4[offset + i];
-    vinp2 = inp2_4[offset + i];
-    val.x = vinp1.x + vinp2.x;
-    val.y = vinp1.y + vinp2.y;
-    val.z = vinp1.z + vinp2.z;
-    val.w = vinp1.w + vinp2.w;
-    out_4[offset + i] = val;
-  }
-}
-
-template <>
-__global__ void fused_add2_kernel<__half>(__half *out, const __half *inp1,
-                                          const __half *inp2, int hidden_dim) {
-  int row_id = blockIdx.x;
-  int offset = flat_2dim(row_id, 0, hidden_dim);
-
-  const float4 *inp1_4 = reinterpret_cast<const float4 *>(inp1);
-  const float4 *inp2_4 = reinterpret_cast<const float4 *>(inp2);
-  float4 *out_4 = reinterpret_cast<float4 *>(out);
-  float4 vinp1;
-  float4 vinp2;
-  float4 val;
-  __half2 *h2_inp1 = reinterpret_cast<__half2 *>(&vinp1);
-  __half2 *h2_inp2 = reinterpret_cast<__half2 *>(&vinp2);
-  __half2 *h2_val = reinterpret_cast<__half2 *>(&val);
-
-  for (std::size_t i = threadIdx.x; i < hidden_dim; i += blockDim.x) {
-    vinp1 = inp1_4[offset + i];
-    vinp2 = inp2_4[offset + i];
-    h2_val[0] = __hadd2(h2_inp1[0], h2_inp2[0]);
-    h2_val[1] = __hadd2(h2_inp1[1], h2_inp2[1]);
-    h2_val[2] = __hadd2(h2_inp1[2], h2_inp2[2]);
-    h2_val[3] = __hadd2(h2_inp1[3], h2_inp2[3]);
-    out_4[offset + i] = val;
-  }
-}
-
-//[b, s, h] -> [b, s, h]
-template <>
-void launch_fused_add2<float>(float *out, const float *inp1, const float *inp2,
-                              int batch_size, int seq_len, int hidden_dim,
-                              cudaStream_t &stream) {
-  hidden_dim >>= 2;
-
-  dim3 grid_dim(batch_size * seq_len);
-  dim3 block_dim(min(hidden_dim, MAX_THREADS));
-
-  fused_add2_kernel<<<grid_dim, block_dim, 0, stream>>>(out, inp1, inp2,
-                                                        hidden_dim);
-}
-
-template <>
-void launch_fused_add2<__half>(__half *out, const __half *inp1,
-                               const __half *inp2, int batch_size, int seq_len,
-                               int hidden_dim, cudaStream_t &stream) {
-  hidden_dim >>= 3;
-
-  dim3 grid_dim(batch_size * seq_len);
-  dim3 block_dim(min(hidden_dim, MAX_THREADS));
-
-  fused_add2_kernel<<<grid_dim, block_dim, 0, stream>>>(out, inp1, inp2,
-                                                        hidden_dim);
-}
-
-template <typename T>
-__global__ void kernel_concat3_dim1(const T *inp1, const T *inp2, T *output,
-                                    int sz0, int sz2, int sz1_1, int sz1_2) {
-  int nele = sz0 * sz2 * (sz1_1 + sz1_2);
-  int idx = flat_2dim(blockIdx.x, threadIdx.x, blockDim.x);
-  if (idx >= nele) {
-    return;
-  }
-  float4 *dst_ptr = (float4 *)output + idx;
-  int idx2 = idx % sz2;
-  idx = idx / sz2;
-  int idx1 = idx % (sz1_1 + sz1_2);
-  int idx0 = idx / (sz1_1 + sz1_2);
-  float4 *src_ptr = nullptr;
-  int sz1 = 0;
-  if (idx1 < sz1_1) {
-    sz1 = sz1_1;
-    src_ptr = (float4 *)inp1;
-  } else {
-    idx1 -= sz1_1;
-    sz1 = sz1_2;
-    src_ptr = (float4 *)inp2;
-  }
-  src_ptr += flat_3dim(idx0, idx1, idx2, sz1, sz2);
-  dst_ptr[0] = src_ptr[0];
-}
-
-template <>
-void launch_concat3_dim1<float>(const float *inp1, const float *inp2,
-                                float *output, int sz0, int sz2, int sz1_1,
-                                int sz1_2, cudaStream_t stream) {
-  sz2 >>= 2;
-  int nele = sz0 * sz2 * (sz1_1 + sz1_2);
-  int nblock = (nele + MAX_THREADS - 1) / MAX_THREADS;
-  kernel_concat3_dim1<<<nblock, MAX_THREADS, 0, stream>>>(
-      inp1, inp2, output, sz0, sz2, sz1_1, sz1_2);
-}
-
-template <>
-void launch_concat3_dim1<__half>(const __half *inp1, const __half *inp2,
-                                 __half *output, int sz0, int sz2, int sz1_1,
-                                 int sz1_2, cudaStream_t stream) {
-  sz2 >>= 3;
-  int nele = sz0 * sz2 * (sz1_1 + sz1_2);
-  int nblock = (nele + MAX_THREADS - 1) / MAX_THREADS;
-  kernel_concat3_dim1<<<nblock, MAX_THREADS, 0, stream>>>(
-      inp1, inp2, output, sz0, sz2, sz1_1, sz1_2);
-}
diff --git a/colossalai/kernel/cuda_native/csrc/kernels/include/block_reduce.h b/colossalai/kernel/cuda_native/csrc/kernels/include/block_reduce.h
deleted file mode 100644
index 9ad27a8d716efa0af67e21215f1b1ffc4419dba4..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/csrc/kernels/include/block_reduce.h
+++ /dev/null
@@ -1,391 +0,0 @@
-/* Copyright 2021 The LightSeq Team
-   Copyright Tencent/TurboTransformers
-   This block_reduce_n is adapted from Tencent/TurboTransformers
-*/
-#pragma once
-#include <cuda.h>
-#include <cuda_fp16.h>
-#include <cuda_runtime.h>
-
-enum class ReduceType { kMax = 0, kSum };
-const unsigned int WARP_REDUCE_MASK = 0xffffffff;
-const float REDUCE_FLOAT_INF_NEG = -100000000.f;
-const float REDUCE_FLOAT_INF_POS = 100000000.f;
-
-#ifdef COLOSSAL_HIP
-const unsigned int WARP_REDUCE_SIZE = 64;
-#else
-const unsigned int WARP_REDUCE_SIZE = 32;
-#endif
-
-template <typename T>
-__forceinline__ __device__ T warpReduceSum(T val) {
-  for (int mask = (WARP_REDUCE_SIZE >> 1); mask > 0; mask >>= 1)
-#ifdef COLOSSAL_HIP
-    val += __shfl_xor_sync(val, mask, WARP_REDUCE_SIZE);
-#else
-    val += __shfl_xor_sync(WARP_REDUCE_MASK, val, mask, WARP_REDUCE_SIZE);
-#endif
-  return val;
-}
-
-/* Calculate the sum of all elements in a block */
-template <typename T>
-__forceinline__ __device__ T blockReduceSum(T val) {
-  static __shared__ T shared[32];
-  int lane = threadIdx.x & 0x1f;
-  int wid = threadIdx.x >> 5;
-
-  val = warpReduceSum<T>(val);
-
-  if (lane == 0) shared[wid] = val;
-  __syncthreads();
-
-  val = (threadIdx.x < (blockDim.x >> 5)) ? shared[lane] : (T)0.0f;
-  val = warpReduceSum<T>(val);
-  return val;
-}
-
-template <ReduceType Rtype, int Num>
-__inline__ __device__ void blockReduce(float *pval);
-
-// use template to make code more concise
-template <ReduceType Rtype, int Num>
-__inline__ __device__ void warpReduce(float *pval);
-
-// static
-template <>
-__inline__ __device__ void warpReduce<ReduceType::kMax, 1>(float *pval) {
-#ifdef COLOSSAL_HIP
-  *pval = max(*pval, __shfl_xor(*pval, 32, WARP_REDUCE_SIZE));
-  *pval = max(*pval, __shfl_xor(*pval, 16, WARP_REDUCE_SIZE));
-  *pval = max(*pval, __shfl_xor(*pval, 8, WARP_REDUCE_SIZE));
-  *pval = max(*pval, __shfl_xor(*pval, 4, WARP_REDUCE_SIZE));
-  *pval = max(*pval, __shfl_xor(*pval, 2, WARP_REDUCE_SIZE));
-  *pval = max(*pval, __shfl_xor(*pval, 1, WARP_REDUCE_SIZE));
-#else
-  *pval = max(*pval, __shfl_xor_sync(WARP_REDUCE_MASK, *pval, 16, 32));
-  *pval = max(*pval, __shfl_xor_sync(WARP_REDUCE_MASK, *pval, 8, 32));
-  *pval = max(*pval, __shfl_xor_sync(WARP_REDUCE_MASK, *pval, 4, 32));
-  *pval = max(*pval, __shfl_xor_sync(WARP_REDUCE_MASK, *pval, 2, 32));
-  *pval = max(*pval, __shfl_xor_sync(WARP_REDUCE_MASK, *pval, 1, 32));
-#endif
-}
-
-template <>
-__inline__ __device__ void warpReduce<ReduceType::kMax, 2>(float *pval) {
-  float val0_tmp, val1_tmp;
-#ifdef COLOSSAL_HIP
-#define WarpReduceMaxOneStep(a, b)                                 \
-  val0_tmp = __shfl_xor(*(pval), a, b);     \
-  val1_tmp = __shfl_xor(*(pval + 1), a, b); \
-  *(pval) = max(val0_tmp, *(pval));                                \
-  *(pval + 1) = max(val1_tmp, *(pval + 1));
-
-  WarpReduceMaxOneStep(32, WARP_REDUCE_SIZE);
-  WarpReduceMaxOneStep(16, WARP_REDUCE_SIZE);
-  WarpReduceMaxOneStep(8, WARP_REDUCE_SIZE);
-  WarpReduceMaxOneStep(4, WARP_REDUCE_SIZE);
-  WarpReduceMaxOneStep(2, WARP_REDUCE_SIZE);
-  WarpReduceMaxOneStep(1, WARP_REDUCE_SIZE);
-#else
-#define WarpReduceMaxOneStep(a, b)                                 \
-  val0_tmp = __shfl_xor_sync(WARP_REDUCE_MASK, *(pval), a, b);     \
-  val1_tmp = __shfl_xor_sync(WARP_REDUCE_MASK, *(pval + 1), a, b); \
-  *(pval) = max(val0_tmp, *(pval));                                \
-  *(pval + 1) = max(val1_tmp, *(pval + 1));
-
-  WarpReduceMaxOneStep(16, 32);
-  WarpReduceMaxOneStep(8, 32);
-  WarpReduceMaxOneStep(4, 32);
-  WarpReduceMaxOneStep(2, 32);
-  WarpReduceMaxOneStep(1, 32);
-#endif
-
-#undef WarpReduceMaxOneStep
-}
-
-template <>
-__inline__ __device__ void warpReduce<ReduceType::kSum, 1>(float *pval) {
-#ifdef COLOSSAL_HIP
-  *pval += __shfl_xor(*pval, 32, WARP_REDUCE_SIZE);
-  *pval += __shfl_xor(*pval, 16, WARP_REDUCE_SIZE);
-  *pval += __shfl_xor(*pval, 8, WARP_REDUCE_SIZE);
-  *pval += __shfl_xor(*pval, 4, WARP_REDUCE_SIZE);
-  *pval += __shfl_xor(*pval, 2, WARP_REDUCE_SIZE);
-  *pval += __shfl_xor(*pval, 1, WARP_REDUCE_SIZE);
-#else
-  *pval += __shfl_xor_sync(WARP_REDUCE_MASK, *pval, 16, 32);
-  *pval += __shfl_xor_sync(WARP_REDUCE_MASK, *pval, 8, 32);
-  *pval += __shfl_xor_sync(WARP_REDUCE_MASK, *pval, 4, 32);
-  *pval += __shfl_xor_sync(WARP_REDUCE_MASK, *pval, 2, 32);
-  *pval += __shfl_xor_sync(WARP_REDUCE_MASK, *pval, 1, 32);
-#endif
-}
-
-/*
- * Unorll for loop for warpreduce to
- * imporve instruction issue efficiency
- * ElemX means there are X numbers to be summed
- */
-
-template <>
-__inline__ __device__ void warpReduce<ReduceType::kSum, 2>(float *pval) {
-  float val0_tmp, val1_tmp;
-
-#ifdef COLOSSAL_HIP
-#define WarpReduceSumOneStep(a, b)                                 \
-  val0_tmp = __shfl_xor(*(pval + 0), a, b); \
-  val1_tmp = __shfl_xor(*(pval + 1), a, b); \
-  *(pval + 0) += val0_tmp;                                         \
-  *(pval + 1) += val1_tmp
-
-  WarpReduceSumOneStep(32, WARP_REDUCE_SIZE);
-  WarpReduceSumOneStep(16, WARP_REDUCE_SIZE);
-  WarpReduceSumOneStep(8, WARP_REDUCE_SIZE);
-  WarpReduceSumOneStep(4, WARP_REDUCE_SIZE);
-  WarpReduceSumOneStep(2, WARP_REDUCE_SIZE);
-  WarpReduceSumOneStep(1, WARP_REDUCE_SIZE);
-#else
-#define WarpReduceSumOneStep(a, b)                                 \
-  val0_tmp = __shfl_xor_sync(WARP_REDUCE_MASK, *(pval + 0), a, b); \
-  val1_tmp = __shfl_xor_sync(WARP_REDUCE_MASK, *(pval + 1), a, b); \
-  *(pval + 0) += val0_tmp;                                         \
-  *(pval + 1) += val1_tmp
-
-  WarpReduceSumOneStep(16, 32);
-  WarpReduceSumOneStep(8, 32);
-  WarpReduceSumOneStep(4, 32);
-  WarpReduceSumOneStep(2, 32);
-  WarpReduceSumOneStep(1, 32);
-#endif
-
-#undef WarpReduceSumOneStep
-}
-
-template <>
-__inline__ __device__ void warpReduce<ReduceType::kSum, 4>(float *pval) {
-  float val0_tmp, val1_tmp, val2_tmp, val3_tmp;
-
-#ifdef COLOSSAL_HIP
-#define WarpReduceSumOneStep(a, b)                                 \
-  val0_tmp = __shfl_xor(*(pval + 0), a, b); \
-  val1_tmp = __shfl_xor(*(pval + 1), a, b); \
-  val2_tmp = __shfl_xor(*(pval + 2), a, b); \
-  val3_tmp = __shfl_xor(*(pval + 3), a, b); \
-  *(pval + 0) += val0_tmp;                                         \
-  *(pval + 1) += val1_tmp;                                         \
-  *(pval + 2) += val2_tmp;                                         \
-  *(pval + 3) += val3_tmp
-
-  WarpReduceSumOneStep(32, WARP_REDUCE_SIZE);
-  WarpReduceSumOneStep(16, WARP_REDUCE_SIZE);
-  WarpReduceSumOneStep(8, WARP_REDUCE_SIZE);
-  WarpReduceSumOneStep(4, WARP_REDUCE_SIZE);
-  WarpReduceSumOneStep(2, WARP_REDUCE_SIZE);
-  WarpReduceSumOneStep(1, WARP_REDUCE_SIZE);
-#else
-#define WarpReduceSumOneStep(a, b)                                 \
-  val0_tmp = __shfl_xor_sync(WARP_REDUCE_MASK, *(pval + 0), a, b); \
-  val1_tmp = __shfl_xor_sync(WARP_REDUCE_MASK, *(pval + 1), a, b); \
-  val2_tmp = __shfl_xor_sync(WARP_REDUCE_MASK, *(pval + 2), a, b); \
-  val3_tmp = __shfl_xor_sync(WARP_REDUCE_MASK, *(pval + 3), a, b); \
-  *(pval + 0) += val0_tmp;                                         \
-  *(pval + 1) += val1_tmp;                                         \
-  *(pval + 2) += val2_tmp;                                         \
-  *(pval + 3) += val3_tmp
-
-  WarpReduceSumOneStep(16, 32);
-  WarpReduceSumOneStep(8, 32);
-  WarpReduceSumOneStep(4, 32);
-  WarpReduceSumOneStep(2, 32);
-  WarpReduceSumOneStep(1, 32);
-#endif
-#undef WarpReduceSumOneStep
-}
-
-template <>
-__inline__ __device__ void blockReduce<ReduceType::kSum, 1>(float *pval) {
-  const int num = 1;
-  static __shared__ float shared[num][32];
-  int lane_id = threadIdx.x & 0x1f;
-  int wid = threadIdx.x >> 5;
-
-  warpReduce<ReduceType::kSum, num>(pval);
-
-  if (lane_id == 0) {
-#pragma unroll
-    for (int i = 0; i < num; ++i) {
-      shared[i][wid] = *(pval + i);
-    }
-  }
-  __syncthreads();
-
-  if (threadIdx.x < (blockDim.x >> 5)) {
-#pragma unroll
-    for (int i = 0; i < num; ++i) {
-      *(pval + i) = shared[i][lane_id];
-    }
-  } else {
-#pragma unroll
-    for (int i = 0; i < num; ++i) {
-      *(pval + i) = 0.f;
-    }
-  }
-  warpReduce<ReduceType::kSum, num>(pval);
-}
-
-template <>
-__inline__ __device__ void blockReduce<ReduceType::kSum, 2>(float *pval) {
-  const int num = 2;
-  static __shared__ float shared[num][32];
-  int lane_id = threadIdx.x & 0x1f;
-  int wid = threadIdx.x >> 5;
-
-  warpReduce<ReduceType::kSum, num>(pval);
-
-  if (lane_id == 0) {
-#pragma unroll
-    for (int i = 0; i < num; ++i) {
-      shared[i][wid] = *(pval + i);
-    }
-  }
-  __syncthreads();
-
-  if (threadIdx.x < (blockDim.x >> 5)) {
-#pragma unroll
-    for (int i = 0; i < num; ++i) {
-      *(pval + i) = shared[i][lane_id];
-    }
-  } else {
-#pragma unroll
-    for (int i = 0; i < num; ++i) {
-      *(pval + i) = 0.f;
-    }
-  }
-  warpReduce<ReduceType::kSum, num>(pval);
-}
-
-template <>
-__inline__ __device__ void blockReduce<ReduceType::kSum, 4>(float *pval) {
-  const int num = 4;
-  static __shared__ float shared[num][32];
-  int lane_id = threadIdx.x & 0x1f;
-  int wid = threadIdx.x >> 5;
-
-  warpReduce<ReduceType::kSum, num>(pval);
-
-  if (lane_id == 0) {
-#pragma unroll
-    for (int i = 0; i < num; ++i) {
-      shared[i][wid] = *(pval + i);
-    }
-  }
-  __syncthreads();
-
-  if (threadIdx.x < (blockDim.x >> 5)) {
-#pragma unroll
-    for (int i = 0; i < num; ++i) {
-      *(pval + i) = shared[i][lane_id];
-    }
-  } else {
-#pragma unroll
-    for (int i = 0; i < num; ++i) {
-      *(pval + i) = 0.f;
-    }
-  }
-  warpReduce<ReduceType::kSum, num>(pval);
-}
-
-template <>
-__inline__ __device__ void blockReduce<ReduceType::kMax, 1>(float *pval) {
-  const int num = 1;
-  static __shared__ float shared[num][32];
-  int lane_id = threadIdx.x & 0x1f;
-  int wid = threadIdx.x >> 5;
-
-  warpReduce<ReduceType::kMax, num>(pval);
-
-  if (lane_id == 0) {
-#pragma unroll
-    for (int i = 0; i < num; ++i) {
-      shared[i][wid] = *(pval + i);
-    }
-  }
-  __syncthreads();
-
-  if (threadIdx.x < (blockDim.x >> 5)) {
-#pragma unroll
-    for (int i = 0; i < num; ++i) {
-      *(pval + i) = shared[i][lane_id];
-    }
-  } else {
-#pragma unroll
-    for (int i = 0; i < num; ++i) {
-      *(pval + i) = REDUCE_FLOAT_INF_NEG;
-    }
-  }
-  warpReduce<ReduceType::kMax, num>(pval);
-}
-
-template <>
-__inline__ __device__ void blockReduce<ReduceType::kMax, 2>(float *pval) {
-  const int num = 1;
-  static __shared__ float shared[num][32];
-  int lane_id = threadIdx.x & 0x1f;
-  int wid = threadIdx.x >> 5;
-
-  warpReduce<ReduceType::kMax, num>(pval);
-
-  if (lane_id == 0) {
-#pragma unroll
-    for (int i = 0; i < num; ++i) {
-      shared[i][wid] = *(pval + i);
-    }
-  }
-  __syncthreads();
-
-  if (threadIdx.x < (blockDim.x >> 5)) {
-#pragma unroll
-    for (int i = 0; i < num; ++i) {
-      *(pval + i) = shared[i][lane_id];
-    }
-  } else {
-#pragma unroll
-    for (int i = 0; i < num; ++i) {
-      *(pval + i) = REDUCE_FLOAT_INF_NEG;
-    }
-  }
-  warpReduce<ReduceType::kMax, num>(pval);
-}
-
-template <>
-__inline__ __device__ void blockReduce<ReduceType::kMax, 4>(float *pval) {
-  const int num = 1;
-  static __shared__ float shared[num][32];
-  int lane_id = threadIdx.x & 0x1f;
-  int wid = threadIdx.x >> 5;
-
-  warpReduce<ReduceType::kMax, num>(pval);
-
-  if (lane_id == 0) {
-#pragma unroll
-    for (int i = 0; i < num; ++i) {
-      shared[i][wid] = *(pval + i);
-    }
-  }
-  __syncthreads();
-
-  if (threadIdx.x < (blockDim.x >> 5)) {
-#pragma unroll
-    for (int i = 0; i < num; ++i) {
-      *(pval + i) = shared[i][lane_id];
-    }
-  } else {
-#pragma unroll
-    for (int i = 0; i < num; ++i) {
-      *(pval + i) = REDUCE_FLOAT_INF_NEG;
-    }
-  }
-  warpReduce<ReduceType::kMax, num>(pval);
-}
diff --git a/colossalai/kernel/cuda_native/csrc/kernels/include/context.h b/colossalai/kernel/cuda_native/csrc/kernels/include/context.h
deleted file mode 100644
index f7d75f38cc2b568db74c935ef26cc14afce312ef..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/csrc/kernels/include/context.h
+++ /dev/null
@@ -1,36 +0,0 @@
-#pragma once
-
-#include <cublas_v2.h>
-#include <cuda.h>
-
-#include <iostream>
-#include <string>
-
-#include "cuda_util.h"
-
-class Context {
- public:
-  Context() : _stream(nullptr) {
-    CHECK_GPU_ERROR(cublasCreate(&_cublasHandle));
-  }
-
-  virtual ~Context() {}
-
-  static Context &Instance() {
-    static Context _ctx;
-    return _ctx;
-  }
-
-  void set_stream(cudaStream_t stream) {
-    _stream = stream;
-    CHECK_GPU_ERROR(cublasSetStream(_cublasHandle, _stream));
-  }
-
-  cudaStream_t get_stream() { return _stream; }
-
-  cublasHandle_t get_cublashandle() { return _cublasHandle; }
-
- private:
-  cudaStream_t _stream;
-  cublasHandle_t _cublasHandle;
-};
diff --git a/colossalai/kernel/cuda_native/csrc/kernels/include/cross_entropy_layer.h b/colossalai/kernel/cuda_native/csrc/kernels/include/cross_entropy_layer.h
deleted file mode 100644
index f4e9befc6588563e04c889da6460bd50ffa5aa56..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/csrc/kernels/include/cross_entropy_layer.h
+++ /dev/null
@@ -1,46 +0,0 @@
-#pragma once
-
-#include <cuda.h>
-#include <cuda_fp16.h>
-#include <cuda_runtime_api.h>
-
-#include <type_traits>
-
-#include "cuda_util.h"
-
-template <typename T>
-class CrossEntropyLayer {
- public:
-  CrossEntropyLayer(float epsilon, int padding_idx, int max_batch_tokens);
-
-  virtual ~CrossEntropyLayer();
-
-  void Forward(const T *inputs_ptr, const int *targets_ptr, float *outputs_ptr,
-               float *nll_loss_ptr);
-
-  void Backward(const float *grad_outputs_ptr, const T *inputs_ptr,
-                const int *targets_ptr, T *grad_inputs_ptr);
-
-  void set_cur_batch_shape(int batch_size, int seq_len, int vocab_size);
-
- private:
-  void allocate_mem_buffer() {
-    // allocate local gpu memory
-    _loss_buffer = cuda_malloc<float>(_max_batch_tokens * 2);
-  }
-
-  void free_mem_buffer() {
-    // free local gpu memory
-    cuda_free(_loss_buffer);
-  }
-
-  const int _padding_idx;
-  const float _epsilon;
-  const int _max_batch_tokens;
-
-  size_t _batch_size;
-  size_t _seq_len;
-  size_t _vocab_size;
-
-  float *_loss_buffer;
-};
diff --git a/colossalai/kernel/cuda_native/csrc/kernels/include/cublas_wrappers.h b/colossalai/kernel/cuda_native/csrc/kernels/include/cublas_wrappers.h
deleted file mode 100644
index 77b49c231f87b8bb7f03241c232363c84a9c8342..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/csrc/kernels/include/cublas_wrappers.h
+++ /dev/null
@@ -1,71 +0,0 @@
-/* Copyright 2021 The LightSeq Team
-   Copyright Microsoft DeepSpeed
-   This file is adapted from Microsoft DeepSpeed
-*/
-#pragma once
-
-#include <assert.h>
-#include <cublas_v2.h>
-#include <cuda.h>
-#include <cuda_fp16.h>
-#include <cuda_runtime.h>
-#ifndef COLOSSAL_HIP
-#include <mma.h>
-#endif
-#include <stdio.h>
-
-#ifdef COLOSSAL_HIP
-int cublas_gemm_ex(cublasHandle_t handle, cublasOperation_t transa,
-                   cublasOperation_t transb, int m, int n, int k,
-                   const float *alpha, const float *beta, const float *A,
-                   const float *B, float *C,
-                   rocblas_gemm_algo algo = rocblas_gemm_algo_standard);
-
-int cublas_gemm_ex(cublasHandle_t handle, cublasOperation_t transa,
-                   cublasOperation_t transb, int m, int n, int k,
-                   const float *alpha, const float *beta, const __half *A,
-                   const __half *B, __half *C,
-                   rocblas_gemm_algo algo = rocblas_gemm_algo_standard);
-
-int cublas_strided_batched_gemm(cublasHandle_t handle, int m, int n, int k,
-                                const float *alpha, const float *beta,
-                                const float *A, const float *B, float *C,
-                                cublasOperation_t op_A, cublasOperation_t op_B,
-                                int stride_A, int stride_B, int stride_C,
-                                int batch,
-                                rocblas_gemm_algo algo = rocblas_gemm_algo_standard);
-
-int cublas_strided_batched_gemm(
-    cublasHandle_t handle, int m, int n, int k, const float *alpha,
-    const float *beta, const __half *A, const __half *B, __half *C,
-    cublasOperation_t op_A, cublasOperation_t op_B, int stride_A, int stride_B,
-    int stride_C, int batch,
-    rocblas_gemm_algo algo = rocblas_gemm_algo_standard);
-#else
-int cublas_gemm_ex(cublasHandle_t handle, cublasOperation_t transa,
-                   cublasOperation_t transb, int m, int n, int k,
-                   const float *alpha, const float *beta, const float *A,
-                   const float *B, float *C,
-                   cublasGemmAlgo_t algo = CUBLAS_GEMM_DEFAULT);
-
-int cublas_gemm_ex(cublasHandle_t handle, cublasOperation_t transa,
-                   cublasOperation_t transb, int m, int n, int k,
-                   const float *alpha, const float *beta, const __half *A,
-                   const __half *B, __half *C,
-                   cublasGemmAlgo_t algo = CUBLAS_GEMM_DEFAULT_TENSOR_OP);
-
-int cublas_strided_batched_gemm(cublasHandle_t handle, int m, int n, int k,
-                                const float *alpha, const float *beta,
-                                const float *A, const float *B, float *C,
-                                cublasOperation_t op_A, cublasOperation_t op_B,
-                                int stride_A, int stride_B, int stride_C,
-                                int batch,
-                                cublasGemmAlgo_t algo = CUBLAS_GEMM_DEFAULT);
-
-int cublas_strided_batched_gemm(
-    cublasHandle_t handle, int m, int n, int k, const float *alpha,
-    const float *beta, const __half *A, const __half *B, __half *C,
-    cublasOperation_t op_A, cublasOperation_t op_B, int stride_A, int stride_B,
-    int stride_C, int batch,
-    cublasGemmAlgo_t algo = CUBLAS_GEMM_DEFAULT_TENSOR_OP);
-#endif
diff --git a/colossalai/kernel/cuda_native/csrc/kernels/include/cuda_util.h b/colossalai/kernel/cuda_native/csrc/kernels/include/cuda_util.h
deleted file mode 100644
index 46a460ea4add2478cadbb063833f8c35bf6b3a14..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/csrc/kernels/include/cuda_util.h
+++ /dev/null
@@ -1,37 +0,0 @@
-#pragma once
-
-#include <cublas_v2.h>
-#include <cuda.h>
-
-#ifndef COLOSSAL_HIP
-#include <math_constants.h>
-#endif
-
-#include <chrono>
-#include <fstream>
-#include <iostream>
-#include <string>
-#include <type_traits>
-#include <vector>
-
-template <typename T>
-void check_gpu_error(T result, char const *const func, const char *const file,
-                     int const line);
-
-#define CHECK_GPU_ERROR(val) check_gpu_error((val), #val, __FILE__, __LINE__)
-
-template <typename T>
-void print_vec(const T *outv, std::string outn, int num_output_ele);
-
-template <typename T>
-T *cuda_malloc(size_t ele_num);
-
-void cuda_free(void *pdata);
-
-template <typename T>
-void check_nan_inf(const T *data_ptr, int dsize, bool check_nan_inf,
-                   std::string file, int line, cudaStream_t stream);
-
-#define CHECK_NAN_INF(ptr, size, stream)                            \
-  check_nan_inf((ptr), (size), true, __FILE__, __LINE__, (stream)); \
-  check_nan_inf((ptr), (size), false, __FILE__, __LINE__, (stream))
diff --git a/colossalai/kernel/cuda_native/csrc/kernels/include/dropout.h b/colossalai/kernel/cuda_native/csrc/kernels/include/dropout.h
deleted file mode 100644
index 336bbacc922715b7fe1f4e2bb4a6ca2b2f7f2ece..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/csrc/kernels/include/dropout.h
+++ /dev/null
@@ -1,95 +0,0 @@
-#pragma once
-
-#include <string>
-#include <cuda.h>
-#include <cuda_fp16.h>
-#include <stdio.h>
-
-#include "kernels.h"
-
-template <typename T>
-class Dropout {
- public:
-  struct Config {
-    float ratio;
-    bool training;
-
-    Config(float r) : ratio(r), training(true) {}
-    float RATIO() const { return training ? ratio : 0.0; }
-  };
-
-  Dropout(const Config &config, size_t max_ele_num)
-      : _config(config), _mask(nullptr) {
-    _mask = cuda_malloc<uint8_t>(max_ele_num);
-  }
-
-  virtual ~Dropout() { cuda_free(_mask); }
-
-  // after attention softmax
-  void dropout(T *output, const T *input, int count, cudaStream_t stream,
-               bool bwd = false) {
-    launch_ls_dropout<T>(output, input, _mask, count, _config.RATIO(), stream,
-                         bwd);
-  }
-
-  void d_dropout(T *d_inp_out, int count, cudaStream_t stream) {
-    launch_ls_dropout<T>(d_inp_out, d_inp_out, _mask, count, _config.RATIO(),
-                         stream, true);
-  }
-
-  // transformer layer's postprocessing dropout, after attn or ffn module,
-  // before residual add.
-  void bias_dropout_residual(T *output, const T *input, const T *residual,
-                             const T *bias, int rows, int cols,
-                             cudaStream_t stream) {
-    launch_ls_dropout_res_bias<T>(output, input, _mask, bias, residual,
-                                  rows * cols, cols, _config.RATIO(), stream);
-  }
-
-  void d_bias_dropout_residual(T *d_input, T *d_bias, const T *d_output,
-                               int rows, int cols, cudaStream_t stream) {
-    launch_ls_dropout_bias_bwd<T>(d_input, d_bias, d_output, _mask, rows, cols,
-                                  _config.RATIO(), stream);
-  }
-
-  // dropout inside ffn.
-  void bias_act_dropout(T *output, const T *input, const T *bias, int rows,
-                        int cols, std::string activation_fn,
-                        cudaStream_t stream) {
-    if (activation_fn == "relu") {
-      launch_ls_dropout_act_bias<ActivationType::kRelu, T>(
-          output, input, _mask, bias, rows * cols, cols, _config.RATIO(),
-          stream);
-    } else if (activation_fn == "gelu") {
-      launch_ls_dropout_act_bias<ActivationType::kGelu, T>(
-          output, input, _mask, bias, rows * cols, cols, _config.RATIO(),
-          stream);
-    } else {
-      throw std::runtime_error("not supported activation: " + activation_fn);
-    }
-  }
-
-  void d_bias_act_dropout(T *d_inp_out, T *d_bias_out, const T *input,
-                          const T *bias, int rows, int cols,
-                          std::string activation_fn, cudaStream_t stream) {
-    if (activation_fn == "relu") {
-      launch_ls_dropout_act_bias_bwd<ActivationType::kRelu, T>(
-          d_inp_out, d_bias_out, input, bias, d_inp_out, _mask, rows, cols,
-          _config.RATIO(), stream);
-    } else if (activation_fn == "gelu") {
-      launch_ls_dropout_act_bias_bwd<ActivationType::kGelu, T>(
-          d_inp_out, d_bias_out, input, bias, d_inp_out, _mask, rows, cols,
-          _config.RATIO(), stream);
-    } else {
-      throw std::runtime_error("not supported activation: " + activation_fn);
-    }
-  }
-
-  bool HasDropout() const { return _config.RATIO() > 0.0; }
-
-  void SetTrainingMode(bool training) { _config.training = training; }
-
- private:
-  uint8_t *_mask;
-  Config _config;
-};
diff --git a/colossalai/kernel/cuda_native/csrc/kernels/include/feed_forward.h b/colossalai/kernel/cuda_native/csrc/kernels/include/feed_forward.h
deleted file mode 100644
index 64c53bc41bf3da4e84634d07d132654a1a285921..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/csrc/kernels/include/feed_forward.h
+++ /dev/null
@@ -1,84 +0,0 @@
-#pragma once
-
-/* Copyright 2021 The LightSeq Team
-   Copyright Microsoft DeepSpeed
-   This file is adapted from Microsoft DeepSpeed
-*/
-#include <cuda.h>
-#include <cuda_fp16.h>
-#include <stdio.h>
-
-#include <array>
-
-#include "cublas_wrappers.h"
-#include "kernels.h"
-
-template <typename T>
-class FeedForward {
- public:
-  struct Config {
-    int outputSize;
-    int inputSize;
-    std::array<int, 3> gemm_algos;
-    Config(int outputs, int inputs)
-        : outputSize(outputs),
-          inputSize(inputs),
-          gemm_algos(std::array<int, 3>({99, 99, 99})) {}
-  };
-
-  FeedForward(Config config) : config_(config) {}
-
-  ~FeedForward() {}
-
-  void Forward(int bsz, const T *input_ptr, const T *weights, T *out,
-               cublasHandle_t &_cublasHandle) {
-    float alpha = T(1.);
-    float beta = T(0.);
-
-#ifdef COLOSSAL_HIP
-    cublas_gemm_ex(_cublasHandle, CUBLAS_OP_T, CUBLAS_OP_N, config_.outputSize,
-                   bsz, config_.inputSize, &alpha, &beta, weights, input_ptr,
-                   out, rocblas_gemm_algo(rocblas_gemm_algo_standard));
-#else
-    cublas_gemm_ex(_cublasHandle, CUBLAS_OP_T, CUBLAS_OP_N, config_.outputSize,
-                   bsz, config_.inputSize, &alpha, &beta, weights, input_ptr,
-                   out, cublasGemmAlgo_t(config_.gemm_algos[0]));
-#endif
-  }
-  void Backward(int bsz, const T *out_grad, const T *input_ptr,
-                const T *weights, T *weights_grad, T *bias_grad,
-                cublasHandle_t &_cublasHandle, cudaStream_t &stream,
-                T *inp_grad_out = nullptr, T *out_grad_trans_out = nullptr,
-                bool compute_bias = true) {
-    float alpha = (T)1.0, beta = (T)0.0;
-#ifdef COLOSSAL_HIP
-    cublas_gemm_ex(_cublasHandle, CUBLAS_OP_N, CUBLAS_OP_T, config_.inputSize,
-                   config_.outputSize, bsz, &alpha, &beta, input_ptr, out_grad,
-                   weights_grad, rocblas_gemm_algo(rocblas_gemm_algo_standard));
-
-    cublas_gemm_ex(_cublasHandle, CUBLAS_OP_N, CUBLAS_OP_N, config_.inputSize,
-                   bsz, config_.outputSize, &alpha, &beta, weights, out_grad,
-                   inp_grad_out, rocblas_gemm_algo(rocblas_gemm_algo_standard));
-#else
-    cublas_gemm_ex(_cublasHandle, CUBLAS_OP_N, CUBLAS_OP_T, config_.inputSize,
-                   config_.outputSize, bsz, &alpha, &beta, input_ptr, out_grad,
-                   weights_grad, cublasGemmAlgo_t(config_.gemm_algos[1]));
-
-    cublas_gemm_ex(_cublasHandle, CUBLAS_OP_N, CUBLAS_OP_N, config_.inputSize,
-                   bsz, config_.outputSize, &alpha, &beta, weights, out_grad,
-                   inp_grad_out, cublasGemmAlgo_t(config_.gemm_algos[2]));
-#endif
-    if (compute_bias) {
-      launch_fuse_transpose_bias_kernel<T>(out_grad, bias_grad, bsz,
-                                           config_.outputSize, stream);
-    }
-  }
-
-  void reset_size(int outputSize, int inputSize) {
-    config_.outputSize = outputSize;
-    config_.inputSize = inputSize;
-  }
-
- private:
-  Config config_;
-};
diff --git a/colossalai/kernel/cuda_native/csrc/kernels/include/kernels.h b/colossalai/kernel/cuda_native/csrc/kernels/include/kernels.h
deleted file mode 100644
index 4467f4914d6b9bef8856bed7e5e5658b8e18736d..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/csrc/kernels/include/kernels.h
+++ /dev/null
@@ -1,282 +0,0 @@
-#pragma once
-
-#include <cuda.h>
-#include <cuda_fp16.h>
-#ifdef COLOSSAL_HIP
-#include <hiprand/hiprand.h>
-#else
-#include <curand_kernel.h>
-#endif
-#include <stdio.h>
-#include <stdlib.h>
-#include <stdexcept>
-
-#define MAX_THREADS 1024
-// HC
-#ifdef COLOSSAL_HIP
-  #define WARP_SIZE 64
-#else
-  #define WARP_SIZE 32
-#endif
-enum class ActivationType { kRelu, kGelu };
-
-void launch_curand_init(int total_count, int dim, cudaStream_t stream);
-
-template <typename T>
-void launch_layer_norm(T *ln_res, T *vars, T *means, const T *inp,
-                       const T *scale, const T *bias, int batch_size,
-                       int hidden_dim, cudaStream_t stream);
-
-template <typename T>
-void launch_ln_bw(T *gamma_grad, T *betta_grad, T *inp_grad, const T *out_grad,
-                  const T *residual_grad, const T *inp_or_out, const T *gamma,
-                  const T *betta, const T *vars, const T *means, int batch,
-                  int hidden_dim, cudaStream_t stream[2]);
-
-template <typename T>
-void launch_attn_softmax(T *vals, const T *attn_mask, int batch_size, int heads,
-                         int from_len, int to_len, bool mask_future,
-                         cudaStream_t stream);
-
-template <typename T>
-void launch_attn_softmax_bw(T *out_grad, const T *soft_inp, int rows,
-                            int softmax_len, cudaStream_t stream);
-
-// [b, s, h] -> [b, nh, s, ad]
-template <typename T>
-void launch_transform_0213(T *output, const T *vals, int batch_size,
-                           int seq_length, int hidden_dim, int nhead,
-                           cudaStream_t stream);
-
-// [b, s, 3, h] -> [3, b, nh, s, ad]
-template <typename T>
-void launch_bias_add_transform_20314(T *output, const T *input, const T *bias,
-                                     int dim_0, int dim_1, int dim_2, int dim_3,
-                                     int dim_4, cudaStream_t stream);
-
-// [tc, b, nh, s, ad] -> [b, s, tc, nh, ad]
-template <typename T>
-void launch_transform4d_0213(T *output, const T *vals, int batch_size,
-                             int seq_len, int hidden_dim, int nhead,
-                             int trans_count, cudaStream_t stream);
-
-template <typename T>
-void launch_ls_dropout(T *out, const T *vals, uint8_t *mask, int total_count,
-                       float ratio, cudaStream_t stream, bool backward = false);
-
-template <typename T>
-void launch_ls_dropout_res_bias(T *out, const T *vals, uint8_t *mask,
-                                const T *bias, const T *residual,
-                                int total_count, int dim, float ratio,
-                                cudaStream_t stream);
-
-template <ActivationType, typename T>
-void launch_ls_dropout_act_bias(T *out, const T *vals, uint8_t *mask,
-                                const T *bias, int total_count, int dim,
-                                float ratio, cudaStream_t stream);
-
-template <typename T>
-void launch_ls_dropout_bias_bwd(T *in_grad, T *bias_grad, const T *out_grad,
-                                const uint8_t *mask, int row_size, int dim,
-                                float ratio, cudaStream_t stream);
-
-template <ActivationType act_type, typename T>
-void launch_ls_dropout_act_bias_bwd(T *in_grad, T *bias_grad, const T *input,
-                                    const T *bias, const T *out_grad,
-                                    const uint8_t *mask, int row_size, int dim,
-                                    float ratio, cudaStream_t stream);
-
-template <typename T>
-void launch_fuse_transpose_bias_kernel(const T *inp, T *out, int rows, int cols,
-                                       cudaStream_t stream);
-
-void launch_param_update(const float *input, __half *output, int size,
-                         cudaStream_t stream);
-
-template <typename T>
-void launch_concat3_dim1(const T *inp1, const T *inp2, T *output, int sz0,
-                         int sz2, int sz1_1, int sz1_2, cudaStream_t stream);
-
-template <typename T>
-void launch_fused_add2(T *out, const T *inp1, const T *inp2, int batch_size,
-                       int seq_len, int hidden_size, cudaStream_t &stream);
-
-template <typename T>
-void launch_cross_entropy_fw(const T *inputs_ptr, const int *targets_ptr,
-                             float *outputs_ptr, float *nll_loss_ptr,
-                             float *loss_buffer, const int padding_idx,
-                             const float epsilon, const int batch_size,
-                             const int seq_len, const int vocab_size,
-                             cudaStream_t stream);
-
-template <typename T>
-void launch_cross_entropy_bw(const float *grad_outputs_ptr, const T *inputs_ptr,
-                             const int *targets_ptr, T *grad_inputs_ptr,
-                             const int padding_idx, const float epsilon,
-                             const int batch_size, const int seq_len,
-                             const int vocab_size, cudaStream_t stream);
-
-template <typename T>
-void launch_lookup_scale_pos_dropout(
-    T *output, const int *input, const T *embeddings, const T *pos_embeddings,
-    uint8_t *dropout_mask, int batch_size, int seq_len, int embedding_dim,
-    int padding_idx, float dropout_ratio, int step, cudaStream_t &stream);
-
-template <typename T>
-void launch_d_lookup_scale_pos_dropout(
-    T *grad_embeddings, const T *grad_output, const int *input,
-    const uint8_t *dropout_mask, int batch_size, int seq_len, int embedding_dim,
-    int vocab_size, int padding_idx, float dropout_ratio, cudaStream_t &stream);
-
-/* Convert 2-dim tensor index into vector index */
-__forceinline__ __host__ __device__ int flat_2dim(int id1, int id2, int dim2) {
-  return id1 * dim2 + id2;
-}
-
-/* Convert 3-dim tensor index into vector index */
-__forceinline__ __host__ __device__ int flat_3dim(int id1, int id2, int id3,
-                                                  int dim2, int dim3) {
-  return id1 * dim2 * dim3 + id2 * dim3 + id3;
-}
-
-/* Convert 4-dim tensor index into vector index */
-__forceinline__ __host__ __device__ int flat_4dim(int id1, int id2, int id3,
-                                                  int id4, int dim2, int dim3,
-                                                  int dim4) {
-  // return id1*(dim2*dim3*dim4) + id2*(dim3*dim4) + id3*dim4 + id4;
-  int res = id4;
-
-  int ld = dim4;
-  res += id3 * ld;
-
-  ld *= dim3;
-  res += id2 * ld;
-
-  ld *= dim2;
-  res += id1 * ld;
-
-  return res;
-}
-
-/* Convert 5-dim tensor index into vector index */
-__forceinline__ __host__ __device__ int flat_5dim(int id1, int id2, int id3,
-                                                  int id4, int id5, int dim2,
-                                                  int dim3, int dim4,
-                                                  int dim5) {
-  // return id1*(dim2*dim3*dim4*dim5) + id2*(dim3*dim4*dim5) + id3*(dim4*dim5) +
-  // id4*dim5 + dim5;
-  int res = id5;
-
-  int ld = dim5;
-  res += id4 * ld;
-
-  ld *= dim4;
-  res += id3 * ld;
-
-  ld *= dim3;
-  res += id2 * ld;
-
-  ld *= dim2;
-  res += id1 * ld;
-
-  return res;
-}
-
-/* Convert 6-dim tensor index into vector index */
-__forceinline__ __host__ __device__ int flat_6dim(int id1, int id2, int id3,
-                                                  int id4, int id5, int id6,
-                                                  int dim2, int dim3, int dim4,
-                                                  int dim5, int dim6) {
-  // return id1*(dim2*dim3*dim4*dim5*dim6) + id2*(dim3*dim4*dim5*dim6) +
-  // id3*(dim4*dim5*dim6) + id4*(dim5*dim6) + id5*dim6 + id6;
-  int res = id6;
-
-  int ld = dim6;
-  res += id5 * ld;
-
-  ld *= dim5;
-  res += id4 * ld;
-
-  ld *= dim4;
-  res += id3 * ld;
-
-  ld *= dim3;
-  res += id2 * ld;
-
-  ld *= dim2;
-  res += id1 * ld;
-
-  return res;
-}
-
-/* Convert vector index to 6-dim tensor index */
-__forceinline__ __host__ __device__ void decompose_6dim(
-    int src, int dim1, int dim2, int dim3, int dim4, int dim5, int *id0,
-    int *id1, int *id2, int *id3, int *id4, int *id5) {
-  *id5 = src % dim5;
-  src /= dim5;
-
-  *id4 = src % dim4;
-  src /= dim4;
-
-  *id3 = src % dim3;
-  src /= dim3;
-
-  *id2 = src % dim2;
-  src /= dim2;
-
-  *id1 = src % dim1;
-  *id0 = src / dim1;
-}
-
-/* Convert vector index to 5-dim tensor index */
-__forceinline__ __host__ __device__ void decompose_5dim(int src, int dim1,
-                                                        int dim2, int dim3,
-                                                        int dim4, int *id0,
-                                                        int *id1, int *id2,
-                                                        int *id3, int *id4) {
-  *id4 = src % dim4;
-  src /= dim4;
-
-  *id3 = src % dim3;
-  src /= dim3;
-
-  *id2 = src % dim2;
-  src /= dim2;
-
-  *id1 = src % dim1;
-  *id0 = src / dim1;
-}
-
-/* Convert vector index to 4-dim tensor index */
-__forceinline__ __host__ __device__ void decompose_4dim(int src, int dim1,
-                                                        int dim2, int dim3,
-                                                        int *id0, int *id1,
-                                                        int *id2, int *id3) {
-  *id3 = src % dim3;
-  src /= dim3;
-
-  *id2 = src % dim2;
-  src /= dim2;
-
-  *id1 = src % dim1;
-  *id0 = src / dim1;
-}
-
-/* Convert vector index to 3-dim tensor index */
-__forceinline__ __host__ __device__ void decompose_3dim(int src, int dim1,
-                                                        int dim2, int *id0,
-                                                        int *id1, int *id2) {
-  *id2 = src % dim2;
-  src /= dim2;
-
-  *id1 = src % dim1;
-  *id0 = src / dim1;
-}
-
-/* Convert vector index to 2-dim tensor index */
-__forceinline__ __host__ __device__ void decompose_2dim(int src, int dim1,
-                                                        int *id0, int *id1) {
-  *id1 = src % dim1;
-  *id0 = src / dim1;
-}
diff --git a/colossalai/kernel/cuda_native/csrc/kernels/include/ls_cub.cuh b/colossalai/kernel/cuda_native/csrc/kernels/include/ls_cub.cuh
deleted file mode 100644
index 4f65e7b54ba19e9520e19d969bebbe4e5d43c266..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/csrc/kernels/include/ls_cub.cuh
+++ /dev/null
@@ -1,12 +0,0 @@
-// copied from https://github.com/dmlc/dgl/pull/2758
-#ifndef DGL_ARRAY_CUDA_DGL_CUB_CUH_
-#define DGL_ARRAY_CUDA_DGL_CUB_CUH_
-
-#define CUB_NS_PREFIX namespace ls {
-#define CUB_NS_POSTFIX }
-#include "cub/cub.cuh"
-#include "cub/util_allocator.cuh"
-#undef CUB_NS_POSTFIX
-#undef CUB_NS_PREFIX
-
-#endif
diff --git a/colossalai/kernel/cuda_native/csrc/kernels/include/normalize_layer.h b/colossalai/kernel/cuda_native/csrc/kernels/include/normalize_layer.h
deleted file mode 100644
index 22e16fe90f9714f93c8b27b175b0f13ec98d530d..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/csrc/kernels/include/normalize_layer.h
+++ /dev/null
@@ -1,65 +0,0 @@
-#pragma once
-
-#include <cuda.h>
-#include <cuda_fp16.h>
-#include <stdio.h>
-
-#include <fstream>
-
-#include "kernels.h"
-
-using namespace std;
-
-template <typename T>
-class Normalize_Layer {
- public:
-  struct Config {
-    uint32_t hidden_dim;
-    bool use_mean;
-    Config(uint32_t hidden_dim, bool use_mean = false)
-        : hidden_dim(hidden_dim), use_mean(use_mean) {}
-  };
-
-  Normalize_Layer(Config config, size_t max_rows)
-      : config_(config), vars_(nullptr), means_(nullptr) {
-    vars_ = cuda_malloc<T>(max_rows);
-    if (config_.use_mean) {
-      means_ = cuda_malloc<T>(max_rows);
-    }
-  }
-
-  ~Normalize_Layer() {
-    cuda_free(vars_);
-    cuda_free(means_);
-  }
-
-  void Forward(T *ln_res, const T *inp, const T *gamma, const T *betta,
-               int batch_size, cudaStream_t stream) {
-    launch_layer_norm(ln_res, vars_, means_, inp, gamma, betta, batch_size,
-                      config_.hidden_dim, stream);
-  }
-
-  /*
-  residual_grad, inp_or_out, betta should be treated carefully.
-  inp_or_out = input if use_mean else output
-  residual_grad, betta can be nullptr.
-  residual_grad will be added to dinp if it is not nullptr
-    which is useful in transformer layer when pre-ln
-  betta are only used to compute xhat,
-    (use_mean == false) ^ (betta == nullptr) should be true
-  */
-  void Backward(T *gamma_grad, T *betta_grad, T *inp_grad, const T *out_grad,
-                const T *residual_grad, const T *inp_or_out, const T *gamma,
-                const T *betta, int batch_size, cudaStream_t stream[2]) {
-    launch_ln_bw(gamma_grad, betta_grad, inp_grad, out_grad, residual_grad,
-                 inp_or_out, gamma, betta, vars_, means_, batch_size,
-                 config_.hidden_dim, stream);
-  }
-
-  inline bool use_mean() const { return config_.use_mean; }
-
- private:
-  Config config_;
-  T *vars_;
-  T *means_;
-};
diff --git a/colossalai/kernel/cuda_native/csrc/kernels/include/softmax.h b/colossalai/kernel/cuda_native/csrc/kernels/include/softmax.h
deleted file mode 100644
index 978c72fed288b638bba531b019d04477cf873c26..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/csrc/kernels/include/softmax.h
+++ /dev/null
@@ -1,44 +0,0 @@
-#pragma once
-
-#include <cuda.h>
-#include <cuda_fp16.h>
-#include <stdio.h>
-
-#include <fstream>
-
-#include "kernels.h"
-
-using namespace std;
-
-template <typename T>
-class Softmax {
- public:
-  struct Config {
-    size_t nhead;
-    Config(size_t nhead) : nhead(nhead) {}
-  };
-
-  Softmax(Config config) : config_(config) {}
-
-  ~Softmax() {}
-
-  void Forward(T *vals, const T *attn_mask, int batch_size, int from_len,
-               int to_len, cudaStream_t &stream, bool mask_future = true) {
-    launch_attn_softmax<T>(vals, attn_mask, batch_size, config_.nhead, from_len,
-                           to_len, mask_future, stream);
-  }
-
-  void Backward(T *out_grad, const T *soft_out, int batch_size, int from_len,
-                int to_len, cudaStream_t stream) {
-    launch_attn_softmax_bw<T>(out_grad, soft_out,
-                              batch_size * config_.nhead * from_len, to_len,
-                              stream);
-  }
-
-  void reset_size(size_t nhead) {
-    config_.nhead = nhead;
-  }
-
- private:
-  Config config_;
-};
diff --git a/colossalai/kernel/cuda_native/csrc/kernels/include/strided_batch_gemm.h b/colossalai/kernel/cuda_native/csrc/kernels/include/strided_batch_gemm.h
deleted file mode 100644
index e30c9d0a51820f7d115bb87102f20aa69d037b0d..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/csrc/kernels/include/strided_batch_gemm.h
+++ /dev/null
@@ -1,121 +0,0 @@
-/* Copyright 2021 The LightSeq Team
-   Copyright Microsoft DeepSpeed
-   This file is adapted from Microsoft DeepSpeed
-*/
-#pragma once
-
-#include <cuda.h>
-#include <cuda_fp16.h>
-#include <stdio.h>
-
-#include <array>
-
-#include "cublas_wrappers.h"
-
-template <typename T>
-class StridedBatchGemm {
- public:
-  struct Config {
-    int m;
-    int n;
-    int k;
-    float alpha;
-    float beta;
-    cublasOperation_t op_A;
-    cublasOperation_t op_B;
-    std::array<int, 3> gemm_algos;
-
-    Config(float param_alpha, float param_beta, cublasOperation_t opA,
-           cublasOperation_t opB)
-        : alpha(param_alpha),
-          beta(param_beta),
-          op_A(opA),
-          op_B(opB),
-          gemm_algos(std::array<int, 3>({99, 99, 99})) {}
-    void SetConfig(int mm, int nn, int kk) {
-      m = mm;
-      n = nn;
-      k = kk;
-    }
-  };
-
-  StridedBatchGemm(const Config &config) : _config(config) {}
-
-  virtual ~StridedBatchGemm() {}
-
-  void Forward(int bsz, T *output, const T *_buffer_a, const T *_buffer_b,
-               cublasHandle_t handle) {
-    int stride_a = _config.m * _config.k;
-    int stride_b = _config.n * _config.k;
-    int stride_c = _config.m * _config.n;
-
-#ifdef COLOSSAL_HIP
-    cublas_strided_batched_gemm(
-        handle, _config.m, _config.n, _config.k, &_config.alpha, &_config.beta,
-        _buffer_a, _buffer_b, output, _config.op_A, _config.op_B, stride_a,
-        stride_b, stride_c, bsz, rocblas_gemm_algo(rocblas_gemm_algo_standard));
-#else
-    cublas_strided_batched_gemm(
-        handle, _config.m, _config.n, _config.k, &_config.alpha, &_config.beta,
-        _buffer_a, _buffer_b, output, _config.op_A, _config.op_B, stride_a,
-        stride_b, stride_c, bsz, cublasGemmAlgo_t(_config.gemm_algos[0]));
-#endif
-  }
-
-  void Backward(int bsz, const T *d_output, const T *_buffer_a,
-                const T *_buffer_b, cublasHandle_t handle,
-                T *inpGradA = nullptr, T *inpGradB = nullptr) {
-    int mb = (_config.op_A == CUBLAS_OP_T ? _config.k : _config.m);
-    int kb = (_config.op_A == CUBLAS_OP_T ? _config.m : _config.k);
-
-    int stride_a = mb * _config.n;
-    int stride_b = _config.n * kb;
-    int stride_c = _config.m * _config.k;
-
-    // B need to transpose.
-    cublasOperation_t op_b =
-        (_config.op_B == CUBLAS_OP_T ? CUBLAS_OP_N : CUBLAS_OP_T);
-
-    // Calculate d_A.
-#ifdef COLOSSAL_HIP
-    cublas_strided_batched_gemm(
-        handle, mb, kb, _config.n, &_config.alpha, &_config.beta,
-        (_config.op_A == CUBLAS_OP_T ? _buffer_b : d_output),
-        (_config.op_A == CUBLAS_OP_T ? d_output : _buffer_b), inpGradA,
-        CUBLAS_OP_N, op_b, stride_a, stride_b, stride_c, bsz,
-        rocblas_gemm_algo(rocblas_gemm_algo_standard));
-#else
-    cublas_strided_batched_gemm(
-        handle, mb, kb, _config.n, &_config.alpha, &_config.beta,
-        (_config.op_A == CUBLAS_OP_T ? _buffer_b : d_output),
-        (_config.op_A == CUBLAS_OP_T ? d_output : _buffer_b), inpGradA,
-        CUBLAS_OP_N, op_b, stride_a, stride_b, stride_c, bsz,
-        cublasGemmAlgo_t(_config.gemm_algos[1]));
-#endif
-    // A need to transpose.
-    cublasOperation_t op_a =
-        (_config.op_A == CUBLAS_OP_T ? CUBLAS_OP_N : CUBLAS_OP_T);
-
-    stride_a = _config.m * _config.k;
-    stride_b = _config.m * _config.n;
-    stride_c = _config.n * _config.k;
-
-    // Calculate d_B.
-#ifdef COLOSSAL_HIP
-    cublas_strided_batched_gemm(
-        handle, _config.k, _config.n, _config.m, &_config.alpha, &_config.beta,
-        _buffer_a, d_output, inpGradB, op_a, CUBLAS_OP_N, stride_a, stride_b,
-        stride_c, bsz, rocblas_gemm_algo(rocblas_gemm_algo_standard));
-#else
-    cublas_strided_batched_gemm(
-        handle, _config.k, _config.n, _config.m, &_config.alpha, &_config.beta,
-        _buffer_a, d_output, inpGradB, op_a, CUBLAS_OP_N, stride_a, stride_b,
-        stride_c, bsz, cublasGemmAlgo_t(_config.gemm_algos[2]));
-#endif
-  }
-
-  inline void SetConfig(int m, int n, int k) { _config.SetConfig(m, n, k); }
-
- private:
-  Config _config;
-};
diff --git a/colossalai/kernel/cuda_native/csrc/kernels/normalize_kernels.cu b/colossalai/kernel/cuda_native/csrc/kernels/normalize_kernels.cu
deleted file mode 100644
index d594774238013e146cd591077ef1a8d7a3ac98c6..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/csrc/kernels/normalize_kernels.cu
+++ /dev/null
@@ -1,1286 +0,0 @@
-#include "block_reduce.h"
-#include "kernels.h"
-
-#ifndef COLOSSAL_HIP
-#include <cooperative_groups.h>
-
-namespace cg = cooperative_groups;
-#endif
-
-const float LN_EPSILON = 1e-8f;
-#define TILE_DIM 32
-
-template <typename T>
-__forceinline__ __device__ T add_eps(T x) {
-  return fabsf(x) > LN_EPSILON ? x : (x < 0 ? -LN_EPSILON : LN_EPSILON);
-}
-
-/**
-@brief: ker_layer_norm
-Standard layer normalization.
-It will not only output the layer norm result,
-  but also outputs variance.
-  may also output means, depends on whether
-  the means argument is nullptr
-
-@thread
-gridDim.x = batch_size * seq_len
-blockDim.x = hidden_size
-
-@param
-ln_res: [batch_size* seq_len, hidden_size], ln result.
-vars: [batch_size* seq_len], variance per token
-means: [batch_size* seq_len], means per token, can be nullput
-inp: [batch_size * seq_len, hidden_size], ln input.
-scale: [hidden_size], ln scale
-bias: [hidden_size], ln bias
-*/
-template <typename T>
-__global__ void ker_layer_norm(T *ln_res, T *vars, T *means, const T *inp,
-                               const T *scale, const T *bias, int hidden_size) {
-  // step 0. compute local sum
-  float l_sum = 0;
-  float l_square_sum = 0;
-  const float4 *inp_f4 = (const float4 *)inp + blockIdx.x * hidden_size;
-  for (uint idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) {
-    float4 val = inp_f4[idx];
-    l_sum += val.x + val.y + val.z + val.w;
-    l_square_sum +=
-        val.x * val.x + val.y * val.y + val.z * val.z + val.w * val.w;
-  }
-
-  // step 1. compute reduce sum
-  float mean_dim = float(hidden_size) * 4.f;
-  float reduce_val[2] = {l_sum, l_square_sum};
-  blockReduce<ReduceType::kSum, 2>(reduce_val);
-  __shared__ float s_mean, s_var;
-  if (threadIdx.x == 0) {
-    s_mean = reduce_val[0] / mean_dim;
-    if (means != nullptr) {
-      means[blockIdx.x] = s_mean;
-    }
-    s_var = reduce_val[1] / mean_dim - s_mean * s_mean + LN_EPSILON;
-    vars[blockIdx.x] = s_var;
-    s_var = rsqrtf(s_var);
-  }
-  __syncthreads();
-
-  // step 2. layer norm result
-  float4 *output_f4 = (float4 *)ln_res + blockIdx.x * hidden_size;
-  for (uint idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) {
-    float4 vscale = __ldg((const float4 *)scale + idx);
-    float4 vbias = __ldg((const float4 *)bias + idx);
-    float4 val = inp_f4[idx];
-    val.x = (val.x - s_mean) * s_var * vscale.x + vbias.x;
-    val.y = (val.y - s_mean) * s_var * vscale.y + vbias.y;
-    val.z = (val.z - s_mean) * s_var * vscale.z + vbias.z;
-    val.w = (val.w - s_mean) * s_var * vscale.w + vbias.w;
-    output_f4[idx] = val;
-  }
-}
-
-template <>
-__global__ void ker_layer_norm<__half>(__half *ln_res, __half *vars,
-                                       __half *means, const __half *inp,
-                                       const __half *scale, const __half *bias,
-                                       int hidden_size) {
-  // step 0. compute local sum
-  float l_sum = 0;
-  float l_square_sum = 0;
-  const float4 *inp_f4 = (const float4 *)inp + blockIdx.x * hidden_size;
-  for (uint idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) {
-    float4 val_f4 = inp_f4[idx];
-    __half2 *val_h2 = (__half2 *)(&val_f4);
-#pragma unroll
-    for (int i = 0; i < 4; i++) {
-      float2 val_f2 = __half22float2(val_h2[i]);
-      l_sum += val_f2.x + val_f2.y;
-      l_square_sum += val_f2.x * val_f2.x + val_f2.y * val_f2.y;
-    }
-  }
-
-  // step 1. compute reduce sum
-  float mean_dim = float(hidden_size) * 8.f;
-  float reduce_val[2] = {l_sum, l_square_sum};
-  blockReduce<ReduceType::kSum, 2>(reduce_val);
-  __shared__ float s_mean, s_var;
-  if (threadIdx.x == 0) {
-    s_mean = reduce_val[0] / mean_dim;
-    if (means != nullptr) {
-      means[blockIdx.x] = s_mean;
-    }
-    s_var = reduce_val[1] / mean_dim - s_mean * s_mean + LN_EPSILON;
-    vars[blockIdx.x] = s_var;
-    s_var = rsqrtf(s_var);
-  }
-  __syncthreads();
-
-  // step 2. layer norm result
-  float4 *output_f4 = (float4 *)ln_res + blockIdx.x * hidden_size;
-  for (uint idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) {
-    // load scale, bias, input
-    float4 scale_f4 = __ldg((const float4 *)scale + idx);
-    __half2 *scale_h2 = (__half2 *)(&scale_f4);
-    float4 bias_f4 = __ldg((const float4 *)bias + idx);
-    __half2 *bias_h2 = (__half2 *)(&bias_f4);
-    float4 val_f4 = inp_f4[idx];
-    __half2 *val_h2 = (__half2 *)(&val_f4);
-
-#pragma unroll
-    for (int i = 0; i < 4; i++) {
-      float2 scale_f2 = __half22float2(scale_h2[i]);
-      float2 bias_f2 = __half22float2(bias_h2[i]);
-      float2 val_f2 = __half22float2(val_h2[i]);
-      val_f2.x = (val_f2.x - s_mean) * s_var * scale_f2.x + bias_f2.x;
-      val_f2.y = (val_f2.y - s_mean) * s_var * scale_f2.y + bias_f2.y;
-      val_h2[i] = __float22half2_rn(val_f2);
-    }
-    output_f4[idx] = val_f4;
-  }
-}
-
-// __global__ void ker_layer_norm_x2(__half *ln_res, __half *vars,
-//                                        __half *means, const __half *inp,
-//                                        const __half *scale, const __half *bias,
-//                                        int hidden_size) {
-//   // step 0. compute local sum
-//   float l_sum = 0;
-//   float l_square_sum = 0;
-//   const float4 *inp_f4 = (const float4 *)inp + blockIdx.x * 2 * hidden_size;
-//   for (uint idx = 2 * threadIdx.x; idx < hidden_size * 2; idx += blockDim.x * 2) {
-//     float4 val_f4 = inp_f4[idx];
-//     float4 val_f4_1 = inp_f4[idx+1];
-//     __half2 *val_h2 = (__half2 *)(&val_f4);
-//     __half2 *val_h2_1 = (__half2 *)(&val_f4_1);
-// #pragma unroll
-//     for (int i = 0; i < 4; i++) {
-//       float2 val_f2 = __half22float2(val_h2[i]);
-//       float2 val_f2_1 = __half22float2(val_h2_1[i]);
-//       l_sum += val_f2.x + val_f2.y + val_f2_1.x + val_f2_1.y;
-//       l_square_sum += val_f2.x * val_f2.x + val_f2.y * val_f2.y + val_f2_1.x * val_f2_1.x + val_f2_1.y * val_f2_1.y;
-//     }
-//   }
-
-//   // step 1. compute reduce sum
-//   float mean_dim = float(hidden_size) * 8.f * 2;
-//   float reduce_val[2] = {l_sum, l_square_sum};
-//   blockReduce<ReduceType::kSum, 2>(reduce_val);
-//   __shared__ float s_mean, s_var;
-//   if (threadIdx.x == 0) {
-//     s_mean = reduce_val[0] / mean_dim;
-//     if (means != nullptr) {
-//       means[blockIdx.x] = s_mean;
-//     }
-//     s_var = reduce_val[1] / mean_dim - s_mean * s_mean + LN_EPSILON;
-//     vars[blockIdx.x] = s_var;
-//     s_var = rsqrtf(s_var);
-//   }
-//   __syncthreads();
-
-//   // step 2. layer norm result
-//   float4 *output_f4 = (float4 *)ln_res + blockIdx.x * hidden_size * 2;
-//   for (uint idx = 2 * threadIdx.x; idx < hidden_size * 2; idx += blockDim.x * 2) {
-//     // load scale, bias, input
-//     float4 scale_f4 = __ldg((const float4 *)scale + idx);
-//     __half2 *scale_h2 = (__half2 *)(&scale_f4);
-//     float4 scale_f4_1 = __ldg((const float4 *)scale + idx + 1);
-//     __half2 *scale_h2_1 = (__half2 *)(&scale_f4_1);
-//     float4 bias_f4 = __ldg((const float4 *)bias + idx);
-//     __half2 *bias_h2 = (__half2 *)(&bias_f4);
-//     float4 bias_f4_1 = __ldg((const float4 *)bias + idx + 1);
-//     __half2 *bias_h2_1 = (__half2 *)(&bias_f4_1);
-//     float4 val_f4 = inp_f4[idx];
-//     __half2 *val_h2 = (__half2 *)(&val_f4);
-//     float4 val_f4_1 = inp_f4[idx+1];
-//     __half2 *val_h2_1 = (__half2 *)(&val_f4_1);
-
-// #pragma unroll
-//     for (int i = 0; i < 4; i++) {
-//       float2 scale_f2 = __half22float2(scale_h2[i]);
-//       float2 scale_f2_1 = __half22float2(scale_h2_1[i]);
-//       float2 bias_f2 = __half22float2(bias_h2[i]);
-//       float2 bias_f2_1 = __half22float2(bias_h2_1[i]);
-//       float2 val_f2 = __half22float2(val_h2[i]);
-//       float2 val_f2_1 = __half22float2(val_h2_1[i]);
-//       val_f2.x = (val_f2.x - s_mean) * s_var * scale_f2.x + bias_f2.x;
-//       val_f2.y = (val_f2.y - s_mean) * s_var * scale_f2.y + bias_f2.y;
-//       val_h2[i] = __float22half2_rn(val_f2);
-//       val_f2_1.x = (val_f2_1.x - s_mean) * s_var * scale_f2_1.x + bias_f2_1.x;
-//       val_f2_1.y = (val_f2_1.y - s_mean) * s_var * scale_f2_1.y + bias_f2_1.y;
-//       val_h2_1[i] = __float22half2_rn(val_f2_1);
-//     }
-//     output_f4[idx] = val_f4;
-//     output_f4[idx+1] = val_f4_1;
-//   }
-// }
-
-// __global__ void ker_layer_norm_x4(__half *ln_res, __half *vars,
-//                                        __half *means, const __half *inp,
-//                                        const __half *scale, const __half *bias,
-//                                        int hidden_size) {
-//   // step 0. compute local sum
-//   float l_sum = 0;
-//   float l_square_sum = 0;
-//   const float4 *inp_f4 = (const float4 *)inp + blockIdx.x * hidden_size * 4;
-//   for (uint idx = 4 * threadIdx.x; idx < hidden_size * 4; idx += blockDim.x * 4) {
-//     float4 val_f4 = inp_f4[idx];
-//     float4 val_f4_1 = inp_f4[idx+1];
-//     float4 val_f4_2 = inp_f4[idx+2];
-//     float4 val_f4_3 = inp_f4[idx+3];
-//     __half2 *val_h2 = (__half2 *)(&val_f4);
-//     __half2 *val_h2_1 = (__half2 *)(&val_f4_1);
-//     __half2 *val_h2_2 = (__half2 *)(&val_f4_2);
-//     __half2 *val_h2_3 = (__half2 *)(&val_f4_3);
-// #pragma unroll
-//     for (int i = 0; i < 4; i++) {
-//       float2 val_f2 = __half22float2(val_h2[i]);
-//       float2 val_f2_1 = __half22float2(val_h2_1[i]);
-//       float2 val_f2_2 = __half22float2(val_h2_2[i]);
-//       float2 val_f2_3 = __half22float2(val_h2_3[i]);
-//       l_sum += val_f2.x + val_f2.y + val_f2_1.x + val_f2_1.y + val_f2_2.x + val_f2_2.y + val_f2_3.x + val_f2_3.y;
-//       l_square_sum += val_f2.x * val_f2.x + val_f2.y * val_f2.y;
-//       l_square_sum += val_f2_1.x * val_f2_1.x + val_f2_1.y * val_f2_1.y;
-//       l_square_sum += val_f2_2.x * val_f2_2.x + val_f2_2.y * val_f2_2.y;
-//       l_square_sum += val_f2_3.x * val_f2_3.x + val_f2_3.y * val_f2_3.y;
-//     }
-//   }
-
-//   // step 1. compute reduce sum
-//   float mean_dim = float(hidden_size) * 8.f * 4;
-//   float reduce_val[2] = {l_sum, l_square_sum};
-//   blockReduce<ReduceType::kSum, 2>(reduce_val);
-//   __shared__ float s_mean, s_var;
-//   if (threadIdx.x == 0) {
-//     s_mean = reduce_val[0] / mean_dim;
-//     if (means != nullptr) {
-//       means[blockIdx.x] = s_mean;
-//     }
-//     s_var = reduce_val[1] / mean_dim - s_mean * s_mean + LN_EPSILON;
-//     vars[blockIdx.x] = s_var;
-//     s_var = rsqrtf(s_var);
-//   }
-//   __syncthreads();
-
-//   // step 2. layer norm result
-//   float4 *output_f4 = (float4 *)ln_res + blockIdx.x * hidden_size * 4;
-//   for (uint idx = 4 * threadIdx.x; idx < hidden_size * 4; idx += blockDim.x * 4) {
-//     // load scale, bias, input
-//     float4 scale_f4 = __ldg((const float4 *)scale + idx);
-//     __half2 *scale_h2 = (__half2 *)(&scale_f4);
-//     float4 scale_f4_1 = __ldg((const float4 *)scale + idx + 1);
-//     __half2 *scale_h2_1 = (__half2 *)(&scale_f4_1);
-//     float4 scale_f4_2 = __ldg((const float4 *)scale + idx + 2);
-//     __half2 *scale_h2_2 = (__half2 *)(&scale_f4_2);
-//     float4 scale_f4_3 = __ldg((const float4 *)scale + idx + 3);
-//     __half2 *scale_h2_3 = (__half2 *)(&scale_f4_3);
-//     float4 bias_f4 = __ldg((const float4 *)bias + idx);
-//     __half2 *bias_h2 = (__half2 *)(&bias_f4);
-//     float4 bias_f4_1 = __ldg((const float4 *)bias + idx + 1);
-//     __half2 *bias_h2_1 = (__half2 *)(&bias_f4_1);
-//     float4 bias_f4_2 = __ldg((const float4 *)bias + idx + 2);
-//     __half2 *bias_h2_2 = (__half2 *)(&bias_f4_2);
-//     float4 bias_f4_3 = __ldg((const float4 *)bias + idx + 3);
-//     __half2 *bias_h2_3 = (__half2 *)(&bias_f4_3);
-//     float4 val_f4 = inp_f4[idx];
-//     __half2 *val_h2 = (__half2 *)(&val_f4);
-//     float4 val_f4_1 = inp_f4[idx+1];
-//     __half2 *val_h2_1 = (__half2 *)(&val_f4_1);
-//     float4 val_f4_2 = inp_f4[idx+2];
-//     __half2 *val_h2_2 = (__half2 *)(&val_f4_2);
-//     float4 val_f4_3 = inp_f4[idx+3];
-//     __half2 *val_h2_3 = (__half2 *)(&val_f4_3);
-
-// #pragma unroll
-//     for (int i = 0; i < 4; i++) {
-//       float2 scale_f2 = __half22float2(scale_h2[i]);
-//       float2 scale_f2_1 = __half22float2(scale_h2_1[i]);
-//       float2 scale_f2_2 = __half22float2(scale_h2_2[i]);
-//       float2 scale_f2_3 = __half22float2(scale_h2_3[i]);
-//       float2 bias_f2 = __half22float2(bias_h2[i]);
-//       float2 bias_f2_1 = __half22float2(bias_h2_1[i]);
-//       float2 bias_f2_2 = __half22float2(bias_h2_2[i]);
-//       float2 bias_f2_3 = __half22float2(bias_h2_3[i]);
-//       float2 val_f2 = __half22float2(val_h2[i]);
-//       float2 val_f2_1 = __half22float2(val_h2_1[i]);
-//       float2 val_f2_2 = __half22float2(val_h2_2[i]);
-//       float2 val_f2_3 = __half22float2(val_h2_3[i]);
-//       val_f2.x = (val_f2.x - s_mean) * s_var * scale_f2.x + bias_f2.x;
-//       val_f2.y = (val_f2.y - s_mean) * s_var * scale_f2.y + bias_f2.y;
-//       val_f2_1.x = (val_f2_1.x - s_mean) * s_var * scale_f2_1.x + bias_f2_1.x;
-//       val_f2_1.y = (val_f2_1.y - s_mean) * s_var * scale_f2_1.y + bias_f2_1.y;
-//       val_f2_2.x = (val_f2_2.x - s_mean) * s_var * scale_f2_2.x + bias_f2_2.x;
-//       val_f2_2.y = (val_f2_2.y - s_mean) * s_var * scale_f2_2.y + bias_f2_2.y;
-//       val_f2_3.x = (val_f2_3.x - s_mean) * s_var * scale_f2_3.x + bias_f2_3.x;
-//       val_f2_3.y = (val_f2_3.y - s_mean) * s_var * scale_f2_3.y + bias_f2_3.y;
-//       val_h2[i] = __float22half2_rn(val_f2);
-//       val_h2_1[i] = __float22half2_rn(val_f2_1);
-//       val_h2_2[i] = __float22half2_rn(val_f2_2);
-//       val_h2_3[i] = __float22half2_rn(val_f2_3);
-//     }
-//     output_f4[idx] = val_f4;
-//     output_f4[idx+1] = val_f4_1;
-//     output_f4[idx+2] = val_f4_2;
-//     output_f4[idx+3] = val_f4_3;
-//   }
-// }
-
-template <>
-void launch_layer_norm<float>(float *ln_res, float *vars, float *means,
-                              const float *inp, const float *scale,
-                              const float *bias, int batch_size, int hidden_dim,
-                              cudaStream_t stream) {
-  if (hidden_dim % 4 != 0) {
-    throw std::runtime_error("violate hidden_dim % 4 = 0");
-  }
-  hidden_dim >>= 2;
-  int nthread = min(((hidden_dim + 31) / 32) * 32, MAX_THREADS);
-  dim3 grid_dim(batch_size);
-  dim3 block_dim(nthread);
-
-  ker_layer_norm<float><<<grid_dim, block_dim, 0, stream>>>(
-      ln_res, vars, means, inp, scale, bias, hidden_dim);
-}
-
-template <>
-void launch_layer_norm<__half>(__half *ln_res, __half *vars, __half *means,
-                               const __half *inp, const __half *scale,
-                               const __half *bias, int batch_size,
-                               int hidden_dim, cudaStream_t stream) {
-  if (hidden_dim % 8 != 0) {
-    throw std::runtime_error("violate hidden_dim % 8 = 0");
-  }
-  hidden_dim >>= 3;
-  int nthread = min(((hidden_dim + 31) / 32) * 32, MAX_THREADS);
-  dim3 grid_dim(batch_size);
-  dim3 block_dim(nthread);
-
-  ker_layer_norm<__half><<<grid_dim, block_dim, 0, stream>>>(
-      ln_res, vars, means, inp, scale, bias, hidden_dim);
-  // if (hidden_dim % 8 != 0) {
-  //   throw std::runtime_error("violate hidden_dim % 8 = 0");
-  // }
-  // hidden_dim >>= 3;
-
-  // if (hidden_dim * 8 < 8192) {
-  //   int nthread = min(((hidden_dim + 31) / 32) * 32, MAX_THREADS);
-  //   dim3 grid_dim(batch_size);
-  //   dim3 block_dim(nthread);
-  //   ker_layer_norm<__half><<<grid_dim, block_dim, 0, stream>>>(
-  //       ln_res, vars, means, inp, scale, bias, hidden_dim);
-  // } else if (hidden_dim * 8 >= 8192 && hidden_dim * 8 <= 8192 * 2) {
-  //   hidden_dim >>= 1;
-  //   int nthread = min(((hidden_dim + 31) / 32) * 32, MAX_THREADS);
-  //   dim3 grid_dim(batch_size);
-  //   dim3 block_dim(nthread);
-  //   ker_layer_norm_x2<<<grid_dim, block_dim, 0, stream>>>(
-  //       ln_res, vars, means, inp, scale, bias, hidden_dim);
-  // } else if (hidden_dim * 8 > 8192 * 2 && hidden_dim * 8 <= 8192 * 4) {
-  //   hidden_dim >>= 2;
-  //   int nthread = min(((hidden_dim + 31) / 32) * 32, MAX_THREADS);
-  //   dim3 grid_dim(batch_size);
-  //   dim3 block_dim(nthread);
-  //   ker_layer_norm_x4<<<grid_dim, block_dim, 0, stream>>>(
-  //       ln_res, vars, means, inp, scale, bias, hidden_dim);
-  // } else {
-  //   throw std::runtime_error("hidden_dim % 4 != 0 || hidden_dim > 32768");
-  // }
-}
-
-/**
-@brief: ker_ln_bw_dgamma_dbetta
-Layer norm backword kernel, compute the gradient of gamma and betta.
-dbetta = sum(dout, dim=0)
-dgamma = sum(xhat * dout, dim=0)
-xhat = (input - mean) * rsqrt(var) or
-  (output - betta) / gamma
-
-
-@thread
-gridDim.x = hidden_size / 32
-blockDim.x = 32
-blockDim.y = 32
-
-@param
-gamma_grad: [hidden_size], gradient of gamma
-betta_grad: [hidden_size], gradient of betta
-out_grad: [batch_size * seq_len, hidden_size], gradient of betta ln output
-inp_or_out: [batch_size * seq_len, hidden_size], ln output if means is nullptr
-  ln input if means is not nullptr
-gamma: [hidden_size], gamma of ln,
-  used to compute xhat, maybe nullptr
-betta: [hidden_size], betta of ln,
-  used to compute xhat, maybe nullptr
-vars: [batch_size * seq_len], variance of ln forward,
-  used to compute xhat, maybe nullptr
-means: [batch_size * seq_len], mean of ln forward,
-  used to compute xhat, maybe nullptr
-(gamma && betta) ^ (vars && means) should be true
-*/
-template <typename T>
-__global__ void ker_ln_bw_dgamma_dbetta(T *gamma_grad, T *betta_grad,
-                                        const T *out_grad, const T *inp_or_out,
-                                        const T *gamma, const T *betta,
-                                        const T *vars, const T *means, int rows,
-                                        int width) {
-  __shared__ float betta_buffer[TILE_DIM][TILE_DIM];
-  __shared__ float gamma_buffer[TILE_DIM][TILE_DIM];
-
-#ifndef COLOSSAL_HIP
-  cg::thread_block b = cg::this_thread_block();
-  cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
-#endif
-
-  int idx = blockDim.x * blockIdx.x + threadIdx.x;
-  int offset = threadIdx.y * width + idx;
-  int y_stride = width * TILE_DIM;
-
-  // Loop across inp height
-  float dbetta = 0;
-  float dgamma = 0;
-  float dout, val;
-  if (idx < width) {
-    if (means == nullptr) {
-      float vbetta = (float)betta[idx];
-      float vgamma = (float)gamma[idx];
-      for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
-        dout = (float)out_grad[offset];
-        // inp_or_out is output
-        val = (float)inp_or_out[offset];
-        dbetta += dout;
-        dgamma += ((val - vbetta) / add_eps(vgamma) * dout);
-        offset += y_stride;
-      }
-    } else {
-      for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
-        dout = (float)out_grad[offset];
-        // inp_or_out is input
-        val = (float)inp_or_out[offset];
-        dbetta += dout;
-        dgamma += ((val - (float)means[r]) *
-                   rsqrtf((float)vars[r] + LN_EPSILON) * dout);
-        offset += y_stride;
-      }
-    }
-  }
-
-  // Sum the shared buffer.
-  betta_buffer[threadIdx.x][threadIdx.y] = dbetta;
-  gamma_buffer[threadIdx.x][threadIdx.y] = dgamma;
-  __syncthreads();
-  float s1 = betta_buffer[threadIdx.y][threadIdx.x];
-  float s2 = gamma_buffer[threadIdx.y][threadIdx.x];
-  __syncthreads();
-
-  for (int i = 1; i < TILE_DIM; i <<= 1) {
-#ifdef COLOSSAL_HIP
-    s1 += __shfl_down(s1, i);
-    s2 += __shfl_down(s2, i);
-#else
-    s1 += g.shfl_down(s1, i);
-    s2 += g.shfl_down(s2, i);
-#endif
-  }
-
-  int pos = blockIdx.x * TILE_DIM + threadIdx.y;
-  if (threadIdx.x == 0 && idx < width) {
-    betta_grad[pos] = s1;
-    gamma_grad[pos] = s2;
-  }
-}
-
-/**
-@brief: ker_ln_bw_dinp
-Layer norm backword kernel, compute the gradient of input.
-dinp = (dxhat - (sum(dxhat) + xhat * sum(dxhat * xhat)) / hidden_dim)
-  * rsqrt(var)
-xhat = (input - mean) * rsqrt(var) if mean is not nullptr
-       (output - betta) / gamma if mean is nullptr
-dxhat = dout * gamma
-
-
-@thread
-gridDim.x = batch_size * seq_len
-blockDim.x = hidden_size
-
-@param
-inp_grad: [batch_size * seq_len, hidden_size], gradient of betta ln output
-out_grad: [batch_size * seq_len, hidden_size], gradient of betta ln output
-residual_grad: [batch_size * seq_len, hidden_size], gradient of residual input,
-  usually appear in pre-layer-norm for transformer layer, maybe nullptr
-inp_or_out: [batch_size * seq_len, hidden_size], ln output if means is nullptr
-  ln input if means is not nullptr
-gamma: [hidden_size], gamma of ln,
-  used to compute xhat and dxhat
-betta: [hidden_size], betta of ln,
-  used to compute xhat, maybe nullptr
-vars: [batch_size * seq_len], variance of ln forward,
-  used to compute xhat and dinp
-means: [batch_size * seq_len], mean of ln forward,
-  used to compute xhat, maybe nullptr
-*/
-template <typename T>
-__global__ void ker_ln_bw_dinp(T *inp_grad, const T *out_grad,
-                               const T *residual_grad, const T *inp_or_out,
-                               const T *gamma, const T *betta, const T *vars,
-                               const T *means, int hidden_dim) {
-  int offset = blockIdx.x * hidden_dim + threadIdx.x;
-  float4 dxhat, xhat;
-  float var_rsqrt;
-
-  if (threadIdx.x < hidden_dim) {
-    // step 0. dxhat = dout * gamma
-    dxhat = ((const float4 *)out_grad)[offset];
-    float4 vgamma = ((const float4 *)gamma)[threadIdx.x];
-    dxhat.x *= vgamma.x;
-    dxhat.y *= vgamma.y;
-    dxhat.z *= vgamma.z;
-    dxhat.w *= vgamma.w;
-
-    /*
-    step 1. xhat = (output - betta) / gamma or
-    (input - mean) * rsqrtf(var)
-    */
-    xhat = ((const float4 *)inp_or_out)[offset];
-    var_rsqrt = rsqrtf((float)vars[blockIdx.x] + LN_EPSILON);
-    if (means == nullptr) {
-      // inp_or_out is output, xhat = (output - betta) / gamma
-      float4 vbetta = ((const float4 *)betta)[threadIdx.x];
-      xhat.x = (xhat.x - vbetta.x) / add_eps(vgamma.x);
-      xhat.y = (xhat.y - vbetta.y) / add_eps(vgamma.y);
-      xhat.z = (xhat.z - vbetta.z) / add_eps(vgamma.z);
-      xhat.w = (xhat.w - vbetta.w) / add_eps(vgamma.w);
-    } else {
-      // inp_or_out is input, xhat = (input - mean) * rsqrtf(var)
-      float fmean = (float)means[blockIdx.x];
-      xhat.x = (xhat.x - fmean) * var_rsqrt;
-      xhat.y = (xhat.y - fmean) * var_rsqrt;
-      xhat.z = (xhat.z - fmean) * var_rsqrt;
-      xhat.w = (xhat.w - fmean) * var_rsqrt;
-    }
-  }
-
-  /* step2. block reduce sum for dxhat and dxhat*xhat */
-  float reduce_val[2] = {0.f, 0.f};
-  if (threadIdx.x < hidden_dim) {
-    reduce_val[0] = dxhat.x + dxhat.y + dxhat.z + dxhat.w;
-    reduce_val[1] = dxhat.x * xhat.x + dxhat.y * xhat.y + dxhat.z * xhat.z +
-                    dxhat.w * xhat.w;
-  }
-  blockReduce<ReduceType::kSum, 2>(reduce_val);
-  __shared__ float s_sum_dxhat, s_sum_dxhat_xhat;
-  if (threadIdx.x == 0) {
-    float mean_dim = hidden_dim * 4;
-    s_sum_dxhat = reduce_val[0] / mean_dim;
-    s_sum_dxhat_xhat = reduce_val[1] / mean_dim;
-  }
-  __syncthreads();
-
-  /*
-  step3. compute input gradient
-  (dxhat - (sum(dxhat) + xhat * sum(dxhat * xhat)) / mean_dim) * rsqrt(var)
-  */
-  if (threadIdx.x >= hidden_dim) {
-    return;
-  }
-  dxhat.x = (dxhat.x - s_sum_dxhat - xhat.x * s_sum_dxhat_xhat) * var_rsqrt;
-  dxhat.y = (dxhat.y - s_sum_dxhat - xhat.y * s_sum_dxhat_xhat) * var_rsqrt;
-  dxhat.z = (dxhat.z - s_sum_dxhat - xhat.z * s_sum_dxhat_xhat) * var_rsqrt;
-  dxhat.w = (dxhat.w - s_sum_dxhat - xhat.w * s_sum_dxhat_xhat) * var_rsqrt;
-  if (residual_grad) {
-    // Add the residual grad,
-    // usually in pre-layer-norm for transformer layer
-    float4 dresidual = ((const float4 *)residual_grad)[offset];
-    dxhat.x += dresidual.x;
-    dxhat.y += dresidual.y;
-    dxhat.z += dresidual.z;
-    dxhat.w += dresidual.w;
-  }
-  ((float4 *)inp_grad)[offset] = dxhat;
-}
-
-template <>
-__global__ void ker_ln_bw_dinp<__half>(__half *inp_grad, const __half *out_grad,
-                                       const __half *residual_grad,
-                                       const __half *inp_or_out,
-                                       const __half *gamma, const __half *betta,
-                                       const __half *vars, const __half *means,
-                                       int hidden_dim) {
-  int offset = blockIdx.x * hidden_dim + threadIdx.x;
-
-  float2 dxhat[4], xhat[4];
-  float var_rsqrt;
-  float4 vtmp;
-  __half2 *tmp_h2;
-  float reduce_val[2] = {0.f, 0.f};
-
-  if (threadIdx.x < hidden_dim) {
-    // step 0. dxhat = dout * gamma
-    vtmp = ((const float4 *)out_grad)[offset];
-    tmp_h2 = reinterpret_cast<__half2 *>(&vtmp);
-    float4 gamma_f4 = ((const float4 *)gamma)[threadIdx.x];
-    __half2 *gamma_h2 = reinterpret_cast<__half2 *>(&gamma_f4);
-#pragma unroll
-    for (int i = 0; i < 4; i++) {
-      float2 vdout = __half22float2(tmp_h2[i]);
-      float2 vgamma = __half22float2(gamma_h2[i]);
-      dxhat[i].x = vdout.x * vgamma.x;
-      dxhat[i].y = vdout.y * vgamma.y;
-      reduce_val[0] += dxhat[i].x + dxhat[i].y;
-    }
-
-    /*
-    step 1. xhat = (output - betta) / gamma or
-    (input - mean) * rsqrtf(var)
-    */
-    vtmp = ((const float4 *)inp_or_out)[offset];
-    var_rsqrt = rsqrtf((float)vars[blockIdx.x] + LN_EPSILON);
-    if (means == nullptr) {
-      // inp_or_out is output, xhat = (output - betta) / gamma
-      float4 vbetta = ((const float4 *)betta)[threadIdx.x];
-      __half2 *betta_h2 = reinterpret_cast<__half2 *>(&vbetta);
-#pragma unroll
-      for (int i = 0; i < 4; i++) {
-        float2 vout = __half22float2(tmp_h2[i]);
-        float2 vgamma = __half22float2(gamma_h2[i]);
-        float2 vbetta = __half22float2(betta_h2[i]);
-        xhat[i].x = (vout.x - vbetta.x) / add_eps(vgamma.x);
-        xhat[i].y = (vout.y - vbetta.y) / add_eps(vgamma.y);
-        reduce_val[1] += xhat[i].x * dxhat[i].x + xhat[i].y * dxhat[i].y;
-      }
-    } else {
-      // inp_or_out is input, xhat = (input - mean) * rsqrtf(var)
-      float fmean = (float)means[blockIdx.x];
-#pragma unroll
-      for (int i = 0; i < 4; i++) {
-        float2 vinp = __half22float2(tmp_h2[i]);
-        xhat[i].x = (vinp.x - fmean) * var_rsqrt;
-        xhat[i].y = (vinp.y - fmean) * var_rsqrt;
-        reduce_val[1] += xhat[i].x * dxhat[i].x + xhat[i].y * dxhat[i].y;
-      }
-    }
-  }
-
-  /* step2. block reduce sum for dxhat and dxhat*xhat */
-  blockReduce<ReduceType::kSum, 2>(reduce_val);
-  __shared__ float s_sum_dxhat, s_sum_dxhat_xhat;
-  if (threadIdx.x == 0) {
-    float mean_dim = hidden_dim * 8;
-    s_sum_dxhat = reduce_val[0] / mean_dim;
-    s_sum_dxhat_xhat = reduce_val[1] / mean_dim;
-  }
-  __syncthreads();
-
-  /*
-  step3. compute input gradient
-  (dxhat - (sum(dxhat) + xhat * sum(dxhat * xhat)) / mean_dim) * rsqrt(var)
-  */
-  if (threadIdx.x >= hidden_dim) {
-    return;
-  }
-  if (residual_grad) {
-    // Add the residual grad,
-    // usually in pre-layer-norm for transformer layer
-    float4 dresidual = ((const float4 *)residual_grad)[offset];
-    __half *hdres = reinterpret_cast<__half *>(&dresidual);
-#pragma unroll
-    for (int i = 0; i < 4; i++) {
-#ifdef COLOSSAL_HIP
-      tmp_h2[i] = make_half2(__float2half(
-          (dxhat[i].x - s_sum_dxhat - xhat[i].x * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres[2 * i])),
-          __float2half(
-          (dxhat[i].y - s_sum_dxhat - xhat[i].y * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres[2 * i + 1])));
-#else
-      tmp_h2[i].x = __float2half(
-          (dxhat[i].x - s_sum_dxhat - xhat[i].x * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres[2 * i]));
-      tmp_h2[i].y = __float2half(
-          (dxhat[i].y - s_sum_dxhat - xhat[i].y * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres[2 * i + 1]));
-#endif
-    }
-  } else {
-#pragma unroll
-    for (int i = 0; i < 4; i++) {
-#ifdef COLOSSAL_HIP
-      tmp_h2[i] = make_half2(__float2half(
-          (dxhat[i].x - s_sum_dxhat - xhat[i].x * s_sum_dxhat_xhat) *
-          var_rsqrt),
-          __float2half(
-          (dxhat[i].y - s_sum_dxhat - xhat[i].y * s_sum_dxhat_xhat) *
-          var_rsqrt));
-#else
-      tmp_h2[i].x = __float2half(
-          (dxhat[i].x - s_sum_dxhat - xhat[i].x * s_sum_dxhat_xhat) *
-          var_rsqrt);
-      tmp_h2[i].y = __float2half(
-          (dxhat[i].y - s_sum_dxhat - xhat[i].y * s_sum_dxhat_xhat) *
-          var_rsqrt);
-#endif
-    }
-  }
-  ((float4 *)inp_grad)[offset] = vtmp;
-}
-
-__global__ void ker_ln_bw_dinp_x2(__half *inp_grad, const __half *out_grad,
-                                       const __half *residual_grad,
-                                       const __half *inp_or_out,
-                                       const __half *gamma, const __half *betta,
-                                       const __half *vars, const __half *means,
-                                       int hidden_dim) {
-  int offset = blockIdx.x * hidden_dim * 2 + threadIdx.x * 2;
-
-  float2 dxhat[4], xhat[4];
-  float2 dxhat_1[4], xhat_1[4];
-  float var_rsqrt;
-  float4 vtmp, vtmp_1;
-  __half2 *tmp_h2;
-  __half2 *tmp_h2_1;
-  float reduce_val[2] = {0.f, 0.f};
-
-  if (threadIdx.x < hidden_dim) {
-    // step 0. dxhat = dout * gamma
-    vtmp = ((const float4 *)out_grad)[offset];
-    vtmp_1 = ((const float4 *)out_grad)[offset + 1];
-    tmp_h2 = reinterpret_cast<__half2 *>(&vtmp);
-    tmp_h2_1 = reinterpret_cast<__half2 *>(&vtmp_1);
-    float4 gamma_f4 = ((const float4 *)gamma)[threadIdx.x * 2];
-    float4 gamma_f4_1 = ((const float4 *)gamma)[threadIdx.x * 2 + 1];
-    __half2 *gamma_h2 = reinterpret_cast<__half2 *>(&gamma_f4);
-    __half2 *gamma_h2_1 = reinterpret_cast<__half2 *>(&gamma_f4_1);
-#pragma unroll
-    for (int i = 0; i < 4; i++) {
-      float2 vdout = __half22float2(tmp_h2[i]);
-      float2 vdout_1 = __half22float2(tmp_h2_1[i]);
-      float2 vgamma = __half22float2(gamma_h2[i]);
-      float2 vgamma_1 = __half22float2(gamma_h2_1[i]);
-      dxhat[i].x = vdout.x * vgamma.x;
-      dxhat[i].y = vdout.y * vgamma.y;
-      dxhat_1[i].x = vdout_1.x * vgamma_1.x;
-      dxhat_1[i].y = vdout_1.y * vgamma_1.y;
-      reduce_val[0] += dxhat[i].x + dxhat[i].y + dxhat_1[i].x + dxhat_1[i].y;
-    }
-
-    /*
-    step 1. xhat = (output - betta) / gamma or
-    (input - mean) * rsqrtf(var)
-    */
-    vtmp = ((const float4 *)inp_or_out)[offset];
-    vtmp_1 = ((const float4 *)inp_or_out)[offset + 1];
-    var_rsqrt = rsqrtf((float)vars[blockIdx.x] + LN_EPSILON);
-    if (means == nullptr) {
-      // inp_or_out is output, xhat = (output - betta) / gamma
-      float4 vbetta = ((const float4 *)betta)[2 * threadIdx.x];
-      float4 vbetta_1 = ((const float4 *)betta)[2 * threadIdx.x + 1];
-      __half2 *betta_h2 = reinterpret_cast<__half2 *>(&vbetta);
-      __half2 *betta_h2_1 = reinterpret_cast<__half2 *>(&vbetta_1);
-#pragma unroll
-      for (int i = 0; i < 4; i++) {
-        float2 vout = __half22float2(tmp_h2[i]);
-        float2 vout_1 = __half22float2(tmp_h2_1[i]);
-        float2 vgamma = __half22float2(gamma_h2[i]);
-        float2 vgamma_1 = __half22float2(gamma_h2_1[i]);
-        float2 vbetta = __half22float2(betta_h2[i]);
-        float2 vbetta_1 = __half22float2(betta_h2_1[i]);
-        xhat[i].x = (vout.x - vbetta.x) / add_eps(vgamma.x);
-        xhat_1[i].x = (vout_1.x - vbetta_1.x) / add_eps(vgamma_1.x);
-        xhat[i].y = (vout.y - vbetta.y) / add_eps(vgamma.y);
-        xhat_1[i].y = (vout_1.y - vbetta_1.y) / add_eps(vgamma_1.y);
-        reduce_val[1] += xhat[i].x * dxhat[i].x + xhat[i].y * dxhat[i].y;
-        reduce_val[1] += xhat_1[i].x * dxhat_1[i].x + xhat_1[i].y * dxhat_1[i].y;
-      }
-    } else {
-      // inp_or_out is input, xhat = (input - mean) * rsqrtf(var)
-      float fmean = (float)means[blockIdx.x];
-#pragma unroll
-      for (int i = 0; i < 4; i++) {
-        float2 vinp = __half22float2(tmp_h2[i]);
-        float2 vinp_1 = __half22float2(tmp_h2_1[i]);
-        xhat[i].x = (vinp.x - fmean) * var_rsqrt;
-        xhat_1[i].x = (vinp_1.x - fmean) * var_rsqrt;
-        xhat[i].y = (vinp.y - fmean) * var_rsqrt;
-        xhat_1[i].y = (vinp_1.y - fmean) * var_rsqrt;
-        reduce_val[1] += xhat[i].x * dxhat[i].x + xhat[i].y * dxhat[i].y;
-        reduce_val[1] += xhat_1[i].x * dxhat_1[i].x + xhat_1[i].y * dxhat_1[i].y;
-      }
-    }
-  }
-
-  /* step2. block reduce sum for dxhat and dxhat*xhat */
-  blockReduce<ReduceType::kSum, 2>(reduce_val);
-  __shared__ float s_sum_dxhat, s_sum_dxhat_xhat;
-  if (threadIdx.x == 0) {
-    float mean_dim = hidden_dim * 8 * 2;
-    s_sum_dxhat = reduce_val[0] / mean_dim;
-    s_sum_dxhat_xhat = reduce_val[1] / mean_dim;
-  }
-  __syncthreads();
-
-  /*
-  step3. compute input gradient
-  (dxhat - (sum(dxhat) + xhat * sum(dxhat * xhat)) / mean_dim) * rsqrt(var)
-  */
-  if (threadIdx.x >= hidden_dim) {
-    return;
-  }
-  if (residual_grad) {
-    // Add the residual grad,
-    // usually in pre-layer-norm for transformer layer
-    float4 dresidual = ((const float4 *)residual_grad)[offset];
-    float4 dresidual_1 = ((const float4 *)residual_grad)[offset+1];
-    __half *hdres = reinterpret_cast<__half *>(&dresidual);
-    __half *hdres_1 = reinterpret_cast<__half *>(&dresidual_1);
-#pragma unroll
-    for (int i = 0; i < 4; i++) {
-#ifdef COLOSSAL_HIP
-      tmp_h2[i] = make_half2(__float2half(
-          (dxhat[i].x - s_sum_dxhat - xhat[i].x * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres[2 * i])),
-          __float2half(
-          (dxhat[i].y - s_sum_dxhat - xhat[i].y * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres[2 * i + 1])));
-      tmp_h2_1[i] = make_half2(__float2half(
-          (dxhat_1[i].x - s_sum_dxhat - xhat_1[i].x * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres_1[2 * i])),
-          __float2half(
-          (dxhat_1[i].y - s_sum_dxhat - xhat_1[i].y * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres_1[2 * i + 1])));
-#else
-      tmp_h2[i].x = __float2half(
-          (dxhat[i].x - s_sum_dxhat - xhat[i].x * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres[2 * i]));
-      tmp_h2_1[i].x = __float2half(
-          (dxhat_1[i].x - s_sum_dxhat - xhat_1[i].x * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres_1[2 * i]));
-      tmp_h2[i].y = __float2half(
-          (dxhat[i].y - s_sum_dxhat - xhat[i].y * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres[2 * i + 1]));
-      tmp_h2_1[i].y = __float2half(
-          (dxhat_1[i].y - s_sum_dxhat - xhat_1[i].y * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres_1[2 * i + 1]));
-#endif
-    }
-  } else {
-#pragma unroll
-    for (int i = 0; i < 4; i++) {
-#ifdef COLOSSAL_HIP
-      tmp_h2[i] = make_half2(__float2half(
-          (dxhat[i].x - s_sum_dxhat - xhat[i].x * s_sum_dxhat_xhat) *
-          var_rsqrt),
-          __float2half(
-          (dxhat[i].y - s_sum_dxhat - xhat[i].y * s_sum_dxhat_xhat) *
-          var_rsqrt));
-      tmp_h2_1[i] = make_half2(__float2half(
-          (dxhat_1[i].x - s_sum_dxhat - xhat_1[i].x * s_sum_dxhat_xhat) *
-          var_rsqrt),
-          __float2half(
-          (dxhat_1[i].y - s_sum_dxhat - xhat_1[i].y * s_sum_dxhat_xhat) *
-          var_rsqrt));
-#else
-      tmp_h2[i].x = __float2half(
-          (dxhat[i].x - s_sum_dxhat - xhat[i].x * s_sum_dxhat_xhat) *
-          var_rsqrt);
-      tmp_h2_1[i].x = __float2half(
-          (dxhat_1[i].x - s_sum_dxhat - xhat_1[i].x * s_sum_dxhat_xhat) *
-          var_rsqrt);
-      tmp_h2[i].y = __float2half(
-          (dxhat[i].y - s_sum_dxhat - xhat[i].y * s_sum_dxhat_xhat) *
-          var_rsqrt);
-      tmp_h2_1[i].y = __float2half(
-          (dxhat_1[i].y - s_sum_dxhat - xhat_1[i].y * s_sum_dxhat_xhat) *
-          var_rsqrt);
-#endif
-   }
-  }
-  ((float4 *)inp_grad)[offset] = vtmp;
-  ((float4 *)inp_grad)[offset + 1] = vtmp_1;
-}
-
-__global__ void ker_ln_bw_dinp_x4(__half *inp_grad, const __half *out_grad,
-                                       const __half *residual_grad,
-                                       const __half *inp_or_out,
-                                       const __half *gamma, const __half *betta,
-                                       const __half *vars, const __half *means,
-                                       int hidden_dim) {
-  int offset = blockIdx.x * hidden_dim * 4 + threadIdx.x * 4;
-
-  float2 dxhat[4], xhat[4];
-  float2 dxhat_1[4], xhat_1[4];
-  float2 dxhat_2[4], xhat_2[4];
-  float2 dxhat_3[4], xhat_3[4];
-  float var_rsqrt;
-  float4 vtmp, vtmp_1, vtmp_2, vtmp_3;
-  __half2 *tmp_h2;
-  __half2 *tmp_h2_1;
-  __half2 *tmp_h2_2;
-  __half2 *tmp_h2_3;
-  float reduce_val[2] = {0.f, 0.f};
-
-  if (threadIdx.x < hidden_dim) {
-    // step 0. dxhat = dout * gamma
-    vtmp = ((const float4 *)out_grad)[offset];
-    vtmp_1 = ((const float4 *)out_grad)[offset + 1];
-    vtmp_2 = ((const float4 *)out_grad)[offset + 2];
-    vtmp_3 = ((const float4 *)out_grad)[offset + 3];
-    tmp_h2 = reinterpret_cast<__half2 *>(&vtmp);
-    tmp_h2_1 = reinterpret_cast<__half2 *>(&vtmp_1);
-    tmp_h2_2 = reinterpret_cast<__half2 *>(&vtmp_2);
-    tmp_h2_3 = reinterpret_cast<__half2 *>(&vtmp_3);
-    float4 gamma_f4 = ((const float4 *)gamma)[threadIdx.x * 4];
-    float4 gamma_f4_1 = ((const float4 *)gamma)[threadIdx.x * 4 + 1];
-    float4 gamma_f4_2 = ((const float4 *)gamma)[threadIdx.x * 4 + 2];
-    float4 gamma_f4_3 = ((const float4 *)gamma)[threadIdx.x * 4 + 3];
-    __half2 *gamma_h2 = reinterpret_cast<__half2 *>(&gamma_f4);
-    __half2 *gamma_h2_1 = reinterpret_cast<__half2 *>(&gamma_f4_1);
-    __half2 *gamma_h2_2 = reinterpret_cast<__half2 *>(&gamma_f4_2);
-    __half2 *gamma_h2_3 = reinterpret_cast<__half2 *>(&gamma_f4_3);
-#pragma unroll
-    for (int i = 0; i < 4; i++) {
-      float2 vdout = __half22float2(tmp_h2[i]);
-      float2 vdout_1 = __half22float2(tmp_h2_1[i]);
-      float2 vdout_2 = __half22float2(tmp_h2_2[i]);
-      float2 vdout_3 = __half22float2(tmp_h2_3[i]);
-      float2 vgamma = __half22float2(gamma_h2[i]);
-      float2 vgamma_1 = __half22float2(gamma_h2_1[i]);
-      float2 vgamma_2 = __half22float2(gamma_h2_2[i]);
-      float2 vgamma_3 = __half22float2(gamma_h2_3[i]);
-      dxhat[i].x = vdout.x * vgamma.x;
-      dxhat[i].y = vdout.y * vgamma.y;
-      dxhat_1[i].x = vdout_1.x * vgamma_1.x;
-      dxhat_1[i].y = vdout_1.y * vgamma_1.y;
-      dxhat_2[i].x = vdout_2.x * vgamma_2.x;
-      dxhat_2[i].y = vdout_2.y * vgamma_2.y;
-      dxhat_3[i].x = vdout_3.x * vgamma_3.x;
-      dxhat_3[i].y = vdout_3.y * vgamma_3.y;
-      reduce_val[0] += dxhat[i].x + dxhat[i].y + dxhat_1[i].x + dxhat_1[i].y + dxhat_2[i].x +
-                       dxhat_2[i].y + dxhat_3[i].x + dxhat_3[i].y;
-    }
-
-    /*
-    step 1. xhat = (output - betta) / gamma or
-    (input - mean) * rsqrtf(var)
-    */
-    vtmp = ((const float4 *)inp_or_out)[offset];
-    vtmp_1 = ((const float4 *)inp_or_out)[offset + 1];
-    vtmp_2 = ((const float4 *)inp_or_out)[offset + 2];
-    vtmp_3 = ((const float4 *)inp_or_out)[offset + 3];
-    var_rsqrt = rsqrtf((float)vars[blockIdx.x] + LN_EPSILON);
-    if (means == nullptr) {
-      // inp_or_out is output, xhat = (output - betta) / gamma
-      float4 vbetta = ((const float4 *)betta)[4 * threadIdx.x];
-      float4 vbetta_1 = ((const float4 *)betta)[4 * threadIdx.x + 1];
-      float4 vbetta_2 = ((const float4 *)betta)[4 * threadIdx.x + 2];
-      float4 vbetta_3 = ((const float4 *)betta)[4 * threadIdx.x + 3];
-      __half2 *betta_h2 = reinterpret_cast<__half2 *>(&vbetta);
-      __half2 *betta_h2_1 = reinterpret_cast<__half2 *>(&vbetta_1);
-      __half2 *betta_h2_2 = reinterpret_cast<__half2 *>(&vbetta_2);
-      __half2 *betta_h2_3 = reinterpret_cast<__half2 *>(&vbetta_3);
-#pragma unroll
-      for (int i = 0; i < 4; i++) {
-        float2 vout = __half22float2(tmp_h2[i]);
-        float2 vout_1 = __half22float2(tmp_h2_1[i]);
-        float2 vout_2 = __half22float2(tmp_h2_2[i]);
-        float2 vout_3 = __half22float2(tmp_h2_3[i]);
-        float2 vgamma = __half22float2(gamma_h2[i]);
-        float2 vgamma_1 = __half22float2(gamma_h2_1[i]);
-        float2 vgamma_2 = __half22float2(gamma_h2_2[i]);
-        float2 vgamma_3 = __half22float2(gamma_h2_3[i]);
-        float2 vbetta = __half22float2(betta_h2[i]);
-        float2 vbetta_1 = __half22float2(betta_h2_1[i]);
-        float2 vbetta_2 = __half22float2(betta_h2_2[i]);
-        float2 vbetta_3 = __half22float2(betta_h2_3[i]);
-        xhat[i].x = (vout.x - vbetta.x) / add_eps(vgamma.x);
-        xhat_1[i].x = (vout_1.x - vbetta_1.x) / add_eps(vgamma_1.x);
-        xhat_2[i].x = (vout_2.x - vbetta_2.x) / add_eps(vgamma_2.x);
-        xhat_3[i].x = (vout_3.x - vbetta_3.x) / add_eps(vgamma_3.x);
-        xhat[i].y = (vout.y - vbetta.y) / add_eps(vgamma.y);
-        xhat_1[i].y = (vout_1.y - vbetta_1.y) / add_eps(vgamma_1.y);
-        xhat_2[i].y = (vout_2.y - vbetta_2.y) / add_eps(vgamma_2.y);
-        xhat_3[i].y = (vout_3.y - vbetta_3.y) / add_eps(vgamma_3.y);
-        reduce_val[1] += xhat[i].x * dxhat[i].x + xhat[i].y * dxhat[i].y;
-        reduce_val[1] += xhat_1[i].x * dxhat_1[i].x + xhat_1[i].y * dxhat_1[i].y;
-        reduce_val[1] += xhat_2[i].x * dxhat_2[i].x + xhat_2[i].y * dxhat_2[i].y;
-        reduce_val[1] += xhat_3[i].x * dxhat_3[i].x + xhat_3[i].y * dxhat_3[i].y;
-      }
-    } else {
-      // inp_or_out is input, xhat = (input - mean) * rsqrtf(var)
-      float fmean = (float)means[blockIdx.x];
-#pragma unroll
-      for (int i = 0; i < 4; i++) {
-        float2 vinp = __half22float2(tmp_h2[i]);
-        float2 vinp_1 = __half22float2(tmp_h2_1[i]);
-        float2 vinp_2 = __half22float2(tmp_h2_2[i]);
-        float2 vinp_3 = __half22float2(tmp_h2_3[i]);
-        xhat[i].x = (vinp.x - fmean) * var_rsqrt;
-        xhat_1[i].x = (vinp_1.x - fmean) * var_rsqrt;
-        xhat_2[i].x = (vinp_2.x - fmean) * var_rsqrt;
-        xhat_3[i].x = (vinp_3.x - fmean) * var_rsqrt;
-        xhat[i].y = (vinp.y - fmean) * var_rsqrt;
-        xhat_1[i].y = (vinp_1.y - fmean) * var_rsqrt;
-        xhat_2[i].y = (vinp_2.y - fmean) * var_rsqrt;
-        xhat_3[i].y = (vinp_3.y - fmean) * var_rsqrt;
-        reduce_val[1] += xhat[i].x * dxhat[i].x + xhat[i].y * dxhat[i].y;
-        reduce_val[1] += xhat_1[i].x * dxhat_1[i].x + xhat_1[i].y * dxhat_1[i].y;
-        reduce_val[1] += xhat_2[i].x * dxhat_2[i].x + xhat_2[i].y * dxhat_2[i].y;
-        reduce_val[1] += xhat_3[i].x * dxhat_3[i].x + xhat_3[i].y * dxhat_3[i].y;
-      }
-    }
-  }
-
-  /* step2. block reduce sum for dxhat and dxhat*xhat */
-  blockReduce<ReduceType::kSum, 2>(reduce_val);
-  __shared__ float s_sum_dxhat, s_sum_dxhat_xhat;
-  if (threadIdx.x == 0) {
-    float mean_dim = hidden_dim * 8 * 4;
-    s_sum_dxhat = reduce_val[0] / mean_dim;
-    s_sum_dxhat_xhat = reduce_val[1] / mean_dim;
-  }
-  __syncthreads();
-
-  /*
-  step3. compute input gradient
-  (dxhat - (sum(dxhat) + xhat * sum(dxhat * xhat)) / mean_dim) * rsqrt(var)
-  */
-  if (threadIdx.x >= hidden_dim) {
-    return;
-  }
-  if (residual_grad) {
-    // Add the residual grad,
-    // usually in pre-layer-norm for transformer layer
-    float4 dresidual = ((const float4 *)residual_grad)[offset];
-    float4 dresidual_1 = ((const float4 *)residual_grad)[offset+1];
-    float4 dresidual_2 = ((const float4 *)residual_grad)[offset+2];
-    float4 dresidual_3 = ((const float4 *)residual_grad)[offset+3];
-    __half *hdres = reinterpret_cast<__half *>(&dresidual);
-    __half *hdres_1 = reinterpret_cast<__half *>(&dresidual_1);
-    __half *hdres_2 = reinterpret_cast<__half *>(&dresidual_2);
-    __half *hdres_3 = reinterpret_cast<__half *>(&dresidual_3);
-#pragma unroll
-    for (int i = 0; i < 4; i++) {
-#ifdef COLOSSAL_HIP
-      tmp_h2[i] = make_half2(__float2half(
-          (dxhat[i].x - s_sum_dxhat - xhat[i].x * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres[2 * i])),
-          __float2half(
-          (dxhat[i].y - s_sum_dxhat - xhat[i].y * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres[2 * i + 1])));
-      tmp_h2_1[i] = make_half2(__float2half(
-          (dxhat_1[i].x - s_sum_dxhat - xhat_1[i].x * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres_1[2 * i])),
-          __float2half(
-          (dxhat_1[i].y - s_sum_dxhat - xhat_1[i].y * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres_1[2 * i + 1])));
-      tmp_h2_2[i] = make_half2(__float2half(
-          (dxhat_2[i].x - s_sum_dxhat - xhat_2[i].x * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres_2[2 * i])),
-          __float2half(
-          (dxhat_2[i].y - s_sum_dxhat - xhat_2[i].y * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres_1[2 * i + 1])));
-      tmp_h2_3[i] = make_half2(__float2half(
-          (dxhat_3[i].x - s_sum_dxhat - xhat_3[i].x * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres_3[2 * i])),
-          __float2half(
-          (dxhat_3[i].y - s_sum_dxhat - xhat_3[i].y * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres_1[2 * i + 1])));
-#else
-      tmp_h2[i].x = __float2half(
-          (dxhat[i].x - s_sum_dxhat - xhat[i].x * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres[2 * i]));
-      tmp_h2_1[i].x = __float2half(
-          (dxhat_1[i].x - s_sum_dxhat - xhat_1[i].x * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres_1[2 * i]));
-      tmp_h2_2[i].x = __float2half(
-          (dxhat_2[i].x - s_sum_dxhat - xhat_2[i].x * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres_2[2 * i]));
-      tmp_h2_3[i].x = __float2half(
-          (dxhat_3[i].x - s_sum_dxhat - xhat_3[i].x * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres_3[2 * i]));
-      tmp_h2[i].y = __float2half(
-          (dxhat[i].y - s_sum_dxhat - xhat[i].y * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres[2 * i + 1]));
-      tmp_h2_1[i].y = __float2half(
-          (dxhat_1[i].y - s_sum_dxhat - xhat_1[i].y * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres_1[2 * i + 1]));
-      tmp_h2_2[i].y = __float2half(
-          (dxhat_2[i].y - s_sum_dxhat - xhat_2[i].y * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres_1[2 * i + 1]));
-      tmp_h2_3[i].y = __float2half(
-          (dxhat_3[i].y - s_sum_dxhat - xhat_3[i].y * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres_1[2 * i + 1]));
-#endif 
-   }
-  } else {
-#pragma unroll
-    for (int i = 0; i < 4; i++) {
-#ifdef COLOSSAL_HIP
-      tmp_h2[i] = make_half2(__float2half(
-          (dxhat[i].x - s_sum_dxhat - xhat[i].x * s_sum_dxhat_xhat) *
-          var_rsqrt),
-          __float2half(
-          (dxhat[i].y - s_sum_dxhat - xhat[i].y * s_sum_dxhat_xhat) *
-          var_rsqrt));
-      tmp_h2_1[i] = make_half2(__float2half(
-          (dxhat_1[i].x - s_sum_dxhat - xhat_1[i].x * s_sum_dxhat_xhat) *
-          var_rsqrt),
-          __float2half(
-          (dxhat_1[i].y - s_sum_dxhat - xhat_1[i].y * s_sum_dxhat_xhat) *
-          var_rsqrt));
-      tmp_h2_2[i] = make_half2(__float2half(
-          (dxhat_2[i].x - s_sum_dxhat - xhat_2[i].x * s_sum_dxhat_xhat) *
-          var_rsqrt),
-          __float2half(
-          (dxhat_2[i].y - s_sum_dxhat - xhat_2[i].y * s_sum_dxhat_xhat) *
-          var_rsqrt));
-      tmp_h2_3[i] = make_half2(__float2half(
-          (dxhat_3[i].x - s_sum_dxhat - xhat_3[i].x * s_sum_dxhat_xhat) *
-          var_rsqrt),
-          __float2half(
-          (dxhat_3[i].y - s_sum_dxhat - xhat_3[i].y * s_sum_dxhat_xhat) *
-          var_rsqrt));
-#else
-      tmp_h2[i].x = __float2half(
-          (dxhat[i].x - s_sum_dxhat - xhat[i].x * s_sum_dxhat_xhat) *
-          var_rsqrt);
-      tmp_h2_1[i].x = __float2half(
-          (dxhat_1[i].x - s_sum_dxhat - xhat_1[i].x * s_sum_dxhat_xhat) *
-          var_rsqrt);
-      tmp_h2_2[i].x = __float2half(
-          (dxhat_2[i].x - s_sum_dxhat - xhat_2[i].x * s_sum_dxhat_xhat) *
-          var_rsqrt);
-      tmp_h2_3[i].x = __float2half(
-          (dxhat_3[i].x - s_sum_dxhat - xhat_3[i].x * s_sum_dxhat_xhat) *
-          var_rsqrt);
-      tmp_h2[i].y = __float2half(
-          (dxhat[i].y - s_sum_dxhat - xhat[i].y * s_sum_dxhat_xhat) *
-          var_rsqrt);
-      tmp_h2_1[i].y = __float2half(
-          (dxhat_1[i].y - s_sum_dxhat - xhat_1[i].y * s_sum_dxhat_xhat) *
-          var_rsqrt);
-      tmp_h2_2[i].y = __float2half(
-          (dxhat_2[i].y - s_sum_dxhat - xhat_2[i].y * s_sum_dxhat_xhat) *
-          var_rsqrt);
-      tmp_h2_3[i].y = __float2half(
-          (dxhat_3[i].y - s_sum_dxhat - xhat_3[i].y * s_sum_dxhat_xhat) *
-          var_rsqrt);
-#endif 
-   }
-  }
-  ((float4 *)inp_grad)[offset] = vtmp;
-  ((float4 *)inp_grad)[offset + 1] = vtmp_1;
-  ((float4 *)inp_grad)[offset + 2] = vtmp_2;
-  ((float4 *)inp_grad)[offset + 3] = vtmp_3;
-}
-
-/**
-Layer norm backword,
-  compute the gradient of gamma, betta and input.
-dbetta = sum(dout, dim=0)
-xhat = (input - mean) * rsqrt(var) if mean is not nullptr
-  (output - betta) / gamma if mean is nullptr
-dgamma = sum(xhat * dout, dim=0)
-dxhat = dout * gamma
-dinp = (dxhat - (sum(dxhat, 1) + xhat * sum(dxhat * xhat, 1)) / hidden_dim)
-  * rsqrt(var)
-
-residual_grad, means, betta can be nullptr.
-residual_grad will be added to dinp if it is not nullptr
-  which is useful in transformer layer when pre-ln
-means and betta are only used to compute xhat,
-  (means == nullptr) ^ (betta == nullptr) should be true
-*/
-template <>
-void launch_ln_bw<float>(float *gamma_grad, float *betta_grad, float *inp_grad,
-                         const float *out_grad, const float *residual_grad,
-                         const float *inp_or_out, const float *gamma,
-                         const float *betta, const float *vars,
-                         const float *means, int batch, int hidden_dim,
-                         cudaStream_t stream[2]) {
-  // compute grad of gamma and betta
-  dim3 grid_dim(((hidden_dim + TILE_DIM - 1) / TILE_DIM) * TILE_DIM);
-  dim3 block_dim(TILE_DIM, TILE_DIM);
-  ker_ln_bw_dgamma_dbetta<float><<<grid_dim, block_dim, 0, stream[0]>>>(
-      gamma_grad, betta_grad, out_grad, inp_or_out, gamma, betta, vars, means,
-      batch, hidden_dim);
-
-  // compute grad of input
-  if (hidden_dim % 4 != 0 || hidden_dim > 4096) {
-    throw std::runtime_error("hidden_dim % 4 != 0 || hidden_dim > 4096");
-  }
-  hidden_dim >>= 2;
-  int nthread = min(((hidden_dim + 31) / 32) * 32, MAX_THREADS);
-  ker_ln_bw_dinp<<<batch, nthread, 0, stream[1]>>>(
-      inp_grad, out_grad, residual_grad, inp_or_out, gamma, betta, vars, means,
-      hidden_dim);
-}
-
-template <>
-void launch_ln_bw<__half>(__half *gamma_grad, __half *betta_grad,
-                          __half *inp_grad, const __half *out_grad,
-                          const __half *residual_grad, const __half *inp_or_out,
-                          const __half *gamma, const __half *betta,
-                          const __half *vars, const __half *means, int batch,
-                          int hidden_dim, cudaStream_t stream[2]) {
-  // compute grad of gamma and betta
-  dim3 grid_dim(((hidden_dim + TILE_DIM - 1) / TILE_DIM) * TILE_DIM);
-  dim3 block_dim(TILE_DIM, TILE_DIM);
-  ker_ln_bw_dgamma_dbetta<__half><<<grid_dim, block_dim, 0, stream[0]>>>(
-      gamma_grad, betta_grad, out_grad, inp_or_out, gamma, betta, vars, means,
-      batch, hidden_dim);
-
-  // compute grad of input
-  if (hidden_dim % 8 != 0) {
-    throw std::runtime_error("hidden_dim % 8 != 0");
-  }
-  hidden_dim >>= 3;
-
-  if (hidden_dim * 8 <= 8192) {
-    int nthread = min(((hidden_dim + 31) / 32) * 32, MAX_THREADS);
-    ker_ln_bw_dinp<<<batch, nthread, 0, stream[1]>>>(
-      inp_grad, out_grad, residual_grad, inp_or_out, gamma, betta, vars, means,
-      hidden_dim);
-  } else if (hidden_dim * 8 > 8192 && hidden_dim * 8 <= 8192 * 2) {
-    hidden_dim >>= 1;
-    int nthread = min(((hidden_dim + 31) / 32) * 32, MAX_THREADS);
-    ker_ln_bw_dinp_x2<<<batch, nthread, 0, stream[1]>>>(
-      inp_grad, out_grad, residual_grad, inp_or_out, gamma, betta, vars, means,
-      hidden_dim);
-  } else if (hidden_dim * 8 > 2 * 8192 && hidden_dim * 8 <= 8192 * 4) {
-    hidden_dim >>= 2;
-    int nthread = min(((hidden_dim + 31) / 32) * 32, MAX_THREADS);
-    ker_ln_bw_dinp_x4<<<batch, nthread, 0, stream[1]>>>(
-      inp_grad, out_grad, residual_grad, inp_or_out, gamma, betta, vars, means,
-      hidden_dim);
-  } else {
-    throw std::runtime_error("hidden_dim % 4 != 0 || hidden_dim > 32768");
-  }
-}
-
diff --git a/colossalai/kernel/cuda_native/csrc/kernels/softmax_kernels.cu b/colossalai/kernel/cuda_native/csrc/kernels/softmax_kernels.cu
deleted file mode 100644
index 7b39f0865ae9506f1111500fe7029d1861cd7b97..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/csrc/kernels/softmax_kernels.cu
+++ /dev/null
@@ -1,393 +0,0 @@
-#include <math.h>
-
-#include <cub/block/block_load.cuh>
-#include <cub/cub.cuh>
-
-#include "block_reduce.h"
-#include "kernels.h"
-
-#ifndef COLOSSAL_HIP
-#include <cooperative_groups.h>
-
-namespace cg = cooperative_groups;
-#endif
-
-const float EPSILON = 1e-8f;
-
-/**
-@brief: softmax_kernel
-Softmax forward kernel for
-  enc-self-attn, dec-self-attn, encdec-attn
-
-@thread
-gridDim.x = dynamic
-gridDim.y = batch_size
-gridDim.z = nhead
-blockDim.x = from_len
-
-@param
-inp: [batch_size, nhead, from_len, to_len], softmax input.
-attn_mask: [batch_size, to_len], padding tokens are -inf,
-  non padding tokens are 0.
-  attn_mask!=nullptr for enc-self-attn and enc-dec-attn
-  attn_mask=nullptr and mask_future=ture for dec-self-attn training
-  attn_mask=nullptr and mask_future=false for dec-self-attn infer
-*/
-template <typename T, int block_dim, int ele_per_thread>
-__global__ void ker_attn_softmax(T *inp, const T *attn_mask, int from_len,
-                                 int to_len, bool mask_future) {
-  int batch_id = blockIdx.y;
-  int head_id = blockIdx.z;
-  const int nhead = gridDim.z;
-  const int token_per_reduce = 1;
-#ifdef COLOSSAL_HIP
-  typedef hipcub::BlockLoad<T, block_dim, ele_per_thread,
-                         hipcub::BLOCK_LOAD_VECTORIZE>
-      BlockLoad;
-  __shared__ typename BlockLoad::TempStorage ts_load;
-  typedef hipcub::BlockStore<T, block_dim, ele_per_thread,
-                          hipcub::BLOCK_STORE_VECTORIZE>
-      BlockStore;
-#else
-  typedef cub::BlockLoad<T, block_dim, ele_per_thread,
-                         cub::BLOCK_LOAD_VECTORIZE>
-      BlockLoad;
-  __shared__ typename BlockLoad::TempStorage ts_load;
-  typedef cub::BlockStore<T, block_dim, ele_per_thread,
-                          cub::BLOCK_STORE_VECTORIZE>
-      BlockStore;
-#endif
-  __shared__ typename BlockStore::TempStorage ts_store;
-
-  T mval[ele_per_thread];
-  if (attn_mask) {
-    attn_mask += batch_id * to_len;
-    BlockLoad(ts_load).Load(attn_mask, mval, to_len, REDUCE_FLOAT_INF_NEG);
-  }
-
-  inp += flat_3dim(batch_id, head_id, 0, nhead, from_len * to_len);
-  for (int token_id = blockIdx.x * token_per_reduce; token_id < from_len;
-       token_id += gridDim.x * token_per_reduce) {
-    T inp_val[token_per_reduce][ele_per_thread];
-    for (int i = 0; i < token_per_reduce && (token_id + i) < from_len; i++) {
-      BlockLoad(ts_load).Load(inp + (token_id + i) * to_len, inp_val[i], to_len,
-                              REDUCE_FLOAT_INF_NEG);
-    }
-
-    /* step 1. compute max */
-    // thread local max
-    float val[token_per_reduce][ele_per_thread];
-    float l_max[token_per_reduce];
-    for (int i = 0; i < token_per_reduce; i++) {
-      l_max[i] = REDUCE_FLOAT_INF_NEG;
-      for (int j = 0; j < ele_per_thread; j++) {
-        if (attn_mask) {
-          val[i][j] = (float)inp_val[i][j] + (float)mval[j];
-        } else {
-          if (mask_future && ele_per_thread * threadIdx.x + j > token_id + i) {
-            val[i][j] = REDUCE_FLOAT_INF_NEG;
-          } else {
-            val[i][j] = (float)inp_val[i][j];
-          }
-        }
-        l_max[i] = fmaxf(l_max[i], val[i][j]);
-      }
-    }
-    // block reduce max
-    blockReduce<ReduceType::kMax, token_per_reduce>(l_max);
-    // write shared
-    __shared__ float s_max[token_per_reduce];
-    if (threadIdx.x == 0) {
-      for (int i = 0; i < token_per_reduce; i++) {
-        s_max[i] = l_max[i];
-      }
-    }
-    __syncthreads();
-
-    /* step 2. compute sum */
-    // thread local sum
-    float l_sum[token_per_reduce];
-    for (int i = 0; i < token_per_reduce; i++) {
-      l_sum[i] = 0.f;
-      for (int j = 0; j < ele_per_thread; j++) {
-        val[i][j] = __expf(val[i][j] - s_max[i]);
-        l_sum[i] += val[i][j];
-      }
-    }
-    // block reduce sum
-    blockReduce<ReduceType::kSum, token_per_reduce>(l_sum);
-    // write shared
-    __shared__ float s_sum[token_per_reduce];
-    if (threadIdx.x == 0) {
-      for (int i = 0; i < token_per_reduce; i++) {
-        s_sum[i] = __fdividef(1.0f, l_sum[i] + EPSILON);
-      }
-    }
-    __syncthreads();
-
-    /* step 3. compute final result */
-    for (int i = 0; i < token_per_reduce && (token_id + i) < from_len; i++) {
-      for (int j = 0; j < ele_per_thread; j++) {
-        inp_val[i][j] = (T)(val[i][j] * s_sum[i]);
-      }
-      BlockStore(ts_store).Store(inp + (token_id + i) * to_len, inp_val[i],
-                                 to_len);
-    }
-  }  // blockIdx.x
-}
-
-template <typename T, int block_dim, int ele_per_thread>
-__global__ void ker_attn_softmax_lt32(T *inp, const T *attn_mask, int from_len,
-                                      int to_len, bool mask_future) {
-  int batch_id = blockIdx.y;
-  int head_id = blockIdx.z;
-  const int nhead = gridDim.z;
-  const int token_per_reduce = 1;
-#ifdef COLOSSAL_HIP
-  typedef hipcub::BlockLoad<T, block_dim, ele_per_thread,
-                         hipcub::BLOCK_LOAD_VECTORIZE>
-      BlockLoad;
-  __shared__ typename BlockLoad::TempStorage ts_load;
-  typedef hipcub::BlockStore<T, block_dim, ele_per_thread,
-                          hipcub::BLOCK_STORE_VECTORIZE>
-      BlockStore;
-#else
-  typedef cub::BlockLoad<T, block_dim, ele_per_thread,
-                         cub::BLOCK_LOAD_VECTORIZE>
-      BlockLoad;
-  __shared__ typename BlockLoad::TempStorage ts_load;
-  typedef cub::BlockStore<T, block_dim, ele_per_thread,
-                          cub::BLOCK_STORE_VECTORIZE>
-      BlockStore;
-#endif
-  __shared__ typename BlockStore::TempStorage ts_store;
-
-  T mval[ele_per_thread];
-  if (attn_mask) {
-    attn_mask += batch_id * to_len;
-    BlockLoad(ts_load).Load(attn_mask, mval, to_len, REDUCE_FLOAT_INF_NEG);
-  }
-
-  inp += flat_3dim(batch_id, head_id, 0, nhead, from_len * to_len);
-  for (int token_id = blockIdx.x * token_per_reduce; token_id < from_len;
-       token_id += gridDim.x * token_per_reduce) {
-    T inp_val[token_per_reduce][ele_per_thread];
-    for (int i = 0; i < token_per_reduce && (token_id + i) < from_len; i++) {
-      BlockLoad(ts_load).Load(inp + (token_id + i) * to_len, inp_val[i], to_len,
-                              REDUCE_FLOAT_INF_NEG);
-    }
-
-    /* step 1. compute max */
-    // thread local max
-    float val[token_per_reduce][ele_per_thread];
-    float l_max[token_per_reduce];
-    for (int i = 0; i < token_per_reduce; i++) {
-      l_max[i] = REDUCE_FLOAT_INF_NEG;
-      for (int j = 0; j < ele_per_thread; j++) {
-        if (attn_mask) {
-          val[i][j] = (float)inp_val[i][j] + (float)mval[j];
-        } else {
-          if (mask_future && ele_per_thread * threadIdx.x + j > token_id + i) {
-            val[i][j] = REDUCE_FLOAT_INF_NEG;
-          } else {
-            val[i][j] = (float)inp_val[i][j];
-          }
-        }
-        l_max[i] = fmaxf(l_max[i], val[i][j]);
-      }
-    }
-    // warp reduce max
-    warpReduce<ReduceType::kMax, token_per_reduce>(l_max);
-
-    /* step 2. compute sum */
-    // thread local sum
-    float l_sum[token_per_reduce];
-    for (int i = 0; i < token_per_reduce; i++) {
-      l_sum[i] = 0.f;
-      for (int j = 0; j < ele_per_thread; j++) {
-        val[i][j] = __expf(val[i][j] - l_max[i]);
-        l_sum[i] += val[i][j];
-      }
-    }
-    // warp reduce sum
-    warpReduce<ReduceType::kSum, token_per_reduce>(l_sum);
-
-    /* step 3. compute final result */
-    for (int i = 0; i < token_per_reduce && (token_id + i) < from_len; i++) {
-      l_sum[i] = __fdividef(1.0f, l_sum[i] + EPSILON);
-      for (int j = 0; j < ele_per_thread; j++) {
-        inp_val[i][j] = (T)(val[i][j] * l_sum[i]);
-      }
-      BlockStore(ts_store).Store(inp + (token_id + i) * to_len, inp_val[i],
-                                 to_len);
-    }
-  }  // blockIdx.x
-}
-
-/*
-  attn_mask!=nullptr for enc-self-attn and enc-dec-attn
-  attn_mask=nullptr and mask_future=ture for dec-self-attn training
-  attn_mask=nullptr and mask_future=false for dec-self-attn infer
-*/
-template <>
-void launch_attn_softmax<float>(float *inp, const float *attn_mask,
-                                int batch_size, int nhead, int from_len,
-                                int to_len, bool mask_future,
-                                cudaStream_t stream) {
-  dim3 grid_dim(1, batch_size, nhead);
-  if (to_len <= 32) {
-    ker_attn_softmax_lt32<float, 32, 1><<<grid_dim, 32, 0, stream>>>(
-        inp, attn_mask, from_len, to_len, mask_future);
-  } else if (to_len <= 64) {
-    ker_attn_softmax_lt32<float, 32, 2><<<grid_dim, 32, 0, stream>>>(
-        inp, attn_mask, from_len, to_len, mask_future);
-  } else if (to_len <= 128) {
-    grid_dim.x = 16;
-    ker_attn_softmax<float, 64, 2><<<grid_dim, 64, 0, stream>>>(
-        inp, attn_mask, from_len, to_len, mask_future);
-  } else if (to_len <= 256) {
-    grid_dim.x = 32;
-    ker_attn_softmax<float, 128, 2><<<grid_dim, 128, 0, stream>>>(
-        inp, attn_mask, from_len, to_len, mask_future);
-  } else if (to_len <= 512) {
-    grid_dim.x = 64;
-    ker_attn_softmax<float, 256, 2><<<grid_dim, 256, 0, stream>>>(
-        inp, attn_mask, from_len, to_len, mask_future);
-  } else {
-    throw std::runtime_error(
-        "Sequence length greater than 512 is currently not supported");
-  }
-}
-
-template <>
-void launch_attn_softmax<__half>(__half *inp, const __half *attn_mask,
-                                 int batch_size, int nhead, int from_len,
-                                 int to_len, bool mask_future,
-                                 cudaStream_t stream) {
-  dim3 grid_dim(1, batch_size, nhead);
-  if (to_len <= 32) {
-    ker_attn_softmax_lt32<__half, 32, 1><<<grid_dim, 32, 0, stream>>>(
-        inp, attn_mask, from_len, to_len, mask_future);
-  } else if (to_len <= 64) {
-    ker_attn_softmax_lt32<__half, 32, 2><<<grid_dim, 32, 0, stream>>>(
-        inp, attn_mask, from_len, to_len, mask_future);
-  } else if (to_len <= 128) {
-    grid_dim.x = 8;
-    ker_attn_softmax<__half, 64, 2><<<grid_dim, 64, 0, stream>>>(
-        inp, attn_mask, from_len, to_len, mask_future);
-  } else if (to_len <= 256) {
-    grid_dim.x = 16;
-    ker_attn_softmax<__half, 128, 2><<<grid_dim, 128, 0, stream>>>(
-        inp, attn_mask, from_len, to_len, mask_future);
-  } else if (to_len <= 512) {
-    grid_dim.x = 32;
-    ker_attn_softmax<__half, 256, 2><<<grid_dim, 256, 0, stream>>>(
-        inp, attn_mask, from_len, to_len, mask_future);
-  } else {
-    throw std::runtime_error(
-        "Sequence length greater than 512 is currently not supported");
-  }
-}
-
-/**
-@brief: ker_attn_softmax_bw
-Softmax backward in self attention.
-
-@thread
-gridDim.x = batch_size * nhead * seq_len / warps_per_block
-blockDim.x = WARP_SIZE
-blockDim.y = warps_per_block
-
-@param
-grad: [batch_size, nhead, seq_len, seq_len], output grad.
-output: [batch_size, nhead, seq_len, seq_len], output of softmax forward.
-*/
-template <typename T, int ITERATIONS>
-__global__ void ker_attn_softmax_bw(T *grad, const T *inp, int softmax_length) {
-  int batch_idx = blockIdx.x * blockDim.y + threadIdx.y;
-  int offset = batch_idx * softmax_length + threadIdx.x;
-
-  grad += offset;
-  inp += offset;
-
-  T grad_reg[ITERATIONS];
-  T inp_reg[ITERATIONS];
-  float sum = 0.0;
-
-#pragma unroll
-  for (int i = 0; i < ITERATIONS; ++i) {
-    int curr_idx = threadIdx.x + i * WARP_SIZE;
-    if (curr_idx < softmax_length) {
-      grad_reg[i] = grad[i * WARP_SIZE];
-      inp_reg[i] = inp[i * WARP_SIZE];
-      sum += (float)grad_reg[i] * (float)inp_reg[i];
-    }
-  }
-
-#ifdef COLOSSAL_HIP
-  for (int i = 1; i < WARP_SIZE; i <<= 1) sum += __shfl_xor(sum, i);
-#else
-  cg::thread_block b = cg::this_thread_block();
-  cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
-
-  for (int i = 1; i < WARP_SIZE; i <<= 1) sum += g.shfl_xor(sum, i);
-#endif
-
-#pragma unroll
-  for (int i = 0; i < ITERATIONS; ++i) {
-    int curr_idx = threadIdx.x + i * WARP_SIZE;
-    if (curr_idx < softmax_length)
-      grad[i * WARP_SIZE] = (T)((float)inp_reg[i] * ((float)grad_reg[i] - sum));
-  }
-}
-
-template <typename T>
-void launch_attn_softmax_bw(T *out_grad, const T *soft_inp, int rows,
-                            int softmax_len, cudaStream_t stream) {
-  const int warps_per_block = 4;
-  // rows = batch_size * nhead * from_len
-  dim3 grid_dim(rows / warps_per_block);
-  dim3 block_dim(WARP_SIZE, warps_per_block);
-
-  if (softmax_len <= 32)
-    ker_attn_softmax_bw<T, 1>
-        <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, softmax_len);
-  else if (softmax_len <= 64)
-    ker_attn_softmax_bw<T, 2>
-        <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, softmax_len);
-  else if (softmax_len <= 128)
-    ker_attn_softmax_bw<T, 4>
-        <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, softmax_len);
-  else if (softmax_len <= 256)
-    ker_attn_softmax_bw<T, 8>
-        <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, softmax_len);
-  else if (softmax_len <= 384)
-    ker_attn_softmax_bw<T, 12>
-        <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, softmax_len);
-  else if (softmax_len <= 512)
-    ker_attn_softmax_bw<T, 16>
-        <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, softmax_len);
-  else if (softmax_len <= 768)
-    ker_attn_softmax_bw<T, 24>
-        <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, softmax_len);
-  else if (softmax_len <= 1024)
-    ker_attn_softmax_bw<T, 32>
-        <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, softmax_len);
-  else if (softmax_len <= 2048)
-    ker_attn_softmax_bw<T, 64>
-        <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, softmax_len);
-  else
-    throw std::runtime_error(
-        std::string(
-            "Special sequence length found in softmax backward, seq_len: ") +
-        std::to_string(softmax_len));
-}
-
-template void launch_attn_softmax_bw<__half>(__half *out_grad,
-                                             const __half *soft_inp, int rows,
-                                             int softmax_len,
-                                             cudaStream_t stream);
-template void launch_attn_softmax_bw<float>(float *out_grad,
-                                            const float *soft_inp, int rows,
-                                            int softmax_len,
-                                            cudaStream_t stream);
diff --git a/colossalai/kernel/cuda_native/csrc/kernels/transform_kernels.cu b/colossalai/kernel/cuda_native/csrc/kernels/transform_kernels.cu
deleted file mode 100644
index 3d39d59c36fdea5a2c248e4da243351ad7d4f439..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/csrc/kernels/transform_kernels.cu
+++ /dev/null
@@ -1,325 +0,0 @@
-#ifdef COLOSSAL_HIP
-#include <hipcub/hipcub.hpp>
-//#include <hipcub/block/block_load.hpp>
-//#include <hipcub/block/block_scan.hpp>
-//#include <hipcub/block/block_store.hpp>
-#else
-#include <cub/block/block_load.cuh>
-#include <cub/block/block_scan.cuh>
-#include <cub/block/block_store.cuh>
-#endif
-
-#include "kernels.h"
-
-#ifdef COLOSSAL_HIP
-using namespace hipcub;
-#else
-using namespace cub;
-#endif
-
-/**
-@brief: transform_0213
-Split the attention heads and reshape input
-during backward progress of encoder self-attention
-
-@thread
-gridDim.x = batch_size
-gridDim.y = seq_len
-blockDim.x = min(hidden_dim, MAX_THREADS)
-
-@param
-input: [batch_size, seq_len, hidden_dim]
-output: [batch_size, nhead, seq_len, head_dim]
-batch_size: the size of the current batch
-seq_len: the sequence length of the current batch
-hidden_dim: dim of the hidden tensor
-nhead: number of attention heads
-*/
-
-template <typename T>
-__global__ void transform_0213(T *output, const T *input, int hidden_dim,
-                               int head_dim);
-
-template <>
-__global__ void transform_0213<float>(float *output, const float *input,
-                                      int hidden_dim, int head_dim) {
-  int batch_id = blockIdx.x;
-  int token_id = blockIdx.y;
-  int seq_len = gridDim.y;
-  int nhead = hidden_dim / head_dim;
-
-  // [b, s, h]
-  int src_offset = flat_3dim(batch_id, token_id, 0, seq_len, hidden_dim);
-  // [b, nh, s, ad]
-  int trg_offset =
-      flat_4dim(batch_id, 0, token_id, 0, nhead, seq_len, head_dim);
-
-  const float4 *input4 = reinterpret_cast<const float4 *>(input);
-  float4 *res4 = reinterpret_cast<float4 *>(output);
-  float4 vinput4;
-
-  for (std::size_t i = threadIdx.x; i < hidden_dim; i += blockDim.x) {
-    vinput4 = input4[src_offset + i];
-
-    int head_id = i / head_dim;
-    int dim_id = i % head_dim;
-    int cur_trg_offset = flat_3dim(head_id, 0, dim_id, seq_len, head_dim);
-    res4[trg_offset + cur_trg_offset] = vinput4;
-  }
-}
-
-template <>
-__global__ void transform_0213<__half>(__half *output, const __half *input,
-                                       int hidden_dim, int head_dim) {
-  int batch_id = blockIdx.x;
-  int token_id = blockIdx.y;
-  int seq_len = gridDim.y;
-  int nhead = hidden_dim / head_dim;
-
-  // [b, s, h]
-  int src_offset = flat_3dim(batch_id, token_id, 0, seq_len, hidden_dim);
-  // [b, nh, s, ad]
-  int trg_offset =
-      flat_4dim(batch_id, 0, token_id, 0, nhead, seq_len, head_dim);
-
-  const float4 *input4 = reinterpret_cast<const float4 *>(input);
-  float4 *res4 = reinterpret_cast<float4 *>(output);
-  float4 vinput4;
-
-  for (std::size_t i = threadIdx.x; i < hidden_dim; i += blockDim.x) {
-    vinput4 = input4[src_offset + i];
-
-    int head_id = i / head_dim;
-    int dim_id = i % head_dim;
-    int cur_trg_offset = flat_3dim(head_id, 0, dim_id, seq_len, head_dim);
-    res4[trg_offset + cur_trg_offset] = vinput4;
-  }
-}
-
-// [b, s, h] -> [b, nh, s, ad]
-template <>
-void launch_transform_0213<float>(float *output, const float *input,
-                                  int batch_size, int seq_len, int hidden_dim,
-                                  int nhead, cudaStream_t stream) {
-  hidden_dim >>= 2;
-  int head_dim = hidden_dim / nhead;
-
-  dim3 grid_dim(batch_size, seq_len);
-  dim3 block_dim(min(hidden_dim, MAX_THREADS));
-
-  transform_0213<float>
-      <<<grid_dim, block_dim, 0, stream>>>(output, input, hidden_dim, head_dim);
-}
-
-template <>
-void launch_transform_0213<__half>(__half *output, const __half *input,
-                                   int batch_size, int seq_len, int hidden_dim,
-                                   int nhead, cudaStream_t stream) {
-  hidden_dim >>= 3;
-  int head_dim = hidden_dim / nhead;
-
-  dim3 grid_dim(batch_size, seq_len);
-  dim3 block_dim(min(hidden_dim, MAX_THREADS));
-
-  transform_0213<__half>
-      <<<grid_dim, block_dim, 0, stream>>>(output, input, hidden_dim, head_dim);
-}
-
-/**
-@brief: bias_add_transform_20314
-Add bias to input, transform from
-[0, 1, 2, 3, 4] to [2, 0, 3, 1, 4]
-
-@thread
-gridDim.x = dim_0
-gridDim.y = dim_1
-gridDim.z = dim_2
-blockDim.x = min(dim_3 * dim_4, MAX_THREADS)
-
-@param
-input: [dim_0, dim_1, dim_2, dim_3, dim_4]
-bias: [dim_2, dim_3, dim_4]
-output: [dim_2, dim_0, dim_3, dim_1, dim_4]
-*/
-template <typename T>
-__global__ void bias_add_transform_20314(T *output, const T *input,
-                                         const T *bias, int dim_3, int dim_4);
-
-template <>
-__global__ void bias_add_transform_20314<float>(float *output,
-                                                const float *input,
-                                                const float *bias, int dim_3,
-                                                int dim_4) {
-  int id0 = blockIdx.x;
-  int id1 = blockIdx.y;
-  int id2 = blockIdx.z;
-  int dim_0 = gridDim.x;
-  int dim_1 = gridDim.y;
-  int dim_2 = gridDim.z;
-  int dim_34 = dim_3 * dim_4;
-
-  int src_offset = flat_4dim(id0, id1, id2, 0, dim_1, dim_2, dim_34);
-  int trg_offset = flat_5dim(id2, id0, 0, id1, 0, dim_0, dim_3, dim_1, dim_4);
-  int bias_offset = flat_2dim(id2, 0, dim_34);
-
-  const float4 *qkv4 = reinterpret_cast<const float4 *>(input);
-  const float4 *bias4 = reinterpret_cast<const float4 *>(bias);
-  float4 *res4 = reinterpret_cast<float4 *>(output);
-  float4 vqkv4;
-  float4 vbias4;
-  float4 vres4;
-
-  for (std::size_t i = threadIdx.x; i < dim_34; i += blockDim.x) {
-    vqkv4 = qkv4[src_offset + i];
-    vbias4 = bias4[bias_offset + i];
-    vres4.x = vqkv4.x + vbias4.x;
-    vres4.y = vqkv4.y + vbias4.y;
-    vres4.z = vqkv4.z + vbias4.z;
-    vres4.w = vqkv4.w + vbias4.w;
-
-    int id3 = i / dim_4;
-    int id4 = i % dim_4;
-    int cur_trg_offset = flat_3dim(id3, 0, id4, dim_1, dim_4);
-    res4[trg_offset + cur_trg_offset] = vres4;
-  }
-}
-
-template <>
-__global__ void bias_add_transform_20314<__half>(__half *output,
-                                                 const __half *input,
-                                                 const __half *bias, int dim_3,
-                                                 int dim_4) {
-  int id0 = blockIdx.x;
-  int id1 = blockIdx.y;
-  int id2 = blockIdx.z;
-  int dim_0 = gridDim.x;
-  int dim_1 = gridDim.y;
-  int dim_2 = gridDim.z;
-  int dim_34 = dim_3 * dim_4;
-
-  int src_offset = flat_4dim(id0, id1, id2, 0, dim_1, dim_2, dim_34);
-  int trg_offset = flat_5dim(id2, id0, 0, id1, 0, dim_0, dim_3, dim_1, dim_4);
-  int bias_offset = flat_2dim(id2, 0, dim_34);
-
-  const float4 *qkv4 = reinterpret_cast<const float4 *>(input);
-  const float4 *bias4 = reinterpret_cast<const float4 *>(bias);
-  float4 *res4 = reinterpret_cast<float4 *>(output);
-  float4 vqkv4;
-  float4 vbias4;
-  float4 vres4;
-  __half2 *h2_qkv = reinterpret_cast<__half2 *>(&vqkv4);
-  __half2 *h2_bias = reinterpret_cast<__half2 *>(&vbias4);
-  __half2 *h2_res = reinterpret_cast<__half2 *>(&vres4);
-
-  for (std::size_t i = threadIdx.x; i < dim_34; i += blockDim.x) {
-    vqkv4 = qkv4[src_offset + i];
-    vbias4 = bias4[bias_offset + i];
-    h2_res[0] = __hadd2(h2_qkv[0], h2_bias[0]);
-    h2_res[1] = __hadd2(h2_qkv[1], h2_bias[1]);
-    h2_res[2] = __hadd2(h2_qkv[2], h2_bias[2]);
-    h2_res[3] = __hadd2(h2_qkv[3], h2_bias[3]);
-
-    int id3 = i / dim_4;
-    int id4 = i % dim_4;
-    int cur_trg_offset = flat_3dim(id3, 0, id4, dim_1, dim_4);
-    res4[trg_offset + cur_trg_offset] = vres4;
-  }
-}
-
-// [b, s, 3, h] -> [3, b, nh, s, ad]
-template <>
-void launch_bias_add_transform_20314<float>(float *output, const float *input,
-                                            const float *bias, int dim_0,
-                                            int dim_1, int dim_2, int dim_3,
-                                            int dim_4, cudaStream_t stream) {
-  dim_4 >>= 2;
-
-  dim3 grid_dim(dim_0, dim_1, dim_2);
-  dim3 block_dim(min(dim_3 * dim_4, MAX_THREADS));
-
-  bias_add_transform_20314<float>
-      <<<grid_dim, block_dim, 0, stream>>>(output, input, bias, dim_3, dim_4);
-}
-
-template <>
-void launch_bias_add_transform_20314<__half>(__half *output,
-                                             const __half *input,
-                                             const __half *bias, int dim_0,
-                                             int dim_1, int dim_2, int dim_3,
-                                             int dim_4, cudaStream_t stream) {
-  dim_4 >>= 3;
-
-  dim3 grid_dim(dim_0, dim_1, dim_2);
-  dim3 block_dim(min(dim_3 * dim_4, MAX_THREADS));
-
-  bias_add_transform_20314<__half>
-      <<<grid_dim, block_dim, 0, stream>>>(output, input, bias, dim_3, dim_4);
-}
-
-/**
-@brief: transform4d_0213
-Reshape the input matrix to merge the heads
-
-@thread
-gridDim.x = (num_all + max_block_thread - 1) / max_block_thread
-blockDim.x = max_block_thread
-
-@param
-input: [trans_count, batch_size, nhead, seq_len, head_dim]
-output: [batch_size, seq_len, trans_count, nhead, head_dim]
-batch_size: the size of the current batch
-seq_len: the sequence length of the current batch
-hidden_dim: dim of the hidden tensor
-nhead: number of attention heads
-trans_count: 1 or 3, the count of matrice need to be transformed
-*/
-template <typename T>
-__global__ void transform4d_0213(T *output, const T *input, int batch_size,
-                                 int seq_len, int trans_count, int nhead,
-                                 int head_dim, int num_all) {
-  int offset = blockIdx.x * blockDim.x + threadIdx.x;
-  if (offset >= num_all) {
-    return;
-  }
-  int trans_id, batch_id, head_id, token_id, dim_id;
-  decompose_5dim(offset, batch_size, nhead, seq_len, head_dim, &trans_id,
-                 &batch_id, &head_id, &token_id, &dim_id);
-  // [b, s, tc, nh, ad]
-  int trg_offset = flat_5dim(batch_id, token_id, trans_id, head_id, dim_id,
-                             seq_len, trans_count, nhead, head_dim);
-
-  const float4 *input4 = reinterpret_cast<const float4 *>(input);
-  float4 *res4 = reinterpret_cast<float4 *>(output);
-  res4[trg_offset] = input4[offset];
-}
-
-// [tc, b, nh, s, ad] -> [b, s, tc, nh, ad]
-template <>
-void launch_transform4d_0213<float>(float *output, const float *input,
-                                    int batch_size, int seq_len, int hidden_dim,
-                                    int nhead, int trans_count,
-                                    cudaStream_t stream) {
-  hidden_dim >>= 2;
-  int head_dim = hidden_dim / nhead;
-  int num_all = batch_size * seq_len * trans_count * hidden_dim;
-  int nblock = (num_all + MAX_THREADS - 1) / MAX_THREADS;
-
-  transform4d_0213<float><<<nblock, MAX_THREADS, 0, stream>>>(
-      output, input, batch_size, seq_len, trans_count, nhead, head_dim,
-      num_all);
-}
-
-template <>
-void launch_transform4d_0213<__half>(__half *output, const __half *input,
-                                     int batch_size, int seq_len,
-                                     int hidden_dim, int nhead, int trans_count,
-                                     cudaStream_t stream) {
-  hidden_dim >>= 3;
-  int head_dim = hidden_dim / nhead;
-  int num_all = batch_size * seq_len * trans_count * hidden_dim;
-  int nblock = (num_all + MAX_THREADS - 1) / MAX_THREADS;
-
-  transform4d_0213<__half><<<nblock, MAX_THREADS, 0, stream>>>(
-      output, input, batch_size, seq_len, trans_count, nhead, head_dim,
-      num_all);
-}
diff --git a/colossalai/kernel/cuda_native/csrc/layer_norm_cuda.cpp b/colossalai/kernel/cuda_native/csrc/layer_norm_cuda.cpp
deleted file mode 100644
index c42d91d36f45f624d93c282dc90deec158976e56..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/csrc/layer_norm_cuda.cpp
+++ /dev/null
@@ -1,185 +0,0 @@
-/*This code from NVIDIA apex:
- *     https://github.com/NVIDIA/apex
- *     with minor changes. */
-
-#include <torch/extension.h>
-#include <vector>
-#include <cassert>
-#include "compat.h"
-
-namespace {
-
-void compute_n1_n2(
-    at::Tensor input,
-    at::IntArrayRef normalized_shape,
-    int& n1,
-    int& n2) {
-    int idiff = input.ndimension() - normalized_shape.size();
-    n2 = 1;
-    for (int i = 0;  i < (int)normalized_shape.size();  ++i) {
-	    assert( input.sizes()[i+idiff] == normalized_shape[i] );
-	    n2 *= normalized_shape[i];
-    }
-    n1 = 1;
-    for (int i = 0;  i < idiff;  ++i) {
-	    n1 *= input.sizes()[i];
-    }
-}
-
-void check_args(
-    at::IntArrayRef normalized_shape,
-    at::Tensor gamma,
-    at::Tensor beta
-    )
-{
-    TORCH_CHECK(!gamma.defined() || gamma.sizes().equals(normalized_shape));
-    TORCH_CHECK(!beta.defined() || beta.sizes().equals(normalized_shape));
-}
-
-void check_args(
-    at::Tensor input,
-    at::IntArrayRef normalized_shape,
-    int& n1,
-    int& n2
-    )
-{
-    int64_t normalized_ndim = normalized_shape.size();
-
-    if (normalized_ndim < 1) {
-      std::stringstream ss;
-      ss << "Expected normalized_shape to be at least 1-dimensional, i.e., "
-         << "containing at least one element, but got normalized_shape="
-         << normalized_shape;
-      throw std::runtime_error(ss.str());
-    }
-
-    auto input_shape = input.sizes();
-    auto input_ndim = input.dim();
-
-    if (input_ndim < normalized_ndim ||
-        !input_shape.slice(input_ndim - normalized_ndim).equals(normalized_shape)) {
-      std::stringstream ss;
-      ss << "Given normalized_shape=" << normalized_shape
-         << ", expected input with shape [*";
-      for (auto size : normalized_shape) {
-        ss << ", " << size;
-      }
-      ss << "], but got input of size" << input_shape;
-      throw std::runtime_error(ss.str());
-    }
-
-    compute_n1_n2(input,normalized_shape,n1,n2);
-}
-
-
-void check_args(
-    at::Tensor input,
-    at::IntArrayRef normalized_shape,
-    at::Tensor gamma,
-    at::Tensor beta,
-    int& n1,
-    int& n2
-    )
-{
-    check_args(input,normalized_shape,n1,n2);
-    check_args(normalized_shape,gamma,beta);
-}
-}
-
-void cuda_layer_norm(
-    at::Tensor* output,
-    at::Tensor* mean,
-    at::Tensor* invvar,
-    at::Tensor* input,
-    int n1,
-    int n2,
-    at::IntArrayRef normalized_shape,
-    at::Tensor* gamma,
-    at::Tensor* beta,
-    double epsilon);
-
-#define CHECK_CUDA(x) TORCH_CHECK(x.is_cuda(), #x " must be a CUDA tensor")
-#define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x " must be contiguous")
-#define CHECK_INPUT(x) CHECK_CUDA(x); CHECK_CONTIGUOUS(x)
-
-std::vector<at::Tensor> layer_norm_affine(
-    at::Tensor input,
-    at::IntArrayRef normalized_shape,
-    at::Tensor gamma,
-    at::Tensor beta,
-    double epsilon) {
-  
-  CHECK_INPUT(input);
-  CHECK_INPUT(gamma);
-  CHECK_INPUT(beta);
-  int n1, n2;
-  check_args(input, normalized_shape, gamma, beta, n1, n2);
-
-  at::Tensor output = at::empty_like(
-      input, gamma.options().dtype(gamma.scalar_type()));
-  at::Tensor mean = at::empty(
-      {n1}, input.options().dtype(at::ScalarType::Float));
-  at::Tensor invvar = at::empty_like(mean);
-
-  cuda_layer_norm(&output, &mean, &invvar, &input, n1, n2,
-      normalized_shape, &gamma, &beta, epsilon);
-
-  return {output, mean, invvar};
-
-}
-
-
-void cuda_layer_norm_gradient(
-    at::Tensor* dout,
-    at::Tensor* mean,
-    at::Tensor* invvar,
-    at::Tensor* input,
-    int n1,
-    int n2,
-    at::IntArrayRef normalized_shape,
-    at::Tensor* gamma,
-    at::Tensor* beta,
-    double epsilon,
-    at::Tensor* grad_input,
-    at::Tensor* grad_gamma,
-    at::Tensor* grad_beta
-    );
-
-std::vector<at::Tensor> layer_norm_gradient_affine(
-    at::Tensor dout,
-    at::Tensor mean,
-    at::Tensor invvar,
-    at::Tensor input,
-    at::IntArrayRef normalized_shape,
-    at::Tensor gamma,
-    at::Tensor beta,
-    double epsilon) {
-
-  CHECK_INPUT(dout);
-  CHECK_INPUT(mean);
-  CHECK_INPUT(invvar);
-  CHECK_INPUT(input);
-  CHECK_INPUT(gamma);
-  CHECK_INPUT(beta);
-  int n1, n2;
-  check_args(input, normalized_shape, gamma, beta, n1, n2);
-
-  at::Tensor grad_input = at::empty_like(input);
-  at::Tensor grad_gamma = at::empty_like(gamma);
-  at::Tensor grad_beta = at::empty_like(beta);
-
-  cuda_layer_norm_gradient(&dout, &mean, &invvar, &input, n1, n2,
-      normalized_shape, &gamma, &beta, epsilon,
-      &grad_input, &grad_gamma, &grad_beta);
-
-  return {grad_input, grad_gamma, grad_beta};
-
-}
-
-
-PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
-  m.def("forward_affine", &layer_norm_affine,
-	"LayerNorm forward (CUDA)");
-  m.def("backward_affine", &layer_norm_gradient_affine,
-	"LayerNorm backward (CUDA)");
-}
\ No newline at end of file
diff --git a/colossalai/kernel/cuda_native/csrc/layer_norm_cuda_kernel.cu b/colossalai/kernel/cuda_native/csrc/layer_norm_cuda_kernel.cu
deleted file mode 100644
index 2eb36f668cedaa67b4e66e28a476777c10de74d9..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/csrc/layer_norm_cuda_kernel.cu
+++ /dev/null
@@ -1,833 +0,0 @@
-/*This code from NVIDIA apex:
- *     https://github.com/NVIDIA/apex
- *     with minor changes. */
-
-#include "ATen/ATen.h"
-#include "ATen/AccumulateType.h"
-#include "ATen/cuda/CUDAContext.h"
-#include <THC/THCDeviceUtils.cuh>
-
-#include <cuda.h>
-#include <cuda_runtime.h>
-
-#include "type_shim.h"
-
-template<typename U> __device__
-void cuWelfordOnlineSum(
-  const U curr,
-  U& mu,
-  U& sigma2,
-  U& count)
-{
-  count = count + U(1);
-  U delta = curr - mu;
-  U lmean = mu + delta / count;
-  mu = lmean;
-  U delta2 = curr - lmean;
-  sigma2 = sigma2 + delta * delta2;
-}
-
-template<typename U> __device__
-void cuChanOnlineSum(
-  const U muB,
-  const U sigma2B,
-  const U countB,
-  U& mu,
-  U& sigma2,
-  U& count)
-{
-  U delta = muB - mu;
-  U nA = count;
-  U nB = countB;
-  count = count + countB;
-  U nX = count;
-  if (nX > U(0)) {
-    nA = nA / nX;
-    nB = nB / nX;
-    mu = nA*mu + nB*muB;
-    sigma2 = sigma2 + sigma2B + delta * delta * nA * nB * nX;
-  } else {
-    mu = U(0);
-    sigma2 = U(0);
-  }
-}
-
-template<typename T, typename U> __device__
-void cuWelfordMuSigma2(
-  const T* __restrict__ vals,
-  const int n1,
-  const int n2,
-  const int i1,
-  U& mu,
-  U& sigma2,
-  U* buf) 
-{
-  // Assumptions:
-  // 1) blockDim.x == warpSize
-  // 2) Tensor is contiguous
-  // 3) 2*blockDim.y*sizeof(U)+blockDim.y*sizeof(int) shared memory available.
-  //
-  // compute variance and mean over n2
-  U count = U(0);
-  mu= U(0);
-  sigma2 = U(0);
-  if (i1 < n1) {
-    // one warp normalizes one n1 index,
-    // synchronization is implicit
-    // initialize with standard Welford algorithm
-    const int numx = blockDim.x * blockDim.y;
-    const int thrx = threadIdx.x + threadIdx.y * blockDim.x;
-    const T* lvals = vals + i1*n2;
-    int l = 4*thrx;
-    for (;  l+3 < n2;  l+=4*numx) {
-      for (int k = 0;  k < 4;  ++k) {
-        U curr = static_cast<U>(lvals[l+k]);
-        cuWelfordOnlineSum<U>(curr,mu,sigma2,count);
-      }
-    }
-    for (;  l < n2;  ++l) {
-      U curr = static_cast<U>(lvals[l]);
-      cuWelfordOnlineSum<U>(curr,mu,sigma2,count);
-    }
-    // intra-warp reductions
-    for (int l = 0;  l <= 4;  ++l) {
-      int srcLaneB = (threadIdx.x+(1<<l))&31;
-      U muB = WARP_SHFL(mu, srcLaneB);
-      U countB = WARP_SHFL(count, srcLaneB);
-      U sigma2B = WARP_SHFL(sigma2, srcLaneB);
-      cuChanOnlineSum<U>(muB,sigma2B,countB,mu,sigma2,count);
-    }
-    // threadIdx.x == 0 has correct values for each warp
-    // inter-warp reductions
-    if (blockDim.y > 1) {
-      U* ubuf = (U*)buf;
-      U* ibuf = (U*)(ubuf + blockDim.y);
-      for (int offset = blockDim.y/2;  offset > 0;  offset /= 2) {
-        // upper half of warps write to shared
-        if (threadIdx.x == 0 && threadIdx.y >= offset && threadIdx.y < 2*offset) {
-          const int wrt_y = threadIdx.y - offset;
-          ubuf[2*wrt_y] = mu;
-          ubuf[2*wrt_y+1] = sigma2;
-          ibuf[wrt_y] = count;
-        }
-        __syncthreads();
-        // lower half merges
-        if (threadIdx.x == 0 && threadIdx.y < offset) {
-          U muB = ubuf[2*threadIdx.y];
-          U sigma2B = ubuf[2*threadIdx.y+1];
-          U countB = ibuf[threadIdx.y];
-          cuChanOnlineSum<U>(muB,sigma2B,countB,mu,sigma2,count);
-        }
-        __syncthreads();
-      }
-      // threadIdx.x = 0 && threadIdx.y == 0 only thread that has correct values
-      if (threadIdx.x == 0 && threadIdx.y == 0) {
-        ubuf[0] = mu;
-        ubuf[1] = sigma2;
-      }
-      __syncthreads();
-      mu = ubuf[0];
-      sigma2 = ubuf[1]/U(n2);
-      // don't care about final value of count, we know count == n2
-    } else {
-      mu = WARP_SHFL(mu, 0);
-      sigma2 = WARP_SHFL(sigma2/U(n2), 0);
-    }
-  }
-}
-
-template<> __device__
-void cuWelfordMuSigma2(
-  const at::Half* __restrict__ vals,
-  const int n1,
-  const int n2,
-  const int i1,
-  float& mu,
-  float& sigma2,
-  float* buf) 
-{
-  // Assumptions:
-  // 1) blockDim.x == warpSize
-  // 2) Tensor is contiguous
-  // 3) 2*blockDim.y*sizeof(U)+blockDim.y*sizeof(int) shared memory available.
-  //
-  // compute variance and mean over n2
-  float count = 0.0f;
-  mu= float(0);
-  sigma2 = float(0);
-  if (i1 < n1) {
-    // one warp normalizes one n1 index,
-    // synchronization is implicit
-    // initialize with standard Welford algorithm
-    const int numx = blockDim.x * blockDim.y;
-    const int thrx = threadIdx.x + threadIdx.y * blockDim.x;
-    const at::Half* lvals = vals + i1*n2;
-    int l = 8*thrx;
-    if ((((size_t)lvals)&3) != 0) {
-      // 16 bit alignment
-      // first thread consumes first point
-      if (thrx == 0) {
-        float curr = static_cast<float>(lvals[0]);
-        cuWelfordOnlineSum(curr,mu,sigma2,count);
-      }
-      ++l;
-    }
-    // at this point, lvals[l] are 32 bit aligned for all threads.
-    for (;  l+7 < n2;  l+=8*numx) {
-      for (int k = 0;  k < 8;  k+=2) {
-        float2 curr = __half22float2(*((__half2*)(lvals+l+k)));
-        cuWelfordOnlineSum(curr.x,mu,sigma2,count);
-	cuWelfordOnlineSum(curr.y,mu,sigma2,count);
-      }
-    }
-    for (;  l < n2;  ++l) {
-      float curr = static_cast<float>(lvals[l]);
-      cuWelfordOnlineSum(curr,mu,sigma2,count);
-    }
-    // intra-warp reductions
-    for (int l = 0;  l <= 4;  ++l) {
-      int srcLaneB = (threadIdx.x+(1<<l))&31;
-      float muB = WARP_SHFL(mu, srcLaneB);
-      float countB = WARP_SHFL(count, srcLaneB);
-      float sigma2B = WARP_SHFL(sigma2, srcLaneB);
-      cuChanOnlineSum(muB,sigma2B,countB,mu,sigma2,count);
-    }
-    // threadIdx.x == 0 has correct values for each warp
-    // inter-warp reductions
-    if (blockDim.y > 1) {
-      float* ubuf = (float*)buf;
-      float* ibuf = (float*)(ubuf + blockDim.y);
-      for (int offset = blockDim.y/2;  offset > 0;  offset /= 2) {
-        // upper half of warps write to shared
-        if (threadIdx.x == 0 && threadIdx.y >= offset && threadIdx.y < 2*offset) {
-          const int wrt_y = threadIdx.y - offset;
-          ubuf[2*wrt_y] = mu;
-          ubuf[2*wrt_y+1] = sigma2;
-          ibuf[wrt_y] = count;
-        }
-        __syncthreads();
-        // lower half merges
-        if (threadIdx.x == 0 && threadIdx.y < offset) {
-          float muB = ubuf[2*threadIdx.y];
-          float sigma2B = ubuf[2*threadIdx.y+1];
-          float countB = ibuf[threadIdx.y];
-          cuChanOnlineSum(muB,sigma2B,countB,mu,sigma2,count);
-        }
-        __syncthreads();
-      }
-      // threadIdx.x = 0 && threadIdx.y == 0 only thread that has correct values
-      if (threadIdx.x == 0 && threadIdx.y == 0) {
-        ubuf[0] = mu;
-        ubuf[1] = sigma2;
-      }
-      __syncthreads();
-      mu = ubuf[0];
-      sigma2 = ubuf[1]/float(n2);
-      // don't care about final value of count, we know count == n2
-    } else {
-      mu = WARP_SHFL(mu, 0);
-      sigma2 = WARP_SHFL(sigma2/float(n2), 0);
-    }
-  }
-}
-
-#ifdef COLOSSAL_HIP
-    template<typename U> __device__ U rsqrt(U v) {
-      return U(1) / sqrt(v);
-    }
-    template<> __device__ float rsqrt(float v) {
-      return rsqrtf(v);
-    }
-    template<> __device__ double rsqrt(double v) {
-      return rsqrt(v);
-    }
-#else
-    template<typename U> U rsqrt(U v) {
-      return U(1) / sqrt(v);
-    }
-    template<> float rsqrt(float v) {
-      return rsqrtf(v);
-    }
-    template<> double rsqrt(double v) {
-      return rsqrt(v);
-    }
-#endif
-
-namespace {
-// This is the un-specialized struct.  Note that we prevent instantiation of this
-// struct by putting an undefined symbol in the function body so it won't compile.
-//  template <typename T>
-//  struct SharedMemory
-//  {
-//      // Ensure that we won't compile any un-specialized types
-//      __device__ T *getPointer()
-//      {
-//          extern __device__ void error(void);
-//          error();
-//          return NULL;
-//      }
-//  };
-// https://github.com/NVIDIA/apex/issues/246
-template <typename T>
-struct SharedMemory;
-
-template <>
-struct SharedMemory <float>
-{
-    __device__ float *getPointer()
-    {
-        extern __shared__ float s_float[];
-        return s_float;
-    }
-};
-
-}
-
-template<typename T, typename U, typename V> __global__
-void cuApplyLayerNorm(
-  V* __restrict__ output_vals,
-  U* __restrict__ mean,
-  U* __restrict__ invvar,
-  const T* __restrict__ vals,
-  const int n1,
-  const int n2,
-  const U epsilon,
-  const V* __restrict__ gamma,
-  const V* __restrict__ beta
-  ) 
-{
-  // Assumptions:
-  // 1) blockDim.x == warpSize
-  // 2) Tensors are contiguous
-  //
-#ifdef COLOSSAL_HIP
-  for (size_t i1=blockIdx.y; i1 < n1; i1 += gridDim.y) {
-#else
-  for (auto i1=blockIdx.y; i1 < n1; i1 += gridDim.y) {
-#endif
-    SharedMemory<U> shared;
-    U* buf = shared.getPointer();
-    U mu,sigma2;
-    cuWelfordMuSigma2(vals,n1,n2,i1,mu,sigma2,buf);
-    const T* lvals = vals + i1*n2;
-    V* ovals = output_vals + i1*n2;
-    U c_invvar = rsqrt(sigma2 + epsilon);
-    const int numx = blockDim.x * blockDim.y;
-    const int thrx = threadIdx.x + threadIdx.y * blockDim.x;
-    if (gamma != NULL && beta != NULL) {
-      for (int i = thrx;  i < n2;  i+=numx) {
-        U curr = static_cast<U>(lvals[i]);
-        ovals[i] = gamma[i] * static_cast<V>(c_invvar * (curr - mu)) + beta[i];
-      }
-    } else {
-      for (int i = thrx;  i < n2;  i+=numx) {
-        U curr = static_cast<U>(lvals[i]);
-        ovals[i] = static_cast<V>(c_invvar * (curr - mu));
-      }
-    }
-    if (threadIdx.x == 0 && threadIdx.y == 0) {
-      mean[i1] = mu;
-      invvar[i1] = c_invvar;
-    }
-  }
-}
-
-template<typename T, typename U, typename V> __device__
-void cuLoadWriteStridedInputs(
-    const int i1_block,
-    const int thr_load_row_off,
-    const int thr_load_col_off,
-    const int i2_off,
-    const int row_stride,
-    U* warp_buf1,
-    U* warp_buf2,
-    const T* input,
-    const V* dout,
-    const int i1_end,
-    const int n2,
-    const U* __restrict__ mean,
-    const U* __restrict__ invvar
-    )
-{
-  int i1 = i1_block+thr_load_row_off;
-  if (i1 < i1_end) {
-    U curr_mean = mean[i1];
-    U curr_invvar = invvar[i1];
-    for (int k = 0;  k < blockDim.y;  ++k) {
-      int i2 = i2_off + k;
-      int load_idx = i1*n2+i2;
-      int write_idx = thr_load_row_off*row_stride+thr_load_col_off+k;
-      if (i2<n2) {
-        U curr_input = static_cast<U>(input[load_idx]);
-	U curr_dout = static_cast<U>(dout[load_idx]);
-	warp_buf1[write_idx] = curr_dout;
-	warp_buf2[write_idx] = curr_dout * (curr_input - curr_mean) * curr_invvar;
-      } else {
-        warp_buf1[write_idx] = U(0);
-        warp_buf2[write_idx] = U(0);
-      }
-    }
-  } else {
-    for (int k = 0;  k < blockDim.y;  ++k) {
-      int write_idx = thr_load_row_off*row_stride+thr_load_col_off+k;
-      warp_buf1[write_idx] = U(0);
-      warp_buf2[write_idx] = U(0);
-    }
-  }
-}
-
-template<typename T, typename U, typename V> __device__
-void cuLoadAddStridedInputs(
-    const int i1_block,
-    const int thr_load_row_off,
-    const int thr_load_col_off,
-    const int i2_off,
-    const int row_stride,
-    U* warp_buf1,
-    U* warp_buf2,
-    const T* input,
-    const V* dout,
-    const int i1_end,
-    const int n2,
-    const U* __restrict__ mean,
-    const U* __restrict__ invvar
-    )
-{
-  int i1 = i1_block+thr_load_row_off;
-  if (i1 < i1_end) {
-    U curr_mean = mean[i1];
-    U curr_invvar = invvar[i1];
-    for (int k = 0;  k < blockDim.y;  ++k) {
-      int i2 = i2_off + k;
-      int load_idx = i1*n2+i2;
-      int write_idx = thr_load_row_off*row_stride+thr_load_col_off+k;
-      if (i2<n2) {
-        U curr_input = static_cast<U>(input[load_idx]);
-	U curr_dout = static_cast<U>(dout[load_idx]);
-	warp_buf1[write_idx] += curr_dout;
-	warp_buf2[write_idx] += curr_dout * (curr_input - curr_mean) * curr_invvar;
-      }
-    }
-  }
-}
-
-template<typename T, typename U, typename V> __global__
-void cuComputePartGradGammaBeta(
-    const V* __restrict__ dout,
-    const T* __restrict__ input,
-    const int n1,
-    const int n2,
-    const U* __restrict__ mean,
-    const U* __restrict__ invvar,
-    U epsilon,
-    U* part_grad_gamma,
-    U* part_grad_beta)
-{
-    const int numsegs_n1 = (n1+blockDim.y*blockDim.y-1) / (blockDim.y*blockDim.y);
-    const int segs_per_block = (numsegs_n1 + gridDim.y - 1) / gridDim.y;
-    const int i1_beg = blockIdx.y * segs_per_block * blockDim.y*blockDim.y;
-    const int i1_beg_plus_one = (blockIdx.y+1) * segs_per_block * blockDim.y*blockDim.y;
-    const int i1_end = i1_beg_plus_one < n1 ? i1_beg_plus_one : n1;
-    const int row_stride = blockDim.x+1;
-    const int thr_load_col_off = (threadIdx.x*blockDim.y)&(blockDim.x-1);
-    const int thr_load_row_off = (threadIdx.x*blockDim.y)/blockDim.x + threadIdx.y*blockDim.y;
-    const int i2_off = blockIdx.x * blockDim.x + thr_load_col_off;
-    SharedMemory<U> shared;
-    U* buf = shared.getPointer(); // buf has at least blockDim.x * blockDim.y * blockDim.y + (blockDim.y - 1)*(blockDim.x/blockDim.y) elements
-    U* warp_buf1 = (U*)buf;
-    U* warp_buf2 = warp_buf1 + blockDim.y * blockDim.y * row_stride;
-    // compute partial sums from strided inputs
-    // do this to increase number of loads in flight
-    cuLoadWriteStridedInputs(i1_beg,thr_load_row_off,thr_load_col_off,i2_off,row_stride,warp_buf1,warp_buf2,input,dout,i1_end,n2,mean,invvar);
-    for (int i1_block = i1_beg+blockDim.y*blockDim.y;  i1_block < i1_end;  i1_block+=blockDim.y*blockDim.y) {
-      cuLoadAddStridedInputs(i1_block,thr_load_row_off,thr_load_col_off,i2_off,row_stride,warp_buf1,warp_buf2,input,dout,i1_end,n2,mean,invvar);
-    }
-    __syncthreads();
-    // inter-warp reductions
-    // sum within each warp
-    U acc1 = U(0);
-    U acc2 = U(0);
-    for (int k = 0;  k < blockDim.y;  ++k) {
-      int row1 = threadIdx.y + k*blockDim.y;
-      int idx1 = row1*row_stride + threadIdx.x;
-      acc1 += warp_buf1[idx1];
-      acc2 += warp_buf2[idx1];
-    }
-    warp_buf1[threadIdx.y*row_stride+threadIdx.x] = acc1;
-    warp_buf2[threadIdx.y*row_stride+threadIdx.x] = acc2;
-    __syncthreads();
-    // sum all warps
-    for (int offset = blockDim.y/2;  offset > 1;  offset /= 2) {
-      if (threadIdx.y < offset) {
-        int row1 = threadIdx.y;
-	int row2 = threadIdx.y + offset;
-	int idx1 = row1*row_stride + threadIdx.x;
-	int idx2 = row2*row_stride + threadIdx.x;
-	warp_buf1[idx1] += warp_buf1[idx2];
-	warp_buf2[idx1] += warp_buf2[idx2];
-      }
-      __syncthreads();
-    }
-    int i2 = blockIdx.x * blockDim.x + threadIdx.x;
-    if (threadIdx.y == 0 && i2 < n2) {
-      int row1 = threadIdx.y;
-      int row2 = threadIdx.y + 1;
-      int idx1 = row1*row_stride + threadIdx.x;
-      int idx2 = row2*row_stride + threadIdx.x;
-      part_grad_beta[blockIdx.y*n2+i2] = warp_buf1[idx1] + warp_buf1[idx2];
-      part_grad_gamma[blockIdx.y*n2+i2] = warp_buf2[idx1] + warp_buf2[idx2];
-    }
-}
-
-template<typename U, typename V> __global__
-void cuComputeGradGammaBeta(
-    const U* part_grad_gamma,
-    const U* part_grad_beta,
-    const int part_size,
-    const int n1,
-    const int n2,
-    V* grad_gamma,
-    V* grad_beta)
-{
-    // sum partial gradients for gamma and beta
-    SharedMemory<U> shared;
-    U* buf = shared.getPointer(); 
-    int i2 = blockIdx.x * blockDim.x + threadIdx.x;
-    if (i2 < n2) {
-      // each warp does sequential reductions until reduced part_size is num_warps
-      int num_warp_reductions = part_size / blockDim.y;
-      U sum_gamma = U(0);
-      U sum_beta = U(0);
-      const U* part_grad_gamma_ptr = part_grad_gamma + threadIdx.y * num_warp_reductions * n2 + i2;
-      const U* part_grad_beta_ptr = part_grad_beta + threadIdx.y * num_warp_reductions * n2 + i2;
-      for (int warp_offset = 0;  warp_offset < num_warp_reductions;  ++warp_offset) {
-        sum_gamma += part_grad_gamma_ptr[warp_offset*n2];
-        sum_beta += part_grad_beta_ptr[warp_offset*n2];
-      }
-      // inter-warp reductions
-      const int nbsize3 = blockDim.x * blockDim.y / 2;
-      for (int offset = blockDim.y/2;  offset >= 1;  offset /= 2) {
-        // top half write to shared memory
-        if (threadIdx.y >= offset && threadIdx.y < 2*offset) {
-          const int write_idx = (threadIdx.y - offset) * blockDim.x + threadIdx.x;
-          buf[write_idx] = sum_gamma;
-          buf[write_idx+nbsize3] = sum_beta;
-        }
-        __syncthreads();
-        // bottom half sums
-        if (threadIdx.y < offset) {
-          const int read_idx = threadIdx.y * blockDim.x + threadIdx.x;
-          sum_gamma += buf[read_idx];
-          sum_beta += buf[read_idx+nbsize3];
-        }
-        __syncthreads();
-      }
-      // write out fully summed gradients
-      if (threadIdx.y == 0) {
-        grad_gamma[i2] = sum_gamma;
-        grad_beta[i2] = sum_beta;
-      }
-    }
-}
-
-template<typename T, typename U, typename V> __global__
-void cuComputeGradInput(
-    const V* __restrict__ dout,
-    const T* __restrict__ input,
-    const int n1,
-    const int n2,
-    const U* __restrict__ mean,
-    const U* __restrict__ invvar,
-    U epsilon,
-    const V* gamma,
-    T* grad_input)
-{
-#ifdef COLOSSAL_HIP
-  for (size_t i1=blockIdx.y; i1 < n1; i1 += gridDim.y) {
-#else
-  for (auto i1=blockIdx.y; i1 < n1; i1 += gridDim.y) {
-#endif
-    U sum_loss1 = U(0);
-    U sum_loss2 = U(0);
-    const U c_mean = mean[i1];
-    const U c_invvar = invvar[i1];
-    const T* k_input = input + i1*n2;
-    const V* k_dout = dout + i1*n2;
-    const int numx = blockDim.x * blockDim.y;
-    const int thrx = threadIdx.x + threadIdx.y * blockDim.x;
-    if (gamma != NULL) {
-      int l = 4*thrx;
-      for (;  l+3 < n2;  l+=4*numx) {
-        for (int k = 0;  k < 4;  ++k) {
-          const U c_h = static_cast<U>(k_input[l+k]);
-          const U c_loss = static_cast<U>(k_dout[l+k]);
-          sum_loss1 += c_loss * gamma[l+k];
-          sum_loss2 += c_loss * gamma[l+k] * (c_h - c_mean) * c_invvar;
-        }
-      }
-      for (;  l < n2;  ++l) {
-        const U c_h = static_cast<U>(k_input[l]);
-        const U c_loss = static_cast<U>(k_dout[l]);
-        sum_loss1 += c_loss * gamma[l];
-        sum_loss2 += c_loss * gamma[l] * (c_h - c_mean) * c_invvar;
-      }
-    } else {
-      int l = 4*thrx;
-      for (;  l+3 < n2;  l+=4*numx) {
-        for (int k = 0;  k < 4;  ++k) {
-          const U c_h = static_cast<U>(k_input[l+k]);
-          const U c_loss = static_cast<U>(k_dout[l+k]);
-          sum_loss1 += c_loss;
-          sum_loss2 += c_loss * (c_h - c_mean) * c_invvar;
-        }
-      }
-      for (;  l < n2;  ++l) {
-        const U c_h = static_cast<U>(k_input[l]);
-        const U c_loss = static_cast<U>(k_dout[l]);
-        sum_loss1 += c_loss;
-        sum_loss2 += c_loss * (c_h - c_mean) * c_invvar;
-      }
-    }
-    // intra-warp reductions
-    for (int mask = blockDim.x/2;  mask > 0;  mask /= 2) {
-      sum_loss1 += WARP_SHFL_XOR(sum_loss1, mask);
-      sum_loss2 += WARP_SHFL_XOR(sum_loss2, mask);
-    }
-    // inter-warp reductions
-    if (blockDim.y > 1) {
-      SharedMemory<U> shared;
-      U* buf = shared.getPointer(); 
-      for (int offset = blockDim.y/2;  offset > 0;  offset /= 2) {
-        // upper half of warps write to shared
-        if (threadIdx.y >= offset && threadIdx.y < 2*offset) {
-          const int wrt_i = (threadIdx.y - offset) * blockDim.x + threadIdx.x;
-          buf[2*wrt_i] = sum_loss1;
-          buf[2*wrt_i+1] = sum_loss2;
-        }
-        __syncthreads();
-        // lower half merges
-        if (threadIdx.y < offset) {
-          const int read_i = threadIdx.y * blockDim.x + threadIdx.x;
-          sum_loss1 += buf[2*read_i];
-          sum_loss2 += buf[2*read_i+1];
-        }
-        __syncthreads();
-      }
-      if (threadIdx.y == 0) {
-        buf[2*threadIdx.x] = sum_loss1;
-        buf[2*threadIdx.x+1] = sum_loss2;
-      }
-      __syncthreads();
-      if (threadIdx.y !=0) {
-        sum_loss1 = buf[2*threadIdx.x];
-        sum_loss2 = buf[2*threadIdx.x+1];
-      } 
-    }
-    // all threads now have the two sums over l
-    U fH = (U)n2;
-    U term1 = (U(1) / fH) * c_invvar;
-    T* k_grad_input = grad_input + i1*n2;
-    if (gamma != NULL) {
-      for (int l = thrx;  l < n2;  l+=numx) {
-        const U c_h = static_cast<U>(k_input[l]);
-        const U c_loss = static_cast<U>(k_dout[l]);
-        U f_grad_input = fH * c_loss * gamma[l];
-        f_grad_input -= sum_loss1;
-        f_grad_input -= (c_h - c_mean) * c_invvar * sum_loss2;
-        f_grad_input *= term1;
-        k_grad_input[l] = static_cast<T>(f_grad_input);
-      }
-    } else {
-      for (int l = thrx;  l < n2;  l+=numx) {
-        const U c_h = static_cast<U>(k_input[l]);
-        const U c_loss = static_cast<U>(k_dout[l]);
-        U f_grad_input = fH * c_loss;
-        f_grad_input -= sum_loss1;
-        f_grad_input -= (c_h - c_mean) * c_invvar * sum_loss2;
-        f_grad_input *= term1;
-        k_grad_input[l] = static_cast<T>(f_grad_input);
-      }
-    }
-  }
-}
-
-
-
-
-template<typename T, typename U, typename V> 
-void HostApplyLayerNorm(
-    V* output,
-    U* mean,
-    U* invvar,
-    const T* input,
-    int n1,
-    int n2,
-    double epsilon,
-    const V* gamma,
-    const V* beta
-    )
-{
-    auto stream = at::cuda::getCurrentCUDAStream().stream();
-    const dim3 threads(32,4,1);
-    const uint64_t maxGridY =
-      at::cuda::getCurrentDeviceProperties()->maxGridSize[1];
-    const dim3 blocks(1, std::min((uint64_t)n1, maxGridY), 1);
-    int nshared = 
-        threads.y > 1 ? 
-	    threads.y*sizeof(U)+(threads.y/2)*sizeof(U) : 
-	    0;
-    cuApplyLayerNorm<<<blocks, threads, nshared, stream>>>(
-		    output,
-		    mean,
-		    invvar,
-		    input,
-		    n1,n2,
-		    U(epsilon),
-            gamma,beta);
-}
-
-
-void cuda_layer_norm(
-    at::Tensor* output,
-    at::Tensor* mean,
-    at::Tensor* invvar,
-    at::Tensor* input,
-    int n1,
-    int n2,
-    #ifdef VERSION_GE_1_1
-    at::IntArrayRef normalized_shape,
-    #else
-    at::IntList normalized_shape,
-    #endif
-    at::Tensor* gamma,
-    at::Tensor* beta,
-    double epsilon)
-{
-    using namespace at;
-    DISPATCH_FLOAT_HALF_AND_BFLOAT_INOUT_TYPES(
-        input->scalar_type(), output->scalar_type(), "cuda_layer_norm_kernel",
-        HostApplyLayerNorm(
-	    output->DATA_PTR<scalar_t_out>(),
-	    mean->DATA_PTR<float>(),
-	    invvar->DATA_PTR<float>(),
-	    input->DATA_PTR<scalar_t_in>(),
-	    n1,n2,
-	    epsilon,
-	    gamma != NULL ? gamma->DATA_PTR<scalar_t_out>() : NULL,
-	    beta != NULL ? beta->DATA_PTR<scalar_t_out>() : NULL);
-      )
-}
-
-
-template<typename T, typename U, typename V>
-void HostLayerNormGradient(
-    const V* dout,
-    const U* mean,
-    const U* invvar,
-    at::Tensor* input,
-    int n1,
-    int n2,
-    const V* gamma,
-    const V* beta,
-    double epsilon,
-    T* grad_input,
-    V* grad_gamma,
-    V* grad_beta
-    )
-{
-    auto stream = at::cuda::getCurrentCUDAStream().stream();
-
-    if (gamma != NULL && beta != NULL) {
-      // compute grad_gamma(j) and grad_beta(j)
-      const int part_size = 16;
-      const dim3 threads2(32,4,1);
-      const dim3 blocks2((n2+threads2.x-1)/threads2.x,part_size,1);
-      const int nshared2_a = 2 * sizeof(U) * threads2.y * threads2.y *
-	(threads2.x + 1);
-      const int nshared2_b = threads2.x * threads2.y * sizeof(U);
-      const int nshared2 = nshared2_a > nshared2_b ? nshared2_a : nshared2_b;
-      at::Tensor part_grad_gamma = at::empty(
-	  {part_size,n2}, input->options().dtype(at::ScalarType::Float));
-      at::Tensor part_grad_beta = at::empty_like(part_grad_gamma);
-      cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
-		      dout,
-		      input->DATA_PTR<T>(),
-		      n1,n2,
-		      mean,
-		      invvar,
-		      U(epsilon),
-		      part_grad_gamma.DATA_PTR<U>(),
-		      part_grad_beta.DATA_PTR<U>());
-
-      const dim3 threads3(32,8,1);
-      const dim3 blocks3((n2+threads2.x-1)/threads2.x,1,1);
-      const int nshared3 = threads3.x * threads3.y * sizeof(U);
-      cuComputeGradGammaBeta<<<blocks3, threads3, nshared3, stream>>>(
-		      part_grad_gamma.DATA_PTR<U>(),
-		      part_grad_beta.DATA_PTR<U>(),
-		      part_size,
-		      n1,n2,
-		      grad_gamma,
-		      grad_beta);
-    }
-
-    // compute grad_input
-    const uint64_t maxGridY =
-      at::cuda::getCurrentDeviceProperties()->maxGridSize[1];
-    const dim3 blocks1(1, std::min((uint64_t)n1, maxGridY), 1);
-    const dim3 threads1(32,4,1);
-    int nshared =
-	    threads1.y > 1 ?
-	    threads1.y*threads1.x*sizeof(U) :
-	    0;
-    cuComputeGradInput<<<blocks1, threads1, nshared, stream>>>(
-            dout,
-            input->DATA_PTR<T>(),
-            n1,n2,
-            mean,
-            invvar,
-            U(epsilon),
-            gamma,
-            grad_input);
-}
-
-
-void cuda_layer_norm_gradient(
-    at::Tensor* dout,
-    at::Tensor* mean,
-    at::Tensor* invvar,
-    at::Tensor* input,
-    int n1,
-    int n2,
-    #ifdef VERSION_GE_1_1
-    at::IntArrayRef normalized_shape,
-    #else
-    at::IntList normalized_shape,
-    #endif
-    at::Tensor* gamma,
-    at::Tensor* beta,
-    double epsilon,
-    at::Tensor* grad_input,
-    at::Tensor* grad_gamma,
-    at::Tensor* grad_beta)
-{
-    using namespace at;
-    DISPATCH_FLOAT_HALF_AND_BFLOAT_INOUT_TYPES(
-        input->scalar_type(), gamma->scalar_type(),
-	"cuda_layer_norm_gradient_kernel",
-        HostLayerNormGradient(
-	    dout->DATA_PTR<scalar_t_out>(),
-	    mean->DATA_PTR<float>(),
-	    invvar->DATA_PTR<float>(),
-	    input,
-	    n1,n2,
-            // TMJ pass NULL argument for gamma, beta, grad_gamma and grad_beta
-            // if gamma Tensor is NULL on input.
-	    gamma != NULL ? gamma->DATA_PTR<scalar_t_out>() : NULL,
-	    gamma != NULL ? beta->DATA_PTR<scalar_t_out>() : NULL,
-	    epsilon,
-	    grad_input->DATA_PTR<scalar_t_in>(),
-	    gamma != NULL ? grad_gamma->DATA_PTR<scalar_t_out>() : NULL,
-	    gamma != NULL ? grad_beta->DATA_PTR<scalar_t_out>() : NULL);
-      )
-}
diff --git a/colossalai/kernel/cuda_native/csrc/multi_tensor_adam.cu b/colossalai/kernel/cuda_native/csrc/multi_tensor_adam.cu
deleted file mode 100644
index 633e2d63fd12304c9e503d78625d8e43d9f74f8a..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/csrc/multi_tensor_adam.cu
+++ /dev/null
@@ -1,177 +0,0 @@
-// modified from https://github.com/NVIDIA/apex/blob/master/csrc/multi_tensor_adam.cu
-#include <ATen/ATen.h>
-#include <ATen/AccumulateType.h>
-#include <ATen/cuda/CUDAContext.h>
-#include <ATen/cuda/Exceptions.h>
-// Another possibility:
-// #include <torch/all.h>
-
-#include <assert.h>
-
-#include "type_shim.h"
-#include "multi_tensor_apply.cuh"
-
-#define BLOCK_SIZE 512
-#define ILP 4
-
-typedef enum
-{
-    ADAM_MODE_0 = 0, // L2 regularization mode
-    ADAM_MODE_1 = 1  // Decoupled weight decay mode(AdamW)
-} adamMode_t;
-
-using MATH_T = float;
-
-template <typename T>
-struct AdamFunctor
-{
-    __device__ __forceinline__ void operator()(
-        int chunk_size,
-        volatile int *noop_gmem,
-        TensorListMetadata<4> &tl,
-        const float beta1,
-        const float beta2,
-        const float beta1_correction,
-        const float beta2_correction,
-        const float epsilon,
-        const float lr,
-        adamMode_t mode,
-        const float decay)
-    {
-        // I'd like this kernel to propagate infs/nans.
-        // if(*noop_gmem == 1)
-        //   return;
-
-        int tensor_loc = tl.block_to_tensor[blockIdx.x];
-
-        // potentially use to pass in list of scalar
-        // int tensor_num = tl.start_tensor_this_launch + tensor_loc;
-
-        int chunk_idx = tl.block_to_chunk[blockIdx.x];
-        int n = tl.sizes[tensor_loc];
-
-        T *g = (T *)tl.addresses[0][tensor_loc];
-        g += chunk_idx * chunk_size;
-
-        T *p = (T *)tl.addresses[1][tensor_loc];
-        p += chunk_idx * chunk_size;
-
-        T *m = (T *)tl.addresses[2][tensor_loc];
-        m += chunk_idx * chunk_size;
-
-        T *v = (T *)tl.addresses[3][tensor_loc];
-        v += chunk_idx * chunk_size;
-
-        n -= chunk_idx * chunk_size;
-
-        // see note in multi_tensor_scale_kernel.cu
-        for (int i_start = 0;
-             i_start < n && i_start < chunk_size;
-             i_start += blockDim.x * ILP)
-        {
-            MATH_T r_g[ILP];
-            MATH_T r_p[ILP];
-            MATH_T r_m[ILP];
-            MATH_T r_v[ILP];
-#pragma unroll
-            for (int ii = 0; ii < ILP; ii++)
-            {
-                int i = i_start + threadIdx.x + ii * blockDim.x;
-                if (i < n && i < chunk_size)
-                {
-                    r_g[ii] = g[i];
-                    r_p[ii] = p[i];
-                    r_m[ii] = m[i];
-                    r_v[ii] = v[i];
-                }
-                else
-                {
-                    r_g[ii] = MATH_T(0);
-                    r_p[ii] = MATH_T(0);
-                    r_m[ii] = MATH_T(0);
-                    r_v[ii] = MATH_T(0);
-                }
-            }
-#pragma unroll
-            for (int ii = 0; ii < ILP; ii++)
-            {
-                if (mode == ADAM_MODE_0)
-                { // L2
-                    r_g[ii] = r_g[ii] + (decay * r_p[ii]);
-                    r_m[ii] = beta1 * r_m[ii] + (1 - beta1) * r_g[ii];
-                    r_v[ii] = beta2 * r_v[ii] + (1 - beta2) * r_g[ii] * r_g[ii];
-                    MATH_T next_m_unbiased = r_m[ii] / beta1_correction;
-                    MATH_T next_v_unbiased = r_v[ii] / beta2_correction;
-                    MATH_T denom = sqrtf(next_v_unbiased) + epsilon;
-                    MATH_T update = next_m_unbiased / denom;
-                    r_p[ii] = r_p[ii] - (lr * update);
-                }
-                else
-                { // weight decay
-                    r_m[ii] = beta1 * r_m[ii] + (1 - beta1) * r_g[ii];
-                    r_v[ii] = beta2 * r_v[ii] + (1 - beta2) * r_g[ii] * r_g[ii];
-                    MATH_T next_m_unbiased = r_m[ii] / beta1_correction;
-                    MATH_T next_v_unbiased = r_v[ii] / beta2_correction;
-                    MATH_T denom = sqrtf(next_v_unbiased) + epsilon;
-                    MATH_T update = (next_m_unbiased / denom) + (decay * r_p[ii]);
-                    r_p[ii] = r_p[ii] - (lr * update);
-                }
-            }
-#pragma unroll
-            for (int ii = 0; ii < ILP; ii++)
-            {
-                int i = i_start + threadIdx.x + ii * blockDim.x;
-                if (i < n && i < chunk_size)
-                {
-                    p[i] = r_p[ii];
-                    m[i] = r_m[ii];
-                    v[i] = r_v[ii];
-                }
-            }
-        }
-    }
-};
-
-void multi_tensor_adam_cuda(
-    int chunk_size,
-    at::Tensor noop_flag,
-    std::vector<std::vector<at::Tensor>> tensor_lists,
-    const float lr,
-    const float beta1,
-    const float beta2,
-    const float epsilon,
-    const int step,
-    const int mode,
-    const int bias_correction,
-    const float weight_decay)
-{
-    using namespace at;
-
-    // Handle bias correction mode
-    float bias_correction1 = 1.0f, bias_correction2 = 1.0f;
-    if (bias_correction == 1)
-    {
-        bias_correction1 = 1 - std::pow(beta1, step);
-        bias_correction2 = 1 - std::pow(beta2, step);
-    }
-
-    // Assume single type across p,g,m1,m2 now
-    DISPATCH_DOUBLE_FLOAT_AND_HALF(
-        tensor_lists[0][0].scalar_type(), 0, "adam",
-        multi_tensor_apply<4>(
-            BLOCK_SIZE,
-            chunk_size,
-            noop_flag,
-            tensor_lists,
-            AdamFunctor<scalar_t_0>(),
-            beta1,
-            beta2,
-            bias_correction1,
-            bias_correction2,
-            epsilon,
-            lr,
-            (adamMode_t)mode,
-            weight_decay);)
-
-    AT_CUDA_CHECK(cudaGetLastError());
-}
\ No newline at end of file
diff --git a/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh b/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh
deleted file mode 100644
index 9ce41191133eeac43534d5d4d33dc0071c4ccb27..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh
+++ /dev/null
@@ -1,133 +0,0 @@
-// modified from https://github.com/NVIDIA/apex/blob/master/csrc/multi_tensor_apply.cuh
-#include <ATen/ATen.h>
-#include <ATen/AccumulateType.h>
-#include <ATen/cuda/CUDAContext.h>
-#include <ATen/cuda/Exceptions.h>
-#include <c10/cuda/CUDAGuard.h>
-#include "compat.h"
-
-#include <assert.h>
-
-// #include <iostream>
-
-// This header is the one-stop shop for all your multi-tensor apply needs.
-
-// TODO:  Kernel arg size limit may be <4KB for some other cards (ie Jetson)
-constexpr int depth_to_max_tensors[5] = {110, 64, 48, 36, 30};
-constexpr int depth_to_max_blocks[5] = {320, 320, 320, 320, 320};
-
-template <int n>
-struct TensorListMetadata
-{
-    void *addresses[n][depth_to_max_tensors[n - 1]];
-    int sizes[depth_to_max_tensors[n - 1]];
-    unsigned char block_to_tensor[depth_to_max_blocks[n - 1]];
-    int block_to_chunk[depth_to_max_blocks[n - 1]]; // I fear this needs to be a full int.
-    int start_tensor_this_launch;
-};
-
-template <typename T, typename U, typename... ArgTypes>
-__global__ void multi_tensor_apply_kernel(
-    int chunk_size,
-    volatile int *noop_flag,
-    T tl,
-    U callable,
-    ArgTypes... args)
-{
-    // Hand the chunk information to the user-supplied functor to process however it likes.
-    callable(chunk_size, noop_flag, tl, args...);
-}
-
-template <int depth, typename T, typename... ArgTypes>
-void multi_tensor_apply(
-    int block_size,
-    int chunk_size,
-    const at::Tensor &noop_flag,
-    const std::vector<std::vector<at::Tensor>> &tensor_lists,
-    T callable,
-    ArgTypes... args)
-{
-    TORCH_CHECK(tensor_lists.size() == depth, "tensor_lists.size() != depth");
-    int len0 = tensor_lists[0].size();
-    TORCH_CHECK(len0 > 0, "tensor_lists[0].size() is not > 0");
-    auto ref_device = tensor_lists[0][0].device();
-    TORCH_CHECK(ref_device.type() == at::kCUDA, "expected input to be on cuda");
-    for (int l = 0; l < tensor_lists.size(); l++) // No range-based for because I need indices
-    {
-        TORCH_CHECK(tensor_lists[l].size() == len0, "Size mismatch among tensor lists");
-        for (int t = 0; t < tensor_lists[l].size(); t++)
-        {
-            // TODO:  Print which tensor fails.
-            bool contiguous_memory = tensor_lists[l][t].is_contiguous();
-#ifdef VERSION_GE_1_5
-            contiguous_memory = (contiguous_memory || tensor_lists[l][t].is_contiguous(at::MemoryFormat::ChannelsLast));
-#endif
-            TORCH_CHECK(contiguous_memory, "A tensor was not contiguous.");
-            TORCH_CHECK(tensor_lists[l][t].device() == ref_device, "A tensor was not on the same device as the first tensor");
-            TORCH_CHECK(tensor_lists[l][t].numel() == tensor_lists[0][t].numel(), "Size mismatch");
-        }
-    }
-
-    int ntensors = tensor_lists[0].size();
-
-    TensorListMetadata<depth> tl;
-
-    const at::cuda::OptionalCUDAGuard device_guard(device_of(tensor_lists[0][0]));
-    auto stream = at::cuda::getCurrentCUDAStream();
-
-    tl.start_tensor_this_launch = 0;
-    int loc_block_info = 0;
-    int loc_tensor_info = 0;
-    for (int t = 0; t < ntensors; t++)
-    {
-        tl.sizes[loc_tensor_info] = tensor_lists[0][t].numel();
-        for (int d = 0; d < depth; d++)
-            tl.addresses[d][loc_tensor_info] = tensor_lists[d][t].data_ptr();
-        loc_tensor_info++;
-
-        int chunks_this_tensor = (tensor_lists[0][t].numel() + chunk_size - 1) / chunk_size;
-
-        for (int chunk = 0; chunk < chunks_this_tensor; chunk++)
-        {
-            // std::cout << chunks_this_tensor << std::endl;
-            tl.block_to_tensor[loc_block_info] = loc_tensor_info - 1;
-            tl.block_to_chunk[loc_block_info] = chunk;
-            loc_block_info++;
-
-            bool tensors_full = (loc_tensor_info == depth_to_max_tensors[depth - 1] &&
-                                 chunk == chunks_this_tensor - 1);
-            bool blocks_full = (loc_block_info == depth_to_max_blocks[depth - 1]);
-            bool last_chunk = (t == ntensors - 1 && chunk == chunks_this_tensor - 1);
-            if (tensors_full || blocks_full || last_chunk)
-            {
-                // using accscalar_t = acc_type<scalar_t, true>;
-                multi_tensor_apply_kernel<<<loc_block_info, block_size, 0, stream>>>(
-                    chunk_size,
-                    noop_flag.DATA_PTR<int>(),
-                    tl,
-                    callable,
-                    args...);
-
-                AT_CUDA_CHECK(cudaGetLastError());
-
-                // Reset.  The control flow possibilities here make my brain hurt.
-                loc_block_info = 0;
-                if (chunk == chunks_this_tensor - 1)
-                {
-                    // std::cout << "Hit case 1 " << cond1 << " " << cond2 << " " << cond3 << std::endl;
-                    loc_tensor_info = 0;
-                    tl.start_tensor_this_launch = t + 1;
-                }
-                else
-                {
-                    // std::cout << "Hit case 2 " << cond1 << " " << cond2 << " " << cond3 << std::endl;
-                    tl.sizes[0] = tl.sizes[loc_tensor_info - 1];
-                    for (int d = 0; d < depth; d++)
-                        tl.addresses[d][0] = tl.addresses[d][loc_tensor_info - 1];
-                    loc_tensor_info = 1;
-                    tl.start_tensor_this_launch = t;
-                }
-            }
-        }
-    }
-}
\ No newline at end of file
diff --git a/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu b/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu
deleted file mode 100644
index 03f60b34c5fc252a0a50708a8c4794c2afa76cf5..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu
+++ /dev/null
@@ -1,455 +0,0 @@
-// modified from https://github.com/NVIDIA/apex/blob/master/csrc/multi_tensor_l2norm_kernel.cu
-#include <ATen/ATen.h>
-#include <ATen/AccumulateType.h>
-#include <ATen/cuda/CUDAContext.h>
-#include <ATen/cuda/Exceptions.h>
-#include <c10/cuda/CUDAGuard.h>
-// Another possibility:
-// #include <torch/all.h>
-
-#include <assert.h>
-
-#include "type_shim.h"
-#include "multi_tensor_apply.cuh"
-
-#define BLOCK_SIZE 512
-#define ILP 4
-
-template <typename T>
-__device__ __forceinline__ bool is_aligned(T *p)
-{
-    return ((uint64_t)p) % (ILP * sizeof(T)) == 0;
-}
-
-template <typename T>
-__device__ __forceinline__ void load_store(T *dst, T *src, int dst_offset, int src_offset)
-{
-    typedef typename std::aligned_storage<ILP * sizeof(T), ILP * alignof(T)>::type LT;
-    ((LT *)dst)[dst_offset] = ((LT *)src)[src_offset];
-}
-
-template <typename x_t>
-struct L2NormFunctor
-{
-    __device__ __forceinline__ void operator()(
-        int chunk_size,
-        volatile int *noop_gmem,
-        TensorListMetadata<1> &tl,
-        float *output,
-        float *output_per_tensor,
-        bool per_tensor,
-        int max_chunks_per_tensor)
-    {
-        // I'd like this kernel to propagate infs/nans.
-        // if(*noop_gmem == 1)
-        //   return;
-
-        int tensor_loc = tl.block_to_tensor[blockIdx.x];
-        int chunk_idx = tl.block_to_chunk[blockIdx.x];
-        int n = tl.sizes[tensor_loc];
-
-        x_t *x = (x_t *)tl.addresses[0][tensor_loc];
-        x += chunk_idx * chunk_size;
-
-        n -= chunk_idx * chunk_size;
-
-        __shared__ float s_vals[512];
-
-        float vals[ILP]; // = {0}; // this probably works too but I want to be sure...
-        x_t r_x[ILP];
-        for (int i = 0; i < ILP; i++)
-        {
-            vals[i] = 0.f;
-            r_x[i] = 0;
-        }
-
-        // to make things simple, we put aligned case in a different code path
-        if (n % ILP == 0 && chunk_size % ILP == 0 && is_aligned(x))
-        {
-            for (int i_start = threadIdx.x; i_start * ILP < n && i_start * ILP < chunk_size; i_start += blockDim.x)
-            {
-                // load
-                load_store(r_x, x, 0, i_start);
-#pragma unroll
-                for (int ii = 0; ii < ILP; ii++)
-                {
-                    float next = static_cast<float>(r_x[ii]);
-                    vals[ii] += next * next;
-                }
-            }
-        }
-        else
-        {
-            for (int i_start = 0; i_start < n && i_start < chunk_size; i_start += blockDim.x * ILP)
-            {
-#pragma unroll
-                for (int ii = 0; ii < ILP; ii++)
-                {
-                    int i = i_start + threadIdx.x + ii * blockDim.x;
-                    if (i < n && i < chunk_size)
-                    {
-                        float next = static_cast<float>(x[i]);
-                        vals[ii] += next * next;
-                    }
-                }
-            }
-        }
-
-        float val = 0.f;
-        for (int i = 0; i < ILP; i++)
-            val += vals[i];
-
-        float final = reduce_block_into_lanes(s_vals, val);
-
-        if (threadIdx.x == 0)
-        {
-            if (!isfinite(final))
-                *noop_gmem = 1; // Blindly fire off a write.  These will race but that's ok.
-            output[blockIdx.x] += final;
-            if (per_tensor)
-                output_per_tensor[(tl.start_tensor_this_launch + tensor_loc) * max_chunks_per_tensor + chunk_idx] = final;
-        }
-    }
-};
-
-// Probably better to template, but since we are not likely to support other norm
-template <typename x_t>
-struct MaxNormFunctor
-{
-    __device__ __forceinline__ void operator()(
-        int chunk_size,
-        volatile int *noop_gmem,
-        TensorListMetadata<1> &tl,
-        float *output,
-        float *output_per_tensor,
-        bool per_tensor,
-        int max_chunks_per_tensor)
-    {
-        // I'd like this kernel to propagate infs/nans.
-        // if(*noop_gmem == 1)
-        //   return;
-
-        int tensor_loc = tl.block_to_tensor[blockIdx.x];
-        int chunk_idx = tl.block_to_chunk[blockIdx.x];
-        int n = tl.sizes[tensor_loc];
-
-        x_t *x = (x_t *)tl.addresses[0][tensor_loc];
-        x += chunk_idx * chunk_size;
-
-        n -= chunk_idx * chunk_size;
-
-        __shared__ float s_vals[512];
-
-        float vals[ILP]; // = {0}; // this probably works too but I want to be sure...
-        x_t r_x[ILP];
-        for (int i = 0; i < ILP; i++)
-        {
-            vals[i] = 0.f;
-            r_x[i] = 0;
-        }
-
-        // to make things simple, we put aligned case in a different code path
-        if (n % ILP == 0 && chunk_size % ILP == 0 && is_aligned(x))
-        {
-            for (int i_start = threadIdx.x; i_start * ILP < n && i_start * ILP < chunk_size; i_start += blockDim.x)
-            {
-                // load
-                load_store(r_x, x, 0, i_start);
-#pragma unroll
-                for (int ii = 0; ii < ILP; ii++)
-                {
-                    float next = static_cast<float>(r_x[ii]);
-                    vals[ii] = fmaxf(fabsf(vals[ii]), fabsf(next));
-                }
-            }
-        }
-        else
-        {
-            for (int i_start = 0; i_start < n && i_start < chunk_size; i_start += blockDim.x * ILP)
-            {
-#pragma unroll
-                for (int ii = 0; ii < ILP; ii++)
-                {
-                    int i = i_start + threadIdx.x + ii * blockDim.x;
-                    if (i < n && i < chunk_size)
-                    {
-                        float next = static_cast<float>(x[i]);
-                        vals[ii] = fmaxf(fabsf(vals[ii]), fabsf(next));
-                    }
-                }
-            }
-        }
-
-        float val = 0.f;
-        for (int i = 0; i < ILP; i++)
-            val = fmaxf(fabsf(val), fabsf(vals[i]));
-
-        float final = reduce_block_into_lanes_max_op(s_vals, val);
-
-        if (threadIdx.x == 0)
-        {
-            if (!isfinite(final))
-                *noop_gmem = 1; // Blindly fire off a write.  These will race but that's ok.
-            output[blockIdx.x] = fmaxf(fabsf(output[blockIdx.x]), fabsf(final));
-            if (per_tensor)
-                output_per_tensor[(tl.start_tensor_this_launch + tensor_loc) * max_chunks_per_tensor + chunk_idx] = final;
-        }
-    }
-};
-
-__global__ void cleanup(
-    float *output,
-    float *output_per_tensor,
-    float *ret,
-    float *ret_per_tensor,
-    bool per_tensor,
-    int max_chunks_per_tensor)
-{
-    __shared__ float vals[512];
-
-    if (blockIdx.x == 0)
-    {
-        float val = 0;
-        if (threadIdx.x < 320)
-            val = output[threadIdx.x];
-
-        float final = reduce_block_into_lanes(vals, val);
-
-        if (threadIdx.x == 0)
-            *ret = sqrt(final);
-    }
-
-    if (per_tensor)
-    {
-        float *output_this_tensor = output_per_tensor + blockIdx.x * max_chunks_per_tensor;
-
-        float val = 0;
-        for (int i = threadIdx.x; i < max_chunks_per_tensor; i += blockDim.x)
-            val += output_this_tensor[i];
-
-        float final = reduce_block_into_lanes(vals, val);
-
-        if (threadIdx.x == 0)
-            ret_per_tensor[blockIdx.x] = sqrt(final);
-    }
-}
-
-__global__ void cleanup_v2(
-    float *output,
-    float *output_per_tensor,
-    float *ret,
-    float *ret_per_tensor,
-    bool per_tensor,
-    int max_chunks_per_tensor,
-    int norm_type,
-    float alpha,
-    float beta)
-{
-    __shared__ float vals[512];
-
-    if (blockIdx.x == 0)
-    {
-        float val = 0;
-        if (threadIdx.x < 320)
-            val = output[threadIdx.x];
-
-        if (norm_type == 0)
-        {
-            float final = reduce_block_into_lanes_max_op(vals, val);
-            if (threadIdx.x == 0)
-                *ret = alpha * (*ret) + beta * final;
-        }
-        else
-        {
-            float final = reduce_block_into_lanes(vals, val);
-            if (threadIdx.x == 0)
-                *ret = sqrt(alpha * (*ret) * (*ret) + beta * final);
-        }
-    }
-
-    if (per_tensor)
-    {
-        float *output_this_tensor = output_per_tensor + blockIdx.x * max_chunks_per_tensor;
-
-        if (norm_type == 0)
-        {
-            float val = 0;
-            for (int i = threadIdx.x; i < max_chunks_per_tensor; i += blockDim.x)
-                val = fmaxf(fabsf(val), fabsf(output_this_tensor[i]));
-
-            float final = reduce_block_into_lanes_max_op(vals, val);
-
-            if (threadIdx.x == 0)
-                ret_per_tensor[blockIdx.x] = alpha * ret_per_tensor[blockIdx.x] + beta * final;
-        }
-        else
-        {
-            float val = 0;
-            for (int i = threadIdx.x; i < max_chunks_per_tensor; i += blockDim.x)
-                val += output_this_tensor[i];
-
-            float final = reduce_block_into_lanes(vals, val);
-
-            if (threadIdx.x == 0)
-                ret_per_tensor[blockIdx.x] = sqrt(alpha * ret_per_tensor[blockIdx.x] * ret_per_tensor[blockIdx.x] + beta * final);
-        }
-    }
-}
-
-std::tuple<at::Tensor, at::Tensor> multi_tensor_l2norm_cuda(
-    int chunk_size,
-    at::Tensor noop_flag,
-    std::vector<std::vector<at::Tensor>> tensor_lists,
-    at::optional<bool> per_tensor_python)
-{
-    bool per_tensor = per_tensor_python.has_value() ? per_tensor_python.value() : false;
-
-    auto float_options = tensor_lists[0][0].options().dtype(at::kFloat);
-    auto output = at::zeros({320}, float_options);
-
-    at::Tensor output_per_tensor;
-    at::Tensor ret_per_tensor;
-
-    int ntensors = tensor_lists[0].size();
-    int max_chunks_per_tensor = -1;
-
-    if (per_tensor)
-    {
-        for (int t = 0; t < ntensors; t++)
-        {
-            int max_chunks_this_tensor = (tensor_lists[0][t].numel() + chunk_size - 1) / chunk_size;
-            if (max_chunks_this_tensor > max_chunks_per_tensor)
-                max_chunks_per_tensor = max_chunks_this_tensor;
-        }
-        output_per_tensor = at::zeros({ntensors * max_chunks_per_tensor}, float_options);
-        ret_per_tensor = at::empty({ntensors}, float_options);
-    }
-    else
-    {
-        ret_per_tensor = at::empty({0}, float_options);
-    }
-
-    DISPATCH_FLOAT_AND_HALF(tensor_lists[0][0].scalar_type(), 0, "multi_tensor_l2norm_cuda",
-                            multi_tensor_apply<1>(
-                                BLOCK_SIZE,
-                                chunk_size,
-                                noop_flag,
-                                tensor_lists,
-                                L2NormFunctor<scalar_t_0>(),
-                                output.DATA_PTR<float>(),
-                                per_tensor ? output_per_tensor.DATA_PTR<float>() : nullptr,
-                                per_tensor,
-                                max_chunks_per_tensor);)
-
-    AT_CUDA_CHECK(cudaGetLastError());
-    // AT_CUDA_CHECK(cudaDeviceSynchronize());
-
-    // This involves one more small kernel launches, but will be negligible end to end.
-    // I could get rid of these by hacking the functor + multi tensor harness with persistence
-    // logic, but keeping it simple for now
-    auto ret = at::empty({1}, output.options());
-    const at::cuda::OptionalCUDAGuard device_guard(device_of(output));
-    auto stream = at::cuda::getCurrentCUDAStream();
-    cleanup<<<per_tensor ? ntensors : 1, 512, 0, stream>>>(
-        output.DATA_PTR<float>(),
-        per_tensor ? output_per_tensor.DATA_PTR<float>() : nullptr,
-        ret.DATA_PTR<float>(),
-        per_tensor ? ret_per_tensor.DATA_PTR<float>() : nullptr,
-        per_tensor,
-        max_chunks_per_tensor);
-
-    return std::tuple<at::Tensor, at::Tensor>(ret, ret_per_tensor);
-}
-
-// Compute and update grad norm
-// Here use a per tensor norm, and blend new norm(n) and old norm(gn) by
-// L-2: gn = sqrt(a * gn^2 + b * n^2)
-// L-inf: gn = a * gn + b * n
-void multi_tensor_norm_out_cuda(
-    int chunk_size,
-    at::Tensor noop_flag,
-    std::vector<std::vector<at::Tensor>> tensor_lists,
-    at::Tensor out,
-    const float alpha,
-    const float beta,
-    const int norm_type)
-{
-    auto float_options = tensor_lists[0][0].options().dtype(at::kFloat);
-    TORCH_CHECK(tensor_lists[0][0].device() == noop_flag.device(), "noop flag should be on the same device as tensors");
-    // we don't need global thus uses empty here
-    auto output = at::empty({320}, float_options);
-
-    at::Tensor output_per_tensor;
-    at::Tensor ret_per_tensor;
-
-    int ntensors = tensor_lists[0].size();
-    int max_chunks_per_tensor = -1;
-
-    for (int t = 0; t < ntensors; t++)
-    {
-        int max_chunks_this_tensor = (tensor_lists[0][t].numel() + chunk_size - 1) / chunk_size;
-        if (max_chunks_this_tensor > max_chunks_per_tensor)
-            max_chunks_per_tensor = max_chunks_this_tensor;
-    }
-
-    // Although it is single write then read, still need to be zero
-    // Since tailing element also participate cleanup
-    output_per_tensor = at::zeros({ntensors * max_chunks_per_tensor}, float_options);
-
-    if (norm_type == 0)
-    {
-        DISPATCH_FLOAT_AND_HALF(
-            tensor_lists[0][0].scalar_type(), 0, "multi_tensor_maxnorm_cuda",
-            multi_tensor_apply<1>(
-                BLOCK_SIZE,
-                chunk_size,
-                noop_flag,
-                tensor_lists,
-                MaxNormFunctor<scalar_t_0>(),
-                output.DATA_PTR<float>(),
-                output_per_tensor.DATA_PTR<float>(),
-                true,
-                max_chunks_per_tensor);)
-    }
-    else
-    {
-        DISPATCH_FLOAT_AND_HALF(
-            tensor_lists[0][0].scalar_type(), 0, "multi_tensor_l2norm_cuda",
-            multi_tensor_apply<1>(
-                BLOCK_SIZE,
-                chunk_size,
-                noop_flag,
-                tensor_lists,
-                L2NormFunctor<scalar_t_0>(),
-                output.DATA_PTR<float>(),
-                output_per_tensor.DATA_PTR<float>(),
-                true,
-                max_chunks_per_tensor);)
-    }
-    AT_CUDA_CHECK(cudaGetLastError());
-
-    // AT_CUDA_CHECK(cudaDeviceSynchronize());
-
-    // This involves one more small kernel launches, but will be negligible end to end.
-    // I could get rid of these by hacking the functor + multi tensor harness with persistence
-    // logic, but keeping it simple for now
-    auto ret = at::empty({1}, output.options());
-
-    // Adding the following device guard since it happens sometimes that the
-    // tensors are on one device and the cuda stream is on another device which
-    // results in ILLEGAL MEM ACCESS error.
-    const at::cuda::OptionalCUDAGuard device_guard(device_of(output));
-    auto stream = at::cuda::getCurrentCUDAStream();
-    cleanup_v2<<<ntensors, 512, 0, stream>>>(
-        output.DATA_PTR<float>(),
-        output_per_tensor.DATA_PTR<float>(),
-        ret.DATA_PTR<float>(),
-        out.DATA_PTR<float>(),
-        true,
-        max_chunks_per_tensor,
-        norm_type,
-        alpha,
-        beta);
-
-    return;
-}
\ No newline at end of file
diff --git a/colossalai/kernel/cuda_native/csrc/multi_tensor_lamb.cu b/colossalai/kernel/cuda_native/csrc/multi_tensor_lamb.cu
deleted file mode 100644
index d67ce92cd3f8f576aeb8a31e36b1cf4a4ed455e1..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/csrc/multi_tensor_lamb.cu
+++ /dev/null
@@ -1,427 +0,0 @@
-// modified from https://github.com/NVIDIA/apex/blob/master/csrc/multi_tensor_lamb.cu
-#include <ATen/ATen.h>
-#include <ATen/AccumulateType.h>
-#include <ATen/cuda/CUDAContext.h>
-#include <ATen/cuda/Exceptions.h>
-// Another possibility:
-// #include <torch/all.h>
-
-#include <assert.h>
-
-#include "type_shim.h"
-#include "multi_tensor_apply.cuh"
-
-#define BLOCK_SIZE 512
-#define ILP 4
-
-template <typename T>
-__device__ __forceinline__ bool is_aligned(T *p)
-{
-    return ((uint64_t)p) % (ILP * sizeof(T)) == 0;
-}
-
-template <typename T>
-__device__ __forceinline__ void load_store(T *dst, T *src, int dst_offset, int src_offset)
-{
-    typedef typename std::aligned_storage<ILP * sizeof(T), ILP * alignof(T)>::type LT;
-    ((LT *)dst)[dst_offset] = ((LT *)src)[src_offset];
-}
-
-typedef enum
-{
-    MOMENT_MODE_0 = 0, // L2 regularization mode
-    MOMENT_MODE_1 = 1  // Decoupled weight decay mode
-} adamMode_t;
-
-std::tuple<at::Tensor, at::Tensor> multi_tensor_l2norm_cuda(
-    int chunk_size,
-    at::Tensor noop_flag,
-    std::vector<std::vector<at::Tensor>> tensor_lists,
-    at::optional<bool> per_tensor_python);
-
-using MATH_T = float;
-
-template <typename T>
-struct LAMBStage1Functor
-{
-    __device__ __forceinline__ void operator()(
-        int chunk_size,
-        volatile int *noop_gmem,
-        TensorListMetadata<4> &tl,
-        const float beta1,
-        const float beta2,
-        const float beta3,
-        const float beta1_correction,
-        const float beta2_correction,
-        const float epsilon,
-        adamMode_t mode,
-        const float decay,
-        const float *global_grad_norm,
-        const float max_global_grad_norm)
-    {
-        // I'd like this kernel to propagate infs/nans.
-        // if(*noop_gmem == 1)
-        //   return;
-
-        int tensor_loc = tl.block_to_tensor[blockIdx.x];
-        int chunk_idx = tl.block_to_chunk[blockIdx.x];
-        int n = tl.sizes[tensor_loc];
-
-        float clipped_global_grad_norm = (*global_grad_norm) > max_global_grad_norm ? (*global_grad_norm) / max_global_grad_norm : 1.0f;
-
-        T *g = (T *)tl.addresses[0][tensor_loc];
-        g += chunk_idx * chunk_size;
-
-        T *p = (T *)tl.addresses[1][tensor_loc];
-        p += chunk_idx * chunk_size;
-
-        T *m = (T *)tl.addresses[2][tensor_loc];
-        m += chunk_idx * chunk_size;
-
-        T *v = (T *)tl.addresses[3][tensor_loc];
-        v += chunk_idx * chunk_size;
-
-        n -= chunk_idx * chunk_size;
-
-        MATH_T r_g[ILP];
-        MATH_T r_p[ILP];
-        MATH_T r_m[ILP];
-        MATH_T r_v[ILP];
-        // to make things simple, we put aligned case in a different code path
-        if (n % ILP == 0 &&
-            chunk_size % ILP == 0 &&
-            is_aligned(g) &&
-            is_aligned(p) &&
-            is_aligned(m) &&
-            is_aligned(v))
-        {
-            T l_g[ILP];
-            T l_p[ILP];
-            T l_m[ILP];
-            T l_v[ILP];
-            for (int i_start = threadIdx.x; i_start * ILP < n && i_start * ILP < chunk_size; i_start += blockDim.x)
-            {
-                // load
-                load_store(l_g, g, 0, i_start);
-                if (decay != 0)
-                    load_store(l_p, p, 0, i_start);
-                load_store(l_m, m, 0, i_start);
-                load_store(l_v, v, 0, i_start);
-                // unpack
-#pragma unroll
-                for (int ii = 0; ii < ILP; ii++)
-                {
-                    r_g[ii] = l_g[ii];
-                    if (decay == 0)
-                    {
-                        r_p[ii] = MATH_T(0);
-                    }
-                    else
-                    {
-                        r_p[ii] = l_p[ii];
-                    }
-                    r_m[ii] = l_m[ii];
-                    r_v[ii] = l_v[ii];
-                }
-#pragma unroll
-                for (int ii = 0; ii < ILP; ii++)
-                {
-                    if (mode == MOMENT_MODE_0)
-                    {
-                        MATH_T scaled_grad = r_g[ii] / clipped_global_grad_norm;
-                        // L2 on scaled grad
-                        scaled_grad = scaled_grad + decay * r_p[ii];
-                        r_m[ii] = r_m[ii] * beta1 + beta3 * scaled_grad;
-                        r_v[ii] = r_v[ii] * beta2 + (1 - beta2) * scaled_grad * scaled_grad;
-                        MATH_T next_m_unbiased = r_m[ii] / beta1_correction;
-                        MATH_T next_v_unbiased = r_v[ii] / beta2_correction;
-                        MATH_T denom = sqrtf(next_v_unbiased) + epsilon;
-                        r_p[ii] = next_m_unbiased / denom;
-                    }
-                    else
-                    {
-                        MATH_T scaled_grad = r_g[ii] / clipped_global_grad_norm;
-                        r_m[ii] = r_m[ii] * beta1 + beta3 * scaled_grad;
-                        r_v[ii] = r_v[ii] * beta2 + (1 - beta2) * scaled_grad * scaled_grad;
-                        MATH_T next_m_unbiased = r_m[ii] / beta1_correction;
-                        MATH_T next_v_unbiased = r_v[ii] / beta2_correction;
-                        MATH_T denom = sqrtf(next_v_unbiased) + epsilon;
-                        r_p[ii] = (next_m_unbiased / denom) + (decay * r_p[ii]);
-                    }
-                }
-#pragma unroll
-                for (int ii = 0; ii < ILP; ii++)
-                {
-                    l_p[ii] = r_p[ii];
-                    l_m[ii] = r_m[ii];
-                    l_v[ii] = r_v[ii];
-                }
-                // store
-                load_store(g, l_p, i_start, 0);
-                load_store(m, l_m, i_start, 0);
-                load_store(v, l_v, i_start, 0);
-            }
-        }
-        else
-        {
-            // see note in multi_tensor_scale_kernel.cu
-            for (int i_start = 0;
-                 i_start < n && i_start < chunk_size;
-                 i_start += blockDim.x * ILP)
-            {
-                MATH_T r_g[ILP];
-                MATH_T r_p[ILP];
-                MATH_T r_m[ILP];
-                MATH_T r_v[ILP];
-#pragma unroll
-                for (int ii = 0; ii < ILP; ii++)
-                {
-                    int i = i_start + threadIdx.x + ii * blockDim.x;
-                    if (i < n && i < chunk_size)
-                    {
-                        r_g[ii] = g[i];
-                        // special ?optimization? for lamb stage 1
-                        if (decay == 0)
-                        {
-                            r_p[ii] = MATH_T(0);
-                        }
-                        else
-                        {
-                            r_p[ii] = p[i];
-                        }
-                        r_m[ii] = m[i];
-                        r_v[ii] = v[i];
-                    }
-                    else
-                    {
-                        r_g[ii] = MATH_T(0);
-                        r_p[ii] = MATH_T(0);
-                        r_m[ii] = MATH_T(0);
-                        r_v[ii] = MATH_T(0);
-                    }
-                }
-#pragma unroll
-                for (int ii = 0; ii < ILP; ii++)
-                {
-                    if (mode == MOMENT_MODE_0)
-                    {
-                        MATH_T scaled_grad = r_g[ii] / clipped_global_grad_norm;
-                        // L2 on scaled grad
-                        scaled_grad = scaled_grad + decay * r_p[ii];
-                        r_m[ii] = r_m[ii] * beta1 + beta3 * scaled_grad;
-                        r_v[ii] = r_v[ii] * beta2 + (1 - beta2) * scaled_grad * scaled_grad;
-                        MATH_T next_m_unbiased = r_m[ii] / beta1_correction;
-                        MATH_T next_v_unbiased = r_v[ii] / beta2_correction;
-                        MATH_T denom = sqrtf(next_v_unbiased) + epsilon;
-                        r_p[ii] = next_m_unbiased / denom;
-                    }
-                    else
-                    {
-                        MATH_T scaled_grad = r_g[ii] / clipped_global_grad_norm;
-                        r_m[ii] = r_m[ii] * beta1 + beta3 * scaled_grad;
-                        r_v[ii] = r_v[ii] * beta2 + (1 - beta2) * scaled_grad * scaled_grad;
-                        MATH_T next_m_unbiased = r_m[ii] / beta1_correction;
-                        MATH_T next_v_unbiased = r_v[ii] / beta2_correction;
-                        MATH_T denom = sqrtf(next_v_unbiased) + epsilon;
-                        r_p[ii] = (next_m_unbiased / denom) + (decay * r_p[ii]);
-                    }
-                }
-#pragma unroll
-                for (int ii = 0; ii < ILP; ii++)
-                {
-                    int i = i_start + threadIdx.x + ii * blockDim.x;
-                    if (i < n && i < chunk_size)
-                    {
-                        g[i] = r_p[ii];
-                        m[i] = r_m[ii];
-                        v[i] = r_v[ii];
-                    }
-                }
-            }
-        }
-    }
-};
-
-// Step 2 reads in 'update' value and per-tensor param_norm and update_norm.
-// It computes new parameter value.
-template <typename T>
-struct LAMBStage2Functor
-{
-    __device__ __forceinline__ void operator()(
-        int chunk_size,
-        volatile int *noop_gmem,
-        TensorListMetadata<2> &tl,
-        const float *per_tensor_param_norm,
-        const float *per_tensor_update_norm,
-        const float learning_rate,
-        const float decay,
-        bool use_nvlamb)
-    {
-        // I'd like this kernel to propagate infs/nans.
-        // if(*noop_gmem == 1)
-        //   return;
-
-        int tensor_loc = tl.block_to_tensor[blockIdx.x];
-        int tensor_num = tl.start_tensor_this_launch + tensor_loc;
-        int chunk_idx = tl.block_to_chunk[blockIdx.x];
-        int n = tl.sizes[tensor_loc];
-
-        MATH_T ratio = learning_rate;
-        // nvlamb: apply adaptive learning rate to all parameters
-        // otherwise, only apply to those with non-zero weight decay
-        if (use_nvlamb || (decay != 0.0))
-        {
-            float param_norm = per_tensor_param_norm[tensor_num];
-            float update_norm = per_tensor_update_norm[tensor_num];
-            ratio = (update_norm != 0.0f && param_norm != 0.0f) ? learning_rate * (param_norm / update_norm) : learning_rate;
-        }
-
-        T *update = (T *)tl.addresses[0][tensor_loc];
-        update += chunk_idx * chunk_size;
-
-        T *p = (T *)tl.addresses[1][tensor_loc];
-        p += chunk_idx * chunk_size;
-
-        n -= chunk_idx * chunk_size;
-
-        // to make things simple, we put aligned case in a different code path
-        if (n % ILP == 0 &&
-            chunk_size % ILP == 0 &&
-            is_aligned(p) &&
-            is_aligned(update))
-        {
-            T r_p[ILP];
-            T r_update[ILP];
-            for (int i_start = threadIdx.x; i_start * ILP < n && i_start * ILP < chunk_size; i_start += blockDim.x)
-            {
-                // load
-                load_store(r_p, p, 0, i_start);
-                load_store(r_update, update, 0, i_start);
-#pragma unroll
-                for (int ii = 0; ii < ILP; ii++)
-                {
-                    r_p[ii] = static_cast<MATH_T>(r_p[ii]) - (ratio * static_cast<MATH_T>(r_update[ii]));
-                }
-                load_store(p, r_p, i_start, 0);
-            }
-        }
-        else
-        {
-            for (int i_start = 0;
-                 i_start < n && i_start < chunk_size;
-                 i_start += blockDim.x * ILP)
-            {
-                MATH_T r_p[ILP];
-                MATH_T r_update[ILP];
-#pragma unroll
-                for (int ii = 0; ii < ILP; ii++)
-                {
-                    int i = i_start + threadIdx.x + ii * blockDim.x;
-                    if (i < n && i < chunk_size)
-                    {
-                        r_p[ii] = p[i];
-                        r_update[ii] = update[i];
-                    }
-                }
-#pragma unroll
-                for (int ii = 0; ii < ILP; ii++)
-                {
-                    r_p[ii] = r_p[ii] - (ratio * r_update[ii]);
-                }
-#pragma unroll
-                for (int ii = 0; ii < ILP; ii++)
-                {
-                    int i = i_start + threadIdx.x + ii * blockDim.x;
-                    if (i < n && i < chunk_size)
-                    {
-                        p[i] = r_p[ii];
-                    }
-                }
-            }
-        }
-    }
-};
-
-void multi_tensor_lamb_cuda(
-    int chunk_size,
-    at::Tensor noop_flag,
-    std::vector<std::vector<at::Tensor>> tensor_lists,
-    const float lr,
-    const float beta1,
-    const float beta2,
-    const float epsilon,
-    const int step,
-    const int bias_correction,
-    const float weight_decay,
-    const int grad_averaging,
-    const int mode,
-    at::Tensor global_grad_norm,
-    const float max_grad_norm,
-    at::optional<bool> use_nvlamb_python)
-{
-    using namespace at;
-    // Master weight and 32bit momentum(potentially changing) is not handled by this
-    // So we assume every tensor are all in the same type
-
-    bool use_nvlamb = use_nvlamb_python.has_value() ? use_nvlamb_python.value() : false;
-
-    // Handle bias correction mode
-    float bias_correction1 = 1.0f, bias_correction2 = 1.0f;
-    if (bias_correction == 1)
-    {
-        bias_correction1 = 1 - std::pow(beta1, step);
-        bias_correction2 = 1 - std::pow(beta2, step);
-    }
-
-    // Handle grad averaging mode
-    float beta3 = 1.0f;
-    if (grad_averaging == 1)
-        beta3 = 1 - beta1;
-
-    std::vector<std::vector<at::Tensor>> grad_list(tensor_lists.begin(), tensor_lists.begin() + 1);
-    std::vector<std::vector<at::Tensor>> param_list(tensor_lists.begin() + 1, tensor_lists.begin() + 2);
-
-    // Compute per tensor param norm
-    auto param_norm_tuple = multi_tensor_l2norm_cuda(chunk_size, noop_flag, param_list, true);
-
-    // We now in-place modify grad to store update before compute its norm
-    // Generally this is not a issue since people modify grad in step() method all the time
-    // We can also grab list of empty tensor to avoid this, but I'd like to save space/cpu code
-    DISPATCH_FLOAT_AND_HALF(tensor_lists[0][0].scalar_type(), 0, "lamb_stage_1",
-                            multi_tensor_apply<4>(
-                                BLOCK_SIZE,
-                                chunk_size,
-                                noop_flag,
-                                tensor_lists,
-                                LAMBStage1Functor<scalar_t_0>(),
-                                beta1,
-                                beta2,
-                                beta3, // 1-beta1 or 1 depends on averaging mode
-                                bias_correction1,
-                                bias_correction2,
-                                epsilon,
-                                (adamMode_t)mode,
-                                weight_decay,
-                                global_grad_norm.DATA_PTR<float>(),
-                                max_grad_norm);)
-
-    // Compute update norms
-    auto update_norm_tuple = multi_tensor_l2norm_cuda(chunk_size, noop_flag, grad_list, true);
-
-    std::vector<std::vector<at::Tensor>> grad_param_list(tensor_lists.begin(), tensor_lists.begin() + 2);
-
-    DISPATCH_FLOAT_AND_HALF(tensor_lists[0][0].scalar_type(), 0, "lamb_stage_2",
-                            multi_tensor_apply<2>(
-                                BLOCK_SIZE,
-                                chunk_size,
-                                noop_flag,
-                                grad_param_list,
-                                LAMBStage2Functor<scalar_t_0>(),
-                                std::get<1>(param_norm_tuple).DATA_PTR<float>(),
-                                std::get<1>(update_norm_tuple).DATA_PTR<float>(),
-                                lr,
-                                weight_decay,
-                                use_nvlamb);)
-
-    AT_CUDA_CHECK(cudaGetLastError());
-}
\ No newline at end of file
diff --git a/colossalai/kernel/cuda_native/csrc/multi_tensor_scale_kernel.cu b/colossalai/kernel/cuda_native/csrc/multi_tensor_scale_kernel.cu
deleted file mode 100644
index 40bd2c7a0f71e69eaf915463fdaf646289f51be1..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/csrc/multi_tensor_scale_kernel.cu
+++ /dev/null
@@ -1,136 +0,0 @@
-#include <ATen/ATen.h>
-#include <ATen/AccumulateType.h>
-#include <ATen/cuda/CUDAContext.h>
-#include <ATen/cuda/Exceptions.h>
-// Another possibility:
-// #include <torch/all.h>
-
-#include <assert.h>
-// Stringstream is a big hammer, but I want to rely on operator<< for dtype.
-#include <sstream>
-
-#include "type_shim.h"
-#include "multi_tensor_apply.cuh"
-
-#define BLOCK_SIZE 512
-#define ILP 4
-
-template<typename T>
-__device__ __forceinline__ bool is_aligned(T* p){
-  return ((uint64_t)p) % (ILP*sizeof(T)) == 0;
-}
-
-template<typename T>
-__device__ __forceinline__ void load_store(T* dst, T* src, int dst_offset, int src_offset){
-  typedef typename std::aligned_storage<ILP*sizeof(T), ILP*alignof(T)>::type LT;
-  ((LT*)dst)[dst_offset] = ((LT*)src)[src_offset];
-}
-
-template<typename in_t, typename out_t>
-struct ScaleFunctor
-{
-   __device__ __forceinline__ void operator()(
-    int chunk_size,
-    volatile int* noop_gmem,
-    TensorListMetadata<2>& tl,
-    float scale)
-  {
-    // I'd like this kernel to propagate infs/nans.
-    // if(*noop_gmem == 1)
-    //   return;
-
-    int tensor_loc = tl.block_to_tensor[blockIdx.x];
-    int chunk_idx = tl.block_to_chunk[blockIdx.x];
-    int n = tl.sizes[tensor_loc];
-
-    in_t* in = (in_t*)tl.addresses[0][tensor_loc];
-    in += chunk_idx*chunk_size;
-
-    out_t* out = (out_t*)tl.addresses[1][tensor_loc];
-    out += chunk_idx*chunk_size;
-
-    n -= chunk_idx*chunk_size;
-
-    bool finite = true;
-    in_t r_in[ILP];
-    out_t r_out[ILP];
-
-    // to make things simple, we put aligned case in a different code path
-    if(n % ILP == 0 && chunk_size % ILP == 0 && is_aligned(in) && is_aligned(out))
-    {
-      for(int i_start = threadIdx.x; i_start*ILP < n && i_start*ILP < chunk_size; i_start += blockDim.x)
-      {
-        // load
-        load_store(r_in, in, 0 , i_start);
-#pragma unroll
-        for(int ii = 0; ii < ILP; ii++)
-        {
-          r_out[ii] = static_cast<float>(r_in[ii]) * scale;
-          finite = finite && isfinite(r_in[ii]);
-        }
-        // store
-        load_store(out, r_out, i_start, 0);
-      }
-    }
-    else
-    {
-      // Non-divergent exit condition for __syncthreads, not necessary here
-      for(int i_start = 0; i_start < n && i_start < chunk_size; i_start += blockDim.x*ILP)
-      {
-#pragma unroll
-        for(int ii = 0; ii < ILP; ii++)
-        {
-          r_in[ii] = 0;
-          int i = i_start + threadIdx.x + ii*blockDim.x;
-          if(i < n && i < chunk_size)
-            r_in[ii] = in[i];
-        }
-        // note for clarification to future michael:
-        // From a pure memory dependency perspective, there's likely no point unrolling
-        // the write loop, since writes just fire off once their LDGs arrive.
-        // Put another way, the STGs are dependent on the LDGs, but not on each other.
-        // There is still compute ILP benefit from unrolling the loop though.
-#pragma unroll
-        for(int ii = 0; ii < ILP; ii++)
-        {
-          r_out[ii] = static_cast<float>(r_in[ii]) * scale;
-          finite = finite && isfinite(r_in[ii]);
-        }
-#pragma unroll
-        for(int ii = 0; ii < ILP; ii++)
-        {
-          int i = i_start + threadIdx.x + ii*blockDim.x;
-          if(i < n && i < chunk_size)
-            out[i] = r_out[ii];
-        }
-      }
-    }
-    if(!finite)
-      *noop_gmem = 1; // Blindly fire off a write.  These will race but that's ok.
-  }
-};
-
-void multi_tensor_scale_cuda(
-  int chunk_size,
-  at::Tensor noop_flag,
-  std::vector<std::vector<at::Tensor>> tensor_lists,
-  float scale)
-{
-  using namespace at;
-  // The output (downscaled) type is always float.
-  // If build times suffer, think about where to put this dispatch,
-  // and what logic should be moved out of multi_tensor_apply.
-
-  DISPATCH_FLOAT_AND_HALF(tensor_lists[0][0].scalar_type(), 0, "multi_tensor_scale_cuda",
-    DISPATCH_FLOAT_AND_HALF(tensor_lists[1][0].scalar_type(), 1, "multi_tensor_scale_cuda",
-      multi_tensor_apply<2>(
-        BLOCK_SIZE,
-        chunk_size,
-        noop_flag,
-        tensor_lists,
-        ScaleFunctor<scalar_t_0, scalar_t_1>(),
-        scale); ))
-  AT_CUDA_CHECK(cudaGetLastError());
-
-  // AT_CUDA_CHECK(cudaDeviceSynchronize());
-}
\ No newline at end of file
diff --git a/colossalai/kernel/cuda_native/csrc/multi_tensor_sgd_kernel.cu b/colossalai/kernel/cuda_native/csrc/multi_tensor_sgd_kernel.cu
deleted file mode 100644
index bc30e272282ff8e4c4e457bff2dd6b662d46ef0f..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/csrc/multi_tensor_sgd_kernel.cu
+++ /dev/null
@@ -1,282 +0,0 @@
-// modified from https://github.com/NVIDIA/apex/blob/master/csrc/multi_tensor_sgd_kernel.cu
-#include <ATen/ATen.h>
-#include <ATen/AccumulateType.h>
-#include <ATen/cuda/CUDAContext.h>
-#include <ATen/cuda/Exceptions.h>
-#include "multi_tensor_apply.cuh"
-#include "compat.h"
-
-#include <assert.h>
-#include <cuda_runtime.h>
-
-#define BLOCK_SIZE 512
-#define ILP 4
-
-/**
- * Perform fused SGD on multiple buffers
- * N: number of tensors
- * tl[0] : gradients
- * tl[1] : weights
- * tl[2] : momentum buffers
- * tl[3] : fp16 weights (if appropriate)
- * wd : weight_decay (scalar)
- * momentum : momentum (scalar)
- * dampening : momentum dampening (scalar)
- * lr : learning rate (scalar)
- * nesterov : enable nesterov (bool)
- * first run : necessary for proper momentum handling & init
- * wd_after_momentum : apply weight decay _after_ momentum instead of before
- **/
-template <int N, typename T_grad, typename T_weight>
-struct SGDFunctor
-{
-    __device__ __forceinline__ void operator()(
-        int chunk_size,
-        volatile int *noop_gmem,
-        TensorListMetadata<N> &tl,
-        float wd,
-        float momentum,
-        float dampening,
-        float lr,
-        bool nesterov,
-        bool first_run,
-        bool wd_after_momentum,
-        float scale)
-    {
-        // Early exit if we don't need to do anything
-        if (*noop_gmem)
-            return;
-
-        int tensor_loc = tl.block_to_tensor[blockIdx.x];
-        int chunk_idx = tl.block_to_chunk[blockIdx.x];
-        int n = tl.sizes[tensor_loc];
-
-        T_grad *grad_in = (T_grad *)tl.addresses[0][tensor_loc];
-        grad_in += chunk_idx * chunk_size;
-
-        T_weight *weight_in = (T_weight *)tl.addresses[1][tensor_loc];
-        weight_in += chunk_idx * chunk_size;
-
-        T_weight *mom_in = (T_weight *)tl.addresses[2][tensor_loc];
-        mom_in += chunk_idx * chunk_size;
-
-        at::Half *model_weights_out = nullptr;
-        if (N == 4)
-        {
-            model_weights_out = (at::Half *)tl.addresses[3][tensor_loc];
-            model_weights_out += chunk_idx * chunk_size;
-        }
-
-        n -= chunk_idx * chunk_size;
-
-        // Non-divergent exit condition for the __syncthreads
-        float incoming_grads[ILP];
-        float incoming_weights[ILP];
-        float incoming_moms[ILP];
-        for (int i_start = 0;
-             i_start < n && i_start < chunk_size;
-             i_start += blockDim.x * ILP)
-        {
-#pragma unroll
-            for (int ii = 0; ii < ILP; ii++)
-            {
-                incoming_grads[ii] = 0;
-                incoming_weights[ii] = 0;
-                incoming_moms[ii] = 0;
-                int i = i_start + threadIdx.x + ii * blockDim.x;
-                if (i < n && i < chunk_size)
-                {
-                    incoming_grads[ii] = static_cast<float>(grad_in[i]) * scale;
-                    incoming_weights[ii] = static_cast<float>(weight_in[i]);
-                    incoming_moms[ii] = static_cast<float>(mom_in[i]);
-                }
-            }
-
-// note for clarification to future michael:
-// From a pure memory dependency perspective, there's likely no point unrolling
-// the write loop, since writes just fire off once their LDGs arrive.
-// Put another way, the STGs are dependent on the LDGs, but not on each other.
-// There is still compute ILP benefit from unrolling the loop though.
-#pragma unroll
-            for (int ii = 0; ii < ILP; ii++)
-            {
-                int i = i_start + threadIdx.x + ii * blockDim.x;
-                if (i < n && i < chunk_size)
-                {
-                    // apply weight decay before momentum if necessary
-                    if (wd != 0.f && !wd_after_momentum)
-                        incoming_grads[ii] += wd * incoming_weights[ii];
-
-                    if (momentum != 0.f)
-                    {
-                        if (!first_run)
-                            incoming_moms[ii] = incoming_moms[ii] * momentum + (1.f - dampening) * incoming_grads[ii];
-                        else // initialize momentums to current incoming grads
-                            incoming_moms[ii] = incoming_grads[ii];
-
-                        if (nesterov)
-                            incoming_grads[ii] += momentum * incoming_moms[ii];
-                        else
-                            incoming_grads[ii] = incoming_moms[ii];
-                    }
-
-                    // Apply WD after momentum if desired
-                    if (wd != 0.f && wd_after_momentum)
-                        incoming_grads[ii] += wd * incoming_weights[ii];
-
-                    // adjust the weight and write out
-                    weight_in[i] += (-lr * incoming_grads[ii]);
-
-                    // if necessary, write out an fp16 copy of the weights
-                    if (N == 4)
-                        model_weights_out[i] = static_cast<at::Half>(weight_in[i]);
-
-                    // also write out the new momentum
-                    if (momentum != 0.f)
-                        mom_in[i] = incoming_moms[ii];
-                }
-            }
-        }
-    }
-};
-
-void multi_tensor_sgd_cuda(
-    int chunk_size,
-    at::Tensor noop_flag,
-    std::vector<std::vector<at::Tensor>> tensor_lists,
-    float wd,
-    float momentum,
-    float dampening,
-    float lr,
-    bool nesterov,
-    bool first_run,
-    bool wd_after_momentum,
-    float scale)
-{
-    auto num_tensors = tensor_lists.size();
-    auto grad_type = tensor_lists[0][0].scalar_type();
-    auto weight_type = tensor_lists[1][0].scalar_type();
-
-    if (num_tensors == 4)
-        for (int i = 0; i < tensor_lists[3].size(); i++)
-            TORCH_CHECK(tensor_lists[3][i].scalar_type() == at::ScalarType::Half,
-                        "Additional output tensors should always be fp16.");
-
-    TORCH_CHECK(noop_flag.device() == tensor_lists[0][0].device(), "expected noop flag to be on the same device as tensors");
-
-    // We have 3 possibilities to handle here, in terms of
-    // grad_type, param_type, momentum_type, requires_fp16_copy
-    // 1. fp16, fp16, fp16, No
-    // 2. fp32, fp32, fp32, No
-    // 3. fp16, fp32, fp32, Yes
-    // 4. fp32, fp32, fp32, Yes // this is the materialize_master_grads=True case
-    // It's easier to hardcode these possibilities than to use
-    // switches etc. to handle the cross-product of cases where
-    // we don't want the majority of them.
-
-    // Case 1. fp16, fp16, fp16, No
-    if (grad_type == at::ScalarType::Half &&
-        weight_type == at::ScalarType::Half &&
-        num_tensors == 3)
-    {
-        multi_tensor_apply<3>(
-            BLOCK_SIZE,
-            chunk_size,
-            noop_flag,
-            tensor_lists,
-            SGDFunctor<3, at::Half, at::Half>(),
-            wd,
-            momentum,
-            dampening,
-            lr,
-            nesterov,
-            first_run,
-            wd_after_momentum,
-            scale);
-    }
-    // Case 2. fp16, fp32, fp32, No
-    // else if (grad_type == at::ScalarType::Half &&
-    //          weight_type == at::ScalarType::Float &&
-    //          num_tensors == 3) {
-    //   multi_tensor_apply<3>(
-    //       BLOCK_SIZE,
-    //       chunk_size,
-    //       noop_flag,
-    //       tensor_lists,
-    //       SGDFunctor<3, at::Half, float>(),
-    //       wd,
-    //       momentum,
-    //       dampening,
-    //       lr,
-    //       nesterov,
-    //       first_run,
-    //       wd_after_momentum);
-    // }
-    // Case 2. fp32, fp32, fp32, No
-    else if (grad_type == at::ScalarType::Float &&
-             weight_type == at::ScalarType::Float &&
-             num_tensors == 3)
-    {
-        multi_tensor_apply<3>(
-            BLOCK_SIZE,
-            chunk_size,
-            noop_flag,
-            tensor_lists,
-            SGDFunctor<3, float, float>(),
-            wd,
-            momentum,
-            dampening,
-            lr,
-            nesterov,
-            first_run,
-            wd_after_momentum,
-            scale);
-    }
-    // Case 3. fp16, fp32, fp32, Yes
-    else if (grad_type == at::ScalarType::Half &&
-             weight_type == at::ScalarType::Float &&
-             num_tensors == 4)
-    {
-        multi_tensor_apply<4>(
-            BLOCK_SIZE,
-            chunk_size,
-            noop_flag,
-            tensor_lists,
-            SGDFunctor<4, at::Half, float>(),
-            wd,
-            momentum,
-            dampening,
-            lr,
-            nesterov,
-            first_run,
-            wd_after_momentum,
-            scale);
-    }
-    // Case 4. fp32, fp32, fp32, Yes
-    else if (grad_type == at::ScalarType::Float &&
-             weight_type == at::ScalarType::Float &&
-             num_tensors == 4)
-    {
-        multi_tensor_apply<4>(
-            BLOCK_SIZE,
-            chunk_size,
-            noop_flag,
-            tensor_lists,
-            SGDFunctor<4, float, float>(),
-            wd,
-            momentum,
-            dampening,
-            lr,
-            nesterov,
-            first_run,
-            wd_after_momentum,
-            scale);
-    }
-    else
-    {
-        AT_ERROR("multi_tensor_sgd only supports some combinations of gradient & weight types. Given: ",
-                 "gradient: ", grad_type, ", weight: ", weight_type, ", num_lists: ", num_tensors);
-    }
-
-    AT_CUDA_CHECK(cudaGetLastError());
-}
\ No newline at end of file
diff --git a/colossalai/kernel/cuda_native/csrc/multihead_attention_1d.cpp b/colossalai/kernel/cuda_native/csrc/multihead_attention_1d.cpp
deleted file mode 100644
index 63bf633f5fab94bafc708207f5752f6ef16a8f53..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/csrc/multihead_attention_1d.cpp
+++ /dev/null
@@ -1,364 +0,0 @@
-#include "multihead_attention_1d.h"
-
-#include <ATen/cuda/CUDAContext.h>
-#include <torch/extension.h>
-
-#include <c10d/Types.hpp>
-#include <iostream>
-
-#include "context.h"
-#include "kernels.h"
-
-template <typename T>
-MultiHeadAttention<T>::MultiHeadAttention(int layer_id, int max_batch_tokens, int max_seq_len,
-                                          int hidden_size, int num_heads,
-                                          float attn_prob_dropout_ratio,
-                                          float hidden_output_dropout_ratio,
-                                          bool pre_or_postLayerNorm)
-    : _layer_id(layer_id),
-      _max_batch_tokens(max_batch_tokens),
-      _max_seq_len(max_seq_len),
-      _hidden_size(hidden_size),
-      _heads(num_heads),
-      _training(true),
-      _pre_or_postLayerNorm(pre_or_postLayerNorm),
-      _qkv_linear(typename FeedForward<T>::Config(3 * hidden_size, hidden_size)),
-      _attn_out_linear(typename FeedForward<T>::Config(hidden_size, hidden_size)),
-      _attn_ln(typename Normalize_Layer<T>::Config(hidden_size, false), _max_batch_tokens),
-      _softmax(typename Softmax<T>::Config(num_heads)),
-      _attn_prob_dropout(typename Dropout<T>::Config(attn_prob_dropout_ratio),
-                         _max_batch_tokens * _heads * _max_seq_len),
-      _attn_dropout(typename Dropout<T>::Config(hidden_output_dropout_ratio),
-                    _max_batch_tokens * _hidden_size),
-      _attn_scores(typename StridedBatchGemm<T>::Config((T(1.0) / T(sqrt(_hidden_size / _heads))),
-                                                        T(0.0), CUBLAS_OP_T, CUBLAS_OP_N)),
-      _attn_context(
-          typename StridedBatchGemm<T>::Config(T(1.0), T(0.0), CUBLAS_OP_N, CUBLAS_OP_N)) {
-  assert(_hidden_size % _heads == 0);
-}
-
-template <typename T>
-MultiHeadAttention<T>::~MultiHeadAttention() {
-  free_mem_buffer();
-}
-
-template <typename T>
-void MultiHeadAttention<T>::attn_layer_fw(const T *input_ptr, const T *input_mask_ptr,
-                                          T *output_ptr, T *buffer) {
-  T *q_tf_ptr = _qkv_ptr;
-  T *k_tf_ptr = q_tf_ptr + _batch_dim / pg_size;
-  T *v_tf_ptr = k_tf_ptr + _batch_dim / pg_size;
-
-  if (_pre_or_postLayerNorm) {
-    _attn_ln.Forward(_gemmQKV_inp_ptr, input_ptr, _attn_nw_ptr, _attn_nb_ptr, _batch_tokens,
-                     _stream);
-  }
-  const T *gemmQKV_inp_ptr = _pre_or_postLayerNorm ? _gemmQKV_inp_ptr : input_ptr;
-  _qkv_linear.reset_size(3 * _hidden_size / pg_size, _hidden_size);
-  _qkv_linear.Forward(_batch_tokens, gemmQKV_inp_ptr, _attn_qkvw_ptr, buffer, _cublasHandle);
-
-  launch_bias_add_transform_20314<T>(q_tf_ptr, buffer, _attn_qkvb_ptr, _batch_size, _seq_len, 3,
-                                     _heads / pg_size, _hidden_size / _heads, _stream);
-
-  // attention scores, q*k
-  _attn_scores.Forward(_batch_heads, _soft_out_ptr, k_tf_ptr, q_tf_ptr, _cublasHandle);
-
-  // Softmax + Mask
-  _softmax.reset_size(_heads / pg_size);
-  _softmax.Forward(_soft_out_ptr, input_mask_ptr, _batch_size, _seq_len, _seq_len, _stream, true);
-
-  // attn prob dropout.
-  _attn_prob_dropout.dropout(_ctx_bufB_ptr, _soft_out_ptr, _batch_heads * _seq_len * _seq_len,
-                             _stream);
-
-  // attention context, score * v
-  _attn_context.Forward(_batch_heads, buffer, v_tf_ptr, _ctx_bufB_ptr, _cublasHandle);
-
-  // [b, nh, s, ad] -> [b, s, nh, ad]
-  launch_transform4d_0213<T>(_attn_o_inp_ptr, buffer, _batch_size, _seq_len, _hidden_size / pg_size,
-                             _heads / pg_size, 1, _stream);
-
-  _attn_out_linear.reset_size(_hidden_size, _hidden_size / pg_size);
-  _attn_out_linear.Forward(_batch_tokens, _attn_o_inp_ptr, _attn_ow_ptr, output_ptr, _cublasHandle);
-
-  // allreduce
-  if (pg == c10::detail::UniqueVoidPtr() || pg->getSize() == 1) {
-  } else {
-    auto data_type = torch::kFloat;
-    if (typeid(T) != typeid(float)) {
-      data_type = torch::kHalf;
-    }
-    auto output_tensor =
-        torch::from_blob(output_ptr, {int(_batch_size), int(_seq_len), int(_hidden_size)},
-                         torch::TensorOptions(torch::kCUDA).dtype(data_type));
-    std::vector<torch::Tensor> allreduce_tensors = {output_tensor};
-    auto work = pg->allreduce(allreduce_tensors, c10d::AllreduceOptions());
-    work->wait();
-  }
-
-  _attn_dropout.bias_dropout_residual(output_ptr, output_ptr, input_ptr, _attn_ob_ptr,
-                                      _batch_tokens, _hidden_size, _stream);
-  if (!_pre_or_postLayerNorm) {
-    // in-place ln since ln-input will not be used in post-ln mode
-    _attn_ln.Forward(output_ptr, output_ptr, _attn_nw_ptr, _attn_nb_ptr, _batch_tokens, _stream);
-  }
-}
-
-template <typename T>
-void MultiHeadAttention<T>::Forward(const T *input_ptr, const T *input_mask_ptr, T *out_ptr) {
-  _stream = Context::Instance().get_stream();
-  _cublasHandle = Context::Instance().get_cublashandle();
-  T *attn_buffer = _shared_mem_ptr;  // 3 * _batch_dim
-
-  attn_layer_fw(input_ptr, input_mask_ptr, out_ptr, attn_buffer);
-}
-
-template <typename T>
-void MultiHeadAttention<T>::attn_layer_bw(const T *input_ptr, const T *input_mask_ptr, const T *output_ptr,
-                                          const T *grad_output_ptr, T *grad_input_ptr, T *buffer) {
-  cudaStream_t streams[2] = {_stream, _stream};
-
-  const T *q_tf_ptr = _qkv_ptr;
-  const T *k_tf_ptr = q_tf_ptr + _batch_dim / pg_size;
-  const T *v_tf_ptr = k_tf_ptr + _batch_dim / pg_size;
-  // batch_dim = batch_size * seq_len * hidden_size
-  // buffer size: batch_dim * 3 + max(batch_dim * 3,
-  //     batch_size * head_num * seq_len * seq_len)
-  T *grad_residual_ptr = buffer;
-  buffer += _batch_dim;
-
-  T *grad_input_buf_ptr = buffer;  // batch_dim
-  T *grad_qkv_5d_ptr = buffer;     // batch_dim * 3
-  buffer += 3 * _batch_dim / pg_size;
-
-  T *grad_qkv_4d_ptr = buffer;   // batch_dim * 3
-  T *grad_softmax_ptr = buffer;  // batch_size * head_num * seq_len * seq_len
-  // buffer += max(3 * _batch_dim,
-  //   batch_size * head_num * seq_len * seq_len);
-
-  if (_pre_or_postLayerNorm) {
-    _attn_dropout.d_bias_dropout_residual(grad_input_ptr, _grad_attn_ob_ptr, grad_output_ptr,
-                                          _batch_tokens, _hidden_size, _stream);
-  } else {
-    _attn_ln.Backward(_grad_attn_nw_ptr, _grad_attn_nb_ptr, grad_residual_ptr, grad_output_ptr,
-                      nullptr, output_ptr, _attn_nw_ptr, _attn_nb_ptr, _batch_tokens, streams);
-    _attn_dropout.d_bias_dropout_residual(grad_input_ptr, _grad_attn_ob_ptr, grad_residual_ptr,
-                                          _batch_tokens, _hidden_size, _stream);
-  }
-
-  // bw of output project
-  _attn_out_linear.reset_size(_hidden_size, _hidden_size / pg_size);
-  _attn_out_linear.Backward(_batch_tokens, grad_input_ptr, _attn_o_inp_ptr, _attn_ow_ptr,
-                            _grad_attn_ow_ptr, _grad_attn_ob_ptr, _cublasHandle, _stream,
-                            grad_input_buf_ptr, nullptr, false);
-  launch_transform_0213<T>(grad_input_ptr, grad_input_buf_ptr, _batch_size, _seq_len,
-                           _hidden_size / pg_size, _heads / pg_size, _stream);
-
-  // bw of score * v
-  _attn_context.Backward(_batch_heads, grad_input_ptr, v_tf_ptr, _ctx_bufB_ptr, _cublasHandle,
-                         grad_qkv_5d_ptr + 2 * _batch_dim / pg_size, grad_softmax_ptr);
-
-  _attn_prob_dropout.d_dropout(grad_softmax_ptr, _batch_heads * _seq_len * _seq_len, _stream);
-
-  _softmax.reset_size(_heads / pg_size);
-  _softmax.Backward(grad_softmax_ptr, _soft_out_ptr, _batch_size, _seq_len, _seq_len, _stream);
-
-  // bw of q * k
-  _attn_scores.Backward(_batch_heads, grad_softmax_ptr, k_tf_ptr, q_tf_ptr, _cublasHandle,
-                        grad_qkv_5d_ptr + _batch_dim / pg_size, grad_qkv_5d_ptr);
-
-  // [3, b, nh, s, ad] -> [b, s, 3, h]
-  launch_transform4d_0213<T>(grad_qkv_4d_ptr, grad_qkv_5d_ptr, _batch_size, _seq_len,
-                             _hidden_size / pg_size, _heads / pg_size, 3, _stream);
-
-  const T *gemmQKV_inp_ptr = _pre_or_postLayerNorm ? _gemmQKV_inp_ptr : input_ptr;
-  _qkv_linear.reset_size(3 * _hidden_size / pg_size, _hidden_size);
-  _qkv_linear.Backward(_batch_tokens, grad_qkv_4d_ptr, gemmQKV_inp_ptr, _attn_qkvw_ptr,
-                       _grad_attn_qkvw_ptr, _grad_attn_qkvb_ptr, _cublasHandle, _stream,
-                       grad_input_buf_ptr, nullptr, true);
-
-  // allreduce
-  if (pg == c10::detail::UniqueVoidPtr() || pg->getSize() == 1) {
-  } else {
-    auto data_type = torch::kFloat;
-    if (typeid(T) != typeid(float)) {
-      data_type = torch::kHalf;
-    }
-    auto grad_input_tensor =
-        torch::from_blob(grad_input_buf_ptr, {int(_batch_size), int(_seq_len), int(_hidden_size)},
-                         torch::TensorOptions(torch::kCUDA).dtype(data_type));
-    std::vector<torch::Tensor> allreduce_tensors = {grad_input_tensor};
-    auto work = pg->allreduce(allreduce_tensors, c10d::AllreduceOptions());
-    work->wait();
-  }
-
-  if (_pre_or_postLayerNorm) {
-    _attn_ln.Backward(_grad_attn_nw_ptr, _grad_attn_nb_ptr, grad_input_ptr, grad_input_buf_ptr,
-                      grad_output_ptr, gemmQKV_inp_ptr, _attn_nw_ptr, _attn_nb_ptr, _batch_tokens,
-                      streams);
-  } else {
-    // FIXME later
-    launch_fused_add2<T>(grad_input_ptr, grad_input_buf_ptr, grad_residual_ptr, _batch_size,
-                         _seq_len, _hidden_size, _stream);
-  }
-}
-
-template <typename T>
-void MultiHeadAttention<T>::Backward(const T *grad_output_ptr, const T *input_ptr, const T *output_ptr,
-                                     const T *input_mask_ptr, T *grad_input_ptr) {
-  _stream = Context::Instance().get_stream();
-  _cublasHandle = Context::Instance().get_cublashandle();
-  T *buffer = _shared_mem_ptr;
-
-  /*
-  buffer size needed by attn bw:
-      4 * _batch_dim + max(3 * _batch_dim,
-      _batch_size * _head_num * _seq_len * _seq_len);
-  */
-  attn_layer_bw(input_ptr, input_mask_ptr, output_ptr, grad_output_ptr, grad_input_ptr, buffer);
-}
-
-template <typename T>
-void MultiHeadAttention<T>::SetTrainingMode(bool training) {
-  // Dropout will be skipped when not in training model.
-  _attn_prob_dropout.SetTrainingMode(training);
-  _attn_dropout.SetTrainingMode(training);
-}
-
-template <typename T>
-T *MultiHeadAttention<T>::_shared_mem_ptr = nullptr;
-
-template class MultiHeadAttention<float>;
-template class MultiHeadAttention<__half>;
-
-// x is torch::Tensor
-#define CHECK_CUDA(x) AT_ASSERTM(x.is_cuda(), #x " must be a CUDA tensor")
-#define CHECK_CONTIGUOUS(x) AT_ASSERTM(x.is_contiguous(), #x " must be contiguous")
-#define CHECK_INPUT(x) \
-  CHECK_CUDA(x);       \
-  CHECK_CONTIGUOUS(x)
-
-static std::unordered_map<int, std::shared_ptr<void>> s_multihead_attention;
-
-template <typename T>
-int create_multihead_attention(int layer_id, int max_batch_tokens, int max_seq_len, int hidden_dim,
-                               int num_heads, float attn_prob_dropout_ratio,
-                               float hidden_dropout_ratio, bool pre_or_postLayerNorm,
-                               c10::intrusive_ptr<c10d::ProcessGroup> pg_) {
-  cudaStream_t stream = at::cuda::getCurrentCUDAStream();
-  Context::Instance().set_stream(stream);
-  auto layer = std::make_shared<MultiHeadAttention<T>>(
-      layer_id, max_batch_tokens, max_seq_len, hidden_dim, num_heads, attn_prob_dropout_ratio,
-      hidden_dropout_ratio, pre_or_postLayerNorm);
-
-  layer->SetPG(pg_);
-
-  s_multihead_attention[layer_id] = layer;
-
-  std::string dtype = (std::is_same<T, __half>::value) ? "half" : "float";
-
-  return 0;
-}
-
-template <typename T>
-std::vector<torch::Tensor> multihead_attention_fw(int layer_id, const torch::Tensor &input,
-                                                  const torch::Tensor &input_mask,
-                                                  const torch::Tensor &in_proj_weight,
-                                                  const torch::Tensor &in_proj_bias,
-                                                  const torch::Tensor &out_proj_weight,
-                                                  const torch::Tensor &out_proj_bias,
-                                                  const torch::Tensor &norm_weight,
-                                                  const torch::Tensor &norm_bias,
-                                                  bool training_mode, bool prelayernorm) {
-  CHECK_INPUT(input);
-  CHECK_INPUT(input_mask);
-
-  const T *input_ptr = (const T *)input.data_ptr();
-  const T *input_mask_ptr = (const T *)input_mask.data_ptr();
-
-  auto output = torch::empty_like(input);
-  T *out_ptr = (T *)output.data_ptr();
-
-  std::shared_ptr<MultiHeadAttention<T>> layer =
-      std::static_pointer_cast<MultiHeadAttention<T>>(s_multihead_attention[layer_id]);
-  layer->set_cur_batch_shape(input.size(0), input.size(1));
-  layer->SetTrainingMode(training_mode);
-
-  layer->_attn_qkvw_ptr = (const T *)in_proj_weight.data_ptr();
-  layer->_attn_qkvb_ptr = (const T *)in_proj_bias.data_ptr();
-  layer->_attn_ow_ptr = (const T *)out_proj_weight.data_ptr();
-  layer->_attn_ob_ptr = (const T *)out_proj_bias.data_ptr();
-  layer->_attn_nw_ptr = (const T *)norm_weight.data_ptr();
-  layer->_attn_nb_ptr = (const T *)norm_bias.data_ptr();
-
-  layer->Forward(input_ptr, input_mask_ptr, out_ptr);
-
-  return {output};
-}
-
-template <typename T>
-std::vector<torch::Tensor> multihead_attention_bw(int layer_id,
-                                                  const torch::Tensor &grad_dec_output,
-                                                  const torch::Tensor &output,
-                                                  const torch::Tensor &input,
-                                                  const torch::Tensor &input_mask,
-                                                  const torch::Tensor &in_proj_weight,
-                                                  const torch::Tensor &in_proj_bias,
-                                                  const torch::Tensor &out_proj_weight,
-                                                  const torch::Tensor &out_proj_bias,
-                                                  const torch::Tensor &norm_weight,
-                                                  const torch::Tensor &norm_bias) {
-  auto g_output = grad_dec_output.contiguous();
-  CHECK_INPUT(g_output);
-  CHECK_INPUT(output);
-  CHECK_INPUT(input);
-  CHECK_INPUT(input_mask);
-
-  auto grad_input = torch::empty_like(input);
-  auto grad_in_proj_weight = torch::empty_like(in_proj_weight);
-  auto grad_in_proj_bias = torch::empty_like(in_proj_bias);
-  auto grad_out_proj_weight = torch::empty_like(out_proj_weight);
-  auto grad_out_proj_bias = torch::empty_like(out_proj_bias);
-  auto grad_norm_weight = torch::empty_like(norm_weight);
-  auto grad_norm_bias = torch::empty_like(norm_bias);
-
-  // inputs.
-  const T *grad_dec_output_ptr = (const T *)g_output.data_ptr();
-  const T *input_ptr = (const T *)input.data_ptr();
-  const T *output_ptr = (const T *)output.data_ptr();
-  const T *input_mask_ptr = (const T *)input_mask.data_ptr();
-
-  // outputs.
-  T *grad_input_ptr = (T *)grad_input.data_ptr();
-
-  std::shared_ptr<MultiHeadAttention<T>> layer =
-      std::static_pointer_cast<MultiHeadAttention<T>>(s_multihead_attention[layer_id]);
-  layer->set_cur_batch_shape(g_output.size(0), g_output.size(1));
-
-  layer->_grad_attn_qkvw_ptr = (T *)grad_in_proj_weight.data_ptr();
-  layer->_grad_attn_qkvb_ptr = (T *)grad_in_proj_bias.data_ptr();
-  layer->_grad_attn_ow_ptr = (T *)grad_out_proj_weight.data_ptr();
-  layer->_grad_attn_ob_ptr = (T *)grad_out_proj_bias.data_ptr();
-  layer->_grad_attn_nw_ptr = (T *)grad_norm_weight.data_ptr();
-  layer->_grad_attn_nb_ptr = (T *)grad_norm_bias.data_ptr();
-
-  layer->Backward(grad_dec_output_ptr, input_ptr, output_ptr, input_mask_ptr, grad_input_ptr);
-
-  return {grad_input,         grad_in_proj_weight, grad_in_proj_bias, grad_out_proj_weight,
-          grad_out_proj_bias, grad_norm_weight,    grad_norm_bias};
-}
-
-PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
-  m.def("multihead_attention_fw_fp32", &multihead_attention_fw<float>,
-        "Multi-head Attention forward with fp32 (CUDA)");
-  m.def("multihead_attention_fw_fp16", &multihead_attention_fw<__half>,
-        "Multi-head Attention forward with fp16 (CUDA)");
-  m.def("multihead_attention_bw_fp32", &multihead_attention_bw<float>,
-        "Multi-head Attention backward with fp32 (CUDA)");
-  m.def("multihead_attention_bw_fp16", &multihead_attention_bw<__half>,
-        "Multi-head Attention backward with fp16 (CUDA)");
-  m.def("create_multihead_attention_fp32", &create_multihead_attention<float>,
-        "Create Multi-head Attention with fp32 (CUDA)");
-  m.def("create_multihead_attention_fp16", &create_multihead_attention<__half>,
-        "Create Multi-head Attention with fp16 (CUDA)");
-}
diff --git a/colossalai/kernel/cuda_native/csrc/multihead_attention_1d.h b/colossalai/kernel/cuda_native/csrc/multihead_attention_1d.h
deleted file mode 100644
index 138c34be6fbab70f21db30709a46fcfd0c06de44..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/csrc/multihead_attention_1d.h
+++ /dev/null
@@ -1,158 +0,0 @@
-#pragma once
-
-#include <c10/util/intrusive_ptr.h>
-#include <cuda.h>
-#include <cuda_fp16.h>
-#include <cuda_runtime_api.h>
-
-#include <c10d/ProcessGroup.hpp>
-#include <string>
-#include <type_traits>
-
-#ifdef COLOSSAL_HIP
-#include "hip_util.h"
-#else
-#include "cuda_util.h"
-#endif
-
-#include "dropout.h"
-#include "feed_forward.h"
-#include "normalize_layer.h"
-#include "softmax.h"
-#include "strided_batch_gemm.h"
-
-template <typename T>
-class MultiHeadAttention {
- public:
-  MultiHeadAttention(int layer_id, int max_batch_tokens, int _max_seq_len, int hidden_size,
-                     int num_heads, float attn_dropout_ratio, float hidden_output_dropout_ratio,
-                     bool pre_or_postLayerNorm);
-
-  virtual ~MultiHeadAttention();
-
-  void Forward(const T *input_ptr, const T *input_mask_ptr, T *out_ptr);
-
-  void Backward(const T *grad_output_ptr, const T *input_ptr, const T *output_ptr,
-                const T *input_mask_ptr, T *grad_input_ptr);
-
-  void attn_layer_fw(const T *input_ptr, const T *input_mask_ptr, T *output_ptr, T *buffer);
-
-  void attn_layer_bw(const T *input_ptr, const T *input_mask_ptr, const T *output_ptr,
-                     const T *grad_output_ptr, T *grad_input_attn_layer_bwptr, T *buffer);
-
-  void set_cur_batch_shape(int batch_size, int seq_len) {
-    _batch_size = batch_size;
-    _seq_len = seq_len;
-    _batch_tokens = batch_size * seq_len;
-    _batch_heads = batch_size * _heads / pg_size;
-    _batch_dim = _batch_tokens * _hidden_size;
-    _attn_scores.SetConfig(_seq_len, _seq_len, _hidden_size / _heads);
-    _attn_context.SetConfig(_hidden_size / _heads, _seq_len, _seq_len);
-  }
-
-  void SetTrainingMode(bool training);
-  inline bool IsTrainingMode() const { return _training; }
-
-  void SetPG(c10::intrusive_ptr<c10d::ProcessGroup> pg_) {
-    pg = pg_;
-    pg_size = 1;
-    if (pg != c10::detail::UniqueVoidPtr()) {
-      pg_size = pg->getSize();
-    }
-    allocate_mem_buffer();
-  }
-
-  // weights ptr
-  const T *_attn_qkvw_ptr;
-  const T *_attn_qkvb_ptr;
-  const T *_attn_ow_ptr;
-  const T *_attn_ob_ptr;
-  const T *_attn_nw_ptr;
-  const T *_attn_nb_ptr;
-
-  // grads ptr
-  T *_grad_attn_qkvw_ptr;
-  T *_grad_attn_qkvb_ptr;
-  T *_grad_attn_ow_ptr;
-  T *_grad_attn_ob_ptr;
-  T *_grad_attn_nw_ptr;
-  T *_grad_attn_nb_ptr;
-
- private:
-  void allocate_mem_buffer() {
-    // allocate local gpu memory
-    if (_pre_or_postLayerNorm) {
-      _gemmQKV_inp_ptr = cuda_malloc<T>(_max_batch_tokens * _hidden_size);
-    } else {
-      _gemmQKV_inp_ptr = nullptr;
-    }
-
-    _qkv_ptr = cuda_malloc<T>(_max_batch_tokens * _hidden_size * 3);
-    _soft_out_ptr = cuda_malloc<T>(_max_batch_tokens * _heads / pg_size * _max_seq_len);
-    _ctx_bufB_ptr = cuda_malloc<T>(_max_batch_tokens * _heads / pg_size * _max_seq_len);
-    _attn_o_inp_ptr = cuda_malloc<T>(_max_batch_tokens * _hidden_size);
-
-    // buffer size needed by attn bw
-    size_t smem_size = 4 * _max_batch_tokens * _hidden_size / pg_size +
-                       std::max(3 * _max_batch_tokens * _hidden_size / pg_size,
-                                _max_batch_tokens * _heads / pg_size * _max_seq_len);
-
-    if (!_shared_mem_ptr) {
-      cuda_free(_shared_mem_ptr);
-      _shared_mem_ptr = cuda_malloc<T>(smem_size);
-    }
-  }
-
-  void free_mem_buffer() {
-    // free local gpu memory
-    cuda_free(_gemmQKV_inp_ptr);
-    cuda_free(_qkv_ptr);
-    cuda_free(_soft_out_ptr);
-    cuda_free(_ctx_bufB_ptr);
-    cuda_free(_attn_o_inp_ptr);
-
-    // free shared gpu memory between layers
-    cuda_free(_shared_mem_ptr);
-    _shared_mem_ptr = nullptr;
-  }
-
-  // const parameter between batch
-  const size_t _layer_id;
-  const size_t _hidden_size;
-  const size_t _heads;
-  const size_t _max_batch_tokens;
-  const size_t _max_seq_len;
-  const bool _pre_or_postLayerNorm;
-  // dynamic parameter between batch
-  size_t _batch_size;
-  size_t _seq_len;
-  size_t _batch_tokens;
-  size_t _batch_heads;
-  size_t _batch_dim;
-  bool _training;
-
-  cublasHandle_t _cublasHandle;
-  cudaStream_t _stream;
-
-  // layers
-  FeedForward<T> _qkv_linear;
-  FeedForward<T> _attn_out_linear;
-  Normalize_Layer<T> _attn_ln;
-  Softmax<T> _softmax;
-  Dropout<T> _attn_prob_dropout;
-  Dropout<T> _attn_dropout;
-  StridedBatchGemm<T> _attn_scores;
-  StridedBatchGemm<T> _attn_context;
-
-  // local GPU memory
-  T *_gemmQKV_inp_ptr;
-  T *_qkv_ptr;
-  T *_soft_out_ptr;
-  T *_ctx_bufB_ptr;
-  T *_attn_o_inp_ptr;
-  // shared GPU memory between layer
-  static T *_shared_mem_ptr;
-
-  c10::intrusive_ptr<c10d::ProcessGroup> pg;
-  int pg_size;
-};
diff --git a/colossalai/kernel/cuda_native/csrc/scaled_masked_softmax.cpp b/colossalai/kernel/cuda_native/csrc/scaled_masked_softmax.cpp
deleted file mode 100644
index 4ae3c853ca5e844272ca4fdb907c8c95a7f2b787..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/csrc/scaled_masked_softmax.cpp
+++ /dev/null
@@ -1,84 +0,0 @@
-/*This code from NVIDIA Megatron:
- *     with minor changes. */
-
-#include <cuda_fp16.h>
-#include <torch/extension.h>
-#include <vector>
-
-namespace multihead_attn {
-namespace fused_softmax {
-namespace scaled_masked_softmax {
-
-torch::Tensor fwd_cuda(
-    torch::Tensor const& input, 
-    torch::Tensor const& mask,
-    float scale_factor);
-
-torch::Tensor bwd_cuda(
-    torch::Tensor const& output_grads, 
-    torch::Tensor const& softmax_results,
-    float scale_factor);
-
-int get_batch_per_block_cuda(
-    int query_seq_len,
-    int key_seq_len,
-    int batches,
-    int attn_heads);
-
-torch::Tensor fwd(
-    torch::Tensor const& input,
-    torch::Tensor const& mask,
-    float scale_factor) {
-  AT_ASSERTM(input.dim() == 4, "expected 4D tensor");
-  AT_ASSERTM((input.scalar_type() == at::ScalarType::Half) ||
-	     (input.scalar_type() == at::ScalarType::BFloat16), 
-      "Only fp16 and bf16 are supported");
-  AT_ASSERTM(mask.dim() == 4, "expected 4D tensor");
-
-  return fwd_cuda(input, mask, scale_factor);
-}
-
-torch::Tensor bwd(
-    torch::Tensor const& output_grads, 
-    torch::Tensor const& softmax_results,
-    float scale_factor) {
-
-  AT_ASSERTM(output_grads.dim() == 4, "expected 3D tensor");
-  AT_ASSERTM(softmax_results.dim() == 4, "expected 3D tensor");
-
-  AT_ASSERTM((output_grads.scalar_type() == at::ScalarType::Half) ||
-	     (output_grads.scalar_type() == at::ScalarType::BFloat16), 
-      "Only fp16 and bf16 are supported");
-  AT_ASSERTM((softmax_results.scalar_type() == at::ScalarType::Half) ||
-	     (softmax_results.scalar_type() == at::ScalarType::BFloat16), 
-      "Only fp16 and bf16 are supported");
-
-  return bwd_cuda(output_grads, softmax_results, scale_factor);
-}
-
-int get_batch_per_block(
-    int query_seq_len,
-    int key_seq_len,
-    int batches,
-    int attn_heads) {
-    return get_batch_per_block_cuda(query_seq_len, key_seq_len, batches, attn_heads);
-}
-
-} // end namespace scaled_masked_softmax
-} // end namespace fused_softmax
-} // end namespace multihead_attn
-
-PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
-  m.def("forward", 
-        &multihead_attn::fused_softmax::scaled_masked_softmax::fwd, 
-	"Self Multihead Attention scaled, time masked softmax -- Forward.");
-
-  m.def("backward",
-        &multihead_attn::fused_softmax::scaled_masked_softmax::bwd,
-	"Self Multihead Attention scaled, time masked softmax -- Backward.");
-
-  m.def("get_batch_per_block",
-        &multihead_attn::fused_softmax::scaled_masked_softmax::get_batch_per_block,
-        "Return Batch per block size."
-  );
-}
diff --git a/colossalai/kernel/cuda_native/csrc/scaled_masked_softmax.h b/colossalai/kernel/cuda_native/csrc/scaled_masked_softmax.h
deleted file mode 100644
index 1583030b8235acfb3a3af1a86fa938901ae52bbb..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/csrc/scaled_masked_softmax.h
+++ /dev/null
@@ -1,492 +0,0 @@
-/*This code from NVIDIA Megatron:
- *     with minor changes. */
-
-#pragma once
-
-#include <assert.h>
-#include <cuda_fp16.h>
-#include <cfloat>
-#include <limits>
-#include <stdint.h>
-#include <cuda_fp16.h>
-#include <c10/macros/Macros.h>
-
-namespace {
-
-template <typename Datatype, int ELEMENTS_PER_LDG>
-__device__ __inline__ void copy_vector(Datatype *dst, const Datatype *src);
-
-template <>
-__device__ __inline__ void copy_vector<c10::BFloat16, 1>(c10::BFloat16 *dst, const c10::BFloat16 *src) { *dst = *src; }
-
-template <>
-__device__ __inline__ void copy_vector<c10::BFloat16, 4>(c10::BFloat16 *dst, const c10::BFloat16 *src) { *((float2*) dst) = *((float2*) src); }
-
-template <>
-__device__ __inline__ void copy_vector<c10::Half, 1>(c10::Half *dst, const c10::Half *src) { *dst = *src; }
-
-template <>
-__device__ __inline__ void copy_vector<c10::Half, 4>(c10::Half *dst, const c10::Half *src) { *((float2*) dst) = *((float2*) src); }
-
-template <>
-__device__ __inline__ void copy_vector<uint8_t, 1>(uint8_t *dst, const uint8_t *src) { *dst = *src; }
-
-template <>
-__device__ __inline__ void copy_vector<uint8_t, 4>(uint8_t *dst, const uint8_t *src) {*((half2*) dst) = *((half2*) src); }
-
-int log2_ceil(int value) {
-    int log2_value = 0;
-    while ((1 << log2_value) < value) ++log2_value;
-    return log2_value;
-}
-
-template<typename T>
-struct Add {
-  __device__ __forceinline__ T operator()(T a, T b) const {
-    return a + b;
-  }
-};
-
-template<typename T>
-struct Max {
-  __device__ __forceinline__ T operator()(T a, T b) const {
-    return a < b ? b : a;
-  }
-};
-
-template <typename T>
-__device__ __forceinline__ T WARP_SHFL_XOR_NATIVE(T value, int laneMask, int width = warpSize, unsigned int mask = 0xffffffff)
-{
-#if CUDA_VERSION >= 9000
-    return __shfl_xor_sync(mask, value, laneMask, width);
-#else
-    return __shfl_xor(value, laneMask, width);
-#endif
-}
-
-template <typename acc_t, int WARP_BATCH, int WARP_SIZE, template<typename> class ReduceOp>
-__device__ __forceinline__ void warp_reduce(acc_t* sum) {
-    ReduceOp<acc_t> r;
-    #pragma unroll
-    for (int offset = WARP_SIZE / 2; offset > 0; offset /= 2) {
-        #pragma unroll
-        for (int i = 0;  i < WARP_BATCH;  ++i) {
-            acc_t b = WARP_SHFL_XOR_NATIVE(sum[i], offset, WARP_SIZE);
-            sum[i] = r(sum[i], b);
-        }
-    }
-}
-
-/*
- * Extended softmax (from native aten pytorch) with following additional features
- * 1) input scaling
- * 2) Explicit masking
- */	
-template <typename input_t, typename output_t, typename acc_t, int log2_elements>
-__global__ void scaled_masked_softmax_warp_forward(
-    output_t *dst, 
-    const input_t *src,
-    const uint8_t *mask, 
-    const acc_t scale, 
-    int micro_batch_size, 
-    int element_count,
-    int pad_batches) 
-{
-    // WARP_SIZE and WARP_BATCH must match the return values batches_per_warp and 
-    // warp_size of method warp_softmax_forward_kernel.
-    constexpr int next_power_of_two = 1 << log2_elements;
-    constexpr int WARP_SIZE = (next_power_of_two < C10_WARP_SIZE) ? next_power_of_two : C10_WARP_SIZE;
-    constexpr int WARP_ITERATIONS = next_power_of_two / WARP_SIZE;
-    constexpr int WARP_BATCH = (next_power_of_two <= 128) ? 2 : 1;
-    constexpr int ELEMENTS_PER_LDG_STG = (WARP_ITERATIONS < 4) ? 1 : 4;
-
-    // blockDim/threadIdx = (WARP_SIZE, WARPS_PER_BLOCK, )
-    // gridDim/blockIdx = (seq_len, attn_heads, batches) 
-    int first_batch = (blockDim.y * (blockIdx.x + gridDim.x * (blockIdx.y + gridDim.y * blockIdx.z))+ threadIdx.y) * WARP_BATCH;
-    int pad_first_batch = 0;
-    if (pad_batches != 1) { // bert style
-        pad_first_batch = (blockDim.y * (blockIdx.x + gridDim.x * blockIdx.z) + threadIdx.y) * WARP_BATCH;
-    } else { // gpt2 style
-        pad_first_batch = (blockDim.y * blockIdx.x + threadIdx.y) * WARP_BATCH;
-    }
-
-    // micro_batch_size might not be a multiple of WARP_BATCH. Check how
-    // many batches have to computed within this WARP.
-    int local_batches = micro_batch_size - first_batch;
-    if (local_batches > WARP_BATCH)
-        local_batches = WARP_BATCH;
-
-    // there might be multiple batches per warp. compute the index within the batch
-    int local_idx = threadIdx.x;
-
-    src += first_batch * element_count + ELEMENTS_PER_LDG_STG * local_idx;
-    dst += first_batch * element_count + ELEMENTS_PER_LDG_STG * local_idx;
-    mask += pad_first_batch * element_count + ELEMENTS_PER_LDG_STG * local_idx;
-
-    // load data from global memory
-    acc_t elements[WARP_BATCH][WARP_ITERATIONS];
-    input_t temp_data[ELEMENTS_PER_LDG_STG];
-    uint8_t temp_mask[ELEMENTS_PER_LDG_STG];
-    #pragma unroll
-    for (int i = 0;  i < WARP_BATCH;  ++i) {
-        int batch_element_count = (i >= local_batches) ? 0 : element_count;
-
-        #pragma unroll
-        for (int it = 0;  it < WARP_ITERATIONS;  it+=ELEMENTS_PER_LDG_STG) {
-            int element_index = ELEMENTS_PER_LDG_STG * local_idx + it * WARP_SIZE;
-
-            if (element_index < batch_element_count) {
-                int itr_idx = i*element_count+it*WARP_SIZE;
-                copy_vector<input_t, ELEMENTS_PER_LDG_STG>(temp_data, src + itr_idx);
-                copy_vector<uint8_t, ELEMENTS_PER_LDG_STG>(temp_mask, mask + itr_idx);
-
-                #pragma unroll
-                  for (int element = 0; element < ELEMENTS_PER_LDG_STG; ++element) {
-                      if (temp_mask[element] != 1) {
-                          elements[i][it + element] = (acc_t)temp_data[element] * scale;
-                      } else {
-                          elements[i][it + element] = -10000.0;
-                      }
-                  }
-            } else {
-                #pragma unroll
-                for (int element = 0; element < ELEMENTS_PER_LDG_STG; ++element) {
-                    elements[i][it + element] = -std::numeric_limits<acc_t>::infinity();
-                }
-            }
-        }
-    }
-
-    // compute max_value
-    acc_t max_value[WARP_BATCH];
-    #pragma unroll
-    for (int i = 0;  i < WARP_BATCH;  ++i) {
-        max_value[i] = elements[i][0];
-        #pragma unroll
-        for (int it = 1;  it < WARP_ITERATIONS;  ++it) {
-            max_value[i] = (max_value[i] > elements[i][it]) ? max_value[i] : elements[i][it];
-        }
-    }
-    warp_reduce<acc_t, WARP_BATCH, WARP_SIZE, Max>(max_value);
-
-    acc_t sum[WARP_BATCH] { 0.0f };
-    #pragma unroll
-    for (int i = 0;  i < WARP_BATCH;  ++i) {
-        #pragma unroll
-        for (int it = 0;  it < WARP_ITERATIONS;  ++it) {
-            elements[i][it] = std::exp((elements[i][it] - max_value[i]));
-            sum[i] += elements[i][it];
-        }
-    }
-    warp_reduce<acc_t, WARP_BATCH, WARP_SIZE, Add>(sum);
-
-    // store result
-    output_t out[ELEMENTS_PER_LDG_STG];
-    #pragma unroll
-    for (int i = 0;  i < WARP_BATCH;  ++i) {
-        if (i >= local_batches)
-            break;
-        #pragma unroll
-        for (int it = 0;  it < WARP_ITERATIONS;  it+=ELEMENTS_PER_LDG_STG) {
-            int element_index = ELEMENTS_PER_LDG_STG * local_idx + it * WARP_SIZE;
-            if (element_index < element_count) {
-                #pragma unroll
-                for (int element = 0; element < ELEMENTS_PER_LDG_STG; ++element) {
-                    out[element] = elements[i][it + element] / sum[i];
-                }
-                copy_vector<output_t, ELEMENTS_PER_LDG_STG>(dst + i * element_count + it * WARP_SIZE, out);  
-            } else {
-                break;
-            } 
-        }
-    }
-}
-
-template <typename input_t, typename output_t, typename acc_t, int log2_elements>
-__global__ void scaled_masked_softmax_warp_backward(
-    output_t *gradInput, 
-    input_t *grad, 
-    const input_t *output,
-    acc_t scale, 
-    int micro_batch_size, 
-    int element_count)
-{
-    // WARP_SIZE and WARP_BATCH must match the return values batches_per_warp and 
-    // warp_size of method warp_softmax_backward_kernel.
-    constexpr int next_power_of_two = 1 << log2_elements;
-    constexpr int WARP_SIZE = (next_power_of_two < C10_WARP_SIZE) ? next_power_of_two : C10_WARP_SIZE;
-    constexpr int WARP_ITERATIONS = next_power_of_two / WARP_SIZE;
-    constexpr int WARP_BATCH = (next_power_of_two <= 128) ? 2 : 1;
-    constexpr int ELEMENTS_PER_LDG_STG = (WARP_ITERATIONS < 4) ? 1 : 4;
-
-    // blockDim/threadIdx = (WARP_SIZE, WARPS_PER_BLOCK, )
-    // gridDim/blockIdx = (seq_len, attn_heads, batches) 
-    int first_batch = (blockDim.y * blockIdx.x + threadIdx.y) * WARP_BATCH;
-    
-    // micro_batch_size might not be a multiple of WARP_BATCH. Check how
-    // many batches have to computed within this WARP.
-    int local_batches = micro_batch_size - first_batch;
-    if (local_batches > WARP_BATCH)
-        local_batches = WARP_BATCH;
-
-    // there might be multiple batches per warp. compute the index within the batch
-    int local_idx = threadIdx.x;
-
-    // the first element to process by the current thread
-    int thread_offset = first_batch * element_count + ELEMENTS_PER_LDG_STG * local_idx;
-    grad += thread_offset;
-    output += thread_offset;
-    gradInput += thread_offset;
-
-    // load data from global memory
-    acc_t grad_reg[WARP_BATCH][WARP_ITERATIONS] { 0.0f };
-    acc_t output_reg[WARP_BATCH][WARP_ITERATIONS] { 0.0f };
-    input_t temp_grad[ELEMENTS_PER_LDG_STG];
-    input_t temp_output[ELEMENTS_PER_LDG_STG];
-    #pragma unroll
-    for (int i = 0;  i < WARP_BATCH;  ++i) {
-        int batch_element_count = (i >= local_batches) ? 0 : element_count;
-
-        #pragma unroll
-        for (int it = 0;  it < WARP_ITERATIONS;  it+=ELEMENTS_PER_LDG_STG) {
-            int element_index = ELEMENTS_PER_LDG_STG * local_idx + it * WARP_SIZE;
-            if (element_index < batch_element_count) {
-                copy_vector<input_t, ELEMENTS_PER_LDG_STG>(temp_grad, grad + i * element_count + it * WARP_SIZE);
-                copy_vector<input_t, ELEMENTS_PER_LDG_STG>(temp_output, output + i * element_count + it * WARP_SIZE);
-
-                #pragma unroll
-                for (int element = 0; element < ELEMENTS_PER_LDG_STG; ++element) {
-                    output_reg[i][it + element] = (acc_t)temp_output[element];
-                }
-                #pragma unroll
-                for (int element = 0; element < ELEMENTS_PER_LDG_STG; ++element) {
-                    grad_reg[i][it + element] = (acc_t)temp_grad[element] * output_reg[i][it + element];
-                }
-            } 
-        }
-    }
-   
-    acc_t sum[WARP_BATCH];
-    #pragma unroll
-    for (int i = 0;  i < WARP_BATCH;  ++i) {
-        sum[i] = grad_reg[i][0];
-        #pragma unroll
-        for (int it = 1;  it < WARP_ITERATIONS;  ++it) {
-            sum[i] += grad_reg[i][it];
-        }
-    }
-    warp_reduce<acc_t, WARP_BATCH, WARP_SIZE, Add>(sum);
-
-    // store result
-    #pragma unroll
-    for (int i = 0;  i < WARP_BATCH;  ++i) {
-        if (i >= local_batches)
-            break;
-        #pragma unroll
-        for (int it = 0;  it < WARP_ITERATIONS;  it+=ELEMENTS_PER_LDG_STG) {
-            int element_index = ELEMENTS_PER_LDG_STG * local_idx + it * WARP_SIZE;
-            if (element_index < element_count) {
-                // compute gradients
-                output_t out[ELEMENTS_PER_LDG_STG];
-                #pragma unroll
-                for (int element = 0; element < ELEMENTS_PER_LDG_STG; ++element) {
-                    out[element] = (output_t)(scale * (grad_reg[i][it + element] - output_reg[i][it + element] * sum[i]));
-                }
-                copy_vector<output_t, ELEMENTS_PER_LDG_STG>(gradInput + i * element_count + it * WARP_SIZE, out);
-            } 
-        }
-    }
-}
-} // end of anonymous namespace
-
-int get_batch_per_block(int query_seq_len, int key_seq_len, int batches, int attn_heads){
-    int log2_elements = log2_ceil(key_seq_len);
-    const int next_power_of_two = 1 << log2_elements;
-
-    int warp_size = (next_power_of_two < C10_WARP_SIZE) ? next_power_of_two : C10_WARP_SIZE;
-    int batches_per_warp = (next_power_of_two <= 128) ? 2 : 1;
-
-    constexpr int threads_per_block = 128;
-    int warps_per_block = (threads_per_block / warp_size);
-    int batches_per_block = warps_per_block * batches_per_warp;
-
-    return batches_per_block;
-}
-
-template<typename input_t, typename output_t, typename acc_t>
-void dispatch_scaled_masked_softmax_forward(
-    output_t *dst, 
-    const input_t *src, 
-    const uint8_t *mask,
-    const input_t scale, 
-    int query_seq_len, 
-    int key_seq_len, 
-    int batches,
-    int attn_heads,
-    int pad_batches)
-{
-    TORCH_INTERNAL_ASSERT(key_seq_len >= 0 && key_seq_len <= 2048 );
-    if (key_seq_len == 0) {
-        return;
-    } else {
-        int log2_elements = log2_ceil(key_seq_len);
-        const int next_power_of_two = 1 << log2_elements;
-        int batch_count = batches * attn_heads * query_seq_len;
-
-        // This value must match the WARP_SIZE constexpr value computed inside softmax_warp_forward.
-        int warp_size = (next_power_of_two < C10_WARP_SIZE) ? next_power_of_two : C10_WARP_SIZE;
-
-        // This value must match the WARP_BATCH constexpr value computed inside softmax_warp_forward.
-        int batches_per_warp = (next_power_of_two <= 128) ? 2 : 1;
-
-        // use 128 threads per block to maximimize gpu utilization
-        constexpr int threads_per_block = 128;
-
-        int warps_per_block = (threads_per_block / warp_size);
-        int batches_per_block = warps_per_block * batches_per_warp;
-        TORCH_INTERNAL_ASSERT(query_seq_len%batches_per_block == 0);
-        dim3 blocks(query_seq_len/batches_per_block, attn_heads, batches);
-        dim3 threads(warp_size, warps_per_block, 1);
-        // Launch code would be more elegant if C++ supported FOR CONSTEXPR
-        switch (log2_elements) {
-            case 0: // 1
-                scaled_masked_softmax_warp_forward<input_t, output_t, acc_t, 0>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(dst, src, mask, scale, batch_count, key_seq_len, pad_batches);
-                break;
-            case 1: // 2
-                scaled_masked_softmax_warp_forward<input_t, output_t, acc_t, 1>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(dst, src, mask, scale, batch_count, key_seq_len, pad_batches);
-                break;
-            case 2: // 4
-                scaled_masked_softmax_warp_forward<input_t, output_t, acc_t, 2>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(dst, src, mask, scale, batch_count, key_seq_len, pad_batches);
-                break;
-            case 3: // 8
-                scaled_masked_softmax_warp_forward<input_t, output_t, acc_t, 3>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(dst, src, mask, scale, batch_count, key_seq_len, pad_batches);
-                break;
-            case 4: // 16
-                scaled_masked_softmax_warp_forward<input_t, output_t, acc_t, 4>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(dst, src, mask, scale, batch_count, key_seq_len, pad_batches);
-                break;
-            case 5: // 32
-                scaled_masked_softmax_warp_forward<input_t, output_t, acc_t, 5>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(dst, src, mask, scale, batch_count, key_seq_len, pad_batches);
-                break;
-            case 6: // 64
-                scaled_masked_softmax_warp_forward<input_t, output_t, acc_t, 6>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(dst, src, mask, scale, batch_count, key_seq_len, pad_batches);
-                break;
-            case 7: // 128
-                scaled_masked_softmax_warp_forward<input_t, output_t, acc_t, 7>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(dst, src, mask, scale, batch_count, key_seq_len, pad_batches);
-                break;
-            case 8: // 256
-                scaled_masked_softmax_warp_forward<input_t, output_t, acc_t, 8>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(dst, src, mask, scale, batch_count, key_seq_len, pad_batches);
-                break;
-            case 9: // 512
-                scaled_masked_softmax_warp_forward<input_t, output_t, acc_t, 9>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(dst, src, mask, scale, batch_count, key_seq_len, pad_batches);
-                break;
-            case 10: // 1024
-                scaled_masked_softmax_warp_forward<input_t, output_t, acc_t, 10>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(dst, src, mask, scale, batch_count, key_seq_len, pad_batches);
-                break;
-            case 11: // 2048
-                scaled_masked_softmax_warp_forward<input_t, output_t, acc_t, 11>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(dst, src, mask, scale, batch_count, key_seq_len, pad_batches);
-                break;
-            default:
-                break;
-        }
-    }
-}
-
-template<typename input_t, typename output_t, typename acc_t>
-void dispatch_scaled_masked_softmax_backward(
-    output_t *grad_input, 
-    input_t *grad, 
-    const input_t *output, 
-    const acc_t scale, 
-    int query_seq_len, 
-    int key_seq_len, 
-    int batches,
-    int attn_heads)
-{
-    TORCH_INTERNAL_ASSERT( key_seq_len >= 0 && key_seq_len <= 2048 );
-    if (key_seq_len == 0) {
-       return;
-    } else {
-        int log2_elements = log2_ceil(key_seq_len);
-        const int next_power_of_two = 1 << log2_elements;
-        int batch_count = batches *  attn_heads * query_seq_len;
-
-        // This value must match the WARP_SIZE constexpr value computed inside softmax_warp_backward.
-        int warp_size = (next_power_of_two < C10_WARP_SIZE) ? next_power_of_two : C10_WARP_SIZE;
-
-        // This value must match the WARP_BATCH constexpr value computed inside softmax_warp_backward.
-        int batches_per_warp = (next_power_of_two <= 128) ? 2 : 1;
-
-        // use 128 threads per block to maximimize gpu utilization
-        constexpr int threads_per_block = 128;
-
-        int warps_per_block = (threads_per_block / warp_size);
-        int batches_per_block = warps_per_block * batches_per_warp;
-        int blocks = batch_count/batches_per_block;
-        dim3 threads(warp_size, warps_per_block, 1);
-        // Launch code would be more elegant if C++ supported FOR CONSTEXPR
-        switch (log2_elements) {
-            case 0: // 1
-                scaled_masked_softmax_warp_backward<input_t, output_t, acc_t, 0>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(grad_input, grad, output, scale, batch_count, key_seq_len);
-                break;
-            case 1: // 2
-                scaled_masked_softmax_warp_backward<input_t, output_t, acc_t, 1>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(grad_input, grad, output, scale, batch_count, key_seq_len);
-                break;
-            case 2: // 4
-                scaled_masked_softmax_warp_backward<input_t, output_t, acc_t, 2>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(grad_input, grad, output, scale, batch_count, key_seq_len);
-                break;
-            case 3: // 8
-                scaled_masked_softmax_warp_backward<input_t, output_t, acc_t, 3>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(grad_input, grad, output, scale, batch_count, key_seq_len);
-                break;
-            case 4: // 16
-                scaled_masked_softmax_warp_backward<input_t, output_t, acc_t, 4>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(grad_input, grad, output, scale, batch_count, key_seq_len);
-                break;
-            case 5: // 32
-                scaled_masked_softmax_warp_backward<input_t, output_t, acc_t, 5>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(grad_input, grad, output, scale, batch_count, key_seq_len);
-                break;
-            case 6: // 64
-                scaled_masked_softmax_warp_backward<input_t, output_t, acc_t, 6>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(grad_input, grad, output, scale, batch_count, key_seq_len);
-                break;
-            case 7: // 128
-                scaled_masked_softmax_warp_backward<input_t, output_t, acc_t, 7>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(grad_input, grad, output, scale, batch_count, key_seq_len);
-                break;
-            case 8: // 256
-                scaled_masked_softmax_warp_backward<input_t, output_t, acc_t, 8>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(grad_input, grad, output, scale, batch_count, key_seq_len);
-                break;
-            case 9: // 512
-                scaled_masked_softmax_warp_backward<input_t, output_t, acc_t, 9>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(grad_input, grad, output, scale, batch_count, key_seq_len);
-                break;
-            case 10: // 1024
-                scaled_masked_softmax_warp_backward<input_t, output_t, acc_t, 10>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(grad_input, grad, output, scale, batch_count, key_seq_len);
-                break;
-            case 11: // 2048
-                scaled_masked_softmax_warp_backward<input_t, output_t, acc_t, 11>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(grad_input, grad, output, scale, batch_count, key_seq_len);
-                break;
-            default:
-                break;
-        }
-    }
-}
diff --git a/colossalai/kernel/cuda_native/csrc/scaled_masked_softmax_cuda.cu b/colossalai/kernel/cuda_native/csrc/scaled_masked_softmax_cuda.cu
deleted file mode 100644
index 1100a4bd129e7a68be64da7e1ffb2074eabc0b3d..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/csrc/scaled_masked_softmax_cuda.cu
+++ /dev/null
@@ -1,108 +0,0 @@
-/*This code from NVIDIA Megatron:
- *     with minor changes. */
-
-#include <ATen/ATen.h>
-#include <cuda.h>
-#include <cuda_runtime.h>
-#include <cuda_fp16.h>
-
-#ifndef COLOSSAL_HIP
-#include <cuda_profiler_api.h>
-#endif
-
-#include <ATen/cuda/CUDAContext.h>
-#include <torch/extension.h>
-#include "scaled_masked_softmax.h"
-#include "type_shim.h"
-
-namespace multihead_attn {
-namespace fused_softmax {
-namespace scaled_masked_softmax {
-
-int get_batch_per_block_cuda(int query_seq_len, int key_seq_len, int batches, int attn_heads){
-    return get_batch_per_block(query_seq_len, key_seq_len, batches, attn_heads);
-}
-
-
-torch::Tensor fwd_cuda(
-    torch::Tensor const& input,
-    torch::Tensor const& mask,
-    float scale_factor)
-{
-  // input is a 4d tensor with dimensions [batches, attn_heads, seq_len, seq_len]
-  const int batches = input.size(0);
-  const int pad_batches = mask.size(0);
-  const int attn_heads = input.size(1);
-  const int query_seq_len = input.size(2);
-  const int key_seq_len = input.size(3);
-  TORCH_INTERNAL_ASSERT(key_seq_len <= 2048);
-  TORCH_INTERNAL_ASSERT(query_seq_len > 1);
-  TORCH_INTERNAL_ASSERT(pad_batches == 1 || pad_batches == batches);
-  TORCH_INTERNAL_ASSERT(mask.size(1) == 1);
-  TORCH_INTERNAL_ASSERT(mask.size(2) == query_seq_len);
-  TORCH_INTERNAL_ASSERT(mask.size(3) == key_seq_len);
-
-  // Output 
-  auto act_options = input.options().requires_grad(false);
-  torch::Tensor softmax_results = 
-      torch::empty({batches, attn_heads, query_seq_len, key_seq_len}, act_options);
-
-  // Softmax Intermediate Result Ptr
-  void* input_ptr = static_cast<void*>(input.data_ptr());
-  void* mask_ptr = static_cast<void*>(mask.data_ptr());
-  void* softmax_results_ptr = static_cast<void*>(softmax_results.data_ptr());
-
-  DISPATCH_HALF_AND_BFLOAT(
-      input.scalar_type(),
-      "dispatch_scaled_masked_softmax_forward",
-      dispatch_scaled_masked_softmax_forward<scalar_t, scalar_t, float>(
-          reinterpret_cast<scalar_t*>(softmax_results_ptr),
-	  reinterpret_cast<const scalar_t*>(input_ptr),
-	  reinterpret_cast<const uint8_t*>(mask_ptr),
-	  scale_factor,
-	  query_seq_len,
-	  key_seq_len,
-	  batches,
-	  attn_heads,
-	  pad_batches);
-      );
-  return softmax_results;
-}
-
-torch::Tensor bwd_cuda(
-    torch::Tensor const& output_grads_, 
-    torch::Tensor const& softmax_results_, 
-    float scale_factor)  {
-	
-  auto output_grads = output_grads_.contiguous();
-  auto softmax_results = softmax_results_.contiguous();
-
-  //output grads is a 4d tensor with dimensions [batches, attn_heads, seq_len, seq_len]
-  const int batches = output_grads.size(0);
-  const int attn_heads = output_grads.size(1);
-  const int query_seq_len = output_grads.size(2);
-  const int key_seq_len = output_grads.size(3);
-
-  void* output_grads_ptr = static_cast<void*>(output_grads.data_ptr());
-
-  //Softmax Grad
-  DISPATCH_HALF_AND_BFLOAT(
-      output_grads_.scalar_type(),
-      "dispatch_scaled_masked_softmax_backward",
-      dispatch_scaled_masked_softmax_backward<scalar_t, scalar_t, float>(
-          reinterpret_cast<scalar_t*>(output_grads_ptr), 
-	  reinterpret_cast<scalar_t*>(output_grads_ptr), 
-	  reinterpret_cast<scalar_t const*>(softmax_results.data_ptr()),
-	  scale_factor,
-	  query_seq_len,
-	  key_seq_len,
-	  batches,
-	  attn_heads);
-			   );
-  
-  //backward pass is completely in-place
-  return output_grads;
-}
-}
-}
-}
diff --git a/colossalai/kernel/cuda_native/csrc/scaled_upper_triang_masked_softmax.cpp b/colossalai/kernel/cuda_native/csrc/scaled_upper_triang_masked_softmax.cpp
deleted file mode 100644
index 590ea7b3fc8775b06af6760f447d8caef59abbe4..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/csrc/scaled_upper_triang_masked_softmax.cpp
+++ /dev/null
@@ -1,59 +0,0 @@
-/*This code from NVIDIA Megatron:
- *     with minor changes. */
-
-#include <cuda_fp16.h>
-#include <torch/extension.h>
-#include <vector>
-
-namespace multihead_attn {
-namespace fused_softmax {
-namespace scaled_upper_triang_masked_softmax {
-
-torch::Tensor fwd_cuda(
-    torch::Tensor const& input, 
-    float scale_factor);
-
-torch::Tensor bwd_cuda(
-    torch::Tensor const& output_grads, 
-    torch::Tensor const& softmax_results,
-    float scale_factor);
-
-torch::Tensor fwd(torch::Tensor const& input, float scale_factor) {
-  AT_ASSERTM(input.dim() == 3, "expected 3D tensor");
-  AT_ASSERTM((input.scalar_type() == at::ScalarType::Half) ||
-	     (input.scalar_type() == at::ScalarType::BFloat16), 
-      "Only fp16 and bf16 are supported");
-
-  return fwd_cuda(input, scale_factor);
-}
-
-torch::Tensor bwd(
-    torch::Tensor const& output_grads, 
-    torch::Tensor const& softmax_results,
-    float scale_factor) {
-
-  AT_ASSERTM(output_grads.dim() == 3, "expected 3D tensor");
-  AT_ASSERTM(softmax_results.dim() == 3, "expected 3D tensor");
-
-  AT_ASSERTM((output_grads.scalar_type() == at::ScalarType::Half) ||
-	     (output_grads.scalar_type() == at::ScalarType::BFloat16), 
-      "Only fp16 and bf16 are supported");
-  AT_ASSERTM((softmax_results.scalar_type() == at::ScalarType::Half) ||
-	     (softmax_results.scalar_type() == at::ScalarType::BFloat16), 
-      "Only fp16 and bf16 are supported");
-
-  return bwd_cuda(output_grads, softmax_results, scale_factor);
-}
-
-} // end namespace scaled_upper_triang_masked_softmax
-} // end namespace fused_softmax
-} // end namespace multihead_attn
-
-PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
-  m.def("forward", 
-        &multihead_attn::fused_softmax::scaled_upper_triang_masked_softmax::fwd,
-	"Self Multihead Attention scaled, time masked softmax -- Forward.");
-  m.def("backward", 
-        &multihead_attn::fused_softmax::scaled_upper_triang_masked_softmax::bwd,
-	"Self Multihead Attention scaled, time masked softmax -- Backward.");
-}
diff --git a/colossalai/kernel/cuda_native/csrc/scaled_upper_triang_masked_softmax.h b/colossalai/kernel/cuda_native/csrc/scaled_upper_triang_masked_softmax.h
deleted file mode 100644
index 3af487f9de0ffdc22faaca142cbc2ff86b68d03e..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/csrc/scaled_upper_triang_masked_softmax.h
+++ /dev/null
@@ -1,500 +0,0 @@
-/*This code from NVIDIA Megatron:
- *     with minor changes. */
-
-#pragma once
-
-#include <assert.h>
-#include <cuda_fp16.h>
-#include <cfloat>
-#include <limits>
-#include <stdint.h>
-#include <c10/macros/Macros.h>
-
-namespace {
-
-template <typename Datatype, int ELEMENTS_PER_LDG>
-__device__ __inline__ void copy_vector(Datatype *dst, const Datatype *src);
-
-template <>
-__device__ __inline__ void copy_vector<c10::BFloat16, 1>(c10::BFloat16 *dst, const c10::BFloat16 *src) { *dst = *src; }
-
-template <>
-__device__ __inline__ void copy_vector<c10::BFloat16, 4>(c10::BFloat16 *dst, const c10::BFloat16 *src) { *((float2*) dst) = *((float2*) src); }
-  
-template <>
-__device__ __inline__ void copy_vector<c10::Half, 1>(c10::Half *dst, const c10::Half *src) { *dst = *src; }
-
-template <>
-__device__ __inline__ void copy_vector<c10::Half, 4>(c10::Half *dst, const c10::Half *src) { *((float2*) dst) = *((float2*) src); }
-
-template <>
-__device__ __inline__ void copy_vector<uint8_t, 1>(uint8_t *dst, const uint8_t *src) { *dst = *src; }
-
-template <>
-__device__ __inline__ void copy_vector<uint8_t, 4>(uint8_t *dst, const uint8_t *src) {*((half2*) dst) = *((half2*) src); }
-
-template <typename Datatype, int ELEMENTS_PER_LDG>
-__device__ __inline__ void copy_zero_vector(Datatype *dst);
-
-template <>
-__device__ __inline__ void copy_zero_vector<c10::BFloat16, 1>(c10::BFloat16 *dst) { *dst = 0.0; }
-
-template <>
-__device__ __inline__ void copy_zero_vector<c10::BFloat16, 4>(c10::BFloat16 *dst) { *((float2*) dst) = make_float2(0.0f, 0.0f); }
-
-template <>
-__device__ __inline__ void copy_zero_vector<c10::Half, 1>(c10::Half *dst) { *dst = 0.0; }
-
-template <>
-__device__ __inline__ void copy_zero_vector<c10::Half, 4>(c10::Half *dst) { *((float2*) dst) = make_float2(0.0f, 0.0f); }
-
-
-int log2_ceil(int value) {
-    int log2_value = 0;
-    while ((1 << log2_value) < value) ++log2_value;
-    return log2_value;
-}
-
-template<typename T>
-struct Add {
-  __device__ __forceinline__ T operator()(T a, T b) const {
-    return a + b;
-  }
-};
-
-template<typename T>
-struct Max {
-  __device__ __forceinline__ T operator()(T a, T b) const {
-    return a < b ? b : a;
-  }
-};
-
-template <typename T>
-__device__ __forceinline__ T WARP_SHFL_XOR_NATIVE(T value, int laneMask, int width = warpSize, unsigned int mask = 0xffffffff)
-{
-#if CUDA_VERSION >= 9000
-    return __shfl_xor_sync(mask, value, laneMask, width);
-#else
-    return __shfl_xor(value, laneMask, width);
-#endif
-}
-
-template <typename acc_t, int WARP_BATCH, int WARP_SIZE, template<typename> class ReduceOp>
-__device__ __forceinline__ void warp_reduce(acc_t* sum) {
-    ReduceOp<acc_t> r;
-    #pragma unroll
-    for (int offset = WARP_SIZE / 2; offset > 0; offset /= 2) {
-        #pragma unroll
-        for (int i = 0;  i < WARP_BATCH;  ++i) {
-            acc_t b = WARP_SHFL_XOR_NATIVE(sum[i], offset, WARP_SIZE);
-            sum[i] = r(sum[i], b);
-        }
-    }
-}
-
-/*
- * Extended softmax (from native aten pytorch) with following additional features
- * 1) input scaling
- * 2) Implicit time (diagonal masking)
- */
-template <typename input_t, typename output_t, typename acc_t, int log2_elements>
-__global__ void scaled_upper_triang_masked_softmax_warp_forward(
-    output_t *dst, 
-    const input_t *src, 
-    const acc_t scale, 
-    int micro_batch_size, 
-    int stride, 
-    int element_count) 
-{
-    // WARP_SIZE and WARP_BATCH must match the return values batches_per_warp and 
-    // warp_size of method warp_softmax_forward_kernel.
-    constexpr int next_power_of_two = 1 << log2_elements;
-    constexpr int WARP_SIZE = (next_power_of_two < C10_WARP_SIZE) ? next_power_of_two : C10_WARP_SIZE;
-    constexpr int WARP_ITERATIONS = next_power_of_two / WARP_SIZE;
-    constexpr int WARP_BATCH = (next_power_of_two <= 128) ? 2 : 1;
-    constexpr int ELEMENTS_PER_LDG_STG = (WARP_ITERATIONS < 4) ? 1 : 4;
-
-    int first_batch = (blockDim.y * blockIdx.y + threadIdx.y) * gridDim.x * WARP_BATCH + blockIdx.x;
-    int local_seq = blockIdx.x + 1; 
-    int warp_iteration_limit = (local_seq + ELEMENTS_PER_LDG_STG * WARP_SIZE - 1)/ WARP_SIZE;
-
-    // micro_batch_size might not be a multiple of WARP_BATCH. Check how
-    // many batches have to computed within this WARP.
-    int local_batches = micro_batch_size - first_batch;
-    if (local_batches > WARP_BATCH)
-        local_batches = WARP_BATCH;
-
-    // there might be multiple batches per warp. compute the index within the batch
-    int local_idx = threadIdx.x;
-
-    src += first_batch * stride + ELEMENTS_PER_LDG_STG * local_idx;
-    dst += first_batch * stride + ELEMENTS_PER_LDG_STG * local_idx;
-
-    // load data from global memory
-    acc_t elements[WARP_BATCH][WARP_ITERATIONS];
-    input_t temp_data[ELEMENTS_PER_LDG_STG];
-    #pragma unroll
-    for (int i = 0;  i < WARP_BATCH;  ++i) {
-        int batch_element_count = (i >= local_batches) ? 0 : local_seq;
-
-        #pragma unroll
-        for (int it = 0;  it < WARP_ITERATIONS;  it+=ELEMENTS_PER_LDG_STG) {
-            int element_index = ELEMENTS_PER_LDG_STG * local_idx + it * WARP_SIZE;
-
-            if (element_index < batch_element_count) {
-                copy_vector<input_t, ELEMENTS_PER_LDG_STG>(temp_data, src + i*element_count*stride + it*WARP_SIZE);
-
-                #pragma unroll
-                for (int element = 0; element < ELEMENTS_PER_LDG_STG; ++element) {
-                    if ((element_index + element) < batch_element_count) {
-                        elements[i][it+element] = (acc_t)temp_data[element] * scale;
-                    } else {
-                        elements[i][it + element] = -std::numeric_limits<acc_t>::infinity();
-                    }
-                }
-            } else {
-                #pragma unroll
-                for (int element = 0; element < ELEMENTS_PER_LDG_STG; ++element) {
-                    elements[i][it + element] = -std::numeric_limits<acc_t>::infinity();
-                }
-            }
-        }
-    }
-
-    // compute max_value
-    acc_t max_value[WARP_BATCH];
-    #pragma unroll
-    for (int i = 0;  i < WARP_BATCH;  ++i) {
-        max_value[i] = elements[i][0];
-        #pragma unroll
-        for (int it = 1;  it < WARP_ITERATIONS;  ++it) {
-            max_value[i] = (max_value[i] > elements[i][it]) ? max_value[i] : elements[i][it];
-        }
-    }
-    warp_reduce<acc_t, WARP_BATCH, WARP_SIZE, Max>(max_value);
-
-    acc_t sum[WARP_BATCH] { 0.0f };
-    #pragma unroll
-    for (int i = 0;  i < WARP_BATCH;  ++i) {
-        #pragma unroll
-        for (int it = 0;  it < WARP_ITERATIONS;  ++it) {
-            if (it < warp_iteration_limit) {
-                elements[i][it] = std::exp((elements[i][it] - max_value[i]));
-                sum[i] += elements[i][it];
-            } 
-        }
-    }
-    warp_reduce<acc_t, WARP_BATCH, WARP_SIZE, Add>(sum);
-
-    // store result
-    output_t out[ELEMENTS_PER_LDG_STG];
-    #pragma unroll
-    for (int i = 0;  i < WARP_BATCH;  ++i) {
-        if (i >= local_batches)
-            break;
-        #pragma unroll
-        for (int it = 0;  it < WARP_ITERATIONS;  it+=ELEMENTS_PER_LDG_STG) {
-            int element_index = ELEMENTS_PER_LDG_STG * local_idx + it * WARP_SIZE;
-
-            if (element_index < local_seq) {
-
-                #pragma unroll  
-                for (int element = 0; element < ELEMENTS_PER_LDG_STG; ++element) {
-                    if (element_index + element < local_seq) {
-                        out[element] = elements[i][it + element] / sum[i];
-                    } else {
-                        out[element] = 0;
-                    }
-                }
-                copy_vector<output_t, ELEMENTS_PER_LDG_STG>(dst + i * element_count * stride + it * WARP_SIZE, out);
-            } else if (element_index < element_count) {
-                copy_zero_vector<output_t, ELEMENTS_PER_LDG_STG>(dst + i * element_count * stride + it * WARP_SIZE);
-            } else {
-                break;
-            } 
-        }
-    }
-}
-
-template <typename input_t, typename output_t, typename acc_t, int log2_elements>
-__global__ void scaled_upper_triang_masked_softmax_warp_backward(
-    output_t *gradInput, 
-    input_t *grad, 
-    const input_t *output,
-    acc_t scale, 
-    int micro_batch_size, 
-    int stride, 
-    int element_count)
-{
-    // WARP_SIZE and WARP_BATCH must match the return values batches_per_warp and 
-    // warp_size of method warp_softmax_backward_kernel.
-    constexpr int next_power_of_two = 1 << log2_elements;
-    constexpr int WARP_SIZE = (next_power_of_two < C10_WARP_SIZE) ? next_power_of_two : C10_WARP_SIZE;
-    constexpr int WARP_ITERATIONS = next_power_of_two / WARP_SIZE;
-    constexpr int WARP_BATCH = (next_power_of_two <= 128) ? 2 : 1;
-    constexpr int ELEMENTS_PER_LDG_STG = (WARP_ITERATIONS < 4) ? 1 : 4;
-
-    int first_batch = (blockDim.y * blockIdx.y + threadIdx.y) * gridDim.x * WARP_BATCH + blockIdx.x;
-    int local_seq = blockIdx.x + 1; 
-    
-    // micro_batch_size might not be a multiple of WARP_BATCH. Check how
-    // many batches have to computed within this WARP.
-    int local_batches = micro_batch_size - first_batch;
-    if (local_batches > WARP_BATCH)
-        local_batches = WARP_BATCH;
-
-    // there might be multiple batches per warp. compute the index within the batch
-    int local_idx = threadIdx.x;
-
-    // the first element to process by the current thread
-    int thread_offset = first_batch * stride + ELEMENTS_PER_LDG_STG * local_idx;
-    grad += thread_offset;
-    output += thread_offset;
-    gradInput += thread_offset;
-
-    // load data from global memory
-    acc_t grad_reg[WARP_BATCH][WARP_ITERATIONS] { 0.0f };
-    acc_t output_reg[WARP_BATCH][WARP_ITERATIONS] { 0.0f };
-    input_t temp_grad[ELEMENTS_PER_LDG_STG];
-    input_t temp_output[ELEMENTS_PER_LDG_STG];
-    #pragma unroll
-    for (int i = 0;  i < WARP_BATCH;  ++i) {
-        int batch_element_count = (i >= local_batches) ? 0 : local_seq;
-
-        #pragma unroll
-        for (int it = 0;  it < WARP_ITERATIONS;  it+=ELEMENTS_PER_LDG_STG) {
-            int element_index = ELEMENTS_PER_LDG_STG * local_idx + it * WARP_SIZE;
-            if (element_index < batch_element_count) {
-                copy_vector<input_t, ELEMENTS_PER_LDG_STG>(temp_grad, grad + i * element_count * stride + it * WARP_SIZE);
-                copy_vector<input_t, ELEMENTS_PER_LDG_STG>(temp_output, output + i * element_count * stride + it * WARP_SIZE);
-
-                #pragma unroll
-                for (int element = 0; element < ELEMENTS_PER_LDG_STG; ++element) {
-                    if (element_index + element < batch_element_count) {
-                        output_reg[i][it + element] = (acc_t)temp_output[element];
-                    }
-                }
-                #pragma unroll
-                for (int element = 0; element < ELEMENTS_PER_LDG_STG; ++element) {
-                    if (element_index + element < batch_element_count) {
-                        grad_reg[i][it + element] = (acc_t)temp_grad[element] * output_reg[i][it + element];
-                    }
-                }
-            }
-        }
-    }
-   
-    acc_t sum[WARP_BATCH];
-    #pragma unroll
-    for (int i = 0;  i < WARP_BATCH;  ++i) {
-        sum[i] = grad_reg[i][0];
-        #pragma unroll
-        for (int it = 1;  it < WARP_ITERATIONS;  ++it) {
-            sum[i] += grad_reg[i][it];
-        }
-    }
-    warp_reduce<acc_t, WARP_BATCH, WARP_SIZE, Add>(sum);
-
-    // store result
-    #pragma unroll
-    for (int i = 0;  i < WARP_BATCH;  ++i) {
-        if (i >= local_batches)
-            break;
-        #pragma unroll
-        for (int it = 0;  it < WARP_ITERATIONS;  it+=ELEMENTS_PER_LDG_STG) {
-            int element_index = ELEMENTS_PER_LDG_STG * local_idx + it * WARP_SIZE;
-            if (element_index < element_count) {
-                // compute gradients
-                output_t out[ELEMENTS_PER_LDG_STG];
-                #pragma unroll
-                for (int element = 0; element < ELEMENTS_PER_LDG_STG; ++element) {
-                    out[element] = (output_t)(scale * (grad_reg[i][it + element] - output_reg[i][it + element] * sum[i]));
-                }
-                copy_vector<output_t, ELEMENTS_PER_LDG_STG>(gradInput + i * element_count * stride + it * WARP_SIZE, out);
-            } 
-        }
-    }
-}
-
-} // end of anonymous namespace
-
-template<typename input_t, typename output_t, typename acc_t>
-void dispatch_scaled_upper_triang_masked_softmax_forward(
-    output_t *dst, 
-    const input_t *src, 
-    const input_t scale, 
-    int softmax_elements, 
-    int softmax_elements_stride, 
-    int attn_batches)
-{
-    TORCH_INTERNAL_ASSERT(softmax_elements >= 0 && softmax_elements <= 2048 );
-    if (softmax_elements == 0) {
-        return;
-    } else {
-        int log2_elements = log2_ceil(softmax_elements);
-        const int next_power_of_two = 1 << log2_elements;
-        int seq_len = softmax_elements;
-        int batch_count = attn_batches * seq_len;
-
-        // This value must match the WARP_SIZE constexpr value computed inside softmax_warp_forward.
-        int warp_size = (next_power_of_two < C10_WARP_SIZE) ? next_power_of_two : C10_WARP_SIZE;
-
-        // This value must match the WARP_BATCH constexpr value computed inside softmax_warp_forward.
-        int batches_per_warp = (next_power_of_two <= 128) ? 2 : 1;
-
-        // use 128 threads per block to maximimize gpu utilization
-        constexpr int threads_per_block = 128;
-
-        int warps_per_block = (threads_per_block / warp_size);
-        int batches_per_block = warps_per_block * batches_per_warp;
-        TORCH_INTERNAL_ASSERT(attn_batches % batches_per_block == 0);
-
-        int blocks_per_seq = attn_batches / batches_per_block;
-        dim3 blocks(seq_len, blocks_per_seq, 1);
-        dim3 threads(warp_size, warps_per_block, 1);
-        // Launch code would be more elegant if C++ supported FOR CONSTEXPR
-        switch (log2_elements) {
-            case 0: // 1
-                scaled_upper_triang_masked_softmax_warp_forward<input_t, output_t, acc_t, 0>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(dst, src, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 1: // 2
-                scaled_upper_triang_masked_softmax_warp_forward<input_t, output_t, acc_t, 1>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(dst, src, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 2: // 4
-                scaled_upper_triang_masked_softmax_warp_forward<input_t, output_t, acc_t, 2>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(dst, src, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 3: // 8
-                scaled_upper_triang_masked_softmax_warp_forward<input_t, output_t, acc_t, 3>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(dst, src, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 4: // 16
-                scaled_upper_triang_masked_softmax_warp_forward<input_t, output_t, acc_t, 4>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(dst, src, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 5: // 32
-                scaled_upper_triang_masked_softmax_warp_forward<input_t, output_t, acc_t, 5>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(dst, src, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 6: // 64
-                scaled_upper_triang_masked_softmax_warp_forward<input_t, output_t, acc_t, 6>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(dst, src, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 7: // 128
-                scaled_upper_triang_masked_softmax_warp_forward<input_t, output_t, acc_t, 7>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(dst, src, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 8: // 256
-                scaled_upper_triang_masked_softmax_warp_forward<input_t, output_t, acc_t, 8>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(dst, src, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 9: // 512
-                scaled_upper_triang_masked_softmax_warp_forward<input_t, output_t, acc_t, 9>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(dst, src, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 10: // 1024
-                scaled_upper_triang_masked_softmax_warp_forward<input_t, output_t, acc_t, 10>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(dst, src, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 11: // 2048
-                scaled_upper_triang_masked_softmax_warp_forward<input_t, output_t, acc_t, 11>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(dst, src, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            default:
-                break;
-        }
-    }
-}
-
-template<typename input_t, typename output_t, typename acc_t>
-void dispatch_scaled_upper_triang_masked_softmax_backward(
-    output_t *grad_input, 
-    input_t *grad, 
-    const input_t *output, 
-    const acc_t scale, 
-    int softmax_elements, 
-    int softmax_elements_stride, 
-    int attn_batches)
-{
-    TORCH_INTERNAL_ASSERT( softmax_elements >= 0 && softmax_elements <= 2048 );
-    if (softmax_elements == 0) {
-       return;
-    } else {
-        int log2_elements = log2_ceil(softmax_elements);
-        const int next_power_of_two = 1 << log2_elements;
-        int seq_len = softmax_elements;
-        int batch_count = attn_batches * seq_len;
-
-        // This value must match the WARP_SIZE constexpr value computed inside softmax_warp_backward.
-        int warp_size = (next_power_of_two < C10_WARP_SIZE) ? next_power_of_two : C10_WARP_SIZE;
-
-        // This value must match the WARP_BATCH constexpr value computed inside softmax_warp_backward.
-        int batches_per_warp = (next_power_of_two <= 128) ? 2 : 1;
-
-        // use 128 threads per block to maximimize gpu utilization
-        constexpr int threads_per_block = 128;
-
-        int warps_per_block = (threads_per_block / warp_size);
-        int batches_per_block = warps_per_block * batches_per_warp;
-        TORCH_INTERNAL_ASSERT(attn_batches % batches_per_block == 0);
-
-        int blocks_per_seq = attn_batches / batches_per_block;
-        dim3 blocks(seq_len, blocks_per_seq, 1);
-        dim3 threads(warp_size, warps_per_block, 1);
-        // Launch code would be more elegant if C++ supported FOR CONSTEXPR
-        switch (log2_elements) {
-            case 0: // 1
-                scaled_upper_triang_masked_softmax_warp_backward<input_t, output_t, acc_t, 0>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(grad_input, grad, output, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 1: // 2
-                scaled_upper_triang_masked_softmax_warp_backward<input_t, output_t, acc_t, 1>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(grad_input, grad, output, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 2: // 4
-                scaled_upper_triang_masked_softmax_warp_backward<input_t, output_t, acc_t, 2>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(grad_input, grad, output, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 3: // 8
-                scaled_upper_triang_masked_softmax_warp_backward<input_t, output_t, acc_t, 3>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(grad_input, grad, output, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 4: // 16
-                scaled_upper_triang_masked_softmax_warp_backward<input_t, output_t, acc_t, 4>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(grad_input, grad, output, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 5: // 32
-                scaled_upper_triang_masked_softmax_warp_backward<input_t, output_t, acc_t, 5>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(grad_input, grad, output, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 6: // 64
-                scaled_upper_triang_masked_softmax_warp_backward<input_t, output_t, acc_t, 6>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(grad_input, grad, output, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 7: // 128
-                scaled_upper_triang_masked_softmax_warp_backward<input_t, output_t, acc_t, 7>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(grad_input, grad, output, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 8: // 256
-                scaled_upper_triang_masked_softmax_warp_backward<input_t, output_t, acc_t, 8>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(grad_input, grad, output, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 9: // 512
-                scaled_upper_triang_masked_softmax_warp_backward<input_t, output_t, acc_t, 9>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(grad_input, grad, output, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 10: // 1024
-                scaled_upper_triang_masked_softmax_warp_backward<input_t, output_t, acc_t, 10>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(grad_input, grad, output, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 11: // 2048
-                scaled_upper_triang_masked_softmax_warp_backward<input_t, output_t, acc_t, 11>
-                    <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(grad_input, grad, output, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            default:
-                break;
-        }
-    }
-}
diff --git a/colossalai/kernel/cuda_native/csrc/scaled_upper_triang_masked_softmax_cuda.cu b/colossalai/kernel/cuda_native/csrc/scaled_upper_triang_masked_softmax_cuda.cu
deleted file mode 100644
index aa9b241508a1e72f6ef4e984e5170a2e9b898c01..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/csrc/scaled_upper_triang_masked_softmax_cuda.cu
+++ /dev/null
@@ -1,89 +0,0 @@
-/*This code from NVIDIA Megatron:
- *     with minor changes. */
-
-#include <ATen/ATen.h>
-#include <cuda.h>
-#include <cuda_runtime.h>
-#include <cuda_fp16.h>
-
-#ifndef COLOSSAL_HIP
-#include <cuda_profiler_api.h>
-#endif
-
-#include <ATen/cuda/CUDAContext.h>
-#include <torch/extension.h>
-#include "scaled_upper_triang_masked_softmax.h"
-#include "type_shim.h"
-
-namespace multihead_attn {
-namespace fused_softmax {
-namespace scaled_upper_triang_masked_softmax {
-
-torch::Tensor fwd_cuda(
-    torch::Tensor const& input, 
-    float scale_factor)
-{
-  // input is a 3d tensor with dimensions [attn_batches, seq_len, seq_len]
-  const int attn_batches = input.size(0);
-  const int seq_len = input.size(1);
-  TORCH_INTERNAL_ASSERT(seq_len <= 2048);
-
-  // Output 
-  auto act_options = input.options().requires_grad(false);
-  torch::Tensor softmax_results = 
-      torch::empty({attn_batches, seq_len, seq_len}, act_options);
-
-  // Softmax Intermediate Result Ptr
-  void* input_ptr = static_cast<void*>(input.data_ptr());
-  void* softmax_results_ptr = static_cast<void*>(softmax_results.data_ptr());
-
-  DISPATCH_HALF_AND_BFLOAT(
-      input.scalar_type(),
-      "dispatch_scaled_upper_triang_masked_softmax_forward",
-      dispatch_scaled_upper_triang_masked_softmax_forward<scalar_t, scalar_t, float>(
-	  reinterpret_cast<scalar_t*>(softmax_results_ptr),
-	  reinterpret_cast<const scalar_t*>(input_ptr),
-	  scale_factor,
-	  seq_len,
-	  seq_len,
-	  attn_batches);
-      );
-  return softmax_results;
-}
-				      
-
-torch::Tensor bwd_cuda(
-    torch::Tensor const& output_grads_, 
-    torch::Tensor const& softmax_results_, 
-    float scale_factor)  {
-	
-  auto output_grads = output_grads_.contiguous();
-  auto softmax_results = softmax_results_.contiguous();
-
-  //output grads is a 3d tensor with dimensions [attn_batches, seq_len, seq_len]
-  const int attn_batches = output_grads.size(0);
-  const int seq_len = output_grads.size(1);
-  TORCH_INTERNAL_ASSERT(output_grads.size(1) == output_grads.size(2));
-
-  void* output_grads_ptr = static_cast<void*>(output_grads.data_ptr());
-
-  //Softmax Grad
-  DISPATCH_HALF_AND_BFLOAT(
-      output_grads_.scalar_type(),
-      "dispatch_scaled_upper_triang_masked_softmax_backward",
-      dispatch_scaled_upper_triang_masked_softmax_backward<scalar_t, scalar_t, float>(
-          reinterpret_cast<scalar_t*>(output_grads_ptr), 
-	  reinterpret_cast<scalar_t*>(output_grads_ptr), 
-	  reinterpret_cast<scalar_t const*>(softmax_results.data_ptr()),
-	  scale_factor,
-	  seq_len,
-	  seq_len,
-	  attn_batches);
-      );
-  
-  //backward pass is completely in-place
-  return output_grads;
-}
-}
-}
-}
diff --git a/colossalai/kernel/cuda_native/csrc/type_shim.h b/colossalai/kernel/cuda_native/csrc/type_shim.h
deleted file mode 100644
index f7c155a4d75017b376c19261a151c253308187e1..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/csrc/type_shim.h
+++ /dev/null
@@ -1,280 +0,0 @@
-#include <ATen/ATen.h>
-#include "compat.h"
-
-
-#define DISPATCH_HALF_AND_BFLOAT(TYPE, NAME, ...)			\
-  switch(TYPE)								\
-    {									\
-    case at::ScalarType::Half:						\
-      {									\
-	using scalar_t = at::Half;					\
-	__VA_ARGS__;							\
-	break;								\
-      }									\
-    case at::ScalarType::BFloat16:					\
-      {									\
-	using scalar_t = at::BFloat16;					\
-	__VA_ARGS__;							\
-	break;								\
-      }									\
-    default:								\
-      AT_ERROR(#NAME, " not implemented for '", toString(TYPE), "'");	\
-      }
-
-
-
-#define DISPATCH_FLOAT_HALF_AND_BFLOAT_INOUT_TYPES(TYPEIN, TYPEOUT, NAME, ...) \
-  switch(TYPEIN)							\
-    {									\
-    case at::ScalarType::Float:						\
-      {									\
-	using scalar_t_in = float;					\
-	switch(TYPEOUT)							\
-	  {								\
-	  case at::ScalarType::Float:					\
-	    {								\
-	      using scalar_t_out = float;				\
-	      __VA_ARGS__;						\
-	      break;							\
-	    }								\
-	  case at::ScalarType::Half:					\
-	    {								\
-	      using scalar_t_out = at::Half;				\
-	      __VA_ARGS__;						\
-	      break;							\
-	    }								\
-	  case at::ScalarType::BFloat16:				\
-	    {								\
-	      using scalar_t_out = at::BFloat16;			\
-	      __VA_ARGS__;						\
-	      break;							\
-	    }								\
-	  default:							\
-	    AT_ERROR(#NAME, " not implemented for '", toString(TYPEOUT), "'"); \
-	  }								\
-	break;								\
-      }									\
-    case at::ScalarType::Half:						\
-      {									\
-	using scalar_t_in = at::Half;					\
-	using scalar_t_out = at::Half;					\
-	__VA_ARGS__;							\
-	break;								\
-      }									\
-    case at::ScalarType::BFloat16:					\
-      {									\
-	using scalar_t_in = at::BFloat16;				\
-	using scalar_t_out = at::BFloat16;				\
-	__VA_ARGS__;							\
-	break;								\
-      }									\
-    default:								\
-      AT_ERROR(#NAME, " not implemented for '", toString(TYPEIN), "'");	\
-    }
-
-// Forward/backward compatiblity hack around
-// https://github.com/pytorch/pytorch/commit/3aeb78079bcd68282fe9117088e138b77318e288
-// pending more future-proof guidance from upstream.
-// struct TypeShim
-// {
-//   const at::Type& payload;
-//   TypeShim(const at::Type& type) : payload(type) {}
-//   // Enable trivial conversion to a const at::Type& for pre-3aeb78
-//   operator const at::Type&(){ return payload; };
-//   // Enable dispatch switch statements to take *this directly for  post-3aeb78
-//   //operator at::ScalarType(){ return payload.; };
-// };
-
-#define DISPATCH_FLOAT_AND_HALF(TYPE, LEVEL, NAME, ...)                 \
-    switch (TYPE)                                                       \
-    {                                                                   \
-    case at::ScalarType::Float:                                         \
-    {                                                                   \
-        using scalar_t_##LEVEL = float;                                 \
-        __VA_ARGS__;                                                    \
-        break;                                                          \
-    }                                                                   \
-    case at::ScalarType::Half:                                          \
-    {                                                                   \
-        using scalar_t_##LEVEL = at::Half;                              \
-        __VA_ARGS__;                                                    \
-        break;                                                          \
-    }                                                                   \
-    default:                                                            \
-        AT_ERROR(#NAME, " not implemented for '", toString(TYPE), "'"); \
-    }
-
-#define DISPATCH_FLOAT_HALF_AND_BYTE(TYPE, LEVEL, NAME, ...)            \
-    switch (TYPE)                                                       \
-    {                                                                   \
-    case at::ScalarType::Float:                                         \
-    {                                                                   \
-        using scalar_t_##LEVEL = float;                                 \
-        __VA_ARGS__;                                                    \
-        break;                                                          \
-    }                                                                   \
-    case at::ScalarType::Half:                                          \
-    {                                                                   \
-        using scalar_t_##LEVEL = at::Half;                              \
-        __VA_ARGS__;                                                    \
-        break;                                                          \
-    }                                                                   \
-    case at::ScalarType::Byte:                                          \
-    {                                                                   \
-        using scalar_t_##LEVEL = uint8_t;                               \
-        __VA_ARGS__;                                                    \
-        break;                                                          \
-    }                                                                   \
-    default:                                                            \
-        AT_ERROR(#NAME, " not implemented for '", toString(TYPE), "'"); \
-    }
-
-#define DISPATCH_DOUBLE_FLOAT_AND_HALF(TYPE, LEVEL, NAME, ...)          \
-    switch (TYPE)                                                       \
-    {                                                                   \
-    case at::ScalarType::Double:                                        \
-    {                                                                   \
-        using scalar_t_##LEVEL = double;                                \
-        __VA_ARGS__;                                                    \
-        break;                                                          \
-    }                                                                   \
-    case at::ScalarType::Float:                                         \
-    {                                                                   \
-        using scalar_t_##LEVEL = float;                                 \
-        __VA_ARGS__;                                                    \
-        break;                                                          \
-    }                                                                   \
-    case at::ScalarType::Half:                                          \
-    {                                                                   \
-        using scalar_t_##LEVEL = at::Half;                              \
-        __VA_ARGS__;                                                    \
-        break;                                                          \
-    }                                                                   \
-    default:                                                            \
-        AT_ERROR(#NAME, " not implemented for '", toString(TYPE), "'"); \
-    }
-
-#define DISPATCH_DOUBLE_AND_FLOAT(TYPE, LEVEL, NAME, ...)               \
-    switch (TYPE)                                                       \
-    {                                                                   \
-    case at::ScalarType::Double:                                        \
-    {                                                                   \
-        using scalar_t_##LEVEL = double;                                \
-        __VA_ARGS__;                                                    \
-        break;                                                          \
-    }                                                                   \
-    case at::ScalarType::Float:                                         \
-    {                                                                   \
-        using scalar_t_##LEVEL = float;                                 \
-        __VA_ARGS__;                                                    \
-        break;                                                          \
-    }                                                                   \
-    default:                                                            \
-        AT_ERROR(#NAME, " not implemented for '", toString(TYPE), "'"); \
-    }
-
-template <typename T>
-__device__ __forceinline__ T reduce_block_into_lanes(T *x,
-                                                     T val,
-                                                     int lanes = 1,
-                                                     bool share_result = false) // lanes is intended to be <= 32.
-{
-    int tid = threadIdx.x + threadIdx.y * blockDim.x;
-    int blockSize = blockDim.x * blockDim.y; // blockSize is intended to be a multiple of 32.
-
-    if (blockSize >= 64)
-    {
-        x[tid] = val;
-        __syncthreads();
-    }
-
-#pragma unroll
-    for (int i = (blockSize >> 1); i >= 64; i >>= 1)
-    {
-        if (tid < i)
-            x[tid] = x[tid] + x[tid + i];
-        __syncthreads();
-    }
-
-    T final;
-
-    if (tid < 32)
-    {
-        if (blockSize >= 64)
-            final = x[tid] + x[tid + 32];
-        else
-            final = val;
-            // __SYNCWARP();
-
-#pragma unroll
-        for (int i = 16; i >= lanes; i >>= 1)
-#ifdef COLOSSAL_HIP
-            final = final + __shfl_down(final, i);
-#else
-            final = final + __shfl_down_sync(0xffffffff, final, i);
-#endif
-    }
-
-    if (share_result)
-    {
-        if (tid < lanes)
-            x[tid] = final; // EpilogueOp
-        // Make sure the smem result is visible to all warps.
-        __syncthreads();
-    }
-
-    return final;
-}
-
-template <typename T>
-__device__ __forceinline__ T reduce_block_into_lanes_max_op(T *x,
-                                                            T val,
-                                                            int lanes = 1,
-                                                            bool share_result = false) // lanes is intended to be <= 32.
-{
-    int tid = threadIdx.x + threadIdx.y * blockDim.x;
-    int blockSize = blockDim.x * blockDim.y; // blockSize is intended to be a multiple of 32.
-
-    if (blockSize >= 64)
-    {
-        x[tid] = val;
-        __syncthreads();
-    }
-
-#pragma unroll
-    for (int i = (blockSize >> 1); i >= 64; i >>= 1)
-    {
-        if (tid < i)
-            x[tid] = fmaxf(fabsf(x[tid]), fabsf(x[tid + i]));
-        __syncthreads();
-    }
-
-    T final;
-
-    if (tid < 32)
-    {
-        if (blockSize >= 64)
-            final = fmaxf(fabsf(x[tid]), fabsf(x[tid + 32]));
-        else
-            final = val;
-            // __SYNCWARP();
-
-#pragma unroll
-        for (int i = 16; i >= lanes; i >>= 1)
-#ifdef COLOSSAL_HIP
-            final = fmaxf(fabsf(final), fabsf(__shfl_down(final, i)));
-#else
-            final = fmaxf(fabsf(final), fabsf(__shfl_down_sync(0xffffffff, final, i)));
-#endif
-    }
-
-    if (share_result)
-    {
-        if (tid < lanes)
-            x[tid] = final; // EpilogueOp
-        // Make sure the smem result is visible to all warps.
-        __syncthreads();
-    }
-
-    return final;
-}
diff --git a/colossalai/kernel/cuda_native/layer_norm.py b/colossalai/kernel/cuda_native/layer_norm.py
deleted file mode 100644
index b2ecd9ff9796a36913b82d550538a2252a5fd072..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/layer_norm.py
+++ /dev/null
@@ -1,79 +0,0 @@
-"""This code is from NVIDIA apex:
-      https://github.com/NVIDIA/apex
-   with some changes. """
-
-import numbers
-import torch
-from torch.nn.parameter import Parameter
-from torch.nn import init
-from torch.cuda.amp import custom_fwd, custom_bwd
-import importlib
-
-global colossal_layer_norm_cuda
-colossal_layer_norm_cuda = None
-
-
-class FusedLayerNormAffineFunction(torch.autograd.Function):
-
-    @staticmethod
-    @custom_fwd(cast_inputs=torch.float32)
-    def forward(ctx, input, weight, bias, normalized_shape, eps):
-
-        ctx.normalized_shape = normalized_shape
-        ctx.eps = eps
-        input_ = input.contiguous()
-        weight_ = weight.contiguous()
-        bias_ = bias.contiguous()
-        output, mean, invvar = colossal_layer_norm_cuda.forward_affine(
-            input_, ctx.normalized_shape, weight_, bias_, ctx.eps)
-        ctx.save_for_backward(input_, weight_, bias_, mean, invvar)
-
-        return output
-
-    @staticmethod
-    @custom_bwd
-    def backward(ctx, grad_output):
-
-        input_, weight_, bias_, mean, invvar = ctx.saved_tensors
-        grad_input = grad_weight = grad_bias = None
-        grad_input, grad_weight, grad_bias \
-          = colossal_layer_norm_cuda.backward_affine(
-            grad_output.contiguous(), mean, invvar,
-            input_, ctx.normalized_shape,
-            weight_, bias_, ctx.eps)
-
-        return grad_input, grad_weight, grad_bias, None, None
-
-
-class MixedFusedLayerNorm(torch.nn.Module):
-
-    def __init__(self, normalized_shape, eps=1e-5, device=None, dtype=None):
-        super(MixedFusedLayerNorm, self).__init__()
-
-        global colossal_layer_norm_cuda
-        if colossal_layer_norm_cuda is None:
-            try:
-                colossal_layer_norm_cuda = importlib.import_module("colossal_layer_norm_cuda")
-            except ImportError:
-                raise RuntimeError('MixedFusedLayerNorm requires cuda extensions')
-
-        if isinstance(normalized_shape, numbers.Integral):
-            normalized_shape = (normalized_shape,)
-        self.normalized_shape = torch.Size(normalized_shape)
-        self.eps = eps
-        self.weight = Parameter(torch.empty(*normalized_shape, device=device, dtype=dtype))
-        self.bias = Parameter(torch.empty(*normalized_shape, device=device, dtype=dtype))
-        self.reset_parameters()
-
-    def reset_parameters(self):
-
-        init.ones_(self.weight)
-        init.zeros_(self.bias)
-
-    def forward(self, input):
-
-        return FusedLayerNormAffineFunction.apply(input, self.weight, self.bias,
-                                                  self.normalized_shape, self.eps)
-
-    def __repr__(self):
-        return f'MixedFusedLayerNorm(normalized_shape={self.normalized_shape}, eps={self.eps})'
diff --git a/colossalai/kernel/cuda_native/multihead_attention.py b/colossalai/kernel/cuda_native/multihead_attention.py
deleted file mode 100644
index 3e776b610c7abc969a05cde0a870debbdd77490c..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/multihead_attention.py
+++ /dev/null
@@ -1,276 +0,0 @@
-import math
-import importlib
-from dataclasses import dataclass
-
-import torch
-from torch import nn
-from torch.autograd import Function
-
-
-def check_config(config):
-    if config.hidden_size % config.nhead != 0:
-        raise Exception(f"hidden_size % nhead != 0")
-
-    factor = 8 if config.fp16 else 4
-    upbound = factor * 1024 * 4
-    if config.hidden_size > upbound:
-        # as required by ln backward kernel currently
-        raise Exception(f"hidden_size > {upbound}")
-
-    head_dim = config.hidden_size // config.nhead
-    if head_dim % factor != 0:
-        # as required by reshape kernel
-        raise Exception(f"head_dim({head_dim}) % {factor} != 0")
-
-
-def calc_offset(sizes):
-    offsets = [0]
-    tmp = 0
-    for x in sizes:
-        tmp += x
-        offsets.append(tmp)
-    return offsets
-
-
-colossal_multihead_attention = None
-
-
-@dataclass
-class Config:
-    max_batch_tokens: int  # max batch token numbers
-    max_seq_len: int  # max sequence length
-    hidden_size: int  # size of transformer hidden layers
-    nhead: int  # number of heads in attention
-    attn_prob_dropout_ratio: float  # attention score dropout ratio
-    hidden_dropout_ratio: float  # dropout ration before residual
-    norm_first: bool  # norm_first
-    fp16: bool  # fp16 presion
-
-
-class MultiHeadAttention1DFunc(Function):
-
-    @staticmethod
-    def forward(ctx, input, input_mask, in_proj_weight, in_proj_bias, out_proj_weight,
-                out_proj_bias, norm_weight, norm_bias, config):
-        cuda_module = colossal_multihead_attention
-        forward_func = (cuda_module.multihead_attention_fw_fp16
-                        if config.fp16 else cuda_module.multihead_attention_fw_fp32)
-        if config.fp16:
-            input = input.to(torch.half)
-            input_mask = input_mask.to(torch.half)
-
-        (output,) = forward_func(config.layer_id, input, input_mask, in_proj_weight, in_proj_bias,
-                                 out_proj_weight, out_proj_bias, norm_weight, norm_bias,
-                                 config.training, config.norm_first)
-
-        if config.is_grad_enabled and config.training:
-            ctx.save_for_backward(output, input, input_mask, in_proj_weight, in_proj_bias,
-                                  out_proj_weight, out_proj_bias, norm_weight, norm_bias)
-            ctx.config = config
-        return output
-
-    @staticmethod
-    def backward(ctx, grad_output):
-        assert ctx.config.training
-
-        cuda_module = colossal_multihead_attention
-        backward_func = (cuda_module.multihead_attention_bw_fp16
-                         if ctx.config.fp16 else cuda_module.multihead_attention_bw_fp32)
-
-        output, input, input_mask, in_proj_weight, in_proj_bias, out_proj_weight, \
-            out_proj_bias, norm_weight, norm_bias = ctx.saved_tensors
-
-        grad_input = None
-        grad_in_proj_weight = None
-        grad_in_proj_bias = None
-        grad_out_proj_weight = None
-        grad_out_proj_bias = None
-        grad_norm_weight = None
-        grad_norm_bias = None
-
-        if ctx.config.fp16:
-            grad_output = grad_output.to(torch.half)
-            output = output.to(torch.half)
-            input = input.to(torch.half)
-            input_mask = input_mask.to(torch.half)
-        grad_input, grad_in_proj_weight, grad_in_proj_bias, grad_out_proj_weight, \
-            grad_out_proj_bias, grad_norm_weight, grad_norm_bias = backward_func(
-                ctx.config.layer_id, grad_output, output, input, input_mask, in_proj_weight,
-                in_proj_bias, out_proj_weight, out_proj_bias, norm_weight, norm_bias)
-
-        return (grad_input, None, grad_in_proj_weight, grad_in_proj_bias, grad_out_proj_weight,
-                grad_out_proj_bias, grad_norm_weight, grad_norm_bias, None)
-
-
-class MultiHeadAttention(nn.Module):
-    """Initialize the MultiHeadAttention.
-
-    Static variable:
-
-        layer_id: The layer-index counter starting from 0 and incrementing by 1 every time a layer object is instantiated,
-        e.g. if a model has 24 transformer layers, layer_id goes from 0 to 23.
-
-    Arguments:
-        hidden_size: Total dimension of hidden_size.
-        nhead: Number of parallel attention heads.
-        batch_size: Batch Size for one foward
-        max_seq_len: Max length of input sequence
-        dropout: Dropout probability
-        norm_first: perform LayerNorms before attention
-    """
-
-    layer_id = 0
-
-    def __init__(self,
-                 hidden_size,
-                 nhead,
-                 batch_size,
-                 max_seq_len,
-                 dropout=0.0,
-                 norm_first=False,
-                 fp16=True,
-                 pg=None):
-        super(MultiHeadAttention, self).__init__()
-
-        self.config = Config(batch_size * max_seq_len, max_seq_len, hidden_size, nhead, dropout,
-                             dropout, norm_first, fp16)
-        check_config(self.config)
-        self.pg = pg
-        self.pg_size = 1
-        if self.pg:
-            self.pg_size = pg.size()
-        self.config.layer_id = MultiHeadAttention.layer_id
-        MultiHeadAttention.layer_id = MultiHeadAttention.layer_id + 1
-
-        # Load cuda modules if needed
-        global colossal_multihead_attention
-        if colossal_multihead_attention is None:
-            try:
-                colossal_multihead_attention = importlib.import_module("colossal_multihead_attention")
-            except ImportError:
-                raise RuntimeError('MultiHeadAttention requires cuda extensions')
-
-        # create the layer in cuda kernels.
-        cuda_module = colossal_multihead_attention
-        create_layer_func = (cuda_module.create_multihead_attention_fp16
-                             if self.config.fp16 else cuda_module.create_multihead_attention_fp32)
-
-        create_layer_func(
-            self.config.layer_id,
-            self.config.max_batch_tokens,
-            self.config.max_seq_len,
-            self.config.hidden_size,
-            self.config.nhead,
-            self.config.attn_prob_dropout_ratio,
-            self.config.hidden_dropout_ratio,
-            self.config.norm_first,
-            self.pg,
-        )
-
-        hs = self.config.hidden_size
-
-        self.precision = torch.float32
-        if self.config.fp16:
-            self.precision = torch.half
-
-        self.hs_per_rank = int(hs / self.pg_size)
-
-        self.in_proj_weight = nn.Parameter(torch.Tensor(3, self.hs_per_rank, hs))
-        self.in_proj_bias = nn.Parameter(torch.Tensor(3, self.hs_per_rank))
-        self.out_proj_weight = nn.Parameter(torch.Tensor(hs, self.hs_per_rank))
-        self.out_proj_bias = nn.Parameter(torch.Tensor(hs))
-        self.norm_weight = nn.Parameter(torch.Tensor(hs))
-        self.norm_bias = nn.Parameter(torch.Tensor(hs))
-
-        self.reset_parameters()
-        torch.cuda.empty_cache()
-
-    def calc_bound(self, w):
-        fan_in, _ = nn.init._calculate_fan_in_and_fan_out(w)
-        bound = 1.0 / math.sqrt(fan_in)
-        return bound
-
-    def reset_parameters(self):
-        hs = self.config.hidden_size
-
-        nn.init.zeros_(self.out_proj_bias)
-
-        nn.init.ones_(self.norm_weight)
-        nn.init.zeros_(self.norm_bias)
-
-        if self.pg_size > 1:
-            rank_in_pg = torch.distributed.get_rank(self.pg)
-            attn_qkvw_global = torch.empty(hs * 3, hs)
-            attn_qkvb_global = torch.empty(hs * 3)
-            nn.init.xavier_uniform_(attn_qkvw_global, 1.0 / math.sqrt(2.0))
-            bound = self.calc_bound(attn_qkvw_global)
-            nn.init.uniform_(attn_qkvb_global, -bound, bound)
-
-            attn_qkvw_global = attn_qkvw_global.cuda()
-            attn_qkvb_global = attn_qkvb_global.cuda()
-            torch.distributed.broadcast(attn_qkvw_global, src=0, group=self.pg)
-            torch.distributed.broadcast(attn_qkvb_global, src=0, group=self.pg)
-            attn_qkvw_global = attn_qkvw_global.cpu()
-            attn_qkvb_global = attn_qkvb_global.cpu()
-
-            with torch.no_grad():
-                self.in_proj_weight.copy_(
-                    attn_qkvw_global.view(3, hs, hs)[:,
-                                                     int(hs * rank_in_pg /
-                                                         self.pg_size):int(hs * (rank_in_pg + 1) /
-                                                                           self.pg_size), :])
-                self.in_proj_bias.copy_(
-                    attn_qkvb_global.view(3, hs)[:,
-                                                 int(hs * rank_in_pg /
-                                                     self.pg_size):int(hs * (rank_in_pg + 1) /
-                                                                       self.pg_size)])
-
-            attn_ow_global = torch.empty(hs, hs)
-            nn.init.xavier_uniform_(attn_ow_global, 1.0)
-            attn_ow_global = attn_ow_global.cuda()
-            torch.distributed.broadcast(attn_ow_global, src=0, group=self.pg)
-            attn_ow_global = attn_ow_global.cpu()
-            with torch.no_grad():
-                self.out_proj_weight.copy_(attn_ow_global[:,
-                                                          int(hs * rank_in_pg /
-                                                              self.pg_size):int(hs * (rank_in_pg + 1) /
-                                                                                self.pg_size)])
-
-        else:
-            attn_qkvw = self.in_proj_weight.view(-1, hs)
-            nn.init.xavier_uniform_(attn_qkvw, 1.0 / math.sqrt(2.0))
-            bound = self.calc_bound(attn_qkvw)
-            nn.init.uniform_(self.in_proj_bias, -bound, bound)
-
-            nn.init.xavier_uniform_(self.out_proj_weight, 1.0)
-
-    def state_dict(self, destination=None, prefix="", keep_vars=False):
-        destination = torch.nn.Module.state_dict(self,
-                                                 destination=destination,
-                                                 prefix=prefix,
-                                                 keep_vars=keep_vars)
-        return destination
-
-    def forward(self, hidden_states, encoder_padding_mask):
-        self.config.training = self.training
-        self.config.is_grad_enabled = torch.is_grad_enabled()
-        hidden_states = hidden_states.contiguous()
-        encoder_padding_mask = ((encoder_padding_mask * -1e8).type_as(hidden_states).contiguous())
-
-        bs, sl, dim = hidden_states.size()
-        if bs * sl > self.config.max_batch_tokens:
-            raise ValueError(
-                f"Batch token numbers {bs * sl} exceeds the limit {self.config.max_batch_tokens}.")
-        if sl > self.config.max_seq_len:
-            raise ValueError(f"Sequence length {sl} exceeds the limit {self.config.max_seq_len}.")
-        if len(encoder_padding_mask.size()) == 1:
-            assert bs == 1 and sl == encoder_padding_mask.size(0)
-        else:
-            assert bs == encoder_padding_mask.size(0) and sl == encoder_padding_mask.size(1)
-
-        output = MultiHeadAttention1DFunc.apply(hidden_states, encoder_padding_mask,
-                                                self.in_proj_weight, self.in_proj_bias,
-                                                self.out_proj_weight, self.out_proj_bias,
-                                                self.norm_weight, self.norm_bias, self.config)
-
-        return output.to(self.precision)
diff --git a/colossalai/kernel/cuda_native/scaled_softmax.py b/colossalai/kernel/cuda_native/scaled_softmax.py
deleted file mode 100644
index 786b922c6905791c0308f765f591672d7dca2f99..0000000000000000000000000000000000000000
--- a/colossalai/kernel/cuda_native/scaled_softmax.py
+++ /dev/null
@@ -1,201 +0,0 @@
-"""This code from NVIDIA Megatron
-   with some changes. """
-
-import torch
-import torch.nn as nn
-import enum
-
-
-class AttnMaskType(enum.Enum):
-    padding = 1
-    causal = 2
-
-
-class ScaledUpperTriangMaskedSoftmax(torch.autograd.Function):
-    """
-    Fused operation which performs following three operations in sequence
-
-        1.  Scale the tensor.
-        2.  Apply upper triangular mask (typically used in gpt models).
-        3.  Perform softmax.
-    """
-
-    @staticmethod
-    def forward(ctx, inputs, scale):
-        try:
-            import colossal_scaled_upper_triang_masked_softmax
-        except ImportError:
-            raise RuntimeError('ScaledUpperTriangMaskedSoftmax requires cuda extensions')
-
-        scale_t = torch.tensor([scale])
-        softmax_results = colossal_scaled_upper_triang_masked_softmax.forward(
-            inputs, scale_t[0]
-        )
-
-        ctx.save_for_backward(softmax_results, scale_t)
-        return softmax_results
-
-    @staticmethod
-    def backward(ctx, output_grads):
-        try:
-            import colossal_scaled_upper_triang_masked_softmax
-        except ImportError:
-            raise RuntimeError('ScaledUpperTriangMaskedSoftmax requires cuda extensions')
-
-        softmax_results, scale_t = ctx.saved_tensors
-        input_grads = colossal_scaled_upper_triang_masked_softmax.backward(
-            output_grads, softmax_results, scale_t[0]
-        )
-
-        return input_grads, None
-
-
-class ScaledMaskedSoftmax(torch.autograd.Function):
-    """
-    Fused operation which performs following three operations in sequence
-
-        1.  Scale the tensor.
-        2.  Apply the mask.
-        3.  Perform softmax.
-    """
-
-    @staticmethod
-    def forward(ctx, inputs, mask, scale):
-        try:
-            import colossal_scaled_masked_softmax
-        except ImportError:
-            raise RuntimeError('ScaledMaskedSoftmax requires cuda extensions')
-
-        scale_t = torch.tensor([scale])
-
-        softmax_results = colossal_scaled_masked_softmax.forward(inputs, mask, scale_t[0])
-        ctx.save_for_backward(softmax_results, scale_t)
-        return softmax_results
-
-    @staticmethod
-    def backward(ctx, output_grads):
-        try:
-            import colossal_scaled_masked_softmax
-        except ImportError:
-            raise RuntimeError('ScaledMaskedSoftmax requires cuda extensions')
-
-        softmax_results, scale_t = ctx.saved_tensors
-
-        input_grads = colossal_scaled_masked_softmax.backward(
-            output_grads, softmax_results, scale_t[0]
-        )
-        return input_grads, None, None
-
-
-class FusedScaleMaskSoftmax(nn.Module):
-    """
-    Fused operation: scaling + mask + softmax
-
-    Arguments:
-        input_in_fp16: Flag to indicate if input in fp16 data format.
-        input_in_bf16: Flag to indicate if input in bf16 data format.
-        attn_mask_type: Attention mask type (pad or causal)
-        scaled_masked_softmax_fusion: Flag to indicate user want to use softmax fusion
-        mask_func: Mask function to be applied.
-        softmax_in_fp32: If True, softmax in performed at fp32 precision.
-        scale: Scaling factor used in input tensor scaling.
-    """
-
-    def __init__(
-        self,
-        input_in_fp16,
-        input_in_bf16,
-        attn_mask_type,
-        scaled_masked_softmax_fusion,
-        mask_func,
-        softmax_in_fp32,
-        scale,
-    ):
-        super(FusedScaleMaskSoftmax, self).__init__()
-        self.input_in_fp16 = input_in_fp16
-        self.input_in_bf16 = input_in_bf16
-        assert not (
-            self.input_in_fp16 and self.input_in_bf16
-        ), "both fp16 and bf16 flags cannot be active at the same time."
-        self.input_in_float16 = self.input_in_fp16 or self.input_in_bf16
-        self.attn_mask_type = attn_mask_type
-        self.scaled_masked_softmax_fusion = scaled_masked_softmax_fusion
-        self.mask_func = mask_func
-        self.softmax_in_fp32 = softmax_in_fp32
-        self.scale = scale
-
-        assert (
-            self.scale is None or softmax_in_fp32
-        ), "softmax should be in fp32 when scaled"
-
-    def forward(self, input, mask):
-        # [b, np, sq, sk]
-        assert input.dim() == 4
-
-        if self.is_kernel_available(mask, *input.size()):
-            return self.forward_fused_softmax(input, mask)
-        else:
-            return self.forward_torch_softmax(input, mask)
-
-    def is_kernel_available(self, mask, b, np, sq, sk):
-        attn_batches = b * np
-
-        if (
-            self.scaled_masked_softmax_fusion  # user want to fuse
-            and self.input_in_float16  # input must be fp16
-            and mask is not None  # mask tensor must not be None
-            and 16 < sk <= 2048  # sk must be 16 ~ 2048
-            and sq % 4 == 0  # sq must be divisor of 4
-            and attn_batches % 4 == 0  # np * b must be divisor of 4
-        ):
-            if 0 <= sk <= 2048:
-                batch_per_block = self.get_batch_per_block(sq, sk, b, np)
-
-                if self.attn_mask_type == AttnMaskType.causal:
-                    if attn_batches % batch_per_block == 0:
-                        return True
-                else:
-                    if sq % batch_per_block == 0:
-                        return True
-        return False
-
-    def forward_fused_softmax(self, input, mask):
-        b, np, sq, sk = input.size()
-        scale = self.scale if self.scale is not None else 1.0
-
-        if self.attn_mask_type == AttnMaskType.causal:
-            assert sq == sk, "causal mask is only for self attention"
-
-            # input is 3D tensor (attn_batches, sq, sk)
-            input = input.view(-1, sq, sk)
-            probs = ScaledUpperTriangMaskedSoftmax.apply(input, scale)
-            return probs.view(b, np, sq, sk)
-        else:
-            # input is 4D tensor (b, np, sq, sk)
-            return ScaledMaskedSoftmax.apply(input, mask, scale)
-
-    def forward_torch_softmax(self, input, mask):
-        if self.input_in_float16 and self.softmax_in_fp32:
-            input = input.float()
-
-        if self.scale is not None:
-            input = input * self.scale
-        mask_output = self.mask_func(input, mask) if mask is not None else input
-        probs = torch.nn.Softmax(dim=-1)(mask_output)
-
-        if self.input_in_float16 and self.softmax_in_fp32:
-            if self.input_in_fp16:
-                probs = probs.half()
-            else:
-                probs = probs.bfloat16()
-
-        return probs
-
-    @staticmethod
-    def get_batch_per_block(sq, sk, b, np):
-        try:
-            import colossal_scaled_masked_softmax
-        except ImportError:
-            raise RuntimeError('ScaledMaskedSoftmax requires cuda extensions')
-
-        return colossal_scaled_masked_softmax.get_batch_per_block(sq, sk, b, np)
diff --git a/colossalai/kernel/hip_native/csrc/colossal_C_frontend.cpp b/colossalai/kernel/hip_native/csrc/colossal_C_frontend.cpp
deleted file mode 100644
index 07154ac4c6473165483a1d6a7ad8cca1df64d02f..0000000000000000000000000000000000000000
--- a/colossalai/kernel/hip_native/csrc/colossal_C_frontend.cpp
+++ /dev/null
@@ -1,72 +0,0 @@
-// !!! This is a file automatically generated by hipify!!!
-// modified from https://github.com/NVIDIA/apex/blob/master/csrc/multi_tensor_adam.cu
-#include <torch/extension.h>
-
-void multi_tensor_scale_cuda(
-  int chunk_size,
-  at::Tensor noop_flag,
-  std::vector<std::vector<at::Tensor>> tensor_lists,
-  float scale);
-
-void multi_tensor_sgd_cuda(
-    int chunk_size,
-    at::Tensor noop_flag,
-    std::vector<std::vector<at::Tensor>> tensor_lists,
-    float wd,
-    float momentum,
-    float dampening,
-    float lr,
-    bool nesterov,
-    bool first_run,
-    bool wd_after_momentum,
-    float scale);
-
-void multi_tensor_adam_cuda(
-    int chunk_size,
-    at::Tensor noop_flag,
-    std::vector<std::vector<at::Tensor>> tensor_lists,
-    const float lr,
-    const float beta1,
-    const float beta2,
-    const float epsilon,
-    const int step,
-    const int mode,
-    const int bias_correction,
-    const float weight_decay);
-
-void multi_tensor_lamb_cuda(
-    int chunk_size,
-    at::Tensor noop_flag,
-    std::vector<std::vector<at::Tensor>> tensor_lists,
-    const float lr,
-    const float beta1,
-    const float beta2,
-    const float epsilon,
-    const int step,
-    const int bias_correction,
-    const float weight_decay,
-    const int grad_averaging,
-    const int mode,
-    at::Tensor global_grad_norm,
-    const float max_grad_norm,
-    at::optional<bool> use_nvlamb_python);
-
-std::tuple<at::Tensor, at::Tensor> multi_tensor_l2norm_cuda(
-    int chunk_size,
-    at::Tensor noop_flag,
-    std::vector<std::vector<at::Tensor>> tensor_lists,
-    at::optional<bool> per_tensor_python);
-
-PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
-{
-    m.def("multi_tensor_scale", &multi_tensor_scale_cuda,
-        "Fused overflow check + scale for a list of contiguous tensors");
-    m.def("multi_tensor_sgd", &multi_tensor_sgd_cuda,
-          "Fused SGD optimizer for list of contiguous tensors");
-    m.def("multi_tensor_adam", &multi_tensor_adam_cuda,
-          "Compute and apply gradient update to parameters for Adam optimizer");
-    m.def("multi_tensor_lamb", &multi_tensor_lamb_cuda,
-          "Computes and apply update for LAMB optimizer");
-    m.def("multi_tensor_l2norm", &multi_tensor_l2norm_cuda,
-          "Computes L2 norm for a list of contiguous tensors");
-}
\ No newline at end of file
diff --git a/colossalai/kernel/hip_native/csrc/compat.h b/colossalai/kernel/hip_native/csrc/compat.h
deleted file mode 100644
index aaa6c92581f35d2b4a3544038d94126f8e6c249d..0000000000000000000000000000000000000000
--- a/colossalai/kernel/hip_native/csrc/compat.h
+++ /dev/null
@@ -1,11 +0,0 @@
-// !!! This is a file automatically generated by hipify!!!
-// modified from https://github.com/NVIDIA/apex/blob/master/csrc/compat.h
-#ifndef TORCH_CHECK
-#define TORCH_CHECK AT_CHECK
-#endif
-
-#ifdef VERSION_GE_1_3
-#define DATA_PTR data_ptr
-#else
-#define DATA_PTR data
-#endif
\ No newline at end of file
diff --git a/colossalai/kernel/hip_native/csrc/kernels/cross_entropy.hip b/colossalai/kernel/hip_native/csrc/kernels/cross_entropy.hip
deleted file mode 100644
index ceff7a373b0695e37f0b7f1933b9d15e8c4437d8..0000000000000000000000000000000000000000
--- a/colossalai/kernel/hip_native/csrc/kernels/cross_entropy.hip
+++ /dev/null
@@ -1,193 +0,0 @@
-// !!! This is a file automatically generated by hipify!!!
-#include "hip/hip_runtime.h"
-#include "block_reduce.h"
-#include "cuda_util.h"
-#include "kernels.h"
-#include "ls_cub.cuh"
-
-ls::hipcub::CachingDeviceAllocator g_allocator(true);
-
-template <typename T>
-__global__ void ls_cross_entropy_fw_kernel(
-    const T *__restrict__ inputs, const int *__restrict__ targets,
-    float *__restrict__ outputs, float *__restrict__ nll_loss_outputs,
-    const int padding_idx, const float epsilon, const int vocab_size) {
-  /* step1: compute each thread's max_logit and sum_exp_logit, store in
-   * max_input, sum_exp_logit */
-  const int block_start = blockIdx.x * vocab_size;
-  const int left_idx = block_start + threadIdx.x;
-  const int right_idx = (blockIdx.x + 1) * vocab_size;
-  float max_input[1] = {REDUCE_FLOAT_INF_NEG};
-  float sum_logits[2] = {0.f, 0.f};  // logit and logit exp
-  int target_tid = targets[blockIdx.x];
-
-  if (target_tid == padding_idx) {
-    if (threadIdx.x == 0) {
-      nll_loss_outputs[blockIdx.x] = 0.f;
-      outputs[blockIdx.x] = 0.f;
-    }
-    return;
-  }
-
-  for (int i = left_idx; i < right_idx; i += blockDim.x) {
-    max_input[0] = fmaxf(max_input[0], static_cast<float>(inputs[i]));
-  }
-  blockReduce<ReduceType::kMax, 1>(max_input);
-  __shared__ float s_max_input;
-  if (threadIdx.x == 0) {
-    s_max_input = max_input[0];
-  }
-  __syncthreads();
-
-  for (int i = left_idx; i < right_idx; i += blockDim.x) {
-    float logit = static_cast<float>(inputs[i]) - s_max_input;
-    sum_logits[0] += logit;
-    sum_logits[1] += expf(logit);
-  }
-
-  blockReduce<ReduceType::kSum, 2>(sum_logits);
-  __shared__ float s_sum_logit;
-  __shared__ float s_sum_exp;
-  if (threadIdx.x == 0) {
-    s_sum_logit = sum_logits[0];
-    s_sum_exp = sum_logits[1];
-  }
-  __syncthreads();
-
-  float eps_i = epsilon / (vocab_size - 1);
-  if (threadIdx.x == 0) {
-    // neg_log_prob = log(sum(exp(x - x_max))) - (x - x_max)
-    float nll_loss = logf(s_sum_exp) -
-                     static_cast<float>(inputs[block_start + target_tid]) +
-                     s_max_input;
-    nll_loss_outputs[blockIdx.x] = nll_loss;
-    float sum_nll_loss = vocab_size * logf(s_sum_exp) - s_sum_logit;
-    outputs[blockIdx.x] =
-        (1.f - epsilon - eps_i) * nll_loss + eps_i * sum_nll_loss;
-  }
-}
-
-template <typename T>
-__global__ void ls_cross_entropy_bw_kernel(
-    const float *__restrict__ grad_outputs, const T *__restrict__ inputs,
-    const int *__restrict__ targets, T *__restrict__ grad_inputs,
-    const int padding_idx, const float epsilon, const int vocab_size) {
-  /* step1: compute each thread's max_logit and sum_exp_logit, store in
-   * max_input, sum_exp_logit */
-  const int block_start = blockIdx.x * vocab_size;
-  const int left_idx = block_start + threadIdx.x;
-  const int right_idx = (blockIdx.x + 1) * vocab_size;
-  float max_input[1] = {REDUCE_FLOAT_INF_NEG};
-  float sum_logits[1] = {0.f};
-  const float grad_out = static_cast<float>(grad_outputs[0]);
-  int target_tid = targets[blockIdx.x];
-
-  if (target_tid == padding_idx) {
-    for (int i = left_idx; i < right_idx; i += blockDim.x) {
-      grad_inputs[i] = 0.f;
-    }
-    return;
-  }
-
-  for (int i = left_idx; i < right_idx; i += blockDim.x) {
-    max_input[0] = fmaxf(max_input[0], static_cast<float>(inputs[i]));
-  }
-  blockReduce<ReduceType::kMax, 1>(max_input);
-  __shared__ float s_max_input;
-  if (threadIdx.x == 0) {
-    s_max_input = max_input[0];
-  }
-  __syncthreads();
-
-  for (int i = left_idx; i < right_idx; i += blockDim.x) {
-    float logit = static_cast<float>(inputs[i]) - s_max_input;
-    sum_logits[0] += expf(logit);
-  }
-
-  blockReduce<ReduceType::kSum, 1>(sum_logits);
-  __shared__ float s_sum_exp;
-  if (threadIdx.x == 0) {
-    s_sum_exp = sum_logits[0];
-  }
-  __syncthreads();
-
-  float eps_i = epsilon / (vocab_size - 1);
-  float nll_weight = 1.0 - epsilon - eps_i;
-
-  for (int i = left_idx; i < right_idx; i += blockDim.x) {
-    float prob = expf(static_cast<float>(inputs[i]) - s_max_input) / s_sum_exp;
-    float grad = 0;
-    grad += (vocab_size * prob - 1) * eps_i;
-    grad += prob * nll_weight;
-    if ((i - block_start) == target_tid) {
-      grad -= nll_weight;
-    }
-    grad_inputs[i] = grad_out * grad;
-  }
-}
-
-template <typename T>
-void launch_cross_entropy_fw(const T *inputs_ptr, const int *targets_ptr,
-                             float *outputs_ptr, float *nll_loss_ptr,
-                             float *loss_buffer, const int padding_idx,
-                             const float epsilon, const int batch_size,
-                             const int seq_len, const int vocab_size,
-                             hipStream_t stream) {
-  int grid_dim = batch_size * seq_len;
-  float *nll_loss_buffer = loss_buffer + grid_dim;
- hipLaunchKernelGGL(( ls_cross_entropy_fw_kernel), dim3(grid_dim), dim3(MAX_THREADS), 0, stream, 
-      inputs_ptr, targets_ptr, loss_buffer, nll_loss_buffer, padding_idx,
-      epsilon, vocab_size);
-
-  int num_items = grid_dim;
-  void *d_temp_storage = NULL;
-  size_t temp_storage_bytes = 0;
-  CHECK_GPU_ERROR(ls::hipcub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes,
-                                             loss_buffer, outputs_ptr,
-                                             num_items, stream));
-  CHECK_GPU_ERROR(
-      g_allocator.DeviceAllocate(&d_temp_storage, temp_storage_bytes));
-  CHECK_GPU_ERROR(ls::hipcub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes,
-                                             loss_buffer, outputs_ptr,
-                                             num_items, stream));
-  CHECK_GPU_ERROR(ls::hipcub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes,
-                                             nll_loss_buffer, nll_loss_ptr,
-                                             num_items, stream));
-  CHECK_GPU_ERROR(g_allocator.DeviceFree(d_temp_storage));
-}
-
-template void launch_cross_entropy_fw<float>(
-    const float *inputs_ptr, const int *targets_ptr, float *outputs_ptr,
-    float *nll_loss_ptr, float *loss_buffer, const int padding_idx,
-    const float epsilon, const int batch_size, const int seq_len,
-    const int vocab_size, hipStream_t stream);
-
-template void launch_cross_entropy_fw<__half>(
-    const __half *inputs_ptr, const int *targets_ptr, float *outputs_ptr,
-    float *nll_loss_ptr, float *loss_buffer, const int padding_idx,
-    const float epsilon, const int batch_size, const int seq_len,
-    const int vocab_size, hipStream_t stream);
-
-template <typename T>
-void launch_cross_entropy_bw(const float *grad_outputs_ptr, const T *inputs_ptr,
-                             const int *targets_ptr, T *grad_inputs_ptr,
-                             const int padding_idx, const float epsilon,
-                             const int batch_size, const int seq_len,
-                             const int vocab_size, hipStream_t stream) {
-  int grid_dim = batch_size * seq_len;
- hipLaunchKernelGGL(( ls_cross_entropy_bw_kernel), dim3(grid_dim), dim3(MAX_THREADS), 0, stream, 
-      grad_outputs_ptr, inputs_ptr, targets_ptr, grad_inputs_ptr, padding_idx,
-      epsilon, vocab_size);
-}
-
-template void launch_cross_entropy_bw<float>(
-    const float *grad_outputs_ptr, const float *inputs_ptr,
-    const int *targets_ptr, float *grad_inputs_ptr, const int padding_idx,
-    const float epsilon, const int batch_size, const int seq_len,
-    const int vocab_size, hipStream_t stream);
-
-template void launch_cross_entropy_bw<__half>(
-    const float *grad_outputs_ptr, const __half *inputs_ptr,
-    const int *targets_ptr, __half *grad_inputs_ptr, const int padding_idx,
-    const float epsilon, const int batch_size, const int seq_len,
-    const int vocab_size, hipStream_t stream);
diff --git a/colossalai/kernel/hip_native/csrc/kernels/cublas_wrappers.hip b/colossalai/kernel/hip_native/csrc/kernels/cublas_wrappers.hip
deleted file mode 100644
index 7f0f0a6144853f11adcc5b31a02297d6f704f05a..0000000000000000000000000000000000000000
--- a/colossalai/kernel/hip_native/csrc/kernels/cublas_wrappers.hip
+++ /dev/null
@@ -1,172 +0,0 @@
-// !!! This is a file automatically generated by hipify!!!
-/* Copyright 2021 The LightSeq Team
-   Copyright Microsoft DeepSpeed
-   This file is adapted from Microsoft DeepSpeed
-*/
-#include "cublas_wrappers.h"
-
-#ifdef COLOSSAL_HIP
-int cublas_gemm_ex(rocblas_handle handle, rocblas_operation transa,
-                   rocblas_operation transb, int m, int n, int k,
-                   const float *alpha, const float *beta, const float *A,
-                   const float *B, float *C, rocblas_gemm_algo algo) {
-  rocblas_status status =
-      rocblas_gemm_ex(handle, transa, transb, m, n, k, (const void *)alpha,
-                   (const void *)A, rocblas_datatype_f32_r, (transa == rocblas_operation_none) ? m : k,
-                   (const void *)B, rocblas_datatype_f32_r, (transb == rocblas_operation_none) ? k : n,
-                   (const void *)beta, C, rocblas_datatype_f32_r, m, C, rocblas_datatype_f32_r, m, rocblas_datatype_f32_r, algo, 0, 0);
-
-  if (status != rocblas_status_success) {
-    fprintf(stderr,
-            "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
-            m, n, k, (int)status);
-    return EXIT_FAILURE;
-  }
-  return 0;
-}
-
-int cublas_gemm_ex(rocblas_handle handle, rocblas_operation transa,
-                   rocblas_operation transb, int m, int n, int k,
-                   const float *alpha, const float *beta, const __half *A,
-                   const __half *B, __half *C, rocblas_gemm_algo algo) {
-  rocblas_status status = rocblas_gemm_ex(
-      handle, transa, transb, m, n, k, (const void *)alpha, (const void *)A,
-      rocblas_datatype_f16_r, (transa == rocblas_operation_none) ? m : k, (const void *)B, rocblas_datatype_f16_r,
-      (transb == rocblas_operation_none) ? k : n, (const void *)beta, (void *)C,
-      rocblas_datatype_f16_r, m, (void *)C, rocblas_datatype_f16_r, m, rocblas_datatype_f32_r, algo, 0, 0);
-
-  if (status != rocblas_status_success) {
-    fprintf(stderr,
-            "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
-            m, n, k, (int)status);
-    return EXIT_FAILURE;
-  }
-  return 0;
-}
-
-int cublas_strided_batched_gemm(rocblas_handle handle, int m, int n, int k,
-                                const float *alpha, const float *beta,
-                                const float *A, const float *B, float *C,
-                                rocblas_operation op_A, rocblas_operation op_B,
-                                int stride_A, int stride_B, int stride_C,
-                                int batch, rocblas_gemm_algo algo) {
-  rocblas_status status = rocblas_gemm_strided_batched_ex(
-      handle, op_A, op_B, m, n, k, alpha, A, rocblas_datatype_f32_r,
-      (op_A == rocblas_operation_none) ? m : k, stride_A, B, rocblas_datatype_f32_r,
-      (op_B == rocblas_operation_none) ? k : n, stride_B, beta, C, rocblas_datatype_f32_r, m, stride_C,
-      C, rocblas_datatype_f16_r, m, stride_C, batch, rocblas_datatype_f32_r, algo, 0, 0);
-
-  if (status != rocblas_status_success) {
-    fprintf(stderr,
-            "!!!! kernel execution error. (batch: %d, m: %d, n: %d, k: %d, "
-            "error: %d) \n",
-            batch, m, n, k, (int)status);
-    return EXIT_FAILURE;
-  }
-  return 0;
-}
-
-int cublas_strided_batched_gemm(rocblas_handle handle, int m, int n, int k,
-                                const float *alpha, const float *beta,
-                                const __half *A, const __half *B, __half *C,
-                                rocblas_operation op_A, rocblas_operation op_B,
-                                int stride_A, int stride_B, int stride_C,
-                                int batch, rocblas_gemm_algo algo) {
-  rocblas_status status = rocblas_gemm_strided_batched_ex(
-      handle, op_A, op_B, m, n, k, alpha, A, rocblas_datatype_f16_r,
-      (op_A == rocblas_operation_none) ? m : k, stride_A, B, rocblas_datatype_f16_r,
-      (op_B == rocblas_operation_none) ? k : n, stride_B, beta, C, rocblas_datatype_f16_r, m, stride_C,
-      C, rocblas_datatype_f16_r, m, stride_C, batch, rocblas_datatype_f32_r, algo, 0, 0);
-
-  if (status != rocblas_status_success) {
-    fprintf(stderr,
-            "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
-            m, n, k, (int)status);
-    return EXIT_FAILURE;
-  }
-
-  return 0;
-}
-#else
-int cublas_gemm_ex(rocblas_handle handle, rocblas_operation transa,
-                   rocblas_operation transb, int m, int n, int k,
-                   const float *alpha, const float *beta, const float *A,
-                   const float *B, float *C, cublasGemmAlgo_t algo) {
-  rocblas_status status =
-      rocblas_gemmex(handle, transa, transb, m, n, k, (const void *)alpha,
-                   (const void *)A, hipR32F, (transa == rocblas_operation_none) ? m : k,
-                   (const void *)B, hipR32F, (transb == rocblas_operation_none) ? k : n,
-                   (const void *)beta, C, hipR32F, m, hipR32F, algo);
-
-  if (status != rocblas_status_success) {
-    fprintf(stderr,
-            "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
-            m, n, k, (int)status);
-    return EXIT_FAILURE;
-  }
-  return 0;
-}
-
-int cublas_gemm_ex(rocblas_handle handle, rocblas_operation transa,
-                   rocblas_operation transb, int m, int n, int k,
-                   const float *alpha, const float *beta, const __half *A,
-                   const __half *B, __half *C, cublasGemmAlgo_t algo) {
-  rocblas_status status = rocblas_gemmex(
-      handle, transa, transb, m, n, k, (const void *)alpha, (const void *)A,
-      hipR16F, (transa == rocblas_operation_none) ? m : k, (const void *)B, hipR16F,
-      (transb == rocblas_operation_none) ? k : n, (const void *)beta, (void *)C,
-      hipR16F, m, hipR32F, algo);
-
-  if (status != rocblas_status_success) {
-    fprintf(stderr,
-            "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
-            m, n, k, (int)status);
-    return EXIT_FAILURE;
-  }
-  return 0;
-}
-
-int cublas_strided_batched_gemm(rocblas_handle handle, int m, int n, int k,
-                                const float *alpha, const float *beta,
-                                const float *A, const float *B, float *C,
-                                rocblas_operation op_A, rocblas_operation op_B,
-                                int stride_A, int stride_B, int stride_C,
-                                int batch, cublasGemmAlgo_t algo) {
-  rocblas_status status = cublasGemmStridedBatchedEx(
-      handle, op_A, op_B, m, n, k, alpha, A, hipR32F,
-      (op_A == rocblas_operation_none) ? m : k, stride_A, B, hipR32F,
-      (op_B == rocblas_operation_none) ? k : n, stride_B, beta, C, hipR32F, m, stride_C,
-      batch, hipR32F, algo);
-
-  if (status != rocblas_status_success) {
-    fprintf(stderr,
-            "!!!! kernel execution error. (batch: %d, m: %d, n: %d, k: %d, "
-            "error: %d) \n",
-            batch, m, n, k, (int)status);
-    return EXIT_FAILURE;
-  }
-  return 0;
-}
-
-int cublas_strided_batched_gemm(rocblas_handle handle, int m, int n, int k,
-                                const float *alpha, const float *beta,
-                                const __half *A, const __half *B, __half *C,
-                                rocblas_operation op_A, rocblas_operation op_B,
-                                int stride_A, int stride_B, int stride_C,
-                                int batch, cublasGemmAlgo_t algo) {
-  rocblas_status status = cublasGemmStridedBatchedEx(
-      handle, op_A, op_B, m, n, k, alpha, A, hipR16F,
-      (op_A == rocblas_operation_none) ? m : k, stride_A, B, hipR16F,
-      (op_B == rocblas_operation_none) ? k : n, stride_B, beta, C, hipR16F, m, stride_C,
-      batch, hipR32F, algo);
-
-  if (status != rocblas_status_success) {
-    fprintf(stderr,
-            "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
-            m, n, k, (int)status);
-    return EXIT_FAILURE;
-  }
-
-  return 0;
-}
-#endif
diff --git a/colossalai/kernel/hip_native/csrc/kernels/dropout_kernels.hip b/colossalai/kernel/hip_native/csrc/kernels/dropout_kernels.hip
deleted file mode 100644
index 3b0e9c24a3b68c95cdbfe7e84534b9539d7aac66..0000000000000000000000000000000000000000
--- a/colossalai/kernel/hip_native/csrc/kernels/dropout_kernels.hip
+++ /dev/null
@@ -1,1045 +0,0 @@
-// !!! This is a file automatically generated by hipify!!!
-#include "hip/hip_runtime.h"
-#include <chrono>
-#include <ctime>
-
-#include "kernels.h"
-
-#ifdef COLOSSAL_HIP
-#include <hiprand/hiprand_kernel_hcc.h>
-#endif
-
-#ifndef COLOSSAL_HIP
-#include <cooperative_groups.h>
-
-namespace cg = cooperative_groups;
-#endif
-
-hiprandStatePhilox4_32_10_t *curandstate;
-
-/**
- * @brief element-wise activation function on device, like Relu, Gelu
- *
- * @tparam enum class ActivationType, kRelu, kGelu
- * @tparam input type
- * @param any shape of float and __half2
- * @return same shape and type with input
- */
-template <ActivationType, typename T>
-__forceinline__ __device__ T activation_kernel(T x);
-
-template <>
-__device__ float activation_kernel<ActivationType::kGelu, float>(float x) {
-  float cdf =
-      0.5f *
-      (1.0f + tanhf((0.7978845608028654f * (x + 0.044715f * x * x * x))));
-  return x * cdf;
-}
-
-template <>
-__device__ __half2
-activation_kernel<ActivationType::kGelu, __half2>(__half2 val) {
-  __half2 val_pow3 = __hmul2(val, __hmul2(val, val));
-  float2 tmp_pow = __half22float2(val_pow3);
-  float2 tmp = __half22float2(val);
-
-  tmp.x =
-      0.5f *
-      (1.0f + tanhf((0.7978845608028654f * (tmp.x + 0.044715f * tmp_pow.x))));
-  tmp.y =
-      0.5f *
-      (1.0f + tanhf((0.7978845608028654f * (tmp.y + 0.044715f * tmp_pow.y))));
-  return __hmul2(val, __float22half2_rn(tmp));
-}
-
-template <>
-__device__ float activation_kernel<ActivationType::kRelu, float>(float x) {
-  return fmaxf(x, 0);
-}
-
-template <>
-__device__ __half2
-activation_kernel<ActivationType::kRelu, __half2>(__half2 x) {
-#ifdef COLOSSAL_HIP
-  float2 tmp = __half22float2(x);
-  return __floats2half2_rn(fmaxf(0.f, tmp.x),
-                           fmaxf(0.f, tmp.y));
-#else
-  return __floats2half2_rn(fmaxf(0.f, __half2float(x.x)),
-                           fmaxf(0.f, __half2float(x.y)));
-#endif
-}
-
-/**
- * @brief element-wise activation backward function on device
- *
- * @tparam enum class ActivationType
- * @tparam input type
- * @param any shape of float and __half2
- * @return same shape of input
- */
-template <ActivationType, typename T>
-__forceinline__ __device__ T activation_bwd_kernel(T grad, T x);
-
-template <>
-__device__ float activation_bwd_kernel<ActivationType::kGelu, float>(float grad,
-                                                                     float x) {
-  const float sqrt_param = 0.79788456080286535587989211986876f;
-  const float mul_param = 0.044715;
-
-  float x2mul = x * x * mul_param;
-  float tan_h = tanhf(sqrt_param * (x + x * x2mul));
-  float dg1 = 0.5f * (1.0f + tan_h);
-  float dg2 = x * 0.5f * sqrt_param * (1 - tan_h * tan_h);
-  float dg3 = dg2 * 3 * x2mul;
-  return grad * (dg1 + dg2 + dg3);
-}
-
-template <>
-__device__ __half activation_bwd_kernel<ActivationType::kGelu, __half>(
-    __half grad, __half x_half) {
-  float x = __half2float(x_half);
-  const float sqrt_param = 0.79788456080286535587989211986876f;
-  const float mul_param = 0.044715;
-
-  float x2mul = x * x * mul_param;
-  float tan_h = tanhf(sqrt_param * (x + x * x2mul));
-  float dg1 = 0.5f * (1.0f + tan_h);
-  float dg2 = x * 0.5f * sqrt_param * (1 - tan_h * tan_h);
-  float dg3 = dg2 * 3 * x2mul;
-  return grad * __float2half(dg1 + dg2 + dg3);
-}
-
-template <>
-__device__ float activation_bwd_kernel<ActivationType::kRelu, float>(float grad,
-                                                                     float x) {
-  return x > 0.f ? grad : 0.f;
-}
-
-template <>
-__device__ __half
-activation_bwd_kernel<ActivationType::kRelu, __half>(__half grad, __half x) {
-  const __half half_zero = __float2half(0.f);
-  return x > half_zero ? grad : half_zero;
-}
-
-template <>
-__device__ __half2 activation_bwd_kernel<ActivationType::kRelu, __half2>(
-    __half2 grad2, __half2 x_half2) {
-#ifdef COLOSSAL_HIP
-  float2 tmp_x = __half22float2(x_half2);
-  float2 tmp_grad2 = __half22float2(grad2);
-
-  return __floats2half2_rn(tmp_x.x > 0.0 ? tmp_grad2.x : 0.0,
-                           tmp_x.y > 0.0 ? tmp_grad2.y : 0.0);
-#else
-  const __half half_zero = __float2half(0.f);
-  return __floats2half2_rn(x_half2.x > half_zero ? grad2.x : half_zero,
-                           x_half2.y > half_zero ? grad2.y : half_zero);
-#endif
-}
-
-/**
- * @brief init hiprand states in global memory
- *
- * @thread grid_dim * block*dim to suuport any size of states
- * @param state persistant hiprand states
- * @param seed seed to init states
- * @return void
- */
-__global__ void curand_init_kernel(hiprandStatePhilox4_32_10_t *state,
-                                   int seed) {
-  /* Each thread gets same seed, a different sequence
-     number, no offset */
-  int id = threadIdx.x + blockIdx.x * blockDim.x;
-  hiprand_init(seed, id, 0, &state[id]);
-}
-
-void launch_curand_init(int total_count, int dim, hipStream_t stream) {
-  hipMalloc(&curandstate, total_count * sizeof(hiprandStatePhilox4_32_10_t));
-  int grid_dim = total_count >> 9;
- hipLaunchKernelGGL(( curand_init_kernel), dim3(grid_dim), dim3(512), 0, stream, 
-      curandstate, std::chrono::duration_cast<std::chrono::microseconds>(
-                       std::chrono::system_clock::now().time_since_epoch())
-                       .count());
-}
-
-/**
- * @brief element-wise dropout, store dropped position in mask, it's not
- * in-place
- *
- * @thread
- * gridDim.x = total_count / 1024
- * blockDim.x = 1024
- *
- * @param total_count total elements
- * @param ratio drop ratio
- * @param out any size of float and __half
- * @param in same with out
- * @param mask uint8 type, same size with out
- * @param seed seed to hiprand
- * @return void
- */
-__global__ void ls_dropout_kernel(const int total_count, const float ratio,
-                                  float *__restrict__ out,
-                                  const float *__restrict__ in,
-                                  uint8_t *__restrict__ mask, const int seed) {
-  const float scale = 1.f / (1.f - ratio);
-  int i = blockIdx.x * blockDim.x + threadIdx.x;
-
-  if (i * 4 >= total_count) return;
-
-  hiprandStatePhilox4_32_10_t state;
-  hiprand_init(seed, i, 0, &state);
-  uint8_t m[4];
-
-  float4 *out4 = reinterpret_cast<float4 *>(out);
-  const float4 *data4 = reinterpret_cast<const float4 *>(in);
-  uint32_t *mask4 = reinterpret_cast<uint32_t *>(mask);
-  float4 rand = hiprand_uniform4(&state);
-
-  m[0] = (uint8_t)(rand.x > ratio);
-  m[1] = (uint8_t)(rand.y > ratio);
-  m[2] = (uint8_t)(rand.z > ratio);
-  m[3] = (uint8_t)(rand.w > ratio);
-
-  uint32_t *m4 = reinterpret_cast<uint32_t *>(m);
-  mask4[i] = m4[0];
-
-  float4 input4 = data4[i];
-  float4 res4;
-  res4.x = input4.x * scale * m[0];
-  res4.y = input4.y * scale * m[1];
-  res4.z = input4.z * scale * m[2];
-  res4.w = input4.w * scale * m[3];
-  out4[i] = res4;
-}
-
-__global__ void ls_dropout_kernel(const int total_count, const float ratio,
-                                  __half *__restrict__ out,
-                                  const __half *__restrict__ in,
-                                  uint8_t *__restrict__ mask, const int seed) {
-  const float scale = 1.f / (1.f - ratio);
-
-  int i = blockIdx.x * blockDim.x + threadIdx.x;
-
-  if (i * 8 >= total_count) return;
-
-  hiprandStatePhilox4_32_10_t state;
-  hiprand_init(seed, i, 0, &state);
-
-  const float4 *vals_float4 = reinterpret_cast<const float4 *>(in);
-  float4 *outs_float4 = reinterpret_cast<float4 *>(out);
-  uint64_t *mask8 = reinterpret_cast<uint64_t *>(mask);
-
-  uint8_t m[8];
-  float4 rand = hiprand_uniform4(&state);
-  m[0] = (uint8_t)(rand.x > ratio);
-  m[1] = (uint8_t)(rand.y > ratio);
-  m[2] = (uint8_t)(rand.z > ratio);
-  m[3] = (uint8_t)(rand.w > ratio);
-  rand = hiprand_uniform4(&state);
-  m[4] = (uint8_t)(rand.x > ratio);
-  m[5] = (uint8_t)(rand.y > ratio);
-  m[6] = (uint8_t)(rand.z > ratio);
-  m[7] = (uint8_t)(rand.w > ratio);
-  uint64_t *m8 = reinterpret_cast<uint64_t *>(m);
-  mask8[i] = *m8;
-
-  float4 val_float4 = vals_float4[i];
-  float4 out_float4;
-  __half2 *val_half2 = reinterpret_cast<__half2 *>(&val_float4);
-  __half2 *out_half2 = reinterpret_cast<__half2 *>(&out_float4);
-  __half2 scale_mask_1 = __floats2half2_rn(scale * m[0], scale * m[1]);
-  __half2 scale_mask_2 = __floats2half2_rn(scale * m[2], scale * m[3]);
-  __half2 scale_mask_3 = __floats2half2_rn(scale * m[4], scale * m[5]);
-  __half2 scale_mask_4 = __floats2half2_rn(scale * m[6], scale * m[7]);
-  out_half2[0] = __hmul2(val_half2[0], scale_mask_1);
-  out_half2[1] = __hmul2(val_half2[1], scale_mask_2);
-  out_half2[2] = __hmul2(val_half2[2], scale_mask_3);
-  out_half2[3] = __hmul2(val_half2[3], scale_mask_4);
-  outs_float4[i] = out_float4;
-}
-
-/**
- * @brief element-wise dropout backward with dropout mask, it's
- * not in-place
- *
- * @thread
- * gridDim.x = total_count / 1024
- * blockDim.x = 1024
- *
- * @param total_count total elements
- * @param ratio drop ratio
- * @param in any size of float and __half
- * @param mask uint8 type, same size with in
- * @return void
- */
-__global__ void ls_dropout_bwd_kernel(const int total_count, const float ratio,
-                                      float *out, const float *in,
-                                      const uint8_t *__restrict__ mask) {
-  const float scale = 1.f / (1.f - ratio);
-  int i = blockIdx.x * blockDim.x + threadIdx.x;
-
-  if (i * 4 >= total_count) return;
-
-  uint8_t m[4];
-
-  float4 *out4 = reinterpret_cast<float4 *>(out);
-  const float4 *in4 = reinterpret_cast<const float4 *>(in);
-  const uint32_t *mask4 = reinterpret_cast<const uint32_t *>(mask);
-
-  uint32_t *m4 = reinterpret_cast<uint32_t *>(m);
-  m4[0] = mask4[i];
-
-  float4 input4 = in4[i];
-  float4 res4;
-  res4.x = input4.x * scale * static_cast<float>(m[0]);
-  res4.y = input4.y * scale * static_cast<float>(m[1]);
-  res4.z = input4.z * scale * static_cast<float>(m[2]);
-  res4.w = input4.w * scale * static_cast<float>(m[3]);
-  out4[i] = res4;
-}
-
-__global__ void ls_dropout_bwd_kernel(const int total_count, const float ratio,
-                                      __half *out, const __half *in,
-                                      const uint8_t *__restrict__ mask) {
-  const __half scale = 1.f / (1.f - ratio);
-
-  int i = blockIdx.x * blockDim.x + threadIdx.x;
-
-  if (i * 8 >= total_count) return;
-
-  float4 *out4 = reinterpret_cast<float4 *>(out);
-  const float4 *vals_float4 = reinterpret_cast<const float4 *>(in);
-  const uint64_t *mask8 = reinterpret_cast<const uint64_t *>(mask);
-
-  uint8_t m[8];
-  uint64_t *m8 = reinterpret_cast<uint64_t *>(m);
-  m8[0] = mask8[i];
-
-  float4 val_float4 = vals_float4[i];
-  float4 out_float4;
-  __half2 *val_half2 = reinterpret_cast<__half2 *>(&val_float4);
-  __half2 *out_half2 = reinterpret_cast<__half2 *>(&out_float4);
-  __half2 scale_mask_1 =
-      __halves2half2(scale * __float2half(m[0]), scale * __float2half(m[1]));
-  __half2 scale_mask_2 =
-      __halves2half2(scale * __float2half(m[2]), scale * __float2half(m[3]));
-  __half2 scale_mask_3 =
-      __halves2half2(scale * __float2half(m[4]), scale * __float2half(m[5]));
-  __half2 scale_mask_4 =
-      __halves2half2(scale * __float2half(m[6]), scale * __float2half(m[7]));
-  out_half2[0] = __hmul2(val_half2[0], scale_mask_1);
-  out_half2[1] = __hmul2(val_half2[1], scale_mask_2);
-  out_half2[2] = __hmul2(val_half2[2], scale_mask_3);
-  out_half2[3] = __hmul2(val_half2[3], scale_mask_4);
-  out4[i] = out_float4;
-}
-
-template <>
-void launch_ls_dropout<float>(float *out, const float *vals, uint8_t *mask,
-                              int total_count, float ratio, hipStream_t stream,
-                              bool backward) {
-  int grid_dim = total_count >> 12;
-  if (!backward) {
-   hipLaunchKernelGGL(( ls_dropout_kernel), dim3(grid_dim + 1), dim3(1024), 0, stream, 
-        total_count, ratio, out, vals, mask,
-        std::chrono::duration_cast<std::chrono::microseconds>(
-            std::chrono::system_clock::now().time_since_epoch())
-            .count());
-  } else {
-   hipLaunchKernelGGL(( ls_dropout_bwd_kernel), dim3(grid_dim + 1), dim3(1024), 0, stream, total_count, ratio,
-                                                             out, vals, mask);
-  }
-}
-
-template <>
-void launch_ls_dropout<__half>(__half *out, const __half *vals, uint8_t *mask,
-                               int total_count, float ratio,
-                               hipStream_t stream, bool backward) {
-  int grid_dim = total_count >> 13;
-  if (!backward) {
-   hipLaunchKernelGGL(( ls_dropout_kernel), dim3(grid_dim + 1), dim3(1024), 0, stream, 
-        total_count, ratio, out, vals, mask,
-        std::chrono::duration_cast<std::chrono::microseconds>(
-            std::chrono::system_clock::now().time_since_epoch())
-            .count());
-  } else {
-   hipLaunchKernelGGL(( ls_dropout_bwd_kernel), dim3(grid_dim + 1), dim3(1024), 0, stream, total_count, ratio,
-                                                             out, vals, mask);
-  }
-}
-
-/**
- * @brief fused bias, dropout, and residual at the end of Attention and FFN,
- * store dropped position in mask, it's not in-place
- *
- * @thread
- * gridDim.x = total_count / 1024
- * blockDim.x = 1024
- *
- * @param total_count total elements
- * @param ratio drop ratio
- * @param out [batch_size, seq_len, hidden_size], float and __half
- * @param in [batch_size, seq_len, hidden_size], float and __half
- * @param mask [batch_size, seq_len, hidden_size], uint8 type
- * @param bias [hidden_size], ffn bias
- * @param residual [batch_size, seq_len, hidden_size], float and __half
- * @param seed seed to hiprand
- * @param hidden_size hidden size
- * @return void
- */
-__global__ void ls_dropout_res_bias_kernel(
-    const int total_count, const float ratio, float *__restrict__ out,
-    const float *__restrict__ in, uint8_t *__restrict__ mask,
-    const float *__restrict__ bias, const float *__restrict__ residual,
-    const int seed, const int hidden_size) {
-  const float scale = 1.f / (1.f - ratio);
-  int i = blockIdx.x * blockDim.x + threadIdx.x;
-
-  if (i * 4 >= total_count) return;
-
-  hiprandStatePhilox4_32_10_t state;
-  hiprand_init(seed, i, 0, &state);
-  uint8_t m[4];
-
-  float4 *out4 = reinterpret_cast<float4 *>(out);
-  const float4 *data4 = reinterpret_cast<const float4 *>(in);
-  const float4 *residual4 = reinterpret_cast<const float4 *>(residual);
-  const float4 *bias4 = reinterpret_cast<const float4 *>(bias);
-  uint32_t *mask4 = reinterpret_cast<uint32_t *>(mask);
-  float4 rand = hiprand_uniform4(&state);
-
-  m[0] = static_cast<uint8_t>(rand.x > ratio);
-  m[1] = static_cast<uint8_t>(rand.y > ratio);
-  m[2] = static_cast<uint8_t>(rand.z > ratio);
-  m[3] = static_cast<uint8_t>(rand.w > ratio);
-
-  int bias_i = i % (hidden_size >> 2);
-  uint32_t *m4 = reinterpret_cast<uint32_t *>(m);
-  mask4[i] = m4[0];
-  const float4 input4 = data4[i];
-  const float4 b4 = __ldg(&bias4[bias_i]);
-  const float4 res4 = residual4[i];
-  float4 output4;
-
-  output4.x = (input4.x + b4.x) * scale * m[0] + res4.x;
-  output4.y = (input4.y + b4.y) * scale * m[1] + res4.y;
-  output4.z = (input4.z + b4.z) * scale * m[2] + res4.z;
-  output4.w = (input4.w + b4.w) * scale * m[3] + res4.w;
-
-  out4[i] = output4;
-}
-
-__global__ void ls_dropout_res_bias_kernel(
-    const int total_count, const float ratio, __half *__restrict__ out,
-    const __half *__restrict__ in, uint8_t *__restrict__ mask,
-    const __half *__restrict__ bias, const __half *__restrict__ residual,
-    const int seed, const int hidden_size) {
-  const __half scale = 1. / (1. - ratio);
-
-  int i = blockIdx.x * blockDim.x + threadIdx.x;
-
-  if (i * 8 >= total_count) return;
-
-  hiprandStatePhilox4_32_10_t state;
-  hiprand_init(seed, i, 0, &state);
-
-  const float4 *vals_float4 = reinterpret_cast<const float4 *>(in);
-  float4 *outs_float4 = reinterpret_cast<float4 *>(out);
-  const float4 *residual4 = reinterpret_cast<const float4 *>(residual);
-  const float4 *bias4 = reinterpret_cast<const float4 *>(bias);
-  uint64_t *mask8 = reinterpret_cast<uint64_t *>(mask);
-
-  uint8_t m[8];
-  float4 rand = hiprand_uniform4(&state);
-  m[0] = static_cast<uint8_t>(rand.x > ratio);
-  m[1] = static_cast<uint8_t>(rand.y > ratio);
-  m[2] = static_cast<uint8_t>(rand.z > ratio);
-  m[3] = static_cast<uint8_t>(rand.w > ratio);
-  rand = hiprand_uniform4(&state);
-  m[4] = static_cast<uint8_t>(rand.x > ratio);
-  m[5] = static_cast<uint8_t>(rand.y > ratio);
-  m[6] = static_cast<uint8_t>(rand.z > ratio);
-  m[7] = static_cast<uint8_t>(rand.w > ratio);
-  uint64_t *m8 = reinterpret_cast<uint64_t *>(m);
-  mask8[i] = m8[0];
-
-  int bias_i = i % (hidden_size >> 3);
-  float4 val_float4 = vals_float4[i];
-  const float4 b4 = __ldg(&bias4[bias_i]);
-  const float4 res4 = residual4[i];
-  float4 out_float4;
-
-  __half2 *val_half2 = reinterpret_cast<__half2 *>(&val_float4);
-  __half2 *out_half2 = reinterpret_cast<__half2 *>(&out_float4);
-  const __half2 *b_half2 = reinterpret_cast<const __half2 *>(&b4);
-  const __half2 *res_half2 = reinterpret_cast<const __half2 *>(&res4);
-  __half2 scale_mask_1 =
-      __halves2half2(scale * __float2half(m[0]), scale * __float2half(m[1]));
-  __half2 scale_mask_2 =
-      __halves2half2(scale * __float2half(m[2]), scale * __float2half(m[3]));
-  __half2 scale_mask_3 =
-      __halves2half2(scale * __float2half(m[4]), scale * __float2half(m[5]));
-  __half2 scale_mask_4 =
-      __halves2half2(scale * __float2half(m[6]), scale * __float2half(m[7]));
-  out_half2[0] =
-      __hfma2(__hadd2(val_half2[0], b_half2[0]), scale_mask_1, res_half2[0]);
-  out_half2[1] =
-      __hfma2(__hadd2(val_half2[1], b_half2[1]), scale_mask_2, res_half2[1]);
-  out_half2[2] =
-      __hfma2(__hadd2(val_half2[2], b_half2[2]), scale_mask_3, res_half2[2]);
-  out_half2[3] =
-      __hfma2(__hadd2(val_half2[3], b_half2[3]), scale_mask_4, res_half2[3]);
-  outs_float4[i] = out_float4;
-}
-
-template <>
-void launch_ls_dropout_res_bias<float>(float *out, const float *vals,
-                                       uint8_t *mask, const float *bias,
-                                       const float *residual, int total_count,
-                                       int dim, float ratio,
-                                       hipStream_t stream) {
-  int grid_dim = total_count >> 12;
- hipLaunchKernelGGL(( ls_dropout_res_bias_kernel), dim3(grid_dim + 1), dim3(1024), 0, stream, 
-      total_count, ratio, out, vals, mask, bias, residual,
-      std::chrono::duration_cast<std::chrono::microseconds>(
-          std::chrono::system_clock::now().time_since_epoch())
-          .count(),
-      dim);
-}
-
-template <>
-void launch_ls_dropout_res_bias<__half>(__half *out, const __half *vals,
-                                        uint8_t *mask, const __half *bias,
-                                        const __half *residual, int total_count,
-                                        int dim, float ratio,
-                                        hipStream_t stream) {
-  int grid_dim = total_count >> 13;
- hipLaunchKernelGGL(( ls_dropout_res_bias_kernel), dim3(grid_dim + 1), dim3(1024), 0, stream, 
-      total_count, ratio, out, vals, mask, bias, residual,
-      std::chrono::duration_cast<std::chrono::microseconds>(
-          std::chrono::system_clock::now().time_since_epoch())
-          .count(),
-      dim);
-}
-
-/**
- * @brief fused bias and dropout backward at the end of Attention and FFN
- *
- * @thread
- * gridDim.x = hidden_size / 8
- * blockDim.x = 8
- * blockDim.y = 1024 / 8 = 128
- *
- * @param row_size batch_size * seq_len
- * @param ratio dropout ratio
- * @param in_grad [batch_size, seq_len, hidden_size], input grad
- * @param bias_grad [hidden_size], bias grad
- * @param out_grad [batch_size, seq_len, hidden_size], output grad
- * @param mask [batch_size, seq_len, hidden_size], dropout mask
- * @param hidden_size
- * @return void
- */
-__global__ void ls_dropout_bias_bwd_kernel(
-    const int row_size, const float ratio, float *__restrict__ in_grad,
-    float *__restrict__ bias_grad, const float *__restrict__ out_grad,
-    const uint8_t *__restrict__ mask, const int hidden_size) {
-  const float scale = 1.f / (1.f - ratio);
-  // every block generate 8 bias result
-  __shared__ float tile[8][129];
-
-#ifndef COLOSSAL_HIP
-  cg::thread_block b = cg::this_thread_block();
-  cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
-#endif
-
-  int col_idx = flat_2dim(blockIdx.x, threadIdx.x, 8);
-  int stride = hidden_size * 128;
-  float local_sum = 0;
-
-  int idx = flat_2dim(threadIdx.y, col_idx, hidden_size);
-  for (int r = threadIdx.y; r < row_size; r += 128) {
-    float val = out_grad[idx];
-    val *= scale * static_cast<float>(mask[idx]);
-    local_sum += val;
-    in_grad[idx] = val;
-    idx += stride;
-  }
-
-  tile[threadIdx.x][threadIdx.y] = local_sum;
-  __syncthreads();
-
-  float sum = 0;
-  int tid = threadIdx.y * blockDim.x + threadIdx.x;
-  int x = tid >> 7;
-  int y = tid & (127);
-  if (y < 32) {
-#pragma unroll
-    for (int i = 0; i < 4; i++) {
-      sum += tile[x][y + i * 32];
-    }
-  }
-  __syncthreads();
-
-#ifdef COLOSSAL_HIP
-  for (int i = 1; i < 32; i <<= 1) sum += __shfl_down(sum, i);
-#else
-  for (int i = 1; i < 32; i <<= 1) sum += g.shfl_down(sum, i);
-#endif
-
-  if (y == 0) tile[0][x] = sum;
-  __syncthreads();
-
-  if (threadIdx.x < 8) {
-    int pos = flat_2dim(blockIdx.x, threadIdx.x, 8);
-    bias_grad[pos] = tile[0][threadIdx.x];
-  }
-}
-
-__global__ void ls_dropout_bias_bwd_kernel(
-    const int row_size, const float ratio, __half *__restrict__ in_grad,
-    __half *__restrict__ bias_grad, const __half *__restrict__ out_grad,
-    const uint8_t *__restrict__ mask, const int hidden_size) {
-  const __half2 scale = __float2half2_rn(1.f / (1.f - ratio));
-  __shared__ __half2 tile[8][129];
-
-#ifndef COLOSSAL_HIP
-  cg::thread_block b = cg::this_thread_block();
-  cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
-#endif
-
-
-  __half2 *in_grad2 = reinterpret_cast<__half2 *>(in_grad);
-  const __half2 *out_grad2 = reinterpret_cast<const __half2 *>(out_grad);
-  __half2 *bias_grad2 = reinterpret_cast<__half2 *>(bias_grad);
-
-  int col_idx = flat_2dim(blockIdx.x, threadIdx.x, 8);
-  int stride = hidden_size * 128;
-  __half2 local_sum = __float2half2_rn(0.f);
-
-  int idx = flat_2dim(threadIdx.y, col_idx, hidden_size);
-  for (int r = threadIdx.y; r < row_size; r += 128) {
-    __half2 val = out_grad2[idx];
-    __half2 m2 = __floats2half2_rn(mask[2 * idx], mask[2 * idx + 1]);
-    val *= scale * m2;
-    local_sum += val;
-    in_grad2[idx] = val;
-    idx += stride;
-  }
-
-  tile[threadIdx.x][threadIdx.y] = local_sum;
-  __syncthreads();
-
-  __half2 sum = __float2half2_rn(0.f);
-  int tid = threadIdx.y * blockDim.x + threadIdx.x;
-  int x = tid >> 7;
-  int y = tid & (127);
-  if (y < 32) {
-#pragma unroll
-    for (int i = 0; i < 4; i++) {
-      sum += tile[x][y + i * 32];
-    }
-  }
-  __syncthreads();
-
-#ifdef COLOSSAL_HIP
-  float2 sum_f2 = __half22float2(sum);
-  for (int i = 1; i < WARP_SIZE; i <<= 1) sum_f2.x += __shfl_down(sum_f2.x, i);
-  for (int i = 1; i < WARP_SIZE; i <<= 1) sum_f2.y += __shfl_down(sum_f2.y, i);
-  sum = __float22half2_rn(sum_f2);
-#else
-  for (int i = 1; i < WARP_SIZE; i <<= 1) sum += g.shfl_down(sum, i);
-#endif
-
-  if (y == 0) tile[0][x] = sum;
-  __syncthreads();
-
-  if (threadIdx.x < 8) {
-    int pos = flat_2dim(blockIdx.x, threadIdx.x, 8);
-    bias_grad2[pos] = tile[0][threadIdx.x];
-  }
-}
-
-template <typename T>
-void launch_ls_dropout_bias_bwd(T *in_grad, T *bias_grad, const T *out_grad,
-                                const uint8_t *mask, int row_size, int dim,
-                                float ratio, hipStream_t stream) {
-  dim3 grid_dim((dim - 1) / 8 + 1);
-  dim3 block_dim(8, 128);
- hipLaunchKernelGGL(( ls_dropout_bias_bwd_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, 
-      row_size, ratio, in_grad, bias_grad, out_grad, mask, dim);
-}
-
-template <>
-void launch_ls_dropout_bias_bwd(__half *in_grad, __half *bias_grad,
-                                const __half *out_grad, const uint8_t *mask,
-                                int row_size, int dim, float ratio,
-                                hipStream_t stream) {
-  dim >>= 1;
-  dim3 grid_dim((dim - 1) / 8 + 1);
-  dim3 block_dim(8, 128);
- hipLaunchKernelGGL(( ls_dropout_bias_bwd_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, 
-      row_size, ratio, in_grad, bias_grad, out_grad, mask, dim);
-}
-
-template void launch_ls_dropout_bias_bwd(float *in_grad, float *bias_grad,
-                                         const float *out_grad,
-                                         const uint8_t *mask, int row_size,
-                                         int dim, float ratio,
-                                         hipStream_t stream);
-
-/**
- * @brief fused bias, activation, and dropout at the end of first ffn
- *
- * @thread
- * gridDim.x = hidden_size / 8
- * blockDim.x = 8
- * blockDim.y = 1024 / 8 = 128
- *
- * @tparam act_type activation function, like kRelu, kGelu
- * @param total_count total elements
- * @param ratio drop ratio
- * @param out [batch_size, seq_len, hidden_size], float and __half
- * @param in [batch_size, seq_len, hidden_size], float and __half
- * @param mask [batch_size, seq_len, hidden_size], uint8 type
- * @param bias [hidden_size], ffn bias
- * @param seed seed to hiprand
- * @param hidden_size
- * @return void
- */
-template <ActivationType act_type>
-__global__ void ls_dropout_act_bias_kernel(
-    const int total_count, const float ratio, float *__restrict__ out,
-    const float *__restrict__ in, uint8_t *__restrict__ mask,
-    const float *__restrict__ bias, const int seed, const int hidden_size) {
-  const float scale = 1.f / (1.f - ratio);
-  int i = blockIdx.x * blockDim.x + threadIdx.x;
-
-  if (i * 4 >= total_count) return;
-
-  hiprandStatePhilox4_32_10_t state;
-  hiprand_init(seed, i, 0, &state);
-  uint8_t m[4];
-
-  float4 *out4 = reinterpret_cast<float4 *>(out);
-  const float4 *data4 = reinterpret_cast<const float4 *>(in);
-  const float4 *bias4 = reinterpret_cast<const float4 *>(bias);
-  uint32_t *mask4 = reinterpret_cast<uint32_t *>(mask);
-  float4 rand = hiprand_uniform4(&state);
-
-  m[0] = (uint8_t)(rand.x > ratio);
-  m[1] = (uint8_t)(rand.y > ratio);
-  m[2] = (uint8_t)(rand.z > ratio);
-  m[3] = (uint8_t)(rand.w > ratio);
-
-  int bias_i = i % (hidden_size >> 2);
-  uint32_t *m4 = reinterpret_cast<uint32_t *>(m);
-  mask4[i] = m4[0];
-  const float4 input4 = data4[i];
-  const float4 b4 = __ldg(&bias4[bias_i]);
-  float4 output4;
-
-  output4.x =
-      activation_kernel<act_type, float>(input4.x + b4.x) * scale * m[0];
-  output4.y =
-      activation_kernel<act_type, float>(input4.y + b4.y) * scale * m[1];
-  output4.z =
-      activation_kernel<act_type, float>(input4.z + b4.z) * scale * m[2];
-  output4.w =
-      activation_kernel<act_type, float>(input4.w + b4.w) * scale * m[3];
-
-  out4[i] = output4;
-}
-
-template <ActivationType act_type>
-__global__ void ls_dropout_act_bias_kernel(
-    const int total_count, const float ratio, __half *__restrict__ out,
-    const __half *__restrict__ in, uint8_t *__restrict__ mask,
-    const __half *__restrict__ bias, const int seed, const int hidden_size) {
-  const float scale = 1.f / (1.f - ratio);
-
-  int i = blockIdx.x * blockDim.x + threadIdx.x;
-
-  if (i * 8 >= total_count) return;
-
-  hiprandStatePhilox4_32_10_t state;
-  hiprand_init(seed, i, 0, &state);
-
-  const float4 *vals_float4 = reinterpret_cast<const float4 *>(in);
-  float4 *outs_float4 = reinterpret_cast<float4 *>(out);
-  const float4 *bias4 = reinterpret_cast<const float4 *>(bias);
-  uint64_t *mask8 = reinterpret_cast<uint64_t *>(mask);
-
-  uint8_t m[8];
-  float4 rand = hiprand_uniform4(&state);
-  m[0] = (uint8_t)(rand.x > ratio);
-  m[1] = (uint8_t)(rand.y > ratio);
-  m[2] = (uint8_t)(rand.z > ratio);
-  m[3] = (uint8_t)(rand.w > ratio);
-  rand = hiprand_uniform4(&state);
-  m[4] = (uint8_t)(rand.x > ratio);
-  m[5] = (uint8_t)(rand.y > ratio);
-  m[6] = (uint8_t)(rand.z > ratio);
-  m[7] = (uint8_t)(rand.w > ratio);
-  uint64_t *m8 = reinterpret_cast<uint64_t *>(m);
-  mask8[i] = *m8;
-
-  int bias_i = i % (hidden_size >> 3);
-  float4 val_float4 = vals_float4[i];
-  const float4 b4 = __ldg(&bias4[bias_i]);
-  float4 out_float4;
-
-  __half2 *val_half2 = reinterpret_cast<__half2 *>(&val_float4);
-  __half2 *out_half2 = reinterpret_cast<__half2 *>(&out_float4);
-  const __half2 *b_half2 = reinterpret_cast<const __half2 *>(&b4);
-
-  __half2 scale_mask_1 = __floats2half2_rn(scale * m[0], scale * m[1]);
-  __half2 scale_mask_2 = __floats2half2_rn(scale * m[2], scale * m[3]);
-  __half2 scale_mask_3 = __floats2half2_rn(scale * m[4], scale * m[5]);
-  __half2 scale_mask_4 = __floats2half2_rn(scale * m[6], scale * m[7]);
-  out_half2[0] = __hmul2(
-      activation_kernel<act_type, __half2>(__hadd2(val_half2[0], b_half2[0])),
-      scale_mask_1);
-  out_half2[1] = __hmul2(
-      activation_kernel<act_type, __half2>(__hadd2(val_half2[1], b_half2[1])),
-      scale_mask_2);
-  out_half2[2] = __hmul2(
-      activation_kernel<act_type, __half2>(__hadd2(val_half2[2], b_half2[2])),
-      scale_mask_3);
-  out_half2[3] = __hmul2(
-      activation_kernel<act_type, __half2>(__hadd2(val_half2[3], b_half2[3])),
-      scale_mask_4);
-  outs_float4[i] = out_float4;
-}
-
-template <>
-void launch_ls_dropout_act_bias<ActivationType::kGelu, float>(
-    float *out, const float *vals, uint8_t *mask, const float *bias,
-    int total_count, int dim, float ratio, hipStream_t stream) {
-  int grid_dim = total_count >> 10;
- hipLaunchKernelGGL(( ls_dropout_act_bias_kernel<ActivationType::kGelu>)
-      , dim3(grid_dim + 1), dim3(256), 0, stream, 
-          total_count, ratio, out, vals, mask, bias,
-          std::chrono::duration_cast<std::chrono::microseconds>(
-              std::chrono::system_clock::now().time_since_epoch())
-              .count(),
-          dim);
-}
-
-template <>
-void launch_ls_dropout_act_bias<ActivationType::kGelu, __half>(
-    __half *out, const __half *vals, uint8_t *mask, const __half *bias,
-    int total_count, int dim, float ratio, hipStream_t stream) {
-  int grid_dim = total_count >> 11;
- hipLaunchKernelGGL(( ls_dropout_act_bias_kernel<ActivationType::kGelu>)
-      , dim3(grid_dim + 1), dim3(256), 0, stream, 
-          total_count, ratio, out, vals, mask, bias,
-          std::chrono::duration_cast<std::chrono::microseconds>(
-              std::chrono::system_clock::now().time_since_epoch())
-              .count(),
-          dim);
-}
-
-template <>
-void launch_ls_dropout_act_bias<ActivationType::kRelu, float>(
-    float *out, const float *vals, uint8_t *mask, const float *bias,
-    int total_count, int dim, float ratio, hipStream_t stream) {
-  int grid_dim = total_count >> 10;
- hipLaunchKernelGGL(( ls_dropout_act_bias_kernel<ActivationType::kRelu>)
-      , dim3(grid_dim + 1), dim3(256), 0, stream, 
-          total_count, ratio, out, vals, mask, bias,
-          std::chrono::duration_cast<std::chrono::microseconds>(
-              std::chrono::system_clock::now().time_since_epoch())
-              .count(),
-          dim);
-}
-
-template <>
-void launch_ls_dropout_act_bias<ActivationType::kRelu, __half>(
-    __half *out, const __half *vals, uint8_t *mask, const __half *bias,
-    int total_count, int dim, float ratio, hipStream_t stream) {
-  int grid_dim = total_count >> 11;
- hipLaunchKernelGGL(( ls_dropout_act_bias_kernel<ActivationType::kRelu>)
-      , dim3(grid_dim + 1), dim3(256), 0, stream, 
-          total_count, ratio, out, vals, mask, bias,
-          std::chrono::duration_cast<std::chrono::microseconds>(
-              std::chrono::system_clock::now().time_since_epoch())
-              .count(),
-          dim);
-}
-
-/**
- * @brief fused bias, activation, and dropout backward
- *
- * @thread
- * gridDim.x = total_count / 1024
- * blockDim.x = 1024
- *
- * @tparam act_type kRelu
- * @param row_size batch_size * seq_len
- * @param ratio dropout ratio
- * @param in_grad [batch_size, seq_len, hidden_size], input grad
- * @param bias_grad [hidden_size], bias grad
- * @param out_grad [batch_size, seq_len, hidden_size], output grad
- * @param mask [batch_size, seq_len, hidden_size], dropout mask
- * @param hidden_size
- * @return void
- */
-template <ActivationType act_type, typename T>
-__global__ void ls_dropout_act_bias_bwd_kernel(
-    const int row_size, const float ratio, T *in_grad,
-    T *__restrict__ bias_grad, const T *__restrict__ input,
-    const T *__restrict__ bias, const T *out_grad,
-    const uint8_t *__restrict__ mask, const int hidden_size) {
-  const float scale = 1.f / (1.f - ratio);
-  __shared__ float tile[WARP_SIZE][WARP_SIZE + 1];
-
-#ifndef COLOSSAL_HIP
-  cg::thread_block b = cg::this_thread_block();
-  cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
-#endif
-
-  int col_idx = flat_2dim(blockIdx.x, threadIdx.x, WARP_SIZE);
-
-  int stride = hidden_size * WARP_SIZE;
-  float local_sum = 0;
-
-  int idx = flat_2dim(threadIdx.y, col_idx, hidden_size);
-  if (col_idx < hidden_size) {
-    for (int r = threadIdx.y; r < row_size; r += WARP_SIZE) {
-      float val = out_grad[idx];
-      float in = input[idx];
-      float b = bias[idx % hidden_size];
-      val = activation_bwd_kernel<act_type, float>(
-          val * scale * static_cast<float>(mask[idx]), in + b);
-      local_sum += val;
-      in_grad[idx] = val;
-      idx += stride;
-    }
-  }
-
-  tile[threadIdx.x][threadIdx.y] = local_sum;
-  __syncthreads();
-  float sum = tile[threadIdx.y][threadIdx.x];
-  __syncthreads();
-
-#ifdef COLOSSAL_HIP
-  for (int i = 1; i < WARP_SIZE; i <<= 1) sum += __shfl_down(sum, i);
-#else
-  for (int i = 1; i < WARP_SIZE; i <<= 1) sum += g.shfl_down(sum, i);
-#endif
-
-  if (threadIdx.x == 0) tile[0][threadIdx.y] = sum;
-  __syncthreads();
-
-  if (threadIdx.y == 0) {
-    int pos = flat_2dim(blockIdx.x, threadIdx.x, WARP_SIZE);
-    bias_grad[pos] = tile[0][threadIdx.x];
-  }
-}
-
-// @brief fused bias, activation, and dropout backward
-// It is deprecated for precision reason. Keep it for future optimization.
-//
-// template <ActivationType act_type>
-// __global__ void ls_dropout_act_bias_bwd_kernel(
-//     const int row_size, const float ratio, __half * in_grad,
-//     __half *__restrict__ bias_grad, const __half *__restrict__ input, const
-//     __half *__restrict__ bias, const __half * out_grad, const uint8_t
-//     *__restrict__ mask, const int hidden_size) {
-//   const __half2 scale = __float2half2_rn(1.f / (1.f - ratio));
-//   __shared__ __half2 tile[WARP_SIZE][WARP_SIZE + 1];
-
-//   cg::thread_block b = cg::this_thread_block();
-//   cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
-
-//   __half2 *in_grad2 = reinterpret_cast<__half2 *>(in_grad);
-//   __half2 *bias_grad2 = reinterpret_cast<__half2 *>(bias_grad);
-//   const __half2 *out_grad2 = reinterpret_cast<const __half2 *>(out_grad);
-//   const __half2 *input2 = reinterpret_cast<const __half2 *>(input);
-//   const __half2 *bias2 = reinterpret_cast<const __half2 *>(bias);
-
-//   int col_idx = flat_2dim(blockIdx.x, threadIdx.x, WARP_SIZE);
-
-//   int stride = hidden_size * WARP_SIZE;
-//   __half2 local_sum = __float2half2_rn(0.f);
-
-//   int idx = flat_2dim(threadIdx.y, col_idx, hidden_size);
-//   if (col_idx < hidden_size) {
-//     for (int r = threadIdx.y; r < row_size; r += WARP_SIZE) {
-//       __half2 val = out_grad2[idx];
-//       __half2 in2 = input2[idx];
-//       __half2 b2 = bias2[idx % hidden_size ];
-//       __half2 m2 = __floats2half2_rn(mask[2 * idx], mask[2 * idx + 1]);
-//       val = activation_bwd_kernel<ActivationType::kRelu, __half2>(val * scale
-//       *
-//                                                                   m2,
-//                                                                   in2+b2);
-//       local_sum += val;
-//       in_grad2[idx] = val;
-//       idx += stride;
-//     }
-//   }
-
-//   tile[threadIdx.x][threadIdx.y] = local_sum;
-//   __syncthreads();
-//   __half2 sum = tile[threadIdx.y][threadIdx.x];
-//   __syncthreads();
-
-//   for (int i = 1; i < WARP_SIZE; i <<= 1) sum += g.shfl_down(sum, i);
-
-//   if (threadIdx.x == 0) tile[0][threadIdx.y] = sum;
-//   __syncthreads();
-
-//   if (threadIdx.y == 0) {
-//     int pos = flat_2dim(blockIdx.x, threadIdx.x, WARP_SIZE);
-//     bias_grad2[pos] = tile[0][threadIdx.x];
-//   }
-// }
-
-template <ActivationType act_type, typename T>
-void launch_ls_dropout_act_bias_bwd(T *in_grad, T *bias_grad, const T *input,
-                                    const T *bias, const T *out_grad,
-                                    const uint8_t *mask, int row_size, int dim,
-                                    float ratio, hipStream_t stream) {
-  dim3 grid_dim((dim - 1) / WARP_SIZE + 1);
-  dim3 block_dim(WARP_SIZE, WARP_SIZE);
- hipLaunchKernelGGL(( ls_dropout_act_bias_bwd_kernel<act_type>), dim3(grid_dim), dim3(block_dim), 0, stream, 
-      row_size, ratio, in_grad, bias_grad, input, bias, out_grad, mask, dim);
-}
-
-// template <>
-// void launch_ls_dropout_act_bias_bwd<ActivationType::kRelu, __half>(
-//     __half *in_grad, __half *bias_grad,const __half *input, const __half
-//     *bias, const __half *out_grad, const uint8_t *mask, int row_size, int
-//     dim, float ratio, hipStream_t stream) {
-//   dim >>= 1;
-//   dim3 grid_dim((dim - 1) / WARP_SIZE + 1);
-//   dim3 block_dim(WARP_SIZE, WARP_SIZE);
-//  hipLaunchKernelGGL(( ls_dropout_act_bias_bwd_kernel<ActivationType::kRelu>)
-//       , dim3(grid_dim), dim3(block_dim), 0, stream, row_size, ratio, in_grad,
-//       bias_grad,
-//                                            input, bias,out_grad, mask, dim);
-// }
-
-template void launch_ls_dropout_act_bias_bwd<ActivationType::kRelu, float>(
-    float *in_grad, float *bias_grad, const float *input, const float *bias,
-    const float *out_grad, const uint8_t *mask, int row_size, int dim,
-    float ratio, hipStream_t stream);
-
-template void launch_ls_dropout_act_bias_bwd<ActivationType::kRelu, __half>(
-    __half *in_grad, __half *bias_grad, const __half *input, const __half *bias,
-    const __half *out_grad, const uint8_t *mask, int row_size, int dim,
-    float ratio, hipStream_t stream);
-
-template void launch_ls_dropout_act_bias_bwd<ActivationType::kGelu, float>(
-    float *in_grad, float *bias_grad, const float *input, const float *bias,
-    const float *out_grad, const uint8_t *mask, int row_size, int dim,
-    float ratio, hipStream_t stream);
-
-template void launch_ls_dropout_act_bias_bwd<ActivationType::kGelu, __half>(
-    __half *in_grad, __half *bias_grad, const __half *input, const __half *bias,
-    const __half *out_grad, const uint8_t *mask, int row_size, int dim,
-    float ratio, hipStream_t stream);
diff --git a/colossalai/kernel/hip_native/csrc/kernels/general_kernels.hip b/colossalai/kernel/hip_native/csrc/kernels/general_kernels.hip
deleted file mode 100644
index 248906678497a8741277e839de92249128c01f9e..0000000000000000000000000000000000000000
--- a/colossalai/kernel/hip_native/csrc/kernels/general_kernels.hip
+++ /dev/null
@@ -1,242 +0,0 @@
-// !!! This is a file automatically generated by hipify!!!
-#include "hip/hip_runtime.h"
-#include "kernels.h"
-
-#ifndef COLOSSAL_HIP
-#include <cooperative_groups.h>
-
-namespace cg = cooperative_groups;
-#endif
-
-/**
-@brief: fuse_transpose_bias
-Calculate the sum of elements in each column of the matrix.
-
-@thread
-gridDim.x = ceil(cols / WARP_SIZE)
-blockDim.x = WARP_SIZE
-blockDim.y = WARP_SIZE
-
-@param
-inp: [rows, cols]
-out: [cols]
-rows: the number of rows in the matrix
-cols: the number of cols in the matrix
-*/
-template <typename T>
-__global__ void column_sum_reduce(const T *__restrict__ inp,
-                                  T *__restrict__ out, int rows, int cols) {
-  __shared__ float tile[WARP_SIZE][WARP_SIZE];
-
-#ifndef COLOSSAL_HIP
-  cg::thread_block b = cg::this_thread_block();
-  cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
-#endif
-
-  int idx = flat_2dim(blockIdx.x, threadIdx.x, WARP_SIZE);
-  int y_stride = cols * WARP_SIZE;
-  float localSum = 0;
-
-  // Loop across matrix row
-  // TODO: optimize to log complexity
-  if (idx < cols) {
-    int offset = flat_2dim(threadIdx.y, idx, cols);
-    for (int r = threadIdx.y; r < rows; r += WARP_SIZE) {
-      localSum += (float)inp[offset];
-      offset += y_stride;
-    }
-  }
-
-  // The sum of a row in tile is equal to the sum of a col in original matrix
-  tile[threadIdx.x][threadIdx.y] = localSum;
-
-  __syncthreads();
-
-  // Sum the shared buffer.
-  // The change of threadIdx.x is continuous
-  float sum = tile[threadIdx.y][threadIdx.x];
-
-  __syncthreads();
-
-  // Calculate the sum of a row in tile
-#ifdef COLOSSAL_HIP
-  for (int i = 1; i < WARP_SIZE; i <<= 1) sum += __shfl_down(sum, i);
-#else
-  for (int i = 1; i < WARP_SIZE; i <<= 1) sum += g.shfl_down(sum, i);
-#endif
-
-  if (threadIdx.x == 0) {
-    int pos = flat_2dim(blockIdx.x, threadIdx.y, WARP_SIZE);
-    if (pos < cols) out[pos] = sum;
-  }
-}
-
-// [r, c] -> [c]
-template <>
-void launch_fuse_transpose_bias_kernel<float>(const float *inp, float *out,
-                                              int rows, int cols,
-                                              hipStream_t stream) {
-  dim3 grid_dim((cols - 1) / WARP_SIZE + 1);
-  dim3 block_dim(WARP_SIZE, WARP_SIZE);
-
- hipLaunchKernelGGL(( column_sum_reduce<float>)
-      , dim3(grid_dim), dim3(block_dim), 0, stream, inp, out, rows, cols);
-}
-
-template <>
-void launch_fuse_transpose_bias_kernel<__half>(const __half *inp, __half *out,
-                                               int rows, int cols,
-                                               hipStream_t stream) {
-  dim3 grid_dim((cols - 1) / WARP_SIZE + 1);
-  dim3 block_dim(WARP_SIZE, WARP_SIZE);
-
- hipLaunchKernelGGL(( column_sum_reduce<__half>)
-      , dim3(grid_dim), dim3(block_dim), 0, stream, inp, out, rows, cols);
-}
-
-/**
-@brief: fused_add2
-Add two matrix inp1 and inp2 to out.
-
-@thread
-gridDim.x = batch_size * seq_len
-blockDim.x = min(hidden_dim, MAX_THREADS)
-
-@param
-inp1: [batch_size, seq_len, hidden_dim]
-inp2: [batch_size, seq_len, hidden_dim]
-out: [batch_size, seq_len, hidden_dim]
-batch_size: the size of the current batch
-seq_len: the sequence length of the current batch
-hidden_dim: dim of the hidden tensor
-*/
-template <typename T>
-__global__ void fused_add2_kernel(T *out, const T *inp1, const T *inp2,
-                                  int hidden_dim);
-
-template <>
-__global__ void fused_add2_kernel<float>(float *out, const float *inp1,
-                                         const float *inp2, int hidden_dim) {
-  int row_id = blockIdx.x;
-  int offset = flat_2dim(row_id, 0, hidden_dim);
-
-  const float4 *inp1_4 = reinterpret_cast<const float4 *>(inp1);
-  const float4 *inp2_4 = reinterpret_cast<const float4 *>(inp2);
-  float4 *out_4 = reinterpret_cast<float4 *>(out);
-  float4 vinp1;
-  float4 vinp2;
-  float4 val;
-
-  for (std::size_t i = threadIdx.x; i < hidden_dim; i += blockDim.x) {
-    vinp1 = inp1_4[offset + i];
-    vinp2 = inp2_4[offset + i];
-    val.x = vinp1.x + vinp2.x;
-    val.y = vinp1.y + vinp2.y;
-    val.z = vinp1.z + vinp2.z;
-    val.w = vinp1.w + vinp2.w;
-    out_4[offset + i] = val;
-  }
-}
-
-template <>
-__global__ void fused_add2_kernel<__half>(__half *out, const __half *inp1,
-                                          const __half *inp2, int hidden_dim) {
-  int row_id = blockIdx.x;
-  int offset = flat_2dim(row_id, 0, hidden_dim);
-
-  const float4 *inp1_4 = reinterpret_cast<const float4 *>(inp1);
-  const float4 *inp2_4 = reinterpret_cast<const float4 *>(inp2);
-  float4 *out_4 = reinterpret_cast<float4 *>(out);
-  float4 vinp1;
-  float4 vinp2;
-  float4 val;
-  __half2 *h2_inp1 = reinterpret_cast<__half2 *>(&vinp1);
-  __half2 *h2_inp2 = reinterpret_cast<__half2 *>(&vinp2);
-  __half2 *h2_val = reinterpret_cast<__half2 *>(&val);
-
-  for (std::size_t i = threadIdx.x; i < hidden_dim; i += blockDim.x) {
-    vinp1 = inp1_4[offset + i];
-    vinp2 = inp2_4[offset + i];
-    h2_val[0] = __hadd2(h2_inp1[0], h2_inp2[0]);
-    h2_val[1] = __hadd2(h2_inp1[1], h2_inp2[1]);
-    h2_val[2] = __hadd2(h2_inp1[2], h2_inp2[2]);
-    h2_val[3] = __hadd2(h2_inp1[3], h2_inp2[3]);
-    out_4[offset + i] = val;
-  }
-}
-
-//[b, s, h] -> [b, s, h]
-template <>
-void launch_fused_add2<float>(float *out, const float *inp1, const float *inp2,
-                              int batch_size, int seq_len, int hidden_dim,
-                              hipStream_t &stream) {
-  hidden_dim >>= 2;
-
-  dim3 grid_dim(batch_size * seq_len);
-  dim3 block_dim(min(hidden_dim, MAX_THREADS));
-
- hipLaunchKernelGGL(( fused_add2_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, out, inp1, inp2,
-                                                        hidden_dim);
-}
-
-template <>
-void launch_fused_add2<__half>(__half *out, const __half *inp1,
-                               const __half *inp2, int batch_size, int seq_len,
-                               int hidden_dim, hipStream_t &stream) {
-  hidden_dim >>= 3;
-
-  dim3 grid_dim(batch_size * seq_len);
-  dim3 block_dim(min(hidden_dim, MAX_THREADS));
-
- hipLaunchKernelGGL(( fused_add2_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, out, inp1, inp2,
-                                                        hidden_dim);
-}
-
-template <typename T>
-__global__ void kernel_concat3_dim1(const T *inp1, const T *inp2, T *output,
-                                    int sz0, int sz2, int sz1_1, int sz1_2) {
-  int nele = sz0 * sz2 * (sz1_1 + sz1_2);
-  int idx = flat_2dim(blockIdx.x, threadIdx.x, blockDim.x);
-  if (idx >= nele) {
-    return;
-  }
-  float4 *dst_ptr = (float4 *)output + idx;
-  int idx2 = idx % sz2;
-  idx = idx / sz2;
-  int idx1 = idx % (sz1_1 + sz1_2);
-  int idx0 = idx / (sz1_1 + sz1_2);
-  float4 *src_ptr = nullptr;
-  int sz1 = 0;
-  if (idx1 < sz1_1) {
-    sz1 = sz1_1;
-    src_ptr = (float4 *)inp1;
-  } else {
-    idx1 -= sz1_1;
-    sz1 = sz1_2;
-    src_ptr = (float4 *)inp2;
-  }
-  src_ptr += flat_3dim(idx0, idx1, idx2, sz1, sz2);
-  dst_ptr[0] = src_ptr[0];
-}
-
-template <>
-void launch_concat3_dim1<float>(const float *inp1, const float *inp2,
-                                float *output, int sz0, int sz2, int sz1_1,
-                                int sz1_2, hipStream_t stream) {
-  sz2 >>= 2;
-  int nele = sz0 * sz2 * (sz1_1 + sz1_2);
-  int nblock = (nele + MAX_THREADS - 1) / MAX_THREADS;
- hipLaunchKernelGGL(( kernel_concat3_dim1), dim3(nblock), dim3(MAX_THREADS), 0, stream, 
-      inp1, inp2, output, sz0, sz2, sz1_1, sz1_2);
-}
-
-template <>
-void launch_concat3_dim1<__half>(const __half *inp1, const __half *inp2,
-                                 __half *output, int sz0, int sz2, int sz1_1,
-                                 int sz1_2, hipStream_t stream) {
-  sz2 >>= 3;
-  int nele = sz0 * sz2 * (sz1_1 + sz1_2);
-  int nblock = (nele + MAX_THREADS - 1) / MAX_THREADS;
- hipLaunchKernelGGL(( kernel_concat3_dim1), dim3(nblock), dim3(MAX_THREADS), 0, stream, 
-      inp1, inp2, output, sz0, sz2, sz1_1, sz1_2);
-}
diff --git a/colossalai/kernel/hip_native/csrc/kernels/hip_util.hip b/colossalai/kernel/hip_native/csrc/kernels/hip_util.hip
deleted file mode 100644
index bcc261ebaced18d61a774dbdab64bb47f79d99e4..0000000000000000000000000000000000000000
--- a/colossalai/kernel/hip_native/csrc/kernels/hip_util.hip
+++ /dev/null
@@ -1,175 +0,0 @@
-// !!! This is a file automatically generated by hipify!!!
-#include <thrust/device_vector.h>
-#include <thrust/reduce.h>
-
-#ifdef COLOSSAL_HIP
-#include "hip_util.h"
-#else
-#include "cuda_util.h"
-#endif
-
-/* GPU function guard */
-std::string _cudaGetErrorString(hipError_t error) {
-  return hipGetErrorString(error);
-}
-
-std::string _cudaGetErrorString(rocblas_status error) {
-  switch (error) {
-    case rocblas_status_success:
-      return "rocblas_status_success";
-
-    case rocblas_status_invalid_handle:
-      return "rocblas_status_invalid_handle";
-
-    case rocblas_status_memory_error:
-      return "rocblas_status_memory_error";
-
-    case rocblas_status_invalid_pointer:
-      return "rocblas_status_invalid_pointer";
-
-    case rocblas_status_not_implemented:
-      return "rocblas_status_not_implemented";
-#ifndef COLOSSAL_HIP
-    case rocblas_status_internal_error:
-      return "rocblas_status_internal_error";
-
-    case rocblas_status_internal_error:
-      return "rocblas_status_internal_error";
-
-    case rocblas_status_internal_error:
-      return "rocblas_status_internal_error";
-
-    case rocblas_status_not_implemented:
-      return "rocblas_status_not_implemented";
-
-    case CUBLAS_STATUS_LICENSE_ERROR:
-      return "CUBLAS_STATUS_LICENSE_ERROR";
-#endif
-  }
-  return "CUBLAS_UNKNOW";
-}
-
-template <typename T>
-void check_gpu_error(T result, char const *const func, const char *const file,
-                     int const line) {
-  if (result) {
-    throw std::runtime_error(std::string("[CUDA][ERROR] ") + +file + "(" +
-                             std::to_string(line) +
-                             "): " + (_cudaGetErrorString(result)) + "\n");
-  }
-}
-
-template void check_gpu_error<hipError_t>(hipError_t result,
-                                           char const *const func,
-                                           const char *const file,
-                                           int const line);
-template void check_gpu_error<rocblas_status>(rocblas_status result,
-                                              char const *const func,
-                                              const char *const file,
-                                              int const line);
-
-template <typename T>
-void print_vec(const T *outv, std::string outn, int num_output_ele) {
-  std::cout << outn << ": ";
-  std::vector<T> hout(num_output_ele, (T)0);
-  hipMemcpy(hout.data(), outv, num_output_ele * sizeof(T),
-             hipMemcpyDeviceToHost);
-  for (int i = 0; i < num_output_ele; i++) {
-    std::cout << hout[i] << ", ";
-  }
-  std::cout << std::endl;
-}
-
-template <>
-void print_vec<__half>(const __half *outv, std::string outn,
-                       int num_output_ele) {
-  std::cout << outn << ": ";
-  std::vector<__half> hout(num_output_ele, (__half)0.f);
-  hipMemcpy(hout.data(), outv, num_output_ele * sizeof(__half),
-             hipMemcpyDeviceToHost);
-  for (int i = 0; i < num_output_ele; i++) {
-    std::cout << __half2float(hout[i]) << ", ";
-  }
-  std::cout << std::endl;
-}
-
-template void print_vec<float>(const float *outv, std::string outn,
-                               int num_output_ele);
-
-template void print_vec<int>(const int *outv, std::string outn,
-                             int num_output_ele);
-
-template void print_vec<__half>(const __half *outv, std::string outn,
-                                int num_output_ele);
-
-template <typename T>
-T *cuda_malloc(size_t ele_num) {
-  size_t byte_size = ele_num * sizeof(T);
-  T *pdata = nullptr;
-  CHECK_GPU_ERROR(hipMalloc((void **)&pdata, byte_size));
-  return pdata;
-}
-
-template float *cuda_malloc<float>(size_t ele_num);
-
-template __half *cuda_malloc<__half>(size_t ele_num);
-
-template uint8_t *cuda_malloc<uint8_t>(size_t ele_num);
-
-void cuda_free(void *pdata) {
-  if (pdata != nullptr) {
-    hipFree(pdata);
-  }
-}
-
-template <typename T>
-struct _isnan {
-  __device__ bool operator()(T a) const { return isnan(a); }
-};
-
-template <>
-struct _isnan<__half> {
-  __device__ bool operator()(const __half a) const { return __hisnan(a); }
-};
-
-template <typename T>
-struct _isinf {
-  __device__ bool operator()(T a) const { return isinf(a); }
-};
-
-template <>
-struct _isinf<__half> {
-  __device__ bool operator()(const __half a) const { return __hisinf(a); }
-};
-
-template <typename T>
-void check_nan_inf(const T *data_ptr, int dsize, bool check_nan_inf,
-                   std::string file, int line, hipStream_t stream) {
-  // check_nan_inf = 0 for checking nan
-  // check_nan_inf = 1 for checking inf
-  bool res = false;
-  std::string msg = file + "(" + std::to_string(line) + "): ";
-  if (check_nan_inf) {
-    msg += "nan.";
-    res = thrust::transform_reduce(thrust::hip::par.on(stream), data_ptr,
-                                   data_ptr + dsize, _isnan<T>(), false,
-                                   thrust::logical_or<bool>());
-  } else {
-    msg += "inf.";
-    res = thrust::transform_reduce(thrust::hip::par.on(stream), data_ptr,
-                                   data_ptr + dsize, _isinf<T>(), false,
-                                   thrust::logical_or<bool>());
-  }
-  if (res) {
-    throw std::runtime_error(msg);
-  }
-  std::cout << msg << " [check pass]." << std::endl;
-}
-
-template void check_nan_inf<float>(const float *data_ptr, int dsize,
-                                   bool check_nan_inf, std::string file,
-                                   int line, hipStream_t stream);
-
-template void check_nan_inf<__half>(const __half *data_ptr, int dsize,
-                                    bool check_nan_inf, std::string file,
-                                    int line, hipStream_t stream);
diff --git a/colossalai/kernel/hip_native/csrc/kernels/include/block_reduce.h b/colossalai/kernel/hip_native/csrc/kernels/include/block_reduce.h
deleted file mode 100644
index 2d750810c96ffcebff6b8c7a2733114c7fbe7546..0000000000000000000000000000000000000000
--- a/colossalai/kernel/hip_native/csrc/kernels/include/block_reduce.h
+++ /dev/null
@@ -1,392 +0,0 @@
-// !!! This is a file automatically generated by hipify!!!
-/* Copyright 2021 The LightSeq Team
-   Copyright Tencent/TurboTransformers
-   This block_reduce_n is adapted from Tencent/TurboTransformers
-*/
-#pragma once
-#include <hip/hip_runtime.h>
-#include <hip/hip_fp16.h>
-#include <hip/hip_runtime.h>
-
-enum class ReduceType { kMax = 0, kSum };
-const unsigned int WARP_REDUCE_MASK = 0xffffffff;
-const float REDUCE_FLOAT_INF_NEG = -100000000.f;
-const float REDUCE_FLOAT_INF_POS = 100000000.f;
-
-#ifdef COLOSSAL_HIP
-const unsigned int WARP_REDUCE_SIZE = 64;
-#else
-const unsigned int WARP_REDUCE_SIZE = 32;
-#endif
-
-template <typename T>
-__forceinline__ __device__ T warpReduceSum(T val) {
-  for (int mask = (WARP_REDUCE_SIZE >> 1); mask > 0; mask >>= 1)
-#ifdef COLOSSAL_HIP
-    val += __shfl_xor_sync(val, mask, WARP_REDUCE_SIZE);
-#else
-    val += __shfl_xor_sync(WARP_REDUCE_MASK, val, mask, WARP_REDUCE_SIZE);
-#endif
-  return val;
-}
-
-/* Calculate the sum of all elements in a block */
-template <typename T>
-__forceinline__ __device__ T blockReduceSum(T val) {
-  static __shared__ T shared[32];
-  int lane = threadIdx.x & 0x1f;
-  int wid = threadIdx.x >> 5;
-
-  val = warpReduceSum<T>(val);
-
-  if (lane == 0) shared[wid] = val;
-  __syncthreads();
-
-  val = (threadIdx.x < (blockDim.x >> 5)) ? shared[lane] : (T)0.0f;
-  val = warpReduceSum<T>(val);
-  return val;
-}
-
-template <ReduceType Rtype, int Num>
-__inline__ __device__ void blockReduce(float *pval);
-
-// use template to make code more concise
-template <ReduceType Rtype, int Num>
-__inline__ __device__ void warpReduce(float *pval);
-
-// static
-template <>
-__inline__ __device__ void warpReduce<ReduceType::kMax, 1>(float *pval) {
-#ifdef COLOSSAL_HIP
-  *pval = max(*pval, __shfl_xor(*pval, 32, WARP_REDUCE_SIZE));
-  *pval = max(*pval, __shfl_xor(*pval, 16, WARP_REDUCE_SIZE));
-  *pval = max(*pval, __shfl_xor(*pval, 8, WARP_REDUCE_SIZE));
-  *pval = max(*pval, __shfl_xor(*pval, 4, WARP_REDUCE_SIZE));
-  *pval = max(*pval, __shfl_xor(*pval, 2, WARP_REDUCE_SIZE));
-  *pval = max(*pval, __shfl_xor(*pval, 1, WARP_REDUCE_SIZE));
-#else
-  *pval = max(*pval, __shfl_xor_sync(WARP_REDUCE_MASK, *pval, 16, 32));
-  *pval = max(*pval, __shfl_xor_sync(WARP_REDUCE_MASK, *pval, 8, 32));
-  *pval = max(*pval, __shfl_xor_sync(WARP_REDUCE_MASK, *pval, 4, 32));
-  *pval = max(*pval, __shfl_xor_sync(WARP_REDUCE_MASK, *pval, 2, 32));
-  *pval = max(*pval, __shfl_xor_sync(WARP_REDUCE_MASK, *pval, 1, 32));
-#endif
-}
-
-template <>
-__inline__ __device__ void warpReduce<ReduceType::kMax, 2>(float *pval) {
-  float val0_tmp, val1_tmp;
-#ifdef COLOSSAL_HIP
-#define WarpReduceMaxOneStep(a, b)                                 \
-  val0_tmp = __shfl_xor(*(pval), a, b);     \
-  val1_tmp = __shfl_xor(*(pval + 1), a, b); \
-  *(pval) = max(val0_tmp, *(pval));                                \
-  *(pval + 1) = max(val1_tmp, *(pval + 1));
-
-  WarpReduceMaxOneStep(32, WARP_REDUCE_SIZE);
-  WarpReduceMaxOneStep(16, WARP_REDUCE_SIZE);
-  WarpReduceMaxOneStep(8, WARP_REDUCE_SIZE);
-  WarpReduceMaxOneStep(4, WARP_REDUCE_SIZE);
-  WarpReduceMaxOneStep(2, WARP_REDUCE_SIZE);
-  WarpReduceMaxOneStep(1, WARP_REDUCE_SIZE);
-#else
-#define WarpReduceMaxOneStep(a, b)                                 \
-  val0_tmp = __shfl_xor_sync(WARP_REDUCE_MASK, *(pval), a, b);     \
-  val1_tmp = __shfl_xor_sync(WARP_REDUCE_MASK, *(pval + 1), a, b); \
-  *(pval) = max(val0_tmp, *(pval));                                \
-  *(pval + 1) = max(val1_tmp, *(pval + 1));
-
-  WarpReduceMaxOneStep(16, 32);
-  WarpReduceMaxOneStep(8, 32);
-  WarpReduceMaxOneStep(4, 32);
-  WarpReduceMaxOneStep(2, 32);
-  WarpReduceMaxOneStep(1, 32);
-#endif
-
-#undef WarpReduceMaxOneStep
-}
-
-template <>
-__inline__ __device__ void warpReduce<ReduceType::kSum, 1>(float *pval) {
-#ifdef COLOSSAL_HIP
-  *pval += __shfl_xor(*pval, 32, WARP_REDUCE_SIZE);
-  *pval += __shfl_xor(*pval, 16, WARP_REDUCE_SIZE);
-  *pval += __shfl_xor(*pval, 8, WARP_REDUCE_SIZE);
-  *pval += __shfl_xor(*pval, 4, WARP_REDUCE_SIZE);
-  *pval += __shfl_xor(*pval, 2, WARP_REDUCE_SIZE);
-  *pval += __shfl_xor(*pval, 1, WARP_REDUCE_SIZE);
-#else
-  *pval += __shfl_xor_sync(WARP_REDUCE_MASK, *pval, 16, 32);
-  *pval += __shfl_xor_sync(WARP_REDUCE_MASK, *pval, 8, 32);
-  *pval += __shfl_xor_sync(WARP_REDUCE_MASK, *pval, 4, 32);
-  *pval += __shfl_xor_sync(WARP_REDUCE_MASK, *pval, 2, 32);
-  *pval += __shfl_xor_sync(WARP_REDUCE_MASK, *pval, 1, 32);
-#endif
-}
-
-/*
- * Unorll for loop for warpreduce to
- * imporve instruction issue efficiency
- * ElemX means there are X numbers to be summed
- */
-
-template <>
-__inline__ __device__ void warpReduce<ReduceType::kSum, 2>(float *pval) {
-  float val0_tmp, val1_tmp;
-
-#ifdef COLOSSAL_HIP
-#define WarpReduceSumOneStep(a, b)                                 \
-  val0_tmp = __shfl_xor(*(pval + 0), a, b); \
-  val1_tmp = __shfl_xor(*(pval + 1), a, b); \
-  *(pval + 0) += val0_tmp;                                         \
-  *(pval + 1) += val1_tmp
-
-  WarpReduceSumOneStep(32, WARP_REDUCE_SIZE);
-  WarpReduceSumOneStep(16, WARP_REDUCE_SIZE);
-  WarpReduceSumOneStep(8, WARP_REDUCE_SIZE);
-  WarpReduceSumOneStep(4, WARP_REDUCE_SIZE);
-  WarpReduceSumOneStep(2, WARP_REDUCE_SIZE);
-  WarpReduceSumOneStep(1, WARP_REDUCE_SIZE);
-#else
-#define WarpReduceSumOneStep(a, b)                                 \
-  val0_tmp = __shfl_xor_sync(WARP_REDUCE_MASK, *(pval + 0), a, b); \
-  val1_tmp = __shfl_xor_sync(WARP_REDUCE_MASK, *(pval + 1), a, b); \
-  *(pval + 0) += val0_tmp;                                         \
-  *(pval + 1) += val1_tmp
-
-  WarpReduceSumOneStep(16, 32);
-  WarpReduceSumOneStep(8, 32);
-  WarpReduceSumOneStep(4, 32);
-  WarpReduceSumOneStep(2, 32);
-  WarpReduceSumOneStep(1, 32);
-#endif
-
-#undef WarpReduceSumOneStep
-}
-
-template <>
-__inline__ __device__ void warpReduce<ReduceType::kSum, 4>(float *pval) {
-  float val0_tmp, val1_tmp, val2_tmp, val3_tmp;
-
-#ifdef COLOSSAL_HIP
-#define WarpReduceSumOneStep(a, b)                                 \
-  val0_tmp = __shfl_xor(*(pval + 0), a, b); \
-  val1_tmp = __shfl_xor(*(pval + 1), a, b); \
-  val2_tmp = __shfl_xor(*(pval + 2), a, b); \
-  val3_tmp = __shfl_xor(*(pval + 3), a, b); \
-  *(pval + 0) += val0_tmp;                                         \
-  *(pval + 1) += val1_tmp;                                         \
-  *(pval + 2) += val2_tmp;                                         \
-  *(pval + 3) += val3_tmp
-
-  WarpReduceSumOneStep(32, WARP_REDUCE_SIZE);
-  WarpReduceSumOneStep(16, WARP_REDUCE_SIZE);
-  WarpReduceSumOneStep(8, WARP_REDUCE_SIZE);
-  WarpReduceSumOneStep(4, WARP_REDUCE_SIZE);
-  WarpReduceSumOneStep(2, WARP_REDUCE_SIZE);
-  WarpReduceSumOneStep(1, WARP_REDUCE_SIZE);
-#else
-#define WarpReduceSumOneStep(a, b)                                 \
-  val0_tmp = __shfl_xor_sync(WARP_REDUCE_MASK, *(pval + 0), a, b); \
-  val1_tmp = __shfl_xor_sync(WARP_REDUCE_MASK, *(pval + 1), a, b); \
-  val2_tmp = __shfl_xor_sync(WARP_REDUCE_MASK, *(pval + 2), a, b); \
-  val3_tmp = __shfl_xor_sync(WARP_REDUCE_MASK, *(pval + 3), a, b); \
-  *(pval + 0) += val0_tmp;                                         \
-  *(pval + 1) += val1_tmp;                                         \
-  *(pval + 2) += val2_tmp;                                         \
-  *(pval + 3) += val3_tmp
-
-  WarpReduceSumOneStep(16, 32);
-  WarpReduceSumOneStep(8, 32);
-  WarpReduceSumOneStep(4, 32);
-  WarpReduceSumOneStep(2, 32);
-  WarpReduceSumOneStep(1, 32);
-#endif
-#undef WarpReduceSumOneStep
-}
-
-template <>
-__inline__ __device__ void blockReduce<ReduceType::kSum, 1>(float *pval) {
-  const int num = 1;
-  static __shared__ float shared[num][32];
-  int lane_id = threadIdx.x & 0x1f;
-  int wid = threadIdx.x >> 5;
-
-  warpReduce<ReduceType::kSum, num>(pval);
-
-  if (lane_id == 0) {
-#pragma unroll
-    for (int i = 0; i < num; ++i) {
-      shared[i][wid] = *(pval + i);
-    }
-  }
-  __syncthreads();
-
-  if (threadIdx.x < (blockDim.x >> 5)) {
-#pragma unroll
-    for (int i = 0; i < num; ++i) {
-      *(pval + i) = shared[i][lane_id];
-    }
-  } else {
-#pragma unroll
-    for (int i = 0; i < num; ++i) {
-      *(pval + i) = 0.f;
-    }
-  }
-  warpReduce<ReduceType::kSum, num>(pval);
-}
-
-template <>
-__inline__ __device__ void blockReduce<ReduceType::kSum, 2>(float *pval) {
-  const int num = 2;
-  static __shared__ float shared[num][32];
-  int lane_id = threadIdx.x & 0x1f;
-  int wid = threadIdx.x >> 5;
-
-  warpReduce<ReduceType::kSum, num>(pval);
-
-  if (lane_id == 0) {
-#pragma unroll
-    for (int i = 0; i < num; ++i) {
-      shared[i][wid] = *(pval + i);
-    }
-  }
-  __syncthreads();
-
-  if (threadIdx.x < (blockDim.x >> 5)) {
-#pragma unroll
-    for (int i = 0; i < num; ++i) {
-      *(pval + i) = shared[i][lane_id];
-    }
-  } else {
-#pragma unroll
-    for (int i = 0; i < num; ++i) {
-      *(pval + i) = 0.f;
-    }
-  }
-  warpReduce<ReduceType::kSum, num>(pval);
-}
-
-template <>
-__inline__ __device__ void blockReduce<ReduceType::kSum, 4>(float *pval) {
-  const int num = 4;
-  static __shared__ float shared[num][32];
-  int lane_id = threadIdx.x & 0x1f;
-  int wid = threadIdx.x >> 5;
-
-  warpReduce<ReduceType::kSum, num>(pval);
-
-  if (lane_id == 0) {
-#pragma unroll
-    for (int i = 0; i < num; ++i) {
-      shared[i][wid] = *(pval + i);
-    }
-  }
-  __syncthreads();
-
-  if (threadIdx.x < (blockDim.x >> 5)) {
-#pragma unroll
-    for (int i = 0; i < num; ++i) {
-      *(pval + i) = shared[i][lane_id];
-    }
-  } else {
-#pragma unroll
-    for (int i = 0; i < num; ++i) {
-      *(pval + i) = 0.f;
-    }
-  }
-  warpReduce<ReduceType::kSum, num>(pval);
-}
-
-template <>
-__inline__ __device__ void blockReduce<ReduceType::kMax, 1>(float *pval) {
-  const int num = 1;
-  static __shared__ float shared[num][32];
-  int lane_id = threadIdx.x & 0x1f;
-  int wid = threadIdx.x >> 5;
-
-  warpReduce<ReduceType::kMax, num>(pval);
-
-  if (lane_id == 0) {
-#pragma unroll
-    for (int i = 0; i < num; ++i) {
-      shared[i][wid] = *(pval + i);
-    }
-  }
-  __syncthreads();
-
-  if (threadIdx.x < (blockDim.x >> 5)) {
-#pragma unroll
-    for (int i = 0; i < num; ++i) {
-      *(pval + i) = shared[i][lane_id];
-    }
-  } else {
-#pragma unroll
-    for (int i = 0; i < num; ++i) {
-      *(pval + i) = REDUCE_FLOAT_INF_NEG;
-    }
-  }
-  warpReduce<ReduceType::kMax, num>(pval);
-}
-
-template <>
-__inline__ __device__ void blockReduce<ReduceType::kMax, 2>(float *pval) {
-  const int num = 1;
-  static __shared__ float shared[num][32];
-  int lane_id = threadIdx.x & 0x1f;
-  int wid = threadIdx.x >> 5;
-
-  warpReduce<ReduceType::kMax, num>(pval);
-
-  if (lane_id == 0) {
-#pragma unroll
-    for (int i = 0; i < num; ++i) {
-      shared[i][wid] = *(pval + i);
-    }
-  }
-  __syncthreads();
-
-  if (threadIdx.x < (blockDim.x >> 5)) {
-#pragma unroll
-    for (int i = 0; i < num; ++i) {
-      *(pval + i) = shared[i][lane_id];
-    }
-  } else {
-#pragma unroll
-    for (int i = 0; i < num; ++i) {
-      *(pval + i) = REDUCE_FLOAT_INF_NEG;
-    }
-  }
-  warpReduce<ReduceType::kMax, num>(pval);
-}
-
-template <>
-__inline__ __device__ void blockReduce<ReduceType::kMax, 4>(float *pval) {
-  const int num = 1;
-  static __shared__ float shared[num][32];
-  int lane_id = threadIdx.x & 0x1f;
-  int wid = threadIdx.x >> 5;
-
-  warpReduce<ReduceType::kMax, num>(pval);
-
-  if (lane_id == 0) {
-#pragma unroll
-    for (int i = 0; i < num; ++i) {
-      shared[i][wid] = *(pval + i);
-    }
-  }
-  __syncthreads();
-
-  if (threadIdx.x < (blockDim.x >> 5)) {
-#pragma unroll
-    for (int i = 0; i < num; ++i) {
-      *(pval + i) = shared[i][lane_id];
-    }
-  } else {
-#pragma unroll
-    for (int i = 0; i < num; ++i) {
-      *(pval + i) = REDUCE_FLOAT_INF_NEG;
-    }
-  }
-  warpReduce<ReduceType::kMax, num>(pval);
-}
diff --git a/colossalai/kernel/hip_native/csrc/kernels/include/context.h b/colossalai/kernel/hip_native/csrc/kernels/include/context.h
deleted file mode 100644
index 1a228e3cac3308bade968ac17b2ee6980eb1060d..0000000000000000000000000000000000000000
--- a/colossalai/kernel/hip_native/csrc/kernels/include/context.h
+++ /dev/null
@@ -1,37 +0,0 @@
-// !!! This is a file automatically generated by hipify!!!
-#pragma once
-
-#include <rocblas.h>
-#include <hip/hip_runtime.h>
-
-#include <iostream>
-#include <string>
-
-#include "../../../../hip_native/csrc/kernels/include/hip_util.h"
-
-class Context {
- public:
-  Context() : _stream(nullptr) {
-    CHECK_GPU_ERROR(rocblas_create_handle(&_cublasHandle));
-  }
-
-  virtual ~Context() {}
-
-  static Context &Instance() {
-    static Context _ctx;
-    return _ctx;
-  }
-
-  void set_stream(hipStream_t stream) {
-    _stream = stream;
-    CHECK_GPU_ERROR(rocblas_set_stream(_cublasHandle, _stream));
-  }
-
-  hipStream_t get_stream() { return _stream; }
-
-  rocblas_handle get_cublashandle() { return _cublasHandle; }
-
- private:
-  hipStream_t _stream;
-  rocblas_handle _cublasHandle;
-};
diff --git a/colossalai/kernel/hip_native/csrc/kernels/include/cross_entropy_layer.h b/colossalai/kernel/hip_native/csrc/kernels/include/cross_entropy_layer.h
deleted file mode 100644
index 0e6d97093ec30a4ff8efdbf3eaa48fed550f7f5a..0000000000000000000000000000000000000000
--- a/colossalai/kernel/hip_native/csrc/kernels/include/cross_entropy_layer.h
+++ /dev/null
@@ -1,47 +0,0 @@
-// !!! This is a file automatically generated by hipify!!!
-#pragma once
-
-#include <hip/hip_runtime.h>
-#include <hip/hip_fp16.h>
-#include <hip/hip_runtime_api.h>
-
-#include <type_traits>
-
-#include "../../../../hip_native/csrc/kernels/include/hip_util.h"
-
-template <typename T>
-class CrossEntropyLayer {
- public:
-  CrossEntropyLayer(float epsilon, int padding_idx, int max_batch_tokens);
-
-  virtual ~CrossEntropyLayer();
-
-  void Forward(const T *inputs_ptr, const int *targets_ptr, float *outputs_ptr,
-               float *nll_loss_ptr);
-
-  void Backward(const float *grad_outputs_ptr, const T *inputs_ptr,
-                const int *targets_ptr, T *grad_inputs_ptr);
-
-  void set_cur_batch_shape(int batch_size, int seq_len, int vocab_size);
-
- private:
-  void allocate_mem_buffer() {
-    // allocate local gpu memory
-    _loss_buffer = cuda_malloc<float>(_max_batch_tokens * 2);
-  }
-
-  void free_mem_buffer() {
-    // free local gpu memory
-    cuda_free(_loss_buffer);
-  }
-
-  const int _padding_idx;
-  const float _epsilon;
-  const int _max_batch_tokens;
-
-  size_t _batch_size;
-  size_t _seq_len;
-  size_t _vocab_size;
-
-  float *_loss_buffer;
-};
diff --git a/colossalai/kernel/hip_native/csrc/kernels/include/cublas_wrappers.h b/colossalai/kernel/hip_native/csrc/kernels/include/cublas_wrappers.h
deleted file mode 100644
index 37a82b1ee7bacf3ade1a675ad1fe8102bbdd6080..0000000000000000000000000000000000000000
--- a/colossalai/kernel/hip_native/csrc/kernels/include/cublas_wrappers.h
+++ /dev/null
@@ -1,72 +0,0 @@
-// !!! This is a file automatically generated by hipify!!!
-/* Copyright 2021 The LightSeq Team
-   Copyright Microsoft DeepSpeed
-   This file is adapted from Microsoft DeepSpeed
-*/
-#pragma once
-
-#include <assert.h>
-#include <rocblas.h>
-#include <hip/hip_runtime.h>
-#include <hip/hip_fp16.h>
-#include <hip/hip_runtime.h>
-#ifndef COLOSSAL_HIP
-#include <mma.h>
-#endif
-#include <stdio.h>
-
-#ifdef COLOSSAL_HIP
-int cublas_gemm_ex(rocblas_handle handle, rocblas_operation transa,
-                   rocblas_operation transb, int m, int n, int k,
-                   const float *alpha, const float *beta, const float *A,
-                   const float *B, float *C,
-                   rocblas_gemm_algo algo = rocblas_gemm_algo_standard);
-
-int cublas_gemm_ex(rocblas_handle handle, rocblas_operation transa,
-                   rocblas_operation transb, int m, int n, int k,
-                   const float *alpha, const float *beta, const __half *A,
-                   const __half *B, __half *C,
-                   rocblas_gemm_algo algo = rocblas_gemm_algo_standard);
-
-int cublas_strided_batched_gemm(rocblas_handle handle, int m, int n, int k,
-                                const float *alpha, const float *beta,
-                                const float *A, const float *B, float *C,
-                                rocblas_operation op_A, rocblas_operation op_B,
-                                int stride_A, int stride_B, int stride_C,
-                                int batch,
-                                rocblas_gemm_algo algo = rocblas_gemm_algo_standard);
-
-int cublas_strided_batched_gemm(
-    rocblas_handle handle, int m, int n, int k, const float *alpha,
-    const float *beta, const __half *A, const __half *B, __half *C,
-    rocblas_operation op_A, rocblas_operation op_B, int stride_A, int stride_B,
-    int stride_C, int batch,
-    rocblas_gemm_algo algo = rocblas_gemm_algo_standard);
-#else
-int cublas_gemm_ex(rocblas_handle handle, rocblas_operation transa,
-                   rocblas_operation transb, int m, int n, int k,
-                   const float *alpha, const float *beta, const float *A,
-                   const float *B, float *C,
-                   cublasGemmAlgo_t algo = CUBLAS_GEMM_DEFAULT);
-
-int cublas_gemm_ex(rocblas_handle handle, rocblas_operation transa,
-                   rocblas_operation transb, int m, int n, int k,
-                   const float *alpha, const float *beta, const __half *A,
-                   const __half *B, __half *C,
-                   cublasGemmAlgo_t algo = CUBLAS_GEMM_DEFAULT_TENSOR_OP);
-
-int cublas_strided_batched_gemm(rocblas_handle handle, int m, int n, int k,
-                                const float *alpha, const float *beta,
-                                const float *A, const float *B, float *C,
-                                rocblas_operation op_A, rocblas_operation op_B,
-                                int stride_A, int stride_B, int stride_C,
-                                int batch,
-                                cublasGemmAlgo_t algo = CUBLAS_GEMM_DEFAULT);
-
-int cublas_strided_batched_gemm(
-    rocblas_handle handle, int m, int n, int k, const float *alpha,
-    const float *beta, const __half *A, const __half *B, __half *C,
-    rocblas_operation op_A, rocblas_operation op_B, int stride_A, int stride_B,
-    int stride_C, int batch,
-    cublasGemmAlgo_t algo = CUBLAS_GEMM_DEFAULT_TENSOR_OP);
-#endif
diff --git a/colossalai/kernel/hip_native/csrc/kernels/include/dropout.h b/colossalai/kernel/hip_native/csrc/kernels/include/dropout.h
deleted file mode 100644
index 8269f9ed6528bc1dfa31b4ac48b8848a92b17d91..0000000000000000000000000000000000000000
--- a/colossalai/kernel/hip_native/csrc/kernels/include/dropout.h
+++ /dev/null
@@ -1,96 +0,0 @@
-// !!! This is a file automatically generated by hipify!!!
-#pragma once
-
-#include <string>
-#include <hip/hip_runtime.h>
-#include <hip/hip_fp16.h>
-#include <stdio.h>
-
-#include "../../../../hip_native/csrc/kernels/include/kernels.h"
-
-template <typename T>
-class Dropout {
- public:
-  struct Config {
-    float ratio;
-    bool training;
-
-    Config(float r) : ratio(r), training(true) {}
-    float RATIO() const { return training ? ratio : 0.0; }
-  };
-
-  Dropout(const Config &config, size_t max_ele_num)
-      : _config(config), _mask(nullptr) {
-    _mask = cuda_malloc<uint8_t>(max_ele_num);
-  }
-
-  virtual ~Dropout() { cuda_free(_mask); }
-
-  // after attention softmax
-  void dropout(T *output, const T *input, int count, hipStream_t stream,
-               bool bwd = false) {
-    launch_ls_dropout<T>(output, input, _mask, count, _config.RATIO(), stream,
-                         bwd);
-  }
-
-  void d_dropout(T *d_inp_out, int count, hipStream_t stream) {
-    launch_ls_dropout<T>(d_inp_out, d_inp_out, _mask, count, _config.RATIO(),
-                         stream, true);
-  }
-
-  // transformer layer's postprocessing dropout, after attn or ffn module,
-  // before residual add.
-  void bias_dropout_residual(T *output, const T *input, const T *residual,
-                             const T *bias, int rows, int cols,
-                             hipStream_t stream) {
-    launch_ls_dropout_res_bias<T>(output, input, _mask, bias, residual,
-                                  rows * cols, cols, _config.RATIO(), stream);
-  }
-
-  void d_bias_dropout_residual(T *d_input, T *d_bias, const T *d_output,
-                               int rows, int cols, hipStream_t stream) {
-    launch_ls_dropout_bias_bwd<T>(d_input, d_bias, d_output, _mask, rows, cols,
-                                  _config.RATIO(), stream);
-  }
-
-  // dropout inside ffn.
-  void bias_act_dropout(T *output, const T *input, const T *bias, int rows,
-                        int cols, std::string activation_fn,
-                        hipStream_t stream) {
-    if (activation_fn == "relu") {
-      launch_ls_dropout_act_bias<ActivationType::kRelu, T>(
-          output, input, _mask, bias, rows * cols, cols, _config.RATIO(),
-          stream);
-    } else if (activation_fn == "gelu") {
-      launch_ls_dropout_act_bias<ActivationType::kGelu, T>(
-          output, input, _mask, bias, rows * cols, cols, _config.RATIO(),
-          stream);
-    } else {
-      throw std::runtime_error("not supported activation: " + activation_fn);
-    }
-  }
-
-  void d_bias_act_dropout(T *d_inp_out, T *d_bias_out, const T *input,
-                          const T *bias, int rows, int cols,
-                          std::string activation_fn, hipStream_t stream) {
-    if (activation_fn == "relu") {
-      launch_ls_dropout_act_bias_bwd<ActivationType::kRelu, T>(
-          d_inp_out, d_bias_out, input, bias, d_inp_out, _mask, rows, cols,
-          _config.RATIO(), stream);
-    } else if (activation_fn == "gelu") {
-      launch_ls_dropout_act_bias_bwd<ActivationType::kGelu, T>(
-          d_inp_out, d_bias_out, input, bias, d_inp_out, _mask, rows, cols,
-          _config.RATIO(), stream);
-    } else {
-      throw std::runtime_error("not supported activation: " + activation_fn);
-    }
-  }
-
-  bool HasDropout() const { return _config.RATIO() > 0.0; }
-
-  void SetTrainingMode(bool training) { _config.training = training; }
-
- private:
-  uint8_t *_mask;
-  Config _config;
-};
diff --git a/colossalai/kernel/hip_native/csrc/kernels/include/feed_forward.h b/colossalai/kernel/hip_native/csrc/kernels/include/feed_forward.h
deleted file mode 100644
index a66ebe22c16eed2ba3e14933cd55196985b5db94..0000000000000000000000000000000000000000
--- a/colossalai/kernel/hip_native/csrc/kernels/include/feed_forward.h
+++ /dev/null
@@ -1,85 +0,0 @@
-// !!! This is a file automatically generated by hipify!!!
-#pragma once
-
-/* Copyright 2021 The LightSeq Team
-   Copyright Microsoft DeepSpeed
-   This file is adapted from Microsoft DeepSpeed
-*/
-#include <hip/hip_runtime.h>
-#include <hip/hip_fp16.h>
-#include <stdio.h>
-
-#include <array>
-
-#include "../../../../hip_native/csrc/kernels/include/cublas_wrappers.h"
-#include "../../../../hip_native/csrc/kernels/include/kernels.h"
-
-template <typename T>
-class FeedForward {
- public:
-  struct Config {
-    int outputSize;
-    int inputSize;
-    std::array<int, 3> gemm_algos;
-    Config(int outputs, int inputs)
-        : outputSize(outputs),
-          inputSize(inputs),
-          gemm_algos(std::array<int, 3>({99, 99, 99})) {}
-  };
-
-  FeedForward(Config config) : config_(config) {}
-
-  ~FeedForward() {}
-
-  void Forward(int bsz, const T *input_ptr, const T *weights, T *out,
-               rocblas_handle &_cublasHandle) {
-    float alpha = T(1.);
-    float beta = T(0.);
-
-#ifdef COLOSSAL_HIP
-    cublas_gemm_ex(_cublasHandle, rocblas_operation_transpose, rocblas_operation_none, config_.outputSize,
-                   bsz, config_.inputSize, &alpha, &beta, weights, input_ptr,
-                   out, rocblas_gemm_algo(rocblas_gemm_algo_standard));
-#else
-    cublas_gemm_ex(_cublasHandle, rocblas_operation_transpose, rocblas_operation_none, config_.outputSize,
-                   bsz, config_.inputSize, &alpha, &beta, weights, input_ptr,
-                   out, cublasGemmAlgo_t(config_.gemm_algos[0]));
-#endif
-  }
-  void Backward(int bsz, const T *out_grad, const T *input_ptr,
-                const T *weights, T *weights_grad, T *bias_grad,
-                rocblas_handle &_cublasHandle, hipStream_t &stream,
-                T *inp_grad_out = nullptr, T *out_grad_trans_out = nullptr,
-                bool compute_bias = true) {
-    float alpha = (T)1.0, beta = (T)0.0;
-#ifdef COLOSSAL_HIP
-    cublas_gemm_ex(_cublasHandle, rocblas_operation_none, rocblas_operation_transpose, config_.inputSize,
-                   config_.outputSize, bsz, &alpha, &beta, input_ptr, out_grad,
-                   weights_grad, rocblas_gemm_algo(rocblas_gemm_algo_standard));
-
-    cublas_gemm_ex(_cublasHandle, rocblas_operation_none, rocblas_operation_none, config_.inputSize,
-                   bsz, config_.outputSize, &alpha, &beta, weights, out_grad,
-                   inp_grad_out, rocblas_gemm_algo(rocblas_gemm_algo_standard));
-#else
-    cublas_gemm_ex(_cublasHandle, rocblas_operation_none, rocblas_operation_transpose, config_.inputSize,
-                   config_.outputSize, bsz, &alpha, &beta, input_ptr, out_grad,
-                   weights_grad, cublasGemmAlgo_t(config_.gemm_algos[1]));
-
-    cublas_gemm_ex(_cublasHandle, rocblas_operation_none, rocblas_operation_none, config_.inputSize,
-                   bsz, config_.outputSize, &alpha, &beta, weights, out_grad,
-                   inp_grad_out, cublasGemmAlgo_t(config_.gemm_algos[2]));
-#endif
-    if (compute_bias) {
-      launch_fuse_transpose_bias_kernel<T>(out_grad, bias_grad, bsz,
-                                           config_.outputSize, stream);
-    }
-  }
-
-  void reset_size(int outputSize, int inputSize) {
-    config_.outputSize = outputSize;
-    config_.inputSize = inputSize;
-  }
-
- private:
-  Config config_;
-};
diff --git a/colossalai/kernel/hip_native/csrc/kernels/include/hip_util.h b/colossalai/kernel/hip_native/csrc/kernels/include/hip_util.h
deleted file mode 100644
index 7bdc6e3818f9f27cb4d614fef6eabc9630a421cd..0000000000000000000000000000000000000000
--- a/colossalai/kernel/hip_native/csrc/kernels/include/hip_util.h
+++ /dev/null
@@ -1,38 +0,0 @@
-// !!! This is a file automatically generated by hipify!!!
-#pragma once
-
-#include <rocblas.h>
-#include <hip/hip_runtime.h>
-
-#ifndef COLOSSAL_HIP
-#include <math_constants.h>
-#endif
-
-#include <chrono>
-#include <fstream>
-#include <iostream>
-#include <string>
-#include <type_traits>
-#include <vector>
-
-template <typename T>
-void check_gpu_error(T result, char const *const func, const char *const file,
-                     int const line);
-
-#define CHECK_GPU_ERROR(val) check_gpu_error((val), #val, __FILE__, __LINE__)
-
-template <typename T>
-void print_vec(const T *outv, std::string outn, int num_output_ele);
-
-template <typename T>
-T *cuda_malloc(size_t ele_num);
-
-void cuda_free(void *pdata);
-
-template <typename T>
-void check_nan_inf(const T *data_ptr, int dsize, bool check_nan_inf,
-                   std::string file, int line, hipStream_t stream);
-
-#define CHECK_NAN_INF(ptr, size, stream)                            \
-  check_nan_inf((ptr), (size), true, __FILE__, __LINE__, (stream)); \
-  check_nan_inf((ptr), (size), false, __FILE__, __LINE__, (stream))
diff --git a/colossalai/kernel/hip_native/csrc/kernels/include/kernels.h b/colossalai/kernel/hip_native/csrc/kernels/include/kernels.h
deleted file mode 100644
index 0eb18e18370ac0f6f1e20138a275ea52a3e25e8b..0000000000000000000000000000000000000000
--- a/colossalai/kernel/hip_native/csrc/kernels/include/kernels.h
+++ /dev/null
@@ -1,283 +0,0 @@
-// !!! This is a file automatically generated by hipify!!!
-#pragma once
-
-#include <hip/hip_runtime.h>
-#include <hip/hip_fp16.h>
-#ifdef COLOSSAL_HIP
-#include <hiprand/hiprand.h>
-#else
-#include <hiprand/hiprand_kernel.h>
-#endif
-#include <stdio.h>
-#include <stdlib.h>
-#include <stdexcept>
-
-#define MAX_THREADS 1024
-// HC
-#ifdef COLOSSAL_HIP
-  #define WARP_SIZE 64
-#else
-  #define WARP_SIZE 32
-#endif
-enum class ActivationType { kRelu, kGelu };
-
-void launch_curand_init(int total_count, int dim, hipStream_t stream);
-
-template <typename T>
-void launch_layer_norm(T *ln_res, T *vars, T *means, const T *inp,
-                       const T *scale, const T *bias, int batch_size,
-                       int hidden_dim, hipStream_t stream);
-
-template <typename T>
-void launch_ln_bw(T *gamma_grad, T *betta_grad, T *inp_grad, const T *out_grad,
-                  const T *residual_grad, const T *inp_or_out, const T *gamma,
-                  const T *betta, const T *vars, const T *means, int batch,
-                  int hidden_dim, hipStream_t stream[2]);
-
-template <typename T>
-void launch_attn_softmax(T *vals, const T *attn_mask, int batch_size, int heads,
-                         int from_len, int to_len, bool mask_future,
-                         hipStream_t stream);
-
-template <typename T>
-void launch_attn_softmax_bw(T *out_grad, const T *soft_inp, int rows,
-                            int softmax_len, hipStream_t stream);
-
-// [b, s, h] -> [b, nh, s, ad]
-template <typename T>
-void launch_transform_0213(T *output, const T *vals, int batch_size,
-                           int seq_length, int hidden_dim, int nhead,
-                           hipStream_t stream);
-
-// [b, s, 3, h] -> [3, b, nh, s, ad]
-template <typename T>
-void launch_bias_add_transform_20314(T *output, const T *input, const T *bias,
-                                     int dim_0, int dim_1, int dim_2, int dim_3,
-                                     int dim_4, hipStream_t stream);
-
-// [tc, b, nh, s, ad] -> [b, s, tc, nh, ad]
-template <typename T>
-void launch_transform4d_0213(T *output, const T *vals, int batch_size,
-                             int seq_len, int hidden_dim, int nhead,
-                             int trans_count, hipStream_t stream);
-
-template <typename T>
-void launch_ls_dropout(T *out, const T *vals, uint8_t *mask, int total_count,
-                       float ratio, hipStream_t stream, bool backward = false);
-
-template <typename T>
-void launch_ls_dropout_res_bias(T *out, const T *vals, uint8_t *mask,
-                                const T *bias, const T *residual,
-                                int total_count, int dim, float ratio,
-                                hipStream_t stream);
-
-template <ActivationType, typename T>
-void launch_ls_dropout_act_bias(T *out, const T *vals, uint8_t *mask,
-                                const T *bias, int total_count, int dim,
-                                float ratio, hipStream_t stream);
-
-template <typename T>
-void launch_ls_dropout_bias_bwd(T *in_grad, T *bias_grad, const T *out_grad,
-                                const uint8_t *mask, int row_size, int dim,
-                                float ratio, hipStream_t stream);
-
-template <ActivationType act_type, typename T>
-void launch_ls_dropout_act_bias_bwd(T *in_grad, T *bias_grad, const T *input,
-                                    const T *bias, const T *out_grad,
-                                    const uint8_t *mask, int row_size, int dim,
-                                    float ratio, hipStream_t stream);
-
-template <typename T>
-void launch_fuse_transpose_bias_kernel(const T *inp, T *out, int rows, int cols,
-                                       hipStream_t stream);
-
-void launch_param_update(const float *input, __half *output, int size,
-                         hipStream_t stream);
-
-template <typename T>
-void launch_concat3_dim1(const T *inp1, const T *inp2, T *output, int sz0,
-                         int sz2, int sz1_1, int sz1_2, hipStream_t stream);
-
-template <typename T>
-void launch_fused_add2(T *out, const T *inp1, const T *inp2, int batch_size,
-                       int seq_len, int hidden_size, hipStream_t &stream);
-
-template <typename T>
-void launch_cross_entropy_fw(const T *inputs_ptr, const int *targets_ptr,
-                             float *outputs_ptr, float *nll_loss_ptr,
-                             float *loss_buffer, const int padding_idx,
-                             const float epsilon, const int batch_size,
-                             const int seq_len, const int vocab_size,
-                             hipStream_t stream);
-
-template <typename T>
-void launch_cross_entropy_bw(const float *grad_outputs_ptr, const T *inputs_ptr,
-                             const int *targets_ptr, T *grad_inputs_ptr,
-                             const int padding_idx, const float epsilon,
-                             const int batch_size, const int seq_len,
-                             const int vocab_size, hipStream_t stream);
-
-template <typename T>
-void launch_lookup_scale_pos_dropout(
-    T *output, const int *input, const T *embeddings, const T *pos_embeddings,
-    uint8_t *dropout_mask, int batch_size, int seq_len, int embedding_dim,
-    int padding_idx, float dropout_ratio, int step, hipStream_t &stream);
-
-template <typename T>
-void launch_d_lookup_scale_pos_dropout(
-    T *grad_embeddings, const T *grad_output, const int *input,
-    const uint8_t *dropout_mask, int batch_size, int seq_len, int embedding_dim,
-    int vocab_size, int padding_idx, float dropout_ratio, hipStream_t &stream);
-
-/* Convert 2-dim tensor index into vector index */
-__forceinline__ __host__ __device__ int flat_2dim(int id1, int id2, int dim2) {
-  return id1 * dim2 + id2;
-}
-
-/* Convert 3-dim tensor index into vector index */
-__forceinline__ __host__ __device__ int flat_3dim(int id1, int id2, int id3,
-                                                  int dim2, int dim3) {
-  return id1 * dim2 * dim3 + id2 * dim3 + id3;
-}
-
-/* Convert 4-dim tensor index into vector index */
-__forceinline__ __host__ __device__ int flat_4dim(int id1, int id2, int id3,
-                                                  int id4, int dim2, int dim3,
-                                                  int dim4) {
-  // return id1*(dim2*dim3*dim4) + id2*(dim3*dim4) + id3*dim4 + id4;
-  int res = id4;
-
-  int ld = dim4;
-  res += id3 * ld;
-
-  ld *= dim3;
-  res += id2 * ld;
-
-  ld *= dim2;
-  res += id1 * ld;
-
-  return res;
-}
-
-/* Convert 5-dim tensor index into vector index */
-__forceinline__ __host__ __device__ int flat_5dim(int id1, int id2, int id3,
-                                                  int id4, int id5, int dim2,
-                                                  int dim3, int dim4,
-                                                  int dim5) {
-  // return id1*(dim2*dim3*dim4*dim5) + id2*(dim3*dim4*dim5) + id3*(dim4*dim5) +
-  // id4*dim5 + dim5;
-  int res = id5;
-
-  int ld = dim5;
-  res += id4 * ld;
-
-  ld *= dim4;
-  res += id3 * ld;
-
-  ld *= dim3;
-  res += id2 * ld;
-
-  ld *= dim2;
-  res += id1 * ld;
-
-  return res;
-}
-
-/* Convert 6-dim tensor index into vector index */
-__forceinline__ __host__ __device__ int flat_6dim(int id1, int id2, int id3,
-                                                  int id4, int id5, int id6,
-                                                  int dim2, int dim3, int dim4,
-                                                  int dim5, int dim6) {
-  // return id1*(dim2*dim3*dim4*dim5*dim6) + id2*(dim3*dim4*dim5*dim6) +
-  // id3*(dim4*dim5*dim6) + id4*(dim5*dim6) + id5*dim6 + id6;
-  int res = id6;
-
-  int ld = dim6;
-  res += id5 * ld;
-
-  ld *= dim5;
-  res += id4 * ld;
-
-  ld *= dim4;
-  res += id3 * ld;
-
-  ld *= dim3;
-  res += id2 * ld;
-
-  ld *= dim2;
-  res += id1 * ld;
-
-  return res;
-}
-
-/* Convert vector index to 6-dim tensor index */
-__forceinline__ __host__ __device__ void decompose_6dim(
-    int src, int dim1, int dim2, int dim3, int dim4, int dim5, int *id0,
-    int *id1, int *id2, int *id3, int *id4, int *id5) {
-  *id5 = src % dim5;
-  src /= dim5;
-
-  *id4 = src % dim4;
-  src /= dim4;
-
-  *id3 = src % dim3;
-  src /= dim3;
-
-  *id2 = src % dim2;
-  src /= dim2;
-
-  *id1 = src % dim1;
-  *id0 = src / dim1;
-}
-
-/* Convert vector index to 5-dim tensor index */
-__forceinline__ __host__ __device__ void decompose_5dim(int src, int dim1,
-                                                        int dim2, int dim3,
-                                                        int dim4, int *id0,
-                                                        int *id1, int *id2,
-                                                        int *id3, int *id4) {
-  *id4 = src % dim4;
-  src /= dim4;
-
-  *id3 = src % dim3;
-  src /= dim3;
-
-  *id2 = src % dim2;
-  src /= dim2;
-
-  *id1 = src % dim1;
-  *id0 = src / dim1;
-}
-
-/* Convert vector index to 4-dim tensor index */
-__forceinline__ __host__ __device__ void decompose_4dim(int src, int dim1,
-                                                        int dim2, int dim3,
-                                                        int *id0, int *id1,
-                                                        int *id2, int *id3) {
-  *id3 = src % dim3;
-  src /= dim3;
-
-  *id2 = src % dim2;
-  src /= dim2;
-
-  *id1 = src % dim1;
-  *id0 = src / dim1;
-}
-
-/* Convert vector index to 3-dim tensor index */
-__forceinline__ __host__ __device__ void decompose_3dim(int src, int dim1,
-                                                        int dim2, int *id0,
-                                                        int *id1, int *id2) {
-  *id2 = src % dim2;
-  src /= dim2;
-
-  *id1 = src % dim1;
-  *id0 = src / dim1;
-}
-
-/* Convert vector index to 2-dim tensor index */
-__forceinline__ __host__ __device__ void decompose_2dim(int src, int dim1,
-                                                        int *id0, int *id1) {
-  *id1 = src % dim1;
-  *id0 = src / dim1;
-}
diff --git a/colossalai/kernel/hip_native/csrc/kernels/include/ls_cub.cuh b/colossalai/kernel/hip_native/csrc/kernels/include/ls_cub.cuh
deleted file mode 100644
index bc0dac3609da82963bf08eea659c9add17f811b0..0000000000000000000000000000000000000000
--- a/colossalai/kernel/hip_native/csrc/kernels/include/ls_cub.cuh
+++ /dev/null
@@ -1,13 +0,0 @@
-// !!! This is a file automatically generated by hipify!!!
-// copied from https://github.com/dmlc/dgl/pull/2758
-#ifndef DGL_ARRAY_CUDA_DGL_CUB_CUH_
-#define DGL_ARRAY_CUDA_DGL_CUB_CUH_
-
-#define CUB_NS_PREFIX namespace ls {
-#define CUB_NS_POSTFIX }
-#include "hipcub/hipcub.hpp"
-#include "hipcub/hipcub.hpp"
-#undef CUB_NS_POSTFIX
-#undef CUB_NS_PREFIX
-
-#endif
diff --git a/colossalai/kernel/hip_native/csrc/kernels/include/normalize_layer.h b/colossalai/kernel/hip_native/csrc/kernels/include/normalize_layer.h
deleted file mode 100644
index eeb3e8c9fc850d869744f5f76d2a1030125bdd39..0000000000000000000000000000000000000000
--- a/colossalai/kernel/hip_native/csrc/kernels/include/normalize_layer.h
+++ /dev/null
@@ -1,66 +0,0 @@
-// !!! This is a file automatically generated by hipify!!!
-#pragma once
-
-#include <hip/hip_runtime.h>
-#include <hip/hip_fp16.h>
-#include <stdio.h>
-
-#include <fstream>
-
-#include "../../../../hip_native/csrc/kernels/include/kernels.h"
-
-using namespace std;
-
-template <typename T>
-class Normalize_Layer {
- public:
-  struct Config {
-    uint32_t hidden_dim;
-    bool use_mean;
-    Config(uint32_t hidden_dim, bool use_mean = false)
-        : hidden_dim(hidden_dim), use_mean(use_mean) {}
-  };
-
-  Normalize_Layer(Config config, size_t max_rows)
-      : config_(config), vars_(nullptr), means_(nullptr) {
-    vars_ = cuda_malloc<T>(max_rows);
-    if (config_.use_mean) {
-      means_ = cuda_malloc<T>(max_rows);
-    }
-  }
-
-  ~Normalize_Layer() {
-    cuda_free(vars_);
-    cuda_free(means_);
-  }
-
-  void Forward(T *ln_res, const T *inp, const T *gamma, const T *betta,
-               int batch_size, hipStream_t stream) {
-    launch_layer_norm(ln_res, vars_, means_, inp, gamma, betta, batch_size,
-                      config_.hidden_dim, stream);
-  }
-
-  /*
-  residual_grad, inp_or_out, betta should be treated carefully.
-  inp_or_out = input if use_mean else output
-  residual_grad, betta can be nullptr.
-  residual_grad will be added to dinp if it is not nullptr
-    which is useful in transformer layer when pre-ln
-  betta are only used to compute xhat,
-    (use_mean == false) ^ (betta == nullptr) should be true
-  */
-  void Backward(T *gamma_grad, T *betta_grad, T *inp_grad, const T *out_grad,
-                const T *residual_grad, const T *inp_or_out, const T *gamma,
-                const T *betta, int batch_size, hipStream_t stream[2]) {
-    launch_ln_bw(gamma_grad, betta_grad, inp_grad, out_grad, residual_grad,
-                 inp_or_out, gamma, betta, vars_, means_, batch_size,
-                 config_.hidden_dim, stream);
-  }
-
-  inline bool use_mean() const { return config_.use_mean; }
-
- private:
-  Config config_;
-  T *vars_;
-  T *means_;
-};
diff --git a/colossalai/kernel/hip_native/csrc/kernels/include/softmax.h b/colossalai/kernel/hip_native/csrc/kernels/include/softmax.h
deleted file mode 100644
index 74033646e930d5d505436a8f743e8f18c473e935..0000000000000000000000000000000000000000
--- a/colossalai/kernel/hip_native/csrc/kernels/include/softmax.h
+++ /dev/null
@@ -1,45 +0,0 @@
-// !!! This is a file automatically generated by hipify!!!
-#pragma once
-
-#include <hip/hip_runtime.h>
-#include <hip/hip_fp16.h>
-#include <stdio.h>
-
-#include <fstream>
-
-#include "../../../../hip_native/csrc/kernels/include/kernels.h"
-
-using namespace std;
-
-template <typename T>
-class Softmax {
- public:
-  struct Config {
-    size_t nhead;
-    Config(size_t nhead) : nhead(nhead) {}
-  };
-
-  Softmax(Config config) : config_(config) {}
-
-  ~Softmax() {}
-
-  void Forward(T *vals, const T *attn_mask, int batch_size, int from_len,
-               int to_len, hipStream_t &stream, bool mask_future = true) {
-    launch_attn_softmax<T>(vals, attn_mask, batch_size, config_.nhead, from_len,
-                           to_len, mask_future, stream);
-  }
-
-  void Backward(T *out_grad, const T *soft_out, int batch_size, int from_len,
-                int to_len, hipStream_t stream) {
-    launch_attn_softmax_bw<T>(out_grad, soft_out,
-                              batch_size * config_.nhead * from_len, to_len,
-                              stream);
-  }
-
-  void reset_size(size_t nhead) {
-    config_.nhead = nhead;
-  }
-
- private:
-  Config config_;
-};
diff --git a/colossalai/kernel/hip_native/csrc/kernels/include/strided_batch_gemm.h b/colossalai/kernel/hip_native/csrc/kernels/include/strided_batch_gemm.h
deleted file mode 100644
index 5aa8314ada582bd608365dbe90504ff8ddcdf9b4..0000000000000000000000000000000000000000
--- a/colossalai/kernel/hip_native/csrc/kernels/include/strided_batch_gemm.h
+++ /dev/null
@@ -1,122 +0,0 @@
-// !!! This is a file automatically generated by hipify!!!
-/* Copyright 2021 The LightSeq Team
-   Copyright Microsoft DeepSpeed
-   This file is adapted from Microsoft DeepSpeed
-*/
-#pragma once
-
-#include <hip/hip_runtime.h>
-#include <hip/hip_fp16.h>
-#include <stdio.h>
-
-#include <array>
-
-#include "../../../../hip_native/csrc/kernels/include/cublas_wrappers.h"
-
-template <typename T>
-class StridedBatchGemm {
- public:
-  struct Config {
-    int m;
-    int n;
-    int k;
-    float alpha;
-    float beta;
-    rocblas_operation op_A;
-    rocblas_operation op_B;
-    std::array<int, 3> gemm_algos;
-
-    Config(float param_alpha, float param_beta, rocblas_operation opA,
-           rocblas_operation opB)
-        : alpha(param_alpha),
-          beta(param_beta),
-          op_A(opA),
-          op_B(opB),
-          gemm_algos(std::array<int, 3>({99, 99, 99})) {}
-    void SetConfig(int mm, int nn, int kk) {
-      m = mm;
-      n = nn;
-      k = kk;
-    }
-  };
-
-  StridedBatchGemm(const Config &config) : _config(config) {}
-
-  virtual ~StridedBatchGemm() {}
-
-  void Forward(int bsz, T *output, const T *_buffer_a, const T *_buffer_b,
-               rocblas_handle handle) {
-    int stride_a = _config.m * _config.k;
-    int stride_b = _config.n * _config.k;
-    int stride_c = _config.m * _config.n;
-
-#ifdef COLOSSAL_HIP
-    cublas_strided_batched_gemm(
-        handle, _config.m, _config.n, _config.k, &_config.alpha, &_config.beta,
-        _buffer_a, _buffer_b, output, _config.op_A, _config.op_B, stride_a,
-        stride_b, stride_c, bsz, rocblas_gemm_algo(rocblas_gemm_algo_standard));
-#else
-    cublas_strided_batched_gemm(
-        handle, _config.m, _config.n, _config.k, &_config.alpha, &_config.beta,
-        _buffer_a, _buffer_b, output, _config.op_A, _config.op_B, stride_a,
-        stride_b, stride_c, bsz, cublasGemmAlgo_t(_config.gemm_algos[0]));
-#endif
-  }
-
-  void Backward(int bsz, const T *d_output, const T *_buffer_a,
-                const T *_buffer_b, rocblas_handle handle,
-                T *inpGradA = nullptr, T *inpGradB = nullptr) {
-    int mb = (_config.op_A == rocblas_operation_transpose ? _config.k : _config.m);
-    int kb = (_config.op_A == rocblas_operation_transpose ? _config.m : _config.k);
-
-    int stride_a = mb * _config.n;
-    int stride_b = _config.n * kb;
-    int stride_c = _config.m * _config.k;
-
-    // B need to transpose.
-    rocblas_operation op_b =
-        (_config.op_B == rocblas_operation_transpose ? rocblas_operation_none : rocblas_operation_transpose);
-
-    // Calculate d_A.
-#ifdef COLOSSAL_HIP
-    cublas_strided_batched_gemm(
-        handle, mb, kb, _config.n, &_config.alpha, &_config.beta,
-        (_config.op_A == rocblas_operation_transpose ? _buffer_b : d_output),
-        (_config.op_A == rocblas_operation_transpose ? d_output : _buffer_b), inpGradA,
-        rocblas_operation_none, op_b, stride_a, stride_b, stride_c, bsz,
-        rocblas_gemm_algo(rocblas_gemm_algo_standard));
-#else
-    cublas_strided_batched_gemm(
-        handle, mb, kb, _config.n, &_config.alpha, &_config.beta,
-        (_config.op_A == rocblas_operation_transpose ? _buffer_b : d_output),
-        (_config.op_A == rocblas_operation_transpose ? d_output : _buffer_b), inpGradA,
-        rocblas_operation_none, op_b, stride_a, stride_b, stride_c, bsz,
-        cublasGemmAlgo_t(_config.gemm_algos[1]));
-#endif
-    // A need to transpose.
-    rocblas_operation op_a =
-        (_config.op_A == rocblas_operation_transpose ? rocblas_operation_none : rocblas_operation_transpose);
-
-    stride_a = _config.m * _config.k;
-    stride_b = _config.m * _config.n;
-    stride_c = _config.n * _config.k;
-
-    // Calculate d_B.
-#ifdef COLOSSAL_HIP
-    cublas_strided_batched_gemm(
-        handle, _config.k, _config.n, _config.m, &_config.alpha, &_config.beta,
-        _buffer_a, d_output, inpGradB, op_a, rocblas_operation_none, stride_a, stride_b,
-        stride_c, bsz, rocblas_gemm_algo(rocblas_gemm_algo_standard));
-#else
-    cublas_strided_batched_gemm(
-        handle, _config.k, _config.n, _config.m, &_config.alpha, &_config.beta,
-        _buffer_a, d_output, inpGradB, op_a, rocblas_operation_none, stride_a, stride_b,
-        stride_c, bsz, cublasGemmAlgo_t(_config.gemm_algos[2]));
-#endif
-  }
-
-  inline void SetConfig(int m, int n, int k) { _config.SetConfig(m, n, k); }
-
- private:
-  Config _config;
-};
diff --git a/colossalai/kernel/hip_native/csrc/kernels/normalize_kernels.hip b/colossalai/kernel/hip_native/csrc/kernels/normalize_kernels.hip
deleted file mode 100644
index 4394eb18a750ba013df522028cb8188ed708aa8c..0000000000000000000000000000000000000000
--- a/colossalai/kernel/hip_native/csrc/kernels/normalize_kernels.hip
+++ /dev/null
@@ -1,1288 +0,0 @@
-// !!! This is a file automatically generated by hipify!!!
-#include "hip/hip_runtime.h"
-#include "block_reduce.h"
-#include "kernels.h"
-
-#ifndef COLOSSAL_HIP
-#include <cooperative_groups.h>
-
-namespace cg = cooperative_groups;
-#endif
-
-const float LN_EPSILON = 1e-8f;
-#define TILE_DIM 32
-
-template <typename T>
-__forceinline__ __device__ T add_eps(T x) {
-  return fabsf(x) > LN_EPSILON ? x : (x < 0 ? -LN_EPSILON : LN_EPSILON);
-}
-
-/**
-@brief: ker_layer_norm
-Standard layer normalization.
-It will not only output the layer norm result,
-  but also outputs variance.
-  may also output means, depends on whether
-  the means argument is nullptr
-
-@thread
-gridDim.x = batch_size * seq_len
-blockDim.x = hidden_size
-
-@param
-ln_res: [batch_size* seq_len, hidden_size], ln result.
-vars: [batch_size* seq_len], variance per token
-means: [batch_size* seq_len], means per token, can be nullput
-inp: [batch_size * seq_len, hidden_size], ln input.
-scale: [hidden_size], ln scale
-bias: [hidden_size], ln bias
-*/
-template <typename T>
-__global__ void ker_layer_norm(T *ln_res, T *vars, T *means, const T *inp,
-                               const T *scale, const T *bias, int hidden_size) {
-  // step 0. compute local sum
-  float l_sum = 0;
-  float l_square_sum = 0;
-  const float4 *inp_f4 = (const float4 *)inp + blockIdx.x * hidden_size;
-  for (uint idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) {
-    float4 val = inp_f4[idx];
-    l_sum += val.x + val.y + val.z + val.w;
-    l_square_sum +=
-        val.x * val.x + val.y * val.y + val.z * val.z + val.w * val.w;
-  }
-
-  // step 1. compute reduce sum
-  float mean_dim = float(hidden_size) * 4.f;
-  float reduce_val[2] = {l_sum, l_square_sum};
-  blockReduce<ReduceType::kSum, 2>(reduce_val);
-  __shared__ float s_mean, s_var;
-  if (threadIdx.x == 0) {
-    s_mean = reduce_val[0] / mean_dim;
-    if (means != nullptr) {
-      means[blockIdx.x] = s_mean;
-    }
-    s_var = reduce_val[1] / mean_dim - s_mean * s_mean + LN_EPSILON;
-    vars[blockIdx.x] = s_var;
-    s_var = rsqrtf(s_var);
-  }
-  __syncthreads();
-
-  // step 2. layer norm result
-  float4 *output_f4 = (float4 *)ln_res + blockIdx.x * hidden_size;
-  for (uint idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) {
-    float4 vscale = __ldg((const float4 *)scale + idx);
-    float4 vbias = __ldg((const float4 *)bias + idx);
-    float4 val = inp_f4[idx];
-    val.x = (val.x - s_mean) * s_var * vscale.x + vbias.x;
-    val.y = (val.y - s_mean) * s_var * vscale.y + vbias.y;
-    val.z = (val.z - s_mean) * s_var * vscale.z + vbias.z;
-    val.w = (val.w - s_mean) * s_var * vscale.w + vbias.w;
-    output_f4[idx] = val;
-  }
-}
-
-template <>
-__global__ void ker_layer_norm<__half>(__half *ln_res, __half *vars,
-                                       __half *means, const __half *inp,
-                                       const __half *scale, const __half *bias,
-                                       int hidden_size) {
-  // step 0. compute local sum
-  float l_sum = 0;
-  float l_square_sum = 0;
-  const float4 *inp_f4 = (const float4 *)inp + blockIdx.x * hidden_size;
-  for (uint idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) {
-    float4 val_f4 = inp_f4[idx];
-    __half2 *val_h2 = (__half2 *)(&val_f4);
-#pragma unroll
-    for (int i = 0; i < 4; i++) {
-      float2 val_f2 = __half22float2(val_h2[i]);
-      l_sum += val_f2.x + val_f2.y;
-      l_square_sum += val_f2.x * val_f2.x + val_f2.y * val_f2.y;
-    }
-  }
-
-  // step 1. compute reduce sum
-  float mean_dim = float(hidden_size) * 8.f;
-  float reduce_val[2] = {l_sum, l_square_sum};
-  blockReduce<ReduceType::kSum, 2>(reduce_val);
-  __shared__ float s_mean, s_var;
-  if (threadIdx.x == 0) {
-    s_mean = reduce_val[0] / mean_dim;
-    if (means != nullptr) {
-      means[blockIdx.x] = s_mean;
-    }
-    s_var = reduce_val[1] / mean_dim - s_mean * s_mean + LN_EPSILON;
-    vars[blockIdx.x] = s_var;
-    s_var = rsqrtf(s_var);
-  }
-  __syncthreads();
-
-  // step 2. layer norm result
-  float4 *output_f4 = (float4 *)ln_res + blockIdx.x * hidden_size;
-  for (uint idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) {
-    // load scale, bias, input
-    float4 scale_f4 = __ldg((const float4 *)scale + idx);
-    __half2 *scale_h2 = (__half2 *)(&scale_f4);
-    float4 bias_f4 = __ldg((const float4 *)bias + idx);
-    __half2 *bias_h2 = (__half2 *)(&bias_f4);
-    float4 val_f4 = inp_f4[idx];
-    __half2 *val_h2 = (__half2 *)(&val_f4);
-
-#pragma unroll
-    for (int i = 0; i < 4; i++) {
-      float2 scale_f2 = __half22float2(scale_h2[i]);
-      float2 bias_f2 = __half22float2(bias_h2[i]);
-      float2 val_f2 = __half22float2(val_h2[i]);
-      val_f2.x = (val_f2.x - s_mean) * s_var * scale_f2.x + bias_f2.x;
-      val_f2.y = (val_f2.y - s_mean) * s_var * scale_f2.y + bias_f2.y;
-      val_h2[i] = __float22half2_rn(val_f2);
-    }
-    output_f4[idx] = val_f4;
-  }
-}
-
-// __global__ void ker_layer_norm_x2(__half *ln_res, __half *vars,
-//                                        __half *means, const __half *inp,
-//                                        const __half *scale, const __half *bias,
-//                                        int hidden_size) {
-//   // step 0. compute local sum
-//   float l_sum = 0;
-//   float l_square_sum = 0;
-//   const float4 *inp_f4 = (const float4 *)inp + blockIdx.x * 2 * hidden_size;
-//   for (uint idx = 2 * threadIdx.x; idx < hidden_size * 2; idx += blockDim.x * 2) {
-//     float4 val_f4 = inp_f4[idx];
-//     float4 val_f4_1 = inp_f4[idx+1];
-//     __half2 *val_h2 = (__half2 *)(&val_f4);
-//     __half2 *val_h2_1 = (__half2 *)(&val_f4_1);
-// #pragma unroll
-//     for (int i = 0; i < 4; i++) {
-//       float2 val_f2 = __half22float2(val_h2[i]);
-//       float2 val_f2_1 = __half22float2(val_h2_1[i]);
-//       l_sum += val_f2.x + val_f2.y + val_f2_1.x + val_f2_1.y;
-//       l_square_sum += val_f2.x * val_f2.x + val_f2.y * val_f2.y + val_f2_1.x * val_f2_1.x + val_f2_1.y * val_f2_1.y;
-//     }
-//   }
-
-//   // step 1. compute reduce sum
-//   float mean_dim = float(hidden_size) * 8.f * 2;
-//   float reduce_val[2] = {l_sum, l_square_sum};
-//   blockReduce<ReduceType::kSum, 2>(reduce_val);
-//   __shared__ float s_mean, s_var;
-//   if (threadIdx.x == 0) {
-//     s_mean = reduce_val[0] / mean_dim;
-//     if (means != nullptr) {
-//       means[blockIdx.x] = s_mean;
-//     }
-//     s_var = reduce_val[1] / mean_dim - s_mean * s_mean + LN_EPSILON;
-//     vars[blockIdx.x] = s_var;
-//     s_var = rsqrtf(s_var);
-//   }
-//   __syncthreads();
-
-//   // step 2. layer norm result
-//   float4 *output_f4 = (float4 *)ln_res + blockIdx.x * hidden_size * 2;
-//   for (uint idx = 2 * threadIdx.x; idx < hidden_size * 2; idx += blockDim.x * 2) {
-//     // load scale, bias, input
-//     float4 scale_f4 = __ldg((const float4 *)scale + idx);
-//     __half2 *scale_h2 = (__half2 *)(&scale_f4);
-//     float4 scale_f4_1 = __ldg((const float4 *)scale + idx + 1);
-//     __half2 *scale_h2_1 = (__half2 *)(&scale_f4_1);
-//     float4 bias_f4 = __ldg((const float4 *)bias + idx);
-//     __half2 *bias_h2 = (__half2 *)(&bias_f4);
-//     float4 bias_f4_1 = __ldg((const float4 *)bias + idx + 1);
-//     __half2 *bias_h2_1 = (__half2 *)(&bias_f4_1);
-//     float4 val_f4 = inp_f4[idx];
-//     __half2 *val_h2 = (__half2 *)(&val_f4);
-//     float4 val_f4_1 = inp_f4[idx+1];
-//     __half2 *val_h2_1 = (__half2 *)(&val_f4_1);
-
-// #pragma unroll
-//     for (int i = 0; i < 4; i++) {
-//       float2 scale_f2 = __half22float2(scale_h2[i]);
-//       float2 scale_f2_1 = __half22float2(scale_h2_1[i]);
-//       float2 bias_f2 = __half22float2(bias_h2[i]);
-//       float2 bias_f2_1 = __half22float2(bias_h2_1[i]);
-//       float2 val_f2 = __half22float2(val_h2[i]);
-//       float2 val_f2_1 = __half22float2(val_h2_1[i]);
-//       val_f2.x = (val_f2.x - s_mean) * s_var * scale_f2.x + bias_f2.x;
-//       val_f2.y = (val_f2.y - s_mean) * s_var * scale_f2.y + bias_f2.y;
-//       val_h2[i] = __float22half2_rn(val_f2);
-//       val_f2_1.x = (val_f2_1.x - s_mean) * s_var * scale_f2_1.x + bias_f2_1.x;
-//       val_f2_1.y = (val_f2_1.y - s_mean) * s_var * scale_f2_1.y + bias_f2_1.y;
-//       val_h2_1[i] = __float22half2_rn(val_f2_1);
-//     }
-//     output_f4[idx] = val_f4;
-//     output_f4[idx+1] = val_f4_1;
-//   }
-// }
-
-// __global__ void ker_layer_norm_x4(__half *ln_res, __half *vars,
-//                                        __half *means, const __half *inp,
-//                                        const __half *scale, const __half *bias,
-//                                        int hidden_size) {
-//   // step 0. compute local sum
-//   float l_sum = 0;
-//   float l_square_sum = 0;
-//   const float4 *inp_f4 = (const float4 *)inp + blockIdx.x * hidden_size * 4;
-//   for (uint idx = 4 * threadIdx.x; idx < hidden_size * 4; idx += blockDim.x * 4) {
-//     float4 val_f4 = inp_f4[idx];
-//     float4 val_f4_1 = inp_f4[idx+1];
-//     float4 val_f4_2 = inp_f4[idx+2];
-//     float4 val_f4_3 = inp_f4[idx+3];
-//     __half2 *val_h2 = (__half2 *)(&val_f4);
-//     __half2 *val_h2_1 = (__half2 *)(&val_f4_1);
-//     __half2 *val_h2_2 = (__half2 *)(&val_f4_2);
-//     __half2 *val_h2_3 = (__half2 *)(&val_f4_3);
-// #pragma unroll
-//     for (int i = 0; i < 4; i++) {
-//       float2 val_f2 = __half22float2(val_h2[i]);
-//       float2 val_f2_1 = __half22float2(val_h2_1[i]);
-//       float2 val_f2_2 = __half22float2(val_h2_2[i]);
-//       float2 val_f2_3 = __half22float2(val_h2_3[i]);
-//       l_sum += val_f2.x + val_f2.y + val_f2_1.x + val_f2_1.y + val_f2_2.x + val_f2_2.y + val_f2_3.x + val_f2_3.y;
-//       l_square_sum += val_f2.x * val_f2.x + val_f2.y * val_f2.y;
-//       l_square_sum += val_f2_1.x * val_f2_1.x + val_f2_1.y * val_f2_1.y;
-//       l_square_sum += val_f2_2.x * val_f2_2.x + val_f2_2.y * val_f2_2.y;
-//       l_square_sum += val_f2_3.x * val_f2_3.x + val_f2_3.y * val_f2_3.y;
-//     }
-//   }
-
-//   // step 1. compute reduce sum
-//   float mean_dim = float(hidden_size) * 8.f * 4;
-//   float reduce_val[2] = {l_sum, l_square_sum};
-//   blockReduce<ReduceType::kSum, 2>(reduce_val);
-//   __shared__ float s_mean, s_var;
-//   if (threadIdx.x == 0) {
-//     s_mean = reduce_val[0] / mean_dim;
-//     if (means != nullptr) {
-//       means[blockIdx.x] = s_mean;
-//     }
-//     s_var = reduce_val[1] / mean_dim - s_mean * s_mean + LN_EPSILON;
-//     vars[blockIdx.x] = s_var;
-//     s_var = rsqrtf(s_var);
-//   }
-//   __syncthreads();
-
-//   // step 2. layer norm result
-//   float4 *output_f4 = (float4 *)ln_res + blockIdx.x * hidden_size * 4;
-//   for (uint idx = 4 * threadIdx.x; idx < hidden_size * 4; idx += blockDim.x * 4) {
-//     // load scale, bias, input
-//     float4 scale_f4 = __ldg((const float4 *)scale + idx);
-//     __half2 *scale_h2 = (__half2 *)(&scale_f4);
-//     float4 scale_f4_1 = __ldg((const float4 *)scale + idx + 1);
-//     __half2 *scale_h2_1 = (__half2 *)(&scale_f4_1);
-//     float4 scale_f4_2 = __ldg((const float4 *)scale + idx + 2);
-//     __half2 *scale_h2_2 = (__half2 *)(&scale_f4_2);
-//     float4 scale_f4_3 = __ldg((const float4 *)scale + idx + 3);
-//     __half2 *scale_h2_3 = (__half2 *)(&scale_f4_3);
-//     float4 bias_f4 = __ldg((const float4 *)bias + idx);
-//     __half2 *bias_h2 = (__half2 *)(&bias_f4);
-//     float4 bias_f4_1 = __ldg((const float4 *)bias + idx + 1);
-//     __half2 *bias_h2_1 = (__half2 *)(&bias_f4_1);
-//     float4 bias_f4_2 = __ldg((const float4 *)bias + idx + 2);
-//     __half2 *bias_h2_2 = (__half2 *)(&bias_f4_2);
-//     float4 bias_f4_3 = __ldg((const float4 *)bias + idx + 3);
-//     __half2 *bias_h2_3 = (__half2 *)(&bias_f4_3);
-//     float4 val_f4 = inp_f4[idx];
-//     __half2 *val_h2 = (__half2 *)(&val_f4);
-//     float4 val_f4_1 = inp_f4[idx+1];
-//     __half2 *val_h2_1 = (__half2 *)(&val_f4_1);
-//     float4 val_f4_2 = inp_f4[idx+2];
-//     __half2 *val_h2_2 = (__half2 *)(&val_f4_2);
-//     float4 val_f4_3 = inp_f4[idx+3];
-//     __half2 *val_h2_3 = (__half2 *)(&val_f4_3);
-
-// #pragma unroll
-//     for (int i = 0; i < 4; i++) {
-//       float2 scale_f2 = __half22float2(scale_h2[i]);
-//       float2 scale_f2_1 = __half22float2(scale_h2_1[i]);
-//       float2 scale_f2_2 = __half22float2(scale_h2_2[i]);
-//       float2 scale_f2_3 = __half22float2(scale_h2_3[i]);
-//       float2 bias_f2 = __half22float2(bias_h2[i]);
-//       float2 bias_f2_1 = __half22float2(bias_h2_1[i]);
-//       float2 bias_f2_2 = __half22float2(bias_h2_2[i]);
-//       float2 bias_f2_3 = __half22float2(bias_h2_3[i]);
-//       float2 val_f2 = __half22float2(val_h2[i]);
-//       float2 val_f2_1 = __half22float2(val_h2_1[i]);
-//       float2 val_f2_2 = __half22float2(val_h2_2[i]);
-//       float2 val_f2_3 = __half22float2(val_h2_3[i]);
-//       val_f2.x = (val_f2.x - s_mean) * s_var * scale_f2.x + bias_f2.x;
-//       val_f2.y = (val_f2.y - s_mean) * s_var * scale_f2.y + bias_f2.y;
-//       val_f2_1.x = (val_f2_1.x - s_mean) * s_var * scale_f2_1.x + bias_f2_1.x;
-//       val_f2_1.y = (val_f2_1.y - s_mean) * s_var * scale_f2_1.y + bias_f2_1.y;
-//       val_f2_2.x = (val_f2_2.x - s_mean) * s_var * scale_f2_2.x + bias_f2_2.x;
-//       val_f2_2.y = (val_f2_2.y - s_mean) * s_var * scale_f2_2.y + bias_f2_2.y;
-//       val_f2_3.x = (val_f2_3.x - s_mean) * s_var * scale_f2_3.x + bias_f2_3.x;
-//       val_f2_3.y = (val_f2_3.y - s_mean) * s_var * scale_f2_3.y + bias_f2_3.y;
-//       val_h2[i] = __float22half2_rn(val_f2);
-//       val_h2_1[i] = __float22half2_rn(val_f2_1);
-//       val_h2_2[i] = __float22half2_rn(val_f2_2);
-//       val_h2_3[i] = __float22half2_rn(val_f2_3);
-//     }
-//     output_f4[idx] = val_f4;
-//     output_f4[idx+1] = val_f4_1;
-//     output_f4[idx+2] = val_f4_2;
-//     output_f4[idx+3] = val_f4_3;
-//   }
-// }
-
-template <>
-void launch_layer_norm<float>(float *ln_res, float *vars, float *means,
-                              const float *inp, const float *scale,
-                              const float *bias, int batch_size, int hidden_dim,
-                              hipStream_t stream) {
-  if (hidden_dim % 4 != 0) {
-    throw std::runtime_error("violate hidden_dim % 4 = 0");
-  }
-  hidden_dim >>= 2;
-  int nthread = min(((hidden_dim + 31) / 32) * 32, MAX_THREADS);
-  dim3 grid_dim(batch_size);
-  dim3 block_dim(nthread);
-
- hipLaunchKernelGGL(( ker_layer_norm<float>), dim3(grid_dim), dim3(block_dim), 0, stream, 
-      ln_res, vars, means, inp, scale, bias, hidden_dim);
-}
-
-template <>
-void launch_layer_norm<__half>(__half *ln_res, __half *vars, __half *means,
-                               const __half *inp, const __half *scale,
-                               const __half *bias, int batch_size,
-                               int hidden_dim, hipStream_t stream) {
-  if (hidden_dim % 8 != 0) {
-    throw std::runtime_error("violate hidden_dim % 8 = 0");
-  }
-  hidden_dim >>= 3;
-  int nthread = min(((hidden_dim + 31) / 32) * 32, MAX_THREADS);
-  dim3 grid_dim(batch_size);
-  dim3 block_dim(nthread);
-
- hipLaunchKernelGGL(( ker_layer_norm<__half>), dim3(grid_dim), dim3(block_dim), 0, stream, 
-      ln_res, vars, means, inp, scale, bias, hidden_dim);
-  // if (hidden_dim % 8 != 0) {
-  //   throw std::runtime_error("violate hidden_dim % 8 = 0");
-  // }
-  // hidden_dim >>= 3;
-
-  // if (hidden_dim * 8 < 8192) {
-  //   int nthread = min(((hidden_dim + 31) / 32) * 32, MAX_THREADS);
-  //   dim3 grid_dim(batch_size);
-  //   dim3 block_dim(nthread);
-  //  hipLaunchKernelGGL(( ker_layer_norm<__half>), dim3(grid_dim), dim3(block_dim), 0, stream, 
-  //       ln_res, vars, means, inp, scale, bias, hidden_dim);
-  // } else if (hidden_dim * 8 >= 8192 && hidden_dim * 8 <= 8192 * 2) {
-  //   hidden_dim >>= 1;
-  //   int nthread = min(((hidden_dim + 31) / 32) * 32, MAX_THREADS);
-  //   dim3 grid_dim(batch_size);
-  //   dim3 block_dim(nthread);
-  //  hipLaunchKernelGGL(( ker_layer_norm_x2), dim3(grid_dim), dim3(block_dim), 0, stream, 
-  //       ln_res, vars, means, inp, scale, bias, hidden_dim);
-  // } else if (hidden_dim * 8 > 8192 * 2 && hidden_dim * 8 <= 8192 * 4) {
-  //   hidden_dim >>= 2;
-  //   int nthread = min(((hidden_dim + 31) / 32) * 32, MAX_THREADS);
-  //   dim3 grid_dim(batch_size);
-  //   dim3 block_dim(nthread);
-  //  hipLaunchKernelGGL(( ker_layer_norm_x4), dim3(grid_dim), dim3(block_dim), 0, stream, 
-  //       ln_res, vars, means, inp, scale, bias, hidden_dim);
-  // } else {
-  //   throw std::runtime_error("hidden_dim % 4 != 0 || hidden_dim > 32768");
-  // }
-}
-
-/**
-@brief: ker_ln_bw_dgamma_dbetta
-Layer norm backword kernel, compute the gradient of gamma and betta.
-dbetta = sum(dout, dim=0)
-dgamma = sum(xhat * dout, dim=0)
-xhat = (input - mean) * rsqrt(var) or
-  (output - betta) / gamma
-
-
-@thread
-gridDim.x = hidden_size / 32
-blockDim.x = 32
-blockDim.y = 32
-
-@param
-gamma_grad: [hidden_size], gradient of gamma
-betta_grad: [hidden_size], gradient of betta
-out_grad: [batch_size * seq_len, hidden_size], gradient of betta ln output
-inp_or_out: [batch_size * seq_len, hidden_size], ln output if means is nullptr
-  ln input if means is not nullptr
-gamma: [hidden_size], gamma of ln,
-  used to compute xhat, maybe nullptr
-betta: [hidden_size], betta of ln,
-  used to compute xhat, maybe nullptr
-vars: [batch_size * seq_len], variance of ln forward,
-  used to compute xhat, maybe nullptr
-means: [batch_size * seq_len], mean of ln forward,
-  used to compute xhat, maybe nullptr
-(gamma && betta) ^ (vars && means) should be true
-*/
-template <typename T>
-__global__ void ker_ln_bw_dgamma_dbetta(T *gamma_grad, T *betta_grad,
-                                        const T *out_grad, const T *inp_or_out,
-                                        const T *gamma, const T *betta,
-                                        const T *vars, const T *means, int rows,
-                                        int width) {
-  __shared__ float betta_buffer[TILE_DIM][TILE_DIM];
-  __shared__ float gamma_buffer[TILE_DIM][TILE_DIM];
-
-#ifndef COLOSSAL_HIP
-  cg::thread_block b = cg::this_thread_block();
-  cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
-#endif
-
-  int idx = blockDim.x * blockIdx.x + threadIdx.x;
-  int offset = threadIdx.y * width + idx;
-  int y_stride = width * TILE_DIM;
-
-  // Loop across inp height
-  float dbetta = 0;
-  float dgamma = 0;
-  float dout, val;
-  if (idx < width) {
-    if (means == nullptr) {
-      float vbetta = (float)betta[idx];
-      float vgamma = (float)gamma[idx];
-      for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
-        dout = (float)out_grad[offset];
-        // inp_or_out is output
-        val = (float)inp_or_out[offset];
-        dbetta += dout;
-        dgamma += ((val - vbetta) / add_eps(vgamma) * dout);
-        offset += y_stride;
-      }
-    } else {
-      for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
-        dout = (float)out_grad[offset];
-        // inp_or_out is input
-        val = (float)inp_or_out[offset];
-        dbetta += dout;
-        dgamma += ((val - (float)means[r]) *
-                   rsqrtf((float)vars[r] + LN_EPSILON) * dout);
-        offset += y_stride;
-      }
-    }
-  }
-
-  // Sum the shared buffer.
-  betta_buffer[threadIdx.x][threadIdx.y] = dbetta;
-  gamma_buffer[threadIdx.x][threadIdx.y] = dgamma;
-  __syncthreads();
-  float s1 = betta_buffer[threadIdx.y][threadIdx.x];
-  float s2 = gamma_buffer[threadIdx.y][threadIdx.x];
-  __syncthreads();
-
-  for (int i = 1; i < TILE_DIM; i <<= 1) {
-#ifdef COLOSSAL_HIP
-    s1 += __shfl_down(s1, i);
-    s2 += __shfl_down(s2, i);
-#else
-    s1 += g.shfl_down(s1, i);
-    s2 += g.shfl_down(s2, i);
-#endif
-  }
-
-  int pos = blockIdx.x * TILE_DIM + threadIdx.y;
-  if (threadIdx.x == 0 && idx < width) {
-    betta_grad[pos] = s1;
-    gamma_grad[pos] = s2;
-  }
-}
-
-/**
-@brief: ker_ln_bw_dinp
-Layer norm backword kernel, compute the gradient of input.
-dinp = (dxhat - (sum(dxhat) + xhat * sum(dxhat * xhat)) / hidden_dim)
-  * rsqrt(var)
-xhat = (input - mean) * rsqrt(var) if mean is not nullptr
-       (output - betta) / gamma if mean is nullptr
-dxhat = dout * gamma
-
-
-@thread
-gridDim.x = batch_size * seq_len
-blockDim.x = hidden_size
-
-@param
-inp_grad: [batch_size * seq_len, hidden_size], gradient of betta ln output
-out_grad: [batch_size * seq_len, hidden_size], gradient of betta ln output
-residual_grad: [batch_size * seq_len, hidden_size], gradient of residual input,
-  usually appear in pre-layer-norm for transformer layer, maybe nullptr
-inp_or_out: [batch_size * seq_len, hidden_size], ln output if means is nullptr
-  ln input if means is not nullptr
-gamma: [hidden_size], gamma of ln,
-  used to compute xhat and dxhat
-betta: [hidden_size], betta of ln,
-  used to compute xhat, maybe nullptr
-vars: [batch_size * seq_len], variance of ln forward,
-  used to compute xhat and dinp
-means: [batch_size * seq_len], mean of ln forward,
-  used to compute xhat, maybe nullptr
-*/
-template <typename T>
-__global__ void ker_ln_bw_dinp(T *inp_grad, const T *out_grad,
-                               const T *residual_grad, const T *inp_or_out,
-                               const T *gamma, const T *betta, const T *vars,
-                               const T *means, int hidden_dim) {
-  int offset = blockIdx.x * hidden_dim + threadIdx.x;
-  float4 dxhat, xhat;
-  float var_rsqrt;
-
-  if (threadIdx.x < hidden_dim) {
-    // step 0. dxhat = dout * gamma
-    dxhat = ((const float4 *)out_grad)[offset];
-    float4 vgamma = ((const float4 *)gamma)[threadIdx.x];
-    dxhat.x *= vgamma.x;
-    dxhat.y *= vgamma.y;
-    dxhat.z *= vgamma.z;
-    dxhat.w *= vgamma.w;
-
-    /*
-    step 1. xhat = (output - betta) / gamma or
-    (input - mean) * rsqrtf(var)
-    */
-    xhat = ((const float4 *)inp_or_out)[offset];
-    var_rsqrt = rsqrtf((float)vars[blockIdx.x] + LN_EPSILON);
-    if (means == nullptr) {
-      // inp_or_out is output, xhat = (output - betta) / gamma
-      float4 vbetta = ((const float4 *)betta)[threadIdx.x];
-      xhat.x = (xhat.x - vbetta.x) / add_eps(vgamma.x);
-      xhat.y = (xhat.y - vbetta.y) / add_eps(vgamma.y);
-      xhat.z = (xhat.z - vbetta.z) / add_eps(vgamma.z);
-      xhat.w = (xhat.w - vbetta.w) / add_eps(vgamma.w);
-    } else {
-      // inp_or_out is input, xhat = (input - mean) * rsqrtf(var)
-      float fmean = (float)means[blockIdx.x];
-      xhat.x = (xhat.x - fmean) * var_rsqrt;
-      xhat.y = (xhat.y - fmean) * var_rsqrt;
-      xhat.z = (xhat.z - fmean) * var_rsqrt;
-      xhat.w = (xhat.w - fmean) * var_rsqrt;
-    }
-  }
-
-  /* step2. block reduce sum for dxhat and dxhat*xhat */
-  float reduce_val[2] = {0.f, 0.f};
-  if (threadIdx.x < hidden_dim) {
-    reduce_val[0] = dxhat.x + dxhat.y + dxhat.z + dxhat.w;
-    reduce_val[1] = dxhat.x * xhat.x + dxhat.y * xhat.y + dxhat.z * xhat.z +
-                    dxhat.w * xhat.w;
-  }
-  blockReduce<ReduceType::kSum, 2>(reduce_val);
-  __shared__ float s_sum_dxhat, s_sum_dxhat_xhat;
-  if (threadIdx.x == 0) {
-    float mean_dim = hidden_dim * 4;
-    s_sum_dxhat = reduce_val[0] / mean_dim;
-    s_sum_dxhat_xhat = reduce_val[1] / mean_dim;
-  }
-  __syncthreads();
-
-  /*
-  step3. compute input gradient
-  (dxhat - (sum(dxhat) + xhat * sum(dxhat * xhat)) / mean_dim) * rsqrt(var)
-  */
-  if (threadIdx.x >= hidden_dim) {
-    return;
-  }
-  dxhat.x = (dxhat.x - s_sum_dxhat - xhat.x * s_sum_dxhat_xhat) * var_rsqrt;
-  dxhat.y = (dxhat.y - s_sum_dxhat - xhat.y * s_sum_dxhat_xhat) * var_rsqrt;
-  dxhat.z = (dxhat.z - s_sum_dxhat - xhat.z * s_sum_dxhat_xhat) * var_rsqrt;
-  dxhat.w = (dxhat.w - s_sum_dxhat - xhat.w * s_sum_dxhat_xhat) * var_rsqrt;
-  if (residual_grad) {
-    // Add the residual grad,
-    // usually in pre-layer-norm for transformer layer
-    float4 dresidual = ((const float4 *)residual_grad)[offset];
-    dxhat.x += dresidual.x;
-    dxhat.y += dresidual.y;
-    dxhat.z += dresidual.z;
-    dxhat.w += dresidual.w;
-  }
-  ((float4 *)inp_grad)[offset] = dxhat;
-}
-
-template <>
-__global__ void ker_ln_bw_dinp<__half>(__half *inp_grad, const __half *out_grad,
-                                       const __half *residual_grad,
-                                       const __half *inp_or_out,
-                                       const __half *gamma, const __half *betta,
-                                       const __half *vars, const __half *means,
-                                       int hidden_dim) {
-  int offset = blockIdx.x * hidden_dim + threadIdx.x;
-
-  float2 dxhat[4], xhat[4];
-  float var_rsqrt;
-  float4 vtmp;
-  __half2 *tmp_h2;
-  float reduce_val[2] = {0.f, 0.f};
-
-  if (threadIdx.x < hidden_dim) {
-    // step 0. dxhat = dout * gamma
-    vtmp = ((const float4 *)out_grad)[offset];
-    tmp_h2 = reinterpret_cast<__half2 *>(&vtmp);
-    float4 gamma_f4 = ((const float4 *)gamma)[threadIdx.x];
-    __half2 *gamma_h2 = reinterpret_cast<__half2 *>(&gamma_f4);
-#pragma unroll
-    for (int i = 0; i < 4; i++) {
-      float2 vdout = __half22float2(tmp_h2[i]);
-      float2 vgamma = __half22float2(gamma_h2[i]);
-      dxhat[i].x = vdout.x * vgamma.x;
-      dxhat[i].y = vdout.y * vgamma.y;
-      reduce_val[0] += dxhat[i].x + dxhat[i].y;
-    }
-
-    /*
-    step 1. xhat = (output - betta) / gamma or
-    (input - mean) * rsqrtf(var)
-    */
-    vtmp = ((const float4 *)inp_or_out)[offset];
-    var_rsqrt = rsqrtf((float)vars[blockIdx.x] + LN_EPSILON);
-    if (means == nullptr) {
-      // inp_or_out is output, xhat = (output - betta) / gamma
-      float4 vbetta = ((const float4 *)betta)[threadIdx.x];
-      __half2 *betta_h2 = reinterpret_cast<__half2 *>(&vbetta);
-#pragma unroll
-      for (int i = 0; i < 4; i++) {
-        float2 vout = __half22float2(tmp_h2[i]);
-        float2 vgamma = __half22float2(gamma_h2[i]);
-        float2 vbetta = __half22float2(betta_h2[i]);
-        xhat[i].x = (vout.x - vbetta.x) / add_eps(vgamma.x);
-        xhat[i].y = (vout.y - vbetta.y) / add_eps(vgamma.y);
-        reduce_val[1] += xhat[i].x * dxhat[i].x + xhat[i].y * dxhat[i].y;
-      }
-    } else {
-      // inp_or_out is input, xhat = (input - mean) * rsqrtf(var)
-      float fmean = (float)means[blockIdx.x];
-#pragma unroll
-      for (int i = 0; i < 4; i++) {
-        float2 vinp = __half22float2(tmp_h2[i]);
-        xhat[i].x = (vinp.x - fmean) * var_rsqrt;
-        xhat[i].y = (vinp.y - fmean) * var_rsqrt;
-        reduce_val[1] += xhat[i].x * dxhat[i].x + xhat[i].y * dxhat[i].y;
-      }
-    }
-  }
-
-  /* step2. block reduce sum for dxhat and dxhat*xhat */
-  blockReduce<ReduceType::kSum, 2>(reduce_val);
-  __shared__ float s_sum_dxhat, s_sum_dxhat_xhat;
-  if (threadIdx.x == 0) {
-    float mean_dim = hidden_dim * 8;
-    s_sum_dxhat = reduce_val[0] / mean_dim;
-    s_sum_dxhat_xhat = reduce_val[1] / mean_dim;
-  }
-  __syncthreads();
-
-  /*
-  step3. compute input gradient
-  (dxhat - (sum(dxhat) + xhat * sum(dxhat * xhat)) / mean_dim) * rsqrt(var)
-  */
-  if (threadIdx.x >= hidden_dim) {
-    return;
-  }
-  if (residual_grad) {
-    // Add the residual grad,
-    // usually in pre-layer-norm for transformer layer
-    float4 dresidual = ((const float4 *)residual_grad)[offset];
-    __half *hdres = reinterpret_cast<__half *>(&dresidual);
-#pragma unroll
-    for (int i = 0; i < 4; i++) {
-#ifdef COLOSSAL_HIP
-      tmp_h2[i] = make_half2(__float2half(
-          (dxhat[i].x - s_sum_dxhat - xhat[i].x * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres[2 * i])),
-          __float2half(
-          (dxhat[i].y - s_sum_dxhat - xhat[i].y * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres[2 * i + 1])));
-#else
-      tmp_h2[i].x = __float2half(
-          (dxhat[i].x - s_sum_dxhat - xhat[i].x * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres[2 * i]));
-      tmp_h2[i].y = __float2half(
-          (dxhat[i].y - s_sum_dxhat - xhat[i].y * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres[2 * i + 1]));
-#endif
-    }
-  } else {
-#pragma unroll
-    for (int i = 0; i < 4; i++) {
-#ifdef COLOSSAL_HIP
-      tmp_h2[i] = make_half2(__float2half(
-          (dxhat[i].x - s_sum_dxhat - xhat[i].x * s_sum_dxhat_xhat) *
-          var_rsqrt),
-          __float2half(
-          (dxhat[i].y - s_sum_dxhat - xhat[i].y * s_sum_dxhat_xhat) *
-          var_rsqrt));
-#else
-      tmp_h2[i].x = __float2half(
-          (dxhat[i].x - s_sum_dxhat - xhat[i].x * s_sum_dxhat_xhat) *
-          var_rsqrt);
-      tmp_h2[i].y = __float2half(
-          (dxhat[i].y - s_sum_dxhat - xhat[i].y * s_sum_dxhat_xhat) *
-          var_rsqrt);
-#endif
-    }
-  }
-  ((float4 *)inp_grad)[offset] = vtmp;
-}
-
-__global__ void ker_ln_bw_dinp_x2(__half *inp_grad, const __half *out_grad,
-                                       const __half *residual_grad,
-                                       const __half *inp_or_out,
-                                       const __half *gamma, const __half *betta,
-                                       const __half *vars, const __half *means,
-                                       int hidden_dim) {
-  int offset = blockIdx.x * hidden_dim * 2 + threadIdx.x * 2;
-
-  float2 dxhat[4], xhat[4];
-  float2 dxhat_1[4], xhat_1[4];
-  float var_rsqrt;
-  float4 vtmp, vtmp_1;
-  __half2 *tmp_h2;
-  __half2 *tmp_h2_1;
-  float reduce_val[2] = {0.f, 0.f};
-
-  if (threadIdx.x < hidden_dim) {
-    // step 0. dxhat = dout * gamma
-    vtmp = ((const float4 *)out_grad)[offset];
-    vtmp_1 = ((const float4 *)out_grad)[offset + 1];
-    tmp_h2 = reinterpret_cast<__half2 *>(&vtmp);
-    tmp_h2_1 = reinterpret_cast<__half2 *>(&vtmp_1);
-    float4 gamma_f4 = ((const float4 *)gamma)[threadIdx.x * 2];
-    float4 gamma_f4_1 = ((const float4 *)gamma)[threadIdx.x * 2 + 1];
-    __half2 *gamma_h2 = reinterpret_cast<__half2 *>(&gamma_f4);
-    __half2 *gamma_h2_1 = reinterpret_cast<__half2 *>(&gamma_f4_1);
-#pragma unroll
-    for (int i = 0; i < 4; i++) {
-      float2 vdout = __half22float2(tmp_h2[i]);
-      float2 vdout_1 = __half22float2(tmp_h2_1[i]);
-      float2 vgamma = __half22float2(gamma_h2[i]);
-      float2 vgamma_1 = __half22float2(gamma_h2_1[i]);
-      dxhat[i].x = vdout.x * vgamma.x;
-      dxhat[i].y = vdout.y * vgamma.y;
-      dxhat_1[i].x = vdout_1.x * vgamma_1.x;
-      dxhat_1[i].y = vdout_1.y * vgamma_1.y;
-      reduce_val[0] += dxhat[i].x + dxhat[i].y + dxhat_1[i].x + dxhat_1[i].y;
-    }
-
-    /*
-    step 1. xhat = (output - betta) / gamma or
-    (input - mean) * rsqrtf(var)
-    */
-    vtmp = ((const float4 *)inp_or_out)[offset];
-    vtmp_1 = ((const float4 *)inp_or_out)[offset + 1];
-    var_rsqrt = rsqrtf((float)vars[blockIdx.x] + LN_EPSILON);
-    if (means == nullptr) {
-      // inp_or_out is output, xhat = (output - betta) / gamma
-      float4 vbetta = ((const float4 *)betta)[2 * threadIdx.x];
-      float4 vbetta_1 = ((const float4 *)betta)[2 * threadIdx.x + 1];
-      __half2 *betta_h2 = reinterpret_cast<__half2 *>(&vbetta);
-      __half2 *betta_h2_1 = reinterpret_cast<__half2 *>(&vbetta_1);
-#pragma unroll
-      for (int i = 0; i < 4; i++) {
-        float2 vout = __half22float2(tmp_h2[i]);
-        float2 vout_1 = __half22float2(tmp_h2_1[i]);
-        float2 vgamma = __half22float2(gamma_h2[i]);
-        float2 vgamma_1 = __half22float2(gamma_h2_1[i]);
-        float2 vbetta = __half22float2(betta_h2[i]);
-        float2 vbetta_1 = __half22float2(betta_h2_1[i]);
-        xhat[i].x = (vout.x - vbetta.x) / add_eps(vgamma.x);
-        xhat_1[i].x = (vout_1.x - vbetta_1.x) / add_eps(vgamma_1.x);
-        xhat[i].y = (vout.y - vbetta.y) / add_eps(vgamma.y);
-        xhat_1[i].y = (vout_1.y - vbetta_1.y) / add_eps(vgamma_1.y);
-        reduce_val[1] += xhat[i].x * dxhat[i].x + xhat[i].y * dxhat[i].y;
-        reduce_val[1] += xhat_1[i].x * dxhat_1[i].x + xhat_1[i].y * dxhat_1[i].y;
-      }
-    } else {
-      // inp_or_out is input, xhat = (input - mean) * rsqrtf(var)
-      float fmean = (float)means[blockIdx.x];
-#pragma unroll
-      for (int i = 0; i < 4; i++) {
-        float2 vinp = __half22float2(tmp_h2[i]);
-        float2 vinp_1 = __half22float2(tmp_h2_1[i]);
-        xhat[i].x = (vinp.x - fmean) * var_rsqrt;
-        xhat_1[i].x = (vinp_1.x - fmean) * var_rsqrt;
-        xhat[i].y = (vinp.y - fmean) * var_rsqrt;
-        xhat_1[i].y = (vinp_1.y - fmean) * var_rsqrt;
-        reduce_val[1] += xhat[i].x * dxhat[i].x + xhat[i].y * dxhat[i].y;
-        reduce_val[1] += xhat_1[i].x * dxhat_1[i].x + xhat_1[i].y * dxhat_1[i].y;
-      }
-    }
-  }
-
-  /* step2. block reduce sum for dxhat and dxhat*xhat */
-  blockReduce<ReduceType::kSum, 2>(reduce_val);
-  __shared__ float s_sum_dxhat, s_sum_dxhat_xhat;
-  if (threadIdx.x == 0) {
-    float mean_dim = hidden_dim * 8 * 2;
-    s_sum_dxhat = reduce_val[0] / mean_dim;
-    s_sum_dxhat_xhat = reduce_val[1] / mean_dim;
-  }
-  __syncthreads();
-
-  /*
-  step3. compute input gradient
-  (dxhat - (sum(dxhat) + xhat * sum(dxhat * xhat)) / mean_dim) * rsqrt(var)
-  */
-  if (threadIdx.x >= hidden_dim) {
-    return;
-  }
-  if (residual_grad) {
-    // Add the residual grad,
-    // usually in pre-layer-norm for transformer layer
-    float4 dresidual = ((const float4 *)residual_grad)[offset];
-    float4 dresidual_1 = ((const float4 *)residual_grad)[offset+1];
-    __half *hdres = reinterpret_cast<__half *>(&dresidual);
-    __half *hdres_1 = reinterpret_cast<__half *>(&dresidual_1);
-#pragma unroll
-    for (int i = 0; i < 4; i++) {
-#ifdef COLOSSAL_HIP
-      tmp_h2[i] = make_half2(__float2half(
-          (dxhat[i].x - s_sum_dxhat - xhat[i].x * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres[2 * i])),
-          __float2half(
-          (dxhat[i].y - s_sum_dxhat - xhat[i].y * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres[2 * i + 1])));
-      tmp_h2_1[i] = make_half2(__float2half(
-          (dxhat_1[i].x - s_sum_dxhat - xhat_1[i].x * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres_1[2 * i])),
-          __float2half(
-          (dxhat_1[i].y - s_sum_dxhat - xhat_1[i].y * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres_1[2 * i + 1])));
-#else
-      tmp_h2[i].x = __float2half(
-          (dxhat[i].x - s_sum_dxhat - xhat[i].x * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres[2 * i]));
-      tmp_h2_1[i].x = __float2half(
-          (dxhat_1[i].x - s_sum_dxhat - xhat_1[i].x * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres_1[2 * i]));
-      tmp_h2[i].y = __float2half(
-          (dxhat[i].y - s_sum_dxhat - xhat[i].y * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres[2 * i + 1]));
-      tmp_h2_1[i].y = __float2half(
-          (dxhat_1[i].y - s_sum_dxhat - xhat_1[i].y * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres_1[2 * i + 1]));
-#endif
-    }
-  } else {
-#pragma unroll
-    for (int i = 0; i < 4; i++) {
-#ifdef COLOSSAL_HIP
-      tmp_h2[i] = make_half2(__float2half(
-          (dxhat[i].x - s_sum_dxhat - xhat[i].x * s_sum_dxhat_xhat) *
-          var_rsqrt),
-          __float2half(
-          (dxhat[i].y - s_sum_dxhat - xhat[i].y * s_sum_dxhat_xhat) *
-          var_rsqrt));
-      tmp_h2_1[i] = make_half2(__float2half(
-          (dxhat_1[i].x - s_sum_dxhat - xhat_1[i].x * s_sum_dxhat_xhat) *
-          var_rsqrt),
-          __float2half(
-          (dxhat_1[i].y - s_sum_dxhat - xhat_1[i].y * s_sum_dxhat_xhat) *
-          var_rsqrt));
-#else
-      tmp_h2[i].x = __float2half(
-          (dxhat[i].x - s_sum_dxhat - xhat[i].x * s_sum_dxhat_xhat) *
-          var_rsqrt);
-      tmp_h2_1[i].x = __float2half(
-          (dxhat_1[i].x - s_sum_dxhat - xhat_1[i].x * s_sum_dxhat_xhat) *
-          var_rsqrt);
-      tmp_h2[i].y = __float2half(
-          (dxhat[i].y - s_sum_dxhat - xhat[i].y * s_sum_dxhat_xhat) *
-          var_rsqrt);
-      tmp_h2_1[i].y = __float2half(
-          (dxhat_1[i].y - s_sum_dxhat - xhat_1[i].y * s_sum_dxhat_xhat) *
-          var_rsqrt);
-#endif
-   }
-  }
-  ((float4 *)inp_grad)[offset] = vtmp;
-  ((float4 *)inp_grad)[offset + 1] = vtmp_1;
-}
-
-__global__ void ker_ln_bw_dinp_x4(__half *inp_grad, const __half *out_grad,
-                                       const __half *residual_grad,
-                                       const __half *inp_or_out,
-                                       const __half *gamma, const __half *betta,
-                                       const __half *vars, const __half *means,
-                                       int hidden_dim) {
-  int offset = blockIdx.x * hidden_dim * 4 + threadIdx.x * 4;
-
-  float2 dxhat[4], xhat[4];
-  float2 dxhat_1[4], xhat_1[4];
-  float2 dxhat_2[4], xhat_2[4];
-  float2 dxhat_3[4], xhat_3[4];
-  float var_rsqrt;
-  float4 vtmp, vtmp_1, vtmp_2, vtmp_3;
-  __half2 *tmp_h2;
-  __half2 *tmp_h2_1;
-  __half2 *tmp_h2_2;
-  __half2 *tmp_h2_3;
-  float reduce_val[2] = {0.f, 0.f};
-
-  if (threadIdx.x < hidden_dim) {
-    // step 0. dxhat = dout * gamma
-    vtmp = ((const float4 *)out_grad)[offset];
-    vtmp_1 = ((const float4 *)out_grad)[offset + 1];
-    vtmp_2 = ((const float4 *)out_grad)[offset + 2];
-    vtmp_3 = ((const float4 *)out_grad)[offset + 3];
-    tmp_h2 = reinterpret_cast<__half2 *>(&vtmp);
-    tmp_h2_1 = reinterpret_cast<__half2 *>(&vtmp_1);
-    tmp_h2_2 = reinterpret_cast<__half2 *>(&vtmp_2);
-    tmp_h2_3 = reinterpret_cast<__half2 *>(&vtmp_3);
-    float4 gamma_f4 = ((const float4 *)gamma)[threadIdx.x * 4];
-    float4 gamma_f4_1 = ((const float4 *)gamma)[threadIdx.x * 4 + 1];
-    float4 gamma_f4_2 = ((const float4 *)gamma)[threadIdx.x * 4 + 2];
-    float4 gamma_f4_3 = ((const float4 *)gamma)[threadIdx.x * 4 + 3];
-    __half2 *gamma_h2 = reinterpret_cast<__half2 *>(&gamma_f4);
-    __half2 *gamma_h2_1 = reinterpret_cast<__half2 *>(&gamma_f4_1);
-    __half2 *gamma_h2_2 = reinterpret_cast<__half2 *>(&gamma_f4_2);
-    __half2 *gamma_h2_3 = reinterpret_cast<__half2 *>(&gamma_f4_3);
-#pragma unroll
-    for (int i = 0; i < 4; i++) {
-      float2 vdout = __half22float2(tmp_h2[i]);
-      float2 vdout_1 = __half22float2(tmp_h2_1[i]);
-      float2 vdout_2 = __half22float2(tmp_h2_2[i]);
-      float2 vdout_3 = __half22float2(tmp_h2_3[i]);
-      float2 vgamma = __half22float2(gamma_h2[i]);
-      float2 vgamma_1 = __half22float2(gamma_h2_1[i]);
-      float2 vgamma_2 = __half22float2(gamma_h2_2[i]);
-      float2 vgamma_3 = __half22float2(gamma_h2_3[i]);
-      dxhat[i].x = vdout.x * vgamma.x;
-      dxhat[i].y = vdout.y * vgamma.y;
-      dxhat_1[i].x = vdout_1.x * vgamma_1.x;
-      dxhat_1[i].y = vdout_1.y * vgamma_1.y;
-      dxhat_2[i].x = vdout_2.x * vgamma_2.x;
-      dxhat_2[i].y = vdout_2.y * vgamma_2.y;
-      dxhat_3[i].x = vdout_3.x * vgamma_3.x;
-      dxhat_3[i].y = vdout_3.y * vgamma_3.y;
-      reduce_val[0] += dxhat[i].x + dxhat[i].y + dxhat_1[i].x + dxhat_1[i].y + dxhat_2[i].x +
-                       dxhat_2[i].y + dxhat_3[i].x + dxhat_3[i].y;
-    }
-
-    /*
-    step 1. xhat = (output - betta) / gamma or
-    (input - mean) * rsqrtf(var)
-    */
-    vtmp = ((const float4 *)inp_or_out)[offset];
-    vtmp_1 = ((const float4 *)inp_or_out)[offset + 1];
-    vtmp_2 = ((const float4 *)inp_or_out)[offset + 2];
-    vtmp_3 = ((const float4 *)inp_or_out)[offset + 3];
-    var_rsqrt = rsqrtf((float)vars[blockIdx.x] + LN_EPSILON);
-    if (means == nullptr) {
-      // inp_or_out is output, xhat = (output - betta) / gamma
-      float4 vbetta = ((const float4 *)betta)[4 * threadIdx.x];
-      float4 vbetta_1 = ((const float4 *)betta)[4 * threadIdx.x + 1];
-      float4 vbetta_2 = ((const float4 *)betta)[4 * threadIdx.x + 2];
-      float4 vbetta_3 = ((const float4 *)betta)[4 * threadIdx.x + 3];
-      __half2 *betta_h2 = reinterpret_cast<__half2 *>(&vbetta);
-      __half2 *betta_h2_1 = reinterpret_cast<__half2 *>(&vbetta_1);
-      __half2 *betta_h2_2 = reinterpret_cast<__half2 *>(&vbetta_2);
-      __half2 *betta_h2_3 = reinterpret_cast<__half2 *>(&vbetta_3);
-#pragma unroll
-      for (int i = 0; i < 4; i++) {
-        float2 vout = __half22float2(tmp_h2[i]);
-        float2 vout_1 = __half22float2(tmp_h2_1[i]);
-        float2 vout_2 = __half22float2(tmp_h2_2[i]);
-        float2 vout_3 = __half22float2(tmp_h2_3[i]);
-        float2 vgamma = __half22float2(gamma_h2[i]);
-        float2 vgamma_1 = __half22float2(gamma_h2_1[i]);
-        float2 vgamma_2 = __half22float2(gamma_h2_2[i]);
-        float2 vgamma_3 = __half22float2(gamma_h2_3[i]);
-        float2 vbetta = __half22float2(betta_h2[i]);
-        float2 vbetta_1 = __half22float2(betta_h2_1[i]);
-        float2 vbetta_2 = __half22float2(betta_h2_2[i]);
-        float2 vbetta_3 = __half22float2(betta_h2_3[i]);
-        xhat[i].x = (vout.x - vbetta.x) / add_eps(vgamma.x);
-        xhat_1[i].x = (vout_1.x - vbetta_1.x) / add_eps(vgamma_1.x);
-        xhat_2[i].x = (vout_2.x - vbetta_2.x) / add_eps(vgamma_2.x);
-        xhat_3[i].x = (vout_3.x - vbetta_3.x) / add_eps(vgamma_3.x);
-        xhat[i].y = (vout.y - vbetta.y) / add_eps(vgamma.y);
-        xhat_1[i].y = (vout_1.y - vbetta_1.y) / add_eps(vgamma_1.y);
-        xhat_2[i].y = (vout_2.y - vbetta_2.y) / add_eps(vgamma_2.y);
-        xhat_3[i].y = (vout_3.y - vbetta_3.y) / add_eps(vgamma_3.y);
-        reduce_val[1] += xhat[i].x * dxhat[i].x + xhat[i].y * dxhat[i].y;
-        reduce_val[1] += xhat_1[i].x * dxhat_1[i].x + xhat_1[i].y * dxhat_1[i].y;
-        reduce_val[1] += xhat_2[i].x * dxhat_2[i].x + xhat_2[i].y * dxhat_2[i].y;
-        reduce_val[1] += xhat_3[i].x * dxhat_3[i].x + xhat_3[i].y * dxhat_3[i].y;
-      }
-    } else {
-      // inp_or_out is input, xhat = (input - mean) * rsqrtf(var)
-      float fmean = (float)means[blockIdx.x];
-#pragma unroll
-      for (int i = 0; i < 4; i++) {
-        float2 vinp = __half22float2(tmp_h2[i]);
-        float2 vinp_1 = __half22float2(tmp_h2_1[i]);
-        float2 vinp_2 = __half22float2(tmp_h2_2[i]);
-        float2 vinp_3 = __half22float2(tmp_h2_3[i]);
-        xhat[i].x = (vinp.x - fmean) * var_rsqrt;
-        xhat_1[i].x = (vinp_1.x - fmean) * var_rsqrt;
-        xhat_2[i].x = (vinp_2.x - fmean) * var_rsqrt;
-        xhat_3[i].x = (vinp_3.x - fmean) * var_rsqrt;
-        xhat[i].y = (vinp.y - fmean) * var_rsqrt;
-        xhat_1[i].y = (vinp_1.y - fmean) * var_rsqrt;
-        xhat_2[i].y = (vinp_2.y - fmean) * var_rsqrt;
-        xhat_3[i].y = (vinp_3.y - fmean) * var_rsqrt;
-        reduce_val[1] += xhat[i].x * dxhat[i].x + xhat[i].y * dxhat[i].y;
-        reduce_val[1] += xhat_1[i].x * dxhat_1[i].x + xhat_1[i].y * dxhat_1[i].y;
-        reduce_val[1] += xhat_2[i].x * dxhat_2[i].x + xhat_2[i].y * dxhat_2[i].y;
-        reduce_val[1] += xhat_3[i].x * dxhat_3[i].x + xhat_3[i].y * dxhat_3[i].y;
-      }
-    }
-  }
-
-  /* step2. block reduce sum for dxhat and dxhat*xhat */
-  blockReduce<ReduceType::kSum, 2>(reduce_val);
-  __shared__ float s_sum_dxhat, s_sum_dxhat_xhat;
-  if (threadIdx.x == 0) {
-    float mean_dim = hidden_dim * 8 * 4;
-    s_sum_dxhat = reduce_val[0] / mean_dim;
-    s_sum_dxhat_xhat = reduce_val[1] / mean_dim;
-  }
-  __syncthreads();
-
-  /*
-  step3. compute input gradient
-  (dxhat - (sum(dxhat) + xhat * sum(dxhat * xhat)) / mean_dim) * rsqrt(var)
-  */
-  if (threadIdx.x >= hidden_dim) {
-    return;
-  }
-  if (residual_grad) {
-    // Add the residual grad,
-    // usually in pre-layer-norm for transformer layer
-    float4 dresidual = ((const float4 *)residual_grad)[offset];
-    float4 dresidual_1 = ((const float4 *)residual_grad)[offset+1];
-    float4 dresidual_2 = ((const float4 *)residual_grad)[offset+2];
-    float4 dresidual_3 = ((const float4 *)residual_grad)[offset+3];
-    __half *hdres = reinterpret_cast<__half *>(&dresidual);
-    __half *hdres_1 = reinterpret_cast<__half *>(&dresidual_1);
-    __half *hdres_2 = reinterpret_cast<__half *>(&dresidual_2);
-    __half *hdres_3 = reinterpret_cast<__half *>(&dresidual_3);
-#pragma unroll
-    for (int i = 0; i < 4; i++) {
-#ifdef COLOSSAL_HIP
-      tmp_h2[i] = make_half2(__float2half(
-          (dxhat[i].x - s_sum_dxhat - xhat[i].x * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres[2 * i])),
-          __float2half(
-          (dxhat[i].y - s_sum_dxhat - xhat[i].y * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres[2 * i + 1])));
-      tmp_h2_1[i] = make_half2(__float2half(
-          (dxhat_1[i].x - s_sum_dxhat - xhat_1[i].x * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres_1[2 * i])),
-          __float2half(
-          (dxhat_1[i].y - s_sum_dxhat - xhat_1[i].y * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres_1[2 * i + 1])));
-      tmp_h2_2[i] = make_half2(__float2half(
-          (dxhat_2[i].x - s_sum_dxhat - xhat_2[i].x * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres_2[2 * i])),
-          __float2half(
-          (dxhat_2[i].y - s_sum_dxhat - xhat_2[i].y * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres_1[2 * i + 1])));
-      tmp_h2_3[i] = make_half2(__float2half(
-          (dxhat_3[i].x - s_sum_dxhat - xhat_3[i].x * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres_3[2 * i])),
-          __float2half(
-          (dxhat_3[i].y - s_sum_dxhat - xhat_3[i].y * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres_1[2 * i + 1])));
-#else
-      tmp_h2[i].x = __float2half(
-          (dxhat[i].x - s_sum_dxhat - xhat[i].x * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres[2 * i]));
-      tmp_h2_1[i].x = __float2half(
-          (dxhat_1[i].x - s_sum_dxhat - xhat_1[i].x * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres_1[2 * i]));
-      tmp_h2_2[i].x = __float2half(
-          (dxhat_2[i].x - s_sum_dxhat - xhat_2[i].x * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres_2[2 * i]));
-      tmp_h2_3[i].x = __float2half(
-          (dxhat_3[i].x - s_sum_dxhat - xhat_3[i].x * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres_3[2 * i]));
-      tmp_h2[i].y = __float2half(
-          (dxhat[i].y - s_sum_dxhat - xhat[i].y * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres[2 * i + 1]));
-      tmp_h2_1[i].y = __float2half(
-          (dxhat_1[i].y - s_sum_dxhat - xhat_1[i].y * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres_1[2 * i + 1]));
-      tmp_h2_2[i].y = __float2half(
-          (dxhat_2[i].y - s_sum_dxhat - xhat_2[i].y * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres_1[2 * i + 1]));
-      tmp_h2_3[i].y = __float2half(
-          (dxhat_3[i].y - s_sum_dxhat - xhat_3[i].y * s_sum_dxhat_xhat) *
-              var_rsqrt +
-          __half2float(hdres_1[2 * i + 1]));
-#endif 
-   }
-  } else {
-#pragma unroll
-    for (int i = 0; i < 4; i++) {
-#ifdef COLOSSAL_HIP
-      tmp_h2[i] = make_half2(__float2half(
-          (dxhat[i].x - s_sum_dxhat - xhat[i].x * s_sum_dxhat_xhat) *
-          var_rsqrt),
-          __float2half(
-          (dxhat[i].y - s_sum_dxhat - xhat[i].y * s_sum_dxhat_xhat) *
-          var_rsqrt));
-      tmp_h2_1[i] = make_half2(__float2half(
-          (dxhat_1[i].x - s_sum_dxhat - xhat_1[i].x * s_sum_dxhat_xhat) *
-          var_rsqrt),
-          __float2half(
-          (dxhat_1[i].y - s_sum_dxhat - xhat_1[i].y * s_sum_dxhat_xhat) *
-          var_rsqrt));
-      tmp_h2_2[i] = make_half2(__float2half(
-          (dxhat_2[i].x - s_sum_dxhat - xhat_2[i].x * s_sum_dxhat_xhat) *
-          var_rsqrt),
-          __float2half(
-          (dxhat_2[i].y - s_sum_dxhat - xhat_2[i].y * s_sum_dxhat_xhat) *
-          var_rsqrt));
-      tmp_h2_3[i] = make_half2(__float2half(
-          (dxhat_3[i].x - s_sum_dxhat - xhat_3[i].x * s_sum_dxhat_xhat) *
-          var_rsqrt),
-          __float2half(
-          (dxhat_3[i].y - s_sum_dxhat - xhat_3[i].y * s_sum_dxhat_xhat) *
-          var_rsqrt));
-#else
-      tmp_h2[i].x = __float2half(
-          (dxhat[i].x - s_sum_dxhat - xhat[i].x * s_sum_dxhat_xhat) *
-          var_rsqrt);
-      tmp_h2_1[i].x = __float2half(
-          (dxhat_1[i].x - s_sum_dxhat - xhat_1[i].x * s_sum_dxhat_xhat) *
-          var_rsqrt);
-      tmp_h2_2[i].x = __float2half(
-          (dxhat_2[i].x - s_sum_dxhat - xhat_2[i].x * s_sum_dxhat_xhat) *
-          var_rsqrt);
-      tmp_h2_3[i].x = __float2half(
-          (dxhat_3[i].x - s_sum_dxhat - xhat_3[i].x * s_sum_dxhat_xhat) *
-          var_rsqrt);
-      tmp_h2[i].y = __float2half(
-          (dxhat[i].y - s_sum_dxhat - xhat[i].y * s_sum_dxhat_xhat) *
-          var_rsqrt);
-      tmp_h2_1[i].y = __float2half(
-          (dxhat_1[i].y - s_sum_dxhat - xhat_1[i].y * s_sum_dxhat_xhat) *
-          var_rsqrt);
-      tmp_h2_2[i].y = __float2half(
-          (dxhat_2[i].y - s_sum_dxhat - xhat_2[i].y * s_sum_dxhat_xhat) *
-          var_rsqrt);
-      tmp_h2_3[i].y = __float2half(
-          (dxhat_3[i].y - s_sum_dxhat - xhat_3[i].y * s_sum_dxhat_xhat) *
-          var_rsqrt);
-#endif 
-   }
-  }
-  ((float4 *)inp_grad)[offset] = vtmp;
-  ((float4 *)inp_grad)[offset + 1] = vtmp_1;
-  ((float4 *)inp_grad)[offset + 2] = vtmp_2;
-  ((float4 *)inp_grad)[offset + 3] = vtmp_3;
-}
-
-/**
-Layer norm backword,
-  compute the gradient of gamma, betta and input.
-dbetta = sum(dout, dim=0)
-xhat = (input - mean) * rsqrt(var) if mean is not nullptr
-  (output - betta) / gamma if mean is nullptr
-dgamma = sum(xhat * dout, dim=0)
-dxhat = dout * gamma
-dinp = (dxhat - (sum(dxhat, 1) + xhat * sum(dxhat * xhat, 1)) / hidden_dim)
-  * rsqrt(var)
-
-residual_grad, means, betta can be nullptr.
-residual_grad will be added to dinp if it is not nullptr
-  which is useful in transformer layer when pre-ln
-means and betta are only used to compute xhat,
-  (means == nullptr) ^ (betta == nullptr) should be true
-*/
-template <>
-void launch_ln_bw<float>(float *gamma_grad, float *betta_grad, float *inp_grad,
-                         const float *out_grad, const float *residual_grad,
-                         const float *inp_or_out, const float *gamma,
-                         const float *betta, const float *vars,
-                         const float *means, int batch, int hidden_dim,
-                         hipStream_t stream[2]) {
-  // compute grad of gamma and betta
-  dim3 grid_dim(((hidden_dim + TILE_DIM - 1) / TILE_DIM) * TILE_DIM);
-  dim3 block_dim(TILE_DIM, TILE_DIM);
- hipLaunchKernelGGL(( ker_ln_bw_dgamma_dbetta<float>), dim3(grid_dim), dim3(block_dim), 0, stream[0], 
-      gamma_grad, betta_grad, out_grad, inp_or_out, gamma, betta, vars, means,
-      batch, hidden_dim);
-
-  // compute grad of input
-  if (hidden_dim % 4 != 0 || hidden_dim > 4096) {
-    throw std::runtime_error("hidden_dim % 4 != 0 || hidden_dim > 4096");
-  }
-  hidden_dim >>= 2;
-  int nthread = min(((hidden_dim + 31) / 32) * 32, MAX_THREADS);
- hipLaunchKernelGGL(( ker_ln_bw_dinp), dim3(batch), dim3(nthread), 0, stream[1], 
-      inp_grad, out_grad, residual_grad, inp_or_out, gamma, betta, vars, means,
-      hidden_dim);
-}
-
-template <>
-void launch_ln_bw<__half>(__half *gamma_grad, __half *betta_grad,
-                          __half *inp_grad, const __half *out_grad,
-                          const __half *residual_grad, const __half *inp_or_out,
-                          const __half *gamma, const __half *betta,
-                          const __half *vars, const __half *means, int batch,
-                          int hidden_dim, hipStream_t stream[2]) {
-  // compute grad of gamma and betta
-  dim3 grid_dim(((hidden_dim + TILE_DIM - 1) / TILE_DIM) * TILE_DIM);
-  dim3 block_dim(TILE_DIM, TILE_DIM);
- hipLaunchKernelGGL(( ker_ln_bw_dgamma_dbetta<__half>), dim3(grid_dim), dim3(block_dim), 0, stream[0], 
-      gamma_grad, betta_grad, out_grad, inp_or_out, gamma, betta, vars, means,
-      batch, hidden_dim);
-
-  // compute grad of input
-  if (hidden_dim % 8 != 0) {
-    throw std::runtime_error("hidden_dim % 8 != 0");
-  }
-  hidden_dim >>= 3;
-
-  if (hidden_dim * 8 <= 8192) {
-    int nthread = min(((hidden_dim + 31) / 32) * 32, MAX_THREADS);
-   hipLaunchKernelGGL(( ker_ln_bw_dinp), dim3(batch), dim3(nthread), 0, stream[1], 
-      inp_grad, out_grad, residual_grad, inp_or_out, gamma, betta, vars, means,
-      hidden_dim);
-  } else if (hidden_dim * 8 > 8192 && hidden_dim * 8 <= 8192 * 2) {
-    hidden_dim >>= 1;
-    int nthread = min(((hidden_dim + 31) / 32) * 32, MAX_THREADS);
-   hipLaunchKernelGGL(( ker_ln_bw_dinp_x2), dim3(batch), dim3(nthread), 0, stream[1], 
-      inp_grad, out_grad, residual_grad, inp_or_out, gamma, betta, vars, means,
-      hidden_dim);
-  } else if (hidden_dim * 8 > 2 * 8192 && hidden_dim * 8 <= 8192 * 4) {
-    hidden_dim >>= 2;
-    int nthread = min(((hidden_dim + 31) / 32) * 32, MAX_THREADS);
-   hipLaunchKernelGGL(( ker_ln_bw_dinp_x4), dim3(batch), dim3(nthread), 0, stream[1], 
-      inp_grad, out_grad, residual_grad, inp_or_out, gamma, betta, vars, means,
-      hidden_dim);
-  } else {
-    throw std::runtime_error("hidden_dim % 4 != 0 || hidden_dim > 32768");
-  }
-}
-
diff --git a/colossalai/kernel/hip_native/csrc/kernels/softmax_kernels.hip b/colossalai/kernel/hip_native/csrc/kernels/softmax_kernels.hip
deleted file mode 100644
index 409c77e0bb270562278d759367ef7a738b45e990..0000000000000000000000000000000000000000
--- a/colossalai/kernel/hip_native/csrc/kernels/softmax_kernels.hip
+++ /dev/null
@@ -1,395 +0,0 @@
-// !!! This is a file automatically generated by hipify!!!
-#include "hip/hip_runtime.h"
-#include <math.h>
-
-#include <hipcub/hipcub.hpp>
-#include <hipcub/hipcub.hpp>
-
-#include "block_reduce.h"
-#include "kernels.h"
-
-#ifndef COLOSSAL_HIP
-#include <cooperative_groups.h>
-
-namespace cg = cooperative_groups;
-#endif
-
-const float EPSILON = 1e-8f;
-
-/**
-@brief: softmax_kernel
-Softmax forward kernel for
-  enc-self-attn, dec-self-attn, encdec-attn
-
-@thread
-gridDim.x = dynamic
-gridDim.y = batch_size
-gridDim.z = nhead
-blockDim.x = from_len
-
-@param
-inp: [batch_size, nhead, from_len, to_len], softmax input.
-attn_mask: [batch_size, to_len], padding tokens are -inf,
-  non padding tokens are 0.
-  attn_mask!=nullptr for enc-self-attn and enc-dec-attn
-  attn_mask=nullptr and mask_future=ture for dec-self-attn training
-  attn_mask=nullptr and mask_future=false for dec-self-attn infer
-*/
-template <typename T, int block_dim, int ele_per_thread>
-__global__ void ker_attn_softmax(T *inp, const T *attn_mask, int from_len,
-                                 int to_len, bool mask_future) {
-  int batch_id = blockIdx.y;
-  int head_id = blockIdx.z;
-  const int nhead = gridDim.z;
-  const int token_per_reduce = 1;
-#ifdef COLOSSAL_HIP
-  typedef hipcub::BlockLoad<T, block_dim, ele_per_thread,
-                         hipcub::BLOCK_LOAD_VECTORIZE>
-      BlockLoad;
-  __shared__ typename BlockLoad::TempStorage ts_load;
-  typedef hipcub::BlockStore<T, block_dim, ele_per_thread,
-                          hipcub::BLOCK_STORE_VECTORIZE>
-      BlockStore;
-#else
-  typedef cub::BlockLoad<T, block_dim, ele_per_thread,
-                         cub::BLOCK_LOAD_VECTORIZE>
-      BlockLoad;
-  __shared__ typename BlockLoad::TempStorage ts_load;
-  typedef cub::BlockStore<T, block_dim, ele_per_thread,
-                          cub::BLOCK_STORE_VECTORIZE>
-      BlockStore;
-#endif
-  __shared__ typename BlockStore::TempStorage ts_store;
-
-  T mval[ele_per_thread];
-  if (attn_mask) {
-    attn_mask += batch_id * to_len;
-    BlockLoad(ts_load).Load(attn_mask, mval, to_len, REDUCE_FLOAT_INF_NEG);
-  }
-
-  inp += flat_3dim(batch_id, head_id, 0, nhead, from_len * to_len);
-  for (int token_id = blockIdx.x * token_per_reduce; token_id < from_len;
-       token_id += gridDim.x * token_per_reduce) {
-    T inp_val[token_per_reduce][ele_per_thread];
-    for (int i = 0; i < token_per_reduce && (token_id + i) < from_len; i++) {
-      BlockLoad(ts_load).Load(inp + (token_id + i) * to_len, inp_val[i], to_len,
-                              REDUCE_FLOAT_INF_NEG);
-    }
-
-    /* step 1. compute max */
-    // thread local max
-    float val[token_per_reduce][ele_per_thread];
-    float l_max[token_per_reduce];
-    for (int i = 0; i < token_per_reduce; i++) {
-      l_max[i] = REDUCE_FLOAT_INF_NEG;
-      for (int j = 0; j < ele_per_thread; j++) {
-        if (attn_mask) {
-          val[i][j] = (float)inp_val[i][j] + (float)mval[j];
-        } else {
-          if (mask_future && ele_per_thread * threadIdx.x + j > token_id + i) {
-            val[i][j] = REDUCE_FLOAT_INF_NEG;
-          } else {
-            val[i][j] = (float)inp_val[i][j];
-          }
-        }
-        l_max[i] = fmaxf(l_max[i], val[i][j]);
-      }
-    }
-    // block reduce max
-    blockReduce<ReduceType::kMax, token_per_reduce>(l_max);
-    // write shared
-    __shared__ float s_max[token_per_reduce];
-    if (threadIdx.x == 0) {
-      for (int i = 0; i < token_per_reduce; i++) {
-        s_max[i] = l_max[i];
-      }
-    }
-    __syncthreads();
-
-    /* step 2. compute sum */
-    // thread local sum
-    float l_sum[token_per_reduce];
-    for (int i = 0; i < token_per_reduce; i++) {
-      l_sum[i] = 0.f;
-      for (int j = 0; j < ele_per_thread; j++) {
-        val[i][j] = __expf(val[i][j] - s_max[i]);
-        l_sum[i] += val[i][j];
-      }
-    }
-    // block reduce sum
-    blockReduce<ReduceType::kSum, token_per_reduce>(l_sum);
-    // write shared
-    __shared__ float s_sum[token_per_reduce];
-    if (threadIdx.x == 0) {
-      for (int i = 0; i < token_per_reduce; i++) {
-        s_sum[i] = __fdividef(1.0f, l_sum[i] + EPSILON);
-      }
-    }
-    __syncthreads();
-
-    /* step 3. compute final result */
-    for (int i = 0; i < token_per_reduce && (token_id + i) < from_len; i++) {
-      for (int j = 0; j < ele_per_thread; j++) {
-        inp_val[i][j] = (T)(val[i][j] * s_sum[i]);
-      }
-      BlockStore(ts_store).Store(inp + (token_id + i) * to_len, inp_val[i],
-                                 to_len);
-    }
-  }  // blockIdx.x
-}
-
-template <typename T, int block_dim, int ele_per_thread>
-__global__ void ker_attn_softmax_lt32(T *inp, const T *attn_mask, int from_len,
-                                      int to_len, bool mask_future) {
-  int batch_id = blockIdx.y;
-  int head_id = blockIdx.z;
-  const int nhead = gridDim.z;
-  const int token_per_reduce = 1;
-#ifdef COLOSSAL_HIP
-  typedef hipcub::BlockLoad<T, block_dim, ele_per_thread,
-                         hipcub::BLOCK_LOAD_VECTORIZE>
-      BlockLoad;
-  __shared__ typename BlockLoad::TempStorage ts_load;
-  typedef hipcub::BlockStore<T, block_dim, ele_per_thread,
-                          hipcub::BLOCK_STORE_VECTORIZE>
-      BlockStore;
-#else
-  typedef cub::BlockLoad<T, block_dim, ele_per_thread,
-                         cub::BLOCK_LOAD_VECTORIZE>
-      BlockLoad;
-  __shared__ typename BlockLoad::TempStorage ts_load;
-  typedef cub::BlockStore<T, block_dim, ele_per_thread,
-                          cub::BLOCK_STORE_VECTORIZE>
-      BlockStore;
-#endif
-  __shared__ typename BlockStore::TempStorage ts_store;
-
-  T mval[ele_per_thread];
-  if (attn_mask) {
-    attn_mask += batch_id * to_len;
-    BlockLoad(ts_load).Load(attn_mask, mval, to_len, REDUCE_FLOAT_INF_NEG);
-  }
-
-  inp += flat_3dim(batch_id, head_id, 0, nhead, from_len * to_len);
-  for (int token_id = blockIdx.x * token_per_reduce; token_id < from_len;
-       token_id += gridDim.x * token_per_reduce) {
-    T inp_val[token_per_reduce][ele_per_thread];
-    for (int i = 0; i < token_per_reduce && (token_id + i) < from_len; i++) {
-      BlockLoad(ts_load).Load(inp + (token_id + i) * to_len, inp_val[i], to_len,
-                              REDUCE_FLOAT_INF_NEG);
-    }
-
-    /* step 1. compute max */
-    // thread local max
-    float val[token_per_reduce][ele_per_thread];
-    float l_max[token_per_reduce];
-    for (int i = 0; i < token_per_reduce; i++) {
-      l_max[i] = REDUCE_FLOAT_INF_NEG;
-      for (int j = 0; j < ele_per_thread; j++) {
-        if (attn_mask) {
-          val[i][j] = (float)inp_val[i][j] + (float)mval[j];
-        } else {
-          if (mask_future && ele_per_thread * threadIdx.x + j > token_id + i) {
-            val[i][j] = REDUCE_FLOAT_INF_NEG;
-          } else {
-            val[i][j] = (float)inp_val[i][j];
-          }
-        }
-        l_max[i] = fmaxf(l_max[i], val[i][j]);
-      }
-    }
-    // warp reduce max
-    warpReduce<ReduceType::kMax, token_per_reduce>(l_max);
-
-    /* step 2. compute sum */
-    // thread local sum
-    float l_sum[token_per_reduce];
-    for (int i = 0; i < token_per_reduce; i++) {
-      l_sum[i] = 0.f;
-      for (int j = 0; j < ele_per_thread; j++) {
-        val[i][j] = __expf(val[i][j] - l_max[i]);
-        l_sum[i] += val[i][j];
-      }
-    }
-    // warp reduce sum
-    warpReduce<ReduceType::kSum, token_per_reduce>(l_sum);
-
-    /* step 3. compute final result */
-    for (int i = 0; i < token_per_reduce && (token_id + i) < from_len; i++) {
-      l_sum[i] = __fdividef(1.0f, l_sum[i] + EPSILON);
-      for (int j = 0; j < ele_per_thread; j++) {
-        inp_val[i][j] = (T)(val[i][j] * l_sum[i]);
-      }
-      BlockStore(ts_store).Store(inp + (token_id + i) * to_len, inp_val[i],
-                                 to_len);
-    }
-  }  // blockIdx.x
-}
-
-/*
-  attn_mask!=nullptr for enc-self-attn and enc-dec-attn
-  attn_mask=nullptr and mask_future=ture for dec-self-attn training
-  attn_mask=nullptr and mask_future=false for dec-self-attn infer
-*/
-template <>
-void launch_attn_softmax<float>(float *inp, const float *attn_mask,
-                                int batch_size, int nhead, int from_len,
-                                int to_len, bool mask_future,
-                                hipStream_t stream) {
-  dim3 grid_dim(1, batch_size, nhead);
-  if (to_len <= 32) {
-   hipLaunchKernelGGL(( ker_attn_softmax_lt32<float, 32, 1>), dim3(grid_dim), dim3(32), 0, stream, 
-        inp, attn_mask, from_len, to_len, mask_future);
-  } else if (to_len <= 64) {
-   hipLaunchKernelGGL(( ker_attn_softmax_lt32<float, 32, 2>), dim3(grid_dim), dim3(32), 0, stream, 
-        inp, attn_mask, from_len, to_len, mask_future);
-  } else if (to_len <= 128) {
-    grid_dim.x = 16;
-   hipLaunchKernelGGL(( ker_attn_softmax<float, 64, 2>), dim3(grid_dim), dim3(64), 0, stream, 
-        inp, attn_mask, from_len, to_len, mask_future);
-  } else if (to_len <= 256) {
-    grid_dim.x = 32;
-   hipLaunchKernelGGL(( ker_attn_softmax<float, 128, 2>), dim3(grid_dim), dim3(128), 0, stream, 
-        inp, attn_mask, from_len, to_len, mask_future);
-  } else if (to_len <= 512) {
-    grid_dim.x = 64;
-   hipLaunchKernelGGL(( ker_attn_softmax<float, 256, 2>), dim3(grid_dim), dim3(256), 0, stream, 
-        inp, attn_mask, from_len, to_len, mask_future);
-  } else {
-    throw std::runtime_error(
-        "Sequence length greater than 512 is currently not supported");
-  }
-}
-
-template <>
-void launch_attn_softmax<__half>(__half *inp, const __half *attn_mask,
-                                 int batch_size, int nhead, int from_len,
-                                 int to_len, bool mask_future,
-                                 hipStream_t stream) {
-  dim3 grid_dim(1, batch_size, nhead);
-  if (to_len <= 32) {
-   hipLaunchKernelGGL(( ker_attn_softmax_lt32<__half, 32, 1>), dim3(grid_dim), dim3(32), 0, stream, 
-        inp, attn_mask, from_len, to_len, mask_future);
-  } else if (to_len <= 64) {
-   hipLaunchKernelGGL(( ker_attn_softmax_lt32<__half, 32, 2>), dim3(grid_dim), dim3(32), 0, stream, 
-        inp, attn_mask, from_len, to_len, mask_future);
-  } else if (to_len <= 128) {
-    grid_dim.x = 8;
-   hipLaunchKernelGGL(( ker_attn_softmax<__half, 64, 2>), dim3(grid_dim), dim3(64), 0, stream, 
-        inp, attn_mask, from_len, to_len, mask_future);
-  } else if (to_len <= 256) {
-    grid_dim.x = 16;
-   hipLaunchKernelGGL(( ker_attn_softmax<__half, 128, 2>), dim3(grid_dim), dim3(128), 0, stream, 
-        inp, attn_mask, from_len, to_len, mask_future);
-  } else if (to_len <= 512) {
-    grid_dim.x = 32;
-   hipLaunchKernelGGL(( ker_attn_softmax<__half, 256, 2>), dim3(grid_dim), dim3(256), 0, stream, 
-        inp, attn_mask, from_len, to_len, mask_future);
-  } else {
-    throw std::runtime_error(
-        "Sequence length greater than 512 is currently not supported");
-  }
-}
-
-/**
-@brief: ker_attn_softmax_bw
-Softmax backward in self attention.
-
-@thread
-gridDim.x = batch_size * nhead * seq_len / warps_per_block
-blockDim.x = WARP_SIZE
-blockDim.y = warps_per_block
-
-@param
-grad: [batch_size, nhead, seq_len, seq_len], output grad.
-output: [batch_size, nhead, seq_len, seq_len], output of softmax forward.
-*/
-template <typename T, int ITERATIONS>
-__global__ void ker_attn_softmax_bw(T *grad, const T *inp, int softmax_length) {
-  int batch_idx = blockIdx.x * blockDim.y + threadIdx.y;
-  int offset = batch_idx * softmax_length + threadIdx.x;
-
-  grad += offset;
-  inp += offset;
-
-  T grad_reg[ITERATIONS];
-  T inp_reg[ITERATIONS];
-  float sum = 0.0;
-
-#pragma unroll
-  for (int i = 0; i < ITERATIONS; ++i) {
-    int curr_idx = threadIdx.x + i * WARP_SIZE;
-    if (curr_idx < softmax_length) {
-      grad_reg[i] = grad[i * WARP_SIZE];
-      inp_reg[i] = inp[i * WARP_SIZE];
-      sum += (float)grad_reg[i] * (float)inp_reg[i];
-    }
-  }
-
-#ifdef COLOSSAL_HIP
-  for (int i = 1; i < WARP_SIZE; i <<= 1) sum += __shfl_xor(sum, i);
-#else
-  cg::thread_block b = cg::this_thread_block();
-  cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
-
-  for (int i = 1; i < WARP_SIZE; i <<= 1) sum += g.shfl_xor(sum, i);
-#endif
-
-#pragma unroll
-  for (int i = 0; i < ITERATIONS; ++i) {
-    int curr_idx = threadIdx.x + i * WARP_SIZE;
-    if (curr_idx < softmax_length)
-      grad[i * WARP_SIZE] = (T)((float)inp_reg[i] * ((float)grad_reg[i] - sum));
-  }
-}
-
-template <typename T>
-void launch_attn_softmax_bw(T *out_grad, const T *soft_inp, int rows,
-                            int softmax_len, hipStream_t stream) {
-  const int warps_per_block = 4;
-  // rows = batch_size * nhead * from_len
-  dim3 grid_dim(rows / warps_per_block);
-  dim3 block_dim(WARP_SIZE, warps_per_block);
-
-  if (softmax_len <= 32)
-   hipLaunchKernelGGL(( ker_attn_softmax_bw<T, 1>)
-        , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, softmax_len);
-  else if (softmax_len <= 64)
-   hipLaunchKernelGGL(( ker_attn_softmax_bw<T, 2>)
-        , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, softmax_len);
-  else if (softmax_len <= 128)
-   hipLaunchKernelGGL(( ker_attn_softmax_bw<T, 4>)
-        , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, softmax_len);
-  else if (softmax_len <= 256)
-   hipLaunchKernelGGL(( ker_attn_softmax_bw<T, 8>)
-        , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, softmax_len);
-  else if (softmax_len <= 384)
-   hipLaunchKernelGGL(( ker_attn_softmax_bw<T, 12>)
-        , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, softmax_len);
-  else if (softmax_len <= 512)
-   hipLaunchKernelGGL(( ker_attn_softmax_bw<T, 16>)
-        , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, softmax_len);
-  else if (softmax_len <= 768)
-   hipLaunchKernelGGL(( ker_attn_softmax_bw<T, 24>)
-        , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, softmax_len);
-  else if (softmax_len <= 1024)
-   hipLaunchKernelGGL(( ker_attn_softmax_bw<T, 32>)
-        , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, softmax_len);
-  else if (softmax_len <= 2048)
-   hipLaunchKernelGGL(( ker_attn_softmax_bw<T, 64>)
-        , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, softmax_len);
-  else
-    throw std::runtime_error(
-        std::string(
-            "Special sequence length found in softmax backward, seq_len: ") +
-        std::to_string(softmax_len));
-}
-
-template void launch_attn_softmax_bw<__half>(__half *out_grad,
-                                             const __half *soft_inp, int rows,
-                                             int softmax_len,
-                                             hipStream_t stream);
-template void launch_attn_softmax_bw<float>(float *out_grad,
-                                            const float *soft_inp, int rows,
-                                            int softmax_len,
-                                            hipStream_t stream);
diff --git a/colossalai/kernel/hip_native/csrc/kernels/transform_kernels.hip b/colossalai/kernel/hip_native/csrc/kernels/transform_kernels.hip
deleted file mode 100644
index 57cc574b3b77737f3e0c6b003238f94fc333a897..0000000000000000000000000000000000000000
--- a/colossalai/kernel/hip_native/csrc/kernels/transform_kernels.hip
+++ /dev/null
@@ -1,327 +0,0 @@
-// !!! This is a file automatically generated by hipify!!!
-#include "hip/hip_runtime.h"
-#ifdef COLOSSAL_HIP
-#include <hipcub/hipcub.hpp>
-//#include <hipcub/block/block_load.hpp>
-//#include <hipcub/block/block_scan.hpp>
-//#include <hipcub/block/block_store.hpp>
-#else
-#include <hipcub/hipcub.hpp>
-#include <cub/block/block_scan.cuh>
-#include <cub/block/block_store.cuh>
-#endif
-
-#include "kernels.h"
-
-#ifdef COLOSSAL_HIP
-using namespace hipcub;
-#else
-using namespace cub;
-#endif
-
-/**
-@brief: transform_0213
-Split the attention heads and reshape input
-during backward progress of encoder self-attention
-
-@thread
-gridDim.x = batch_size
-gridDim.y = seq_len
-blockDim.x = min(hidden_dim, MAX_THREADS)
-
-@param
-input: [batch_size, seq_len, hidden_dim]
-output: [batch_size, nhead, seq_len, head_dim]
-batch_size: the size of the current batch
-seq_len: the sequence length of the current batch
-hidden_dim: dim of the hidden tensor
-nhead: number of attention heads
-*/
-
-template <typename T>
-__global__ void transform_0213(T *output, const T *input, int hidden_dim,
-                               int head_dim);
-
-template <>
-__global__ void transform_0213<float>(float *output, const float *input,
-                                      int hidden_dim, int head_dim) {
-  int batch_id = blockIdx.x;
-  int token_id = blockIdx.y;
-  int seq_len = gridDim.y;
-  int nhead = hidden_dim / head_dim;
-
-  // [b, s, h]
-  int src_offset = flat_3dim(batch_id, token_id, 0, seq_len, hidden_dim);
-  // [b, nh, s, ad]
-  int trg_offset =
-      flat_4dim(batch_id, 0, token_id, 0, nhead, seq_len, head_dim);
-
-  const float4 *input4 = reinterpret_cast<const float4 *>(input);
-  float4 *res4 = reinterpret_cast<float4 *>(output);
-  float4 vinput4;
-
-  for (std::size_t i = threadIdx.x; i < hidden_dim; i += blockDim.x) {
-    vinput4 = input4[src_offset + i];
-
-    int head_id = i / head_dim;
-    int dim_id = i % head_dim;
-    int cur_trg_offset = flat_3dim(head_id, 0, dim_id, seq_len, head_dim);
-    res4[trg_offset + cur_trg_offset] = vinput4;
-  }
-}
-
-template <>
-__global__ void transform_0213<__half>(__half *output, const __half *input,
-                                       int hidden_dim, int head_dim) {
-  int batch_id = blockIdx.x;
-  int token_id = blockIdx.y;
-  int seq_len = gridDim.y;
-  int nhead = hidden_dim / head_dim;
-
-  // [b, s, h]
-  int src_offset = flat_3dim(batch_id, token_id, 0, seq_len, hidden_dim);
-  // [b, nh, s, ad]
-  int trg_offset =
-      flat_4dim(batch_id, 0, token_id, 0, nhead, seq_len, head_dim);
-
-  const float4 *input4 = reinterpret_cast<const float4 *>(input);
-  float4 *res4 = reinterpret_cast<float4 *>(output);
-  float4 vinput4;
-
-  for (std::size_t i = threadIdx.x; i < hidden_dim; i += blockDim.x) {
-    vinput4 = input4[src_offset + i];
-
-    int head_id = i / head_dim;
-    int dim_id = i % head_dim;
-    int cur_trg_offset = flat_3dim(head_id, 0, dim_id, seq_len, head_dim);
-    res4[trg_offset + cur_trg_offset] = vinput4;
-  }
-}
-
-// [b, s, h] -> [b, nh, s, ad]
-template <>
-void launch_transform_0213<float>(float *output, const float *input,
-                                  int batch_size, int seq_len, int hidden_dim,
-                                  int nhead, hipStream_t stream) {
-  hidden_dim >>= 2;
-  int head_dim = hidden_dim / nhead;
-
-  dim3 grid_dim(batch_size, seq_len);
-  dim3 block_dim(min(hidden_dim, MAX_THREADS));
-
- hipLaunchKernelGGL(( transform_0213<float>)
-      , dim3(grid_dim), dim3(block_dim), 0, stream, output, input, hidden_dim, head_dim);
-}
-
-template <>
-void launch_transform_0213<__half>(__half *output, const __half *input,
-                                   int batch_size, int seq_len, int hidden_dim,
-                                   int nhead, hipStream_t stream) {
-  hidden_dim >>= 3;
-  int head_dim = hidden_dim / nhead;
-
-  dim3 grid_dim(batch_size, seq_len);
-  dim3 block_dim(min(hidden_dim, MAX_THREADS));
-
- hipLaunchKernelGGL(( transform_0213<__half>)
-      , dim3(grid_dim), dim3(block_dim), 0, stream, output, input, hidden_dim, head_dim);
-}
-
-/**
-@brief: bias_add_transform_20314
-Add bias to input, transform from
-[0, 1, 2, 3, 4] to [2, 0, 3, 1, 4]
-
-@thread
-gridDim.x = dim_0
-gridDim.y = dim_1
-gridDim.z = dim_2
-blockDim.x = min(dim_3 * dim_4, MAX_THREADS)
-
-@param
-input: [dim_0, dim_1, dim_2, dim_3, dim_4]
-bias: [dim_2, dim_3, dim_4]
-output: [dim_2, dim_0, dim_3, dim_1, dim_4]
-*/
-template <typename T>
-__global__ void bias_add_transform_20314(T *output, const T *input,
-                                         const T *bias, int dim_3, int dim_4);
-
-template <>
-__global__ void bias_add_transform_20314<float>(float *output,
-                                                const float *input,
-                                                const float *bias, int dim_3,
-                                                int dim_4) {
-  int id0 = blockIdx.x;
-  int id1 = blockIdx.y;
-  int id2 = blockIdx.z;
-  int dim_0 = gridDim.x;
-  int dim_1 = gridDim.y;
-  int dim_2 = gridDim.z;
-  int dim_34 = dim_3 * dim_4;
-
-  int src_offset = flat_4dim(id0, id1, id2, 0, dim_1, dim_2, dim_34);
-  int trg_offset = flat_5dim(id2, id0, 0, id1, 0, dim_0, dim_3, dim_1, dim_4);
-  int bias_offset = flat_2dim(id2, 0, dim_34);
-
-  const float4 *qkv4 = reinterpret_cast<const float4 *>(input);
-  const float4 *bias4 = reinterpret_cast<const float4 *>(bias);
-  float4 *res4 = reinterpret_cast<float4 *>(output);
-  float4 vqkv4;
-  float4 vbias4;
-  float4 vres4;
-
-  for (std::size_t i = threadIdx.x; i < dim_34; i += blockDim.x) {
-    vqkv4 = qkv4[src_offset + i];
-    vbias4 = bias4[bias_offset + i];
-    vres4.x = vqkv4.x + vbias4.x;
-    vres4.y = vqkv4.y + vbias4.y;
-    vres4.z = vqkv4.z + vbias4.z;
-    vres4.w = vqkv4.w + vbias4.w;
-
-    int id3 = i / dim_4;
-    int id4 = i % dim_4;
-    int cur_trg_offset = flat_3dim(id3, 0, id4, dim_1, dim_4);
-    res4[trg_offset + cur_trg_offset] = vres4;
-  }
-}
-
-template <>
-__global__ void bias_add_transform_20314<__half>(__half *output,
-                                                 const __half *input,
-                                                 const __half *bias, int dim_3,
-                                                 int dim_4) {
-  int id0 = blockIdx.x;
-  int id1 = blockIdx.y;
-  int id2 = blockIdx.z;
-  int dim_0 = gridDim.x;
-  int dim_1 = gridDim.y;
-  int dim_2 = gridDim.z;
-  int dim_34 = dim_3 * dim_4;
-
-  int src_offset = flat_4dim(id0, id1, id2, 0, dim_1, dim_2, dim_34);
-  int trg_offset = flat_5dim(id2, id0, 0, id1, 0, dim_0, dim_3, dim_1, dim_4);
-  int bias_offset = flat_2dim(id2, 0, dim_34);
-
-  const float4 *qkv4 = reinterpret_cast<const float4 *>(input);
-  const float4 *bias4 = reinterpret_cast<const float4 *>(bias);
-  float4 *res4 = reinterpret_cast<float4 *>(output);
-  float4 vqkv4;
-  float4 vbias4;
-  float4 vres4;
-  __half2 *h2_qkv = reinterpret_cast<__half2 *>(&vqkv4);
-  __half2 *h2_bias = reinterpret_cast<__half2 *>(&vbias4);
-  __half2 *h2_res = reinterpret_cast<__half2 *>(&vres4);
-
-  for (std::size_t i = threadIdx.x; i < dim_34; i += blockDim.x) {
-    vqkv4 = qkv4[src_offset + i];
-    vbias4 = bias4[bias_offset + i];
-    h2_res[0] = __hadd2(h2_qkv[0], h2_bias[0]);
-    h2_res[1] = __hadd2(h2_qkv[1], h2_bias[1]);
-    h2_res[2] = __hadd2(h2_qkv[2], h2_bias[2]);
-    h2_res[3] = __hadd2(h2_qkv[3], h2_bias[3]);
-
-    int id3 = i / dim_4;
-    int id4 = i % dim_4;
-    int cur_trg_offset = flat_3dim(id3, 0, id4, dim_1, dim_4);
-    res4[trg_offset + cur_trg_offset] = vres4;
-  }
-}
-
-// [b, s, 3, h] -> [3, b, nh, s, ad]
-template <>
-void launch_bias_add_transform_20314<float>(float *output, const float *input,
-                                            const float *bias, int dim_0,
-                                            int dim_1, int dim_2, int dim_3,
-                                            int dim_4, hipStream_t stream) {
-  dim_4 >>= 2;
-
-  dim3 grid_dim(dim_0, dim_1, dim_2);
-  dim3 block_dim(min(dim_3 * dim_4, MAX_THREADS));
-
- hipLaunchKernelGGL(( bias_add_transform_20314<float>)
-      , dim3(grid_dim), dim3(block_dim), 0, stream, output, input, bias, dim_3, dim_4);
-}
-
-template <>
-void launch_bias_add_transform_20314<__half>(__half *output,
-                                             const __half *input,
-                                             const __half *bias, int dim_0,
-                                             int dim_1, int dim_2, int dim_3,
-                                             int dim_4, hipStream_t stream) {
-  dim_4 >>= 3;
-
-  dim3 grid_dim(dim_0, dim_1, dim_2);
-  dim3 block_dim(min(dim_3 * dim_4, MAX_THREADS));
-
- hipLaunchKernelGGL(( bias_add_transform_20314<__half>)
-      , dim3(grid_dim), dim3(block_dim), 0, stream, output, input, bias, dim_3, dim_4);
-}
-
-/**
-@brief: transform4d_0213
-Reshape the input matrix to merge the heads
-
-@thread
-gridDim.x = (num_all + max_block_thread - 1) / max_block_thread
-blockDim.x = max_block_thread
-
-@param
-input: [trans_count, batch_size, nhead, seq_len, head_dim]
-output: [batch_size, seq_len, trans_count, nhead, head_dim]
-batch_size: the size of the current batch
-seq_len: the sequence length of the current batch
-hidden_dim: dim of the hidden tensor
-nhead: number of attention heads
-trans_count: 1 or 3, the count of matrice need to be transformed
-*/
-template <typename T>
-__global__ void transform4d_0213(T *output, const T *input, int batch_size,
-                                 int seq_len, int trans_count, int nhead,
-                                 int head_dim, int num_all) {
-  int offset = blockIdx.x * blockDim.x + threadIdx.x;
-  if (offset >= num_all) {
-    return;
-  }
-  int trans_id, batch_id, head_id, token_id, dim_id;
-  decompose_5dim(offset, batch_size, nhead, seq_len, head_dim, &trans_id,
-                 &batch_id, &head_id, &token_id, &dim_id);
-  // [b, s, tc, nh, ad]
-  int trg_offset = flat_5dim(batch_id, token_id, trans_id, head_id, dim_id,
-                             seq_len, trans_count, nhead, head_dim);
-
-  const float4 *input4 = reinterpret_cast<const float4 *>(input);
-  float4 *res4 = reinterpret_cast<float4 *>(output);
-  res4[trg_offset] = input4[offset];
-}
-
-// [tc, b, nh, s, ad] -> [b, s, tc, nh, ad]
-template <>
-void launch_transform4d_0213<float>(float *output, const float *input,
-                                    int batch_size, int seq_len, int hidden_dim,
-                                    int nhead, int trans_count,
-                                    hipStream_t stream) {
-  hidden_dim >>= 2;
-  int head_dim = hidden_dim / nhead;
-  int num_all = batch_size * seq_len * trans_count * hidden_dim;
-  int nblock = (num_all + MAX_THREADS - 1) / MAX_THREADS;
-
- hipLaunchKernelGGL(( transform4d_0213<float>), dim3(nblock), dim3(MAX_THREADS), 0, stream, 
-      output, input, batch_size, seq_len, trans_count, nhead, head_dim,
-      num_all);
-}
-
-template <>
-void launch_transform4d_0213<__half>(__half *output, const __half *input,
-                                     int batch_size, int seq_len,
-                                     int hidden_dim, int nhead, int trans_count,
-                                     hipStream_t stream) {
-  hidden_dim >>= 3;
-  int head_dim = hidden_dim / nhead;
-  int num_all = batch_size * seq_len * trans_count * hidden_dim;
-  int nblock = (num_all + MAX_THREADS - 1) / MAX_THREADS;
-
- hipLaunchKernelGGL(( transform4d_0213<__half>), dim3(nblock), dim3(MAX_THREADS), 0, stream, 
-      output, input, batch_size, seq_len, trans_count, nhead, head_dim,
-      num_all);
-}
diff --git a/colossalai/kernel/hip_native/csrc/layer_norm_hip.cpp b/colossalai/kernel/hip_native/csrc/layer_norm_hip.cpp
deleted file mode 100644
index b28fb8e5d0f0ceae89666ad229c95f7dbb99d1ab..0000000000000000000000000000000000000000
--- a/colossalai/kernel/hip_native/csrc/layer_norm_hip.cpp
+++ /dev/null
@@ -1,186 +0,0 @@
-// !!! This is a file automatically generated by hipify!!!
-/*This code from NVIDIA apex:
- *     https://github.com/NVIDIA/apex
- *     with minor changes. */
-
-#include <torch/extension.h>
-#include <vector>
-#include <cassert>
-#include "../../hip_native/csrc/compat.h"
-
-namespace {
-
-void compute_n1_n2(
-    at::Tensor input,
-    at::IntArrayRef normalized_shape,
-    int& n1,
-    int& n2) {
-    int idiff = input.ndimension() - normalized_shape.size();
-    n2 = 1;
-    for (int i = 0;  i < (int)normalized_shape.size();  ++i) {
-	    assert( input.sizes()[i+idiff] == normalized_shape[i] );
-	    n2 *= normalized_shape[i];
-    }
-    n1 = 1;
-    for (int i = 0;  i < idiff;  ++i) {
-	    n1 *= input.sizes()[i];
-    }
-}
-
-void check_args(
-    at::IntArrayRef normalized_shape,
-    at::Tensor gamma,
-    at::Tensor beta
-    )
-{
-    TORCH_CHECK(!gamma.defined() || gamma.sizes().equals(normalized_shape));
-    TORCH_CHECK(!beta.defined() || beta.sizes().equals(normalized_shape));
-}
-
-void check_args(
-    at::Tensor input,
-    at::IntArrayRef normalized_shape,
-    int& n1,
-    int& n2
-    )
-{
-    int64_t normalized_ndim = normalized_shape.size();
-
-    if (normalized_ndim < 1) {
-      std::stringstream ss;
-      ss << "Expected normalized_shape to be at least 1-dimensional, i.e., "
-         << "containing at least one element, but got normalized_shape="
-         << normalized_shape;
-      throw std::runtime_error(ss.str());
-    }
-
-    auto input_shape = input.sizes();
-    auto input_ndim = input.dim();
-
-    if (input_ndim < normalized_ndim ||
-        !input_shape.slice(input_ndim - normalized_ndim).equals(normalized_shape)) {
-      std::stringstream ss;
-      ss << "Given normalized_shape=" << normalized_shape
-         << ", expected input with shape [*";
-      for (auto size : normalized_shape) {
-        ss << ", " << size;
-      }
-      ss << "], but got input of size" << input_shape;
-      throw std::runtime_error(ss.str());
-    }
-
-    compute_n1_n2(input,normalized_shape,n1,n2);
-}
-
-
-void check_args(
-    at::Tensor input,
-    at::IntArrayRef normalized_shape,
-    at::Tensor gamma,
-    at::Tensor beta,
-    int& n1,
-    int& n2
-    )
-{
-    check_args(input,normalized_shape,n1,n2);
-    check_args(normalized_shape,gamma,beta);
-}
-}
-
-void cuda_layer_norm(
-    at::Tensor* output,
-    at::Tensor* mean,
-    at::Tensor* invvar,
-    at::Tensor* input,
-    int n1,
-    int n2,
-    at::IntArrayRef normalized_shape,
-    at::Tensor* gamma,
-    at::Tensor* beta,
-    double epsilon);
-
-#define CHECK_CUDA(x) TORCH_CHECK(x.is_cuda(), #x " must be a CUDA tensor")
-#define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x " must be contiguous")
-#define CHECK_INPUT(x) CHECK_CUDA(x); CHECK_CONTIGUOUS(x)
-
-std::vector<at::Tensor> layer_norm_affine(
-    at::Tensor input,
-    at::IntArrayRef normalized_shape,
-    at::Tensor gamma,
-    at::Tensor beta,
-    double epsilon) {
-  
-  CHECK_INPUT(input);
-  CHECK_INPUT(gamma);
-  CHECK_INPUT(beta);
-  int n1, n2;
-  check_args(input, normalized_shape, gamma, beta, n1, n2);
-
-  at::Tensor output = at::empty_like(
-      input, gamma.options().dtype(gamma.scalar_type()));
-  at::Tensor mean = at::empty(
-      {n1}, input.options().dtype(at::ScalarType::Float));
-  at::Tensor invvar = at::empty_like(mean);
-
-  cuda_layer_norm(&output, &mean, &invvar, &input, n1, n2,
-      normalized_shape, &gamma, &beta, epsilon);
-
-  return {output, mean, invvar};
-
-}
-
-
-void cuda_layer_norm_gradient(
-    at::Tensor* dout,
-    at::Tensor* mean,
-    at::Tensor* invvar,
-    at::Tensor* input,
-    int n1,
-    int n2,
-    at::IntArrayRef normalized_shape,
-    at::Tensor* gamma,
-    at::Tensor* beta,
-    double epsilon,
-    at::Tensor* grad_input,
-    at::Tensor* grad_gamma,
-    at::Tensor* grad_beta
-    );
-
-std::vector<at::Tensor> layer_norm_gradient_affine(
-    at::Tensor dout,
-    at::Tensor mean,
-    at::Tensor invvar,
-    at::Tensor input,
-    at::IntArrayRef normalized_shape,
-    at::Tensor gamma,
-    at::Tensor beta,
-    double epsilon) {
-
-  CHECK_INPUT(dout);
-  CHECK_INPUT(mean);
-  CHECK_INPUT(invvar);
-  CHECK_INPUT(input);
-  CHECK_INPUT(gamma);
-  CHECK_INPUT(beta);
-  int n1, n2;
-  check_args(input, normalized_shape, gamma, beta, n1, n2);
-
-  at::Tensor grad_input = at::empty_like(input);
-  at::Tensor grad_gamma = at::empty_like(gamma);
-  at::Tensor grad_beta = at::empty_like(beta);
-
-  cuda_layer_norm_gradient(&dout, &mean, &invvar, &input, n1, n2,
-      normalized_shape, &gamma, &beta, epsilon,
-      &grad_input, &grad_gamma, &grad_beta);
-
-  return {grad_input, grad_gamma, grad_beta};
-
-}
-
-
-PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
-  m.def("forward_affine", &layer_norm_affine,
-	"LayerNorm forward (CUDA)");
-  m.def("backward_affine", &layer_norm_gradient_affine,
-	"LayerNorm backward (CUDA)");
-}
\ No newline at end of file
diff --git a/colossalai/kernel/hip_native/csrc/layer_norm_hip_kernel.hip b/colossalai/kernel/hip_native/csrc/layer_norm_hip_kernel.hip
deleted file mode 100644
index 2f9fe61036985ac6c9f363052bdb12bce7503422..0000000000000000000000000000000000000000
--- a/colossalai/kernel/hip_native/csrc/layer_norm_hip_kernel.hip
+++ /dev/null
@@ -1,834 +0,0 @@
-// !!! This is a file automatically generated by hipify!!!
-/*This code from NVIDIA apex:
- *     https://github.com/NVIDIA/apex
- *     with minor changes. */
-
-#include "ATen/ATen.h"
-#include "ATen/AccumulateType.h"
-#include "ATen/hip/HIPContext.h"
-#include <THH/THHDeviceUtils.cuh>
-
-#include <hip/hip_runtime.h>
-#include <hip/hip_runtime.h>
-
-#include "../../hip_native/csrc/type_shim.h"
-
-template<typename U> __device__
-void cuWelfordOnlineSum(
-  const U curr,
-  U& mu,
-  U& sigma2,
-  U& count)
-{
-  count = count + U(1);
-  U delta = curr - mu;
-  U lmean = mu + delta / count;
-  mu = lmean;
-  U delta2 = curr - lmean;
-  sigma2 = sigma2 + delta * delta2;
-}
-
-template<typename U> __device__
-void cuChanOnlineSum(
-  const U muB,
-  const U sigma2B,
-  const U countB,
-  U& mu,
-  U& sigma2,
-  U& count)
-{
-  U delta = muB - mu;
-  U nA = count;
-  U nB = countB;
-  count = count + countB;
-  U nX = count;
-  if (nX > U(0)) {
-    nA = nA / nX;
-    nB = nB / nX;
-    mu = nA*mu + nB*muB;
-    sigma2 = sigma2 + sigma2B + delta * delta * nA * nB * nX;
-  } else {
-    mu = U(0);
-    sigma2 = U(0);
-  }
-}
-
-template<typename T, typename U> __device__
-void cuWelfordMuSigma2(
-  const T* __restrict__ vals,
-  const int n1,
-  const int n2,
-  const int i1,
-  U& mu,
-  U& sigma2,
-  U* buf) 
-{
-  // Assumptions:
-  // 1) blockDim.x == warpSize
-  // 2) Tensor is contiguous
-  // 3) 2*blockDim.y*sizeof(U)+blockDim.y*sizeof(int) shared memory available.
-  //
-  // compute variance and mean over n2
-  U count = U(0);
-  mu= U(0);
-  sigma2 = U(0);
-  if (i1 < n1) {
-    // one warp normalizes one n1 index,
-    // synchronization is implicit
-    // initialize with standard Welford algorithm
-    const int numx = blockDim.x * blockDim.y;
-    const int thrx = threadIdx.x + threadIdx.y * blockDim.x;
-    const T* lvals = vals + i1*n2;
-    int l = 4*thrx;
-    for (;  l+3 < n2;  l+=4*numx) {
-      for (int k = 0;  k < 4;  ++k) {
-        U curr = static_cast<U>(lvals[l+k]);
-        cuWelfordOnlineSum<U>(curr,mu,sigma2,count);
-      }
-    }
-    for (;  l < n2;  ++l) {
-      U curr = static_cast<U>(lvals[l]);
-      cuWelfordOnlineSum<U>(curr,mu,sigma2,count);
-    }
-    // intra-warp reductions
-    for (int l = 0;  l <= 4;  ++l) {
-      int srcLaneB = (threadIdx.x+(1<<l))&31;
-      U muB = WARP_SHFL(mu, srcLaneB);
-      U countB = WARP_SHFL(count, srcLaneB);
-      U sigma2B = WARP_SHFL(sigma2, srcLaneB);
-      cuChanOnlineSum<U>(muB,sigma2B,countB,mu,sigma2,count);
-    }
-    // threadIdx.x == 0 has correct values for each warp
-    // inter-warp reductions
-    if (blockDim.y > 1) {
-      U* ubuf = (U*)buf;
-      U* ibuf = (U*)(ubuf + blockDim.y);
-      for (int offset = blockDim.y/2;  offset > 0;  offset /= 2) {
-        // upper half of warps write to shared
-        if (threadIdx.x == 0 && threadIdx.y >= offset && threadIdx.y < 2*offset) {
-          const int wrt_y = threadIdx.y - offset;
-          ubuf[2*wrt_y] = mu;
-          ubuf[2*wrt_y+1] = sigma2;
-          ibuf[wrt_y] = count;
-        }
-        __syncthreads();
-        // lower half merges
-        if (threadIdx.x == 0 && threadIdx.y < offset) {
-          U muB = ubuf[2*threadIdx.y];
-          U sigma2B = ubuf[2*threadIdx.y+1];
-          U countB = ibuf[threadIdx.y];
-          cuChanOnlineSum<U>(muB,sigma2B,countB,mu,sigma2,count);
-        }
-        __syncthreads();
-      }
-      // threadIdx.x = 0 && threadIdx.y == 0 only thread that has correct values
-      if (threadIdx.x == 0 && threadIdx.y == 0) {
-        ubuf[0] = mu;
-        ubuf[1] = sigma2;
-      }
-      __syncthreads();
-      mu = ubuf[0];
-      sigma2 = ubuf[1]/U(n2);
-      // don't care about final value of count, we know count == n2
-    } else {
-      mu = WARP_SHFL(mu, 0);
-      sigma2 = WARP_SHFL(sigma2/U(n2), 0);
-    }
-  }
-}
-
-template<> __device__
-void cuWelfordMuSigma2(
-  const at::Half* __restrict__ vals,
-  const int n1,
-  const int n2,
-  const int i1,
-  float& mu,
-  float& sigma2,
-  float* buf) 
-{
-  // Assumptions:
-  // 1) blockDim.x == warpSize
-  // 2) Tensor is contiguous
-  // 3) 2*blockDim.y*sizeof(U)+blockDim.y*sizeof(int) shared memory available.
-  //
-  // compute variance and mean over n2
-  float count = 0.0f;
-  mu= float(0);
-  sigma2 = float(0);
-  if (i1 < n1) {
-    // one warp normalizes one n1 index,
-    // synchronization is implicit
-    // initialize with standard Welford algorithm
-    const int numx = blockDim.x * blockDim.y;
-    const int thrx = threadIdx.x + threadIdx.y * blockDim.x;
-    const at::Half* lvals = vals + i1*n2;
-    int l = 8*thrx;
-    if ((((size_t)lvals)&3) != 0) {
-      // 16 bit alignment
-      // first thread consumes first point
-      if (thrx == 0) {
-        float curr = static_cast<float>(lvals[0]);
-        cuWelfordOnlineSum(curr,mu,sigma2,count);
-      }
-      ++l;
-    }
-    // at this point, lvals[l] are 32 bit aligned for all threads.
-    for (;  l+7 < n2;  l+=8*numx) {
-      for (int k = 0;  k < 8;  k+=2) {
-        float2 curr = __half22float2(*((__half2*)(lvals+l+k)));
-        cuWelfordOnlineSum(curr.x,mu,sigma2,count);
-	cuWelfordOnlineSum(curr.y,mu,sigma2,count);
-      }
-    }
-    for (;  l < n2;  ++l) {
-      float curr = static_cast<float>(lvals[l]);
-      cuWelfordOnlineSum(curr,mu,sigma2,count);
-    }
-    // intra-warp reductions
-    for (int l = 0;  l <= 4;  ++l) {
-      int srcLaneB = (threadIdx.x+(1<<l))&31;
-      float muB = WARP_SHFL(mu, srcLaneB);
-      float countB = WARP_SHFL(count, srcLaneB);
-      float sigma2B = WARP_SHFL(sigma2, srcLaneB);
-      cuChanOnlineSum(muB,sigma2B,countB,mu,sigma2,count);
-    }
-    // threadIdx.x == 0 has correct values for each warp
-    // inter-warp reductions
-    if (blockDim.y > 1) {
-      float* ubuf = (float*)buf;
-      float* ibuf = (float*)(ubuf + blockDim.y);
-      for (int offset = blockDim.y/2;  offset > 0;  offset /= 2) {
-        // upper half of warps write to shared
-        if (threadIdx.x == 0 && threadIdx.y >= offset && threadIdx.y < 2*offset) {
-          const int wrt_y = threadIdx.y - offset;
-          ubuf[2*wrt_y] = mu;
-          ubuf[2*wrt_y+1] = sigma2;
-          ibuf[wrt_y] = count;
-        }
-        __syncthreads();
-        // lower half merges
-        if (threadIdx.x == 0 && threadIdx.y < offset) {
-          float muB = ubuf[2*threadIdx.y];
-          float sigma2B = ubuf[2*threadIdx.y+1];
-          float countB = ibuf[threadIdx.y];
-          cuChanOnlineSum(muB,sigma2B,countB,mu,sigma2,count);
-        }
-        __syncthreads();
-      }
-      // threadIdx.x = 0 && threadIdx.y == 0 only thread that has correct values
-      if (threadIdx.x == 0 && threadIdx.y == 0) {
-        ubuf[0] = mu;
-        ubuf[1] = sigma2;
-      }
-      __syncthreads();
-      mu = ubuf[0];
-      sigma2 = ubuf[1]/float(n2);
-      // don't care about final value of count, we know count == n2
-    } else {
-      mu = WARP_SHFL(mu, 0);
-      sigma2 = WARP_SHFL(sigma2/float(n2), 0);
-    }
-  }
-}
-
-#ifdef COLOSSAL_HIP
-    template<typename U> __device__ U rsqrt(U v) {
-      return U(1) / sqrt(v);
-    }
-    template<> __device__ float rsqrt(float v) {
-      return rsqrtf(v);
-    }
-    template<> __device__ double rsqrt(double v) {
-      return rsqrt(v);
-    }
-#else
-    template<typename U> U rsqrt(U v) {
-      return U(1) / sqrt(v);
-    }
-    template<> float rsqrt(float v) {
-      return rsqrtf(v);
-    }
-    template<> double rsqrt(double v) {
-      return rsqrt(v);
-    }
-#endif
-
-namespace {
-// This is the un-specialized struct.  Note that we prevent instantiation of this
-// struct by putting an undefined symbol in the function body so it won't compile.
-//  template <typename T>
-//  struct SharedMemory
-//  {
-//      // Ensure that we won't compile any un-specialized types
-//      __device__ T *getPointer()
-//      {
-//          extern __device__ void error(void);
-//          error();
-//          return NULL;
-//      }
-//  };
-// https://github.com/NVIDIA/apex/issues/246
-template <typename T>
-struct SharedMemory;
-
-template <>
-struct SharedMemory <float>
-{
-    __device__ float *getPointer()
-    {
-        HIP_DYNAMIC_SHARED( float, s_float)
-        return s_float;
-    }
-};
-
-}
-
-template<typename T, typename U, typename V> __global__
-void cuApplyLayerNorm(
-  V* __restrict__ output_vals,
-  U* __restrict__ mean,
-  U* __restrict__ invvar,
-  const T* __restrict__ vals,
-  const int n1,
-  const int n2,
-  const U epsilon,
-  const V* __restrict__ gamma,
-  const V* __restrict__ beta
-  ) 
-{
-  // Assumptions:
-  // 1) blockDim.x == warpSize
-  // 2) Tensors are contiguous
-  //
-#ifdef COLOSSAL_HIP
-  for (size_t i1=blockIdx.y; i1 < n1; i1 += gridDim.y) {
-#else
-  for (auto i1=blockIdx.y; i1 < n1; i1 += gridDim.y) {
-#endif
-    SharedMemory<U> shared;
-    U* buf = shared.getPointer();
-    U mu,sigma2;
-    cuWelfordMuSigma2(vals,n1,n2,i1,mu,sigma2,buf);
-    const T* lvals = vals + i1*n2;
-    V* ovals = output_vals + i1*n2;
-    U c_invvar = rsqrt(sigma2 + epsilon);
-    const int numx = blockDim.x * blockDim.y;
-    const int thrx = threadIdx.x + threadIdx.y * blockDim.x;
-    if (gamma != NULL && beta != NULL) {
-      for (int i = thrx;  i < n2;  i+=numx) {
-        U curr = static_cast<U>(lvals[i]);
-        ovals[i] = gamma[i] * static_cast<V>(c_invvar * (curr - mu)) + beta[i];
-      }
-    } else {
-      for (int i = thrx;  i < n2;  i+=numx) {
-        U curr = static_cast<U>(lvals[i]);
-        ovals[i] = static_cast<V>(c_invvar * (curr - mu));
-      }
-    }
-    if (threadIdx.x == 0 && threadIdx.y == 0) {
-      mean[i1] = mu;
-      invvar[i1] = c_invvar;
-    }
-  }
-}
-
-template<typename T, typename U, typename V> __device__
-void cuLoadWriteStridedInputs(
-    const int i1_block,
-    const int thr_load_row_off,
-    const int thr_load_col_off,
-    const int i2_off,
-    const int row_stride,
-    U* warp_buf1,
-    U* warp_buf2,
-    const T* input,
-    const V* dout,
-    const int i1_end,
-    const int n2,
-    const U* __restrict__ mean,
-    const U* __restrict__ invvar
-    )
-{
-  int i1 = i1_block+thr_load_row_off;
-  if (i1 < i1_end) {
-    U curr_mean = mean[i1];
-    U curr_invvar = invvar[i1];
-    for (int k = 0;  k < blockDim.y;  ++k) {
-      int i2 = i2_off + k;
-      int load_idx = i1*n2+i2;
-      int write_idx = thr_load_row_off*row_stride+thr_load_col_off+k;
-      if (i2<n2) {
-        U curr_input = static_cast<U>(input[load_idx]);
-	U curr_dout = static_cast<U>(dout[load_idx]);
-	warp_buf1[write_idx] = curr_dout;
-	warp_buf2[write_idx] = curr_dout * (curr_input - curr_mean) * curr_invvar;
-      } else {
-        warp_buf1[write_idx] = U(0);
-        warp_buf2[write_idx] = U(0);
-      }
-    }
-  } else {
-    for (int k = 0;  k < blockDim.y;  ++k) {
-      int write_idx = thr_load_row_off*row_stride+thr_load_col_off+k;
-      warp_buf1[write_idx] = U(0);
-      warp_buf2[write_idx] = U(0);
-    }
-  }
-}
-
-template<typename T, typename U, typename V> __device__
-void cuLoadAddStridedInputs(
-    const int i1_block,
-    const int thr_load_row_off,
-    const int thr_load_col_off,
-    const int i2_off,
-    const int row_stride,
-    U* warp_buf1,
-    U* warp_buf2,
-    const T* input,
-    const V* dout,
-    const int i1_end,
-    const int n2,
-    const U* __restrict__ mean,
-    const U* __restrict__ invvar
-    )
-{
-  int i1 = i1_block+thr_load_row_off;
-  if (i1 < i1_end) {
-    U curr_mean = mean[i1];
-    U curr_invvar = invvar[i1];
-    for (int k = 0;  k < blockDim.y;  ++k) {
-      int i2 = i2_off + k;
-      int load_idx = i1*n2+i2;
-      int write_idx = thr_load_row_off*row_stride+thr_load_col_off+k;
-      if (i2<n2) {
-        U curr_input = static_cast<U>(input[load_idx]);
-	U curr_dout = static_cast<U>(dout[load_idx]);
-	warp_buf1[write_idx] += curr_dout;
-	warp_buf2[write_idx] += curr_dout * (curr_input - curr_mean) * curr_invvar;
-      }
-    }
-  }
-}
-
-template<typename T, typename U, typename V> __global__
-void cuComputePartGradGammaBeta(
-    const V* __restrict__ dout,
-    const T* __restrict__ input,
-    const int n1,
-    const int n2,
-    const U* __restrict__ mean,
-    const U* __restrict__ invvar,
-    U epsilon,
-    U* part_grad_gamma,
-    U* part_grad_beta)
-{
-    const int numsegs_n1 = (n1+blockDim.y*blockDim.y-1) / (blockDim.y*blockDim.y);
-    const int segs_per_block = (numsegs_n1 + gridDim.y - 1) / gridDim.y;
-    const int i1_beg = blockIdx.y * segs_per_block * blockDim.y*blockDim.y;
-    const int i1_beg_plus_one = (blockIdx.y+1) * segs_per_block * blockDim.y*blockDim.y;
-    const int i1_end = i1_beg_plus_one < n1 ? i1_beg_plus_one : n1;
-    const int row_stride = blockDim.x+1;
-    const int thr_load_col_off = (threadIdx.x*blockDim.y)&(blockDim.x-1);
-    const int thr_load_row_off = (threadIdx.x*blockDim.y)/blockDim.x + threadIdx.y*blockDim.y;
-    const int i2_off = blockIdx.x * blockDim.x + thr_load_col_off;
-    SharedMemory<U> shared;
-    U* buf = shared.getPointer(); // buf has at least blockDim.x * blockDim.y * blockDim.y + (blockDim.y - 1)*(blockDim.x/blockDim.y) elements
-    U* warp_buf1 = (U*)buf;
-    U* warp_buf2 = warp_buf1 + blockDim.y * blockDim.y * row_stride;
-    // compute partial sums from strided inputs
-    // do this to increase number of loads in flight
-    cuLoadWriteStridedInputs(i1_beg,thr_load_row_off,thr_load_col_off,i2_off,row_stride,warp_buf1,warp_buf2,input,dout,i1_end,n2,mean,invvar);
-    for (int i1_block = i1_beg+blockDim.y*blockDim.y;  i1_block < i1_end;  i1_block+=blockDim.y*blockDim.y) {
-      cuLoadAddStridedInputs(i1_block,thr_load_row_off,thr_load_col_off,i2_off,row_stride,warp_buf1,warp_buf2,input,dout,i1_end,n2,mean,invvar);
-    }
-    __syncthreads();
-    // inter-warp reductions
-    // sum within each warp
-    U acc1 = U(0);
-    U acc2 = U(0);
-    for (int k = 0;  k < blockDim.y;  ++k) {
-      int row1 = threadIdx.y + k*blockDim.y;
-      int idx1 = row1*row_stride + threadIdx.x;
-      acc1 += warp_buf1[idx1];
-      acc2 += warp_buf2[idx1];
-    }
-    warp_buf1[threadIdx.y*row_stride+threadIdx.x] = acc1;
-    warp_buf2[threadIdx.y*row_stride+threadIdx.x] = acc2;
-    __syncthreads();
-    // sum all warps
-    for (int offset = blockDim.y/2;  offset > 1;  offset /= 2) {
-      if (threadIdx.y < offset) {
-        int row1 = threadIdx.y;
-	int row2 = threadIdx.y + offset;
-	int idx1 = row1*row_stride + threadIdx.x;
-	int idx2 = row2*row_stride + threadIdx.x;
-	warp_buf1[idx1] += warp_buf1[idx2];
-	warp_buf2[idx1] += warp_buf2[idx2];
-      }
-      __syncthreads();
-    }
-    int i2 = blockIdx.x * blockDim.x + threadIdx.x;
-    if (threadIdx.y == 0 && i2 < n2) {
-      int row1 = threadIdx.y;
-      int row2 = threadIdx.y + 1;
-      int idx1 = row1*row_stride + threadIdx.x;
-      int idx2 = row2*row_stride + threadIdx.x;
-      part_grad_beta[blockIdx.y*n2+i2] = warp_buf1[idx1] + warp_buf1[idx2];
-      part_grad_gamma[blockIdx.y*n2+i2] = warp_buf2[idx1] + warp_buf2[idx2];
-    }
-}
-
-template<typename U, typename V> __global__
-void cuComputeGradGammaBeta(
-    const U* part_grad_gamma,
-    const U* part_grad_beta,
-    const int part_size,
-    const int n1,
-    const int n2,
-    V* grad_gamma,
-    V* grad_beta)
-{
-    // sum partial gradients for gamma and beta
-    SharedMemory<U> shared;
-    U* buf = shared.getPointer(); 
-    int i2 = blockIdx.x * blockDim.x + threadIdx.x;
-    if (i2 < n2) {
-      // each warp does sequential reductions until reduced part_size is num_warps
-      int num_warp_reductions = part_size / blockDim.y;
-      U sum_gamma = U(0);
-      U sum_beta = U(0);
-      const U* part_grad_gamma_ptr = part_grad_gamma + threadIdx.y * num_warp_reductions * n2 + i2;
-      const U* part_grad_beta_ptr = part_grad_beta + threadIdx.y * num_warp_reductions * n2 + i2;
-      for (int warp_offset = 0;  warp_offset < num_warp_reductions;  ++warp_offset) {
-        sum_gamma += part_grad_gamma_ptr[warp_offset*n2];
-        sum_beta += part_grad_beta_ptr[warp_offset*n2];
-      }
-      // inter-warp reductions
-      const int nbsize3 = blockDim.x * blockDim.y / 2;
-      for (int offset = blockDim.y/2;  offset >= 1;  offset /= 2) {
-        // top half write to shared memory
-        if (threadIdx.y >= offset && threadIdx.y < 2*offset) {
-          const int write_idx = (threadIdx.y - offset) * blockDim.x + threadIdx.x;
-          buf[write_idx] = sum_gamma;
-          buf[write_idx+nbsize3] = sum_beta;
-        }
-        __syncthreads();
-        // bottom half sums
-        if (threadIdx.y < offset) {
-          const int read_idx = threadIdx.y * blockDim.x + threadIdx.x;
-          sum_gamma += buf[read_idx];
-          sum_beta += buf[read_idx+nbsize3];
-        }
-        __syncthreads();
-      }
-      // write out fully summed gradients
-      if (threadIdx.y == 0) {
-        grad_gamma[i2] = sum_gamma;
-        grad_beta[i2] = sum_beta;
-      }
-    }
-}
-
-template<typename T, typename U, typename V> __global__
-void cuComputeGradInput(
-    const V* __restrict__ dout,
-    const T* __restrict__ input,
-    const int n1,
-    const int n2,
-    const U* __restrict__ mean,
-    const U* __restrict__ invvar,
-    U epsilon,
-    const V* gamma,
-    T* grad_input)
-{
-#ifdef COLOSSAL_HIP
-  for (size_t i1=blockIdx.y; i1 < n1; i1 += gridDim.y) {
-#else
-  for (auto i1=blockIdx.y; i1 < n1; i1 += gridDim.y) {
-#endif
-    U sum_loss1 = U(0);
-    U sum_loss2 = U(0);
-    const U c_mean = mean[i1];
-    const U c_invvar = invvar[i1];
-    const T* k_input = input + i1*n2;
-    const V* k_dout = dout + i1*n2;
-    const int numx = blockDim.x * blockDim.y;
-    const int thrx = threadIdx.x + threadIdx.y * blockDim.x;
-    if (gamma != NULL) {
-      int l = 4*thrx;
-      for (;  l+3 < n2;  l+=4*numx) {
-        for (int k = 0;  k < 4;  ++k) {
-          const U c_h = static_cast<U>(k_input[l+k]);
-          const U c_loss = static_cast<U>(k_dout[l+k]);
-          sum_loss1 += c_loss * gamma[l+k];
-          sum_loss2 += c_loss * gamma[l+k] * (c_h - c_mean) * c_invvar;
-        }
-      }
-      for (;  l < n2;  ++l) {
-        const U c_h = static_cast<U>(k_input[l]);
-        const U c_loss = static_cast<U>(k_dout[l]);
-        sum_loss1 += c_loss * gamma[l];
-        sum_loss2 += c_loss * gamma[l] * (c_h - c_mean) * c_invvar;
-      }
-    } else {
-      int l = 4*thrx;
-      for (;  l+3 < n2;  l+=4*numx) {
-        for (int k = 0;  k < 4;  ++k) {
-          const U c_h = static_cast<U>(k_input[l+k]);
-          const U c_loss = static_cast<U>(k_dout[l+k]);
-          sum_loss1 += c_loss;
-          sum_loss2 += c_loss * (c_h - c_mean) * c_invvar;
-        }
-      }
-      for (;  l < n2;  ++l) {
-        const U c_h = static_cast<U>(k_input[l]);
-        const U c_loss = static_cast<U>(k_dout[l]);
-        sum_loss1 += c_loss;
-        sum_loss2 += c_loss * (c_h - c_mean) * c_invvar;
-      }
-    }
-    // intra-warp reductions
-    for (int mask = blockDim.x/2;  mask > 0;  mask /= 2) {
-      sum_loss1 += WARP_SHFL_XOR(sum_loss1, mask);
-      sum_loss2 += WARP_SHFL_XOR(sum_loss2, mask);
-    }
-    // inter-warp reductions
-    if (blockDim.y > 1) {
-      SharedMemory<U> shared;
-      U* buf = shared.getPointer(); 
-      for (int offset = blockDim.y/2;  offset > 0;  offset /= 2) {
-        // upper half of warps write to shared
-        if (threadIdx.y >= offset && threadIdx.y < 2*offset) {
-          const int wrt_i = (threadIdx.y - offset) * blockDim.x + threadIdx.x;
-          buf[2*wrt_i] = sum_loss1;
-          buf[2*wrt_i+1] = sum_loss2;
-        }
-        __syncthreads();
-        // lower half merges
-        if (threadIdx.y < offset) {
-          const int read_i = threadIdx.y * blockDim.x + threadIdx.x;
-          sum_loss1 += buf[2*read_i];
-          sum_loss2 += buf[2*read_i+1];
-        }
-        __syncthreads();
-      }
-      if (threadIdx.y == 0) {
-        buf[2*threadIdx.x] = sum_loss1;
-        buf[2*threadIdx.x+1] = sum_loss2;
-      }
-      __syncthreads();
-      if (threadIdx.y !=0) {
-        sum_loss1 = buf[2*threadIdx.x];
-        sum_loss2 = buf[2*threadIdx.x+1];
-      } 
-    }
-    // all threads now have the two sums over l
-    U fH = (U)n2;
-    U term1 = (U(1) / fH) * c_invvar;
-    T* k_grad_input = grad_input + i1*n2;
-    if (gamma != NULL) {
-      for (int l = thrx;  l < n2;  l+=numx) {
-        const U c_h = static_cast<U>(k_input[l]);
-        const U c_loss = static_cast<U>(k_dout[l]);
-        U f_grad_input = fH * c_loss * gamma[l];
-        f_grad_input -= sum_loss1;
-        f_grad_input -= (c_h - c_mean) * c_invvar * sum_loss2;
-        f_grad_input *= term1;
-        k_grad_input[l] = static_cast<T>(f_grad_input);
-      }
-    } else {
-      for (int l = thrx;  l < n2;  l+=numx) {
-        const U c_h = static_cast<U>(k_input[l]);
-        const U c_loss = static_cast<U>(k_dout[l]);
-        U f_grad_input = fH * c_loss;
-        f_grad_input -= sum_loss1;
-        f_grad_input -= (c_h - c_mean) * c_invvar * sum_loss2;
-        f_grad_input *= term1;
-        k_grad_input[l] = static_cast<T>(f_grad_input);
-      }
-    }
-  }
-}
-
-
-
-
-template<typename T, typename U, typename V> 
-void HostApplyLayerNorm(
-    V* output,
-    U* mean,
-    U* invvar,
-    const T* input,
-    int n1,
-    int n2,
-    double epsilon,
-    const V* gamma,
-    const V* beta
-    )
-{
-    auto stream = at::hip::getCurrentHIPStreamMasqueradingAsCUDA().stream();
-    const dim3 threads(32,4,1);
-    const uint64_t maxGridY =
-      at::cuda::getCurrentDeviceProperties()->maxGridSize[1];
-    const dim3 blocks(1, ::min((uint64_t)n1, maxGridY), 1);
-    int nshared = 
-        threads.y > 1 ? 
-	    threads.y*sizeof(U)+(threads.y/2)*sizeof(U) : 
-	    0;
-   hipLaunchKernelGGL(( cuApplyLayerNorm), dim3(blocks), dim3(threads), nshared, stream, 
-		    output,
-		    mean,
-		    invvar,
-		    input,
-		    n1,n2,
-		    U(epsilon),
-            gamma,beta);
-}
-
-
-void cuda_layer_norm(
-    at::Tensor* output,
-    at::Tensor* mean,
-    at::Tensor* invvar,
-    at::Tensor* input,
-    int n1,
-    int n2,
-    #ifdef VERSION_GE_1_1
-    at::IntArrayRef normalized_shape,
-    #else
-    at::IntList normalized_shape,
-    #endif
-    at::Tensor* gamma,
-    at::Tensor* beta,
-    double epsilon)
-{
-    using namespace at;
-    DISPATCH_FLOAT_HALF_AND_BFLOAT_INOUT_TYPES(
-        input->scalar_type(), output->scalar_type(), "cuda_layer_norm_kernel",
-        HostApplyLayerNorm(
-	    output->DATA_PTR<scalar_t_out>(),
-	    mean->DATA_PTR<float>(),
-	    invvar->DATA_PTR<float>(),
-	    input->DATA_PTR<scalar_t_in>(),
-	    n1,n2,
-	    epsilon,
-	    gamma != NULL ? gamma->DATA_PTR<scalar_t_out>() : NULL,
-	    beta != NULL ? beta->DATA_PTR<scalar_t_out>() : NULL);
-      )
-}
-
-
-template<typename T, typename U, typename V>
-void HostLayerNormGradient(
-    const V* dout,
-    const U* mean,
-    const U* invvar,
-    at::Tensor* input,
-    int n1,
-    int n2,
-    const V* gamma,
-    const V* beta,
-    double epsilon,
-    T* grad_input,
-    V* grad_gamma,
-    V* grad_beta
-    )
-{
-    auto stream = at::hip::getCurrentHIPStreamMasqueradingAsCUDA().stream();
-
-    if (gamma != NULL && beta != NULL) {
-      // compute grad_gamma(j) and grad_beta(j)
-      const int part_size = 16;
-      const dim3 threads2(32,4,1);
-      const dim3 blocks2((n2+threads2.x-1)/threads2.x,part_size,1);
-      const int nshared2_a = 2 * sizeof(U) * threads2.y * threads2.y *
-	(threads2.x + 1);
-      const int nshared2_b = threads2.x * threads2.y * sizeof(U);
-      const int nshared2 = nshared2_a > nshared2_b ? nshared2_a : nshared2_b;
-      at::Tensor part_grad_gamma = at::empty(
-	  {part_size,n2}, input->options().dtype(at::ScalarType::Float));
-      at::Tensor part_grad_beta = at::empty_like(part_grad_gamma);
-     hipLaunchKernelGGL(( cuComputePartGradGammaBeta), dim3(blocks2), dim3(threads2), nshared2, stream, 
-		      dout,
-		      input->DATA_PTR<T>(),
-		      n1,n2,
-		      mean,
-		      invvar,
-		      U(epsilon),
-		      part_grad_gamma.DATA_PTR<U>(),
-		      part_grad_beta.DATA_PTR<U>());
-
-      const dim3 threads3(32,8,1);
-      const dim3 blocks3((n2+threads2.x-1)/threads2.x,1,1);
-      const int nshared3 = threads3.x * threads3.y * sizeof(U);
-     hipLaunchKernelGGL(( cuComputeGradGammaBeta), dim3(blocks3), dim3(threads3), nshared3, stream, 
-		      part_grad_gamma.DATA_PTR<U>(),
-		      part_grad_beta.DATA_PTR<U>(),
-		      part_size,
-		      n1,n2,
-		      grad_gamma,
-		      grad_beta);
-    }
-
-    // compute grad_input
-    const uint64_t maxGridY =
-      at::cuda::getCurrentDeviceProperties()->maxGridSize[1];
-    const dim3 blocks1(1, ::min((uint64_t)n1, maxGridY), 1);
-    const dim3 threads1(32,4,1);
-    int nshared =
-	    threads1.y > 1 ?
-	    threads1.y*threads1.x*sizeof(U) :
-	    0;
-   hipLaunchKernelGGL(( cuComputeGradInput), dim3(blocks1), dim3(threads1), nshared, stream, 
-            dout,
-            input->DATA_PTR<T>(),
-            n1,n2,
-            mean,
-            invvar,
-            U(epsilon),
-            gamma,
-            grad_input);
-}
-
-
-void cuda_layer_norm_gradient(
-    at::Tensor* dout,
-    at::Tensor* mean,
-    at::Tensor* invvar,
-    at::Tensor* input,
-    int n1,
-    int n2,
-    #ifdef VERSION_GE_1_1
-    at::IntArrayRef normalized_shape,
-    #else
-    at::IntList normalized_shape,
-    #endif
-    at::Tensor* gamma,
-    at::Tensor* beta,
-    double epsilon,
-    at::Tensor* grad_input,
-    at::Tensor* grad_gamma,
-    at::Tensor* grad_beta)
-{
-    using namespace at;
-    DISPATCH_FLOAT_HALF_AND_BFLOAT_INOUT_TYPES(
-        input->scalar_type(), gamma->scalar_type(),
-	"cuda_layer_norm_gradient_kernel",
-        HostLayerNormGradient(
-	    dout->DATA_PTR<scalar_t_out>(),
-	    mean->DATA_PTR<float>(),
-	    invvar->DATA_PTR<float>(),
-	    input,
-	    n1,n2,
-            // TMJ pass NULL argument for gamma, beta, grad_gamma and grad_beta
-            // if gamma Tensor is NULL on input.
-	    gamma != NULL ? gamma->DATA_PTR<scalar_t_out>() : NULL,
-	    gamma != NULL ? beta->DATA_PTR<scalar_t_out>() : NULL,
-	    epsilon,
-	    grad_input->DATA_PTR<scalar_t_in>(),
-	    gamma != NULL ? grad_gamma->DATA_PTR<scalar_t_out>() : NULL,
-	    gamma != NULL ? grad_beta->DATA_PTR<scalar_t_out>() : NULL);
-      )
-}
diff --git a/colossalai/kernel/hip_native/csrc/multi_tensor_adam.hip b/colossalai/kernel/hip_native/csrc/multi_tensor_adam.hip
deleted file mode 100644
index 000e280ef5dbe8c41b95880c8c5e7ad91d01cec4..0000000000000000000000000000000000000000
--- a/colossalai/kernel/hip_native/csrc/multi_tensor_adam.hip
+++ /dev/null
@@ -1,178 +0,0 @@
-// !!! This is a file automatically generated by hipify!!!
-// modified from https://github.com/NVIDIA/apex/blob/master/csrc/multi_tensor_adam.cu
-#include <ATen/ATen.h>
-#include <ATen/AccumulateType.h>
-#include <ATen/hip/HIPContext.h>
-#include <ATen/hip/Exceptions.h>
-// Another possibility:
-// #include <torch/all.h>
-
-#include <assert.h>
-
-#include "../../hip_native/csrc/type_shim.h"
-#include "../../hip_native/csrc/multi_tensor_apply.cuh"
-
-#define BLOCK_SIZE 512
-#define ILP 4
-
-typedef enum
-{
-    ADAM_MODE_0 = 0, // L2 regularization mode
-    ADAM_MODE_1 = 1  // Decoupled weight decay mode(AdamW)
-} adamMode_t;
-
-using MATH_T = float;
-
-template <typename T>
-struct AdamFunctor
-{
-    __device__ __forceinline__ void operator()(
-        int chunk_size,
-        volatile int *noop_gmem,
-        TensorListMetadata<4> &tl,
-        const float beta1,
-        const float beta2,
-        const float beta1_correction,
-        const float beta2_correction,
-        const float epsilon,
-        const float lr,
-        adamMode_t mode,
-        const float decay)
-    {
-        // I'd like this kernel to propagate infs/nans.
-        // if(*noop_gmem == 1)
-        //   return;
-
-        int tensor_loc = tl.block_to_tensor[blockIdx.x];
-
-        // potentially use to pass in list of scalar
-        // int tensor_num = tl.start_tensor_this_launch + tensor_loc;
-
-        int chunk_idx = tl.block_to_chunk[blockIdx.x];
-        int n = tl.sizes[tensor_loc];
-
-        T *g = (T *)tl.addresses[0][tensor_loc];
-        g += chunk_idx * chunk_size;
-
-        T *p = (T *)tl.addresses[1][tensor_loc];
-        p += chunk_idx * chunk_size;
-
-        T *m = (T *)tl.addresses[2][tensor_loc];
-        m += chunk_idx * chunk_size;
-
-        T *v = (T *)tl.addresses[3][tensor_loc];
-        v += chunk_idx * chunk_size;
-
-        n -= chunk_idx * chunk_size;
-
-        // see note in multi_tensor_scale_kernel.cu
-        for (int i_start = 0;
-             i_start < n && i_start < chunk_size;
-             i_start += blockDim.x * ILP)
-        {
-            MATH_T r_g[ILP];
-            MATH_T r_p[ILP];
-            MATH_T r_m[ILP];
-            MATH_T r_v[ILP];
-#pragma unroll
-            for (int ii = 0; ii < ILP; ii++)
-            {
-                int i = i_start + threadIdx.x + ii * blockDim.x;
-                if (i < n && i < chunk_size)
-                {
-                    r_g[ii] = g[i];
-                    r_p[ii] = p[i];
-                    r_m[ii] = m[i];
-                    r_v[ii] = v[i];
-                }
-                else
-                {
-                    r_g[ii] = MATH_T(0);
-                    r_p[ii] = MATH_T(0);
-                    r_m[ii] = MATH_T(0);
-                    r_v[ii] = MATH_T(0);
-                }
-            }
-#pragma unroll
-            for (int ii = 0; ii < ILP; ii++)
-            {
-                if (mode == ADAM_MODE_0)
-                { // L2
-                    r_g[ii] = r_g[ii] + (decay * r_p[ii]);
-                    r_m[ii] = beta1 * r_m[ii] + (1 - beta1) * r_g[ii];
-                    r_v[ii] = beta2 * r_v[ii] + (1 - beta2) * r_g[ii] * r_g[ii];
-                    MATH_T next_m_unbiased = r_m[ii] / beta1_correction;
-                    MATH_T next_v_unbiased = r_v[ii] / beta2_correction;
-                    MATH_T denom = sqrtf(next_v_unbiased) + epsilon;
-                    MATH_T update = next_m_unbiased / denom;
-                    r_p[ii] = r_p[ii] - (lr * update);
-                }
-                else
-                { // weight decay
-                    r_m[ii] = beta1 * r_m[ii] + (1 - beta1) * r_g[ii];
-                    r_v[ii] = beta2 * r_v[ii] + (1 - beta2) * r_g[ii] * r_g[ii];
-                    MATH_T next_m_unbiased = r_m[ii] / beta1_correction;
-                    MATH_T next_v_unbiased = r_v[ii] / beta2_correction;
-                    MATH_T denom = sqrtf(next_v_unbiased) + epsilon;
-                    MATH_T update = (next_m_unbiased / denom) + (decay * r_p[ii]);
-                    r_p[ii] = r_p[ii] - (lr * update);
-                }
-            }
-#pragma unroll
-            for (int ii = 0; ii < ILP; ii++)
-            {
-                int i = i_start + threadIdx.x + ii * blockDim.x;
-                if (i < n && i < chunk_size)
-                {
-                    p[i] = r_p[ii];
-                    m[i] = r_m[ii];
-                    v[i] = r_v[ii];
-                }
-            }
-        }
-    }
-};
-
-void multi_tensor_adam_cuda(
-    int chunk_size,
-    at::Tensor noop_flag,
-    std::vector<std::vector<at::Tensor>> tensor_lists,
-    const float lr,
-    const float beta1,
-    const float beta2,
-    const float epsilon,
-    const int step,
-    const int mode,
-    const int bias_correction,
-    const float weight_decay)
-{
-    using namespace at;
-
-    // Handle bias correction mode
-    float bias_correction1 = 1.0f, bias_correction2 = 1.0f;
-    if (bias_correction == 1)
-    {
-        bias_correction1 = 1 - ::pow(beta1, step);
-        bias_correction2 = 1 - ::pow(beta2, step);
-    }
-
-    // Assume single type across p,g,m1,m2 now
-    DISPATCH_DOUBLE_FLOAT_AND_HALF(
-        tensor_lists[0][0].scalar_type(), 0, "adam",
-        multi_tensor_apply<4>(
-            BLOCK_SIZE,
-            chunk_size,
-            noop_flag,
-            tensor_lists,
-            AdamFunctor<scalar_t_0>(),
-            beta1,
-            beta2,
-            bias_correction1,
-            bias_correction2,
-            epsilon,
-            lr,
-            (adamMode_t)mode,
-            weight_decay);)
-
-    AT_CUDA_CHECK(hipGetLastError());
-}
\ No newline at end of file
diff --git a/colossalai/kernel/hip_native/csrc/multi_tensor_apply.cuh b/colossalai/kernel/hip_native/csrc/multi_tensor_apply.cuh
deleted file mode 100644
index 24ff519a763943487c0542833a06e1e0a2d8403d..0000000000000000000000000000000000000000
--- a/colossalai/kernel/hip_native/csrc/multi_tensor_apply.cuh
+++ /dev/null
@@ -1,135 +0,0 @@
-// !!! This is a file automatically generated by hipify!!!
-#include "hip/hip_runtime.h"
-// modified from https://github.com/NVIDIA/apex/blob/master/csrc/multi_tensor_apply.cuh
-#include <ATen/ATen.h>
-#include <ATen/AccumulateType.h>
-#include <ATen/hip/HIPContext.h>
-#include <ATen/hip/Exceptions.h>
-#include <ATen/hip/impl/HIPGuardImplMasqueradingAsCUDA.h>
-#include "../../hip_native/csrc/compat.h"
-
-#include <assert.h>
-
-// #include <iostream>
-
-// This header is the one-stop shop for all your multi-tensor apply needs.
-
-// TODO:  Kernel arg size limit may be <4KB for some other cards (ie Jetson)
-constexpr int depth_to_max_tensors[5] = {110, 64, 48, 36, 30};
-constexpr int depth_to_max_blocks[5] = {320, 320, 320, 320, 320};
-
-template <int n>
-struct TensorListMetadata
-{
-    void *addresses[n][depth_to_max_tensors[n - 1]];
-    int sizes[depth_to_max_tensors[n - 1]];
-    unsigned char block_to_tensor[depth_to_max_blocks[n - 1]];
-    int block_to_chunk[depth_to_max_blocks[n - 1]]; // I fear this needs to be a full int.
-    int start_tensor_this_launch;
-};
-
-template <typename T, typename U, typename... ArgTypes>
-__global__ void multi_tensor_apply_kernel(
-    int chunk_size,
-    volatile int *noop_flag,
-    T tl,
-    U callable,
-    ArgTypes... args)
-{
-    // Hand the chunk information to the user-supplied functor to process however it likes.
-    callable(chunk_size, noop_flag, tl, args...);
-}
-
-template <int depth, typename T, typename... ArgTypes>
-void multi_tensor_apply(
-    int block_size,
-    int chunk_size,
-    const at::Tensor &noop_flag,
-    const std::vector<std::vector<at::Tensor>> &tensor_lists,
-    T callable,
-    ArgTypes... args)
-{
-    TORCH_CHECK(tensor_lists.size() == depth, "tensor_lists.size() != depth");
-    int len0 = tensor_lists[0].size();
-    TORCH_CHECK(len0 > 0, "tensor_lists[0].size() is not > 0");
-    auto ref_device = tensor_lists[0][0].device();
-    TORCH_CHECK(ref_device.type() == at::kCUDA, "expected input to be on cuda");
-    for (int l = 0; l < tensor_lists.size(); l++) // No range-based for because I need indices
-    {
-        TORCH_CHECK(tensor_lists[l].size() == len0, "Size mismatch among tensor lists");
-        for (int t = 0; t < tensor_lists[l].size(); t++)
-        {
-            // TODO:  Print which tensor fails.
-            bool contiguous_memory = tensor_lists[l][t].is_contiguous();
-#ifdef VERSION_GE_1_5
-            contiguous_memory = (contiguous_memory || tensor_lists[l][t].is_contiguous(at::MemoryFormat::ChannelsLast));
-#endif
-            TORCH_CHECK(contiguous_memory, "A tensor was not contiguous.");
-            TORCH_CHECK(tensor_lists[l][t].device() == ref_device, "A tensor was not on the same device as the first tensor");
-            TORCH_CHECK(tensor_lists[l][t].numel() == tensor_lists[0][t].numel(), "Size mismatch");
-        }
-    }
-
-    int ntensors = tensor_lists[0].size();
-
-    TensorListMetadata<depth> tl;
-
-    const at::hip::OptionalHIPGuardMasqueradingAsCUDA device_guard(device_of(tensor_lists[0][0]));
-    auto stream = at::hip::getCurrentHIPStreamMasqueradingAsCUDA();
-
-    tl.start_tensor_this_launch = 0;
-    int loc_block_info = 0;
-    int loc_tensor_info = 0;
-    for (int t = 0; t < ntensors; t++)
-    {
-        tl.sizes[loc_tensor_info] = tensor_lists[0][t].numel();
-        for (int d = 0; d < depth; d++)
-            tl.addresses[d][loc_tensor_info] = tensor_lists[d][t].data_ptr();
-        loc_tensor_info++;
-
-        int chunks_this_tensor = (tensor_lists[0][t].numel() + chunk_size - 1) / chunk_size;
-
-        for (int chunk = 0; chunk < chunks_this_tensor; chunk++)
-        {
-            // std::cout << chunks_this_tensor << std::endl;
-            tl.block_to_tensor[loc_block_info] = loc_tensor_info - 1;
-            tl.block_to_chunk[loc_block_info] = chunk;
-            loc_block_info++;
-
-            bool tensors_full = (loc_tensor_info == depth_to_max_tensors[depth - 1] &&
-                                 chunk == chunks_this_tensor - 1);
-            bool blocks_full = (loc_block_info == depth_to_max_blocks[depth - 1]);
-            bool last_chunk = (t == ntensors - 1 && chunk == chunks_this_tensor - 1);
-            if (tensors_full || blocks_full || last_chunk)
-            {
-                // using accscalar_t = acc_type<scalar_t, true>;
-               hipLaunchKernelGGL(( multi_tensor_apply_kernel), dim3(loc_block_info), dim3(block_size), 0, stream, 
-                    chunk_size,
-                    noop_flag.DATA_PTR<int>(),
-                    tl,
-                    callable,
-                    args...);
-
-                AT_CUDA_CHECK(hipGetLastError());
-
-                // Reset.  The control flow possibilities here make my brain hurt.
-                loc_block_info = 0;
-                if (chunk == chunks_this_tensor - 1)
-                {
-                    // std::cout << "Hit case 1 " << cond1 << " " << cond2 << " " << cond3 << std::endl;
-                    loc_tensor_info = 0;
-                    tl.start_tensor_this_launch = t + 1;
-                }
-                else
-                {
-                    // std::cout << "Hit case 2 " << cond1 << " " << cond2 << " " << cond3 << std::endl;
-                    tl.sizes[0] = tl.sizes[loc_tensor_info - 1];
-                    for (int d = 0; d < depth; d++)
-                        tl.addresses[d][0] = tl.addresses[d][loc_tensor_info - 1];
-                    loc_tensor_info = 1;
-                    tl.start_tensor_this_launch = t;
-                }
-            }
-        }
-    }
-}
\ No newline at end of file
diff --git a/colossalai/kernel/hip_native/csrc/multi_tensor_l2norm_kernel.hip b/colossalai/kernel/hip_native/csrc/multi_tensor_l2norm_kernel.hip
deleted file mode 100644
index 28be4f49921447ca57d595df2293bd0ea2de6f6b..0000000000000000000000000000000000000000
--- a/colossalai/kernel/hip_native/csrc/multi_tensor_l2norm_kernel.hip
+++ /dev/null
@@ -1,457 +0,0 @@
-// !!! This is a file automatically generated by hipify!!!
-#include "hip/hip_runtime.h"
-// modified from https://github.com/NVIDIA/apex/blob/master/csrc/multi_tensor_l2norm_kernel.cu
-#include <ATen/ATen.h>
-#include <ATen/AccumulateType.h>
-#include <ATen/hip/HIPContext.h>
-#include <ATen/hip/Exceptions.h>
-#include <ATen/hip/impl/HIPGuardImplMasqueradingAsCUDA.h>
-// Another possibility:
-// #include <torch/all.h>
-
-#include <assert.h>
-
-#include "../../hip_native/csrc/type_shim.h"
-#include "../../hip_native/csrc/multi_tensor_apply.cuh"
-
-#define BLOCK_SIZE 512
-#define ILP 4
-
-template <typename T>
-__device__ __forceinline__ bool is_aligned(T *p)
-{
-    return ((uint64_t)p) % (ILP * sizeof(T)) == 0;
-}
-
-template <typename T>
-__device__ __forceinline__ void load_store(T *dst, T *src, int dst_offset, int src_offset)
-{
-    typedef typename std::aligned_storage<ILP * sizeof(T), ILP * alignof(T)>::type LT;
-    ((LT *)dst)[dst_offset] = ((LT *)src)[src_offset];
-}
-
-template <typename x_t>
-struct L2NormFunctor
-{
-    __device__ __forceinline__ void operator()(
-        int chunk_size,
-        volatile int *noop_gmem,
-        TensorListMetadata<1> &tl,
-        float *output,
-        float *output_per_tensor,
-        bool per_tensor,
-        int max_chunks_per_tensor)
-    {
-        // I'd like this kernel to propagate infs/nans.
-        // if(*noop_gmem == 1)
-        //   return;
-
-        int tensor_loc = tl.block_to_tensor[blockIdx.x];
-        int chunk_idx = tl.block_to_chunk[blockIdx.x];
-        int n = tl.sizes[tensor_loc];
-
-        x_t *x = (x_t *)tl.addresses[0][tensor_loc];
-        x += chunk_idx * chunk_size;
-
-        n -= chunk_idx * chunk_size;
-
-        __shared__ float s_vals[512];
-
-        float vals[ILP]; // = {0}; // this probably works too but I want to be sure...
-        x_t r_x[ILP];
-        for (int i = 0; i < ILP; i++)
-        {
-            vals[i] = 0.f;
-            r_x[i] = 0;
-        }
-
-        // to make things simple, we put aligned case in a different code path
-        if (n % ILP == 0 && chunk_size % ILP == 0 && is_aligned(x))
-        {
-            for (int i_start = threadIdx.x; i_start * ILP < n && i_start * ILP < chunk_size; i_start += blockDim.x)
-            {
-                // load
-                load_store(r_x, x, 0, i_start);
-#pragma unroll
-                for (int ii = 0; ii < ILP; ii++)
-                {
-                    float next = static_cast<float>(r_x[ii]);
-                    vals[ii] += next * next;
-                }
-            }
-        }
-        else
-        {
-            for (int i_start = 0; i_start < n && i_start < chunk_size; i_start += blockDim.x * ILP)
-            {
-#pragma unroll
-                for (int ii = 0; ii < ILP; ii++)
-                {
-                    int i = i_start + threadIdx.x + ii * blockDim.x;
-                    if (i < n && i < chunk_size)
-                    {
-                        float next = static_cast<float>(x[i]);
-                        vals[ii] += next * next;
-                    }
-                }
-            }
-        }
-
-        float val = 0.f;
-        for (int i = 0; i < ILP; i++)
-            val += vals[i];
-
-        float final = reduce_block_into_lanes(s_vals, val);
-
-        if (threadIdx.x == 0)
-        {
-            if (!isfinite(final))
-                *noop_gmem = 1; // Blindly fire off a write.  These will race but that's ok.
-            output[blockIdx.x] += final;
-            if (per_tensor)
-                output_per_tensor[(tl.start_tensor_this_launch + tensor_loc) * max_chunks_per_tensor + chunk_idx] = final;
-        }
-    }
-};
-
-// Probably better to template, but since we are not likely to support other norm
-template <typename x_t>
-struct MaxNormFunctor
-{
-    __device__ __forceinline__ void operator()(
-        int chunk_size,
-        volatile int *noop_gmem,
-        TensorListMetadata<1> &tl,
-        float *output,
-        float *output_per_tensor,
-        bool per_tensor,
-        int max_chunks_per_tensor)
-    {
-        // I'd like this kernel to propagate infs/nans.
-        // if(*noop_gmem == 1)
-        //   return;
-
-        int tensor_loc = tl.block_to_tensor[blockIdx.x];
-        int chunk_idx = tl.block_to_chunk[blockIdx.x];
-        int n = tl.sizes[tensor_loc];
-
-        x_t *x = (x_t *)tl.addresses[0][tensor_loc];
-        x += chunk_idx * chunk_size;
-
-        n -= chunk_idx * chunk_size;
-
-        __shared__ float s_vals[512];
-
-        float vals[ILP]; // = {0}; // this probably works too but I want to be sure...
-        x_t r_x[ILP];
-        for (int i = 0; i < ILP; i++)
-        {
-            vals[i] = 0.f;
-            r_x[i] = 0;
-        }
-
-        // to make things simple, we put aligned case in a different code path
-        if (n % ILP == 0 && chunk_size % ILP == 0 && is_aligned(x))
-        {
-            for (int i_start = threadIdx.x; i_start * ILP < n && i_start * ILP < chunk_size; i_start += blockDim.x)
-            {
-                // load
-                load_store(r_x, x, 0, i_start);
-#pragma unroll
-                for (int ii = 0; ii < ILP; ii++)
-                {
-                    float next = static_cast<float>(r_x[ii]);
-                    vals[ii] = fmaxf(fabsf(vals[ii]), fabsf(next));
-                }
-            }
-        }
-        else
-        {
-            for (int i_start = 0; i_start < n && i_start < chunk_size; i_start += blockDim.x * ILP)
-            {
-#pragma unroll
-                for (int ii = 0; ii < ILP; ii++)
-                {
-                    int i = i_start + threadIdx.x + ii * blockDim.x;
-                    if (i < n && i < chunk_size)
-                    {
-                        float next = static_cast<float>(x[i]);
-                        vals[ii] = fmaxf(fabsf(vals[ii]), fabsf(next));
-                    }
-                }
-            }
-        }
-
-        float val = 0.f;
-        for (int i = 0; i < ILP; i++)
-            val = fmaxf(fabsf(val), fabsf(vals[i]));
-
-        float final = reduce_block_into_lanes_max_op(s_vals, val);
-
-        if (threadIdx.x == 0)
-        {
-            if (!isfinite(final))
-                *noop_gmem = 1; // Blindly fire off a write.  These will race but that's ok.
-            output[blockIdx.x] = fmaxf(fabsf(output[blockIdx.x]), fabsf(final));
-            if (per_tensor)
-                output_per_tensor[(tl.start_tensor_this_launch + tensor_loc) * max_chunks_per_tensor + chunk_idx] = final;
-        }
-    }
-};
-
-__global__ void cleanup(
-    float *output,
-    float *output_per_tensor,
-    float *ret,
-    float *ret_per_tensor,
-    bool per_tensor,
-    int max_chunks_per_tensor)
-{
-    __shared__ float vals[512];
-
-    if (blockIdx.x == 0)
-    {
-        float val = 0;
-        if (threadIdx.x < 320)
-            val = output[threadIdx.x];
-
-        float final = reduce_block_into_lanes(vals, val);
-
-        if (threadIdx.x == 0)
-            *ret = sqrt(final);
-    }
-
-    if (per_tensor)
-    {
-        float *output_this_tensor = output_per_tensor + blockIdx.x * max_chunks_per_tensor;
-
-        float val = 0;
-        for (int i = threadIdx.x; i < max_chunks_per_tensor; i += blockDim.x)
-            val += output_this_tensor[i];
-
-        float final = reduce_block_into_lanes(vals, val);
-
-        if (threadIdx.x == 0)
-            ret_per_tensor[blockIdx.x] = sqrt(final);
-    }
-}
-
-__global__ void cleanup_v2(
-    float *output,
-    float *output_per_tensor,
-    float *ret,
-    float *ret_per_tensor,
-    bool per_tensor,
-    int max_chunks_per_tensor,
-    int norm_type,
-    float alpha,
-    float beta)
-{
-    __shared__ float vals[512];
-
-    if (blockIdx.x == 0)
-    {
-        float val = 0;
-        if (threadIdx.x < 320)
-            val = output[threadIdx.x];
-
-        if (norm_type == 0)
-        {
-            float final = reduce_block_into_lanes_max_op(vals, val);
-            if (threadIdx.x == 0)
-                *ret = alpha * (*ret) + beta * final;
-        }
-        else
-        {
-            float final = reduce_block_into_lanes(vals, val);
-            if (threadIdx.x == 0)
-                *ret = sqrt(alpha * (*ret) * (*ret) + beta * final);
-        }
-    }
-
-    if (per_tensor)
-    {
-        float *output_this_tensor = output_per_tensor + blockIdx.x * max_chunks_per_tensor;
-
-        if (norm_type == 0)
-        {
-            float val = 0;
-            for (int i = threadIdx.x; i < max_chunks_per_tensor; i += blockDim.x)
-                val = fmaxf(fabsf(val), fabsf(output_this_tensor[i]));
-
-            float final = reduce_block_into_lanes_max_op(vals, val);
-
-            if (threadIdx.x == 0)
-                ret_per_tensor[blockIdx.x] = alpha * ret_per_tensor[blockIdx.x] + beta * final;
-        }
-        else
-        {
-            float val = 0;
-            for (int i = threadIdx.x; i < max_chunks_per_tensor; i += blockDim.x)
-                val += output_this_tensor[i];
-
-            float final = reduce_block_into_lanes(vals, val);
-
-            if (threadIdx.x == 0)
-                ret_per_tensor[blockIdx.x] = sqrt(alpha * ret_per_tensor[blockIdx.x] * ret_per_tensor[blockIdx.x] + beta * final);
-        }
-    }
-}
-
-std::tuple<at::Tensor, at::Tensor> multi_tensor_l2norm_cuda(
-    int chunk_size,
-    at::Tensor noop_flag,
-    std::vector<std::vector<at::Tensor>> tensor_lists,
-    at::optional<bool> per_tensor_python)
-{
-    bool per_tensor = per_tensor_python.has_value() ? per_tensor_python.value() : false;
-
-    auto float_options = tensor_lists[0][0].options().dtype(at::kFloat);
-    auto output = at::zeros({320}, float_options);
-
-    at::Tensor output_per_tensor;
-    at::Tensor ret_per_tensor;
-
-    int ntensors = tensor_lists[0].size();
-    int max_chunks_per_tensor = -1;
-
-    if (per_tensor)
-    {
-        for (int t = 0; t < ntensors; t++)
-        {
-            int max_chunks_this_tensor = (tensor_lists[0][t].numel() + chunk_size - 1) / chunk_size;
-            if (max_chunks_this_tensor > max_chunks_per_tensor)
-                max_chunks_per_tensor = max_chunks_this_tensor;
-        }
-        output_per_tensor = at::zeros({ntensors * max_chunks_per_tensor}, float_options);
-        ret_per_tensor = at::empty({ntensors}, float_options);
-    }
-    else
-    {
-        ret_per_tensor = at::empty({0}, float_options);
-    }
-
-    DISPATCH_FLOAT_AND_HALF(tensor_lists[0][0].scalar_type(), 0, "multi_tensor_l2norm_cuda",
-                            multi_tensor_apply<1>(
-                                BLOCK_SIZE,
-                                chunk_size,
-                                noop_flag,
-                                tensor_lists,
-                                L2NormFunctor<scalar_t_0>(),
-                                output.DATA_PTR<float>(),
-                                per_tensor ? output_per_tensor.DATA_PTR<float>() : nullptr,
-                                per_tensor,
-                                max_chunks_per_tensor);)
-
-    AT_CUDA_CHECK(hipGetLastError());
-    // AT_CUDA_CHECK(hipDeviceSynchronize());
-
-    // This involves one more small kernel launches, but will be negligible end to end.
-    // I could get rid of these by hacking the functor + multi tensor harness with persistence
-    // logic, but keeping it simple for now
-    auto ret = at::empty({1}, output.options());
-    const at::hip::OptionalHIPGuardMasqueradingAsCUDA device_guard(device_of(output));
-    auto stream = at::hip::getCurrentHIPStreamMasqueradingAsCUDA();
-   hipLaunchKernelGGL(( cleanup), dim3(per_tensor ? ntensors : 1), dim3(512), 0, stream, 
-        output.DATA_PTR<float>(),
-        per_tensor ? output_per_tensor.DATA_PTR<float>() : nullptr,
-        ret.DATA_PTR<float>(),
-        per_tensor ? ret_per_tensor.DATA_PTR<float>() : nullptr,
-        per_tensor,
-        max_chunks_per_tensor);
-
-    return std::tuple<at::Tensor, at::Tensor>(ret, ret_per_tensor);
-}
-
-// Compute and update grad norm
-// Here use a per tensor norm, and blend new norm(n) and old norm(gn) by
-// L-2: gn = sqrt(a * gn^2 + b * n^2)
-// L-inf: gn = a * gn + b * n
-void multi_tensor_norm_out_cuda(
-    int chunk_size,
-    at::Tensor noop_flag,
-    std::vector<std::vector<at::Tensor>> tensor_lists,
-    at::Tensor out,
-    const float alpha,
-    const float beta,
-    const int norm_type)
-{
-    auto float_options = tensor_lists[0][0].options().dtype(at::kFloat);
-    TORCH_CHECK(tensor_lists[0][0].device() == noop_flag.device(), "noop flag should be on the same device as tensors");
-    // we don't need global thus uses empty here
-    auto output = at::empty({320}, float_options);
-
-    at::Tensor output_per_tensor;
-    at::Tensor ret_per_tensor;
-
-    int ntensors = tensor_lists[0].size();
-    int max_chunks_per_tensor = -1;
-
-    for (int t = 0; t < ntensors; t++)
-    {
-        int max_chunks_this_tensor = (tensor_lists[0][t].numel() + chunk_size - 1) / chunk_size;
-        if (max_chunks_this_tensor > max_chunks_per_tensor)
-            max_chunks_per_tensor = max_chunks_this_tensor;
-    }
-
-    // Although it is single write then read, still need to be zero
-    // Since tailing element also participate cleanup
-    output_per_tensor = at::zeros({ntensors * max_chunks_per_tensor}, float_options);
-
-    if (norm_type == 0)
-    {
-        DISPATCH_FLOAT_AND_HALF(
-            tensor_lists[0][0].scalar_type(), 0, "multi_tensor_maxnorm_cuda",
-            multi_tensor_apply<1>(
-                BLOCK_SIZE,
-                chunk_size,
-                noop_flag,
-                tensor_lists,
-                MaxNormFunctor<scalar_t_0>(),
-                output.DATA_PTR<float>(),
-                output_per_tensor.DATA_PTR<float>(),
-                true,
-                max_chunks_per_tensor);)
-    }
-    else
-    {
-        DISPATCH_FLOAT_AND_HALF(
-            tensor_lists[0][0].scalar_type(), 0, "multi_tensor_l2norm_cuda",
-            multi_tensor_apply<1>(
-                BLOCK_SIZE,
-                chunk_size,
-                noop_flag,
-                tensor_lists,
-                L2NormFunctor<scalar_t_0>(),
-                output.DATA_PTR<float>(),
-                output_per_tensor.DATA_PTR<float>(),
-                true,
-                max_chunks_per_tensor);)
-    }
-    AT_CUDA_CHECK(hipGetLastError());
-
-    // AT_CUDA_CHECK(hipDeviceSynchronize());
-
-    // This involves one more small kernel launches, but will be negligible end to end.
-    // I could get rid of these by hacking the functor + multi tensor harness with persistence
-    // logic, but keeping it simple for now
-    auto ret = at::empty({1}, output.options());
-
-    // Adding the following device guard since it happens sometimes that the
-    // tensors are on one device and the cuda stream is on another device which
-    // results in ILLEGAL MEM ACCESS error.
-    const at::hip::OptionalHIPGuardMasqueradingAsCUDA device_guard(device_of(output));
-    auto stream = at::hip::getCurrentHIPStreamMasqueradingAsCUDA();
-   hipLaunchKernelGGL(( cleanup_v2), dim3(ntensors), dim3(512), 0, stream, 
-        output.DATA_PTR<float>(),
-        output_per_tensor.DATA_PTR<float>(),
-        ret.DATA_PTR<float>(),
-        out.DATA_PTR<float>(),
-        true,
-        max_chunks_per_tensor,
-        norm_type,
-        alpha,
-        beta);
-
-    return;
-}
\ No newline at end of file
diff --git a/colossalai/kernel/hip_native/csrc/multi_tensor_lamb.hip b/colossalai/kernel/hip_native/csrc/multi_tensor_lamb.hip
deleted file mode 100644
index a7e6bde616cf84516e3080e0e8dfbb0da88fa375..0000000000000000000000000000000000000000
--- a/colossalai/kernel/hip_native/csrc/multi_tensor_lamb.hip
+++ /dev/null
@@ -1,428 +0,0 @@
-// !!! This is a file automatically generated by hipify!!!
-// modified from https://github.com/NVIDIA/apex/blob/master/csrc/multi_tensor_lamb.cu
-#include <ATen/ATen.h>
-#include <ATen/AccumulateType.h>
-#include <ATen/hip/HIPContext.h>
-#include <ATen/hip/Exceptions.h>
-// Another possibility:
-// #include <torch/all.h>
-
-#include <assert.h>
-
-#include "../../hip_native/csrc/type_shim.h"
-#include "../../hip_native/csrc/multi_tensor_apply.cuh"
-
-#define BLOCK_SIZE 512
-#define ILP 4
-
-template <typename T>
-__device__ __forceinline__ bool is_aligned(T *p)
-{
-    return ((uint64_t)p) % (ILP * sizeof(T)) == 0;
-}
-
-template <typename T>
-__device__ __forceinline__ void load_store(T *dst, T *src, int dst_offset, int src_offset)
-{
-    typedef typename std::aligned_storage<ILP * sizeof(T), ILP * alignof(T)>::type LT;
-    ((LT *)dst)[dst_offset] = ((LT *)src)[src_offset];
-}
-
-typedef enum
-{
-    MOMENT_MODE_0 = 0, // L2 regularization mode
-    MOMENT_MODE_1 = 1  // Decoupled weight decay mode
-} adamMode_t;
-
-std::tuple<at::Tensor, at::Tensor> multi_tensor_l2norm_cuda(
-    int chunk_size,
-    at::Tensor noop_flag,
-    std::vector<std::vector<at::Tensor>> tensor_lists,
-    at::optional<bool> per_tensor_python);
-
-using MATH_T = float;
-
-template <typename T>
-struct LAMBStage1Functor
-{
-    __device__ __forceinline__ void operator()(
-        int chunk_size,
-        volatile int *noop_gmem,
-        TensorListMetadata<4> &tl,
-        const float beta1,
-        const float beta2,
-        const float beta3,
-        const float beta1_correction,
-        const float beta2_correction,
-        const float epsilon,
-        adamMode_t mode,
-        const float decay,
-        const float *global_grad_norm,
-        const float max_global_grad_norm)
-    {
-        // I'd like this kernel to propagate infs/nans.
-        // if(*noop_gmem == 1)
-        //   return;
-
-        int tensor_loc = tl.block_to_tensor[blockIdx.x];
-        int chunk_idx = tl.block_to_chunk[blockIdx.x];
-        int n = tl.sizes[tensor_loc];
-
-        float clipped_global_grad_norm = (*global_grad_norm) > max_global_grad_norm ? (*global_grad_norm) / max_global_grad_norm : 1.0f;
-
-        T *g = (T *)tl.addresses[0][tensor_loc];
-        g += chunk_idx * chunk_size;
-
-        T *p = (T *)tl.addresses[1][tensor_loc];
-        p += chunk_idx * chunk_size;
-
-        T *m = (T *)tl.addresses[2][tensor_loc];
-        m += chunk_idx * chunk_size;
-
-        T *v = (T *)tl.addresses[3][tensor_loc];
-        v += chunk_idx * chunk_size;
-
-        n -= chunk_idx * chunk_size;
-
-        MATH_T r_g[ILP];
-        MATH_T r_p[ILP];
-        MATH_T r_m[ILP];
-        MATH_T r_v[ILP];
-        // to make things simple, we put aligned case in a different code path
-        if (n % ILP == 0 &&
-            chunk_size % ILP == 0 &&
-            is_aligned(g) &&
-            is_aligned(p) &&
-            is_aligned(m) &&
-            is_aligned(v))
-        {
-            T l_g[ILP];
-            T l_p[ILP];
-            T l_m[ILP];
-            T l_v[ILP];
-            for (int i_start = threadIdx.x; i_start * ILP < n && i_start * ILP < chunk_size; i_start += blockDim.x)
-            {
-                // load
-                load_store(l_g, g, 0, i_start);
-                if (decay != 0)
-                    load_store(l_p, p, 0, i_start);
-                load_store(l_m, m, 0, i_start);
-                load_store(l_v, v, 0, i_start);
-                // unpack
-#pragma unroll
-                for (int ii = 0; ii < ILP; ii++)
-                {
-                    r_g[ii] = l_g[ii];
-                    if (decay == 0)
-                    {
-                        r_p[ii] = MATH_T(0);
-                    }
-                    else
-                    {
-                        r_p[ii] = l_p[ii];
-                    }
-                    r_m[ii] = l_m[ii];
-                    r_v[ii] = l_v[ii];
-                }
-#pragma unroll
-                for (int ii = 0; ii < ILP; ii++)
-                {
-                    if (mode == MOMENT_MODE_0)
-                    {
-                        MATH_T scaled_grad = r_g[ii] / clipped_global_grad_norm;
-                        // L2 on scaled grad
-                        scaled_grad = scaled_grad + decay * r_p[ii];
-                        r_m[ii] = r_m[ii] * beta1 + beta3 * scaled_grad;
-                        r_v[ii] = r_v[ii] * beta2 + (1 - beta2) * scaled_grad * scaled_grad;
-                        MATH_T next_m_unbiased = r_m[ii] / beta1_correction;
-                        MATH_T next_v_unbiased = r_v[ii] / beta2_correction;
-                        MATH_T denom = sqrtf(next_v_unbiased) + epsilon;
-                        r_p[ii] = next_m_unbiased / denom;
-                    }
-                    else
-                    {
-                        MATH_T scaled_grad = r_g[ii] / clipped_global_grad_norm;
-                        r_m[ii] = r_m[ii] * beta1 + beta3 * scaled_grad;
-                        r_v[ii] = r_v[ii] * beta2 + (1 - beta2) * scaled_grad * scaled_grad;
-                        MATH_T next_m_unbiased = r_m[ii] / beta1_correction;
-                        MATH_T next_v_unbiased = r_v[ii] / beta2_correction;
-                        MATH_T denom = sqrtf(next_v_unbiased) + epsilon;
-                        r_p[ii] = (next_m_unbiased / denom) + (decay * r_p[ii]);
-                    }
-                }
-#pragma unroll
-                for (int ii = 0; ii < ILP; ii++)
-                {
-                    l_p[ii] = r_p[ii];
-                    l_m[ii] = r_m[ii];
-                    l_v[ii] = r_v[ii];
-                }
-                // store
-                load_store(g, l_p, i_start, 0);
-                load_store(m, l_m, i_start, 0);
-                load_store(v, l_v, i_start, 0);
-            }
-        }
-        else
-        {
-            // see note in multi_tensor_scale_kernel.cu
-            for (int i_start = 0;
-                 i_start < n && i_start < chunk_size;
-                 i_start += blockDim.x * ILP)
-            {
-                MATH_T r_g[ILP];
-                MATH_T r_p[ILP];
-                MATH_T r_m[ILP];
-                MATH_T r_v[ILP];
-#pragma unroll
-                for (int ii = 0; ii < ILP; ii++)
-                {
-                    int i = i_start + threadIdx.x + ii * blockDim.x;
-                    if (i < n && i < chunk_size)
-                    {
-                        r_g[ii] = g[i];
-                        // special ?optimization? for lamb stage 1
-                        if (decay == 0)
-                        {
-                            r_p[ii] = MATH_T(0);
-                        }
-                        else
-                        {
-                            r_p[ii] = p[i];
-                        }
-                        r_m[ii] = m[i];
-                        r_v[ii] = v[i];
-                    }
-                    else
-                    {
-                        r_g[ii] = MATH_T(0);
-                        r_p[ii] = MATH_T(0);
-                        r_m[ii] = MATH_T(0);
-                        r_v[ii] = MATH_T(0);
-                    }
-                }
-#pragma unroll
-                for (int ii = 0; ii < ILP; ii++)
-                {
-                    if (mode == MOMENT_MODE_0)
-                    {
-                        MATH_T scaled_grad = r_g[ii] / clipped_global_grad_norm;
-                        // L2 on scaled grad
-                        scaled_grad = scaled_grad + decay * r_p[ii];
-                        r_m[ii] = r_m[ii] * beta1 + beta3 * scaled_grad;
-                        r_v[ii] = r_v[ii] * beta2 + (1 - beta2) * scaled_grad * scaled_grad;
-                        MATH_T next_m_unbiased = r_m[ii] / beta1_correction;
-                        MATH_T next_v_unbiased = r_v[ii] / beta2_correction;
-                        MATH_T denom = sqrtf(next_v_unbiased) + epsilon;
-                        r_p[ii] = next_m_unbiased / denom;
-                    }
-                    else
-                    {
-                        MATH_T scaled_grad = r_g[ii] / clipped_global_grad_norm;
-                        r_m[ii] = r_m[ii] * beta1 + beta3 * scaled_grad;
-                        r_v[ii] = r_v[ii] * beta2 + (1 - beta2) * scaled_grad * scaled_grad;
-                        MATH_T next_m_unbiased = r_m[ii] / beta1_correction;
-                        MATH_T next_v_unbiased = r_v[ii] / beta2_correction;
-                        MATH_T denom = sqrtf(next_v_unbiased) + epsilon;
-                        r_p[ii] = (next_m_unbiased / denom) + (decay * r_p[ii]);
-                    }
-                }
-#pragma unroll
-                for (int ii = 0; ii < ILP; ii++)
-                {
-                    int i = i_start + threadIdx.x + ii * blockDim.x;
-                    if (i < n && i < chunk_size)
-                    {
-                        g[i] = r_p[ii];
-                        m[i] = r_m[ii];
-                        v[i] = r_v[ii];
-                    }
-                }
-            }
-        }
-    }
-};
-
-// Step 2 reads in 'update' value and per-tensor param_norm and update_norm.
-// It computes new parameter value.
-template <typename T>
-struct LAMBStage2Functor
-{
-    __device__ __forceinline__ void operator()(
-        int chunk_size,
-        volatile int *noop_gmem,
-        TensorListMetadata<2> &tl,
-        const float *per_tensor_param_norm,
-        const float *per_tensor_update_norm,
-        const float learning_rate,
-        const float decay,
-        bool use_nvlamb)
-    {
-        // I'd like this kernel to propagate infs/nans.
-        // if(*noop_gmem == 1)
-        //   return;
-
-        int tensor_loc = tl.block_to_tensor[blockIdx.x];
-        int tensor_num = tl.start_tensor_this_launch + tensor_loc;
-        int chunk_idx = tl.block_to_chunk[blockIdx.x];
-        int n = tl.sizes[tensor_loc];
-
-        MATH_T ratio = learning_rate;
-        // nvlamb: apply adaptive learning rate to all parameters
-        // otherwise, only apply to those with non-zero weight decay
-        if (use_nvlamb || (decay != 0.0))
-        {
-            float param_norm = per_tensor_param_norm[tensor_num];
-            float update_norm = per_tensor_update_norm[tensor_num];
-            ratio = (update_norm != 0.0f && param_norm != 0.0f) ? learning_rate * (param_norm / update_norm) : learning_rate;
-        }
-
-        T *update = (T *)tl.addresses[0][tensor_loc];
-        update += chunk_idx * chunk_size;
-
-        T *p = (T *)tl.addresses[1][tensor_loc];
-        p += chunk_idx * chunk_size;
-
-        n -= chunk_idx * chunk_size;
-
-        // to make things simple, we put aligned case in a different code path
-        if (n % ILP == 0 &&
-            chunk_size % ILP == 0 &&
-            is_aligned(p) &&
-            is_aligned(update))
-        {
-            T r_p[ILP];
-            T r_update[ILP];
-            for (int i_start = threadIdx.x; i_start * ILP < n && i_start * ILP < chunk_size; i_start += blockDim.x)
-            {
-                // load
-                load_store(r_p, p, 0, i_start);
-                load_store(r_update, update, 0, i_start);
-#pragma unroll
-                for (int ii = 0; ii < ILP; ii++)
-                {
-                    r_p[ii] = static_cast<MATH_T>(r_p[ii]) - (ratio * static_cast<MATH_T>(r_update[ii]));
-                }
-                load_store(p, r_p, i_start, 0);
-            }
-        }
-        else
-        {
-            for (int i_start = 0;
-                 i_start < n && i_start < chunk_size;
-                 i_start += blockDim.x * ILP)
-            {
-                MATH_T r_p[ILP];
-                MATH_T r_update[ILP];
-#pragma unroll
-                for (int ii = 0; ii < ILP; ii++)
-                {
-                    int i = i_start + threadIdx.x + ii * blockDim.x;
-                    if (i < n && i < chunk_size)
-                    {
-                        r_p[ii] = p[i];
-                        r_update[ii] = update[i];
-                    }
-                }
-#pragma unroll
-                for (int ii = 0; ii < ILP; ii++)
-                {
-                    r_p[ii] = r_p[ii] - (ratio * r_update[ii]);
-                }
-#pragma unroll
-                for (int ii = 0; ii < ILP; ii++)
-                {
-                    int i = i_start + threadIdx.x + ii * blockDim.x;
-                    if (i < n && i < chunk_size)
-                    {
-                        p[i] = r_p[ii];
-                    }
-                }
-            }
-        }
-    }
-};
-
-void multi_tensor_lamb_cuda(
-    int chunk_size,
-    at::Tensor noop_flag,
-    std::vector<std::vector<at::Tensor>> tensor_lists,
-    const float lr,
-    const float beta1,
-    const float beta2,
-    const float epsilon,
-    const int step,
-    const int bias_correction,
-    const float weight_decay,
-    const int grad_averaging,
-    const int mode,
-    at::Tensor global_grad_norm,
-    const float max_grad_norm,
-    at::optional<bool> use_nvlamb_python)
-{
-    using namespace at;
-    // Master weight and 32bit momentum(potentially changing) is not handled by this
-    // So we assume every tensor are all in the same type
-
-    bool use_nvlamb = use_nvlamb_python.has_value() ? use_nvlamb_python.value() : false;
-
-    // Handle bias correction mode
-    float bias_correction1 = 1.0f, bias_correction2 = 1.0f;
-    if (bias_correction == 1)
-    {
-        bias_correction1 = 1 - ::pow(beta1, step);
-        bias_correction2 = 1 - ::pow(beta2, step);
-    }
-
-    // Handle grad averaging mode
-    float beta3 = 1.0f;
-    if (grad_averaging == 1)
-        beta3 = 1 - beta1;
-
-    std::vector<std::vector<at::Tensor>> grad_list(tensor_lists.begin(), tensor_lists.begin() + 1);
-    std::vector<std::vector<at::Tensor>> param_list(tensor_lists.begin() + 1, tensor_lists.begin() + 2);
-
-    // Compute per tensor param norm
-    auto param_norm_tuple = multi_tensor_l2norm_cuda(chunk_size, noop_flag, param_list, true);
-
-    // We now in-place modify grad to store update before compute its norm
-    // Generally this is not a issue since people modify grad in step() method all the time
-    // We can also grab list of empty tensor to avoid this, but I'd like to save space/cpu code
-    DISPATCH_FLOAT_AND_HALF(tensor_lists[0][0].scalar_type(), 0, "lamb_stage_1",
-                            multi_tensor_apply<4>(
-                                BLOCK_SIZE,
-                                chunk_size,
-                                noop_flag,
-                                tensor_lists,
-                                LAMBStage1Functor<scalar_t_0>(),
-                                beta1,
-                                beta2,
-                                beta3, // 1-beta1 or 1 depends on averaging mode
-                                bias_correction1,
-                                bias_correction2,
-                                epsilon,
-                                (adamMode_t)mode,
-                                weight_decay,
-                                global_grad_norm.DATA_PTR<float>(),
-                                max_grad_norm);)
-
-    // Compute update norms
-    auto update_norm_tuple = multi_tensor_l2norm_cuda(chunk_size, noop_flag, grad_list, true);
-
-    std::vector<std::vector<at::Tensor>> grad_param_list(tensor_lists.begin(), tensor_lists.begin() + 2);
-
-    DISPATCH_FLOAT_AND_HALF(tensor_lists[0][0].scalar_type(), 0, "lamb_stage_2",
-                            multi_tensor_apply<2>(
-                                BLOCK_SIZE,
-                                chunk_size,
-                                noop_flag,
-                                grad_param_list,
-                                LAMBStage2Functor<scalar_t_0>(),
-                                std::get<1>(param_norm_tuple).DATA_PTR<float>(),
-                                std::get<1>(update_norm_tuple).DATA_PTR<float>(),
-                                lr,
-                                weight_decay,
-                                use_nvlamb);)
-
-    AT_CUDA_CHECK(hipGetLastError());
-}
\ No newline at end of file
diff --git a/colossalai/kernel/hip_native/csrc/multi_tensor_scale_kernel.hip b/colossalai/kernel/hip_native/csrc/multi_tensor_scale_kernel.hip
deleted file mode 100644
index e310ce939252d860d0c1a824073501289e51b131..0000000000000000000000000000000000000000
--- a/colossalai/kernel/hip_native/csrc/multi_tensor_scale_kernel.hip
+++ /dev/null
@@ -1,137 +0,0 @@
-// !!! This is a file automatically generated by hipify!!!
-#include <ATen/ATen.h>
-#include <ATen/AccumulateType.h>
-#include <ATen/hip/HIPContext.h>
-#include <ATen/hip/Exceptions.h>
-// Another possibility:
-// #include <torch/all.h>
-
-#include <assert.h>
-// Stringstream is a big hammer, but I want to rely on operator<< for dtype.
-#include <sstream>
-
-#include "../../hip_native/csrc/type_shim.h"
-#include "../../hip_native/csrc/multi_tensor_apply.cuh"
-
-#define BLOCK_SIZE 512
-#define ILP 4
-
-template<typename T>
-__device__ __forceinline__ bool is_aligned(T* p){
-  return ((uint64_t)p) % (ILP*sizeof(T)) == 0;
-}
-
-template<typename T>
-__device__ __forceinline__ void load_store(T* dst, T* src, int dst_offset, int src_offset){
-  typedef typename std::aligned_storage<ILP*sizeof(T), ILP*alignof(T)>::type LT;
-  ((LT*)dst)[dst_offset] = ((LT*)src)[src_offset];
-}
-
-template<typename in_t, typename out_t>
-struct ScaleFunctor
-{
-   __device__ __forceinline__ void operator()(
-    int chunk_size,
-    volatile int* noop_gmem,
-    TensorListMetadata<2>& tl,
-    float scale)
-  {
-    // I'd like this kernel to propagate infs/nans.
-    // if(*noop_gmem == 1)
-    //   return;
-
-    int tensor_loc = tl.block_to_tensor[blockIdx.x];
-    int chunk_idx = tl.block_to_chunk[blockIdx.x];
-    int n = tl.sizes[tensor_loc];
-
-    in_t* in = (in_t*)tl.addresses[0][tensor_loc];
-    in += chunk_idx*chunk_size;
-
-    out_t* out = (out_t*)tl.addresses[1][tensor_loc];
-    out += chunk_idx*chunk_size;
-
-    n -= chunk_idx*chunk_size;
-
-    bool finite = true;
-    in_t r_in[ILP];
-    out_t r_out[ILP];
-
-    // to make things simple, we put aligned case in a different code path
-    if(n % ILP == 0 && chunk_size % ILP == 0 && is_aligned(in) && is_aligned(out))
-    {
-      for(int i_start = threadIdx.x; i_start*ILP < n && i_start*ILP < chunk_size; i_start += blockDim.x)
-      {
-        // load
-        load_store(r_in, in, 0 , i_start);
-#pragma unroll
-        for(int ii = 0; ii < ILP; ii++)
-        {
-          r_out[ii] = static_cast<float>(r_in[ii]) * scale;
-          finite = finite && isfinite(r_in[ii]);
-        }
-        // store
-        load_store(out, r_out, i_start, 0);
-      }
-    }
-    else
-    {
-      // Non-divergent exit condition for __syncthreads, not necessary here
-      for(int i_start = 0; i_start < n && i_start < chunk_size; i_start += blockDim.x*ILP)
-      {
-#pragma unroll
-        for(int ii = 0; ii < ILP; ii++)
-        {
-          r_in[ii] = 0;
-          int i = i_start + threadIdx.x + ii*blockDim.x;
-          if(i < n && i < chunk_size)
-            r_in[ii] = in[i];
-        }
-        // note for clarification to future michael:
-        // From a pure memory dependency perspective, there's likely no point unrolling
-        // the write loop, since writes just fire off once their LDGs arrive.
-        // Put another way, the STGs are dependent on the LDGs, but not on each other.
-        // There is still compute ILP benefit from unrolling the loop though.
-#pragma unroll
-        for(int ii = 0; ii < ILP; ii++)
-        {
-          r_out[ii] = static_cast<float>(r_in[ii]) * scale;
-          finite = finite && isfinite(r_in[ii]);
-        }
-#pragma unroll
-        for(int ii = 0; ii < ILP; ii++)
-        {
-          int i = i_start + threadIdx.x + ii*blockDim.x;
-          if(i < n && i < chunk_size)
-            out[i] = r_out[ii];
-        }
-      }
-    }
-    if(!finite)
-      *noop_gmem = 1; // Blindly fire off a write.  These will race but that's ok.
-  }
-};
-
-void multi_tensor_scale_cuda(
-  int chunk_size,
-  at::Tensor noop_flag,
-  std::vector<std::vector<at::Tensor>> tensor_lists,
-  float scale)
-{
-  using namespace at;
-  // The output (downscaled) type is always float.
-  // If build times suffer, think about where to put this dispatch,
-  // and what logic should be moved out of multi_tensor_apply.
-
-  DISPATCH_FLOAT_AND_HALF(tensor_lists[0][0].scalar_type(), 0, "multi_tensor_scale_cuda",
-    DISPATCH_FLOAT_AND_HALF(tensor_lists[1][0].scalar_type(), 1, "multi_tensor_scale_cuda",
-      multi_tensor_apply<2>(
-        BLOCK_SIZE,
-        chunk_size,
-        noop_flag,
-        tensor_lists,
-        ScaleFunctor<scalar_t_0, scalar_t_1>(),
-        scale); ))
-  AT_CUDA_CHECK(hipGetLastError());
-
-  // AT_CUDA_CHECK(hipDeviceSynchronize());
-}
\ No newline at end of file
diff --git a/colossalai/kernel/hip_native/csrc/multi_tensor_sgd_kernel.hip b/colossalai/kernel/hip_native/csrc/multi_tensor_sgd_kernel.hip
deleted file mode 100644
index 89d5bcb57e5dd4d04339921c8093757b18abb618..0000000000000000000000000000000000000000
--- a/colossalai/kernel/hip_native/csrc/multi_tensor_sgd_kernel.hip
+++ /dev/null
@@ -1,283 +0,0 @@
-// !!! This is a file automatically generated by hipify!!!
-// modified from https://github.com/NVIDIA/apex/blob/master/csrc/multi_tensor_sgd_kernel.cu
-#include <ATen/ATen.h>
-#include <ATen/AccumulateType.h>
-#include <ATen/hip/HIPContext.h>
-#include <ATen/hip/Exceptions.h>
-#include "../../hip_native/csrc/multi_tensor_apply.cuh"
-#include "../../hip_native/csrc/compat.h"
-
-#include <assert.h>
-#include <hip/hip_runtime.h>
-
-#define BLOCK_SIZE 512
-#define ILP 4
-
-/**
- * Perform fused SGD on multiple buffers
- * N: number of tensors
- * tl[0] : gradients
- * tl[1] : weights
- * tl[2] : momentum buffers
- * tl[3] : fp16 weights (if appropriate)
- * wd : weight_decay (scalar)
- * momentum : momentum (scalar)
- * dampening : momentum dampening (scalar)
- * lr : learning rate (scalar)
- * nesterov : enable nesterov (bool)
- * first run : necessary for proper momentum handling & init
- * wd_after_momentum : apply weight decay _after_ momentum instead of before
- **/
-template <int N, typename T_grad, typename T_weight>
-struct SGDFunctor
-{
-    __device__ __forceinline__ void operator()(
-        int chunk_size,
-        volatile int *noop_gmem,
-        TensorListMetadata<N> &tl,
-        float wd,
-        float momentum,
-        float dampening,
-        float lr,
-        bool nesterov,
-        bool first_run,
-        bool wd_after_momentum,
-        float scale)
-    {
-        // Early exit if we don't need to do anything
-        if (*noop_gmem)
-            return;
-
-        int tensor_loc = tl.block_to_tensor[blockIdx.x];
-        int chunk_idx = tl.block_to_chunk[blockIdx.x];
-        int n = tl.sizes[tensor_loc];
-
-        T_grad *grad_in = (T_grad *)tl.addresses[0][tensor_loc];
-        grad_in += chunk_idx * chunk_size;
-
-        T_weight *weight_in = (T_weight *)tl.addresses[1][tensor_loc];
-        weight_in += chunk_idx * chunk_size;
-
-        T_weight *mom_in = (T_weight *)tl.addresses[2][tensor_loc];
-        mom_in += chunk_idx * chunk_size;
-
-        at::Half *model_weights_out = nullptr;
-        if (N == 4)
-        {
-            model_weights_out = (at::Half *)tl.addresses[3][tensor_loc];
-            model_weights_out += chunk_idx * chunk_size;
-        }
-
-        n -= chunk_idx * chunk_size;
-
-        // Non-divergent exit condition for the __syncthreads
-        float incoming_grads[ILP];
-        float incoming_weights[ILP];
-        float incoming_moms[ILP];
-        for (int i_start = 0;
-             i_start < n && i_start < chunk_size;
-             i_start += blockDim.x * ILP)
-        {
-#pragma unroll
-            for (int ii = 0; ii < ILP; ii++)
-            {
-                incoming_grads[ii] = 0;
-                incoming_weights[ii] = 0;
-                incoming_moms[ii] = 0;
-                int i = i_start + threadIdx.x + ii * blockDim.x;
-                if (i < n && i < chunk_size)
-                {
-                    incoming_grads[ii] = static_cast<float>(grad_in[i]) * scale;
-                    incoming_weights[ii] = static_cast<float>(weight_in[i]);
-                    incoming_moms[ii] = static_cast<float>(mom_in[i]);
-                }
-            }
-
-// note for clarification to future michael:
-// From a pure memory dependency perspective, there's likely no point unrolling
-// the write loop, since writes just fire off once their LDGs arrive.
-// Put another way, the STGs are dependent on the LDGs, but not on each other.
-// There is still compute ILP benefit from unrolling the loop though.
-#pragma unroll
-            for (int ii = 0; ii < ILP; ii++)
-            {
-                int i = i_start + threadIdx.x + ii * blockDim.x;
-                if (i < n && i < chunk_size)
-                {
-                    // apply weight decay before momentum if necessary
-                    if (wd != 0.f && !wd_after_momentum)
-                        incoming_grads[ii] += wd * incoming_weights[ii];
-
-                    if (momentum != 0.f)
-                    {
-                        if (!first_run)
-                            incoming_moms[ii] = incoming_moms[ii] * momentum + (1.f - dampening) * incoming_grads[ii];
-                        else // initialize momentums to current incoming grads
-                            incoming_moms[ii] = incoming_grads[ii];
-
-                        if (nesterov)
-                            incoming_grads[ii] += momentum * incoming_moms[ii];
-                        else
-                            incoming_grads[ii] = incoming_moms[ii];
-                    }
-
-                    // Apply WD after momentum if desired
-                    if (wd != 0.f && wd_after_momentum)
-                        incoming_grads[ii] += wd * incoming_weights[ii];
-
-                    // adjust the weight and write out
-                    weight_in[i] += (-lr * incoming_grads[ii]);
-
-                    // if necessary, write out an fp16 copy of the weights
-                    if (N == 4)
-                        model_weights_out[i] = static_cast<at::Half>(weight_in[i]);
-
-                    // also write out the new momentum
-                    if (momentum != 0.f)
-                        mom_in[i] = incoming_moms[ii];
-                }
-            }
-        }
-    }
-};
-
-void multi_tensor_sgd_cuda(
-    int chunk_size,
-    at::Tensor noop_flag,
-    std::vector<std::vector<at::Tensor>> tensor_lists,
-    float wd,
-    float momentum,
-    float dampening,
-    float lr,
-    bool nesterov,
-    bool first_run,
-    bool wd_after_momentum,
-    float scale)
-{
-    auto num_tensors = tensor_lists.size();
-    auto grad_type = tensor_lists[0][0].scalar_type();
-    auto weight_type = tensor_lists[1][0].scalar_type();
-
-    if (num_tensors == 4)
-        for (int i = 0; i < tensor_lists[3].size(); i++)
-            TORCH_CHECK(tensor_lists[3][i].scalar_type() == at::ScalarType::Half,
-                        "Additional output tensors should always be fp16.");
-
-    TORCH_CHECK(noop_flag.device() == tensor_lists[0][0].device(), "expected noop flag to be on the same device as tensors");
-
-    // We have 3 possibilities to handle here, in terms of
-    // grad_type, param_type, momentum_type, requires_fp16_copy
-    // 1. fp16, fp16, fp16, No
-    // 2. fp32, fp32, fp32, No
-    // 3. fp16, fp32, fp32, Yes
-    // 4. fp32, fp32, fp32, Yes // this is the materialize_master_grads=True case
-    // It's easier to hardcode these possibilities than to use
-    // switches etc. to handle the cross-product of cases where
-    // we don't want the majority of them.
-
-    // Case 1. fp16, fp16, fp16, No
-    if (grad_type == at::ScalarType::Half &&
-        weight_type == at::ScalarType::Half &&
-        num_tensors == 3)
-    {
-        multi_tensor_apply<3>(
-            BLOCK_SIZE,
-            chunk_size,
-            noop_flag,
-            tensor_lists,
-            SGDFunctor<3, at::Half, at::Half>(),
-            wd,
-            momentum,
-            dampening,
-            lr,
-            nesterov,
-            first_run,
-            wd_after_momentum,
-            scale);
-    }
-    // Case 2. fp16, fp32, fp32, No
-    // else if (grad_type == at::ScalarType::Half &&
-    //          weight_type == at::ScalarType::Float &&
-    //          num_tensors == 3) {
-    //   multi_tensor_apply<3>(
-    //       BLOCK_SIZE,
-    //       chunk_size,
-    //       noop_flag,
-    //       tensor_lists,
-    //       SGDFunctor<3, at::Half, float>(),
-    //       wd,
-    //       momentum,
-    //       dampening,
-    //       lr,
-    //       nesterov,
-    //       first_run,
-    //       wd_after_momentum);
-    // }
-    // Case 2. fp32, fp32, fp32, No
-    else if (grad_type == at::ScalarType::Float &&
-             weight_type == at::ScalarType::Float &&
-             num_tensors == 3)
-    {
-        multi_tensor_apply<3>(
-            BLOCK_SIZE,
-            chunk_size,
-            noop_flag,
-            tensor_lists,
-            SGDFunctor<3, float, float>(),
-            wd,
-            momentum,
-            dampening,
-            lr,
-            nesterov,
-            first_run,
-            wd_after_momentum,
-            scale);
-    }
-    // Case 3. fp16, fp32, fp32, Yes
-    else if (grad_type == at::ScalarType::Half &&
-             weight_type == at::ScalarType::Float &&
-             num_tensors == 4)
-    {
-        multi_tensor_apply<4>(
-            BLOCK_SIZE,
-            chunk_size,
-            noop_flag,
-            tensor_lists,
-            SGDFunctor<4, at::Half, float>(),
-            wd,
-            momentum,
-            dampening,
-            lr,
-            nesterov,
-            first_run,
-            wd_after_momentum,
-            scale);
-    }
-    // Case 4. fp32, fp32, fp32, Yes
-    else if (grad_type == at::ScalarType::Float &&
-             weight_type == at::ScalarType::Float &&
-             num_tensors == 4)
-    {
-        multi_tensor_apply<4>(
-            BLOCK_SIZE,
-            chunk_size,
-            noop_flag,
-            tensor_lists,
-            SGDFunctor<4, float, float>(),
-            wd,
-            momentum,
-            dampening,
-            lr,
-            nesterov,
-            first_run,
-            wd_after_momentum,
-            scale);
-    }
-    else
-    {
-        AT_ERROR("multi_tensor_sgd only supports some combinations of gradient & weight types. Given: ",
-                 "gradient: ", grad_type, ", weight: ", weight_type, ", num_lists: ", num_tensors);
-    }
-
-    AT_CUDA_CHECK(hipGetLastError());
-}
\ No newline at end of file
diff --git a/colossalai/kernel/hip_native/csrc/multihead_attention_1d.cpp b/colossalai/kernel/hip_native/csrc/multihead_attention_1d.cpp
deleted file mode 100644
index 22acc9f868429ecc9a7d48f3fbb9488afe4677a6..0000000000000000000000000000000000000000
--- a/colossalai/kernel/hip_native/csrc/multihead_attention_1d.cpp
+++ /dev/null
@@ -1,365 +0,0 @@
-// !!! This is a file automatically generated by hipify!!!
-#include "../../hip_native/csrc/multihead_attention_1d.h"
-
-#include <ATen/hip/HIPContext.h>
-#include <torch/extension.h>
-
-#include <c10d/Types.hpp>
-#include <iostream>
-
-#include "context.h"
-#include "kernels.h"
-
-template <typename T>
-MultiHeadAttention<T>::MultiHeadAttention(int layer_id, int max_batch_tokens, int max_seq_len,
-                                          int hidden_size, int num_heads,
-                                          float attn_prob_dropout_ratio,
-                                          float hidden_output_dropout_ratio,
-                                          bool pre_or_postLayerNorm)
-    : _layer_id(layer_id),
-      _max_batch_tokens(max_batch_tokens),
-      _max_seq_len(max_seq_len),
-      _hidden_size(hidden_size),
-      _heads(num_heads),
-      _training(true),
-      _pre_or_postLayerNorm(pre_or_postLayerNorm),
-      _qkv_linear(typename FeedForward<T>::Config(3 * hidden_size, hidden_size)),
-      _attn_out_linear(typename FeedForward<T>::Config(hidden_size, hidden_size)),
-      _attn_ln(typename Normalize_Layer<T>::Config(hidden_size, false), _max_batch_tokens),
-      _softmax(typename Softmax<T>::Config(num_heads)),
-      _attn_prob_dropout(typename Dropout<T>::Config(attn_prob_dropout_ratio),
-                         _max_batch_tokens * _heads * _max_seq_len),
-      _attn_dropout(typename Dropout<T>::Config(hidden_output_dropout_ratio),
-                    _max_batch_tokens * _hidden_size),
-      _attn_scores(typename StridedBatchGemm<T>::Config((T(1.0) / T(sqrt(_hidden_size / _heads))),
-                                                        T(0.0), rocblas_operation_transpose, rocblas_operation_none)),
-      _attn_context(
-          typename StridedBatchGemm<T>::Config(T(1.0), T(0.0), rocblas_operation_none, rocblas_operation_none)) {
-  assert(_hidden_size % _heads == 0);
-}
-
-template <typename T>
-MultiHeadAttention<T>::~MultiHeadAttention() {
-  free_mem_buffer();
-}
-
-template <typename T>
-void MultiHeadAttention<T>::attn_layer_fw(const T *input_ptr, const T *input_mask_ptr,
-                                          T *output_ptr, T *buffer) {
-  T *q_tf_ptr = _qkv_ptr;
-  T *k_tf_ptr = q_tf_ptr + _batch_dim / pg_size;
-  T *v_tf_ptr = k_tf_ptr + _batch_dim / pg_size;
-
-  if (_pre_or_postLayerNorm) {
-    _attn_ln.Forward(_gemmQKV_inp_ptr, input_ptr, _attn_nw_ptr, _attn_nb_ptr, _batch_tokens,
-                     _stream);
-  }
-  const T *gemmQKV_inp_ptr = _pre_or_postLayerNorm ? _gemmQKV_inp_ptr : input_ptr;
-  _qkv_linear.reset_size(3 * _hidden_size / pg_size, _hidden_size);
-  _qkv_linear.Forward(_batch_tokens, gemmQKV_inp_ptr, _attn_qkvw_ptr, buffer, _cublasHandle);
-
-  launch_bias_add_transform_20314<T>(q_tf_ptr, buffer, _attn_qkvb_ptr, _batch_size, _seq_len, 3,
-                                     _heads / pg_size, _hidden_size / _heads, _stream);
-
-  // attention scores, q*k
-  _attn_scores.Forward(_batch_heads, _soft_out_ptr, k_tf_ptr, q_tf_ptr, _cublasHandle);
-
-  // Softmax + Mask
-  _softmax.reset_size(_heads / pg_size);
-  _softmax.Forward(_soft_out_ptr, input_mask_ptr, _batch_size, _seq_len, _seq_len, _stream, true);
-
-  // attn prob dropout.
-  _attn_prob_dropout.dropout(_ctx_bufB_ptr, _soft_out_ptr, _batch_heads * _seq_len * _seq_len,
-                             _stream);
-
-  // attention context, score * v
-  _attn_context.Forward(_batch_heads, buffer, v_tf_ptr, _ctx_bufB_ptr, _cublasHandle);
-
-  // [b, nh, s, ad] -> [b, s, nh, ad]
-  launch_transform4d_0213<T>(_attn_o_inp_ptr, buffer, _batch_size, _seq_len, _hidden_size / pg_size,
-                             _heads / pg_size, 1, _stream);
-
-  _attn_out_linear.reset_size(_hidden_size, _hidden_size / pg_size);
-  _attn_out_linear.Forward(_batch_tokens, _attn_o_inp_ptr, _attn_ow_ptr, output_ptr, _cublasHandle);
-
-  // allreduce
-  if (pg == c10::detail::UniqueVoidPtr() || pg->getSize() == 1) {
-  } else {
-    auto data_type = torch::kFloat;
-    if (typeid(T) != typeid(float)) {
-      data_type = torch::kHalf;
-    }
-    auto output_tensor =
-        torch::from_blob(output_ptr, {int(_batch_size), int(_seq_len), int(_hidden_size)},
-                         torch::TensorOptions(torch::kCUDA).dtype(data_type));
-    std::vector<torch::Tensor> allreduce_tensors = {output_tensor};
-    auto work = pg->allreduce(allreduce_tensors, c10d::AllreduceOptions());
-    work->wait();
-  }
-
-  _attn_dropout.bias_dropout_residual(output_ptr, output_ptr, input_ptr, _attn_ob_ptr,
-                                      _batch_tokens, _hidden_size, _stream);
-  if (!_pre_or_postLayerNorm) {
-    // in-place ln since ln-input will not be used in post-ln mode
-    _attn_ln.Forward(output_ptr, output_ptr, _attn_nw_ptr, _attn_nb_ptr, _batch_tokens, _stream);
-  }
-}
-
-template <typename T>
-void MultiHeadAttention<T>::Forward(const T *input_ptr, const T *input_mask_ptr, T *out_ptr) {
-  _stream = Context::Instance().get_stream();
-  _cublasHandle = Context::Instance().get_cublashandle();
-  T *attn_buffer = _shared_mem_ptr;  // 3 * _batch_dim
-
-  attn_layer_fw(input_ptr, input_mask_ptr, out_ptr, attn_buffer);
-}
-
-template <typename T>
-void MultiHeadAttention<T>::attn_layer_bw(const T *input_ptr, const T *input_mask_ptr, const T *output_ptr,
-                                          const T *grad_output_ptr, T *grad_input_ptr, T *buffer) {
-  hipStream_t streams[2] = {_stream, _stream};
-
-  const T *q_tf_ptr = _qkv_ptr;
-  const T *k_tf_ptr = q_tf_ptr + _batch_dim / pg_size;
-  const T *v_tf_ptr = k_tf_ptr + _batch_dim / pg_size;
-  // batch_dim = batch_size * seq_len * hidden_size
-  // buffer size: batch_dim * 3 + max(batch_dim * 3,
-  //     batch_size * head_num * seq_len * seq_len)
-  T *grad_residual_ptr = buffer;
-  buffer += _batch_dim;
-
-  T *grad_input_buf_ptr = buffer;  // batch_dim
-  T *grad_qkv_5d_ptr = buffer;     // batch_dim * 3
-  buffer += 3 * _batch_dim / pg_size;
-
-  T *grad_qkv_4d_ptr = buffer;   // batch_dim * 3
-  T *grad_softmax_ptr = buffer;  // batch_size * head_num * seq_len * seq_len
-  // buffer += max(3 * _batch_dim,
-  //   batch_size * head_num * seq_len * seq_len);
-
-  if (_pre_or_postLayerNorm) {
-    _attn_dropout.d_bias_dropout_residual(grad_input_ptr, _grad_attn_ob_ptr, grad_output_ptr,
-                                          _batch_tokens, _hidden_size, _stream);
-  } else {
-    _attn_ln.Backward(_grad_attn_nw_ptr, _grad_attn_nb_ptr, grad_residual_ptr, grad_output_ptr,
-                      nullptr, output_ptr, _attn_nw_ptr, _attn_nb_ptr, _batch_tokens, streams);
-    _attn_dropout.d_bias_dropout_residual(grad_input_ptr, _grad_attn_ob_ptr, grad_residual_ptr,
-                                          _batch_tokens, _hidden_size, _stream);
-  }
-
-  // bw of output project
-  _attn_out_linear.reset_size(_hidden_size, _hidden_size / pg_size);
-  _attn_out_linear.Backward(_batch_tokens, grad_input_ptr, _attn_o_inp_ptr, _attn_ow_ptr,
-                            _grad_attn_ow_ptr, _grad_attn_ob_ptr, _cublasHandle, _stream,
-                            grad_input_buf_ptr, nullptr, false);
-  launch_transform_0213<T>(grad_input_ptr, grad_input_buf_ptr, _batch_size, _seq_len,
-                           _hidden_size / pg_size, _heads / pg_size, _stream);
-
-  // bw of score * v
-  _attn_context.Backward(_batch_heads, grad_input_ptr, v_tf_ptr, _ctx_bufB_ptr, _cublasHandle,
-                         grad_qkv_5d_ptr + 2 * _batch_dim / pg_size, grad_softmax_ptr);
-
-  _attn_prob_dropout.d_dropout(grad_softmax_ptr, _batch_heads * _seq_len * _seq_len, _stream);
-
-  _softmax.reset_size(_heads / pg_size);
-  _softmax.Backward(grad_softmax_ptr, _soft_out_ptr, _batch_size, _seq_len, _seq_len, _stream);
-
-  // bw of q * k
-  _attn_scores.Backward(_batch_heads, grad_softmax_ptr, k_tf_ptr, q_tf_ptr, _cublasHandle,
-                        grad_qkv_5d_ptr + _batch_dim / pg_size, grad_qkv_5d_ptr);
-
-  // [3, b, nh, s, ad] -> [b, s, 3, h]
-  launch_transform4d_0213<T>(grad_qkv_4d_ptr, grad_qkv_5d_ptr, _batch_size, _seq_len,
-                             _hidden_size / pg_size, _heads / pg_size, 3, _stream);
-
-  const T *gemmQKV_inp_ptr = _pre_or_postLayerNorm ? _gemmQKV_inp_ptr : input_ptr;
-  _qkv_linear.reset_size(3 * _hidden_size / pg_size, _hidden_size);
-  _qkv_linear.Backward(_batch_tokens, grad_qkv_4d_ptr, gemmQKV_inp_ptr, _attn_qkvw_ptr,
-                       _grad_attn_qkvw_ptr, _grad_attn_qkvb_ptr, _cublasHandle, _stream,
-                       grad_input_buf_ptr, nullptr, true);
-
-  // allreduce
-  if (pg == c10::detail::UniqueVoidPtr() || pg->getSize() == 1) {
-  } else {
-    auto data_type = torch::kFloat;
-    if (typeid(T) != typeid(float)) {
-      data_type = torch::kHalf;
-    }
-    auto grad_input_tensor =
-        torch::from_blob(grad_input_buf_ptr, {int(_batch_size), int(_seq_len), int(_hidden_size)},
-                         torch::TensorOptions(torch::kCUDA).dtype(data_type));
-    std::vector<torch::Tensor> allreduce_tensors = {grad_input_tensor};
-    auto work = pg->allreduce(allreduce_tensors, c10d::AllreduceOptions());
-    work->wait();
-  }
-
-  if (_pre_or_postLayerNorm) {
-    _attn_ln.Backward(_grad_attn_nw_ptr, _grad_attn_nb_ptr, grad_input_ptr, grad_input_buf_ptr,
-                      grad_output_ptr, gemmQKV_inp_ptr, _attn_nw_ptr, _attn_nb_ptr, _batch_tokens,
-                      streams);
-  } else {
-    // FIXME later
-    launch_fused_add2<T>(grad_input_ptr, grad_input_buf_ptr, grad_residual_ptr, _batch_size,
-                         _seq_len, _hidden_size, _stream);
-  }
-}
-
-template <typename T>
-void MultiHeadAttention<T>::Backward(const T *grad_output_ptr, const T *input_ptr, const T *output_ptr,
-                                     const T *input_mask_ptr, T *grad_input_ptr) {
-  _stream = Context::Instance().get_stream();
-  _cublasHandle = Context::Instance().get_cublashandle();
-  T *buffer = _shared_mem_ptr;
-
-  /*
-  buffer size needed by attn bw:
-      4 * _batch_dim + max(3 * _batch_dim,
-      _batch_size * _head_num * _seq_len * _seq_len);
-  */
-  attn_layer_bw(input_ptr, input_mask_ptr, output_ptr, grad_output_ptr, grad_input_ptr, buffer);
-}
-
-template <typename T>
-void MultiHeadAttention<T>::SetTrainingMode(bool training) {
-  // Dropout will be skipped when not in training model.
-  _attn_prob_dropout.SetTrainingMode(training);
-  _attn_dropout.SetTrainingMode(training);
-}
-
-template <typename T>
-T *MultiHeadAttention<T>::_shared_mem_ptr = nullptr;
-
-template class MultiHeadAttention<float>;
-template class MultiHeadAttention<__half>;
-
-// x is torch::Tensor
-#define CHECK_CUDA(x) AT_ASSERTM(x.is_cuda(), #x " must be a CUDA tensor")
-#define CHECK_CONTIGUOUS(x) AT_ASSERTM(x.is_contiguous(), #x " must be contiguous")
-#define CHECK_INPUT(x) \
-  CHECK_CUDA(x);       \
-  CHECK_CONTIGUOUS(x)
-
-static std::unordered_map<int, std::shared_ptr<void>> s_multihead_attention;
-
-template <typename T>
-int create_multihead_attention(int layer_id, int max_batch_tokens, int max_seq_len, int hidden_dim,
-                               int num_heads, float attn_prob_dropout_ratio,
-                               float hidden_dropout_ratio, bool pre_or_postLayerNorm,
-                               c10::intrusive_ptr<c10d::ProcessGroup> pg_) {
-  hipStream_t stream = at::hip::getCurrentHIPStreamMasqueradingAsCUDA();
-  Context::Instance().set_stream(stream);
-  auto layer = std::make_shared<MultiHeadAttention<T>>(
-      layer_id, max_batch_tokens, max_seq_len, hidden_dim, num_heads, attn_prob_dropout_ratio,
-      hidden_dropout_ratio, pre_or_postLayerNorm);
-
-  layer->SetPG(pg_);
-
-  s_multihead_attention[layer_id] = layer;
-
-  std::string dtype = (std::is_same<T, __half>::value) ? "half" : "float";
-
-  return 0;
-}
-
-template <typename T>
-std::vector<torch::Tensor> multihead_attention_fw(int layer_id, const torch::Tensor &input,
-                                                  const torch::Tensor &input_mask,
-                                                  const torch::Tensor &in_proj_weight,
-                                                  const torch::Tensor &in_proj_bias,
-                                                  const torch::Tensor &out_proj_weight,
-                                                  const torch::Tensor &out_proj_bias,
-                                                  const torch::Tensor &norm_weight,
-                                                  const torch::Tensor &norm_bias,
-                                                  bool training_mode, bool prelayernorm) {
-  CHECK_INPUT(input);
-  CHECK_INPUT(input_mask);
-
-  const T *input_ptr = (const T *)input.data_ptr();
-  const T *input_mask_ptr = (const T *)input_mask.data_ptr();
-
-  auto output = torch::empty_like(input);
-  T *out_ptr = (T *)output.data_ptr();
-
-  std::shared_ptr<MultiHeadAttention<T>> layer =
-      std::static_pointer_cast<MultiHeadAttention<T>>(s_multihead_attention[layer_id]);
-  layer->set_cur_batch_shape(input.size(0), input.size(1));
-  layer->SetTrainingMode(training_mode);
-
-  layer->_attn_qkvw_ptr = (const T *)in_proj_weight.data_ptr();
-  layer->_attn_qkvb_ptr = (const T *)in_proj_bias.data_ptr();
-  layer->_attn_ow_ptr = (const T *)out_proj_weight.data_ptr();
-  layer->_attn_ob_ptr = (const T *)out_proj_bias.data_ptr();
-  layer->_attn_nw_ptr = (const T *)norm_weight.data_ptr();
-  layer->_attn_nb_ptr = (const T *)norm_bias.data_ptr();
-
-  layer->Forward(input_ptr, input_mask_ptr, out_ptr);
-
-  return {output};
-}
-
-template <typename T>
-std::vector<torch::Tensor> multihead_attention_bw(int layer_id,
-                                                  const torch::Tensor &grad_dec_output,
-                                                  const torch::Tensor &output,
-                                                  const torch::Tensor &input,
-                                                  const torch::Tensor &input_mask,
-                                                  const torch::Tensor &in_proj_weight,
-                                                  const torch::Tensor &in_proj_bias,
-                                                  const torch::Tensor &out_proj_weight,
-                                                  const torch::Tensor &out_proj_bias,
-                                                  const torch::Tensor &norm_weight,
-                                                  const torch::Tensor &norm_bias) {
-  auto g_output = grad_dec_output.contiguous();
-  CHECK_INPUT(g_output);
-  CHECK_INPUT(output);
-  CHECK_INPUT(input);
-  CHECK_INPUT(input_mask);
-
-  auto grad_input = torch::empty_like(input);
-  auto grad_in_proj_weight = torch::empty_like(in_proj_weight);
-  auto grad_in_proj_bias = torch::empty_like(in_proj_bias);
-  auto grad_out_proj_weight = torch::empty_like(out_proj_weight);
-  auto grad_out_proj_bias = torch::empty_like(out_proj_bias);
-  auto grad_norm_weight = torch::empty_like(norm_weight);
-  auto grad_norm_bias = torch::empty_like(norm_bias);
-
-  // inputs.
-  const T *grad_dec_output_ptr = (const T *)g_output.data_ptr();
-  const T *input_ptr = (const T *)input.data_ptr();
-  const T *output_ptr = (const T *)output.data_ptr();
-  const T *input_mask_ptr = (const T *)input_mask.data_ptr();
-
-  // outputs.
-  T *grad_input_ptr = (T *)grad_input.data_ptr();
-
-  std::shared_ptr<MultiHeadAttention<T>> layer =
-      std::static_pointer_cast<MultiHeadAttention<T>>(s_multihead_attention[layer_id]);
-  layer->set_cur_batch_shape(g_output.size(0), g_output.size(1));
-
-  layer->_grad_attn_qkvw_ptr = (T *)grad_in_proj_weight.data_ptr();
-  layer->_grad_attn_qkvb_ptr = (T *)grad_in_proj_bias.data_ptr();
-  layer->_grad_attn_ow_ptr = (T *)grad_out_proj_weight.data_ptr();
-  layer->_grad_attn_ob_ptr = (T *)grad_out_proj_bias.data_ptr();
-  layer->_grad_attn_nw_ptr = (T *)grad_norm_weight.data_ptr();
-  layer->_grad_attn_nb_ptr = (T *)grad_norm_bias.data_ptr();
-
-  layer->Backward(grad_dec_output_ptr, input_ptr, output_ptr, input_mask_ptr, grad_input_ptr);
-
-  return {grad_input,         grad_in_proj_weight, grad_in_proj_bias, grad_out_proj_weight,
-          grad_out_proj_bias, grad_norm_weight,    grad_norm_bias};
-}
-
-PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
-  m.def("multihead_attention_fw_fp32", &multihead_attention_fw<float>,
-        "Multi-head Attention forward with fp32 (CUDA)");
-  m.def("multihead_attention_fw_fp16", &multihead_attention_fw<__half>,
-        "Multi-head Attention forward with fp16 (CUDA)");
-  m.def("multihead_attention_bw_fp32", &multihead_attention_bw<float>,
-        "Multi-head Attention backward with fp32 (CUDA)");
-  m.def("multihead_attention_bw_fp16", &multihead_attention_bw<__half>,
-        "Multi-head Attention backward with fp16 (CUDA)");
-  m.def("create_multihead_attention_fp32", &create_multihead_attention<float>,
-        "Create Multi-head Attention with fp32 (CUDA)");
-  m.def("create_multihead_attention_fp16", &create_multihead_attention<__half>,
-        "Create Multi-head Attention with fp16 (CUDA)");
-}
diff --git a/colossalai/kernel/hip_native/csrc/multihead_attention_1d.h b/colossalai/kernel/hip_native/csrc/multihead_attention_1d.h
deleted file mode 100644
index 9a030623ac21b72a05f44904f19d123296b5ebbd..0000000000000000000000000000000000000000
--- a/colossalai/kernel/hip_native/csrc/multihead_attention_1d.h
+++ /dev/null
@@ -1,159 +0,0 @@
-// !!! This is a file automatically generated by hipify!!!
-#pragma once
-
-#include <c10/util/intrusive_ptr.h>
-#include <hip/hip_runtime.h>
-#include <hip/hip_fp16.h>
-#include <hip/hip_runtime_api.h>
-
-#include <c10d/ProcessGroup.hpp>
-#include <string>
-#include <type_traits>
-
-#ifdef COLOSSAL_HIP
-#include "hip_util.h"
-#else
-#include "cuda_util.h"
-#endif
-
-#include "dropout.h"
-#include "feed_forward.h"
-#include "normalize_layer.h"
-#include "softmax.h"
-#include "strided_batch_gemm.h"
-
-template <typename T>
-class MultiHeadAttention {
- public:
-  MultiHeadAttention(int layer_id, int max_batch_tokens, int _max_seq_len, int hidden_size,
-                     int num_heads, float attn_dropout_ratio, float hidden_output_dropout_ratio,
-                     bool pre_or_postLayerNorm);
-
-  virtual ~MultiHeadAttention();
-
-  void Forward(const T *input_ptr, const T *input_mask_ptr, T *out_ptr);
-
-  void Backward(const T *grad_output_ptr, const T *input_ptr, const T *output_ptr,
-                const T *input_mask_ptr, T *grad_input_ptr);
-
-  void attn_layer_fw(const T *input_ptr, const T *input_mask_ptr, T *output_ptr, T *buffer);
-
-  void attn_layer_bw(const T *input_ptr, const T *input_mask_ptr, const T *output_ptr,
-                     const T *grad_output_ptr, T *grad_input_attn_layer_bwptr, T *buffer);
-
-  void set_cur_batch_shape(int batch_size, int seq_len) {
-    _batch_size = batch_size;
-    _seq_len = seq_len;
-    _batch_tokens = batch_size * seq_len;
-    _batch_heads = batch_size * _heads / pg_size;
-    _batch_dim = _batch_tokens * _hidden_size;
-    _attn_scores.SetConfig(_seq_len, _seq_len, _hidden_size / _heads);
-    _attn_context.SetConfig(_hidden_size / _heads, _seq_len, _seq_len);
-  }
-
-  void SetTrainingMode(bool training);
-  inline bool IsTrainingMode() const { return _training; }
-
-  void SetPG(c10::intrusive_ptr<c10d::ProcessGroup> pg_) {
-    pg = pg_;
-    pg_size = 1;
-    if (pg != c10::detail::UniqueVoidPtr()) {
-      pg_size = pg->getSize();
-    }
-    allocate_mem_buffer();
-  }
-
-  // weights ptr
-  const T *_attn_qkvw_ptr;
-  const T *_attn_qkvb_ptr;
-  const T *_attn_ow_ptr;
-  const T *_attn_ob_ptr;
-  const T *_attn_nw_ptr;
-  const T *_attn_nb_ptr;
-
-  // grads ptr
-  T *_grad_attn_qkvw_ptr;
-  T *_grad_attn_qkvb_ptr;
-  T *_grad_attn_ow_ptr;
-  T *_grad_attn_ob_ptr;
-  T *_grad_attn_nw_ptr;
-  T *_grad_attn_nb_ptr;
-
- private:
-  void allocate_mem_buffer() {
-    // allocate local gpu memory
-    if (_pre_or_postLayerNorm) {
-      _gemmQKV_inp_ptr = cuda_malloc<T>(_max_batch_tokens * _hidden_size);
-    } else {
-      _gemmQKV_inp_ptr = nullptr;
-    }
-
-    _qkv_ptr = cuda_malloc<T>(_max_batch_tokens * _hidden_size * 3);
-    _soft_out_ptr = cuda_malloc<T>(_max_batch_tokens * _heads / pg_size * _max_seq_len);
-    _ctx_bufB_ptr = cuda_malloc<T>(_max_batch_tokens * _heads / pg_size * _max_seq_len);
-    _attn_o_inp_ptr = cuda_malloc<T>(_max_batch_tokens * _hidden_size);
-
-    // buffer size needed by attn bw
-    size_t smem_size = 4 * _max_batch_tokens * _hidden_size / pg_size +
-                       std::max(3 * _max_batch_tokens * _hidden_size / pg_size,
-                                _max_batch_tokens * _heads / pg_size * _max_seq_len);
-
-    if (!_shared_mem_ptr) {
-      cuda_free(_shared_mem_ptr);
-      _shared_mem_ptr = cuda_malloc<T>(smem_size);
-    }
-  }
-
-  void free_mem_buffer() {
-    // free local gpu memory
-    cuda_free(_gemmQKV_inp_ptr);
-    cuda_free(_qkv_ptr);
-    cuda_free(_soft_out_ptr);
-    cuda_free(_ctx_bufB_ptr);
-    cuda_free(_attn_o_inp_ptr);
-
-    // free shared gpu memory between layers
-    cuda_free(_shared_mem_ptr);
-    _shared_mem_ptr = nullptr;
-  }
-
-  // const parameter between batch
-  const size_t _layer_id;
-  const size_t _hidden_size;
-  const size_t _heads;
-  const size_t _max_batch_tokens;
-  const size_t _max_seq_len;
-  const bool _pre_or_postLayerNorm;
-  // dynamic parameter between batch
-  size_t _batch_size;
-  size_t _seq_len;
-  size_t _batch_tokens;
-  size_t _batch_heads;
-  size_t _batch_dim;
-  bool _training;
-
-  rocblas_handle _cublasHandle;
-  hipStream_t _stream;
-
-  // layers
-  FeedForward<T> _qkv_linear;
-  FeedForward<T> _attn_out_linear;
-  Normalize_Layer<T> _attn_ln;
-  Softmax<T> _softmax;
-  Dropout<T> _attn_prob_dropout;
-  Dropout<T> _attn_dropout;
-  StridedBatchGemm<T> _attn_scores;
-  StridedBatchGemm<T> _attn_context;
-
-  // local GPU memory
-  T *_gemmQKV_inp_ptr;
-  T *_qkv_ptr;
-  T *_soft_out_ptr;
-  T *_ctx_bufB_ptr;
-  T *_attn_o_inp_ptr;
-  // shared GPU memory between layer
-  static T *_shared_mem_ptr;
-
-  c10::intrusive_ptr<c10d::ProcessGroup> pg;
-  int pg_size;
-};
diff --git a/colossalai/kernel/hip_native/csrc/scaled_masked_softmax.cpp b/colossalai/kernel/hip_native/csrc/scaled_masked_softmax.cpp
deleted file mode 100644
index 14af121c1642f7de57deb62522114bc2b721af19..0000000000000000000000000000000000000000
--- a/colossalai/kernel/hip_native/csrc/scaled_masked_softmax.cpp
+++ /dev/null
@@ -1,85 +0,0 @@
-// !!! This is a file automatically generated by hipify!!!
-/*This code from NVIDIA Megatron:
- *     with minor changes. */
-
-#include <hip/hip_fp16.h>
-#include <torch/extension.h>
-#include <vector>
-
-namespace multihead_attn {
-namespace fused_softmax {
-namespace scaled_masked_softmax {
-
-torch::Tensor fwd_cuda(
-    torch::Tensor const& input, 
-    torch::Tensor const& mask,
-    float scale_factor);
-
-torch::Tensor bwd_cuda(
-    torch::Tensor const& output_grads, 
-    torch::Tensor const& softmax_results,
-    float scale_factor);
-
-int get_batch_per_block_cuda(
-    int query_seq_len,
-    int key_seq_len,
-    int batches,
-    int attn_heads);
-
-torch::Tensor fwd(
-    torch::Tensor const& input,
-    torch::Tensor const& mask,
-    float scale_factor) {
-  AT_ASSERTM(input.dim() == 4, "expected 4D tensor");
-  AT_ASSERTM((input.scalar_type() == at::ScalarType::Half) ||
-	     (input.scalar_type() == at::ScalarType::BFloat16), 
-      "Only fp16 and bf16 are supported");
-  AT_ASSERTM(mask.dim() == 4, "expected 4D tensor");
-
-  return fwd_cuda(input, mask, scale_factor);
-}
-
-torch::Tensor bwd(
-    torch::Tensor const& output_grads, 
-    torch::Tensor const& softmax_results,
-    float scale_factor) {
-
-  AT_ASSERTM(output_grads.dim() == 4, "expected 3D tensor");
-  AT_ASSERTM(softmax_results.dim() == 4, "expected 3D tensor");
-
-  AT_ASSERTM((output_grads.scalar_type() == at::ScalarType::Half) ||
-	     (output_grads.scalar_type() == at::ScalarType::BFloat16), 
-      "Only fp16 and bf16 are supported");
-  AT_ASSERTM((softmax_results.scalar_type() == at::ScalarType::Half) ||
-	     (softmax_results.scalar_type() == at::ScalarType::BFloat16), 
-      "Only fp16 and bf16 are supported");
-
-  return bwd_cuda(output_grads, softmax_results, scale_factor);
-}
-
-int get_batch_per_block(
-    int query_seq_len,
-    int key_seq_len,
-    int batches,
-    int attn_heads) {
-    return get_batch_per_block_cuda(query_seq_len, key_seq_len, batches, attn_heads);
-}
-
-} // end namespace scaled_masked_softmax
-} // end namespace fused_softmax
-} // end namespace multihead_attn
-
-PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
-  m.def("forward", 
-        &multihead_attn::fused_softmax::scaled_masked_softmax::fwd, 
-	"Self Multihead Attention scaled, time masked softmax -- Forward.");
-
-  m.def("backward",
-        &multihead_attn::fused_softmax::scaled_masked_softmax::bwd,
-	"Self Multihead Attention scaled, time masked softmax -- Backward.");
-
-  m.def("get_batch_per_block",
-        &multihead_attn::fused_softmax::scaled_masked_softmax::get_batch_per_block,
-        "Return Batch per block size."
-  );
-}
diff --git a/colossalai/kernel/hip_native/csrc/scaled_masked_softmax.h b/colossalai/kernel/hip_native/csrc/scaled_masked_softmax.h
deleted file mode 100644
index 77a7d2ab3865f4b69ca90f1c4137801d8ec59a85..0000000000000000000000000000000000000000
--- a/colossalai/kernel/hip_native/csrc/scaled_masked_softmax.h
+++ /dev/null
@@ -1,494 +0,0 @@
-// !!! This is a file automatically generated by hipify!!!
-#include "hip/hip_runtime.h"
-/*This code from NVIDIA Megatron:
- *     with minor changes. */
-
-#pragma once
-
-#include <assert.h>
-#include <hip/hip_fp16.h>
-#include <cfloat>
-#include <limits>
-#include <stdint.h>
-#include <hip/hip_fp16.h>
-#include <c10/macros/Macros.h>
-
-namespace {
-
-template <typename Datatype, int ELEMENTS_PER_LDG>
-__device__ __inline__ void copy_vector(Datatype *dst, const Datatype *src);
-
-template <>
-__device__ __inline__ void copy_vector<c10::BFloat16, 1>(c10::BFloat16 *dst, const c10::BFloat16 *src) { *dst = *src; }
-
-template <>
-__device__ __inline__ void copy_vector<c10::BFloat16, 4>(c10::BFloat16 *dst, const c10::BFloat16 *src) { *((float2*) dst) = *((float2*) src); }
-
-template <>
-__device__ __inline__ void copy_vector<c10::Half, 1>(c10::Half *dst, const c10::Half *src) { *dst = *src; }
-
-template <>
-__device__ __inline__ void copy_vector<c10::Half, 4>(c10::Half *dst, const c10::Half *src) { *((float2*) dst) = *((float2*) src); }
-
-template <>
-__device__ __inline__ void copy_vector<uint8_t, 1>(uint8_t *dst, const uint8_t *src) { *dst = *src; }
-
-template <>
-__device__ __inline__ void copy_vector<uint8_t, 4>(uint8_t *dst, const uint8_t *src) {*((half2*) dst) = *((half2*) src); }
-
-int log2_ceil(int value) {
-    int log2_value = 0;
-    while ((1 << log2_value) < value) ++log2_value;
-    return log2_value;
-}
-
-template<typename T>
-struct Add {
-  __device__ __forceinline__ T operator()(T a, T b) const {
-    return a + b;
-  }
-};
-
-template<typename T>
-struct Max {
-  __device__ __forceinline__ T operator()(T a, T b) const {
-    return a < b ? b : a;
-  }
-};
-
-template <typename T>
-__device__ __forceinline__ T WARP_SHFL_XOR_NATIVE(T value, int laneMask, int width = warpSize, unsigned int mask = 0xffffffff)
-{
-#if TORCH_HIP_VERSION >= 9000
-    return __shfl_xor_sync(mask, value, laneMask, width);
-#else
-    return __shfl_xor(value, laneMask, width);
-#endif
-}
-
-template <typename acc_t, int WARP_BATCH, int WARP_SIZE, template<typename> class ReduceOp>
-__device__ __forceinline__ void warp_reduce(acc_t* sum) {
-    ReduceOp<acc_t> r;
-    #pragma unroll
-    for (int offset = WARP_SIZE / 2; offset > 0; offset /= 2) {
-        #pragma unroll
-        for (int i = 0;  i < WARP_BATCH;  ++i) {
-            acc_t b = WARP_SHFL_XOR_NATIVE(sum[i], offset, WARP_SIZE);
-            sum[i] = r(sum[i], b);
-        }
-    }
-}
-
-/*
- * Extended softmax (from native aten pytorch) with following additional features
- * 1) input scaling
- * 2) Explicit masking
- */	
-template <typename input_t, typename output_t, typename acc_t, int log2_elements>
-__global__ void scaled_masked_softmax_warp_forward(
-    output_t *dst, 
-    const input_t *src,
-    const uint8_t *mask, 
-    const acc_t scale, 
-    int micro_batch_size, 
-    int element_count,
-    int pad_batches) 
-{
-    // WARP_SIZE and WARP_BATCH must match the return values batches_per_warp and 
-    // warp_size of method warp_softmax_forward_kernel.
-    constexpr int next_power_of_two = 1 << log2_elements;
-    constexpr int WARP_SIZE = (next_power_of_two < C10_WARP_SIZE) ? next_power_of_two : C10_WARP_SIZE;
-    constexpr int WARP_ITERATIONS = next_power_of_two / WARP_SIZE;
-    constexpr int WARP_BATCH = (next_power_of_two <= 128) ? 2 : 1;
-    constexpr int ELEMENTS_PER_LDG_STG = (WARP_ITERATIONS < 4) ? 1 : 4;
-
-    // blockDim/threadIdx = (WARP_SIZE, WARPS_PER_BLOCK, )
-    // gridDim/blockIdx = (seq_len, attn_heads, batches) 
-    int first_batch = (blockDim.y * (blockIdx.x + gridDim.x * (blockIdx.y + gridDim.y * blockIdx.z))+ threadIdx.y) * WARP_BATCH;
-    int pad_first_batch = 0;
-    if (pad_batches != 1) { // bert style
-        pad_first_batch = (blockDim.y * (blockIdx.x + gridDim.x * blockIdx.z) + threadIdx.y) * WARP_BATCH;
-    } else { // gpt2 style
-        pad_first_batch = (blockDim.y * blockIdx.x + threadIdx.y) * WARP_BATCH;
-    }
-
-    // micro_batch_size might not be a multiple of WARP_BATCH. Check how
-    // many batches have to computed within this WARP.
-    int local_batches = micro_batch_size - first_batch;
-    if (local_batches > WARP_BATCH)
-        local_batches = WARP_BATCH;
-
-    // there might be multiple batches per warp. compute the index within the batch
-    int local_idx = threadIdx.x;
-
-    src += first_batch * element_count + ELEMENTS_PER_LDG_STG * local_idx;
-    dst += first_batch * element_count + ELEMENTS_PER_LDG_STG * local_idx;
-    mask += pad_first_batch * element_count + ELEMENTS_PER_LDG_STG * local_idx;
-
-    // load data from global memory
-    acc_t elements[WARP_BATCH][WARP_ITERATIONS];
-    input_t temp_data[ELEMENTS_PER_LDG_STG];
-    uint8_t temp_mask[ELEMENTS_PER_LDG_STG];
-    #pragma unroll
-    for (int i = 0;  i < WARP_BATCH;  ++i) {
-        int batch_element_count = (i >= local_batches) ? 0 : element_count;
-
-        #pragma unroll
-        for (int it = 0;  it < WARP_ITERATIONS;  it+=ELEMENTS_PER_LDG_STG) {
-            int element_index = ELEMENTS_PER_LDG_STG * local_idx + it * WARP_SIZE;
-
-            if (element_index < batch_element_count) {
-                int itr_idx = i*element_count+it*WARP_SIZE;
-                copy_vector<input_t, ELEMENTS_PER_LDG_STG>(temp_data, src + itr_idx);
-                copy_vector<uint8_t, ELEMENTS_PER_LDG_STG>(temp_mask, mask + itr_idx);
-
-                #pragma unroll
-                  for (int element = 0; element < ELEMENTS_PER_LDG_STG; ++element) {
-                      if (temp_mask[element] != 1) {
-                          elements[i][it + element] = (acc_t)temp_data[element] * scale;
-                      } else {
-                          elements[i][it + element] = -10000.0;
-                      }
-                  }
-            } else {
-                #pragma unroll
-                for (int element = 0; element < ELEMENTS_PER_LDG_STG; ++element) {
-                    elements[i][it + element] = -std::numeric_limits<acc_t>::infinity();
-                }
-            }
-        }
-    }
-
-    // compute max_value
-    acc_t max_value[WARP_BATCH];
-    #pragma unroll
-    for (int i = 0;  i < WARP_BATCH;  ++i) {
-        max_value[i] = elements[i][0];
-        #pragma unroll
-        for (int it = 1;  it < WARP_ITERATIONS;  ++it) {
-            max_value[i] = (max_value[i] > elements[i][it]) ? max_value[i] : elements[i][it];
-        }
-    }
-    warp_reduce<acc_t, WARP_BATCH, WARP_SIZE, Max>(max_value);
-
-    acc_t sum[WARP_BATCH] { 0.0f };
-    #pragma unroll
-    for (int i = 0;  i < WARP_BATCH;  ++i) {
-        #pragma unroll
-        for (int it = 0;  it < WARP_ITERATIONS;  ++it) {
-            elements[i][it] = std::exp((elements[i][it] - max_value[i]));
-            sum[i] += elements[i][it];
-        }
-    }
-    warp_reduce<acc_t, WARP_BATCH, WARP_SIZE, Add>(sum);
-
-    // store result
-    output_t out[ELEMENTS_PER_LDG_STG];
-    #pragma unroll
-    for (int i = 0;  i < WARP_BATCH;  ++i) {
-        if (i >= local_batches)
-            break;
-        #pragma unroll
-        for (int it = 0;  it < WARP_ITERATIONS;  it+=ELEMENTS_PER_LDG_STG) {
-            int element_index = ELEMENTS_PER_LDG_STG * local_idx + it * WARP_SIZE;
-            if (element_index < element_count) {
-                #pragma unroll
-                for (int element = 0; element < ELEMENTS_PER_LDG_STG; ++element) {
-                    out[element] = elements[i][it + element] / sum[i];
-                }
-                copy_vector<output_t, ELEMENTS_PER_LDG_STG>(dst + i * element_count + it * WARP_SIZE, out);  
-            } else {
-                break;
-            } 
-        }
-    }
-}
-
-template <typename input_t, typename output_t, typename acc_t, int log2_elements>
-__global__ void scaled_masked_softmax_warp_backward(
-    output_t *gradInput, 
-    input_t *grad, 
-    const input_t *output,
-    acc_t scale, 
-    int micro_batch_size, 
-    int element_count)
-{
-    // WARP_SIZE and WARP_BATCH must match the return values batches_per_warp and 
-    // warp_size of method warp_softmax_backward_kernel.
-    constexpr int next_power_of_two = 1 << log2_elements;
-    constexpr int WARP_SIZE = (next_power_of_two < C10_WARP_SIZE) ? next_power_of_two : C10_WARP_SIZE;
-    constexpr int WARP_ITERATIONS = next_power_of_two / WARP_SIZE;
-    constexpr int WARP_BATCH = (next_power_of_two <= 128) ? 2 : 1;
-    constexpr int ELEMENTS_PER_LDG_STG = (WARP_ITERATIONS < 4) ? 1 : 4;
-
-    // blockDim/threadIdx = (WARP_SIZE, WARPS_PER_BLOCK, )
-    // gridDim/blockIdx = (seq_len, attn_heads, batches) 
-    int first_batch = (blockDim.y * blockIdx.x + threadIdx.y) * WARP_BATCH;
-    
-    // micro_batch_size might not be a multiple of WARP_BATCH. Check how
-    // many batches have to computed within this WARP.
-    int local_batches = micro_batch_size - first_batch;
-    if (local_batches > WARP_BATCH)
-        local_batches = WARP_BATCH;
-
-    // there might be multiple batches per warp. compute the index within the batch
-    int local_idx = threadIdx.x;
-
-    // the first element to process by the current thread
-    int thread_offset = first_batch * element_count + ELEMENTS_PER_LDG_STG * local_idx;
-    grad += thread_offset;
-    output += thread_offset;
-    gradInput += thread_offset;
-
-    // load data from global memory
-    acc_t grad_reg[WARP_BATCH][WARP_ITERATIONS] { 0.0f };
-    acc_t output_reg[WARP_BATCH][WARP_ITERATIONS] { 0.0f };
-    input_t temp_grad[ELEMENTS_PER_LDG_STG];
-    input_t temp_output[ELEMENTS_PER_LDG_STG];
-    #pragma unroll
-    for (int i = 0;  i < WARP_BATCH;  ++i) {
-        int batch_element_count = (i >= local_batches) ? 0 : element_count;
-
-        #pragma unroll
-        for (int it = 0;  it < WARP_ITERATIONS;  it+=ELEMENTS_PER_LDG_STG) {
-            int element_index = ELEMENTS_PER_LDG_STG * local_idx + it * WARP_SIZE;
-            if (element_index < batch_element_count) {
-                copy_vector<input_t, ELEMENTS_PER_LDG_STG>(temp_grad, grad + i * element_count + it * WARP_SIZE);
-                copy_vector<input_t, ELEMENTS_PER_LDG_STG>(temp_output, output + i * element_count + it * WARP_SIZE);
-
-                #pragma unroll
-                for (int element = 0; element < ELEMENTS_PER_LDG_STG; ++element) {
-                    output_reg[i][it + element] = (acc_t)temp_output[element];
-                }
-                #pragma unroll
-                for (int element = 0; element < ELEMENTS_PER_LDG_STG; ++element) {
-                    grad_reg[i][it + element] = (acc_t)temp_grad[element] * output_reg[i][it + element];
-                }
-            } 
-        }
-    }
-   
-    acc_t sum[WARP_BATCH];
-    #pragma unroll
-    for (int i = 0;  i < WARP_BATCH;  ++i) {
-        sum[i] = grad_reg[i][0];
-        #pragma unroll
-        for (int it = 1;  it < WARP_ITERATIONS;  ++it) {
-            sum[i] += grad_reg[i][it];
-        }
-    }
-    warp_reduce<acc_t, WARP_BATCH, WARP_SIZE, Add>(sum);
-
-    // store result
-    #pragma unroll
-    for (int i = 0;  i < WARP_BATCH;  ++i) {
-        if (i >= local_batches)
-            break;
-        #pragma unroll
-        for (int it = 0;  it < WARP_ITERATIONS;  it+=ELEMENTS_PER_LDG_STG) {
-            int element_index = ELEMENTS_PER_LDG_STG * local_idx + it * WARP_SIZE;
-            if (element_index < element_count) {
-                // compute gradients
-                output_t out[ELEMENTS_PER_LDG_STG];
-                #pragma unroll
-                for (int element = 0; element < ELEMENTS_PER_LDG_STG; ++element) {
-                    out[element] = (output_t)(scale * (grad_reg[i][it + element] - output_reg[i][it + element] * sum[i]));
-                }
-                copy_vector<output_t, ELEMENTS_PER_LDG_STG>(gradInput + i * element_count + it * WARP_SIZE, out);
-            } 
-        }
-    }
-}
-} // end of anonymous namespace
-
-int get_batch_per_block(int query_seq_len, int key_seq_len, int batches, int attn_heads){
-    int log2_elements = log2_ceil(key_seq_len);
-    const int next_power_of_two = 1 << log2_elements;
-
-    int warp_size = (next_power_of_two < C10_WARP_SIZE) ? next_power_of_two : C10_WARP_SIZE;
-    int batches_per_warp = (next_power_of_two <= 128) ? 2 : 1;
-
-    constexpr int threads_per_block = 128;
-    int warps_per_block = (threads_per_block / warp_size);
-    int batches_per_block = warps_per_block * batches_per_warp;
-
-    return batches_per_block;
-}
-
-template<typename input_t, typename output_t, typename acc_t>
-void dispatch_scaled_masked_softmax_forward(
-    output_t *dst, 
-    const input_t *src, 
-    const uint8_t *mask,
-    const input_t scale, 
-    int query_seq_len, 
-    int key_seq_len, 
-    int batches,
-    int attn_heads,
-    int pad_batches)
-{
-    TORCH_INTERNAL_ASSERT(key_seq_len >= 0 && key_seq_len <= 2048 );
-    if (key_seq_len == 0) {
-        return;
-    } else {
-        int log2_elements = log2_ceil(key_seq_len);
-        const int next_power_of_two = 1 << log2_elements;
-        int batch_count = batches * attn_heads * query_seq_len;
-
-        // This value must match the WARP_SIZE constexpr value computed inside softmax_warp_forward.
-        int warp_size = (next_power_of_two < C10_WARP_SIZE) ? next_power_of_two : C10_WARP_SIZE;
-
-        // This value must match the WARP_BATCH constexpr value computed inside softmax_warp_forward.
-        int batches_per_warp = (next_power_of_two <= 128) ? 2 : 1;
-
-        // use 128 threads per block to maximimize gpu utilization
-        constexpr int threads_per_block = 128;
-
-        int warps_per_block = (threads_per_block / warp_size);
-        int batches_per_block = warps_per_block * batches_per_warp;
-        TORCH_INTERNAL_ASSERT(query_seq_len%batches_per_block == 0);
-        dim3 blocks(query_seq_len/batches_per_block, attn_heads, batches);
-        dim3 threads(warp_size, warps_per_block, 1);
-        // Launch code would be more elegant if C++ supported FOR CONSTEXPR
-        switch (log2_elements) {
-            case 0: // 1
-               hipLaunchKernelGGL(( scaled_masked_softmax_warp_forward<input_t, output_t, acc_t, 0>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), dst, src, mask, scale, batch_count, key_seq_len, pad_batches);
-                break;
-            case 1: // 2
-               hipLaunchKernelGGL(( scaled_masked_softmax_warp_forward<input_t, output_t, acc_t, 1>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), dst, src, mask, scale, batch_count, key_seq_len, pad_batches);
-                break;
-            case 2: // 4
-               hipLaunchKernelGGL(( scaled_masked_softmax_warp_forward<input_t, output_t, acc_t, 2>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), dst, src, mask, scale, batch_count, key_seq_len, pad_batches);
-                break;
-            case 3: // 8
-               hipLaunchKernelGGL(( scaled_masked_softmax_warp_forward<input_t, output_t, acc_t, 3>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), dst, src, mask, scale, batch_count, key_seq_len, pad_batches);
-                break;
-            case 4: // 16
-               hipLaunchKernelGGL(( scaled_masked_softmax_warp_forward<input_t, output_t, acc_t, 4>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), dst, src, mask, scale, batch_count, key_seq_len, pad_batches);
-                break;
-            case 5: // 32
-               hipLaunchKernelGGL(( scaled_masked_softmax_warp_forward<input_t, output_t, acc_t, 5>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), dst, src, mask, scale, batch_count, key_seq_len, pad_batches);
-                break;
-            case 6: // 64
-               hipLaunchKernelGGL(( scaled_masked_softmax_warp_forward<input_t, output_t, acc_t, 6>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), dst, src, mask, scale, batch_count, key_seq_len, pad_batches);
-                break;
-            case 7: // 128
-               hipLaunchKernelGGL(( scaled_masked_softmax_warp_forward<input_t, output_t, acc_t, 7>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), dst, src, mask, scale, batch_count, key_seq_len, pad_batches);
-                break;
-            case 8: // 256
-               hipLaunchKernelGGL(( scaled_masked_softmax_warp_forward<input_t, output_t, acc_t, 8>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), dst, src, mask, scale, batch_count, key_seq_len, pad_batches);
-                break;
-            case 9: // 512
-               hipLaunchKernelGGL(( scaled_masked_softmax_warp_forward<input_t, output_t, acc_t, 9>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), dst, src, mask, scale, batch_count, key_seq_len, pad_batches);
-                break;
-            case 10: // 1024
-               hipLaunchKernelGGL(( scaled_masked_softmax_warp_forward<input_t, output_t, acc_t, 10>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), dst, src, mask, scale, batch_count, key_seq_len, pad_batches);
-                break;
-            case 11: // 2048
-               hipLaunchKernelGGL(( scaled_masked_softmax_warp_forward<input_t, output_t, acc_t, 11>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), dst, src, mask, scale, batch_count, key_seq_len, pad_batches);
-                break;
-            default:
-                break;
-        }
-    }
-}
-
-template<typename input_t, typename output_t, typename acc_t>
-void dispatch_scaled_masked_softmax_backward(
-    output_t *grad_input, 
-    input_t *grad, 
-    const input_t *output, 
-    const acc_t scale, 
-    int query_seq_len, 
-    int key_seq_len, 
-    int batches,
-    int attn_heads)
-{
-    TORCH_INTERNAL_ASSERT( key_seq_len >= 0 && key_seq_len <= 2048 );
-    if (key_seq_len == 0) {
-       return;
-    } else {
-        int log2_elements = log2_ceil(key_seq_len);
-        const int next_power_of_two = 1 << log2_elements;
-        int batch_count = batches *  attn_heads * query_seq_len;
-
-        // This value must match the WARP_SIZE constexpr value computed inside softmax_warp_backward.
-        int warp_size = (next_power_of_two < C10_WARP_SIZE) ? next_power_of_two : C10_WARP_SIZE;
-
-        // This value must match the WARP_BATCH constexpr value computed inside softmax_warp_backward.
-        int batches_per_warp = (next_power_of_two <= 128) ? 2 : 1;
-
-        // use 128 threads per block to maximimize gpu utilization
-        constexpr int threads_per_block = 128;
-
-        int warps_per_block = (threads_per_block / warp_size);
-        int batches_per_block = warps_per_block * batches_per_warp;
-        int blocks = batch_count/batches_per_block;
-        dim3 threads(warp_size, warps_per_block, 1);
-        // Launch code would be more elegant if C++ supported FOR CONSTEXPR
-        switch (log2_elements) {
-            case 0: // 1
-               hipLaunchKernelGGL(( scaled_masked_softmax_warp_backward<input_t, output_t, acc_t, 0>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), grad_input, grad, output, scale, batch_count, key_seq_len);
-                break;
-            case 1: // 2
-               hipLaunchKernelGGL(( scaled_masked_softmax_warp_backward<input_t, output_t, acc_t, 1>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), grad_input, grad, output, scale, batch_count, key_seq_len);
-                break;
-            case 2: // 4
-               hipLaunchKernelGGL(( scaled_masked_softmax_warp_backward<input_t, output_t, acc_t, 2>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), grad_input, grad, output, scale, batch_count, key_seq_len);
-                break;
-            case 3: // 8
-               hipLaunchKernelGGL(( scaled_masked_softmax_warp_backward<input_t, output_t, acc_t, 3>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), grad_input, grad, output, scale, batch_count, key_seq_len);
-                break;
-            case 4: // 16
-               hipLaunchKernelGGL(( scaled_masked_softmax_warp_backward<input_t, output_t, acc_t, 4>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), grad_input, grad, output, scale, batch_count, key_seq_len);
-                break;
-            case 5: // 32
-               hipLaunchKernelGGL(( scaled_masked_softmax_warp_backward<input_t, output_t, acc_t, 5>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), grad_input, grad, output, scale, batch_count, key_seq_len);
-                break;
-            case 6: // 64
-               hipLaunchKernelGGL(( scaled_masked_softmax_warp_backward<input_t, output_t, acc_t, 6>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), grad_input, grad, output, scale, batch_count, key_seq_len);
-                break;
-            case 7: // 128
-               hipLaunchKernelGGL(( scaled_masked_softmax_warp_backward<input_t, output_t, acc_t, 7>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), grad_input, grad, output, scale, batch_count, key_seq_len);
-                break;
-            case 8: // 256
-               hipLaunchKernelGGL(( scaled_masked_softmax_warp_backward<input_t, output_t, acc_t, 8>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), grad_input, grad, output, scale, batch_count, key_seq_len);
-                break;
-            case 9: // 512
-               hipLaunchKernelGGL(( scaled_masked_softmax_warp_backward<input_t, output_t, acc_t, 9>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), grad_input, grad, output, scale, batch_count, key_seq_len);
-                break;
-            case 10: // 1024
-               hipLaunchKernelGGL(( scaled_masked_softmax_warp_backward<input_t, output_t, acc_t, 10>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), grad_input, grad, output, scale, batch_count, key_seq_len);
-                break;
-            case 11: // 2048
-               hipLaunchKernelGGL(( scaled_masked_softmax_warp_backward<input_t, output_t, acc_t, 11>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), grad_input, grad, output, scale, batch_count, key_seq_len);
-                break;
-            default:
-                break;
-        }
-    }
-}
diff --git a/colossalai/kernel/hip_native/csrc/scaled_masked_softmax_hip.hip b/colossalai/kernel/hip_native/csrc/scaled_masked_softmax_hip.hip
deleted file mode 100644
index d416999068781da8bd3aab435ecfa2f3132dbc84..0000000000000000000000000000000000000000
--- a/colossalai/kernel/hip_native/csrc/scaled_masked_softmax_hip.hip
+++ /dev/null
@@ -1,109 +0,0 @@
-// !!! This is a file automatically generated by hipify!!!
-/*This code from NVIDIA Megatron:
- *     with minor changes. */
-
-#include <ATen/ATen.h>
-#include <hip/hip_runtime.h>
-#include <hip/hip_runtime.h>
-#include <hip/hip_fp16.h>
-
-#ifndef COLOSSAL_HIP
-#include <cuda_profiler_api.h>
-#endif
-
-#include <ATen/hip/HIPContext.h>
-#include <torch/extension.h>
-#include "../../hip_native/csrc/scaled_masked_softmax.h"
-#include "../../hip_native/csrc/type_shim.h"
-
-namespace multihead_attn {
-namespace fused_softmax {
-namespace scaled_masked_softmax {
-
-int get_batch_per_block_cuda(int query_seq_len, int key_seq_len, int batches, int attn_heads){
-    return get_batch_per_block(query_seq_len, key_seq_len, batches, attn_heads);
-}
-
-
-torch::Tensor fwd_cuda(
-    torch::Tensor const& input,
-    torch::Tensor const& mask,
-    float scale_factor)
-{
-  // input is a 4d tensor with dimensions [batches, attn_heads, seq_len, seq_len]
-  const int batches = input.size(0);
-  const int pad_batches = mask.size(0);
-  const int attn_heads = input.size(1);
-  const int query_seq_len = input.size(2);
-  const int key_seq_len = input.size(3);
-  TORCH_INTERNAL_ASSERT(key_seq_len <= 2048);
-  TORCH_INTERNAL_ASSERT(query_seq_len > 1);
-  TORCH_INTERNAL_ASSERT(pad_batches == 1 || pad_batches == batches);
-  TORCH_INTERNAL_ASSERT(mask.size(1) == 1);
-  TORCH_INTERNAL_ASSERT(mask.size(2) == query_seq_len);
-  TORCH_INTERNAL_ASSERT(mask.size(3) == key_seq_len);
-
-  // Output 
-  auto act_options = input.options().requires_grad(false);
-  torch::Tensor softmax_results = 
-      torch::empty({batches, attn_heads, query_seq_len, key_seq_len}, act_options);
-
-  // Softmax Intermediate Result Ptr
-  void* input_ptr = static_cast<void*>(input.data_ptr());
-  void* mask_ptr = static_cast<void*>(mask.data_ptr());
-  void* softmax_results_ptr = static_cast<void*>(softmax_results.data_ptr());
-
-  DISPATCH_HALF_AND_BFLOAT(
-      input.scalar_type(),
-      "dispatch_scaled_masked_softmax_forward",
-      dispatch_scaled_masked_softmax_forward<scalar_t, scalar_t, float>(
-          reinterpret_cast<scalar_t*>(softmax_results_ptr),
-	  reinterpret_cast<const scalar_t*>(input_ptr),
-	  reinterpret_cast<const uint8_t*>(mask_ptr),
-	  scale_factor,
-	  query_seq_len,
-	  key_seq_len,
-	  batches,
-	  attn_heads,
-	  pad_batches);
-      );
-  return softmax_results;
-}
-
-torch::Tensor bwd_cuda(
-    torch::Tensor const& output_grads_, 
-    torch::Tensor const& softmax_results_, 
-    float scale_factor)  {
-	
-  auto output_grads = output_grads_.contiguous();
-  auto softmax_results = softmax_results_.contiguous();
-
-  //output grads is a 4d tensor with dimensions [batches, attn_heads, seq_len, seq_len]
-  const int batches = output_grads.size(0);
-  const int attn_heads = output_grads.size(1);
-  const int query_seq_len = output_grads.size(2);
-  const int key_seq_len = output_grads.size(3);
-
-  void* output_grads_ptr = static_cast<void*>(output_grads.data_ptr());
-
-  //Softmax Grad
-  DISPATCH_HALF_AND_BFLOAT(
-      output_grads_.scalar_type(),
-      "dispatch_scaled_masked_softmax_backward",
-      dispatch_scaled_masked_softmax_backward<scalar_t, scalar_t, float>(
-          reinterpret_cast<scalar_t*>(output_grads_ptr), 
-	  reinterpret_cast<scalar_t*>(output_grads_ptr), 
-	  reinterpret_cast<scalar_t const*>(softmax_results.data_ptr()),
-	  scale_factor,
-	  query_seq_len,
-	  key_seq_len,
-	  batches,
-	  attn_heads);
-			   );
-  
-  //backward pass is completely in-place
-  return output_grads;
-}
-}
-}
-}
diff --git a/colossalai/kernel/hip_native/csrc/scaled_upper_triang_masked_softmax.cpp b/colossalai/kernel/hip_native/csrc/scaled_upper_triang_masked_softmax.cpp
deleted file mode 100644
index fc33125b6a7c04d83718f39410584f443856f0bf..0000000000000000000000000000000000000000
--- a/colossalai/kernel/hip_native/csrc/scaled_upper_triang_masked_softmax.cpp
+++ /dev/null
@@ -1,60 +0,0 @@
-// !!! This is a file automatically generated by hipify!!!
-/*This code from NVIDIA Megatron:
- *     with minor changes. */
-
-#include <hip/hip_fp16.h>
-#include <torch/extension.h>
-#include <vector>
-
-namespace multihead_attn {
-namespace fused_softmax {
-namespace scaled_upper_triang_masked_softmax {
-
-torch::Tensor fwd_cuda(
-    torch::Tensor const& input, 
-    float scale_factor);
-
-torch::Tensor bwd_cuda(
-    torch::Tensor const& output_grads, 
-    torch::Tensor const& softmax_results,
-    float scale_factor);
-
-torch::Tensor fwd(torch::Tensor const& input, float scale_factor) {
-  AT_ASSERTM(input.dim() == 3, "expected 3D tensor");
-  AT_ASSERTM((input.scalar_type() == at::ScalarType::Half) ||
-	     (input.scalar_type() == at::ScalarType::BFloat16), 
-      "Only fp16 and bf16 are supported");
-
-  return fwd_cuda(input, scale_factor);
-}
-
-torch::Tensor bwd(
-    torch::Tensor const& output_grads, 
-    torch::Tensor const& softmax_results,
-    float scale_factor) {
-
-  AT_ASSERTM(output_grads.dim() == 3, "expected 3D tensor");
-  AT_ASSERTM(softmax_results.dim() == 3, "expected 3D tensor");
-
-  AT_ASSERTM((output_grads.scalar_type() == at::ScalarType::Half) ||
-	     (output_grads.scalar_type() == at::ScalarType::BFloat16), 
-      "Only fp16 and bf16 are supported");
-  AT_ASSERTM((softmax_results.scalar_type() == at::ScalarType::Half) ||
-	     (softmax_results.scalar_type() == at::ScalarType::BFloat16), 
-      "Only fp16 and bf16 are supported");
-
-  return bwd_cuda(output_grads, softmax_results, scale_factor);
-}
-
-} // end namespace scaled_upper_triang_masked_softmax
-} // end namespace fused_softmax
-} // end namespace multihead_attn
-
-PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
-  m.def("forward", 
-        &multihead_attn::fused_softmax::scaled_upper_triang_masked_softmax::fwd,
-	"Self Multihead Attention scaled, time masked softmax -- Forward.");
-  m.def("backward", 
-        &multihead_attn::fused_softmax::scaled_upper_triang_masked_softmax::bwd,
-	"Self Multihead Attention scaled, time masked softmax -- Backward.");
-}
diff --git a/colossalai/kernel/hip_native/csrc/scaled_upper_triang_masked_softmax.h b/colossalai/kernel/hip_native/csrc/scaled_upper_triang_masked_softmax.h
deleted file mode 100644
index 9ceda50a66fd5d2cb9fcf19019f13a41bee88677..0000000000000000000000000000000000000000
--- a/colossalai/kernel/hip_native/csrc/scaled_upper_triang_masked_softmax.h
+++ /dev/null
@@ -1,502 +0,0 @@
-// !!! This is a file automatically generated by hipify!!!
-#include "hip/hip_runtime.h"
-/*This code from NVIDIA Megatron:
- *     with minor changes. */
-
-#pragma once
-
-#include <assert.h>
-#include <hip/hip_fp16.h>
-#include <cfloat>
-#include <limits>
-#include <stdint.h>
-#include <c10/macros/Macros.h>
-
-namespace {
-
-template <typename Datatype, int ELEMENTS_PER_LDG>
-__device__ __inline__ void copy_vector(Datatype *dst, const Datatype *src);
-
-template <>
-__device__ __inline__ void copy_vector<c10::BFloat16, 1>(c10::BFloat16 *dst, const c10::BFloat16 *src) { *dst = *src; }
-
-template <>
-__device__ __inline__ void copy_vector<c10::BFloat16, 4>(c10::BFloat16 *dst, const c10::BFloat16 *src) { *((float2*) dst) = *((float2*) src); }
-  
-template <>
-__device__ __inline__ void copy_vector<c10::Half, 1>(c10::Half *dst, const c10::Half *src) { *dst = *src; }
-
-template <>
-__device__ __inline__ void copy_vector<c10::Half, 4>(c10::Half *dst, const c10::Half *src) { *((float2*) dst) = *((float2*) src); }
-
-template <>
-__device__ __inline__ void copy_vector<uint8_t, 1>(uint8_t *dst, const uint8_t *src) { *dst = *src; }
-
-template <>
-__device__ __inline__ void copy_vector<uint8_t, 4>(uint8_t *dst, const uint8_t *src) {*((half2*) dst) = *((half2*) src); }
-
-template <typename Datatype, int ELEMENTS_PER_LDG>
-__device__ __inline__ void copy_zero_vector(Datatype *dst);
-
-template <>
-__device__ __inline__ void copy_zero_vector<c10::BFloat16, 1>(c10::BFloat16 *dst) { *dst = 0.0; }
-
-template <>
-__device__ __inline__ void copy_zero_vector<c10::BFloat16, 4>(c10::BFloat16 *dst) { *((float2*) dst) = make_float2(0.0f, 0.0f); }
-
-template <>
-__device__ __inline__ void copy_zero_vector<c10::Half, 1>(c10::Half *dst) { *dst = 0.0; }
-
-template <>
-__device__ __inline__ void copy_zero_vector<c10::Half, 4>(c10::Half *dst) { *((float2*) dst) = make_float2(0.0f, 0.0f); }
-
-
-int log2_ceil(int value) {
-    int log2_value = 0;
-    while ((1 << log2_value) < value) ++log2_value;
-    return log2_value;
-}
-
-template<typename T>
-struct Add {
-  __device__ __forceinline__ T operator()(T a, T b) const {
-    return a + b;
-  }
-};
-
-template<typename T>
-struct Max {
-  __device__ __forceinline__ T operator()(T a, T b) const {
-    return a < b ? b : a;
-  }
-};
-
-template <typename T>
-__device__ __forceinline__ T WARP_SHFL_XOR_NATIVE(T value, int laneMask, int width = warpSize, unsigned int mask = 0xffffffff)
-{
-#if TORCH_HIP_VERSION >= 9000
-    return __shfl_xor_sync(mask, value, laneMask, width);
-#else
-    return __shfl_xor(value, laneMask, width);
-#endif
-}
-
-template <typename acc_t, int WARP_BATCH, int WARP_SIZE, template<typename> class ReduceOp>
-__device__ __forceinline__ void warp_reduce(acc_t* sum) {
-    ReduceOp<acc_t> r;
-    #pragma unroll
-    for (int offset = WARP_SIZE / 2; offset > 0; offset /= 2) {
-        #pragma unroll
-        for (int i = 0;  i < WARP_BATCH;  ++i) {
-            acc_t b = WARP_SHFL_XOR_NATIVE(sum[i], offset, WARP_SIZE);
-            sum[i] = r(sum[i], b);
-        }
-    }
-}
-
-/*
- * Extended softmax (from native aten pytorch) with following additional features
- * 1) input scaling
- * 2) Implicit time (diagonal masking)
- */
-template <typename input_t, typename output_t, typename acc_t, int log2_elements>
-__global__ void scaled_upper_triang_masked_softmax_warp_forward(
-    output_t *dst, 
-    const input_t *src, 
-    const acc_t scale, 
-    int micro_batch_size, 
-    int stride, 
-    int element_count) 
-{
-    // WARP_SIZE and WARP_BATCH must match the return values batches_per_warp and 
-    // warp_size of method warp_softmax_forward_kernel.
-    constexpr int next_power_of_two = 1 << log2_elements;
-    constexpr int WARP_SIZE = (next_power_of_two < C10_WARP_SIZE) ? next_power_of_two : C10_WARP_SIZE;
-    constexpr int WARP_ITERATIONS = next_power_of_two / WARP_SIZE;
-    constexpr int WARP_BATCH = (next_power_of_two <= 128) ? 2 : 1;
-    constexpr int ELEMENTS_PER_LDG_STG = (WARP_ITERATIONS < 4) ? 1 : 4;
-
-    int first_batch = (blockDim.y * blockIdx.y + threadIdx.y) * gridDim.x * WARP_BATCH + blockIdx.x;
-    int local_seq = blockIdx.x + 1; 
-    int warp_iteration_limit = (local_seq + ELEMENTS_PER_LDG_STG * WARP_SIZE - 1)/ WARP_SIZE;
-
-    // micro_batch_size might not be a multiple of WARP_BATCH. Check how
-    // many batches have to computed within this WARP.
-    int local_batches = micro_batch_size - first_batch;
-    if (local_batches > WARP_BATCH)
-        local_batches = WARP_BATCH;
-
-    // there might be multiple batches per warp. compute the index within the batch
-    int local_idx = threadIdx.x;
-
-    src += first_batch * stride + ELEMENTS_PER_LDG_STG * local_idx;
-    dst += first_batch * stride + ELEMENTS_PER_LDG_STG * local_idx;
-
-    // load data from global memory
-    acc_t elements[WARP_BATCH][WARP_ITERATIONS];
-    input_t temp_data[ELEMENTS_PER_LDG_STG];
-    #pragma unroll
-    for (int i = 0;  i < WARP_BATCH;  ++i) {
-        int batch_element_count = (i >= local_batches) ? 0 : local_seq;
-
-        #pragma unroll
-        for (int it = 0;  it < WARP_ITERATIONS;  it+=ELEMENTS_PER_LDG_STG) {
-            int element_index = ELEMENTS_PER_LDG_STG * local_idx + it * WARP_SIZE;
-
-            if (element_index < batch_element_count) {
-                copy_vector<input_t, ELEMENTS_PER_LDG_STG>(temp_data, src + i*element_count*stride + it*WARP_SIZE);
-
-                #pragma unroll
-                for (int element = 0; element < ELEMENTS_PER_LDG_STG; ++element) {
-                    if ((element_index + element) < batch_element_count) {
-                        elements[i][it+element] = (acc_t)temp_data[element] * scale;
-                    } else {
-                        elements[i][it + element] = -std::numeric_limits<acc_t>::infinity();
-                    }
-                }
-            } else {
-                #pragma unroll
-                for (int element = 0; element < ELEMENTS_PER_LDG_STG; ++element) {
-                    elements[i][it + element] = -std::numeric_limits<acc_t>::infinity();
-                }
-            }
-        }
-    }
-
-    // compute max_value
-    acc_t max_value[WARP_BATCH];
-    #pragma unroll
-    for (int i = 0;  i < WARP_BATCH;  ++i) {
-        max_value[i] = elements[i][0];
-        #pragma unroll
-        for (int it = 1;  it < WARP_ITERATIONS;  ++it) {
-            max_value[i] = (max_value[i] > elements[i][it]) ? max_value[i] : elements[i][it];
-        }
-    }
-    warp_reduce<acc_t, WARP_BATCH, WARP_SIZE, Max>(max_value);
-
-    acc_t sum[WARP_BATCH] { 0.0f };
-    #pragma unroll
-    for (int i = 0;  i < WARP_BATCH;  ++i) {
-        #pragma unroll
-        for (int it = 0;  it < WARP_ITERATIONS;  ++it) {
-            if (it < warp_iteration_limit) {
-                elements[i][it] = std::exp((elements[i][it] - max_value[i]));
-                sum[i] += elements[i][it];
-            } 
-        }
-    }
-    warp_reduce<acc_t, WARP_BATCH, WARP_SIZE, Add>(sum);
-
-    // store result
-    output_t out[ELEMENTS_PER_LDG_STG];
-    #pragma unroll
-    for (int i = 0;  i < WARP_BATCH;  ++i) {
-        if (i >= local_batches)
-            break;
-        #pragma unroll
-        for (int it = 0;  it < WARP_ITERATIONS;  it+=ELEMENTS_PER_LDG_STG) {
-            int element_index = ELEMENTS_PER_LDG_STG * local_idx + it * WARP_SIZE;
-
-            if (element_index < local_seq) {
-
-                #pragma unroll  
-                for (int element = 0; element < ELEMENTS_PER_LDG_STG; ++element) {
-                    if (element_index + element < local_seq) {
-                        out[element] = elements[i][it + element] / sum[i];
-                    } else {
-                        out[element] = 0;
-                    }
-                }
-                copy_vector<output_t, ELEMENTS_PER_LDG_STG>(dst + i * element_count * stride + it * WARP_SIZE, out);
-            } else if (element_index < element_count) {
-                copy_zero_vector<output_t, ELEMENTS_PER_LDG_STG>(dst + i * element_count * stride + it * WARP_SIZE);
-            } else {
-                break;
-            } 
-        }
-    }
-}
-
-template <typename input_t, typename output_t, typename acc_t, int log2_elements>
-__global__ void scaled_upper_triang_masked_softmax_warp_backward(
-    output_t *gradInput, 
-    input_t *grad, 
-    const input_t *output,
-    acc_t scale, 
-    int micro_batch_size, 
-    int stride, 
-    int element_count)
-{
-    // WARP_SIZE and WARP_BATCH must match the return values batches_per_warp and 
-    // warp_size of method warp_softmax_backward_kernel.
-    constexpr int next_power_of_two = 1 << log2_elements;
-    constexpr int WARP_SIZE = (next_power_of_two < C10_WARP_SIZE) ? next_power_of_two : C10_WARP_SIZE;
-    constexpr int WARP_ITERATIONS = next_power_of_two / WARP_SIZE;
-    constexpr int WARP_BATCH = (next_power_of_two <= 128) ? 2 : 1;
-    constexpr int ELEMENTS_PER_LDG_STG = (WARP_ITERATIONS < 4) ? 1 : 4;
-
-    int first_batch = (blockDim.y * blockIdx.y + threadIdx.y) * gridDim.x * WARP_BATCH + blockIdx.x;
-    int local_seq = blockIdx.x + 1; 
-    
-    // micro_batch_size might not be a multiple of WARP_BATCH. Check how
-    // many batches have to computed within this WARP.
-    int local_batches = micro_batch_size - first_batch;
-    if (local_batches > WARP_BATCH)
-        local_batches = WARP_BATCH;
-
-    // there might be multiple batches per warp. compute the index within the batch
-    int local_idx = threadIdx.x;
-
-    // the first element to process by the current thread
-    int thread_offset = first_batch * stride + ELEMENTS_PER_LDG_STG * local_idx;
-    grad += thread_offset;
-    output += thread_offset;
-    gradInput += thread_offset;
-
-    // load data from global memory
-    acc_t grad_reg[WARP_BATCH][WARP_ITERATIONS] { 0.0f };
-    acc_t output_reg[WARP_BATCH][WARP_ITERATIONS] { 0.0f };
-    input_t temp_grad[ELEMENTS_PER_LDG_STG];
-    input_t temp_output[ELEMENTS_PER_LDG_STG];
-    #pragma unroll
-    for (int i = 0;  i < WARP_BATCH;  ++i) {
-        int batch_element_count = (i >= local_batches) ? 0 : local_seq;
-
-        #pragma unroll
-        for (int it = 0;  it < WARP_ITERATIONS;  it+=ELEMENTS_PER_LDG_STG) {
-            int element_index = ELEMENTS_PER_LDG_STG * local_idx + it * WARP_SIZE;
-            if (element_index < batch_element_count) {
-                copy_vector<input_t, ELEMENTS_PER_LDG_STG>(temp_grad, grad + i * element_count * stride + it * WARP_SIZE);
-                copy_vector<input_t, ELEMENTS_PER_LDG_STG>(temp_output, output + i * element_count * stride + it * WARP_SIZE);
-
-                #pragma unroll
-                for (int element = 0; element < ELEMENTS_PER_LDG_STG; ++element) {
-                    if (element_index + element < batch_element_count) {
-                        output_reg[i][it + element] = (acc_t)temp_output[element];
-                    }
-                }
-                #pragma unroll
-                for (int element = 0; element < ELEMENTS_PER_LDG_STG; ++element) {
-                    if (element_index + element < batch_element_count) {
-                        grad_reg[i][it + element] = (acc_t)temp_grad[element] * output_reg[i][it + element];
-                    }
-                }
-            }
-        }
-    }
-   
-    acc_t sum[WARP_BATCH];
-    #pragma unroll
-    for (int i = 0;  i < WARP_BATCH;  ++i) {
-        sum[i] = grad_reg[i][0];
-        #pragma unroll
-        for (int it = 1;  it < WARP_ITERATIONS;  ++it) {
-            sum[i] += grad_reg[i][it];
-        }
-    }
-    warp_reduce<acc_t, WARP_BATCH, WARP_SIZE, Add>(sum);
-
-    // store result
-    #pragma unroll
-    for (int i = 0;  i < WARP_BATCH;  ++i) {
-        if (i >= local_batches)
-            break;
-        #pragma unroll
-        for (int it = 0;  it < WARP_ITERATIONS;  it+=ELEMENTS_PER_LDG_STG) {
-            int element_index = ELEMENTS_PER_LDG_STG * local_idx + it * WARP_SIZE;
-            if (element_index < element_count) {
-                // compute gradients
-                output_t out[ELEMENTS_PER_LDG_STG];
-                #pragma unroll
-                for (int element = 0; element < ELEMENTS_PER_LDG_STG; ++element) {
-                    out[element] = (output_t)(scale * (grad_reg[i][it + element] - output_reg[i][it + element] * sum[i]));
-                }
-                copy_vector<output_t, ELEMENTS_PER_LDG_STG>(gradInput + i * element_count * stride + it * WARP_SIZE, out);
-            } 
-        }
-    }
-}
-
-} // end of anonymous namespace
-
-template<typename input_t, typename output_t, typename acc_t>
-void dispatch_scaled_upper_triang_masked_softmax_forward(
-    output_t *dst, 
-    const input_t *src, 
-    const input_t scale, 
-    int softmax_elements, 
-    int softmax_elements_stride, 
-    int attn_batches)
-{
-    TORCH_INTERNAL_ASSERT(softmax_elements >= 0 && softmax_elements <= 2048 );
-    if (softmax_elements == 0) {
-        return;
-    } else {
-        int log2_elements = log2_ceil(softmax_elements);
-        const int next_power_of_two = 1 << log2_elements;
-        int seq_len = softmax_elements;
-        int batch_count = attn_batches * seq_len;
-
-        // This value must match the WARP_SIZE constexpr value computed inside softmax_warp_forward.
-        int warp_size = (next_power_of_two < C10_WARP_SIZE) ? next_power_of_two : C10_WARP_SIZE;
-
-        // This value must match the WARP_BATCH constexpr value computed inside softmax_warp_forward.
-        int batches_per_warp = (next_power_of_two <= 128) ? 2 : 1;
-
-        // use 128 threads per block to maximimize gpu utilization
-        constexpr int threads_per_block = 128;
-
-        int warps_per_block = (threads_per_block / warp_size);
-        int batches_per_block = warps_per_block * batches_per_warp;
-        TORCH_INTERNAL_ASSERT(attn_batches % batches_per_block == 0);
-
-        int blocks_per_seq = attn_batches / batches_per_block;
-        dim3 blocks(seq_len, blocks_per_seq, 1);
-        dim3 threads(warp_size, warps_per_block, 1);
-        // Launch code would be more elegant if C++ supported FOR CONSTEXPR
-        switch (log2_elements) {
-            case 0: // 1
-               hipLaunchKernelGGL(( scaled_upper_triang_masked_softmax_warp_forward<input_t, output_t, acc_t, 0>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), dst, src, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 1: // 2
-               hipLaunchKernelGGL(( scaled_upper_triang_masked_softmax_warp_forward<input_t, output_t, acc_t, 1>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), dst, src, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 2: // 4
-               hipLaunchKernelGGL(( scaled_upper_triang_masked_softmax_warp_forward<input_t, output_t, acc_t, 2>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), dst, src, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 3: // 8
-               hipLaunchKernelGGL(( scaled_upper_triang_masked_softmax_warp_forward<input_t, output_t, acc_t, 3>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), dst, src, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 4: // 16
-               hipLaunchKernelGGL(( scaled_upper_triang_masked_softmax_warp_forward<input_t, output_t, acc_t, 4>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), dst, src, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 5: // 32
-               hipLaunchKernelGGL(( scaled_upper_triang_masked_softmax_warp_forward<input_t, output_t, acc_t, 5>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), dst, src, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 6: // 64
-               hipLaunchKernelGGL(( scaled_upper_triang_masked_softmax_warp_forward<input_t, output_t, acc_t, 6>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), dst, src, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 7: // 128
-               hipLaunchKernelGGL(( scaled_upper_triang_masked_softmax_warp_forward<input_t, output_t, acc_t, 7>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), dst, src, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 8: // 256
-               hipLaunchKernelGGL(( scaled_upper_triang_masked_softmax_warp_forward<input_t, output_t, acc_t, 8>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), dst, src, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 9: // 512
-               hipLaunchKernelGGL(( scaled_upper_triang_masked_softmax_warp_forward<input_t, output_t, acc_t, 9>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), dst, src, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 10: // 1024
-               hipLaunchKernelGGL(( scaled_upper_triang_masked_softmax_warp_forward<input_t, output_t, acc_t, 10>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), dst, src, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 11: // 2048
-               hipLaunchKernelGGL(( scaled_upper_triang_masked_softmax_warp_forward<input_t, output_t, acc_t, 11>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), dst, src, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            default:
-                break;
-        }
-    }
-}
-
-template<typename input_t, typename output_t, typename acc_t>
-void dispatch_scaled_upper_triang_masked_softmax_backward(
-    output_t *grad_input, 
-    input_t *grad, 
-    const input_t *output, 
-    const acc_t scale, 
-    int softmax_elements, 
-    int softmax_elements_stride, 
-    int attn_batches)
-{
-    TORCH_INTERNAL_ASSERT( softmax_elements >= 0 && softmax_elements <= 2048 );
-    if (softmax_elements == 0) {
-       return;
-    } else {
-        int log2_elements = log2_ceil(softmax_elements);
-        const int next_power_of_two = 1 << log2_elements;
-        int seq_len = softmax_elements;
-        int batch_count = attn_batches * seq_len;
-
-        // This value must match the WARP_SIZE constexpr value computed inside softmax_warp_backward.
-        int warp_size = (next_power_of_two < C10_WARP_SIZE) ? next_power_of_two : C10_WARP_SIZE;
-
-        // This value must match the WARP_BATCH constexpr value computed inside softmax_warp_backward.
-        int batches_per_warp = (next_power_of_two <= 128) ? 2 : 1;
-
-        // use 128 threads per block to maximimize gpu utilization
-        constexpr int threads_per_block = 128;
-
-        int warps_per_block = (threads_per_block / warp_size);
-        int batches_per_block = warps_per_block * batches_per_warp;
-        TORCH_INTERNAL_ASSERT(attn_batches % batches_per_block == 0);
-
-        int blocks_per_seq = attn_batches / batches_per_block;
-        dim3 blocks(seq_len, blocks_per_seq, 1);
-        dim3 threads(warp_size, warps_per_block, 1);
-        // Launch code would be more elegant if C++ supported FOR CONSTEXPR
-        switch (log2_elements) {
-            case 0: // 1
-               hipLaunchKernelGGL(( scaled_upper_triang_masked_softmax_warp_backward<input_t, output_t, acc_t, 0>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), grad_input, grad, output, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 1: // 2
-               hipLaunchKernelGGL(( scaled_upper_triang_masked_softmax_warp_backward<input_t, output_t, acc_t, 1>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), grad_input, grad, output, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 2: // 4
-               hipLaunchKernelGGL(( scaled_upper_triang_masked_softmax_warp_backward<input_t, output_t, acc_t, 2>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), grad_input, grad, output, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 3: // 8
-               hipLaunchKernelGGL(( scaled_upper_triang_masked_softmax_warp_backward<input_t, output_t, acc_t, 3>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), grad_input, grad, output, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 4: // 16
-               hipLaunchKernelGGL(( scaled_upper_triang_masked_softmax_warp_backward<input_t, output_t, acc_t, 4>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), grad_input, grad, output, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 5: // 32
-               hipLaunchKernelGGL(( scaled_upper_triang_masked_softmax_warp_backward<input_t, output_t, acc_t, 5>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), grad_input, grad, output, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 6: // 64
-               hipLaunchKernelGGL(( scaled_upper_triang_masked_softmax_warp_backward<input_t, output_t, acc_t, 6>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), grad_input, grad, output, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 7: // 128
-               hipLaunchKernelGGL(( scaled_upper_triang_masked_softmax_warp_backward<input_t, output_t, acc_t, 7>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), grad_input, grad, output, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 8: // 256
-               hipLaunchKernelGGL(( scaled_upper_triang_masked_softmax_warp_backward<input_t, output_t, acc_t, 8>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), grad_input, grad, output, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 9: // 512
-               hipLaunchKernelGGL(( scaled_upper_triang_masked_softmax_warp_backward<input_t, output_t, acc_t, 9>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), grad_input, grad, output, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 10: // 1024
-               hipLaunchKernelGGL(( scaled_upper_triang_masked_softmax_warp_backward<input_t, output_t, acc_t, 10>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), grad_input, grad, output, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            case 11: // 2048
-               hipLaunchKernelGGL(( scaled_upper_triang_masked_softmax_warp_backward<input_t, output_t, acc_t, 11>)
-                    , dim3(blocks), dim3(threads), 0, at::hip::getCurrentHIPStreamMasqueradingAsCUDA(), grad_input, grad, output, scale, batch_count, softmax_elements_stride, softmax_elements);
-                break;
-            default:
-                break;
-        }
-    }
-}
diff --git a/colossalai/kernel/hip_native/csrc/scaled_upper_triang_masked_softmax_hip.hip b/colossalai/kernel/hip_native/csrc/scaled_upper_triang_masked_softmax_hip.hip
deleted file mode 100644
index 0cfc1137c37b5e77ca64382a08e35ea4fbcc2347..0000000000000000000000000000000000000000
--- a/colossalai/kernel/hip_native/csrc/scaled_upper_triang_masked_softmax_hip.hip
+++ /dev/null
@@ -1,90 +0,0 @@
-// !!! This is a file automatically generated by hipify!!!
-/*This code from NVIDIA Megatron:
- *     with minor changes. */
-
-#include <ATen/ATen.h>
-#include <hip/hip_runtime.h>
-#include <hip/hip_runtime.h>
-#include <hip/hip_fp16.h>
-
-#ifndef COLOSSAL_HIP
-#include <cuda_profiler_api.h>
-#endif
-
-#include <ATen/hip/HIPContext.h>
-#include <torch/extension.h>
-#include "../../hip_native/csrc/scaled_upper_triang_masked_softmax.h"
-#include "../../hip_native/csrc/type_shim.h"
-
-namespace multihead_attn {
-namespace fused_softmax {
-namespace scaled_upper_triang_masked_softmax {
-
-torch::Tensor fwd_cuda(
-    torch::Tensor const& input, 
-    float scale_factor)
-{
-  // input is a 3d tensor with dimensions [attn_batches, seq_len, seq_len]
-  const int attn_batches = input.size(0);
-  const int seq_len = input.size(1);
-  TORCH_INTERNAL_ASSERT(seq_len <= 2048);
-
-  // Output 
-  auto act_options = input.options().requires_grad(false);
-  torch::Tensor softmax_results = 
-      torch::empty({attn_batches, seq_len, seq_len}, act_options);
-
-  // Softmax Intermediate Result Ptr
-  void* input_ptr = static_cast<void*>(input.data_ptr());
-  void* softmax_results_ptr = static_cast<void*>(softmax_results.data_ptr());
-
-  DISPATCH_HALF_AND_BFLOAT(
-      input.scalar_type(),
-      "dispatch_scaled_upper_triang_masked_softmax_forward",
-      dispatch_scaled_upper_triang_masked_softmax_forward<scalar_t, scalar_t, float>(
-	  reinterpret_cast<scalar_t*>(softmax_results_ptr),
-	  reinterpret_cast<const scalar_t*>(input_ptr),
-	  scale_factor,
-	  seq_len,
-	  seq_len,
-	  attn_batches);
-      );
-  return softmax_results;
-}
-				      
-
-torch::Tensor bwd_cuda(
-    torch::Tensor const& output_grads_, 
-    torch::Tensor const& softmax_results_, 
-    float scale_factor)  {
-	
-  auto output_grads = output_grads_.contiguous();
-  auto softmax_results = softmax_results_.contiguous();
-
-  //output grads is a 3d tensor with dimensions [attn_batches, seq_len, seq_len]
-  const int attn_batches = output_grads.size(0);
-  const int seq_len = output_grads.size(1);
-  TORCH_INTERNAL_ASSERT(output_grads.size(1) == output_grads.size(2));
-
-  void* output_grads_ptr = static_cast<void*>(output_grads.data_ptr());
-
-  //Softmax Grad
-  DISPATCH_HALF_AND_BFLOAT(
-      output_grads_.scalar_type(),
-      "dispatch_scaled_upper_triang_masked_softmax_backward",
-      dispatch_scaled_upper_triang_masked_softmax_backward<scalar_t, scalar_t, float>(
-          reinterpret_cast<scalar_t*>(output_grads_ptr), 
-	  reinterpret_cast<scalar_t*>(output_grads_ptr), 
-	  reinterpret_cast<scalar_t const*>(softmax_results.data_ptr()),
-	  scale_factor,
-	  seq_len,
-	  seq_len,
-	  attn_batches);
-      );
-  
-  //backward pass is completely in-place
-  return output_grads;
-}
-}
-}
-}
diff --git a/colossalai/kernel/hip_native/csrc/type_shim.h b/colossalai/kernel/hip_native/csrc/type_shim.h
deleted file mode 100644
index f92933aad51642723691be76531a31de9454406e..0000000000000000000000000000000000000000
--- a/colossalai/kernel/hip_native/csrc/type_shim.h
+++ /dev/null
@@ -1,282 +0,0 @@
-// !!! This is a file automatically generated by hipify!!!
-#include "hip/hip_runtime.h"
-#include <ATen/ATen.h>
-#include "../../hip_native/csrc/compat.h"
-
-
-#define DISPATCH_HALF_AND_BFLOAT(TYPE, NAME, ...)			\
-  switch(TYPE)								\
-    {									\
-    case at::ScalarType::Half:						\
-      {									\
-	using scalar_t = at::Half;					\
-	__VA_ARGS__;							\
-	break;								\
-      }									\
-    case at::ScalarType::BFloat16:					\
-      {									\
-	using scalar_t = at::BFloat16;					\
-	__VA_ARGS__;							\
-	break;								\
-      }									\
-    default:								\
-      AT_ERROR(#NAME, " not implemented for '", toString(TYPE), "'");	\
-      }
-
-
-
-#define DISPATCH_FLOAT_HALF_AND_BFLOAT_INOUT_TYPES(TYPEIN, TYPEOUT, NAME, ...) \
-  switch(TYPEIN)							\
-    {									\
-    case at::ScalarType::Float:						\
-      {									\
-	using scalar_t_in = float;					\
-	switch(TYPEOUT)							\
-	  {								\
-	  case at::ScalarType::Float:					\
-	    {								\
-	      using scalar_t_out = float;				\
-	      __VA_ARGS__;						\
-	      break;							\
-	    }								\
-	  case at::ScalarType::Half:					\
-	    {								\
-	      using scalar_t_out = at::Half;				\
-	      __VA_ARGS__;						\
-	      break;							\
-	    }								\
-	  case at::ScalarType::BFloat16:				\
-	    {								\
-	      using scalar_t_out = at::BFloat16;			\
-	      __VA_ARGS__;						\
-	      break;							\
-	    }								\
-	  default:							\
-	    AT_ERROR(#NAME, " not implemented for '", toString(TYPEOUT), "'"); \
-	  }								\
-	break;								\
-      }									\
-    case at::ScalarType::Half:						\
-      {									\
-	using scalar_t_in = at::Half;					\
-	using scalar_t_out = at::Half;					\
-	__VA_ARGS__;							\
-	break;								\
-      }									\
-    case at::ScalarType::BFloat16:					\
-      {									\
-	using scalar_t_in = at::BFloat16;				\
-	using scalar_t_out = at::BFloat16;				\
-	__VA_ARGS__;							\
-	break;								\
-      }									\
-    default:								\
-      AT_ERROR(#NAME, " not implemented for '", toString(TYPEIN), "'");	\
-    }
-
-// Forward/backward compatiblity hack around
-// https://github.com/pytorch/pytorch/commit/3aeb78079bcd68282fe9117088e138b77318e288
-// pending more future-proof guidance from upstream.
-// struct TypeShim
-// {
-//   const at::Type& payload;
-//   TypeShim(const at::Type& type) : payload(type) {}
-//   // Enable trivial conversion to a const at::Type& for pre-3aeb78
-//   operator const at::Type&(){ return payload; };
-//   // Enable dispatch switch statements to take *this directly for  post-3aeb78
-//   //operator at::ScalarType(){ return payload.; };
-// };
-
-#define DISPATCH_FLOAT_AND_HALF(TYPE, LEVEL, NAME, ...)                 \
-    switch (TYPE)                                                       \
-    {                                                                   \
-    case at::ScalarType::Float:                                         \
-    {                                                                   \
-        using scalar_t_##LEVEL = float;                                 \
-        __VA_ARGS__;                                                    \
-        break;                                                          \
-    }                                                                   \
-    case at::ScalarType::Half:                                          \
-    {                                                                   \
-        using scalar_t_##LEVEL = at::Half;                              \
-        __VA_ARGS__;                                                    \
-        break;                                                          \
-    }                                                                   \
-    default:                                                            \
-        AT_ERROR(#NAME, " not implemented for '", toString(TYPE), "'"); \
-    }
-
-#define DISPATCH_FLOAT_HALF_AND_BYTE(TYPE, LEVEL, NAME, ...)            \
-    switch (TYPE)                                                       \
-    {                                                                   \
-    case at::ScalarType::Float:                                         \
-    {                                                                   \
-        using scalar_t_##LEVEL = float;                                 \
-        __VA_ARGS__;                                                    \
-        break;                                                          \
-    }                                                                   \
-    case at::ScalarType::Half:                                          \
-    {                                                                   \
-        using scalar_t_##LEVEL = at::Half;                              \
-        __VA_ARGS__;                                                    \
-        break;                                                          \
-    }                                                                   \
-    case at::ScalarType::Byte:                                          \
-    {                                                                   \
-        using scalar_t_##LEVEL = uint8_t;                               \
-        __VA_ARGS__;                                                    \
-        break;                                                          \
-    }                                                                   \
-    default:                                                            \
-        AT_ERROR(#NAME, " not implemented for '", toString(TYPE), "'"); \
-    }
-
-#define DISPATCH_DOUBLE_FLOAT_AND_HALF(TYPE, LEVEL, NAME, ...)          \
-    switch (TYPE)                                                       \
-    {                                                                   \
-    case at::ScalarType::Double:                                        \
-    {                                                                   \
-        using scalar_t_##LEVEL = double;                                \
-        __VA_ARGS__;                                                    \
-        break;                                                          \
-    }                                                                   \
-    case at::ScalarType::Float:                                         \
-    {                                                                   \
-        using scalar_t_##LEVEL = float;                                 \
-        __VA_ARGS__;                                                    \
-        break;                                                          \
-    }                                                                   \
-    case at::ScalarType::Half:                                          \
-    {                                                                   \
-        using scalar_t_##LEVEL = at::Half;                              \
-        __VA_ARGS__;                                                    \
-        break;                                                          \
-    }                                                                   \
-    default:                                                            \
-        AT_ERROR(#NAME, " not implemented for '", toString(TYPE), "'"); \
-    }
-
-#define DISPATCH_DOUBLE_AND_FLOAT(TYPE, LEVEL, NAME, ...)               \
-    switch (TYPE)                                                       \
-    {                                                                   \
-    case at::ScalarType::Double:                                        \
-    {                                                                   \
-        using scalar_t_##LEVEL = double;                                \
-        __VA_ARGS__;                                                    \
-        break;                                                          \
-    }                                                                   \
-    case at::ScalarType::Float:                                         \
-    {                                                                   \
-        using scalar_t_##LEVEL = float;                                 \
-        __VA_ARGS__;                                                    \
-        break;                                                          \
-    }                                                                   \
-    default:                                                            \
-        AT_ERROR(#NAME, " not implemented for '", toString(TYPE), "'"); \
-    }
-
-template <typename T>
-__device__ __forceinline__ T reduce_block_into_lanes(T *x,
-                                                     T val,
-                                                     int lanes = 1,
-                                                     bool share_result = false) // lanes is intended to be <= 32.
-{
-    int tid = threadIdx.x + threadIdx.y * blockDim.x;
-    int blockSize = blockDim.x * blockDim.y; // blockSize is intended to be a multiple of 32.
-
-    if (blockSize >= 64)
-    {
-        x[tid] = val;
-        __syncthreads();
-    }
-
-#pragma unroll
-    for (int i = (blockSize >> 1); i >= 64; i >>= 1)
-    {
-        if (tid < i)
-            x[tid] = x[tid] + x[tid + i];
-        __syncthreads();
-    }
-
-    T final;
-
-    if (tid < 32)
-    {
-        if (blockSize >= 64)
-            final = x[tid] + x[tid + 32];
-        else
-            final = val;
-            // __SYNCWARP();
-
-#pragma unroll
-        for (int i = 16; i >= lanes; i >>= 1)
-#ifdef COLOSSAL_HIP
-            final = final + __shfl_down(final, i);
-#else
-            final = final + __shfl_down_sync(0xffffffff, final, i);
-#endif
-    }
-
-    if (share_result)
-    {
-        if (tid < lanes)
-            x[tid] = final; // EpilogueOp
-        // Make sure the smem result is visible to all warps.
-        __syncthreads();
-    }
-
-    return final;
-}
-
-template <typename T>
-__device__ __forceinline__ T reduce_block_into_lanes_max_op(T *x,
-                                                            T val,
-                                                            int lanes = 1,
-                                                            bool share_result = false) // lanes is intended to be <= 32.
-{
-    int tid = threadIdx.x + threadIdx.y * blockDim.x;
-    int blockSize = blockDim.x * blockDim.y; // blockSize is intended to be a multiple of 32.
-
-    if (blockSize >= 64)
-    {
-        x[tid] = val;
-        __syncthreads();
-    }
-
-#pragma unroll
-    for (int i = (blockSize >> 1); i >= 64; i >>= 1)
-    {
-        if (tid < i)
-            x[tid] = fmaxf(fabsf(x[tid]), fabsf(x[tid + i]));
-        __syncthreads();
-    }
-
-    T final;
-
-    if (tid < 32)
-    {
-        if (blockSize >= 64)
-            final = fmaxf(fabsf(x[tid]), fabsf(x[tid + 32]));
-        else
-            final = val;
-            // __SYNCWARP();
-
-#pragma unroll
-        for (int i = 16; i >= lanes; i >>= 1)
-#ifdef COLOSSAL_HIP
-            final = fmaxf(fabsf(final), fabsf(__shfl_down(final, i)));
-#else
-            final = fmaxf(fabsf(final), fabsf(__shfl_down_sync(0xffffffff, final, i)));
-#endif
-    }
-
-    if (share_result)
-    {
-        if (tid < lanes)
-            x[tid] = final; // EpilogueOp
-        // Make sure the smem result is visible to all warps.
-        __syncthreads();
-    }
-
-    return final;
-}
diff --git a/colossalai/kernel/jit/__init__.py b/colossalai/kernel/jit/__init__.py
deleted file mode 100644
index 57b8fb7b2e996ea0f0336dad1e42ea379d608b15..0000000000000000000000000000000000000000
--- a/colossalai/kernel/jit/__init__.py
+++ /dev/null
@@ -1,8 +0,0 @@
-from .option import set_jit_fusion_options
-from .bias_dropout_add import bias_dropout_add_fused_train, bias_dropout_add_fused_inference
-from .bias_gelu import bias_gelu_impl
-
-__all__ = [
-    "bias_dropout_add_fused_train", "bias_dropout_add_fused_inference", "bias_gelu_impl",
-    "set_jit_fusion_options"
-]
diff --git a/colossalai/kernel/jit/bias_dropout_add.py b/colossalai/kernel/jit/bias_dropout_add.py
deleted file mode 100644
index 3687dde79a08b7f8f192d6516694938828aae659..0000000000000000000000000000000000000000
--- a/colossalai/kernel/jit/bias_dropout_add.py
+++ /dev/null
@@ -1,24 +0,0 @@
-import torch
-
-
-def bias_dropout_add(x, bias, residual, prob, training):
-    # type: (Tensor, Tensor, Tensor, float, bool) -> Tensor
-    out = torch.nn.functional.dropout(x + bias, p=prob, training=training)
-    out = residual + out
-    return out
-
-
-@torch.jit.script
-def bias_dropout_add_fused_train(x: torch.Tensor,
-                                 bias: torch.Tensor,
-                                 residual: torch.Tensor,
-                                 prob: float) -> torch.Tensor:
-    return bias_dropout_add(x, bias, residual, prob, True)
-
-
-@torch.jit.script
-def bias_dropout_add_fused_inference(x: torch.Tensor,
-                                     bias: torch.Tensor,
-                                     residual: torch.Tensor,
-                                     prob: float) -> torch.Tensor:
-    return bias_dropout_add(x, bias, residual, prob, False)
diff --git a/colossalai/kernel/jit/bias_gelu.py b/colossalai/kernel/jit/bias_gelu.py
deleted file mode 100644
index f7a425dd5400a5b230121b8a276d286068787acb..0000000000000000000000000000000000000000
--- a/colossalai/kernel/jit/bias_gelu.py
+++ /dev/null
@@ -1,41 +0,0 @@
-import torch
-
-
-###### BIAS GELU FUSION/ NO AUTOGRAD ################
-# 1/sqrt(2*pi)-> 0.3989423
-# 1/sqrt(2)   -> 0.70710678
-# sqrt(2/pi)  -> 0.79788456
-# this function is tanh approximation of gelu
-# actual gelu is:
-# x * 0.5 * (1.0 + torch.erf(x * 0.70710678))
-
-@torch.jit.script
-def bias_gelu(bias, y):
-    x = bias + y
-    return  x * 0.5 * (1.0 + torch.tanh(0.79788456 * x * (1 + 0.044715 * x * x)))
-
-# gradient of tanh approximation of gelu
-# gradient of actual gelu is:
-# 0.5 * (1. + torch.erf(x * 0.70710678)) + 0.3989423 * x * torch.exp(-0.5 * x * x)
-@torch.jit.script
-def bias_gelu_back(g, bias, y):
-    x = bias + y
-    tanh_out = torch.tanh(0.79788456 * x * (1 + 0.044715 * x * x))
-    # sqrt(2/pi) * 3 * 0.044715 -> 0.1070322243
-    ff = 0.5 * x * ((1 - tanh_out * tanh_out) * (0.79788456 + 0.1070322243 * x * x)) + 0.5 * (1 + tanh_out)
-    return ff*g
-
-class GeLUFunction(torch.autograd.Function):
-    @staticmethod
-    # bias is an optional argument
-    def forward(ctx, input, bias):
-        ctx.save_for_backward(input, bias)
-        return bias_gelu(bias, input)
-
-    @staticmethod
-    def backward(ctx, grad_output):
-        input, bias = ctx.saved_tensors
-        tmp = bias_gelu_back(grad_output, bias, input)
-        return tmp, tmp
-
-bias_gelu_impl = GeLUFunction.apply
\ No newline at end of file
diff --git a/colossalai/kernel/jit/option.py b/colossalai/kernel/jit/option.py
deleted file mode 100644
index d959058975078975fe7b5c33bf34c27fb791b770..0000000000000000000000000000000000000000
--- a/colossalai/kernel/jit/option.py
+++ /dev/null
@@ -1,32 +0,0 @@
-import torch
-
-JIT_OPTIONS_SET = False
-
-
-def set_jit_fusion_options():
-    """Set PyTorch JIT layer fusion options.
-    """
-    # LSG: the latest pytorch and CUDA versions may not support
-    # the following jit settings
-    global JIT_OPTIONS_SET
-    if JIT_OPTIONS_SET == False:
-        # flags required to enable jit fusion kernels
-        TORCH_MAJOR = int(torch.__version__.split('.')[0])
-        TORCH_MINOR = int(torch.__version__.split('.')[1])
-        if (TORCH_MAJOR > 1) or (TORCH_MAJOR == 1 and TORCH_MINOR >= 10):
-            # nvfuser
-            torch._C._jit_set_profiling_executor(True)
-            torch._C._jit_set_profiling_mode(True)
-            torch._C._jit_override_can_fuse_on_cpu(False)
-            torch._C._jit_override_can_fuse_on_gpu(False)
-            torch._C._jit_set_texpr_fuser_enabled(False)
-            torch._C._jit_set_nvfuser_enabled(True)
-            torch._C._debug_set_autodiff_subgraph_inlining(False)
-        else:
-            # legacy pytorch fuser
-            torch._C._jit_set_profiling_mode(False)
-            torch._C._jit_set_profiling_executor(False)
-            torch._C._jit_override_can_fuse_on_cpu(True)
-            torch._C._jit_override_can_fuse_on_gpu(True)
-
-        JIT_OPTIONS_SET = True
diff --git a/colossalai/logging/__init__.py b/colossalai/logging/__init__.py
deleted file mode 100644
index 9a73099105f6eef23ca9431ba9785f9ada97cd1a..0000000000000000000000000000000000000000
--- a/colossalai/logging/__init__.py
+++ /dev/null
@@ -1,29 +0,0 @@
-from typing import List
-from .logging import DistributedLogger
-import logging
-
-__all__ = ['get_dist_logger', 'DistributedLogger']
-
-
-def get_dist_logger(name='colossalai'):
-    """Get logger instance based on name. The DistributedLogger will create singleton instances,
-    which means that only one logger instance is created per name.
-
-    :param name: name of the logger, name must be unique
-    :type name: str
-
-    :return: a distributed logger instance
-    :rtype: :class:`colossalai.logging.DistributedLogger`
-    """
-    return DistributedLogger.get_instance(name=name)
-
-
-def disable_existing_loggers(except_loggers: List[str] = ['colossalai']):
-    """Set the level of existing loggers to `WARNING`.
-
-    :param except_loggers: loggers in this `list` will be ignored when disabling, defaults to ['colossalai']
-    :type except_loggers: list, optional
-    """
-    for log_name in logging.Logger.manager.loggerDict.keys():
-        if log_name not in except_loggers:
-            logging.getLogger(log_name).setLevel(logging.WARNING)
diff --git a/colossalai/logging/__pycache__/__init__.cpython-36.pyc b/colossalai/logging/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index bc5914482e2f4b111c9d602489e321ad84593876..0000000000000000000000000000000000000000
Binary files a/colossalai/logging/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/logging/__pycache__/__init__.cpython-37.pyc b/colossalai/logging/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index bd4f1a79a559e5d1fa87d50372a54cb8995484e7..0000000000000000000000000000000000000000
Binary files a/colossalai/logging/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/logging/__pycache__/logging.cpython-36.pyc b/colossalai/logging/__pycache__/logging.cpython-36.pyc
deleted file mode 100644
index 80825bb6aed4d98c985aeaeb75fac88d70b1a0d0..0000000000000000000000000000000000000000
Binary files a/colossalai/logging/__pycache__/logging.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/logging/__pycache__/logging.cpython-37.pyc b/colossalai/logging/__pycache__/logging.cpython-37.pyc
deleted file mode 100644
index ee1620c8ed84b8caf68aab70bc8d7001ebec480e..0000000000000000000000000000000000000000
Binary files a/colossalai/logging/__pycache__/logging.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/logging/logging.py b/colossalai/logging/logging.py
deleted file mode 100644
index 0893081889fbe7347b07ce9a92688e9f9990bfef..0000000000000000000000000000000000000000
--- a/colossalai/logging/logging.py
+++ /dev/null
@@ -1,156 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import colossalai
-import logging
-from pathlib import Path
-from typing import Union
-
-from colossalai.context.parallel_mode import ParallelMode
-
-_FORMAT = 'colossalai - %(name)s - %(asctime)s %(levelname)s: %(message)s'
-logging.basicConfig(level=logging.INFO, format=_FORMAT)
-
-
-class DistributedLogger:
-    """This is a distributed event logger class essentially based on :class:`logging`.
-
-    :param name: The name of the logger
-    :type name: str
-    """
-
-    __instances = dict()
-
-    @staticmethod
-    def get_instance(name: str):
-        """Get the unique single logger instance based on name.
-
-        :param name: The name of the logger
-        :type name: str
-        :return: A DistributedLogger object
-        :rtype: DistributedLogger
-        """
-        if name in DistributedLogger.__instances:
-            return DistributedLogger.__instances[name]
-        else:
-            logger = DistributedLogger(name=name)
-            return logger
-
-    def __init__(self, name):
-        if name in DistributedLogger.__instances:
-            raise Exception(
-                'Logger with the same name has been created, you should use colossalai.logging.get_dist_logger')
-        else:
-            self._name = name
-            self._logger = logging.getLogger(name)
-            DistributedLogger.__instances[name] = self
-
-    @staticmethod
-    def _check_valid_logging_level(level: str):
-        assert level in ['INFO', 'DEBUG', 'WARNING', 'ERROR'], 'found invalid logging level'
-
-    def set_level(self, level: str):
-        """Set the logging level
-
-        :param level: Can only be INFO, DEBUG, WARNING and ERROR
-        :type level: str
-        """
-        self._check_valid_logging_level(level)
-        self._logger.setLevel(getattr(logging, level))
-
-    def log_to_file(self, path: Union[str, Path], mode: str = 'a', level: str = 'INFO', suffix: str = None):
-        """Save the logs to file
-
-        :param path: The file to save the log
-        :type path: A string or pathlib.Path object
-        :param mode: The mode to write log into the file
-        :type mode: str
-        :param level: Can only be INFO, DEBUG, WARNING and ERROR
-        :type level: str
-        :param suffix: The suffix string of log's name
-        :type suffix: str
-        """
-        assert isinstance(path, (str, Path)), \
-            f'expected argument path to be type str or Path, but got {type(path)}'
-        self._check_valid_logging_level(level)
-
-        if isinstance(path, str):
-            path = Path(path)
-
-        # create log directory
-        path.mkdir(parents=True, exist_ok=True)
-
-        # set the default file name if path is a directory
-        if not colossalai.core.global_context.is_initialized(ParallelMode.GLOBAL):
-            rank = 0
-        else:
-            rank = colossalai.core.global_context.get_global_rank()
-
-        if suffix is not None:
-            log_file_name = f'rank_{rank}_{suffix}.log'
-        else:
-            log_file_name = f'rank_{rank}.log'
-        path = path.joinpath(log_file_name)
-
-        # add file handler
-        file_handler = logging.FileHandler(path, mode)
-        file_handler.setLevel(getattr(logging, level))
-        formatter = logging.Formatter(_FORMAT)
-        file_handler.setFormatter(formatter)
-        self._logger.addHandler(file_handler)
-
-    def _log(self, level, message: str, parallel_mode: ParallelMode = ParallelMode.GLOBAL, ranks: list = None):
-        if ranks is None:
-            getattr(self._logger, level)(message)
-        else:
-            local_rank = colossalai.core.global_context.get_local_rank(parallel_mode)
-            if local_rank in ranks:
-                getattr(self._logger, level)(message)
-
-    def info(self, message: str, parallel_mode: ParallelMode = ParallelMode.GLOBAL, ranks: list = None):
-        """Log an info message.
-
-        :param message: The message to be logged
-        :type message: str
-        :param parallel_mode: The parallel mode used for logging. Defaults to ParallelMode.GLOBAL
-        :type parallel_mode: :class:`colossalai.context.parallel_mode.ParallelMode`
-        :param ranks: List of parallel ranks
-        :type ranks: list
-        """
-        self._log('info', message, parallel_mode, ranks)
-
-    def warning(self, message: str, parallel_mode: ParallelMode = ParallelMode.GLOBAL, ranks: list = None):
-        """Log a warning message.
-
-        :param message: The message to be logged
-        :type message: str
-        :param parallel_mode: The parallel mode used for logging. Defaults to ParallelMode.GLOBAL
-        :type parallel_mode: :class:`colossalai.context.parallel_mode.ParallelMode`
-        :param ranks: List of parallel ranks
-        :type ranks: list
-        """
-        self._log('warning', message, parallel_mode, ranks)
-
-    def debug(self, message: str, parallel_mode: ParallelMode = ParallelMode.GLOBAL, ranks: list = None):
-        """Log a debug message.
-
-        :param message: The message to be logged
-        :type message: str
-        :param parallel_mode: The parallel mode used for logging. Defaults to ParallelMode.GLOBAL
-        :type parallel_mode: :class:`colossalai.context.parallel_mode.ParallelMode`
-        :param ranks: List of parallel ranks
-        :type ranks: list
-        """
-        self._log('debug', message, parallel_mode, ranks)
-
-    def error(self, message: str, parallel_mode: ParallelMode = ParallelMode.GLOBAL, ranks: list = None):
-        """Log an error message.
-
-        :param message: The message to be logged
-        :type message: str
-        :param parallel_mode: The parallel mode used for logging. Defaults to ParallelMode.GLOBAL
-        :type parallel_mode: :class:`colossalai.context.parallel_mode.ParallelMode`
-        :param ranks: List of parallel ranks
-        :type ranks: list
-        """
-        self._log('error', message, parallel_mode, ranks)
diff --git a/colossalai/nn/__init__.py b/colossalai/nn/__init__.py
deleted file mode 100644
index 3991e3bfb9488a9c893e75b3859f1df79cca45c2..0000000000000000000000000000000000000000
--- a/colossalai/nn/__init__.py
+++ /dev/null
@@ -1,6 +0,0 @@
-from .layer import *
-from .loss import *
-from .lr_scheduler import *
-from .metric import *
-from .model import *
-from .optimizer import *
diff --git a/colossalai/nn/__pycache__/__init__.cpython-36.pyc b/colossalai/nn/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index fe14da1ff3f0b1b0c2915437c306d5877183c7bc..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/__pycache__/__init__.cpython-37.pyc b/colossalai/nn/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index 1696de1e17804f976e36239df5896b0e36956178..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/__pycache__/init.cpython-36.pyc b/colossalai/nn/__pycache__/init.cpython-36.pyc
deleted file mode 100644
index c0f5699124f4cdd8e26c85e1eded545c662a2369..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/__pycache__/init.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/__pycache__/init.cpython-37.pyc b/colossalai/nn/__pycache__/init.cpython-37.pyc
deleted file mode 100644
index c2fe6733e32cd638de7063ee06df09926d5da86a..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/__pycache__/init.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/init.py b/colossalai/nn/init.py
deleted file mode 100644
index 2aeff7c5268f5bb974db2c7544a00323e307e708..0000000000000000000000000000000000000000
--- a/colossalai/nn/init.py
+++ /dev/null
@@ -1,140 +0,0 @@
-import math
-import warnings
-
-from torch import Tensor
-import torch.nn as nn
-
-
-def zeros_():
-    def initializer(tensor: Tensor, fan_in: int = None, fan_out: int = None):
-        return nn.init.zeros_(tensor)
-
-    return initializer
-
-
-def ones_():
-    def initializer(tensor: Tensor, fan_in: int = None, fan_out: int = None):
-        return nn.init.ones_(tensor)
-
-    return initializer
-
-
-def uniform_(a: float = 0., b: float = 1.):
-    def initializer(tensor: Tensor, fan_in: int = None, fan_out: int = None):
-        return nn.init.uniform_(tensor, a, b)
-
-    return initializer
-
-
-def normal_(mean: float = 0., std: float = 1.):
-    def initializer(tensor: Tensor, fan_in: int = None, fan_out: int = None):
-        return nn.init.normal_(tensor, mean, std)
-
-    return initializer
-
-
-def trunc_normal_(mean: float = 0., std: float = 1., a: float = -2., b: float = 2.):
-    def initializer(tensor: Tensor, fan_in: int = None, fan_out: int = None):
-        return nn.init.trunc_normal_(tensor, mean, std, a, b)
-
-    return initializer
-
-
-def kaiming_uniform_(a=0, mode='fan_in', nonlinearity='leaky_relu'):
-    # adapted from torch.nn.init
-    def initializer(tensor: Tensor, fan_in: int = None, fan_out: int = None):
-        if 0 in tensor.shape:
-            warnings.warn("Initializing zero-element tensors is a no-op")
-            return tensor
-
-        if mode == 'fan_in':
-            assert fan_in is not None, 'Fan_in is not provided.'
-            fan = fan_in
-        elif mode == 'fan_out':
-            assert fan_out is not None, 'Fan_out is not provided.'
-            fan = fan_out
-        else:
-            raise ValueError(f'Invalid initialization mode \'{mode}\'')
-
-        std = nn.init.calculate_gain(nonlinearity, a) / math.sqrt(fan)
-        bound = math.sqrt(3.) * std
-        return nn.init.uniform_(tensor, -bound, bound)
-
-    return initializer
-
-
-def kaiming_normal_(a=0, mode='fan_in', nonlinearity='leaky_relu'):
-    # adapted from torch.nn.init
-    def initializer(tensor: Tensor, fan_in: int = None, fan_out: int = None):
-        if 0 in tensor.shape:
-            warnings.warn("Initializing zero-element tensors is a no-op")
-            return tensor
-
-        if mode == 'fan_in':
-            assert fan_in is not None, 'Fan_in is not provided.'
-            fan = fan_in
-        elif mode == 'fan_out':
-            assert fan_out is not None, 'Fan_out is not provided.'
-            fan = fan_out
-        else:
-            raise ValueError(f'Invalid initialization mode \'{mode}\'')
-
-        std = nn.init.calculate_gain(nonlinearity, a) / math.sqrt(fan)
-        return nn.init.normal_(tensor, 0, std)
-
-    return initializer
-
-
-def xavier_uniform_(a: float = math.sqrt(3.), scale: float = 2., gain: float = 1.):
-    # adapted from torch.nn.init
-    def initializer(tensor: Tensor, fan_in: int = None, fan_out: int = None):
-        assert fan_in is not None, 'Fan_in is not provided.'
-
-        fan = fan_in
-        if fan_out is not None:
-            fan += fan_out
-
-        std = gain * math.sqrt(scale / float(fan))
-        bound = a * std
-        return nn.init.uniform_(tensor, -bound, bound)
-
-    return initializer
-
-
-def xavier_normal_(scale: float = 2., gain: float = 1.):
-    # adapted from torch.nn.init
-    def initializer(tensor: Tensor, fan_in: int = None, fan_out: int = None):
-        assert fan_in is not None, 'Fan_in is not provided.'
-
-        fan = fan_in
-        if fan_out is not None:
-            fan += fan_out
-
-        std = gain * math.sqrt(scale / float(fan))
-
-        return nn.init.normal_(tensor, 0., std)
-
-    return initializer
-
-
-def lecun_uniform_():
-    # adapted from jax.nn.initializers
-    def initializer(tensor: Tensor, fan_in: int = None, fan_out: int = None):
-        assert fan_in is not None, 'Fan_in is not provided.'
-
-        var = 1.0 / fan_in
-        bound = math.sqrt(3 * var)
-        return nn.init.uniform_(tensor, -bound, bound)
-
-    return initializer
-
-
-def lecun_normal_():
-    # adapted from jax.nn.initializers
-    def initializer(tensor: Tensor, fan_in: int = None, fan_out: int = None):
-        assert fan_in is not None, 'Fan_in is not provided.'
-
-        std = math.sqrt(1.0 / fan_in)
-        return nn.init.trunc_normal_(tensor, std=std / .87962566103423978)
-
-    return initializer
diff --git a/colossalai/nn/layer/__init__.py b/colossalai/nn/layer/__init__.py
deleted file mode 100644
index 86961dd933a73f292da722fe76467657a20e950a..0000000000000000000000000000000000000000
--- a/colossalai/nn/layer/__init__.py
+++ /dev/null
@@ -1,9 +0,0 @@
-from .colossalai_layer import *
-from .parallel_1d import *
-from .parallel_2d import *
-from .parallel_2p5d import *
-from .parallel_3d import *
-from .parallel_sequence import *
-from .utils import *
-from .vanilla import *
-from .wrapper import *
diff --git a/colossalai/nn/layer/__pycache__/__init__.cpython-36.pyc b/colossalai/nn/layer/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index 4e2e0a6df0b00c663e83e642024fa90aa35b57b7..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/__pycache__/__init__.cpython-37.pyc b/colossalai/nn/layer/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index f6ae8aea207d207f5f2fac47ca2812821b98ff27..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/__pycache__/base_layer.cpython-36.pyc b/colossalai/nn/layer/__pycache__/base_layer.cpython-36.pyc
deleted file mode 100644
index 14ad2d081033ed4a33e22376b7e0fdfcdb1da09b..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/__pycache__/base_layer.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/__pycache__/base_layer.cpython-37.pyc b/colossalai/nn/layer/__pycache__/base_layer.cpython-37.pyc
deleted file mode 100644
index 946121aeee5dca55d63ea71b7e3393c4a083257f..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/__pycache__/base_layer.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/base_layer.py b/colossalai/nn/layer/base_layer.py
deleted file mode 100644
index fd0d6ef5ea0fc56410b32c829377776e2bdf7ddb..0000000000000000000000000000000000000000
--- a/colossalai/nn/layer/base_layer.py
+++ /dev/null
@@ -1,27 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import torch.nn as nn
-
-from colossalai.context import ParallelMode
-from colossalai.core import global_context as gpc
-
-
-class ParallelLayer(nn.Module):
-
-    def __init__(self):
-        super().__init__()
-        self.data_parallel_rank = 0 if not gpc.is_initialized(ParallelMode.DATA) else gpc.get_local_rank(
-            ParallelMode.DATA)
-        self.data_parallel_size = 1 if not gpc.is_initialized(ParallelMode.DATA) else gpc.get_world_size(
-            ParallelMode.DATA)
-
-        self.tensor_parallel_rank = 0 if not gpc.is_initialized(ParallelMode.TENSOR) else gpc.get_local_rank(
-            ParallelMode.TENSOR)
-        self.tensor_parallel_size = 1 if not gpc.is_initialized(ParallelMode.TENSOR) else gpc.get_world_size(
-            ParallelMode.TENSOR)
-
-        self.pipeline_parallel_rank = 0 if not gpc.is_initialized(ParallelMode.PIPELINE) else gpc.get_local_rank(
-            ParallelMode.PIPELINE)
-        self.pipeline_parallel_size = 1 if not gpc.is_initialized(ParallelMode.PIPELINE) else gpc.get_world_size(
-            ParallelMode.PIPELINE)
diff --git a/colossalai/nn/layer/colossalai_layer/__init__.py b/colossalai/nn/layer/colossalai_layer/__init__.py
deleted file mode 100644
index 2ae1b07a75b2e7a231fc3512e8f46bccd0e9d4c6..0000000000000000000000000000000000000000
--- a/colossalai/nn/layer/colossalai_layer/__init__.py
+++ /dev/null
@@ -1,7 +0,0 @@
-from ._utils import partition_batch
-from .dropout import Dropout
-from .embedding import Embedding, PatchEmbedding
-from .linear import Classifier, Linear
-from .normalization import LayerNorm
-
-__all__ = ['Linear', 'Classifier', 'Embedding', 'PatchEmbedding', 'LayerNorm', 'Dropout', 'partition_batch']
diff --git a/colossalai/nn/layer/colossalai_layer/__pycache__/__init__.cpython-36.pyc b/colossalai/nn/layer/colossalai_layer/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index 86dd55b71998c0b47a143a73cfe8dad57897ef6c..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/colossalai_layer/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/colossalai_layer/__pycache__/__init__.cpython-37.pyc b/colossalai/nn/layer/colossalai_layer/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index 54c46279d844fa80e77ea3ba6142134b0d85d270..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/colossalai_layer/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/colossalai_layer/__pycache__/_utils.cpython-36.pyc b/colossalai/nn/layer/colossalai_layer/__pycache__/_utils.cpython-36.pyc
deleted file mode 100644
index 5258ef17791576c8ea9219921d4fb778cd632f5a..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/colossalai_layer/__pycache__/_utils.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/colossalai_layer/__pycache__/_utils.cpython-37.pyc b/colossalai/nn/layer/colossalai_layer/__pycache__/_utils.cpython-37.pyc
deleted file mode 100644
index 3a251ab91b4303f57654d6dafb678aa96281e0bc..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/colossalai_layer/__pycache__/_utils.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/colossalai_layer/__pycache__/dropout.cpython-36.pyc b/colossalai/nn/layer/colossalai_layer/__pycache__/dropout.cpython-36.pyc
deleted file mode 100644
index 4bc9cf27961acb0bde34c22c335027c46ca44fe1..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/colossalai_layer/__pycache__/dropout.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/colossalai_layer/__pycache__/dropout.cpython-37.pyc b/colossalai/nn/layer/colossalai_layer/__pycache__/dropout.cpython-37.pyc
deleted file mode 100644
index 1e8d35952b707d5e864e6882e84e62c59b4c5b8d..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/colossalai_layer/__pycache__/dropout.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/colossalai_layer/__pycache__/embedding.cpython-36.pyc b/colossalai/nn/layer/colossalai_layer/__pycache__/embedding.cpython-36.pyc
deleted file mode 100644
index f5c375e1b0cf9af298fa9856afb701698d909f37..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/colossalai_layer/__pycache__/embedding.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/colossalai_layer/__pycache__/embedding.cpython-37.pyc b/colossalai/nn/layer/colossalai_layer/__pycache__/embedding.cpython-37.pyc
deleted file mode 100644
index 8faecc341b57587323c25bf6c1e2935e4e054b2b..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/colossalai_layer/__pycache__/embedding.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/colossalai_layer/__pycache__/linear.cpython-36.pyc b/colossalai/nn/layer/colossalai_layer/__pycache__/linear.cpython-36.pyc
deleted file mode 100644
index b416b83fac13f9888a87482fd38ba5f1c44535dc..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/colossalai_layer/__pycache__/linear.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/colossalai_layer/__pycache__/linear.cpython-37.pyc b/colossalai/nn/layer/colossalai_layer/__pycache__/linear.cpython-37.pyc
deleted file mode 100644
index cec16111b07a9a55952a69e5c13985cc0976799f..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/colossalai_layer/__pycache__/linear.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/colossalai_layer/__pycache__/normalization.cpython-36.pyc b/colossalai/nn/layer/colossalai_layer/__pycache__/normalization.cpython-36.pyc
deleted file mode 100644
index 2955865a0400b9913711af05d676616c88b97d5f..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/colossalai_layer/__pycache__/normalization.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/colossalai_layer/__pycache__/normalization.cpython-37.pyc b/colossalai/nn/layer/colossalai_layer/__pycache__/normalization.cpython-37.pyc
deleted file mode 100644
index 8765606a2cc08e8f301275a0462cc95d13e2afcb..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/colossalai_layer/__pycache__/normalization.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/colossalai_layer/_utils.py b/colossalai/nn/layer/colossalai_layer/_utils.py
deleted file mode 100644
index 6271667cc1e419a673c622765880840bd831e877..0000000000000000000000000000000000000000
--- a/colossalai/nn/layer/colossalai_layer/_utils.py
+++ /dev/null
@@ -1,19 +0,0 @@
-from torch import Tensor
-
-from ..parallel_2d._operation import split_tensor_2d
-from ..parallel_2p5d._operation import split_tensor_2p5d
-from ..parallel_3d._operation import split_batch_3d
-from ..utils import get_tensor_parallel_mode
-
-_parallel_split_batch = {'2d': split_tensor_2d, '2.5d': split_tensor_2p5d, '3d': split_batch_3d}
-
-
-def partition_batch(input_) -> Tensor:
-    tensor_parallel_mode = get_tensor_parallel_mode()
-    if tensor_parallel_mode in _parallel_split_batch:
-        if isinstance(input_, dict):
-            return {k: _parallel_split_batch[tensor_parallel_mode](v) for k, v in input_.items()}
-        else:
-            return _parallel_split_batch[tensor_parallel_mode](input_)
-    else:
-        return input_
diff --git a/colossalai/nn/layer/colossalai_layer/dropout.py b/colossalai/nn/layer/colossalai_layer/dropout.py
deleted file mode 100644
index 8921b0884611ca918241b460ec763a85eb9f2b2e..0000000000000000000000000000000000000000
--- a/colossalai/nn/layer/colossalai_layer/dropout.py
+++ /dev/null
@@ -1,30 +0,0 @@
-import torch.nn as nn
-from colossalai.context import ParallelMode, seed
-
-from ..parallel_1d import *
-from ..utils import get_tensor_parallel_mode
-
-
-class Dropout(nn.Module):
-    """
-    Dropout layer of colossalai
-
-    :param p: dropout rate, defaults to 0.5
-    :type p: float, optional
-    :param inplace: If set to ``True``, will do this operation in-place, defaults tp ``False``
-    :type inplace: bool, optional
-    """
-    def __init__(self, p: float = 0.5, inplace: bool = False) -> None:
-        super().__init__()
-        self.tensor_parallel = get_tensor_parallel_mode()
-        if self.tensor_parallel == '1d':
-            self.drop = Dropout1D(p, inplace)
-        else:
-            self.drop = nn.Dropout(p, inplace)
-
-    def forward(self, *args):
-        if self.tensor_parallel in [None, '1d']:
-            return self.drop(*args)
-        else:
-            with seed(ParallelMode.TENSOR):
-                return self.drop(*args)
diff --git a/colossalai/nn/layer/colossalai_layer/embedding.py b/colossalai/nn/layer/colossalai_layer/embedding.py
deleted file mode 100644
index daa74e8ae239337810ac5eb7191e9e31eed83106..0000000000000000000000000000000000000000
--- a/colossalai/nn/layer/colossalai_layer/embedding.py
+++ /dev/null
@@ -1,166 +0,0 @@
-import math
-from typing import Callable
-
-from colossalai.utils import get_current_device
-from torch import dtype, nn
-
-from ... import init as init
-from ..parallel_1d import *
-from ..parallel_2d import *
-from ..parallel_2p5d import *
-from ..parallel_3d import *
-from ..utils import get_tensor_parallel_mode
-from ..vanilla import *
-
-_parallel_embedding = {
-    '2d': Embedding2D,
-    '2.5d': Embedding2p5D,
-    '3d': Embedding3D,
-}
-
-_vocab_parallel_embedding = {
-    '1d': VocabParallelEmbedding1D,
-    '2d': VocabParallelEmbedding2D,
-    '2.5d': VocabParallelEmbedding2p5D,
-    '3d': VocabParallelEmbedding3D
-}
-
-_parallel_patchembedding = {
-    None: VanillaPatchEmbedding,
-    '1d': VanillaPatchEmbedding,
-    '2d': PatchEmbedding2D,
-    '2.5d': PatchEmbedding2p5D,
-    '3d': PatchEmbedding3D
-}
-
-
-class Embedding(nn.Module):
-    """
-    Embedding for colossalai
-
-    :param num_embeddings: number of embeddings
-    :type num_embeddings: int
-    :param embedding_dim: dimension of embedding
-    :type embedding_dim: int
-    :param padding_idx: index of padding, defaults to None
-    :type padding_idx: int, optional
-    :param dtype: The dtype of parameters, defaults to None
-    :type dtype: torch.dtype, optional
-    :param weight_initializer: The intializer of weight, defaults to normal initializer
-    :type weight_initializer: typing.Callable, optional
-    :param args: Args used in F.embedding
-    :param kwargs: Kwargs used in F.embedding
-    """
-
-    def __init__(self,
-                 num_embeddings: int,
-                 embedding_dim: int,
-                 padding_idx: int = None,
-                 dtype: dtype = None,
-                 weight_initializer: Callable = init.normal_(),
-                 vocab_parallel_limit: int = 2048,
-                 *args,
-                 **kwargs) -> None:
-        super().__init__()
-        tensor_parallel = get_tensor_parallel_mode()
-        if tensor_parallel is None or (tensor_parallel == '1d' and num_embeddings <= vocab_parallel_limit):
-            self.embed = nn.Embedding(num_embeddings, embedding_dim, padding_idx=padding_idx, *args,
-                                      **kwargs).to(dtype).to(get_current_device())
-            weight_initializer(self.embed.weight, fan_in=num_embeddings, fan_out=embedding_dim)
-        elif num_embeddings <= vocab_parallel_limit:
-            self.embed = _parallel_embedding[tensor_parallel](
-                num_embeddings,
-                embedding_dim,
-                padding_idx=padding_idx,
-                dtype=dtype,
-                weight_initializer=weight_initializer,
-                *args,
-                **kwargs,
-            )
-        else:
-            self.embed = _vocab_parallel_embedding[tensor_parallel](
-                num_embeddings,
-                embedding_dim,
-                padding_idx=padding_idx,
-                dtype=dtype,
-                weight_initializer=weight_initializer,
-                *args,
-                **kwargs,
-            )
-
-    @property
-    def weight(self):
-        return self.embed.weight
-
-    def forward(self, *args):
-        return self.embed(*args)
-
-
-class PatchEmbedding(nn.Module):
-    """
-    2D Image to Patch Embedding
-
-    :param img_size: image size
-    :type img_size: int
-    :param patch_size: patch size
-    :type patch_size: int
-    :param in_chans: number of channels of input image
-    :type in_chans: int
-    :param embed_size: size of embedding
-    :type embed_size: int
-    :param dtype: The dtype of parameters, defaults to None
-    :type dtype: torch.dtype, optional
-    :param flatten: whether to flatten output tensor, defaults to True
-    :type flatten: bool, optional
-    :param weight_initializer: The intializer of weight, defaults to kaiming uniform initializer
-    :type weight_initializer: typing.Callable, optional
-    :param bias_initializer: The intializer of bias, defaults to xavier uniform initializer
-    :type bias_initializer: typing.Callable, optional
-    :param position_embed_initializer: The intializer of position embedding, defaults to zero
-    :type position_embed_initializer: typing.Callable, optional
-    """
-
-    def __init__(
-        self,
-        img_size: int,
-        patch_size: int,
-        in_chans: int,
-        embed_size: int,
-        dtype: dtype = None,
-        flatten: bool = True,
-        weight_initializer: Callable = init.kaiming_uniform_(a=math.sqrt(5)),
-        bias_initializer: Callable = init.xavier_uniform_(a=1, scale=1),
-        position_embed_initializer: Callable = init.zeros_()
-    ) -> None:
-        super().__init__()
-        tensor_parallel = get_tensor_parallel_mode()
-        self.embed = _parallel_patchembedding[tensor_parallel](
-            img_size,
-            patch_size,
-            in_chans,
-            embed_size,
-            dtype=dtype,
-            flatten=flatten,
-            weight_initializer=weight_initializer,
-            bias_initializer=bias_initializer,
-            position_embed_initializer=position_embed_initializer,
-        )
-
-    @property
-    def weight(self):
-        return self.embed.weight
-
-    @property
-    def bias(self):
-        return self.embed.bias
-
-    @property
-    def pos_embed(self):
-        return self.embed.pos_embed
-
-    @property
-    def cls_token(self):
-        return self.embed.cls_token
-
-    def forward(self, *args):
-        return self.embed(*args)
diff --git a/colossalai/nn/layer/colossalai_layer/linear.py b/colossalai/nn/layer/colossalai_layer/linear.py
deleted file mode 100644
index baa2abf7c1046a51e91c5892d120cce27cbf68d2..0000000000000000000000000000000000000000
--- a/colossalai/nn/layer/colossalai_layer/linear.py
+++ /dev/null
@@ -1,149 +0,0 @@
-import math
-from typing import Callable
-
-from colossalai.utils import get_current_device
-from torch import dtype, nn
-
-from ... import init as init
-from ..parallel_1d import *
-from ..parallel_2d import *
-from ..parallel_2p5d import *
-from ..parallel_3d import *
-from ..utils import get_tensor_parallel_mode
-from ..vanilla import *
-
-_parallel_linear = {'1d': Linear1D, '2d': Linear2D, '2.5d': Linear2p5D, '3d': Linear3D}
-
-_parallel_classifier = {
-    None: VanillaClassifier,
-    '1d': Classifier1D,
-    '2d': Classifier2D,
-    '2.5d': Classifier2p5D,
-    '3d': Classifier3D
-}
-
-_vocab_parallel_classifier = {
-    '1d': VocabParallelClassifier1D,
-    '2d': VocabParallelClassifier2D,
-    '2.5d': VocabParallelClassifier2p5D,
-    '3d': VocabParallelClassifier3D
-}
-
-
-class Linear(nn.Module):
-    """
-    Linear layer of colossalai
-
-    :param in_features: size of each input sample
-    :type in_features: int
-    :param out_features: size of each output sample
-    :type out_features: int
-    :param bias: If set to ``False``, the layer will not learn an additive bias, defaults to True
-    :type bias: bool, optional
-    :param dtype: The dtype of parameters, defaults to None
-    :type dtype: torch.dtype, optional
-    :param weight_initializer: The intializer of weight, defaults to kaiming uniform initializer
-    :type weight_initializer: typing.Callable, optional
-    :param bias_initializer: The intializer of bias, defaults to xavier uniform initializer
-    :type bias_initializer: typing.Callable, optional
-    :param kwargs: Kwargs used for particular parallelisms
-    """
-
-    def __init__(self,
-                 in_features: int,
-                 out_features: int,
-                 bias: bool = True,
-                 dtype: dtype = None,
-                 weight_initializer: Callable = init.kaiming_uniform_(a=math.sqrt(5)),
-                 bias_initializer: Callable = init.xavier_uniform_(a=1, scale=1),
-                 **kwargs) -> None:
-        super().__init__()
-        tensor_parallel = get_tensor_parallel_mode()
-        if tensor_parallel is None:
-            self.layer = nn.Linear(in_features, out_features, bias=bias).to(dtype).to(get_current_device())
-            weight_initializer(self.layer.weight, fan_in=in_features, fan_out=out_features)
-            if self.layer.bias is not None:
-                bias_initializer(self.layer.bias, fan_in=in_features)
-        else:
-            self.layer = _parallel_linear[tensor_parallel](
-                in_features,
-                out_features,
-                bias=bias,
-                dtype=dtype,
-                weight_initializer=weight_initializer,
-                bias_initializer=bias_initializer,
-                **kwargs,
-            )
-
-    @property
-    def weight(self):
-        return self.layer.weight
-
-    @property
-    def bias(self):
-        return self.layer.bias
-
-    def forward(self, *args):
-        return self.layer(*args)
-
-
-class Classifier(nn.Module):
-    """
-    Classifier layer of colossalai
-
-    :param in_features: size of each input sample
-    :type in_features: int
-    :param num_classes: number of total classes for the dataset
-    :type num_classes: int
-    :param bias: If set to ``False``, the layer will not learn an additive bias, defaults to True
-    :type bias: bool, optional
-    :param dtype: The dtype of parameters, defaults to None
-    :type dtype: torch.dtype, optional
-    :param weight_initializer: The intializer of weight, defaults to kaiming uniform initializer
-    :type weight_initializer: typing.Callable, optional
-    :param bias_initializer: The intializer of bias, defaults to xavier uniform initializer
-    :type bias_initializer: typing.Callable, optional
-    """
-
-    def __init__(self,
-                 in_features: int,
-                 num_classes: int,
-                 weight: nn.Parameter = None,
-                 bias: bool = True,
-                 dtype: dtype = None,
-                 weight_initializer: Callable = init.kaiming_uniform_(a=math.sqrt(5)),
-                 bias_initializer: Callable = init.xavier_uniform_(a=1, scale=1),
-                 vocab_parallel_limit: int = 2048) -> None:
-        super().__init__()
-        tensor_parallel = get_tensor_parallel_mode()
-        if num_classes <= vocab_parallel_limit or tensor_parallel is None:
-            self.layer = _parallel_classifier[tensor_parallel](
-                in_features,
-                num_classes,
-                weight=weight,
-                bias=bias,
-                dtype=dtype,
-                weight_initializer=weight_initializer,
-                bias_initializer=bias_initializer,
-            )
-        else:
-            self.layer = _vocab_parallel_classifier[tensor_parallel](
-                in_features,
-                num_classes,
-                weight=weight,
-                bias=bias,
-                dtype=dtype,
-                weight_initializer=weight_initializer,
-                bias_initializer=bias_initializer,
-            )
-
-    @property
-    def weight(self):
-        return self.layer.weight
-
-    @property
-    def bias(self):
-        return self.layer.bias
-
-    def forward(self, *args):
-        return self.layer(*args)
diff --git a/colossalai/nn/layer/colossalai_layer/normalization.py b/colossalai/nn/layer/colossalai_layer/normalization.py
deleted file mode 100644
index 1f9277214910a131226017ad7656e4c3e9daf3eb..0000000000000000000000000000000000000000
--- a/colossalai/nn/layer/colossalai_layer/normalization.py
+++ /dev/null
@@ -1,53 +0,0 @@
-from colossalai.utils import get_current_device
-from torch import nn
-from colossalai import kernel
-
-from ... import init as init
-from ..parallel_1d import *
-from ..parallel_2d import *
-from ..parallel_2p5d import *
-from ..parallel_3d import *
-from ..utils import get_tensor_parallel_mode
-from ..vanilla import *
-
-_parallel_layernorm = {
-    '1d': kernel.LayerNorm,
-    '2d': LayerNorm2D,
-    '2.5d': LayerNorm2p5D,
-    '3d': LayerNorm3D
-}
-
-
-class LayerNorm(nn.Module):
-    r"""
-    Layer Normalization for colossalai
-
-    :param normalized_shape: input shape from an expected input
-        of size. :math:`[* \times \text{normalized_shape}[0] \times \text{normalized_shape}[1] \times \ldots \times \text{normalized_shape}[-1]]`
-        If a single integer is used, it is treated as a singleton list, and this module will
-        normalize over the last dimension which is expected to be of that specific size.
-    :type normalized_shape: int
-    :param eps: a value added to the denominator for numerical stability, defaults to 1e-05
-    :type eps: float, optional
-    :param dtype: The dtype of parameters, defaults to None
-    :type dtype: torch.dtype, optional
-    """
-
-    def __init__(self, normalized_shape: int, eps=1e-05, dtype=None) -> None:
-        super().__init__()
-        tensor_parallel = get_tensor_parallel_mode()
-        if tensor_parallel is None:
-            self.norm = nn.LayerNorm(normalized_shape, eps=eps).to(dtype).to(get_current_device())
-        else:
-            self.norm = _parallel_layernorm[tensor_parallel](normalized_shape, eps=eps, dtype=dtype)
-
-    @property
-    def weight(self):
-        return self.norm.weight
-
-    @property
-    def bias(self):
-        return self.norm.bias
-
-    def forward(self, *args):
-        return self.norm(*args)
diff --git a/colossalai/nn/layer/moe/__init__.py b/colossalai/nn/layer/moe/__init__.py
deleted file mode 100644
index e75aff6edd89672077e9a10c27d4b40b513c71ec..0000000000000000000000000000000000000000
--- a/colossalai/nn/layer/moe/__init__.py
+++ /dev/null
@@ -1,8 +0,0 @@
-from ._operation import AllToAll
-from .layers import Experts, MoeLayer, \
-    NormalNoiseGenerator, Top1Router, Top2Router
-
-__all__ = [
-    'AllToAll', 'Experts', 'Top1Router', 'Top2Router',
-    'MoeLayer', 'NormalNoiseGenerator'
-]
\ No newline at end of file
diff --git a/colossalai/nn/layer/moe/_operation.py b/colossalai/nn/layer/moe/_operation.py
deleted file mode 100644
index fd2720fb90186ad59725283a147780af34e7e332..0000000000000000000000000000000000000000
--- a/colossalai/nn/layer/moe/_operation.py
+++ /dev/null
@@ -1,29 +0,0 @@
-import torch
-import torch.distributed as dist
-from torch import Tensor
-
-from colossalai.context import ParallelMode
-from colossalai.core import global_context as gpc
-from typing import Any, Tuple
-
-
-class AllToAll(torch.autograd.Function):
-    """Dispatches input tensor [e, c, h] to all experts by all_to_all_single
-    operation in torch.distributed.
-    """
-    @staticmethod
-    def forward(ctx: Any,
-                inputs: Tensor,
-                parallel_mode: ParallelMode) -> Tensor:
-        ctx.parallel_mode = parallel_mode
-        if not inputs.is_contiguous():
-            inputs = inputs.contiguous()
-
-        output = torch.empty_like(inputs)
-        dist.all_to_all_single(output, inputs,
-                               group=gpc.get_group(parallel_mode))
-        return output
-
-    @staticmethod
-    def backward(ctx: Any, *grad_outputs: Tensor) -> Tuple[Tensor, None]:
-        return AllToAll.apply(*grad_outputs, ctx.parallel_mode), None
diff --git a/colossalai/nn/layer/moe/layers.py b/colossalai/nn/layer/moe/layers.py
deleted file mode 100644
index ab9c7239556e251e675cb99630b7a15ba02fe0b1..0000000000000000000000000000000000000000
--- a/colossalai/nn/layer/moe/layers.py
+++ /dev/null
@@ -1,276 +0,0 @@
-import math
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from torch.cuda.amp import autocast
-from colossalai.global_variables import moe_env
-from colossalai.context import ParallelMode, seed
-from colossalai.utils import get_current_device
-from ._operation import AllToAll
-
-
-class NormalNoiseGenerator:
-    """Generates a random noisy mask for logtis tensor.
-
-    All noise is generated from a normal distribution (0, 1 / E^2), where
-    E = the number of experts.
-
-    :param num_experts: The number of experts
-    :type num_experts: int
-    """
-
-    def __init__(self, num_experts: int):
-        self.normal = torch.distributions.normal.Normal(
-            loc=torch.tensor(0.0, device=get_current_device()),
-            scale=torch.tensor(1.0 / num_experts ** 2, device=get_current_device())
-        ).rsample
-
-    def __call__(self, inputs: torch.Tensor):
-        noisy = self.normal(inputs.shape)
-        return inputs + noisy
-
-
-class Experts(nn.Module):
-    """A wrapper class to create experts. It will create E experts across the
-    moe model parallel group, where E is the number of experts. Every expert
-    is a instence of the class, 'expert' in initialization parameters.
-
-    :param expert: The class of all experts
-    :param num_experts: The number of experts
-    :param expert_args: Args used to initialize experts
-
-    :type num_experts: int
-    """
-
-    def __init__(self, expert, num_experts, **expert_args):
-        super().__init__()
-
-        assert num_experts % moe_env.model_parallel_size == 0, \
-            "The number of experts should be divied by moe model size"
-
-        num_local_experts = num_experts // moe_env.model_parallel_size
-        with seed(ParallelMode.MOE_MODEL):
-            self.experts = nn.ModuleList([
-                expert(**expert_args) for _ in range(num_local_experts)])
-        self.num_local_experts = num_local_experts
-        for exp in self.experts:
-            for param in exp.parameters():
-                param.__setattr__('moe_param', 1)
-
-    def forward(self, inputs):
-        expert_input = torch.chunk(inputs, self.num_local_experts, dim=0)
-        expert_output = []
-
-        for i in range(self.num_local_experts):
-            expert_output.append(self.experts[i](expert_input[i]))
-
-        output = torch.cat(expert_output, dim=0)
-        return output
-
-
-class Top1Router(nn.Module):
-    """Top1 router that returns the dispatch mask [s, e, c] and combine weight [s, e, c]
-    for routing usage. More deailted function can be found in the paper about Switch Transformer
-    of Google.
-
-    :param capacity_factor: Capacity factor in routing
-    :param min_capacity: The minimum number of the capacity of each expert
-    :param noisy_func: Noisy function used in logits
-
-    :type capacity_factor: float
-    :type min_capacity: int
-    :type noisy_func: Callable, optional
-    """
-
-    def __init__(self,
-                 capacity_factor: float,
-                 min_capacity: int,
-                 noisy_func=None):
-        super().__init__()
-        self.capacity_factor = capacity_factor
-        self.min_capacity = min_capacity
-        self.noisy_func = noisy_func
-        self.uniform = torch.distributions.uniform.Uniform(
-            low=torch.tensor(0.0, device=get_current_device()),
-            high=torch.tensor(1.0, device=get_current_device())).rsample
-
-    def get_capacity(self, logits_shape):
-        capacity = math.ceil(self.capacity_factor *
-                             logits_shape[0] / logits_shape[1])
-        if capacity < self.min_capacity:
-            capacity = self.min_capacity
-        return capacity
-
-    def forward(self, inputs):
-
-        if self.noisy_func is not None:
-            inputs_noisy = self.noisy_func(inputs)
-        else:
-            inputs_noisy = inputs
-
-        logits = F.softmax(inputs, dim=1)
-
-        num_experts = logits.shape[1]
-        capacity = self.get_capacity(logits.shape)
-
-        expert_idx = torch.argmax(inputs_noisy, dim=1)
-        expert_mask = F.one_hot(expert_idx, num_classes=num_experts)
-        expert_mask_f = expert_mask.float()
-
-        exp_counts = torch.sum(expert_mask, dim=0).detach().to('cpu')
-
-        me = torch.mean(logits, dim=0)
-        ce = torch.mean(expert_mask_f, dim=0)
-        l_aux = torch.sum(me * ce) * num_experts
-        moe_env.add_loss(l_aux)
-
-        rand_mask = expert_mask * self.uniform(logits.shape)
-        _, dispatch_idx = torch.topk(rand_mask, k=capacity, dim=0)
-
-        dispatch_mask = \
-            expert_mask * torch.zeros_like(expert_mask).scatter_(0, dispatch_idx, 1)
-
-        locations = torch.cumsum(dispatch_mask, dim=0) - 1
-        locations = torch.sum(dispatch_mask * locations, dim=1)
-        locations = F.one_hot(locations, num_classes=capacity)
-
-        logits = logits * dispatch_mask
-        combine_weights = logits.unsqueeze(2) * locations.unsqueeze(1)
-
-        sec_mask = combine_weights.bool()
-        return combine_weights, sec_mask, exp_counts
-
-
-class Top2Router(nn.Module):
-    """Top2 router that returns the dispatch mask [s, e, c] and combine weight [s, e, c]
-    for routing usage. More deailted function can be found in the paper about ViT-MoE.
-
-    :param capacity_factor: Capacity factor in routing
-    :param noisy_func: Noisy function used in logits
-
-    :type capacity_factor: float
-    :type noisy_func: Callable, optional
-    """
-
-    def __init__(self, capacity_factor: float, noisy_func=None):
-        super().__init__()
-        self.capacity_factor = capacity_factor
-        self.noisy_func = noisy_func
-
-    def get_capacity(self, logits_shape):
-        capacity = math.ceil(2 * self.capacity_factor *
-                             logits_shape[0] / logits_shape[1])
-        return capacity
-
-    def forward(self, inputs):
-        if self.noisy_func is not None:
-            inputs = self.noisy_func(inputs)
-
-        logits = F.softmax(inputs, dim=-1)
-        num_experts = logits.size(-1)
-        capacity = self.get_capacity(logits.shape)
-
-        _, expert_idx = torch.topk(logits, k=2, dim=-1, largest=True, sorted=True)
-        top1_idx = expert_idx[:, 0]
-        top2_idx = expert_idx[:, 1]
-
-        mask1 = F.one_hot(top1_idx, num_classes=num_experts)
-        mask2 = F.one_hot(top2_idx, num_classes=num_experts)
-
-        loss_mask = (mask1 + mask2)
-        exp_counts = torch.sum(loss_mask, dim=0).detach().to('cpu')
-        me = torch.mean(logits, dim=0)
-        ce = torch.mean(loss_mask.float(), dim=0)
-        l_aux = num_experts * torch.sum(me * ce) / 2.0
-        moe_env.add_loss(l_aux)
-
-        locations1 = torch.cumsum(mask1, dim=0) - 1
-        locations2 = torch.cumsum(mask2, dim=0) - 1
-        locations2 += torch.sum(mask1, dim=0, keepdim=True)
-
-        mask1 *= torch.lt(locations1, capacity)
-        mask2 *= torch.lt(locations2, capacity)
-
-        weight1 = mask1 * logits
-        weight2 = mask2 * logits
-
-        locations1 = torch.sum(mask1 * locations1, dim=1)
-        locations2 = torch.sum(mask2 * locations2, dim=1)
-        locations1_sc = F.one_hot(locations1, num_classes=capacity)
-        locations2_sc = F.one_hot(locations2, num_classes=capacity)
-
-        combine_weights1 = weight1.unsqueeze(2) * locations1_sc.unsqueeze(1)
-        combine_weights2 = weight2.unsqueeze(2) * locations2_sc.unsqueeze(1)
-        combine_weights = combine_weights1 + combine_weights2
-        sec_mask = combine_weights.bool()
-
-        return combine_weights, sec_mask, exp_counts
-
-
-class MoeLayer(nn.Module):
-    """A MoE layer, that puts its input tensor to its gate and uses the output logits
-    to router all tokens, is mainly used to exchange all tokens for every expert across
-    the moe tensor group by all to all comunication. Then it will get the output of all
-    experts and exchange the output. At last returns the output of the moe system.
-
-    :param dim_model: Dimension of model
-    :param num_experts: The number of experts
-    :param router: Instance of router used in routing
-    :param experts: Instance of experts generated by Expert
-
-    :type dim_model: int
-    :type num_experts: int
-    :type router: nn.Module
-    :type experts: nn.Module
-    """
-
-    def __init__(self,
-                 dim_model: int,
-                 num_experts: int,
-                 router: nn.Module,
-                 experts: nn.Module):
-        super().__init__()
-        self.d_model = dim_model
-        self.num_experts = num_experts
-        self.gate = nn.Linear(dim_model, num_experts, device=get_current_device())
-        self.router = router
-        self.experts = experts
-
-    def _router_part(self, tokens: torch.Tensor):
-        gate_output = self.gate(tokens)
-        return self.router(gate_output)
-
-    def router_part(self, tokens: torch.Tensor):
-        autocast_context = torch.is_autocast_enabled()
-        if not autocast_context:
-            return self._router_part(tokens)
-        else:
-            with autocast(enabled=False):
-                if tokens.dtype == torch.float16:
-                    input_tokens = tokens.float()
-                else:
-                    input_tokens = tokens
-                return self._router_part(input_tokens)
-
-    def forward(self, inputs: torch.Tensor) -> torch.Tensor:
-        tokens = inputs.reshape(-1, self.d_model)
-
-        combine_weights, sec_mask, exp_counts = self.router_part(tokens)
-
-        sec_mask_f = sec_mask.type_as(inputs)
-        dispatch_data = torch.matmul(sec_mask_f.permute(1, 2, 0), tokens)
-
-        dispatch_data = AllToAll.apply(dispatch_data, ParallelMode.MOE_MODEL)
-
-        expert_output = self.experts(dispatch_data)
-
-        expert_output = AllToAll.apply(expert_output, ParallelMode.MOE_MODEL)
-
-        combine_weights = combine_weights.view(combine_weights.shape[0], -1)
-        expert_output = expert_output.view(-1, expert_output.shape[-1])
-
-        ret = torch.matmul(combine_weights, expert_output)
-        ret = ret.reshape(inputs.shape)
-
-        return ret
diff --git a/colossalai/nn/layer/parallel_1d/__init__.py b/colossalai/nn/layer/parallel_1d/__init__.py
deleted file mode 100644
index fddeedd7df8e71d879fadc9cd526baaca57b1567..0000000000000000000000000000000000000000
--- a/colossalai/nn/layer/parallel_1d/__init__.py
+++ /dev/null
@@ -1,7 +0,0 @@
-from .layers import (Classifier1D, Dropout1D, Embedding1D, Linear1D, Linear1D_Col, Linear1D_Row,
-                     VocabParallelClassifier1D, VocabParallelEmbedding1D)
-
-__all__ = [
-    'Linear1D', 'Linear1D_Col', 'Linear1D_Row', 'Embedding1D', 'Dropout1D', 'Classifier1D', 'VocabParallelClassifier1D',
-    'VocabParallelEmbedding1D'
-]
diff --git a/colossalai/nn/layer/parallel_1d/__pycache__/__init__.cpython-36.pyc b/colossalai/nn/layer/parallel_1d/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index 42f675851e814a5672f62a31be0f897da2136443..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/parallel_1d/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/parallel_1d/__pycache__/__init__.cpython-37.pyc b/colossalai/nn/layer/parallel_1d/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index 64c7abdb584a517c5411925c1a95aa16ee3fc653..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/parallel_1d/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/parallel_1d/__pycache__/_utils.cpython-36.pyc b/colossalai/nn/layer/parallel_1d/__pycache__/_utils.cpython-36.pyc
deleted file mode 100644
index f6629d148bfe372f52d011116810184c6402b234..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/parallel_1d/__pycache__/_utils.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/parallel_1d/__pycache__/_utils.cpython-37.pyc b/colossalai/nn/layer/parallel_1d/__pycache__/_utils.cpython-37.pyc
deleted file mode 100644
index 142b1aa2a6eacb0187955e529270e7d2deec2deb..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/parallel_1d/__pycache__/_utils.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/parallel_1d/__pycache__/layers.cpython-36.pyc b/colossalai/nn/layer/parallel_1d/__pycache__/layers.cpython-36.pyc
deleted file mode 100644
index 90215068168d553373d688ddda1db25187616c01..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/parallel_1d/__pycache__/layers.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/parallel_1d/__pycache__/layers.cpython-37.pyc b/colossalai/nn/layer/parallel_1d/__pycache__/layers.cpython-37.pyc
deleted file mode 100644
index 287c19668fbbd589973cc0225e72038028929fd1..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/parallel_1d/__pycache__/layers.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/parallel_1d/_operation.py b/colossalai/nn/layer/parallel_1d/_operation.py
deleted file mode 100644
index d6b851e923f1e2ddc745d6b7c3eea39210cd58d9..0000000000000000000000000000000000000000
--- a/colossalai/nn/layer/parallel_1d/_operation.py
+++ /dev/null
@@ -1,46 +0,0 @@
-import torch
-
-try:
-    import fused_mix_prec_layer_norm_cuda
-except:
-    fused_mix_prec_layer_norm_cuda = None
-
-
-class FusedLayerNormAffineFunction1D(torch.autograd.Function):
-  r"""
-  Layernorm
-
-  :param input: input maxtrix
-  :param weight: weight matrix
-  :param bias: bias matrix
-  :param normalized_shape: input shape from an expected input
-        of size. :math:`[* \times \text{normalized_shape}[0] \times \text{normalized_shape}[1] \times \ldots \times \text{normalized_shape}[-1]]`
-        If a single integer is used, it is treated as a singleton list, and this module will
-        normalize over the last dimension which is expected to be of that specific size.
-  :param eps: a value added to the denominator for numerical stability
-  """
-
-  @staticmethod
-  def forward(ctx, input, weight, bias, normalized_shape, eps):
-    ctx.normalized_shape = normalized_shape
-    ctx.eps = eps
-    input_ = input.contiguous()
-    weight_ = weight.contiguous()
-    bias_ = bias.contiguous()
-    output, mean, invvar = fused_mix_prec_layer_norm_cuda.forward_affine(
-        input_, ctx.normalized_shape, weight_, bias_, ctx.eps)
-    ctx.save_for_backward(input_, weight_, bias_, mean, invvar)
-    return output
-
-
-  @staticmethod
-  def backward(ctx, grad_output):
-    input_, weight_, bias_, mean, invvar = ctx.saved_tensors
-    grad_input = grad_weight = grad_bias = None
-    grad_input, grad_weight, grad_bias \
-      = fused_mix_prec_layer_norm_cuda.backward_affine(
-        grad_output.contiguous(), mean, invvar,
-        input_, ctx.normalized_shape,
-        weight_, bias_, ctx.eps)
-
-    return grad_input, grad_weight, grad_bias, None, None
\ No newline at end of file
diff --git a/colossalai/nn/layer/parallel_1d/_utils.py b/colossalai/nn/layer/parallel_1d/_utils.py
deleted file mode 100644
index cc1967f1126b7b8bfb6d11a26259225a19c9ed3e..0000000000000000000000000000000000000000
--- a/colossalai/nn/layer/parallel_1d/_utils.py
+++ /dev/null
@@ -1,177 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import torch
-import torch.distributed as dist
-from colossalai.core import global_context as gpc
-from colossalai.global_variables import tensor_parallel_env as env
-
-from ..utils import divide
-
-
-def set_parallel_input(input_parallel: bool):
-    env.parallel_input_1d = input_parallel
-
-
-def get_parallel_input():
-    return env.parallel_input_1d
-
-
-def vocab_range_from_per_partition_vocab_size(per_partition_vocab_size, rank):
-    index_f = rank * per_partition_vocab_size
-    index_l = index_f + per_partition_vocab_size
-    return index_f, index_l
-
-
-def vocab_range_from_global_vocab_size(global_vocab_size, rank, world_size):
-    per_partition_vocab_size = divide(global_vocab_size, world_size)
-    return vocab_range_from_per_partition_vocab_size(per_partition_vocab_size, rank)
-
-
-def _reduce(input_, parallel_mode):
-    # skip if only one rank involved
-    if gpc.get_world_size(parallel_mode) == 1:
-        return input_
-    dist.all_reduce(input_, group=gpc.get_group(parallel_mode))
-
-    return input_
-
-
-def _split(input_, parallel_mode, dim=-1):
-    # skip if only one rank involved
-    world_size = gpc.get_world_size(parallel_mode)
-    if world_size == 1:
-        return input_
-
-    # Split along last dimension.
-    dim_size = input_.size(dim)
-    assert dim_size % world_size == 0, \
-        f'The dimension to split ({dim_size}) is not a multiple of world size ({world_size}), ' \
-        f'cannot split tensor evenly'
-
-    tensor_list = torch.split(input_, dim_size // world_size, dim=dim)
-    rank = gpc.get_local_rank(parallel_mode)
-    output = tensor_list[rank].contiguous()
-
-    return output
-
-
-def _gather(input_, parallel_mode, dim=-1):
-    # skip if only one rank involved
-    world_size = gpc.get_world_size(parallel_mode)
-    if world_size == 1:
-        return input_
-
-    # all gather
-    rank = gpc.get_local_rank(parallel_mode)
-    tensor_list = [torch.empty_like(input_) for _ in range(world_size)]
-    tensor_list[rank] = input_
-    torch.distributed.all_gather(tensor_list, input_, group=gpc.get_group(parallel_mode))
-
-    # concat
-    output = torch.cat(tensor_list, dim=dim).contiguous()
-
-    return output
-
-
-class _ReduceGrad(torch.autograd.Function):
-    """
-    Pass the input to the model parallel region.
-
-    :param input_: input matrix
-    :param parallel_mode: parallel mode
-    """
-    @staticmethod
-    def symbolic(graph, input_):
-        return input_
-
-    @staticmethod
-    def forward(ctx, input_, parallel_mode):
-        ctx.mode = parallel_mode
-        return input_
-
-    @staticmethod
-    def backward(ctx, grad_output):
-        return _reduce(grad_output, ctx.mode), None
-
-
-class _ReduceInput(torch.autograd.Function):
-    """
-    All-reduce the input from the model parallel region.
-    
-    :param input_: input matrix
-    :param parallel_mode: parallel mode
-    """
-    @staticmethod
-    def symbolic(graph, input_):
-        return _reduce(input_)
-
-    @staticmethod
-    def forward(ctx, input_, parallel_mode):
-        return _reduce(input_, parallel_mode)
-
-    @staticmethod
-    def backward(ctx, grad_output):
-        return grad_output, None
-
-
-class _SplitForwardGatherBackward(torch.autograd.Function):
-    """
-    Split the input and keep only the corresponding chuck to the rank.
-    
-    :param input_: input matrix
-    :param parallel_mode: parallel mode
-    :param dim: dimension
-    """
-    @staticmethod
-    def symbolic(graph, input_):
-        return _split(input_)
-
-    @staticmethod
-    def forward(ctx, input_, parallel_mode, dim):
-        ctx.mode = parallel_mode
-        ctx.dim = dim
-        return _split(input_, parallel_mode, dim)
-
-    @staticmethod
-    def backward(ctx, grad_output):
-        return _gather(grad_output, ctx.mode, ctx.dim), None, None
-
-
-class _GatherForwardSplitBackward(torch.autograd.Function):
-    """
-    Gather the input from model parallel region and concatinate.
-    
-    :param input_: input matrix
-    :param parallel_mode: parallel mode
-    :param dim: dimension
-    """
-    @staticmethod
-    def symbolic(graph, input_):
-        return _gather(input_)
-
-    @staticmethod
-    def forward(ctx, input_, parallel_mode, dim):
-        ctx.mode = parallel_mode
-        ctx.dim = dim
-        return _gather(input_, parallel_mode, dim)
-
-    @staticmethod
-    def backward(ctx, grad_output):
-        return _split(grad_output, ctx.mode, ctx.dim), None, None
-
-
-def reduce_grad(input_, parallel_mode):
-    return _ReduceGrad.apply(input_, parallel_mode)
-
-
-def reduce_input(input_, parallel_mode):
-    return _ReduceInput.apply(input_, parallel_mode)
-
-
-def split_forward_gather_backward(input_, parallel_mode, dim):
-    return _SplitForwardGatherBackward.apply(input_, parallel_mode, dim)
-
-
-def gather_forward_split_backward(input_, parallel_mode, dim):
-    return _GatherForwardSplitBackward.apply(input_, parallel_mode, dim)
diff --git a/colossalai/nn/layer/parallel_1d/layers.py b/colossalai/nn/layer/parallel_1d/layers.py
deleted file mode 100644
index daf54c12616cb1cc7ed51743a4c972817faa83d1..0000000000000000000000000000000000000000
--- a/colossalai/nn/layer/parallel_1d/layers.py
+++ /dev/null
@@ -1,598 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import math
-from typing import Callable, Tuple
-
-import torch
-import torch.nn.functional as F
-from colossalai.communication import broadcast
-from colossalai.context import ParallelMode, seed
-from colossalai.core import global_context as gpc
-from colossalai.global_variables import tensor_parallel_env as env
-from colossalai.nn import init as init
-from colossalai.registry import LAYERS
-from colossalai.utils.cuda import get_current_device
-from torch import Tensor
-from torch.nn.parameter import Parameter
-
-from ..base_layer import ParallelLayer
-from ..utils import divide, set_tensor_parallel_attribute_by_partition
-from ._utils import (gather_forward_split_backward, get_parallel_input, reduce_grad,
-                     reduce_input, set_parallel_input, split_forward_gather_backward)
-
-
-@LAYERS.register_module
-class Linear1D(torch.nn.Module):
-    """
-    Linear layer for 1D parallelism
-
-    :param in_features: size of each input sample
-    :type in_features: int
-    :param out_features: size of each output sample
-    :type out_features: int
-    :param bias: If set to ``False``, the layer will not learn an additive bias, defaults to True
-    :type bias: bool, optional
-    :param dtype: The dtype of parameters, defaults to None
-    :type dtype: torch.dtype, optional
-    :param skip_bias_add: If set to ``True``, it will skip bias add for linear layer, which is preserved for kernel fusion, defaults to False
-    :type skip_bias_add: bool, optional
-    :param weight_initializer: The intializer of weight, defaults to kaiming uniform initializer
-    :type weight_initializer: typing.Callable, optional
-    :param bias_initializer: The intializer of bias, defaults to xavier uniform initializer
-    :type bias_initializer: typing.Callable, optional
-    """
-
-    def __init__(self,
-                 in_features: int,
-                 out_features: int,
-                 bias: bool = True,
-                 dtype: torch.dtype = None,
-                 gather_output: bool = False,
-                 skip_bias_add: bool = False,
-                 weight_initializer: Callable = init.kaiming_uniform_(a=math.sqrt(5)),
-                 bias_initializer: Callable = init.xavier_uniform_(a=1, scale=1)):
-        super().__init__()
-        parallel_input = get_parallel_input()
-        if not parallel_input:
-            self.layer = Linear1D_Col(in_features,
-                                      out_features,
-                                      bias=bias,
-                                      dtype=dtype,
-                                      gather_output=gather_output,
-                                      skip_bias_add=skip_bias_add,
-                                      weight_initializer=weight_initializer,
-                                      bias_initializer=bias_initializer)
-        else:
-            self.layer = Linear1D_Row(in_features,
-                                      out_features,
-                                      bias=bias,
-                                      dtype=dtype,
-                                      parallel_input=parallel_input,
-                                      skip_bias_add=skip_bias_add,
-                                      weight_initializer=weight_initializer,
-                                      bias_initializer=bias_initializer)
-
-    @property
-    def weight(self):
-        return self.layer.weight
-
-    @property
-    def bias(self):
-        return self.layer.bias
-
-    def forward(self, input_: Tensor) -> Tensor:
-        return self.layer(input_)
-
-
-@LAYERS.register_module
-class Classifier1D(ParallelLayer):
-    """RowLinear with given weight
-    Classifier of 1D parallelism
-    
-    :param in_features: size of input features
-    :type in_features: int
-    :param num_classes: number of classes in the dataset
-    :type num_classes: int
-    :param weight: weight of the classifier, defaults to True
-    :type weight: torch.nn.Parameter, optional
-    :param bias: If set to ``False``, the layer will not learn an additive bias, defaults to ``True``
-    :type bias: bool, optional
-    :param dtype: The dtype of parameters, defaults to None
-    :type dtype: torch.dtype, optional
-    :param weight_initializer: The intializer of weight, defaults to kaiming uniform initializer
-    :type weight_initializer: typing.Callable, optional
-    :param bias_initializer: The intializer of bias, defaults to xavier uniform initializer
-    :type bias_initializer: typing.Callable, optional
-    """
-
-    def __init__(self,
-                 in_features: int,
-                 num_classes: int,
-                 weight: Parameter = None,
-                 bias: bool = True,
-                 dtype: torch.dtype = None,
-                 weight_initializer: Callable = init.kaiming_uniform_(a=math.sqrt(5)),
-                 bias_initializer: Callable = init.xavier_uniform_(a=1, scale=1)):
-        super().__init__()
-        self.in_features = in_features
-        self.num_classes = num_classes
-        self.parallel_input = get_parallel_input()
-
-        # Divide the weight matrix along the last dimension.
-        self.input_size_per_partition = divide(in_features, gpc.tensor_parallel_size)
-
-        # Parameters.
-        # Initialize weight.
-        factory_kwargs = {'device': get_current_device(), 'dtype': dtype}
-        if weight is not None:
-            self.weight = weight
-            self.has_weight = False
-        else:
-            self.weight = Parameter(torch.empty(self.num_classes, self.input_size_per_partition, **factory_kwargs))
-            self.has_weight = True
-        if bias:
-            self.bias = Parameter(torch.empty(self.num_classes, **factory_kwargs))
-        else:
-            self.bias = None
-        with seed(ParallelMode.TENSOR):
-            self.reset_parameters(weight_initializer, bias_initializer)
-        self._set_tensor_parallel_attributes()
-        set_parallel_input(False)
-        env.vocab_parallel = False
-
-    def reset_parameters(self, weight_initializer, bias_initializer) -> None:
-        fan_in, fan_out = self.in_features, self.num_classes
-        if self.has_weight:
-            weight_initializer(self.weight, fan_in=fan_in, fan_out=fan_out)
-        if self.bias is not None:
-            bias_initializer(self.bias, fan_in=fan_in)
-            broadcast(self.bias, gpc.get_ranks_in_group(ParallelMode.PARALLEL_1D)[0], ParallelMode.PARALLEL_1D)
-
-    def _set_tensor_parallel_attributes(self):
-        if self.has_weight:
-            num_partition = gpc.get_world_size(ParallelMode.TENSOR)
-            set_tensor_parallel_attribute_by_partition(self.weight, num_partition)
-
-    def forward(self, input_: Tensor) -> Tensor:
-        # Set up backprop all-reduce.
-        if self.parallel_input:
-            input_ = input_
-        else:
-            input_ = split_forward_gather_backward(input_, ParallelMode.PARALLEL_1D, dim=-1)
-
-        output_parallel = F.linear(input_, self.weight)
-        output = reduce_input(output_parallel, ParallelMode.PARALLEL_1D)
-        if self.bias is not None:
-            output = output + self.bias
-        return output
-
-
-@LAYERS.register_module
-class VocabParallelClassifier1D(ParallelLayer):
-    """ColLinear with given weight
-    Classifier of 1D parallelism
-    
-    :param in_features: size of input features
-    :type in_features: int
-    :param num_classes: number of classes in the dataset
-    :type num_classes: int
-    :param weight: weight of the classifier, defaults to True
-    :type weight: torch.nn.Parameter, optional
-    :param bias: If set to ``False``, the layer will not learn an additive bias, defaults to ``True``
-    :type bias: bool, optional
-    :param dtype: The dtype of parameters, defaults to None
-    :type dtype: torch.dtype, optional
-    :param weight_initializer: The intializer of weight, defaults to kaiming uniform initializer
-    :type weight_initializer: typing.Callable, optional
-    :param bias_initializer: The intializer of bias, defaults to xavier uniform initializer
-    :type bias_initializer: typing.Callable, optional
-    """
-
-    def __init__(self,
-                 in_features: int,
-                 num_classes: int,
-                 weight: Parameter = None,
-                 bias: bool = True,
-                 dtype: torch.dtype = None,
-                 weight_initializer: Callable = init.kaiming_uniform_(a=math.sqrt(5)),
-                 bias_initializer: Callable = init.xavier_uniform_(a=1, scale=1)):
-        super().__init__()
-        self.in_features = in_features
-        self.num_classes = num_classes
-        self.parallel_input = get_parallel_input()
-
-        # Divide the weight matrix along the last dimension.
-        self.num_classes_per_partition = divide(num_classes, gpc.tensor_parallel_size)
-
-        # Parameters.
-        # Initialize weight.
-        factory_kwargs = {'device': get_current_device(), 'dtype': dtype}
-        if weight is not None:
-            self.weight = weight
-            self.has_weight = False
-        else:
-            self.weight = Parameter(torch.empty(self.num_classes_per_partition, self.in_features, **factory_kwargs))
-            self.has_weight = True
-        if bias:
-            self.bias = Parameter(torch.empty(self.num_classes_per_partition, **factory_kwargs))
-        else:
-            self.bias = None
-        with seed(ParallelMode.TENSOR):
-            self.reset_parameters(weight_initializer, bias_initializer)
-        self._set_tensor_parallel_attributes()
-        set_parallel_input(False)
-        env.vocab_parallel = True
-
-    def reset_parameters(self, weight_initializer, bias_initializer) -> None:
-        fan_in, fan_out = self.in_features, self.num_classes
-        if self.has_weight:
-            weight_initializer(self.weight, fan_in=fan_in, fan_out=fan_out)
-        if self.bias is not None:
-            bias_initializer(self.bias, fan_in=fan_in)
-
-    def _set_tensor_parallel_attributes(self):
-        num_partition = gpc.get_world_size(ParallelMode.TENSOR)
-        if self.has_weight:
-            set_tensor_parallel_attribute_by_partition(self.weight, num_partition)
-        if self.bias is not None:
-            set_tensor_parallel_attribute_by_partition(self.bias, num_partition)
-
-    def forward(self, input_: Tensor) -> Tensor:
-        # Set up backprop all-reduce.
-        input_parallel = reduce_grad(input_, ParallelMode.PARALLEL_1D)
-        # Matrix multiply.
-        output = F.linear(input_parallel, self.weight, self.bias)
-        return output
-
-
-@LAYERS.register_module
-class Linear1D_Col(ParallelLayer):
-    """Linear layer with column parallelism.
-
-    The linear layer is defined as :math:`Y = XA + b`. A is parallelized along
-    its second dimension as :math:`A = [A_1, ..., A_p]`.
-
-    :param in_features: first dimension of matrix A.
-    :type in_features: int
-    :param output_size: second dimension of matrix A.
-    :type output_size: int
-    :param bias: If set to ``False``, the layer will not learn an additive bias, defaults to ``True``
-    :type bias: bool, optional
-    :param dtype: The dtype of parameters, defaults to None
-    :type dtype: torch.dtype, optional
-    :param gather_output: If true, call all-gether on output and make Y avaiable
-                    to all GPUs, otherwise, every GPU will have its output
-                    which is :math:`Y_i = XA_i`, defaults to False
-    :type gather_output: bool, optional
-    :param skip_bias_add: If set to ``True``, it will skip bias add for linear layer, which is preserved for kernel fusion, defaults to False
-    :type skip_bias_add: bool, optional
-    :param weight_initializer: The intializer of weight, defaults to kaiming uniform initializer
-    :type weight_initializer: typing.Callable, optional
-    :param bias_initializer: The intializer of bias, defaults to xavier uniform initializer
-    :type bias_initializer: typing.Callable, optional
-    """
-
-    def __init__(self,
-                 in_features: int,
-                 out_features: int,
-                 bias: bool = True,
-                 dtype: torch.dtype = None,
-                 gather_output: bool = False,
-                 skip_bias_add: bool = False,
-                 weight_initializer: Callable = init.kaiming_uniform_(a=math.sqrt(5)),
-                 bias_initializer: Callable = init.xavier_uniform_(a=1, scale=1)):
-        super().__init__()
-
-        # Keep input parameters
-        self.in_features = in_features
-        self.out_features = out_features
-        self.gather_output = gather_output
-        self.skip_bias_add = skip_bias_add
-
-        if skip_bias_add and not bias:
-            raise ValueError('cannot skip bias addition if bias is None')
-
-        self.out_features_per_partition = divide(out_features, gpc.tensor_parallel_size)
-
-        # Parameters.
-        # Initialize weight.
-        factory_kwargs = {'device': get_current_device(), 'dtype': dtype}
-        self.weight = Parameter(torch.empty(self.out_features_per_partition, self.in_features, **factory_kwargs))
-
-        if bias:
-            self.bias = Parameter(torch.empty(self.out_features_per_partition, **factory_kwargs))
-        else:
-            self.bias = None
-        with seed(ParallelMode.TENSOR):
-            self.reset_parameters(weight_initializer, bias_initializer)
-        self._set_tensor_parallel_attributes()
-        set_parallel_input(True)
-
-    def reset_parameters(self, weight_initializer, bias_initializer) -> None:
-        fan_in, fan_out = self.in_features, self.out_features
-        weight_initializer(self.weight, fan_in=fan_in, fan_out=fan_out)
-        if self.bias is not None:
-            bias_initializer(self.bias, fan_in=fan_in)
-
-    def _set_tensor_parallel_attributes(self):
-        num_partition = gpc.get_world_size(ParallelMode.TENSOR)
-        set_tensor_parallel_attribute_by_partition(self.weight, num_partition)
-        if self.bias is not None:
-            set_tensor_parallel_attribute_by_partition(self.bias, num_partition)
-
-    def forward(self, input_: Tensor) -> Tuple[Tensor, Tensor]:
-        # Set up backprop all-reduce.
-        input_parallel = reduce_grad(input_, ParallelMode.PARALLEL_1D)
-        # Matrix multiply.
-
-        bias = self.bias if not self.skip_bias_add else None
-        output_parallel = F.linear(input_parallel, self.weight, bias)
-        if self.gather_output:
-            # All-gather across the partitions.
-            output = gather_forward_split_backward(output_parallel, ParallelMode.PARALLEL_1D, dim=-1)
-        else:
-            output = output_parallel
-        if self.skip_bias_add:
-            return output, self.bias
-        else:
-            return output
-
-
-@LAYERS.register_module
-class Linear1D_Row(ParallelLayer):
-    """ Linear layer with row parallelism 
-
-    :param in_features: size of each input sample
-    :type in_features: int
-    :param out_features: size of each output sample
-    :type out_features: int
-    :param bias: If set to ``False``, the layer will not learn an additive bias, defaults to ``True``
-    :type bias: bool, optional
-    :param dtype: The dtype of parameters, defaults to None
-    :type dtype: torch.dtype, optional
-    :param parallel_input: If set to ``True``, it's assumed that the input is splitted, defaults to False
-    :type parallel_input: bool, optional
-    :param skip_bias_add: If set to ``True``, it will skip bias add for linear layer, which is preserved for kernel fusion, defaults to False
-    :type skip_bias_add: bool, optional
-    :param weight_initializer: The intializer of weight, defaults to kaiming uniform initializer
-    :type weight_initializer: typing.Callable, optional
-    :param bias_initializer: The intializer of bias, defaults to xavier uniform initializer
-    :type bias_initializer: typing.Callable, optional
-    """
-
-    def __init__(self,
-                 in_features: int,
-                 out_features: int,
-                 bias: bool = True,
-                 dtype: torch.dtype = None,
-                 parallel_input: bool = True,
-                 skip_bias_add: bool = False,
-                 weight_initializer: Callable = init.kaiming_uniform_(a=math.sqrt(5)),
-                 bias_initializer: Callable = init.xavier_uniform_(a=1, scale=1)):
-        super().__init__()
-
-        # Keep input parameters
-        self.in_features = in_features
-        self.out_features = out_features
-        self.parallel_input = parallel_input
-        self.skip_bias_add = skip_bias_add
-
-        if skip_bias_add and not bias:
-            raise ValueError('cannot skip bias addition if bias is None')
-
-        # Divide the weight matrix along the last dimension.
-        self.input_size_per_partition = divide(in_features, gpc.tensor_parallel_size)
-
-        # Parameters.
-        # Initialize weight.
-        factory_kwargs = {'device': get_current_device(), 'dtype': dtype}
-        self.weight = Parameter(torch.empty(self.out_features, self.input_size_per_partition, **factory_kwargs))
-
-        if bias:
-            self.bias = Parameter(torch.empty(self.out_features, **factory_kwargs))
-        else:
-            self.bias = None
-        with seed(ParallelMode.TENSOR):
-            self.reset_parameters(weight_initializer, bias_initializer)
-        self._set_tensor_parallel_attributes()
-        set_parallel_input(False)
-
-    def reset_parameters(self, weight_initializer, bias_initializer) -> None:
-        fan_in, fan_out = self.in_features, self.out_features
-        weight_initializer(self.weight, fan_in=fan_in, fan_out=fan_out)
-        if self.bias is not None:
-            bias_initializer(self.bias, fan_in=fan_in)
-            broadcast(self.bias, gpc.get_ranks_in_group(ParallelMode.PARALLEL_1D)[0], ParallelMode.PARALLEL_1D)
-
-    def _set_tensor_parallel_attributes(self):
-        num_partition = gpc.get_world_size(ParallelMode.TENSOR)
-        set_tensor_parallel_attribute_by_partition(self.weight, num_partition)
-
-    def forward(self, input_: Tensor) -> Tensor:
-        # Set up backprop all-reduce.
-        if self.parallel_input:
-            input_ = input_
-        else:
-            input_ = split_forward_gather_backward(input_, ParallelMode.PARALLEL_1D, dim=-1)
-
-        output_parallel = F.linear(input_, self.weight)
-        output = reduce_input(output_parallel, ParallelMode.PARALLEL_1D)
-
-        if not self.skip_bias_add:
-            if self.bias is not None:
-                output = output + self.bias
-            return output
-        else:
-            return output, self.bias
-
-
-@LAYERS.register_module
-class Embedding1D(ParallelLayer):
-    """
-    Embedding for 1D parallelism
-
-    :param num_embeddings: number of embeddings
-    :type num_embeddings: int
-    :param embedding_dim: dimension of embedding
-    :type embedding_dim: int
-    :param padding_idx: index of padding, defaults to None
-    :type padding_idx: int, optional
-    :param dtype: The dtype of parameters, defaults to None
-    :type dtype: torch.dtype, optional
-    :param weight_initializer: The intializer of weight, defaults to normal initializer
-    :type weight_initializer: typing.Callable, optional
-    :param args: Args used in F.embedding
-    :param kwargs: Kwargs used in F.embedding
-    """
-
-    def __init__(self,
-                 num_embeddings: int,
-                 embedding_dim: int,
-                 padding_idx: int = None,
-                 dtype: torch.dtype = None,
-                 weight_initializer: Callable = init.normal_(),
-                 *args,
-                 **kwargs):
-        super().__init__()
-
-        self.num_embeddings = num_embeddings
-        self.embed_dim = embedding_dim
-        embed_dim_per_partition = divide(embedding_dim, gpc.tensor_parallel_size)
-
-        self.padding_idx = padding_idx
-        self.embed_args = args
-        self.embed_kwargs = kwargs
-
-        self.weight = Parameter(
-            torch.empty((num_embeddings, embed_dim_per_partition), device=get_current_device(), dtype=dtype))
-
-        self.reset_parameters(weight_initializer)
-        self._set_tensor_parallel_attributes()
-        set_parallel_input(False)
-
-    def _set_tensor_parallel_attributes(self):
-        set_tensor_parallel_attribute_by_partition(self.weight, gpc.tensor_parallel_size)
-
-    def reset_parameters(self, weight_initializer) -> None:
-        with seed(ParallelMode.TENSOR):
-            fan_in, fan_out = self.num_embeddings, self.embed_dim
-            weight_initializer(self.weight, fan_in=fan_in, fan_out=fan_out)
-            self._fill_padding_idx_with_zero()
-
-    def _fill_padding_idx_with_zero(self) -> None:
-        if self.padding_idx is not None:
-            with torch.no_grad():
-                self.weight[self.padding_idx].fill_(0)
-
-    def forward(self, input_: Tensor) -> Tensor:
-
-        output_parallel = F.embedding(input_, self.weight, self.padding_idx, *self.embed_args, **self.embed_kwargs)
-
-        output = gather_forward_split_backward(output_parallel, ParallelMode.PARALLEL_1D, dim=-1)
-
-        return output
-
-
-@LAYERS.register_module
-class VocabParallelEmbedding1D(torch.nn.Module):
-    """Embedding parallelized in the vocabulary dimension.
-
-    :param num_embeddings: number of embeddings
-    :type num_embeddings: int
-    :param embedding_dim: dimension of embedding
-    :type embedding_dim: int
-    :param padding_idx: index of padding, defaults to None
-    :type padding_idx: int, optional
-    :param dtype: The dtype of parameters, defaults to None
-    :type dtype: torch.dtype, optional
-    :param weight_initializer: The intializer of weight, defaults to normal initializer
-    :type weight_initializer: typing.Callable, optional
-    :param args: Args used in F.embedding
-    :param kwargs: Kwargs used in F.embedding
-    """
-
-    def __init__(self,
-                 num_embeddings: int,
-                 embedding_dim: int,
-                 padding_idx: int = None,
-                 dtype: torch.dtype = None,
-                 weight_initializer: Callable = init.normal_(),
-                 *args,
-                 **kwargs):
-        super().__init__()
-        self.num_embeddings = num_embeddings
-        self.embed_dim = embedding_dim
-        self.padding_idx = padding_idx
-        self.embed_args = args
-        self.embed_kwargs = kwargs
-
-        tensor_parallel_size = gpc.get_world_size(ParallelMode.PARALLEL_1D)
-        tensor_parallel_rank = gpc.get_local_rank(ParallelMode.PARALLEL_1D)
-        self.num_embeddings_per_partition = divide(num_embeddings, tensor_parallel_size)
-        self.vocab_start_index = tensor_parallel_rank * self.num_embeddings_per_partition
-        self.vocab_end_index = self.vocab_start_index + self.num_embeddings_per_partition
-
-        self.weight = Parameter(
-            torch.empty((self.num_embeddings_per_partition, self.embed_dim), device=get_current_device(), dtype=dtype))
-
-        self.reset_parameters(weight_initializer)
-        self._set_tensor_parallel_attributes()
-        set_parallel_input(False)
-        env.vocab_parallel = True
-
-    def _set_tensor_parallel_attributes(self):
-        set_tensor_parallel_attribute_by_partition(self.weight, gpc.tensor_parallel_size)
-
-    def reset_parameters(self, weight_initializer) -> None:
-        with seed(ParallelMode.TENSOR):
-            fan_in, fan_out = self.num_embeddings, self.embed_dim
-            weight_initializer(self.weight, fan_in=fan_in, fan_out=fan_out)
-            self._fill_padding_idx_with_zero()
-
-    def _fill_padding_idx_with_zero(self) -> None:
-        if self.padding_idx is not None:
-            with torch.no_grad():
-                self.weight[self.padding_idx].fill_(0)
-
-    def forward(self, input_: Tensor) -> Tensor:
-        # Build the mask.
-        input_mask = (input_ < self.vocab_start_index) | (input_ >= self.vocab_end_index)
-        # Mask the input.
-        masked_input = input_.clone() - self.vocab_start_index
-        masked_input[input_mask] = 0
-
-        output_parallel = F.embedding(masked_input, self.weight, self.padding_idx, *self.embed_args,
-                                      **self.embed_kwargs)
-
-        # Mask the output embedding.
-        output_parallel[input_mask, :] = 0.
-        # Reduce across all the model parallel GPUs.
-        output = reduce_input(output_parallel, ParallelMode.PARALLEL_1D)
-        return output
-
-
-@LAYERS.register_module
-class Dropout1D(ParallelLayer):
-    """
-    Dropout layer of 1D parallelism
-
-    :param p: dropout rate, defaults to 0.5
-    :type p: float, optional
-    :param inplace: If set to ``True``, will do this operation in-place, defaults tp ``False``
-    :type inplace: bool, optional
-    """
-
-    def __init__(self, p: float = 0.5, inplace: bool = False):
-        super().__init__()
-        self.parallel_input = get_parallel_input()
-        self.p = p
-        self.inplace = inplace
-
-    def forward(self, input_: Tensor) -> Tensor:
-        if self.parallel_input:
-            with seed(ParallelMode.TENSOR):
-                output = F.dropout(input_, self.p, self.training, self.inplace)
-        else:
-            output = F.dropout(input_, self.p, self.training, self.inplace)
-        return output
diff --git a/colossalai/nn/layer/parallel_2d/__init__.py b/colossalai/nn/layer/parallel_2d/__init__.py
deleted file mode 100644
index 9bb62b4565fc520b07f02eba24e0888eced40f64..0000000000000000000000000000000000000000
--- a/colossalai/nn/layer/parallel_2d/__init__.py
+++ /dev/null
@@ -1,8 +0,0 @@
-from ._operation import reduce_by_batch_2d, split_tensor_2d
-from .layers import (Classifier2D, Embedding2D, LayerNorm2D, Linear2D, PatchEmbedding2D, VocabParallelClassifier2D,
-                     VocabParallelEmbedding2D)
-
-__all__ = [
-    'split_tensor_2d', 'reduce_by_batch_2d', 'Linear2D', 'LayerNorm2D', 'Classifier2D', 'PatchEmbedding2D',
-    'Embedding2D', 'VocabParallelEmbedding2D', 'VocabParallelClassifier2D'
-]
diff --git a/colossalai/nn/layer/parallel_2d/__pycache__/__init__.cpython-36.pyc b/colossalai/nn/layer/parallel_2d/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index 6ef9ee2d2351d3b23839c94ba76493af993a9ea5..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/parallel_2d/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/parallel_2d/__pycache__/__init__.cpython-37.pyc b/colossalai/nn/layer/parallel_2d/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index f32fed2168c8526ddd3c434e6f9ee7f565970c06..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/parallel_2d/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/parallel_2d/__pycache__/_operation.cpython-36.pyc b/colossalai/nn/layer/parallel_2d/__pycache__/_operation.cpython-36.pyc
deleted file mode 100644
index e697e034bf4f1a4d61dc650560d3bb1f507711fd..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/parallel_2d/__pycache__/_operation.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/parallel_2d/__pycache__/_operation.cpython-37.pyc b/colossalai/nn/layer/parallel_2d/__pycache__/_operation.cpython-37.pyc
deleted file mode 100644
index bab98cbe602e15213bbf54519dec28b640fcbfd8..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/parallel_2d/__pycache__/_operation.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/parallel_2d/__pycache__/_utils.cpython-36.pyc b/colossalai/nn/layer/parallel_2d/__pycache__/_utils.cpython-36.pyc
deleted file mode 100644
index 9b00c61b854592d33d2a4730da9d416a51ecc4e1..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/parallel_2d/__pycache__/_utils.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/parallel_2d/__pycache__/_utils.cpython-37.pyc b/colossalai/nn/layer/parallel_2d/__pycache__/_utils.cpython-37.pyc
deleted file mode 100644
index 687dc50ea519ef41598c192ccce3ff22736800bf..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/parallel_2d/__pycache__/_utils.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/parallel_2d/__pycache__/layers.cpython-36.pyc b/colossalai/nn/layer/parallel_2d/__pycache__/layers.cpython-36.pyc
deleted file mode 100644
index 871017dc3f5d034006f82ba6624bbd96ef60d16d..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/parallel_2d/__pycache__/layers.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/parallel_2d/__pycache__/layers.cpython-37.pyc b/colossalai/nn/layer/parallel_2d/__pycache__/layers.cpython-37.pyc
deleted file mode 100644
index 2b053f02b85e58c7b4c9349512abf23f774156f7..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/parallel_2d/__pycache__/layers.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/parallel_2d/_operation.py b/colossalai/nn/layer/parallel_2d/_operation.py
deleted file mode 100644
index f5c16671a8ea5c68ea117633873bcb627d5b9714..0000000000000000000000000000000000000000
--- a/colossalai/nn/layer/parallel_2d/_operation.py
+++ /dev/null
@@ -1,859 +0,0 @@
-from typing import Any, Optional, Tuple
-
-import torch
-import torch.distributed as dist
-from colossalai.communication.collective import (all_gather, all_reduce, reduce, reduce_scatter)
-from colossalai.context.parallel_mode import ParallelMode
-from colossalai.core import global_context as gpc
-from colossalai.utils import get_current_device
-from torch import Tensor
-from torch.cuda.amp import custom_bwd, custom_fwd
-from colossalai.global_variables import tensor_parallel_env as env
-
-
-def matmul_2d(
-    a,
-    b,
-    summa_dim,
-    out_shape,
-    row_rank=None,
-    col_rank=None,
-    row_parallel_mode=ParallelMode.PARALLEL_2D_ROW,
-    col_parallel_mode=ParallelMode.PARALLEL_2D_COL,
-):
-    """
-    Matrix multiplication for 2D parallelism
-
-    :param a: matrix :math:`A`
-    :type a: torch.tensor
-    :param b: matrix :math:`B`
-    :type b: torch.tensor
-    :param summa_dim: dimension of SUMMA fo 2D parallelism
-    :type summa_dim: int
-    :param out_shape: shape of output tensor
-    :type out_shape: tuple
-    :param row_rank: the rank of row, defaults to None
-    :type row_rank: int, optional
-    :param col_rank: the rank of column, defaults to None
-    :type col_rank: int, optional
-    :param row_parallel_mode: row parallel mode, defaults to ParallelMode.PARALLEL_2D_ROW
-    :type row_parallel_mode: str, optional
-    :param col_parallel_mode: column parallel mode, defaults to ParallelMode.PARALLEL_2D_COL
-    :type col_parallel_mode: str, optional
-    :return: :math:`C = AB`
-    :rtype: torch.tensor
-    """
-    if row_rank is None:
-        row_rank = gpc.get_local_rank(col_parallel_mode)
-    if col_rank is None:
-        col_rank = gpc.get_local_rank(row_parallel_mode)
-
-    data_parallel_rank = 0 if not gpc.is_initialized(ParallelMode.DATA) else gpc.get_local_rank(ParallelMode.DATA)
-    pipeline_parallel_rank = 0 if not gpc.is_initialized(ParallelMode.PIPELINE) else gpc.get_local_rank(
-        ParallelMode.PIPELINE)
-    pipeline_parallel_size = 1 if not gpc.is_initialized(ParallelMode.PIPELINE) else gpc.get_world_size(
-        ParallelMode.PIPELINE)
-    tensor_parallel_size = summa_dim**2
-    return Matmul_AB_2D(a, b, summa_dim, out_shape, row_rank, col_rank, row_parallel_mode, col_parallel_mode,
-                        data_parallel_rank, pipeline_parallel_rank, pipeline_parallel_size, tensor_parallel_size)
-
-
-class _Classifier2D(torch.autograd.Function):
-    @staticmethod
-    @custom_fwd(cast_inputs=torch.float16)
-    def forward(
-        ctx: Any,
-        A: Tensor,
-        B: Tensor,
-        bias: Optional[Tensor],
-        summa_dim: int,
-        out_shape: Tuple[int, ...],
-        row_rank: int,
-        col_rank: int,
-        row_parallel_mode: ParallelMode,
-        col_parallel_mode: ParallelMode,
-        data_parallel_rank: int,
-        pipeline_parallel_rank: int,
-        pipeline_parallel_size: int,
-        tensor_parallel_size: int,
-    ) -> Tensor:
-
-        A_shape = A.shape
-        A = A.reshape((-1, A_shape[-1]))
-        B_shape = B.shape
-        B = B.reshape((-1, B_shape[-1]))
-        B_temp = all_gather(B, -1, col_parallel_mode)
-        if ctx:
-            ctx.save_for_backward(A, B_temp)
-
-        C = torch.matmul(A, B_temp.transpose(0, 1))
-
-        C = all_reduce(C, row_parallel_mode)
-
-        ctx.use_bias = bias is not None
-        if bias is not None:
-            C = C + bias
-
-        out = C.reshape(out_shape)
-
-        if ctx:
-            ctx.summa_dim = summa_dim
-            ctx.row_rank = row_rank
-            ctx.col_rank = col_rank
-            ctx.row_parallel_mode = row_parallel_mode
-            ctx.col_parallel_mode = col_parallel_mode
-            ctx.A_shape = A_shape
-            ctx.B_shape = B_shape
-            ctx.data_parallel_rank = data_parallel_rank
-            ctx.pipeline_parallel_rank = pipeline_parallel_rank
-            ctx.pipeline_parallel_size = pipeline_parallel_size
-            ctx.tensor_parallel_size = tensor_parallel_size
-
-        return out
-
-    @staticmethod
-    @custom_bwd
-    def backward(ctx: Any, output_grad: Tensor) -> Tuple[Tensor, ...]:
-        A, B = ctx.saved_tensors
-
-        with torch.no_grad():
-            A_grad = torch.matmul(output_grad, B)
-            A_grad = A_grad.reshape(ctx.A_shape)
-            B_grad = torch.matmul(output_grad.reshape(-1, output_grad.shape[-1]).transpose(0, 1), A)
-            B_grad = reduce_scatter(B_grad, -1, ctx.col_parallel_mode)
-            B_grad = B_grad.reshape(ctx.B_shape)
-            if ctx.use_bias:
-                bias_grad = torch.sum(output_grad, dim=tuple(range(output_grad.ndim - 1)))
-                bias_grad = all_reduce(bias_grad, ctx.col_parallel_mode)
-            else:
-                bias_grad = None
-
-        return A_grad, B_grad, bias_grad, None, None, None, None, None, None, None, None, None, None
-
-
-def classifier_2d(A: Tensor, B: Tensor, bias: Optional[Tensor], summa_dim: int, out_shape: Tuple[int, ...],
-                  row_rank: int, col_rank: int, row_parallel_mode: ParallelMode, col_parallel_mode: ParallelMode,
-                  data_parallel_rank: int, pipeline_parallel_rank: int, pipeline_parallel_size: int,
-                  tensor_parallel_size: int) -> Tensor:
-    """
-    2D parallel classifier
-
-    :param a: matrix :math:`A`
-    :type a: torch.tensor
-    :param b: matrix :math:`B`
-    :type b: torch.tensor
-    :param bias: matrix of bias
-    :type bias: torch.tensor, optional
-    :param summa_dim: dimension of SUMMA fo 2D parallelism
-    :type summa_dim: int
-    :param out_shape: shape of output tensor
-    :type out_shape: tuple
-    :param row_rank: the rank of row
-    :type row_rank: int
-    :param col_rank: the rank of column
-    :type col_rank: int
-    :param row_parallel_mode: row parallel mode
-    :type row_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    :param col_parallel_mode: column parallel mode
-    :type col_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    :param data_parallel_rank: data parallel rank
-    :type data_parallel_rank: int
-    :param pipeline_parallel_rank: pipeline parallel rank
-    :type pipeline_parallel_rank: int
-    :param pipeline_parallel_size: pipeline parallel size
-    :type pipeline_parallel_size: int
-    :param tensor_parallel_size: tensor parallel size
-    :type tensor_parallel_size: int
-    """
-    return _Classifier2D.apply(A, B, bias, summa_dim, out_shape, row_rank, col_rank, row_parallel_mode,
-                               col_parallel_mode, data_parallel_rank, pipeline_parallel_rank, pipeline_parallel_size,
-                               tensor_parallel_size)
-
-
-class Matmul_AB_2D(torch.autograd.Function):
-    """
-    Matrix multiplication for :math:`C = AB`
-
-    :param a: matrix :math:`A`
-    :type a: torch.tensor
-    :param b: matrix :math:`B`
-    :type b: torch.tensor
-    :param summa_dim: dimension of SUMMA fo 2D parallelism
-    :type summa_dim: int
-    :param out_shape: shape of output tensor
-    :type out_shape: tuple
-    :param row_rank: the rank of row
-    :type row_rank: int
-    :param col_rank: the rank of column
-    :type col_rank: int
-    :param row_parallel_mode: row parallel mode
-    :type row_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    :param col_parallel_mode: column parallel mode
-    :type col_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    :param data_parallel_rank: data parallel rank
-    :type data_parallel_rank: int
-    :param pipeline_parallel_rank: pipeline parallel rank
-    :type pipeline_parallel_rank: int
-    :param pipeline_parallel_size: pipeline parallel size
-    :type pipeline_parallel_size: int
-    :param tensor_parallel_size: tensor parallel size
-    :type tensor_parallel_size: int
-    """
-    @staticmethod
-    @custom_fwd(cast_inputs=torch.float16)
-    def forward(
-        ctx: Any,
-        A: Tensor,
-        B: Tensor,
-        summa_dim: int,
-        out_shape: Tuple[int, ...],
-        row_rank: int,
-        col_rank: int,
-        row_parallel_mode: ParallelMode,
-        col_parallel_mode: ParallelMode,
-        data_parallel_rank: int,
-        pipeline_parallel_rank: int,
-        pipeline_parallel_size: int,
-        tensor_parallel_size: int,
-    ) -> Tensor:
-        # A: [b / q, s, h / q] -> [(b * s) / q, h / q]
-        # B: [h / q, s / q]
-        # C: [b / q, s, s / q] -> [(b * s) / q, s / q]
-
-        assert A.shape[-1] == B.shape[-2], \
-            'Invalid shapes: A={}, B={} for AB.'.format(A.shape, B.shape)
-
-        if ctx:
-            ctx.save_for_backward(A, B)
-
-        A_shape = A.shape
-        A = A.reshape((-1, A_shape[-1]))
-        B_shape = B.shape
-        B = B.reshape((-1, B_shape[-1]))
-        C_shape = (A.shape[0], B.shape[-1])
-        C = torch.zeros(C_shape, dtype=A.dtype, device=get_current_device())
-
-        # use circular buffer to store the communication tensor
-        # 2 is enough for all cases
-        A_list = [torch.empty_like(A) for _ in range(2)]
-        B_list = [torch.empty_like(B) for _ in range(2)]
-
-        row_group = gpc.get_group(row_parallel_mode)
-        col_group = gpc.get_group(col_parallel_mode)
-
-        src_a = summa_dim * row_rank + data_parallel_rank * pipeline_parallel_size * tensor_parallel_size + \
-                pipeline_parallel_rank * tensor_parallel_size
-        src_b = col_rank + data_parallel_rank * pipeline_parallel_size * tensor_parallel_size + \
-                pipeline_parallel_rank * tensor_parallel_size
-
-        opa = [None] * 2
-        opb = [None] * 2
-
-        A_list[0].copy_(A)
-        B_list[0].copy_(B)
-        opa[0] = dist.broadcast(A_list[0], src=src_a, group=row_group, async_op=True)
-        opb[0] = dist.broadcast(B_list[0], src=src_b, group=col_group, async_op=True)
-        cur = 0
-
-        for i in range(summa_dim):
-            if i != summa_dim - 1:
-                A_list[1 - cur].copy_(A)
-                opa[1 - cur] = dist.broadcast(A_list[1 - cur], src=src_a + 1, group=row_group, async_op=True)
-                B_list[1 - cur].copy_(B)
-                opb[1 - cur] = dist.broadcast(B_list[1 - cur], src=src_b + summa_dim, group=col_group, async_op=True)
-
-            if opa[cur] is not None:
-                opa[cur].wait()
-            if opb[cur] is not None:
-                opb[cur].wait()
-
-            torch.addmm(C, A_list[cur], B_list[cur], out=C)
-            cur = 1 - cur
-            src_a += 1
-            src_b += summa_dim
-
-        out = C.reshape(out_shape)
-
-        if ctx:
-            ctx.summa_dim = summa_dim
-            ctx.row_rank = row_rank
-            ctx.col_rank = col_rank
-            ctx.row_parallel_mode = row_parallel_mode
-            ctx.col_parallel_mode = col_parallel_mode
-            ctx.A_shape = A_shape
-            ctx.B_shape = B_shape
-            ctx.data_parallel_rank = data_parallel_rank
-            ctx.pipeline_parallel_rank = pipeline_parallel_rank
-            ctx.pipeline_parallel_size = pipeline_parallel_size
-            ctx.tensor_parallel_size = tensor_parallel_size
-        return out
-
-    @staticmethod
-    @custom_bwd
-    def backward(ctx: Any, output_grad: Tensor) -> Tuple[Tensor, ...]:
-        A, B = ctx.saved_tensors
-        with torch.no_grad():
-            A_grad = Matmul_ABT_2D.apply(output_grad, B, ctx.summa_dim, ctx.A_shape, ctx.row_rank, ctx.col_rank,
-                                         ctx.row_parallel_mode, ctx.col_parallel_mode, ctx.data_parallel_rank,
-                                         ctx.pipeline_parallel_rank, ctx.pipeline_parallel_size,
-                                         ctx.tensor_parallel_size)
-            B_grad = Matmul_ATB_2D.apply(A, output_grad, ctx.summa_dim, ctx.B_shape, ctx.row_rank, ctx.col_rank,
-                                         ctx.row_parallel_mode, ctx.col_parallel_mode, ctx.data_parallel_rank,
-                                         ctx.pipeline_parallel_rank, ctx.pipeline_parallel_size,
-                                         ctx.tensor_parallel_size)
-        return A_grad, B_grad, None, None, None, None, None, None, None, None, None, None
-
-
-class Matmul_ABT_2D(torch.autograd.Function):
-    """
-    Matrix multiplication for :math:`C = AB^T`
-    
-    :param a: matrix :math:`A`
-    :type a: torch.tensor
-    :param b: matrix :math:`B`
-    :type b: torch.tensor
-    :param summa_dim: dimension of SUMMA fo 2D parallelism
-    :type summa_dim: int
-    :param out_shape: shape of output tensor
-    :type out_shape: tuple
-    :param row_rank: the rank of row
-    :type row_rank: int
-    :param col_rank: the rank of column
-    :type col_rank: int
-    :param row_parallel_mode: row parallel mode
-    :type row_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    :param col_parallel_mode: column parallel mode
-    :type col_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    :param data_parallel_rank: data parallel rank
-    :type data_parallel_rank: int
-    :param pipeline_parallel_rank: pipeline parallel rank
-    :type pipeline_parallel_rank: int
-    :param pipeline_parallel_size: pipeline parallel size
-    :type pipeline_parallel_size: int
-    :param tensor_parallel_size: tensor parallel size
-    :type tensor_parallel_size: int
-    """
-    @staticmethod
-    @custom_fwd(cast_inputs=torch.float16)
-    def forward(
-        ctx: Any,
-        A: Tensor,
-        B: Tensor,
-        summa_dim: int,
-        out_shape: Tuple[int, ...],
-        row_rank: int,
-        col_rank: int,
-        row_parallel_mode: ParallelMode,
-        col_parallel_mode: ParallelMode,
-        data_parallel_rank: int,
-        pipeline_parallel_rank: int,
-        pipeline_parallel_size: int,
-        tensor_parallel_size: int,
-    ) -> Tensor:
-
-        assert A.shape[-1] == B.shape[-1], \
-            'Invalid shapes: A={}, B={} for ABT.'.format(A.shape, B.shape)
-
-        if ctx:
-            ctx.save_for_backward(A, B)
-
-        A_shape = A.shape
-        A = A.reshape((-1, A_shape[-1]))
-        B_shape = B.shape
-        B = B.reshape((-1, B_shape[-1]))
-        C_shape = (A.shape[0], B.shape[0])
-        C = torch.empty(C_shape, dtype=A.dtype, device=get_current_device())
-
-        # use circular buffer to store the communication tensor
-        # 2 is enough for all cases
-        B_list = [torch.empty_like(B) for _ in range(2)]
-        C_list = [torch.empty_like(C) for _ in range(2)]
-
-        row_group = gpc.get_group(row_parallel_mode)
-        col_group = gpc.get_group(col_parallel_mode)
-
-        src_b = col_rank + data_parallel_rank * pipeline_parallel_size * tensor_parallel_size + \
-                pipeline_parallel_rank * tensor_parallel_size
-        src_c = summa_dim * row_rank + data_parallel_rank * pipeline_parallel_size * tensor_parallel_size + \
-                pipeline_parallel_rank * tensor_parallel_size
-
-        opb = [None] * 2
-        opr = [None] * 2
-
-        B_list[0].copy_(B)
-        opb[0] = dist.broadcast(B_list[0], src=src_b, group=col_group, async_op=True)
-        cur = 0
-
-        for i in range(summa_dim):
-            if i != summa_dim - 1:
-                B_list[1 - cur].copy_(B)
-                opb[1 - cur] = dist.broadcast(B_list[1 - cur], src=src_b + summa_dim, group=col_group, async_op=True)
-
-            if opr[cur] is not None:
-                opr[cur].wait()
-                if i - 2 == col_rank:
-                    C.copy_(C_list[cur])
-
-            if opb[cur] is not None:
-                opb[cur].wait()
-
-            torch.matmul(A, B_list[cur].transpose(0, 1), out=C_list[cur])
-            opr[cur] = dist.reduce(C_list[cur], dst=src_c, group=row_group, async_op=True)
-            cur = 1 - cur
-            src_b += summa_dim
-            src_c += 1
-
-        for op in opr:
-            op.wait()
-
-        if summa_dim - 2 == col_rank:
-            C.copy_(C_list[cur])
-        if summa_dim - 1 == col_rank:
-            C.copy_(C_list[1 - cur])
-        out = C.reshape(out_shape)
-
-        if ctx:
-            ctx.summa_dim = summa_dim
-            ctx.row_rank = row_rank
-            ctx.col_rank = col_rank
-            ctx.row_parallel_mode = row_parallel_mode
-            ctx.col_parallel_mode = col_parallel_mode
-            ctx.A_shape = A_shape
-            ctx.B_shape = B_shape
-            ctx.data_parallel_rank = data_parallel_rank
-            ctx.pipeline_parallel_rank = pipeline_parallel_rank
-            ctx.pipeline_parallel_size = pipeline_parallel_size
-            ctx.tensor_parallel_size = tensor_parallel_size
-
-        return out
-
-    @staticmethod
-    @custom_bwd
-    def backward(ctx: Any, output_grad: Tensor) -> Tuple[Tensor, ...]:
-        A, B = ctx.saved_tensors
-
-        with torch.no_grad():
-            A_grad = Matmul_AB_2D.apply(output_grad, B, ctx.summa_dim, ctx.A_shape, ctx.row_rank, ctx.col_rank,
-                                        ctx.row_parallel_mode, ctx.col_parallel_mode, ctx.data_parallel_rank,
-                                        ctx.pipeline_parallel_rank, ctx.pipeline_parallel_size,
-                                        ctx.tensor_parallel_size)
-            B_grad = Matmul_ATB_2D.apply(output_grad, A, ctx.summa_dim, ctx.B_shape, ctx.row_rank, ctx.col_rank,
-                                         ctx.row_parallel_mode, ctx.col_parallel_mode, ctx.data_parallel_rank,
-                                         ctx.pipeline_parallel_rank, ctx.pipeline_parallel_size,
-                                         ctx.tensor_parallel_size)
-        return A_grad, B_grad, None, None, None, None, None, None, None, None, None, None
-
-
-class Matmul_ATB_2D(torch.autograd.Function):
-    """
-    Matrix multiplication for :math:`C = A^TB`
-
-    :param a: matrix :math:`A`
-    :type a: torch.tensor
-    :param b: matrix :math:`B`
-    :type b: torch.tensor
-    :param summa_dim: dimension of SUMMA fo 2D parallelism
-    :type summa_dim: int
-    :param out_shape: shape of output tensor
-    :type out_shape: tuple
-    :param row_rank: the rank of row
-    :type row_rank: int
-    :param col_rank: the rank of column
-    :type col_rank: int
-    :param row_parallel_mode: row parallel mode
-    :type row_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    :param col_parallel_mode: column parallel mode
-    :type col_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    :param data_parallel_rank: data parallel rank
-    :type data_parallel_rank: int
-    :param pipeline_parallel_rank: pipeline parallel rank
-    :type pipeline_parallel_rank: int
-    :param pipeline_parallel_size: pipeline parallel size
-    :type pipeline_parallel_size: int
-    :param tensor_parallel_size: tensor parallel size
-    :type tensor_parallel_size: int
-    """
-    @staticmethod
-    @custom_fwd(cast_inputs=torch.float16)
-    def forward(
-        ctx: Any,
-        A: Tensor,
-        B: Tensor,
-        summa_dim: int,
-        out_shape: Tuple[int, ...],
-        row_rank: int,
-        col_rank: int,
-        row_parallel_mode: ParallelMode,
-        col_parallel_mode: ParallelMode,
-        data_parallel_rank: int,
-        pipeline_parallel_rank: int,
-        pipeline_parallel_size: int,
-        tensor_parallel_size: int,
-    ) -> Tensor:
-
-        assert A.shape[-2] == B.shape[-2], \
-            'Invalid shapes: A={}, B={} for ATB.'.format(A.shape, B.shape)
-
-        if ctx:
-            ctx.save_for_backward(A, B)
-
-        A_shape = A.shape
-        A = A.reshape((-1, A_shape[-1]))
-        B_shape = B.shape
-        B = B.reshape((-1, B_shape[-1]))
-        C_shape = (A.shape[-1], B.shape[-1])
-        C = torch.empty(C_shape, dtype=A.dtype, device=get_current_device())
-
-        # use circular buffer to store the communication tensor
-        # 2 is enough for all cases
-        A_list = [torch.empty_like(A) for _ in range(2)]
-        C_list = [torch.empty_like(C) for _ in range(2)]
-
-        row_group = gpc.get_group(row_parallel_mode)
-        col_group = gpc.get_group(col_parallel_mode)
-
-        src_a = summa_dim * row_rank + data_parallel_rank * pipeline_parallel_size * tensor_parallel_size + \
-                pipeline_parallel_rank * tensor_parallel_size
-        src_c = col_rank + data_parallel_rank * pipeline_parallel_size * tensor_parallel_size + \
-                pipeline_parallel_rank * tensor_parallel_size
-
-        opa = [None] * 2
-        opr = [None] * 2
-
-        A_list[0].copy_(A)
-        opa[0] = dist.broadcast(A_list[0], src=src_a, group=row_group, async_op=True)
-        cur = 0
-
-        for i in range(summa_dim):
-            if i != summa_dim - 1:
-                A_list[1 - cur].copy_(A)
-                opa[1 - cur] = dist.broadcast(A_list[1 - cur], src=src_a + 1, group=row_group, async_op=True)
-
-            if opr[cur] is not None:
-                opr[cur].wait()
-                if i - 2 == row_rank:
-                    C.copy_(C_list[cur])
-
-            if opa[cur] is not None:
-                opa[cur].wait()
-
-            torch.matmul(A_list[cur].transpose(0, 1), B, out=C_list[cur])
-            opr[cur] = dist.reduce(C_list[cur], dst=src_c, group=col_group, async_op=True)
-            cur = 1 - cur
-            src_a += 1
-            src_c += summa_dim
-
-        for op in opr:
-            op.wait()
-
-        if summa_dim - 2 == row_rank:
-            C.copy_(C_list[cur])
-        if summa_dim - 1 == row_rank:
-            C.copy_(C_list[1 - cur])
-        out = C.reshape(out_shape)
-
-        if ctx:
-            ctx.summa_dim = summa_dim
-            ctx.row_rank = row_rank
-            ctx.col_rank = col_rank
-            ctx.row_parallel_mode = row_parallel_mode
-            ctx.col_parallel_mode = col_parallel_mode
-            ctx.A_shape = A_shape
-            ctx.B_shape = B_shape
-            ctx.data_parallel_rank = data_parallel_rank
-            ctx.pipeline_parallel_rank = pipeline_parallel_rank
-            ctx.pipeline_parallel_size = pipeline_parallel_size
-            ctx.tensor_parallel_size = tensor_parallel_size
-
-        return out
-
-    @staticmethod
-    @custom_bwd
-    def backward(ctx: Any, output_grad: Tensor) -> Tuple[Tensor, ...]:
-        A, B = ctx.saved_tensors
-
-        with torch.no_grad():
-            A_grad = Matmul_ABT_2D.apply(B, output_grad, ctx.summa_dim, ctx.A_shape, ctx.row_rank, ctx.col_rank,
-                                         ctx.row_parallel_mode, ctx.col_parallel_mode, ctx.data_parallel_rank,
-                                         ctx.pipeline_parallel_rank, ctx.pipeline_parallel_size,
-                                         ctx.tensor_parallel_size)
-            B_grad = Matmul_AB_2D.apply(A, output_grad, ctx.summa_dim, ctx.B_shape, ctx.row_rank, ctx.col_rank,
-                                        ctx.row_parallel_mode, ctx.col_parallel_mode, ctx.data_parallel_rank,
-                                        ctx.pipeline_parallel_rank, ctx.pipeline_parallel_size,
-                                        ctx.tensor_parallel_size)
-        return A_grad, B_grad, None, None, None, None, None, None, None, None, None, None
-
-
-class _Add_Bias_2D(torch.autograd.Function):
-    @staticmethod
-    @custom_fwd(cast_inputs=torch.float16)
-    def forward(
-        ctx: Any,
-        input_: Tensor,
-        bias: Tensor,
-        output_size_per_partition: int,
-        row_rank: int,
-        col_rank: int,
-        row_parallel_mode: ParallelMode,
-        col_parallel_mode: ParallelMode,
-        skip_bias_add: bool,
-        data_parallel_rank: int,
-        pipeline_parallel_rank: int,
-        pipeline_parallel_size: int,
-        tensor_parallel_size: int,
-    ) -> Tensor:
-        bias_temp = all_gather(bias, -1, col_parallel_mode)
-
-        ctx.row_rank = row_rank
-        ctx.col_rank = col_rank
-        ctx.row_parallel_mode = row_parallel_mode
-        ctx.col_parallel_mode = col_parallel_mode
-        ctx.bias = skip_bias_add
-        ctx.data_parallel_rank = data_parallel_rank
-        ctx.pipeline_parallel_rank = pipeline_parallel_rank
-        ctx.pipeline_parallel_size = pipeline_parallel_size
-        ctx.tensor_parallel_size = tensor_parallel_size
-
-        if skip_bias_add:
-            return bias_temp
-        else:
-            output = input_ + bias_temp
-            return output
-
-    @staticmethod
-    @custom_bwd
-    def backward(ctx: Any, output_grad: Tensor) -> Tuple[Tensor, ...]:
-        col_parallel_mode = ctx.col_parallel_mode
-
-        if ctx.bias:
-            grad = reduce_scatter(output_grad, -1, col_parallel_mode)
-            return None, grad, None, None, None, None, None, None, None, None, None, None
-        else:
-            reduce_dim = tuple(range(output_grad.ndim - 1))
-            reduce = torch.sum(output_grad, dim=reduce_dim)
-            grad = reduce_scatter(reduce, -1, col_parallel_mode)
-            return output_grad, grad, None, None, None, None, None, None, None, None, None, None
-
-
-def add_bias_2d(input_: Tensor, bias: Tensor, output_size_per_partition: int, row_rank: int, col_rank: int,
-                row_parallel_mode: ParallelMode, col_parallel_mode: ParallelMode, skip_bias_add: bool,
-                data_parallel_rank: int, pipeline_parallel_rank: int, pipeline_parallel_size: int,
-                tensor_parallel_size: int) -> Tensor:
-    """
-    Matrix add bias: :math:`C = A + b`
-
-    :param input_: matrix :math:`A`
-    :type input_: torch.tensor
-    :param bias: matrix :math:`b`
-    :type bias: torch.tensor
-    :param output_size_per_partition: size of ouput per partition
-    :type output_size_per_partition: int
-    :param row_rank: the rank of row
-    :type row_rank: int
-    :param col_rank: the rank of column
-    :type col_rank: int
-    :param row_parallel_mode: row parallel mode
-    :type row_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    :param col_parallel_mode: column parallel mode
-    :type col_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    :param skip_bias_add: If set to ``True``, it will skip bias add for linear layer, which is preserved for kernel fusion
-    :type skip_bias_add: bool
-    :param data_parallel_rank: data parallel rank
-    :type data_parallel_rank: int
-    :param pipeline_parallel_rank: pipeline parallel rank
-    :type pipeline_parallel_rank: int
-    :param pipeline_parallel_size: pipeline parallel size
-    :type pipeline_parallel_size: int
-    :param tensor_parallel_size: tensor parallel size
-    :type tensor_parallel_size: int
-    """
-    return _Add_Bias_2D.apply(input_, bias, output_size_per_partition, row_rank, col_rank, row_parallel_mode,
-                              col_parallel_mode, skip_bias_add, data_parallel_rank, pipeline_parallel_rank,
-                              pipeline_parallel_size, tensor_parallel_size)
-
-
-class _Layernorm_2D(torch.autograd.Function):
-    @staticmethod
-    @custom_fwd(cast_inputs=torch.float32)
-    def forward(ctx: Any, input_: Tensor, E_x: Tensor, Var_x: Tensor, hidden_size: int, row_parallel_mode: ParallelMode,
-                col_parallel_mode: ParallelMode) -> Tensor:
-        input_ = input_ - E_x
-        # in here, input = x - E[x], Var_x = 1 / sqrt(Var[x] + eps)
-        ctx.normalized_shape = hidden_size
-        output = input_ * Var_x
-        ctx.save_for_backward(output, Var_x)
-        ctx.row_parallel_mode = row_parallel_mode
-        ctx.col_parallel_mode = col_parallel_mode
-        return output
-
-    @staticmethod
-    @custom_bwd
-    def backward(ctx: Any, output_grad: Tensor) -> Tuple[Tensor, ...]:
-        row_parallel_mode = ctx.row_parallel_mode
-        col_parallel_mode = ctx.col_parallel_mode
-        x, Var_x = ctx.saved_tensors
-        # in here, Var_x = 1 / sqrt(Var[x] + eps), x = (x - E[x]) * Var_x
-        output_grad_sum = torch.sum(output_grad, dim=-1, keepdim=True)
-        torch.distributed.all_reduce(output_grad_sum, group=gpc.get_group(row_parallel_mode))
-        output_grad_sum /= ctx.normalized_shape
-
-        output_grad_mul_x_sum = torch.sum(output_grad * x, dim=-1, keepdim=True)
-        torch.distributed.all_reduce(output_grad_mul_x_sum, group=gpc.get_group(row_parallel_mode))
-        output_grad_mul_x_sum /= ctx.normalized_shape
-
-        input_grad = output_grad.clone()
-        input_grad -= x * output_grad_mul_x_sum
-        input_grad -= output_grad_sum
-        input_grad *= Var_x
-
-        return input_grad, None, None, None, None, None
-
-
-def layernorm_2d(input_: Tensor, E_x: Tensor, Var_x: Tensor, hidden_size: int, row_parallel_mode: ParallelMode,
-                 col_parallel_mode: ParallelMode) -> Tensor:
-    """
-    Layernorm
-
-    :param input_: input maxtrix
-    :type input_: torch.tensor
-    :param E_x: mean
-    :type E_x: torch.tensor
-    :param Var_x: variance
-    :type Var_x: torch.tensor
-    :param hidden_size: hidden size
-    :type hidden_size: int
-    :param row_parallel_mode: row parallel mode
-    :type row_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    :param col_parallel_mode: column parallel mode
-    :type col_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    """
-    return _Layernorm_2D.apply(input_, E_x, Var_x, hidden_size, row_parallel_mode, col_parallel_mode)
-
-
-class _AllGatherTensor2D(torch.autograd.Function):
-    @staticmethod
-    @custom_fwd(cast_inputs=torch.float16)
-    def forward(ctx: Any, inputs: Tensor, dim: int, parallel_mode: ParallelMode) -> Tensor:
-        ctx.dim = dim
-        ctx.parallel_mode = parallel_mode
-
-        outputs = all_gather(inputs, dim, parallel_mode)
-        return outputs
-
-    @staticmethod
-    @custom_bwd
-    def backward(ctx: Any, output_grad: Tensor) -> Tuple[Tensor, ...]:
-        grad = reduce_scatter(output_grad, ctx.dim, ctx.parallel_mode)
-        return grad.contiguous(), None, None
-
-
-def all_gather_tensor_2d(tensor: Tensor, dim: int, parallel_mode: ParallelMode) -> Tensor:
-    """
-    All gather the tensor of 2D parallelism
-
-    :param inputs: input maxtrix
-    :type inputs: torch.tensor
-    :param dim: dimension to gather
-    :type dim: int
-    :param parallel_mode: parallel mode
-    :type parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    """
-    return _AllGatherTensor2D.apply(tensor, dim, parallel_mode)
-
-
-def split_tensor_2d(input_: Tensor, dim: int = 0) -> Tensor:
-    """Splits 2D tensor in specified dimension across cols
-    :param input_: Input tensor
-    :param dim: Specified dimension in which to split
-    :type input_: torch.Tensor
-    :type dim: int, optional
-    :return output: Splitted tensor
-    :rtype output: torch.Tensor
-    """
-    if input_.size(dim) <= 1:
-        return input_
-    return torch.chunk(input_, gpc.get_world_size(ParallelMode.PARALLEL_2D_COL),
-                       dim=dim)[gpc.get_local_rank(ParallelMode.PARALLEL_2D_COL)].contiguous()
-
-
-class _ReduceTensor2D(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, input_, parallel_mode):
-        return all_reduce(input_, parallel_mode)
-
-    @staticmethod
-    def backward(ctx, output_grad):
-        return output_grad, None
-
-
-def reduce_tensor_2d(input_: Tensor, parallel_mode: ParallelMode) -> Tensor:
-    """
-    All-reduce the input.
-    
-    :param input_: input tensor
-    :param parallel_mode: parallel mode
-    """
-    return _ReduceTensor2D.apply(input_, parallel_mode)
-
-
-class _ReduceScatterTensor2D(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, input_, dim, parallel_mode):
-        ctx.dim = dim
-        ctx.parallel_mode = parallel_mode
-        return reduce_scatter(input_, dim, parallel_mode)
-
-    @staticmethod
-    def backward(ctx, output_grad):
-        return all_gather(output_grad, ctx.dim, ctx.parallel_mode), None, None
-
-
-def reduce_scatter_tensor_2d(tensor: Tensor, dim: int, parallel_mode: ParallelMode) -> Tensor:
-    """
-    Reduce-scatter the input.
-    
-    :param tensor: Input tensor
-    :param dim: Dimension to scatter
-    :param parallel_mode: Parallel mode
-    """
-    return _ReduceScatterTensor2D.apply(tensor, dim, parallel_mode)
-
-
-class _ReduceByBatch2D(torch.autograd.Function):
-    @staticmethod
-    def symbolic(graph, input_, reduce_mean: bool = False):
-        output = all_reduce(input_, ParallelMode.PARALLEL_2D_COL)
-        if reduce_mean:
-            reduce_size = gpc.get_world_size(ParallelMode.PARALLEL_2D_COL)
-            return output / reduce_size
-        return output
-
-    @staticmethod
-    @custom_fwd(cast_inputs=torch.float32)
-    def forward(ctx, input_, reduce_mean: bool = False):
-        output = all_reduce(input_, ParallelMode.PARALLEL_2D_COL)
-        ctx.reduce_mean = reduce_mean
-        if reduce_mean:
-            reduce_size = gpc.get_world_size(ParallelMode.PARALLEL_2D_COL)
-            ctx.reduce_size = reduce_size
-            return output.clone() / reduce_size
-        return output.clone()
-
-    @staticmethod
-    @custom_bwd
-    def backward(ctx, output_grad):
-        if ctx.reduce_mean:
-            return output_grad / ctx.reduce_size, None
-        else:
-            return output_grad, None
-
-
-def reduce_by_batch_2d(input_, reduce_mean: bool = False) -> Tensor:
-    """All-reduce the input from the model parallel region.
-
-    :param input_: input maxtrix
-    :type input_: torch.tensor
-    :param reduce_mean:  If set to ``True``, it will divide the output by column parallel size, default to False
-    :type reduce_mean: bool, optional
-    """
-    return _ReduceByBatch2D.apply(input_, reduce_mean)
\ No newline at end of file
diff --git a/colossalai/nn/layer/parallel_2d/_utils.py b/colossalai/nn/layer/parallel_2d/_utils.py
deleted file mode 100644
index 012fec41c80231165ceb92e57e2f449e61fdb8b2..0000000000000000000000000000000000000000
--- a/colossalai/nn/layer/parallel_2d/_utils.py
+++ /dev/null
@@ -1,20 +0,0 @@
-from colossalai.context.parallel_mode import ParallelMode
-from colossalai.core import global_context as gpc
-from colossalai.global_variables import tensor_parallel_env as env
-
-
-def get_summa_dim_from_env() -> int:
-    try:
-        summa_dim = env.summa_dim
-        assert summa_dim > 0, 'SUMMA_DIM must be larger than zero'
-        return summa_dim
-
-    except KeyError as e:
-        raise EnvironmentError('SUMMA_DIM is not found in the current environment, '
-                               'please make sure that you have used the correct process group initializer')
-
-
-def assert_summa_initialization():
-    assert gpc.is_initialized(ParallelMode.PARALLEL_2D_COL) and \
-           gpc.is_initialized(ParallelMode.PARALLEL_2D_ROW), \
-        'Both TWO_DIMENSION_COL and TWO_DIMENSION_ROW must be initialized by the process group initializer'
diff --git a/colossalai/nn/layer/parallel_2d/layers.py b/colossalai/nn/layer/parallel_2d/layers.py
deleted file mode 100644
index b6adbcecd7e29d6033fbce900a05dab3c13c579c..0000000000000000000000000000000000000000
--- a/colossalai/nn/layer/parallel_2d/layers.py
+++ /dev/null
@@ -1,607 +0,0 @@
-import math
-from typing import Callable
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from colossalai.communication import broadcast
-from colossalai.context import ParallelMode, seed
-from colossalai.core import global_context as gpc
-from colossalai.global_variables import tensor_parallel_env as env
-from colossalai.nn import init as init
-from colossalai.registry import LAYERS
-from colossalai.utils.cuda import get_current_device
-from torch import Tensor
-from torch.nn import Parameter
-
-from ..base_layer import ParallelLayer
-from ..utils import divide, set_tensor_parallel_attribute_by_partition, to_2tuple
-from ._operation import *
-from ._utils import assert_summa_initialization, get_summa_dim_from_env
-
-
-@LAYERS.register_module
-class Linear2D(ParallelLayer):
-    """
-    Linear layer for 2D parallelism
-
-    :param in_features: size of each input sample
-    :type in_features: int
-    :param out_features: size of each output sample
-    :type out_features: int
-    :param bias: If set to ``False``, the layer will not learn an additive bias, defaults to True
-    :type bias: bool, optional
-    :param dtype: The dtype of parameters, defaults to None
-    :type dtype: torch.dtype, optional
-    :param skip_bias_add: If set to ``True``, it will skip bias add for linear layer, which is preserved for kernel fusion, defaults to False
-    :type skip_bias_add: bool, optional
-    :param weight_initializer: The intializer of weight, defaults to kaiming uniform initializer
-    :type weight_initializer: typing.Callable, optional
-    :param bias_initializer: The intializer of bias, defaults to xavier uniform initializer
-    :type bias_initializer: typing.Callable, optional
-    """
-    def __init__(self,
-                 in_features: int,
-                 out_features: int,
-                 bias: bool = True,
-                 dtype: torch.dtype = None,
-                 skip_bias_add: bool = False,
-                 weight_initializer: Callable = init.kaiming_uniform_(a=math.sqrt(5)),
-                 bias_initializer: Callable = init.xavier_uniform_(a=1, scale=1)):
-        super().__init__()
-
-        self.in_features = in_features
-        self.out_features = out_features
-        self.skip_bias_add = skip_bias_add
-
-        # parallel settings
-        assert_summa_initialization()
-        self.row_rank = gpc.get_local_rank(ParallelMode.PARALLEL_2D_COL)
-        self.col_rank = gpc.get_local_rank(ParallelMode.PARALLEL_2D_ROW)
-        self.summa_dim = get_summa_dim_from_env()
-
-        # partitioning dimension
-        self.input_size_per_partition = divide(self.in_features, self.summa_dim)
-        self.hidden_size_per_partition = divide(self.out_features, self.summa_dim)
-
-        # create weight, shape: [k/q, h/q]
-        factory_kwargs = {'device': get_current_device(), 'dtype': dtype}
-        self.weight = Parameter(
-            torch.empty(self.input_size_per_partition, self.hidden_size_per_partition, **factory_kwargs))
-
-        # create bias, shape: [h/q]
-        if bias:
-            self.bias = Parameter(torch.empty(divide(self.out_features, self.summa_dim**2), **factory_kwargs))
-        else:
-            self.register_parameter('bias', None)
-
-        # initialize parameters
-        with seed(ParallelMode.TENSOR):
-            self.reset_parameters(weight_initializer, bias_initializer)
-        self._set_tensor_parallel_attributes()
-
-    def _set_tensor_parallel_attributes(self):
-        set_tensor_parallel_attribute_by_partition(self.weight, self.summa_dim**2)
-        if self.bias is not None:
-            set_tensor_parallel_attribute_by_partition(self.bias, self.summa_dim**2)
-
-    def reset_parameters(self, weight_initializer, bias_initializer) -> None:
-        fan_in, fan_out = self.in_features, self.out_features
-        weight_initializer(self.weight, fan_in=fan_in, fan_out=fan_out)
-        if self.bias is not None:
-            bias_initializer(self.bias, fan_in=fan_in)
-
-    def forward(self, x: Tensor) -> Tensor:
-        # input: [m/q, n/q, k/q]
-        # output: [m/q, n/q, h/q]
-        out_shape = x.shape[:-1] + (self.hidden_size_per_partition, )
-
-        output = Matmul_AB_2D.apply(x, self.weight, self.summa_dim, out_shape, self.row_rank, self.col_rank,
-                                    ParallelMode.PARALLEL_2D_ROW, ParallelMode.PARALLEL_2D_COL, self.data_parallel_rank,
-                                    self.pipeline_parallel_rank, self.pipeline_parallel_size, self.tensor_parallel_size)
-
-        if self.bias is not None:
-            if self.skip_bias_add:
-                bias = add_bias_2d(None, self.bias, self.hidden_size_per_partition, self.row_rank, self.col_rank,
-                                   ParallelMode.PARALLEL_2D_ROW, ParallelMode.PARALLEL_2D_COL, True,
-                                   self.data_parallel_rank, self.pipeline_parallel_rank, self.pipeline_parallel_size,
-                                   self.tensor_parallel_size)
-                return output, bias
-            else:
-                output = add_bias_2d(output, self.bias, self.hidden_size_per_partition, self.row_rank, self.col_rank,
-                                     ParallelMode.PARALLEL_2D_ROW, ParallelMode.PARALLEL_2D_COL, False,
-                                     self.data_parallel_rank, self.pipeline_parallel_rank, self.pipeline_parallel_size,
-                                     self.tensor_parallel_size)
-                return output
-        else:
-            return output
-
-
-@LAYERS.register_module
-class LayerNorm2D(ParallelLayer):
-    r"""
-    Layer Normalization for 2D parallelism
-
-    :param normalized_shape: input shape from an expected input
-        of size. :math:`[* \times \text{normalized_shape}[0] \times \text{normalized_shape}[1] \times \ldots \times \text{normalized_shape}[-1]]`
-        If a single integer is used, it is treated as a singleton list, and this module will
-        normalize over the last dimension which is expected to be of that specific size.
-    :type normalized_shape: int
-    :param eps: a value added to the denominator for numerical stability, defaults to 1e-05
-    :type eps: float, optional
-    :param dtype: The dtype of parameters, defaults to None
-    :type dtype: torch.dtype, optional
-    """
-    def __init__(self, normalized_shape: int, eps: float = 1e-05, dtype=None):
-        super().__init__()
-
-        # layer norm config
-        self.normalized_shape = normalized_shape
-        self.variance_epsilon = eps
-
-        # parallel setting
-        assert_summa_initialization()
-        self.row_rank = gpc.get_local_rank(ParallelMode.PARALLEL_2D_COL)
-        self.col_rank = gpc.get_local_rank(ParallelMode.PARALLEL_2D_ROW)
-        self.summa_dim = get_summa_dim_from_env()
-
-        # partitioning dimension
-        self.partitioned_partition = divide(normalized_shape, self.summa_dim**2)
-
-        # create parameters
-        factory_kwargs = {'device': get_current_device(), 'dtype': dtype}
-
-        self.gamma = Parameter(torch.ones(self.partitioned_partition, **factory_kwargs))
-        self.beta = Parameter(torch.zeros(self.partitioned_partition, **factory_kwargs))
-
-        self._set_tensor_parallel_attributes()
-
-    def _set_tensor_parallel_attributes(self):
-        set_tensor_parallel_attribute_by_partition(self.gamma, self.summa_dim**2)
-        set_tensor_parallel_attribute_by_partition(self.beta, self.summa_dim**2)
-
-    def forward(self, x: Tensor) -> Tensor:
-        with torch.no_grad():
-            E_x = torch.sum(x, dim=-1, keepdim=True)  # [b/q, s, 1]
-            torch.distributed.all_reduce(E_x, group=gpc.get_group(ParallelMode.PARALLEL_2D_ROW))
-            E_x /= self.normalized_shape
-
-            # Var_x in the block below is the sum of input^2
-            Var_x = torch.sum(x * x, dim=-1, keepdim=True)  # [b/q, s, 1]
-            torch.distributed.all_reduce(Var_x, group=gpc.get_group(ParallelMode.PARALLEL_2D_ROW))
-            Var_x /= self.normalized_shape
-
-            Var_x = Var_x - E_x * E_x  # variance of x [b/q, s, 1]
-            # this time 1/sqrt(Var_x + epsilon)
-            Var_x = 1.0 / torch.sqrt(Var_x + self.variance_epsilon)
-
-        output = layernorm_2d(x, E_x, Var_x, self.normalized_shape, ParallelMode.PARALLEL_2D_ROW,
-                              ParallelMode.PARALLEL_2D_COL)
-        bias = add_bias_2d(None, self.beta, self.partitioned_partition, self.row_rank, self.col_rank,
-                           ParallelMode.PARALLEL_2D_ROW, ParallelMode.PARALLEL_2D_COL, True, self.data_parallel_rank,
-                           self.pipeline_parallel_rank, self.pipeline_parallel_size, self.tensor_parallel_size)
-        scale = add_bias_2d(None, self.gamma, self.partitioned_partition, self.row_rank, self.col_rank,
-                            ParallelMode.PARALLEL_2D_ROW, ParallelMode.PARALLEL_2D_COL, True, self.data_parallel_rank,
-                            self.pipeline_parallel_rank, self.pipeline_parallel_size, self.tensor_parallel_size)
-        output = torch.addcmul(bias, scale, output)
-        return output
-
-
-@LAYERS.register_module
-class PatchEmbedding2D(ParallelLayer):
-    """
-    2D Image to Patch Embedding
-
-    :param img_size: image size
-    :type img_size: int
-    :param patch_size: patch size
-    :type patch_size: int
-    :param in_chans: number of channels of input image
-    :type in_chans: int
-    :param embed_size: size of embedding
-    :type embed_size: int
-    :param dtype: The dtype of parameters, defaults to None
-    :type dtype: torch.dtype, optional
-    :param flatten: whether to flatten output tensor, defaults to True
-    :type flatten: bool, optional
-    :param weight_initializer: The intializer of weight, defaults to kaiming uniform initializer
-    :type weight_initializer: typing.Callable, optional
-    :param bias_initializer: The intializer of bias, defaults to xavier uniform initializer
-    :type bias_initializer: typing.Callable, optional
-    :param position_embed_initializer: The intializer of position embedding, defaults to zero
-    :type position_embed_initializer: typing.Callable, optional
-    """
-    def __init__(self,
-                 img_size: int,
-                 patch_size: int,
-                 in_chans: int,
-                 embed_size: int,
-                 flatten: bool = True,
-                 dtype: torch.dtype = None,
-                 weight_initializer: Callable = init.kaiming_uniform_(a=math.sqrt(5)),
-                 bias_initializer: Callable = init.xavier_uniform_(a=1, scale=1),
-                 position_embed_initializer: Callable = init.zeros_()):
-        super().__init__()
-        img_size = to_2tuple(img_size)
-        patch_size = to_2tuple(patch_size)
-
-        assert_summa_initialization()
-        self.summa_dim = get_summa_dim_from_env()
-        self.img_size = img_size
-        self.patch_size = patch_size
-        self.grid_size = (img_size[0] // patch_size[0], img_size[1] // patch_size[1])
-        self.num_patches = self.grid_size[0] * self.grid_size[1]
-        self.flatten = flatten
-        self.embed_size = embed_size
-        self.embed_size_per_partition = embed_size // (self.summa_dim**2)
-
-        with seed(ParallelMode.TENSOR):
-            self.weight = Parameter(
-                torch.empty((self.embed_size_per_partition, in_chans, *self.patch_size),
-                            device=get_current_device(),
-                            dtype=dtype))
-            self.bias = Parameter(torch.empty(self.embed_size_per_partition, device=get_current_device(), dtype=dtype))
-
-            self.cls_token = Parameter(
-                torch.zeros((1, 1, self.embed_size_per_partition), device=get_current_device(), dtype=dtype))
-            self.pos_embed = Parameter(
-                torch.zeros((1, self.num_patches + 1, self.embed_size_per_partition),
-                            device=get_current_device(),
-                            dtype=dtype))
-
-        self.reset_parameters(weight_initializer, bias_initializer, position_embed_initializer)
-        self._set_tensor_parallel_attribute()
-
-    def _set_tensor_parallel_attribute(self):
-        set_tensor_parallel_attribute_by_partition(self.weight, self.summa_dim**2)
-        set_tensor_parallel_attribute_by_partition(self.bias, self.summa_dim**2)
-        set_tensor_parallel_attribute_by_partition(self.cls_token, self.summa_dim**2)
-        set_tensor_parallel_attribute_by_partition(self.pos_embed, self.summa_dim**2)
-
-    def reset_parameters(self, weight_initializer, bias_initializer, position_embed_initializer):
-        with seed(ParallelMode.TENSOR):
-            fan_in, _ = nn.init._calculate_fan_in_and_fan_out(self.weight)
-            fan_out = self.embed_size
-            weight_initializer(self.weight, fan_in=fan_in, fan_out=fan_out)
-            bias_initializer(self.bias, fan_in=fan_in)
-            position_embed_initializer(self.pos_embed)
-
-    def forward(self, input_: Tensor) -> Tensor:
-        input_ = split_tensor_2d(input_)
-
-        B, C, H, W = input_.shape
-        assert H == self.img_size[0] and W == self.img_size[1], \
-            f"Input image size ({H}*{W}) doesn't match model ({self.img_size[0]}*{self.img_size[1]})."
-
-        weight = all_gather_tensor_2d(self.weight, 0, ParallelMode.PARALLEL_2D_COL)
-        bias = all_gather_tensor_2d(self.bias, 0, ParallelMode.PARALLEL_2D_COL)
-
-        output = F.conv2d(input_, weight, bias, stride=self.patch_size)
-        if self.flatten:
-            output = output.flatten(2).transpose(1, 2)  # BCHW -> BNC
-
-        cls_token = all_gather_tensor_2d(self.cls_token, -1, ParallelMode.PARALLEL_2D_COL)
-        pos_embed = all_gather_tensor_2d(self.pos_embed, -1, ParallelMode.PARALLEL_2D_COL)
-        cls_token = cls_token.expand(output.shape[0], -1, -1)
-        output = torch.cat((cls_token, output), dim=1)
-        output = output + pos_embed
-
-        return output
-
-
-@LAYERS.register_module
-class Embedding2D(ParallelLayer):
-    """
-    Embedding for 2D parallelism
-
-    :param num_embeddings: number of embeddings
-    :type num_embeddings: int
-    :param embedding_dim: dimension of embedding
-    :type embedding_dim: int
-    :param padding_idx: index of padding, defaults to None
-    :type padding_idx: int, optional
-    :param dtype: The dtype of parameters, defaults to None
-    :type dtype: torch.dtype, optional
-    :param weight_initializer: The intializer of weight, defaults to normal initializer
-    :type weight_initializer: typing.Callable, optional
-    :param args: Args used in F.embedding
-    :param kwargs: Kwargs used in F.embedding
-    """
-    def __init__(self,
-                 num_embeddings: int,
-                 embedding_dim: int,
-                 padding_idx: int = None,
-                 dtype: torch.dtype = None,
-                 weight_initializer: Callable = init.normal_(),
-                 *args,
-                 **kwargs):
-        super().__init__()
-
-        assert_summa_initialization()
-        self.summa_dim = get_summa_dim_from_env()
-        self.num_embeddings = num_embeddings
-        self.embed_dim = embedding_dim
-        embed_dim_per_partition = divide(embedding_dim, self.summa_dim**2)
-
-        self.padding_idx = padding_idx
-        self.embed_args = args
-        self.embed_kwargs = kwargs
-
-        self.weight = Parameter(
-            torch.empty((num_embeddings, embed_dim_per_partition), device=get_current_device(), dtype=dtype))
-
-        self.reset_parameters(weight_initializer)
-        self._set_tensor_parallel_attributes()
-
-    def _set_tensor_parallel_attributes(self):
-        set_tensor_parallel_attribute_by_partition(self.weight, self.summa_dim**2)
-
-    def reset_parameters(self, weight_initializer) -> None:
-        with seed(ParallelMode.TENSOR):
-            fan_in, fan_out = self.num_embeddings, self.embed_dim
-            weight_initializer(self.weight, fan_in=fan_in, fan_out=fan_out)
-            self._fill_padding_idx_with_zero()
-
-    def _fill_padding_idx_with_zero(self) -> None:
-        if self.padding_idx is not None:
-            with torch.no_grad():
-                self.weight[self.padding_idx].fill_(0)
-
-    def forward(self, input_: Tensor) -> Tensor:
-        input_ = split_tensor_2d(input_)
-
-        weight = all_gather_tensor_2d(self.weight, -1, ParallelMode.PARALLEL_2D_COL)
-        output = F.embedding(input_, weight, self.padding_idx, *self.embed_args, **self.embed_kwargs)
-
-        return output
-
-
-@LAYERS.register_module
-class VocabParallelEmbedding2D(torch.nn.Module):
-    """Embedding parallelized in the vocabulary dimension.
-
-    :param num_embeddings: number of embeddings
-    :type num_embeddings: int
-    :param embedding_dim: dimension of embedding
-    :type embedding_dim: int
-    :param padding_idx: index of padding, defaults to None
-    :type padding_idx: int, optional
-    :param dtype: The dtype of parameters, defaults to None
-    :type dtype: torch.dtype, optional
-    :param weight_initializer: The intializer of weight, defaults to normal initializer
-    :type weight_initializer: typing.Callable, optional
-    :param args: Args used in F.embedding
-    :param kwargs: Kwargs used in F.embedding
-    """
-    def __init__(self,
-                 num_embeddings: int,
-                 embedding_dim: int,
-                 padding_idx: int = None,
-                 dtype: torch.dtype = None,
-                 weight_initializer: Callable = init.normal_(),
-                 *args,
-                 **kwargs):
-        super().__init__()
-        self.num_embeddings = num_embeddings
-        self.embed_dim = embedding_dim
-        self.padding_idx = padding_idx
-        self.embed_args = args
-        self.embed_kwargs = kwargs
-
-        assert_summa_initialization()
-        self.summa_dim = get_summa_dim_from_env()
-        self.num_embeddings_per_partition = divide(self.num_embeddings, self.summa_dim)
-        self.embed_dim_per_partition = divide(self.embed_dim, self.summa_dim)
-        tensor_parallel_rank = gpc.get_local_rank(ParallelMode.PARALLEL_2D_COL)
-        self.vocab_start_index = tensor_parallel_rank * self.num_embeddings_per_partition
-        self.vocab_end_index = self.vocab_start_index + self.num_embeddings_per_partition
-
-        self.weight = Parameter(
-            torch.empty((self.num_embeddings_per_partition, self.embed_dim_per_partition),
-                        device=get_current_device(),
-                        dtype=dtype))
-
-        self.reset_parameters(weight_initializer)
-        self._set_tensor_parallel_attributes()
-        env.vocab_parallel = True
-
-    def _set_tensor_parallel_attributes(self):
-        set_tensor_parallel_attribute_by_partition(self.weight, self.summa_dim**2)
-
-    def reset_parameters(self, weight_initializer) -> None:
-        with seed(ParallelMode.TENSOR):
-            fan_in, fan_out = self.num_embeddings, self.embed_dim
-            weight_initializer(self.weight, fan_in=fan_in, fan_out=fan_out)
-            self._fill_padding_idx_with_zero()
-
-    def _fill_padding_idx_with_zero(self) -> None:
-        if self.padding_idx is not None:
-            with torch.no_grad():
-                self.weight[self.padding_idx].fill_(0)
-
-    def forward(self, input_: Tensor) -> Tensor:
-        input_mask = (input_ < self.vocab_start_index) | (input_ >= self.vocab_end_index)
-        masked_input = input_.clone() - self.vocab_start_index
-        masked_input[input_mask] = 0
-
-        output_parallel = F.embedding(masked_input, self.weight, self.padding_idx, *self.embed_args,
-                                      **self.embed_kwargs)
-
-        output_parallel[input_mask, :] = 0.
-        output = reduce_scatter_tensor_2d(output_parallel, 0, ParallelMode.PARALLEL_2D_COL)
-        return output
-
-
-@LAYERS.register_module
-class Classifier2D(ParallelLayer):
-    """
-    Classifier for 2D parallelism
-
-    :param in_features: size of each input sample
-    :type in_features: int
-    :param num_classes: number of classes
-    :type num_classes: int
-    :param weight: weight of the classifier, defaults to True
-    :type weight: torch.nn.Parameter, optional
-    :param bias: If set to ``False``, the layer will not learn an additive bias, defaults to ``True``
-    :type bias: bool, optional
-    :param dtype: The dtype of parameters, defaults to None
-    :type dtype: torch.dtype, optional
-    :param weight_initializer: The intializer of weight, defaults to kaiming uniform initializer
-    :type weight_initializer: typing.Callable, optional
-    :param bias_initializer: The intializer of bias, defaults to xavier uniform initializer
-    :type bias_initializer: typing.Callable, optional
-    """
-    def __init__(self,
-                 in_features: int,
-                 num_classes: int,
-                 weight: Parameter = None,
-                 bias: bool = True,
-                 dtype: torch.dtype = None,
-                 weight_initializer: Callable = init.kaiming_uniform_(a=math.sqrt(5)),
-                 bias_initializer: Callable = init.xavier_uniform_(a=1, scale=1)):
-        super().__init__()
-        self.in_features = in_features
-        self.num_classes = num_classes
-        assert_summa_initialization()
-        self.row_rank = gpc.get_local_rank(ParallelMode.PARALLEL_2D_COL)
-        self.col_rank = gpc.get_local_rank(ParallelMode.PARALLEL_2D_ROW)
-        self.summa_dim = get_summa_dim_from_env()
-
-        # partitioning dimension
-        self.input_size_per_partition = divide(self.in_features, self.summa_dim**2)
-
-        if weight is not None:
-            self.weight = weight
-            self.has_weight = False
-        else:
-            self.weight = Parameter(
-                torch.empty(self.num_classes, self.input_size_per_partition, device=get_current_device(), dtype=dtype))
-            self.has_weight = True
-        if bias:
-            self.bias = Parameter(torch.zeros(self.num_classes, device=get_current_device(), dtype=dtype))
-        else:
-            self.bias = None
-
-        self.reset_parameters(weight_initializer, bias_initializer)
-        self._set_tensor_parallel_attributes()
-
-    def _set_tensor_parallel_attributes(self):
-        if self.has_weight:
-            set_tensor_parallel_attribute_by_partition(self.weight, self.summa_dim**2)
-
-    def reset_parameters(self, weight_initializer, bias_initializer) -> None:
-        with seed(ParallelMode.TENSOR):
-            fan_in, fan_out = self.in_features, self.num_classes
-            col_src_rank = gpc.get_ranks_in_group(ParallelMode.PARALLEL_2D_COL)[0]
-            row_src_rank = gpc.get_ranks_in_group(ParallelMode.PARALLEL_2D_ROW)[0]
-
-            if self.has_weight:
-                weight_initializer(self.weight, fan_in=fan_in, fan_out=fan_out)
-
-            if self.bias is not None:
-                bias_initializer(self.bias, fan_in=fan_in)
-                broadcast(self.bias, col_src_rank, ParallelMode.PARALLEL_2D_COL)
-                broadcast(self.bias, row_src_rank, ParallelMode.PARALLEL_2D_ROW)
-
-    def forward(self, input_: Tensor) -> Tensor:
-        out_shape = input_.shape[:-1] + (self.num_classes, )
-
-        return classifier_2d(input_, self.weight, self.bias, self.summa_dim, out_shape, self.row_rank, self.col_rank,
-                             ParallelMode.PARALLEL_2D_ROW, ParallelMode.PARALLEL_2D_COL, self.data_parallel_rank,
-                             self.pipeline_parallel_rank, self.pipeline_parallel_size, self.tensor_parallel_size)
-
-
-@LAYERS.register_module
-class VocabParallelClassifier2D(ParallelLayer):
-    """
-    Vocab parallel classifier layer for 2D parallelism
-
-    :param in_features: size of each input sample
-    :type in_features: int
-    :param num_classes: number of classes
-    :type num_classes: int
-    :param weight: weight of the classifier, defaults to True
-    :type weight: torch.nn.Parameter, optional
-    :param bias: If set to ``False``, the layer will not learn an additive bias, defaults to ``True``
-    :type bias: bool, optional
-    :param dtype: The dtype of parameters, defaults to None
-    :type dtype: torch.dtype, optional
-    :param weight_initializer: The intializer of weight, defaults to kaiming uniform initializer
-    :type weight_initializer: typing.Callable, optional
-    :param bias_initializer: The intializer of bias, defaults to xavier uniform initializer
-    :type bias_initializer: typing.Callable, optional
-    """
-    def __init__(self,
-                 in_features: int,
-                 num_classes: int,
-                 weight: Parameter = None,
-                 bias: bool = True,
-                 dtype: torch.dtype = None,
-                 weight_initializer: Callable = init.kaiming_uniform_(a=math.sqrt(5)),
-                 bias_initializer: Callable = init.xavier_uniform_(a=1, scale=1)):
-        super().__init__()
-
-        self.in_features = in_features
-        self.num_classes = num_classes
-
-        # parallel setting
-        assert_summa_initialization()
-        self.row_rank = gpc.get_local_rank(ParallelMode.PARALLEL_2D_COL)
-        self.col_rank = gpc.get_local_rank(ParallelMode.PARALLEL_2D_ROW)
-        self.summa_dim = get_summa_dim_from_env()
-
-        # partitioning dimension
-        self.input_size_per_partition = divide(in_features, self.summa_dim)
-        self.output_size_per_partition = divide(num_classes, self.summa_dim)
-
-        # create weight, shape: [k/q, h/q]
-        factory_kwargs = {'device': get_current_device(), 'dtype': dtype}
-        if weight is not None:
-            self.weight = weight
-            self.has_weight = False
-        else:
-            self.weight = Parameter(
-                torch.empty(self.output_size_per_partition, self.input_size_per_partition, **factory_kwargs))
-            self.has_weight = True
-        # create bias, shape: [h/q]
-        if bias:
-            self.bias = Parameter(torch.empty(divide(self.num_classes, self.summa_dim**2), **factory_kwargs))
-        else:
-            self.bias = None
-
-        # initialize parameters
-        with seed(ParallelMode.TENSOR):
-            self.reset_parameters(weight_initializer, bias_initializer)
-        self._set_tensor_parallel_attributes()
-        env.vocab_parallel = True
-
-    def _set_tensor_parallel_attributes(self):
-        if self.has_weight:
-            set_tensor_parallel_attribute_by_partition(self.weight, self.summa_dim**2)
-        if self.bias is not None:
-            set_tensor_parallel_attribute_by_partition(self.bias, self.summa_dim**2)
-
-    def reset_parameters(self, weight_initializer, bias_initializer) -> None:
-        fan_in, fan_out = self.in_features, self.num_classes
-        if self.has_weight:
-            weight_initializer(self.weight, fan_in=fan_in, fan_out=fan_out)
-        if self.bias is not None:
-            bias_initializer(self.bias, fan_in=fan_in)
-
-    def forward(self, x: Tensor) -> Tensor:
-        # input: [m/q, n/q, k/q]
-        # output: [m/q, n/q, h/q]
-        out_shape = x.shape[:-1] + (self.output_size_per_partition, )
-
-        output = Matmul_ABT_2D.apply(x, self.weight, self.summa_dim, out_shape, self.row_rank, self.col_rank,
-                                     ParallelMode.PARALLEL_2D_ROW, ParallelMode.PARALLEL_2D_COL,
-                                     self.data_parallel_rank, self.pipeline_parallel_rank, self.pipeline_parallel_size,
-                                     self.tensor_parallel_size)
-
-        if self.bias is not None:
-            output = add_bias_2d(output, self.bias, self.output_size_per_partition, self.row_rank, self.col_rank,
-                                 ParallelMode.PARALLEL_2D_ROW, ParallelMode.PARALLEL_2D_COL, False,
-                                 self.data_parallel_rank, self.pipeline_parallel_rank, self.pipeline_parallel_size,
-                                 self.tensor_parallel_size)
-        return output
diff --git a/colossalai/nn/layer/parallel_2p5d/__init__.py b/colossalai/nn/layer/parallel_2p5d/__init__.py
deleted file mode 100644
index 5ca3516054967920a753a65faddd151fcf79cf7b..0000000000000000000000000000000000000000
--- a/colossalai/nn/layer/parallel_2p5d/__init__.py
+++ /dev/null
@@ -1,8 +0,0 @@
-from ._operation import reduce_by_batch_2p5d, split_tensor_2p5d
-from .layers import (Classifier2p5D, Embedding2p5D, LayerNorm2p5D, Linear2p5D, PatchEmbedding2p5D,
-                     VocabParallelClassifier2p5D, VocabParallelEmbedding2p5D)
-
-__all__ = [
-    'split_tensor_2p5d', 'reduce_by_batch_2p5d', 'Linear2p5D', 'LayerNorm2p5D', 'Classifier2p5D', 'PatchEmbedding2p5D',
-    'Embedding2p5D', 'VocabParallelClassifier2p5D', 'VocabParallelEmbedding2p5D'
-]
diff --git a/colossalai/nn/layer/parallel_2p5d/__pycache__/__init__.cpython-36.pyc b/colossalai/nn/layer/parallel_2p5d/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index be8c55328f3ca5938262b09a75d9edf7c5939d1f..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/parallel_2p5d/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/parallel_2p5d/__pycache__/__init__.cpython-37.pyc b/colossalai/nn/layer/parallel_2p5d/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index a38dbb2610f6ebcce3f8461b2ecb7ec72ffb22cd..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/parallel_2p5d/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/parallel_2p5d/__pycache__/_operation.cpython-36.pyc b/colossalai/nn/layer/parallel_2p5d/__pycache__/_operation.cpython-36.pyc
deleted file mode 100644
index 260049ba33fd98db60cb3f6cc810103bcd76763b..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/parallel_2p5d/__pycache__/_operation.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/parallel_2p5d/__pycache__/_operation.cpython-37.pyc b/colossalai/nn/layer/parallel_2p5d/__pycache__/_operation.cpython-37.pyc
deleted file mode 100644
index 2879144afbfdad5d79055f375d6cfefc5c3272d6..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/parallel_2p5d/__pycache__/_operation.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/parallel_2p5d/__pycache__/_utils.cpython-36.pyc b/colossalai/nn/layer/parallel_2p5d/__pycache__/_utils.cpython-36.pyc
deleted file mode 100644
index 8424b202f564ca34dfb77133a890051930d76948..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/parallel_2p5d/__pycache__/_utils.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/parallel_2p5d/__pycache__/_utils.cpython-37.pyc b/colossalai/nn/layer/parallel_2p5d/__pycache__/_utils.cpython-37.pyc
deleted file mode 100644
index bde3557b9bbe078edaf213bedb4497096703105b..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/parallel_2p5d/__pycache__/_utils.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/parallel_2p5d/__pycache__/layers.cpython-36.pyc b/colossalai/nn/layer/parallel_2p5d/__pycache__/layers.cpython-36.pyc
deleted file mode 100644
index 2b9791b48115ee4bbf007c19da32ac36ba3bf1fa..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/parallel_2p5d/__pycache__/layers.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/parallel_2p5d/__pycache__/layers.cpython-37.pyc b/colossalai/nn/layer/parallel_2p5d/__pycache__/layers.cpython-37.pyc
deleted file mode 100644
index a0af1ad92e1f46e599accad790b14eae929987d2..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/parallel_2p5d/__pycache__/layers.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/parallel_2p5d/_operation.py b/colossalai/nn/layer/parallel_2p5d/_operation.py
deleted file mode 100644
index 8974ff377a642157f4978e0e16121f4bde652c8d..0000000000000000000000000000000000000000
--- a/colossalai/nn/layer/parallel_2p5d/_operation.py
+++ /dev/null
@@ -1,871 +0,0 @@
-from typing import Any, Tuple
-
-import torch
-import torch.distributed as dist
-from colossalai.communication.collective import (all_gather, all_reduce, reduce_scatter)
-from colossalai.context.parallel_mode import ParallelMode
-from colossalai.core import global_context as gpc
-from colossalai.utils import get_current_device
-from torch import Tensor
-from torch.cuda.amp import custom_bwd, custom_fwd
-
-
-def get_parallel_group(parallel_mode: ParallelMode):
-    return gpc.get_group(parallel_mode)
-
-
-def get_global_rank():
-    return gpc.get_global_rank()
-
-
-def get_parallel_rank(parallel_mode: ParallelMode):
-    return gpc.get_local_rank(parallel_mode)
-
-
-class _Classifier2p5D(torch.autograd.Function):
-    @staticmethod
-    @custom_fwd(cast_inputs=torch.float16)
-    def forward(
-        ctx: Any,
-        A: Tensor,
-        B: Tensor,
-        bias,
-        tesseract_dim: int,
-        out_shape: Tuple[int, ...],
-        row_rank: int,
-        col_rank: int,
-        row_parallel_mode: ParallelMode,
-        col_parallel_mode: ParallelMode,
-        data_parallel_rank: int,
-        pipeline_parallel_rank: int,
-        pipeline_parallel_size: int,
-        tensor_parallel_size: int,
-    ) -> Tensor:
-
-        A_shape = A.shape
-        A = A.reshape((-1, A_shape[-1]))
-        B_shape = B.shape
-        B = B.reshape((-1, B_shape[-1]))
-        B_temp = all_gather(B, -1, col_parallel_mode)
-        if ctx:
-            ctx.save_for_backward(A, B_temp)
-
-        C = torch.matmul(A, B_temp.transpose(0, 1))
-
-        C = all_reduce(C, row_parallel_mode)
-
-        ctx.use_bias = bias is not None
-        if bias is not None:
-            C = C + bias
-
-        out = C.reshape(out_shape)
-
-        if ctx:
-            ctx.tesseract_dim = tesseract_dim
-            ctx.row_rank = row_rank
-            ctx.col_rank = col_rank
-            ctx.row_parallel_mode = row_parallel_mode
-            ctx.col_parallel_mode = col_parallel_mode
-            ctx.A_shape = A_shape
-            ctx.B_shape = B_shape
-            ctx.data_parallel_rank = data_parallel_rank
-            ctx.pipeline_parallel_rank = pipeline_parallel_rank
-            ctx.pipeline_parallel_size = pipeline_parallel_size
-            ctx.tensor_parallel_size = tensor_parallel_size
-
-        return out
-
-    @staticmethod
-    @custom_bwd
-    def backward(ctx: Any, output_grad: Tensor) -> Tuple[Tensor, ...]:
-        A, B = ctx.saved_tensors
-
-        with torch.no_grad():
-            A_grad = torch.matmul(output_grad, B)
-            A_grad = A_grad.reshape(ctx.A_shape)
-            B_grad = torch.matmul(output_grad.reshape(-1, output_grad.shape[-1]).transpose(0, 1), A)
-            B_grad = reduce_scatter(B_grad, -1, ctx.col_parallel_mode)
-            B_grad = B_grad.reshape(ctx.B_shape)
-
-            if ctx.use_bias:
-                bias_grad = torch.sum(output_grad, dim=tuple(range(output_grad.ndim - 1)))
-                bias_grad = all_reduce(bias_grad, ctx.col_parallel_mode)
-            else:
-                bias_grad = None
-
-        return A_grad, B_grad, bias_grad, None, None, None, None, None, None, None, None, None, None
-
-
-def classifier_2p5d(A: Tensor, B: Tensor, bias, tesseract_dim: int, out_shape: Tuple[int,
-                                                                                     ...], row_rank: int, col_rank: int,
-                    row_parallel_mode: ParallelMode, col_parallel_mode: ParallelMode, data_parallel_rank: int,
-                    pipeline_parallel_rank: int, pipeline_parallel_size: int, tensor_parallel_size: int) -> Tensor:
-    """
-    Classifier
-
-    :param a: matrix :math:`A`
-    :type a: torch.tensor
-    :param b: matrix :math:`B`
-    :type b: torch.tensor
-    :param bias: matrix of bias
-    :type bias: torch.tensor, optional
-    :param tesseract_dim: dimension of TESSERACT fo 2.5D parallelism
-    :type tesseract_dim: int
-    :param out_shape: shape of output tensor
-    :type out_shape: tuple
-    :param row_rank: the rank of row
-    :type row_rank: int
-    :param col_rank: the rank of column
-    :type col_rank: int
-    :param row_parallel_mode: row parallel mode
-    :type row_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    :param col_parallel_mode: column parallel mode
-    :type col_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    :param data_parallel_rank: data parallel rank
-    :type data_parallel_rank: int
-    :param pipeline_parallel_rank: pipeline parallel rank
-    :type pipeline_parallel_rank: int
-    :param pipeline_parallel_size: pipeline parallel size
-    :type pipeline_parallel_size: int
-    :param tensor_parallel_size: tensor parallel size
-    :type tensor_parallel_size: int
-    """
-    return _Classifier2p5D.apply(A, B, bias, tesseract_dim, out_shape, row_rank, col_rank, row_parallel_mode,
-                                 col_parallel_mode, data_parallel_rank, pipeline_parallel_rank, pipeline_parallel_size,
-                                 tensor_parallel_size)
-
-
-class Matmul_AB_2p5D(torch.autograd.Function):
-    """
-    Matrix multiplication for :math:`C = AB`
-
-    :param a: matrix :math:`A`
-    :type a: torch.tensor
-    :param b: matrix :math:`B`
-    :type b: torch.tensor
-    :param tesseract_dim: dimension of TESSERACT fo 2.5D parallelism
-    :type tesseract_dim: int
-    :param out_shape: shape of output tensor
-    :type out_shape: tuple
-    :param row_rank: the rank of row
-    :type row_rank: int
-    :param col_rank: the rank of column
-    :type col_rank: int
-    :param dep_rank: the rank of depth
-    :type dep_rank: int
-    :param row_parallel_mode: row parallel mode
-    :type row_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    :param col_parallel_mode: column parallel mode
-    :type col_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    :param data_parallel_rank: data parallel rank
-    :type data_parallel_rank: int
-    :param pipeline_parallel_rank: pipeline parallel rank
-    :type pipeline_parallel_rank: int
-    :param pipeline_parallel_size: pipeline parallel size
-    :type pipeline_parallel_size: int
-    :param tensor_parallel_size: tensor parallel size
-    :type tensor_parallel_size: int
-    """
-    @staticmethod
-    @custom_fwd(cast_inputs=torch.float16)
-    def forward(ctx: Any, A: Tensor, B: Tensor, tesseract_dim: int, out_shape: Tuple[int, ...], row_rank: int,
-                col_rank: int, dep_rank: int, row_parallel_mode: ParallelMode, col_parallel_mode: ParallelMode,
-                data_parallel_rank: int, pipeline_parallel_rank: int, pipeline_parallel_size: int,
-                tensor_parallel_size: int) -> Tensor:
-        # A: [b / dq, s, h / q] -> [(b * s) / dq, h / q]
-        # B: [h / dq, s / q]
-        # C: [b / dq, s, s / q] -> [(b * s) / dq, s / q]
-
-        assert A.shape[-1] == B.shape[-2], \
-            'Invalid shapes: A={}, B={} for AB.'.format(A.shape, B.shape)
-
-        if ctx:
-            ctx.save_for_backward(A, B)
-
-        A_shape = A.shape
-        A = A.reshape((-1, A_shape[-1]))
-        B_shape = B.shape
-        B = B.reshape((-1, B_shape[-1]))
-        C_shape = (A.shape[0], B.shape[-1])
-        C = torch.zeros(C_shape, dtype=A.dtype, device=get_current_device())
-
-        # use circular buffer to store the communication tensor
-        # 2 is enough for all cases
-        A_list = [torch.empty_like(A) for _ in range(2)]
-        B_list = [torch.empty_like(B) for _ in range(2)]
-
-        row_group = gpc.get_group(row_parallel_mode)
-        col_group = gpc.get_group(col_parallel_mode)
-
-        src_a = tesseract_dim * row_rank + tesseract_dim ** 2 * dep_rank + data_parallel_rank * pipeline_parallel_size * tensor_parallel_size + \
-                pipeline_parallel_rank * tensor_parallel_size
-        src_b = col_rank + tesseract_dim ** 2 * dep_rank + data_parallel_rank * pipeline_parallel_size * tensor_parallel_size + \
-                pipeline_parallel_rank * tensor_parallel_size
-
-        opa = [None] * 2
-        opb = [None] * 2
-
-        A_list[0].copy_(A)
-        B_list[0].copy_(B)
-        opa[0] = dist.broadcast(A_list[0], src=src_a, group=row_group, async_op=True)
-        opb[0] = dist.broadcast(B_list[0], src=src_b, group=col_group, async_op=True)
-        cur = 0
-
-        for i in range(tesseract_dim):
-            if i != tesseract_dim - 1:
-                A_list[1 - cur].copy_(A)
-                opa[1 - cur] = dist.broadcast(A_list[1 - cur], src=src_a + 1, group=row_group, async_op=True)
-                B_list[1 - cur].copy_(B)
-                opb[1 - cur] = dist.broadcast(B_list[1 - cur],
-                                              src=src_b + tesseract_dim,
-                                              group=col_group,
-                                              async_op=True)
-
-            if opa[cur] is not None:
-                opa[cur].wait()
-            if opb[cur] is not None:
-                opb[cur].wait()
-
-            torch.addmm(C, A_list[cur], B_list[cur], out=C)
-            cur = 1 - cur
-            src_a += 1
-            src_b += tesseract_dim
-        out = C.reshape(out_shape)
-
-        if ctx:
-            ctx.tesseract_dim = tesseract_dim
-            ctx.row_rank = row_rank
-            ctx.col_rank = col_rank
-            ctx.dep_rank = dep_rank
-            ctx.row_parallel_mode = row_parallel_mode
-            ctx.col_parallel_mode = col_parallel_mode
-            ctx.A_shape = A_shape
-            ctx.B_shape = B_shape
-            ctx.data_parallel_rank = data_parallel_rank
-            ctx.pipeline_parallel_rank = pipeline_parallel_rank
-            ctx.pipeline_parallel_size = pipeline_parallel_size
-            ctx.tensor_parallel_size = tensor_parallel_size
-
-        return out
-
-    @staticmethod
-    @custom_bwd
-    def backward(ctx: Any, output_grad: Tensor) -> Tuple[Tensor, ...]:
-        A, B = ctx.saved_tensors
-        with torch.no_grad():
-            A_grad = Matmul_ABT_2p5D.apply(output_grad, B, ctx.tesseract_dim, ctx.A_shape, ctx.row_rank, ctx.col_rank,
-                                           ctx.dep_rank, ctx.row_parallel_mode, ctx.col_parallel_mode,
-                                           ctx.data_parallel_rank, ctx.pipeline_parallel_rank,
-                                           ctx.pipeline_parallel_size, ctx.tensor_parallel_size)
-            B_grad = Matmul_ATB_2p5D.apply(A, output_grad, ctx.tesseract_dim, ctx.B_shape, ctx.row_rank, ctx.col_rank,
-                                           ctx.dep_rank, ctx.row_parallel_mode, ctx.col_parallel_mode,
-                                           ctx.data_parallel_rank, ctx.pipeline_parallel_rank,
-                                           ctx.pipeline_parallel_size, ctx.tensor_parallel_size)
-        return A_grad, B_grad, None, None, None, None, None, None, None, None, None, None, None, None, None
-
-
-class Matmul_ABT_2p5D(torch.autograd.Function):
-    """
-    Matrix multiplication for :math:`C = AB^T`
-
-    :param a: matrix :math:`A`
-    :type a: torch.tensor
-    :param b: matrix :math:`B`
-    :type b: torch.tensor
-    :param tesseract_dim: dimension of TESSERACT fo 2.5D parallelism
-    :type tesseract_dim: int
-    :param out_shape: shape of output tensor
-    :type out_shape: tuple
-    :param row_rank: the rank of row
-    :type row_rank: int
-    :param col_rank: the rank of column
-    :type col_rank: int
-    :param dep_rank: the rank of depth
-    :type dep_rank: int
-    :param row_parallel_mode: row parallel mode
-    :type row_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    :param col_parallel_mode: column parallel mode
-    :type col_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    :param data_parallel_rank: data parallel rank
-    :type data_parallel_rank: int
-    :param pipeline_parallel_rank: pipeline parallel rank
-    :type pipeline_parallel_rank: int
-    :param pipeline_parallel_size: pipeline parallel size
-    :type pipeline_parallel_size: int
-    :param tensor_parallel_size: tensor parallel size
-    :type tensor_parallel_size: int
-    """
-    @staticmethod
-    @custom_fwd(cast_inputs=torch.float16)
-    def forward(ctx: Any, A: Tensor, B: Tensor, tesseract_dim: int, out_shape: Tuple[int, ...], row_rank: int,
-                col_rank: int, dep_rank: int, row_parallel_mode: ParallelMode, col_parallel_mode: ParallelMode,
-                data_parallel_rank: int, pipeline_parallel_rank: int, pipeline_parallel_size: int,
-                tensor_parallel_size: int) -> Tensor:
-
-        assert A.shape[-1] == B.shape[-1], \
-            'Invalid shapes: A={}, B={} for ABT.'.format(A.shape, B.shape)
-
-        if ctx:
-            ctx.save_for_backward(A, B)
-
-        A_shape = A.shape
-        A = A.reshape((-1, A_shape[-1]))
-        B_shape = B.shape
-        B = B.reshape((-1, B_shape[-1]))
-        C_shape = (A.shape[0], B.shape[0])
-        C = torch.empty(C_shape, dtype=A.dtype, device=get_current_device())
-
-        # use circular buffer to store the communication tensor
-        # 2 is enough for all cases
-        B_list = [torch.empty_like(B) for _ in range(2)]
-        C_list = [torch.empty_like(C) for _ in range(2)]
-
-        row_group = gpc.get_group(row_parallel_mode)
-        col_group = gpc.get_group(col_parallel_mode)
-
-        src_b = col_rank + tesseract_dim ** 2 * dep_rank + data_parallel_rank * pipeline_parallel_size * tensor_parallel_size + \
-                pipeline_parallel_rank * tensor_parallel_size
-        src_c = tesseract_dim * row_rank + tesseract_dim ** 2 * dep_rank + data_parallel_rank * pipeline_parallel_size * tensor_parallel_size + \
-                pipeline_parallel_rank * tensor_parallel_size
-
-        opb = [None] * 2
-        opr = [None] * 2
-
-        B_list[0].copy_(B)
-        opb[0] = dist.broadcast(B_list[0], src=src_b, group=col_group, async_op=True)
-        cur = 0
-
-        for i in range(tesseract_dim):
-            if i != tesseract_dim - 1:
-                B_list[1 - cur].copy_(B)
-                opb[1 - cur] = dist.broadcast(B_list[1 - cur],
-                                              src=src_b + tesseract_dim,
-                                              group=col_group,
-                                              async_op=True)
-
-            if opr[cur] is not None:
-                opr[cur].wait()
-                if i - 2 == col_rank:
-                    C.copy_(C_list[cur])
-
-            if opb[cur] is not None:
-                opb[cur].wait()
-
-            torch.matmul(A, B_list[cur].transpose(0, 1), out=C_list[cur])
-            opr[cur] = dist.reduce(C_list[cur], dst=src_c, group=row_group, async_op=True)
-            cur = 1 - cur
-            src_b += tesseract_dim
-            src_c += 1
-
-        for op in opr:
-            op.wait()
-
-        if tesseract_dim - 2 == col_rank:
-            C.copy_(C_list[cur])
-        if tesseract_dim - 1 == col_rank:
-            C.copy_(C_list[1 - cur])
-        out = C.reshape(out_shape)
-
-        if ctx:
-            ctx.tesseract_dim = tesseract_dim
-            ctx.row_rank = row_rank
-            ctx.col_rank = col_rank
-            ctx.dep_rank = dep_rank
-            ctx.row_parallel_mode = row_parallel_mode
-            ctx.col_parallel_mode = col_parallel_mode
-            ctx.A_shape = A_shape
-            ctx.B_shape = B_shape
-            ctx.data_parallel_rank = data_parallel_rank
-            ctx.pipeline_parallel_rank = pipeline_parallel_rank
-            ctx.pipeline_parallel_size = pipeline_parallel_size
-            ctx.tensor_parallel_size = tensor_parallel_size
-
-        return out
-
-    @staticmethod
-    @custom_bwd
-    def backward(ctx: Any, output_grad: Tensor) -> Tuple[Tensor, ...]:
-        A, B = ctx.saved_tensors
-        with torch.no_grad():
-            A_grad = Matmul_AB_2p5D.apply(output_grad, B, ctx.tesseract_dim, ctx.A_shape, ctx.row_rank, ctx.col_rank,
-                                          ctx.dep_rank, ctx.row_parallel_mode, ctx.col_parallel_mode,
-                                          ctx.data_parallel_rank, ctx.pipeline_parallel_rank,
-                                          ctx.pipeline_parallel_size, ctx.tensor_parallel_size)
-            B_grad = Matmul_ATB_2p5D.apply(output_grad, A, ctx.tesseract_dim, ctx.B_shape, ctx.row_rank, ctx.col_rank,
-                                           ctx.dep_rank, ctx.row_parallel_mode, ctx.col_parallel_mode,
-                                           ctx.data_parallel_rank, ctx.pipeline_parallel_rank,
-                                           ctx.pipeline_parallel_size, ctx.tensor_parallel_size)
-        return A_grad, B_grad, None, None, None, None, None, None, None, None, None, None, None, None, None
-
-
-class Matmul_ATB_2p5D(torch.autograd.Function):
-    """
-    Matrix multiplication for :math:`C = A^TB`
-
-    :param a: matrix :math:`A`
-    :type a: torch.tensor
-    :param b: matrix :math:`B`
-    :type b: torch.tensor
-    :param tesseract_dim: dimension of TESSERACT fo 2.5D parallelism
-    :type tesseract_dim: int
-    :param out_shape: shape of output tensor
-    :type out_shape: tuple
-    :param row_rank: the rank of row
-    :type row_rank: int
-    :param col_rank: the rank of column
-    :type col_rank: int
-    :param dep_rank: the rank of depth
-    :type dep_rank: int
-    :param row_parallel_mode: row parallel mode
-    :type row_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    :param col_parallel_mode: column parallel mode
-    :type col_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    :param data_parallel_rank: data parallel rank
-    :type data_parallel_rank: int
-    :param pipeline_parallel_rank: pipeline parallel rank
-    :type pipeline_parallel_rank: int
-    :param pipeline_parallel_size: pipeline parallel size
-    :type pipeline_parallel_size: int
-    :param tensor_parallel_size: tensor parallel size
-    :type tensor_parallel_size: int
-    """
-    @staticmethod
-    @custom_fwd(cast_inputs=torch.float16)
-    def forward(ctx: Any, A: Tensor, B: Tensor, tesseract_dim: int, out_shape: Tuple[int, ...], row_rank: int,
-                col_rank: int, dep_rank: int, row_parallel_mode: ParallelMode, col_parallel_mode: ParallelMode,
-                data_parallel_rank: int, pipeline_parallel_rank: int, pipeline_parallel_size: int,
-                tensor_parallel_size: int):
-
-        assert A.shape[-2] == B.shape[-2], \
-            'Invalid shapes: A={}, B={} for ATB.'.format(A.shape, B.shape)
-
-        if ctx:
-            ctx.save_for_backward(A, B)
-
-        A_shape = A.shape
-        A = A.reshape((-1, A_shape[-1]))
-        B_shape = B.shape
-        B = B.reshape((-1, B_shape[-1]))
-        C_shape = (A.shape[-1], B.shape[-1])
-        C = torch.empty(C_shape, dtype=A.dtype, device=get_current_device())
-
-        # use circular buffer to store the communication tensor
-        # 2 is enough for all cases
-        A_list = [torch.empty_like(A) for _ in range(2)]
-        C_list = [torch.empty_like(C) for _ in range(2)]
-
-        row_group = gpc.get_group(row_parallel_mode)
-        col_group = gpc.get_group(col_parallel_mode)
-
-        src_a = tesseract_dim * row_rank + tesseract_dim ** 2 * dep_rank + data_parallel_rank * pipeline_parallel_size * tensor_parallel_size + \
-                pipeline_parallel_rank * tensor_parallel_size
-        src_c = col_rank + tesseract_dim ** 2 * dep_rank + data_parallel_rank * pipeline_parallel_size * tensor_parallel_size + \
-                pipeline_parallel_rank * tensor_parallel_size
-
-        opa = [None] * 2
-        opr = [None] * 2
-
-        A_list[0].copy_(A)
-        opa[0] = dist.broadcast(A_list[0], src=src_a, group=row_group, async_op=True)
-        cur = 0
-
-        for i in range(tesseract_dim):
-            if i != tesseract_dim - 1:
-                A_list[1 - cur].copy_(A)
-                opa[1 - cur] = dist.broadcast(A_list[1 - cur], src=src_a + 1, group=row_group, async_op=True)
-
-            if opr[cur] is not None:
-                opr[cur].wait()
-                if i - 2 == row_rank:
-                    C.copy_(C_list[cur])
-
-            if opa[cur] is not None:
-                opa[cur].wait()
-
-            torch.matmul(A_list[cur].transpose(0, 1), B, out=C_list[cur])
-            opr[cur] = dist.reduce(C_list[cur], dst=src_c, group=col_group, async_op=True)
-            cur = 1 - cur
-            src_a += 1
-            src_c += tesseract_dim
-
-        for op in opr:
-            op.wait()
-
-        if tesseract_dim - 2 == row_rank:
-            C.copy_(C_list[cur])
-        if tesseract_dim - 1 == row_rank:
-            C.copy_(C_list[1 - cur])
-        out = C.reshape(out_shape)
-
-        if ctx:
-            ctx.tesseract_dim = tesseract_dim
-            ctx.row_rank = row_rank
-            ctx.col_rank = col_rank
-            ctx.dep_rank = dep_rank
-            ctx.row_parallel_mode = row_parallel_mode
-            ctx.col_parallel_mode = col_parallel_mode
-            ctx.A_shape = A_shape
-            ctx.B_shape = B_shape
-            ctx.data_parallel_rank = data_parallel_rank
-            ctx.pipeline_parallel_rank = pipeline_parallel_rank
-            ctx.pipeline_parallel_size = pipeline_parallel_size
-            ctx.tensor_parallel_size = tensor_parallel_size
-
-        return out
-
-    @staticmethod
-    @custom_bwd
-    def backward(ctx: Any, output_grad: Tensor) -> Tuple[Tensor, ...]:
-        A, B = ctx.saved_tensors
-        with torch.no_grad():
-            A_grad = Matmul_ABT_2p5D.apply(B, output_grad, ctx.tesseract_dim, ctx.A_shape, ctx.row_rank, ctx.col_rank,
-                                           ctx.dep_rank, ctx.row_parallel_mode, ctx.col_parallel_mode,
-                                           ctx.data_parallel_rank, ctx.pipeline_parallel_rank,
-                                           ctx.pipeline_parallel_size, ctx.tensor_parallel_size)
-            B_grad = Matmul_AB_2p5D.apply(A, output_grad, ctx.tesseract_dim, ctx.B_shape, ctx.row_rank, ctx.col_rank,
-                                          ctx.dep_rank, ctx.row_parallel_mode, ctx.col_parallel_mode,
-                                          ctx.data_parallel_rank, ctx.pipeline_parallel_rank,
-                                          ctx.pipeline_parallel_size, ctx.tensor_parallel_size)
-        return A_grad, B_grad, None, None, None, None, None, None, None, None, None, None, None, None, None
-
-
-class _Add_Bias_2p5D(torch.autograd.Function):
-    @staticmethod
-    @custom_fwd(cast_inputs=torch.float16)
-    def forward(ctx: Any, input: Tensor, bias: Tensor, output_size_per_partition: int, tesseract_dim: int,
-                row_rank: int, col_rank: int, dep_rank: int, col_parallel_mode: ParallelMode, skip_bias_add: bool,
-                data_parallel_rank: int, pipeline_parallel_rank: int, pipeline_parallel_size: int,
-                tensor_parallel_size: int) -> Tensor:
-        if row_rank == 0:
-            bias_temp = bias.clone()
-        else:
-            bias_temp = torch.zeros(output_size_per_partition, dtype=bias.dtype, device=get_current_device())
-        src_rank = col_rank + dep_rank * tesseract_dim ** 2 + data_parallel_rank * pipeline_parallel_size * tensor_parallel_size + \
-                   pipeline_parallel_rank * tensor_parallel_size
-        dist.broadcast(bias_temp, src=src_rank, group=get_parallel_group(col_parallel_mode))
-
-        ctx.row_rank = row_rank
-        ctx.col_rank = col_rank
-        ctx.dep_rank = dep_rank
-        ctx.tesseract_dim = tesseract_dim
-        ctx.col_parallel_mode = col_parallel_mode
-        ctx.bias = skip_bias_add
-        ctx.data_parallel_rank = data_parallel_rank
-        ctx.pipeline_parallel_rank = pipeline_parallel_rank
-        ctx.pipeline_parallel_size = pipeline_parallel_size
-        ctx.tensor_parallel_size = tensor_parallel_size
-
-        if skip_bias_add:
-            return bias_temp
-        else:
-            output = input + bias_temp
-            return output
-
-    @staticmethod
-    @custom_bwd
-    def backward(ctx: Any, output_grad: Tensor) -> Tuple[Tensor, ...]:
-        row_rank = ctx.row_rank
-        col_rank = ctx.col_rank
-        dep_rank = ctx.dep_rank
-        tesseract_dim = ctx.tesseract_dim
-        col_parallel_mode = ctx.col_parallel_mode
-        data_parallel_rank = ctx.data_parallel_rank
-        pipeline_parallel_rank = ctx.pipeline_parallel_rank
-        pipeline_parallel_size = ctx.pipeline_parallel_size
-        tensor_parallel_size = ctx.tensor_parallel_size
-
-        if ctx.bias:
-            dst_rank = col_rank + dep_rank * (
-                        tesseract_dim ** 2) + data_parallel_rank * pipeline_parallel_size * tensor_parallel_size + \
-                       pipeline_parallel_rank * tensor_parallel_size
-            dist.reduce(output_grad, dst=dst_rank, group=get_parallel_group(col_parallel_mode))
-            if row_rank == 0:
-                return None, output_grad, None, None, None, None, None, None, None, None, None, None, None, None, None, None
-            else:
-                grad_tmp = torch.zeros_like(output_grad)
-                return None, grad_tmp, None, None, None, None, None, None, None, None, None, None, None, None, None, None
-        else:
-            reduce_dim = tuple(range(output_grad.ndim - 1))
-            reduce = torch.sum(output_grad, dim=reduce_dim)
-            dst_rank = col_rank + dep_rank * (
-                        tesseract_dim ** 2) + data_parallel_rank * pipeline_parallel_size * tensor_parallel_size + \
-                       pipeline_parallel_rank * tensor_parallel_size
-            dist.reduce(reduce, dst=dst_rank, group=get_parallel_group(col_parallel_mode))
-            if row_rank == 0:
-                return output_grad, reduce, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None
-            else:
-                reduce_tmp = torch.zeros_like(reduce)
-                return output_grad, reduce_tmp, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None
-
-
-def add_bias_2p5d(input: Tensor, bias: Tensor, output_size_per_partition: int, tesseract_dim: int, row_rank: int,
-                  col_rank: int, dep_rank: int, col_parallel_mode: ParallelMode, skip_bias_add: bool,
-                  data_parallel_rank: int, pipeline_parallel_rank: int, pipeline_parallel_size: int,
-                  tensor_parallel_size: int) -> Tensor:
-    """
-    Matrix add bias: :math:`C = A + b`
-
-    :param input: matrix :math:`A`
-    :type input: torch.tensor
-    :param bias: matrix :math:`b`
-    :type bias: torch.tensor
-    :param output_size_per_partition: output size in each partition
-    :type output_size_per_partition: int
-    :param tesseract_dim: dimension of TESSERACT fo 2.5D parallelism
-    :type tesseract_dim: int
-    :param row_rank: the rank of row
-    :type row_rank: int
-    :param col_rank: the rank of column
-    :type col_rank: int
-    :param row_parallel_mode: row parallel mode
-    :type row_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    :param col_parallel_mode: column parallel mode
-    :type col_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    :param skip_bias_add: If set to ``True``, it will skip bias add for linear layer, which is preserved for kernel fusion
-    :type skip_bias_add: bool
-    :param data_parallel_rank: data parallel rank
-    :type data_parallel_rank: int
-    :param pipeline_parallel_rank: pipeline parallel rank
-    :type pipeline_parallel_rank: int
-    :param pipeline_parallel_size: pipeline parallel size
-    :type pipeline_parallel_size: int
-    :param tensor_parallel_size: tensor parallel size
-    :type tensor_parallel_size: int
-    """
-    return _Add_Bias_2p5D.apply(input, bias, output_size_per_partition, tesseract_dim, row_rank, col_rank, dep_rank,
-                                col_parallel_mode, skip_bias_add, data_parallel_rank, pipeline_parallel_rank,
-                                pipeline_parallel_size, tensor_parallel_size)
-
-
-class _Layernorm2p5D(torch.autograd.Function):
-    """
-    Layernorm
-
-    :param input: input maxtrix
-    :type input: torch.tensor
-    :param E_x: mean
-    :type E_x: torch.tensor
-    :param Var_x: variance
-    :type Var_x: torch.tensor
-    :param hidden_size: hidden size
-    :type hidden_size: int
-    :param row_parallel_mode: row parallel mode
-    :type row_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    """
-    @staticmethod
-    @custom_fwd(cast_inputs=torch.float32)
-    def forward(ctx: Any, input: Tensor, E_x: Tensor, Var_x: Tensor, hidden_size: int,
-                row_parallel_mode: ParallelMode) -> Tensor:
-        input = input - E_x
-        # in here, input = x - E[x], Var_x = 1 / sqrt(Var[x] + eps)
-        ctx.hidden_size = hidden_size
-        output = input * Var_x
-        ctx.save_for_backward(output, Var_x)
-        ctx.row_parallel_mode = row_parallel_mode
-        return output
-
-    @staticmethod
-    @custom_bwd
-    def backward(ctx, output_grad):
-        row_parallel_mode = ctx.row_parallel_mode
-        x, Var_x = ctx.saved_tensors
-        # in here, Var_x = 1 / sqrt(Var[x] + eps), x = (x - E[x]) * Var_x
-        with torch.no_grad():
-            output_grad_sum = torch.sum(output_grad, dim=-1, keepdim=True)
-            torch.distributed.all_reduce(output_grad_sum, group=get_parallel_group(row_parallel_mode))
-            output_grad_sum /= ctx.hidden_size
-
-            output_grad_mul_x_sum = torch.sum(output_grad * x, dim=-1, keepdim=True)
-            torch.distributed.all_reduce(output_grad_mul_x_sum, group=get_parallel_group(row_parallel_mode))
-            output_grad_mul_x_sum /= ctx.hidden_size
-
-            input_grad = output_grad.clone()
-            input_grad -= x * output_grad_mul_x_sum
-            input_grad -= output_grad_sum
-            input_grad *= Var_x
-
-        return input_grad, None, None, None, None, None, None
-
-
-def layernorm_2p5d(input: Tensor, E_x: Tensor, Var_x: Tensor, hidden_size: int,
-                   row_parallel_mode: ParallelMode) -> Tensor:
-    """
-    Layernorm
-
-    :param input: input maxtrix
-    :type input: torch.tensor
-    :param E_x: mean
-    :type E_x: torch.tensor
-    :param Var_x: variance
-    :type Var_x: torch.tensor
-    :param hidden_size: hidden size
-    :type hidden_size: int
-    :param row_parallel_mode: row parallel mode
-    :type row_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    """
-    return _Layernorm2p5D.apply(input, E_x, Var_x, hidden_size, row_parallel_mode)
-
-
-class _AllGatherTensor2p5D(torch.autograd.Function):
-    @staticmethod
-    @custom_fwd(cast_inputs=torch.float16)
-    def forward(ctx: Any, inputs: Tensor, dim: int, col_parallel_mode: ParallelMode) -> Tensor:
-        ctx.dim = dim
-        ctx.col_parallel_mode = col_parallel_mode
-
-        outputs = all_gather(inputs, dim, col_parallel_mode)
-        return outputs
-
-    @staticmethod
-    @custom_bwd
-    def backward(ctx: Any, output_grad: Tensor) -> Tuple[Tensor, ...]:
-        grad = reduce_scatter(output_grad, ctx.dim, ctx.col_parallel_mode)
-        return grad.contiguous(), None, None
-
-
-def all_gather_tensor_2p5d(inputs: Tensor, dim: int, col_parallel_mode: ParallelMode) -> Tensor:
-    """
-    all gather the weight of 2.5D parallelism
-
-    :param inputs: input maxtrix
-    :type inputs: torch.tensor
-    :param dim: dimension of all gather
-    :type dim: int
-    :param tesseract_dim: dimension of TESSERACT fo 2.5D parallelism
-    :type tesseract_dim: int
-    :param col_parallel_mode: column parallel mode
-    :type col_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    """
-    return _AllGatherTensor2p5D.apply(inputs, dim, col_parallel_mode)
-
-
-class SplitFirst(torch.autograd.Function):
-    """
-    :param inputs: input maxtrix
-    :type inputs: torch.tensor
-    :param tesseract_dim: dimension of TESSERACT fo 2.5D parallelism
-    :type tesseract_dim: int
-    :param col_parallel_mode: column parallel mode
-    :type col_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    """
-    @staticmethod
-    @custom_fwd(cast_inputs=torch.float16)
-    def forward(ctx: Any, inputs: Tensor, tesseract_dim: int, col_parallel_mode: ParallelMode) -> Tensor:
-        ctx.tesseract_dim = tesseract_dim
-        ctx.batch_size = inputs.size(0)
-        ctx.para_mode = col_parallel_mode
-        row_rank = gpc.get_local_rank(col_parallel_mode)
-
-        outputs = inputs.chunk(tesseract_dim, dim=0)[row_rank]
-        return outputs
-
-    @staticmethod
-    @custom_bwd
-    def backward(ctx: Any, output_grad: Tensor) -> Tuple[Tensor, ...]:
-        grad_shape = (ctx.batch_size, ) + output_grad.shape[1:]
-        grad = torch.empty(grad_shape, dtype=output_grad.dtype, device=get_current_device())
-        dist.all_gather(list(grad.chunk(ctx.tesseract_dim, dim=0)),
-                        output_grad.contiguous(),
-                        group=gpc.get_group(ctx.para_mode))
-        return grad, None, None
-
-
-def split_tensor_2p5d(input_: Tensor, dim: int = 0) -> Tensor:
-    """Splits 2P5D tensor in specified dimension across cols
-
-    :param input_: Input tensor
-    :param dim: Specified dimension in which to split
-    
-    :type input_: torch.Tensor
-    :type dim: int, optional
-    
-    :return output: Splitted tensor
-    :rtype output: torch.Tensor
-    """
-    if input_.size(dim) <= 1:
-        return input_
-    return torch.chunk(input_, gpc.get_world_size(ParallelMode.PARALLEL_2P5D_COL),
-                       dim=dim)[gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_COL)].contiguous()
-
-
-class _ReduceTensor2p5D(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, input_, parallel_mode):
-        return all_reduce(input_, parallel_mode)
-
-    @staticmethod
-    def backward(ctx, output_grad):
-        return output_grad, None
-
-
-def reduce_tensor_2p5d(input_: Tensor, parallel_mode: ParallelMode) -> Tensor:
-    """
-    All-reduce the input.
-    
-    :param input_: input tensor
-    :param parallel_mode: parallel mode
-    """
-    return _ReduceTensor2p5D.apply(input_, parallel_mode)
-
-
-class _ReduceScatterTensor2p5D(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, input_, dim, parallel_mode):
-        ctx.dim = dim
-        ctx.parallel_mode = parallel_mode
-        return reduce_scatter(input_, dim, parallel_mode)
-
-    @staticmethod
-    def backward(ctx, output_grad):
-        return all_gather(output_grad, ctx.dim, ctx.parallel_mode), None, None
-
-
-def reduce_scatter_tensor_2p5d(input_: Tensor, dim: int, parallel_mode: ParallelMode) -> Tensor:
-    """
-    Reduce-scatter the input.
-    
-    :param input_: input tensor
-    :param parallel_mode: parallel mode
-    """
-    return _ReduceScatterTensor2p5D.apply(input_, dim, parallel_mode)
-
-
-class _RreduceByBatch2p5D(torch.autograd.Function):
-    @staticmethod
-    def symbolic(graph, input_, reduce_mean: bool = False):
-        output = all_reduce(input_, ParallelMode.PARALLEL_2P5D_COL)
-        if reduce_mean:
-            reduce_size = gpc.get_world_size(ParallelMode.PARALLEL_2P5D_COL)
-            return output / reduce_size
-        return output
-
-    @staticmethod
-    @custom_fwd(cast_inputs=torch.float32)
-    def forward(ctx, input_, reduce_mean: bool = False):
-        output = all_reduce(input_, ParallelMode.PARALLEL_2P5D_COL)
-        ctx.reduce_mean = reduce_mean
-        if reduce_mean:
-            reduce_size = gpc.get_world_size(ParallelMode.PARALLEL_2P5D_COL)
-            ctx.reduce_size = reduce_size
-            return output.clone() / reduce_size
-        return output.clone()
-
-    @staticmethod
-    @custom_bwd
-    def backward(ctx, output_grad):
-        if ctx.reduce_mean:
-            return output_grad / ctx.reduce_size, None
-        else:
-            return output_grad, None
-
-
-def reduce_by_batch_2p5d(input_, reduce_mean: bool = False) -> Tensor:
-    """
-    All-reduce the input from the model parallel region.
-
-    :param input_: input maxtrix
-    :type input_: torch.tensor
-    :param reduce_mean:  If set to ``True``, it will divide the output by column parallel size, default to False
-    :type reduce_mean: bool, optional
-    """
-    return _RreduceByBatch2p5D.apply(input_, reduce_mean)
\ No newline at end of file
diff --git a/colossalai/nn/layer/parallel_2p5d/_utils.py b/colossalai/nn/layer/parallel_2p5d/_utils.py
deleted file mode 100644
index bcab619ca0a183b7def84ce0dd8fe9c419292f39..0000000000000000000000000000000000000000
--- a/colossalai/nn/layer/parallel_2p5d/_utils.py
+++ /dev/null
@@ -1,24 +0,0 @@
-from colossalai.context.parallel_mode import ParallelMode
-from colossalai.core import global_context as gpc
-from colossalai.global_variables import tensor_parallel_env as env
-
-
-def get_tesseract_dim_dep_from_env():
-    try:
-        tesseract_dim = env.tesseract_dim
-        tesseract_dep = env.tesseract_dep
-        assert tesseract_dim > 0, 'TESSERACT_DIM must be larger than zero'
-        assert tesseract_dep > 0, 'TESSERACT_DEP must be larger than zero'
-        return tesseract_dim, tesseract_dep
-
-    except KeyError as e:
-        raise EnvironmentError('TESSERACT_DIM or TESSERACT_DEP is not found in the current environment, '
-                               'please make sure that you have used the correct process group initializer')
-
-
-def assert_tesseract_initialization():
-    assert gpc.is_initialized(ParallelMode.PARALLEL_2P5D_COL) and \
-           gpc.is_initialized(ParallelMode.PARALLEL_2P5D_ROW) and \
-           gpc.is_initialized(ParallelMode.PARALLEL_2P5D_DEP) and \
-           gpc.is_initialized(ParallelMode.PARALLEL_2P5D_XZ), \
-        'Both PARALLEL_2P5D_COL, PARALLEL_2P5D_ROW, PARALLEL_2P5D_DEP and PARALLEL_2P5D_XZ must be initialized by the process group initializer'
diff --git a/colossalai/nn/layer/parallel_2p5d/layers.py b/colossalai/nn/layer/parallel_2p5d/layers.py
deleted file mode 100644
index 7dd17f21b27332dbaa3a225d64449c66c2c664cf..0000000000000000000000000000000000000000
--- a/colossalai/nn/layer/parallel_2p5d/layers.py
+++ /dev/null
@@ -1,640 +0,0 @@
-import math
-from typing import Callable
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from colossalai.communication import broadcast
-from colossalai.context import ParallelMode, seed
-from colossalai.core import global_context as gpc
-from colossalai.global_variables import tensor_parallel_env as env
-from colossalai.nn import init as init
-from colossalai.registry import LAYERS
-from colossalai.utils.cuda import get_current_device
-from torch import Tensor
-from torch.nn import Parameter
-
-from ..base_layer import ParallelLayer
-from ..utils import divide, set_tensor_parallel_attribute_by_partition, to_2tuple
-from ._operation import (add_bias_2p5d, Matmul_AB_2p5D, Matmul_ABT_2p5D, all_gather_tensor_2p5d, classifier_2p5d,
-                         layernorm_2p5d, reduce_scatter_tensor_2p5d, split_tensor_2p5d)
-from ._utils import assert_tesseract_initialization, get_tesseract_dim_dep_from_env
-
-
-@LAYERS.register_module
-class Linear2p5D(ParallelLayer):
-    """
-    Linear layer for 2.5D parallelism
-
-    :param in_features: size of each input sample
-    :type in_features: int
-    :param out_features: size of each output sample
-    :type out_features: int
-    :param bias: If set to ``False``, the layer will not learn an additive bias, defaults to True
-    :type bias: bool, optional
-    :param dtype: The dtype of parameters, defaults to None
-    :type dtype: torch.dtype, optional
-    :param weight_initializer: The intializer of weight, defaults to kaiming uniform initializer
-    :type weight_initializer: typing.Callable, optional
-    :param bias_initializer: The intializer of bias, defaults to xavier uniform initializer
-    :type bias_initializer: typing.Callable, optional
-    """
-    def __init__(self,
-                 in_features: int,
-                 out_features: int,
-                 bias: bool = True,
-                 dtype: torch.dtype = None,
-                 skip_bias_add: bool = False,
-                 weight_initializer: Callable = init.kaiming_uniform_(a=math.sqrt(5)),
-                 bias_initializer: Callable = init.xavier_uniform_(a=1, scale=1)):
-        super().__init__()
-
-        self.in_features = in_features
-        self.out_features = out_features
-        self.skip_bias_add = skip_bias_add
-
-        # parallel setting
-        assert_tesseract_initialization()
-        self.row_rank = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_COL)
-        self.col_rank = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_ROW)
-        self.dep_rank = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_DEP)
-        self.tesseract_dim, _ = get_tesseract_dim_dep_from_env()
-
-        # partitioning dimension
-        self.input_size_per_partition = divide(in_features, self.tesseract_dim)
-        self.hidden_size_per_partition = divide(out_features, self.tesseract_dim)
-
-        # create weight, shape: [k/q, h/q]
-        factory_kwargs = {'device': get_current_device(), 'dtype': dtype}
-        self.weight = Parameter(
-            torch.empty(self.input_size_per_partition, self.hidden_size_per_partition, **factory_kwargs))
-
-        # create bias, shape: [h/q]
-        if bias:
-            self.bias = Parameter(torch.empty(self.hidden_size_per_partition, **factory_kwargs))
-        else:
-            self.register_parameter('bias', None)
-
-        # initialize parameters
-        with seed(ParallelMode.TENSOR):
-            self.reset_parameters(weight_initializer, bias_initializer)
-        self._set_tensor_parallel_attributes()
-
-    def _set_tensor_parallel_attributes(self):
-        set_tensor_parallel_attribute_by_partition(self.weight, self.tesseract_dim**2)
-        if self.bias is not None:
-            set_tensor_parallel_attribute_by_partition(self.bias, self.tesseract_dim)
-
-    def reset_parameters(self, weight_initializer, bias_initializer) -> None:
-        fan_in, fan_out = self.in_features, self.out_features
-        weight_initializer(self.weight, fan_in=fan_in, fan_out=fan_out)
-        if self.bias is not None:
-            bias_initializer(self.bias, fan_in=fan_in)
-
-    def forward(self, x: Tensor) -> Tensor:
-        # input: [m/dq, n/q, k/q]
-        # output: [m/dq, n/q, h/q]
-        out_shape = x.shape[:-1] + (self.hidden_size_per_partition, )
-
-        output = Matmul_AB_2p5D.apply(
-            x,
-            self.weight,
-            self.tesseract_dim,
-            out_shape,
-            self.row_rank,
-            self.col_rank,
-            self.dep_rank,
-            ParallelMode.PARALLEL_2P5D_ROW,
-            ParallelMode.PARALLEL_2P5D_COL,
-            self.data_parallel_rank,
-            self.pipeline_parallel_rank,
-            self.pipeline_parallel_size,
-            self.tensor_parallel_size,
-        )
-
-        if self.bias is not None:
-            if self.skip_bias_add:
-                bias = add_bias_2p5d(None, self.bias, self.hidden_size_per_partition, self.tesseract_dim, self.row_rank,
-                                     self.col_rank, self.dep_rank, ParallelMode.PARALLEL_2P5D_COL, True,
-                                     self.data_parallel_rank, self.pipeline_parallel_rank, self.pipeline_parallel_size,
-                                     self.tensor_parallel_size)
-                return output, bias
-            else:
-                output = add_bias_2p5d(output, self.bias, self.hidden_size_per_partition, self.tesseract_dim,
-                                       self.row_rank, self.col_rank, self.dep_rank, ParallelMode.PARALLEL_2P5D_COL,
-                                       False, self.data_parallel_rank, self.pipeline_parallel_rank,
-                                       self.pipeline_parallel_size, self.tensor_parallel_size)
-                return output
-        else:
-            return output
-
-
-@LAYERS.register_module
-class LayerNorm2p5D(ParallelLayer):
-    r"""
-    Layer Normalization for 2.5D parallelism
-
-    :param normalized_shape: input shape from an expected input
-        of size. :math:`[* \times \text{normalized_shape}[0] \times \text{normalized_shape}[1] \times \ldots \times \text{normalized_shape}[-1]]`
-        If a single integer is used, it is treated as a singleton list, and this module will
-        normalize over the last dimension which is expected to be of that specific size.
-    :type normalized_shape: int
-    :param eps: a value added to the denominator for numerical stability, defaults to 1e-05
-    :type eps: float, optional
-    :param dtype: The dtype of parameters, defaults to None
-    :type dtype: torch.dtype, optional
-    """
-    def __init__(self, normalized_shape: int, eps: float = 1e-05, dtype=None):
-        super().__init__()
-
-        # layer norm config
-        self.normalized_shape = normalized_shape
-        self.variance_epsilon = eps
-
-        # parallel setting
-        assert_tesseract_initialization()
-        self.row_rank = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_COL)
-        self.col_rank = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_ROW)
-        self.dep_rank = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_DEP)
-        self.tesseract_dim, _ = get_tesseract_dim_dep_from_env()
-
-        # partitioning dimension
-        self.partitioned_partition = divide(normalized_shape, self.tesseract_dim)  # *
-
-        # create parameters
-        factory_kwargs = {'device': get_current_device(), 'dtype': dtype}
-
-        self.gamma = Parameter(torch.ones(self.partitioned_partition, **factory_kwargs))
-        self.beta = Parameter(torch.zeros(self.partitioned_partition, **factory_kwargs))
-
-        self._set_tensor_parallel_attribute()
-
-    def _set_tensor_parallel_attribute(self):
-        set_tensor_parallel_attribute_by_partition(self.gamma, self.tesseract_dim)
-        set_tensor_parallel_attribute_by_partition(self.beta, self.tesseract_dim)
-
-    def forward(self, x: Tensor) -> Tensor:
-        with torch.no_grad():
-            E_x = torch.sum(x, dim=-1, keepdim=True)  # [b/q, s, 1]
-            torch.distributed.all_reduce(E_x, group=gpc.get_group(ParallelMode.PARALLEL_2P5D_ROW))
-            E_x /= self.normalized_shape
-
-            # Var_x in the block below is the sum of input^2
-            Var_x = torch.sum(x * x, dim=-1, keepdim=True)  # [b/q, s, 1]
-            torch.distributed.all_reduce(Var_x, group=gpc.get_group(ParallelMode.PARALLEL_2P5D_ROW))
-            Var_x /= self.normalized_shape
-
-            Var_x = Var_x - E_x * E_x  # variance of x [b/q, s, 1]
-            # this time 1/sqrt(Var_x + epsilon)
-            Var_x = 1.0 / torch.sqrt(Var_x + self.variance_epsilon)
-
-        output = layernorm_2p5d(x, E_x, Var_x, self.normalized_shape, ParallelMode.PARALLEL_2P5D_ROW)
-        bias = add_bias_2p5d(None, self.beta, self.partitioned_partition, self.tesseract_dim, self.row_rank,
-                             self.col_rank, self.dep_rank, ParallelMode.PARALLEL_2P5D_COL, True,
-                             self.data_parallel_rank, self.pipeline_parallel_rank, self.pipeline_parallel_size,
-                             self.tensor_parallel_size)
-        scale = add_bias_2p5d(None, self.gamma, self.partitioned_partition, self.tesseract_dim, self.row_rank,
-                              self.col_rank, self.dep_rank, ParallelMode.PARALLEL_2P5D_COL, True,
-                              self.data_parallel_rank, self.pipeline_parallel_rank, self.pipeline_parallel_size,
-                              self.tensor_parallel_size)
-        output = torch.addcmul(bias, scale, output)
-        return output
-
-
-@LAYERS.register_module
-class PatchEmbedding2p5D(ParallelLayer):
-    """
-    2D Image to Patch Embedding
-
-    :param img_size: image size
-    :type img_size: int
-    :param patch_size: patch size
-    :type patch_size: int
-    :param in_chans: number of channels of input image
-    :type in_chans: int
-    :param embed_size: size of embedding
-    :type embed_size: int
-    :param dtype: The dtype of parameters, defaults to None
-    :type dtype: torch.dtype, optional
-    :param flatten: whether to flatten output tensor, defaults to True
-    :type flatten: bool, optional
-    :param weight_initializer: The intializer of weight, defaults to kaiming uniform initializer
-    :type weight_initializer: typing.Callable, optional
-    :param bias_initializer: The intializer of bias, defaults to xavier uniform initializer
-    :type bias_initializer: typing.Callable, optional
-    :param position_embed_initializer: The intializer of position embedding, defaults to zero
-    :type position_embed_initializer: typing.Callable, optional
-    """
-    def __init__(self,
-                 img_size: int,
-                 patch_size: int,
-                 in_chans: int,
-                 embed_size: int,
-                 flatten: bool = True,
-                 dtype: torch.dtype = None,
-                 weight_initializer: Callable = init.kaiming_uniform_(a=math.sqrt(5)),
-                 bias_initializer: Callable = init.xavier_uniform_(a=1, scale=1),
-                 position_embed_initializer: Callable = init.zeros_()):
-        super().__init__()
-        img_size = to_2tuple(img_size)
-        patch_size = to_2tuple(patch_size)
-
-        assert_tesseract_initialization()
-        self.tesseract_dim, self.tesseract_dep = get_tesseract_dim_dep_from_env()
-        self.img_size = img_size
-        self.patch_size = patch_size
-        self.grid_size = (img_size[0] // patch_size[0], img_size[1] // patch_size[1])
-        self.num_patches = self.grid_size[0] * self.grid_size[1]
-        self.flatten = flatten
-        self.embed_size = embed_size
-        self.embed_size_per_partition = embed_size // self.tesseract_dim**2
-
-        with seed(ParallelMode.TENSOR):
-            self.weight = Parameter(
-                torch.empty((self.embed_size_per_partition, in_chans, *self.patch_size),
-                            device=get_current_device(),
-                            dtype=dtype))
-            self.bias = Parameter(torch.empty(self.embed_size_per_partition, device=get_current_device(), dtype=dtype))
-
-            self.cls_token = Parameter(
-                torch.zeros((1, 1, self.embed_size_per_partition), device=get_current_device(), dtype=dtype))
-            self.pos_embed = Parameter(
-                torch.zeros((1, self.num_patches + 1, self.embed_size_per_partition),
-                            device=get_current_device(),
-                            dtype=dtype))
-
-        self.reset_parameters(weight_initializer, bias_initializer, position_embed_initializer)
-        self._set_tensor_parallel_attribute()
-
-    def _set_tensor_parallel_attribute(self):
-        set_tensor_parallel_attribute_by_partition(self.weight, self.tesseract_dim**2)
-        set_tensor_parallel_attribute_by_partition(self.bias, self.tesseract_dim**2)
-        set_tensor_parallel_attribute_by_partition(self.cls_token, self.tesseract_dim**2)
-        set_tensor_parallel_attribute_by_partition(self.pos_embed, self.tesseract_dim**2)
-
-    def reset_parameters(self, weight_initializer, bias_initializer, position_embed_initializer):
-        with seed(ParallelMode.TENSOR):
-            fan_in, _ = nn.init._calculate_fan_in_and_fan_out(self.weight)
-            fan_out = self.embed_size
-            weight_initializer(self.weight, fan_in=fan_in, fan_out=fan_out)
-            bias_initializer(self.bias, fan_in=fan_in)
-            position_embed_initializer(self.pos_embed)
-
-    def forward(self, input_: Tensor) -> Tensor:
-        input_ = split_tensor_2p5d(input_, 0)
-
-        B, C, H, W = input_.shape
-        assert H == self.img_size[0] and W == self.img_size[1], \
-            f"Input image size ({H}*{W}) doesn't match model ({self.img_size[0]}*{self.img_size[1]})."
-
-        weight = all_gather_tensor_2p5d(self.weight, 0, ParallelMode.PARALLEL_2P5D_COL)
-        bias = all_gather_tensor_2p5d(self.bias, 0, ParallelMode.PARALLEL_2P5D_COL)
-
-        output = F.conv2d(input_, weight, bias, stride=self.patch_size)
-        if self.flatten:
-            output = output.flatten(2).transpose(1, 2)  # BCHW -> BNC
-
-        cls_token = all_gather_tensor_2p5d(self.cls_token, -1, ParallelMode.PARALLEL_2P5D_COL)
-        pos_embed = all_gather_tensor_2p5d(self.pos_embed, -1, ParallelMode.PARALLEL_2P5D_COL)
-        cls_token = cls_token.expand(output.shape[0], -1, -1)
-        output = torch.cat((cls_token, output), dim=1)
-        output = output + pos_embed
-
-        return output
-
-
-@LAYERS.register_module
-class Embedding2p5D(ParallelLayer):
-    """
-    Embedding for 2.5D parallelism
-
-    :param num_embeddings: number of embeddings
-    :type num_embeddings: int
-    :param embedding_dim: dimension of embedding
-    :type embedding_dim: int
-    :param padding_idx: index of padding, defaults to None
-    :type padding_idx: int, optional
-    :param dtype: The dtype of parameters, defaults to None
-    :type dtype: torch.dtype, optional
-    :param weight_initializer: The intializer of weight, defaults to normal initializer
-    :type weight_initializer: typing.Callable, optional
-    :param args: Args used in F.embedding
-    :param kwargs: Kwargs used in F.embedding
-    """
-    def __init__(self,
-                 num_embeddings: int,
-                 embedding_dim: int,
-                 padding_idx: int = None,
-                 dtype: torch.dtype = None,
-                 weight_initializer: Callable = init.normal_(),
-                 *args,
-                 **kwargs):
-        super().__init__()
-
-        assert_tesseract_initialization()
-        self.tesseract_dim, self.tesseract_dep = get_tesseract_dim_dep_from_env()
-        self.num_embeddings = num_embeddings
-        self.embed_dim = embedding_dim
-        embed_dim_per_partition = embedding_dim // self.tesseract_dim**2
-
-        self.padding_idx = padding_idx
-        self.embed_args = args
-        self.embed_kwargs = kwargs
-
-        self.weight = Parameter(
-            torch.empty((num_embeddings, embed_dim_per_partition), device=get_current_device(), dtype=dtype))
-
-        self.reset_parameters(weight_initializer)
-        self._set_tensor_parallel_attributes()
-
-    def _set_tensor_parallel_attributes(self):
-        set_tensor_parallel_attribute_by_partition(self.weight, self.tesseract_dim**2)
-
-    def reset_parameters(self, weight_initializer) -> None:
-        with seed(ParallelMode.TENSOR):
-            fan_in, fan_out = self.num_embeddings, self.embed_dim
-            weight_initializer(self.weight, fan_in=fan_in, fan_out=fan_out)
-            self._fill_padding_idx_with_zero()
-
-    def _fill_padding_idx_with_zero(self) -> None:
-        if self.padding_idx is not None:
-            with torch.no_grad():
-                self.weight[self.padding_idx].fill_(0)
-
-    def forward(self, input_: Tensor) -> Tensor:
-        input_ = split_tensor_2p5d(input_, 0)
-
-        weight = all_gather_tensor_2p5d(self.weight, -1, ParallelMode.PARALLEL_2P5D_COL)
-
-        output = F.embedding(input_, weight, self.padding_idx, *self.embed_args, **self.embed_kwargs)
-
-        return output
-
-
-@LAYERS.register_module
-class VocabParallelEmbedding2p5D(torch.nn.Module):
-    """Embedding parallelized in the vocabulary dimension.
-
-    :param num_embeddings: number of embeddings
-    :type num_embeddings: int
-    :param embedding_dim: dimension of embedding
-    :type embedding_dim: int
-    :param padding_idx: index of padding, defaults to None
-    :type padding_idx: int, optional
-    :param dtype: The dtype of parameters, defaults to None
-    :type dtype: torch.dtype, optional
-    :param weight_initializer: The intializer of weight, defaults to normal initializer
-    :type weight_initializer: typing.Callable, optional
-    :param args: Args used in F.embedding
-    :param kwargs: Kwargs used in F.embedding
-    """
-    def __init__(self,
-                 num_embeddings: int,
-                 embedding_dim: int,
-                 padding_idx: int = None,
-                 dtype: torch.dtype = None,
-                 weight_initializer: Callable = init.normal_(),
-                 *args,
-                 **kwargs):
-        super().__init__()
-        self.num_embeddings = num_embeddings
-        self.embed_dim = embedding_dim
-        self.padding_idx = padding_idx
-        self.embed_args = args
-        self.embed_kwargs = kwargs
-
-        assert_tesseract_initialization()
-        self.tesseract_dim, self.tesseract_dep = get_tesseract_dim_dep_from_env()
-        self.num_embeddings_per_partition = divide(self.num_embeddings, self.tesseract_dim)
-        self.embed_dim_per_partition = divide(self.embed_dim, self.tesseract_dim)
-        tensor_parallel_rank = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_COL)
-        self.vocab_start_index = tensor_parallel_rank * self.num_embeddings_per_partition
-        self.vocab_end_index = self.vocab_start_index + self.num_embeddings_per_partition
-
-        self.weight = Parameter(
-            torch.empty((self.num_embeddings_per_partition, self.embed_dim_per_partition),
-                        device=get_current_device(),
-                        dtype=dtype))
-
-        self.reset_parameters(weight_initializer)
-        self._set_tensor_parallel_attributes()
-        env.vocab_parallel = True
-
-    def _set_tensor_parallel_attributes(self):
-        set_tensor_parallel_attribute_by_partition(self.weight, self.tesseract_dim**2)
-
-    def reset_parameters(self, weight_initializer) -> None:
-        with seed(ParallelMode.TENSOR):
-            fan_in, fan_out = self.num_embeddings, self.embed_dim
-            weight_initializer(self.weight, fan_in=fan_in, fan_out=fan_out)
-            self._fill_padding_idx_with_zero()
-
-    def _fill_padding_idx_with_zero(self) -> None:
-        if self.padding_idx is not None:
-            with torch.no_grad():
-                self.weight[self.padding_idx].fill_(0)
-
-    def forward(self, input_: Tensor) -> Tensor:
-        # Build the mask.
-        input_mask = (input_ < self.vocab_start_index) | (input_ >= self.vocab_end_index)
-        # Mask the input.
-        masked_input = input_.clone() - self.vocab_start_index
-        masked_input[input_mask] = 0
-
-        output_parallel = F.embedding(masked_input, self.weight, self.padding_idx, *self.embed_args,
-                                      **self.embed_kwargs)
-
-        # Mask the output embedding.
-        output_parallel[input_mask, :] = 0.
-        # Reduce across all the model parallel GPUs.
-        output = reduce_scatter_tensor_2p5d(output_parallel, 0, ParallelMode.PARALLEL_2P5D_COL)
-        return output
-
-
-@LAYERS.register_module
-class Classifier2p5D(ParallelLayer):
-    """
-    Classifier for 2.5D parallelism
-
-    :param in_features: size of each input sample
-    :type in_features: int
-    :param num_classes: number of classes
-    :type num_classes: int
-    :param weight: weight of the classifier, defaults to True
-    :type weight: torch.nn.Parameter, optional
-    :param bias: If set to ``False``, the layer will not learn an additive bias, defaults to True
-    :type bias: bool, optional
-    :param dtype: The dtype of parameters, defaults to None
-    :type dtype: torch.dtype, optional
-    :param weight_initializer: The intializer of weight, defaults to kaiming uniform initializer
-    :type weight_initializer: typing.Callable, optional
-    :param bias_initializer: The intializer of bias, defaults to xavier uniform initializer
-    :type bias_initializer: typing.Callable, optional
-    """
-    def __init__(self,
-                 in_features: int,
-                 num_classes: int,
-                 weight: Parameter = None,
-                 bias: bool = True,
-                 dtype: torch.dtype = None,
-                 weight_initializer: Callable = init.kaiming_uniform_(a=math.sqrt(5)),
-                 bias_initializer: Callable = init.xavier_uniform_(a=1, scale=1)):
-        super().__init__()
-        self.in_features = in_features
-        self.num_classes = num_classes
-        assert_tesseract_initialization()
-        self.row_rank = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_COL)
-        self.col_rank = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_ROW)
-        self.dep_rank = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_DEP)
-        self.tesseract_dim, self.tesseract_dep = get_tesseract_dim_dep_from_env()
-
-        # partitioning dimension
-        self.input_size_per_partition = divide(self.in_features, self.tesseract_dim**2)
-
-        if weight is not None:
-            self.weight = weight
-            self.has_weight = False
-        else:
-            self.weight = Parameter(
-                torch.empty(self.num_classes, self.input_size_per_partition, device=get_current_device(), dtype=dtype))
-            self.has_weight = True
-        if bias:
-            self.bias = Parameter(torch.zeros(self.num_classes, device=get_current_device(), dtype=dtype))
-        else:
-            self.bias = None
-
-        self.reset_parameters(weight_initializer, bias_initializer)
-        self._set_tensor_parallel_attributes()
-
-    def _set_tensor_parallel_attributes(self):
-        if self.has_weight:
-            set_tensor_parallel_attribute_by_partition(self.weight, self.tesseract_dim**2)
-
-    def reset_parameters(self, weight_initializer, bias_initializer) -> None:
-        with seed(ParallelMode.TENSOR):
-            fan_in, fan_out = self.in_features, self.num_classes
-            col_src_rank = gpc.get_ranks_in_group(ParallelMode.PARALLEL_2P5D_COL)[0]
-            row_src_rank = gpc.get_ranks_in_group(ParallelMode.PARALLEL_2P5D_ROW)[0]
-
-            if self.has_weight:
-                weight_initializer(self.weight, fan_in=fan_in, fan_out=fan_out)
-
-            if self.bias is not None:
-                bias_initializer(self.bias, fan_in=fan_in)
-                broadcast(self.bias, col_src_rank, ParallelMode.PARALLEL_2P5D_COL)
-                broadcast(self.bias, row_src_rank, ParallelMode.PARALLEL_2P5D_ROW)
-
-    def forward(self, input_: Tensor) -> Tensor:
-        out_shape = input_.shape[:-1] + (self.num_classes, )
-
-        return classifier_2p5d(input_, self.weight, self.bias, self.tesseract_dim, out_shape, self.row_rank,
-                               self.col_rank, ParallelMode.PARALLEL_2P5D_ROW, ParallelMode.PARALLEL_2P5D_COL,
-                               self.data_parallel_rank, self.pipeline_parallel_rank, self.pipeline_parallel_size,
-                               self.tensor_parallel_size)
-
-
-@LAYERS.register_module
-class VocabParallelClassifier2p5D(ParallelLayer):
-    """
-    Vocab parallel classifier layer for 2.5D parallelism
-
-    :param in_features: size of each input sample
-    :type in_features: int
-    :param num_classes: number of classes
-    :type num_classes: int
-    :param weight: weight of the classifier, defaults to True
-    :type weight: torch.nn.Parameter, optional
-    :param bias: If set to ``False``, the layer will not learn an additive bias, defaults to ``True``
-    :type bias: bool, optional
-    :param dtype: The dtype of parameters, defaults to None
-    :type dtype: torch.dtype, optional
-    :param weight_initializer: The intializer of weight, defaults to kaiming uniform initializer
-    :type weight_initializer: typing.Callable, optional
-    :param bias_initializer: The intializer of bias, defaults to xavier uniform initializer
-    :type bias_initializer: typing.Callable, optional
-    """
-    def __init__(self,
-                 in_features: int,
-                 num_classes: int,
-                 weight: Parameter = None,
-                 bias: bool = True,
-                 dtype: torch.dtype = None,
-                 weight_initializer: Callable = init.kaiming_uniform_(a=math.sqrt(5)),
-                 bias_initializer: Callable = init.xavier_uniform_(a=1, scale=1)):
-        super().__init__()
-
-        self.in_features = in_features
-        self.num_classes = num_classes
-
-        # parallel setting
-        assert_tesseract_initialization()
-        self.row_rank = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_COL)
-        self.col_rank = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_ROW)
-        self.dep_rank = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_DEP)
-        self.tesseract_dim, _ = get_tesseract_dim_dep_from_env()
-
-        # partitioning dimension
-        self.input_size_per_partition = divide(in_features, self.tesseract_dim)
-        self.hidden_size_per_partition = divide(num_classes, self.tesseract_dim)
-
-        # create weight, shape: [k/q, h/q]
-        factory_kwargs = {'device': get_current_device(), 'dtype': dtype}
-        if weight is not None:
-            self.weight = weight
-            self.has_weight = False
-        else:
-            self.weight = Parameter(
-                torch.empty(self.hidden_size_per_partition, self.input_size_per_partition, **factory_kwargs))
-            self.has_weight = True
-        # create bias, shape: [h/q]
-        if bias:
-            self.bias = Parameter(torch.empty(self.hidden_size_per_partition, **factory_kwargs))
-        else:
-            self.bias = None
-
-        # initialize parameters
-        with seed(ParallelMode.TENSOR):
-            self.reset_parameters(weight_initializer, bias_initializer)
-        self._set_tensor_parallel_attributes()
-        env.vocab_parallel = True
-
-    def _set_tensor_parallel_attributes(self):
-        if self.has_weight:
-            set_tensor_parallel_attribute_by_partition(self.weight, self.tesseract_dim**2)
-        if self.bias is not None:
-            set_tensor_parallel_attribute_by_partition(self.bias, self.tesseract_dim)
-
-    def reset_parameters(self, weight_initializer, bias_initializer) -> None:
-        fan_in, fan_out = self.in_features, self.num_classes
-        if self.has_weight:
-            weight_initializer(self.weight, fan_in=fan_in, fan_out=fan_out)
-        if self.bias is not None:
-            bias_initializer(self.bias, fan_in=fan_in)
-
-    def forward(self, x: Tensor) -> Tensor:
-        # input: [m/dq, n/q, k/q]
-        # output: [m/dq, n/q, h/q]
-        out_shape = x.shape[:-1] + (self.hidden_size_per_partition, )
-
-        output = Matmul_ABT_2p5D.apply(
-            x,
-            self.weight,
-            self.tesseract_dim,
-            out_shape,
-            self.row_rank,
-            self.col_rank,
-            self.dep_rank,
-            ParallelMode.PARALLEL_2P5D_ROW,
-            ParallelMode.PARALLEL_2P5D_COL,
-            self.data_parallel_rank,
-            self.pipeline_parallel_rank,
-            self.pipeline_parallel_size,
-            self.tensor_parallel_size,
-        )
-
-        if self.bias is not None:
-            output = add_bias_2p5d(output, self.bias, self.hidden_size_per_partition, self.tesseract_dim, self.row_rank,
-                                   self.col_rank, self.dep_rank, ParallelMode.PARALLEL_2P5D_COL, False,
-                                   self.data_parallel_rank, self.pipeline_parallel_rank, self.pipeline_parallel_size,
-                                   self.tensor_parallel_size)
-        return output
diff --git a/colossalai/nn/layer/parallel_3d/__init__.py b/colossalai/nn/layer/parallel_3d/__init__.py
deleted file mode 100644
index 9ae255b449ee7f57a08a3bb596102860bb1b60d3..0000000000000000000000000000000000000000
--- a/colossalai/nn/layer/parallel_3d/__init__.py
+++ /dev/null
@@ -1,8 +0,0 @@
-from ._operation import reduce_by_batch_3d, split_batch_3d, split_tensor_3d
-from .layers import (Classifier3D, Embedding3D, LayerNorm3D, Linear3D, PatchEmbedding3D, VocabParallelClassifier3D,
-                     VocabParallelEmbedding3D)
-
-__all__ = [
-    'reduce_by_batch_3d', 'split_tensor_3d', 'split_batch_3d', 'Linear3D', 'LayerNorm3D', 'PatchEmbedding3D',
-    'Classifier3D', 'Embedding3D', 'VocabParallelEmbedding3D', 'VocabParallelClassifier3D'
-]
diff --git a/colossalai/nn/layer/parallel_3d/__pycache__/__init__.cpython-36.pyc b/colossalai/nn/layer/parallel_3d/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index 781c0bbd8129b033216eb7ea06b212f399b1030b..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/parallel_3d/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/parallel_3d/__pycache__/__init__.cpython-37.pyc b/colossalai/nn/layer/parallel_3d/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index 4629dcc713411ffa5be8ea9cfe309e6c4d3a27d9..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/parallel_3d/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/parallel_3d/__pycache__/_operation.cpython-36.pyc b/colossalai/nn/layer/parallel_3d/__pycache__/_operation.cpython-36.pyc
deleted file mode 100644
index 5d7c25373426580c9aafd067e9ff65e917a767d9..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/parallel_3d/__pycache__/_operation.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/parallel_3d/__pycache__/_operation.cpython-37.pyc b/colossalai/nn/layer/parallel_3d/__pycache__/_operation.cpython-37.pyc
deleted file mode 100644
index 6a2849f09cb4344323a4ea04b7831b915a291a4d..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/parallel_3d/__pycache__/_operation.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/parallel_3d/__pycache__/_utils.cpython-36.pyc b/colossalai/nn/layer/parallel_3d/__pycache__/_utils.cpython-36.pyc
deleted file mode 100644
index a75e957acc78a894a371caf4de032ac883e81ea8..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/parallel_3d/__pycache__/_utils.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/parallel_3d/__pycache__/_utils.cpython-37.pyc b/colossalai/nn/layer/parallel_3d/__pycache__/_utils.cpython-37.pyc
deleted file mode 100644
index 2b5e762b1783455ff2d9bf746e800c7352cf8e35..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/parallel_3d/__pycache__/_utils.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/parallel_3d/__pycache__/layers.cpython-36.pyc b/colossalai/nn/layer/parallel_3d/__pycache__/layers.cpython-36.pyc
deleted file mode 100644
index a924f8c57a87d4c25872cfe2700d0677fc64227e..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/parallel_3d/__pycache__/layers.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/parallel_3d/__pycache__/layers.cpython-37.pyc b/colossalai/nn/layer/parallel_3d/__pycache__/layers.cpython-37.pyc
deleted file mode 100644
index 07fe9887689b939f76335f51518d7b227ce82f75..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/parallel_3d/__pycache__/layers.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/parallel_3d/_operation.py b/colossalai/nn/layer/parallel_3d/_operation.py
deleted file mode 100644
index 26e30d8cfe3caa3cd260c9b8bb9652d7debbc5c1..0000000000000000000000000000000000000000
--- a/colossalai/nn/layer/parallel_3d/_operation.py
+++ /dev/null
@@ -1,481 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from typing import Optional, Tuple
-
-import torch
-from colossalai.communication import (all_gather, all_reduce, broadcast, reduce, reduce_scatter)
-from colossalai.context import parallel_mode
-from colossalai.context.parallel_mode import ParallelMode
-from colossalai.core import global_context as gpc
-from torch import Tensor
-from torch.cuda.amp import custom_bwd, custom_fwd
-from ._utils import get_parallel_mode_from_env
-from colossalai.constants import INPUT_GROUP_3D, WEIGHT_GROUP_3D
-
-from colossalai.nn.layer.base_layer import ParallelLayer
-
-
-class _Linear3D(torch.autograd.Function):
-
-    @staticmethod
-    @custom_fwd(cast_inputs=torch.float16)
-    def forward(ctx,
-                input_: Tensor,
-                weight: Tensor,
-                bias: Optional[Tensor],
-                input_parallel_mode: ParallelMode,
-                weight_parallel_mode: ParallelMode,
-                output_parallel_mode: ParallelMode,
-                input_dim: int = 0,
-                weight_dim: int = -1,
-                output_dim: int = 0) -> Tensor:
-        ctx.use_bias = bias is not None
-
-        input_ = all_gather(input_, input_dim, input_parallel_mode)
-        ctx.save_for_backward(input_, weight)
-
-        output = torch.matmul(input_, weight)
-        output = reduce_scatter(output, output_dim, output_parallel_mode)
-
-        if bias is not None:
-            output += bias
-
-        ctx.input_parallel_mode = input_parallel_mode
-        ctx.weight_parallel_mode = weight_parallel_mode
-        ctx.output_parallel_mode = output_parallel_mode
-        ctx.input_dim = input_dim
-        ctx.weight_dim = weight_dim
-        ctx.output_dim = output_dim
-        return output
-
-    @staticmethod
-    @custom_bwd
-    def backward(ctx, output_grad: Tensor) -> Tuple[Tensor, ...]:
-        input_, weight = ctx.saved_tensors
-        with torch.no_grad():
-            output_grad = all_gather(output_grad, ctx.output_dim, ctx.output_parallel_mode)
-
-            async_ops = list()
-
-            input_grad = torch.matmul(output_grad, weight.transpose(0, 1))
-            input_grad, op = reduce_scatter(input_grad, ctx.input_dim, ctx.input_parallel_mode, async_op=True)
-            async_ops.append(op)
-
-            weight_grad = torch.matmul(
-                input_.reshape(-1, input_.shape[-1]).transpose(0, 1), output_grad.reshape(-1, output_grad.shape[-1]))
-            weight_grad, op = all_reduce(weight_grad, ctx.weight_parallel_mode, async_op=True)
-            async_ops.append(op)
-
-            if ctx.use_bias:
-                bias_grad = torch.sum(output_grad, dim=tuple(range(len(output_grad.shape))[:-1]))
-                bias_grad, op = all_reduce(bias_grad, ctx.weight_parallel_mode, async_op=True)
-                async_ops.append(op)
-            else:
-                bias_grad = None
-
-            for op in async_ops:
-                if op is not None:
-                    op.wait()
-
-        return input_grad, weight_grad, bias_grad, None, None, None, None, None, None
-
-
-def linear_3d(input_: Tensor,
-              weight: Tensor,
-              bias: Optional[Tensor],
-              input_parallel_mode: ParallelMode,
-              weight_parallel_mode: ParallelMode,
-              output_parallel_mode: ParallelMode,
-              input_dim: int = 0,
-              weight_dim: int = -1,
-              output_dim: int = 0) -> Tensor:
-    """
-    Linear layer for 3D parallelism
-
-    :param input_: matrix of input
-    :type input_: torch.tensor
-    :param weight: matrix of weight
-    :type weight: torch.tensor
-    :param bias: matrix of bias
-    :type bias: torch.tensor, optional
-    :param input_parallel_mode: input parallel mode
-    :type input_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    :param weight_parallel_mode: weight parallel mode
-    :type weight_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    :param output_parallel_mode: output parallel mode
-    :type output_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    :param input_dim: dimension of input, defaults to 0
-    :type input_dim: int, optional
-    :param weight_dim: dimension of weight, defaults to -1
-    :type weight_dim: int, optional
-    :param output_dim: dimension of output, defaults to 0
-    :type output_dim: int, optional
-    """
-    return _Linear3D.apply(input_, weight, bias, input_parallel_mode, weight_parallel_mode, output_parallel_mode,
-                           input_dim, weight_dim, output_dim)
-
-
-class _Classifier3D(torch.autograd.Function):
-
-    @staticmethod
-    @custom_fwd(cast_inputs=torch.float16)
-    def forward(ctx, input_: Tensor, weight: Tensor, bias: Optional[Tensor], input_parallel_mode: ParallelMode,
-                weight_parallel_mode: ParallelMode, output_parallel_mode: ParallelMode) -> Tensor:
-        ctx.use_bias = bias is not None
-
-        ranks_in_group = gpc.get_ranks_in_group(input_parallel_mode)
-        src_rank = ranks_in_group[gpc.get_local_rank(output_parallel_mode)]
-        weight = broadcast(weight, src_rank, input_parallel_mode)
-        ctx.save_for_backward(input_, weight)
-
-        output = torch.matmul(input_, weight.transpose(0, 1))
-        output = all_reduce(output, output_parallel_mode)
-
-        if bias is not None:
-            output += bias
-
-        ctx.src_rank = src_rank
-        ctx.input_parallel_mode = input_parallel_mode
-        ctx.weight_parallel_mode = weight_parallel_mode
-        ctx.output_parallel_mode = output_parallel_mode
-        return output
-
-    @staticmethod
-    @custom_bwd
-    def backward(ctx, output_grad: Tensor) -> Tuple[Tensor, ...]:
-        input_, weight = ctx.saved_tensors
-        with torch.no_grad():
-            async_ops = list()
-
-            weight_grad = torch.matmul(
-                output_grad.reshape(-1, output_grad.shape[-1]).transpose(0, 1), input_.reshape(-1, input_.shape[-1]))
-            weight_grad = reduce(weight_grad, ctx.src_rank, ctx.input_parallel_mode)
-            if gpc.get_local_rank(ctx.input_parallel_mode) == gpc.get_local_rank(ctx.output_parallel_mode):
-                weight_grad, op = all_reduce(weight_grad, ctx.weight_parallel_mode, async_op=True)
-                async_ops.append(op)
-            else:
-                weight_grad = None
-
-            if ctx.use_bias:
-                bias_grad = torch.sum(output_grad, dim=tuple(range(len(output_grad.shape))[:-1]))
-                bias_grad = all_reduce(bias_grad, ctx.input_parallel_mode)
-                bias_grad, op = all_reduce(bias_grad, ctx.weight_parallel_mode, async_op=True)
-                async_ops.append(op)
-            else:
-                bias_grad = None
-
-            input_grad = torch.matmul(output_grad, weight)
-
-            for op in async_ops:
-                if op is not None:
-                    op.wait()
-
-        return input_grad, weight_grad, bias_grad, None, None, None, None, None, None
-
-
-def classifier_3d(input_: Tensor, weight: Tensor, bias: Optional[Tensor], input_parallel_mode: ParallelMode,
-                  weight_parallel_mode: ParallelMode, output_parallel_mode: ParallelMode) -> Tensor:
-    """
-    3D parallel classifier
-
-    :param input_: matrix of input
-    :type input_: torch.tensor
-    :param weight: matrix of weight
-    :type weight: torch.tensor
-    :param bias: matrix of bias
-    :type bias: torch.tensor, optional
-    :param input_parallel_mode: input parallel mode
-    :type input_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    :param weight_parallel_mode: weight parallel mode
-    :type weight_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    :param output_parallel_mode: output parallel mode
-    :type output_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    """
-    return _Classifier3D.apply(input_, weight, bias, input_parallel_mode, weight_parallel_mode, output_parallel_mode)
-
-
-class _Layernorm3D(torch.autograd.Function):
-
-    @staticmethod
-    @custom_fwd(cast_inputs=torch.float32)
-    def forward(ctx, input_: Tensor, weight: Tensor, bias: Tensor, normalized_shape: int, eps: float,
-                input_parallel_mode: ParallelMode, weight_parallel_mode: ParallelMode,
-                output_parallel_mode: ParallelMode) -> Tensor:
-        mean = all_reduce(torch.sum(input_, dim=-1, keepdim=True), output_parallel_mode) / normalized_shape
-        mu = input_ - mean
-        var = all_reduce(torch.sum(mu**2, dim=-1, keepdim=True), output_parallel_mode) / normalized_shape
-        sigma = torch.sqrt(var + eps)
-
-        ctx.save_for_backward(mu, sigma, weight)
-
-        z = mu / sigma
-        output = weight * z + bias
-
-        ctx.normalized_shape = normalized_shape
-        ctx.input_parallel_mode = input_parallel_mode
-        ctx.weight_parallel_mode = weight_parallel_mode
-        ctx.output_parallel_mode = output_parallel_mode
-
-        return output
-
-    @staticmethod
-    @custom_bwd
-    def backward(ctx, output_grad: Tensor) -> Tuple[Tensor, ...]:
-        mu, sigma, weight = ctx.saved_tensors
-        with torch.no_grad():
-            bias_grad, weight_grad = output_grad, output_grad * mu / sigma
-            grads = torch.stack([bias_grad, weight_grad]).contiguous()
-            grads = torch.sum(grads, dim=tuple(range(len(grads.shape))[1:-1]))
-            grads = all_reduce(grads, ctx.weight_parallel_mode)
-            grads = all_reduce(grads, ctx.input_parallel_mode)
-            bias_grad, weight_grad = grads[0], grads[1]
-
-            dz = output_grad * weight
-            dvar = dz * mu * (-0.5) * sigma**(-3)
-            dvar = all_reduce(torch.sum(dvar, dim=-1, keepdim=True), ctx.output_parallel_mode)
-            dmean = dz * (-1 / sigma) + dvar * -2 * mu / ctx.normalized_shape
-            dmean = all_reduce(torch.sum(dmean, dim=-1, keepdim=True), ctx.output_parallel_mode)
-
-            input_grad = dz / sigma + dvar * 2 * mu / \
-                ctx.normalized_shape + dmean / ctx.normalized_shape
-
-        return input_grad, weight_grad, bias_grad, None, None, None, None, None
-
-
-def layernorm_3d(input_: Tensor, weight: Tensor, bias: Tensor, normalized_shape: int, eps: float,
-                 input_parallel_mode: ParallelMode, weight_parallel_mode: ParallelMode,
-                 output_parallel_mode: ParallelMode) -> Tensor:
-    """
-    3D parallel Layernorm
-
-    :param input_: input maxtrix
-    :type input_: torch.tensor
-    :param weight: matrix of weight
-    :type weight: torch.tensor
-    :param bias: matrix of bias
-    :type bias: torch.tensor
-    :param normalized_shape: input shape from an expected input
-        of size. :math:`[* \times \text{normalized_shape}[0] \times \text{normalized_shape}[1] \times \ldots \times \text{normalized_shape}[-1]]`
-        If a single integer is used, it is treated as a singleton list, and this module will
-        normalize over the last dimension which is expected to be of that specific size.
-    :type normalized_shape: int
-    :param eps: a value added to the denominator for numerical stability
-    :type eps: float
-    :param input_parallel_mode: input parallel mode
-    :type input_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    :param weight_parallel_mode: weight parallel mode
-    :type weight_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    :param output_parallel_mode: output parallel mode
-    :type output_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    """
-    return _Layernorm3D.apply(input_, weight, bias, normalized_shape, eps, input_parallel_mode, weight_parallel_mode,
-                              output_parallel_mode)
-
-
-def split_tensor_3d(tensor: Tensor, dim: int, parallel_mode: ParallelMode) -> Tensor:
-    """Splits 3D parallel tensor in specified dimension
-
-    :param tensor: Input tensor
-    :param dim: Specified dimension in which to split
-    :param parallel_mode: Parallel mode
-    :param weight_parallel_mode: Weight parallel mode
-
-    :type tensor: torch.Tensor
-    :type dim: int
-    :type parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    
-    :return output: Splitted tensor
-    :rtype output: torch.Tensor
-    """
-    if tensor.size(dim) <= 1:
-        return tensor
-    output = torch.chunk(tensor, gpc.get_world_size(parallel_mode),
-                         dim=dim)[gpc.get_local_rank(parallel_mode)].contiguous()
-    return output
-
-
-def split_batch_3d(input_: Tensor,
-                    dim: int = 0,
-                    input_parallel_mode: ParallelMode = ParallelMode.PARALLEL_3D_INPUT,
-                    weight_parallel_mode: ParallelMode = ParallelMode.PARALLEL_3D_WEIGHT) -> Tensor:
-    """Splits 3D tensor in batch
-    :param input_: Input tensor
-    :param dim: Specified dimension in which to split
-    :param input_parallel_mode: Input parallel mode
-    :param weight_parallel_mode: Weight parallel mode
-    :type input_: torch.Tensor
-    :type dim: int, optional
-    :type input_parallel_mode: colossalai.context.parallel_mode.ParallelMode, optional
-    :type weight_parallel_mode: colossalai.context.parallel_mode.ParallelMode, optional
-    :return output: Splitted tensor
-    :rtype output: torch.Tensor
-    """
-    if input_.size(dim) <= 1:
-        return input_
-    weight_parallel_mode = get_parallel_mode_from_env(WEIGHT_GROUP_3D)
-    input_parallel_mode = get_parallel_mode_from_env(INPUT_GROUP_3D)
-    output = torch.chunk(input_, gpc.get_world_size(weight_parallel_mode),
-                         dim=dim)[gpc.get_local_rank(weight_parallel_mode)].contiguous()
-    output = torch.chunk(output, gpc.get_world_size(input_parallel_mode),
-                         dim=dim)[gpc.get_local_rank(input_parallel_mode)].contiguous()
-    return output
-
-
-class _ReduceTensor3D(torch.autograd.Function):
-
-    @staticmethod
-    def forward(ctx, input_, parallel_mode):
-        return all_reduce(input_, parallel_mode)
-
-    @staticmethod
-    def backward(ctx, output_grad):
-        return output_grad, None
-
-
-def reduce_tensor_3d(tensor: Tensor, parallel_mode: ParallelMode) -> Tensor:
-    """
-    All-reduce the input.
-    
-    :param tensor: Input tensor
-    :param parallel_mode: Parallel mode
-    """
-    return _ReduceTensor3D.apply(tensor, parallel_mode)
-
-
-class _ReduceGrad3D(torch.autograd.Function):
-
-    @staticmethod
-    def forward(ctx, input_, parallel_mode):
-        ctx.parallel_mode = parallel_mode
-        return input_
-
-    @staticmethod
-    def backward(ctx, output_grad):
-        input_grad = all_reduce(output_grad, ctx.parallel_mode)
-        return input_grad, None
-
-
-def reduce_grad_3d(tensor: Tensor, parallel_mode: ParallelMode) -> Tensor:
-    """
-    All-reduce the gradient in backward pass.
-    
-    :param tensor: Input tensor
-    :param parallel_mode: Parallel mode
-    """
-    return _ReduceGrad3D.apply(tensor, parallel_mode)
-
-
-class _ReduceScatterTensor3D(torch.autograd.Function):
-
-    @staticmethod
-    def forward(ctx, input_, dim, parallel_mode):
-        ctx.dim = dim
-        ctx.parallel_mode = parallel_mode
-        return reduce_scatter(input_, dim, parallel_mode)
-
-    @staticmethod
-    def backward(ctx, output_grad):
-        input_grad = all_gather(output_grad, ctx.dim, ctx.parallel_mode)
-        return input_grad, None, None
-
-
-def reduce_scatter_tensor_3d(tensor: Tensor, dim: int, parallel_mode: ParallelMode) -> Tensor:
-    """
-    Reduce-scatter the input.
-    
-    :param tensor: Input tensor
-    :param dim: Dimension to scatter
-    :param parallel_mode: Parallel mode
-    """
-    return _ReduceScatterTensor3D.apply(tensor, dim, parallel_mode)
-
-
-class _ReduceByBatch3D(torch.autograd.Function):
-
-    @staticmethod
-    @custom_fwd(cast_inputs=torch.float32)
-    def forward(ctx,
-                input_: Tensor,
-                input_parallel_mode: ParallelMode,
-                weight_parallel_mode: ParallelMode,
-                reduce_mean: bool = False) -> Tensor:
-        output = all_reduce(input_, input_parallel_mode)
-        output = all_reduce(output, weight_parallel_mode)
-        ctx.reduce_mean = reduce_mean
-        if reduce_mean:
-            reduce_size = gpc.get_world_size(input_parallel_mode) * gpc.get_world_size(weight_parallel_mode)
-            ctx.reduce_size = reduce_size
-            return output.clone() / reduce_size
-        return output.clone()
-
-    @staticmethod
-    @custom_bwd
-    def backward(ctx, output_grad: Tensor) -> Tuple[Tensor, ...]:
-        if ctx.reduce_mean:
-            return output_grad / ctx.reduce_size, None, None, None
-        else:
-            return output_grad, None, None, None
-
-
-def reduce_by_batch_3d(tensor: Tensor,
-                       input_parallel_mode: ParallelMode,
-                       weight_parallel_mode: ParallelMode,
-                       reduce_mean: bool = False) -> Tensor:
-    """
-    All-reduce the input from the model parallel region.
-
-    :param input_: input maxtrix
-    :type input_: torch.tensor
-    :param input_parallel_mode: input parallel mode
-    :type input_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    :param weight_parallel_mode: weight parallel mode
-    :type weight_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    :param reduce_mean:  If set to ``True``, it will divide the output by (input parallel size * weight parallel size), default to False
-    :type reduce_mean: int, optional
-    """
-    return _ReduceByBatch3D.apply(tensor, input_parallel_mode, weight_parallel_mode, reduce_mean)
-
-
-class _BroadcastWeight3D_FromDiagonal(torch.autograd.Function):
-    """
-    broadcast weight from diagonal
-
-    :param input_: input maxtrix
-    :type input_: torch.tensor
-    :param input_parallel_mode: input parallel mode
-    :type input_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    :param weight_parallel_mode: weight parallel mode
-    :type weight_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    :param weight_parallel_mode: output parallel mode
-    :type weight_parallel_mode: colossalai.context.parallel_mode.ParallelMode
-    """
-
-    @staticmethod
-    @custom_fwd(cast_inputs=torch.float16)
-    def forward(ctx, input_: Tensor, input_parallel_mode: ParallelMode, weight_parallel_mode: ParallelMode,
-                output_parallel_mode: ParallelMode) -> Tensor:
-        ranks_in_group = gpc.get_ranks_in_group(input_parallel_mode)
-        src_rank = ranks_in_group[gpc.get_local_rank(output_parallel_mode)]
-        output = broadcast(input_, src_rank, input_parallel_mode)
-        ctx.src_rank = src_rank
-        ctx.input_parallel_mode = input_parallel_mode
-        ctx.weight_parallel_mode = weight_parallel_mode
-        ctx.output_parallel_mode = output_parallel_mode
-        return output
-
-    @staticmethod
-    @custom_bwd
-    def backward(ctx, output_grad: Tensor) -> Tuple[Tensor, ...]:
-        input_grad = reduce(output_grad, ctx.src_rank, ctx.input_parallel_mode)
-        if gpc.get_local_rank(ctx.input_parallel_mode) == gpc.get_local_rank(ctx.output_parallel_mode):
-            input_grad = all_reduce(input_grad, ctx.weight_parallel_mode)
-        else:
-            input_grad = None
-        return input_grad, None, None, None
-
-
-def broadcast_weight_3d_from_diagonal(tensor: Tensor, input_parallel_mode: ParallelMode,
-                                      weight_parallel_mode: ParallelMode, output_parallel_mode: ParallelMode) -> Tensor:
-    return _BroadcastWeight3D_FromDiagonal.apply(tensor, input_parallel_mode, weight_parallel_mode,
-                                                 output_parallel_mode)
diff --git a/colossalai/nn/layer/parallel_3d/_utils.py b/colossalai/nn/layer/parallel_3d/_utils.py
deleted file mode 100644
index 0622164cdf1c77a8fcff42ff2c01c34f3dfc4c07..0000000000000000000000000000000000000000
--- a/colossalai/nn/layer/parallel_3d/_utils.py
+++ /dev/null
@@ -1,51 +0,0 @@
-from colossalai.constants import INPUT_GROUP_3D, WEIGHT_GROUP_3D, OUTPUT_GROUP_3D
-from colossalai.context.parallel_mode import ParallelMode
-from colossalai.core import global_context as gpc
-from colossalai.global_variables import tensor_parallel_env as env
-from torch import Tensor
-
-
-def get_depth_from_env() -> int:
-    try:
-        depth = env.depth_3d
-        assert depth > 0, 'DEPTH must be greater than zero'
-        return depth
-
-    except KeyError as e:
-        raise EnvironmentError('DEPTH is not found in the current environment, '
-                               'please make sure that you have used the correct process group initializer')
-
-
-def get_parallel_mode_from_env(group):
-    assert group in [INPUT_GROUP_3D, WEIGHT_GROUP_3D, OUTPUT_GROUP_3D], \
-        f'{group} is not valid for 3D tensor parallelism.'
-    return getattr(env, group)
-
-
-def get_last_group(a, b):
-    mapping = {
-        ParallelMode.PARALLEL_3D_INPUT: 'A',
-        ParallelMode.PARALLEL_3D_WEIGHT: 'B',
-        ParallelMode.PARALLEL_3D_OUTPUT: 'C',
-    }
-
-    res = chr(ord('A') + ord('B') + ord('C') - ord(mapping[a]) - ord(mapping[b]))
-
-    if res == 'A':
-        return ParallelMode.PARALLEL_3D_INPUT
-    elif res == 'B':
-        return ParallelMode.PARALLEL_3D_WEIGHT
-    elif res == 'C':
-        return ParallelMode.PARALLEL_3D_OUTPUT
-
-
-def swap_in_out_group():
-    env.input_group_3d, env.output_group_3d = env.output_group_3d, env.input_group_3d
-
-
-def dbg_check_shape(tensor: Tensor, shape: tuple):
-    rank = gpc.get_global_rank()
-    if rank == 0:
-        print(tensor.shape)
-    assert tensor.shape == shape, \
-        '{} does not match {}'.format(tensor.shape, shape)
diff --git a/colossalai/nn/layer/parallel_3d/layers.py b/colossalai/nn/layer/parallel_3d/layers.py
deleted file mode 100644
index da8a50995c483706256f1e10e8517b69533f966a..0000000000000000000000000000000000000000
--- a/colossalai/nn/layer/parallel_3d/layers.py
+++ /dev/null
@@ -1,571 +0,0 @@
-import math
-from typing import Callable
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from colossalai.communication import all_reduce, broadcast
-from colossalai.constants import INPUT_GROUP_3D, WEIGHT_GROUP_3D
-from colossalai.context import ParallelMode, seed
-from colossalai.core import global_context as gpc
-from colossalai.global_variables import tensor_parallel_env as env
-from colossalai.nn import init as init
-from colossalai.nn.layer.base_layer import ParallelLayer
-from colossalai.registry import LAYERS
-from colossalai.utils.cuda import get_current_device
-from torch import Tensor
-from torch.nn import Parameter
-
-from ..utils import divide, set_tensor_parallel_attribute_by_partition, to_2tuple
-from ._operation import *
-from ._utils import get_depth_from_env, get_last_group, get_parallel_mode_from_env, swap_in_out_group
-
-
-@LAYERS.register_module
-class LayerNorm3D(ParallelLayer):
-    r"""
-    Layer Normalization for 3D parallelism
-
-    :param normalized_shape: input shape from an expected input
-        of size. :math:`[* \times \text{normalized_shape}[0] \times \text{normalized_shape}[1] \times \ldots \times \text{normalized_shape}[-1]]`
-        If a single integer is used, it is treated as a singleton list, and this module will
-        normalize over the last dimension which is expected to be of that specific size.
-    :type normalized_shape: int
-    :param eps: a value added to the denominator for numerical stability, defaults to 1e-12
-    :type eps: float, optional
-    :param dtype: The dtype of parameters, defaults to None
-    :type dtype: torch.dtype, optional
-    """
-
-    def __init__(self, normalized_shape: int, eps: float = 1e-12, dtype=None):
-        super().__init__()
-        self.input_parallel_mode = get_parallel_mode_from_env(INPUT_GROUP_3D)
-        self.weight_parallel_mode = get_parallel_mode_from_env(WEIGHT_GROUP_3D)
-        self.output_parallel_mode = get_last_group(self.input_parallel_mode, self.weight_parallel_mode)
-        self.depth = get_depth_from_env()
-        self.normalized_shape = normalized_shape
-        self.normalized_shape_per_partition = divide(normalized_shape, self.depth)
-
-        self.weight = Parameter(
-            torch.ones(self.normalized_shape_per_partition, device=get_current_device(), dtype=dtype))
-        self.bias = Parameter(torch.zeros(self.normalized_shape_per_partition, device=get_current_device(),
-                                          dtype=dtype))
-        self.variance_epsilon = eps
-        self._set_tensor_parallel_attributes()
-
-    def _set_tensor_parallel_attributes(self) -> None:
-        set_tensor_parallel_attribute_by_partition(self.weight, self.depth)
-        set_tensor_parallel_attribute_by_partition(self.bias, self.depth)
-
-    def reset_parameters(self) -> None:
-        init.zeros_()(self.bias)
-        init.ones_()(self.weight)
-
-    def forward(self, input_: Tensor) -> Tensor:
-        return layernorm_3d(input_, self.weight, self.bias, self.normalized_shape, self.variance_epsilon,
-                            self.input_parallel_mode, self.weight_parallel_mode, self.output_parallel_mode)
-
-
-@LAYERS.register_module
-class Linear3D(ParallelLayer):
-    """
-    Linear layer for 3D parallelism
-
-    :param in_features: size of each input sample
-    :type in_features: int
-    :param out_features: size of each output sample
-    :type out_features: int
-    :param bias: If set to ``False``, the layer will not learn an additive bias, defaults to True
-    :type bias: bool, optional
-    :param dtype: The dtype of parameters, defaults to None
-    :type dtype: torch.dtype, optional
-    :param weight_initializer: The intializer of weight, defaults to kaiming uniform initializer
-    :type weight_initializer: typing.Callable, optional
-    :param bias_initializer: The intializer of bias, defaults to xavier uniform initializer
-    :type bias_initializer: typing.Callable, optional
-    """
-
-    def __init__(self,
-                 in_features: int,
-                 out_features: int,
-                 bias: bool = True,
-                 dtype: torch.dtype = None,
-                 weight_initializer: Callable = init.kaiming_uniform_(a=math.sqrt(5)),
-                 bias_initializer: Callable = init.xavier_uniform_(a=1, scale=1)):
-        super().__init__()
-        self.in_features = in_features
-        self.out_features = out_features
-        self.input_parallel_mode = get_parallel_mode_from_env(INPUT_GROUP_3D)
-        self.weight_parallel_mode = get_parallel_mode_from_env(WEIGHT_GROUP_3D)
-        self.output_parallel_mode = get_last_group(self.input_parallel_mode, self.weight_parallel_mode)
-        self.depth = get_depth_from_env()
-        self.in_features_per_partition = divide(in_features, self.depth)
-        self.out_features_per_partition = divide(out_features, self.depth)
-
-        self.weight = Parameter(
-            torch.empty(self.in_features_per_partition,
-                        self.out_features_per_partition,
-                        device=get_current_device(),
-                        dtype=dtype))
-        if bias:
-            self.bias = Parameter(torch.zeros(self.out_features_per_partition, device=get_current_device(),
-                                              dtype=dtype))
-        else:
-            self.bias = None
-
-        self.reset_parameters(weight_initializer, bias_initializer)
-        self._set_tensor_parallel_attributes()
-        swap_in_out_group()
-
-    def _set_tensor_parallel_attributes(self) -> None:
-        set_tensor_parallel_attribute_by_partition(self.weight, self.depth**2)
-        if self.bias is not None:
-            set_tensor_parallel_attribute_by_partition(self.bias, self.depth)
-
-    def reset_parameters(self, weight_initializer, bias_initializer) -> None:
-        with seed(ParallelMode.TENSOR):
-            fan_in, fan_out = self.in_features, self.out_features
-            weight_src_rank = gpc.get_ranks_in_group(self.weight_parallel_mode)[0]
-            output_src_rank = gpc.get_ranks_in_group(self.output_parallel_mode)[0]
-
-            weight_initializer(self.weight, fan_in=fan_in, fan_out=fan_out)
-            broadcast(self.weight, weight_src_rank, self.weight_parallel_mode)
-
-            if self.bias is not None:
-                bias_initializer(self.bias, fan_in=fan_in)
-                broadcast(self.bias, weight_src_rank, self.weight_parallel_mode)
-                broadcast(self.bias, output_src_rank, self.output_parallel_mode)
-
-    def forward(self, input_: Tensor) -> Tensor:
-        return linear_3d(input_, self.weight, self.bias, self.input_parallel_mode, self.weight_parallel_mode,
-                         self.output_parallel_mode)
-
-
-@LAYERS.register_module
-class Classifier3D(ParallelLayer):
-    """
-    Classifier for 3D parallelism
-
-    :param in_features: size of each input sample
-    :type in_features: int
-    :param num_classes: number of classes
-    :type num_classes: int
-    :param weight: weight of the classifier, defaults to True
-    :type weight: torch.nn.Parameter, optional
-    :param bias: If set to ``False``, the layer will not learn an additive bias, defaults to True
-    :type bias: bool, optional
-    :param dtype: The dtype of parameters, defaults to None
-    :type dtype: torch.dtype, optional
-    :param weight_initializer: The intializer of weight, defaults to kaiming uniform initializer
-    :type weight_initializer: typing.Callable, optional
-    :param bias_initializer: The intializer of bias, defaults to xavier uniform initializer
-    :type bias_initializer: typing.Callable, optional
-    """
-
-    def __init__(self,
-                 in_features: int,
-                 num_classes: int,
-                 weight: Parameter = None,
-                 bias: bool = True,
-                 dtype: torch.dtype = None,
-                 weight_initializer: Callable = init.kaiming_uniform_(a=math.sqrt(5)),
-                 bias_initializer: Callable = init.xavier_uniform_(a=1, scale=1)):
-        super().__init__()
-        self.in_features = in_features
-        self.num_classes = num_classes
-        self.input_parallel_mode = get_parallel_mode_from_env(INPUT_GROUP_3D)
-        self.weight_parallel_mode = get_parallel_mode_from_env(WEIGHT_GROUP_3D)
-        self.output_parallel_mode = get_last_group(self.input_parallel_mode, self.weight_parallel_mode)
-        self.depth = get_depth_from_env()
-        self.in_features_per_partition = divide(in_features, self.depth)
-
-        if weight is not None:
-            self.weight = weight
-            self.has_weight = False
-        else:
-            self.weight = Parameter(
-                torch.empty(self.num_classes, self.in_features_per_partition, device=get_current_device(), dtype=dtype))
-            self.has_weight = True
-        if bias:
-            self.bias = Parameter(torch.zeros(self.num_classes, device=get_current_device(), dtype=dtype))
-        else:
-            self.bias = None
-
-        self.reset_parameters(weight_initializer, bias_initializer)
-        self._set_tensor_parallel_attributes()
-
-    def _set_tensor_parallel_attributes(self) -> None:
-        if self.has_weight:
-            set_tensor_parallel_attribute_by_partition(self.weight, self.depth)
-
-    def reset_parameters(self, weight_initializer, bias_initializer) -> None:
-        with seed(ParallelMode.TENSOR):
-            fan_in, fan_out = self.in_features, self.num_classes
-            weight_src_rank = gpc.get_ranks_in_group(self.weight_parallel_mode)[0]
-            output_src_rank = gpc.get_ranks_in_group(self.output_parallel_mode)[0]
-            input_src_rank = gpc.get_ranks_in_group(self.input_parallel_mode)[0]
-
-            if self.has_weight:
-                weight_initializer(self.weight, fan_in=fan_in, fan_out=fan_out)
-                broadcast(self.weight, weight_src_rank, self.weight_parallel_mode)
-
-            if self.bias is not None:
-                bias_initializer(self.bias, fan_in=fan_in)
-                broadcast(self.bias, weight_src_rank, self.weight_parallel_mode)
-                broadcast(self.bias, output_src_rank, self.output_parallel_mode)
-                broadcast(self.bias, input_src_rank, self.input_parallel_mode)
-
-    def forward(self, input_: Tensor) -> Tensor:
-        return classifier_3d(input_, self.weight, self.bias, self.input_parallel_mode, self.weight_parallel_mode,
-                             self.output_parallel_mode)
-
-
-@LAYERS.register_module
-class VocabParallelClassifier3D(ParallelLayer):
-    """
-    Vocab parallel classifier layer for 2D parallelism
-
-    :param in_features: size of each input sample
-    :type in_features: int
-    :param num_classes: number of classes
-    :type num_classes: int
-    :param weight: weight of the classifier, defaults to True
-    :type weight: torch.nn.Parameter, optional
-    :param bias: If set to ``False``, the layer will not learn an additive bias, defaults to ``True``
-    :type bias: bool, optional
-    :param dtype: The dtype of parameters, defaults to None
-    :type dtype: torch.dtype, optional
-    :param weight_initializer: The intializer of weight, defaults to kaiming uniform initializer
-    :type weight_initializer: typing.Callable, optional
-    :param bias_initializer: The intializer of bias, defaults to xavier uniform initializer
-    :type bias_initializer: typing.Callable, optional
-    """
-
-    def __init__(self,
-                 in_features: int,
-                 num_classes: int,
-                 weight: Parameter = None,
-                 bias: bool = True,
-                 dtype: torch.dtype = None,
-                 weight_initializer: Callable = init.kaiming_uniform_(a=math.sqrt(5)),
-                 bias_initializer: Callable = init.xavier_uniform_(a=1, scale=1)):
-        super().__init__()
-        self.in_features = in_features
-        self.num_classes = num_classes
-        self.input_parallel_mode = get_parallel_mode_from_env(INPUT_GROUP_3D)
-        self.weight_parallel_mode = get_parallel_mode_from_env(WEIGHT_GROUP_3D)
-        self.output_parallel_mode = get_last_group(self.input_parallel_mode, self.weight_parallel_mode)
-        self.depth = get_depth_from_env()
-        self.in_features_per_partition = divide(in_features, self.depth)
-        self.out_features_per_partition = divide(num_classes, self.depth)
-
-        if weight is not None:
-            self.weight = weight
-            self.has_weight = False
-        else:
-            self.weight = Parameter(
-                torch.empty(self.out_features_per_partition,
-                            self.in_features_per_partition,
-                            device=get_current_device(),
-                            dtype=dtype))
-            self.has_weight = True
-        if bias:
-            self.bias = Parameter(torch.zeros(self.out_features_per_partition, device=get_current_device(),
-                                              dtype=dtype))
-        else:
-            self.bias = None
-
-        self.reset_parameters(weight_initializer, bias_initializer)
-        self._set_tensor_parallel_attributes()
-        swap_in_out_group()
-        env.vocab_parallel = True
-
-    def _set_tensor_parallel_attributes(self) -> None:
-        if self.has_weight:
-            set_tensor_parallel_attribute_by_partition(self.weight, self.depth**2)
-        if self.bias is not None:
-            set_tensor_parallel_attribute_by_partition(self.bias, self.depth)
-
-    def reset_parameters(self, weight_initializer, bias_initializer) -> None:
-        with seed(ParallelMode.TENSOR):
-            fan_in, fan_out = self.in_features, self.num_classes
-            weight_src_rank = gpc.get_ranks_in_group(self.weight_parallel_mode)[0]
-            output_src_rank = gpc.get_ranks_in_group(self.output_parallel_mode)[0]
-
-            if self.has_weight:
-                weight_initializer(self.weight, fan_in=fan_in, fan_out=fan_out)
-                broadcast(self.weight, weight_src_rank, self.weight_parallel_mode)
-
-            if self.bias is not None:
-                bias_initializer(self.bias, fan_in=fan_in)
-                broadcast(self.bias, weight_src_rank, self.weight_parallel_mode)
-                broadcast(self.bias, output_src_rank, self.output_parallel_mode)
-
-    def forward(self, input_: Tensor) -> Tensor:
-        return linear_3d(input_, self.weight.transpose(0, 1), self.bias, self.input_parallel_mode,
-                         self.weight_parallel_mode, self.output_parallel_mode)
-
-
-@LAYERS.register_module
-class PatchEmbedding3D(ParallelLayer):
-    """
-    2D Image to Patch Embedding
-
-    :param img_size: image size
-    :type img_size: int
-    :param patch_size: patch size
-    :type patch_size: int
-    :param in_chans: number of channels of input image
-    :type in_chans: int
-    :param embed_size: size of embedding
-    :type embed_size: int
-    :param dtype: The dtype of parameters, defaults to None
-    :type dtype: torch.dtype, optional
-    :param flatten: whether to flatten output tensor, defaults to True
-    :type flatten: bool, optional
-    :param weight_initializer: The intializer of weight, defaults to kaiming uniform initializer
-    :type weight_initializer: typing.Callable, optional
-    :param bias_initializer: The intializer of bias, defaults to xavier uniform initializer
-    :type bias_initializer: typing.Callable, optional
-    :param position_embed_initializer: The intializer of position embedding, defaults to zero
-    :type position_embed_initializer: typing.Callable, optional
-    """
-
-    def __init__(self,
-                 img_size: int,
-                 patch_size: int,
-                 in_chans: int,
-                 embed_size: int,
-                 flatten: bool = True,
-                 dtype: torch.dtype = None,
-                 weight_initializer: Callable = init.kaiming_uniform_(a=math.sqrt(5)),
-                 bias_initializer: Callable = init.xavier_uniform_(a=1, scale=1),
-                 position_embed_initializer: Callable = init.zeros_()):
-        super().__init__()
-        self.depth = get_depth_from_env()
-        self.input_parallel_mode = get_parallel_mode_from_env(INPUT_GROUP_3D)
-        self.weight_parallel_mode = get_parallel_mode_from_env(WEIGHT_GROUP_3D)
-        self.output_parallel_mode = get_last_group(self.input_parallel_mode, self.weight_parallel_mode)
-        self.patch_size = to_2tuple(patch_size)
-        grid_size = to_2tuple(img_size // patch_size)
-        num_patches = grid_size[0] * grid_size[1]
-        self.embed_size = embed_size
-        embed_size_per_partition = divide(embed_size, self.depth)
-        self.flatten = flatten
-
-        self.weight = nn.Parameter(
-            torch.empty((embed_size_per_partition, in_chans, *self.patch_size),
-                        device=get_current_device(),
-                        dtype=dtype))
-        self.bias = nn.Parameter(torch.empty(embed_size_per_partition, device=get_current_device(), dtype=dtype))
-
-        self.cls_token = nn.Parameter(
-            torch.zeros((1, 1, embed_size_per_partition), device=get_current_device(), dtype=dtype))
-        self.pos_embed = nn.Parameter(
-            torch.zeros((1, num_patches + 1, embed_size_per_partition), device=get_current_device(), dtype=dtype))
-
-        self.reset_parameters(weight_initializer, bias_initializer, position_embed_initializer)
-        self._set_tensor_parallel_attributes()
-
-    def _set_tensor_parallel_attributes(self) -> None:
-        set_tensor_parallel_attribute_by_partition(self.weight, self.depth)
-        set_tensor_parallel_attribute_by_partition(self.bias, self.depth)
-        set_tensor_parallel_attribute_by_partition(self.cls_token, self.depth)
-        set_tensor_parallel_attribute_by_partition(self.pos_embed, self.depth)
-
-    def _sync_grad_hook(self, grad) -> Tensor:
-        grad = all_reduce(grad.clone(), self.input_parallel_mode)
-        grad = all_reduce(grad, self.weight_parallel_mode)
-        return grad
-
-    def reset_parameters(self, weight_initializer, bias_initializer, position_embed_initializer) -> None:
-        with seed(ParallelMode.TENSOR):
-            fan_in, _ = nn.init._calculate_fan_in_and_fan_out(self.weight)
-            fan_out = self.embed_size
-            weight_initializer(self.weight, fan_in=fan_in, fan_out=fan_out)
-            bias_initializer(self.bias, fan_in=fan_in)
-            position_embed_initializer(self.pos_embed)
-
-        weight_src_rank = gpc.get_ranks_in_group(self.weight_parallel_mode)[0]
-        input_src_rank = gpc.get_ranks_in_group(self.input_parallel_mode)[0]
-        broadcast(self.weight, weight_src_rank, self.weight_parallel_mode)
-        broadcast(self.bias, weight_src_rank, self.weight_parallel_mode)
-        broadcast(self.pos_embed, weight_src_rank, self.weight_parallel_mode)
-        broadcast(self.weight, input_src_rank, self.input_parallel_mode)
-        broadcast(self.bias, input_src_rank, self.input_parallel_mode)
-        broadcast(self.pos_embed, input_src_rank, self.input_parallel_mode)
-
-        self.weight.register_hook(self._sync_grad_hook)
-        self.bias.register_hook(self._sync_grad_hook)
-        self.cls_token.register_hook(self._sync_grad_hook)
-        self.pos_embed.register_hook(self._sync_grad_hook)
-
-    def forward(self, input_: Tensor) -> Tensor:
-        input_ = split_tensor_3d(input_, 0, self.weight_parallel_mode)
-        input_ = split_tensor_3d(input_, 0, self.input_parallel_mode)
-        output = F.conv2d(input_, self.weight, self.bias, stride=self.patch_size)
-        if self.flatten:
-            output = output.flatten(2).transpose(1, 2)  # BCHW -> BNC
-
-        cls_token = self.cls_token.expand(output.shape[0], -1, -1)
-        output = torch.cat((cls_token, output), dim=1)
-        output = output + self.pos_embed
-
-        return output
-
-
-@LAYERS.register_module
-class Embedding3D(ParallelLayer):
-    """
-    Embedding for 3D parallelism
-
-    :param num_embeddings: number of embeddings
-    :type num_embeddings: int
-    :param embedding_dim: dimension of embedding
-    :type embedding_dim: int
-    :param padding_idx: index of padding, defaults to None
-    :type padding_idx: int, optional
-    :param dtype: The dtype of parameters, defaults to None
-    :type dtype: torch.dtype, optional
-    :param weight_initializer: The intializer of weight, defaults to normal initializer
-    :type weight_initializer: typing.Callable, optional
-    :param args: Args used in F.embedding
-    :param kwargs: Kwargs used in F.embedding
-    """
-
-    def __init__(self,
-                 num_embeddings: int,
-                 embedding_dim: int,
-                 padding_idx: int = None,
-                 dtype: torch.dtype = None,
-                 weight_initializer: Callable = init.normal_(),
-                 *args,
-                 **kwargs):
-        super().__init__()
-        self.depth = get_depth_from_env()
-        self.input_parallel_mode = get_parallel_mode_from_env(INPUT_GROUP_3D)
-        self.weight_parallel_mode = get_parallel_mode_from_env(WEIGHT_GROUP_3D)
-        self.output_parallel_mode = get_last_group(self.input_parallel_mode, self.weight_parallel_mode)
-
-        self.num_embeddings = num_embeddings
-        self.embed_dim = embedding_dim
-        embed_dim_per_partition = divide(embedding_dim, self.depth)
-        self.padding_idx = padding_idx
-        self.embed_args = args
-        self.embed_kwargs = kwargs
-
-        self.weight = nn.Parameter(
-            torch.empty((num_embeddings, embed_dim_per_partition), device=get_current_device(), dtype=dtype))
-
-        self.reset_parameters(weight_initializer)
-        self._set_tensor_parallel_attributes()
-
-    def _set_tensor_parallel_attributes(self) -> None:
-        set_tensor_parallel_attribute_by_partition(self.weight, self.depth)
-
-    def reset_parameters(self, weight_initializer) -> None:
-        with seed(ParallelMode.TENSOR):
-            fan_in, fan_out = self.num_embeddings, self.embed_dim
-            weight_initializer(self.weight, fan_in=fan_in, fan_out=fan_out)
-            self._fill_padding_idx_with_zero()
-        weight_src_rank = gpc.get_ranks_in_group(self.weight_parallel_mode)[0]
-        broadcast(self.weight, weight_src_rank, self.weight_parallel_mode)
-
-    def _fill_padding_idx_with_zero(self) -> None:
-        if self.padding_idx is not None:
-            with torch.no_grad():
-                self.weight[self.padding_idx].fill_(0)
-
-    def forward(self, input_: Tensor) -> Tensor:
-        input_ = split_tensor_3d(input_, 0, self.weight_parallel_mode)
-        input_ = split_tensor_3d(input_, 0, self.input_parallel_mode)
-        weight = broadcast_weight_3d_from_diagonal(self.weight, self.input_parallel_mode, self.weight_parallel_mode,
-                                                   self.output_parallel_mode)
-        output = F.embedding(input_, weight, self.padding_idx, *self.embed_args, **self.embed_kwargs)
-
-        return output
-
-
-@LAYERS.register_module
-class VocabParallelEmbedding3D(torch.nn.Module):
-    """Embedding parallelized in the vocabulary dimension.
-
-    :param num_embeddings: number of embeddings
-    :type num_embeddings: int
-    :param embedding_dim: dimension of embedding
-    :type embedding_dim: int
-    :param padding_idx: index of padding, defaults to None
-    :type padding_idx: int, optional
-    :param dtype: The dtype of parameters, defaults to None
-    :type dtype: torch.dtype, optional
-    :param weight_initializer: The intializer of weight, defaults to normal initializer
-    :type weight_initializer: typing.Callable, optional
-    :param args: Args used in F.embedding
-    :param kwargs: Kwargs used in F.embedding
-    """
-
-    def __init__(self,
-                 num_embeddings: int,
-                 embedding_dim: int,
-                 padding_idx: int = None,
-                 dtype: torch.dtype = None,
-                 weight_initializer: Callable = init.normal_(),
-                 *args,
-                 **kwargs):
-        super().__init__()
-        self.num_embeddings = num_embeddings
-        self.embed_dim = embedding_dim
-        self.padding_idx = padding_idx
-        self.embed_args = args
-        self.embed_kwargs = kwargs
-
-        self.depth = get_depth_from_env()
-        self.input_parallel_mode = get_parallel_mode_from_env(INPUT_GROUP_3D)
-        self.weight_parallel_mode = get_parallel_mode_from_env(WEIGHT_GROUP_3D)
-        self.output_parallel_mode = get_last_group(self.input_parallel_mode, self.weight_parallel_mode)
-        self.num_embeddings_per_partition = divide(self.num_embeddings, self.depth)
-        self.embed_dim_per_partition = divide(self.embed_dim, self.depth)
-        vocab_parallel_rank = gpc.get_local_rank(self.input_parallel_mode)
-        self.vocab_start_index = vocab_parallel_rank * self.num_embeddings_per_partition
-        self.vocab_end_index = self.vocab_start_index + self.num_embeddings_per_partition
-
-        self.weight = Parameter(
-            torch.empty((self.num_embeddings_per_partition, self.embed_dim_per_partition),
-                        device=get_current_device(),
-                        dtype=dtype))
-
-        self.reset_parameters(weight_initializer)
-        self._set_tensor_parallel_attributes()
-        env.vocab_parallel = True
-
-    def _set_tensor_parallel_attributes(self):
-        set_tensor_parallel_attribute_by_partition(self.weight, self.depth**2)
-
-    def reset_parameters(self, weight_initializer) -> None:
-        with seed(ParallelMode.TENSOR):
-            fan_in, fan_out = self.num_embeddings, self.embed_dim
-            weight_initializer(self.weight, fan_in=fan_in, fan_out=fan_out)
-            self._fill_padding_idx_with_zero()
-        weight_src_rank = gpc.get_ranks_in_group(self.weight_parallel_mode)[0]
-        broadcast(self.weight, weight_src_rank, self.weight_parallel_mode)
-
-    def _fill_padding_idx_with_zero(self) -> None:
-        if self.padding_idx is not None:
-            with torch.no_grad():
-                self.weight[self.padding_idx].fill_(0)
-
-    def forward(self, input_: Tensor) -> Tensor:
-        input_ = split_tensor_3d(input_, 0, self.weight_parallel_mode)
-
-        input_mask = (input_ < self.vocab_start_index) | (input_ >= self.vocab_end_index)
-        masked_input = input_.clone() - self.vocab_start_index
-        masked_input[input_mask] = 0
-
-        weight = reduce_grad_3d(self.weight, self.weight_parallel_mode)
-
-        output_parallel = F.embedding(masked_input, weight, self.padding_idx, *self.embed_args, **self.embed_kwargs)
-
-        output_parallel[input_mask, :] = 0.
-        output = reduce_scatter_tensor_3d(output_parallel, 0, self.input_parallel_mode)
-
-        return output
diff --git a/colossalai/nn/layer/parallel_sequence/__init__.py b/colossalai/nn/layer/parallel_sequence/__init__.py
deleted file mode 100644
index 4fa9eed6f34b8ccdcf03935337bc96ba705530d0..0000000000000000000000000000000000000000
--- a/colossalai/nn/layer/parallel_sequence/__init__.py
+++ /dev/null
@@ -1,4 +0,0 @@
-from ._operation import RingQK, RingAV
-from .layers import TransformerSelfAttentionRing
-
-__all__ = ['TransformerSelfAttentionRing', 'RingAV', 'RingQK']
diff --git a/colossalai/nn/layer/parallel_sequence/__pycache__/__init__.cpython-36.pyc b/colossalai/nn/layer/parallel_sequence/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index 19ba720697835fb54013c981871443565ffa4364..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/parallel_sequence/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/parallel_sequence/__pycache__/__init__.cpython-37.pyc b/colossalai/nn/layer/parallel_sequence/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index 0d060ef0fb75d8e7578e8d42e256379a434f8e1a..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/parallel_sequence/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/parallel_sequence/__pycache__/_operation.cpython-36.pyc b/colossalai/nn/layer/parallel_sequence/__pycache__/_operation.cpython-36.pyc
deleted file mode 100644
index b639753328a7042e1bdda3f0e6a52f5f9528b945..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/parallel_sequence/__pycache__/_operation.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/parallel_sequence/__pycache__/_operation.cpython-37.pyc b/colossalai/nn/layer/parallel_sequence/__pycache__/_operation.cpython-37.pyc
deleted file mode 100644
index 324bc4074573250a366b8241099188acbd37dd85..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/parallel_sequence/__pycache__/_operation.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/parallel_sequence/__pycache__/_utils.cpython-36.pyc b/colossalai/nn/layer/parallel_sequence/__pycache__/_utils.cpython-36.pyc
deleted file mode 100644
index e684203dc1b73de54169529109495d88052fda60..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/parallel_sequence/__pycache__/_utils.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/parallel_sequence/__pycache__/_utils.cpython-37.pyc b/colossalai/nn/layer/parallel_sequence/__pycache__/_utils.cpython-37.pyc
deleted file mode 100644
index b335829051c4a2f07af0e6819798ad903cfa333e..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/parallel_sequence/__pycache__/_utils.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/parallel_sequence/__pycache__/layers.cpython-36.pyc b/colossalai/nn/layer/parallel_sequence/__pycache__/layers.cpython-36.pyc
deleted file mode 100644
index 7f65f0d6b5ad5a06dde1415ff3ec0d352f0069d6..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/parallel_sequence/__pycache__/layers.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/parallel_sequence/__pycache__/layers.cpython-37.pyc b/colossalai/nn/layer/parallel_sequence/__pycache__/layers.cpython-37.pyc
deleted file mode 100644
index c377376a6b4bc779dcc5a63ffff5b9e331275035..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/parallel_sequence/__pycache__/layers.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/parallel_sequence/_operation.py b/colossalai/nn/layer/parallel_sequence/_operation.py
deleted file mode 100644
index 119302a0976da3a9597f21fc64f9248adb0406f8..0000000000000000000000000000000000000000
--- a/colossalai/nn/layer/parallel_sequence/_operation.py
+++ /dev/null
@@ -1,175 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import torch
-from torch import distributed as dist
-
-from colossalai.communication import ring_forward
-from colossalai.context.parallel_mode import ParallelMode
-from colossalai.core import global_context as gpc
-from colossalai.nn.layer.parallel_sequence._utils import _calc_incoming_device_range, _calc_current_device_range
-from colossalai.utils import get_current_device
-from torch.cuda.amp import custom_bwd, custom_fwd
-
-
-class RingQK(torch.autograd.Function):
-    """
-    Calculate QK in a ring-exchange style
-    """
-
-    @staticmethod
-    @custom_fwd
-    def forward(ctx,
-                sub_q,
-                sub_k,
-                batch_size,
-                num_attention_heads,
-                sub_seq_length):
-        # save tensor for backward
-        ctx.save_for_backward(sub_q, sub_k)
-        ctx.sub_seq_length = sub_seq_length
-
-        # create local segment of attention score
-        attention_score = torch.empty(
-            batch_size * num_attention_heads,
-            sub_seq_length,
-            sub_seq_length * gpc.get_world_size(ParallelMode.SEQUENCE),
-            dtype=sub_q.dtype,
-            device=get_current_device()
-        )
-
-        # compute local QK^T
-        part_a = torch.matmul(sub_q, sub_k.transpose(2, 1))
-        local_rank = gpc.get_local_rank(ParallelMode.SEQUENCE)
-        local_world_size = gpc.get_world_size(ParallelMode.SEQUENCE)
-        start_idx = local_rank * sub_seq_length
-        end_idx = (local_rank + 1) * sub_seq_length
-        attention_score[:, :, start_idx: end_idx] = part_a
-
-        # compute QK^T in ring-all-reduce style
-        for i in range(local_world_size - 1):
-            sub_k = ring_forward(sub_k, ParallelMode.SEQUENCE)
-            start_idx, end_idx = _calc_incoming_device_range(i, local_rank, local_world_size, sub_seq_length)
-            part_a = torch.matmul(sub_q, sub_k.transpose(2, 1))
-            attention_score[:, :, start_idx:end_idx] = part_a
-
-        return attention_score
-
-    @staticmethod
-    @custom_bwd
-    def backward(ctx, grad_output):
-        sub_q, sub_k, = ctx.saved_tensors
-        local_rank = gpc.get_local_rank(ParallelMode.SEQUENCE)
-        local_world_size = gpc.get_world_size(ParallelMode.SEQUENCE)
-
-        # calculate gradient of sub_k
-        grad_k = torch.matmul(
-            grad_output.transpose(2, 1),
-            sub_q
-        )
-
-        dist.all_reduce(grad_k, group=gpc.get_group(ParallelMode.SEQUENCE))
-        grad_k = grad_k[:, local_rank * ctx.sub_seq_length: (local_rank + 1) * ctx.sub_seq_length]
-        grad_k /= local_world_size
-
-        # calculate gradient for sub_q
-        grad_q = torch.zeros_like(sub_q,
-                                  dtype=sub_q.dtype,
-                                  device=get_current_device(), )
-
-        # compute with local sub_k
-        start_idx, end_idx = _calc_current_device_range(local_rank, ctx.sub_seq_length)
-        grad_q += torch.matmul(grad_output[:, :, start_idx:end_idx], sub_k)
-
-        # compute QK^T in ring-all-reduce style
-        for i in range(local_world_size - 1):
-            sub_k = ring_forward(sub_k, ParallelMode.SEQUENCE)
-            start_idx, end_idx = _calc_incoming_device_range(i, local_rank, local_world_size, ctx.sub_seq_length)
-            grad_q += torch.matmul(grad_output[:, :, start_idx: end_idx], sub_k)
-
-        grad_q /= local_world_size
-
-        return grad_q, grad_k, None, None, None
-
-
-class RingAV(torch.autograd.Function):
-    """
-    Calculate AV in a ring-exchange style
-    """
-
-    @staticmethod
-    @custom_fwd
-    def forward(ctx,
-                attention_score,
-                sub_v,
-                batch_size,
-                num_attention_heads,
-                attention_head_size,
-                sub_seq_length):
-        local_rank = gpc.get_local_rank(ParallelMode.SEQUENCE)
-        local_world_size = gpc.get_world_size(ParallelMode.SEQUENCE)
-        local_start_idx, local_end_idx = _calc_current_device_range(local_rank, sub_seq_length)
-
-        sub_attention_result = torch.zeros(
-            batch_size * num_attention_heads,
-            sub_seq_length,
-            attention_head_size,
-            device=get_current_device(),
-            dtype=attention_score.dtype)
-
-        # save tensors for backward
-        ctx.save_for_backward(attention_score, sub_v)
-        ctx.sub_seq_length = sub_seq_length
-
-        # compute local AV
-        part_av = torch.matmul(attention_score[:, :, local_start_idx:local_end_idx], sub_v)
-        sub_attention_result += part_av
-
-        # compute AV in ring - all - reduce style
-        for i in range(local_world_size - 1):
-            sub_v = ring_forward(sub_v, ParallelMode.SEQUENCE)
-            start_idx, end_idx = _calc_incoming_device_range(i, local_rank, local_world_size, sub_seq_length)
-
-            # compute QK^T
-            part_av = torch.matmul(attention_score[:, :, start_idx:end_idx], sub_v)
-            sub_attention_result += part_av
-        return sub_attention_result
-
-    @staticmethod
-    @custom_bwd
-    def backward(ctx, grad_output):
-        local_rank = gpc.get_local_rank(ParallelMode.SEQUENCE)
-        local_world_size = gpc.get_world_size(ParallelMode.SEQUENCE)
-        local_start_idx, local_end_idx = _calc_current_device_range(local_rank, ctx.sub_seq_length)
-        attention_scores, sub_v = ctx.saved_tensors
-
-        # calculate gradient of v
-        grad_v = torch.matmul(
-            attention_scores.transpose(2, 1),
-            grad_output
-        )
-        dist.all_reduce(grad_v, group=gpc.get_group(ParallelMode.SEQUENCE))
-        grad_v = grad_v[:, local_start_idx:local_end_idx]
-        grad_v /= local_world_size
-
-        # calculate gradient for attention score
-        grad_attention_score = torch.zeros_like(attention_scores,
-                                                dtype=grad_output.dtype,
-                                                device=get_current_device())
-
-        # compute with local sub_k
-        grad_attention_score[:, :, local_start_idx:local_end_idx] += torch.matmul(
-            grad_output,
-            sub_v.transpose(2, 1))
-
-        # compute QK^T in ring-all-reduce style
-        for i in range(local_world_size - 1):
-            sub_v = ring_forward(sub_v, ParallelMode.SEQUENCE)
-            start_idx, end_idx = _calc_incoming_device_range(i, local_rank, local_world_size, ctx.sub_seq_length)
-
-            # compute grad_q
-            grad_attention_score[:, :, start_idx:end_idx] += torch.matmul(
-                grad_output,
-                sub_v.transpose(2, 1))
-
-        return grad_attention_score, grad_v, None, None, None, None
diff --git a/colossalai/nn/layer/parallel_sequence/_utils.py b/colossalai/nn/layer/parallel_sequence/_utils.py
deleted file mode 100644
index 9fad8fab23d2e89d70ef2d82789107db78ebaf08..0000000000000000000000000000000000000000
--- a/colossalai/nn/layer/parallel_sequence/_utils.py
+++ /dev/null
@@ -1,15 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-
-def _calc_incoming_device_range(i, rank, world_size, sub_seq_length):
-    device_of_incoming_k = (rank - i - 1) % world_size
-    start_idx = sub_seq_length * device_of_incoming_k
-    end_idx = sub_seq_length * (device_of_incoming_k + 1)
-    return start_idx, end_idx
-
-
-def _calc_current_device_range(rank, sub_seq_length):
-    start_idx = sub_seq_length * rank
-    end_idx = sub_seq_length * (rank + 1)
-    return start_idx, end_idx
diff --git a/colossalai/nn/layer/parallel_sequence/layers.py b/colossalai/nn/layer/parallel_sequence/layers.py
deleted file mode 100644
index 3e87f10f049d1850ea040de3d095e0be46e9f08b..0000000000000000000000000000000000000000
--- a/colossalai/nn/layer/parallel_sequence/layers.py
+++ /dev/null
@@ -1,265 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import math
-import colossalai
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from torch.nn import Parameter
-
-from colossalai.context.parallel_mode import ParallelMode
-from colossalai.core import global_context as gpc
-from colossalai.nn.layer.parallel_sequence._operation import RingQK, RingAV
-from colossalai.registry import LAYERS
-from colossalai.kernel.cuda_native.scaled_softmax import AttnMaskType
-from colossalai.kernel import FusedScaleMaskSoftmax
-from colossalai.context import seed
-
-
-@LAYERS.register_module
-class TransformerSelfAttentionRing(nn.Module):
-    """Parallel self-attention layer abstract class.
-    Self-attention layer takes input with size [b, s, h]
-    and returns output of the same size.
-
-    :param hidden_size: hidden size
-    :type hidden_size: int
-    :param kv_channels: channels of key/value tensor
-    :type kv_channels: int
-    :param num_attention_heads: number of attention heads
-    :type num_attention_heads: int
-    :param attention_dropout: dropout probability for attention layer
-    :type attention_dropout: float
-    """
-
-    def __init__(self,
-                 hidden_size,
-                 num_attention_heads,
-                 attention_dropout,
-                 attention_mask_func,
-                 layer_number,
-                 apply_query_key_layer_scaling: bool = False,
-                 convert_fp16_to_fp32_in_softmax: bool = False,
-                 attn_mask_type=AttnMaskType.padding,
-                 masked_softmax_fusion=True,
-                 fp16=False,
-                 bf16=False
-                 ):
-        super().__init__()
-        self.convert_fp16_to_fp32_in_softmax = convert_fp16_to_fp32_in_softmax
-        self.apply_query_key_layer_scaling = apply_query_key_layer_scaling
-        self.attention_mask_func = attention_mask_func
-        self.layer_number = layer_number
-        self.hidden_size = hidden_size
-        self.num_attention_heads = num_attention_heads
-        self.attn_mask_type = attn_mask_type
-        assert self.layer_number > 0
-        self.attention_dropout = attention_dropout
-
-        if self.apply_query_key_layer_scaling:
-            self.convert_fp16_to_fp32_in_softmax = True
-
-        assert self.hidden_size % self.num_attention_heads == 0, \
-            'hidden size is not divisible by the number of attention heads'
-
-        self.hidden_size_per_attention_head = self.hidden_size // num_attention_heads
-
-        self.world_size = gpc.get_world_size(ParallelMode.SEQUENCE)
-
-        # Strided linear layer.
-        self.query_key_value = _Linear(
-            hidden_size,
-            3 * self.hidden_size,
-        )
-
-        self.coeff = None
-        self.norm_factor = math.sqrt(self.hidden_size)
-
-        if self.apply_query_key_layer_scaling:
-            self.coeff = layer_number
-            self.norm_factor *= self.coeff
-
-        self.scale_mask_softmax = FusedScaleMaskSoftmax(
-            fp16, bf16,
-            self.attn_mask_type,
-            masked_softmax_fusion,
-            self.attention_mask_func,
-            self.convert_fp16_to_fp32_in_softmax,
-            self.coeff)
-
-        self.attention_dropout = nn.Dropout(attention_dropout)
-
-        # Output.
-        self.dense = _Linear(hidden_size,
-                             hidden_size,
-                             bias=True,
-                             skip_bias_add=True)
-
-    def forward(self, hidden_states, attention_mask):
-        # hidden_states: [sub_seq_len, batch_size, hidden_size]
-        # attention_mask: [batch_size, 1, sub_seq_len, seq_len]
-        sub_seq_length, batch_size, hidden_size = hidden_states.size()
-
-        # =====================
-        # Query, Key, and Value
-        # =====================
-
-        # Attention heads shape change:
-        # [sub_seq_len, batch_size, hidden_size] --> [sub_seq_len, batch_size, (3 * head_size * num_heads)]
-        mixed_x_layer = self.query_key_value(hidden_states)
-
-        # [sub_seq_len, batch_size, num_heads, 3 * head_size] --> 3 [sub_seq_len, batch_size, num_heads, head_size]
-        new_tensor_shape = mixed_x_layer.size()[:-1] + (self.num_attention_heads,
-                                                        3 * self.hidden_size_per_attention_head)
-        mixed_x_layer = mixed_x_layer.view(*new_tensor_shape)
-
-        # split into query, key and value
-        last_dim = mixed_x_layer.dim() - 1
-        last_dim_value = mixed_x_layer.size(-1)
-        assert last_dim_value % 3 == 0, 'the last dimension is not a multiple of 3, ' \
-                                        'cannot be divided into query, key and value'
-        partition_size = last_dim_value // 3
-        (query_layer, key_layer, value_layer) = torch.split(
-            mixed_x_layer, partition_size, dim=last_dim)
-
-        # attention scores: [batch_size, num_heads, sub_seq_len, seq_len]
-        output_size = (query_layer.size(1),
-                       query_layer.size(2),
-                       query_layer.size(0),
-                       key_layer.size(0) * self.world_size)
-
-        # [sub_seq_len, batch_size, num_heads, head_size] -> [sub_seq_len, batch_size * num_heads, head_size]
-        query_layer = query_layer.view(output_size[2],
-                                       output_size[0] * output_size[1], -1)
-        # [sub_seq_len, batch_size, num_heads, head_size] -> [sub_seq_len, batch_size * num_heads, head_size]
-        key_layer = key_layer.view(key_layer.size(0),
-                                   output_size[0] * output_size[1], -1)
-
-        # attention_scores: [batch_size * num_heads, sub_seq_len, seq_len]
-        attention_scores = RingQK.apply(
-            query_layer.transpose(0, 1).contiguous(),  # [batch_size * num_heads, sub_seq_len, head_size]
-            key_layer.transpose(0, 1).contiguous(),  # [batch_size * num_heads, sub_seq_len, head_size],
-            batch_size,
-            self.num_attention_heads,
-            sub_seq_length
-        )
-
-        attention_scores /= self.norm_factor
-
-        # change view to [batch_size, num_heads, sub_seq_len, seq_len]
-        attention_scores = attention_scores.view(*output_size)
-
-        # change shape to [batch_size, num_heads, sub_seq_len, seq_len]
-        attention_probs = self.scale_mask_softmax(attention_scores, attention_mask)
-        # This is actually dropping out entire tokens to attend to, which might
-        # seem a bit unusual, but is taken from the original Transformer paper.
-        with seed(ParallelMode.TENSOR):
-            attention_probs = self.attention_dropout(attention_probs)
-
-        # context layer shape: [batch_size, num_heads, sub_seq_len, head_size]
-        output_size = (value_layer.size(1),
-                       value_layer.size(2),
-                       query_layer.size(0),
-                       value_layer.size(3))
-
-        # change view [sub_seq_len, batch_size * num_heads, head_size]
-        value_layer = value_layer.contiguous().view(value_layer.size(0),
-                                                    output_size[0] * output_size[1], -1)
-
-        # # change view [b * num_heads, sub_seq_len, seq_len]
-        attention_probs = attention_probs.view(attention_probs.size(0) * attention_probs.size(1),
-                                               attention_probs.size(2),
-                                               attention_probs.size(3))
-
-        # matmul: [batch_size * num_heads, sub_seq_len, head_size]
-        context_layer = RingAV.apply(
-            attention_probs,
-            value_layer.transpose(0, 1).contiguous(),
-            batch_size,
-            self.num_attention_heads,
-            self.hidden_size_per_attention_head,
-            sub_seq_length
-        )
-
-        # change view [batch_size, num_heads, sub_seq_len, head_size]
-        context_layer = context_layer.view(*output_size)
-
-        # [batch_size, num_heads, sub_seq_len, head_size] -> [sub_seq_len, batch_size, num_heads, head_size]
-        context_layer = context_layer.permute(2, 0, 1, 3).contiguous()
-
-        # [sub_seq_len, batch_size, num_heads, head_size] -> [sub_seq_len, batch_size, hidden_size]
-        new_context_layer_shape = context_layer.size()[:-2] + (
-            self.hidden_size_per_attention_head * self.num_attention_heads,)
-        context_layer = context_layer.view(*new_context_layer_shape)
-
-        output, bias = self.dense(context_layer)
-
-        return output, bias
-
-    def __repr__(self):
-        return f'TransformerSelfAttentionRing(apply_query_key_layer_scaling={self.apply_query_key_layer_scaling}, ' \
-            f'layer_number={self.layer_number}, hidden_size:{self.hidden_size}, attention_dropout={self.attention_dropout}, ' \
-            f'attn_mask_type={self.attn_mask_type}, num_attention_heads={self.num_attention_heads}, ' \
-            f'hidden_size_per_attention_head={self.hidden_size_per_attention_head}, coeff={self.coeff}, norm_factor={self.norm_factor}, ' \
-            f'convert_fp16_to_fp32_in_softmax={self.convert_fp16_to_fp32_in_softmax})'
-
-
-class _Linear(nn.Module):
-    """Linear layer with column parallelism.
-    The linear layer is defined as Y = XA + b. A is parallelized along
-    its second dimension as A = [A_1, ..., A_p].
-    Arguments:
-        input_size: first dimension of matrix A.
-        output_size: second dimension of matrix A.
-        bias: If true, add bias
-        init_method: method to initialize weights. Note that bias is always set
-                     to zero.
-        stride: For the strided linear layers.
-        keep_master_weight_for_test: This was added for testing and should be
-                                     set to False. It returns the master weights
-                                     used for initialization.
-        skip_bias_add: This was added to enable performance optimations where bias
-                       can be fused with other elementwise operations. we skip
-                       adding bias but instead return it.
-    """
-
-    def __init__(self,
-                 input_size,
-                 output_size,
-                 bias=True,
-                 skip_bias_add=False):
-        super(_Linear, self).__init__()
-
-        # Keep input parameters
-        self.input_size = input_size
-        self.output_size = output_size
-        self.skip_bias_add = skip_bias_add
-
-        self.weight = Parameter(torch.empty(self.output_size,
-                                            self.input_size,
-                                            ))
-        nn.init.xavier_normal_(self.weight)
-
-        if bias:
-            self.bias = Parameter(torch.empty(self.output_size))
-            # Always initialize bias to zero.
-            with torch.no_grad():
-                self.bias.zero_()
-        else:
-            self.register_parameter('bias', None)
-
-    def forward(self, input_):
-        # Matrix multiply.
-        bias = self.bias if not self.skip_bias_add else None
-        output = F.linear(input_, self.weight, bias)
-
-        if self.skip_bias_add:
-            return output, self.bias
-        else:
-            return output
-
-    def __repr__(self):
-        return f'Linear(in_features={self.input_size}, out_features={self.output_size}, ' + \
-            f'bias={self.bias is not None}, skip_bias_add={self.skip_bias_add})'
diff --git a/colossalai/nn/layer/utils/__init__.py b/colossalai/nn/layer/utils/__init__.py
deleted file mode 100644
index 7e999ee8214916d9d2b5465333262d05cad198ec..0000000000000000000000000000000000000000
--- a/colossalai/nn/layer/utils/__init__.py
+++ /dev/null
@@ -1,7 +0,0 @@
-from .common import (ACT2FN, CheckpointModule, _ntuple, divide, get_tensor_parallel_mode,
-                     set_tensor_parallel_attribute_by_partition, set_tensor_parallel_attribute_by_size, to_2tuple)
-
-__all__ = [
-    'CheckpointModule', 'divide', 'ACT2FN', 'set_tensor_parallel_attribute_by_size',
-    'set_tensor_parallel_attribute_by_partition', 'get_tensor_parallel_mode', '_ntuple', 'to_2tuple'
-]
diff --git a/colossalai/nn/layer/utils/__pycache__/__init__.cpython-36.pyc b/colossalai/nn/layer/utils/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index 7a543040065e68a22a6b30c603f370ab4d3e107c..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/utils/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/utils/__pycache__/__init__.cpython-37.pyc b/colossalai/nn/layer/utils/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index 1feff2f90a5e3336b620dff2712c6c576910921a..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/utils/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/utils/__pycache__/common.cpython-36.pyc b/colossalai/nn/layer/utils/__pycache__/common.cpython-36.pyc
deleted file mode 100644
index 077b0fc56ecc176b682f5d344e782d586e06beb6..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/utils/__pycache__/common.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/utils/__pycache__/common.cpython-37.pyc b/colossalai/nn/layer/utils/__pycache__/common.cpython-37.pyc
deleted file mode 100644
index ece9ae9ade0274195e0f8b761cc3aaa210636079..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/utils/__pycache__/common.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/utils/common.py b/colossalai/nn/layer/utils/common.py
deleted file mode 100644
index c1d88d2fcfcb56f63d770deadf603153754bfc99..0000000000000000000000000000000000000000
--- a/colossalai/nn/layer/utils/common.py
+++ /dev/null
@@ -1,83 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import collections.abc
-from itertools import repeat
-
-import numpy as np
-import torch
-from colossalai.constants import IS_TENSOR_PARALLEL, NUM_PARTITIONS
-from colossalai.global_variables import tensor_parallel_env as env
-from colossalai.utils import checkpoint
-from torch import Tensor, nn
-
-
-class CheckpointModule(nn.Module):
-    def __init__(self, checkpoint: bool = True):
-        super().__init__()
-        self.checkpoint = checkpoint
-        self._use_checkpoint = checkpoint
-
-    def _forward(self, *args, **kwargs):
-        raise NotImplementedError('CheckpointModule should implement _forward method instead of origin forward')
-
-    def forward(self, *args, **kwargs):
-        if self._use_checkpoint:
-            return checkpoint(self._forward, *args, **kwargs)
-        else:
-            return self._forward(*args, **kwargs)
-
-    def train(self, mode: bool = True):
-        self._use_checkpoint = self.checkpoint
-        return super().train(mode=mode)
-
-    def eval(self):
-        self._use_checkpoint = False
-        return super().eval()
-
-
-def divide(numerator, denominator):
-    """Only allow exact division
-    
-    :param numerator: Numerator of the division
-    :param denominator: Denominator of the division
-    """
-    assert numerator % denominator == 0, \
-        '{} is not divisible by {}'.format(numerator, denominator)
-    return numerator // denominator
-
-
-def swish(x: Tensor) -> Tensor:
-    return x * torch.sigmoid(x)
-
-
-ACT2FN = {"gelu": torch.nn.functional.gelu, "relu": torch.nn.functional.relu, "swish": swish}
-
-
-def set_tensor_parallel_attribute_by_size(param, size):
-    setattr(param, IS_TENSOR_PARALLEL, True)
-    setattr(param, NUM_PARTITIONS, size // np.prod(param.shape))
-
-
-def set_tensor_parallel_attribute_by_partition(param, num_partitions):
-    setattr(param, IS_TENSOR_PARALLEL, True)
-    setattr(param, NUM_PARTITIONS, num_partitions)
-
-
-def get_tensor_parallel_mode():
-    return env.mode
-
-
-# From PyTorch internals
-
-
-def _ntuple(n):
-    def parse(x):
-        if isinstance(x, collections.abc.Iterable):
-            return x
-        return tuple(repeat(x, n))
-
-    return parse
-
-
-to_2tuple = _ntuple(2)
diff --git a/colossalai/nn/layer/vanilla/__init__.py b/colossalai/nn/layer/vanilla/__init__.py
deleted file mode 100644
index 14af800272826030a04cb8172ba2fd1c6558d103..0000000000000000000000000000000000000000
--- a/colossalai/nn/layer/vanilla/__init__.py
+++ /dev/null
@@ -1,5 +0,0 @@
-from .layers import DropPath, VanillaClassifier, VanillaPatchEmbedding, \
-    WrappedDropout, WrappedDropPath
-
-__all__ = ['VanillaPatchEmbedding', 'VanillaClassifier', 'DropPath',
-           'WrappedDropout', 'WrappedDropPath']
diff --git a/colossalai/nn/layer/vanilla/__pycache__/__init__.cpython-36.pyc b/colossalai/nn/layer/vanilla/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index 73d717425b12b8e8bc3ac96e417cd1295555a6fe..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/vanilla/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/vanilla/__pycache__/__init__.cpython-37.pyc b/colossalai/nn/layer/vanilla/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index f586a44a11fffbf1f857fc475b6ce06ec88545f3..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/vanilla/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/vanilla/__pycache__/layers.cpython-36.pyc b/colossalai/nn/layer/vanilla/__pycache__/layers.cpython-36.pyc
deleted file mode 100644
index 32a3a9a99ae86fd7bbcde322e00118dc7b550593..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/vanilla/__pycache__/layers.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/vanilla/__pycache__/layers.cpython-37.pyc b/colossalai/nn/layer/vanilla/__pycache__/layers.cpython-37.pyc
deleted file mode 100644
index 7fa4f7ac14940d675509b17032b91f21bb04c472..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/vanilla/__pycache__/layers.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/vanilla/layers.py b/colossalai/nn/layer/vanilla/layers.py
deleted file mode 100644
index e5c9fd074addbd944f4e6322eafabd9432eaefb7..0000000000000000000000000000000000000000
--- a/colossalai/nn/layer/vanilla/layers.py
+++ /dev/null
@@ -1,232 +0,0 @@
-import math
-from typing import Callable
-
-import torch
-import torch.nn.functional as F
-from colossalai.context import seed
-from colossalai.nn import init as init
-from colossalai.registry import LAYERS
-from colossalai.utils.cuda import get_current_device
-from torch import Tensor
-from torch import nn as nn
-
-from ..utils import to_2tuple
-
-
-def drop_path(x, drop_prob: float = 0., training: bool = False):
-    """Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
-    This is the same as the DropConnect impl I created for EfficientNet, etc networks, however,
-    the original name is misleading as 'Drop Connect' is a different form of dropout in a separate paper...
-    See discussion: https://github.com/tensorflow/tpu/issues/494#issuecomment-532968956 ... I've opted for
-    changing the layer and argument names to 'drop path' rather than mix DropConnect as a layer name and use
-    'survival rate' as the argument.
-    """
-    if drop_prob == 0. or not training:
-        return x
-    keep_prob = 1 - drop_prob
-    shape = (x.shape[0], ) + (1, ) * (x.ndim - 1)  # work with diff dim tensors, not just 2D ConvNets
-    random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)
-    random_tensor.floor_()  # binarize
-    output = x.div(keep_prob) * random_tensor
-    return output
-
-
-class DropPath(nn.Module):
-    """
-    Drop paths (Stochastic Depth) per sample  (when applied in main path of residual blocks).
-    Adapted from https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/layers/drop.py
-    """
-
-    def __init__(self, drop_prob=None):
-        super(DropPath, self).__init__()
-        self.drop_prob = drop_prob
-
-    def forward(self, x):
-        return drop_path(x, self.drop_prob, self.training)
-
-
-class WrappedDropout(nn.Module):
-    """Same as torch.nn.Dropout. But it is wrapped with the context of seed manager.
-    """
-
-    def __init__(self, p: float = 0.5, inplace: bool = False, mode=None):
-        super().__init__()
-        if p < 0 or p > 1:
-            raise ValueError("dropout probability has to be between 0 and 1, "
-                             "but got {}".format(p))
-        self.p = p
-        self.inplace = inplace
-        if mode is None:
-            self.func = self.nonefunc
-        else:
-            self.func = self.normalfunc
-            self.mode = mode
-
-    def nonefunc(self, inputs):
-        return F.dropout(inputs, self.p, self.training, self.inplace)
-
-    def normalfunc(self, inputs):
-        with seed(self.mode):
-            return F.dropout(inputs, self.p, self.training, self.inplace)
-
-    def forward(self, inputs):
-        return self.func(inputs)
-
-
-class WrappedDropPath(nn.Module):
-    """Drop paths (Stochastic Depth) per sample  (when applied in main path of residual blocks).
-    Here, it is wrapped with the context of seed manager.
-    """
-
-    def __init__(self, p: float = 0., mode=None):
-        super().__init__()
-        self.p = p
-        self.mode = mode
-        if self.mode is None:
-            self.func = self.nonefunc
-        else:
-            self.func = self.normalfunc
-            self.mode = mode
-
-    def nonefunc(self, inputs):
-        return drop_path(inputs, self.p, self.training)
-
-    def normalfunc(self, inputs):
-        with seed(self.mode):
-            return drop_path(inputs, self.p, self.training)
-
-    def forward(self, inputs):
-        return self.func(inputs)
-
-
-@LAYERS.register_module
-class VanillaPatchEmbedding(nn.Module):
-    """ 
-    2D Image to Patch Embedding
-
-    :param img_size: image size
-    :type img_size: int
-    :param patch_size: patch size
-    :type patch_size: int
-    :param in_chans: number of channels of input image
-    :type in_chans: int
-    :param embed_size: size of embedding
-    :type embed_size: int
-    :param dtype: The dtype of parameters, defaults to None
-    :type dtype: torch.dtype, optional
-    :param flatten: whether to flatten output tensor, defaults to True
-    :type flatten: bool, optional
-    :param weight_initializer: The intializer of weight, defaults to kaiming uniform initializer
-    :type weight_initializer: typing.Callable, optional
-    :param bias_initializer: The intializer of bias, defaults to xavier uniform initializer
-    :type bias_initializer: typing.Callable, optional
-    :param position_embed_initializer: The intializer of position embedding, defaults to zero
-    :type position_embed_initializer: typing.Callable, optional
-    """
-
-    def __init__(self,
-                 img_size: int,
-                 patch_size: int,
-                 in_chans: int,
-                 embed_size: int,
-                 flatten: bool = True,
-                 dtype: torch.dtype = None,
-                 weight_initializer: Callable = init.kaiming_uniform_(a=math.sqrt(5)),
-                 bias_initializer: Callable = init.xavier_uniform_(a=1, scale=1),
-                 position_embed_initializer: Callable = init.zeros_()):
-        super().__init__()
-        img_size = to_2tuple(img_size)
-        patch_size = to_2tuple(patch_size)
-        self.img_size = img_size
-        self.patch_size = patch_size
-        self.grid_size = (img_size[0] // patch_size[0], img_size[1] // patch_size[1])
-        self.num_patches = self.grid_size[0] * self.grid_size[1]
-        self.flatten = flatten
-
-        self.weight = nn.Parameter(
-            torch.empty((embed_size, in_chans, *self.patch_size), device=get_current_device(), dtype=dtype))
-        self.bias = nn.Parameter(torch.empty(embed_size, device=get_current_device(), dtype=dtype))
-        self.cls_token = nn.Parameter(torch.zeros((1, 1, embed_size), device=get_current_device(), dtype=dtype))
-        self.pos_embed = nn.Parameter(
-            torch.zeros((1, self.num_patches + 1, embed_size), device=get_current_device(), dtype=dtype))
-
-        self.reset_parameters(weight_initializer, bias_initializer, position_embed_initializer)
-
-    def reset_parameters(self, weight_initializer, bias_initializer, position_embed_initializer):
-        fan_in, fan_out = nn.init._calculate_fan_in_and_fan_out(self.weight)
-        weight_initializer(self.weight, fan_in=fan_in, fan_out=fan_out)
-        bias_initializer(self.bias, fan_in=fan_in)
-        position_embed_initializer(self.pos_embed)
-
-    def forward(self, input_: Tensor) -> Tensor:
-        B, C, H, W = input_.shape
-        assert H == self.img_size[0] and W == self.img_size[1], \
-            f"Input image size ({H}*{W}) doesn't match model ({self.img_size[0]}*{self.img_size[1]})."
-        output = F.conv2d(input_, self.weight, self.bias, stride=self.patch_size)
-        if self.flatten:
-            output = output.flatten(2).transpose(1, 2)  # BCHW -> BNC
-
-        cls_token = self.cls_token.expand(output.shape[0], -1, -1)
-        output = torch.cat((cls_token, output), dim=1)
-        output = output + self.pos_embed
-        return output
-
-
-@LAYERS.register_module
-class VanillaClassifier(nn.Module):
-    """
-    Dense linear classifier
-
-    :param in_features: size of each input sample
-    :type in_features: int
-    :param num_classes: number of classes
-    :type num_classes: int
-    :param weight: weight of the classifier, defaults to True
-    :type weight: torch.nn.Parameter, optional
-    :param bias: If set to ``False``, the layer will not learn an additive bias, defaults to True
-    :type bias: bool, optional
-    :param dtype: The dtype of parameters, defaults to None
-    :type dtype: torch.dtype, optional
-    :param weight_initializer: The intializer of weight, defaults to kaiming uniform initializer
-    :type weight_initializer: typing.Callable, optional
-    :param bias_initializer: The intializer of bias, defaults to xavier uniform initializer
-    :type bias_initializer: typing.Callable, optional
-    """
-
-    def __init__(self,
-                 in_features: int,
-                 num_classes: int,
-                 weight: nn.Parameter = None,
-                 bias: bool = True,
-                 dtype: torch.dtype = None,
-                 weight_initializer: Callable = init.kaiming_uniform_(a=math.sqrt(5)),
-                 bias_initializer: Callable = init.xavier_uniform_(a=1, scale=1)):
-        super().__init__()
-        self.in_features = in_features
-        self.num_classes = num_classes
-
-        if weight is not None:
-            self.weight = weight
-            self.has_weight = False
-        else:
-            self.weight = nn.Parameter(
-                torch.empty(self.num_classes, self.in_features, device=get_current_device(), dtype=dtype))
-            self.has_weight = True
-        if bias:
-            self.bias = nn.Parameter(torch.zeros(self.num_classes, device=get_current_device(), dtype=dtype))
-        else:
-            self.bias = None
-
-        self.reset_parameters(weight_initializer, bias_initializer)
-
-    def reset_parameters(self, weight_initializer, bias_initializer):
-        fan_in, fan_out = self.in_features, self.num_classes
-
-        if self.has_weight:
-            weight_initializer(self.weight, fan_in=fan_in, fan_out=fan_out)
-
-        if self.bias is not None:
-            bias_initializer(self.bias, fan_in=fan_in)
-
-    def forward(self, input_: Tensor) -> Tensor:
-        return F.linear(input_, self.weight, self.bias)
diff --git a/colossalai/nn/layer/wrapper/__init__.py b/colossalai/nn/layer/wrapper/__init__.py
deleted file mode 100644
index 01f746f65748d5daf56b967bde352073195cef4e..0000000000000000000000000000000000000000
--- a/colossalai/nn/layer/wrapper/__init__.py
+++ /dev/null
@@ -1,4 +0,0 @@
-from .lambda_wrapper import LambdaWrapper
-from .pipeline_wrapper import PipelineSharedModuleWrapper
-
-__all__ = ['LambdaWrapper', 'PipelineSharedModuleWrapper']
diff --git a/colossalai/nn/layer/wrapper/__pycache__/__init__.cpython-36.pyc b/colossalai/nn/layer/wrapper/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index db689e76aeeb79e74808ab3a211078d3f4a839a6..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/wrapper/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/wrapper/__pycache__/__init__.cpython-37.pyc b/colossalai/nn/layer/wrapper/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index 94bbd78c8d878b1064b697c4c888a7c962e56fb8..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/wrapper/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/wrapper/__pycache__/lambda_wrapper.cpython-36.pyc b/colossalai/nn/layer/wrapper/__pycache__/lambda_wrapper.cpython-36.pyc
deleted file mode 100644
index ce20a191f0a9d9c9e7c2a295f2e6be135a8ab2e3..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/wrapper/__pycache__/lambda_wrapper.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/wrapper/__pycache__/lambda_wrapper.cpython-37.pyc b/colossalai/nn/layer/wrapper/__pycache__/lambda_wrapper.cpython-37.pyc
deleted file mode 100644
index 28c5553b72e95c378a739e0c2b3f6d3cde016608..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/wrapper/__pycache__/lambda_wrapper.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/wrapper/__pycache__/pipeline_wrapper.cpython-36.pyc b/colossalai/nn/layer/wrapper/__pycache__/pipeline_wrapper.cpython-36.pyc
deleted file mode 100644
index 93398247f5303a914de6678060aa27fff4e3109e..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/wrapper/__pycache__/pipeline_wrapper.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/wrapper/__pycache__/pipeline_wrapper.cpython-37.pyc b/colossalai/nn/layer/wrapper/__pycache__/pipeline_wrapper.cpython-37.pyc
deleted file mode 100644
index 98161eb2639d7fd76d3d1c040b251225e0184f2f..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/layer/wrapper/__pycache__/pipeline_wrapper.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/layer/wrapper/lambda_wrapper.py b/colossalai/nn/layer/wrapper/lambda_wrapper.py
deleted file mode 100644
index f40ed7297da6fa31e00928699330622335e04efd..0000000000000000000000000000000000000000
--- a/colossalai/nn/layer/wrapper/lambda_wrapper.py
+++ /dev/null
@@ -1,37 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import torch.nn as nn
-
-from colossalai.builder import build_layer
-from colossalai.registry import LAYERS
-
-
-@LAYERS.register_module
-class LambdaWrapper(nn.Module):
-    """Wrap a function to nn.Module, which takes a config of layers and can fully access them
-
-    :param func: User customed function
-    :type func: Callable
-    :param layers_cfg: Config of layers, defaults to None
-    :type layers_cfg: dict, optional
-    """
-
-    def __init__(self, func, layers_cfg: dict = None):
-        super().__init__()
-        self.func = func
-        self.layers = self._build_layers(layers_cfg)
-
-    def _build_layers(self, layers_cfg: dict):
-        if layers_cfg is None:
-            return None
-        else:
-            layers = []
-
-            for cfg in layers_cfg:
-                layer = build_layer(cfg)
-                layers.append(layer)
-            return layers
-
-    def forward(self, *args, **kwargs):
-        return self.func(self, *args, **kwargs)
diff --git a/colossalai/nn/layer/wrapper/pipeline_wrapper.py b/colossalai/nn/layer/wrapper/pipeline_wrapper.py
deleted file mode 100644
index dd422a75f9793daa5697e24044670d644a64d792..0000000000000000000000000000000000000000
--- a/colossalai/nn/layer/wrapper/pipeline_wrapper.py
+++ /dev/null
@@ -1,46 +0,0 @@
-import torch.nn as nn
-import torch.distributed as dist
-from typing import List, Tuple, Union
-from colossalai.context import ParallelMode
-from colossalai.core import global_context as gpc
-
-
-class PipelineSharedModuleWrapper:
-    def __init__(self, pipeline_ranks: Union[List[int], Tuple[int]]) -> None:
-        assert len(pipeline_ranks) > 1, f'Expect len(pipeline_ranks) > 1, got {len(pipeline_ranks)}'
-        self.pipeline_ranks = pipeline_ranks
-        self.group = None
-        self.ranks_in_group = None
-        self._init_group()
-
-    def _init_group(self):
-        world_size = gpc.get_world_size(ParallelMode.GLOBAL)
-        dp_size = gpc.get_world_size(ParallelMode.DATA)
-        pp_size = gpc.get_world_size(ParallelMode.PIPELINE)
-        rank = gpc.get_global_rank()
-        num_dp_groups = world_size // dp_size
-        num_pp_stages = num_dp_groups // pp_size
-        for i in range(dp_size):
-            for j in range(num_pp_stages):
-                pipeline_ranks = list(
-                    range(i * num_dp_groups + j,
-                          (i + 1) * num_dp_groups,
-                          num_pp_stages))
-                sub_ranks = [pipeline_ranks[idx] for idx in self.pipeline_ranks]
-                group = dist.new_group(sub_ranks)
-                if rank in sub_ranks:
-                    self.group = group
-                    self.ranks_in_group = sub_ranks
-
-    def register_module(self, module: nn.Module):
-        assert self.ranks_in_group is not None, f'Rank {gpc.get_local_rank(ParallelMode.PIPELINE)} is not in pipeline_ranks {self.pipeline_ranks}'
-        src = self.ranks_in_group[self.pipeline_ranks[0]]
-        for p in module.parameters():
-            setattr(p, 'pipeline_shared_module_pg', self.group)
-            dist.broadcast(p, src, group=self.group)
-
-    def register_parameter(self, param: nn.Parameter):
-        assert self.ranks_in_group is not None, f'Rank {gpc.get_local_rank(ParallelMode.PIPELINE)} is not in pipeline_ranks {self.pipeline_ranks}'
-        src = self.ranks_in_group[self.pipeline_ranks[0]]
-        setattr(param, 'pipeline_shared_module_pg', self.group)
-        dist.broadcast(param, src, group=self.group)
diff --git a/colossalai/nn/loss/__init__.py b/colossalai/nn/loss/__init__.py
deleted file mode 100644
index 373e4ec9468bc13317d74c19b5922073a5cb8c0c..0000000000000000000000000000000000000000
--- a/colossalai/nn/loss/__init__.py
+++ /dev/null
@@ -1,41 +0,0 @@
-from colossalai.global_variables import tensor_parallel_env as env
-from colossalai.nn.layer.utils import get_tensor_parallel_mode
-from torch import nn
-from torch.nn.modules.loss import *
-from torch.nn.modules.loss import _Loss
-
-from .loss_1d import VocabParallelCrossEntropyLoss1D
-from .loss_2d import CrossEntropyLoss2D, VocabParallelCrossEntropyLoss2D
-from .loss_2p5d import CrossEntropyLoss2p5D, VocabParallelCrossEntropyLoss2p5D
-from .loss_3d import CrossEntropyLoss3D, VocabParallelCrossEntropyLoss3D
-from .loss_moe import MoeCrossEntropyLoss, MoeLoss
-
-_parallel_cross_entropy = {
-    '2d': CrossEntropyLoss2D,
-    '2.5d': CrossEntropyLoss2p5D,
-    '3d': CrossEntropyLoss3D,
-}
-
-_vocab_parallel_cross_entropy = {
-    '1d': VocabParallelCrossEntropyLoss1D,
-    '2d': VocabParallelCrossEntropyLoss2D,
-    '2.5d': VocabParallelCrossEntropyLoss2p5D,
-    '3d': VocabParallelCrossEntropyLoss3D,
-}
-
-
-class CrossEntropyLoss(_Loss):
-
-    def __init__(self, reduction: bool = True, *args, **kwargs):
-        super().__init__()
-        tensor_parallel = get_tensor_parallel_mode()
-        if tensor_parallel is not None and env.vocab_parallel:
-            self.loss = _vocab_parallel_cross_entropy[tensor_parallel](reduction=reduction, *args, **kwargs)
-        elif tensor_parallel is None or tensor_parallel == '1d':
-            reduction = 'mean' if reduction else 'none'
-            self.loss = nn.CrossEntropyLoss(reduction=reduction, *args, **kwargs)
-        else:
-            self.loss = _parallel_cross_entropy[tensor_parallel](reduction=reduction, *args, **kwargs)
-
-    def forward(self, *args):
-        return self.loss(*args)
diff --git a/colossalai/nn/loss/__pycache__/__init__.cpython-36.pyc b/colossalai/nn/loss/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index 48b1e0d28719ba38172fef7e4c43a3e82d0233a1..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/loss/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/loss/__pycache__/__init__.cpython-37.pyc b/colossalai/nn/loss/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index d172438bc76e91746d116d55dbadaf2bff16b17b..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/loss/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/loss/__pycache__/loss_1d.cpython-36.pyc b/colossalai/nn/loss/__pycache__/loss_1d.cpython-36.pyc
deleted file mode 100644
index 8609c9b914966507d1d804a32d6a7a94898b5c81..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/loss/__pycache__/loss_1d.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/loss/__pycache__/loss_1d.cpython-37.pyc b/colossalai/nn/loss/__pycache__/loss_1d.cpython-37.pyc
deleted file mode 100644
index 7eeded7a5e13348a96c1ffad4387cdf5cf245cb1..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/loss/__pycache__/loss_1d.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/loss/__pycache__/loss_2d.cpython-36.pyc b/colossalai/nn/loss/__pycache__/loss_2d.cpython-36.pyc
deleted file mode 100644
index ba379556c271e8886b9147719c772bc8ad00ba1d..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/loss/__pycache__/loss_2d.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/loss/__pycache__/loss_2d.cpython-37.pyc b/colossalai/nn/loss/__pycache__/loss_2d.cpython-37.pyc
deleted file mode 100644
index 266bb7f44f5730c867f6880e7edb1abdf370ce6f..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/loss/__pycache__/loss_2d.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/loss/__pycache__/loss_2p5d.cpython-36.pyc b/colossalai/nn/loss/__pycache__/loss_2p5d.cpython-36.pyc
deleted file mode 100644
index e29747878c34ae2268f92988abb94e9d896eabb8..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/loss/__pycache__/loss_2p5d.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/loss/__pycache__/loss_2p5d.cpython-37.pyc b/colossalai/nn/loss/__pycache__/loss_2p5d.cpython-37.pyc
deleted file mode 100644
index 1d26267744ea8e493c9f6cf5f0d82516f4cc14a3..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/loss/__pycache__/loss_2p5d.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/loss/__pycache__/loss_3d.cpython-36.pyc b/colossalai/nn/loss/__pycache__/loss_3d.cpython-36.pyc
deleted file mode 100644
index 838b989193cbf5f728066dd2c31018771c974946..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/loss/__pycache__/loss_3d.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/loss/__pycache__/loss_3d.cpython-37.pyc b/colossalai/nn/loss/__pycache__/loss_3d.cpython-37.pyc
deleted file mode 100644
index 2aa82f6465d8cfbd3a3596a343f1596f91806384..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/loss/__pycache__/loss_3d.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/loss/__pycache__/loss_moe.cpython-36.pyc b/colossalai/nn/loss/__pycache__/loss_moe.cpython-36.pyc
deleted file mode 100644
index 84f0ef7cf3901754708cf2b9a645ec80f08faa39..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/loss/__pycache__/loss_moe.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/loss/__pycache__/loss_moe.cpython-37.pyc b/colossalai/nn/loss/__pycache__/loss_moe.cpython-37.pyc
deleted file mode 100644
index 3fc2070f19b9251d7afae4edda6a1833dbfac232..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/loss/__pycache__/loss_moe.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/loss/loss_1d.py b/colossalai/nn/loss/loss_1d.py
deleted file mode 100644
index d0e1ec2a4bcdb139f3f32531f2792e2dd63ee444..0000000000000000000000000000000000000000
--- a/colossalai/nn/loss/loss_1d.py
+++ /dev/null
@@ -1,110 +0,0 @@
-import torch
-from colossalai.context import ParallelMode
-from colossalai.core import global_context as gpc
-from colossalai.registry import LOSSES
-from torch.cuda.amp import custom_bwd, custom_fwd
-from torch.nn.modules.loss import _Loss
-
-
-class _VocabParallelCrossEntropy1D(torch.autograd.Function):
-
-    @staticmethod
-    @custom_fwd(cast_inputs=torch.float32)
-    def forward(ctx, vocab_parallel_logits, targets):
-
-        # Maximum value along vocab dimension across all GPUs.
-        logits_max = torch.max(vocab_parallel_logits, dim=-1)[0]
-        torch.distributed.all_reduce(logits_max,
-                                     op=torch.distributed.ReduceOp.MAX,
-                                     group=gpc.get_group(ParallelMode.PARALLEL_1D))
-        # Subtract the maximum value.
-        vocab_parallel_logits.sub_(logits_max.unsqueeze(dim=-1))
-
-        # Get the partition's vocab indecies
-        partition_vocab_size = vocab_parallel_logits.size()[-1]
-        rank = gpc.get_local_rank(ParallelMode.PARALLEL_1D)
-        vocab_start_index = partition_vocab_size * rank
-        vocab_end_index = vocab_start_index + partition_vocab_size
-
-        # Create a mask of valid vocab ids (1 means it needs to be masked).
-        target_mask = (targets < vocab_start_index) | (targets >= vocab_end_index)
-        masked_target = targets.clone() - vocab_start_index
-        masked_target[target_mask] = 0
-
-        # Get predicted-logits = logits[target].
-        # For Simplicity, we convert logits to a 2-D tensor with size
-        # [*, partition-vocab-size] and target to a 1-D tensor of size [*].
-        logits_2d = vocab_parallel_logits.view(-1, partition_vocab_size)
-        masked_target_1d = masked_target.view(-1)
-        arange_1d = torch.arange(start=0, end=logits_2d.size()[0], device=logits_2d.device)
-        predicted_logits_1d = logits_2d[arange_1d, masked_target_1d]
-        predicted_logits_1d = predicted_logits_1d.clone().contiguous()
-        predicted_logits = predicted_logits_1d.view_as(targets)
-        predicted_logits[target_mask] = 0.0
-        # All reduce is needed to get the chunks from other GPUs.
-        torch.distributed.all_reduce(predicted_logits,
-                                     op=torch.distributed.ReduceOp.SUM,
-                                     group=gpc.get_group(ParallelMode.PARALLEL_1D))
-
-        # Sum of exponential of logits along vocab dimension across all GPUs.
-        exp_logits = vocab_parallel_logits
-        torch.exp(vocab_parallel_logits, out=exp_logits)
-        sum_exp_logits = exp_logits.sum(dim=-1)
-        torch.distributed.all_reduce(sum_exp_logits,
-                                     op=torch.distributed.ReduceOp.SUM,
-                                     group=gpc.get_group(ParallelMode.PARALLEL_1D))
-
-        # Loss = log(sum(exp(logits))) - predicted-logit.
-        loss = torch.log(sum_exp_logits) - predicted_logits
-        # Store softmax, target-mask and masked-target for backward pass.
-        exp_logits.div_(sum_exp_logits.unsqueeze(dim=-1))
-        ctx.save_for_backward(exp_logits, target_mask, masked_target_1d)
-        return loss
-
-    @staticmethod
-    @custom_bwd
-    def backward(ctx, grad_output):
-
-        # Retreive tensors from the forward path.
-        softmax, target_mask, masked_target_1d = ctx.saved_tensors
-
-        # All the inputs have softmax as thier gradient.
-        grad_input = softmax
-        # For simplicity, work with the 2D gradient.
-        partition_vocab_size = softmax.size()[-1]
-        grad_2d = grad_input.view(-1, partition_vocab_size)
-
-        # Add the gradient from matching classes.
-        arange_1d = torch.arange(start=0, end=grad_2d.size()[0], device=grad_2d.device)
-        grad_2d[arange_1d, masked_target_1d] -= (1.0 - target_mask.view(-1).float())
-
-        # Finally elementwise multiplication with the output gradients.
-        grad_input.mul_(grad_output.unsqueeze(dim=-1))
-
-        return grad_input, None
-
-
-@LOSSES.register_module
-class VocabParallelCrossEntropyLoss1D(_Loss):
-    """
-    Vocab parallel cross entropy loss for 1D parallelism
-
-    :param reduction: whether to average the loss, defaults to True
-
-    :type reduction: bool, optional
-    """
-
-    def __init__(self, reduction=True):
-        super().__init__()
-        self.reduction_mean = reduction
-
-    def forward(self, logits, targets):
-        """Calculate loss between logits and targets
-
-        :param logits: Output logits of model
-        :param targets: True targets from data
-        """
-        loss = _VocabParallelCrossEntropy1D.apply(logits, targets)
-        if self.reduction_mean:
-            loss = loss.mean()
-        return loss
diff --git a/colossalai/nn/loss/loss_2d.py b/colossalai/nn/loss/loss_2d.py
deleted file mode 100644
index a2ad8f435c3400825058682d1cd01d4ed863f61e..0000000000000000000000000000000000000000
--- a/colossalai/nn/loss/loss_2d.py
+++ /dev/null
@@ -1,145 +0,0 @@
-import torch
-import torch.distributed as dist
-from colossalai.context import ParallelMode
-from colossalai.core import global_context as gpc
-from colossalai.nn.layer.parallel_2d import reduce_by_batch_2d, split_tensor_2d
-from colossalai.nn.layer.parallel_2d._utils import assert_summa_initialization
-from colossalai.registry import LOSSES
-from colossalai.utils import get_current_device
-from torch.cuda.amp import custom_bwd, custom_fwd
-from torch.nn.functional import cross_entropy
-from torch.nn.modules.loss import _Loss
-
-
-@LOSSES.register_module
-class CrossEntropyLoss2D(_Loss):
-    """
-    Cross entropy loss for 2D parallelism
-
-    :param reduction: whether to average the loss, defaults to True
-    :param args: Args for loss function
-    :param kwargs: Kwargs for loss function
-
-    :type reduction: bool, optional
-    """
-
-    def __init__(self, reduction=True, *args, **kwargs):
-        super().__init__()
-        assert_summa_initialization()
-        self.reduction_mean = reduction
-        self.loss_args = args
-        self.loss_kwargs = kwargs
-
-    def forward(self, logits, targets):
-        """Calculate loss between logits and targets
-
-        :param logits: Output logits of model
-        :param targets: True targets from data
-        """
-        targets = split_tensor_2d(targets)
-        loss = cross_entropy(logits, targets, reduction='none', *self.loss_args, **self.loss_kwargs)
-        if self.reduction_mean:
-            loss = loss.mean()
-            loss = reduce_by_batch_2d(loss, True)
-        return loss
-
-
-class _VocabParallelCrossEntropy2D(torch.autograd.Function):
-    ### Modified based on megatron.mpu.cross_entropy ###
-
-    @staticmethod
-    @custom_fwd(cast_inputs=torch.float32)
-    def forward(ctx, logits, targets):
-        # logits: [b/q, h/q]
-        # labels: [b/q]
-        # loss: [b/q]
-        # vocab_parallel_logits: [b/q, s, v/q]
-        # target: [b/q, s]
-        logits_max = torch.max(logits, dim=-1)[0]
-        torch.distributed.all_reduce(logits_max,
-                                     op=torch.distributed.ReduceOp.MAX,
-                                     group=gpc.get_group(ParallelMode.PARALLEL_2D_ROW))
-        # Subtract the maximum value.
-        # vocab_parallel_logits.sub_(logits_max.unsqueeze(dim=-1))
-        logits = logits - logits_max.unsqueeze(dim=-1)
-
-        vocab_size = logits.size(-1)
-        rank = gpc.get_local_rank(ParallelMode.PARALLEL_2D_ROW)
-        vocab_start = rank * (vocab_size)
-        vocab_end = (rank + 1) * (vocab_size) - 1
-
-        target_mask = (targets < vocab_start) | (targets > vocab_end)
-
-        masked_target = targets.clone() - vocab_start
-        masked_target[target_mask] = 0
-        arange_1d = torch.arange(
-            start=0,
-            end=logits.size()[0],
-        )
-        predicted_logits = logits[arange_1d, masked_target]
-        predicted_logits[target_mask] = 0.
-        dist.all_reduce(predicted_logits, group=gpc.get_group(ParallelMode.PARALLEL_2D_ROW))
-
-        exp_logits = torch.exp(logits)
-        sum_exp_logits = exp_logits.sum(dim=1)
-        dist.all_reduce(sum_exp_logits, group=gpc.get_group(ParallelMode.PARALLEL_2D_ROW))
-
-        loss = torch.log(sum_exp_logits) - predicted_logits
-
-        exp_logits.div_(sum_exp_logits.unsqueeze(dim=-1))
-        ctx.save_for_backward(exp_logits, target_mask, masked_target)
-
-        return loss
-
-    @staticmethod
-    @custom_bwd
-    def backward(ctx, output_grad):
-        # Retreive tensors from the forward path.
-        softmax, target_mask, masked_target = ctx.saved_tensors
-
-        # All the inputs have softmax as their gradient.
-        grad_input = softmax
-
-        # For simplicity, work with the 2D gradient.
-        partition_vocab_size = softmax.size()[-1]
-        grad_2d = grad_input.view(-1, partition_vocab_size)
-
-        # Add the gradient from matching classes.
-        arange_1d = torch.arange(start=0, end=grad_2d.size()[0], device=get_current_device())
-        grad_2d[arange_1d, masked_target] -= (1.0 - target_mask.view(-1).float())
-
-        # Finally elementwise multiplication with the output gradients.
-        grad_input.mul_(output_grad.unsqueeze(dim=-1))
-
-        return grad_input, None
-
-
-@LOSSES.register_module
-class VocabParallelCrossEntropyLoss2D(_Loss):
-    """
-    Vocab parallel cross entropy loss for 2D parallelism
-
-    :param reduction: whether to average the loss, defaults to True
-
-    :type reduction: bool, optional
-    """
-
-    def __init__(self, reduction=True):
-        super().__init__()
-        self.reduction_mean = reduction
-
-    def forward(self, logits, targets):
-        """Calculate loss between logits and targets
-
-        :param logits: Output logits of model
-        :param targets: True targets from data
-        """
-        targets = split_tensor_2d(targets)
-        loss = _VocabParallelCrossEntropy2D.apply(
-            logits,
-            targets,
-        )
-        if self.reduction_mean:
-            loss = loss.mean()
-            loss = reduce_by_batch_2d(loss, True)
-        return loss
diff --git a/colossalai/nn/loss/loss_2p5d.py b/colossalai/nn/loss/loss_2p5d.py
deleted file mode 100644
index b5379776b9c53e5dc390d0be5da90cd59f9d99fc..0000000000000000000000000000000000000000
--- a/colossalai/nn/loss/loss_2p5d.py
+++ /dev/null
@@ -1,138 +0,0 @@
-import torch
-import torch.distributed as dist
-from colossalai.context import ParallelMode
-from colossalai.core import global_context as gpc
-from colossalai.nn.layer.parallel_2p5d import reduce_by_batch_2p5d, split_tensor_2p5d
-from colossalai.nn.layer.parallel_2p5d._utils import assert_tesseract_initialization
-from colossalai.registry import LOSSES
-from colossalai.utils import get_current_device
-from torch.cuda.amp import custom_bwd, custom_fwd
-from torch.nn.functional import cross_entropy
-from torch.nn.modules.loss import _Loss
-
-
-@LOSSES.register_module
-class CrossEntropyLoss2p5D(_Loss):
-    """
-    Cross entropy loss for 2.5D parallelism
-
-    :param reduction: whether to average the loss, defaults to True
-    :param args: Args for loss function
-    :param kwargs: Kwargs for loss function
-
-    :type reduction: bool, optional
-    """
-    def __init__(self, reduction=True, *args, **kwargs):
-        super().__init__()
-        assert_tesseract_initialization()
-        self.reduction_mean = reduction
-        self.loss_args = args
-        self.loss_kwargs = kwargs
-
-    def forward(self, logits, targets):
-        """Calculate loss between logits and targets
-
-        :param logits: Output logits of model
-        :param targets: True targets from data
-        """
-        targets = split_tensor_2p5d(targets)
-        loss = cross_entropy(logits, targets, reduction='none', *self.loss_args, **self.loss_kwargs)
-        if self.reduction_mean:
-            loss = loss.mean()
-            loss = reduce_by_batch_2p5d(loss, True)
-        return loss
-
-
-class _VocabParallelCrossEntropy2p5D(torch.autograd.Function):
-    ### Modified based on megatron.mpu.cross_entropy ###
-
-    @staticmethod
-    @custom_fwd(cast_inputs=torch.float32)
-    def forward(ctx, logits, targets):
-        # logits: [b/dq, h/q]
-        # loss: [b/dq]
-        # targets: [b/dq, h/q]
-        logits_max = torch.max(logits, dim=-1)[0]
-        torch.distributed.all_reduce(logits_max,
-                                     op=torch.distributed.ReduceOp.MAX,
-                                     group=gpc.get_group(ParallelMode.PARALLEL_2P5D_ROW))
-        # Subtract the maximum value.
-        logits = logits - logits_max.unsqueeze(dim=-1)
-
-        vocab_size = logits.size(-1)
-        rank = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_ROW)
-        vocab_start = rank * (vocab_size)
-        vocab_end = (rank + 1) * (vocab_size) - 1
-
-        target_mask = (targets < vocab_start) | (targets > vocab_end)
-
-        masked_target = targets.clone() - vocab_start
-        masked_target[target_mask] = 0
-        arange_1d = torch.arange(
-            start=0,
-            end=logits.size()[0],
-        )
-        predicted_logits = logits[arange_1d, masked_target]
-        predicted_logits[target_mask] = 0.
-        dist.all_reduce(predicted_logits, group=gpc.get_group(ParallelMode.PARALLEL_2P5D_ROW))
-
-        exp_logits = torch.exp(logits)
-        sum_exp_logits = exp_logits.sum(dim=1)
-        dist.all_reduce(sum_exp_logits, group=gpc.get_group(ParallelMode.PARALLEL_2P5D_ROW))
-
-        loss = torch.log(sum_exp_logits) - predicted_logits
-
-        exp_logits.div_(sum_exp_logits.unsqueeze(dim=-1))
-        ctx.save_for_backward(exp_logits, target_mask, masked_target)
-
-        return loss
-
-    @staticmethod
-    @custom_bwd
-    def backward(ctx, output_grad):
-        # Retreive tensors from the forward path.
-        softmax, target_mask, masked_target = ctx.saved_tensors
-
-        # All the inputs have softmax as their gradient.
-        grad_input = softmax
-
-        # For simplicity, work with the 2D gradient.
-        partition_vocab_size = softmax.size()[-1]
-        grad_2d = grad_input.view(-1, partition_vocab_size)
-
-        # Add the gradient from matching classes.
-        arange_1d = torch.arange(start=0, end=grad_2d.size()[0], device=get_current_device())
-        grad_2d[arange_1d, masked_target] -= (1.0 - target_mask.view(-1).float())
-
-        # Finally elementwise multiplication with the output gradients.
-        grad_input.mul_(output_grad.unsqueeze(dim=-1))
-
-        return grad_input, None
-
-
-@LOSSES.register_module
-class VocabParallelCrossEntropyLoss2p5D(_Loss):
-    """
-    Vocab parallel cross entropy loss for 2.5D parallelism
-
-    :param reduction: whether to average the loss, defaults to True
-
-    :type reduction: bool, optional
-    """
-    def __init__(self, reduction=True):
-        super().__init__()
-        self.reduction_mean = reduction
-
-    def forward(self, logits, targets):
-        """Calculate loss between logits and targets
-
-        :param logits: Output logits of model
-        :param targets: True targets from data
-        """
-        targets = split_tensor_2p5d(targets)
-        loss = _VocabParallelCrossEntropy2p5D.apply(logits, targets)
-        if self.reduction_mean:
-            loss = loss.mean()
-            loss = reduce_by_batch_2p5d(loss, True)
-
-        return loss
diff --git a/colossalai/nn/loss/loss_3d.py b/colossalai/nn/loss/loss_3d.py
deleted file mode 100644
index 0835d277097bbbd36a91a13379cfe8c97fe1b21f..0000000000000000000000000000000000000000
--- a/colossalai/nn/loss/loss_3d.py
+++ /dev/null
@@ -1,139 +0,0 @@
-import torch
-import torch.distributed as dist
-from colossalai.constants import INPUT_GROUP_3D, WEIGHT_GROUP_3D, OUTPUT_GROUP_3D
-from colossalai.core import global_context as gpc
-from colossalai.nn.layer.parallel_3d import reduce_by_batch_3d, split_tensor_3d
-from colossalai.nn.layer.parallel_3d._utils import get_parallel_mode_from_env
-from colossalai.registry import LOSSES
-from colossalai.utils import get_current_device
-from torch.cuda.amp import custom_bwd, custom_fwd
-from torch.nn.functional import cross_entropy
-from torch.nn.modules.loss import _Loss
-
-
-@LOSSES.register_module
-class CrossEntropyLoss3D(_Loss):
-    """
-    Cross entropy loss for 3D parallelism
-
-    :param reduction: whether to average the loss, defaults to True
-    :param args: Args for loss function
-    :param kwargs: Kwargs for loss function
-
-    :type reduction: bool, optional
-    """
-
-    def __init__(self, reduction=True, *args, **kwargs):
-        super().__init__()
-        self.input_parallel_mode = get_parallel_mode_from_env(INPUT_GROUP_3D)
-        self.weight_parallel_mode = get_parallel_mode_from_env(WEIGHT_GROUP_3D)
-        self.reduction_mean = reduction
-        self.loss_args = args
-        self.loss_kwargs = kwargs
-
-    def forward(self, logits, targets):
-        """Calculate loss between logits and targets
-
-        :param logits: Output logits of model
-        :param targets: True targets from data
-        """
-        targets = split_tensor_3d(targets, 0, self.weight_parallel_mode)
-        targets = split_tensor_3d(targets, 0, self.input_parallel_mode)
-        loss = cross_entropy(logits, targets, reduction='none', *self.loss_args, **self.loss_kwargs)
-        if self.reduction_mean:
-            loss = loss.mean()
-            loss = reduce_by_batch_3d(loss, self.input_parallel_mode, self.weight_parallel_mode, True)
-        return loss
-
-
-class _VocabParallelCrossEntropy3D(torch.autograd.Function):
-    # Adapted from megatron.mpu.cross_entropy
-    # loss[i] = -logits[i][targets] + log(sum(exp(logits[i])))
-
-    @staticmethod
-    @custom_fwd(cast_inputs=torch.float32)
-    def forward(ctx, logits, targets, output_parallel_mode):
-        # logits: [b/q^2, c/q]
-        # labels: [b/q^2]
-        # loss: [b/q^2]
-        logits_max = torch.max(logits, dim=-1)[0]
-        dist.all_reduce(logits_max, op=torch.distributed.ReduceOp.MAX, group=gpc.get_group(output_parallel_mode))
-        # Subtract the maximum value.
-        logits = logits - logits_max.unsqueeze(dim=-1)
-
-        vocab_size_per_partition = logits.size()[-1]
-        rank = gpc.get_local_rank(output_parallel_mode)
-        vocab_start = rank * vocab_size_per_partition
-        vocab_end = (rank + 1) * vocab_size_per_partition - 1
-
-        # loss[i] = 0 if targets[i] < vocab_start or targets[i] > vocab_end
-        target_mask = (targets < vocab_start) | (targets > vocab_end)
-        masked_target = targets.clone() - vocab_start
-        masked_target[target_mask] = 0
-        arange_1d = torch.arange(start=0, end=logits.size()[0], device=get_current_device())
-        predicted_logits = logits[arange_1d, masked_target]
-        predicted_logits = predicted_logits.clone().contiguous().view_as(targets)
-        predicted_logits[target_mask] = 0.
-        dist.all_reduce(predicted_logits, group=gpc.get_group(output_parallel_mode))
-
-        # Loss = log(sum(exp(logits))) - predicted-logit.
-        exp_logits = torch.exp(logits)
-        sum_exp_logits = exp_logits.sum(dim=-1)
-        dist.all_reduce(sum_exp_logits, group=gpc.get_group(output_parallel_mode))
-        loss = torch.log(sum_exp_logits) - predicted_logits
-
-        exp_logits.div_(sum_exp_logits.unsqueeze(dim=-1))
-        ctx.save_for_backward(exp_logits, target_mask, masked_target)
-
-        return loss
-
-    @staticmethod
-    @custom_bwd
-    def backward(ctx, output_grad):
-        # Retreive tensors from the forward path.
-        softmax, target_mask, masked_target = ctx.saved_tensors
-
-        # All the inputs have softmax as thier gradient.
-        input_grad = softmax
-        # For simplicity, work with the 2D gradient.
-        partition_vocab_size = softmax.size()[-1]
-        grad_2d = input_grad.view(-1, partition_vocab_size)
-
-        # Add the gradient from matching classes.
-        arange_1d = torch.arange(start=0, end=grad_2d.size()[0], device=get_current_device())
-        grad_2d[arange_1d, masked_target] -= (1.0 - target_mask.view(-1).float())
-        input_grad.mul_(output_grad.unsqueeze(dim=-1))
-
-        return input_grad, None, None, None
-
-
-@LOSSES.register_module
-class VocabParallelCrossEntropyLoss3D(_Loss):
-    """
-    Vocab parallel cross entropy loss for 2D parallelism
-
-    :param reduction: whether to average the loss, defaults to True
-
-    :type reduction: bool, optional
-    """
-
-    def __init__(self, reduction=True):
-        super().__init__()
-        self.input_parallel_mode = get_parallel_mode_from_env(INPUT_GROUP_3D)
-        self.weight_parallel_mode = get_parallel_mode_from_env(WEIGHT_GROUP_3D)
-        self.output_parallel_mode = get_parallel_mode_from_env(OUTPUT_GROUP_3D)
-        self.reduction_mean = reduction
-
-    def forward(self, logits, targets):
-        """Calculate loss between logits and targets
-
-        :param logits: Output logits of model
-        :param targets: True targets from data
-        """
-        targets = split_tensor_3d(targets, 0, self.weight_parallel_mode)
-        targets = split_tensor_3d(targets, 0, self.input_parallel_mode)
-        loss = _VocabParallelCrossEntropy3D.apply(logits, targets, self.output_parallel_mode)
-        if self.reduction_mean:
-            loss = loss.mean()
-            loss = reduce_by_batch_3d(loss, self.input_parallel_mode, self.weight_parallel_mode, True)
-        return loss
diff --git a/colossalai/nn/loss/loss_moe.py b/colossalai/nn/loss/loss_moe.py
deleted file mode 100644
index 50f42fcd32e7ae139afd7763ac8f7a0b28bc0764..0000000000000000000000000000000000000000
--- a/colossalai/nn/loss/loss_moe.py
+++ /dev/null
@@ -1,48 +0,0 @@
-import torch.nn as nn
-from colossalai.registry import LOSSES
-from torch.nn.modules.loss import _Loss
-from colossalai.global_variables import moe_env
-
-
-@LOSSES.register_module
-class MoeCrossEntropyLoss(_Loss):
-    """torch.nn.CrossEntropyLoss added with auxiliary loss.
-
-    :param aux_weight: Weight of auxiliary loss in total loss
-    :param args: Args in CrossEntropyLoss
-    :param kwargs: Kwargs in CrossEntropyLoss
-
-    :type aux_weight: float, optional
-    """
-    def __init__(self, aux_weight: float = 0.01, *args, **kwargs):
-        super().__init__()
-        self.loss = nn.CrossEntropyLoss(*args, **kwargs)
-        self.aux_weight = aux_weight
-
-    def forward(self, *args):
-        main_loss = self.loss(*args)
-        aux_loss = moe_env.get_loss()
-        return main_loss + self.aux_weight * aux_loss
-
-
-@LOSSES.register_module
-class MoeLoss(_Loss):
-    """A wrapper class for any loss module to add with auxiliary loss.
-
-    :param aux_weight: Weight of auxiliary loss in total loss
-    :param loss_fn: Loss function
-    :param args: Args in loss function
-    :param kwargs: Kwargs in loss function
-
-    :type aux_weight: float
-    :type loss_fn: Callable
-    """
-    def __init__(self, aux_weight: float, loss_fn, *args, **kwargs):
-        super().__init__()
-        self.loss_fn = loss_fn(*args, **kwargs)
-        self.aux_weight = aux_weight
-
-    def forward(self, *args, **kwargs):
-        main_loss = self.loss_fn(*args, **kwargs)
-        aux_loss = moe_env.get_loss()
-        return main_loss + self.aux_weight * aux_loss
diff --git a/colossalai/nn/lr_scheduler/__init__.py b/colossalai/nn/lr_scheduler/__init__.py
deleted file mode 100644
index fd44686f0e373766b837a32b2b92f5a37ba6822a..0000000000000000000000000000000000000000
--- a/colossalai/nn/lr_scheduler/__init__.py
+++ /dev/null
@@ -1,13 +0,0 @@
-from .cosine import CosineAnnealingLR, CosineAnnealingWarmupLR, FlatAnnealingLR, FlatAnnealingWarmupLR
-from .linear import LinearWarmupLR
-from .multistep import MultiStepLR, MultiStepWarmupLR
-from .onecycle import OneCycleLR
-from .poly import PolynomialLR, PolynomialWarmupLR
-from .torch import LambdaLR, MultiplicativeLR, StepLR, ExponentialLR
-
-__all__ = [
-    'CosineAnnealingLR', 'CosineAnnealingWarmupLR', 'FlatAnnealingLR', 'FlatAnnealingWarmupLR', 'LinearWarmupLR',
-    'MultiStepLR', 'MultiStepWarmupLR', 'OneCycleLR', 'PolynomialLR', 'PolynomialWarmupLR', 'LambdaLR',
-    'MultiplicativeLR', 'StepLR',
-    'ExponentialLR'
-]
diff --git a/colossalai/nn/lr_scheduler/__pycache__/__init__.cpython-36.pyc b/colossalai/nn/lr_scheduler/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index 227dfdf8817672b2b980265d12d0ee9821ff6933..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/lr_scheduler/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/lr_scheduler/__pycache__/__init__.cpython-37.pyc b/colossalai/nn/lr_scheduler/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index d33c95002a5fcc712cae48dcb85f5d363e217fbe..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/lr_scheduler/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/lr_scheduler/__pycache__/cosine.cpython-36.pyc b/colossalai/nn/lr_scheduler/__pycache__/cosine.cpython-36.pyc
deleted file mode 100644
index 69ba54d70b027a2b810df446253c9c4684b64d5f..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/lr_scheduler/__pycache__/cosine.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/lr_scheduler/__pycache__/cosine.cpython-37.pyc b/colossalai/nn/lr_scheduler/__pycache__/cosine.cpython-37.pyc
deleted file mode 100644
index 6d671076fde9c134c3a2653de6c9d168e4f8e2c7..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/lr_scheduler/__pycache__/cosine.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/lr_scheduler/__pycache__/delayed.cpython-36.pyc b/colossalai/nn/lr_scheduler/__pycache__/delayed.cpython-36.pyc
deleted file mode 100644
index da80f28b888e902544fd2e64b20b3b088cb2b814..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/lr_scheduler/__pycache__/delayed.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/lr_scheduler/__pycache__/delayed.cpython-37.pyc b/colossalai/nn/lr_scheduler/__pycache__/delayed.cpython-37.pyc
deleted file mode 100644
index 223fc183cdd63d4d3cb85681444df4d0585688f9..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/lr_scheduler/__pycache__/delayed.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/lr_scheduler/__pycache__/linear.cpython-36.pyc b/colossalai/nn/lr_scheduler/__pycache__/linear.cpython-36.pyc
deleted file mode 100644
index aa50356a28e705754b28758f4a871b5f294925de..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/lr_scheduler/__pycache__/linear.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/lr_scheduler/__pycache__/linear.cpython-37.pyc b/colossalai/nn/lr_scheduler/__pycache__/linear.cpython-37.pyc
deleted file mode 100644
index 8fc060b2a28aa002694dbf5e048375415c59df17..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/lr_scheduler/__pycache__/linear.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/lr_scheduler/__pycache__/multistep.cpython-36.pyc b/colossalai/nn/lr_scheduler/__pycache__/multistep.cpython-36.pyc
deleted file mode 100644
index 86bf52779df47786911ec58f61dfebe40e1442ab..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/lr_scheduler/__pycache__/multistep.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/lr_scheduler/__pycache__/multistep.cpython-37.pyc b/colossalai/nn/lr_scheduler/__pycache__/multistep.cpython-37.pyc
deleted file mode 100644
index 8f037c6170990def5f702c98b29386895e3a2933..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/lr_scheduler/__pycache__/multistep.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/lr_scheduler/__pycache__/onecycle.cpython-36.pyc b/colossalai/nn/lr_scheduler/__pycache__/onecycle.cpython-36.pyc
deleted file mode 100644
index 71c3621d1e3eb20324948699cdb737ab7c34b8b3..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/lr_scheduler/__pycache__/onecycle.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/lr_scheduler/__pycache__/onecycle.cpython-37.pyc b/colossalai/nn/lr_scheduler/__pycache__/onecycle.cpython-37.pyc
deleted file mode 100644
index b9def7ef925cefe541781a1b6eeae0e0f20f21b0..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/lr_scheduler/__pycache__/onecycle.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/lr_scheduler/__pycache__/poly.cpython-36.pyc b/colossalai/nn/lr_scheduler/__pycache__/poly.cpython-36.pyc
deleted file mode 100644
index 278ba078a6c17276d645c6f8a1b46170bb88ddeb..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/lr_scheduler/__pycache__/poly.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/lr_scheduler/__pycache__/poly.cpython-37.pyc b/colossalai/nn/lr_scheduler/__pycache__/poly.cpython-37.pyc
deleted file mode 100644
index 68cd0dfa6f7947b91fb4144f9bf6492fe40c2fde..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/lr_scheduler/__pycache__/poly.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/lr_scheduler/__pycache__/torch.cpython-36.pyc b/colossalai/nn/lr_scheduler/__pycache__/torch.cpython-36.pyc
deleted file mode 100644
index bcde25b2e0a5e9f23fd1c66ce93bb46d50887c5d..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/lr_scheduler/__pycache__/torch.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/lr_scheduler/__pycache__/torch.cpython-37.pyc b/colossalai/nn/lr_scheduler/__pycache__/torch.cpython-37.pyc
deleted file mode 100644
index 4a5fe9656c60124ec79d349f2d4b3b8461a01784..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/lr_scheduler/__pycache__/torch.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/lr_scheduler/cosine.py b/colossalai/nn/lr_scheduler/cosine.py
deleted file mode 100644
index 6e14bf05bc2f74955868b1fdefcbf389214cf2a6..0000000000000000000000000000000000000000
--- a/colossalai/nn/lr_scheduler/cosine.py
+++ /dev/null
@@ -1,129 +0,0 @@
-from torch.optim.lr_scheduler import CosineAnnealingLR as _CosineAnnealingLR
-
-from colossalai.registry import LR_SCHEDULERS
-from .delayed import DelayerScheduler, WarmupDelayerScheduler, WarmupScheduler
-
-
-@LR_SCHEDULERS.register_module
-class CosineAnnealingLR(_CosineAnnealingLR):
-    r"""Set the learning rate of each parameter group using a cosine annealing
-    schedule, where :math:`\eta_{max}` is set to the initial lr and
-    :math:`T_{cur}` is the number of epochs since the last restart in SGDR:
-
-    .. math::
-        \begin{aligned}
-            \eta_t & = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1
-            + \cos\left(\frac{T_{cur}}{T_{max}}\pi\right)\right),
-            & T_{cur} \neq (2k+1)T_{max}; \\
-            \eta_{t+1} & = \eta_{t} + \frac{1}{2}(\eta_{max} - \eta_{min})
-            \left(1 - \cos\left(\frac{1}{T_{max}}\pi\right)\right),
-            & T_{cur} = (2k+1)T_{max}.
-        \end{aligned}
-
-    When last_epoch=-1, sets initial lr as lr. Notice that because the schedule
-    is defined recursively, the learning rate can be simultaneously modified
-    outside this scheduler by other operators. If the learning rate is set
-    solely by this scheduler, the learning rate at each step becomes:
-
-    .. math::
-        \eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 +
-        \cos\left(\frac{T_{cur}}{T_{max}}\pi\right)\right)
-
-    It has been proposed in
-    `SGDR: Stochastic Gradient Descent with Warm Restarts`_. Note that this only
-    implements the cosine annealing part of SGDR, and not the restarts.
-
-    .. _SGDR\: Stochastic Gradient Descent with Warm Restarts:
-        https://arxiv.org/abs/1608.03983
-
-    :param optimizer: Wrapped optimizer
-    :type optimizer: torch.optim.Optimizer
-    :param total_steps: Number of total training steps
-    :type total_steps: int
-    :param eta_min: Minimum learning rate, defaults to 0
-    :type eta_min: int, optional
-    :param last_epoch: The index of last epoch, defaults to -1
-    :type last_epoch: int, optional
-    """
-
-    def __init__(self, optimizer, total_steps: int, eta_min: int = 0, last_epoch: int = -1, **kwargs):
-        super().__init__(optimizer, total_steps, eta_min=eta_min, last_epoch=last_epoch)
-
-
-@LR_SCHEDULERS.register_module
-class CosineAnnealingWarmupLR(WarmupScheduler):
-    """Cosine annealing learning rate scheduler with learning rate warmup. A linear warmup schedule will be applied.
-
-    :param optimizer: Wrapped optimizer
-    :type optimizer: torch.optim.Optimizer
-    :param total_steps: Number of total training steps
-    :type total_steps: int
-    :param warmup_steps: Number of warmup steps, defaults to 0
-    :type warmup_steps: int, optional
-    :param eta_min: Minimum learning rate, defaults to 0
-    :type eta_min: int, optional
-    :param last_epoch: The index of last epoch, defaults to -1
-    :type last_epoch: int, optional
-    """
-
-    def __init__(self, optimizer, total_steps: int, warmup_steps: int = 0, eta_min: float = 0., last_epoch: int = -1):
-        base_scheduler = _CosineAnnealingLR(
-            optimizer, total_steps - warmup_steps, eta_min=eta_min, last_epoch=last_epoch)
-        super().__init__(optimizer, warmup_steps, base_scheduler)
-
-
-@LR_SCHEDULERS.register_module
-class FlatAnnealingLR(DelayerScheduler):
-    """Flat and cosine annealing learning rate scheduler. The learning rate will be a fixed value before starting decay.
-
-    :param optimizer: Wrapped optimizer
-    :type optimizer: torch.optim.Optimizer
-    :param total_steps: Number of total training steps
-    :type total_steps: int
-    :param pct_start: Percent of steps before starting learning rate decay
-    :type pct_start: float
-    :param last_epoch: The index of last epoch, defaults to -1
-    :type last_epoch: int, optional
-    """
-
-    def __init__(self, optimizer, total_steps: int, pct_start: float = 0.72, last_epoch: int = -1, **kwargs):
-        if not (0.0 <= pct_start <= 1.0):
-            raise ValueError(
-                f'pct_start must >= 0.0 and <= 1.0, got {pct_start}')
-        flat_steps = int(total_steps * pct_start)
-        anneal_steps = total_steps - flat_steps
-        base_scheduler = _CosineAnnealingLR(
-            optimizer, anneal_steps)
-        super().__init__(optimizer, flat_steps, base_scheduler, last_epoch=last_epoch)
-
-
-@LR_SCHEDULERS.register_module
-class FlatAnnealingWarmupLR(WarmupDelayerScheduler):
-    """Flat and cosine annealing learning rate scheduler with learning rate warmup. A linear warmup schedule will be
-    applied, and then the learning rate will be a fixed value before starting decay.
-
-    :param optimizer: Wrapped optimizer
-    :type optimizer: torch.optim.Optimizer
-    :param total_steps: Number of total training steps
-    :type total_steps: int
-    :param warmup_steps: Number of warmup steps, defaults to 0
-    :type warmup_steps: int, optional
-    :param pct_start: Percent of steps before starting learning rate decay
-    :type pct_start: float
-    :param eta_min: Minimum learning rate, defaults to 0
-    :type eta_min: int, optional
-    :param last_epoch: The index of last epoch, defaults to -1
-    :type last_epoch: int, optional
-    """
-
-    def __init__(self, optimizer, total_steps: int, warmup_steps: int = 0, pct_start: float = 0.72, eta_min: int = 0,
-                 last_epoch: int = -1, **kwargs):
-        if not (0.0 <= pct_start <= 1.0):
-            raise ValueError(
-                f'pct_start must >= 0.0 and <= 1.0, got {pct_start}')
-        flat_steps = int((total_steps - warmup_steps) * pct_start)
-        anneal_steps = total_steps - warmup_steps - flat_steps
-        base_scheduler = _CosineAnnealingLR(
-            optimizer, anneal_steps, eta_min=eta_min)
-        super().__init__(optimizer, warmup_steps, flat_steps,
-                         base_scheduler, last_epoch=last_epoch)
diff --git a/colossalai/nn/lr_scheduler/delayed.py b/colossalai/nn/lr_scheduler/delayed.py
deleted file mode 100644
index daaeb81dddf7c0d986ddc9440810c98c7f4ee19d..0000000000000000000000000000000000000000
--- a/colossalai/nn/lr_scheduler/delayed.py
+++ /dev/null
@@ -1,149 +0,0 @@
-from torch.optim.lr_scheduler import _LRScheduler
-
-
-class _enable_get_lr_call:
-    def __init__(self, o):
-        self.o = o
-
-    def __enter__(self):
-        self.o._get_lr_called_within_step = True
-        return self
-
-    def __exit__(self, type, value, traceback):
-        self.o._get_lr_called_within_step = False
-
-
-class DelayerScheduler(_LRScheduler):
-    """ Starts with a flat lr schedule until it reaches N epochs the applies a scheduler 
-
-    :param optimizer: Wrapped optimizer.
-    :type optimizer: torch.optim.Optimizer
-    :param delay_epochs: Number of epochs to keep the initial lr until starting aplying the scheduler
-    :type delay_epochs: int
-    :param after_scheduler: After target_epoch, use this scheduler(eg. ReduceLROnPlateau)
-    :type after_scheduler: torch.optim.lr_scheduler
-    :param last_epoch: The index of last epoch, defaults to -1
-    :type last_epoch: int, optional
-    """
-
-    def __init__(self, optimizer, delay_epochs, after_scheduler, last_epoch=-1):
-        if delay_epochs < 0:
-            raise ValueError(f'delay_epochs must >= 0, got {delay_epochs}')
-        self.delay_epochs = delay_epochs
-        self.after_scheduler = after_scheduler
-        self.finished = False
-        super().__init__(optimizer, last_epoch)
-
-    def get_lr(self):
-        if self.last_epoch >= self.delay_epochs:
-            if not self.finished:
-                self.after_scheduler.base_lrs = self.base_lrs
-                self.finished = True
-            with _enable_get_lr_call(self.after_scheduler):
-                return self.after_scheduler.get_lr()
-
-        return self.base_lrs
-
-    def step(self, epoch=None):
-        if self.finished:
-            if epoch is None:
-                self.after_scheduler.step(None)
-                self._last_lr = self.after_scheduler.get_last_lr()
-            else:
-                self.after_scheduler.step(epoch - self.delay_epochs)
-                self._last_lr = self.after_scheduler.get_last_lr()
-        else:
-            return super(DelayerScheduler, self).step(epoch)
-
-
-class WarmupScheduler(_LRScheduler):
-    """ Starts with a linear warmup lr schedule until it reaches N epochs the applies a scheduler
-
-    :param optimizer: Wrapped optimizer.
-    :type optimizer: torch.optim.Optimizer
-    :param warmup_epochs: Number of epochs to linearly warmup lr until starting aplying the scheduler
-    :type warmup_epochs: int
-    :param after_scheduler: After target_epoch, use this scheduler(eg. ReduceLROnPlateau)
-    :type after_scheduler: torch.optim.lr_scheduler
-    :param last_epoch: The index of last epoch, defaults to -1
-    :type last_epoch: int, optional
-    """
-
-    def __init__(self, optimizer, warmup_epochs, after_scheduler, last_epoch=-1):
-        self.warmup_epochs = int(warmup_epochs)
-        self.after_scheduler = after_scheduler
-        self.finished = False
-        super().__init__(optimizer, last_epoch)
-
-    def get_lr(self):
-        if self.last_epoch >= self.warmup_epochs:
-            if not self.finished:
-                self.after_scheduler.base_lrs = self.base_lrs
-                self.finished = True
-            return self.after_scheduler.get_lr()
-
-        return [(self.last_epoch + 1) / self.warmup_epochs * lr for lr in self.base_lrs]
-
-    def step(self, epoch=None):
-        if self.finished:
-            if epoch is None:
-                self.after_scheduler.step(None)
-                self._last_lr = self.after_scheduler.get_last_lr()
-            else:
-                self.after_scheduler.step(epoch - self.warmup_epochs)
-                self._last_lr = self.after_scheduler.get_last_lr()
-        else:
-            return super().step(epoch)
-
-
-class WarmupDelayerScheduler(_LRScheduler):
-    """ Starts with a linear warmup lr schedule until it reaches N epochs and a flat lr schedule until it reaches M epochs the applies a scheduler 
-
-    :param optimizer: Wrapped optimizer.
-    :type optimizer: torch.optim.Optimizer
-    :param warmup_epochs: Number of epochs to linearly warmup lr until starting aplying the scheduler
-    :type warmup_epochs: int
-    :param delay_epochs: Number of epochs to keep the initial lr until starting aplying the scheduler
-    :type delay_epochs: int
-    :param after_scheduler: After target_epoch, use this scheduler(eg. ReduceLROnPlateau)
-    :type after_scheduler: torch.optim.lr_scheduler
-    :param last_epoch: The index of last epoch, defaults to -1
-    :type last_epoch: int, optional
-    """
-
-    def __init__(self, optimizer, warmup_epochs, delay_epochs, after_scheduler, last_epoch=-1):
-        if delay_epochs < 0:
-            raise ValueError(f'delay_epochs must >= 0, got {delay_epochs}')
-        if warmup_epochs < 0:
-            raise ValueError(f'warmup_epochs must >= 0, got {warmup_epochs}')
-        self.warmup_epochs = warmup_epochs
-        self.delay_epochs = delay_epochs
-        self.after_scheduler = after_scheduler
-        self.finished = False
-        super().__init__(optimizer, last_epoch)
-
-    def get_lr(self):
-        if self.last_epoch >= self.warmup_epochs + self.delay_epochs:
-            if not self.finished:
-                self.after_scheduler.base_lrs = self.base_lrs
-                # reset lr to base_lr
-                for group, base_lr in zip(self.optimizer.param_groups, self.base_lrs):
-                    group['lr'] = base_lr
-                self.finished = True
-            with _enable_get_lr_call(self.after_scheduler):
-                return self.after_scheduler.get_lr()
-        elif self.last_epoch >= self.warmup_epochs:
-            return self.base_lrs
-
-        return [(self.last_epoch + 1) / self.warmup_epochs * lr for lr in self.base_lrs]
-
-    def step(self, epoch=None):
-        if self.finished:
-            if epoch is None:
-                self.after_scheduler.step(None)
-                self._last_lr = self.after_scheduler.get_last_lr()
-            else:
-                self.after_scheduler.step(epoch - self.warmup_epochs)
-                self._last_lr = self.after_scheduler.get_last_lr()
-        else:
-            return super().step(epoch)
diff --git a/colossalai/nn/lr_scheduler/linear.py b/colossalai/nn/lr_scheduler/linear.py
deleted file mode 100644
index 826e36ce15da1ef7fbc69290d9d18248a917c7c4..0000000000000000000000000000000000000000
--- a/colossalai/nn/lr_scheduler/linear.py
+++ /dev/null
@@ -1,30 +0,0 @@
-from torch.optim.lr_scheduler import _LRScheduler
-
-from colossalai.registry import LR_SCHEDULERS
-
-
-@LR_SCHEDULERS.register_module
-class LinearWarmupLR(_LRScheduler):
-    """Linearly warmup learning rate and then linearly decay
-
-    :param optimizer: Wrapped optimizer
-    :type optimizer: torch.optim.Optimizer
-    :param total_steps: Number of total training steps
-    :type total_steps: int
-    :param warmup_steps: Number of warmup steps, defaults to 0
-    :type warmup_steps: int, optional
-    :param last_epoch: The index of last epoch, defaults to -1
-    :type last_epoch: int, optional
-    """
-
-    def __init__(self, optimizer, total_steps: int, warmup_steps: int = 0, last_epoch: int = -1, **kwargs):
-        self.warmup_steps = warmup_steps
-        self.total_steps = total_steps
-        super().__init__(optimizer, last_epoch=last_epoch)
-
-    def get_lr(self):
-        if self.last_epoch < self.warmup_steps:
-            return [(self.last_epoch + 1) / (self.warmup_steps + 1) * lr for lr in self.base_lrs]
-        else:
-            return [(self.total_steps - self.last_epoch) / (self.total_steps - self.warmup_steps) * lr for lr in
-                    self.base_lrs]
diff --git a/colossalai/nn/lr_scheduler/multistep.py b/colossalai/nn/lr_scheduler/multistep.py
deleted file mode 100644
index e9a672b720d61e05f725eb28cc84e648e23a2046..0000000000000000000000000000000000000000
--- a/colossalai/nn/lr_scheduler/multistep.py
+++ /dev/null
@@ -1,62 +0,0 @@
-from typing import List
-
-from torch.optim.lr_scheduler import MultiStepLR as _MultiStepLR
-
-from colossalai.registry import LR_SCHEDULERS
-from .delayed import WarmupScheduler
-
-
-@LR_SCHEDULERS.register_module
-class MultiStepLR(_MultiStepLR):
-    """Decays the learning rate of each parameter group by gamma once the
-    number of epoch reaches one of the milestones. Notice that such decay can
-    happen simultaneously with other changes to the learning rate from outside
-    this scheduler. When last_epoch=-1, sets initial lr as lr.
-
-    :param optimizer: Wrapped optimizer
-    :type optimizer: torch.optim.Optimizer
-    :param total_steps: Number of total training steps
-    :type total_steps: int
-    :param milestones: List of epoch indices. Must be increasing, defaults to None
-    :type milestones: List[int], optional
-    :param gamma: Multiplicative factor of learning rate decay, defaults to 0.1
-    :type gamma: float, optional
-    :param num_steps_per_epoch: Number of steps per epoch, defaults to -1
-    :type num_steps_per_epoch: int, optional
-    :param last_epoch: The index of last epoch, defaults to -1
-    :type last_epoch: int, optional
-    """
-
-    def __init__(self, optimizer, total_steps: int, milestones: List[int] = None, gamma: float = 0.1, last_epoch: int = -1, **kwargs):
-        super().__init__(optimizer, milestones, gamma=gamma, last_epoch=last_epoch)
-
-
-@LR_SCHEDULERS.register_module
-class MultiStepWarmupLR(WarmupScheduler):
-    """Multi-step laerning rate scheduler with warmup.
-
-    :param optimizer: Wrapped optimizer
-    :type optimizer: torch.optim.Optimizer
-    :param total_steps: Number of total training steps
-    :type total_steps: int
-    :param warmup_steps: Number of warmup steps, defaults to 0
-    :type warmup_steps: int, optional
-    :param milestones: List of epoch indices. Must be increasing, defaults to None
-    :type milestones: List[int], optional
-    :param gamma: Multiplicative factor of learning rate decay, defaults to 0.1
-    :type gamma: float, optional
-    :param num_steps_per_epoch: Number of steps per epoch, defaults to -1
-    :type num_steps_per_epoch: int, optional
-    :param last_epoch: The index of last epoch, defaults to -1
-    :type last_epoch: int, optional
-    """
-
-    def __init__(self, optimizer, total_steps: int, warmup_steps: int = 0, milestones: List[int] = None,
-                 gamma: float = 0.1, last_epoch: int = -1, **kwargs):
-        if len(milestones) == 0:
-            raise ValueError('milestones cannot be empty')
-        milestones = [
-            v - warmup_steps for v in milestones if v >= warmup_steps]
-        base_scheduler = _MultiStepLR(optimizer, milestones=milestones,
-                                      gamma=gamma)
-        super().__init__(optimizer, warmup_steps, base_scheduler, last_epoch=last_epoch)
diff --git a/colossalai/nn/lr_scheduler/onecycle.py b/colossalai/nn/lr_scheduler/onecycle.py
deleted file mode 100644
index 2c25647effc040f5572ce428412e8000c9d47ac0..0000000000000000000000000000000000000000
--- a/colossalai/nn/lr_scheduler/onecycle.py
+++ /dev/null
@@ -1,91 +0,0 @@
-from torch.optim.lr_scheduler import OneCycleLR as _OneCycleLR
-
-from colossalai.registry import LR_SCHEDULERS
-
-
-@LR_SCHEDULERS.register_module
-class OneCycleLR(_OneCycleLR):
-    r"""Sets the learning rate of each parameter group according to the
-    1cycle learning rate policy. The 1cycle policy anneals the learning
-    rate from an initial learning rate to some maximum learning rate and then
-    from that maximum learning rate to some minimum learning rate much lower
-    than the initial learning rate.
-    This policy was initially described in the paper `Super-Convergence:
-    Very Fast Training of Neural Networks Using Large Learning Rates`_.
-    The 1cycle learning rate policy changes the learning rate after every batch.
-    `step` should be called after a batch has been used for training.
-    This scheduler is not chainable.
-    Note also that the total number of steps in the cycle can be determined in one
-    of two ways (listed in order of precedence):
-
-      * A value for total_steps is explicitly provided.
-      * A number of epochs (epochs) and a number of steps per epoch (steps_per_epoch) are provided.
-        In this case, the number of total steps is inferred by total_steps = epochs * steps_per_epoch
-
-    You must either provide a value for total_steps or provide a value for both
-    epochs and steps_per_epoch.
-    The default behaviour of this scheduler follows the fastai implementation of 1cycle, which
-    claims that "unpublished work has shown even better results by using only two phases". To
-    mimic the behaviour of the original paper instead, set ``three_phase=True``.
-
-    :param optimizer: Wrapped optimizer
-    :type optimizer: torch.optim.Optimizer
-    :param total_steps: Number of total training steps
-    :type total_steps: int
-    :param pct_start: The percentage of the cycle (in number of steps) spent increasing the learning rate, defaults to 0.3
-    :type pct_start: float, optional
-    :param anneal_strategy: {'cos', 'linear'}
-        Specifies the annealing strategy: "cos" for cosine annealing, "linear" for
-        linear annealing, defaults to 'cos'
-    :type anneal_strategy: str, optional
-    :param cycle_momentum: If ``True``, momentum is cycled inversely
-        to learning rate between 'base_momentum' and 'max_momentum', defaults to True
-    :type cycle_momentum: bool, optional
-    :param base_momentum:  Lower momentum boundaries in the cycle
-        for each parameter group. Note that momentum is cycled inversely
-        to learning rate; at the peak of a cycle, momentum is
-        'base_momentum' and learning rate is 'max_lr', defaults to 0.85
-    :type base_momentum: float, optional
-    :param max_momentum: Upper momentum boundaries in the cycle
-        for each parameter group. Functionally,
-        it defines the cycle amplitude (max_momentum - base_momentum).
-        Note that momentum is cycled inversely
-        to learning rate; at the start of a cycle, momentum is 'max_momentum'
-        and learning rate is 'base_lr', defaults to 0.95
-    :type max_momentum: float, optional
-    :param div_factor: Determines the initial learning rate via
-        initial_lr = max_lr/div_factor, defaults to 25.0
-    :type div_factor: float, optional
-    :param final_div_factor: Determines the minimum learning rate via
-        min_lr = initial_lr/final_div_factor, defaults to 10000.0
-    :type final_div_factor: float, optional
-    :param last_epoch: The index of the last batch. This parameter is used when
-        resuming a training job. Since `step()` should be invoked after each
-        batch instead of after each epoch, this number represents the total
-        number of *batches* computed, not the total number of epochs computed.
-        When last_epoch=-1, the schedule is started from the beginning, defaults to -1
-    :type last_epoch: int, optional
-
-    .. _Super-Convergence\: Very Fast Training of Neural Networks Using Large Learning Rates:
-        https://arxiv.org/abs/1708.07120
-    """
-
-    def __init__(self, optimizer, total_steps: int,
-                 pct_start=0.3,
-                 anneal_strategy='cos',
-                 cycle_momentum=True,
-                 base_momentum=0.85,
-                 max_momentum=0.95,
-                 div_factor=25.0,
-                 final_div_factor=10000.0,
-                 last_epoch=-1, **kwargs):
-        max_lrs = list(map(lambda group: group['lr'], optimizer.param_groups))
-        super().__init__(optimizer, max_lrs, total_steps=total_steps,
-                         pct_start=pct_start,
-                         anneal_strategy=anneal_strategy,
-                         cycle_momentum=cycle_momentum,
-                         base_momentum=base_momentum,
-                         max_momentum=max_momentum,
-                         div_factor=div_factor,
-                         final_div_factor=final_div_factor,
-                         last_epoch=last_epoch)
diff --git a/colossalai/nn/lr_scheduler/poly.py b/colossalai/nn/lr_scheduler/poly.py
deleted file mode 100644
index 8347a83dfdd2d9e903a81d0bdcd3cd6aca228a0f..0000000000000000000000000000000000000000
--- a/colossalai/nn/lr_scheduler/poly.py
+++ /dev/null
@@ -1,65 +0,0 @@
-from torch.optim.lr_scheduler import _LRScheduler
-
-from colossalai.registry import LR_SCHEDULERS
-from .delayed import WarmupScheduler
-
-
-@LR_SCHEDULERS.register_module
-class PolynomialLR(_LRScheduler):
-    """Polynomial learning rate scheduler.
-
-    :param optimizer: Wrapped optimizer
-    :type optimizer: torch.optim.Optimizer
-    :param total_steps: Number of total training steps
-    :type total_steps: int
-    :param end_lr: Minimum learning rate, defaults to 0.0001
-    :type end_lr: float, optional
-    :param power: The power of polynomial, defaults to 1.0
-    :type power: float, optional
-    :param last_epoch: The index of last epoch, defaults to -1
-    :type last_epoch: int, optional
-    """
-
-    def __init__(self, optimizer, total_steps: int, end_lr: float = 0.0001, power: float = 1.0, last_epoch: int = -1,
-                 **kwargs):
-        if end_lr < 0:
-            raise ValueError(f'end_lr must >= 0, got {end_lr}')
-        self.total_steps = total_steps
-        self.end_lr = end_lr
-        self.power = power
-        super().__init__(optimizer, last_epoch=last_epoch)
-
-    def get_lr(self):
-        return self._get_closed_form_lr()
-
-    def _get_closed_form_lr(self):
-        return [
-            (base_lr - self.end_lr) * ((1 - min(self.last_epoch, self.total_steps) /
-                                        self.total_steps) ** self.power) + self.end_lr
-            for base_lr in self.base_lrs
-        ]
-
-
-@LR_SCHEDULERS.register_module
-class PolynomialWarmupLR(WarmupScheduler):
-    """Polynomial learning rate scheduler with warmup.
-
-    :param optimizer: Wrapped optimizer
-    :type optimizer: torch.optim.Optimizer
-    :param total_steps: Number of total training steps
-    :type total_steps: int
-    :param warmup_steps: Number of warmup steps, defaults to 0
-    :type warmup_steps: int, optional
-    :param end_lr: Minimum learning rate, defaults to 0.0001
-    :type end_lr: float, optional
-    :param power: The power of polynomial, defaults to 1.0
-    :type power: float, optional
-    :param last_epoch: The index of last epoch, defaults to -1
-    :type last_epoch: int, optional
-    """
-
-    def __init__(self, optimizer, total_steps: int, warmup_steps: int = 0, end_lr: float = 0.0001, power: float = 1.0,
-                 last_epoch: int = -1, **kwargs):
-        base_scheduler = PolynomialLR(
-            optimizer, total_steps - warmup_steps, end_lr=end_lr, power=power)
-        super().__init__(optimizer, warmup_steps, base_scheduler, last_epoch=last_epoch)
diff --git a/colossalai/nn/lr_scheduler/torch.py b/colossalai/nn/lr_scheduler/torch.py
deleted file mode 100644
index c02297aff34d7c6a55f66a92fa22ac6326d7d908..0000000000000000000000000000000000000000
--- a/colossalai/nn/lr_scheduler/torch.py
+++ /dev/null
@@ -1,92 +0,0 @@
-from torch.optim.lr_scheduler import LambdaLR as _LambdaLR
-from torch.optim.lr_scheduler import MultiplicativeLR as _MultiplicativeLR
-from torch.optim.lr_scheduler import StepLR as _StepLR
-from torch.optim.lr_scheduler import ExponentialLR as _ExponentialLR
-
-from colossalai.registry import LR_SCHEDULERS
-
-
-@LR_SCHEDULERS.register_module
-class LambdaLR(_LambdaLR):
-    """Sets the learning rate of each parameter group to the initial lr
-    times a given function. When last_epoch=-1, sets initial lr as lr.
-
-    :param optimizer: Wrapped optimizer
-    :type optimizer: torch.optim.Optimizer
-    :param total_steps: Number of total training steps
-    :type total_steps: int
-    :param lr_lambda: A function which computes a multiplicative
-        factor given an integer parameter epoch, or a list of such
-        functions, one for each group in optimizer.param_groups, defaults to None
-    :type lr_lambda: function or list, optional
-    :param last_epoch: The index of last epoch, defaults to -1
-    :type last_epoch: int, optional
-    """
-
-    def __init__(self, optimizer, total_steps, lr_lambda=None, last_epoch: int = -1) -> None:
-        super().__init__(optimizer, lr_lambda, last_epoch=last_epoch)
-
-
-@LR_SCHEDULERS.register_module
-class MultiplicativeLR(_MultiplicativeLR):
-    """Multiply the learning rate of each parameter group by the factor given
-    in the specified function. When last_epoch=-1, sets initial lr as lr
-
-    :param optimizer: Wrapped optimizer
-    :type optimizer: torch.optim.Optimizer
-    :param total_steps: Number of total training steps
-    :type total_steps: int
-    :param lr_lambda: A function which computes a multiplicative
-        factor given an integer parameter epoch, or a list of such
-        functions, one for each group in optimizer.param_groups, defaults to None
-    :type lr_lambda: function or list, optional
-    :param last_epoch: The index of last epoch, defaults to -1
-    :type last_epoch: int, optional
-    """
-
-    def __init__(self, optimizer, total_steps, lr_lambda=None, last_epoch: int = -1) -> None:
-        super().__init__(optimizer, lr_lambda, last_epoch=last_epoch)
-
-
-@LR_SCHEDULERS.register_module
-class StepLR(_StepLR):
-    """Decays the learning rate of each parameter group by gamma every
-    step_size epochs. Notice that such decay can happen simultaneously with
-    other changes to the learning rate from outside this scheduler. When
-    last_epoch=-1, sets initial lr as lr
-
-    :param optimizer: Wrapped optimizer
-    :type optimizer: torch.optim.Optimizer
-    :param total_steps: Number of total training steps
-    :type total_steps: int
-    :param step_size: Period of learning rate decay, defaults to 1
-    :type step_size: int, optional
-    :param gamma: Multiplicative factor of learning rate decay, defaults to 0.1
-    :type gamma: float, optional
-    :param last_epoch: The index of last epoch, defaults to -1
-    :type last_epoch: int, optional
-    """
-
-    def __init__(self, optimizer, total_steps, step_size: int = 1, gamma: float = 0.1, last_epoch: int = -1) -> None:
-        super().__init__(optimizer, step_size,
-                         gamma=gamma, last_epoch=last_epoch)
-
-
-@LR_SCHEDULERS.register_module
-class ExponentialLR(_ExponentialLR):
-    """Decays the learning rate of each parameter group by gamma every epoch.
-    When last_epoch=-1, sets initial lr as lr
-
-    :param optimizer: Wrapped optimizer
-    :type optimizer: torch.optim.Optimizer
-    :param total_steps: Number of total training steps
-    :type total_steps: int
-    :param gamma: Multiplicative factor of learning rate decay, defaults to 1.0
-    :type gamma: float, optional
-    :param last_epoch: The index of last epoch, defaults to -1
-    :type last_epoch: int, optional
-    """
-
-    def __init__(self, optimizer, total_steps, gamma: float = 1.0,
-                 last_epoch: int = -1) -> None:
-        super().__init__(optimizer, gamma, last_epoch=last_epoch)
diff --git a/colossalai/nn/metric/__init__.py b/colossalai/nn/metric/__init__.py
deleted file mode 100644
index 00833b6119c161be0bd2855a1b44b333f2b93f66..0000000000000000000000000000000000000000
--- a/colossalai/nn/metric/__init__.py
+++ /dev/null
@@ -1,26 +0,0 @@
-from torch import nn
-
-from ._utils import calc_acc
-from .accuracy_2d import Accuracy2D
-from .accuracy_2p5d import Accuracy2p5D
-from .accuracy_3d import Accuracy3D
-from colossalai.nn.layer.utils import get_tensor_parallel_mode
-
-_parallel_accuracy = {
-    '2d': Accuracy2D,
-    '2.5d': Accuracy2p5D,
-    '3d': Accuracy3D,
-}
-
-
-class Accuracy(nn.Module):
-    def __init__(self):
-        super().__init__()
-        tensor_parallel = get_tensor_parallel_mode()
-        if tensor_parallel not in _parallel_accuracy:
-            self.acc = calc_acc
-        else:
-            self.acc = _parallel_accuracy[tensor_parallel]()
-
-    def forward(self, *args):
-        return self.acc(*args)
diff --git a/colossalai/nn/metric/__pycache__/__init__.cpython-36.pyc b/colossalai/nn/metric/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index dbe3d12405da55a7147aee0a6e9912ff8a3cf027..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/metric/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/metric/__pycache__/__init__.cpython-37.pyc b/colossalai/nn/metric/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index 2d225ccf976965b5140eae7074cab0537616b660..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/metric/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/metric/__pycache__/_utils.cpython-36.pyc b/colossalai/nn/metric/__pycache__/_utils.cpython-36.pyc
deleted file mode 100644
index 48219dc4f7044701038f30837dafbb0d8c312aaf..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/metric/__pycache__/_utils.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/metric/__pycache__/_utils.cpython-37.pyc b/colossalai/nn/metric/__pycache__/_utils.cpython-37.pyc
deleted file mode 100644
index db17d09060a6887018805756c4ff02b657cc966a..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/metric/__pycache__/_utils.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/metric/__pycache__/accuracy_2d.cpython-36.pyc b/colossalai/nn/metric/__pycache__/accuracy_2d.cpython-36.pyc
deleted file mode 100644
index e5243d133278a6536435d2bb29a06a8cebfd5f53..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/metric/__pycache__/accuracy_2d.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/metric/__pycache__/accuracy_2d.cpython-37.pyc b/colossalai/nn/metric/__pycache__/accuracy_2d.cpython-37.pyc
deleted file mode 100644
index 8620aa60579b22f39651b16ed212792ee984d782..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/metric/__pycache__/accuracy_2d.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/metric/__pycache__/accuracy_2p5d.cpython-36.pyc b/colossalai/nn/metric/__pycache__/accuracy_2p5d.cpython-36.pyc
deleted file mode 100644
index 4fc88562f6bd1d00c9edc1b78d408865b33c49ee..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/metric/__pycache__/accuracy_2p5d.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/metric/__pycache__/accuracy_2p5d.cpython-37.pyc b/colossalai/nn/metric/__pycache__/accuracy_2p5d.cpython-37.pyc
deleted file mode 100644
index b9c5fd793473025b7275e726e3201865141857a7..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/metric/__pycache__/accuracy_2p5d.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/metric/__pycache__/accuracy_3d.cpython-36.pyc b/colossalai/nn/metric/__pycache__/accuracy_3d.cpython-36.pyc
deleted file mode 100644
index 2d79c03509a2e7496f5159a55f4a91d7d2fe5c21..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/metric/__pycache__/accuracy_3d.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/metric/__pycache__/accuracy_3d.cpython-37.pyc b/colossalai/nn/metric/__pycache__/accuracy_3d.cpython-37.pyc
deleted file mode 100644
index 3e983002ea65b489666fb63a0a221e6bd40cb22f..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/metric/__pycache__/accuracy_3d.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/metric/_utils.py b/colossalai/nn/metric/_utils.py
deleted file mode 100644
index d4a69f9430205b8362ec896936f3356e8141b9c0..0000000000000000000000000000000000000000
--- a/colossalai/nn/metric/_utils.py
+++ /dev/null
@@ -1,6 +0,0 @@
-import torch
-
-def calc_acc(logits, targets):
-    preds = torch.argmax(logits, dim=-1)
-    correct = torch.sum(targets == preds)
-    return correct
diff --git a/colossalai/nn/metric/accuracy_2d.py b/colossalai/nn/metric/accuracy_2d.py
deleted file mode 100644
index 4a3eb6f7ab3c02b9ce6bdab65523318cd74c0631..0000000000000000000000000000000000000000
--- a/colossalai/nn/metric/accuracy_2d.py
+++ /dev/null
@@ -1,24 +0,0 @@
-import torch
-from colossalai.nn.layer.parallel_2d import reduce_by_batch_2d, split_tensor_2d
-from torch import nn
-
-from ._utils import calc_acc
-
-
-class Accuracy2D(nn.Module):
-    """Accuracy for 2D parallelism
-    """
-    def __init__(self):
-        super().__init__()
-
-    def forward(self, logits, targets):
-        """Calculate the accuracy of predicted labels.
-
-        :param logits: Predicted labels
-        :param targets: True labels from data
-        """
-        with torch.no_grad():
-            targets = split_tensor_2d(targets)
-            correct = calc_acc(logits, targets)
-            correct = reduce_by_batch_2d(correct)
-        return correct
diff --git a/colossalai/nn/metric/accuracy_2p5d.py b/colossalai/nn/metric/accuracy_2p5d.py
deleted file mode 100644
index 0eeedd46fa3718de26b9bcb4a80044fb7d8dfb33..0000000000000000000000000000000000000000
--- a/colossalai/nn/metric/accuracy_2p5d.py
+++ /dev/null
@@ -1,24 +0,0 @@
-import torch
-from colossalai.nn.layer.parallel_2p5d import reduce_by_batch_2p5d
-from torch import nn
-
-from ._utils import calc_acc
-
-
-class Accuracy2p5D(nn.Module):
-    """Accuracy for 2p5D parallelism
-    """
-    def __init__(self):
-        super().__init__()
-
-    def forward(self, logits, targets):
-        """Calculate the accuracy of predicted labels.
-
-        :param logits: Predicted labels
-        :param targets: True labels from data
-        """
-        with torch.no_grad():
-            targets = split_tensor_2p5d(targets)
-            correct = calc_acc(logits, targets)
-            correct = reduce_by_batch_2p5d(correct)
-        return correct
diff --git a/colossalai/nn/metric/accuracy_3d.py b/colossalai/nn/metric/accuracy_3d.py
deleted file mode 100644
index e24219e64dd183d31def0b8af5a94b2999d8fe0a..0000000000000000000000000000000000000000
--- a/colossalai/nn/metric/accuracy_3d.py
+++ /dev/null
@@ -1,29 +0,0 @@
-import torch
-from colossalai.constants import INPUT_GROUP_3D, WEIGHT_GROUP_3D
-from colossalai.nn.layer.parallel_3d import reduce_by_batch_3d, split_tensor_3d
-from colossalai.nn.layer.parallel_3d._utils import get_parallel_mode_from_env
-from torch import nn
-
-from ._utils import calc_acc
-
-
-class Accuracy3D(nn.Module):
-    """Accuracy for 3D parallelism
-    """
-    def __init__(self):
-        super().__init__()
-        self.input_parallel_mode = get_parallel_mode_from_env(INPUT_GROUP_3D)
-        self.weight_parallel_mode = get_parallel_mode_from_env(WEIGHT_GROUP_3D)
-
-    def forward(self, logits, targets):
-        """Calculate the accuracy of predicted labels.
-
-         :param logits: Predicted labels
-         :param targets: True labels from data
-         """
-        with torch.no_grad():
-            targets = split_tensor_3d(targets, 0, self.weight_parallel_mode)
-            targets = split_tensor_3d(targets, 0, self.input_parallel_mode)
-            correct = calc_acc(logits, targets)
-            correct = reduce_by_batch_3d(correct, self.input_parallel_mode, self.weight_parallel_mode)
-        return correct
diff --git a/colossalai/nn/model/__init__.py b/colossalai/nn/model/__init__.py
deleted file mode 100644
index 6ced1705408edccf73ab37ec8752dae3cd6b8bff..0000000000000000000000000000000000000000
--- a/colossalai/nn/model/__init__.py
+++ /dev/null
@@ -1,3 +0,0 @@
-from .model_from_config import ModelFromConfig
-
-__all__ = ['ModelFromConfig']
diff --git a/colossalai/nn/model/__pycache__/__init__.cpython-36.pyc b/colossalai/nn/model/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index 5bf02f7a4bee1b850c8eb0a30c8ff0e6bdffc940..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/model/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/model/__pycache__/__init__.cpython-37.pyc b/colossalai/nn/model/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index 4cf46c81bcf1e07625645f349d4a5a8d026f7da6..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/model/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/model/__pycache__/model_from_config.cpython-36.pyc b/colossalai/nn/model/__pycache__/model_from_config.cpython-36.pyc
deleted file mode 100644
index 6a80f8589827698f706c3c6a21db20167f9995a8..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/model/__pycache__/model_from_config.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/model/__pycache__/model_from_config.cpython-37.pyc b/colossalai/nn/model/__pycache__/model_from_config.cpython-37.pyc
deleted file mode 100644
index 0c4df6d677adfe66758ab322fa5229a03984f8d0..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/model/__pycache__/model_from_config.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/model/model_from_config.py b/colossalai/nn/model/model_from_config.py
deleted file mode 100644
index 24903ca3607d7ca2c0e2e0e2cf4aa54e9a273472..0000000000000000000000000000000000000000
--- a/colossalai/nn/model/model_from_config.py
+++ /dev/null
@@ -1,37 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from abc import ABC, abstractmethod
-
-import torch.nn as nn
-
-from colossalai.builder import build_layer
-
-
-class ModelFromConfig(nn.Module, ABC):
-
-    def __init__(self):
-        super(ModelFromConfig, self).__init__()
-        self.layers = nn.ModuleList()
-        self.layers_cfg = []
-
-    def build_from_cfg(self, start=None, end=None):
-        assert hasattr(self, 'layers_cfg'), 'Cannot find attribute layers_cfg from the module, please check the ' \
-                                            'spelling and if you have initialized this variable'
-        if start is None:
-            start = 0
-        if end is None:
-            end = len(self.layers_cfg)
-        for cfg in self.layers_cfg[start: end]:
-            layer = build_layer(cfg)
-            self.layers.append(layer)
-
-    @abstractmethod
-    def init_weights(self):
-        pass
-
-    def state_dict_for_save_checkpoint(self, destination=None, prefix='',
-                                       keep_vars=False):
-        """Use this function to override the state dict for
-        saving checkpoints."""
-        return self.state_dict(destination, prefix, keep_vars)
diff --git a/colossalai/nn/optimizer/__init__.py b/colossalai/nn/optimizer/__init__.py
deleted file mode 100644
index c084c5c8671ddcde96135478ae41be12d2159f4c..0000000000000000000000000000000000000000
--- a/colossalai/nn/optimizer/__init__.py
+++ /dev/null
@@ -1,10 +0,0 @@
-from .colossalai_optimizer import ColossalaiOptimizer
-from .fused_adam import FusedAdam
-from .fused_lamb import FusedLAMB
-from .fused_sgd import FusedSGD
-from .lamb import Lamb
-from .lars import Lars
-
-__all__ = [
-    'ColossalaiOptimizer', 'FusedLAMB', 'FusedAdam', 'FusedSGD', 'Lamb', 'Lars'
-]
diff --git a/colossalai/nn/optimizer/__pycache__/__init__.cpython-36.pyc b/colossalai/nn/optimizer/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index 52b6a0fb757485cd4567e248be8421af683c0ab4..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/optimizer/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/optimizer/__pycache__/__init__.cpython-37.pyc b/colossalai/nn/optimizer/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index 420bc09692657f3f07a4c2e9cf4324a4b5487f08..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/optimizer/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/optimizer/__pycache__/colossalai_optimizer.cpython-36.pyc b/colossalai/nn/optimizer/__pycache__/colossalai_optimizer.cpython-36.pyc
deleted file mode 100644
index ac65278a3aedb1f6f3849c107958948f6d0eb777..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/optimizer/__pycache__/colossalai_optimizer.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/optimizer/__pycache__/colossalai_optimizer.cpython-37.pyc b/colossalai/nn/optimizer/__pycache__/colossalai_optimizer.cpython-37.pyc
deleted file mode 100644
index 669b1f65fa4b72283ae6b7f44ff49f47567fabe5..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/optimizer/__pycache__/colossalai_optimizer.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/optimizer/__pycache__/fused_adam.cpython-36.pyc b/colossalai/nn/optimizer/__pycache__/fused_adam.cpython-36.pyc
deleted file mode 100644
index 0616922151b226f1cf2d1229534541d281e794e9..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/optimizer/__pycache__/fused_adam.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/optimizer/__pycache__/fused_adam.cpython-37.pyc b/colossalai/nn/optimizer/__pycache__/fused_adam.cpython-37.pyc
deleted file mode 100644
index d798b63d80840c322946c187eead9b013e779694..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/optimizer/__pycache__/fused_adam.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/optimizer/__pycache__/fused_lamb.cpython-36.pyc b/colossalai/nn/optimizer/__pycache__/fused_lamb.cpython-36.pyc
deleted file mode 100644
index 93dd8260dce2ed1211fc46756681255ad124e560..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/optimizer/__pycache__/fused_lamb.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/optimizer/__pycache__/fused_lamb.cpython-37.pyc b/colossalai/nn/optimizer/__pycache__/fused_lamb.cpython-37.pyc
deleted file mode 100644
index ef350d697e04199d249929c30cca0b0530622c61..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/optimizer/__pycache__/fused_lamb.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/optimizer/__pycache__/fused_sgd.cpython-36.pyc b/colossalai/nn/optimizer/__pycache__/fused_sgd.cpython-36.pyc
deleted file mode 100644
index 9773e9d834faa89b954c2c338766bfa63b9e3175..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/optimizer/__pycache__/fused_sgd.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/optimizer/__pycache__/fused_sgd.cpython-37.pyc b/colossalai/nn/optimizer/__pycache__/fused_sgd.cpython-37.pyc
deleted file mode 100644
index 6ad56ea8b4dc58d0bed2ab4a323cba83a3dd7ece..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/optimizer/__pycache__/fused_sgd.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/optimizer/__pycache__/lamb.cpython-36.pyc b/colossalai/nn/optimizer/__pycache__/lamb.cpython-36.pyc
deleted file mode 100644
index 4f8200306885169f6248ce69fa09b6d3ccb538db..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/optimizer/__pycache__/lamb.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/optimizer/__pycache__/lamb.cpython-37.pyc b/colossalai/nn/optimizer/__pycache__/lamb.cpython-37.pyc
deleted file mode 100644
index 1b6a722502688c72f7fb4bc5d3748858fe4eebc7..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/optimizer/__pycache__/lamb.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/optimizer/__pycache__/lars.cpython-36.pyc b/colossalai/nn/optimizer/__pycache__/lars.cpython-36.pyc
deleted file mode 100644
index e5502389de46cb8efd1de6cc94943364c2dc7d94..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/optimizer/__pycache__/lars.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/nn/optimizer/__pycache__/lars.cpython-37.pyc b/colossalai/nn/optimizer/__pycache__/lars.cpython-37.pyc
deleted file mode 100644
index f1076b648ac91f83dbb6a0ae7eb9d42a85f59d9c..0000000000000000000000000000000000000000
Binary files a/colossalai/nn/optimizer/__pycache__/lars.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/nn/optimizer/colossalai_optimizer.py b/colossalai/nn/optimizer/colossalai_optimizer.py
deleted file mode 100644
index fb0c439035098b4e017830f870275fa66a7a1e8c..0000000000000000000000000000000000000000
--- a/colossalai/nn/optimizer/colossalai_optimizer.py
+++ /dev/null
@@ -1,47 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import torch
-import torch.nn as nn
-from torch import Tensor
-from torch.optim import Optimizer
-from colossalai.utils import clip_grad_norm_fp32
-
-
-class ColossalaiOptimizer(Optimizer):
-
-    def __init__(self, optim: Optimizer):
-        self.optim = optim
-
-    @property
-    def param_groups(self):
-        return self.optim.param_groups
-
-    @property
-    def defaults(self):
-        return self.optim.defaults
-
-    def add_param_group(self, *args, **kwargs):
-        return self.optim.add_param_group(*args, **kwargs)
-
-    def step(self, *args, **kwargs):
-        return self.optim.step(*args, **kwargs)
-
-    def zero_grad(self, *args, **kwargs):
-        self.optim.zero_grad(*args, **kwargs)
-
-    def load_state_dict(self, *args, **kwargs):
-        self.optim.load_state_dict(*args, **kwargs)
-
-    def state_dict(self):
-        return self.optim.state_dict()
-
-    def backward(self, loss: Tensor):
-        loss.backward()
-
-    def backward_by_grad(self, tensor: Tensor, grad: Tensor):
-        torch.autograd.backward(tensors=tensor, grad_tensors=grad)
-
-    def clip_grad_norm(self, model: nn.Module, max_norm: float):
-        if max_norm > 0.0:
-            clip_grad_norm_fp32(model.parameters(), max_norm)
diff --git a/colossalai/nn/optimizer/fused_adam.py b/colossalai/nn/optimizer/fused_adam.py
deleted file mode 100644
index cb75d073b685bda5883155bb7ba587a3b58decf0..0000000000000000000000000000000000000000
--- a/colossalai/nn/optimizer/fused_adam.py
+++ /dev/null
@@ -1,162 +0,0 @@
-# modified from https://github.com/NVIDIA/apex/blob/master/apex/optimizers/fused_adam.py
-import torch
-
-from colossalai.registry import OPTIMIZERS
-from colossalai.utils import multi_tensor_applier
-
-
-@OPTIMIZERS.register_module
-class FusedAdam(torch.optim.Optimizer):
-    """Implements Adam algorithm.
-
-    Currently GPU-only.  Requires ColossalAI to be installed via
-    ``pip install -v --no-cache-dir --global-option="--cuda_ext" ./``.
-
-    This version of fused Adam implements 2 fusions.
-
-      * Fusion of the Adam update's elementwise operations
-      * A multi-tensor apply launch that batches the elementwise updates applied to all the model's parameters into one or a few kernel launches.
-
-    :class:`colossalai.nn.optimizer.FusedAdam` may be used as a drop-in replacement for ``torch.optim.AdamW``,
-    or ``torch.optim.Adam`` with ``adam_w_mode=False``
-
-    :class:`colossalai.nn.optimizer.FusedAdam` may be used with or without Amp. 
-
-    Adam was been proposed in `Adam: A Method for Stochastic Optimization`_.
-
-    Arguments:
-        params (iterable): iterable of parameters to optimize or dicts defining
-            parameter groups.
-        lr (float, optional): learning rate. (default: 1e-3)
-        betas (Tuple[float, float], optional): coefficients used for computing
-            running averages of gradient and its square. (default: (0.9, 0.999))
-        eps (float, optional): term added to the denominator to improve
-            numerical stability. (default: 1e-8)
-        weight_decay (float, optional): weight decay (L2 penalty) (default: 0)
-        amsgrad (boolean, optional): whether to use the AMSGrad variant of this
-            algorithm from the paper `On the Convergence of Adam and Beyond`_
-            (default: False) NOT SUPPORTED in FusedAdam!
-        adam_w_mode (boolean, optional): Apply L2 regularization or weight decay
-            True for decoupled weight decay(also known as AdamW) (default: True)
-        set_grad_none (bool, optional): whether set grad to None when zero_grad()
-            method is called. (default: True)
-
-    .. _Adam\: A Method for Stochastic Optimization:
-        https://arxiv.org/abs/1412.6980
-    .. _On the Convergence of Adam and Beyond:
-        https://openreview.net/forum?id=ryQu7f-RZ
-    """
-
-    def __init__(self, params, lr=1e-3, bias_correction=True,
-                 betas=(0.9, 0.999), eps=1e-8, adam_w_mode=True,
-                 weight_decay=0., amsgrad=False, set_grad_none=True):
-
-        if amsgrad:
-            raise RuntimeError(
-                'FusedAdam does not support the AMSGrad variant.')
-        defaults = dict(lr=lr, bias_correction=bias_correction,
-                        betas=betas, eps=eps, weight_decay=weight_decay)
-        super(FusedAdam, self).__init__(params, defaults)
-        self.adam_w_mode = 1 if adam_w_mode else 0
-        self.set_grad_none = set_grad_none
-        if multi_tensor_applier.available:
-            import colossal_C
-            # Skip buffer
-            self._dummy_overflow_buf = torch.cuda.IntTensor([0])
-            self.multi_tensor_adam = colossal_C.multi_tensor_adam
-        else:
-            raise RuntimeError('FusedAdam requires cuda extensions')
-
-    def zero_grad(self):
-        if self.set_grad_none:
-            for group in self.param_groups:
-                for p in group['params']:
-                    p.grad = None
-        else:
-            super(FusedAdam, self).zero_grad()
-
-    def step(self, closure=None, grads=None, output_params=None, scale=None, grad_norms=None):
-        """Performs a single optimization step.
-
-        Arguments:
-            closure (callable, optional): A closure that reevaluates the model
-                and returns the loss.
-
-        The remaining arguments are deprecated, and are only retained (for the moment) for error-checking purposes.
-        """
-        if any(p is not None for p in [grads, output_params, scale, grad_norms]):
-            raise RuntimeError(
-                'FusedAdam has been updated.  Simply initialize it identically to torch.optim.Adam, and call step() with no arguments.')
-        loss = None
-        if closure is not None:
-            loss = closure()
-
-        for group in self.param_groups:
-            bias_correction = 1 if group['bias_correction'] else 0
-            beta1, beta2 = group['betas']
-
-            # assume same step across group now to simplify things
-            # per parameter step can be easily support by making it tensor, or pass list into kernel
-            if 'step' in group:
-                group['step'] += 1
-            else:
-                group['step'] = 1
-
-            # create lists for multi-tensor apply
-            g_16, p_16, m_16, v_16 = [], [], [], []
-            g_32, p_32, m_32, v_32 = [], [], [], []
-
-            for p in group['params']:
-                if p.grad is None:
-                    continue
-                if p.grad.data.is_sparse:
-                    raise RuntimeError(
-                        'FusedAdam does not support sparse gradients, please consider SparseAdam instead')
-
-                state = self.state[p]
-                # State initialization
-                if len(state) == 0:
-                    # Exponential moving average of gradient values
-                    state['exp_avg'] = torch.zeros_like(p.data)
-                    # Exponential moving average of squared gradient values
-                    state['exp_avg_sq'] = torch.zeros_like(p.data)
-
-                if p.dtype == torch.float16:
-                    g_16.append(p.grad.data)
-                    p_16.append(p.data)
-                    m_16.append(state['exp_avg'])
-                    v_16.append(state['exp_avg_sq'])
-                elif p.dtype == torch.float32:
-                    g_32.append(p.grad.data)
-                    p_32.append(p.data)
-                    m_32.append(state['exp_avg'])
-                    v_32.append(state['exp_avg_sq'])
-                else:
-                    raise RuntimeError('FusedAdam only support fp16 and fp32.')
-
-            if (len(g_16) > 0):
-                multi_tensor_applier(self.multi_tensor_adam,
-                                     self._dummy_overflow_buf,
-                                     [g_16, p_16, m_16, v_16],
-                                     group['lr'],
-                                     beta1,
-                                     beta2,
-                                     group['eps'],
-                                     group['step'],
-                                     self.adam_w_mode,
-                                     bias_correction,
-                                     group['weight_decay'])
-            if (len(g_32) > 0):
-                multi_tensor_applier(self.multi_tensor_adam,
-                                     self._dummy_overflow_buf,
-                                     [g_32, p_32, m_32, v_32],
-                                     group['lr'],
-                                     beta1,
-                                     beta2,
-                                     group['eps'],
-                                     group['step'],
-                                     self.adam_w_mode,
-                                     bias_correction,
-                                     group['weight_decay'])
-
-        return loss
diff --git a/colossalai/nn/optimizer/fused_lamb.py b/colossalai/nn/optimizer/fused_lamb.py
deleted file mode 100644
index dfbcff71781903c28addaee1b3d182ea837b527d..0000000000000000000000000000000000000000
--- a/colossalai/nn/optimizer/fused_lamb.py
+++ /dev/null
@@ -1,211 +0,0 @@
-# modified from https://github.com/NVIDIA/apex/blob/master/apex/optimizers/fused_lamb.py
-import torch
-
-from colossalai.registry import OPTIMIZERS
-from colossalai.utils import multi_tensor_applier
-
-
-@OPTIMIZERS.register_module
-class FusedLAMB(torch.optim.Optimizer):
-    """Implements LAMB algorithm.
-
-    Currently GPU-only.  Requires ColossalAI to be installed via
-    ``pip install -v --no-cache-dir --global-option="--cuda_ext" ./``.
-
-    This version of fused LAMB implements 2 fusions.
-
-      * Fusion of the LAMB update's elementwise operations
-      * A multi-tensor apply launch that batches the elementwise updates applied to all the model's parameters into one or a few kernel launches.
-
-    :class:`colossalai.nn.optimizer.FusedLAMB`'s usage is identical to any ordinary Pytorch optimizer
-
-    :class:`colossalai.nn.optimizer.FusedLAMB` may be used with or without Amp.
-
-    LAMB was proposed in `Large Batch Optimization for Deep Learning: Training BERT in 76 minutes`_.
-
-    Arguments:
-        params (iterable): iterable of parameters to optimize or dicts defining
-            parameter groups.
-        lr (float, optional): learning rate. (default: 1e-3)
-        betas (Tuple[float, float], optional): coefficients used for computing
-            running averages of gradient and its norm. (default: (0.9, 0.999))
-        eps (float, optional): term added to the denominator to improve
-            numerical stability. (default: 1e-6)
-        weight_decay (float, optional): weight decay (L2 penalty) (default: 0.01)
-        amsgrad (boolean, optional): whether to use the AMSGrad variant of this
-            algorithm from the paper `On the Convergence of Adam and Beyond`_
-            NOT SUPPORTED now! (default: False)
-        adam_w_mode (boolean, optional): Apply L2 regularization or weight decay
-            True for decoupled weight decay(also known as AdamW) (default: True)
-        grad_averaging (bool, optional): whether apply (1-beta2) to grad when
-            calculating running averages of gradient. (default: True)
-        set_grad_none (bool, optional): whether set grad to None when zero_grad()
-            method is called. (default: True)
-        max_grad_norm (float, optional): value used to clip global grad norm
-            (default: 1.0)
-        use_nvlamb (boolean, optional): Apply adaptive learning rate to 0.0
-            weight decay parameter (default: False)
-
-    .. _Large Batch Optimization for Deep Learning\: Training BERT in 76 minutes:
-        https://arxiv.org/abs/1904.00962
-    .. _On the Convergence of Adam and Beyond:
-        https://openreview.net/forum?id=ryQu7f-RZ
-    """
-
-    def __init__(self, params, lr=1e-3, bias_correction=True,
-                 betas=(0.9, 0.999), eps=1e-6, weight_decay=0.01,
-                 amsgrad=False, adam_w_mode=True,
-                 grad_averaging=True, set_grad_none=True,
-                 max_grad_norm=1.0, use_nvlamb=False):
-        if amsgrad:
-            raise RuntimeError(
-                'FusedLAMB does not support the AMSGrad variant.')
-        defaults = dict(lr=lr, bias_correction=bias_correction,
-                        betas=betas, eps=eps, weight_decay=weight_decay,
-                        grad_averaging=grad_averaging,
-                        max_grad_norm=max_grad_norm)
-        super(FusedLAMB, self).__init__(params, defaults)
-        if multi_tensor_applier.available:
-            import colossal_C
-            self.multi_tensor_l2norm = colossal_C.multi_tensor_l2norm
-            # Skip buffer
-            self._dummy_overflow_buf = torch.tensor(
-                [0], dtype=torch.int, device=self.param_groups[0]["params"][0].device)
-            self.multi_tensor_lamb = colossal_C.multi_tensor_lamb
-        else:
-            raise RuntimeError('FusedLAMB requires cuda extensions')
-
-        self.adam_w_mode = 1 if adam_w_mode else 0
-        self.set_grad_none = set_grad_none
-        self.use_nvlamb = use_nvlamb
-
-    def zero_grad(self):
-        if self.set_grad_none:
-            for group in self.param_groups:
-                for p in group['params']:
-                    p.grad = None
-        else:
-            super(FusedLAMB, self).zero_grad()
-
-    def step(self, closure=None):
-        """Performs a single optimization step.
-
-        Arguments:
-            closure (callable, optional): A closure that reevaluates the model
-                and returns the loss.
-        """
-        loss = None
-        if closure is not None:
-            loss = closure()
-
-        # create separate grad lists for fp32 and fp16 params
-        g_all_32, g_all_16 = [], []
-        for group in self.param_groups:
-            for p in group['params']:
-                if p.grad is None:
-                    continue
-                if p.dtype == torch.float32:
-                    g_all_32.append(p.grad.data)
-                elif p.dtype == torch.float16:
-                    g_all_16.append(p.grad.data)
-                else:
-                    raise RuntimeError('FusedLAMB only support fp16 and fp32.')
-
-        device = self.param_groups[0]["params"][0].device
-        g_norm_32, g_norm_16 = torch.zeros(
-            1, device=device), torch.zeros(1, device=device)
-        # compute grad norm for two lists
-        if len(g_all_32) > 0:
-            g_norm_32 = multi_tensor_applier(self.multi_tensor_l2norm,
-                                             self._dummy_overflow_buf,
-                                             [g_all_32], False)[0]
-        if len(g_all_16) > 0:
-            g_norm_16 = multi_tensor_applier(self.multi_tensor_l2norm,
-                                             self._dummy_overflow_buf,
-                                             [g_all_16], False)[0]
-
-        # blend two grad norms to get global grad norm
-        global_grad_norm = multi_tensor_applier(self.multi_tensor_l2norm,
-                                                self._dummy_overflow_buf,
-                                                [[g_norm_32, g_norm_16]],
-                                                False)[0]
-        max_grad_norm = self.defaults['max_grad_norm']
-
-        for group in self.param_groups:
-            bias_correction = 1 if group['bias_correction'] else 0
-            beta1, beta2 = group['betas']
-            grad_averaging = 1 if group['grad_averaging'] else 0
-
-            # assume same step across group now to simplify things
-            # per parameter step can be easily support by making it tensor, or pass list into kernel
-            if 'step' in group:
-                group['step'] += 1
-            else:
-                group['step'] = 1
-
-            # create lists for multi-tensor apply
-            g_16, p_16, m_16, v_16 = [], [], [], []
-            g_32, p_32, m_32, v_32 = [], [], [], []
-
-            for p in group['params']:
-                if p.grad is None:
-                    continue
-                if p.grad.data.is_sparse:
-                    raise RuntimeError(
-                        'FusedLAMB does not support sparse gradients, please consider SparseAdam instead')
-
-                state = self.state[p]
-                # State initialization
-                if len(state) == 0:
-                    # Exponential moving average of gradient values
-                    state['exp_avg'] = torch.zeros_like(p.data)
-                    # Exponential moving average of gradient values
-                    state['exp_avg_sq'] = torch.zeros_like(p.data)
-
-                if p.dtype == torch.float16:
-                    g_16.append(p.grad.data)
-                    p_16.append(p.data)
-                    m_16.append(state['exp_avg'])
-                    v_16.append(state['exp_avg_sq'])
-                elif p.dtype == torch.float32:
-                    g_32.append(p.grad.data)
-                    p_32.append(p.data)
-                    m_32.append(state['exp_avg'])
-                    v_32.append(state['exp_avg_sq'])
-                else:
-                    raise RuntimeError('FusedLAMB only support fp16 and fp32.')
-
-            if (len(g_16) > 0):
-                multi_tensor_applier(self.multi_tensor_lamb,
-                                     self._dummy_overflow_buf,
-                                     [g_16, p_16, m_16, v_16],
-                                     group['lr'],
-                                     beta1,
-                                     beta2,
-                                     group['eps'],
-                                     group['step'],
-                                     bias_correction,
-                                     group['weight_decay'],
-                                     grad_averaging,
-                                     self.adam_w_mode,
-                                     global_grad_norm,
-                                     max_grad_norm,
-                                     self.use_nvlamb)
-            if (len(g_32) > 0):
-                multi_tensor_applier(self.multi_tensor_lamb,
-                                     self._dummy_overflow_buf,
-                                     [g_32, p_32, m_32, v_32],
-                                     group['lr'],
-                                     beta1,
-                                     beta2,
-                                     group['eps'],
-                                     group['step'],
-                                     bias_correction,
-                                     group['weight_decay'],
-                                     grad_averaging,
-                                     self.adam_w_mode,
-                                     global_grad_norm,
-                                     max_grad_norm,
-                                     self.use_nvlamb)
-
-        return loss
diff --git a/colossalai/nn/optimizer/fused_sgd.py b/colossalai/nn/optimizer/fused_sgd.py
deleted file mode 100644
index 9e29f67f7ecfc0ed28ddbfe89dacd8c8adec2e86..0000000000000000000000000000000000000000
--- a/colossalai/nn/optimizer/fused_sgd.py
+++ /dev/null
@@ -1,226 +0,0 @@
-# modified from https://github.com/NVIDIA/apex/blob/master/apex/optimizers/fused_sgd.py
-import torch
-from torch.optim.optimizer import Optimizer, required
-
-from colossalai.registry import OPTIMIZERS
-from colossalai.utils import multi_tensor_applier
-
-
-@OPTIMIZERS.register_module
-class FusedSGD(Optimizer):
-    r"""Implements stochastic gradient descent (optionally with momentum).
-
-    Currently GPU-only.  Requires ColossalAI to be installed via
-    ``pip install -v --no-cache-dir --global-option="--cuda_ext" ./``.
-
-    This version of fused SGD implements 2 fusions.
-
-      * Fusion of the SGD update's elementwise operations
-      * A multi-tensor apply launch that batches the elementwise updates applied to all the model's parameters into one or a few kernel launches.
-
-    :class:`colossalai.nn.optimizer.FusedSGD` may be used as a drop-in replacement for ``torch.optim.SGD``
-
-    :class:`colossalai.nn.optimizer.FusedSGD` may be used with or without Amp. 
-
-    Nesterov momentum is based on the formula from
-    `On the importance of initialization and momentum in deep learning`__.
-
-    Args:
-        params (iterable): iterable of parameters to optimize or dicts defining
-            parameter groups
-        lr (float): learning rate
-        momentum (float, optional): momentum factor (default: 0)
-        weight_decay (float, optional): weight decay (L2 penalty) (default: 0)
-        dampening (float, optional): dampening for momentum (default: 0)
-        nesterov (bool, optional): enables Nesterov momentum (default: False)
-
-    __ http://www.cs.toronto.edu/%7Ehinton/absps/momentum.pdf
-
-    .. note::
-        The implementation of SGD with Momentum/Nesterov subtly differs from
-        Sutskever et. al. and implementations in some other frameworks.
-        Considering the specific case of Momentum, the update can be written as
-
-        .. math::
-                  v = \rho * v + g \\
-                  p = p - lr * v
-
-        where p, g, v and :math:`\rho` denote the parameters, gradient,
-        velocity, and momentum respectively.
-        This is in contrast to Sutskever et. al. and
-        other frameworks which employ an update of the form
-
-        .. math::
-             v = \rho * v + lr * g \\
-             p = p - v
-
-        The Nesterov version is analogously modified.
-    """
-
-    def __init__(self, params, lr=required, momentum=0, dampening=0,
-                 weight_decay=0, nesterov=False,
-                 wd_after_momentum=False,
-                 materialize_master_grads=True,
-                 set_grad_none=False):
-        if lr is not required and lr < 0.0:
-            raise ValueError("Invalid learning rate: {}".format(lr))
-        if momentum < 0.0:
-            raise ValueError("Invalid momentum value: {}".format(momentum))
-        if weight_decay < 0.0:
-            raise ValueError(
-                "Invalid weight_decay value: {}".format(weight_decay))
-
-        defaults = dict(lr=lr, momentum=momentum, dampening=dampening,
-                        weight_decay=weight_decay, nesterov=nesterov)
-        if nesterov and (momentum <= 0 or dampening != 0):
-            raise ValueError(
-                "Nesterov momentum requires a momentum and zero dampening")
-        super(FusedSGD, self).__init__(params, defaults)
-
-        self.wd_after_momentum = wd_after_momentum
-        self.materialize_master_grads = materialize_master_grads
-        self.most_recent_scale = 1.0
-        self.scale_set_by_backward = False
-        self.set_grad_none = set_grad_none
-
-        if multi_tensor_applier.available:
-            import colossal_C
-            # Skip buffer
-            self._dummy_overflow_buf = torch.tensor(
-                [0], dtype=torch.int, device=self.param_groups[0]["params"][0].device)
-            self.multi_tensor_sgd = colossal_C.multi_tensor_sgd
-        else:
-            raise RuntimeError('FusedSGD requires cuda extensions')
-
-    def __setstate__(self, state):
-        super(FusedSGD, self).__setstate__(state)
-        for group in self.param_groups:
-            group.setdefault('nesterov', False)
-
-    def zero_grad(self):
-        if self.set_grad_none:
-            for group in self.param_groups:
-                for p in group['params']:
-                    p.grad = None
-        else:
-            super(FusedSGD, self).zero_grad()
-
-    def get_momentums(self, params):
-        momentums = []
-        first_run = True
-        for p in params:
-            param_state = self.state[p]
-            # torch.optim.SGD initializes momentum in the main loop, we have
-            # to do it here, and track whether or not we've done so, so that
-            # momentum application can be skipped in the main kernel.
-            if 'momentum_buffer' not in param_state:
-                first_run = True
-                buf = param_state['momentum_buffer'] = torch.zeros_like(p.data)
-                momentums.append(buf)
-            else:
-                first_run = False
-                momentums.append(param_state['momentum_buffer'])
-        return momentums, first_run
-
-    def step(self, closure=None):
-        """Performs a single optimization step.
-
-        Arguments:
-            closure (callable, optional): A closure that reevaluates the model
-                and returns the loss.
-        """
-        loss = None
-        if closure is not None:
-            loss = closure()
-
-        explicit_master_params = (hasattr(self, "_amp_stash") and
-                                  hasattr(self._amp_stash, "fp32_from_fp16_groups"))
-
-        for gid, group in enumerate(self.param_groups):
-            weight_decay = group['weight_decay']
-            momentum = group['momentum']
-            dampening = group['dampening']
-            nesterov = group['nesterov']
-
-            # For each group, there are 3 possible combinations we need to consider:
-            # grad_type, param_to_update_type, momentum_type, requires_fp16_model_copy
-            # 1. fp16, fp16, fp16, No
-            # 2. fp32, fp32, fp32, No
-            # 3. fp16, fp32, fp32, Yes
-
-            first_runs = [True, True]
-
-            # I think a bit of code divergence in exchange for naming clarity is worthwhile
-            if explicit_master_params:
-                stash = self._amp_stash
-
-                fp32_params = [
-                    p for p in stash.fp32_from_fp32_groups[gid] if p.grad is not None]
-                fp32_grads = [
-                    p.grad for p in stash.fp32_from_fp32_groups[gid] if p.grad is not None]
-                fp32_momentums, first_runs[1] = self.get_momentums(fp32_params)
-
-                if self.materialize_master_grads:
-                    fp16_model_params = [p for i, p in enumerate(
-                        stash.fp16_groups[gid]) if stash.fp32_from_fp16_groups[gid][i].grad is not None]
-                    fp32_from_fp16_grads = [
-                        p.grad for p in stash.fp32_from_fp16_groups[gid] if p.grad is not None]
-                    fp32_from_fp16_params = [
-                        p for p in stash.fp32_from_fp16_groups[gid] if p.grad is not None]
-                    fp32_from_fp16_momentums, first_runs[0] = self.get_momentums(
-                        fp32_from_fp16_params)
-
-                    fp16_set = [fp32_from_fp16_grads, fp32_from_fp16_params,
-                                fp32_from_fp16_momentums, fp16_model_params]
-                else:
-                    fp16_model_params = [
-                        p for p in stash.fp16_groups[gid] if p.grad is not None]
-                    fp16_model_grads = [
-                        p.grad for p in stash.fp16_groups[gid] if p.grad is not None]
-                    fp32_from_fp16_params = [p for i, p in enumerate(
-                        stash.fp32_from_fp16_groups[gid]) if stash.fp16_groups[gid][i].grad is not None]
-                    fp32_from_fp16_momentums, first_runs[0] = self.get_momentums(
-                        fp32_from_fp16_params)
-
-                    fp16_set = [fp16_model_grads, fp32_from_fp16_params,
-                                fp32_from_fp16_momentums, fp16_model_params]
-
-                launch_sets = [fp16_set, [
-                    fp32_grads, fp32_params, fp32_momentums]]
-            else:
-                fp16_params = [p for p in group['params'] if (
-                    p.dtype == torch.float16 and p.grad is not None)]
-                fp16_grads = [p.grad for p in group['params'] if (
-                    p.dtype == torch.float16 and p.grad is not None)]
-                fp16_momentums, first_runs[0] = self.get_momentums(fp16_params)
-
-                fp32_params = [p for p in group['params'] if (
-                    p.dtype == torch.float32 and p.grad is not None)]
-                fp32_grads = [p.grad for p in group['params'] if (
-                    p.dtype == torch.float32 and p.grad is not None)]
-                fp32_momentums, first_runs[1] = self.get_momentums(fp32_params)
-
-                launch_sets = [[fp16_grads, fp16_params, fp16_momentums],
-                               [fp32_grads, fp32_params, fp32_momentums]]
-
-            for s, (launch_set, first_run) in enumerate(zip(launch_sets, first_runs)):
-                assert len(launch_set[0]) == len(launch_set[1])
-                assert len(launch_set[0]) == len(launch_set[2])
-                if len(launch_set[0]) > 0:
-                    multi_tensor_applier(
-                        self.multi_tensor_sgd,
-                        self._dummy_overflow_buf,
-                        launch_set,
-                        weight_decay,
-                        momentum,
-                        dampening,
-                        group['lr'],
-                        nesterov,
-                        first_run,
-                        self.wd_after_momentum,
-                        1.0 / self.most_recent_scale)
-
-        self.most_recent_scale = 1.0
-        self.scale_set_by_backward = False
-
-        return loss
diff --git a/colossalai/nn/optimizer/lamb.py b/colossalai/nn/optimizer/lamb.py
deleted file mode 100644
index aa137098a7e81e7583e8d0bde16edb74a6371c3f..0000000000000000000000000000000000000000
--- a/colossalai/nn/optimizer/lamb.py
+++ /dev/null
@@ -1,116 +0,0 @@
-"""
-Adapted from the pytorch-lamb library at https://github.com/cybertronai/pytorch-lamb
-"""
-
-import torch
-from torch.optim import Optimizer
-
-from colossalai.registry import OPTIMIZERS
-
-
-@OPTIMIZERS.register_module
-class Lamb(Optimizer):
-    r"""Implements Lamb algorithm.
-    It has been proposed in `Large Batch Optimization for Deep Learning: Training BERT in 76 minutes`_.
-
-    Arguments:
-        params (iterable): iterable of parameters to optimize or dicts defining
-            parameter groups
-        lr (float, optional): learning rate (default: 1e-3)
-        betas (Tuple[float, float], optional): coefficients used for computing
-            running averages of gradient and its square (default: (0.9, 0.999))
-        eps (float, optional): term added to the denominator to improve
-            numerical stability (default: 1e-6)
-        weight_decay (float, optional): weight decay (L2 penalty) (default: 0)
-        adam (bool, optional): always use trust ratio = 1, which turns this into
-            Adam. Useful for comparison purposes.
-
-    .. _Large Batch Optimization for Deep Learning\: Training BERT in 76 minutes:
-        https://arxiv.org/abs/1904.00962
-    """
-
-    def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-6,
-                 weight_decay=0, adam=False):
-        if not 0.0 <= lr:
-            raise ValueError("Invalid learning rate: {}".format(lr))
-        if not 0.0 <= eps:
-            raise ValueError("Invalid epsilon value: {}".format(eps))
-        if not 0.0 <= betas[0] < 1.0:
-            raise ValueError(
-                "Invalid beta parameter at index 0: {}".format(betas[0]))
-        if not 0.0 <= betas[1] < 1.0:
-            raise ValueError(
-                "Invalid beta parameter at index 1: {}".format(betas[1]))
-        defaults = dict(lr=lr, betas=betas, eps=eps,
-                        weight_decay=weight_decay)
-        self.adam = adam
-        super(Lamb, self).__init__(params, defaults)
-
-    def step(self, closure=None):
-        """Performs a single optimization step.
-
-        Arguments:
-            closure (callable, optional): A closure that reevaluates the model
-                and returns the loss.
-        """
-        loss = None
-        if closure is not None:
-            loss = closure()
-
-        for group in self.param_groups:
-            for p in group['params']:
-                if p.grad is None:
-                    continue
-                grad = p.grad.data
-                if grad.is_sparse:
-                    raise RuntimeError(
-                        'Lamb does not support sparse gradients, consider SparseAdam instad.')
-
-                state = self.state[p]
-
-                # State initialization
-                if len(state) == 0:
-                    state['step'] = 0
-                    # Exponential moving average of gradient values
-                    state['exp_avg'] = torch.zeros_like(p.data)
-                    # Exponential moving average of squared gradient values
-                    state['exp_avg_sq'] = torch.zeros_like(p.data)
-
-                exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
-                beta1, beta2 = group['betas']
-
-                state['step'] += 1
-
-                # Decay the first and second moment running average coefficient
-                # m_t
-                exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)
-                # v_t
-                exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1 - beta2)
-
-                # Paper v3 does not use debiasing.
-                # bias_correction1 = 1 - beta1 ** state['step']
-                # bias_correction2 = 1 - beta2 ** state['step']
-                # Apply bias to lr to avoid broadcast.
-                # * math.sqrt(bias_correction2) / bias_correction1
-                step_size = group['lr']
-
-                weight_norm = p.data.pow(2).sum().sqrt()
-
-                adam_step = exp_avg / exp_avg_sq.sqrt().add(group['eps'])
-                if group['weight_decay'] != 0:
-                    adam_step.add_(p.data, alpha=group['weight_decay'])
-
-                adam_norm = adam_step.pow(2).sum().sqrt()
-                if weight_norm == 0 or adam_norm == 0:
-                    trust_ratio = 1
-                else:
-                    trust_ratio = weight_norm / adam_norm
-                state['weight_norm'] = weight_norm
-                state['adam_norm'] = adam_norm
-                state['trust_ratio'] = trust_ratio
-                if self.adam:
-                    trust_ratio = 1
-
-                p.data.add_(adam_step, alpha=-step_size * trust_ratio)
-
-        return loss
diff --git a/colossalai/nn/optimizer/lars.py b/colossalai/nn/optimizer/lars.py
deleted file mode 100644
index 212f66671a0db9f580d99758823fdf78e3e54106..0000000000000000000000000000000000000000
--- a/colossalai/nn/optimizer/lars.py
+++ /dev/null
@@ -1,102 +0,0 @@
-"""Adapted from https://github.com/NUS-HPC-AI-Lab/LARS-ImageNet-PyTorch/blob/main/lars.py"""
-
-from typing import Iterable
-
-import torch
-from torch.optim import Optimizer
-
-from colossalai.registry import OPTIMIZERS
-
-
-@OPTIMIZERS.register_module
-class Lars(Optimizer):
-    r"""Implements the LARS optimizer from `"Large batch training of convolutional networks"
-    <https://arxiv.org/pdf/1708.03888.pdf>`_.
-
-    Args:
-        params (iterable): iterable of parameters to optimize or dicts defining
-            parameter groups
-        lr (float, optional): learning rate (default: 1e-3)
-        momentum (float, optional): momentum factor (default: 0)
-        eeta (float, optional): LARS coefficient as used in the paper (default: 1e-3)
-        weight_decay (float, optional): weight decay (L2 penalty) (default: 0)
-    """
-
-    def __init__(
-            self,
-            params: Iterable[torch.nn.Parameter],
-            lr=1e-3,
-            momentum=0,
-            eeta=1e-3,
-            weight_decay=0,
-            epsilon=0.0
-    ) -> None:
-        if not isinstance(lr, float) or lr < 0.0:
-            raise ValueError("Invalid learning rate: {}".format(lr))
-        if momentum < 0.0:
-            raise ValueError("Invalid momentum value: {}".format(momentum))
-        if weight_decay < 0.0:
-            raise ValueError(
-                "Invalid weight_decay value: {}".format(weight_decay))
-        if eeta <= 0 or eeta > 1:
-            raise ValueError("Invalid eeta value: {}".format(eeta))
-        if epsilon < 0:
-            raise ValueError("Invalid epsilon value: {}".format(epsilon))
-        defaults = dict(lr=lr, momentum=momentum,
-                        weight_decay=weight_decay, eeta=eeta, epsilon=epsilon, lars=True)
-
-        super().__init__(params, defaults)
-
-    @torch.no_grad()
-    def step(self, closure=None):
-        """Performs a single optimization step.
-
-        Arguments:
-            closure (callable, optional): A closure that reevaluates the model
-                and returns the loss.
-        """
-        loss = None
-        if closure is not None:
-            with torch.enable_grad():
-                loss = closure()
-
-        for group in self.param_groups:
-            weight_decay = group['weight_decay']
-            momentum = group['momentum']
-            eeta = group['eeta']
-            lr = group['lr']
-            lars = group['lars']
-            eps = group['epsilon']
-
-            for p in group['params']:
-                if p.grad is None:
-                    continue
-                decayed_grad = p.grad
-                scaled_lr = lr
-                if lars:
-                    w_norm = torch.norm(p)
-                    g_norm = torch.norm(p.grad)
-                    trust_ratio = torch.where(
-                        w_norm > 0 and g_norm > 0,
-                        eeta * w_norm / (g_norm + weight_decay * w_norm + eps),
-                        torch.ones_like(w_norm)
-                    )
-                    trust_ratio.clamp_(0.0, 50)
-                    scaled_lr *= trust_ratio.item()
-                    if weight_decay != 0:
-                        decayed_grad = decayed_grad.add(p, alpha=weight_decay)
-                decayed_grad = torch.clamp(decayed_grad, -10.0, 10.0)
-
-                if momentum != 0:
-                    param_state = self.state[p]
-                    if 'momentum_buffer' not in param_state:
-                        buf = param_state['momentum_buffer'] = torch.clone(
-                            decayed_grad).detach()
-                    else:
-                        buf = param_state['momentum_buffer']
-                        buf.mul_(momentum).add_(decayed_grad)
-                    decayed_grad = buf
-
-                p.add_(decayed_grad, alpha=-scaled_lr)
-
-        return loss
diff --git a/colossalai/registry/__init__.py b/colossalai/registry/__init__.py
deleted file mode 100644
index 62b0bb08fae37a8a135265cd3ab26895a3068494..0000000000000000000000000000000000000000
--- a/colossalai/registry/__init__.py
+++ /dev/null
@@ -1,23 +0,0 @@
-import torch.distributed.optim as dist_optim
-import torch.nn as nn
-import torch.optim as optim
-import torchvision.models as tv_models
-import torchvision.datasets as tv_datasets
-from torchvision import transforms
-
-from .registry import Registry
-
-LAYERS = Registry("layers", third_party_library=[nn])
-LOSSES = Registry("losses")
-MODELS = Registry("models", third_party_library=[tv_models])
-OPTIMIZERS = Registry("optimizers", third_party_library=[optim, dist_optim])
-DATASETS = Registry("datasets", third_party_library=[tv_datasets])
-DIST_GROUP_INITIALIZER = Registry("dist_group_initializer")
-GRADIENT_HANDLER = Registry("gradient_handler")
-LOSSES = Registry("losses", third_party_library=[nn])
-HOOKS = Registry("hooks")
-TRANSFORMS = Registry("transforms", third_party_library=[transforms])
-DATA_SAMPLERS = Registry("data_samplers")
-LR_SCHEDULERS = Registry("lr_schedulers")
-SCHEDULE = Registry("schedules")
-OPHOOKS = Registry("ophooks")
diff --git a/colossalai/registry/__pycache__/__init__.cpython-36.pyc b/colossalai/registry/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index 930afb7640cb0de369afa5a12cafa9a83bb11290..0000000000000000000000000000000000000000
Binary files a/colossalai/registry/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/registry/__pycache__/__init__.cpython-37.pyc b/colossalai/registry/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index 4c26b57e329f595767b285143f958ad67e38bf73..0000000000000000000000000000000000000000
Binary files a/colossalai/registry/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/registry/__pycache__/registry.cpython-36.pyc b/colossalai/registry/__pycache__/registry.cpython-36.pyc
deleted file mode 100644
index 959116d44d1013b9d3a33c2e8d82a3b78c338120..0000000000000000000000000000000000000000
Binary files a/colossalai/registry/__pycache__/registry.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/registry/__pycache__/registry.cpython-37.pyc b/colossalai/registry/__pycache__/registry.cpython-37.pyc
deleted file mode 100644
index e4fa1094f941937191364a55a60820686525c189..0000000000000000000000000000000000000000
Binary files a/colossalai/registry/__pycache__/registry.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/registry/registry.py b/colossalai/registry/registry.py
deleted file mode 100644
index 3ea858b7ebef2df98efb109db8c15a8d6ddad88b..0000000000000000000000000000000000000000
--- a/colossalai/registry/registry.py
+++ /dev/null
@@ -1,82 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from types import ModuleType
-from typing import List
-
-
-class Registry:
-    """This is a registry class used to register classes and modules so that a universal 
-    object builder can be enabled.
-
-    :param name: The name of the registry
-    :type name: str
-    :param third_party_library: List of third party libraries which are used in the 
-        initialization of the register module
-    :type third_party_library: list, optional
-    """
-
-    def __init__(self, name: str, third_party_library: List[ModuleType] = None):
-        self._name = name
-        self._registry = dict()
-        self._third_party_lib = third_party_library
-
-    @property
-    def name(self):
-        return self._name
-
-    def register_module(self, module_class):
-        """Registers a module represented in `module_class`.
-
-        :param module_class: The module to be registered
-        :type module_class: class
-        :raises AssertionError: Raises an AssertionError if the module has already been 
-            registered before
-        :return: The module to be registered, so as to use it normally if via importing
-        :rtype: class
-        """
-        module_name = module_class.__name__
-        assert module_name not in self._registry
-        self._registry[module_name] = module_class
-
-        # return so as to use it normally if via importing
-        return module_class
-
-    def get_module(self, module_name: str):
-        """Retrieves a module with name `module_name` and returns the module if it has 
-        already been registered before.
-
-        :param module_name: The name of the module to be retrieved
-        :type module_name: str
-        :raises NameError: Raises a NameError if the module to be retrieved has neither been 
-            registered directly nor as third party modules before
-        :return: The retrieved module or None
-        :rtype: :class:`object`
-        """
-        if module_name in self._registry:
-            return self._registry[module_name]
-        elif self._third_party_lib is not None:
-            for lib in self._third_party_lib:
-                if hasattr(lib, module_name):
-                    return getattr(lib, module_name)
-            raise NameError(f'Module {module_name} not found in the registry {self.name}')
-
-    def has(self, module_name: str):
-        """Searches for a module with name `module_name` and returns a boolean value indicating
-        whether the module has been registered directly or as third party modules before.
-
-        :param module_name: The name of the module to be searched for
-        :type module_name: str
-        :return: A boolean value indicating whether the module has been registered directly or
-            as third party modules before
-        :rtype: bool
-        """
-        found_flag = module_name in self._registry
-
-        if self._third_party_lib:
-            for lib in self._third_party_lib:
-                if hasattr(lib, module_name):
-                    found_flag = True
-                    break
-
-        return found_flag
diff --git a/colossalai/trainer/__init__.py b/colossalai/trainer/__init__.py
deleted file mode 100644
index 84e53dc4e87ac5b10a93aacc0fce975cc49c66eb..0000000000000000000000000000000000000000
--- a/colossalai/trainer/__init__.py
+++ /dev/null
@@ -1,3 +0,0 @@
-from ._trainer import Trainer
-
-__all__ = ['Trainer']
diff --git a/colossalai/trainer/__pycache__/__init__.cpython-37.pyc b/colossalai/trainer/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index 80ad09e0fb33ead4ce212b611ff802fcc1f698e8..0000000000000000000000000000000000000000
Binary files a/colossalai/trainer/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/trainer/__pycache__/_trainer.cpython-37.pyc b/colossalai/trainer/__pycache__/_trainer.cpython-37.pyc
deleted file mode 100644
index 5dcc57357d7679706f540587d99e9604425288fe..0000000000000000000000000000000000000000
Binary files a/colossalai/trainer/__pycache__/_trainer.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/trainer/_trainer.py b/colossalai/trainer/_trainer.py
deleted file mode 100644
index ebb3ac893884b76e04038e714fc049f3a8e067e1..0000000000000000000000000000000000000000
--- a/colossalai/trainer/_trainer.py
+++ /dev/null
@@ -1,441 +0,0 @@
-from typing import Union, List
-from colossalai.context.parallel_mode import ParallelMode
-
-import torch
-from torch import Tensor
-from torch.utils.data import DataLoader
-from tqdm import tqdm
-
-from colossalai.core import global_context as gpc
-
-from colossalai.engine import Engine
-from colossalai.engine.schedule import NonPipelineSchedule, BaseSchedule
-from colossalai.logging import DistributedLogger
-from colossalai.utils import MultiTimer
-from colossalai.utils import is_dp_rank_0, is_tp_rank_0, is_no_pp_or_last_stage
-from colossalai.trainer.hooks import BaseHook
-
-
-class Trainer:
-    """This a class tending for easy deployments of users' training and evaluation instead of
-    writing their own scripts. It is similar with ``ignite.engine`` and ``keras.engine``, but is
-    called `Trainer`.
-
-    :param engine: Engine responsible for the process function
-    :type engine: :class:`Engine`
-    :param schedule: Schedule responsible for forward and backward steps
-    :type schedule: :class:`BaseSchedule`, optional
-    :param timer: Timer used to monitor the whole training
-    :type timer: :class:`MultiTimer`, optional
-    :param logger: Logger used to record the whole training
-    :type logger: :class:`colossalai.logging.DistributedLogger`, optional
-    """
-    def __init__(
-            self,
-            engine: Engine,
-            schedule: BaseSchedule = None,
-            timer: MultiTimer = None,
-            logger: DistributedLogger = None,
-    ):
-        # training-ralated params
-        self._engine = engine
-        self._max_epochs = 0
-        self._cur_epoch = 0
-        self._max_steps = 0
-        self._cur_step = 0
-        self._steps_per_epoch = 0
-
-        # misc params
-        self._logger = logger
-        self._verbose = logger is not None
-
-        # hooks can store states in this dict, and could be consumed by other hooks
-        self.states = dict()
-
-        # build hooks
-        self.hooks = list()
-
-        # multi-timer for time benchmarking
-        self._timer = timer
-
-        # set schedule which specifies the training iteration for the engine
-        if schedule is None:
-            schedule = NonPipelineSchedule()
-        if (gpc.is_initialized(ParallelMode.PIPELINE)
-                and gpc.get_world_size(ParallelMode.PIPELINE) > 1):
-            assert not isinstance(
-                schedule, NonPipelineSchedule
-            ), "NonPipelineSchedule cannot be used for pipeline parallel training, please use PipelineSchedule instead."
-        self._schedule = schedule
-        self._schedule.pre_processing(engine)
-
-    @property
-    def cur_epoch(self):
-        """Returns the index of the current epoch."""
-        return self._cur_epoch
-
-    @cur_epoch.setter
-    def cur_epoch(self, epoch: int):
-        """Set how many epochs have been processed."""
-        # allow setter for training resumption
-        self._cur_epoch = epoch
-
-    @property
-    def cur_step(self):
-        """Returns how many iteration steps have been processed."""
-        return self._cur_step
-
-    @property
-    def max_epochs(self):
-        return self._max_epochs
-
-    @property
-    def max_steps(self):
-        return self._max_steps
-
-    @property
-    def steps_per_epoch(self):
-        return self._steps_per_epoch
-
-    @property
-    def engine(self):
-        return self._engine
-
-    @property
-    def schedule(self):
-        return self._schedule
-
-    def _set_current_step(self, epoch: int):
-        """Sets current step number.
-
-        :param epoch: Step number to be set
-        :type epoch: int
-        """
-        self._cur_step = epoch * self._steps_per_epoch
-
-    def _call_timer(self, action: str, item: str, *args, **kwargs) -> None:
-        """Call timer funciton with a given timer name.
-
-        :param action: Function to be called on timer
-        :type action: str
-        :param item: Name of the timer
-        :type item: str
-        :param args: args used for action function
-        :param kwargs: kwargs used for action function
-        """
-
-        if self._timer is not None:
-            getattr(self._timer, action)(item, *args, **kwargs)
-
-    def _reset_states(self) -> None:
-        """Clear trainer states"""
-        self.states = dict()
-
-    def _call_hooks(self, func, output=None):
-        """Calls specific hooks in the current time point.
-
-        :param func: A string represents the time point
-        :param output: Output of the model after running a iteration or None in any other time points
-        :type func: str
-        :type output: optional
-        """
-        # Only after iter hook will receive output
-        for hook in self.hooks:
-            if output is None:
-                getattr(hook, func)(self)
-            else:
-                getattr(hook, func)(self, *output)
-
-    @staticmethod
-    def _should_display_progress(display_progress: bool):
-        """Only display progress on DP rank 0, TP rank 0 and PP last rank"""
-        return (display_progress and is_dp_rank_0() and is_tp_rank_0()
-                and is_no_pp_or_last_stage())
-
-    def _train_epoch(
-            self,
-            train_dataloader: DataLoader,
-            epoch: int = None,
-            display_progress: bool = False,
-            return_output_label: bool = True,
-    ):
-        # set training state
-        self._engine.train()
-        data_iter = iter(train_dataloader)
-        progress = range(self._steps_per_epoch)
-        if display_progress:
-            if epoch is None:
-                progress = tqdm(progress, desc="[Train]")
-            else:
-                progress = tqdm(progress, desc=f"[Epoch {epoch} / Train]")
-
-        self._call_hooks("before_train_epoch")
-        self._call_timer(action="start", item="Train-epoch")
-        for i in progress:
-            self._call_hooks("before_train_iter")
-            self._call_timer(action="start", item="Train-step")
-
-            # run 1 training step
-            self.engine.zero_grad()
-            logits, label, loss = self.schedule.forward_backward_step(
-                self.engine,
-                data_iter,
-                forward_only=False,
-                return_loss=True,
-                return_output_label=return_output_label,
-            )
-            self.engine.step()
-            self._call_timer(action="stop",
-                             item="Train-step",
-                             keep_in_history=True)
-            self._call_hooks("after_train_iter", output=(logits, label, loss))
-
-            self._cur_step += 1
-
-            if display_progress:
-                if "step_metrics" in self.states:
-                    progress.set_postfix(**self.states["step_metrics"])
-
-            # stop when max iter is reached
-            if self._exceed_max_step():
-                break
-
-        self._call_timer(action="stop",
-                         item="Train-epoch",
-                         keep_in_history=True)
-        self._call_hooks("after_train_epoch")
-        self._call_timer(action="reset", item="Train-epoch")
-
-    def _eval(
-            self,
-            test_dataloader: DataLoader,
-            epoch: int = None,
-            display_progress: bool = False,
-            return_output_label: bool = True,
-    ):
-        # switch engine status
-        self._engine.eval()
-
-        data_iter = iter(test_dataloader)
-        num_steps = len(test_dataloader)
-
-        self._call_hooks("before_test")
-        # prepare progress bar
-        progress = range(num_steps)
-        if display_progress:
-            desc = "Evaluation"
-            if epoch is not None:
-                desc = "[Epoch %d / Test]" % epoch
-            progress = tqdm(progress, desc=desc)
-
-        self._call_hooks("before_test_epoch")
-        self._call_timer(action="start", item="Test-epoch")
-        with torch.no_grad():
-            for _ in progress:
-                self._call_hooks("before_test_iter")
-                self._call_timer(action="start", item="Test-step")
-                logits, label, loss = self.schedule.forward_backward_step(
-                    self.engine,
-                    data_iter,
-                    forward_only=True,
-                    return_loss=True,
-                    return_output_label=return_output_label,
-                )
-                self._call_timer(action="stop",
-                                 item="Test-step",
-                                 keep_in_history=True)
-                self._call_hooks("after_test_iter",
-                                 output=(logits, label, loss))
-
-                if display_progress:
-                    if "step_metrics" in self.states:
-                        progress.set_postfix(**self.states["step_metrics"])
-
-        self._call_timer(action="stop",
-                         item="Test-epoch",
-                         keep_in_history=True)
-        self._call_hooks("after_test_epoch")
-        self._call_hooks("after_test")
-        self._call_timer(action="reset", item="Test-step")
-        self._call_timer(action="reset", item="Test-epoch")
-
-    def _exceed_max_step(self):
-        return self._max_steps is not None and self._cur_step >= self._max_steps
-
-    def fit(
-            self,
-            train_dataloader: DataLoader,
-            epochs: int,
-            max_steps: int = None,
-            test_dataloader: DataLoader = None,
-            test_interval: int = 1,
-            hooks: List[BaseHook] = None,
-            display_progress: bool = False,
-            return_output_label: bool = True,
-    ):
-        """Trains the model to fit training data.
-
-        :param train_dataloader: DataLoader in training
-        :param epochs: Maximum number of epoches
-        :param max_steps: Maximum number of running iterations
-        :param test_dataloader: DataLoader in testing
-        :param test_interval: Interval of testing
-        :param hooks: A list of hooks used in training
-        :param display_progress: If True, the training progress will be printed
-        :param return_output_label: If True, the output of model and the label will be returned
-
-        :type train_dataloader: DataLoader
-        :type epochs: int
-        :type max_steps: int, optional
-        :type test_dataloader: DataLoader, optional
-        :type test_interval: int, optional
-        :type hooks: list, optional
-        :type display_progress: bool, optional
-        :type return_output_label: bool, optional
-        """
-
-        # set epochs and steps, consider gradient accumulation
-        self._steps_per_epoch = len(train_dataloader)
-        self._max_steps = max_steps
-        self._max_epochs = epochs
-
-        # check if testing is required
-        should_test = False
-        if test_dataloader is not None:
-            should_test = True
-
-        display_progress = self._should_display_progress(display_progress)
-
-        # reset hooks
-        self._reset_states()
-        if hooks is not None:
-            assert isinstance(
-                hooks, list
-            ), f"expected argument hooks be to list, but got {type(hooks)}"
-        else:
-            hooks = []
-        self.hooks = hooks
-        self.hooks.sort(key=lambda hook: hook.priority)
-        if self._verbose:
-            for hook in self.hooks:
-                self._logger.info(
-                    f"Using {hook.__class__.__name__} for training, priority = {hook.priority}",
-                    ranks=[0],
-                )
-            self._logger.info(
-                "Lower value means higher priority for calling hook function",
-                ranks=[0])
-        self._call_hooks("after_hook_is_attached")
-
-        self._engine.train()
-        self._call_hooks("before_train")
-
-        # recover step value if resuming training
-        last_epoch = self._cur_epoch
-        if self.cur_epoch != 0:
-            self._set_current_step(last_epoch)
-
-        for epoch in range(last_epoch, epochs):
-            # train for one epoch
-            self._train_epoch(
-                train_dataloader=train_dataloader,
-                epoch=epoch,
-                display_progress=display_progress,
-                return_output_label=return_output_label,
-            )
-
-            # start eval
-            if should_test and epoch % test_interval == 0:
-                self._eval(
-                    test_dataloader=test_dataloader,
-                    display_progress=display_progress,
-                    epoch=epoch,
-                    return_output_label=return_output_label,
-                )
-
-            self._cur_epoch += 1
-
-            # check for termination
-            if self._exceed_max_step():
-                self._logger.info(
-                    f"Max number of steps {max_steps} has been reached, training is stopped automatically",
-                    ranks=[0],
-                )
-                break
-        self._call_hooks("after_train")
-        self._call_timer("reset", "Train-epoch")
-
-    def evaluate(
-            self,
-            test_dataloader: DataLoader,
-            hooks: List[BaseHook] = None,
-            display_progress: bool = False,
-            return_output_label: bool = True,
-    ):
-        """Evaluates the model with testing data.
-
-        :param test_dataloader: DataLoader in testing
-        :param hooks: A list of hooks used in evaluation
-        :param display_progress: If True, the evaluation progress will be printed
-        :param return_output_label: If True, the output of model and the label will be returned
-
-        :type test_dataloader: DataLoader
-        :type hooks: list, optional
-        :type display_progress: bool, optional
-        :type return_output_label: bool
-        """
-        # set display
-        display_progress = self._should_display_progress(display_progress)
-
-        # reset hooks
-        self._reset_states()
-        if hooks is not None:
-            assert isinstance(
-                hooks, list
-            ), f"expected argument hooks be to list, but got {type(hooks)}"
-        else:
-            hooks = []
-        self.hooks = hooks
-        self.hooks.sort(key=lambda hook: hook.priority)
-        if self._verbose:
-            for hook in self.hooks:
-                self._logger.info(
-                    f"Using {hook.__class__.__name__} for training, priority = {hook.priority}",
-                    ranks=[0],
-                )
-            self._logger.info(
-                "Lower value means higher priority for calling hook function",
-                ranks=[0])
-        self._call_hooks("after_hook_is_attached")
-
-        # eval
-        self._eval(
-            test_dataloader=test_dataloader,
-            display_progress=display_progress,
-            return_output_label=return_output_label,
-        )
-
-    def predict(self, data: Union[Tensor, List[Tensor]]):
-        """Uses trained model to make a prediction for a tensor or a tensor list.
-
-        :param data: Data as the input
-        :type data: Union[Tensor, List[Tensor]
-        :return: The output of model as the prediction
-        :rtype: Tensor
-        """
-        # predict without labels
-        if isinstance(data, (list, tuple)):
-            assert isinstance(data[0], Tensor)
-        else:
-            assert isinstance(data, Tensor)
-        self._engine.eval()
-
-        # prepare a list of (data, label) to make it iterable
-        # for compatibility with schedule
-        simple_dataloader = [(data, None)]
-        data_iter = iter(simple_dataloader)
-        output, _, _ = self.schedule.forward_backward_step(self.engine,
-                                                           data_iter,
-                                                           forward_only=True,
-                                                           return_loss=False)
-        return output
diff --git a/colossalai/trainer/hooks/__init__.py b/colossalai/trainer/hooks/__init__.py
deleted file mode 100644
index ab5ef9df9153cf480eb0040ed1eef3b43d2ec040..0000000000000000000000000000000000000000
--- a/colossalai/trainer/hooks/__init__.py
+++ /dev/null
@@ -1,12 +0,0 @@
-from ._base_hook import BaseHook
-from ._checkpoint_hook import LoadCheckpointHook, SaveCheckpointHook
-from ._log_hook import (LogMemoryByEpochHook, LogMetricByEpochHook, LogMetricByStepHook, LogTimingByEpochHook,
-                        TensorboardHook)
-from ._lr_scheduler_hook import LRSchedulerHook
-from ._metric_hook import AccuracyHook, LossHook, MetricHook, ThroughputHook
-
-__all__ = [
-    'BaseHook', 'MetricHook', 'LoadCheckpointHook', 'SaveCheckpointHook', 'LossHook', 'AccuracyHook',
-    'LogMetricByEpochHook', 'TensorboardHook', 'LogTimingByEpochHook', 'LogMemoryByEpochHook', 'LRSchedulerHook',
-    'ThroughputHook', 'LogMetricByStepHook'
-]
diff --git a/colossalai/trainer/hooks/__pycache__/__init__.cpython-37.pyc b/colossalai/trainer/hooks/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index 6eb47fd2cd0ec912d3a6f6890d37c75ccb120ab8..0000000000000000000000000000000000000000
Binary files a/colossalai/trainer/hooks/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/trainer/hooks/__pycache__/_base_hook.cpython-37.pyc b/colossalai/trainer/hooks/__pycache__/_base_hook.cpython-37.pyc
deleted file mode 100644
index bddd5a74e83c3e5bd85b049f460c92e8d7f5578a..0000000000000000000000000000000000000000
Binary files a/colossalai/trainer/hooks/__pycache__/_base_hook.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/trainer/hooks/__pycache__/_checkpoint_hook.cpython-37.pyc b/colossalai/trainer/hooks/__pycache__/_checkpoint_hook.cpython-37.pyc
deleted file mode 100644
index e044c516f51a3d76ee9d132f023b1e493fdc23bb..0000000000000000000000000000000000000000
Binary files a/colossalai/trainer/hooks/__pycache__/_checkpoint_hook.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/trainer/hooks/__pycache__/_log_hook.cpython-37.pyc b/colossalai/trainer/hooks/__pycache__/_log_hook.cpython-37.pyc
deleted file mode 100644
index aef9cf49d0df549ed5388072263c23d9b82cc36a..0000000000000000000000000000000000000000
Binary files a/colossalai/trainer/hooks/__pycache__/_log_hook.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/trainer/hooks/__pycache__/_lr_scheduler_hook.cpython-37.pyc b/colossalai/trainer/hooks/__pycache__/_lr_scheduler_hook.cpython-37.pyc
deleted file mode 100644
index b79f6df36bdf7934b2aa3e7c69ff71c249dcb673..0000000000000000000000000000000000000000
Binary files a/colossalai/trainer/hooks/__pycache__/_lr_scheduler_hook.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/trainer/hooks/__pycache__/_metric_hook.cpython-37.pyc b/colossalai/trainer/hooks/__pycache__/_metric_hook.cpython-37.pyc
deleted file mode 100644
index d06a0b566aeed5e04a28d3e2ed6e623c9a2dfeed..0000000000000000000000000000000000000000
Binary files a/colossalai/trainer/hooks/__pycache__/_metric_hook.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/trainer/hooks/_base_hook.py b/colossalai/trainer/hooks/_base_hook.py
deleted file mode 100644
index 03c3614813b4e050ed37192c7e7cbd8334d7a640..0000000000000000000000000000000000000000
--- a/colossalai/trainer/hooks/_base_hook.py
+++ /dev/null
@@ -1,112 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from abc import ABC
-
-from torch import Tensor
-
-
-class BaseHook(ABC):
-    """This class allows users to add desired actions in specific time points
-    during training or evaluation.
-
-    :param priority: Priority in the printing, hooks with small priority will be printed in front
-    :type priority: int
-    """
-
-    def __init__(self, priority: int) -> None:
-        self.priority = priority
-
-    def after_hook_is_attached(self, trainer):
-        """Actions after hooks are attached to trainer.
-        """
-        pass
-
-    def before_train(self, trainer):
-        """Actions before training.
-        """
-        pass
-
-    def after_train(self, trainer):
-        """Actions after training.
-        """
-        pass
-
-    def before_train_iter(self, trainer):
-        """Actions before running a training iteration.
-        """
-        pass
-
-    def after_train_iter(self, trainer, output: Tensor, label: Tensor, loss: Tensor):
-        """Actions after running a training iteration.
-
-        :param trainer: Trainer which is using this hook
-        :type trainer: :class:`Trainer`
-        :param output: Output of the model
-        :type output: torch.Tensor
-        :param label: Labels of the input data
-        :type label: torch.Tensor
-        :param loss: Loss between the output and input data
-        :type loss: torch.Tensor
-        """
-        pass
-
-    def before_train_epoch(self, trainer):
-        """Actions before starting a training epoch.
-        """
-        pass
-
-    def after_train_epoch(self, trainer):
-        """Actions after finishing a training epoch.
-        """
-        pass
-
-    def before_test(self, trainer):
-        """Actions before evaluation.
-        """
-        pass
-
-    def after_test(self, trainer):
-        """Actions after evaluation.
-        """
-        pass
-
-    def before_test_epoch(self, trainer):
-        """Actions before starting a testing epoch.
-        """
-        pass
-
-    def after_test_epoch(self, trainer):
-        """Actions after finishing a testing epoch.
-        """
-        pass
-
-    def before_test_iter(self, trainer):
-        """Actions before running a testing iteration.
-        """
-        pass
-
-    def after_test_iter(self, trainer, output: Tensor, label: Tensor, loss: Tensor):
-        """Actions after running a testing iteration.
-
-        :param trainer: Trainer which is using this hook
-        :type trainer: :class:`Trainer`
-        :param output: Output of the model
-        :type output: Tensor
-        :param label: Labels of the input data
-        :type label: Tensor
-        :param loss: Loss between the output and input data
-        :type loss: Tensor
-        """
-        pass
-
-    def init_runner_states(self, trainer, key, val):
-        """Initializes trainer's state.
-
-        :param trainer: Trainer which is using this hook
-        :type trainer: :class:`Trainer`
-        :param key: Key of reseting state
-        :param val: Value of reseting state
-        """
-        if key not in trainer.states:
-            trainer.states[key] = val
diff --git a/colossalai/trainer/hooks/_checkpoint_hook.py b/colossalai/trainer/hooks/_checkpoint_hook.py
deleted file mode 100644
index 4eedf85b3e770f5c0ea381eacf403dc1ee770bab..0000000000000000000000000000000000000000
--- a/colossalai/trainer/hooks/_checkpoint_hook.py
+++ /dev/null
@@ -1,134 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import os.path as osp
-from colossalai.logging import get_dist_logger
-
-from colossalai.registry import HOOKS
-from colossalai.trainer.hooks import BaseHook
-from colossalai.utils import is_dp_rank_0
-from colossalai.utils.checkpointing import get_latest_checkpoint_path, get_checkpoint_path
-from colossalai.utils.checkpointing import save_checkpoint, load_checkpoint
-from ._lr_scheduler_hook import LRSchedulerHook
-
-
-@HOOKS.register_module
-class SaveCheckpointHook(BaseHook):
-    """Saves the model by interval in training process.
-
-    :param interval: Saving interval, defaults to 1
-    :type interval: int, optional
-    :param checkpoint_dir: Directory of saving checkpoint, defaults to None
-    :type checkpoint_dir: str, optional
-    :param suffix: Saving suffix of the file, defaults to ''
-    :type suffix: str, optional
-    :param priority: Priority in the printing, hooks with small priority will be printed in front, defaults to 10
-    :type priority: int, optional
-    """
-
-    def __init__(self,
-                 interval: int = 1,
-                 checkpoint_dir: str = None,
-                 suffix: str = '',
-                 priority: int = 10):
-        super().__init__(priority=priority)
-        self.interval = interval
-        self.checkpoint_dir = checkpoint_dir
-        self.suffix = suffix
-        self.logger = get_dist_logger()
-
-        # get lr scheduler from the LRSchedulerHook before train
-        self._lr_scheduler = None
-
-    def after_hook_is_attached(self, trainer):
-        # check if lr scheduler is present in LRSchedulerHook
-        for hook in trainer.hooks:
-            if isinstance(hook, LRSchedulerHook):
-                self._lr_scheduler = hook.lr_scheduler
-                break
-
-    def after_train_epoch(self, trainer):
-        """Saves the model after a training epoch.
-        """
-        # save by interval
-        if trainer.cur_epoch % self.interval == 0:
-            # only gpus with data parallel rank equals to 0 write to the disk
-            if is_dp_rank_0():
-                save_path = get_checkpoint_path(self.checkpoint_dir,
-                                                trainer.cur_epoch,
-                                                suffix=self.suffix)
-
-                save_checkpoint(save_path,
-                                trainer.cur_epoch,
-                                trainer.engine.model,
-                                trainer.engine.optimizer,
-                                self._lr_scheduler)
-                self.logger.info(
-                    f'checkpoint for epoch {trainer.cur_epoch} is saved to {self.checkpoint_dir}', ranks=[0])
-
-
-@HOOKS.register_module
-class LoadCheckpointHook(BaseHook):
-    """Loads the model before training process.
-
-    :param checkpoint_dir: Directory of saving checkpoint, defaults to None
-    :type checkpoint_dir: str, optional
-    :param epoch: Epoch number to be set, defaults to -1
-    :type epoch: str, optional
-    :param finetune: Whether allows to load a part of the model, defaults to False
-    :type finetune: bool, optional
-    :param strict: Whether loads a model that has the same shape of parameters, defaults to False
-    :type strict: bool, optional
-    :param suffix: Suffic, defaults to ''
-    :type suffix: str, optional
-    :param priority: Priority in the printing, hooks with small priority will be printed in front, defaults to 0
-    :type priority: int, optional
-    """
-
-    def __init__(self,
-                 checkpoint_dir: str = None,
-                 epoch: int = -1,
-                 finetune: bool = False,
-                 strict: bool = False,
-                 suffix: str = '',
-                 priority: int = 0) -> None:
-        super().__init__(priority=priority)
-        self.epoch = epoch
-        self.checkpoint_dir = checkpoint_dir
-        self.finetune = finetune
-        self.suffix = suffix
-        self.strict = strict
-        self.logger = get_dist_logger()
-
-    def before_train(self, trainer):
-        """Loads parameters to the model before training.
-        """
-        # check if lr scheduler is present in LRSchedulerHook
-        lr_scheduler = None
-        for hook in trainer.hooks:
-            if isinstance(hook, LRSchedulerHook):
-                lr_scheduler = hook.lr_scheduler
-                break
-
-        # use latest checkpoint if epoch = -1
-        if self.epoch == -1:
-            path = get_latest_checkpoint_path(self.checkpoint_dir, suffix=self.suffix)
-        else:
-            path = get_checkpoint_path(self.checkpoint_dir, epoch=self.epoch, suffix=self.suffix)
-
-        if osp.exists(path):
-            last_epoch, _ = load_checkpoint(path,
-                                            trainer.engine.model,
-                                            trainer.engine.optimizer,
-                                            lr_scheduler,
-                                            finetune=self.finetune,
-                                            strict=self.strict)
-            if self.finetune:
-                trainer.cur_epoch = 0
-            else:
-                trainer.cur_epoch = last_epoch
-
-            self.logger.info(
-                f'loaded checkpoint from {path}', ranks=[0])
-        else:
-            raise FileNotFoundError(f'checkpoint is not found at {path}')
diff --git a/colossalai/trainer/hooks/_log_hook.py b/colossalai/trainer/hooks/_log_hook.py
deleted file mode 100644
index 2a081e088c138acf297078681b38b6d58b102747..0000000000000000000000000000000000000000
--- a/colossalai/trainer/hooks/_log_hook.py
+++ /dev/null
@@ -1,310 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import os
-import os.path as osp
-
-import torch
-from typing import List
-from decimal import Decimal
-from colossalai.context import ParallelMode
-from colossalai.core import global_context as gpc
-from colossalai.registry import HOOKS
-from colossalai.logging import DistributedLogger
-from colossalai.utils import report_memory_usage, is_dp_rank_0, \
-    is_tp_rank_0, is_no_pp_or_last_stage, MultiTimer
-from ._base_hook import BaseHook
-
-
-def _format_number(val, prec=5):
-    if isinstance(val, float):
-        return f'{val:.{prec}g}'
-    elif torch.is_tensor(val) and torch.is_floating_point(val):
-        return f'{val.item():.{prec}g}'
-    return val
-
-
-class LogByEpochHook(BaseHook):
-    """Hook to log by epoch
-
-    :param logger: Logger for the log
-    :param interval: Recording interval, defaults to 1
-    :type interval: int, optional
-    :param priority: Priority in the printing, hooks with small priority will be printed in front, defaults to 1
-    :type priority: int, optional
-    """
-
-    def __init__(self,
-                 logger,
-                 interval: int = 1,
-                 priority: int = 1):
-        super().__init__(priority)
-        self.logger = logger
-        self._interval = interval
-
-    def _is_epoch_to_log(self, trainer):
-        return trainer.cur_epoch % self._interval == 0
-
-
-@HOOKS.register_module
-class LogMetricByStepHook(BaseHook):
-    """Hook to log metric by step
-
-    :param priority: Priority in the printing, hooks with small priority will be printed in front, defaults to 10
-    :type priority: int, optional
-    """
-
-    def __init__(self, priority: int = 10):
-        super().__init__(priority)
-
-    def after_train_iter(self, trainer, *args):
-        trainer.states['step_metrics'] = dict()
-        for metric_name, metric_calculator in trainer.states['metrics']['train'].items():
-            trainer.states['step_metrics'][metric_name.lower()] = \
-                f'{_format_number(metric_calculator.get_last_step_value())}'
-
-    def after_test_iter(self, trainer, *args):
-        trainer.states['step_metrics'] = dict()
-        for metric_name, metric_calculator in trainer.states['metrics']['test'].items():
-            trainer.states['step_metrics'][metric_name.lower()] = \
-                f'{_format_number(metric_calculator.get_last_step_value())}'
-
-
-@HOOKS.register_module
-class LogMetricByEpochHook(LogByEpochHook):
-    """Specialized hook to record the metric to log.
-
-    :param logger: Logger for the log
-    :param interval: Recording interval, defaults to 1
-    :type interval: int, optional
-    :param priority: Priority in the printing, hooks with small priority will be printed in front, defaults to 10
-    :type priority: int, optional
-    """
-
-    def __init__(self,
-                 logger,
-                 interval: int = 1,
-                 priority: int = 10) -> None:
-        super().__init__(logger, interval, priority)
-        self._is_rank_to_log = is_dp_rank_0() and is_tp_rank_0() and is_no_pp_or_last_stage()
-
-    def _get_str(self, trainer, mode):
-        msg = []
-        for metric_name, metric_calculator in trainer.states['metrics'][mode].items():
-            msg.append(
-                f'{metric_name} = {_format_number(metric_calculator.get_accumulated_value())}')
-        msg = ' | '.join(msg)
-        return msg
-
-    def after_train_epoch(self, trainer):
-        if self._is_epoch_to_log(trainer):
-            msg = self._get_str(trainer=trainer, mode='train')
-
-            if self._is_rank_to_log:
-                self.logger.info(f'[Epoch {trainer.cur_epoch} / Train]: {msg}')
-                # f'Training - Epoch {trainer.cur_epoch} - {self.__class__.__name__}: {msg}')
-
-    def after_test_epoch(self, trainer):
-        if self._is_epoch_to_log(trainer):
-            msg = self._get_str(trainer=trainer, mode='test')
-            if self._is_rank_to_log:
-                self.logger.info(f'[Epoch {trainer.cur_epoch} / Test]: {msg}')
-                # f'Testing - Epoch {trainer.cur_epoch} - {self.__class__.__name__}: {msg}')
-
-
-@HOOKS.register_module
-class TensorboardHook(BaseHook):
-    """Specialized hook to record the metric to Tensorboard.
-
-    :param log_dir: Directory of log
-    :type log_dir: str
-    :param ranks: Ranks of processors
-    :type ranks: typing.List
-    :param parallel_mode: Parallel mode, defaults to colossalai.context.parallel_mode.ParallelMode.GLOBAL
-    :type parallel_mode: :class:`colossalai.context.parallel_mode.ParallelMode`, optional
-    :param priority: Priority in the printing, hooks with small priority will be printed in front, defaults to 10
-    :type priority: int, optional
-    """
-
-    def __init__(self,
-                 log_dir: str,
-                 ranks: List = None,
-                 parallel_mode: ParallelMode = ParallelMode.GLOBAL,
-                 priority: int = 10,
-                 ) -> None:
-        super().__init__(priority=priority)
-        from torch.utils.tensorboard import SummaryWriter
-
-        # create log dir
-        if not gpc.is_initialized(ParallelMode.GLOBAL) or gpc.get_global_rank() == 0:
-            os.makedirs(log_dir, exist_ok=True)
-
-        # determine the ranks to generate tensorboard logs
-        self._is_valid_rank_to_log = False
-        if not gpc.is_initialized(parallel_mode):
-            self._is_valid_rank_to_log = True
-        else:
-            local_rank = gpc.get_local_rank(parallel_mode)
-
-            if ranks is None or local_rank in ranks:
-                self._is_valid_rank_to_log = True
-
-        # check for
-        if gpc.is_initialized(ParallelMode.PIPELINE) and \
-                not gpc.is_last_rank(ParallelMode.PIPELINE) and self._is_valid_rank_to_log:
-            raise ValueError("Tensorboard hook can only log on the last rank of pipeline process group")
-
-        if self._is_valid_rank_to_log:
-            # create workspace on only one rank
-            if gpc.is_initialized(parallel_mode):
-                rank = gpc.get_local_rank(parallel_mode)
-            else:
-                rank = 0
-
-            # create workspace
-            log_dir = osp.join(log_dir, f'{parallel_mode}_rank_{rank}')
-            os.makedirs(log_dir, exist_ok=True)
-
-            self.writer = SummaryWriter(log_dir=log_dir, filename_suffix=f'_rank_{rank}')
-
-    def _log_by_iter(self, trainer, mode: str):
-        for metric_name, metric_calculator in trainer.states['metrics'][mode].items():
-            if metric_calculator.epoch_only:
-                continue
-            val = metric_calculator.get_last_step_value()
-
-            if self._is_valid_rank_to_log:
-                self.writer.add_scalar(f'{metric_name}/{mode}', val, trainer.cur_step)
-
-    def _log_by_epoch(self, trainer, mode: str):
-        for metric_name, metric_calculator in trainer.states['metrics'][mode].items():
-            if metric_calculator.epoch_only:
-                val = metric_calculator.get_accumulated_value()
-                if self._is_valid_rank_to_log:
-                    self.writer.add_scalar(f'{metric_name}/{mode}', val, trainer.cur_step)
-
-    def after_test_iter(self, trainer, *args):
-        self._log_by_iter(trainer, mode='test')
-
-    def after_test_epoch(self, trainer):
-        self._log_by_epoch(trainer, mode='test')
-
-    def after_train_iter(self, trainer, *args):
-        self._log_by_iter(trainer, mode='train')
-
-    def after_train_epoch(self, trainer):
-        self._log_by_epoch(trainer, mode='train')
-
-
-@HOOKS.register_module
-class LogTimingByEpochHook(LogByEpochHook):
-    """Specialized hook to write timing record to log.
-
-    :param timer: Timer for the hook
-    :type timer: :class:`colossalai.utils.MultiTimer`
-    :param logger: Logger for the log
-    :type logger: :class:`colossalai.logging.DistributedLogger`
-    :param interval: Recording interval, defaults to 1
-    :type interval: int, optional
-    :param priority: Priority in the printing, hooks with small priority will be printed in front, defaults to 10
-    :type priority: int, optional
-    :param log_eval: Whether writes in evaluation, defaults to True
-    :type log_eval: bool, optional
-    :param ignore_num_train_steps: Number of training steps to ignore, defaults to 0
-    :type ignore_num_train_steps: int, optional
-    """
-
-    def __init__(self,
-                 timer: MultiTimer,
-                 logger: DistributedLogger,
-                 interval: int = 1,
-                 priority: int = 10,
-                 log_eval: bool = True,
-                 ignore_num_train_steps: int = 0) -> None:
-        super().__init__(logger=logger, interval=interval, priority=priority)
-        self._timer = timer
-        self._log_eval = log_eval
-        self._is_rank_to_log = is_dp_rank_0() and is_tp_rank_0() and is_no_pp_or_last_stage()
-
-        # extra handling to avoid the unstable readings of the first
-        # few training steps to affect the history mean time
-        self._ignore_num_train_steps = ignore_num_train_steps
-        self._is_train_step_history_trimmed = False
-
-    def _get_message(self, mode):
-        msg = []
-        for timer_name, timer in self._timer:
-            if timer_name.startswith(mode):
-                last_elapsed_time = timer.get_elapsed_time()
-                if timer.has_history:
-                    if timer_name == 'Train-step' and not self._is_train_step_history_trimmed:
-                        timer._history = timer._history[self._ignore_num_train_steps:]
-                        self._is_train_step_history_trimmed = True
-                    history_mean = timer.get_history_mean()
-                    history_sum = timer.get_history_sum()
-                    msg.append(
-                        f'{timer_name}: last = {_format_number(last_elapsed_time)} s, mean = {_format_number(history_mean)} s'
-                    )
-                else:
-                    msg.append(f'{timer_name}: last = {_format_number(last_elapsed_time)} s')
-
-        msg = ' | '.join(msg)
-        return msg
-
-    def after_train_epoch(self, trainer):
-        """Writes log after finishing a training epoch.
-        """
-        if self._is_epoch_to_log(trainer) and self._is_rank_to_log:
-            msg = self._get_message('Train')
-            self.logger.info(f'[Epoch {trainer.cur_epoch} / Train]: {msg} | #steps/epoch = {trainer.steps_per_epoch}')
-
-    def after_test_epoch(self, trainer):
-        """Writes log after finishing a testing epoch.
-        """
-        if self._is_epoch_to_log(trainer) and self._is_rank_to_log and self._log_eval:
-            msg = self._get_message('Test')
-            self.logger.info(f'[Epoch {trainer.cur_epoch} / Test]: {msg}')
-
-
-@HOOKS.register_module
-class LogMemoryByEpochHook(LogByEpochHook):
-    """Specialized Hook to write memory usage record to log.
-
-    :param logger: Logger for the log
-    :type logger: colossalai.logging.DistributedLogger
-    :param interval: Recording interval, defaults to 1
-    :type interval: int, optional
-    :param priority: Priority in the printing, hooks with small priority will be printed in front, defaults to 10
-    :type priority: int, optional
-    :param log_eval: Whether writes in evaluation, defaults to True
-    :type log_eval: bool, optional
-    """
-
-    def __init__(self,
-                 logger: DistributedLogger,
-                 interval: int = 1,
-                 priority: int = 10,
-                 log_eval: bool = True,
-                 report_cpu: bool = False,  # no reference
-                 ) -> None:
-        super().__init__(logger=logger, interval=interval, priority=priority)
-        self._log_eval = log_eval
-        self._is_rank_to_log = is_dp_rank_0() and is_tp_rank_0()
-
-    def before_train(self, trainer):
-        """Resets before training.
-        """
-        if self._is_epoch_to_log(trainer) and self._is_rank_to_log:
-            report_memory_usage('Before-train', self.logger)
-
-    def after_train_epoch(self, trainer):
-        """Writes log after finishing a training epoch.
-        """
-        if self._is_epoch_to_log(trainer) and self._is_rank_to_log:
-            report_memory_usage(f'[Epoch {trainer.cur_epoch} / Train]', self.logger)
-
-    def after_test(self, trainer):
-        """Reports after testing.
-        """
-        if self._is_epoch_to_log(trainer) and self._is_rank_to_log and self._log_eval:
-            report_memory_usage(f'[Epoch {trainer.cur_epoch} / Test]', self.logger)
diff --git a/colossalai/trainer/hooks/_lr_scheduler_hook.py b/colossalai/trainer/hooks/_lr_scheduler_hook.py
deleted file mode 100644
index 726db3bf5cd8457b4205b7483f5c9257fe3cf15f..0000000000000000000000000000000000000000
--- a/colossalai/trainer/hooks/_lr_scheduler_hook.py
+++ /dev/null
@@ -1,43 +0,0 @@
-from colossalai.registry import HOOKS
-from torch import Tensor
-
-from ._metric_hook import LearningRateMetric, MetricHook
-
-
-@HOOKS.register_module
-class LRSchedulerHook(MetricHook):
-    """Build LR scheduler
-
-    :param lr_scheduler: LR scheduler
-    :param by_epoch: If `True`, the LR will be scheduled every epoch. Else, the LR will be scheduled every batch
-    :type by_epoch: bool
-    :param store_lr_in_state: If `True`, store the learning rate in each state, defaults to `True`
-    :type store_lr_in_state: bool, optional
-    :param priority: Priority in the printing, hooks with small priority will be printed in front, defaults to 1
-    :type priority: int, optional
-    """
-    def __init__(
-        self,
-        lr_scheduler,
-        by_epoch: bool,
-        store_lr_in_state: bool = True,
-        priority: int = 1,
-    ):
-        super().__init__(priority=priority)
-        self.by_epoch = by_epoch
-        self.lr_scheduler = lr_scheduler
-        self.store_lr_in_state = store_lr_in_state
-
-    def after_hook_is_attached(self, trainer):
-        trainer.states['metrics']['train']['LR'] = LearningRateMetric(epoch_only=self.by_epoch,
-                                                                      initial_lr=self.lr_scheduler.get_last_lr()[0])
-
-    def after_train_epoch(self, trainer):
-        if self.by_epoch:
-            self.lr_scheduler.step()
-            trainer.states['metrics']['train']['LR'].update(self.lr_scheduler.get_last_lr()[0])
-
-    def after_train_iter(self, trainer, output: Tensor, label: Tensor, loss: Tensor):
-        if not self.by_epoch:
-            self.lr_scheduler.step()
-            trainer.states['metrics']['train']['LR'].update(self.lr_scheduler.get_last_lr()[0])
diff --git a/colossalai/trainer/hooks/_metric_hook.py b/colossalai/trainer/hooks/_metric_hook.py
deleted file mode 100644
index 3faf8b438003887c95f7f9187323acbd0f96f0c7..0000000000000000000000000000000000000000
--- a/colossalai/trainer/hooks/_metric_hook.py
+++ /dev/null
@@ -1,395 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from abc import ABC, abstractmethod
-from typing import Callable
-
-import torch
-import torch.distributed as dist
-from colossalai.communication import all_reduce
-from colossalai.context import ParallelMode
-from colossalai.core import global_context as gpc
-from colossalai.registry import HOOKS
-from colossalai.utils import get_current_device, is_no_pp_or_last_stage
-
-from ._base_hook import BaseHook
-
-
-class Metric(ABC):
-    """A basic class of metric collectors. It collects a specific
-    metric during training or evaluation and it's always used with 
-    :class:`MetricHook` to help it update its states and show the 
-    metric. So please use corresponding hook class to make the metric 
-    collector works.
-
-    :param epoch_only: Whether the metric only read for the full epoch
-    :type epoch_only: bool
-    """
-
-    def __init__(self, epoch_only: bool):
-        # is the metric only read for the full epoch
-        self._epoch_only = epoch_only
-
-    @property
-    def epoch_only(self):
-        """Returns :attr:`epoch_only`.
-        """
-        return self._epoch_only
-
-    @abstractmethod
-    def reset(self) -> None:
-        """Resets the metric to it's initial state.
-        By default, this is called at the start of each epoch.
-        """
-        pass
-
-    @abstractmethod
-    def update(self, *args, **kwargs) -> None:
-        """Updates the metric's state using the passed batch output.
-        By default, this is called once for each batch.
-        """
-        pass
-
-    @abstractmethod
-    def get_last_step_value(self):
-        """Returns the metric value in the last iteration.
-        """
-        pass
-
-    @abstractmethod
-    def get_accumulated_value(self):
-        """Computes the metric based on it's accumulated state.
-        By default, this is called at the end of each epoch.
-
-        :return: the actual quantity of interest
-        :rtype: Any
-        """
-        pass
-
-    @staticmethod
-    @abstractmethod
-    def is_better(a, b) -> bool:
-        """Compares a and b, and returns whether a is better than b
-
-        :return: The result of comparison
-        :rtype: bool
-        """
-        pass
-
-
-class LossMetric(Metric):
-    """A metric collector for loss.
-
-    :param epoch_only: Whether the metric only read for the full epoch
-    :type epoch_only: bool
-    """
-
-    def __init__(self, epoch_only):
-        super().__init__(epoch_only=epoch_only)
-        self.last_step_loss = torch.zeros(1, device=get_current_device())
-        self.accum_loss = torch.zeros(1, device=get_current_device())
-        self.count = 0
-
-    def reset(self) -> None:
-        """Sets :attr:`last_step_loss` and :attr:`accum_loss` to zero.
-        """
-        self.last_step_loss.zero_()
-        self.accum_loss.zero_()
-        self.count = 0
-
-    def update(self, loss) -> None:
-        """Updates :attr:`last_step_loss` and :attr:`accum_loss` with current loss.
-        It expects the output has loss.
-
-        :param loss: Current loss of the output
-        """
-        # expect output to be logits, label and loss
-        loss_ = loss.detach()
-        self.last_step_loss.copy_(loss_)
-        self.accum_loss.add_(loss_)
-        self.count += 1
-
-    def get_accumulated_value(self):
-        """Returns accumulated loss.
-        """
-        if gpc.is_initialized(ParallelMode.DATA):
-            dist.all_reduce(self.accum_loss, op=dist.ReduceOp.SUM, group=gpc.get_group(ParallelMode.DATA))
-            self.accum_loss.div_(gpc.get_world_size(ParallelMode.DATA))
-
-        self.accum_loss.div_(self.count)
-        return self.accum_loss.item()
-
-    def get_last_step_value(self):
-        """Returns :attr:`last_step_loss`.
-        """
-        return self.last_step_loss
-
-    @staticmethod
-    def is_better(a, b):
-        return a < b
-
-
-class LearningRateMetric(Metric):
-    """A metric collector for learning rate.
-
-    :param epoch_only: Whether the metric only read for the full epoch
-    :type epoch_only: bool
-    :param initial_lr: Initial learning rate, defaults to 0.0
-    :type initial_lr: float, optional
-    """
-
-    def __init__(self, epoch_only: bool, initial_lr: float = 0.):
-        super().__init__(epoch_only=epoch_only)
-        self.lr = initial_lr
-
-    def reset(self) -> None:
-        pass
-
-    def update(self, lr) -> None:
-        self.lr = lr
-
-    def get_last_step_value(self):
-        return self.lr
-
-    def get_accumulated_value(self):
-        return self.lr
-
-    @staticmethod
-    def is_better(a, b) -> bool:
-        pass
-
-
-class AccuracyMetric(Metric):
-    """A metric collector for accuracy. It only works for classification
-    tasks.
-
-    :param epoch_only: Whether the metric only read for the full epoch
-    :type epoch_only: bool
-    :param accuracy_func: Accuracy function for the classification task
-    :type accuracy_func: :class:`typing.Callable`
-    """
-
-    def __init__(self, epoch_only: bool, accuracy_func: Callable):
-        super().__init__(epoch_only=epoch_only)
-        self.acc = accuracy_func
-        self.last_step_sum = torch.zeros(1, device=get_current_device())
-        self.last_step_correct = torch.zeros(1, device=get_current_device())
-        self.accumulated_sum = torch.zeros(1, device=get_current_device())
-        self.accumulated_correct = torch.zeros(1, device=get_current_device())
-
-    def reset(self) -> None:
-        self.last_step_sum.zero_()
-        self.last_step_correct.zero_()
-        self.accumulated_sum.zero_()
-        self.accumulated_correct.zero_()
-
-    def update(self, logits, targets, batch_size) -> None:
-        """Updates last step accuracy and accumulated accuracy with current logits
-        and labels. It expects the output has logits and labels.
-
-        :param logits: The logits output of the model
-        :param targets: Real labels of the dataset
-        :param batch_size: Batch size of the task
-        """
-        if isinstance(logits, (list, tuple)):
-            logits = logits[0]
-        if isinstance(targets, (list, tuple)):
-            targets = targets[0]
-        # update
-        correct = self.acc(logits, targets)
-
-        self.last_step_sum.fill_(batch_size)
-        self.last_step_correct.fill_(correct)
-        self.accumulated_sum += self.last_step_sum
-        self.accumulated_correct += self.last_step_correct
-
-    def get_last_step_value(self):
-        self.last_step_sum = all_reduce(self.last_step_sum, ParallelMode.DATA)
-        self.last_step_correct = all_reduce(self.last_step_correct, ParallelMode.DATA)
-        return (self.last_step_correct / self.last_step_sum).item()
-
-    def get_accumulated_value(self):
-        self.accumulated_sum = all_reduce(self.accumulated_sum, ParallelMode.DATA)
-        self.accumulated_correct = all_reduce(self.accumulated_correct, ParallelMode.DATA)
-        return (self.accumulated_correct / self.accumulated_sum).item()
-
-    @staticmethod
-    def is_better(a, b) -> bool:
-        return a > b
-
-
-class MetricHook(BaseHook):
-    """Specialized hook classes for :class:`Metric`. 
-    Some help metric collectors initialize, reset and 
-    update their states. Others are used to display and 
-    record the metric.
-
-    :param priority: Priority in the printing, hooks with small priority will be printed in front
-    :type priority: int
-    """
-
-    def __init__(
-        self,
-        priority: int,
-    ):
-        super().__init__(priority)
-        self._is_stage_to_compute = is_no_pp_or_last_stage()
-
-    def _check_metric_states_initialization(self, trainer):
-        if 'metrics' not in trainer.states:
-            self.init_runner_states(trainer, 'metrics', dict(train={}, test={}))
-
-
-@HOOKS.register_module
-class LossHook(MetricHook):
-    """Specialized hook class for :class:`Loss`.
-
-    :param priority: Priority in the printing, hooks with small priority will be printed in front, defaults to 0
-    :type priority: int, optional
-    """
-
-    def __init__(self, priority: int = 0):
-        super().__init__(priority)
-
-    def after_hook_is_attached(self, trainer):
-        self._check_metric_states_initialization(trainer)
-
-        if self._is_stage_to_compute:
-            self.train_loss = LossMetric(epoch_only=False)
-            self.test_loss = LossMetric(epoch_only=True)
-
-            # register the metric calculator
-            trainer.states['metrics']['train']['Loss'] = self.train_loss
-            trainer.states['metrics']['test']['Loss'] = self.test_loss
-
-    def before_train_epoch(self, trainer):
-        if self._is_stage_to_compute:
-            self.train_loss.reset()
-
-    def after_train_iter(self, trainer, logits, label, loss):
-        if self._is_stage_to_compute:
-            self.train_loss.update(loss)
-
-    def before_test_epoch(self, trainer):
-        if self._is_stage_to_compute:
-            self.test_loss.reset()
-
-    def after_test_iter(self, trainer, logits, label, loss):
-        if self._is_stage_to_compute:
-            self.test_loss.update(loss)
-
-
-@HOOKS.register_module
-class AccuracyHook(MetricHook):
-    """Specialized hook class for :class:`Accuracy`.
-
-    :param accuracy_func: Priority in the printing, hooks with small priority will be printed in front
-    :type accuracy_func: typing.Callable
-    :param priority: Priority in the printing, hooks with small priority will be printed in front, defaults to 0
-    :type priority: int, optional
-    """
-
-    def __init__(self, accuracy_func: Callable, priority: int = 0):
-        super().__init__(priority)
-        self.accuracy_func = accuracy_func
-
-    def after_hook_is_attached(self, trainer):
-        self._check_metric_states_initialization(trainer)
-        if self._is_stage_to_compute:
-            self.metric = AccuracyMetric(epoch_only=True, accuracy_func=self.accuracy_func)
-
-            # register the metric
-            trainer.states['metrics']['test']['Accuracy'] = self.metric
-
-    def before_test(self, trainer):
-        if self._is_stage_to_compute:
-            self.metric.reset()
-
-    def after_test_iter(self, trainer, logits, targets, *args):
-        if self._is_stage_to_compute:
-            batch_size = trainer.schedule.batch_size
-            self.metric.update(logits, targets, batch_size)
-
-
-class ThroughputMetric(Metric):
-    """Metric for :class:`Throughput`.
-
-    :param epoch_only: epoch only
-    :type epoch_only: bool
-    """
-    def __init__(self, epoch_only: bool, ignored_steps: int = 0):
-        super().__init__(epoch_only=epoch_only)
-        self.ignored_steps = ignored_steps
-        self.cur_steps = 0
-        self.accumulated_num_samples = torch.zeros(1, device=get_current_device())
-        self.accumulated_used_time = torch.zeros(1, device=get_current_device())
-        self.last_step_num_samples = torch.zeros(1, device=get_current_device())
-        self.last_step_used_time = torch.zeros(1, device=get_current_device())
-
-    def reset(self) -> None:
-        # self.cur_steps = 0
-        self.accumulated_num_samples.zero_()
-        self.accumulated_used_time.zero_()
-        self.last_step_num_samples.zero_()
-        self.last_step_used_time.zero_()
-
-    def update(self, num_samples, time) -> None:
-        self.cur_steps += 1
-        self.last_step_num_samples.fill_(num_samples)
-        self.last_step_used_time.fill_(time)
-        if self.cur_steps >= self.ignored_steps:
-            self.accumulated_num_samples += self.last_step_num_samples
-            self.accumulated_used_time += self.last_step_used_time
-
-    def get_last_step_value(self):
-        self.last_step_used_time = all_reduce(self.last_step_used_time, ParallelMode.DATA) / \
-            gpc.get_world_size(ParallelMode.DATA)
-        self.last_step_num_samples = all_reduce(self.last_step_num_samples, ParallelMode.DATA)
-        return (self.last_step_num_samples / (self.last_step_used_time + 1e-12)).item()
-
-    def get_accumulated_value(self):
-        self.accumulated_used_time = all_reduce(self.accumulated_used_time, ParallelMode.DATA) / \
-            gpc.get_world_size(ParallelMode.DATA)
-        self.accumulated_num_samples = all_reduce(self.accumulated_num_samples, ParallelMode.DATA)
-        return (self.accumulated_num_samples / (self.accumulated_used_time + 1e-12)).item()
-
-    @staticmethod
-    def is_better(a, b) -> bool:
-        pass
-
-
-@HOOKS.register_module
-class ThroughputHook(MetricHook):
-    """Specialized hook class for :class:`Throughput`.
-
-    :param priority: priority of throughput hook, defaults to 10
-    :type priority: int, optional
-    """
-    def __init__(self, ignored_steps: int = 0, priority: int = 10):
-        super().__init__(priority)
-        self.ignored_steps = ignored_steps
-
-    def after_hook_is_attached(self, trainer):
-        self._check_metric_states_initialization(trainer)
-        if self._is_stage_to_compute:
-            self.metric = ThroughputMetric(epoch_only=True, ignored_steps=self.ignored_steps)
-
-            # register the metric
-            trainer.states['metrics']['train']['Throughput'] = self.metric
-            trainer.states['metrics']['test']['Throughput'] = self.metric
-
-    def before_train_epoch(self, trainer):
-        if self._is_stage_to_compute:
-            self.metric.reset()
-
-    def after_train_iter(self, trainer, *args):
-        if self._is_stage_to_compute:
-            self.metric.update(trainer.schedule.batch_size, trainer._timer.get_timer('Train-step').get_elapsed_time())
-
-    def before_test(self, trainer):
-        if self._is_stage_to_compute:
-            self.metric.reset()
-
-    def after_test_iter(self, trainer, *args):
-        if self._is_stage_to_compute:
-            self.metric.update(trainer.schedule.batch_size, trainer._timer.get_timer('Test-step').get_elapsed_time())
diff --git a/colossalai/utils/__init__.py b/colossalai/utils/__init__.py
deleted file mode 100644
index c769022a5c72017a805deb10c1a0b415e46259a1..0000000000000000000000000000000000000000
--- a/colossalai/utils/__init__.py
+++ /dev/null
@@ -1,20 +0,0 @@
-from .activation_checkpoint import checkpoint
-from .common import (clip_grad_norm_fp32, conditional_context, copy_tensor_parallel_attributes, count_zeros_fp32,
-                     free_port, is_dp_rank_0, is_model_parallel_parameter, is_no_pp_or_last_stage, is_tp_rank_0,
-                     is_using_ddp, is_using_pp, is_using_sequence, model_branch_context, multi_tensor_applier,
-                     param_is_not_tensor_parallel_duplicate, print_rank_0, switch_virtual_pipeline_parallel_rank,
-                     sync_model_param)
-from .cuda import empty_cache, get_current_device, set_to_cuda, synchronize
-from .data_sampler import DataParallelSampler, get_dataloader
-from .gradient_accumulation import accumulate_gradient
-from .memory import report_memory_usage
-from .timer import MultiTimer, Timer
-
-__all__ = [
-    'checkpoint', 'free_port', 'print_rank_0', 'sync_model_param', 'is_dp_rank_0', 'is_tp_rank_0',
-    'is_no_pp_or_last_stage', 'is_using_ddp', 'is_using_pp', 'is_using_sequence', 'model_branch_context',
-    'conditional_context', 'is_model_parallel_parameter', 'clip_grad_norm_fp32', 'count_zeros_fp32',
-    'copy_tensor_parallel_attributes', 'param_is_not_tensor_parallel_duplicate', 'get_current_device', 'synchronize',
-    'empty_cache', 'set_to_cuda', 'report_memory_usage', 'Timer', 'MultiTimer', 'multi_tensor_applier',
-    'accumulate_gradient', 'DataParallelSampler', 'get_dataloader', 'switch_virtual_pipeline_parallel_rank'
-]
diff --git a/colossalai/utils/__pycache__/__init__.cpython-36.pyc b/colossalai/utils/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index cecdeaaaf031b4efb7b1ee88750c80fea4f7a82a..0000000000000000000000000000000000000000
Binary files a/colossalai/utils/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/utils/__pycache__/__init__.cpython-37.pyc b/colossalai/utils/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index 59dc3d3394a5ab776586bda1b17fec1b821e7b53..0000000000000000000000000000000000000000
Binary files a/colossalai/utils/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/utils/__pycache__/activation_checkpoint.cpython-36.pyc b/colossalai/utils/__pycache__/activation_checkpoint.cpython-36.pyc
deleted file mode 100644
index 9f0f8059f8507449c9cf5c8b6faea0b63a06b148..0000000000000000000000000000000000000000
Binary files a/colossalai/utils/__pycache__/activation_checkpoint.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/utils/__pycache__/activation_checkpoint.cpython-37.pyc b/colossalai/utils/__pycache__/activation_checkpoint.cpython-37.pyc
deleted file mode 100644
index e9cbe9cbf716b60f2c670594635af985a172b9b0..0000000000000000000000000000000000000000
Binary files a/colossalai/utils/__pycache__/activation_checkpoint.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/utils/__pycache__/checkpointing.cpython-37.pyc b/colossalai/utils/__pycache__/checkpointing.cpython-37.pyc
deleted file mode 100644
index f4411eaa28c86f47a370edc460db5f590167ab10..0000000000000000000000000000000000000000
Binary files a/colossalai/utils/__pycache__/checkpointing.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/utils/__pycache__/common.cpython-36.pyc b/colossalai/utils/__pycache__/common.cpython-36.pyc
deleted file mode 100644
index f0b3b7fb8bca38ee736f43b3263600788e73410b..0000000000000000000000000000000000000000
Binary files a/colossalai/utils/__pycache__/common.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/utils/__pycache__/common.cpython-37.pyc b/colossalai/utils/__pycache__/common.cpython-37.pyc
deleted file mode 100644
index d8d6fd9fb91f844b5a83551e3ee0911a2eac6918..0000000000000000000000000000000000000000
Binary files a/colossalai/utils/__pycache__/common.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/utils/__pycache__/cuda.cpython-36.pyc b/colossalai/utils/__pycache__/cuda.cpython-36.pyc
deleted file mode 100644
index 9b6d1a6e13db7bb6b70adb467b2be76fd037ee18..0000000000000000000000000000000000000000
Binary files a/colossalai/utils/__pycache__/cuda.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/utils/__pycache__/cuda.cpython-37.pyc b/colossalai/utils/__pycache__/cuda.cpython-37.pyc
deleted file mode 100644
index 993317b57d933c6aa5da699707665e788cb659cc..0000000000000000000000000000000000000000
Binary files a/colossalai/utils/__pycache__/cuda.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/utils/__pycache__/memory.cpython-36.pyc b/colossalai/utils/__pycache__/memory.cpython-36.pyc
deleted file mode 100644
index 1589256d885154b89e9fe0c1bfcaecd112ed27c5..0000000000000000000000000000000000000000
Binary files a/colossalai/utils/__pycache__/memory.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/utils/__pycache__/memory.cpython-37.pyc b/colossalai/utils/__pycache__/memory.cpython-37.pyc
deleted file mode 100644
index 12f3ba3170bd7024f99b7e5910a2ad363de09c45..0000000000000000000000000000000000000000
Binary files a/colossalai/utils/__pycache__/memory.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/utils/__pycache__/timer.cpython-36.pyc b/colossalai/utils/__pycache__/timer.cpython-36.pyc
deleted file mode 100644
index 2040110463533d9b34f1d449f8d2c8d49794aa50..0000000000000000000000000000000000000000
Binary files a/colossalai/utils/__pycache__/timer.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/utils/__pycache__/timer.cpython-37.pyc b/colossalai/utils/__pycache__/timer.cpython-37.pyc
deleted file mode 100644
index e0abf841c96f7ff1f965482d324e1b1527a195bb..0000000000000000000000000000000000000000
Binary files a/colossalai/utils/__pycache__/timer.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/utils/activation_checkpoint.py b/colossalai/utils/activation_checkpoint.py
deleted file mode 100644
index f50211614e4b6af921b80d4a75d808312ec14e4b..0000000000000000000000000000000000000000
--- a/colossalai/utils/activation_checkpoint.py
+++ /dev/null
@@ -1,117 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import torch
-from torch.utils.checkpoint import check_backward_validity, detach_variable
-
-from colossalai.context.random import get_states, get_current_mode, set_seed_states, set_mode, sync_states
-
-
-class CheckpointFunction(torch.autograd.Function):
-
-    @staticmethod
-    def forward(ctx, run_function, *args):
-        check_backward_validity(args)
-        ctx.run_function = run_function
-
-        # preserve rng states
-        ctx.fwd_cpu_rng_state = torch.get_rng_state()
-        sync_states()
-        ctx.fwd_seed_states = get_states(copy=True)
-        ctx.fwd_current_mode = get_current_mode()
-
-        if hasattr(torch, 'is_autocast_enabled'):
-            ctx.had_autocast_in_fwd = torch.is_autocast_enabled()
-        else:
-            ctx.had_autocast_in_fwd = False
-
-        # Save non-tensor inputs in ctx, keep a placeholder None for tensors
-        # to be filled out during the backward.
-        ctx.inputs = []
-        ctx.tensor_indices = []
-        tensor_inputs = []
-        for i, arg in enumerate(args):
-            if torch.is_tensor(arg):
-                tensor_inputs.append(arg)
-                ctx.tensor_indices.append(i)
-                ctx.inputs.append(None)
-            else:
-                ctx.inputs.append(arg)
-
-        ctx.save_for_backward(*tensor_inputs)
-
-        with torch.no_grad():
-            outputs = run_function(*args)
-        return outputs
-
-    @staticmethod
-    def backward(ctx, *args):
-        if not torch.autograd._is_checkpoint_valid():
-            raise RuntimeError(
-                "Checkpointing is not compatible with .grad() or when an `inputs` parameter"
-                " is passed to .backward(). Please use .backward() and do not pass its `inputs`"
-                " argument.")
-        # Copy the list to avoid modifying original list.
-        inputs = list(ctx.inputs)
-        tensor_indices = ctx.tensor_indices
-        tensors = ctx.saved_tensors
-
-        # store the current states
-        bwd_cpu_rng_state = torch.get_rng_state()
-        sync_states()
-        bwd_seed_states = get_states(copy=True)
-        bwd_current_mode = get_current_mode()
-
-        # set the states to what it used to be
-        torch.set_rng_state(ctx.fwd_cpu_rng_state)
-        for parallel_mode, state in ctx.fwd_seed_states.items():
-            set_seed_states(parallel_mode, state)
-        set_mode(ctx.fwd_current_mode)
-
-        # Fill in inputs with appropriate saved tensors.
-        for i, idx in enumerate(tensor_indices):
-            inputs[idx] = tensors[i]
-
-        detached_inputs = detach_variable(tuple(inputs))
-        if ctx.had_autocast_in_fwd:
-            with torch.enable_grad(), torch.cuda.amp.autocast():
-                outputs = ctx.run_function(*detached_inputs)
-        else:
-            with torch.enable_grad():
-                outputs = ctx.run_function(*detached_inputs)
-
-        if isinstance(outputs, torch.Tensor):
-            outputs = (outputs,)
-
-        # recover the rng states
-        torch.set_rng_state(bwd_cpu_rng_state)
-        for parallel_mode, state in bwd_seed_states.items():
-            set_seed_states(parallel_mode, state)
-        set_mode(bwd_current_mode)
-
-        # run backward() with only tensor that requires grad
-        outputs_with_grad = []
-        args_with_grad = []
-        for i in range(len(outputs)):
-            if torch.is_tensor(outputs[i]) and outputs[i].requires_grad:
-                outputs_with_grad.append(outputs[i])
-                args_with_grad.append(args[i])
-        if len(outputs_with_grad) == 0:
-            raise RuntimeError(
-                "none of output has requires_grad=True,"
-                " this checkpoint() is not necessary")
-        torch.autograd.backward(outputs_with_grad, args_with_grad)
-        grads = tuple(inp.grad if isinstance(inp, torch.Tensor) else None
-                      for inp in detached_inputs)
-
-        return (None,) + grads
-
-
-def checkpoint(function, *args):
-    """Checkpoint the computation while preserve the rng states, modified from Pytorch torch.utils.checkpoint
-
-    :param function: Describe the forward pass function. It should know how to handle the input tuples.
-    :param args: Tuple containing the parameters of the function
-    :return: Output of running function with provided args
-    """
-    return CheckpointFunction.apply(function, *args)
diff --git a/colossalai/utils/checkpointing.py b/colossalai/utils/checkpointing.py
deleted file mode 100644
index bb39c07d205f3e01feaa58b31601a80fe261e8ab..0000000000000000000000000000000000000000
--- a/colossalai/utils/checkpointing.py
+++ /dev/null
@@ -1,211 +0,0 @@
-import os
-import os.path as osp
-import re
-from typing import Tuple
-from pathlib import Path
-
-import torch
-
-from colossalai.context import Config
-from colossalai.context.parallel_mode import ParallelMode
-from colossalai.core import global_context as gpc
-
-__all__ = [
-    'get_checkpoint_path', 'get_latest_checkpoint_path', 'get_latest_checkpoint_pattern', 'save_checkpoint',
-    'load_checkpoint'
-]
-
-
-def unwrap_config(config: Config):
-    """Unwrap Config objects to normal dicts
-    """
-    config_dict = dict()
-    for k, v in config.items():
-        if isinstance(v, dict):
-            config_dict[k] = unwrap_config(v)
-        else:
-            config_dict[k] = v
-
-    return config_dict
-
-
-def _get_ranks_name():
-    # tensor parallel
-    tp_local_rank = 0
-    if gpc.is_initialized(ParallelMode.TENSOR):
-        tp_local_rank = gpc.get_local_rank(ParallelMode.TENSOR)
-
-    # pipeline parallel
-    pp_local_rank = 0
-    if gpc.is_initialized(ParallelMode.PIPELINE):
-        pp_local_rank = gpc.get_local_rank(ParallelMode.PIPELINE)
-
-    ranks_name = f'tp{tp_local_rank}-pp{pp_local_rank}'
-    return ranks_name
-
-
-def _get_standard_checkpoint_filename(epoch: int, suffix: str = ''):
-    ranks_name = _get_ranks_name()
-    return f'epoch{epoch}-{ranks_name}{suffix}.pt'
-
-
-def get_checkpoint_path(checkpoint_dir: str, epoch: int, suffix: str = ''):
-    """This is a function to generate the checkpoint path from the (checkpoint_dir, epoch, suffix, gpu_parallel_rank) tuple.
-    This is useful during generation and recuperation of the checkpoint.
-
-    :param checkpoint_dir: Set up a directory for saving checkpoints
-    :type checkpoint_dir: str
-    :param epoch: Epoch number (indicate how many epochs have you trained this model)
-    :type epoch: int
-    :param suffix: Additional notation to specify the model or checkpoint, defaults to ''
-    :type suffix: str, optional
-    :return: Checkpoint path to be generated
-    :rtype: path
-    """
-    ckpt_filename = _get_standard_checkpoint_filename(epoch, suffix)
-    return os.path.join(checkpoint_dir, ckpt_filename)
-
-
-def _ensure_directory_exists(filename: str):
-    # ensure the directory exists
-    dirpath = os.path.dirname(filename)
-    if not os.path.exists(dirpath):
-        Path(dirpath).mkdir(parents=True, exist_ok=True)
-
-
-def get_latest_checkpoint_pattern(suffix: str = ''):
-    """Generate Regular expression of latest checkpoint's pattern
-
-    :param suffix: Additional notation to specify the model or checkpoint, defaults to ''
-    :type suffix: str, optional
-    :return: Checkpoint pattern
-    :rtype: regular expression
-    """
-    ranks_name = _get_ranks_name()
-    pattern = r'epoch(\d+)-{}{}\.pt'.format(ranks_name, suffix)
-    ckpt_pattern = re.compile(pattern)
-    return ckpt_pattern
-
-
-def get_latest_checkpoint_path(checkpoint_dir: str, suffix: str = ''):
-    """This is a function to retrieve the latest checkpoint path from the (checkpoint_dir, suffix, gpu_parallel_rank) tuple.
-    This is useful during recuperation of the checkpoint, especially when you do not know the epoch number.
-
-    :param checkpoint_dir: Directory for saving checkpoints
-    :type checkpoint_dir: str
-    :param suffix: Additional notation to specify the model or checkpoint, defaults to ''
-    :type suffix: str, optional
-    :raises FileNotFoundError: Raise error when we cannot find the latest checkpoint file with inputs given
-    :return: The latest checkpoint path to be retrieved
-    :rtype: path
-    """
-    CKPT_NAME_PAT = get_latest_checkpoint_pattern(suffix=suffix)
-
-    last_epoch = -1
-    assert osp.isdir(checkpoint_dir), f'{checkpoint_dir} is not a directory'
-
-    for filename in os.listdir(checkpoint_dir):
-        ret = CKPT_NAME_PAT.match(filename)
-        if ret:
-            epoch = int(ret[0].split('-')[0].lstrip('epoch'))
-            if epoch > last_epoch:
-                last_epoch = epoch
-
-    if last_epoch == -1:
-        ranks_name = _get_ranks_name()
-        raise FileNotFoundError(f"Cannot find the latest checkpoint file for {ranks_name} in {checkpoint_dir}")
-    else:
-        target_file = _get_standard_checkpoint_filename(last_epoch, suffix=suffix)
-        path = osp.join(checkpoint_dir, target_file)
-        return path
-
-
-def save_checkpoint(checkpoint_path: str,
-                    epoch: int,
-                    model: torch.nn.Module,
-                    optimizer: torch.optim.Optimizer,
-                    lr_scheduler: torch.optim.lr_scheduler._LRScheduler = None,
-                    **kwargs):
-    """Given a directory to store the checkpoints, saves all the training components' parameters or buffers, such as model,
-     optimizer, lr_scheduler and etc. into a checkpoint dictionary.
-
-    This method can be used for both colosalai nn.BaseModel and normal pytorch nn.Module.
-
-
-    :param checkpoint_path: Set up a directory for saving checkpoints
-    :type checkpoint_path: str
-    :param epoch: Epoch number (indicate how many epochs have you trained this model)
-    :type epoch: int
-    :param model: Model to be registered
-    :type model: torch.nn.Module
-    :param optimizer: Optimizer to be registered
-    :type optimizer: torch.optim.Optimizer
-    :param lr_scheduler: lr_scheduler to be registered, defaults to None
-    :type lr_scheduler: torch.optim.lr_scheduler._LRScheduler, optional
-    """
-    # for compatibility with normal pytorch nn.Module
-    if hasattr(model, 'state_dict_for_save_checkpoint'):
-        model_sd = model.state_dict_for_save_checkpoint()
-    else:
-        model_sd = model.state_dict()
-
-    # ckpt container
-    checkpoint = {'epoch': epoch, 'model': model_sd, 'optimizer': optimizer.state_dict(), **kwargs}
-    if lr_scheduler is not None:
-        checkpoint['lr_scheduler'] = lr_scheduler.state_dict()
-
-    _ensure_directory_exists(checkpoint_path)
-    torch.save(checkpoint, checkpoint_path)
-
-
-def load_checkpoint(checkpoint_path: str,
-                    model: torch.nn.Module,
-                    optimizer: torch.optim.Optimizer,
-                    lr_scheduler: torch.optim.lr_scheduler._LRScheduler = None,
-                    finetune: bool = False,
-                    strict: bool = True) -> Tuple:
-    """Loads the checkpoint file.
-    If finetune is False, then we intend to continue/resume the training process from the checkpoint given.
-    So we copy parameters and buffers from state_dict into these modules(model, optimizer,lr_scheduler)
-     and its descendants.
-    If finetune is True, then only the weights and buffers of model should be reload.
-    If strict is True, then the keys of state_dict must exactly match the keys returned by this module’s
-     state_dict() function.
-
-    :param checkpoint_path: The exact and matched checkpoint_path directory to retrieve appropriate state_dict
-    :type checkpoint_path: str
-    :param model: Model to reload parameters and buffers
-    :type model: torch.nn.Module
-    :param optimizer: Optimizer to recuperate
-    :type optimizer: torch.optim.Optimizer
-    :param lr_scheduler: lr_scheduler to recuperate, defaults to None
-    :type lr_scheduler: torch.optim.lr_scheduler._LRScheduler, optional
-    :param finetune: Whether to finetune the model with new dataset or continue the pre-training, defaults to False
-    :type finetune: bool, optional
-    :param strict: Whether to strictly enforce that the keys in
-        :attr:`state_dict` of the checkpoint match the names of
-        parameters and buffers in model., defaults to True
-    :type strict: bool, optional
-    :raises ValueError: Raise error if the model/optimizer cannot successfully be recuperated
-    :return: (the epoch number of the checkpoint retrieved, the checkpoint retrieved)
-    :rtype: Tuple
-
-    """
-    # Load the checkpoint.
-    checkpoint = torch.load(checkpoint_path, map_location='cpu')
-    try:
-        last_epoch = checkpoint.pop('epoch') if not finetune else 0
-        model.load_state_dict(checkpoint.pop('model'), strict=strict)
-    except KeyError:
-        raise ValueError('Checkpoint is corrupted')
-
-    if not finetune:
-        try:
-            optimizer.load_state_dict(checkpoint.pop('optimizer'))
-        except KeyError:
-            raise ValueError('Checkpoint is corrupted')
-
-        if lr_scheduler is not None and 'lr_scheduler' in checkpoint:
-            lr_scheduler.load_state_dict(checkpoint.pop('lr_scheduler'))
-
-    return last_epoch, checkpoint
diff --git a/colossalai/utils/common.py b/colossalai/utils/common.py
deleted file mode 100644
index 942801018adb62cbd98950d5ba47e87b8b75eb27..0000000000000000000000000000000000000000
--- a/colossalai/utils/common.py
+++ /dev/null
@@ -1,289 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-import random
-import socket
-
-import torch
-from torch._six import inf
-
-try:
-    import colossal_C
-except:
-    pass
-
-from contextlib import contextmanager
-
-import torch.distributed as dist
-from colossalai.constants import IS_TENSOR_PARALLEL, NUM_PARTITIONS, TENSOR_PARALLEL_ATTRIBUTES
-from colossalai.context.parallel_mode import ParallelMode
-from colossalai.core import global_context as gpc
-from colossalai.global_variables import moe_env
-from colossalai.global_variables import tensor_parallel_env as env
-
-from .multi_tensor_apply import multi_tensor_applier
-
-
-def print_rank_0(msg: str, logger=None):
-    """Print messages and save logs(optional). This is executed only if you are the rank-0 gpu.
-
-    :param msg: A string message to output
-    :type msg: str
-    :param logger: Python logger object, defaults to None
-    :type logger: optional
-    """
-    if gpc.get_global_rank() == 0:
-        if logger is None:
-            print(msg, flush=True)
-        else:
-            logger.info(msg)
-
-
-def free_port():
-    while True:
-        try:
-            sock = socket.socket()
-            port = random.randint(20000, 65000)
-            sock.bind(('localhost', port))
-            sock.close()
-            return port
-        except Exception:
-            continue
-
-
-def sync_model_param(model, parallel_mode):
-    """Make sure data parameters are consistent during Data Parallel Mode
-
-    :param model: A pyTorch nn.model on whose parameters you check the consistency
-    :param parallel_mode: Parallel mode to be checked
-    :type model: torch.nn.Module
-    :type parallel_mode:  colossalai.context.ParallelMode
-    """
-    if gpc.is_initialized(parallel_mode) and gpc.get_world_size(parallel_mode) > 1:
-        for param in model.parameters():
-            ranks = gpc.get_ranks_in_group(parallel_mode)
-            dist.broadcast(param, src=ranks[0], group=gpc.get_group(parallel_mode))
-
-
-def is_dp_rank_0():
-    return not gpc.is_initialized(ParallelMode.DATA) or gpc.is_first_rank(ParallelMode.DATA)
-
-
-def is_tp_rank_0():
-    return not gpc.is_initialized(ParallelMode.TENSOR) or gpc.is_first_rank(ParallelMode.TENSOR)
-
-
-def is_no_pp_or_last_stage():
-    return not gpc.is_initialized(ParallelMode.PIPELINE) or gpc.is_last_rank(ParallelMode.PIPELINE)
-
-
-def is_using_ddp():
-    return gpc.is_initialized(ParallelMode.DATA) and gpc.get_world_size(ParallelMode.DATA) > 1
-
-
-def is_using_pp():
-    return gpc.is_initialized(ParallelMode.PIPELINE) and gpc.get_world_size(ParallelMode.PIPELINE) > 1
-
-
-def is_using_sequence():
-    return gpc.is_initialized(ParallelMode.SEQUENCE) and gpc.get_world_size(ParallelMode.SEQUENCE) > 1
-
-
-@contextmanager
-def conditional_context(context_manager, enable=True):
-    if enable:
-        with context_manager:
-            yield
-    else:
-        yield
-
-
-class model_branch_context(object):
-
-    def __enter__(self):
-        self.env_status = env.save()
-
-    def __exit__(self, *exc_info):
-        env.load(**self.env_status)
-
-
-def is_model_parallel_parameter(p):
-    return hasattr(p, IS_TENSOR_PARALLEL) and getattr(p, IS_TENSOR_PARALLEL)
-
-
-def is_moe_parallel_parameter(p):
-    return hasattr(p, 'moe_param') and moe_env.data_parallel_size > 1
-
-
-def _calc_l2_norm(grads):
-    norm = 0.0
-    if len(grads) > 0:
-        dummy_overflow_buf = torch.cuda.IntTensor([0])
-        norm, _ = multi_tensor_applier(
-            colossal_C.multi_tensor_l2norm,
-            dummy_overflow_buf,
-            [grads],
-            False  # no per-parameter norm
-        )
-    return norm
-
-
-def _calc_lp(grads, norm_type):
-    norm = 0.0
-    for grad in grads:
-        grad_norm = torch.norm(grad, norm_type)
-        norm += grad_norm**norm_type
-    return norm
-
-
-# ======== Gradient Clipping =========
-
-
-def clip_grad_norm_fp32(parameters, max_norm, norm_type=2):
-    """Clips gradient norm of an iterable of parameters whose gradients are in fp32.
-
-    This is adapted from :func:`torch.nn.utils.clip_grad.clip_grad_norm_` and
-    added functionality to handle model parallel parameters. Note that
-    the gradients are modified in place.
-
-    :param parameters: An iterable of Tensors or a single Tensor that will have gradients normalized
-    :type parameters: (Iterable[Tensor] or Tensor)
-    :param max_norm: Max norm of the gradients
-    :type max_norm: float or int
-    :param norm_type: Type of the used p-norm. Can be ``'inf'`` for infinity norm.
-    :type norm_type: float or int 
-
-    :return: Total norm of the parameters (viewed as a single vector).
-    :rtype: float
-    """
-
-    if isinstance(parameters, torch.Tensor):
-        parameters = [parameters]
-
-    # Filter parameters based on:
-    #   - grad should not be none
-    #   - parameter should not be shared
-    #   - should not be a replica due to tensor model parallelism
-    params = []
-    for param in parameters:
-        if param.grad is not None:
-            # Make sure the grads are in fp32
-            assert param.grad.type() == 'torch.cuda.FloatTensor', \
-                f'expected gradient to be dtype torch.cuda.FloatTensor, but got {param.grad.type()}'
-            params.append(param)
-    # Norm parameters.
-    max_norm = float(max_norm)
-    norm_type = float(norm_type)
-
-    # Calculate norm.
-    if norm_type == inf:
-        total_norm = max(p.grad.data.abs().max() for p in params)
-        total_norm_cuda = torch.cuda.FloatTensor([float(total_norm)])
-        # Take max across all model-parallel GPUs.
-        if gpc.is_initialized(ParallelMode.MODEL) and gpc.get_world_size(ParallelMode.MODEL) > 1:
-            dist.all_reduce(total_norm_cuda,
-                            op=dist.ReduceOp.MAX,
-                            group=gpc.get_group(ParallelMode.MODEL),
-                            async_op=False)
-        total_norm = total_norm_cuda[0].item()
-    else:
-        tensor_parallel_grads = []
-        no_tensor_parallel_grads = []
-        moe_parallel_grads = []  # used to collect moe tensor parallel gradients
-        for p in params:
-            if is_model_parallel_parameter(p):
-                reductor = (gpc.get_world_size(ParallelMode.TENSOR) / getattr(p, NUM_PARTITIONS))**(1 / norm_type)
-                tensor_parallel_grads.append(p.grad.data / reductor)
-            elif is_moe_parallel_parameter(p):
-                moe_parallel_grads.append(p.grad.data)
-            else:
-                no_tensor_parallel_grads.append(p.grad.data)
-
-        if norm_type == 2.0:
-            tensor_parallel_norm = _calc_l2_norm(tensor_parallel_grads)**norm_type
-            no_tensor_parallel_norm = _calc_l2_norm(no_tensor_parallel_grads)**norm_type
-            moe_parallel_norm = _calc_l2_norm(moe_parallel_grads)**norm_type
-        else:
-            tensor_parallel_norm = _calc_lp(tensor_parallel_grads, norm_type)
-            no_tensor_parallel_norm = _calc_lp(no_tensor_parallel_grads, norm_type)
-            moe_parallel_norm = _calc_lp(moe_parallel_grads, norm_type)
-        # Sum across all model-parallel GPUs.
-        if gpc.is_initialized(ParallelMode.TENSOR) and len(tensor_parallel_grads) > 0:
-            dist.all_reduce(tensor_parallel_norm, op=dist.ReduceOp.SUM, group=gpc.get_group(ParallelMode.TENSOR))
-        # Sum across all moe-tensor-parallel GPUs
-        if len(moe_parallel_grads) > 0:
-            dist.all_reduce(moe_parallel_norm, group=gpc.get_group(ParallelMode.MOE_MODEL))
-            no_tensor_parallel_norm += moe_parallel_norm
-        total_norm = tensor_parallel_norm + no_tensor_parallel_norm
-        if gpc.is_initialized(ParallelMode.PIPELINE) and gpc.get_world_size(ParallelMode.PIPELINE) > 1:
-            dist.all_reduce(total_norm, op=dist.ReduceOp.SUM, group=gpc.get_group(ParallelMode.PIPELINE))
-        total_norm = total_norm**(1.0 / norm_type)
-        if type(total_norm) == 'torch.cuda.FloatTensor':
-            total_norm = total_norm.item()
-
-    # Scale.
-    clip_coeff = max_norm / (total_norm + 1.0e-6)
-    if clip_coeff < 1.0:
-        grads = [p.grad.detach() for p in params]
-        dummy_overflow_buf = torch.cuda.IntTensor([0])
-        multi_tensor_applier(colossal_C.multi_tensor_scale, dummy_overflow_buf, [grads, grads], clip_coeff)
-
-    return total_norm
-
-
-def count_zeros_fp32(parameters):
-    if isinstance(parameters, torch.Tensor):
-        parameters = [parameters]
-
-    # Filter parameters based on:
-    #   - grad should not be none
-    #   - parameter should not be shared
-    #   - should not be a replica due to tensor model parallelism
-    total_num_zeros = 0.0
-    for param in parameters:
-        grad_not_none = param.grad is not None
-        is_not_tp_duplicate = param_is_not_tensor_parallel_duplicate(param)
-        if grad_not_none and is_not_tp_duplicate:
-            grad = param.grad.detach()
-            num_zeros = grad.numel() - torch.count_nonzero(grad)
-            total_num_zeros = num_zeros + total_num_zeros
-
-    total_num_zeros = torch.IntTensor([int(total_num_zeros)]).cuda()
-
-    # Sum across all model-parallel GPUs.
-    ops = []
-    ops.append(
-        dist.all_reduce(total_num_zeros, op=dist.ReduceOp.SUM, group=gpc.get_group(ParallelMode.TENSOR), async_op=True))
-    if gpc.is_initialized(ParallelMode.PIPELINE):
-        ops.append(
-            dist.all_reduce(total_num_zeros,
-                            op=dist.ReduceOp.SUM,
-                            group=gpc.get_group(ParallelMode.PIPELINE),
-                            async_op=True))
-
-    for req in ops:
-        req.wait()
-    total_num_zeros = total_num_zeros.item()
-
-    return total_num_zeros
-
-
-def copy_tensor_parallel_attributes(src_tensor, dst_tensor):
-    for attr in TENSOR_PARALLEL_ATTRIBUTES:
-        if hasattr(src_tensor, attr):
-            val = getattr(src_tensor, attr)
-            setattr(dst_tensor, attr, val)
-
-
-def param_is_not_tensor_parallel_duplicate(param):
-    return (hasattr(param, IS_TENSOR_PARALLEL) and getattr(param, IS_TENSOR_PARALLEL)) or (gpc.get_local_rank(
-        ParallelMode.TENSOR) == 0)
-
-
-@contextmanager
-def switch_virtual_pipeline_parallel_rank(rank):
-    prev_rank = gpc.virtual_pipeline_parallel_rank
-    try:
-        gpc.set_virtual_pipeline_parallel_rank(rank)
-        yield
-    finally:
-        gpc.set_virtual_pipeline_parallel_rank(prev_rank)
diff --git a/colossalai/utils/cuda.py b/colossalai/utils/cuda.py
deleted file mode 100644
index b287fa276e75591385824980de3e8b5e1a7439a3..0000000000000000000000000000000000000000
--- a/colossalai/utils/cuda.py
+++ /dev/null
@@ -1,45 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import torch
-
-
-def set_to_cuda(models):
-    """Send model to gpu.
-
-    :param models: nn.module or a list of module
-    """
-    if isinstance(models, list) and len(models) > 1:
-        ret = []
-        for model in models:
-            ret.append(model.to(get_current_device()))
-        return ret
-    elif isinstance(models, list):
-        return models[0].to(get_current_device())
-    else:
-        return models.to(get_current_device())
-
-
-def get_current_device():
-    """Returns the index of a currently selected device (gpu/cpu).
-    """
-    if torch.cuda.is_available():
-        return torch.cuda.current_device()
-    else:
-        return 'cpu'
-
-
-def synchronize():
-    """Similar to cuda.synchronize().
-    Waits for all kernels in all streams on a CUDA device to complete.
-    """
-    if torch.cuda.is_available():
-        torch.cuda.synchronize()
-
-
-def empty_cache():
-    """Similar to cuda.empty_cache()
-    Releases all unoccupied cached memory currently held by the caching allocator.
-    """
-    if torch.cuda.is_available():
-        torch.cuda.empty_cache()
diff --git a/colossalai/utils/data_sampler/__init__.py b/colossalai/utils/data_sampler/__init__.py
deleted file mode 100644
index 12798a94c2d063bb120f805967e748c5a1059a3a..0000000000000000000000000000000000000000
--- a/colossalai/utils/data_sampler/__init__.py
+++ /dev/null
@@ -1,4 +0,0 @@
-from .base_sampler import BaseSampler
-from .data_parallel_sampler import DataParallelSampler, get_dataloader
-
-__all__ = ['BaseSampler', 'DataParallelSampler', 'get_dataloader']
diff --git a/colossalai/utils/data_sampler/__pycache__/__init__.cpython-36.pyc b/colossalai/utils/data_sampler/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index e3a5441539ad0d311742d04a457552c9bb285d0f..0000000000000000000000000000000000000000
Binary files a/colossalai/utils/data_sampler/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/utils/data_sampler/__pycache__/__init__.cpython-37.pyc b/colossalai/utils/data_sampler/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index ec4eb53e7c85f6e749bdbb79694cca44ae6abf57..0000000000000000000000000000000000000000
Binary files a/colossalai/utils/data_sampler/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/utils/data_sampler/__pycache__/base_sampler.cpython-36.pyc b/colossalai/utils/data_sampler/__pycache__/base_sampler.cpython-36.pyc
deleted file mode 100644
index b61361175e9e397d3110150b710a59a81891e324..0000000000000000000000000000000000000000
Binary files a/colossalai/utils/data_sampler/__pycache__/base_sampler.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/utils/data_sampler/__pycache__/base_sampler.cpython-37.pyc b/colossalai/utils/data_sampler/__pycache__/base_sampler.cpython-37.pyc
deleted file mode 100644
index 755f466d9ce37b12d9b80fe7766c9a1eb046a82a..0000000000000000000000000000000000000000
Binary files a/colossalai/utils/data_sampler/__pycache__/base_sampler.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/utils/data_sampler/__pycache__/data_parallel_sampler.cpython-36.pyc b/colossalai/utils/data_sampler/__pycache__/data_parallel_sampler.cpython-36.pyc
deleted file mode 100644
index b403413bf8044bb6a827113cf5b890d383e4371f..0000000000000000000000000000000000000000
Binary files a/colossalai/utils/data_sampler/__pycache__/data_parallel_sampler.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/utils/data_sampler/__pycache__/data_parallel_sampler.cpython-37.pyc b/colossalai/utils/data_sampler/__pycache__/data_parallel_sampler.cpython-37.pyc
deleted file mode 100644
index c5695970f17614937be5b98289d529d51eebf8ab..0000000000000000000000000000000000000000
Binary files a/colossalai/utils/data_sampler/__pycache__/data_parallel_sampler.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/utils/data_sampler/base_sampler.py b/colossalai/utils/data_sampler/base_sampler.py
deleted file mode 100644
index 89f3bca5b1b51925ef7b32e4a08f1df301776fcb..0000000000000000000000000000000000000000
--- a/colossalai/utils/data_sampler/base_sampler.py
+++ /dev/null
@@ -1,19 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from abc import ABC, abstractmethod
-
-
-class BaseSampler(ABC):
-
-    def __init__(self, dataset, batch_size):
-        self.dataset = dataset
-        self.batch_size = batch_size
-
-    @abstractmethod
-    def __len__(self):
-        pass
-
-    @abstractmethod
-    def __iter__(self):
-        pass
diff --git a/colossalai/utils/data_sampler/data_parallel_sampler.py b/colossalai/utils/data_sampler/data_parallel_sampler.py
deleted file mode 100644
index a6061e3cdd9a93610c46041b4784869900df094c..0000000000000000000000000000000000000000
--- a/colossalai/utils/data_sampler/data_parallel_sampler.py
+++ /dev/null
@@ -1,174 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-# adpated from torch.utils.data.DistributedSampler
-
-import math
-import random
-import numpy as np
-from typing import TypeVar, Iterator
-
-import torch
-from torch.utils.data import Sampler, Dataset, DataLoader
-
-from colossalai.context.parallel_mode import ParallelMode
-from colossalai.core import global_context as gpc
-from colossalai.registry import DATA_SAMPLERS
-
-T_co = TypeVar('T_co', covariant=True)
-
-
-@DATA_SAMPLERS.register_module
-class DataParallelSampler(Sampler):
-    """A data sampler for distributed data parallelism
-
-    :param dataset: A Dataset instance
-    :type dataset: torch.utils.data.Dataset
-    :param shuffle: Whether to shuffle data, defaults to False
-    :type shuffle: bool, optional
-    :param seed: The random seed, defaults to 0
-    :type seed: int, optional
-    :param drop_last: Set to True to drop the last incomplete batch, if the dataset size is not divisible by the batch
-        size. If False and the size of dataset is not divisible by the batch size, then the last batch will be smaller,
-        defaults to False
-    :type drop_last: bool, optional
-    """
-
-    def __init__(self,
-                 dataset: Dataset,
-                 shuffle: bool = False,
-                 seed: int = 0,
-                 drop_last: bool = False) -> None:
-        self.dataset = dataset
-        self.num_replicas = gpc.get_world_size(ParallelMode.DATA)
-        self.rank = gpc.get_local_rank(ParallelMode.DATA)
-        self.epoch = 0
-        self.drop_last = drop_last
-        # If the dataset length is evenly divisible by # of replicas, then there
-        # is no need to drop any data, since the dataset will be split equally.
-        # type: ignore[arg-type]
-        if self.drop_last and len(self.dataset) % self.num_replicas != 0:
-            # Split to nearest available length that is evenly divisible.
-            # This is to ensure each rank receives the same amount of data when
-            # using this Sampler.
-            self.num_samples = math.ceil(
-                # `type:ignore` is required because Dataset cannot provide a default __len__
-                # see NOTE in pytorch/torch/utils/data/sampler.py
-                (len(self.dataset) - self.num_replicas) / \
-                self.num_replicas  # type: ignore[arg-type]
-            )
-        else:
-            self.num_samples = math.ceil(
-                len(self.dataset) / self.num_replicas)  # type: ignore[arg-type]
-        self.total_size = self.num_samples * self.num_replicas
-        self.shuffle = shuffle
-        self.seed = seed
-
-    def __iter__(self) -> Iterator[T_co]:
-        if self.shuffle:
-            # deterministically shuffle based on epoch and seed
-            g = torch.Generator()
-            g.manual_seed(self.seed + self.epoch)
-            # type: ignore[arg-type]
-            indices = torch.randperm(len(self.dataset), generator=g).tolist()
-
-            # update for next epoch so that there is no need to call
-            # set_epoch manually
-            self.epoch += 1
-        else:
-            indices = list(range(len(self.dataset)))  # type: ignore[arg-type]
-
-        if not self.drop_last:
-            # add extra samples to make it evenly divisible
-            padding_size = self.total_size - len(indices)
-            if padding_size <= len(indices):
-                indices += indices[:padding_size]
-            else:
-                indices += (indices * math.ceil(padding_size /
-                            len(indices)))[:padding_size]
-        else:
-            # remove tail of data to make it evenly divisible.
-            indices = indices[:self.total_size]
-        assert len(indices) == self.total_size
-
-        # subsample
-        indices = indices[self.rank:self.total_size:self.num_replicas]
-        assert len(indices) == self.num_samples
-
-        return iter(indices)
-
-    def __len__(self) -> int:
-        return self.num_samples
-
-    def set_epoch(self, epoch: int) -> None:
-        r"""Sets the epoch for this sampler. When :attr:`shuffle=True`, this ensures all replicas
-        use a different random ordering for each epoch. Otherwise, the next iteration of this
-        sampler will yield the same ordering.
-
-        :param epoch: Epoch number.
-        :type epoch: int
-        """
-        self.epoch = epoch
-
-
-def get_dataloader(dataset,
-                   shuffle=False,
-                   seed=1024, 
-                   add_sampler=True, 
-                   drop_last=False,
-                   pin_memory=False,
-                   num_workers=0,
-                   **kwargs):
-    """Set up a deterministic dataloader (also configure seed workers, samplers and whether shuffle or not)
-
-    .. note:: When pipeline parallel is enabled, shuffle cannot be True as it will result in mismatch between input data
-        on the 1st stage and label on the last stage
-
-    :param dataset: A :class:`torch.utils.data.Dataset` object
-    :param shuffle: Whether to shuffle the dataset
-    :param seed: Random worker seed, defaults to 1024
-    :param add_sampler: Add DistributedDataParallelSampelr to the dataset
-    :param drop_last: Drop the last incomplete batch of data
-    :param pin_memory: Whether to pin memory address in CPU memory
-    :param num_workers: Number of worker threads for this dataloader
-
-    :type dataset: :class:`torch.utils.data.Dataset`
-    :type shuffle: bool, optional. Default is False
-    :type seed: int, optional. Default is 1024
-    :type add_sampler: bool, optional. Default is True
-    :type drop_last: bool, optional. Default is False
-    :type pin_memory: bool, optional. Default is False
-    :type num_workers: int, optional. Default is 0
-
-    :return: A object of :class:`torch.utils.data.DataLoader`
-    :rtype: :class:`torch.utils.data.DataLoader`
-    """
-    _kwargs = kwargs.copy()
-
-    if add_sampler and gpc.is_initialized(ParallelMode.DATA) and gpc.get_world_size(ParallelMode.DATA) > 1:
-        sampler = DataParallelSampler(dataset, shuffle=shuffle)
-    else:
-        sampler = None
-
-    # Deterministic dataloader
-    def seed_worker(worker_id):
-        worker_seed = seed
-        np.random.seed(worker_seed)
-        torch.manual_seed(worker_seed)
-        random.seed(worker_seed)
-
-    if sampler is None:
-        return DataLoader(dataset,
-                          worker_init_fn=seed_worker,
-                          shuffle=shuffle,
-                          drop_last=drop_last,
-                          pin_memory=pin_memory,
-                          num_workers=num_workers,
-                          **_kwargs)
-    else:
-        return DataLoader(dataset,
-                          sampler=sampler,
-                          worker_init_fn=seed_worker,
-                          drop_last=drop_last,
-                          pin_memory=pin_memory,
-                          num_workers=num_workers,
-                          **_kwargs)
diff --git a/colossalai/utils/gradient_accumulation/__init__.py b/colossalai/utils/gradient_accumulation/__init__.py
deleted file mode 100644
index 4c4bf343855285834255e0565ed995e6e6999299..0000000000000000000000000000000000000000
--- a/colossalai/utils/gradient_accumulation/__init__.py
+++ /dev/null
@@ -1,43 +0,0 @@
-import torch.nn as nn
-from typing import List
-from colossalai.engine import BaseGradientHandler
-from typing import Iterable
-from torch.optim import Optimizer
-from torch.optim.lr_scheduler import _LRScheduler
-from ._gradient_accumulation import GradAccumDataloader, GradAccumOptimizer, GradAccumLrSchedulerByStep, GradAccumGradientHandler
-
-
-def accumulate_gradient(model: nn.Module,
-                        optimizer: Optimizer,
-                        dataloader: Iterable,
-                        accumulate_size: int,
-                        gradient_handlers: List[BaseGradientHandler] = None,
-                        lr_scheduler: _LRScheduler = None):
-    """
-    :param model: your model object
-    :type model: :class:`torch.nn.Module`
-    :param optimizer: your optimizer object
-    :type optimizer: :class:`torch.optim.Optimizer`
-    :param dataloader: your dataloader object
-    :type dataloader: Iterable
-    :param accumulate_size: the number of steps to accumulate gradients
-    :type accumulate_size: int
-    :param gradient_handlers: list of gradient handler objects. Default is None
-    :type gradient_handlers: List[:class:`colossalai.engine.BaseGradientHandler`]
-    :param lr_scheduler: your lr scheduler object. Default is None
-    :type lr_scheduler: `torch.optim.lr_scheduler._LRScheduler`
-    """
-    optimizer = GradAccumOptimizer(optimizer, accumulate_size=accumulate_size, model=model)
-    dataloader = GradAccumDataloader(dataloader, accumulate_size=accumulate_size)
-
-    if gradient_handlers is not None:
-        gradient_handlers = [GradAccumGradientHandler(handler, accumulate_size) for handler in gradient_handlers]
-
-    if lr_scheduler is not None:
-        lr_scheduler = GradAccumLrSchedulerByStep(lr_scheduler, accumulate_size=accumulate_size)
-
-    return optimizer, dataloader, gradient_handlers, lr_scheduler
-
-
-__all__ = ['accumulate_gradient', 'GradAccumDataloader', 'GradAccumOptimizer',
-           'GradAccumLrSchedulerByStep', 'GradAccumGradientHandler']
diff --git a/colossalai/utils/gradient_accumulation/__pycache__/__init__.cpython-36.pyc b/colossalai/utils/gradient_accumulation/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index 105fc757a620474789f619112148109c2d0f5322..0000000000000000000000000000000000000000
Binary files a/colossalai/utils/gradient_accumulation/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/utils/gradient_accumulation/__pycache__/__init__.cpython-37.pyc b/colossalai/utils/gradient_accumulation/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index 672eccd5c23fa69439b1790553e036e110749ac3..0000000000000000000000000000000000000000
Binary files a/colossalai/utils/gradient_accumulation/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/utils/gradient_accumulation/__pycache__/_gradient_accumulation.cpython-36.pyc b/colossalai/utils/gradient_accumulation/__pycache__/_gradient_accumulation.cpython-36.pyc
deleted file mode 100644
index 563ce7708d283903af5120e71aece26196f7cbbb..0000000000000000000000000000000000000000
Binary files a/colossalai/utils/gradient_accumulation/__pycache__/_gradient_accumulation.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/utils/gradient_accumulation/__pycache__/_gradient_accumulation.cpython-37.pyc b/colossalai/utils/gradient_accumulation/__pycache__/_gradient_accumulation.cpython-37.pyc
deleted file mode 100644
index 5448956085cdc3df7444f663698a7f96462a60b8..0000000000000000000000000000000000000000
Binary files a/colossalai/utils/gradient_accumulation/__pycache__/_gradient_accumulation.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/utils/gradient_accumulation/_gradient_accumulation.py b/colossalai/utils/gradient_accumulation/_gradient_accumulation.py
deleted file mode 100644
index 136c46c98e7d81dcc41b802b8529fdd015b4b155..0000000000000000000000000000000000000000
--- a/colossalai/utils/gradient_accumulation/_gradient_accumulation.py
+++ /dev/null
@@ -1,197 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import torch.nn as nn
-from torch import Tensor
-from typing import Iterable, Any
-from colossalai.nn.optimizer import ColossalaiOptimizer
-from torch.nn.parallel.distributed import DistributedDataParallel
-from torch.optim import Optimizer
-from torch.optim.lr_scheduler import _LRScheduler
-from torch.utils.data import DataLoader
-from colossalai.utils import conditional_context
-from colossalai.engine import BaseGradientHandler
-
-
-class GradAccumOptimizer(ColossalaiOptimizer):
-    """A wrapper for the optimizer to enable gradient accumulation by skipping the steps 
-    before accumulation size is reached
-
-    :param optim: Your optimizer object
-    :type optim: :class:`torch.optim.Optimizer`
-    :param accumulate_size: The number of steps to accumulate gradients
-    :type accumulate_size: int
-    :param model: Your model object to check if it is DDP for special handling of no_sync() context
-    :type model: :class:`torch.nn.Module`
-
-    """
-
-    def __init__(self, optim: Optimizer, accumulate_size: int, model: nn.Module = None):
-        super().__init__(optim)
-        self.accumulate_size = accumulate_size
-        self.accumulate_step = 0
-
-        # handle pytorch ddp auto all reduce
-        self.model = model
-        self.is_torch_ddp = isinstance(self.model, DistributedDataParallel)
-
-    def zero_grad(self, *args, **kwargs):
-        if self.accumulate_step == 0:
-            self.optim.zero_grad(*args, **kwargs)
-
-    def step(self, *args, **kwargs):
-        if self.accumulate_step < self.accumulate_size:
-            return None
-        else:
-            self.accumulate_step = 0
-            return self.optim.step(*args, **kwargs)
-
-    def clip_grad_norm(self, model: nn.Module, max_norm: float):
-        if self.accumulate_step < self.accumulate_size:
-            pass
-        else:
-            self.optim.clip_grad_norm(model, max_norm)
-
-    def backward(self, loss: Tensor):
-        self.accumulate_step += 1
-
-        if self.is_torch_ddp:
-            no_sync = self.accumulate_step < self.accumulate_size
-            with conditional_context(self.model.no_sync(), enable=no_sync):
-                scaled_loss = loss / self.accumulate_size
-                self.optim.backward(scaled_loss)
-        else:
-            scaled_loss = loss / self.accumulate_size
-            self.optim.backward(scaled_loss)
-
-    def backward_by_grad(self, tensor: Tensor, grad: Tensor):
-        self.accumulate_step += 1
-        no_sync = self.is_torch_ddp and self.accumulate_step < self.accumulate_size
-
-        if no_sync:
-            with self.model.no_sync():
-                self.optim.backward_by_grad(tensor, grad)
-        else:
-            self.optim.backward_by_grad(tensor, grad)
-
-
-class GradAccumDataloader:
-    """A wrapper for dataloder to enable gradient accumulation by dropping the last incomplete steps.
-
-    For example, if a dataloader has 10 batches of data and accumulate size is 4. The model paramters will 
-    be update only twice at step 4 and step 8. The last two batches of data do not form a complete 4-step cycle.
-    Thus, they will be automatically skipped by this class. If the dataloader is not standard PyTorch dataloader, 
-    (e.g. Dali dataloader), this class will automatically consume (load data for nothing) the remaining 2 batches.
-
-    :param dataloader: Your dataloader object
-    :type dataloader: Iterable
-    :param accumulate_size: The number of steps to accumulate gradients
-    :type accumulate_size: int
-
-    """
-
-    def __init__(self, dataloader: Iterable, accumulate_size: int) -> None:
-        self.dataloader = dataloader
-        self.consume_remain_data = not isinstance(dataloader, DataLoader)
-        self.steps_per_epoch = len(dataloader) - len(dataloader) % accumulate_size
-
-    def __getattr__(self, __name: str) -> Any:
-        return getattr(self.dataloader, __name)
-
-    def __len__(self):
-        return self.steps_per_epoch
-
-    def __iter__(self):
-        self._cur_step = 0
-        self._dataiter = iter(self.dataloader)
-        return self
-
-    def __next__(self) -> Any:
-        if self._cur_step < self.steps_per_epoch:
-            self._cur_step += 1
-
-            if self._cur_step == self.steps_per_epoch and self.consume_remain_data:
-                # this is to handle non standard pytorch dataloader
-                # such as dali dataloader
-                while True:
-                    try:
-                        _ = next(self._dataiter)
-                    except StopIteration:
-                        break
-            return next(self._dataiter)
-        else:
-            raise StopIteration
-
-
-class GradAccumLrSchedulerByStep(_LRScheduler):
-    """A wrapper for the LR scheduler to enable gradient accumulation by skipping the steps 
-    before accumulation size is reached
-
-    :param lr_scheduler: Your lr scheduler object
-    :type lr_scheduler: :class:`torch.optim.lr_scheduler._LRScheduler`    
-    :param accumulate_size: The number of steps to accumulate gradients
-    :type accumulate_size: int
-
-    """
-
-    def __init__(self, lr_scheduler: _LRScheduler, accumulate_size: int) -> None:
-        self.lr_scheduler = lr_scheduler
-        self.accumulate_size = accumulate_size
-        self.accumulate_step = 0
-
-    @staticmethod
-    def compute_effective_steps_per_epoch(dataloader: Iterable, accumulate_size: int):
-        return len(dataloader) // accumulate_size
-
-    def __getattr__(self, __name: str) -> Any:
-        return getattr(self.lr_scheduler, __name)
-
-    def step(self, *args, **kwargs):
-        self.accumulate_step += 1
-        if self.accumulate_step < self.accumulate_size:
-            pass
-        else:
-            self.accumulate_step = 0
-            self.lr_scheduler.step(*args, **kwargs)
-
-    def get_lr(self):
-        return self.lr_scheduler.get_lr()
-
-    def get_last_lr(self):
-        return self.lr_scheduler.get_last_lr()
-
-    def print_lr(self, *args, **kwargs):
-        self.lr_scheduler.print_lr(*args, **kwargs)
-
-    def state_dict(self) -> dict:
-        return self.lr_scheduler.state_dict()
-
-    def load_state_dict(self, state_dict: dict) -> None:
-        self.lr_scheduler.load_state_dict(state_dict)
-
-
-class GradAccumGradientHandler:
-    """A wrapper for the gradient handler to enable gradient accumulation by skipping the steps 
-    before accumulation size is reached
-
-    :param grad_handler: Your gradient handler object
-    :type grad_handler: :class:`colossalai.engine.BaseGradientHandler`    
-    :param accumulate_size: The number of steps to accumulate gradients
-    :type accumulate_size: int
-
-    """
-
-    def __init__(self, grad_handler: BaseGradientHandler, accumulate_size: int) -> None:
-        assert isinstance(grad_handler, BaseGradientHandler), \
-            f'expected grad_handler to be type BaseGradientHandler, but got {type(grad_handler)}'
-        self.grad_handler = grad_handler
-        self.accumulate_size = accumulate_size
-        self.accumulate_step = 0
-
-    def handle_gradient(self):
-        self.accumulate_step += 1
-        if self.accumulate_step < self.accumulate_size:
-            pass
-        else:
-            self.accumulate_step = 0
-            self.grad_handler.handle_gradient()
diff --git a/colossalai/utils/memory.py b/colossalai/utils/memory.py
deleted file mode 100644
index 21c5a5145409c01b33ad20078e84efb20ee5afde..0000000000000000000000000000000000000000
--- a/colossalai/utils/memory.py
+++ /dev/null
@@ -1,67 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import gc
-
-import psutil
-import torch
-
-from colossalai.context.parallel_mode import ParallelMode
-from colossalai.core import global_context as gpc
-from colossalai.logging import get_dist_logger
-
-
-def bytes_to_GB(val, decimal=2):
-    """A byte-to-Gigabyte converter, defaultly using binary notation.
-
-    :param val: X bytes to convert
-    :return: X' GB
-    """
-    return round(val / (1024 * 1024 * 1024), decimal)
-
-
-def bytes_to_MB(val, decimal=2):
-    """A byte-to-Megabyte converter, defaultly using binary notation.
-
-    :param val: X bytes to convert
-    :return: X' MB
-    """
-    return round(val / (1024 * 1024), decimal)
-
-
-def report_memory_usage(message, logger=None, report_cpu=False):
-    """Calculate and print RAM usage (in GB)
-
-    :param message: A prefix message to add in the log
-    :type message: str
-    :param logger: An instance of :class:`colossalai.logging.DistributedLogger`
-    :type logger: :class:`colossalai.logging.DistributedLogger`, optional
-    :param report_cpu: Whether to report CPU memory
-    :type report_cpu: bool, optional
-    :raises EnvironmentError: Raise error if no distributed environment has been initialized
-    """
-    if not gpc.is_initialized(ParallelMode.GLOBAL):
-        raise EnvironmentError("No distributed environment is initialized")
-
-    gpu_allocated = bytes_to_MB(torch.cuda.memory_allocated())
-    gpu_max_allocated = bytes_to_MB(torch.cuda.max_memory_allocated())
-    gpu_cached = bytes_to_MB(torch.cuda.memory_reserved())
-    gpu_max_cached = bytes_to_MB(torch.cuda.max_memory_reserved())
-
-    full_log = f"{message}: GPU: allocated {gpu_allocated} MB, max allocated {gpu_max_allocated} MB, " \
-        + f"cached: {gpu_cached} MB, max cached: {gpu_max_cached} MB"
-
-    if report_cpu:
-        # python doesn't do real-time garbage collection so do it explicitly to get the correct RAM reports
-        gc.collect()
-        vm_stats = psutil.virtual_memory()
-        vm_used = bytes_to_MB(vm_stats.total - vm_stats.available)
-        full_log += f", CPU Virtual Memory: used = {vm_used} MB, percent = {vm_stats.percent}%"
-
-    if logger is None:
-        logger = get_dist_logger()
-    logger.info(full_log)
-
-    # get the peak memory to report correct data, so reset the counter for the next call
-    if hasattr(torch.cuda, "reset_peak_memory_stats"):  # pytorch 1.4+
-        torch.cuda.reset_peak_memory_stats()
diff --git a/colossalai/utils/multi_tensor_apply/__init__.py b/colossalai/utils/multi_tensor_apply/__init__.py
deleted file mode 100644
index 94d13b339a0de106334fee99a8b73ea6e70f60dd..0000000000000000000000000000000000000000
--- a/colossalai/utils/multi_tensor_apply/__init__.py
+++ /dev/null
@@ -1,3 +0,0 @@
-from .multi_tensor_apply import MultiTensorApply
-
-multi_tensor_applier = MultiTensorApply(2048 * 32)
diff --git a/colossalai/utils/multi_tensor_apply/__pycache__/__init__.cpython-36.pyc b/colossalai/utils/multi_tensor_apply/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index daba5ff0e1ce269e2800fa53b88a91a5d7b4de1b..0000000000000000000000000000000000000000
Binary files a/colossalai/utils/multi_tensor_apply/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/utils/multi_tensor_apply/__pycache__/__init__.cpython-37.pyc b/colossalai/utils/multi_tensor_apply/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index f7df4c75ea6a75a1d960b566e607d1c519e335ab..0000000000000000000000000000000000000000
Binary files a/colossalai/utils/multi_tensor_apply/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/utils/multi_tensor_apply/__pycache__/multi_tensor_apply.cpython-36.pyc b/colossalai/utils/multi_tensor_apply/__pycache__/multi_tensor_apply.cpython-36.pyc
deleted file mode 100644
index 15b1d0d255e7334f1fcb46cbf71abc05f2f53151..0000000000000000000000000000000000000000
Binary files a/colossalai/utils/multi_tensor_apply/__pycache__/multi_tensor_apply.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/utils/multi_tensor_apply/__pycache__/multi_tensor_apply.cpython-37.pyc b/colossalai/utils/multi_tensor_apply/__pycache__/multi_tensor_apply.cpython-37.pyc
deleted file mode 100644
index 95f6fa06ff43a65cd8ff32cedea6bc20f575aa72..0000000000000000000000000000000000000000
Binary files a/colossalai/utils/multi_tensor_apply/__pycache__/multi_tensor_apply.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/utils/multi_tensor_apply/multi_tensor_apply.py b/colossalai/utils/multi_tensor_apply/multi_tensor_apply.py
deleted file mode 100644
index 70a8491869417d7daf2bdbe5062513da7c37bf73..0000000000000000000000000000000000000000
--- a/colossalai/utils/multi_tensor_apply/multi_tensor_apply.py
+++ /dev/null
@@ -1,38 +0,0 @@
-# modified from https://github.com/NVIDIA/apex/blob/master/apex/multi_tensor_apply/multi_tensor_apply.py
-
-
-class MultiTensorApply(object):
-    """
-    Apply an operation to a list of tensors efficiently
-
-    :param chunk_size: Size of a chunk
-    :type chunk_size: int
-    """
-
-    available = False
-    warned = False
-
-    def __init__(self, chunk_size):
-        try:
-            import colossal_C
-            MultiTensorApply.available = True
-            self.chunk_size = chunk_size
-        except ImportError as err:
-            MultiTensorApply.available = False
-            MultiTensorApply.import_err = err
-
-    def check_avail(self):
-        if not MultiTensorApply.available:
-            raise RuntimeError(
-                "Attempted to call MultiTensorApply method, but MultiTensorApply "
-                "is not available, possibly because Apex was installed without "
-                "--cpp_ext --cuda_ext.  Original import error message:",
-                MultiTensorApply.import_err)
-
-    def __call__(self, op, noop_flag_buffer, tensor_lists, *args):
-        self.check_avail()
-
-        return op(self.chunk_size,
-                  noop_flag_buffer,
-                  tensor_lists,
-                  *args)
diff --git a/colossalai/utils/timer.py b/colossalai/utils/timer.py
deleted file mode 100644
index 1c1b440eb745e1dd6c0d8b4aa5c8492209774c47..0000000000000000000000000000000000000000
--- a/colossalai/utils/timer.py
+++ /dev/null
@@ -1,147 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-import time
-from typing import Tuple
-from .cuda import synchronize
-
-
-class Timer:
-    """A timer object which helps to log the execution times, and provides different tools to assess the times.
-    """
-
-    def __init__(self):
-        self._started = False
-        self._start_time = time.time()
-        self._elapsed = 0
-        self._history = []
-
-    @property
-    def has_history(self):
-        return len(self._history) != 0
-
-    def start(self):
-        """Fisrtly synchronize cuda, reset the clock and then start the timer.
-        """
-        self._elapsed = 0
-        synchronize()
-        self._start_time = time.time()
-        self._started = True
-
-    def stop(self, keep_in_history: bool = False):
-        """Stop the timer and record the start-stop time interval.
-
-        :param keep_in_history: Whether does it record into history each start-stop interval, defaults to False
-        :type keep_in_history: bool, optional
-        :return: Start-stop interval
-        :rtype: int
-        """
-        synchronize()
-        end_time = time.time()
-        elapsed = end_time - self._start_time
-        if keep_in_history:
-            self._history.append(elapsed)
-        self._elapsed = elapsed
-        self._started = False
-        return elapsed
-
-    def get_history_mean(self):
-        """Mean of all history start-stop time intervals.
-
-        :return: Mean of time intervals
-        :rtype: int
-        """
-        return sum(self._history) / len(self._history)
-
-    def get_history_sum(self):
-        """Add up all the start-stop time intervals.
-
-        :return: Sum of time intervals
-        :rtype: int
-        """
-        return sum(self._history)
-
-    def get_elapsed_time(self):
-        """Return the last start-stop time interval.
-
-        .. note:: Use it only when timer is not in progress
-
-        :return: The last time interval
-        :rtype: int
-        """
-        assert not self._started, 'Timer is still in progress'
-        return self._elapsed
-
-    def reset(self):
-        """Clear up the timer and its history
-        """
-        self._history = []
-        self._started = False
-        self._elapsed = 0
-
-
-class MultiTimer:
-    """An object contains multiple timers
-
-    :param on: Whether the timer is enabled. Default is True
-    :type on: bool, optional
-    """
-
-    def __init__(self, on: bool = True):
-        self._on = on
-        self._timers = dict()
-
-    def start(self, name: str):
-        """Start namely one of the timers
-
-        :param name: Timer's key
-        :type name: str
-        """
-        if self._on:
-            if name not in self._timers:
-                self._timers[name] = Timer()
-            return self._timers[name].start()
-
-    def stop(self, name: str, keep_in_history: bool):
-        """Stop namely one of the timers.
-
-        :param name: Timer's key
-        :type name: str
-        :param keep_in_history: Whether does it record into history each start-stop interval
-        :type keep_in_history: bool
-        """
-        if self._on:
-            return self._timers[name].stop(keep_in_history)
-        else:
-            return None
-
-    def get_timer(self, name):
-        """Get timer by its name (from multitimer)
-
-        :param name: Timer's key
-        :return: Timer with the name you give correctly
-        :rtype: Timer
-        """
-        return self._timers[name]
-
-    def reset(self, name=None):
-        """Reset timers.
-
-        :param name: If name is designated, the named timer will be reset and others will not, defaults to None
-        :type name: optional
-        """
-        if self._on:
-            if name is not None:
-                self._timers[name].reset()
-            else:
-                for timer in self._timers:
-                    timer.reset()
-
-    def is_on(self):
-        return self._on
-
-    def set_status(self, mode: bool):
-        self._on = mode
-
-    def __iter__(self) -> Tuple[str, Timer]:
-        for name, timer in self._timers.items():
-            yield name, timer
diff --git a/colossalai/zero/__init__.py b/colossalai/zero/__init__.py
deleted file mode 100644
index 02c210c0b97850c2050fabc7689468b9c0b19213..0000000000000000000000000000000000000000
--- a/colossalai/zero/__init__.py
+++ /dev/null
@@ -1,110 +0,0 @@
-import torch
-import torch.nn as nn
-from torch.optim import Optimizer
-from colossalai.amp.naive_amp import NaiveAMPModel
-from colossalai.utils import is_no_pp_or_last_stage
-from colossalai.core import global_context as gpc
-from colossalai.context.parallel_mode import ParallelMode
-
-from .zero_redundancy_optimizer_level_2 import ZeroRedundancyOptimizer_Level_2
-from .zero_redundancy_optimizer_level_3 import ZeroRedundancyOptimizer_Level_3
-
-
-def convert_to_zero(model: nn.Module,
-                    optimizer: Optimizer,
-                    level: int,
-                    zero_config: dict):
-    """
-    A helper function to integrate the model and optimizer with ZeRO optimizer and off-loading
-
-    :param model: Your model object
-    :type model: :class:`torch.nn.Module`
-    :param optimizer: Your optimizer object
-    :type optimizer: :class:`torch.optim.Optimizer`
-    :param level: Optimizer level, can be 2 or 3
-    :type level: int
-    :param zero_config: Configuration for zero
-    :type zero_config: dict
-
-    :return: (model, optimizer)
-    :rtype: Tuple
-    """
-    import deepspeed
-    assert level == 2 or level == 3, 'Only ZERO Optimizer Level 2 and 3 are provided'
-    model = NaiveAMPModel(model, output_to_fp32=False)
-
-    if level == 2:
-        optimizer = ZeroRedundancyOptimizer_Level_2(init_optimizer=optimizer, **zero_config)
-    else:
-        optimizer = ZeroRedundancyOptimizer_Level_3(init_optimizer=optimizer, module=model, **zero_config)
-    return model, optimizer
-
-
-def zero3_model_context(dtype=torch.half):
-    """A context to enable massive model construction for training with
-        ZeRO-3. Models are automatically partitioned (or, sharded) across the
-        system and converted to half precision. Note that the config of ZeRO-3 will be loaded automatically from `gpc.config`.
-
-        Args:
-            dtype (``dtype``, optional): Can be used to change the data type of the parameters.
-                Supported options are ``torch.half`` and ``torch.float``. Defaults to ``torch.half``
-
-        This context accelerates model initialization and enables models that
-        are too large to allocate in their entirety in CPU memory. It has the
-        following effects:
-
-        #. allocates tensors to either GPU or CPU memory or NVMe
-        #. converts floating point tensors to half precision
-        #. immediately partitions tensors among the group of data-parallel devices
-        #. (*optional*) replaces ``torch.nn.functional.linear`` with a more
-           memory-efficient implementation
-
-        These modifications allow for models that exceed the size of local CPU/GPU
-        memory/NVMe, but fit within the total NVMe capacity (*i.e.*, aggregate CPU
-        or GPU memory or NVMe) across all nodes. Consider initializing a model with one
-        trillion parameters, whose weights occupy two terabytes (TB) in half
-        precision. The initial CPU allocation in full precision requires 4TB of
-        memory *per process*, and so a system with 8 GPUs per node would need 32TB of
-        CPU memory due to data-parallel redundancies. Instead, by immediately
-        partitioning tensors we remove the redundancies. The result is that
-        regardless of the number of GPUs, we still only require the original 4TB. This
-        allows for a linear increase in model size with the aggregate system memory.
-        For example, if a node has 1TB of memory and 8 GPUs, we could fit a trillion
-        parameter model with 4 nodes and 32 GPUs.
-
-        Important: If the fp16 weights of the model can't fit onto a single GPU memory
-        this feature must be used.
-
-        Examples
-        --------
-
-        #. Allocate a model and partition it among all processes:
-
-            .. code-block:: python
-
-                with zero3_model_context():
-                    model = MyLargeModel()
-
-    """
-    assert dtype == torch.half or dtype == torch.float, f'Invalid dtype, except torch.half or torch.float, got {dtype}'
-    import deepspeed
-    ds_config = {
-        "train_micro_batch_size_per_gpu": 1,
-        "gradient_accumulation_steps": 1,
-        "zero_optimization": {
-            "offload_param": getattr(gpc.config.zero, 'offload_param_config', None),
-            "offload_optimizer": getattr(gpc.config.zero, 'offload_optimizer_config'),
-        },
-        "aio": getattr(gpc.config.zero, 'aio_config', None)
-    }
-    remote_device = getattr(ds_config['zero_optimization']['offload_param'], 'device', None)
-    pin_memory = getattr(ds_config['zero_optimization']['offload_param'], 'pin_memory', False)
-    return deepspeed.zero.Init(data_parallel_group=gpc.get_group(ParallelMode.DATA),
-                               remote_device=remote_device,
-                               config_dict_or_path=ds_config,
-                               pin_memory=pin_memory,
-                               dtype=dtype)
-
-
-__all__ = ['convert_to_zero', 'ZeroRedundancyOptimizer_Level_2',
-           'ZeroRedundancyOptimizer_Level_3', 'zero3_model_context']
diff --git a/colossalai/zero/__pycache__/__init__.cpython-36.pyc b/colossalai/zero/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index 78d79f82ad0274118163abbd29c0b99c17970de7..0000000000000000000000000000000000000000
Binary files a/colossalai/zero/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/zero/__pycache__/__init__.cpython-37.pyc b/colossalai/zero/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index 1e6a99caedbf668de9b40b22f3f3c276ee28155b..0000000000000000000000000000000000000000
Binary files a/colossalai/zero/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/zero/__pycache__/loss_scaler.cpython-36.pyc b/colossalai/zero/__pycache__/loss_scaler.cpython-36.pyc
deleted file mode 100644
index f5d112185237219d28be743af9d2247d628f73a4..0000000000000000000000000000000000000000
Binary files a/colossalai/zero/__pycache__/loss_scaler.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/zero/__pycache__/loss_scaler.cpython-37.pyc b/colossalai/zero/__pycache__/loss_scaler.cpython-37.pyc
deleted file mode 100644
index a03b38e846a05da2895456763f1839dc88b30e3b..0000000000000000000000000000000000000000
Binary files a/colossalai/zero/__pycache__/loss_scaler.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/zero/__pycache__/zero_redundancy_optimizer_level_2.cpython-36.pyc b/colossalai/zero/__pycache__/zero_redundancy_optimizer_level_2.cpython-36.pyc
deleted file mode 100644
index d0c99b06fca60a292745f7aa3ac226da1b9fab81..0000000000000000000000000000000000000000
Binary files a/colossalai/zero/__pycache__/zero_redundancy_optimizer_level_2.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/zero/__pycache__/zero_redundancy_optimizer_level_2.cpython-37.pyc b/colossalai/zero/__pycache__/zero_redundancy_optimizer_level_2.cpython-37.pyc
deleted file mode 100644
index 083e247d8bfe4da255fb11d7c5d9f510c7e7ebaa..0000000000000000000000000000000000000000
Binary files a/colossalai/zero/__pycache__/zero_redundancy_optimizer_level_2.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/zero/__pycache__/zero_redundancy_optimizer_level_3.cpython-36.pyc b/colossalai/zero/__pycache__/zero_redundancy_optimizer_level_3.cpython-36.pyc
deleted file mode 100644
index ffaa1aa9da365998c5bff0f7f0945a1ff19a8466..0000000000000000000000000000000000000000
Binary files a/colossalai/zero/__pycache__/zero_redundancy_optimizer_level_3.cpython-36.pyc and /dev/null differ
diff --git a/colossalai/zero/__pycache__/zero_redundancy_optimizer_level_3.cpython-37.pyc b/colossalai/zero/__pycache__/zero_redundancy_optimizer_level_3.cpython-37.pyc
deleted file mode 100644
index 7f8badd09525605ebeff8d80d1e62ae13f6a22b6..0000000000000000000000000000000000000000
Binary files a/colossalai/zero/__pycache__/zero_redundancy_optimizer_level_3.cpython-37.pyc and /dev/null differ
diff --git a/colossalai/zero/loss_scaler.py b/colossalai/zero/loss_scaler.py
deleted file mode 100644
index ebaaf2549f1424fe9d69a9d33249b61f57bfb33e..0000000000000000000000000000000000000000
--- a/colossalai/zero/loss_scaler.py
+++ /dev/null
@@ -1,169 +0,0 @@
-# Copyright 2019 The Microsoft DeepSpeed Team
-# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# Taken and modified for DeepSpeed from:
-#    https://github.com/NVIDIA/Megatron-LM/blob/master/fp16/loss_scaler.py
-# Commit: 93ab4bea59dc5cbf97c079d313741866af4deac9
-
-
-INITIAL_LOSS_SCALE = 'init_scale'
-SCALE_WINDOW = 'scale_window'
-DELAYED_SHIFT = 'delayed_shift'
-MIN_LOSS_SCALE = 'min_scale'
-
-
-# item() is a recent addition, so this helps with backward compatibility.
-def to_python_float(t):
-    if hasattr(t, 'item'):
-        return t.item()
-    return t[0]
-
-
-class LossScalerBase:
-    """LossScalarBase
-    Base class for a loss scaler
-    """
-
-    def __init__(self, cur_scale):
-        self.cur_scale = cur_scale
-
-    @property
-    def loss_scale(self):
-        return self.cur_scale
-
-    def scale_gradient(self, module, grad_in, grad_out):
-        return tuple(self.loss_scale * g for g in grad_in)
-
-    def update_scale(self, overflow):
-        pass
-
-    def backward(self, loss, retain_graph=False):
-        scaled_loss = loss * self.loss_scale
-        scaled_loss.backward(retain_graph=retain_graph)
-
-
-class LossScaler(LossScalerBase):
-    """
-    Class that manages a static loss scale.  This class is intended to interact with
-    :class:`FP16_Optimizer`, and should not be directly manipulated by the user.
-    Use of :class:`LossScaler` is enabled via the ``static_loss_scale`` argument to
-    :class:`FP16_Optimizer`'s constructor.
-    Args:
-        scale (float, optional, default=1.0):  The loss scale.
-    """
-
-    def __init__(self, scale=1):
-        super(LossScaler, self).__init__(scale)
-
-    # `params` is a list / generator of torch.Variable
-    def has_overflow(self, params):
-        return False
-
-    # `x` is a torch.Tensor
-    def _has_inf_or_nan(x):
-        return False
-
-
-class DynamicLossScaler(LossScalerBase):
-    """
-    Class that manages dynamic loss scaling.  It is recommended to use :class:`DynamicLossScaler`
-    indirectly, by supplying ``dynamic_loss_scale=True`` to the constructor of
-    :class:`FP16_Optimizer`.  However, it's important to understand how :class:`DynamicLossScaler`
-    operates, because the default options can be changed using the
-    the ``dynamic_loss_args`` argument to :class:`FP16_Optimizer`'s constructor.
-    Loss scaling is designed to combat the problem of underflowing gradients encountered at long
-    times when training fp16 networks.  Dynamic loss scaling begins by attempting a very high loss
-    scale.  Ironically, this may result in OVERflowing gradients.  If overflowing gradients are
-    encountered, :class:`DynamicLossScaler` informs :class:`FP16_Optimizer` that an overflow has
-    occurred.
-    :class:`FP16_Optimizer` then skips the update step for this particular iteration/minibatch,
-    and :class:`DynamicLossScaler` adjusts the loss scale to a lower value.
-    If a certain number of iterations occur without overflowing gradients detected,
-    :class:`DynamicLossScaler` increases the loss scale once more.
-    In this way :class:`DynamicLossScaler` attempts to "ride the edge" of
-    always using the highest loss scale possible without incurring overflow.
-    Args:
-        init_scale (float, optional, default=2**32):  Initial loss scale attempted by :class:`DynamicLossScaler.`
-        scale_factor (float, optional, default=2.0):  Factor used when adjusting the loss scale. If an overflow is
-            encountered, the loss scale is readjusted to loss scale/``scale_factor``.  If ``scale_window`` consecutive
-            iterations take place without an overflow, the loss scale is readjusted to loss_scale*``scale_factor``.
-        scale_window (int, optional, default=1000):  Number of consecutive iterations without an overflow to wait before
-            increasing the loss scale.
-    """
-
-    def __init__(self,
-                 init_scale=2 ** 32,
-                 scale_factor=2.,
-                 scale_window=1000,
-                 min_scale=1,
-                 delayed_shift=1,
-                 consecutive_hysteresis=False):
-        super(DynamicLossScaler, self).__init__(init_scale)
-        self.cur_iter = 0
-        self.last_overflow_iter = -1
-        self.scale_factor = scale_factor
-        self.scale_window = scale_window
-        self.min_scale = min_scale
-        self.delayed_shift = delayed_shift
-        self.cur_hysteresis = delayed_shift
-        self.consecutive_hysteresis = consecutive_hysteresis
-
-    # `params` is a list / generator of torch.Variable
-    def has_overflow_serial(self, params):
-        for p in params:
-            if p.grad is not None and self._has_inf_or_nan(p.grad.data):
-                return True
-
-        return False
-
-    # `x` is a torch.Tensor
-    @staticmethod
-    def _has_inf_or_nan(x):
-        try:
-            # if x is half, the .float() incurs an additional deep copy, but it's necessary if
-            # Pytorch's .sum() creates a one-element tensor of the same type as x
-            # (which is true for some recent version of pytorch).
-            cpu_sum = float(x.float().sum())
-            # More efficient version that can be used if .sum() returns a Python scalar
-            # cpu_sum = float(x.sum())
-        except RuntimeError as instance:
-            # We want to check if inst is actually an overflow exception.
-            # RuntimeError could come from a different error.
-            # If so, we still want the exception to propagate.
-            if "value cannot be converted" not in instance.args[0]:
-                raise
-            return True
-        else:
-            if cpu_sum in [float('inf'), -float('inf')] or cpu_sum != cpu_sum:
-                return True
-            return False
-
-    # `overflow` is boolean indicating whether the gradient overflowed
-    def update_scale(self, overflow):
-        if overflow:
-            # self.cur_scale /= self.scale_factor
-            if self.delayed_shift == 1 or self.cur_hysteresis == 1:
-                self.cur_scale = max(
-                    self.cur_scale / self.scale_factor, self.min_scale)
-            else:
-                self.cur_hysteresis -= 1
-            self.last_overflow_iter = self.cur_iter
-        else:
-            if self.consecutive_hysteresis:
-                self.cur_hysteresis = self.delayed_shift
-            if (self.cur_iter - self.last_overflow_iter) % self.scale_window == 0:
-                if not self.consecutive_hysteresis:
-                    self.cur_hysteresis = self.delayed_shift
-                self.cur_scale *= self.scale_factor
-        self.cur_iter += 1
diff --git a/colossalai/zero/zero_redundancy_optimizer_level_2.py b/colossalai/zero/zero_redundancy_optimizer_level_2.py
deleted file mode 100644
index f022aaa6fdca95604da63b31d9095c66bd25d882..0000000000000000000000000000000000000000
--- a/colossalai/zero/zero_redundancy_optimizer_level_2.py
+++ /dev/null
@@ -1,2347 +0,0 @@
-'''
-Copyright 2019 The Microsoft DeepSpeed Team
-'''
-
-import math
-
-import torch
-import torch.distributed as dist
-
-try:
-    from deepspeed.git_version_info import version
-    from deepspeed.moe.utils import is_moe_param
-    from deepspeed.ops.adam import DeepSpeedCPUAdam
-    from deepspeed.ops.op_builder import UtilsBuilder
-    from deepspeed.runtime.zero.config import ZERO_OPTIMIZATION_GRADIENTS
-except ImportError:
-    pass
-from packaging import version as pkg_version
-from torch._six import inf
-from torch.distributed.distributed_c10d import _get_global_rank
-from torch.optim import Optimizer
-
-from colossalai.core import global_context as gpc
-from colossalai.utils import report_memory_usage
-from colossalai.utils.common import is_model_parallel_parameter
-from .loss_scaler import LossScaler, DynamicLossScaler
-from colossalai.context import ParallelMode
-
-# Toggle this to true to enable correctness test
-# with gradient partitioning and without
-pg_correctness_test = False
-
-
-def input(msg):
-    return
-
-
-def split_half_float_double(tensors):
-    dtypes = [
-        "torch.cuda.HalfTensor",
-        "torch.cuda.FloatTensor",
-        "torch.cuda.DoubleTensor"
-    ]
-    buckets = []
-    for i, dtype in enumerate(dtypes):
-        bucket = [t for t in tensors if t.type() == dtype]
-        if bucket:
-            buckets.append(bucket)
-    return buckets
-
-
-def isclose(a, b, rtol=1e-09, atol=0.0):
-    return abs(a - b) <= max(rtol * max(abs(a), abs(b)), atol)
-
-
-def lcm(x, y):
-    from fractions import gcd  # or can import gcd from `math` in Python 3
-    return x * y // gcd(x, y)
-
-
-def get_alignment_padding(tensor_list, alignment):
-    num_elements = sum([tensor.numel() for tensor in tensor_list])
-    remainder = num_elements % alignment
-    return (alignment - remainder) if remainder else remainder
-
-
-def move_to_cpu(tensor_list):
-    for tensor in tensor_list:
-        tensor.data = tensor.data.cpu()
-
-
-def print_rank_msg(msg):
-    print(f"rank {dist.get_rank()} - {msg}")
-
-
-class ZeroRedundancyOptimizer_Level_2(Optimizer):
-    """
-    ZeroRedundancyOptimizer_Level_2 designed to reduce the memory footprint
-    required for training large deep learning models.
-
-    For more details please see ZeRO: Memory Optimization Towards Training A Trillion Parameter Models
-    https://arxiv.org/abs/1910.02054
-
-    """
-
-    def __init__(self,
-                 init_optimizer,
-                 dp_parallel_mode=ParallelMode.DATA,
-                 static_loss_scale=1.0,
-                 dynamic_loss_scale=False,
-                 dynamic_loss_args=None,
-                 verbose=False,
-                 contiguous_gradients=True,
-                 reduce_bucket_size=500000000,
-                 allgather_bucket_size=5000000000,
-                 reduce_scatter=True,
-                 overlap_comm=False,
-                 cpu_offload=False,
-                 clip_grad=0.0,
-                 allreduce_always_fp32=False,
-                 postscale_gradients=True,
-                 gradient_predivide_factor=1.0,
-                 gradient_accumulation_steps=1,
-                 ignore_unused_parameters=True,
-                 round_robin_gradients=False,
-                 fp16_master_weights_and_gradients=False):
-        # mpu = None is removed from the parameter list
-        # tensor parallel will be automatically detected later
-
-        # LSG: default arguments for compatibility
-        has_moe_layers = False
-        partition_grads = True
-        expert_parallel_group = None
-        expert_data_parallel_group = None
-        self.timers = None
-        self.defaults = init_optimizer.defaults
-
-        dp_process_group = gpc.get_group(dp_parallel_mode)
-        if gpc.get_world_size(dp_parallel_mode) == 1:
-            partition_grads = False  # for compatibility with dp size = 1
-
-        self.verbose = verbose
-
-        if dist.get_rank() == 0 and self.verbose:
-            print(f"Reduce bucket size {reduce_bucket_size}")
-            print(f"Allgather bucket size {allgather_bucket_size}")
-            print(f"CPU Offload: {cpu_offload}")
-            print(
-                f'Round robin gradient partitioning: {round_robin_gradients}')
-        # The fused optimizer does all the work. We need this layer for two reason:
-        # 1. maintain same user API from apex.fp16_utils
-        # 2. keep common stuff here in case we need to add ne552w fused optimizer later
-
-        # differences from apex.fp16_utils:
-        # - assume all model params in fp16
-        # - assume all params requires grad
-        # - flat by groups, not keeping state. TODO: remove state explicitly?
-        # - master gard and unflat master weight never exist. TODO: a way to save out unflat master?
-        if not torch.cuda.is_available:
-            raise SystemError("Cannot use fp16 without CUDA.")
-        self.optimizer = init_optimizer
-
-        # Load pre-built or JIT compile (un)flatten ops
-        util_ops = UtilsBuilder().load()
-        self.flatten = util_ops.flatten
-        self.unflatten = util_ops.unflatten
-
-        # ZeRO stage 1 (False) or 2 (True)
-        self.partition_gradients = partition_grads
-
-        self.reduce_scatter = reduce_scatter
-
-        self.overlap_comm = overlap_comm
-
-        self.cpu_offload = cpu_offload
-
-        self.deepspeed_adam_offload = cpu_offload
-
-        self.device = torch.cuda.current_device() if not self.cpu_offload else 'cpu'
-
-        self.dp_process_group = dp_process_group
-
-        # expert parallel group
-        self.ep_process_group = expert_parallel_group
-
-        # data parallel group for experts
-        self.expert_dp_process_group = expert_data_parallel_group
-
-        # data parallel size for non-experts
-        dp_size = dist.get_world_size(group=self.dp_process_group)
-
-        # For MoE models this maybe different for different param group
-        # It will be modified during MoE setup later in the init
-        self.real_dp_process_group = [
-            dp_process_group for i in range(len(self.optimizer.param_groups))
-        ]
-        self.partition_count = [dp_size for i in range(
-            len(self.optimizer.param_groups))]
-
-        self.is_gradient_accumulation_boundary = True
-
-        # CPU-Offload requires contiguous gradients
-        self.contiguous_gradients = contiguous_gradients or cpu_offload
-
-        self.has_moe_layers = has_moe_layers
-
-        if self.has_moe_layers:
-            self._configure_moe_settings()
-
-        if not gpc.is_initialized(ParallelMode.TENSOR) or gpc.get_world_size(ParallelMode.TENSOR) == 1:
-            self.model_parallel_group = None
-            self.model_parallel_rank = 0
-        else:
-            self.model_parallel_group = gpc.get_group(ParallelMode.TENSOR)
-            self.model_parallel_rank = gpc.get_local_rank(ParallelMode.TENSOR)
-
-        self.overflow = False
-        self.clip_grad = clip_grad
-        self.allreduce_always_fp32 = allreduce_always_fp32
-        self.gradient_predivide_factor = gradient_predivide_factor
-        self.postscale_gradients = postscale_gradients
-        self.gradient_accumulation_steps = gradient_accumulation_steps
-        self.micro_step_id = 0
-        self.ignore_unused_parameters = ignore_unused_parameters
-        self.round_robin_gradients = round_robin_gradients
-
-        self.extra_large_param_to_reduce = None
-        self.fp16_master_weights_and_gradients = fp16_master_weights_and_gradients
-
-        if self.fp16_master_weights_and_gradients:
-            assert self.cpu_offload and type(self.optimizer) in [
-                DeepSpeedCPUAdam], f"fp16_master_and_gradients requires optimizer to support keeping fp16 master and gradients while keeping the optimizer states in fp32. Currenty only supported using ZeRO-Offload with DeepSpeedCPUAdam. But current setting is ZeRO-Offload:{self.cpu_offload} and optimizer type {type(self.optimizer)}. Either disable fp16_master_weights_and_gradients or enable ZeRO-2 Offload with DeepSpeedCPUAdam"
-
-        if self.reduce_scatter:
-            assert not self.allreduce_always_fp32, "allreduce_always_fp32 is not yet supported with ZeRO-2 with reduce scatter enabled"
-            assert self.gradient_predivide_factor == 1.0, "gradient_predivide_factor != 1.0 is not yet supported with ZeRO-2 with reduce scatter enabled"
-            assert self.postscale_gradients, "pre-scale gradients is not yet supported with ZeRO-2 with reduce scatter enabled"
-
-        # param flattened by groups
-        self.fp16_groups = []
-        self.fp16_groups_flat = []
-
-        # param partitioned by data parallel degree
-        # this will contain a list of equal sized tensors
-        # each of which will be updated by a different process
-        self.parallel_partitioned_fp16_groups = []
-
-        # a single 32-bit partition of the parallel partitioned parameters
-        # that this process will update
-        self.single_partition_of_fp32_groups = []
-
-        # param partition info
-
-        # These are the parameters in each group that will not be updated by this process directly
-        self.params_not_in_partition = []
-
-        # These are the parameters that will be updated by this process directly
-        self.params_in_partition = []
-
-        # Offset from the first paramter in the the self.params_in_partition
-        # the parameter boundaries may not align with partition boundaries
-        # so we need to keep track of the offset
-        self.first_offset = []
-
-        # number of elements per partition in each group
-        self.partition_size = []
-
-        # align nccl all-gather send buffers to 4-bye boundary
-        # 4-byte alignment/sizeof(fp16) = 2
-        self.nccl_start_alignment_factor = 2
-
-        assert (
-            allgather_bucket_size % self.nccl_start_alignment_factor == 0), f"allgather_bucket_size must be a multiple of nccl_start_alignment_factor, {self.nccl_start_alignment_factor} "
-
-        self.all_reduce_print = False
-        self.dtype = self.optimizer.param_groups[0]['params'][0].dtype
-
-        self.round_robin_fp16_groups = []
-        self.round_robin_fp6_indices = []
-
-        # padding on each partition for alignment purposes
-        self.groups_padding = []
-        # loop to deal with groups
-        for i, param_group in enumerate(self.optimizer.param_groups):
-            partition_id = dist.get_rank(group=self.real_dp_process_group[i])
-
-            # push this group to list before modify
-            # TODO: Explore simplification that avoids the extra book-keeping by pushing the reordered group
-            self.fp16_groups.append(param_group['params'])
-
-            # Record padding required to align group to world size
-            if partition_id == dist.get_world_size(
-                    group=self.real_dp_process_group[i]) - 1:
-                padding = get_alignment_padding(self.fp16_groups[i],
-                                                self.partition_count[i])
-            else:
-                padding = 0
-            self.groups_padding.append(padding)
-
-            # not sure why apex was cloning the weights before flattening
-            # removing cloning here
-
-            if self.verbose:
-                report_memory_usage(f"Before moving param group {i} to CPU")
-            # move all the parameters to cpu to free up GPU space for creating flat buffer
-            move_to_cpu(self.fp16_groups[i])
-            if self.verbose:
-                report_memory_usage(f"After moving param group {i} to CPU")
-
-            # Reorder group parameters for load balancing of gradient partitioning during backward among ranks.
-            # This ensures that gradients are reduced in a fashion such that ownership round robins among the ranks.
-            # For example, rather than 3 gradients (g_n+2, g_n+1, g_n) that are reduced consecutively belonging
-            # to the same rank, instead they will belong to 3 ranks (r_m+2, r_m+1, r_m).
-            if self.round_robin_gradients:
-                round_robin_tensors, round_robin_indices = self._round_robin_reorder(
-                    self.fp16_groups[i],
-                    dist.get_world_size(group=self.real_dp_process_group[i])
-                )
-            else:
-                round_robin_tensors = self.fp16_groups[i]
-                round_robin_indices = list(range(len(self.fp16_groups[i])))
-
-            self.round_robin_fp16_groups.append(round_robin_tensors)
-            self.round_robin_fp6_indices.append(round_robin_indices)
-
-            # create flat buffer in CPU and move to GPU
-            self.fp16_groups_flat.append(
-                self.flatten_dense_tensors_aligned(
-                    self.round_robin_fp16_groups[i],
-                    self.nccl_start_alignment_factor *
-                    dist.get_world_size(group=self.real_dp_process_group[i])).cuda(
-                    torch.cuda.current_device()))
-
-            if self.verbose:
-                report_memory_usage(
-                    f"After flattening and moving param group {i} to GPU")
-
-                if dist.get_rank(group=self.real_dp_process_group[i]) == 0:
-                    report_memory_usage(
-                        f"After Flattening and after emptying param group {i} cache")
-
-            # set model fp16 weight to slices of flattened buffer
-            self._update_model_fp16_weights(i)
-
-            # divide the flat weights into near equal partition equal to the data parallel degree
-            # each process will compute on a different part of the partition
-            data_parallel_partitions = self.get_data_parallel_partitions(
-                self.fp16_groups_flat[i],
-                i)
-            self.parallel_partitioned_fp16_groups.append(
-                data_parallel_partitions)
-
-            # verify that data partition start locations are 4-byte aligned
-            for partitioned_data in data_parallel_partitions:
-                assert (partitioned_data.data_ptr() %
-                        (2 * self.nccl_start_alignment_factor) == 0)
-
-            # a partition of the fp32 master weights that will be updated by this process
-            if not fp16_master_weights_and_gradients:
-                self.single_partition_of_fp32_groups.append(
-                    self.parallel_partitioned_fp16_groups[i][partition_id].to(
-                        self.device).clone().float().detach())
-            else:
-                self.single_partition_of_fp32_groups.append(
-                    self.parallel_partitioned_fp16_groups[i][partition_id].to(
-                        self.device).clone().half().detach())
-
-            # modify optimizer of have flat master weight
-            self.single_partition_of_fp32_groups[
-                i].requires_grad = True  # keep this in case internal optimizer uses it
-            param_group['params'] = [self.single_partition_of_fp32_groups[i]]
-
-            partition_size = len(self.fp16_groups_flat[i]) / dist.get_world_size(
-                group=self.real_dp_process_group[i])
-            params_in_partition, params_not_in_partition, first_offset = self.get_partition_info(
-                self.round_robin_fp16_groups[i],
-                partition_size,
-                partition_id)
-
-            self.partition_size.append(partition_size)
-            self.params_in_partition.append(params_in_partition)
-            self.params_not_in_partition.append(params_not_in_partition)
-            self.first_offset.append(first_offset)
-
-        for rank in range(dist.get_world_size()):
-            if dist.get_rank() == rank and self.verbose:
-                print(
-                    f"Rank: {rank} partition count {self.partition_count} and sizes{[(p.numel(), self.is_moe_param_group[i] if hasattr(self, 'is_moe_param_group') else False) for i, p in enumerate(self.single_partition_of_fp32_groups)]} "
-                )
-                dist.barrier()
-        # exit(0)
-        self.reduce_bucket_size = int(reduce_bucket_size)
-        self.allgather_bucket_size = int(allgather_bucket_size)
-
-        self.reduction_event = torch.cuda.Event(
-            enable_timing=False, blocking=False)
-        self.reduction_stream = torch.cuda.Stream()
-        self.cpu_computation_stream = torch.cuda.Stream()
-        self.copy_grad_stream = torch.cuda.Stream()
-        self.callback_queued = False
-
-        self.param_dict = {}
-
-        # map between param_id and bool to specify if a param is in this partition
-        self.is_param_in_current_partition = {}
-
-        self.grads_in_ipg_bucket = []
-        self.params_in_ipg_bucket = []
-        self.elements_in_ipg_bucket = 0
-        self.params_already_reduced = []
-        self._release_ipg_buffers()
-        self.previous_reduced_grads = None
-        self.ipg_bucket_has_moe_params = False
-
-        # simplified param id
-        self.param_id = {}
-
-        largest_param_numel = 0
-        count = 0
-        for i, params_group in enumerate(self.fp16_groups):
-            for param in params_group:
-                unique_id = id(param)
-                self.param_id[unique_id] = count
-                self.param_dict[count] = param
-                self.params_already_reduced.append(False)
-                if param.numel() > largest_param_numel:
-                    largest_param_numel = param.numel()
-                count = count + 1
-
-        for param_group in self.params_in_partition:
-            for param in param_group:
-                self.is_param_in_current_partition[self.get_param_id(
-                    param)] = True
-
-        for param_group in self.params_not_in_partition:
-            for param in param_group:
-                self.is_param_in_current_partition[self.get_param_id(
-                    param)] = False
-
-        if self.cpu_offload:
-            self.accumulated_grads_in_cpu = {}
-            self.norm_for_param_grads = {}
-            self.local_overflow = False
-            self.grad_position = {}
-            self.temp_grad_buffer_for_cpu_offload = torch.zeros(
-                largest_param_numel,
-                device=self.device,
-                dtype=self.dtype).pin_memory()
-            self.temp_grad_buffer_for_gpu_offload = torch.zeros(
-                largest_param_numel,
-                device=torch.cuda.current_device(),
-                dtype=self.dtype)
-
-            for i, params_group in enumerate(self.fp16_groups):
-                self.get_grad_position(i,
-                                       self.params_in_partition[i],
-                                       self.first_offset[i],
-                                       self.partition_size[i])
-
-        # mapping from parameter to partition that it belongs to
-        self.param_to_partition_ids = {}
-
-        # stores if a partition has been reduced in this step
-        self.is_partition_reduced = {}
-
-        # number of grads in partition that still need to be computed
-        self.remaining_grads_in_partition = {}
-
-        # total number of grads in partition
-        self.total_grads_in_partition = {}
-
-        # stores if a grad in a partition has been computed or not
-        self.is_grad_computed = {}
-
-        # stores the offset at which a parameter gradient needs to be inserted in a partition
-        self.grad_partition_insertion_offset = {}
-
-        # the offset in the gradient at which it must be inserted at the beginning of the partition
-        self.grad_start_offset = {}
-
-        # will store the averaged gradients required by this partition
-        self.averaged_gradients = {}
-
-        # store index of first parameter in each partition
-        self.first_param_index_in_partition = {}
-
-        # initializes all data structures for implementing gradient partitioning
-        self.initialize_gradient_partitioning_data_structures()
-
-        # resets the data structure value for the next backward propagation
-        self.reset_partition_gradient_structures()
-
-        # creates backward hooks for gradient partitioning
-        if self.partition_gradients or self.overlap_comm:
-            self.create_reduce_and_remove_grad_hooks()
-
-        # we may have a way of fusing dynamic scale. Do not support for now
-        if self.dtype == torch.float or not dynamic_loss_scale:
-            loss_scale_value = 1.0 if self.dtype == torch.float else static_loss_scale
-
-            self.dynamic_loss_scale = False
-            self.loss_scaler = LossScaler(scale=loss_scale_value)
-            cur_iter = 0
-        else:
-            if dynamic_loss_args is None:
-                self.loss_scaler = DynamicLossScaler()
-            else:
-                self.loss_scaler = DynamicLossScaler(**dynamic_loss_args)
-
-            self.dynamic_loss_scale = True
-
-        if self.verbose:
-            report_memory_usage("Before initializing optimizer states")
-        self.initialize_optimizer_states()
-        if self.verbose:
-            report_memory_usage("After initializing optimizer states")
-
-            if dist.get_rank() == 0:
-                print(f"optimizer state initialized")
-
-            if dist.get_rank(group=self.dp_process_group) == 0:
-                report_memory_usage(f"After initializing ZeRO optimizer")
-
-    def _configure_moe_settings(self):
-        assert self.contiguous_gradients, "Contiguous Gradients in ZeRO Stage 2 must be set to True for MoE. Other code paths are not tested with MoE"
-        assert self.reduce_scatter, "Reduce Scatter in ZeRO Stage 2 must be set to True for MoE. Other code paths are not tested with MoE"
-
-        def is_moe_group(group):
-            return 'moe' in group and group['moe']
-
-        assert any([is_moe_group(group) for group in
-                    self.optimizer.param_groups]), "The model has moe layers, but None of the param groups are marked as MoE. Create a param group with 'moe' key set to True before creating optimizer"
-        self.is_moe_param_group = []
-        for i, group in enumerate(self.optimizer.param_groups):
-            if is_moe_group(group):
-                assert all(
-                    [is_moe_param(param) for param in group['params']]), "All params in MoE group must be MoE params"
-                self.real_dp_process_group[i] = self.expert_dp_process_group
-                self.partition_count[i] = dist.get_world_size(
-                    group=self.expert_dp_process_group)
-                self.is_moe_param_group.append(True)
-            else:
-                self.is_moe_param_group.append(False)
-
-        assert self.expert_dp_process_group is not None, "Expert data parallel group should be configured with MoE"
-        assert self.ep_process_group is not None, "Expert parallel group should be configured with MoE"
-
-    def _update_model_fp16_weights(self, group_index):
-        updated_params = self.unflatten(self.fp16_groups_flat[group_index],
-                                        self.round_robin_fp16_groups[group_index])
-        for p, q in zip(self.round_robin_fp16_groups[group_index], updated_params):
-            p.data = q.data
-
-        # set model fp16 weight to slices of reordered flattened buffer
-        for param_index, param in enumerate(self.fp16_groups[group_index]):
-            new_index = self.round_robin_fp6_indices[group_index][param_index]
-            param.data = self.round_robin_fp16_groups[group_index][new_index].data
-
-    def _round_robin_reorder(self, tensor_list, num_partitions):
-
-        # disable round robin if need to debug something
-        # return tensor_list, list(range(len(tensor_list)))
-
-        partition_tensors = {}
-
-        for i, tensor in enumerate(tensor_list):
-            j = i % num_partitions
-            if not j in partition_tensors:
-                partition_tensors[j] = []
-            partition_tensors[j].append((i, tensor))
-
-        reordered_tensors = []
-        reordered_indices = {}
-
-        for partition_index in partition_tensors.keys():
-            for i, (original_index, tensor) in enumerate(partition_tensors[partition_index]):
-                reordered_indices[original_index] = len(reordered_tensors)
-                reordered_tensors.append(tensor)
-
-        return reordered_tensors, reordered_indices
-
-    def _release_ipg_buffers(self):
-        if self.contiguous_gradients:
-            self.ipg_buffer = None
-            self.grads_in_partition = None
-            self.grads_in_partition_offset = 0
-
-    def initialize_optimizer_states(self):
-
-        for i, group in enumerate(self.fp16_groups):
-            single_grad_partition = torch.zeros(
-                int(self.partition_size[i]),
-                dtype=self.single_partition_of_fp32_groups[i].dtype,
-                device=self.device)
-            self.single_partition_of_fp32_groups[
-                i].grad = single_grad_partition.pin_memory(
-            ) if self.cpu_offload else single_grad_partition
-
-        self.optimizer.step()
-
-        if not self.cpu_offload:
-            for group in self.single_partition_of_fp32_groups:
-                group.grad = None  # class init
-
-        return
-
-    #########################################################################
-    #################### ZeRO Stage 1 - reduce gradients ####################
-    #########################################################################
-
-    def reduce_gradients(self, pipeline_parallel=False):
-        world_size = dist.get_world_size(self.dp_process_group)
-        my_rank = dist.get_rank(self.dp_process_group)
-
-        # with PP we must create ipg buffer, since backward is handled outside zero
-        if pipeline_parallel and self.contiguous_gradients:
-            self.ipg_buffer = []
-            buf_0 = torch.empty(int(self.reduce_bucket_size),
-                                dtype=self.dtype,
-                                device=torch.cuda.current_device())
-            self.ipg_buffer.append(buf_0)
-            self.ipg_index = 0
-
-        if not self.overlap_comm:
-            for i, group in enumerate(self.fp16_groups):
-                for param in group:
-                    if param.grad is not None:
-                        self.reduce_ready_partitions_and_remove_grads(param, i)
-
-        # reduce any pending grads in either hook/non-hook case
-        self.overlapping_partition_gradients_reduce_epilogue()
-
-    #########################################################################
-    #########################ZeRO Partition Gradients########################
-    #########################################################################
-
-    def get_first_param_index(self, group_id, param_group, partition_id):
-        for index, param in enumerate(param_group):
-            param_id = self.get_param_id(param)
-            if partition_id in self.param_to_partition_ids[group_id][param_id]:
-                return index
-        return None
-
-    def initialize_gradient_partitioning_data_structures(self):
-
-        for i, param_group in enumerate(self.round_robin_fp16_groups):
-
-            total_partitions = dist.get_world_size(
-                group=self.real_dp_process_group[i])
-
-            self.param_to_partition_ids[i] = {}
-            self.is_partition_reduced[i] = {}
-            self.total_grads_in_partition[i] = {}
-            self.remaining_grads_in_partition[i] = {}
-            self.is_grad_computed[i] = {}
-            self.grad_partition_insertion_offset[i] = {}
-            self.grad_start_offset[i] = {}
-            self.first_param_index_in_partition[i] = {}
-
-            for partition_id in range(total_partitions):
-                self.is_grad_computed[i][partition_id] = {}
-                self.grad_partition_insertion_offset[i][partition_id] = {}
-                self.grad_start_offset[i][partition_id] = {}
-                self.total_grads_in_partition[i][partition_id] = 0
-                self.initialize_gradient_partition(
-                    i, param_group, partition_id)
-                self.is_partition_reduced[i][partition_id] = False
-                self.first_param_index_in_partition[i][
-                    partition_id] = self.get_first_param_index(
-                    i,
-                    param_group,
-                    partition_id)
-
-    def independent_gradient_partition_epilogue(self):
-        if self.verbose:
-            self.report_ipg_memory_usage(
-                f"In ipg_epilogue before reduce_ipg_grads", 0)
-        self.reduce_ipg_grads()
-        if self.verbose:
-            self.report_ipg_memory_usage(
-                f"In ipg_epilogue after reduce_ipg_grads", 0)
-
-        # if dist.get_rank() == 0:
-        #    print()("Params already reduced %s", self.params_already_reduced)
-        for i in range(len(self.params_already_reduced)):
-            self.params_already_reduced[i] = False
-
-        if self.overlap_comm:
-            torch.cuda.synchronize()
-            # It is safe to clear previously reduced grads of other partitions
-            self._clear_previous_reduced_grads()
-
-        if self.cpu_offload is False:
-            for i, _ in enumerate(self.fp16_groups):
-
-                if not i in self.averaged_gradients or self.averaged_gradients[i] is None:
-                    self.averaged_gradients[i] = self.get_flat_partition(
-                        self.params_in_partition[i],
-                        self.first_offset[i],
-                        self.partition_size[i],
-                        dtype=self.dtype,
-                        device=torch.cuda.current_device(),
-                        return_tensor_list=True)
-                else:
-                    avg_new = self.get_flat_partition(self.params_in_partition[i],
-                                                      self.first_offset[i],
-                                                      self.partition_size[i],
-                                                      dtype=self.dtype,
-                                                      device=torch.cuda.current_device(),
-                                                      return_tensor_list=True)
-
-                    for accumulated_grad, new_avg_grad in zip(self.averaged_gradients[i], avg_new):
-                        accumulated_grad.add_(new_avg_grad)
-
-        self._release_ipg_buffers()
-
-        # No need to keep the gradients anymore.
-        # All gradients required by the step
-        # are in self.averaged_gradients
-        self.zero_grad()
-
-        if self.verbose:
-            report_memory_usage(f"End ipg_epilogue")
-
-    # resets all partition to no reduced
-    # sets remaining grads to the total number of grads in each partition
-    # set is grad computed to false for all grads in partition
-    def reset_partition_gradient_structures(self):
-        for i, _ in enumerate(self.fp16_groups):
-            total_partitions = dist.get_world_size(
-                group=self.real_dp_process_group[i])
-            for partition_id in range(total_partitions):
-                self.is_partition_reduced[i][partition_id] = False
-                self.remaining_grads_in_partition[i][
-                    partition_id] = self.total_grads_in_partition[i][partition_id]
-
-                for param_id in self.is_grad_computed[i][partition_id]:
-                    self.is_grad_computed[i][partition_id][param_id] = False
-
-    def initialize_gradient_partition(self, i, param_group, partition_id):
-        def set_key_value_list(dictionary, key, value):
-            if key in dictionary:
-                dictionary[key].append(value)
-            else:
-                dictionary[key] = [value]
-
-        def increment_value(dictionary, key):
-            if key in dictionary:
-                dictionary[key] += 1
-            else:
-                dictionary[key] = 1
-
-        partition_size = self.partition_size[i]
-
-        start_index = partition_size * partition_id
-        end_index = partition_size * (partition_id + 1)
-
-        current_index = 0
-        first_offset = 0
-
-        for param in param_group:
-
-            param_size = param.numel()
-            param_id = self.get_param_id(param)
-
-            if (current_index >= start_index and current_index < end_index):
-                set_key_value_list(self.param_to_partition_ids[i],
-                                   param_id,
-                                   partition_id)
-                increment_value(self.total_grads_in_partition[i], partition_id)
-
-                self.is_grad_computed[i][partition_id][param_id] = False
-
-                self.grad_partition_insertion_offset[i][partition_id][
-                    param_id] = current_index - start_index
-                self.grad_start_offset[i][partition_id][param_id] = 0
-
-            elif start_index > current_index and start_index < (current_index +
-                                                                param_size):
-                assert (
-                    first_offset == 0), "This can happen either zero or only once as this must be the first tensor in the partition"
-                first_offset = start_index - current_index
-
-                set_key_value_list(self.param_to_partition_ids[i],
-                                   param_id,
-                                   partition_id)
-                increment_value(self.total_grads_in_partition[i], partition_id)
-
-                self.is_grad_computed[i][partition_id][param_id] = False
-
-                self.grad_partition_insertion_offset[i][partition_id][param_id] = 0
-                self.grad_start_offset[i][partition_id][param_id] = first_offset
-
-            current_index = current_index + param_size
-
-    def overlapping_partition_gradients_reduce_epilogue(self):
-        self.independent_gradient_partition_epilogue()
-
-    def create_reduce_and_remove_grad_hooks(self):
-        self.grad_accs = []
-        for i, param_group in enumerate(self.fp16_groups):
-            for param in param_group:
-                if param.requires_grad:
-                    def wrapper(param, i):
-                        param_tmp = param.expand_as(param)
-                        grad_acc = param_tmp.grad_fn.next_functions[0][0]
-
-                        def reduce_partition_and_remove_grads(*notneeded):
-                            self.reduce_ready_partitions_and_remove_grads(
-                                param, i)
-
-                        grad_acc.register_hook(
-                            reduce_partition_and_remove_grads)
-                        self.grad_accs.append(grad_acc)
-
-                    wrapper(param, i)
-
-    def get_param_id(self, param):
-        unique_id = id(param)
-        return self.param_id[unique_id]
-
-    def report_ipg_memory_usage(self, tag, param_elems):
-        elem_count = self.elements_in_ipg_bucket + param_elems
-        percent_of_bucket_size = (
-            100.0 * elem_count) // self.reduce_bucket_size
-        if self.verbose:
-            report_memory_usage(
-                f"{tag}: elems in_bucket {self.elements_in_ipg_bucket} param {param_elems} max_percent {percent_of_bucket_size}"
-            )
-
-    # create a flat tensor aligned at the alignment boundary
-    def flatten_dense_tensors_aligned(self, tensor_list, alignment):
-        num_elements = 0
-        for tensor in tensor_list:
-            num_elements = num_elements + tensor.numel()
-
-        remaining = num_elements % alignment
-
-        if remaining:
-            elements_to_add = alignment - remaining
-            pad_tensor = torch.zeros(elements_to_add,
-                                     device=tensor_list[0].device,
-                                     dtype=tensor_list[0].dtype)
-            padded_tensor_list = tensor_list + [pad_tensor]
-
-            num_elements = num_elements + elements_to_add
-        else:
-            padded_tensor_list = tensor_list
-
-        return self.flatten(padded_tensor_list)
-
-    ############### Independent Partition Gradient ########################
-    def reduce_independent_p_g_buckets_and_remove_grads(self, param, i):
-        if self.elements_in_ipg_bucket + param.numel() > self.reduce_bucket_size:
-            self.report_ipg_memory_usage("In ipg_remove_grads before reduce_ipg_grads",
-                                         param.numel())
-            self.reduce_ipg_grads()
-            if self.contiguous_gradients and self.overlap_comm:
-                # Swap ipg_index between 0 and 1
-                self.ipg_index = 1 - self.ipg_index
-
-            self.report_ipg_memory_usage("In ipg_remove_grads after reduce_ipg_grads",
-                                         param.numel())
-
-        param_id = self.get_param_id(param)
-        assert self.params_already_reduced[param_id] == False, \
-            f"The parameter {param_id} has already been reduced. \
-            Gradient computed twice for this partition. \
-            Multiple gradient reduction is currently not supported"
-
-        if param.numel() > self.reduce_bucket_size:
-            self.extra_large_param_to_reduce = param
-
-        elif self.contiguous_gradients:
-            # keeping the gradients contiguous to prevent memory fragmentation, and avoid flattening
-            new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(
-                0,
-                self.elements_in_ipg_bucket,
-                param.numel())
-            new_grad_tensor.copy_(param.grad.view(-1))
-            param.grad.data = new_grad_tensor.data.view_as(param.grad)
-
-        self.elements_in_ipg_bucket += param.numel()
-
-        assert param.grad is not None, f"rank {dist.get_rank()} - Invalid to reduce Param {param_id} with None gradient"
-
-        self.grads_in_ipg_bucket.append(param.grad)
-        self.params_in_ipg_bucket.append((i, param, param_id))
-
-        # make sure the average tensor function knows how to average the gradients
-        if is_moe_param(param):
-            self.ipg_bucket_has_moe_params = True
-
-        self.report_ipg_memory_usage("End ipg_remove_grads", 0)
-
-    def print_rank_0(self, message):
-        if dist.get_rank() == 0 and self.verbose:
-            print(message)
-
-    def gradient_reduction_w_predivide(self, tensor):
-
-        dp_world_size = dist.get_world_size(group=self.dp_process_group)
-
-        tensor_to_allreduce = tensor
-
-        if self.allreduce_always_fp32:
-            tensor_to_allreduce = tensor.float()
-
-        if self.postscale_gradients:
-            if self.gradient_predivide_factor != 1.0:
-                tensor_to_allreduce.mul_(1. / self.gradient_predivide_factor)
-
-            dist.all_reduce(tensor_to_allreduce, group=self.dp_process_group)
-
-            if self.gradient_predivide_factor != dp_world_size:
-                tensor_to_allreduce.mul_(
-                    self.gradient_predivide_factor / dp_world_size)
-        else:
-            tensor_to_allreduce.div_(dp_world_size)
-            dist.all_reduce(tensor_to_allreduce, group=self.dp_process_group)
-
-        if self.allreduce_always_fp32 and tensor is not tensor_to_allreduce:
-            tensor.copy_(tensor_to_allreduce)
-
-        return tensor
-
-    def average_tensor(self, tensor):
-        if self.overlap_comm:
-            torch.cuda.synchronize()
-            stream = self.reduction_stream
-        else:
-            stream = torch.cuda.current_stream()
-
-        with torch.cuda.stream(stream):
-            if not self.reduce_scatter:
-                self.gradient_reduction_w_predivide(tensor)
-                return
-
-            # Accumulate destination ranks and bucket offsets for each gradient slice.
-            # Note: potential future optimization, record access pattern of parameters
-            # in backward pass and partition gradients w.r.t. access pattern so that our
-            # bucket is guaranteed to be contiguous w.r.t. ranks
-            rank_and_offsets = []
-            real_dp_process_group = []
-            curr_size = 0
-            prev_id = -1
-
-            process_group = self.dp_process_group
-            # count = 0
-            for i, param, param_id in self.params_in_ipg_bucket:
-
-                process_group = self.dp_process_group
-                # Averages gradients at parameter level if ipg has a moe param
-                # Otherwise averaging is done at the entire buffer level at the end of the loop
-                if self.ipg_bucket_has_moe_params:
-                    process_group = self.expert_dp_process_group if is_moe_param(
-                        param) else self.dp_process_group
-                    param.grad.data.div_(
-                        dist.get_world_size(group=process_group))
-
-                partition_ids = self.param_to_partition_ids[i][param_id]
-                partition_size = self.partition_size[i]
-                # Get all partition ids + their offsets
-                partition_ids_w_offsets = []
-                for partition_id in partition_ids:
-                    offset = self.grad_start_offset[i][partition_id][param_id]
-                    partition_ids_w_offsets.append((partition_id, offset))
-                partition_ids_w_offsets.sort(key=lambda t: t[1])
-
-                # Calculate rank and offsets for grad slices
-                for idx in range(len(partition_ids_w_offsets)):
-                    partition_id, offset = partition_ids_w_offsets[idx]
-
-                    # if dist.get_rank() == 0 and count < 100:
-                    #     print(f"Rank {dist.get_rank()} rank offet id {idx} calculated dp size {dist.get_world_size(group=process_group)} real dp size {dist.get_world_size(self.real_dp_process_group[i])} and dst: {partition_id}")
-                    # count += 1
-
-                    # Calculate numel for grad slice depending on partition location
-                    if idx == len(partition_ids_w_offsets) - 1:
-                        # Last partition_id uses its own offset
-                        numel = param.numel() - offset
-                    else:
-                        # Set numel to next partition's offset
-                        numel = partition_ids_w_offsets[idx + 1][1] - offset
-
-                    # Merge bucket ranges if they belong to the same rank
-                    if partition_id == prev_id:
-                        prev_pid, prev_size, prev_numel = rank_and_offsets[-1]
-                        rank_and_offsets[-1] = (prev_pid,
-                                                prev_size, prev_numel + numel)
-                    else:
-                        rank_and_offsets.append(
-                            (partition_id, curr_size, numel))
-                        real_dp_process_group.append(process_group)
-                    curr_size += numel
-                    prev_id = partition_id
-
-            if not self.ipg_bucket_has_moe_params:
-                tensor.div_(dist.get_world_size(group=self.dp_process_group))
-
-            async_handles = []
-            for i, (dst, bucket_offset, numel) in enumerate(rank_and_offsets):
-                grad_slice = tensor.narrow(0, int(bucket_offset), int(numel))
-                # if dist.get_rank() == 0:
-                #     print(f"Rank {dist.get_rank()} rank offet id {i} real dp size {dist.get_world_size(group=real_dp_process_group[i])} and dst: {dst}")
-                # dist.barrier()
-                # dist.barrier()
-                dst_rank = _get_global_rank(real_dp_process_group[i], dst)
-                async_handle = dist.reduce(grad_slice,
-                                           dst=dst_rank,
-                                           group=real_dp_process_group[i],
-                                           async_op=True)
-                async_handles.append(async_handle)
-
-            for handle in async_handles:
-                handle.wait()
-
-    ##############################################################################
-    ############################# CPU Offload Methods#############################
-    ##############################################################################
-    def get_grad_position(self, group_id, tensor_list, first_offset, partition_size):
-        current_offset = 0
-
-        for i, tensor in enumerate(tensor_list):
-            param_id = self.get_param_id(tensor)
-            param_start_offset = 0
-
-            num_elements = tensor.numel()
-            tensor_offset = 0
-
-            # we need to offset to get to the right element
-            if i == 0 and first_offset > 0:
-                tensor_offset = first_offset
-                num_elements = num_elements - tensor_offset
-                param_start_offset = first_offset
-
-            # we dont need all elements of the tensor
-            if num_elements > (partition_size - current_offset):
-                num_elements = partition_size - current_offset
-
-            self.grad_position[param_id] = [
-                int(group_id),
-                int(param_start_offset),
-                int(current_offset),
-                int(num_elements)
-            ]
-            current_offset += num_elements
-
-    def update_overflow_tracker_for_param_grad(self, param):
-        if param.grad is not None and self._has_inf_or_nan(param.grad.data):
-            self.local_overflow = True
-
-    def async_accumulate_grad_in_cpu_via_gpu(self, param):
-        param_id = self.get_param_id(param)
-
-        [i, source_offset, dest_offset, num_elements] = self.grad_position[param_id]
-
-        # copy to a preexisiting buffer to avoid memory allocation penalty
-        dest_buffer = self.temp_grad_buffer_for_gpu_offload.view(-1).narrow(
-            0,
-            0,
-            param.numel())
-
-        # buffer for storing gradients for this parameter in CPU
-        def buffer_to_accumulate_to_in_cpu():
-            if not self.fp16_master_weights_and_gradients:
-                return torch.zeros(param.numel(),
-                                   dtype=param.dtype,
-                                   device=self.device).pin_memory()
-            else:
-                return self.single_partition_of_fp32_groups[i].grad.view(-1).narrow(
-                    0,
-                    dest_offset,
-                    num_elements)
-
-        # accumulate gradients into param.grad or parts of it that belongs to this parittion
-        def accumulate_gradients():
-            if not self.fp16_master_weights_and_gradients:
-                dest_buffer.copy_(self.accumulated_grads_in_cpu[param_id].view(-1),
-                                  non_blocking=True)
-                param.grad.data.view(-1).add_(dest_buffer)
-            else:
-                dest_buffer.narrow(0,
-                                   source_offset,
-                                   num_elements).copy_(
-                    self.accumulated_grads_in_cpu[param_id].view(-1),
-                    non_blocking=True)
-                param.grad.data.view(-1).narrow(
-                    0,
-                    source_offset,
-                    num_elements).add_(dest_buffer.narrow(0,
-                                                          source_offset,
-                                                          num_elements))
-
-        # move accumulated gradients back to CPU
-        def copy_gradients_to_cpu():
-            if not self.fp16_master_weights_and_gradients:
-                self.accumulated_grads_in_cpu[param_id].data.copy_(
-                    param.grad.data.view(-1),
-                    non_blocking=True)
-            else:
-                self.accumulated_grads_in_cpu[param_id].data.copy_(
-                    param.grad.data.view(-1).narrow(0,
-                                                    source_offset,
-                                                    num_elements),
-                    non_blocking=True)
-
-        if param_id not in self.accumulated_grads_in_cpu:
-            self.accumulated_grads_in_cpu[param_id] = buffer_to_accumulate_to_in_cpu(
-            )
-
-        if self.micro_step_id > 0:
-            accumulate_gradients()
-
-        # at the boundary we will send 32bit directly
-        if not self.is_gradient_accumulation_boundary:
-            copy_gradients_to_cpu()
-
-    def set_norm_for_param_grad(self, param):
-        param_id = self.get_param_id(param)
-        accumulated_grad = self.accumulated_grads_in_cpu[
-            param_id] if self.gradient_accumulation_steps > 1 else param.grad
-
-        [i, source_offset, dest_offset, num_elements] = self.grad_position[param_id]
-
-        start = source_offset
-        accumulated_grad = accumulated_grad.view(
-            -1).narrow(0, start, num_elements)
-
-        self.norm_for_param_grads[param_id] = accumulated_grad.data.double().norm(
-            2)
-
-    def set_norm_for_param_grad_in_gpu(self, param):
-        param_id = self.get_param_id(param)
-        accumulated_grad = param.grad
-
-        [i, source_offset, dest_offset, num_elements] = self.grad_position[param_id]
-
-        start = source_offset
-        accumulated_grad = accumulated_grad.view(
-            -1).narrow(0, start, num_elements)
-
-        self.norm_for_param_grads[param_id] = accumulated_grad.data.double().norm(
-            2)
-
-    def async_inplace_copy_grad_to_fp32_buffer_from_gpu(self, param):
-        param_id = self.get_param_id(param)
-
-        [i, source_offset, dest_offset, num_elements] = self.grad_position[param_id]
-
-        dest_tensor = self.single_partition_of_fp32_groups[i].grad.view(-1).narrow(
-            0,
-            dest_offset,
-            num_elements)
-
-        src_tensor = param.grad.view(-1).narrow(0, source_offset, num_elements)
-        if not self.fp16_master_weights_and_gradients:
-            src_tensor = src_tensor.float()
-
-        dest_tensor.copy_(src_tensor, non_blocking=True)
-        param.grad = None  # offload only
-
-    def complete_grad_norm_calculation_for_cpu_offload(self, params):
-        total_norm = 0.0
-        norm_type = 2.0
-        for p in params:
-            if is_model_parallel_parameter(p) or (self.model_parallel_rank == 0):
-                param_id = self.get_param_id(p)
-                # as some model have trainable parameters but skipped in training,
-                # their backward hooks in self.create_reduce_and_remove_grad_hooks() will not run,
-                # so they have no norm_for_param_grads
-                if param_id in self.norm_for_param_grads:
-                    param_norm = self.norm_for_param_grads[param_id]
-                    total_norm += param_norm.item() ** 2
-                else:
-                    # As unused parameters in modules may not be expected sometimes,
-                    # add an explicit error msg when it occurred and an option to
-                    # avoid the error
-                    assert self.ignore_unused_parameters, """
-                        This assert indicates that your module has parameters that
-                        were not used in producing loss.
-                        You can avoid this assert by
-                        (1) enable ignore_unused_parameters option in zero_optimization config;
-                        (2) making sure all trainable parameters and `forward` function
-                            outputs participate in calculating loss.
-                    """
-
-        # Sum across all model parallel GPUs.
-        total_norm_cuda = torch.cuda.FloatTensor([float(total_norm)])
-
-        torch.distributed.all_reduce(total_norm_cuda,
-                                     op=torch.distributed.ReduceOp.SUM,
-                                     group=self.dp_process_group)
-
-        self._model_parallel_all_reduce(tensor=total_norm_cuda,
-                                        op=torch.distributed.ReduceOp.SUM)
-
-        total_norm = total_norm_cuda[0].item() ** (1. / norm_type)
-
-        if total_norm == float(
-                'inf') or total_norm == -float('inf') or total_norm != total_norm:
-            total_norm = -1
-
-        return total_norm
-
-    ############################################################################################
-
-    def copy_grads_in_partition(self, param):
-        if self.cpu_offload:
-
-            if self.gradient_accumulation_steps > 1:
-                self.async_accumulate_grad_in_cpu_via_gpu(param)
-
-            if self.is_gradient_accumulation_boundary:
-                self.set_norm_for_param_grad_in_gpu(param)
-
-                self.update_overflow_tracker_for_param_grad(param)
-
-                self.async_inplace_copy_grad_to_fp32_buffer_from_gpu(param)
-
-            return
-        # print(f"ID {self.get_param_id(param)} grad norm {param.grad.norm()}")
-        if self.grads_in_partition is None:
-            self.grads_in_partition_offset = 0
-            total_size = 0
-            for group in self.params_in_partition:
-                for param_in_partition in group:
-                    total_size += param_in_partition.numel()
-
-            if self.verbose:
-                report_memory_usage(
-                    f"before copying {total_size} gradients into partition")
-            self.grads_in_partition = torch.empty(int(total_size),
-                                                  dtype=self.dtype,
-                                                  device=torch.cuda.current_device())
-
-            if self.verbose:
-                report_memory_usage(
-                    f"after copying {total_size} gradients into partition")
-
-        # The allreduce buffer will be rewritted. Copy the gradients in partition to a new buffer
-        new_grad_tensor = self.grads_in_partition.view(-1).narrow(
-            0,
-            self.grads_in_partition_offset,
-            param.numel())
-        new_grad_tensor.copy_(param.grad.view(-1))
-        param.grad.data = new_grad_tensor.data.view_as(param.grad)
-        # print(f"Grad norm after copy to contiguous_buffer {param.grad.data.norm()}")
-        self.grads_in_partition_offset += param.numel()
-
-    def reduce_ipg_grads(self):
-        if self.contiguous_gradients:
-            if self.extra_large_param_to_reduce is not None:
-                assert len(
-                    self.params_in_ipg_bucket) == 1, "more than 1 param in ipg bucket, this shouldn't happen"
-                _, _, param_id = self.params_in_ipg_bucket[0]
-                assert self.get_param_id(
-                    self.extra_large_param_to_reduce) == param_id, "param in ipg bucket does not match extra-large param"
-                self.average_tensor(
-                    self.extra_large_param_to_reduce.grad.view(-1))
-                self.extra_large_param_to_reduce = None
-            else:
-                self.average_tensor(self.ipg_buffer[self.ipg_index])
-        else:
-            self.buffered_reduce_fallback(
-                None,
-                self.grads_in_ipg_bucket,
-                elements_per_buffer=self.elements_in_ipg_bucket)
-
-        if self.overlap_comm:
-            stream = self.reduction_stream
-        elif self.cpu_offload:
-            # TODO: copy_grad_stream is disabled because of race with reduce. This hurts perf and should be fixed.
-            #            torch.cuda.synchronize()
-            #            stream = self.copy_grad_stream
-            stream = torch.cuda.current_stream()
-        else:
-            stream = torch.cuda.current_stream()
-
-        with torch.cuda.stream(stream):
-            for _, param, param_id in self.params_in_ipg_bucket:
-
-                assert self.params_already_reduced[param_id] == False, \
-                    f"The parameter {param_id} has already been reduced. \
-                    Gradient computed twice for this partition. \
-                    Multiple gradient reduction is currently not supported"
-
-                self.params_already_reduced[param_id] = True
-
-                if self.partition_gradients:
-                    if not self.is_param_in_current_partition[param_id]:
-                        if self.overlap_comm and self.contiguous_gradients is False:
-                            # Clear grads of other partitions during the next reduction
-                            # to avoid clearing them before the reduction is complete.
-                            if self.previous_reduced_grads is None:
-                                self.previous_reduced_grads = []
-                            self.previous_reduced_grads.append(param)
-                        else:
-                            param.grad = None  # only if self.partition_gradients
-                    elif self.contiguous_gradients:
-                        self.copy_grads_in_partition(param)
-
-        self.grads_in_ipg_bucket = []
-        self.params_in_ipg_bucket = []
-        self.ipg_bucket_has_moe_params = False
-        self.elements_in_ipg_bucket = 0
-        #####################################################################
-
-    def reduce_ready_partitions_and_remove_grads(self, param, i):
-        if self.partition_gradients or self.is_gradient_accumulation_boundary:
-            self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
-
-    def zero_reduced_gradients(self, partition_id, i):
-        def are_all_related_partitions_reduced(params_id):
-            for partition_id in self.param_to_partition_ids[i][params_id]:
-                if not self.is_partition_reduced[i][partition_id]:
-                    return False
-            return True
-
-        for params_id in self.is_grad_computed[i][partition_id]:
-            if are_all_related_partitions_reduced(params_id):
-                self.param_dict[params_id].grad = None  # dead code
-
-    def flatten_and_print(self, message, tensors, start=0, n=5):
-        flatten_tensor = self.flatten(tensors)
-
-        def print_func():
-            print(flatten_tensor.contiguous().view(-1).narrow(0, start, n))
-
-        self.sequential_execution(print_func, message)
-
-    def get_grads_to_reduce(self, i, partition_id):
-        def get_reducable_portion(key):
-            grad = self.param_dict[key].grad
-            total_elements = grad.numel()
-            start = self.grad_start_offset[i][partition_id][key]
-            num_elements = min(
-                total_elements - start,
-                self.partition_size[i] -
-                self.grad_partition_insertion_offset[i][partition_id][key])
-            if not pg_correctness_test:
-                if num_elements == total_elements:
-                    return grad
-                else:
-                    return grad.contiguous().view(-1).narrow(0,
-                                                             int(start),
-                                                             int(num_elements))
-            else:
-                if num_elements == total_elements:
-                    return grad.clone()
-                else:
-                    return grad.clone().contiguous().view(-1).narrow(
-                        0,
-                        int(start),
-                        int(num_elements))
-
-        grads_to_reduce = []
-        for key in self.is_grad_computed[i][partition_id]:
-            grad = get_reducable_portion(key)
-            grads_to_reduce.append(grad)
-        return grads_to_reduce
-
-    def sequential_execution(self, function, message, group=None):
-        if group is None:
-            group = self.dp_process_group
-        if dist.get_rank(group=group) == 0:
-            print(message)
-        for id in range(dist.get_world_size(group=group)):
-            if id == dist.get_rank(group=group):
-                function()
-            dist.barrier(group=group)
-
-    def set_none_gradients_to_zero(self, i, partition_id):
-        for param_id in self.is_grad_computed[i][partition_id]:
-            param = self.param_dict[param_id]
-            if param.grad is None:
-                param.grad = torch.zero_like(param)
-
-    ######################Reduction Related Methods##############################
-
-    def allreduce_bucket(self, bucket, allreduce_always_fp32=False, rank=None, log=None):
-        rank = None
-        tensor = self.flatten(bucket)
-
-        tensor_to_allreduce = tensor
-
-        if pg_correctness_test:
-            allreduce_always_fp32 = True
-
-        if allreduce_always_fp32:
-            tensor_to_allreduce = tensor.float()
-
-        tensor_to_allreduce.div_(
-            dist.get_world_size(group=self.dp_process_group))
-
-        if rank is None:
-            #    "All Reducing"
-            dist.all_reduce(tensor_to_allreduce, group=self.dp_process_group)
-        else:
-            global_rank = _get_global_rank(self.dp_process_group, rank)
-            dist.reduce(tensor_to_allreduce, global_rank,
-                        group=self.dp_process_group)
-
-        if allreduce_always_fp32 and tensor is not tensor_to_allreduce:
-            if rank is None or rank == dist.get_rank(group=self.dp_process_group):
-                tensor.copy_(tensor_to_allreduce)
-
-        return tensor
-
-    def _clear_previous_reduced_grads(self):
-        if self.previous_reduced_grads is not None:
-            for param in self.previous_reduced_grads:
-                param.grad = None  # overlap enabled
-            self.previous_reduced_grads = None
-
-    # if rank is specified do a reduction instead of an allreduce
-    def allreduce_and_copy(self, small_bucket, rank=None, log=None):
-        if self.overlap_comm:
-            torch.cuda.synchronize()
-            # It is safe to clear the previously reduced grads of other partitions
-            self._clear_previous_reduced_grads()
-            stream = self.reduction_stream
-        else:
-            stream = torch.cuda.current_stream()
-
-        with torch.cuda.stream(stream):
-            allreduced = self.allreduce_bucket(
-                small_bucket, rank=rank, log=log)
-            if rank is None or rank == dist.get_rank(group=self.dp_process_group):
-                for buf, synced in zip(small_bucket, self.unflatten(allreduced, small_bucket)):
-                    buf.copy_(synced)
-
-    def allreduce_no_retain(self,
-                            bucket,
-                            numel_per_bucket=500000000,
-                            rank=None,
-                            log=None):
-        small_bucket = []
-        numel = 0
-        for tensor in bucket:
-            small_bucket.append(tensor)
-            numel = numel + tensor.numel()
-            if numel > numel_per_bucket:
-                self.allreduce_and_copy(small_bucket, rank=rank, log=None)
-                small_bucket = []
-
-        if len(small_bucket) > 0:
-            self.allreduce_and_copy(small_bucket, rank=rank, log=log)
-
-    # allows using reduction of gradients instead of using all_reduce
-
-    def buffered_reduce_fallback(self,
-                                 rank,
-                                 grads,
-                                 elements_per_buffer=500000000,
-                                 log=None):
-        split_buckets = split_half_float_double(grads)
-
-        for i, bucket in enumerate(split_buckets):
-            self.allreduce_no_retain(bucket,
-                                     numel_per_bucket=elements_per_buffer,
-                                     rank=rank,
-                                     log=log)
-
-    #############################################################################
-    #############################################################################
-    #############################################################################
-
-    # views the tensor as multiple partitions and returns
-    # those partitions
-    def get_data_parallel_partitions(self, tensor, group_id):
-        partitions = []
-
-        dp = dist.get_world_size(group=self.real_dp_process_group[group_id])
-        dp_id = dist.get_rank(group=self.real_dp_process_group[group_id])
-
-        total_num_elements = tensor.numel()
-
-        base_size = total_num_elements // dp
-        remaining = total_num_elements % dp
-
-        start = 0
-        for id in range(dp):
-            partition_size = base_size
-            if id < remaining:
-                partition_size = partition_size + 1
-            partitions.append(tensor.narrow(0, start, partition_size))
-            start = start + partition_size
-        return partitions
-
-    def get_partition_info(self, tensor_list, partition_size, partition_id):
-        params_in_partition = []
-        params_not_in_partition = []
-
-        start_index = partition_size * partition_id
-        end_index = partition_size * (partition_id + 1)
-
-        current_index = 0
-        first_offset = 0
-
-        for tensor in tensor_list:
-
-            tensor_size = tensor.numel()
-
-            if (current_index >= start_index and current_index < end_index):
-                params_in_partition.append(tensor)
-
-            elif start_index > current_index and start_index < (current_index +
-                                                                tensor_size):
-                params_in_partition.append(tensor)
-
-                assert (
-                    first_offset == 0), "This can happen either zero or only once as this must be the first tensor in the partition"
-                first_offset = start_index - current_index
-
-            else:
-                params_not_in_partition.append(tensor)
-
-            current_index = current_index + tensor_size
-
-        return params_in_partition, params_not_in_partition, first_offset
-
-    def zero_grad(self, set_grads_to_None=True):
-        """
-        Zero FP16 parameter grads.
-        """
-        # FP32 grad should never exist.
-        # For speed, set model fp16 grad to None by default
-        for group in self.fp16_groups:
-            for p in group:
-                if set_grads_to_None:
-                    p.grad = None  # epilogue and in step
-                else:
-                    if p.grad is not None:
-                        p.grad.detach_()
-                        p.grad.zero_()
-
-    def _model_parallel_all_reduce(self, tensor, op):
-        """ Perform all reduce within model parallel group, if any.
-        """
-        if self.model_parallel_group is None:
-            pass
-        else:
-            torch.distributed.all_reduce(tensor=tensor,
-                                         op=op,
-                                         group=self.model_parallel_group)
-
-    def clip_grad_norm(self, *args, **kwargs):
-        # dummy function to retain the same function interface
-        # as ColossalaiOptimizer for compatibility
-        pass
-
-    def get_grad_norm_direct(self, gradients, params, norm_type=2):
-        """Clips gradient norm of an iterable of parameters.
-
-        This is adapted from ``torch.nn.utils.clip_grad.clip_grad_norm_`` and
-        added functionality to handle model parallel parameters. Note that
-        the gradients are modified in place.
-
-        Arguments:
-            parameters (Iterable[Tensor] or Tensor): an iterable of Tensors or a
-                single Tensor that will have gradients normalized
-            max_norm (float or int): max norm of the gradients
-            norm_type (float or int): type of the used p-norm. Can be ``'inf'`` for
-                infinity norm.
-
-        Returns:
-            Total norm of the parameters (viewed as a single vector).
-        """
-        norm_type = float(norm_type)
-        if norm_type == inf:
-            total_norm = max(g.data.abs().max() for g in gradients)
-            total_norm_cuda = torch.cuda.FloatTensor([float(total_norm)])
-            torch.distributed.all_reduce(total_norm_cuda,
-                                         op=torch.distributed.ReduceOp.MAX,
-                                         group=self.dp_process_group)
-
-            # Take max across all GPUs.
-            self._model_parallel_all_reduce(tensor=total_norm_cuda,
-                                            op=torch.distributed.ReduceOp.MAX)
-            total_norm = total_norm_cuda[0].item()
-        else:
-            total_norm = 0.0
-            # if dist.get_rank() == 0:
-            #    print()(f"Total Norm begining {total_norm}")
-            for g, p in zip(gradients, params):
-                if is_model_parallel_parameter(p) or (self.model_parallel_rank == 0):
-                    param_norm = g.data.double().norm(2)
-                    total_norm += param_norm.item() ** 2
-            # Sum across all model parallel GPUs.
-            total_norm_cuda = torch.cuda.FloatTensor([float(total_norm)])
-
-            torch.distributed.all_reduce(total_norm_cuda,
-                                         op=torch.distributed.ReduceOp.SUM,
-                                         group=self.dp_process_group)
-
-            self._model_parallel_all_reduce(tensor=total_norm_cuda,
-                                            op=torch.distributed.ReduceOp.SUM)
-
-            total_norm = total_norm_cuda[0].item() ** (1. / norm_type)
-
-        if total_norm == float(
-                'inf') or total_norm == -float('inf') or total_norm != total_norm:
-            total_norm = -1
-
-        return total_norm
-
-    # creates a flat fused tensor from the tensor list starting at the first_offset
-    # in the first tensor of the list. If there are not enough elements in the tensor
-    # list then the flat tensor will be padded with zeros
-    def get_flat_partition(self,
-                           tensor_list,
-                           first_offset,
-                           partition_size,
-                           dtype,
-                           device,
-                           return_tensor_list=False):
-        flat_tensor_list = []
-        current_size = 0
-        for i, tensor in enumerate(tensor_list):
-            if tensor.grad is None:
-                tensor.grad = torch.zeros_like(tensor)
-
-            tensor = tensor.grad
-            num_elements = tensor.numel()
-            tensor_offset = 0
-
-            # we need to offset to get to the right element
-            if i == 0 and first_offset > 0:
-                tensor_offset = first_offset
-                num_elements = num_elements - tensor_offset
-
-            # we dont need all elements of the tensor
-            if num_elements > (partition_size - current_size):
-                num_elements = partition_size - current_size
-
-            # we need a narrow view of the tensor based on the tensor offset and number of elements that
-            # we need from this tensor
-            if tensor_offset > 0 or num_elements < tensor.numel():
-                flat_tensor_list.append(tensor.contiguous().view(-1).narrow(
-                    0,
-                    int(tensor_offset),
-                    int(num_elements)))
-            else:
-                flat_tensor_list.append(tensor)
-
-            current_size = current_size + num_elements
-
-        # this means its the last partition and does not align with the dp boundary. We need to pad before flattening
-        if current_size < partition_size:
-            flat_tensor_list.append(
-                torch.zeros(int(partition_size - current_size),
-                            dtype=dtype,
-                            device=device))
-
-        if return_tensor_list:
-            return flat_tensor_list
-
-        return self.flatten(flat_tensor_list)
-
-    def free_grad_in_param_list(self, param_list):
-        for p in param_list:
-            p.grad = None  # in step
-
-    def reset_cpu_buffers(self):
-        self.norm_for_param_grads = {}
-        self.local_overflow = False
-
-    def log_timers(self, timer_names):
-        if self.timers is None:
-            return
-
-        self.timers.log(names=list(timer_names))
-
-    def start_timers(self, timer_names):
-        if self.timers is None:
-            return
-
-        for name in timer_names:
-            self.timers(name).start()
-
-    def stop_timers(self, timer_names):
-        if self.timers is None:
-            return
-
-        for name in timer_names:
-            self.timers(name).stop()
-
-    def step(self, closure=None):
-        """
-        Not supporting closure.
-        """
-        self.micro_step_id = -1
-
-        if self.verbose:
-            report_memory_usage(f"In step before checking overflow")
-
-        # First compute norm for all group so we know if there is overflow
-        self.check_overflow(self.partition_gradients)
-
-        OPTIMIZER_ALLGATHER = 'optimizer_allgather'
-        OPTIMIZER_GRADIENTS = 'optimizer_gradients'
-        OPTIMIZER_STEP = 'optimizer_step'
-        timer_names = [OPTIMIZER_ALLGATHER,
-                       OPTIMIZER_GRADIENTS, OPTIMIZER_STEP]
-
-        prev_scale = self.loss_scale
-        self._update_scale(self.overflow)
-        if self.overflow:
-            if self.verbose:
-                report_memory_usage('After overflow before clearing gradients')
-            self.zero_grad()
-            if self.cpu_offload:
-                self.reset_cpu_buffers()
-            else:
-                self.averaged_gradients = {}
-
-            if self.verbose:
-                report_memory_usage('After overflow after clearing gradients')
-
-            print(
-                "[deepspeed] fp16 dynamic loss scale overflow! Rank {} Skipping step. Attempted loss scale: {}, "
-                "reducing to {}".format(dist.get_rank(),
-                                        prev_scale,
-                                        self.loss_scale))
-            self.start_timers(timer_names)
-            self.stop_timers(timer_names)
-            return
-
-        self.start_timers([OPTIMIZER_GRADIENTS])
-        norm_groups = []
-        single_partition_grad_groups = []
-        skip = False
-        for i, group in enumerate(self.fp16_groups):
-            partition_id = dist.get_rank(group=self.real_dp_process_group[i])
-            if self.cpu_offload:
-                norm_groups.append(
-                    self.complete_grad_norm_calculation_for_cpu_offload(
-                        self.params_in_partition[i]))
-                single_grad_partition = self.single_partition_of_fp32_groups[i].grad
-            else:
-                norm_groups.append(
-                    self.get_grad_norm_direct(self.averaged_gradients[i],
-                                              self.params_in_partition[i]))
-
-                # free gradients for all the prameters that are not updated by this process
-                self.free_grad_in_param_list(self.params_not_in_partition[i])
-
-                # create a flat gradients for parameters updated by this process
-                # If we are last partition, ensure we have same size grads and partition size, if not pad with zero tensors
-                if partition_id == dist.get_world_size(
-                        group=self.real_dp_process_group[i]) - 1:
-                    single_grad_partition = self.flatten_dense_tensors_aligned(
-                        self.averaged_gradients[i],
-                        int(self.partition_size[i])).to(
-                        self.single_partition_of_fp32_groups[i].dtype)
-                else:
-                    single_grad_partition = self.flatten(self.averaged_gradients[i]).to(
-                        self.single_partition_of_fp32_groups[i].dtype)
-                assert single_grad_partition.numel() == self.partition_size[i], \
-                    "averaged gradients have different number of elements that partition size {} {} {} {}".format(
-                        single_grad_partition.numel(), self.partition_size[i], i, partition_id)
-
-                self.single_partition_of_fp32_groups[i].grad = single_grad_partition
-                # release all the gradient since we have already created a necessary copy in dp_grad_partition
-                self.free_grad_in_param_list(self.params_in_partition[i])
-
-                self.averaged_gradients[i] = None
-
-            single_partition_grad_groups.append(single_grad_partition)
-
-        if self.has_moe_layers:
-            self._average_expert_grad_norms(norm_groups)
-
-        self.unscale_and_clip_grads(single_partition_grad_groups, norm_groups)
-        self.stop_timers([OPTIMIZER_GRADIENTS])
-
-        self.start_timers([OPTIMIZER_STEP])
-        if self.deepspeed_adam_offload:
-            from deepspeed.ops.adam import DeepSpeedCPUAdam
-            if type(self.optimizer) == DeepSpeedCPUAdam and self.dtype == torch.half:
-                fp16_param_groups = [
-                    fp16_partitions[partition_id]
-                    for fp16_partitions in self.parallel_partitioned_fp16_groups
-                ]
-                self.optimizer.step(fp16_param_groups=fp16_param_groups)
-            else:
-                self.optimizer.step()
-                for fp16_partitions, fp32_partition in zip(self.parallel_partitioned_fp16_groups,
-                                                           self.single_partition_of_fp32_groups):
-                    fp16_partitions[partition_id].data.copy_(
-                        fp32_partition.data)
-        else:
-            self.optimizer.step()
-
-            # get rid of the fp32 gradients. Not needed anymore
-            if not self.cpu_offload:
-                for group in self.single_partition_of_fp32_groups:
-                    group.grad = None  # in step
-
-            for fp16_partitions, fp32_partition in zip(self.parallel_partitioned_fp16_groups,
-                                                       self.single_partition_of_fp32_groups):
-                fp16_partitions[partition_id].data.copy_(fp32_partition.data)
-
-        self.stop_timers([OPTIMIZER_STEP])
-
-        if self.cpu_offload:
-            self.reset_cpu_buffers()
-
-        self.start_timers([OPTIMIZER_ALLGATHER])
-        # gather the updated weights from everyone
-        for group_id, partitioned_params in enumerate(self.parallel_partitioned_fp16_groups):
-
-            # Sequential AllGather Best of both worlds
-            dp_world_size = dist.get_world_size(
-                group=self.real_dp_process_group[group_id])
-            num_shards = max(
-                1,
-                partitioned_params[partition_id].numel() * dp_world_size //
-                self.allgather_bucket_size)
-
-            shard_size = partitioned_params[partition_id].numel() // num_shards
-            num_elements = shard_size
-
-            assert shard_size * \
-                num_shards <= partitioned_params[partition_id].numel()
-
-            for shard_id in range(num_shards):
-
-                if shard_id == (num_shards - 1):
-                    num_elements = partitioned_params[partition_id].numel(
-                    ) - shard_id * shard_size
-
-                shard_list = []
-                for dp_id in range(dp_world_size):
-                    curr_shard = partitioned_params[dp_id].narrow(
-                        0,
-                        shard_id * shard_size,
-                        num_elements).detach()
-                    shard_list.append(curr_shard)
-
-                dist.all_gather(shard_list,
-                                shard_list[partition_id],
-                                group=self.real_dp_process_group[group_id])
-        self.stop_timers([OPTIMIZER_ALLGATHER])
-
-        # TODO: we probably don't need this? just to be safe
-        for i in range(len(norm_groups)):
-            self._update_model_fp16_weights(i)
-
-        self.log_timers(timer_names)
-        if self.verbose:
-            report_memory_usage('After zero_optimizer step')
-
-        return
-
-    def _average_expert_grad_norms(self, norm_groups):
-        for i, norm in enumerate(norm_groups):
-            if self.is_moe_param_group[i]:
-                scaled_norm = norm * 1.0 / float(
-                    dist.get_world_size(group=self.ep_process_group))
-                scaled_norm_tensor = torch.tensor(scaled_norm,
-                                                  device='cuda',
-                                                  dtype=torch.float)
-                dist.all_reduce(scaled_norm_tensor,
-                                group=self.ep_process_group)
-                norm_groups[i] = scaled_norm_tensor.item()
-
-    def unscale_and_clip_grads(self, grad_groups_flat, norm_groups):
-        total_norm = 0.0
-        for norm in norm_groups:
-            total_norm += norm ** 2.0
-        total_norm = math.sqrt(total_norm)
-
-        # compute combined scale factor for this group
-        combined_scale = self.loss_scale
-        if self.clip_grad > 0.:
-            # norm is in fact norm*scale
-            clip = ((total_norm / self.loss_scale) + 1e-6) / self.clip_grad
-            if clip > 1:
-                combined_scale = clip * self.loss_scale
-
-        for grad in grad_groups_flat:
-            if isinstance(grad, list):
-                sub_partitions = grad
-                for g in sub_partitions:
-                    g.data.mul_(1. / combined_scale)
-            else:
-                grad.data.mul_(1. / combined_scale)
-
-    def _check_overflow(self, partition_gradients=True):
-        self.overflow = self.has_overflow(partition_gradients)
-
-    # `params` is a list / generator of torch.Variable
-    def has_overflow_serial(self, params, is_grad_list=False):
-        for p in params:
-            if p.grad is not None and self._has_inf_or_nan(p.grad.data):
-                return True
-
-        return False
-
-    def has_overflow_partitioned_grads_serial(self):
-        for i in range(len(self.fp16_groups)):
-            for j, grad in enumerate(self.averaged_gradients[i]):
-                if grad is not None and self._has_inf_or_nan(grad.data, j):
-                    return True
-        return False
-
-    def has_overflow(self, partition_gradients=True):
-        if partition_gradients:
-            overflow = self.local_overflow if self.cpu_offload else self.has_overflow_partitioned_grads_serial(
-            )
-            overflow_gpu = torch.cuda.ByteTensor([overflow])
-            '''This will capture overflow across all data parallel and expert parallel process
-            Since expert parallel process are a subset of data parallel process'''
-            torch.distributed.all_reduce(overflow_gpu,
-                                         op=torch.distributed.ReduceOp.MAX,
-                                         group=self.dp_process_group)
-
-        else:
-            params = []
-            for group in self.fp16_groups:
-                for param in group:
-                    params.append(param)
-
-            overflow = self.has_overflow_serial(
-                params, is_grad_list=partition_gradients)
-            overflow_gpu = torch.cuda.ByteTensor([overflow])
-
-        # Since each model parallel GPU carries only part of the model,
-        # make sure overflow flag is synced across all the model parallel GPUs
-        self._model_parallel_all_reduce(tensor=overflow_gpu,
-                                        op=torch.distributed.ReduceOp.MAX)
-
-        overflow = overflow_gpu[0].item()
-        return bool(overflow)
-
-    # `x` is a torch.Tensor
-    @staticmethod
-    def _has_inf_or_nan(x, j=None):
-        try:
-            # if x is half, the .float() incurs an additional deep copy, but it's necessary if
-            # Pytorch's .sum() creates a one-element tensor of the same type as x
-            # (which is true for some recent version of pytorch).
-            cpu_sum = float(x.float().sum())
-            # More efficient version that can be used if .sum() returns a Python scalar
-            # cpu_sum = float(x.sum())
-        except RuntimeError as instance:
-            # We want to check if inst is actually an overflow exception.
-            # RuntimeError could come from a different error.
-            # If so, we still want the exception to propagate.
-            if "value cannot be converted" not in instance.args[0]:
-                raise
-            return True
-        else:
-            if cpu_sum == float('inf') or cpu_sum == -float('inf') or cpu_sum != cpu_sum:
-                return True
-            return False
-
-    def backward(self, loss, retain_graph=False):
-        """
-        :attr:`backward` performs the following steps:
-
-        1. fp32_loss = loss.float()
-        2. scaled_loss = fp32_loss*loss_scale
-        3. scaled_loss.backward(), which accumulates scaled gradients into the ``.grad`` attributes of the model's fp16 leaves
-        """
-        self.micro_step_id += 1
-
-        if self.contiguous_gradients:
-            self.ipg_buffer = []
-            buf_0 = torch.empty(int(self.reduce_bucket_size),
-                                dtype=self.dtype,
-                                device=torch.cuda.current_device())
-            self.ipg_buffer.append(buf_0)
-
-            # Use double buffers to avoid data access conflict when overlap_comm is enabled.
-            if self.overlap_comm:
-                buf_1 = torch.empty(int(self.reduce_bucket_size),
-                                    dtype=self.dtype,
-                                    device=torch.cuda.current_device())
-                self.ipg_buffer.append(buf_1)
-            self.ipg_index = 0
-
-        self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
-
-    def check_overflow(self, partition_gradients=True):
-        self._check_overflow(partition_gradients)
-
-    def _update_scale(self, has_overflow=False):
-        self.loss_scaler.update_scale(has_overflow)
-
-    # Promote state so it can be retrieved or set via "fp16_optimizer_instance.state"
-    def _get_state(self):
-        return self.optimizer.state
-
-    def _set_state(self, value):
-        self.optimizer.state = value
-
-    state = property(_get_state, _set_state)
-
-    # Promote param_groups so it can be retrieved or set via "fp16_optimizer_instance.param_groups"
-    # (for example, to adjust the learning rate)
-    def _get_param_groups(self):
-        return self.optimizer.param_groups
-
-    def _set_param_groups(self, value):
-        self.optimizer.param_groups = value
-
-    param_groups = property(_get_param_groups, _set_param_groups)
-
-    # Promote loss scale so it can be retrieved or set via "fp16_optimizer_instance.loss_scale"
-    def _get_loss_scale(self):
-        return self.loss_scaler.loss_scale
-
-    def _set_loss_scale(self, value):
-        self.loss_scaler.cur_scale = value
-
-    loss_scale = property(_get_loss_scale, _set_loss_scale)
-    cur_scale = property(_get_loss_scale, _set_loss_scale)
-
-    # Return group tensor after removing paddings that are added for alignment to DP world size.
-    # This method works on the assumption that each group contains a single flattened tensor.
-    def _get_groups_without_padding(self, groups_with_padding):
-        groups_without_padding = []
-        for i, group in enumerate(groups_with_padding):
-            lean_length = group.numel() - self.groups_padding[i]
-            groups_without_padding.append(group[:lean_length])
-
-        return groups_without_padding
-
-    # Return optimizer state after removing paddings that are added for alignment.
-    def _get_state_without_padding(self, state_with_padding, padding):
-        lean_state = {}
-        for key, value in state_with_padding.items():
-            if torch.is_tensor(value):
-                lean_length = value.numel() - padding
-                lean_state[key] = value[:lean_length]
-            else:
-                lean_state[key] = value
-
-        return lean_state
-
-    # Return base optimizer states.
-    # This method assumes that each param group contains a single flattened tensor.
-    def _get_base_optimizer_state(self):
-        optimizer_groups_state = []
-        for i, group in enumerate(self.optimizer.param_groups):
-            p = group['params'][0]
-            lean_optimizer_state = self._get_state_without_padding(
-                self.optimizer.state[p],
-                self.groups_padding[i])
-            optimizer_groups_state.append(lean_optimizer_state)
-
-        return optimizer_groups_state
-
-    def state_dict(self):
-        """
-        Returns a dict containing the current state of this :class:`FP16_Optimizer` instance.
-        This dict contains attributes of :class:`FP16_Optimizer`, as well as the state_dict
-        of the contained Pytorch optimizer.
-
-        Example::
-
-            checkpoint = {}
-            checkpoint['model'] = model.state_dict()
-            checkpoint['optimizer'] = optimizer.state_dict()
-            torch.save(checkpoint, "saved.pth")
-        """
-        state_dict = {}
-        state_dict['loss_scaler'] = self.loss_scaler
-        state_dict['dynamic_loss_scale'] = self.dynamic_loss_scale
-        state_dict['overflow'] = self.overflow
-        state_dict['base_optimizer_state'] = self._get_base_optimizer_state()
-
-        state_dict['zero_stage'] = ZERO_OPTIMIZATION_GRADIENTS
-        state_dict['partition_count'] = self.partition_count
-
-        state_dict['ds_version'] = version
-
-        # Remove paddings for DP alignment to enable loading for other alignment values
-        fp32_groups_without_padding = self._get_groups_without_padding(
-            self.single_partition_of_fp32_groups)
-        state_dict['single_partition_of_fp32_groups'] = fp32_groups_without_padding
-
-        #        if self.cpu_offload:
-        #            state_dict_tmp = async_copy_to(state_dict,
-        #                                           'cpu',
-        #                                           torch.cuda.current_stream())
-        #            state_dict = state_dict_tmp
-
-        return state_dict
-
-    # Restore base optimizer fp32 weights from checkpoint by:
-    # 1) Merging fp32 weights from checkpoints of all partitions
-    # 2) Extracting fp32 weights for current partition from merged weights
-    # 3) Using extracted weights to update base optimizer weights directly.
-    def _restore_from_fp32_weights(self, all_state_dict):
-        merged_single_partition_of_fp32_groups = []
-        for i in range(len(self.single_partition_of_fp32_groups)):
-            partition_id = dist.get_rank(group=self.real_dp_process_group[i])
-            merged_partitions = [
-                sd['single_partition_of_fp32_groups'][i] for sd in all_state_dict
-            ]
-            flat_merged_partitions = self.flatten_dense_tensors_aligned(
-                merged_partitions,
-                self.nccl_start_alignment_factor *
-                dist.get_world_size(group=self.real_dp_process_group[i]))
-            dp_partitions = self.get_data_parallel_partitions(
-                flat_merged_partitions, i)
-            merged_single_partition_of_fp32_groups.append(
-                dp_partitions[partition_id])
-
-        for current, saved in zip(self.single_partition_of_fp32_groups, merged_single_partition_of_fp32_groups):
-            current.data.copy_(saved.data)
-
-    # Restore base optimizer fp32 weights from ZeRO fp16 weights
-    def _restore_from_fp16_weights(self):
-        for group_id, fp16_partitions, fp32_partition in enumerate(
-                zip(self.parallel_partitioned_fp16_groups, self.single_partition_of_fp32_groups)):
-            partition_id = dist.get_rank(
-                group=self.real_dp_process_group[group_id])
-            fp32_partition.data.copy_(fp16_partitions[partition_id].data)
-
-    # Refresh the fp32 master params from the fp16 copies.
-    def refresh_fp32_params(self):
-        self._restore_from_fp16_weights()
-
-    # Extract optimizer state for current partition from merged states of all partitions
-    def _partition_base_optimizer_state(self, state_key, all_partition_states, group_id):
-        partition_id = dist.get_rank(
-            group=self.real_dp_process_group[group_id])
-        alignment = dist.get_world_size(
-            group=self.real_dp_process_group[group_id])
-        if torch.is_tensor(all_partition_states[0]):
-            flat_merged_partitions = self.flatten_dense_tensors_aligned(
-                all_partition_states,
-                alignment)
-            dp_partitions = self.get_data_parallel_partitions(flat_merged_partitions,
-                                                              group_id)
-            return dp_partitions[partition_id]
-        else:
-            # Assume non-tensor states are not partitioned and equal across ranks, so return first one
-            return all_partition_states[0]
-
-    # Restore base optimizer state from checkpoint by
-    # 1) Merging optimizer state from checkpoints of all partitions
-    # 2) Extracting optimizer state for current partition from the merged state
-    # 3) Using the extracted value to directly update the base optimizer.
-    def _restore_base_optimizer_state(self, all_state_dict):
-        base_optimizer_group_states = []
-        for i in range(len(self.optimizer.param_groups)):
-            partition_states = {}
-            all_partition_group_states = [
-                sd['base_optimizer_state'][i] for sd in all_state_dict
-            ]
-            for key in all_partition_group_states[0].keys():
-                all_partition_states = [
-                    all_states[key] for all_states in all_partition_group_states
-                ]
-                partition_states[key] = self._partition_base_optimizer_state(
-                    key,
-                    all_partition_states,
-                    i)
-            base_optimizer_group_states.append(partition_states)
-
-        for i, group in enumerate(self.optimizer.param_groups):
-            p = group['params'][0]
-            for key, saved in base_optimizer_group_states[i].items():
-                if torch.is_tensor(self.optimizer.state[p][key]):
-                    self.optimizer.state[p][key].data.copy_(saved.data)
-                else:
-                    self.optimizer.state[p][key] = saved
-
-    def load_state_dict(self,
-                        state_dict_list,
-                        load_optimizer_states=True,
-                        load_from_fp32_weights=False):
-        r"""Loading ZeRO checkpoint
-
-        Arguments:
-            state_dict_list: List of all saved ZeRO checkpoints, one for each saved partition.
-                Note that the number of saved partitions may differ from number of loading partitions to support
-                changing GPU count, specifically DP world size, between saving and loading checkpoints.
-            load_optimizer_states: Boolean indicating whether or not to load base optimizer states
-            load_from_fp32_weights: Boolean indicating whether to initialize fp32 master weights from fp32
-            copies in checkpoints (no precision loss) or from model's fp16 copies (with precision loss).
-        """
-        """
-        Loads a state_dict created by an earlier call to state_dict().
-        If ``fp16_optimizer_instance`` was constructed from some ``init_optimizer``,
-        whose parameters in turn came from ``model``, it is expected that the user
-        will call ``model.load_state_dict()`` before
-        ``fp16_optimizer_instance.load_state_dict()`` is called.
-        
-        Example::
-
-            model = torch.nn.Linear(D_in, D_out).cuda().half()
-            optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
-            optimizer = FP16_Optimizer(optimizer, static_loss_scale = 128.0)
-            ...
-            checkpoint = torch.load("saved.pth")
-            model.load_state_dict(checkpoint['model'])
-            optimizer.load_state_dict(checkpoint['optimizer'])
-        """
-        # I think it should actually be ok to reload the optimizer before the model.
-        self.loss_scaler = state_dict_list[0]['loss_scaler']
-        self.dynamic_loss_scale = state_dict_list[0]['dynamic_loss_scale']
-        self.overflow = state_dict_list[0]['overflow']
-
-        # zero stage 1 mode
-        if not self.partition_gradients:
-            required_version = pkg_version.parse("0.3.17")
-            ckpt_version = state_dict_list[0].get("ds_version", False)
-            error_str = f"ZeRO stage 1 changed in {required_version} and is not backwards compatible " \
-                        "with older stage 1 checkpoints. If you'd like to load an old ZeRO-1 checkpoint " \
-                        "please set 'legacy_stage1': true in your zero config json. This old version of " \
-                        "stage 1 will be removed in v0.4.0."
-
-            assert ckpt_version, f"Empty ds_version! {error_str}"
-            assert required_version <= pkg_version.parse(
-                ckpt_version), f"Old version: {ckpt_version} {error_str}"
-
-        if load_optimizer_states:
-            self._restore_base_optimizer_state(state_dict_list)
-
-        # At this point, the optimizer's references to the model's fp32 parameters are up to date.
-        # The optimizer's hyperparameters and internal buffers are also up to date.
-        # However, the fp32 master copies of the model's fp16 params stored by the optimizer are still
-        # out of date.  There are two options.
-        # 1:  Refresh the master params from the model's fp16 params.
-        # This requires less storage but incurs precision loss.
-        # 2:  Save and restore the fp32 master copies separately.
-        # We choose option 1 if changing DP degree and option 2 otherwise.
-        #
-        # Pytorch Optimizer.load_state_dict casts saved buffers (e.g. momentum) to the type and device
-        # of their associated parameters, because it's possible those buffers might not exist yet in
-        # the current optimizer instance.  In our case, as long as the current FP16_Optimizer has been
-        # constructed in the same way as the one whose state_dict we are loading, the same master params
-        # are guaranteed to exist, so we can just copy_() from the saved master params.
-
-        if load_from_fp32_weights:
-            self._restore_from_fp32_weights(state_dict_list)
-        else:
-            self._restore_from_fp16_weights()
-
-    def allreduce_gradients(self):
-        self.overlapping_partition_gradients_reduce_epilogue()
-
-
-def _handle_overflow(cpu_sum, x, i):
-    import math
-    rank = torch.distributed.get_rank()
-    if rank == 0:
-        t_i = -1
-        for v_i, v in enumerate(x.data.contiguous().view(-1)):
-            if not math.isfinite(float(v)):
-                t_i = v_i
-                break
-        print(
-            f"rank {rank} detected overflow {cpu_sum} in tensor {i}:{t_i} shape {x.shape}"
-        )
-
-
-def estimate_zero2_model_states_mem_needs(total_params,
-                                          num_gpus_per_node=1,
-                                          num_nodes=1,
-                                          cpu_offload=True,
-                                          additional_buffer_factor=1.5):
-    total_gpus = num_nodes * num_gpus_per_node
-
-    if cpu_offload:
-        gpu_mem = 2 * total_params
-        cpu_mem = total_params * \
-            max(4 * total_gpus, 16) * additional_buffer_factor
-    else:
-        gpu_mem = 4 * total_params + int(16 * total_params / total_gpus)
-        cpu_mem = total_params * 4 * num_gpus_per_node * additional_buffer_factor
-
-    return int(cpu_mem), int(gpu_mem)
-
-
-def model_to_params(model):
-    # shared params calculated only once
-    total_params = sum(
-        dict((p.data_ptr(),
-              p.numel()) for p in model.parameters()).values())
-    return total_params
-
-
-def estimate_zero2_model_states_mem_needs_all_live(model,
-                                                   num_gpus_per_node=1,
-                                                   num_nodes=1,
-                                                   additional_buffer_factor=1.5):
-    """
-    Print out estimates on memory usage requirements for ZeRO 2 params, optim states and gradients
-    for a given ``model`` and hardware setup.
-
-    If you have an actual model object, use this function and everything will be derived
-    automatically.
-
-    If it's a hypothetical model, use ``estimate_zero2_model_states_mem_needs_all_cold`` where you have to pass
-    the ``total_params`` explicitly.
-
-    Args:
-        - ``model``: ``nn.Module`` object
-        - ``num_gpus_per_node``: how many gpus per node (defaults to 1)
-        - ``num_nodes``: how many nodes (defaults to 1),
-        - ``additional_buffer_factor``: estimation factor (defaults to 1.5):
-
-    """
-
-    total_params = model_to_params(model)
-
-    estimate_zero2_model_states_mem_needs_all_cold(
-        total_params=total_params,
-        num_gpus_per_node=num_gpus_per_node,
-        num_nodes=num_nodes,
-        additional_buffer_factor=additional_buffer_factor)
-
-
-def estimate_zero2_model_states_mem_needs_all_cold(total_params,
-                                                   num_gpus_per_node=1,
-                                                   num_nodes=1,
-                                                   additional_buffer_factor=1.5):
-    """
-    Print out estimates on memory usage requirements for ZeRO 2 params, optim states and gradients
-    for a given ``model`` and hardware setup.
-
-    If it's a hypothetical model, use this function where you have to pass
-    the ``total_params`` and ``largest_layer_params`` explicitly.
-
-    If you have an actual model object, use ``estimate_zero2_model_states_mem_needs_all_live`` and everything
-    will be derived automatically.
-
-    Args:
-        - ``total_params``: total  model params
-        - ``num_gpus_per_node``: how many gpus per node (defaults to 1)
-        - ``num_nodes``: how many nodes (defaults to 1),
-        - ``additional_buffer_factor``: estimation factor (defaults to 1.5):
-
-    """
-
-    def format_options(cpu_offload):
-        enabled = []
-        enabled.append(f"cpu_offload={1 if cpu_offload else 0}")
-        return ", ".join(enabled)
-
-    nodes_str = "nodes" if num_nodes > 1 else "node"
-    gpus_str = "GPUs" if num_gpus_per_node > 1 else "GPU"
-    print(
-        "Estimated memory needed for params, optim states and gradients for a:\n"
-        f"HW: Setup with {num_nodes} {nodes_str}, {num_gpus_per_node} {gpus_str} per node.\n"
-        f"SW: Model with {int(total_params / 1e6)}M total params.")
-    print("  per CPU  |  per GPU |   Options")
-    for cpu_offload in [True, False]:
-        cpu_mem, gpu_mem = estimate_zero2_model_states_mem_needs(
-            total_params=total_params,
-            num_gpus_per_node=num_gpus_per_node,
-            num_nodes=num_nodes,
-            cpu_offload=cpu_offload,
-            additional_buffer_factor=additional_buffer_factor
-        )
-
-        options_str = format_options(cpu_offload=cpu_offload)
-        print(
-            f" {cpu_mem / 2 ** 30:7.2f}GB | {gpu_mem / 2 ** 30:6.2f}GB | {options_str}")
diff --git a/colossalai/zero/zero_redundancy_optimizer_level_3.py b/colossalai/zero/zero_redundancy_optimizer_level_3.py
deleted file mode 100644
index 34051e638c1932e41450e55d54b89f8483cb1c87..0000000000000000000000000000000000000000
--- a/colossalai/zero/zero_redundancy_optimizer_level_3.py
+++ /dev/null
@@ -1,3624 +0,0 @@
-"""
-"Copyright 2020 The Microsoft DeepSpeed Team.
-Licensed under the MIT license.
-"""
-
-import math
-from collections import OrderedDict
-
-import torch
-import torch.distributed as dist
-
-try:
-    from deepspeed.utils.debug import debug_module2name_id, debug_param2name_id, debug_param2name_id_numel, \
-        debug_param2name_id_shape_device, debug_module2name_class
-    from deepspeed.ops.adam import DeepSpeedCPUAdam
-    from deepspeed.ops.op_builder import UtilsBuilder
-    from deepspeed.runtime.swap_tensor.partitioned_optimizer_swapper import PartitionedOptimizerSwapper
-    from deepspeed.runtime.swap_tensor.pipelined_optimizer_swapper import PipelinedOptimizerSwapper
-    from deepspeed.runtime.utils import is_model_parallel_parameter
-    from deepspeed.runtime.zero.constants import ZERO_OPTIMIZATION_WEIGHTS
-    from deepspeed.runtime.zero.partition_parameters import *
-    from deepspeed.runtime.zero.partition_parameters import _init_external_params
-except ImportError:
-    pass
-
-from torch._six import inf
-from torch.distributed.distributed_c10d import _get_global_rank
-from torch.optim import Optimizer
-
-from colossalai.core import global_context as gpc
-from colossalai.utils import report_memory_usage
-from .loss_scaler import LossScaler, DynamicLossScaler
-from colossalai.context import ParallelMode
-
-# Toggle this to true to enable correctness test
-# with gradient partitioning and without
-pg_correctness_test = False
-
-FWD_MODULE_STACK = list()
-
-
-def print_rank_0(message, debug=False, force=False):
-    rank = torch.distributed.get_rank()
-    if rank == 0 and (debug or force):
-        print(message)
-    # other variations
-    # - print for all ranks w/o interleaving
-    # printflock(f"[{rank}] {message}")
-    # - print to log file per rank
-    # log_rank_file(rank, message)
-
-
-def input(msg):
-    return
-
-
-def split_half_float_double(tensors):
-    dtypes = [
-        "torch.cuda.HalfTensor",
-        "torch.cuda.FloatTensor",
-        "torch.cuda.DoubleTensor"
-    ]
-    buckets = []
-    for i, dtype in enumerate(dtypes):
-        bucket = [t for t in tensors if t.type() == dtype]
-        if bucket:
-            buckets.append(bucket)
-    return buckets
-
-
-def isclose(a, b, rtol=1e-09, atol=0.0):
-    return abs(a - b) <= max(rtol * max(abs(a), abs(b)), atol)
-
-
-def lcm(x, y):
-    from fractions import gcd  # or can import gcd from `math` in Python 3
-    return x * y // gcd(x, y)
-
-
-def move_to_cpu(tensor_list):
-    for tensor in tensor_list:
-        tensor.data = tensor.data.cpu()
-
-
-def get_all_parameters(sub_module, recurse=False):
-    return itertools.chain(sub_module.named_parameters(recurse=recurse),
-                           sub_module.ds_external_parameters())
-
-
-# apply torch.autograd.Function that calls a backward_function to tensors in output
-def _apply_to_tensors_only(module, functional, backward_function, outputs):
-    if type(outputs) is tuple:
-        touched_outputs = []
-        for output in outputs:
-            touched_output = _apply_to_tensors_only(module,
-                                                    functional,
-                                                    backward_function,
-                                                    output)
-            touched_outputs.append(touched_output)
-        return tuple(touched_outputs)
-    elif type(outputs) is torch.Tensor:
-        return functional.apply(module, backward_function, outputs)
-    else:
-        return outputs
-
-
-# for each tensor in outputs run the forward_funciton and register backward_function as hook
-def _apply_forward_and_backward_to_tensors_only(module,
-                                                forward_function,
-                                                backward_function,
-                                                outputs):
-    if type(outputs) is tuple:
-        touched_outputs = []
-        for output in outputs:
-            touched_output = _apply_forward_and_backward_to_tensors_only(
-                module,
-                forward_function,
-                backward_function,
-                output)
-            touched_outputs.append(touched_output)
-        return tuple(touched_outputs)
-    elif type(outputs) is torch.Tensor:
-        forward_function(outputs)
-        if outputs.requires_grad:
-            outputs.register_hook(backward_function)
-        return outputs
-    else:
-        return outputs
-
-
-class ZeROOrderedDict(OrderedDict):
-    def __init__(self, parent_module, *args, **kwargs):
-        """A replacement for ``collections.OrderedDict`` to detect external ZeRO params.
-
-        Args:
-            parent_module (``collections.OrderedDict``): the collection to replace
-        """
-
-        super().__init__(*args, **kwargs)
-        self._parent_module = parent_module
-        self._in_forward = False
-
-    def __getitem__(self, key):
-        param = super().__getitem__(key)
-
-        # Params can be registered as None (e.g., bias)
-        if param is None:
-            return param
-
-        if param.ds_status == ZeroParamStatus.NOT_AVAILABLE:
-            if self._parent_module._parameters._in_forward:
-                print_rank_0(f'Registering external parameter from getter {key}',
-                             force=False)
-                register_external_parameter(FWD_MODULE_STACK[-1], param)
-                param.all_gather()
-
-        return param
-
-
-def _inject_parameters(module, cls):
-    for module in module.modules():
-        if cls == ZeROOrderedDict:
-            new_param = cls(parent_module=module)
-        else:
-            new_param = cls()
-
-        for key, param in module._parameters.items():
-            new_param[key] = param
-        module._parameters = new_param
-
-
-# TODO Needs to be implemented
-class PrefetchCoordinator(object):
-    def __init__(self):
-        # step_id keeps track of the number of sub-modules invoked so far
-        # the step_id is tracking forward and backward sequence of sub-modules
-        self.step_id = 0
-
-        # stores the sequence of sub modules in forward+backward pass
-        self.sub_module_trace = []
-
-        # maps sub_module id to submodule objects
-        self.id_to_sub_module_map = {}
-
-        # stores the total number of parmeters in each sub_module
-        self.id_to_sub_module_size_map = {}
-
-        self.trace_completed = False
-
-        self.most_recent_sub_module_step = {}
-
-        # reuse distances
-        self.reuse_numel_for_step_id = {}
-
-    def record_trace(self, sub_module):
-        if not self.trace_completed:
-            self.sub_module_trace.append(sub_module.id)
-            self.id_to_sub_module_map[sub_module.id] = sub_module
-
-    def print_trace(self):
-        print_rank_0(
-            f"The module trace is : {[self.id_to_sub_module_map[module_id].id for module_id in self.sub_module_trace]}"
-        )
-
-    def increment_step(self, sub_module):
-        self.most_recent_sub_module_step[sub_module.id] = self.step_id
-        self.step_id += 1
-
-    def reset_step(self):
-        self.step_id = 0
-
-    # returns the next numel parameters that will be used next but are not available or inflight
-    def get_params_to_prefetch(self, sub_module, numel=2000000):
-
-        # numel_in_sub_module = 0
-        # for name, param in sub_module.named_parameters(recurse=False):
-        #     numel_in_sub_module += param.ds_numel
-
-        # #if numel_in_sub_module < (numel // 2):
-        #    return []
-
-        # tracing failed. The sub_module passed at the step_id must match with the sub_module during tracing
-        if sub_module.id != self.sub_module_trace[self.step_id]:
-            print_rank_0(
-                f"Tracing failed. Prefetching is disabled at sub-module: {debug_module2name_id(sub_module)}"
-            )
-            return []
-
-        params_to_prefetch = []
-        total_numel_to_prefetch = 0
-
-        for i in range(self.step_id, len(self.sub_module_trace)):
-            module_id = self.sub_module_trace[i]
-            for _, param in get_all_parameters(self.id_to_sub_module_map[module_id]):
-                if param.ds_status is ZeroParamStatus.NOT_AVAILABLE and (
-                        param.ds_id not in [p.ds_id for p in params_to_prefetch]):
-                    params_to_prefetch.append(param)
-                    total_numel_to_prefetch += param.ds_numel
-                    # print_rank_0(f"Total numel to prefetch: {total_numel_to_prefetch}. Param: {param.ds_shape} and numel {param.ds_numel}, numel limit {numel}")
-                    # and total_numel_to_prefetch > (numel_in_sub_module // 2):
-                    if total_numel_to_prefetch >= numel:
-                        return params_to_prefetch
-
-        return params_to_prefetch
-
-    # checks if this sub_module will be used again and if so then returns the number of elements
-    # in the parameters used between this sub_module and the reuse of this sub_module
-    def get_reuse_distance_in_numel(self, sub_module, sub_module_step_id=None):
-        # assert is_forward is not None, "is_forward must be set to True for Forward Propagation and False for backward Propagation"
-        is_there_reuse = False
-        reuse_distance_in_numel = 1000000000000
-
-        # set the appropriate trace
-        trace = self.sub_module_trace
-        total_steps = len(trace)
-        if sub_module_step_id is None:
-            sub_module_step_id = self.most_recent_sub_module_step[sub_module.id]
-
-        # tracing failed. The sub_module passed at the step_id must match with the sub_module during tracing
-        if sub_module.id != trace[sub_module_step_id]:
-            print_rank_0(
-                f"Tracing failed. Cannot tell if the sub_module: {sub_module.id} is reused"
-            )
-            return reuse_distance_in_numel
-
-        # return cached value
-        if sub_module_step_id in self.reuse_numel_for_step_id:
-            return self.reuse_numel_for_step_id[sub_module_step_id]
-
-        start_step = self.step_id
-        print_rank_0(f"Step id is {self.step_id} ")
-        for step_id in range(start_step, total_steps):
-            print_rank_0(
-                f"Trace id {trace[step_id]} and sub_module id {sub_module.id}")
-            if sub_module.id == trace[step_id]:
-                end_step = step_id
-
-                is_there_reuse = True
-                reuse_distance_in_numel = self._distance_in_numel(
-                    start_step,
-                    end_step,
-                    trace)
-                break
-
-        self.reuse_numel_for_step_id[sub_module_step_id] = reuse_distance_in_numel
-
-        return reuse_distance_in_numel
-
-    def _distance_in_numel(self, start_step, end_step, trace):
-        distance_in_numel = 0
-        for step_id in range(start_step, end_step):
-            module_id = trace[step_id]
-            for _, param in self.id_to_sub_module_map[module_id].named_parameters(recurse=False):
-                distance_in_numel += param.ds_numel
-            for _, param in self.id_to_sub_module_map[module_id].ds_external_parameters():
-                distance_in_numel += param.ds_numel
-        return distance_in_numel
-
-
-class PartitionedParameterCoordinator(object):
-    def __init__(self,
-                 comm_stream=None,
-                 max_reuse_distance_in_numel=500000000,
-                 max_available_parameters_in_numel=700000000):
-
-        self.in_flight_handles = []
-        self.params_in_flight = []
-        self.comm_stream = comm_stream if comm_stream is not None else torch.cuda.current_stream(
-        )
-        self.prefetch_coordinator = PrefetchCoordinator()
-        self.hierarchy = 0
-
-        self.total_available_parameter_numel = 0
-        self.max_available_parameters_in_numel = max_available_parameters_in_numel
-
-        # max distance between two use of the module beyond which module is released
-        self.max_reuse_distance_in_numel = max_reuse_distance_in_numel
-
-    def _increment_available_parameter_numel(self, increment):
-        self.total_available_parameter_numel += increment
-
-    def _decrement_available_parameter_numel(self, decrement):
-        self.total_available_parameter_numel -= decrement
-
-    '''-----------------------Tracing and Prefetching ---------------'''
-
-    def record_trace(self, sub_module):
-        self.prefetch_coordinator.record_trace(sub_module)
-
-    def finish_tracing(self, print_trace=False):
-        self.prefetch_coordinator.trace_completed = True
-
-        if print_trace:
-            self.prefetch_coordinator.print_trace()
-
-    # swap in parameter partitions from nvme for those parameters that will be used
-    # after the ones that are already being prefetched into full parameters
-    def _prefetch_nvme_param_partitions(self, sub_module, params_in_flight):
-        numel_in_flight = sum(
-            [param.ds_tensor.ds_numel for param in params_in_flight])
-        upcoming_param_list = self.prefetch_coordinator.get_params_to_prefetch(
-            sub_module,
-            numel=2 * numel_in_flight)
-        swap_in_params = []
-        for param in upcoming_param_list:
-            if len(swap_in_params) >= param.nvme_swapper.available_swap_in_buffers():
-                break
-            if param.ds_tensor.status == PartitionedParamStatus.NOT_AVAILABLE:
-                swap_in_params.append(param)
-
-        if len(swap_in_params) > 0:
-            swap_in_params[0].nvme_swapper.swap_in(
-                swap_in_params, async_op=True)
-
-    # Pre fetches the parameters for sub_modules that comes after
-    #  the current sub_module. This call is asynchronous
-    def prefetch_next_sub_modules(self, sub_module, numel=5000000, nvme=False):
-
-        params_to_prefetch = []
-        if not self.prefetch_coordinator.trace_completed:
-            return params_to_prefetch
-
-        # prefetch if there is no current prefetching in flight
-        if not self.in_flight_handles and self.total_available_parameter_numel < self.max_available_parameters_in_numel:
-            params_to_prefetch = self.prefetch_coordinator.get_params_to_prefetch(
-                sub_module,
-                numel=numel)
-
-            self._all_gather(params_to_prefetch, async_op=True)
-            for param in params_to_prefetch:
-                param.ds_status = ZeroParamStatus.INFLIGHT
-
-                # keeping track of number of elements consumed by available parmaeters
-                self._increment_available_parameter_numel(param.ds_numel)
-
-            if nvme:
-                self._prefetch_nvme_param_partitions(
-                    sub_module, params_to_prefetch)
-
-        self._print_prefetch_elements_info(sub_module, params_to_prefetch)
-        print_rank_0(
-            f"{'--' * self.hierarchy}--PreFetching parameters {[param.ds_id for param in params_to_prefetch]} and available {self.total_available_parameter_numel}, max limit {self.max_available_parameters_in_numel}",
-            force=False)
-
-    def _print_prefetch_elements_info(self, sub_module, params_to_prefetch):
-        sub_module_numel = 0.0
-        for name, param in sub_module.named_parameters(recurse=False):
-            sub_module_numel += param.ds_numel
-        numel_being_prefetched = 0
-        for param in params_to_prefetch:
-            numel_being_prefetched = param.ds_numel
-        print_rank_0(
-            f"{'--' * self.hierarchy}--PreFetching  {numel_being_prefetched} numels and number of numel in the next sub module is {sub_module_numel}",
-            force=False)
-
-    def increment_step(self, sub_module):
-        self.prefetch_coordinator.increment_step(sub_module)
-
-    def reset_step(self):
-        self.prefetch_coordinator.reset_step()
-
-    '''----------------------------------------------------------------------'''
-
-    # Fetches the parameters in the sub_module
-    # This call is blocking
-    def fetch_sub_module(self, sub_module):
-        partitioned_params = []
-        params_in_flight = False
-        print_rank_0(
-            f"{'--' * self.hierarchy}Fetching params in module {debug_module2name_class(sub_module)}"
-        )
-        params_to_fetch = [
-            param for _,
-            param in sub_module.named_parameters(recurse=False)
-        ]
-        # print([n for n,p in sub_module.named_parameters(recurse=False)])
-
-        if hasattr(sub_module, 'ds_external_parameters'):
-            print_rank_0(
-                f"{'--' * self.hierarchy}--Fetching external parameters {sub_module.ds_external_parameters()}"
-            )
-            params_to_fetch += [
-                param for _,
-                param in sub_module.ds_external_parameters()
-            ]
-        # for _, param in sub_module.named_parameters(recurse=False):
-        for param in params_to_fetch:
-            param.ds_active_sub_modules += 1
-            print_rank_0(
-                f"{'--' * self.hierarchy}--Fetching parameters {debug_param2name_id_shape(param)} with active sub modules {param.ds_active_sub_modules}"
-            )
-
-            if param.ds_status == ZeroParamStatus.AVAILABLE:
-                print_rank_0(
-                    f"{'--' * self.hierarchy}--Parameter {debug_param2name_id(param)} is already available"
-                )
-
-            if param.ds_status == ZeroParamStatus.NOT_AVAILABLE:
-                print_rank_0(
-                    f"{'--' * self.hierarchy}--Parameter {debug_param2name_id(param)} is being fetched"
-                )
-                partitioned_params.append(param)
-
-                # keeping track of number of elements consumed by available parmaeters
-                self._increment_available_parameter_numel(param.ds_numel)
-                print_rank_0(f"Incrementing with parameter id {param.ds_id}")
-
-            if param.ds_status == ZeroParamStatus.INFLIGHT:
-                params_in_flight = True
-                print_rank_0(
-                    f"{'--' * self.hierarchy}--Parameters {debug_param2name_id(param)} is already in flight (prefetched)"
-                )
-        self.hierarchy += 1
-
-        # parameters are partitioned and need to be allgathered
-        self._all_gather(partitioned_params, async_op=True)
-
-        # parameters are inflight and communication needs to be completed
-        if partitioned_params or params_in_flight:
-            self._synchronize_communication()
-
-        for _, param in sub_module.named_parameters(recurse=False):
-            param.ds_status = ZeroParamStatus.AVAILABLE
-            print_rank_0(
-                f"Param {debug_param2name_id_shape_device(param)} norm={param.norm()}",
-                force=False)
-        # print_rank_0(f"After fetching (id, shape, device): {[(param.ds_id, param.shape, param.device) for param in sub_module.named_parameters(recurse=False)]}")
-
-    def release_sub_module(self, sub_module):
-        self.hierarchy -= 1
-        print_rank_0(
-            f"{'--' * self.hierarchy}Releasing params in module {debug_module2name_class(sub_module)}"
-        )
-        params_to_release = [
-            param for _,
-            param in sub_module.named_parameters(recurse=False)
-        ]
-
-        if hasattr(sub_module, 'ds_external_parameters'):
-            # print_rank_0(f"Releasing external parameters {sub_module.ds_external_parameters()}")
-            params_to_release += [
-                param for _,
-                param in sub_module.ds_external_parameters()
-            ]
-
-        # for _, param in sub_module.named_parameters(recurse=False):
-        for param in params_to_release:
-            param.ds_active_sub_modules -= 1
-            if not param.ds_active_sub_modules and not self._keep_for_later(
-                    sub_module) and not param.ds_persist:
-
-                print_rank_0(
-                    f"{'--' * self.hierarchy}--Releasing parameter {debug_param2name_id_numel(param)} active sub modules {param.ds_active_sub_modules} and keep for later {self._keep_for_later(sub_module)}",
-                    force=False)
-
-                # Keeping track of number of elements that are consumed by available parameters
-                self._decrement_available_parameter_numel(param.ds_numel)
-
-                # report_memory_usage(
-                #     f"Before releasing param {debug_param2name_id_numel(param)}",
-                # )
-                param.partition(hierarchy=self.hierarchy)
-
-                # report_memory_usage(
-                #     f"After releasing param {debug_param2name_id_numel(param)}",
-                # )
-
-                param.ds_status = ZeroParamStatus.NOT_AVAILABLE
-            else:
-                print_rank_0(
-                    f"{'--' * self.hierarchy}--Did not release param {debug_param2name_id_numel(param)} with active sub modules {param.ds_active_sub_modules}, keep for later={self._keep_for_later(sub_module)} and persistence={param.ds_persist}",
-                    force=False)
-
-    def release_and_reset_parameter(self, param):
-        param.ds_active_sub_modules = 0
-        if param.ds_status == ZeroParamStatus.AVAILABLE:
-            print_rank_0(
-                f"Releasing unpartitioned param {debug_param2name_id_numel(param)} active sub-modules {param.ds_active_sub_modules} and persisitence {param.ds_persist}"
-            )
-            self._decrement_available_parameter_numel(param.ds_numel)
-            param.partition()
-
-    def _keep_for_later(self, sub_module):
-        if not self.prefetch_coordinator.trace_completed:
-            return False
-        if self.max_reuse_distance_in_numel == 0:
-            return False
-        reuse_distance_in_numel = self.prefetch_coordinator.get_reuse_distance_in_numel(
-            sub_module)
-        # print_rank_0(f"Reuse distance and numel for sub_module id {sub_module.id} is {reuse_distance_in_numel}")
-        return reuse_distance_in_numel < self.max_reuse_distance_in_numel
-
-    def _all_gather(self, partitioned_params, async_op=False):
-        with torch.cuda.stream(self.comm_stream):
-            handles = partitioned_params[0].all_gather(
-                param_list=partitioned_params,
-                async_op=async_op,
-                hierarchy=self.hierarchy) if partitioned_params else None
-
-        if handles is not None:
-            self.in_flight_handles.extend(handles)
-            self.params_in_flight.extend(partitioned_params)
-
-    def _synchronize_communication(self, synchronize_streams=True):
-        assert len(self.params_in_flight) == len(self.in_flight_handles)
-        for handle, param in zip(self.in_flight_handles, self.params_in_flight):
-            if handle is not None:
-                with torch.cuda.stream(self.comm_stream):
-                    handle.wait()
-            param.ds_status = ZeroParamStatus.AVAILABLE
-        self.comm_stream.synchronize()
-        torch.cuda.synchronize() if synchronize_streams else None
-        self.in_flight_handles = []
-        self.params_in_flight = []
-
-
-class PreBackwardFunction(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, module, pre_backward_function, outputs):
-        ctx.module = module
-        ctx.pre_backward_function = pre_backward_function
-        if not hasattr(module, "applied_pre_backward_ref_cnt"):
-            module.applied_pre_backward_ref_cnt = 0
-        module.applied_pre_backward_ref_cnt += 1
-        # print(f"After Forward: {ctx.module.__class__.__name__}")
-        outputs = outputs.detach()
-        return outputs
-
-    @staticmethod
-    def backward(ctx, *args):
-        # print(f"Before Backward: {ctx.module.__class__.__name__}")
-        ctx.pre_backward_function(ctx.module)
-        return (None, None) + args
-
-
-class PostBackwardFunction(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, module, pre_backward_function, output):
-        ctx.module = module
-        if output.requires_grad:
-            # TODO SOME TIMES post backward does not report_memory_usage()ered debug in detail
-            # Should only cause increase in memory not correctness issue
-            # if output.grad_fn.__class__.__name__ == 'ViewBackward':
-            #    ctx.view=True
-            #    print(f"Warning view tensor for input to module : {module.__class__.__name__}. Backward hooks may not trigger properly")
-            # assert len(module.parameters(recurse=False)), "The input tensor to the module is a view, and autograd Function or register_hook is not triggered with view tensors."
-            # if module.ds_grads_remaining == 0:
-            #    print(f"Before Forward: {ctx.module.__class__.__name__}")
-            module.ds_grads_remaining += 1
-            ctx.pre_backward_function = pre_backward_function
-        output = output.detach()
-        return output
-
-    @staticmethod
-    def backward(ctx, *args):
-        ctx.module.ds_grads_remaining = ctx.module.ds_grads_remaining - 1
-        if ctx.module.ds_grads_remaining == 0:
-            ctx.pre_backward_function(ctx.module)
-            # print(f"After Backward: {ctx.module.__class__.__name__}")
-        return (None, None) + args
-
-
-INITIAL_MICRO_STEP_ID = -1
-
-
-class ZeroRedundancyOptimizer_Level_3(Optimizer):
-    """
-    ZeroRedundancyOptimizer_Level_3 designed to reduce the memory footprint
-    required for training large deep learning models.
-
-    For more details please report_memory_usage() Optimization Towards Training A Trillion Parameter Models
-    https://arxiv.org/abs/1910.02054
-
-    """
-
-    def __init__(self,
-                 module,
-                 init_optimizer,
-                 dp_paralllel_mode=ParallelMode.DATA,
-                 static_loss_scale=1.0,
-                 dynamic_loss_scale=False,
-                 dynamic_loss_args=None,
-                 verbose=False,
-                 contiguous_gradients=True,
-                 reduce_bucket_size=500000000,
-                 prefetch_bucket_size=50000000,
-                 max_reuse_distance=1000000000,
-                 max_live_parameters=1000000000,
-                 param_persistence_threshold=100000,
-                 reduce_scatter=True,
-                 overlap_comm=False,
-                 offload_optimizer_config=None,
-                 offload_param_config=None,
-                 sub_group_size=1000000000000,
-                 clip_grad=0.0,
-                 allreduce_always_fp32=False,
-                 postscale_gradients=True,
-                 gradient_predivide_factor=1.0,
-                 gradient_accumulation_steps=1,
-                 aio_config=None,
-                 dtype=torch.half):
-        # mpu = None
-        # mpu is removed from the parameter list
-        # tensor parallel will be automatically detected later
-
-        # LSG: default parameter for compatibility
-        elastic_checkpoint = False
-        timers = None
-        dp_process_group = gpc.get_group(dp_paralllel_mode)
-        self.verbose = verbose
-
-        # LSG: in deepspeed deepspeed/runtime/zero/partition_parameters.py,
-        # self.local_device = torch.device('cuda:{}'.format(os.environ["LOCAL_RANK"]))
-        # the local device is obtained by env var LOCAL_RANK, thus, need to change this
-        # env var on the spot as LOCAL_RANK may not be present
-        if not 'LOCAL_RANK' in os.environ:
-            device_id = gpc.get_global_rank() % torch.cuda.device_count()
-            os.environ['LOCAL_RANK'] = str(device_id)
-
-        # self.local_device = torch.device('cuda:{}'.format(os.environ["LOCAL_RANK"]))
-
-        if self.verbose:
-            report_memory_usage("Stage 3 initialize beginning")
-
-            if dist.get_rank() == 0:
-                print(f"Reduce bucket size {reduce_bucket_size}")
-                print(f"Allgather bucket size {prefetch_bucket_size}")
-        # The fused optimizer does all the work. We need this layer for two reason:
-        # 1. maintain same user API from apex.fp16_utils
-        # 2. keep common stuff here in case we need to add ne552w fused optimizer later
-
-        # differences from apex.fp16_utils:
-        # - assume all model params in fp16
-        # - assume all params requires grad
-        # - flat by groups, not keeping state. TODO: remove state explicitly?
-        # - master gard and unflat master weight never exist. TODO: a way to save out unflat master?
-        if not torch.cuda.is_available:
-            raise SystemError("Cannot use fp16 without CUDA.")
-        self.optimizer = init_optimizer
-        self.defaults = init_optimizer.defaults
-
-        # Load pre-built or JIT compile (un)flatten ops
-        util_ops = UtilsBuilder().load()
-        self.flatten = util_ops.flatten
-        self.unflatten = util_ops.unflatten
-        self.dtype = dtype
-
-        if not all(is_zero_param(p) for p in module.parameters()):
-            ds_config = {
-                "train_micro_batch_size_per_gpu": 1,
-                "gradient_accumulation_steps": 1,
-                "zero_optimization": {
-                    "offload_param": offload_param_config,
-                    "offload_optimizer": offload_optimizer_config,
-                },
-                "aio": aio_config
-            }
-
-            if offload_param_config is not None:
-                remote_device = offload_param_config['device']
-            else:
-                remote_device = None
-
-            if offload_optimizer_config is not None:
-                pin_memory = offload_optimizer_config.get(OFFLOAD_OPTIMIZER_PIN_MEMORY, False)
-            else:
-                pin_memory = False
-
-            group = None
-            if gpc.is_initialized(ParallelMode.DATA):
-                group = gpc.get_group(ParallelMode.DATA)
-            Init(module=module, data_parallel_group=group, dtype=self.dtype,
-                 remote_device=remote_device, config_dict_or_path=ds_config,
-                 pin_memory=pin_memory)
-
-        for m in module.modules():
-            _init_external_params(m)
-
-        self.module = module
-        self.elastic_checkpoint = elastic_checkpoint
-        self.overlap_comm = overlap_comm
-
-        # Replace ._parameters with a new class to enable auto-registration of
-        # external parameters
-        _inject_parameters(module, ZeROOrderedDict)
-
-        if self.overlap_comm:
-            self.gpu_sum = torch.zeros(1, dtype=torch.float).cuda()
-
-        ###################### offload optimizer setup ##################################
-        self.optimizer_swapper = None
-        self.swap_optimizer = False
-
-        self.offload_optimizer = False
-        self.offload_optimizer_pin_memory = False
-        self.offload_optimizer_fast_init = False
-        if offload_optimizer_config is not None:
-            self.offload_optimizer = True
-            self.offload_optimizer_pin_memory = offload_optimizer_config[
-                OFFLOAD_OPTIMIZER_PIN_MEMORY]
-            self.swap_optimizer = offload_optimizer_config[
-                OFFLOAD_OPTIMIZER_DEVICE] == OFFLOAD_NVME_DEVICE
-            self.offload_optimizer_fast_init = offload_optimizer_config[
-                OFFLOAD_OPTIMIZER_FAST_INIT]
-
-        ###################### offload param setup ##################################
-        self.offload_param = False
-        self.offload_param_pin_memory = False
-        self.params_in_nvme_and_cpu = False
-        self.max_params_in_cpu = 0
-        if offload_param_config is not None:
-            assert self.offload_optimizer, "parameter offload is only available with optimizer state offload"
-            self.offload_param = True
-            self.offload_param_pin_memory = offload_param_config[
-                OFFLOAD_PARAM_PIN_MEMORY]
-            self.params_in_nvme_and_cpu = offload_param_config[
-                OFFLOAD_PARAM_DEVICE] == OFFLOAD_NVME_DEVICE
-            self.max_params_in_cpu = offload_param_config[OFFLOAD_PARAM_MAX_IN_CPU]
-            if self.verbose:
-                print_rank_0(
-                    f"FP16 params swapping is {self.params_in_nvme_and_cpu}, Max params in CPU is {self.max_params_in_cpu}",
-                    force=False)
-
-        self.deepspeed_adam_offload = (self.offload_optimizer
-                                       and type(init_optimizer) == DeepSpeedCPUAdam)
-
-        self.device = torch.cuda.current_device(
-        ) if not self.offload_optimizer else OFFLOAD_CPU_DEVICE
-        ############################################################################
-
-        if self.verbose:
-            report_memory_usage("Before Partitioned Parameter Coordinator")
-
-        fetch_stream = torch.cuda.Stream() if self.overlap_comm else None
-        self.param_coordinator = PartitionedParameterCoordinator(
-            comm_stream=fetch_stream,
-            max_reuse_distance_in_numel=int(max_reuse_distance),
-            max_available_parameters_in_numel=int(max_live_parameters))
-
-        if self.verbose:
-            report_memory_usage("After Partitioned Parameter Coordinator")
-
-        # self.param_coordinator = PartitionedParameterCoordinator(comm_stream=torch.cuda.Stream())
-        # -------------Stage 3 Setup-------------------#
-        # parameters smaller than the threshold will be collectively gathered at the
-        # end of the optimizer step and will be kept till the end of the backward pass
-        # TODO maybe worth just replicating these parameters and doing all reduce for them
-        self.persistence_threshold = int(param_persistence_threshold)
-
-        self.persistent_parameters = self.persistent_parameters()
-
-        self.setup_zero_stage3_hooks()
-
-        # resetting ds_tensor just in case parameters have been changed after initialization
-        # example .half() or .to()
-        # self.reset_ds_tensor()
-        # ---------------------------------------------#
-
-        self.timers = timers
-
-        self.reduce_scatter = reduce_scatter
-
-        self.dp_process_group = dp_process_group
-
-        self.partition_count = dist.get_world_size(group=self.dp_process_group)
-
-        if gpc.is_initialized(ParallelMode.TENSOR) is None:
-            self.model_parallel_group = None
-            self.model_parallel_rank = 0
-        else:
-            self.model_parallel_group = gpc.get_group(ParallelMode.TENSOR)
-            self.model_parallel_rank = gpc.get_local_rank(ParallelMode.TENSOR)
-
-        self.overflow = False
-        self.clip_grad = clip_grad
-        self.allreduce_always_fp32 = allreduce_always_fp32
-        self.gradient_predivide_factor = gradient_predivide_factor
-        self.postscale_gradients = postscale_gradients
-        self.gradient_accumulation_steps = gradient_accumulation_steps
-        self.micro_step_id = INITIAL_MICRO_STEP_ID
-
-        if self.reduce_scatter:
-            assert not self.allreduce_always_fp32, "allreduce_always_fp32 is not yet supported with ZeRO-2 with reduce scatter enabled"
-            assert self.gradient_predivide_factor == 1.0, "gradient_predivide_factor != 1.0 is not yet supported with ZeRO-2 with reduce scatter enabled"
-            assert self.postscale_gradients, "pre-scale gradients is not yet supported with ZeRO-2 with reduce scatter enabled"
-
-        # Holds the mode parameter
-        # The param.data may not hold any meaningful data
-        # when param's status is NOT_AVAILABLE or IN_FLGHT
-        self.fp16_groups = []
-
-        # Hold partitioned parameters
-        self.fp16_partitioned_groups = []
-
-        # Holds a fused and flattened copy of the parameters
-        self.fp16_partitioned_groups_flat = []
-        self.fp16_partitioned_groups_flat_numel = []
-
-        # defragmented pinned memory
-        self.param_groups_fp16_flat_cpu_memory = []
-
-        # a single 32-bit partition of the parallel partitioned parameters
-        # that this process will update
-        self.fp32_partitioned_groups_flat = []
-        self.next_swappable_fp32_partitioned_groups = []
-
-        # number of elements per partition in each group
-        self.partition_size = []
-
-        self.all_reduce_print = False
-
-        self.prefetch_elements = int(prefetch_bucket_size)
-
-        # padding on each partition for alignment purposes
-        self.groups_padding = []
-
-        self.sub_group_size = sub_group_size
-
-        self.sub_group_to_group_id = {}
-
-        if self.verbose:
-            report_memory_usage("Before creating fp16 partitions")
-        self._create_fp16_partitions_with_defragmentation()
-        num_fp16_subgroups = len(self.fp16_partitioned_groups_flat)
-        if self.verbose:
-            report_memory_usage(
-                f"After creating fp16 partitions: {num_fp16_subgroups}")
-
-        # Optimizer ensor swapping
-        if self.swap_optimizer:
-            self._configure_tensor_swapping(
-                offload_optimizer_config, aio_config)
-
-        if self.verbose:
-            report_memory_usage("Before creating fp32 partitions")
-        self._create_fp32_partitions()
-        if self.verbose:
-            report_memory_usage("After creating fp32 partitions")
-        dist.barrier()
-
-        # To support pipelined optimizer swapping
-        self._create_next_swappable_fp32_groups()
-
-        if self.verbose:
-            report_memory_usage("Before initializing optimizer states")
-        self.initialize_optimizer_states()
-        if self.verbose:
-            report_memory_usage("After initializing optimizer states")
-        dist.barrier()
-
-        if dist.get_rank() == 0 and self.verbose:
-            print(f"optimizer state initialized")
-
-        self.reduce_bucket_size = int(reduce_bucket_size)
-
-        self.reduction_event = torch.cuda.Event(
-            enable_timing=False, blocking=False)
-
-        self.reduction_stream = torch.cuda.Stream(
-        ) if self.overlap_comm else torch.cuda.current_stream()
-        self.callback_queued = False
-        self.copy_grad_stream = torch.cuda.Stream()
-
-        self.param_dict = {}
-
-        # map between param_id and bool to specify if a param is in this partition
-        self.is_param_in_current_partition = {}
-
-        self.contiguous_gradients = contiguous_gradients
-        self.extra_large_param_to_reduce = None
-        self.grads_in_ipg_bucket = []
-        self.params_in_ipg_bucket = []
-        self.elements_in_ipg_bucket = 0
-        self.params_already_reduced = []
-        self.is_gradient_accumulation_boundary = True
-        self._release_ipg_buffers()
-        self.previous_reduced_grads = None
-
-        # simplified param id
-        self.param_id = {}
-
-        count = 0
-        for i, params_group in enumerate(self.fp16_groups):
-            for param in params_group:
-                unique_id = id(param)
-                self.param_id[unique_id] = count
-                self.param_dict[count] = param
-                self.params_already_reduced.append(False)
-                count = count + 1
-
-        # Largest partitioned param
-        largest_partitioned_param_numel = max([
-            max([tensor.numel() for tensor in fp16_partitioned_group])
-            for fp16_partitioned_group in self.fp16_partitioned_groups
-        ])
-        if self.verbose:
-            print_rank_0(
-                f'Largest partitioned param numel = {largest_partitioned_param_numel}',
-                force=False)
-
-        if self.verbose:
-            report_memory_usage(f"Before Set Grad positions")
-
-        self.grad_position = {}
-        self.set_grad_positions()
-        if self.verbose:
-            report_memory_usage(f"Before CPU Offload initialization")
-
-        self.grads_in_partition = None
-
-        if self.offload_optimizer:
-            self.accumulated_grads_in_cpu = {}
-            self.norm_for_param_grads = {}
-            self.local_overflow = False
-            self.temp_grad_buffer_for_gpu_offload = torch.zeros(
-                largest_partitioned_param_numel,
-                device=torch.cuda.current_device(),
-                dtype=self.dtype)
-            self.temp_grad_gpu_buffer = torch.zeros(largest_partitioned_param_numel,
-                                                    device=torch.cuda.current_device(),
-                                                    dtype=self.dtype)
-
-        if self.verbose:
-            report_memory_usage(f"After CPU Offload initialization")
-
-        # stores if a partition has been reduced in this step
-        self.is_partition_reduced = {}
-
-        # stores if a grad in a partition has been computed or not
-        self.is_grad_computed = {}
-
-        # will store the averaged gradients required by this parititon
-        self.averaged_gradients = {}
-
-        # creates backward hooks for gradient partitioning
-        self.create_reduce_and_remove_grad_hooks()
-
-        # exit(0)
-
-        # we may have a way of fusing dynamic scale. Do not support for now
-        if self.dtype == torch.float or not dynamic_loss_scale:
-            loss_scale_value = 1.0 if self.dtype == torch.float else static_loss_scale
-
-            self.dynamic_loss_scale = False
-            self.loss_scaler = LossScaler(scale=loss_scale_value)
-            cur_iter = 0
-        else:
-            if dynamic_loss_args is None:
-                self.loss_scaler = DynamicLossScaler()
-            else:
-                self.loss_scaler = DynamicLossScaler(**dynamic_loss_args)
-
-            self.dynamic_loss_scale = True
-
-        self.debug_fp16_grads = [{} for _ in self.fp16_groups]
-
-        if dist.get_rank(group=self.dp_process_group) == 0 and self.verbose:
-            report_memory_usage(f"After initializing ZeRO optimizer")
-
-    def _configure_tensor_swapping(self, offload_optimizer_config, aio_config):
-        nvme_swap_folder = os.path.join(
-            offload_optimizer_config[OFFLOAD_OPTIMIZER_NVME_PATH],
-            'zero_stage_3')
-        os.makedirs(nvme_swap_folder, exist_ok=True)
-        if torch.distributed.get_rank() == 0 and self.verbose:
-            print(f'Tensor Swapping: Adding optimizer tensors')
-
-        swapper_type = PipelinedOptimizerSwapper if offload_optimizer_config[
-            OFFLOAD_OPTIMIZER_PIPELINE] else PartitionedOptimizerSwapper
-
-        self.optimizer_swapper = swapper_type(
-            swap_config=offload_optimizer_config,
-            aio_config=aio_config,
-            base_folder=nvme_swap_folder,
-            optimizer=self.optimizer,
-            largest_numel=max(self.fp16_partitioned_groups_flat_numel),
-            device=self.device,
-            dtype=torch.float32,
-            timers=self.timers)
-
-    def _create_fp16_partitions(self):
-        dist.barrier()
-        partition_id = dist.get_rank(group=self.dp_process_group)
-
-        # loop to deal with groups
-        for j, param_group in enumerate(self.optimizer.param_groups):
-
-            sub_groups = self._create_fp16_sub_groups(param_group['params'])
-            for sub_group in sub_groups:
-                i = len(self.fp16_groups)
-
-                # push this group to list before modify
-                self.fp16_groups.append(sub_group)
-                self.sub_group_to_group_id[i] = j
-
-                # These are the list of the partitioned parameters
-                self.fp16_partitioned_groups.append(
-                    [param.ds_tensor for param in self.fp16_groups[i]])
-
-                if self.verbose:
-                    print_rank_0(
-                        f"fp16 group {i} partitioned_param norms : {[param.ds_tensor.norm().item() for param in self.fp16_groups[i]]}"
-                    )
-
-                # Record padding required to align group to world size (only applies to last rank)
-                if partition_id == dist.get_world_size(group=self.dp_process_group) - 1:
-                    padding = [p.padding_size() for p in self.fp16_groups[i]]
-                else:
-                    padding = [0] * len(self.fp16_groups[i])
-                self.groups_padding.append(padding)
-
-                # not sure why apex was cloning the weights before flattening
-                # removing cloning here
-                if self.verbose:
-                    report_memory_usage(f"Before Flattening param group {i}")
-
-                if not self.offload_param:
-                    if self.verbose:
-                        report_memory_usage(
-                            f"Before moving param group {i} to CPU")
-                    # move all the parameters to cpu to free up GPU space for creating flat buffer
-                    move_to_cpu(self.fp16_partitioned_groups[i])
-                    if self.verbose:
-                        report_memory_usage(
-                            f"After moving param group {i} to CPU")
-
-                    # create flat buffer in CPU and move to GPU
-                    self.fp16_partitioned_groups_flat.append(
-                        self.flatten_dense_tensors_aligned(
-                            self.fp16_partitioned_groups[i],
-                            dist.get_world_size(group=self.dp_process_group)).cuda(
-                            torch.cuda.current_device()))
-
-                    if self.verbose:
-                        report_memory_usage(
-                            f"After flattening and moving param group {i} to GPU"
-                        )
-                else:
-                    # Without the detach, report_memory_usage()lattening becomes part of the
-                    # model graph causing errors downstream
-                    self.fp16_partitioned_groups_flat.append(
-                        self.flatten_dense_tensors_aligned(
-                            self.fp16_partitioned_groups[i],
-                            dist.get_world_size(
-                                group=self.dp_process_group)).detach().pin_memory())
-
-                if self.verbose:
-                    report_memory_usage(f"After Flattening param group {i}")
-
-                # set model fp16 weight to slices of flattened buffer
-                updated_params = self.unflatten(self.fp16_partitioned_groups_flat[i],
-                                                self.fp16_partitioned_groups[i])
-
-                for partitioned_param, q in zip(self.fp16_partitioned_groups[i], updated_params):
-                    partitioned_param.data = q.data
-
-    def _move_to_flat_buffer(self, param_list, flat_buffer, avoid_copy=False):
-        '''If flat buffer is None then the parameters in the param_list are
-        not copied to the flat buffer. This is because they excede the number of max_params_in_cpu
-        Some of these parameters may aready be in CPU in unflattened buffers
-        or they maybe in GPU, or they maybe in NVME. If they are in NVME, then
-        they will be marked as NOT_AVAILABLE, and will be moved to CPU when they are
-        needed during training.'''
-        if flat_buffer is None:
-            # this dst buffer is on NVMe, so skip this
-            return
-
-        start = 0
-        for param in param_list:
-            src = param.ds_tensor
-            dest = flat_buffer.narrow(0, start, src.ds_numel)
-            start = start + src.ds_numel
-            '''if the parameter was initialized in nvme then bring it to the destination buffer directly'''
-            if src.status == PartitionedParamStatus.NOT_AVAILABLE:
-                if self.verbose:
-                    print_rank_0(
-                        f"Swapping in {param.ds_id} with partition size {param.ds_tensor.ds_numel} permanently to CPU"
-                    )
-                param.nvme_swapper.swap_into_buffer(param, dest)
-                src.data = dest.data
-                src.status = PartitionedParamStatus.AVAILABLE
-            else:
-                assert src.status == PartitionedParamStatus.AVAILABLE, "Partitioned Parm must be avialable here"
-                if not avoid_copy:
-                    dest.data.copy_(src.data)
-                src.data = dest.data
-
-            # Final location must be gpu/cpu in this case
-            param.ds_tensor.final_location = 'not-nvme'
-
-    def _create_param_groups_fp16_flat_cpu_memory(self):
-
-        aggregate_params_count = 0
-
-        for j, param_group in enumerate(self.optimizer.param_groups):
-            params_in_group = sum(
-                [p.ds_tensor.ds_numel for p in param_group['params']])
-
-            flat_buffer_size = params_in_group
-
-            if self.params_in_nvme_and_cpu and \
-                    aggregate_params_count + params_in_group > self.max_params_in_cpu:
-                flat_buffer_size = max(0,
-                                       self.max_params_in_cpu - aggregate_params_count)
-
-            aggregate_params_count += params_in_group
-
-            if flat_buffer_size > 0:
-                if self.verbose:
-                    print_rank_0(f"group {j} flat buffer size {flat_buffer_size}",
-                                 force=False)
-                self.param_groups_fp16_flat_cpu_memory.append(
-                    torch.empty(int(flat_buffer_size),
-                                dtype=self.dtype,
-                                pin_memory=True))
-            else:
-                if self.verbose:
-                    print_rank_0(
-                        f"No flat buffer size. Param group size was  {params_in_group}",
-                        force=False)
-
-                self.param_groups_fp16_flat_cpu_memory.append(
-                    torch.empty(1,
-                                dtype=self.dtype))
-
-    def _create_fp16_partitions_with_defragmentation(self):
-        dist.barrier()
-        partition_id = dist.get_rank(group=self.dp_process_group)
-        create_fp16_flat_reuse_buffer = False
-        largest_partition_numel = []
-        max_partition_numel = 0
-
-        # create a flat CPU memory allocation for each param group
-        if self.offload_param:
-            self._create_param_groups_fp16_flat_cpu_memory()
-
-        # loop to deal with groups
-        for j, param_group in enumerate(self.optimizer.param_groups):
-
-            sub_groups = self._create_fp16_sub_groups(param_group['params'])
-
-            if self.verbose:
-                print_rank_0(
-                    f'fp16 group {j} has {len(sub_groups)} subgroups', force=False)
-
-            flat_offset = 0
-            for sub_group in sub_groups:
-                i = len(self.fp16_groups)
-
-                # push this group to list before modify
-                self.fp16_groups.append(sub_group)
-                self.sub_group_to_group_id[i] = j
-
-                # comment out for zero_to_fp32 debug
-                # if torch.distributed.get_rank() == 0:
-                #     for param in self.fp16_groups[i]:
-                #         print(f"{debug_param2name_id_shape(param)} {param.ds_shape}")
-
-                # These are the list of the partitioned parameters
-                self.fp16_partitioned_groups.append(
-                    [param.ds_tensor for param in self.fp16_groups[i]])
-
-                total_elements = sum(
-                    [t.ds_numel for t in self.fp16_partitioned_groups[i]])
-                self.fp16_partitioned_groups_flat_numel.append(total_elements)
-
-                if total_elements > max_partition_numel:
-                    largest_partition_numel = [
-                        t.ds_numel for t in self.fp16_partitioned_groups[i]
-                    ]
-                    max_partition_numel = total_elements
-
-                if self.verbose:
-                    print_rank_0(
-                        f"fp16 group {i} partitioned_param norms : {[param.ds_tensor.norm().item() for param in self.fp16_groups[i]]}"
-                    )
-
-                # Record padding required to align group to world size (only applies to last rank)
-                if partition_id == dist.get_world_size(group=self.dp_process_group) - 1:
-                    padding = [p.padding_size() for p in self.fp16_groups[i]]
-                else:
-                    padding = [0] * len(self.fp16_groups[i])
-                self.groups_padding.append(padding)
-
-                # not sure why apex was cloning the weights before flattening
-                # removing cloning here
-                if self.verbose:
-                    report_memory_usage(
-                        f"Before Flattening param subgroup {i}")
-
-                # all partitioned parameters remain in GPU during training
-                if not self.offload_param:
-                    if self.verbose:
-                        report_memory_usage(
-                            f"Before moving param subgroup group {i} to CPU")
-                    # move all the parameters to cpu to free up GPU space for creating flat buffer
-                    move_to_cpu(self.fp16_partitioned_groups[i])
-                    if self.verbose:
-                        report_memory_usage(
-                            f"After moving param subgroup {i} to CPU")
-
-                    # create flat buffer in CPU and move to GPU
-                    self.fp16_partitioned_groups_flat.append(
-                        self.flatten_dense_tensors_aligned(
-                            self.fp16_partitioned_groups[i],
-                            1).cuda(torch.cuda.current_device()))
-                    if self.verbose:
-                        report_memory_usage(
-                            f"After flattening and moving param subgroup {i} to GPU")
-
-                # all partitioned parameters are in CPU during training
-                else:
-                    if self.verbose:
-                        print_rank_0(
-                            f"Params in nvme and cpu {self.params_in_nvme_and_cpu}")
-                    # Flat buffer may not be available for parameters that reside in NVME
-                    if not self.params_in_nvme_and_cpu or flat_offset + total_elements <= \
-                            self.param_groups_fp16_flat_cpu_memory[
-                                j].numel():
-                        fp16_partitioned_group_flat = self.param_groups_fp16_flat_cpu_memory[
-                            j].narrow(0,
-                                      flat_offset,
-                                      total_elements)
-                        if self.verbose:
-                            print_rank_0(
-                                f"Creating a flat buffer for subgroup {i} requiring {total_elements} elements, and cumulative CPU elemets {flat_offset + total_elements}",
-                                force=False)
-                    # these parameters reside in NVME and
-                    elif self.params_in_nvme_and_cpu:
-                        fp16_partitioned_group_flat = None
-                        if self.verbose:
-                            print_rank_0(
-                                f"No flat buffer for sub group {i} of {total_elements} elements",
-                                force=False)
-                    else:
-                        assert False, "Either params are in nvme, or they are in CPU memory. This code path should not be triggered. Please report_memory_usage()ms_in_cpu and params_in_nvme configs"
-
-                    self.fp16_partitioned_groups_flat.append(
-                        fp16_partitioned_group_flat)
-                    flat_offset += total_elements
-
-                # move param to flat buffer for both param offload on/off
-                self._move_to_flat_buffer(self.fp16_groups[i],
-                                          self.fp16_partitioned_groups_flat[i],
-                                          avoid_copy=not self.offload_param)
-                if self.verbose:
-                    report_memory_usage(f"After Flattening param group {i}")
-
-                # create a pinned memory to be used for swapping out params to NVME after optimizer step
-                if self.fp16_partitioned_groups_flat[-1] is None:
-                    create_fp16_flat_reuse_buffer = True
-
-                if self.verbose:
-                    report_memory_usage(f"After Flattening param subgroup {i}")
-
-        if create_fp16_flat_reuse_buffer:
-            assert len(
-                largest_partition_numel) > 0, f'Unexpected that largest partition is empty'
-            self.fp16_groups[0][0].nvme_swapper.reserve_partitioned_swap_space(
-                largest_partition_numel)
-
-    def _swap_in_sub_group_to_flat_buffer(self, flat_buffer, sub_group_id):
-        offset = 0
-        elements_in_sub_group = sum(
-            [t.ds_numel for t in self.fp16_partitioned_groups[sub_group_id]])
-        assert (flat_buffer.numel() == elements_in_sub_group)
-        for param, partitioned_param in zip(self.fp16_groups[sub_group_id], self.fp16_partitioned_groups[sub_group_id]):
-            dest = flat_buffer.narrow(0, offset, partitioned_param.ds_numel)
-            if partitioned_param.status == PartitionedParamStatus.NOT_AVAILABLE:
-                if self.verbose:
-                    print_rank_0(
-                        f"Swapping in {param.ds_id} with elements {param.ds_numel} and partition {param.ds_tensor.ds_numel}"
-                    )
-                param.nvme_swapper.swap_in([param], async_op=False)
-                dest.data.copy_(partitioned_param.data)
-                param.nvme_swapper.remove_partition_and_release_buffers([
-                    param])
-                if self.verbose:
-                    print_rank_0(f"Swapping in {param.ds_id} done")
-            else:
-                dest.data.copy_(partitioned_param.data)
-            offset += partitioned_param.ds_numel
-
-    def _create_next_swappable_fp32_groups(self):
-        reverse_order_indices = [
-            i for i in range(len(self.fp32_partitioned_groups_flat))
-        ]
-        reverse_order_indices.reverse()
-
-        next_group = None
-        for i in reverse_order_indices:
-            self.next_swappable_fp32_partitioned_groups.append(next_group)
-            if self._swappable_optimizer_subgroup(i):
-                next_group = self.fp32_partitioned_groups_flat[i]
-
-        self.next_swappable_fp32_partitioned_groups.reverse()
-
-    def _get_sub_group_partitions(self, sub_group_id):
-        sub_group_partitions = []
-        for param, partitioned_param in zip(self.fp16_groups[sub_group_id], self.fp16_partitioned_groups[sub_group_id]):
-            if partitioned_param.status == PartitionedParamStatus.NOT_AVAILABLE:
-                swap_path = param.nvme_swapper.get_path(param, True)
-                sub_group_partitions.append((partitioned_param,
-                                             param.ds_tensor.ds_numel,
-                                             swap_path))
-            else:
-                sub_group_partitions.append((partitioned_param,
-                                             partitioned_param.ds_numel,
-                                             None))
-
-        return sub_group_partitions
-
-    def _create_fp32_partitions(self):
-        cpu_memory_usage = 0
-        cpu_memory_sub_groups = 0
-        nvme_memory_usage = 0
-        num_swappable_partitions = 0
-        num_swap_from_nvme_partitions = 0
-        num_swap_from_cpu_partitions = 0
-        swap_from_nvme_memory_usage = 0
-        swap_from_cpu_memory_usage = 0
-        GIGA_BYTES = (1024 ** 3)
-
-        swappable_fp32_tensors = []
-        swappable_fp16_src_tensors = []
-        nvme_fp16_partitions_info = []
-        nvme_fp16_num_elems = []
-        nvme_fp32_dest_tensors = []
-        fp32_element_size = torch.tensor(
-            [], dtype=torch.float32).element_size()
-
-        for i, tensor in enumerate(self.fp16_partitioned_groups_flat):
-            num_elements = self.fp16_partitioned_groups_flat_numel[i]
-
-            # a partition of the fp32 master weights that will be updated by this process
-            if self._swappable_optimizer_subgroup(i):
-                self.fp32_partitioned_groups_flat.append(torch.Tensor())
-                nvme_memory_usage += (fp32_element_size * num_elements)
-                num_swappable_partitions += 1
-
-                if self.params_in_nvme_and_cpu and tensor is None:
-                    num_swap_from_nvme_partitions += 1
-                    swap_from_nvme_memory_usage += (
-                        fp32_element_size * num_elements)
-                    if self.offload_optimizer_fast_init:
-                        sub_group_partitions = self._get_sub_group_partitions(
-                            i)
-                        nvme_fp16_partitions_info.append(sub_group_partitions)
-                        nvme_fp16_num_elems.append(num_elements)
-                        nvme_fp32_dest_tensors.append(
-                            self.fp32_partitioned_groups_flat[i])
-                    else:
-                        unpinned_fp32_buffer = torch.empty(num_elements,
-                                                           device=self.device,
-                                                           dtype=torch.float)
-                        self._swap_in_sub_group_to_flat_buffer(
-                            unpinned_fp32_buffer, i)
-                        self.optimizer_swapper.initialize_parameters(
-                            parameters=[self.fp32_partitioned_groups_flat[i]],
-                            src_tensors=[unpinned_fp32_buffer])
-                else:
-                    num_swap_from_cpu_partitions += 1
-                    swap_from_cpu_memory_usage += (
-                        fp32_element_size * num_elements)
-                    swappable_fp32_tensors.append(
-                        self.fp32_partitioned_groups_flat[i])
-                    swappable_fp16_src_tensors.append(
-                        self.fp16_partitioned_groups_flat[i])
-            else:
-                cpu_memory_usage += (fp32_element_size * num_elements)
-                cpu_memory_sub_groups += 1
-
-                if self.params_in_nvme_and_cpu and tensor is None:
-                    unpinned_fp32_buffer = torch.empty(num_elements,
-                                                       device=self.device,
-                                                       dtype=torch.float)
-                    self._swap_in_sub_group_to_flat_buffer(
-                        unpinned_fp32_buffer, i)
-                    self.fp32_partitioned_groups_flat.append(
-                        unpinned_fp32_buffer)
-                else:
-                    self.fp32_partitioned_groups_flat.append(
-                        self.fp16_partitioned_groups_flat[i].to(
-                            self.device).clone().float().detach())
-
-            self.fp32_partitioned_groups_flat[
-                i].requires_grad = True  # keep this in case internal optimizer uses it
-
-        if len(swappable_fp32_tensors) > 0:
-            self.optimizer_swapper.initialize_parameters(
-                parameters=swappable_fp32_tensors,
-                src_tensors=swappable_fp16_src_tensors)
-
-        if len(nvme_fp32_dest_tensors) > 0:
-            fp16_pinned_buffers = self.fp16_groups[0][
-                0].nvme_swapper.reserve_available_buffers()
-            assert len(fp16_pinned_buffers) > 0
-            self.optimizer_swapper.initialize_from_swapped_fp16_params(
-                fp16_partitions_info=nvme_fp16_partitions_info,
-                fp16_num_elems=nvme_fp16_num_elems,
-                fp16_pinned_buffers=fp16_pinned_buffers,
-                fp32_parameters=nvme_fp32_dest_tensors)
-            self.fp16_groups[0][0].nvme_swapper.release_reserved_buffers()
-
-        nvme_gigabytes = nvme_memory_usage / GIGA_BYTES
-        if self.verbose:
-            print_rank_0(
-                f'Swappable FP32 Partitions: count={num_swappable_partitions} size={nvme_gigabytes:5.2f} GB',
-                force=False)
-        if self.params_in_nvme_and_cpu:
-            if self.verbose:
-                print_rank_0(
-                    f'Swap from NVMe Partitions: count = {num_swap_from_nvme_partitions}, size = {swap_from_nvme_memory_usage / GIGA_BYTES:5.2f}GB',
-                    force=False)
-                print_rank_0(
-                    f'Swap from CPU Partitions: count = {num_swap_from_cpu_partitions}, size = {swap_from_cpu_memory_usage / GIGA_BYTES:5.2f}GB',
-                    force=False)
-
-        cpu_memory_gigabytes = cpu_memory_usage / GIGA_BYTES
-        if self.verbose:
-            print_rank_0(
-                f'In-Memory FP32 Partitions: count={cpu_memory_sub_groups} size={cpu_memory_gigabytes:5.2f} GB',
-                force=False)
-
-        # Clear for on-the-fly population before the optimizer step
-        for param_group in self.optimizer.param_groups:
-            param_group['params'] = []
-
-    def _create_fp16_sub_groups(self, params_group):
-
-        params_group_numel = sum([param.partitioned_size()
-                                  for param in params_group])
-        sub_group_size = self.sub_group_size
-
-        if sub_group_size is None or sub_group_size >= params_group_numel:
-            return [params_group]
-
-        sub_groups = []
-        sub_group = []
-        local_sub_group_size = 0
-        for param in params_group:
-
-            sub_group.append(param)
-            local_sub_group_size += param.partitioned_size()
-
-            if local_sub_group_size >= sub_group_size or id(param) == id(
-                    params_group[-1]):
-                sub_groups.append(sub_group)
-
-                sub_group = []
-                local_sub_group_size = 0
-
-        return sub_groups
-
-    # def reset_ds_tensor(self):
-    #     for name, param in self.module.named_parameters(recurse=True):
-    #         assert hasattr(param,'ds_id'), "Parameters have not been converted to be Zero 3 compatible"
-    #         assert (param.ds_status == ZeroParamStatus.NOT_AVAILABLE), "All the parameters must have been partitioned by now"
-    #         param.ds_tensor.data = param.data
-
-    def setup_zero_stage3_hooks(self):
-        self.hierarchy = 0
-        self._register_hooks_recursively(self.module)
-
-        # reset step at the beginning of forward
-        def _pre_forward_hook(module, *args):
-            self.param_coordinator.reset_step()
-
-        # reset step if in inference mode
-        def _end_of_forward_hook(module, *args):
-            if not torch._C.is_grad_enabled():
-                self.param_coordinator.reset_step()
-
-        # likely one of them should be enough but just to be safe
-        self.module.register_forward_hook(_end_of_forward_hook)
-        self.module.register_forward_pre_hook(_pre_forward_hook)
-
-        # Add top todule to stack trace
-        global FWD_MODULE_STACK
-        FWD_MODULE_STACK.append(self.module)
-
-    def persistent_parameters(self):
-        persistent_params = []
-        total_persistent_parameters = 0
-        params_count = 0
-        for _, param in self.module.named_parameters(recurse=True):
-            if param.ds_numel < self.persistence_threshold:
-                params_count += 1
-                param.ds_persist = True
-                persistent_params.append(param)
-                total_persistent_parameters += param.ds_numel
-
-        if self.verbose:
-            print_rank_0(
-                f"ZeRO 3: Total persistent parameters: {total_persistent_parameters} in {params_count} params",
-                force=False)
-        return persistent_params
-
-    def _register_hooks_recursively(self, module, count=[0]):
-        my_count = count[0]
-        module.id = my_count
-
-        # print(f"{module.__class__} : {module.id}")
-
-        for child in module.children():
-            count[0] = count[0] + 1
-            self._register_hooks_recursively(child, count=count)
-
-        def _pre_forward_module_hook(module, *args):
-            self.pre_sub_module_forward_function(module)
-
-        def _post_forward_module_hook(module, input, output):
-            global FWD_MODULE_STACK
-            FWD_MODULE_STACK.pop()
-            if output is None:
-                output = []
-            elif not isinstance(output, (list, tuple)):
-                if torch.is_tensor(output):
-                    output = [output]
-                else:
-                    # print(f'got UNKNOWN type {type(output)}')
-                    outputs = []
-                    output = output if isinstance(
-                        output, dict) else vars(output)
-                    for name, val in output.items():
-                        if not name.startswith('__') and torch.is_tensor(val):
-                            outputs.append(val)
-                    output = outputs
-                    # print(f'convert output to {output}')
-
-            for item in filter(lambda item: is_zero_param(item), output):
-                if not any(id(item) in m._external_params for m in FWD_MODULE_STACK):
-                    item.ds_active_sub_modules += 1
-                    module_to_register = FWD_MODULE_STACK[-1]
-
-                    if self.verbose:
-                        print_rank_0(
-                            f'Registering dangling parameter for module {module_to_register.__class__.__name__}.',
-                            force=False)
-                    register_external_parameter(module_to_register, item)
-
-                    # It's possible that the parameter was already external to the completed module. If so, remove it the
-                    # registration as it will be covered by the outer module instead.
-                    if id(item) in module._external_params:
-                        if self.verbose:
-                            print_rank_0(
-                                f'  Unregistering nested dangling parameter from module {module.__class__.__name__}',
-                                force=False)
-                        unregister_external_parameter(module, item)
-
-                    item.all_gather()
-
-            self.post_sub_module_forward_function(module)
-
-        def _pre_backward_module_hook(module, inputs, output):
-            def _run_before_backward_function(sub_module):
-                # some models (e.g. Albert) may run multiple forwards on the same layer in a loop
-                # before doing backwards, so each backward will need a pre-fetch - using reference
-                # counting to support this scenario
-                # print(f"COUNTER before: {sub_module.applied_pre_backward_ref_cnt}")
-                if sub_module.applied_pre_backward_ref_cnt > 0:
-                    self.pre_sub_module_backward_function(sub_module)
-                    sub_module.applied_pre_backward_ref_cnt -= 1
-                # print(f"COUNTER after: {sub_module.applied_pre_backward_ref_cnt}")
-
-            return _apply_to_tensors_only(module,
-                                          PreBackwardFunction,
-                                          _run_before_backward_function,
-                                          output)
-
-        # This is an alternate to doing _post_backward_module_hook
-        # it uses tensor.register_hook instead of using torch.autograd.Function
-        def _alternate_post_backward_module_hook(module, inputs):
-            module.ds_grads_remaining = 0
-
-            # print(f"Before Forward {module.__class__.__name__}")
-
-            def _run_after_backward_hook(*unused):
-                module.ds_grads_remaining = module.ds_grads_remaining - 1
-                if module.ds_grads_remaining == 0:
-                    # print(f"After backward {module.__class__.__name__}")
-                    self.post_sub_module_backward_function(module)
-
-            def _run_before_forward_function(input):
-                if input.requires_grad:
-                    module.ds_grads_remaining += 1
-
-            return _apply_forward_and_backward_to_tensors_only(
-                module,
-                _run_before_forward_function,
-                _run_after_backward_hook,
-                inputs)
-
-        def _post_backward_module_hook(module, inputs):
-            module.ds_grads_remaining = 0
-
-            def _run_after_backward_function(sub_module):
-                if sub_module.ds_grads_remaining == 0:
-                    self.post_sub_module_backward_function(sub_module)
-
-            return _apply_to_tensors_only(module,
-                                          PostBackwardFunction,
-                                          _run_after_backward_function,
-                                          inputs)
-
-        # Pre forward hook
-        module.register_forward_pre_hook(_pre_forward_module_hook)
-        # Post forward hook
-        module.register_forward_hook(_post_forward_module_hook)
-
-        # Pre backward hook
-        module.register_forward_hook(_pre_backward_module_hook)
-
-        # post backward hook
-        module.register_forward_pre_hook(_post_backward_module_hook)
-
-    def pre_sub_module_forward_function(self, sub_module):
-        if self.verbose:
-            report_memory_usage(
-                f"Before sub module function {sub_module.__class__.__name__}")
-
-        global FWD_MODULE_STACK
-        FWD_MODULE_STACK.append(sub_module)
-
-        self.param_coordinator.record_trace(sub_module)
-
-        self.param_coordinator.fetch_sub_module(sub_module)
-        if self.verbose:
-            report_memory_usage(
-                f"Before sub module function {sub_module.__class__.__name__} after fetch")
-
-        self.param_coordinator.prefetch_next_sub_modules(
-            sub_module,
-            numel=self.prefetch_elements,
-            nvme=self.params_in_nvme_and_cpu)
-        if self.verbose:
-            report_memory_usage(
-                f"Before sub module function {sub_module.__class__.__name__} after prefetch")
-
-        self.param_coordinator.increment_step(sub_module)
-
-    def post_sub_module_forward_function(self, sub_module):
-        if self.verbose:
-            report_memory_usage(
-                f"After sub module function {sub_module.__class__.__name__} {sub_module.id} before release")
-
-        self.param_coordinator.release_sub_module(sub_module)
-        if self.verbose:
-            report_memory_usage(
-                f"After sub module function {sub_module.__class__.__name__}  {sub_module.id} after release")
-
-    def pre_sub_module_backward_function(self, sub_module):
-        self.param_coordinator.record_trace(sub_module)
-
-        self.param_coordinator.fetch_sub_module(sub_module)
-
-        self.param_coordinator.prefetch_next_sub_modules(sub_module,
-                                                         numel=self.prefetch_elements)
-
-        self.param_coordinator.increment_step(sub_module)
-
-    def post_sub_module_backward_function(self, sub_module):
-        if self.verbose:
-            report_memory_usage(
-                f"After sub module backward function {sub_module.__class__.__name__} {sub_module.id} before release")
-        self.param_coordinator.release_sub_module(sub_module)
-
-        if self.verbose:
-            report_memory_usage(
-                f"After sub module backward function {sub_module.__class__.__name__} {sub_module.id} after release")
-
-    def _release_ipg_buffers(self):
-        if self.contiguous_gradients:
-            self.ipg_buffer = None
-            if not self.offload_optimizer and self.is_gradient_accumulation_boundary:
-                self.grads_in_partition = None
-
-            self.grads_in_partition_offset = 0
-
-    def _optimizer_step(self, sub_group_id):
-        param_group_id = self.sub_group_to_group_id[sub_group_id]
-        fp32_param = self.fp32_partitioned_groups_flat[sub_group_id]
-        fp16_param = self.fp16_partitioned_groups_flat[sub_group_id]
-        self.optimizer.param_groups[param_group_id]['params'] = [fp32_param]
-
-        self.optimizer.step()
-        self.optimizer.param_groups[param_group_id]['params'] = []
-
-    def _swappable_optimizer_subgroup(self, sub_group_id):
-        if not self.swap_optimizer:
-            return False
-
-        return self.optimizer_swapper.swappable_tensor(
-            None,
-            numel=self.fp16_partitioned_groups_flat_numel[sub_group_id])
-
-    def _partitioned_params_swap_out(self, i):
-        offset = 0
-        fp32_param = self.fp32_partitioned_groups_flat[i]
-        assert fp32_param is not None, \
-            f'fp32 parameters of sub_group {i} is None'
-
-        swap_fp16_params = []
-        swap_fp32_params = []
-        for param, partitioned_param in zip(self.fp16_groups[i], self.fp16_partitioned_groups[i]):
-            src = fp32_param.narrow(0, offset, partitioned_param.ds_numel)
-            if partitioned_param.status == PartitionedParamStatus.AVAILABLE:
-                partitioned_param.data.copy_(src.data)
-            else:
-                swap_fp32_params.append(src)
-                swap_fp16_params.append(param)
-            offset += partitioned_param.ds_numel
-
-        if len(swap_fp16_params):
-            swap_fp16_params[0].nvme_swapper.swap_out_partitioned_params(
-                dst_fp16_params=swap_fp16_params,
-                src_fp32_params=swap_fp32_params)
-
-    def initialize_optimizer_states(self):
-        num_subgroups = len(self.fp16_groups)
-
-        largest_numel = max(
-            [sum([p.ds_numel for p in psg]) for psg in self.fp16_partitioned_groups])
-        gradient_dtype = self.fp32_partitioned_groups_flat[0].dtype
-        gradient_buffer = torch.zeros(int(largest_numel),
-                                      dtype=gradient_dtype,
-                                      device=self.device)
-
-        timers = self.timers
-        timer_names = set()
-
-        if self.swap_optimizer:
-            self.optimizer_swapper.init_timers()
-
-        INIT_OPTIMIZER_TIMER = 'init_optimizer_state'
-        timer_names.add(INIT_OPTIMIZER_TIMER)
-        self.start_timers([INIT_OPTIMIZER_TIMER])
-
-        for i, group in enumerate(self.fp16_groups):
-            swappable_optimizer_subgroup = self._swappable_optimizer_subgroup(
-                i)
-            swappable_param_subgroup = self.fp16_partitioned_groups_flat[i] is None
-
-            num_elements = int(self.fp16_partitioned_groups_flat_numel[i])
-
-            if self.verbose:
-                report_memory_usage(
-                    f'[Begin] Initialize optimizer states {i} / {num_subgroups} subgroups, num_elems: {num_elements}, swappable opt/param:{swappable_optimizer_subgroup}/{swappable_param_subgroup}')
-
-            if swappable_optimizer_subgroup:
-                self._optimizer_states_and_gradient_swap_in(i, timer_names)
-
-            if self.offload_optimizer and not swappable_optimizer_subgroup:
-                subgroup_gradient_buffer = torch.zeros(num_elements,
-                                                       dtype=gradient_dtype,
-                                                       device=self.device)
-                if self.offload_optimizer_pin_memory:
-                    subgroup_gradient_buffer = subgroup_gradient_buffer.pin_memory()
-
-                self.fp32_partitioned_groups_flat[i].grad = subgroup_gradient_buffer
-            else:
-                self.fp32_partitioned_groups_flat[i].grad = gradient_buffer.narrow(
-                    0,
-                    0,
-                    num_elements)
-
-            self._optimizer_step(i)
-
-            if swappable_param_subgroup:
-                self._partitioned_params_swap_out(i)
-
-            if swappable_optimizer_subgroup:
-                self._optimizer_states_and_gradient_swap_out(i, timer_names)
-
-            if self.verbose:
-                report_memory_usage(
-                    f'[End] Initialize optimizer states {i} / {num_subgroups} subgroups, num_elems: {num_elements}, swappable opt/param:{swappable_optimizer_subgroup}/{swappable_param_subgroup}')
-
-        self.stop_timers([INIT_OPTIMIZER_TIMER])
-        self.log_timers(timer_names)
-
-        if self.swap_optimizer:
-            self.optimizer_swapper.log_timers()
-
-        if not self.offload_optimizer:
-            for group in self.fp32_partitioned_groups_flat:
-                group.grad = None
-
-        # Reset steps
-        return
-
-    #########################################################################
-    #########################ZeRO Partition Gradients########################
-    #########################################################################
-
-    def get_first_param_index(self, group_id, param_group, partition_id):
-        for index, param in enumerate(param_group):
-            param_id = self.get_param_id(param)
-            if partition_id in self.param_to_partition_ids[group_id][param_id]:
-                return index
-        return None
-
-    def initialize_gradient_partitioning_data_structures(self):
-
-        total_partitions = dist.get_world_size(group=self.dp_process_group)
-
-        for i, param_group in enumerate(self.fp16_groups):
-
-            self.param_to_partition_ids[i] = {}
-            self.is_partition_reduced[i] = {}
-            self.total_grads_in_partition[i] = {}
-            self.remaining_grads_in_partition[i] = {}
-            self.is_grad_computed[i] = {}
-            self.grad_partition_insertion_offset[i] = {}
-            self.grad_start_offset[i] = {}
-            self.first_param_index_in_partition[i] = {}
-
-            for partition_id in range(total_partitions):
-                self.is_grad_computed[i][partition_id] = {}
-                self.grad_partition_insertion_offset[i][partition_id] = {}
-                self.grad_start_offset[i][partition_id] = {}
-                self.initialize_gradient_partition(
-                    i, param_group, partition_id)
-                self.is_partition_reduced[i][partition_id] = False
-                self.first_param_index_in_partition[i][
-                    partition_id] = self.get_first_param_index(
-                    i,
-                    param_group,
-                    partition_id)
-
-    def independent_gradient_partition_epilogue(self):
-        if self.verbose:
-            self.report_ipg_memory_usage(
-                f"In ipg_epilogue before reduce_ipg_grads", 0)
-        self.reduce_ipg_grads()
-        if self.verbose:
-            self.report_ipg_memory_usage(
-                f"In ipg_epilogue after reduce_ipg_grads", 0)
-
-        if self.overlap_comm:
-            self.reduction_stream.synchronize()
-
-        with torch.cuda.stream(self.reduction_stream):
-            self.partition_previous_reduced_grads()
-
-        # if dist.get_rank() == 0:
-        #    print()("Params already reduced %s", self.params_already_reduced)
-        for i in range(len(self.params_already_reduced)):
-            self.params_already_reduced[i] = False
-
-        # in case of cpu offload, averaged gradients are already in fp32_partitioned_groups_flat.grad
-        # TODO: use a similar code path for both cpu_offload and non-cpu offload
-        if not self.offload_optimizer:
-            for i, sub_group in enumerate(self.fp16_groups):
-                self.averaged_gradients[i] = [
-                    torch.zeros_like(param.ds_tensor) if param.grad is None else
-                    param.grad.data.narrow(0,
-                                           0,
-                                           param.ds_tensor.numel())
-                    for param in sub_group
-                ]
-                # self.averaged_gradients[i] = self.get_flat_partition(
-                #     self.fp16_groups[i],
-                #     0,
-                #     self.fp32_partitioned_groups_flat[i].numel(),
-                #     return_tensor_list=True)
-
-        self._release_ipg_buffers()
-
-        if self.verbose:
-            report_memory_usage(f"End ipg_epilogue")
-
-    # resets all partition to no reduced
-    # sets remianing grads to the total number of grads in each partition
-    # set is grad computed to false for all grads in partition
-    def reset_partition_gradient_structures(self):
-        total_partitions = dist.get_world_size(group=self.dp_process_group)
-        for i, _ in enumerate(self.fp16_groups):
-            for partition_id in range(total_partitions):
-                self.is_partition_reduced[i][partition_id] = False
-                self.remaining_grads_in_partition[i][
-                    partition_id] = self.total_grads_in_partition[i][partition_id]
-
-                for param_id in self.is_grad_computed[i][partition_id]:
-                    self.is_grad_computed[i][partition_id][param_id] = False
-
-    def initialize_gradient_partition(self, i, param_group, partition_id):
-        def set_key_value_list(dictionary, key, value):
-            if key in dictionary:
-                dictionary[key].append(value)
-            else:
-                dictionary[key] = [value]
-
-        def increment_value(dictionary, key):
-            if key in dictionary:
-                dictionary[key] += 1
-            else:
-                dictionary[key] = 1
-
-        partition_size = self.partition_size[i]
-
-        start_index = partition_size * partition_id
-        end_index = partition_size * (partition_id + 1)
-
-        current_index = 0
-        first_offset = 0
-
-        for param in param_group:
-
-            param_size = param.numel()
-            param_id = self.get_param_id(param)
-
-            if (current_index >= start_index and current_index < end_index):
-                set_key_value_list(self.param_to_partition_ids[i],
-                                   param_id,
-                                   partition_id)
-                increment_value(self.total_grads_in_partition[i], partition_id)
-
-                self.is_grad_computed[i][partition_id][param_id] = False
-
-                self.grad_partition_insertion_offset[i][partition_id][
-                    param_id] = current_index - start_index
-                self.grad_start_offset[i][partition_id][param_id] = 0
-
-            elif start_index > current_index and start_index < (current_index +
-                                                                param_size):
-                assert (
-                    first_offset == 0), "This can happen either zero or only once as this must be the first tensor in the partition"
-                first_offset = start_index - current_index
-
-                set_key_value_list(self.param_to_partition_ids[i],
-                                   param_id,
-                                   partition_id)
-                increment_value(self.total_grads_in_partition[i], partition_id)
-
-                self.is_grad_computed[i][partition_id][param_id] = False
-
-                self.grad_partition_insertion_offset[i][partition_id][param_id] = 0
-                self.grad_start_offset[i][partition_id][param_id] = first_offset
-
-            current_index = current_index + param_size
-
-    def overlapping_partition_gradients_reduce_epilogue(self):
-        self.independent_gradient_partition_epilogue()
-        self.zero_grad()
-
-    def create_reduce_and_remove_grad_hooks(self):
-        if self.verbose:
-            print_rank_0(f'[Begin] Create gradient reduction hooks')
-        self.grad_accs = []
-        for i, param_group in enumerate(self.fp16_groups):
-            for param in param_group:
-                if param.requires_grad:
-                    # print_rank_0(f" Before all gather {param.device}, {param.shape}")
-
-                    # The hook must be created in un-partitioned parameter
-                    param.all_gather()
-
-                    # print(f"After all gather {param.device}, {param.shape}")
-                    def wrapper(param, i):
-                        param_tmp = param.expand_as(param)
-                        grad_acc = param_tmp.grad_fn.next_functions[0][0]
-
-                        def reduce_partition_and_remove_grads(*notneeded):
-                            self.reduce_ready_partitions_and_remove_grads(
-                                param, i)
-
-                        grad_acc.register_hook(
-                            reduce_partition_and_remove_grads)
-                        self.grad_accs.append(grad_acc)
-
-                    # print(f"param grad fn {param.expand_as(param).grad_fn}")
-                    wrapper(param, i)
-
-                    # Partition the parameter after creating the hook
-                    param.partition()
-        if self.verbose:
-            print_rank_0(f'[End] Create gradient reduction hooks')
-
-    def get_param_id(self, param):
-        unique_id = id(param)
-        return self.param_id[unique_id]
-
-    def report_ipg_memory_usage(self, tag, param_elems):
-        elem_count = self.elements_in_ipg_bucket + param_elems
-        percent_of_bucket_size = (
-            100.0 * elem_count) // self.reduce_bucket_size
-        report_memory_usage(
-            f"{tag}: elems in_bucket {self.elements_in_ipg_bucket} param {param_elems} max_percent {percent_of_bucket_size}")
-
-    ###############Idependent Partition Gradient ########################
-    def reduce_independent_p_g_buckets_and_remove_grads(self, param, i):
-        # print_rank_0(f"Inside reduce ipg buckets. {debug_param2name_id_shape(param)}, ipg elements {self.elements_in_ipg_bucket}, reduce bucket size {self.reduce_bucket_size}", force=True)
-
-        # Because the ipg bucket is initialized with a random place holder tensor, we must
-        # explicitly check that the bucket has any real data in it (self.elements_in_ipg_bucket >
-        # 0). Otherwise if the incoming param.ds_numel is large, this branch may get triggered on a
-        # garbage data and `self.average_tensor()` will crash because its params_to_reduce will be
-        # empty, while reduction_list will have that garbage data.
-        if self.elements_in_ipg_bucket > 0 and self.elements_in_ipg_bucket + param.ds_numel > self.reduce_bucket_size:
-            if self.verbose:
-                self.report_ipg_memory_usage("In ipg_remove_grads before reduce_ipg_grads",
-                                             param.ds_numel)
-
-            self.reduce_ipg_grads()
-
-            if self.contiguous_gradients and self.overlap_comm:
-                # Swap ipg_index between 0 and 1
-                self.ipg_index = 1 - self.ipg_index
-            if self.verbose:
-                self.report_ipg_memory_usage("In ipg_remove_grads after reduce_ipg_grads",
-                                             param.ds_numel)
-
-        param_id = self.get_param_id(param)
-        assert self.params_already_reduced[param_id] == False, \
-            f"The parameter {param_id} has already been reduced. \
-            Gradient computed twice for this partition. \
-            Multiple gradient reduction is currently not supported"
-
-        # keeping the gradients contiguous to prevent memory fragmentation, and avoid flattening
-        if param.ds_numel > self.reduce_bucket_size:
-            self.extra_large_param_to_reduce = param
-
-        elif self.contiguous_gradients:
-            # print_rank_0("before new grad tensor move")
-            new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(
-                0,
-                self.elements_in_ipg_bucket,
-                param.ds_numel)
-            # print_rank_0("after new grad tensor move")
-            new_grad_tensor.copy_(param.grad.view(-1))
-            param.grad.data = new_grad_tensor.data.view_as(param.grad)
-
-        self.elements_in_ipg_bucket += param.ds_numel
-        self.grads_in_ipg_bucket.append(param.grad)
-        self.params_in_ipg_bucket.append((i, param, param_id))
-        if self.verbose:
-            self.report_ipg_memory_usage("End ipg_remove_grads", 0)
-
-    def gradient_reduction_w_predivide(self, tensor):
-        dp_world_size = dist.get_world_size(group=self.dp_process_group)
-
-        tensor_to_allreduce = tensor
-
-        if self.allreduce_always_fp32:
-            tensor_to_allreduce = tensor.float()
-
-        if self.postscale_gradients:
-            if self.gradient_predivide_factor != 1.0:
-                tensor_to_allreduce.mul_(1. / self.gradient_predivide_factor)
-
-            dist.all_reduce(tensor_to_allreduce, group=self.dp_process_group)
-
-            if self.gradient_predivide_factor != dp_world_size:
-                tensor_to_allreduce.mul_(
-                    self.gradient_predivide_factor / dp_world_size)
-        else:
-            tensor_to_allreduce.div_(dp_world_size)
-            dist.all_reduce(tensor_to_allreduce, group=self.dp_process_group)
-
-        if self.allreduce_always_fp32 and tensor is not tensor_to_allreduce:
-            tensor.copy_(tensor_to_allreduce)
-
-        return tensor
-
-    def average_tensor(self, tensors, params_to_reduce):
-        with torch.cuda.stream(self.reduction_stream):
-            if not self.reduce_scatter:
-                for tensor in tensors:
-                    self.gradient_reduction_w_predivide(tensor)
-                return
-
-            for tensor in tensors:
-                tensor.div_(dist.get_world_size(group=self.dp_process_group))
-
-            # reduction resulting with each rank only holding the gradient partition it owns
-            # This could either be a reduce scatter or a reduce op depending on how
-            # parameters are partitionied. The method is implemented by the
-            # DeepSpeed param extensions to the pytorch parameter, so its up to
-            # the extension to define what happens here
-            params_to_reduce[0].reduce_gradients_at_owner(
-                param_list=params_to_reduce,
-                hierarchy=self.param_coordinator.hierarchy)
-
-    def set_grad_positions(self):
-        for i, group in enumerate(self.fp16_groups):
-            current_offset = 0
-            for param in group:
-                param_id = self.get_param_id(param)
-                num_elements = param.ds_tensor.ds_numel
-
-                self.grad_position[param_id] = [
-                    int(i),
-                    int(current_offset),
-                    int(num_elements)
-                ]
-                # print(f"param id {param_id} i:{i}, ds_tensor {num_elements} numel {param.numel()}")
-                current_offset += num_elements
-
-    def async_accumulate_grad_in_cpu_via_gpu(self, param, acc_grad_cpu_partition):
-
-        # copy to a preexisiting buffer to avoid memory allocation penalty
-        dest_buffer = self.temp_grad_buffer_for_gpu_offload.view(-1).narrow(
-            0,
-            0,
-            param.ds_tensor.ds_numel)
-
-        if self.micro_step_id > 0:
-            dest_buffer.copy_(
-                acc_grad_cpu_partition.view(-1), non_blocking=True)
-            param.grad.data.view(-1).add_(dest_buffer)
-
-        # at the boundary we will send 32bit directly
-        if not self.is_gradient_accumulation_boundary:
-            acc_grad_cpu_partition.data.copy_(param.grad.data.view(-1),
-                                              non_blocking=True)
-
-    def _constant_buffered_norm2(self, input, buffer_size=250000000):
-        norm = None
-        for part in input.view(-1).split(buffer_size):
-            if norm is None:
-                norm = part.data.double().norm(2) ** 2.0
-            else:
-                norm += part.data.double().norm(2) ** 2.0
-        return norm ** 0.5
-
-    def set_norm_for_param_grad_in_gpu(self, param):
-        param_id = self.get_param_id(param)
-        # self.norm_for_param_grads[param_id] = param.grad.data.double().norm(2)
-        # Using a more memory efficient version
-        self.norm_for_param_grads[param_id] = self._constant_buffered_norm2(
-            param.grad)
-
-    def update_overflow_tracker_for_param_grad(self, param):
-        # Credit to our user David Minn
-        if param.grad is not None:
-            if self.overlap_comm:
-                self.gpu_sum = self.gpu_sum + param.grad.data.float().sum()
-            elif self._has_inf_or_nan(param.grad.data):
-                self.local_overflow = True
-
-    def async_inplace_copy_grad_to_fp32_buffer_from_gpu(self, param, fp32_grad_tensor):
-        with torch.cuda.stream(self.copy_grad_stream):
-            param_id = self.get_param_id(param)
-            src_tensor = param.grad.view(-1).float()
-            # print(f"src_tensor {src_tensor.size()} and fp32 grad {fp32_grad_tensor.size()}")
-            fp32_grad_tensor.copy_(src_tensor, non_blocking=True)
-            param.grad = None
-
-    def complete_grad_norm_calculation_for_cpu_offload(self, params):
-        total_norm = 0.0
-        norm_type = 2.0
-        for p in params:
-            if is_model_parallel_parameter(p) or (self.model_parallel_rank == 0):
-                param_id = self.get_param_id(p)
-                if param_id in self.norm_for_param_grads.keys():
-                    param_norm = self.norm_for_param_grads[param_id]
-                    total_norm += param_norm.item() ** 2
-
-        # Sum across all model parallel GPUs.
-        total_norm_cuda = torch.cuda.FloatTensor([float(total_norm)])
-
-        torch.distributed.all_reduce(total_norm_cuda,
-                                     op=torch.distributed.ReduceOp.SUM,
-                                     group=self.dp_process_group)
-
-        self._model_parallel_all_reduce(tensor=total_norm_cuda,
-                                        op=torch.distributed.ReduceOp.SUM)
-
-        total_norm = total_norm_cuda[0].item() ** (1. / norm_type)
-
-        if total_norm == float(
-                'inf') or total_norm == -float('inf') or total_norm != total_norm:
-            total_norm = -1
-
-        return total_norm
-
-    def partition_previous_reduced_grads(self):
-        if not self.previous_reduced_grads:
-            return
-
-        if self.offload_optimizer:
-            allocate_grads_in_partition = self.grads_in_partition is None \
-                and self.gradient_accumulation_steps > 1
-        else:
-            allocate_grads_in_partition = self.grads_in_partition is None
-
-        if allocate_grads_in_partition:
-            self.grads_in_partition = []
-
-            for i, group in enumerate(self.fp16_groups):
-                total_size = 0
-                for param_in_partition in group:
-                    total_size += param_in_partition.ds_tensor.ds_numel
-
-                if self.verbose:
-                    report_memory_usage(
-                        f"group {i} before creating {total_size} reduced gradients into partition")
-                if self.offload_param_pin_memory:
-                    self.grads_in_partition.append(
-                        torch.zeros(int(total_size),
-                                    dtype=self.dtype,
-                                    device=self.device).pin_memory())
-                else:
-                    self.grads_in_partition.append(
-                        torch.zeros(int(total_size),
-                                    dtype=self.dtype,
-                                    device=self.device))
-                if self.verbose:
-                    report_memory_usage(
-                        f"group {i} after creating {total_size} reduced gradients into partition")
-
-        if self.offload_optimizer:
-            offload_fp32_gradients = {}
-            offload_fp32_offsets = {}
-
-        with torch.cuda.stream(self.copy_grad_stream):
-            self.reduction_stream.synchronize()
-            for param in self.previous_reduced_grads:
-
-                [i,
-                 dest_offset,
-                 num_elements] = self.grad_position[self.get_param_id(param)]
-
-                if self.offload_optimizer:
-                    param.partition_gradients(
-                        partition_buffers=self.temp_grad_gpu_buffer)
-                    # with torch.cuda.stream(self.copy_grad_stream):
-                    #    self.reduction_stream.synchronize()
-
-                    if self.gradient_accumulation_steps > 1:
-                        # The allreduce buffer will be rewritted. Copy the gradients in partition to a new buffer
-                        fp16_grad_tensor = self.grads_in_partition[i].narrow(
-                            0,
-                            dest_offset,
-                            num_elements)
-                        self.async_accumulate_grad_in_cpu_via_gpu(
-                            param,
-                            fp16_grad_tensor)
-
-                    if self.is_gradient_accumulation_boundary:
-
-                        self.set_norm_for_param_grad_in_gpu(param)
-
-                        self.update_overflow_tracker_for_param_grad(param)
-
-                        if self._swappable_optimizer_subgroup(i):
-                            if not i in offload_fp32_gradients.keys():
-                                offload_fp32_gradients[i] = []
-                                offload_fp32_offsets[i] = []
-
-                            offload_fp32_gradients[i].append(
-                                param.grad.view(-1).float())
-                            param.grad = None
-                            offload_fp32_offsets[i].append(dest_offset)
-                        else:
-                            fp32_grad_tensor = self.fp32_partitioned_groups_flat[
-                                i].grad.narrow(0,
-                                               dest_offset,
-                                               num_elements)
-
-                            self.async_inplace_copy_grad_to_fp32_buffer_from_gpu(
-                                param,
-                                fp32_grad_tensor)
-                else:
-                    # The allreduce buffer will be rewritted. Copy the gradients in partition to a new buffer
-                    fp16_grad_tensor = self.grads_in_partition[i].narrow(
-                        0,
-                        dest_offset,
-                        num_elements)
-                    param.partition_gradients(
-                        partition_buffers=fp16_grad_tensor,
-                        accumulate=True if self.micro_step_id > 0 else False)
-
-            if self.offload_optimizer and self.swap_optimizer:
-                for i in offload_fp32_gradients.keys():
-                    self.optimizer_swapper.swap_out_gradients(
-                        parameter=self.fp32_partitioned_groups_flat[i],
-                        gradient_offsets=offload_fp32_offsets[i],
-                        gradient_tensors=offload_fp32_gradients[i])
-
-        self.previous_reduced_grads = []
-
-    def reduce_ipg_grads(self, extra_param=None):
-        if self.overlap_comm:
-            self.reduction_stream.synchronize()
-
-        with torch.cuda.stream(self.reduction_stream):
-            self.partition_previous_reduced_grads()
-
-        params_to_reduce = [param for i, param,
-                            param_id in self.params_in_ipg_bucket]
-        # print(f"Params in ipg bucket {self.params_in_ipg_bucket}")
-        # print(f"Reducing {[(debug_param2name_id_shape(param), param.grad) for param in params_to_reduce]}")
-        # exit(0)
-        if self.contiguous_gradients:
-            reduction_list = [self.ipg_buffer[self.ipg_index]]
-            if self.extra_large_param_to_reduce is not None:
-                reduction_list.append(self.extra_large_param_to_reduce.grad)
-                self.extra_large_param_to_reduce = None
-            self.average_tensor(reduction_list, params_to_reduce)
-        else:
-            self.buffered_reduce_fallback(
-                None,
-                self.grads_in_ipg_bucket,
-                elements_per_buffer=self.elements_in_ipg_bucket)
-
-        for _, param, param_id in self.params_in_ipg_bucket:
-            self.params_already_reduced[param_id] = True
-
-        self.previous_reduced_grads = params_to_reduce
-
-        self.grads_in_ipg_bucket = []
-        self.params_in_ipg_bucket = []
-        self.elements_in_ipg_bucket = 0
-        #####################################################################
-
-    def reduce_ready_partitions_and_remove_grads(self, param, i):
-        # print_rank_0(f"Backward {debug_param2name_id_shape(param)}", force=True)
-        self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
-
-    def zero_reduced_gradients(self, partition_id, i):
-        def are_all_related_partitions_reduced(params_id):
-            for partition_id in self.param_to_partition_ids[i][params_id]:
-                if not self.is_partition_reduced[i][partition_id]:
-                    return False
-            return True
-
-        for params_id in self.is_grad_computed[i][partition_id]:
-            if are_all_related_partitions_reduced(params_id):
-                self.param_dict[params_id].grad = None
-
-    def flatten_and_print(self, message, tensors, start=0, n=5):
-        flatten_tensor = self.flatten(tensors)
-
-        def print_func():
-            print(flatten_tensor.contiguous().view(-1).narrow(0, start, n))
-
-        self.sequential_execution(print_func, message)
-
-    def get_grads_to_reduce(self, i, partition_id):
-        def get_reducable_portion(key):
-            grad = self.param_dict[key].grad
-            total_elements = grad.numel()
-            start = self.grad_start_offset[i][partition_id][key]
-            num_elements = min(
-                total_elements - start,
-                self.partition_size[i] -
-                self.grad_partition_insertion_offset[i][partition_id][key])
-            if not pg_correctness_test:
-                if num_elements == total_elements:
-                    return grad
-                else:
-                    return grad.contiguous().view(-1).narrow(0,
-                                                             int(start),
-                                                             int(num_elements))
-            else:
-                if num_elements == total_elements:
-                    return grad.clone()
-                else:
-                    return grad.clone().contiguous().view(-1).narrow(
-                        0,
-                        int(start),
-                        int(num_elements))
-
-        grads_to_reduce = []
-        for key in self.is_grad_computed[i][partition_id]:
-            grad = get_reducable_portion(key)
-            grads_to_reduce.append(grad)
-        return grads_to_reduce
-
-    def sequential_execution(self, function, message, group=None):
-        if group is None:
-            group = self.dp_process_group
-        if dist.get_rank(group=group) == 0:
-            print(message)
-        for id in range(dist.get_world_size(group=group)):
-            if id == dist.get_rank(group=group):
-                function()
-            dist.barrier(group=group)
-
-    def set_none_gradients_to_zero(self, i, partition_id):
-        for param_id in self.is_grad_computed[i][partition_id]:
-            param = self.param_dict[param_id]
-            if param.grad is None:
-                param.grad = torch.zero_like(param)
-
-    ######################Reduction Related Methods##############################
-
-    def allreduce_bucket(self, bucket, allreduce_always_fp32=False, rank=None, log=None):
-        rank = None
-        tensor = self.flatten(bucket)
-
-        tensor_to_allreduce = tensor
-
-        if pg_correctness_test:
-            allreduce_always_fp32 = True
-
-        if allreduce_always_fp32:
-            tensor_to_allreduce = tensor.float()
-
-        tensor_to_allreduce.div_(
-            dist.get_world_size(group=self.dp_process_group))
-
-        if rank is None:
-            #    "All Reducing"
-            dist.all_reduce(tensor_to_allreduce, group=self.dp_process_group)
-        else:
-            global_rank = _get_global_rank(self.dp_process_group, rank)
-            dist.reduce(tensor_to_allreduce, global_rank,
-                        group=self.dp_process_group)
-
-        if allreduce_always_fp32 and tensor is not tensor_to_allreduce:
-            if rank is None or rank == dist.get_rank(group=self.dp_process_group):
-                tensor.copy_(tensor_to_allreduce)
-
-        return tensor
-
-    # if rank is specified do a reduction instead of an allreduce
-    def allreduce_and_copy(self, small_bucket, rank=None, log=None):
-        with torch.cuda.stream(self.reduction_stream):
-            allreduced = self.allreduce_bucket(
-                small_bucket, rank=rank, log=log)
-            if rank is None or rank == dist.get_rank(group=self.dp_process_group):
-                for buf, synced in zip(small_bucket, self.unflatten(allreduced, small_bucket)):
-                    buf.copy_(synced)
-
-    def allreduce_no_retain(self,
-                            bucket,
-                            numel_per_bucket=500000000,
-                            rank=None,
-                            log=None):
-        small_bucket = []
-        numel = 0
-        for tensor in bucket:
-            small_bucket.append(tensor)
-            numel = numel + tensor.numel()
-            if numel > numel_per_bucket:
-                self.allreduce_and_copy(small_bucket, rank=rank, log=None)
-                small_bucket = []
-        if len(small_bucket) > 0:
-            self.allreduce_and_copy(small_bucket, rank=rank, log=log)
-
-    # allows using reduction of gradients instead of using all_reduce
-    def buffered_reduce_fallback(self,
-                                 rank,
-                                 grads,
-                                 elements_per_buffer=500000000,
-                                 log=None):
-        split_buckets = split_half_float_double(grads)
-
-        for i, bucket in enumerate(split_buckets):
-            self.allreduce_no_retain(bucket,
-                                     numel_per_bucket=elements_per_buffer,
-                                     rank=rank,
-                                     log=log)
-
-    #############################################################################
-    #############################################################################
-    #############################################################################
-
-    # views the tensor as multiple partitions and returns
-    # those partitions
-    def get_data_parallel_partitions(self, tensor):
-        partitions = []
-
-        dp = dist.get_world_size(group=self.dp_process_group)
-        dp_id = dist.get_rank(group=self.dp_process_group)
-
-        total_num_elements = tensor.numel()
-
-        base_size = total_num_elements // dp
-        remaining = total_num_elements % dp
-
-        start = 0
-        for id in range(dp):
-            partition_size = base_size
-            if id < remaining:
-                partition_size = partition_size + 1
-            partitions.append(tensor.narrow(0, start, partition_size))
-            start = start + partition_size
-        return partitions
-
-    def get_partition_info(self, tensor_list, partition_size, partition_id):
-        params_in_partition = []
-        params_not_in_partition = []
-
-        start_index = partition_size * partition_id
-        end_index = partition_size * (partition_id + 1)
-
-        current_index = 0
-        first_offset = 0
-
-        for tensor in tensor_list:
-
-            tensor_size = tensor.numel()
-
-            if (current_index >= start_index and current_index < end_index):
-                params_in_partition.append(tensor)
-
-            elif start_index > current_index and start_index < (current_index +
-                                                                tensor_size):
-                params_in_partition.append(tensor)
-
-                assert (
-                    first_offset == 0), "This can happen either zero or only once as this must be the first tensor in the partition"
-                first_offset = start_index - current_index
-
-            else:
-                params_not_in_partition.append(tensor)
-
-            current_index = current_index + tensor_size
-
-        return params_in_partition, params_not_in_partition, first_offset
-
-    def zero_grad(self, set_grads_to_None=True):
-        """
-        Zero FP16 parameter grads.
-        """
-        # FP32 grad should never exist.
-        # For speed, set model fp16 grad to None by default
-        for group in self.fp16_groups:
-            for p in group:
-                if set_grads_to_None:
-                    p.grad = None
-                else:
-                    if p.grad is not None:
-                        p.grad.detach_()
-                        p.grad.zero_()
-
-    def _model_parallel_all_reduce(self, tensor, op):
-        """ Perform all reduce within model parallel group, if any.
-        """
-        if self.model_parallel_group is None:
-            pass
-        else:
-            torch.distributed.all_reduce(tensor=tensor,
-                                         op=op,
-                                         group=self.model_parallel_group)
-
-    def clip_grad_norm(self, *args, **kwargs):
-        # dummy function to retain the same function interface
-        # as ColossalaiOptimizer for compatibility
-        pass
-
-    def get_grad_norm_direct(self, gradients, params, norm_type=2):
-        """Clips gradient norm of an iterable of parameters.
-
-        This is adapted from ``torch.nn.utils.clip_grad.clip_grad_norm_`` and
-        added functionality to handle model parallel parameters. Note that
-        the gradients are modified in place.
-
-        Arguments:
-            parameters (Iterable[Tensor] or Tensor): an iterable of Tensors or a
-                single Tensor that will have gradients normalized
-            max_norm (float or int): max norm of the gradients
-            norm_type (float or int): type of the used p-norm. Can be ``'inf'`` for
-                infinity norm.
-
-        Returns:
-            Total norm of the parameters (viewed as a single vector).
-        """
-        norm_type = float(norm_type)
-        if norm_type == inf:
-            total_norm = max(g.data.abs().max() for g in gradients)
-            total_norm_cuda = torch.cuda.FloatTensor([float(total_norm)])
-            torch.distributed.all_reduce(total_norm_cuda,
-                                         op=torch.distributed.ReduceOp.MAX,
-                                         group=self.dp_process_group)
-
-            # Take max across all GPUs.
-            self._model_parallel_all_reduce(tensor=total_norm_cuda,
-                                            op=torch.distributed.ReduceOp.MAX)
-            total_norm = total_norm_cuda[0].item()
-        else:
-            total_norm = 0.0
-            # if dist.get_rank() == 0:
-            #    print()(f"Total Norm begining {total_norm}")
-            for g, p in zip(gradients, params):
-                if is_model_parallel_parameter(p) or (self.model_parallel_rank == 0):
-                    param_norm = g.data.double().norm(2)
-                    total_norm += param_norm.item() ** 2
-            # Sum across all model parallel GPUs.
-            total_norm_cuda = torch.cuda.FloatTensor([float(total_norm)])
-
-            torch.distributed.all_reduce(total_norm_cuda,
-                                         op=torch.distributed.ReduceOp.SUM,
-                                         group=self.dp_process_group)
-
-            self._model_parallel_all_reduce(tensor=total_norm_cuda,
-                                            op=torch.distributed.ReduceOp.SUM)
-
-            total_norm = total_norm_cuda[0].item() ** (1. / norm_type)
-
-        if total_norm == float(
-                'inf') or total_norm == -float('inf') or total_norm != total_norm:
-            total_norm = -1
-
-        return total_norm
-
-    # creates a flat fused tensor from the tensor list starting at the first_offset
-    # in the first tensor of the list. If there are not enough elements in the tensor
-    # list then the flat tensor will be padded with zeros
-    def get_flat_partition(self,
-                           tensor_list,
-                           first_offset,
-                           partition_size,
-                           return_tensor_list=False):
-        flat_tensor_list = []
-        current_size = 0
-        for i, tensor in enumerate(tensor_list):
-            if tensor.grad is None:
-                tensor.grad = torch.zeros_like(tensor)
-
-            tensor = tensor.grad
-            num_elements = tensor.numel()
-            tensor_offset = 0
-
-            # we need to offset to get to the right element
-            if i == 0 and first_offset > 0:
-                tensor_offset = first_offset
-                num_elements = num_elements - tensor_offset
-
-            # we dont need all elements of the tensor
-            if num_elements > (partition_size - current_size):
-                num_elements = partition_size - current_size
-
-            # we need a narrow view of the tensor based on the tensor offset and number of elements that
-            # we need from this tensor
-            if tensor_offset > 0 or num_elements < tensor.numel():
-                flat_tensor_list.append(tensor.contiguous().view(-1).narrow(
-                    0,
-                    int(tensor_offset),
-                    int(num_elements)))
-            else:
-                flat_tensor_list.append(tensor)
-
-            current_size = current_size + num_elements
-
-        # this means its the last partition and does not align with the dp boundary. We need to pad before flattening
-        if current_size < partition_size:
-            flat_tensor_list.append(
-                torch.zeros(int(partition_size - current_size),
-                            dtype=tensor_list[0].dtype,
-                            device=tensor_list[0].device))
-
-        if return_tensor_list:
-            return flat_tensor_list
-
-        return self.flatten(flat_tensor_list)
-
-    def free_grad_in_param_list(self, param_list):
-        for p in param_list:
-            p.grad = None
-
-    def reset_cpu_buffers(self):
-        self.norm_for_param_grads = {}
-        self.local_overflow = False
-
-    def log_timers(self, timer_names):
-        if self.timers is None:
-            return
-
-        self.timers.log(names=list(timer_names))
-
-    def start_timers(self, timer_names):
-        if self.timers is None:
-            return
-
-        for name in timer_names:
-            self.timers(name).start()
-
-    def stop_timers(self, timer_names):
-        if self.timers is None:
-            return
-
-        for name in timer_names:
-            self.timers(name).stop()
-
-    def _pre_step(self):
-        self.micro_step_id = INITIAL_MICRO_STEP_ID
-
-        if self.verbose:
-            print_rank_0(f"Inside Step function")
-            report_memory_usage(f"In step before checking overflow")
-            print_rank_0("Finished Tracing at Beginning of Step")
-        self.param_coordinator.hierarchy = 0
-        self.param_coordinator.finish_tracing(print_trace=True)
-
-        self.param_coordinator.reset_step()
-
-        if self.verbose:
-            print_rank_0("Finished Tracing at Beginning of Step")
-
-    def _get_norm_groups(self):
-        norm_groups = []
-        for i, group in enumerate(self.fp16_groups):
-            if self.offload_optimizer:
-                norm_groups.append(
-                    self.complete_grad_norm_calculation_for_cpu_offload(
-                        self.fp16_groups[i]))
-            else:
-                norm_groups.append(
-                    self.get_grad_norm_direct(self.averaged_gradients[i],
-                                              self.fp16_groups[i]))
-        return norm_groups
-
-    def _prepare_fp32_grad_for_sub_group(self, sub_group_id):
-        partition_id = dist.get_rank(group=self.dp_process_group)
-
-        single_grad_partition = self.flatten(self.averaged_gradients[sub_group_id]).to(
-            self.fp32_partitioned_groups_flat[sub_group_id].dtype)
-
-        assert single_grad_partition.numel() == self.fp32_partitioned_groups_flat[sub_group_id].numel(), \
-            "averaged gradients have different number of elements that partition size {} {} {} {}".format(
-                single_grad_partition.numel(
-                ), self.fp32_partitioned_groups_flat[sub_group_id].numel(), sub_group_id,
-                partition_id)
-
-        self.fp32_partitioned_groups_flat[sub_group_id].grad = single_grad_partition
-
-        # release all the gradient since we have already created a necessary copy in dp_grad_partition
-        self.zero_grad()
-
-        self.averaged_gradients[sub_group_id] = None
-
-    def _prepare_sub_group(self, sub_group_id, timer_names=set()):
-        if self.verbose:
-            report_memory_usage(
-                f'Before prepare optimizer sub group {sub_group_id}')
-        if self._swappable_optimizer_subgroup(sub_group_id):
-            self._optimizer_states_and_gradient_swap_in(
-                sub_group_id, timer_names)
-        elif not self.offload_optimizer:
-            self._prepare_fp32_grad_for_sub_group(sub_group_id)
-        if self.verbose:
-            report_memory_usage(
-                f'After prepare optimizer sub group {sub_group_id}')
-
-    def _optimizer_states_and_gradient_swap_in(self, sub_group_id, timer_names=set()):
-        param_length = self.fp16_partitioned_groups_flat_numel[sub_group_id]
-        fp32_param_id = id(self.fp32_partitioned_groups_flat[sub_group_id])
-        assert self._swappable_optimizer_subgroup(sub_group_id), \
-            f'Parameter {fp32_param_id} of numel={param_length} is not swappable'
-
-        OPTIMIZER_SWAP_IN_STATE = 'optimizer_swap_in_state'
-        if self.verbose:
-            report_memory_usage(
-                f'pre-step Before swapping in optimizer tensors {sub_group_id}')
-        self.start_timers([OPTIMIZER_SWAP_IN_STATE])
-
-        self.optimizer_swapper.swap_in_optimizer_state(
-            parameter=self.fp32_partitioned_groups_flat[sub_group_id],
-            async_parameter=self.next_swappable_fp32_partitioned_groups[sub_group_id])
-
-        self.stop_timers([OPTIMIZER_SWAP_IN_STATE])
-        timer_names.add(OPTIMIZER_SWAP_IN_STATE)
-        if self.verbose:
-            report_memory_usage(
-                f'pre-step After swapping in optimizer tensors {sub_group_id}')
-
-    def _release_sub_group(self, sub_group_id, timer_names=set()):
-        if self.verbose:
-            report_memory_usage(
-                f'Before release optimizer sub group {sub_group_id}')
-        # get rid of the fp32 gradients. Not needed anymore
-        if not self.offload_optimizer:
-            self.fp32_partitioned_groups_flat[sub_group_id].grad = None
-
-        if self._swappable_optimizer_subgroup(sub_group_id):
-            self._optimizer_states_and_gradient_swap_out(
-                sub_group_id, timer_names)
-        if self.verbose:
-            report_memory_usage(
-                f'After release optimizer sub group {sub_group_id}')
-
-    # create a flat tensor aligned at the alignment boundary
-    def flatten_dense_tensors_aligned(self, tensor_list, alignment):
-        num_elements = 0
-        for tens in tensor_list:
-            num_elements = num_elements + tens.numel()
-
-        remaining = num_elements % alignment
-
-        if remaining:
-            elements_to_add = alignment - remaining
-            pad_tensor = torch.zeros(elements_to_add,
-                                     device=tensor_list[0].device,
-                                     dtype=tensor_list[0].dtype)
-            padded_tensor_list = tensor_list + [pad_tensor]
-
-            num_elements = num_elements + elements_to_add
-        else:
-            padded_tensor_list = tensor_list
-
-        return self.flatten(padded_tensor_list)
-
-    def _optimizer_states_and_gradient_swap_out(self, sub_group_id, timer_names=set()):
-        param_length = self.fp16_partitioned_groups_flat_numel[sub_group_id]
-        fp32_param_id = id(self.fp32_partitioned_groups_flat[sub_group_id])
-        assert self._swappable_optimizer_subgroup(sub_group_id), \
-            f'Parameter {fp32_param_id} of numel={param_length} is not swappable'
-
-        OPTIMIZER_SWAP_OUT_STATE = 'optimizer_swap_out_state'
-        if self.verbose:
-            report_memory_usage(
-                f'post-step Before swapping out optimizer tensors {sub_group_id}')
-        self.start_timers([OPTIMIZER_SWAP_OUT_STATE])
-
-        self.optimizer_swapper.swap_out_optimizer_state(
-            parameter=self.fp32_partitioned_groups_flat[sub_group_id],
-            async_swap=self.next_swappable_fp32_partitioned_groups[sub_group_id] is
-            not None)
-
-        self.stop_timers([OPTIMIZER_SWAP_OUT_STATE])
-        if self.verbose:
-            report_memory_usage(
-                f'post-step After swapping out optimizer tensors {sub_group_id}')
-        timer_names.add(OPTIMIZER_SWAP_OUT_STATE)
-
-        # get rid of the fp32 gradients. Not needed anymore
-        self.fp32_partitioned_groups_flat[sub_group_id].grad = None
-
-    def _unflatten_partitioned_parameters(self, sub_group_id):
-        updated_params = self.unflatten(self.fp16_partitioned_groups_flat[sub_group_id],
-                                        self.fp16_partitioned_groups[sub_group_id])
-
-        for partitioned_param, q in zip(self.fp16_partitioned_groups[sub_group_id], updated_params):
-            partitioned_param.data = q.data
-
-    def _overflow_clean_up(self, prev_scale):
-        if self.verbose:
-            report_memory_usage('After overflow before clearing gradients')
-        self.zero_grad()
-
-        if self.offload_optimizer:
-            self.reset_cpu_buffers()
-        else:
-            self.averaged_gradients = {}
-
-        if self.verbose:
-            report_memory_usage('After overflow after clearing gradients')
-
-        if torch.distributed.get_rank() == 0:
-            print(
-                "[deepscale] OVERFLOW! Rank {} Skipping step. Attempted loss scale: {}, "
-                "reducing to {}".format(dist.get_rank(),
-                                        prev_scale,
-                                        self.loss_scale))
-
-    def _overflow_check_and_loss_scale_update(self):
-
-        # First compute norm for all group so we know if there is overflow
-        self.check_overflow()
-
-        # loss scaling related computation
-        prev_scale = self.loss_scale
-        self._update_scale(self.overflow)
-
-        if self.overflow:
-            self._overflow_clean_up(prev_scale)
-
-        return self.overflow
-
-    def _post_step(self, timer_names=set()):
-        if self.offload_optimizer:
-            self.reset_cpu_buffers()
-
-        # Gathering persisting parameters
-        if len(self.persistent_parameters) > 0:
-            self.persistent_parameters[0].all_gather(
-                self.persistent_parameters)
-
-        if self.swap_optimizer:
-            self.optimizer_swapper.log_timers()
-
-        self.log_timers(timer_names)
-
-        if self.verbose:
-            report_memory_usage('After zero_optimizer step')
-            print_rank_0(
-                f"------------------Finishing Step-----------------------")
-
-    def _reassign_or_swap_out_partitioned_parameters(self, sub_group_id):
-        if self.fp16_partitioned_groups_flat[sub_group_id] is not None:
-            self.fp16_partitioned_groups_flat[sub_group_id].data.copy_(
-                self.fp32_partitioned_groups_flat[sub_group_id].data)
-
-            # unflatten fp16 parameter subgroup
-            self._unflatten_partitioned_parameters(sub_group_id)
-        else:
-            self._partitioned_params_swap_out(sub_group_id)
-
-    def allreduce_gradients(self):
-        self.overlapping_partition_gradients_reduce_epilogue()
-
-    def step(self, closure=None):
-        """
-            Not supporting closure.
-            """
-        self._pre_step()
-
-        # checks for overflow, adjust the loss scale accordingly
-        if self._overflow_check_and_loss_scale_update():
-            if self.swap_optimizer:
-                self.optimizer_swapper.log_timers()
-            return
-
-        norm_groups = self._get_norm_groups()
-
-        timer_names = set()
-
-        timer_names.add('optimizer_step')
-        self.start_timers(['optimizer_step'])
-
-        # update parameters one sub group at a time
-        for sub_group_id, group in enumerate(self.fp16_groups):
-            # prepare optimizer states, gradients and fp32 parameters for update
-            self._prepare_sub_group(sub_group_id, timer_names)
-
-            # scale the fp32 gradients
-            self.unscale_and_clip_grads(sub_group_id, norm_groups)
-
-            # apply the optimizer step on the sub group and copy fp32 parameters to fp16
-            self._optimizer_step(sub_group_id)
-
-            # put fp16 parameters in appropriate location
-            self._reassign_or_swap_out_partitioned_parameters(sub_group_id)
-
-            # release memory or swap out optimizer states of fp32 parameters
-            self._release_sub_group(sub_group_id, timer_names)
-
-        self.stop_timers(['optimizer_step'])
-
-        self._post_step(timer_names)
-        return
-
-    def dump_pre_step_gradients(self, debug_fp32_grads):
-        # Dump gradient norms for debbuging
-        for i, _ in enumerate(self.fp16_groups):
-            if self.verbose:
-                print(
-                    f'Pre-Step Dump Norms for Group {i} FP16P, FP16G, FP32G, FP32GUC')
-            for fp16_param, fp32_grad in zip(self.fp16_groups[i], debug_fp32_grads[i]):
-                param_id = self.get_param_id(fp16_param)
-                fp16_grad_norm = self.debug_fp16_grads[i][param_id]
-
-                fp32_grad_norm = [float(t.data.float().norm(2))
-                                  for t in fp32_grad]
-                norm_list = [fp16_grad_norm, fp32_grad_norm]
-                if self.verbose:
-                    print(f'Pre-Step Norms {i} {param_id} = {norm_list}')
-
-    def dump_post_step_gradients(self):
-        # Dump gradient norms for debbuging
-        for i, group in enumerate(self.fp16_groups):
-            if self.verbose:
-                print(
-                    f'Post-Step Dump Norms for Group {i} FP16P, FP16DS, FP16FLAT, FP32FLAT')
-            unflat_fp16 = self.unflatten(
-                self.fp16_groups_flat[i], self.fp16_groups[i])
-            unflat_fp32 = self.unflatten(self.fp32_partitioned_groups_flat[i],
-                                         self.fp16_groups[i])
-            for j, p in enumerate(self.fp16_groups[i]):
-                param_id = self.get_param_id(p)
-                param_norm = float(p.data.float().norm(2))
-                ds_norm = float(p.ds_tensor.data.float().norm(2))
-
-                unflat_norm = [
-                    float(t.data.float().norm(2))
-                    for t in [unflat_fp16[j],
-                              unflat_fp32[j]]
-                ]
-                norm_list = [param_norm, ds_norm] + unflat_norm
-                if self.verbose:
-                    print(f'Post-Step Norms {i} {param_id} = {norm_list}')
-
-    def unscale_and_clip_grads(self, sub_group_id, norm_groups):
-
-        grad_groups_flat = [
-            self.fp32_partitioned_groups_flat[sub_group_id].grad]
-
-        total_norm = 0.0
-        for norm in norm_groups:
-            total_norm += norm ** 2.0
-        total_norm = math.sqrt(total_norm)
-
-        # compute combined scale factor for this group
-        combined_scale = self.loss_scale
-        if self.clip_grad > 0.:
-            # norm is in fact norm*scale
-            clip = ((total_norm / self.loss_scale) + 1e-6) / self.clip_grad
-            if clip > 1:
-                combined_scale = clip * self.loss_scale
-
-        for grad in grad_groups_flat:
-            if isinstance(grad, list):
-                sub_partitions = grad
-                for g in sub_partitions:
-                    g.data.mul_(1. / combined_scale)
-            else:
-                grad.data.mul_(1. / combined_scale)
-
-    def _check_overflow(self, partition_gradients=True):
-        self.overflow = self.has_overflow(partition_gradients)
-
-    # `params` is a list / generator of torch.Variable
-    def has_overflow_serial(self, params, is_grad_list=False):
-        for p in params:
-            if p.grad is not None and self._has_inf_or_nan(p.grad.data):
-                return True
-
-        return False
-
-    def has_overflow_partitioned_grads_serial(self):
-        for i in range(len(self.fp16_groups)):
-            for j, grad in enumerate(self.averaged_gradients[i]):
-                if grad is not None and self._has_inf_or_nan(grad.data, j):
-                    return True
-        return False
-
-    def has_overflow(self, partition_gradients=True):
-        if partition_gradients:
-            if self.overlap_comm:
-                self.local_overflow = self._has_inf_or_nan(self.gpu_sum)
-                self.gpu_sum = torch.zeros(1, dtype=torch.float).cuda()
-
-            overflow = self.local_overflow if self.offload_optimizer else self.has_overflow_partitioned_grads_serial(
-            )
-            # overflow = self.has_overflow_partitioned_grads_serial()
-            overflow_gpu = torch.cuda.ByteTensor([overflow])
-            torch.distributed.all_reduce(overflow_gpu,
-                                         op=torch.distributed.ReduceOp.MAX,
-                                         group=self.dp_process_group)
-
-        else:
-            params = []
-            for group in self.fp16_groups:
-                for param in group:
-                    params.append(param)
-
-            overflow = self.has_overflow_serial(
-                params, is_grad_list=partition_gradients)
-            overflow_gpu = torch.cuda.ByteTensor([overflow])
-
-        # Since each model parallel GPU carries only part of the model,
-        # make sure overflow flag is synced across all the model parallel GPUs
-        self._model_parallel_all_reduce(tensor=overflow_gpu,
-                                        op=torch.distributed.ReduceOp.MAX)
-
-        overflow = overflow_gpu[0].item()
-        return bool(overflow)
-
-    # `x` is a torch.Tensor
-    @staticmethod
-    def _has_inf_or_nan(x, j=None):
-        try:
-            # if x is half, the .float() incurs an additional deep copy, but it's necessary if
-            # Pytorch's .sum() creates a one-element tensor of the same type as x
-            # (which is true for some recent version of pytorch).
-            cpu_sum = float(x.float().sum())
-            # More efficient version that can be used if .sum() returns a Python scalar
-            # cpu_sum = float(x.sum())
-        except RuntimeError as instance:
-            # We want to check if inst is actually an overflow exception.
-            # RuntimeError could come from a different error.
-            # If so, we still want the exception to propagate.
-            if "value cannot be converted" not in instance.args[0]:
-                raise
-            return True
-        else:
-            if cpu_sum == float('inf') or cpu_sum == -float('inf') or cpu_sum != cpu_sum:
-                return True
-            return False
-
-    def backward(self, loss, retain_graph=False):
-        """
-        :attr:`backward` performs the following steps:
-
-        1. fp32_loss = loss.float()
-        2. scaled_loss = fp32_loss*loss_scale
-        3. scaled_loss.backward(), which accumulates scaled gradients into the ``.grad`` attributes of the model's fp16 leaves
-        """
-        self.micro_step_id += 1
-        if self.verbose:
-            print_rank_0(
-                f"Total fully available parameters {self.param_coordinator.total_available_parameter_numel}"
-            )
-
-        if self.swap_optimizer:
-            self.optimizer_swapper.pre_backward()
-
-        if self.verbose:
-            report_memory_usage(f"Before backward")
-
-        if self.contiguous_gradients:
-            self.ipg_buffer = []
-            buf_0 = torch.empty(self.reduce_bucket_size,
-                                dtype=self.dtype,
-                                device=torch.cuda.current_device())
-            self.ipg_buffer.append(buf_0)
-
-            # Use double buffers to avoid data access conflict when overlap_comm is enabled.
-            if self.overlap_comm:
-                buf_1 = torch.empty(self.reduce_bucket_size,
-                                    dtype=self.dtype,
-                                    device=torch.cuda.current_device())
-                self.ipg_buffer.append(buf_1)
-            self.ipg_index = 0
-
-        self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
-        '''Partitioning Parameters that were not partitioned
-        Usually if parameters of modules whose input parameters do not require
-        grad computation do not trigger post call and will therefore will remain unpartitioned '''
-        self._partition_all_parameters()
-
-        if self.swap_optimizer:
-            self.optimizer_swapper.post_backward()
-
-    def _partition_all_parameters(self):
-        for name, param in self.module.named_parameters(recurse=True):
-            self.param_coordinator.release_and_reset_parameter(param)
-
-    def check_overflow(self, partition_gradients=True):
-        self._check_overflow(partition_gradients)
-
-    def _update_scale(self, has_overflow=False):
-        self.loss_scaler.update_scale(has_overflow)
-
-    # Promote state so it can be retrieved or set via "fp16_optimizer_instance.state"
-    def _get_state(self):
-        return self.optimizer.state
-
-    def _set_state(self, value):
-        self.optimizer.state = value
-
-    state = property(_get_state, _set_state)
-
-    # Promote param_groups so it can be retrieved or set via "fp16_optimizer_instance.param_groups"
-    # (for example, to adjust the learning rate)
-    def _get_param_groups(self):
-        return self.optimizer.param_groups
-
-    def _set_param_groups(self, value):
-        self.optimizer.param_groups = value
-
-    param_groups = property(_get_param_groups, _set_param_groups)
-
-    # Promote loss scale so it can be retrieved or set via "fp16_optimizer_instance.loss_scale"
-    def _get_loss_scale(self):
-        return self.loss_scaler.loss_scale
-
-    def _set_loss_scale(self, value):
-        self.loss_scaler.cur_scale = value
-
-    loss_scale = property(_get_loss_scale, _set_loss_scale)
-    cur_scale = property(_get_loss_scale, _set_loss_scale)
-
-    def _get_lean_tensors(self, padded_flattened_tensor, group_tensors, paddings):
-        # Remove paddings from flattened tensor
-        individual_tensors = self.unflatten(
-            padded_flattened_tensor, group_tensors)
-        lean_lengths = [t.numel() - pad for t,
-                        pad in zip(group_tensors, paddings)]
-        lean_tensors = [t[:len]
-                        for t, len in zip(individual_tensors, lean_lengths)]
-        # print()(f'rank {dist.get_rank()}: lean_tensors = {[t.numel() for t in lean_tensors]}')
-        return lean_tensors
-
-    # TODO REVISIT this for stage 3
-    def get_lean_optimizer_state(self):
-        # Return optimizer states after removing paddings.
-        # This method assumes that each param group contains a single flattened tensor.
-        optimizer_groups_state = []
-
-        for i, group in enumerate(self.optimizer.param_groups):
-            p = group['params'][0]
-            lean_state = {}
-            for key, value in self.optimizer.state[p].items():
-                if torch.is_tensor(value):
-                    padded_lens = [t.numel()
-                                   for t in self.fp16_partitioned_groups[i]]
-                    lean_state[key] = self._get_lean_tensors(
-                        value,
-                        self.fp16_partitioned_groups[i],
-                        self.groups_padding[i])
-                    lean_flat_len = sum([t.numel() for t in lean_state[key]])
-                else:
-                    lean_state[key] = value
-
-            optimizer_groups_state.append(lean_state)
-
-        return optimizer_groups_state
-
-    def get_groups_without_padding(self, groups_with_padding):
-        # Return group tensor after removing paddings added for alignment to DP world size.
-        groups_without_padding = []
-        for i, group in enumerate(groups_with_padding):
-            lean_group = self._get_lean_tensors(group,
-                                                self.fp16_partitioned_groups[i],
-                                                self.groups_padding[i])
-            groups_without_padding.append(lean_group)
-
-        return groups_without_padding
-
-    def _set_fp32_optimizer_param_groups(self):
-        for sub_group_id, _ in enumerate(self.fp16_groups):
-            param_group_id = self.sub_group_to_group_id[sub_group_id]
-            self.optimizer.param_groups[param_group_id]['params'].append(
-                self.fp32_partitioned_groups_flat[sub_group_id])
-
-    def _clear_fp32_optimizer_param_groups(self):
-        for param_group in self.optimizer.param_groups:
-            param_group['params'] = []
-
-    def _rigid_state_dict(self):
-        state_dict = {}
-        state_dict['zero_stage'] = ZERO_OPTIMIZATION_WEIGHTS
-        state_dict['loss_scaler'] = self.loss_scaler
-        state_dict['dynamic_loss_scale'] = self.dynamic_loss_scale
-        state_dict['overflow'] = self.overflow
-        state_dict['partition_count'] = self.partition_count
-
-        self._set_fp32_optimizer_param_groups()
-        state_dict['optimizer_state_dict'] = self.optimizer.state_dict()
-        state_dict['fp32_flat_groups'] = self.fp32_partitioned_groups_flat
-        self._clear_fp32_optimizer_param_groups()
-
-        return state_dict
-
-    def state_dict(self):
-        """
-        Returns a dict containing the current state of this :class:`FP16_Optimizer` instance.
-        This dict contains attributes of :class:`FP16_Optimizer`, as well as the state_dict
-        of the contained Pytorch optimizer.
-
-        Example::
-
-            checkpoint = {}
-            checkpoint['model'] = model.state_dict()
-            checkpoint['optimizer'] = optimizer.state_dict()
-            torch.save(checkpoint, "saved.pth")
-        """
-        if self.elastic_checkpoint:
-            raise NotImplementedError(
-                "ZeRO-3 does not yet support elastic checkpointing, please disable for now."
-            )
-
-        if self.swap_optimizer or self.params_in_nvme_and_cpu:
-            raise NotImplementedError(
-                "ZeRO-3 does not yet support checkpointing with NVMe offloading, please disable for now."
-            )
-
-        return self._rigid_state_dict()
-
-    # Restore base optimizer fp32 weights from checkpoint by:
-    # 1) Merging fp32 weights from checkpoints of all partitions
-    # 2) Extracting fp32 weights for current partition from merged weights
-    # 3) Using extracted weights to update base optimizer weights directly.
-
-    def _restore_from_fp32_weights(self, all_state_dict):
-
-        flat_local_partition = []
-        for i in range(len(self.fp32_partitioned_groups_flat)):
-            merged_partitions = [sd['fp32_groups'][i] for sd in all_state_dict]
-            flat_local_partition.append(
-                self._get_flattened_partition(merged_partitions))
-
-        for current, saved in zip(self.fp32_partitioned_groups_flat, flat_local_partition):
-            current.data.copy_(saved.data)
-
-    # Restore base optimizer fp32 weights from ZeRO fp16 weights
-    def _restore_from_fp16_weights(self):
-        for fp16_partitions, fp32_partition in zip(self.fp16_partitioned_groups_flat,
-                                                   self.fp32_partitioned_groups_flat):
-            fp32_partition.data.copy_(fp16_partitions.data)
-
-    # Refresh the fp32 master params from the fp16 copies.
-    def refresh_fp32_params(self):
-        self._restore_from_fp16_weights()
-
-    # Extract flattened partion for current rank from all partitions
-    def _get_flattened_partition(self, all_partition_states):
-        partition_id = dist.get_rank(group=self.dp_process_group)
-        alignment = dist.get_world_size(group=self.dp_process_group)
-
-        param_partitions = [[] for _ in range(len(all_partition_states[0]))]
-        for i, partition in enumerate(all_partition_states):
-            for j, param in enumerate(partition):
-                param_partitions[j].append(param)
-
-        local_state_partitions = []
-        for param_index, param_slices in enumerate(param_partitions):
-            flattened_merged_tensor = self.flatten_dense_tensors_aligned(
-                param_slices,
-                alignment)
-            new_partitions = self.get_data_parallel_partitions(
-                flattened_merged_tensor)
-            local_state_partitions.append(new_partitions[partition_id])
-
-        if torch.is_tensor(local_state_partitions[0]):
-            return self.flatten_dense_tensors_aligned(local_state_partitions, alignment)
-
-        # Assume non-tensor states are not partitioned and equal across ranks, so return first one
-        return local_state_partitions[0]
-
-    # Restore base optimizer state from checkpoint by
-    # 1) Merging optimizer state from checkpoints of all partitions
-    # 2) Extracting optimizer state for current partition from the merged state
-    # 3) Using the extracted value to directly update the base optimizer.
-    def _restore_base_optimizer_state(self, all_state_dict):
-        base_optimizer_group_states = []
-        for i in range(len(self.optimizer.param_groups)):
-            partition_states = {}
-            all_partition_group_states = [
-                sd['base_optimizer_state'][i] for sd in all_state_dict
-            ]
-            for key in all_partition_group_states[0].keys():
-                all_partition_states = [
-                    all_states[key] for all_states in all_partition_group_states
-                ]
-                partition_states[key] = self._get_flattened_partition(
-                    all_partition_states)
-            base_optimizer_group_states.append(partition_states)
-
-        for i, group in enumerate(self.optimizer.param_groups):
-            p = group['params'][0]
-            for key, saved in base_optimizer_group_states[i].items():
-                if torch.is_tensor(self.optimizer.state[p][key]):
-                    self.optimizer.state[p][key].data.copy_(saved.data)
-                else:
-                    self.optimizer.state[p][key] = saved
-
-    def _rigid_load_state_dict(self, state_dict, load_optimizer_states=True):
-        # I think it should actually be ok to reload the optimizer before the model.
-        self.loss_scaler = state_dict['loss_scaler']
-        self.dynamic_loss_scale = state_dict['dynamic_loss_scale']
-        self.overflow = state_dict['overflow']
-
-        if load_optimizer_states:
-            self._set_fp32_optimizer_param_groups()
-            self.optimizer.load_state_dict(state_dict['optimizer_state_dict'])
-            self._clear_fp32_optimizer_param_groups()
-
-        # restore fp32 partitions
-        for curr_param, saved_param in zip(self.fp32_partitioned_groups_flat, state_dict['fp32_flat_groups']):
-            curr_param.data.copy_(saved_param.data)
-
-        # restore fp16 partitions from fp32
-        for sub_group_id in range(len(self.fp32_partitioned_groups_flat)):
-            fp32_param = self.fp32_partitioned_groups_flat[sub_group_id]
-            fp16_param = self.fp16_partitioned_groups_flat[sub_group_id]
-            fp16_param.data.copy_(fp32_param.data)
-
-        # update fp16 unflattened params
-        for sub_group_id in range(len(self.fp16_partitioned_groups_flat)):
-            updated_params = self.unflatten(
-                self.fp16_partitioned_groups_flat[sub_group_id],
-                self.fp16_partitioned_groups[sub_group_id])
-
-            for partitioned_param, q in zip(self.fp16_partitioned_groups[sub_group_id], updated_params):
-                partitioned_param.data = q.data
-
-    # TODO: Support different/changing load/save DP degree.
-    def load_state_dict(self,
-                        state_dict_list,
-                        load_optimizer_states=True,
-                        load_from_fp32_weights=False):
-        r"""Loading a ZeRO checkpoint
-
-        Loads a state_dict created by an earlier call to state_dict().
-        If ``fp16_optimizer_instance`` was constructed from some ``init_optimizer``,
-        whose parameters in turn came from ``model``, it is expected that the user
-        will call ``model.load_state_dict()`` before
-        ``fp16_optimizer_instance.load_state_dict()`` is called.
-
-        Arguments:
-            state_dict_list: List of all saved ZeRO checkpoints, one for each saved partition.
-                Note that the number of saved partitions may differ from number of loading partitions to support
-                changing GPU count, specifically DP world size, between saving and loading checkpoints.
-            load_optimizer_states: Boolean indicating whether or not to load base optimizer states
-            load_from_fp32_weights: Boolean indicating whether to initialize fp32 master weights from fp32
-            copies in checkpoints (no precision loss) or from model's fp16 copies (with precision loss).
-
-        Example::
-
-            model = torch.nn.Linear(D_in, D_out).cuda().half()
-            optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
-            optimizer = FP16_Optimizer(optimizer, static_loss_scale = 128.0)
-            ...
-            checkpoint = torch.load("saved.pth")
-            model.load_state_dict(checkpoint['model'])
-            optimizer.load_state_dict(checkpoint['optimizer'])
-        """
-
-        if self.elastic_checkpoint:
-            raise NotImplementedError(
-                "ZeRO-3 does not yet support elastic checkpointing, please disable for now."
-            )
-
-        if self.swap_optimizer or self.params_in_nvme_and_cpu:
-            raise NotImplementedError(
-                "ZeRO-3 does not yet support checkpointing with NVMe offloading, please disable for now."
-            )
-
-        self._rigid_load_state_dict(
-            state_dict_list[dist.get_rank(group=self.dp_process_group)],
-            load_optimizer_states=load_optimizer_states)
-
-        if len(self.persistent_parameters) > 0:
-            self.persistent_parameters[0].partition(self.persistent_parameters)
-            self.persistent_parameters[0].all_gather(
-                self.persistent_parameters)
-
-    def save_checkpoint_prologue(self):
-        self._partition_all_parameters()
-
-    def save_checkpoint_epilogue(self):
-        if len(self.persistent_parameters) > 0:
-            self.persistent_parameters[0].all_gather(
-                self.persistent_parameters)
-
-
-def _handle_overflow(cpu_sum, x, i):
-    import math
-    rank = torch.distributed.get_rank()
-    if rank == 0:
-        t_i = -1
-        for v_i, v in enumerate(x.data.contiguous().view(-1)):
-            if not math.isfinite(float(v)):
-                t_i = v_i
-                break
-        print(
-            f"rank {rank} detected overflow {cpu_sum} in tensor {i}:{t_i} shape {x.shape}"
-        )
-
-
-def estimate_zero3_model_states_mem_needs(total_params,
-                                          largest_layer_params,
-                                          num_gpus_per_node=1,
-                                          num_nodes=1,
-                                          cpu_offload=True,
-                                          cpu_offload_params=True,
-                                          zero_init=True,
-                                          additional_buffer_factor=1.5):
-    total_gpus = num_nodes * num_gpus_per_node
-    gpus_factor = 1 / num_nodes
-    largest_layer_memory = (4 * largest_layer_params)
-
-    if cpu_offload:
-        if cpu_offload_params:
-            gpu_mem = largest_layer_memory
-
-            if zero_init:
-                cpu_mem = total_params * 18 * gpus_factor * additional_buffer_factor
-            else:
-                cpu_mem = total_params * max(4 * num_gpus_per_node,
-                                             18 * gpus_factor) * additional_buffer_factor
-
-        else:
-            gpu_mem = largest_layer_memory + int(2 * total_params / total_gpus)
-
-            if zero_init:
-                cpu_mem = total_params * 16 * gpus_factor * additional_buffer_factor
-            else:
-                cpu_mem = total_params * max(4 * num_gpus_per_node,
-                                             16 * gpus_factor) * additional_buffer_factor
-    else:
-        gpu_mem = largest_layer_memory + int(18 * total_params / total_gpus)
-        if zero_init:
-            cpu_mem = largest_layer_params * 4 * num_gpus_per_node * additional_buffer_factor
-        else:
-            cpu_mem = total_params * 4 * num_gpus_per_node * additional_buffer_factor
-
-    return int(cpu_mem), int(gpu_mem), largest_layer_memory
-
-
-def model_to_params(model):
-    # shared params calculated only once
-    total_params = sum(
-        dict((p.data_ptr(),
-              p.numel()) for p in model.parameters()).values())
-
-    largest_layer_params = 0
-    for m in model.modules():
-        # assuming no shared params within a single layer
-        layer_params = sum(p.numel() for p in m.parameters(recurse=False))
-        largest_layer_params = max(largest_layer_params, layer_params)
-
-    return total_params, largest_layer_params
-
-
-def estimate_zero3_model_states_mem_needs_all_live(model,
-                                                   num_gpus_per_node=1,
-                                                   num_nodes=1,
-                                                   additional_buffer_factor=1.5):
-    """
-    Print out estimates on memory usage requirements for ZeRO 3 params, optim states and gradients
-    for a given ``model`` and hardware setup.
-
-    If you have an actual model object, use this function and everything will be derived
-    automatically.
-
-    If it's a hypothetical model, use ``estimate_zero3_model_states_mem_needs_all_cold`` where you have to pass
-    the ``total_params`` and ``largest_layer_params`` explicitly.
-
-    Args:
-        - ``model``: ``nn.Module`` object
-        - ``num_gpus_per_node``: how many gpus per node (defaults to 1)
-        - ``num_nodes``: how many nodes (defaults to 1),
-        - ``additional_buffer_factor``: estimation factor (defaults to 1.5):
-
-    """
-
-    total_params, largest_layer_params = model_to_params(model)
-
-    estimate_zero3_model_states_mem_needs_all_cold(
-        total_params=total_params,
-        largest_layer_params=largest_layer_params,
-        num_gpus_per_node=num_gpus_per_node,
-        num_nodes=num_nodes,
-        additional_buffer_factor=additional_buffer_factor)
-
-
-def estimate_zero3_model_states_mem_needs_all_cold(total_params,
-                                                   largest_layer_params,
-                                                   num_gpus_per_node=1,
-                                                   num_nodes=1,
-                                                   additional_buffer_factor=1.5):
-    """
-    Print out estimates on memory usage requirements for ZeRO 3 params, optim states and gradients
-    for a given ``model`` and hardware setup.
-
-    If it's a hypothetical model, use this function where you have to pass
-    the ``total_params`` and ``largest_layer_params`` explicitly.
-
-    If you have an actual model object, use ``estimate_zero3_model_states_mem_needs_all_live`` and everything
-    will be derived automatically.
-
-    Args:
-        - ``total_params``: total  model params
-        - ``largest_layer_params``: largest layer's params
-        - ``num_gpus_per_node``: how many gpus per node (defaults to 1)
-        - ``num_nodes``: how many nodes (defaults to 1),
-        - ``additional_buffer_factor``: estimation factor (defaults to 1.5):
-
-    """
-
-    def format_options(cpu_offload, cpu_offload_params, zero_init):
-        enabled = []
-        enabled.append(f"cpu_offload={1 if cpu_offload else 0}")
-        enabled.append(f"cpu_offload_params={1 if cpu_offload_params else 0}")
-        enabled.append(f"zero_init={1 if zero_init else 0}")
-        return ", ".join(enabled)
-
-    nodes_str = "nodes" if num_nodes > 1 else "node"
-    gpus_str = "GPUs" if num_gpus_per_node > 1 else "GPU"
-    print(
-        "Estimated memory needed for params, optim states and gradients for a:\n"
-        f"HW: Setup with {num_nodes} {nodes_str}, {num_gpus_per_node} {gpus_str} per node.\n"
-        f"SW: Model with {int(total_params / 1e6)}M total params, {int(largest_layer_params / 1e6)}M largest layer params."
-    )
-    print("  per CPU  |  per GPU |   Options")
-    for cpu_offload in [True, False]:
-        for cpu_offload_params in [True, False]:
-            if not cpu_offload and cpu_offload_params:
-                continue
-            for zero_init in [True, False]:
-                cpu_mem, gpu_mem, largest_layer_memory = estimate_zero3_model_states_mem_needs(
-                    total_params=total_params,
-                    largest_layer_params=largest_layer_params,
-                    num_gpus_per_node=num_gpus_per_node,
-                    num_nodes=num_nodes,
-                    cpu_offload=cpu_offload,
-                    cpu_offload_params=cpu_offload_params,
-                    zero_init=zero_init,
-                    additional_buffer_factor=additional_buffer_factor
-                )
-
-                options_str = format_options(cpu_offload=cpu_offload,
-                                             cpu_offload_params=cpu_offload_params,
-                                             zero_init=zero_init)
-                print(
-                    f" {cpu_mem / 2 ** 30:7.2f}GB | {gpu_mem / 2 ** 30:6.2f}GB | {options_str}")
diff --git a/compile.sh b/compile.sh
deleted file mode 100755
index 55d8dd04f1d326e9845f9e6da44611f1565f913a..0000000000000000000000000000000000000000
--- a/compile.sh
+++ /dev/null
@@ -1,8 +0,0 @@
-source /opt/dtk-22.04.2/env.sh
-export CPLUS_INCLUDE_PATH=${ROCM_PATH}/include:${ROCM_PATH}:/opencl/include
-export CPLUS_INCLUDE_PATH=${ROCM_PATH}/hiprand/include/:${ROCM_PATH}/rocrand/include:$CPLUS_INCLUDE_PATH
-export LD_LIBRARY_PATH=${ROCM_PATH}/hipblas/lib/:$LD_LIBRARY_PATH
-export LD_LIBRARY_PATH=${ROCM_PATH}/llvm/lib:$LD_LIBRARY_PATH
-
-MAX_JOBS=32 python3 setup.py -v  bdist_wheel
-
diff --git a/dist/colossalai-0.0.2-cp37-cp37m-linux_x86_64.whl b/dist/colossalai-0.0.2-cp37-cp37m-linux_x86_64.whl
deleted file mode 100644
index 1ba8b60f6c9c7a84d204cb19a1bfb51596eb7bf6..0000000000000000000000000000000000000000
Binary files a/dist/colossalai-0.0.2-cp37-cp37m-linux_x86_64.whl and /dev/null differ
diff --git a/docker/Dockerfile b/docker/Dockerfile
deleted file mode 100644
index da851a1562d19606d378f3cebd7a9d3d768d8aa5..0000000000000000000000000000000000000000
--- a/docker/Dockerfile
+++ /dev/null
@@ -1,11 +0,0 @@
-FROM nvcr.io/nvidia/pytorch:21.07-py3
-
-# install dependencies
-RUN pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple \
- && pip install -U pip setuptools wheel \
- && pip install pytest tensorboard deepspeed apex
-
-# install colossalai
-RUN git clone https://github.com/hpcaitech/ColossalAI.git \
- && cd ./ColossalAI \
- && pip install -v --no-cache-dir .
diff --git a/docs/Makefile b/docs/Makefile
deleted file mode 100644
index 9f43a48d64206ee9af88cb4fef87363921998174..0000000000000000000000000000000000000000
--- a/docs/Makefile
+++ /dev/null
@@ -1,26 +0,0 @@
-# Minimal makefile for Sphinx documentation
-#
-
-# You can set these variables from the command line, and also
-# from the environment for the first two.
-SPHINXOPTS    ?=
-SPHINXBUILD   ?= sphinx-build
-SOURCEDIR     = .
-BUILDDIR      = .build
-SPHINXAPIDOC  ?= sphinx-apidoc
-SPHINX_APIDOC_OPTIONS = members
-SPHINX_APIDOC_TEMPLATEDIR = _templates/apidoc
-
-# Put it first so that "make" without argument is like "make help".
-help:
-	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
-
-.PHONY: help Makefile apidoc
-
-apidoc:
-	@SPHINX_APIDOC_OPTIONS=$(SPHINX_APIDOC_OPTIONS) $(SPHINXAPIDOC) -f -T -e -M -d 2 -t $(SPHINX_APIDOC_TEMPLATEDIR) -o ./colossalai ../colossalai
-# @$(SPHINXAPIDOC) -f -o ./model_zoo ../model_zoo
-# Catch-all target: route all unknown targets to Sphinx using the new
-# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
-%: Makefile
-	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
diff --git a/docs/_static/css/rtd_theme.css b/docs/_static/css/rtd_theme.css
deleted file mode 100644
index caf42dc5aaab93f26d417dd132001b5e03e849e0..0000000000000000000000000000000000000000
--- a/docs/_static/css/rtd_theme.css
+++ /dev/null
@@ -1,3 +0,0 @@
-.wy-nav-content {
-    max-width: 80%;
-}
\ No newline at end of file
diff --git a/docs/_templates/apidoc/module.rst_t b/docs/_templates/apidoc/module.rst_t
deleted file mode 100644
index d9a50e6b9752a1b04ef1317c33075e8c19fc97cd..0000000000000000000000000000000000000000
--- a/docs/_templates/apidoc/module.rst_t
+++ /dev/null
@@ -1,9 +0,0 @@
-{%- if show_headings %}
-{{- basename | e | heading }}
-
-{% endif -%}
-.. automodule:: {{ qualname }}
-{%- for option in automodule_options %}
-   :{{ option }}:
-{%- endfor %}
-
diff --git a/docs/_templates/apidoc/package.rst_t b/docs/_templates/apidoc/package.rst_t
deleted file mode 100644
index 83742b3f7c66c10e0ebbe78718dea91e34d050a5..0000000000000000000000000000000000000000
--- a/docs/_templates/apidoc/package.rst_t
+++ /dev/null
@@ -1,52 +0,0 @@
-{%- macro automodule(modname, options) -%}
-.. automodule:: {{ modname }}
-{%- for option in options %}
-   :{{ option }}:
-{%- endfor %}
-{%- endmacro %}
-
-{%- macro toctree(docnames) -%}
-.. toctree::
-   :maxdepth: {{ maxdepth }}
-{% for docname in docnames %}
-   {{ docname }}
-{%- endfor %}
-{%- endmacro %}
-
-{%- if is_namespace %}
-{{- pkgname | e | heading }}
-{% else %}
-{{- pkgname | e | heading }}
-{% endif %}
-
-{%- if is_namespace %}
-.. py:module:: {{ pkgname }}
-{% endif %}
-
-{%- if modulefirst and not is_namespace %}
-{{ automodule(pkgname, automodule_options) }}
-{% endif %}
-
-{%- if subpackages %}
-{{ toctree(subpackages) }}
-{% endif %}
-
-{%- if submodules %}
-{% if separatemodules %}
-{{ toctree(submodules) }}
-{% else %}
-{%- for submodule in submodules %}
-{% if show_headings %}
-{{- submodule | e | heading(2) }}
-{% endif %}
-{{ automodule(submodule, automodule_options) }}
-{% endfor %}
-{%- endif %}
-{%- endif %}
-
-{%- if not modulefirst and not is_namespace %}
-Module contents
----------------
-
-{{ automodule(pkgname, automodule_options) }}
-{% endif %}
diff --git a/docs/_templates/apidoc/toc.rst_t b/docs/_templates/apidoc/toc.rst_t
deleted file mode 100644
index f0877eeb2f85324a48eb63d793a536a8cfdb4a00..0000000000000000000000000000000000000000
--- a/docs/_templates/apidoc/toc.rst_t
+++ /dev/null
@@ -1,8 +0,0 @@
-{{ header | heading }}
-
-.. toctree::
-   :maxdepth: {{ maxdepth }}
-{% for docname in docnames %}
-   {{ docname }}
-{%- endfor %}
-
diff --git a/docs/colossalai/colossalai.amp.amp_type.rst b/docs/colossalai/colossalai.amp.amp_type.rst
deleted file mode 100644
index 067af7d8c51a88ca94140b5b79fbbce7beccf41f..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.amp.amp_type.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.amp.amp\_type
-========================
-
-.. automodule:: colossalai.amp.amp_type
-   :members:
diff --git a/docs/colossalai/colossalai.amp.apex_amp.apex_amp.rst b/docs/colossalai/colossalai.amp.apex_amp.apex_amp.rst
deleted file mode 100644
index cba7e00625a4d6d018e1416cbdc984e659a4f345..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.amp.apex_amp.apex_amp.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.amp.apex\_amp.apex\_amp
-==================================
-
-.. automodule:: colossalai.amp.apex_amp.apex_amp
-   :members:
diff --git a/docs/colossalai/colossalai.amp.apex_amp.rst b/docs/colossalai/colossalai.amp.apex_amp.rst
deleted file mode 100644
index 7116a538b4c1d227354d1a16a64ce00165427cc3..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.amp.apex_amp.rst
+++ /dev/null
@@ -1,11 +0,0 @@
-colossalai.amp.apex\_amp
-========================
-
-.. automodule:: colossalai.amp.apex_amp
-   :members:
-
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.amp.apex_amp.apex_amp
diff --git a/docs/colossalai/colossalai.amp.naive_amp.naive_amp.rst b/docs/colossalai/colossalai.amp.naive_amp.naive_amp.rst
deleted file mode 100644
index e20f22b2e386effc3f68c5ef49c490dbac75aaea..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.amp.naive_amp.naive_amp.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.amp.naive\_amp.naive\_amp
-====================================
-
-.. automodule:: colossalai.amp.naive_amp.naive_amp
-   :members:
diff --git a/docs/colossalai/colossalai.amp.naive_amp.rst b/docs/colossalai/colossalai.amp.naive_amp.rst
deleted file mode 100644
index 15917e174995368e39a7e0aeffd2a56a8d602866..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.amp.naive_amp.rst
+++ /dev/null
@@ -1,11 +0,0 @@
-colossalai.amp.naive\_amp
-=========================
-
-.. automodule:: colossalai.amp.naive_amp
-   :members:
-
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.amp.naive_amp.naive_amp
diff --git a/docs/colossalai/colossalai.amp.rst b/docs/colossalai/colossalai.amp.rst
deleted file mode 100644
index 5ef4f36c13ac30b6accda7435a9e6d5b30f49a4e..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.amp.rst
+++ /dev/null
@@ -1,18 +0,0 @@
-colossalai.amp
-==============
-
-.. automodule:: colossalai.amp
-   :members:
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.amp.apex_amp
-   colossalai.amp.naive_amp
-   colossalai.amp.torch_amp
-
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.amp.amp_type
diff --git a/docs/colossalai/colossalai.amp.torch_amp.rst b/docs/colossalai/colossalai.amp.torch_amp.rst
deleted file mode 100644
index f10095f136e091bad583e643151a1c6eae56351a..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.amp.torch_amp.rst
+++ /dev/null
@@ -1,11 +0,0 @@
-colossalai.amp.torch\_amp
-=========================
-
-.. automodule:: colossalai.amp.torch_amp
-   :members:
-
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.amp.torch_amp.torch_amp
diff --git a/docs/colossalai/colossalai.amp.torch_amp.torch_amp.rst b/docs/colossalai/colossalai.amp.torch_amp.torch_amp.rst
deleted file mode 100644
index 5f1549cb8d48aac1c0b51d03c9bd05aac0f16f46..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.amp.torch_amp.torch_amp.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.amp.torch\_amp.torch\_amp
-====================================
-
-.. automodule:: colossalai.amp.torch_amp.torch_amp
-   :members:
diff --git a/docs/colossalai/colossalai.builder.builder.rst b/docs/colossalai/colossalai.builder.builder.rst
deleted file mode 100644
index 85da78ab9e3de33e5eb4e7fcc9a659a7d3fa5952..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.builder.builder.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.builder.builder
-==========================
-
-.. automodule:: colossalai.builder.builder
-   :members:
diff --git a/docs/colossalai/colossalai.builder.pipeline.rst b/docs/colossalai/colossalai.builder.pipeline.rst
deleted file mode 100644
index 7b8c960bb4ce0dcd32949539b69f9acc0b006784..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.builder.pipeline.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.builder.pipeline
-===========================
-
-.. automodule:: colossalai.builder.pipeline
-   :members:
diff --git a/docs/colossalai/colossalai.builder.rst b/docs/colossalai/colossalai.builder.rst
deleted file mode 100644
index 60b8501c8f5fcd41e0da7bec2122046c2f598dd1..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.builder.rst
+++ /dev/null
@@ -1,12 +0,0 @@
-colossalai.builder
-==================
-
-.. automodule:: colossalai.builder
-   :members:
-
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.builder.builder
-   colossalai.builder.pipeline
diff --git a/docs/colossalai/colossalai.communication.collective.rst b/docs/colossalai/colossalai.communication.collective.rst
deleted file mode 100644
index 5015edf98901fa4077b6b11e4ab81a9979ae84c4..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.communication.collective.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.communication.collective
-===================================
-
-.. automodule:: colossalai.communication.collective
-   :members:
diff --git a/docs/colossalai/colossalai.communication.p2p.rst b/docs/colossalai/colossalai.communication.p2p.rst
deleted file mode 100644
index 79135bb8630f6dfa57f2f2857e4efaac046e0b5c..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.communication.p2p.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.communication.p2p
-============================
-
-.. automodule:: colossalai.communication.p2p
-   :members:
diff --git a/docs/colossalai/colossalai.communication.ring.rst b/docs/colossalai/colossalai.communication.ring.rst
deleted file mode 100644
index c218d4bed350f7af9e81cf4bfccb3bf94e273d94..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.communication.ring.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.communication.ring
-=============================
-
-.. automodule:: colossalai.communication.ring
-   :members:
diff --git a/docs/colossalai/colossalai.communication.rst b/docs/colossalai/colossalai.communication.rst
deleted file mode 100644
index 5086fa663ec7e09ee12eb8393454dac783453354..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.communication.rst
+++ /dev/null
@@ -1,14 +0,0 @@
-colossalai.communication
-========================
-
-.. automodule:: colossalai.communication
-   :members:
-
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.communication.collective
-   colossalai.communication.p2p
-   colossalai.communication.ring
-   colossalai.communication.utils
diff --git a/docs/colossalai/colossalai.communication.utils.rst b/docs/colossalai/colossalai.communication.utils.rst
deleted file mode 100644
index 19a36cc9ff6f75448dc31bbdcbe41f1755bbbe83..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.communication.utils.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.communication.utils
-==============================
-
-.. automodule:: colossalai.communication.utils
-   :members:
diff --git a/docs/colossalai/colossalai.context.config.rst b/docs/colossalai/colossalai.context.config.rst
deleted file mode 100644
index 2fb1b99d3e7af8af7cafb5f1ea7dd744aa888fc4..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.context.config.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.context.config
-=========================
-
-.. automodule:: colossalai.context.config
-   :members:
diff --git a/docs/colossalai/colossalai.context.parallel_context.rst b/docs/colossalai/colossalai.context.parallel_context.rst
deleted file mode 100644
index d1c82c5178451e954115425e0c52620250371ccb..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.context.parallel_context.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.context.parallel\_context
-====================================
-
-.. automodule:: colossalai.context.parallel_context
-   :members:
diff --git a/docs/colossalai/colossalai.context.parallel_mode.rst b/docs/colossalai/colossalai.context.parallel_mode.rst
deleted file mode 100644
index f7ac137493fb4ad0f476c9ec82af719368bc1124..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.context.parallel_mode.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.context.parallel\_mode
-=================================
-
-.. automodule:: colossalai.context.parallel_mode
-   :members:
diff --git a/docs/colossalai/colossalai.context.process_group_initializer.initializer_1d.rst b/docs/colossalai/colossalai.context.process_group_initializer.initializer_1d.rst
deleted file mode 100644
index 88cbf3ebadb3845028d3cc004981e47443a657fb..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.context.process_group_initializer.initializer_1d.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.context.process\_group\_initializer.initializer\_1d
-==============================================================
-
-.. automodule:: colossalai.context.process_group_initializer.initializer_1d
-   :members:
diff --git a/docs/colossalai/colossalai.context.process_group_initializer.initializer_2d.rst b/docs/colossalai/colossalai.context.process_group_initializer.initializer_2d.rst
deleted file mode 100644
index d99a2e1c31775187ce5db8239a04c749e750acb8..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.context.process_group_initializer.initializer_2d.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.context.process\_group\_initializer.initializer\_2d
-==============================================================
-
-.. automodule:: colossalai.context.process_group_initializer.initializer_2d
-   :members:
diff --git a/docs/colossalai/colossalai.context.process_group_initializer.initializer_2p5d.rst b/docs/colossalai/colossalai.context.process_group_initializer.initializer_2p5d.rst
deleted file mode 100644
index 73d80e4431bbbefb094459cb53ff866239bc49b0..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.context.process_group_initializer.initializer_2p5d.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.context.process\_group\_initializer.initializer\_2p5d
-================================================================
-
-.. automodule:: colossalai.context.process_group_initializer.initializer_2p5d
-   :members:
diff --git a/docs/colossalai/colossalai.context.process_group_initializer.initializer_3d.rst b/docs/colossalai/colossalai.context.process_group_initializer.initializer_3d.rst
deleted file mode 100644
index 5cfba5ce0870e973930bcb5ea925185561b8509b..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.context.process_group_initializer.initializer_3d.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.context.process\_group\_initializer.initializer\_3d
-==============================================================
-
-.. automodule:: colossalai.context.process_group_initializer.initializer_3d
-   :members:
diff --git a/docs/colossalai/colossalai.context.process_group_initializer.initializer_data.rst b/docs/colossalai/colossalai.context.process_group_initializer.initializer_data.rst
deleted file mode 100644
index 55ad05f32b143b768d4d6b46add4e513f07a57fa..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.context.process_group_initializer.initializer_data.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.context.process\_group\_initializer.initializer\_data
-================================================================
-
-.. automodule:: colossalai.context.process_group_initializer.initializer_data
-   :members:
diff --git a/docs/colossalai/colossalai.context.process_group_initializer.initializer_model.rst b/docs/colossalai/colossalai.context.process_group_initializer.initializer_model.rst
deleted file mode 100644
index 8f2d79369915a0c0a5b76e17a145be9e98311ab5..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.context.process_group_initializer.initializer_model.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.context.process\_group\_initializer.initializer\_model
-=================================================================
-
-.. automodule:: colossalai.context.process_group_initializer.initializer_model
-   :members:
diff --git a/docs/colossalai/colossalai.context.process_group_initializer.initializer_moe.rst b/docs/colossalai/colossalai.context.process_group_initializer.initializer_moe.rst
deleted file mode 100644
index be2314629604440b9bca554204887189ee6cb81d..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.context.process_group_initializer.initializer_moe.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.context.process\_group\_initializer.initializer\_moe
-===============================================================
-
-.. automodule:: colossalai.context.process_group_initializer.initializer_moe
-   :members:
diff --git a/docs/colossalai/colossalai.context.process_group_initializer.initializer_pipeline.rst b/docs/colossalai/colossalai.context.process_group_initializer.initializer_pipeline.rst
deleted file mode 100644
index 466d5143a02b58c86b8cb3adbf6461f2d59a759f..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.context.process_group_initializer.initializer_pipeline.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.context.process\_group\_initializer.initializer\_pipeline
-====================================================================
-
-.. automodule:: colossalai.context.process_group_initializer.initializer_pipeline
-   :members:
diff --git a/docs/colossalai/colossalai.context.process_group_initializer.initializer_sequence.rst b/docs/colossalai/colossalai.context.process_group_initializer.initializer_sequence.rst
deleted file mode 100644
index dab71cc3c3917c416e1d35ca6c5a6dafe4fdc1b9..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.context.process_group_initializer.initializer_sequence.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.context.process\_group\_initializer.initializer\_sequence
-====================================================================
-
-.. automodule:: colossalai.context.process_group_initializer.initializer_sequence
-   :members:
diff --git a/docs/colossalai/colossalai.context.process_group_initializer.initializer_tensor.rst b/docs/colossalai/colossalai.context.process_group_initializer.initializer_tensor.rst
deleted file mode 100644
index 0c2d8d1e9daaa9a7392c048ffa7e7d3bf9e59342..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.context.process_group_initializer.initializer_tensor.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.context.process\_group\_initializer.initializer\_tensor
-==================================================================
-
-.. automodule:: colossalai.context.process_group_initializer.initializer_tensor
-   :members:
diff --git a/docs/colossalai/colossalai.context.process_group_initializer.process_group_initializer.rst b/docs/colossalai/colossalai.context.process_group_initializer.process_group_initializer.rst
deleted file mode 100644
index 3f98723c170b56c1b8b12dce96edb611cee1dc66..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.context.process_group_initializer.process_group_initializer.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.context.process\_group\_initializer.process\_group\_initializer
-==========================================================================
-
-.. automodule:: colossalai.context.process_group_initializer.process_group_initializer
-   :members:
diff --git a/docs/colossalai/colossalai.context.process_group_initializer.rst b/docs/colossalai/colossalai.context.process_group_initializer.rst
deleted file mode 100644
index b5e261195eef5ec2438400098d75f7b4dd2945fe..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.context.process_group_initializer.rst
+++ /dev/null
@@ -1,21 +0,0 @@
-colossalai.context.process\_group\_initializer
-==============================================
-
-.. automodule:: colossalai.context.process_group_initializer
-   :members:
-
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.context.process_group_initializer.initializer_1d
-   colossalai.context.process_group_initializer.initializer_2d
-   colossalai.context.process_group_initializer.initializer_2p5d
-   colossalai.context.process_group_initializer.initializer_3d
-   colossalai.context.process_group_initializer.initializer_data
-   colossalai.context.process_group_initializer.initializer_model
-   colossalai.context.process_group_initializer.initializer_moe
-   colossalai.context.process_group_initializer.initializer_pipeline
-   colossalai.context.process_group_initializer.initializer_sequence
-   colossalai.context.process_group_initializer.initializer_tensor
-   colossalai.context.process_group_initializer.process_group_initializer
diff --git a/docs/colossalai/colossalai.context.random.rst b/docs/colossalai/colossalai.context.random.rst
deleted file mode 100644
index 8d4b9c56af3cbeb64d14e7891a282a6f75ce7fa9..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.context.random.rst
+++ /dev/null
@@ -1,11 +0,0 @@
-colossalai.context.random
-=========================
-
-.. automodule:: colossalai.context.random
-   :members:
-
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.context.random.seed_manager
diff --git a/docs/colossalai/colossalai.context.random.seed_manager.rst b/docs/colossalai/colossalai.context.random.seed_manager.rst
deleted file mode 100644
index b71f35c2750c973eb9664bd96d3210fb0722c005..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.context.random.seed_manager.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.context.random.seed\_manager
-=======================================
-
-.. automodule:: colossalai.context.random.seed_manager
-   :members:
diff --git a/docs/colossalai/colossalai.context.rst b/docs/colossalai/colossalai.context.rst
deleted file mode 100644
index babab509945eb08fbef1dcbf41062bcb0bf53b50..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.context.rst
+++ /dev/null
@@ -1,19 +0,0 @@
-colossalai.context
-==================
-
-.. automodule:: colossalai.context
-   :members:
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.context.process_group_initializer
-   colossalai.context.random
-
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.context.config
-   colossalai.context.parallel_context
-   colossalai.context.parallel_mode
diff --git a/docs/colossalai/colossalai.engine.gradient_handler.rst b/docs/colossalai/colossalai.engine.gradient_handler.rst
deleted file mode 100644
index d7d1633a60831b74187ea3f86bf17d47d1f14ecb..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.engine.gradient_handler.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.engine.gradient\_handler
-===================================
-
-.. automodule:: colossalai.engine.gradient_handler
-   :members:
diff --git a/docs/colossalai/colossalai.engine.rst b/docs/colossalai/colossalai.engine.rst
deleted file mode 100644
index f41c21e67abed1a4a07f4fa2f84f129a8cef500f..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.engine.rst
+++ /dev/null
@@ -1,11 +0,0 @@
-colossalai.engine
-=================
-
-.. automodule:: colossalai.engine
-   :members:
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.engine.gradient_handler
-   colossalai.engine.schedule
diff --git a/docs/colossalai/colossalai.engine.schedule.rst b/docs/colossalai/colossalai.engine.schedule.rst
deleted file mode 100644
index 2909373f00020afea52509e8c92d3563f6a128ce..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.engine.schedule.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.engine.schedule
-==========================
-
-.. automodule:: colossalai.engine.schedule
-   :members:
diff --git a/docs/colossalai/colossalai.initialize.rst b/docs/colossalai/colossalai.initialize.rst
deleted file mode 100644
index d3f65076a795876a34d3bcbcc03a4b4b96a28e79..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.initialize.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.initialize
-=====================
-
-.. automodule:: colossalai.initialize
-   :members:
diff --git a/docs/colossalai/colossalai.kernel.cuda_native.layer_norm.rst b/docs/colossalai/colossalai.kernel.cuda_native.layer_norm.rst
deleted file mode 100644
index b8bff51bef34d1cd5d515f1fb36da8d5634af13c..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.kernel.cuda_native.layer_norm.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.kernel.cuda\_native.layer\_norm
-==========================================
-
-.. automodule:: colossalai.kernel.cuda_native.layer_norm
-   :members:
diff --git a/docs/colossalai/colossalai.kernel.cuda_native.multihead_attention.rst b/docs/colossalai/colossalai.kernel.cuda_native.multihead_attention.rst
deleted file mode 100644
index de7577d195cd70de7054af3d68d1b33275f90f5c..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.kernel.cuda_native.multihead_attention.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.kernel.cuda\_native.multihead\_attention
-===================================================
-
-.. automodule:: colossalai.kernel.cuda_native.multihead_attention
-   :members:
diff --git a/docs/colossalai/colossalai.kernel.cuda_native.rst b/docs/colossalai/colossalai.kernel.cuda_native.rst
deleted file mode 100644
index d88e4cfdb761f37266ce9ee2433ef440a8923105..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.kernel.cuda_native.rst
+++ /dev/null
@@ -1,13 +0,0 @@
-colossalai.kernel.cuda\_native
-==============================
-
-.. automodule:: colossalai.kernel.cuda_native
-   :members:
-
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.kernel.cuda_native.layer_norm
-   colossalai.kernel.cuda_native.multihead_attention
-   colossalai.kernel.cuda_native.scaled_softmax
diff --git a/docs/colossalai/colossalai.kernel.cuda_native.scaled_softmax.rst b/docs/colossalai/colossalai.kernel.cuda_native.scaled_softmax.rst
deleted file mode 100644
index 474fcd3349bd79c9db707ef462cd31ac3ccde249..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.kernel.cuda_native.scaled_softmax.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.kernel.cuda\_native.scaled\_softmax
-==============================================
-
-.. automodule:: colossalai.kernel.cuda_native.scaled_softmax
-   :members:
diff --git a/docs/colossalai/colossalai.kernel.jit.bias_dropout_add.rst b/docs/colossalai/colossalai.kernel.jit.bias_dropout_add.rst
deleted file mode 100644
index d61550928bc8742060ce912cf55725d29a3168e4..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.kernel.jit.bias_dropout_add.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.kernel.jit.bias\_dropout\_add
-========================================
-
-.. automodule:: colossalai.kernel.jit.bias_dropout_add
-   :members:
diff --git a/docs/colossalai/colossalai.kernel.jit.bias_gelu.rst b/docs/colossalai/colossalai.kernel.jit.bias_gelu.rst
deleted file mode 100644
index 7db184b4ce3bd96d692debe06f86ef7d000318ed..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.kernel.jit.bias_gelu.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.kernel.jit.bias\_gelu
-================================
-
-.. automodule:: colossalai.kernel.jit.bias_gelu
-   :members:
diff --git a/docs/colossalai/colossalai.kernel.jit.option.rst b/docs/colossalai/colossalai.kernel.jit.option.rst
deleted file mode 100644
index 15ebfc83aa7744444d2b60471264cb7d415b26dc..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.kernel.jit.option.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.kernel.jit.option
-============================
-
-.. automodule:: colossalai.kernel.jit.option
-   :members:
diff --git a/docs/colossalai/colossalai.kernel.jit.rst b/docs/colossalai/colossalai.kernel.jit.rst
deleted file mode 100644
index 8b2f728d34d55d326979827a6cb3e23022ebb99e..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.kernel.jit.rst
+++ /dev/null
@@ -1,13 +0,0 @@
-colossalai.kernel.jit
-=====================
-
-.. automodule:: colossalai.kernel.jit
-   :members:
-
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.kernel.jit.bias_dropout_add
-   colossalai.kernel.jit.bias_gelu
-   colossalai.kernel.jit.option
diff --git a/docs/colossalai/colossalai.kernel.rst b/docs/colossalai/colossalai.kernel.rst
deleted file mode 100644
index dcbac8c1de76167ee5dcaccc8dde8c3805759ae1..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.kernel.rst
+++ /dev/null
@@ -1,11 +0,0 @@
-colossalai.kernel
-=================
-
-.. automodule:: colossalai.kernel
-   :members:
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.kernel.cuda_native
-   colossalai.kernel.jit
diff --git a/docs/colossalai/colossalai.logging.logging.rst b/docs/colossalai/colossalai.logging.logging.rst
deleted file mode 100644
index 05374b8f41773401915de6b7f94a3ee56661171f..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.logging.logging.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.logging.logging
-==========================
-
-.. automodule:: colossalai.logging.logging
-   :members:
diff --git a/docs/colossalai/colossalai.logging.rst b/docs/colossalai/colossalai.logging.rst
deleted file mode 100644
index a7a5cec72b81000dd942e2f837caabf19217dd38..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.logging.rst
+++ /dev/null
@@ -1,11 +0,0 @@
-colossalai.logging
-==================
-
-.. automodule:: colossalai.logging
-   :members:
-
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.logging.logging
diff --git a/docs/colossalai/colossalai.nn.init.rst b/docs/colossalai/colossalai.nn.init.rst
deleted file mode 100644
index d0ab993126d5b3b63b7d0aeab86031d716f1b301..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.init.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.init
-==================
-
-.. automodule:: colossalai.nn.init
-   :members:
diff --git a/docs/colossalai/colossalai.nn.layer.base_layer.rst b/docs/colossalai/colossalai.nn.layer.base_layer.rst
deleted file mode 100644
index c2a22f04d3f37c22b54aaacaf89b962e947d7d80..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.layer.base_layer.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.layer.base\_layer
-===============================
-
-.. automodule:: colossalai.nn.layer.base_layer
-   :members:
diff --git a/docs/colossalai/colossalai.nn.layer.colossalai_layer.dropout.rst b/docs/colossalai/colossalai.nn.layer.colossalai_layer.dropout.rst
deleted file mode 100644
index ec1dfd395f1709ddb696b3809c39a19d7a5efd13..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.layer.colossalai_layer.dropout.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.layer.colossalai\_layer.dropout
-=============================================
-
-.. automodule:: colossalai.nn.layer.colossalai_layer.dropout
-   :members:
diff --git a/docs/colossalai/colossalai.nn.layer.colossalai_layer.embedding.rst b/docs/colossalai/colossalai.nn.layer.colossalai_layer.embedding.rst
deleted file mode 100644
index 8438b3a077879e7722e3eb47414ef06adc3d30c3..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.layer.colossalai_layer.embedding.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.layer.colossalai\_layer.embedding
-===============================================
-
-.. automodule:: colossalai.nn.layer.colossalai_layer.embedding
-   :members:
diff --git a/docs/colossalai/colossalai.nn.layer.colossalai_layer.linear.rst b/docs/colossalai/colossalai.nn.layer.colossalai_layer.linear.rst
deleted file mode 100644
index 3213282549eaca72275d6a300813d937f3348526..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.layer.colossalai_layer.linear.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.layer.colossalai\_layer.linear
-============================================
-
-.. automodule:: colossalai.nn.layer.colossalai_layer.linear
-   :members:
diff --git a/docs/colossalai/colossalai.nn.layer.colossalai_layer.normalization.rst b/docs/colossalai/colossalai.nn.layer.colossalai_layer.normalization.rst
deleted file mode 100644
index f94dd27b86e43f31719f54d90c033a2c3bd85e6e..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.layer.colossalai_layer.normalization.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.layer.colossalai\_layer.normalization
-===================================================
-
-.. automodule:: colossalai.nn.layer.colossalai_layer.normalization
-   :members:
diff --git a/docs/colossalai/colossalai.nn.layer.colossalai_layer.rst b/docs/colossalai/colossalai.nn.layer.colossalai_layer.rst
deleted file mode 100644
index 0f685e6c2dc3a463f237d19764e9c8297360c5a2..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.layer.colossalai_layer.rst
+++ /dev/null
@@ -1,14 +0,0 @@
-colossalai.nn.layer.colossalai\_layer
-=====================================
-
-.. automodule:: colossalai.nn.layer.colossalai_layer
-   :members:
-
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.nn.layer.colossalai_layer.dropout
-   colossalai.nn.layer.colossalai_layer.embedding
-   colossalai.nn.layer.colossalai_layer.linear
-   colossalai.nn.layer.colossalai_layer.normalization
diff --git a/docs/colossalai/colossalai.nn.layer.moe.layers.rst b/docs/colossalai/colossalai.nn.layer.moe.layers.rst
deleted file mode 100644
index d109d47b8174375f561b717236bc9020c2e3675d..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.layer.moe.layers.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.layer.moe.layers
-==============================
-
-.. automodule:: colossalai.nn.layer.moe.layers
-   :members:
diff --git a/docs/colossalai/colossalai.nn.layer.moe.rst b/docs/colossalai/colossalai.nn.layer.moe.rst
deleted file mode 100644
index 403d39817c84058b7c53768e39dc0753de7557e4..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.layer.moe.rst
+++ /dev/null
@@ -1,11 +0,0 @@
-colossalai.nn.layer.moe
-=======================
-
-.. automodule:: colossalai.nn.layer.moe
-   :members:
-
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.nn.layer.moe.layers
diff --git a/docs/colossalai/colossalai.nn.layer.parallel_1d.layers.rst b/docs/colossalai/colossalai.nn.layer.parallel_1d.layers.rst
deleted file mode 100644
index 380f6bf8d134d55482902d9ea6c1b18f09955bb0..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.layer.parallel_1d.layers.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.layer.parallel\_1d.layers
-=======================================
-
-.. automodule:: colossalai.nn.layer.parallel_1d.layers
-   :members:
diff --git a/docs/colossalai/colossalai.nn.layer.parallel_1d.rst b/docs/colossalai/colossalai.nn.layer.parallel_1d.rst
deleted file mode 100644
index 3a8ed620672189e6178925d46a21753b7d4f79e3..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.layer.parallel_1d.rst
+++ /dev/null
@@ -1,11 +0,0 @@
-colossalai.nn.layer.parallel\_1d
-================================
-
-.. automodule:: colossalai.nn.layer.parallel_1d
-   :members:
-
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.nn.layer.parallel_1d.layers
diff --git a/docs/colossalai/colossalai.nn.layer.parallel_2d.layers.rst b/docs/colossalai/colossalai.nn.layer.parallel_2d.layers.rst
deleted file mode 100644
index b64d402bdf3e608fc522f0567319a37b4bd35b2a..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.layer.parallel_2d.layers.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.layer.parallel\_2d.layers
-=======================================
-
-.. automodule:: colossalai.nn.layer.parallel_2d.layers
-   :members:
diff --git a/docs/colossalai/colossalai.nn.layer.parallel_2d.rst b/docs/colossalai/colossalai.nn.layer.parallel_2d.rst
deleted file mode 100644
index f5ad41a1b450ae721809fe912a762006bd77e8ad..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.layer.parallel_2d.rst
+++ /dev/null
@@ -1,11 +0,0 @@
-colossalai.nn.layer.parallel\_2d
-================================
-
-.. automodule:: colossalai.nn.layer.parallel_2d
-   :members:
-
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.nn.layer.parallel_2d.layers
diff --git a/docs/colossalai/colossalai.nn.layer.parallel_2p5d.layers.rst b/docs/colossalai/colossalai.nn.layer.parallel_2p5d.layers.rst
deleted file mode 100644
index ebc99d56ccdc58675d99e6d25c6e5867f9144e4f..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.layer.parallel_2p5d.layers.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.layer.parallel\_2p5d.layers
-=========================================
-
-.. automodule:: colossalai.nn.layer.parallel_2p5d.layers
-   :members:
diff --git a/docs/colossalai/colossalai.nn.layer.parallel_2p5d.rst b/docs/colossalai/colossalai.nn.layer.parallel_2p5d.rst
deleted file mode 100644
index 5869bdee9928d1843b171011fb7f9e86169d6fe8..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.layer.parallel_2p5d.rst
+++ /dev/null
@@ -1,11 +0,0 @@
-colossalai.nn.layer.parallel\_2p5d
-==================================
-
-.. automodule:: colossalai.nn.layer.parallel_2p5d
-   :members:
-
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.nn.layer.parallel_2p5d.layers
diff --git a/docs/colossalai/colossalai.nn.layer.parallel_3d.layers.rst b/docs/colossalai/colossalai.nn.layer.parallel_3d.layers.rst
deleted file mode 100644
index a1702f1fcf627cd5d49996af5d65dd1388007e7a..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.layer.parallel_3d.layers.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.layer.parallel\_3d.layers
-=======================================
-
-.. automodule:: colossalai.nn.layer.parallel_3d.layers
-   :members:
diff --git a/docs/colossalai/colossalai.nn.layer.parallel_3d.rst b/docs/colossalai/colossalai.nn.layer.parallel_3d.rst
deleted file mode 100644
index bb55a63e507d60e010c34c2edcc7a90fd21b0dc1..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.layer.parallel_3d.rst
+++ /dev/null
@@ -1,11 +0,0 @@
-colossalai.nn.layer.parallel\_3d
-================================
-
-.. automodule:: colossalai.nn.layer.parallel_3d
-   :members:
-
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.nn.layer.parallel_3d.layers
diff --git a/docs/colossalai/colossalai.nn.layer.parallel_sequence.layers.rst b/docs/colossalai/colossalai.nn.layer.parallel_sequence.layers.rst
deleted file mode 100644
index 54929d2e71690bda6a390e8ff64ad61f4f077e25..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.layer.parallel_sequence.layers.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.layer.parallel\_sequence.layers
-=============================================
-
-.. automodule:: colossalai.nn.layer.parallel_sequence.layers
-   :members:
diff --git a/docs/colossalai/colossalai.nn.layer.parallel_sequence.rst b/docs/colossalai/colossalai.nn.layer.parallel_sequence.rst
deleted file mode 100644
index 24e8941d4ec4e6c64fb30c377055dbe8176283b2..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.layer.parallel_sequence.rst
+++ /dev/null
@@ -1,11 +0,0 @@
-colossalai.nn.layer.parallel\_sequence
-======================================
-
-.. automodule:: colossalai.nn.layer.parallel_sequence
-   :members:
-
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.nn.layer.parallel_sequence.layers
diff --git a/docs/colossalai/colossalai.nn.layer.rst b/docs/colossalai/colossalai.nn.layer.rst
deleted file mode 100644
index 32a93128f2a40431768048f2394f8427955eca7d..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.layer.rst
+++ /dev/null
@@ -1,25 +0,0 @@
-colossalai.nn.layer
-===================
-
-.. automodule:: colossalai.nn.layer
-   :members:
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.nn.layer.colossalai_layer
-   colossalai.nn.layer.moe
-   colossalai.nn.layer.parallel_1d
-   colossalai.nn.layer.parallel_2d
-   colossalai.nn.layer.parallel_2p5d
-   colossalai.nn.layer.parallel_3d
-   colossalai.nn.layer.parallel_sequence
-   colossalai.nn.layer.utils
-   colossalai.nn.layer.vanilla
-   colossalai.nn.layer.wrapper
-
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.nn.layer.base_layer
diff --git a/docs/colossalai/colossalai.nn.layer.utils.common.rst b/docs/colossalai/colossalai.nn.layer.utils.common.rst
deleted file mode 100644
index 6a552830f8f56652fd7f721087dfcefc278cffef..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.layer.utils.common.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.layer.utils.common
-================================
-
-.. automodule:: colossalai.nn.layer.utils.common
-   :members:
diff --git a/docs/colossalai/colossalai.nn.layer.utils.rst b/docs/colossalai/colossalai.nn.layer.utils.rst
deleted file mode 100644
index 16c3d718286a10211684de54b36a474defef771d..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.layer.utils.rst
+++ /dev/null
@@ -1,11 +0,0 @@
-colossalai.nn.layer.utils
-=========================
-
-.. automodule:: colossalai.nn.layer.utils
-   :members:
-
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.nn.layer.utils.common
diff --git a/docs/colossalai/colossalai.nn.layer.vanilla.layers.rst b/docs/colossalai/colossalai.nn.layer.vanilla.layers.rst
deleted file mode 100644
index f993b1f50e5bb5a57eb4699175aab3964cc68647..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.layer.vanilla.layers.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.layer.vanilla.layers
-==================================
-
-.. automodule:: colossalai.nn.layer.vanilla.layers
-   :members:
diff --git a/docs/colossalai/colossalai.nn.layer.vanilla.rst b/docs/colossalai/colossalai.nn.layer.vanilla.rst
deleted file mode 100644
index fe1ea5c6c53e4ead032a0e9707059454804dfd7d..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.layer.vanilla.rst
+++ /dev/null
@@ -1,11 +0,0 @@
-colossalai.nn.layer.vanilla
-===========================
-
-.. automodule:: colossalai.nn.layer.vanilla
-   :members:
-
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.nn.layer.vanilla.layers
diff --git a/docs/colossalai/colossalai.nn.layer.wrapper.lambda_wrapper.rst b/docs/colossalai/colossalai.nn.layer.wrapper.lambda_wrapper.rst
deleted file mode 100644
index f2ced672594c56398265cd1d4fd7349cacb8a277..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.layer.wrapper.lambda_wrapper.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.layer.wrapper.lambda\_wrapper
-===========================================
-
-.. automodule:: colossalai.nn.layer.wrapper.lambda_wrapper
-   :members:
diff --git a/docs/colossalai/colossalai.nn.layer.wrapper.pipeline_wrapper.rst b/docs/colossalai/colossalai.nn.layer.wrapper.pipeline_wrapper.rst
deleted file mode 100644
index e5648873d34b9c618700047ba7f3d08fd007dcfe..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.layer.wrapper.pipeline_wrapper.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.layer.wrapper.pipeline\_wrapper
-=============================================
-
-.. automodule:: colossalai.nn.layer.wrapper.pipeline_wrapper
-   :members:
diff --git a/docs/colossalai/colossalai.nn.layer.wrapper.rst b/docs/colossalai/colossalai.nn.layer.wrapper.rst
deleted file mode 100644
index 4e66651dca5b1e9b17a0d9da63458f9c0d99ce7d..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.layer.wrapper.rst
+++ /dev/null
@@ -1,12 +0,0 @@
-colossalai.nn.layer.wrapper
-===========================
-
-.. automodule:: colossalai.nn.layer.wrapper
-   :members:
-
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.nn.layer.wrapper.lambda_wrapper
-   colossalai.nn.layer.wrapper.pipeline_wrapper
diff --git a/docs/colossalai/colossalai.nn.loss.loss_2d.rst b/docs/colossalai/colossalai.nn.loss.loss_2d.rst
deleted file mode 100644
index 14d1585e3e0fe42943b40b75280fab2bf5993300..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.loss.loss_2d.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.loss.loss\_2d
-===========================
-
-.. automodule:: colossalai.nn.loss.loss_2d
-   :members:
diff --git a/docs/colossalai/colossalai.nn.loss.loss_2p5d.rst b/docs/colossalai/colossalai.nn.loss.loss_2p5d.rst
deleted file mode 100644
index fc3714da36301a65e88bf1856cf74375580cce19..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.loss.loss_2p5d.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.loss.loss\_2p5d
-=============================
-
-.. automodule:: colossalai.nn.loss.loss_2p5d
-   :members:
diff --git a/docs/colossalai/colossalai.nn.loss.loss_3d.rst b/docs/colossalai/colossalai.nn.loss.loss_3d.rst
deleted file mode 100644
index a593324fb4f16741383477f3b44233e54c354859..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.loss.loss_3d.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.loss.loss\_3d
-===========================
-
-.. automodule:: colossalai.nn.loss.loss_3d
-   :members:
diff --git a/docs/colossalai/colossalai.nn.loss.loss_moe.rst b/docs/colossalai/colossalai.nn.loss.loss_moe.rst
deleted file mode 100644
index ef2851ace83a0fb1c98464c761f0bf6d1063234f..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.loss.loss_moe.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.loss.loss\_moe
-============================
-
-.. automodule:: colossalai.nn.loss.loss_moe
-   :members:
diff --git a/docs/colossalai/colossalai.nn.loss.rst b/docs/colossalai/colossalai.nn.loss.rst
deleted file mode 100644
index 5677b74483c094d87d85061a77182e531e7df84c..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.loss.rst
+++ /dev/null
@@ -1,14 +0,0 @@
-colossalai.nn.loss
-==================
-
-.. automodule:: colossalai.nn.loss
-   :members:
-
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.nn.loss.loss_2d
-   colossalai.nn.loss.loss_2p5d
-   colossalai.nn.loss.loss_3d
-   colossalai.nn.loss.loss_moe
diff --git a/docs/colossalai/colossalai.nn.lr_scheduler.cosine.rst b/docs/colossalai/colossalai.nn.lr_scheduler.cosine.rst
deleted file mode 100644
index a7c636ad3a364ed85e105df4102960d081edb434..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.lr_scheduler.cosine.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.lr\_scheduler.cosine
-==================================
-
-.. automodule:: colossalai.nn.lr_scheduler.cosine
-   :members:
diff --git a/docs/colossalai/colossalai.nn.lr_scheduler.delayed.rst b/docs/colossalai/colossalai.nn.lr_scheduler.delayed.rst
deleted file mode 100644
index 2a86c4b2a20c4e7f4db1c719d45db68ca475eea5..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.lr_scheduler.delayed.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.lr\_scheduler.delayed
-===================================
-
-.. automodule:: colossalai.nn.lr_scheduler.delayed
-   :members:
diff --git a/docs/colossalai/colossalai.nn.lr_scheduler.linear.rst b/docs/colossalai/colossalai.nn.lr_scheduler.linear.rst
deleted file mode 100644
index 5e917edc2faf84b49f2025bc2aee4cae8b5fd422..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.lr_scheduler.linear.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.lr\_scheduler.linear
-==================================
-
-.. automodule:: colossalai.nn.lr_scheduler.linear
-   :members:
diff --git a/docs/colossalai/colossalai.nn.lr_scheduler.multistep.rst b/docs/colossalai/colossalai.nn.lr_scheduler.multistep.rst
deleted file mode 100644
index 4248a638637543a0196616fa2addc17cc79a2f6d..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.lr_scheduler.multistep.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.lr\_scheduler.multistep
-=====================================
-
-.. automodule:: colossalai.nn.lr_scheduler.multistep
-   :members:
diff --git a/docs/colossalai/colossalai.nn.lr_scheduler.onecycle.rst b/docs/colossalai/colossalai.nn.lr_scheduler.onecycle.rst
deleted file mode 100644
index 7f2fd47586fea3ef2114b23654c56f64827b49bd..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.lr_scheduler.onecycle.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.lr\_scheduler.onecycle
-====================================
-
-.. automodule:: colossalai.nn.lr_scheduler.onecycle
-   :members:
diff --git a/docs/colossalai/colossalai.nn.lr_scheduler.poly.rst b/docs/colossalai/colossalai.nn.lr_scheduler.poly.rst
deleted file mode 100644
index c1618812aa0c34b31deb3276a1c671ba142c2b80..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.lr_scheduler.poly.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.lr\_scheduler.poly
-================================
-
-.. automodule:: colossalai.nn.lr_scheduler.poly
-   :members:
diff --git a/docs/colossalai/colossalai.nn.lr_scheduler.rst b/docs/colossalai/colossalai.nn.lr_scheduler.rst
deleted file mode 100644
index 427a3ee4529e45ec0adff7707a8f90e6650ec5ce..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.lr_scheduler.rst
+++ /dev/null
@@ -1,17 +0,0 @@
-colossalai.nn.lr\_scheduler
-===========================
-
-.. automodule:: colossalai.nn.lr_scheduler
-   :members:
-
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.nn.lr_scheduler.cosine
-   colossalai.nn.lr_scheduler.delayed
-   colossalai.nn.lr_scheduler.linear
-   colossalai.nn.lr_scheduler.multistep
-   colossalai.nn.lr_scheduler.onecycle
-   colossalai.nn.lr_scheduler.poly
-   colossalai.nn.lr_scheduler.torch
diff --git a/docs/colossalai/colossalai.nn.lr_scheduler.torch.rst b/docs/colossalai/colossalai.nn.lr_scheduler.torch.rst
deleted file mode 100644
index f8d552bf1d62d069923ae713f159d0b5eeefd10a..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.lr_scheduler.torch.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.lr\_scheduler.torch
-=================================
-
-.. automodule:: colossalai.nn.lr_scheduler.torch
-   :members:
diff --git a/docs/colossalai/colossalai.nn.metric.accuracy_2d.rst b/docs/colossalai/colossalai.nn.metric.accuracy_2d.rst
deleted file mode 100644
index 63bcb834976384874049930eb21da742a5d1835b..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.metric.accuracy_2d.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.metric.accuracy\_2d
-=================================
-
-.. automodule:: colossalai.nn.metric.accuracy_2d
-   :members:
diff --git a/docs/colossalai/colossalai.nn.metric.accuracy_2p5d.rst b/docs/colossalai/colossalai.nn.metric.accuracy_2p5d.rst
deleted file mode 100644
index dd4358fbff72eb5df642168cc4674a85de041387..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.metric.accuracy_2p5d.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.metric.accuracy\_2p5d
-===================================
-
-.. automodule:: colossalai.nn.metric.accuracy_2p5d
-   :members:
diff --git a/docs/colossalai/colossalai.nn.metric.accuracy_3d.rst b/docs/colossalai/colossalai.nn.metric.accuracy_3d.rst
deleted file mode 100644
index 95143444b945e35c46fa94528e4e7bddf27a19dd..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.metric.accuracy_3d.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.metric.accuracy\_3d
-=================================
-
-.. automodule:: colossalai.nn.metric.accuracy_3d
-   :members:
diff --git a/docs/colossalai/colossalai.nn.metric.rst b/docs/colossalai/colossalai.nn.metric.rst
deleted file mode 100644
index 28f5568eb84696011e46fdfaad59aed00964b094..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.metric.rst
+++ /dev/null
@@ -1,13 +0,0 @@
-colossalai.nn.metric
-====================
-
-.. automodule:: colossalai.nn.metric
-   :members:
-
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.nn.metric.accuracy_2d
-   colossalai.nn.metric.accuracy_2p5d
-   colossalai.nn.metric.accuracy_3d
diff --git a/docs/colossalai/colossalai.nn.model.model_from_config.rst b/docs/colossalai/colossalai.nn.model.model_from_config.rst
deleted file mode 100644
index fadb5fd0f7bb15d65aa443aa3c4d22551600cc3d..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.model.model_from_config.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.model.model\_from\_config
-=======================================
-
-.. automodule:: colossalai.nn.model.model_from_config
-   :members:
diff --git a/docs/colossalai/colossalai.nn.model.rst b/docs/colossalai/colossalai.nn.model.rst
deleted file mode 100644
index 5756e11cdc2b793a2df209e68d6e402754f75964..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.model.rst
+++ /dev/null
@@ -1,11 +0,0 @@
-colossalai.nn.model
-===================
-
-.. automodule:: colossalai.nn.model
-   :members:
-
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.nn.model.model_from_config
diff --git a/docs/colossalai/colossalai.nn.optimizer.colossalai_optimizer.rst b/docs/colossalai/colossalai.nn.optimizer.colossalai_optimizer.rst
deleted file mode 100644
index 35515c374f3360452f74ed95a4750f255ef4ba56..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.optimizer.colossalai_optimizer.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.optimizer.colossalai\_optimizer
-=============================================
-
-.. automodule:: colossalai.nn.optimizer.colossalai_optimizer
-   :members:
diff --git a/docs/colossalai/colossalai.nn.optimizer.fused_adam.rst b/docs/colossalai/colossalai.nn.optimizer.fused_adam.rst
deleted file mode 100644
index 60af624cb6c12f78d488d651cc612c82bf55ad8c..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.optimizer.fused_adam.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.optimizer.fused\_adam
-===================================
-
-.. automodule:: colossalai.nn.optimizer.fused_adam
-   :members:
diff --git a/docs/colossalai/colossalai.nn.optimizer.fused_lamb.rst b/docs/colossalai/colossalai.nn.optimizer.fused_lamb.rst
deleted file mode 100644
index 66c0fa4ca1c7c8880ceb48b564a621c55be4687d..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.optimizer.fused_lamb.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.optimizer.fused\_lamb
-===================================
-
-.. automodule:: colossalai.nn.optimizer.fused_lamb
-   :members:
diff --git a/docs/colossalai/colossalai.nn.optimizer.fused_sgd.rst b/docs/colossalai/colossalai.nn.optimizer.fused_sgd.rst
deleted file mode 100644
index 2ecc77c33d88cf11e8075bc04dcc17026eeadc75..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.optimizer.fused_sgd.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.optimizer.fused\_sgd
-==================================
-
-.. automodule:: colossalai.nn.optimizer.fused_sgd
-   :members:
diff --git a/docs/colossalai/colossalai.nn.optimizer.lamb.rst b/docs/colossalai/colossalai.nn.optimizer.lamb.rst
deleted file mode 100644
index 57199ea3695132e4a6e76b5cf41da81ce7a37bd8..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.optimizer.lamb.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.optimizer.lamb
-============================
-
-.. automodule:: colossalai.nn.optimizer.lamb
-   :members:
diff --git a/docs/colossalai/colossalai.nn.optimizer.lars.rst b/docs/colossalai/colossalai.nn.optimizer.lars.rst
deleted file mode 100644
index f935950f8b5a2a1a8a5a757d7454a875d4db69c6..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.optimizer.lars.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.optimizer.lars
-============================
-
-.. automodule:: colossalai.nn.optimizer.lars
-   :members:
diff --git a/docs/colossalai/colossalai.nn.optimizer.rst b/docs/colossalai/colossalai.nn.optimizer.rst
deleted file mode 100644
index 7fbd814066abf427fe389aef4db524d53a98193e..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.optimizer.rst
+++ /dev/null
@@ -1,16 +0,0 @@
-colossalai.nn.optimizer
-=======================
-
-.. automodule:: colossalai.nn.optimizer
-   :members:
-
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.nn.optimizer.colossalai_optimizer
-   colossalai.nn.optimizer.fused_adam
-   colossalai.nn.optimizer.fused_lamb
-   colossalai.nn.optimizer.fused_sgd
-   colossalai.nn.optimizer.lamb
-   colossalai.nn.optimizer.lars
diff --git a/docs/colossalai/colossalai.nn.rst b/docs/colossalai/colossalai.nn.rst
deleted file mode 100644
index 32e5eae2fbf1cf7c06906015641af99c82b67961..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.nn.rst
+++ /dev/null
@@ -1,21 +0,0 @@
-colossalai.nn
-=============
-
-.. automodule:: colossalai.nn
-   :members:
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.nn.layer
-   colossalai.nn.loss
-   colossalai.nn.lr_scheduler
-   colossalai.nn.metric
-   colossalai.nn.model
-   colossalai.nn.optimizer
-
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.nn.init
diff --git a/docs/colossalai/colossalai.registry.registry.rst b/docs/colossalai/colossalai.registry.registry.rst
deleted file mode 100644
index e942d7969b60beb309f38e3ea1b5e82614941a4c..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.registry.registry.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.registry.registry
-============================
-
-.. automodule:: colossalai.registry.registry
-   :members:
diff --git a/docs/colossalai/colossalai.registry.rst b/docs/colossalai/colossalai.registry.rst
deleted file mode 100644
index 0f294f6d15a7285709b69e6c3cddaa2cc2e47833..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.registry.rst
+++ /dev/null
@@ -1,11 +0,0 @@
-colossalai.registry
-===================
-
-.. automodule:: colossalai.registry
-   :members:
-
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.registry.registry
diff --git a/docs/colossalai/colossalai.rst b/docs/colossalai/colossalai.rst
deleted file mode 100644
index eca3e273aff71dc35ba07403cec80ed347b1c76f..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.rst
+++ /dev/null
@@ -1,27 +0,0 @@
-colossalai
-==========
-
-.. automodule:: colossalai
-   :members:
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.amp
-   colossalai.builder
-   colossalai.communication
-   colossalai.context
-   colossalai.engine
-   colossalai.kernel
-   colossalai.logging
-   colossalai.nn
-   colossalai.registry
-   colossalai.trainer
-   colossalai.utils
-   colossalai.zero
-
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.initialize
diff --git a/docs/colossalai/colossalai.trainer.hooks.rst b/docs/colossalai/colossalai.trainer.hooks.rst
deleted file mode 100644
index 84cc6797b83138669f216162e7d19ff024a5e21f..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.trainer.hooks.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.trainer.hooks
-========================
-
-.. automodule:: colossalai.trainer.hooks
-   :members:
diff --git a/docs/colossalai/colossalai.trainer.rst b/docs/colossalai/colossalai.trainer.rst
deleted file mode 100644
index abc636e623737ab86669521c6262c9352e504d2c..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.trainer.rst
+++ /dev/null
@@ -1,10 +0,0 @@
-colossalai.trainer
-==================
-
-.. automodule:: colossalai.trainer
-   :members:
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.trainer.hooks
diff --git a/docs/colossalai/colossalai.utils.activation_checkpoint.rst b/docs/colossalai/colossalai.utils.activation_checkpoint.rst
deleted file mode 100644
index 671b5fe9e9c452ea608e4fd9e74b2046a13073d3..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.utils.activation_checkpoint.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.utils.activation\_checkpoint
-=======================================
-
-.. automodule:: colossalai.utils.activation_checkpoint
-   :members:
diff --git a/docs/colossalai/colossalai.utils.checkpointing.rst b/docs/colossalai/colossalai.utils.checkpointing.rst
deleted file mode 100644
index 534a581d536406a26a288f39d6f761d60c16869f..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.utils.checkpointing.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.utils.checkpointing
-==============================
-
-.. automodule:: colossalai.utils.checkpointing
-   :members:
diff --git a/docs/colossalai/colossalai.utils.common.rst b/docs/colossalai/colossalai.utils.common.rst
deleted file mode 100644
index cb9f9c14ef4fb14cda1058ee9783a970c5365a74..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.utils.common.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.utils.common
-=======================
-
-.. automodule:: colossalai.utils.common
-   :members:
diff --git a/docs/colossalai/colossalai.utils.cuda.rst b/docs/colossalai/colossalai.utils.cuda.rst
deleted file mode 100644
index ec428c5ef6ea2e3f4fe9b3ce0def3fe2417fd1f3..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.utils.cuda.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.utils.cuda
-=====================
-
-.. automodule:: colossalai.utils.cuda
-   :members:
diff --git a/docs/colossalai/colossalai.utils.data_sampler.base_sampler.rst b/docs/colossalai/colossalai.utils.data_sampler.base_sampler.rst
deleted file mode 100644
index 199e8fcf83c35c9303baad559a0e10da27197d52..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.utils.data_sampler.base_sampler.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.utils.data\_sampler.base\_sampler
-============================================
-
-.. automodule:: colossalai.utils.data_sampler.base_sampler
-   :members:
diff --git a/docs/colossalai/colossalai.utils.data_sampler.data_parallel_sampler.rst b/docs/colossalai/colossalai.utils.data_sampler.data_parallel_sampler.rst
deleted file mode 100644
index 85e1b121c682310dc8f9930df90f06e1ed32ae80..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.utils.data_sampler.data_parallel_sampler.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.utils.data\_sampler.data\_parallel\_sampler
-======================================================
-
-.. automodule:: colossalai.utils.data_sampler.data_parallel_sampler
-   :members:
diff --git a/docs/colossalai/colossalai.utils.data_sampler.rst b/docs/colossalai/colossalai.utils.data_sampler.rst
deleted file mode 100644
index 61dde070bad445a582bf9e198402b76b1768623d..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.utils.data_sampler.rst
+++ /dev/null
@@ -1,12 +0,0 @@
-colossalai.utils.data\_sampler
-==============================
-
-.. automodule:: colossalai.utils.data_sampler
-   :members:
-
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.utils.data_sampler.base_sampler
-   colossalai.utils.data_sampler.data_parallel_sampler
diff --git a/docs/colossalai/colossalai.utils.gradient_accumulation.rst b/docs/colossalai/colossalai.utils.gradient_accumulation.rst
deleted file mode 100644
index 6ad2ca3ae2f3aeb38ff5b81351de134ed25b1661..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.utils.gradient_accumulation.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.utils.gradient\_accumulation
-=======================================
-
-.. automodule:: colossalai.utils.gradient_accumulation
-   :members:
diff --git a/docs/colossalai/colossalai.utils.memory.rst b/docs/colossalai/colossalai.utils.memory.rst
deleted file mode 100644
index 67c5d60022dddf5293b66ea048cdc13b6bc6bdaa..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.utils.memory.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.utils.memory
-=======================
-
-.. automodule:: colossalai.utils.memory
-   :members:
diff --git a/docs/colossalai/colossalai.utils.multi_tensor_apply.multi_tensor_apply.rst b/docs/colossalai/colossalai.utils.multi_tensor_apply.multi_tensor_apply.rst
deleted file mode 100644
index 493b9530e0f614409ce33c3a3c6f013c261e546b..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.utils.multi_tensor_apply.multi_tensor_apply.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.utils.multi\_tensor\_apply.multi\_tensor\_apply
-==========================================================
-
-.. automodule:: colossalai.utils.multi_tensor_apply.multi_tensor_apply
-   :members:
diff --git a/docs/colossalai/colossalai.utils.multi_tensor_apply.rst b/docs/colossalai/colossalai.utils.multi_tensor_apply.rst
deleted file mode 100644
index d5749cfa8801c4ad5f38b6037e1621e6ea011ab8..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.utils.multi_tensor_apply.rst
+++ /dev/null
@@ -1,11 +0,0 @@
-colossalai.utils.multi\_tensor\_apply
-=====================================
-
-.. automodule:: colossalai.utils.multi_tensor_apply
-   :members:
-
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.utils.multi_tensor_apply.multi_tensor_apply
diff --git a/docs/colossalai/colossalai.utils.rst b/docs/colossalai/colossalai.utils.rst
deleted file mode 100644
index 5a7d2ea5c8126126fcce50fa9aa30aeb1bafc415..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.utils.rst
+++ /dev/null
@@ -1,23 +0,0 @@
-colossalai.utils
-================
-
-.. automodule:: colossalai.utils
-   :members:
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.utils.data_sampler
-   colossalai.utils.gradient_accumulation
-   colossalai.utils.multi_tensor_apply
-
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.utils.activation_checkpoint
-   colossalai.utils.checkpointing
-   colossalai.utils.common
-   colossalai.utils.cuda
-   colossalai.utils.memory
-   colossalai.utils.timer
diff --git a/docs/colossalai/colossalai.utils.timer.rst b/docs/colossalai/colossalai.utils.timer.rst
deleted file mode 100644
index 2014c85f548f6e6d4211bba74b95656d2fe30ef8..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.utils.timer.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.utils.timer
-======================
-
-.. automodule:: colossalai.utils.timer
-   :members:
diff --git a/docs/colossalai/colossalai.zero.loss_scaler.rst b/docs/colossalai/colossalai.zero.loss_scaler.rst
deleted file mode 100644
index 71c4d4446e98ef0e7b7f48b05c16223406619b50..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.zero.loss_scaler.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.zero.loss\_scaler
-============================
-
-.. automodule:: colossalai.zero.loss_scaler
-   :members:
diff --git a/docs/colossalai/colossalai.zero.rst b/docs/colossalai/colossalai.zero.rst
deleted file mode 100644
index 136c3c51eea9ef7b704dc633f26ea567ffb7223e..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.zero.rst
+++ /dev/null
@@ -1,13 +0,0 @@
-colossalai.zero
-===============
-
-.. automodule:: colossalai.zero
-   :members:
-
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.zero.loss_scaler
-   colossalai.zero.zero_redundancy_optimizer_level_2
-   colossalai.zero.zero_redundancy_optimizer_level_3
diff --git a/docs/colossalai/colossalai.zero.zero_redundancy_optimizer_level_2.rst b/docs/colossalai/colossalai.zero.zero_redundancy_optimizer_level_2.rst
deleted file mode 100644
index 5929d5c1253c16ed15994593bea3d4179bbb9364..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.zero.zero_redundancy_optimizer_level_2.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.zero.zero\_redundancy\_optimizer\_level\_2
-=====================================================
-
-.. automodule:: colossalai.zero.zero_redundancy_optimizer_level_2
-   :members:
diff --git a/docs/colossalai/colossalai.zero.zero_redundancy_optimizer_level_3.rst b/docs/colossalai/colossalai.zero.zero_redundancy_optimizer_level_3.rst
deleted file mode 100644
index 063dba60b2c1c13904687d73bc314406ea6b31b8..0000000000000000000000000000000000000000
--- a/docs/colossalai/colossalai.zero.zero_redundancy_optimizer_level_3.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.zero.zero\_redundancy\_optimizer\_level\_3
-=====================================================
-
-.. automodule:: colossalai.zero.zero_redundancy_optimizer_level_3
-   :members:
diff --git a/docs/conf.py b/docs/conf.py
deleted file mode 100644
index 4d45312f523404768c22893635e71384796f1a6d..0000000000000000000000000000000000000000
--- a/docs/conf.py
+++ /dev/null
@@ -1,89 +0,0 @@
-# Configuration file for the Sphinx documentation builder.
-#
-# This file only contains a selection of the most common options. For a full
-# list see the documentation:
-# https://www.sphinx-doc.org/en/master/usage/configuration.html
-
-# -- Path setup --------------------------------------------------------------
-
-import datetime
-# If extensions (or modules to document with autodoc) are in another directory,
-# add these directories to sys.path here. If the directory is relative to the
-# documentation root, use os.path.abspath to make it absolute, like shown here.
-#
-import os
-import sys
-
-sys.path.insert(0, os.path.abspath('..'))
-
-# -- Project information -----------------------------------------------------
-
-project = 'Colossal-AI'
-copyright = f'{datetime.datetime.now().year}, HPC-AI Tech'
-author = 'HPC-AI Technology Inc.'
-
-# The full version, including alpha/beta/rc tags
-release = '0.0.1'
-
-
-# -- General configuration ---------------------------------------------------
-
-# Add any Sphinx extension module names here, as strings. They can be
-# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
-# ones.
-extensions = [
-    'sphinx.ext.autodoc',
-    'sphinx.ext.mathjax',
-    'sphinx.ext.napoleon',
-    'myst_parser',
-]
-
-# Disable docstring inheritance
-autodoc_inherit_docstrings = False
-
-# Disable displaying type annotations, these can be very verbose
-autodoc_typehints = 'none'
-
-# Enable overriding of function signatures in the first line of the docstring.
-autodoc_docstring_signature = True
-autodoc_default_options = {
-    'member-order': 'bysource'
-}
-
-# Add any paths that contain templates here, relative to this directory.
-templates_path = ['_templates']
-
-# List of patterns, relative to source directory, that match files and
-# directories to ignore when looking for source files.
-# This pattern also affects html_static_path and html_extra_path.
-exclude_patterns = ['.build', 'Thumbs.db', '.DS_Store']
-
-# -- Options for HTML output -------------------------------------------------
-
-# The theme to use for HTML and HTML Help pages.  See the documentation for
-# a list of builtin themes.
-#
-html_theme = 'sphinx_rtd_theme'
-html_show_sourcelink = False
-html_theme_options = {
-    'navigation_depth': 2,
-}
-
-html_context = {
-    'display_github': False,
-    'github_user': 'hpcaitech',
-    'github_repo': 'ColossalAI',
-    #   'github_version': 'master/docs/',
-}
-
-# Add any paths that contain custom static files (such as style sheets) here,
-# relative to this directory. They are copied after the builtin static files,
-# so a file named "default.css" will overwrite the builtin "default.css".
-html_static_path = ['_static']
-
-html_css_files = [
-    'css/rtd_theme.css',
-]
-
-# -- Extension configuration -------------------------------------------------
-source_suffix = ['.rst', '.md', '.MD']
diff --git a/docs/images/Colossal-AI_logo.png b/docs/images/Colossal-AI_logo.png
deleted file mode 100644
index 886f35bebd056c3a53244f3cd9e131e04c5ad2bd..0000000000000000000000000000000000000000
Binary files a/docs/images/Colossal-AI_logo.png and /dev/null differ
diff --git a/docs/index.rst b/docs/index.rst
deleted file mode 100644
index b29450f58d551d0d31e270f6e966a19fd3b20864..0000000000000000000000000000000000000000
--- a/docs/index.rst
+++ /dev/null
@@ -1,18 +0,0 @@
-.. Colossal-AI documentation master file, created by
-   sphinx-quickstart on Mon Oct 11 17:05:05 2021.
-   You can adapt this file completely to your liking, but it should at least
-   contain the root `toctree` directive.
-
-Colossal-AI API documentation
-======================================
-.. toctree::
-   :maxdepth: 2
-   :caption: API REFERENCE
-
-   colossalai/colossalai
-
-
-Indices and tables
-==================
-
-* :ref:`genindex`
\ No newline at end of file
diff --git a/docs/make.bat b/docs/make.bat
deleted file mode 100644
index cf73214110f2aa6d830a1d40ff6c5a7125fb1d0c..0000000000000000000000000000000000000000
--- a/docs/make.bat
+++ /dev/null
@@ -1,35 +0,0 @@
-@ECHO OFF
-
-pushd %~dp0
-
-REM Command file for Sphinx documentation
-
-if "%SPHINXBUILD%" == "" (
-	set SPHINXBUILD=sphinx-build
-)
-set SOURCEDIR=.
-set BUILDDIR=.build
-
-if "%1" == "" goto help
-
-%SPHINXBUILD% >NUL 2>NUL
-if errorlevel 9009 (
-	echo.
-	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
-	echo.installed, then set the SPHINXBUILD environment variable to point
-	echo.to the full path of the 'sphinx-build' executable. Alternatively you
-	echo.may add the Sphinx directory to PATH.
-	echo.
-	echo.If you don't have Sphinx installed, grab it from
-	echo.https://www.sphinx-doc.org/
-	exit /b 1
-)
-
-%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
-goto end
-
-:help
-%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
-
-:end
-popd
diff --git a/docs/requirements.txt b/docs/requirements.txt
deleted file mode 100644
index ae216364c65863f54fa2ddabfc3ca4597297fa65..0000000000000000000000000000000000000000
--- a/docs/requirements.txt
+++ /dev/null
@@ -1,6 +0,0 @@
-tensorboard 
-deepspeed 
-apex 
-sphinx 
-sphinx-rtd-theme 
-myst-parser
\ No newline at end of file
diff --git a/hc.log b/hc.log
deleted file mode 100644
index 4b9800ff62bf5edd94e2712ff023d8a88a4cc6c4..0000000000000000000000000000000000000000
--- a/hc.log
+++ /dev/null
@@ -1,909 +0,0 @@
-
-Compiling cuda extensions with
-HIP version: 4.3.22313-cccb3896
-clang version 14.0.0 (http://10.8.150.239/dcutoolkit/driverruntime/llvm-project.git 458573e609dd35aac1fa72e6136853de2b7651c8)
-Target: x86_64-unknown-linux-gnu
-Thread model: posix
-InstalledDir: /opt/dtk-22.04.2/llvm/bin
-from /opt/dtk-22.04.2/bin
-
-
-
-torch.__version__  = 1.10.0a0+gitc7f69d6-dtk22042
-
-
-nvcc was not found. CUDA extension will not be installed. If you're installing within a container from https://hub.docker.com/r/pytorch/pytorch, only images whose names contain 'devel' will provide nvcc.
-/public/home/huchen/colossalAI/ColossalAI/MANIFEST.in -> /public/home/huchen/colossalAI/ColossalAI/MANIFEST.in ok
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/scaled_upper_triang_masked_softmax.h -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_upper_triang_masked_softmax.h skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/compat.h -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/compat.h skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/type_shim.h -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/type_shim.h skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/scaled_upper_triang_masked_softmax_cuda.cu -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_upper_triang_masked_softmax_hip.hip skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multi_tensor_apply.cuh skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/multi_tensor_sgd_kernel.cu -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multi_tensor_sgd_kernel.hip skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/multi_tensor_scale_kernel.cu -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multi_tensor_scale_kernel.hip skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/multi_tensor_adam.cu -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multi_tensor_adam.hip skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/scaled_masked_softmax.h -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_masked_softmax.h skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/scaled_masked_softmax_cuda.cu -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_masked_softmax_hip.hip skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/layer_norm_cuda.cpp -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/layer_norm_hip.cpp skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/scaled_upper_triang_masked_softmax.h -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_upper_triang_masked_softmax.h skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/scaled_masked_softmax.cpp -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_masked_softmax.cpp skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/colossal_C_frontend.cpp -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/colossal_C_frontend.cpp skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/multi_tensor_lamb.cu -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multi_tensor_lamb.hip skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/type_shim.h -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/type_shim.h skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/layer_norm_cuda_kernel.cu -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/layer_norm_hip_kernel.hip skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/multihead_attention_1d.h -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multihead_attention_1d.h skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multi_tensor_l2norm_kernel.hip skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/compat.h -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/compat.h skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/multihead_attention_1d.cpp -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multihead_attention_1d.cpp skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/scaled_upper_triang_masked_softmax.cpp -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_upper_triang_masked_softmax.cpp skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multi_tensor_apply.cuh skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/kernels/cross_entropy.cu -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/cross_entropy.hip skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/kernels/normalize_kernels.cu -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/normalize_kernels.hip skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/kernels/cuda_util.cu -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/hip_util.hip skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/kernels/dropout_kernels.cu -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/dropout_kernels.hip skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/kernels/cublas_wrappers.cu -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/cublas_wrappers.hip skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/kernels/transform_kernels.cu -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/transform_kernels.hip skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/kernels/softmax_kernels.cu -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/softmax_kernels.hip skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/kernels/general_kernels.cu -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/general_kernels.hip skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/kernels/include/cublas_wrappers.h -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/cublas_wrappers.h skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/kernels/include/strided_batch_gemm.h -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/strided_batch_gemm.h skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/kernels/include/kernels.h -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/kernels.h skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/kernels/include/feed_forward.h -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/feed_forward.h skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/kernels/include/cublas_wrappers.h -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/cublas_wrappers.h skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/kernels/include/dropout.h -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/dropout.h skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/kernels/include/normalize_layer.h -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/normalize_layer.h skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/kernels/include/block_reduce.h -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/block_reduce.h skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/kernels/include/cuda_util.h -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/hip_util.h skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/kernels/include/context.h -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/context.h skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/kernels/include/ls_cub.cuh -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/ls_cub.cuh skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/kernels/include/cross_entropy_layer.h -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/cross_entropy_layer.h skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/kernels/include/softmax.h -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/softmax.h skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/kernels/include/cuda_util.h -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/hip_util.h skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/cuda_native/csrc/kernels/include/kernels.h -> /public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/kernels.h skipped
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_masked_softmax.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_upper_triang_masked_softmax.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_masked_softmax.cpp -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/colossal_C_frontend.cpp -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/type_shim.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multihead_attention_1d.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/compat.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multihead_attention_1d.cpp -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_upper_triang_masked_softmax.cpp -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multi_tensor_apply.cuh -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/layer_norm_hip.cpp -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/strided_batch_gemm.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/feed_forward.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/cublas_wrappers.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/dropout.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/normalize_layer.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/block_reduce.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/context.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/ls_cub.cuh -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/cross_entropy_layer.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/hip_util.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/softmax.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/kernels.h -> None ignored
-Total number of unsupported CUDA function calls: 0
-
-
-Total number of replaced kernel launches: 139
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_masked_softmax.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_upper_triang_masked_softmax.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_masked_softmax.cpp -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/colossal_C_frontend.cpp -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/type_shim.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multihead_attention_1d.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/compat.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multihead_attention_1d.cpp -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_upper_triang_masked_softmax.cpp -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multi_tensor_apply.cuh -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/layer_norm_hip.cpp -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/strided_batch_gemm.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/feed_forward.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/cublas_wrappers.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/dropout.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/normalize_layer.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/block_reduce.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/context.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/ls_cub.cuh -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/cross_entropy_layer.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/hip_util.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/softmax.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/kernels.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multi_tensor_sgd_kernel.hip -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multi_tensor_scale_kernel.hip -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multi_tensor_adam.hip -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multi_tensor_l2norm_kernel.hip -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multi_tensor_lamb.hip -> None ignored
-Total number of unsupported CUDA function calls: 0
-
-
-Total number of replaced kernel launches: 0
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_masked_softmax.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_upper_triang_masked_softmax.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_masked_softmax.cpp -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/colossal_C_frontend.cpp -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/type_shim.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multihead_attention_1d.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/compat.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multihead_attention_1d.cpp -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_upper_triang_masked_softmax.cpp -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multi_tensor_apply.cuh -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/layer_norm_hip.cpp -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/strided_batch_gemm.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/feed_forward.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/cublas_wrappers.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/dropout.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/normalize_layer.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/block_reduce.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/context.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/ls_cub.cuh -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/cross_entropy_layer.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/hip_util.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/softmax.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/kernels.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_upper_triang_masked_softmax_hip.hip -> None ignored
-Total number of unsupported CUDA function calls: 0
-
-
-Total number of replaced kernel launches: 0
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_masked_softmax.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_upper_triang_masked_softmax.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_masked_softmax.cpp -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/colossal_C_frontend.cpp -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/type_shim.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multihead_attention_1d.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/compat.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multihead_attention_1d.cpp -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_upper_triang_masked_softmax.cpp -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multi_tensor_apply.cuh -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/layer_norm_hip.cpp -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/strided_batch_gemm.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/feed_forward.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/cublas_wrappers.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/dropout.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/normalize_layer.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/block_reduce.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/context.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/ls_cub.cuh -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/cross_entropy_layer.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/hip_util.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/softmax.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/kernels.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_masked_softmax_hip.hip -> None ignored
-Total number of unsupported CUDA function calls: 0
-
-
-Total number of replaced kernel launches: 0
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_masked_softmax.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_upper_triang_masked_softmax.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_masked_softmax.cpp -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/colossal_C_frontend.cpp -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/type_shim.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multihead_attention_1d.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/compat.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multihead_attention_1d.cpp -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_upper_triang_masked_softmax.cpp -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multi_tensor_apply.cuh -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/layer_norm_hip.cpp -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/strided_batch_gemm.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/feed_forward.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/cublas_wrappers.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/dropout.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/normalize_layer.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/block_reduce.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/context.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/ls_cub.cuh -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/cross_entropy_layer.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/hip_util.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/softmax.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/kernels.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/layer_norm_hip_kernel.hip -> None ignored
-Total number of unsupported CUDA function calls: 0
-
-
-Total number of replaced kernel launches: 0
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_masked_softmax.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_upper_triang_masked_softmax.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_masked_softmax.cpp -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/colossal_C_frontend.cpp -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/type_shim.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multihead_attention_1d.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/compat.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multihead_attention_1d.cpp -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_upper_triang_masked_softmax.cpp -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multi_tensor_apply.cuh -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/layer_norm_hip.cpp -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/strided_batch_gemm.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/feed_forward.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/cublas_wrappers.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/dropout.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/normalize_layer.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/block_reduce.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/context.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/ls_cub.cuh -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/cross_entropy_layer.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/hip_util.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/softmax.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/include/kernels.h -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/cublas_wrappers.hip -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/transform_kernels.hip -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/dropout_kernels.hip -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/normalize_kernels.hip -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/softmax_kernels.hip -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/general_kernels.hip -> None ignored
-/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/hip_util.hip -> None ignored
-Total number of unsupported CUDA function calls: 0
-
-
-Total number of replaced kernel launches: 0
-running bdist_wheel
-running build
-running build_py
-not copying model_zoo/helper.py (output up-to-date)
-not copying model_zoo/__init__.py (output up-to-date)
-not copying colossalai/global_variables.py (output up-to-date)
-not copying colossalai/initialize.py (output up-to-date)
-not copying colossalai/constants.py (output up-to-date)
-not copying colossalai/__init__.py (output up-to-date)
-not copying colossalai/core.py (output up-to-date)
-not copying model_zoo/bert/__init__.py (output up-to-date)
-not copying model_zoo/gpt/gpt.py (output up-to-date)
-not copying model_zoo/gpt/__init__.py (output up-to-date)
-not copying model_zoo/mlp_mixer/__init__.py (output up-to-date)
-not copying model_zoo/moe/models.py (output up-to-date)
-not copying model_zoo/moe/util.py (output up-to-date)
-not copying model_zoo/moe/__init__.py (output up-to-date)
-not copying model_zoo/vit/vit.py (output up-to-date)
-not copying model_zoo/vit/vision_transformer_from_config.py (output up-to-date)
-not copying model_zoo/vit/__init__.py (output up-to-date)
-not copying model_zoo/mlp_mixer/parallel_3d/__init__.py (output up-to-date)
-not copying model_zoo/mlp_mixer/parallel_3d/mlp_mixer.py (output up-to-date)
-not copying colossalai/logging/logging.py (output up-to-date)
-not copying colossalai/logging/__init__.py (output up-to-date)
-not copying colossalai/amp/__init__.py (output up-to-date)
-not copying colossalai/amp/amp_type.py (output up-to-date)
-not copying colossalai/zero/zero_redundancy_optimizer_level_3.py (output up-to-date)
-not copying colossalai/zero/zero_redundancy_optimizer_level_2.py (output up-to-date)
-not copying colossalai/zero/__init__.py (output up-to-date)
-not copying colossalai/zero/loss_scaler.py (output up-to-date)
-not copying colossalai/engine/__init__.py (output up-to-date)
-not copying colossalai/engine/_base_engine.py (output up-to-date)
-not copying colossalai/trainer/_trainer.py (output up-to-date)
-not copying colossalai/trainer/__init__.py (output up-to-date)
-not copying colossalai/registry/__init__.py (output up-to-date)
-not copying colossalai/registry/registry.py (output up-to-date)
-not copying colossalai/utils/checkpointing.py (output up-to-date)
-not copying colossalai/utils/activation_checkpoint.py (output up-to-date)
-not copying colossalai/utils/timer.py (output up-to-date)
-not copying colossalai/utils/common.py (output up-to-date)
-not copying colossalai/utils/memory.py (output up-to-date)
-not copying colossalai/utils/__init__.py (output up-to-date)
-not copying colossalai/utils/cuda.py (output up-to-date)
-not copying colossalai/nn/init.py (output up-to-date)
-not copying colossalai/nn/__init__.py (output up-to-date)
-not copying colossalai/builder/builder.py (output up-to-date)
-not copying colossalai/builder/__init__.py (output up-to-date)
-not copying colossalai/builder/pipeline.py (output up-to-date)
-not copying colossalai/context/config.py (output up-to-date)
-not copying colossalai/context/parallel_context.py (output up-to-date)
-not copying colossalai/context/parallel_mode.py (output up-to-date)
-not copying colossalai/context/__init__.py (output up-to-date)
-not copying colossalai/communication/p2p.py (output up-to-date)
-not copying colossalai/communication/utils.py (output up-to-date)
-not copying colossalai/communication/collective.py (output up-to-date)
-not copying colossalai/communication/__init__.py (output up-to-date)
-not copying colossalai/communication/ring.py (output up-to-date)
-not copying colossalai/kernel/__init__.py (output up-to-date)
-not copying colossalai/amp/apex_amp/apex_amp.py (output up-to-date)
-not copying colossalai/amp/apex_amp/__init__.py (output up-to-date)
-not copying colossalai/amp/naive_amp/__init__.py (output up-to-date)
-not copying colossalai/amp/naive_amp/_fp16_optimizer.py (output up-to-date)
-not copying colossalai/amp/naive_amp/naive_amp.py (output up-to-date)
-not copying colossalai/amp/torch_amp/torch_amp.py (output up-to-date)
-not copying colossalai/amp/torch_amp/_grad_scaler.py (output up-to-date)
-not copying colossalai/amp/torch_amp/__init__.py (output up-to-date)
-not copying colossalai/engine/ophooks/_memtracer_ophook.py (output up-to-date)
-not copying colossalai/engine/ophooks/_base_ophook.py (output up-to-date)
-not copying colossalai/engine/ophooks/__init__.py (output up-to-date)
-not copying colossalai/engine/schedule/_base_schedule.py (output up-to-date)
-not copying colossalai/engine/schedule/_pipeline_schedule.py (output up-to-date)
-not copying colossalai/engine/schedule/_non_pipeline_schedule.py (output up-to-date)
-not copying colossalai/engine/schedule/__init__.py (output up-to-date)
-not copying colossalai/engine/gradient_handler/_pipeline_parallel_gradient_handler.py (output up-to-date)
-not copying colossalai/engine/gradient_handler/_data_parallel_gradient_handler.py (output up-to-date)
-not copying colossalai/engine/gradient_handler/_base_gradient_handler.py (output up-to-date)
-not copying colossalai/engine/gradient_handler/_moe_gradient_handler.py (output up-to-date)
-not copying colossalai/engine/gradient_handler/_zero_gradient_handler.py (output up-to-date)
-not copying colossalai/engine/gradient_handler/_sequence_parallel_gradient_handler.py (output up-to-date)
-not copying colossalai/engine/gradient_handler/__init__.py (output up-to-date)
-not copying colossalai/trainer/hooks/_log_hook.py (output up-to-date)
-not copying colossalai/trainer/hooks/_checkpoint_hook.py (output up-to-date)
-not copying colossalai/trainer/hooks/__init__.py (output up-to-date)
-not copying colossalai/trainer/hooks/_base_hook.py (output up-to-date)
-not copying colossalai/trainer/hooks/_metric_hook.py (output up-to-date)
-not copying colossalai/trainer/hooks/_lr_scheduler_hook.py (output up-to-date)
-not copying colossalai/utils/data_sampler/data_parallel_sampler.py (output up-to-date)
-not copying colossalai/utils/data_sampler/base_sampler.py (output up-to-date)
-not copying colossalai/utils/data_sampler/__init__.py (output up-to-date)
-not copying colossalai/utils/multi_tensor_apply/multi_tensor_apply.py (output up-to-date)
-not copying colossalai/utils/multi_tensor_apply/__init__.py (output up-to-date)
-not copying colossalai/utils/gradient_accumulation/__init__.py (output up-to-date)
-not copying colossalai/utils/gradient_accumulation/_gradient_accumulation.py (output up-to-date)
-not copying colossalai/nn/optimizer/fused_adam.py (output up-to-date)
-not copying colossalai/nn/optimizer/lars.py (output up-to-date)
-not copying colossalai/nn/optimizer/fused_sgd.py (output up-to-date)
-not copying colossalai/nn/optimizer/fused_lamb.py (output up-to-date)
-not copying colossalai/nn/optimizer/__init__.py (output up-to-date)
-not copying colossalai/nn/optimizer/lamb.py (output up-to-date)
-not copying colossalai/nn/optimizer/colossalai_optimizer.py (output up-to-date)
-not copying colossalai/nn/layer/base_layer.py (output up-to-date)
-not copying colossalai/nn/layer/__init__.py (output up-to-date)
-not copying colossalai/nn/model/model_from_config.py (output up-to-date)
-not copying colossalai/nn/model/__init__.py (output up-to-date)
-not copying colossalai/nn/metric/_utils.py (output up-to-date)
-not copying colossalai/nn/metric/accuracy_3d.py (output up-to-date)
-not copying colossalai/nn/metric/accuracy_2p5d.py (output up-to-date)
-not copying colossalai/nn/metric/accuracy_2d.py (output up-to-date)
-not copying colossalai/nn/metric/__init__.py (output up-to-date)
-not copying colossalai/nn/lr_scheduler/delayed.py (output up-to-date)
-not copying colossalai/nn/lr_scheduler/torch.py (output up-to-date)
-not copying colossalai/nn/lr_scheduler/cosine.py (output up-to-date)
-not copying colossalai/nn/lr_scheduler/multistep.py (output up-to-date)
-not copying colossalai/nn/lr_scheduler/onecycle.py (output up-to-date)
-not copying colossalai/nn/lr_scheduler/linear.py (output up-to-date)
-not copying colossalai/nn/lr_scheduler/__init__.py (output up-to-date)
-not copying colossalai/nn/lr_scheduler/poly.py (output up-to-date)
-not copying colossalai/nn/loss/loss_3d.py (output up-to-date)
-not copying colossalai/nn/loss/loss_2d.py (output up-to-date)
-not copying colossalai/nn/loss/__init__.py (output up-to-date)
-not copying colossalai/nn/loss/loss_2p5d.py (output up-to-date)
-not copying colossalai/nn/loss/loss_moe.py (output up-to-date)
-not copying colossalai/nn/loss/loss_1d.py (output up-to-date)
-not copying colossalai/nn/layer/parallel_2d/_utils.py (output up-to-date)
-not copying colossalai/nn/layer/parallel_2d/layers.py (output up-to-date)
-not copying colossalai/nn/layer/parallel_2d/_operation.py (output up-to-date)
-not copying colossalai/nn/layer/parallel_2d/__init__.py (output up-to-date)
-not copying colossalai/nn/layer/parallel_1d/_utils.py (output up-to-date)
-not copying colossalai/nn/layer/parallel_1d/layers.py (output up-to-date)
-not copying colossalai/nn/layer/parallel_1d/_operation.py (output up-to-date)
-not copying colossalai/nn/layer/parallel_1d/__init__.py (output up-to-date)
-not copying colossalai/nn/layer/colossalai_layer/_utils.py (output up-to-date)
-not copying colossalai/nn/layer/colossalai_layer/embedding.py (output up-to-date)
-not copying colossalai/nn/layer/colossalai_layer/linear.py (output up-to-date)
-not copying colossalai/nn/layer/colossalai_layer/normalization.py (output up-to-date)
-not copying colossalai/nn/layer/colossalai_layer/dropout.py (output up-to-date)
-not copying colossalai/nn/layer/colossalai_layer/__init__.py (output up-to-date)
-not copying colossalai/nn/layer/parallel_sequence/_utils.py (output up-to-date)
-not copying colossalai/nn/layer/parallel_sequence/layers.py (output up-to-date)
-not copying colossalai/nn/layer/parallel_sequence/_operation.py (output up-to-date)
-not copying colossalai/nn/layer/parallel_sequence/__init__.py (output up-to-date)
-not copying colossalai/nn/layer/vanilla/layers.py (output up-to-date)
-not copying colossalai/nn/layer/vanilla/__init__.py (output up-to-date)
-not copying colossalai/nn/layer/moe/layers.py (output up-to-date)
-not copying colossalai/nn/layer/moe/_operation.py (output up-to-date)
-not copying colossalai/nn/layer/moe/__init__.py (output up-to-date)
-not copying colossalai/nn/layer/utils/common.py (output up-to-date)
-not copying colossalai/nn/layer/utils/__init__.py (output up-to-date)
-not copying colossalai/nn/layer/parallel_3d/_utils.py (output up-to-date)
-not copying colossalai/nn/layer/parallel_3d/layers.py (output up-to-date)
-not copying colossalai/nn/layer/parallel_3d/_operation.py (output up-to-date)
-not copying colossalai/nn/layer/parallel_3d/__init__.py (output up-to-date)
-not copying colossalai/nn/layer/wrapper/lambda_wrapper.py (output up-to-date)
-not copying colossalai/nn/layer/wrapper/pipeline_wrapper.py (output up-to-date)
-not copying colossalai/nn/layer/wrapper/__init__.py (output up-to-date)
-not copying colossalai/nn/layer/parallel_2p5d/_utils.py (output up-to-date)
-not copying colossalai/nn/layer/parallel_2p5d/layers.py (output up-to-date)
-not copying colossalai/nn/layer/parallel_2p5d/_operation.py (output up-to-date)
-not copying colossalai/nn/layer/parallel_2p5d/__init__.py (output up-to-date)
-not copying colossalai/context/random/_helper.py (output up-to-date)
-not copying colossalai/context/random/seed_manager.py (output up-to-date)
-not copying colossalai/context/random/__init__.py (output up-to-date)
-not copying colossalai/context/process_group_initializer/initializer_2p5d.py (output up-to-date)
-not copying colossalai/context/process_group_initializer/initializer_tensor.py (output up-to-date)
-not copying colossalai/context/process_group_initializer/initializer_data.py (output up-to-date)
-not copying colossalai/context/process_group_initializer/initializer_3d.py (output up-to-date)
-not copying colossalai/context/process_group_initializer/initializer_sequence.py (output up-to-date)
-not copying colossalai/context/process_group_initializer/initializer_moe.py (output up-to-date)
-not copying colossalai/context/process_group_initializer/initializer_model.py (output up-to-date)
-not copying colossalai/context/process_group_initializer/initializer_2d.py (output up-to-date)
-not copying colossalai/context/process_group_initializer/__init__.py (output up-to-date)
-not copying colossalai/context/process_group_initializer/initializer_1d.py (output up-to-date)
-not copying colossalai/context/process_group_initializer/initializer_pipeline.py (output up-to-date)
-not copying colossalai/context/process_group_initializer/process_group_initializer.py (output up-to-date)
-not copying colossalai/kernel/cuda_native/scaled_softmax.py (output up-to-date)
-not copying colossalai/kernel/cuda_native/layer_norm.py (output up-to-date)
-not copying colossalai/kernel/cuda_native/__init__.py (output up-to-date)
-not copying colossalai/kernel/cuda_native/multihead_attention.py (output up-to-date)
-not copying colossalai/kernel/jit/bias_gelu.py (output up-to-date)
-not copying colossalai/kernel/jit/option.py (output up-to-date)
-not copying colossalai/kernel/jit/bias_dropout_add.py (output up-to-date)
-not copying colossalai/kernel/jit/__init__.py (output up-to-date)
-running build_ext
-building 'colossal_C' extension
-Emitting ninja build file /public/home/huchen/colossalAI/ColossalAI/build/temp.linux-x86_64-3.7/build.ninja...
-Compiling objects...
-Using envvar MAX_JOBS (32) as the number of workers...
-[92mSuccessfully preprocessed all matching files.[0m
-[92mSuccessfully preprocessed all matching files.[0m
-[92mSuccessfully preprocessed all matching files.[0m
-[92mSuccessfully preprocessed all matching files.[0m
-[92mSuccessfully preprocessed all matching files.[0m
-[92mSuccessfully preprocessed all matching files.[0m
-ninja: no work to do.
-g++ -pthread -shared /public/home/huchen/colossalAI/ColossalAI/build/temp.linux-x86_64-3.7/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multi_tensor_l2norm_kernel.o /public/home/huchen/colossalAI/ColossalAI/build/temp.linux-x86_64-3.7/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multi_tensor_sgd_kernel.o /public/home/huchen/colossalAI/ColossalAI/build/temp.linux-x86_64-3.7/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multi_tensor_scale_kernel.o /public/home/huchen/colossalAI/ColossalAI/build/temp.linux-x86_64-3.7/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/colossal_C_frontend.o /public/home/huchen/colossalAI/ColossalAI/build/temp.linux-x86_64-3.7/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multi_tensor_adam.o /public/home/huchen/colossalAI/ColossalAI/build/temp.linux-x86_64-3.7/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multi_tensor_lamb.o -L/usr/local/lib/python3.7/site-packages/torch/lib -L/opt/dtk-22.04.2/lib -L/usr/local/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lamdhip64 -lc10_hip -ltorch_hip -lpython3.7m -o build/lib.linux-x86_64-3.7/colossal_C.cpython-37m-x86_64-linux-gnu.so
-building 'colossal_scaled_upper_triang_masked_softmax' extension
-Emitting ninja build file /public/home/huchen/colossalAI/ColossalAI/build/temp.linux-x86_64-3.7/build.ninja...
-Compiling objects...
-Using envvar MAX_JOBS (32) as the number of workers...
-ninja: no work to do.
-g++ -pthread -shared /public/home/huchen/colossalAI/ColossalAI/build/temp.linux-x86_64-3.7/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_upper_triang_masked_softmax.o /public/home/huchen/colossalAI/ColossalAI/build/temp.linux-x86_64-3.7/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_upper_triang_masked_softmax_hip.o -L/usr/local/lib/python3.7/site-packages/torch/lib -L/opt/dtk-22.04.2/lib -L/usr/local/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lamdhip64 -lc10_hip -ltorch_hip -lpython3.7m -o build/lib.linux-x86_64-3.7/colossal_scaled_upper_triang_masked_softmax.cpython-37m-x86_64-linux-gnu.so
-building 'colossal_scaled_masked_softmax' extension
-Emitting ninja build file /public/home/huchen/colossalAI/ColossalAI/build/temp.linux-x86_64-3.7/build.ninja...
-Compiling objects...
-Using envvar MAX_JOBS (32) as the number of workers...
-ninja: no work to do.
-g++ -pthread -shared /public/home/huchen/colossalAI/ColossalAI/build/temp.linux-x86_64-3.7/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_masked_softmax.o /public/home/huchen/colossalAI/ColossalAI/build/temp.linux-x86_64-3.7/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/scaled_masked_softmax_hip.o -L/usr/local/lib/python3.7/site-packages/torch/lib -L/opt/dtk-22.04.2/lib -L/usr/local/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lamdhip64 -lc10_hip -ltorch_hip -lpython3.7m -o build/lib.linux-x86_64-3.7/colossal_scaled_masked_softmax.cpython-37m-x86_64-linux-gnu.so
-building 'colossal_layer_norm_cuda' extension
-Emitting ninja build file /public/home/huchen/colossalAI/ColossalAI/build/temp.linux-x86_64-3.7/build.ninja...
-Compiling objects...
-Using envvar MAX_JOBS (32) as the number of workers...
-ninja: no work to do.
-g++ -pthread -shared /public/home/huchen/colossalAI/ColossalAI/build/temp.linux-x86_64-3.7/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/layer_norm_hip.o /public/home/huchen/colossalAI/ColossalAI/build/temp.linux-x86_64-3.7/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/layer_norm_hip_kernel.o -L/usr/local/lib/python3.7/site-packages/torch/lib -L/opt/dtk-22.04.2/lib -L/usr/local/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lamdhip64 -lc10_hip -ltorch_hip -lpython3.7m -o build/lib.linux-x86_64-3.7/colossal_layer_norm_cuda.cpython-37m-x86_64-linux-gnu.so
-building 'colossal_multihead_attention' extension
-Emitting ninja build file /public/home/huchen/colossalAI/ColossalAI/build/temp.linux-x86_64-3.7/build.ninja...
-Compiling objects...
-Using envvar MAX_JOBS (32) as the number of workers...
-ninja: no work to do.
-g++ -pthread -shared /public/home/huchen/colossalAI/ColossalAI/build/temp.linux-x86_64-3.7/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/softmax_kernels.o /public/home/huchen/colossalAI/ColossalAI/build/temp.linux-x86_64-3.7/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/cublas_wrappers.o /public/home/huchen/colossalAI/ColossalAI/build/temp.linux-x86_64-3.7/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/multihead_attention_1d.o /public/home/huchen/colossalAI/ColossalAI/build/temp.linux-x86_64-3.7/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/normalize_kernels.o /public/home/huchen/colossalAI/ColossalAI/build/temp.linux-x86_64-3.7/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/general_kernels.o /public/home/huchen/colossalAI/ColossalAI/build/temp.linux-x86_64-3.7/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/hip_util.o /public/home/huchen/colossalAI/ColossalAI/build/temp.linux-x86_64-3.7/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/transform_kernels.o /public/home/huchen/colossalAI/ColossalAI/build/temp.linux-x86_64-3.7/public/home/huchen/colossalAI/ColossalAI/colossalai/kernel/hip_native/csrc/kernels/dropout_kernels.o -L/usr/local/lib/python3.7/site-packages/torch/lib -L/opt/dtk-22.04.2/lib -L/usr/local/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lamdhip64 -lc10_hip -ltorch_hip -lpython3.7m -o build/lib.linux-x86_64-3.7/colossal_multihead_attention.cpython-37m-x86_64-linux-gnu.so
-installing to build/bdist.linux-x86_64/wheel
-running install
-running install_lib
-creating build/bdist.linux-x86_64/wheel
-copying build/lib.linux-x86_64-3.7/colossal_multihead_attention.cpython-37m-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/wheel
-copying build/lib.linux-x86_64-3.7/colossal_layer_norm_cuda.cpython-37m-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/wheel
-copying build/lib.linux-x86_64-3.7/colossal_scaled_masked_softmax.cpython-37m-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/wheel
-copying build/lib.linux-x86_64-3.7/colossal_scaled_upper_triang_masked_softmax.cpython-37m-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/wheel
-copying build/lib.linux-x86_64-3.7/colossal_C.cpython-37m-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/wheel
-creating build/bdist.linux-x86_64/wheel/model_zoo
-copying build/lib.linux-x86_64-3.7/model_zoo/helper.py -> build/bdist.linux-x86_64/wheel/model_zoo
-creating build/bdist.linux-x86_64/wheel/model_zoo/bert
-copying build/lib.linux-x86_64-3.7/model_zoo/bert/__init__.py -> build/bdist.linux-x86_64/wheel/model_zoo/bert
-creating build/bdist.linux-x86_64/wheel/model_zoo/gpt
-copying build/lib.linux-x86_64-3.7/model_zoo/gpt/gpt.py -> build/bdist.linux-x86_64/wheel/model_zoo/gpt
-copying build/lib.linux-x86_64-3.7/model_zoo/gpt/__init__.py -> build/bdist.linux-x86_64/wheel/model_zoo/gpt
-creating build/bdist.linux-x86_64/wheel/model_zoo/mlp_mixer
-copying build/lib.linux-x86_64-3.7/model_zoo/mlp_mixer/__init__.py -> build/bdist.linux-x86_64/wheel/model_zoo/mlp_mixer
-creating build/bdist.linux-x86_64/wheel/model_zoo/mlp_mixer/parallel_3d
-copying build/lib.linux-x86_64-3.7/model_zoo/mlp_mixer/parallel_3d/__init__.py -> build/bdist.linux-x86_64/wheel/model_zoo/mlp_mixer/parallel_3d
-copying build/lib.linux-x86_64-3.7/model_zoo/mlp_mixer/parallel_3d/mlp_mixer.py -> build/bdist.linux-x86_64/wheel/model_zoo/mlp_mixer/parallel_3d
-creating build/bdist.linux-x86_64/wheel/model_zoo/moe
-copying build/lib.linux-x86_64-3.7/model_zoo/moe/models.py -> build/bdist.linux-x86_64/wheel/model_zoo/moe
-copying build/lib.linux-x86_64-3.7/model_zoo/moe/util.py -> build/bdist.linux-x86_64/wheel/model_zoo/moe
-copying build/lib.linux-x86_64-3.7/model_zoo/moe/__init__.py -> build/bdist.linux-x86_64/wheel/model_zoo/moe
-copying build/lib.linux-x86_64-3.7/model_zoo/__init__.py -> build/bdist.linux-x86_64/wheel/model_zoo
-creating build/bdist.linux-x86_64/wheel/model_zoo/vit
-copying build/lib.linux-x86_64-3.7/model_zoo/vit/vit.py -> build/bdist.linux-x86_64/wheel/model_zoo/vit
-copying build/lib.linux-x86_64-3.7/model_zoo/vit/vision_transformer_from_config.py -> build/bdist.linux-x86_64/wheel/model_zoo/vit
-copying build/lib.linux-x86_64-3.7/model_zoo/vit/__init__.py -> build/bdist.linux-x86_64/wheel/model_zoo/vit
-creating build/bdist.linux-x86_64/wheel/colossalai
-creating build/bdist.linux-x86_64/wheel/colossalai/logging
-copying build/lib.linux-x86_64-3.7/colossalai/logging/logging.py -> build/bdist.linux-x86_64/wheel/colossalai/logging
-copying build/lib.linux-x86_64-3.7/colossalai/logging/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/logging
-creating build/bdist.linux-x86_64/wheel/colossalai/amp
-creating build/bdist.linux-x86_64/wheel/colossalai/amp/apex_amp
-copying build/lib.linux-x86_64-3.7/colossalai/amp/apex_amp/apex_amp.py -> build/bdist.linux-x86_64/wheel/colossalai/amp/apex_amp
-copying build/lib.linux-x86_64-3.7/colossalai/amp/apex_amp/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/amp/apex_amp
-creating build/bdist.linux-x86_64/wheel/colossalai/amp/naive_amp
-copying build/lib.linux-x86_64-3.7/colossalai/amp/naive_amp/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/amp/naive_amp
-copying build/lib.linux-x86_64-3.7/colossalai/amp/naive_amp/_fp16_optimizer.py -> build/bdist.linux-x86_64/wheel/colossalai/amp/naive_amp
-copying build/lib.linux-x86_64-3.7/colossalai/amp/naive_amp/naive_amp.py -> build/bdist.linux-x86_64/wheel/colossalai/amp/naive_amp
-copying build/lib.linux-x86_64-3.7/colossalai/amp/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/amp
-copying build/lib.linux-x86_64-3.7/colossalai/amp/amp_type.py -> build/bdist.linux-x86_64/wheel/colossalai/amp
-creating build/bdist.linux-x86_64/wheel/colossalai/amp/torch_amp
-copying build/lib.linux-x86_64-3.7/colossalai/amp/torch_amp/torch_amp.py -> build/bdist.linux-x86_64/wheel/colossalai/amp/torch_amp
-copying build/lib.linux-x86_64-3.7/colossalai/amp/torch_amp/_grad_scaler.py -> build/bdist.linux-x86_64/wheel/colossalai/amp/torch_amp
-copying build/lib.linux-x86_64-3.7/colossalai/amp/torch_amp/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/amp/torch_amp
-creating build/bdist.linux-x86_64/wheel/colossalai/zero
-copying build/lib.linux-x86_64-3.7/colossalai/zero/zero_redundancy_optimizer_level_3.py -> build/bdist.linux-x86_64/wheel/colossalai/zero
-copying build/lib.linux-x86_64-3.7/colossalai/zero/zero_redundancy_optimizer_level_2.py -> build/bdist.linux-x86_64/wheel/colossalai/zero
-copying build/lib.linux-x86_64-3.7/colossalai/zero/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/zero
-copying build/lib.linux-x86_64-3.7/colossalai/zero/loss_scaler.py -> build/bdist.linux-x86_64/wheel/colossalai/zero
-copying build/lib.linux-x86_64-3.7/colossalai/global_variables.py -> build/bdist.linux-x86_64/wheel/colossalai
-creating build/bdist.linux-x86_64/wheel/colossalai/engine
-copying build/lib.linux-x86_64-3.7/colossalai/engine/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/engine
-creating build/bdist.linux-x86_64/wheel/colossalai/engine/ophooks
-copying build/lib.linux-x86_64-3.7/colossalai/engine/ophooks/_memtracer_ophook.py -> build/bdist.linux-x86_64/wheel/colossalai/engine/ophooks
-copying build/lib.linux-x86_64-3.7/colossalai/engine/ophooks/_base_ophook.py -> build/bdist.linux-x86_64/wheel/colossalai/engine/ophooks
-copying build/lib.linux-x86_64-3.7/colossalai/engine/ophooks/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/engine/ophooks
-creating build/bdist.linux-x86_64/wheel/colossalai/engine/schedule
-copying build/lib.linux-x86_64-3.7/colossalai/engine/schedule/_base_schedule.py -> build/bdist.linux-x86_64/wheel/colossalai/engine/schedule
-copying build/lib.linux-x86_64-3.7/colossalai/engine/schedule/_pipeline_schedule.py -> build/bdist.linux-x86_64/wheel/colossalai/engine/schedule
-copying build/lib.linux-x86_64-3.7/colossalai/engine/schedule/_non_pipeline_schedule.py -> build/bdist.linux-x86_64/wheel/colossalai/engine/schedule
-copying build/lib.linux-x86_64-3.7/colossalai/engine/schedule/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/engine/schedule
-copying build/lib.linux-x86_64-3.7/colossalai/engine/_base_engine.py -> build/bdist.linux-x86_64/wheel/colossalai/engine
-creating build/bdist.linux-x86_64/wheel/colossalai/engine/gradient_handler
-copying build/lib.linux-x86_64-3.7/colossalai/engine/gradient_handler/_pipeline_parallel_gradient_handler.py -> build/bdist.linux-x86_64/wheel/colossalai/engine/gradient_handler
-copying build/lib.linux-x86_64-3.7/colossalai/engine/gradient_handler/_data_parallel_gradient_handler.py -> build/bdist.linux-x86_64/wheel/colossalai/engine/gradient_handler
-copying build/lib.linux-x86_64-3.7/colossalai/engine/gradient_handler/_base_gradient_handler.py -> build/bdist.linux-x86_64/wheel/colossalai/engine/gradient_handler
-copying build/lib.linux-x86_64-3.7/colossalai/engine/gradient_handler/_moe_gradient_handler.py -> build/bdist.linux-x86_64/wheel/colossalai/engine/gradient_handler
-copying build/lib.linux-x86_64-3.7/colossalai/engine/gradient_handler/_zero_gradient_handler.py -> build/bdist.linux-x86_64/wheel/colossalai/engine/gradient_handler
-copying build/lib.linux-x86_64-3.7/colossalai/engine/gradient_handler/_sequence_parallel_gradient_handler.py -> build/bdist.linux-x86_64/wheel/colossalai/engine/gradient_handler
-copying build/lib.linux-x86_64-3.7/colossalai/engine/gradient_handler/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/engine/gradient_handler
-creating build/bdist.linux-x86_64/wheel/colossalai/trainer
-copying build/lib.linux-x86_64-3.7/colossalai/trainer/_trainer.py -> build/bdist.linux-x86_64/wheel/colossalai/trainer
-creating build/bdist.linux-x86_64/wheel/colossalai/trainer/hooks
-copying build/lib.linux-x86_64-3.7/colossalai/trainer/hooks/_log_hook.py -> build/bdist.linux-x86_64/wheel/colossalai/trainer/hooks
-copying build/lib.linux-x86_64-3.7/colossalai/trainer/hooks/_checkpoint_hook.py -> build/bdist.linux-x86_64/wheel/colossalai/trainer/hooks
-copying build/lib.linux-x86_64-3.7/colossalai/trainer/hooks/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/trainer/hooks
-copying build/lib.linux-x86_64-3.7/colossalai/trainer/hooks/_base_hook.py -> build/bdist.linux-x86_64/wheel/colossalai/trainer/hooks
-copying build/lib.linux-x86_64-3.7/colossalai/trainer/hooks/_metric_hook.py -> build/bdist.linux-x86_64/wheel/colossalai/trainer/hooks
-copying build/lib.linux-x86_64-3.7/colossalai/trainer/hooks/_lr_scheduler_hook.py -> build/bdist.linux-x86_64/wheel/colossalai/trainer/hooks
-copying build/lib.linux-x86_64-3.7/colossalai/trainer/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/trainer
-creating build/bdist.linux-x86_64/wheel/colossalai/registry
-copying build/lib.linux-x86_64-3.7/colossalai/registry/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/registry
-copying build/lib.linux-x86_64-3.7/colossalai/registry/registry.py -> build/bdist.linux-x86_64/wheel/colossalai/registry
-copying build/lib.linux-x86_64-3.7/colossalai/initialize.py -> build/bdist.linux-x86_64/wheel/colossalai
-copying build/lib.linux-x86_64-3.7/colossalai/constants.py -> build/bdist.linux-x86_64/wheel/colossalai
-copying build/lib.linux-x86_64-3.7/colossalai/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai
-creating build/bdist.linux-x86_64/wheel/colossalai/utils
-copying build/lib.linux-x86_64-3.7/colossalai/utils/checkpointing.py -> build/bdist.linux-x86_64/wheel/colossalai/utils
-creating build/bdist.linux-x86_64/wheel/colossalai/utils/data_sampler
-copying build/lib.linux-x86_64-3.7/colossalai/utils/data_sampler/data_parallel_sampler.py -> build/bdist.linux-x86_64/wheel/colossalai/utils/data_sampler
-copying build/lib.linux-x86_64-3.7/colossalai/utils/data_sampler/base_sampler.py -> build/bdist.linux-x86_64/wheel/colossalai/utils/data_sampler
-copying build/lib.linux-x86_64-3.7/colossalai/utils/data_sampler/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/utils/data_sampler
-creating build/bdist.linux-x86_64/wheel/colossalai/utils/multi_tensor_apply
-copying build/lib.linux-x86_64-3.7/colossalai/utils/multi_tensor_apply/multi_tensor_apply.py -> build/bdist.linux-x86_64/wheel/colossalai/utils/multi_tensor_apply
-copying build/lib.linux-x86_64-3.7/colossalai/utils/multi_tensor_apply/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/utils/multi_tensor_apply
-creating build/bdist.linux-x86_64/wheel/colossalai/utils/gradient_accumulation
-copying build/lib.linux-x86_64-3.7/colossalai/utils/gradient_accumulation/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/utils/gradient_accumulation
-copying build/lib.linux-x86_64-3.7/colossalai/utils/gradient_accumulation/_gradient_accumulation.py -> build/bdist.linux-x86_64/wheel/colossalai/utils/gradient_accumulation
-copying build/lib.linux-x86_64-3.7/colossalai/utils/activation_checkpoint.py -> build/bdist.linux-x86_64/wheel/colossalai/utils
-copying build/lib.linux-x86_64-3.7/colossalai/utils/timer.py -> build/bdist.linux-x86_64/wheel/colossalai/utils
-copying build/lib.linux-x86_64-3.7/colossalai/utils/common.py -> build/bdist.linux-x86_64/wheel/colossalai/utils
-copying build/lib.linux-x86_64-3.7/colossalai/utils/memory.py -> build/bdist.linux-x86_64/wheel/colossalai/utils
-copying build/lib.linux-x86_64-3.7/colossalai/utils/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/utils
-copying build/lib.linux-x86_64-3.7/colossalai/utils/cuda.py -> build/bdist.linux-x86_64/wheel/colossalai/utils
-creating build/bdist.linux-x86_64/wheel/colossalai/nn
-creating build/bdist.linux-x86_64/wheel/colossalai/nn/optimizer
-copying build/lib.linux-x86_64-3.7/colossalai/nn/optimizer/fused_adam.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/optimizer
-copying build/lib.linux-x86_64-3.7/colossalai/nn/optimizer/lars.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/optimizer
-copying build/lib.linux-x86_64-3.7/colossalai/nn/optimizer/fused_sgd.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/optimizer
-copying build/lib.linux-x86_64-3.7/colossalai/nn/optimizer/fused_lamb.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/optimizer
-copying build/lib.linux-x86_64-3.7/colossalai/nn/optimizer/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/optimizer
-copying build/lib.linux-x86_64-3.7/colossalai/nn/optimizer/lamb.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/optimizer
-copying build/lib.linux-x86_64-3.7/colossalai/nn/optimizer/colossalai_optimizer.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/optimizer
-creating build/bdist.linux-x86_64/wheel/colossalai/nn/layer
-copying build/lib.linux-x86_64-3.7/colossalai/nn/layer/base_layer.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/layer
-creating build/bdist.linux-x86_64/wheel/colossalai/nn/layer/parallel_2d
-copying build/lib.linux-x86_64-3.7/colossalai/nn/layer/parallel_2d/_utils.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/layer/parallel_2d
-copying build/lib.linux-x86_64-3.7/colossalai/nn/layer/parallel_2d/layers.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/layer/parallel_2d
-copying build/lib.linux-x86_64-3.7/colossalai/nn/layer/parallel_2d/_operation.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/layer/parallel_2d
-copying build/lib.linux-x86_64-3.7/colossalai/nn/layer/parallel_2d/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/layer/parallel_2d
-creating build/bdist.linux-x86_64/wheel/colossalai/nn/layer/parallel_1d
-copying build/lib.linux-x86_64-3.7/colossalai/nn/layer/parallel_1d/_utils.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/layer/parallel_1d
-copying build/lib.linux-x86_64-3.7/colossalai/nn/layer/parallel_1d/layers.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/layer/parallel_1d
-copying build/lib.linux-x86_64-3.7/colossalai/nn/layer/parallel_1d/_operation.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/layer/parallel_1d
-copying build/lib.linux-x86_64-3.7/colossalai/nn/layer/parallel_1d/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/layer/parallel_1d
-creating build/bdist.linux-x86_64/wheel/colossalai/nn/layer/colossalai_layer
-copying build/lib.linux-x86_64-3.7/colossalai/nn/layer/colossalai_layer/_utils.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/layer/colossalai_layer
-copying build/lib.linux-x86_64-3.7/colossalai/nn/layer/colossalai_layer/embedding.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/layer/colossalai_layer
-copying build/lib.linux-x86_64-3.7/colossalai/nn/layer/colossalai_layer/linear.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/layer/colossalai_layer
-copying build/lib.linux-x86_64-3.7/colossalai/nn/layer/colossalai_layer/normalization.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/layer/colossalai_layer
-copying build/lib.linux-x86_64-3.7/colossalai/nn/layer/colossalai_layer/dropout.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/layer/colossalai_layer
-copying build/lib.linux-x86_64-3.7/colossalai/nn/layer/colossalai_layer/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/layer/colossalai_layer
-creating build/bdist.linux-x86_64/wheel/colossalai/nn/layer/parallel_sequence
-copying build/lib.linux-x86_64-3.7/colossalai/nn/layer/parallel_sequence/_utils.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/layer/parallel_sequence
-copying build/lib.linux-x86_64-3.7/colossalai/nn/layer/parallel_sequence/layers.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/layer/parallel_sequence
-copying build/lib.linux-x86_64-3.7/colossalai/nn/layer/parallel_sequence/_operation.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/layer/parallel_sequence
-copying build/lib.linux-x86_64-3.7/colossalai/nn/layer/parallel_sequence/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/layer/parallel_sequence
-creating build/bdist.linux-x86_64/wheel/colossalai/nn/layer/vanilla
-copying build/lib.linux-x86_64-3.7/colossalai/nn/layer/vanilla/layers.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/layer/vanilla
-copying build/lib.linux-x86_64-3.7/colossalai/nn/layer/vanilla/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/layer/vanilla
-creating build/bdist.linux-x86_64/wheel/colossalai/nn/layer/moe
-copying build/lib.linux-x86_64-3.7/colossalai/nn/layer/moe/layers.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/layer/moe
-copying build/lib.linux-x86_64-3.7/colossalai/nn/layer/moe/_operation.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/layer/moe
-copying build/lib.linux-x86_64-3.7/colossalai/nn/layer/moe/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/layer/moe
-copying build/lib.linux-x86_64-3.7/colossalai/nn/layer/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/layer
-creating build/bdist.linux-x86_64/wheel/colossalai/nn/layer/utils
-copying build/lib.linux-x86_64-3.7/colossalai/nn/layer/utils/common.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/layer/utils
-copying build/lib.linux-x86_64-3.7/colossalai/nn/layer/utils/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/layer/utils
-creating build/bdist.linux-x86_64/wheel/colossalai/nn/layer/parallel_3d
-copying build/lib.linux-x86_64-3.7/colossalai/nn/layer/parallel_3d/_utils.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/layer/parallel_3d
-copying build/lib.linux-x86_64-3.7/colossalai/nn/layer/parallel_3d/layers.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/layer/parallel_3d
-copying build/lib.linux-x86_64-3.7/colossalai/nn/layer/parallel_3d/_operation.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/layer/parallel_3d
-copying build/lib.linux-x86_64-3.7/colossalai/nn/layer/parallel_3d/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/layer/parallel_3d
-creating build/bdist.linux-x86_64/wheel/colossalai/nn/layer/wrapper
-copying build/lib.linux-x86_64-3.7/colossalai/nn/layer/wrapper/lambda_wrapper.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/layer/wrapper
-copying build/lib.linux-x86_64-3.7/colossalai/nn/layer/wrapper/pipeline_wrapper.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/layer/wrapper
-copying build/lib.linux-x86_64-3.7/colossalai/nn/layer/wrapper/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/layer/wrapper
-creating build/bdist.linux-x86_64/wheel/colossalai/nn/layer/parallel_2p5d
-copying build/lib.linux-x86_64-3.7/colossalai/nn/layer/parallel_2p5d/_utils.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/layer/parallel_2p5d
-copying build/lib.linux-x86_64-3.7/colossalai/nn/layer/parallel_2p5d/layers.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/layer/parallel_2p5d
-copying build/lib.linux-x86_64-3.7/colossalai/nn/layer/parallel_2p5d/_operation.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/layer/parallel_2p5d
-copying build/lib.linux-x86_64-3.7/colossalai/nn/layer/parallel_2p5d/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/layer/parallel_2p5d
-creating build/bdist.linux-x86_64/wheel/colossalai/nn/model
-copying build/lib.linux-x86_64-3.7/colossalai/nn/model/model_from_config.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/model
-copying build/lib.linux-x86_64-3.7/colossalai/nn/model/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/model
-copying build/lib.linux-x86_64-3.7/colossalai/nn/init.py -> build/bdist.linux-x86_64/wheel/colossalai/nn
-copying build/lib.linux-x86_64-3.7/colossalai/nn/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/nn
-creating build/bdist.linux-x86_64/wheel/colossalai/nn/metric
-copying build/lib.linux-x86_64-3.7/colossalai/nn/metric/_utils.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/metric
-copying build/lib.linux-x86_64-3.7/colossalai/nn/metric/accuracy_3d.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/metric
-copying build/lib.linux-x86_64-3.7/colossalai/nn/metric/accuracy_2p5d.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/metric
-copying build/lib.linux-x86_64-3.7/colossalai/nn/metric/accuracy_2d.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/metric
-copying build/lib.linux-x86_64-3.7/colossalai/nn/metric/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/metric
-creating build/bdist.linux-x86_64/wheel/colossalai/nn/lr_scheduler
-copying build/lib.linux-x86_64-3.7/colossalai/nn/lr_scheduler/delayed.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/lr_scheduler
-copying build/lib.linux-x86_64-3.7/colossalai/nn/lr_scheduler/torch.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/lr_scheduler
-copying build/lib.linux-x86_64-3.7/colossalai/nn/lr_scheduler/cosine.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/lr_scheduler
-copying build/lib.linux-x86_64-3.7/colossalai/nn/lr_scheduler/multistep.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/lr_scheduler
-copying build/lib.linux-x86_64-3.7/colossalai/nn/lr_scheduler/onecycle.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/lr_scheduler
-copying build/lib.linux-x86_64-3.7/colossalai/nn/lr_scheduler/linear.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/lr_scheduler
-copying build/lib.linux-x86_64-3.7/colossalai/nn/lr_scheduler/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/lr_scheduler
-copying build/lib.linux-x86_64-3.7/colossalai/nn/lr_scheduler/poly.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/lr_scheduler
-creating build/bdist.linux-x86_64/wheel/colossalai/nn/loss
-copying build/lib.linux-x86_64-3.7/colossalai/nn/loss/loss_3d.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/loss
-copying build/lib.linux-x86_64-3.7/colossalai/nn/loss/loss_2d.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/loss
-copying build/lib.linux-x86_64-3.7/colossalai/nn/loss/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/loss
-copying build/lib.linux-x86_64-3.7/colossalai/nn/loss/loss_2p5d.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/loss
-copying build/lib.linux-x86_64-3.7/colossalai/nn/loss/loss_moe.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/loss
-copying build/lib.linux-x86_64-3.7/colossalai/nn/loss/loss_1d.py -> build/bdist.linux-x86_64/wheel/colossalai/nn/loss
-creating build/bdist.linux-x86_64/wheel/colossalai/builder
-copying build/lib.linux-x86_64-3.7/colossalai/builder/builder.py -> build/bdist.linux-x86_64/wheel/colossalai/builder
-copying build/lib.linux-x86_64-3.7/colossalai/builder/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/builder
-copying build/lib.linux-x86_64-3.7/colossalai/builder/pipeline.py -> build/bdist.linux-x86_64/wheel/colossalai/builder
-creating build/bdist.linux-x86_64/wheel/colossalai/context
-copying build/lib.linux-x86_64-3.7/colossalai/context/config.py -> build/bdist.linux-x86_64/wheel/colossalai/context
-copying build/lib.linux-x86_64-3.7/colossalai/context/parallel_context.py -> build/bdist.linux-x86_64/wheel/colossalai/context
-copying build/lib.linux-x86_64-3.7/colossalai/context/parallel_mode.py -> build/bdist.linux-x86_64/wheel/colossalai/context
-copying build/lib.linux-x86_64-3.7/colossalai/context/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/context
-creating build/bdist.linux-x86_64/wheel/colossalai/context/random
-copying build/lib.linux-x86_64-3.7/colossalai/context/random/_helper.py -> build/bdist.linux-x86_64/wheel/colossalai/context/random
-copying build/lib.linux-x86_64-3.7/colossalai/context/random/seed_manager.py -> build/bdist.linux-x86_64/wheel/colossalai/context/random
-copying build/lib.linux-x86_64-3.7/colossalai/context/random/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/context/random
-creating build/bdist.linux-x86_64/wheel/colossalai/context/process_group_initializer
-copying build/lib.linux-x86_64-3.7/colossalai/context/process_group_initializer/initializer_2p5d.py -> build/bdist.linux-x86_64/wheel/colossalai/context/process_group_initializer
-copying build/lib.linux-x86_64-3.7/colossalai/context/process_group_initializer/initializer_tensor.py -> build/bdist.linux-x86_64/wheel/colossalai/context/process_group_initializer
-copying build/lib.linux-x86_64-3.7/colossalai/context/process_group_initializer/initializer_data.py -> build/bdist.linux-x86_64/wheel/colossalai/context/process_group_initializer
-copying build/lib.linux-x86_64-3.7/colossalai/context/process_group_initializer/initializer_3d.py -> build/bdist.linux-x86_64/wheel/colossalai/context/process_group_initializer
-copying build/lib.linux-x86_64-3.7/colossalai/context/process_group_initializer/initializer_sequence.py -> build/bdist.linux-x86_64/wheel/colossalai/context/process_group_initializer
-copying build/lib.linux-x86_64-3.7/colossalai/context/process_group_initializer/initializer_moe.py -> build/bdist.linux-x86_64/wheel/colossalai/context/process_group_initializer
-copying build/lib.linux-x86_64-3.7/colossalai/context/process_group_initializer/initializer_model.py -> build/bdist.linux-x86_64/wheel/colossalai/context/process_group_initializer
-copying build/lib.linux-x86_64-3.7/colossalai/context/process_group_initializer/initializer_2d.py -> build/bdist.linux-x86_64/wheel/colossalai/context/process_group_initializer
-copying build/lib.linux-x86_64-3.7/colossalai/context/process_group_initializer/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/context/process_group_initializer
-copying build/lib.linux-x86_64-3.7/colossalai/context/process_group_initializer/initializer_1d.py -> build/bdist.linux-x86_64/wheel/colossalai/context/process_group_initializer
-copying build/lib.linux-x86_64-3.7/colossalai/context/process_group_initializer/initializer_pipeline.py -> build/bdist.linux-x86_64/wheel/colossalai/context/process_group_initializer
-copying build/lib.linux-x86_64-3.7/colossalai/context/process_group_initializer/process_group_initializer.py -> build/bdist.linux-x86_64/wheel/colossalai/context/process_group_initializer
-creating build/bdist.linux-x86_64/wheel/colossalai/communication
-copying build/lib.linux-x86_64-3.7/colossalai/communication/p2p.py -> build/bdist.linux-x86_64/wheel/colossalai/communication
-copying build/lib.linux-x86_64-3.7/colossalai/communication/utils.py -> build/bdist.linux-x86_64/wheel/colossalai/communication
-copying build/lib.linux-x86_64-3.7/colossalai/communication/collective.py -> build/bdist.linux-x86_64/wheel/colossalai/communication
-copying build/lib.linux-x86_64-3.7/colossalai/communication/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/communication
-copying build/lib.linux-x86_64-3.7/colossalai/communication/ring.py -> build/bdist.linux-x86_64/wheel/colossalai/communication
-copying build/lib.linux-x86_64-3.7/colossalai/core.py -> build/bdist.linux-x86_64/wheel/colossalai
-creating build/bdist.linux-x86_64/wheel/colossalai/kernel
-creating build/bdist.linux-x86_64/wheel/colossalai/kernel/cuda_native
-copying build/lib.linux-x86_64-3.7/colossalai/kernel/cuda_native/scaled_softmax.py -> build/bdist.linux-x86_64/wheel/colossalai/kernel/cuda_native
-copying build/lib.linux-x86_64-3.7/colossalai/kernel/cuda_native/layer_norm.py -> build/bdist.linux-x86_64/wheel/colossalai/kernel/cuda_native
-copying build/lib.linux-x86_64-3.7/colossalai/kernel/cuda_native/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/kernel/cuda_native
-copying build/lib.linux-x86_64-3.7/colossalai/kernel/cuda_native/multihead_attention.py -> build/bdist.linux-x86_64/wheel/colossalai/kernel/cuda_native
-creating build/bdist.linux-x86_64/wheel/colossalai/kernel/jit
-copying build/lib.linux-x86_64-3.7/colossalai/kernel/jit/bias_gelu.py -> build/bdist.linux-x86_64/wheel/colossalai/kernel/jit
-copying build/lib.linux-x86_64-3.7/colossalai/kernel/jit/option.py -> build/bdist.linux-x86_64/wheel/colossalai/kernel/jit
-copying build/lib.linux-x86_64-3.7/colossalai/kernel/jit/bias_dropout_add.py -> build/bdist.linux-x86_64/wheel/colossalai/kernel/jit
-copying build/lib.linux-x86_64-3.7/colossalai/kernel/jit/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/kernel/jit
-copying build/lib.linux-x86_64-3.7/colossalai/kernel/__init__.py -> build/bdist.linux-x86_64/wheel/colossalai/kernel
-running install_egg_info
-running egg_info
-writing colossalai.egg-info/PKG-INFO
-writing dependency_links to colossalai.egg-info/dependency_links.txt
-writing requirements to colossalai.egg-info/requires.txt
-writing top-level names to colossalai.egg-info/top_level.txt
-'license_file' option was not specified
-reading manifest file 'colossalai.egg-info/SOURCES.txt'
-reading manifest template 'MANIFEST.in'
-warning: no files found matching '*.txt'
-warning: no files found matching '*.tr' under directory 'colossalai'
-warning: no files found matching '*.cc' under directory 'colossalai'
-writing manifest file 'colossalai.egg-info/SOURCES.txt'
-Copying colossalai.egg-info to build/bdist.linux-x86_64/wheel/colossalai-0.0.2-py3.7.egg-info
-Copying top_level.txt to build/bdist.linux-x86_64/wheel/colossalai-0.0.2-py3.7.egg-info/top_level.txt
-Copying PKG-INFO to build/bdist.linux-x86_64/wheel/colossalai-0.0.2-py3.7.egg-info/PKG-INFO
-Copying dependency_links.txt to build/bdist.linux-x86_64/wheel/colossalai-0.0.2-py3.7.egg-info/dependency_links.txt
-Copying requires.txt to build/bdist.linux-x86_64/wheel/colossalai-0.0.2-py3.7.egg-info/requires.txt
-Copying SOURCES.txt to build/bdist.linux-x86_64/wheel/colossalai-0.0.2-py3.7.egg-info/SOURCES.txt
-running install_scripts
-adding license file "LICENSE" (matched pattern "LICEN[CS]E*")
-creating build/bdist.linux-x86_64/wheel/colossalai-0.0.2.dist-info/WHEEL
-creating 'dist/colossalai-0.0.2-cp37-cp37m-linux_x86_64.whl' and adding 'build/bdist.linux-x86_64/wheel' to it
-adding 'colossal_C.cpython-37m-x86_64-linux-gnu.so'
-adding 'colossal_layer_norm_cuda.cpython-37m-x86_64-linux-gnu.so'
-adding 'colossal_multihead_attention.cpython-37m-x86_64-linux-gnu.so'
-adding 'colossal_scaled_masked_softmax.cpython-37m-x86_64-linux-gnu.so'
-adding 'colossal_scaled_upper_triang_masked_softmax.cpython-37m-x86_64-linux-gnu.so'
-adding 'colossalai/__init__.py'
-adding 'colossalai/constants.py'
-adding 'colossalai/core.py'
-adding 'colossalai/global_variables.py'
-adding 'colossalai/initialize.py'
-adding 'colossalai/amp/__init__.py'
-adding 'colossalai/amp/amp_type.py'
-adding 'colossalai/amp/apex_amp/__init__.py'
-adding 'colossalai/amp/apex_amp/apex_amp.py'
-adding 'colossalai/amp/naive_amp/__init__.py'
-adding 'colossalai/amp/naive_amp/_fp16_optimizer.py'
-adding 'colossalai/amp/naive_amp/naive_amp.py'
-adding 'colossalai/amp/torch_amp/__init__.py'
-adding 'colossalai/amp/torch_amp/_grad_scaler.py'
-adding 'colossalai/amp/torch_amp/torch_amp.py'
-adding 'colossalai/builder/__init__.py'
-adding 'colossalai/builder/builder.py'
-adding 'colossalai/builder/pipeline.py'
-adding 'colossalai/communication/__init__.py'
-adding 'colossalai/communication/collective.py'
-adding 'colossalai/communication/p2p.py'
-adding 'colossalai/communication/ring.py'
-adding 'colossalai/communication/utils.py'
-adding 'colossalai/context/__init__.py'
-adding 'colossalai/context/config.py'
-adding 'colossalai/context/parallel_context.py'
-adding 'colossalai/context/parallel_mode.py'
-adding 'colossalai/context/process_group_initializer/__init__.py'
-adding 'colossalai/context/process_group_initializer/initializer_1d.py'
-adding 'colossalai/context/process_group_initializer/initializer_2d.py'
-adding 'colossalai/context/process_group_initializer/initializer_2p5d.py'
-adding 'colossalai/context/process_group_initializer/initializer_3d.py'
-adding 'colossalai/context/process_group_initializer/initializer_data.py'
-adding 'colossalai/context/process_group_initializer/initializer_model.py'
-adding 'colossalai/context/process_group_initializer/initializer_moe.py'
-adding 'colossalai/context/process_group_initializer/initializer_pipeline.py'
-adding 'colossalai/context/process_group_initializer/initializer_sequence.py'
-adding 'colossalai/context/process_group_initializer/initializer_tensor.py'
-adding 'colossalai/context/process_group_initializer/process_group_initializer.py'
-adding 'colossalai/context/random/__init__.py'
-adding 'colossalai/context/random/_helper.py'
-adding 'colossalai/context/random/seed_manager.py'
-adding 'colossalai/engine/__init__.py'
-adding 'colossalai/engine/_base_engine.py'
-adding 'colossalai/engine/gradient_handler/__init__.py'
-adding 'colossalai/engine/gradient_handler/_base_gradient_handler.py'
-adding 'colossalai/engine/gradient_handler/_data_parallel_gradient_handler.py'
-adding 'colossalai/engine/gradient_handler/_moe_gradient_handler.py'
-adding 'colossalai/engine/gradient_handler/_pipeline_parallel_gradient_handler.py'
-adding 'colossalai/engine/gradient_handler/_sequence_parallel_gradient_handler.py'
-adding 'colossalai/engine/gradient_handler/_zero_gradient_handler.py'
-adding 'colossalai/engine/ophooks/__init__.py'
-adding 'colossalai/engine/ophooks/_base_ophook.py'
-adding 'colossalai/engine/ophooks/_memtracer_ophook.py'
-adding 'colossalai/engine/schedule/__init__.py'
-adding 'colossalai/engine/schedule/_base_schedule.py'
-adding 'colossalai/engine/schedule/_non_pipeline_schedule.py'
-adding 'colossalai/engine/schedule/_pipeline_schedule.py'
-adding 'colossalai/kernel/__init__.py'
-adding 'colossalai/kernel/cuda_native/__init__.py'
-adding 'colossalai/kernel/cuda_native/layer_norm.py'
-adding 'colossalai/kernel/cuda_native/multihead_attention.py'
-adding 'colossalai/kernel/cuda_native/scaled_softmax.py'
-adding 'colossalai/kernel/jit/__init__.py'
-adding 'colossalai/kernel/jit/bias_dropout_add.py'
-adding 'colossalai/kernel/jit/bias_gelu.py'
-adding 'colossalai/kernel/jit/option.py'
-adding 'colossalai/logging/__init__.py'
-adding 'colossalai/logging/logging.py'
-adding 'colossalai/nn/__init__.py'
-adding 'colossalai/nn/init.py'
-adding 'colossalai/nn/layer/__init__.py'
-adding 'colossalai/nn/layer/base_layer.py'
-adding 'colossalai/nn/layer/colossalai_layer/__init__.py'
-adding 'colossalai/nn/layer/colossalai_layer/_utils.py'
-adding 'colossalai/nn/layer/colossalai_layer/dropout.py'
-adding 'colossalai/nn/layer/colossalai_layer/embedding.py'
-adding 'colossalai/nn/layer/colossalai_layer/linear.py'
-adding 'colossalai/nn/layer/colossalai_layer/normalization.py'
-adding 'colossalai/nn/layer/moe/__init__.py'
-adding 'colossalai/nn/layer/moe/_operation.py'
-adding 'colossalai/nn/layer/moe/layers.py'
-adding 'colossalai/nn/layer/parallel_1d/__init__.py'
-adding 'colossalai/nn/layer/parallel_1d/_operation.py'
-adding 'colossalai/nn/layer/parallel_1d/_utils.py'
-adding 'colossalai/nn/layer/parallel_1d/layers.py'
-adding 'colossalai/nn/layer/parallel_2d/__init__.py'
-adding 'colossalai/nn/layer/parallel_2d/_operation.py'
-adding 'colossalai/nn/layer/parallel_2d/_utils.py'
-adding 'colossalai/nn/layer/parallel_2d/layers.py'
-adding 'colossalai/nn/layer/parallel_2p5d/__init__.py'
-adding 'colossalai/nn/layer/parallel_2p5d/_operation.py'
-adding 'colossalai/nn/layer/parallel_2p5d/_utils.py'
-adding 'colossalai/nn/layer/parallel_2p5d/layers.py'
-adding 'colossalai/nn/layer/parallel_3d/__init__.py'
-adding 'colossalai/nn/layer/parallel_3d/_operation.py'
-adding 'colossalai/nn/layer/parallel_3d/_utils.py'
-adding 'colossalai/nn/layer/parallel_3d/layers.py'
-adding 'colossalai/nn/layer/parallel_sequence/__init__.py'
-adding 'colossalai/nn/layer/parallel_sequence/_operation.py'
-adding 'colossalai/nn/layer/parallel_sequence/_utils.py'
-adding 'colossalai/nn/layer/parallel_sequence/layers.py'
-adding 'colossalai/nn/layer/utils/__init__.py'
-adding 'colossalai/nn/layer/utils/common.py'
-adding 'colossalai/nn/layer/vanilla/__init__.py'
-adding 'colossalai/nn/layer/vanilla/layers.py'
-adding 'colossalai/nn/layer/wrapper/__init__.py'
-adding 'colossalai/nn/layer/wrapper/lambda_wrapper.py'
-adding 'colossalai/nn/layer/wrapper/pipeline_wrapper.py'
-adding 'colossalai/nn/loss/__init__.py'
-adding 'colossalai/nn/loss/loss_1d.py'
-adding 'colossalai/nn/loss/loss_2d.py'
-adding 'colossalai/nn/loss/loss_2p5d.py'
-adding 'colossalai/nn/loss/loss_3d.py'
-adding 'colossalai/nn/loss/loss_moe.py'
-adding 'colossalai/nn/lr_scheduler/__init__.py'
-adding 'colossalai/nn/lr_scheduler/cosine.py'
-adding 'colossalai/nn/lr_scheduler/delayed.py'
-adding 'colossalai/nn/lr_scheduler/linear.py'
-adding 'colossalai/nn/lr_scheduler/multistep.py'
-adding 'colossalai/nn/lr_scheduler/onecycle.py'
-adding 'colossalai/nn/lr_scheduler/poly.py'
-adding 'colossalai/nn/lr_scheduler/torch.py'
-adding 'colossalai/nn/metric/__init__.py'
-adding 'colossalai/nn/metric/_utils.py'
-adding 'colossalai/nn/metric/accuracy_2d.py'
-adding 'colossalai/nn/metric/accuracy_2p5d.py'
-adding 'colossalai/nn/metric/accuracy_3d.py'
-adding 'colossalai/nn/model/__init__.py'
-adding 'colossalai/nn/model/model_from_config.py'
-adding 'colossalai/nn/optimizer/__init__.py'
-adding 'colossalai/nn/optimizer/colossalai_optimizer.py'
-adding 'colossalai/nn/optimizer/fused_adam.py'
-adding 'colossalai/nn/optimizer/fused_lamb.py'
-adding 'colossalai/nn/optimizer/fused_sgd.py'
-adding 'colossalai/nn/optimizer/lamb.py'
-adding 'colossalai/nn/optimizer/lars.py'
-adding 'colossalai/registry/__init__.py'
-adding 'colossalai/registry/registry.py'
-adding 'colossalai/trainer/__init__.py'
-adding 'colossalai/trainer/_trainer.py'
-adding 'colossalai/trainer/hooks/__init__.py'
-adding 'colossalai/trainer/hooks/_base_hook.py'
-adding 'colossalai/trainer/hooks/_checkpoint_hook.py'
-adding 'colossalai/trainer/hooks/_log_hook.py'
-adding 'colossalai/trainer/hooks/_lr_scheduler_hook.py'
-adding 'colossalai/trainer/hooks/_metric_hook.py'
-adding 'colossalai/utils/__init__.py'
-adding 'colossalai/utils/activation_checkpoint.py'
-adding 'colossalai/utils/checkpointing.py'
-adding 'colossalai/utils/common.py'
-adding 'colossalai/utils/cuda.py'
-adding 'colossalai/utils/memory.py'
-adding 'colossalai/utils/timer.py'
-adding 'colossalai/utils/data_sampler/__init__.py'
-adding 'colossalai/utils/data_sampler/base_sampler.py'
-adding 'colossalai/utils/data_sampler/data_parallel_sampler.py'
-adding 'colossalai/utils/gradient_accumulation/__init__.py'
-adding 'colossalai/utils/gradient_accumulation/_gradient_accumulation.py'
-adding 'colossalai/utils/multi_tensor_apply/__init__.py'
-adding 'colossalai/utils/multi_tensor_apply/multi_tensor_apply.py'
-adding 'colossalai/zero/__init__.py'
-adding 'colossalai/zero/loss_scaler.py'
-adding 'colossalai/zero/zero_redundancy_optimizer_level_2.py'
-adding 'colossalai/zero/zero_redundancy_optimizer_level_3.py'
-adding 'model_zoo/__init__.py'
-adding 'model_zoo/helper.py'
-adding 'model_zoo/bert/__init__.py'
-adding 'model_zoo/gpt/__init__.py'
-adding 'model_zoo/gpt/gpt.py'
-adding 'model_zoo/mlp_mixer/__init__.py'
-adding 'model_zoo/mlp_mixer/parallel_3d/__init__.py'
-adding 'model_zoo/mlp_mixer/parallel_3d/mlp_mixer.py'
-adding 'model_zoo/moe/__init__.py'
-adding 'model_zoo/moe/models.py'
-adding 'model_zoo/moe/util.py'
-adding 'model_zoo/vit/__init__.py'
-adding 'model_zoo/vit/vision_transformer_from_config.py'
-adding 'model_zoo/vit/vit.py'
-adding 'colossalai-0.0.2.dist-info/LICENSE'
-adding 'colossalai-0.0.2.dist-info/METADATA'
-adding 'colossalai-0.0.2.dist-info/WHEEL'
-adding 'colossalai-0.0.2.dist-info/top_level.txt'
-adding 'colossalai-0.0.2.dist-info/RECORD'
-removing build/bdist.linux-x86_64/wheel
diff --git a/model_zoo/__init__.py b/model_zoo/__init__.py
deleted file mode 100644
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..0000000000000000000000000000000000000000
diff --git a/model_zoo/__pycache__/__init__.cpython-37.pyc b/model_zoo/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index 99b97890f10c4c0ae8e79fbd3fe887dcf396cdc6..0000000000000000000000000000000000000000
Binary files a/model_zoo/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/model_zoo/bert/__init__.py b/model_zoo/bert/__init__.py
deleted file mode 100644
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..0000000000000000000000000000000000000000
diff --git a/model_zoo/bert/parallel_1d/.init b/model_zoo/bert/parallel_1d/.init
deleted file mode 100644
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..0000000000000000000000000000000000000000
diff --git a/model_zoo/bert/parallel_2d/.init b/model_zoo/bert/parallel_2d/.init
deleted file mode 100644
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..0000000000000000000000000000000000000000
diff --git a/model_zoo/bert/parallel_2p5d/.init b/model_zoo/bert/parallel_2p5d/.init
deleted file mode 100644
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..0000000000000000000000000000000000000000
diff --git a/model_zoo/bert/parallel_3d/.init b/model_zoo/bert/parallel_3d/.init
deleted file mode 100644
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..0000000000000000000000000000000000000000
diff --git a/model_zoo/gpt/__init__.py b/model_zoo/gpt/__init__.py
deleted file mode 100644
index 5a20f0f818c77df8ef69e2f5574631137775d885..0000000000000000000000000000000000000000
--- a/model_zoo/gpt/__init__.py
+++ /dev/null
@@ -1 +0,0 @@
-from .gpt import *
\ No newline at end of file
diff --git a/model_zoo/gpt/gpt.py b/model_zoo/gpt/gpt.py
deleted file mode 100644
index b5413f6b8b3c746ddd02ffdb4d89de9e9483b1c7..0000000000000000000000000000000000000000
--- a/model_zoo/gpt/gpt.py
+++ /dev/null
@@ -1,450 +0,0 @@
-import math
-from typing import Callable
-
-import torch
-from colossalai import nn as col_nn
-from colossalai.builder.pipeline import partition_uniform
-from colossalai.context import ParallelMode
-from colossalai.core import global_context as gpc
-from colossalai.logging import get_dist_logger
-from colossalai.nn.layer.utils import CheckpointModule, divide
-from colossalai.nn.layer.wrapper import PipelineSharedModuleWrapper
-from colossalai.registry import LAYERS, LOSSES, MODELS
-from colossalai.utils import get_current_device
-from torch import dtype, nn
-
-__all__ = [
-    'GPT', 'GPTLMLoss', 'gpt2_small', 'gpt2_medium', 'gpt2_large', 'gpt2_xl', 'gpt2_8B', 'gpt2_xl_pipeline',
-    'gpt2_8B_pipeline', 'gpt3', 'gpt3_pipeline'
-]
-
-
-@LAYERS.register_module
-class GPTEmbedding(nn.Module):
-    def __init__(self,
-                 embedding_dim: int,
-                 vocab_size: int,
-                 max_position_embeddings: int,
-                 num_tokentypes: int = 0,
-                 padding_idx: int = None,
-                 dropout: float = 0.,
-                 dtype: dtype = None) -> None:
-        super().__init__()
-        self.word_embeddings = col_nn.Embedding(vocab_size, embedding_dim, padding_idx=padding_idx, dtype=dtype)
-        self.position_embeddings = col_nn.Embedding(max_position_embeddings, embedding_dim, dtype=dtype)
-        if num_tokentypes > 0:
-            self.tokentype_embeddings = col_nn.Embedding(num_tokentypes, embedding_dim, dtype=dtype)
-        else:
-            self.tokentype_embeddings = None
-        self.dropout = col_nn.Dropout(dropout)
-
-    @property
-    def word_embedding_weight(self):
-        return self.word_embeddings.weight
-
-    def forward(self, input_ids, attention_mask=None, position_ids=None, tokentype_ids=None):
-        seq_length = input_ids.size(1)
-        if position_ids is None:
-            position_ids = torch.arange(seq_length, dtype=torch.long, device=get_current_device()).unsqueeze(0)
-        x = self.word_embeddings(input_ids) + self.position_embeddings(position_ids)
-        if self.tokentype_embeddings is not None and tokentype_ids is not None:
-            x = x + self.tokentype_embeddings(tokentype_ids)
-        x = self.dropout(x)
-
-        # We create a 3D attention mask from a 2D tensor mask.
-        # Sizes are [batch_size, 1, 1, to_seq_length]
-        # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
-        # Adapted from huggingface
-        if attention_mask is not None:
-            batch_size = input_ids.shape[0]
-            attention_mask = attention_mask.view(batch_size, -1)
-            attention_mask = col_nn.partition_batch(attention_mask)
-            attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
-            attention_mask = attention_mask.to(dtype=x.dtype)  # fp16 compatibility
-            attention_mask = (1.0 - attention_mask) * -10000.0
-
-        return x, attention_mask
-
-
-@LAYERS.register_module
-class GPTSelfAttention(nn.Module):
-    def __init__(self,
-                 dim: int,
-                 num_heads: int,
-                 attention_dropout: float,
-                 dropout: float,
-                 bias: bool = True,
-                 fuse_scale_mask_softmax: bool = False,
-                 dtype: dtype = None) -> None:
-        super().__init__()
-        self.fuse_scale_mask_softmax = fuse_scale_mask_softmax
-        self.attention_head_size = divide(dim, num_heads)
-        self.query_key_value = col_nn.Linear(dim, 3 * dim, dtype=dtype, bias=bias)
-        if fuse_scale_mask_softmax:
-            from colossalai.kernel import FusedScaleMaskSoftmax
-            from colossalai.kernel.cuda_native.scaled_softmax import AttnMaskType
-            self.softmax = FusedScaleMaskSoftmax(input_in_fp16=True,
-                                                 input_in_bf16=False,
-                                                 attn_mask_type=AttnMaskType.causal,
-                                                 scaled_masked_softmax_fusion=True,
-                                                 mask_func=None,
-                                                 softmax_in_fp32=True,
-                                                 scale=math.sqrt(self.attention_head_size))
-        else:
-            self.softmax = nn.Softmax(dim=-1)
-        self.attention_dropout = col_nn.Dropout(attention_dropout)
-        self.dense = col_nn.Linear(dim, dim, dtype=dtype, bias=True)
-        self.dropout = col_nn.Dropout(dropout)
-
-    def forward(self, x, attention_mask=None):
-        qkv = self.query_key_value(x)
-        all_head_size = qkv.shape[-1] // 3
-        num_attention_heads = divide(all_head_size, self.attention_head_size)
-        new_qkv_shape = qkv.shape[:-1] + \
-            (num_attention_heads, 3 * self.attention_head_size)
-        qkv = qkv.view(new_qkv_shape)
-        qkv = qkv.permute((0, 2, 1, 3))
-        q, k, v = torch.chunk(qkv, 3, dim=-1)
-
-        x = torch.matmul(q, k.transpose(-1, -2))
-
-        if self.fuse_scale_mask_softmax:
-            x = self.softmax(x, attention_mask)
-        else:
-            x = x / math.sqrt(self.attention_head_size)
-            # causal mask
-            q_len, k_len = q.size(-2), k.size(-2)
-            causal_mask = torch.tril(torch.ones((q_len, k_len), dtype=torch.uint8,
-                                                device=get_current_device())).view(1, 1, q_len, k_len).bool()
-            x = torch.where(causal_mask, x, torch.tensor(-1e4, dtype=x.dtype, device=get_current_device()))
-            if attention_mask is not None:
-                x = x + attention_mask
-            x = self.softmax(x)
-
-        x = self.attention_dropout(x)
-
-        x = torch.matmul(x, v)
-        x = x.transpose(1, 2)
-        new_context_layer_shape = x.size()[:-2] + (all_head_size, )
-        x = x.reshape(new_context_layer_shape)
-
-        x = self.dense(x)
-        x = self.dropout(x)
-
-        return x
-
-
-@LAYERS.register_module
-class GPTMLP(nn.Module):
-    def __init__(self,
-                 dim: int,
-                 mlp_ratio: float,
-                 activation: Callable,
-                 dropout: float,
-                 dtype: dtype = None,
-                 bias: bool = True):
-        super().__init__()
-        intermediate_dim = int(dim * mlp_ratio)
-        self.dense_1 = col_nn.Linear(dim, intermediate_dim, dtype=dtype, bias=bias)
-        self.activation = activation
-        self.dense_2 = col_nn.Linear(intermediate_dim, dim, dtype=dtype, bias=bias)
-        self.dropout = col_nn.Dropout(dropout)
-
-    def forward(self, x):
-        x = self.dense_1(x)
-        x = self.activation(x)
-        x = self.dense_2(x)
-        x = self.dropout(x)
-        return x
-
-
-@LAYERS.register_module
-class GPTBlock(CheckpointModule):
-    def __init__(self,
-                 dim: int,
-                 num_heads: int,
-                 mlp_ratio: float,
-                 activation: Callable,
-                 attention_dropout: float = 0.,
-                 dropout: float = 0.,
-                 layernorm_epsilon: float = 1e-5,
-                 dtype: dtype = None,
-                 bias: bool = True,
-                 apply_post_layernorm: bool = False,
-                 fuse_scale_mask_softmax: bool = False,
-                 checkpoint: bool = False):
-        super().__init__(checkpoint)
-        self.apply_post_layernorm = apply_post_layernorm
-        self.norm1 = col_nn.LayerNorm(normalized_shape=dim, eps=layernorm_epsilon, dtype=dtype)
-        self.attn = GPTSelfAttention(dim=dim,
-                                     num_heads=num_heads,
-                                     attention_dropout=attention_dropout,
-                                     dropout=dropout,
-                                     bias=bias,
-                                     fuse_scale_mask_softmax=fuse_scale_mask_softmax,
-                                     dtype=dtype)
-        self.norm2 = col_nn.LayerNorm(normalized_shape=dim, eps=layernorm_epsilon, dtype=dtype)
-        self.mlp = GPTMLP(dim=dim, mlp_ratio=mlp_ratio, activation=activation, dropout=dropout, dtype=dtype, bias=bias)
-
-    def _forward(self, x, attention_mask=None):
-        if not self.apply_post_layernorm:
-            residual = x
-        x = self.norm1(x)
-        if self.apply_post_layernorm:
-            residual = x
-        x = residual + self.attn(x, attention_mask)
-
-        if not self.apply_post_layernorm:
-            residual = x
-        x = self.norm2(x)
-        if self.apply_post_layernorm:
-            residual = x
-        x = residual + self.mlp(x)
-
-        return x, attention_mask
-
-
-@LAYERS.register_module
-class GPTLMHead(nn.Module):
-    def __init__(self,
-                 dim: int,
-                 vocab_size: int,
-                 word_embeeding_weight: nn.Parameter = None,
-                 bias: bool = False,
-                 dtype: dtype = None) -> None:
-        super().__init__()
-        self.dense = col_nn.Classifier(dim, vocab_size, word_embeeding_weight, bias=bias, dtype=dtype)
-
-    @property
-    def weight(self):
-        return self.dense.weight
-
-    def forward(self, x):
-        x = self.dense(x)
-        return x
-
-
-@LOSSES.register_module
-class GPTLMLoss(nn.Module):
-    def __init__(self):
-        super().__init__()
-        self.loss = col_nn.CrossEntropyLoss()
-
-    def forward(self, logits, labels):
-        shift_logits = logits[..., :-1, :].contiguous()
-        shift_labels = labels[..., 1:].contiguous()
-        # Flatten the tokens
-        return self.loss(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
-
-
-@MODELS.register_module
-class GPT(nn.Module):
-    def __init__(self,
-                 vocab_size: int = 50304,
-                 max_position_embeddings: int = 1024,
-                 dim: int = 768,
-                 num_heads: int = 12,
-                 depth: int = 12,
-                 mlp_ratio: float = 4.0,
-                 dropout: float = 0.1,
-                 embedding_dropout: float = 0.1,
-                 attention_dropout: float = 0.1,
-                 layernorm_epsilon: float = 1e-5,
-                 activation: Callable = nn.functional.gelu,
-                 padding_idx: int = None,
-                 dtype: dtype = None,
-                 bias: bool = True,
-                 apply_post_layernorm: bool = False,
-                 fuse_scale_mask_softmax: bool = False,
-                 checkpoint: bool = False) -> None:
-        super().__init__()
-        self.embed = GPTEmbedding(embedding_dim=dim,
-                                  vocab_size=vocab_size,
-                                  max_position_embeddings=max_position_embeddings,
-                                  padding_idx=padding_idx,
-                                  dropout=embedding_dropout,
-                                  dtype=dtype)
-        self.blocks = nn.ModuleList([
-            GPTBlock(
-                dim=dim,
-                num_heads=num_heads,
-                mlp_ratio=mlp_ratio,
-                activation=activation,
-                attention_dropout=attention_dropout,
-                dropout=dropout,
-                layernorm_epsilon=layernorm_epsilon,
-                dtype=dtype,
-                bias=bias,
-                apply_post_layernorm=apply_post_layernorm,
-                fuse_scale_mask_softmax=fuse_scale_mask_softmax,
-                checkpoint=checkpoint,
-            ) for _ in range(depth)
-        ])
-
-        self.norm = col_nn.LayerNorm(normalized_shape=dim, eps=layernorm_epsilon, dtype=dtype)
-
-        self.head = GPTLMHead(dim=dim,
-                              vocab_size=vocab_size,
-                              word_embeeding_weight=self.embed.word_embedding_weight,
-                              dtype=dtype)
-
-    def forward(self, input_ids, attention_mask=None):
-        x, attention_mask = self.embed(input_ids, attention_mask)
-
-        for block in self.blocks:
-            x, attention_mask = block(x, attention_mask)
-
-        x = self.head(self.norm(x))
-
-        return x
-
-
-class PipelineGPT(nn.Module):
-    def __init__(self,
-                 vocab_size: int = 50304,
-                 max_position_embeddings: int = 1024,
-                 dim: int = 768,
-                 num_heads: int = 12,
-                 depth: int = 12,
-                 mlp_ratio: float = 4.0,
-                 dropout: float = 0.1,
-                 embedding_dropout: float = 0.1,
-                 attention_dropout: float = 0.1,
-                 layernorm_epsilon: float = 1e-5,
-                 activation: Callable = nn.functional.gelu,
-                 padding_idx: int = None,
-                 dtype: dtype = None,
-                 bias: bool = True,
-                 apply_post_layernorm: bool = False,
-                 fuse_scale_mask_softmax: bool = False,
-                 checkpoint: bool = False,
-                 first: bool = False,
-                 last: bool = False):
-        super().__init__()
-        self.checkpoint = checkpoint
-        self.first = first
-        self.last = last
-        if first:
-            self.embed = GPTEmbedding(embedding_dim=dim,
-                                      vocab_size=vocab_size,
-                                      max_position_embeddings=max_position_embeddings,
-                                      padding_idx=padding_idx,
-                                      dropout=embedding_dropout,
-                                      dtype=dtype)
-        self.blocks = nn.ModuleList([
-            GPTBlock(
-                dim=dim,
-                num_heads=num_heads,
-                mlp_ratio=mlp_ratio,
-                activation=activation,
-                attention_dropout=attention_dropout,
-                dropout=dropout,
-                layernorm_epsilon=layernorm_epsilon,
-                dtype=dtype,
-                bias=bias,
-                apply_post_layernorm=apply_post_layernorm,
-                fuse_scale_mask_softmax=fuse_scale_mask_softmax,
-                checkpoint=checkpoint,
-            ) for _ in range(depth)
-        ])
-        if self.last:
-            self.norm = col_nn.LayerNorm(normalized_shape=dim, eps=layernorm_epsilon, dtype=dtype)
-            self.head = GPTLMHead(dim=dim, vocab_size=vocab_size, dtype=dtype)
-
-    def forward(self, x=None, input_ids=None, attention_mask=None):
-        if self.first:
-            x, attention_mask = self.embed(input_ids, attention_mask)
-
-        for block in self.blocks:
-            x, attention_mask = block(x, attention_mask)
-
-        if self.last:
-            x = self.head(self.norm(x))
-
-        return x
-
-
-def _create_gpt_model(**model_kwargs):
-    model = GPT(**model_kwargs)
-    return model
-
-
-def _create_gpt_pipeline_model(depth=48, num_chunks=1, layer_partitions=None, **model_kwargs):
-    logger = get_dist_logger()
-    pipeline_size = gpc.get_world_size(ParallelMode.PIPELINE)
-    pipeline_rank = gpc.get_local_rank(ParallelMode.PIPELINE)
-    rank = gpc.get_global_rank()
-    wrapper = PipelineSharedModuleWrapper([0, pipeline_size - 1])
-    parts = partition_uniform(depth, pipeline_size,
-                              num_chunks)[pipeline_rank] if layer_partitions is None else layer_partitions
-    models = []
-    for start, end in parts:
-        model_kwargs['first'] = start == 0
-        model_kwargs['last'] = end == depth
-        model_kwargs['depth'] = end - start
-        chunk = PipelineGPT(**model_kwargs).to(get_current_device())
-        if start == 0:
-            wrapper.register_parameter(chunk.embed.word_embedding_weight)
-        elif end == depth:
-            wrapper.register_parameter(chunk.head.weight)
-        models.append(chunk)
-        logger.info(f'==> Rank {rank} built layer {start}-{end} / total {depth}')
-    if len(models) == 1:
-        model = models[0]
-    else:
-        model = nn.ModuleList(models)
-    return model
-
-
-@MODELS.register_module
-def gpt2_small(**kwargs):
-    model_kwargs = dict(dim=768, depth=12, num_heads=12, **kwargs)
-    return _create_gpt_model(**model_kwargs)
-
-
-@MODELS.register_module
-def gpt2_medium(**kwargs):
-    model_kwargs = dict(dim=1024, depth=24, num_heads=8, **kwargs)
-    return _create_gpt_model(**model_kwargs)
-
-
-@MODELS.register_module
-def gpt2_large(**kwargs):
-    model_kwargs = dict(dim=1536, depth=36, num_heads=12, **kwargs)
-    return _create_gpt_model(**model_kwargs)
-
-
-@MODELS.register_module
-def gpt2_xl(**kwargs):
-    model_kwargs = dict(dim=1600, depth=48, num_heads=16, **kwargs)
-    return _create_gpt_model(**model_kwargs)
-
-
-@MODELS.register_module
-def gpt2_8B(**kwargs):
-    model_kwargs = dict(dim=3072, depth=72, num_heads=24, **kwargs)
-    return _create_gpt_model(**model_kwargs)
-
-
-@MODELS.register_module
-def gpt2_xl_pipeline(**kwargs):
-    model_kwargs = dict(dim=1600, depth=48, num_heads=20, **kwargs)
-    return _create_gpt_pipeline_model(**model_kwargs)
-
-
-@MODELS.register_module
-def gpt2_8B_pipeline(**kwargs):
-    model_kwargs = dict(dim=3072, depth=72, num_heads=24, **kwargs)
-    return _create_gpt_pipeline_model(**model_kwargs)
-
-
-@MODELS.register_module
-def gpt3(**kwargs):
-    model_kwargs = dict(dim=12288, depth=96, num_heads=96, **kwargs)
-    return _create_gpt_model(**model_kwargs)
-
-
-@MODELS.register_module
-def gpt3_pipeline(**kwargs):
-    model_kwargs = dict(dim=12288, depth=96, num_heads=96, **kwargs)
-    return _create_gpt_pipeline_model(**model_kwargs)
diff --git a/model_zoo/helper.py b/model_zoo/helper.py
deleted file mode 100644
index 0f4fac17c742c55b81155109a2bca9a027f9f099..0000000000000000000000000000000000000000
--- a/model_zoo/helper.py
+++ /dev/null
@@ -1,26 +0,0 @@
-import torch
-import torch.nn as nn
-from colossalai.nn.layer import WrappedDropPath as DropPath
-
-
-class TransformerLayer(nn.Module):
-    """Transformer layer builder.
-    """
-    def __init__(self,
-                 att: nn.Module,
-                 ffn: nn.Module,
-                 norm1: nn.Module,
-                 norm2: nn.Module,
-                 droppath=None,
-                 droppath_rate: float = 0):
-        super().__init__()
-        self.att = att
-        self.ffn = ffn
-        self.norm1 = norm1
-        self.norm2 = norm2
-        self.droppath = DropPath(droppath_rate) if droppath is None else droppath
-
-    def forward(self, x):
-        x = x + self.droppath(self.att(self.norm1(x)))
-        x = x + self.droppath(self.ffn(self.norm2(x)))
-        return x
diff --git a/model_zoo/mlp_mixer/__init__.py b/model_zoo/mlp_mixer/__init__.py
deleted file mode 100644
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..0000000000000000000000000000000000000000
diff --git a/model_zoo/mlp_mixer/parallel_1d/.init b/model_zoo/mlp_mixer/parallel_1d/.init
deleted file mode 100644
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..0000000000000000000000000000000000000000
diff --git a/model_zoo/mlp_mixer/parallel_2d/.init b/model_zoo/mlp_mixer/parallel_2d/.init
deleted file mode 100644
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..0000000000000000000000000000000000000000
diff --git a/model_zoo/mlp_mixer/parallel_2p5d/.init b/model_zoo/mlp_mixer/parallel_2p5d/.init
deleted file mode 100644
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..0000000000000000000000000000000000000000
diff --git a/model_zoo/mlp_mixer/parallel_3d/__init__.py b/model_zoo/mlp_mixer/parallel_3d/__init__.py
deleted file mode 100644
index 4beba8761d5eb73652784e5ce12d3f07f02ff990..0000000000000000000000000000000000000000
--- a/model_zoo/mlp_mixer/parallel_3d/__init__.py
+++ /dev/null
@@ -1 +0,0 @@
-from .mlp_mixer import *
diff --git a/model_zoo/mlp_mixer/parallel_3d/mlp_mixer.py b/model_zoo/mlp_mixer/parallel_3d/mlp_mixer.py
deleted file mode 100644
index 3aa2b731785722af406dd2148cb64e1e22e89dde..0000000000000000000000000000000000000000
--- a/model_zoo/mlp_mixer/parallel_3d/mlp_mixer.py
+++ /dev/null
@@ -1,63 +0,0 @@
-# modified from https://github.com/lucidrains/mlp-mixer-pytorch/blob/main/mlp_mixer_pytorch/mlp_mixer_pytorch.py
-from functools import partial
-from colossalai.context import ParallelMode
-from colossalai.registry import MODELS
-from torch import nn
-from colossalai import nn as col_nn
-from colossalai.nn.layer.parallel_3d._utils import get_depth_from_env
-from einops.layers.torch import Rearrange, Reduce
-
-__all__ = [
-    'MLPMixer',
-]
-
-
-class PreNormResidual(nn.Module):
-    def __init__(self, dim, fn, depth_3d):
-        super().__init__()
-        self.fn = fn
-        self.norm = col_nn.LayerNorm3D(
-            dim, depth_3d, ParallelMode.PARALLEL_3D_INPUT, ParallelMode.PARALLEL_3D_WEIGHT)
-
-    def forward(self, x):
-        return self.fn(self.norm(x)) + x
-
-
-def FeedForward(dim, depth_3d, expansion_factor=4, dropout=0., dense=None):
-    if dense is None:
-        dense = partial(col_nn.Linear3D, depth=depth_3d, input_parallel_mode=ParallelMode.PARALLEL_3D_INPUT,
-                        weight_parallel_mode=ParallelMode.PARALLEL_3D_WEIGHT)
-    return nn.Sequential(
-        dense(dim, dim * expansion_factor),
-        nn.GELU(),
-        nn.Dropout(dropout),
-        dense(dim * expansion_factor, dim),
-        nn.Dropout(dropout)
-    )
-
-
-@MODELS.register_module
-def MLPMixer(image_size, channels, patch_size, dim, depth, num_classes, expansion_factor=4, dropout=0.):
-    assert (image_size % patch_size) == 0, 'image must be divisible by patch size'
-    num_patches = (image_size // patch_size) ** 2
-    depth_3d = get_depth_from_env()
-    linear = partial(col_nn.Linear3D, depth=depth_3d, input_parallel_mode=ParallelMode.PARALLEL_3D_INPUT,
-                     weight_parallel_mode=ParallelMode.PARALLEL_3D_WEIGHT)
-    norm_layer = partial(col_nn.LayerNorm3D, depth=depth_3d, input_parallel_mode=ParallelMode.PARALLEL_3D_INPUT,
-                         weight_parallel_mode=ParallelMode.PARALLEL_3D_WEIGHT)
-    chan_first, chan_last = partial(nn.Conv1d, kernel_size=1), linear
-
-    return nn.Sequential(
-        Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)',
-                  p1=patch_size, p2=patch_size),
-        linear((patch_size ** 2) * channels, dim),
-        *[nn.Sequential(
-            PreNormResidual(dim, FeedForward(
-                num_patches, expansion_factor, dropout, chan_first)),
-            PreNormResidual(dim, FeedForward(
-                dim, expansion_factor, dropout, chan_last))
-        ) for _ in range(depth)],
-        norm_layer(dim),
-        Reduce('b n c -> b c', 'mean'),
-        linear(dim, num_classes)
-    )
diff --git a/model_zoo/moe/__init__.py b/model_zoo/moe/__init__.py
deleted file mode 100644
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..0000000000000000000000000000000000000000
diff --git a/model_zoo/moe/models.py b/model_zoo/moe/models.py
deleted file mode 100644
index 2be7b21aec3a7d116e5ba02a87bf70a484c49e27..0000000000000000000000000000000000000000
--- a/model_zoo/moe/models.py
+++ /dev/null
@@ -1,147 +0,0 @@
-import math
-import torch
-import torch.nn as nn
-from colossalai.context import ParallelMode
-from colossalai.nn.layer import VanillaPatchEmbedding, VanillaClassifier, \
-    WrappedDropout as Dropout, WrappedDropPath as DropPath
-from colossalai.nn.layer.moe import Experts, MoeLayer, Top2Router, NormalNoiseGenerator
-from .util import moe_sa_args, moe_mlp_args
-from ..helper import TransformerLayer
-from colossalai.global_variables import moe_env
-from colossalai.utils import get_current_device
-
-
-class VanillaSelfAttention(nn.Module):
-    """Standard ViT self attention.
-    """
-
-    def __init__(self,
-                 d_model: int,
-                 n_heads: int,
-                 d_kv: int,
-                 attention_drop: float = 0,
-                 drop_rate: float = 0,
-                 bias: bool = True,
-                 dropout1=None,
-                 dropout2=None):
-        super().__init__()
-        self.n_heads = n_heads
-        self.d_kv = d_kv
-        self.scale = 1.0 / math.sqrt(self.d_kv)
-
-        self.dense1 = nn.Linear(d_model, 3 * n_heads * d_kv, bias, device=get_current_device())
-        self.softmax = nn.Softmax(dim=-1)
-        self.atten_drop = nn.Dropout(attention_drop) if dropout1 is None else dropout1
-        self.dense2 = nn.Linear(n_heads * d_kv, d_model, device=get_current_device())
-        self.dropout = nn.Dropout(drop_rate) if dropout2 is None else dropout2
-
-    def forward(self, x):
-        qkv = self.dense1(x)
-        new_shape = qkv.shape[:2] + (3, self.n_heads, self.d_kv)
-        qkv = qkv.view(*new_shape)
-        qkv = qkv.permute(2, 0, 3, 1, 4)
-        q, k, v = qkv[:]
-
-        x = torch.matmul(q, k.transpose(-2, -1)) * self.scale
-        x = self.atten_drop(self.softmax(x))
-
-        x = torch.matmul(x, v)
-        x = x.transpose(1, 2)
-        new_shape = x.shape[:2] + (self.n_heads * self.d_kv,)
-        x = x.reshape(*new_shape)
-        x = self.dense2(x)
-        x = self.dropout(x)
-
-        return x
-
-
-class VanillaFFN(nn.Module):
-    """FFN composed with two linear layers, also called MLP.
-    """
-
-    def __init__(self,
-                 d_model: int,
-                 d_ff: int,
-                 activation=None,
-                 drop_rate: float = 0,
-                 bias: bool = True,
-                 dropout1=None,
-                 dropout2=None):
-        super().__init__()
-        dense1 = nn.Linear(d_model, d_ff, bias, device=get_current_device())
-        act = nn.GELU() if activation is None else activation
-        dense2 = nn.Linear(d_ff, d_model, bias, device=get_current_device())
-        drop1 = nn.Dropout(drop_rate) if dropout1 is None else dropout1
-        drop2 = nn.Dropout(drop_rate) if dropout2 is None else dropout2
-
-        self.ffn = nn.Sequential(dense1, act, drop1, dense2, drop2)
-
-    def forward(self, x):
-        return self.ffn(x)
-
-
-class Widenet(nn.Module):
-    def __init__(self,
-                 num_experts: int,
-                 capacity_factor: float,
-                 img_size: int = 224,
-                 patch_size: int = 16,
-                 in_chans: int = 3,
-                 num_classes: int = 1000,
-                 depth: int = 12,
-                 d_model: int = 768,
-                 num_heads: int = 12,
-                 d_kv: int = 64,
-                 d_ff: int = 4096,
-                 attention_drop: float = 0.,
-                 drop_rate: float = 0.1,
-                 drop_path: float = 0.):
-        super().__init__()
-
-        embedding = VanillaPatchEmbedding(
-            img_size=img_size,
-            patch_size=patch_size,
-            in_chans=in_chans,
-            embed_size=d_model)
-        embed_dropout = Dropout(p=drop_rate, mode=ParallelMode.TENSOR)
-
-        shared_sa = VanillaSelfAttention(**moe_sa_args(
-            d_model=d_model, n_heads=num_heads, d_kv=d_kv,
-            attention_drop=attention_drop, drop_rate=drop_rate))
-
-        noisy_func = NormalNoiseGenerator(num_experts)
-        shared_router = Top2Router(capacity_factor, noisy_func=noisy_func)
-        shared_experts = Experts(expert=VanillaFFN,
-                                 num_experts=num_experts,
-                                 **moe_mlp_args(
-                                     d_model=d_model,
-                                     d_ff=d_ff,
-                                     drop_rate=drop_rate
-                                 ))
-
-        # stochastic depth decay rule
-        dpr = [x.item() for x in torch.linspace(0, drop_path, depth)]
-        blocks = [
-            TransformerLayer(
-                att=shared_sa,
-                ffn=MoeLayer(dim_model=d_model, num_experts=num_experts,
-                             router=shared_router, experts=shared_experts),
-                norm1=nn.LayerNorm(d_model, eps=1e-6),
-                norm2=nn.LayerNorm(d_model, eps=1e-6),
-                droppath=DropPath(p=dpr[i], mode=ParallelMode.TENSOR)
-            )
-            for i in range(depth)
-        ]
-        norm = nn.LayerNorm(d_model, eps=1e-6)
-        self.linear = VanillaClassifier(in_features=d_model,
-                                        num_classes=num_classes)
-        nn.init.zeros_(self.linear.weight)
-        nn.init.zeros_(self.linear.bias)
-        self.widenet = nn.Sequential(embedding, embed_dropout, *blocks, norm)
-
-    def forward(self, x):
-        moe_env.reset_loss()
-        x = self.widenet(x)
-        x = torch.mean(x, dim=1)
-        x = self.linear(x)
-        return x
diff --git a/model_zoo/moe/util.py b/model_zoo/moe/util.py
deleted file mode 100644
index 60028656eea5753f00a20826541b9e2c24412be6..0000000000000000000000000000000000000000
--- a/model_zoo/moe/util.py
+++ /dev/null
@@ -1,41 +0,0 @@
-from colossalai.context import ParallelMode
-from colossalai.nn.layer import WrappedDropout as Dropout
-
-
-def moe_sa_args(d_model: int,
-                n_heads: int,
-                d_kv: int,
-                attention_drop: float = 0,
-                drop_rate: float = 0,
-                bias: bool = True):
-    """This is an example for args in moe self attention, since lots of modules should be
-    adapted before putting them in experts.
-    """
-    dropout1 = Dropout(attention_drop, mode=ParallelMode.TENSOR)
-    dropout2 = Dropout(drop_rate, mode=ParallelMode.TENSOR)
-    return dict(
-        d_model=d_model,
-        n_heads=n_heads,
-        d_kv=d_kv,
-        bias=bias,
-        dropout1=dropout1,
-        dropout2=dropout2
-    )
-
-
-def moe_mlp_args(d_model: int,
-                 d_ff: int,
-                 drop_rate: float,
-                 bias: bool = True):
-    """This is an example for args of MLP in Experts, since lots of modules should be adapted
-    before putting them in experts.
-    """
-    dropout1 = Dropout(drop_rate, mode=ParallelMode.TENSOR)
-    dropout2 = Dropout(drop_rate, mode=ParallelMode.TENSOR)
-    return dict(
-        d_model=d_model,
-        d_ff=d_ff,
-        bias=bias,
-        dropout1=dropout1,
-        dropout2=dropout2
-    )
diff --git a/model_zoo/vit/__init__.py b/model_zoo/vit/__init__.py
deleted file mode 100644
index 5e5f1941de61363c1a411d38cd12b654787509a8..0000000000000000000000000000000000000000
--- a/model_zoo/vit/__init__.py
+++ /dev/null
@@ -1 +0,0 @@
-from .vit import *
\ No newline at end of file
diff --git a/model_zoo/vit/__pycache__/__init__.cpython-37.pyc b/model_zoo/vit/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index 9b7d28c11520274675f00ba95bb86f6e1b4571c5..0000000000000000000000000000000000000000
Binary files a/model_zoo/vit/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/model_zoo/vit/__pycache__/vision_transformer_from_config.cpython-37.pyc b/model_zoo/vit/__pycache__/vision_transformer_from_config.cpython-37.pyc
deleted file mode 100644
index 43bed730eb71d76fee0dcc178be93e19d5f2d9c7..0000000000000000000000000000000000000000
Binary files a/model_zoo/vit/__pycache__/vision_transformer_from_config.cpython-37.pyc and /dev/null differ
diff --git a/model_zoo/vit/__pycache__/vit.cpython-37.pyc b/model_zoo/vit/__pycache__/vit.cpython-37.pyc
deleted file mode 100644
index 9dfad24b7287ab769cc4c227452a2b5dd56e55d6..0000000000000000000000000000000000000000
Binary files a/model_zoo/vit/__pycache__/vit.cpython-37.pyc and /dev/null differ
diff --git a/model_zoo/vit/vision_transformer_from_config.py b/model_zoo/vit/vision_transformer_from_config.py
deleted file mode 100644
index af1e320914684462d52ca016ebdf8025f434b473..0000000000000000000000000000000000000000
--- a/model_zoo/vit/vision_transformer_from_config.py
+++ /dev/null
@@ -1,87 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import torch
-
-from colossalai.registry import MODELS
-from colossalai.nn.model.model_from_config import ModelFromConfig
-
-
-@MODELS.register_module
-class VisionTransformerFromConfig(ModelFromConfig):
-    """Vision Transformer from 
-    `"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" <https://arxiv.org/pdf/2010.11929>`_.
-
-    """
-
-    def __init__(self,
-                 embedding_cfg: dict,
-                 norm_cfg: dict,
-                 block_cfg: dict,
-                 head_cfg: dict,
-                 token_fusion_cfg: dict = None,
-                 embed_dim=768,
-                 depth=12,
-                 drop_path_rate=0.,
-                 tensor_splitting_cfg: dict = None):
-        super().__init__()
-        self.embed_dim = embed_dim
-        self.num_tokens = 1
-        self.tensor_splitting_cfg = tensor_splitting_cfg
-        dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)
-               ]  # stochastic depth decay rule
-        if token_fusion_cfg is None:
-            token_fusion_cfg = []
-        else:
-            token_fusion_cfg = [token_fusion_cfg]
-
-        self.layers_cfg = [
-            embedding_cfg,
-
-            # input tensor splitting
-            *self._generate_tensor_splitting_cfg(),
-            *token_fusion_cfg,
-
-            # blocks
-            *self._generate_block_cfg(
-                dpr=dpr, block_cfg=block_cfg, depth=depth),
-
-            # norm
-            norm_cfg,
-
-            # head
-            head_cfg
-        ]
-
-    def _fuse_tokens(self, x):
-        cls_token = self.cls_token.expand(x.shape[0], -1, -1)
-        x = torch.cat((cls_token, x), dim=1)
-        return x
-
-    def _generate_block_cfg(self, dpr, depth, block_cfg):
-        blocks_cfg = []
-
-        for i in range(depth):
-            _cfg = block_cfg.copy()
-            _cfg['droppath_cfg']['drop_path'] = dpr[i]
-            blocks_cfg.append(_cfg)
-
-        return blocks_cfg
-
-    def _generate_tensor_splitting_cfg(self):
-        if self.tensor_splitting_cfg:
-            return [self.tensor_splitting_cfg]
-        else:
-            return []
-
-    def forward(self, x):  # [512, 3, 32, 32]
-        for layer in self.layers:
-            if isinstance(x, tuple):
-                x = layer(*x)
-            else:
-                x = layer(x)
-        return x  # [256, 5]
-
-    def init_weights(self):
-        # TODO: add init weights
-        pass
diff --git a/model_zoo/vit/vit.py b/model_zoo/vit/vit.py
deleted file mode 100644
index 9bdcbfd388b36d7ba35524a90dd340db35544967..0000000000000000000000000000000000000000
--- a/model_zoo/vit/vit.py
+++ /dev/null
@@ -1,415 +0,0 @@
-import math
-from typing import Callable
-
-import torch
-from colossalai import nn as col_nn
-from colossalai.nn.layer.utils import CheckpointModule
-from colossalai.registry import LAYERS, MODELS
-from torch import dtype, nn
-
-__all__ = [
-    'VisionTransformer',
-    'vit_lite_depth7_patch4_32',
-    'vit_tiny_patch4_32',
-    'vit_tiny_patch16_224',
-    'vit_tiny_patch16_384',
-    'vit_small_patch16_224',
-    'vit_small_patch16_384',
-    'vit_small_patch32_224',
-    'vit_small_patch32_384',
-    'vit_base_patch16_224',
-    'vit_base_patch16_384',
-    'vit_base_patch32_224',
-    'vit_base_patch32_384',
-    'vit_large_patch16_224',
-    'vit_large_patch16_384',
-    'vit_large_patch32_224',
-    'vit_large_patch32_384',
-]
-
-_init_rules = dict(
-    torch=dict(
-        embed=dict(
-            weight_initializer=col_nn.init.kaiming_uniform_(a=math.sqrt(5)),
-            bias_initializer=col_nn.init.xavier_uniform_(a=1, scale=1),
-            position_embed_initializer=col_nn.init.zeros_(),
-        ),
-        transformer=dict(
-            weight_initializer=col_nn.init.kaiming_uniform_(a=math.sqrt(5)),
-            bias_initializer=col_nn.init.xavier_uniform_(a=1, scale=1),
-        ),
-        head=dict(
-            weight_initializer=col_nn.init.kaiming_uniform_(a=math.sqrt(5)),
-            bias_initializer=col_nn.init.xavier_uniform_(a=1, scale=1),
-        ),
-    ),
-    jax=dict(
-        embed=dict(
-            weight_initializer=col_nn.init.lecun_normal_(),
-            bias_initializer=col_nn.init.zeros_(),
-            position_embed_initializer=col_nn.init.trunc_normal_(std=.02),
-        ),
-        transformer=dict(
-            weight_initializer=col_nn.init.xavier_uniform_(),
-            bias_initializer=col_nn.init.normal_(std=1e-6),
-        ),
-        head=dict(
-            weight_initializer=col_nn.init.zeros_(),
-            bias_initializer=col_nn.init.zeros_(),
-        ),
-    ),
-)
-
-
-@LAYERS.register_module
-class ViTEmbedding(nn.Module):
-    def __init__(self,
-                 img_size: int,
-                 patch_size: int,
-                 in_chans: int,
-                 embedding_dim: int,
-                 dropout: float,
-                 dtype: dtype = None,
-                 flatten: bool = True,
-                 init_method: str = 'torch'):
-        super().__init__()
-        self.patch_embed = col_nn.PatchEmbedding(img_size,
-                                                 patch_size,
-                                                 in_chans,
-                                                 embedding_dim,
-                                                 dtype=dtype,
-                                                 flatten=flatten,
-                                                 **_init_rules[init_method]['embed'])
-        self.dropout = col_nn.Dropout(dropout)
-
-    def forward(self, x):
-        x = self.patch_embed(x)
-        x = self.dropout(x)
-        return x
-
-
-@LAYERS.register_module
-class ViTSelfAttention(nn.Module):
-    def __init__(self,
-                 dim: int,
-                 num_heads: int,
-                 attention_dropout: float,
-                 dropout: float,
-                 bias: bool = True,
-                 dtype: dtype = None,
-                 init_method: str = 'torch'):
-        super().__init__()
-        self.attention_head_size = dim // num_heads
-        self.query_key_value = col_nn.Linear(dim,
-                                             3 * dim,
-                                             dtype=dtype,
-                                             bias=bias,
-                                             **_init_rules[init_method]['transformer'])
-        self.attention_dropout = col_nn.Dropout(attention_dropout)
-        self.dense = col_nn.Linear(dim, dim, dtype=dtype, bias=True, **_init_rules[init_method]['transformer'])
-        self.dropout = col_nn.Dropout(dropout)
-        self.softmax = nn.Softmax(dim=-1)
-
-    def forward(self, x):
-        qkv = self.query_key_value(x)
-        all_head_size = qkv.shape[-1] // 3
-        num_attention_heads = all_head_size // self.attention_head_size
-        new_qkv_shape = qkv.shape[:-1] + \
-            (num_attention_heads, 3 * self.attention_head_size)
-        qkv = qkv.view(new_qkv_shape)
-        qkv = qkv.permute((0, 2, 1, 3))
-        q, k, v = torch.chunk(qkv, 3, dim=-1)
-
-        x = torch.matmul(q, k.transpose(-1, -2))
-        x = x / math.sqrt(self.attention_head_size)
-        x = self.softmax(x)
-        x = self.attention_dropout(x)
-
-        x = torch.matmul(x, v)
-        x = x.transpose(1, 2)
-        new_context_layer_shape = x.size()[:-2] + (all_head_size, )
-        x = x.reshape(new_context_layer_shape)
-
-        x = self.dense(x)
-        x = self.dropout(x)
-
-        return x
-
-
-@LAYERS.register_module
-class ViTMLP(nn.Module):
-    def __init__(self,
-                 dim: int,
-                 mlp_ratio: int,
-                 activation: Callable,
-                 dropout: float,
-                 dtype: dtype = None,
-                 bias: bool = True,
-                 init_method: str = 'torch'):
-        super().__init__()
-        self.dense_1 = col_nn.Linear(dim,
-                                     mlp_ratio * dim,
-                                     dtype=dtype,
-                                     bias=bias,
-                                     **_init_rules[init_method]['transformer'])
-        self.activation = activation
-        self.dropout_1 = col_nn.Dropout(dropout)
-        self.dense_2 = col_nn.Linear(mlp_ratio * dim,
-                                     dim,
-                                     dtype=dtype,
-                                     bias=bias,
-                                     **_init_rules[init_method]['transformer'])
-        self.dropout_2 = col_nn.Dropout(dropout)
-
-    def forward(self, x):
-        x = self.dense_1(x)
-        x = self.activation(x)
-        x = self.dropout_1(x)
-        x = self.dense_2(x)
-        x = self.dropout_2(x)
-        return x
-
-
-@LAYERS.register_module
-class ViTHead(nn.Module):
-    def __init__(self,
-                 dim: int,
-                 num_classes: int,
-                 representation_size: int = None,
-                 dtype: dtype = None,
-                 bias: bool = True,
-                 init_method: str = 'torch'):
-        super().__init__()
-        if representation_size:
-            self.representation = col_nn.Linear(dim,
-                                                representation_size,
-                                                bias=bias,
-                                                dtype=dtype,
-                                                **_init_rules[init_method]['head'])
-        else:
-            self.representation = None
-            representation_size = dim
-
-        self.dense = col_nn.Classifier(representation_size,
-                                       num_classes,
-                                       dtype=dtype,
-                                       bias=bias,
-                                       **_init_rules[init_method]['head'])
-
-    def forward(self, x):
-        x = x[:, 0]
-        if self.representation is not None:
-            x = self.representation(x)
-        x = self.dense(x)
-        return x
-
-
-@LAYERS.register_module
-class ViTBlock(CheckpointModule):
-    def __init__(self,
-                 dim: int,
-                 num_heads: int,
-                 mlp_ratio: int,
-                 activation: Callable,
-                 attention_dropout: float = 0.,
-                 dropout: float = 0.,
-                 drop_path: float = 0.,
-                 layernorm_epsilon: float = 1e-6,
-                 dtype: dtype = None,
-                 bias: bool = True,
-                 checkpoint: bool = False,
-                 init_method: str = 'torch'):
-        super().__init__(checkpoint)
-        self.norm1 = col_nn.LayerNorm(normalized_shape=dim, eps=layernorm_epsilon, dtype=dtype)
-        self.attn = ViTSelfAttention(dim=dim,
-                                     num_heads=num_heads,
-                                     attention_dropout=attention_dropout,
-                                     dropout=dropout,
-                                     bias=bias,
-                                     dtype=dtype,
-                                     init_method=init_method)
-        self.drop_path = col_nn.DropPath(drop_path) if drop_path > 0. else nn.Identity()
-        self.norm2 = col_nn.LayerNorm(normalized_shape=dim, eps=layernorm_epsilon, dtype=dtype)
-        self.mlp = ViTMLP(dim=dim,
-                          mlp_ratio=mlp_ratio,
-                          activation=activation,
-                          dropout=dropout,
-                          dtype=dtype,
-                          bias=bias,
-                          init_method=init_method)
-
-    def _forward(self, x):
-        x = x + self.drop_path(self.attn(self.norm1(x)))
-        x = x + self.drop_path(self.mlp(self.norm2(x)))
-        return x
-
-
-@MODELS.register_module
-class VisionTransformer(nn.Module):
-    def __init__(self,
-                 img_size: int = 224,
-                 patch_size: int = 16,
-                 in_chans: int = 3,
-                 num_classes: int = 1000,
-                 depth: int = 12,
-                 num_heads: int = 12,
-                 dim: int = 768,
-                 mlp_ratio: int = 4,
-                 attention_dropout: float = 0.,
-                 dropout: float = 0.1,
-                 drop_path: float = 0.,
-                 layernorm_epsilon: float = 1e-6,
-                 activation: Callable = nn.functional.gelu,
-                 representation_size: int = None,
-                 dtype: dtype = None,
-                 bias: bool = True,
-                 checkpoint: bool = False,
-                 init_method: str = 'torch'):
-        super().__init__()
-
-        embed = ViTEmbedding(img_size=img_size,
-                             patch_size=patch_size,
-                             in_chans=in_chans,
-                             embedding_dim=dim,
-                             dropout=dropout,
-                             dtype=dtype,
-                             init_method=init_method)
-
-        # stochastic depth decay rule
-        dpr = [x.item() for x in torch.linspace(0, drop_path, depth)]
-        blocks = [
-            ViTBlock(
-                dim=dim,
-                num_heads=num_heads,
-                mlp_ratio=mlp_ratio,
-                attention_dropout=attention_dropout,
-                dropout=dropout,
-                drop_path=dpr[i],
-                activation=activation,
-                dtype=dtype,
-                bias=bias,
-                checkpoint=checkpoint,
-                init_method=init_method,
-            ) for i in range(depth)
-        ]
-
-        norm = col_nn.LayerNorm(normalized_shape=dim, eps=layernorm_epsilon, dtype=dtype)
-
-        head = ViTHead(dim=dim,
-                       num_classes=num_classes,
-                       representation_size=representation_size,
-                       dtype=dtype,
-                       bias=bias,
-                       init_method=init_method)
-
-        self.layers = nn.Sequential(
-            embed,
-            *blocks,
-            norm,
-            head,
-        )
-
-    def forward(self, x):
-        x = self.layers(x)
-        return x
-
-
-def _create_vit_model(**model_kwargs):
-    model = VisionTransformer(**model_kwargs)
-    return model
-
-
-@MODELS.register_module
-def vit_lite_depth7_patch4_32(**kwargs):
-    model_kwargs = dict(img_size=32, patch_size=4, dim=256, depth=7, num_heads=4, mlp_ratio=2, num_classes=10, **kwargs)
-    return _create_vit_model(**model_kwargs)
-
-
-@MODELS.register_module
-def vit_tiny_patch4_32(**kwargs):
-    model_kwargs = dict(img_size=32, patch_size=4, dim=512, depth=6, num_heads=8, mlp_ratio=1, num_classes=10, **kwargs)
-    return _create_vit_model(**model_kwargs)
-
-
-@MODELS.register_module
-def vit_tiny_patch16_224(**kwargs):
-    model_kwargs = dict(img_size=224, patch_size=16, dim=192, depth=12, num_heads=3, mlp_ratio=4, **kwargs)
-    return _create_vit_model(**model_kwargs)
-
-
-@MODELS.register_module
-def vit_tiny_patch16_384(**kwargs):
-    model_kwargs = dict(img_size=384, patch_size=16, dim=192, depth=12, num_heads=3, mlp_ratio=4, **kwargs)
-    return _create_vit_model(**model_kwargs)
-
-
-@MODELS.register_module
-def vit_small_patch16_224(**kwargs):
-    model_kwargs = dict(img_size=224, patch_size=16, dim=384, depth=12, num_heads=6, mlp_ratio=4, **kwargs)
-    return _create_vit_model(**model_kwargs)
-
-
-@MODELS.register_module
-def vit_small_patch16_384(**kwargs):
-    model_kwargs = dict(img_size=384, patch_size=16, dim=384, depth=12, num_heads=6, mlp_ratio=4, **kwargs)
-    return _create_vit_model(**model_kwargs)
-
-
-@MODELS.register_module
-def vit_small_patch32_224(**kwargs):
-    model_kwargs = dict(img_size=224, patch_size=32, dim=384, depth=12, num_heads=6, mlp_ratio=4, **kwargs)
-    return _create_vit_model(**model_kwargs)
-
-
-@MODELS.register_module
-def vit_small_patch32_384(**kwargs):
-    model_kwargs = dict(img_size=384, patch_size=32, dim=384, depth=12, num_heads=6, mlp_ratio=4, **kwargs)
-    return _create_vit_model(**model_kwargs)
-
-
-@MODELS.register_module
-def vit_base_patch16_224(**kwargs):
-    model_kwargs = dict(img_size=224, patch_size=16, dim=768, depth=12, num_heads=12, mlp_ratio=4, **kwargs)
-    return _create_vit_model(**model_kwargs)
-
-
-@MODELS.register_module
-def vit_base_patch16_384(**kwargs):
-    model_kwargs = dict(img_size=384, patch_size=16, dim=768, depth=12, num_heads=12, mlp_ratio=4, **kwargs)
-    return _create_vit_model(**model_kwargs)
-
-
-@MODELS.register_module
-def vit_base_patch32_224(**kwargs):
-    model_kwargs = dict(img_size=224, patch_size=32, dim=768, depth=12, num_heads=12, mlp_ratio=4, **kwargs)
-    return _create_vit_model(**model_kwargs)
-
-
-@MODELS.register_module
-def vit_base_patch32_384(**kwargs):
-    model_kwargs = dict(img_size=384, patch_size=32, dim=768, depth=12, num_heads=12, mlp_ratio=4, **kwargs)
-    return _create_vit_model(**model_kwargs)
-
-
-@MODELS.register_module
-def vit_large_patch16_224(**kwargs):
-    model_kwargs = dict(img_size=224, patch_size=16, dim=1024, depth=24, num_heads=16, mlp_ratio=4, **kwargs)
-    return _create_vit_model(**model_kwargs)
-
-
-@MODELS.register_module
-def vit_large_patch16_384(**kwargs):
-    model_kwargs = dict(img_size=384, patch_size=16, dim=1024, depth=24, num_heads=16, mlp_ratio=4, **kwargs)
-    return _create_vit_model(**model_kwargs)
-
-
-@MODELS.register_module
-def vit_large_patch32_224(**kwargs):
-    model_kwargs = dict(img_size=224, patch_size=32, dim=1024, depth=24, num_heads=16, mlp_ratio=4, **kwargs)
-    return _create_vit_model(**model_kwargs)
-
-
-@MODELS.register_module
-def vit_large_patch32_384(**kwargs):
-    model_kwargs = dict(img_size=384, patch_size=32, dim=1024, depth=24, num_heads=16, mlp_ratio=4, **kwargs)
-    return _create_vit_model(**model_kwargs)
diff --git a/pytest.ini b/pytest.ini
deleted file mode 100644
index ac31ace4bfae025025b1098719aba873db615d1c..0000000000000000000000000000000000000000
--- a/pytest.ini
+++ /dev/null
@@ -1,6 +0,0 @@
-[pytest]
-markers =
-    cpu: tests which can run on CPU
-    gpu: tests which requires a single GPU
-    dist: tests which are run in a multi-GPU or multi-machine environment
-    experiment: tests for experimental features
\ No newline at end of file
diff --git a/requirements/requirements-test.txt b/requirements/requirements-test.txt
deleted file mode 100644
index 69b82ff84ee524ef7ae6d85ad73aff01c09348d4..0000000000000000000000000000000000000000
--- a/requirements/requirements-test.txt
+++ /dev/null
@@ -1,3 +0,0 @@
-pytest
-rpyc
-matplotlib
\ No newline at end of file
diff --git a/requirements/requirements-zero.txt b/requirements/requirements-zero.txt
deleted file mode 100644
index 816211e728eeb5fb84edd358cb20eac9c9a69c2c..0000000000000000000000000000000000000000
--- a/requirements/requirements-zero.txt
+++ /dev/null
@@ -1 +0,0 @@
-deepspeed
\ No newline at end of file
diff --git a/requirements/requirements.txt b/requirements/requirements.txt
deleted file mode 100644
index 469b4d32324bc99b5dd3d1816b1d843123658ef7..0000000000000000000000000000000000000000
--- a/requirements/requirements.txt
+++ /dev/null
@@ -1,8 +0,0 @@
-torch>=1.8
-torchvision>=0.9
-numpy
-tqdm
-psutil
-tensorboard
-packaging
-pre-commit
diff --git a/scripts/slurm_dist_train.sh b/scripts/slurm_dist_train.sh
deleted file mode 100644
index 1d3d505f368f3cbb8257284c3628da98761e693f..0000000000000000000000000000000000000000
--- a/scripts/slurm_dist_train.sh
+++ /dev/null
@@ -1,11 +0,0 @@
-#!/usr/bin/env sh
-
-
-main_file=$1
-config_file=$2
-
-python $main_file --local_rank $SLURM_PROCID --world_size $SLURM_NPROCS --host $HOST --port 29500 --config $config_file
-
-# how to run this script
-# exmaple:
-# HOST=IP_ADDR srun ./scripts/slurm_dist_train.sh ./examples/train_vit_2d.py ./configs/vit/vit_2d.py
\ No newline at end of file
diff --git a/setup.py b/setup.py
deleted file mode 100644
index 6d4dd54d4de16c8653adf5aed2f0173d8dc41020..0000000000000000000000000000000000000000
--- a/setup.py
+++ /dev/null
@@ -1,301 +0,0 @@
-import os
-import subprocess
-import sys
-
-from setuptools import find_packages, setup
-
-# HC
-import torch
-
-# ninja build does not work unless include_dirs are abs path
-this_dir = os.path.dirname(os.path.abspath(__file__))
-build_cuda_ext = True
-# HC
-build_hip_ext = False
-ext_modules = []
-
-if '--no_cuda_ext' in sys.argv:
-    sys.argv.remove('--no_cuda_ext')
-    build_cuda_ext = False
-
-# HC CUDA_HOME is ROCM_HOME in hip branch
-if torch.__version__ >= '1.5':
-    from torch.utils.cpp_extension import ROCM_HOME
-    if ((torch.version.hip is not None) and (ROCM_HOME is not None)):
-        build_hip_ext = True
-        build_cuda_ext = False
-        CUDA_HOME = ROCM_HOME
-
-if '--no_hip_ext' in sys.argv:
-    sys.argv.remove('--no_hip_ext')
-    build_hip_ext = False
-
-# HC
-def get_cuda_bare_metal_version(cuda_dir):
-    if build_cuda_ext == True:
-        raw_output = subprocess.check_output([cuda_dir + "/bin/nvcc", "-V"], universal_newlines=True)
-        output = raw_output.split()
-        release_idx = output.index("release") + 1
-    else:   
-        raw_output = subprocess.check_output([cuda_dir + "/bin/hipcc", "--version"], universal_newlines=True)
-        output = raw_output.split()
-        release_idx = output.index("version:") + 1
-    release = output[release_idx].split(".")
-    bare_metal_major = release[0]
-    bare_metal_minor = release[1][0]
-
-    return raw_output, bare_metal_major, bare_metal_minor
-
-# HC
-def check_cuda_torch_binary_vs_bare_metal(cuda_dir):
-    raw_output, bare_metal_major, bare_metal_minor = get_cuda_bare_metal_version(cuda_dir)
-    if build_cuda_ext == True:
-        torch_binary_major = torch.version.cuda.split(".")[0]
-        torch_binary_minor = torch.version.cuda.split(".")[1]
-    else:
-        torch_binary_major = torch.version.hip.split(".")[0]
-        torch_binary_minor = torch.version.hip.split(".")[1]
-
-    print("\nCompiling cuda extensions with")
-    print(raw_output + "from " + cuda_dir + "/bin\n")
-
-    if bare_metal_major != torch_binary_major:
-        print(
-            f'The detected CUDA version ({raw_output}) mismatches the version that was used to compile PyTorch ({torch.version.cuda}). CUDA extension will not be installed.')
-        return False
-
-    if bare_metal_minor != torch_binary_minor:
-        print("\nWarning: Cuda extensions are being compiled with a version of Cuda that does "
-              + "not match the version used to compile Pytorch binaries.  "
-              + "Pytorch binaries were compiled with Cuda {}.\n".format(torch.version.cuda)
-              + "In some cases, a minor-version mismatch will not cause later errors:  "
-              + "https://github.com/NVIDIA/apex/pull/323#discussion_r287021798. ")
-    return True
-
-check_cuda_torch_binary_vs_bare_metal(CUDA_HOME)
-#print("+++++++++++++++++++++++++", bare_metal_major, bare_metal_minor)
-#exit()
-
-def check_cuda_availability(cuda_dir):
-    if not torch.cuda.is_available():
-        # https://github.com/NVIDIA/apex/issues/486
-        # Extension builds after https://github.com/pytorch/pytorch/pull/23408 attempt to query torch.cuda.get_device_capability(),
-        # which will fail if you are compiling in an environment without visible GPUs (e.g. during an nvidia-docker build command).
-        print('\nWarning: Torch did not find available GPUs on this system.\n',
-              'If your intention is to cross-compile, this is not an error.\n'
-              'By default, Colossal-AI will cross-compile for Pascal (compute capabilities 6.0, 6.1, 6.2),\n'
-              'Volta (compute capability 7.0), Turing (compute capability 7.5),\n'
-              'and, if the CUDA version is >= 11.0, Ampere (compute capability 8.0).\n'
-              'If you wish to cross-compile for a single specific architecture,\n'
-              'export TORCH_CUDA_ARCH_LIST="compute capability" before running setup.py.\n')
-        if os.environ.get("TORCH_CUDA_ARCH_LIST", None) is None:
-            _, bare_metal_major, _ = get_cuda_bare_metal_version(cuda_dir)
-            if int(bare_metal_major) == 11:
-                os.environ["TORCH_CUDA_ARCH_LIST"] = "6.0;6.1;6.2;7.0;7.5;8.0"
-            else:
-                os.environ["TORCH_CUDA_ARCH_LIST"] = "6.0;6.1;6.2;7.0;7.5"
-        return False
-
-    if cuda_dir is None:
-        print(
-            "nvcc was not found. CUDA extension will not be installed. If you're installing within a container from https://hub.docker.com/r/pytorch/pytorch, only images whose names contain 'devel' will provide nvcc.")
-        return False
-    return True
-
-
-def append_nvcc_threads(nvcc_extra_args):
-    _, bare_metal_major, bare_metal_minor = get_cuda_bare_metal_version(CUDA_HOME)
-    if int(bare_metal_major) >= 11 and int(bare_metal_minor) >= 2:
-        return nvcc_extra_args + ["--threads", "4"]
-    return nvcc_extra_args
-
-
-def fetch_requirements(path):
-    with open(path, 'r') as fd:
-        return [r.strip() for r in fd.readlines()]
-
-# HC
-if build_cuda_ext or build_hip_ext:
-    try:
-        import torch
-        from torch.utils.cpp_extension import (CUDA_HOME, BuildExtension,
-                                               CUDAExtension)
-        print("\n\ntorch.__version__  = {}\n\n".format(torch.__version__))
-        TORCH_MAJOR = int(torch.__version__.split('.')[0])
-        TORCH_MINOR = int(torch.__version__.split('.')[1])
-
-        if TORCH_MAJOR < 1 or (TORCH_MAJOR == 1 and TORCH_MINOR < 8):
-            raise RuntimeError("Colossal-AI requires Pytorch 1.8 or newer.\n"
-                               + "The latest stable release can be obtained from https://pytorch.org/")
-    except ImportError:
-        print('torch is not found. CUDA extension will not be installed')
-        build_cuda_ext = False
-        build_hip_ext = False
-
-if build_cuda_ext or build_hip_ext:
-    build_cuda_ext = check_cuda_availability(CUDA_HOME) and check_cuda_torch_binary_vs_bare_metal(CUDA_HOME)
-
-# HC
-
-
-if build_hip_ext:
-    # Set up macros for forward/backward compatibility hack around
-    # https://github.com/pytorch/pytorch/commit/4404762d7dd955383acee92e6f06b48144a0742e
-    # and
-    # https://github.com/NVIDIA/apex/issues/456
-    # https://github.com/pytorch/pytorch/commit/eb7b39e02f7d75c26d8a795ea8c7fd911334da7e#diff-4632522f237f1e4e728cb824300403ac
-    version_dependent_macros = ['-DVERSION_GE_1_1', '-DVERSION_GE_1_3', '-DVERSION_GE_1_5']
-    if build_hip_ext:
-        hip_macros = ['-DCOLOSSAL_HIP']
-
-    def cuda_ext_helper(name, sources, extra_cuda_flags):
-        return CUDAExtension(name=name,
-                             sources=[os.path.join('colossalai/kernel/hip_native/csrc', path) for path in sources],
-                             include_dirs=[os.path.join(
-                                 this_dir, 'colossalai/kernel/hip_native/csrc/kernels/include')] + [os.path.join(this_dir, 'colossalai/kernel/hip_native/csrc')] + ['/opt/dtk-21.04/hiprand/include'] +  ['/opt/dtk-21.04/rocrand/include'],
-                             extra_compile_args={'cxx': ['-O3'] + version_dependent_macros + hip_macros,
-                                                 'nvcc': ['-O3'] + version_dependent_macros + hip_macros + extra_cuda_flags})
-
-    from torch.utils.hipify import hipify_python
-    hipify_python.hipify(
-           project_directory=this_dir,
-           output_directory=this_dir,
-           includes="colossalai/kernel/cuda_native/*",
-           show_detailed=True,
-           is_pytorch_extension=True,
-        )
-
-    ext_modules.append(cuda_ext_helper('colossal_C',
-                                       ['colossal_C_frontend.cpp',
-                                        'multi_tensor_sgd_kernel.hip',
-                                        'multi_tensor_scale_kernel.hip',
-                                        'multi_tensor_adam.hip',
-                                        'multi_tensor_l2norm_kernel.hip',
-                                        'multi_tensor_lamb.hip'],
-                                       ['-lineinfo']))
-
-    cc_flag = []
-    extra_cuda_flags = ['-U__HIP_NO_HALF_OPERATORS__',
-                        '-U__HIP_NO_HALF_CONVERSIONS__']
-
-    ext_modules.append(cuda_ext_helper('colossal_scaled_upper_triang_masked_softmax',
-                                       ['scaled_upper_triang_masked_softmax.cpp',
-                                        'scaled_upper_triang_masked_softmax_hip.hip'],
-                                       extra_cuda_flags + cc_flag))
-
-    ext_modules.append(cuda_ext_helper('colossal_scaled_masked_softmax',
-                                       ['scaled_masked_softmax.cpp', 'scaled_masked_softmax_hip.hip'],
-                                       extra_cuda_flags + cc_flag))
-
-    extra_cuda_flags = []
-
-    ext_modules.append(cuda_ext_helper('colossal_layer_norm_cuda',
-                                       ['layer_norm_hip.cpp', 'layer_norm_hip_kernel.hip'],
-                                       extra_cuda_flags + cc_flag))
-
-    extra_cuda_flags = ['-std=c++14',
-                        '-U__HIP_NO_HALF_OPERATORS__',
-                        '-U__HIP_NO_HALF_CONVERSIONS__',
-                        '-U__HIP_NO_HALF2_OPERATORS__',
-                        '-DTHRUST_IGNORE_CUB_VERSION_CHECK']
-
-    ext_modules.append(cuda_ext_helper('colossal_multihead_attention',
-                                       ['multihead_attention_1d.cpp',
-                                        'kernels/cublas_wrappers.hip',
-                                        'kernels/transform_kernels.hip',
-                                        'kernels/dropout_kernels.hip',
-                                        'kernels/normalize_kernels.hip',
-                                        'kernels/softmax_kernels.hip',
-                                        'kernels/general_kernels.hip',
-                                        'kernels/hip_util.hip'],
-                                       extra_cuda_flags + cc_flag))
-
-if build_cuda_ext:
-    # Set up macros for forward/backward compatibility hack around
-    # https://github.com/pytorch/pytorch/commit/4404762d7dd955383acee92e6f06b48144a0742e
-    # and
-    # https://github.com/NVIDIA/apex/issues/456
-    # https://github.com/pytorch/pytorch/commit/eb7b39e02f7d75c26d8a795ea8c7fd911334da7e#diff-4632522f237f1e4e728cb824300403ac
-    version_dependent_macros = ['-DVERSION_GE_1_1', '-DVERSION_GE_1_3', '-DVERSION_GE_1_5']
-
-    def cuda_ext_helper(name, sources, extra_cuda_flags):
-        return CUDAExtension(name=name,
-                             sources=[os.path.join('colossalai/kernel/cuda_native/csrc', path) for path in sources],
-                             include_dirs=[os.path.join(
-                                 this_dir, 'colossalai/kernel/cuda_native/csrc/kernels/include')],
-                             extra_compile_args={'cxx': ['-O3'] + version_dependent_macros,
-                                                 'nvcc': append_nvcc_threads(['-O3',
-                                                                              '--use_fast_math'] + version_dependent_macros + extra_cuda_flags)})
-
-    ext_modules.append(cuda_ext_helper('colossal_C',
-                                       ['colossal_C_frontend.cpp',
-                                        'multi_tensor_sgd_kernel.cu',
-                                        'multi_tensor_scale_kernel.cu',
-                                        'multi_tensor_adam.cu',
-                                        'multi_tensor_l2norm_kernel.cu',
-                                        'multi_tensor_lamb.cu'],
-                                       ['-lineinfo']))
-
-    cc_flag = ['-gencode', 'arch=compute_70,code=sm_70']
-    _, bare_metal_major, _ = get_cuda_bare_metal_version(CUDA_HOME)
-    if int(bare_metal_major) >= 11:
-        cc_flag.append('-gencode')
-        cc_flag.append('arch=compute_80,code=sm_80')
-
-    extra_cuda_flags = ['-U__CUDA_NO_HALF_OPERATORS__',
-                        '-U__CUDA_NO_HALF_CONVERSIONS__',
-                        '--expt-relaxed-constexpr',
-                        '--expt-extended-lambda']
-
-    ext_modules.append(cuda_ext_helper('colossal_scaled_upper_triang_masked_softmax',
-                                       ['scaled_upper_triang_masked_softmax.cpp',
-                                        'scaled_upper_triang_masked_softmax_cuda.cu'],
-                                       extra_cuda_flags + cc_flag))
-
-    ext_modules.append(cuda_ext_helper('colossal_scaled_masked_softmax',
-                                       ['scaled_masked_softmax.cpp', 'scaled_masked_softmax_cuda.cu'],
-                                       extra_cuda_flags + cc_flag))
-
-    extra_cuda_flags = ['-maxrregcount=50']
-
-    ext_modules.append(cuda_ext_helper('colossal_layer_norm_cuda',
-                                       ['layer_norm_cuda.cpp', 'layer_norm_cuda_kernel.cu'],
-                                       extra_cuda_flags + cc_flag))
-
-    extra_cuda_flags = ['-std=c++14',
-                        '-U__CUDA_NO_HALF_OPERATORS__',
-                        '-U__CUDA_NO_HALF_CONVERSIONS__',
-                        '-U__CUDA_NO_HALF2_OPERATORS__',
-                        '-DTHRUST_IGNORE_CUB_VERSION_CHECK']
-
-    ext_modules.append(cuda_ext_helper('colossal_multihead_attention',
-                                       ['multihead_attention_1d.cpp',
-                                        'kernels/cublas_wrappers.cu',
-                                        'kernels/transform_kernels.cu',
-                                        'kernels/dropout_kernels.cu',
-                                        'kernels/normalize_kernels.cu',
-                                        'kernels/softmax_kernels.cu',
-                                        'kernels/general_kernels.cu',
-                                        'kernels/cuda_util.cu'],
-                                       extra_cuda_flags + cc_flag))
-
-setup(
-    name='colossalai',
-    version='0.0.2',
-    packages=find_packages(exclude=('benchmark',
-                                    'docker',
-                                    'tests',
-                                    'docs',
-                                    'examples',
-                                    'tests',
-                                    'scripts',
-                                    'requirements',
-                                    '*.egg-info',)),
-    description='An integrated large-scale model training system with efficient parallelization techniques',
-    ext_modules=ext_modules,
-    cmdclass={'build_ext': BuildExtension} if ext_modules else {},
-    install_requires=fetch_requirements('requirements/requirements.txt'),
-    extras_require={
-        'zero': fetch_requirements('requirements/requirements-zero.txt'),
-    }
-)
diff --git a/test_gpu.sh b/test_gpu.sh
deleted file mode 100755
index b2268dd1b7771b826ed366278efd24458954e75d..0000000000000000000000000000000000000000
--- a/test_gpu.sh
+++ /dev/null
@@ -1,6 +0,0 @@
-source /opt/dtk-22.04.2/env.sh
-export LD_LIBRARY_PATH=/usr/local/lib/python3.7/site-packages/torch/lib/:$LD_LIBRARY_PATH
-
-export HIP_VISIBLE_DEVICES=0,1,2,3 
-#DATA=./cifar_dataset pytest  tests
-DATA=./cifar_dataset pytest -v tests/test_zero_tensor_parallel/
diff --git a/tests/test_comm/__pycache__/test_comm.cpython-36-pytest-7.0.1.pyc b/tests/test_comm/__pycache__/test_comm.cpython-36-pytest-7.0.1.pyc
deleted file mode 100644
index 46dbd394965ade2d8ed07277dfca9289a93e483a..0000000000000000000000000000000000000000
Binary files a/tests/test_comm/__pycache__/test_comm.cpython-36-pytest-7.0.1.pyc and /dev/null differ
diff --git a/tests/test_comm/__pycache__/test_comm.cpython-36.pyc b/tests/test_comm/__pycache__/test_comm.cpython-36.pyc
deleted file mode 100644
index 3e3f598b1cd2c0041f1b37471cd0f4989c028edd..0000000000000000000000000000000000000000
Binary files a/tests/test_comm/__pycache__/test_comm.cpython-36.pyc and /dev/null differ
diff --git a/tests/test_comm/__pycache__/test_comm.cpython-37-pytest-7.1.3.pyc b/tests/test_comm/__pycache__/test_comm.cpython-37-pytest-7.1.3.pyc
deleted file mode 100644
index 7b3b7e7ed719308b5aafb5c5d6faf6b68ddcf81e..0000000000000000000000000000000000000000
Binary files a/tests/test_comm/__pycache__/test_comm.cpython-37-pytest-7.1.3.pyc and /dev/null differ
diff --git a/tests/test_comm/__pycache__/test_comm.cpython-37.pyc b/tests/test_comm/__pycache__/test_comm.cpython-37.pyc
deleted file mode 100644
index fd2bb92b9553c8edeb2f61df74761eb063057ea0..0000000000000000000000000000000000000000
Binary files a/tests/test_comm/__pycache__/test_comm.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_comm/test_comm.py b/tests/test_comm/test_comm.py
deleted file mode 100644
index 4316e1a56ac535710bfb4867831b75a7c507782b..0000000000000000000000000000000000000000
--- a/tests/test_comm/test_comm.py
+++ /dev/null
@@ -1,73 +0,0 @@
-from functools import partial
-
-import pytest
-import torch
-import torch.distributed as dist
-import torch.multiprocessing as mp
-from colossalai.communication import all_gather, all_reduce, reduce_scatter
-from colossalai.context import ParallelMode
-from colossalai.core import global_context as gpc
-from colossalai.initialize import launch
-from colossalai.utils import free_port, get_current_device
-
-CONFIG = dict(parallel=dict(data=8, pipeline=1, tensor=dict(mode=None, size=1)))
-
-SIZE = 8
-
-
-def check_all_gather():
-    tensor = torch.tensor([dist.get_rank() * SIZE + j for j in range(SIZE)])
-    tensor = tensor.to(get_current_device())
-    print('Before:   Rank {0} - {1}'.format(dist.get_rank(), tensor))
-    tensor, op = all_gather(tensor, 0, ParallelMode.GLOBAL, async_op=True)
-    print('After:    Rank {0} - {1}'.format(dist.get_rank(), tensor))
-    op.wait()
-    print('Complete: Rank {0} - {1}'.format(dist.get_rank(), tensor))
-    torch.cuda.synchronize()
-
-
-def check_reduce_scatter():
-    tensor = torch.tensor([dist.get_rank() * SIZE + j for j in range(SIZE)])
-    tensor = tensor.to(get_current_device())
-    print('Before:   Rank {0} - {1}'.format(dist.get_rank(), tensor))
-    tensor, op = reduce_scatter(tensor, 0, ParallelMode.GLOBAL, async_op=True)
-    print('After:    Rank {0} - {1}'.format(dist.get_rank(), tensor))
-    op.wait()
-    print('Complete: Rank {0} - {1}'.format(dist.get_rank(), tensor))
-    torch.cuda.synchronize()
-
-
-def check_all_reduce():
-    tensor = torch.tensor([dist.get_rank() * SIZE + j for j in range(SIZE)])
-    tensor = tensor.to(get_current_device())
-    print('Before:   Rank {0} - {1}'.format(dist.get_rank(), tensor))
-    tensor, op = all_reduce(tensor, ParallelMode.GLOBAL, async_op=True)
-    print('After:    Rank {0} - {1}'.format(dist.get_rank(), tensor))
-    op.wait()
-    print('Complete: Rank {0} - {1}'.format(dist.get_rank(), tensor))
-    torch.cuda.synchronize()
-
-
-def check_layer(rank, world_size, port):
-    launch(config=CONFIG, rank=rank, world_size=world_size, host='localhost', port=port, backend='nccl')
-
-    assert dist.get_rank() == gpc.get_global_rank()
-    print('Rank {} / {}'.format(dist.get_rank(), dist.get_world_size()))
-
-    check_all_gather()
-    check_reduce_scatter()
-    check_all_reduce()
-
-    gpc.destroy()
-    torch.cuda.empty_cache()
-
-
-@pytest.mark.dist
-def test_comm():
-    world_size = 4
-    run_func = partial(check_layer, world_size=world_size, port=free_port())
-    mp.spawn(run_func, nprocs=world_size)
-
-
-if __name__ == '__main__':
-    test_comm()
diff --git a/tests/test_config/__pycache__/sample_config.cpython-37-pytest-7.1.3.pyc b/tests/test_config/__pycache__/sample_config.cpython-37-pytest-7.1.3.pyc
deleted file mode 100644
index f728c7c06443014b3b524e601d9fb5555c616ad1..0000000000000000000000000000000000000000
Binary files a/tests/test_config/__pycache__/sample_config.cpython-37-pytest-7.1.3.pyc and /dev/null differ
diff --git a/tests/test_config/__pycache__/sample_config.cpython-37.pyc b/tests/test_config/__pycache__/sample_config.cpython-37.pyc
deleted file mode 100644
index 6a6dfb1afa7c59403f6ff2980b98416a36cafc88..0000000000000000000000000000000000000000
Binary files a/tests/test_config/__pycache__/sample_config.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_config/__pycache__/test_load_config.cpython-37-pytest-7.1.3.pyc b/tests/test_config/__pycache__/test_load_config.cpython-37-pytest-7.1.3.pyc
deleted file mode 100644
index c4c1a04f3561049a9676086c61a9ebb03c63bdc6..0000000000000000000000000000000000000000
Binary files a/tests/test_config/__pycache__/test_load_config.cpython-37-pytest-7.1.3.pyc and /dev/null differ
diff --git a/tests/test_config/sample_config.py b/tests/test_config/sample_config.py
deleted file mode 100644
index 08ca108281b9c0700fef4ecb2c14416ccbabfd9f..0000000000000000000000000000000000000000
--- a/tests/test_config/sample_config.py
+++ /dev/null
@@ -1,25 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-train_data = dict(
-    dataset=dict(
-        type='CIFAR10Dataset',
-        root='/path/to/data',
-        download=True,
-        transform_pipeline=[
-            dict(type='RandomResizedCrop', size=224),
-            dict(type='RandomHorizontalFlip'),
-            dict(type='ToTensor'),
-            dict(type='Normalize', mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5))
-        ]
-    ),
-    dataloader=dict(
-        batch_size=64,
-        pin_memory=True,
-        num_workers=4,
-        sampler=dict(
-            type='DataParallelSampler',
-            shuffle=True,
-        )
-    )
-)
diff --git a/tests/test_config/test_load_config.py b/tests/test_config/test_load_config.py
deleted file mode 100644
index 2c4543b750d5c11286d2dba7ece4de91b9fdf3cd..0000000000000000000000000000000000000000
--- a/tests/test_config/test_load_config.py
+++ /dev/null
@@ -1,27 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from pathlib import Path
-
-import pytest
-
-from colossalai.context.config import Config
-from colossalai.builder import build_ophooks
-
-
-@pytest.mark.cpu
-def test_load_config():
-    filename = Path(__file__).parent.joinpath('sample_config.py')
-    config = Config.from_file(filename)
-
-    assert config.train_data, 'cannot access train data as attribute'
-    assert config.train_data.dataset, 'cannot access grandchild attribute'
-    assert isinstance(config.train_data.dataset.transform_pipeline[0], dict), \
-        f'expected attribute transform_pipeline elements to be a dict, but found {type(config.train_data.dataset.transform_pipeline)}'
-
-
-@pytest.mark.cpu
-def test_load_ophooks():
-    dict = {'type': 'MemTracerOpHook', 'niter': 2}
-    ophook = build_ophooks(dict)
-    assert ophook.niter() == 2
diff --git a/tests/test_context/__pycache__/test_2d_init.cpython-37-pytest-7.1.3.pyc b/tests/test_context/__pycache__/test_2d_init.cpython-37-pytest-7.1.3.pyc
deleted file mode 100644
index ce7645369f1ad4bdecda970d2a7c08b9486bbc2f..0000000000000000000000000000000000000000
Binary files a/tests/test_context/__pycache__/test_2d_init.cpython-37-pytest-7.1.3.pyc and /dev/null differ
diff --git a/tests/test_context/__pycache__/test_2d_init.cpython-37.pyc b/tests/test_context/__pycache__/test_2d_init.cpython-37.pyc
deleted file mode 100644
index 62e029150d269f2cbeed73c19b2968743c7d38bf..0000000000000000000000000000000000000000
Binary files a/tests/test_context/__pycache__/test_2d_init.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_context/__pycache__/test_2p5d_init.cpython-37-pytest-7.1.3.pyc b/tests/test_context/__pycache__/test_2p5d_init.cpython-37-pytest-7.1.3.pyc
deleted file mode 100644
index 464acf3e63e0460481a1a8ae80c0e16e0ad45eb6..0000000000000000000000000000000000000000
Binary files a/tests/test_context/__pycache__/test_2p5d_init.cpython-37-pytest-7.1.3.pyc and /dev/null differ
diff --git a/tests/test_context/__pycache__/test_2p5d_init.cpython-37.pyc b/tests/test_context/__pycache__/test_2p5d_init.cpython-37.pyc
deleted file mode 100644
index 57bb594b4d9d24473a3dd0f2ac24ef461adedafb..0000000000000000000000000000000000000000
Binary files a/tests/test_context/__pycache__/test_2p5d_init.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_context/__pycache__/test_3d_init.cpython-37-pytest-7.1.3.pyc b/tests/test_context/__pycache__/test_3d_init.cpython-37-pytest-7.1.3.pyc
deleted file mode 100644
index 3d4ba569d2e2271a5170e2828573cda54a9b1120..0000000000000000000000000000000000000000
Binary files a/tests/test_context/__pycache__/test_3d_init.cpython-37-pytest-7.1.3.pyc and /dev/null differ
diff --git a/tests/test_context/__pycache__/test_3d_init.cpython-37.pyc b/tests/test_context/__pycache__/test_3d_init.cpython-37.pyc
deleted file mode 100644
index e8bd375eaeee228c2441817480a8d96fe0332572..0000000000000000000000000000000000000000
Binary files a/tests/test_context/__pycache__/test_3d_init.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_context/configs/__pycache__/parallel_2d_init.cpython-37.pyc b/tests/test_context/configs/__pycache__/parallel_2d_init.cpython-37.pyc
deleted file mode 100644
index 7f054e043756c40a8a4cd866dd91e61b87a36b9f..0000000000000000000000000000000000000000
Binary files a/tests/test_context/configs/__pycache__/parallel_2d_init.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_context/configs/__pycache__/parallel_2p5d_init.cpython-37.pyc b/tests/test_context/configs/__pycache__/parallel_2p5d_init.cpython-37.pyc
deleted file mode 100644
index 0a7305b0d5d82e3243ce439afce639c1dfabf359..0000000000000000000000000000000000000000
Binary files a/tests/test_context/configs/__pycache__/parallel_2p5d_init.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_context/configs/__pycache__/parallel_3d_init.cpython-37.pyc b/tests/test_context/configs/__pycache__/parallel_3d_init.cpython-37.pyc
deleted file mode 100644
index 8e1bfaba073023c4f04544311aa9be085b3d0651..0000000000000000000000000000000000000000
Binary files a/tests/test_context/configs/__pycache__/parallel_3d_init.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_context/configs/parallel_2d_init.py b/tests/test_context/configs/parallel_2d_init.py
deleted file mode 100644
index 6af884450ad0fee42d86fd1ad7ee950d576dd7da..0000000000000000000000000000000000000000
--- a/tests/test_context/configs/parallel_2d_init.py
+++ /dev/null
@@ -1,10 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-parallel = dict(
-    pipeline=dict(size=2),
-    tensor=dict(
-        size=4,
-        mode='2d'
-    )
-)
diff --git a/tests/test_context/configs/parallel_2p5d_init.py b/tests/test_context/configs/parallel_2p5d_init.py
deleted file mode 100644
index c2d896d383e26d1530bd05d4127dfdafec57d826..0000000000000000000000000000000000000000
--- a/tests/test_context/configs/parallel_2p5d_init.py
+++ /dev/null
@@ -1,11 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-parallel = dict(
-    pipeline=dict(size=2),
-    tensor=dict(
-        size=8,
-        depth=2,
-        mode='2.5d'
-    )
-)
diff --git a/tests/test_context/configs/parallel_3d_init.py b/tests/test_context/configs/parallel_3d_init.py
deleted file mode 100644
index 0ec724f8bb4f2513457568eaeb221727e4da2ff1..0000000000000000000000000000000000000000
--- a/tests/test_context/configs/parallel_3d_init.py
+++ /dev/null
@@ -1,10 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-parallel = dict(
-    pipeline=dict(size=2),
-    tensor=dict(
-        size=8,
-        mode='3d'
-    )
-)
diff --git a/tests/test_context/test_2d_init.py b/tests/test_context/test_2d_init.py
deleted file mode 100644
index 117b6e0d6603523d9ce924f216a1c2ef7a88a8d0..0000000000000000000000000000000000000000
--- a/tests/test_context/test_2d_init.py
+++ /dev/null
@@ -1,105 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from functools import partial
-from pathlib import Path
-
-import pytest
-import torch
-import torch.multiprocessing as mp
-from colossalai import launch
-from colossalai.context.parallel_mode import ParallelMode
-from colossalai.core import global_context as gpc
-from colossalai.utils import free_port
-
-CONFIG_PATH = Path(__file__).parent.joinpath('configs/parallel_2d_init.py').absolute()
-
-
-def check_data_parallel_rank(rank):
-    if rank in [0, 1, 2, 3, 4, 5, 6, 7]:
-        assert gpc.get_local_rank(ParallelMode.DATA) == 0
-    elif rank in [8, 9, 10, 11, 12, 13, 14, 15]:
-        assert gpc.get_local_rank(ParallelMode.DATA) == 1
-
-
-def check_pipeline_parallel_rank(rank):
-    if rank in [0, 1, 2, 3]:
-        assert gpc.get_local_rank(ParallelMode.PIPELINE) == 0
-    elif rank in [4, 5, 6, 7]:
-        assert gpc.get_local_rank(ParallelMode.PIPELINE) == 1
-    elif rank in [8, 9, 10, 11]:
-        assert gpc.get_local_rank(ParallelMode.PIPELINE) == 0
-    elif rank in [12, 13, 14, 15]:
-        assert gpc.get_local_rank(ParallelMode.PIPELINE) == 1
-
-
-def check_model_parallel_rank(rank):
-    for i in range(8):
-        if rank in [i, i+8]:
-            assert gpc.get_local_rank(ParallelMode.MODEL) == i
-
-
-def check_tensor_parallel_rank(rank):
-    if rank in [0, 4, 8, 12]:
-        assert gpc.get_local_rank(ParallelMode.TENSOR) == 0
-    elif rank in [1, 5, 9, 13]:
-        assert gpc.get_local_rank(ParallelMode.TENSOR) == 1
-    elif rank in [2, 6, 10, 14]:
-        assert gpc.get_local_rank(ParallelMode.TENSOR) == 2
-    elif rank in [3, 7, 11, 15]:
-        assert gpc.get_local_rank(ParallelMode.TENSOR) == 3
-
-
-def check_2d_parallel_rank(rank):
-    if rank in [0, 4, 8, 12]:
-        assert gpc.get_local_rank(ParallelMode.PARALLEL_2D_COL) == 0
-        assert gpc.get_local_rank(ParallelMode.PARALLEL_2D_ROW) == 0
-    elif rank in [1, 5, 9, 13]:
-        assert gpc.get_local_rank(ParallelMode.PARALLEL_2D_COL) == 0
-        assert gpc.get_local_rank(ParallelMode.PARALLEL_2D_ROW) == 1
-    elif rank in [2, 6, 10, 14]:
-        assert gpc.get_local_rank(ParallelMode.PARALLEL_2D_COL) == 1
-        assert gpc.get_local_rank(ParallelMode.PARALLEL_2D_ROW) == 0
-    elif rank in [3, 7, 11, 15]:
-        assert gpc.get_local_rank(ParallelMode.PARALLEL_2D_COL) == 1
-        assert gpc.get_local_rank(ParallelMode.PARALLEL_2D_ROW) == 1
-
-
-def init_2d(rank, world_size, backend, port, host):
-    dist_args = dict(
-        config=CONFIG_PATH,
-        rank=rank,
-        world_size=world_size,
-        backend=backend,
-        port=port,
-        host=host,
-        verbose=True
-    )
-    launch(**dist_args)
-
-    check_tensor_parallel_rank(rank)
-    check_data_parallel_rank(rank)
-    check_2d_parallel_rank(rank)
-    check_pipeline_parallel_rank(rank)
-    check_model_parallel_rank(rank)
-    gpc.destroy()
-    torch.cuda.empty_cache()
-
-
-@pytest.mark.cpu
-def test_2d_init():
-    """
-    As no computation or communication is done, we can run this test on CPU.
-    """
-    world_size = 16
-    test_fn = partial(init_2d,
-                      world_size=world_size,
-                      backend='gloo',
-                      port=free_port(),
-                      host='localhost'
-                      )
-    mp.spawn(test_fn, nprocs=world_size)
-
-
-if __name__ == '__main__':
-    test_2d_init()
diff --git a/tests/test_context/test_2p5d_init.py b/tests/test_context/test_2p5d_init.py
deleted file mode 100644
index ef67897100b8a964d07afbe90e6ac2c2efd70996..0000000000000000000000000000000000000000
--- a/tests/test_context/test_2p5d_init.py
+++ /dev/null
@@ -1,128 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from functools import partial
-from pathlib import Path
-
-import pytest
-import torch
-import torch.multiprocessing as mp
-from colossalai.context.parallel_mode import ParallelMode
-from colossalai.core import global_context as gpc
-from colossalai.initialize import launch
-from colossalai.utils import free_port
-
-CONFIG_PATH = Path(__file__).parent.joinpath('configs/parallel_2p5d_init.py').absolute()
-
-
-def check_data_parallel_rank(rank):
-    dp_rank = gpc.get_local_rank(ParallelMode.DATA)
-
-    if rank in list(range(16)):
-        assert dp_rank == 0
-    elif rank in list(range(16, 32)):
-        assert dp_rank == 1
-
-
-def check_pipeline_parallel_rank(rank):
-    ppr = gpc.get_local_rank(ParallelMode.PIPELINE)
-
-    if rank in list(range(8)):
-        assert ppr == 0
-    elif rank in list(range(8, 16)):
-        assert ppr == 1
-    elif rank in list(range(16, 24)):
-        assert ppr == 0
-    elif rank in list(range(24, 32)):
-        assert ppr == 1
-
-
-def check_model_parallel_rank(rank):
-    for i in range(16):
-        if rank in [i, i+16]:
-            assert gpc.get_local_rank(ParallelMode.MODEL) == i
-
-
-def check_tensor_parallel_rank(rank):
-    tp_rank = gpc.get_local_rank(ParallelMode.TENSOR)
-
-    for i in range(8):
-        ranks = list(range(i, 32, 8))
-        if rank in ranks:
-            assert tp_rank == i, f'{rank}:{tp_rank}'
-
-
-def check_2p5d_parallel_rank(rank):
-    rp_rank = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_ROW)
-    cp_rank = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_COL)
-    dp_rank = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_DEP)
-    xp_rank = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_XZ)
-
-    # check for row parallel group
-    for i in range(2):
-        ranks = list(range(i, 32, 2))
-        if rank in ranks:
-            assert rp_rank == i
-
-    # check for col parallel group
-    for i in range(2):
-        ranks = list(range(i * 2, 32, 4))
-        ranks_plus_ones = [val + 1 for val in ranks]
-        ranks.extend(ranks_plus_ones)
-        if rank in ranks:
-            assert cp_rank == i
-
-    # check for depth parallel group
-    for i in range(2):
-        ranks = []
-        for j in range(i * 4, 32, 8):
-            ranks.extend([j + k for k in range(4)])
-        if rank in ranks:
-            assert dp_rank == i
-
-    # check for xz parallel group
-    for i in range(2):
-        ranks = list(range(i * 2, 32, 8))
-        ranks_plus_one = [val + 1 for val in ranks]
-        ranks.extend(ranks_plus_one)
-        if rank in ranks:
-            assert xp_rank == i
-
-
-def init_2halfd(rank, world_size, backend, port, host):
-    dist_args = dict(
-        config=CONFIG_PATH,
-        rank=rank,
-        world_size=world_size,
-        backend=backend,
-        port=port,
-        host=host,
-        verbose=True
-    )
-    launch(**dist_args)
-    check_data_parallel_rank(rank)
-    check_pipeline_parallel_rank(rank)
-    check_tensor_parallel_rank(rank)
-    check_2p5d_parallel_rank(rank)
-    check_model_parallel_rank(rank)
-    gpc.destroy()
-    torch.cuda.empty_cache()
-
-
-@pytest.mark.cpu
-def test_2halfd_init():
-    """
-    As no computation or communication is done, we can run this test on CPU.
-    """
-    world_size = 32
-    test_fn = partial(init_2halfd,
-                      world_size=world_size,
-                      backend='gloo',
-                      port=free_port(),
-                      host='localhost'
-                      )
-    mp.spawn(test_fn, nprocs=world_size)
-
-
-if __name__ == '__main__':
-    test_2halfd_init()
diff --git a/tests/test_context/test_3d_init.py b/tests/test_context/test_3d_init.py
deleted file mode 100644
index 12f0f1ea5c7dbc04a1bed61f7359280fa0c33d7a..0000000000000000000000000000000000000000
--- a/tests/test_context/test_3d_init.py
+++ /dev/null
@@ -1,120 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from functools import partial
-from pathlib import Path
-
-import pytest
-import torch
-import torch.multiprocessing as mp
-from colossalai.context.parallel_mode import ParallelMode
-from colossalai.core import global_context as gpc
-from colossalai.initialize import launch
-from colossalai.utils import free_port
-
-CONFIG_PATH = Path(__file__).parent.joinpath('configs/parallel_3d_init.py').absolute()
-
-
-def check_data_parallel_rank(rank):
-    dp_rank = gpc.get_local_rank(ParallelMode.DATA)
-
-    if rank in list(range(16)):
-        assert dp_rank == 0
-    elif rank in list(range(16, 32)):
-        assert dp_rank == 1
-
-
-def check_pipeline_parallel_rank(rank):
-    ppr = gpc.get_local_rank(ParallelMode.PIPELINE)
-
-    if rank in list(range(8)):
-        assert ppr == 0
-    elif rank in list(range(8, 16)):
-        assert ppr == 1
-    elif rank in list(range(16, 24)):
-        assert ppr == 0
-    elif rank in list(range(24, 32)):
-        assert ppr == 1
-
-
-def check_model_parallel_rank(rank):
-    for i in range(16):
-        if rank in [i, i+16]:
-            assert gpc.get_local_rank(ParallelMode.MODEL) == i
-
-
-def check_tensor_parallel_rank(rank):
-    tp_rank = gpc.get_local_rank(ParallelMode.TENSOR)
-
-    for i in range(8):
-        ranks = list(range(i, 32, 8))
-        if rank in ranks:
-            assert tp_rank == i
-
-
-def check_3d_parallel_rank(rank):
-    ip_rank = gpc.get_local_rank(ParallelMode.PARALLEL_3D_INPUT)
-    wp_rank = gpc.get_local_rank(ParallelMode.PARALLEL_3D_WEIGHT)
-    op_rank = gpc.get_local_rank(ParallelMode.PARALLEL_3D_OUTPUT)
-
-    # check for input parallel group
-    for i in range(2):
-        _ranks = list(range(i * 2, 32, 4))
-        _ranks_plus_one = [val + 1 for val in _ranks]
-        input_ranks = _ranks + _ranks_plus_one
-        if rank in input_ranks:
-            assert ip_rank == i
-
-    # check for weight parallel group
-    for i in range(2):
-        ranks = list(range(i, 32, 2))
-
-        if rank in ranks:
-            assert wp_rank == i
-
-    # check for output parallel group
-    for i in range(2):
-        ranks = []
-        for j in range(i * 4, 32, 8):
-            ranks.extend([j + k for k in range(4)])
-        if rank in ranks:
-            assert op_rank == i
-
-
-def init_3d(rank, world_size, backend, port, host):
-    dist_args = dict(
-        config=CONFIG_PATH,
-        rank=rank,
-        world_size=world_size,
-        backend=backend,
-        port=port,
-        host=host,
-        verbose=True
-    )
-    launch(**dist_args)
-    check_tensor_parallel_rank(rank)
-    check_3d_parallel_rank(rank)
-    check_data_parallel_rank(rank)
-    check_pipeline_parallel_rank(rank)
-    check_model_parallel_rank(rank)
-    gpc.destroy()
-    torch.cuda.empty_cache()
-
-
-@pytest.mark.cpu
-def test_3d_init():
-    """
-    As no computation or communication is done, we can run this test on CPU.
-    """
-    world_size = 32
-    test_fn = partial(init_3d,
-                      world_size=world_size,
-                      backend='gloo',
-                      port=free_port(),
-                      host='localhost'
-                      )
-    mp.spawn(test_fn, nprocs=world_size)
-
-
-if __name__ == '__main__':
-    test_3d_init()
diff --git a/tests/test_data/__pycache__/test_cifar10_dataset.cpython-37-pytest-7.1.3.pyc b/tests/test_data/__pycache__/test_cifar10_dataset.cpython-37-pytest-7.1.3.pyc
deleted file mode 100644
index 48b988cec86c28e6106684d7e36fdb58ac63d4f3..0000000000000000000000000000000000000000
Binary files a/tests/test_data/__pycache__/test_cifar10_dataset.cpython-37-pytest-7.1.3.pyc and /dev/null differ
diff --git a/tests/test_data/__pycache__/test_data_parallel_sampler.cpython-37-pytest-7.1.3.pyc b/tests/test_data/__pycache__/test_data_parallel_sampler.cpython-37-pytest-7.1.3.pyc
deleted file mode 100644
index fd2299bd50659a2dbe434548cf94b5156f972c97..0000000000000000000000000000000000000000
Binary files a/tests/test_data/__pycache__/test_data_parallel_sampler.cpython-37-pytest-7.1.3.pyc and /dev/null differ
diff --git a/tests/test_data/__pycache__/test_data_parallel_sampler.cpython-37.pyc b/tests/test_data/__pycache__/test_data_parallel_sampler.cpython-37.pyc
deleted file mode 100644
index 46156a319a47328c7e07e519eb62c2ac0b311f7a..0000000000000000000000000000000000000000
Binary files a/tests/test_data/__pycache__/test_data_parallel_sampler.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_data/__pycache__/test_deterministic_dataloader.cpython-37-pytest-7.1.3.pyc b/tests/test_data/__pycache__/test_deterministic_dataloader.cpython-37-pytest-7.1.3.pyc
deleted file mode 100644
index 19226b4b413e568f37a6c3915fd1e76eed7e07c7..0000000000000000000000000000000000000000
Binary files a/tests/test_data/__pycache__/test_deterministic_dataloader.cpython-37-pytest-7.1.3.pyc and /dev/null differ
diff --git a/tests/test_data/__pycache__/test_deterministic_dataloader.cpython-37.pyc b/tests/test_data/__pycache__/test_deterministic_dataloader.cpython-37.pyc
deleted file mode 100644
index f192232944d6627a04141162f1e6dc3883bba92e..0000000000000000000000000000000000000000
Binary files a/tests/test_data/__pycache__/test_deterministic_dataloader.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_data/test_cifar10_dataset.py b/tests/test_data/test_cifar10_dataset.py
deleted file mode 100644
index 569cea2ca1edfc2c2f3cc5725a7c69dd7cbee842..0000000000000000000000000000000000000000
--- a/tests/test_data/test_cifar10_dataset.py
+++ /dev/null
@@ -1,54 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import os
-from pathlib import Path
-
-import pytest
-from torchvision import transforms
-from torch.utils.data import DataLoader
-
-from colossalai.builder import build_dataset, build_transform
-from colossalai.context import Config
-
-TRAIN_DATA = dict(
-    dataset=dict(
-        type='CIFAR10',
-        root=Path(os.environ['DATA']),
-        train=True,
-        download=True
-    ),
-    dataloader=dict(batch_size=4, shuffle=True, num_workers=2),
-    transform_pipeline=[
-        dict(type='ToTensor'),
-        dict(type='Normalize',
-             mean=(0.5, 0.5, 0.5),
-             std=(0.5, 0.5, 0.5)
-             )
-    ]
-)
-
-
-@pytest.mark.cpu
-def test_cifar10_dataset():
-    config = Config(TRAIN_DATA)
-    dataset_cfg = config.dataset
-    dataloader_cfg = config.dataloader
-    transform_cfg = config.transform_pipeline
-
-    # build transform
-    transform_pipeline = [build_transform(cfg) for cfg in transform_cfg]
-    transform_pipeline = transforms.Compose(transform_pipeline)
-    dataset_cfg['transform'] = transform_pipeline
-
-    # build dataset
-    dataset = build_dataset(dataset_cfg)
-
-    # build dataloader
-    dataloader = DataLoader(dataset=dataset, **dataloader_cfg)
-    data_iter = iter(dataloader)
-    img, label = data_iter.next()
-
-
-if __name__ == '__main__':
-    test_cifar10_dataset()
diff --git a/tests/test_data/test_data_parallel_sampler.py b/tests/test_data/test_data_parallel_sampler.py
deleted file mode 100644
index 18d3e1b35400abbfe299c0343ac45c2504507ddd..0000000000000000000000000000000000000000
--- a/tests/test_data/test_data_parallel_sampler.py
+++ /dev/null
@@ -1,87 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import os
-from functools import partial
-from pathlib import Path
-
-import pytest
-import torch
-import torch.distributed as dist
-import torch.multiprocessing as mp
-from torch.utils.data import DataLoader
-
-import colossalai
-from colossalai.builder import build_dataset, build_data_sampler, build_transform
-from torchvision import transforms
-from colossalai.context import ParallelMode, Config
-from colossalai.core import global_context as gpc
-from colossalai.utils import get_dataloader
-
-CONFIG = Config(
-    dict(
-        train_data=dict(
-            dataset=dict(
-                type='CIFAR10',
-                root=Path(os.environ['DATA']),
-                train=True,
-                download=True,
-            ),
-            dataloader=dict(
-                batch_size=8,
-            ),
-            transform_pipeline=[
-                dict(type='ToTensor'),
-                dict(type='Normalize', mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5))
-            ]
-        ),
-        parallel=dict(
-            pipeline=dict(size=1),
-            tensor=dict(size=1, mode=None),
-        ),
-        seed=1024,
-    ))
-
-
-def run_data_sampler(rank, world_size):
-    dist_args = dict(
-        config=CONFIG,
-        rank=rank,
-        world_size=world_size,
-        backend='gloo',
-        port='29903',
-        host='localhost'
-    )
-    colossalai.launch(**dist_args)
-    print('finished initialization')
-
-    transform_pipeline = [build_transform(cfg) for cfg in gpc.config.train_data.transform_pipeline]
-    transform_pipeline = transforms.Compose(transform_pipeline)
-    gpc.config.train_data.dataset['transform'] = transform_pipeline
-    dataset = build_dataset(gpc.config.train_data.dataset)
-    dataloader = get_dataloader(dataset, **gpc.config.train_data.dataloader)
-    data_iter = iter(dataloader)
-    img, label = data_iter.next()
-    img = img[0]
-
-    if gpc.get_local_rank(ParallelMode.DATA) != 0:
-        img_to_compare = img.clone()
-    else:
-        img_to_compare = img
-    dist.broadcast(img_to_compare, src=0, group=gpc.get_group(ParallelMode.DATA))
-
-    if gpc.get_local_rank(ParallelMode.DATA) != 0:
-        assert not torch.equal(img,
-                               img_to_compare), 'Same image was distributed across ranks but expected it to be different'
-    torch.cuda.empty_cache()
-
-
-@pytest.mark.cpu
-def test_data_sampler():
-    world_size = 4
-    test_func = partial(run_data_sampler, world_size=world_size)
-    mp.spawn(test_func, nprocs=world_size)
-
-
-if __name__ == '__main__':
-    test_data_sampler()
diff --git a/tests/test_data/test_deterministic_dataloader.py b/tests/test_data/test_deterministic_dataloader.py
deleted file mode 100644
index c96a3210f07100dcfa03c3b0ee5376de48d23e15..0000000000000000000000000000000000000000
--- a/tests/test_data/test_deterministic_dataloader.py
+++ /dev/null
@@ -1,101 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import os
-from functools import partial
-from pathlib import Path
-
-import pytest
-import torch
-import torch.distributed as dist
-import torch.multiprocessing as mp
-from torchvision import transforms
-from torch.utils.data import DataLoader
-
-import colossalai
-from colossalai.builder import build_dataset, build_transform
-from colossalai.context import ParallelMode, Config
-from colossalai.core import global_context as gpc
-
-CONFIG = Config(
-    dict(
-        train_data=dict(
-            dataset=dict(
-                type='CIFAR10',
-                root=Path(os.environ['DATA']),
-                train=True,
-                download=True,
-            ),
-            dataloader=dict(
-                num_workers=2,
-                batch_size=2,
-                shuffle=True
-            ),
-            transform_pipeline=[
-                dict(type='ToTensor'),
-                dict(type='RandomCrop', size=32),
-                dict(type='Normalize', mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5))
-            ]
-        ),
-        parallel=dict(
-            pipeline=dict(size=1),
-            tensor=dict(size=1, mode=None),
-        ),
-        seed=1024,
-    )
-)
-
-
-def run_data_sampler(rank, world_size):
-    dist_args = dict(
-        config=CONFIG,
-        rank=rank,
-        world_size=world_size,
-        backend='gloo',
-        port='29904',
-        host='localhost'
-    )
-    colossalai.launch(**dist_args)
-
-    dataset_cfg = gpc.config.train_data.dataset
-    dataloader_cfg = gpc.config.train_data.dataloader
-    transform_cfg = gpc.config.train_data.transform_pipeline
-
-    # build transform
-    transform_pipeline = [build_transform(cfg) for cfg in transform_cfg]
-    transform_pipeline = transforms.Compose(transform_pipeline)
-    dataset_cfg['transform'] = transform_pipeline
-
-    # build dataset
-    dataset = build_dataset(dataset_cfg)
-
-    # build dataloader
-    dataloader = DataLoader(dataset=dataset, **dataloader_cfg)
-
-    data_iter = iter(dataloader)
-    img, label = data_iter.next()
-    img = img[0]
-
-    if gpc.get_local_rank(ParallelMode.DATA) != 0:
-        img_to_compare = img.clone()
-    else:
-        img_to_compare = img
-    dist.broadcast(img_to_compare, src=0, group=gpc.get_group(ParallelMode.DATA))
-
-    if gpc.get_local_rank(ParallelMode.DATA) != 0:
-        # this is without sampler
-        # this should be false if data parallel sampler to given to the dataloader
-        assert torch.equal(img,
-                           img_to_compare), 'Same image was distributed across ranks and expected it to be the same'
-    torch.cuda.empty_cache()
-
-
-@pytest.mark.cpu
-def test_data_sampler():
-    world_size = 4
-    test_func = partial(run_data_sampler, world_size=world_size)
-    mp.spawn(test_func, nprocs=world_size)
-
-
-if __name__ == '__main__':
-    test_data_sampler()
diff --git a/tests/test_data_pipeline_tensor_parallel/__pycache__/test_cifar_with_data_pipeline_tensor.cpython-37-pytest-7.1.3.pyc b/tests/test_data_pipeline_tensor_parallel/__pycache__/test_cifar_with_data_pipeline_tensor.cpython-37-pytest-7.1.3.pyc
deleted file mode 100644
index ab00153b83e84e1cea0c88d8f076239727430f79..0000000000000000000000000000000000000000
Binary files a/tests/test_data_pipeline_tensor_parallel/__pycache__/test_cifar_with_data_pipeline_tensor.cpython-37-pytest-7.1.3.pyc and /dev/null differ
diff --git a/tests/test_data_pipeline_tensor_parallel/__pycache__/test_cifar_with_data_pipeline_tensor.cpython-37.pyc b/tests/test_data_pipeline_tensor_parallel/__pycache__/test_cifar_with_data_pipeline_tensor.cpython-37.pyc
deleted file mode 100644
index a35c5eb73b6d14659931afc13db202f47b3a2402..0000000000000000000000000000000000000000
Binary files a/tests/test_data_pipeline_tensor_parallel/__pycache__/test_cifar_with_data_pipeline_tensor.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_data_pipeline_tensor_parallel/test_cifar_with_data_pipeline_tensor.py b/tests/test_data_pipeline_tensor_parallel/test_cifar_with_data_pipeline_tensor.py
deleted file mode 100644
index d35937f3fc5ee05862ccb0b891a5967b7d6979b2..0000000000000000000000000000000000000000
--- a/tests/test_data_pipeline_tensor_parallel/test_cifar_with_data_pipeline_tensor.py
+++ /dev/null
@@ -1,104 +0,0 @@
-import os
-from functools import partial
-from pathlib import Path
-
-import colossalai
-import pytest
-import torch
-import torch.multiprocessing as mp
-from colossalai.amp.amp_type import AMP_TYPE
-from colossalai.builder import build_pipeline_model
-from colossalai.engine.schedule import PipelineSchedule
-from colossalai.logging import get_dist_logger
-from colossalai.nn import Accuracy, LinearWarmupLR
-from colossalai.nn.loss import CrossEntropyLoss
-from colossalai.trainer import Trainer, hooks
-from colossalai.utils import MultiTimer, free_port, get_dataloader
-from colossalai.utils.gradient_accumulation import GradAccumLrSchedulerByStep
-from model_zoo.vit import vit_tiny_patch4_32
-from torchvision import transforms
-from torchvision.datasets import CIFAR10
-
-BATCH_SIZE = 16
-NUM_EPOCHS = 60
-WARMUP_EPOCHS = 5
-CONFIG = dict(parallel=dict(pipeline=2, tensor=dict(size=2, mode='1d')),
-              fp16=dict(mode=AMP_TYPE.NAIVE),
-              gradient_accumulation=2)
-
-
-def run_trainer(rank, world_size, port):
-    colossalai.launch(config=CONFIG, rank=rank, world_size=world_size, host='localhost', port=port, backend='nccl')
-
-    logger = get_dist_logger()
-
-    model = vit_tiny_patch4_32()
-    pipe_model = build_pipeline_model(model.layers, num_chunks=1)
-
-    # build dataloaders
-    transform_train = transforms.Compose([
-        transforms.RandomCrop(32, padding=4),
-        transforms.AutoAugment(policy=transforms.AutoAugmentPolicy.CIFAR10),
-        transforms.ToTensor(),
-        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
-    ])
-    transform_test = transforms.Compose([
-        transforms.Resize(32),
-        transforms.ToTensor(),
-        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
-    ])
-
-    train_dataset = CIFAR10(root=Path(os.environ['DATA']), train=True, download=True, transform=transform_train)
-    test_dataset = CIFAR10(root=Path(os.environ['DATA']), train=False, transform=transform_test)
-    train_dataloader = get_dataloader(dataset=train_dataset, shuffle=True, batch_size=BATCH_SIZE, pin_memory=True)
-    test_dataloader = get_dataloader(dataset=test_dataset, batch_size=BATCH_SIZE, pin_memory=True)
-
-    # build criterion
-    criterion = CrossEntropyLoss()
-
-    # optimizer
-    optimizer = torch.optim.Adam(pipe_model.parameters(), lr=0.001, weight_decay=0)
-
-    # lr_scheduler
-    steps_per_epoch = GradAccumLrSchedulerByStep.compute_effective_steps_per_epoch(train_dataloader, accumulate_size=2)
-    total_steps = steps_per_epoch * NUM_EPOCHS
-    warmup_steps = steps_per_epoch * WARMUP_EPOCHS
-    lr_scheduler = LinearWarmupLR(optimizer, total_steps=total_steps, warmup_steps=warmup_steps)
-
-    engine, train_dataloader, test_dataloader, lr_scheduler = colossalai.initialize(pipe_model, optimizer, criterion,
-                                                                                    train_dataloader, test_dataloader,
-                                                                                    lr_scheduler)
-
-    timer = MultiTimer()
-
-    schedule = PipelineSchedule(num_microbatches=4)
-
-    trainer = Trainer(engine=engine, timer=timer, logger=logger, schedule=schedule)
-
-    hook_list = [
-        hooks.LossHook(),
-        hooks.LRSchedulerHook(lr_scheduler=lr_scheduler, by_epoch=False),
-        hooks.LogMetricByEpochHook(logger),
-    ]
-
-    trainer.fit(train_dataloader=train_dataloader,
-                epochs=NUM_EPOCHS,
-                max_steps=5,
-                test_dataloader=test_dataloader,
-                test_interval=1,
-                hooks=hook_list,
-                display_progress=True)
-
-
-@pytest.mark.dist
-# @pytest.mark.skip("This test requires more than 8 GPUs, you should invoke this test script using test.sh provided manually")
-def test_hybrid_parallel():
-    ## HC
-    #world_size = 8
-    world_size = 4
-    run_func = partial(run_trainer, world_size=world_size, port=free_port())
-    mp.spawn(run_func, nprocs=world_size)
-
-
-if __name__ == '__main__':
-    test_hybrid_parallel()
diff --git a/tests/test_engine/test_engine/__pycache__/test_engine_apex_amp.cpython-37-pytest-7.1.3.pyc b/tests/test_engine/test_engine/__pycache__/test_engine_apex_amp.cpython-37-pytest-7.1.3.pyc
deleted file mode 100644
index 56e16e2a21b2e74ef1661efb7af6ae48091c320c..0000000000000000000000000000000000000000
Binary files a/tests/test_engine/test_engine/__pycache__/test_engine_apex_amp.cpython-37-pytest-7.1.3.pyc and /dev/null differ
diff --git a/tests/test_engine/test_engine/__pycache__/test_engine_apex_amp.cpython-37.pyc b/tests/test_engine/test_engine/__pycache__/test_engine_apex_amp.cpython-37.pyc
deleted file mode 100644
index 9d1dc5f5aa59dd1d6de10a692c45f9bd645efaee..0000000000000000000000000000000000000000
Binary files a/tests/test_engine/test_engine/__pycache__/test_engine_apex_amp.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_engine/test_engine/__pycache__/test_engine_naive_amp.cpython-37-pytest-7.1.3.pyc b/tests/test_engine/test_engine/__pycache__/test_engine_naive_amp.cpython-37-pytest-7.1.3.pyc
deleted file mode 100644
index da1781c74b9ce4c2ed50831eac9c620bee062cca..0000000000000000000000000000000000000000
Binary files a/tests/test_engine/test_engine/__pycache__/test_engine_naive_amp.cpython-37-pytest-7.1.3.pyc and /dev/null differ
diff --git a/tests/test_engine/test_engine/__pycache__/test_engine_naive_amp.cpython-37.pyc b/tests/test_engine/test_engine/__pycache__/test_engine_naive_amp.cpython-37.pyc
deleted file mode 100644
index af3459e9f321075bd416d054dae8049731115b91..0000000000000000000000000000000000000000
Binary files a/tests/test_engine/test_engine/__pycache__/test_engine_naive_amp.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_engine/test_engine/__pycache__/test_engine_no_amp.cpython-37-pytest-7.1.3.pyc b/tests/test_engine/test_engine/__pycache__/test_engine_no_amp.cpython-37-pytest-7.1.3.pyc
deleted file mode 100644
index 93d83b302b4e109e51b6a3329ba5cd4bbea3a523..0000000000000000000000000000000000000000
Binary files a/tests/test_engine/test_engine/__pycache__/test_engine_no_amp.cpython-37-pytest-7.1.3.pyc and /dev/null differ
diff --git a/tests/test_engine/test_engine/__pycache__/test_engine_no_amp.cpython-37.pyc b/tests/test_engine/test_engine/__pycache__/test_engine_no_amp.cpython-37.pyc
deleted file mode 100644
index 665a3a67d6a14360b08505cf343f55c25ec62f1e..0000000000000000000000000000000000000000
Binary files a/tests/test_engine/test_engine/__pycache__/test_engine_no_amp.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_engine/test_engine/__pycache__/test_engine_torch_amp.cpython-37-pytest-7.1.3.pyc b/tests/test_engine/test_engine/__pycache__/test_engine_torch_amp.cpython-37-pytest-7.1.3.pyc
deleted file mode 100644
index 3f29777bb9dabdbc4720a5f7506d341852fad2c4..0000000000000000000000000000000000000000
Binary files a/tests/test_engine/test_engine/__pycache__/test_engine_torch_amp.cpython-37-pytest-7.1.3.pyc and /dev/null differ
diff --git a/tests/test_engine/test_engine/__pycache__/test_engine_torch_amp.cpython-37.pyc b/tests/test_engine/test_engine/__pycache__/test_engine_torch_amp.cpython-37.pyc
deleted file mode 100644
index c6189ba090356309438f4c205d1427d17cd28375..0000000000000000000000000000000000000000
Binary files a/tests/test_engine/test_engine/__pycache__/test_engine_torch_amp.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_engine/test_engine/test_engine_apex_amp.py b/tests/test_engine/test_engine/test_engine_apex_amp.py
deleted file mode 100644
index 164ae54bb7146a6c65f939c11f065fd07419e337..0000000000000000000000000000000000000000
--- a/tests/test_engine/test_engine/test_engine_apex_amp.py
+++ /dev/null
@@ -1,110 +0,0 @@
-# !/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import os
-from functools import partial
-from pathlib import Path
-
-import colossalai
-import pytest
-import torch
-import torch.multiprocessing as mp
-import torch.nn as nn
-from colossalai.amp import AMP_TYPE
-from colossalai.core import global_context as gpc
-from colossalai.logging import get_dist_logger
-from colossalai.utils import free_port, get_dataloader, report_memory_usage
-from torch.optim import Adam
-from torchvision import transforms
-from torchvision.datasets import CIFAR10
-from torchvision.models import resnet18
-
-# Config
-BATCH_SIZE = 128
-IMG_SIZE = 224
-DIM = 768
-NUM_CLASSES = 10
-NUM_ATTN_HEADS = 12
-
-CONFIG = dict(
-    parallel=dict(
-        pipeline=dict(size=1),
-        tensor=dict(size=1, mode=None)
-    ),
-    fp16=dict(mode=AMP_TYPE.APEX),
-    clip_grad_norm=1.0
-)
-
-
-def run_engine(rank, world_size, port):
-    # init dist env
-    colossalai.launch(
-        config=CONFIG,
-        rank=rank,
-        world_size=world_size,
-        host='localhost',
-        port=port,
-        backend='nccl'
-    )
-
-    # build model
-    model = resnet18(num_classes=10)
-
-    # build dataloaders
-    train_dataset = CIFAR10(
-        root=Path(os.environ['DATA']),
-        download=True,
-        transform=transforms.Compose(
-            [
-                transforms.Resize(size=(IMG_SIZE, IMG_SIZE)),
-                transforms.ToTensor(),
-                transforms.Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5))
-            ]
-        )
-    )
-    train_dataloader = get_dataloader(dataset=train_dataset,
-                                      shuffle=True,
-                                      batch_size=BATCH_SIZE,
-                                      drop_last=True)
-
-    # build optimizer
-    optimizer = Adam(model.parameters(), lr=0.001)
-    criterion = nn.CrossEntropyLoss()
-
-    engine, train_dataloader, *args = colossalai.initialize(
-        model=model,
-        optimizer=optimizer,
-        criterion=criterion,
-        train_dataloader=train_dataloader
-    )
-    logger = get_dist_logger()
-    rank = torch.distributed.get_rank()
-
-    engine.train()
-    for img, label in train_dataloader:
-        engine.zero_grad()
-        img = img.cuda()
-        label = label.cuda()
-        output = engine(img)
-        loss = engine.criterion(output, label)
-        engine.backward(loss)
-        engine.step()
-        break
-
-    logger.info('Rank {} returns: {}'.format(rank, loss.item()))
-
-    gpc.destroy()
-    logger.info('Test engine finished')
-    report_memory_usage("After testing")
-    torch.cuda.empty_cache()
-
-
-@pytest.mark.dist
-def test_engine():
-    world_size = 4
-    run_func = partial(run_engine, world_size=world_size, port=free_port())
-    mp.spawn(run_func, nprocs=world_size)
-
-
-if __name__ == '__main__':
-    test_engine()
diff --git a/tests/test_engine/test_engine/test_engine_naive_amp.py b/tests/test_engine/test_engine/test_engine_naive_amp.py
deleted file mode 100644
index 95c6203683174ed4a46e9f4dd738e54e9cb5dd1d..0000000000000000000000000000000000000000
--- a/tests/test_engine/test_engine/test_engine_naive_amp.py
+++ /dev/null
@@ -1,109 +0,0 @@
-import os
-from functools import partial
-from pathlib import Path
-
-import colossalai
-import pytest
-import torch
-import torch.multiprocessing as mp
-import torch.nn as nn
-from colossalai.amp import AMP_TYPE
-from colossalai.core import global_context as gpc
-from colossalai.logging import get_dist_logger
-from colossalai.utils import free_port, get_dataloader, report_memory_usage
-from torch.optim import Adam
-from torchvision import transforms
-from torchvision.datasets import CIFAR10
-from torchvision.models import resnet18
-
-# Config
-BATCH_SIZE = 128
-IMG_SIZE = 224
-DIM = 768
-NUM_CLASSES = 10
-NUM_ATTN_HEADS = 12
-
-CONFIG = dict(
-    parallel=dict(
-        pipeline=dict(size=1),
-        tensor=dict(size=1, mode=None)
-    ),
-    fp16=dict(
-        mode=AMP_TYPE.NAIVE,
-        clip_grad=1.0
-    )
-)
-
-
-def run_engine(rank, world_size, port):
-    # init dist env
-    colossalai.launch(
-        config=CONFIG,
-        rank=rank,
-        world_size=world_size,
-        host='localhost',
-        port=port,
-        backend='nccl'
-    )
-
-    # build model
-    model = resnet18(num_classes=10)
-
-    # build dataloaders
-    train_dataset = CIFAR10(
-        root=Path(os.environ['DATA']),
-        download=True,
-        transform=transforms.Compose(
-            [
-                transforms.Resize(size=(IMG_SIZE, IMG_SIZE)),
-                transforms.ToTensor(),
-                transforms.Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5))
-            ]
-        )
-    )
-    train_dataloader = get_dataloader(dataset=train_dataset,
-                                      shuffle=True,
-                                      batch_size=BATCH_SIZE,
-                                      drop_last=True)
-
-    # build optimizer
-    optimizer = Adam(model.parameters(), lr=0.001)
-    criterion = nn.CrossEntropyLoss()
-
-    engine, train_dataloader, *args = colossalai.initialize(
-        model=model,
-        optimizer=optimizer,
-        criterion=criterion,
-        train_dataloader=train_dataloader
-    )
-    logger = get_dist_logger()
-    rank = torch.distributed.get_rank()
-
-    engine.train()
-    for img, label in train_dataloader:
-        engine.zero_grad()
-        img = img.cuda()
-        label = label.cuda()
-        output = engine(img)
-        loss = engine.criterion(output, label)
-        engine.backward(loss)
-        engine.step()
-        break
-
-    logger.info('Rank {} returns: {}'.format(rank, loss.item()))
-
-    gpc.destroy()
-    logger.info('Test engine finished')
-    report_memory_usage("After testing")
-    torch.cuda.empty_cache()
-
-
-@pytest.mark.dist
-def test_engine():
-    world_size = 4
-    run_func = partial(run_engine, world_size=world_size, port=free_port())
-    mp.spawn(run_func, nprocs=world_size)
-
-
-if __name__ == '__main__':
-    test_engine()
diff --git a/tests/test_engine/test_engine/test_engine_no_amp.py b/tests/test_engine/test_engine/test_engine_no_amp.py
deleted file mode 100644
index 13668e251274e372da5167c895d98dc32e865111..0000000000000000000000000000000000000000
--- a/tests/test_engine/test_engine/test_engine_no_amp.py
+++ /dev/null
@@ -1,105 +0,0 @@
-import os
-from functools import partial
-from pathlib import Path
-
-import colossalai
-import pytest
-import torch
-import torch.multiprocessing as mp
-import torch.nn as nn
-from colossalai.core import global_context as gpc
-from colossalai.logging import get_dist_logger
-from colossalai.utils import free_port, get_dataloader, report_memory_usage
-from torch.optim import Adam
-from torchvision import transforms
-from torchvision.datasets import CIFAR10
-from torchvision.models import resnet18
-
-# Config
-BATCH_SIZE = 128
-IMG_SIZE = 224
-DIM = 768
-NUM_CLASSES = 10
-NUM_ATTN_HEADS = 12
-
-CONFIG = dict(
-    parallel=dict(
-        pipeline=dict(size=1),
-        tensor=dict(size=1, mode=None)
-    ),
-    clip_grad_norm=1.0
-)
-
-
-def run_engine(rank, world_size, port):
-    # init dist env
-    colossalai.launch(
-        config=CONFIG,
-        rank=rank,
-        world_size=world_size,
-        host='localhost',
-        port=port,
-        backend='nccl'
-    )
-
-    # build model
-    model = resnet18(num_classes=10)
-
-    # build dataloaders
-    train_dataset = CIFAR10(
-        root=Path(os.environ['DATA']),
-        download=True,
-        transform=transforms.Compose(
-            [
-                transforms.Resize(size=(IMG_SIZE, IMG_SIZE)),
-                transforms.ToTensor(),
-                transforms.Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5))
-            ]
-        )
-    )
-    train_dataloader = get_dataloader(dataset=train_dataset,
-                                      shuffle=True,
-                                      batch_size=BATCH_SIZE,
-                                      drop_last=True)
-
-    # build optimizer
-    optimizer = Adam(model.parameters(), lr=0.001)
-    criterion = nn.CrossEntropyLoss()
-
-    engine, train_dataloader, *args = colossalai.initialize(
-        model=model,
-        optimizer=optimizer,
-        criterion=criterion,
-        train_dataloader=train_dataloader
-    )
-    logger = get_dist_logger()
-    rank = torch.distributed.get_rank()
-
-    engine.train()
-    for img, label in train_dataloader:
-        engine.zero_grad()
-        img = img.cuda()
-        label = label.cuda()
-        output = engine(img)
-        loss = engine.criterion(output, label)
-        engine.backward(loss)
-        engine.step()
-        break
-
-    logger.info('Rank {} returns: {}'.format(rank, loss.item()))
-
-    gpc.destroy()
-    logger.info('Test engine finished')
-    report_memory_usage("After testing")
-    torch.cuda.empty_cache()
-
-
-@pytest.mark.dist
-def test_engine():
-    world_size = 4
-    run_func = partial(run_engine, world_size=world_size, port=free_port())
-    mp.spawn(run_func, nprocs=world_size)
-
-
-if __name__ == '__main__':
-    test_engine()
diff --git a/tests/test_engine/test_engine/test_engine_torch_amp.py b/tests/test_engine/test_engine/test_engine_torch_amp.py
deleted file mode 100644
index 435df81dcdfb7a408ef9c62035339d3d5c260051..0000000000000000000000000000000000000000
--- a/tests/test_engine/test_engine/test_engine_torch_amp.py
+++ /dev/null
@@ -1,107 +0,0 @@
-import os
-from functools import partial
-from pathlib import Path
-
-import colossalai
-import pytest
-import torch
-import torch.multiprocessing as mp
-import torch.nn as nn
-from colossalai.amp import AMP_TYPE
-from colossalai.core import global_context as gpc
-from colossalai.logging import get_dist_logger
-from colossalai.utils import free_port, get_dataloader, report_memory_usage
-from torch.optim import Adam
-from torchvision import transforms
-from torchvision.datasets import CIFAR10
-from torchvision.models import resnet18
-
-# Config
-BATCH_SIZE = 128
-IMG_SIZE = 224
-DIM = 768
-NUM_CLASSES = 10
-NUM_ATTN_HEADS = 12
-
-CONFIG = dict(
-    parallel=dict(
-        pipeline=dict(size=1),
-        tensor=dict(size=1, mode=None)
-    ),
-    fp16=dict(mode=AMP_TYPE.TORCH),
-    clip_grad_norm=1.0
-)
-
-
-def run_engine(rank, world_size, port):
-    # init dist env
-    colossalai.launch(
-        config=CONFIG,
-        rank=rank,
-        world_size=world_size,
-        host='localhost',
-        port=port,
-        backend='nccl'
-    )
-
-    # build model
-    model = resnet18(num_classes=10)
-
-    # build dataloaders
-    train_dataset = CIFAR10(
-        root=Path(os.environ['DATA']),
-        download=True,
-        transform=transforms.Compose(
-            [
-                transforms.Resize(size=(IMG_SIZE, IMG_SIZE)),
-                transforms.ToTensor(),
-                transforms.Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5))
-            ]
-        )
-    )
-    train_dataloader = get_dataloader(dataset=train_dataset,
-                                      shuffle=True,
-                                      batch_size=BATCH_SIZE,
-                                      drop_last=True)
-
-    # build optimizer
-    optimizer = Adam(model.parameters(), lr=0.001)
-    criterion = nn.CrossEntropyLoss()
-
-    engine, train_dataloader, *args = colossalai.initialize(
-        model=model,
-        optimizer=optimizer,
-        criterion=criterion,
-        train_dataloader=train_dataloader
-    )
-    logger = get_dist_logger()
-    rank = torch.distributed.get_rank()
-
-    engine.train()
-    for img, label in train_dataloader:
-        engine.zero_grad()
-        img = img.cuda()
-        label = label.cuda()
-        output = engine(img)
-        loss = engine.criterion(output, label)
-        engine.backward(loss)
-        engine.step()
-        break
-
-    logger.info('Rank {} returns: {}'.format(rank, loss.item()))
-
-    gpc.destroy()
-    logger.info('Test engine finished')
-    report_memory_usage("After testing")
-    torch.cuda.empty_cache()
-
-
-@pytest.mark.dist
-def test_engine():
-    world_size = 4
-    run_func = partial(run_engine, world_size=world_size, port=free_port())
-    mp.spawn(run_func, nprocs=world_size)
-
-
-if __name__ == '__main__':
-    test_engine()
diff --git a/tests/test_layers/test_1d/__pycache__/test_1d.cpython-36-pytest-7.0.1.pyc b/tests/test_layers/test_1d/__pycache__/test_1d.cpython-36-pytest-7.0.1.pyc
deleted file mode 100644
index 5e1b8422e469e4c4884cb4b771a3a5360952e646..0000000000000000000000000000000000000000
Binary files a/tests/test_layers/test_1d/__pycache__/test_1d.cpython-36-pytest-7.0.1.pyc and /dev/null differ
diff --git a/tests/test_layers/test_1d/__pycache__/test_1d.cpython-36.pyc b/tests/test_layers/test_1d/__pycache__/test_1d.cpython-36.pyc
deleted file mode 100644
index a080f1553613fe375de6c6427806fa99574ca4f3..0000000000000000000000000000000000000000
Binary files a/tests/test_layers/test_1d/__pycache__/test_1d.cpython-36.pyc and /dev/null differ
diff --git a/tests/test_layers/test_1d/__pycache__/test_1d.cpython-37-pytest-7.1.3.pyc b/tests/test_layers/test_1d/__pycache__/test_1d.cpython-37-pytest-7.1.3.pyc
deleted file mode 100644
index a45888b75df9a3fbef001dbb59e188d51e569abd..0000000000000000000000000000000000000000
Binary files a/tests/test_layers/test_1d/__pycache__/test_1d.cpython-37-pytest-7.1.3.pyc and /dev/null differ
diff --git a/tests/test_layers/test_1d/__pycache__/test_1d.cpython-37.pyc b/tests/test_layers/test_1d/__pycache__/test_1d.cpython-37.pyc
deleted file mode 100644
index 8b6cfc37ba2b3aa9d7b5202cbe5a7fe0a4e6a64c..0000000000000000000000000000000000000000
Binary files a/tests/test_layers/test_1d/__pycache__/test_1d.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_layers/test_1d/checks_1d/__init__.py b/tests/test_layers/test_1d/checks_1d/__init__.py
deleted file mode 100644
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..0000000000000000000000000000000000000000
diff --git a/tests/test_layers/test_1d/checks_1d/__pycache__/__init__.cpython-36.pyc b/tests/test_layers/test_1d/checks_1d/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index 023255fdd62c589d7d0a0d0fc0304f54e215316d..0000000000000000000000000000000000000000
Binary files a/tests/test_layers/test_1d/checks_1d/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/tests/test_layers/test_1d/checks_1d/__pycache__/__init__.cpython-37.pyc b/tests/test_layers/test_1d/checks_1d/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index 7f156462a4e3b55119d70c9704d1b8d4df32a8a3..0000000000000000000000000000000000000000
Binary files a/tests/test_layers/test_1d/checks_1d/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_layers/test_1d/checks_1d/__pycache__/check_layer_1d.cpython-36.pyc b/tests/test_layers/test_1d/checks_1d/__pycache__/check_layer_1d.cpython-36.pyc
deleted file mode 100644
index 8a8d04d7dcaf0e6c5a519efed838fa6de338b07f..0000000000000000000000000000000000000000
Binary files a/tests/test_layers/test_1d/checks_1d/__pycache__/check_layer_1d.cpython-36.pyc and /dev/null differ
diff --git a/tests/test_layers/test_1d/checks_1d/__pycache__/check_layer_1d.cpython-37.pyc b/tests/test_layers/test_1d/checks_1d/__pycache__/check_layer_1d.cpython-37.pyc
deleted file mode 100644
index 3de061450b20637bcc1601cd36c3c8c835af1f88..0000000000000000000000000000000000000000
Binary files a/tests/test_layers/test_1d/checks_1d/__pycache__/check_layer_1d.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_layers/test_1d/checks_1d/__pycache__/common.cpython-36.pyc b/tests/test_layers/test_1d/checks_1d/__pycache__/common.cpython-36.pyc
deleted file mode 100644
index 2a8bb54ec81d11b1cf90f60f3f9ef45623026076..0000000000000000000000000000000000000000
Binary files a/tests/test_layers/test_1d/checks_1d/__pycache__/common.cpython-36.pyc and /dev/null differ
diff --git a/tests/test_layers/test_1d/checks_1d/__pycache__/common.cpython-37.pyc b/tests/test_layers/test_1d/checks_1d/__pycache__/common.cpython-37.pyc
deleted file mode 100644
index 7cd805d10f8d45a13d29bfbcd0137ccac6a367f1..0000000000000000000000000000000000000000
Binary files a/tests/test_layers/test_1d/checks_1d/__pycache__/common.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_layers/test_1d/checks_1d/check_layer_1d.py b/tests/test_layers/test_1d/checks_1d/check_layer_1d.py
deleted file mode 100644
index 5e1681da9c76db6733a352bb3c86de3d7ff1721f..0000000000000000000000000000000000000000
--- a/tests/test_layers/test_1d/checks_1d/check_layer_1d.py
+++ /dev/null
@@ -1,496 +0,0 @@
-import torch
-import torch.distributed as dist
-from colossalai.context.parallel_mode import ParallelMode
-from colossalai.core import global_context as gpc
-from colossalai.global_variables import tensor_parallel_env as env
-from colossalai.nn import (Classifier1D, Embedding1D, Linear1D_Col, Linear1D_Row, VanillaClassifier,
-                           VocabParallelClassifier1D, VocabParallelCrossEntropyLoss1D, VocabParallelEmbedding1D)
-from colossalai.utils import get_current_device, print_rank_0
-from torch.nn import Parameter
-
-from .common import BATCH_SIZE, DEPTH, HIDDEN_SIZE, NUM_CLASSES, SEQ_LENGTH, VOCAB_SIZE, check_equal
-
-
-def check_linear_col():
-    device = get_current_device()
-    dtype = torch.float32
-    INPUT_SIZE = HIDDEN_SIZE
-    OUTPUT_SIZE = 2 * HIDDEN_SIZE
-
-    i = gpc.get_local_rank(ParallelMode.PARALLEL_1D)
-
-    layer = Linear1D_Col(INPUT_SIZE, OUTPUT_SIZE)
-
-    A_shape = (BATCH_SIZE, SEQ_LENGTH, INPUT_SIZE)
-    A_master = torch.randn(A_shape, dtype=dtype, device=device)
-    dist.broadcast(A_master, src=0)
-    A = A_master.clone()
-    A.requires_grad = True
-
-    W_shape = (OUTPUT_SIZE, INPUT_SIZE)
-    W_master = torch.randn(W_shape, dtype=dtype, device=device)
-    dist.broadcast(W_master, src=0)
-    W = torch.chunk(W_master, DEPTH, dim=0)[i]
-    W = W.clone()
-    W.requires_grad = True
-
-    B_shape = (OUTPUT_SIZE)
-    B_master = torch.randn(B_shape, dtype=dtype, device=device)
-    dist.broadcast(B_master, src=0)
-    B = torch.chunk(B_master, DEPTH, dim=0)[i]
-    B = B.clone()
-    B.requires_grad = True
-
-    layer.weight = Parameter(W)
-    layer.bias = Parameter(B)
-    out = layer(A)
-
-    A_master = A_master.clone()
-    A_master.requires_grad = True
-    W_master = W_master.clone()
-    W_master.requires_grad = True
-    B_master = B_master.clone()
-    B_master.requires_grad = True
-    C_master = torch.matmul(A_master, W_master.transpose(0, 1)) + B_master
-    C = torch.chunk(C_master, DEPTH, dim=-1)[i]
-
-    check_equal(out, C)
-    print_rank_0('linear_col forward: pass')
-
-    grad_shape = C_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=get_current_device())
-    dist.broadcast(grad_master, src=0)
-    grad = torch.chunk(grad_master, DEPTH, dim=-1)[i]
-    grad = grad.clone()
-    out.backward(grad)
-
-    grad_master = grad_master.clone()
-    C_master.backward(grad_master)
-    A_grad = A_master.grad
-    check_equal(A_grad, A.grad)
-
-    W_grad = W_master.grad
-    W_grad = torch.chunk(W_grad, DEPTH, dim=0)[i]
-    check_equal(W_grad, layer.weight.grad)
-
-    B_grad = B_master.grad
-    B_grad = torch.chunk(B_grad, DEPTH, dim=0)[i]
-    check_equal(B_grad, layer.bias.grad)
-
-    print_rank_0('linear_col backward: pass')
-
-
-def check_linear_row():
-    device = get_current_device()
-    dtype = torch.float32
-    INPUT_SIZE = HIDDEN_SIZE
-    OUTPUT_SIZE = 2 * HIDDEN_SIZE
-
-    i = gpc.get_local_rank(ParallelMode.PARALLEL_1D)
-
-    layer = Linear1D_Row(OUTPUT_SIZE, INPUT_SIZE)
-
-    A_shape = (BATCH_SIZE, SEQ_LENGTH, OUTPUT_SIZE)
-    A_master = torch.randn(A_shape, dtype=dtype, device=device)
-    dist.broadcast(A_master, src=0)
-    A = torch.chunk(A_master, DEPTH, dim=-1)[i]
-    A = A.clone()
-    A.requires_grad = True
-
-    W_shape = (INPUT_SIZE, OUTPUT_SIZE)
-    W_master = torch.randn(W_shape, dtype=dtype, device=device)
-    dist.broadcast(W_master, src=0)
-    W = torch.chunk(W_master, DEPTH, dim=-1)[i]
-    W = W.clone()
-    W.requires_grad = True
-
-    B_shape = (INPUT_SIZE)
-    B_master = torch.randn(B_shape, dtype=dtype, device=device)
-    dist.broadcast(B_master, src=0)
-    B = B_master.clone()
-    B.requires_grad = True
-
-    layer.weight = Parameter(W)
-    layer.bias = Parameter(B)
-    out = layer(A)
-
-    A_master = A_master.clone()
-    A_master.requires_grad = True
-    W_master = W_master.clone()
-    W_master.requires_grad = True
-    B_master = B_master.clone()
-    B_master.requires_grad = True
-    C_master = torch.matmul(A_master, W_master.transpose(0, 1)) + B_master
-    C = C_master.clone()
-
-    check_equal(out, C)
-    print_rank_0('linear_row forward: pass')
-
-    grad_shape = C_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=get_current_device())
-    dist.broadcast(grad_master, src=0)
-    grad = grad_master.clone()
-    out.backward(grad)
-
-    grad_master = grad_master.clone()
-    C_master.backward(grad_master)
-    A_grad = A_master.grad
-    A_grad = torch.chunk(A_grad, DEPTH, dim=-1)[i]
-    check_equal(A_grad, A.grad)
-
-    W_grad = W_master.grad
-    W_grad = torch.chunk(W_grad, DEPTH, dim=-1)[i]
-    check_equal(W_grad, layer.weight.grad)
-
-    B_grad = B_master.grad
-    check_equal(B_grad, layer.bias.grad)
-
-    print_rank_0('linear_row backward: pass')
-
-
-def check_embed():
-    device = get_current_device()
-    dtype = torch.float32
-
-    i = gpc.get_local_rank(ParallelMode.PARALLEL_1D)
-
-    embed = Embedding1D(VOCAB_SIZE, HIDDEN_SIZE)
-    embed = embed.to(dtype).to(device)
-    embed_master = torch.nn.Embedding(VOCAB_SIZE, HIDDEN_SIZE)
-    embed_master = embed_master.to(dtype).to(device)
-
-    weight_master = embed_master.weight.data
-    torch.distributed.broadcast(weight_master, src=0)
-    weight = torch.chunk(weight_master, DEPTH, dim=-1)[i]
-    embed.weight.data.copy_(weight)
-
-    A_shape = (BATCH_SIZE, SEQ_LENGTH)
-    A_master = torch.randint(VOCAB_SIZE, A_shape, device=device)
-    torch.distributed.broadcast(A_master, src=0)
-    A = A_master.clone()
-    out = embed(A)
-
-    A_master = A_master.clone()
-    C_master = embed_master(A_master)
-    C = C_master.clone()
-    check_equal(out, C)
-    print_rank_0('embed forward: pass')
-
-    grad_shape = C_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(grad_master, src=0)
-    grad = grad_master.clone()
-    out.backward(grad)
-    grad_master = grad_master.clone()
-    C_master.backward(grad_master)
-
-    B_grad = embed_master.weight.grad
-    B_grad = torch.chunk(B_grad, DEPTH, dim=-1)[i]
-    check_equal(B_grad, embed.weight.grad)
-    print_rank_0('embed backward: pass')
-
-
-def check_vocab_parallel_embed():
-    device = get_current_device()
-    dtype = torch.float32
-
-    i = gpc.get_local_rank(ParallelMode.PARALLEL_1D)
-
-    embed = VocabParallelEmbedding1D(VOCAB_SIZE, HIDDEN_SIZE)
-    embed = embed.to(dtype).to(device)
-    embed_master = torch.nn.Embedding(VOCAB_SIZE, HIDDEN_SIZE)
-    embed_master = embed_master.to(dtype).to(device)
-
-    weight_master = embed_master.weight.data
-    torch.distributed.broadcast(weight_master, src=0)
-    weight = torch.chunk(weight_master, DEPTH, dim=0)[i]
-    embed.weight.data.copy_(weight)
-
-    A_shape = (BATCH_SIZE, SEQ_LENGTH)
-    A_master = torch.randint(VOCAB_SIZE, A_shape, device=device)
-    torch.distributed.broadcast(A_master, src=0)
-    A = A_master.clone()
-    out = embed(A)
-
-    A_master = A_master.clone()
-    C_master = embed_master(A_master)
-    C = C_master.clone()
-    check_equal(out, C)
-    print_rank_0('vocab parallel embed forward: pass')
-
-    grad_shape = C_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(grad_master, src=0)
-    grad = grad_master.clone()
-    out.backward(grad)
-    grad_master = grad_master.clone()
-    C_master.backward(grad_master)
-
-    B_grad = embed_master.weight.grad
-    B_grad = torch.chunk(B_grad, DEPTH, dim=0)[i]
-    check_equal(B_grad, embed.weight.grad)
-    print_rank_0('vocab parallel embed backward: pass')
-
-
-def check_classifier_no_given_weight():
-    device = get_current_device()
-    dtype = torch.float32
-
-    i = gpc.get_local_rank(ParallelMode.PARALLEL_1D)
-
-    env.parallel_input_1d = False
-    parallel_input_1d = env.parallel_input_1d
-    layer = Classifier1D(HIDDEN_SIZE, NUM_CLASSES, bias=True)
-    layer.to(dtype).to(device)
-
-    layer_master = VanillaClassifier(HIDDEN_SIZE, NUM_CLASSES, bias=True)
-    layer_master = layer_master.to(dtype).to(device)
-
-    W_master = layer_master.weight.data
-    dist.broadcast(W_master, src=0)
-    W = torch.chunk(W_master, DEPTH, dim=-1)[i]
-    layer.weight.data.copy_(W)
-    B_master = layer_master.bias.data
-    dist.broadcast(B_master, src=0)
-    B = B_master.clone()
-    layer.bias.data.copy_(B)
-
-    A_shape = (BATCH_SIZE, SEQ_LENGTH, HIDDEN_SIZE)
-    A_master = torch.randn(A_shape, dtype=dtype, device=device)
-    dist.broadcast(A_master, src=0)
-    if parallel_input_1d:
-        A = torch.chunk(A_master, DEPTH, dim=-1)[i]
-        A = A.clone()
-    else:
-        A = A_master.clone()
-    A.requires_grad = True
-
-    out = layer(A)
-
-    A_master = A_master.clone()
-    A_master.requires_grad = True
-    C_master = layer_master(A_master)
-    C = C_master.clone()
-
-    check_equal(out, C)
-    print_rank_0('classifier (no given weight) forward: pass')
-
-    grad_shape = C_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=device)
-    dist.broadcast(grad_master, src=0)
-    grad = grad_master.clone()
-    out.backward(grad)
-
-    grad_master = grad_master.clone()
-    C_master.backward(grad_master)
-    A_grad = A_master.grad
-    if parallel_input_1d:
-        A_grad = torch.chunk(A_grad, DEPTH, dim=-1)[i]
-    check_equal(A_grad, A.grad)
-
-    W_grad = layer_master.weight.grad
-    W_grad = torch.chunk(W_grad, DEPTH, dim=-1)[i]
-    check_equal(W_grad, layer.weight.grad)
-
-    B_grad = layer_master.bias.grad
-    check_equal(B_grad, layer.bias.grad)
-
-    print_rank_0('classifier (no given weight) backward: pass')
-
-
-def check_vocab_parallel_classifier_no_given_weight():
-    device = get_current_device()
-    dtype = torch.float32
-
-    i = gpc.get_local_rank(ParallelMode.PARALLEL_1D)
-
-    layer = VocabParallelClassifier1D(HIDDEN_SIZE, VOCAB_SIZE, bias=True)
-    layer.to(dtype).to(device)
-
-    layer_master = VanillaClassifier(HIDDEN_SIZE, VOCAB_SIZE, bias=True)
-    layer_master = layer_master.to(dtype).to(device)
-
-    W_master = layer_master.weight.data
-    dist.broadcast(W_master, src=0)
-    W = torch.chunk(W_master, DEPTH, dim=0)[i]
-    layer.weight.data.copy_(W)
-    B_master = layer_master.bias.data
-    dist.broadcast(B_master, src=0)
-    B = torch.chunk(B_master, DEPTH, dim=0)[i]
-    layer.bias.data.copy_(B)
-
-    A_shape = (BATCH_SIZE, SEQ_LENGTH, HIDDEN_SIZE)
-    A_master = torch.randn(A_shape, dtype=dtype, device=device)
-    dist.broadcast(A_master, src=0)
-    A = A_master.clone()
-    A.requires_grad = True
-
-    out = layer(A)
-
-    A_master = A_master.clone()
-    A_master.requires_grad = True
-    C_master = layer_master(A_master)
-    C = torch.chunk(C_master, DEPTH, dim=-1)[i]
-
-    check_equal(out, C)
-    print_rank_0('vocab parallel classifier (no given weight) forward: pass')
-
-    grad_shape = C_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=device)
-    dist.broadcast(grad_master, src=0)
-    grad = torch.chunk(grad_master, DEPTH, dim=-1)[i]
-    grad = grad.clone()
-    out.backward(grad)
-
-    grad_master = grad_master.clone()
-    C_master.backward(grad_master)
-    A_grad = A_master.grad
-    check_equal(A_grad, A.grad)
-
-    W_grad = layer_master.weight.grad
-    W_grad = torch.chunk(W_grad, DEPTH, dim=0)[i]
-    check_equal(W_grad, layer.weight.grad)
-
-    B_grad = layer_master.bias.grad
-    B_grad = torch.chunk(B_grad, DEPTH, dim=0)[i]
-    check_equal(B_grad, layer.bias.grad)
-
-    print_rank_0('vocab parallel classifier (no given weight) backward: pass')
-
-
-def check_classifier_given_embed_weight():
-    device = get_current_device()
-    dtype = torch.float32
-
-    i = gpc.get_local_rank(ParallelMode.PARALLEL_1D)
-
-    embed = Embedding1D(VOCAB_SIZE, HIDDEN_SIZE)
-    embed = embed.to(dtype).to(device)
-    embed_master = torch.nn.Embedding(VOCAB_SIZE, HIDDEN_SIZE)
-    embed_master = embed_master.to(dtype).to(device)
-
-    weight_master = embed_master.weight.data
-    torch.distributed.broadcast(weight_master, src=0)
-    weight = torch.chunk(weight_master, DEPTH, dim=-1)[i]
-    embed.weight.data.copy_(weight)
-
-    env.parallel_input_1d = False
-    layer = Classifier1D(HIDDEN_SIZE, NUM_CLASSES, weight=embed.weight, bias=False)
-    layer.to(dtype).to(device)
-
-    layer_master = VanillaClassifier(HIDDEN_SIZE, NUM_CLASSES, weight=embed_master.weight, bias=False)
-    layer_master = layer_master.to(dtype).to(device)
-
-    A_shape = (BATCH_SIZE, SEQ_LENGTH)
-    A_master = torch.randint(VOCAB_SIZE, A_shape, device=device)
-    torch.distributed.broadcast(A_master, src=0)
-    A = A_master.clone()
-    out = layer(embed(A))
-
-    A_master = A_master.clone()
-    C_master = layer_master(embed_master(A_master))
-    C = C_master.clone()
-    check_equal(out, C)
-    print_rank_0('classifier (given embed weight) forward: pass')
-
-    grad_shape = C_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=device)
-    dist.broadcast(grad_master, src=0)
-    grad = grad_master.clone()
-    out.backward(grad)
-
-    grad_master = grad_master.clone()
-    C_master.backward(grad_master)
-
-    W_grad = embed_master.weight.grad
-    W_grad = torch.chunk(W_grad, DEPTH, dim=-1)[i]
-    check_equal(W_grad, embed.weight.grad)
-
-    print_rank_0('classifier (given embed weight) backward: pass')
-
-
-def check_vocab_parallel_classifier_given_embed_weight():
-    device = get_current_device()
-    dtype = torch.float32
-
-    i = gpc.get_local_rank(ParallelMode.PARALLEL_1D)
-
-    embed = VocabParallelEmbedding1D(VOCAB_SIZE, HIDDEN_SIZE)
-    embed = embed.to(dtype).to(device)
-    embed_master = torch.nn.Embedding(VOCAB_SIZE, HIDDEN_SIZE)
-    embed_master = embed_master.to(dtype).to(device)
-
-    weight_master = embed_master.weight.data
-    torch.distributed.broadcast(weight_master, src=0)
-    weight = torch.chunk(weight_master, DEPTH, dim=0)[i]
-    embed.weight.data.copy_(weight)
-
-    env.parallel_input_1d = False
-    layer = VocabParallelClassifier1D(HIDDEN_SIZE, NUM_CLASSES, weight=embed.weight, bias=False)
-    layer.to(dtype).to(device)
-
-    layer_master = VanillaClassifier(HIDDEN_SIZE, NUM_CLASSES, weight=embed_master.weight, bias=False)
-    layer_master = layer_master.to(dtype).to(device)
-
-    A_shape = (BATCH_SIZE, SEQ_LENGTH)
-    A_master = torch.randint(VOCAB_SIZE, A_shape, device=device)
-    torch.distributed.broadcast(A_master, src=0)
-    A = A_master.clone()
-    out = layer(embed(A))
-
-    A_master = A_master.clone()
-    C_master = layer_master(embed_master(A_master))
-    C = torch.chunk(C_master, DEPTH, dim=-1)[i]
-    check_equal(out, C)
-    print_rank_0('vocab parallel classifier (given embed weight) forward: pass')
-
-    grad_shape = C_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=device)
-    dist.broadcast(grad_master, src=0)
-    grad = torch.chunk(grad_master, DEPTH, dim=-1)[i]
-    grad = grad.clone()
-    out.backward(grad)
-
-    grad_master = grad_master.clone()
-    C_master.backward(grad_master)
-
-    W_grad = embed_master.weight.grad
-    W_grad = torch.chunk(W_grad, DEPTH, dim=0)[i]
-    check_equal(W_grad, embed.weight.grad)
-
-    print_rank_0('vocab parallel classifier (given embed weight) backward: pass')
-
-
-def check_vocab_parallel_loss():
-    device = get_current_device()
-    dtype = torch.float32
-
-    i = gpc.get_local_rank(ParallelMode.PARALLEL_1D)
-
-    criterion = VocabParallelCrossEntropyLoss1D()
-    criterion_master = torch.nn.CrossEntropyLoss()
-
-    out_shape = (BATCH_SIZE, SEQ_LENGTH, NUM_CLASSES)
-    out_master = torch.randn(out_shape, dtype=dtype, device=device)
-    target_master = torch.randint(NUM_CLASSES, (BATCH_SIZE, SEQ_LENGTH), dtype=torch.long, device=device)
-    torch.distributed.broadcast(out_master, src=0)
-    torch.distributed.broadcast(target_master, src=0)
-    out = torch.chunk(out_master, DEPTH, dim=-1)[i]
-    out = out.clone()
-    out.requires_grad = True
-
-    loss = criterion(out, target_master)
-
-    out_master = out_master.clone()
-    out_master.requires_grad = True
-    loss_master = criterion_master(out_master, target_master)
-    check_equal(loss, loss_master)
-    print_rank_0('vocab parallel loss forward: pass')
-
-    loss.backward()
-    loss_master.backward()
-
-    out_grad = out_master.grad
-    out_grad = torch.chunk(out_grad, DEPTH, dim=-1)[i]
-    check_equal(out_grad, out.grad)
-    print_rank_0('vocab parallel loss backward: pass')
diff --git a/tests/test_layers/test_1d/checks_1d/common.py b/tests/test_layers/test_1d/checks_1d/common.py
deleted file mode 100644
index 8b7b28613d223bfb0ab249dee01869446a08e406..0000000000000000000000000000000000000000
--- a/tests/test_layers/test_1d/checks_1d/common.py
+++ /dev/null
@@ -1,15 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import torch
-
-DEPTH = 4
-BATCH_SIZE = 8
-SEQ_LENGTH = 8
-IMG_SIZE = 16
-HIDDEN_SIZE = 8
-NUM_CLASSES = 8
-VOCAB_SIZE = 16
-
-def check_equal(A, B):
-    assert torch.allclose(A, B, rtol=1e-3, atol=1e-1) == True
diff --git a/tests/test_layers/test_1d/test_1d.py b/tests/test_layers/test_1d/test_1d.py
deleted file mode 100644
index 58b914b90c5e696e2ae63fba32defcba456b3405..0000000000000000000000000000000000000000
--- a/tests/test_layers/test_1d/test_1d.py
+++ /dev/null
@@ -1,58 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from functools import partial
-
-import pytest
-import torch
-import torch.multiprocessing as mp
-from colossalai.core import global_context as gpc
-from colossalai.logging import disable_existing_loggers
-from colossalai.initialize import launch
-from colossalai.utils import free_port
-
-from checks_1d.check_layer_1d import *
-
-CONFIG = dict(
-    parallel=dict(
-        pipeline=dict(size=1),
-        tensor=dict(
-            size=4,
-            mode='1d'
-        )
-    ),
-)
-
-
-def check_layer(rank, world_size, port):
-    disable_existing_loggers()
-    launch(config=CONFIG,
-           rank=rank,
-           world_size=world_size,
-           host='localhost',
-           port=port,
-           backend='nccl')
-
-    check_linear_col()
-    check_linear_row()
-    check_embed()
-    check_vocab_parallel_embed()
-    check_classifier_no_given_weight()
-    check_vocab_parallel_classifier_no_given_weight()
-    check_classifier_given_embed_weight()
-    check_vocab_parallel_classifier_given_embed_weight()
-    check_vocab_parallel_loss()
-
-    gpc.destroy()
-    torch.cuda.empty_cache()
-
-
-@pytest.mark.dist
-def test_1d():
-    world_size = 4
-    run_func = partial(check_layer, world_size=world_size, port=free_port())
-    mp.spawn(run_func, nprocs=world_size)
-
-
-if __name__ == '__main__':
-    test_1d()
diff --git a/tests/test_layers/test_2d/__pycache__/test_2d.cpython-37-pytest-7.1.3.pyc b/tests/test_layers/test_2d/__pycache__/test_2d.cpython-37-pytest-7.1.3.pyc
deleted file mode 100644
index f307b2d8f8dbc693b719fe3908355a57344e6121..0000000000000000000000000000000000000000
Binary files a/tests/test_layers/test_2d/__pycache__/test_2d.cpython-37-pytest-7.1.3.pyc and /dev/null differ
diff --git a/tests/test_layers/test_2d/__pycache__/test_2d.cpython-37.pyc b/tests/test_layers/test_2d/__pycache__/test_2d.cpython-37.pyc
deleted file mode 100644
index 053f4e0cb75e42a34b1a7e6da10c964d1324cf5b..0000000000000000000000000000000000000000
Binary files a/tests/test_layers/test_2d/__pycache__/test_2d.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_layers/test_2d/checks_2d/__init__.py b/tests/test_layers/test_2d/checks_2d/__init__.py
deleted file mode 100644
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..0000000000000000000000000000000000000000
diff --git a/tests/test_layers/test_2d/checks_2d/__pycache__/__init__.cpython-37.pyc b/tests/test_layers/test_2d/checks_2d/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index 0748a4394dc27df2ec7743fab79b8cb1a7efd2c9..0000000000000000000000000000000000000000
Binary files a/tests/test_layers/test_2d/checks_2d/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_layers/test_2d/checks_2d/__pycache__/check_layer_2d.cpython-37.pyc b/tests/test_layers/test_2d/checks_2d/__pycache__/check_layer_2d.cpython-37.pyc
deleted file mode 100644
index 7b2795caf1ad2658c9cc8d4a38f3eaaf84f4b845..0000000000000000000000000000000000000000
Binary files a/tests/test_layers/test_2d/checks_2d/__pycache__/check_layer_2d.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_layers/test_2d/checks_2d/__pycache__/check_operation_2d.cpython-37.pyc b/tests/test_layers/test_2d/checks_2d/__pycache__/check_operation_2d.cpython-37.pyc
deleted file mode 100644
index d4f3da4ddfa060e5377b569c2e9afbe35a2cbcbd..0000000000000000000000000000000000000000
Binary files a/tests/test_layers/test_2d/checks_2d/__pycache__/check_operation_2d.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_layers/test_2d/checks_2d/__pycache__/common.cpython-37.pyc b/tests/test_layers/test_2d/checks_2d/__pycache__/common.cpython-37.pyc
deleted file mode 100644
index 1a453b526f3b393248e1a2eb096cdbfc51835121..0000000000000000000000000000000000000000
Binary files a/tests/test_layers/test_2d/checks_2d/__pycache__/common.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_layers/test_2d/checks_2d/check_layer_2d.py b/tests/test_layers/test_2d/checks_2d/check_layer_2d.py
deleted file mode 100644
index e030e473a36311250bd0310b94261a46b77d561a..0000000000000000000000000000000000000000
--- a/tests/test_layers/test_2d/checks_2d/check_layer_2d.py
+++ /dev/null
@@ -1,741 +0,0 @@
-import torch
-from colossalai.context.parallel_mode import ParallelMode
-from colossalai.core import global_context as gpc
-from colossalai.nn import (Classifier2D, CrossEntropyLoss2D, Embedding2D, LayerNorm2D, Linear2D, PatchEmbedding2D,
-                           VanillaClassifier, VanillaPatchEmbedding, VocabParallelClassifier2D,
-                           VocabParallelCrossEntropyLoss2D, VocabParallelEmbedding2D)
-from colossalai.utils import get_current_device, print_rank_0
-
-from .common import (BATCH_SIZE, DEPTH, HIDDEN_SIZE, IMG_SIZE, NUM_CLASSES, SEQ_LENGTH, VOCAB_SIZE, check_equal)
-
-
-def check_linear():
-    device = get_current_device()
-    dtype = torch.float32
-    INPUT_SIZE = HIDDEN_SIZE
-    OUTPUT_SIZE = HIDDEN_SIZE
-
-    j = gpc.get_local_rank(ParallelMode.PARALLEL_2D_ROW)
-    i = gpc.get_local_rank(ParallelMode.PARALLEL_2D_COL)
-
-    layer = Linear2D(INPUT_SIZE, OUTPUT_SIZE)
-
-    A_shape = (BATCH_SIZE, SEQ_LENGTH, INPUT_SIZE)
-    A_master = torch.randn(A_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(A_master, src=0)
-    A = torch.chunk(A_master, DEPTH, dim=0)[i]
-    A = torch.chunk(A, DEPTH, dim=-1)[j]
-    A = A.clone()
-    A.requires_grad = True
-
-    W_shape = (INPUT_SIZE, OUTPUT_SIZE)
-    W_master = torch.randn(W_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(W_master, src=0)
-    W = torch.chunk(W_master, DEPTH, dim=0)[i]
-    W = torch.chunk(W, DEPTH, dim=-1)[j]
-    W = W.clone()
-    W.requires_grad = True
-
-    B_shape = (OUTPUT_SIZE)
-    B_master = torch.randn(B_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(B_master, src=0)
-    B = torch.chunk(B_master, DEPTH, dim=-1)[j]
-    B = torch.chunk(B, DEPTH, dim=-1)[i]
-    B = B.clone()
-    B.requires_grad = True
-
-    layer.weight.data.copy_(W)
-    layer.bias.data.copy_(B)
-    out = layer(A)
-
-    A_master = A_master.clone()
-    A_master.requires_grad = True
-    W_master = W_master.clone()
-    W_master.requires_grad = True
-    B_master = B_master.clone()
-    B_master.requires_grad = True
-    C_master = torch.matmul(A_master, W_master) + B_master
-    C = torch.chunk(C_master, DEPTH, dim=0)[i]
-    C = torch.chunk(C, DEPTH, dim=-1)[j]
-
-    check_equal(out, C)
-    print_rank_0('linear forward: pass')
-
-    grad_shape = C_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=get_current_device())
-    torch.distributed.broadcast(grad_master, src=0)
-    grad = torch.chunk(grad_master, DEPTH, dim=0)[i]
-    grad = torch.chunk(grad, DEPTH, dim=-1)[j]
-    grad = grad.clone()
-    out.backward(grad)
-
-    grad_master = grad_master.clone()
-    C_master.backward(grad_master)
-    A_grad = A_master.grad
-    A_grad = torch.chunk(A_grad, DEPTH, dim=0)[i]
-    A_grad = torch.chunk(A_grad, DEPTH, dim=-1)[j]
-    check_equal(A_grad, A.grad)
-
-    W_grad = W_master.grad
-    W_grad = torch.chunk(W_grad, DEPTH, dim=0)[i]
-    W_grad = torch.chunk(W_grad, DEPTH, dim=-1)[j]
-    check_equal(W_grad, layer.weight.grad)
-
-    B_grad = B_master.grad
-    B_grad = torch.chunk(B_grad, DEPTH, dim=-1)[j]
-    B_grad = torch.chunk(B_grad, DEPTH, dim=-1)[i]
-    # if i == 0:
-    check_equal(B_grad, layer.bias.grad)
-
-    print_rank_0('linear backward: pass')
-
-
-def check_layernorm():
-    device = get_current_device()
-    dtype = torch.float32
-    INPUT_SIZE = HIDDEN_SIZE
-    EPS = 1e-12
-
-    j = gpc.get_local_rank(ParallelMode.PARALLEL_2D_ROW)
-    i = gpc.get_local_rank(ParallelMode.PARALLEL_2D_COL)
-
-    layernorm = LayerNorm2D(INPUT_SIZE)
-
-    A_shape = (BATCH_SIZE, SEQ_LENGTH, INPUT_SIZE)
-    A_master = torch.randn(A_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(A_master, src=0)
-    A = torch.chunk(A_master, DEPTH, dim=0)[i]
-    A = torch.chunk(A, DEPTH, dim=-1)[j]
-    A = A.clone()
-    A.requires_grad = True
-
-    out = layernorm(A)
-
-    A_master = A_master.clone()
-    A_master.requires_grad = True
-    E_master = torch.sum(A_master, dim=-1, keepdim=True)
-    E_master /= INPUT_SIZE
-    V_master = torch.sum(A_master * A_master, dim=-1, keepdim=True)
-    V_master /= INPUT_SIZE
-    V_master = V_master - E_master * E_master
-    V_master = 1.0 / torch.sqrt(V_master + EPS)
-    C_master = (A_master - E_master) * V_master
-    C = torch.chunk(C_master, DEPTH, dim=0)[i]
-    C = torch.chunk(C, DEPTH, dim=-1)[j]
-
-    check_equal(out, C)
-    print_rank_0('layer norm forward: pass')
-
-    grad_shape = C_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=get_current_device())
-    torch.distributed.broadcast(grad_master, src=0)
-    grad = torch.chunk(grad_master, DEPTH, dim=0)[i]
-    grad = torch.chunk(grad, DEPTH, dim=-1)[j]
-    out.backward(grad)
-
-    C_master.backward(grad_master)
-    A_grad = A_master.grad
-    A_grad = torch.chunk(A_grad, DEPTH, dim=0)[i]
-    A_grad = torch.chunk(A_grad, DEPTH, dim=-1)[j]
-    check_equal(A_grad, A.grad)
-    print_rank_0('layer norm backward: pass')
-
-
-def check_embed():
-    device = get_current_device()
-    dtype = torch.float32
-    j = gpc.get_local_rank(ParallelMode.PARALLEL_2D_ROW)
-    i = gpc.get_local_rank(ParallelMode.PARALLEL_2D_COL)
-
-    embed = Embedding2D(VOCAB_SIZE, HIDDEN_SIZE)
-    embed = embed.to(dtype).to(device)
-    embed_master = torch.nn.Embedding(VOCAB_SIZE, HIDDEN_SIZE)
-    embed_master = embed_master.to(dtype).to(device)
-
-    weight_master = embed_master.weight.data
-    torch.distributed.broadcast(weight_master, src=0)
-    weight = torch.chunk(weight_master, DEPTH, dim=-1)[j]
-    weight = torch.chunk(weight, DEPTH, dim=-1)[i]
-    embed.weight.data.copy_(weight)
-
-    A_shape = (BATCH_SIZE, SEQ_LENGTH)
-    A_master = torch.randint(VOCAB_SIZE, A_shape, device=device)
-    torch.distributed.broadcast(A_master, src=0)
-    A = A_master.clone()
-    out = embed(A)
-
-    A_master = A_master.clone()
-    C_master = embed_master(A_master)
-    C = torch.chunk(C_master, DEPTH, dim=0)[i]
-    C = torch.chunk(C, DEPTH, dim=-1)[j]
-    check_equal(out, C)
-    print_rank_0('embed forward: pass')
-
-    grad_shape = C_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(grad_master, src=0)
-    grad = torch.chunk(grad_master, DEPTH, dim=0)[i]
-    grad = torch.chunk(grad, DEPTH, dim=-1)[j]
-    grad = grad.clone()
-    out.backward(grad)
-    grad_master = grad_master.clone()
-    C_master.backward(grad_master)
-
-    B_grad = embed_master.weight.grad
-    B_grad = torch.chunk(B_grad, DEPTH, dim=-1)[j]
-    B_grad = torch.chunk(B_grad, DEPTH, dim=-1)[i]
-    check_equal(B_grad, embed.weight.grad)
-    print_rank_0('embed backward: pass')
-
-
-def check_patch_embed():
-    device = get_current_device()
-    dtype = torch.float32
-    j = gpc.get_local_rank(ParallelMode.PARALLEL_2D_ROW)
-    i = gpc.get_local_rank(ParallelMode.PARALLEL_2D_COL)
-
-    layer = PatchEmbedding2D(IMG_SIZE, 4, 3, HIDDEN_SIZE, dtype=dtype)
-    torch.nn.init.ones_(layer.cls_token)
-    torch.nn.init.ones_(layer.pos_embed)
-    layer = layer.to(device)
-
-    layer_master = VanillaPatchEmbedding(IMG_SIZE, 4, 3, HIDDEN_SIZE, dtype=dtype)
-    torch.nn.init.ones_(layer_master.cls_token)
-    torch.nn.init.ones_(layer_master.pos_embed)
-    layer_master = layer_master.to(device)
-
-    proj_weight_master = layer_master.weight.data
-    torch.distributed.broadcast(proj_weight_master, src=0)
-    proj_weight = torch.chunk(proj_weight_master, DEPTH, dim=0)[j]
-    proj_weight = torch.chunk(proj_weight, DEPTH, dim=0)[i]
-    layer.weight.data.copy_(proj_weight)
-    proj_bias_master = layer_master.bias.data
-    torch.distributed.broadcast(proj_bias_master, src=0)
-    proj_bias = torch.chunk(proj_bias_master, DEPTH, dim=0)[j]
-    proj_bias = torch.chunk(proj_bias, DEPTH, dim=0)[i]
-    layer.bias.data.copy_(proj_bias)
-
-    A_shape = (BATCH_SIZE, 3, IMG_SIZE, IMG_SIZE)
-    A_master = torch.randn(A_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(A_master, src=0)
-    A = A_master.clone()
-    out = layer(A)
-
-    A_master = A_master.clone()
-    C_master = layer_master(A_master)
-    C = torch.chunk(C_master, DEPTH, dim=0)[i]
-    C = torch.chunk(C, DEPTH, dim=-1)[j]
-    check_equal(out, C)
-    print_rank_0('patch embed forward: pass')
-
-    grad_shape = C_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(grad_master, src=0)
-    grad = torch.chunk(grad_master, DEPTH, dim=0)[i]
-    grad = torch.chunk(grad, DEPTH, dim=-1)[j]
-    grad = grad.clone()
-    out.backward(grad)
-
-    grad_master = grad_master.clone()
-    C_master.backward(grad_master)
-
-    cls_grad_master = layer_master.cls_token.grad
-    cls_grad = torch.chunk(cls_grad_master, DEPTH, dim=-1)[j]
-    cls_grad = torch.chunk(cls_grad, DEPTH, dim=-1)[i]
-    check_equal(cls_grad, layer.cls_token.grad)
-
-    pos_grad_master = layer_master.pos_embed.grad
-    pos_grad = torch.chunk(pos_grad_master, DEPTH, dim=-1)[j]
-    pos_grad = torch.chunk(pos_grad, DEPTH, dim=-1)[i]
-    check_equal(pos_grad, layer.pos_embed.grad)
-
-    B_grad = layer_master.weight.grad
-    B_grad = torch.chunk(B_grad, DEPTH, dim=0)[j]
-    B_grad = torch.chunk(B_grad, DEPTH, dim=0)[i]
-    check_equal(B_grad, layer.weight.grad)
-
-    bias_grad = layer_master.bias.grad
-    bias_grad = torch.chunk(bias_grad, DEPTH)[j]
-    bias_grad = torch.chunk(bias_grad, DEPTH)[i]
-    check_equal(bias_grad, layer.bias.grad)
-    print_rank_0('patch embed backward: pass')
-
-
-def check_vocab_parallel_embed():
-    device = get_current_device()
-    dtype = torch.float32
-    j = gpc.get_local_rank(ParallelMode.PARALLEL_2D_ROW)
-    i = gpc.get_local_rank(ParallelMode.PARALLEL_2D_COL)
-
-    embed = VocabParallelEmbedding2D(VOCAB_SIZE, HIDDEN_SIZE)
-    embed = embed.to(dtype).to(device)
-    embed_master = torch.nn.Embedding(VOCAB_SIZE, HIDDEN_SIZE)
-    embed_master = embed_master.to(dtype).to(device)
-
-    weight_master = embed_master.weight.data
-    torch.distributed.broadcast(weight_master, src=0)
-    weight = torch.chunk(weight_master, DEPTH, dim=-1)[j]
-    weight = torch.chunk(weight, DEPTH, dim=0)[i]
-    embed.weight.data.copy_(weight)
-
-    A_shape = (BATCH_SIZE, SEQ_LENGTH)
-    A_master = torch.randint(VOCAB_SIZE, A_shape, device=device)
-    torch.distributed.broadcast(A_master, src=0)
-    A = A_master.clone()
-    out = embed(A)
-
-    A_master = A_master.clone()
-    C_master = embed_master(A_master)
-    C = torch.chunk(C_master, DEPTH, dim=0)[i]
-    C = torch.chunk(C, DEPTH, dim=-1)[j]
-    check_equal(out, C)
-    print_rank_0('vocab parallel embed forward: pass')
-
-    grad_shape = C_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(grad_master, src=0)
-    grad = torch.chunk(grad_master, DEPTH, dim=0)[i]
-    grad = torch.chunk(grad, DEPTH, dim=-1)[j]
-    grad = grad.clone()
-    out.backward(grad)
-    grad_master = grad_master.clone()
-    C_master.backward(grad_master)
-
-    B_grad = embed_master.weight.grad
-    B_grad = torch.chunk(B_grad, DEPTH, dim=-1)[j]
-    B_grad = torch.chunk(B_grad, DEPTH, dim=0)[i]
-    check_equal(B_grad, embed.weight.grad)
-    print_rank_0('vocab parallel embed backward: pass')
-
-
-def check_classifier_no_given_weight():
-    device = get_current_device()
-    dtype = torch.float32
-    INPUT_SIZE = HIDDEN_SIZE
-    OUTPUT_SIZE = NUM_CLASSES
-
-    j = gpc.get_local_rank(ParallelMode.PARALLEL_2D_ROW)
-    i = gpc.get_local_rank(ParallelMode.PARALLEL_2D_COL)
-
-    layer = Classifier2D(INPUT_SIZE, OUTPUT_SIZE)
-
-    A_shape = (BATCH_SIZE, SEQ_LENGTH, INPUT_SIZE)
-    A_master = torch.randint(5, A_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(A_master, src=0)
-    A = torch.chunk(A_master, DEPTH, dim=0)[i]
-    A = torch.chunk(A, DEPTH, dim=-1)[j]
-    A = A.clone()
-    A.requires_grad = True
-
-    W_shape = (OUTPUT_SIZE, INPUT_SIZE)
-    W_master = torch.randint(5, W_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(W_master, src=0)
-    W = torch.chunk(W_master, DEPTH, dim=-1)[j]
-    W = torch.chunk(W, DEPTH, dim=-1)[i]
-    W = W.clone()
-    layer.weight.data.copy_(W)
-    # W.requires_grad = True
-
-    B_shape = (OUTPUT_SIZE, )
-    B_master = torch.randint(5, B_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(B_master, src=0)
-    # B = torch.chunk(B_master, DEPTH, dim=0)[j]
-    B = B_master.clone()
-    layer.bias.data.copy_(B)
-
-    out = layer(A)
-
-    A_master = A_master.clone()
-    A_master.requires_grad = True
-    W_master = W_master.clone()
-    W_master.requires_grad = True
-    B_master = B_master.clone()
-    B_master.requires_grad = True
-    C_master = torch.matmul(A_master, W_master.transpose(0, 1)) + B_master
-    C = torch.chunk(C_master, DEPTH, dim=0)[i]
-    # C = torch.chunk(C, DEPTH, dim=-1)[j]
-
-    check_equal(out, C)
-    print_rank_0('classifier (no given weight) forward: pass')
-
-    grad_shape = C_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=get_current_device())
-    torch.distributed.broadcast(grad_master, src=0)
-    grad = torch.chunk(grad_master, DEPTH, dim=0)[i]
-    # grad = torch.chunk(grad, DEPTH, dim=-1)[j]
-    grad = grad.clone()
-    out.backward(grad)
-
-    grad_master = grad_master.clone()
-    C_master.backward(grad_master)
-    A_grad = A_master.grad
-    A_grad = torch.chunk(A_grad, DEPTH, dim=0)[i]
-    A_grad = torch.chunk(A_grad, DEPTH, dim=-1)[j]
-    check_equal(A_grad, A.grad)
-
-    W_grad = W_master.grad
-    W_grad = torch.chunk(W_grad, DEPTH, dim=-1)[j]
-    W_grad = torch.chunk(W_grad, DEPTH, dim=-1)[i]
-    check_equal(W_grad, layer.weight.grad)
-
-    B_grad = B_master.grad
-    # B_grad = torch.chunk(B_grad, DEPTH, dim=0)[j]
-    # if i == 0:
-    check_equal(B_grad, layer.bias.grad)
-
-    print_rank_0('classifier (no given weight) backward: pass')
-
-
-def check_vocab_parallel_classifier_no_given_weight():
-    device = get_current_device()
-    dtype = torch.float32
-
-    j = gpc.get_local_rank(ParallelMode.PARALLEL_2D_ROW)
-    i = gpc.get_local_rank(ParallelMode.PARALLEL_2D_COL)
-
-    layer = VocabParallelClassifier2D(HIDDEN_SIZE, VOCAB_SIZE, bias=True)
-    layer = layer.to(dtype).to(device)
-
-    layer_master = VanillaClassifier(HIDDEN_SIZE, VOCAB_SIZE, bias=True)
-    layer_master = layer_master.to(dtype).to(device)
-
-    weight_master = layer_master.weight.data
-    torch.distributed.broadcast(weight_master, src=0)
-    weight = torch.chunk(weight_master, DEPTH, dim=0)[i]
-    weight = torch.chunk(weight, DEPTH, dim=-1)[j]
-    layer.weight.data.copy_(weight)
-    bias_master = layer_master.bias.data
-    torch.distributed.broadcast(bias_master, src=0)
-    bias = torch.chunk(bias_master, DEPTH)[j]
-    bias = torch.chunk(bias, DEPTH)[i]
-    layer.bias.data.copy_(bias)
-
-    A_shape = (BATCH_SIZE, SEQ_LENGTH, HIDDEN_SIZE)
-    A_master = torch.randn(A_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(A_master, src=0)
-    A = torch.chunk(A_master, DEPTH, dim=0)[i]
-    A = torch.chunk(A, DEPTH, dim=-1)[j]
-    A = A.clone()
-    A.requires_grad = True
-    out = layer(A)
-
-    A_master = A_master.clone()
-    A_master.requires_grad = True
-    C_master = layer_master(A_master)
-    C = torch.chunk(C_master, DEPTH, dim=0)[i]
-    C = torch.chunk(C, DEPTH, dim=-1)[j]
-    check_equal(out, C)
-    print_rank_0('vocab parallel classifier (no given weight) forward: pass')
-
-    grad_shape = C_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(grad_master, src=0)
-    grad = torch.chunk(grad_master, DEPTH, dim=0)[i]
-    grad = torch.chunk(grad, DEPTH, dim=-1)[j]
-    grad = grad.clone()
-    out.backward(grad)
-
-    grad_master = grad_master.clone()
-    C_master.backward(grad_master)
-    A_grad = A_master.grad
-    A_grad = torch.chunk(A_grad, DEPTH, dim=0)[i]
-    A_grad = torch.chunk(A_grad, DEPTH, dim=-1)[j]
-    check_equal(A_grad, A.grad)
-
-    W_grad = layer_master.weight.grad
-    W_grad = torch.chunk(W_grad, DEPTH, dim=0)[i]
-    W_grad = torch.chunk(W_grad, DEPTH, dim=-1)[j]
-    check_equal(W_grad, layer.weight.grad)
-
-    B_grad = layer_master.bias.grad
-    B_grad = torch.chunk(B_grad, DEPTH)[j]
-    B_grad = torch.chunk(B_grad, DEPTH)[i]
-    check_equal(B_grad, layer.bias.grad)
-    print_rank_0('vocab parallel classifier (no given weight) backward: pass')
-
-
-def check_classifier_given_embed_weight():
-    device = get_current_device()
-    dtype = torch.float32
-
-    j = gpc.get_local_rank(ParallelMode.PARALLEL_2D_ROW)
-    i = gpc.get_local_rank(ParallelMode.PARALLEL_2D_COL)
-
-    embed = Embedding2D(VOCAB_SIZE, HIDDEN_SIZE)
-    embed = embed.to(dtype).to(device)
-    embed_master = torch.nn.Embedding(VOCAB_SIZE, HIDDEN_SIZE)
-    embed_master = embed_master.to(dtype).to(device)
-
-    weight_master = embed_master.weight.data
-    torch.distributed.broadcast(weight_master, src=0)
-    weight = torch.chunk(weight_master, DEPTH, dim=-1)[j]
-    weight = torch.chunk(weight, DEPTH, dim=-1)[i]
-    embed.weight.data.copy_(weight)
-
-    layer = Classifier2D(HIDDEN_SIZE, VOCAB_SIZE, weight=embed.weight, bias=False)
-    layer = layer.to(dtype).to(device)
-    layer_master = VanillaClassifier(HIDDEN_SIZE, VOCAB_SIZE, weight=embed_master.weight, bias=False)
-    layer_master = layer_master.to(dtype).to(device)
-
-    A_shape = (BATCH_SIZE, SEQ_LENGTH)
-    A_master = torch.randint(VOCAB_SIZE, A_shape, device=device)
-    torch.distributed.broadcast(A_master, src=0)
-    A = A_master.clone()
-    out = layer(embed(A))
-
-    A_master = A_master.clone()
-    C_master = layer_master(embed_master(A_master))
-    C = torch.chunk(C_master, DEPTH, dim=0)[i]
-    check_equal(out, C)
-    print_rank_0('classifier (given embed weight) forward: pass')
-
-    grad_shape = C_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(grad_master, src=0)
-    grad = torch.chunk(grad_master, DEPTH, dim=0)[i]
-    grad = grad.clone()
-    out.backward(grad)
-
-    grad_master = grad_master.clone()
-    C_master.backward(grad_master)
-
-    W_grad = embed_master.weight.grad
-    W_grad = torch.chunk(W_grad, DEPTH, dim=-1)[j]
-    W_grad = torch.chunk(W_grad, DEPTH, dim=-1)[i]
-    check_equal(W_grad, embed.weight.grad)
-    print_rank_0('classifier (given embed weight) backward: pass')
-
-
-def check_vocab_parallel_classifier_given_embed_weight():
-    device = get_current_device()
-    dtype = torch.float32
-
-    j = gpc.get_local_rank(ParallelMode.PARALLEL_2D_ROW)
-    i = gpc.get_local_rank(ParallelMode.PARALLEL_2D_COL)
-
-    embed = VocabParallelEmbedding2D(VOCAB_SIZE, HIDDEN_SIZE)
-    embed = embed.to(dtype).to(device)
-    embed_master = torch.nn.Embedding(VOCAB_SIZE, HIDDEN_SIZE)
-    embed_master = embed_master.to(dtype).to(device)
-
-    weight_master = embed_master.weight.data
-    torch.distributed.broadcast(weight_master, src=0)
-    weight = torch.chunk(weight_master, DEPTH, dim=-1)[j]
-    weight = torch.chunk(weight, DEPTH, dim=0)[i]
-    embed.weight.data.copy_(weight)
-
-    layer = VocabParallelClassifier2D(HIDDEN_SIZE, VOCAB_SIZE, weight=embed.weight, bias=False)
-    layer = layer.to(dtype).to(device)
-    layer_master = VanillaClassifier(HIDDEN_SIZE, VOCAB_SIZE, weight=embed_master.weight, bias=False)
-    layer_master = layer_master.to(dtype).to(device)
-
-    A_shape = (BATCH_SIZE, SEQ_LENGTH)
-    A_master = torch.randint(VOCAB_SIZE, A_shape, device=device)
-    torch.distributed.broadcast(A_master, src=0)
-    A = A_master.clone()
-    out = layer(embed(A))
-
-    A_master = A_master.clone()
-    C_master = layer_master(embed_master(A_master))
-    C = torch.chunk(C_master, DEPTH, dim=0)[i]
-    C = torch.chunk(C, DEPTH, dim=-1)[j]
-    check_equal(out, C)
-    print_rank_0('vocab parallel classifier (given embed weight) forward: pass')
-
-    grad_shape = C_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(grad_master, src=0)
-    grad = torch.chunk(grad_master, DEPTH, dim=0)[i]
-    grad = torch.chunk(grad, DEPTH, dim=-1)[j]
-    grad = grad.clone()
-    out.backward(grad)
-
-    grad_master = grad_master.clone()
-    C_master.backward(grad_master)
-
-    W_grad = embed_master.weight.grad
-    W_grad = torch.chunk(W_grad, DEPTH, dim=-1)[j]
-    W_grad = torch.chunk(W_grad, DEPTH, dim=0)[i]
-    check_equal(W_grad, embed.weight.grad)
-    print_rank_0('vocab parallel classifier (given embed weight) backward: pass')
-
-
-def check_loss():
-    device = get_current_device()
-    dtype = torch.float32
-
-    j = gpc.get_local_rank(ParallelMode.PARALLEL_2D_ROW)
-    i = gpc.get_local_rank(ParallelMode.PARALLEL_2D_COL)
-
-    criterion = CrossEntropyLoss2D()
-    criterion_master = torch.nn.CrossEntropyLoss()
-
-    out_shape = (BATCH_SIZE, NUM_CLASSES)
-    out_master = torch.randn(out_shape, dtype=dtype, device=device)
-    target_master = torch.randint(NUM_CLASSES, (BATCH_SIZE, ), dtype=torch.long, device=device)
-    torch.distributed.broadcast(out_master, src=0)
-    torch.distributed.broadcast(target_master, src=0)
-    out = torch.chunk(out_master, DEPTH, dim=0)[i]
-    out = out.clone()
-    out.requires_grad = True
-    loss = criterion(out, target_master)
-
-    out_master = out_master.clone()
-    out_master.requires_grad = True
-    loss_master = criterion_master(out_master, target_master)
-    check_equal(loss, loss_master)
-    print_rank_0('cross entropy loss forward: pass')
-
-    loss.backward()
-    loss_master.backward()
-
-    out_grad = out_master.grad
-    out_grad = torch.chunk(out_grad, DEPTH, dim=0)[i]
-    check_equal(out_grad, out.grad)
-    print_rank_0('cross entropy loss backward: pass')
-
-
-def check_vocab_parallel_loss():
-    device = get_current_device()
-    dtype = torch.float32
-
-    j = gpc.get_local_rank(ParallelMode.PARALLEL_2D_ROW)
-    i = gpc.get_local_rank(ParallelMode.PARALLEL_2D_COL)
-
-    criterion = VocabParallelCrossEntropyLoss2D()
-    criterion_master = torch.nn.CrossEntropyLoss()
-
-    out_shape = (BATCH_SIZE, NUM_CLASSES)
-    out_master = torch.randn(out_shape, dtype=dtype, device=device)
-    target_master = torch.randint(NUM_CLASSES, (BATCH_SIZE, ), dtype=torch.long, device=device)
-    torch.distributed.broadcast(out_master, src=0)
-    torch.distributed.broadcast(target_master, src=0)
-    out = torch.chunk(out_master, DEPTH, dim=0)[i]
-    out = torch.chunk(out, DEPTH, dim=-1)[j]
-    out = out.clone()
-    out.requires_grad = True
-    loss = criterion(out, target_master)
-
-    out_master = out_master.clone()
-    out_master.requires_grad = True
-    loss_master = criterion_master(out_master, target_master)
-    check_equal(loss, loss_master)
-    print_rank_0('vocab parallel cross entropy loss forward: pass')
-
-    loss.backward()
-    loss_master.backward()
-
-    out_grad = out_master.grad
-    out_grad = torch.chunk(out_grad, DEPTH, dim=0)[i]
-    out_grad = torch.chunk(out_grad, DEPTH, dim=-1)[j]
-    check_equal(out_grad, out.grad)
-    print_rank_0('vocab parallel cross entropy loss backward: pass')
-
-
-# def check_attention():
-#     device = get_current_device()
-#     dtype = torch.float32
-#     INPUT_SIZE = HIDDEN_SIZE
-#     NUM_ATTENTION_HEADS = 2
-
-#     j = gpc.get_local_rank(ParallelMode.PARALLEL_2D_ROW)
-#     i = gpc.get_local_rank(ParallelMode.PARALLEL_2D_COL)
-
-#     layer = TransformerSelfAttention2D(
-#         HIDDEN_SIZE,
-#         NUM_ATTENTION_HEADS,
-#         attention_dropout_prob=0.5,
-#         hidden_dropout_prob=0.5,
-#     )
-
-#     A_shape = (BATCH_SIZE, SEQ_LENGTH, INPUT_SIZE)
-#     A_master = torch.randn(A_shape, dtype=dtype, device=device)
-#     torch.distributed.broadcast(A_master, src=0)
-#     A = torch.chunk(A_master, DEPTH, dim=0)[i]
-#     A = torch.chunk(A, DEPTH, dim=-1)[j]
-#     A = A.clone()
-#     A.requires_grad = True
-
-#     mask_shape = (BATCH_SIZE // DEPTH, NUM_ATTENTION_HEADS // DEPTH, SEQ_LENGTH, SEQ_LENGTH)
-#     attention_mask = torch.zeros(mask_shape, dtype=dtype, device=device)
-
-#     out = layer(A, attention_mask)
-#     assert out.shape == (BATCH_SIZE // DEPTH, SEQ_LENGTH, INPUT_SIZE // DEPTH)
-#     print_rank_0('self attention forward: pass')
-
-#     grad_shape = out.shape
-#     grad = torch.randn(grad_shape, dtype=dtype, device=device)
-
-#     out.backward(grad)
-#     assert A.grad.shape == A.shape
-#     print_rank_0('self attention backward: pass')
-
-# def check_mlp():
-#     device = get_current_device()
-#     dtype = torch.float32
-#     INPUT_SIZE = HIDDEN_SIZE
-
-#     j = gpc.get_local_rank(ParallelMode.PARALLEL_2D_ROW)
-#     i = gpc.get_local_rank(ParallelMode.PARALLEL_2D_COL)
-
-#     layer = TransformerMLP2D(
-#         HIDDEN_SIZE,
-#         dropout_prob=0.5,
-#         act_func='gelu',
-#     )
-
-#     A_shape = (BATCH_SIZE, SEQ_LENGTH, INPUT_SIZE)
-#     A_master = torch.randn(A_shape, dtype=dtype, device=device)
-#     torch.distributed.broadcast(A_master, src=0)
-#     A = torch.chunk(A_master, DEPTH, dim=0)[i]
-#     A = torch.chunk(A, DEPTH, dim=-1)[j]
-#     A = A.clone()
-#     A.requires_grad = True
-
-#     out = layer(A)
-#     assert out.shape == (BATCH_SIZE // DEPTH, SEQ_LENGTH, INPUT_SIZE // DEPTH)
-#     print_rank_0('mlp forward: pass')
-
-#     grad_shape = out.shape
-#     grad = torch.randn(grad_shape, dtype=dtype, device=device)
-
-#     out.backward(grad)
-#     assert A.grad.shape == A.shape
-#     print_rank_0('mlp backward: pass')
-
-# def check_transformerlayer():
-#     device = get_current_device()
-#     dtype = torch.float32
-#     INPUT_SIZE = HIDDEN_SIZE
-#     NUM_ATTENTION_HEADS = 2
-
-#     j = gpc.get_local_rank(ParallelMode.PARALLEL_2D_ROW)
-#     i = gpc.get_local_rank(ParallelMode.PARALLEL_2D_COL)
-
-#     layer = TransformerLayer2D(HIDDEN_SIZE,
-#                                NUM_ATTENTION_HEADS,
-#                                act_func='gelu',
-#                                attention_dropout_prob=0.5,
-#                                hidden_dropout_prob=0.5)
-
-#     A_shape = (BATCH_SIZE, SEQ_LENGTH, INPUT_SIZE)
-#     A_master = torch.randn(A_shape, dtype=dtype, device=device)
-#     torch.distributed.broadcast(A_master, src=0)
-#     A = torch.chunk(A_master, DEPTH, dim=0)[i]
-#     A = torch.chunk(A, DEPTH, dim=-1)[j]
-#     A = A.clone()
-#     A.requires_grad = True
-
-#     mask_shape = (BATCH_SIZE // DEPTH, NUM_ATTENTION_HEADS // DEPTH, SEQ_LENGTH, SEQ_LENGTH)
-#     attention_mask = torch.zeros(mask_shape, dtype=dtype, device=device)
-
-#     out = layer(A, attention_mask)
-#     assert out.shape == (BATCH_SIZE // DEPTH, SEQ_LENGTH, INPUT_SIZE // DEPTH)
-#     print_rank_0('transformerlayer forward: pass')
-
-#     grad_shape = out.shape
-#     grad = torch.randn(grad_shape, dtype=dtype, device=device)
-
-#     out.backward(grad)
-#     assert A.grad.shape == A.shape
-#     print_rank_0('transformerlayer backward: pass')
diff --git a/tests/test_layers/test_2d/checks_2d/check_operation_2d.py b/tests/test_layers/test_2d/checks_2d/check_operation_2d.py
deleted file mode 100644
index 83442df70720a89aa912984898e869beaa8970d9..0000000000000000000000000000000000000000
--- a/tests/test_layers/test_2d/checks_2d/check_operation_2d.py
+++ /dev/null
@@ -1,240 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import torch
-
-from colossalai.context.parallel_mode import ParallelMode
-from colossalai.core import global_context as gpc
-from colossalai.nn.layer.parallel_2d._operation import Matmul_AB_2D, Matmul_ABT_2D, Matmul_ATB_2D
-from colossalai.utils import get_current_device
-from colossalai.utils import print_rank_0
-from .common import check_equal, BATCH_SIZE, SEQ_LENGTH, HIDDEN_SIZE, DEPTH
-
-
-def check_AB():
-    data_parallel_rank = 0 if not gpc.is_initialized(ParallelMode.DATA) else gpc.get_local_rank(ParallelMode.DATA)
-    pipeline_parallel_rank = 0 if not gpc.is_initialized(ParallelMode.PIPELINE) else gpc.get_local_rank(
-        ParallelMode.PIPELINE)
-    pipeline_parallel_size = 1 if not gpc.is_initialized(ParallelMode.PIPELINE) else gpc.get_world_size(
-        ParallelMode.PIPELINE)
-    tensor_parallel_size = gpc.get_world_size(ParallelMode.TENSOR)
-
-    dtype = torch.float
-    j = gpc.get_local_rank(ParallelMode.PARALLEL_2D_ROW)
-    i = gpc.get_local_rank(ParallelMode.PARALLEL_2D_COL)
-
-    A_shape = (BATCH_SIZE, SEQ_LENGTH, HIDDEN_SIZE)
-    A_master = torch.randn(A_shape, dtype=dtype, device=get_current_device())
-    torch.distributed.broadcast(A_master, src=0)
-    A = torch.chunk(A_master, DEPTH, dim=0)[i]
-    A = torch.chunk(A, DEPTH, dim=-1)[j]
-    A = A.clone()
-    A.requires_grad = True
-
-    B_shape = (HIDDEN_SIZE, 4 * HIDDEN_SIZE)
-    B_master = torch.randn(B_shape, dtype=dtype, device=get_current_device())
-    torch.distributed.broadcast(B_master, src=0)
-    B = torch.chunk(B_master, DEPTH, dim=0)[i]
-    B = torch.chunk(B, DEPTH, dim=-1)[j]
-    B = B.clone()
-    B.requires_grad = True
-
-    out_shape = (BATCH_SIZE // DEPTH, SEQ_LENGTH, 4 * HIDDEN_SIZE // DEPTH)
-
-    out = Matmul_AB_2D.apply(
-        A, B,
-        DEPTH,
-        out_shape,
-        i, j,
-        ParallelMode.PARALLEL_2D_ROW,
-        ParallelMode.PARALLEL_2D_COL,
-        data_parallel_rank,
-        pipeline_parallel_rank,
-        pipeline_parallel_size,
-        tensor_parallel_size
-    )
-
-    C_shape = (BATCH_SIZE, SEQ_LENGTH, 4 * HIDDEN_SIZE)
-    A_master = A_master.clone()
-    A_master.requires_grad = True
-    B_master = B_master.clone()
-    B_master.requires_grad = True
-    C_master = torch.matmul(A_master, B_master)
-    C = torch.chunk(C_master, DEPTH, dim=0)[i]
-    C = torch.chunk(C, DEPTH, dim=-1)[j]
-    # check forward correctness
-    check_equal(out, C)
-    print_rank_0('AB forward: pass')
-
-    grad_shape = C_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=get_current_device())
-    torch.distributed.broadcast(grad_master, src=0)
-    grad = torch.chunk(grad_master, DEPTH, dim=0)[i]
-    grad = torch.chunk(grad, DEPTH, dim=-1)[j]
-
-    out.backward(grad)
-
-    C_master.backward(grad_master)
-    A_grad = A_master.grad
-    A_grad = torch.chunk(A_grad, DEPTH, dim=0)[i]
-    A_grad = torch.chunk(A_grad, DEPTH, dim=-1)[j]
-    # check backward correctness
-    check_equal(A_grad, A.grad)
-
-    B_grad = B_master.grad
-    B_grad = torch.chunk(B_grad, DEPTH, dim=0)[i]
-    B_grad = torch.chunk(B_grad, DEPTH, dim=-1)[j]
-    # check backward correctness
-    check_equal(B_grad, B.grad)
-    print_rank_0('AB backward: pass')
-
-
-def check_ABT():
-    data_parallel_rank = 0 if not gpc.is_initialized(ParallelMode.DATA) else gpc.get_local_rank(ParallelMode.DATA)
-    pipeline_parallel_rank = 0 if not gpc.is_initialized(ParallelMode.PIPELINE) else gpc.get_local_rank(
-        ParallelMode.PIPELINE)
-    pipeline_parallel_size = 1 if not gpc.is_initialized(ParallelMode.PIPELINE) else gpc.get_world_size(
-        ParallelMode.PIPELINE)
-    tensor_parallel_size = gpc.get_world_size(ParallelMode.TENSOR)
-
-    dtype = torch.float
-    device = get_current_device()
-
-    j = gpc.get_local_rank(ParallelMode.PARALLEL_2D_ROW)
-    i = gpc.get_local_rank(ParallelMode.PARALLEL_2D_COL)
-
-    C_shape = (BATCH_SIZE, SEQ_LENGTH, 4 * HIDDEN_SIZE)
-    C_master = torch.randn(C_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(C_master, src=0)
-    C = torch.chunk(C_master, DEPTH, dim=0)[i]
-    C = torch.chunk(C, DEPTH, dim=-1)[j]
-    C = C.clone()
-    C.requires_grad = True
-
-    B_shape = (HIDDEN_SIZE, 4 * HIDDEN_SIZE)
-    B_master = torch.randn(B_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(B_master, src=0)
-    B = torch.chunk(B_master, DEPTH, dim=0)[i]
-    B = torch.chunk(B, DEPTH, dim=-1)[j]
-    B = B.clone()
-    B.requires_grad = True
-
-    out = Matmul_ABT_2D.apply(
-        C, B,
-        DEPTH, (BATCH_SIZE // DEPTH, SEQ_LENGTH, HIDDEN_SIZE // DEPTH),
-        i, j,
-        ParallelMode.PARALLEL_2D_ROW,
-        ParallelMode.PARALLEL_2D_COL,
-        data_parallel_rank,
-        pipeline_parallel_rank,
-        pipeline_parallel_size,
-        tensor_parallel_size
-    )
-
-    A_shape = (BATCH_SIZE, SEQ_LENGTH, HIDDEN_SIZE)
-    C_master = C_master.clone()
-    C_master.requires_grad = True
-    B_master = B_master.clone()
-    B_master.requires_grad = True
-    A_master = torch.matmul(C_master, B_master.transpose(0, 1))
-    A = torch.chunk(A_master, DEPTH, dim=0)[i]
-    A = torch.chunk(A, DEPTH, dim=-1)[j]
-    check_equal(out, A)
-    print_rank_0('ABT forward: pass')
-
-    grad_shape = A_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(grad_master, src=0)
-    grad = torch.chunk(grad_master, DEPTH, dim=0)[i]
-    grad = torch.chunk(grad, DEPTH, dim=-1)[j]
-
-    # backward
-    out.backward(grad)
-
-    A_master.backward(grad_master)
-    C_grad = C_master.grad
-    C_grad = torch.chunk(C_grad, DEPTH, dim=0)[i]
-    C_grad = torch.chunk(C_grad, DEPTH, dim=-1)[j]
-    check_equal(C_grad, C.grad)
-
-    B_grad = B_master.grad
-    B_grad = torch.chunk(B_grad, DEPTH, dim=0)[i]
-    B_grad = torch.chunk(B_grad, DEPTH, dim=-1)[j]
-    check_equal(B_grad, B.grad)
-    print_rank_0('ABT backward: pass')
-
-
-def check_ATB():
-    data_parallel_rank = 0 if not gpc.is_initialized(ParallelMode.DATA) else gpc.get_local_rank(ParallelMode.DATA)
-    pipeline_parallel_rank = 0 if not gpc.is_initialized(ParallelMode.PIPELINE) else gpc.get_local_rank(
-        ParallelMode.PIPELINE)
-    pipeline_parallel_size = 1 if not gpc.is_initialized(ParallelMode.PIPELINE) else gpc.get_world_size(
-        ParallelMode.PIPELINE)
-    tensor_parallel_size = gpc.get_world_size(ParallelMode.TENSOR)
-
-    device = get_current_device()
-    dtype = torch.float
-
-    j = gpc.get_local_rank(ParallelMode.PARALLEL_2D_ROW)
-    i = gpc.get_local_rank(ParallelMode.PARALLEL_2D_COL)
-
-    A_shape = (BATCH_SIZE, SEQ_LENGTH, HIDDEN_SIZE)
-    A_master = torch.randn(A_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(A_master, src=0)
-    A = torch.chunk(A_master, DEPTH, dim=0)[i]
-    A = torch.chunk(A, DEPTH, dim=-1)[j]
-    A = A.clone()
-    A.requires_grad = True
-
-    C_shape = (BATCH_SIZE, SEQ_LENGTH, 4 * HIDDEN_SIZE)
-    C_master = torch.randn(C_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(C_master, src=0)
-    C = torch.chunk(C_master, DEPTH, dim=0)[i]
-    C = torch.chunk(C, DEPTH, dim=-1)[j]
-    C = C.clone()
-    C.requires_grad = True
-
-    out = Matmul_ATB_2D.apply(
-        A, C,
-        DEPTH, (HIDDEN_SIZE // DEPTH, 4 * HIDDEN_SIZE // DEPTH),
-        i, j,
-        ParallelMode.PARALLEL_2D_ROW,
-        ParallelMode.PARALLEL_2D_COL,
-        data_parallel_rank,
-        pipeline_parallel_rank,
-        pipeline_parallel_size,
-        tensor_parallel_size
-    )
-
-    B_shape = (HIDDEN_SIZE, 4 * HIDDEN_SIZE)
-    A_master = A_master.clone()
-    A_master.requires_grad = True
-    C_master = C_master.clone()
-    C_master.requires_grad = True
-    B_master = torch.matmul(
-        A_master.view(-1, A_master.shape[-1]).transpose(0, 1),
-        C_master.view(-1, C_master.shape[-1]))
-    B = torch.chunk(B_master, DEPTH, dim=0)[i]
-    B = torch.chunk(B, DEPTH, dim=-1)[j]
-    check_equal(out, B)
-    print_rank_0('ATB forward: pass')
-
-    grad_shape = B_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(grad_master, src=0)
-    grad = torch.chunk(grad_master, DEPTH, dim=0)[i]
-    grad = torch.chunk(grad, DEPTH, dim=-1)[j]
-
-    out.backward(grad)
-
-    B_master.backward(grad_master)
-    A_grad = A_master.grad
-    A_grad = torch.chunk(A_grad, DEPTH, dim=0)[i]
-    A_grad = torch.chunk(A_grad, DEPTH, dim=-1)[j]
-    check_equal(A_grad, A.grad)
-
-    C_grad = C_master.grad
-    C_grad = torch.chunk(C_grad, DEPTH, dim=0)[i]
-    C_grad = torch.chunk(C_grad, DEPTH, dim=-1)[j]
-    check_equal(C_grad, C.grad)
-    print_rank_0('ATB backward: pass')
diff --git a/tests/test_layers/test_2d/checks_2d/common.py b/tests/test_layers/test_2d/checks_2d/common.py
deleted file mode 100644
index 8c855c18bc26c7e06507f2b180d87cb0cae8b67f..0000000000000000000000000000000000000000
--- a/tests/test_layers/test_2d/checks_2d/common.py
+++ /dev/null
@@ -1,16 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import torch
-
-DEPTH = 2
-BATCH_SIZE = 8
-SEQ_LENGTH = 8
-HIDDEN_SIZE = 8
-NUM_CLASSES = 8
-VOCAB_SIZE = 16
-IMG_SIZE = 16
-
-
-def check_equal(A, B):
-    assert torch.allclose(A, B, rtol=1e-3, atol=1e-2)
diff --git a/tests/test_layers/test_2d/test_2d.py b/tests/test_layers/test_2d/test_2d.py
deleted file mode 100644
index 5401510108b499ae2d76de92add32a34ce5f0e01..0000000000000000000000000000000000000000
--- a/tests/test_layers/test_2d/test_2d.py
+++ /dev/null
@@ -1,65 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from functools import partial
-
-import pytest
-import torch
-import torch.multiprocessing as mp
-from colossalai.core import global_context as gpc
-from colossalai.initialize import launch
-from colossalai.logging import disable_existing_loggers
-from colossalai.utils import free_port
-
-from checks_2d.check_layer_2d import (check_classifier_given_embed_weight, check_classifier_no_given_weight,
-                                      check_embed, check_layernorm, check_linear, check_loss, check_patch_embed,
-                                      check_vocab_parallel_classifier_given_embed_weight,
-                                      check_vocab_parallel_classifier_no_given_weight, check_vocab_parallel_embed,
-                                      check_vocab_parallel_loss)
-from checks_2d.check_operation_2d import check_AB, check_ABT, check_ATB
-
-CONFIG = dict(parallel=dict(pipeline=dict(size=1), tensor=dict(size=4, mode='2d')), )
-
-
-def check_operations():
-    check_AB()
-    check_ABT()
-    check_ATB()
-
-
-def check_layer():
-    check_linear()
-    check_layernorm()
-    check_embed()
-    check_patch_embed()
-    check_vocab_parallel_embed()
-    check_classifier_no_given_weight()
-    check_vocab_parallel_classifier_no_given_weight()
-    check_classifier_given_embed_weight()
-    check_vocab_parallel_classifier_given_embed_weight()
-    check_loss()
-    check_vocab_parallel_loss()
-
-
-def check_layer_and_operation(rank, world_size, port):
-    disable_existing_loggers()
-    launch(config=CONFIG, rank=rank, world_size=world_size, host='localhost', port=port, backend='nccl')
-
-    torch.backends.cuda.matmul.allow_tf32 = False
-    torch.backends.cudnn.allow_tf32 = False
-    torch.backends.cudnn.deterministic = True
-    # check_operations()
-    check_layer()
-    gpc.destroy()
-    torch.cuda.empty_cache()
-
-
-@pytest.mark.dist
-def test_2d():
-    world_size = 4
-    run_func = partial(check_layer_and_operation, world_size=world_size, port=free_port())
-    mp.spawn(run_func, nprocs=world_size)
-
-
-if __name__ == '__main__':
-    test_2d()
diff --git a/tests/test_layers/test_2p5d/__pycache__/test_2p5d.cpython-37-pytest-7.1.3.pyc b/tests/test_layers/test_2p5d/__pycache__/test_2p5d.cpython-37-pytest-7.1.3.pyc
deleted file mode 100644
index fa2ace8a2e36dd59fc9f10711d1a4a8b6c261f66..0000000000000000000000000000000000000000
Binary files a/tests/test_layers/test_2p5d/__pycache__/test_2p5d.cpython-37-pytest-7.1.3.pyc and /dev/null differ
diff --git a/tests/test_layers/test_2p5d/__pycache__/test_2p5d.cpython-37.pyc b/tests/test_layers/test_2p5d/__pycache__/test_2p5d.cpython-37.pyc
deleted file mode 100644
index 6cc9aa2664473ac49ad3d5d44f2c5dbe2e53d031..0000000000000000000000000000000000000000
Binary files a/tests/test_layers/test_2p5d/__pycache__/test_2p5d.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_layers/test_2p5d/checks_2p5d/__init__.py b/tests/test_layers/test_2p5d/checks_2p5d/__init__.py
deleted file mode 100644
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..0000000000000000000000000000000000000000
diff --git a/tests/test_layers/test_2p5d/checks_2p5d/__pycache__/__init__.cpython-37.pyc b/tests/test_layers/test_2p5d/checks_2p5d/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index a08efd38ddd3e40fb7d92c8e717d1c6fbd2893eb..0000000000000000000000000000000000000000
Binary files a/tests/test_layers/test_2p5d/checks_2p5d/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_layers/test_2p5d/checks_2p5d/__pycache__/check_layer_2p5d.cpython-37.pyc b/tests/test_layers/test_2p5d/checks_2p5d/__pycache__/check_layer_2p5d.cpython-37.pyc
deleted file mode 100644
index 808164ee2f7fe5d451f0b4bcd90a50cc28646297..0000000000000000000000000000000000000000
Binary files a/tests/test_layers/test_2p5d/checks_2p5d/__pycache__/check_layer_2p5d.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_layers/test_2p5d/checks_2p5d/__pycache__/check_operation_2p5d.cpython-37.pyc b/tests/test_layers/test_2p5d/checks_2p5d/__pycache__/check_operation_2p5d.cpython-37.pyc
deleted file mode 100644
index 60a54ca3462c6bc6756f1168fa1391b6c9fea9b9..0000000000000000000000000000000000000000
Binary files a/tests/test_layers/test_2p5d/checks_2p5d/__pycache__/check_operation_2p5d.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_layers/test_2p5d/checks_2p5d/__pycache__/common.cpython-37.pyc b/tests/test_layers/test_2p5d/checks_2p5d/__pycache__/common.cpython-37.pyc
deleted file mode 100644
index 8eae0ed7861bfb9ba21bd07c5015970f449674e1..0000000000000000000000000000000000000000
Binary files a/tests/test_layers/test_2p5d/checks_2p5d/__pycache__/common.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_layers/test_2p5d/checks_2p5d/check_layer_2p5d.py b/tests/test_layers/test_2p5d/checks_2p5d/check_layer_2p5d.py
deleted file mode 100644
index a8f551093b1ef782f8bef64d4241c1400aa6bdde..0000000000000000000000000000000000000000
--- a/tests/test_layers/test_2p5d/checks_2p5d/check_layer_2p5d.py
+++ /dev/null
@@ -1,754 +0,0 @@
-import torch
-from colossalai.context.parallel_mode import ParallelMode
-from colossalai.core import global_context as gpc
-from colossalai.nn import (Classifier2p5D, CrossEntropyLoss2p5D, Embedding2p5D, LayerNorm2p5D, Linear2p5D,
-                           PatchEmbedding2p5D, VanillaClassifier, VanillaPatchEmbedding, VocabParallelClassifier2p5D,
-                           VocabParallelCrossEntropyLoss2p5D, VocabParallelEmbedding2p5D)
-from colossalai.utils import get_current_device, print_rank_0
-from torch.nn import Parameter
-
-from .common import *
-
-
-def check_linear():
-    device = get_current_device()
-    dtype = torch.float32
-    INPUT_SIZE = HIDDEN_SIZE
-    OUTPUT_SIZE = 2 * HIDDEN_SIZE
-
-    i = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_COL)
-    j = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_ROW)
-    k = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_DEP)
-
-    layer = Linear2p5D(INPUT_SIZE, OUTPUT_SIZE, dtype=dtype, skip_bias_add=False)
-
-    A_shape = (BATCH_SIZE, SEQ_LENGTH, INPUT_SIZE)
-    A_master = torch.randn(A_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(A_master, src=0)
-    A = torch.chunk(A_master, TESSERACT_DIM, dim=0)[i]
-    A = torch.chunk(A, TESSERACT_DIM, dim=-1)[j]
-    A = A.clone()
-    A.requires_grad = True
-
-    W_shape = (INPUT_SIZE, OUTPUT_SIZE)
-    W_master = torch.randn(W_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(W_master, src=0)
-    W = torch.chunk(W_master, TESSERACT_DIM, dim=0)[i]
-    W = torch.chunk(W, TESSERACT_DIM, dim=-1)[j]
-    W = W.clone()
-    W.requires_grad = True
-
-    B_shape = (OUTPUT_SIZE)
-    B_master = torch.randn(B_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(B_master, src=0)
-    B = torch.chunk(B_master, TESSERACT_DIM, dim=0)[j]
-    B = B.clone()
-    B.requires_grad = True
-
-    layer.weight = Parameter(W)
-    layer.bias = Parameter(B)
-    out = layer(A)
-    bias = layer.bias
-
-    A_master = A_master.clone()
-    A_master.requires_grad = True
-    W_master = W_master.clone()
-    W_master.requires_grad = True
-    B_master = B_master.clone()
-    B_master.requires_grad = True
-    C_master = torch.matmul(A_master, W_master) + B_master
-    C = torch.chunk(C_master, TESSERACT_DIM, dim=0)[i]
-    C = torch.chunk(C, TESSERACT_DIM, dim=-1)[j]
-
-    check_equal(out, C)
-    print_rank_0('linear forward: pass')
-
-    grad_shape = C_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=get_current_device())
-    torch.distributed.broadcast(grad_master, src=0)
-    grad = torch.chunk(grad_master, TESSERACT_DIM, dim=0)[i]
-    grad = torch.chunk(grad, TESSERACT_DIM, dim=-1)[j]
-    grad = grad.clone()
-    out.backward(grad)
-
-    grad_master = grad_master.clone()
-    C_master.backward(grad_master)
-    A_grad = A_master.grad
-    A_grad = torch.chunk(A_grad, TESSERACT_DIM, dim=0)[i]
-    A_grad = torch.chunk(A_grad, TESSERACT_DIM, dim=-1)[j]
-    check_equal(A_grad, A.grad)
-
-    W_grad = W_master.grad
-    W_grad = torch.chunk(W_grad, TESSERACT_DIM, dim=0)[i]
-    W_grad = torch.chunk(W_grad, TESSERACT_DIM, dim=-1)[j]
-    check_equal(W_grad, layer.weight.grad)
-
-    B_grad = B_master.grad
-    B_grad = torch.chunk(B_grad, TESSERACT_DIM, dim=0)[j]
-    if i == 0:
-        check_equal(B_grad, layer.bias.grad)
-
-    print_rank_0('linear backward: pass')
-
-
-def check_layernorm():
-    device = get_current_device()
-    dtype = torch.float32
-    INPUT_SIZE = HIDDEN_SIZE
-    EPS = 1e-12
-
-    i = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_COL)
-    j = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_ROW)
-    k = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_DEP)
-
-    layernorm = LayerNorm2p5D(INPUT_SIZE, dtype=dtype)
-
-    A_shape = (BATCH_SIZE, SEQ_LENGTH, INPUT_SIZE)
-    A_master = torch.randn(A_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(A_master, src=0)
-    A = torch.chunk(A_master, TESSERACT_DIM, dim=0)[i]
-    A = torch.chunk(A, TESSERACT_DIM, dim=-1)[j]
-    A = A.clone()
-    A.requires_grad = True
-
-    out = layernorm(A)
-
-    A_master = A_master.clone()
-    A_master.requires_grad = True
-    E_master = torch.sum(A_master, dim=-1, keepdim=True)
-    E_master /= INPUT_SIZE
-    V_master = torch.sum(A_master * A_master, dim=-1, keepdim=True)
-    V_master /= INPUT_SIZE
-    V_master = V_master - E_master * E_master
-    V_master = 1.0 / torch.sqrt(V_master + EPS)
-    C_master = (A_master - E_master) * V_master
-    C = torch.chunk(C_master, TESSERACT_DIM, dim=0)[i]
-    C = torch.chunk(C, TESSERACT_DIM, dim=-1)[j]
-
-    check_equal(out, C)
-    print_rank_0('layer norm forward: pass')
-
-    grad_shape = C_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=get_current_device())
-    torch.distributed.broadcast(grad_master, src=0)
-    grad = torch.chunk(grad_master, TESSERACT_DIM, dim=0)[i]
-    grad = torch.chunk(grad, TESSERACT_DIM, dim=-1)[j]
-    out.backward(grad)
-
-    C_master.backward(grad_master)
-    A_grad = A_master.grad
-    A_grad = torch.chunk(A_grad, TESSERACT_DIM, dim=0)[i]
-    A_grad = torch.chunk(A_grad, TESSERACT_DIM, dim=-1)[j]
-    check_equal(A_grad, A.grad)
-    print_rank_0('layer norm backward: pass')
-
-
-def check_embed():
-    device = get_current_device()
-    dtype = torch.float32
-    i = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_COL)
-    j = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_ROW)
-    k = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_DEP)
-
-    embed = Embedding2p5D(VOCAB_SIZE, HIDDEN_SIZE)
-    embed = embed.to(dtype).to(device)
-    embed_master = torch.nn.Embedding(VOCAB_SIZE, HIDDEN_SIZE)
-    embed_master = embed_master.to(dtype).to(device)
-
-    weight_master = embed_master.weight.data
-    torch.distributed.broadcast(weight_master, src=0)
-    weight = torch.chunk(weight_master, TESSERACT_DIM, dim=-1)[j]
-    weight = torch.chunk(weight, TESSERACT_DIM, dim=-1)[i]
-    embed.weight.data.copy_(weight)
-
-    A_shape = (BATCH_SIZE, SEQ_LENGTH)
-    A_master = torch.randint(VOCAB_SIZE, A_shape, device=device)
-    torch.distributed.broadcast(A_master, src=0)
-    A = A_master.clone()
-    out = embed(A)
-
-    A_master = A_master.clone()
-    C_master = embed_master(A_master)
-    C = torch.chunk(C_master, TESSERACT_DIM, dim=0)[i]
-    C = torch.chunk(C, TESSERACT_DIM, dim=-1)[j]
-    check_equal(out, C)
-    print_rank_0('embed forward: pass')
-
-    grad_shape = C_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(grad_master, src=0)
-    grad = torch.chunk(grad_master, TESSERACT_DIM, dim=0)[i]
-    grad = torch.chunk(grad, TESSERACT_DIM, dim=-1)[j]
-    grad = grad.clone()
-    out.backward(grad)
-    grad_master = grad_master.clone()
-    C_master.backward(grad_master)
-
-    B_grad = embed_master.weight.grad
-    B_grad = torch.chunk(B_grad, TESSERACT_DIM, dim=-1)[j]
-    B_grad = torch.chunk(B_grad, TESSERACT_DIM, dim=-1)[i]
-    check_equal(B_grad, embed.weight.grad)
-    print_rank_0('embed backward: pass')
-
-
-def check_patch_embed():
-    device = get_current_device()
-    dtype = torch.float32
-    i = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_COL)
-    j = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_ROW)
-    k = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_DEP)
-
-    layer = PatchEmbedding2p5D(IMG_SIZE, 4, 3, HIDDEN_SIZE, dtype=dtype)
-    torch.nn.init.ones_(layer.cls_token)
-    torch.nn.init.ones_(layer.pos_embed)
-    layer = layer.to(device)
-
-    layer_master = VanillaPatchEmbedding(IMG_SIZE, 4, 3, HIDDEN_SIZE, dtype=dtype)
-    torch.nn.init.ones_(layer_master.cls_token)
-    torch.nn.init.ones_(layer_master.pos_embed)
-    layer_master = layer_master.to(device)
-
-    proj_weight_master = layer_master.weight.data
-    torch.distributed.broadcast(proj_weight_master, src=0)
-    proj_weight = torch.chunk(proj_weight_master, TESSERACT_DIM, dim=0)[j]
-    proj_weight = torch.chunk(proj_weight, TESSERACT_DIM, dim=0)[i]
-    layer.weight.data.copy_(proj_weight)
-    proj_bias_master = layer_master.bias.data
-    torch.distributed.broadcast(proj_bias_master, src=0)
-    proj_bias = torch.chunk(proj_bias_master, TESSERACT_DIM, dim=0)[j]
-    proj_bias = torch.chunk(proj_bias, TESSERACT_DIM, dim=0)[i]
-    layer.bias.data.copy_(proj_bias)
-
-    A_shape = (BATCH_SIZE, 3, IMG_SIZE, IMG_SIZE)
-    A_master = torch.randn(A_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(A_master, src=0)
-    A = A_master.clone()
-    out = layer(A)
-
-    A_master = A_master.clone()
-    C_master = layer_master(A_master)
-    C = torch.chunk(C_master, TESSERACT_DIM, dim=0)[i]
-    C = torch.chunk(C, TESSERACT_DIM, dim=-1)[j]
-    check_equal(out, C)
-    print_rank_0('patch embed forward: pass')
-
-    grad_shape = C_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(grad_master, src=0)
-    grad = torch.chunk(grad_master, TESSERACT_DIM, dim=0)[i]
-    grad = torch.chunk(grad, TESSERACT_DIM, dim=-1)[j]
-    grad = grad.clone()
-    out.backward(grad)
-
-    grad_master = grad_master.clone()
-    C_master.backward(grad_master)
-
-    cls_grad_master = layer_master.cls_token.grad
-    cls_grad = torch.chunk(cls_grad_master, TESSERACT_DIM, dim=-1)[j]
-    cls_grad = torch.chunk(cls_grad, TESSERACT_DIM, dim=-1)[i]
-    check_equal(cls_grad, layer.cls_token.grad)
-
-    pos_grad_master = layer_master.pos_embed.grad
-    pos_grad = torch.chunk(pos_grad_master, TESSERACT_DIM, dim=-1)[j]
-    pos_grad = torch.chunk(pos_grad, TESSERACT_DIM, dim=-1)[i]
-    check_equal(pos_grad, layer.pos_embed.grad)
-
-    B_grad = layer_master.weight.grad
-    B_grad = torch.chunk(B_grad, TESSERACT_DIM, dim=0)[j]
-    B_grad = torch.chunk(B_grad, TESSERACT_DIM, dim=0)[i]
-    check_equal(B_grad, layer.weight.grad)
-
-    bias_grad = layer_master.bias.grad
-    bias_grad = torch.chunk(bias_grad, TESSERACT_DIM)[j]
-    bias_grad = torch.chunk(bias_grad, TESSERACT_DIM)[i]
-    check_equal(bias_grad, layer.bias.grad)
-    print_rank_0('patch embed backward: pass')
-
-
-def check_vocab_parallel_embed():
-    device = get_current_device()
-    dtype = torch.float32
-    i = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_COL)
-    j = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_ROW)
-    k = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_DEP)
-
-    embed = VocabParallelEmbedding2p5D(VOCAB_SIZE, HIDDEN_SIZE)
-    embed = embed.to(dtype).to(device)
-    embed_master = torch.nn.Embedding(VOCAB_SIZE, HIDDEN_SIZE)
-    embed_master = embed_master.to(dtype).to(device)
-
-    weight_master = embed_master.weight.data
-    torch.distributed.broadcast(weight_master, src=0)
-    weight = torch.chunk(weight_master, TESSERACT_DIM, dim=-1)[j]
-    weight = torch.chunk(weight, TESSERACT_DIM, dim=0)[i]
-    embed.weight.data.copy_(weight)
-
-    A_shape = (BATCH_SIZE, SEQ_LENGTH)
-    A_master = torch.randint(VOCAB_SIZE, A_shape, device=device)
-    torch.distributed.broadcast(A_master, src=0)
-    A = A_master.clone()
-    out = embed(A)
-
-    A_master = A_master.clone()
-    C_master = embed_master(A_master)
-    C = torch.chunk(C_master, TESSERACT_DIM, dim=0)[i]
-    C = torch.chunk(C, TESSERACT_DIM, dim=-1)[j]
-    check_equal(out, C)
-    print_rank_0('vocab parallel embed forward: pass')
-
-    grad_shape = C_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(grad_master, src=0)
-    grad = torch.chunk(grad_master, TESSERACT_DIM, dim=0)[i]
-    grad = torch.chunk(grad, TESSERACT_DIM, dim=-1)[j]
-    grad = grad.clone()
-    out.backward(grad)
-    grad_master = grad_master.clone()
-    C_master.backward(grad_master)
-
-    B_grad = embed_master.weight.grad
-    B_grad = torch.chunk(B_grad, TESSERACT_DIM, dim=-1)[j]
-    B_grad = torch.chunk(B_grad, TESSERACT_DIM, dim=0)[i]
-    check_equal(B_grad, embed.weight.grad)
-    print_rank_0('vocab parallel embed backward: pass')
-
-
-def check_classifier_no_given_weight():
-    device = get_current_device()
-    dtype = torch.float32
-    INPUT_SIZE = HIDDEN_SIZE
-    OUTPUT_SIZE = NUM_CLASSES
-
-    j = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_ROW)
-    i = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_COL)
-
-    layer = Classifier2p5D(INPUT_SIZE, OUTPUT_SIZE)
-
-    A_shape = (BATCH_SIZE, SEQ_LENGTH, INPUT_SIZE)
-    A_master = torch.randint(5, A_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(A_master, src=0)
-    A = torch.chunk(A_master, TESSERACT_DIM, dim=0)[i]
-    A = torch.chunk(A, TESSERACT_DIM, dim=-1)[j]
-    A = A.clone()
-    A.requires_grad = True
-
-    W_shape = (OUTPUT_SIZE, INPUT_SIZE)
-    W_master = torch.randint(5, W_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(W_master, src=0)
-    # W = torch.chunk(W_master, TESSERACT_DIM, dim=-1)[j]
-    W = torch.chunk(W_master, TESSERACT_DIM, dim=-1)[j]
-    W = torch.chunk(W, TESSERACT_DIM, dim=-1)[i]
-    W = W.clone()
-    layer.weight.data.copy_(W)
-    # W.requires_grad = True
-
-    B_shape = (OUTPUT_SIZE, )
-    B_master = torch.randint(5, B_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(B_master, src=0)
-    # B = torch.chunk(B_master, TESSERACT_DIM, dim=0)[j]
-    B = B_master.clone()
-    layer.bias.data.copy_(B)
-
-    out = layer(A)
-
-    A_master = A_master.clone()
-    A_master.requires_grad = True
-    W_master = W_master.clone()
-    W_master.requires_grad = True
-    B_master = B_master.clone()
-    B_master.requires_grad = True
-    C_master = torch.matmul(A_master, W_master.transpose(0, 1)) + B_master
-    C = torch.chunk(C_master, TESSERACT_DIM, dim=0)[i]
-    # C = torch.chunk(C, TESSERACT_DIM, dim=-1)[j]
-
-    check_equal(out, C)
-    print_rank_0('classifier (no given weight) forward: pass')
-
-    grad_shape = C_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=get_current_device())
-    torch.distributed.broadcast(grad_master, src=0)
-    grad = torch.chunk(grad_master, TESSERACT_DIM, dim=0)[i]
-    # grad = torch.chunk(grad, TESSERACT_DIM, dim=-1)[j]
-    grad = grad.clone()
-    out.backward(grad)
-
-    grad_master = grad_master.clone()
-    C_master.backward(grad_master)
-    A_grad = A_master.grad
-    A_grad = torch.chunk(A_grad, TESSERACT_DIM, dim=0)[i]
-    A_grad = torch.chunk(A_grad, TESSERACT_DIM, dim=-1)[j]
-    check_equal(A_grad, A.grad)
-
-    W_grad = W_master.grad
-    W_grad = torch.chunk(W_grad, TESSERACT_DIM, dim=-1)[j]
-    W_grad = torch.chunk(W_grad, TESSERACT_DIM, dim=-1)[i]
-    check_equal(W_grad, layer.weight.grad)
-
-    B_grad = B_master.grad
-    # B_grad = torch.chunk(B_grad, TESSERACT_DIM, dim=0)[j]
-    # if i == 0:
-    check_equal(B_grad, layer.bias.grad)
-
-    print_rank_0('classifier (no given weight) backward: pass')
-
-
-def check_vocab_parallel_classifier_no_given_weight():
-    device = get_current_device()
-    dtype = torch.float32
-    i = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_COL)
-    j = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_ROW)
-    k = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_DEP)
-
-    layer = VocabParallelClassifier2p5D(HIDDEN_SIZE, VOCAB_SIZE, bias=True)
-    layer = layer.to(dtype).to(device)
-
-    layer_master = VanillaClassifier(HIDDEN_SIZE, VOCAB_SIZE, bias=True)
-    layer_master = layer_master.to(dtype).to(device)
-
-    weight_master = layer_master.weight.data
-    torch.distributed.broadcast(weight_master, src=0)
-    weight = torch.chunk(weight_master, TESSERACT_DIM, dim=0)[i]
-    weight = torch.chunk(weight, TESSERACT_DIM, dim=-1)[j]
-    layer.weight.data.copy_(weight)
-    bias_master = layer_master.bias.data
-    torch.distributed.broadcast(bias_master, src=0)
-    bias = torch.chunk(bias_master, TESSERACT_DIM)[j]
-    layer.bias.data.copy_(bias)
-
-    A_shape = (BATCH_SIZE, SEQ_LENGTH, HIDDEN_SIZE)
-    A_master = torch.randn(A_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(A_master, src=0)
-    A = torch.chunk(A_master, TESSERACT_DIM, dim=0)[i]
-    A = torch.chunk(A, TESSERACT_DIM, dim=-1)[j]
-    A = A.clone()
-    A.requires_grad = True
-    out = layer(A)
-
-    A_master = A_master.clone()
-    A_master.requires_grad = True
-    C_master = layer_master(A_master)
-    C = torch.chunk(C_master, TESSERACT_DIM, dim=0)[i]
-    C = torch.chunk(C, TESSERACT_DIM, dim=-1)[j]
-    check_equal(out, C)
-    print_rank_0('vocab parallel classifier (no given weight) forward: pass')
-
-    grad_shape = C_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(grad_master, src=0)
-    grad = torch.chunk(grad_master, TESSERACT_DIM, dim=0)[i]
-    grad = torch.chunk(grad, TESSERACT_DIM, dim=-1)[j]
-    grad = grad.clone()
-    out.backward(grad)
-
-    grad_master = grad_master.clone()
-    C_master.backward(grad_master)
-    A_grad = A_master.grad
-    A_grad = torch.chunk(A_grad, TESSERACT_DIM, dim=0)[i]
-    A_grad = torch.chunk(A_grad, TESSERACT_DIM, dim=-1)[j]
-    check_equal(A_grad, A.grad)
-
-    W_grad = layer_master.weight.grad
-    W_grad = torch.chunk(W_grad, TESSERACT_DIM, dim=0)[i]
-    W_grad = torch.chunk(W_grad, TESSERACT_DIM, dim=-1)[j]
-    check_equal(W_grad, layer.weight.grad)
-
-    B_grad = layer_master.bias.grad
-    B_grad = torch.chunk(B_grad, TESSERACT_DIM)[j]
-    if i == 0:
-        check_equal(B_grad, layer.bias.grad)
-    print_rank_0('vocab parallel classifier (no given weight) backward: pass')
-
-
-def check_classifier_given_embed_weight():
-    device = get_current_device()
-    dtype = torch.float32
-    i = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_COL)
-    j = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_ROW)
-    k = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_DEP)
-
-    embed = Embedding2p5D(VOCAB_SIZE, HIDDEN_SIZE)
-    embed = embed.to(dtype).to(device)
-    embed_master = torch.nn.Embedding(VOCAB_SIZE, HIDDEN_SIZE)
-    embed_master = embed_master.to(dtype).to(device)
-
-    weight_master = embed_master.weight.data
-    torch.distributed.broadcast(weight_master, src=0)
-    weight = torch.chunk(weight_master, TESSERACT_DIM, dim=-1)[j]
-    weight = torch.chunk(weight, TESSERACT_DIM, dim=-1)[i]
-    embed.weight.data.copy_(weight)
-
-    layer = Classifier2p5D(HIDDEN_SIZE, VOCAB_SIZE, weight=embed.weight, bias=False)
-    layer = layer.to(dtype).to(device)
-    layer_master = VanillaClassifier(HIDDEN_SIZE, VOCAB_SIZE, weight=embed_master.weight, bias=False)
-    layer_master = layer_master.to(dtype).to(device)
-
-    A_shape = (BATCH_SIZE, SEQ_LENGTH)
-    A_master = torch.randint(VOCAB_SIZE, A_shape, device=device)
-    torch.distributed.broadcast(A_master, src=0)
-    A = A_master.clone()
-    out = layer(embed(A))
-
-    A_master = A_master.clone()
-    C_master = layer_master(embed_master(A_master))
-    C = torch.chunk(C_master, TESSERACT_DIM, dim=0)[i]
-    check_equal(out, C)
-    print_rank_0('classifier (given embed weight) forward: pass')
-
-    grad_shape = C_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(grad_master, src=0)
-    grad = torch.chunk(grad_master, TESSERACT_DIM, dim=0)[i]
-    grad = grad.clone()
-    out.backward(grad)
-
-    grad_master = grad_master.clone()
-    C_master.backward(grad_master)
-
-    W_grad = embed_master.weight.grad
-    W_grad = torch.chunk(W_grad, TESSERACT_DIM, dim=-1)[j]
-    W_grad = torch.chunk(W_grad, TESSERACT_DIM, dim=-1)[i]
-    check_equal(W_grad, embed.weight.grad)
-    print_rank_0('classifier (given embed weight) backward: pass')
-
-
-def check_vocab_parallel_classifier_given_embed_weight():
-    device = get_current_device()
-    dtype = torch.float32
-    i = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_COL)
-    j = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_ROW)
-    k = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_DEP)
-
-    embed = VocabParallelEmbedding2p5D(VOCAB_SIZE, HIDDEN_SIZE)
-    embed = embed.to(dtype).to(device)
-    embed_master = torch.nn.Embedding(VOCAB_SIZE, HIDDEN_SIZE)
-    embed_master = embed_master.to(dtype).to(device)
-
-    weight_master = embed_master.weight.data
-    torch.distributed.broadcast(weight_master, src=0)
-    weight = torch.chunk(weight_master, TESSERACT_DIM, dim=-1)[j]
-    weight = torch.chunk(weight, TESSERACT_DIM, dim=0)[i]
-    embed.weight.data.copy_(weight)
-
-    layer = VocabParallelClassifier2p5D(HIDDEN_SIZE, VOCAB_SIZE, weight=embed.weight, bias=False)
-    layer = layer.to(dtype).to(device)
-    layer_master = VanillaClassifier(HIDDEN_SIZE, VOCAB_SIZE, weight=embed_master.weight, bias=False)
-    layer_master = layer_master.to(dtype).to(device)
-
-    A_shape = (BATCH_SIZE, SEQ_LENGTH)
-    A_master = torch.randint(VOCAB_SIZE, A_shape, device=device)
-    torch.distributed.broadcast(A_master, src=0)
-    A = A_master.clone()
-    out = layer(embed(A))
-
-    A_master = A_master.clone()
-    C_master = layer_master(embed_master(A_master))
-    C = torch.chunk(C_master, TESSERACT_DIM, dim=0)[i]
-    C = torch.chunk(C, TESSERACT_DIM, dim=-1)[j]
-    check_equal(out, C)
-    print_rank_0('vocab parallel classifier (given embed weight) forward: pass')
-
-    grad_shape = C_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(grad_master, src=0)
-    grad = torch.chunk(grad_master, TESSERACT_DIM, dim=0)[i]
-    grad = torch.chunk(grad, TESSERACT_DIM, dim=-1)[j]
-    grad = grad.clone()
-    out.backward(grad)
-
-    grad_master = grad_master.clone()
-    C_master.backward(grad_master)
-
-    W_grad = embed_master.weight.grad
-    W_grad = torch.chunk(W_grad, TESSERACT_DIM, dim=-1)[j]
-    W_grad = torch.chunk(W_grad, TESSERACT_DIM, dim=0)[i]
-    check_equal(W_grad, embed.weight.grad)
-    print_rank_0('vocab parallel classifier (given embed weight) backward: pass')
-
-
-def check_loss():
-    device = get_current_device()
-    dtype = torch.float32
-    i = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_COL)
-    j = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_ROW)
-    k = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_DEP)
-
-    criterion = CrossEntropyLoss2p5D()
-    criterion_master = torch.nn.CrossEntropyLoss()
-
-    out_shape = (BATCH_SIZE, NUM_CLASSES)
-    out_master = torch.randn(out_shape, dtype=dtype, device=device)
-    target_master = torch.randint(NUM_CLASSES, (BATCH_SIZE, ), dtype=torch.long, device=device)
-    torch.distributed.broadcast(out_master, src=0)
-    torch.distributed.broadcast(target_master, src=0)
-    out = torch.chunk(out_master, TESSERACT_DIM, dim=0)[i]
-    out = out.clone()
-    out.requires_grad = True
-    loss = criterion(out, target_master)
-
-    out_master = out_master.clone()
-    out_master.requires_grad = True
-    loss_master = criterion_master(out_master, target_master)
-    check_equal(loss, loss_master)
-    print_rank_0('cross entropy loss forward: pass')
-
-    loss.backward()
-    loss_master.backward()
-
-    out_grad = out_master.grad
-    out_grad = torch.chunk(out_grad, TESSERACT_DIM, dim=0)[i]
-    check_equal(out_grad, out.grad)
-    print_rank_0('cross entropy loss backward: pass')
-
-
-def check_vocab_parallel_loss():
-    device = get_current_device()
-    dtype = torch.float32
-    i = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_COL)
-    j = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_ROW)
-    k = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_DEP)
-
-    criterion = VocabParallelCrossEntropyLoss2p5D()
-    criterion_master = torch.nn.CrossEntropyLoss()
-
-    out_shape = (BATCH_SIZE, NUM_CLASSES)
-    out_master = torch.randn(out_shape, dtype=dtype, device=device)
-    target_master = torch.randint(NUM_CLASSES, (BATCH_SIZE, ), dtype=torch.long, device=device)
-    torch.distributed.broadcast(out_master, src=0)
-    torch.distributed.broadcast(target_master, src=0)
-    out = torch.chunk(out_master, TESSERACT_DIM, dim=0)[i]
-    out = torch.chunk(out, TESSERACT_DIM, dim=-1)[j]
-    out = out.clone()
-    out.requires_grad = True
-    loss = criterion(out, target_master)
-
-    out_master = out_master.clone()
-    out_master.requires_grad = True
-    loss_master = criterion_master(out_master, target_master)
-    check_equal(loss, loss_master)
-    print_rank_0('vocab parallel cross entropy loss forward: pass')
-
-    loss.backward()
-    loss_master.backward()
-
-    out_grad = out_master.grad
-    out_grad = torch.chunk(out_grad, TESSERACT_DIM, dim=0)[i]
-    out_grad = torch.chunk(out_grad, TESSERACT_DIM, dim=-1)[j]
-    check_equal(out_grad, out.grad)
-    print_rank_0('vocab parallel cross entropy loss backward: pass')
-
-
-# def check_attention():
-#     device = get_current_device()
-#     dtype = torch.float32
-#     INPUT_SIZE = HIDDEN_SIZE
-#     NUM_ATTENTION_HEADS = 2
-
-#     i = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_COL)
-#     j = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_ROW)
-#     k = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_DEP)
-
-#     layer = TransformerSelfAttention2p5D(
-#         HIDDEN_SIZE, NUM_ATTENTION_HEADS,
-#         attention_dropout_prob=0.5,
-#         hidden_dropout_prob=0.5,
-#         dtype=dtype,
-#     )
-
-#     A_shape = (BATCH_SIZE, SEQ_LENGTH, INPUT_SIZE)
-#     A_master = torch.randn(A_shape, dtype=dtype, device=device)
-#     torch.distributed.broadcast(A_master, src=0)
-#     A = torch.chunk(A_master, TESSERACT_DIM, dim=0)[i]
-#     A = torch.chunk(A, TESSERACT_DIM, dim=-1)[j]
-#     A = A.clone()
-#     A.requires_grad = True
-
-#     mask_shape = (BATCH_SIZE // TESSERACT_DIM, NUM_ATTENTION_HEADS // TESSERACT_DIM, SEQ_LENGTH, SEQ_LENGTH)
-#     attention_mask = torch.zeros(mask_shape, dtype=dtype, device=device)
-
-#     out = layer(A, attention_mask)
-#     assert out.shape == (BATCH_SIZE // TESSERACT_DIM, SEQ_LENGTH, INPUT_SIZE // TESSERACT_DIM)
-#     print_rank_0('self attention forward: pass')
-
-#     grad_shape = out.shape
-#     grad = torch.randn(grad_shape, dtype=dtype, device=device)
-
-#     out.backward(grad)
-#     assert A.grad.shape == A.shape
-#     print_rank_0('self attention backward: pass')
-
-# def check_mlp():
-#     device = get_current_device()
-#     dtype = torch.float32
-#     INPUT_SIZE = HIDDEN_SIZE
-
-#     i = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_COL)
-#     j = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_ROW)
-#     k = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_DEP)
-
-#     layer = TransformerMLP2p5D(
-#         HIDDEN_SIZE,
-#         mlp_ratio=1,
-#         dropout_prob=0.5,
-#         act_func='gelu',
-#         dtype=dtype,
-#     )
-
-#     A_shape = (BATCH_SIZE, SEQ_LENGTH, INPUT_SIZE)
-#     A_master = torch.randn(A_shape, dtype=dtype, device=device)
-#     torch.distributed.broadcast(A_master, src=0)
-#     A = torch.chunk(A_master, TESSERACT_DIM, dim=0)[i]
-#     A = torch.chunk(A, TESSERACT_DIM, dim=-1)[j]
-#     A = A.clone()
-#     A.requires_grad = True
-
-#     out = layer(A)
-#     assert out.shape == (BATCH_SIZE // TESSERACT_DIM, SEQ_LENGTH, INPUT_SIZE // TESSERACT_DIM)
-#     print_rank_0('mlp forward: pass')
-
-#     grad_shape = out.shape
-#     grad = torch.randn(grad_shape, dtype=dtype, device=device)
-
-#     out.backward(grad)
-#     assert A.grad.shape == A.shape
-#     print_rank_0('mlp backward: pass')
-
-# def check_transformerlayer():
-#     device = get_current_device()
-#     dtype = torch.float32
-#     INPUT_SIZE = HIDDEN_SIZE
-#     NUM_ATTENTION_HEADS = 2
-
-#     i = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_COL)
-#     j = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_ROW)
-#     k = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_DEP)
-
-#     layer = TransformerLayer2p5D(
-#         HIDDEN_SIZE,
-#         NUM_ATTENTION_HEADS,
-#         act_func='gelu',
-#         attention_dropout_prob=0.5,
-#         hidden_dropout_prob=0.5,
-#         dtype=dtype,
-#     )
-
-#     A_shape = (BATCH_SIZE, SEQ_LENGTH, INPUT_SIZE)
-#     A_master = torch.randn(A_shape, dtype=dtype, device=device)
-#     torch.distributed.broadcast(A_master, src=0)
-#     A = torch.chunk(A_master, TESSERACT_DIM, dim=0)[i]
-#     A = torch.chunk(A, TESSERACT_DIM, dim=-1)[j]
-#     A = A.clone()
-#     A.requires_grad = True
-
-#     mask_shape = (BATCH_SIZE // TESSERACT_DIM, NUM_ATTENTION_HEADS // TESSERACT_DIM, SEQ_LENGTH, SEQ_LENGTH)
-#     attention_mask = torch.zeros(mask_shape, dtype=dtype, device=device)
-
-#     out = layer(A, attention_mask)
-#     assert out.shape == (BATCH_SIZE // TESSERACT_DIM, SEQ_LENGTH, INPUT_SIZE // TESSERACT_DIM)
-#     print_rank_0('transformerlayer forward: pass')
-
-#     grad_shape = out.shape
-#     grad = torch.randn(grad_shape, dtype=dtype, device=device)
-
-#     out.backward(grad)
-#     assert A.grad.shape == A.shape
-#     print_rank_0('transformerlayer backward: pass')
diff --git a/tests/test_layers/test_2p5d/checks_2p5d/check_operation_2p5d.py b/tests/test_layers/test_2p5d/checks_2p5d/check_operation_2p5d.py
deleted file mode 100644
index f2b7ffe17139bcadaffa4d8be2944abfba1316af..0000000000000000000000000000000000000000
--- a/tests/test_layers/test_2p5d/checks_2p5d/check_operation_2p5d.py
+++ /dev/null
@@ -1,236 +0,0 @@
-import torch
-
-from colossalai.context import ParallelMode
-from colossalai.core import global_context as gpc
-from colossalai.nn.layer.parallel_2p5d._operation import Matmul_AB_2p5D, Matmul_ABT_2p5D, \
-    Matmul_ATB_2p5D
-from colossalai.utils import get_current_device
-from colossalai.utils import print_rank_0
-from .common import *
-
-
-def check_AB():
-    data_parallel_rank = 0 if not gpc.is_initialized(ParallelMode.DATA) else gpc.get_local_rank(ParallelMode.DATA)
-    pipeline_parallel_rank = 0 if not gpc.is_initialized(ParallelMode.PIPELINE) else gpc.get_local_rank(
-        ParallelMode.PIPELINE)
-    pipeline_parallel_size = 1 if not gpc.is_initialized(ParallelMode.PIPELINE) else gpc.get_world_size(
-        ParallelMode.PIPELINE)
-    tensor_parallel_size = gpc.get_world_size(ParallelMode.TENSOR)
-
-    dtype = torch.float
-    i = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_COL)
-    j = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_ROW)
-    k = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_DEP)
-
-    A_shape = (BATCH_SIZE, SEQ_LENGTH, HIDDEN_SIZE)
-    A_master = torch.randn(A_shape, dtype=dtype, device=get_current_device())
-    torch.distributed.broadcast(A_master, src=0)
-    A = torch.chunk(A_master, TESSERACT_DIM, dim=0)[i]
-    A = torch.chunk(A, TESSERACT_DIM, dim=-1)[j]
-    A = A.clone()
-    A.requires_grad = True
-
-    B_shape = (HIDDEN_SIZE, 4 * HIDDEN_SIZE)
-    B_master = torch.randn(B_shape, dtype=dtype, device=get_current_device())
-    torch.distributed.broadcast(B_master, src=0)
-    B = torch.chunk(B_master, TESSERACT_DIM, dim=0)[i]
-    B = torch.chunk(B, TESSERACT_DIM, dim=-1)[j]
-    B = B.clone()
-    B.requires_grad = True
-
-    out_shape = (BATCH_SIZE // TESSERACT_DIM, SEQ_LENGTH, 4 * HIDDEN_SIZE // TESSERACT_DIM)
-    out = Matmul_AB_2p5D.apply(
-        A, B,
-        TESSERACT_DIM, out_shape,
-        i, j, k,
-        ParallelMode.PARALLEL_2P5D_ROW,
-        ParallelMode.PARALLEL_2P5D_COL,
-        data_parallel_rank,
-        pipeline_parallel_rank,
-        pipeline_parallel_size,
-        tensor_parallel_size)
-
-    C_shape = (BATCH_SIZE, SEQ_LENGTH, 4 * HIDDEN_SIZE)
-    A_master = A_master.clone()
-    A_master.requires_grad = True
-    B_master = B_master.clone()
-    B_master.requires_grad = True
-    C_master = torch.matmul(A_master, B_master)
-    C = torch.chunk(C_master, TESSERACT_DIM, dim=0)[i]
-    C = torch.chunk(C, TESSERACT_DIM, dim=-1)[j]
-    # check forward correctness
-    check_equal(out, C)
-    print_rank_0('AB forward: pass')
-
-    grad_shape = C_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=get_current_device())
-    torch.distributed.broadcast(grad_master, src=0)
-    grad = torch.chunk(grad_master, TESSERACT_DIM, dim=0)[i]
-    grad = torch.chunk(grad, TESSERACT_DIM, dim=-1)[j]
-
-    out.backward(grad)
-
-    C_master.backward(grad_master)
-    A_grad = A_master.grad
-    A_grad = torch.chunk(A_grad, TESSERACT_DIM, dim=0)[i]
-    A_grad = torch.chunk(A_grad, TESSERACT_DIM, dim=-1)[j]
-    # check backward correctness
-    check_equal(A_grad, A.grad)
-
-    B_grad = B_master.grad
-    B_grad = torch.chunk(B_grad, TESSERACT_DIM, dim=0)[i]
-    B_grad = torch.chunk(B_grad, TESSERACT_DIM, dim=-1)[j]
-    # check backward correctness
-    check_equal(B_grad, B.grad)
-    print_rank_0('AB backward: pass')
-
-
-def check_ABT():
-    data_parallel_rank = 0 if not gpc.is_initialized(ParallelMode.DATA) else gpc.get_local_rank(ParallelMode.DATA)
-    pipeline_parallel_rank = 0 if not gpc.is_initialized(ParallelMode.PIPELINE) else gpc.get_local_rank(
-        ParallelMode.PIPELINE)
-    pipeline_parallel_size = 1 if not gpc.is_initialized(ParallelMode.PIPELINE) else gpc.get_world_size(
-        ParallelMode.PIPELINE)
-    tensor_parallel_size = gpc.get_world_size(ParallelMode.TENSOR)
-
-    dtype = torch.float
-    device = get_current_device()
-
-    i = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_COL)
-    j = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_ROW)
-    k = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_DEP)
-
-    C_shape = (BATCH_SIZE, SEQ_LENGTH, 4 * HIDDEN_SIZE)
-    C_master = torch.randn(C_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(C_master, src=0)
-    C = torch.chunk(C_master, TESSERACT_DIM, dim=0)[i]
-    C = torch.chunk(C, TESSERACT_DIM, dim=-1)[j]
-    C = C.clone()
-    C.requires_grad = True
-
-    B_shape = (HIDDEN_SIZE, 4 * HIDDEN_SIZE)
-    B_master = torch.randn(B_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(B_master, src=0)
-    B = torch.chunk(B_master, TESSERACT_DIM, dim=0)[i]
-    B = torch.chunk(B, TESSERACT_DIM, dim=-1)[j]
-    B = B.clone()
-    B.requires_grad = True
-
-    out = Matmul_ABT_2p5D.apply(
-        C, B,
-        TESSERACT_DIM, (BATCH_SIZE // TESSERACT_DIM, SEQ_LENGTH, HIDDEN_SIZE // TESSERACT_DIM),
-        i, j, k,
-        ParallelMode.PARALLEL_2P5D_ROW,
-        ParallelMode.PARALLEL_2P5D_COL,
-        data_parallel_rank,
-        pipeline_parallel_rank,
-        pipeline_parallel_size,
-        tensor_parallel_size)
-
-    A_shape = (BATCH_SIZE, SEQ_LENGTH, HIDDEN_SIZE)
-    C_master = C_master.clone()
-    C_master.requires_grad = True
-    B_master = B_master.clone()
-    B_master.requires_grad = True
-    A_master = torch.matmul(C_master, B_master.transpose(0, 1))
-    A = torch.chunk(A_master, TESSERACT_DIM, dim=0)[i]
-    A = torch.chunk(A, TESSERACT_DIM, dim=-1)[j]
-    check_equal(out, A)
-    print_rank_0('ABT forward: pass')
-
-    grad_shape = A_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(grad_master, src=0)
-    grad = torch.chunk(grad_master, TESSERACT_DIM, dim=0)[i]
-    grad = torch.chunk(grad, TESSERACT_DIM, dim=-1)[j]
-
-    # backward
-    out.backward(grad)
-
-    A_master.backward(grad_master)
-    C_grad = C_master.grad
-    C_grad = torch.chunk(C_grad, TESSERACT_DIM, dim=0)[i]
-    C_grad = torch.chunk(C_grad, TESSERACT_DIM, dim=-1)[j]
-    check_equal(C_grad, C.grad)
-
-    B_grad = B_master.grad
-    B_grad = torch.chunk(B_grad, TESSERACT_DIM, dim=0)[i]
-    B_grad = torch.chunk(B_grad, TESSERACT_DIM, dim=-1)[j]
-    check_equal(B_grad, B.grad)
-    print_rank_0('ABT backward: pass')
-
-
-def check_ATB():
-    data_parallel_rank = 0 if not gpc.is_initialized(ParallelMode.DATA) else gpc.get_local_rank(ParallelMode.DATA)
-    pipeline_parallel_rank = 0 if not gpc.is_initialized(ParallelMode.PIPELINE) else gpc.get_local_rank(
-        ParallelMode.PIPELINE)
-    pipeline_parallel_size = 1 if not gpc.is_initialized(ParallelMode.PIPELINE) else gpc.get_world_size(
-        ParallelMode.PIPELINE)
-    tensor_parallel_size = gpc.get_world_size(ParallelMode.TENSOR)
-
-    device = get_current_device()
-    dtype = torch.float
-
-    i = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_COL)
-    j = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_ROW)
-    k = gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_DEP)
-
-    A_shape = (BATCH_SIZE, SEQ_LENGTH, HIDDEN_SIZE)
-    A_master = torch.randn(A_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(A_master, src=0)
-    A = torch.chunk(A_master, TESSERACT_DIM, dim=0)[i]
-    A = torch.chunk(A, TESSERACT_DIM, dim=-1)[j]
-    A = A.clone()
-    A.requires_grad = True
-
-    C_shape = (BATCH_SIZE, SEQ_LENGTH, 4 * HIDDEN_SIZE)
-    C_master = torch.randn(C_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(C_master, src=0)
-    C = torch.chunk(C_master, TESSERACT_DIM, dim=0)[i]
-    C = torch.chunk(C, TESSERACT_DIM, dim=-1)[j]
-    C = C.clone()
-    C.requires_grad = True
-
-    out = Matmul_ATB_2p5D.apply(
-        A, C,
-        TESSERACT_DIM, (HIDDEN_SIZE // TESSERACT_DIM, 4 * HIDDEN_SIZE // TESSERACT_DIM),
-        i, j, k,
-        ParallelMode.PARALLEL_2P5D_ROW,
-        ParallelMode.PARALLEL_2P5D_COL,
-        data_parallel_rank,
-        pipeline_parallel_rank,
-        pipeline_parallel_size,
-        tensor_parallel_size)
-
-    B_shape = (HIDDEN_SIZE, 4 * HIDDEN_SIZE)
-    A_master = A_master.clone()
-    A_master.requires_grad = True
-    C_master = C_master.clone()
-    C_master.requires_grad = True
-    B_master = torch.matmul(
-        A_master.view(-1, A_master.shape[-1]).transpose(0, 1),
-        C_master.view(-1, C_master.shape[-1]))
-    B = torch.chunk(B_master, TESSERACT_DIM, dim=0)[i]
-    B = torch.chunk(B, TESSERACT_DIM, dim=-1)[j]
-    check_equal(out, B)
-    print_rank_0('ATB forward: pass')
-
-    grad_shape = B_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(grad_master, src=0)
-    grad = torch.chunk(grad_master, TESSERACT_DIM, dim=0)[i]
-    grad = torch.chunk(grad, TESSERACT_DIM, dim=-1)[j]
-
-    out.backward(grad)
-
-    B_master.backward(grad_master)
-    A_grad = A_master.grad
-    A_grad = torch.chunk(A_grad, TESSERACT_DIM, dim=0)[i]
-    A_grad = torch.chunk(A_grad, TESSERACT_DIM, dim=-1)[j]
-    check_equal(A_grad, A.grad)
-
-    C_grad = C_master.grad
-    C_grad = torch.chunk(C_grad, TESSERACT_DIM, dim=0)[i]
-    C_grad = torch.chunk(C_grad, TESSERACT_DIM, dim=-1)[j]
-    check_equal(C_grad, C.grad)
-    print_rank_0('ATB backward: pass')
diff --git a/tests/test_layers/test_2p5d/checks_2p5d/common.py b/tests/test_layers/test_2p5d/checks_2p5d/common.py
deleted file mode 100644
index aff85f109666d7cdf9e65173eda851368c39694c..0000000000000000000000000000000000000000
--- a/tests/test_layers/test_2p5d/checks_2p5d/common.py
+++ /dev/null
@@ -1,14 +0,0 @@
-import torch
-
-TESSERACT_DIM = 2
-TESSERACT_DEP = 2
-BATCH_SIZE = 8
-SEQ_LENGTH = 8
-HIDDEN_SIZE = 8
-NUM_CLASSES = 8
-VOCAB_SIZE = 16
-IMG_SIZE = 16
-
-
-def check_equal(A, B):
-    assert torch.allclose(A, B, rtol=1e-5, atol=1e-2)
\ No newline at end of file
diff --git a/tests/test_layers/test_2p5d/test_2p5d.py b/tests/test_layers/test_2p5d/test_2p5d.py
deleted file mode 100644
index da0848d060b9940cd7ae3939a95f6b712fce3214..0000000000000000000000000000000000000000
--- a/tests/test_layers/test_2p5d/test_2p5d.py
+++ /dev/null
@@ -1,68 +0,0 @@
-from functools import partial
-
-import pytest
-import torch
-import torch.multiprocessing as mp
-from colossalai.core import global_context as gpc
-from colossalai.initialize import launch
-from colossalai.logging import disable_existing_loggers
-from colossalai.utils import free_port
-
-from checks_2p5d.check_layer_2p5d import *
-from checks_2p5d.check_operation_2p5d import check_AB, check_ABT, check_ATB
-
-CONFIG = dict(
-    parallel=dict(
-        pipeline=dict(size=1),
-        tensor=dict(size=4, mode='2.5d', depth=1),
-    ),
-)
-
-
-def check_operations():
-    check_AB()
-    check_ABT()
-    check_ATB()
-
-
-def check_layer():
-    check_linear()
-    check_layernorm()
-    check_embed()
-    check_patch_embed()
-    check_vocab_parallel_embed()
-    check_classifier_no_given_weight()
-    check_vocab_parallel_classifier_no_given_weight()
-    check_classifier_given_embed_weight()
-    check_vocab_parallel_classifier_given_embed_weight()
-    check_loss()
-    check_vocab_parallel_loss()
-
-
-def check_layer_and_operation(rank, world_size, port):
-    disable_existing_loggers()
-    launch(config=CONFIG,
-           rank=rank,
-           world_size=world_size,
-           host='localhost',
-           port=port,
-           backend='nccl')
-
-    torch.backends.cuda.matmul.allow_tf32 = False
-    torch.backends.cudnn.allow_tf32 = False
-    torch.backends.cudnn.deterministic = True
-    check_operations()
-    check_layer()
-    gpc.destroy()
-    torch.cuda.empty_cache()
-
-
-@pytest.mark.dist
-def test_2p5d():
-    world_size = 4
-    run_func = partial(check_layer_and_operation, world_size=world_size, port=free_port())
-    mp.spawn(run_func, nprocs=world_size)
-
-
-if __name__ == '__main__':
-    test_2p5d()
diff --git a/tests/test_layers/test_3d/__pycache__/test_3d.cpython-37-pytest-7.1.3.pyc b/tests/test_layers/test_3d/__pycache__/test_3d.cpython-37-pytest-7.1.3.pyc
deleted file mode 100644
index e8ae1b383432e46ed034f8b2926e6f4fa86b0a6e..0000000000000000000000000000000000000000
Binary files a/tests/test_layers/test_3d/__pycache__/test_3d.cpython-37-pytest-7.1.3.pyc and /dev/null differ
diff --git a/tests/test_layers/test_3d/__pycache__/test_3d.cpython-37.pyc b/tests/test_layers/test_3d/__pycache__/test_3d.cpython-37.pyc
deleted file mode 100644
index fb820c82abb130d0c2962116cf197c13c26ad62c..0000000000000000000000000000000000000000
Binary files a/tests/test_layers/test_3d/__pycache__/test_3d.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_layers/test_3d/checks_3d/__init__.py b/tests/test_layers/test_3d/checks_3d/__init__.py
deleted file mode 100644
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..0000000000000000000000000000000000000000
diff --git a/tests/test_layers/test_3d/checks_3d/__pycache__/__init__.cpython-37.pyc b/tests/test_layers/test_3d/checks_3d/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index da44fbb7d6039095eed7d2eaa7db7038effc4d8e..0000000000000000000000000000000000000000
Binary files a/tests/test_layers/test_3d/checks_3d/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_layers/test_3d/checks_3d/__pycache__/check_layer_3d.cpython-37.pyc b/tests/test_layers/test_3d/checks_3d/__pycache__/check_layer_3d.cpython-37.pyc
deleted file mode 100644
index 8b171ebb9523042c936d6d84d43f90ab4f884789..0000000000000000000000000000000000000000
Binary files a/tests/test_layers/test_3d/checks_3d/__pycache__/check_layer_3d.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_layers/test_3d/checks_3d/__pycache__/common.cpython-37.pyc b/tests/test_layers/test_3d/checks_3d/__pycache__/common.cpython-37.pyc
deleted file mode 100644
index 3e051f2488311b89c1b94e154572c376c11d6148..0000000000000000000000000000000000000000
Binary files a/tests/test_layers/test_3d/checks_3d/__pycache__/common.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_layers/test_3d/checks_3d/check_layer_3d.py b/tests/test_layers/test_3d/checks_3d/check_layer_3d.py
deleted file mode 100644
index 087bb07816cd78972398114b1ed7fed02eb8df40..0000000000000000000000000000000000000000
--- a/tests/test_layers/test_3d/checks_3d/check_layer_3d.py
+++ /dev/null
@@ -1,865 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import time
-
-import torch
-from colossalai.constants import INPUT_GROUP_3D, OUTPUT_GROUP_3D, WEIGHT_GROUP_3D
-from colossalai.core import global_context
-from colossalai.logging import get_dist_logger
-from colossalai.nn import (Classifier3D, CrossEntropyLoss3D, Embedding3D, LayerNorm3D, Linear3D, PatchEmbedding3D,
-                           VanillaClassifier, VanillaPatchEmbedding, VocabParallelClassifier3D,
-                           VocabParallelCrossEntropyLoss3D, VocabParallelEmbedding3D)
-from colossalai.nn.layer.parallel_3d._utils import get_parallel_mode_from_env
-from colossalai.utils import get_current_device, print_rank_0
-
-from .common import BATCH_SIZE, DEPTH, HIDDEN_SIZE, IMG_SIZE, NUM_CLASSES, SEQ_LENGTH, VOCAB_SIZE, check_equal
-
-
-def check_linear():
-    rank = torch.distributed.get_rank()
-    logger = get_dist_logger()
-    device = get_current_device()
-    dtype = torch.float32
-    INPUT_SIZE = HIDDEN_SIZE
-    OUTPUT_SIZE = 2 * HIDDEN_SIZE
-
-    input_parallel_mode = get_parallel_mode_from_env(INPUT_GROUP_3D)
-    weight_parallel_mode = get_parallel_mode_from_env(WEIGHT_GROUP_3D)
-    output_parallel_mode = get_parallel_mode_from_env(OUTPUT_GROUP_3D)
-
-    j = global_context.get_local_rank(input_parallel_mode)
-    i = global_context.get_local_rank(weight_parallel_mode)
-    k = global_context.get_local_rank(output_parallel_mode)
-
-    layer = Linear3D(INPUT_SIZE, OUTPUT_SIZE, dtype=dtype, bias=True)
-    layer = layer.to(device)
-    layer_master = torch.nn.Linear(INPUT_SIZE, OUTPUT_SIZE)
-    layer_master = layer_master.to(device)
-
-    weight_master = layer_master.weight.data.transpose(0, 1)
-    torch.distributed.broadcast(weight_master, src=0)
-    weight = torch.chunk(weight_master, DEPTH, dim=0)[k]
-    weight = torch.chunk(weight, DEPTH, dim=-1)[j]
-    layer.weight.data.copy_(weight)
-    bias_master = layer_master.bias.data
-    torch.distributed.broadcast(bias_master, src=0)
-    bias = torch.chunk(bias_master, DEPTH)[j]
-    layer.bias.data.copy_(bias)
-
-    A_shape = (BATCH_SIZE, SEQ_LENGTH, INPUT_SIZE)
-    A_master = torch.randn(A_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(A_master, src=0)
-    A = torch.chunk(A_master, DEPTH, dim=0)[i]
-    A = torch.chunk(A, DEPTH, dim=-1)[k]
-    A = torch.chunk(A, DEPTH, dim=0)[j]
-    A = A.clone()
-    A.requires_grad = True
-
-    fwd_start = time.time()
-    out = layer(A)
-    torch.cuda.synchronize()
-    fwd_end = time.time()
-    print_rank_0(
-        'linear forward: {0} --> {1} | {2:.3f} s'.format(tuple(A.shape), tuple(out.shape), fwd_end - fwd_start), logger)
-    A_master = A_master.clone()
-    A_master.requires_grad = True
-    C_master = layer_master(A_master)
-    C = torch.chunk(C_master, DEPTH, dim=0)[i]
-    C = torch.chunk(C, DEPTH, dim=-1)[j]
-    C = torch.chunk(C, DEPTH, dim=0)[k]
-    logger.info('Rank {} linear forward: {}'.format(rank, check_equal(out, C)))
-
-    grad_shape = C_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=get_current_device())
-    torch.distributed.broadcast(grad_master, src=0)
-    grad = torch.chunk(grad_master, DEPTH, dim=0)[i]
-    grad = torch.chunk(grad, DEPTH, dim=-1)[j]
-    grad = torch.chunk(grad, DEPTH, dim=0)[k]
-
-    bwd_start = time.time()
-    out.backward(grad)
-    torch.cuda.synchronize()
-    bwd_end = time.time()
-    print_rank_0('linear backward: {:.3f} s'.format(bwd_end - bwd_start), logger)
-
-    C_master.backward(grad_master)
-    A_grad = A_master.grad
-    A_grad = torch.chunk(A_grad, DEPTH, dim=0)[i]
-    A_grad = torch.chunk(A_grad, DEPTH, dim=-1)[k]
-    A_grad = torch.chunk(A_grad, DEPTH, dim=0)[j]
-    logger.info('Rank {} linear backward (input_grad): {}'.format(rank, check_equal(A_grad, A.grad)))
-
-    B_grad = layer_master.weight.grad.transpose(0, 1)
-    B_grad = torch.chunk(B_grad, DEPTH, dim=0)[k]
-    B_grad = torch.chunk(B_grad, DEPTH, dim=-1)[j]
-    logger.info('Rank {} linear backward (weight_grad): {}'.format(rank, check_equal(B_grad, layer.weight.grad)))
-
-    bias_grad = layer_master.bias.grad
-    bias_grad = torch.chunk(bias_grad, DEPTH)[j]
-    logger.info('Rank {} linear backward (bias_grad): {}'.format(rank, check_equal(bias_grad, layer.bias.grad)))
-
-    return fwd_end - fwd_start, bwd_end - bwd_start
-
-
-def check_layernorm():
-    rank = torch.distributed.get_rank()
-    logger = get_dist_logger()
-    device = get_current_device()
-    dtype = torch.float32
-    INPUT_SIZE = HIDDEN_SIZE
-
-    input_parallel_mode = get_parallel_mode_from_env(INPUT_GROUP_3D)
-    weight_parallel_mode = get_parallel_mode_from_env(WEIGHT_GROUP_3D)
-    output_parallel_mode = get_parallel_mode_from_env(OUTPUT_GROUP_3D)
-
-    j = global_context.get_local_rank(input_parallel_mode)
-    i = global_context.get_local_rank(weight_parallel_mode)
-    k = global_context.get_local_rank(output_parallel_mode)
-
-    norm = LayerNorm3D(INPUT_SIZE, eps=1e-6, dtype=dtype)
-    norm = norm.to(device)
-    norm_master = torch.nn.LayerNorm(INPUT_SIZE, eps=1e-6)
-    norm_master = norm_master.to(device)
-
-    weight_master = norm_master.weight.data
-    torch.distributed.broadcast(weight_master, src=0)
-    weight = torch.chunk(weight_master, DEPTH)[k]
-    norm.weight.data.copy_(weight)
-    bias_master = norm_master.bias.data
-    torch.distributed.broadcast(bias_master, src=0)
-    bias = torch.chunk(bias_master, DEPTH)[k]
-    norm.bias.data.copy_(bias)
-
-    A_shape = (BATCH_SIZE, SEQ_LENGTH, INPUT_SIZE)
-    A_master = torch.randn(A_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(A_master, src=0)
-    A = torch.chunk(A_master, DEPTH, dim=0)[i]
-    A = torch.chunk(A, DEPTH, dim=-1)[k]
-    A = torch.chunk(A, DEPTH, dim=0)[j]
-    A = A.clone()
-    A.requires_grad = True
-
-    fwd_start = time.time()
-    out = norm(A)
-    torch.cuda.synchronize()
-    fwd_end = time.time()
-    print_rank_0(
-        'layer norm forward: pass | {0} --> {1} | {2:.3f} s'.format(tuple(A.shape), tuple(out.shape),
-                                                                    fwd_end - fwd_start), logger)
-
-    A_master = A_master.clone()
-    A_master.requires_grad = True
-    C_master = norm_master(A_master)
-    C = torch.chunk(C_master, DEPTH, dim=0)[i]
-    C = torch.chunk(C, DEPTH, dim=-1)[k]
-    C = torch.chunk(C, DEPTH, dim=0)[j]
-    logger.info('Rank {} layernorm forward: {}'.format(rank, check_equal(out, C)))
-
-    grad_shape = C_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(grad_master, src=0)
-    grad = torch.chunk(grad_master, DEPTH, dim=0)[i]
-    grad = torch.chunk(grad, DEPTH, dim=-1)[k]
-    grad = torch.chunk(grad, DEPTH, dim=0)[j]
-
-    bwd_start = time.time()
-    out.backward(grad)
-    torch.cuda.synchronize()
-    bwd_end = time.time()
-    print_rank_0('layer norm backward: pass | {:.3f} s'.format(bwd_end - bwd_start), logger)
-
-    C_master.backward(grad_master)
-    A_grad = A_master.grad
-    A_grad = torch.chunk(A_grad, DEPTH, dim=0)[i]
-    A_grad = torch.chunk(A_grad, DEPTH, dim=-1)[k]
-    A_grad = torch.chunk(A_grad, DEPTH, dim=0)[j]
-    logger.info('Rank {} layernorm backward (input_grad): {}'.format(rank, check_equal(A_grad, A.grad)))
-
-    bias_grad = norm_master.weight.grad
-    bias_grad = torch.chunk(bias_grad, DEPTH)[k]
-    logger.info('Rank {} layernorm backward (weight_grad): {}'.format(rank, check_equal(bias_grad, norm.weight.grad)))
-
-    bias_grad = norm_master.bias.grad
-    bias_grad = torch.chunk(bias_grad, DEPTH)[k]
-    logger.info('Rank {} layernorm backward (bias_grad): {}'.format(rank, check_equal(bias_grad, norm.bias.grad)))
-
-    return fwd_end - fwd_start, bwd_end - bwd_start
-
-
-def check_classifier_no_given_weight():
-    rank = torch.distributed.get_rank()
-    logger = get_dist_logger()
-    device = get_current_device()
-    dtype = torch.float32
-    INPUT_SIZE = HIDDEN_SIZE
-
-    input_parallel_mode = get_parallel_mode_from_env(INPUT_GROUP_3D)
-    weight_parallel_mode = get_parallel_mode_from_env(WEIGHT_GROUP_3D)
-    output_parallel_mode = get_parallel_mode_from_env(OUTPUT_GROUP_3D)
-
-    j = global_context.get_local_rank(input_parallel_mode)
-    i = global_context.get_local_rank(weight_parallel_mode)
-    k = global_context.get_local_rank(output_parallel_mode)
-
-    layer = Classifier3D(INPUT_SIZE, NUM_CLASSES, dtype=dtype, bias=True)
-    layer = layer.to(device)
-
-    layer_master = VanillaClassifier(INPUT_SIZE, NUM_CLASSES, bias=True, dtype=dtype)
-    layer_master = layer_master.to(device)
-
-    weight_master = layer_master.weight.data
-    torch.distributed.broadcast(weight_master, src=0)
-    weight = torch.chunk(weight_master, DEPTH, dim=-1)[k]
-    layer.weight.data.copy_(weight)
-    bias_master = layer_master.bias.data
-    torch.distributed.broadcast(bias_master, src=0)
-    layer.bias.data.copy_(bias_master)
-
-    A_shape = (BATCH_SIZE, SEQ_LENGTH, INPUT_SIZE)
-    A_master = torch.randn(A_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(A_master, src=0)
-    A = torch.chunk(A_master, DEPTH, dim=0)[i]
-    A = torch.chunk(A, DEPTH, dim=-1)[k]
-    A = torch.chunk(A, DEPTH, dim=0)[j]
-    A = A.clone()
-    A.requires_grad = True
-
-    fwd_start = time.time()
-    out = layer(A)
-    torch.cuda.synchronize()
-    fwd_end = time.time()
-    print_rank_0(
-        'classifier (no given weight) forward: pass | {0} --> {1} | {2:.3f} s'.format(
-            tuple(A.shape), tuple(out.shape), fwd_end - fwd_start), logger)
-    A_master = A_master.clone()
-    A_master.requires_grad = True
-    C_master = layer_master(A_master)
-    C = torch.chunk(C_master, DEPTH, dim=0)[i]
-    C = torch.chunk(C, DEPTH, dim=0)[j]
-    logger.info('Rank {} classifier (no given weight) forward: {}'.format(rank, check_equal(out, C)))
-
-    grad_shape = C_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=get_current_device())
-    torch.distributed.broadcast(grad_master, src=0)
-    grad = torch.chunk(grad_master, DEPTH, dim=0)[i]
-    grad = torch.chunk(grad, DEPTH, dim=0)[j]
-    grad = grad.clone()
-
-    bwd_start = time.time()
-    out.backward(grad)
-    torch.cuda.synchronize()
-    bwd_end = time.time()
-    print_rank_0('classifier (no given weight) backward: pass | {:.3f} s'.format(bwd_end - bwd_start), logger)
-
-    grad_master = grad_master.clone()
-    C_master.backward(grad_master)
-    A_grad = A_master.grad
-    A_grad = torch.chunk(A_grad, DEPTH, dim=0)[i]
-    A_grad = torch.chunk(A_grad, DEPTH, dim=-1)[k]
-    A_grad = torch.chunk(A_grad, DEPTH, dim=0)[j]
-    logger.info('Rank {} classifier (no given weight) backward (input_grad): {}'.format(
-        rank, check_equal(A_grad, A.grad)))
-
-    B_grad = layer_master.weight.grad
-    B_grad = torch.chunk(B_grad, DEPTH, dim=-1)[k]
-    if j == k:
-        logger.info('Rank {} classifier (no given weight) backward (weight_grad): {}'.format(
-            rank, check_equal(B_grad, layer.weight.grad)))
-    else:
-        logger.info('Rank {} classifier (no given weight) backward (weight_grad): {}'.format(
-            rank, layer.weight.grad is None))
-
-    bias_grad = layer_master.bias.grad
-    logger.info('Rank {} classifier (no given weight) backward (bias_grad): {}'.format(
-        rank, check_equal(bias_grad, layer.bias.grad)))
-
-    return fwd_end - fwd_start, bwd_end - bwd_start
-
-
-def check_vocab_parallel_classifier_no_given_weight():
-    rank = torch.distributed.get_rank()
-    logger = get_dist_logger()
-    device = get_current_device()
-    dtype = torch.float32
-    INPUT_SIZE = HIDDEN_SIZE
-
-    input_parallel_mode = get_parallel_mode_from_env(INPUT_GROUP_3D)
-    weight_parallel_mode = get_parallel_mode_from_env(WEIGHT_GROUP_3D)
-    output_parallel_mode = get_parallel_mode_from_env(OUTPUT_GROUP_3D)
-
-    j = global_context.get_local_rank(input_parallel_mode)
-    i = global_context.get_local_rank(weight_parallel_mode)
-    k = global_context.get_local_rank(output_parallel_mode)
-
-    layer = VocabParallelClassifier3D(INPUT_SIZE, VOCAB_SIZE, bias=True)
-    layer = layer.to(dtype).to(device)
-
-    layer_master = VanillaClassifier(INPUT_SIZE, VOCAB_SIZE, bias=True)
-    layer_master = layer_master.to(dtype).to(device)
-
-    weight_master = layer_master.weight.data
-    torch.distributed.broadcast(weight_master, src=0)
-    weight = torch.chunk(weight_master, DEPTH, dim=0)[j]
-    weight = torch.chunk(weight, DEPTH, dim=-1)[k]
-    layer.weight.data.copy_(weight)
-    bias_master = layer_master.bias.data
-    torch.distributed.broadcast(bias_master, src=0)
-    bias = torch.chunk(bias_master, DEPTH)[j]
-    layer.bias.data.copy_(bias)
-
-    A_shape = (BATCH_SIZE, SEQ_LENGTH, INPUT_SIZE)
-    A_master = torch.randn(A_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(A_master, src=0)
-    A = torch.chunk(A_master, DEPTH, dim=0)[i]
-    A = torch.chunk(A, DEPTH, dim=-1)[k]
-    A = torch.chunk(A, DEPTH, dim=0)[j]
-    A = A.clone()
-    A.requires_grad = True
-
-    fwd_start = time.time()
-    out = layer(A)
-    torch.cuda.synchronize()
-    fwd_end = time.time()
-    print_rank_0(
-        'vocab parallel classifier (no given weight) forward: pass | {0} --> {1} | {2:.3f} s'.format(
-            tuple(A.shape), tuple(out.shape), fwd_end - fwd_start), logger)
-    A_master = A_master.clone()
-    A_master.requires_grad = True
-    C_master = layer_master(A_master)
-    C = torch.chunk(C_master, DEPTH, dim=0)[i]
-    C = torch.chunk(C, DEPTH, dim=-1)[j]
-    C = torch.chunk(C, DEPTH, dim=0)[k]
-    logger.info('Rank {} vocab parallel classifier (no given weight) forward: {}'.format(rank, check_equal(out, C)))
-
-    grad_shape = C_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(grad_master, src=0)
-    grad = torch.chunk(grad_master, DEPTH, dim=0)[i]
-    grad = torch.chunk(grad, DEPTH, dim=-1)[j]
-    grad = torch.chunk(grad, DEPTH, dim=0)[k]
-    grad = grad.clone()
-
-    bwd_start = time.time()
-    out.backward(grad)
-    torch.cuda.synchronize()
-    bwd_end = time.time()
-    print_rank_0('vocab parallel classifier (no given weight) backward: pass | {:.3f} s'.format(bwd_end - bwd_start),
-                 logger)
-
-    grad_master = grad_master.clone()
-    C_master.backward(grad_master)
-    A_grad = A_master.grad
-    A_grad = torch.chunk(A_grad, DEPTH, dim=0)[i]
-    A_grad = torch.chunk(A_grad, DEPTH, dim=-1)[k]
-    A_grad = torch.chunk(A_grad, DEPTH, dim=0)[j]
-    logger.info('Rank {} vocab parallel classifier (no given weight) backward (input_grad): {}'.format(
-        rank, check_equal(A_grad, A.grad)))
-
-    B_grad = layer_master.weight.grad
-    B_grad = torch.chunk(B_grad, DEPTH, dim=0)[j]
-    B_grad = torch.chunk(B_grad, DEPTH, dim=-1)[k]
-    logger.info('Rank {} vocab parallel classifier (no given weight) backward (weight_grad): {}'.format(
-        rank, check_equal(B_grad, layer.weight.grad)))
-
-    bias_grad = layer_master.bias.grad
-    bias_grad = torch.chunk(bias_grad, DEPTH)[j]
-    logger.info('Rank {} vocab parallel classifier (no given weight) backward (bias_grad): {}'.format(
-        rank, check_equal(bias_grad, layer.bias.grad)))
-
-    return fwd_end - fwd_start, bwd_end - bwd_start
-
-
-def check_classifier_given_embed_weight():
-    rank = torch.distributed.get_rank()
-    logger = get_dist_logger()
-    device = get_current_device()
-    dtype = torch.float32
-
-    input_parallel_mode = get_parallel_mode_from_env(INPUT_GROUP_3D)
-    weight_parallel_mode = get_parallel_mode_from_env(WEIGHT_GROUP_3D)
-    output_parallel_mode = get_parallel_mode_from_env(OUTPUT_GROUP_3D)
-
-    j = global_context.get_local_rank(input_parallel_mode)
-    i = global_context.get_local_rank(weight_parallel_mode)
-    k = global_context.get_local_rank(output_parallel_mode)
-
-    embed = Embedding3D(VOCAB_SIZE, HIDDEN_SIZE)
-    embed = embed.to(dtype).to(device)
-
-    embed_master = torch.nn.Embedding(VOCAB_SIZE, HIDDEN_SIZE)
-    embed_master = embed_master.to(dtype).to(device)
-
-    weight_master = embed_master.weight.data
-    torch.distributed.broadcast(weight_master, src=0)
-    weight = torch.chunk(weight_master, DEPTH, dim=-1)[k]
-    embed.weight.data.copy_(weight)
-
-    layer = Classifier3D(HIDDEN_SIZE, VOCAB_SIZE, weight=embed.weight, bias=False)
-    layer = layer.to(dtype).to(device)
-
-    layer_master = VanillaClassifier(HIDDEN_SIZE, VOCAB_SIZE, weight=embed_master.weight, bias=False)
-    layer_master = layer_master.to(dtype).to(device)
-
-    A_shape = (BATCH_SIZE, SEQ_LENGTH)
-    A_master = torch.randint(VOCAB_SIZE, A_shape, device=device)
-    torch.distributed.broadcast(A_master, src=0)
-    A = A_master.clone()
-
-    fwd_start = time.time()
-    out = layer(embed(A))
-    torch.cuda.synchronize()
-    fwd_end = time.time()
-    print_rank_0(
-        'classifier (given embed weight) forward: pass | {0} --> {1} | {2:.3f} s'.format(
-            tuple(A.shape), tuple(out.shape), fwd_end - fwd_start), logger)
-    A_master = A_master.clone()
-    C_master = layer_master(embed_master(A_master))
-    C = torch.chunk(C_master, DEPTH, dim=0)[i]
-    C = torch.chunk(C, DEPTH, dim=0)[j]
-    logger.info('Rank {} classifier (given embed weight) forward: {}'.format(rank, check_equal(out, C)))
-
-    grad_shape = C_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=get_current_device())
-    torch.distributed.broadcast(grad_master, src=0)
-    grad = torch.chunk(grad_master, DEPTH, dim=0)[i]
-    grad = torch.chunk(grad, DEPTH, dim=0)[j]
-    grad = grad.clone()
-
-    bwd_start = time.time()
-    out.backward(grad)
-    torch.cuda.synchronize()
-    bwd_end = time.time()
-    print_rank_0('classifier (given embed weight) backward: pass | {:.3f} s'.format(bwd_end - bwd_start), logger)
-
-    grad_master = grad_master.clone()
-    C_master.backward(grad_master)
-
-    B_grad = embed_master.weight.grad
-    B_grad = torch.chunk(B_grad, DEPTH, dim=-1)[k]
-    if j == k:
-        logger.info('Rank {} classifier (given embed weight) backward (weight_grad): {}'.format(
-            rank, check_equal(B_grad, embed.weight.grad)))
-    else:
-        logger.info('Rank {} classifier (given embed weight) backward (weight_grad): {}'.format(
-            rank, embed.weight.grad is None))
-
-    return fwd_end - fwd_start, bwd_end - bwd_start
-
-
-def check_vocab_parallel_classifier_given_embed_weight():
-    rank = torch.distributed.get_rank()
-    logger = get_dist_logger()
-    device = get_current_device()
-    dtype = torch.float32
-
-    input_parallel_mode = get_parallel_mode_from_env(INPUT_GROUP_3D)
-    weight_parallel_mode = get_parallel_mode_from_env(WEIGHT_GROUP_3D)
-    output_parallel_mode = get_parallel_mode_from_env(OUTPUT_GROUP_3D)
-
-    j = global_context.get_local_rank(input_parallel_mode)
-    i = global_context.get_local_rank(weight_parallel_mode)
-    k = global_context.get_local_rank(output_parallel_mode)
-
-    embed = VocabParallelEmbedding3D(VOCAB_SIZE, HIDDEN_SIZE)
-    embed = embed.to(dtype).to(device)
-
-    embed_master = torch.nn.Embedding(VOCAB_SIZE, HIDDEN_SIZE)
-    embed_master = embed_master.to(dtype).to(device)
-
-    weight_master = embed_master.weight.data
-    torch.distributed.broadcast(weight_master, src=0)
-    weight = torch.chunk(weight_master, DEPTH, dim=0)[j]
-    weight = torch.chunk(weight, DEPTH, dim=-1)[k]
-    embed.weight.data.copy_(weight)
-
-    layer = VocabParallelClassifier3D(HIDDEN_SIZE, VOCAB_SIZE, weight=embed.weight, bias=False)
-    layer = layer.to(dtype).to(device)
-
-    layer_master = VanillaClassifier(HIDDEN_SIZE, VOCAB_SIZE, weight=embed_master.weight, bias=False)
-    layer_master = layer_master.to(dtype).to(device)
-
-    A_shape = (BATCH_SIZE, SEQ_LENGTH)
-    A_master = torch.randint(VOCAB_SIZE, A_shape, device=device)
-    torch.distributed.broadcast(A_master, src=0)
-    A = A_master.clone()
-
-    fwd_start = time.time()
-    out = layer(embed(A))
-    torch.cuda.synchronize()
-    fwd_end = time.time()
-    print_rank_0(
-        'vocab parallel classifier (given embed weight) forward: pass | {0} --> {1} | {2:.3f} s'.format(
-            tuple(A.shape), tuple(out.shape), fwd_end - fwd_start), logger)
-    A_master = A_master.clone()
-    C_master = layer_master(embed_master(A_master))
-    C = torch.chunk(C_master, DEPTH, dim=0)[i]
-    C = torch.chunk(C, DEPTH, dim=-1)[j]
-    C = torch.chunk(C, DEPTH, dim=0)[k]
-    logger.info('Rank {} vocab parallel classifier (given embed weight) forward: {}'.format(rank, check_equal(out, C)))
-
-    grad_shape = C_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(grad_master, src=0)
-    grad = torch.chunk(grad_master, DEPTH, dim=0)[i]
-    grad = torch.chunk(grad, DEPTH, dim=-1)[j]
-    grad = torch.chunk(grad, DEPTH, dim=0)[k]
-    grad = grad.clone()
-
-    bwd_start = time.time()
-    out.backward(grad)
-    torch.cuda.synchronize()
-    bwd_end = time.time()
-    print_rank_0('vocab parallel classifier (given embed weight) backward: pass | {:.3f} s'.format(bwd_end - bwd_start),
-                 logger)
-
-    grad_master = grad_master.clone()
-    C_master.backward(grad_master)
-
-    B_grad = embed_master.weight.grad
-    B_grad = torch.chunk(B_grad, DEPTH, dim=0)[j]
-    B_grad = torch.chunk(B_grad, DEPTH, dim=-1)[k]
-    logger.info('Rank {} vocab parallel embed backward (weight_grad): {}'.format(rank,
-                                                                                 check_equal(B_grad,
-                                                                                             embed.weight.grad)))
-
-    return fwd_end - fwd_start, bwd_end - bwd_start
-
-
-def check_patch_embed():
-    rank = torch.distributed.get_rank()
-    device = get_current_device()
-    logger = get_dist_logger()
-    dtype = torch.float32
-
-    input_parallel_mode = get_parallel_mode_from_env(INPUT_GROUP_3D)
-    weight_parallel_mode = get_parallel_mode_from_env(WEIGHT_GROUP_3D)
-    output_parallel_mode = get_parallel_mode_from_env(OUTPUT_GROUP_3D)
-
-    j = global_context.get_local_rank(input_parallel_mode)
-    i = global_context.get_local_rank(weight_parallel_mode)
-    k = global_context.get_local_rank(output_parallel_mode)
-
-    layer = PatchEmbedding3D(IMG_SIZE, 4, 3, HIDDEN_SIZE, dtype=dtype)
-    torch.nn.init.ones_(layer.cls_token)
-    torch.nn.init.ones_(layer.pos_embed)
-    layer = layer.to(device)
-
-    layer_master = VanillaPatchEmbedding(IMG_SIZE, 4, 3, HIDDEN_SIZE, dtype=dtype)
-    torch.nn.init.ones_(layer_master.cls_token)
-    torch.nn.init.ones_(layer_master.pos_embed)
-    layer_master = layer_master.to(device)
-
-    proj_weight_master = layer_master.weight.data
-    torch.distributed.broadcast(proj_weight_master, src=0)
-    proj_weight = torch.chunk(proj_weight_master, DEPTH, dim=0)[k]
-    layer.weight.data.copy_(proj_weight)
-    proj_bias_master = layer_master.bias.data
-    torch.distributed.broadcast(proj_bias_master, src=0)
-    proj_bias = torch.chunk(proj_bias_master, DEPTH)[k]
-    layer.bias.data.copy_(proj_bias)
-
-    A_shape = (BATCH_SIZE, 3, IMG_SIZE, IMG_SIZE)
-    A_master = torch.randn(A_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(A_master, src=0)
-    A = A_master.clone()
-
-    fwd_start = time.time()
-    out = layer(A)
-    torch.cuda.synchronize()
-    fwd_end = time.time()
-    print_rank_0(
-        'patch embed forward: pass | {0} --> {1} | {2:.3f} s'.format(tuple(A.shape), tuple(out.shape),
-                                                                     fwd_end - fwd_start), logger)
-
-    A_master = A_master.clone()
-    C_master = layer_master(A_master)
-    C = torch.chunk(C_master, DEPTH, dim=0)[i]
-    C = torch.chunk(C, DEPTH, dim=-1)[k]
-    C = torch.chunk(C, DEPTH, dim=0)[j]
-    logger.info('Rank {} patch embed forward: {}'.format(rank, check_equal(out, C)))
-
-    grad_shape = C_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(grad_master, src=0)
-    grad = torch.chunk(grad_master, DEPTH, dim=0)[i]
-    grad = torch.chunk(grad, DEPTH, dim=-1)[k]
-    grad = torch.chunk(grad, DEPTH, dim=0)[j]
-    grad = grad.clone()
-
-    bwd_start = time.time()
-    out.backward(grad)
-    torch.cuda.synchronize()
-    bwd_end = time.time()
-    print_rank_0('patch embed backward: pass | {:.3f} s'.format(bwd_end - bwd_start), logger)
-
-    grad_master = grad_master.clone()
-    C_master.backward(grad_master)
-
-    cls_grad_master = layer_master.cls_token.grad
-    cls_grad = torch.chunk(cls_grad_master, DEPTH, dim=-1)[k]
-    logger.info('Rank {} patch embed backward (cls_grad): {}'.format(rank, check_equal(cls_grad, layer.cls_token.grad)))
-
-    pos_grad_master = layer_master.pos_embed.grad
-    pos_grad = torch.chunk(pos_grad_master, DEPTH, dim=-1)[k]
-    logger.info('Rank {} patch embed backward (pos_embed_grad): {}'.format(rank,
-                                                                           check_equal(pos_grad, layer.pos_embed.grad)))
-
-    B_grad = layer_master.weight.grad
-    B_grad = torch.chunk(B_grad, DEPTH, dim=0)[k]
-    logger.info('Rank {} patch embed backward (proj_weight_grad): {}'.format(rank,
-                                                                             check_equal(B_grad, layer.weight.grad)))
-
-    bias_grad = layer_master.bias.grad
-    bias_grad = torch.chunk(bias_grad, DEPTH)[k]
-    logger.info('Rank {} patch embed backward (proj_bias_grad): {}'.format(rank,
-                                                                           check_equal(bias_grad, layer.bias.grad)))
-
-    return fwd_end - fwd_start, bwd_end - bwd_start
-
-
-def check_embed():
-    rank = torch.distributed.get_rank()
-    device = get_current_device()
-    logger = get_dist_logger()
-    dtype = torch.float32
-
-    input_parallel_mode = get_parallel_mode_from_env(INPUT_GROUP_3D)
-    weight_parallel_mode = get_parallel_mode_from_env(WEIGHT_GROUP_3D)
-    output_parallel_mode = get_parallel_mode_from_env(OUTPUT_GROUP_3D)
-
-    j = global_context.get_local_rank(input_parallel_mode)
-    i = global_context.get_local_rank(weight_parallel_mode)
-    k = global_context.get_local_rank(output_parallel_mode)
-
-    layer = Embedding3D(VOCAB_SIZE, HIDDEN_SIZE)
-    layer = layer.to(dtype).to(device)
-    layer_master = torch.nn.Embedding(VOCAB_SIZE, HIDDEN_SIZE)
-    layer_master = layer_master.to(dtype).to(device)
-
-    weight_master = layer_master.weight.data
-    torch.distributed.broadcast(weight_master, src=0)
-    weight = torch.chunk(weight_master, DEPTH, dim=-1)[k]
-    layer.weight.data.copy_(weight)
-
-    A_shape = (BATCH_SIZE, SEQ_LENGTH)
-    A_master = torch.randint(VOCAB_SIZE, A_shape, device=device)
-    torch.distributed.broadcast(A_master, src=0)
-    A = A_master.clone()
-
-    fwd_start = time.time()
-    out = layer(A)
-    torch.cuda.synchronize()
-    fwd_end = time.time()
-    logger.info('embed forward: pass | {0} --> {1} | {2:.3f} s'.format(tuple(A.shape), tuple(out.shape),
-                                                                       fwd_end - fwd_start),
-                ranks=[0])
-
-    A_master = A_master.clone()
-    C_master = layer_master(A_master)
-    C = torch.chunk(C_master, DEPTH, dim=0)[i]
-    C = torch.chunk(C, DEPTH, dim=-1)[k]
-    C = torch.chunk(C, DEPTH, dim=0)[j]
-    logger.info('Rank {} embed forward: {}'.format(rank, check_equal(out, C)))
-
-    grad_shape = C_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(grad_master, src=0)
-    grad = torch.chunk(grad_master, DEPTH, dim=0)[i]
-    grad = torch.chunk(grad, DEPTH, dim=-1)[k]
-    grad = torch.chunk(grad, DEPTH, dim=0)[j]
-    grad = grad.clone()
-    bwd_start = time.time()
-    out.backward(grad)
-    torch.cuda.synchronize()
-    bwd_end = time.time()
-    logger.info('embed backward: pass | {:.3f} s'.format(bwd_end - bwd_start), ranks=[0])
-
-    grad_master = grad_master.clone()
-    C_master.backward(grad_master)
-
-    B_grad = layer_master.weight.grad
-    B_grad = torch.chunk(B_grad, DEPTH, dim=-1)[k]
-    if j == k:
-        logger.info('Rank {} embed backward (weight_grad): {}'.format(rank, check_equal(B_grad, layer.weight.grad)))
-    else:
-        logger.info('Rank {} embed backward (weight_grad): {}'.format(rank, layer.weight.grad is None))
-
-    return fwd_end - fwd_start, bwd_end - bwd_start
-
-
-def check_vocab_parallel_embed():
-    rank = torch.distributed.get_rank()
-    device = get_current_device()
-    logger = get_dist_logger()
-    dtype = torch.float32
-
-    input_parallel_mode = get_parallel_mode_from_env(INPUT_GROUP_3D)
-    weight_parallel_mode = get_parallel_mode_from_env(WEIGHT_GROUP_3D)
-    output_parallel_mode = get_parallel_mode_from_env(OUTPUT_GROUP_3D)
-
-    j = global_context.get_local_rank(input_parallel_mode)
-    i = global_context.get_local_rank(weight_parallel_mode)
-    k = global_context.get_local_rank(output_parallel_mode)
-
-    layer = VocabParallelEmbedding3D(VOCAB_SIZE, HIDDEN_SIZE)
-    layer = layer.to(dtype).to(device)
-    layer_master = torch.nn.Embedding(VOCAB_SIZE, HIDDEN_SIZE)
-    layer_master = layer_master.to(dtype).to(device)
-
-    weight_master = layer_master.weight.data
-    torch.distributed.broadcast(weight_master, src=0)
-    weight = torch.chunk(weight_master, DEPTH, dim=0)[j]
-    weight = torch.chunk(weight, DEPTH, dim=-1)[k]
-    layer.weight.data.copy_(weight)
-
-    A_shape = (BATCH_SIZE, SEQ_LENGTH)
-    A_master = torch.randint(VOCAB_SIZE, A_shape, device=device)
-    torch.distributed.broadcast(A_master, src=0)
-    A = A_master.clone()
-
-    fwd_start = time.time()
-    out = layer(A)
-    torch.cuda.synchronize()
-    fwd_end = time.time()
-    logger.info('vocab parallel embed forward: pass | {0} --> {1} | {2:.3f} s'.format(
-        tuple(A.shape), tuple(out.shape), fwd_end - fwd_start),
-                ranks=[0])
-
-    A_master = A_master.clone()
-    C_master = layer_master(A_master)
-    C = torch.chunk(C_master, DEPTH, dim=0)[i]
-    C = torch.chunk(C, DEPTH, dim=-1)[k]
-    C = torch.chunk(C, DEPTH, dim=0)[j]
-    logger.info('Rank {} vocab parallel embed forward: {}'.format(rank, check_equal(out, C)))
-
-    grad_shape = C_master.shape
-    grad_master = torch.randn(grad_shape, dtype=dtype, device=device)
-    torch.distributed.broadcast(grad_master, src=0)
-    grad = torch.chunk(grad_master, DEPTH, dim=0)[i]
-    grad = torch.chunk(grad, DEPTH, dim=-1)[k]
-    grad = torch.chunk(grad, DEPTH, dim=0)[j]
-    grad = grad.clone()
-    bwd_start = time.time()
-    out.backward(grad)
-    torch.cuda.synchronize()
-    bwd_end = time.time()
-    logger.info('vocab parallel embed backward: pass | {:.3f} s'.format(bwd_end - bwd_start), ranks=[0])
-
-    grad_master = grad_master.clone()
-    C_master.backward(grad_master)
-
-    B_grad = layer_master.weight.grad
-    B_grad = torch.chunk(B_grad, DEPTH, dim=0)[j]
-    B_grad = torch.chunk(B_grad, DEPTH, dim=-1)[k]
-    logger.info('Rank {} vocab parallel embed backward (weight_grad): {}'.format(rank,
-                                                                                 check_equal(B_grad,
-                                                                                             layer.weight.grad)))
-
-    return fwd_end - fwd_start, bwd_end - bwd_start
-
-
-def check_loss():
-    rank = torch.distributed.get_rank()
-    logger = get_dist_logger()
-    device = get_current_device()
-    dtype = torch.float32
-
-    input_parallel_mode = get_parallel_mode_from_env(INPUT_GROUP_3D)
-    weight_parallel_mode = get_parallel_mode_from_env(WEIGHT_GROUP_3D)
-
-    j = global_context.get_local_rank(input_parallel_mode)
-    i = global_context.get_local_rank(weight_parallel_mode)
-
-    criterion = CrossEntropyLoss3D()
-    criterion_master = torch.nn.CrossEntropyLoss()
-
-    out_shape = (BATCH_SIZE, NUM_CLASSES)
-    out_master = torch.randn(out_shape, dtype=dtype, device=device)
-    target_master = torch.randint(NUM_CLASSES, (BATCH_SIZE, ), dtype=torch.long, device=device)
-    torch.distributed.broadcast(out_master, src=0)
-    torch.distributed.broadcast(target_master, src=0)
-    out = torch.chunk(out_master, DEPTH, dim=0)[i]
-    out = torch.chunk(out, DEPTH, dim=0)[j]
-    out = out.clone()
-    out.requires_grad = True
-
-    fwd_start = time.time()
-    loss = criterion(out, target_master)
-    fwd_end = time.time()
-    logger.info('cross entropy loss forward: pass | {0} --> {1} | {2:.3f} s'.format(tuple(out.shape), tuple(loss.shape),
-                                                                                    fwd_end - fwd_start),
-                ranks=[0])
-
-    out_master = out_master.clone()
-    out_master.requires_grad = True
-    loss_master = criterion_master(out_master, target_master)
-    logger.info('Rank {} cross entropy loss forward: {}'.format(rank, check_equal(loss, loss_master)))
-
-    bwd_start = time.time()
-    loss.backward()
-    bwd_end = time.time()
-    logger.info('cross entropy loss backward: pass | {:.3f} s'.format(bwd_end - bwd_start), ranks=[0])
-
-    loss_master.backward()
-    out_grad = out_master.grad
-    out_grad = torch.chunk(out_grad, DEPTH, dim=0)[i]
-    out_grad = torch.chunk(out_grad, DEPTH, dim=0)[j]
-    logger.info('Rank {} cross entropy loss backward: {}'.format(rank, check_equal(out_grad, out.grad)))
-
-    return fwd_end - fwd_start, bwd_end - bwd_start
-
-
-def check_vocab_parallel_loss():
-    rank = torch.distributed.get_rank()
-    logger = get_dist_logger()
-    device = get_current_device()
-    dtype = torch.float32
-
-    input_parallel_mode = get_parallel_mode_from_env(INPUT_GROUP_3D)
-    weight_parallel_mode = get_parallel_mode_from_env(WEIGHT_GROUP_3D)
-    output_parallel_mode = get_parallel_mode_from_env(OUTPUT_GROUP_3D)
-
-    j = global_context.get_local_rank(input_parallel_mode)
-    i = global_context.get_local_rank(weight_parallel_mode)
-    k = global_context.get_local_rank(output_parallel_mode)
-
-    criterion = VocabParallelCrossEntropyLoss3D()
-    criterion_master = torch.nn.CrossEntropyLoss()
-
-    out_shape = (BATCH_SIZE, NUM_CLASSES)
-    out_master = torch.randn(out_shape, dtype=dtype, device=device)
-    target_master = torch.randint(NUM_CLASSES, (BATCH_SIZE, ), dtype=torch.long, device=device)
-    torch.distributed.broadcast(out_master, src=0)
-    torch.distributed.broadcast(target_master, src=0)
-    out = torch.chunk(out_master, DEPTH, dim=0)[i]
-    out = torch.chunk(out, DEPTH, dim=-1)[k]
-    out = torch.chunk(out, DEPTH, dim=0)[j]
-    out = out.clone()
-    out.requires_grad = True
-
-    fwd_start = time.time()
-    loss = criterion(out, target_master)
-    fwd_end = time.time()
-    logger.info('vocab parallel cross entropy loss forward: pass | {0} --> {1} | {2:.3f} s'.format(
-        tuple(out.shape), tuple(loss.shape), fwd_end - fwd_start),
-                ranks=[0])
-
-    out_master = out_master.clone()
-    out_master.requires_grad = True
-    loss_master = criterion_master(out_master, target_master)
-    logger.info('Rank {} vocab parallel cross entropy loss forward: {}'.format(rank, check_equal(loss, loss_master)))
-
-    bwd_start = time.time()
-    loss.backward()
-    bwd_end = time.time()
-    logger.info('vocab parallel cross entropy loss backward: pass | {:.3f} s'.format(bwd_end - bwd_start), ranks=[0])
-
-    loss_master.backward()
-    out_grad = out_master.grad
-    out_grad = torch.chunk(out_grad, DEPTH, dim=0)[i]
-    out_grad = torch.chunk(out_grad, DEPTH, dim=-1)[k]
-    out_grad = torch.chunk(out_grad, DEPTH, dim=0)[j]
-    logger.info('Rank {} vocab parallel cross entropy loss backward: {}'.format(rank, check_equal(out_grad, out.grad)))
-
-    return fwd_end - fwd_start, bwd_end - bwd_start
diff --git a/tests/test_layers/test_3d/checks_3d/common.py b/tests/test_layers/test_3d/checks_3d/common.py
deleted file mode 100644
index 43a04f649145c5843e97e2de4ad7d6bef4e4ead1..0000000000000000000000000000000000000000
--- a/tests/test_layers/test_3d/checks_3d/common.py
+++ /dev/null
@@ -1,18 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import torch
-
-DEPTH = 2
-BATCH_SIZE = 8
-SEQ_LENGTH = 8
-HIDDEN_SIZE = 8
-NUM_CLASSES = 8
-NUM_BLOCKS = 2
-IMG_SIZE = 16
-VOCAB_SIZE = 16
-
-def check_equal(A, B):
-    eq = torch.allclose(A, B, rtol=1e-3, atol=1e-2)
-    assert eq
-    return eq
diff --git a/tests/test_layers/test_3d/test_3d.py b/tests/test_layers/test_3d/test_3d.py
deleted file mode 100644
index 131013b8d827bcab8de9a5adbc59110c692d70c7..0000000000000000000000000000000000000000
--- a/tests/test_layers/test_3d/test_3d.py
+++ /dev/null
@@ -1,63 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-from functools import partial
-
-import pytest
-import torch
-import torch.multiprocessing as mp
-from colossalai.core import global_context as gpc
-from colossalai.initialize import launch
-from colossalai.logging import disable_existing_loggers
-from colossalai.utils import free_port
-
-from checks_3d.check_layer_3d import (check_classifier_given_embed_weight, check_classifier_no_given_weight,
-                                      check_embed, check_layernorm, check_linear, check_loss, check_patch_embed,
-                                      check_vocab_parallel_classifier_given_embed_weight,
-                                      check_vocab_parallel_classifier_no_given_weight, check_vocab_parallel_embed,
-                                      check_vocab_parallel_loss)
-
-CONFIG = dict(
-    parallel=dict(
-        pipeline=1,
-        tensor=dict(mode='3d', size=8),
-    ),
-    seed=42,
-)
-
-
-def check_layer():
-    check_linear()
-    check_layernorm()
-    check_classifier_no_given_weight()
-    check_vocab_parallel_classifier_no_given_weight()
-    check_classifier_given_embed_weight()
-    check_vocab_parallel_classifier_given_embed_weight()
-    check_embed()
-    check_patch_embed()
-    check_vocab_parallel_embed()
-    check_loss()
-    check_vocab_parallel_loss()
-
-
-def check_layer_and_operation(rank, world_size, port):
-    disable_existing_loggers()
-    launch(config=CONFIG, rank=rank, world_size=world_size, host='localhost', port=port, backend='nccl')
-    torch.backends.cuda.matmul.allow_tf32 = False
-    torch.backends.cudnn.allow_tf32 = False
-    torch.backends.cudnn.deterministic = True
-    check_layer()
-    gpc.destroy()
-    torch.cuda.empty_cache()
-
-
-@pytest.mark.dist
-def test_3d():
-    ## HC
-    #world_size = 8
-    world_size = 4
-    run_func = partial(check_layer_and_operation, world_size=world_size, port=free_port())
-    mp.spawn(run_func, nprocs=world_size)
-
-
-if __name__ == '__main__':
-    test_3d()
diff --git a/tests/test_layers/test_sequence/__pycache__/test_sequence.cpython-37-pytest-7.1.3.pyc b/tests/test_layers/test_sequence/__pycache__/test_sequence.cpython-37-pytest-7.1.3.pyc
deleted file mode 100644
index df6ed1e5a026f5d53b412913b38c054ad8f0c0a6..0000000000000000000000000000000000000000
Binary files a/tests/test_layers/test_sequence/__pycache__/test_sequence.cpython-37-pytest-7.1.3.pyc and /dev/null differ
diff --git a/tests/test_layers/test_sequence/__pycache__/test_sequence.cpython-37.pyc b/tests/test_layers/test_sequence/__pycache__/test_sequence.cpython-37.pyc
deleted file mode 100644
index 0357a9e3b450df49508c6b812d7c9e12bb5e0b63..0000000000000000000000000000000000000000
Binary files a/tests/test_layers/test_sequence/__pycache__/test_sequence.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_layers/test_sequence/checks_seq/__init__.py b/tests/test_layers/test_sequence/checks_seq/__init__.py
deleted file mode 100644
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..0000000000000000000000000000000000000000
diff --git a/tests/test_layers/test_sequence/checks_seq/__pycache__/__init__.cpython-36.pyc b/tests/test_layers/test_sequence/checks_seq/__pycache__/__init__.cpython-36.pyc
deleted file mode 100644
index b93d0958e54ef1fcf737d14c18cc116163c1dfed..0000000000000000000000000000000000000000
Binary files a/tests/test_layers/test_sequence/checks_seq/__pycache__/__init__.cpython-36.pyc and /dev/null differ
diff --git a/tests/test_layers/test_sequence/checks_seq/__pycache__/check_layer_seq.cpython-36-pytest-7.0.1.pyc b/tests/test_layers/test_sequence/checks_seq/__pycache__/check_layer_seq.cpython-36-pytest-7.0.1.pyc
deleted file mode 100644
index 7476da567a90ffdc89fa38f55090dd16c8eff62b..0000000000000000000000000000000000000000
Binary files a/tests/test_layers/test_sequence/checks_seq/__pycache__/check_layer_seq.cpython-36-pytest-7.0.1.pyc and /dev/null differ
diff --git a/tests/test_layers/test_sequence/checks_seq/check_layer_seq.py b/tests/test_layers/test_sequence/checks_seq/check_layer_seq.py
deleted file mode 100644
index 156e60333bf7af2a4053f362ce6567cc7d940e21..0000000000000000000000000000000000000000
--- a/tests/test_layers/test_sequence/checks_seq/check_layer_seq.py
+++ /dev/null
@@ -1,26 +0,0 @@
-import torch
-
-from colossalai.context import ParallelMode
-from colossalai.core import global_context as gpc
-from colossalai.nn import TransformerSelfAttentionRing
-from colossalai.utils import get_current_device
-
-
-def check_selfattention():
-    WORLD_SIZE = gpc.get_world_size(ParallelMode.SEQUENCE)
-    SUB_SEQ_LENGTH = 8
-    BATCH = 4
-    HIDDEN_SIZE = 16
-
-    layer = TransformerSelfAttentionRing(
-        16,
-        8,
-        8,
-        0.1
-    )
-    layer = layer.to(get_current_device())
-
-    hidden_states = torch.rand(SUB_SEQ_LENGTH, BATCH, HIDDEN_SIZE).to(get_current_device())
-    attention_mask = torch.randint(low=0, high=2, size=(BATCH, 1, 1, 1, SUB_SEQ_LENGTH * WORLD_SIZE)).to(
-        get_current_device())
-    out = layer(hidden_states, attention_mask)
diff --git a/tests/test_layers/test_sequence/test_sequence.py b/tests/test_layers/test_sequence/test_sequence.py
deleted file mode 100644
index b482757c8cb1bc80fa0f2bd51958896bf497b752..0000000000000000000000000000000000000000
--- a/tests/test_layers/test_sequence/test_sequence.py
+++ /dev/null
@@ -1,153 +0,0 @@
-import colossalai
-import colossalai.nn as col_nn
-import torch
-import torch.distributed as dist
-import torch.multiprocessing as mp
-import pytest
-
-from colossalai.core import global_context as gpc
-from colossalai.context import ParallelMode
-from functools import partial
-
-
-CONFIG = dict(
-    parallel=dict(
-        tensor=dict(size=4, mode='sequence')
-    )
-)
-
-
-def check_ring_qk(rank, world_size):
-    # params
-    batch_size = 4
-    num_heads = 4
-    seq_length = 32
-    attention_head_size = 32
-    sub_seq_length = seq_length // world_size
-
-    # create master tensors
-    q = torch.rand(batch_size*num_heads, seq_length, attention_head_size).cuda()
-    k = torch.rand(batch_size*num_heads, seq_length, attention_head_size).cuda()
-    dist.broadcast(q, src=0, group=gpc.get_group(ParallelMode.SEQUENCE))
-    dist.broadcast(k, src=0, group=gpc.get_group(ParallelMode.SEQUENCE))
-
-    # create distributed tensors
-    sub_q = q.clone()[:, rank*sub_seq_length:(rank+1)*sub_seq_length].contiguous()
-    sub_k = k.clone()[:, rank*sub_seq_length:(rank+1)*sub_seq_length].contiguous()
-
-    # set autograd attributes
-    q.requires_grad = True
-    k.requires_grad = True
-    q.retain_grad()
-    k.retain_grad()
-    sub_q.requires_grad = True
-    sub_k.requires_grad = True
-    sub_q.retain_grad()
-    sub_k.retain_grad()
-
-    # compute master attention scores
-    a = torch.matmul(q, k.transpose(2, 1))
-
-    # compute distributed attention scores
-    ring_qk = colossalai.nn.layer.parallel_sequence.RingQK.apply
-    sub_a = ring_qk(sub_q, sub_k, batch_size, num_heads, sub_seq_length)
-
-    # check master and distributed attetion scores
-    sub_master_a = a[:, rank*sub_seq_length:(rank+1)*sub_seq_length]
-    assert torch.allclose(sub_a, sub_master_a, rtol=1e-5, atol=1e-2)
-
-    # run master backward
-    a.retain_grad()
-    a.mean().backward()
-
-    # run distributed backward
-    partial_master_a_grad = a.grad[:, rank*sub_seq_length:(rank+1)*sub_seq_length]
-    torch.autograd.backward(sub_a, partial_master_a_grad)
-
-    # check master and distributed grads
-    partial_master_q_grad = q.grad[:, rank*sub_seq_length:(rank+1)*sub_seq_length]
-    assert torch.allclose(sub_q.grad, partial_master_q_grad, rtol=1e-5, atol=1e-2), \
-        'attention score cannot match'
-
-
-def check_ring_av(rank, world_size):
-    # params
-    batch_size = 4
-    num_heads = 4
-    seq_length = 16
-    attention_head_size = 32
-    sub_seq_length = seq_length // world_size
-
-    # create master tensors
-    a = torch.rand(batch_size*num_heads, seq_length, seq_length).cuda()
-    v = torch.rand(batch_size*num_heads, seq_length, attention_head_size).cuda()
-    dist.broadcast(a, src=0, group=gpc.get_group(ParallelMode.SEQUENCE))
-    dist.broadcast(v, src=0, group=gpc.get_group(ParallelMode.SEQUENCE))
-
-    # create distributed tensors
-    sub_a = a.clone()[:, rank*sub_seq_length:(rank+1)*sub_seq_length].contiguous()
-    sub_v = v.clone()[:, rank*sub_seq_length:(rank+1)*sub_seq_length].contiguous()
-
-    # set autograd attributes
-    a.requires_grad = True
-    v.requires_grad = True
-    a.retain_grad()
-    v.retain_grad()
-    sub_a.requires_grad = True
-    sub_v.requires_grad = True
-    sub_a.retain_grad()
-    sub_v.retain_grad()
-
-    # compute master attention scores
-    out = torch.matmul(a, v)
-
-    # compute distributed attention scores
-    ring_av = colossalai.nn.layer.parallel_sequence.RingAV.apply
-    sub_out = ring_av(sub_a, sub_v, batch_size, num_heads, attention_head_size, sub_seq_length)
-
-    # print(f'master output shape: {out.shape}, partial output shape: {sub_out.shape}')
-
-    # check master and distributed output
-    sub_master_out = out[:, rank*sub_seq_length:(rank+1)*sub_seq_length]
-    assert torch.allclose(sub_out, sub_master_out, rtol=1e-5, atol=1e-2)
-
-    # # run master backward
-    out.retain_grad()
-    out.mean().backward()
-
-    # # run distributed backward
-    partial_master_out_grad = out.grad[:, rank*sub_seq_length:(rank+1)*sub_seq_length]
-    torch.autograd.backward(sub_out, partial_master_out_grad)
-
-    # # check master and distributed grads
-    partial_master_a_grad = a.grad[:, rank*sub_seq_length:(rank+1)*sub_seq_length]
-    assert torch.allclose(sub_a.grad, partial_master_a_grad, rtol=1e-5, atol=1e-2), \
-        'attention output cannot match'
-
-
-# HC
-def run_test(rank, world_size):
-    colossalai.launch(
-        rank=rank,
-        world_size=world_size,
-        config=CONFIG,
-        host='localhost',
-        port=29501
-    )
-
-    # check_ring_qk(rank, world_size)
-    check_ring_av(rank, world_size)
-
-    gpc.destroy()
-    torch.cuda.empty_cache()
-
-
-@pytest.mark.dist
-def test_sequence():
-    world_size = 4
-    run_func = partial(run_test, world_size=world_size)
-    mp.spawn(run_func, nprocs=world_size)
-
-
-if __name__ == '__main__':
-    test_sequence()
diff --git a/tests/test_trainer/__pycache__/test_trainer_with_non_pipe_schedule.cpython-37-pytest-7.1.3.pyc b/tests/test_trainer/__pycache__/test_trainer_with_non_pipe_schedule.cpython-37-pytest-7.1.3.pyc
deleted file mode 100644
index 1f2b55349c99a0d99357bda5532df7f3f25e3021..0000000000000000000000000000000000000000
Binary files a/tests/test_trainer/__pycache__/test_trainer_with_non_pipe_schedule.cpython-37-pytest-7.1.3.pyc and /dev/null differ
diff --git a/tests/test_trainer/__pycache__/test_trainer_with_non_pipe_schedule.cpython-37.pyc b/tests/test_trainer/__pycache__/test_trainer_with_non_pipe_schedule.cpython-37.pyc
deleted file mode 100644
index 8135b43752059563f771f0738ed9a3b5c6e81335..0000000000000000000000000000000000000000
Binary files a/tests/test_trainer/__pycache__/test_trainer_with_non_pipe_schedule.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_trainer/__pycache__/test_trainer_with_pipe_schedule.cpython-37-pytest-7.1.3.pyc b/tests/test_trainer/__pycache__/test_trainer_with_pipe_schedule.cpython-37-pytest-7.1.3.pyc
deleted file mode 100644
index 31fbabd926795c20266ff81c058f2cbbe8624611..0000000000000000000000000000000000000000
Binary files a/tests/test_trainer/__pycache__/test_trainer_with_pipe_schedule.cpython-37-pytest-7.1.3.pyc and /dev/null differ
diff --git a/tests/test_trainer/__pycache__/test_trainer_with_pipe_schedule.cpython-37.pyc b/tests/test_trainer/__pycache__/test_trainer_with_pipe_schedule.cpython-37.pyc
deleted file mode 100644
index 03c5efafc224c97e7430312b6d8fae311f5cb5e0..0000000000000000000000000000000000000000
Binary files a/tests/test_trainer/__pycache__/test_trainer_with_pipe_schedule.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_trainer/test_pipeline/__pycache__/resnet_config.cpython-37.pyc b/tests/test_trainer/test_pipeline/__pycache__/resnet_config.cpython-37.pyc
deleted file mode 100644
index 3aa0312345751c5e6c4e7d2311723941b94f2812..0000000000000000000000000000000000000000
Binary files a/tests/test_trainer/test_pipeline/__pycache__/resnet_config.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_trainer/test_pipeline/__pycache__/test_p2p.cpython-37-pytest-7.1.3.pyc b/tests/test_trainer/test_pipeline/__pycache__/test_p2p.cpython-37-pytest-7.1.3.pyc
deleted file mode 100644
index 2ff03fd8fbb59ca1f98d313201a86757a69a2a3c..0000000000000000000000000000000000000000
Binary files a/tests/test_trainer/test_pipeline/__pycache__/test_p2p.cpython-37-pytest-7.1.3.pyc and /dev/null differ
diff --git a/tests/test_trainer/test_pipeline/__pycache__/test_p2p.cpython-37.pyc b/tests/test_trainer/test_pipeline/__pycache__/test_p2p.cpython-37.pyc
deleted file mode 100644
index 9e297d08945ec6b8b7c1596b1a8972854f7f3846..0000000000000000000000000000000000000000
Binary files a/tests/test_trainer/test_pipeline/__pycache__/test_p2p.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_trainer/test_pipeline/__pycache__/test_partition.cpython-37-pytest-7.1.3.pyc b/tests/test_trainer/test_pipeline/__pycache__/test_partition.cpython-37-pytest-7.1.3.pyc
deleted file mode 100644
index a40108e26d667c71184efccd0906527227e75968..0000000000000000000000000000000000000000
Binary files a/tests/test_trainer/test_pipeline/__pycache__/test_partition.cpython-37-pytest-7.1.3.pyc and /dev/null differ
diff --git a/tests/test_trainer/test_pipeline/__pycache__/test_partition.cpython-37.pyc b/tests/test_trainer/test_pipeline/__pycache__/test_partition.cpython-37.pyc
deleted file mode 100644
index 2bd62beb628a12bb38bb349107439232dcd5ee62..0000000000000000000000000000000000000000
Binary files a/tests/test_trainer/test_pipeline/__pycache__/test_partition.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_trainer/test_pipeline/__pycache__/test_pipeline_schedule.cpython-37-pytest-7.1.3.pyc b/tests/test_trainer/test_pipeline/__pycache__/test_pipeline_schedule.cpython-37-pytest-7.1.3.pyc
deleted file mode 100644
index 4f9f8b8cbd3df7696aea4faad59620c174befa71..0000000000000000000000000000000000000000
Binary files a/tests/test_trainer/test_pipeline/__pycache__/test_pipeline_schedule.cpython-37-pytest-7.1.3.pyc and /dev/null differ
diff --git a/tests/test_trainer/test_pipeline/__pycache__/test_pipeline_schedule.cpython-37.pyc b/tests/test_trainer/test_pipeline/__pycache__/test_pipeline_schedule.cpython-37.pyc
deleted file mode 100644
index 533d50a745cb7116badf4c4fb968afe10d751c09..0000000000000000000000000000000000000000
Binary files a/tests/test_trainer/test_pipeline/__pycache__/test_pipeline_schedule.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_trainer/test_pipeline/model/__init__.py b/tests/test_trainer/test_pipeline/model/__init__.py
deleted file mode 100644
index 2bf880f41ebc0e1b2234b926845c211c25f29c32..0000000000000000000000000000000000000000
--- a/tests/test_trainer/test_pipeline/model/__init__.py
+++ /dev/null
@@ -1,2 +0,0 @@
-from .layers import *
-from .resnet import VanillaResNet
diff --git a/tests/test_trainer/test_pipeline/model/__pycache__/__init__.cpython-37.pyc b/tests/test_trainer/test_pipeline/model/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index 8ec577a8919375e4b16f3cfd61bb5f6f97cde986..0000000000000000000000000000000000000000
Binary files a/tests/test_trainer/test_pipeline/model/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_trainer/test_pipeline/model/__pycache__/resnet.cpython-37.pyc b/tests/test_trainer/test_pipeline/model/__pycache__/resnet.cpython-37.pyc
deleted file mode 100644
index fad13f8c723886adb37684868fe012271306f6b9..0000000000000000000000000000000000000000
Binary files a/tests/test_trainer/test_pipeline/model/__pycache__/resnet.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_trainer/test_pipeline/model/layers/__init__.py b/tests/test_trainer/test_pipeline/model/layers/__init__.py
deleted file mode 100644
index aa553b73754d34ce0b9278ac48114c2fe3b25285..0000000000000000000000000000000000000000
--- a/tests/test_trainer/test_pipeline/model/layers/__init__.py
+++ /dev/null
@@ -1,3 +0,0 @@
-from .basic_block import ResNetBasicBlock
-from .bottleneck import ResNetBottleneck
-from .reslayer import ResLayer
\ No newline at end of file
diff --git a/tests/test_trainer/test_pipeline/model/layers/__pycache__/__init__.cpython-37.pyc b/tests/test_trainer/test_pipeline/model/layers/__pycache__/__init__.cpython-37.pyc
deleted file mode 100644
index 3d55bab038e36740ea64e543ad6131b7ffa09862..0000000000000000000000000000000000000000
Binary files a/tests/test_trainer/test_pipeline/model/layers/__pycache__/__init__.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_trainer/test_pipeline/model/layers/__pycache__/basic_block.cpython-37.pyc b/tests/test_trainer/test_pipeline/model/layers/__pycache__/basic_block.cpython-37.pyc
deleted file mode 100644
index a4aa94d25039a045e035a73e68768a4424bbaf01..0000000000000000000000000000000000000000
Binary files a/tests/test_trainer/test_pipeline/model/layers/__pycache__/basic_block.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_trainer/test_pipeline/model/layers/__pycache__/bottleneck.cpython-37.pyc b/tests/test_trainer/test_pipeline/model/layers/__pycache__/bottleneck.cpython-37.pyc
deleted file mode 100644
index 53be42c19284b5546f04e823ac99067179338d19..0000000000000000000000000000000000000000
Binary files a/tests/test_trainer/test_pipeline/model/layers/__pycache__/bottleneck.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_trainer/test_pipeline/model/layers/__pycache__/conv.cpython-37.pyc b/tests/test_trainer/test_pipeline/model/layers/__pycache__/conv.cpython-37.pyc
deleted file mode 100644
index a653be2d2ce10ae9555bc17079e9f026fe54e2a3..0000000000000000000000000000000000000000
Binary files a/tests/test_trainer/test_pipeline/model/layers/__pycache__/conv.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_trainer/test_pipeline/model/layers/__pycache__/reslayer.cpython-37.pyc b/tests/test_trainer/test_pipeline/model/layers/__pycache__/reslayer.cpython-37.pyc
deleted file mode 100644
index 0a79e57cac9b83cb1dd2450cf54db40b13eb72bc..0000000000000000000000000000000000000000
Binary files a/tests/test_trainer/test_pipeline/model/layers/__pycache__/reslayer.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_trainer/test_pipeline/model/layers/basic_block.py b/tests/test_trainer/test_pipeline/model/layers/basic_block.py
deleted file mode 100644
index 320dac2fde591fffa98af12447a972010bceab93..0000000000000000000000000000000000000000
--- a/tests/test_trainer/test_pipeline/model/layers/basic_block.py
+++ /dev/null
@@ -1,64 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from typing import Optional, Callable
-
-import torch.nn as nn
-from torch import Tensor
-
-from colossalai.registry import LAYERS
-from .conv import conv3x3
-
-
-@LAYERS.register_module
-class ResNetBasicBlock(nn.Module):
-    """Basic ResNet block
-    """
-    expansion: int = 1
-
-    def __init__(
-            self,
-            inplanes: int,
-            planes: int,
-            stride: int = 1,
-            downsample: Optional[nn.Module] = None,
-            groups: int = 1,
-            base_width: int = 64,
-            dilation: int = 1,
-            norm_layer: Optional[Callable[..., nn.Module]] = None
-    ) -> None:
-        super().__init__()
-        if norm_layer is None:
-            norm_layer = nn.BatchNorm2d
-        if groups != 1 or base_width != 64:
-            raise ValueError(
-                'BasicBlock only supports groups=1 and base_width=64')
-        if dilation > 1:
-            raise NotImplementedError(
-                "Dilation > 1 not supported in BasicBlock")
-        # Both self.conv1 and self.downsample layers downsample the input when stride != 1
-        self.conv1 = conv3x3(inplanes, planes, stride)
-        self.bn1 = norm_layer(planes)
-        self.relu = nn.ReLU(inplace=True)
-        self.conv2 = conv3x3(planes, planes)
-        self.bn2 = norm_layer(planes)
-        self.downsample = downsample
-        self.stride = stride
-
-    def forward(self, x: Tensor) -> Tensor:
-        identity = x
-
-        out = self.conv1(x)
-        out = self.bn1(out)
-        out = self.relu(out)
-
-        out = self.conv2(out)
-        out = self.bn2(out)
-
-        if self.downsample is not None:
-            identity = self.downsample(x)
-
-        out += identity
-        out = self.relu(out)
-
-        return out
diff --git a/tests/test_trainer/test_pipeline/model/layers/bottleneck.py b/tests/test_trainer/test_pipeline/model/layers/bottleneck.py
deleted file mode 100644
index d75f9534b0f7dbbd99a69544ef3c8c5edba1f77b..0000000000000000000000000000000000000000
--- a/tests/test_trainer/test_pipeline/model/layers/bottleneck.py
+++ /dev/null
@@ -1,69 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from typing import Optional, Callable
-
-import torch.nn as nn
-from torch import Tensor
-
-from colossalai.registry import LAYERS
-from .conv import conv3x3, conv1x1
-
-
-@LAYERS.register_module
-class ResNetBottleneck(nn.Module):
-    # Bottleneck in torchvision places the stride for downsampling at 3x3 convolution(self.conv2)
-    # while original implementation places the stride at the first 1x1 convolution(self.conv1)
-    # according to "Deep residual learning for image recognition"https://arxiv.org/abs/1512.03385.
-    # This variant is also known as ResNet V1.5 and improves accuracy according to
-    # https://ngc.nvidia.com/catalog/model-scripts/nvidia:resnet_50_v1_5_for_pytorch.
-
-    expansion: int = 4
-
-    def __init__(
-            self,
-            inplanes: int,
-            planes: int,
-            stride: int = 1,
-            downsample: Optional[nn.Module] = None,
-            groups: int = 1,
-            base_width: int = 64,
-            dilation: int = 1,
-            norm_layer: Optional[Callable[..., nn.Module]] = None
-    ) -> None:
-        super().__init__()
-        if norm_layer is None:
-            norm_layer = nn.BatchNorm2d
-        width = int(planes * (base_width / 64.)) * groups
-        # Both self.conv2 and self.downsample layers downsample the input when stride != 1
-        self.conv1 = conv1x1(inplanes, width)
-        self.bn1 = norm_layer(width)
-        self.conv2 = conv3x3(width, width, stride, groups, dilation)
-        self.bn2 = norm_layer(width)
-        self.conv3 = conv1x1(width, planes * self.expansion)
-        self.bn3 = norm_layer(planes * self.expansion)
-        self.relu = nn.ReLU(inplace=True)
-        self.downsample = downsample
-        self.stride = stride
-
-    def forward(self, x: Tensor) -> Tensor:
-        identity = x
-
-        out = self.conv1(x)
-        out = self.bn1(out)
-        out = self.relu(out)
-
-        out = self.conv2(out)
-        out = self.bn2(out)
-        out = self.relu(out)
-
-        out = self.conv3(out)
-        out = self.bn3(out)
-
-        if self.downsample is not None:
-            identity = self.downsample(x)
-
-        out += identity
-        out = self.relu(out)
-
-        return out
diff --git a/tests/test_trainer/test_pipeline/model/layers/conv.py b/tests/test_trainer/test_pipeline/model/layers/conv.py
deleted file mode 100644
index c918d94c4e1a54f32e85b792da542085e70fa2b8..0000000000000000000000000000000000000000
--- a/tests/test_trainer/test_pipeline/model/layers/conv.py
+++ /dev/null
@@ -1,15 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import torch.nn as nn
-
-
-def conv3x3(in_planes: int, out_planes: int, stride: int = 1, groups: int = 1, dilation: int = 1) -> nn.Conv2d:
-    """3x3 convolution with padding"""
-    return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride,
-                     padding=dilation, groups=groups, bias=False, dilation=dilation)
-
-
-def conv1x1(in_planes: int, out_planes: int, stride: int = 1) -> nn.Conv2d:
-    """1x1 convolution"""
-    return nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=stride, bias=False)
diff --git a/tests/test_trainer/test_pipeline/model/layers/reslayer.py b/tests/test_trainer/test_pipeline/model/layers/reslayer.py
deleted file mode 100644
index 4e1b48c5e8b57f8a6274c9a976591eca18b9661a..0000000000000000000000000000000000000000
--- a/tests/test_trainer/test_pipeline/model/layers/reslayer.py
+++ /dev/null
@@ -1,63 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import torch.nn as nn
-
-from colossalai.registry import LAYERS
-from .conv import conv1x1
-
-
-@LAYERS.register_module
-class ResLayer(nn.Module):
-
-    def __init__(self,
-                 block_type: str,
-                 norm_layer_type: str,
-                 inplanes: int,
-                 planes: int,
-                 blocks: int,
-                 groups: int,
-                 base_width: int,
-                 stride: int = 1,
-                 dilation: int = 1,
-                 dilate: bool = False,
-                 ):
-        super().__init__()
-        self.block = LAYERS.get_module(block_type)
-        self.norm_layer = LAYERS.get_module(norm_layer_type)
-        self.inplanes = inplanes
-        self.planes = planes
-        self.blocks = blocks
-        self.groups = groups
-        self.dilation = dilation
-        self.base_width = base_width
-        self.dilate = dilate
-        self.stride = stride
-        self.layer = self._make_layer()
-
-    def _make_layer(self):
-        norm_layer = self.norm_layer
-        downsample = None
-        previous_dilation = self.dilation
-        if self.dilate:
-            self.dilation *= self.stride
-            self.stride = 1
-        if self.stride != 1 or self.inplanes != self.planes * self.block.expansion:
-            downsample = nn.Sequential(
-                conv1x1(self.inplanes, self.planes * self.block.expansion, self.stride),
-                norm_layer(self.planes * self.block.expansion),
-            )
-
-        layers = []
-        layers.append(self.block(self.inplanes, self.planes, self.stride, downsample, self.groups,
-                                 self.base_width, previous_dilation, norm_layer))
-        self.inplanes = self.planes * self.block.expansion
-        for _ in range(1, self.blocks):
-            layers.append(self.block(self.inplanes, self.planes, groups=self.groups,
-                                     base_width=self.base_width, dilation=self.dilation,
-                                     norm_layer=norm_layer))
-
-        return nn.Sequential(*layers)
-
-    def forward(self, x):
-        return self.layer(x)
diff --git a/tests/test_trainer/test_pipeline/model/resnet.py b/tests/test_trainer/test_pipeline/model/resnet.py
deleted file mode 100644
index 11d964943942a907dfddfc40812d1f47c1cd057e..0000000000000000000000000000000000000000
--- a/tests/test_trainer/test_pipeline/model/resnet.py
+++ /dev/null
@@ -1,163 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from typing import List, Optional
-
-import torch
-import torch.nn as nn
-from torch import Tensor
-
-from colossalai.registry import LAYERS
-from colossalai.registry import MODELS
-from colossalai.nn.model import ModelFromConfig
-
-
-@MODELS.register_module
-class VanillaResNet(ModelFromConfig):
-    """ResNet from 
-    `"Deep Residual Learning for Image Recognition" <https://arxiv.org/pdf/1512.03385.pdf>`_.
-    """
-
-    def __init__(
-            self,
-            num_cls: int,
-            block_type: str,
-            layers: List[int],
-            norm_layer_type: str = 'BatchNorm2d',
-            in_channels: int = 3,
-            groups: int = 1,
-            width_per_group: int = 64,
-            zero_init_residual: bool = False,
-            replace_stride_with_dilation: Optional[List[bool]] = None,
-            dilations=(1, 1, 1, 1)
-    ) -> None:
-        super().__init__()
-
-        self.inplanes = 64
-        self.zero_init_residual = zero_init_residual
-        self.blocks = layers
-        self.block_expansion = LAYERS.get_module(block_type).expansion
-        self.dilations = dilations
-        self.reslayer_common_cfg = dict(
-            type='ResLayer',
-            block_type=block_type,
-            norm_layer_type=norm_layer_type,
-            groups=groups,
-            base_width=width_per_group
-        )
-
-        if replace_stride_with_dilation is None:
-            # each element in the tuple indicates if we should replace
-            # the 2x2 stride with a dilated convolution instead
-            replace_stride_with_dilation = [False, False, False]
-
-        if len(replace_stride_with_dilation) != 3:
-            raise ValueError("replace_stride_with_dilation should be None "
-                             "or a 3-element tuple, got {}".format(replace_stride_with_dilation))
-
-        self.layers_cfg = [
-            # conv1
-            dict(type='Conv2d',
-                 in_channels=in_channels,
-                 out_channels=self.inplanes,
-                 kernel_size=7,
-                 stride=2,
-                 padding=3,
-                 bias=False),
-            # bn1
-            dict(
-                type=norm_layer_type,
-                num_features=self.inplanes
-            ),
-            # relu
-            dict(
-                type='ReLU',
-                inplace=True
-            ),
-            # maxpool
-            dict(
-                type='MaxPool2d',
-                kernel_size=3,
-                stride=2,
-                padding=1
-            ),
-            # layer 1
-            dict(
-                inplanes=self.inplanes,
-                planes=64,
-                blocks=self.blocks[0],
-                dilation=self.dilations[0],
-                **self.reslayer_common_cfg
-            ),
-            # layer 2
-            dict(
-                inplanes=64 * self.block_expansion,
-                planes=128,
-                blocks=self.blocks[1],
-                stride=2,
-                dilate=replace_stride_with_dilation[0],
-                dilation=self.dilations[1],
-                **self.reslayer_common_cfg
-            ),
-            # layer  3
-            dict(
-                inplanes=128 * self.block_expansion,
-                planes=256,
-                blocks=layers[2],
-                stride=2,
-                dilate=replace_stride_with_dilation[1],
-                dilation=self.dilations[2],
-                **self.reslayer_common_cfg
-            ),
-            # layer 4
-            dict(
-                inplanes=256 * self.block_expansion,
-                planes=512,
-                blocks=layers[3], stride=2,
-                dilate=replace_stride_with_dilation[2],
-                dilation=self.dilations[3],
-                **self.reslayer_common_cfg
-            ),
-            # avg pool
-            dict(
-                type='AdaptiveAvgPool2d',
-                output_size=(1, 1)
-            ),
-            # flatten
-            dict(
-                type='LambdaWrapper',
-                func=lambda mod, x: torch.flatten(x, 1)
-            ),
-            # linear
-            dict(
-                type='Linear',
-                in_features=512 * self.block_expansion,
-                out_features=num_cls
-            )
-        ]
-
-    def forward(self, x: Tensor):
-        for layer in self.layers:
-            x = layer(x)
-        return x
-
-    def init_weights(self):
-        for m in self.modules():
-            if isinstance(m, nn.Conv2d):
-                nn.init.kaiming_normal_(
-                    m.weight, mode='fan_out', nonlinearity='relu')
-            elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
-                nn.init.constant_(m.weight, 1)
-                nn.init.constant_(m.bias, 0)
-
-        # Zero-initialize the last BN in each residual branch,
-        # so that the residual branch starts with zeros, and each residual block behaves like an identity.
-        # This improves the model by 0.2~0.3% according to https://arxiv.org/abs/1706.02677
-        if self.zero_init_residual:
-            for m in self.modules():
-                if isinstance(m, LAYERS.get_module('ResNetBottleneck')):
-                    # type: ignore[arg-type]
-                    nn.init.constant_(m.bn3.weight, 0)
-                elif isinstance(m, LAYERS.get_module('ResNetBasicBlock')):
-                    # type: ignore[arg-type]
-                    nn.init.constant_(m.bn2.weight, 0)
diff --git a/tests/test_trainer/test_pipeline/resnet_config.py b/tests/test_trainer/test_pipeline/resnet_config.py
deleted file mode 100644
index cbf7dd266192d46a365380fe7a21aa0613e51ec7..0000000000000000000000000000000000000000
--- a/tests/test_trainer/test_pipeline/resnet_config.py
+++ /dev/null
@@ -1,20 +0,0 @@
-import os
-import model
-from pathlib import Path
-
-BATCH_SIZE = 128
-IMG_SIZE = 224
-DIM = 768
-NUM_CLASSES = 10
-NUM_ATTN_HEADS = 12
-
-# resnet 18
-model = dict(type='VanillaResNet',
-             block_type='ResNetBasicBlock',
-             layers=[2, 2, 2, 2],
-             num_cls=10)
-
-parallel = dict(
-    pipeline=dict(size=4),
-    tensor=dict(size=1, mode=None)
-)
diff --git a/tests/test_trainer/test_pipeline/test_p2p.py b/tests/test_trainer/test_pipeline/test_p2p.py
deleted file mode 100644
index 5258b42a5830b90ef8bfadf5ff34e601a59dfe28..0000000000000000000000000000000000000000
--- a/tests/test_trainer/test_pipeline/test_p2p.py
+++ /dev/null
@@ -1,123 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from functools import partial
-
-import pytest
-import torch
-import torch.distributed as dist
-import torch.multiprocessing as mp
-from colossalai.communication import (recv_backward, recv_forward,
-                                      recv_tensor_meta, send_backward,
-                                      send_backward_recv_forward, send_forward,
-                                      send_forward_recv_backward,
-                                      send_tensor_meta)
-from colossalai.context.parallel_mode import ParallelMode
-from colossalai.core import global_context as gpc
-from colossalai.initialize import launch
-from colossalai.logging import get_dist_logger
-from colossalai.utils import free_port, get_current_device
-
-BATCH_SIZE = 16
-SEQ_LENGTH = 64
-HIDDEN_SIZE = 128
-
-CONFIG = dict(
-    parallel=dict(
-        pipeline=dict(size=4),
-        tensor=dict(size=1, mode=None)
-    ),
-    seed=1024
-)
-
-
-def check_equal(A, B):
-    return torch.allclose(A, B, rtol=1e-5, atol=1e-3)
-
-
-def check_forward(output_tensor, rank, logger):
-    dist.barrier()
-    if gpc.is_first_rank(ParallelMode.PIPELINE):
-        tensor = output_tensor.clone()
-    else:
-        tensor = recv_forward(output_tensor.shape)
-        logger.info('Rank {} received forward. Correct tensor: {}'.format(
-            rank, check_equal(tensor, output_tensor)))
-    if not gpc.is_last_rank(ParallelMode.PIPELINE):
-        send_forward(tensor)
-        logger.info('Rank {} sent forward.'.format(rank))
-
-
-def check_backward(output_grad, rank, logger):
-    dist.barrier()
-    if gpc.is_last_rank(ParallelMode.PIPELINE):
-        grad = output_grad.clone()
-    else:
-        grad = recv_backward(output_grad.shape)
-        logger.info('Rank {} received backward. Correct grad: {}'.format(
-            rank, check_equal(grad, output_grad)))
-    if not gpc.is_first_rank(ParallelMode.PIPELINE):
-        send_backward(grad)
-        logger.info('Rank {} sent backward.'.format(rank))
-
-
-def check_forward_backward(output_tensor, output_grad, rank, logger):
-    dist.barrier()
-    if not gpc.is_first_rank(ParallelMode.PIPELINE):
-        tensor = send_backward_recv_forward(output_grad, output_tensor.shape)
-        logger.info(
-            'Rank {} sent backward received forward. Correct tensor: {}'.
-            format(rank, check_equal(tensor, output_tensor)))
-    if not gpc.is_last_rank(ParallelMode.PIPELINE):
-        grad = send_forward_recv_backward(output_tensor, output_grad.shape)
-        logger.info(
-            'Rank {} sent forward received backward. Correct grad: {}'.format(
-                rank, check_equal(grad, output_grad)))
-
-
-def check_comm(size, rank, prev_rank, next_rank,  logger):
-    dtype = torch.float32
-    device = get_current_device()
-    tensor_shape = (BATCH_SIZE, SEQ_LENGTH, HIDDEN_SIZE)
-    grad_shape = (BATCH_SIZE, SEQ_LENGTH, HIDDEN_SIZE)
-    tensor = torch.randn(tensor_shape, dtype=dtype, device=device)
-    dist.all_reduce(tensor)
-    grad = torch.randn(grad_shape, dtype=dtype, device=device)
-    dist.all_reduce(grad)
-    check_forward(tensor, rank, logger)
-    check_backward(grad, rank, logger)
-    check_forward_backward(tensor, grad, rank, logger)
-
-
-def run_check(rank, world_size, port):
-    launch(
-        config=CONFIG,
-        rank=rank,
-        world_size=world_size,
-        host='localhost',
-        port=port,
-        backend='nccl'
-    )
-    logger = get_dist_logger()
-    rank = gpc.get_global_rank()
-    prev_rank = gpc.get_prev_global_rank(ParallelMode.PIPELINE)
-    next_rank = gpc.get_next_global_rank(ParallelMode.PIPELINE)
-    logger.info(
-        'Rank {0}: prev rank {1}, next rank {2}'.format(
-            rank, prev_rank, next_rank))
-    logger.info('Distributed environment is initialzied.')
-
-    check_comm(world_size, rank, prev_rank, next_rank, logger)
-    gpc.destroy()
-    torch.cuda.empty_cache()
-
-
-@pytest.mark.dist
-def test_p2p():
-    world_size = 4
-    run_func = partial(run_check, world_size=world_size, port=free_port())
-    mp.spawn(run_func, nprocs=world_size)
-
-
-if __name__ == '__main__':
-    test_p2p()
diff --git a/tests/test_trainer/test_pipeline/test_partition.py b/tests/test_trainer/test_pipeline/test_partition.py
deleted file mode 100644
index 61e7e707b16b63fb4d7c2f7dc24f066cae1f8efe..0000000000000000000000000000000000000000
--- a/tests/test_trainer/test_pipeline/test_partition.py
+++ /dev/null
@@ -1,47 +0,0 @@
-import os.path as osp
-
-import pytest
-import torch
-import torch.multiprocessing as mp
-
-from colossalai.builder.pipeline import build_pipeline_model_from_cfg
-from colossalai.core import global_context
-from colossalai.initialize import launch
-from colossalai.logging import get_dist_logger
-from functools import partial
-from colossalai.utils import free_port
-
-DIR_PATH = osp.dirname(osp.realpath(__file__))
-CONFIG_PATH = osp.join(DIR_PATH, 'resnet_config.py')
-
-
-def run_partition(rank, world_size, port):
-    launch(config=CONFIG_PATH,
-           rank=rank,
-           world_size=world_size,
-           host='localhost',
-           port=port,
-           backend='nccl'
-           )
-    logger = get_dist_logger()
-    logger.info('finished initialization')
-
-    # build model
-    model = build_pipeline_model_from_cfg(global_context.config.model, 1, verbose=True)
-    assert isinstance(model, torch.nn.Module)
-    logger.info('model is created')
-
-    global_context.destroy()
-    logger.info('training finished')
-    torch.cuda.empty_cache()
-
-
-@pytest.mark.dist
-def test_partition():
-    world_size = 4
-    run_func = partial(run_partition, world_size=world_size, port=free_port())
-    mp.spawn(run_func, nprocs=world_size)
-
-
-if __name__ == '__main__':
-    test_partition()
diff --git a/tests/test_trainer/test_pipeline/test_pipeline_schedule.py b/tests/test_trainer/test_pipeline/test_pipeline_schedule.py
deleted file mode 100644
index d3c876c9c8cfe940df67de641f499517954e2873..0000000000000000000000000000000000000000
--- a/tests/test_trainer/test_pipeline/test_pipeline_schedule.py
+++ /dev/null
@@ -1,91 +0,0 @@
-# referenced from Megatron and used to testify communication
-
-import os
-import os.path as osp
-from functools import partial
-from pathlib import Path
-
-import colossalai
-import pytest
-import torch
-import torch.multiprocessing as mp
-from colossalai.builder import build_pipeline_model_from_cfg
-from colossalai.core import global_context as gpc
-from colossalai.engine.schedule import PipelineSchedule
-from colossalai.initialize import launch
-from colossalai.utils import free_port, get_dataloader, print_rank_0
-from torchvision import transforms
-from torchvision.datasets import CIFAR10
-
-import model
-
-BATCH_SIZE = 32
-NUM_MICRO = 8
-
-
-DIR_PATH = osp.dirname(osp.realpath(__file__))
-CONFIG_PATH = osp.join(DIR_PATH, './resnet_config.py')
-
-
-def run_schedule(rank, world_size, port):
-    launch(config=CONFIG_PATH,
-           rank=rank,
-           world_size=world_size,
-           host='localhost',
-           port=port,
-           backend='nccl')
-
-    # build model
-    model = build_pipeline_model_from_cfg(gpc.config.model, 1)
-    print_rank_0('model is created')
-
-    train_dataset = CIFAR10(
-        root=Path(os.environ['DATA']),
-        download=True,
-        transform=transforms.Compose(
-            [
-                transforms.RandomCrop(size=32, padding=4),
-                transforms.RandomHorizontalFlip(),
-                transforms.ToTensor(),
-                transforms.Normalize(mean=[0.4914, 0.4822, 0.4465], std=[
-                    0.2023, 0.1994, 0.2010]),
-            ]
-        )
-    )
-
-    train_dataloader = get_dataloader(dataset=train_dataset,
-                                      shuffle=True,
-                                      add_sampler=True,
-                                      batch_size=BATCH_SIZE,
-                                      pin_memory=True,
-                                      )
-
-    # build criterion
-    criterion = torch.nn.CrossEntropyLoss()
-
-    # optimizer
-    optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=0)
-
-    # initialize
-    engine, train_dataloader, _, _ = colossalai.initialize(model, optimizer, criterion, train_dataloader)
-
-    # build pipeline schedule
-    schedule = PipelineSchedule(num_microbatches=NUM_MICRO)
-
-    # run schedule
-    data_iter = iter(train_dataloader)
-    schedule.forward_backward_step(engine, data_iter)
-
-    gpc.destroy()
-    torch.cuda.empty_cache()
-
-
-@pytest.mark.dist
-def test_pipeline_schedule():
-    world_size = 4
-    run_func = partial(run_schedule, world_size=world_size, port=free_port())
-    mp.spawn(run_func, nprocs=world_size)
-
-
-if __name__ == '__main__':
-    test_pipeline_schedule()
diff --git a/tests/test_trainer/test_trainer_with_non_pipe_schedule.py b/tests/test_trainer/test_trainer_with_non_pipe_schedule.py
deleted file mode 100644
index 599efd883aa17e38012dc4866ff8c87d4c254b72..0000000000000000000000000000000000000000
--- a/tests/test_trainer/test_trainer_with_non_pipe_schedule.py
+++ /dev/null
@@ -1,96 +0,0 @@
-import os
-from functools import partial
-from pathlib import Path
-
-import colossalai
-import pytest
-import torch
-import torch.multiprocessing as mp
-import torch.nn as nn
-from colossalai.amp.amp_type import AMP_TYPE
-from colossalai.core import global_context as gpc
-from colossalai.logging import get_dist_logger
-from colossalai.trainer import Trainer
-from colossalai.utils import MultiTimer, free_port, get_dataloader
-from torch.optim import Adam
-from torchvision import transforms
-from torchvision.datasets import CIFAR10
-from torchvision.models import resnet18
-
-BATCH_SIZE = 16
-IMG_SIZE = 32
-NUM_EPOCHS = 200
-
-CONFIG = dict(
-    # Config
-    fp16=dict(mode=AMP_TYPE.TORCH))
-
-
-def run_trainer_no_pipeline(rank, world_size, port):
-    colossalai.launch(config=CONFIG, rank=rank, world_size=world_size, host='localhost', port=port, backend='nccl')
-
-    # build model
-    model = resnet18(num_classes=10)
-
-    # build dataloaders
-    train_dataset = CIFAR10(root=Path(os.environ['DATA']),
-                            download=True,
-                            transform=transforms.Compose([
-                                transforms.Resize(size=(IMG_SIZE, IMG_SIZE)),
-                                transforms.ToTensor(),
-                                transforms.Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5))
-                            ]))
-
-    test_dataset = CIFAR10(root=Path(os.environ['DATA']),
-                           train=False,
-                           download=True,
-                           transform=transforms.Compose([
-                               transforms.Resize(size=(IMG_SIZE, IMG_SIZE)),
-                               transforms.ToTensor(),
-                               transforms.Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5))
-                           ]))
-
-    train_dataloader = get_dataloader(dataset=train_dataset,
-                                      shuffle=True,
-                                      batch_size=BATCH_SIZE,
-                                      pin_memory=True,
-                                      drop_last=True)
-
-    test_dataloader = get_dataloader(dataset=test_dataset, batch_size=BATCH_SIZE, pin_memory=True, drop_last=True)
-
-    # build optimizer
-    optimizer = Adam(model.parameters(), lr=0.001)
-    criterion = nn.CrossEntropyLoss()
-
-    engine, train_dataloader, *args = colossalai.initialize(model=model,
-                                                            optimizer=optimizer,
-                                                            criterion=criterion,
-                                                            train_dataloader=train_dataloader)
-
-    logger = get_dist_logger()
-    logger.info("engine is built", ranks=[0])
-
-    timer = MultiTimer()
-    trainer = Trainer(engine=engine, logger=logger, timer=timer)
-    logger.info("trainer is built", ranks=[0])
-
-    logger.info("start training", ranks=[0])
-    trainer.fit(train_dataloader=train_dataloader,
-                test_dataloader=test_dataloader,
-                epochs=NUM_EPOCHS,
-                max_steps=100,
-                display_progress=True,
-                test_interval=5)
-    gpc.destroy()
-    torch.cuda.empty_cache()
-
-
-@pytest.mark.dist
-def test_trainer_no_pipeline():
-    world_size = 4
-    run_func = partial(run_trainer_no_pipeline, world_size=world_size, port=free_port())
-    mp.spawn(run_func, nprocs=world_size)
-
-
-if __name__ == '__main__':
-    test_trainer_no_pipeline()
diff --git a/tests/test_trainer/test_trainer_with_pipe_schedule.py b/tests/test_trainer/test_trainer_with_pipe_schedule.py
deleted file mode 100644
index 8dffc3fc748bfe6d9c911e8ab0d8e1263f6dc049..0000000000000000000000000000000000000000
--- a/tests/test_trainer/test_trainer_with_pipe_schedule.py
+++ /dev/null
@@ -1,107 +0,0 @@
-import os
-from functools import partial
-from pathlib import Path
-
-import colossalai
-import pytest
-import torch
-import torch.multiprocessing as mp
-import torch.nn as nn
-from colossalai.context.parallel_mode import ParallelMode
-from colossalai.core import global_context as gpc
-from colossalai.engine.schedule import PipelineSchedule
-from colossalai.logging import get_dist_logger
-from colossalai.trainer import Trainer
-from colossalai.utils import MultiTimer, free_port, get_dataloader
-from torch.optim import Adam
-from torchvision import transforms
-from torchvision.datasets import CIFAR10
-from torchvision.models import resnet18
-
-BATCH_SIZE = 16
-IMG_SIZE = 32
-NUM_EPOCHS = 200
-
-CONFIG = dict(parallel=dict(pipeline=2, ), )
-
-
-def run_trainer_with_pipeline(rank, world_size, port):
-    colossalai.launch(config=CONFIG, rank=rank, world_size=world_size, host='localhost', port=port, backend='nccl')
-
-    # build model
-    model = resnet18(num_classes=10)
-
-    if gpc.get_local_rank(ParallelMode.PIPELINE) == 0:
-        model = nn.Sequential(model.conv1, model.bn1, model.relu, model.maxpool, model.layer1, model.layer2)
-    elif gpc.get_local_rank(ParallelMode.PIPELINE) == 1:
-        from functools import partial
-
-        class Flatten(nn.Module):
-            def forward(self, x):
-                return torch.flatten(x, 1)
-
-        model = nn.Sequential(model.layer3, model.layer4, model.avgpool, Flatten(), model.fc)
-
-    # build dataloaders
-    train_dataset = CIFAR10(root=Path(os.environ['DATA']),
-                            download=True,
-                            transform=transforms.Compose([
-                                transforms.Resize(size=(IMG_SIZE, IMG_SIZE)),
-                                transforms.ToTensor(),
-                                transforms.Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5))
-                            ]))
-
-    test_dataset = CIFAR10(root=Path(os.environ['DATA']),
-                           train=False,
-                           download=True,
-                           transform=transforms.Compose([
-                               transforms.Resize(size=(IMG_SIZE, IMG_SIZE)),
-                               transforms.ToTensor(),
-                               transforms.Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5))
-                           ]))
-
-    train_dataloader = get_dataloader(dataset=train_dataset,
-                                      shuffle=True,
-                                      batch_size=BATCH_SIZE,
-                                      pin_memory=True,
-                                      drop_last=True)
-
-    test_dataloader = get_dataloader(dataset=test_dataset, batch_size=BATCH_SIZE, pin_memory=True, drop_last=True)
-
-    # build optimizer
-    optimizer = Adam(model.parameters(), lr=0.001)
-    criterion = nn.CrossEntropyLoss()
-
-    engine, train_dataloader, *args = colossalai.initialize(model=model,
-                                                            optimizer=optimizer,
-                                                            criterion=criterion,
-                                                            train_dataloader=train_dataloader)
-
-    logger = get_dist_logger()
-    logger.info("engine is built", ranks=[0])
-    pipe_schedule = PipelineSchedule(num_microbatches=4)
-    timer = MultiTimer()
-    trainer = Trainer(engine=engine, schedule=pipe_schedule, logger=logger, timer=timer)
-    logger.info("trainer is built", ranks=[0])
-
-    logger.info("start training", ranks=[0])
-
-    trainer.fit(train_dataloader=train_dataloader,
-                test_dataloader=test_dataloader,
-                epochs=NUM_EPOCHS,
-                max_steps=100,
-                display_progress=True,
-                test_interval=5)
-    gpc.destroy()
-    torch.cuda.empty_cache()
-
-
-@pytest.mark.dist
-def test_trainer_with_pipeline():
-    world_size = 4
-    run_func = partial(run_trainer_with_pipeline, world_size=world_size, port=free_port())
-    mp.spawn(run_func, nprocs=world_size)
-
-
-if __name__ == '__main__':
-    test_trainer_with_pipeline()
diff --git a/tests/test_utils/__pycache__/test_activation_checkpointing.cpython-37-pytest-7.1.3.pyc b/tests/test_utils/__pycache__/test_activation_checkpointing.cpython-37-pytest-7.1.3.pyc
deleted file mode 100644
index 5679cb12ade7b17e0010c762253bef288d09cb3f..0000000000000000000000000000000000000000
Binary files a/tests/test_utils/__pycache__/test_activation_checkpointing.cpython-37-pytest-7.1.3.pyc and /dev/null differ
diff --git a/tests/test_utils/__pycache__/test_gradient_accumluation.cpython-37-pytest-7.1.3.pyc b/tests/test_utils/__pycache__/test_gradient_accumluation.cpython-37-pytest-7.1.3.pyc
deleted file mode 100644
index 85b86372345a5587121dacf2aab3e11484c69f26..0000000000000000000000000000000000000000
Binary files a/tests/test_utils/__pycache__/test_gradient_accumluation.cpython-37-pytest-7.1.3.pyc and /dev/null differ
diff --git a/tests/test_utils/__pycache__/test_gradient_accumluation.cpython-37.pyc b/tests/test_utils/__pycache__/test_gradient_accumluation.cpython-37.pyc
deleted file mode 100644
index a5d11e6138fcd3cf7fbfeaff4ba2857ed25e90be..0000000000000000000000000000000000000000
Binary files a/tests/test_utils/__pycache__/test_gradient_accumluation.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_utils/test_activation_checkpointing.py b/tests/test_utils/test_activation_checkpointing.py
deleted file mode 100644
index 1adc548fbd0a2ac8bc17bd5ddf3400d2c33677c6..0000000000000000000000000000000000000000
--- a/tests/test_utils/test_activation_checkpointing.py
+++ /dev/null
@@ -1,61 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import pytest
-import torch
-import torch.nn.functional as F
-from torch.utils.checkpoint import checkpoint
-
-from colossalai.context.parallel_mode import ParallelMode
-from colossalai.context.random import add_seed, seed, set_mode
-from colossalai.utils import checkpoint
-
-
-def forward(x, weight):
-    out = torch.matmul(x, weight)
-    with seed(ParallelMode.DATA):
-        out_ = F.dropout(out, p=0.4, training=True)
-    return out_
-
-
-@pytest.mark.gpu
-def test_activation_checkpointing():
-    add_seed(ParallelMode.GLOBAL, 1024)
-    set_mode(ParallelMode.GLOBAL)
-    global_cuda_rng_state = torch.cuda.get_rng_state()
-    add_seed(ParallelMode.DATA, 1026)
-    set_mode(ParallelMode.DATA)
-    data_parallel_cuda_rng_state = torch.cuda.get_rng_state()
-    set_mode(ParallelMode.GLOBAL)
-
-    # normal
-    data = torch.rand(2, 2, requires_grad=True).cuda()
-    data.retain_grad()
-    weight = torch.rand(2, 4, requires_grad=True).cuda()
-
-    data_ = data.clone().detach()
-    data_.requires_grad = True
-    data_.retain_grad()
-    weight_ = weight.clone().detach()
-    weight_.requires_grad = True
-
-    out = forward(data, weight)
-    loss = out.sum()
-    loss.backward()
-
-    # checkpoint
-    set_mode(ParallelMode.GLOBAL)
-    torch.cuda.set_rng_state(global_cuda_rng_state)
-    set_mode(ParallelMode.DATA)
-    torch.cuda.set_rng_state(data_parallel_cuda_rng_state)
-    set_mode(ParallelMode.GLOBAL)
-    out = checkpoint(forward, data_, weight_)
-    loss = out.sum()
-    loss.backward()
-
-    assert torch.all(data.grad == data_.grad), 'Gradient of the input does not match'
-    torch.cuda.empty_cache()
-
-
-if __name__ == '__main__':
-    test_activation_checkpointing()
diff --git a/tests/test_utils/test_gradient_accumluation.py b/tests/test_utils/test_gradient_accumluation.py
deleted file mode 100644
index c7471d77ccb97dd1a0241ad5522f77f1d6bcfb55..0000000000000000000000000000000000000000
--- a/tests/test_utils/test_gradient_accumluation.py
+++ /dev/null
@@ -1,116 +0,0 @@
-import os
-from functools import partial
-from pathlib import Path
-
-import colossalai
-import pytest
-import torch
-import torch.multiprocessing as mp
-import torch.nn as nn
-from colossalai.core import global_context as gpc
-from colossalai.logging import get_dist_logger
-from colossalai.utils import free_port, get_dataloader
-from torch.optim import Adam
-from torchvision import transforms
-from torchvision.datasets import CIFAR10
-from torchvision.models import resnet18
-
-# Config
-BATCH_SIZE = 16
-IMG_SIZE = 224
-NUM_CLASSES = 10
-
-CONFIG = dict(
-    parallel=dict(
-        pipeline=dict(size=1),
-        tensor=dict(size=1, mode=None)
-    ),
-    clip_grad_norm=1.0,
-    gradient_accumulation=4
-)
-
-
-def run_no_pipeline(rank, world_size, port):
-
-    # init dist env
-    colossalai.launch(
-        config=CONFIG,
-        rank=rank,
-        world_size=world_size,
-        host='localhost',
-        port=port,
-        backend='nccl'
-    )
-
-    # build model
-    model = resnet18(num_classes=10)
-
-    # build dataloaders
-    train_dataset = CIFAR10(
-        root=Path(os.environ['DATA']),
-        download=True,
-        transform=transforms.Compose(
-            [
-                transforms.Resize(size=(IMG_SIZE, IMG_SIZE)),
-                transforms.ToTensor(),
-                transforms.Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5))
-            ]
-        )
-    )
-    train_dataloader = get_dataloader(dataset=train_dataset,
-                                      shuffle=True,
-                                      batch_size=BATCH_SIZE,
-                                      pin_memory=True,
-                                      drop_last=True)
-
-    # build optimizer
-    optimizer = Adam(model.parameters(), lr=0.001)
-    criterion = nn.CrossEntropyLoss()
-
-    engine, train_dataloader, *args = colossalai.initialize(
-        model=model,
-        optimizer=optimizer,
-        criterion=criterion,
-        train_dataloader=train_dataloader
-    )
-    logger = get_dist_logger()
-    rank = torch.distributed.get_rank()
-    param_track = []
-    grad_track = []
-    next(model.parameters()).retain_grad()
-
-    engine.train()
-    step = 0
-    for img, label in train_dataloader:
-        engine.zero_grad()
-        img = img.cuda()
-        label = label.cuda()
-        output = engine(img)
-        loss = engine.criterion(output, label)
-        engine.backward(loss)
-        engine.step()
-
-        # check
-        param_track.append(next(model.parameters())[0].clone())
-        grad_track.append(next(model.parameters()).grad[0].clone())
-        step += 1
-        if step == CONFIG['gradient_accumulation']:
-            break
-
-    assert not torch.all(grad_track[0] == grad_track[-1]), 'grad should be different in different iterations'
-    assert torch.all(param_track[0] == param_track[1]) and not torch.all(param_track[0] == param_track[-1]), \
-        'param should be the same in the first few iterations and only changed in the last iteration'
-
-    gpc.destroy()
-    torch.cuda.empty_cache()
-
-
-@pytest.mark.dist
-def test_engine():
-    world_size = 4
-    func = partial(run_no_pipeline, world_size=world_size, port=free_port())
-    mp.spawn(func, nprocs=world_size)
-
-
-if __name__ == '__main__':
-    test_engine()
diff --git a/tests/test_zero_data_parallel/__pycache__/test_zero_level_2.cpython-37-pytest-7.1.3.pyc b/tests/test_zero_data_parallel/__pycache__/test_zero_level_2.cpython-37-pytest-7.1.3.pyc
deleted file mode 100644
index 3682cec22f12433ff440c13ab1e95b0e06c6886c..0000000000000000000000000000000000000000
Binary files a/tests/test_zero_data_parallel/__pycache__/test_zero_level_2.cpython-37-pytest-7.1.3.pyc and /dev/null differ
diff --git a/tests/test_zero_data_parallel/__pycache__/test_zero_level_2.cpython-37.pyc b/tests/test_zero_data_parallel/__pycache__/test_zero_level_2.cpython-37.pyc
deleted file mode 100644
index 92c36f228514b11744189cb62b52660eca75665c..0000000000000000000000000000000000000000
Binary files a/tests/test_zero_data_parallel/__pycache__/test_zero_level_2.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_zero_data_parallel/__pycache__/test_zero_level_3.cpython-37-pytest-7.1.3.pyc b/tests/test_zero_data_parallel/__pycache__/test_zero_level_3.cpython-37-pytest-7.1.3.pyc
deleted file mode 100644
index 2104e179467d526f311d5383aeb83cb8a4313b26..0000000000000000000000000000000000000000
Binary files a/tests/test_zero_data_parallel/__pycache__/test_zero_level_3.cpython-37-pytest-7.1.3.pyc and /dev/null differ
diff --git a/tests/test_zero_data_parallel/__pycache__/test_zero_level_3.cpython-37.pyc b/tests/test_zero_data_parallel/__pycache__/test_zero_level_3.cpython-37.pyc
deleted file mode 100644
index c2f501e52ded5cbaeda777cd9d6c936769771fb6..0000000000000000000000000000000000000000
Binary files a/tests/test_zero_data_parallel/__pycache__/test_zero_level_3.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_zero_data_parallel/test_zero_level_2.py b/tests/test_zero_data_parallel/test_zero_level_2.py
deleted file mode 100644
index 9bdd1b12496be676d0fce6b215b5bfe0dd4711b9..0000000000000000000000000000000000000000
--- a/tests/test_zero_data_parallel/test_zero_level_2.py
+++ /dev/null
@@ -1,102 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import os
-from functools import partial
-from pathlib import Path
-
-import colossalai
-import pytest
-import torch
-import torch.multiprocessing as mp
-from colossalai.core import global_context as gpc
-from colossalai.utils import free_port, get_dataloader
-from torchvision import transforms
-from torchvision.datasets import CIFAR10
-from torchvision.models import resnet18
-
-BATCH_SIZE = 16
-IMG_SIZE = 224
-
-CONFIG = dict(
-    fp16=dict(
-        mode=None,
-    ),
-    zero=dict(
-        level=2,
-        cpu_offload=True,
-        verbose=False,
-    ),
-    parallel=dict(
-        pipeline=dict(size=1),
-        tensor=dict(size=1, mode=None)
-    )
-)
-
-
-def run_dist(rank, world_size, port):
-    colossalai.launch(config=CONFIG,
-                      rank=rank,
-                      world_size=world_size,
-                      host='localhost',
-                      port=port,
-                      backend='nccl')
-
-    # build model
-    model = resnet18(num_classes=10)
-
-    # build dataloader# build dataloaders
-    train_dataset = CIFAR10(
-        root=Path(os.environ['DATA']),
-        download=True,
-        transform=transforms.Compose(
-            [
-                transforms.Resize(size=(IMG_SIZE, IMG_SIZE)),
-                transforms.ToTensor(),
-                transforms.Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5))
-            ]
-        )
-    )
-    train_dataloader = get_dataloader(dataset=train_dataset,
-                                      shuffle=True,
-                                      batch_size=BATCH_SIZE,
-                                      pin_memory=True,
-                                      drop_last=True)
-
-    # build optimizer and loss
-    # optimizer = build_optimizer(global_context.config.optimizer, model)
-    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
-    criterion = torch.nn.CrossEntropyLoss()
-
-    engine, train_dataloader, *args = colossalai.initialize(model=model,
-                                                            optimizer=optimizer,
-                                                            criterion=criterion,
-                                                            train_dataloader=train_dataloader)
-
-    # train
-    model.train()
-    for idx, (data, label) in enumerate(train_dataloader):
-        engine.zero_grad()
-        data = data.cuda()
-        label = label.cuda()
-
-        output = engine(data)
-        loss = engine.criterion(output, label)
-
-        engine.backward(loss)
-        engine.step()
-        break
-
-    gpc.destroy()
-    torch.cuda.empty_cache()
-
-
-@pytest.mark.dist
-def test_zero_level_2():
-    world_size = 4
-    run_func = partial(run_dist, world_size=world_size, port=free_port())
-    mp.spawn(run_func, nprocs=world_size)
-
-
-if __name__ == '__main__':
-    test_zero_level_2()
diff --git a/tests/test_zero_data_parallel/test_zero_level_3.py b/tests/test_zero_data_parallel/test_zero_level_3.py
deleted file mode 100644
index 2655210dbffa5dd5d1a9bfa378e25a53a1139ced..0000000000000000000000000000000000000000
--- a/tests/test_zero_data_parallel/test_zero_level_3.py
+++ /dev/null
@@ -1,114 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import os
-from functools import partial
-from pathlib import Path
-
-import colossalai
-import pytest
-import torch
-import torch.multiprocessing as mp
-from colossalai.core import global_context as gpc
-from colossalai.utils import free_port, get_dataloader
-from torchvision import transforms
-from torchvision.datasets import CIFAR10
-from torchvision.models import resnet18
-
-BATCH_SIZE = 16
-IMG_SIZE = 224
-
-CONFIG = dict(
-    fp16=dict(
-        mode=None,
-    ),
-    zero=dict(
-        level=3,
-        verbose=False,
-        offload_optimizer_config=dict(
-            device='cpu',
-            pin_memory=True,
-            buffer_count=5,
-            fast_init=False
-        ),
-        offload_param_config=dict(
-            device='cpu',
-            pin_memory=True,
-            buffer_count=5,
-            buffer_size=1e8,
-            max_in_cpu=1e9
-        )
-    ),
-    parallel=dict(
-        pipeline=dict(size=1),
-        tensor=dict(size=1, mode=None)
-    )
-)
-
-
-def run_dist(rank, world_size, port):
-    colossalai.launch(config=CONFIG,
-                      rank=rank,
-                      world_size=world_size,
-                      host='localhost',
-                      port=port,
-                      backend='nccl')
-
-    # build model
-    model = resnet18(num_classes=10)
-
-    # build dataloader# build dataloaders
-    train_dataset = CIFAR10(
-        root=Path(os.environ['DATA']),
-        download=True,
-        transform=transforms.Compose(
-            [
-                transforms.Resize(size=(IMG_SIZE, IMG_SIZE)),
-                transforms.ToTensor(),
-                transforms.Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5))
-            ]
-        )
-    )
-    train_dataloader = get_dataloader(dataset=train_dataset,
-                                      shuffle=True,
-                                      batch_size=BATCH_SIZE,
-                                      pin_memory=True,
-                                      drop_last=True)
-
-    # build optimizer and loss
-    # optimizer = build_optimizer(global_context.config.optimizer, model)
-    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
-    criterion = torch.nn.CrossEntropyLoss()
-
-    engine, train_dataloader, *args = colossalai.initialize(model=model,
-                                                            optimizer=optimizer,
-                                                            criterion=criterion,
-                                                            train_dataloader=train_dataloader)
-
-    # train
-    model.train()
-    for idx, (data, label) in enumerate(train_dataloader):
-        engine.zero_grad()
-        data = data.cuda()
-        label = label.cuda()
-
-        output = engine(data)
-        loss = engine.criterion(output, label)
-
-        engine.backward(loss)
-        engine.step()
-        break
-
-    gpc.destroy()
-    torch.cuda.empty_cache()
-
-
-@pytest.mark.dist
-def test_zero_level_3():
-    world_size = 4
-    run_func = partial(run_dist, world_size=world_size, port=free_port())
-    mp.spawn(run_func, nprocs=world_size)
-
-
-if __name__ == '__main__':
-    test_zero_level_3()
diff --git a/tests/test_zero_tensor_parallel/__pycache__/components.cpython-37.pyc b/tests/test_zero_tensor_parallel/__pycache__/components.cpython-37.pyc
deleted file mode 100644
index 8da4ee5ed35638609801d527c0b1f96b74ba50c6..0000000000000000000000000000000000000000
Binary files a/tests/test_zero_tensor_parallel/__pycache__/components.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_zero_tensor_parallel/__pycache__/test_vit_2d_level_2.cpython-37-pytest-7.1.3.pyc b/tests/test_zero_tensor_parallel/__pycache__/test_vit_2d_level_2.cpython-37-pytest-7.1.3.pyc
deleted file mode 100644
index 551689bd01fd7a006943aab8d2c2cd361a28577f..0000000000000000000000000000000000000000
Binary files a/tests/test_zero_tensor_parallel/__pycache__/test_vit_2d_level_2.cpython-37-pytest-7.1.3.pyc and /dev/null differ
diff --git a/tests/test_zero_tensor_parallel/__pycache__/test_vit_2d_level_2.cpython-37.pyc b/tests/test_zero_tensor_parallel/__pycache__/test_vit_2d_level_2.cpython-37.pyc
deleted file mode 100644
index 73f751b71baa55412b06b4249fd1d3ba4c79dd42..0000000000000000000000000000000000000000
Binary files a/tests/test_zero_tensor_parallel/__pycache__/test_vit_2d_level_2.cpython-37.pyc and /dev/null differ
diff --git a/tests/test_zero_tensor_parallel/__pycache__/test_vit_2d_level_3.cpython-37-pytest-7.1.3.pyc b/tests/test_zero_tensor_parallel/__pycache__/test_vit_2d_level_3.cpython-37-pytest-7.1.3.pyc
deleted file mode 100644
index e12772b8c57711765879bbff7d0c9442abb13ad3..0000000000000000000000000000000000000000
Binary files a/tests/test_zero_tensor_parallel/__pycache__/test_vit_2d_level_3.cpython-37-pytest-7.1.3.pyc and /dev/null differ
diff --git a/tests/test_zero_tensor_parallel/components.py b/tests/test_zero_tensor_parallel/components.py
deleted file mode 100644
index 69a4c9a95617651924b9af85d497e5a33aeea75b..0000000000000000000000000000000000000000
--- a/tests/test_zero_tensor_parallel/components.py
+++ /dev/null
@@ -1,19 +0,0 @@
-
-import sys
-from pathlib import Path
-repo_path = Path(__file__).absolute().parents[2]
-sys.path.append(str(repo_path))
-
-try:
-    import model_zoo.vit.vision_transformer_from_config
-except ImportError:
-    raise ImportError("model_zoo is not found, please check your path")
-
-BATCH_SIZE = 8
-IMG_SIZE = 32
-PATCH_SIZE = 4
-DIM = 512
-NUM_ATTENTION_HEADS = 8
-SUMMA_DIM = 2
-NUM_CLASSES = 10
-DEPTH = 6
diff --git a/tests/test_zero_tensor_parallel/test_vit_2d_level_2.py b/tests/test_zero_tensor_parallel/test_vit_2d_level_2.py
deleted file mode 100644
index f1c7ba1c1aa92950d62137145ea0606decc834fe..0000000000000000000000000000000000000000
--- a/tests/test_zero_tensor_parallel/test_vit_2d_level_2.py
+++ /dev/null
@@ -1,100 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import os
-from functools import partial
-from pathlib import Path
-
-import colossalai
-import pytest
-import torch
-import torch.autograd
-import torch.multiprocessing as mp
-from colossalai.core import global_context as gpc
-from colossalai.logging import get_dist_logger
-from colossalai.nn import CrossEntropyLoss
-from colossalai.utils import free_port, get_dataloader
-from model_zoo.vit import vit_lite_depth7_patch4_32
-from torchvision import transforms
-from torchvision.datasets import CIFAR10
-
-from components import *
-
-CONFIG = dict(parallel=dict(
-    pipeline=dict(size=1),
-    tensor=dict(size=4, mode='2d'),
-),
-              fp16=dict(mode=None, ),
-              zero=dict(level=2))
-
-
-def train_epoch(engine, train_dataloader):
-    engine.train()
-    accumulated_loss = 0
-    num_steps = len(train_dataloader)
-    data_iter = iter(train_dataloader)
-    for i in range(num_steps):
-        output, label, loss = engine.step(data_iter)
-        accumulated_loss += loss.detach().cpu().numpy()
-    avg_loss = accumulated_loss / num_steps
-    return avg_loss
-
-
-def run_2d_parallel_vision_transformer_level_2(rank, world_size, port):
-    colossalai.launch(config=CONFIG, rank=rank, world_size=world_size, host='localhost', port=port, backend='nccl')
-
-    # build model
-    model = vit_lite_depth7_patch4_32()
-
-    # build dataloader# build dataloaders
-    train_dataset = CIFAR10(root=Path(os.environ['DATA']),
-                            download=True,
-                            transform=transforms.Compose([
-                                transforms.Resize(size=(IMG_SIZE, IMG_SIZE)),
-                                transforms.ToTensor(),
-                                transforms.Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5))
-                            ]))
-    train_dataloader = get_dataloader(dataset=train_dataset,
-                                      shuffle=True,
-                                      batch_size=BATCH_SIZE,
-                                      pin_memory=True,
-                                      drop_last=True)
-
-    # build optimizer and loss
-    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
-    criterion = CrossEntropyLoss()
-
-    engine, train_dataloader, *args = colossalai.initialize(model=model,
-                                                            optimizer=optimizer,
-                                                            criterion=criterion,
-                                                            train_dataloader=train_dataloader)
-    logger = get_dist_logger()
-
-    logger.info('start training')
-    engine.train()
-
-    for img, label in train_dataloader:
-        engine.zero_grad()
-        img = img.cuda()
-        label = label.cuda()
-        out = engine(img)
-        loss = engine.criterion(out, label)
-        engine.backward(loss)
-        engine.step()
-        break
-
-    gpc.destroy()
-    torch.cuda.empty_cache()
-
-
-@pytest.mark.dist
-def test_2d_vit_zero_level_2():
-    ## HC
-    #world_size = 8
-    world_size = 4
-    run_func = partial(run_2d_parallel_vision_transformer_level_2, world_size=world_size, port=free_port())
-    mp.spawn(run_func, nprocs=world_size)
-
-
-if __name__ == '__main__':
-    test_2d_vit_zero_level_2()
diff --git a/tests/test_zero_tensor_parallel/test_vit_2d_level_3.py b/tests/test_zero_tensor_parallel/test_vit_2d_level_3.py
deleted file mode 100644
index 6101ce556ab5f59eebe46856426d9bffc4bdfc8d..0000000000000000000000000000000000000000
--- a/tests/test_zero_tensor_parallel/test_vit_2d_level_3.py
+++ /dev/null
@@ -1,101 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import os
-from functools import partial
-from pathlib import Path
-
-import colossalai
-import pytest
-import torch
-import torch.autograd
-import torch.multiprocessing as mp
-from colossalai.core import global_context as gpc
-from colossalai.logging import get_dist_logger
-from colossalai.nn import CrossEntropyLoss
-from colossalai.utils import free_port, get_dataloader
-from model_zoo.vit import vit_lite_depth7_patch4_32
-from torchvision import transforms
-from torchvision.datasets import CIFAR10
-
-from components import *
-
-CONFIG = dict(parallel=dict(
-    pipeline=dict(size=1),
-    tensor=dict(size=4, mode='2d'),
-),
-              fp16=dict(mode=None, ),
-              zero=dict(level=3))
-
-
-def train_epoch(engine, train_dataloader):
-    engine.train()
-    accumulated_loss = 0
-    num_steps = len(train_dataloader)
-    data_iter = iter(train_dataloader)
-    for i in range(num_steps):
-        output, label, loss = engine.step(data_iter)
-        accumulated_loss += loss.detach().cpu().numpy()
-    avg_loss = accumulated_loss / num_steps
-    return avg_loss
-
-
-def run_2d_parallel_vision_transformer_level_3(rank, world_size, port):
-    colossalai.launch(config=CONFIG, rank=rank, world_size=world_size, host='localhost', port=port, backend='nccl')
-
-    # build model
-    model = vit_lite_depth7_patch4_32()
-
-    # build dataloader# build dataloaders
-    train_dataset = CIFAR10(root=Path(os.environ['DATA']),
-                            download=True,
-                            transform=transforms.Compose([
-                                transforms.Resize(size=(IMG_SIZE, IMG_SIZE)),
-                                transforms.ToTensor(),
-                                transforms.Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5))
-                            ]))
-    train_dataloader = get_dataloader(dataset=train_dataset,
-                                      shuffle=True,
-                                      batch_size=BATCH_SIZE,
-                                      pin_memory=True,
-                                      drop_last=True)
-
-    # build optimizer and loss
-    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
-    criterion = CrossEntropyLoss()
-
-    engine, train_dataloader, *args = colossalai.initialize(model=model,
-                                                            optimizer=optimizer,
-                                                            criterion=criterion,
-                                                            train_dataloader=train_dataloader)
-    logger = get_dist_logger()
-
-    logger.info('start training')
-    engine.train()
-
-    for img, label in train_dataloader:
-        engine.zero_grad()
-        img = img.cuda()
-        label = label.cuda()
-        out = engine(img)
-        loss = engine.criterion(out, label)
-        engine.backward(loss)
-        engine.step()
-        break
-
-    gpc.destroy()
-    torch.cuda.empty_cache()
-
-
-@pytest.mark.dist
-@pytest.mark.skip("Level 3 has unknown bug so skip this test for now")
-def test_3d_vit_zero_level_3():
-    ## HC
-    #world_size = 8
-    world_size = 4
-    run_func = partial(run_2d_parallel_vision_transformer_level_3, world_size=world_size, port=free_port())
-    mp.spawn(run_func, nprocs=world_size)
-
-
-if __name__ == '__main__':
-    test_3d_vit_zero_level_3()